12
Jun
08

High Performance Computing with CUDA

This article will probably not inspire everyone, but I hope it highlights some of the interesting evolutions in computer hardware and how computers are going to evolve over the next couple of years.

CUDA is a new technology from Nvidia which, as much as possible, gives you access to the power of your commercial-grade graphics card. There are now quite a number of nvidia cards that allow you to use this technology.

Basically, CUDA is an API interface to your graphics card. A GPU doesn’t look like your main CPU that you have in the computer, it’s actually composed of a number of multi-processors, each containing smaller processors that work in unison on the chip.

The power of CUDA lies in the fact that it’s entirely geared towards complex math calculation. And fast! Since graphics applications rely on a lot of math operations, which is mostly real number maths and often have to do with matrices, GPU’s are designed around matrix math and the ability to execute calculations in parallel.

Another very strong part of a graphics card in comparison with a CPU is the speed of the memory bus. In comparison, standard computer memory generally has a bus speed of 8Gb/s of memory to/from the chip. Some older nvidia cards have 20 Gb/s, newer cards can go up to 80Gb/s. This number is both determined by the speed of the memory, its latency and also the bus width (number of bits) that can travel across this bus.

Another very important part in GPU design is that its threads for execution are hardware supported. On your CPU and operating for example, threads are created and destroyed by the operating system and therefore cost CPU cycles to handle. Because you only have one CPU, every time your computer triggers a timer event, it may decide to handle a different thread. This happens generally about 100-250-300 times per second, on some computers this is 1000 per second.

But the GPU has this built into the hardware. It’s more or less a stupid co-processor card that doesn’t have the same complexity as a computer. You’d typically “load” a program onto it and it will launch its hardware threads at that program. Each hardware thread is executing exactly the same program or “function”.

The power here though is that you are not limited to the number of processors that are available, but you can actually launch millions of threads at the program. So, although it is important to understand the architecture and design of the hardware if you wish to do this kind of programming, you’ll also be thinking in mathematical terms to compound your functions and operations in such a way that they make sense and produce the correct result. This means that there’s a lot of thinking to do mostly on the algorithm, the order in which it calculates and how you synchronize the results before you continue with the next step.

Many people are already using CUDA for their research, applications or innovative ideas. But be aware, even though it’s super cool, it doesn’t necessarily mean that you would benefit by running things on your graphics card. It really depends on the problem you are facing if it will help you out and by using CUDA, you will introduce a great deal of complexity into your code that is essentially slightly more difficult to debug later.

Why is CUDA so cool?  It’s because for something like $250, slightly more or slightly less, you’ll be getting the power of a supercomputer in your house. My 8600GT (which is outdated already) has 128 processors and 256M memory. Some memory is consumed by running the operating system. I’ve ran a speed test on its processing powers by storing numbers consecutively in 96M of memory (an int, thus 4 bytes), retrieving the number from memory (consuming clock cycles and memory latency), adding the number 16, then storing that number in a different section of memory which was also 96M. In effect, it was code like this:

g_out[ pos ] = g_in[ pos ];

When running this on the host CPU (Dual Intel E6850), it completed the entire loop in 48 seconds, no other applications were running. Pushing this to the graphics device through DMA, then launching 512 * 512 * 96 = 25165824 threads on it, it all completed in half a second. That is a performance difference of a factor 100. And this power is easily yours with the NVIDIA SDK and a bundle of demos and documentation!

And to top this all off, some (or many) computers already have facilities to plug in a number of video cards, not just one. And the driver does allow you to use each card separately. Some architectures even allow you to copy memory from one card to another directly. The university of Antwerp built a machine that has four high-grade graphics cards in a 1500W machine that is running tomographic calculations. It’s said to have the same power as a 250-node PC cluster that costs 3,5 million, except they only paid 4,000 EUR to build it.

So there you go… High Performance Computing at your fingertips and at an affordable price!


2 Responses to “High Performance Computing with CUDA”


  1. 1 J. Longoria Jun 19th, 2008 at 4:36 am

    Thats incredible Gerard, I think you undervalued the concept in this article. I had no idea these cards were capable of this, I’ve got two at home here… maybe I’ll take a crack at it. Good stuff!

  1. 1 High Performance Computing with CUDA Pingback on Jul 8th, 2008 at 1:48 am

Leave a Reply




June 2008
M T W T F S S
« May   Jul »
 1
2345678
9101112131415
16171819202122
23242526272829
30