Chris Umbel

GPU-Based Computing in C with NVIDIA's CUDA

NVIDIA CUDA Logo At work we do plenty of video and image manipulation, particularly video encoding. While it's certainly not a specialty of mine I've had plenty of exposure to enterprise-level transcoding for projects like transcode.it, our free public transcoding system; uEncode, a RESTful encoding service; and our own in-house solutions (we're DVD Empire, BTW).

Of course my exposure is rather high-level and large portions of the process still elude me but I've certainly developed an appreciation for boxes with multiple GPUs chugging away performing complex computation.

For a while now I've been hoping to dedicate some time to peer into the inner workings of GPUs and explore the possibility of using them for general-purpose, highly-parallel computing.

Well, I finally got around to it.

Most of the machines I use on a regular basis have some kind of NVIDIA card so I decided to see what resources they had to offer for general-purpose work. Turns out they set you up quite well!

They offer an architecture called CUDA which does a great job of rather directly exposing the compute resources of GPUs to developers. It supports Windows, Linux and Macs equally well as far as I can tell and while it has bindings for many higher-levels languages it's primarily accessible via a set of C extensions.

Like I said, I'm relatively new to this so I'm in no position to profess, but figured I might as well share one of the first experiments I did while familiarizing myself with CUDA.

Also, I'd like some feedback as I'm just getting my feet wet as well.

Getting Started & Documentation

Before going any further check out the "Getting Started Guide" for your platform on the CUDA download page. It will indicate what you specifically have to download and how to install it. I've only done so on Macs but the process was simple, I assure you.

Example

Ok, here's a little example C program that performs two parallel executions (the advantage of using GPUs is parallelism after all) of the "Internet Checksum" algorithm on some hard-coded sample data.

First I'll blast you with the full source then I'll walk through it piece by piece.

#include <stdio.h>
#include <cuda_runtime.h>

/* "kernel" to compute an "internet checksum" */
__global__ void inet_checksum(unsigned char *buff, size_t pitch,
    int len, unsigned short *checksums) {
  int i;
  long sum = 0;

  /* advance to where this threads data starts. the pitch
     ensured optimal alignment. */
  buff += threadIdx.x * pitch;
  unsigned short datum;

  for(i = 0; i < len / 2; i++) {
    datum = *buff++ << 8;
    datum |= *buff++;
    sum += datum;
  }

  while (sum >> 16)
    sum = (sum & 0xffff) + (sum >> 16);

  sum = ~sum;
  /* write data back for host */
  checksums[threadIdx.x] = (unsigned short)sum;
}

int main (int argc, char **argv) {
  int device_count;
  int size = 8;
  int count = 2;
  unsigned short checksums[count];
  int i; 

  unsigned char data[16] = {
           /* first chunk */
           0xe3, 0x4f, 0x23, 0x96, 0x44, 0x27, 0x99, 0xf3,
           /* second chunk */
           0xe4, 0x50, 0x24, 0x97, 0x45, 0x28, 0x9A, 0xf4};

  /* ask cuda how many devices it can find */
  cudaGetDeviceCount(&device_count);

  if(device_count < 1) {
    /* if it couldn't find any fail out */
    fprintf(stderr, "Unable to find CUDA device\n");
  } else {
    /* for the sake of this example just use the first one */
    cudaSetDevice(0);

    unsigned short *gpu_checksum;
    /* create a place for the results be stored in the GPU's
       memory space.  */
    cudaMalloc((void **)&gpu_checksum, count * sizeof(short));

    unsigned char *gpu_buff;
    size_t gpu_buff_pitch;

    /* create a 2d pointer in the GPUs memory space */
    cudaMallocPitch((void**)&gpu_buff, &gpu_buff_pitch,
      size * sizeof(unsigned char), count);

    /* copy our hard-coded data from above into the the GPU's
       memory spacing correctly alligned for 2d access. */
    cudaMemcpy2D(gpu_buff, gpu_buff_pitch, &data,
      sizeof(unsigned char) * size,
      size, count, cudaMemcpyHostToDevice);

    /* execute the checksum operation. two threads
       of execution will be executed due to the count param. */
    inet_checksum<<<1, count>>>(gpu_buff, gpu_buff_pitch, size,
      gpu_checksum);

    /* copy the results from the GPU's memory to the host's */
    cudaMemcpy(&checksums, gpu_checksum,
      count * sizeof(short), cudaMemcpyDeviceToHost);

    /* clean up the GPU's memory space */
    cudaFree(gpu_buff);
    cudaFree(gpu_checksum);

    for(i = 0; i < count; i++)
        printf("Checksum #%d 0x%x\n", i + 1, checksums[i]);
  }

  return 0;
}

Dissection

Phew, alright. There wasn't really all that much to it, but I'm sure many of you will appreciate some explanation.

I'm sure you know the first directive. The second is obviously the inclusion of CUDA.

#include <stdio.h>
#include <cuda_runtime.h>

The following is what's referred to as a "kernel" in CUDA. It's basically a function that can execute on a GPU. Note the function is __global__ and has no return type. The details of the function really aren't the subject of the article. In this case it calculates the "internet checksum" of the incoming buff but here's where you'd put your highly-parallelizable, computationally intensive code.

The pitch will make more sense later as it helps to deal with memory alignment of multi-dimensional data which is what the buff turns out to be despite being a one dimensional vector. One dimension per thread, each with eight bytes.

Also have a look at the threadIdx.x. That's how you can determine which thread you are and can use it to read/write from the correct indexes in vectors, etc.

/* "kernel" to compute an "internet checksum" */
__global__ void inet_checksum(unsigned char *buff, size_t pitch,
    int len, unsigned short *checksums) {
  int i;
  long sum = 0;

  /* advance to where this threads data starts. the pitch
     ensured optimal alignment. */
  buff += threadIdx.x * pitch;
  unsigned short datum;

  for(i = 0; i < len / 2; i++) {
    datum = *buff++ << 8;
    datum |= *buff++;
    sum += datum;
  }

  while (sum >> 16)
    sum = (sum & 0xffff) + (sum >> 16);

  sum = ~sum;
  /* write data back for host */
  checksums[threadIdx.x] = (unsigned short)sum;
}

Getting the party started. Note that this indicates that we have two elements of eight bytes a piece with the size and count variables. They'll carve up our hard-coded data.

int main (int argc, char **argv) {
  int device_count;
  int size = 8;
  int count = 2;
  unsigned short checksums[count];
  int i; 

Now here's the data we're going to checksum. We'll actually be treating this as two distinct values later. The first eight bytes will be checksummed while the second eight bytes are checksummed on another GPU thread.

  unsigned char data[16] = {
           /* first chunk */
           0xe3, 0x4f, 0x23, 0x96, 0x44, 0x27, 0x99, 0xf3,
           /* second chunk */
           0xe4, 0x50, 0x24, 0x97, 0x45, 0x28, 0x9A, 0xf4};

The comment says it all. We're just asking CUDA how many devices it can find. We could then use that information later to distribute load to GPUs.

  /* ask cuda how many devices it can find */
  cudaGetDeviceCount(&device_count);

For the most part we'll ignore it however. We will make sure at least one was found as there's not point to all this if we can't slap our load on a GPU! Assuming a GPU was found we'll call cudaSetDevice to direct CUDA to run our GPU routines there.

  if(device_count < 1) {
    /* if it couldn't find any fail out */
    fprintf(stderr, "Unable to find CUDA device\n");
  } else {
    /* for the sake of this example just use the first one */
    cudaSetDevice(0);

Now I'll create a vector for the checksum's to be written in to by our "kernel". Think of the cudaMalloc as a typical malloc call except the memory is reserved in the GPU's space. We wont' directly access that memory. Instead we'll copy in and out of it. The use of count indicats that it'll have room for two unsigned short values.

    unsigned short *gpu_checksum;
    /* create a place for the results be stored in the GPU's
       memory space.  */
    cudaMalloc((void **)&gpu_checksum, count * sizeof(short));

Here's some more allocation but in this case it's using a pitch. This is for the memory we'll write our workload into. We're using cudaMallocPitch because this data is essentially two dimensional and the pitch facilitates optimal alignment in memory. It's basically allocating two rows of eight byte columns.

    unsigned char *gpu_buff;
    size_t gpu_buff_pitch;

    /* create a 2d pointer in the GPUs memory space */
    cudaMallocPitch((void**)&gpu_buff, &gpu_buff_pitch,
      size * sizeof(unsigned char), count);

Now cudaMemcpy2D will shove the workload into the two-dimensial buffer we allocated above. Think memcpy for the GPU. Care is take to specify the dimensions of the data with the pitch, size and count. The cudaMemcpyHostToDevice parameter directs the data to the GPUs memory space rather than from it.

    /* copy our hard-coded data from above into the the GPU's
       memory spacing correctly alligned for 2d access. */
    cudaMemcpy2D(gpu_buff, gpu_buff_pitch, &data,
      sizeof(unsigned char) * size,
      size, count, cudaMemcpyHostToDevice);

Here's the money. See the <<<..., ...>>> business? The first argument is "blocks per grid" but I'll leave NVIDIA to explain that one to you in the CUDA C Programming Guide. The second argument indicates how many threads will be spawned. Like I said, this is all about parallelism. Consider our inet_checksum "kernel" hereby invoked twice in parallel!

    /* execute the checksum operation. two threads
       of execution will be executed due to the count param. */
    inet_checksum<<<1, count>>>(gpu_buff, gpu_buff_pitch, size,
      gpu_checksum);

Now the "kernel" executions are done. We've successfully executed our logic on a GPU! The results are still sitting in the GPU's memory space, however. We'll simply copy it out with cudaMemcpy while specifying cudaMemcpyDeviceToHost for the direction. The results are then in the checksums vector.

    /* copy the results from the GPU's memory to the host's */
    cudaMemcpy(&checksums, gpu_checksum,
      count * sizeof(short), cudaMemcpyDeviceToHost);

CUDA has its own allocating, and copying and of course its own clean-up. We'll be good citizens and use it here.

    /* clean up the GPU's memory space */
    cudaFree(gpu_buff);
    cudaFree(gpu_checksum);

Might as well let the user know the results, no?

    for(i = 0; i < count; i++)
        printf("Checksum #%d 0x%x\n", i + 1, checksums[i]);
  }

  return 0;
}

Compiling and Execution

Assuming you've installed the CUDA SDK according to the documentation you can compile with:

> nvcc -o yourprogram yoursourcefile.cu

and execution produces:

> ./yourprogram
Checksum #1 0x1aff
Checksum #2 0x16fb

.cu being the preferred extension to be used with the CUDA pre-processor.

Conclusion

There you have it. Execution of your own logic on a GPU.

Where to go from here? Well, this barely scratched the surface but NVIDIA's CUDA Zone site is the starting point to much more.

GPGPU.org is also a more platform independent source of general-purpose GPU computing.

Sun Jan 09 2011 03:01:45 GMT+0000 (UTC)

Follow Chris
RSS Feed
Twitter
Facebook
CodePlex
github
LinkedIn
Google