Tips/tricks related to Computer Graphics, GPUs and other programming

GitHub Repository here

I’ve been meaning to get into OpenCL for a while, so I thought I would write a simple program which would evaluate Bicubic Bezier patches (surfaces) in parallel.
I also wanted to program in Python more, so I decided to use PyOpenCL. You can find more info about the OpenCL implementation in Python from the link.

I am not going to go over what a Bezier patch is and how it can be evaluated. There is a straightforward equation for evaluating one which I will be showing later on, but for more info about Bezier patches you can find other info online.

Let’s begin.


If you have a surface made of multiple Bezier patches they can be evaluated in parallel. The new shaders in DirectX 11 (Hull and Domain shaders) make this easier for graphics applications and rendering. There is a straightforward method of evaluating these patches using a simple equation. I am going to demonstrate the simple, straightforward way first ( and then another way which I thought would work faster (

What you’ll need:

  • Python (Preferably 2.7)
  • NumPy
  • PyOpenCL
  • A Graphics card which supports OpenCL. Any recent card should work fine (I used an old Nvidia 8800 GT).


This is not intended to be an OpenCL tutorial, but I will be going over what I did in PyOpenCL (getting the context, setting up the input buffers, writing the kernels, getting the output, etc.) For a more detailed look into OpenCL I would suggest OpenCL tutorials by Nvidia.
I will not be going over in detail what a work-item is, what a work-group is, etc.

Quick note for CUDA users: For some reason CUDA and OpenCL use different terminology. I will be using OpenCL terminology throughout this post. So here’s a quick translation for CUDA users:

  • CUDA thread = OpenCL work-item
  • CUDA block = OpenCL work-group
  • CUDA shared memory = OpenCL local memory
  • CUDA local memory = OpenCL private memory


The input to both programs is a BezierView file with the control point coordinates of all the patches. BezierView is a program developed by my graduate research lab (SurfLab) for the purposes of rendering different kinds of surfaces and viewing their properties (like Gaussian curvature, highlight lines, etc.)

Here is a sample file:

The input method is called readBezierFile and returns a 1-D list with all the vertices in it (16 vertices per patch).


PyOpenCL requires the use of NumPy arrays as input buffers to the OpenCL kernels.

NumPy is a python module used for scientific computing, and it allows better creation of multi-dimensional arrays which are widely used for OpenCL. I won’t go over in much detail about what NumPy is, but I will show you how I use it.

First, we need to convert the regular Python list with all the vertices in it to a NumPy array:

import numpy


# try to read from a file here - returns the array
vertices = readBezierFile("")

# create numpy array
npVertices = array( vertices, dtype = float32)

The function “array” converts a Python list to a NumPy array. The “dtype” represents the type of values we will be storing in the array (32-bit floating numbers in this case).

Initial Setup:

I read the file with all the OpenCL kernels in it first (

After that I setup the UV-buffer. Evaluating a Bicubic patch requires U and V values. The more UV-value pairs there are the more detailed surface is output (from 0.0 to 1.0). The UV-value generation code is:

# Array of UV values - 36 in total (detail by 0.2)
uvValues = empty( (36, 2)).astype(numpy.float32)

index = 0
for u in range(0, 12, 2): # step = 2
    for v in range(0, 12, 2):
        # conver the ints to floats
        fU = float(u)
        fV = float(v)
        uvValues[index] = [ fU/10.0, fV/10.0]
        index = index + 1

The “empty” function is used for creating a NumPy array with empty values. We are going to be generating an array of size 36 * 2. (UV values spaced by 0.2)

OpenCL Setup:

Before we do computation we have to do some OpenCL setup. This can be a little tedious when writing in C/C++, but PyOpenCL makes it much easier.

First, we have to create an OpenCL Context. This can be done in two ways (I used the second one for my clarification):

  1. This simple way will automatically choose a device for you (usually the GPU if it’s available). Very convenient.
    ctx = cl.create_some_context()
  2. This way goes through all the platforms/devices and creates a context specifically with that device.
    # Platform test
    for found_platform in cl.get_platforms():
        if == 'NVIDIA CUDA':
            my_platform = found_platform
            print "Selected platform:",
    for device in my_platform.get_devices():
      dev_type = cl.device_type.to_string(device.type)
      if dev_type == 'GPU':
            dev = device
            print "Selected device: ", dev_type
    # context
    ctx = cl.Context([dev])

Next we need to create the Command Queue:

cq = cl.CommandQueue(ctx, properties=cl.command_queue_properties.PROFILING_ENABLE)

You can just pass in the Context to the function, but I want to point out that I passed in the PROFILING_ENABLE flag, which allows us to determine how much time the operation took. This will allow us to compare the different methods.

Now finally we setup the input/output buffers to be sent to the GPU:

# memory flags
mf = cl.mem_flags

# input buffers
# the control point vertices
vertex_buffer = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=npVertices)

# the uv buffer
uv_buffer = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=uvValues)

# final output
output_buffer = cl.Buffer(ctx, mf.WRITE_ONLY, uvValues.nbytes * 2 * numPatches)

The output buffer is to be written to (WRITE_ONLY), and we specify the size of the output buffer in terms of bytes. the “nbytes” property in a NumPy array lets us know the size in terms of bytes of that array.
For each patch we will have 36 * 4 * 4 bytes.
36 = number of UV pairs
4 = number of floating-point values output for each UV-value
4 = number of bytes for a float32 value
Hopefully that is clear.

Now we can start evaluating!

Method #1: The simple, straightforward way:

The first method directly evaluates the patch for 1 UV-pair. The kernel takes in the UV-pair, reads in the control points for the corresponding patch and then outputs it to the output buffer. The Python evaluation code is as follows:

# the global size (number of uv-values * number of patches): 36 * numPatches
numUVs = uvValues.size/2 # 72/2 = 36
globalSize = numUVs * numPatches

localSize = numUVs # 36 work-items per work-group

# evaluate
exec_evt = prg.bezierEval2Multiple(cq, (globalSize,), (localSize,), vertex_buffer, uv_buffer, output_buffer)

First we calculate the global memory size (total number of kernels needed), and then the local memory size (# of work-items per work-group). The global memory size is basically the number of UV-value pairs.

We will need this information to send to the evaluation kernel. The kernel name is “bezierEval2Multiple” (apologies for the incoherent names).
The kernel needs to take in the Command Queue, the global memory size and the local memory size first.

After that we pass in the input and output buffers as parameters. Once the kernel is called we wait for it to finish (exec_evt.wait()).

The kernel is as follows (in

__kernel void bezierEval2Multiple(__global const float4 *controlPoints, __global float2 *uvValues,
                        __global float4 *output) {

   // get the global id and the corresponding patch number
   int gid = get_global_id(0);

   // get the patch number - the patch info we will have to read
   int numPatch = gid/get_local_size(0);

   // get the uv values - local id goes from 0-35
   int lid = get_local_id(0);
   float2 uv = uvValues[lid];

   // evaluate row1
   float4 b00 = evalCubicCurve(controlPoints[numPatch * 16 + 0], controlPoints[numPatch * 16 + 1], controlPoints[numPatch * 16 + 2], controlPoints[numPatch * 16 + 3], uv.x);

   float4 b01 = evalCubicCurve(controlPoints[numPatch * 16 + 4], controlPoints[numPatch * 16 + 5], controlPoints[numPatch * 16 + 6], controlPoints[numPatch * 16 + 7], uv.x);

   float4 b02 = evalCubicCurve(controlPoints[numPatch * 16 + 8], controlPoints[numPatch * 16 + 9], controlPoints[numPatch * 16 + 10], controlPoints[numPatch * 16 + 11], uv.x);

   float4 b03 = evalCubicCurve(controlPoints[numPatch * 16 + 12], controlPoints[numPatch * 16 + 13], controlPoints[numPatch * 16 + 14], controlPoints[numPatch * 16 + 15], uv.x);

  // evaluated point
  output[gid] = evalCubicCurve(b00, b01, b02, b03, uv.y);

For each kernel we need to figure out which Bezier patch we are going to evaluate, what UV-pair we are going to use and where we need to place our output in the output buffer.

The global id let’s us know where we need to place our output.
To get the patch we just divide our global id by the local memory size.
The UV-pair we want to evaluate is our local id. The local memory size is 36, which is the # of UV-value pairs, so we can easily find out which one we need to use.

Finally we evaluate the Bezier patch and place it in the output buffer. A simple kernel.

Method #2: The complex, trying-to-be-more-parallel way:

The “problem” I noticed with the first kernel is how all the work-items in each work-group have to read the same control points from global memory again and again. At this point I was still iffy on the concept of “local memory” so I read up on it and found out that reading from local memory is faster than global memory (obviously). So will using local memory allow me to make the whole process faster? I decided to come up with a different way which will use a lot more work-items but where each work-item does “less” work (only simple bilinear interpolation across selective control points).

This method did not turn out to be faster. My advisor already told me before I fully implemented this that the first method will be much faster, and he was right. Turns out it was 3 times slower. In any case here is the concept:

Bezier Patches can be evaluated using what is known as De Casteljau’s Algorithm (The link is for curves but can be easily adapted to surfaces). It basically involves linearly interpolating (bilinearly interpolating for surfaces) the control points and their results until you get the evaluated point. I try to parallelize the interpolation part so it can be done faster.

Here are a few images showing the steps for a UV-pair of (0.5, 0.5):


First, here are the 16 control points of a patch:

Bicubic Control Points

16 Bicubic Control Points


The next step would be to do a bilinear interpolation across each “square” (4) of control points per kernel. The results are the blue points shown below: (since U and V are both 0.5, the point lies in the middle)

First bilinear interpolation pass (results in biquadratic patch)


We repeat the process for this biquadratic patch: (results are points in red)

Second interpolation pass (results in linear patch)


Finally we do one last bilinear interpolation to get the evaluated point (in green):

The point at (U,V) = (0.5, 0.5)


You have probably figured it out by now, but the way this method works in parallel is that it generates 9 work-items per UV-pair. Each work-item does a bilinear interpolation and stores it’s result in local memory. This completes the first pass.

After the first pass we need to repeat the process, but this time we only need 4 work-items. This is where I encounter a problem: the other 5 work-items don’t do anything. Then for the final evaluation I only use 1 work-item, with others not doing work.

The kernel (bezierEvalMutiple) shown below:

__kernel void bezierEvalMultiple(int numPatches, __global const float4 *controlPoints, __global float2 *uvValues,
                     __global float4 *output, __local float4 *local_points, __local float4 *local_points2) {

	// get the global id
	int gid = get_global_id(0)/get_local_size(0); // divide by 9
	// get the patch number
	int numPatch = gid/36; //get_local_size(0);
	// get the uv values 
	float2 uv = uvValues[gid];//get_global_id(0)];
	//output[gid] = (float4)(numPatch, numPatch, 0, 0);
	// get the row
	int lid = get_local_id(0);
	int patchNum = numPatch * 16; 	// the patch to deal with
	int row = lid/3;
	int index = row + numPatch + lid;
	// get the 4 control points you'll need
	   |   |
	float4 b00 = controlPoints[index];
	float4 b01 = controlPoints[index+1];
	float4 b10 = controlPoints[index+4];
	float4 b11 = controlPoints[index+5];
	// do linear interpolation across u first
	float4 val1 = lerp(b00, b01, uv.x);
	float4 val2 = lerp(b10, b11, uv.x);
	// then across v
	float4 newPoint = lerp(val1, val2, uv.y);
	// store it in the local memory
	local_points[lid] = newPoint;
	// synchronize in work-group
	// only do it for certain work-items (now evaluating quadratic)
	if (lid < 4) {
		row = lid/2; 		// 2 = degree
		float4 b002 = local_points[row + lid];
		float4 b012 = local_points[row + lid+1];
		float4 b102 = local_points[row + lid+3];
		float4 b112 = local_points[row + lid+4];
		// do linear interpolation across u first
		float4 val12 = lerp(b002, b012, uv.x);
		float4 val22 = lerp(b102, b112, uv.x);
		// then across v
		float4 newPoint2 = lerp(val12, val22, uv.y);
		local_points2[lid] = newPoint2;
		// output - only do it for the first work-item
		if (lid == 1) {
			float4 val13 = lerp(local_points2[0], local_points2[1], uv.x);
			float4 val23 = lerp(local_points2[2], local_points2[3], uv.x);
			float4 newPoint3 = lerp(val13, val23, uv.y);
			output[gid] = newPoint3;

I won’t go over much of this (you can probably figure it out), but I just wanted to mention how I synchronize between work-items (in a work-group) at each pass. You have to call barrier(CLK_LOCAL_MEM_FENCE); to ensure synchronization at each point.

This method runs 3 times slower – I would guess mainly because it requires 9 times more work-items and just straightforward multiplication is fast enough. At the very least I learned how to use local memory and synchronize between work-items.


In the end we want to find out the time elapsed and read back the output. This is how it’s done in Python:

elapsed = exec_evt.profile.end - exec_evt.profile.start
print("Execution time of test: %g " % elapsed)

# read back the result
eval = empty( numUVs * 4 * numPatches).astype(numpy.float32)
cl.enqueue_read_buffer(cq, output_buffer, eval).wait()

We create an empty NumPy buffer and then call enqueue_read_buffer to read from the output buffer into the array.


I showed a simple example of using (Py)OpenCL for evaluating Bezier patches. The next step would be to render them using PyOpenGL, and perhaps allow user manipulation of the control points. This project helped me get a start on OpenCL and learn some of the basics. I would recommend using PyOpenCL since it allows you to write applications quickly.

If you have any questions, comments or suggestions please feel free to ask.

GitHub Repository here


Comments on: "Evaluating Bicubic Bézier Surfaces using (Py)OpenCL" (1)

  1. I don’t understand the implementation entirely.

    But at least for the above example you may be optimize it further.
    You can use read_imagef() functions if CL_R, CL_FLOAT format is supported on your system.
    Idea is to do the first pass just by reading pixels. After that I guess you have to use your approach (images can either readonly on writeonly all the passes could be written using image operations)

    Hmm, now I realize image read functions doesn’t guarantee correct values 😦 interpolation might be done in lower precision

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: