Jacob Vosmaer's blog

Ray Tracing in One Weekend in C

2024-02-19

In this post I reflect on following the Ray Tracing in One Weekend course in C.

The book

"Ray Tracing in One Weekend" is a free online book about 3D computer graphics. It tells you how to build a self-contained program that uses ray tracing to generate a 3D image. Each chapter ends with an image. If you wrote your program correctly, its output matches the example in the book (up to randomness).

The first image you produce looks like this:

a square image with a color gradient from red to green

By the time you finish the last chapter your program is able to generate an image like this:

spheres of different size, color and material

The book encourages you to write your program in C++ and it contains partial C++ listings that show the changes you need to make at each step. The successive improvements are explained and motivated by drawings and mathematical formulas but it is possible to follow along and write your program without really understanding the math.

Of course you are free to use a programming language other than C++ and I chose C.

My impressions

The book is very well structured and has a good pace. I chose to write my program in C instead of C++ because I am currently interested in getting better at C. This required me to constantly "translate" the listings in the book but that is not too difficult because the languages are so similar.

At times it is frustrating to write a bunch of code without really understanding what it's doing but it is still rewarding to reach all the intermediate goals. There is the conceptual goal of getting each iteration of the program running but also the visual goal of the image you generate. It feels like a little triumph each time the images from your program matches the image in the book.

Also, in spite of the occasional frustration of merely doing what I am told, it does feel like I learned something in the process.

Adapting the book to C

I will share some observations about how I adapted the C++ code of the book to C.

The graphics code uses a lot of vector arithmetic on 3-dimensional vectors for both points in 3D space and RGB colors. The book uses C++ operator overloading to make it possible to write things like v + w for vectors v, w. In my C version that ends up as a regular function call v3add(v, w). While this is more verbose I did not find it difficult to work with.

The book introduces a C++ abstract class for "hittable objects", meaning physical objects in the 3D scene that can be hit by a ray of light. The only hittable objects in the book are spheres but the idea is you could also add different shapes. C does not have an equivalent built-in construct for an abstract class. Because we only have spheres anyway, I just did not create this abstraction in my version. Wherever the book has a hittable object, my program has a sphere.

Another abstraction introduced by the book is that of a sphere material. The book implements three materials: a matte material it calls Lambertian, metal, and dielectric (e.g. glass). To factor out the different behaviors of the materials the book again uses an abstract class material with one virtual method scatter which determines if and how a ray of light bounces off the material. Because there is only that one method on the material class I could get away with a normal scatter function in C which performs a switch on the material type enum.

Finally, the book makes a point of using shared pointers to manage the lifecycle of C++ objects. Maybe this does something useful in the C++ version but my program uses stack allocation for practically all objects and this works fine. In the few places where I use malloc, the allocation exists for the life of the process so there is no need to call free, let alone worry about when or where to call free. In other words memory management was a non-issue for me.

Performance

The final image in the book is 1200 * 675 = 810,000 pixels large. Each pixel is computed 500 times and then averaged so the program has to trace 810,000 * 500 = 405,000,000 rays. Computing each pixel more than once is needed to prevent stair stepping artifacts (aliasing) and to correctly implement dielectric materials (glass). When a ray hits a dielectric material it effectively splits in two: a reflected part (the mirror image) and a refracted part (the transparent image). The ray tracer randomly chooses to either reflect or refract. By repeating this many times and averaging, you get the combined effect of both reflection and refraction.

It speaks to the good structure of the book and the speed of modern computers and compilers that my first, single threaded implementation of the image ran in under 12 minutes on my M1 MacBook Pro. On the other hand, 12 minutes is slow enough for me to want to make the program go faster.

SIMD

I recently learned how to use ARM Neon SIMD instructions in C. Because SIMD is really about vector operations, and because my ray tracing program does so much vector arithmetic, it seemed like an obvious thing to try.

This was probably the first time I was glad I used a typedef for a complex data structure. My code already used vec3 instead of struct vec3 everywhere. This made it possible to change the underlying type of vec3 at compile time to float32x4_t which is a special Neon type which holds 4 float values. I only need 3 so I set the 4th value to 0.

My vector operations mapped directly to Neon intrinsics.


vec3 v3add(vec3 v, vec3 w) { return vaddq_f32(v, w); }
vec3 v3sub(vec3 v, vec3 w) { return vsubq_f32(v, w); }
vec3 v3mul(vec3 v, vec3 w) { return vmulq_f32(v, w); }
vec3 v3scale(vec3 v, float c) { return vmulq_n_f32(v, c); }
float v3dot(vec3 v, vec3 w) { return vaddvq_f32(vmulq_f32(v, w)); }

For comparison, the struct version of v3add.


vec3 v3add(vec3 v, vec3 w) {
  v.x += w.x;
  v.y += w.y;
  v.z += w.z;
  return v;
}

My program was still single threaded when I added SIMD. It made the program run about 10% faster with practically no changes outside the vec3 functions. It may be possible to go even faster if you write the program for SIMD from the ground up but I don't know how to do that or how much time that would take. Moreover you would end up with either a program that only runs on ARM processors, or with multiple separate implementations of the same program. Right now the combined amount of duplicated code is 66 lines.

Multi-threading

Although it is fun to play with SIMD, a more promising optimization is to use multi-threading. The program performs its raytracing algorithm for 405M rays which are all independent: it has a high degree of parallelism.

After adding SIMD my single-threaded program ran in about 620 seconds.

The book structures the program to write the output file one pixel at a time. That means you do 500 ray tracing operations, take their average, then write 1 pixel. My first attempt at multi-threading followed this structure by fanning out the raytracing work to 8 worker threads, wait for them to trace 500 rays, then fan in and compute the result pixel. This got the runtime down to about 175 seconds but I could see on the macOS Activity Monitor that the process used only about 600% CPU. With 8 threads you would expect 800%. By using the 'Sample' button in Activity Monitor I could also see my threads were spending a lot of time locking and unlocking mutexes. This was because I had to do the fan-out and fan-in 810,000 times!

It took me a while but I eventually realized I needed to move away from "pixel at a time". I modified the program to create a big buffer for the output image. I then changed the threading code so that each thread fills in its own set of scan lines. Thread 0 renders line 0, 8, 16 etc. into the output buffer. Thread 1 renders line 1, 9, 17, etc. This way the main thread only needs to create the 8 worker threads and wait for them to finish. Once all threads are done the main threads write the image file from the output buffer in one pass. There is no coordination between the threads so they can spend all their time doing ray tracing. After this change my program did reach 800% CPU, and it completed the work in about 125 seconds.

There is still room for improvement because with 8 threads you would want the work to take 620 / 8 = 77 seconds. The book suggests giving each thread its own random number generator. If I were to try to make my version faster I think that is what I would try next.

Conclusion

I learned about this course because other Recurse participants in my batch went through the book. Even though my batch is over I decided I could still try it out myself. I am happy to see that I am now fluent enough in C to actually do the work in a weekend as the title suggests.

It was fun to also do some performance tweaking with my experience with the One Billion Row Challenge still fresh in my mind.

If you like programming and are curious about 3D graphics you might enjoy this book too!

Tags: recurse c

IndexContact