Taking on the One Billion Row Challenge in C
2024-02-06
I will share below what I have done and learned by taking on the One Billion Row Challenge in C.
Am I allowed to do this challenge?
Earlier this year I read the announcement of the One Billion Row Challenge. The challenge was to parse and summarize 1 billion rows of CSV as fast as possible... in Java.
I remember reading the blog post and thinking "this sounds like something I would enjoy" until I hit the requirement to do it in Java. Not because I have something against Java, but because I have virtually no experience in that language so the challenge would become "using Java" rather than "process data quickly".
I decided it wasn't for me and I moved on. Then several weeks later I found out that many people did the challenge for themselves in whatever language they felt like, and I felt stupid for not realizing I could have done the same. When I was done feeling stupid I started doing the challenge for myself after all, in C.
A big lesson here is that I need to be more critical when I tell myself I cannot do something or that something is not for me.
Generating the inputs
The official repository for the challenge contains Java code that can generate random data to test your solution on. There is no download link for the test data, you have to generate it yourself.
I tried generating the data on my local machine but I kept running into Java compiler errors that I did not understand. I decided it would be more fun to implement my own test data generator than to fix the compiler errors.
The test data purports to be a series of temperature readings from weather stations. The challenge repository contains a 45 thousand line CSV file of (weather station, mean temperature) pairs. The test generator selects up to 10,000 rows from the weather station CSV and then generates N random readings for those weather stations. The temperature values are generated by the Java nextGaussian
standard library function.
The C standard library has no nextGaussian function. The Java documentation however helpfully explains that the underlying algorithm is the Marsaglia polar method. The Wikipedia page for that algorithm has C++ example code that I copy-pasted and modified to become C and fit in my program.
This data generator program may be the first time I got to use the trick of a growable array that starts as a null pointer and then grows as needed by calling realloc
. I have seen this trick plenty of times but it was good to actually use it so that it becomes part of my vocabulary.
The
realloc
trick
struct city *cities = 0;
int ncities = 0, maxcities = 0;
while(fgets(buf, sizeof(buf), stdin)) {
if (ncities == maxcities) {
maxcities = maxcities ? 2 * maxcities : 1;
assert(cities = realloc(cities, maxcities * sizeof(*cities)));
}
/* Parse buf and store at cities[ncities++] */
}
The end result is a 100-line C program that only uses the standard library (and getpid()
from unistd.h
to seed the random number generator). All this had nothing to do with the challenge proper, but it was fun for me and very satisfying. Generating the test data should not be hard and my generator is now very easy to compile and use (no dependencies).
Exploratory solutions
If this had been a "real" question I wanted to answer for myself ("what is the average of these numbers?") the first tool I would have reached for is Awk. My Awk solution is short and took very little time to write. The flip side is that it needs 17 minutes to process 1B rows. The best Java solutions in the official contest run under 2 seconds. Still, if you add up the time to run the Awk code once and the time to write the code it's a clear winner.
Core of my Awk solution
awk -F';' '
{
if(!num[$1] || $2>max[$1]) max[$1]=$2
if(!num[$1] || $2<min[$1]) min[$1]=$2
total[$1]+=$2
num[$1]++
}
END {
for(x in total) printf("%s=%.1f/%.1f/%.1f\n", x, min[x], total[x]/num[x], max[x])
}
'
All statements of the form "program X runs in Y seconds" are of course unscientific measurements on my laptop. It is the relative differences that matter.
After Awk I tried Go. I am an experienced Go programmer so I can write a solution without having to fight the language. Go has a useful standard library and it's usually fast. In my Go solution I mostly took a naive approach except for the use of bufio.Reader.ReadSlice in an attempt to avoid spurious allocations. I still saw some garbage collector activity when I took a CPU profile so I am not sure my attempt worked.
Still, the Go version processes 1B rows in about 1:50 minutes and it did not take me long to write. This also "matches" the baseline number quoted by the original challenge. I put "matches" in quotes because that was 2 minutes on a different computer so it's not really comparable.. :)
My first C hash map!
I am used to working with languages that have a built-in hash map data structure. Hash maps are incredibly useful. C, however, does not have hash maps built in. You have to bring or write your own.
In my journey of learning C, I have been able to avoid using hash maps until now. In my embedded projects I use relatively small data structures that work perfectly well as arrays with linear search. For example, a MIDI keyboard has at most 128 keys. If I want to track which keys are pressed down I need a data structure with only 128 elements. And because keys are usually pressed by fingers, the total number of pressed keys is at most 10 in practice. My MIDI instruments also do not have to handle one billion key presses as fast as possible.
But here, with looking up one of 10,000 records one billion times, a hash map is worthwhile. I chose to use open addressing as suggested by Chris Wellons.
Writing a general use hash map implementation is a lot of work, but when you are solving a specific program you only have to implement the bits you need. For example, my implementation can only find or insert elements. I never delete elements so I don't have to implement that.
Open addressing hash map
#define EXP 16
struct city {
char *name;
double total, min, max;
int num;
} cities[1 << 14], *cityindex[1 << EXP];
/* The size of cities is 1 << 14 = 16384 which is well over
the maximum 10000 records we need to be able to handle.
The index is 4x bigger which gives us a hash map load
factor of 4. */
int ncities;
/* From https://nullprogram.com/blog/2022/08/08/ */
int ht_lookup(uint64_t hash, int exp, int idx) {
uint32_t mask = ((uint32_t)1 << exp) - 1;
uint32_t step = (hash >> (64 - exp)) | 1;
return (idx + step) & mask;
}
/* upsert returns the existing record in cities for name, or
it creates one and returns that. */
struct city *upsert(char *name) {
uint64_t h = hash(name);
int i = h;
while (1) {
i = ht_lookup(h, EXP, i);
if (!cityindex[i]) {
assert(ncities < nelem(cities));
cityindex[i] = cities + ncities++;
assert(cityindex[i]->name = strdup(name));
return cityindex[i];
} else if (!strcmp(name, cityindex[i]->name)) {
return cityindex[i];
}
}
}
I also don't need bring in a general purpose hash function that can resist adversarial inputs such as SipHash. The hash function only has to be "good enough" and I can compensate for lack of uniformity (i.e. increased chance of collissions) by increasing the load factor of my hash map.
My first C solution with a hash map runs in close to the same time as my naive Go implementation: 2 minutes. Apart from achieving this performance, this felt like a big win because now I feel like I can use hashmaps in C.
Faster number parsing
Some quick profiling on Linux with perf record
showed that my program was spending a lot of time in sscanf
. I was using sscanf
to parse floating point numbers from the input. Because the input is actually fixed precision fixed point (the exponent is always 0 and there is always exactly one decimal after the .
) you can also treat them as integers and divide by 10.0
before printing them. This reduced the running time to 1:06.
Optimistic number parsing
int64_t temp = 0, sign = 1;
/* p points to the next character we're parsing */
if (*p == '-') {
sign = -1;
p++;
}
for (; *p && *p != '\n'; p++)
if (*p != '.')
temp = 10 * temp + (*p - '0');
temp *= sign;
My first mmap
I did not look at other people's solutions in detail but I did notice that mmap
(memory-mapped IO) and multi-threading seemed to be worth using. I chose to try mmap first.
My first attempt made the program slower: 1:14 minutes. But after adapting the design some more to better take advantage of the input data being there in memory, that went down to 1:04.
This does not seem like a big improvement. On small scale benchmarks (10M rows instead of 1B), the improvement was bigger: from 655ms to 388ms. It would be interesting to understand the discrepancy between the 10M row run times and the 1B row times. I don't have an answer right now. After writing this I tried using madvise
with MADV_SEQUENTIAL but that did not change anything for the better.
I guess my lesson here is that mmap
is at least convenient, and that micro-benchmarks (processing 10M rows instead of 1B rows) can be misleading?
Multi-threading
The final thing I wanted to try for now was multi-threading. To facilitate this I refactored the code that updates the statistics to work both with a single row as well as with aggregated data for the current weather station.
This now runs in 0:14 minutes, which is 8 times faster than my first C-with-hash-map implementation. Woohoo!
Benefits of doing this work while at Recurse
I did the work I described above by myself. I could also have done it while not at Recurse. But what was nice here was that I was able to show my work to other participants and get interesting questions, reactions and suggestions. I am very grateful for this.
Conclusion and next steps
Even though I did not get to participate in the original challenge, this has been a great opportunity for me to grow as a C programmer and it gave me another thing over which I could connect with my fellow Recurse participants.
To get an even faster solution, I may try to implement some sort of SIMD approach to make the parsing faster..