If there’s something any backend engineer loves, it’s when they manage to solve a problem in an O(log(n)) way. I did that this week.
As the number of requests per second goes up exponentially, the number of servers needed to handle them goes up linearly (until we reach a bandwidth bottleneck). Amazing!
First, let me recap. We were having performance issues in an application. We needed a lot of big and expensive EC2 compute-optimized instances to handle fairly low traffic by our standards (250mb/s). Those instances were burning CPU like crazy. The codebase is built with performance in mind and we couldn’t find a single inefficiency in it. That means it’s time for flame graphs!
By load testing the code in profiling mode, the time spent in each function is recorded and displayed on the graph. We found out that 85% of all the CPU time was spent doing JSON schema validation. We’re already using the fastest validator around, called is-my-json-valid.
We have to validate very complicated JSON objects. They’re about 1200 bytes each and contain dates, IPs, references, custom formats, etc. Each object took 30ms to validate! We figured it was time to bring out the big C++ guns. Since Node.JS runs on the v8 engine, built in C++, it can call C++ code transparently and vice versa. The time to validate each object dropped to 5ms, including the back and forth through the Node.JS-C++ bridge. And since it’s done outside of the v8 even loop, we could even detach from it and run the computation asynchronously in a thread pool! Success? Alas, no.
- Now the code is vastly more complicated
- It involves manually editing the CMakeList.txt file of a C++ library prior to building it
- It involves other ugly ELF hacks to make it usable that won’t work on OS X, which most devs at the company use.
- The thread pool doesn’t even help us here, since the code is already fully multi-process and distributed behind a load balancer. All the available cores are already used.
It’s a mess.
For 1 object: 30ms / object
For 10 objects: 3ms / object
For 100 objects: 0.4ms / object
For 1000 objects: 0.07ms / object
For 10000 objects: 0.04ms / object
Awesome. Each request we serve already goes through an asynchronous flow. Now, our validation module can register the callback and put the object into an array. Every second, the entire array is validated at once and the errors are separated per object. The callbacks are all called concurrently with or without an error, depending on the result for that object.
I then tried the same in C++, but the crazy magic imjv uses to leverage the JIT compiler just couldn’t be matched by the C++ validator, where each object always took 5ms, no matter how many were batched together.
Some will say that it’s a bad idea, because it can lock up the event loop for relatively long periods of time (up to 200ms for us in the most extreme cases). This is irrelevant to this codebase, which serves no data back to clients and acts as a data sink. All it does is process data and it needs to do it as cheaply as possible.
It’s not a technique for every performance problem there is, but it solved ours brilliantly.