And also diabetes, Alzheimer’s and countless other genetic diseases. The short answer is by the relatively recent software advances that are general purpose distributed computation systems (Spark, Hadoop and others) and Machine Learning. We’re currently putting together the pieces necessary to identify the exact root causes of those diseases.
It won’t be this year or the next, but the people that will find those cures are most likely not only already alive but also already working on them.
It definitely won’t be easy, though. That whole industry is built on top of an extremely shaky software foundation. The people doing the ground work are mostly unpaid grad students creating standard formats to hold genomic data and converting existing data into those formats. Currently, genetics studies all have their own standards, usually stored in an ad-hoc relational schema. Those datasets are then often patched together and the result is not only hard to work with, it’s also error-prone. All this makes it impossible to combine them into a larger dataset where more subtle patterns and answers could be discovered.
The researchers in biology and genetics are not trained in software best practices. They are assisted by technicians that often don’t have any formal training in programming, who simply picked up Bash, Python and SQL on the fly. Published results contain the output of those messy Python scripts, which are built mostly from copy pasted incantations taken from StackOverflow and the like. Those scripts are then stashed away into folders and forgotten. In the mean time, the original dataset used to generate the results has “grown organically”. Tables have been added, others have changed and the original patient data has been updated and augmented. In other words, it becomes nearly impossible to reproduce the studies.
I have no idea whether I’m paraphrasing this from someone or if I made it up, but these design guidelines have had a tremendous impact on how I design software and how I judge it, especially programming languages.
Make the good easy, short and rewarding
Make the exceptional explicit and verbose, but possible
Make the bad impossible
Even better, when only a small number of actions are good, it’s easy to keep them all in memory and build using only these tools. Any task that ends up looking ugly sends a clear signal that it’s time to go dig into the specialty toolbox. Those exceptional actions, being explicit, document the intent and the original reasons behind the solution. Being verbose ensures that their usage remains sparse and calculated.
Anything that’s possible will end up being used. There’s no number of “X Considered Harmful” blog posts that will prevent it, and the bad will never truly go away without drastic breaking changes.
Using libraries and tools that follow these guidelines in turn helps shape code in a way that makes it safer and easier to build upon. Your code then passes on the benefits to the layer above, whether it’s another library, or the end user.
Things I’ve learned about making portable binaries
7 min read
I don’t claim to be a master at linking and ELF (Linux) executables, but there’s some tricks I’ve learned that I wish someone had explained to me back then.
There’s two problems to resolve to make clean distributable binaries: dependency hell and the libc compatibility. By solving both, we can get an executable to run on any recent Linux system, regardless of the distribution and installed packages.
Compiling to a binary is a two-step process. There’s the actual compiling, then the linking. In the first step, each source file gets turned into an Object file (.o extension). Then all the .o and the .a files are put together in a binary and the ELF meta-information is added. .a files are static library files. If your code uses an external library and that library has previously been statically compiled, then the .a file can be directly embedded into the binary. The ELF headers contain data such as “where is the main?” and “where should I look for the dynamically linked libraries (.so files)?”
Logarithmic scaling with the fastest JSON validator
5 min read
If there’s something any backend engineer loves, it’s when they manage to solve a problem in an O(log(n)) way. I did that this week.
As the number of requests per second goes up exponentially, the number of servers needed to handle them goes up linearly (until we reach a bandwidth bottleneck). Amazing!
First, let me recap. We were having performance issues in an application. We needed a lot of big and expensive EC2 compute-optimized instances to handle fairly low traffic by our standards (250mb/s). Those instances were burning CPU like crazy. The codebase is built with performance in mind and we couldn’t find a single inefficiency in it. That means it’s time for flame graphs!
A quick overview of the OCaml programming language and ecosystem. This is not a tutorial.
I’ll cover the basics of the OCaml language with examples and some theory sprinkled here and there. The goal is to give you a “feel” of the language.
If you’ve ever wondered
What functional programming looks like in practice
How functional languages represent objects
How they do asynchronous operations
How they even write real world code when everything is immutable
Then you’re about to get answers to all those questions and a whole lot more.
First of all, a quick description. OCaml is a functional programming language. It’s pragmatic: it aims for beautiful, expressive, immutable high level code, but recognizes that sometimes it’s necessary to drop down to imperative code in hot sections. Its syntax can appear radical at first, but it suits the language well. It’s a high level language that retains support for low level operations and excellent abilities to call C code transparently when needed. OCaml uses strong static typing.
At Mashape, we’ve been building a new product, APIAnalytics, to let anyone easily track and monitor anything related to HTTP in their applications. To be successful, it needs to be extremely easy to plug any application into it. There’s many ways to achieve that with as little friction as possible, such as agents and libraries for every popular language, but for now let’s talk about HTTP proxies.
Some of our requirements for HARchiver, our official lightweight proxy for APIAnalytics are:
We’re going to be running HARchiver on the live Mashape.com API hub, so any failure is out of question because it could take down the apps of the hundreds of thousands of developers leveraging the Mashape platform.
After eliminating all the languages that require some sort of runtime (Node, Java, etc.) and the ones that don’t prioritize safety and correctness (C, C++, etc.), we’re left with not many options. Rust is also still way too immature and untested. Fortunately, it turns out to be a perfect use case for my new favorite language, OCaml.
For those unfamiliar with it, OCaml is a functional programming language that values (in this order): speed, correctness and safety. It sits close to D, Dart and ASP on the 2014/Q1 programming language rankings, it was released in 1996 and has been gaining a fair amount of popularity in the last few years.
Surprisingly, reaching Feature Complete took less time than expected. It’s the performance tuning that turned out to be more challenging, through no fault of the language itself. OCaml currently lives in a weird state, split between 2 Standard Libraries (the included one is too minimal). One is made by Jane Street Capital, called Core, while the other is a community-built one, called Batteries. The same exact thing happens when it comes to picking an Async/IO engine. Jane Street makes “Async”, but the community has mostly rallied around Lwt. I’m using both Core and Lwt.
Making it faster
All the following numbers are on a fairly weak laptop i5-4200U CPU, testing against a local nginx server serving a small (110 bytes) static file. I’ll be using siege for the load testing. Each test is run 3 times, keeping only the median result.
To provide a baseline, let’s put a second nginx server and configure it as a reverse proxy. HARchiver will need to achieve similar performance despite doing a lot more processing.
All the siege tests are run with
siege -c 500 -b -t 10S
That means: 500 concurrent client connections, each trying to complete as many GET calls as possible, for 10 seconds.
siege -> nginx proxy -> nginx server
And now, let’s test it with the first, unoptimized version of HARchiver between the client and the server, all three running on the same machine.
siege -> HARchiver -> nginx server
Briefly, this is what HARchiver needs to do:
Parse the incoming request, grab the headers and the query
Open a connection to the remote host and pass everything through
Parse the response, grab the headers and the query
Compute some timers, sizes, count the length of the request and response bodies, then create a JSON representation of the whole exchange
Send it off via ZMQ
Still, it doesn’t look so great considering that the average response time was doubled.
After plenty of small performance improvements, I used the excellent Lwt_pool module to maintain a group of system DNS resolvers and split the load between them.
After noticing insane DNS traffic to the system resolver, I wrote a homemade generic async caching system, leveraging the fantastic OCaml type system.
Unfortunately since the beginning of this project, under heavy load (800+ client threads), HARchiver would crash with some really ugly errors:
(Unix.Unix_error "Invalid argument" select "")
Fatal error: exception (Unix.Unix_error "Operation not permitted" send "")
That, and noticing that it would happen with less threads if it was restarted immediately after crashing led me to realize that Lwt was using the Linux kernel’s select() syscall as its engine to keep track of async tasks. After a few tweaks to switch to the libev engine, all the stability problems disappeared and the performance improvements were great:
And there we go, all the objectives are met, the performance is decently close to the extremely performant nginx while doing way more processing for each request. If you’ve been paying attention, you might have noticed that the average response time is 5ms longer than the baseline. This is because HARchiver is more taxed at our current benchmark settings than nginx and the response times suffer as a result. Under reasonable production load (up to 500 req/sec) the response time compared to going straight to the server is no more than 1ms longer.
A Node.js implementation
Finally, let’s compare it to a quick and dirty Node.js test implementation that does all the same processing that the OCaml proxy does.
Even though it hasn’t been optimized as much as the OCaml version, the difference is massive.
The code is hosted on Github. Check out HARchiver and the node.js version.