How cancer will be cured.

And also diabetes, Alzheimer’s and countless other genetic diseases. The short answer is by the relatively recent software advances that are general purpose distributed computation systems (Spark, Hadoop and others) and Machine Learning. We’re currently putting together the pieces necessary to identify the exact root causes of those diseases.

It won’t be this year or the next, but the people that will find those cures are most likely not only already alive but also already working on them.

It definitely won’t be easy, though. That whole industry is built on top of an extremely shaky software foundation. The people doing the ground work are mostly unpaid grad students creating standard formats to hold genomic data and converting existing data into those formats. Currently, genetics studies all have their own standards, usually stored in an ad-hoc relational schema. Those datasets are then often patched together and the result is not only hard to work with, it’s also error-prone. All this makes it impossible to combine them into a larger dataset where more subtle patterns and answers could be discovered.

The researchers in biology and genetics are not trained in software best practices. They are assisted by technicians that often don’t have any formal training in programming, who simply picked up Bash, Python and SQL on the fly. Published results contain the output of those messy Python scripts, which are built mostly from copy pasted incantations taken from StackOverflow and the like. Those scripts are then stashed away into folders and forgotten. In the mean time, the original dataset used to generate the results has “grown organically”. Tables have been added, others have changed and the original patient data has been updated and augmented. In other words, it becomes nearly impossible to reproduce the studies.

Continue reading