How cancer will be cured.

And also diabetes, Alzheimer’s and countless other genetic diseases. The short answer is by the relatively recent software advances that are general purpose distributed computation systems (Spark, Hadoop and others) and Machine Learning. We’re currently putting together the pieces necessary to identify the exact root causes of those diseases.

It won’t be this year or the next, but the people that will find those cures are most likely not only already alive but also already working on them.

It definitely won’t be easy, though. That whole industry is built on top of an extremely shaky software foundation. The people doing the ground work are mostly unpaid grad students creating standard formats to hold genomic data and converting existing data into those formats. Currently, genetics studies all have their own standards, usually stored in an ad-hoc relational schema. Those datasets are then often patched together and the result is not only hard to work with, it’s also error-prone. All this makes it impossible to combine them into a larger dataset where more subtle patterns and answers could be discovered.

The researchers in biology and genetics are not trained in software best practices. They are assisted by technicians that often don’t have any formal training in programming, who simply picked up Bash, Python and SQL on the fly. Published results contain the output of those messy Python scripts, which are built mostly from copy pasted incantations taken from StackOverflow and the like. Those scripts are then stashed away into folders and forgotten. In the mean time, the original dataset used to generate the results has “grown organically”. Tables have been added, others have changed and the original patient data has been updated and augmented. In other words, it becomes nearly impossible to reproduce the studies.

Currently, the challenge is in porting those datasets to a format that can be used by Spark and similar technologies suited to large clusters of computers. The most promising is ADAM, made by the Berkeley AmpLab. They’re also the creators of Spark. Once there will be sufficient patient genomic data, it’ll be become possible to run Machine Learning algorithms on hundreds -to a few thousands- of genomes and find previously unseen patterns and root causes for those diseases.

As of Summer 2016, we are just beginning to have enough data in the same place to start looking for patterns. As I’ve previously mentionned, the work is done mostly in academia. There is simply no incentive for the pharmaceutical industry to accelerate a process that will greatly reduce its sales.

I strongly believe that within the next 10 years, routine trips to the hospital or health clinic will involve a genetic test. A doctor will be able to enter symptoms into a system that will identify which segments of the genome need to be genotyped to validate the hypotheses the doctor has come up with. Another major upcoming change is in the way drugs are prescribed. Certain treaments, from chemotherapy to antidepressants, have a relatively low chance of success. Patients often need to try several different drugs before landing on one that’s effective for them. Machine Learning is a formidable tool that’s better at recognizing patterns than our limited human brains ever will be. Being able to know exactly which drugs will effectively treat a patient by looking at that specific patient’s genetic makeup is not science fiction, it’s something being developed right now.

This is simply a matter of time and money. There is no funding outside of a few government grants and the funds certain universities will allocate to these projects over others.

When I hear programmers complain that their work has no meaning because all they do is develop new ways to share cat pictures, I remind them that these sort of projects badly need skilled developers. There’s a constant churn of students and having a few stable and serious contributors from the open source community would have a major impact. It’s hard and tedious work and the total opposite of fun. Personally, I ported what is probably the largest and highest quality dataset of patient data to the ADAM format and I almost burned out in the process, but that’s a story for another post.

Leave a Reply