Katie-Rose Skelly
March 10, 2021

Building a Useful Biological Dataset

The purpose of science is to explain the natural world. The purpose of machine learning, or at least the most common use, is to allow us to predict the future based on the past. It stands to reason that combining biology and machine learning will enable a better understanding of our own present and future. So far, the results have rarely been as well-matched as we all hoped.


Our goal at Known Medicine is to predict which treatments every cancer patient will and won’t respond to. In order to succeed we need the right data collected the right way.


The Right Kind of Data

Even in cases where there is a trove of patient data, like in genetic testing to match tumors to treatments, the vast majority of patients are left behind. In 2018, 15.44% of patients who found out what mutations their tumors had were eligible for treatments based on their test. Less than half of those patients — just 6.62% of the patients tested — actually benefited from a treatment tailored to genetics.


The information we need to make optimal treatment decisions often isn’t contained in the patient’s genetic code. We need data that is downstream of genetic mutations, data that takes into account the incredible complexity of the cells themselves, their interactions, and their environment in order to elucidate treatment response.


Uniquely Biology

One of the problems any biologist is familiar with is the tendency for the unexpected to wreak havoc on experiments. Perhaps my favorite illustration of this was a story told to me by a coworker. They were experiencing fluctuations in the endpoints of certain experiments. Average readings were very different each week and seasonally changed even more. Almost as a joke, one of their colleagues plotted the measurements from the instruments against the humidity outside each day, and found a nearly perfect correlation. In retrospect this made sense: high humidity leads to less evaporation, which leads to less concentrated nutrients in the cell media, meaning there’s less cell growth. It is not, however, an intuitive relationship. It’s just one of the thousands of things that can affect the results of a biological experiment. Even after years of development, batch to batch variation is clearly visible even in Recursion’s published test dataset (see Figure 1).


Struggles to ensure consistency reach beyond high throughput labs like Known Medicine’s. The inability to reproduce results often referred to as the replication crisis, is one of the most pervasive and problematic issues in biological research. It is one that we will have to continuously contend with as we build our data and platform.


It is impossible to control for everything. Because of this, it is imperative to make sure conclusions are correct despite variation. Lab design — from incubators that minimize exposure to robotics that move things quickly and efficiently around the lab to automatic liquid handling that eliminates human error — can play a huge role. Designing each experiment with appropriate controls and layouts, then standardizing results on the back end can allow you to account for variation that couldn’t be prevented. Beyond the infrastructure, the dataset needs constant evaluation to make sure inferences are still valid in different times and places.


Onward and Upward

Known Medicine is building the first large scale dataset of 3D tumor cell images. We create an environment as similar to the patient’s body as possible for these cells to thrive, from maintaining both support and tumor cell interactions to growing the cells in a matrix that replicates the tissue in a patient’s body. We have already begun a clinical study to receive tumor samples from over 200 patients from several top cancer centers, generating hundreds of thousands of pictures of patient cells responding to different treatments. These pictures are taken on our high throughput system across different time points, enabling us to see and account for myriad unintentional changes in environmental conditions. As we continue our study, we collect ground truth data about how patients actually respond to cancer treatments.


The more data we get the better our predictions will become. We will be able to see smaller and smaller changes— ways that the cells elongate, cluster together to form micro-tissues, or secrete immune factors — that can tell us how the patient will respond to treatment. Ultimately, the dataset we generate and the models created from it drive our machine and allow us to find the best treatment for every cancer patient.



Read full story

Read next