Most often, when people talk about artificial intelligence (AI) they mean a particular kind of AI called machine learning (ML). This is the process of uncovering patterns in data using algorithms and then using those algorithms to make predictions about new data. For example, the recent advances in mammography for the diagnosis of breast cancers, taking a collection of images from CT scans and trying to find the features that discriminate the benign and malignant tumours. Once the algorithm has been ‘trained’ on this dataset, it can be applied to a newly obtained scan to make a prediction about a new patient’s tumour. Similar methods could be applied to predict a patient’s risk of mortality from a particular disease, identify the best treatment pathway for them, or estimate their likelihood of experiencing adverse events after taking a particular drug.
So, what is difficult about doing machine learning? Perhaps surprisingly, it’s not usually the algorithms. Most of the current cutting-edge algorithms, such as neural networks, were originally invented decades ago. Best-in-class implementations of these algorithms are ‘open-source’ and available to anyone to use for free. A more significant challenge is usually the availability and quality of data to train them on. In medicine, this problem is particularly acute, since medical data is challenging in three significant ways.
Firstly, it is messy. There are a huge range of data models and medical terminologies and much data is captured in ‘unstructured’ forms such as doctor’s notes. Secondly, it is ‘siloed’, with pharmacy, diagnosis, radiology and labs all being collected and held on different systems across 100s of hospitals and 1000s of GP surgeries and community care settings. Finally, it is highly sensitive. Whilst there are robust methods for removal of ‘identifying’ information such as names and addresses, true anonymisation of patient level data is difficult or impossible given how specific a particular sequence of healthcare events can be to a particular individual.
The greatest challenge with developing machine learning models for medicine is therefore not coming up with smarter algorithms. It is in enabling researchers to obtain secure access to clean, integrated datasets in a way that doesn’t compromise patient privacy and enables good governance of the resulting models. Good governance involves validating models against unbiased and previously unseen testing data; enabling them to update safely as new, richer data becomes available; monitoring their performance over time; and integrating them into care pathways. It requires building transparency into the end-to-end process of generating the models, so that regulators and the public can be assured that they are safe and effective.
The software platforms where these models are developed therefore need more than just the analytical tools used to train the models. They also require robust technical capabilities that can enable good governance. They should have a granular access control system to ensure users can only access resources that they have permission to interact with. They need to automatically track the provenance for data and models. This enables users or auditors to understand how a particular model was derived, which other information it was derived from, and when and by who it was produced. In order to prove that these controls have the expected effect, they need robust audit trails that provide clear insight into who had access to what data and why the access was granted.
As well as these technical challenges, developing such platforms also entails overcoming governance, regulatory, and communication hurdles. In the UK, Genomics England has commissioned work to help understand public attitudes around the appropriate use of genomic data  and has built a secure platform for genomic research. Health Data Research UK (HDRUK) has developed a set of principles that should be adhered to by ‘safe settings’ for trusted health research environments  and is piloting the development of such environments through its series of “innovation hubs”.
If we can get this right, the potential is enormous. Chronic diseases such as diabetes could be diagnosed earlier or avoided completely, complications arising from care gaps reduced, and prescribing errors eliminated. But effective AI requires good data. It is critical that the systems for managing this data are secure, transparent, and well-governed.