The healthcare space is seeing a rapid deployment of machine learning algorithms for prediction, pattern recognition and data visualization. Some of these are already making an impact and changing the way medical professionals utilize their expertise. The systems that automate the diagnosis procedure and assist clinicians in suggesting a course of therapy are called diagnostic decision support systems (DDSS). Those who use learning-based approaches, can often work with limited data and unveil patterns by looking for structure in underlying data. DDSS is often associated with application in ‘differential diagnosis’.

Diagnosis of rare disorders is a deterring task that requires involved methods to interpret symptoms and genetic information. Fewer cases mean fewer validated approaches to tackle these disorders. Family physicians can have a hard time diagnosing rare disorders. Given that disease-specific experts are few and geographically localised, easy access to required treatment is a major hindrance for most patients. Overlapping and infrequent clinical manifestations challenge correct disease classification, further compounding the intricacy of the task.

Diagnosis of rare disorders is a deterring task that involved methods to interpret symptoms and genetic information. Fewer cases mean fewer validated approaches to tackle these disorders. Family physicians can have a hard time diagnosing rare disorders. Given that disease-specific experts are few and geographically localised, easy access to required treatment is a major hindrance for most patients. Overlapping and infrequent clinical manifestations challenge correct disease classification, further compounding the intricacy of the task.

Enter PEDIA — a DDSS that combines face recognition in order to sort genomic data. Face recognition is very useful in connection to hereditary diseases, since 30–40% of the seven thousand plus Mendelian disorders display craniofacial dysmorphism. Facial dysmorphism is often suggestive of rare intellectual disabilities.

PEDIA — Prioritization of Exome Data by Image Analysis is an algorithm that uses face recognition to diagnose the genetic background of rare disorders by facial images of the patient. It quantifies phenotypic features, from the images of patients and combines them with genotypic information from the genome, to obtain the most likely affected gene.

In case the name is baffling, let me explain. The exome refers to the protein coding region of our DNA. It might sound perplexing, but only 1.5% of the entire genome is made out into proteins i.e. the exome is about 40 millionbase pairs as against the 3 billion base pairs that each of our cells ordinarily carry. Whole Exome Sequencing (WES) is thus a popular alternative (alternative to sequencing the whole genome, known as WGS) to explain the underlying genetic cause of many monogenic or Mendelian disorders.

Prioritization: When researchers try to implicate a gene mutation to explain rare or novel disorders, they often find many variants (>30,000) in the exome as against a healthy reference human genome. An average healthy person carries ~100 authentic variations, thus clearly not all variants are pathogenic. A pathogenic variant is a genuine cause for loss in gene function. To narrow the search space for genuine gene candidates, different prioritization approaches have been used so far. Some rely on looking at gene function by using the mantra ‘guilt-by-association’ so if a gene plays a particular role, all genes interacting with it or with it’s gene products, will be prioritized. Others strategies include using pedigree information (linkage) to prioritize disease causing genes of diseases found in family members or gene expression data.

The word prioritization is used instead of filtering to indicate that this task does not discard any variants. It simply categorizes a bunch of variants as more worthy of looking into given a certain set of criteria.

Turns out, PEDIA is the first prioritization tool that uses face recognition algorithms to organise variant data, quite cool tbh.

Now that the meaning of the name has made some sense, we can begin understanding what goes on under the hood.

PEDIA essentially ranks genes based on five scores it obtains via phenotypic and genotypic analysis respectively. These are -

Gestalt Score
Feature Score
BOQA
Phenomizer
CADD

The first two are part of an existing service, Face2Gene developed by FDNA. Gestalt Scores are obtained by running a CNN (Convolutional Neural Network) on input patient images. It scores each region of the face for a list of syndromes causing facial dysmorphism and generates an aggregate score for the likeliest disorder the patient is suffering from. It can also differentiate between mutations in different genes leading to the same disorder.

Feature Score, BOQA (Bayesian Ontology Query Algorithm) and Phenomizer use clinical descriptions to find out the likeliest disease. They are semantic search algorithms, that use a set of descriptor terms (phenotype ontology) to map visible phenotypes to a disease. BOQA is based on a Bayesian Network, that uses structured ontology relations between terms (nodes). These two scores are added while pre-processing data for PEDIA, i.e. before the classification step occurs.

Lastly, the genomic variations are scored. CADD is one of the many quantitative tools to evaluate the impacts of SNPs (Single Nucleotide Polymorphism), insertions/deletions and other kinds of mutations. Over the last few years, it has been increasingly preferred since it combines the predictive powers of many similar softwares. It utilizes an underlying SVM (Support Vector Machine — a machine learning algorithm) to score an input sequence for variants. Most similar computational tools disagree with each other and CADD, so to speak irons out these differences. Further, CADD performs well on unseen de novo mutations (mutations that are not inherited or those which arise from scratch in an individual).

So what is the PEDIA score? It is simply the distance between the datapoint corresponding to a gene and the hyperplane (a technical term for the line separating the diseased class from the healthy class). For each individual, the five other parameters mentioned earlier map the datapoint to a higher dimension. Data which is not separable becomes separable by this process. The output is a ranked list of genes (descending PEDIA score values) with a label each, signifying whether the gene is disease causing (1) or not (0).

PEDIA is a classifier? Yes! It uses an SVM too (Like CADD). In case you’re wondering how accurate it is, PEDIA placed the correct gene causing the disease within the top-10 output, a whopping 98.7 % times!

Can face recognition reliably replace clinicians? No. It can act as a support system in making decisions and suggesting a course for therapy. One of the reasons why it is so, is because many rare disorders have symptoms not visible externally, e.g. congenitial¹ heart problems, thus face recognition for disease diagnosis is a guidance system, but definitely leaves room for accommodating more kinds of medical data that could make it unbelievably accurate. However, the time and cost benefits will make them more ubiquitous shortly!

Summing up, only a handful DDSS based on phenotypes have been developed so far. Results from various doctor groups in different parts of the world have been promising. Besides, patient privacy is maintained by using the principle of ‘privacy-by-design’. Thus, confidentiality is maintained at all steps of software development, safeguarding data used to train and test the algorithm. Since more data will only enhance the software’s accuracy, the future does seem upbeat!

Do checkout the publication and/or Github page if you’re further interested.

EDIT NOTES -

[1] A doctor pointed out that congenital heart problem is also diagnosed because of mild cyanosis in babies. Cyanosis refers to the bluish tint that appears on the skin due to the mixing of oxygenated and deoxygenated blood, thus AI systems can detect it as well. Here, I clarify that I’m talking about acyanotic congenital heart disease, where this phenomenon isn’t that common.

PEDIA — More power to face recognition tech in disease diagnosis

Homopolymers

Computing gene therapy transduction efficiencies