TL;DR We made a bacteria identifier app with unlimited scope compared to conventional identifiers. It leverages the massive and exponentially growing open-source UniProt proteomic database.
Check out our app in the image slider!
What is PigletMS?
PigletMS enables microbiologists to identify bacteria species from mass spectrum (MS) samples beyond the limited scope of existing MS databases by harnessing massive open-source proteomic databases. For our industry partner DSO, this means enhancing the nation’s biological defences by expanding the range of identifiable bacteria.
Check out our app demo video!
Why bacteria identification?
Bacteria identification is crucial in preserving the health and safety of our community. Harmful bacteria outbreaks such as the infamous Bubonic plague are not a thing of the past. Currently, Yemen is experiencing a large scale cholera outbreak (caused by the bacterium Vibrio cholerae) which has caused more than 3,900 deaths. Bacteria identification has several use cases such as:
1. Clinical diagnosis, where effective treatments depend on accurate and rapid diagnostic, and
2. Environment monitoring to detect and assess outbreak threats quickly.
How did we approach this problem?
1. We first identified larger databases such as UniProt or RefSeq that we could use as references databases.
2. Next, we designed algorithms that could effectively classify species from input MS with minimal reliance on the MS database.
3. Finally, we interviewed users to discern the essential User Interface (UI) features and implemented them.
PigletMS: The User-Friendly Two-Pronged Bacteria Identifier
PigletMS was designed to serve as a parallel diagnostic tool for microbiologists. In practice, an unknown MS sample may contain a species that is present or absent from the MS databases. At its core, PigletMS consists of two classifiers that target these two different prediction spaces.
Classifier 1 targets species that are present in the MS databases and uses the idea that species closer in biology are likely to express similar proteins and hence have similar MS. A Graph Convolution Network (GCN) is employed for this purpose.
Classifier 2 targets species that are absent in the MS database and relates proteins found in the large proteomic database to those expressed in MS, thus expanding the total search space of PigletMS beyond MS databases. Classifier 2 combines TF-IDF*, a classic Information Retrieval model, and ProtTrans, a state-of-the-art Protein embeddings model, to identify important proteins for similarity scoring. This innovative combination is our primary contribution to the field of microbiology.
*Term Frequency – Inverse Document Frequency model
How well does PigletMS perform?
PigletMS outperforms the current state-of-the-art MSLF model in MS library-free bacteria identification. When tested on a diverse and balanced test set of species, PigletMS achieved 66.7% Top-3 Genus^ accuracy (versus 14.9% by MSLF, a 4.5x improvement) and 25.0% Top-1 Species accuracy (versus 2.1% by MSLF, a 11.9x improvement).
This result is reasonable as MSLF uses only 10 genes in its predictions while our Classifier 2 uses 71 genes.
^ If the true species or genus is found within the Top-K ranked output of the model for a sample, the sample is considered “Top-k species or genus accurate”.
Reviews of our system
Taking into account our user research and several co-creative design iterations with our end-users, the PigletMS UI successfully streamlined workflows and was intuitive to use by first-time users.
Chieu Hai Leong, Distinguished Member of Technical Staff, DSO National Laboratories, commended:
“The software and model implemented have exceeded our expectations of the project.”
A DSO scientist, our end user, said:
“This software coupled with our BRUKER software makes a dream team.”
PigletMS and the future of bacteria identification
As massive open-source databases such as UniProt grow exponentially, so does the scope and accuracy of bacteria identifiable by PigletMS. PigletMS enables microbiologists to tap into the world’s collaborative effort to study and fight against bacterial threats – and to do so fast and free.
In collaboration with
P.s. Why are we called PigletMS?
We honestly had a hard time coming up with a good name that could represent our model, that wasn’t boring like most academic titles and that hints at our team’s love for fun. One day, while taking a break from making ground-breaking research that make bacteria everywhere tremble in fear, our team member doodled cute pigs to a song on a whiteboard. The adorable piglet caught the attention of our mentor who then started calling the project PigletMS – because we work with peaks in MS and bacteria is small like piglets. It caught on. It’s even in our source code.