Identification of Bacteria from MALDI-TOF Mass Spectrometry

Identification of Bacteria from MALDI-TOF Mass Spectrometry

Team members

Ku Wee Tiong (ESD), Choo Yan Guang (ESD), Lee En Qi Amanda (ESD), Chan Luo Qi (ISTD), Seow Xu Liang (ISTD), Zhang Jingyu (ISTD)

Instructors:

Kenny Choo, Wang Xingyin, Lynette Cheah

Writing Instructors:

Grace Kong

Teaching Assistant:

Cheong Rui Zhi Jeremy, Anirudh Gajendra Rathi

Why bacteria identification?

Bacteria identification is crucial in preserving the health and safety of our community. Harmful bacteria outbreaks such as the infamous Bubonic plague are not a thing of the past. Currently, Yemen is experiencing a large scale cholera outbreak (caused by the bacterium Vibrio cholerae) which has caused more than 3,900 deaths. Bacteria identification has several use cases such as:

1. Clinical diagnosis, where effective treatments depend on accurate and rapid diagnostic, and
2. Environment monitoring to detect and assess outbreak threats quickly.

PigletMS: The User-Friendly Two-Pronged Bacteria Identifier

PigletMS was designed to serve as a parallel diagnostic tool for microbiologists. In practice, an unknown MS sample may contain a species that is present or absent from the MS databases. At its core, PigletMS consists of two classifiers that target these two different prediction spaces.

Classifier 1 targets species that are present in the MS databases and uses the idea that species closer in biology are likely to express similar proteins and hence have similar MS. A Graph Convolution Network (GCN) is employed for this purpose.

Classifier 2 targets species that are absent in the MS database and relates proteins found in the large proteomic database to those expressed in MS, thus expanding the total search space of PigletMS beyond MS databases. Classifier 2 combines TF-IDF*, a classic Information Retrieval model, and ProtTrans, a state-of-the-art Protein embeddings model, to identify important proteins for similarity scoring. This innovative combination is our primary contribution to the field of microbiology.

*Term Frequency – Inverse Document Frequency model

How well does PigletMS perform?

PigletMS outperforms the current state-of-the-art MSLF model in MS library-free bacteria identification. When tested on a diverse and balanced test set of species, PigletMS achieved 66.7% Top-3 Genus^ accuracy (versus 14.9% by MSLF, a 4.5x improvement) and 25.0% Top-1 Species accuracy (versus 2.1% by MSLF, a 11.9x improvement).

This result is reasonable as MSLF uses only 10 genes in its predictions while our Classifier 2 uses 71 genes.

^ If the true species or genus is found within the Top-K ranked output of the model for a sample, the sample is considered “Top-k species or genus accurate”.

Reviews of our system

Taking into account our user research and several co-creative design iterations with our end-users, the PigletMS UI successfully streamlined workflows and was intuitive to use by first-time users.

Chieu Hai Leong, Distinguished Member of Technical Staff, DSO National Laboratories, commended:

“The software and model implemented have exceeded our expectations of the project.”

A DSO scientist, our end user, said:

“This software coupled with our BRUKER software makes a dream team.”

P.s. Why are we called PigletMS?

We honestly had a hard time coming up with a good name that could represent our model, that wasn’t boring like most academic titles and that hints at our team’s love for fun. One day, while taking a break from making ground-breaking research that make bacteria everywhere tremble in fear, our team member doodled cute pigs to a song on a whiteboard. The adorable piglet caught the attention of our mentor who then started calling the project PigletMS – because we work with peaks in MS and bacteria is small like piglets. It caught on. It’s even in our source code.