Oral Presentation MedVetPATHOGENS 2018

Shiga-toxin producing Escherichia coli (STEC) serotype, lineage and host prediction using machine learning models (#14)

Chad R Laing 1 , Matthew D Whiteside 1 , Akiff Manji 1 , Rylan Boothman 1 , Victor Gannon 1
  1. Public Health Agency of Canada, Lethbridge, AB, Canada

Shiga-toxin producing Escherichia coli (STEC) are zoonotic pathogens associated with food and waterborne outbreaks of disease in humans. However, certain STEC serotypes, and lineages within these serotypes, are more frequently associated with human disease than others. Further, the precise roles of many serotype and clade-specific genes on phenotypes that influence bacterial survival and virulence are unknown. In this study, we examined 143 STEC from 36 serotypes, using whole-genome sequencing (WGS), phenotypic microarray (PM) analyses, and machine learning (ML) models to explore these linkages. The phylogeny based on single nucleotide polymorphisms (SNPs) among the 143 genomes was highly concordant with that based on the PM data. STEC were largely divided among O- and H-type specific subgroups using both data sources. ML models trained on the PM data correctly predicted serotype 98.6% of the time using the artificial neural network (ANN) models, and 68.34% using the linear support vector machine (SVM) models. Host classification as human / non-human using the PM data correctly predicted host source 73.6% of the time with the ANN model, and 62.1% of the time using the SVM. The same models using kmer analyses of the corresponding WGS gave serotype prediction accuracy of 98.4% for the ANN, and 85.3% for the SVM. In conclusion, predictive phenotypic and genomic markers, were identified for all of the major phylogenetic clades, and for serotype-specific groups. PM and WGS data were found to produce highly concordant phylogenies when used as input for ML models. ANN in particular shows promise for predictive classification of STEC.  Potential implications of this work include the development of selective media for specific serotypes or lineages, and the rapid classification of bacteria into subgroups most frequently associated with severe human disease.