Poster Presentation 21st Lancefield International Symposium for Streptococci and Streptococcal Diseases 2022

Lessons learnt from using a machine learning (ML)-based genotype-phenotype association study (GPAS) to predict the metadata of group A Streptococcus (GAS) genomes (#115)

Sean J Buckley 1 , Robert J Harvey 1 2
  1. School of Health and Behavioural Sciences, University of the Sunshine Coast, Maroochydore, Queensland, Australia
  2. Sunshine Coast Health Institute, University of the Sunshine Coast, Birtinya, Queensland, Australia

Background:

GAS is a globally significant pathogen. In the era of serology, the typing of GAS based on the immuno-protective antigenicity of the surface-exposed Emm (that is, ‘M-typing’) was mandatory. In the era of classical nucleotide sequencing, calibration of the nucleotide-based emm-type against the M-type was inevitable. In the whole genome sequencing (WGS) era, however, we contend that assessment of the quintessence of emm-typing for characterising GAS phylogenetic delineation and epidemiology is warranted. Therein, rationalising the feasibility testing of  alternative WGS-amenable typing schemes, of which our transcription regulator (TR)-based scheme is one. Quantification of the strain (or emm-type)-dependency of variation in the DNA that encodes GAS TRs indirectly provides a measure of the backwards compatibility of our TR-based scheme with the vast body of emm-based epidemiological research.

Methods:

We catalogued the distribution and diversity of GAS TRs using phylogenetic and concordance metrics, and applied GPAS in the prediction of GAS genome metadata (including emm-type, invasiveness, and clinical outcome). We developed an ML-based workflow incorporating a novel WGS-amenable TR-based phenotype prediction scheme and a protocol for collecting GAS phenotype metadata.

Results:

We predicted emm-type (97%), country of origin (88.6%), and invasiveness (84.7%) with high accuracy. Interpretation of the inaccurate emm-type predictions resulted in the development of biological models for characterising: emm-switching, mga2-switching, two types of emm-enn chimerisation, and the putative rapid evolution and time-dependent excision of genes in the mga regulon of clinically-significant GAS strains.

Conclusions:

Our workflows have advanced the understanding of GAS phylogenetic delineation and epidemiology, and stand as templates for the testing of hitherto untested GAS phenotypic traits.