Background:
GAS is a globally significant pathogen. In the era of serology, the typing of GAS based on the immuno-protective antigenicity of the surface-exposed Emm (that is, ‘M-typing’) was mandatory. In the era of classical nucleotide sequencing, calibration of the nucleotide-based emm-type against the M-type was inevitable. In the whole genome sequencing (WGS) era, however, we contend that assessment of the quintessence of emm-typing for characterising GAS phylogenetic delineation and epidemiology is warranted. Therein, rationalising the feasibility testing of alternative WGS-amenable typing schemes, of which our transcription regulator (TR)-based scheme is one. Quantification of the strain (or emm-type)-dependency of variation in the DNA that encodes GAS TRs indirectly provides a measure of the backwards compatibility of our TR-based scheme with the vast body of emm-based epidemiological research.
Methods:
We catalogued the distribution and diversity of GAS TRs using phylogenetic and concordance metrics, and applied GPAS in the prediction of GAS genome metadata (including emm-type, invasiveness, and clinical outcome). We developed an ML-based workflow incorporating a novel WGS-amenable TR-based phenotype prediction scheme and a protocol for collecting GAS phenotype metadata.
Results:
We predicted emm-type (97%), country of origin (88.6%), and invasiveness (84.7%) with high accuracy. Interpretation of the inaccurate emm-type predictions resulted in the development of biological models for characterising: emm-switching, mga2-switching, two types of emm-enn chimerisation, and the putative rapid evolution and time-dependent excision of genes in the mga regulon of clinically-significant GAS strains.
Conclusions:
Our workflows have advanced the understanding of GAS phylogenetic delineation and epidemiology, and stand as templates for the testing of hitherto untested GAS phenotypic traits.