The seasonal influenza (flu) vaccine is designed to protect against those influenza viruses predicted to circulate during the upcoming flu season, but identifying which viruses are likely to circulate is challenging. We use features from phylogenetic trees reconstructed from hemagglutinin (HA) and neuraminidase (NA) sequences, together with a support vector machine, to predict future circulation. We obtain accuracies of 0.75 to 0.89 (AUC 0.83 to 0.91) over 2016–2020. We explore ways to select potential candidates for a seasonal vaccine and find that the machine learning model has a moderate ability to select strains that are close to future populations.
Serial intervals – the time between symptom onset in infector and infectee – are a fundamental quantity in infectious disease control. However, their estimation requires knowledge of individuals’ exposures, typically obtained through resource-intensive contact tracing efforts. We introduce an alternate framework using virus sequences to inform who infected whom and thereby estimate serial intervals. We apply our technique to SARS-CoV-2 sequences from case clusters in the first two COVID-19 waves in Victoria, Australia. We find that our approach offers high resolution, cluster-specific serial interval estimates that are comparable with those obtained from contact data, despite requiring no knowledge of who infected whom and relying on incompletely-sampled data. Compared to a published serial interval, cluster-specific serial intervals can vary estimates of the effective reproduction number by a factor of 2–3. We find that serial interval estimates in settings such as schools and meat processing/packing plants are shorter than those in healthcare facilities.
Genomic data are increasingly being used to understand infectious disease epidemiology. Isolates from a given outbreak are sequenced, and the patterns of shared variation are used to infer which isolates within the outbreak are most closely related to each other. Unfortunately, the phylogenetic trees typically used to represent this variation are not directly informative about who infected whom—a phylogenetic tree is not a transmission tree. However, a transmission tree can be inferred from a phylogeny while accounting for within-host genetic diversity by coloring the branches of a phylogeny according to which host those branches were in.
Evolutionary relationships are frequently described by phylogenetic trees, but a central barrier in many fields is the difficulty of interpreting data containing conflicting phylogenetic signals. We present a metric-based method for comparing trees which extracts distinct alternative evolutionary relationships embedded in data. We demonstrate detection and resolution of phylogenetic uncertainty in a recent study of anole lizards, leading to alternate hypotheses about their evolutionary relationships. We use our approach to compare trees derived from different genes of Ebolavirus and find that the VP30 gene has a distinct phylogenetic signature composed of three alternatives that differ in the deep branching structure.