Michael B. Hall
Second Year Report
bash courseDevelop algorithms and software for variant discovery using bacterial genome graphs, building on work of a previous student in the lab.
Benchmark Nanopore versus Illumina SNP calling and show our algorithms meet the needs of clinical and public health users.
Improve upon current whole-genome sequencing-based drug resistance prediction for M. tuberculosis using genome graphs.
Curate a high-quality reference pan-genome for M. tuberculosis that includes a detailed map of the pe/ppe genes.
Develop algorithms and software for variant discovery using bacterial genome graphs, building on work of a previous student in the lab.
80%
Bacterial genomes are incredibly diverse.

In an “open” pan-genome, such as Salmonella enterica, two individuals could share as little as 16% of their genes



Variant calling in its infancy (medaka and nanopolish), but no extensive benchmark has been completed
Pandora - pan-genome inference and genotyping with long-noisy or short-accurate reads from Rachel Colquhoun.

Pandora mapInfer consensus sequence for a single sample and genotype with respect to this consensus sequence

Pandora compareInfer consensus sequence for a collection of samples and genotype with respect to this consensus sequence

Pandora can only genotype based on variation within the graph
The work in my first chapter outlines a method for removing this limitation and provides
an analysis of the gain in recall and precision by incorporating de novo variant discovery
into the pandora workflow.
pandora (~850/~3200 lines of source/test code)
snakemake pipeline of ~3500 lines of codes to orchestrate the entire evaluation and simulations.


Aim to show that the addition of de novo discovery allows pandora to improve its
ability to discover and call variants correctly (precision/recall).
pandora with reads from mutated genome

pandora and de novo evaluationcompare routine to show the power of the reference-graph approachsnippy, medaka, nanopolish - using a variety of referencesDifficulty in evaluating four-way is “truth”
nucmer to get differencespandora VCF
Benchmark Nanopore versus Illumina SNP calling and show our algorithms meet the needs of clinical and public health users.
10%
The first step towards clustering a set of genomes is determining a distance matrix.
We define genetic distance to be the sum of genetic discordances, where missing data and heterozygosity do not cause discordance (unless the zygosity does not include the reference allele) and study the clustering this definition generates.
M. tuberculosis public health applications
How can pandora be used to improve these requirements?
Show that Nanopore sequencing is now capable of performing these tasks
Same isolate DNA extraction sequenced on both Illumina and Nanopore
clockwork, combining best of samtools and cortexsamtools with some filtering and maskingBaseline Illumina/Nanopore concordance, using PacBio as a validation (where we have it)
Using four PRGs of varying complexity:
pandorapandora with “best” PRGImprove upon current whole-genome sequencing-based drug resistance prediction for M. tuberculosis using genome graphs.
MykrobeUses a panel of resistance markers to predict drug resistance from WGS data for M. tuberculosis and Staphylococcus aureus.
The predictive power of Mykrobe likely to expand during this PhD due to CRyPTIC consortium.
Nanopore concordance with Illumina for phenotype prediction in M. tuberculosis
Given the collection of SNPs and indels identified as being necessary for resistance to the 14 major drugs tested, we want to show that we can detect them as well with Nanopore data as we can with Illumina.
Mykrobe and TBProfiler - small sample sizes (n<6) used to validateTBProfiler uses pileup approach - poor indel power. Indels are important for resistance to some drugsMykrobe uses k-mer mapping - requires high coverage. K-mers considered in isolationPandoraCan use smaller k-mer size than Mykrobe as it takes context into account. Therefore it theoretically requires less coverage.
Can call novel variants (Chapter 1)
pandorapandora output and produces resistance predictions or flag for phenotypingMykrobe for Illumina and NanoporeCurate a high-quality reference pan-genome for M. tuberculosis that includes a detailed map of the pe/ppe genes.

The ability to accurately map sequencing reads to these genes would likely improve our ability to perform variant calling in M. tuberculosis and therefore better determine how isolates relate to each other.
Build a high-quality pan-genome for M. tuberculosis, to allow variant discovery in all genes - ideally including the pe/ppe genes.
Produce a collection of high-quality pe/ppe PRGs with information about what read length will provide reliable mapping, and whether Illumina data can be reliably mapped to them.
Re-analyse data from Chapter 2 and see if we are better able to cluster samples with this new pan-genome with pe/ppe map
Assess variation in pe/ppe genes across 10,000 samples from CRyPTIC
pandora and the work in Chapter 1 is currently in preparationpandora.Too far off to say at this stage
