Michael B. Hall
Second Year Report
bash
courseDevelop algorithms and software for variant discovery using bacterial genome graphs, building on work of a previous student in the lab.
Benchmark Nanopore versus Illumina SNP calling and show our algorithms meet the needs of clinical and public health users.
Improve upon current whole-genome sequencing-based drug resistance prediction for M. tuberculosis using genome graphs.
Curate a high-quality reference pan-genome for M. tuberculosis that includes a detailed map of the pe/ppe genes.
Develop algorithms and software for variant discovery using bacterial genome graphs, building on work of a previous student in the lab.
80%
Bacterial genomes are incredibly diverse.
In an “open” pan-genome, such as Salmonella enterica, two individuals could share as little as 16% of their genes
Variant calling in its infancy (medaka
and nanopolish
), but no extensive benchmark has been completed
Pandora
- pan-genome inference and genotyping with long-noisy or short-accurate reads from Rachel Colquhoun.
Pandora map
Infer consensus sequence for a single sample and genotype with respect to this consensus sequence
Pandora compare
Infer consensus sequence for a collection of samples and genotype with respect to this consensus sequence
Pandora
can only genotype based on variation within the graph
The work in my first chapter outlines a method for removing this limitation and provides
an analysis of the gain in recall and precision by incorporating de novo variant discovery
into the pandora
workflow.
pandora
(~850/~3200 lines of source/test code)
snakemake
pipeline of ~3500 lines of codes to orchestrate the entire evaluation and simulations.
Aim to show that the addition of de novo discovery allows pandora
to improve its
ability to discover and call variants correctly (precision/recall).
pandora
with reads from mutated genomepandora
and de novo evaluationcompare
routine to show the power of the reference-graph approachsnippy
, medaka
, nanopolish
- using a variety of referencesDifficulty in evaluating four-way is “truth”
nucmer
to get differencespandora
VCFBenchmark Nanopore versus Illumina SNP calling and show our algorithms meet the needs of clinical and public health users.
10%
The first step towards clustering a set of genomes is determining a distance matrix.
We define genetic distance to be the sum of genetic discordances, where missing data and heterozygosity do not cause discordance (unless the zygosity does not include the reference allele) and study the clustering this definition generates.
M. tuberculosis public health applications
How can pandora
be used to improve these requirements?
Show that Nanopore sequencing is now capable of performing these tasks
Same isolate DNA extraction sequenced on both Illumina and Nanopore
clockwork
, combining best of samtools
and cortex
samtools
with some filtering and maskingBaseline Illumina/Nanopore concordance, using PacBio as a validation (where we have it)
Using four PRGs of varying complexity:
pandora
pandora
with “best” PRGImprove upon current whole-genome sequencing-based drug resistance prediction for M. tuberculosis using genome graphs.
Mykrobe
Uses a panel of resistance markers to predict drug resistance from WGS data for M. tuberculosis and Staphylococcus aureus.
The predictive power of Mykrobe
likely to expand during this PhD due to CRyPTIC consortium.
Nanopore concordance with Illumina for phenotype prediction in M. tuberculosis
Given the collection of SNPs and indels identified as being necessary for resistance to the 14 major drugs tested, we want to show that we can detect them as well with Nanopore data as we can with Illumina.
Mykrobe
and TBProfiler
- small sample sizes (n<6) used to validateTBProfiler
uses pileup approach - poor indel power. Indels are important for resistance to some drugsMykrobe
uses k-mer mapping - requires high coverage. K-mers considered in isolationPandora
Can use smaller k-mer size than Mykrobe
as it takes context into account. Therefore it theoretically requires less coverage.
Can call novel variants (Chapter 1)
pandora
pandora
output and produces resistance predictions or flag for phenotypingMykrobe
for Illumina and NanoporeCurate a high-quality reference pan-genome for M. tuberculosis that includes a detailed map of the pe/ppe genes.
The ability to accurately map sequencing reads to these genes would likely improve our ability to perform variant calling in M. tuberculosis and therefore better determine how isolates relate to each other.
Build a high-quality pan-genome for M. tuberculosis, to allow variant discovery in all genes - ideally including the pe/ppe genes.
Produce a collection of high-quality pe/ppe PRGs with information about what read length will provide reliable mapping, and whether Illumina data can be reliably mapped to them.
Re-analyse data from Chapter 2 and see if we are better able to cluster samples with this new pan-genome with pe/ppe map
Assess variation in pe/ppe genes across 10,000 samples from CRyPTIC
pandora
and the work in Chapter 1 is currently in preparationpandora
.Too far off to say at this stage