Gardylab – A sort-of open science thing.

A long overdue update

So much for blogging on a regular basis.

Having recently launched our public health infovis-specific blog (check it out here), I realized I was horribly remiss in updating my own blog. I will blame this largely on the fact that a) I went off to make science television for a bit and b) we’ve been in more of a data-gathering phase than a data-analyzing phase. That being said, there are a few things worth reporting on.

First, our preprint for TransPhylo: the Next Generation is up! You can find it on biorxiv as we await feedback on our initial journal submission. In this version, we’ve added the ability to account for unsampled cases within an outbreak, which is HUGELY useful because we can’t always identify and sequence every case, even for fairly tidy outbreaks like TB in a low-incidence setting like Canada. We’ve also made it possible to run the analysis as an outbreak is unfolding – no need to wait til it’s over. This is, obviously, rather useful.

Second, the last of our M. tuberculosis genomes from our big provincial sequencing project finally came through! The data came in fits and starts – the 276 I previously mentioned here came back in March, a good batch of them came in throughout May, and then the final plate, which maybe got lost in the couch cushions or something over at the sequencing centre, came in just two weeks ago. Our chief genome-pusher is currently on a post-wedding vacation; as soon as she’s back, we’ll get these out into a public repository, tagged with some useful metadata.

Everything else that’s on my plate at the moment is all small busy-work sorts of things as we wind down the summer. I’ve been rooting around in an unusual NTM genome sequence, playing with our first nanopore data, analysed a 140M point dataset (human genomic data, the horror!), and got a load of manuscripts from various projects I collaborate on out the door. And cat videos. Been watching a lot of cat videos.

Uncategorized

Hackathon!

It has been a busy few days! After teaching my course last Thursday, I flew back over the ocean again (the same steward who had worked my Monday flight LHR-YVR was surprised to see me again. He said “what do you do for work? Human pinball”). After a one-day mini-break which involved taking in a Premiership match featuring my favourite team (Chelsea!), it was up to Birmingham for a day to hang out with the fabulous Nick Loman. We are working on a couple reviews together at the moment.

From Birmingham it was onto Oxford yesterday morning, to launch a hackathon event. Genome BC and Genomics England signed a data-sharing MOU last year, and as part of this, teams in three areas (one of which is infectious disease), are demonstrating that we can share data/protocols/pipelines/resources/training materials/etc… across the two jurisdictions. In the ID group, which is led by myself in BC and Derrick Crook at Oxford, we have a few tasks on our plate: benchmark the pipeline the Oxford team is currently using for TB variant calling and explore alternatives, redesign the clinical reporting form that the pipeline outputs, and work towards harmonizing the metadata we collect for TB isolates and the ontology we use to store that data. The first objective is best achieved through a hackathon model, where everyone gets in a room and works on a common problem.

We started on Monday – Zam Iqbal arranged a space for us at the Wellcome Trust Human Genome Campus in Oxford. His team and I both brought a few different datasets to the table – an in silico “spike-in” datasets where bases have been artifically mutated in the sequence file, then used to generate virtual reads; a series of clustered isolates and technical replicates from Oxford’s earlier work on TB genomics; an F11 resequencing dataset graciously provided by Ashlee Earl at the Broad, and a lab contamination dataset I had handy. Zam’s group got a head start running the spike and cluster datasets through three pipelines of theirs: the current Oxford TB pipeline, and two newer reference mappers/variant callers, Cortex and Platypus. Torsten Seemann is part of the hackathon and is testing out his Nullarbor on the data, and I’m looking at my basic pipeline (BWA mem, samtools, filter on quality and repetitive regions) as well as Pilon-polished outputs of the basic model. By comparing the VCFs we’re getting out of each method, we can get a sense for which is giving us the most reliable data with a minimum of parameterizing and fussing about, and we can also explore how to best filter VCFs that come out. There’s about an order of magnitude difference in the # of SNPs coming out of stringently-tuned-and-filtering pipelines versus well-tuned-but-not-filtering pipelines and I think the F11 data will be key to letting us know whether we’ve got undercalling in the stringent pipelines or overcalling in the more flexible methods.

Ultimately, this will help inform the design of the “final” pipeline that is used by both the UK and BC to call SNPs in the TB genomes we’re sequencing as part of routine clinical work. While the ultimate architecture of that may not be completely open-source (i.e the world won’t be able to query the same databases with the same metadata that we do), the “middle bit”, i.e the recipe for mapping to a reference and calling high-quality variants, will be. We have chatted with David Aanensen about making this quite interactive, so expect something cool when he and Zam and I next meet up in June.

Most of the hackathon guts are spilling out in our Slack channel, but we’ve got a GitHub repo going and will eventually be using the Wiki there to report on what we’re doing. You can visit the repo and have a poke around but it’s not much yet – except more fulsome narrative come Thursday!

Okay, time to pop over to the Microbiology 2016 conference. Nick and I along with some others are on a lunchtime panel on data sharing. I haven’t entirely firmed up my brilliant soundbites, but at the very least I’ll be advocating for making data accessible and, if you can’t talk about or release the product, at least be open about the process.

Uncategorized

Scientific snapshot

The Guardian ran a day-in-the-life piece today covering a typical day for university scientist. It’s fairly spot on, though the accompanying photograph indicates the researcher clearly needs MOAR SCREENZ and there’s much more cake and much less emailing in the day than I would have expected.

My day yesterday was quite typical too – a constant battle to stay on top of emails interspersed with various meetings and prepping some slides for this week’s class. It meant that I didn’t have time for any data analysis (I didn’t even pee til about 3:45), but I’m hoping to launch a couple of big jobs on the cluster today as well as possibly maybe think about sorting out the MD5 checksum error that has been plaguing a recent ENA submission.

I am also in the final stages of planning next week’s little hackathon in Oxford – Genome BC and Genomics England recently signed a genomics data-sharing MOU, and Derrick Crook and I are jointly leading the infectious disease genomics pilot project. We have a few goals, including tweaking aspects of a mycobacterial genomics pipeline as well as designing an intuitive, interpretable report to share TB genomic data in the public health lab context (and that also happens to be regulatory-compliant). We’re tackling the pipeline work first with a little in-person hackathon next week, so I am flying over to the UK tomorrow night to launch that, as well as attend the Microbial Genomics editorial board meeting and be on a panel about data sharing at the Microbiology Society conference.

Uncategorized

Kicking tires

IMMEM XI has just wound up and it’s been a fantastic meeting – check out the Twitter stream for people’s recaps. The best part of meetings like these is the time you get to spend with your colleagues and kicking the tires on each others’ pipelines and tools together. Adam Phillippy‘s MASH was a big hit (I first used it when he and I were both part of a hackathon right after ABPHM2015 and it led to an interesting story around Mycobacterium chimaera that we’re following up using the Oxford Nanopore), and I’m super excited for WGSA out of the Aanensen group, makers of one of my other favourite platforms, Microreact. WGSA is just for Staph aureus now, but I’m thinking a TB version isn’t far off. Nullarbor from Torsten Seemann is also a standout and builds on his excellent suite of tools.

The conference hasn’t been all listening to talks and trying a variety of Portuguese beers – we are actually doing some work while the meeting is happening, and a smaller group of folks is staying on in Lisbon for a bit for a hackathon. Damion Dooley, the scientific programmer who works for the IRIDA project, is currently getting Nullarbor set up on our BCCDC genomics server, as Torst and David and I are keen to demonstrate how we can couple an analytical pipeline like Nullarbor to a visualization platform like Microreact to share near real-time TB genomic data on a public-facing website, as opposed to some internal system. It sounds easy, but there are a few hurdles:

You have to get your pipeline set up first. For the Nullarbor install, Damion had to get brew set up on our cluster, and brew required ruby. Once ruby was on, there were some brew install issues that Torst and Damion are sorting out. 95% of your time in bioinformatics is spent trying to install stuff and get all the dependencies sorted; 5% of your time is actually analyzing data.
I’m sure we’ll run into something once we get Nullarbor talking to Microreact, but we’ll blog about that when it happens.
We’ve been releasing TB genomic data to SRA/ENA for years, with no metadata attached. This time, we’ll be working with some limited metadata. This shouldn’t create a problem, but I’m not looking forward to writing the email that says “hey, we’re doing this and this is what we’re doing”. I’d rather send an email saying “what do you all think we should do” (because Canadian. Must be polite), but that’s like opening a door that’s previously held back a million bureaucrats, and then suddenly they all burst through the door like a zombie horde on The Walking Dead and everything gets held up for months. What do we need to think about releasing?
- An identifier associated with each genome. This needs to be something we’ve created ourselves that appears nowhere on any of the clinical/lab data associated with a patient. It also needs to have some sort of logic to it. We’ve released outbreak genomes before with made-up IDs, but they were all done as one-off retrospective studies so we just created names that made sense at the time. Our first outbreak was the first large outbreak ever sequenced, so we just called things MTXXX, where XX was a number starting at 001. I think the ordering was based on when the genomes were processed in the lab too. The latest outbreak was KXXX, and I won’t even try to explain the ordering there. So, how do we move forward labelling these and the rest of our retrospective complete genomic survey, and how do we label moving forward?
- A date. We want to use the sampling date for each specimen, as the day/month/year of sampling is very useful for us when we do analyses like BEAST, but how fine-grained can that date be for public-facing metadata as opposed to the data that we see behind our public health doors? For this first pass, we are likely to go with just year, but moving forward, we will have to see whether the month-year combo is sufficiently general to prevent people going “hey, that’s my TB bacterium” if they happened to be the only entry for a given time period.
- A location. This is one is straightforward enough for now – we have regional health authorities in BC, and is each one is further subdivided into health service delivery areas. We’ll keep the HSDA data behind our doors, as it’s too close to identifiable. Placing people at centroids on the regional health authority level is doable though.
- Lab resistance data. This one is a no-brainer. We’re interested in TB genomics to resolve transmission patterns in BC, but a larger global issue is on being able to rapidly diagnose and resistance-type isolates in a point-of-care fashion to get people on the right therapy right away, and for that we need massive amounts of genomic data annotated with resistance data so we can find new targets. We don’t have a lot of dramatic resistance in BC isolates, but that means our data makes great controls in the sort of bacterial GWAS studies that can mine these targets. We’ve pledged our annotated genomes to the global ReSeqTB project so they’ll be a part of a very high-quality international dataset of TB genomes.
- Anything else? I can’t think of anything that would be especially useful, but we’ll probably chuck a few things in like MIRU-VNTR genotype, lineage, and maybe site of disease as a simple pulmonary/extrapulmonary field.

As I pointed out in my IMMEM talk (pointers to my slides are in my last post), the genomic epidemiology field is at the point where we need to get on board with ontologies for this sort of metadata, and while most of the world dislikes making ontologies, there are – thank your lucky cats – people who like constructing them and are extremely skilled at doing so, like the excellent Emma Griffiths from the IRIDA project. So, when we get around to constructing the CSV file of metadata that Microreact will read in, you can bet your proverbial bottom dollar that we will be talking to Emma to make sure we’re doing it right.

The genomes are also running through Mykrobe right now – a great resistance prediction tool from Phelim Bradley and Zam Iqbal. Data from that to come! Come to think of it, I will include the Mykrobe predictions of resistance and lineage into the Nullarbor-Microreact example.

Okay, time to sleep off all the fish here. It’s almost bedtime in Lisbon, the place where you can survive on a single meal a day because that one meal basically comprised most of the fish in the ocean.

Uncategorized

IMMEM XI Slides and zombies

Post author By jlgardy
Post date March 10, 2016
No Comments on IMMEM XI Slides and zombies

The IMMEM XI meeting in Portugal is fantastic – lots of good friends here, and by eating grilled fish every night I can convince myself that eating a custard pastry every day is still okay.

My keynote followed a tour de force presentation by Ed Feil, who basically covered the entire history of bacterial diversity in 45 minutes. I was somewhat less ambitious, but did cover the ten simple rules i outlined in my earlier blog. The slides are now up at my Slideshare page, along with slides from my SMBE and IUATLD NAR talks from the last couple of weeks.

I was a little surprised by my survey question in the talk – I asked how many people worked with epidemic models like SIR. A tiny number of hands (a handful of hands? Is that even a thing?) went up. While I don’t think everybody needs to be able to do the math, I find that thinking about cases and how they move through compartments during an outbreak (e.g. susceptible, exposed, infected, removed, etc…) is very helpful for understanding transmission. If you want a gentle introduction to compartment models (and zombies), try Robert Smith?’s (yes, that is a ?) excellent 2009 paper When Zombies Attack: Mathematical Modelling of an Outbreak of Zombie Infection.

Uncategorized

Ten Simple Rules – genomic epidemiology edition

Post author By jlgardy
Post date March 7, 2016
No Comments on Ten Simple Rules – genomic epidemiology edition

Monday morning status update: attempting to clear out most of the inbox backlog before leaving this afternoon for IMMEM XI in sunny Portugal. And I suppose I should write my IMMEM keynote talk at some point before Wednesday afternoon too. My assigned topic is integrating genomic data with epidemiological data for outbreak investigation, and I’m thinking of doing a PLoS Comp Bio “Ten Simple Rules” sort of thing, with anecdotes from our work and other key studies in the genomic epidemiology space.

What have I got so far? Here are a few; feel free to comment with more (if I use them in the talk, you’ll get a shout-out on the slide):

Clinical bits to remember:

Know your pathogen. Different bugs have different quirks that will affect an outbreak reconstruction, like latency, asymptomatic carriage, varying levels of infectiousness.
Think SIIR – Susceptible, Infected, Infectious, Removed. Each patient will move through these stages in different ways. For each individual, try to plot their potential trajectory. For some bugs with well-defined stages, like measles, this is easy. For others, mehhhhh, not so much.
Not all hosts are created equal. Within-host genetic diversity already bit us in the @$$ (see this post for a bit of background on that), and both recent and older work has shown that in diseases like TB, the within-host mutation rate fluctuates rather wildly. Stringent thresholds for person-to-person transmission might miss events involving a mutational-hotspot host.

Epidemiological bits to remember:

People, places, food/water, and things all need to be considered. In any given outbreak, at least two of these things are probably in play, if not more. Your reconstruction diagram shouldn’t be just a bunch of people connected by arrows – there’s always more to the story.
When the genomic data says “maybe” and the field epis say “definitely”, believe the epis. Genomics is great for ruling out transmission events, but in complex outbreaks, the ruling-in part is harder. I’d trust a public health nurse over a model any day, especially for outbreaks like TB.
The line list is only half the story. Talk to the epis and nurses who collected the data about each case, especially if you’re trying to link two cases and the line list isn’t giving you much to work with. There’s always way more to the story than what ends up in the Excel spreadsheet.
Following on from #3, learn to ~~love~~ tolerate Excel. Epidemiologists love a good spreadsheet. Tools like Microreact (David Aanensen) and NextFlu (Trevor Bedford) are helping move them into the next generation, but for now, expect to be sent about a million Excel sheets.
Spread the ontology gospel. Groups like IRIDA (van Domselaar/Brinkman/Hsiao) are working towards developing a shared vocabulary for genomic epidemiology – try repackaging what you dig up out of those Excel files into these standardized vocabularies.

Big picture bits to remember:

Genomics and phylogenetics is a strange and foreign landscape for most epidemiologists, but they’re enthusiastic to learn. We need to make our work accessible and open, and provide more opportunities to train our colleagues in understanding and interpreting genomic data.
Public health agencies change slowly. Don’t expect to march in with a new ontology and a new platform for collecting and displaying data and revamp the whole system overnight. Public health changes in response to evidence – when you can show that something works better, cheaper, and faster, then you can start making changes. Moving genomics to routine clinical use takes years of relationship-building.

So, uh, that’s ten. But my “Ten Simple Rules” can totally turn into 12, 15, 18, 36, 47, etc simple rules, so feel free to chime in with new ones below, or comments/feedback/examples on the ones above. I should probably add the caveat that for each of those, your experience may vary depending on your pathogen and the team you’re working with.

Uncategorized

Gardylab goes live

Here begins the era of the open Gardylab notebook.

This week, we received the fastq data from 276 M. tuberculosis genomes we had sent down to the BCGSC for sequencing. A few years back, we started a project to resurrect all the TB isolates we had in our freezers (~4500, going back to 1999), MIRU-VNTR genotype a bunch of them (we were able to do the complete decade from 01/2005 through 12/2014, ~2500 bugs in all), and then sequence all those isolates with a genotype in common with at least one other isolate – these represent potential cases of recent transmission, though only genomic data can provide high enough resolution to confirm/refute said recent transmission.

The extracted DNA varied rather wildly in concentration – our Nanodrop was giving readings that were sometimes orders of magnitude higher than what the GSC saw with Quant-iT. On average, the Nanodrop reported concentrations (ng/uL) that were 13x higher than what Quant-iT showed. A table comparing the quants from both systems on seven plates’ worth of isolates is up at Figshare.

We looked through the 644 isolates that had initial QC data and identified three plates’ worth of isolates that we wanted to test out on the HiSeq before going any further with the sequencing – we made up one plate of high-concentration samples, one of medium, and one of low. The TechDev team at the BCGSC tried out a new library prep method for low-input samples and the samples went into the queue a few months ago.

The fastq data arrived in our inboxes this week, and we were curious to see how well the low-concentration samples fared. GOOD NEWS. Everything worked brilliantly. When we assembled against the H37Rv reference, the minimum % of reference covered was 98.89% and the max 99.99%. All but two samples had average depth of coverage >100x, and the other two were 27x and 35x – enough for good variant calling.

This is only a subset of our complete dataset, which we’ll be exploring to look at transmission dynamics of TB within BC, but over the next weeks/months as we wait for the rest of the genotypically-clustered genomes to come in, we will be rooting around in these a bit to see what tumbles out.

Other things to note this week:

I got an award! I am an SFU Outstanding Alumnus! and the Prime Minister showed up too.
Our Member Empowerment Taskforce – part of the American Society for Microbiology’s Communications Committee – had our first meeting of 2016 via phone on Wednesday. We have developed and piloted a 3-hour workshop on communicating science to the public, geared towards ASM members. This year we’ll be finalising the content and rolling it out at branch meetings, and we hope to get a train-the-trainer kit together so that at Microbe 2017, we can host a training workshop for people who want to deliver the communication workshop at their own organizations.
Lots of radio interviews this week for a science documentary I am hosting that premieres next week: While You Were Sleeping is about the science of sleep and it airs Thursday 10th March on CBC (in Canada).

Uncategorized

Behind the preprint: Declaring a tuberculosis outbreak over with genomic epidemiology

Post author By jlgardy
Post date March 4, 2016
No Comments on Behind the preprint: Declaring a tuberculosis outbreak over with genomic epidemiology

~~I’ve got some time~~, it’s Friday at 2:30 and I don’t want to start anything new or look at my to-do list, so, inspired by Nick Loman‘s behind-the-paper blog post about his Ebola nanopore sequencing work, here’s the behind-the-scenes story of a preprint we (myself, Caroline Colijn, her student Hollie-Ann Hatherell, and Xavier Didelot, on behalf of a larger author group) recently posted to biorxiv.org.

But first, the paper…

Declaring a tuberculosis outbreak over with genomic epidemiology

We report an updated method for inferring the time at which an infectious disease was transmitted between persons from a time-labelled pathogen genome phylogeny. We applied the method to 48 Mycobacterium tuberculosis genomes as part of a real-time public health outbreak investigation, demonstrating that although active tuberculosis (TB) cases were diagnosed through 2013, no transmission events took place beyond mid-2012. Subsequent cases were the result of progression from latent TB infection to active disease and not recent transmission. This evolutionary genomic approach was used to declare the outbreak over in January 2015.

Alright, story time. Back in December 2010, my co-worker (and cat aficionado, which I can say here because I bet he won’t ever read this post) Jay Johnston and I gave a BCCDC Grand Rounds talk on our work using genomics to unravel a tuberculosis outbreak (see the NEJM article for more on that story). These things are webcast throughout BC and by the time I had packed up my computer and walked the two flights of stairs back my office, I already had a voicemail on my phone from Rob Parker, the then-Medical Health Officer in Kelowna, BC. He had been dealing with a large outbreak of TB in the region and wanted to know if we could use our genomics approach to figure out whether their outbreak management strategy was on the right track.

In late 2011, we sequenced the first 33 cases of the outbreak, along with 7 cases from elsewhere in BC with the same MIRU-VNTR genotype but no epi-link to the outbreak. We got the data back in early 2012, and noticed something odd. Three SNVs separated cases #1 and #2, even though we know #1 had to have infected #2. 3 SNVs is around six years’ worth of evolution in TB, so we were a bit confused. Then we realized we’d not accounted for within-host diversity – variants that arose in a host and that were then transmitted on. This was also around the time that a MRSA outbreak paper came out and Twitter realized that this diversity is an issue – see Ed Feil’s guest post at Nick Loman’s blog for more on that.

I had been working with Caroline Colijn at Imperial College London on a project investigating how different patterns of outbreak spread lead to different structures within a phylogeny, and when she saw this diversity issue, she teamed up with Xavier Didelot, also at ICL, to tackle it. The result was TransPhylo, which we published in MBE in 2014 and which is currently undergoing an update and a port to R. Expect a preprint on that soon. TransPhylo can infer potential person-to-person transmissions from pathogen genome data, and can also infer when those transmission occurred. This will become important later.

Anyway, many years after they first asked, we were able to give the Kelowna public health team a nice outbreak reconstruction and tell them “yes, your management strategy is working”.

A few more years passed, and on one of our regular outbreak management team conference calls, the new Medical Health Officer for Kelowna, Sue Pollock, brought up the idea of declaring the outbreak over – incident cases had been declining, and only one case was diagnosed in 2014. There’s no great definition for a TB outbreak being “over”, but the generally accepted wisdom is that if you don’t have transmissions occurring for two years, you can stand down (though you’ll always get a long tail of cases that continue to activate over many years). The problem is that because TB can go latent and then wake up years after the actual transmission, it’s hard to tell whether a case you’ve just diagnosed is somehow who was just infected or whether they were infected years earlier and are just progressing to active disease now. If the former, boo – your outbreak isn’t over. If the latter, congratulations.

Sue was familiar with the TransPhylo method and how it could date the time at which a person had likely become infected, and she raised the idea of using genomics to determine whether the outbreak had truly ended – were the cases we saw in 2013/14 the result of recent transmission or had they been infected earlier? Patrick Tang, who was leading BCCDC’s mycobacteriology laboratory at the time, extracted DNA from the 15 cases we had diagnosed since our 2011 genomics study and got them onto our in-house MiSeq in late 2014. The data arrived right before Christmas, and by early January 2015, Xavier and Caroline had reported some preliminary TransPhylo data back suggesting the last transmission occurred in mid-2012. We shared this with Sue and her team on a conference call in late January, and the outbreak was declared over shortly thereafter.

Caroline’s student Hollie had an idea for improving the accuracy of our timing inference by replacing TransPhylo’s SIR model with a branching model, so she set to work on that task. When we revisited the data with her updated model later in 2015, we again got the happy result that transmission ended mid-2012. Hollie presented the work as a poster at Epidemics, and as a group, we wrote it up in January 2016. That’s the preprint that’s up on biorxiv now – a brief writeup of the analysis and details on the branching model.

The genomic data for this paper is (mostly) up at ENA – I say “mostly” because all the genome files are on their server but I keep getting an MD5 checksum error for many of the files that I haven’t had a chance to fix yet. As soon as I get that sorted (next few weeks, hopefully), I’ll write up a little bit about the dataset and post it here.

Tags paper, tuberculosis