Hackathon!

It has been a busy few days! After teaching my course last Thursday, I flew back over the ocean again (the same steward who had worked my Monday flight LHR-YVR was surprised to see me again. He said “what do you do for work? Human pinball”). After a one-day mini-break which involved taking in a Premiership match featuring my favourite team (Chelsea!), it was up to Birmingham for a day to hang out with the fabulous Nick Loman. We are working on a couple reviews together at the moment.

From Birmingham it was onto Oxford yesterday morning, to launch a hackathon event. Genome BC and Genomics England signed a data-sharing MOU last year, and as part of this, teams in three areas (one of which is infectious disease), are demonstrating that we can share data/protocols/pipelines/resources/training materials/etc… across the two jurisdictions. In the ID group, which is led by myself in BC and Derrick Crook at Oxford, we have a few tasks on our plate: benchmark the pipeline the Oxford team is currently using for TB variant calling and explore alternatives, redesign the clinical reporting form that the pipeline outputs, and work towards harmonizing the metadata we collect for TB isolates and the ontology we use to store that data. The first objective is best achieved through a hackathon model, where everyone gets in a room and works on a common problem.

We started on Monday – Zam Iqbal arranged a space for us at the Wellcome Trust Human Genome Campus in Oxford. His team and I both brought a few different datasets to the table – an in silico “spike-in” datasets where bases have been artifically mutated in the sequence file, then used to generate virtual reads; a series of clustered isolates and technical replicates from Oxford’s earlier work on TB genomics; an F11 resequencing dataset graciously provided by Ashlee Earl at the Broad, and a lab contamination dataset I had handy. Zam’s group got a head start running the spike and cluster datasets through three pipelines of theirs: the current Oxford TB pipeline, and two newer reference mappers/variant callers, Cortex and Platypus. Torsten Seemann is part of the hackathon and is testing out his Nullarbor on the data, and I’m looking at my basic pipeline (BWA mem, samtools, filter on quality and repetitive regions) as well as Pilon-polished outputs of the basic model. By comparing the VCFs we’re getting out of each method, we can get a sense for which is giving us the most reliable data with a minimum of parameterizing and fussing about, and we can also explore how to best filter VCFs that come out. There’s about an order of magnitude difference in the # of SNPs coming out of stringently-tuned-and-filtering pipelines versus well-tuned-but-not-filtering pipelines and I think the F11 data will be key to letting us know whether we’ve got undercalling in the stringent pipelines or overcalling in the more flexible methods.

Ultimately, this will help inform the design of the “final” pipeline that is used by both the UK and BC to call SNPs in the TB genomes we’re sequencing as part of routine clinical work. While the ultimate architecture of that may not be completely open-source (i.e the world won’t be able to query the same databases with the same metadata that we do), the “middle bit”, i.e the recipe for mapping to a reference and calling high-quality variants, will be. We have chatted with David Aanensen about making this quite interactive, so expect something cool when he and Zam and I next meet up in June.

Most of the hackathon guts are spilling out in our Slack channel, but we’ve got a GitHub repo going and will eventually be using the Wiki there to report on what we’re doing. You can visit the repo and have a poke around but it’s not much yet – except more fulsome narrative come Thursday!

Okay, time to pop over to the Microbiology 2016 conference. Nick and I along with some others are on a lunchtime panel on data sharing. I haven’t entirely firmed up my brilliant soundbites, but at the very least I’ll be advocating for making data accessible and, if you can’t talk about or release the product, at least be open about the process.