Kicking tires
IMMEM XI has just wound up and it’s been a fantastic meeting – check out the Twitter stream for people’s recaps. The best part of meetings like these is the time you get to spend with your colleagues and kicking the tires on each others’ pipelines and tools together. Adam Phillippy‘s MASH was a big hit (I first used it when he and I were both part of a hackathon right after ABPHM2015 and it led to an interesting story around Mycobacterium chimaera that we’re following up using the Oxford Nanopore), and I’m super excited for WGSA out of the Aanensen group, makers of one of my other favourite platforms, Microreact. WGSA is just for Staph aureus now, but I’m thinking a TB version isn’t far off. Nullarbor from Torsten Seemann is also a standout and builds on his excellent suite of tools.
The conference hasn’t been all listening to talks and trying a variety of Portuguese beers – we are actually doing some work while the meeting is happening, and a smaller group of folks is staying on in Lisbon for a bit for a hackathon. Damion Dooley, the scientific programmer who works for the IRIDA project, is currently getting Nullarbor set up on our BCCDC genomics server, as Torst and David and I are keen to demonstrate how we can couple an analytical pipeline like Nullarbor to a visualization platform like Microreact to share near real-time TB genomic data on a public-facing website, as opposed to some internal system. It sounds easy, but there are a few hurdles:
- You have to get your pipeline set up first. For the Nullarbor install, Damion had to get brew set up on our cluster, and brew required ruby. Once ruby was on, there were some brew install issues that Torst and Damion are sorting out. 95% of your time in bioinformatics is spent trying to install stuff and get all the dependencies sorted; 5% of your time is actually analyzing data.
- I’m sure we’ll run into something once we get Nullarbor talking to Microreact, but we’ll blog about that when it happens.
- We’ve been releasing TB genomic data to SRA/ENA for years, with no metadata attached. This time, we’ll be working with some limited metadata. This shouldn’t create a problem, but I’m not looking forward to writing the email that says “hey, we’re doing this and this is what we’re doing”. I’d rather send an email saying “what do you all think we should do” (because Canadian. Must be polite), but that’s like opening a door that’s previously held back a million bureaucrats, and then suddenly they all burst through the door like a zombie horde on The Walking Dead and everything gets held up for months. What do we need to think about releasing?
- An identifier associated with each genome. This needs to be something we’ve created ourselves that appears nowhere on any of the clinical/lab data associated with a patient. It also needs to have some sort of logic to it. We’ve released outbreak genomes before with made-up IDs, but they were all done as one-off retrospective studies so we just created names that made sense at the time. Our first outbreak was the first large outbreak ever sequenced, so we just called things MTXXX, where XX was a number starting at 001. I think the ordering was based on when the genomes were processed in the lab too. The latest outbreak was KXXX, and I won’t even try to explain the ordering there. So, how do we move forward labelling these and the rest of our retrospective complete genomic survey, and how do we label moving forward?
- A date. We want to use the sampling date for each specimen, as the day/month/year of sampling is very useful for us when we do analyses like BEAST, but how fine-grained can that date be for public-facing metadata as opposed to the data that we see behind our public health doors? For this first pass, we are likely to go with just year, but moving forward, we will have to see whether the month-year combo is sufficiently general to prevent people going “hey, that’s my TB bacterium” if they happened to be the only entry for a given time period.
- A location. This is one is straightforward enough for now – we have regional health authorities in BC, and is each one is further subdivided into health service delivery areas. We’ll keep the HSDA data behind our doors, as it’s too close to identifiable. Placing people at centroids on the regional health authority level is doable though.
- Lab resistance data. This one is a no-brainer. We’re interested in TB genomics to resolve transmission patterns in BC, but a larger global issue is on being able to rapidly diagnose and resistance-type isolates in a point-of-care fashion to get people on the right therapy right away, and for that we need massive amounts of genomic data annotated with resistance data so we can find new targets. We don’t have a lot of dramatic resistance in BC isolates, but that means our data makes great controls in the sort of bacterial GWAS studies that can mine these targets. We’ve pledged our annotated genomes to the global ReSeqTB project so they’ll be a part of a very high-quality international dataset of TB genomes.
- Anything else? I can’t think of anything that would be especially useful, but we’ll probably chuck a few things in like MIRU-VNTR genotype, lineage, and maybe site of disease as a simple pulmonary/extrapulmonary field.
As I pointed out in my IMMEM talk (pointers to my slides are in my last post), the genomic epidemiology field is at the point where we need to get on board with ontologies for this sort of metadata, and while most of the world dislikes making ontologies, there are – thank your lucky cats – people who like constructing them and are extremely skilled at doing so, like the excellent Emma Griffiths from the IRIDA project. So, when we get around to constructing the CSV file of metadata that Microreact will read in, you can bet your proverbial bottom dollar that we will be talking to Emma to make sure we’re doing it right.
The genomes are also running through Mykrobe right now – a great resistance prediction tool from Phelim Bradley and Zam Iqbal. Data from that to come! Come to think of it, I will include the Mykrobe predictions of resistance and lineage into the Nullarbor-Microreact example.
Okay, time to sleep off all the fish here. It’s almost bedtime in Lisbon, the place where you can survive on a single meal a day because that one meal basically comprised most of the fish in the ocean.
Reply