Deleting actual resistance when it is present in real data

Our previous posts discussed the issue of creating “fictional” amino acids variants as a result of nucleic acid mixtures in Sanger sequences or consensus NGS sequences.  Here we will discuss the OPPOSITE problem, where nucleic acid mixtures can results in the DISAPPEARANCE of actually observed mutations (and/or  resistance) at when the results are submitted to online databases for processing.  I am going to use examples from HIV because that is what I have been doing since I was a little boy, but the same would apply to HCV.

As we discussed below, nucleotide mixtures are indicated by IUPAC codes, such that WYC (for example, at HIV RT Codon 215) translates into four amino acids, such as Phe/Thr/Cys/Ser.  Usually we would write this as T215P/T/C/S, and this is fair enough.   However, by the time that one starts to get quite a few nucleotide mixtures detected at a given codon, things start to get a bit unwieldy:  an NNN would translate to T215A/C/D/E/F/G/H/I/K/L/M/N/P/R/S/T/V/W/*.   This would obviously be absurd to report to patients, and is also such a high multiplicity of possible mixtures at a given codon would usually be an indication that the data quality (using Sanger Sequencing) was very poor.

 

As a result, most people (and programs) have decided that if they see a mixture containing “too many amino acids” – probably of more than four –  at a given resistance position, they would declares the position an X (which is fair enough) and INTERPRET THAT THE MUTANT AMINO ACID IS EFFECTIVELY **NOT** PRESENT.    This seems to be something similar to what the major databases which report resistance have done.

This is a feature or a bug, depending on how you look at it.   I suspect that the rationale was that if there were more than 4 amino acids present, it was therefore present at less than ~20% of the population and could not be detected *by Sanger Sequencing* anyway, and this makes a lot of sense.  However, this can become a problem if you have  a consensus sequence created from NGS data. If you have a genuine mixture of a mutations, and progressively add some minority species data (apparently “noise”, but what is actual data), it can ACTIVELY REMOVE the mutant signal….so for example, it can claim that a mutation is “present” (and a sample “resistant”) at a 20% or a 5% cut-off yet absent (and a sample sensitive) at a 2% cut-off, as a result of collapsing “too many amino acids” into an “X”.  This seems to be a problem.    So a lot of studies looking at different NGS cut-offs by processing consensus sequences through these databases have probably been subtracting the actual signal (this post) as well as adding to the noise (previous post) at low prevalence of mutations.  The result is a systematic underestimate of the benefit of NGS data.

To demonstrate this, we looked at the effects of systematically changing only codon 184 of the HIV RT in an HXB2 sequence background with increasing amounts of different simulated types of “noise”,  include the  M184V  (example fasta included).  This mutation (and an HIV example) was chosen just because everyone universally agrees it confers resistance to 3TC/FTC, so it represents the simplest possible case.  To orient you, ATG is wildtype and GTG is resistant.

We submitted the sequences to both the Stanford Website and Geno2Pheno, and the results were very interesting.   Here’s the example fasta file, cleverly called  “test“,  if you want to play with it yourself.

Stanford Database


Base             AA call                Interpretation
====          ====               ========
GTG            M184V                    resistant
STG            M184LV                  resistant
VTG            M184MLV              resistant
GHM           M184ADEV           resistant
GHV           M184ADEV            resistant
GWS           M184DEV               resistant
BTG            M184LV                  resistant
BMG           M184X                    susceptible(!)
SWS           M184X                     susceptible (!)
GNN           wt (!!!)                  susceptible(!)
VMG           M184X                    susceptible(!)
NNN           wt(!!!)                   susceptible(!)

Note that A BTG at codon 184 is resistant, but BMG is susceptible.  And in the case of NNN or GNN, it actively denies the mutation is even present as M184X at all. I think the most egregious offender is GNN…. if you have the key M184V mutation (the G in GTG) with too much noise AT THE OTHER TWO BASES, it declares you to be wildtype (!)

 

The results when processed through Geno2Pheno are different, but also also show unexpected behavior.

 

geno2Pheno Results


Base      AA ”                Interpretation
====     ====               ========
GTG      M184V              3TC resistant (56-fold)                   (ddI, ABC partial)
STG      M184V               3TC resistant (56-fold)                   (ddI, ABC partial)
VTG     M184V                 3TC resistant (56-fold)                  (ddI, ABC partial)
GHM   M184V                3TC resistant (56-fold)                   (ddI, ABC partial)
GHV    “missing”         3TC resistant (12-fold)          (ddI partial)
GWS    M184V               3TC resistant (56-fold)                   (ddI, ABC partial)
BTG     M184V               3TC resistant (56-fold)                   (ddI, ABC partial)
BMG     wt  (!!)              3TC resistant (12-fold)         (ddI partial)
SWS    M184V                3TC resistant (56-fold)                  (ddI, ABC partial)
GNN   “missing”          3TC resistant (12-fold)        (ddI partial)
VHG   “missing”         3TC resistant (12-fold)        (ddI partial)
VMG     wt   (!!)            3TC resistant (12-fold)        (ddI partial)
VHG   “missing”         3TC resistant (12-fold)        (ddI partial)
NNN  “missing”          3TC resistant (12-fold)        (ddI partial)

 

The implication of all this is that a lot of studies have probably been both subtracting the actual genuinely detected resistance signal (this post) as well as adding artificially adding to the noise at low prevalence of mutations.  As a result, we should probably think again about the effect of nucleotide mixture processing on all previous studies of NGS sequencing which submitted results as consensus sequences to these databases.  This is also another argument for translating NGS data before converting a consensus sequence.

Inventing Fictional Resistance from Real Data (Pt 2 of 2)

This particular example of HCV being devious was brought up to us by Federico Garcia – Thanks Federico!   He points out an interesting aspect of one of the most famous mutations in HCV.  S282T is a substitution in the NS5b gene, which results in resistance to sofosbuvir.  It is an example where we can accidently  “invent” HCV drug resistance when none is actually present.

So let’s dig into the weeds a bit.  First of all, Serine is potentially encoded by a number of different nucleotide combinations, but for HCV genotype 1a and 1b, and most other genotypes, the S is almost exclusively encoded by AGC, except in genotype 3, where it is usually AGT.  (see https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3993261/).  This is not that weird – different organisms usually have an overall preference for which codon they use for a given amino acid, even if it is not obvious why.  And viruses tend to have preferences for which nucleotide triplet for any specific amino acids at any given location in their genome.

An odd thing about Serine, though, is that it is one of the few amino acids which appears on opposite sides of a Translation wheel (see picture).  So it is encoded quite different nucleotide triplets – for example AGT (on the left) and by TCT (on the right), which is kind of weird.  (“T” is actually “U” in this wheel, of course).

The effect of this weird encoding is that if one has a 50:50 mixture of AGT and TCT (two different S282S HCV variants which are susceptible to sofosbuvir) and perform Sanger Sequencing (or NGS sequencing and create an amino acid consensus sequence), the result is “WST”.    This will be naively translated to S282C/S/T.  One would interpret that the S282T is present and that the virus is sofosbuvir resistant, but this is not the case – we just made that up!.

The good news is that this artefact must be pretty rare, or everyone would be reporting detecting the S282T all the time!   S282T is pretty rare.  But this is definitely something to look out for.

Inventing “Fictional” Amino Acids out of real data (Pt 1)

Our initial few blog posts will focus on some of the “interesting” or less well-known details of using DNA sequencing methods to monitor drug resistance.   A warning here that only a few obsessive people will care about these details!  I’ll be mixing and matching examples from HIV and HCV.

Usually sequencing is described as something like “determining the sequence of A, C, T, and G” in the virus (after converting to DNA). But in reality the IUPAC codes are not just A,C,T and G, but also include 16 codes where DNA mixtures are observed.   For example, a mixture of an A and a G, where viruses with both an A and a G are present simultaneously is denoted by an R, for “purine”. (See picture).

This is all straightforward enough.   Where it gets a bit trickier is when there are two (or more) base changes in a codon, which can lead to some initially unexpected behavior.   For example, one of the earliest codons of interest in HIV was position 215 of the HIV Reverse Transcriptase, commonly Threonine encoded by ACC at the nucleotide level.  The nucleotides are usually ACC in most western HIV variants but when drug resistance is selected by AZT, can change to a Phenylalanine, usually TTC or a Tyrosine (usually TAC).

In the example above, a mixture of an A and G gives a R, which corresponds to both an Histidine and an Arginine.   Now, a mixture of viruses where both drug resistant virus (“TTC”) and drug susceptible virus (“ACC”) is therefore identified by Sanger sequencing has TWO mixtures and is depicted as the nucleic acid mixtures WMC.  That is great.   BUT the tricky part is that when we go to translate that sequence back to amino acids, we cannot tell that there is only the starting TTC (Phe) and the ACC (Thr) present in the “WMC” nucleotide codes, so we translate this back to include two “fictional” amino acids that don’t actually exist! These are TCC (Ser) and TAC (Cys) in this case.

The result of all this is that since the 1990s we have cheerfully been reporting the existence of a lot of “drug resistant” HIV variants that literally may not exist in a given sample, including 215(Serine) and 215(Cysteine).  As it turns out, this has not been a big problem, just a bit embarrassing.  Similar “fictional” amino acids can happen with the translation of mixtures of other drug resistant and wild-type viruses, as long as there is a two-base change.

What can we do about this?  It depends on the sequencing methods you are using. If one is using Sanger sequencing, there is really nothing one can do, short of cloning or diluting out the individual virus strains present one at a time.  Life is too short for that.  However, if using NGS sequencing methods, you CAN avoid creating these fictional amino acids by doing translations to amino acids BEFORE creating consensus sequences.  There is a new format called AAVF which allows one to deal rationally with NGS sequences and prevents the invention of “fictional” DNA sequences.

Our next post will deal with a particularly devious example of this popping up as a potential problem for HCV analyses.

The first presentation from SHARED! At AASLD 2018

Anita Howe was at AASLD over the weekend with the first presentation from the SHARED collaboration.   The start of something good – have a look!

AASLD 2018_Final Presentation

Aside

SHARED is an international research collaboration with the goal of better understanding and avoiding Hepatitis C drug resistance.  Our site will be located at https://hcvdb.ubc.ca but this is currently under construction.  Sorry!

Here we will give an outline of our future plans and have space for general discussions.

 

Working on logos

Our secondary logo: