Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for targeted sequencing #163

Draft
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

mbhall88
Copy link
Member

@mbhall88 mbhall88 commented Oct 7, 2022

This PR will (eventually/hopefully) close #28

Added

  • option to compress JSON output (-z/--compress)
  • specific support for targeted sequencing (-T/--targeted). Takes a BED file of the regions targeted to aid in the read depth estimation [[Add support for targeted sequencing #28][28]]
  • The script scripts/combine.py to combine mykrobe reports into a single CSV file

The major feature addition in this PR is the targeted option. It takes a BED file of the regions the user has amplified and only calculates the expected depth from variants that fall within these regions.

Motivation

So this implementation has been motivated by some Nanopore targeted sequencing data I have been working with lately. There are two amplification strategies used: PCR and RPA. In particular, RPA cannot amplify an entire gene, so regions of genes that have the most variants in a WHO catalogue were chosen. One consequence of this when running vanilla mykrobe is that there are lots of variant with median depth 0/1, which end up drowing out the median depth calculate. To make matters more interesting, some of these samples were sequenced for up to 3 days, so the read depth is crazy high (~100k for some genes).

For example, here is a histogram of variant median depths for one sample (subsampled to theoretical coverage 10,000x)

image

The median depth (expected depth) from mykrobe is 1,074. Using this targeted approach, we get the following histogram

image

The median depth (expected depth) from mykrobe is 4,208. Using this targeted approach, we get the following histogram.

In addition to this, comparing DST predictions, before this option, I was finding lots of poor quality indels being called (e.g. #139; see our discussion in the slack channel too). After this change, I didn't see any of those.

One last thing to consider is whether we need to change the min. proportion expected depth. This is potentially just because of our high depth, but I have come across a case where we filter out a good call because it only has 14% of the expected depth. This isn't crazy low in the context of targeted sequencing where it is possible different genes amplify differently. I'm also not sure whether it is normal for targeted sequencing to produce much higher depths than WGS (I suspect it does) and so lowering the default from 0.3 to something like 0.1 when using the targeted option might not be insane? Thoughts?

@iqbal-lab
Copy link
Collaborator

The bed file is an excellent idea, and indeed i think the only way to support targetted sequencing well is (IMO) to know the targets and estimate depth on each independently. I also think it makes sense to set some limit, and once coverage goes above that, subsample reads randomly per target/amplicon.

Yes, it is typical to get much higher depth with amplicon seq; it is sometimes also an intention to use this to spot minor alleles. I think we do need to set some boundaries on what we are trying to do though, we can't support everything and handling everything between 10x and 10000x depth is complicated.

If people want to do exploratory work with super-high depth, they can always select reads for each amplicon, subsample those, use racon to get a consensus, and then remap all of the reads to that consensus to spot minor alleles.

personally, i'd vote for something like a max read depth of 100x, and then feed straight into standard mykrobe.

What do you think of that?

@mbhall88
Copy link
Member Author

mbhall88 commented Oct 8, 2022

Are you saying to set a max read depth on a variant, or are you talking about subsampling the reads within mykrobe?

@iqbal-lab
Copy link
Collaborator

I'm talking about subsampling

@mbhall88
Copy link
Member Author

mbhall88 commented Oct 8, 2022

That seems well outside of the scope of mykrobe. Besides, subsampling in this way is not trivial and I don't think we want to go there. You can't just do a random thing like rasusa because if there's uneven coverage on your targets, it's quite possible you will lose reads from one or more targets. See mbhall88/rasusa#17

@iqbal-lab
Copy link
Collaborator

Don't see how it's more outside scope than taking a bedfile, or coping with 100000x depth

@iqbal-lab
Copy link
Collaborator

When we originally added "support " for targeted sequencing we were rather cavalier, hoping a rough average depth would work. After the covid work, I'm more inclined to just accept we cannot properly support targeted sequencing and just offer something limited , if at all. Proper handling of multiple (>1) minor alleles when you have huge depth, and distinguishing nanopore homopolymer noise is going to be hard. That's why I proposed subsampling. Which, yes, requires specific code to subsample per target.

@iqbal-lab
Copy link
Collaborator

Open to suggestions/disagreement! Could have a wrapper repo that does the bedfile and subsampling? Have a collaborator wanting to do targeted sequencing and am about to help them do subsampling plus racon (ie completely avoid Mykrobe), so interested in your opinion

@mbhall88
Copy link
Member Author

mbhall88 commented Oct 8, 2022

We are definitely going to be needing to use this in the future so I am happy to work up a solution (I might also try drprg on this data and see what happens).

So we're going to need to do some kind of alignment to be able to subsample targets right?

We could do minimap2 as it has a python API? Only pitfall here is I don't know whether or not windows can deal with that?

@iqbal-lab
Copy link
Collaborator

My suggestion is this.

  1. Create a new repo which does the following
  2. Take bed file of targets, and create one mykrobe panel for each (eg single gene mykrobe)
  3. Map reads, assign to each target, estimate depth, subsample if necessary
  4. Run mini mykrobe on each target independently passing in those reads only
  5. Combine results

@mbhall88
Copy link
Member Author

Alright. That will be a decent amount of work. Before I do that, I might take a look at how drprg goes on the amplicon data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for targeted sequencing
2 participants