Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mykrobe overcounts homopol deletions when making probes #148

Open
iqbal-lab opened this issue Feb 24, 2022 · 3 comments
Open

Mykrobe overcounts homopol deletions when making probes #148

iqbal-lab opened this issue Feb 24, 2022 · 3 comments

Comments

@iqbal-lab
Copy link
Collaborator

The probe generator is making one probe for each 1bp and 2bp deletions in katG and pncA. But if you make a all 1bp deletions in a homopolymer, they are essentially the same.
eg deleting one A from AAA could happen in 3 places and give the same sequence.
This leads to annoying things where we report >1 mutation being detected, when it is basicvally the same thing detected twice/whatever.

Workaround suggestions

  1. compare all the indel/frame-shift probes against each other and dedup
  2. pay attention to whether you are in a homopolymer when creating the 1bp and 2bp deletions, and don't overcreate probes
@mbhall88
Copy link
Member

For context. We have this example

            "Isoniazid": {
                "predict": "R",
                "called_by": {
                    "katG_GC1037G-GC2155074C": {
                        "variant": null,
                        "genotype": [
                            1,
                            1
                        ],
                        "genotype_likelihoods": [
                            -711.0227088947968,
                            -417.9483717274669
                        ],
                        "info": {
                            "coverage": {
                                "reference": {
                                    "percent_coverage": 100.0,
                                    "median_depth": 7,
                                    "min_non_zero_depth": 6,
                                    "kmer_count": 175,
                                    "klen": 21
                                },
                                "alternate": {
                                    "percent_coverage": 100.0,
                                    "median_depth": 15,
                                    "min_non_zero_depth": 13,
                                    "kmer_count": 253,
                                    "klen": 18
                                }
                            },
                            "expected_depths": [
                                24
                            ],
                            "contamination_depths": [],
                            "filter": [],
                            "conf": 293
                        },
                        "_cls": "Call.VariantCall"
                    },
                    "katG_CC1038C-GG2155073G": {
                        "variant": null,
                        "genotype": [
                            1,
                            1
                        ],
                        "genotype_likelihoods": [
                            -727.5071540961122,
                            -377.76816030583126
                        ],
                        "info": {
                            "coverage": {
                                "reference": {
                                    "percent_coverage": 100.0,
                                    "median_depth": 7,
                                    "min_non_zero_depth": 6,
                                    "kmer_count": 160,
                                    "klen": 21
                                },
                                "alternate": {
                                    "percent_coverage": 100.0,
                                    "median_depth": 15,
                                    "min_non_zero_depth": 13,
                                    "kmer_count": 253,
                                    "klen": 18
                                }
                            },
                            "expected_depths": [
                                24
                            ],
                            "contamination_depths": [],
                            "filter": [],
                            "conf": 350
                        },
                        "_cls": "Call.VariantCall"
                    },
                    "katG_CC1039C-GG2155072G": {
                        "variant": null,
                        "genotype": [
                            1,
                            1
                        ],
                        "genotype_likelihoods": [
                            -729.808268296947,
                            -372.51398695693916
                        ],
                        "info": {
                            "coverage": {
                                "reference": {
                                    "percent_coverage": 100.0,
                                    "median_depth": 7,
                                    "min_non_zero_depth": 6,
                                    "kmer_count": 158,
                                    "klen": 21
                                },
                                "alternate": {
                                    "percent_coverage": 100.0,
                                    "median_depth": 15,
                                    "min_non_zero_depth": 13,
                                    "kmer_count": 253,
                                    "klen": 18
                                }
                            },
                            "expected_depths": [
                                24
                            ],
                            "contamination_depths": [],
                            "filter": [],
                            "conf": 357
                        },
                        "_cls": "Call.VariantCall"
                    }
                }
            },

The three mutations in the JSON above could concieveably be the same deletion...or a 2bp deletion I guess.

GC1037G
CC1038C
CC1039C

I guess it probably is a single 1bp deletion - at 1038.

Although I'm a bit confused about how the mutations work. Which of the two bases is the one at the position?

If I pull out these three bases, with one base flanking, from the katG reference, I get G GCC T. So if the position describes the first base, then 1039 should read CT1039C right? But if the second base describes the position, it's the same problem right?

@martinghunt
Copy link
Member

The sequence at 1038-1040 in katG is CCC.
image

@mbhall88
Copy link
Member

Hmmm, wonder why I got a different sequence.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants