Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UMItools extract with nanopore long reads #661

Open
ctarpey opened this issue Sep 4, 2024 · 2 comments
Open

UMItools extract with nanopore long reads #661

ctarpey opened this issue Sep 4, 2024 · 2 comments

Comments

@ctarpey
Copy link

ctarpey commented Sep 4, 2024

Hello!

I would love your opinion or advice with our method, and I have a few questions about how UMItools extract with a regex works.

I have nanopore long reads with dual 18bp UMIs, one on each end of sequence reads that are variable in length between 2000-2800bp. I would like to extract both UMIs, concatenate them and append them to the sequence read name.

The anatomy of my sequence read is:

Barcode - enrichment primer - 18bp UMI - target primer - genomic region of interest - reverse target primer - 18bp UMI - enrichment primer - Barcode

The regex that I have been using looks like this:
"^.(?P<umi_1>.{3}[CT][GA].{3}[CT][GA].{3}[CT][GA].{3})(?P<discard_1>CAGTGGCTCC){e<=1}.{1000,}(?P<discard_2>GCACATGCAG){e<=1}(?P<umi_2>.{3}[CT][GA].{3}[CT][GA].{3}[CT][GA].{3})."

Where I ignore any number of bases before the first instance of my UMI pattern, then I search for the flanking primer sequence with fuzzy matching of at most 1, at least a thousand bases of genomic region of interest before the second flanking primer sequence with an edit distance of at most 1, the second UMI pattern and any number of bases following. I am also running the reverse complement of this regex as you suggest at the end of #610.

Using this regex and the reverse complement, I am only finding matches in a cumulative 45% of my reads. I understand there may be some molecular issues I have yet to rule out, but can you think of a reason that the regex may not be performing as I expect it to? Or is there some characteristic of the nanopore data that I may be overlooking? Or part of the program I am misunderstanding?

How does the tool use the regex, is the first match of the regex pattern in the read reported? What does it do in the instance that it can match the pattern more than once? I believe our current regex is specific enough to avoid this issue, but I started with a regex of just the UMI patterns flanking at least 100bp. I found that in +95% of my reads, only to realize that we were seeing the UMI pattern more than the expected 2 times per read, so I couldn't trust that the UMIs being pulled out by extract were true UMIs.

Thank you for your help!

@IanSudbery
Copy link
Member

UMI-tools uses regex through the regex module using match. This means that, yes, the first match will be pulled out. However, this is the first match using greedy matching (as is standard in regex). I think that means that the longest possible insert will be captured, although I could be wrong.

As for why its not matching.....
One thing here is that you say that the regex " ignore any number of bases before the first instance of my UMI pattern," but that isn't true of the regex you posted. ^ represents the start of the sequence, and . any single charactor. So the regex ignores a single character before the first umi group, not any number of bases. If you wanted it to ignore any number of bases at the start, you could put ^.*(?<umi_1>....).

Secondly you've got this string of [CT][GA] bases in your UMIs. Thats sort of equivalent to have a 12nt fixed sequence. Where as with the fixed sequences you've got in the discard groups you've allowed a mismatch, that isn't the case with the fixed bases in the UMI groups.

@ctarpey
Copy link
Author

ctarpey commented Sep 17, 2024

Thanks @IanSudbery!

That is really helpful, I see adding the "*" after the "." at the start of the regex will make a big difference.

I spoke with a bioinformatics specialist at ONT about the pattern we're seeing, and they believe it essentially comes down to higher rates of sequencing error at the start of the read due to strand slippage through the pore and fewer kmers at the start of the reads. So, I agree with your comment about the string of [CT][GA] being essentially fixed bases for which I am not allowing any mismatches might be a big culprit for the lack of matches. Based on nothing but intuition, I am thinking of using a mismatch of single edit distance for each of my 2 UMI patterns, which would allow one basepair to deviate for each set of the 6 sort-of-fixed positions of each UMI.

I didn't include any mismatches in the UMI pattern query because I was worried about being too lenient with the UMI matching and ultimately grouping UMIs that were unrelated. If I use one edit distance of mismatch allowed between the UMI pattern and the sequence pattern being matched, but also want to account for the fact that there may be other sequence errors in the N / random bases of the UMI pattern, do you suggest any specific ways of running the UMItools group to ensure that the program will cluster the UMIs appropriately?

Thanks again for your help on this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants