Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

provide an explicit list of cell barcodes to whitelist? #642

Open
bbimber opened this issue May 7, 2024 · 5 comments
Open

provide an explicit list of cell barcodes to whitelist? #642

bbimber opened this issue May 7, 2024 · 5 comments

Comments

@bbimber
Copy link

bbimber commented May 7, 2024

Hello -

In some workflows that could use umi-tools, we already have an explicit whitelist of the corrected cell barcodes. There is still a need to identify the non-error-corrected cell barcodes.

As I understand umi-tools, one can run the whitelist command and either give it a cell number, or let the tool infer the cell #. Is there any way to provide a list of allowable cell barcodes, and to let umi-tools generate the whitelist TSV to map cellbarcode to error-corrected barcodes?

@TomSmithCGAT
Copy link
Member

Hi @bbimber. Just to clarify, you have a whitelist of cell barcodes and would like UMI-tools to automatically identify the acceptable cell barcodes which should be erorr-corrected to these cell barcodes. Is that correct?

If so, I'm afraid this isn't currently supported by UMI-tools.

When running umi_tools extract with a whitelist, the error barcodes need to be supplied in the format indicated here: cell barcode in column 1 and barcodes to correct to it in comma separated list in column 2.

It should be relatively trivial to determine for yourself what the error barcodes you wish to correct are if you already have a list of whitelisted cell barcodes. However, one issue will be that specifying all the possible error corrections without making reference to whether the barcode is actually observed, column 2 of the whitelist will get excessively long.

You could run umi_tools whitelist to generate the error mappings in the whitelist and then subset the whitelist using your pre-defined whitelist, but that seems a bit hacky and could run into issues where your pre-defined whitelist barcode wasn't in the umi-tools output.

Hmm.. answering your question, I see the issue now!

@TomSmithCGAT
Copy link
Member

TomSmithCGAT commented May 8, 2024

@IanSudbery, any objections to an option being added to whitelist to accept a pre-defined whitelist and then derive a sensible whitelist + error-corrections from the fastq?

It should be a simple addition of a new knee_method, perhaps with that parameter renamed. Other than some sanity checking for the presence of the pre-defined whitelisted CBs in the observed CBs, I can't see any other gotchas. Thoughts?

def getCellWhitelist(cell_barcode_counts,
knee_method="distance",
expect_cells=False,
cell_number=False,
error_correct_threshold=0,
plotfile_prefix=None):

There is an option to define a error correction from just the whitelist CB sequences when reading in the whitelist in extract, but that's going to run into issues creating an excessively broad set of possible error corrections, since there is no checking that the error CBs are actually present in the data. I imagine the excessively broad whitelist might impact on runtime.

def getUserDefinedBarcodes(whitelist_tsv, whitelist_tsv2=None,
getErrorCorrection=False,
deriveErrorCorrection=False,
threshold=1):

@IanSudbery
Copy link
Member

I've no objection, other than to add that I'm not really all that au-fait with whitelist and its methods, so I won't really be able to help with support.

There is an option already to read a supplied whitelist into whitelist. What does this do?

@bbimber
Copy link
Author

bbimber commented May 8, 2024

@TomSmithCGAT: yes, your description is pretty accurate. I considered the options you were suggesting, including making the TSV whitelist format myself. Like you said, the utility of having umi-tools generate the error-corrected barcodes is that it would be empirical based on data

@IanSudbery
Copy link
Member

I think it would only be empirical in that it a list of all possible barcodes that could be corrected would be filtered by those actaully present.

I don't think it would make any different to the results. Where it might have a benefit is that the lists would be smaller, and therefore the extract process might be quicker/less memory consuming.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants