IEDB database cdr3_aa stored as junction_aa #469

racng · 2023-12-14T23:57:01Z

Describe the bug
IEDB database provides cdr3_aa sequences instead of junction_aa sequences. scirpy database import code puts this in the slot for junction_aa. The sequence would then be missing the flanking amino acids. This makes it inaccurate to do identity string matching with ir_dist/ir_query.

To Reproduce
https://github.com/scverse/scirpy/blob/d862cf35740a79e91f95c49ce6da1fbe280f8c1c/src/scirpy/datasets/__init__.py#L338C1-L371C10

Expected behaviour
Is there a way to format cdr3 into junction sequence? like add the missing "C" at the beginning based on V call? Do you have advice on how to do this?

grst · 2023-12-15T07:30:16Z

Hi Rachel,

thanks for bringing this up. This was meant as a workaround to make it work with ir_dist (which is currently hardcoded to work on the junction sequence).

A proper solution is probably to

convert junction to cdr3 (which means just removing the constant amino acids)
provide an option in ir_dist to use the cdr3 column instead of junction_aa

For now, as a workaround, I would suggest to store CDR3 sequences in junction_aa in both query and database object.
You can trim the string with awkward array functions:

import awkward as ak
mdata["airr"].obsm["airr"]["junction_aa"] = ak.str.slice(mdata["airr"].obsm["airr"]["junction_aa"], 1, -1)

grst · 2023-12-15T07:33:14Z

@zktuong, just wanted to ask you to be sure

junction_aa -> cdr3 should be trivial, right? Just clip off the first and last amino acid. Any caveats?
cdr3 -> junction_aa: Is it correct that the first AA is always C and the last is always W/F? Can this be robustly inferred from the V call?

zktuong · 2023-12-21T10:30:52Z

junction_aa -> cdr3 should be trivial, right? Just clip off the first and last amino acid. Any caveats?

yes should be correct.

cdr3 -> junction_aa: Is it correct that the first AA is always C and the last is always W/F? Can this be robustly inferred from the V call?

yes first is always C and the last should always be either F/W. Should be able to infer the F/W from the second codon from start of the J call (TTT/TTC for F or TGG for W).

grst · 2024-01-09T20:53:39Z

Hi @racng,

I believe I fixed this in #476, by extracting the junction_aa sequence from the full protein sequence based on the start/end coordinates provieded in IEDB. Would be great if you could give this a try!

You can install that version using

pip install git+https://github.com/scverse/scirpy@issue-469

Please make sure to remove the cached version of iedb before rerunning ir.datasets.iedb by deleting the data directory that was created.

racng · 2024-01-10T03:45:51Z

@grst Thank you for working on a fix for this issue!
I have copied the new function into a script and tested it.
For some reason I am getting a lot of NaN values for both junction_aa and cdr3_aa.

adata = iedb(cached=True, cache_path='test/iedb.h5ad')

ir.get.airr(adata, 'junction_aa')['VDJ_1_junction_aa'].isna().sum()
# 30080

ir.get.airr(adata, 'cdr3_aa')['VDJ_1_cdr3_aa'].isna().sum()
# 30110

# Old copy of iedb reference
ir.get.airr(adata_old, 'junction_aa')['VDJ_1_junction_aa'].isna().sum()
# 118

grst · 2024-01-11T18:20:10Z

Bad news! I took another look and it seems that indeed the Start/End position and and Protein sequence are only available for ~5000 receptors:

>>> iedb_df["Chain 1 Protein Sequence"].dropna().size
5083

For the rest, there is "CDR3 Curated", but not "CDR3 Calculated" available. Taking a closer look at "CDR3 Curated", it seems that some (but not all) sequences there are actually junction sequences, including the C and W/F.

So I'm afraid this would take quite some cleanup to get it right!

On the scirpy side, I could consider adding the option to use cdr3_aa sequences, instead of junction_aa, for sequence comparisons. Then the CDR3 sequences in the database could remain as they are.

racng · 2024-01-11T22:33:34Z

I think I found a solution. We can use the J-motif sequences in J-region-motifs.tsv generated by IMGTgeneDL run in stitchr mode to determine the terminal residue of the J gene. It "contains automatically inferred CDR3 junction ending motifs and residues (using the process established in the autoDCR TCR assignation tool)".

I am attaching it here:
J-region-motifs.txt

grst · 2024-01-12T12:48:06Z

@zktuong, what do you say about this one? Is that wrong or the always existing exception to the rule in Biology?

(L and V are "not confident" but C is)
(this is from above J-junction-motifs.txt)

zktuong · 2024-01-18T13:00:08Z

Looks like they are correct and are treated as exceptions, but probably non-functional

For TRAJ35*01:

(6) noncanonical J-MOTIF: Cys-Gly-X-Gly instead of Phe-Gly-X-Gly. The functionality of TRAJ35*01, initially considered as ORF owing to the presence of Cys (C118) instead of Phe (F118) in the J-MOTIF, has been changed to F (functional) as TRAJ35*01 has been found rearranged and expressed in several specific cDNA clones (comments added by G. Folch and M.-P. Lefranc on the 23/01/2019, following the identification of an additional clone and a question by C. Joos (Germany)).

https://www.imgt.org/IMGTrepertoire/index.php?section=LocusGenes&repertoire=genetable&species=human&group=TRAJ
Genecard says it's non-functional
https://www.genecards.org/cgi-bin/carddisp.pl?gene=TRAJ35

TRBJ2-2P*01 also looks like an exception and non-functional open reading frame

(1) This sequence has diverged considerably from that of other TRBJ2-2: it has no canonical J-NONAMER [2], J-PHE 118 is replaced by Gly (G) and G 119 by R in J-MOTIF.

https://www.imgt.org/IMGTrepertoire/index.php?section=LocusGenes&repertoire=genetable&species=human&group=TRBJ
https://www.genecards.org/cgi-bin/carddisp.pl?gene=TRBJ2-2P

TRBJ2-7*02

(2) J-PHE replaced 118 replaced by Val in J-MOTIF.

zktuong · 2024-01-18T13:02:57Z

i just checked the IgBLAST auxiliary files that indicate where the CDR3 end is and crossed it to the fasta files and it looks like the codons are correct for those 3 genes as well. so should be alright to use the J-junction-motifs.txt

racng added the bug Something isn't working label Dec 14, 2023

grst mentioned this issue Jan 9, 2024

Calculate junction_aa in IEDB #476

Closed

3 tasks

grst mentioned this issue Jan 12, 2024

Hackathon: add generic_ir_from_biocypher() function to ingest TCR data #404

Open

grst mentioned this issue Jan 12, 2024

Mistake in AIRR epitope database annotation theislab/single-cell-best-practices#258

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IEDB database cdr3_aa stored as junction_aa #469

IEDB database cdr3_aa stored as junction_aa #469

racng commented Dec 14, 2023

grst commented Dec 15, 2023

grst commented Dec 15, 2023

zktuong commented Dec 21, 2023

grst commented Jan 9, 2024

racng commented Jan 10, 2024

grst commented Jan 11, 2024

racng commented Jan 11, 2024

grst commented Jan 12, 2024

zktuong commented Jan 18, 2024

zktuong commented Jan 18, 2024

IEDB database cdr3_aa stored as junction_aa #469

IEDB database cdr3_aa stored as junction_aa #469

Comments

racng commented Dec 14, 2023

grst commented Dec 15, 2023

grst commented Dec 15, 2023

zktuong commented Dec 21, 2023

grst commented Jan 9, 2024

racng commented Jan 10, 2024

grst commented Jan 11, 2024

racng commented Jan 11, 2024

grst commented Jan 12, 2024

zktuong commented Jan 18, 2024

zktuong commented Jan 18, 2024