Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scverse datastructure for AIRR data #327

Closed
1 task done
grst opened this issue Mar 9, 2022 · 57 comments · Fixed by #356
Closed
1 task done

scverse datastructure for AIRR data #327

grst opened this issue Mar 9, 2022 · 57 comments · Fixed by #356

Comments

@grst
Copy link
Collaborator

grst commented Mar 9, 2022

Now that scirpy is part of scverse, we could think of an improved data structure for scAIRR data. See also the discussion at scverse/scanpy#1387.

The challenge with scAIRR data is that

  • 1 cell can have n chains. Up to four of them are biologically meaningful but there could be more for technical reasons.
  • Each chain has a lot of fields. See the AIRR rearrangement standard.

The current pragmatic solution is to store all fields in adata.obs.

  • All columns from the airr rearrangement schema are repeated four times
  • Excess chains are serialized into JSON and stored in an extra column. These chains are not used by scirpy, but enable lossless conversions.
  • The downside is that there can easily be 100+ columns in adata.obs. Also serializing excess chains is not really elegant.
  • The advantage is that it works really well with scanpy, i.e. any AIRR variable can immediately be used for grouping, plotting etc.

New options are

  • mudata. AIRR data could be saved as a separate modality. Even if we keep the current reprepsentation of a wide data frame, it would at least not clutter the rest of adata.obs.
  • awkward array support. Allows storing an arbitrary number of values per row. See first attempt to support awkward arrays anndata#647

The new representation should also aim at being a community standard for the scverse ecosystem and should build upon the AIRR rearrangement standard. Ideally, we could get additional stakeholders onboard, including conga, dandelion, tcrdist3 and possibly members of the AIRR community.

  • what's the state of the AIRR single-cell schema? And what are its advantates over the rearrangement schema.
@bcorrie
Copy link

bcorrie commented May 20, 2022

Hi All, I am active in the AIRR Standards group, can try and answer questions. We are looking at integrating scirpy into the iReceptor Gateway (gateway.ireceptor.org) to do analysis on AIRR Cell data - so have a vested interest... 8-)

The AIRR Cell schema was just announce at the AIRR meeting this week (https://www.antibodysociety.org/the-airr-community/meetings/airr-community-meeting-vi-exploring-new-frontiers/), with a tagged (v1.4) release expected on github in the next few weeks. This is an "Experimental" schema, meaning that we think it is pretty stable, but will probably need some minor changes before it is finalized, probably in an AIRR Standard v2.0 release sometime in the "near" future.

@bcorrie
Copy link

bcorrie commented May 20, 2022

The Cell schema is intended to represent a Cell as an object. It is a light weight object, and contains a cell_id that is used to cross reference information in other collections such as Rearrangement and CellExpression. So if you have a cell_id from a Cell you can find all of the Rearrangements for the paired chains associated with the cell in an AIRR TSV file by searching for that same cell_id. Similarly you can find all of the CellExpression information by searching a CellExpression file for that same cell_id.

@bcorrie
Copy link

bcorrie commented May 20, 2022

Here are the current docs on the master branch, which will be tagged with a v1.4 release soon.

https://docs.airr-community.org/en/latest/datarep/cell.html

@bcorrie
Copy link

bcorrie commented May 20, 2022

One of the outstanding things we need to resolve in the AIRR Standard is what if any "efficient" file format should be supported. Currently the default standard for all AIRR objects is a JSON representation of the standard objects. For Rearrangement there is a defined TSV file. Such a file format does not yet exist for Cell and more importantly CellExpression. We could probably use some advice from this community...

We have an about to be release version of our AIRR repository (https://github.com/sfu-ireceptor/turnkey-service-php/tree/production-v4) that can store Cell and CellExpression data, but currently what you get out is a very big, verbose JSON file. It works, but...

@bcorrie
Copy link

bcorrie commented May 20, 2022

BTW, I think it would be interesting to decorate a cell's adata.obs with AIRR Repertoire metadata. That is, it would be quite easy (I think) to decorate the adata.obs with all sorts of subject/sample metadata for each cell such as subject.disease.disease_daignosis, subject.age_min, subject.sex, sample.tissue etc so that one can use these criteria for coloring, group, etc as well. It seems to me that these fields in adata.obs would be more valuable than the most of the Rearrangement fields. I could see wanting some Rearrangement fields, maybe VDJ calls, CDR3 length, locus for the paired chains, but it seems like most of the Rearrangement fields would not be very interesting from a slicing/coloring/visualization point of view.

@grst
Copy link
Collaborator Author

grst commented May 21, 2022

Hi @bcorrie,

thanks for reaching out, I'm definitely interested in working with the AIRR community to make this as interoperable as possible.
Your description of the Cell schema is really helpful, before I wasn't sure it was to replace the Rearrangement schema, but now I understood it's built on top of it.

Actually scirpy internally uses something very very similar to the Cell schema already. Every file format read into scirpy is first represented as a list of AirrCell objects that have a cell_id, a list of Rearrangement entries and arbitrary cell-level attributes that are added as columns to adata.obs. Once the Cell schema is final, I could simply adjust my AirrCell implementation to match the cell-schema, which would make reading and writing AIRR-compliant data trivial.

The internal representation is one side of the medal, the other is representing the AIRR data as part of either an AnnData or MuData container, which allows to link the data to other modalities, allows efficient file storage in h5ad/h5mu and enables the interaction with scanpy or other scverse tools. The current way is to dump everything in .obs which is not ideal.


Such a file format does not yet exist for Cell and more importantly CellExpression. We could probably use some advice from this community...

Well, how about AnnData? If you are looking for a lower level of abstraction, I think hdf5, zarr or tiledb can handle these sparse data efficiently, but @ivirshup or @gtca are the experts here. Is it really necessary to define yet another format for single-cell expression?

Note that reading/writing gene expression formats would be out-of-scope for scirpy, this should be handled by scanpy or AnnData, or possibly by a future scverse IO package.

It's also worth noting here that the CZI is currently looking into standardizing Feature Observation Matrices (FOM) for single-cell data. Maybe it would be worth for the AIRR community and them getting in touch for talking about a AIRR-specific extension. Their public presence is still very sparse, but maybe @ivirshup can tell more about this initiative and make the contact.

Best,
Gregor

@bcorrie
Copy link

bcorrie commented May 24, 2022

Such a file format does not yet exist for Cell and more importantly CellExpression. We could probably use some advice from this community...

Well, how about AnnData? If you are looking for a lower level of abstraction, I think hdf5, zarr or tiledb can handle these sparse data efficiently, but @ivirshup or @gtca are the experts here. Is it really necessary to define yet another format for single-cell expression?

Yes, AnnData seems logical and along the lines of what I was thinking... Definitely do not want YAF (yet another format) 8-)

Like you, this is likely out of scope for the AIRR Community - at least in the sense that the obvious thing for us to do would be to have the AIRR python library (https://github.com/airr-community/airr-standards/tree/master/lang/python) use scanpy to read and write these formats.

The CZI FOM is interesting as well...

@bcorrie
Copy link

bcorrie commented May 24, 2022

Actually scirpy internally uses something very very similar to the Cell schema already. Every file format read into scirpy is first represented as a list of AirrCell objects that have a cell_id, a list of Rearrangement entries and arbitrary cell-level attributes that are added as columns to adata.obs. Once the Cell schema is final, I could simply adjust my AirrCell implementation to match the cell-schema, which would make reading and writing AIRR-compliant data trivial.

Very nice - one would hope that these would be closely aligned, and hopefully we can now make sure they are completely aligned 8-)

@grst
Copy link
Collaborator Author

grst commented Jun 20, 2022

Draft proposal for an awkward-array-based data structure

I've been playing around with the draft implementation of awkward array support in AnnData (scverse/anndata#647). This allows to store in .obsm (or in the future maybe also in .obs, see the comments there) ragged arrays of mixed types for each cell -- exactely what we need to represent the 1:n relationship of cells with chains.

An "awkward array" can be directly created from a list of AirrRearrangement dictionaries, e.g.

import awkward as ak
import scirpy as ir
airr_cells = ir.io.to_airr_cells(adata)
chains_as_awkarr = ak.Array([cell.chains for cell in airr_cells])
# an array of length 3000 (= number of cells), variable width (= number of chains) and 
# AIRR Rearrangement fields on the 3rd dimension
> chains_as_awkarr
<Array [[{c_call: 'TRBC2', ... v_cigar: None}]] type='3000 * var * {"c_call": op...'>

To get, e.g. all junction_aa sequences of the first chain of each cell, this awkward array can be sliced:

> chains_as_awkarr[:, 0, "junction_aa"]
<Array ['CASSLMRLAGDTQYF', ... 'CATEEYGNKLVF'] type='3000 * string'>

It is therefore straightforward and computationally efficient to retrieve individual elements.

In summary: We simply take AIRR compliant data, put it in an efficient data structure and store it an AnnData. Not a lot new to define from our side

Caveats and possible solutions

1. Primary and Secondary chains

The list of chains does not know the concept of primary and secondary chains. This can be solved by adding an index array:

# the index arrays point to the element in the list of chains
primary_vj = [0, 0, 1, 2, 0, 1, ...]
primary_vdj = [1, 3, 5, 3, 2, ...]

The awkward array can then be sliced

> chains_as_awkarr[:, primary_vj, "junction_aa"]

to retrieve e,g. the junction AA sequences for the primary VJ chains.

2. Getting values for plotting

The current solution to put everything in obs was amongst others a pragmatic solution to make all AIRR data immediately available to scanpy plotting functions. This is not possible anymore with the proposed new implementation. Arguably, most of those columns are used very rarely for those purposes anyway.

If they are, here are some solutions to do so

  • implement a ir.get function to retrieve data from the awkward array (see also Easy-access getter functions #184). The result can be re-assigned to adata.obs for plotting and/or grouping
  • re-implement all scanpy plotting functions as e.g. ir.pl.umap, where color is retrieved from the awkward array instead of .obs.

3. Where to store the data

The current draft implementation only supports awkward arrays in .obsm. IMO it would be completely fine to have an .obsm["airr"]. In the future, it may be possible to have a column .obs["airr"].

It was also previously suggested to use mudata as AIRR data can be considered a modality. Since we don't have any data for .X and .var, I don't see a clear advantage of using mudata, though.


I'm looking forward to feedback, especially from people that develop ecosystem packages for
AIRR analysis (@zktuong, ..., ?)

@grst
Copy link
Collaborator Author

grst commented Jun 20, 2022

@bcorrie, this should make it possible that we merely define a mapping how the AIRR standard is represented as AnnData object rather than converting between a scirpy and an AIRR representation. Need to hash that out in more detail, but roughly

  • Cell metadata -> adata.obs
  • Rearrangement chains -> adata.obsm["airr"] (as awkward array)
  • CellExpression -> adata.X or mudata (in case of CITE-seq)

@ivirshup
Copy link
Member

The CZI FOM is interesting as well...

@bcorrie from discussion with the SOMA team, the intent was to keep anndata and SOMA (formerly FOM) pretty consistent with each other. Ideally they are the same thing eventually.

Though I haven't heard much on SOMA development recently, so hopefully we are still on the same page here.

Where to store the data

@grst, I've recently heard people wanting to treat their re-arrangment data as a separate modality. E.g. have an AnnData with just re-arrangement info which could then be merged with other modalities. I think the main point here was to avoid having to arbitrarily associate AIRR data with another modality.

I think @bio-la and @crichgriffin could elaborate on this.

@grst
Copy link
Collaborator Author

grst commented Jun 21, 2022

I've recently heard people wanting to treat their re-arrangment data as a separate modality.

Wouldn't it anyway work with both anndata and mudata to use .obsm["airr"].

# anndata
ir.tl.something(adata)
# anndata, override obsm_key
ir.tl.something(adata, airr_key="ir")
# mudata
ir.tl.something(mdata.mod["AIRR"])

Or would you prefer something like

# mudata only, use `.mod["AIRR"]` by default
ir.tl.something(mdata)
# ... or override default modality
ir.tl.something(mdata, mod="myairr")

@ivirshup
Copy link
Member

I don't remember the details, but one of the issues was with subsetting. Wanting to filter AIRR obs, but not the RNA.

I think it would work with AnnData right now, I think the issue was more about want to use scirpy with a MuData where RNA and AIRR were in separate modalities? I think this sorta worked, but there were issues with mudata operations.

@crichgriffin
Copy link

In the context of multimodal data and using MuData, I like the idea of keeping the AIRR data separate from any specific modality. My current workflow is to store the scirpy anndata object as a separate modality in muon, where I also have rna and adt data, as separate modalities. But I agree that it doesn't make a lot of sense, since the scirpy anndata object is mostly empty.

I agree with @grst that storing airr data could potentially work in both AnnData and MuData by using .obsm["airr"] except (iirc) in AnnData .obsm row dimensions match .obs row dimensions, and I assume the same applies to MuData.
What would you do in the instances of cells not having AIRR data? E.g. in PBMC data, monocytes would not have airr data, but you would want to keep them in your AnnData/MuData.

@grst
Copy link
Collaborator Author

grst commented Jun 21, 2022

What would you do in the instances of cells not having AIRR data?

fill the corresponding rows with NaNs? This is how scirpy also handles this currently.
Although this is a good point in favor of mudata. It would also make ir.pp.merge_with_ir superfluous.

I think it should be easy to support both. Just need to think what to advertise as default in the tutorial.

@zktuong
Copy link
Contributor

zktuong commented Jun 24, 2022

all of this sounds awesome. Just have some naive question about how well/fast does it deal with situation when there's >100k+ cells? would chains_as_awkarr take a while to retrieve the entries?

@grst
Copy link
Collaborator Author

grst commented Jun 25, 2022

That's a fair point, I need to try this out, but it should be very fast. Awkward array claims to be similarly scalable as numpy arrays.

@bcorrie
Copy link

bcorrie commented Jul 2, 2022

Hi All, sorry for the lack of input - both fighting "the big C" and travelling to meetings (I don't recommend trying to both at the same time) the last several weeks.

I am not an AnnData expert - yet - but my observations thus far.

We are experimenting with Conga and it annotates each AnnData cell with heavy and light chain VDJ/CDR3 in the .obs of the object. Other tools like CellTypist also populate the .obs with other data.

This seems messy, and .obsm seems like a good option for adding specific annotations to a cell object on a per "tool" basis. So there might be a 'AIRR' obsm object, but also a 'Conga' and 'CellTypist' object. This would separate these cell annotations cleanly - would that be considered good practice. It seems to make sense to me that the community might encourage this???

@bcorrie
Copy link

bcorrie commented Jul 2, 2022

2. Getting values for plotting

The current solution to put everything in obs was amongst others a pragmatic solution to make all AIRR data immediately available to scanpy plotting functions. This is not possible anymore with the proposed new implementation. Arguably, most of those columns are used very rarely for those purposes anyway.

Yes, it occurred to me that plotting might be challenging using obsm. I personally think this would be quite important.

I would anticipate using the obsm['airr'] to store not only rearrangement annotations, but also what we call repertoire annotations (by repertoire AIRR means study/subject/sample metadata). For example, it is quite easy for us to generate a pool of cell data from many subjects in a study, possibly across disease conditions (healthy, mild covid, severe covid). We would pool that data and create a single h5ad file. We would want the 'disease_state' and 'subject_id' to be in the AIRR obsm object so we could visualize cells based on these fields. Same for the AIRR 'tissue' field and a range of others potentially.

It seems like a bad idea to have this difficult.

@bcorrie
Copy link

bcorrie commented Jul 2, 2022

A quick example - this is from a combined Conga/CellTypist analysis, with the .obs annotated with Conga and Cell Typist fields. This is trivial to visualize VDJ annotations per cell next to CellTypist majortiy voting - at the same time - because the columns are added to .obs. If they were in .obsm and the visualizations could not make use of this easily, this would be challenging...

image

'vb', 'jb', and 'va' are from Conga, while 'majority_voting' is from CellTypist.

I just threw these data sets together today to try and explore this data for our early experiments with CellTypist. So please be patient with the data oddities with TR call against projected B-cells. I have yet to validate anything from this image in terms of it making sense - and any and all errors are mine 8-)

The main point is that these experimentations are easy because the visualizations are easy.

@bcorrie
Copy link

bcorrie commented Jul 2, 2022

@bcorrie, this should make it possible that we merely define a mapping how the AIRR standard is represented as AnnData object rather than converting between a scirpy and an AIRR representation. Need to hash that out in more detail, but roughly

  • Cell metadata -> adata.obs
  • Rearrangement chains -> adata.obsm["airr"] (as awkward array)
  • CellExpression -> adata.X or mudata (in case of CITE-seq)

This seems to make sense to me as a first pass, although I would probably suggest that adata.obsm['airr'] might contain more than just info about rearrangement chains (see #327 (comment)) as AIRR Repertoire metadata (disease_state, tissue, age, sex, ...) on a cell basis will often be very valuable.

One could have adata.obsm['airr_rearrangement'] and adata.obsm['airr_repertoire'] - but maybe too clunky? Another related link that cells potentially have is to "Receptors" - which are known B/T cell receptors that have a specific antigen/epitope specificity (think of the B and T cell specificity info in IEDB). Again, maybe worth capturing.

@javh
Copy link

javh commented Jul 2, 2022

For what it's worth, on the R side we've been using MultiAssayExperiments with a "rearrangement" experiment that includes the AIRR Rearrangement data for storing multimodal single-cell data that includes AIRR data, GEX, CITE-seq, and/or whatever people dream up. The AIRR assay data (equivalent to .X) is stored as a BumpyMatrix, specifically a BumpyDataFrameMatrix. We have a primary key (row names) derived from the locus field to easily separate out IGH/TRB, but I don't think that's necessary. Sample/cell level metadata goes into the colData (.obs) as is typical.

If I'm understanding the awkward array correctly, and I may not be, this would be the same as using the "record" array implementation to populate .X with an awkward array in a mudata object. A pandas DataFrame with multi-indexing seems like the most natural fit for working cellular Rearrangement data (eg, key on something like ['cell_id', 'locus', 'sequence_id']) and it looks like the conversion from awkward array to multi-indexed DataFrame is trivial. Unfortunately, my python is a bit rusty these days, so I could be misunderstanding.

PS:
"We" is not the AIRR Standards WG in this case. I don't think we should have an official opinion on implementation.

@grst
Copy link
Collaborator Author

grst commented Jul 5, 2022

where to store the data (.obs vs. .obsm vs. .X)

The AIRR assay data (equivalent to .X) is stored as a BumpyMatrix, specifically a BumpyDataFrameMatrix. We have a primary key (row names) derived from the locus field to easily separate out IGH/TRB, but I don't think that's necessary. Sample/cell level metadata goes into the colData (.obs) as is typical.

@javh this sounds pretty much like what I would imagine! Interesting to have the BumpyMatrix in .X.
In combination with mudata this actually makes a lot of sense and would be consistent to other modalities.

I would probably suggest that adata.obsm['airr'] might contain more than just info about rearrangement chains (see #327 (comment)) as AIRR Repertoire metadata (disease_state, tissue, age, sex, ...) on a cell basis will often be very valuable.

Other tools like CellTypist also populate the .obs with other data.
This seems messy, and .obsm seems like a good option for adding specific annotations to a cell object on a per "tool" basis. So there might be a 'AIRR' obsm object, but also a 'Conga' and 'CellTypist' object. This would separate these cell annotations cleanly - would that be considered good practice.

@bcorrie, I'll try to summarize here a bit the (typical) purpose of the different AnnData slots.

  • .obs has been designed to hold all cell-level metadata. I would stick to that and add the repertoire fields (disease state, tissue, etc.) to .obs to make them readily available for plotting and other scanpy functions. So far, AnnData does not distinguish between columns added by different tools. I like the idea of having more structure in .obs, but that's something we would need to discuss on the AnnData side.

  • .obsm has been designed to hold data matrices aligned to .obs, e.g. UMAP or PCA. One could argue that Rearrangement data is not a data matrix either. An alternative would be to add an "awkward array column" to adata.obs:

    adata.obs["airr_rearrangement"] = ak.Array(...)

    Yet another option would be to put the awkward array in .X as @javh suggests.

    The thing here is that having awkward columns requires some additional effort on the AnnData side and it is unclear how long this would take (see first attempt to support awkward arrays anndata#647). Awkward arrays in obsm already work, but the PR needs to be wrapped up. Awkward arrays in X are not implemented either, but proably easier to fix than the pandas extension.


accessing data for plotting

If they were in .obsm and the visualizations could not make use of this easily, this would be challenging...

I agree that cell-type, patient info etc. must be directly accessible for plotting (which they would be if they are in .obs).
One can argue how useful a UMAP plot with V/D/J genes really is, similar to other information from the rearrangement schema. I would envisage something like this:

# if one really wants to plot rearrangement fields, add them to adata.obs explicitly
adata.obs["my_v_gene_col"] = ir.get("v_call", receptor_arm="VJ", dual_ir="primary")
sc.pl.umap(adata, color="my_v_gene_col")

That would be easy enough IMO and makes it explicit which "chains" to choose from the possibly multiple chains that are available for each cell.

@grst
Copy link
Collaborator Author

grst commented Sep 6, 2022

To broaden the scope even further, looping more people into the discussion:

@ncborcherding is maintaining scRepertoire, the scirpy-equivalent in the Seurat world. Nick, maybe you could give us your perspective on how you store AIRR data in Seurat-objects?

@kmayerb and @phbradley, with TCRdist3 and conga you maintain other immune-cell receptor analysis packages in the python universe that are reasonably popular and complementary to scirpy. In an ideal world we would be using the same data structure for better interoperability. Maybe you are interested in joining this discussion.

I'm aware that this is quite a thread to read through, but the top-level comment describes what this is all about, and in the comment just before are the latest proposals of how this could be implemented.

@bcorrie
Copy link

bcorrie commented Sep 6, 2022

@bcorrie, from when you mentioned:

The AIRR assay data (equivalent to .X) is stored as a BumpyMatrix, specifically a BumpyDataFrameMatrix

Could you link to a vignette or some code where you are using a bumpy matrix to work with AIRR data?

I think that was @javh that was talking about "bumpy matrix" - ahh, I see he answered already...

@bcorrie
Copy link

bcorrie commented Sep 6, 2022

@kmayerb and @phbradley, with TCRdist3 and conga you maintain other immune-cell receptor analysis packages in the python universe that are reasonably popular and complementary to scirpy. In an ideal world we would be using the same data structure for better interoperability. Maybe you are interested in joining this discussion.

One of the tools we are integrating with (and therefore exporting to) is Conga - that is exactly our use case 8-)

@javh
Copy link

javh commented Sep 6, 2022

@grst said:

You introduce a separate dimension for locus, whereas I would have gone for putting all chains into a single dimension, and store separate indices that allow to retrieve VJ (i.e. TRA/TRG/IGK/IGL) or VDJ (i.e. TRB/TRD/IGH) loci. Both is possible and maybe worth another discussion.

If I had my druthers, I would also have that index be the chain type instead of the locus, because that more naturally fits with how you're likely to work with the data (avoiding the awkwardness of always indexing on c("IGK", "IGL")). Which just means doing this instead:

df <- read_rearrangement("~/Downloads/pbmc3_ig_db-pass.tsv.gz") %>%
    mutate(chain=alakazam::getChain(locus))
chain <- factor(df$chain, levels=c("VH", "VL"))
bumpy <- splitAsBumpyMatrix(DataFrame(df), row=chain, column=df$cell_id, sparse=TRUE)

However, this has nomenclature problems and is very B cell centric. There's no comparable terminology to VH and VL for T cells, which instead uses alpha-beta and gamma-delta. Hence, there's no field in the AIRR Rearrangement standard that defines standard nomenclature for an abstract chain type. That's trivial to overcome. But, that does mean whatever your solution, it won't necessarily be consistent across tools. Eg, I have minor reservations about VDJ vs VJ as the chain type. We decided to use VH and VL in alakazam, which is also fine as a key, but still incorrect terminology in the case of TCRs.

I'm also not certain in makes sense to combine analyses for TRA and TRG.

Hence, locus ended up being the index. Not ideal, because downstream analysis often starts with adding the chain field. But, it doesn't suffer from any wrongness upon import either.

What I didn't understand yet: Is this format something you use in-house only or is it part of the AIRR stndard, some other community standard or any R package?

In-house only. It doesn't appear in a public package (that I'm aware of). And it is not part of the AIRR Standards efforts or the AIRR Data Commons stuff @bcorrie is talking about. So there's no need to conform to this data structure if you have something better in mind.

@ncborcherding
Copy link

Very interesting discussion and thanks for looping me in!

As of right now, scRepertoire is appending a portion of the filtered TCR/BCR alignments to the meta data of a single cell object (either Seurat or SingleCellExperiment). It is not necessarily ideal as it does not preserve AIRR format or other information that users have been requesting, like cdr1/2 sequences.

In terms of implementing a change for consistency - from the R side of things, SingleCellExperiment functions as a SummarizeExperiment and is similar to the python equivalent. However, I don't think Seurat data format would be compatible with the array you are proposing. But I can do some investigating

@grst
Copy link
Collaborator Author

grst commented Oct 11, 2022

I decided to go with the adata.obsm variant described in #327 (comment), storing all chains in a single dimension rather than making an additional dimension for loci.
This makes IO agnostic of loci, which aren't standardized or even a mandatory field in rearrangement data.
Calling "primary" and "secondary" chains would then happen in a separate step, which would also allow to implement different strategies for chain ranking in the future.

If anyone has reservations against this approach, now would be a good time to speak up, otherwise it might be too late.

@grst
Copy link
Collaborator Author

grst commented Feb 2, 2023

I'm finally making some progress on the new datastructure (#356). Here is how it is currently working out:

  1. IO functions read in data into an awkward array in adata.obsm["airr"]. At this point no filtering happens, all information from an AIRR rearrangement file is preserved.

    # adata.obsm["airr"]
    [
        # cell0: 2 chains
        [{"locus": "TRA", "junction_aa": "CADASGT..."}, {"locus": "TRB", "junction_aa": "CTFDD..."}],
        # cell1: 1 chain
        [{"locus": "IGH", "junction_aa": "CDGFFA..."}],
        # cell2: 0 chains
        [],
    ]
  2. The new function scirpy.pp.index_chains is called. It adds another awkward array to adata.obsm["chain_indices"]. This function applies what is described as the scirpy receptor model, i.e. it flags multichain cells, and calls primary and secondary VJ and VDJ cells.

    # adata.obsm["chain_indices"]
    [
        {"VJ": [0], "VDJ": [1], "multichain": False}, # single pair
        {"VJ": [0, 2], "VDJ": [1,3], "multichain": False}, # dual IR
    ]

    The beauty of this is that this can easily be adapted to other receptor models. For instance, the index_chains function can be called with productive=False to keep non-productive chains. It would also be thinkable to develop a completely different receptor model with an arbitrary number of chains per obs. This could be relevant for spatial or bulk TCR data. This would require new analysis functions in scirpy (or another library), but it all fits into the same data structure.

  3. Get AIRR data by slicing the adata.obsm["airr"] with the indices in adata.obsm["chain_indices"]. For convenience, there is now a helper function that allows to request one, or multiple columns from the AIRR data, e.g.

    >>> scirpy.get.airr(adata, "locus", "VJ_1")
    AAACCTGAGAGTGAGA-1     TRA 
    AAACCTGAGGCATTGG-1     TRA
    AAACCTGCACCTCGTT-1    None
    ...
  4. Work with MuData. MuData will be the new recommended way of combining gene expression and AIRR data.

    mdata = md.MuData({"airr": adata_airr, 'gex': adata_gex})
    ir.tl.chain_qc(mdata["airr"])

I still need to iron out a few details, and most importantly, update the documentation and tutorial to reflect all these changes. There will be an automatic check and a conversion function to transfer data from the previous format into the new data structure.

@grst
Copy link
Collaborator Author

grst commented Apr 7, 2023

The new datastructure is now available in the v0.13.0rc1 pre-release 🎉!
I'm happy about everyone who tries it out and gives feedback!

You can install it using

pip install --upgrade --pre scirpy

Please also check out

@gszep
Copy link

gszep commented Apr 14, 2023

@grst this is amazing thank you so much for implementing this ❤️ I can see in the tutorials how I can access Rearrangement fields which is excellent. What about Subject or Sample fields from AIRR such as the age of donor/patient or disease_diagnosis etc; metadata that would be the same for a subset or all of the cells under obs

Happy to open a PR to make this happen 🙂

@bcorrie
Copy link

bcorrie commented Apr 14, 2023

I agree - great to see this moving forward - thanks @grst and others.

@gszep we talked about Repertoire metadata earlier (#327 (comment)) but I think currently scverse libraries only load AIRR Rearrangements. Would love to hear that I am wrong, but I believe that to be the case.

@bcorrie
Copy link

bcorrie commented Apr 14, 2023

There is an example in the docs (https://scverse.org/scirpy/tags/v0.13.0rc1/tutorials/tutorial_io.html#Combining-multiple-samples) where multiple samples are combined, and the equivalent of the AIRR sample_id is set for multiple samples in the obs data. In this case the obs_name is used to assign a sample obs.

So the trick would be to load in an AIRR Repertoire file and process it to do something similar based on the repertoire_id in the Rearrangement data.

One day soon I hope to get to doing something like this for the Single Cell downloads from the iReceptor Gateway. Ideally one could request an h5ad download of multiple samples and get a single h5ad file with all the GEX, Cell, Rearrangement, and Repertoire metadata embedded in that single h5ad file. Currently when you do a download you get 4 separate files, one for each type of data.

@jday1
Copy link

jday1 commented Apr 14, 2023

Thank you for your reply.

I am still learning the best practices for anndata. Say we have a usecase of 20 repertoires each with 100,000 cells or so.

In this case, would it make sense to have 20 Anndata objects and in each anndata object there would be the 100,000 'Rearrangement' rows stored under the airr path and the AIRR Repertoire stored in uns or similar?

@bcorrie
Copy link

bcorrie commented Apr 14, 2023

I am unfortunately not an anndata/scirpy expert either, I come from the AIRR side. My simple answer would be it depends on if you want to compare the repertoires. If yes, then having them in a single object is probably best (as in the example - you have to worry about batch effects). That way you can slice and dice across any of the metadata fields from the Repertoire.

I don't yet have any experience with how well scirpy and related tools will scale when you are throwing together data sets of this size. Any scirpy experts want to comment.

@jday1
Copy link

jday1 commented Apr 14, 2023

Thanks. Scaling is my main concern. If I want to filter my Repertoires before using their Rearrangements I'm not sure it's efficient to use a single object. I guess one might be able to use the concatenate feature which will put multiple AnnData objects into a single object if you require it. It would be great if someone from the scirpy side could offer a more informed view than mine!

@bcorrie
Copy link

bcorrie commented Apr 14, 2023

Yes, I would guess keeping them separate until you want to compare them, then concatenate them as required for specific comparisons. Still an interesting scalability question as to how many samples can you practically compare in a single scirpy object.

@zktuong
Copy link
Contributor

zktuong commented Apr 14, 2023

i can help answer some of these questions (disclaimer - i still need time to familiarise myself with the new scirpy data structure but i'm familiar with the other main bits):

@gszep we talked about Repertoire metadata earlier (#327 (comment)) but I think currently scverse libraries only load AIRR Rearrangements. Would love to hear that I am wrong, but I believe that to be the case.

The main thing is that the anndata.obs slot index is cell_id, whereas the airr data's index is sequence_id. So scirpy stores the airr data separately in the .uns and uses the awkward arrays to both store and retrieve the information. Although i believe the awkward arrays can theoretically hold any information in the AIRR row for each contig. So you could store those information if you have it, just that it won't be retrieved automatically.

I am still learning the best practices for anndata. Say we have a usecase of 20 repertoires each with 100,000 cells or so.

In this case, would it make sense to have 20 Anndata objects and in each anndata object there would be the 100,000 'Rearrangement' rows stored under the airr path and the AIRR Repertoire stored in uns or similar?

Yes. the thing to look out for in the airr table is that your sequence_id are unique. cell_id can be repeated in the airr table but i think internally the individual contigs will still be ranked by umi count to sort them from largest to smallest, so that the highest expressing pairs of heavy/light or long/short contigs would be the main bcr/tcr for that cell.

Yes, I would guess keeping them separate until you want to compare them, then concatenate them as required for specific comparisons. Still an interesting scalability question as to how many samples can you practically compare in a single scirpy object.

The io step and retrieval steps should be fast i believe?

@gszep
Copy link

gszep commented Apr 15, 2023

The HDF5 file format was designed to have lazy constant time random access to any contiguous slices of your data. If these advantages are exposed in anndata.AnnData class then we shouldn't have to worry about running out of memory and can concatenate and compare as much as we want. However as far as I can tell:

  1. Lazy concatentation of different files is still an experimental feature available via anndata.experimental.AnnCollection and there are issues being tracked for this anndata.concat with backed anndata objects anndata#793
  2. Lazy loading of partial data within a single file is supported for .X (maybe also .layers and .obsm?) using the backed="r" mode but it appears that .obs and .var are still all loaded into memory. I've create an issue to suggest supporting lazy dataframes lazy dataframes in .obs and .var with backed="r" mode anndata#981
  3. After the above two issues are solved, the lazy versions of plotting and analysis functions must be written. Hopefully some of them will work out of the box if the lazy dataframe closesly matches the pandas api.

@grst
Copy link
Collaborator Author

grst commented Apr 17, 2023

Thanks for all your comments!

What about Subject or Sample fields from AIRR such as the age of donor/patient or disease_diagnosis etc; metadata that would be the same for a subset or all of the cells under obs

These are usually just stored as additional columns in .obs. Represented as categoricals, this is quite memory-efficient.

Ideally one could request an h5ad download of multiple samples and get a single h5ad file with all the GEX, Cell, Rearrangement, and Repertoire metadata embedded in that single h5ad file. Currently when you do a download you get 4 separate files, one for each type of data.

I'm definitely open to add more reader functions to load other AIRR schemas than Rearrangement. Could you maybe point me to an example dataset that has all four data types? Then I'll take a look how to best represent them as AnnData.

I'll separately comment on scalability later.

@grst grst pinned this issue Apr 18, 2023
@grst
Copy link
Collaborator Author

grst commented Apr 18, 2023

Regarding scalability:

Different steps in the scirpy pipeline are subject to different limitations. My goal is to enable improve scalability of scirpy such that analysis of (few) millions of cells is conveniently possible on a single workstation (e.g. >200GB RAM, >30 cores). This is tracked here: #370.

For anything >10M cells, I believe we need to move to out-of-memory and out-of-core approaches. Solutions for this are still being figured out on the AnnData side as @gszep has pointed out. AnnData's current "backed mode" does not support layers and obsm. I believe random access of serialized awkward arrays requires some additional effort, but will be possible.


To get a better idea of current limitations of scalability, I'll share here my experiments with omniscope's longitudonal COVID19 dataset with 8M TCR-beta chains (and no gene expression data):

  • reading the rearrangement files: ~15min (can be further optimized & parallelized)
  • the index_chains function: 2:30h (I think I can speed this up significantly using numba, Speed up index_chains #386).
  • retrieving a data frame of all "junction_aa" sequences: 13s
    ir.get.airr(adata, "junction_aa")
  • subsetting anndata and retrieving 10k "junction_aa": 34ms
    ir.get.airr(adata[200000:210000], "junction_aa")
  • size of AnnData object on-disk (uncompressed): 23GB

The real bottlenecks of the scirpy workflow are further downstream:

  • levenshtein or alignment distance take ~24h on 44 cpus
  • clonotype calling takes ~24h on 1 cpu, parallelization as currently implemented is not effective.

@bcorrie
Copy link

bcorrie commented Apr 18, 2023

I'm definitely open to add more reader functions to load other AIRR schemas than Rearrangement. Could you maybe point me to an example dataset that has all four data types? Then I'll take a look how to best represent them as AnnData.

Any chance you have an iReceptor Gateway account 8-)

I just did a download of the data from one subject from a cancer study from Yost et al (http://doi.org/DOI:10.1038/s41591-019-0522-3). The download contains three Repertoires from a single subject at different time points, two pre-treatment and one post-treatment. The data will have a rearrangement file, a cell file (AIRR JSON format), and a GEX file (AIRR JSON format) (as well as some other files). Each will contain all of the data of one type from all three Repertoires. That is the rearrangement file will contain rearrangements from all three repertoires (actually six repertoires in the rearrangement case, as they are split into TRA and TRB for each time point). It is a 10X study with ~4500 Cells across all three time points.

The ZIP file is 180MB so probably want to share it outside of github.

@grst
Copy link
Collaborator Author

grst commented Jun 9, 2023

Happy to announce that the new datastructure is now rolled out as part of scirpy v0.13.
Please check out the release notes.

@javh
Copy link

javh commented Jun 9, 2023

Excellent!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

10 participants