(feat): `read_lazy` for whole `AnnData` lazy-loading + `xarray` reading + `read_elem_as_dask` -> `read_elem_lazy` #1247

ilan-gold · 2023-11-30T12:51:54Z

This PR is a lighter weight version of #947 that involves using the original AnnData object as the class to hold obs and var xr.Dataset.

Closes Dask and Zarr not loading obsp and obsm from remote s3 #951 and closes lazy dataframes in .obs and .var with backed="r" mode #981
Tests added
Release note added (or unnecessary)

codecov · 2023-12-07T16:24:55Z

Codecov Report

Attention: Patch coverage is 94.10377% with 25 lines in your changes missing coverage. Please review.

Project coverage is 85.04%. Comparing base (3260222) to head (310191c).
Report is 4 commits behind head on main.

Files with missing lines	Patch %	Lines
src/anndata/experimental/backed/_lazy_arrays.py	91.46%	7 Missing ⚠️
src/anndata/_core/storage.py	37.50%	5 Missing ⚠️
src/anndata/experimental/backed/_compat.py	84.21%	3 Missing ⚠️
src/anndata/experimental/backed/_xarray.py	95.52%	3 Missing ⚠️
src/anndata/tests/helpers.py	85.00%	3 Missing ⚠️
src/anndata/_io/specs/registry.py	87.50%	2 Missing ⚠️
src/anndata/_io/specs/lazy_methods.py	98.50%	1 Missing ⚠️
src/anndata/experimental/backed/_io.py	97.87%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1247      +/-   ##
==========================================
- Coverage   86.93%   85.04%   -1.90%     
==========================================
  Files          40       45       +5     
  Lines        6039     6419     +380     
==========================================
+ Hits         5250     5459     +209     
- Misses        789      960     +171

Files with missing lines	Coverage Δ
src/anndata/_core/aligned_df.py	`97.87% <100.00%> (+0.09%)`	⬆️
src/anndata/_core/anndata.py	`83.77% <100.00%> (+0.04%)`	⬆️
src/anndata/_core/index.py	`94.03% <100.00%> (+0.70%)`	⬆️
src/anndata/_core/merge.py	`85.73% <100.00%> (-9.25%)`	⬇️
src/anndata/_core/views.py	`85.71% <100.00%> (-5.40%)`	⬇️
src/anndata/_io/specs/__init__.py	`100.00% <ø> (ø)`
src/anndata/_io/zarr.py	`83.75% <100.00%> (+0.20%)`	⬆️
src/anndata/_types.py	`86.11% <100.00%> (+0.81%)`	⬆️
src/anndata/experimental/__init__.py	`100.00% <100.00%> (ø)`
src/anndata/experimental/backed/__init__.py	`100.00% <100.00%> (ø)`
... and 9 more

... and 4 files with indirect coverage changes

…/xarray_compat

flying-sheep

Please make a PR into the notebook repo, there are some things broken in the notebook that I’d like to properly review.

Also I’m still not a fan of the exports from private namespaces, but I’m OK with it if there are at least minimal docstrings for them.

This is very close! Thanks for taking on this huge thing!

flying-sheep · 2024-10-22T09:10:13Z

docs/api.md

+   experimental.backed._lazy_arrays.MaskedArray
+   experimental.backed._lazy_arrays.CategoricalArray
+   experimental.backed._xarray.Dataset2D


Yeah, I think they only work for param docs. I should get around to check if I can extend it to work for pure Sphinx as well.

src/anndata/experimental/backed/_xarray.py

flying-sheep · 2024-10-22T09:34:33Z

src/anndata/_core/index.py

-        assert index.dtype != int, msg
+    from ..experimental.backed._compat import xr
+
+    # TODO: why is this here? All tests pass without it and it seems at the minimum not strict enough.


asserts in runtime code are purely there to make debugging easier in case the asserts fail.

Hmm, so it’s not possible that _normalize_index ever gets called with the wrong type of index as result of user action?

If that’s impossible, feel free to remove. Otherwise we should probably update this check to a TypeError or so.

flying-sheep · 2024-10-22T09:37:45Z

src/anndata/_core/merge.py

+        dtype = "object"
+    else:
+        dtype = col.dtype.numpy_dtype
+    # TODO: get good chunk size?


small reminder in case you have an idea for this TODO

flying-sheep · 2024-10-22T09:43:43Z

src/anndata/_io/specs/lazy_methods.py

+    index_label: str,
+    index_key: str,
+    index: np.NDArray,
+) -> Generator[tuple[str, DataArray]]:


Suggested change

) -> Generator[tuple[str, DataArray]]:

) -> Generator[tuple[str, DataArray], None, None]:

flying-sheep · 2024-10-22T11:06:56Z

tests/test_read_lazy.py

+            id="consecutive integer array",
+        ),
+        pytest.param(
+            np.random.choice(np.arange(800, 1100), 500),


isn’t that randint?

flying-sheep · 2024-10-22T11:08:06Z

tests/test_read_lazy.py

+    maybe_warning_context = (
+        pytest.warns(UserWarning, match=r"Concatenating with a pandas numeric")
+        if not load_annotation_index
+        else nullcontext()
+    )
+    with maybe_warning_context:


thanks to Python 3.10, we can now do this:

Suggested change

maybe_warning_context = (

pytest.warns(UserWarning, match=r"Concatenating with a pandas numeric")

if not load_annotation_index

else nullcontext()

)

with maybe_warning_context:

with (

pytest.warns(UserWarning, match=r"Concatenating with a pandas numeric")

if not load_annotation_index

else nullcontext()

):

flying-sheep · 2024-10-22T11:14:50Z

tests/test_read_lazy.py

+
+
+# remote has object dtype, need to convert back for integers booleans etc.
+def correct_extension_dtype_differences(remote: pd.DataFrame, memory: pd.DataFrame):


why does remote have object dtypes? can that be fixed ahead of time? will that affect users or is that just here in the tests?

also for clarity, please call it fix_extension_dtype_differences or unify_extension_dtypes and make the comment a docstring. I was unsure if correct was meant as a verb or something like make_correct_...

I will change the name and add a better comment since it's only used for concat. There is no way of concat-ing these lazy pandas adapter arrays we have so in order to work with them, we present a dask array wrapping each when they are concatenated. Practically, this actually doesn't change anything in terms of IO. But it means that when we make the round trip, we lose the original data type. We could maybe store this in uns or something (what the original data type was) but

People concating remote/large datasets probably won't be reading them into memory very often

This gets complicated quickly with mixed data types. So if you have two similarly named columns but different numerical data types, we need to start dealing with upcasting/downcasting.

Feels like a cross-that-bridge-when-we-come-to-it issue

flying-sheep · 2024-10-22T11:18:41Z

tests/test_read_lazy.py

+@pytest.fixture
+def concatenation_objects(
+    adatas_paths_var_indices_for_concatenation,
+) -> tuple[list[AnnData], list[pd.Index], list[AccessTrackingStore], list[AnnData]]:
+    adatas, paths, var_indices = adatas_paths_var_indices_for_concatenation
+    stores = [AccessTrackingStore(path) for path in paths]
+    lazys = [read_lazy(store) for store in stores]
+    return adatas, var_indices, stores, lazys


please split this into 4 fixtures, that way typing below is much simpler, and you can just do

def mystest(objs1, objs2, objs3, objs4): ...

instead of

def mystest(concatenation_objects): objs1, objs2, objs3, objs4 = concatenation_objects ...

flying-sheep · 2024-10-22T11:20:15Z

tests/test_read_lazy.py

+    return request.param
+
+
+@pytest.fixture(params=[True, False], scope="session")


why did you mark this as solved?

ilan-gold mentioned this pull request Nov 30, 2023

Dask and Zarr not loading obsp and obsm from remote s3 #951

Open

ilan-gold mentioned this pull request Jan 31, 2024

lazy dataframes in .obs and .var with backed="r" mode #981

Open

ilan-gold added this to the 0.11.0 milestone Jul 2, 2024

ilan-gold self-assigned this Jul 2, 2024

ilan-gold added the skip-gpu-ci label Jul 5, 2024

ilan-gold force-pushed the ig/xarray_compat branch from 68fcd2b to 6165f07 Compare July 5, 2024 14:16

ilan-gold changed the base branch from main to ig/read_dask_elem July 9, 2024 15:44

ilan-gold added 7 commits July 10, 2024 10:10

(fix): compat zarr import

02b1b1d

(fix): zarr import

2cadd71

(refactor): clean up tests

33aebb2

Merge branch 'ig/read_dask_elem' into ig/xarray_compat

4a55c70

(fix): overfetching problem

701cd85

Merge branch 'ig/read_dask_elem' into ig/xarray_compat

31ba49d

(fix): don't read in index multiple times

ad62f8c

ilan-gold mentioned this pull request Jul 10, 2024

(feat): read_elem_as_dask method #1469

Merged

3 tasks

ilan-gold and others added 14 commits July 10, 2024 16:14

(fix): don't overfetch

d79a673

Merge branch 'ig/xarray_compat' of github.com:scverse/anndata into ig…

0bd1b6c

…/xarray_compat

Fix circular import

43b21a2

add some typing

0e22449

fix mapping types

ec546f4

Fix Read/Write

7c2e4da

Fix one more

1ba5b99

unify names

49c0d49

claift ReadCallback signature

3666735

Fix type aliases

3a332ad

(fix): clean up typing to use RWAble

d0f4d13

Merge branch 'main' into ig/protocol_for_callback

6e89e14

(fix): use Union

ea29cfa

(fix): add qualname override

f4ff236

ilan-gold added 14 commits October 16, 2024 17:14

(fix): remove resetting key trackers

e07426a

(fix): adata only has 4 cchunks in test, udpate comment

8278e0f

(chore): better use arrange-act-assert

30d1bb1

(chore): ids for boolean params

509af7f

(chore): contextlib + better assert objects

58122a1

(chore): refactor concatenation for arrange-act-assert

e6fea74

merge again?

bd509a1

(fix): notebook submodule

62cda13

(fix): use find_spec pattern

4e1a1f6

(chore): re-insert types for AccessTrackingStore

a242dea

(chore): dedent docstrings

07caf93

(chore): raise error if slots have changed on ZarrOrHDF5Wrapper

8fd1fa0

(fix): add slots to please xarray

94cf8ea

Merge branch 'main' into ig/xarray_compat

f01818a

ilan-gold mentioned this pull request Oct 17, 2024

(fix): correct default fill values for dask-sparse #1719

Merged

3 tasks

flying-sheep and others added 8 commits October 17, 2024 15:15

Merge branch 'main' into ig/xarray_compat

aca24db

(chore): remove redefinition

2c082bf

(refactor): reuse join type

81c5fb9

(fix): mixed type dataframe merging

bb49dd2

(fix): condition for going to memory in mixed typing

942661f

(refactor): mixed type helper function

99219c6

(fix): try linking to dask/awkward in docs build

98197fe

(fix): awkward array docs

752e02b

ilan-gold requested a review from flying-sheep October 21, 2024 10:20

flying-sheep requested changes Oct 22, 2024

View reviewed changes

ilan-gold added 5 commits October 25, 2024 11:19

(chore): ValueError -> AssertionError

2a38900

(fix): clean up _lazy_arrays.py

cc40369

(fix): ValueError->KeyError for store

a807673

(chore): add note about unify_extension_dtypes

852ab20

(chore): add ids

310191c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(feat): `read_lazy` for whole `AnnData` lazy-loading + `xarray` reading + `read_elem_as_dask` -> `read_elem_lazy` #1247

(feat): `read_lazy` for whole `AnnData` lazy-loading + `xarray` reading + `read_elem_as_dask` -> `read_elem_lazy` #1247

ilan-gold commented Nov 30, 2023 •

edited by ivirshup

Loading

codecov bot commented Dec 7, 2023 •

edited

Loading

flying-sheep left a comment

flying-sheep Oct 22, 2024

flying-sheep Oct 22, 2024

flying-sheep Oct 22, 2024

flying-sheep Oct 22, 2024

flying-sheep Oct 22, 2024

flying-sheep Oct 22, 2024

flying-sheep Oct 22, 2024

ilan-gold Oct 25, 2024 •

edited

Loading

flying-sheep Oct 22, 2024

flying-sheep Oct 22, 2024

	) -> Generator[tuple[str, DataArray]]:
	) -> Generator[tuple[str, DataArray], None, None]:



		# remote has object dtype, need to convert back for integers booleans etc.
		def correct_extension_dtype_differences(remote: pd.DataFrame, memory: pd.DataFrame):

		return request.param


		@pytest.fixture(params=[True, False], scope="session")

(feat): read_lazy for whole AnnData lazy-loading + xarray reading + read_elem_as_dask -> read_elem_lazy #1247

Are you sure you want to change the base?

(feat): read_lazy for whole AnnData lazy-loading + xarray reading + read_elem_as_dask -> read_elem_lazy #1247

Conversation

ilan-gold commented Nov 30, 2023 • edited by ivirshup Loading

codecov bot commented Dec 7, 2023 • edited Loading

Codecov Report

flying-sheep left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ilan-gold Oct 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

(feat): `read_lazy` for whole `AnnData` lazy-loading + `xarray` reading + `read_elem_as_dask` -> `read_elem_lazy` #1247

(feat): `read_lazy` for whole `AnnData` lazy-loading + `xarray` reading + `read_elem_as_dask` -> `read_elem_lazy` #1247

ilan-gold commented Nov 30, 2023 •

edited by ivirshup

Loading

codecov bot commented Dec 7, 2023 •

edited

Loading

ilan-gold Oct 25, 2024 •

edited

Loading