Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid calling np.asarray on lazy indexing classes #6874

Merged
merged 59 commits into from
Mar 31, 2023
Merged

Conversation

dcherian
Copy link
Contributor

@dcherian dcherian commented Aug 3, 2022

This is motivated by https://docs.rapids.ai/api/kvikio/stable/api.html#kvikio.zarr.GDSStore which on read loads the data directly into GPU memory.

Currently we rely on np.asarray to convert a BackendArray wrapped with a number of lazy indexing classes to a real array but this breaks for GDSStore because the underlying array is a cupy array, so using np.asarray raises an error.

np.asarray will raise if a non-numpy array is returned so we need to use something else.

Here I added get_array which like np.array recurses down until it receives a duck array.

Quite a few things are broken I think , but I'd like feedback on the approach.

I considered np.asanyarray(..., like=...) but that would require the lazy indexing classes to know what they're wrapping which doesn't seem right.

Ref: xarray-contrib/cupy-xarray#10 which adds a kvikio backend entrypoint

dcherian and others added 6 commits August 3, 2022 09:08
This returns the underlying array type instead of always casting
to np.array. This is necessary for Zarr stores where the
Zarr Array wraps a cupy array (for example kvikio.zarr.GDSStoree).
In that case, we cannot call np.asarray because __array__ is
expected to always return a numpy array.

We use get_array in Variable.data to make sure we don't load arrays
from such GDSStores.
instead of always casting to np.asarray
@shoyer
Copy link
Member

shoyer commented Aug 3, 2022

As I understand it, the main purpose here is to remove Xarray lazy indexing class.

Maybe call this get_duck_array(), just to be a little more descriptive?

@dcherian dcherian changed the title Avoid calling np.asarray on backend arrays Avoid calling np.asarray on lazy indexing classes Aug 3, 2022
xarray/core/indexing.py Outdated Show resolved Hide resolved
@dcherian dcherian marked this pull request as ready for review August 3, 2022 20:57
@dcherian dcherian requested a review from shoyer August 3, 2022 20:58
xarray/core/common.py Outdated Show resolved Hide resolved
xarray/core/indexing.py Outdated Show resolved Hide resolved
# so we need the explicit check for ExplicitlyIndexed
if isinstance(array, ExplicitlyIndexed):
array = array.get_duck_array()
return _wrap_numpy_scalars(array)
Copy link
Contributor Author

@dcherian dcherian Jan 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding _wrap_numpy_scalars allows us to handle scalars being returned by the backend. This seems OK to me in that we place fewer restrictions on the backend (and is backward compatible).

def _wrap_numpy_scalars(array):
"""Wrap NumPy scalars in 0d arrays."""
if np.isscalar(array):
return np.array(array)
else:
return array

But now the issue is that we should pass an appropriate like argument to np.array but I don't see how to that from a scalar array

Good news is that backends can avoid this complication by returning arrays, so we could just ignore this ugly bit for now.

xarray/backends/common.py Outdated Show resolved Hide resolved
xarray/core/indexing.py Outdated Show resolved Hide resolved
xarray/core/indexing.py Outdated Show resolved Hide resolved
xarray/core/indexing.py Outdated Show resolved Hide resolved
xarray/core/indexing.py Outdated Show resolved Hide resolved
xarray/tests/test_dataset.py Outdated Show resolved Hide resolved
xarray/tests/__init__.py Outdated Show resolved Hide resolved
xarray/tests/__init__.py Outdated Show resolved Hide resolved
dcherian and others added 2 commits February 16, 2023 12:51
Co-authored-by: Illviljan <14371165+Illviljan@users.noreply.github.com>
Co-authored-by: Stephan Hoyer <shoyer@google.com>
Co-authored-by: Illviljan <14371165+Illviljan@users.noreply.github.com>
@dcherian
Copy link
Contributor Author

@Illviljan feel free to push any typing changes to this PR. I think that would really help clarify the interface. I tried adding a DuckArray type but that didn't go to far.

@Illviljan
Copy link
Contributor

I don't have a better idea than to do DuckArray = Any # ndarray/cupy/sparse etc. and add that as output, but that wont change anything mypy-wise besides making it easier for us to read the code.

@dcherian
Copy link
Contributor Author

T_ExplicitlyIndexed may be a different thing to add

xarray/core/indexing.py Outdated Show resolved Hide resolved
dcherian and others added 9 commits March 3, 2023 16:49
Co-authored-by: Illviljan <14371165+Illviljan@users.noreply.github.com>
* main: (40 commits)
  Faq pull request (According to pull request pydata#7604 & issue pydata#1285 (pydata#7638)
  add timeouts for tests (pydata#7657)
  Pull Request Labeler - Undo workaround sync-labels bug (pydata#7667)
  [pre-commit.ci] pre-commit autoupdate (pydata#7651)
  Allow all integer dtypes in `polyval` (pydata#7619)
  [skip-ci] dev whats-new (pydata#7660)
  Redo whats-new for 2023.03.0 (pydata#7659)
  Set copy=False when calling pd.Series (pydata#7642)
  Pin pandas < 2 (pydata#7650)
  Whats-new for release 2023.03.0 (pydata#7643)
  Bump pypa/gh-action-pypi-publish from 1.7.1 to 1.8.1 (pydata#7648)
  Use more descriptive link texts (pydata#7625)
  Fix missing 'dim' argument in _get_nan_block_lengths (pydata#7598)
  Fix `pcolormesh` with str coords (pydata#7612)
  [skip-ci] Fix groupby binary ops benchmarks (pydata#7603)
  Remove incomplete sentence in IO docs (pydata#7631)
  Allow indexing unindexed dimensions using dask arrays (pydata#5873)
  Bump pypa/gh-action-pypi-publish from 1.6.4 to 1.7.1 (pydata#7618)
  [pre-commit.ci] pre-commit autoupdate (pydata#7620)
  add a test for scatter colorbar extend (pydata#7616)
  ...
@dcherian dcherian added the plan to merge Final call for comments label Mar 26, 2023
@dcherian
Copy link
Contributor Author

I'd like to merge this at the end of next week.

It now has tests and should be backwards compatible with external backends.

A good next step would be to finish up #7020

@Illviljan Illviljan mentioned this pull request Mar 31, 2023
4 tasks
@dcherian dcherian merged commit aa4361d into pydata:main Mar 31, 2023
@dcherian dcherian deleted the kvikio branch March 31, 2023 15:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
io plan to merge Final call for comments topic-backends topic-indexing topic-zarr Related to zarr storage library
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants