Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Future changes to Awkward Array behavior class resolution #1035

Open
agoose77 opened this issue Jun 28, 2023 · 14 comments · Fixed by #1040 · May be fixed by #1070
Open

Future changes to Awkward Array behavior class resolution #1035

agoose77 opened this issue Jun 28, 2023 · 14 comments · Fixed by #1040 · May be fixed by #1070

Comments

@agoose77
Copy link

agoose77 commented Jun 28, 2023

This is not an existing bug, but I'm raising it now so that we're on the same page.

Over at Awkward Array, we've come to the conclusion that the __array__ mechanism, used to facilitate categorical and string arrays, should be a built-in feature, rather than implemented using Awkward behaviors. We were planning upon introducing a new parameter that mirrors __record__ to indicate the nominal type of an array. This change would mean __array__ no longer affects the loaded class (type(arr)) of an Awkward Array. Anndata uses the __array__ parameter to wrap Awkward Arrays in custom view wrappers.

The mechanism that we were originally thinking of was a new parameter __list__ that can only attach to list-types (arrays with >1 dimension). Clearly, this is less general than the existing functionality, so I was wondering if you'd be able to clarify the kinds of Awkward Arrays that anndata needs to support.

Before we get into the details, the most important point is that we do not want to make existing use cases impossible. So, I'm hoping that we can figure out what anndata needs, and rework our proposal.

When we were conceiving of these changes, we did not have any examples of custom arrays in the forefront of our minds. We did know of the use of __array__ in anndata, but it certainly slipped my mind during these discussions. Pinging @jpivarski for visiblity.

@agoose77
Copy link
Author

I took another look at

try:
from ..compat import awkward as ak
import weakref
# Registry to store weak references from AwkwardArrayViews to their parent AnnData container
_registry = weakref.WeakValueDictionary()
_PARAM_NAME = "_view_args"
class AwkwardArrayView(_ViewMixin, AwkArray):
@property
def _view_args(self):
"""Override _view_args to retrieve the values from awkward arrays parameters.
Awkward arrays cannot be subclassed like other python objects. Instead subclasses need
to be attached as "behavior". These "behaviors" cannot take any additional parameters (as we do
for other data types to store `_view_args`). Therefore, we need to store `_view_args` using awkward's
parameter mechanism. These parameters need to be json-serializable, which is why we can't store
ElementRef directly, but need to replace the reference to the parent AnnDataView container with a weak
reference.
"""
parent_key, attrname, keys = self.layout.parameter(_PARAM_NAME)
parent = _registry[parent_key]
return ElementRef(parent, attrname, keys)
def __copy__(self) -> AwkArray:
"""
Turn the AwkwardArrayView into an actual AwkwardArray with no special behavior.
Need to override __copy__ instead of `.copy()` as awkward arrays don't implement `.copy()`
and are copied using python's standard copy mechanism in `aligned_mapping.py`.
"""
array = self
# makes a shallow copy and removes the reference to the original AnnData object
array = ak.with_parameter(self, _PARAM_NAME, None)
array = ak.with_parameter(array, "__array__", None)
return array
@as_view.register(AwkArray)
def as_view_awkarray(array, view_args):
parent, attrname, keys = view_args
parent_key = f"target-{id(parent)}"
_registry[parent_key] = parent
# TODO: See https://github.com/scverse/anndata/pull/647#discussion_r963494798_ for more details and
# possible strategies to stack behaviors.
# A better solution might be based on xarray-style "attrs", once this is implemented
# https://github.com/scikit-hep/awkward/issues/1391#issuecomment-1412297114
if type(array).__name__ != "Array":
raise NotImplementedError(
"Cannot create a view of an awkward array with __array__ parameter. "
"Please open an issue in the AnnData repo and describe your use-case."
)
array = ak.with_parameter(array, _PARAM_NAME, (parent_key, attrname, keys))
array = ak.with_parameter(array, "__array__", "AwkwardArrayView")
return array
ak.behavior["AwkwardArrayView"] = AwkwardArrayView

It seems like the point of this array class is to intercept __setitem__ for copy-on-write semantics. Could you clarify whether my understanding is correct. If so, then we don't need the array class at all? Awkward Arrays are (mostly) immutable, as the structure of the array is stored in a separate object to the high-level interface. I'm thinking that we just need to ensure that any view produces a new outer ak.Array via ak.Array(array).

@agoose77
Copy link
Author

Tagging @ivirshup for visibility :)

@ivirshup
Copy link
Member

ivirshup commented Jul 6, 2023

Hey, sorry about the delayed response.

It seems like the point of this array class is to intercept setitem for copy-on-write semantics.

Yes.

Awkward Arrays are (mostly) immutable, as the structure of the array is stored in a separate object to the high-level interface. I'm thinking that we just need to ensure that any view produces a new outer ak.Array via ak.Array(array).

Yes, but... It's not just copy-on-write for the ak.Array, it's also copy on write for the parent AnnData object.

cc: @grst

@ivirshup
Copy link
Member

ivirshup commented Jul 6, 2023

Also @grst, for some context I believe the change mentioned here happened in awkward 2.3 and is causing test failures. I think the fix is super easy, but would appreciate your eyes on it (#1040).


@agoose77 could you also take a look at the linked PR?

@agoose77
Copy link
Author

agoose77 commented Jul 6, 2023

Yes, but... It's not just copy-on-write for the ak.Array, it's also copy on write for the parent AnnData object.

@ivirshup Is this required? Would it break your invariants if accessing an ak.Array always returned a shallow copy?

@ivirshup
Copy link
Member

ivirshup commented Jul 6, 2023

The overall idea is that subsetting an anndata does not actually subset all of its contents until you need them. E.g. subsetting is lazy. However, we track this at the AnnData object level, not at each element level.

Since we don't want to write back updates, we just want it to act like you took a subset, we have copy on write behavior for each element of the AnnData. Since this is just tracked at the AnnData level, making one element "actual" means we have to make all elements actual. So we need to have a way for the parent anndata to know a modification has been made to a child element – hence our view classes.

Is this required? Would it break your invariants if accessing an ak.Array always returned a shallow copy?

How would you set any values on an awkward array? Or do you mean only when it's an anndata style view?

@agoose77
Copy link
Author

agoose77 commented Jul 6, 2023

I can see that this is confusing! I'll reply exclusively on #1040 from now on.

@ivirshup
Copy link
Member

Going to keep this open until we are happy with a solution. Current status summarized here: #1040 (comment)

@grst grst linked a pull request Jul 24, 2023 that will close this issue
3 tasks
@grst
Copy link
Contributor

grst commented Aug 22, 2023

I met with @jpivarski and @agoose77 yesterday.

We came to the conclusion that we might not need any copy-on-write behavior for views of awkward arrays, because a slice of an awkward array is always a (shallow) copy anyway. There's therefore no risk of modifying the original awkward array when setting a record on an awkward array within an AnnDataView. We could therefore get rid of all the custom code around awkward array views which is, ultimately, a hack that abuses custom behaviors for something they were not designed for.

Only downside is that AnnDataView behaves differently for awkward arrays than for other backends. Updating the awkward array will not trigger an init_as_actual.

In any case I'll first come up with a bunch of test cases to make sure everything works as expected.

@grst grst self-assigned this Aug 22, 2023
@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity.
Please add a comment if you want to keep the issue open. Thank you for your contributions!

@github-actions github-actions bot added the stale label Oct 23, 2023
@grst
Copy link
Contributor

grst commented Oct 23, 2023

This is still an issue. The proposed fix is in #1070, but pending some more fundamental decision on how to cache views.

@grst grst removed the stale label Oct 23, 2023
@agoose77
Copy link
Author

@grst I'm a little lost with where things are! My understanding from #1035 is that we don't need to create array classes, and instead can simply shallow copy arrays where appropriate. Is that no longer the case?

@grst
Copy link
Contributor

grst commented Oct 23, 2023

@agoose77, it is still the case and this is what is proposed in #1070.

However, @ivirshup pointed out in #1070 (comment) that when creating the view only when accessing the data, it is impossible to assign any data to a record, e.g. using

v.obsm["awk"]["b"] = [5, 6]

because a new shallow copy will be created every time v.obsm["awk"] is accessed.

This is why in the most recent version I suggested to create the shallow copy already during creating of the AnnDataView, i.e. on

v = a[:2]

in the same example, which Isaac referred to as "caching" the view.

But apparently this might conflict with the proposed fix for an issue with garbage collection (#1119).

In any case, I don't think we need anything from the awkward side, but just need to find a way how to reconcile cached views with garbage collection.

Copy link

This issue has been automatically marked as stale because it has not had recent activity.
Please add a comment if you want to keep the issue open. Thank you for your contributions!

@github-actions github-actions bot added the stale label Dec 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants