Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add known forth for ATLAS #1282

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

nikoladze
Copy link
Contributor

@nikoladze nikoladze commented Aug 30, 2024

Since edge cases still keep popping up for types in PHYSLITE that actually have always the same interpretation (most prominently vector<vector<ElementLink<...>>> we were discussing to include a mechanism to pre-define known forth code for special cases.

This PR attempts to add such a mechanism and define a special treatment for vector<vector<ElementLink<...>>> in ATLAS PHYSLITE files. One needs to define 2 things:

  • A string containing the full forth code used to deserialize the binary data
  • An awkward form that references the correct array names used in the forth code via form_key

These 2 properties are picked up in the constructor of the generic AsObjects interpretation, ensuring that no forth code needs to be generated (and therefore also no python loop through any of the basket data) when the array is read with library="ak" (default).

In the current state it seems to be working, but at the moment breaks a lot of tests, mainly because the hardcoded form now does not match anymore excactly what it was before. Looking into how to resolve this.

tagging @matthewfeickert, @alexander-held

@nikoladze nikoladze changed the title Add known forth for ATLAS feat: Add known forth for ATLAS Sep 2, 2024
@matthewfeickert
Copy link
Member

@nikoladze is this ready for review? Or were there additional things that you needed to add to it?

@nikoladze
Copy link
Contributor Author

@nikoladze is this ready for review? Or were there additional things that you needed to add to it?

I think i'm getting there. Before merging i also want to write a bit more info into the docstrings, e.g. how to add further stuff in the future.

Current status:

  • i now implemented this based on a lookup in the AsObjects constructor as discussed with @jpivarski. I think this was necessary (as opposed to putting it into interpretation_of) since other places in the code need the Model of the interpretation. So now only the forth code is special cased, not the whole interpretation
  • In addition to that, also the awkward form needs to be provided since this has to reference the names of the buffers that come out of running the forth code. I implemented this now as a member function of a class that knows about the typename since the awkward forms coming out of uproot also carry the typename that corresponds to the RecordArray (other tests would fail if the form is different because of this)
  • For start i implemented this now for some of the vector<vector<ElementLink<DataVector<T>>>> types which cause most of the problems.

Some problems that this PR should fix (for vector<vector<ElementLink...):

  • In some cases all vectors of a basket are empty, causing the discovery mechanism to loop (in python) through the whole data of the basket which may be slow (performance issue). If the type for these is now part of the special cases then the predefined forth code will be used instead
  • In some cases there will be a CannotBeAwkward exception, e.g. in the test file some streamer info for std::vector<std::vector<ElementLink<DataVector<xAOD::NeutralParticle_v1>>>> is missing. Now also here the special cased forth code (and awkward form) will be picked up.

@@ -122,6 +130,13 @@ def awkward_form(
tobject_header=False,
breadcrumbs=(),
):
awkward = uproot.extras.awkward()
if self._form is not None: # TODO: is this really fine?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpivarski i'm a bit uncomfortable with the piece of code i added here. The _form attribute already existed before, but was never directly returned here. The tests seem to pass, but do you know if there are potential problems that may be caused by directly returning _form if it is not None?

One thing that i had to do (and also don't quite feel comfortable with) was to convert the form from a dict representation which seems to happen sometimes (but i'm a bit unsure where and why not always).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still trying to understand your first paragraph, but about the second paragraph: there are three ways that a Form can be represented,

  • as a ak.forms.Form object, which is immutable and therefore needs to be entirely known or at least built from the leaves up to the root
  • as a Python dict (JSON represented as dicts and lists), which isn't type-safe, but it can be modified in place
  • as a JSON string, which doesn't even ensure that pairs of square brackets are closed.

Thus, they have decreasing levels of safety and we prefer the safer ones when possible. The problem is that the process of discovering the type involves starting at the root and walking down toward the leaves, filling in a ListOffsetArray's content if we see a non-empty example of it. So the ideal, ak.forms.Form, isn't possible and the Form starts its life as a Python dict. When it gets converted into an object, that's final. The Forth is fully discovered at that point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i now realized that the self._form attribute will always be in the Python dict form (either from generated or known forth code). So at the place where it is returned in AsObjects.awkward_form(...) it has to be converted to the final ak.forms.Form object. In principle the form from the known forth path could have been provided directly as ak.forms.Form, but to be consistent with the one created from the ForthGenerator it is now also a dict.

@nikoladze nikoladze marked this pull request as ready for review September 17, 2024 11:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants