Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: desperate attempts to unlink output arrays #1305

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

jpivarski
Copy link
Member

No description provided.

@martindurant
Copy link

--- a/src/uproot/behaviors/TBranch.py
+++ b/src/uproot/behaviors/TBranch.py
@@ -212,10 +212,12 @@ def iterate(
                             report = report.to_global(global_offset)
                             popper = [arrays]
                             del arrays
+                            del item
                             yield popper.pop(), report

                         else:
                             popper = [library.global_index(item, global_offset)]
+                            del item
                             yield popper.pop()

produces objgraph output
Screenshot 2024-10-11 at 13 39 21

I'm happy to run your memory-over-time graph with this change, if you have a script. The inner code from the slack thread was this - you did memray run?

import gc
import uproot

filenames = [{"/tmp/zlib9-jagged0.root": "tree"}] * 5

for batch in uproot.iterate(filenames, step_size="1 GB"):
    # gc.collect()
    print(repr(batch))

@jpivarski
Copy link
Member Author

I can't believe I missed that! If I'm reading this right, item is the second-to-last step in the transformation that produces the final arrays (by mapping a local index to a global one).

I don't have scripts to produce the memray plots from Slack—I'd lose them anyway, if I kept them—so the messages in Slack include the code.

@jpivarski jpivarski marked this pull request as ready for review October 11, 2024 18:57
@jpivarski jpivarski marked this pull request as draft October 11, 2024 18:57
@jpivarski
Copy link
Member Author

I made a to-do item to finish this off. I'll include your diff.

@martindurant
Copy link

My loop for lowest memory usage was like this:

for i, batch in enumerate(uproot.iterate(filenames, step_size="1 GB")):
    # gc.collect()
    if i:
        objgraph.show_growth()
    else:
        objgraph.growth()
    print(repr(batch))
    del batch
  • note the final del, means only one array is live at a time, otherwise the previous one doesn't get cleaned up until the next iteration is ready to be assigned into batch
  • including objgraph in the loop magically stops leak/accumulation; does it cause a gc collect? Without objgraph, I still get big memory growth.
  • objgraph n a stable run shows growth as follows, but most iterations have no growth at all
ReferenceType                  4400       +23
builtin_function_or_method     4269       +23
tuple                         12593        +1

(so not very useful; I don't know what ReferenceType is)

I am thinking there might be many of these non-functional-programming invocations down the graph (with or without generators) that is making refcounting hard for python.

@jpivarski
Copy link
Member Author

  • including objgraph in the loop magically stops leak/accumulation; does it cause a gc collect?

I discovered the same thing with Pympler: it was clearly calling gc.collect() internally. Either that or it allocated so many temporaries that it triggered GC collection the normal way.

@jpivarski
Copy link
Member Author

jpivarski commented Oct 14, 2024

Well, I added that commit (7cfc360), but it didn't fix the original memory issue.

Here's memray of

import gc
import uproot

filenames = [{"/tmp/zlib9-jagged0.root": "tree"}] * 5

for batch in uproot.iterate(filenames, step_size="1 GB"):
    # gc.collect()
    print(repr(batch))

in this PR, before the commit:

image

Here's after the commit:

image

And, just for reference, here it is with the gc.collect() line uncommented (i.e. explicitly collect garbage in every iteration).

image

Even if objgraph isn't showing it, the big arrays are still held until garbage collection.

I'm setting up this PR to be merged, anyway, since it's a step forward, at least.

@jpivarski jpivarski marked this pull request as ready for review October 14, 2024 16:46
@martindurant
Copy link

Yeah; I spent some time trying to find out where the cycle(s) is, but no luck. The code goes deep, and I couldn't figure out which tbranch/basket/etc refer to one-enother. Doing more pop()-tricks helps remove one object from a generator at a time, but doesn't account for accumulation of cyclic references.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants