Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset Edit Performance Improvements #10890

Draft
wants to merge 44 commits into
base: develop
Choose a base branch
from

Conversation

qqmyers
Copy link
Member

@qqmyers qqmyers commented Sep 27, 2024

What this PR does / why we need it: These PR includes multiple changes to the UpdateDatasetVersionCommand to improve the performance/scalability when editing dataset with large numbers of files. Key changes include:

  • Adding a feature flag to allow disabling the edit-draft logging (separate log files that report changes being made by the current user)
  • Changing functionality to not update the lastmodifieddate on existing files (since they do not change)
  • The DatasetVersionDifference optimizations from IQSS/10814 Improve dataset version differencing #10818 (only improves time when edit-draft reporting is still enabled)
  • Doing an initial merge of the dataset and avoiding subsequent merge/flush operations

Which issue(s) this PR closes:

Closes #10138

Special notes for your reviewer: In my testing on a dataset with 10K files, the time required for the UpdateDatasetVersionCommand in the DatasetPage.save() method to complete (as measured by logging in the save method) when a one char change to the description was made was averaging ~30 seconds. With all the changes in the PR, it now takes ~12-13 seconds. In general, verifying the impact of individual changes is hard:

  • I see variations of ~2 seconds between repeat runs
  • The first run after deployment can be ~3-4 seconds longer
  • Simply logging the time a statement takes can be misleading: in one iteration, I saw that calculating the md5 hash of the :CVocConf setting was taking 2 seconds! While moving the retrieval of that setting as in the PR reduced that time to a ~1ms and produced an overall improvement, the overall change was much smaller than 2 seconds - looks like parallel operations were just slowing that step.
  • Similarly, while IQSS/10814 Improve dataset version differencing #10818 reduced the difference time from ~12 seconds to < 1 sec when run after operations, trying to do it early led to a ~4-5 second run time - my guess is that some of the time is in lazy loading elements used in the differencing, but I'm not sure.

That said, I would estimate that the first two changes contribute ~4 second reductions each (the feature flag would save 12 seconds, but the differencing PR saves ~ 8 seconds there).

Suggestions on how to test this: All the automated tests should pass, any/all variants of making changes to a dataset should work as before, there should be no changes w.r.t. the db-level updates except for the change to not update datafile lastmodified dates. Performance should be improved overall and scaling should be improved. The simplest way to test that might be to turn on fine logging for the DatasetPage where I've added logging of the time to run the update command. (Note that the overall time seen in the UI includes both the time to save the changes and the time to reload the page. The latter, with 10K files is still many seconds and hasn't been improved in this PR.

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

Is there a release notes update needed for this change?: Probably one for any/all performance updates going into 6.5 along with announcing the feature flag and change to file last modified behavior.

Additional documentation: to be added

@coveralls
Copy link

coveralls commented Sep 27, 2024

Coverage Status

coverage: 21.315% (+0.4%) from 20.868%
when pulling d3ea94d on GlobalDataverseCommunityConsortium:DANS_Performance2
into d039a10 on IQSS:develop.

@qqmyers qqmyers marked this pull request as draft September 27, 2024 16:54
it apparently causes new datafiles, which don't have create dates to
start to be persisted, cause a failure with an sql insert to the dataset
table with a null createdate. Moving it back to this location assures
the datafiles have create dates and avoids the exception. It's not clear
to me why trying to get the authenticatedUser in the updateDatasetUser()
call causes this.
CacheFactoryBeanTest.testAuthenticatedUserGettingRateLimited:171
expected: <120> but was: <122>
@qqmyers qqmyers added the GDCC: DANS related to GDCC work for DANS label Sep 30, 2024
previously gave a null encountered in unit of work clone error
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GDCC: DANS related to GDCC work for DANS Type: Feature a feature request
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

Number of log files under edit-drafts folder is over growing
3 participants