Dataset Edit Performance Improvements #10890

qqmyers · 2024-09-27T15:26:05Z

What this PR does / why we need it: These PR includes multiple changes to the UpdateDatasetVersionCommand to improve the performance/scalability when editing dataset with large numbers of files. Key changes include:

Adding a feature flag to allow disabling the edit-draft logging (separate log files that report changes being made by the current user)
Changing functionality to not update the lastmodifieddate on existing files (since they do not change)
The DatasetVersionDifference optimizations from IQSS/10814 Improve dataset version differencing #10818 (only improves time when edit-draft reporting is still enabled)
Doing an initial merge of the dataset and avoiding subsequent merge/flush operations

Which issue(s) this PR closes:

Closes #10138

Special notes for your reviewer: In my testing on a dataset with 10K files, the time required for the UpdateDatasetVersionCommand in the DatasetPage.save() method to complete (as measured by logging in the save method) when a one char change to the description was made was averaging ~30 seconds. With all the changes in the PR, it now takes ~12-13 seconds. In general, verifying the impact of individual changes is hard:

I see variations of ~2 seconds between repeat runs
The first run after deployment can be ~3-4 seconds longer
Simply logging the time a statement takes can be misleading: in one iteration, I saw that calculating the md5 hash of the :CVocConf setting was taking 2 seconds! While moving the retrieval of that setting as in the PR reduced that time to a ~1ms and produced an overall improvement, the overall change was much smaller than 2 seconds - looks like parallel operations were just slowing that step.
Similarly, while IQSS/10814 Improve dataset version differencing #10818 reduced the difference time from ~12 seconds to < 1 sec when run after operations, trying to do it early led to a ~4-5 second run time - my guess is that some of the time is in lazy loading elements used in the differencing, but I'm not sure.

That said, I would estimate that the first two changes contribute ~4 second reductions each (the feature flag would save 12 seconds, but the differencing PR saves ~ 8 seconds there).

Suggestions on how to test this: All the automated tests should pass, any/all variants of making changes to a dataset should work as before, there should be no changes w.r.t. the db-level updates except for the change to not update datafile lastmodified dates. Performance should be improved overall and scaling should be improved. The simplest way to test that might be to turn on fine logging for the DatasetPage where I've added logging of the time to run the update command. (Note that the overall time seen in the UI includes both the time to save the changes and the time to reload the page. The latter, with 10K files is still many seconds and hasn't been improved in this PR.

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

Is there a release notes update needed for this change?: Probably one for any/all performance updates going into 6.5 along with announcing the feature flag and change to file last modified behavior.

Additional documentation: to be added

coveralls · 2024-09-27T15:59:08Z

coverage: 21.315% (+0.4%) from 20.868%
when pulling d3ea94d on GlobalDataverseCommunityConsortium:DANS_Performance2
into d039a10 on IQSS:develop.

it apparently causes new datafiles, which don't have create dates to start to be persisted, cause a failure with an sql insert to the dataset table with a null createdate. Moving it back to this location assures the datafiles have create dates and avoids the exception. It's not clear to me why trying to get the authenticatedUser in the updateDatasetUser() call causes this.

CacheFactoryBeanTest.testAuthenticatedUserGettingRateLimited:171 expected: <120> but was: <122>

previously gave a null encountered in unit of work clone error

qqmyers added 20 commits August 29, 2024 16:03

Updates to speed difference calcs and cleanup for terms differencing

709ecf7

tests for metadata, varmetadata, replace,add, remove cases

4bb3943

tests for terms cleanup

8437c63

Cleanup logging

4f4736e

Show loop execution time in log

1fc6f51

release note

b27739f

Merge remote-tracking branch 'IQSS/develop' into DANS_Performance2

432d8d0

overall duration of update

6bcb41d

minor - start with reasonable map size

6745109

timing logging and avoid recalc of cvocMap

837a4dc

allow check system md from difference, avoid cvocConf calls in loop

84e93ef

start with dataset merge plus various tests/timing checks

3c2040d

Merge remote-tracking branch 'IQSS/develop' into DANS_Performance2

e0ab62b

allow sending in cvoc setting

473ef86

clean-up

411f8f6

clean-up

19ae1db

fix persistedVersion/get draft after merge, drop deep

84dd027

add disable-edit-draft-logging fflag

ac7026e

clean-up

c912209

more clean-up of log info

821918f

qqmyers mentioned this pull request Sep 27, 2024

Number of log files under edit-drafts folder is over growing #10138

Open

qqmyers added 2 commits September 27, 2024 11:45

test fix - add mock

b4d9444

satisfy review dog

602409b

qqmyers marked this pull request as draft September 27, 2024 16:54

qqmyers added 4 commits September 27, 2024 13:48

reduce per-file logging level to finest

5e1540d

try additional resourcelock to avoid test error

4f5b381

CacheFactoryBeanTest.testAuthenticatedUserGettingRateLimited:171 expected: <120> but was: <122>

Merge remote-tracking branch 'IQSS/develop' into DANS_Performance2

e0cfcfc

qqmyers added the GDCC: DANS related to GDCC work for DANS label Sep 30, 2024

qqmyers added 4 commits September 30, 2024 11:20

Merge remote-tracking branch 'IQSS/develop' into DANS_Performance2

19bd4eb

merge earlier if needed, use editVersion

7274de6

Merge remote-tracking branch 'IQSS/develop' into DANS_Performance2

10d985d

Merge in original place, use editVersion in more places

558579b

qqmyers mentioned this pull request Oct 3, 2024

Deaccessioned dataset file edit fix. #10901

Merged

qqmyers added 11 commits October 8, 2024 10:08

formatting

1392937

conditional merges

94f9c72

Merge remote-tracking branch 'IQSS/develop' into DANS_Performance2

ff6dbe5

no merges before lock

913e5d6

unlock in this transaction except for exception case

dc12231

avoid concurrent mod exception

033f1a7

try flush, add warning if lock exists

e32557d

add logging, trigger jenkins

32d2188

merge dataset after version create

60919f5

log change to restart jenkins

7e17fbe

use merged edit version, add temp. fmd checks

ac1a0d9

qqmyers force-pushed the DANS_Performance2 branch from 7c46bfc to 3329636 Compare October 9, 2024 18:21

pdurbin added the Type: Feature a feature request label Oct 9, 2024

qqmyers force-pushed the DANS_Performance2 branch from f18d569 to ad3a582 Compare October 17, 2024 17:27

preload files/fmds

49f48ef

qqmyers force-pushed the DANS_Performance2 branch from 2910fdf to 49f48ef Compare October 18, 2024 18:40

qqmyers added 2 commits October 18, 2024 16:11

get fmds too

b1aebb4

Fix DataFileTag edits/related test

d3ea94d

previously gave a null encountered in unit of work clone error

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset Edit Performance Improvements #10890

Dataset Edit Performance Improvements #10890

qqmyers commented Sep 27, 2024 •

edited

Loading

coveralls commented Sep 27, 2024 •

edited

Loading

Dataset Edit Performance Improvements #10890

Are you sure you want to change the base?

Dataset Edit Performance Improvements #10890

Conversation

qqmyers commented Sep 27, 2024 • edited Loading

coveralls commented Sep 27, 2024 • edited Loading

qqmyers commented Sep 27, 2024 •

edited

Loading

coveralls commented Sep 27, 2024 •

edited

Loading