-
Notifications
You must be signed in to change notification settings - Fork 490
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor reindexing of harvested datasets #10734
Labels
Feature: Harvesting
FY25 Sprint 5
FY25 sprint 5
GREI 3
Search and Browse
Size: 30
A percentage of a sprint. 21 hours. (formerly size:33)
Comments
landreev
added
the
Size: 30
A percentage of a sprint. 21 hours. (formerly size:33)
label
Aug 14, 2024
38 tasks
landreev
added a commit
that referenced
this issue
Sep 9, 2024
landreev
added a commit
that referenced
this issue
Sep 11, 2024
landreev
added a commit
that referenced
this issue
Sep 11, 2024
landreev
added a commit
that referenced
this issue
Sep 11, 2024
landreev
added a commit
that referenced
this issue
Sep 19, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
Feature: Harvesting
FY25 Sprint 5
FY25 sprint 5
GREI 3
Search and Browse
Size: 30
A percentage of a sprint. 21 hours. (formerly size:33)
Among the indexing improvements in 6.3, we now have added logic that prevents deleting solr documents when existing and already indexed datasets are updated. Unfortunately, we still cannot take advantage of this improvement when it comes to reindexing (re-)harvested datasets - since the harvesting framework relies on completely deleting an existing harvested dataset, then re-creating it from scratch. So we still end up going to the trouble of deleting all the existing solr documents associated with it, then rebuilding them, even if it was a minor metadata update (documents, plural - because harvested datasets can have files).
A most obvious way to solve this is to add some straightforward mods to the destroy dataset framework, and make it spare existing solr documents during a re-harvesting workflow. An alternative would be to modify the harvesting framework itself, and figure out how to avoid having to destroy the dataset and still avoid creating multiple versions... but that may be more difficult (?).
By nature of harvesting, it often involves having to modify a large number of datasets in quick succession, so this can still be a serious performance premium in production. It would be great to address this before we restart serious-scale harvesting at HDV.
The text was updated successfully, but these errors were encountered: