You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A current bottleneck in the pipeline processing is merging separate individual organism datasets into a single multi-organism dataset. Historically the data design placed items like genes and networks for all organisms into shared database tables identified by an internal ID. This created a unified ID space for looking up the data, and causes interdependencies between the organisms when building a dataset. The revised data pipeline allows independent builds of the independent organisms, and adds a merge step that takes care of the interdependencies.
However the merge step results in duplication of data (just re-indexing files to map to a unified id space), wasting time and disk-space during the build process, as well as requiring the re-execution of data processing steps that follow the merge step (the merge step runs at the generic_db level).
This could be improved in a couple of ways:
The 'right' way (but could affect users, requires care):
update application api's and binary data products to always be retrievable by organism, so that id's within an organism can clash.
then we could simply skip the merge step and just distribute the aggregation of the individual organism datasets
would make it easy to update an individual organism without touching the others
but requires changes to application code
but will change the format of data distributed to users via the plugin, requiring compatability workarounds
The 'wrong' way (but won't affect users):
could still improve data duplication by reserving id space for each organism in the organism.cfg properties file (e.g. network id range X-Y, node ID range W-Z, etc for attributes and so on)
then the merge step could be simplified to checking that ids don't clash, and then just copying the data
no changes to user data or application code, only the pipeline
but manual and likely error prone
The text was updated successfully, but these errors were encountered:
A current bottleneck in the pipeline processing is merging separate individual organism datasets into a single multi-organism dataset. Historically the data design placed items like genes and networks for all organisms into shared database tables identified by an internal ID. This created a unified ID space for looking up the data, and causes interdependencies between the organisms when building a dataset. The revised data pipeline allows independent builds of the independent organisms, and adds a merge step that takes care of the interdependencies.
However the merge step results in duplication of data (just re-indexing files to map to a unified id space), wasting time and disk-space during the build process, as well as requiring the re-execution of data processing steps that follow the merge step (the merge step runs at the generic_db level).
This could be improved in a couple of ways:
The 'right' way (but could affect users, requires care):
The 'wrong' way (but won't affect users):
The text was updated successfully, but these errors were encountered: