Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

simplify organism data merging #19

Open
kzuberi opened this issue Jan 8, 2015 · 0 comments
Open

simplify organism data merging #19

kzuberi opened this issue Jan 8, 2015 · 0 comments

Comments

@kzuberi
Copy link
Contributor

kzuberi commented Jan 8, 2015

A current bottleneck in the pipeline processing is merging separate individual organism datasets into a single multi-organism dataset. Historically the data design placed items like genes and networks for all organisms into shared database tables identified by an internal ID. This created a unified ID space for looking up the data, and causes interdependencies between the organisms when building a dataset. The revised data pipeline allows independent builds of the independent organisms, and adds a merge step that takes care of the interdependencies.

However the merge step results in duplication of data (just re-indexing files to map to a unified id space), wasting time and disk-space during the build process, as well as requiring the re-execution of data processing steps that follow the merge step (the merge step runs at the generic_db level).

This could be improved in a couple of ways:

The 'right' way (but could affect users, requires care):

  • update application api's and binary data products to always be retrievable by organism, so that id's within an organism can clash.
  • then we could simply skip the merge step and just distribute the aggregation of the individual organism datasets
  • would make it easy to update an individual organism without touching the others
  • but requires changes to application code
  • but will change the format of data distributed to users via the plugin, requiring compatability workarounds

The 'wrong' way (but won't affect users):

  • could still improve data duplication by reserving id space for each organism in the organism.cfg properties file (e.g. network id range X-Y, node ID range W-Z, etc for attributes and so on)
  • then the merge step could be simplified to checking that ids don't clash, and then just copying the data
  • no changes to user data or application code, only the pipeline
  • but manual and likely error prone
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant