-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Galaxy: allow tool_data #49
Comments
Also, it will be necessary to add test data to the tool. E.g. to enable testing tools with data input |
For the test data: https://github.com/galaxyproject/galaxy/blob/dev/test/functional/tools/remote_test_data_location.xml As we (almost) imply that the input files are in the repo (so that notebooks run from top to bottom), its raw path can be used |
@bgruening could you give advice? Do I miss something simple here? |
Have you checked that you have a recent planemo/Galaxy version? Remote test files are a new feature. |
Whereas I checked, of course, that the galaxy release I use is at least 23.1 (where this feature was introduced), I was indeed using an older version of planemo. I thought, the role of planemo in running tool tests is more restricted, and it's galaxy itself, who is taking care of the test data staging. Getting back to the first point in this issue, if we have a tool that always use the same static dataset (not a big data) which is always referenced and used as a whole, what is the best (or most "galactic") way to deal with such a situation?
This is more of an extended note for myself, but I appreciate any comments and thoughts... Also, regardless of the implementation, if this small data has some kind of persistent identifier (doi, etc.) it needs to be expressed somewhere. |
Cool, glad it works now!
On IUC, which is our best practice recommendation, we recommend keeping test data below 1MB. For everything bigger, I would recommend using external resources. There are two reasons remote-test-data is maybe not good (that I'm aware of):
Once the Galaxy dataset caching is enabled by default the increased network traffic might not be such a big problem, but the restricted network access is.
Not sure I understand how you would like to use DM with tool tests. Or is this here about some reference/model data? DM are used to populate location (.loc) files automatically - so that an Admin has less work. Lets forget about DM for the moment and just talk about .loc files. A location file is a simple tabular file that a tool can read and populate a select box. We use this to distribute large reference/models to all users of a Galaxy instance. So that those models don't need to be downloaded by every user. It also makes it easy to discover those models etc... In tool tests you can also test locations files ... however here we also usually recommend to use tiny versions of the model etc.
I assume I'm missing something here. You can use remote-data for all
Galaxy does support some special protocol-schemas: https://github.com/galaxyproject/galaxy/blob/06eada1c2ce5694d92aaa0bbf312258ff66398e2/client/src/utils/upload-payload.js#L2 I guess we could add a DOI resolver, that would be useful for many use cases I think. The catch is ... if the DOI is some tarball and not a single file, I don't know what Galaxy should do with that. |
Yes, all this second part is not about the test data (I think we are happy enough with the remote test data approach or putting test data inside the tool if it's small). This is about the data that's actually needed to run a tool. I wasn't clear enough in this, sorry. Some is model or reference data in your terms. Some is real observational dataset. But in both cases in hand, we don't expect this data to be variable. The model data is fixed for the given tool release, the observational data is sometimes a comparably small (~ 100 MB -- 1GB) data release in the form of one archive, is used as a whole. Given both are static, there is no need in that user selects what to use.
That's why I started to look into it, the same data is used by all users of the specific tool. The difference is that there isn't really a need for any checkbox, that's why I thought of a hidden parameter. All this is opposite to the use-cases where we have a "big data" problem. |
Oh interesting use-case. How do you distribute those data currently (to the user)? What we can do, without any changes to Galaxy, is, have a hidden select box that is linked to a location file. The location file has only one entry, the hidden select box has only one entry which is always taken into account. ( The admin can setup this tool by modifying this location file. ) |
Yes, thanks, that's exactly the mechanism I imagined when talking about DM. But it requires some action from admin, if we add such kind of a tool in the usegalaxy-eu instance? Or there is a CD action to fetch data/run DM tool? |
Yes, we have this: https://github.com/galaxyproject/idc But for you, this might be maybe overkill. Most of our reference data is more than download. We actually need to build indices on genomes and so on. If you just need to download maybe we can simply host all of your data on CVMFS and the admins needs to mount this data in and done. |
Closing in favour of #55 |
Currently, only scripts and xml end up in the tool.
Some static data may be needed, like a small and static public dataset in HESS tool or model grids in Phosphoros photoz.
It may be either in the repo (lfs) or downloaded from the given URL upon tool creation.
The text was updated successfully, but these errors were encountered: