-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ExeTera Questions #289
Comments
Posting responses here for continuity with the issue: Hi Netta, Thank you for your interest in ExeTera. My apologies for my slightly delayed response; it has been a busy week and now I am catching up on emails. Firstly, yes, we are still working on ExeTera. ExeTera continues to have an application within our school and we are also looking for other applications that require the kind of scale it provides. We haven't yet had the opportunity to rewrite the backend using dask, but we are in the process of scheduling the future work, including dask integration. I cannot give you a timescale for that right now, however. I don't recall the precise errors that were raised during the dask benchmarking with the artificial dataset, but we tried a number of approaches to make it work and were unable to do so. I'll rerun the experiment for dask and then give you instructions on how to run the artificial join code in the evaluation repo so you can experiment for yourself (it will need a machine with a lot of RAM). Using dask arrays means that we can make use of the dask compute engine, which builds directed graphs of the operations to be performed and parallelises / distributes them as appropriate. We need to implement a number of key operations ourselves as dask does not natively support the operations that we want to perform outside of dask dataframe, but its API permits the specification of custom graph fragments through which those can be implemented. Are you able to give me more details about your project and its aims? I might be able to assist your enquiries more effectively if you can give me an overview of what you want to do. Yours, |
Here is a stacktrace from a failing dask merge scenario:
If you look at the last line, you have the ValueError where the table structure appears to be inconsistent for the subtables that dask generates while performing the merge. We noted that this occurred for some join sizes and not others. We were not able to find a solution to work around the issue at the time and we felt that it represented the kind of technical hurdle that a data analyst should not have to solve. Also, given the performance disparity between dask dataframe merge and our merge we also considered that further time trying to work around the issue was not productive. To summarise, our view is that the dask dataframe implementation was a "low hanging fruit" implementation that gave dask dataframe like functionality, but is sufficiently problematic to be worth tackling from a completely different design direction. |
You can clone the ExeTeraEval repository and do the following if you want to replicate the dask evaluation. You'll need to run two commands:
If you are using numbers like the above, I suggest you use a machine with a large amount of RAM (>64GB) |
Hello ,
My name is Netta Shemesh and I’m 23 years old from Israel. Recently I have read your article “Accessible data curation and analytics for international-scale citizen science datasets” and I am doing a project for university (Israel Institute of Technology) on your paper.
I have several questions about ExeTera, the Python-based software package you wrote.
First, I would like to know if you are still working on the software, and have you succeeded in apply Dask array in your code?
Secondly, I haven’t understood precisely why the Dask DataFrame could not import the data and what are the exceptions that raise during the Artificial Joins that made the Dask program to fail.
In addition I was wondering why applying Dask array on your software will make all operations on ExeTera fields to become streaming by default.
Thank you in advance,
Netta Shemesh
The text was updated successfully, but these errors were encountered: