-
Notifications
You must be signed in to change notification settings - Fork 4
Roadmap
ExeTera is under active development. Our primary goals for the software are as follows:
- Provide a full, rich API that is familiar to users of Pandas
- Provide an engine that can execute in environments from laptops to clusters
- Provide a powerful set of data curation features, such as journalling and versioning of algorithms for reproducibility
These correspond to the three pillars of functionality that we believe can create an innovative, world beating software package combining data curation and analytics.
The Roadmap page is divided into subsections, each dealing with an aspect of the software:
- API
- Execution
- Serialization
- Data curation
Priority: High
ExeTera is currently a Python-only library. R users can access ExeTera through Reticulate, an R library written to interop with Python packages, but this is onerous for two of reasons:
- Syntactic sugar in Python such as slices does not work in R, and so data fetching involves the creation of explicit slice objects
- ExeTera (like most Python libraries) uses 0-based indices, whereas R uses 1-based indices, so the user must perform this conversion correctly every time
We propose to write an R library that wraps Sessions, Datasets, DataFrames and Fields with their R equivalents. This should enable R users to write using syntax and conventions that they are used to while using ExeTera.
Pandas' DataFrame API (itself based on the R data.frame) has a rich set of functionality that is not yet supported by ExeTera. This missing functionality ranges from 'implement as soon as humanly possible' to 'implement when we have time to do so'. We should be guided by our users in this matter, but we have an initial set of priorities that should be refined upon.
As ExeTera has some critical differences to Pandas and so there are areas in which we must necessarily deviate from the API. The largest difference is that ExeTera doesn't keep the DataFrame field values in memory but reads them from drive when they are required. This changes DataFrame.merge
for example, as it is necessary to specify a DataFrame instance to which merged values should be written.
We still have a number of areas where ExeTera's API can be argued to be unnecessarily complicated:
-
.data
on Field - Access to scalable sorts
- Access to journalling
- Moving functionality out of Session to the appropriate classes (Dataset / DataFrame / Field)
HDF5 has worked to provide us a good initial implementation for the serialized dataset, but it has a number of serious issues that require workarounds:
- Data cannot be properly deleted from a HDF5 dataset
- The format is fragile and interrupts at the wrong point can irretrievably corrupt a dataset
- It has some restrictions on read-based concurrency
- The
h5py
python library that wraps the underlying C library is very inefficient, particularly for iteration
When HDF5 datasets (the HDF5 equivalent of a numpy array) is deleted, the series is marked as having been deleted but the space is not reclaimed. This can only be done by running the h5repack
command, which takes many minutes to run on datasets the size of the Covid Symptom Study. It is very easy for a user to thus consume their entire drive by creating and deleting temporary fields. As a result, the user has to adopt unnecessarily complex workflows that categorise datasets as 'source', 'temporary', and 'destination', which adds to the cognative load of using ExeTera.
As any write-based interaction with a HDF5 dataset is capable of leaving a dataset in an invalid and irretrievable state simply by interrupting the execution of one of the HDF5 commands, we have to protect all HDF5 write commands in such a way that they cannot be interrupted. We currently use threads for this purpose. This is another reason why we suggest to users that once a dataset is generated, they treat it as a read-only 'source' dataset.
TODO: qualify this statement with HDF5 documentation. Read based concurrency appears to be unnecessarily limited in HDF5 files.
Whilst iteration will always be slower than performing numpy-style operations on fields, h5py
adds several further orders of magnitude of overhead to iterative access. This can certainly be improved upcoming
Performance and scale improvements can be delivered through several mechanisms:
- Selection of the appropriate fast code generation technology (Numba, Cython or native C/C++)
- Increase of the field API footprint, so that all operations can be performed scalably under the hood
- Adoption of a multi-core/multi-node graph compiler / scheduler / executor to carry out ExeTera processing (Dask / Arrow / Spark)