-
Notifications
You must be signed in to change notification settings - Fork 4
Roadmap
ExeTera is under active development. Our primary goals for the software are as follows:
- Provide a full, rich API that is familiar to users of Pandas
- Provide an engine that can execute in environments from laptops to clusters
- Provide a powerful set of data curation features, such as journalling and versioning of algorithms for reproducibility
These correspond to the three pillars of functionality that we believe can create an innovative, world beating software package combining data curation and analytics.
The Roadmap page is divided into subsections, each dealing with an aspect of the software:
- R wrappers for ExeTera
- API
- Serialization
- Execution
- Data curation
Target: v0.6
ExeTera is currently a Python-only library. R users can access ExeTera through Reticulate, an R library written to interop with Python packages, but this is onerous for two of reasons:
- Syntactic sugar in Python such as slices does not work in R, and so data fetching involves the creation of explicit slice objects
- ExeTera (like most Python libraries) uses 0-based indices, whereas R uses 1-based indices, so the user must perform this conversion correctly every time
We propose to write an R library that wraps Sessions, Datasets, DataFrames and Fields with their R equivalents. This should enable R users to write using syntax and conventions that they are used to while using ExeTera.
Target: v0.6 - v0.9
Pandas' DataFrame API (itself based on the R data.frame) has a rich set of functionality that is not yet supported by ExeTera. This missing functionality ranges from 'implement as soon as humanly possible' to 'implement when we have time to do so'. We should be guided by our users in this matter, but we have an initial set of priorities that should be refined upon.
As ExeTera has some critical differences to Pandas and so there are areas in which we must necessarily deviate from the API. The largest difference is that ExeTera doesn't keep the DataFrame field values in memory but reads them from drive when they are required. This changes DataFrame.merge
for example, as it is necessary to specify a DataFrame instance to which merged values should be written.
We should tackle this work in a series of stages so that for each new release we have broadened the API.
Target: v0.7 onwards
We still have a number of areas where ExeTera's API can be argued to be unnecessarily complicated:
-
.data
on Field - Use highly-scalable sorts by default
- Move journalling up into DataFrame API
- Moving functionality out of Session to the appropriate classes (Dataset / DataFrame / Field) and retiring old API
Highly scalable sorts are implemented in the ops level but not called by default in the DataFrame / Session API as yet. They should be incorporated into the DataFrame sort, although they are not currently required for datasets below around a billion rows.
Journalling functionality should be incorporated into the DataFrame namespace, so that a journalled DataFrame can be generated by compatible dataframes. At the moment is must be accessed through the ops layer
Target: v0.8
HDF5 has worked to provide us a good initial implementation for the serialized dataset, but it has a number of serious issues that require workarounds:
- Data cannot be properly deleted from a HDF5 dataset
- The format is fragile and interrupts at the wrong point can irretrievably corrupt a dataset
- It has some restrictions on read-based concurrency
- The
h5py
python library that wraps the underlying C library is very inefficient, particularly for iteration
These points are described in more detail in Roadmap: Move away from HDF5
Priority Medium to Low
Performance and scale improvements can be delivered through several mechanisms:
- Selection of the appropriate fast code generation technology (Numba, Cython or native C/C++)
- Increase of the field API footprint, so that all operations can be performed scalably under the hood
- Adoption of a multi-core/multi-node graph compiler / scheduler / executor to carry out ExeTera processing (Dask / Arrow / Spark)