Skip to content

Roadmap

Ben Murray edited this page Jun 22, 2021 · 27 revisions

ExeTera roadmap

ExeTera is under active development. Our primary goals for the software are as follows:

  1. Provide a full, rich API that is familiar to users of Pandas
  2. Provide an engine that can execute in environments from laptops to clusters
  3. Provide a powerful set of data curation features, such as journalling and versioning of algorithms for reproducibility

These correspond to the three pillars of functionality that we believe can create an innovative, world beating software package combining data curation and analytics.

The Roadmap page is divided into subsections, each dealing with an aspect of the software:

  1. R wrappers for ExeTera
  2. API
  3. Serialization
  4. Execution
  5. Data curation

R wrappers for ExeTera

Target: v0.6

ExeTera is currently a Python-only library. R users can access ExeTera through Reticulate, an R library written to interop with Python packages, but this is onerous for two of reasons:

  1. Syntactic sugar in Python such as slices does not work in R, and so data fetching involves the creation of explicit slice objects
  2. ExeTera (like most Python libraries) uses 0-based indices, whereas R uses 1-based indices, so the user must perform this conversion correctly every time

We propose to write an R library that wraps Sessions, Datasets, DataFrames and Fields with their R equivalents. This should enable R users to write using syntax and conventions that they are used to while using ExeTera.

API

DataFrame API

Target: v0.6 - v0.9

Pandas' DataFrame API (itself based on the R data.frame) has a rich set of functionality that is not yet supported by ExeTera. This missing functionality ranges from 'implement as soon as humanly possible' to 'implement when we have time to do so'. We should be guided by our users in this matter, but we have an initial set of priorities that should be refined upon.

As ExeTera has some critical differences to Pandas and so there are areas in which we must necessarily deviate from the API. The largest difference is that ExeTera doesn't keep the DataFrame field values in memory but reads them from drive when they are required. This changes DataFrame.merge for example, as it is necessary to specify a DataFrame instance to which merged values should be written.

We should tackle this work in a series of stages so that for each new release we have broadened the API.

Cleaner Syntax

Target: v0.7 onwards

We still have a number of areas where ExeTera's API can be argued to be unnecessarily complicated:

  1. .data on Field
  2. Use highly-scalable sorts by default
  3. Move journalling up into DataFrame API
  4. Moving functionality out of Session to the appropriate classes (Dataset / DataFrame / Field) and retiring old API

Highly-scalable sorts

Highly scalable sorts are implemented in the ops level but not called by default in the DataFrame / Session API as yet. They should be incorporated into the DataFrame sort, although they are not currently required for datasets below around a billion rows.

Move journalling up into DataFrame API

Journalling functionality should be incorporated into the DataFrame namespace, so that a journalled DataFrame can be generated by compatible dataframes. At the moment is must be accessed through the ops layer

Serialization

Move away from HDF5

Target: v0.8

HDF5 has worked to provide us a good initial implementation for the serialized dataset, but it has a number of serious issues that require workarounds:

  1. Data cannot be properly deleted from a HDF5 dataset
  2. The format is fragile and interrupts at the wrong point can irretrievably corrupt a dataset
  3. It has some restrictions on read-based concurrency
  4. The h5py python library that wraps the underlying C library is very inefficient, particularly for iteration

These points are described in more detail in Roadmap: Move away from HDF5

Performance and scale improvements

Priority Medium to Low

Performance and scale improvements can be delivered through several mechanisms:

  1. Selection of the appropriate fast code generation technology (Numba, Cython or native C/C++)
  2. Increase of the field API footprint, so that all operations can be performed scalably under the hood
  3. Adoption of a multi-core/multi-node graph compiler / scheduler / executor to carry out ExeTera processing (Dask / Arrow / Spark)
Clone this wiki locally