Skip to content

Roadmap

Ben Murray edited this page Jun 14, 2021 · 27 revisions

ExeTera roadmap

ExeTera is under active development. Our primary goals for the software are as follows:

  1. Provide a full, rich API that is familiar to users of Pandas
  2. Provide an engine that can execute in environments from laptops to clusters
  3. Provide a powerful set of data curation features, such as journalling and versioning of algorithms for reproducibility

These correspond to the three pillars of functionality that we believe can create an innovative, world beating software package combining data curation and analytics.

The Roadmap page is divided into subsections, each dealing with an aspect of the software:

  1. API
  2. Execution
  3. Serialization
  4. Data curation

API

R wrappers for ExeTera

Priority: High

ExeTera is currently a Python-only library. R users can access ExeTera through Reticulate, an R library written to interop with Python packages, but this is onerous for two of reasons:

  1. Syntactic sugar in Python such as slices does not work in R, and so data fetching involves the creation of explicit slice objects
  2. ExeTera (like most Python libraries) uses 0-based indices, whereas R uses 1-based indices, so the user must perform this conversion correctly every time

We propose to write an R library that wraps Sessions, Datasets, DataFrames and Fields with their R equivalents. This should enable R users to write using syntax and conventions that they are used to while using ExeTera.

DataFrame API

Pandas' DataFrame API (itself based on the R data.frame) has a rich set of functionality that is not yet supported by ExeTera. This missing functionality ranges from 'implement as soon as humanly possible' to 'implement when we have time to do so'. We should be guided by our users in this matter, but we have an initial set of priorities that should be refined upon.

As ExeTera has some critical differences to Pandas and so there are areas in which we must necessarily deviate from the API. The largest difference is that ExeTera doesn't keep the DataFrame field values in memory but reads them from drive when they are required. This changes DataFrame.merge for example, as it is necessary to specify a DataFrame instance to which merged values should be written.

Cleaner syntax

We still have a number of areas where ExeTera's API can be argued to be unnecessarily complicated:

  1. .data on Field
  2. Access to scalable sorts
  3. Access to journalling
  4. Moving functionality out of Session to the appropriate classes (Dataset / DataFrame / Field)

Move away from HDF5

HDF5 has worked to provide us a good initial implementation for the serialized dataset, but it has a number of serious issues that require workarounds:

  1. Data cannot be properly deleted from a HDF5 dataset
  2. The format is fragile and interrupts at the wrong point can irretrievably corrupt a dataset
  3. It has some restrictions on read-based concurrency
  4. The h5py python library that wraps the underlying C library is very inefficient, particularly for iteration

Deleting data

When HDF5 datasets (the HDF5 equivalent of a numpy array) is deleted, the series is marked as having been deleted but the space is not reclaimed. This can only be done by running the h5repack command, which takes many minutes to run on datasets the size of the Covid Symptom Study. It is very easy for a user to thus consume their entire drive by creating and deleting temporary fields. As a result, the user has to adopt unnecessarily complex workflows that categorise datasets as 'source', 'temporary', and 'destination', which adds to the cognative load of using ExeTera.

Fragility

As any write-based interaction with a HDF5 dataset is capable of leaving a dataset in an invalid and irretrievable state simply by interrupting the execution of one of the HDF5 commands, we have to protect all HDF5 write commands in such a way that they cannot be interrupted. We currently use threads for this purpose. This is another reason why we suggest to users that once a dataset is generated, they treat it as a read-only 'source' dataset.

Read-based concurrency

TODO: qualify this statement with HDF5 documentation. Read based concurrency appears to be unnecessarily limited in HDF5 files.

h5py iterator Performance

Whilst iteration will always be slower than performing numpy-style operations on fields, h5py adds several further orders of magnitude of overhead to iterative access. This can certainly be improved upcoming

Performance and scale improvements

Performance and scale improvements can be delivered through several mechanisms:

  1. Selection of the appropriate fast code generation technology (Numba, Cython or native C/C++)
  2. Increase of the field API footprint, so that all operations can be performed scalably under the hood
  3. Adoption of a multi-core/multi-node graph compiler / scheduler / executor to carry out ExeTera processing (Dask / Arrow / Spark)
Clone this wiki locally