Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial draft of the DataSet spec #476

Merged
merged 8 commits into from
Mar 8, 2017
Merged
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
316 changes: 316 additions & 0 deletions specs/DataSet.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,316 @@
=====================
DataSet Specification
=====================

Introduction
============

The DataSet class is used in QCoDeS to hold measurement results.
It is the destination for measurement loops and the source for plotting and data analysis.
As such, it is a central component of QCoDeS.

The DataSet class should be usable on its own, without other QCoDeS components.
In particular, the DataSet class should not require the use of Loop and parameters, although it should integrate with those components seamlessly.
This will significantly improve the modularity of QCoDeS by allowing users to plug into and extend the package in many different ways.
As long as a DataSet is used for data storage, users can freely select the QCoDeS components they want to use.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this imply the other components will not work if the DataSet is not being used or is the modularity intended to work both ways?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modularity should work both ways, when it makes sense. Specifically, modularity should respect layering: it should always be possible to use lower-level, more "base" components without higher-level components, but it may be acceptable to have higher-level components rely on base components.

As an example, DataSet should be usable stand-alone, but Loop can require DataSet if that makes sense. An important corollary is that anything else that uses DataSet, like plotting, should work just as well if the DataSet is filled in by hand as when it is filled in using Loop.


Terminology
================

Metadata
Many items in this spec have metadata associated with them.
In all cases, we expect metadata to be represented as a dictionary with string keys.
While the values are arbitrary and up to the user, in many cases we expect metadata to be nested, string-keyed dictionaries
with scalars (strings or numbers) as the final values.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would argue that the final values need to support anything that a parameter can return.

The most direct example is some parameters that are arrays (like a vector of integration weights). We want to store this in the metadata (and are currently able to do so). I think it is important to note this here.

In some cases, we specify particular keys or paths in the metadata that other QCoDeS components may rely on.

Parameter
A logically-single value input to or produced by a measurement.
A parameter need not be a scalar, but can be an array or a tuple or an array of tuples, etc.
A DataSet parameter corresponds conceptually to a QCoDeS parameter, but does not have to be defined by or associated with a QCoDeS Parameter .
Roughly, a parameter represents a column in a table of experimental data.

Result
A result is the collection of parameter values associated to a single measurement in an experiment.
Roughly, a result corresponds to a row in a table of experimental data.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would always put them in columns, but maybe that's just my transposed intuition. I do think columns are more readable if the dataset itself is intended to be directly viewable.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I think of a parameter as a column and a result as a row. Too much relational database history...

I also tend to think of columns as more-or-less fixed, while you can add rows forever.


DataSet
A DataSet is a QCoDeS object that stores the results of an experiment.
Roughly, a DataSet corresponds to a table of experimental data, along with metadata that describes the data.
Depending on the state of the experiment, a DataSet may be "in progress" or "completed".

ExperimentContainer
An ExperimentContainer is a QCoDeS object that stores all information about an experiment.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is an experiment?

Mostly because we can either argue that an experiment runs for several weeks/months and has many datasets or it can be a single run of some experiment.

I'm mostly asking this as how DataSets are grouped into experiments and how we can combine and split multiple of these ExperimentContainers has far reaching consequences on the usability of the DataSet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can quickly answer on his point as I have been very involved in the Experiment container.
The experiment is a collection of datasets, up to the user to "flip the switch" to a new experiment, or load an old one and keep on going.
The dataset itself has no notion of being part of an experiment container.

Does this clarify ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @giulioungaretti can you please clarify the following about the Experiment container.

  • can an individual dataset be part of multiple experiments?
  • can individual datasets be split out from multiple experiment containers? e.g. for comparison purposes
  • can multiple experiments run simultaneously?
  • are the datasets available without an Experiment container
  • what exactly does the experiment store? links to the datasets, or actual datasets? default parameters? instrument drivers? executable scripts?

To me, these features are all absolutely crucial for running long-term, complex experiments. We have to be able to easily compare data from all points in history, as well as between different Station Q experimental sites.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@majacassidy thanks for the feedback.

  • I would say yes, although it's more of an implementation than a concept issue (i.e. how to make so that it's easy to do this).

  • Datasets live on their onw, so yes. The container just acts like a file system on steroids (allowing to perform searches and so on. I guess we would also want to add this feature: select x from container z, and y from container w and compare them.

  • Yes. Although one is always limited by hardware that can't operate simultaneously.

  • Yes.

  • The first implementation will store a pointer to the dataset. But it may in the long run be way more convenient to store the actual data (but this opens up a lot of accessibility problems).
    The container would store:
    - references to all the datasets generated with an unique hash and a timestamp
    - metadata of all the datasets linked to the unique hash of the dataset they describe
    - metadata of the experiment ( something that a user would want to stick inside, literally anything)
    - GitHub hash of the qcodes version one is using ( easy tro trace bugs, make refeneces), and a diff of what is changed locally (custom drivers and so on)

      -  scripts (but to do that we'd have to agree on a standard way to include them
    

re: parameters those are saved in the metadata of every dataset.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great - thanks for the clarification @giulioungaretti !

This includes items such as the equipment on which the experiment was run, the configuration of the equipment, graphs and other analytical output, and arbitrary notes, as well as the DataSet that holds the results of the experiment.

Requirements
============

The DataSet class should meet the following requirements:

Basics
---------

#. A DataSet can store data of (reasonably) arbitrary types and shapes. basically, any type and shape that can fit in a NumPy array should be supported.
#. The results stored in a completed DataSet should be immutable; no new results may be added to a completed DataSet.
#. Each DataSet should have a unique identifying string that can be used to create references to DataSets.

Creation
------------

#. It should be possible to create a DataSet without knowing the final item count of the various values it stores.
In particular, the number of loop iterations for a sweep should not be required to create the DataSet.
#. The list of parameters in each result to be stored in a DataSet may be specified at creation time.
This includes the name, role (set-point or output), and type of each parameter.
Parameters may be marked as optional, in which case they are not required for each result.
#. It should be possible to add a new parameter to an in-progress DataSet.
#. It should be possible to define a result parameter that is independent of any QCoDeSParameter or Instrument.
#. A QCoDeS Parameter should provide sufficient information to define a result parameter.
#. A DataSet should allow storage of relatively arbitrary metadata describing the run that
generated the results and the parameters included in the results.
Essentially, DataSet metadata should be a string-keyed dictionary at the top,
and should allow storage of any JSON-encodable data.
#. The DataSet identifier should be automatically stored in the DataSet's metadata under the "id" tag.


Writing
----------

#. It should be possible to add a single result or a sequence of results to an in-progress DataSet.
#. It should be able to add an array of values for a new parameter to an in-progress DataSet.
#. A DataSet should maintain the order in which results were added.
#. An in-progress DataSet may be marked as completed.

Access
---------

#. Values in a DataSet should be easily accessible for plotting and analysis, even while the DataSet is in progress.
In particular, it should be possible to retrieve full or partial results as a NumPy array.
#. It should be possible to define a cursor that specifies a location in a specific value set in a DataSet.
It should be possible to get a cursor that specifies the current end of the DataSet when the DataSet is "in progress".
It should be possible to read "new data" in a DataSet; that is, to read everything after a cursor.
#. It should be possible to subscribe to change notifications from a DataSet.
It is acceptable if such subscriptions must be in-process until QCoDeS multiprocessing is redone.

Storage and Persistence
-----------------------

#. Storage and persistence should be defined outside of the DataSet class.

The following items are no longer applicable:

#. A DataSet object should allow writing to and reading from storage in a variety of formats.
#. Users should be able to define new persistence formats.
#. Users should be able to specify where a DataSet is written.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to suggest some helper functions which should be possible (though I realize that this may not be for the DataSet itself.

  • It should be possible to load (parameter) settings from a DataSet onto the currently active instruments
  • It should be possible to easily compare parameter settings between different datasets and between datasets and the active environment.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both make sense. I agree that they should be separate helper functions, though, in order to keep DataSet from being dependent on Instrument.


Interface
=========

Creation
--------

ParamSpec
~~~~~~~~~

A ParamSpec object specifies a single parameter in a DataSet.

ParamSpec(name, type, metadata=)
Creates a parameter specification with the given name and type.
The type should be a NumPy dtype object.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this directly excludes nesting datasets within datasets?
This is a concept I have discussed quite extensively with @damazter , I respect the choice for wanting to keep the DataSet simple and browsable. However, there are very direct use cases when starting to automate things where this becomes relevant. To give an example, suppose some experiment requires a several calibration experiments before doing the final experiment. Every time some underlying parameter is changed the calibration "experiments" (which generate DataSets) are repeated and have to be stored. The top level experiment is only interested in the effect of the final "experiment". The simplest way is to just ignore the relation between these DataSets and only look at the top level experiment. However, some form of relating these DataSets (of which nesting seems an obvious candidate) is required.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something with DataSet ids and having the final 'experiment' put the ids of the calibration datasets in the metadata of the 'final experiment dataset'? Might be cleaner than having nested DataSets. Also perhaps this is can be somewhat abstracted to the ExperimentContainer which I know nothing about but could in my imagination know about these things and be able to do things like 'update_calibration_dataset_ids()'.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like Natalie's idea of putting such information into the experiment container.

Perhaps we need another container in between DataSet and Experiment?

  • DataSet contains the results of a single run
  • TBD contains the results of a single "session" (prep, calibration, run) (sequence of related runs)
  • Experiment contains the results of a single experiment (many TBDs)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AdriaanRol I think nesting is not such a good idea in terms of design for the dataset.
I would say that all of the use cases you describe can be made nice a simple just by using a list of data_set and a container.

much like @nataliejpg is saying.

I don't know how much database knowledge you have @AdriaanRol, but if you do think of the container as sql db, where every data_set is a table, and another table describes relations between datasets for example.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@giulioungaretti the thing I was thinking about is probably more similar to your idea than I may make it seem :).

Let's see if I can explain it more concisely by an example:

There is an ExperimentContainer (to be renamed as per @majacassidy 's arguments) that contains several DataSets, let's say 7.
We label these datasets A, B, and C0, C1, C2, C3, C4.

DataSet A is a simple dataset that is independent of all the others.
DataSet B is an experiment containing 5 datapoints of some parameter y, corresponding to a specific setting for some parameter x. For each of those Datapoints a calibration was done before measuring parameter y.
DataSet s C0-C4 contain the DataSet corresponding to the calibration done for each point in DataSet B.

I would propose that instead of just linking the DataSets relationally (as proposed by @giulioungaretti ) we go one step further, putting a the GUID of DataSets Cx as entries in DataSet B.
This will achieve several goals.

  • It is possible to relate datasets to specific points in Datasets (as in relational database)
  • DataSets are still flat, you can access DataSet C0-C4 directly
  • It is possible to infinitely nest or group DataSets (as I would like)
  • A viewer could then allow you to browse through nested files.

This would then also make the notion of the ExperimentContainer a lot simpler, You can use it as a filesystem on steroids and it can contain any number of DataSets (or part of your DataSets) you want, it will only give a DataSet not available error if you happen to link to a DataSet that is not part of your container.


If metadata is provided, it is included in the overall metadata of the DataSet.
The metadata can be any JSON-able object.

ParamSpec.name
The name of this parameter.

ParamSpec.type
The dtype of this parameter.

ParamSpec.metadata
The metadata of this parameter.
This should be an empty dictionary as a default.

Either the QCoDeS Parameter class should inherit from ParamSpec, or the Parameter class should provide
a simple way to get a ParamSpec for the Parameter.

DataSet
~~~~~~~

Construction
------------

DataSet(name)
Creates a DataSet with no parameters.
The name should be a short string that will be part of the DataSet's identifier.

DataSet(name, specs)
Creates a DataSet for the provided list of parameter specifications.
The name should be a short string that will be part of the DataSet's identifier.
Each item in the list should be a ParamSpec object.

DataSet(name, specs, values)
Creates a DataSet for the provided list of parameter specifications and values.
The name should be a short string that will be part of the DataSet's identifier.
Each item in the specs list should be a ParamSpec object.
Each item in the values list should be a NumPy array or a Python list of values for the corresponding ParamSpec.
There should be exactly one item in the values list for every item in the specs list.
All of the arrays/lists in the values list should have the same length.
The values list may intermix NumPy arrays and Python lists.

DataSet.add_parameter(spec)
Adds a parameter to the DataSet.
The spec should be a ParamSpec object.
If the DataSet is not empty, then existing results will have the type-appropriate null value for the new parameter.

It is an error to add parameters to a completed DataSet.

DataSet.add_parameters(specs)
Adds a list of parameters to the DataSet.
Each item in the list should be a ParamSpec object.
If the DataSet is not empty, then existing results will have the type-appropriate null value for the new parameters.

It is an error to add parameters to a completed DataSet.

DataSet.add_metadata(tag=, metadata=)
Adds metadata to the DataSet.
The metadata is stored under the provided tag.
If there is already metadata under the provided tag, the new metadata replaces the old metadata.
The metadata can be any JSON-able object.

Writing
-------
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand from the separation that it is not possible to directly create a dataset from an array of values but that first a dataset must be created followed by writing where "results" are added. Is this correct?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct with the current design.

Is creating a DataSet from a set of arrays useful? It's not hard to add, if it's needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this would be useful.


DataSet.add_result(**kwargs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would greatly benefit the usability of the eventual implementation if the add_result method has explicit arguments rather than the generic **kw. I find that it greatly helps me if I have access to that information in the notebook workflow.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all of the keywords would be parameter names, so not known at compile time anyway. At least, that was what I had in mind. That means that each DataSet might have different arguments to add_result.

The alternative would be for add_result to take a single (positional) parameter that is a dictionary of parameter name/parameter value pairs. I'm fine with that, but it's no less opaque than just keyword arguments.

If there's a better way, please tell me!! My last heavy Python coding was in version 1.4...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the only way that matches the api design :D

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which one @giulioungaretti?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using kwargs

Adds a result to the DataSet.
Keyword parameters should have the name of a parameter as the keyword and the value to associate as the value.
If there is only one positional parameter and it is a dictionary, then it is interpreted as a map from parameter name to parameter value.

It is an error to provide a value for a key or keyword that is not the name of a parameter in this DataSet.

It is an error to add a result to a completed DataSet.

DataSet.add_results(args)
Adds a sequence of results to the DataSet.
The single argument should be a sequence of dictionaries, where each dictionary provides the values for all of the parameters in that result.
See the add_result method for a description of such a dictionary.
The order of dictionaries in the sequence will be the same as the order in which they are added to the DataSet.

It is an error to add results to a completed DataSet.

DataSet.add_parameter_values(spec, values)
Adds a parameter to the DataSet and associates result values with the new parameter.
The values must be a NumPy array or a Python list, with each element holding a single result value that matches the parameter's data type.
If the DataSet is not empty, then the count of provided values must equal the current count of results in the DataSet, or an error will result.

It is an error to add parameters to a completed DataSet.

DataSet.mark_complete()
Marks the DataSet as completed.

Access
------

DataSet.id
Returns the unique identifying string for this DataSet.
This string will include the date and time that the DataSet was created and the name supplied to the constructor,
as well as additional content to ensure uniqueness.

DataSet.length
This attribute holds the current number of results in the DataSet.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would wager this can be a rather ambiguous quantity. Specifically when considering more complex measurements that may not have identical shapes for the different arrays.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you give an example @AdriaanRol? If each 'row' is a measurement (of which the parameters set/measured might vary in shape and number etc) then although the shape of a DataSet wouldn't be easily defineable I can't think of a scenario where the length is ambiguous.

Copy link
Contributor

@AdriaanRol AdriaanRol Feb 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You measure single shots for a 2qubit tomography.
Two rows are all the raw shots (1000s of them)
You then add three other rows containing the correlations for each bin (depending on how you choose it 36*3 values).

Now is the shape the length of the number of shots?
The number of different quantities? (36*3+2?)

Not clear to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have expected the length in that case to be 2+3 which I understand isn't then a particularly useful measure in this case (since results are not of the same kind of information) but I wouldn't have called it ambiguous. But if I'm wrong in interpreting your 'rows' as 'results' then disregard the above. It seems like a value that when doing building a dataset not using a qc.Loop would depend entirely on how the user decides to structure the data (and so loses it's general meaning to anyone else). For Loop built datasets though I think it makes sense.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you think in terms of nested datasets, the answer is clear: 2+3 as @nataliejpg pointed out


DataSet.is_empty
This attribute will be true if the DataSet is empty (has no results), or false if at least one result has been added to the DataSet.
It is equivalent to testing if the length is zero.

DataSet.is_marked_complete
This attribute will be true if the DataSet has been marked as complete or false if it is in progress.

DataSet.get_data(*params, start=, end=)
Returns the values stored in the DataSet for the specified parameters.
The values are returned as a list of parallel NumPy arrays, one array per parameter.
The data type of each array is based on the data type provided when the DataSet was created.

The parameter list may contain a mix of string parameter names, QCoDeS Parameter objects, and ParamSpec objects.

If provided, the start and end parameters select a range of results by result count (index).
Start defaults to 0, and end defaults to the current length.

If the range is empty -- that is, if the end is less than or equal to the start, or if start is after the current end of the DataSet –
then a list of empty arrays is returned.

DataSet.get_parameters()
Returns a list of ParamSpec objects that describe the parameters stored in this DataSet.

DataSet.get_metadata(tag=)
Returns metadata for this DataSet.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to specify examples of metadata that should be supported and some kind of structure for that in this document?

I would think of at least the following

  • Instrument settings in the form of station.snapshot(),a (nested) dictionary
  • Analysis results added at a later time
  • Images as the result of the analysis, or a link to them (though arguable that should not be part of the dataset itself) .
  • Some format for adding arbitrary user metadata

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that makes sense.

My basic idea is that anything that is JSON-able should be valid as metadata, and that metadata should be thought of as a big, nested dictionary.

Images feel like something that should be associated with the DataSet through the experiment container or some equivalent container -- although it also seems to me that every plot should include a pointer to the data set somehow. Perhaps each DataSet should have a GUID that uniquely identifies it?


If a tag string is provided, only metadata stored under that tag is returned.
Otherwise, all metadata is returned.

Subscribing
----------------

DataSet.subscribe(callback, min_wait=, min_count=, state=)
Subscribes the provided callback function to result additions to the DataSet.
As results are added to the DataSet, the subscriber is notified by having the callback invoked.

- min_wait is the minimum amount of time between notifications for this subscription, in milliseconds. The default is 100.
- min_count is the minimum number of results for which a notification should be sent. The default is 1.

When the callback is invoked, it is passed the DataSet itself, the current length of the DataSet, and the state object provided when subscribing.
If no state object was provided, then the callback gets passed None as the fourth parameter.

The callback is invoked when the DataSet is completed, regardless of the values of min_wait and min_count.

This method returns an opaque subscription identifier.

DataSet.unsubscribe(subid)
Removes the indicated subscription.
The subid must be the same object that was returned from a DataSet.subscribe call.

Storage
-------

DataSet persistence is handled externally to this class.

The existing QCoDeS storage subsystem should be modified so that some object has two methods:

- A write_dataset method that takes a DataSet object and writes it to the appropriate storage location in an appropriate format.
- A read_dataset method that reads from the appropriate location, either with a specified format or inferring the format, and returns
a DataSet object.

Metadata
========

While in general the metadata associated with a DataSet is free-form, it is useful to specify a set of "well-known" tags and paths that components can rely on to contain specific information.
Other components are free to specify new well-known metadata tags and paths, as long as they don't conflict with the set defined here.

parameters
This tag contains a dictionary from the string name of each parameter to information about that parameter.
Thus, if DataSet ds has a parameter named "foo", there will be a key "foo" in the dictionary returned from ds.get_metadata("parameters").
The value associated with this key will be a string-keyed dictionary.

parameters/__param__/spec
This path contains a string-keyed dictionary with (at least) the following two keys:
The "type" key is associated with the NumPy dtype for the values of this parameter.
The "metadata" key is associated with the metadata that was passed to the ParamSpec constructor that defines this parameter, or an empty dictionary if no metadata was set.

Utilities
=========

There are many utility routines that may be defined outside of the DataSet class that may be useful.
We collect several of them here, with the note that these functions will not be part of the DataSet class
and will not be required by the DataSet class.

dataframe_from_dataset(dataset)
Creates a Pandas DataFrame object from a DataSet that has been marked as completed.

Open Issues
===========

#. Should it be possible to "reopen" a DataSet that has been marked as completed?

This is convenient for adding data analysis results after the experiement has added, but could potentially lead mixing data from different experimental runs accidentally.
It is already possible to modify metadata after the DataSet has beenmarked as completed, but sometimes that may not be sufficient.