Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial draft of the DataSet spec #476

Merged
merged 8 commits into from
Mar 8, 2017
Merged

Initial draft of the DataSet spec #476

merged 8 commits into from
Mar 8, 2017

Conversation

alan-geller
Copy link

@alan-geller alan-geller commented Feb 3, 2017

This is an initial draft of the specification for a new DataSet class. Please post comments, suggestions, and other feedback!

@peendebak
Copy link
Contributor

@alan-geller Thanks for the spec. I am assuming this is for the next generation DataSet. I am missing a couple of points in the specification that are important for our workflow:

  • The data in the DataSet should be very easy to access. Right now we can access the raw numpy array using dataset.myarray.ndarray which is very convenient. In particular we can pass a DataArray straight into many numpy functions. Also it would be great we can can apply operations to DataArray objects, e.g.
array = 2 * dataset.myarray + 100

The array would then either be a numpy array or a DataArray.

  • Related to that: it should be easy to add arrays back to an existing dataset. E.g.
dataset = mymeasurement()
array = myprocessing(dataset.measured)
dataset.add_array(array, setpoints = ...)
  • Making a subset of a DataSet should be possible DataArray subset QCoDeS/Qcodes_loop#21
  • Storage of datasets should be using standard python frameworks such as pickle or json formatters. An example:
dataset = mymeasurement()
params = myfit(dataset)

pickle.dump(filename, {'dataset': dataset, 'params': params} )
  • The metadata in the DataSet should be identical after writing and reading the DataSet for all reasonable objects in the metadata. The set of reasonable objects would include at least: str, int, float, a python list (recursive), a python dict (recursive), numpy arrays.

  • In the spec there is a function DataSet.read_from(location, formatter=). Specifying the formatter is inconvenient for many users. Suppose I want to read in a dataset from two years ago or from a different user. I then just want to write DataSet.read_from(location) and let the system figure out automatically which formatter to use.

  • Open issue 1.: I would make a call to .write explicit and perhaps leave the storage functionality out of the dataset altogether (in order to make the DataSet as standalone as possible)

Copy link
Contributor

@AdriaanRol AdriaanRol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Alan,

Thanks for sharing this spec with us. I have made quite a few comments while reading. There are some more high level remarks that I would like to make in this comment as well. I think we should think of some use cases to test the DataSet against, mostly because I have experienced that the Loop that sets and gets single values as a core assumption is too simple and leads to a lot of problems further on. I asked some questions that relate to this in the review but it boils down to supporting non-matching shapes in a DataSet and potential nesting of DataSets within DataSets, a concept which I think would be very powerful but not free of controversy.

I am looking forward to your thoughts.

The DataSet class should be usable on its own, without other QCoDeS components.
In particular, the DataSet class should not require the use of Loop and parameters, although it should integrate with those components seamlessly.
This will significantly improve the modularity of QCoDeS by allowing users to plug into and extend the package in many different ways.
As long as a DataSet is used for data storage, users can freely select the QCoDeS components they want to use.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this imply the other components will not work if the DataSet is not being used or is the modularity intended to work both ways?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modularity should work both ways, when it makes sense. Specifically, modularity should respect layering: it should always be possible to use lower-level, more "base" components without higher-level components, but it may be acceptable to have higher-level components rely on base components.

As an example, DataSet should be usable stand-alone, but Loop can require DataSet if that makes sense. An important corollary is that anything else that uses DataSet, like plotting, should work just as well if the DataSet is filled in by hand as when it is filled in using Loop.


Result
A result is the collection of parameter values associated to a single measurement in an experiment.
Roughly, a result corresponds to a row in a table of experimental data.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would always put them in columns, but maybe that's just my transposed intuition. I do think columns are more readable if the dataset itself is intended to be directly viewable.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I think of a parameter as a column and a result as a row. Too much relational database history...

I also tend to think of columns as more-or-less fixed, while you can add rows forever.

Role
Parameters may play different roles in a measurement.
Specifically, they may be input to the measurement (set-points) or outputs of the measurement (measured or computed values).
This distinction is important for plotting and for replicating an experiment.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am affraid of the problems this distinction might impose. I can think of three examples where the relevant distinction between setpoints and outputs is not natural.

  • The R&S vector network analyzer (VNA) returns S-parameters as a function of frequency. However, instead of specifying the frequencies you specify start-freq, step and stop frequency (or some similar parameterization). The VNA will return you the frequencies and the measured S-parameters all as once. However, the frequencies (which is your intended set-parameter) is not known beforehand, nor do you directly set it.
  • There are measurements where you want to plot certain qualities against each other, e.g., qubit frequency vs T1. Both are measured values and not set-points. I think for the purpose of plotting having some need to specify this will only hinder the analysis.
  • Last there are adaptive measurements in which the set-points are generated during the measurement but are also output of the measurements, I'm not sure how to fit these into this distinction.

All of the above are issues that can be worked around and as such not breaking. However, I think that we should make an architecture in which hacking is not required.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very much agree with this. From my experience the difficulty is the difference between the handling of 'real' setpoints (ie things that actually get set on the instrument) vs calculated setpoints (ie exactly the frequency example above which you dont directly set on the instrument or read from it but rather calculate from the values of some other parameters) vs actual measured values (returned by the instrument). I think some examples should be worked out before too much time is spent writing this aspect of the data set to avoid (as much as possible) problems coming up later which result in unpleasant hacking.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

further to this should one parameter be able to have setpoints AND measured values? Currently they can but from the above it doesn't sound like the ones you imagine have that functionality.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another example would be a 'monitor' loop, that would make measurements for an indeterminate amount of time, i.e. measure the current through the device until the fridge is below 100 mK. You don't set anything (or all things are set at the beginning of the loop) and the size of the dataset is not known initially.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have any problem dropping this distinction. It doesn't affect the behavior of the DataSet at all, as far as I can tell.

The primary use overall seems to be in plotting: inputs are X axis, outputs are Y, so knowing the role of a parameter allows some defaulting. I don't know how much this is ever used, though.

So I would be in favor of dropping this notion entirely, as long as it doesn't break plotting expectations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

most simple would maybe be having no 'setpoints' or 'measured' and just having the user specify what to plot...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get the impression that some users would strongly prefer that there is a default plot representation of 1D and 2D datasets. However this could probably just as well be handled by writing some metadata to the dataset naming the default x and y axis in the dataset (naturally over-writable for more advanced plots)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jenshnielsen I like your suggestion as it is far less constraining and would also work in more general cases. Moreover, I understand this would be optional, is that correct?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it could be optional. It would make sense for loop or what ever stores the data to flag data as sweep axis 1, 2 ... but some other mechanism might not do that

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jenshnielsen I like your idea as well. The plotting package can specify what metadata it looks for, and then it's up to the code that creates and fills the DataSet (e.g., Loop) to make sure that the right metadata is there.

Depending on the state of the experiment, a DataSet may be "in progress" or "completed".

ExperimentContainer
An ExperimentContainer is a QCoDeS object that stores all information about an experiment.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is an experiment?

Mostly because we can either argue that an experiment runs for several weeks/months and has many datasets or it can be a single run of some experiment.

I'm mostly asking this as how DataSets are grouped into experiments and how we can combine and split multiple of these ExperimentContainers has far reaching consequences on the usability of the DataSet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can quickly answer on his point as I have been very involved in the Experiment container.
The experiment is a collection of datasets, up to the user to "flip the switch" to a new experiment, or load an old one and keep on going.
The dataset itself has no notion of being part of an experiment container.

Does this clarify ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @giulioungaretti can you please clarify the following about the Experiment container.

  • can an individual dataset be part of multiple experiments?
  • can individual datasets be split out from multiple experiment containers? e.g. for comparison purposes
  • can multiple experiments run simultaneously?
  • are the datasets available without an Experiment container
  • what exactly does the experiment store? links to the datasets, or actual datasets? default parameters? instrument drivers? executable scripts?

To me, these features are all absolutely crucial for running long-term, complex experiments. We have to be able to easily compare data from all points in history, as well as between different Station Q experimental sites.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@majacassidy thanks for the feedback.

  • I would say yes, although it's more of an implementation than a concept issue (i.e. how to make so that it's easy to do this).

  • Datasets live on their onw, so yes. The container just acts like a file system on steroids (allowing to perform searches and so on. I guess we would also want to add this feature: select x from container z, and y from container w and compare them.

  • Yes. Although one is always limited by hardware that can't operate simultaneously.

  • Yes.

  • The first implementation will store a pointer to the dataset. But it may in the long run be way more convenient to store the actual data (but this opens up a lot of accessibility problems).
    The container would store:
    - references to all the datasets generated with an unique hash and a timestamp
    - metadata of all the datasets linked to the unique hash of the dataset they describe
    - metadata of the experiment ( something that a user would want to stick inside, literally anything)
    - GitHub hash of the qcodes version one is using ( easy tro trace bugs, make refeneces), and a diff of what is changed locally (custom drivers and so on)

      -  scripts (but to do that we'd have to agree on a standard way to include them
    

re: parameters those are saved in the metadata of every dataset.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great - thanks for the clarification @giulioungaretti !

---------

#. A DataSet can store data of (reasonably) arbitrary types.
#. A completed DataSet should be immutable; neither its metadata nor its results may be modified.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We like to add metadata, fitting results and figures, to a DataSet after it has been marked as completed. You also list similar purposes below. I think it makes sense to lock the main DataSet after the experiment is complete. However, I think that it should remain possible to add (and also modify) post experiment metadata such as fitting results etc.

Maybe a distinction is in place here but I sense conflicting requirements here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also just being able to add comments to a DataSet afterwards like 'oops, power cut midway through' for example? This seems like a good use case for being able to add user generated metadata after a data set is complete which would add a lot of functionality. Whether or not analysis should be something other than 'metadata' I am a bit conflicted but it's worth a conversation.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no problem with allowing metadata to be modified post-completion.

In general, I've tried to start with the most restrictive possible requirements because it's better to add functionality than remove.

This is a static method in the DataSet class.
It returns a new DataSet object.

DataSet.read_updates()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit worried about the implications of having a read_updates() command.

Such a feature is only useful if the data in the underlying location is expected to change, and as such suggests the possibility of having multiple copies of the same dataset open. This may be exactly what we want (I can think of numerous applications where it is useful to have the DataSet open in another proces) but we should then also ensure we properly manage any potential for conflicting changes to the data.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, having multiple in-memory copies of the same persistent object can be problematic.

The current design tries to address this by making DataSet append-only, so you can't overwrite or modify data written by some other process. You could step on metadata, though, and you could also add multiple copies of the same result.

We could require just a single copy of the DataSet, but for applications like plotting that might cause performance problems.

Would could also tag one copy as the master copy, and only allow updates at the master. In many ways that's the simplest way to solve the problem.

Thoughts?


At least for now, it seems useful to maintain the current behavior of the DataSet flushing to disk periodically.

#. Should there be a DataSet method similar to add_result that automatically adds a new result by calling the get() method on all parameters that are defined by QCoDeS Parameters?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, there should not be such a method.

That is exactly what the Loop (or analogous function) is intended to do. I think the DataSet should be as simple and modular as possible.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. In the best of all possible worlds, DataSet should be compilable and usable without any other QCoDeS files at all.

Right now that's not quite true; the ability to construct a DataSet from a QCoDeS Parameter breaks that layering. We could fix it by having Parameter inherit from ParamSpec, which might be worth the effort.

It should be possible to read "new data" in a DataSet; that is, to read everything after a cursor.
#. It should be possible to subscribe to change notifications from a DataSet.
It is acceptable if such subscriptions must be in-process until QCoDeS multiprocessing is redone.
Change notifications should include the results that were added to the DataSet that triggered the notification.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this at least should be optional.
I can think of cases where the size of the data would just blow up the required memory if everything is send along for every change.

Think of experiments where raw traces are stored and multiple of these come in at the same time, or just think of the Alazar card where speed and performance are critical. Having the computer slow down because of some change notification is not desired.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. I think I actually changed this in the API, but forgot to modify the requirement; the notification indicates which results were added, but doesn't include the actual new results.

Basics
---------

#. A DataSet can store data of (reasonably) arbitrary types.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not read anything about the shapes of these data in the document. However, I think it is important to specify some use cases that do not involve the typical column of values that come in as it has consequences for a lot of the methods defined (for instance, what is the length of a dataset when there are arrays of different shapes in there).

To give a few examples

  • Simple loops, values come in 1 by 1 for 1 or more parameters. (here basic columns would suffice)
  • Array based measurements, values come in in chunks of n-values that may or may not match the corresponding set-points. (already not matching shapes)
  • Array based measurements with metadata, think of some hardware that gives you back raw traces but also the result of integrating them with some weight function. Depending on the experiment the saving of these raw traces may be turned on or off, but they will certainly belong in the same dataset.

I realize these descriptions may be a bit vague as I try to describe them in terms of their consequences to the DataSet but they relate to experiments we do on a daily basis. Let me know if you have any questions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will finish up final Alazar9360 tweaks but this is very relevant for this and I think some thinking for how drivers should shape and label data (which relates to my comment above) is important. Ideally the 'shape' of the QCoDeS parameter in being measured/set/saved should be mirrored as intuitively as possible in the dataset shapes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking is that anything you can represent in a NumPy data type object should be allowable as the value of a single parameter in a single result. I think this covers simple scalars, tuples of scalars, arrays of scalars, tuples of arrays, arrays of tuples, etc.

For the array-based measurement with metadata, I would probably model that as a tuple containing an array and one or more scalars. I think storing more flexible (JSON-ish) metadata with each result is likely to cause problems accessing the data in a simple and efficient way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not really clear how the metadata fits into it but perhaps that is best demonstrated with an example when we progress to that point. Thanks

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nataliejpg @AdriaanRol One thing I'm not clear on: do we need result-level metadata as well as DataSet-level metadata? We could certainly have metadata on each result, but that feels like it might be overkill -- and it might be really hard to use effectively.


#. A DataSet object should allow writing to and reading from storage in a variety of formats.
#. Users should be able to define new persistence formats.
#. Users should be able to specify where a DataSet is written.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to suggest some helper functions which should be possible (though I realize that this may not be for the DataSet itself.

  • It should be possible to load (parameter) settings from a DataSet onto the currently active instruments
  • It should be possible to easily compare parameter settings between different datasets and between datasets and the active environment.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both make sense. I agree that they should be separate helper functions, though, in order to keep DataSet from being dependent on Instrument.

Creates a parameter specification from a QCoDeS Parameter.
If optional is provided and is true, then the parameter is optional in each result.

ParamSpec(name, role, type, desc=, optional=)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

definitely an example of this would be really great in terms of helping to clarify how these ParamSpecs which don't correspond to actual Parameters would be fab. I'm trying to envisage a use case and wondering if it's to replace calculated setpoints or if it should be for data calculated from the measured values when you want to store both raw data and some calculated data, is either of those what is envisaged?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, calculated values certainly fit this scenario.

It also allows you to create a DataSet without using QCoDeS Instruments and Parameters at all, and then use QCoDeS plotting and persistence for your data. For instance, when Damaz was using QCoDeS to drive simulations using LIQUi|>, this might have been an easier way (although it might have made it harder, too).

@AdriaanRol
Copy link
Contributor

I would like to start a discussion on the ExperimentContainer and nesting of Datasets.

Besides previous discussions the following comments are related to this idea.
@alan-geller

Perhaps each DataSet should have a GUID that uniquely identifies it?

@nataliejpg

something with DataSet ids and having the final 'experiment' put the ids of the calibration datasets in the metadata of the 'final experiment dataset'? Might be cleaner than having nested DataSets. Also perhaps this is can be somewhat abstracted to the ExperimentContainer which I know nothing about but could in my imagination know about these things and be able to do things like 'update_calibration_dataset_ids()'.

I think there is a general consensus that it should be possible to group DataSets. @giulioungaretti proposes an Experiment Container which is a "file system on steroids". The current discussions raise that we want to be able to group some closely related experiments.

I would propose that instead of creating 3 or more different layers for experiments (Container, to be discussed and DataSet), we instead make it possible to nest DataSets within Datasets. This could be as simple as allowing the GUID of the DataSet (as a string identifier) to be allowed as an entry in a DataSet. The viewer can then take care of opening the desired other dataset when required.

The different DataSets can still be different files, allowing easy transfer of only part of the Data and grouping them into one or more Experiment Containers.

Another advantage of solving this problem by nesting as opposed to adding another layer is that one only has to learn one abstraction and that the superpowers @giulioungaretti is talking about will be applicable to searching the original dataset directly.

Looking forward to everyone's thoughts

@nataliejpg
Copy link
Contributor

@AdriaanRol I like the idea of the DataSet being as simple as possible and so if you have a DataSet you know how deep it can be and what sort of objects it can contain (ie nothing more complicated than some labels and numpy arrays sensibly organised and an id/tag used to identify it) and not need information about it's depth before you can interact with it. I appreciate the desire for not having too many layers though so I would be in support of having only one container object (which I wouldn't have a big problem with being nestable although I think that in general the sql relational table version is cleaner). I do think that there should be a simple, well defined DataSet at the lowest level though. I've included the relevant quote from giuilo.

@giulioungaretti

@AdriaanRol

I think nesting is not such a good idea in terms of design for the dataset.
I would say that all of the use cases you describe can be made nice a simple just by using a list of data_set and a container.

much like @nataliejpg is saying.

I don't know how much database knowledge you have @AdriaanRol, but if you do think of the container as sql db, where every data_set is a table, and another table describes relations between datasets for example.

Also not sure how metadata fits into this, perhaps a jsonlike object with some pretty print and search functions with a 1:1 relationship between metadata and dataset objects which share an id (or each store their own id and the corresponding object's id), a container can then have the ids of the datasets and its own metadata and analysis plots etc and you could have a 'calibration' container inside an 'experiment' container (if we went with nestable containers). This would also be nice because then you could have multiple containers knowing about/using the same DataSet just by having it's id in the container rather than a copy of the DataSet.

@majacassidy
Copy link

@AdriaanRol great discussion.

Some thoughts

  • I feel like the metadata is what belongs in a SQL type database structure, even more than the dataset. Eventually we want to slice and dice the metadata to, for example, pull up a list of all pinch off traces for nanowires from batch QDEV43, pull up a list of all Transmon T1 measurements between a certain date range etc... Imagine if you could also add in the fabrication information for searching also. A phd in condensed matter physics would take all of 6 months instead of 6 years. This has the potential to be extremely powerful if all the meta data and datasets from all Station Q sites are searchable and can then have some advanced analytics applied to them. I think it will revolutionize science. (I think I made the claim to Charlie in 2010 that the 5/2 state would be solved if we could do this - let's just say it's been on my mind for a while).

  • Maybe the nesting of the DataSet can just be contained within the metadata of the dataset, in some sort of parent_of, child_of thing that links DataSets together.

  • The name ExperimentContainer is perhaps not the best name and could be a source of confusion. The term 'experiment' means all sorts of things to different people. We have to think, is this something that's going to be created every day, every month, or every year. Do people take their experiment container to a different fridge when they switch setups. We need to have a clear idea how to use the structure, and the we can design it.

  • Maybe we should look at how the astronomy community handles their data and metadata.

@alan-geller
Copy link
Author

I've loaded a new version of the spec. Mostly I've removed things that don't seem to be required, and added a few new things.
One issue I forgot to add: right now, I say that if you write new metadata for a given top-level tag, it replaces the old metadata. An alternative would be to try to do some intelligent merging, which is kind of safer (harder to lose metadata) but in my experience never works the way you want in all circumstances.

I really appreciate the feedback!!

@akhmerov
Copy link
Contributor

akhmerov commented Feb 8, 2017

I like the spec.

Overall it reminds me quite strongly of a pandas DataFrame, except of course like most array types, DataFrame doesn't easily allow appending without memory reallocation.

Should there be metadata associated with single measurements? time comes to mind as a good example.

Would a typical measurement result contain most parameters from a dataset, or only a handful?

@alan-geller
Copy link
Author

@AdriaanRol (and everyone else)

On DataSets containing DataSets: To me, this doesn't really fit my conceptual model of what a data set is, but that is probably because my mental model of an experiment is not accurate.

My (simplistic) mental model is that you do some basic set-up, then you run your experiment by setting a few parameters to different values and measuring a few variables at the different settings.

From your example, it sounds like this is incorrect, and that a more accurate description of an experiment would allow for different stages to occur, where each stage is a sweep more or less as I've described (potentially adaptive, if it's a calibration sweep, but that's not conceptually a problem), but where you may perform a particular stage (e.g., calibration) more than once, and indeed you might want to interrupt one sweep (the main measurement sweep) and interject another sweep.

I fear that trying to glue together DataSets within DataSets to handle all of these possibilities will turn into a nightmare of complexity. I'd be much more inclined to use references (by GUID or filename or whatever) from one DataSet to another.

One somewhat related question: it sounds like the values in each result may be different for different sweep points; e.g., you might only do certain computations every N points, but you want to store the computed values in the DataSet. This was my thinking behind optional parameters: values that were in some results but not all. In the latest version I've dropped "optional" and effectively all parameters are optional. Does this give you enough flexibility?

Removed API reference to QCoDeS Parameters, allowed addition of
parameters with value arrays, dropped "optional", added min_count and
min_wait to subscriptions, and some RST cleanup.
@nataliejpg
Copy link
Contributor

@alan-geller I like references as a way to navigate the nesting-type functionality, it would be nice to come up with a way to visualise these so you could look at your experiment/data structure etc but that's not a first iteration necessity by any means.
I also like having all parameters optional.

@akhmerov good point, time would definitely be something good to have for all data points, maybe even a config option for whether in an experiment you want all results or even datapoints to be timestamped (and have this be an array in the dataset, ie have time as a paramspec)

Metadata is the still unresolved issue in my mind, do other people have a clear picture of how it is/should be structured (especially at what level we want to have it: datapoint, result or dataset)?

@AdriaanRol
Copy link
Contributor

@alan-geller

I fear that trying to glue together DataSets within DataSets to handle all of these possibilities will turn into a nightmare of complexity. I'd be much more inclined to use references (by GUID or filename or whatever) from one DataSet to another.

For me allowing the use of a GUID of a DataSet as a valid entry for a value in a DataSet would solve the problem and avoid the fears you are listing. It would allow nesting to arbitrary levels and thus allow constructing arbitrarily complex nested DataSets (which ofcourse should be avoided whenever possible). It would at the same time not require us to rethink anything about the DataSet itself (as a GUID string is a valid Numpy DataType).

A requirement that follows from this is to have some way to browse/view these nested GUID references as @nataliejpg notes.

I think a suitable way to manage these things would be some DataSet container that can then include some DataSets. I think it is important to explicitly allow not including all linked DataSets in this container to ensure portability of files.

Some more small points.

  • per-datapoint timestamps, I like the idea but I have performance concerns.
  • GUID, I think the GUID should be partly human readable (e.g. start with yymmdd_hhmmss_userdefinedname_unique_hash) and allow selecting with incomplete GUID's al la GIT's hash selection
  • I like to associate one or more images with datasets and see them when browsing through my data. I'm not sure how to integrate that idea with our abstract discussion but I think it is important for usability.

@giulioungaretti
Copy link
Contributor

I shall not let anything that is not data inside a dataset. It's too complex and dangerous. I am not sure I have ever seen a numpy array with a pointer to another numpy array, same goes for pandas.
It requires runtime introspection to figure out what kind of "thing" is in the dataset.

What you may want is a way to relate different data_sets.

- per-datapoint timestamps, I like the idea but I have performance concerns.

don't worry about implementation, this is about specs.
Having a function that checks if it should stop at every point in your loop is a lot more expensive than getting a timestamp.

- GUID, I think the GUID should be partly human readable (e.g. start with yymmdd_hhmmss_userdefinedname_unique_hash) and allow selecting with incomplete GUID's al la GIT's hash selection

Let's see, this more of an implementation.

- I like to associate one or more images with datasets and see them when browsing through my data. I'm not sure how to integrate that idea with our abstract discussion but I think it is important for usability.

Yes, part of the container not the dataset.

@damazter
Copy link
Contributor

damazter commented Feb 9, 2017

I have not finished reading the spec yet, but I think it would be good if I add my two cents about nesting datasets.
I think it is crucial to be able to nest them. In my personal oppinion it will make everything a lot easier because it is a natural extension of any looping structure in the first place (for every item in iterable: do something) where you don't care what that something is, as long as it is executable python code.

Exactly the same with datasets,
for setpoint in list, datum (my attempt at singular data) is either a primitive (e.g. int, float string, etc) or another dataset.
This would be very convenient to experiments that contain multiple measurement looping structures.
Having this nesting property, nothing would stop you of nesting further and therefor string more connected data together.

As @AdriaanRol said, just having a GUID as an entry in the dataset would suffice to do this. All datasets contained in the experiment could then be stored in a single container.
@giulioungaretti

I shall not let anything that is not data inside a dataset.

A GUID is a proper datum imho. it is not as iffy as a pointer

It requires runtime introspection to figure out what kind of "thing" is in the dataset.

Wouldn't be good if the are entries in the dataset to tell what is in there in the first place.

I am really quite convinced that nesting datasets is the right way to go, so I think we should spend some time in figuring out what problems this might cause an thinking towards a solution for them

Alan Geller added 2 commits February 9, 2017 16:33
Added a name parameter to the constructor and an id attribute that
returns a unique identifier for the DAtaSet, suitable for use as a
reference.
Specified that the identifier should be automatically stored in the
DataSet's metadata.
@alan-geller
Copy link
Author

I've added a unique identifier as a fundamental attribute of the DataSet. I think this addresses many of the nesting scenarios.

@damazter I think complex result data can be handled without nesting. My assumption is that a result is a collection of values, and each value can hold arbitrary (NumPy-compatible) data -- not just scalars, but anything you can define a dtype object for, so records holding arrays of records of arrays...

@akhmerov @nataliejpg @AdriaanRol @giulioungaretti On time-stamping results, I think that functionality belongs to the layer that is filling in the DataSet, rather than to the DataSet itself. It would be easy to build a TImedDataSet extension that adds a "time" parameter to the passed-in list of parameters at construction time, and adds in the current date/time to the values in the result dictionary in add_result, so I would prefer to keep that functionality out of the core class because you might not always want it.

@damazter
Copy link
Contributor

@alan-geller
Natively storing any numpy data is a very good idea, I was not trying to be restricitve in my primitive data types. This does however not completely preempt the nesting of datasets: if many calibration measurements need to be done during the measurements, it would be nice to separate them out into different sub-datasets. It will make the resulting dataset a lot cleaner and it would natively support thinking about the calibration measurement as being relatively independent.

This would result in that any measurement code that spits out a dataset, could then be used as a calibration measurement in a more convoluted experiment.

@alan-geller
Copy link
Author

@damazter For the calibration scenario, is it sufficient to have the more convoluted experiment's dataset contain a reference (by unique ID) to the calibration dataset? You could put this in metadata; alternatively, if the calibration is performed multiple times during the experiment, you might want to add a result value that holds a reference to the calibration dataset, perhaps with the resulting calibration values, and every time you do a calibration insert a result with only those values filled in (and leave them empty the rest of the time)?

It occurs to me that a really useful helper function (not part of this spec) will take a dataset identifier and return the dataset, possibly looking up the dataset's location in a database or in a formatted text file or using a web service or...

@alan-geller
Copy link
Author

It occurs to me that we'll need to figure out how to plot (and analyze) optional parameters that don't always have values. There may need to be a helper function that takes a NumPy array and strips out all of the "empty" entries. If for some reason that's not feasible, we might have to store which parameters were given values in each result in the DataSet itself, and have a somewhat more complicated version of get_data that allows you to skip empty results or partially empty results or somehow mark empties.

Does anyone have any insight here?

@damazter
Copy link
Contributor

I have finally caught up with you all, and finished reading the document.
A few points which come to mind:

  1. can a dataset be remarked as incomplete. This would be useful to add analysis to a dataset after the measurement is finished.

  2. Can a measurement_container contain multiple datasets. So a dataset for the measurement and multiple datasets for calibration issues.

  3. I think somebody said it before, but I think it would be really useful to add a function that can open a dataset from GUID only. It might also be useful to be able to open a dataset based on name only (or a function that returns all GUID's of datasets with that name).

Finally: about nesting. If measurement containers can hold an arbitrary amount of datasets, then I see not how nesting is a bad thing (or how it can even be prevented)
it would just amount to crossreferencing between datasets.

@nataliejpg

@AdriaanRol I like the idea of the DataSet being as simple as possible and so if you have a DataSet you know how deep it can be and what sort of objects it can contain

One could make the easy convention that a dataset stops when it contains a GUID of another dataset, such that the top dataset does not actually contain any measurement data, but only a collection of references. How we solve that with plotting is a different issue. (but I think this is not so important, because for many of the cases I have in mind, plotting data from different datasets at the same time via this nesting structure is not directly needed)

@alan-geller

For the calibration scenario, is it sufficient to have the more convoluted experiment's dataset contain a reference (by unique ID) to the calibration dataset? You could put this in metadata;

I am not a big fan of this, because it is less general than your suggestion below, but a user would always be free to add this reference in the metadata I guess.

alternatively, if the calibration is performed multiple times during the experiment, you might want to add a result value that holds a reference to the calibration dataset, perhaps with the resulting calibration values, and every time you do a calibration insert a result with only those values filled in (and leave them empty the rest of the time)?

I don't see how this is different from what I had in mind (Am I missing something @AdriaanRol ). It would be good if all these datasets would be part of the same measurement_container for mess_prevention.

@alan-geller
Copy link
Author

@damazter On the very last item: yes, I think having a result value that holds a DataSet reference by GUID is exactly the same as what you're suggesting.

@alan-geller
Copy link
Author

alan-geller commented Feb 13, 2017

A general question for everyone:
On Slack, a suggestion was made to use a pandas dataframe as the DataSet. I'm not familiar with pandas, so I went and did some reading. On the one hand, it looks like dataframes are massive and complex objects -- but someone else does all the work to make them massive and complex, so that's not necessarily an issue. I think we would need a mechanism to hold metadata, as well as a dataframe.

Would it be sufficient for DataSet to simply be a class that holds a metadata dictionary and a pandas dataframe? If sufficient, would it be usable?

Never mind -- dataframes don't grow in place. When you append a new row, an entire new dataframe is created, copying all of the data. This won't work for us, for perf reasons.

@peendebak
Copy link
Contributor

@alan-geller My suggestion to consider pandas dataframes or other structures was a suggestion to use existing known python packages and structures as much as possible. For example, right now the DataArray is a wrapper around the numpy array object. I would not use a pandas dataframe as a DataSet, but perhaps as a DataArray.

@alan-geller
Copy link
Author

@peendebak I agree on the suggestion -- the more we can leverage from existing packages, the less work we have to do.

My mental model of the new DataSet class would use a NumPy array for each parameter. I don't anticipate bringing the DataArray class forward; I'm not sure what value it has, beyond letting you get at the underlying NumPy array.

Looking at pandas, though, it does look like adding a helper function that creates a dataframe from a (completed) DataSet might be useful.

Removed persistence, added more metadata details, added utilitiy
function section
@alan-geller
Copy link
Author

I uploaded a new draft:

  • I dropped all of the persistence methods; I think they belong outside.
  • I added a section on "well-known" metadata.
  • I added a section on useful utility functions that wouldn't be part of this package or required to use DataSet or QCoDeS, but that might be helpful to users.

I'm going to plan to send out a "public" request for feedback on Friday afternoon (Seattle time), and try to close this spec and start implementation by the end of February.

Many items in this spec have metadata associated with them.
In all cases, we expect metadata to be represented as a dictionary with string keys.
While the values are arbitrary and up to the user, in many cases we expect metadata to be nested, string-keyed dictionaries
with scalars (strings or numbers) as the final values.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would argue that the final values need to support anything that a parameter can return.

The most direct example is some parameters that are arrays (like a vector of integration weights). We want to store this in the metadata (and are currently able to do so). I think it is important to note this here.

@AdriaanRol
Copy link
Contributor

@alan-geller , I did not notice any reference to the ideas of nesting and the use of GUID's. I think they solve a very basic problem in experiments (as also pointed out by @damazter ).

A simple workaround for me would be to store the GUID string as an entry in a dataset (which is currently allowed). It would then only amount to having the right helper functions to show the nested relation.

Is there any consensus or verdict on this topic?

@giulioungaretti
Copy link
Contributor

@damazter
Copy link
Contributor

@giulioungaretti
Can we add multiple datasets to a instrument container?
Secondly: I think we agreed upon nesting via GUID right?

@alan-geller
Copy link
Author

@AdriaanRol @damazter @giulioungaretti The GUID idea is there -- it's the DataSet identifier (last requirements in Basics and Creation, and the DataSet.id attribute.
Since persistence is no longer part of the DataSet class itself, there's no "open_by_id" method, but I could add one to the Utilities or Storage sections if you want.

@nulinspiratie
Copy link
Contributor

Hey @alan-geller I've just read over the dataset specs and it feels like it would definitely be very useful in our measurements.
I was wondering if there is any thought on including some sort of logging in a dataset. I've noticed it can be quite useful to look through the log, especially if I'm looking back at adataset a few months in the future. For instance, in complicated measurement routines that include calibration etc., a log can provide information on if a calibration has ever failed, or if an instrument got stuck at some point, etc.

@giulioungaretti giulioungaretti merged commit 48d663a into master Mar 8, 2017
giulioungaretti pushed a commit that referenced this pull request Mar 8, 2017
Author: Alan Geller <alan.geller@microsoft.com>

    Add  DataSet spec (#476)
@alan-geller alan-geller deleted the dataset-spec branch May 9, 2017 16:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants