Proposal for packet storage data format (take a look!) #84

samkohn · 2017-12-15T08:30:27Z

I wrote up a proposal for data storage that's intended to be flexible, accessible, stable, and efficient. This "issue" post is not for actually implementing the proposal, but rather for soliciting comments and suggestions for it.

I think we should continue using Dan's storage format for now. (I think he might have called it v1.0 so I might have to change my proposal to v2.0 or some such thing.) But after the current rush is over and we can breathe again, I'd like to transition to a more standard format with the associated bells and whistles.

So take a look here and let me know what you think in the comments of this post.

dadwyer · 2017-12-18T20:07:35Z

Hey Sam,

Quick initial thoughts:

Use a format-agnostic class structure for serialization
e.g.: Use various dataformatter classes to provide various methods to serialize the data to files of different types. Enhance datalogger, dataloader to dynamically determine the appropriate serializer, deserializer.
HDF5:
Using a standard format like this is a good idea. I did not use this initially, since I lack experience with HDF5. When I looked at this long ago, I ran into two problems. HDF5 support was definitely 'batteries-not-included': it required building of a boost-dependent external c package and installation of a python interface. These inhibited general access to the serialized data files. Hopefully this has changed. Second, there were some issues in dealing with large ntuples. In particular, one should examine memory management and the capability to selectively deserialize or iterate over ntuple entries. This may require some careful programming of an interface class which sits over the standard HDF5 api. My quick-and-dirty bare binary format avoided these complications (platform portable, batteries included, memory efficient.) Make sure the better is not the enemy of the good enough.
Avoid python datetime
Python datetime lacks timezone support. Unix epoch time [time.time()] is timezone independent and transportable, and can be flexibly reformatted for presentation. (One warning: time.time() on linux and os x returns a number with 1 microsecond precision, but is not guaranteed to be sub-second precise on all platforms.) Addendum: We should implement a specific 'high-precision timestamp' data object for LArPix data. This will combine the information from system time [second to microsecond precision], ADC timestamp (~200 ns precision), and photon system time (ns precision), into a coherent ns precision global timestamp. In Daya Bay, we used a simple dual uint object (uint epoch_seconds, uint nanoseconds) for this purpose. Peter is currently wrestling with this problem for the analysis of the LArPix data taken in the refrigerator, so should be included in the discussion.
To parse or not to parse:
Should the serialized data file contain the bytes after parsing into packets, or before parsing into packets? My general feeling is that we may need to avoid parsing until the serial system communication is fully debugged and established. Otherwise we lose the ability to properly diagnose and characterize problems with serial i/o. If the system is stable enough that we very very rarely encounter corrupted or otherwise damaged serial data, then we should move to a packet-based representation for convenience in data analysis. (Issue: how should we handle partially read packets in the future? should we discard this data?). Until that time, it may be best to maintain two serialized formats: one that relies on unparsed bytes for data storage, and another that is packet-ntuple-ized for analysis. The latter could be generated by a helper script running on the former, or dynamically by parsing the former (e.g. LogAnalyzer). Alternatively, there may be a way to have our cake and eat it too: construct a 'packet-ntuple-ized' format that has sufficient contextual information to reconstruct the original bytestream, including errors.
In the long run, we will likely need to define a much richer analysis-level format which allows joint description of all relevant info (LArPix data, run conditions, photon system info, etc.), with each ntuple 'atom' representing a unique 'physical interaction' of interest in the TPC.
One-to-one runs-to-configs-to-dataset mapping:
I can see how this will likely be appropriate for standard data collection. During commissioning and testing, I'm not sure if there is a one-to-one mapping. For example, during a threshold run or noise tests, we repeatedly change configs. How will this be represented?
Multiple runs per data file?
This is an interesting choice. I like this structure for representation in data in memory; in this manner many runs can be seen as a single database. As to whether to actually put these data into the same file on disk or separate files, I might be inclined to use separate files (control extent of damage in case a file is corrupted, ease portability of data subsets).
How should we handle big runs? Should not be a problem on the bench, but is a common feature in experiments where there are multiple files per ~day-long run. For this case, I've seen the addition of a 'file_number' or 'sub_run_number' in order to identify the serial order of files associated with a single run.

samkohn added the enhancement label Dec 15, 2017

samkohn assigned samkohn, dadwyer and peter-madigan Dec 15, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal for packet storage data format (take a look!) #84

Proposal for packet storage data format (take a look!) #84

samkohn commented Dec 15, 2017

dadwyer commented Dec 18, 2017

Proposal for packet storage data format (take a look!) #84

Proposal for packet storage data format (take a look!) #84

Comments

samkohn commented Dec 15, 2017

dadwyer commented Dec 18, 2017