Skip to content

collect.py usage hints for converting time series observation data

Ellyn Montgomery edited this page Dec 5, 2016 · 18 revisions

collect.py is a program to read EPIC format netcdf(3) files [grouped by the experiment in which they were collected] and convert them to CF-1.6 compliant (discrete samples) netcdf(4) files. These output CF-1.6 files will be incorporated into the portal, and be harvested into the IOOS database.

Ignored Files

By default, the following experiment/sensor/file types are ignored and not converted:

  • hourly averaged data (files with a1h or A1H or A1h or a1H) - 8543sc-a1h.nc
  • low pass filtered data (files with alp) - 3971-alp.nc
  • Burst variances (files with *var-) - 8545advbvar-cal.nc
  • b-cal data (files with b-cal) - 8545advb-cal.nc
  • Files that don't follow any of the patterns:
    • *-a* (basic sample interval)
    • *-A* (")
    • *s-cal* (burst averages)
    • *d-cal* (aqd-cal, a few sgtid-cal files)
    • *tide-cal* (seagauge tide files (burst averages)

Collect.py must have a file that provides experiment-level metadata. The default name of this file is project_metadata.csv, and it should be in the same location as collect.py. The columns must contain:

    1. Experiment Name [project_name]- string must match the directory name where the data files are (case sensitive)
    1. Scientist Name [contributor_name] - Name of the PI conducting the research
    1. Title [project_title] - longer version of the Experiment Name
    1. abstract [project_summary] - experiment summary: what was collected and why?
    1. default server location [catalog_xml] - where to download the data from A different file name may be specified using the -c option

To see the help for collect.py enter:

 python collect.py -h

If you don't specify any projects with the -p option, it will try to do all the experiments it finds in the .csv file. The default command to do this is (it's always a good idea to git pull first):

python collect.py --download --output=../../CF-1.6new/ 

When this is complete, it will have put ALL the files into a directory called download under the cwd. I usually cd to ...emontgomery/stellwagen/usgs-cmg-portal/woods_hole_obs_data before working, so the download directory is here. In theory it will run all the experiments perfectly after downloading and put them into sub-directories under whatever was put as the output. Should it fail and you need to re-run one or more experiments, use a command like this to just re-do DIAMONDShoals:

 python collect.py -p DIAMONDSHOALS --folder download -o ../../CF-1.6new

In some instances (say a new experiment being added), you want to just download from one directory and convert that. The command below downloads and converts the HURRIRENE_BB files. The last column of the project_metadata.csv has the URL from which to get the data for the experiment directory listed in column 1.

python collect.py --projects HURRIRENE_BB --download --output=../../CF-1.6/

The default location to put the downloaded files is the /download directory in the cwd. If we use this we get all the files we've ever collected in that directory. To retain our directory structure, a command like this puts the EPIC data in ../../../tmp/HURRIRENE_BB (a location not in the datasetScan path), and the CF output in ../../CF-1.6 (where it will make a subdirectory of the project_name). Do not include the experiment directory name in the -o section or you'll get an output directory structure like CF-1.6/RCNWR/RCNWR; not what we want!

$python collect.py -p HURRIRENE_BB --folder ../../../tmp/HURRIRENE_BB --download -o ../../CF-1.6

The output directory CF-1.6 is in the datasetScan path. If you write to a different output location, you need to copy or move the files to under CF-1.6. It's best to save the original if a major revision is done.

You MUST do a --download on each dataset initially because it adds an id global attribute containing the string in --projects and filename_root to each file. If that attribute isn't there, subsequent runs of collect.py using the --folder option will fail. Therefore, if you have a local set of files that you want to convert, collect.py won't work until either you a) put the data on a TDS and use the --download option or b) add an id attribute to each file.

In the case above, since we've already downloaded HURRIRENE_BB, if we wanted to update the CF-1.6 by running a more recent version of collect.py, we can skip the --download and use the local folder we created in the previous step.

python collect.py -p HURRIRENE_BB --folder ../../../tmp/HURRIRENE_BB -o ../../CF-1.6

In situations where one file needs to be repaired, as was the case of MBAY_LT/7172sbe-a.nc, you can use -l to specify a single file. In this case, the original file in Data had bad times in the 1st 14 samples, and replacing it with the version from stellwagen was all that was needed. Then I needed to update the version in download too. This command will do that.

python collect.py -p MBAY_LT -l 7172sbe-a.nc --download -o ../../CF-1.6