In this challenge you are asked to group galaxies into tomographic bins using only the quantities we generate using the metacalibration method. These quantities are the only ones for which we can compute a shear bias correction associcated with the division.
We provide training and validation sets of data, and once everyone has added methods, we will run on the secret testing data.
This test is highly idealised: we have a huge complete training sample, simple noise models, no outliers, and no variation in depth or any other observing conditions.
Galaxy magnitudes are generated by adding noise to the CosmoDC2 data and applying a preliminary selection cut (SNR>10, metacal flag=0, metacal size > PSF size / 2).
All entrants will be on the author list for the upcoming TXPipe-CosmoDC2 paper, which these results will go into. The winner will additionally receive glory.
The deadline to enter the challenge is Friday September 4th 2020 (11:59 pm AoE), i.e. Pull Requests decribing your entry need to be open by that time to be considered. Participants who have entered the competition by that date will be given until Monday September 14th (11:59 pm AoE) to fine tune their entry, at which point their entry will be considered final.
To get started rightaway, try this live colab notebook:
In general you can install python requirements with pip install -r requirements.txt
On NERSC it's easiest to use shifter (I've had problems with CCL there):
shifter --image=joezuntz/txpipe-tomo bash
This will put you in a shell with all requirements.
Run the following from the challenge directory:
python -m tomo_challenge.data
This will download the full set of challenge data, about 6.GB. You can also get the individual files from here if you prefer: https://portal.nersc.gov/project/lsst/txpipe/tomo_challenge_data/
You will get two datasets, based on two different simulations, which will allow us to test different assumptions about galaxy SEDs.
The first dataset, which can be found under data
after download is based on the CosmoDC2 simulation. The second dataset, found under
data_buzzard
is based on the Buzzard simulations.
The first metric is the S/N on the spectra generated with the method:
score^2 = sqrt(mu^T . C^{-1} . mu) - baseline
where mu is the theory spectrum and C the Gaussian covariance.
The second is a Fisher-based Figure Of Merit, currently sigma8-omegac, though we will later add w0-wa.
You can enter the contest by pull request. Add a python file with your method in it.
In tomo_challenge/classifiers/random_forest.py
you can find an example of using a scikit-learn classified with a simple galaxy split to assign objects. Run it by doing, e.g.:
$ python bin/challenge.py example/example.yaml
This will compute the metrics and write an output file for some test methods.
You are welcome to adapt any part of that code in your methods.
The example random forest implementation gets the following scores using the spectrum S/N metric. For griz:
# nbin score
1 0.0
2 28.2
3 35.4
4 37.9
5 39.7
6 40.3
7 40.5
8 41.1
9 41.5
10 41.8
and for riz:
# nbin score
1 0.1
2 21.6
3 26.4
4 28.5
5 29.7
6 30.1
7 30.1
8 29.4
9 29.1
10 30.1
- How do I enter?
Create a pull request that adds a python file to the tomo_challenge/classifiers
directory containing a class with train
and apply
methods, following the random forest example as a template.
- Can I use a different programming language?
Only if it can be easily called from python.
- What general software requirements are needed?
- Relatively recent compilers
- MPI
- Lapack
- Python 3.6+
- cmake
- swig
- if you find others please let us know
On ubuntu 20 you can install these apt/yum packages:
gfortran
cmake
swig
libopenmpi-dev
liblapack3
liblapack64-dev
libopenblas-dev
- What are the metrics?
- The total S/N (including covariance) of all the power spectra of weak lensing made using your bins
- The inverse area of the w0-wa Fisher matrix (due to a technical problem the current metric is the sigma8-omega_c Fisher matrix)
Each can be run on ww (lensing-lensing) gg (lss-lss) and 3x2 (both + cross-corr), so the full list is: SNR_ww, SNR_gg, SNR_3x2, FOM_ww, FOM_gg, FOM_3x2
- How can I change hyper-parameters or otherwise configure my method?
Add a class variable valid_options
in your method. Then variables in it are accessible in the dictionary self.opt
. See the random forest file for an example.
Then you can set the variables for a given run in your yaml file, as in how bins
is set in the example yaml file.
- Why is this needed?
The metacal method can correct galaxy shear biases associated with putting galaxies into bins (which arise because noise on magnitudes correlates with noise on shape), but only if the selection is done with quantities measured jointly with the shear.
This only affects shear catalogs - for lens bins we can do what we like.
-
What is the input data?
- CosmoDC2 galaxies with mock noise added.
- Buzzard galaxies with mock noise added.
- What is the mcal_T column?
This is a measurement of the squared radius of the galaxy. It is the trace of the moments matrix: T = Q_xx + Q_yy
where Q_xx = int I(x, y) (x - x0)**2 dx dy
, and similar for Q_yy
, where I(x, y)
is the flux in pixel at location x,y
and (x0,y0)
is the centroid.
- How many bins should I use, and what are the target distributions?
As many as you like - it's likely that more bins will add to your score as long as they're well-separated in redshift, so you probably want to push the number upwards. You can experiment with what edges give you best metrics; historically most approaches have tried to divide into bins with roughly equal numbers, so that may be a good place to start.
- What do I get out of this?
You can be an author on the paper we write if you submit a working method.
- Can non-DESC people enter?
Yes - we have now been told by the publication board that this is fine, and the results paper can be non-DESC.
- How realistic is this?
This is the easiest possible challenge - the training set is large and drawn from the same population as the test data, and the data selection is relatively simple.
If you think it's too unrealistic then you should do really really well.
- Do I have to use machine learning methods?
No - we call the methods train
and apply
, but that's just terminology, you can train however you like.
- Do I have to assign every galaxy to a bin?
No, you can leave out galaxies if you want. If you leave out too many the decrease in number density will start to hit your score of course.
- Can I use a simpler metric?
Yes, you can train however you like, including with your own metrics. The final score will be on a suite of metrics including the two here. We reserve the right to add more metrics to better understand things.
- When does the challenge close?
The end of August 2020. (We have updated this from July 2020 since we felt it was too soon).
- What does the winner get?
Recognition.