-
Notifications
You must be signed in to change notification settings - Fork 0
/
LDM-513.tex
101 lines (67 loc) · 11.8 KB
/
LDM-513.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
\documentclass[DM,toc]{lsstdoc}
% Package imports go here
% Local commands go here
\title[Deblender Data Products]{Proposal for Deblender Outputs as Level 2 Data Products}
\author{
James~Bosch
}
\setDocRef{LDM-513}
\date{2017-06-02}
\setDocAbstract{%
Most catalog measurements produced in a LSST Data Release Production will be derived from pixel values that have been \emph{deblended} -- we have apportioned the flux in each pixel to each of the \Objects that overlap that pixel. These deblended pixels values are thus a critical link in our provenance chain, and should be provided to users as a first-class data product. This document provides further justification for the proposal to add this data product and provides a preliminary estimate of the impact on data storage requirements of an additional $83\,\mathrm{TB}$ in DR1 and $221\,\mathrm{TB}$ in DR11.
}
% Change history defined here. Will be inserted into
% correct place with \maketitle
% OLDEST FIRST: VERSION, DATE, DESCRIPTION, OWNER NAME
\setDocChangeRecord{%
\addtohist{}{2017-04-01}{Initial release.}{James Bosch}
\addtohist{}{2017-04-28}{Science Rationale}{David Ciardi}
\addtohist{1.0}{2017-06-02}{Approved in \jira{RFC-329}}{W.~O'Mullane}
}
\begin{document}
% Create the title page
% Table of contents will be added automatically if "toc" class option
% is used.
\maketitle
\section{Introduction}
The LSST Survey is built upon four major science pillars: the nature of dark energy and matter, Milky Way structure and formation, exploring the changing sky, and cataloging of the Solar System. Each of these pillars relies on the precise and accurate measurement of the size, shape, and brightness of the stars and galaxies. However, a natural consequence of any deep telescope survey is that \Objects, particularly stars and galaxies will be blended in the images -- both the deep co-adds and the individual processed visit images (PVIs). In order to achieve the science goals of the project, the data processing pipeline must make an effort to ``deblend'' all \Objects on the sky.
Essentially all measurements in LSST's \Object table (\DPDD) are directly or indirectly derived from running the ``deblender'' algorithm on coadd images. The deblender apportions the flux in each pixel between the \Objects that overlap that pixel (with \Object extent defined by a surface-brightness threshold). Subsequent measurements on the coadds themselves use a combination of the original pixels and the deblended pixels as their inputs. The deblended pixels are then used for Forced \Source measurements as well as in the generation of the Bulge-Disk Model and Moving Point \Source Models. For these models, fits are made to individual per-epoch images, and rely upon the per-epoch deblended pixels that are computed from the deblended coadd pixels.
As a primary product of the LSST Survey, the ``deblended'' \Objects are a necessary component to achieve the science goals of the LSST Survey. The ``deblending'' enables the separation of blended \Objects on the sky so that the shape, flux, and surface brightness distribution of the \Objects can be ascertained. Further, the ``deblended'' shape is used to perform forced photometry on the CCD-Level Processed Visit Images. The Forced \Source photometry is used to assess the brightness variability of the \Objects throughout the LSST Survey. The deblended pixel values are thus a critical link in understanding the scientific integrity of the \Object and \ForcedSource photometry as well as in understanding the provenance change between the individual PVIs, the coadd images, and the measurements in the \Object and \ForcedSource tables.
Deblending is inherently a scientific task with associated uncertainties; as a result, an assessment of the shape, flux, and surface brightness distribution of an \Object by a science user requires the ability of that science user to be able to example the output of the deblending algorithm. Without the access to the output, a user has no insight into quality, accuracy, and precision of the algorithm performance for the \Objects in which the user is interested. Science users (especially those interested in rare \Objects) will need to access deblended pixel values to understand any measurement outliers that may be due to deblending problems. Access to coadded images or the PVIs is not sufficient for these purposes, as the deblender assigns fractional values to individual \Objects on a per-pixel level and that information is only available in the deblender output products. Additionally, deblended pixel values are also an important input to Level 3 algorithms that perform new measurements on individual \Objects.
In principle, a science user could rerun the deblender and recreate the output products for \Objects of interest, but the deblender is a compuationally expensive algorithm that will run simultaneously on multiple coadds. It will use a combination of \emph{deep} and \emph{best-seeing} coadds, and it may also utilize \emph{short-period} or \emph{PSF-matched} coadds. All of this is difficult for a general user to replicate, and for assessment of thousands or millions of \Objects, the regeneration of the deblender output products is potentially extremely resource intenstive and impractical.
We are, therefore, proposing the storage and serving of the deblender output products known as the ``Heavy Footprints.'' Access to the ``Heavy Footprints'' by the community science users will greatly improve the ability of the community as assess the scientific merit and quality of any analysis they may perform and will lead to a better understanding of the data and its limitations by the community.
\section{Proposal and Implementation Options}
We propose adding deblended \emph{deep} coadd pixel values (in each band) for each \Object as a first-class Level 2 data product. Our recommendation for the data structure for this information is one \texttt{HeavyFootprint} for each \Object. A \texttt{HeavyFootprint} combines a run-length encoding of the region covered by an \Object (which we call a \texttt{Footprint}) with a 1-d array of 32-bit floating point values containing just the pixels in that region. \texttt{Footprint}s themselves are small: three 32-bit integers for each pixel \emph{row} in the region. \Objects that were not deblended (because they had no neighbors) need only have a \texttt{Footprint}; their \texttt{HeavyFootprint} can be generated on-the-fly efficiently from the \texttt{Footprint} and the coadd images.
Despite the fact that they are tied to \Objects, the access patterns for \texttt{HeavyFootprints} should be assumed to be more coupled to the access patterns for coadd images, but will be even less frequent: users who request \texttt{HeavyFootprints} will almost always want coadd images as well (though the converse is not true). We anticipate that in most cases, these requests will be for single \Objects (and cut-outs of coadd images near them), not all \Objects in a field.
We do not believe regenerating \texttt{HeavyFootprint}s by re-running the deblender on-demand is a viable way to implement this proposal; even if running it does not require regenerating some coadds, the deblender will still likely be one of our most computationally expensive algorithms, and we currently have no requirement to make it executable on a per-\Object bases. So while regenerating \texttt{HeavyFootprints} may technically represent a tradeoff between computation and storage, it likely the storage cost will be sufficiently lower (as well as lower-latency). The deblender algorithm \emph{may} permit a highly compressed output (e.g. an analytic model) that allows the true \texttt{HeavyFootprints} to be regenerated from the coadd data (which would be an acceptable implementation), but this cannot be guaranteed today.
We do not propose storing \texttt{HeavyFootprints} for \emph{best-seeing} or other coadd flavors besides \emph{deep}. It is likely these could be derived efficiently from those coadd images and the \emph{deep} coadd \texttt{HeavyFootprints}, but since most measurements will be done on the \emph{deep} coadds we anticipate the need for deblender outputs for other flavors to be much less.
We also do not propose storing \texttt{HeavyFootprints} for \Source. We expect \Source measurements to be much more lightly used than \Object or \ForcedSource measurements, and \Source \texttt{HeavyFootprints} can be more efficiently regenerated because they are produced by a simpler, faster form of the deblender.
\section{Impact on Storage}
Because the pixels included in the set of \texttt{HeavyFootprints} are defined by a surface-brightness threshold, the storage costs for \texttt{HeavyFootprints} are a function of depth. We have attempted to estimate this scaling by using LSST DM Stack processing of Hyper-Suprime Cam Subaru Strategic Program Survey (HSC SSP) data \citep{2017arXiv170208449A}. The HSC SSP data includes both a Wide layer ($1200\,s$ in $i$) and an UltraDeep layer ($9000\,s$ in $i$ to date).
\begin{figure}
\includegraphics[width=\textwidth]{regression.png}
\caption{Per-patch \texttt{HeavyFootprint} storage costs in LSST processing of HSC SSP data. The red vertical lines indicate the approximate depth of LSST data releases. Green line is the linear regression discussed in the text.}
\label{fig:regression}
\end{figure}
As shown in Figure~\ref{fig:regression}, we fit a linear regression to the total storage cost of the \texttt{Footprints} and \texttt{HeavyFootprints} 117 patches (each $0.03484\,\mathrm{deg}^2$) of Wide data and 25 patches of UltraDeep data. While the scaling is likely nonlinear, we do not currently have processed data available from the intermediate SSP Deep fields that could constrain a higher-order fit. In the current DM stack, the size of the \texttt{HeavyFootprints} in all bands are mostly determined by the depth of the deepest band. This should remain at least approximately true in future versions of the pipeline.
The regression yields the following formula:
$$
\frac{\mathrm{storage}}{(\mathrm{MB})(N_\mathrm{patches})(N_\mathrm{bands})} =
N_{\mathrm{visits},i} \times 0.149 + 18.26
$$
The conversion from HSC exposure time to LSST visits assumes 30s for each LSST visit and simply accounts for the difference in effective primary mirror size.
The patch area of $0.03484\,\mathrm{deg}^2$ represents only the inner, non-overlapping area of the patch, but the footprint storage estimate comes from the full area of $0.03841\,\mathrm{deg}^2$. The inner area should be used when estimating the number of patches from the area of a survey, which will automatically account (in part; tracts also overlap) for the current overlap fraction. This yields
$$
\frac{\mathrm{storage}}{(\mathrm{MB})(\mathrm{deg}^2)(N_\mathrm{bands})} =
N_{\mathrm{visits},i} \times 4.277 + 524.11
$$
The effect of the difference in seeing between HSC and LSST is not clear; LSST's larger PSF will both increase the area covered by bright \Objects and prevent \Objects at the edge of HSC's detection limit from being detected at all. HSC has smaller ($0.168\,\mathrm{arcsec}$) pixels as well, however, which should make this an overestimate (and hence a conservative estimate) for LSST storage costs.
Assuming a $24,000\,\mathrm{deg}^2$ survey and $236$ ($12$) $i$-band visits in DR11 (DR1), the total projected storage cost for all six bands is $220.8\,\mathrm{TB}$ ($82.9\,\mathrm{TB}$).
\begin{figure}
\includegraphics[width=\textwidth]{histogram.png}
\caption{Histogram of per-\Object storage costs for \texttt{HeavyFootprints} in LSST processing of HSC SSP data.}
\label{fig:histogram}
\end{figure}
It is also worth noting that the dynamic range of per-\Object \texttt{HeavyFootprint} storage sizes spans several orders of magnitude, as shown in Figure~\ref{fig:histogram}. Storage cannot be assumed to be even approximately constant across \Objects.
\bibliography{lsst,refs_ads}
\end{document}