The pipeline design is influenced by the data tiers that we establish, inspired by the Medallion Architecture:
- All new data coming from the Ingestion Portal starts out as
raw
, which may have:- Columns with the wrong data types.
- Column names that do not match what is in
gold
. Each raw file will have an associated column mapping which the uploader should have filled up in Giga Sync, which maps the columns into the correct names. - Missing columns which are in
gold
, but are nullable. - Extra columns which are not in
gold
.
- At the
bronze
tier:- Column mapping is applied.
- Missing, nullable columns are added.
- Extra columns are dropped.
- Data quality checks are performed.
- The output is split into 2 tables according to rows that
passed
orfailed
the data quality checks. - In the
passed
table, columns are cast into the correct data types. - The data quality report is generated and emailed to the uploader.
- At the
staging
tier, the passed table from the previous tier undergoes a human-in-the-loop review process, where users with the appropriate permissions can approve or deny individual rows from proceeding to the next tier. This process takes place in Giga Sync. The output is again split into 2 tables according toapproved
orrejected
rows. approved
rows are merged into thesilver
tier.silver
is merged intogold
, which is then split intomaster
andreference
tables.
In the context of the current implementation of the platform, we are ingesting school
geolocation data and school coverage data separately, so their tiers are separate up
until silver
. The silver geolocation and silver coverage tables are then joined to
form the gold
School Master table.