-
Notifications
You must be signed in to change notification settings - Fork 646
BFD RIF Export
The BFD RIF exporter produces files that conform to the following specifications:
- RIF Layout and FHIR Mapping defines each file type and the fields contained within it
- CODEBOOK: Medicare Beneficiary Summary File (MBSF) Base with Medicare Part A, B, C, and D defines each of the data dictionaries and the included code values used in the beneficiary file
- CODEBOOK: Medicare Fee For Service (FFS) Claims defines each of the data dictionaries and the included code values used in the claim files
- CODEBOOK: Medicare Part D Event (PDE)/Drug Characteristics defines each of the data dictionaries and the included code values used in the Part D claim file
The exporter is configured via a set of properties as shown below with their default values:
-
exporter.bfd.bene_id_start = -1000000
defines the start value ofBENE_ID
, the first exported patient will get the specified value, subsequent ids are monotonically decremented from that value -
exporter.bfd.clm_id_start = -100000000
defines the start value ofCLM_ID
, the first exported claim will get the specified value, subsequent ids are monotonically decremented from that value -
exporter.bfd.clm_grp_id_start = -100000000
defines the start value ofCLM_GRP_ID
, the first exported group will get the specified value, subsequent ids are monotonically decremented from that value -
exporter.bfd.pde_id_start = -100000000
defines the start value ofPDE_ID
, the first exported PDE claim will get the specified value, subsequent ids are monotonically decremented from that value -
exporter.bfd.mbi_start = 1S00-E00-AA00
defines the start value ofMBI_NUM
, the first exported patient will use that value, subsequent ids will monotonically increase from that value -
exporter.bfd.hicn_start = T01000000A
defines the start value ofBENE_CRNT_HIC_NUM
, the first exported record will use that value, subsequent ids will monotonically increase from that value. -
exporter.bfd.partc_contract_start = Y0001
defines the start value of Part C contract IDs that will be used inPTC_CNTRCT_JAN_ID
toPTC_CNTRCT_DEC_ID
, the first contract will use that id, subsequent ids will monotonically increase from that value. -
exporter.bfd.partc_contract_count = 10
defines the number of Part C contracts that Synthea will use in exports; each year, each patient will be randomly assigned to one of the contracts (or no contract). -
exporter.bfd.partd_contract_start = Z0001
defines the start value of Part D contract IDs that will be used inPLAN_CNTRCT_REC_ID
, the first contract will use that id, subsequent ids will monotonically increase from that value. -
exporter.bfd.partd_contract_count = 10
defines the number of Part D contracts that Synthea will use in exports; each year, each patient will be randomly assigned to one of the contracts (or no contract). -
exporter.bfd.plan_benefit_package_start = 800
defines the starting value of plan benefit package identifiers -
exporter.bfd.plan_benefit_package_count = 5
defines the number of plan benefit package identifiers, each Part C and Part D plan will share the same set of plan benefit package identifiers. -
exporter.bfd.clia_labs_start = 00A0000000
defines the start number of CLIA lab numbers that will be used to populateCARR_LINE_CLIA_LAB_NUM
. -
exporter.bfd.clia_labs_count = 10
defines the number of CLIA lab numbers that will be used. -
exporter.bfd.cutoff_date=20140529
defines the earliest date for any exported claims -
generate.thread_pool_size = -1
defines the number of threads to use for the generator, set the value to -1 (the default) to match the number of available processor cores (as perRuntime.getRuntime().availableProcessors()
)
The BFD output files will be found at output/bfd
:
-
beneficiary_YYYY.csv
beneficiary information, one file per year whereYYYY
will be the year -
carrier.csv
carrier claims -
dme.csv
durable medical equipment claims -
end_state.properties
see below -
export_summary.csv
summarizes the number of claims of each type per beneficiary -
hha.csv
home health claims -
hospice.csv
hospice claims -
inpatient.csv
inpatient claims -
manifest.xml
an XML list of generated files -
missing_codes.csv
list of Synthea codes that could not be mapped to HCPCS or CPT -
npi.tsv
synthetic provider list -
outpatient.csv
outpatient claims -
pde.csv
part d prescription claims -
snf.csv
skilled nursing facility claims
The end_state.properties
file captures the final value of any of the above listed configuration options that require a monotonically increasing or decreasing value per beneficiary or claim. The values in this file can be used (via the -c
command line switch) to override the configured values to permit subsequent runs of Synthea to start where the prior run ended. An example file is shown below.
exporter.bfd.hicn_start=T01000020A
exporter.bfd.mbi_start=1S00E00AA20
exporter.bfd.clm_grp_id_start=-100003266
exporter.bfd.pde_id_start=-100000996
exporter.bfd.fi_doc_cntl_num_start=-100000575
exporter.bfd.bene_id_start=-1000020
exporter.bfd.carr_clm_cntl_num_start=-100001695
exporter.bfd.clm_id_start=-100002270
The following shell script will generate records for a set of beneficiaries for all 50 states and Washington, DC. The desired total size of the population is supplied as a command line argument, numbers of beneficiaries in each location will be proportional to the population of each state (based on census data). An optional second integer argument specifies the number of months of future medical history; to generate 1000 patients with 24 months of future claims the script would be run as: ./national_bfd.sh 1000 24
.
#!/bin/bash
if [[ $# -eq 0 || $# -gt 2 ]]; then
echo "Usage: $0 size [months]"
echo "where 'size' is an integer specifying the target population size and 'months' is an integer specifying the number of months of future medical history"
exit 1
fi
if [[ $# -eq 1 ]]; then
end_date=
else
case "$(uname -s)" in
Darwin*) date_args="-v+${2}m +%Y%m%d";;
*) date_args="-d ${2}months +%Y%m%d"
esac
future_date=`date $date_args`
end_date="-e ${future_date}"
fi
# Weights are based on 2019 census data:
#
# https://data.census.gov/cedsci/table?q=Total%20Population&g=0400000US01,02,04,05,06,08,09,10,11,12,13,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,44,45,46,47,48,49,50,51,53,54,55,56&tid=ACSDP1Y2019.DP05&hidePreview=true&moe=false
#
# Each value represents the number of state residents aged 62 or more divided by the
# total number of USA state residents aged 62 or more expressed as a percentage.
#
states=( ); weights=( )
states+=( "Alabama" ); weights+=( "1.578" )
states+=( "Alaska" ); weights+=( "0.178" )
states+=( "Arizona" ); weights+=( "2.357" )
states+=( "Arkansas" ); weights+=( "0.958" )
# states+=( "California" ); weights+=( "10.801" ) # California is handled separately at the end and is used to absorb any rounding errors
states+=( "Colorado" ); weights+=( "1.586" )
states+=( "Connecticut" ); weights+=( "1.170" )
states+=( "Delaware" ); weights+=( "0.351" )
states+=( "District of Columbia" ); weights+=( "0.161" )
states+=( "Florida" ); weights+=( "8.044" )
states+=( "Georgia" ); weights+=( "2.836" )
states+=( "Hawaii" ); weights+=( "0.492" )
states+=( "Idaho" ); weights+=( "0.536" )
states+=( "Illinois" ); weights+=( "3.796" )
states+=( "Indiana" ); weights+=( "2.016" )
states+=( "Iowa" ); weights+=( "1.016" )
states+=( "Kansas" ); weights+=( "0.891" )
states+=( "Kentucky" ); weights+=( "1.401" )
states+=( "Louisiana" ); weights+=( "1.399" )
states+=( "Maine" ); weights+=( "0.530" )
states+=( "Maryland" ); weights+=( "1.801" )
states+=( "Massachusetts" ); weights+=( "2.179" )
states+=( "Michigan" ); weights+=( "3.288" )
states+=( "Minnesota" ); weights+=( "1.712" )
states+=( "Mississippi" ); weights+=( "0.905" )
states+=( "Missouri" ); weights+=( "1.963" )
states+=( "Montana" ); weights+=( "0.382" )
states+=( "Nebraska" ); weights+=( "0.580" )
states+=( "Nevada" ); weights+=( "0.916" )
states+=( "New Hampshire" ); weights+=( "0.472" )
states+=( "New Jersey" ); weights+=( "2.753" )
states+=( "New Mexico" ); weights+=( "0.698" )
states+=( "New York" ); weights+=( "6.092" )
states+=( "North Carolina" ); weights+=( "3.210" )
states+=( "North Dakota" ); weights+=( "0.220" )
states+=( "Ohio" ); weights+=( "3.804" )
states+=( "Oklahoma" ); weights+=( "1.175" )
states+=( "Oregon" ); weights+=( "1.406" )
states+=( "Pennsylvania" ); weights+=( "4.413" )
states+=( "Rhode Island" ); weights+=( "0.351" )
states+=( "South Carolina" ); weights+=( "1.713" )
states+=( "South Dakota" ); weights+=( "0.285" )
states+=( "Tennessee" ); weights+=( "2.098" )
states+=( "Texas" ); weights+=( "7.031" )
states+=( "Utah" ); weights+=( "0.686" )
states+=( "Vermont" ); weights+=( "0.234" )
states+=( "Virginia" ); weights+=( "2.523" )
states+=( "Washington" ); weights+=( "2.247" )
states+=( "West Virginia" ); weights+=( "0.679" )
states+=( "Wisconsin" ); weights+=( "1.903" )
states+=( "Wyoming" ); weights+=( "0.185" )
END_STATE_PROPS_FILE="./output/bfd/end_state.properties"
total_generated=0
for i in "${!states[@]}"
do
state=${states[$i]}
weight=${weights[$i]}
count=`echo "${1}*${weight}/100" | bc`
total_generated=`echo "${total_generated}+${count}" | bc`
if [[ $count -eq "0" ]]
then
echo "Skipping generating ${state}, requested patients is ${count} "
continue
fi
if [[ -f "${END_STATE_PROPS_FILE}" ]]
then
load_props="-c ${END_STATE_PROPS_FILE}"
else
load_props=
fi
echo "Generating ${count} patients for ${state}"
./run_synthea -s ${i} -cs ${i} -r 20230224 ${end_date} ${load_props} -p ${count} --exporter.fhir.export=false --exporter.fhir.transaction_bundle=false --exporter.hospital.fhir.export=false --exporter.practitioner.fhir.export=false --exporter.bfd.export=true --exporter.years_of_history=10 --generate.only_alive_patients=true --generate.providers.selection_behavior=medicare "${state}"
done
# Generate remaining requested population for California to handle any rounding errors
if [[ -f "${END_STATE_PROPS_FILE}" ]]
then
load_props="-c ${END_STATE_PROPS_FILE}"
else
load_props=
fi
remaining=`echo "${1}-${total_generated}" | bc`
echo "Generating ${remaining} patients for California"
total_generated=`echo "${total_generated}+${remaining}" | bc`
./run_synthea -s 51 -cs 51 -r 20230224 ${end_date} ${load_props} -p ${remaining} --exporter.fhir.export=false --exporter.fhir.transaction_bundle=false --exporter.hospital.fhir.export=false --exporter.practitioner.fhir.export=false --exporter.bfd.export=true --exporter.years_of_history=10 --generate.only_alive_patients=true --generate.providers.selection_behavior=medicare California
echo "Finished generating ${total_generated} of ${1} requested patients"
The number of patients generated for each state is based on 2019 Census data. The target population for each state is calculated as:
target_total_population * census_state_population
target_state_population = -------------------------------------------------
census_all_states_population
where:
-
target_total_population
is the target population specified on the command line, e.g. 1000 in the example above, -
census_state_population
is the number of state residents aged 62 or more, and -
census_all_states_population
is the total number of USA state residents aged 62 or more
Note that the script fixes the value of a number of Synthea command line arguments. It may be desirable to edit these values, e.g. the value of the simulation end date (-r 20230224
) depending on requirements.
Prior to making any changes it is recommended to fork the Synthea repository and create a new branch for the changes.
The src/main/java/org/mitre/synthea/export/rif/BB2RIFStructure.java
file contains an enum
for each of the output files: BENEFICIARY
, CARRIER
, DME
, HHA
, HOSPICE
, INPATIENT
, OUTPATIENT
, PDE
and SNF
. Each value in these enumerations defines the name of a field in the corresponding output file (the column header in the CSV file). The order of the values in the enumeration defines the order that the fields will be output in the file.
The above file also contains
- Additional enumerations that are used to define other file structures, e.g.
EXPORT_SUMMARY
to define the structure of theexport_summary.csv
file. - Static arrays of enumeration values that are used to group and loop over related fields, e.g.
beneficiaryMedicareStatusFields
includes each of the medicare status field names (MDCR_STUS_JAN_CD
...MDCR_STUS_DEC_CD
one field for each calendar month).
These additional enumerations and arrays can be ignored for the purposes of this document.
To add a new field to a BFD file, first edit the corresponding enumeration to add a new value for the field in the desired location. E.g. to add an EYE_COLOR
field to the beneficiary file following the AGE
field, edit the BENEFICIARY
enumeration as shown below:
public enum BENEFICIARY {
DML_IND,
BENE_ID,
...
CRNT_BIC,
AGE,
EYE_COLOR, // new field added for this example
COVSTART,
...
}
Once rebuilt, Synthea will output the new field with a blank value.
The src/main/resources/export/bfd_field_values.tsv
tab-separated value (TSV) file contains fixed or random values for fields in each of the BFD output files. The columns of this file are:
-
Line
a unique index for the line in the file used for reporting errors when processing the file -
Field
the BFD field name, values must match a value in an output file enumeration -
BENEFICIARY
...SNF
one column per BFD file enumeration (names must match), each column entry specifies the desired value for that BFD file for the corresponding field value -
Optional
specifies whether the BFD field is optional (TRUE
) or required (FALSE
) -
Comment
provides any comments related to the field
The order of rows in the TSV file is not significant. The table below shows a small extract from the file.
Line | Field | BENEFICIARY | INPATIENT | OUTPATIENT | CARRIER | ... |
---|---|---|---|---|---|---|
0 | ADJSTMT_DLTN_CD | N/A | N/A | N/A | N/A | ... |
1 | ADMTG_DGNS_CD | N/A | Coded | N/A | N/A | ... |
2 | ADMTG_DGNS_VRSN_CD | N/A | 0 | N/A | N/A | ... |
3 | AGE | Coded | N/A | N/A | N/A | ... |
4 | AT_PHYSN_NPI | N/A | Coded | Coded | N/A | ... |
5 | AT_PHYSN_UPIN | N/A | [Blank] | [Blank] | N/A | ... |
Each cell at the intersection of a field and an output file can contain one of the following:
-
N/A
means that the field is not included in the file -
Coded
means that the field value is set dynamically in Java code - see next section -
[Blank]
means that the field value is explicitly set be blank - A single value (e.g.
0
for theADMTG_DGNS_VRSN_CD
field ofINPATIENT
in the table above). Values can be numbers or strings and they are copied literally after removing any leading or trailing whitespace. - Multiple values separated by commas, e.g.
1,2,3
, where one value will be selected at random from the list with equal weight applied to all values
Comments can be added to cells using parentheses, e.g. 1,2,3 (this is a comment)
is functionally equivalent to 1,2,3
. These can provide a helpful reminder of the meaning of coded values.
To assign random values to the example EYE_COLOR
field above, add a new row to the TSV file, increment the Line
number value, use EYE_COLOR
for the field column value, enter the desired values in the BENEFICIARY
column and N/A
in every other column. See below for an example:
Line | Field | BENEFICIARY | INPATIENT | OUTPATIENT | CARRIER | ... |
---|---|---|---|---|---|---|
... | ... | ... | ... | ... | ... | ... |
125 | EYE_COLOR | brown,blue,green | N/A | N/A | N/A | ... |
If the field value needs to be computed you will need to edit the Java source code. Each BFD file is written by a separate Java class in the org.mitre.synthea.export.rif
package. The class is named after the BFD file, e.g., the beneficiary BFD file is written by the BeneficiaryExporter
class. Each of these classes implements an export
method that is responsible for exporting all rows for a given synthetic patient at the end of the simulation. An example of adding a statistically weighted value for eye color is shown below:
# org.mitre.synthea.export.rif.BeneficiaryExporter.java
private static final RandomCollection<String> eyeColors = new RandomCollection<>();
static {
eyeColors.add(45, "brown");
eyeColors.add(27, "blue");
eyeColors.add(18, "hazel");
eyeColors.add(9, "green");
}
public String export(Person person, long startTime, long stopTime) throws IOException {
...
fieldValues.put(BB2RIFStructure.BENEFICIARY.EYE_COLOR, eyeColors.next(person));
...
}
Once all changes have been made and tested:
- Run the Synthea test suite via
./gradlew clean check
- Fix any test failures
- Submit a pull request via GitHub to request review of your changes and merging into the Synthea master branch