Impact of data source diversity on the distribution of key variables in pregnancy cohorts based on the ConcePTION pregnancy algorithm leveraging a random forest imputation model

Girardi A, Limoncella G, Hyeraci G, Roberto G, Bartolini C, Paoletti O, Messina D, Villalobos F, Bissacco C, Van den Burg J, Houben E, Santaca k, Lentile V, Ingrasciotta Y, Trifiro G, Hoxhaj V, Duran C, Riera-Arnau J, Garcia P, Martin-Perez M, Huerta-Alvarez C, Llorente-Garcia A, Sanchez-Saez F, Rodriguez-Bernal C, Lassalle R, Jove J, Bernard M, Thurin N, Jordan S, Thayer D, Evans H, Coldea A, Manfrini M, Van Gelder M, Hayati S, Schink T, Tari M, Pajouheshnia R, Afonso A, Noan-Laine M, Molgaard-Nielsen D, Cunnington M, Dodd C, Sturkenboom M, Nordeng H, Gini R. Impact of data source diversity on the distribution of key variables in pregnancy cohorts based on the ConcePTION pregnancy algorithm leveraging a random forest imputation model. Presentation to be given at the 2024 ISPE Annual Meeting; August 28, 2024. Berlin, Germany.

BACKGROUND: In the IMI ConcePTION project, aiming to build a European ecosystem for medication safety in pregnancy, researchers developed a meta-algorithm identifying the most exhaustive list of pregnancies from diverse European data sources.

OBJECTIVES: To describe impact of diversity of 7 European data sources on results from the ConcePTION pregnancy algorithm.

METHODS: Data from 2015 to 2019 were extracted from 7 data sources originating from Italy, Spain, the Netherlands and Norway. Any record implying a pregnancy on the record date was retrieved from various combinations of data banks in the data sources, such as Birth Registry, Malformation Registry, Hospital Discharge Records, Primary Care Medical Records, and others. Some data could not be retrieved due to governance restrictions. Records of the same person were ordered longitudinally and grouped into distinct pregnancy episodes. Three variables were defined: start date of pregnancy (SP), end date of pregnancy (EP) and type of end. Type of end was assigned with hierarchical rules among the following: live birth (LB), stillbirth, spontaneous abortion (SA), elective termination, ectopic or molar, unknown (UNK), unfavourable, lost to follow-up. If data on SP and EP were missing or conflicting, they were imputed with hierarchical rules with the following exception. In data sources with a Birth Registry (6 out of 7), missing SP was imputed record-wise through two distinct predictive models based on random forest (RF) method: the first in records carrying information on EP, and the second in the other records. Root mean squared error of the prediction (RMSE) was computed. Pregnancy SP was calculated by inverse variance weighting.

RESULTS: The total number of pregnancies identified by the algorithm ranged from about 40,000 to 400,000 across the participating data sources, yielding a total set of about 1.5 million pregnancies. The most common type of end was LB, which depending on source ranged from 49% in data sources based on Primary Care Medical Records to 83% in those based on Hospital Discharge Records. SA prevalence ranged from 10 to13%. In data sources with low LB, a large share had UNK type of end (up to 28%). For sources using the RF, the range of RMSE was 17-22 days and 28-50 days for the first and the second RF model, respectively.

CONCLUSIONS: A large cohort of pregnancies was extracted systematically across diverse data sources in Europe. Diversity across data sources resulted in differences in type of end distribution. Our methods permitted identification of pregnancies with all type of end, if allowed by data governance. When using the pregnancy algorithm for pharmaco-epidemiological studies, sensitivity analyses must assess the impact of restricting pregnancy cohorts to selected type of end and of imputed information.

Share on: