5.2018 - Record linkage of medical data
Supervisors:
• G. Chauvet (Professor, ENSAI)
• V. Gares (Assistant professor, INSA)
• Andre´ Happe (REPERES team)
Research unit: IRMAR-INSA (Rennes)
Contact: valerie.gares@insa-rennes.fr, Guillaume.CHAUVET@ensai.Fr
Keywords: Statictics, Record linkage, probabilist models, optimal transportation.
The National Health Data System (”Système national des données de santé” (SNDS)) gathers the main national health databases existing in France, i.e. the health information of more than 65 million French people. It is currently one of the largest health centers in the world. The SNDS includes data of the Health Insurance, hospitalization data, medical cause of death data, disability data and sampled data from supplementary health insurance organizations.
The SNDS data can be used to enrich existing cohorts or medical registers. The objective is to link de-identified research datasets at the patient level, when no personal health identifiers such as name or date of birth are available.
Deterministic approaches might be satisfying when the junction of different individual cova- riates leads to a unique identifier per patient. When no unique patient identifier is available, alternative approaches are needed. Optimal transport constitutes a promising method for that purpose, that will be thoroughly investigated in this internship. Optimal transport aims at minimizing the transportation cost of the joint distribution of all available patients explanatory variables from one dataset to the other. Individual matching probabilities are computed in a second step. This method requires adaptations in order to be applied to SNDS data, especially because of the large variety of data types present in the datasets (dates, numerical values, categories). These adaptations will constitute the objective of this work. A PhD is possible after the internship.
This internship will be realized with the association of the team REPERES (REcherche en Pharmaco-Epide´miologie et REcours aux Soins) who works on the analysis of consumption, use and impact of care, including the prescription of health products (drugs, medical devices) at the population level.
RÉFÉRENCES
[1] Dimeglio C*, Garès V.*, Kosorok M. R., Guernec G., Fantin R., Lepage B. and Savy N. On the use of optimal transportation theory to merge databases. Application to clinical trials.. Soumis à International Journal of Biostatistics.
[2] Boris P. Hejblum, Griffin M. Weber, Katherine P. Liao, Nathan P. Palmer, Susanne Churchill, Peter Szolovits, Shawn N. Murphy, Isaac S. Kohane, Tianxi Cao. Probabilistic Record Linkage of De-Identified Research Datasets with Discrepancies Using Diagnosis Codes. Journal of the American Statistical Association.
[3] Fellegi, I. P. and Sunter, A. B. A Theory for Record Linkage. Journal of the American Statistical Association. 64, 1183-1210 (1969)