IADB meeting 12-12-2017

Elena Cabrio, Olivier Corby, David Darmon, Catherine Faron Zucker, Edson Florez, Raphael Gazzotti, Amina Ghrissi, Johan Montagnat (remote), Céline Poudat (remote), Frédéric Precioso, Michel Riveill, Pascal Staccini (remote), Serena Villata

- Feedback on the data anonymization meeting (Pascal) and first work on anonymization (Serena / Elena)

See introduction slides attached for general objectives and organization of the project.

- 2 PhDs hired: Amina Ghrissi (image data) and Tobias Mayer (text data). Amina just arrived in December. Tobias started in October and is working on an anonymization procedure for textual clinical reports.

- First contacts with the Medical Data Centre (now renamed Medical Data Institute) of UCA.

- Franck Michel (I3S) started working on a PMSI data integration procedure using the MongoDB flexible format database. He wrote scripts to export Access PMSI data (in CSV) and import it in MongoDB, proposed some example MongoDB queries and deployed a server on a virtual machine inside the I3S private network. Test were done on the PMSI MCO data.

=== Feedback on the data anonymization meeting (Pascal) and first work on anonymization (Serena / Elena)

Data anonymization is needed to experiment with medical data outside of the CHU network / without direct supervision from the DIIM. However, anonymization degrades data and destroy links between data items: a trade-off needs to be found between the acceptable level of anonymization and the kind of inferences that can be made on data. In the PMSI data exported for instance, the link between several stays of a patient at hospital is lost. To fully exploit PMSI data, non-anonymized data (or at least loosely anonymized data with preserved data links) will be needed in the end. It remains to be seen if a VPN access to the CHUN network from an outside institution is an acceptable solution to work with the raw data or if all work on raw data need to be implemented inside the CHU network.

Exported PMSI data is strongly anonymised but data links are preserved in the raw database. There exist well established data anonymization tools for textual medical data (Tobias is visiting the LMSI laboratory in Paris which develops the "Medina" tool). A reference terminology for medical documents is available (CDA) in addition to all clinical terms vocabularies. terms For other kind of data, e.g. biology, there is neither automatic extraction nor automatic anonymization tools.

Michel is trying to get access to the SNIIRAM data (national data from the health insurance) that contains medical prescriptions and medical acts that have been paid for by the health insurance (only paid for acts are known; prescriptions are not always applied). There exists some data set on the international scene, in particular in the US which as a more open legislation than France (CDC Atlanta). Producing a linked research medical data set would be a strong added value for the project. This work needs to be synchronised with the medical data centre and may involve other medical institutions (contacts in Rennes and Grenoble in particular).

The PRIMEGE data set contains town medicine prescriptions from 13 general practitioners from 2012. Data contains both structured text with codes and notes in free text. Complete reports, that would need to be anonymized are excluded. The data set is declared to the CNIL and ready to be exploited. The data is formatted in H' (an HL7 compatible format). Linking these data with biology data would be of high interest but there exist currently no tool for this work. Some statistical tools search for links between PRIMEGE and SNIIRAM data (with 93% accuracy).

- data encoding: how to represent data (e.g. age can be a number but also an age range…)

The model is proportional to the dataset size -> find the correct trade-off between model complexity and computational cost.

Auto-encoder: data is noised + reconstructed to reduce the input vector sparsity without any supervision.

This work on the PRIMEGE database aims at detecting Adverse Drug Reaction (ADR) by detecting correlations between in medical notes that contain information on medication, diseases and disorders observed. Example: "the patient has internal bleeding secondary to warfarin" establishes a correlation between the disorder (internal bleeding) and the medication (warfarin).

Current status: a first version of the tool extracts clinical entities from the notes and a time series for each patient with notes, medication, diagnosis and symptoms. The LTSM deep network is being built and trained for ADR detection.

A spin-off project named HELP was accepted and funded by Academy 1 to fund for a low power computing platform using nVidia GPUs for deep learning computing.

- Case study specification at M3. There were significant progresses on textual data access and exploitation plan (PMSI, PRIMEGE) but everything remains to be done on the imaging data (a meeting planned with the CHUN cardiology department in January).

- State of the art in LTSM at M6. Edson is working on an LSTM implementation and Frédéric plan further work on LSTM this year. Although no written report is required, we need to assess the progresses on LSTM networks bibliography and exploitation.

The next milestone planed is the data access infrastructure setup with the MSI at M12 (June 2018). Work on data access is progressing but much needs to be done in coordination with the Medical Data Institute of the MSI.

The following indicators have been given to the IDEX project reporting and follow-up officer. Partners are welcome to give more details. In particular please let me know:

ERC	projet ambitionnant d’être candidat à l’ERC	Serena Villata a déposé une candiature ERC fin 2017
International	recrutement international : étudiant ou doctorant ou chercheur	Recrutement d'un étudiant Allemand et d'une étudiante Tunisienne + 2 Colombiens
	mise en place de coopération int (ex : accueil d'1 chercheur invité, thèse en co-tutelle, accueil d'un étudiant international…)	Projet de collaboration avec l'Université et l'hôpital de de Da Nang (Vietnam)
	mise en place de structure internationale pérenne
	autre	Ecole de calcul intensif sur le traitement de données médicales (Da Nang) Liste des collaborations internationales?
Transdisciplinaire	co-publi entre domaines scientifiques	Rappel: remerciement publis et liste à établir Identifier les co-publications
	co-direction de thèse	Les 3 sujets sont prévus en co-direction
	autre
Université cible	collaboration entre membres de l'Idex	I3S, Inria, BCL, CHUN, MSI, MDC… Autres?
	co-publication entre membres de l'Idex	Cf point publications ci-dessus
	contribution à l'EUR	Une EUR acceptée! IADB dans le périmètre thématique.
	autre
Impacts éco	création de start-up
culturels	dépôt d'un brevet
sociétaux	valorisation (licences, transfert…)
	autre
Effet levier	co-financement / fonds publics	Thèses, postdoc et ingénieur financées sur d'autres sources (ENS, Labex, ANR)
	co-financement / fonds privés	Thèse CIFRE
	co-financement / fonds européens

Important reminder: all publications related to the project should be acknowledged with the following sentence:

This work is partly funded by the French government labelled PIA program under its IDEX UCAJEDI project (ANR-15-IDEX-0001).

IADB includes funding for master traineeship and travels to present project papers: do not hesitate to let me know if you have specific funding requirement needs in these directions.

The next meeting will be held on February 19, from 10am to 1pm. Subsequent plenary meeting will be organised in May, September and November 2018.

Meeting IADB december 12th, 2017