What is OpenData4Health?

OpenData4Health is a worldwide and endless hackathon to see how to reduce death and diseases based on open data.
Anyone can participate at any time. In simple terms: The project started in 2021 as a project of the International Longevity Alliance and Epidemium. We encourage universities and institutions from every country to join - health must be improved everywhere. This is a serious project that can be part of studies related to public health, environment, graphics, communication, data science, data management, mathematics, statistics, actuarial sciences, public health, and much more. It is free data, there is no patent, anyone can contribute freely, put it on one's resume, and contribute to a better health.

Interestingly, we found that the epidemiologist Antoine Flahaut had expressed his desire for such a citizen-science project back in 2011!

How to join ?

OpenData4Health is related to another project launched in the framework of Epidemium: NEOS - described below - please do not hesitate to apply as well!


Essential achievements so far


The project was initiated in 2015 by superimposing cancer risks and geographic factors worldwide at a large granularity level (country/state/region/department). This proved that known factors for cancer risks could be found with this approach, and that finer geographic granularity was needed. This is how in 2021 the project was continued but at small granularity, focusing on all-cause mortality for both an easier access to fine-granularity data and a larger scope in terms of health, and starting with France to make the project more amenable. One country is not enough however to grab the whole picture of what impacts health globally so the project needed to be extended. This is how OpenData4Health was born. What follows is a snapshot of the project.


Mortality, when accounting by age and gender, is not homogeneous: [Pol Sans (Barcelona, Spain) and Edouard Debonneuil (Paris, France)]

Of note, the two lines of graphs are the same except that there is one dot per city or one dot per department/canton/city. The graphs on the right are even the same with different point sizes. This highlights the difficulty to visually appreciate nuances and the need to use mathematics (based on mixtures of binomial distributions) to refine the superposition between Y (mortality) and X (geographic factors) once visual analysis helps grossly chose X - just like with most database analyses. We developped an interactive map at city level (slow to appear). The data on these maps average deaths and populations over 2014-2019, more granular statistics are being precisely collected by Oscar Garibo Orts (Valencia, Spain) for modeling.

Why is mortality low in Paris Bordeaux Lyon and high in Lille? Why is mortality high in the North of France?

Air pollution, water pollution, behaviors, wealth, access to healthcare...
Does a first look at maps of geographic factors in France provide potential clues?

Of note, this part will evolve as we focus on more factors
Air pollution would be a key cause for mortality after alcohol and tobacco. PM2.5 - particulate matter with particles of aerodynamic diameter smaller than 2.5 µm - would lead to 9% of deaths. The high concentration in the North half of France correlates with overmortality (even if it would suggest overmortality in Paris). Let's look at neighboring countries: Visually, the link is far from obvious, possibly due to confounding factors. One would notably expect North of Belgium and North of Italy to have a low life expectancy.
Alcool would cause caucer even at small doses.
Smoking would be more prevalent at the borders of France. See how a slightly different color, here gives the impression of a different map: . This shows the importance of knowing thresholds at which a substance is midly and strongly dangerous, and of finding fine granularity geographic data.
Water pollution/soils. Arsenic, above 25 mg / kg or soil, would cause various diseases.
Water pollution. Pesticides by city
  • Access to healthcare:
  • Alcohol, tobacco, obesity: [Adrien Helary (Paris, France)]
  • Wealth:
  • Other:
Of note, it may be useful also to decompose Y:

The impact of air pollution on mortality was already studied

Facing multiple possible explanations for mortality, it is good to rely on the scientific litterature and to use macro-data to adjust "Y=f(X)" rather than fully discover what X should be considered and how they should be considered (mean/max...).

A study that focused on fine particles versus mortality in France happens to be a good basis for OpenData4Health. We here go through various part of their report and synthesis while highlighting what serves the OpenData4Health more generally and we add comments in italics:

The NEOS project

Is mortality risk linked with average PM2.5 concentration for the past 10 years? PM2.5 peaks 3 to 5 years earlier?

This "PM2.5" example (thin particulate matter in the air) shows how important it is to gather such knowledge to compare Y with the many potential risk factors X. Luckily, the International Agency for Research on Cancer (IARC) investigates the link between cancers risks and various agents, and freely provides monographs that contain such knowledge.
The NEOS collaborative project gathers such knowledge for the use of each X and tries to define a standard for interoperable medical data that embbed such agents. Please apply here for the NEOS project! The two projects are very linked as you can see.

NEOS project, with Pascal Deschaseaux (Lyon, France), Sébastien de Longeaux (Lyon, France), Rachel Aronoff (Lausanne, Switzerland), Edouard Debonneuil (Paris, France) and others

How to model Y=f(X)

As seen above, our eyes do not know use to use micro-data: to see red or green parts of France, mortality had to be aggregated over time and space. Models know however. Also, many factors X can grossly explain why mortality is particular in some parts of France - models can disentangle risk factors more precisely than our eyes.

It is important for models to be well guided with the right appreciation of uncertainty for each data. Then, simple models such as a logistic regression can be used, or strong machine learning models with shap values and/or ACV to show the main risk explanations that the model found for each city [Adrien Helary (Paris, France)]


Some data was prepared for 2022


Some data was prepared for 2022