SHARE-ENV: A Data Set to Advance Our Knowledge of the Environment–Wellbeing Relationship

Climate change interacts with other environmental stressors and vulnerability factors. Some places and, owing to socioeconomic conditions, some people, are far more at risk. The data behind current assessments of the environment–wellbeing nexus is coarse and regionally aggregated, when considering multiple regions/groups; or, when granular, comes from ad hoc samples with few variables. To assess the impacts of climate change, we require data that are granular and comprehensive, both in the variables and population studied. We build a publicly accessible data set, the SHARE-ENV data set, which fulfills these criteria. We expand on EU representative, individual-level, longitudinal data (the SHARE survey), with environmental exposure information about temperature, radiation, precipitation, pollution, and flood events. We illustrate through four simplified multilevel linear regressions, cross-sectional and longitudinal, how full-fledged studies can use SHARE-ENV to contribute to the literature. Such studies would help assess climate impacts and estimate the effectiveness and fairness of several climate adaptation policies. Other surveys can be expanded with environmental information to unlock different research avenues.


Climate data
The E-OBS gridded datasets 10 on temperature, radiation and precipitation, are the starting point for the climate data generated.

Temperature Bins
For yearly measures of the full temperature distribution, we focus on bins of temperature, i.e., the number of days in a year where the minimum (TN variable in E-OBS), mean (TG variable in E-OBS) and maximum (TX variable in E-OBS) temperature fall in one of the sixteen 2.5°C temperature intervals: <-5, -5 to -2.5, -2.5 to 0, 0 to 2.5, 2.5 to 5, 5 to 7.5, 7.5 to 10, 10 to 12.5, 12.5 to 15, 15 to 17.5, 17.5 to  20, 20 to 22.5, 22.5 to 25, 25 to 27.5, 27.5 to 30 and > 30, computed at the grid cell level.The use of temperature bins allows flexibility in considering the non-linear impacts of temperature on health and other variables of interest.We then assign the grid cells to the SHARE regions by employing a shapefile of the SHARE regions and geospatial routines from R packages sf and raster.We constructed a shapefile of the SHARE regions by resorting to EUROSTAT NUTS shapefiles (downloadable from EUROSTAT) and to a shapefile of Luxembourg cantons (downloadable from data.public.lu).
Once the bins are computed at grid cell level and georeferenced to a SHARE region, we aggregate them into two regional measures: median and mean.We also calculate the standard deviation between the cells of a SHARE region, given that, especially for larger regions, spatial variability might be substantial.Accordingly, the variable names end with '_median', '_mean' or '_std'.

Average (seasonal) temperature
We calculate the average annual temperature and the average seasonal temperatures -spring, summer, fall and winter -in the SHARE region where the respondent lived in a certain year.These are calculated for each grid cell as the average of the mean temperature (TG variable in E-OBS) in all days of the year, or in the days pertaining to each season (December, January and February were allocated to winter; March, April and May to spring; June, July and August to summer; and September, October and November to fall).These grid cells values are aggregated to the SHARE region through both the median and the mean.

Heating and Cooling Degree days
Following the EUROSTAT definitions (https://ec.europa.eu/eurostat/cache/metadata/en/nrg_chdd_esms.htm), at each grid cell we calculate the number of heating degree days (HDD) and cooling degree days (CDD) using the average temperature from the E-OBS dataset (TG variable).Thus, for HDD, we sum over a year, for each grid cell, the differences between 18ºC and the recorded mean daily temperatures, for every day when the temperature in that grid cell was equal or below 15ºC (average temperature coming from TG variable of E-OBS).For CDD, the process is analogous, except we sum the differences between the recorded mean daily temperature and 21ºC, only for those days where the mean temperature was above 24ºC.
Each grid cell thus has, for each year, an HDD and a CDD index.These are aggregated to the SHARE regions through both the median and the mean, as with the remaining variables.

Radiation
The 0.1° gridded E-OBS dataset provides data on daily radiation starting in 1950 through variable QQ.
For each grid cell, we calculate for any given year, the average of the radiation over all the days in that year, or in the days pertaining to each season.These grid cell values are aggregated to the SHARE region through both the median and the mean.

Precipitation
For precipitation we likewise provide yearly variables and cumulative variables calculated from them, starting from the E-OBS dataset, resorting to daily near-surface precipitation (E-OBS variable RR).At each grid cell, we calculate the number of days in each year where the sum of precipitation exceeds 10 mm and 20 mm -heavy and very heavy precipitation days-, as defined in the Agroclimatic indicators datasets part of the C3S Global Agriculture Sectoral Information Systems (SIS).As with temperature variables, these are georeferenced to SHARE regions, and aggregated using the median and mean, alongside the standard deviation to analyze intra-region variation.

Pollution data
The variables considered for pollution relate to the four most explored pollutants in the context of health: particulate matter 2.5 microns (in diameter) (PM 2.5 ), particulate matter 10 microns (PM 10 ), ozone (O 3 ) and nitrogen dioxide (NO 2 ) (as put forward in the WHO Review of evidence on health aspects of air pollution S1 ).

Concentration
For PM 2.5 , PM 10 and NO 2 , there is limited evidence for the existence of a threshold below which health effects are negligible.Negative health outcomes have been found at very low concentrations (WHO  2014).We therefore resort to yearly average exposures, starting from the dataset CAMS global reanalysis (EAC4) on monthly averaged fields whose first year is 2003.
The original CAMS EAC4 monthly dataset resolution is 0.75° X 0.75°.We disaggregate the dataset into 0.1° X 0.1° through bilinear interpolation, and, at the grid cell level, take the average of the 12 months of each year.As done with the temperature dataset, each grid cell is associated with the SHARE region when its centroid falls within the region boundary, and the three variables, mean, median and standard deviation, are then constructed.
For O 3 , the literature documents mixed evidence on the existence of thresholds.Several papers find an association between health outcomes and summer ozone concentration, but not winter season concentration; a finding attributed to the existence of a threshold by some studies or due to confounding effects or seasonal behavioral differences S2 .Other studies that specifically analyze the threshold question arrive to different conclusions (e.g., evidence of thresholds is found in some studies S3 but not in others S4 ).We follow the recent literature on long-term effects of ozone exposure and operate with yearly averages of daily maxima and warm-season averages of daily maxima S5, S6, S7 .The dataset used is CAMS EAC4 12 , from which we use the average O 3 concentration at 3-hour intervals of each day at the surface level, whose first year is 2004.For each day, we keep the maximum of the 6 observations reported, at the grid cell level (after disaggregating the spatial resolution from the gridded 0.75° to 0.1° as mentioned above).We then take either the yearly average or the warm months average (April to September) of the daily maxima, for each grid cell.The grid cells are overlapped with the SHARE regions, as with the temperature datasets, and we calculate the mean, median, and standard deviation at the SHARE region level.

Emissions
The datasets on pollution concentration mentioned begin in 2003 (or in 2004 for O 3 ), thus, enabling coverage for the regular SHARE waves (which start in 2004), but not for the cumulative exposure.To allow us to go further back in time we use a dataset not on pollution concentration, but on pollutant emissions, the EDGAR v5.0 Global Air Pollutant Emissions i dataset, which covers the period 1970-2015 13 .The relevant variable for direct health effects is concentration, thus, the health impacts of emissions will be different across regions, depending, namely, on meteorological conditions and topography.Even so, especially given that emissions are the variables which can be affected policywise, considering their (indirect) effects on other variables can be of interest.The variables obtained from EDGAR are estimates of yearly emissions of PM 2.5 and PM 10 at the grid cell level which we overlap with SHARE regions to obtain the yearly mean, median and standard deviation at the region level.Information on concentration could also be derived from the EDGAR dataset if combined with advanced chemical transport models (CTMs).The original dataset is available a 0.1° X 0.1° resolution.

Flood events data
For floods, we resort to the DFO dataset 11 , which provides information on flood events from 1985 until the present.We report 6 variables: the number of flood events, the number of casualties, the number of displaced individuals, a weighted number of flood events (weighted by an indicator 1, 1.5 or 2, representing the severity of the flood event), the total days during which there were floods events, and the weighted total days (weighted by an indicator 1, 1.5 or 2 representing the severity of the flood event).
The variables correspond to whether the individual was living in a region considered in the dataset to be affected by the flood event (more specifically, if the region where the individual was living overlaps with the region provided as 'affected' in the DFO dataset).Since depending on the country, individuals might report a NUTS2 or NUTS1 region, other 12 variables are created.The first 6 refer to whether the NUTS1 region where the individual resided was affected by flood events and the latter 6 to whether the NUTS2 region where the individual resided was affected by flood events.

Regional aggregation and population weighting
We identify households' location through the SHARE regions reported in the retrospective accommodation waves 3 and 7, or through the NUTS in which the household was located at the moment of sampling in the regular waves.The latter is reported in the housing modules of the regular panel waves.We use information from the housing modules on whether individuals changed house to expand forward regional information.SHARE regions are mostly NUTS2 (Austria, Bulgaria, Croatia, Czechia, Denmark, Finland, Greece, Hungary except for Budapest and Pest, which are reported together as the NUTS1 region of Central Hungary, Italy, Latvia, Lithuania, Poland, Portugal, Romania, Slovakia, Slovenia, Spain and Sweden) with a few countries reporting NUTS1 only (Belgium, France, Germany and one region of Hungary, Central Hungary).ii i https://edgar.jrc.ec.europa.eu/gallery?release=v50_AP&substance=PM10&sector=TOTALSii Wave 3 was conducted in almost all countries in 2008/2009 while Wave 7 was conducted in 2017.This would, at first, lead us to use NUTS2006 and NUTS2016 respectively.In practice, the SHARE regions indicated by respondents are consistent from Wave 3 to Wave 7, i.e., they do not change even if there were changes in the NUTS structure.France and Poland are the two examples -there are region changes in the NUTS, but not Whenever individuals lived in a country different to that in which they were now sampled, we do not know in which region they lived, but only the country.Country-level information is considered too aggregate to provide useful environmental exposure measures.Thus, for periods where respondents were outside the country, we do not have any environmental information.Cumulative exposure variables, therefore, do not consider such years.Averages which explicitly consider this fact can be calculated by dividing cumulative exposures by the number of years for which there is information (which excludes the years when individuals were abroad).We provide the variables necessary for users to build said averages.
From gridded raw datasets, we generate transformed variables at the grid cell level, as explained in the previous sections.We finally aggregate them to the SHARE regions: we detect in which SHARE region the grid cells are located by overlaying them with a shapefile of the SHARE regions, constructed resorting to EUROSTAT NUTS shapefiles (downloadable from EUROSTAT) and to a shapefile of Luxembourg cantons (downloadable from data.public.lu,see SI for more details on the NUTS classifications used.)For climate and pollution variables we provide unweighted variables and population-weighted variables.For population-weighted variables, we resort to the historical gridded population dataset from ISIMIP iii , which provides annual population estimates for 1901-2020.
Weighting is done at the moment of regional aggregation.
A second version of the dataset, currently undergoing further robustness checks, explores more granular geographical data.Resorting to the Degree of Urbanization DEGURBA methodology (the EU/OECD standard for urbanization classification), we classify each grid cell within a SHARE region as being either part of a city, of towns and suburbs, or of a rural area.We compute for each SHARE region-DEGURBA region pair population-weighted exposure variables.With estimated countryspecific weights, we transform these into averages for the five regions indicated by SHARE respondents -big cities, suburbs, large towns, small towns and rural areas.

Cumulative variables
The SHARE dataset is a panel dataset.Environmental hazards might have a cumulative impact on health.Situations which took place at a young age might also only later transpire into health consequences.
We therefore construct cumulative variables of exposure to environmental hazards, reflecting not the exposure to for instance extreme temperatures in the year of a wave, but instead exposure since an individual was born until the wave in question, amongst other cumulative indicators.
If a variable has no prefix, it refers to the exposure to the environmental hazard in the year of the wave.Prefixes starting with 's' correspond to a rolling sum of exposure, with the simple 's_' corresponding to the rolling sum of exposure from birth (or from the oldest year available) up until the year of the wave in question.in the SHARE regions, which remain with a direct correspondence of names to NUTS2006.Therefore, for the two countries, we resort to NUTS2006 (shapefile NUTS2013 since there was no change to the NUTS boundaries of the two countries from NUTS2006 to NUTS2013).For the remaining countries, we resort to the NUTS2016 shapefile.
iii https://data.isimip.org/datasets/fc1e4a06-bd4a-4044-b8e6-46ce86346489/ The prefixes starting with 'y' are simple sums instead of rolling sums; they correspond to total exposure during certain, relevant, years.For early age exposure, 'y5_', 'y10_' and 'y15_' correspond to total exposure during the first 5, 10 and 15 years of age.'yjob_' corresponds to exposure during the years at current job or at the most recent job.We also generate variables for exposure to environment in the years preceding periods of ill health during adulthood.Respondents indicate up to 3 periods where they experienced ill health, specifying the start and end (more details in Appendix 2).For individuals indicating illness periods, we construct variables with prefix 'yill1_', 'yill2_' and 'yill3_' denoting exposure during the years of illness periods 1,2 and 3 respectively.We construct variables with prefix 'y1bf_', 'y3bf_' and 'y5bf_' to represent exposure to hazards during the 1 year, the 3 years and the 5 years preceding the start of each illness period.
We generate cumulative variables since birth for 6 of the 16 temperature bins, on the low extremes and on the high extremes, i.e., for temperatures below 5ºC, between -5ºC and -2.5 ºC and between -2.5 ºC and 0 ºC; and for temperatures between 25 ºC and 27.5 ºC, between 27.5 ºC and 30 ºC, and above 30 ºC.Other bins can be made available on request.On the temperature variables, we report cumulative exposure since birth for CDD and HDD.Cumulative variables since birth are also available for precipitation.We report cumulative variables for flood variables as well.
As auxiliary variables, we report the rolling sum of the number of years for which cumulative measures were computed.We choose to provide both cumulative exposures and years for which cumulative exposure is available, instead of only averages, since even for the same variable, the information for the same number of years for all individuals is not available.This is for two reasons: i) individuals who were born before the years where the environmental variables start and ii) periods in which individuals were outside their country of interview.By providing both cumulative and years available, averages can be readily computed through their ratio, if averages are the variables of interest, and simultaneously, subsets of the sample based on the number of years available (e.g., necessarily all years since birth) can be analyzed separately.
We report as well average spring, summer, fall, winter, and yearly temperatures and average radiation, since birth and during the first 5, 10 and 15 years of life.For these, we directly provide these averages alongside the rolling sum of the number of years, instead of cumulative exposure as we do for the remaining (count) variables.The cumulative variables are created using the yearly variables; therefore, their names are the same, but with added prefixes which indicate over what period are the cumulative measures taken.

Table S8 Summary statistics of variables in Table S5
Old age subsample with at least two health status observations, 50+ in the first obs.