Dataset
Synthetic Hourly Air Pollution Prediction Averages for England (SynthHAPPE) version 2
Abstract
This dataset contains synthetic estimates of ambient air pollution concentrations across England, provided as hourly averages representing typical conditions. The data cover major pollutants, including Nitrogen Dioxide (NO2), Nitric Oxide (NO), Nitrogen Oxides (NOx), Ozone (O3), Particulate Matter smaller than 10 micrometres (PM10) and smaller than 2.5 micrometres (PM2.5), and Sulphur Dioxide (SO2). Each pollutant's concentrations are predicted not only as average (mean) values but also include estimates at lower (5th percentile), median (50th percentile), and upper (95th percentile) levels to highlight typical and potential extreme pollution scenarios.
The spatial coverage of the dataset includes the entire area of England, structured as an evenly spaced grid, with each grid square covering an area of 1 square kilometre (1 km^2). Data points correspond to the centre of these grid squares. Temporally, the dataset does not represent actual hourly measurements from specific dates; instead, it provides aggregated "typical day" profiles constructed by averaging observations collected from multiple years (2014-2018) for each month, weekday, and hour. This method offers representative insights into typical air pollution patterns, avoiding the complexity of handling large-scale raw datasets.
These pollution estimates were produced using a supervised machine learning method, which is a computational approach where algorithms are trained to identify patterns in historical data and apply these learned patterns to predict new data points. The predictions incorporated various environmental factors including weather conditions (e.g., temperature, wind, precipitation), human activities (traffic patterns), satellite measurements, land-use types (urban, rural, industrial areas), and emission inventories (datasets detailing pollutants released into the atmosphere). Additionally, the dataset provides uncertainty intervals through percentile-based estimates, giving users insights into the reliability of the predictions.
The dataset was developed to facilitate easier access to high-quality air pollution information for diverse stakeholders, such as researchers, policymakers, urban planners, and health professionals. By providing clear, simplified air quality scenarios, it helps users make informed decisions in urban planning, public health, environmental management, and policy development, as well as to assess potential impacts and interventions related to air pollution.
The dataset was created by Liam J. Berrisford at the University of Exeter during his PhD studies, supported by the UK Research and Innovation (UKRI) Centre for Doctoral Training in Environmental Intelligence. Full methodological details and data validation information are available in the associated open-access scientific publication. For more information about the data, see the README.md archived alongside this dataset.
In terms of completeness, this dataset intentionally provides representative hourly pollution estimates rather than exact historical measurements or specific pollution events. While it extensively covers typical pollution scenarios across England, direct measurements from specific air quality monitoring stations are not included. Users requiring detailed historical observations or data about specific events should refer to original monitoring station datasets.
Details
| Previous Info: |
No news update for this record
|
|---|---|
| Previously used record identifiers: |
No related previous identifiers.
|
| Access rules: |
Public data: access to these data is available to both registered and non-registered users.
Use of these data is covered by the following licence(s): http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/ When using these data you must cite them correctly using the citation given on the CEDA Data Catalogue record. |
| Data lineage: |
This dataset was produced through a research initiative at the University of Exeter. It was created using a supervised machine learning model-a type of artificial intelligence that learns patterns from existing data and applies those patterns to predict unknown values. Historical data from ambient air pollution monitoring stations across England were used to train the machine learning model. Input data included meteorological factors (such as wind speed and temperature), traffic activity levels, land-use information, satellite-based measurements, and emissions data. After the initial creation of hourly air pollution predictions, these data were aggregated to generate representative "typical day" hourly averages for each month and weekday. Predictions were validated for accuracy and reliability, with uncertainty intervals (5th, 50th, and 95th percentiles) calculated to indicate confidence in the estimates. |
| Data Quality: |
The quality of this dataset was ensured through rigorous validation and uncertainty quantification practices. The data were generated using supervised machine learning methods, which involved comprehensive testing and validation against real-world measurements from official ambient air quality monitoring stations across England. Quality assessments included comparisons between predicted concentrations and actual measured values, statistical analyses of prediction errors, and evaluations of the reliability of the estimates provided. Predictions were produced alongside uncertainty intervals (expressed as the 5th, 50th, and 95th percentiles), enabling transparent communication about the confidence associated with each pollution estimate. All datasets were formatted according to established scientific data standards (NetCDF format), ensuring consistency, interoperability, and ease of use for stakeholders.
|
| File Format: |
NetCDF
|
Process overview
| Title | Machine-Learning-Based Prediction and Aggregation of Air Pollution Estimates into "Typical Day" Profiles |
| Abstract | The dataset was created using a supervised machine-learning pipeline. The pipeline generates air pollution concentration predictions across a 1 km^2^ grid over England, subsequently aggregated to form representative "typical" hourly cycles for each day of the week and month. This approach simplifies downstream use cases such as policy assessment and public communication. The underlying methodology is implemented in the accompanying open-source Python package Environmental Insights, available at https://github.com/berrli/Environmental-Insights |
| Input Description | None |
| Output Description | None |
| Software Reference | None |
No variables found.
Temporal Range
2014-01-01T00:00:00
2018-12-31T00:00:00
Geographic Extent
55.8100° |
||
-5.7200° |
1.7600° |
|
49.9600° |