Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
# ERA-NUTS (1980-2021)
This dataset contains a set of time-series of meteorological variables based on Copernicus Climate Change Service (C3S) ERA5 reanalysis. The data files can be downloaded from here while notebooks and other files can be found on the associated Github repository.
This data has been generated with the aim of providing hourly time-series of the meteorological variables commonly used for power system modelling and, more in general, studies on energy systems.
An example of the analysis that can be performed with ERA-NUTS is shown in this video.
Important: this dataset is still a work-in-progress, we will add more analysis and variables in the near-future. If you spot an error or something strange in the data please tell us sending an email or opening an Issue in the associated Github repository.
## Data
The time-series have hourly/daily/monthly frequency and are aggregated following the NUTS 2016 classification. NUTS (Nomenclature of Territorial Units for Statistics) is a European Union standard for referencing the subdivisions of countries (member states, candidate countries and EFTA countries).
This dataset contains NUTS0/1/2 time-series for the following variables obtained from the ERA5 reanalysis data (in brackets the name of the variable on the Copernicus Data Store and its unit measure):
- t2m: 2-meter temperature (`2m_temperature`, Celsius degrees)
- ssrd: Surface solar radiation (`surface_solar_radiation_downwards`, Watt per square meter)
- ssrdc: Surface solar radiation clear-sky (`surface_solar_radiation_downward_clear_sky`, Watt per square meter)
- ro: Runoff (`runoff`, millimeters)
There are also a set of derived variables:
- ws10: Wind speed at 10 meters (derived by `10m_u_component_of_wind` and `10m_v_component_of_wind`, meters per second)
- ws100: Wind speed at 100 meters (derived by `100m_u_component_of_wind` and `100m_v_component_of_wind`, meters per second)
- CS: Clear-Sky index (the ratio between the solar radiation and the solar radiation clear-sky)
- HDD/CDD: Heating/Cooling Degree days (derived by 2-meter temperature the EUROSTAT definition.
For each variable we have 367 440 hourly samples (from 01-01-1980 00:00:00 to 31-12-2021 23:00:00) for 34/115/309 regions (NUTS 0/1/2).
The data is provided in two formats:
- NetCDF version 4 (all the variables hourly and CDD/HDD daily). NOTE: the variables are stored as `int16` type using a `scale_factor` to minimise the size of the files.
- Comma Separated Value ("single index" format for all the variables and the time frequencies and "stacked" only for daily and monthly)
All the CSV files are stored in a zipped file for each variable.
## Methodology
The time-series have been generated using the following workflow:
1. The NetCDF files are downloaded from the Copernicus Data Store from the ERA5 hourly data on single levels from 1979 to present dataset
2. The data is read in R with the climate4r packages and aggregated using the function `/get_ts_from_shp` from panas. All the variables are aggregated at the NUTS boundaries using the average except for the runoff, which consists of the sum of all the grid points within the regional/national borders.
3. The derived variables (wind speed, CDD/HDD, clear-sky) are computed and all the CSV files are generated using R
4. The NetCDF are created using `xarray` in Python 3.8.
## Example notebooks
In the folder `notebooks` on the associated Github repository there are two Jupyter notebooks which shows how to deal effectively with the NetCDF data in `xarray` and how to visualise them in several ways by using matplotlib or the enlopy package.
There are currently two notebooks:
- exploring-ERA-NUTS: it shows how to open the NetCDF files (with Dask), how to manipulate and visualise them.
- ERA-NUTS-explore-with-widget: explorer interactively the datasets with [jupyter]() and ipywidgets.
The notebook `exploring-ERA-NUTS` is also available rendered as HTML.
## Additional files
In the folder `additional files`on the associated Github repository there is a map showing the spatial resolution of the ERA5 reanalysis and a CSV file specifying the number of grid points with respect to each NUTS0/1/2 region.
## License
This dataset is released under CC-BY-4.0 license.
A bike-sharing system is a service in which bikes are made available for shared use to individuals on a short term basis for a price or free. Many bike share systems allow people to borrow a bike from a "dock" which is usually computer-controlled wherein the user enters the payment information, and the system unlocks it. This bike can then be returned to another dock belonging to the same system.
A US bike-sharing provider BoomBikes has recently suffered considerable dip in their revenue due to the Corona pandemic. The company is finding it very difficult to sustain in the current market scenario. So, it has decided to come up with a mindful business plan to be able to accelerate its revenue.
In such an attempt, BoomBikes aspires to understand the demand for shared bikes among the people. They have planned this to prepare themselves to cater to the people's needs once the situation gets better all around and stand out from other service providers and make huge profits.
They have contracted a consulting company to understand the factors on which the demand for these shared bikes depends. Specifically, they want to understand the factors affecting the demand for these shared bikes in the American market. The company wants to know:
Based on various meteorological surveys and people's styles, the service provider firm has gathered a large dataset on daily bike demands across the American market based on some factors.
You are required to model the demand for shared bikes with the available independent variables. It will be used by the management to understand how exactly the demands vary with different features. They can accordingly manipulate the business strategy to meet the demand levels and meet the customer's expectations. Further, the model will be a good way for management to understand the demand dynamics of a new market.
In the dataset provided, you will notice that there are three columns named 'casual', 'registered', and 'cnt'. The variable 'casual' indicates the number casual users who have made a rental. The variable 'registered' on the other hand shows the total number of registered users who have made a booking on a given day. Finally, the 'cnt' variable indicates the total number of bike rentals, including both casual and registered. The model should be built taking this 'cnt' as the target variable.
When you're done with model building and residual analysis and have made predictions on the test set, just make sure you use the following two lines of code to calculate the R-squared score on the test set.
python
from sklearn.metrics import r2_score
r2_score(y_test, y_pred)
- where y_test is the test data set for the target variable, and y_pred is the variable containing the predicted values of the target variable on the test set.
- Please perform this step as the R-squared score on the test set holds as a benchmark for your model.
2m_temperature
, Celsius degrees) - ssrd: Surface solar radiation (surface_solar_radiation_downwards
, Watt per square meter) - ssrdc: Surface solar radiation clear-sky (surface_solar_radiation_downward_clear_sky
, Watt per square meter) - ro: Runoff (runoff
, millimeters) There are also a set of derived variables: - ws10: Wind speed at 10 meters (derived by 10m_u_component_of_wind
and 10m_v_component_of_wind
, meters per second) - ws100: Wind speed at 100 meters (derived by 100m_u_component_of_wind
and 100m_v_component_of_wind
, meters per second) - CS: Clear-Sky index (the ratio between the solar radiation and the solar radiation clear-sky) - HDD/CDD: Heating/Cooling Degree days (derived by 2-meter temperature the EUROSTAT definition. For each variable we have 350 599 hourly samples (from 01-01-1980 00:00:00 to 31-12-2019 23:00:00) for 34/115/309 regions (NUTS 0/1/2). The data is provided in two formats: - NetCDF version 4 (all the variables hourly and CDD/HDD daily). NOTE: the variables are stored as int16
type using a scale_factor
of 0.01 to minimise the size of the files. - Comma Separated Value ("single index" format for all the variables and the time frequencies and "stacked" only for daily and monthly) All the CSV files are stored in a zipped file for each variable. ## Methodology The time-series have been generated using the following workflow: 1. The NetCDF files are downloaded from the Copernicus Data Store from the ERA5 hourly data on single levels from 1979 to present dataset 2. The data is read in R with the climate4r packages and aggregated using the function /get_ts_from_shp
from panas. All the variables are aggregated at the NUTS boundaries using the average except for the runoff, which consists of the sum of all the grid points within the regional/national borders. 3. The derived variables (wind speed, CDD/HDD, clear-sky) are computed and all the CSV files are generated using R 4. The NetCDF are created using xarray
in Python 3.7. NOTE: air temperature, solar radiation, runoff and wind speed hourly data have been rounded with two decimal digits. ## Example notebooks In the folder notebooks
on the associated Github repository there are two Jupyter notebooks which shows how to deal effectively with the NetCDF data in xarray
and how to visualise them in several ways by using matplotlib or the enlopy package. There are currently two notebooks: - exploring-ERA-NUTS: it shows how to open the NetCDF files (with Dask), how to manipulate and visualise them. - ERA-NUTS-explore-with-widget: explorer interactively the datasets with jupyter and ipywidgets. The notebook exploring-ERA-NUTS
is also available rendered as HTML. ## Additional files In the folder additional files
on the associated Github repository there is a map showing the spatial resolution of the ERA5 reanalysis and a CSV file specifying the number of grid points with respect to each NUTS0/1/2 region. ## License This dataset is released under CC-BY-4.0 license.[NOTE - 11/24/2021: this dataset supersedes an earlier version https://doi.org/10.15482/USDA.ADC/1518654 ] Data sources. Time series data on cattle fever tick incidence, 1959-2020, and climate variables January 1950 through December 2020, form the core information in this analysis. All variables are monthly averages or sums over the fiscal year, October 01 (of the prior calendar year, y-1) through September 30 of the current calendar year (y). Annual records on monthly new detections of Rhipicephalus microplus and R. annulatus (cattle fever tick, CFT) on premises within the Permanent Quarantine Zone (PQZ) were obtained from the Cattle Fever Tick Eradication Program (CFTEP) maintained jointly by the United States Department of Agriculture (USDA), Animal Plant Health Inspection Service and the USDA Animal Research Service in Laredo, Texas. Details of tick survey procedures, CFTEP program goals and history, and the geographic extent of the PQZ are in the main text, and in the Supporting Information (SI) of the associated paper. Data sources on oceanic indicators, on local meteorology, and their pretreatment are detailed in SI. Data pretreatment. To address the low signal-to-noise ratio and non-independence of observations common in time series, we transformed all explanatory and response variables by using a series of six consecutive steps: (i) First differences (year y minus year y-1) were calculated, (ii) these were then converted to z scores (z = (x- μ) / σ, where x is the raw value, μ is the population mean, σ is the standard deviation of the population), (iii) linear regression was applied to remove any directional trends, (iv) moving averages (typically 11-year point-centered moving averages) were calculated for each variable, (v) a lag was applied if/when deemed necessary, and (vi) statistics calculated (r, n, df, P<, p<). Principal component analysis (PCA). A matrix of z-score first differences of the 13 climate variables, and CFT (1960-2020), was entered into XLSTAT principal components analysis routine; we used Pearson correlation of the 14 x 60 matrix, and Varimax rotation of the first two components. Autoregressive Integrated Moving Average (ARIMA). An ARIMA (2,0,0) model was selected among 7 test models in which the p, d, and q terms were varied, and selection made on the basis of lowest RMSE and AIC statistics, and reduction of partial autocorrelation outcomes. A best model linear regression of CFT values on ARIMA-predicted CFT was developed using XLSTAT linear regression software with the objective of examining statistical properties (r, n, df, P<, p<), including the Durbin-Watson index of order-1 autocorrelation, and Cook’s Di distance index. Cross-validation of the model was made by withholding the last 30, and then the first 30 observations in a pair of regressions. Forecast of the next major CFT outbreak. It is generally recognized that the onset year of the first major CFT outbreak was not 1959, but may have occurred earlier in the decade. We postulated the actual underlying pattern is fully 44 years from the start to the end of a CFT cycle linked to external climatic drivers. (SI Appendix, Hypothesis on CFT cycles). The hypothetical reconstruction was projected one full CFT cycle into the future. To substantiate the projected trend, we generated a power spectrum analysis based on 1-year values of the 1959-2020 CFT dataset using SYSTAT AutoSignal software. The outcome included a forecast to 2100; this was compared to the hypothetical reconstruction and projection. Any differences were noted, and the start and end dates of the next major CFT outbreak identified. Resources in this dataset: Resource Title: CFT and climate data. File Name: climate-cft-data2.csv Resource Description: Main dataset; see data dictionary for information on each column Resource Title: Data dictionary (metadata). File Name: climate-cft-metadata2.csv Resource Description: Information on variables and their origin Resource Title: fitted models. File Name: climate-cft-models2.xlsx Resource Software Recommended: Microsoft Excel,url: https://www.microsoft.com/en-us/microsoft-365/excel; XLSTAT,url: https://www.xlstat.com/en/; SYStat Autosignal,url: https://www.systat.com/products/AutoSignal/
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Standardized data on large-scale and long-term patterns of species richness are critical for understanding the consequences of natural and anthropogenic changes in the environment. The North American Breeding Bird Survey (BBS) is one of the largest and most widely used sources of such data, but so far, little is known about the degree to which BBS data provide accurate estimates of regional richness. Here we test this question by comparing estimates of regional richness based on BBS data with spatially and temporally matched estimates based on state Breeding Bird Atlases (BBA). We expected that estimates based on BBA data would provide a more complete (and therefore, more accurate) representation of regional richness due to their larger number of observation units and higher sampling effort within the observation units. Our results were only partially consistent with these predictions: while estimates of regional richness based on BBA data were higher than those based on BBS data, estimates of local richness (number of species per observation unit) were higher in BBS data. The latter result is attributed to higher land-cover heterogeneity in BBS units and higher effectiveness of bird detection (more species are detected per unit time). Interestingly, estimates of regional richness based on BBA blocks were higher than those based on BBS data even when differences in the number of observation units were controlled for. Our analysis indicates that this difference was due to higher compositional turnover between BBA units, probably due to larger differences in habitat conditions between BBA units and a larger number of geographically restricted species. Our overall results indicate that estimates of regional richness based on BBS data suffer from incomplete detection of a large number of rare species, and that corrections of these estimates based on standard extrapolation techniques are not sufficient to remove this bias. Future applications of BBS data in ecology and conservation, and in particular, applications in which the representation of rare species is important (e.g., those focusing on biodiversity conservation), should be aware of this bias, and should integrate BBA data whenever possible.
Methods Overview
This is a compilation of second-generation breeding bird atlas data and corresponding breeding bird survey data. This contains presence-absence breeding bird observations in 5 U.S. states: MA, MI, NY, PA, VT, sampling effort per sampling unit, geographic location of sampling units, and environmental variables per sampling unit: elevation and elevation range from (from SRTM), mean annual precipitation & mean summer temperature (from PRISM), and NLCD 2006 land-use data.
Each row contains all observations per sampling unit, with additional tables containing information on sampling effort impact on richness, a rareness table of species per dataset, and two summary tables for both bird diversity and environmental variables.
The methods for compilation are contained in the supplementary information of the manuscript but also here:
Bird data
For BBA data, shapefiles for blocks and the data on species presences and sampling effort in blocks were received from the atlas coordinators. For BBS data, shapefiles for routes and raw species data were obtained from the Patuxent Wildlife Research Center (https://databasin.org/datasets/02fe0ebbb1b04111b0ba1579b89b7420 and https://www.pwrc.usgs.gov/BBS/RawData).
Using ArcGIS Pro© 10.0, species observations were joined to respective BBS and BBA observation units shapefiles using the Join Table tool. For both BBA and BBS, a species was coded as either present (1) or absent (0). Presence in a sampling unit was based on codes 2, 3, or 4 in the original volunteer birding checklist codes (possible breeder, probable breeder, and confirmed breeder, respectively), and absence was based on codes 0 or 1 (not observed and observed but not likely breeding). Spelling inconsistencies of species names between BBA and BBS datasets were fixed. Species that needed spelling fixes included Brewer’s Blackbird, Cooper’s Hawk, Henslow’s Sparrow, Kirtland’s Warbler, LeConte’s Sparrow, Lincoln’s Sparrow, Swainson’s Thrush, Wilson’s Snipe, and Wilson’s Warbler. In addition, naming conventions were matched between BBS and BBA data. The Alder and Willow Flycatchers were lumped into Traill’s Flycatcher and regional races were lumped into a single species column: Dark-eyed Junco regional types were lumped together into one Dark-eyed Junco, Yellow-shafted Flicker was lumped into Northern Flicker, Saltmarsh Sparrow and the Saltmarsh Sharp-tailed Sparrow were lumped into Saltmarsh Sparrow, and the Yellow-rumped Myrtle Warbler was lumped into Myrtle Warbler (currently named Yellow-rumped Warbler). Three hybrid species were removed: Brewster's and Lawrence's Warblers and the Mallard x Black Duck hybrid. Established “exotic” species were included in the analysis since we were concerned only with detection of richness and not of specific species.
The resultant species tables with sampling effort were pivoted horizontally so that every row was a sampling unit and each species observation was a column. This was done for each state using R version 3.6.2 (R© 2019, The R Foundation for Statistical Computing Platform) and all state tables were merged to yield one BBA and one BBS dataset. Following the joining of environmental variables to these datasets (see below), BBS and BBA data were joined using rbind.data.frame in R© to yield a final dataset with all species observations and environmental variables for each observation unit.
Environmental data
Using ArcGIS Pro© 10.0, all environmental raster layers, BBA and BBS shapefiles, and the species observations were integrated in a common coordinate system (North_America Equidistant_Conic) using the Project tool. For BBS routes, 400m buffers were drawn around each route using the Buffer tool. The observation unit shapefiles for all states were merged (separately for BBA blocks and BBS routes and 400m buffers) using the Merge tool to create a study-wide shapefile for each data source. Whether or not a BBA block was adjacent to a BBS route was determined using the Intersect tool based on a radius of 30m around the route buffer (to fit the NLCD map resolution). Area and length of the BBS route inside the proximate BBA block were also calculated. Mean values for annual precipitation and summer temperature, and mean and range for elevation, were extracted for every BBA block and 400m buffer BBS route using Zonal Statistics as Table tool. The area of each land-cover type in each observation unit (BBA block and BBS buffer) was calculated from the NLCD layer using the Zonal Histogram tool.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Scripts used for analysis of V1 and V2 Datasets.seurat_v1.R - initialize seurat object from 10X Genomics cellranger outputs. Includes filtering, normalization, regression, variable gene identification, PCA analysis, clustering, tSNE visualization. Used for v1 datasets. merge_seurat.R - merge two or more seurat objects into one seurat object. Perform linear regression to remove batch effects from separate objects. Used for v1 datasets. subcluster_seurat_v1.R - subcluster clusters of interest from Seurat object. Determine variable genes, perform regression and PCA. Used for v1 datasets.seurat_v2.R - initialize seurat object from 10X Genomics cellranger outputs. Includes filtering, normalization, regression, variable gene identification, and PCA analysis. Used for v2 datasets. clustering_markers_v2.R - clustering and tSNE visualization for v2 datasets. subcluster_seurat_v2.R - subcluster clusters of interest from Seurat object. Determine variable genes, perform regression and PCA analysis. Used for v2 datasets.seurat_object_analysis_v1_and_v2.R - downstream analysis and plotting functions for seurat object created by seurat_v1.R or seurat_v2.R. merge_clusters.R - merge clusters that do not meet gene threshold. Used for both v1 and v2 datasets. prepare_for_monocle_v1.R - subcluster cells of interest and perform linear regression, but not scaling in order to input normalized, regressed values into monocle with monocle_seurat_input_v1.R monocle_seurat_input_v1.R - monocle script using seurat batch corrected values as input for v1 merged timecourse datasets. monocle_lineage_trace.R - monocle script using nUMI as input for v2 lineage traced dataset. monocle_object_analysis.R - downstream analysis for monocle object - BEAM and plotting. CCA_merging_v2.R - script for merging v2 endocrine datasets with canonical correlation analysis and determining the number of CCs to include in downstream analysis. CCA_alignment_v2.R - script for downstream alignment, clustering, tSNE visualization, and differential gene expression analysis.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This lesson was adapted from educational material written by Dr. Kateri Salk for her Fall 2019 Hydrologic Data Analysis course at Duke University. This is the first part of a two-part exercise focusing on time series analysis.
Introduction
Time series are a special class of dataset, where a response variable is tracked over time. The frequency of measurement and the timespan of the dataset can vary widely. At its most simple, a time series model includes an explanatory time component and a response variable. Mixed models can include additional explanatory variables (check out the nlme
and lme4
R packages). We will be covering a few simple applications of time series analysis in these lessons.
Opportunities
Analysis of time series presents several opportunities. In aquatic sciences, some of the most common questions we can answer with time series modeling are:
Can we forecast conditions in the future?
Challenges
Time series datasets come with several caveats, which need to be addressed in order to effectively model the system. A few common challenges that arise (and can occur together within a single dataset) are:
Autocorrelation: Data points are not independent from one another (i.e., the measurement at a given time point is dependent on previous time point(s)).
Data gaps: Data are not collected at regular intervals, necessitating interpolation between measurements. There are often gaps between monitoring periods. For many time series analyses, we need equally spaced points.
Seasonality: Cyclic patterns in variables occur at regular intervals, impeding clear interpretation of a monotonic (unidirectional) trend. Ex. We can assume that summer temperatures are higher.
Heteroscedasticity: The variance of the time series is not constant over time.
Covariance: the covariance of the time series is not constant over time. Many of these models assume that the variance and covariance are similar over the time-->heteroschedasticity.
Learning Objectives
After successfully completing this notebook, you will be able to:
Choose appropriate time series analyses for trend detection and forecasting
Discuss the influence of seasonality on time series analysis
Interpret and communicate results of time series analyses
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset contains the MODIS cloud data that has been analyzed in Sengupta et al (2012) and Sengupta et al (2016). The variable that was modeled are the final Q-values which reflect the confidence of clear sky for a pixel located at the specified latitude and longitude values. For more details about the MODIS cloud data please see Sengupta et al (2016). Format: The file mod35_2006180_1245_test_output.out contains the following columns, in this order, for each pixel: scan line number pixel number surface type latitude longitude initial Q values final Q values The Surface types are coded as follows: 0 = water 1 = coast 2 = desert 3 = land Sengupta et al (2016) also made use of a sun-glint indicator variable. This is provided in the file SG.mat. References: Sengupta, A., Cressie, N., Frey, R., and Kahn, B. H. (2012). "Statistical modeling of MODIS cloud data using the Spatial Random Effects model." In Proceedings of the 2012 Joint Statistical Meetings, Alexandria, VA: American Statistical Association, 3111-3123 Sengupta, A., Cressie, N., Kahn, B. H., and Frey, R. (2016). "Predictive inference for big, spatial non-Gaussian data: MODIS cloud data and its change-of-support." Australian & New Zealand Journal of Statistics. 58, 15–45.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Note: a new time-series dataset from ERA5 has been published — this one won't be updated/maintained anymore
Country averages of meteorological variables generated using the R routines available in the package panas based on the Copernicus Climate Change ERA5 reanalyses. The time-series are at hourly resolution and the included variables are:
2-meter temperature (t2m),
snow depth (snow_depth),
mean sea-level pressure (mslp),
runoff,
surface solar radiation (ssrd),
surface solar radiation with clear-sky (ssrdc),
temperature at 850hPa (t850),
total precipitation (total_prec),
zonal (west-east direction) wind speed at 10m (u10) and 100m (u100),
meridional (north-sud) wind speed at 10m (v10) and 100m (v100),
dew point temperature (dew)
The original gridded data has been averaged considered the national borders of the following countries (European 2-letter country codes are used, i.e. ISO 3166 alpha-2 codes with the exception of GB->UK and GR->EL): AL, AT, BA, BE, BG, BY, CH, CY, CZ, DE, DK, DZ, EE, EL, ES, FI, FR, HR, HU, IE, IS, IT, LT, LU, LV, MD, ME, MK, NL, NO, PL, PT, RO, RS, SE, SI, SK, UA, UK.
The unit measures here used are listed in the official page: https://cds.climate.copernicus.eu/cdsapp#!/dataset/era5-hourly-data-on-single-levels-from-2000-to-2017?tab=overview
The script used to generate the files is available on github here
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Supplementary Material for
Human Wellbeing and Machine Learning
by Ekaterina Oparina (r) Caspar Kaiser (r) Niccolò Gentile; Alexandre Tkatchenko, Andrew E. Clark, Jan-Emmanuel De Neve and Conchita D'Ambrosio
This repository contains the list of variables that are used in the Extended Set analysis for the German Socio-Economic Panel, the UK Household Longitudinal Study, and the American Gallup Daily Poll. The variables are grouped into categories, the summary table is reported at the beginning of the document. We use the 2013 Wave of Gallup and SOEP, and Wave 3 of the UKHLS (which covers 2011-2012). Our dataset includes all of the available variables, apart from direct measures of subjective wellbeing (such as domain satisfaction, happiness, or subjective health) or mental health and technical variables (e.g. id numbers). We also exclude variables with more than 50% missing values.
The presented lists include the variables before processing. For the analysis, we convert categorical variables into a set of dummies, one for each category. We then drop all perfectly collinear variables.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset supplements the publication:
Schütz, I. & Einhäuser, W. (2018). Visual awareness in binocular rivalry modulates induced pupil fluctuations.
Raw data are available in two formats: the actual raw EDF data files as returned by the eyetracking device (.edf) for reference, and as MATLAB files into which all relevant information has been extracted (.mat) for analysis.
Variables in the pft_*.mat files include - raw eye position in the variable scan (t,x,y,p), where t is the timestamp of the eyetracker, x/y the position on the screen and p the pupil diameter in arbitrary units - fixation, saccade and blink events in variables fix, sac and blink - raw button presses in cell arrays buttonDn and buttonUp (6 - left button, 7 - right button) including timestamps - stimulus presentation cycle timestamps in the variable stimperiod - timestamps for auditory attentional instruction in the variable attends
For analysis, data from all participants and sessions is imported into pftdata.mat, where it is stored in cell arrays of the format "samples{subject_no, condition}".
rawdata.zip
analysis.zip
Run the following functions in the listed order to reproduce figures and data in results/.
preprocessing.m:
figure1_methods.m:
figure2_example_plot.m:
figure3_averaged_response.m
figure4_complex_plane.m:
stats_auc_decoding.m:
pft_statistics.R:
results/ - Paper Figures (not layouted): figure1.tif, figure2.png, figure3.png, figure4.png - stats_responses.txt: behavioral analyses results, - stats_Zvalues.txt: complex plane analysis results - stats_AUC.txt: moment-by-moment AUC decoding results
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The file set is a freely downloadable aggregation of information about Australian schools. The individual files represent a series of tables which, when considered together, form a relational database. The records cover the years 2008-2014 and include information on approximately 9500 primary and secondary school main-campuses and around 500 subcampuses. The records all relate to school-level data; no data about individuals is included. All the information has previously been published and is publicly available but it has not previously been released as a documented, useful aggregation. The information includes: (a) the names of schools (b) staffing levels, including full-time and part-time teaching and non-teaching staff (c) student enrolments, including the number of boys and girls (d) school financial information, including Commonwealth government, state government, and private funding (e) test data, potentially for school years 3, 5, 7 and 9, relating to an Australian national testing programme know by the trademark 'NAPLAN'
Documentation of this Edition 2016.1 is incomplete but the organization of the data should be readily understandable to most people. If you are a researcher, the simplest way to study the data is to make use of the SQLite3 database called 'school-data-2016-1.db'. If you are unsure how to use an SQLite database, ask a guru.
The database was constructed directly from the other included files by running the following command at a command-line prompt: sqlite3 school-data-2016-1.db < school-data-2016-1.sql Note that a few, non-consequential, errors will be reported if you run this command yourself. The reason for the errors is that the SQLite database is created by importing a series of '.csv' files. Each of the .csv files contains a header line with the names of the variable relevant to each column. The information is useful for many statistical packages but it is not what SQLite expects, so it complains about the header. Despite the complaint, the database will be created correctly.
Briefly, the data are organized as follows. (a) The .csv files ('comma separated values') do not actually use a comma as the field delimiter. Instead, the vertical bar character '|' (ASCII Octal 174 Decimal 124 Hex 7C) is used. If you read the .csv files using Microsoft Excel, Open Office, or Libre Office, you will need to set the field-separator to be '|'. Check your software documentation to understand how to do this. (b) Each school-related record is indexed by an identifer called 'ageid'. The ageid uniquely identifies each school and consequently serves as the appropriate variable for JOIN-ing records in different data files. For example, the first school-related record after the header line in file 'students-headed-bar.csv' shows the ageid of the school as 40000. The relevant school name can be found by looking in the file 'ageidtoname-headed-bar.csv' to discover that the the ageid of 40000 corresponds to a school called 'Corpus Christi Catholic School'. (3) In addition to the variable 'ageid' each record is also identified by one or two 'year' variables. The most important purpose of a year identifier will be to indicate the year that is relevant to the record. For example, if one turn again to file 'students-headed-bar.csv', one sees that the first seven school-related records after the header line all relate to the school Corpus Christi Catholic School with ageid of 40000. The variable that identifies the important differences between these seven records is the variable 'studentyear'. 'studentyear' shows the year to which the student data refer. One can see, for example, that in 2008, there were a total of 410 students enrolled, of whom 185 were girls and 225 were boys (look at the variable names in the header line). (4) The variables relating to years are given different names in each of the different files ('studentsyear' in the file 'students-headed-bar.csv', 'financesummaryyear' in the file 'financesummary-headed-bar.csv'). Despite the different names, the year variables provide the second-level means for joining information acrosss files. For example, if you wanted to relate the enrolments at a school in each year to its financial state, you might wish to JOIN records using 'ageid' in the two files and, secondarily, matching 'studentsyear' with 'financialsummaryyear'. (5) The manipulation of the data is most readily done using the SQL language with the SQLite database but it can also be done in a variety of statistical packages. (6) It is our intention for Edition 2016-2 to create large 'flat' files suitable for use by non-researchers who want to view the data with spreadsheet software. The disadvantage of such 'flat' files is that they contain vast amounts of redundant information and might not display the data in the form that the user most wants it. (7) Geocoding of the schools is not available in this edition. (8) Some files, such as 'sector-headed-bar.csv' are not used in the creation of the database but are provided as a convenience for researchers who might wish to recode some of the data to remove redundancy. (9) A detailed example of a suitable SQLite query can be found in the file 'school-data-sqlite-example.sql'. The same query, used in the context of analyses done with the excellent, freely available R statistical package (http://www.r-project.org) can be seen in the file 'school-data-with-sqlite.R'.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
# ERA-NUTS (1980-2021)
This dataset contains a set of time-series of meteorological variables based on Copernicus Climate Change Service (C3S) ERA5 reanalysis. The data files can be downloaded from here while notebooks and other files can be found on the associated Github repository.
This data has been generated with the aim of providing hourly time-series of the meteorological variables commonly used for power system modelling and, more in general, studies on energy systems.
An example of the analysis that can be performed with ERA-NUTS is shown in this video.
Important: this dataset is still a work-in-progress, we will add more analysis and variables in the near-future. If you spot an error or something strange in the data please tell us sending an email or opening an Issue in the associated Github repository.
## Data
The time-series have hourly/daily/monthly frequency and are aggregated following the NUTS 2016 classification. NUTS (Nomenclature of Territorial Units for Statistics) is a European Union standard for referencing the subdivisions of countries (member states, candidate countries and EFTA countries).
This dataset contains NUTS0/1/2 time-series for the following variables obtained from the ERA5 reanalysis data (in brackets the name of the variable on the Copernicus Data Store and its unit measure):
- t2m: 2-meter temperature (`2m_temperature`, Celsius degrees)
- ssrd: Surface solar radiation (`surface_solar_radiation_downwards`, Watt per square meter)
- ssrdc: Surface solar radiation clear-sky (`surface_solar_radiation_downward_clear_sky`, Watt per square meter)
- ro: Runoff (`runoff`, millimeters)
There are also a set of derived variables:
- ws10: Wind speed at 10 meters (derived by `10m_u_component_of_wind` and `10m_v_component_of_wind`, meters per second)
- ws100: Wind speed at 100 meters (derived by `100m_u_component_of_wind` and `100m_v_component_of_wind`, meters per second)
- CS: Clear-Sky index (the ratio between the solar radiation and the solar radiation clear-sky)
- HDD/CDD: Heating/Cooling Degree days (derived by 2-meter temperature the EUROSTAT definition.
For each variable we have 367 440 hourly samples (from 01-01-1980 00:00:00 to 31-12-2021 23:00:00) for 34/115/309 regions (NUTS 0/1/2).
The data is provided in two formats:
- NetCDF version 4 (all the variables hourly and CDD/HDD daily). NOTE: the variables are stored as `int16` type using a `scale_factor` to minimise the size of the files.
- Comma Separated Value ("single index" format for all the variables and the time frequencies and "stacked" only for daily and monthly)
All the CSV files are stored in a zipped file for each variable.
## Methodology
The time-series have been generated using the following workflow:
1. The NetCDF files are downloaded from the Copernicus Data Store from the ERA5 hourly data on single levels from 1979 to present dataset
2. The data is read in R with the climate4r packages and aggregated using the function `/get_ts_from_shp` from panas. All the variables are aggregated at the NUTS boundaries using the average except for the runoff, which consists of the sum of all the grid points within the regional/national borders.
3. The derived variables (wind speed, CDD/HDD, clear-sky) are computed and all the CSV files are generated using R
4. The NetCDF are created using `xarray` in Python 3.8.
## Example notebooks
In the folder `notebooks` on the associated Github repository there are two Jupyter notebooks which shows how to deal effectively with the NetCDF data in `xarray` and how to visualise them in several ways by using matplotlib or the enlopy package.
There are currently two notebooks:
- exploring-ERA-NUTS: it shows how to open the NetCDF files (with Dask), how to manipulate and visualise them.
- ERA-NUTS-explore-with-widget: explorer interactively the datasets with [jupyter]() and ipywidgets.
The notebook `exploring-ERA-NUTS` is also available rendered as HTML.
## Additional files
In the folder `additional files`on the associated Github repository there is a map showing the spatial resolution of the ERA5 reanalysis and a CSV file specifying the number of grid points with respect to each NUTS0/1/2 region.
## License
This dataset is released under CC-BY-4.0 license.