MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This folder contains processed and derived data, and script for the manuscript, 'Detecting synthetic population bias using a spatially-oriented framework and independent validation data'.Abstract: Models of human mobility can be broadly applied to find solutions addressing diverse topics such as public health policy, transportation management, emergency management, and urban development. However, many mobility models require individual-level data that is limited in availability and accessibility. Synthetic populations are commonly used as the foundation for mobility models because they provide detailed individual-level data representing the different types and characteristics of people in a study area. Thorough evaluation of synthetic populations are required to detect data biases before the prejudices are transferred to subsequent applications. Although synthetic populations are commonly used for modeling mobility, they are conventionally validated by their sociodemographic characteristics, rather than mobility attributes. Mobility microdata provides an opportunity to independently/externally validate the mobility attributes of synthetic populations. This study demonstrates a spatially-oriented data validation framework and independent data validation to assess the mobility attributes of two synthetic populations at different spatial granularities. Validation using independent data (SafeGraph) and the validation framework replicated the spatial distribution of errors detected using source data (LODES) and total absolute error. Spatial clusters of error exposed the locations of underrepresented and overrepresented communities. This information can guide bias mitigation efforts to generate a more representative synthetic population.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Motivation
Increasing heat stress due to climate change poses significant risks to human health and can lead to widespread social and economic consequences. Evaluating these impacts requires reliable datasets of heat stress projections.
Data Record
We present a global dataset projecting future dry-bulb, wet-bulb, and wet-bulb globe temperatures under 1-4°C global warming scenarios (at 0.5°C intervals) relative to the preindustrial era, using outputs from 16 CMIP6 global climate models (GCMs) (Table 1). All variables were retrieved from the historical and SSP585 scenarios which were selected to maximize the warming signal.
The dataset was bias-corrected against ERA5 reanalysis by incorporating the GCM-simulated climate change signal onto the ERA5 baseline (1950-1976) at a 3-hourly frequency. It therefore includes a 27-year sample for each GCM under each warming target.
The data is provided at a fine spatial resolution of 0.25° x 0.25° and a temporal resolution of 3 hours, and is stored in a self-describing NetCDF format. Filenames follow the pattern "VAR_bias_corrected_3hr_GCM_XC_yyyy.nc", where:
"VAR" represents the variable (Ta, Tw, WBGT for dry-bulb, wet-bulb, and wet-bulb globe temperature, respectively),
"GCM" denotes the CMIP6 GCM name,
"X" indicates the warming target compared to the preindustrial period,
"yyyy" represents the year index (0001-0027) of the 27-year sample
Table 1 CMIP6 GCMs used for generating the dataset for Ta, Tw and WBGT.
GCM |
Realization |
GCM grid spacing |
Ta |
Tw |
WBGT |
ACCESS-CM2 |
r1i1p1f1 |
1.25ox1.875o |
✓ |
✓ |
✓ |
BCC-CSM2-MR |
r1i1p1f1 |
1.1ox1.125o |
✓ |
✓ |
✓ |
CanESM5 |
r1i1p2f1 |
2.8ox2.8o |
✓ |
✓ |
✓ |
CMCC-CM2-SR5 |
r1i1p1f1 |
0.94ox1.25o |
✓ |
✓ |
✓ |
CMCC-ESM2 |
r1i1p1f1 |
0.94ox1.25o |
✓ |
✓ |
✓ |
CNRM-CM6-1 |
r1i1p1f2 |
1.4ox1.4o |
✓ |
✓ | |
EC-Earth3 |
r1i1p1f1 |
0.7ox0.7o |
✓ |
✓ |
✓ |
GFDL-ESM4 |
r1i1p1f1 |
1.0ox1.25o |
✓ |
✓ |
✓ |
HadGEM3-GC31-LL |
r1i1p1f3 |
1.25ox1.875o |
✓ |
✓ |
✓ |
HadGEM3-GC31-MM |
r1i1p1f3 |
0.55ox0.83o |
✓ |
✓ |
✓ |
KACE-1-0-G |
r1i1p1f1 |
1.25ox1.875o |
✓ |
✓ |
✓ |
KIOST-ESM |
r1i1p1f1 |
1.9ox1.9o |
✓ |
✓ |
✓ |
MIROC-ES2L |
r1i1p1f2 |
2.8ox2.8o |
✓ |
✓ |
✓ |
MIROC6 |
r1i1p1f1 |
1.4ox1.4o |
✓ |
✓ |
✓ |
MPI-ESM1-2-HR |
r1i1p1f1 |
0.93ox0.93o |
✓ |
✓ |
✓ |
MPI-ESM1-2-LR |
r1i1p1f1 |
1.85ox1.875o |
✓ |
✓ |
✓ |
Data Access
An inventory of the dataset is available in this repository. The complete dataset, approximately 57 TB in size, is freely accessible via Purdue Fortress' long-term archive through Globus at Globus Link. After clicking the link, users may be prompted to log in with a Purdue institutional Globus account. You can switch to your institutional account, or log in via a personal Globus ID, Gmail, GitHub handle, or ORCID ID. Alternatively, the dataset can be accessed by searching for the universally unique identifier (UUID): "6538f53a-1ea7-4c13-a0cf-10478190b901" in Globus.
Dataset Validation
We validate the bias-correction method and show that it significantly enhances the GCMs' accuracy in reproducing both the annual average and the full range of quantiles for all metrics within an ERA5 reference climate state. This dataset is expected to support future research on projected changes in mean and extreme heat stress and the assessment of related health and socio-economic impacts.
For a detailed introduction to the dataset and its validation, please refer to our data descriptor currently under review at Scientific Data. We will update this information upon publication.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The EWEMBI dataset was compiled to support the bias correction of climate input data for the impact assessments carried out in phase 2b of the Inter-Sectoral Impact Model Intercomparison Project (ISIMIP2b; Frieler et al., 2017), which will contribute to the 2018 IPCC special report on the impacts of global warming of 1.5°C above pre-industrial levels and related global greenhouse gas emission pathways. The EWEMBI data cover the entire globe at 0.5° horizontal and daily temporal resolution from 1979 to 2013. Data sources of EWEMBI are ERA-Interim reanalysis data (ERAI; Dee et al., 2011), WATCH forcing data methodology applied to ERA-Interim reanalysis data (WFDEI; Weedon et al., 2014), eartH2Observe forcing data (E2OBS; Calton et al., 2016) and NASA/GEWEX Surface Radiation Budget data (SRB; Stackhouse Jr. et al., 2011). The SRB data were used to bias-correct E2OBS shortwave and longwave radiation (Lange, 2018). Variables included in the EWEMBI dataset are Near Surface Relative Humidity, Near Surface Specific Humidity, Precipitation, Snowfall Flux, Surface Air Pressure, Surface Downwelling Longwave Radiation, Surface Downwelling Shortwave Radiation, Near Surface Wind Speed, Near-Surface Air Temperature, Daily Maximum Near Surface Air Temperature, Daily Minimum Near Surface Air Temperature, Eastward Near-Surface Wind and Northward Near-Surface Wind. For data sources, units and short names of all variables see Frieler et al. (2017, Table 1).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Assessment of emotional states is becoming an increasingly important part of animal welfare research, but emotional state is hard to measure and often requires time consuming or complicated tests. A threat perception test has been developed as a measure of anxiety in sheep and validated using pharmacological models of anxiety. While it appears that the responses we measure in the test are directed towards the threat of a dog, a controlled study had not been conducted to confirm this. The main objective of this study was therefore to further investigate the behavioural responses of sheep in the threat perception test to differentiate between responses to the dog versus responses to the novel testing environment itself. A secondary aim of this study was to automate some of the behavioural measures taken during the test. The collection of key threat perception measures (vigilance and attention to threat) from videos is a long and labour intensive process. Accelerometers or similar devices have been used previously on sheep, attached via halters or collars, to monitor animal movements and feeding behaviours. This study investigated the use such devices to automate the collection of vigilance and attention to threat data, making the test faster, more practical and more accurate. Importantly, this study aimed to determine whether the attachment of data loggers to sheep would alter the behaviour of the animals during testing. Lineage: Details of the methods used to produce this data have been published and can be found at https://doi.org/10.1371/journal.pone.0190404 (see methods for Experiment 2)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This archive provides the full dataset and reproducibility materials for the EMDBC (Empirical Mode Decomposition-Based Bias Correction) method. It includes:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data for application of an attention bias test in free-range laying hens including pharmacological validation of the test using the anxiogenic drug m-CPP. Lineage: All data were obtained by staff and students employed within the Agriculture and Food business unit at the FD McMaster Laboratory, Chiswick or through the University of New England, Armidale, NSW.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Tempe Police Department prides itself in its continued efforts to reduce harm within the community and is providing this dataset on hate crime incidents that occur in Tempe.The Tempe Police Department documents the type of bias that motivated a hate crime according to those categories established by the FBI. These include crimes motivated by biases based on race and ethnicity, religion, sexual orientation, disability, gender and gender identity.The Bias Type categories provided in the data come from the Bias Motivation Categories as defined in the Federal Bureau of Investigation (FBI) National Incident-Based Reporting System (NIBRS) manual, version 2020.1 dated 4/15/2021. The FBI NIBRS manual can be found at https://www.fbi.gov/file-repository/ucr/ucr-2019-1-nibrs-user-manua-093020.pdf with the Bias Motivation Categories found on pages 78-79.Although data is updated monthly, there is a delay by one month to allow for data validation and submission.Information about Tempe Police Department's collection and reporting process for possible hate crimes is included in https://storymaps.arcgis.com/stories/a963e97ca3494bfc8cd66d593eebabaf.Additional InformationSource: Data are from the Law Enforcement Records Management System (RMS)Contact: Angelique BeltranContact E-Mail: angelique_beltran@tempe.govData Source Type: TabularPreparation Method: Data from the Law Enforcement Records Management System (RMS) are entered by the Tempe Police Department into a GIS mapping system, which automatically publishes to open data.Publish Frequency: MonthlyPublish Method: New data entries are automatically published to open data. Data Dictionary
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
CV errors for the 5-Fold-CV.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
VERSION HISTORY:- On June 26, 2018 all files were republished due to the incorporation of additional observational data covering years 2014 to 2016. Prior to that date, the dataset only covered years 1979 to 2013. Data for all years prior to 2014 are identical in this and the original version of the dataset. DATA DESCRIPTION:The EWEMBI dataset was compiled to support the bias correction of climate input data for the impact assessments carried out in phase 2b of the Inter-Sectoral Impact Model Intercomparison Project (ISIMIP2b; Frieler et al., 2017), which will contribute to the 2018 IPCC special report on the impacts of global warming of 1.5°C above pre-industrial levels and related global greenhouse gas emission pathways. The EWEMBI data cover the entire globe at 0.5° horizontal and daily temporal resolution from 1979 to 2013. Data sources of EWEMBI are ERA-Interim reanalysis data (ERAI; Dee et al., 2011), WATCH forcing data methodology applied to ERA-Interim reanalysis data (WFDEI; Weedon et al., 2014), eartH2Observe forcing data (E2OBS; Calton et al., 2016) and NASA/GEWEX Surface Radiation Budget data (SRB; Stackhouse Jr. et al., 2011). The SRB data were used to bias-correct E2OBS shortwave and longwave radiation (Lange, 2018). Variables included in the EWEMBI dataset are Near Surface Relative Humidity, Near Surface Specific Humidity, Precipitation, Snowfall Flux, Surface Air Pressure, Surface Downwelling Longwave Radiation, Surface Downwelling Shortwave Radiation, Near Surface Wind Speed, Near-Surface Air Temperature, Daily Maximum Near Surface Air Temperature, Daily Minimum Near Surface Air Temperature, Eastward Near-Surface Wind and Northward Near-Surface Wind. For data sources, units and short names of all variables see Frieler et al. (2017, Table 1).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Mapping of environmental variables often relies on map accuracy assessment through cross-validation with the data used for calibrating the underlying mapping model. When the data points are spatially clustered, conventional cross-validation leads to optimistically biased estimates of map accuracy. Several papers have promoted spatial cross-validation as a means to tackle this over-optimism. Many of these papers blame spatial autocorrelation as the cause of the bias and propagate the widespread misconception that spatial proximity of calibration points to validation points invalidates classical statistical validation of maps. In the paper related to these data, we present and evaluate alternative cross-validation approaches for assessing map accuracy from clustered sample data.
The study area is western Europe, constrained in the north at 52° latitude and at -10° and 24° longitude The projection is IGNF:ETRS89LAEA (Lambert azimuthal equal area projection).
Files:
agb.tif = above ground biomass (AGB) map from version 3 of the 2017 CCI-Biomass product (https://catalogue.ceda.ac.uk/uuid/5f331c418e9f4935b8eb1b836f8a91b8) AGBstack.tif = covariates used for predicting AGB aggArea.tif = coarse grid used for simulation in the model-based methods ocs.tif = soil organic carbon stock (OCS) map (0-30 cm) from Soilgrids (https://www.isric.org/explore/soilgrids) OCSstack.tif = covariates used for predicting OCS strata.xxx = 100 compact geo-strata (ESRI shape) created with the spcosa package; used for generating clustered samples TOTmask.tif = mask of the area covered by the covariates
Details and data sources of the covariates in AGBstack.tif and OCSstack.tif:
Name
Description
Source
Note
ai
Aridity Index
https://chelsa-climate.org/downloads/
Version 2.1
bio1
Mean annual air temperature [°C]
https://chelsa-climate.org/downloads/
Version 2.1
bio5
Mean daily maximum air temperature of the warmest month [°C]
https://chelsa-climate.org/downloads/
Version 2.1
bio7
Annual range of air temperature [°C]
https://chelsa-climate.org/downloads/
Version 2.1
bio12
Annual precipitation [kg/m2]
https://chelsa-climate.org/downloads/
Version 2.1
bio15
Precipitation seasonality [kg/m2]
https://chelsa-climate.org/downloads/
Version 2.1
gdd10
Growing degree days heat sum above 10°C
https://chelsa-climate.org/downloads/
Version 2.1
clay
Clay content [g/kg] of the 0-5cm layer
Only used for AGB
sand
Sand content [g/kg] of the 0-5cm layer
https://soilgrids.org/
as above
pH
Acidity (Ph(water)) of the 0-5cm layer
https://soilgrids.org/
as above
glc2017
Landcover 2017
https://land.copernicus.eu/global/products/lc, reclassified to: closed forest, open forest, natural non-forest veg., bare & sparse veg. cropland, built-up, water
Categorical variable
dem
Elevation
https://www.eea.europa.eu/data-and-maps/data/copernicus-land-monitoring-service-eu-dem
cosasp
Cosine of slope aspect
Computed with the terra package from elevation
Computed @25m resolution; next aggregated to 0.5km
sinasp
Sine of slope aspect
Computed with the terra package from elevation
as above
slope
Slope
Computed with the terra package from elevation
as above
TPI
Topographic position index
Computed with the terra package from elevation
as above
TRI
Terrain ruggedness index
Computed with the terra package from elevation
as above
TWI
Topographic wetness index
Computed with SAGA from 500m resolution (aggregated) dem
gedi
Forest height
https://glad.umd.edu/dataset/gedi
Zone: NAFR
xcoord
X coordinate
Using a mask created from the other covariates
ycoord
Y coordinate
Using a mask created from the other covariates
Dcoast
Distance from coast
Using a land mask created from the other covariates
Metabarcoding can rapidly determine the species composition of bulk samples and thus aids biodiversity and ecosystem assessment. However, it is essential to use primer sets that minimize amplification bias among taxa to maximize species recovery. Despite this fact, the performance of primer sets employed for metabarcoding terrestrial arthropods has not been sufficiently evaluated. This study tests the performance of 36 primer sets on a mock community containing 374 insect species. Amplification success was assessed with gradient PCRs and the 21 most promising primer sets selected for metabarcoding. These 21 primer sets were also tested by metabarcoding a Malaise trap sample. We identified eight primer sets, mainly those including inosine and/or high degeneracy, that recovered more than 95% of the species in the mock community. Results from the Malaise trap sample were congruent with the mock community, but primer sets generating short amplicons produced potential false positives. Taxon ...
http://www.apache.org/licenses/LICENSE-2.0http://www.apache.org/licenses/LICENSE-2.0
This compressed file contains all datasets made for the validation of MUBDsyn.
datasets_int_val: 17 cases in this folder are derived from MUBD for GPCRs. MUBDreal was made by MUBD-DecoyMaker2.0 and MUBDsyn was made by MUBD-DecoyMakersyn. datasets_ext_val_classical_VS: Five cases in this folder are derived from the shared cases of MUV and DUD-E. The active sets of MUV were taken as the input to make corresponding MUBD datasets. Files in SBVS are raw molecular docking results by smina. datasets_ext_val_SI_classical_VS: DeepCoy and TocoDecoy were used to make the datasets corresponding to the same five cases above. The data of DeepCoy was directly retrieved from DeepCoy resources at OPIG while topology decoys of TocoDecoy_9W were made based on the scripts provided at TocoDecoy GitHub Repository. Files in SBVS are raw molecular docking results by smina. datasets_ext_val_ML_VS: Ten cases in this folder are derived from NRLiSt-BDB. Corresponding MUBD datasets were made as described above. All these datasets can be used for the reproduction of validation performed in the manuscript or to benchmark various virtual screening methods.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data sharpening in kernel regression has been shown to be an effective method of reducing bias while having minimal effects on variance. Earlier efforts to iterate the data sharpening procedure have been less effective, due to the employment of an inappropriate sharpening transformation. In this article, an iterated data sharpening algorithm is proposed which reduces the asymptotic bias at each iteration, while having modest effects on the variance. The efficacy of the iterative approach is demonstrated theoretically and via a simulation study. Boundary effects persist and the affected region successively grows when the iteration is applied to local constant regression. By contrast, boundary bias successively decreases for each iteration step when applied to local linear regression. This study also shows that after iteration, the resulting estimates are less sensitive to bandwidth choice, and a further simulation study demonstrates that iterated data sharpening with data-driven bandwidth selection via cross-validation can lead to more accurate regression function estimation. Examples with real data are used to illustrate the scope of change made possible by using iterated data sharpening and to also identify its limitations. Supplementary materials for this article are available online.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains detailed synthetic records of credit card applications, including applicant demographics, financial profiles, application outcomes, and risk assessments. It is ideal for validating credit scoring models, detecting bias, and supporting regulatory compliance or fairness analysis in financial services. The flat schema design enables seamless integration with analytics and machine learning workflows.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
AbstractHome range estimation is routine practice in ecological research. While advances in animal tracking technology have increased our capacity to collect data to support home range analysis, these same advances have also resulted in increasingly autocorrelated data. Consequently, the question of which home range estimator to use on modern, highly autocorrelated tracking data remains open. This question is particularly relevant given that most estimators assume independently sampled data. Here, we provide a comprehensive evaluation of the effects of autocorrelation on home range estimation. We base our study on an extensive dataset of GPS locations from 369 individuals representing 27 species distributed across 5 continents. We first assemble a broad array of home range estimators, including Kernel Density Estimation (KDE) with four bandwidth optimizers (Gaussian reference function, autocorrelated-Gaussian reference function (AKDE), Silverman's rule of thumb, and least squares cross-validation), Minimum Convex Polygon, and Local Convex Hull methods. Notably, all of these estimators except AKDE assume independent and identically distributed (IID) data. We then employ half-sample cross-validation to objectively quantify estimator performance, and the recently introduced effective sample size for home range area estimation ($\hat{N}_\mathrm{area}$) to quantify the information content of each dataset. We found that AKDE 95\% area estimates were larger than conventional IID-based estimates by a mean factor of 2. The median number of cross-validated locations included in the holdout sets by AKDE 95\% (or 50\%) estimates was 95.3\% (or 50.1\%), confirming the larger AKDE ranges were appropriately selective at the specified quantile. Conversely, conventional estimates exhibited negative bias that increased with decreasing $\hat{N}_\mathrm{area}$. To contextualize our empirical results, we performed a detailed simulation study to tease apart how sampling frequency, sampling duration, and the focal animal's movement conspire to affect range estimates. Paralleling our empirical results, the simulation study demonstrated that AKDE was generally more accurate than conventional methods, particularly for small $\hat{N}_\mathrm{area}$. While 72\% of the 369 empirical datasets had \textgreater1000 total observations, only 4\% had an $\hat{N}_\mathrm{area}$ \textgreater1000, where 30\% had an $\hat{N}_\mathrm{area}$ \textless30. In this frequently encountered scenario of small $\hat{N}_\mathrm{area}$, AKDE was the only estimator capable of producing an accurate home range estimate on autocorrelated data. Usage notesEmpirical GPS tracking dataAnonymised, empirical tracking data used to estimate home range areas based on various home range estimators.Anonymised_Data.zip
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains all data used in "Training data composition affects performance of protein structure analysis algorithms", published in the Pacific Symposium on Biocomputing 2022 by A. Derry, K. A. Carpenter, & R. B. Altman.
The data consists of the following files:
Details on dataset construction can be found in our paper and dataloaders can be found in our Github repo.
Reference
A. Derry*, K. A. Carpenter*, & R. B. Altman, "Training data composition affects performance of protein structure analysis algorithms", 2021.
Dataset References
Datasets used were derived from the following works:
Kryshtafovych, A., Schwede, T., Topf, M., Fidelis, K., & Moult, J. (2019). Critical assessment of methods of protein structure prediction (CASP)—Round XIII. In Proteins: Structure, Function and Bioinformatics (Vol. 87, Issue 12, pp. 1011–1020). https://doi.org/10.1002/prot.25823
Ingraham, J., Garg, V. K., Barzilay, R., & Jaakkola, T. (2019). Generative Models for Graph-Based Protein Design. https://openreview.net/pdf?id=SJgxrLLKOE
Furnham, N., Holliday, G. L., de Beer, T. A. P., Jacobsen, J. O. B., Pearson, W. R., & Thornton, J. M. (2014). The Catalytic Site Atlas 2.0: cataloging catalytic sites and residues identified in enzymes. Nucleic Acids Research, 42 (Database issue), D485–D489.
Squirrel locationsFox squirrel locations entered on website from public and professionals. File contains x,y coordinates and associated covariatespresencedata_final.csvSquirrel validation pointsFox squirrel occurrence data from camera trappingvalid_final.csv
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Training and validation data for the PAN @ SemEval 2019 Task 4: Hyperpartisan News Detection.
The data is split into multiple files. The articles are contained in the files with names starting with "articles-" (which validate against the XML schema article.xsd). The ground-truth information is contained in the files with names starting with "ground-truth-" (which validate against the XML schema ground-truth.xsd).
The first part of the data (filename contains "bypublisher") is labeled by the overall bias of the publisher as provided by BuzzFeed journalists or MediaBiasFactCheck.com. It contains a total of 750,000 articles, half of which (375,000) are hyperpartisan and half of which are not. Half of the articles that are hyperpartisan (187,500) are on the left side of the political spectrum, half are on the right side. This data is split into a training set (80%, 600,000 articles) and a validation set (20%, 150,000 articles), where no publisher that occurs in the training set also occurs in the validation set. Similarly, none of the publishers in those sets will occur in the test set.
The second part of the data (filename contains "byarticle") is labeled through crowdsourcing on an article basis. The data contains only articles for which a consensus among the crowdsourcing workers existed. It contains a total of 645 articles. Of these, 238 (37%) are hyperpartisan and 407 (63%) are not, We will use a similar (but balanced!) test set. Again, none of the publishers in this set will occur in the test set.
Note that article IDs are only unique within the parts.
The collection (including labels) are licensed under a Creative Commons Attribution 4.0 International License.
Acknowledgements: Thanks to Jonathan Miller for his assistance in cleaning the data!
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Data set for: P.G. Martins, A.; Köbrich, M.V.; Carstengerdes , N. & Biella, M. (submitted). All’s Well That Ends Well? Outcome Bias in Pilots During Instrument Flight Rules. Applied Cognitive Psychology. Data set for the two conditions, including the codebook: data Validation Condition
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This folder contains processed and derived data, and script for the manuscript, 'Detecting synthetic population bias using a spatially-oriented framework and independent validation data'.Abstract: Models of human mobility can be broadly applied to find solutions addressing diverse topics such as public health policy, transportation management, emergency management, and urban development. However, many mobility models require individual-level data that is limited in availability and accessibility. Synthetic populations are commonly used as the foundation for mobility models because they provide detailed individual-level data representing the different types and characteristics of people in a study area. Thorough evaluation of synthetic populations are required to detect data biases before the prejudices are transferred to subsequent applications. Although synthetic populations are commonly used for modeling mobility, they are conventionally validated by their sociodemographic characteristics, rather than mobility attributes. Mobility microdata provides an opportunity to independently/externally validate the mobility attributes of synthetic populations. This study demonstrates a spatially-oriented data validation framework and independent data validation to assess the mobility attributes of two synthetic populations at different spatial granularities. Validation using independent data (SafeGraph) and the validation framework replicated the spatial distribution of errors detected using source data (LODES) and total absolute error. Spatial clusters of error exposed the locations of underrepresented and overrepresented communities. This information can guide bias mitigation efforts to generate a more representative synthetic population.