CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Dataset abstract This dataset contains the results from 40 language and speech researchers, who completed a survey. In the first part of the survey, respondents were asked to complete a demographic (e.g., age, gender, first language) and professional background questionnaire (e.g., current academic position, research interests). In addition, they were asked several open-ended questions about their familiarity with and understanding of the term ‘ecological validity’ (e.g., which words come to mind when you hear this term, how to measure the ecological validity of a study, how does ecological validity apply to your area of research). In the second part of the survey, respondents were presented with 24 short speech excerpts, representing 12 different stimulus types. They were asked to rate each speech excerpt on its degree of casualness (i.e. spontaneity) and naturalness, and how likely they are to encounter each excerpt in everyday listening situations. Article abstract This paper explores how researchers in the field of language and speech sciences understand and apply the concept of ecological validity. It also assesses the ecological validity of various stimulus materials, ranging from isolated word productions to sentences taken from authentic interviews. Forty researchers participated in a survey, which contained (i) a demographic and professional background questionnaire with open-ended questions about the definition, feasibility and desirability of ecological validity, and (ii) a speech rating task. In the rating task, respondents evaluated 24 speech excerpts, representing 12 types of stimulus materials, on their casualness, naturalness, and likelihood of occurrence in real-life contexts. The results showed that while most researchers acknowledge the importance of ecological validity, defining the necessary and sufficient criteria for evaluating or achieving it remains challenging. Regarding stimulus types, unscripted sentences from interviews and Map Task dialogues were rated as the most casual and natural. In contrast, carefully read sentences and digitally modified stimuli were viewed as the least casual and natural, although individual differences in rating were noticeable. Similarly, ratings for the likelihood of occurrence in everyday listening situations were highest for various types of extemporaneous speech. The survey responses not only enhance our theoretical understanding of ecological validity, but also raise awareness about the implications of methodological choices, such as the selection of tasks and stimulus materials, on the ecological validity of a study.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a synthetic smart card data set that can be used to test pattern detection methods for the extraction of temporal and spatial data. The data set is tab seperated and based on a stylized travel pattern description for city of Utrecht in The Netherlands and is developed and used in Chapter 6 of the PhD Thesis of Paul Bouman.
This dataset contains the following files:
journeys.tsv : the actual data set of synthetic smart card data
utrecht.xml : the activity pattern definition that was used to randomly generate the synthethic smart card data
validate.ref : a file derived from the activity pattern definition that can be used for validation purposes. It specifies which activity types occur at each location in the smart card data set.
The GPM Ground Validation KTYX NEXRAD GCPEx dataset was collected during January 9, 2012 to March 12, 2012 for the GPM Cold-season Precipitation Experiment (GCPEx). GCPEx addressed shortcomings in GPM snowfall retrieval algorithm by collecting microphysical properties, associated remote sensing observations, and coordinated model simulations of precipitating snow. These data sets were collected toward achieving the overarching goal of GCPEx which is to characterize the ability of multi-frequency active and passive microwave sensors to detect and estimate falling snow. The Next Generation Weather Radar system (NEXRAD) is comprised of 160 Weather Surveillance Radar-1988 Doppler (WSR-88D) sites throughout the United States and select overseas locations. The GPM Ground Validation NEXRAD GCPEx data files are available as level 2 binary files and level 3 compressed binary files.
Coastal marshes are highly dynamic and ecologically important ecosystems that are subject to pervasive and often harmful disturbances, including shoreline erosion. Shoreline erosion can result in an overall loss of coastal marsh, particularly in estuaries with moderate- or high-wave energy. Not only can waves be important physical drivers of shoreline change, they can also influence shore-proximal vertical accretion through sediment delivery. For these reason, estimates of wave energy can provide a quantitative measure of wave effects on marsh shorelines. Since wave energy is difficult to measure at all locations, scientists and managers often rely on hydrodynamic models to estimate wave properties at different locations. The Wave Exposure Model (WEMo) is a simple tool that uses linear wave theory to estimate wave energy characteristics for enclosed and semi-enclosed estuaries(Malhotra and Fonseca, 2007). The interpretation of hydrodynamic models is improved if model results can be validated against measured data. The data presented in this publication are input and validation data for modeled and observed mean wave height for two temporary oceanographic stations established by the U.S. Geological Survey (USGS) in the Grand Bay National Estuarine Research Reserve, Mississippi.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
With recent technological advancements, quantitative analysis has become an increasingly important area within professional sports. However, the manual process of collecting data on relevant match events like passes, goals and tacklings comes with considerable costs and limited consistency across providers, affecting both research and practice. In football, while automatic detection of events from positional data of the players and the ball could alleviate these issues, it is not entirely clear what accuracy current state-of-the-art methods realistically achieve because there is a lack of high-quality validations on realistic and diverse data sets. This paper adds context to existing research by validating a two-step rule-based pass and shot detection algorithm on four different data sets using a comprehensive validation routine that accounts for the temporal, hierarchical and imbalanced nature of the task. Our evaluation shows that pass and shot detection performance is highly dependent on the specifics of the data set. In accordance with previous studies, we achieve F-scores of up to 0.92 for passes, but only when there is an inherent dependency between event and positional data. We find a significantly lower accuracy with F-scores of 0.71 for passes and 0.65 for shots if event and positional data are independent. This result, together with a critical evaluation of existing methodologies, suggests that the accuracy of current football event detection algorithms operating on positional data is currently overestimated. Further analysis reveals that the temporal extraction of passes and shots from positional data poses the main challenge for rule-based approaches. Our results further indicate that the classification of plays into shots and passes is a relatively straightforward task, achieving F-scores between 0.83 to 0.91 ro rule-based classifiers and up to 0.95 for machine learning classifiers. We show that there exist simple classifiers that accurately differentiate shots from passes in different data sets using a low number of human-understandable rules. Operating on basic spatial features, our classifiers provide a simple, objective event definition that can be used as a foundation for more reliable event-based match analysis.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This fileset provides supporting data and corpora for the empirical study described in:Rafael S. Gonçalves and Mark A. Musen. The variable quality of metadata about biological samples used in biomedical experiments. Scientific Data, in press (2019).Description of filesAnalysis spreadsheet files:- ncbi-biosample-metadata-study.xlsx contains data to support the analysis of the quality of metadata in the NCBI BioSample.- ebi-biosamples-metadata-study.xlsx contains data to support the analysis of the quality of metadata in the EBI BioSamples.Validation data files:- ncbi-biosample-validation-data.tar.gz is an archive containing the validation data for the analysis of the entire NCBI BioSample dataset.- ncbi-biosample-packaged-validation-data.tar.gz is an archive containing the validation data for the analysis of the subset of metadata records in the NCBI BioSample that use a BioSample package definition.- ebi-ncbi-shared-records-validation-data.tar.gz is an archive containing the validation data for the analysis of the set of metadata records that exist both in EBI BioSamples and NCBI BioSample.Corpus files:- ebi-biosamples-corpus.xml.gz corresponds to the EBI BioSamples corpus.- ncbi-biosample-corpus.xml.gz corresponds to the NCBI BioSample corpus.- ncbi-biosample-packaged-records-corpus.tar.gz corresponds to the NCBI BioSample metadata records that declare a package definition.- ebi-ncbi-shared-records-corpus.tar.gz corresponds to the corpus of metadata records that exist both in NCBI BioSample and EBI BioSamples.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This section presents a discussion of the research data. The data was received as secondary data however, it was originally collected using the time study techniques. Data validation is a crucial step in the data analysis process to ensure that the data is accurate, complete, and reliable. Descriptive statistics was used to validate the data. The mean, mode, standard deviation, variance and range determined provides a summary of the data distribution and assists in identifying outliers or unusual patterns. The data presented in the dataset show the measures of central tendency which includes the mean, median and the mode. The mean signifies the average value of each of the factors presented in the tables. This is the balance point of the dataset, the typical value and behaviour of the dataset. The median is the middle value of the dataset for each of the factors presented. This is the point where the dataset is divided into two parts, half of the values lie below this value and the other half lie above this value. This is important for skewed distributions. The mode shows the most common value in the dataset. It was used to describe the most typical observation. These values are important as they describe the central value around which the data is distributed. The mean, mode and median give an indication of a skewed distribution as they are not similar nor are they close to one another. In the dataset, the results and discussion of the results is also presented. This section focuses on the customisation of the DMAIC (Define, Measure, Analyse, Improve, Control) framework to address the specific concerns outlined in the problem statement. To gain a comprehensive understanding of the current process, value stream mapping was employed, which is further enhanced by measuring the factors that contribute to inefficiencies. These factors are then analysed and ranked based on their impact, utilising factor analysis. To mitigate the impact of the most influential factor on project inefficiencies, a solution is proposed using the EOQ (Economic Order Quantity) model. The implementation of the 'CiteOps' software facilitates improved scheduling, monitoring, and task delegation in the construction project through digitalisation. Furthermore, project progress and efficiency are monitored remotely and in real time. In summary, the DMAIC framework was tailored to suit the requirements of the specific project, incorporating techniques from inventory management, project management, and statistics to effectively minimise inefficiencies within the construction project.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
With recent technological advancements, quantitative analysis has become an increasingly important area within professional sports. However, the manual process of collecting data on relevant match events like passes, goals and tacklings comes with considerable costs and limited consistency across providers, affecting both research and practice. In football, while automatic detection of events from positional data of the players and the ball could alleviate these issues, it is not entirely clear what accuracy current state-of-the-art methods realistically achieve because there is a lack of high-quality validations on realistic and diverse data sets. This paper adds context to existing research by validating a two-step rule-based pass and shot detection algorithm on four different data sets using a comprehensive validation routine that accounts for the temporal, hierarchical and imbalanced nature of the task. Our evaluation shows that pass and shot detection performance is highly dependent on the specifics of the data set. In accordance with previous studies, we achieve F-scores of up to 0.92 for passes, but only when there is an inherent dependency between event and positional data. We find a significantly lower accuracy with F-scores of 0.71 for passes and 0.65 for shots if event and positional data are independent. This result, together with a critical evaluation of existing methodologies, suggests that the accuracy of current football event detection algorithms operating on positional data is currently overestimated. Further analysis reveals that the temporal extraction of passes and shots from positional data poses the main challenge for rule-based approaches. Our results further indicate that the classification of plays into shots and passes is a relatively straightforward task, achieving F-scores between 0.83 to 0.91 ro rule-based classifiers and up to 0.95 for machine learning classifiers. We show that there exist simple classifiers that accurately differentiate shots from passes in different data sets using a low number of human-understandable rules. Operating on basic spatial features, our classifiers provide a simple, objective event definition that can be used as a foundation for more reliable event-based match analysis.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Demographic data of Dataset 1 (test–retest variability dataset for simulated VF series).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Demographic data of Dataset 2 (real VF series from clinics).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Main content and developer: The Chinese Air Quality Reanalysis dataset was produced by the Institute of Atmospheric Physics, Chinese Academy of Sciences (IAP/CAS), collaborated with China National Environmental Monitoring Centre (CNEMC) and other research institutes. It provides the surface gridded fields of six conventional air pollutants (i.e. PM2.5,PM10,SO2,NO2,CO and O3) and the simulated surface fields of wind speed (u, v), pressure (psfc), relative humidity (RH) and temperature (temp) by WRF model. The spatial and temporal resolutions are respectively 15km and 1 hour. Currently, the time period of the dataset is from 2013 to 2019, and will be updated irregularly.Data assimilation method: The dataset was produced by the chemical data assimilation system (ChemDAS) developed by IAP, CAS, which assimilates over 1000 surface air quality monitoring sites from CNEMC based on the ensemble Kalman filter (EnKF) and the Nested Air Quality Prediction Modeling system (NAQPMS). This method broke through the problems of instability, insufficient adjustment and negative assimilation effect in atmospheric chemistry data assimilation and develops the multi-air pollutant collaborative assimilation including the monitoring data automatic quality control methods, adaptive mode error estimation and other advanced algorithms. It has been published in Earth System Science Data, where detailed descriptions and validation of this dataset are available (https://doi.org/10.5194/essd-13-529-2021).Data accuracy: The dataset was evaluated by cross-validation and independent data validation. 2013-2018: The calculated root of mean square error (RMSE) at assimilation (validation) sites for hourly concentrations were estimated to be 15.2 (21.3) μg/m3 for PM2.5, 28.0 (39.3) μg/m3 for PM10, 16.9 (24.9) μg/m3 for SO2, 12.7 (16.4) μg/m3 for NO2, 0.38 (0.54) mg/m3 for CO and 17.5 (21.9) μg/m3 for O3. 2019: The calculated root of mean square error (RMSE) at assimilation (validation) sites for hourly concentrations were estimated to be 10.2 (13.3) μg/m3 for PM2.5, 19.1 (24.5) μg/m3 for PM10, 6.1 (7.7) μg/m3 for SO2, 10.0(12.4) μg/m3 for NO2, 0.24 (0.30) mg/m3 for CO and 14.0 (17.2) μg/m3 for O3.Dataset’s versions : The first version of this datasets (V1) is from 2013 to 2018 including 72 zip files and each zip file contains one month of reanalysis data. The second version of this datasets (V2) is from 2013 to 2018, splitting by days and in all 2191 zip files. The third version of this datasets (V3) was extended to 2019 based on the same algorithm and validation as the V1 and V2, including seven folders in a year. Each folder contains the reanalysis data compression files in days. The description on the content of each data file is available in README.txt.
The GPM Ground Validation NOAA S-Band Profiler Minute Data MC3E dataset was gathered during the Midlatitude Continental Convective Clouds Experiment (MC3E) in Oklahoma from April-June 2011. The overarching goal was to provide the most complete characterization of convective cloud systems, precipitation, and the environment that has ever been obtained, providing constraints for model cumulus parameterizations and space-based rainfall retrieval algorithms over land that had never before been available. The S-band 2.8 GHz profiler measured the backscattered power from raindrops and ice particles as precipitating cloud systems pass overhead. After calibration, the instrument provided an unattenuated reflectivity estimate through the precipitation. Spectra and moment files are included in netCDF format.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Identifying the cause of death is important for the study of end-of-life patients using claims data in Japan. However, the validity of how cause of death is identified using claims data remains unknown. Therefore, this study aimed to verify the validity of the method used to identify the cause of death based on Japanese claims data. Our study population included patients who died at two institutions between January 1, 2018 and December 31, 2019. Claims data consisted of medical data and Diagnosis Procedure Combination (DPC) data, and five definitions developed from disease classification in each dataset were compared with death certificates. Nine causes of death, including cancer, were included in the study. The definition with the highest positive predictive values (PPVs) and sensitivities in this study was the combination of “main disease” in both medical and DPC data. For cancer, these definitions had PPVs and sensitivities of > 90%. For heart disease, these definitions had PPVs of > 50% and sensitivities of > 70%. For cerebrovascular disease, these definitions had PPVs of > 80% and sensitivities of> 70%. For other causes of death, PPVs and sensitivities were < 50% for most definitions. Based on these results, we recommend definitions with a combination of “main disease” in both medical and DPC data for cancer and cerebrovascular disease. However, a clear argument cannot be made for other causes of death because of the small sample size. Therefore, the results of this study can be used with confidence for cancer and cerebrovascular disease but should be used with caution for other causes of death.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Additionally, the effect size (mean divided by standard deviation) is reported. GCDdiscov denotes the largest GCD on the discovery data and GCDvalid the GCD resulting from the corresponding method combination on the validation data. The quantities ASWdiscov, ASWvalid (average silhouette width) are defined analogously.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
PurposeTo investigate the clinical validity of the Guided Progression Analysis definition (GPAD) and cluster-based definition (CBD) with the Humphrey Field Analyzer (HFA) 10–2 test in retinitis pigmentosa (RP).MethodsTen non-progressive RP visual fields (VFs) (HFA 10–2 test) were simulated for each of 10 VFs of 111 eyes (10 simulations × 10 VF sequencies × 111 eyes = 111,000 VFs; Dataset 1). Using these simulated VFs, the specificity of GPAD for the detection of progression was determined. Using this dataset, similar analyses were conducted for the CBD, in which the HFA 10–2 test was divided into four quadrants. Subsequently, the Hybrid Definition was designed by combining the GPAD and CBD; various conditions of the GPAD and CBD were altered to approach a specificity of 95.0%. Subsequently, actual HFA 10–2 tests of 116 RP eyes (10 VFs each) were collected (Dataset 2), and true positive rate, true negative rate, false positive rate, and the time required to detect VF progression were evaluated and compared across the GPAD, CBD, and Hybrid Definition.ResultsSpecificity values were 95.4% and 98.5% for GPAD and CBD, respectively. There were no significant differences in true positive rate, true negative rate, and false positive rate between the GPAD, CBD, and Hybrid Definition. The GPAD and Hybrid Definition detected progression significantly earlier than the CBD (at 4.5, 5.0, and 4.5 years, respectively).ConclusionsThe GPAD and the optimized Hybrid Definition exhibited similar ability for the detection of progression, with the specificity reaching 95.4%.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
With recent technological advancements, quantitative analysis has become an increasingly important area within professional sports. However, the manual process of collecting data on relevant match events like passes, goals and tacklings comes with considerable costs and limited consistency across providers, affecting both research and practice. In football, while automatic detection of events from positional data of the players and the ball could alleviate these issues, it is not entirely clear what accuracy current state-of-the-art methods realistically achieve because there is a lack of high-quality validations on realistic and diverse data sets. This paper adds context to existing research by validating a two-step rule-based pass and shot detection algorithm on four different data sets using a comprehensive validation routine that accounts for the temporal, hierarchical and imbalanced nature of the task. Our evaluation shows that pass and shot detection performance is highly dependent on the specifics of the data set. In accordance with previous studies, we achieve F-scores of up to 0.92 for passes, but only when there is an inherent dependency between event and positional data. We find a significantly lower accuracy with F-scores of 0.71 for passes and 0.65 for shots if event and positional data are independent. This result, together with a critical evaluation of existing methodologies, suggests that the accuracy of current football event detection algorithms operating on positional data is currently overestimated. Further analysis reveals that the temporal extraction of passes and shots from positional data poses the main challenge for rule-based approaches. Our results further indicate that the classification of plays into shots and passes is a relatively straightforward task, achieving F-scores between 0.83 to 0.91 ro rule-based classifiers and up to 0.95 for machine learning classifiers. We show that there exist simple classifiers that accurately differentiate shots from passes in different data sets using a low number of human-understandable rules. Operating on basic spatial features, our classifiers provide a simple, objective event definition that can be used as a foundation for more reliable event-based match analysis.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The SPOG 2015 FN Definition study (NCT02324231) recruited patients from April 2016 to August 2018 in six out of nine pediatric oncology centers in Switzerland. 269 patients were observed and 360 episodes of fever in neutopenia (FN) were diagnosed in 158 patients. Here data on the 360 FN episodes is published. Data are fully anonymized. In order not to compromise anonymization, information on date and times are not given. A key-file explains all variables. A data-file contains the data of the 360 FN episodes.
The GPM Ground Validaiton KICT NEXRAD MC3E dataset was collected from April 22, 2011 to June 6, 2011 for the Midlatitude Continental Convective Clouds Experiment (MC3E) which took place in central Oklahoma. The overarching goal of MC3E was to provide the most complete characterization of convective cloud systems, precipitation, and the environment that has ever been obtained, providing constraints for model cumulus parameterizations and space-based rainfall retrieval algorithms over land that had never before been available. The Next Generation Weather Radar system (NEXRAD) is comprised of 160 Weather Surveillance Radar-1988 Doppler (WSR-88D) sites throughout the United States and select overseas locations. The GPM Ground Validation NEXRAD MC3E data files are available as compressed binary files.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Item statistics including mean score, standard deviation, factor loadings, and corrected item-total-correlation.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The averaged values of the validity indices for all clustering methods across all simulated data experiments.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Dataset abstract This dataset contains the results from 40 language and speech researchers, who completed a survey. In the first part of the survey, respondents were asked to complete a demographic (e.g., age, gender, first language) and professional background questionnaire (e.g., current academic position, research interests). In addition, they were asked several open-ended questions about their familiarity with and understanding of the term ‘ecological validity’ (e.g., which words come to mind when you hear this term, how to measure the ecological validity of a study, how does ecological validity apply to your area of research). In the second part of the survey, respondents were presented with 24 short speech excerpts, representing 12 different stimulus types. They were asked to rate each speech excerpt on its degree of casualness (i.e. spontaneity) and naturalness, and how likely they are to encounter each excerpt in everyday listening situations. Article abstract This paper explores how researchers in the field of language and speech sciences understand and apply the concept of ecological validity. It also assesses the ecological validity of various stimulus materials, ranging from isolated word productions to sentences taken from authentic interviews. Forty researchers participated in a survey, which contained (i) a demographic and professional background questionnaire with open-ended questions about the definition, feasibility and desirability of ecological validity, and (ii) a speech rating task. In the rating task, respondents evaluated 24 speech excerpts, representing 12 types of stimulus materials, on their casualness, naturalness, and likelihood of occurrence in real-life contexts. The results showed that while most researchers acknowledge the importance of ecological validity, defining the necessary and sufficient criteria for evaluating or achieving it remains challenging. Regarding stimulus types, unscripted sentences from interviews and Map Task dialogues were rated as the most casual and natural. In contrast, carefully read sentences and digitally modified stimuli were viewed as the least casual and natural, although individual differences in rating were noticeable. Similarly, ratings for the likelihood of occurrence in everyday listening situations were highest for various types of extemporaneous speech. The survey responses not only enhance our theoretical understanding of ecological validity, but also raise awareness about the implications of methodological choices, such as the selection of tasks and stimulus materials, on the ecological validity of a study.