100+ datasets found
  1. High School Heights Dataset

    • kaggle.com
    Updated Aug 11, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yashmeet Singh (2022). High School Heights Dataset [Dataset]. https://www.kaggle.com/datasets/yashmeetsingh/high-school-heights-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 11, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Yashmeet Singh
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    High School Heights Dataset

    You will find three datasets containing heights of the high school students.

    All heights are in inches.

    The data is simulated. The heights are generated from a normal distribution with different sets of mean and standard deviation for boys and girls.

    Height Statistics (inches)BoysGirls
    Mean6762
    Standard Deviation2.92.2

    There are 500 measurements for each gender.

    Here are the datasets:

    • hs_heights.csv: contains a single column with heights for all boys and girls. There's no way to tell which of the values are for boys and which ones are for girls.

    • hs_heights_pair.csv: has two columns. The first column has boy's heights. The second column contains girl's heights.

    • hs_heights_flag.csv: has two columns. The first column has the flag is_girl. The second column contains a girl's height if the flag is 1. Otherwise, it contains a boy's height.

    To see how I generated this dataset, check this out: https://github.com/ysk125103/datascience101/tree/main/datasets/high_school_heights

    Image by Gillian Callison from Pixabay

  2. d

    Example Groundwater-Level Datasets and Benchmarking Results for the...

    • catalog.data.gov
    • data.usgs.gov
    Updated Oct 13, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2024). Example Groundwater-Level Datasets and Benchmarking Results for the Automated Regional Correlation Analysis for Hydrologic Record Imputation (ARCHI) Software Package [Dataset]. https://catalog.data.gov/dataset/example-groundwater-level-datasets-and-benchmarking-results-for-the-automated-regional-cor
    Explore at:
    Dataset updated
    Oct 13, 2024
    Dataset provided by
    U.S. Geological Survey
    Description

    This data release provides two example groundwater-level datasets used to benchmark the Automated Regional Correlation Analysis for Hydrologic Record Imputation (ARCHI) software package (Levy and others, 2024). The first dataset contains groundwater-level records and site metadata for wells located on Long Island, New York (NY) and some surrounding mainland sites in New York and Connecticut. The second dataset contains groundwater-level records and site metadata for wells located in the southeastern San Joaquin Valley of the Central Valley, California (CA). For ease of exposition these are referred to as NY and CA datasets, respectively. Both datasets are formatted with column headers that can be read by the ARCHI software package within the R computing environment. These datasets were used to benchmark the imputation accuracy of three ARCHI model settings (OLS, ridge, and MOVE.1) against the widely used imputation program missForest (Stekhoven and Bühlmann, 2012). The ARCHI program was used to process the NY and CA datasets on monthly and annual timesteps, respectively, filter out sites with insufficient data for imputation, and create 200 test datasets from each of the example datasets with 5 percent of observations removed at random (herein, referred to as "holdouts"). Imputation accuracy for test datasets was assessed using normalized root mean square error (NRMSE), which is the root mean square error divided by the standard deviation of the observed holdout values. ARCHI produces prediction intervals (PIs) using a non-parametric bootstrapping routine, which were assessed by computing a coverage rate (CR) defined as the proportion of holdout observations falling within the estimated PI. The multiple regression models included with the ARCHI package (OLS and ridge) were further tested on all test datasets at eleven different levels of the p_per_n input parameter, which limits the maximum ratio of regression model predictors (p) per observations (n) as a decimal fraction greater than zero and less than or equal to one. This data release contains ten tables formatted as tab-delimited text files. The “CA_data.txt” and “NY_data.txt” tables contain 243,094 and 89,997 depth-to-groundwater measurement values (value, in feet below land surface) indexed by site identifier (site_no) and measurement date (date) for CA and NY datasets, respectively. The “CA_sites.txt” and “NY_sites.txt” tables contain site metadata for the 4,380 and 476 unique sites included in the CA and NY datasets, respectively. The “CA_NRMSE.txt” and “NY_NRMSE.txt” tables contain NRMSE values computed by imputing 200 test datasets with 5 percent random holdouts to assess imputation accuracy for three different ARCHI model settings and missForest using CA and NY datasets, respectively. The “CA_CR.txt” and “NY_CR.txt” tables contain CR values used to evaluate non-parametric PIs generated by bootstrapping regressions with three different ARCHI model settings using the CA and NY test datasets, respectively. The “CA_p_per_n.txt” and “NY_p_per_n.txt” tables contain mean NRMSE values computed for 200 test datasets with 5 percent random holdouts at 11 different levels of p_per_n for OLS and ridge models compared to training error for the same models on the entire CA and NY datasets, respectively. References Cited Levy, Z.F., Stagnitta, T.J., and Glas, R.L., 2024, ARCHI: Automated Regional Correlation Analysis for Hydrologic Record Imputation, v1.0.0: U.S. Geological Survey software release, https://doi.org/10.5066/P1VVHWKE. Stekhoven, D.J., and Bühlmann, P., 2012, MissForest—non-parametric missing value imputation for mixed-type data: Bioinformatics 28(1), 112-118. https://doi.org/10.1093/bioinformatics/btr597.

  3. f

    Data from: Correlation matrices.

    • plos.figshare.com
    xlsx
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bálint Maczák; Gergely Vadai; András Dér; István Szendi; Zoltán Gingl (2023). Correlation matrices. [Dataset]. http://doi.org/10.1371/journal.pone.0261718.s001
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Bálint Maczák; Gergely Vadai; András Dér; István Szendi; Zoltán Gingl
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Our analyses are based on 148×148 time- and frequency-domain correlation matrices. A correlation matrix covers all the possible use cases of every activity metric listed in the article. With these activity metrics and different preprocessing methods, we were able to calculate 148 different activity signals from multiple datasets of a single measurement. Each cell of a correlation matrix contains the mean and standard deviation of the calculated Pearson’s correlation coefficients between two types of activity signals based on 42 different subjects’ 10-days-long motion. The small correlation matrices presented both in the article and in the appendixes are derived from these 148 × 148 correlation matrices. This published Excel workbook contains multiple sheets labelled according to their content. The mean and standard deviation values for both time- and frequency-domain correlations can be found on their own separate sheet. Moreover, we reproduced the correlation matrix with an alternatively parametrized digital filter, which doubled the number of sheets to 8. In the Excel workbook, we used the same notation for both the datasets and activity metrics as presented in this article with an extension to the PIM metric: PIMs denotes the PIM metric where we used Simpson’s 3/8 rule integration method, PIMr indicates the PIM metric where we calculated the integral by simple numerical integration (Riemann sum). (XLSX)

  4. J

    Bounds testing approaches to the analysis of level relationships...

    • journaldata.zbw.eu
    • jda-test.zbw.eu
    .dat, txt
    Updated Dec 8, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    M. Hashem Pesaran; Yongcheol Shin; Richard Smith; M. Hashem Pesaran; Yongcheol Shin; Richard Smith (2022). Bounds testing approaches to the analysis of level relationships (replication data) [Dataset]. http://doi.org/10.15456/jae.2022314.0708242076
    Explore at:
    .dat(13572), txt(3949), .dat(77432)Available download formats
    Dataset updated
    Dec 8, 2022
    Dataset provided by
    ZBW - Leibniz Informationszentrum Wirtschaft
    Authors
    M. Hashem Pesaran; Yongcheol Shin; Richard Smith; M. Hashem Pesaran; Yongcheol Shin; Richard Smith
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This paper develops a new approach to the problem of testing the existence of a level relationship between a dependent variable and a set of regressors, when it is not known with certainty whether the underlying regressors are trend- or first-difference stationary. The proposed tests are based on standard F- and t-statistics used to test the significance of the lagged levels of the variables in a univariate equilibrium correction mechanism. The asymptotic distributions of these statistics are non-standard under the null hypothesis that there exists no level relationship, irrespective of whether the regressors are I(0) or I(1). Two sets of asymptotic critical values are provided: one when all regressors are purely I(1) and the other if they are all purely I(0). These two sets of critical values provide a band covering all possible classifications of the regressors into purely I(0), purely I(1) or mutually cointegrated. Accordingly, various bounds testing procedures are proposed. It is shown that the proposed tests are consistent, and their asymptotic distribution under the null and suitably defined local alternatives are derived. The empirical relevance of the bounds procedures is demonstrated by a re-examination of the earnings equation included in the UK Treasury macroeconometric model.

  5. 4

    Supplementary data for the paper "Why psychologists should not default to...

    • data.4tu.nl
    zip
    Updated Apr 28, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joost de Winter (2025). Supplementary data for the paper "Why psychologists should not default to Welch’s t-test instead of Student’s t-test (and why the Anderson–Darling test is an underused alternative)" [Dataset]. http://doi.org/10.4121/e8e6861a-7ab0-4b6d-bd67-5f95029322c5.v3
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 28, 2025
    Dataset provided by
    4TU.ResearchData
    Authors
    Joost de Winter
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This paper evaluates the claim that Welch’s t-test (WT) should replace the independent-samples t-test (IT) as the default approach for comparing sample means. Simulations involving unequal and equal variances, skewed distributions, and different sample sizes were performed. For normal distributions, we confirm that the WT maintains the false positive rate close to the nominal level of 0.05 when sample sizes and standard deviations are unequal. However, the WT was found to yield inflated false positive rates under skewed distributions, even with relatively large sample sizes, whereas the IT avoids such inflation. A complementary empirical study based on gender differences in two psychological scales corroborates these findings. Finally, we contend that the null hypothesis of unequal variances together with equal means lacks plausibility, and that empirically, a difference in means typically coincides with differences in variance and skewness. An additional analysis using the Kolmogorov-Smirnov and Anderson-Darling tests demonstrates that examining entire distributions, rather than just their means, can provide a more suitable alternative when facing unequal variances or skewed distributions. Given these results, researchers should remain cautious with software defaults, such as R favoring Welch’s test.

  6. d

    Variable Terrestrial GPS Telemetry Detection Rates: Parts 1 - 7—Data

    • datasets.ai
    • data.usgs.gov
    • +1more
    55
    Updated Sep 11, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department of the Interior (2024). Variable Terrestrial GPS Telemetry Detection Rates: Parts 1 - 7—Data [Dataset]. https://datasets.ai/datasets/variable-terrestrial-gps-telemetry-detection-rates-parts-1-7data
    Explore at:
    55Available download formats
    Dataset updated
    Sep 11, 2024
    Dataset authored and provided by
    Department of the Interior
    Description

    Studies utilizing Global Positioning System (GPS) telemetry rarely result in 100% fix success rates (FSR). Many assessments of wildlife resource use do not account for missing data, either assuming data loss is random or because a lack of practical treatment for systematic data loss. Several studies have explored how the environment, technological features, and animal behavior influence rates of missing data in GPS telemetry, but previous spatially explicit models developed to correct for sampling bias have been specified to small study areas, on a small range of data loss, or to be species-specific, limiting their general utility. Here we explore environmental effects on GPS fix acquisition rates across a wide range of environmental conditions and detection rates for bias correction of terrestrial GPS-derived, large mammal habitat use. We also evaluate patterns in missing data that relate to potential animal activities that change the orientation of the antennae and characterize home-range probability of GPS detection for 4 focal species; cougars (Puma concolor), desert bighorn sheep (Ovis canadensis nelsoni), Rocky Mountain elk (Cervus elaphus ssp. nelsoni) and mule deer (Odocoileus hemionus). Part 1, Positive Openness Raster (raster dataset): Openness is an angular measure of the relationship between surface relief and horizontal distance. For angles less than 90 degrees it is equivalent to the internal angle of a cone with its apex at a DEM location, and is constrained by neighboring elevations within a specified radial distance. 480 meter search radius was used for this calculation of positive openness. Openness incorporates the terrain line-of-sight or viewshed concept and is calculated from multiple zenith and nadir angles-here along eight azimuths. Positive openness measures openness above the surface, with high values for convex forms and low values for concave forms (Yokoyama et al. 2002). We calculated positive openness using a custom python script, following the methods of Yokoyama et. al (2002) using a USGS National Elevation Dataset as input. Part 2, Northern Arizona GPS Test Collar (csv): Bias correction in GPS telemetry data-sets requires a strong understanding of the mechanisms that result in missing data. We tested wildlife GPS collars in a variety of environmental conditions to derive a predictive model of fix acquisition. We found terrain exposure and tall over-story vegetation are the primary environmental features that affect GPS performance. Model evaluation showed a strong correlation (0.924) between observed and predicted fix success rates (FSR) and showed little bias in predictions. The model's predictive ability was evaluated using two independent data-sets from stationary test collars of different make/model, fix interval programming, and placed at different study sites. No statistically significant differences (95% CI) between predicted and observed FSRs, suggest changes in technological factors have minor influence on the models ability to predict FSR in new study areas in the southwestern US. The model training data are provided here for fix attempts by hour. This table can be linked with the site location shapefile using the site field. Part 3, Probability Raster (raster dataset): Bias correction in GPS telemetry datasets requires a strong understanding of the mechanisms that result in missing data. We tested wildlife GPS collars in a variety of environmental conditions to derive a predictive model of fix aquistion. We found terrain exposure and tall overstory vegetation are the primary environmental features that affect GPS performance. Model evaluation showed a strong correlation (0.924) between observed and predicted fix success rates (FSR) and showed little bias in predictions. The models predictive ability was evaluated using two independent datasets from stationary test collars of different make/model, fix interval programing, and placed at different study sites. No statistically significant differences (95% CI) between predicted and observed FSRs, suggest changes in technological factors have minor influence on the models ability to predict FSR in new study areas in the southwestern US. We evaluated GPS telemetry datasets by comparing the mean probability of a successful GPS fix across study animals home-ranges, to the actual observed FSR of GPS downloaded deployed collars on cougars (Puma concolor), desert bighorn sheep (Ovis canadensis nelsoni), Rocky Mountain elk (Cervus elaphus ssp. nelsoni) and mule deer (Odocoileus hemionus). Comparing the mean probability of acquisition within study animals home-ranges and observed FSRs of GPS downloaded collars resulted in a approximatly 1:1 linear relationship with an r-sq= 0.68. Part 4, GPS Test Collar Sites (shapefile): Bias correction in GPS telemetry data-sets requires a strong understanding of the mechanisms that result in missing data. We tested wildlife GPS collars in a variety of environmental conditions to derive a predictive model of fix acquisition. We found terrain exposure and tall over-story vegetation are the primary environmental features that affect GPS performance. Model evaluation showed a strong correlation (0.924) between observed and predicted fix success rates (FSR) and showed little bias in predictions. The model's predictive ability was evaluated using two independent data-sets from stationary test collars of different make/model, fix interval programming, and placed at different study sites. No statistically significant differences (95% CI) between predicted and observed FSRs, suggest changes in technological factors have minor influence on the models ability to predict FSR in new study areas in the southwestern US. Part 5, Cougar Home Ranges (shapefile): Cougar home-ranges were calculated to compare the mean probability of a GPS fix acquisition across the home-range to the actual fix success rate (FSR) of the collar as a means for evaluating if characteristics of an animal’s home-range have an effect on observed FSR. We estimated home-ranges using the Local Convex Hull (LoCoH) method using the 90th isopleth. Data obtained from GPS download of retrieved units were only used. Satellite delivered data was omitted from the analysis for animals where the collar was lost or damaged because satellite delivery tends to lose as additional 10% of data. Comparisons with home-range mean probability of fix were also used as a reference for assessing if the frequency animals use areas of low GPS acquisition rates may play a role in observed FSRs. Part 6, Cougar Fix Success Rate by Hour (csv): Cougar GPS collar fix success varied by hour-of-day suggesting circadian rhythms with bouts of rest during daylight hours may change the orientation of the GPS receiver affecting the ability to acquire fixes. Raw data of overall fix success rates (FSR) and FSR by hour were used to predict relative reductions in FSR. Data only includes direct GPS download datasets. Satellite delivered data was omitted from the analysis for animals where the collar was lost or damaged because satellite delivery tends to lose approximately an additional 10% of data. Part 7, Openness Python Script version 2.0: This python script was used to calculate positive openness using a 30 meter digital elevation model for a large geographic area in Arizona, California, Nevada and Utah. A scientific research project used the script to explore environmental effects on GPS fix acquisition rates across a wide range of environmental conditions and detection rates for bias correction of terrestrial GPS-derived, large mammal habitat use.

  7. Google Patent Phrase Similarity Dataset

    • kaggle.com
    Updated Jul 22, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google (2022). Google Patent Phrase Similarity Dataset [Dataset]. https://www.kaggle.com/datasets/google/google-patent-phrase-similarity-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 22, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Google
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is a human rated contextual phrase to phrase matching dataset focused on technical terms from patents. In addition to similarity scores that are typically included in other benchmark datasets we include granular rating classes similar to WordNet, such as synonym, antonym, hypernym, hyponym, holonym, meronym, domain related. The dataset was used in the U.S. Patent Phrase to Phrase Matching competition.

    The dataset was generated with focus on the following: - Phrase disambiguation: certain keywords and phrases can have multiple different meanings. For example, the phrase "mouse" may refer to an animal or a computer input device. To help disambiguate the phrases we have included Cooperative Patent Classification (CPC) classes with each pair of phrases. - Adversarial keyword match: there are phrases that have matching keywords but are otherwise unrelated (e.g. “container section” → “kitchen container”, “offset table” → “table fan”). Many models will not do well on such data (e.g. bag of words models). Our dataset is designed to include many such examples. - Hard negatives: We created our dataset with the aim to improve upon current state of the art language models. Specifically, we have used the BERT model to generate some of the target phrases. So our dataset contains many human rated examples of phrase pairs that BERT may identify as very similar but in fact they may not be.

    Each entry of the dataset contains two phrases - anchor and target, a context CPC class, a rating class, and a similarity score. The rating classes have the following meanings: - 4 - Very high. - 3 - High. - 2 - Medium. - 2a - Hyponym (broad-narrow match). - 2b - Hypernym (narrow-broad match). - 2c - Structural match. - 1 - Low. - 1a - Antonym. - 1b - Meronym (a part of). - 1c - Holonym ( a whole of). - 1d - Other high level domain match. - 0 - Not related.

    The dataset is split into a training (75%), validation (5%), and test (20%) sets. When splitting the data all of the entries with the same anchor are kept together in the same set. There are 106 different context CPC classes and all of them are represented in the training set.

    More details about the dataset are available in the corresponding paper. Please cite the paper if you use the dataset.

  8. Data from: Triple Dissociation Revisited

    • openneuro.org
    Updated May 31, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Julie Van; Samuel Nielson; C. Brock Kirwan (2022). Triple Dissociation Revisited [Dataset]. http://doi.org/10.18112/openneuro.ds004086.v1.2.0
    Explore at:
    Dataset updated
    May 31, 2022
    Dataset provided by
    OpenNeurohttps://openneuro.org/
    Authors
    Julie Van; Samuel Nielson; C. Brock Kirwan
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    README

    DETAILS FOR ACCESSING DATA

    CONTACT PERSON (Corresponding Author)

    C. Brock Kirwan 1001 KMBL, Brigham Young University, Provo, UT 84602 Email: kirwan@byu.edu Phone: 801-422-2532 Fax: 801-422-0602 ORCID ID: 0000-003-0768-1446

    OVERVIEW

    PROJECT NAME

    Limited Evidence for a Triple Dissociation in the Medial Temporal Lobe: an fMRI Recognition Memory Replication Study

    YEARS THAT PROJECT RAN

    2020-2021

    BRIEF OVERVIEW OF EXPERIMENTAL TASKS

    The present experiment aims to replicate two previous papers (cited below) in which authors present two analysis paths for a dataset in which participants underwent fMRI while performing a recognition memory test for old and new words. Both studies found activation in the hippocampus, with the first (Daselaar, Fleck, & Cabeza, 2006) demonstrating a distinction in hippocampus activation corresponding to true and perceived oldness of stimuli and the second (Daselaar, Fleck, Prince, & Cabeza, 2006) demonstrating that hippocampus activation reflects the subjective experience of the participant.

    We replicated behavioral and MRI acquisition parameters reported in these two target articles with N=53 participants and focused fMRI analyses on regions of interest reported in those articles looking at fMRI activation for differences corresponding with true and perceived oldness and those associated with subjective memory experiences of recollection, familiarity, and novelty.

    References: (1) Daselaar, S. M., Fleck, M. S., & Cabeza, R. (2006). Triple dissociation in the medial temporal lobes: Recollection, familiarity, and novelty. J Neurophysiol, 96(4), 1902–1911. https://doi.org/10.1152/jn.01029.2005 (2) Daselaar, S. M., Fleck, M. S., Prince, S. E., & Cabeza, R. (2006). The medial temporal lobe distinguishes old from new independently of consciousness. J Neurosci, 26(21), 5835–5839. https://doi.org/26/21/5835 [pii] 10.1523/JNEUROSCI.0258-06.2006

    DATASET CONTENTS

    This dataset includes raw data from all scanned participants acquired by the Siemens Trio 3T MRI scanner (12-channel head coil), with each participant consisting of the following folders: /anat, /fmap, and /func. /anat includes structural imaging data obtained from scanning in the form of .nii.gz and .json files. /fmap includes field mapping data in the form of .nii.gz and .json files. /func includes functional imaging data obtained from scanning in the form of .nii.gz and .json files, along with event.tsv files for each run (total runs = 4). Data for a total of N=53 participants is included in the present dataset.

    INDEPENDENT VARIABLES

    True vs Perceived Oldness: Mean activity (mean parameter estimates) for each individual trial in the anterior/posterior MTL regions were identified by true oldness and perceived novelty contrasts. These resulting values were entered into a logistic regression model with activations in the MTL regions set as independent variables. Subjective Confidence: Mean activity for each individual trial from different MTL regions were identified and entered into a multiple regression model based on activations in different MTL regions (i.e., recollection-related activity, familiarity-related activity, and novelty-related activity) as independent variables.

    DEPENDENT VARIABLES

    True vs Perceived Oldness: A binary variable reflecting whether participants correctly recognized an old item as old (hit) or incorrectly classified an old item as new (miss) were set as the dependent variable. Subjective Confidence: 6-point oldness scale was entered as the dependent variable.

    CONTROL VARIABLES

    N/A

    QUALITY ASSESSMENT OF DATA

    Data were preprocessed, which included spatial motion correction and spatial normalization that was automatically generated by the fMRIPrep software. Following fMRIPrep preprocessing, functional data were scaled with a mean of 100 and blurred with an 8 mm FWHM Gaussian kernel to account for inter-subject anatomical variation. Analysis scripts are available here: https://osf.io/ctvsw/. Data was acquired for N=60 participants, with data from n=7 participants excluded for reasons of ineligibility (left-handedness, n=1), failure to comply with study procedures (n=2), excessive motion (n=3), and equipment error (n=1).

    METHODS

    STUDY PHASE

    In our experimental task, participants completed a study phase in which they were presented with a randomized list of 120 real English words and 80 pseudo words at a rate of 2000 ms per item. A fixation cross was presented between words for a random time interval varying between 0-5500 ms, where participants indicated whether the stimulus presented was a word or pseudo word. They were not informed at this time that their memory for the words would be tested. After the completion of the study phase, researchers situated participants in the MRI scanner and obtained localizer, field map, and T1-weighted structural MRI scans before initiating the test phase of the experiment.

    TEST PHASE

    During the test phase, a task paradigm was presented as four experimental runs lasting between 435-442 seconds. Participants saw an equal number of target stimuli (words shown during the study phase) and foil stimuli (novel words) at 60 words per run. Target and foil stimuli were presented in a randomized order at 3.4 seconds. Participants were asked to make judgments on whether the word was presented on the study list while the stimulus was displayed. Confidence ratings were collected for those judgments between true and perceived oldness of stimuli from 1 (lowest confidence) to 4 (highest confidence), with a prompt displayed for 1.7 seconds.

    PARTICIPANTS

    Recruitment: To determine sample size, an a prior power analysis was done by extracting values from Figure 1 of (Daselaar, Fleck, Prince, et al., 2006) in the right hippocampus via Web Plot Digitizer, given that the region showed smaller differences. We computed main effects by averaging hits and misses, and CRs and FAs prior to SEM to SD conversion and averaging again. Resulting values were entered into g+power to estimate an effect size of 0.46, indicating that a sample of N=54 would achieve a power of 0.95 with an error probability of 0.05 (t(1,53)=1.67). Participants were recruited from the campus community and met MRI compliance screening requirements. Exclusion: Non-native English speakers, history of drug use, previous psychiatric or neurologic diagnosis, or contra-indications for MRI (e.g., ferromagnetic implant). Compensation: Participants were compensated for participation with a choice of $20, course credit, or a 3D-printed 1/4-scale model of their brains.

    APPARATUS

    Localizer, field map, and T1-weighted structural MRI scans were obtained once the participants were situated in the scanner. MRI data were collected using a Siemens Trio 3T MRI scanner (12-channel head coil) and behavioral responses were collected using a four-key fiber-optic response cylinder (Current Designs, n.d.). Structural scanning was done at the beginning of the scan session (256 x 215 matrix, TR 1900 ms, TE 2.26 ms, FOV 250 x 218 mm, 176 slices, 1 mm slice thickness, 0 mm spacing) and functional scanning was done during all experimental runs (64 x 64 image matrix, TR 1800 ms, TE 31 ms, FOV 240 mm, 34 slices, 3.8 mm slice thickness). An MR-compatible LCD monitor displayed stimuli from the head of the bore, which participants viewed through a mirror mounted on the head coil. MRI data are available at: https://openneuro.org/datasets/ds004086.

    INITIAL SETUP

    [See above under STUDY PHASE and TEST PHASE for procedures performed once the participant arrived.]

    TASK ORGANIZATION

    Behavioral and imaging data were collected for each participant through the course of four (4) experimental runs. Behavioral data was used to create event.tsv files for each participant per run, indicating the onset, duration, trial type, stimuli response, correct answer, and reaction times of responses. Each experimental run lasted between 435-442 seconds, where participants saw an equal number of target stimuli (words shown in the study phase) and foil stimuli (novel words) at 60 words per run.

    TASK DETAILS

    Stimuli were presented for 3.4 seconds, where participants were asked to make judgments indicating whether the word was presented on the study list while the stimulus was displayed. Confidence ratings were then collected for those judgments between true and perceived oldness of stimuli from 1 (lowest confidence) to 4 (highest confidence). Prompt for the confidence ratings was displayed for 1.7 seconds, with each trial separated by an inter-trial interval (ITI) consisting of a fixation cross with a randomly distributed duration of 0-5.4 seconds (mean ITI=2.7 seconds).

    ADDITIONAL DATA ACQUIRED

    Behavioral data were identified as hits, misses, correct rejections (CRs), and false alarms (FAs). Hits indicated correct judgments of “old” for words that were actually old. Misses reflected incorrect judgments of “old” for words that were actually new. Correct rejections indicated correct judgments of “new” for new words, and false alarms represented incorrect judgments of “new” for old words.

    EXPERIMENTAL LOCATION

    The study was performed in the MRI Research Facility at the Brigham Young University campus in Provo, UT.

    MISSING DATA

    The following subjects may be missing data and/or are not included in analyses for the following reasons: Sub-001: Ineligible; left-handedness Sub-005: Failure to comply; completed only 10% of entries compared to other subjects Sub-026: Excessive motion Sub-034: Failure to comply; did not provide a response other than a “1” or none Sub-050: Excessive motion Sub-052: Excessive motion Sub-056: Equipment error

    NOTES

    Sub-054 restarted their testing and completed the study protocol in full in the latter session.

  9. Film Circulation dataset

    • zenodo.org
    • data.niaid.nih.gov
    bin, csv, png
    Updated Jul 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Skadi Loist; Skadi Loist; Evgenia (Zhenya) Samoilova; Evgenia (Zhenya) Samoilova (2024). Film Circulation dataset [Dataset]. http://doi.org/10.5281/zenodo.7887672
    Explore at:
    csv, png, binAvailable download formats
    Dataset updated
    Jul 12, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Skadi Loist; Skadi Loist; Evgenia (Zhenya) Samoilova; Evgenia (Zhenya) Samoilova
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Complete dataset of “Film Circulation on the International Film Festival Network and the Impact on Global Film Culture”

    A peer-reviewed data paper for this dataset is in review to be published in NECSUS_European Journal of Media Studies - an open access journal aiming at enhancing data transparency and reusability, and will be available from https://necsus-ejms.org/ and https://mediarep.org

    Please cite this when using the dataset.


    Detailed description of the dataset:

    1 Film Dataset: Festival Programs

    The Film Dataset consists a data scheme image file, a codebook and two dataset tables in csv format.

    The codebook (csv file “1_codebook_film-dataset_festival-program”) offers a detailed description of all variables within the Film Dataset. Along with the definition of variables it lists explanations for the units of measurement, data sources, coding and information on missing data.

    The csv file “1_film-dataset_festival-program_long” comprises a dataset of all films and the festivals, festival sections, and the year of the festival edition that they were sampled from. The dataset is structured in the long format, i.e. the same film can appear in several rows when it appeared in more than one sample festival. However, films are identifiable via their unique ID.

    The csv file “1_film-dataset_festival-program_wide” consists of the dataset listing only unique films (n=9,348). The dataset is in the wide format, i.e. each row corresponds to a unique film, identifiable via its unique ID. For easy analysis, and since the overlap is only six percent, in this dataset the variable sample festival (fest) corresponds to the first sample festival where the film appeared. For instance, if a film was first shown at Berlinale (in February) and then at Frameline (in June of the same year), the sample festival will list “Berlinale”. This file includes information on unique and IMDb IDs, the film title, production year, length, categorization in length, production countries, regional attribution, director names, genre attribution, the festival, festival section and festival edition the film was sampled from, and information whether there is festival run information available through the IMDb data.


    2 Survey Dataset

    The Survey Dataset consists of a data scheme image file, a codebook and two dataset tables in csv format.

    The codebook “2_codebook_survey-dataset” includes coding information for both survey datasets. It lists the definition of the variables or survey questions (corresponding to Samoilova/Loist 2019), units of measurement, data source, variable type, range and coding, and information on missing data.

    The csv file “2_survey-dataset_long-festivals_shared-consent” consists of a subset (n=161) of the original survey dataset (n=454), where respondents provided festival run data for films (n=206) and gave consent to share their data for research purposes. This dataset consists of the festival data in a long format, so that each row corresponds to the festival appearance of a film.

    The csv file “2_survey-dataset_wide-no-festivals_shared-consent” consists of a subset (n=372) of the original dataset (n=454) of survey responses corresponding to sample films. It includes data only for those films for which respondents provided consent to share their data for research purposes. This dataset is shown in wide format of the survey data, i.e. information for each response corresponding to a film is listed in one row. This includes data on film IDs, film title, survey questions regarding completeness and availability of provided information, information on number of festival screenings, screening fees, budgets, marketing costs, market screenings, and distribution. As the file name suggests, no data on festival screenings is included in the wide format dataset.


    3 IMDb & Scripts

    The IMDb dataset consists of a data scheme image file, one codebook and eight datasets, all in csv format. It also includes the R scripts that we used for scraping and matching.

    The codebook “3_codebook_imdb-dataset” includes information for all IMDb datasets. This includes ID information and their data source, coding and value ranges, and information on missing data.

    The csv file “3_imdb-dataset_aka-titles_long” contains film title data in different languages scraped from IMDb in a long format, i.e. each row corresponds to a title in a given language.

    The csv file “3_imdb-dataset_awards_long” contains film award data in a long format, i.e. each row corresponds to an award of a given film.

    The csv file “3_imdb-dataset_companies_long” contains data on production and distribution companies of films. The dataset is in a long format, so that each row corresponds to a particular company of a particular film.

    The csv file “3_imdb-dataset_crew_long” contains data on names and roles of crew members in a long format, i.e. each row corresponds to each crew member. The file also contains binary gender assigned to directors based on their first names using the GenderizeR application.

    The csv file “3_imdb-dataset_festival-runs_long” contains festival run data scraped from IMDb in a long format, i.e. each row corresponds to the festival appearance of a given film. The dataset does not include each film screening, but the first screening of a film at a festival within a given year. The data includes festival runs up to 2019.

    The csv file “3_imdb-dataset_general-info_wide” contains general information about films such as genre as defined by IMDb, languages in which a film was shown, ratings, and budget. The dataset is in wide format, so that each row corresponds to a unique film.

    The csv file “3_imdb-dataset_release-info_long” contains data about non-festival release (e.g., theatrical, digital, tv, dvd/blueray). The dataset is in a long format, so that each row corresponds to a particular release of a particular film.

    The csv file “3_imdb-dataset_websites_long” contains data on available websites (official websites, miscellaneous, photos, video clips). The dataset is in a long format, so that each row corresponds to a website of a particular film.

    The dataset includes 8 text files containing the script for webscraping. They were written using the R-3.6.3 version for Windows.

    The R script “r_1_unite_data” demonstrates the structure of the dataset, that we use in the following steps to identify, scrape, and match the film data.

    The R script “r_2_scrape_matches” reads in the dataset with the film characteristics described in the “r_1_unite_data” and uses various R packages to create a search URL for each film from the core dataset on the IMDb website. The script attempts to match each film from the core dataset to IMDb records by first conducting an advanced search based on the movie title and year, and then potentially using an alternative title and a basic search if no matches are found in the advanced search. The script scrapes the title, release year, directors, running time, genre, and IMDb film URL from the first page of the suggested records from the IMDb website. The script then defines a loop that matches (including matching scores) each film in the core dataset with suggested films on the IMDb search page. Matching was done using data on directors, production year (+/- one year), and title, a fuzzy matching approach with two methods: “cosine” and “osa.” where the cosine similarity is used to match titles with a high degree of similarity, and the OSA algorithm is used to match titles that may have typos or minor variations.

    The script “r_3_matching” creates a dataset with the matches for a manual check. Each pair of films (original film from the core dataset and the suggested match from the IMDb website was categorized in the following five categories: a) 100% match: perfect match on title, year, and director; b) likely good match; c) maybe match; d) unlikely match; and e) no match). The script also checks for possible doubles in the dataset and identifies them for a manual check.

    The script “r_4_scraping_functions” creates a function for scraping the data from the identified matches (based on the scripts described above and manually checked). These functions are used for scraping the data in the next script.

    The script “r_5a_extracting_info_sample” uses the function defined in the “r_4_scraping_functions”, in order to scrape the IMDb data for the identified matches. This script does that for the first 100 films, to check, if everything works. Scraping for the entire dataset took a few hours. Therefore, a test with a subsample of 100 films is advisable.

    The script “r_5b_extracting_info_all” extracts the data for the entire dataset of the identified matches.

    The script “r_5c_extracting_info_skipped” checks the films with missing data (where data was not scraped) and tried to extract data one more time to make sure that the errors were not caused by disruptions in the internet connection or other technical issues.

    The script “r_check_logs” is used for troubleshooting and tracking the progress of all of the R scripts used. It gives information on the amount of missing values and errors.


    4 Festival Library Dataset

    The Festival Library Dataset consists of a data scheme image file, one codebook and one dataset, all in csv format.

    The codebook (csv file “4_codebook_festival-library_dataset”) offers a detailed description of all variables within the Library Dataset. It lists the definition of variables, such as location and festival name, and festival categories,

  10. Z

    PAN22 Authorship Analysis: Style Change Detection

    • data.niaid.nih.gov
    • zenodo.org
    Updated Dec 6, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tschuggnall, Michael (2023). PAN22 Authorship Analysis: Style Change Detection [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6334244
    Explore at:
    Dataset updated
    Dec 6, 2023
    Dataset provided by
    Tschuggnall, Michael
    Stein, Benno
    Zangerle, Eva
    Mayerl, Maximilian
    Potthast, Martin
    Description

    This is the dataset for the Style Change Detection task of PAN 2022.

    Task

    The goal of the style change detection task is to identify text positions within a given multi-author document at which the author switches. Hence, a fundamental question is the following: If multiple authors have written a text together, can we find evidence for this fact; i.e., do we have a means to detect variations in the writing style? Answering this question belongs to the most difficult and most interesting challenges in author identification: Style change detection is the only means to detect plagiarism in a document if no comparison texts are given; likewise, style change detection can help to uncover gift authorships, to verify a claimed authorship, or to develop new technology for writing support.

    Previous editions of the Style Change Detection task aim at e.g., detecting whether a document is single- or multi-authored (2018), the actual number of authors within a document (2019), whether there was a style change between two consecutive paragraphs (2020, 2021) and where the actual style changes were located (2021). Based on the progress made towards this goal in previous years, we again extend the set of challenges to likewise entice novices and experts:

    Given a document, we ask participants to solve the following three tasks:

    [Task1] Style Change Basic: for a text written by two authors that contains a single style change only, find the position of this change (i.e., cut the text into the two authors’ texts on the paragraph-level),

    [Task2] Style Change Advanced: for a text written by two or more authors, find all positions of writing style change (i.e., assign all paragraphs of the text uniquely to some author out of the number of authors assumed for the multi-author document)

    [Task3] Style Change Real-World: for a text written by two or more authors, find all positions of writing style change, where style changes now not only occur between paragraphs, but at the sentence level.

    All documents are provided in English and may contain an arbitrary number of style changes, resulting from at most five different authors.

    Data

    To develop and then test your algorithms, three datasets including ground truth information are provided (dataset1 for task 1, dataset2 for task 2, and dataset3 for task 3).

    Each dataset is split into three parts:

    training set: Contains 70% of the whole dataset and includes ground truth data. Use this set to develop and train your models.

    validation set: Contains 15% of the whole dataset and includes ground truth data. Use this set to evaluate and optimize your models.

    test set: Contains 15% of the whole dataset, no ground truth data is given. This set is used for evaluation (see later).

    You are free to use additional external data for training your models. However, we ask you to make the additional data utilized freely available under a suitable license.

    Input Format

    The datasets are based on user posts from various sites of the StackExchange network, covering different topics. We refer to each input problem (i.e., the document for which to detect style changes) by an ID, which is subsequently also used to identify the submitted solution to this input problem. We provide one folder for train, validation, and test data for each dataset, respectively.

    For each problem instance X (i.e., each input document), two files are provided:

    problem-X.txt contains the actual text, where paragraphs are denoted by for tasks 1 and 2. For task 3, we provide one sentence per paragraph (again, split by ).

    truth-problem-X.json contains the ground truth, i.e., the correct solution in JSON format. An example file is listed in the following (note that we list keys for the three tasks here):

    { "authors": NUMBER_OF_AUTHORS, "site": SOURCE_SITE, "changes": RESULT_ARRAY_TASK1 or RESULT_ARRAY_TASK3, "paragraph-authors": RESULT_ARRAY_TASK2 }

    The result for task 1 (key "changes") is represented as an array, holding a binary for each pair of consecutive paragraphs within the document (0 if there was no style change, 1 if there was a style change). For task 2 (key "paragraph-authors"), the result is the order of authors contained in the document (e.g., [1, 2, 1] for a two-author document), where the first author is "1", the second author appearing in the document is referred to as "2", etc. Furthermore, we provide the total number of authors and the Stackoverflow site the texts were extracted from (i.e., topic). The result for task 3 (key "changes") is similarly structured as the results array for task 1. However, for task 3, the changes array holds a binary for each pair of consecutive sentences and they may be multiple style changes in the document.

    An example of a multi-author document with a style change between the third and fourth paragraph (or sentence for task 3) could be described as follows (we only list the relevant key/value pairs here):

    { "changes": [0,0,1,...], "paragraph-authors": [1,1,1,2,...] }

    Output Format

    To evaluate the solutions for the tasks, the results have to be stored in a single file for each of the input documents and each of the datasets. Please note that we require a solution file to be generated for each input problem for each dataset. The data structure during the evaluation phase will be similar to that in the training phase, with the exception that the ground truth files are missing.

    For each given problem problem-X.txt, your software should output the missing solution file solution-problem-X.json, containing a JSON object holding the solution to the respective task. The solution for tasks 1 and 3 is an array containing a binary value for each pair of consecutive paragraphs (task 1) or sentences (task 3). For task 2, the solution is an array containing the order of authors contained in the document (as in the truth files).

    An example solution file for tasks 1 and 3 is featured in the following (note again that for task 1, changes are captured on the paragraph level, whereas for task 3, changes are captured on the sentence level):

    { "changes": [0,0,1,0,0,...] }

    For task 2, the solution file looks as follows:

    { "paragraph-authors": [1,1,2,2,3,2,...] }

  11. o

    QASPER: NLP Questions and Evidence

    • opendatabay.com
    .undefined
    Updated Jun 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). QASPER: NLP Questions and Evidence [Dataset]. https://www.opendatabay.com/data/ai-ml/c030902d-7b02-48a2-b32f-8f7140dd1de7
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jun 22, 2025
    Dataset authored and provided by
    Datasimple
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Data Science and Analytics
    Description

    QASPER: NLP Questions and Evidence Discovering Answers with Expertise By Huggingface Hub [source]

    About this dataset QASPER is an incredible collection of over 5,000 questions and answers on a vast range of Natural Language Processing (NLP) papers -- all crowdsourced from experienced NLP practitioners. Each question in the dataset is written based only on the titles and abstracts of the corresponding paper, providing an insight into how the experts understood and parsed various materials. The answers to each query have been expertly enriched by evidence taken directly from the full text of each paper. Moreover, QASPER comes with carefully crafted fields that contain relevant information including ‘qas’ – questions and answers; ‘evidence’ – evidence provided for answering questions; title; abstract; figures_and_tables, and full_text. All this adds up to create a remarkable dataset for researchers looking to gain insights into how practitioners interpret NLP topics while providing effective validation when it comes to finding clear-cut solutions to problems encountered in existing literature

    More Datasets For more datasets, click here.

    Featured Notebooks 🚨 Your notebook can be here! 🚨! How to use the dataset This guide will provide instructions on how to use the QASPER dataset of Natural Language Processing (NLP) questions and evidence. The QASPER dataset contains 5,049 questions over 1,585 papers that has been crowdsourced by NLP practitioners. To get the most out of this dataset we will show you how to access the questions and evidence, as well as provide tips for getting started.

    Step 1: Accessing the Dataset To access the data you can download it from Kaggle's website or through a code version control system like Github. Once downloaded, you will find five files in .csv format; two test data sets (test.csv and validation.csv), two train data sets (train-v2-0_lessons_only_.csv and trainv2-0_unsplit.csv) as well as one figure data set (figures_and_tables_.json). Each .csv file contains different datasets with columns representing titles, abstracts, full texts and Q&A fields with evidence for each paper mentioned in each row of each file respectively

    **Step 2: Analyzing Your Data Sets ** Now would be a good time to explore your datasets using basic descriptive statistics or more advanced predictive analytics such as logistic regression or naive bayes models depending on what kind of analysis you would like to undertake with this dataset You can start simple by summarizing some basic crosstabs between any two variables comprise your dataset; titles abstracts etc.). As an example try correlating title lengths with certain number of words in their corresponding abstracts then check if there is anything worth investigating further

    **Step 3: Define Your Research Questions & Perform Further Analysis ** Once satisfied with your initial exploration it is time to dig deeper into the underlying QR relationship among different variables comprising your main documents One way would be using text mining technologies such as topic modeling machine learning techniques or even automated processes that may help summarize any underlying patterns Yet another approach could involve filtering terms that are relevant per specific research hypothesis then process such terms via web crawlers search engines document similarity algorithms etc

    Finally once all relevant parameters are defined analyzed performed searched it would make sense to draw preliminary connsusison linking them back together before conducting replicable tests ensuring reproducible results

    Research Ideas Developing AI models to automatically generate questions and answers from paper titles and abstracts. Enhancing machine learning algorithms by combining the answers with the evidence provided in the dataset to find relationships between papers. Creating online forums for NLP practitioners that uses questions from this dataset to spark discussion within the community

    License

    CC0

    Original Data Source: QASPER: NLP Questions and Evidence

  12. r

    Data from: Event conceptualisation and aspect in L2 English and Persian: An...

    • researchdata.se
    • demo.researchdata.se
    Updated Nov 7, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Somaje Abdollahian Barough (2019). Event conceptualisation and aspect in L2 English and Persian: An application of the Heidelberg-Paris model [Dataset]. http://doi.org/10.5878/wz3s-wt38
    Explore at:
    (10147845)Available download formats
    Dataset updated
    Nov 7, 2019
    Dataset provided by
    Stockholm University
    Authors
    Somaje Abdollahian Barough
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Time period covered
    Aug 1, 2010 - Jul 31, 2013
    Area covered
    Sweden, Islamic Republic of, Iran, United States, United Kingdom
    Description

    The data have been used in an investigation for a PhD thesis in English Linguistics on similarities and differences in the use of the progressive aspect in two different language systems, English and Persian, both of which have the grammaticalised progressive. It is an application of the Heidelberg-Paris model of investigation into the impact of the progressive aspect on event conceptualisation. It builds on an analysis of single event descriptions at sentence level and re-narrations of a film clip at discourse level, as presented in von Stutterheim and Lambert (2005) DOI: 10.1515/9783110909593.203; Carroll and Lambert (2006: 54–73) http://libris.kb.se/bib/10266700; and von Stutterheim, Andermann, Carroll, Flecken & Schmiedtová (2012) DOI: 10.1515/ling-2012-0026. However, there are system-based typological differences between these two language systems due to the absence/presence of the imperfective-perfective categories, respectively. Thus, in addition to the description of the status of the progressive aspect in English and Persian and its impact on event conceptualisation, an important part of the investigation is the analysis of the L2 English speakers’ language production as the progressives in the first languages, L1s, exhibit differences in their principles of use due to the typological differences. The question of importance in the L2 context concerns the way they conceptualise ongoing events when the language systems are different, i.e. whether their language production is conceptually driven by their first language Persian.

    The data consist of two data sets as the study includes two linguistic experiments, Experiment 1 and Experiment 2. The data for both experiments were collected by email. Separate forms of instructions, and language background questions were prepared for the six different informant groups, i.e. three speaker groups and two experimental tasks, as well as a Nelson English test https://www.worldcat.org/isbn/9780175551972 on the proficiency of English for Experiment 2 was selected and modified for the L2 English speaker group. Nelson English tests are published in Fowler, W.S. & Coe, N. (1976). Nelson English tests. Middlesex: Nelson and Sons. The test battery provides tests for all levels of proficiency. The graded tests are compiled in ten sets from elementary to very advanced level. Each set includes four graded tests, i.e. A, B, C, and D, resulting in 40 separate tests, each with 50 multiple-choice questions. The test entitled 250C was selected for this project. It belongs to the slot 19 out of the 40 slots of the total battery. The multiple-choice questions were checked with a native English professional and 5 inadequate questions relevant for pronunciation were omitted. In addition, a few modifications of the grammar questions were made, aiming at including questions that involve a contrast for the Persian L2 English learner with respect to the grammars of the two languages. The omissions and modifications provide an appropriate grammar test for very advanced Iranian learners of L2 English who have learnt the language in a classroom setting. The data set collected from the informants are characterised as follows: The data from Experiment 1 functions as the basis for the description of the progressive aspect in English, Persian and L2 English, while the data from Experiment 2 is the basis for the analysis of its use in a long stretch of discourse/language production for the three speaker groups. The parameters selected for the investigation comprised, first, phasal decomposition, which involves the use of the progressive in unrelated single motion events and narratives, and uses of begin/start in narratives. Second, granularity in narratives, which relates to the overall amount of language production in narratives. Third, event boundedness (encoded in the use of 2-state verbs and 1-state verbs with an endpoint adjunct) partly in single motion events and partly in temporal shift in narratives. Temporal shift is defined as follows: Events in the narrative which are bounded shift the time line via a right boundary; events with a left boundary also shift the time line, even if they are unbounded. Fourth, left boundary comprising the use of begin/start and try in narratives. Finally, temporal structuring, which involves the use of bounded versus unbounded events preceding the temporal adverbial then in narratives (The tests are described in the documentation files aspectL2English_Persian_Exp2Chi-square-tests-in-SPSS.docx and aspectL2English_Persian_Exp2Chi-square-tests-in-SPSS.rtf). In both experiments the participants watched a video, one relevant for single event descriptions, the other relevant for re-narration of a series of events. Thus, two different videos with stimuli for the different kinds of experimental tasks were used. For Experiment 1, a video of 63 short film clips presenting unrelated single events was provided by Professor Christiane von Stutterheim, Heidelberg University Language & Cognition (HULC) Lab, at Heidelberg University, German, https://www.hulclab.eu/. For Experiment 2, an animation called Quest produced by Thomas Stellmach 1996 was used. It is available online at http://www.youtube.com/watch?v=uTyev6OaThg. Both stimuli have been used in the previous investigations on different languages by the research groups associated with the HULC Lab. The informants were asked to describe the events seen in the stimuli videos, to record their language production and send it to the researcher. For Experiment 2, most part of the L1 English data were provided by Prof. von Stutterheim, Heidelberg University, making available 34 re-narrations of the film Quest in English. 24 of them were selected for the present investigation. The project used six different informant groups, i.e. fully separate groups for the two experiments. The data from single event descriptions in Experiment 1 were analysed quantitatively in Excel. The re-narrations of Experiment 2 were coded in NVivo 10 (2014) providing frequencies of various parametrical features (Ltd, Nv. (2014). NVivo QSR International Pty Ltd, Version 10. Doncaster, Australia: QSR International). The numbers from NVivo 10 were analysed statistically in Excel and SPSS (2017). The tools are appropriate for this research. Excel suits well for the smaller data load in Experiment 1 while NVivo 10 is practical for the large amount of data and parameters in Experiment 2. Notably, NVivo 10 enabled the analysis of the three data sets to take place in the same manner once the categories of analysis and parameters had been defined under different nodes. As the results were to be extracted in the same fashion from each data set, the L1 English data received from the Heidelberg for Experiment 2 were re-analysed according to the criteria employed in this project. Yet, the analysis in the project conforms to the criteria used earlier in the model.

  13. Z

    Background music and cognitive task performance: systematic review dataset

    • data.niaid.nih.gov
    • zenodo.org
    Updated Nov 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yiting Cheah (2023). Background music and cognitive task performance: systematic review dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6301060
    Explore at:
    Dataset updated
    Nov 29, 2023
    Dataset provided by
    Yiting Cheah
    Hoo Keat Wong
    Michael Spitzer
    Eduardo Coutinho
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository contains the raw data used for a systematic review of the impact of background music on cognitive task performance (Cheah et al., 2022). Our intention is to facilitate future updates to this work. Contents description This repository contains eight Microsoft Excel files, each containing the synthesised data pertaining to each of the six cognitive domains analysed in the review, as well as task difficulty, and population characteristics:

    raw-data-attention raw-data-inhibition raw-data-language raw-data-memory raw-data-thinking raw-data-processing-speed raw-data-task-difficulty raw-data--population Files description Tabs organisation The files pertaining to each cognitive domain include individual tabs for each cognitive task analysed (c.f. Figure 2 in the original paper for the list of cognitive tasks). The file with the population characteristics data also contains separate tabs for each characteristic (extraversion, music training, gender, and working memory capacity). Tabs contents In all files and tabs, each row corresponds to the data of a test. The same article can have more than one row if it reports multiple tests. For instance, the study by Cassidy and MacDonald (2007; cf. Memory.xlsx, tab: Memory-all) contains two experiments (immediate and delayed free recall) each with multiple test (immediate free recall: tests 25 – 32; delayed free recall: tests 58 – 61). Each test (one per row), in this experiment, pertains to comparisons between conditions where the background music has different levels of arousal, between groups of participants with different extraversion levels, between different tasks material (words or paragraphs) and different combinations of the previous (e.g., high arousing music vs silence test among extraverts whilst completing an immediate free recall task involving paragraphs; cf. test 30). The columns are organised as follows:

    "TESTS": the index of the test in a particular tab (for easy reference); "ID": abbreviation of the cognitive tasks involved in a specific experiment (see glossary for meaning); "REFERENCE": the article where the data was taken from (see main publications for list of articles); "CONDITIONS": an abbreviated description of the music condition of a given test; "MEANS (music)": the average performance across all participants in a given experiment with background music; "MEANS (silence)": the average performance across all participants in a given experiment without background music. Then, in horizontal arrangement, we also include groups of two columns that breakdown specific comparisons related to each test (i.e., all tests comparing the same two types of condition, e.g., L-BgM vs I-BgM, will appear under the same set of columns). For each one, we indicate mean difference between the respective conditions ("MD" column) and the direction of effect ("Standard Metric" column). Each file also contains a "Glossary" tab that explains all the abbreviations used in each document. Bibliography Cheah, Y., Wong, H. K., Spitzer, M., & Coutinho, E. (2022). Background music and cognitive task performance: A systematic review of task, music and population impact. Music & Science, 5(1), 1-38. https://doi.org/10.1177/20592043221134392

  14. ERA5 hourly data on single levels from 1940 to present

    • cds.climate.copernicus.eu
    • arcticdata.io
    grib
    Updated Jun 30, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ECMWF (2025). ERA5 hourly data on single levels from 1940 to present [Dataset]. http://doi.org/10.24381/cds.adbb2d47
    Explore at:
    gribAvailable download formats
    Dataset updated
    Jun 30, 2025
    Dataset provided by
    European Centre for Medium-Range Weather Forecastshttp://ecmwf.int/
    Authors
    ECMWF
    License

    https://object-store.os-api.cci2.ecmwf.int:443/cci2-prod-catalogue/licences/licence-to-use-copernicus-products/licence-to-use-copernicus-products_b4b9451f54cffa16ecef5c912c9cebd6979925a956e3fa677976e0cf198c2c18.pdfhttps://object-store.os-api.cci2.ecmwf.int:443/cci2-prod-catalogue/licences/licence-to-use-copernicus-products/licence-to-use-copernicus-products_b4b9451f54cffa16ecef5c912c9cebd6979925a956e3fa677976e0cf198c2c18.pdf

    Time period covered
    Jan 1, 1940 - Jun 24, 2025
    Description

    ERA5 is the fifth generation ECMWF reanalysis for the global climate and weather for the past 8 decades. Data is available from 1940 onwards. ERA5 replaces the ERA-Interim reanalysis. Reanalysis combines model data with observations from across the world into a globally complete and consistent dataset using the laws of physics. This principle, called data assimilation, is based on the method used by numerical weather prediction centres, where every so many hours (12 hours at ECMWF) a previous forecast is combined with newly available observations in an optimal way to produce a new best estimate of the state of the atmosphere, called analysis, from which an updated, improved forecast is issued. Reanalysis works in the same way, but at reduced resolution to allow for the provision of a dataset spanning back several decades. Reanalysis does not have the constraint of issuing timely forecasts, so there is more time to collect observations, and when going further back in time, to allow for the ingestion of improved versions of the original observations, which all benefit the quality of the reanalysis product. ERA5 provides hourly estimates for a large number of atmospheric, ocean-wave and land-surface quantities. An uncertainty estimate is sampled by an underlying 10-member ensemble at three-hourly intervals. Ensemble mean and spread have been pre-computed for convenience. Such uncertainty estimates are closely related to the information content of the available observing system which has evolved considerably over time. They also indicate flow-dependent sensitive areas. To facilitate many climate applications, monthly-mean averages have been pre-calculated too, though monthly means are not available for the ensemble mean and spread. ERA5 is updated daily with a latency of about 5 days. In case that serious flaws are detected in this early release (called ERA5T), this data could be different from the final release 2 to 3 months later. In case that this occurs users are notified. The data set presented here is a regridded subset of the full ERA5 data set on native resolution. It is online on spinning disk, which should ensure fast and easy access. It should satisfy the requirements for most common applications. An overview of all ERA5 datasets can be found in this article. Information on access to ERA5 data on native resolution is provided in these guidelines. Data has been regridded to a regular lat-lon grid of 0.25 degrees for the reanalysis and 0.5 degrees for the uncertainty estimate (0.5 and 1 degree respectively for ocean waves). There are four main sub sets: hourly and monthly products, both on pressure levels (upper air fields) and single levels (atmospheric, ocean-wave and land surface quantities). The present entry is "ERA5 hourly data on single levels from 1940 to present".

  15. Replay-Attack

    • zenodo.org
    • data.niaid.nih.gov
    Updated Mar 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ivana Chingovska; André Anjos; André Anjos; Sébastien Marcel; Sébastien Marcel; Ivana Chingovska (2023). Replay-Attack [Dataset]. http://doi.org/10.34777/cwcg-7r82
    Explore at:
    Dataset updated
    Mar 6, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Ivana Chingovska; André Anjos; André Anjos; Sébastien Marcel; Sébastien Marcel; Ivana Chingovska
    Description

    Replay-Attack is a dataset for face recognition and presentation attack detection (anti-spoofing). The dataset consists of 1300 video clips of photo and video presentation attack (spoofing attacks) to 50 clients, under different lighting conditions.

    Spoofing Attacks Description

    The 2D face spoofing attack database consists of 1,300 video clips of photo and video attack attempts of 50 clients, under different lighting conditions.

    The data is split into 4 sub-groups comprising:

    1. Training data ("train"), to be used for training your anti-spoof classifier;
    2. Development data ("devel"), to be used for threshold estimation;
    3. Test data ("test"), with which to report error figures;
    4. Enrollment data ("enroll"), that can be used to verify spoofing sensitivity on face detection algorithms.

    Clients that appear in one of the data sets (train, devel or test) do not appear in any other set.

    Database Description

    All videos are generated by either having a (real) client trying to access a laptop through a built-in webcam or by displaying a photo or a video recording of the same client for at least 9 seconds. The webcam produces colour videos with a resolution of 320 pixels (width) by 240 pixels (height). The movies were recorded on a Macbook laptop using the QuickTime framework (codec: Motion JPEG) and saved into ".mov" files. The frame rate is about 25 Hz. Besides the native support on Apple computers, these files are *easily* readable using mplayer, ffmpeg or any other video utilities available under Linux or MS Windows systems.

    Real client accesses as well as data collected for the attacks are taken under two different lighting conditions:

    * **controlled**: The office light was turned on, blinds are down, background is homogeneous;
    * **adverse**: Blinds up, more complex background, office lights are out.

    To produce the attacks, high-resolution photos and videos from each client were taken under the same conditions as in their authentication sessions, using a Canon PowerShot SX150 IS camera, which records both 12.1 Mpixel photographs and 720p high-definition video clips. The way to perform the attacks can be divided into two subsets: the first subset is composed of videos generated using a stand to hold the client biometry ("fixed"). For the second set, the attacker holds the device used for the attack with their own hands. In total, 20 attack videos were registered for each client, 10 for each of the attacking modes just described:

    4 x mobile attacks using an iPhone 3GS screen (with resolution 480x320 pixels) displaying:

    • 1 x mobile photo/controlled
    • 1 x mobile photo/adverse
    • 1 x mobile video/controlled
    • 1 x mobile video/adverse

    4 x high-resolution screen attacks using an iPad (first generation, with a screen resolution of 1024x768 pixels) displaying:

    • 1 x high-resolution photo/controlled
    • 1 x high-resolution photo/adverse
    • 1 x high-resolution video/controlled
    • 1 x high-resolution video/adverse

    2 x hard-copy print attacks (produced on a Triumph-Adler DCC 2520 color laser printer) occupying the whole available printing surface on A4 paper for the following samples:

    • 1 x high-resolution print of photo/controlled
    • 1 x high-resolution print of photo/adverse

    The 1300 real-accesses and attacks videos were then divided in the following way:

    • Training set: contains 60 real-accesses and 300 attacks under different lighting conditions;
    • Development set: contains 60 real-accesses and 300 attacks under different lighting conditions;
    • Test set: contains 80 real-accesses and 400 attacks under different lighting conditions;

    Face Locations

    We also provide face locations automatically annotated by a cascade of classifiers based on a variant of Local Binary Patterns (LBP) referred as Modified Census Transform (MCT) [Face Detection with the Modified Census Transform, Froba, B. and Ernst, A., 2004, IEEE International Conference on Automatic Face and Gesture Recognition, pp. 91-96]. The automatic face localisation procedure works in more than 99% of the total number of frames acquired. This means that less than 1% of the total set of frames for all videos do not possess annotated faces. User algorithms must account for this fact.

    Protocol for Licit Biometric Transactions

    It is possible to measure the performance of baseline face recognition systems on the 2D Face spoofing database and evaluate how well the attacks pass such systems or how, otherwise robust they are to attacks. Here we describe how to use the available data at the enrolment set to create a background model, client models and how to perform scoring using the available data.

    1. Universal Background Model (UBM): To generate the UBM, subselect the training-set client videos from the enrollment videos. There should be 2 per client, which means you get 30 videos, each with 375 frames to create the model;
    2. Client models: To generate client models, use the enrollment data for clients at the development and test groups. There should be 2 videos per client (one for each light condition) once more. At the end of the enrollment procedure, the development set must have 1 model for each of the 15 clients available in that set. Similarly, for the test set, 1 model for each of the 20 clients available;
    3. For a simple baseline verification, generate scores **exhaustively** for all videos from the development and test **real-accesses** respectively, but **without** intermixing accross development and test sets. The scores generated against matched client videos and models (within the subset, i.e. development or test) should be considered true client accesses, while all others impostors;
    4. If you are looking for a single number to report on the performance do the following: exclusively using the scores from the development set, tune your baseline face recognition system on the EER of the development set and use this threshold to find the HTER on the test set scores.

    Protocols for Spoofing Attacks

    Attack protocols are used to evaluate the (binary classification) performance of counter-measures to spoof attacks. The database can be split into 6 different protocols according to the type of device used to generate the attack: print, mobile (phone), high-definition (tablet), photo, video or grand test (all types). Furthermore, subsetting can be achieved on the top of the previous 6 groups by classifying attacks as performed by the attacker bare hands or using a fixed support. This classification scheme makes-up a total of 18 protocols that can be used for studying the performance of counter-measures to 2D face spoofing attacks. The table bellow details the amount of video clips in each protocol.

    Acknowledgements

    If you use this database, please cite the following publication:

    I. Chingovska, A. Anjos, S. Marcel,"On the Effectiveness of Local Binary Patterns in Face Anti-spoofing"; IEEE BIOSIG, 2012.
    https://ieeexplore.ieee.org/document/6313548
    http://publications.idiap.ch/index.php/publications/show/2447

  16. Z

    CompanyKG Dataset V2.0: A Large-Scale Heterogeneous Graph for Company...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jun 4, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vilhelm von Ehrenheim (2024). CompanyKG Dataset V2.0: A Large-Scale Heterogeneous Graph for Company Similarity Quantification [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7957401
    Explore at:
    Dataset updated
    Jun 4, 2024
    Dataset provided by
    Armin Catovic
    Mark Granroth-Wilding
    Lele Cao
    Vilhelm von Ehrenheim
    Richard Anselmo Stahl
    Drew McCornack
    Dhiana Deva Cavacanti Rocha
    Description

    CompanyKG is a heterogeneous graph consisting of 1,169,931 nodes and 50,815,503 undirected edges, with each node representing a real-world company and each edge signifying a relationship between the connected pair of companies.

    Edges: We model 15 different inter-company relations as undirected edges, each of which corresponds to a unique edge type. These edge types capture various forms of similarity between connected company pairs. Associated with each edge of a certain type, we calculate a real-numbered weight as an approximation of the similarity level of that type. It is important to note that the constructed edges do not represent an exhaustive list of all possible edges due to incomplete information. Consequently, this leads to a sparse and occasionally skewed distribution of edges for individual relation/edge types. Such characteristics pose additional challenges for downstream learning tasks. Please refer to our paper for a detailed definition of edge types and weight calculations.

    Nodes: The graph includes all companies connected by edges defined previously. Each node represents a company and is associated with a descriptive text, such as "Klarna is a fintech company that provides support for direct and post-purchase payments ...". To comply with privacy and confidentiality requirements, we encoded the text into numerical embeddings using four different pre-trained text embedding models: mSBERT (multilingual Sentence BERT), ADA2, SimCSE (fine-tuned on the raw company descriptions) and PAUSE.

    Evaluation Tasks. The primary goal of CompanyKG is to develop algorithms and models for quantifying the similarity between pairs of companies. In order to evaluate the effectiveness of these methods, we have carefully curated three evaluation tasks:

    Similarity Prediction (SP). To assess the accuracy of pairwise company similarity, we constructed the SP evaluation set comprising 3,219 pairs of companies that are labeled either as positive (similar, denoted by "1") or negative (dissimilar, denoted by "0"). Of these pairs, 1,522 are positive and 1,697 are negative.

    Competitor Retrieval (CR). Each sample contains one target company and one of its direct competitors. It contains 76 distinct target companies, each of which has 5.3 competitors annotated in average. For a given target company A with N direct competitors in this CR evaluation set, we expect a competent method to retrieve all N competitors when searching for similar companies to A.

    Similarity Ranking (SR) is designed to assess the ability of any method to rank candidate companies (numbered 0 and 1) based on their similarity to a query company. Paid human annotators, with backgrounds in engineering, science, and investment, were tasked with determining which candidate company is more similar to the query company. It resulted in an evaluation set comprising 1,856 rigorously labeled ranking questions. We retained 20% (368 samples) of this set as a validation set for model development.

    Edge Prediction (EP) evaluates a model's ability to predict future or missing relationships between companies, providing forward-looking insights for investment professionals. The EP dataset, derived (and sampled) from new edges collected between April 6, 2023, and May 25, 2024, includes 40,000 samples, with edges not present in the pre-existing CompanyKG (a snapshot up until April 5, 2023).

    Background and Motivation

    In the investment industry, it is often essential to identify similar companies for a variety of purposes, such as market/competitor mapping and Mergers & Acquisitions (M&A). Identifying comparable companies is a critical task, as it can inform investment decisions, help identify potential synergies, and reveal areas for growth and improvement. The accurate quantification of inter-company similarity, also referred to as company similarity quantification, is the cornerstone to successfully executing such tasks. However, company similarity quantification is often a challenging and time-consuming process, given the vast amount of data available on each company, and the complex and diversified relationships among them.

    While there is no universally agreed definition of company similarity, researchers and practitioners in PE industry have adopted various criteria to measure similarity, typically reflecting the companies' operations and relationships. These criteria can embody one or more dimensions such as industry sectors, employee profiles, keywords/tags, customers' review, financial performance, co-appearance in news, and so on. Investment professionals usually begin with a limited number of companies of interest (a.k.a. seed companies) and require an algorithmic approach to expand their search to a larger list of companies for potential investment.

    In recent years, transformer-based Language Models (LMs) have become the preferred method for encoding textual company descriptions into vector-space embeddings. Then companies that are similar to the seed companies can be searched in the embedding space using distance metrics like cosine similarity. The rapid advancements in Large LMs (LLMs), such as GPT-3/4 and LLaMA, have significantly enhanced the performance of general-purpose conversational models. These models, such as ChatGPT, can be employed to answer questions related to similar company discovery and quantification in a Q&A format.

    However, graph is still the most natural choice for representing and learning diverse company relations due to its ability to model complex relationships between a large number of entities. By representing companies as nodes and their relationships as edges, we can form a Knowledge Graph (KG). Utilizing this KG allows us to efficiently capture and analyze the network structure of the business landscape. Moreover, KG-based approaches allow us to leverage powerful tools from network science, graph theory, and graph-based machine learning, such as Graph Neural Networks (GNNs), to extract insights and patterns to facilitate similar company analysis. While there are various company datasets (mostly commercial/proprietary and non-relational) and graph datasets available (mostly for single link/node/graph-level predictions), there is a scarcity of datasets and benchmarks that combine both to create a large-scale KG dataset expressing rich pairwise company relations.

    Source Code and Tutorial:https://github.com/llcresearch/CompanyKG2

    Paper: to be published

  17. z

    AeroSonic YPAD-0523: Labelled audio dataset for acoustic detection and...

    • zenodo.org
    csv, json, txt, zip
    Updated Sep 23, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Blake Downward; Blake Downward (2023). AeroSonic YPAD-0523: Labelled audio dataset for acoustic detection and classification of aircraft [Dataset]. http://doi.org/10.5281/zenodo.8004081
    Explore at:
    zip, csv, json, txtAvailable download formats
    Dataset updated
    Sep 23, 2023
    Dataset provided by
    Zenodo
    Authors
    Blake Downward; Blake Downward
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    AeroSonic YPAD-0523: Labelled audio dataset for acoustic detection and classification of aircraft
    Version 0.2 (June 2023)

    Publication
    If using this data in an academic work, please reference the DOI and version.

    Description
    AeroSonic:YPAD-0523 is a specialised dataset of ADS-B labelled audio clips for research in the fields of aircraft noise attribution and machine listening, particularly acoustic detection and classification of low-flying aircraft. Audio files in this dataset were recorded at locations in close proximity to a flight path approaching or departing Adelaide International Airport’s (ICAO code: YPAD) primary runway, 05/23. Recordings are initially labelled from radio (ADS-B) messages received from the aircraft overhead. Each recording is then human verified, and trimmed to the best (subjective) 20 seconds of audio in which the target aircraft is audible.

    A total of 1,890 audio clips are balanced across two top-level classes, “Aircraft” (3.57 hours: 642 20-second recordings) and “Silence” (3.37 hours: 1,248 5 and 10-second recordings). The aircraft class is then further broken-down into four unbalanced subclasses which broadly describe an aircrafts structure and propulsion mechanism. A variety of additional "airframe" features are provided to give researchers finer control of the dataset, and the opportunity to develop ontologies specific to their own use case.

    For convenience, the dataset has been split into training (6.28 hours) and testing (0.66 hours) subsets, with the training set further split into 10 folds for cross-validation. Care has been taken to ensure the class distribution for each subset and fold does not significantly deviate from the overall distribution.

    Researchers may find applications for this dataset in a number of fields; particularly aircraft noise isolation and monitoring in an urban environment, development of passive acoustic systems to assist radar technology, and understanding the sources of aircraft noise to help manufacturers design less-noisy aircraft.

    Audio data
    ADS-B (Automatic Dependent Surveillance–Broadcast) messages transmitted directly from aircraft are used to automatically capture and label audio recordings. A 60-second recording is triggered when an aircraft transmits a message indicating it is within a specified distance of the recording device. The file is labelled with a unique ICAO identifier code for the aircraft, as well as its last recorded altitude, date and time. The recording is then human verified and trimmed to 20 seconds - with the aircraft audible for the duration of the clip.

    A balanced collection of urban background noise without aircraft (silence) is included with the dataset as a means of distinguishing location specific environmental noises from aircraft noises. 10-second background noise, or “silence” recordings are triggered only when there are no aircraft broadcasting that they are within a specified distance of the recording device. These "silence" recordings are also human verified to ensure no aircraft noise is present. The dataset contains 1,180 10-second clips, and 68 5-second clips of silence/ambient background noise.

    Location information
    Recordings have been collected from three (3) locations. GPS coordinates for each location are provided in the "locations.json" file. In order to protect privacy, coordinates have been provided for a road or public space nearby the recording device instead of its exact location.

    Location: 0
    Situated in a suburban environment approximately 15.5km north-east of the start/end of the runway. For Adelaide, typical south-westerly winds bring most arriving aircraft past this location on approach. Winds from the north or east will cause aircraft to take-off to the north-east, however not all departing aircraft will maintain a course to trigger a recording at this location. The "trigger distance" for this location is set for 3km to ensure small/slower aircraft and large/faster aircraft are captured within a sixty-second recording.

    "Silence" or ambient background noises at this location include; cars, motorbikes, light-trucks, garbage trucks, power-tools, lawn mowers, construction sounds, sirens, people talking, dogs barking and a wide range of Australian native birds (New Holland Honeyeaters, Wattlebirds, Australian Magpies, Australian Ravens, Spotted Doves, Rainbow Lorikeets and others).

    Location: 1
    Situated approximately 500m south-east of the south-eastern end of the runway, this location is nearby recreational areas (golf course, skate park and parklands) with a busy road/highway inbetween the location and runway. This location features heavy winds and road traffic, as well as people talking, walking and riding, and also birds such as the Australian Magpie and Noisy Miner. The trigger distance for this location is set to 1km. Due to their low altitude aircraft are louder, but audible for a shorter time compared to "Location 0".

    Location: 2
    As an alternative to "Location 1", this location is situated approximately 950m south-east of the end of the runway. This location has a wastewater facility to the north, a residential area to the south and a popular beach to the west. This location offers greater wind protection and further distance from airport and highway noises. Ambient background sounds feature close proximity cars and motorbikes, cyclists, people walking, nail guns and other construction sounds, as well as the local birds mentioned above.


    Aircraft metadata
    Supplementary "airframe" metadata for all aircraft has been gathered to help broaden the research possibilities from this dataset. Airframe information was collected and cross-checked from a number of open-source databases. The author has no reason to beleive any significant errors exist in the "aircraft_meta" files, however future versions of this dataset plan to obtain aircraft information directly from ICAO (International Civil Aviation Organization) to ensure a single, verifiable source of information.

    Class/subclass ontology (minutes of recordings)

    0. no aircraft (202)
    0: no aircraft (202)

    1. aircraft (214)
    1: piston-propeller aeroplane (12)
    2: turbine-propeller aeroplane (37)
    3: turbine-fan aeroplane (163)
    4: rotorcraft (1.6)

    The subclasses are a combination of the "airframe" and "engtype" features. Piston and Turboshaft rotorcraft/helicopters have been combined into a single subclass due to the small number of samples.

    Data splits
    Audio recordings have been split into training (90.5%) and test (9.5%) sets. The training set has further been split into 10 folds, giving researchers a common split to perform 10-fold cross-validation - ensuring reproducibility and comparative results. Data leakage into the test set has been avoided by ensuring recordings are disjointed from the training set by time and location - meaning samples in the test set for a particular location were recorded after any samples included in the training set for that particular location.


    Labelled data
    The entire dataset (training and test) is referenced and labelled in the "sample_meta.csv" file. Each row contains a reference to a unique recording and all the labels and features associated with that recording and aircraft.

    Alternatively, these labels can be derived directly from the filename of the sample (see below), plus a JSON file which accompanies each aircraft sample. The "aircraft_meta.csv" and "aircraft_meta.json" files can be used to reference aircraft specific features - such as; manufacturer, engine type, ICAO type designator etc. (see below for all 14 airframe features).

    File naming convention
    Audio samples are in WAV format, and metadata for aircraft recordings are stored in JSON files. Both files share the same name, only differing by their file extension.

    Basic Convention

    “Aircraft ID + Date + Time + Location ID + Microphone ID”

    “XXXXXX_YYYY-MM-DD_hh-mm-ss_X_X”

    Sample with aircraft

    {hex_id} _ {date} _ {time} _ {location_id} _ {microphone_id} . {file_ext}

    7C7CD0_2023-05-09_12-42-55_2_1.wav
    7C7CD0_2023-05-09_12-42-55_2_1.json

    Sample without aircraft

    “Silence” files are denoted with six (6) leading zeros rather than an aircraft hex code. All relevant metadata for “silence” samples are contained in the audio filename, and again in the accompanying “sample_meta.csv”

    000000 _ {date} _ {time} _ {location_id} _ {microphone_id} . {file_ext}

    000000_2023-05-09_12-30-55_2_1.wav

    Columns/Labels
    (found in sample_meta.csv, aircraft_meta.csv/json and aircraft recording JSON files)

    train-test: Train-test split (train, test)

    fold: Digit from 0 to 9 splitting the training subset 10 ways (else test)

    filename: The filename of the audio recording

    date: Date of the recording

    time: Time of the recording

    duration: Length of the recording (in seconds)

    location_id: ID for the location of the recording

    microphone_id: ID of the microphone used

    hex_id: Unique ICAO 24-bit address for the aircraft

  18. 2010 American Community Survey: CP02 | SELECTED SOCIAL CHARACTERISTICS IN...

    • data.census.gov
    Updated Apr 1, 2010
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ACS (2010). 2010 American Community Survey: CP02 | SELECTED SOCIAL CHARACTERISTICS IN THE UNITED STATES (ACS 1-Year Estimates Comparison Profiles) [Dataset]. https://data.census.gov/table?tid=ACSCP1Y2010.CP02&g=0500000US39093
    Explore at:
    Dataset updated
    Apr 1, 2010
    Dataset provided by
    United States Census Bureauhttp://census.gov/
    Authors
    ACS
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Time period covered
    2010
    Area covered
    United States
    Description

    Supporting documentation on code lists, subject definitions, data accuracy, and statistical testing can be found on the American Community Survey website in the Data and Documentation section...Sample size and data quality measures (including coverage rates, allocation rates, and response rates) can be found on the American Community Survey website in the Methodology section..An * indicates that the estimate is significantly different (at a 90% confidence level) than the estimate from the most current year. A "c" indicates the estimates for that year and the current year are both controlled; a statistical test is not appropriate..Although the American Community Survey (ACS) produces population, demographic and housing unit estimates, for 2010, the 2010 Census provides the official counts of the population and housing units for the nation, states, counties, cities and towns..Explanation of Symbols:.An ''**'' entry in the margin of error column indicates that either no sample observations or too few sample observations were available to compute a standard error and thus the margin of error. A statistical test is not appropriate..An ''-'' entry in the estimate column indicates that either no sample observations or too few sample observations were available to compute an estimate, or a ratio of medians cannot be calculated because one or both of the median estimates falls in the lowest interval or upper interval of an open-ended distribution..An ''-'' following a median estimate means the median falls in the lowest interval of an open-ended distribution..An ''+'' following a median estimate means the median falls in the upper interval of an open-ended distribution..An ''***'' entry in the margin of error column indicates that the median falls in the lowest interval or upper interval of an open-ended distribution. A statistical test is not appropriate..An ''*****'' entry in the margin of error column indicates that the estimate is controlled. A statistical test for sampling variability is not appropriate. .An ''N'' entry in the estimate and margin of error columns indicates that data for this geographic area cannot be displayed because the number of sample cases is too small..An ''(X)'' means that the estimate is not applicable or not available..Estimates of urban and rural population, housing units, and characteristics reflect boundaries of urban areas defined based on Census 2000 data. Boundaries for urban areas have not been updated since Census 2000. As a result, data for urban and rural areas from the ACS do not necessarily reflect the results of ongoing urbanization..While the 2010 American Community Survey (ACS) data generally reflect the December 2009 Office of Management and Budget (OMB) definitions of metropolitan and micropolitan statistical areas; in certain instances the names, codes, and boundaries of the principal cities shown in ACS tables may differ from the OMB definitions due to differences in the effective dates of the geographic entities..The Census Bureau introduced a new set of disability questions in the 2008 ACS questionnaire. Accordingly, comparisons of disability data from 2008 or later with data from prior years are not recommended. For more information on these questions and their evaluation in the 2006 ACS Content Test, see the Evaluation Report Covering Disability..Selected migration data are not available for certain geographic areas due to problems with group quarters data collection and imputation. See Errata Note #44 for details..Data for year of entry of the native population reflect the year of entry into the U.S. by people who were born in Puerto Rico, U.S. Island Areas or born outside the U.S. to a U.S. citizen parent and who subsequently moved to the U.S..Ancestry listed in this table refers to the total number of people who responded with a particular ancestry; for example, the estimate given for Russian represents the number of people who listed Russian as either their first or second ancestry. This table lists only the largest ancestry groups; see the Detailed Tables for more categories. Race and Hispanic origin groups are not included in this table because official data for those groups come from the Race and Hispanic origin questions rather than the ancestry question (see Demographic Table)..Starting in 2008, the Scotch-Irish category does not include Irish-Scotch. People who reported Irish-Scotch ancestry are classified under "Other groups," whereas in 2007 and earlier they were classified as Scotch-Irish..Data are based on a sample and are subject to sampling variability. The degree of uncertainty for an estimate arising from sampling variability is represented through the use of a margin of error. The value shown here is the 90 percent margin of error. The margin of error can be interpreted roughly as providing a 90 percent probability that the interval defined by the estimate minus the margin of error and the estimate plus the margin of error (the lowe...

  19. 4

    Data set for comparing different bond wrench test procedures (set-up and...

    • data.4tu.nl
    zip
    Updated Mar 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rita Esposito; Belen Gaggero (2023). Data set for comparing different bond wrench test procedures (set-up and specimen's type) [Dataset]. http://doi.org/10.4121/20193173.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 29, 2023
    Dataset provided by
    4TU.ResearchData
    Authors
    Rita Esposito; Belen Gaggero
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    2018
    Area covered
    The Netherlands, Delft
    Dataset funded by
    Nederlandse Aardolie Maatschappij (NAM)
    Polytechnic University of Turin
    Description

    The data set contains the results of a comprehensive investigation of the flexural bond behavior in brick masonry by means of bond wrench tests, with a focus on the influence of the testing procedure. In this context, the study explores the impact of the testing set-up's accuracy and the specimen type on the experimental assessment of the flexural bond strength. Additionally, the data set includes the results of a novel procedure to assess the flexural bond fracture energy.


    The data set contains the results of the followings:

    • Experimental assessment of flexural bond strength with set-ups with increasing accuracy: manually-operated vs computer controlled set-ups.

    (A) Force-controlled set-up entirely operated by hand for in-situ applications.

    (B) Force-controlled set-up operated by hand for lab applications.

    (C) Computer-controlled set-up operated by controlling the crack mouth opening displacement (CMOD) at the tension side of the tested specimen for lab

    applications.

    • Experimental assessment of flexural bond strength of three different types of masonry specimens using the (A) in-situ bond wrench test: with/without head joints and couplets vs wallet.

    (D) “Standard”, stack-bonded couplets (without head joint).

    (E) Running-bonded couplets (with head joint).

    (F) Running-bonded wallets (with head joint and courses).

    • Experimental assessment of flexural bond fracture energy using the (C) computer-controlled bond wrench test.


    For both force-controlled set-ups, the maximum force registered during the test is reported together with the dimensions of the tested bed joint and weight of the portion of specimen pulled off. For the computer-controlled set-up, the force measured at the relative CMOD and Jack displacement are provided along with the dimensions of the tested bed joint and weight of the portion of specimen pulled off. Detailed pictures of the set-ups and specimens are also included.


    Tests were performed on two traditional Dutch masonry types, i.e. calcium silicate brick masonry with a 1:3 (cement: sand proportions by volume) cement-based mortar and clay brick masonry with a 1:2:9 (cement: lime: sand proportions by volume) cement-lime mortar.


  20. f

    Data from: Tests for Scale Changes Based on Pairwise Differences

    • tandf.figshare.com
    zip
    Updated Jun 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carina Gerstenberger; Daniel Vogel; Martin Wendler (2023). Tests for Scale Changes Based on Pairwise Differences [Dataset]. http://doi.org/10.6084/m9.figshare.8295638.v3
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    Taylor & Francis
    Authors
    Carina Gerstenberger; Daniel Vogel; Martin Wendler
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In many applications it is important to know whether the amount of fluctuation in a series of observations changes over time. In this article, we investigate different tests for detecting changes in the scale of mean-stationary time series. The classical approach, based on the CUSUM test applied to the squared centered observations, is very vulnerable to outliers and impractical for heavy-tailed data, which leads us to contemplate test statistics based on alternative, less outlier-sensitive scale estimators. It turns out that the tests based on Gini’s mean difference (the average of all pairwise distances) and generalized Qn estimators (sample quantiles of all pairwise distances) are very suitable candidates. They improve upon the classical test not only under heavy tails or in the presence of outliers, but also under normality. We use recent results on the process convergence of U-statistics and U-quantiles for dependent sequences to derive the limiting distribution of the test statistics and propose estimators for the long-run variance. We show the consistency of the tests and demonstrate the applicability of the new change-point detection methods at two real-life data examples from hydrology and finance. Supplementary materials for this article are available online.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Yashmeet Singh (2022). High School Heights Dataset [Dataset]. https://www.kaggle.com/datasets/yashmeetsingh/high-school-heights-dataset
Organization logo

High School Heights Dataset

You will find three datasets containing heights of the high school students

Explore at:
11 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 11, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Yashmeet Singh
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

High School Heights Dataset

You will find three datasets containing heights of the high school students.

All heights are in inches.

The data is simulated. The heights are generated from a normal distribution with different sets of mean and standard deviation for boys and girls.

Height Statistics (inches)BoysGirls
Mean6762
Standard Deviation2.92.2

There are 500 measurements for each gender.

Here are the datasets:

  • hs_heights.csv: contains a single column with heights for all boys and girls. There's no way to tell which of the values are for boys and which ones are for girls.

  • hs_heights_pair.csv: has two columns. The first column has boy's heights. The second column contains girl's heights.

  • hs_heights_flag.csv: has two columns. The first column has the flag is_girl. The second column contains a girl's height if the flag is 1. Otherwise, it contains a boy's height.

To see how I generated this dataset, check this out: https://github.com/ysk125103/datascience101/tree/main/datasets/high_school_heights

Image by Gillian Callison from Pixabay

Search
Clear search
Close search
Google apps
Main menu