26 datasets found

CSV file used in statistical analyses
data.csiro.au
researchdata.edu.au
+1more
Updated Oct 13, 2014
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CSIRO (2014). CSV file used in statistical analyses [Dataset]. http://doi.org/10.4225/08/543B4B4CA92E6
Explore at:
Unique identifier
https://doi.org/10.4225/08/543B4B4CA92E6
Dataset updated
Oct 13, 2014
Dataset authored and provided by
CSIROhttp://www.csiro.au/
License
https://research.csiro.au/dap/licences/csiro-data-licence/https://research.csiro.au/dap/licences/csiro-data-licence/
Time period covered
Mar 14, 2008 - Jun 9, 2009
Dataset funded by
CSIROhttp://www.csiro.au/
Description
A csv file containing the tidal frequencies used for statistical analyses in the paper "Estimating Freshwater Flows From Tidally-Affected Hydrographic Data" by Dan Pagendam and Don Percival.
i
Sample Dataset for Testing
ieee-dataport.org
Updated Apr 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alex Outman (2025). Sample Dataset for Testing [Dataset]. https://ieee-dataport.org/documents/sample-dataset-testing
Explore at:
Dataset updated
Apr 28, 2025
Authors
Alex Outman
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
10
MOT testing data for Great Britain
s3.amazonaws.com
gov.uk
Updated Mar 24, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Driver and Vehicle Standards Agency (2022). MOT testing data for Great Britain [Dataset]. https://s3.amazonaws.com/thegovernmentsays-files/content/179/1797262.html
Explore at:
Dataset updated
Mar 24, 2022
Dataset provided by
GOV.UKhttp://gov.uk/
Authors
Driver and Vehicle Standards Agency
Area covered
United Kingdom, Great Britain
Description
About this data set

This data set comes from data held by the Driver and Vehicle Standards Agency (DVSA).

It is not classed as an ‘official statistic’. This means it’s not subject to scrutiny and assessment by the UK Statistics Authority.

MOT test results by class

The MOT test checks that your vehicle meets road safety and environmental standards. Different types of vehicles (for example, cars and motorcycles) fall into different ‘classes’.

This data table shows the number of initial tests. It does not include abandoned tests, aborted tests, or retests.

The initial fail rate is the rate for vehicles as they were brought for the MOT. The final fail rate excludes vehicles that pass the test after rectification of minor defects at the time of the test.

This data table is updated every 3 months.

https://www.gov.uk/assets/whitehall/pub-cover-spreadsheet-471052e0d03e940bbc62528a05ac204a884b553e4943e63c8bffa6b8baef8967.png">

MOT test results by class of vehicle

Ref: DVSA/MOT/01 View online https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/1060287/dvsa-mot-01-mot-test-results-by-class-of-vehicle1.csv"> Download CSV 16.1 KB

Initial failures by defect category

These tables give data for the following classes of vehicles:

class 1 and 2 vehicles - motorcycles

class 3 and 4 vehicles - cars and light vans up to 3,000kg

class 5 vehicles - private passenger vehicles with more than 12 seats

class 7 vehicles - goods vehicles between 3,000kg and 3,500kg gross vehicle weight

All figures are for vehicles as they were brought in for the MOT.

A failed test usually has multiple failure items.

The percentage of tests is worked out as the number of tests with one or more failure items in the defect as a percentage of total tests.

The percentage of defects is worked out as the total defects in the category as a percentage of total defects for all categories.

The average defects per initial test failure is worked out as the total failure items as a percentage of total tests failed plus tests that passed after rectification of a minor defect at the time of the test.

These data tables are updated every 3 months.

https://www.gov.uk/assets/whitehall/pub-cover-spreadsheet-471052e0d03e940bbc62528a05ac204a884b553e4943e63c8bffa6b8baef8967.png">

MOT class 1 and 2 vehicles: initial failures by defect category

Ref: DVSA/MOT/02 View online https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/1060255/dvsa-mot-02-mot-class-1-and-2-vehicles-initial-failures-by-defect-category-.csv"> Download CSV 19.1 KB

https://www.gov.uk/assets/whitehall/pub-cover-spreadsheet-471052e0d03e940bbc62528a05ac204a884b553e4943e63c8bffa6b8baef8967.png">

MOT class 3 and 4 vehicles: initial failures by defect category</h3
ML HACK Dataset
kaggle.com
Updated Nov 17, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abhinav Padmawar (2020). ML HACK Dataset [Dataset]. https://www.kaggle.com/abhinavpadmawar20/ml-hack-dataset/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 17, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Abhinav Padmawar
Description
Given below are three files that you will be using for the challenge. Download all the files. The training file has a labelled data set. However, the test file shall only have the features. Work out your algorithm for the same and make predictions on the test file after which you have to create a submissions.csv file that will be evaluated. You may refer to the sample_submission.csv file in order to understand the overall structure of your submission. The dataset consists of overall stats of players in ODIs only.

File descriptions:

train.csv - the training set test.csv - the test set sampleSubmission.csv - a sample submission file in the correct format Data fields id - an anonymous id unique to the player Name - Name of the player. Age - Age 100s - Number of centuries of the player 50s - Number of half centuries of the player 6s - Total number of sixes hit by the player Balls - Number of balls bowled by the player Bat_Average - Average batting score Bowl_Strike_Rate - average number of balls bowled per wicket taken Balls faced - Number of balls faced Economy - average number of runs conceded for each over bowled. Innings - Number of innings played Overs/strong> - Number of overs bowled Maidens - Overs when no run was conceded Runs - Total runs scored by the player Wickets - Number of wickets taken Ratings - Final rating of the player
Z
Data from: T1DiabetesGranada: a longitudinal multi-modal dataset of type 1...
data.niaid.nih.gov
produccioncientifica.ugr.es
Updated Feb 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lopez-Ibarra, Pablo J (2024). T1DiabetesGranada: a longitudinal multi-modal dataset of type 1 diabetes mellitus [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10050943
Explore at:
Dataset updated
Feb 2, 2024
Dataset provided by
Banos, Oresti
Quesada-Charneco, Miguel
Villalonga, Claudia
Aviles Perez, Maria Dolores
Rodriguez-Leon, Ciro
Munoz-Torres, Manuel
Lopez-Ibarra, Pablo J
Description
T1DiabetesGranada

A longitudinal multi-modal dataset of type 1 diabetes mellitus

Documented by:

Rodriguez-Leon, C., Aviles-Perez, M. D., Banos, O., Quesada-Charneco, M., Lopez-Ibarra, P. J., Villalonga, C., & Munoz-Torres, M. (2023). T1DiabetesGranada: a longitudinal multi-modal dataset of type 1 diabetes mellitus. Scientific Data, 10(1), 916. https://doi.org/10.1038/s41597-023-02737-4

Background

Type 1 diabetes mellitus (T1D) patients face daily difficulties in keeping their blood glucose levels within appropriate ranges. Several techniques and devices, such as flash glucose meters, have been developed to help T1D patients improve their quality of life. Most recently, the data collected via these devices is being used to train advanced artificial intelligence models to characterize the evolution of the disease and support its management. The main problem for the generation of these models is the scarcity of data, as most published works use private or artificially generated datasets. For this reason, this work presents T1DiabetesGranada, a open under specific permission longitudinal dataset that not only provides continuous glucose levels, but also patient demographic and clinical information. The dataset includes 257780 days of measurements over four years from 736 T1D patients from the province of Granada, Spain. This dataset progresses significantly beyond the state of the art as one the longest and largest open datasets of continuous glucose measurements, thus boosting the development of new artificial intelligence models for glucose level characterization and prediction.

Data Records

The data are stored in four comma-separated values (CSV) files which are available in T1DiabetesGranada.zip. These files are described in detail below.

Patient_info.csv

Patient_info.csv is the file containing information about the patients, such as demographic data, start and end dates of blood glucose level measurements and biochemical parameters, number of biochemical parameters or number of diagnostics. This file is composed of 736 records, one for each patient in the dataset, and includes the following variables:

Patient_ID – Unique identifier of the patient. Format: LIB19XXXX.

Sex – Sex of the patient. Values: F (for female), masculine (for male)

Birth_year – Year of birth of the patient. Format: YYYY.

Initial_measurement_date – Date of the first blood glucose level measurement of the patient in the Glucose_measurements.csv file. Format: YYYY-MM-DD.

Final_measurement_date – Date of the last blood glucose level measurement of the patient in the Glucose_measurements.csv file. Format: YYYY-MM-DD.

Number_of_days_with_measures – Number of days with blood glucose level measurements of the patient, extracted from the Glucose_measurements.csv file. Values: ranging from 8 to 1463.

Number_of_measurements – Number of blood glucose level measurements of the patient, extracted from the Glucose_measurements.csv file. Values: ranging from 400 to 137292.

Initial_biochemical_parameters_date – Date of the first biochemical test to measure some biochemical parameter of the patient, extracted from the Biochemical_parameters.csv file. Format: YYYY-MM-DD.

Final_biochemical_parameters_date – Date of the last biochemical test to measure some biochemical parameter of the patient, extracted from the Biochemical_parameters.csv file. Format: YYYY-MM-DD.

Number_of_biochemical_parameters – Number of biochemical parameters measured on the patient, extracted from the Biochemical_parameters.csv file. Values: ranging from 4 to 846.

Number_of_diagnostics – Number of diagnoses realized to the patient, extracted from the Diagnostics.csv file. Values: ranging from 1 to 24.

Glucose_measurements.csv

Glucose_measurements.csv is the file containing the continuous blood glucose level measurements of the patients. The file is composed of more than 22.6 million records that constitute the time series of continuous blood glucose level measurements. It includes the following variables:

Patient_ID – Unique identifier of the patient. Format: LIB19XXXX.

Measurement_date – Date of the blood glucose level measurement. Format: YYYY-MM-DD.

Measurement_time – Time of the blood glucose level measurement. Format: HH:MM:SS.

Measurement – Value of the blood glucose level measurement in mg/dL. Values: ranging from 40 to 500.

Biochemical_parameters.csv

Biochemical_parameters.csv is the file containing data of the biochemical tests performed on patients to measure their biochemical parameters. This file is composed of 87482 records and includes the following variables:

Patient_ID – Unique identifier of the patient. Format: LIB19XXXX.

Reception_date – Date of receipt in the laboratory of the sample to measure the biochemical parameter. Format: YYYY-MM-DD.

Name – Name of the measured biochemical parameter. Values: 'Potassium', 'HDL cholesterol', 'Gammaglutamyl Transferase (GGT)', 'Creatinine', 'Glucose', 'Uric acid', 'Triglycerides', 'Alanine transaminase (GPT)', 'Chlorine', 'Thyrotropin (TSH)', 'Sodium', 'Glycated hemoglobin (Ac)', 'Total cholesterol', 'Albumin (urine)', 'Creatinine (urine)', 'Insulin', 'IA ANTIBODIES'.

Value – Value of the biochemical parameter. Values: ranging from -4.0 to 6446.74.

Diagnostics.csv

Diagnostics.csv is the file containing diagnoses of diabetes mellitus complications or other diseases that patients have in addition to type 1 diabetes mellitus. This file is composed of 1757 records and includes the following variables:

Patient_ID – Unique identifier of the patient. Format: LIB19XXXX.

Code – ICD-9-CM diagnosis code. Values: subset of 594 of the ICD-9-CM codes (https://www.cms.gov/Medicare/Coding/ICD9ProviderDiagnosticCodes/codes).

Description – ICD-9-CM long description. Values: subset of 594 of the ICD-9-CM long description (https://www.cms.gov/Medicare/Coding/ICD9ProviderDiagnosticCodes/codes).

Technical Validation

Blood glucose level measurements are collected using FreeStyle Libre devices, which are widely used for healthcare in patients with T1D. Abbott Diabetes Care, Inc., Alameda, CA, USA, the manufacturer company, has conducted validation studies of these devices concluding that the measurements made by their sensors compare to YSI analyzer devices (Xylem Inc.), the gold standard, yielding results of 99.9% of the time within zones A and B of the consensus error grid. In addition, other studies external to the company concluded that the accuracy of the measurements is adequate.

Moreover, it was also checked in most cases the blood glucose level measurements per patient were continuous (i.e. a sample at least every 15 minutes) in the Glucose_measurements.csv file as they should be.

Usage Notes

For data downloading, it is necessary to be authenticated on the Zenodo platform, accept the Data Usage Agreement and send a request specifying full name, email, and the justification of the data use. This request will be processed by the Secretary of the Department of Computer Engineering, Automatics, and Robotics of the University of Granada and access to the dataset will be granted.

The files that compose the dataset are CSV type files delimited by commas and are available in T1DiabetesGranada.zip. A Jupyter Notebook (Python v. 3.8) with code that may help to a better understanding of the dataset, with graphics and statistics, is available in UsageNotes.zip.

Graphs_and_stats.ipynb

The Jupyter Notebook generates tables, graphs and statistics for a better understanding of the dataset. It has four main sections, one dedicated to each file in the dataset. In addition, it has useful functions such as calculating the patient age, deleting a patient list from a dataset file and leaving only a patient list in a dataset file.

Code Availability

The dataset was generated using some custom code located in CodeAvailability.zip. The code is provided as Jupyter Notebooks created with Python v. 3.8. The code was used to conduct tasks such as data curation and transformation, and variables extraction.

Original_patient_info_curation.ipynb

In the Jupyter Notebook is preprocessed the original file with patient data. Mainly irrelevant rows and columns are removed, and the sex variable is recoded.

Glucose_measurements_curation.ipynb

In the Jupyter Notebook is preprocessed the original file with the continuous glucose level measurements of the patients. Principally rows without information or duplicated rows are removed and the variable with the timestamp is transformed into two new variables, measurement date and measurement time.

Biochemical_parameters_curation.ipynb

In the Jupyter Notebook is preprocessed the original file with patient data of the biochemical tests performed on patients to measure their biochemical parameters. Mainly irrelevant rows and columns are removed and the variable with the name of the measured biochemical parameter is translated.

Diagnostic_curation.ipynb

In the Jupyter Notebook is preprocessed the original file with patient data of the diagnoses of diabetes mellitus complications or other diseases that patients have in addition to T1D.

Get_patient_info_variables.ipynb

In the Jupyter Notebook it is coded the feature extraction process from the files Glucose_measurements.csv, Biochemical_parameters.csv and Diagnostics.csv to complete the file Patient_info.csv. It is divided into six sections, the first three to extract the features from each of the mentioned files and the next three to add the extracted features to the resulting new file.

Data Usage Agreement

The conditions for use are as follows:

You confirm that you will not attempt to re-identify research participants for any reason, including for re-identification theory research.

You commit to keeping the T1DiabetesGranada dataset confidential and secure and will not redistribute data or Zenodo account credentials.

You will require
BigQuery Sample Tables
kaggle.com
zip
Updated Sep 4, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Google BigQuery (2018). BigQuery Sample Tables [Dataset]. https://www.kaggle.com/datasets/bigquery/samples
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Sep 4, 2018
Dataset provided by
BigQueryhttps://cloud.google.com/bigquery
Googlehttp://google.com/
Authors
Google BigQuery
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

BigQuery provides a limited number of sample tables that you can run queries against. These tables are suited for testing queries and learning BigQuery.

Content

gsod: Contains weather information collected by NOAA, such as precipitation amounts and wind speeds from late 1929 to early 2010.

github_nested: Contains a timeline of actions such as pull requests and comments on GitHub repositories with a nested schema. Created in September 2012.

github_timeline: Contains a timeline of actions such as pull requests and comments on GitHub repositories with a flat schema. Created in May 2012.

natality: Describes all United States births registered in the 50 States, the District of Columbia, and New York City from 1969 to 2008.

shakespeare: Contains a word index of the works of Shakespeare, giving the number of times each word appears in each corpus.

trigrams: Contains English language trigrams from a sample of works published between 1520 and 2008.

wikipedia: Contains the complete revision history for all Wikipedia articles up to April 2010.

Fork this kernel to get started.

Acknowledgements

Data Source: https://cloud.google.com/bigquery/sample-tables

Banner Photo by Mervyn Chan from Unplash.

Inspiration

How many babies were born in New York City on Christmas Day?

How many words are in the play Hamlet?
Z
Time to Update the Split-Sample Approach in Hydrological Model Calibration...
data.niaid.nih.gov
zenodo.org
Updated May 31, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hongren Shen (2022). Time to Update the Split-Sample Approach in Hydrological Model Calibration v1.1 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5915373
Explore at:
Dataset updated
May 31, 2022
Dataset provided by
Juliane Mai
Hongren Shen
Bryan A. Tolson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Time to Update the Split-Sample Approach in Hydrological Model Calibration

Hongren Shen1, Bryan A. Tolson1, Juliane Mai1

1Department of Civil and Environmental Engineering, University of Waterloo, Waterloo, Ontario, Canada

Corresponding author: Hongren Shen (hongren.shen@uwaterloo.ca)

Abstract

Model calibration and validation are critical in hydrological model robustness assessment. Unfortunately, the commonly-used split-sample test (SST) framework for data splitting requires modelers to make subjective decisions without clear guidelines. This large-sample SST assessment study empirically assesses how different data splitting methods influence post-validation model testing period performance, thereby identifying optimal data splitting methods under different conditions. This study investigates the performance of two lumped conceptual hydrological models calibrated and tested in 463 catchments across the United States using 50 different data splitting schemes. These schemes are established regarding the data availability, length and data recentness of the continuous calibration sub-periods (CSPs). A full-period CSP is also included in the experiment, which skips model validation. The assessment approach is novel in multiple ways including how model building decisions are framed as a decision tree problem and viewing the model building process as a formal testing period classification problem, aiming to accurately predict model success/failure in the testing period. Results span different climate and catchment conditions across a 35-year period with available data, making conclusions quite generalizable. Calibrating to older data and then validating models on newer data produces inferior model testing period performance in every single analysis conducted and should be avoided. Calibrating to the full available data and skipping model validation entirely is the most robust split-sample decision. Experimental findings remain consistent no matter how model building factors (i.e., catchments, model types, data availability, and testing periods) are varied. Results strongly support revising the traditional split-sample approach in hydrological modeling.

Version updates

v1.1 Updated on May 19, 2022. We added hydrographs for each catchment.

There are 8 parts of the zipped file attached in v1.1. You should download all of them and unzip all those eight parts together.

In this update, we added two zipped files in each gauge subfolder:

(1) GR4J_Hydrographs.zip and (2) HMETS_Hydrographs.zip

Each of the zip files contains 50 CSV files. These CSV files are named with keywords of model name, gauge ID, and the calibration sub-period (CSP) identifier.

Each hydrograph CSV file contains four key columns:

(1) Date time (note that the hour column is less significant since this is daily data); (2) Precipitation in mm that is the aggregated basin mean precipitation; (3) Simulated streamflow in m3/s and the column is named as "subXXX", where XXX is the ID of the catchment, specified in the CAMELS_463_gauge_info.txt file; and (4) Observed streamflow in m3/s and the column is named as "subXXX(observed)".

Note that these hydrograph CSV files reported period-ending time-averaged flows. They were directly produced by the Raven hydrological modeling framework. More information about the format of the hydrograph CSV files can be redirected to the Raven webpage.

v1.0 First version published on Jan 29, 2022.

Data description

This data was used in the paper entitled "Time to Update the Split-Sample Approach in Hydrological Model Calibration" by Shen et al. (2022).

Catchment, meteorological forcing and streamflow data are provided for hydrological modeling use. Specifically, the forcing and streamflow data are archived in the Raven hydrological modeling required format. The GR4J and HMETS model building results in the paper, i.e., reference KGE and KGE metrics in calibration, validation and testing periods, are provided for replication of the split-sample assessment performed in the paper.

Data content

The data folder contains a gauge info file (CAMELS_463_gauge_info.txt), which reports basic information of each catchment, and 463 subfolders, each having four files for a catchment, including:

(1) Raven_Daymet_forcing.rvt, which contains Daymet meteorological forcing (i.e., daily precipitation in mm/d, minimum and maximum air temperature in deg_C, shortwave in MJ/m2/day, and day length in day) from Jan 1st 1980 to Dec 31 2014 in a Raven hydrological modeling required format. (2) Raven_USGS_streamflow.rvt, which contains daily discharge data (in m3/s) from Jan 1st 1980 to Dec 31 2014 in a Raven hydrological modeling required format. (3) GR4J_metrics.txt, which contains reference KGE and GR4J-based KGE metrics in calibration, validation and testing periods. (4) HMETS_metrics.txt, which contains reference KGE and HMETS-based KGE metrics in calibration, validation and testing periods.

Data collection and processing methods

Data source

Catchment information and the Daymet meteorological forcing are retrieved from the CAMELS data set, which can be found here.

The USGS streamflow data are collected from the U.S. Geological Survey's (USGS) National Water Information System (NWIS), which can be found here.

The GR4J and HMETS performance metrics (i.e., reference KGE and KGE) are produced in the study by Shen et al. (2022).

Forcing data processing

A quality assessment procedure was performed. For example, daily maximum air temperature should be larger than the daily minimum air temperature; otherwise, these two values will be swapped.

Units are converted to Raven-required ones. Precipitation: mm/day, unchanged; daily minimum/maximum air temperature: deg_C, unchanged; shortwave: W/m2 to MJ/m2/day; day length: seconds to days.

Data for a catchment is archived in a RVT (ASCII-based) file, in which the second line specifies the start time of the forcing series, the time step (= 1 day), and the total time steps in the series (= 12784), respectively; the third and the fourth lines specify the forcing variables and their corresponding units, respectively.

More details of Raven formatted forcing files can be found in the Raven manual (here).

Streamflow data processing

Units are converted to Raven-required ones. Daily discharge originally in cfs is converted to m3/s.

Missing data are replaced with -1.2345 as Raven requires. Those missing time steps will not be counted in performance metrics calculation.

Streamflow series is archived in a RVT (ASCII-based) file, which is open with eight commented lines specifying relevant gauge and streamflow data information, such as gauge name, gauge ID, USGS reported catchment area, calculated catchment area (based on the catchment shapefiles in CAMELS dataset), streamflow data range, data time step, and missing data periods. The first line after the commented lines in the streamflow RVT files specifies data type (default is HYDROGRAPH), subbasin ID (i.e., SubID), and discharge unit (m3/s), respectively. And the next line specifies the start of the streamflow data, time step (=1 day), and the total time steps in the series(= 12784), respectively.

GR4J and HMETS metrics

The GR4J and HMETS metrics files consists of reference KGE and KGE in model calibration, validation, and testing periods, which are derived in the massive split-sample test experiment performed in the paper.

Columns in these metrics files are gauge ID, calibration sub-period (CSP) identifier, KGE in calibration, validation, testing1, testing2, and testing3, respectively.

We proposed 50 different CSPs in the experiment. "CSP_identifier" is a unique name of each CSP. e.g., CSP identifier "CSP-3A_1990" stands for the model is built in Jan 1st 1990, calibrated in the first 3-year sample (1981-1983), calibrated in the rest years during the period of 1980 to 1989. Note that 1980 is always used for spin-up.

We defined three testing periods (independent to calibration and validation periods) for each CSP, which are the first 3 years from model build year inclusive, the first 5 years from model build year inclusive, and the full years from model build year inclusive. e.g., "testing1", "testing2", and "testing3" for CSP-3A_1990 are 1990-1992, 1990-1994, and 1990-2014, respectively.

Reference flow is the interannual mean daily flow based on a specific period, which is derived for a one-year period and then repeated in each year in the calculation period.

For calibration, its reference flow is based on spin-up + calibration periods.

For validation, its reference flow is based on spin-up + calibration periods.

For testing, its reference flow is based on spin-up +calibration + validation periods.

Reference KGE is calculated based on the reference flow and observed streamflow in a specific calculation period (e.g., calibration). Reference KGE is computed using the KGE equation with substituting the "simulated" flow for "reference" flow in the period for calculation. Note that the reference KGEs for the three different testing periods corresponds to the same historical period, but are different, because each testing period spans in a different time period and covers different series of observed flow.

More details of the split-sample test experiment and modeling results analysis can be referred to the paper by Shen et al. (2022).

Citation

Journal Publication

This study:

Shen, H., Tolson, B. A., & Mai, J.(2022). Time to update the split-sample approach in hydrological model calibration. Water Resources Research, 58, e2021WR031523. https://doi.org/10.1029/2021WR031523

Original CAMELS dataset:

A. J. Newman, M. P. Clark, K. Sampson, A. Wood, L. E. Hay, A. Bock, R. J. Viger, D. Blodgett, L. Brekke, J. R. Arnold, T. Hopson, and Q. Duan (2015). Development of a large-sample
u
Data from: Pesticide Data Program (PDP)
agdatacommons.nal.usda.gov
txt
Updated Nov 30, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Department of Agriculture (USDA), Agricultural Marketing Service (AMS) (2023). Pesticide Data Program (PDP) [Dataset]. http://doi.org/10.15482/USDA.ADC/1520764
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.15482/USDA.ADC/1520764
Dataset updated
Nov 30, 2023
Dataset provided by
Ag Data Commons
Authors
U.S. Department of Agriculture (USDA), Agricultural Marketing Service (AMS)
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Description
The Pesticide Data Program (PDP) is a national pesticide residue database program. Through cooperation with State agriculture departments and other Federal agencies, PDP manages the collection, analysis, data entry, and reporting of pesticide residues on agricultural commodities in the U.S. food supply, with an emphasis on those commodities highly consumed by infants and children. This dataset provides information on where each tested sample was collected, where the product originated from, what type of product it was, and what residues were found on the product, for calendar years 1992 through 2020. The data can measure residues of individual compounds and classes of compounds, as well as provide information about the geographic distribution of the origin of samples, from growers, packers and distributors. The dataset also includes information on where the samples were taken, what laboratory was used to test them, and all testing procedures (by sample, so can be linked to the compound that is identified). The dataset also contains a reference variable for each compound that denotes the limit of detection for a pesticide/commodity pair (LOD variable). The metadata also includes EPA tolerance levels or action levels for each pesticide/commodity pair. The dataset will be updated on a continual basis, with a new resource data file added annually after the PDP calendar-year survey data is released. Resources in this dataset:Resource Title: CSV Data Dictionary for PDP. File Name: PDP_DataDictionary.csvResource Description: Machine-readable Comma Separated Values (CSV) format data dictionary for PDP Database Zip files. Defines variables for the sample identity and analytical results data tables/files. The ## characters in the Table and Text Data File name refer to the 2-digit year for the PDP survey, like 97 for 1997 or 01 for 2001. For details on table linking, see PDF. Resource Software Recommended: Microsoft Excel,url: https://www.microsoft.com/en-us/microsoft-365/excel Resource Title: Data dictionary for Pesticide Data Program. File Name: PDP DataDictionary.pdfResource Description: Data dictionary for PDP Database Zip files.Resource Software Recommended: Adobe Acrobat,url: https://www.adobe.com Resource Title: 2019 PDP Database Zip File. File Name: 2019PDPDatabase.zipResource Title: 2018 PDP Database Zip File. File Name: 2018PDPDatabase.zipResource Title: 2017 PDP Database Zip File. File Name: 2017PDPDatabase.zipResource Title: 2016 PDP Database Zip File. File Name: 2016PDPDatabase.zipResource Title: 2015 PDP Database Zip File. File Name: 2015PDPDatabase.zipResource Title: 2014 PDP Database Zip File. File Name: 2014PDPDatabase.zipResource Title: 2013 PDP Database Zip File. File Name: 2013PDPDatabase.zipResource Title: 2012 PDP Database Zip File. File Name: 2012PDPDatabase.zipResource Title: 2011 PDP Database Zip File. File Name: 2011PDPDatabase.zipResource Title: 2010 PDP Database Zip File. File Name: 2010PDPDatabase.zipResource Title: 2009 PDP Database Zip File. File Name: 2009PDPDatabase.zipResource Title: 2008 PDP Database Zip File. File Name: 2008PDPDatabase.zipResource Title: 2007 PDP Database Zip File. File Name: 2007PDPDatabase.zipResource Title: 2005 PDP Database Zip File. File Name: 2005PDPDatabase.zipResource Title: 2004 PDP Database Zip File. File Name: 2004PDPDatabase.zipResource Title: 2003 PDP Database Zip File. File Name: 2003PDPDatabase.zipResource Title: 2002 PDP Database Zip File. File Name: 2002PDPDatabase.zipResource Title: 2001 PDP Database Zip File. File Name: 2001PDPDatabase.zipResource Title: 2000 PDP Database Zip File. File Name: 2000PDPDatabase.zipResource Title: 1999 PDP Database Zip File. File Name: 1999PDPDatabase.zipResource Title: 1998 PDP Database Zip File. File Name: 1998PDPDatabase.zipResource Title: 1997 PDP Database Zip File. File Name: 1997PDPDatabase.zipResource Title: 1996 PDP Database Zip File. File Name: 1996PDPDatabase.zipResource Title: 1995 PDP Database Zip File. File Name: 1995PDPDatabase.zipResource Title: 1994 PDP Database Zip File. File Name: 1994PDPDatabase.zipResource Title: 1993 PDP Database Zip File. File Name: 1993PDPDatabase.zipResource Title: 1992 PDP Database Zip File. File Name: 1992PDPDatabase.zipResource Title: 2006 PDP Database Zip File. File Name: 2006PDPDatabase.zipResource Title: 2020 PDP Database Zip File. File Name: 2020PDPDatabase.zipResource Description: Data and supporting files for PDP 2020 surveyResource Software Recommended: Microsoft Access,url: https://products.office.com/en-us/access
Test data and model for the FlowCam data processing pipeline
zenodo.org
explore.openaire.eu
csv, zip
Updated Jan 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Katerina Symiakaki; Katerina Symiakaki; Tim Walles; Tim Walles; Cassidy J. Park; Jens Nejstgaard; Jens Nejstgaard; Stella A. Berger; Stella A. Berger; Cassidy J. Park (2025). Test data and model for the FlowCam data processing pipeline [Dataset]. http://doi.org/10.5281/zenodo.14732560
Explore at:
zip, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14732560
Dataset updated
Jan 27, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Katerina Symiakaki; Katerina Symiakaki; Tim Walles; Tim Walles; Cassidy J. Park; Jens Nejstgaard; Jens Nejstgaard; Stella A. Berger; Stella A. Berger; Cassidy J. Park
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Testing data for the processing pipeline for FlowCam data

The data are fully processed but can be used to test each pipeline component. You can download the scripts at

Pipeline scripts

To use the model, unzip the freshwater_phytoplankton_model.zip and place the folder in the respective model folder in the services.

|--services |-- ProcessData.py |-- config.py |-- classification
|-- ObjectClassification
|-- models
|--

Once you unzip the data.zip file, each folder corresponds to the data export of a FlowCam run. You have the TIF collage files, a CSV file with the sample name containing all the parameters measured by the FlowCam, and a LabelChecker_

You can run the preprocessing.py script directly on the files by including the -R (reprocess) argument. Otherwise you can do it by removing the LabelChecker CSV from the folders. The PreprocessingTrue column will remain the same.

When running the classification.py script you can get new predictions on the data. In this case, only the LabelPredicted column will be updated and the validated labels (LabelTrue column) will not be lost.

You could also use these files to try out the train_model.ipynb, although the resulting model will not be very good with so little data. We recommend trying it with your own data.

LabelChecker

These files can be used to test LabelChecker. You can open them one by one or all together and try all functionalities. We provide a label_file.csv but you can also make your own.
Gender Detection & Classification - Face Dataset
kaggle.com
Updated Oct 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Training Data (2023). Gender Detection & Classification - Face Dataset [Dataset]. https://www.kaggle.com/datasets/trainingdatapro/gender-detection-and-classification-image-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 31, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Training Data
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
Gender Detection & Classification - face recognition dataset

The dataset is created on the basis of Face Mask Detection dataset

Dataset Description:

The dataset comprises a collection of photos of people, organized into folders labeled "women" and "men." Each folder contains a significant number of images to facilitate training and testing of gender detection algorithms or models.

The dataset contains a variety of images capturing female and male individuals from diverse backgrounds, age groups, and ethnicities.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F1c4708f0b856f7889e3c0eea434fe8e2%2FFrame%2045%20(1).png?generation=1698764294000412&alt=media" alt="">

This labeled dataset can be utilized as training data for machine learning models, computer vision applications, and gender detection algorithms.

💴 For Commercial Usage: Full version of the dataset includes 376 000+ photos of people, leave a request on TrainingData to buy the dataset

Metadata for the full dataset:

assignment_id - unique identifier of the media file

worker_id - unique identifier of the person

age - age of the person

true_gender - gender of the person

country - country of the person

ethnicity - ethnicity of the person

photo_1_extension, photo_2_extension, photo_3_extension, photo_4_extension - photo extensions in the dataset

photo_1_resolution, photo_2_resolution, photo_3_extension, photo_4_resolution - photo resolution in the dataset

OTHER BIOMETRIC DATASETS:

Anti Spoofing Real Dataset

Antispoofing Replay Dataset

Selfies, ID Images dataset (5591 sets of 15 files)

Selfies and video dataset (4 052 sets)

Dataset of bald people, 5000 images

💴 Buy the Dataset: This is just an example of the data. Leave a request on https://trainingdata.pro/datasets to learn about the price and buy the dataset

Content

The dataset is split into train and test folders, each folder includes: - folders women and men - folders with images of people with the corresponding gender, - .csv file - contains information about the images and people in the dataset

File with the extension .csv

file: link to access the file,

gender: gender of a person in the photo (woman/man),

split: classification on train and test

TrainingData provides high-quality data annotation tailored to your needs

keywords: biometric system, biometric system attacks, biometric dataset, face recognition database, face recognition dataset, face detection dataset, facial analysis, gender detection, supervised learning dataset, gender classification dataset, gender recognition dataset
Funding Successful Projects on Kickstarter
kaggle.com
Updated Jun 20, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lathwal (2017). Funding Successful Projects on Kickstarter [Dataset]. https://www.kaggle.com/datasets/codename007/funding-successful-projects/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 20, 2017
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Lathwal
Description
Problem Statement

Kickstarter is a community of more than 10 million people comprising of creative, tech enthusiasts who help in bringing creative project to life. Till now, more than $3 billion dollars have been contributed by the members in fueling creative projects. The projects can be literally anything – a device, a game, an app, a film etc.

Kickstarter works on all or nothing basis i.e if a project doesn’t meet it goal, the project owner gets nothing. For example: if a projects’s goal is $500. Even if it gets funded till $499, the project won’t be a success.

Recently, Kickstarter released its public data repository to allow researchers and enthusiasts like us to help them solve a problem. Will a project get fully funded ?

In this challenge, you have to predict if a project will get successfully funded or not.

Data Description

There are three files given to download: train.csv, test.csv and sample_submission.csv The train data consists of sample projects from the May 2009 to May 2015. The test data consists of projects from June 2015 to March 2017.
Metatasks for Auto-Sklearn 1 - ROC AUC and Balanced Accuracy
figshare.com
bin
Updated Jul 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lennart Purucker (2023). Metatasks for Auto-Sklearn 1 - ROC AUC and Balanced Accuracy [Dataset]. http://doi.org/10.6084/m9.figshare.23613627.v1
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.23613627.v1
Dataset updated
Jul 1, 2023
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Lennart Purucker
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Prediction Data of Base Models from Auto-Sklearn 1 on 71 classification datasets from the AutoML Benchmark for Balanced Accuracy and ROC AUC.

The files of this figshare item include data that was collected for the paper:

Q(D)O-ES: Population-based Quality (Diversity) Optimisation for Post Hoc Ensemble Selection in AutoML, Lennart Purucker, Lennart Schneider, Marie Anastacio, Joeran Beel, Bernd Bischl, Holger Hoos, Second International Conference on Automated Machine Learning, 2023.

The data was stored and used with the assembled framework: https://github.com/ISG-Siegen/assembled.

In detail, the data contains the predictions of base models on validation and test as produced by running Auto-Sklearn 1 for 4 hours. Such prediction data is included for each model produced by Auto-Sklearn 1 on each fold of 10-fold cross-validation on the 71 classification datasets from the AutoML Benchmark. The data exists for two metrics (ROC AUC and Balanced Accuracy). More details can be found in the paper.

The data was collected by code created for the paper and is available in its reproducibility repository: https://doi.org/10.6084/m9.figshare.23613624.

Its usage is intended for but not limited to using assembled to evaluate post hoc ensembling methods for AutoML.

Details The link above points to a hosted server that facilitates the download. We opted for a hosted server, as we found no other suitable solution to share these large files (due to file size or storage limits) for a reasonable price. If you want to obtain the data in another way or know of a more suitable alternative, please contact Lennart Purucker.

The link resolves to a directory containing the following:

example_metatasks: contains an example metatask for test purposes before committing to downloading all files.
metatasks_roc_auc.zip: The Metatasks obtained by running Auto-Sklearn 1 for ROC AUC. metatasks_bacc.zip: The Metatasks obtained by running Auto-Sklearn 1 for Balanced Accuracy.

The size after unzipping the entire file is:

metatasks_roc_auc.zip: ~450GB metatasks_bacc.zip: ~330GB

We suggest extracting only files that are of interest from the .zip archive, as these can be much smaller in size and might suffice for experiments.

The metatask .zip files contain 2 subdirectories for Metatasks produced based on TopN or SiloTopN pruning (see paper for details). In each of these subdirectories, 2 files per metatask exist. One .json file with metadata information and a .hdf or .csv file containing the prediction data. The details on how this should be read and used as a Metatask can be found in the assembled framework and the reproducibility repository. To obtain the data without Metataks, we advise looking at the file content and metadata individually or parsing them by using Metatasks first.
NEET 2024 UG RESULTS DATASET
kaggle.com
zip
Updated Jul 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
aakashvaani84@gmail.com (2024). NEET 2024 UG RESULTS DATASET [Dataset]. https://www.kaggle.com/datasets/aakaash89/neet-2024-ug-results-citycenter-wise
Explore at:
zip(21441774 bytes)Available download formats
Dataset updated
Jul 21, 2024
Authors
aakashvaani84@gmail.com
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
The Supreme Court of India directed National Testing Agency (NTA) to release the results of all the students who appeared in the National Eligibility cum Entrance Test(NEET) Undergraduate by masking their original Roll Number in order to perform data analysis.

NTA in response released the results at https://neet.ntaonline.in/frontend/web/common-scorecard/index. But it released the center wise results in PDF format because of which data analysis cannot be performed directly on the data.

NEET_2024_RESULTS.csv is compilation of all those 4750 center's results. It has a total of 2333120 student records.

Disclaimer: dummy_srlno doesn't correspond to the actual Roll No.
s
BC Environmental Monitoring System Results - Dataset - Skeena Salmon Data...
data.skeenasalmon.info
Updated Mar 31, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2019). BC Environmental Monitoring System Results - Dataset - Skeena Salmon Data Catalogue [Dataset]. https://data.skeenasalmon.info/dataset/bc-environmental-monitoring-system-results
Explore at:
Dataset updated
Mar 31, 2019
Area covered
British Columbia
Description
The Environmental Monitoring System (EMS) test results as .csv files available for download. Results include physical, chemical and biological analyses of samples taken from water, air, solid waste discharges and ambient monitoring sites throughout the province.
Market survey 2019 rawdata
figshare.com
txt
Updated May 17, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Markus Niederer (2019). Market survey 2019 rawdata [Dataset]. http://doi.org/10.6084/m9.figshare.8143031.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.8143031.v1
Dataset updated
May 17, 2019
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Markus Niederer
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Raw data and descriptive statistic data of the market survey performed with the Add-In XLSTAT 2009.1.02 is provided as Excel-file (CSV). The data include file name, sample name, area, calculated N2O amounts, test result and statistical values.
Student Performance Data Set
kaggle.com
Updated Mar 27, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data-Science Sean (2020). Student Performance Data Set [Dataset]. https://www.kaggle.com/datasets/larsen0966/student-performance-data-set
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 27, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Data-Science Sean
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
If this Data Set is useful, and upvote is appreciated. This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd-period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details).
Z
FSDKaggle2019
data.niaid.nih.gov
explore.openaire.eu
+1more
Updated Jan 24, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Frederic Font (2020). FSDKaggle2019 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3612636
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
Daniel P. W. Ellis
Xavier Serra
Frederic Font
Manoj Plakal
Eduardo Fonseca
Description
FSDKaggle2019 is an audio dataset containing 29,266 audio files annotated with 80 labels of the AudioSet Ontology. FSDKaggle2019 has been used for the DCASE Challenge 2019 Task 2, which was run as a Kaggle competition titled Freesound Audio Tagging 2019.

Citation

If you use the FSDKaggle2019 dataset or part of it, please cite our DCASE 2019 paper:

Eduardo Fonseca, Manoj Plakal, Frederic Font, Daniel P. W. Ellis, Xavier Serra. "Audio tagging with noisy labels and minimal supervision". Proceedings of the DCASE 2019 Workshop, NYC, US (2019)

You can also consider citing our ISMIR 2017 paper, which describes how we gathered the manual annotations included in FSDKaggle2019.

Eduardo Fonseca, Jordi Pons, Xavier Favory, Frederic Font, Dmitry Bogdanov, Andres Ferraro, Sergio Oramas, Alastair Porter, and Xavier Serra, "Freesound Datasets: A Platform for the Creation of Open Audio Datasets", In Proceedings of the 18th International Society for Music Information Retrieval Conference, Suzhou, China, 2017

Data curators

Eduardo Fonseca, Manoj Plakal, Xavier Favory, Jordi Pons

Contact

You are welcome to contact Eduardo Fonseca should you have any questions at eduardo.fonseca@upf.edu.

ABOUT FSDKaggle2019

Freesound Dataset Kaggle 2019 (or FSDKaggle2019 for short) is an audio dataset containing 29,266 audio files annotated with 80 labels of the AudioSet Ontology [1]. FSDKaggle2019 has been used for the Task 2 of the Detection and Classification of Acoustic Scenes and Events (DCASE) Challenge 2019. Please visit the DCASE2019 Challenge Task 2 website for more information. This Task was hosted on the Kaggle platform as a competition titled Freesound Audio Tagging 2019. It was organized by researchers from the Music Technology Group (MTG) of Universitat Pompeu Fabra (UPF), and from Sound Understanding team at Google AI Perception. The competition intended to provide insight towards the development of broadly-applicable sound event classifiers able to cope with label noise and minimal supervision conditions.

FSDKaggle2019 employs audio clips from the following sources:

Freesound Dataset (FSD): a dataset being collected at the MTG-UPF based on Freesound content organized with the AudioSet Ontology

The soundtracks of a pool of Flickr videos taken from the Yahoo Flickr Creative Commons 100M dataset (YFCC)

The audio data is labeled using a vocabulary of 80 labels from Google’s AudioSet Ontology [1], covering diverse topics: Guitar and other Musical Instruments, Percussion, Water, Digestive, Respiratory sounds, Human voice, Human locomotion, Hands, Human group actions, Insect, Domestic animals, Glass, Liquid, Motor vehicle (road), Mechanisms, Doors, and a variety of Domestic sounds. The full list of categories can be inspected in vocabulary.csv (see Files & Download below). The goal of the task was to build a multi-label audio tagging system that can predict appropriate label(s) for each audio clip in a test set.

What follows is a summary of some of the most relevant characteristics of FSDKaggle2019. Nevertheless, it is highly recommended to read our DCASE 2019 paper for a more in-depth description of the dataset and how it was built.

Ground Truth Labels

The ground truth labels are provided at the clip-level, and express the presence of a sound category in the audio clip, hence can be considered weak labels or tags. Audio clips have variable lengths (roughly from 0.3 to 30s).

The audio content from FSD has been manually labeled by humans following a data labeling process using the Freesound Annotator platform. Most labels have inter-annotator agreement but not all of them. More details about the data labeling process and the Freesound Annotator can be found in [2].

The YFCC soundtracks were labeled using automated heuristics applied to the audio content and metadata of the original Flickr clips. Hence, a substantial amount of label noise can be expected. The label noise can vary widely in amount and type depending on the category, including in- and out-of-vocabulary noises. More information about some of the types of label noise that can be encountered is available in [3].

Specifically, FSDKaggle2019 features three types of label quality, one for each set in the dataset:

curated train set: correct (but potentially incomplete) labels

noisy train set: noisy labels

test set: correct and complete labels

Further details can be found below in the sections for each set.

Format

All audio clips are provided as uncompressed PCM 16 bit, 44.1 kHz, mono audio files.

DATA SPLIT

FSDKaggle2019 consists of two train sets and one test set. The idea is to limit the supervision provided for training (i.e., the manually-labeled, hence reliable, data), thus promoting approaches to deal with label noise.

Curated train set

The curated train set consists of manually-labeled data from FSD.

Number of clips/class: 75 except in a few cases (where there are less)

Total number of clips: 4970

Avg number of labels/clip: 1.2

Total duration: 10.5 hours

The duration of the audio clips ranges from 0.3 to 30s due to the diversity of the sound categories and the preferences of Freesound users when recording/uploading sounds. Labels are correct but potentially incomplete. It can happen that a few of these audio clips present additional acoustic material beyond the provided ground truth label(s).

Noisy train set

The noisy train set is a larger set of noisy web audio data from Flickr videos taken from the YFCC dataset [5].

Number of clips/class: 300

Total number of clips: 19,815

Avg number of labels/clip: 1.2

Total duration: ~80 hours

The duration of the audio clips ranges from 1s to 15s, with the vast majority lasting 15s. Labels are automatically generated and purposefully noisy. No human validation is involved. The label noise can vary widely in amount and type depending on the category, including in- and out-of-vocabulary noises.

Considering the numbers above, the per-class data distribution available for training is, for most of the classes, 300 clips from the noisy train set and 75 clips from the curated train set. This means 80% noisy / 20% curated at the clip level, while at the duration level the proportion is more extreme considering the variable-length clips.

Test set

The test set is used for system evaluation and consists of manually-labeled data from FSD.

Number of clips/class: between 50 and 150

Total number of clips: 4481

Avg number of labels/clip: 1.4

Total duration: 12.9 hours

The acoustic material present in the test set clips is labeled exhaustively using the aforementioned vocabulary of 80 classes. Most labels have inter-annotator agreement but not all of them. Except human error, the label(s) are correct and complete considering the target vocabulary; nonetheless, a few clips could still present additional (unlabeled) acoustic content out of the vocabulary.

During the DCASE2019 Challenge Task 2, the test set was split into two subsets, for the public and private leaderboards, and only the data corresponding to the public leaderboard was provided. In this current package you will find the full test set with all the test labels. To allow comparison with previous work, the file test_post_competition.csv includes a flag to determine the corresponding leaderboard (public or private) for each test clip (see more info in Files & Download below).

Acoustic mismatch

As mentioned before, FSDKaggle2019 uses audio clips from two sources:

FSD: curated train set and test set, and

YFCC: noisy train set.

While the sources of audio (Freesound and Flickr) are collaboratively contributed and pretty diverse themselves, a certain acoustic mismatch can be expected between FSD and YFCC. We conjecture this mismatch comes from a variety of reasons. For example, through acoustic inspection of a small sample of both data sources, we find a higher percentage of high quality recordings in FSD. In addition, audio clips in Freesound are typically recorded with the purpose of capturing audio, which is not necessarily the case in YFCC.

This mismatch can have an impact in the evaluation, considering that most of the train data come from YFCC, while all test data are drawn from FSD. This constraint (i.e., noisy training data coming from a different web audio source than the test set) is sometimes a real-world condition.

LICENSE

All clips in FSDKaggle2019 are released under Creative Commons (CC) licenses. For attribution purposes and to facilitate attribution of these files to third parties, we include a mapping from the audio clips to their corresponding licenses.

Curated train set and test set. All clips in Freesound are released under different modalities of Creative Commons (CC) licenses, and each audio clip has its own license as defined by the audio clip uploader in Freesound, some of them requiring attribution to their original authors and some forbidding further commercial reuse. The licenses are specified in the files train_curated_post_competition.csv and test_post_competition.csv. These licenses can be CC0, CC-BY, CC-BY-NC and CC Sampling+.

Noisy train set. Similarly, the licenses of the soundtracks from Flickr used in FSDKaggle2019 are specified in the file train_noisy_post_competition.csv. These licenses can be CC-BY and CC BY-SA.

In addition, FSDKaggle2019 as a whole is the result of a curation process and it has an additional license. FSDKaggle2019 is released under CC-BY. This license is specified in the LICENSE-DATASET file downloaded with the FSDKaggle2019.doc zip file.

FILES & DOWNLOAD

FSDKaggle2019 can be downloaded as a series of zip files with the following directory structure:

root │
└───FSDKaggle2019.audio_train_curated/ Audio clips in the curated train set │ └───FSDKaggle2019.audio_train_noisy/ Audio clips in the noisy
C
test
data.cityofchicago.org
application/rdfxml +5
Updated Apr 3, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). test [Dataset]. https://data.cityofchicago.org/Environment-Sustainable-Development/test/ib5h-w8vu
Explore at:
application/rdfxml, application/rssxml, csv, xml, json, tsvAvailable download formats
Dataset updated
Apr 3, 2025
Description
This dataset serves as a lookup table to determine if environmental records exist in a Chicago Department of Public Health (CDPH) environmental dataset for a given address.
Data fields requiring description are detailed below. MAPPED LOCATION: Contains the address, city, state and latitude/longitude coordinates of the facility. In instances where the facility address is a range, the lower number (the value in the “Street Number From” column) is used. For example, for the range address 1000-1005 S Wabash Ave, the Mapped Location would be 1000 S Wabash Ave. The latitude/longitude coordinate is determined through the Chicago Open Data Portal’s geocoding process. Addresses that fail to geocode are assigned the coordinates 41.88415000022252°, -87.63241000012124°.This coordinate is located approximately just south of the intersection of W Randolph and N LaSalle. COMPLAINTS: A ‘Y’ indicates that one or more records exist in the CDPH Environmental Complaints dataset. NESHAPS & DEMOLITON NOTICES: A ‘Y’ indicates that one or more records exist in the CDPH Asbestos and Demolition Notification dataset. ENFORCEMENT: A ‘Y’ indicates that one or more records exist in the CDPH Environmental Enforcement dataset. INSPECTIONS: A ‘Y’ indicates that one or more records exist in the CDPH Environmental Inspections dataset. PERMITS: A ‘Y’ indicates that one or more records exist in the CDPH Environmental Permits dataset. TANKS: A ‘Y’ indicates that one or more records exist in the CDPH Storage Tanks dataset. Each 'Y' is a clickable link that will download the corresponding records in CSV format.
LoDoPaB-CT Dataset
zenodo.org
data.niaid.nih.gov
csv, zip
Updated Apr 16, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Johannes Leuschner; Maximilian Schmidt; Daniel Otero Baguer; Johannes Leuschner; Maximilian Schmidt; Daniel Otero Baguer (2021). LoDoPaB-CT Dataset [Dataset]. http://doi.org/10.5281/zenodo.3384092
Explore at:
zip, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3384092
Dataset updated
Apr 16, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Johannes Leuschner; Maximilian Schmidt; Daniel Otero Baguer; Johannes Leuschner; Maximilian Schmidt; Daniel Otero Baguer
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A Benchmark Dataset for Low-Dose CT Reconstruction Methods.

The following Data Descriptor article provides full documentation:

Leuschner, J., Schmidt, M., Baguer, D.O. et al. LoDoPaB-CT, a benchmark dataset for low-dose computed tomography reconstruction. Sci Data 8, 109 (2021). https://www.nature.com/articles/s41597-021-00893-z

The python library DIVαℓ (github.com/jleuschn/dival) can be used to download and access the dataset.

Reconstructions from the LIDC/IDRI dataset are used as a basis for this dataset.

The ZIP files included in the LoDoPaB dataset contain multiple HDF5 files. Each HDF5 file contains one HDF5 dataset named "data", that provides a number of samples (128 except for the last file in each ZIP file). For example, the n-th training sample pair is stored in the files "observation_train_%03d.hdf5" and "ground_truth_train_%03d.hdf5" where "%03d" is floor(n / 128), at row (n mod 128) of "data".

Note: each last ground truth file (i.e. ground_truth_train_279.hdf5, ground_truth_validation_027.hdf5 and ground_truth_test_027.hdf5) still contains a HDF5 dataset of shape (128, 362, 362), although it contains less than 128 valid samples. Thus, the number of valid samples needs to be determined from the total samples numbers in the part (i.e. "train": 35820, "validation": 3522, "test": 3553), or from the corresponding observation file, for which the first dimension of the HDF5 dataset matches the number of valid samples in the file.

The randomized patient IDs of the samples are provided as CSV files. The patient IDs of the train, validation and test parts are integers in the range of 0–631, 632–691 and 692–751, respectively. The ID of each sample is stored in a single row.

Acknowledgements

Johannes Leuschner, Maximilian Schmidt and Daniel Otero Baguer acknowledge the support by the Deutsche
Forschungsgemeinschaft (DFG) within the framework of GRK 2224/1 “π3: Parameter Identification – Analysis,
Algorithms, Applications”. We thank Simon Arridge, Ozan Öktem, Carola-Bibiane Schönlieb and Christian
Etmann for the fruitful discussion about the procedure, and Felix Lucka and Jonas Adler for their ideas and
helpful feedback on the simulation setup. The authors acknowledge the National Cancer Institute and the Foundation for the National Institutes of Health, and their critical role in the creation of the free publicly available LIDC/IDRI Database used in this study.
u
Detecting Machine-obfuscated Plagiarism
deepblue.lib.umich.edu
Updated Oct 8, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Foltynek, Tomas; Ruas, Terry; Scharpf, Philipp; Meuschke, Norman; Schubotz, Moritz; Grosky, William; Gipp, Bela (2020). Detecting Machine-obfuscated Plagiarism [Dataset]. http://doi.org/10.7302/bewj-qx93
Explore at:
Unique identifier
https://doi.org/10.7302/bewj-qx93
Dataset updated
Oct 8, 2020
Dataset provided by
Deep Blue Data
Authors
Foltynek, Tomas; Ruas, Terry; Scharpf, Philipp; Meuschke, Norman; Schubotz, Moritz; Grosky, William; Gipp, Bela
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This data set is comprised of multiple folders. The corpus folder contains raw text used for training and testing in two splits, "document" and "paragraph". The Spun documents and paragraphs are generated using the SpinBot tool (https://spinbot.com/API). The paragraph split is generated by only selecting paragraphs with 3 or more sentences in the document split. Each folder is divided in mg (i.e., machine generated through SpinBot) and og (i.e., original generated file);The human judgement folder contains the human evaluation between original and spun documents (sample). It also contains the answers (keys) and survey results. ;The models folder contains the machine learning classifier models for each word embedding technique used (only for document split training). The models were exported using pickle (Python 3.6). The grid search for hyperparameter adjustments is described in the paper. ;The vector folders (train and test) contains the average of all word vectors for each document and paragraph. Each line has the number of dimensions of the word embeddings technique used (see paper for more details) followed by its respective class (i.e, label mg or og). Each file belong to one class, either "mg" or "og". The values are comma-separated (.csv). The extension is .arff can be read as a normal .txt file.

Facebook

Twitter

Click to copy link

Link copied

Cite

CSIRO (2014). CSV file used in statistical analyses [Dataset]. http://doi.org/10.4225/08/543B4B4CA92E6

CSV file used in statistical analyses

Explore at:

Unique identifier

https://doi.org/10.4225/08/543B4B4CA92E6

Dataset updated

Oct 13, 2014

Dataset authored and provided by

CSIROhttp://www.csiro.au/

License

https://research.csiro.au/dap/licences/csiro-data-licence/https://research.csiro.au/dap/licences/csiro-data-licence/

Time period covered

Mar 14, 2008 - Jun 9, 2009

Dataset funded by

CSIROhttp://www.csiro.au/

Description

A csv file containing the tidal frequencies used for statistical analyses in the paper "Estimating Freshwater Flows From Tidally-Affected Hydrographic Data" by Dan Pagendam and Don Percival.

Clear search

Close search

Google apps

Main menu

CSV file used in statistical analyses

Sample Dataset for Testing

MOT testing data for Great Britain

About this data set

MOT test results by class

MOT test results by class of vehicle

Initial failures by defect category

MOT class 1 and 2 vehicles: initial failures by defect category

MOT class 3 and 4 vehicles: initial failures by defect category</h3

ML HACK Dataset

Data from: T1DiabetesGranada: a longitudinal multi-modal dataset of type 1...

BigQuery Sample Tables

Context

Content

Acknowledgements

Inspiration

Time to Update the Split-Sample Approach in Hydrological Model Calibration...

Data from: Pesticide Data Program (PDP)

Test data and model for the FlowCam data processing pipeline

Testing data for the processing pipeline for FlowCam data

Pipeline scripts

LabelChecker

Gender Detection & Classification - Face Dataset

Gender Detection & Classification - face recognition dataset

The dataset is created on the basis of Face Mask Detection dataset

💴 For Commercial Usage: Full version of the dataset includes 376 000+ photos of people, leave a request on TrainingData to buy the dataset

Metadata for the full dataset:

OTHER BIOMETRIC DATASETS:

💴 Buy the Dataset: This is just an example of the data. Leave a request on https://trainingdata.pro/datasets to learn about the price and buy the dataset

Content

File with the extension .csv

TrainingData provides high-quality data annotation tailored to your needs

Funding Successful Projects on Kickstarter

Problem Statement

Data Description

Metatasks for Auto-Sklearn 1 - ROC AUC and Balanced Accuracy

NEET 2024 UG RESULTS DATASET

BC Environmental Monitoring System Results - Dataset - Skeena Salmon Data...

Market survey 2019 rawdata

Student Performance Data Set

FSDKaggle2019

test

LoDoPaB-CT Dataset

Detecting Machine-obfuscated Plagiarism

CSV file used in statistical analysesSee More Versions

CSV file used in statistical analyses