25 datasets found

Z
CVEfixes Dataset: Automatically Collected Vulnerabilities and Their Fixes...
data.niaid.nih.gov
explore.openaire.eu
+1more
Updated Jul 28, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vidziunas, Linas (2024). CVEfixes Dataset: Automatically Collected Vulnerabilities and Their Fixes from Open-Source Software [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4476563
Explore at:
Dataset updated
Jul 28, 2024
Dataset provided by
Moonen, Leon
Vidziunas, Linas
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
CVEfixes is a comprehensive vulnerability dataset that is automatically collected and curated from Common Vulnerabilities and Exposures (CVE) records in the public U.S. National Vulnerability Database (NVD). The goal is to support data-driven security research based on source code and source code metrics related to fixes for CVEs in the NVD by providing detailed information at different interlinked levels of abstraction, such as the commit-, file-, and method level, as well as the repository- and CVE level.

This release, v1.0.8, covers all published CVEs up to 23 July 2024. All open-source projects that were reported in CVE records in the NVD in this time frame and had publicly available git repositories were fetched and considered for the construction of this vulnerability dataset. The dataset is organized as a relational database and covers 12107 vulnerability fixing commits in 4249 open source projects for a total of 11873 CVEs in 272 different Common Weakness Enumeration (CWE) types. The dataset includes the source code before and after changing 51342 files and 138974 functions. The collection took 48 hours with 4 workers (AMD EPYC Genoa-X 9684X).

This repository includes the SQL dump of the dataset, as well as the JSON for the CVEs and XML of the CWEs at the time of collection. The complete process has been documented in the paper "CVEfixes: Automated Collection of Vulnerabilities and Their Fixes from Open-Source Software", which is published in the Proceedings of the 17th International Conference on Predictive Models and Data Analytics in Software Engineering (PROMISE '21). You will find a copy of the paper in the Doc folder.

Citation and Zenodo links

Please cite this work by referring to the published paper:

Guru Bhandari, Amara Naseer, and Leon Moonen. 2021. CVEfixes: Automated Collection of Vulnerabilities and Their Fixes from Open-Source Software. In Proceedings of the 17th International Conference on Predictive Models and Data Analytics in Software Engineering (PROMISE '21). ACM, 10 pages. https://doi.org/10.1145/3475960.3475985

@inproceedings{bhandari2021:cvefixes, title = {{CVEfixes: Automated Collection of Vulnerabilities and Their Fixes from Open-Source Software}}, booktitle = {{Proceedings of the 17th International Conference on Predictive Models and Data Analytics in Software Engineering (PROMISE '21)}}, author = {Bhandari, Guru and Naseer, Amara and Moonen, Leon}, year = {2021}, pages = {10}, publisher = {{ACM}}, doi = {10.1145/3475960.3475985}, copyright = {Open Access}, isbn = {978-1-4503-8680-7}, language = {en} }

The dataset has been released on Zenodo with DOI:10.5281/zenodo.4476563. The GitHub repository containing the code to automatically collect the dataset can be found at https://github.com/secureIT-project/CVEfixes, released with DOI:10.5281/zenodo.5111494.
USA Name Data
kaggle.com
zip
Updated Feb 12, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data.gov (2019). USA Name Data [Dataset]. https://www.kaggle.com/datasets/datagov/usa-names
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Feb 12, 2019
Dataset provided by
Data.govhttps://data.gov/
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Area covered
United States
Description
Context

Cultural diversity in the U.S. has led to great variations in names and naming traditions and names have been used to express creativity, personality, cultural identity, and values. Source: https://en.wikipedia.org/wiki/Naming_in_the_United_States

Content

This public dataset was created by the Social Security Administration and contains all names from Social Security card applications for births that occurred in the United States after 1879. Note that many people born before 1937 never applied for a Social Security card, so their names are not included in this data. For others who did apply, records may not show the place of birth, and again their names are not included in the data.

All data are from a 100% sample of records on Social Security card applications as of the end of February 2015. To safeguard privacy, the Social Security Administration restricts names to those with at least 5 occurrences.

Fork this kernel to get started with this dataset.

Acknowledgements

https://bigquery.cloud.google.com/dataset/bigquery-public-data:usa_names

https://cloud.google.com/bigquery/public-data/usa-names

Dataset Source: Data.gov. This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source — http://www.data.gov/privacy-policy#data_policy — and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.

Banner Photo by @dcp from Unplash.

Inspiration

What are the most common names?

What are the most common female names?

Are there more female or male names?

Female names by a wide margin?
o
US Public Schools
public.opendatasoft.com
data.smartidf.services
+1more
csv, excel, geojson +1
Updated Jan 6, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). US Public Schools [Dataset]. https://public.opendatasoft.com/explore/dataset/us-public-schools/
Explore at:
csv, json, excel, geojsonAvailable download formats
Dataset updated
Jan 6, 2023
License
https://en.wikipedia.org/wiki/Public_domainhttps://en.wikipedia.org/wiki/Public_domain
Area covered
United States
Description
This Public Schools feature dataset is composed of all Public elementary and secondary education facilities in the United States as defined by the Common Core of Data (CCD, https://nces.ed.gov/ccd/ ), National Center for Education Statistics (NCES, https://nces.ed.gov ), US Department of Education for the 2017-2018 school year. This includes all Kindergarten through 12th grade schools as tracked by the Common Core of Data. Included in this dataset are military schools in US territories and referenced in the city field with an APO or FPO address. DOD schools represented in the NCES data that are outside of the United States or US territories have been omitted. This feature class contains all MEDS/MEDS+ as approved by NGA. Complete field and attribute information is available in the ”Entities and Attributes” metadata section. Geographical coverage is depicted in the thumbnail above and detailed in the Place Keyword section of the metadata. This release includes the addition of 3065 new records, modifications to the spatial location and/or attribution of 99,287 records, and removal of 2996 records not present in the NCES CCD data.
Online Sales Dataset - Popular Marketplace Data
kaggle.com
Updated May 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ShreyanshVerma27 (2024). Online Sales Dataset - Popular Marketplace Data [Dataset]. https://www.kaggle.com/datasets/shreyanshverma27/online-sales-dataset-popular-marketplace-data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 25, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
ShreyanshVerma27
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset provides a comprehensive overview of online sales transactions across different product categories. Each row represents a single transaction with detailed information such as the order ID, date, category, product name, quantity sold, unit price, total price, region, and payment method.

Columns:

Order ID: Unique identifier for each sales order.

Date:Date of the sales transaction.

Category:Broad category of the product sold (e.g., Electronics, Home Appliances, Clothing, Books, Beauty Products, Sports).

Product Name:Specific name or model of the product sold.

Quantity:Number of units of the product sold in the transaction.

Unit Price:Price of one unit of the product.

Total Price: Total revenue generated from the sales transaction (Quantity * Unit Price).

Region:Geographic region where the transaction occurred (e.g., North America, Europe, Asia).

Payment Method: Method used for payment (e.g., Credit Card, PayPal, Debit Card).

Insights:

1. Analyze sales trends over time to identify seasonal patterns or growth opportunities.

2. Explore the popularity of different product categories across regions.

3. Investigate the impact of payment methods on sales volume or revenue.

4. Identify top-selling products within each category to optimize inventory and marketing strategies.

5. Evaluate the performance of specific products or categories in different regions to tailor marketing campaigns accordingly.
GRACEnet Soil Biology Network
catalog.data.gov
datasetcatalog.nlm.nih.gov
+1more
Updated Apr 21, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Agricultural Research Service (2025). GRACEnet Soil Biology Network [Dataset]. https://catalog.data.gov/dataset/gracenet-soil-biology-network-a44c4
Explore at:
Dataset updated
Apr 21, 2025
Dataset provided by
Agricultural Research Servicehttps://www.ars.usda.gov/
Description
To help enhance USA soil health, and ensure a robust living soil component that sustains essential functions for healthy plants, animals, and environment, and ultimately provides food for a healthy society, the GRACEnet Soil Biology group are working together with the larger USDA-ARS GRACEnet community to provide soil biology component measurements across regions and to eliminate data gaps for GRACEnet and REAP efforts. The Soil Biology group is focused on efforts that foster method comparison and meta-analyses to allow researchers to better assess soil biology and soil health indicators that are most responsive to agricultural management and that reflect the ecosystems services associated with a healthy, functioning soil. The GRACEnet Soil Biology mission is to produce the soil biology data, including methods of identifying and quantifying specific organisms and processes they govern, that are needed to evaluate impacts on agroecosystems and sustainable agricultural practices. This data collection effort is being accomplished in a highly structured manner to support current and future soil health and antimicrobial resistance research initiatives. The outcomes of the efforts of this team will provide a common biological data platform for several ARS databases, including: GRACEnet/REAP, Nutrient Use and Outcome Network (NUOnet), Long-Term Agroecosystem Research (LTAR) network, soil biology (e.g., MyPhyloDB) databases, and others. Resources in this dataset:Resource Title: Soil Biology Data Search. File Name: Web Page, url: https://agcros-usdaars.opendata.arcgis.com/datasets?group_ids=091b86e9e44a4e948ef2aeae3c916ca5
f
Data from: metLinkR: Facilitating Metaanalysis of Human Metabolomics Data...
figshare.com
zip
Updated Apr 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrew Patt; Iris Pang; Fred Lee; Chiraag Gohel; Eoin Fahy; Vicki Stevens; David Ruggieri; Steven C. Moore; Ewy A. Mathé (2025). metLinkR: Facilitating Metaanalysis of Human Metabolomics Data through Automated Linking of Metabolite Identifiers [Dataset]. http://doi.org/10.1021/acs.jproteome.4c01051.s001
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jproteome.4c01051.s001
Dataset updated
Apr 4, 2025
Dataset provided by
ACS Publications
Authors
Andrew Patt; Iris Pang; Fred Lee; Chiraag Gohel; Eoin Fahy; Vicki Stevens; David Ruggieri; Steven C. Moore; Ewy A. Mathé
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Metabolites are referenced in spectral, structural and pathway databases with a diverse array of schemas, including various internal database identifiers and large tables of common name synonyms. Cross-linking metabolite identifiers is a required step for meta-analysis of metabolomic results across studies but made difficult due to the lack of a consensus identifier system. We have implemented metLinkR, an R package that leverages RefMet and RaMP-DB to automate and simplify cross-linking metabolite identifiers across studies and generating common names. MetLinkR accepts as input metabolite common names and identifiers from five different databases (HMDB, KEGG, ChEBI, LIPIDMAPS and PubChem) to exhaustively search for possible overlap in supplied metabolites from input data sets. In an example of 13 metabolomic data sets totaling 10,400 metabolites, metLinkR identified and provided common names for 1377 metabolites in common between at least 2 data sets in less than 18 min and produced standardized names for 74.4% of the input metabolites. In another example comprising five data sets with 3512 metabolites, metLinkR identified 715 metabolites in common between at least two data sets in under 12 min and produced standardized names for 82.3% of the input metabolites. Outputs of MetLInR include output tables and metrics allowing users to readily double check the mappings and to get an overview of chemical classes represented. Overall, MetLinkR provides a streamlined solution for a common task in metabolomic epidemiology and other fields that meta-analyze metabolomic data. The R package, vignette and source code are freely downloadable at https://github.com/ncats/metLinkR.
HCUP State Inpatient Databases (SID) - Restricted Access File
catalog.data.gov
healthdata.gov
+3more
Updated Jul 29, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Agency for Healthcare Research and Quality, Department of Health & Human Services (2025). HCUP State Inpatient Databases (SID) - Restricted Access File [Dataset]. https://catalog.data.gov/dataset/hcup-state-inpatient-databases-sid-restricted-access-file
Explore at:
Dataset updated
Jul 29, 2025
Dataset provided by
United States Department of Health and Human Serviceshttp://www.hhs.gov/
Agency for Healthcare Research and Qualityhttp://www.ahrq.gov/
Description
The Healthcare Cost and Utilization Project (HCUP) State Inpatient Databases (SID) are a set of hospital databases that contain the universe of hospital inpatient discharge abstracts from data organizations in participating States. The data are translated into a uniform format to facilitate multi-State comparisons and analyses. The SID are based on data from short term, acute care, nonfederal hospitals. Some States include discharges from specialty facilities, such as acute psychiatric hospitals. The SID include all patients, regardless of payer and contain clinical and resource use information included in a typical discharge abstract, with safeguards to protect the privacy of individual patients, physicians, and hospitals (as required by data sources). Developed through a Federal-State-Industry partnership sponsored by the Agency for Healthcare Research and Quality (AHRQ), HCUP data inform decision making at the national, State, and community levels. The SID contain clinical and resource-use information that is included in a typical discharge abstract, with safeguards to protect the privacy of individual patients, physicians, and hospitals (as required by data sources). Data elements include but are not limited to: diagnoses, procedures, admission and discharge status, patient demographics (e.g., sex, age), total charges, length of stay, and expected payment source, including but not limited to Medicare, Medicaid, private insurance, self-pay, or those billed as ‘no charge’. In addition to the core set of uniform data elements common to all SID, some include State-specific data elements. The SID exclude data elements that could directly or indirectly identify individuals. For some States, hospital and county identifiers are included that permit linkage to the American Hospital Association Annual Survey File and county-level data from the Bureau of Health Professions' Area Resource File except in States that do not allow the release of hospital identifiers. Restricted access data files are available with a data use agreement and brief online security training.
d
Common Database on Designated Areas (CDDA)
msdi.data.gov.mt
testsdi.gov.mt
ogc:wfs +1
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Environment and Resources Authority, Common Database on Designated Areas (CDDA) [Dataset]. https://msdi.data.gov.mt/geonetwork/srv/api/records/a81b0ae2-1446-422b-86a5-abc614e24807
Explore at:
ogc:wms-1.1.1-http-get-capabilities, ogc:wfsAvailable download formats
Dataset provided by
Environment and Resources Authority
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
http://inspire.ec.europa.eu/metadata-codelist/LimitationsOnPublicAccess/noLimitationshttp://inspire.ec.europa.eu/metadata-codelist/LimitationsOnPublicAccess/noLimitations
Time period covered
Jul 19, 1933 - Dec 31, 2022
Area covered

Description
The Common Database on Designated Areas (CDDA) is more commonly known as Nationally designated areas, and is one of the agreed Eionet priority data flows maintained by EEA with support from the European Topic Centre on Biological Diversity. It is a result of an annual data flow through Eionet countries. In fact, Malta, being a member of the EEA, submits this report on an annual basis to fulfill this requirement. The EEA publishes the data set and makes it available to the World Database of Protected Areas (WDPA). The CDDA data can also be queried online in the European Nature Information System (EUNIS).
f
R code dataset derivation centralized.
plos.figshare.com
txt
Updated Nov 14, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Romain Jégou; Camille Bachot; Charles Monteil; Eric Boernert; Jacek Chmiel; Mathieu Boucher; David Pau (2024). R code dataset derivation centralized. [Dataset]. http://doi.org/10.1371/journal.pone.0312697.s011
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0312697.s011
Dataset updated
Nov 14, 2024
Dataset provided by
PLOS ONE
Authors
Romain Jégou; Camille Bachot; Charles Monteil; Eric Boernert; Jacek Chmiel; Mathieu Boucher; David Pau
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
MethodsThe objective of this project was to determine the capability of a federated analysis approach using DataSHIELD to maintain the level of results of a classical centralized analysis in a real-world setting. This research was carried out on an anonymous synthetic longitudinal real-world oncology cohort randomly splitted in three local databases, mimicking three healthcare organizations, stored in a federated data platform integrating DataSHIELD. No individual data transfer, statistics were calculated simultaneously but in parallel within each healthcare organization and only summary statistics (aggregates) were provided back to the federated data analyst.Descriptive statistics, survival analysis, regression models and correlation were first performed on the centralized approach and then reproduced on the federated approach. The results were then compared between the two approaches.ResultsThe cohort was splitted in three samples (N1 = 157 patients, N2 = 94 and N3 = 64), 11 derived variables and four types of analyses were generated. All analyses were successfully reproduced using DataSHIELD, except for one descriptive variable due to data disclosure limitation in the federated environment, showing the good capability of DataSHIELD. For descriptive statistics, exactly equivalent results were found for the federated and centralized approaches, except some differences for position measures. Estimates of univariate regression models were similar, with a loss of accuracy observed for multivariate models due to source database variability.ConclusionOur project showed a practical implementation and use case of a real-world federated approach using DataSHIELD. The capability and accuracy of common data manipulation and analysis were satisfying, and the flexibility of the tool enabled the production of a variety of analyses while preserving the privacy of individual data. The DataSHIELD forum was also a practical source of information and support. In order to find the right balance between privacy and accuracy of the analysis, set-up of privacy requirements should be established prior to the start of the analysis, as well as a data quality review of the participating healthcare organization.
n
Animal Imaging Database
neuinfo.org
scicrunch.org
+2more
Updated Sep 7, 2012
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2012). Animal Imaging Database [Dataset]. http://identifiers.org/RRID:SCR_008002
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_008002 https://identifiers.org/RRID:SCR_008002/resolver?q=&i=rrid
Dataset updated
Sep 7, 2012
Description
THIS RESOURCE IS NO LONGER IN SERVICE, documented May 10, 2017. A pilot effort that has developed a centralized, web-based biospecimen locator that presents biospecimens collected and stored at participating Arizona hospitals and biospecimen banks, which are available for acquisition and use by researchers. Researchers may use this site to browse, search and request biospecimens to use in qualified studies. The development of the ABL was guided by the Arizona Biospecimen Consortium (ABC), a consortium of hospitals and medical centers in the Phoenix area, and is now being piloted by this Consortium under the direction of ABRC. You may browse by type (cells, fluid, molecular, tissue) or disease. Common data elements decided by the ABC Standards Committee, based on data elements on the National Cancer Institute''s (NCI''s) Common Biorepository Model (CBM), are displayed. These describe the minimum set of data elements that the NCI determined were most important for a researcher to see about a biospecimen. The ABL currently does not display information on whether or not clinical data is available to accompany the biospecimens. However, a requester has the ability to solicit clinical data in the request. Once a request is approved, the biospecimen provider will contact the requester to discuss the request (and the requester''s questions) before finalizing the invoice and shipment. The ABL is available to the public to browse. In order to request biospecimens from the ABL, the researcher will be required to submit the requested required information. Upon submission of the information, shipment of the requested biospecimen(s) will be dependent on the scientific and institutional review approval. Account required. Registration is open to everyone.. Documented October 4, 2017.A sub-project of the Cell Centered Database (http://ccdb.ucsd.edu) providing a public repository for animal imaging data sets from MRI and related techniques. The public AIDB website provides the ability for browsing, visualizing and downloading the animal subjected MRI data. The AIDB is a pilot project to serve the current need for public imaging repositories for animal imaging data. The Cell Centered Database (CCDB) is a web accessible database for high resolution 2D, 3D and 4D data from light and electron microscopy. The AIDB data model is modified from the basic model of the CCDB where microscopic images are combined to make 2D, 3D and 4D reconstructions. The CCDB has made available over 40 segmented datasets from high resolution magnetic resonance imaging of inbred mouse strains through the prototype AIDB. These data were acquired as part of the Mouse BIRN project by Drs. G. Allan Johnson and Robert Williams. More information about these data can be found in Badea et al. (2009) (Genetic dissection of the mouse CNS using magnetic resonance microscopy - Pubmed: 19542887)
Z
Data from: MedMNIST Classification Decathlon: A Lightweight AutoML Benchmark...
data.niaid.nih.gov
Updated Apr 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jiancheng Yang (2023). MedMNIST Classification Decathlon: A Lightweight AutoML Benchmark for Medical Image Analysis [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4269851
Explore at:
Dataset updated
Apr 19, 2023
Dataset provided by
Rui Shi
Bingbing Ni
Jiancheng Yang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This data repository for MedMNIST v1 is out of date! Please check the latest version of MedMNIST v2.

Abstract

We present MedMNIST, a collection of 10 pre-processed medical open datasets. MedMNIST is standardized to perform classification tasks on lightweight 28x28 images, which requires no background knowledge. Covering the primary data modalities in medical image analysis, it is diverse on data scale (from 100 to 100,000) and tasks (binary/multi-class, ordinal regression and multi-label). MedMNIST could be used for educational purpose, rapid prototyping, multi-modal machine learning or AutoML in medical image analysis. Moreover, MedMNIST Classification Decathlon is designed to benchmark AutoML algorithms on all 10 datasets; We have compared several baseline methods, including open-source or commercial AutoML tools. The datasets, evaluation code and baseline methods for MedMNIST are publicly available at https://medmnist.github.io/.

Please note that this dataset is NOT intended for clinical use.

We recommend our official code to download, parse and use the MedMNIST dataset:

pip install medmnist

Citation and Licenses

If you find this project useful, please cite our ISBI'21 paper as: Jiancheng Yang, Rui Shi, Bingbing Ni. "MedMNIST Classification Decathlon: A Lightweight AutoML Benchmark for Medical Image Analysis," arXiv preprint arXiv:2010.14925, 2020.

or using bibtex: @article{medmnist, title={MedMNIST Classification Decathlon: A Lightweight AutoML Benchmark for Medical Image Analysis}, author={Yang, Jiancheng and Shi, Rui and Ni, Bingbing}, journal={arXiv preprint arXiv:2010.14925}, year={2020} }

Besides, please cite the corresponding paper if you use any subset of MedMNIST. Each subset uses the same license as that of the source dataset.

PathMNIST

Jakob Nikolas Kather, Johannes Krisam, et al., "Predicting survival from colorectal cancer histology slides using deep learning: A retrospective multicenter study," PLOS Medicine, vol. 16, no. 1, pp. 1–22, 01 2019.

License: CC BY 4.0

ChestMNIST

Xiaosong Wang, Yifan Peng, et al., "Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases," in CVPR, 2017, pp. 3462–3471.

License: CC0 1.0

DermaMNIST

Philipp Tschandl, Cliff Rosendahl, and Harald Kittler, "The ham10000 dataset, a large collection of multisource dermatoscopic images of common pigmented skin lesions," Scientific data, vol. 5, pp. 180161, 2018.

Noel Codella, Veronica Rotemberg, Philipp Tschandl, M. Emre Celebi, Stephen Dusza, David Gutman, Brian Helba, Aadi Kalloo, Konstantinos Liopyris, Michael Marchetti, Harald Kittler, and Allan Halpern: “Skin Lesion Analysis Toward Melanoma Detection 2018: A Challenge Hosted by the International Skin Imaging Collaboration (ISIC)”, 2018; arXiv:1902.03368.

License: CC BY-NC 4.0

OCTMNIST/PneumoniaMNIST

Daniel S. Kermany, Michael Goldbaum, et al., "Identifying medical diagnoses and treatable diseases by image-based deep learning," Cell, vol. 172, no. 5, pp. 1122 – 1131.e9, 2018.

License: CC BY 4.0

RetinaMNIST

DeepDR Diabetic Retinopathy Image Dataset (DeepDRiD), "The 2nd diabetic retinopathy – grading and image quality estimation challenge," https://isbi.deepdr.org/data.html, 2020.

License: CC BY 4.0

BreastMNIST

Walid Al-Dhabyani, Mohammed Gomaa, Hussien Khaled, and Aly Fahmy, "Dataset of breast ultrasound images," Data in Brief, vol. 28, pp. 104863, 2020.

License: CC BY 4.0

OrganMNIST_{Axial,Coronal,Sagittal}

Patrick Bilic, Patrick Ferdinand Christ, et al., "The liver tumor segmentation benchmark (lits)," arXiv preprint arXiv:1901.04056, 2019.

Xuanang Xu, Fugen Zhou, et al., "Efficient multiple organ localization in ct image using 3d region proposal network," IEEE Transactions on Medical Imaging, vol. 38, no. 8, pp. 1885–1898, 2019.

License: CC BY 4.0
Iris Species Dataset and Database
kaggle.com
Updated May 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ghanshyam Saini (2025). Iris Species Dataset and Database [Dataset]. https://www.kaggle.com/datasets/ghnshymsaini/iris-species-dataset-and-database
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 15, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ghanshyam Saini
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Iris Flower Dataset

This is a classic and very widely used dataset in machine learning and statistics, often serving as a first dataset for classification problems. Introduced by the British statistician and biologist Ronald Fisher in his 1936 paper "The use of multiple measurements in taxonomic problems," it is a foundational resource for learning classification algorithms.

Overview:

The dataset contains measurements for 150 samples of iris flowers. Each sample belongs to one of three species of iris:

Iris setosa

Iris versicolor

Iris virginica

For each flower, four features were measured:

Sepal length (in cm)

Sepal width (in cm)

Petal length (in cm)

Petal width (in cm)

The goal is typically to build a model that can classify iris flowers into their correct species based on these four features.

File Structure:

The dataset is usually provided as a single CSV (Comma Separated Values) file, often named iris.csv or similar. This file typically contains the following columns:

sepal_length (cm): Numerical. The length of the sepal of the iris flower.

sepal_width (cm): Numerical. The width of the sepal of the iris flower.

petal_length (cm): Numerical. The length of the petal of the iris flower.

petal_width (cm): Numerical. The width of the petal of the iris flower.

species: Categorical. The species of the iris flower (either 'setosa', 'versicolor', or 'virginica'). This is the target variable for classification.

Content of the Data:

The dataset contains an equal number of samples (50) for each of the three iris species. The measurements of the sepal and petal dimensions vary between the species, allowing for their differentiation using machine learning models.

How to Use This Dataset:

Download the iris.csv file.

Load the data using libraries like Pandas in Python.

Explore the data through visualization and statistical analysis to understand the relationships between the features and the different species.

Build classification models (e.g., Logistic Regression, Support Vector Machines, Decision Trees, K-Nearest Neighbors) using the sepal and petal measurements as features and the 'species' column as the target variable.

Evaluate the performance of your model using appropriate metrics (e.g., accuracy, precision, recall, F1-score).

The dataset is small and well-behaved, making it excellent for learning and experimenting with various classification techniques.

Citation:

When using the Iris dataset, it is common to cite Ronald Fisher's original work:

Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7(2), 179-188.

Data Contribution:

Thank you for providing this classic and fundamental dataset to the Kaggle community. The Iris dataset remains an invaluable resource for both beginners learning the basics of classification and experienced practitioners testing new algorithms. Its simplicity and clear class separation make it an ideal starting point for many data science projects.

If you find this dataset description helpful and the dataset itself useful for your learning or projects, please consider giving it an upvote after downloading. Your appreciation is valuable!
d
Data: Self-Sacrifice for the Common Good under Risk and Competition -...
b2find.dkrz.de
b2find.eudat.eu
Updated Aug 10, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Data: Self-Sacrifice for the Common Good under Risk and Competition - Dataset - B2FIND [Dataset]. https://b2find.dkrz.de/dataset/31858c26-bb37-5978-b888-854d7db9ba81
Explore at:
Dataset updated
Aug 10, 2025
Description
This is the database for the article Self-Sacrifice for the Common Good under Risk and Competition. Public service-motivated individuals have a greater concern for the delivery of public services and for the societal consequence of collective inaction, and they see themselves playing a pivotal role in upholding public goods. Such self-efficacy and perceived importance of public service jointly motivate individuals to commit themselves to sacrificing for the common good. Using an incentivized laboratory experiment with a Volunteer’s Dilemma game, which is a well-established stylized variant of a common good setting, we explore the association between self-reported Public Service Motivation (PSM) and voluntary self-sacrifice under different task characteristics and social contexts. We find that risk-taking and intergroup competition negatively moderate the positive effect of PSM on volunteering. The risky situation may reduce an individual’s self-efficacy in making meaningful sacrifice, and intergroup competition may divert attention from the concern for society at large to the outcome of the competition, compromising the positive effect of PSM on the likelihood to self-sacrifice for the common good.
Beat This! Spectrograms for Beat and Downbeat Tracking
zenodo.org
zip
Updated Oct 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jan Schlüter; Jan Schlüter; Francesco Foscarin; Francesco Foscarin (2024). Beat This! Spectrograms for Beat and Downbeat Tracking [Dataset]. http://doi.org/10.5281/zenodo.13922116
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13922116
Dataset updated
Oct 17, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jan Schlüter; Jan Schlüter; Francesco Foscarin; Francesco Foscarin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This collection contains mel spectrograms and annotations of 16 datasets for beat and downbeat tracking. All datasets have been used in "Beat This! Accurate beat tracking without DBN postprocessing" (Foscarin/Schlüter/Widmer, ISMIR 2024) and prior publications by other authors, but for many of these datasets, audio data is not publicly available. By publishing the spectrograms, we invite other researchers to improve the state of the art in beat and downbeat tracking.

Datasets

Spectrograms for the following datasets are included in the collection:

asap: "ASAP: a dataset of aligned scores and performances for piano transcription" (Foscarin et al., ISMIR 2020)

ballroom: "An experimental comparison of audio tempo induction algorithms" (Gouyon et al., TASLP 2006) for the audio and "Rhythmic Pattern Modeling for Beat and Downbeat Tracking in Musical Audio" (Krebs/Böck/Widmer, ISMIR 2013) for the annotations

beatles: "Evaluation methods for musical audio beat tracking algorithms" (Davies/Degara/Plumbley, Tech. Rep., QMU, 2019)

candombe: "Beat and Downbeat Tracking Based on Rhythmic Patterns Applied to the Uruguayan Candombe Drumming" (Nunes et al., ISMIR 2015)

filosax: "Filosax: A dataset of annotated jazz saxophone recordings" (Foster/Dixon, ISMIR 2021)

groove_midi: "Learning to groove with inverse sequence transformations" (Gillick et al., ICML 2019)

gtzan: "Musical genre classification of audio signals" (Tzanetakis/Cook, TSAP 2002) for the audio and "Swing ratio estimation" (Marchand/Peters, DAFx 2015) for the annotations

guitarset: "GuitarSet: A dataset for guitar transcription" (Xi et al., ISMIR 2018)

hainsworth: "Particle filtering applied to musical tempo tracking" (Hainsworth/Macleod, JASP 2004)

harmonix: "The Harmonix set: Beats, downbeats, and functional segment annotations of western popular music" (Nieto et al., ISMIR 2019) for the original and "Modeling Beats and Downbeats with a Time-Frequency Transformer" (Hung et al., ICASSP 2022) for the version included here

hjdb: "One in the jungle: Downbeat detection in hardcore, jungle, and drum and bass" (Hockman/Davies/Fujinaga, ISMIR 2012)

jaah: "Audio-aligned jazz harmony dataset for automatic chord transcription and corpus-based research" (Eremenko et al., ISMIR 2018)

rwc: "RWC music database: Popular, classical and jazz music databases" (Goto et al., ISMIR 2002) for the audio and "AIST annotation for the RWC music
database" (Goto, ISMIR 2006) for the annotations

simac: "A computational approach to rhythm description — Audio features for the computation of rhythm periodicity functions and their use in tempo induction and music content processing" (Gouyon, PhD thesis, UPF, 2005)

smc: "Selective sampling for beat tracking evaluation" (Holzapfel et al., TASLP 2012)

tapcorrect: "Towards Automatically Correcting Tapped Beat Annotations for Music Recordings" (Driedger et al., ISMIR 2019)

If given, links in the above list point to locations for obtaining the original audio.

Annotations

The corresponding annotations are available on https://github.com/CPJKU/beat_this_annotations. A snapshot of v1.0 is included in this collection as beat_this_annotations.zip, but you may want to use a later release.

Spectrograms

Spectrograms are computed from monophonic audio at a sample rate of 22050 Hz with a window size of 1024 and hop size of 441 samples (yielding 50 frames per second), processed with a mel filterbank of 128 bands from 30 Hz to 11 kHz, and magnitudes scaled with ln(1+1000x). They are provided in half-precision floating-point format. Spectrograms can be reproduced with torchaudio 2.3.1 from a 22050 Hz waveform tensor (resampled with soxr.resample(), if needed) via:

melspect = torchaudio.transforms.MelSpectrogram(sample_rate=22050, n_fft=1024, hop_length=441, f_min=30, f_max=11000, n_mels=128, mel_scale='slaney', normalized='frame_length', power=1)(waveform).mul(1000).log1p()

Format

For each dataset, a compressed .zip file is provided, which in turn holds an uncompressed .npz file. The .npz file holds a set of numpy arrays in subdirectories named after the annotations. Each subdirectory contains a spectrogram of the original audio file ("track.npy"), 11 pitch-shifted versions from -5 to +6 semitones ("track_ps-5.npy" to "track_ps6.npy") and 10 time-stretched versions from -20% to +20% ("track_ts-20.npy" to "track_ts20.npy"), except for gtzan.npz, which is designated for testing and only holds the original audio files. The .npz files can be loaded in numpy via np.load(), or unzipped into a set of .npy files that can again be loaded via np.load(). We also provide code to load .npz files as memory maps for more efficiency.
m
A database of eight common tomato pest images
data.mendeley.com
search.datacite.org
Updated May 27, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mei-Ling Huang (2020). A database of eight common tomato pest images [Dataset]. http://doi.org/10.17632/s62zm6djd2.1
Explore at:
Unique identifier
https://doi.org/10.17632/s62zm6djd2.1
Dataset updated
May 27, 2020
Authors
Mei-Ling Huang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The database is based on eight common tomato pests, including (1) Tetranychus urticae, (2) Bemisia argentifolii, (3) Zeugodacus cucurbitae, (4) Thrips palmi, (5) Myzus persicae, (6) Spodoptera litura, (7) Spodoptera exigua, and (8) Helicoverpa armigera. The original images were collected from IPMImages database (https://www.ipmimages.org/index.cfm), National Bureau of Agricultural Insect Resources (NBAIR) (https://www.nbair.res.in /Databases/insectpests/index.php) and Google search. The image database contains 609 original images in 8 categories, and is amplified using image enhancement technology to have a total of 4263 images after enhancement. Image enhancement technologies include 90 degree rotation, 180 degree rotation, 270 degree rotation, horizontal flip, vertical flip and crop. Finally, the image size is unified in 299*299 and the image format is in .JPG file.
p
Data from: MIT-BIH Arrhythmia Database
physionet.org
opendatalab.com
+1more
Updated Feb 24, 2005
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
George Moody; Roger Mark (2005). MIT-BIH Arrhythmia Database [Dataset]. http://doi.org/10.13026/C2F305
Explore at:
Unique identifier
https://doi.org/10.13026/C2F305
Dataset updated
Feb 24, 2005
Authors
George Moody; Roger Mark
License
Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
Description
The MIT-BIH Arrhythmia Database contains 48 half-hour excerpts of two-channel ambulatory ECG recordings, obtained from 47 subjects studied by the BIH Arrhythmia Laboratory between 1975 and 1979. Twenty-three recordings were chosen at random from a set of 4000 24-hour ambulatory ECG recordings collected from a mixed population of inpatients (about 60%) and outpatients (about 40%) at Boston's Beth Israel Hospital; the remaining 25 recordings were selected from the same set to include less common but clinically significant arrhythmias that would not be well-represented in a small random sample.
FishBase Database
gbif.org
portal.obis.org
+5more
Updated Mar 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michael Norén; Michael Norén (2023). FishBase Database [Dataset]. http://doi.org/10.15468/wk3zk7
Explore at:
Unique identifier
https://doi.org/10.15468/wk3zk7
Dataset updated
Mar 23, 2023
Dataset provided by
Global Biodiversity Information Facilityhttps://www.gbif.org/
FishBasehttp://www.fishbase.us/
Authors
Michael Norén; Michael Norén
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Area covered

Description
FishBase is a global biodiversity information system on fin-fishes. From its initial goal to provide key facts on population dynamics for 200 major commercial species, FishBase has now grown to having a wide range of information on all species currently known in the world: taxonomy, biology, trophic ecology, life history, and uses, as well as historical data reaching back to 250 years.

At present, FishBase covers >35,000 fish species compiled from >59,200 references in partnership with 2,480 collaborators, >325,700 common names and >61,700 pictures.

The breadth and depth of information in the database, combined with the analytical and graphical tools available in the web, cater to different needs of diverse groups of stakeholders (scientists, researchers, policy makers, fisheries managers, donors, conservationists, teachers and students). Its various applications aim for sustainable fisheries management, biodiversity conservation and environmental protection.

FishBase is currently hosted by the Quantitative Aquatics, Incorporated (Q-quatics), a non-stock, non-profit, non-governmental organization engaged in the development and management of global databases on aquatic organisms, including their distribution and ecology. It is scientifically guided by a Consortium of 12 international members.
Data from: GSTRIDE: A database of frailty and functional assessments with...
zenodo.org
explore.openaire.eu
zip
Updated Jul 3, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Guillermo García-Villamil Neira; Guillermo García-Villamil Neira; Marta Neira Álvarez; Marta Neira Álvarez; Elisabet Huertas Hoyas; Elisabet Huertas Hoyas; Luisa Ruiz Ruiz; Luisa Ruiz Ruiz; Sara García-de-Villa; Sara García-de-Villa; Antonio J. del-Ama; Antonio J. del-Ama; María Cristina Rodríguez Sánchez; María Cristina Rodríguez Sánchez; Antonio Jiménez Ruiz; Antonio Jiménez Ruiz (2023). GSTRIDE: A database of frailty and functional assessments with inertial gait data from elderly fallers and non-fallers populations [Dataset]. http://doi.org/10.5281/zenodo.6883292
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6883292
Dataset updated
Jul 3, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Guillermo García-Villamil Neira; Guillermo García-Villamil Neira; Marta Neira Álvarez; Marta Neira Álvarez; Elisabet Huertas Hoyas; Elisabet Huertas Hoyas; Luisa Ruiz Ruiz; Luisa Ruiz Ruiz; Sara García-de-Villa; Sara García-de-Villa; Antonio J. del-Ama; Antonio J. del-Ama; María Cristina Rodríguez Sánchez; María Cristina Rodríguez Sánchez; Antonio Jiménez Ruiz; Antonio Jiménez Ruiz
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The GSTRIDE database contains relevant metrics and motion data of elder people for the assessment of their health status. The data correspond to 163 patients, 45 men and 118 women, between 70 and 98 years old with an average Body Mass Index (BMI) of 26.1±5.0 kg/m² and a cognitive deterioration status index between 1 and 7, according to the Global Deterioration Scale (GDS) scale. In this way, we ensure variability among the volunteers in terms of socio-demographic and anatomic parameters and their functional and cognitive capacities. The database files are stored in TXT and CSV format to ease their usability with common data processing software.

We provide socio-demographic data, anatomical, functional and cognitive variables, and the outcome measurements from test commonly performed for the evaluation of elder people. The evaluation tests carried out to obtain these data are the Gait Speed Test (4-metre), the Hand Grip Strength, the Short Physical Performance Battery (SPPB), the Timed up and go (TUG) and the Short Falls Efficacy Scale International (FES-I). We also include the outcomes of the GDS questionnaire, the Frailty assessment and the information about falls during the last year prior to the tests.

These data are complemented with the gait parameters of a walking test recorded by an Inertial Measurement Unit (IMU) placed on the foot. The walking tests have an average duration of 21.4±7.1 minutes, which are analyzed in order to estimate the total walking distance, the number of strides and the gait spatio-temporal parameters. The results of this analysis include the following metrics: stride time duration, stride length, step speed, percentage of the gait phases (toe off, swing, heel strike, foot flat) over the strides, foot angle during the toe off and heel strike phases, cadence, step speed, 3D and 2D paths and clearance. We provide these metrics for the steps detected, as well as their average and variance values in the database record.

The raw and calibrated signals from the IMUs using the calibration parameters (bias vector, misalignment and scaling matrix and the sampling rate correction factor) are included in the database in order to allow the researchers to perform other approaches for the gait analysis. These signals consist in the linear acceleration and the turn rate. The files also contain the calibration parameters and the specifications of the inertial sensors used in this work. Furthermore, these data are accompanied with the gait analysis code, which is used to obtain the metrics given in the database, that provides also visualization tools to study the distribution of these metrics.

GSTRIDE is specially focused on, but not limited to, the study of faller and non-faller elder people. The main aim of this dataset is the availability of study these different populations. By including the results of the health evaluation tests and questionnaires and the inertial and spatio-temporal data, researchers can analyze different techniques for the identification of fallers. Moreover, this database allows the analysis of cognitive deterioration and frailty parameters of patients by the research community.
d
U.S. Geological Survey Oceanographic Time Series Data Collection.
datadiscoverystudio.org
data.usgs.gov
+4more
html
Updated Jun 8, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2018). U.S. Geological Survey Oceanographic Time Series Data Collection. [Dataset]. http://datadiscoverystudio.org/geoportal/rest/metadata/item/b7f5402ffa2a4493a15b3c75115a3733/html
Explore at:
htmlAvailable download formats
Dataset updated
Jun 8, 2018
Description
description: The oceanographic time series data collected by U.S. Geological Survey scientists and collaborators are served in an online database at http://stellwagen.er.usgs.gov/index.html. These data were collected as part of research experiments investigating circulation and sediment transport in the coastal ocean. The experiments (projects, research programs) are typically one month to several years long and have been carried out since 1975. New experiments will be conducted, and the data from them will be added to the collection. As of 2016, all but one of the experiments were conducted in waters abutting the U.S. coast; the exception was conducted in the Adriatic Sea. Measurements acquired vary by site and experiment; they usually include current velocity, wave statistics, water temperature, salinity, pressure, turbidity, and light transmission from one or more depths over a time period. The measurements are concentrated near the sea floor but may also include data from the water column. The user interface provides an interactive map, a tabular summary of the experiments, and a separate page for each experiment. Each experiment page has documentation and maps that provide details of what data were collected at each site. Links to related publications with additional information about the research are also provided. The data are stored in Network Common Data Format (netCDF) files using the Equatorial Pacific Information Collection (EPIC) conventions defined by the National Oceanic and Atmospheric Administration (NOAA) Pacific Marine Environmental Laboratory. NetCDF is a general, self-documenting, machine-independent, open source data format created and supported by the University Corporation for Atmospheric Research (UCAR). EPIC is an early set of standards designed to allow researchers from different organizations to share oceanographic data. The files may be downloaded or accessed online using the Open-source Project for a Network Data Access Protocol (OPeNDAP). The OPeNDAP framework allows users to access data from anywhere on the Internet using a variety of Web services including Thematic Realtime Environmental Distributed Data Services (THREDDS). A subset of the data compliant with the Climate and Forecast convention (CF, currently version 1.6) is also available.; abstract: The oceanographic time series data collected by U.S. Geological Survey scientists and collaborators are served in an online database at http://stellwagen.er.usgs.gov/index.html. These data were collected as part of research experiments investigating circulation and sediment transport in the coastal ocean. The experiments (projects, research programs) are typically one month to several years long and have been carried out since 1975. New experiments will be conducted, and the data from them will be added to the collection. As of 2016, all but one of the experiments were conducted in waters abutting the U.S. coast; the exception was conducted in the Adriatic Sea. Measurements acquired vary by site and experiment; they usually include current velocity, wave statistics, water temperature, salinity, pressure, turbidity, and light transmission from one or more depths over a time period. The measurements are concentrated near the sea floor but may also include data from the water column. The user interface provides an interactive map, a tabular summary of the experiments, and a separate page for each experiment. Each experiment page has documentation and maps that provide details of what data were collected at each site. Links to related publications with additional information about the research are also provided. The data are stored in Network Common Data Format (netCDF) files using the Equatorial Pacific Information Collection (EPIC) conventions defined by the National Oceanic and Atmospheric Administration (NOAA) Pacific Marine Environmental Laboratory. NetCDF is a general, self-documenting, machine-independent, open source data format created and supported by the University Corporation for Atmospheric Research (UCAR). EPIC is an early set of standards designed to allow researchers from different organizations to share oceanographic data. The files may be downloaded or accessed online using the Open-source Project for a Network Data Access Protocol (OPeNDAP). The OPeNDAP framework allows users to access data from anywhere on the Internet using a variety of Web services including Thematic Realtime Environmental Distributed Data Services (THREDDS). A subset of the data compliant with the Climate and Forecast convention (CF, currently version 1.6) is also available.
Z
Euroccupations dataset, codebook, and labelset
data.niaid.nih.gov
zenodo.org
Updated Jul 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Esther de Ruijter (2024). Euroccupations dataset, codebook, and labelset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3985647
Explore at:
Dataset updated
Jul 19, 2024
Dataset provided by
Esther de Ruijter
Kea Tijdens
Judith de Ruijter
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Title: Developing a detailed 8-country occupations database for comparative socio-economic research in the European Union Duration: April 2006 - March 2009 Funded by: European Commission - 6th Framework Programme - FP6-028987

The EurOccupations project aimed to build a publicly available database containing the most common occupations for use in multi-country data-collection, through the Internet or otherwise. It covered eight EU countries, notably Belgium, France, Germany, Italy, Netherlands, Poland, Spain, and United Kingdom. The database includes a source list of 1,594 distinct occupational titles within the ISCO-08 classification, country-specific translations and a search tree to navigate through the database.

An updated occupational database is downloadable here: https://www.surveycodings.org/occupation-measurement

See for information about the project: https://wageindicator.org/Wageindicatorfoundation/projects/euroccp

See publications:

Tijdens K.G., De Ruijter, E., De Ruijter, J. (2014) Comparing work tasks of 160 occupations across eight European countries, Employee Relations, 36 (2), pp.110 – 127 (EN) LINK = http://www.emeraldinsight.com/toc/er/36/2

Tijdens, K.G., De Ruijter, J. and De Ruijter, E. (2012), "Measuring work activities and skill requirements of occupations: Experiences from a European pilot study with a web‐survey", European Journal of Training and Development, Vol. 36 No. 7, pp. 751-763. https://doi.org/10.1108/03090591211255575

Facebook

Twitter

Click to copy link

Link copied

Cite

Vidziunas, Linas (2024). CVEfixes Dataset: Automatically Collected Vulnerabilities and Their Fixes from Open-Source Software [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4476563

CVEfixes Dataset: Automatically Collected Vulnerabilities and Their Fixes from Open-Source Software

Explore at:

Dataset updated

Jul 28, 2024

Dataset provided by

Moonen, Leon
Vidziunas, Linas

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

CVEfixes is a comprehensive vulnerability dataset that is automatically collected and curated from Common Vulnerabilities and Exposures (CVE) records in the public U.S. National Vulnerability Database (NVD). The goal is to support data-driven security research based on source code and source code metrics related to fixes for CVEs in the NVD by providing detailed information at different interlinked levels of abstraction, such as the commit-, file-, and method level, as well as the repository- and CVE level.

This release, v1.0.8, covers all published CVEs up to 23 July 2024. All open-source projects that were reported in CVE records in the NVD in this time frame and had publicly available git repositories were fetched and considered for the construction of this vulnerability dataset. The dataset is organized as a relational database and covers 12107 vulnerability fixing commits in 4249 open source projects for a total of 11873 CVEs in 272 different Common Weakness Enumeration (CWE) types. The dataset includes the source code before and after changing 51342 files and 138974 functions. The collection took 48 hours with 4 workers (AMD EPYC Genoa-X 9684X).

This repository includes the SQL dump of the dataset, as well as the JSON for the CVEs and XML of the CWEs at the time of collection. The complete process has been documented in the paper "CVEfixes: Automated Collection of Vulnerabilities and Their Fixes from Open-Source Software", which is published in the Proceedings of the 17th International Conference on Predictive Models and Data Analytics in Software Engineering (PROMISE '21). You will find a copy of the paper in the Doc folder.

Citation and Zenodo links

Please cite this work by referring to the published paper:

Guru Bhandari, Amara Naseer, and Leon Moonen. 2021. CVEfixes: Automated Collection of Vulnerabilities and Their Fixes from Open-Source Software. In Proceedings of the 17th International Conference on Predictive Models and Data Analytics in Software Engineering (PROMISE '21). ACM, 10 pages. https://doi.org/10.1145/3475960.3475985

@inproceedings{bhandari2021:cvefixes, title = {{CVEfixes: Automated Collection of Vulnerabilities and Their Fixes from Open-Source Software}}, booktitle = {{Proceedings of the 17th International Conference on Predictive Models and Data Analytics in Software Engineering (PROMISE '21)}}, author = {Bhandari, Guru and Naseer, Amara and Moonen, Leon}, year = {2021}, pages = {10}, publisher = {{ACM}}, doi = {10.1145/3475960.3475985}, copyright = {Open Access}, isbn = {978-1-4503-8680-7}, language = {en} }

The dataset has been released on Zenodo with DOI:10.5281/zenodo.4476563. The GitHub repository containing the code to automatically collect the dataset can be found at https://github.com/secureIT-project/CVEfixes, released with DOI:10.5281/zenodo.5111494.

Clear search

Close search

Google apps

Main menu

CVEfixes Dataset: Automatically Collected Vulnerabilities and Their Fixes...

USA Name Data

Context

Content

Acknowledgements

Inspiration

US Public Schools

Online Sales Dataset - Popular Marketplace Data

Columns:

Insights:

GRACEnet Soil Biology Network

Data from: metLinkR: Facilitating Metaanalysis of Human Metabolomics Data...

HCUP State Inpatient Databases (SID) - Restricted Access File

Common Database on Designated Areas (CDDA)

R code dataset derivation centralized.

Animal Imaging Database

Data from: MedMNIST Classification Decathlon: A Lightweight AutoML Benchmark...

Iris Species Dataset and Database

Iris Flower Dataset

Data: Self-Sacrifice for the Common Good under Risk and Competition -...

Beat This! Spectrograms for Beat and Downbeat Tracking

Datasets

Annotations

Spectrograms

Format

A database of eight common tomato pest images

Data from: MIT-BIH Arrhythmia Database

FishBase Database

Data from: GSTRIDE: A database of frailty and functional assessments with...

U.S. Geological Survey Oceanographic Time Series Data Collection.

Euroccupations dataset, codebook, and labelset

CVEfixes Dataset: Automatically Collected Vulnerabilities and Their Fixes from Open-Source SoftwareSee More Versions

CVEfixes Dataset: Automatically Collected Vulnerabilities and Their Fixes from Open-Source Software