100+ datasets found

u
Product Exchange/Bartering Data
cseweb.ucsd.edu
json
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
UCSD CSE Research Project, Product Exchange/Bartering Data [Dataset]. https://cseweb.ucsd.edu/~jmcauley/datasets.html
Explore at:
jsonAvailable download formats
Dataset authored and provided by
UCSD CSE Research Project
Description
These datasets contain peer-to-peer trades from various recommendation platforms.

Metadata includes

peer-to-peer trades

have and want lists

image data (tradesy)
B
Data Cleaning Sample
borealisdata.ca
dataone.org
Updated Jul 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rong Luo (2023). Data Cleaning Sample [Dataset]. http://doi.org/10.5683/SP3/ZCN177
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.5683/SP3/ZCN177
Dataset updated
Jul 13, 2023
Dataset provided by
Borealis
Authors
Rong Luo
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Sample data for exercises in Further Adventures in Data Cleaning.
Sample Purchasing / Supply Chain Data
catalog.data.gov
s.cnmilf.com
+2more
Updated Jul 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Standards and Technology (2022). Sample Purchasing / Supply Chain Data [Dataset]. https://catalog.data.gov/dataset/sample-purchasing-supply-chain-data
Explore at:
Dataset updated
Jul 29, 2022
Dataset provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
Description
Sample purchasing data containing information on suppliers, the products they provide, and the projects those products are used for. Data created or adapted from publicly available sources.
Sample CVs Dataset for Analysis
kaggle.com
Updated Aug 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
lone (2024). Sample CVs Dataset for Analysis [Dataset]. https://www.kaggle.com/datasets/hussnainmushtaq/sample-cvs-dataset-for-analysis
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 19, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
lone
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
This dataset contains a small collection of 6 randomly selected CVs (Curriculum Vitae), representing various professional backgrounds. The dataset is intended to serve as a resource for research in fields such as Human Resources (HR), data analysis, natural language processing (NLP), and machine learning. It can be used for tasks like resume parsing, skill extraction, job matching, and analyzing trends in professional qualifications and experiences. Potential Use Cases: This dataset can be used for various research and development purposes, including but not limited to:

Resume Parsing: Developing algorithms to automatically extract and categorize information from resumes. Skill Extraction: Identifying key skills and competencies from text data within the CVs. Job Matching: Creating models to match candidates to job descriptions based on their qualifications and experience. NLP Research: Analyzing language patterns, sentence structure, and terminology used in professional resumes. HR Analytics: Studying trends in career paths, education, and skill development across different professions. Training Data for Machine Learning Models: Using the dataset as a sample for training and testing machine learning models in HR-related applications. Dataset Format: The dataset is available in a compressed file (ZIP) containing the 6 CVs in both PDF and DOCX formats. This allows for flexibility in how the data is processed and analyzed.

Licensing: This dataset is shared under the CC BY-NC-SA 4.0 license. This means that you are free to:

Share: Copy and redistribute the material in any medium or format. Adapt: Remix, transform, and build upon the material. Under the following terms:

Attribution: You must give appropriate credit, provide a link to the license, and indicate if changes were made. NonCommercial: You may not use the material for commercial purposes. ShareAlike: If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original. Citation: If you use this dataset in your research or projects, please cite it as follows:

"Sample CVs Dataset for Analysis, Mushtaq et al., Kaggle, 2024."

Limitations and Considerations: Sample Size: The dataset contains only 6 CVs, which is a very small sample size. It is intended for educational and prototyping purposes rather than large-scale analysis. Anonymization: Personal details such as names, contact information, and specific locations may be anonymized or altered to protect privacy. Bias: The dataset is not representative of the entire population and may contain biases related to profession, education, and experience. This dataset is a useful starting point for developing models or conducting small-scale experiments in HR-related fields. However, users should be aware of its limitations and consider supplementing it with additional data for more robust analysis.
c
Exhibit of Datasets
datacatalogue.cessda.eu
ssh.datastations.nl
Updated Sep 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
P.K. Doorn; L. Breure (2024). Exhibit of Datasets [Dataset]. http://doi.org/10.17026/SS/TLTMIR
Explore at:
Unique identifier
https://doi.org/10.17026/SS/TLTMIR
Dataset updated
Sep 3, 2024
Dataset provided by
DANS (retired)
Authors
P.K. Doorn; L. Breure
Description
The Exhibit of Datasets was an experimental project with the aim of providing concise introductions to research datasets in the humanities and social sciences deposited in a trusted repository and thus made accessible for the long term. The Exhibit consists of so-called 'showcases', short webpages summarizing and supplementing the corresponding data papers, published in the Research Data Journal for the Humanities and Social Sciences. The showcase is a quick introduction to such a dataset, a bit longer than an abstract, with illustrations, interactive graphs and other multimedia (if available). As a rule it also offers the option to get acquainted with the data itself, through an interactive online spreadsheet, a data sample or link to the online database of a research project. Usually, access to these datasets requires several time consuming actions, such as downloading data, installing the appropriate software and correctly uploading the data into these programs. This makes it difficult for interested parties to quickly assess the possibilities for reuse in other projects.

The Exhibit aimed to help visitors of the website to get the right information at a glance by: - Attracting attention to (recently) acquired deposits: showing why data are interesting. - Providing a concise overview of the dataset's scope and research background; more details are to be found, for example, in the associated data paper in the Research Data Journal (RDJ). - Bringing together references to the location of the dataset and to more detailed information elsewhere, such as the project website of the data producers. - Allowing visitors to explore (a sample of) the data without downloading and installing associated software at first (see below). - Publishing related multimedia content, such as videos, animated maps, slideshows etc., which are currently difficult to include in online journals as RDJ. - Making it easier to review the dataset. The Exhibit would also have been the right place to publish these reviews in the same way as a webshop publishes consumer reviews of a product, but this could not yet be achieved within the limited duration of the project.

Note (1) The text of the showcase is a summary of the corresponding data paper in RDJ, and as such a compilation made by the Exhibit editor. In some cases a section 'Quick start in Reusing Data' is added, whose text is written entirely by the editor. (2) Various hyperlinks such as those to pages within the Exhibit website will no longer work. The interactive Zoho spreadsheets are also no longer available because this facility has been discontinued.
Z
UCI and OpenML Data Sets for Ordinal Quantification
data.niaid.nih.gov
zenodo.org
Updated Jul 25, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bunse, Mirko (2023). UCI and OpenML Data Sets for Ordinal Quantification [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8177301
Explore at:
Dataset updated
Jul 25, 2023
Dataset provided by
Bunse, Mirko
Moreo, Alejandro
Senz, Martin
Sebastiani, Fabrizio
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
These four labeled data sets are targeted at ordinal quantification. The goal of quantification is not to predict the label of each individual instance, but the distribution of labels in unlabeled sets of data.

With the scripts provided, you can extract CSV files from the UCI machine learning repository and from OpenML. The ordinal class labels stem from a binning of a continuous regression label.

We complement this data set with the indices of data items that appear in each sample of our evaluation. Hence, you can precisely replicate our samples by drawing the specified data items. The indices stem from two evaluation protocols that are well suited for ordinal quantification. To this end, each row in the files app_val_indices.csv, app_tst_indices.csv, app-oq_val_indices.csv, and app-oq_tst_indices.csv represents one sample.

Our first protocol is the artificial prevalence protocol (APP), where all possible distributions of labels are drawn with an equal probability. The second protocol, APP-OQ, is a variant thereof, where only the smoothest 20% of all APP samples are considered. This variant is targeted at ordinal quantification tasks, where classes are ordered and a similarity of neighboring classes can be assumed.

Usage

You can extract four CSV files through the provided script extract-oq.jl, which is conveniently wrapped in a Makefile. The Project.toml and Manifest.toml specify the Julia package dependencies, similar to a requirements file in Python.

Preliminaries: You have to have a working Julia installation. We have used Julia v1.6.5 in our experiments.

Data Extraction: In your terminal, you can call either

make

(recommended), or

julia --project="." --eval "using Pkg; Pkg.instantiate()" julia --project="." extract-oq.jl

Outcome: The first row in each CSV file is the header. The first column, named "class_label", is the ordinal class.

Further Reading

Implementation of our experiments: https://github.com/mirkobunse/regularized-oq
m
Impact of limited data availability on the accuracy of project duration...
data.mendeley.com
Updated Nov 22, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Naimeh Sadeghi (2022). Impact of limited data availability on the accuracy of project duration estimation in project networks [Dataset]. http://doi.org/10.17632/bjfdw6xbxw.3
Explore at:
Unique identifier
https://doi.org/10.17632/bjfdw6xbxw.3
Dataset updated
Nov 22, 2022
Authors
Naimeh Sadeghi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This database includes simulated data showing the accuracy of estimated probability distributions of project durations when limited data are available for the project activities. The base project networks are taken from PSPLIB. Then, various stochastic project networks are synthesized by changing the variability and skewness of project activity durations. Number of variables: 20 Number of cases/rows: 114240 Variable List: • Experiment ID: The ID of the experiment • Experiment for network: The ID of the experiment for each of the synthesized networks • Network ID: ID of the synthesized network • #Activities: Number of activities in the network, including start and finish activities • Variability: Variance of the activities in the network (this value can be either high, low, medium or rand, where rand shows a random combination of low, high and medium variance in the network activities.) • Skewness: Skewness of the activities in the network (Skewness can be either right, left, None or rand, where rand shows a random combination of right, left, and none skewed in the network activities)
• Fitted distribution type: Distribution type used to fit on sampled data • Sample size: Number of sampled data used for the experiment resembling limited data condition • Benchmark 10th percentile: 10th percentile of project duration in the benchmark stochastic project network • Benchmark 50th percentile: 50th project duration in the benchmark stochastic project network • Benchmark 90th percentile: 90th project duration in the benchmark stochastic project network • Benchmark mean: Mean project duration in the benchmark stochastic project network • Benchmark variance: Variance project duration in the benchmark stochastic project network • Experiment 10th percentile: 10th percentile of project duration distribution for the experiment • Experiment 50th percentile: 50th percentile of project duration distribution for the experiment • Experiment 90th percentile: 90th percentile of project duration distribution for the experiment • Experiment mean: Mean of project duration distribution for the experiment • Experiment variance: Variance of project duration distribution for the experiment • K-S: Kolmogorov–Smirnov test comparing benchmark distribution and project duration • distribution of the experiment • P_value: the P-value based on the distance calculated in the K-S test
h
autotrain-data-sample
huggingface.co
Updated Aug 4, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maulik Madhavi (2023). autotrain-data-sample [Dataset]. https://huggingface.co/datasets/MaulikMadhavi/autotrain-data-sample
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 4, 2023
Authors
Maulik Madhavi
Description
AutoTrain Dataset for project: sample

Dataset Description

This dataset has been automatically processed by AutoTrain for project sample.

Languages

The BCP-47 code for the dataset's language is unk.

Dataset Structure Data Instances

A sample from this dataset looks as follows: [ { "image": "<500x375 RGB PIL image>", "target": 1 }, { "image": "<378x274 RGB PIL image>", "target": 0 } ]… See the full description on the dataset page: https://huggingface.co/datasets/MaulikMadhavi/autotrain-data-sample.
u
Pinterest Fashion Compatibility
cseweb.ucsd.edu
beta.data.urbandatacentre.ca
json
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
UCSD CSE Research Project, Pinterest Fashion Compatibility [Dataset]. https://cseweb.ucsd.edu/~jmcauley/datasets.html
Explore at:
jsonAvailable download formats
Dataset authored and provided by
UCSD CSE Research Project
Description
This dataset contains images (scenes) containing fashion products, which are labeled with bounding boxes and links to the corresponding products.

Metadata includes

product IDs

bounding boxes

Basic Statistics:

Scenes: 47,739

Products: 38,111

Scene-Product Pairs: 93,274
m
Building Information Modeling (BIM) adoption level in Lima and Callao,...
data.mendeley.com
Updated Jul 12, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gerson Tapia Nieto (2020). Building Information Modeling (BIM) adoption level in Lima and Callao, Dataset [Dataset]. http://doi.org/10.17632/8n2ymkttkp.4
Explore at:
Unique identifier
https://doi.org/10.17632/8n2ymkttkp.4
Dataset updated
Jul 12, 2020
Authors
Gerson Tapia Nieto
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Callao
Description
This research data aims to measure the level of Building Information Modeling (BIM) adoption in urban building projects in Lima city and Callao by the end of 2017. This level helped us to determine in what category of adopter Lima city, Peru most important city in terms of urban buildings, is located according to the Diffusion of Innovations theory (Rogers, 2003). Our hypothesis was that 15% of urban building projects adopted BIM by the end of 2017. The level of adoption is estimated through sampling principles and the population data (N=1218) can be found in the publication “Urban buildings market in Lima city and Callao 2017: edition 22” (CAPECO, 2017). The survey (docx file) is divided in five sections: general data of the interviewee, BIM perception, BIM acceptance, BIM adoption and general data of the project. The final data (xlsx file) provides the results of the survey that was answer by 323 professionals related to the building industry (Civil engineers, architects and others) and each answer corresponded to a unique project. As it was mentioned the population was based on all new urban building projects in the geographical area of study and under construction process during the period of data collection. A project was considered under construction process when it was found at the beginning of earthworks or preliminary works until the delivery of the unoccupied project. In addition, remodeling projects that involved expanding their built-up area were also considered, but this expansion had to be at least 500 sm. On the other hand, all single family houses and multi-family buildings that do not had a public construction license were excluded. The data collection was carried out by a research manager and two research assistants. Each research assistant was assigned a certain number of clusters within the designed sample. The sampling frame used in this research is one of an area type which are geographical surfaces well-defined. These surfaces are clusters and in this case were districts of Lima city and Callao. The data collection was taken from October to December of 2017. The method to reach the sample size (n=323) was through an emailed virtual survey (52 answers) and by visiting building projects site (271 answers). Projects visited were found aleatory, with the only requirement to be inside a designed sample cluster. It has been considered that a project has adopted BIM if it has used any of its applications: 3D models visualization; 3D modeling; material quantification and budgets made from 3D models; structure, MEP or HVAC model coordination; 4D construction simulation; control of construction progress with BIM; procurement of precast components; or the generation of 2D drawings from 3D models. The main notable finding is that the BIM adoption level in urban building projects in Lima city and Callao in 2017 was 21.6% (70/323), which places this analyzed region in the category of “early majority” adopters (Rogers, 2003).
1000 Cannabis Genomes Project
kaggle.com
zip
Updated Feb 26, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Google BigQuery (2019). 1000 Cannabis Genomes Project [Dataset]. https://www.kaggle.com/bigquery/genomics-cannabis
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Feb 26, 2019
Dataset provided by
BigQueryhttps://cloud.google.com/bigquery
Authors
Google BigQuery
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

Cannabis is a genus of flowering plants in the family Cannabaceae.

Source: https://en.wikipedia.org/wiki/Cannabis

Content

In October 2016, Phylos Bioscience released a genomic open dataset of approximately 850 strains of Cannabis via the Open Cannabis Project. In combination with other genomics datasets made available by Courtagen Life Sciences, Michigan State University, NCBI, Sunrise Medicinal, University of Calgary, University of Toronto, and Yunnan Academy of Agricultural Sciences, the total amount of publicly available data exceeds 1,000 samples taken from nearly as many unique strains.

https://medium.com/google-cloud/dna-sequencing-of-1000-cannabis-strains-publicly-available-in-google-bigquery-a33430d63998

These data were retrieved from the National Center for Biotechnology Information’s Sequence Read Archive (NCBI SRA), processed using the BWA aligner and FreeBayes variant caller, indexed with the Google Genomics API, and exported to BigQuery for analysis. Data are available directly from Google Cloud Storage at gs://gcs-public-data--genomics/cannabis, as well as via the Google Genomics API as dataset ID 918853309083001239, and an additional duplicated subset of only transcriptome data as dataset ID 94241232795910911, as well as in the BigQuery dataset bigquery-public-data:genomics_cannabis.

All tables in the Cannabis Genomes Project dataset have a suffix like _201703. The suffix is referred to as [BUILD_DATE] in the descriptions below. The dataset is updated frequently as new releases become available.

The following tables are included in the Cannabis Genomes Project dataset:

Sample_info contains fields extracted for each SRA sample, including the SRA sample ID and other data that give indications about the type of sample. Sample types include: strain, library prep methods, and sequencing technology. See SRP008673 for an example of upstream sample data. SRP008673 is the University of Toronto sequencing of Cannabis Sativa subspecies Purple Kush.

MNPR01_reference_[BUILD_DATE] contains reference sequence names and lengths for the draft assembly of Cannabis Sativa subspecies Cannatonic produced by Phylos Bioscience. This table contains contig identifiers and their lengths.

MNPR01_[BUILD_DATE] contains variant calls for all included samples and types (genomic, transcriptomic) aligned to the MNPR01_reference_[BUILD_DATE] table. Samples can be found in the sample_info table. The MNPR01_[BUILD_DATE] table is exported using the Google Genomics BigQuery variants schema. This table is useful for general analysis of the Cannabis genome.

MNPR01_transcriptome_[BUILD_DATE] is similar to the MNPR01_[BUILD_DATE] table, but it includes only the subset transcriptomic samples. This table is useful for transcribed gene-level analysis of the Cannabis genome.

Fork this kernel to get started with this dataset.

Acknowledgements

Dataset Source: http://opencannabisproject.org/ Category: Genomics Use: This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source - https://www.ncbi.nlm.nih.gov/home/about/policies.shtml - and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset. Update frequency: As additional data are released to GenBank View in BigQuery: https://bigquery.cloud.google.com/dataset/bigquery-public-data:genomics_cannabis View in Google Cloud Storage: gs://gcs-public-data--genomics/cannabis

Banner Photo by Rick Proctor from Unplash.

Inspiration

Which Cannabis samples are included in the variants table?

Which contigs in the MNPR01_reference_[BUILD_DATE] table have the highest density of variants?

How many variants does each sample have at the THC Synthase gene (THCA1) locus?
Dataset #2: Experimental study
figshare.com
docx
Updated Jul 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Adam Baimel (2023). Dataset #2: Experimental study [Dataset]. http://doi.org/10.6084/m9.figshare.23708766.v1
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.23708766.v1
Dataset updated
Jul 19, 2023
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Adam Baimel
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Project Title: Add title here

Project Team: Add contact information for research project team members

Summary: Provide a descriptive summary of the nature of your research project and its aims/focal research questions.

Relevant publications/outputs: When available, add links to the related publications/outputs from this data.

Data availability statement: If your data is not linked on figshare directly, provide links to where it is being hosted here (i.e., Open Science Framework, Github, etc.). If your data is not going to be made publicly available, please provide details here as to the conditions under which interested individuals could gain access to the data and how to go about doing so.

Data collection details: 1. When was your data collected? 2. How were your participants sampled/recruited?

Sample information: How many and who are your participants? Demographic summaries are helpful additions to this section.

Research Project Materials: What materials are necessary to fully reproduce your the contents of your dataset? Include a list of all relevant materials (e.g., surveys, interview questions) with a brief description of what is included in each file that should be uploaded alongside your datasets.

List of relevant datafile(s): If your project produces data that cannot be contained in a single file, list the names of each of the files here with a brief description of what parts of your research project each file is related to.

Data codebook: What is in each column of your dataset? Provide variable names as they are encoded in your data files, verbatim question associated with each response, response options, details of any post-collection coding that has been done on the raw-response (and whether that's encoded in a separate column).

Examples available at: https://www.thearda.com/data-archive?fid=PEWMU17 https://www.thearda.com/data-archive?fid=RELLAND14
Student Performance Data Set
kaggle.com
Updated Mar 27, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data-Science Sean (2020). Student Performance Data Set [Dataset]. https://www.kaggle.com/datasets/larsen0966/student-performance-data-set
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 27, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Data-Science Sean
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
If this Data Set is useful, and upvote is appreciated. This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd-period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details).
Building and updating software datasets: an empirical assessment
zenodo.org
zip
Updated Mar 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Juan Andrés Carruthers; Juan Andrés Carruthers (2025). Building and updating software datasets: an empirical assessment [Dataset]. http://doi.org/10.5281/zenodo.15008288
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15008288
Dataset updated
Mar 11, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Juan Andrés Carruthers; Juan Andrés Carruthers
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This is the repository for the scripts and data of the study "Building and updating software datasets: an empirical assessment".

Data collected

The data generated for the study it can be downloaded as a zip file. Each folder inside the file corresponds to one of the datasets of projects employed in the study (qualitas, currentSample and qualitasUpdated). Every dataset comprised three files "class.csv", "method.csv" and "sample.csv", with class metrics, method metrics and repository metadata of the projects respectively. Here is a description of the datasets:

qualitas: includes code metrics and repository metrics from the projects in the release 20130901r of the Qualitas Corpus.

currentSample: includes code metrics and repository metrics from a recent sample collected with our sampling procedure.

qualitasUpdated: includes code metrics and repository metrics from an updated version of the Qualitas Corpus applying our maintenance procedure.

Plot graphics

To plot the results and graphics in the article there is a Jupyter Notebook "Experiment.ipynb". It is initially configured to use the data in "datasets" folder.

Replication Kit

For replication purposes, the datasets containing recent projects from Github can be re-generated. To do so, the virtual environment must have installed the dependencies in "requirements.txt" file, add Github's tokens in "./token" file, re-define or leave as is the paths declared in the constants (variables written in caps) in the main method, and finally run "main.py" script. The portable versions of the source code scanner Sourcemeter are located as zip files in "./Sourcemeter/tool" directory. To install Sourcemeter the appropriate zip file must be decompressed excluding the root folder "SourceMeter-10.2.0-x64-

The script comprise 5 steps:

Project retrieval from Github: at first the sampling frame with projects complying with a specific quality criteria are retrieved from Github's API.

Create samples: with the sampling frame retrieved, the current samples are selected (currentSample and qualitasUpdated). In the case of qualitasUpdated, it is important to have first the "sample.csv" file inside the qualitas folder of the dataset originally created for the study. This file contains the metadata of the projects in Qualitas Corpus.

Project download and analysis: when all the samples are selected from the sampling frame (currentSample and qualitasUpdated), the repositories are downloaded and scanned with SourceMeter. In the cases in which the analysis is not possible, the projects are replaced with another one with similar size.

Outlier detection: once the datasets are collected, it is necessary to manually look for possible outliers in the code metrics under study. In the notebook "Experiment.ipynb" there are specific sections dedicated for it ("Outlier detection (Section 4.2.2)").

Outlier replacement: when the outliers are detected, in the same notebook there is also a section for outlier replacement ("Replace Outliers") where the outliers' url have to be listed to find the appropriate replacement.

If it is required, the metrics from the Qualitas Corpus can also be re-generated. First, it is necessary to download the release 20130901r from its official webpage. Second, decompress the .tar files downloaded. Third, make sure that the compressed files with source code from the projects (.java files) are placed in the "compressed" folder, in some cases it is necessary to read the "QC_README" file in the project's folder. Finally, run the original main script "Generate metrics for the Qualitas Corpus (QC) dataset" part of the code.
MEG-BIDS Brainstorm data sample
openneuro.org
Updated Apr 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Elizabeth Bock; Peter Donhauser; Francois Tadel; Guiomar Niso; Sylvain Baillet (2024). MEG-BIDS Brainstorm data sample [Dataset]. http://doi.org/10.18112/openneuro.ds000246.v1.0.1
Explore at:
Unique identifier
https://doi.org/10.18112/openneuro.ds000246.v1.0.1
Dataset updated
Apr 23, 2024
Dataset provided by
OpenNeurohttps://openneuro.org/
Authors
Elizabeth Bock; Peter Donhauser; Francois Tadel; Guiomar Niso; Sylvain Baillet
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Brainstorm - Auditory Dataset

License

This dataset (MEG and MRI data) was collected by the MEG Unit Lab, McConnell Brain Imaging Center, Montreal Neurological Institute, McGill University, Canada. The original purpose was to serve as a tutorial data example for the Brainstorm software project (http://neuroimage.usc.edu/brainstorm). It is presently released in the Public Domain, and is not subject to copyright in any jurisdiction.

We would appreciate though that you reference this dataset in your publications: please acknowledge its authors (Elizabeth Bock, Peter Donhauser, Francois Tadel and Sylvain Baillet) and cite the Brainstorm project seminal publication (also in open access): http://www.hindawi.com/journals/cin/2011/879716/

Presentation of the experiment

Experiment

One subject, two acquisition runs of 6 minutes each

Subject stimulated binaurally with intra-aural earphones (air tubes+transducers)

Each run contains:

200 regular beeps (440Hz)

40 easy deviant beeps (554.4Hz, 4 semitones higher)

Random inter-stimulus interval: between 0.7s and 1.7s seconds, uniformly distributed

The subject presses a button when detecting a deviant with the right index finger

Auditory stimuli generated with the Matlab Psychophysics toolbox

The specifications of this dataset were discussed initially on the FieldTrip bug tracker

MEG acquisition

Acquisition at 2400Hz, with a CTF 275 system, subject in seating position

Recorded at the Montreal Neurological Institute in December 2013

Anti-aliasing low-pass filter at 600Hz, files saved with the 3rd order gradient

Recorded channels (340):

1 Stim channel indicating the presentation times of the audio stimuli: UPPT001 (#1)

1 Audio signal sent to the subject: UADC001 (#316)

1 Response channel recordings the finger taps in response to the deviants: UDIO001 (#2)

26 MEG reference sensors (#5-#30)

274 MEG axial gradiometers (#31-#304)

2 EEG electrodes: Cz, Pz (#305 and #306)

1 ECG bipolar (#307)

2 EOG bipolar (vertical #308, horizontal #309)

12 Head tracking channels: Nasion XYZ, Left XYZ, Right XYZ, Error N/L/R (#317-#328)

20 Unused channels (#3, #4, #310-#315, #329-340)

3 datasets:

S01_AEF_20131218_01.ds: Run #1, 360s, 200 standard + 40 deviants

S01_AEF_20131218_02.ds: Run #2, 360s, 200 standard + 40 deviants

S01_Noise_20131218_01.ds: Empty room recordings, 30s long

File name: S01=Subject01, AEF=Auditory evoked field, 20131218=date(Dec 18 2013), 01=run

Use of the .ds, not the AUX (standard at the MNI) because they are easier to manipulate in FieldTrip

Stimulation delays

Delay #1: Production of the sound.
Between the stim markers (channel UDIO001) and the moment where the sound card plays the sound (channel UADC001). This is mostly due to the software running on the computer (stimulation software, operating system, sound card drivers, sound card electronics). The delay can be measured from the recorded files by comparing the triggers in the two channels: Delay between 11.5ms and 12.8ms (std = 0.3ms) This delay is not constant, we will need to correct for it.

Delay #2: Transmission of the sound.
Between when the sound card plays the sound and when the subject receives the sound in the ears. This is the time it takes for the transducer to convert the analog audio signal into a sound, plus the time it takes to the sound to travel through the air tubes from the transducer to the subject's ears. This delay cannot be estimated from the recorded signals: before the acquisition, we placed a sound meter at the extremity of the tubes to record when the sound is delivered. Delay between 4.8ms and 5.0ms (std = 0.08ms). At a sampling rate of 2400Hz, this delay can be considered constant, we will not compensate for it.

Delay #3: Recording of the signals.
The CTF MEG systems have a constant delay of 4 samples between the MEG/EEG channels and the analog channels (such as the audio signal UADC001), because of an anti-aliasing filtered that is applied to the first and not the second. This translate here to a constant delay of 1.7ms.

Delay #4: Over-compensation of delay #1.
When correcting of delay #1, the process we use to detect the beginning of the triggers on the audio signal (UADC001) sets the trigger in the middle of the ramp between silence and the beep. We "over-compensate" the delay #1 by 1.7ms. This can be considered as constant delay of about -1.7ms.

Uncorrected delays: We will correct for the delay #1, and keep the other delays (#2, #3 and #4). After we compensate for delay #1 our MEG signals will have a constant delay of about 4.9 + 1.7 - 1.7 = 4.9 ms. We decide not to compensate for th3se delays because they do not introduce any jitter in the responses and they are not going to change anything in the interpretation of the data.

Head shape and fiducial points

3D digitization using a Polhemus Fastrak device driven by Brainstorm (S01_20131218_*.pos)

More information: Digitize EEG electrodes and head shape

The output file is copied to each .ds folder and contains the following entries:

The position of the center of CTF coils

The position of the anatomical references we use in Brainstorm: Nasion and connections tragus/helix, as illustrated here.

Around 150 head points distributed on the hard parts of the head (no soft tissues)

Subject anatomy

Subject with 1.5T MRI

Marker on the left cheek

Processed with FreeSurfer 5.3
d
Datasets of Groundwater-Quality and Select Quality-Control Data from the...
catalog.data.gov
data.usgs.gov
Updated Jul 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Datasets of Groundwater-Quality and Select Quality-Control Data from the National Water-Quality Assessment Project, January 2017 through December 2019 (ver. 1.1, January 2021) [Dataset]. https://catalog.data.gov/dataset/datasets-of-groundwater-quality-and-select-quality-control-data-from-the-national-water-qu
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Description
Groundwater-quality data were collected from 983 wells as part of the National Water-Quality Assessment Project of the U.S. Geological Survey National Water-Quality Program and are included in this report. The data were collected from six types of well networks: principal aquifer study networks, which are used to assess the quality of groundwater used for public water supply; land-use study networks, which are used to assess land-use effects on shallow groundwater quality; major aquifer study networks, which are used to assess the quality of groundwater used for domestic supply; enhanced trends networks, which are used to evaluate the time scales during which groundwater quality changes; vertical flow-path study networks, which are used to evaluate changes in groundwater quality from shallow to deeper depths; and modeling support studies, which are used to provide data to support groundwater modeling. Groundwater samples were analyzed for a large number of water-quality indicators and constituents, including major ions, nutrients, trace elements, volatile organic compounds, pesticides, radionuclides, microbiological indicators, and some special interest constituents (arsenic speciation, chromium [VI] and perchlorate). Most of the data included were collected from wells that were sampled between January 2017 and December 2019. Microbiological indicator data for networks sampled in 2016 are included in this data release. These groundwater quality networks are described in a U.S. Geological Survey Data Series report DS####, which is available at https://dx.doi.org/XXXXXX, and the results are in this data release. Data for quality-control samples collected in 2017 through 2019 also are included in this data release. Data from samples collected between 2012 and 2016 are associated with networks described in previous reports in this data series (Arnold and others, 2016a and b; 2017a and b; 2018a and b; and 2020a and b). There are 24 data tables included in this data release and they are referenced as tables 1 through 14 and appendix tables 5-11 through 5-20 in the larger work citation (see supplemental information for descriptions). Two tables summarizing well depth and open interval are included in the data series report and were derived from table 1 in this data release. A separate table named DSR_2017-19_Description_of_Data_Fields.txt describes the 405 unique fields contained in the 24 data tables.
u
PDMX
cseweb.ucsd.edu
json
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
UCSD CSE Research Project, PDMX [Dataset]. https://cseweb.ucsd.edu/~jmcauley/datasets.html
Explore at:
jsonAvailable download formats
Dataset authored and provided by
UCSD CSE Research Project
Description
We introduce PDMX: a Public Domain MusicXML dataset for symbolic music processing, including over 250k musical scores in MusicXML format. PDMX is the largest publicly available, copyright-free MusicXML dataset in existence. PDMX includes genre, tag, description, and popularity metadata for every file.
Climate Change: Earth Surface Temperature Data
kaggle.com
redivis.com
zip
Updated May 1, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Berkeley Earth (2017). Climate Change: Earth Surface Temperature Data [Dataset]. https://www.kaggle.com/datasets/berkeleyearth/climate-change-earth-surface-temperature-data
Explore at:
zip(88843537 bytes)Available download formats
Dataset updated
May 1, 2017
Dataset authored and provided by
Berkeley Earthhttp://berkeleyearth.org/
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Area covered
Earth
Description
Some say climate change is the biggest threat of our age while others say it’s a myth based on dodgy science. We are turning some of the data over to you so you can form your own view.

Even more than with other data sets that Kaggle has featured, there’s a huge amount of data cleaning and preparation that goes into putting together a long-time study of climate trends. Early data was collected by technicians using mercury thermometers, where any variation in the visit time impacted measurements. In the 1940s, the construction of airports caused many weather stations to be moved. In the 1980s, there was a move to electronic thermometers that are said to have a cooling bias.

Given this complexity, there are a range of organizations that collate climate trends data. The three most cited land and ocean temperature data sets are NOAA’s MLOST, NASA’s GISTEMP and the UK’s HadCrut.

We have repackaged the data from a newer compilation put together by the Berkeley Earth, which is affiliated with Lawrence Berkeley National Laboratory. The Berkeley Earth Surface Temperature Study combines 1.6 billion temperature reports from 16 pre-existing archives. It is nicely packaged and allows for slicing into interesting subsets (for example by country). They publish the source data and the code for the transformations they applied. They also use methods that allow weather observations from shorter time series to be included, meaning fewer observations need to be thrown away.

In this dataset, we have include several files:

Global Land and Ocean-and-Land Temperatures (GlobalTemperatures.csv):

Date: starts in 1750 for average land temperature and 1850 for max and min land temperatures and global ocean and land temperatures

LandAverageTemperature: global average land temperature in celsius

LandAverageTemperatureUncertainty: the 95% confidence interval around the average

LandMaxTemperature: global average maximum land temperature in celsius

LandMaxTemperatureUncertainty: the 95% confidence interval around the maximum land temperature

LandMinTemperature: global average minimum land temperature in celsius

LandMinTemperatureUncertainty: the 95% confidence interval around the minimum land temperature

LandAndOceanAverageTemperature: global average land and ocean temperature in celsius

LandAndOceanAverageTemperatureUncertainty: the 95% confidence interval around the global average land and ocean temperature

Other files include:

Global Average Land Temperature by Country (GlobalLandTemperaturesByCountry.csv)

Global Average Land Temperature by State (GlobalLandTemperaturesByState.csv)

Global Land Temperatures By Major City (GlobalLandTemperaturesByMajorCity.csv)

Global Land Temperatures By City (GlobalLandTemperaturesByCity.csv)

The raw data comes from the Berkeley Earth data page.
d
Appendix 3 - Analytical results for the environmental and replicate sample...
datasets.ai
dataone.org
+3more
55
Updated Aug 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Department of the Interior (2024). Appendix 3 - Analytical results for the environmental and replicate sample sets collected from October 2011 through September 2013 at the Triangle Area Water Supply Monitoring Project study sites, North Carolina [Dataset]. https://datasets.ai/datasets/appendix-3-analytical-results-for-the-environmental-and-replicate-sample-sets-collected-fr
Explore at:
55Available download formats
Dataset updated
Aug 6, 2024
Dataset authored and provided by
Department of the Interior
Area covered
Research Triangle Park, North Carolina
Description
The dataset contains the analytical results for environmental and quality-control replicate sample sets and the computed relative percent differences (RPD) greater than 25 percent for the data collected during the surface-water sampling for the Triangle Area Water Supply Monitoring Project. The data are from samples collected during October 2011 through September 2013. Several study sites contained in this dataset were sampled for other USGS projects during the same time frame. Unless the samples at these sites were collected in conjunction with the Triangle Area Water Supply Monitoring Project, the data for other projects are not included in the dataset.
h
kl3m-data-govinfo-sample
huggingface.co
Updated Apr 11, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ALEA Institute (2025). kl3m-data-govinfo-sample [Dataset]. https://huggingface.co/datasets/alea-institute/kl3m-data-govinfo-sample
Explore at:
Dataset updated
Apr 11, 2025
Authors
ALEA Institute
Description
KL3M Data Project

Note: This page provides general information about the KL3M Data Project. Additional details specific to this dataset will be added in future updates. For complete information, please visit the GitHub repository or refer to the KL3M Data Project paper.

Description

This dataset is part of the ALEA Institute's KL3M Data Project, which provides copyright-clean training resources for large language models.

Dataset Details

Format: Parquet… See the full description on the dataset page: https://huggingface.co/datasets/alea-institute/kl3m-data-govinfo-sample.