60 datasets found

h
starcoderdata-python-edu-lang-score
huggingface.co
Updated Jul 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jan Schmitz (2023). starcoderdata-python-edu-lang-score [Dataset]. https://huggingface.co/datasets/JanSchTech/starcoderdata-python-edu-lang-score
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 18, 2023
Authors
Jan Schmitz
Description
Dataset Card for Starcoder Data with Python Education and Language Scores

Dataset Summary

The starcoderdata-python-edu-lang-score dataset contains the Python subset of the starcoderdata dataset. It augments the existing Python subset with features that assess the educational quality of code and classify the language of code comments. This dataset was created for high-quality Python education and language-based training, with a primary focus on facilitating models that can… See the full description on the dataset page: https://huggingface.co/datasets/JanSchTech/starcoderdata-python-edu-lang-score.
h
ML4SE23_G8_CodeSearchNet-Python
huggingface.co
Updated Nov 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AISE research lab at TU Delft (2023). ML4SE23_G8_CodeSearchNet-Python [Dataset]. https://huggingface.co/datasets/AISE-TUDelft/ML4SE23_G8_CodeSearchNet-Python
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 8, 2023
Dataset authored and provided by
AISE research lab at TU Delft
License
https://choosealicense.com/licenses/c-uda/https://choosealicense.com/licenses/c-uda/
Description
Dataset Card for "ML4SE23_G8_CodeSearchNet-Python"

Dataset used to finetune WizardCoder-1B-V1.0 on the Code Summarization task. The dataset is a cleaned version of the Python subset from the CodeXGLUE CodeSearchNet code-to-text dataset. The original Python subset included the docstring in the code column. This dataset has a cleaned code column, which contains the original code with the docstring removed. See https://github.com/ML4SE2023/G8-Codex for more details. More Information… See the full description on the dataset page: https://huggingface.co/datasets/AISE-TUDelft/ML4SE23_G8_CodeSearchNet-Python.
DataSet for ICSE SEIP 25: Detecting Python Malware in the Software Supply...
zenodo.org
bin
Updated Dec 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ridwan Shariffdeen; Ridwan Shariffdeen (2024). DataSet for ICSE SEIP 25: Detecting Python Malware in the Software Supply Chain with Program Analysis [Dataset]. http://doi.org/10.5281/zenodo.14580885
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14580885
Dataset updated
Dec 31, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Ridwan Shariffdeen; Ridwan Shariffdeen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
* MalOSS: subset of malicious packages from MalOSS dataset [RQ1, RQ2, RQ4]
* BackStabber: subset of malicious packages from BackStabber Knife's Collection [RQ1, RQ2, RQ4]
* MalRegistry: subset of malicious packages from Python MalRegistry dataset [RQ1, RQ2, RQ4]
* Popular: a collection of top-100 most popular python packages from PyPI [RQ1, RQ2, RQ3, RQ4]
* Trusted: a collection of packages from trusted organizations hosted in PyPI [RQ1, RQ2, RQ3, RQ4]
* DataKund: a collection of newly identified malicious packages from PyPI [Case Study]
* Recent: a collection of packages that were recently (2024 Oct) added to PyPI [Macaron Case Study]
911 Calls Data (Subset)
kaggle.com
zip
Updated Jun 3, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
hardly_human (2020). 911 Calls Data (Subset) [Dataset]. https://www.kaggle.com/rehan1024/911-calls-data-subset
Explore at:
zip(3828316 bytes)Available download formats
Dataset updated
Jun 3, 2020
Authors
hardly_human
License
https://www.usa.gov/government-works/https://www.usa.gov/government-works/
Description
Dataset

This dataset was created by hardly_human

Released under U.S. Government Works

Contents
musicnet_midis_lite
kaggle.com
zip
Updated Oct 8, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rupak Roy/ Bob (2022). musicnet_midis_lite [Dataset]. https://www.kaggle.com/rupakroy/musicnet-midis
Explore at:
zip(18209815 bytes)Available download formats
Dataset updated
Oct 8, 2022
Authors
Rupak Roy/ Bob
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context MusicNet is a collection of 330 freely-licensed classical music recordings, together with over 1 million annotated labels indicating the precise time of each note in every recording, the instrument that plays each note, and the note's position in the metrical structure of the composition. The labels are acquired from musical scores aligned to recordings by dynamic time warping. The labels are verified by trained musicians; a labeling error rate of 4% has been estimated. The MusicNet labels are offered to the machine learning and music communities as a resource for training models and a common benchmark for comparing results.

Specifically, MusicNet labels is proposed as a tool to address the following tasks:

Identify the notes performed at specific times in a recording. Classify the instruments that perform in a recording. Classify the composer of a recording. Identify precise onset times of the notes in a recording. Predict the next note in a recording, conditioned on history. Content (Raw - recommended) The raw data is available in standard wav audio format, with corresponding label files in csv format. These data and label filenames are MusicNet ids, which you can use to cross-index the data, labels, and metadata files.

(Python) The Python version of the dataset is distributed as a NumPy npz file. This is a binary format specific to Python (WARNING: if you attempt to read this data in Python 3, you need to set encoding='latin1' when you call np.load or your process will hang without any informative error messages). This format has three dependencies:

Python - This version of MusicNet is distributed as a Python object. NumPy - The MusicNet features are stored in NumPy arrays. intervaltree - The MusicNet labels are stored in an IntervalTree. Acknowledgements The MusicNet labels apply exclusively to Creative Commons and Public Domain recordings, and as such we can distribute and re-distribute the MusicNet labels together with their corresponding recordings. The music that underlies MusicNet is sourced from the Isabella Stewart Gardner Museum, the European Archive, and Musopen.

This work was supported by the Washington Research Foundation Fund for Innovation in Data-Intensive Discovery, and the program "Learning in Machines and Brains" (CIFAR).
Tabular DeDuplication Synthetic
kaggle.com
Updated Jan 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tyl3rDurd3n (2023). Tabular DeDuplication Synthetic [Dataset]. https://www.kaggle.com/datasets/spac84/tabular-deduplication-synthetic
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 1, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Tyl3rDurd3n
Description
This dataset was created synthetically with the python package faker. It is intended to practice the deduplication of databases.

unique_data.csv is our main data frame without duplicates. Everything starts here. The other files (01_duplicate*, 02_duplicate*, etc...) hold only duplicate values from the unique_data.csv entries. You can mix unique_data.csv with one of the duplicate csvs or parts of the duplicate csv to get a dataset with duplicate values to practice your deduplication skills.

unique_data.csv generation process:

Every entry has a unique identifier uuid4

The company column is generated from a subset of 35.000 unique entries . This subset is called via random.choice(subset)

The postcode and the city colmun is generated together from a list of tuples which contains 20% entries of the total size to inject duplicate

The name column is generated for each entry seperatly, but may contain duplicates due to the nature and name limits of the faker (Package) generation process

Country is US

The street column is generated from a subset of 70.000 unique entries and 30.000 nan values. This subset is called via random.choice(subset) (high unique value count - if you like it hard please feel free to delete to make the task harder)

The email column is generated from a subset of 40.000 unique entries and 30.000 nan values. This subset is called via random.choice(subset) (high unique value count - if you like it hard please feel free to delete to make the task harder)

The phone column is generated from a subset of 55.000 unique entries and 30.000 nan values. This subset is called via random.choice(subset) (high unique value count - if you like it hard please feel free to delete to make the task harder)

01_duplicate_data_random-nan.csv generation process:

Replaces a random fraction (50%) of cells in the dataframe with np.nan. ['company', 'name', 'uuid4'] are excluded by this augmentation

02_duplicate_data_random-nan_firstname-abbreviation.csv generation process:

Replaces a random fraction (50%) of cells in the dataframe with np.nan. ['company', 'name', 'uuid4'] are excluded by this augmentation

Does first name abbreviation on random 70% of the name column values

03_duplicate_data_random-nan_firstname-abbreviation_middlename-insertion.csv generation process:

Replaces a random fraction (50%) of cells in the dataframe with np.nan. ['company', 'name', 'uuid4'] are excluded by this augmentation

Does first name abbreviation on random 70% of the name column values

Does a random middle name insertion at 40%of the name column values. Also does random abbreviation on the middle name in 30% of the cases

04_duplicate_data_random-nan_firstname-abbreviation_middlename-insertion_keyboarderror.csv generation process:

Replaces a random fraction (50%) of cells in the dataframe with np.nan. ['company', 'name', 'uuid4'] are excluded by this augmentation

Does first name abbreviation on random 70% of the name column values

Does a random middle name insertion at 40%of the name column values. Also does random abbreviation on the middle name in 30% of the cases

Performs keyboarderror augmentation on 60% of the values for the columns ['name', 'city', 'street', 'company', 'email', 'phone'] https://nlpaug.readthedocs.io/en/latest/augmenter/char/keyboard.html skills
h
dataset-the-stack-v2-dedup-sub
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TempestTeam, dataset-the-stack-v2-dedup-sub [Dataset]. https://huggingface.co/datasets/TempestTeam/dataset-the-stack-v2-dedup-sub
Explore at:
Dataset authored and provided by
TempestTeam
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
The Stack v2 Subset with File Contents (Python, Java, JavaScript, C, C++)

TempestTeam/dataset-the-stack-v2-dedup-sub

Dataset Summary

This dataset is a language-filtered and self-contained subset of bigcode/the-stack-v2-dedup, part of the BigCode Project. It contains only files written in the following programming languages:

Python 🐍 Java ☕ JavaScript 📜 C ⚙️ C++ ⚙️

Unlike the original dataset, which only includes metadata and Software Heritage IDs, this subset includes… See the full description on the dataset page: https://huggingface.co/datasets/TempestTeam/dataset-the-stack-v2-dedup-sub.
Data from: Community Earth System Model v2 Large Ensemble (CESM2 LENS) Zarr...
gdex.ucar.edu
ckanprod.data-commons.k8s.ucar.edu
+1more
Updated Nov 11, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gokhan Danabasoglu; Clara Deser; Keith Rodgers Axel Timmermann (2024). Community Earth System Model v2 Large Ensemble (CESM2 LENS) Zarr Subset [Dataset]. https://gdex.ucar.edu/datasets/d010092/
Explore at:
Dataset updated
Nov 11, 2024
Dataset provided by
National Science Foundationhttp://www.nsf.gov/
Authors
Gokhan Danabasoglu; Clara Deser; Keith Rodgers Axel Timmermann
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jan 1, 1850 - Dec 31, 2014
Description
The US National Center for Atmospheric Research partnered with the IBS Center for Climate Physics in South Korea to generate the CESM2 Large Ensemble which consists of 100 ensemble members at 1 degree spatial resolution covering the period 1850-2100 under CMIP6 historical and SSP370 future radiative forcing scenarios. Data sets from this ensemble were made downloadable via the Climate Data Gateway on June 14, 2021. NCAR has copied a subset (currently ~500 TB) of CESM2 LENS data to Amazon S3 as part of the AWS Public Datasets Program. To optimize for large-scale analytics we have represented the data as ~275 Zarr stores format accessible through the Python Xarray library. Each Zarr store contains a single physical variable for a given model run type and temporal frequency (monthly, daily).
Z
Data from: ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction
data.niaid.nih.gov
zenodo.org
Updated Jan 27, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Keshavarz, Hossein; Nagappan, Meiyappan (2022). ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5907001
Explore at:
Dataset updated
Jan 27, 2022
Dataset provided by
David R. Cheriton School of Computer Science, University of Waterloo, Waterloo, ON, Canada
Authors
Keshavarz, Hossein; Nagappan, Meiyappan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction

This archive contains the ApacheJIT dataset presented in the paper "ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction" as well as the replication package. The paper is submitted to MSR 2022 Data Showcase Track.

The datasets are available under directory dataset. There are 4 datasets in this directory.

apachejit_total.csv: This file contains the entire dataset. Commits are specified by their identifier and a set of commit metrics that are explained in the paper are provided as features. Column buggy specifies whether or not the commit introduced any bug into the system.

apachejit_train.csv: This file is a subset of the entire dataset. It provides a balanced set that we recommend for models that are sensitive to class imbalance. This set is obtained from the first 14 years of data (2003 to 2016).

apachejit_test_large.csv: This file is a subset of the entire dataset. The commits in this file are the commits from the last 3 years of data. This set is not balanced to represent a real-life scenario in a JIT model evaluation where the model is trained on historical data to be applied on future data without any modification.

apachejit_test_small.csv: This file is a subset of the test file explained above. Since the test file has more than 30,000 commits, we also provide a smaller test set which is still unbalanced and from the last 3 years of data.

In addition to the dataset, we also provide the scripts using which we built the dataset. These scripts are written in Python 3.8. Therefore, Python 3.8 or above is required. To set up the environment, we have provided a list of required packages in file requirements.txt. Additionally, one filtering step requires GumTree [1]. For Java, GumTree requires Java 11. For other languages, external tools are needed. Installation guide and more details can be found here.

The scripts are comprised of Python scripts under directory src and Python notebooks under directory notebooks. The Python scripts are mainly responsible for conducting GitHub search via GitHub search API and collecting commits through PyDriller Package [2]. The notebooks link the fixed issue reports with their corresponding fixing commits and apply some filtering steps. The bug-inducing candidates then are filtered again using gumtree.py script that utilizes the GumTree package. Finally, the remaining bug-inducing candidates are combined with the clean commits in the dataset_construction notebook to form the entire dataset.

More specifically, git_token.py handles GitHub API token that is necessary for requests to GitHub API. Script collector.py performs GitHub search. Tracing changed lines and git annotate is done in gitminer.py using PyDriller. Finally, gumtree.py applies 4 filtering steps (number of lines, number of files, language, and change significance).

References:

GumTree

https://github.com/GumTreeDiff/gumtree

Jean-Rémy Falleri, Floréal Morandat, Xavier Blanc, Matias Martinez, and Martin Monperrus. 2014. Fine-grained and accurate source code differencing. In ACM/IEEE International Conference on Automated Software Engineering, ASE ’14,Vasteras, Sweden - September 15 - 19, 2014. 313–324

PyDriller

https://pydriller.readthedocs.io/en/latest/

Davide Spadini, Maurício Aniche, and Alberto Bacchelli. 2018. PyDriller: Python Framework for Mining Software Repositories. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering(Lake Buena Vista, FL, USA)(ESEC/FSE2018). Association for Computing Machinery, New York, NY, USA, 908–911
OGBN-MAG (Processed for PyG)
kaggle.com
zip
Updated Feb 27, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Redao da Taupl (2021). OGBN-MAG (Processed for PyG) [Dataset]. https://www.kaggle.com/dataup1/ogbn-mag
Explore at:
zip(852576506 bytes)Available download formats
Dataset updated
Feb 27, 2021
Authors
Redao da Taupl
License
Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
Description
OGBN-MAG

Webpage: https://ogb.stanford.edu/docs/nodeprop/#ogbn-mag

Usage in Python

Warning: Currently not usable.

import torch_geometric from ogb.nodeproppred import PygNodePropPredDataset dataset = PygNodePropPredDataset('ogbn-mag', root = '/kaggle/input') split_idx = dataset.get_idx_split() train_idx, valid_idx, test_idx = split_idx['train'], split_idx['valid'], split_idx['test'] graph = dataset[0] # PyG Graph object

Description

Graph: The ogbn-mag dataset is a heterogeneous network composed of a subset of the Microsoft Academic Graph (MAG) [1]. It contains four types of entities—papers (736,389 nodes), authors (1,134,649 nodes), institutions (8,740 nodes), and fields of study (59,965 nodes)—as well as four types of directed relations connecting two types of entities—an author is “affiliated with” an institution, an author “writes” a paper, a paper “cites” a paper, and a paper “has a topic of” a field of study. Similar to ogbn-arxiv, each paper is associated with a 128-dimensional word2vec feature vector, and all the other types of entities are not associated with input node features.

Prediction task: Given the heterogeneous ogbn-mag data, the task is to predict the venue (conference or journal) of each paper, given its content, references, authors, and authors’ affiliations. This is of practical interest as some manuscripts’ venue information is unknown or missing in MAG, due to the noisy nature of Web data. In total, there are 349 different venues in ogbn-mag, making the task a 349-class classification problem.

Dataset splitting: The authors of this dataset follow the same time-based strategy as ogbn-arxiv and ogbn-papers100M to split the paper nodes in the heterogeneous graph, i.e., training models to predict venue labels of all papers published before 2018, validating and testing the models on papers published in 2018 and since 2019, respectively.

Summary

Package #Nodes #Edges Split Type Task Type Metric
ogb>=1.2.1 1,939,743 21,111,007 Time Multi-class classification Accuracy

Open Graph Benchmark

Website: https://ogb.stanford.edu

The Open Graph Benchmark (OGB) [2] is a collection of realistic, large-scale, and diverse benchmark datasets for machine learning on graphs. OGB datasets are automatically downloaded, processed, and split using the OGB Data Loader. The model performance can be evaluated using the OGB Evaluator in a unified manner.

References

[1] Kuansan Wang, Zhihong Shen, Chiyuan Huang, Chieh-Han Wu, Yuxiao Dong, and Anshul Kanakia. Microsoft academic graph: When experts are not enough. Quantitative Science Studies, 1(1):396–413, 2020. [2] Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. Advances in Neural Information Processing Systems, pp. 22118–22133, 2020.

Disclaimer

I am NOT the author of this dataset. It was downloaded from its official website. I assume no responsibility or liability for the content in this dataset. Any questions, problems or issues, please contact the original authors at their website or their GitHub repo.
Data from: Da-TACOS: A Dataset for Cover Song Identification and...
data.europa.eu
unknown
Updated Jul 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2025). Da-TACOS: A Dataset for Cover Song Identification and Understanding [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-3520368?locale=fi
Explore at:
unknown(3513878)Available download formats
Dataset updated
Jul 3, 2025
Dataset authored and provided by
Zenodohttp://zenodo.org/
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
We present Da-TACOS: a dataset for cover song identification and understanding. It contains two subsets, namely the benchmark subset (for benchmarking cover song identification systems) and the cover analysis subset (for analyzing the links among cover songs), with pre-extracted features and metadata for 15,000 and 10,000 songs, respectively. The annotations included in the metadata are obtained with the API of SecondHandSongs.com. All audio files we use to extract features are encoded in MP3 format and their sample rate is 44.1 kHz. Da-TACOS does not contain any audio files. For the results of our analyses on modifiable musical characteristics using the cover analysis subset and our initial benchmarking of 7 state-of-the-art cover song identification algorithms on the benchmark subset, you can look at our publication. For organizing the data, we use the structure of SecondHandSongs where each song is called a ‘performance’, and each clique (cover group) is called a ‘work’. Based on this, the file names of the songs are their unique performance IDs (PID, e.g. P_22), and their labels with respect to their cliques are their work IDs (WID, e.g. W_14). Metadata for each song includes performance title, performance artist, work title, work artist, release year, SecondHandSongs.com performance ID, SecondHandSongs.com work ID, whether the song is instrumental or not. In addition, we matched the original metadata with MusicBrainz to obtain MusicBrainz ID (MBID), song length and genre/style tags. We would like to note that MusicBrainz related information is not available for all the songs in Da-TACOS, and since we used just our metadata for matching, we include all possible MBIDs for a particular songs. For facilitating reproducibility in cover song identification (CSI) research, we propose a framework for feature extraction and benchmarking in our supplementary repository: acoss. The feature extraction component is designed to help CSI researchers to find the most commonly used features for CSI in a single address. The parameter values we used to extract the features in Da-TACOS are shared in the same repository. Moreover, the benchmarking component includes our implementations of 7 state-of-the-art CSI systems. We provide the performance results of an initial benchmarking of those 7 systems on the benchmark subset of Da-TACOS. We encourage other CSI researchers to contribute to acoss with implementing their favorite feature extraction algorithms and their CSI systems to build up a knowledge base where CSI research can reach larger audiences. The instructions for how to download and use the dataset are shared below. Please contact us if you have any questions or requests. 1. Structure 1.1. Metadata We provide two metadata files that contain information about the benchmark subset and the cover analysis subset. Both metadata files are stored as python dictionaries in .json format, and have the same hierarchical structure. An example to load the metadata files in python: import json with open('./da-tacos_metadata/da-tacos_benchmark_subset_metadata.json') as f: benchmark_metadata = json.load(f) The python dictionary obtained with the code above will have the respective WIDs as keys. Each key will provide the song dictionaries that contain the metadata regarding the songs that belong to their WIDs. An example can be seen below: "W_163992": { # work id "P_547131": { # performance id of the first song belonging to the clique 'W_163992' "work_title": "Trade Winds, Trade Winds", "work_artist": "Aki Aleong", "perf_title": "Trade Winds, Trade Winds", "perf_artist": "Aki Aleong", "release_year": "1961", "work_id": "W_163992", "perf_id": "P_547131", "instrumental": "No", "perf_artist_mbid": "9bfa011f-8331-4c9a-b49b-d05bc7916605", "mb_performances": { "4ce274b3-0979-4b39-b8a3-5ae1de388c4a": { "length": "175000" }, "7c10ba3b-6f1d-41ab-8b20-14b2567d384a": { "length": "177653" } } }, "P_547140": { # performance id of the second song belonging to the clique 'W_163992' "work_title": "Trade Winds, Trade Winds", "work_artist": "Aki Aleong", "perf_title": "Trade Winds, Trade Winds", "perf_artist": "Dodie Stevens", "release_year": "1961", "work_id": "W_163992", "perf_id": "P_547140", "instrumental": "No" } } 1.2. Pre-extracted features The list of features included in Da-TACOS can be seen below. All the features are extracted with acoss repository that uses open-source feature extraction libraries such as Essentia, LibROSA, and Madmom. To facilitate the use of the dataset, we provide two options regarding the file structure. 1- In da-tacos_benchmark_subset_single_files and da-tacos_coveranalysis_subset_single_files folders, we organize the data based on their respective cliques, and one file contains all the features for that particular song. { "chroma_cens": numpy.ndarray, "crema": numpy.ndarray, "hpcp": numpy.ndarray, "key_extractor": { "key": numpy.str_, "scale": numpy.str_,_ "strength": numpy.float64 }, "madmom_features": { "novfn":
h
python-edu
huggingface.co
Updated Jan 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Avelina Hadji-Kyriacou (2025). python-edu [Dataset]. https://huggingface.co/datasets/Avelina/python-edu
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 8, 2025
Authors
Avelina Hadji-Kyriacou
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
This version is deprecated! Please use the cleaned version: Avelina/python-edu-cleaned

SmolLM-Corpus: Python-Edu

This dataset contains the python-edu subset of SmolLM-Corpus with the contents of the files stored in a new text field. All files were downloaded from the S3 bucket on January the 8th 2025, using the blob IDs from the original dataset with revision 3ba9d605774198c5868892d7a8deda78031a781f. Only 1 file was marked as not found and the corresponding row removed from the… See the full description on the dataset page: https://huggingface.co/datasets/Avelina/python-edu.
Wikimedia Structured Dataset Navigator (JSONL)
kaggle.com
zip
Updated Apr 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mehranism (2025). Wikimedia Structured Dataset Navigator (JSONL) [Dataset]. https://www.kaggle.com/datasets/mehranism/wikimedia-structured-dataset-navigator-jsonl
Explore at:
zip(266196504 bytes)Available download formats
Dataset updated
Apr 23, 2025
Authors
Mehranism
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
📚 Overview: This dataset provides a compact and efficient way to explore the massive "Wikipedia Structured Contents" dataset by Wikimedia Foundation, which consists of 38 large JSONL files (each ~2.5GB). Loading these directly in Kaggle or Colab is impractical due to resource constraints. This file index solves that problem.

🔍 What’s Inside: This dataset includes a single JSONL file named wiki_structured_dataset_navigator.jsonl that contains metadata for every file in the English portion of the Wikimedia dataset.

Each line in the JSONL file is a JSON object with the following fields: - file_name: the actual filename in the source dataset (e.g., enwiki_namespace_0_0.jsonl) - file_index: the numeric row index of the file - name: the Wikipedia article title or identifier - url: a link to the full article on Wikipedia - description: a short description or abstract of the article (when available)

🛠 Use Case: Use this dataset to search by keyword, article name, or description to find which specific files from the full Wikimedia dataset contain the topics you're interested in. You can then download only the relevant file(s) instead of the entire dataset.

⚡️ Benefits: - Lightweight (~MBs vs. GBs) - Easy to load and search - Great for indexing, previewing, and subsetting the Wikimedia dataset - Saves time, bandwidth, and compute resources

📎 Example Usage (Python): ```python import kagglehub import json import pandas as pd import numpy as np import os from tqdm import tqdm from datetime import datetime import re

def read_jsonl(file_path, max_records=None): data = [] with open(file_path, 'r', encoding='utf-8') as f: for i, line in enumerate(tqdm(f)): if max_records and i >= max_records: break data.append(json.loads(line)) return data

file_path = kagglehub.dataset_download("mehranism/wikimedia-structured-dataset-navigator-jsonl",path="wiki_structured_dataset_navigator.jsonl") data = read_jsonl(file_path) print(f"Successfully loaded {len(data)} records")

df = pd.DataFrame(data) print(f"Dataset shape: {df.shape}") print(" Columns in the dataset:") for col in df.columns: print(f"- {col}")

This dataset is perfect for developers working on: - Retrieval-Augmented Generation (RAG) - Large Language Model (LLM) fine-tuning - Search and filtering pipelines - Academic research on structured Wikipedia content 💡 Tip: Pair this index with the original [Wikipedia Structured Contents dataset](https://www.kaggle.com/datasets/wikimedia-foundation/wikipedia-structured-contents) for full article access. 📃 Format: - File: `wiki_structured_dataset_navigator.jsonl` - Format: JSON Lines (1 object per line) - Encoding: UTF-8 --- ### **Tags**

wikipedia, wikimedia, jsonl, structured-data, search-index, metadata, file-catalog, dataset-index, large-language-models, machine-learning ```

Licensing

CC0: Public Domain Dedication

(Recommended for open indexing tools with no sensitive data.)
f
Development of a Cambridge Structural Database Subset: A Collection of...
acs.figshare.com
text/x-python
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Peyman Z. Moghadam; Aurelia Li; Seth B. Wiggin; Andi Tao; Andrew G. P. Maloney; Peter A. Wood; Suzanna C. Ward; David Fairen-Jimenez (2023). Development of a Cambridge Structural Database Subset: A Collection of Metal–Organic Frameworks for Past, Present, and Future [Dataset]. http://doi.org/10.1021/acs.chemmater.7b00441.s002
Explore at:
text/x-pythonAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.chemmater.7b00441.s002
Dataset updated
Jun 1, 2023
Dataset provided by
ACS Publications
Authors
Peyman Z. Moghadam; Aurelia Li; Seth B. Wiggin; Andi Tao; Andrew G. P. Maloney; Peter A. Wood; Suzanna C. Ward; David Fairen-Jimenez
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
We report the generation and characterization of the most complete collection of metal–organic frameworks (MOFs) maintained and updated, for the first time, by the Cambridge Crystallographic Data Centre (CCDC). To set up this subset, we asked the question “what is a MOF?” and implemented a number of “look-for-MOF” criteria embedded within a bespoke Cambridge Structural Database (CSD) Python API workflow to identify and extract information on 69 666 MOF materials. The CSD MOF subset is updated regularly with subsequent MOF additions to the CSD, bringing a unique record for all researchers working in the area of porous materials around the world, whether to perform high-throughput computational screening for materials discovery or to have a global view over the existing structures in a single resource. Using this resource, we then developed and used an array of computational tools to remove residual solvent molecules from the framework pores of all the MOFs identified and went on to analyze geometrical and physical properties of nondisordered structures.
T
imagenet2012_subset
tensorflow.org
Updated Oct 21, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). imagenet2012_subset [Dataset]. https://www.tensorflow.org/datasets/catalog/imagenet2012_subset
Explore at:
Dataset updated
Oct 21, 2024
Description
ILSVRC 2012, commonly known as 'ImageNet' is an image dataset organized according to the WordNet hierarchy. Each meaningful concept in WordNet, possibly described by multiple words or word phrases, is called a "synonym set" or "synset". There are more than 100,000 synsets in WordNet, majority of them are nouns (80,000+). In ImageNet, we aim to provide on average 1000 images to illustrate each synset. Images of each concept are quality-controlled and human-annotated. In its completion, we hope ImageNet will offer tens of millions of cleanly sorted images for most of the concepts in the WordNet hierarchy.

The test split contains 100K images but no labels because no labels have been publicly released. We provide support for the test split from 2012 with the minor patch released on October 10, 2019. In order to manually download this data, a user must perform the following operations:

Download the 2012 test split available here.

Download the October 10, 2019 patch. There is a Google Drive link to the patch provided on the same page.

Combine the two tar-balls, manually overwriting any images in the original archive with images from the patch. According to the instructions on image-net.org, this procedure overwrites just a few images.

The resulting tar-ball may then be processed by TFDS.

To assess the accuracy of a model on the ImageNet test split, one must run inference on all images in the split, export those results to a text file that must be uploaded to the ImageNet evaluation server. The maintainers of the ImageNet evaluation server permits a single user to submit up to 2 submissions per week in order to prevent overfitting.

To evaluate the accuracy on the test split, one must first create an account at image-net.org. This account must be approved by the site administrator. After the account is created, one can submit the results to the test server at https://image-net.org/challenges/LSVRC/eval_server.php The submission consists of several ASCII text files corresponding to multiple tasks. The task of interest is "Classification submission (top-5 cls error)". A sample of an exported text file looks like the following:

771 778 794 387 650 363 691 764 923 427 737 369 430 531 124 755 930 755 59 168

The export format is described in full in "readme.txt" within the 2013 development kit available here: https://image-net.org/data/ILSVRC/2013/ILSVRC2013_devkit.tgz Please see the section entitled "3.3 CLS-LOC submission format". Briefly, the format of the text file is 100,000 lines corresponding to each image in the test split. Each line of integers correspond to the rank-ordered, top 5 predictions for each test image. The integers are 1-indexed corresponding to the line number in the corresponding labels file. See labels.txt.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('imagenet2012_subset', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.

https://storage.googleapis.com/tfds-data/visualization/fig/imagenet2012_subset-1pct-5.0.0.png" alt="Visualization" width="500px">
MatSim Dataset and benchmark for one-shot visual materials and textures...
zenodo.org
data.niaid.nih.gov
pdf, zip
Updated Jun 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Manuel S. Drehwald; Sagi Eppel; Jolina Li; Han Hao; Alan Aspuru-Guzik; Manuel S. Drehwald; Sagi Eppel; Jolina Li; Han Hao; Alan Aspuru-Guzik (2025). MatSim Dataset and benchmark for one-shot visual materials and textures recognition [Dataset]. http://doi.org/10.5281/zenodo.7390166
Explore at:
zip, pdfAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7390166
Dataset updated
Jun 25, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Manuel S. Drehwald; Sagi Eppel; Jolina Li; Han Hao; Alan Aspuru-Guzik; Manuel S. Drehwald; Sagi Eppel; Jolina Li; Han Hao; Alan Aspuru-Guzik
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
The MatSim Dataset and benchmark

Lastest version

Synthetic dataset and real images benchmark for visual similarity recognition of materials and textures.

MatSim: a synthetic dataset, a benchmark, and a method for computer vision-based recognition of similarities and transitions between materials and textures focusing on identifying any material under any conditions using one or a few examples (one-shot learning).

Based on the paper: One-shot recognition of any material anywhere using contrastive learning with physics-based rendering

Benchmark_MATSIM.zip: contain the benchmark made of real-world images as described in the paper

MatSim_object_train_split_1,2,3.zip: Contain a subset of the synthetics dataset for images of CGI images materials on random objects as described in the paper.

MatSim_Vessels_Train_1,2,3.zip : Contain a subset of the synthetics dataset for images of CGI images materials inside transparent containers as described in the paper.

*Note: these are subsets of the dataset; the full dataset can be found at:
https://e1.pcloud.link/publink/show?code=kZIiSQZCYU5M4HOvnQykql9jxF4h0KiC5MX

or
https://icedrive.net/s/A13FWzZ8V2aP9T4ufGQ1N3fBZxDF

Code:

Up to date code for generating the dataset, reading and evaluation and trained nets can be found in this URL:https://github.com/sagieppel/MatSim-Dataset-Generator-Scripts-And-Neural-net

Dataset Generation Scripts.zip: Contain the Blender (3.1) Python scripts used for generating the dataset, this code might be odl up to date code can be found here
Net_Code_And_Trained_Model.zip: Contain a reference neural net code, including loaders, trained models, and evaluators scripts that can be used to read and train with the synthetic dataset or test the model with the benchmark. Note code in the ZIP file is not up to date and contains some bugs For the Latest version of this code see this URL

Further documentation can be found inside the zip files or in the paper.
T
civil_comments
tensorflow.org
huggingface.co
Updated Feb 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). civil_comments [Dataset]. https://www.tensorflow.org/datasets/catalog/civil_comments
Explore at:
Dataset updated
Feb 28, 2023
Description
This version of the CivilComments Dataset provides access to the primary seven labels that were annotated by crowd workers, the toxicity and other tags are a value between 0 and 1 indicating the fraction of annotators that assigned these attributes to the comment text.

The other tags are only available for a fraction of the input examples. They are currently ignored for the main dataset; the CivilCommentsIdentities set includes those labels, but only consists of the subset of the data with them. The other attributes that were part of the original CivilComments release are included only in the raw data. See the Kaggle documentation for more details about the available features.

The comments in this dataset come from an archive of the Civil Comments platform, a commenting plugin for independent news sites. These public comments were created from 2015 - 2017 and appeared on approximately 50 English-language news sites across the world. When Civil Comments shut down in 2017, they chose to make the public comments available in a lasting open archive to enable future research. The original data, published on figshare, includes the public comment text, some associated metadata such as article IDs, publication IDs, timestamps and commenter-generated "civility" labels, but does not include user ids. Jigsaw extended this dataset by adding additional labels for toxicity, identity mentions, as well as covert offensiveness. This data set is an exact replica of the data released for the Jigsaw Unintended Bias in Toxicity Classification Kaggle challenge. This dataset is released under CC0, as is the underlying comment text.

For comments that have a parent_id also in the civil comments data, the text of the previous comment is provided as the "parent_text" feature. Note that the splits were made without regard to this information, so using previous comments may leak some information. The annotators did not have access to the parent text when making the labels.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('civil_comments', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
T
imagenette
tensorflow.org
opendatalab.com
+1more
Updated Jun 1, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). imagenette [Dataset]. https://www.tensorflow.org/datasets/catalog/imagenette
Explore at:
Dataset updated
Jun 1, 2024
Description
Imagenette is a subset of 10 easily classified classes from the Imagenet dataset. It was originally prepared by Jeremy Howard of FastAI. The objective behind putting together a small version of the Imagenet dataset was mainly because running new ideas/algorithms/experiments on the whole Imagenet take a lot of time.

This version of the dataset allows researchers/practitioners to quickly try out ideas and share with others. The dataset comes in three variants:

Full size

320 px

160 px

Note: The v2 config correspond to the new 70/30 train/valid split (released in Dec 6 2019).

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('imagenette', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.

https://storage.googleapis.com/tfds-data/visualization/fig/imagenette-full-size-v2-1.0.0.png" alt="Visualization" width="500px">
d
Data from: tableone: An open source Python package for producing summary...
datadryad.org
search.dataone.org
+1more
zip
Updated Apr 23, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tom J. Pollard; Alistair E. W. Johnson; Jesse D. Raffa; Roger G. Mark (2019). tableone: An open source Python package for producing summary statistics for research papers [Dataset]. http://doi.org/10.5061/dryad.26c4s35
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.26c4s35
Dataset updated
Apr 23, 2019
Dataset provided by
Dryad
Authors
Tom J. Pollard; Alistair E. W. Johnson; Jesse D. Raffa; Roger G. Mark
Time period covered
Apr 19, 2018
Description
Objectives: In quantitative research, understanding basic parameters of the study population is key for interpretation of the results. As a result, it is typical for the first table (“Table 1”) of a research paper to include summary statistics for the study data. Our objectives are 2-fold. First, we seek to provide a simple, reproducible method for providing summary statistics for research papers in the Python programming language. Second, we seek to use the package to improve the quality of summary statistics reported in research papers.

Materials and Methods: The tableone package is developed following good practice guidelines for scientific computing and all code is made available under a permissive MIT License. A testing framework runs on a continuous integration server, helping to maintain code stability. Issues are tracked openly and public contributions are encouraged.

Results: The tableone software package automatically compiles summary statistics into publishable formats such...
k
Code of Dietel et al.: "Combined impacts of temperature, sea ice coverage,...
radar.kit.edu
radar-service.eu
tar
Updated May 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hendrik Andersen; Philip Stier; Barbara Dietel; Jan Cermak; Corinna Hoose (2024). Code of Dietel et al.: "Combined impacts of temperature, sea ice coverage, and mixing ratios of sea spray and dust on cloud phase over the Arctic and Southern Oceans", submitted to Geophysical Research Letters [Dataset]. http://doi.org/10.35097/VEbaqHtbXdEzreqO
Explore at:
tar(34304 bytes)Available download formats
Unique identifier
https://doi.org/10.35097/VEbaqHtbXdEzreqO
Dataset updated
May 6, 2024
Dataset provided by
Karlsruhe Institute of Technology
Stier, Philip
Dietel, Barbara
Authors
Hendrik Andersen; Philip Stier; Barbara Dietel; Jan Cermak; Corinna Hoose
Area covered
Southern Ocean, Arctic
Description
Code of Dietel et al.: "Combined impacts of temperature, sea ice coverage, and mixing ratios of sea spray and dust on cloud phase over the Arctic and Southern Oceans", submitted to Geophysical Research Letters

Scripts to train a machine learning model (Histogram based gradient boosting regression with scikitlearn) and calculate SHapley Additive exPlanation (SHAP) values

The machine learning model can predict the liquid fraction in different cloud types based on four parameters, namely the cloud top temperature, the sea ice concentration, the dust mixing ratio and the sea salt mixing ratio. More information on the used dataset can be found here: Dietel et al. 2023

Bash-scripts

The bash scripts are used to run the python scripts for different cloud types and regions on a cluster. bash-scripts starting with GBR_[...] (Gradient Boosting Regression) run the python-script hist_gbr_subset_final2_with_comments.py for different regions (Arctic Ocean (AO), Southern Ocean (SO)) and different cloud types (low-level, mid-level,mid-to-low-level). bash-scripts starting with shap_values_[...] run the python-script shap_values-subset-final2_with_comments.py to calculate SHAP values based on the trained machine learning models for a 500 000 sample subset of the validation dataset.

Python scripts

hist_gbr_subset_final2_with_comments.py Python script to train the a Histogram-based Gradient Boosting Regression model using the scikitlearn python package. More detailed information can be found as comments in the scripts. shap_values-subset-final2_with_comments.py Calculates SHAP values for a 500 000 sample subset of the validation dataset to make the machine learning model explainable. More detailed information can be found as comments in the scripts.

Package	#Nodes	#Edges	Split Type	Task Type	Metric
`ogb>=1.2.1`	1,939,743	21,111,007	Time	Multi-class classification	Accuracy

Facebook

Twitter

Click to copy link

Link copied

Cite

Jan Schmitz (2023). starcoderdata-python-edu-lang-score [Dataset]. https://huggingface.co/datasets/JanSchTech/starcoderdata-python-edu-lang-score

starcoderdata-python-edu-lang-score

JanSchTech/starcoderdata-python-edu-lang-score

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Jul 18, 2023

Authors

Jan Schmitz

Description

Dataset Card for Starcoder Data with Python Education and Language Scores

  Dataset Summary

The starcoderdata-python-edu-lang-score dataset contains the Python subset of the starcoderdata dataset. It augments the existing Python subset with features that assess the educational quality of code and classify the language of code comments. This dataset was created for high-quality Python education and language-based training, with a primary focus on facilitating models that can… See the full description on the dataset page: https://huggingface.co/datasets/JanSchTech/starcoderdata-python-edu-lang-score.

Clear search

Close search

Google apps

Main menu

starcoderdata-python-edu-lang-score

ML4SE23_G8_CodeSearchNet-Python

DataSet for ICSE SEIP 25: Detecting Python Malware in the Software Supply...

911 Calls Data (Subset)

Dataset

Contents

musicnet_midis_lite

Tabular DeDuplication Synthetic

unique_data.csv generation process:

01_duplicate_data_random-nan.csv generation process:

02_duplicate_data_random-nan_firstname-abbreviation.csv generation process:

03_duplicate_data_random-nan_firstname-abbreviation_middlename-insertion.csv generation process:

04_duplicate_data_random-nan_firstname-abbreviation_middlename-insertion_keyboarderror.csv generation process:

dataset-the-stack-v2-dedup-sub

Data from: Community Earth System Model v2 Large Ensemble (CESM2 LENS) Zarr...

Data from: ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction

OGBN-MAG (Processed for PyG)

OGBN-MAG

Usage in Python

Description

Summary

Open Graph Benchmark

References

Disclaimer

Data from: Da-TACOS: A Dataset for Cover Song Identification and...

python-edu

Wikimedia Structured Dataset Navigator (JSONL)

Licensing

Development of a Cambridge Structural Database Subset: A Collection of...

imagenet2012_subset

MatSim Dataset and benchmark for one-shot visual materials and textures...

civil_comments

imagenette

Data from: tableone: An open source Python package for producing summary...

Code of Dietel et al.: "Combined impacts of temperature, sea ice coverage,...

Code of Dietel et al.: "Combined impacts of temperature, sea ice coverage, and mixing ratios of sea spray and dust on cloud phase over the Arctic and Southern Oceans", submitted to Geophysical Research Letters

Scripts to train a machine learning model (Histogram based gradient boosting regression with scikitlearn) and calculate SHapley Additive exPlanation (SHAP) values

Bash-scripts

Python scripts

starcoderdata-python-edu-lang-score

JanSchTech/starcoderdata-python-edu-lang-score