60 datasets found
  1. h

    starcoderdata-python-edu-lang-score

    • huggingface.co
    Updated Jul 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jan Schmitz (2023). starcoderdata-python-edu-lang-score [Dataset]. https://huggingface.co/datasets/JanSchTech/starcoderdata-python-edu-lang-score
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 18, 2023
    Authors
    Jan Schmitz
    Description

    Dataset Card for Starcoder Data with Python Education and Language Scores

      Dataset Summary
    

    The starcoderdata-python-edu-lang-score dataset contains the Python subset of the starcoderdata dataset. It augments the existing Python subset with features that assess the educational quality of code and classify the language of code comments. This dataset was created for high-quality Python education and language-based training, with a primary focus on facilitating models that can… See the full description on the dataset page: https://huggingface.co/datasets/JanSchTech/starcoderdata-python-edu-lang-score.

  2. h

    ML4SE23_G8_CodeSearchNet-Python

    • huggingface.co
    Updated Nov 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AISE research lab at TU Delft (2023). ML4SE23_G8_CodeSearchNet-Python [Dataset]. https://huggingface.co/datasets/AISE-TUDelft/ML4SE23_G8_CodeSearchNet-Python
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 8, 2023
    Dataset authored and provided by
    AISE research lab at TU Delft
    License

    https://choosealicense.com/licenses/c-uda/https://choosealicense.com/licenses/c-uda/

    Description

    Dataset Card for "ML4SE23_G8_CodeSearchNet-Python"

    Dataset used to finetune WizardCoder-1B-V1.0 on the Code Summarization task. The dataset is a cleaned version of the Python subset from the CodeXGLUE CodeSearchNet code-to-text dataset. The original Python subset included the docstring in the code column. This dataset has a cleaned code column, which contains the original code with the docstring removed. See https://github.com/ML4SE2023/G8-Codex for more details. More Information… See the full description on the dataset page: https://huggingface.co/datasets/AISE-TUDelft/ML4SE23_G8_CodeSearchNet-Python.

  3. DataSet for ICSE SEIP 25: Detecting Python Malware in the Software Supply...

    • zenodo.org
    bin
    Updated Dec 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ridwan Shariffdeen; Ridwan Shariffdeen (2024). DataSet for ICSE SEIP 25: Detecting Python Malware in the Software Supply Chain with Program Analysis [Dataset]. http://doi.org/10.5281/zenodo.14580885
    Explore at:
    binAvailable download formats
    Dataset updated
    Dec 31, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Ridwan Shariffdeen; Ridwan Shariffdeen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
    * MalOSS: subset of malicious packages from MalOSS dataset [RQ1, RQ2, RQ4]
    * BackStabber: subset of malicious packages from BackStabber Knife's Collection [RQ1, RQ2, RQ4]
    * MalRegistry: subset of malicious packages from Python MalRegistry dataset [RQ1, RQ2, RQ4]
    * Popular: a collection of top-100 most popular python packages from PyPI [RQ1, RQ2, RQ3, RQ4]
    * Trusted: a collection of packages from trusted organizations hosted in PyPI [RQ1, RQ2, RQ3, RQ4]
    * DataKund: a collection of newly identified malicious packages from PyPI [Case Study]
    * Recent: a collection of packages that were recently (2024 Oct) added to PyPI [Macaron Case Study]
  4. 911 Calls Data (Subset)

    • kaggle.com
    zip
    Updated Jun 3, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    hardly_human (2020). 911 Calls Data (Subset) [Dataset]. https://www.kaggle.com/rehan1024/911-calls-data-subset
    Explore at:
    zip(3828316 bytes)Available download formats
    Dataset updated
    Jun 3, 2020
    Authors
    hardly_human
    License

    https://www.usa.gov/government-works/https://www.usa.gov/government-works/

    Description

    Dataset

    This dataset was created by hardly_human

    Released under U.S. Government Works

    Contents

  5. musicnet_midis_lite

    • kaggle.com
    zip
    Updated Oct 8, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rupak Roy/ Bob (2022). musicnet_midis_lite [Dataset]. https://www.kaggle.com/rupakroy/musicnet-midis
    Explore at:
    zip(18209815 bytes)Available download formats
    Dataset updated
    Oct 8, 2022
    Authors
    Rupak Roy/ Bob
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context MusicNet is a collection of 330 freely-licensed classical music recordings, together with over 1 million annotated labels indicating the precise time of each note in every recording, the instrument that plays each note, and the note's position in the metrical structure of the composition. The labels are acquired from musical scores aligned to recordings by dynamic time warping. The labels are verified by trained musicians; a labeling error rate of 4% has been estimated. The MusicNet labels are offered to the machine learning and music communities as a resource for training models and a common benchmark for comparing results.

    Specifically, MusicNet labels is proposed as a tool to address the following tasks:

    Identify the notes performed at specific times in a recording. Classify the instruments that perform in a recording. Classify the composer of a recording. Identify precise onset times of the notes in a recording. Predict the next note in a recording, conditioned on history. Content (Raw - recommended) The raw data is available in standard wav audio format, with corresponding label files in csv format. These data and label filenames are MusicNet ids, which you can use to cross-index the data, labels, and metadata files.

    (Python) The Python version of the dataset is distributed as a NumPy npz file. This is a binary format specific to Python (WARNING: if you attempt to read this data in Python 3, you need to set encoding='latin1' when you call np.load or your process will hang without any informative error messages). This format has three dependencies:

    Python - This version of MusicNet is distributed as a Python object. NumPy - The MusicNet features are stored in NumPy arrays. intervaltree - The MusicNet labels are stored in an IntervalTree. Acknowledgements The MusicNet labels apply exclusively to Creative Commons and Public Domain recordings, and as such we can distribute and re-distribute the MusicNet labels together with their corresponding recordings. The music that underlies MusicNet is sourced from the Isabella Stewart Gardner Museum, the European Archive, and Musopen.

    This work was supported by the Washington Research Foundation Fund for Innovation in Data-Intensive Discovery, and the program "Learning in Machines and Brains" (CIFAR).

  6. Tabular DeDuplication Synthetic

    • kaggle.com
    Updated Jan 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tyl3rDurd3n (2023). Tabular DeDuplication Synthetic [Dataset]. https://www.kaggle.com/datasets/spac84/tabular-deduplication-synthetic
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 1, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Tyl3rDurd3n
    Description

    This dataset was created synthetically with the python package faker. It is intended to practice the deduplication of databases.

    unique_data.csv is our main data frame without duplicates. Everything starts here. The other files (01_duplicate*, 02_duplicate*, etc...) hold only duplicate values from the unique_data.csv entries. You can mix unique_data.csv with one of the duplicate csvs or parts of the duplicate csv to get a dataset with duplicate values to practice your deduplication skills.

    unique_data.csv generation process:

    • Every entry has a unique identifier uuid4
    • The company column is generated from a subset of 35.000 unique entries . This subset is called via random.choice(subset)
    • The postcode and the city colmun is generated together from a list of tuples which contains 20% entries of the total size to inject duplicate
    • The name column is generated for each entry seperatly, but may contain duplicates due to the nature and name limits of the faker (Package) generation process
    • Country is US
    • The street column is generated from a subset of 70.000 unique entries and 30.000 nan values. This subset is called via random.choice(subset) (high unique value count - if you like it hard please feel free to delete to make the task harder)
    • The email column is generated from a subset of 40.000 unique entries and 30.000 nan values. This subset is called via random.choice(subset) (high unique value count - if you like it hard please feel free to delete to make the task harder)
    • The phone column is generated from a subset of 55.000 unique entries and 30.000 nan values. This subset is called via random.choice(subset) (high unique value count - if you like it hard please feel free to delete to make the task harder)

    01_duplicate_data_random-nan.csv generation process:

    Replaces a random fraction (50%) of cells in the dataframe with np.nan. ['company', 'name', 'uuid4'] are excluded by this augmentation

    02_duplicate_data_random-nan_firstname-abbreviation.csv generation process:

    1. Replaces a random fraction (50%) of cells in the dataframe with np.nan. ['company', 'name', 'uuid4'] are excluded by this augmentation
    2. Does first name abbreviation on random 70% of the name column values

    03_duplicate_data_random-nan_firstname-abbreviation_middlename-insertion.csv generation process:

    1. Replaces a random fraction (50%) of cells in the dataframe with np.nan. ['company', 'name', 'uuid4'] are excluded by this augmentation
    2. Does first name abbreviation on random 70% of the name column values
    3. Does a random middle name insertion at 40%of the name column values. Also does random abbreviation on the middle name in 30% of the cases

    04_duplicate_data_random-nan_firstname-abbreviation_middlename-insertion_keyboarderror.csv generation process:

    1. Replaces a random fraction (50%) of cells in the dataframe with np.nan. ['company', 'name', 'uuid4'] are excluded by this augmentation
    2. Does first name abbreviation on random 70% of the name column values
    3. Does a random middle name insertion at 40%of the name column values. Also does random abbreviation on the middle name in 30% of the cases
    4. Performs keyboarderror augmentation on 60% of the values for the columns ['name', 'city', 'street', 'company', 'email', 'phone'] https://nlpaug.readthedocs.io/en/latest/augmenter/char/keyboard.html skills
  7. h

    dataset-the-stack-v2-dedup-sub

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TempestTeam, dataset-the-stack-v2-dedup-sub [Dataset]. https://huggingface.co/datasets/TempestTeam/dataset-the-stack-v2-dedup-sub
    Explore at:
    Dataset authored and provided by
    TempestTeam
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    The Stack v2 Subset with File Contents (Python, Java, JavaScript, C, C++)

    TempestTeam/dataset-the-stack-v2-dedup-sub

      Dataset Summary
    

    This dataset is a language-filtered and self-contained subset of bigcode/the-stack-v2-dedup, part of the BigCode Project. It contains only files written in the following programming languages:

    Python 🐍 Java ☕ JavaScript 📜 C ⚙️ C++ ⚙️

    Unlike the original dataset, which only includes metadata and Software Heritage IDs, this subset includes… See the full description on the dataset page: https://huggingface.co/datasets/TempestTeam/dataset-the-stack-v2-dedup-sub.

  8. Data from: Community Earth System Model v2 Large Ensemble (CESM2 LENS) Zarr...

    • gdex.ucar.edu
    • ckanprod.data-commons.k8s.ucar.edu
    • +1more
    Updated Nov 11, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gokhan Danabasoglu; Clara Deser; Keith Rodgers Axel Timmermann (2024). Community Earth System Model v2 Large Ensemble (CESM2 LENS) Zarr Subset [Dataset]. https://gdex.ucar.edu/datasets/d010092/
    Explore at:
    Dataset updated
    Nov 11, 2024
    Dataset provided by
    National Science Foundationhttp://www.nsf.gov/
    Authors
    Gokhan Danabasoglu; Clara Deser; Keith Rodgers Axel Timmermann
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jan 1, 1850 - Dec 31, 2014
    Description

    The US National Center for Atmospheric Research partnered with the IBS Center for Climate Physics in South Korea to generate the CESM2 Large Ensemble which consists of 100 ensemble members at 1 degree spatial resolution covering the period 1850-2100 under CMIP6 historical and SSP370 future radiative forcing scenarios. Data sets from this ensemble were made downloadable via the Climate Data Gateway on June 14, 2021. NCAR has copied a subset (currently ~500 TB) of CESM2 LENS data to Amazon S3 as part of the AWS Public Datasets Program. To optimize for large-scale analytics we have represented the data as ~275 Zarr stores format accessible through the Python Xarray library. Each Zarr store contains a single physical variable for a given model run type and temporal frequency (monthly, daily).

  9. Z

    Data from: ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jan 27, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Keshavarz, Hossein; Nagappan, Meiyappan (2022). ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5907001
    Explore at:
    Dataset updated
    Jan 27, 2022
    Dataset provided by
    David R. Cheriton School of Computer Science, University of Waterloo, Waterloo, ON, Canada
    Authors
    Keshavarz, Hossein; Nagappan, Meiyappan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction

    This archive contains the ApacheJIT dataset presented in the paper "ApacheJIT: A Large Dataset for Just-In-Time Defect Prediction" as well as the replication package. The paper is submitted to MSR 2022 Data Showcase Track.

    The datasets are available under directory dataset. There are 4 datasets in this directory.

    1. apachejit_total.csv: This file contains the entire dataset. Commits are specified by their identifier and a set of commit metrics that are explained in the paper are provided as features. Column buggy specifies whether or not the commit introduced any bug into the system.
    2. apachejit_train.csv: This file is a subset of the entire dataset. It provides a balanced set that we recommend for models that are sensitive to class imbalance. This set is obtained from the first 14 years of data (2003 to 2016).
    3. apachejit_test_large.csv: This file is a subset of the entire dataset. The commits in this file are the commits from the last 3 years of data. This set is not balanced to represent a real-life scenario in a JIT model evaluation where the model is trained on historical data to be applied on future data without any modification.
    4. apachejit_test_small.csv: This file is a subset of the test file explained above. Since the test file has more than 30,000 commits, we also provide a smaller test set which is still unbalanced and from the last 3 years of data.

    In addition to the dataset, we also provide the scripts using which we built the dataset. These scripts are written in Python 3.8. Therefore, Python 3.8 or above is required. To set up the environment, we have provided a list of required packages in file requirements.txt. Additionally, one filtering step requires GumTree [1]. For Java, GumTree requires Java 11. For other languages, external tools are needed. Installation guide and more details can be found here.

    The scripts are comprised of Python scripts under directory src and Python notebooks under directory notebooks. The Python scripts are mainly responsible for conducting GitHub search via GitHub search API and collecting commits through PyDriller Package [2]. The notebooks link the fixed issue reports with their corresponding fixing commits and apply some filtering steps. The bug-inducing candidates then are filtered again using gumtree.py script that utilizes the GumTree package. Finally, the remaining bug-inducing candidates are combined with the clean commits in the dataset_construction notebook to form the entire dataset.

    More specifically, git_token.py handles GitHub API token that is necessary for requests to GitHub API. Script collector.py performs GitHub search. Tracing changed lines and git annotate is done in gitminer.py using PyDriller. Finally, gumtree.py applies 4 filtering steps (number of lines, number of files, language, and change significance).

    References:

    1. GumTree

    Jean-Rémy Falleri, Floréal Morandat, Xavier Blanc, Matias Martinez, and Martin Monperrus. 2014. Fine-grained and accurate source code differencing. In ACM/IEEE International Conference on Automated Software Engineering, ASE ’14,Vasteras, Sweden - September 15 - 19, 2014. 313–324

    1. PyDriller
    • https://pydriller.readthedocs.io/en/latest/

    • Davide Spadini, Maurício Aniche, and Alberto Bacchelli. 2018. PyDriller: Python Framework for Mining Software Repositories. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering(Lake Buena Vista, FL, USA)(ESEC/FSE2018). Association for Computing Machinery, New York, NY, USA, 908–911

  10. OGBN-MAG (Processed for PyG)

    • kaggle.com
    zip
    Updated Feb 27, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Redao da Taupl (2021). OGBN-MAG (Processed for PyG) [Dataset]. https://www.kaggle.com/dataup1/ogbn-mag
    Explore at:
    zip(852576506 bytes)Available download formats
    Dataset updated
    Feb 27, 2021
    Authors
    Redao da Taupl
    License

    Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
    License information was derived automatically

    Description

    OGBN-MAG

    Webpage: https://ogb.stanford.edu/docs/nodeprop/#ogbn-mag

    Usage in Python

    Warning: Currently not usable.

    import torch_geometric
    from ogb.nodeproppred import PygNodePropPredDataset
    
    dataset = PygNodePropPredDataset('ogbn-mag', root = '/kaggle/input')
    split_idx = dataset.get_idx_split()
    train_idx, valid_idx, test_idx = split_idx['train'], split_idx['valid'], split_idx['test']
    graph = dataset[0] # PyG Graph object
    

    Description

    Graph: The ogbn-mag dataset is a heterogeneous network composed of a subset of the Microsoft Academic Graph (MAG) [1]. It contains four types of entities—papers (736,389 nodes), authors (1,134,649 nodes), institutions (8,740 nodes), and fields of study (59,965 nodes)—as well as four types of directed relations connecting two types of entities—an author is “affiliated with” an institution, an author “writes” a paper, a paper “cites” a paper, and a paper “has a topic of” a field of study. Similar to ogbn-arxiv, each paper is associated with a 128-dimensional word2vec feature vector, and all the other types of entities are not associated with input node features.

    Prediction task: Given the heterogeneous ogbn-mag data, the task is to predict the venue (conference or journal) of each paper, given its content, references, authors, and authors’ affiliations. This is of practical interest as some manuscripts’ venue information is unknown or missing in MAG, due to the noisy nature of Web data. In total, there are 349 different venues in ogbn-mag, making the task a 349-class classification problem.

    Dataset splitting: The authors of this dataset follow the same time-based strategy as ogbn-arxiv and ogbn-papers100M to split the paper nodes in the heterogeneous graph, i.e., training models to predict venue labels of all papers published before 2018, validating and testing the models on papers published in 2018 and since 2019, respectively.

    Summary

    Package#Nodes#EdgesSplit TypeTask TypeMetric
    ogb>=1.2.11,939,74321,111,007TimeMulti-class classificationAccuracy

    Open Graph Benchmark

    Website: https://ogb.stanford.edu

    The Open Graph Benchmark (OGB) [2] is a collection of realistic, large-scale, and diverse benchmark datasets for machine learning on graphs. OGB datasets are automatically downloaded, processed, and split using the OGB Data Loader. The model performance can be evaluated using the OGB Evaluator in a unified manner.

    References

    [1] Kuansan Wang, Zhihong Shen, Chiyuan Huang, Chieh-Han Wu, Yuxiao Dong, and Anshul Kanakia. Microsoft academic graph: When experts are not enough. Quantitative Science Studies, 1(1):396–413, 2020. [2] Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. Advances in Neural Information Processing Systems, pp. 22118–22133, 2020.

    Disclaimer

    I am NOT the author of this dataset. It was downloaded from its official website. I assume no responsibility or liability for the content in this dataset. Any questions, problems or issues, please contact the original authors at their website or their GitHub repo.

  11. Data from: Da-TACOS: A Dataset for Cover Song Identification and...

    • data.europa.eu
    unknown
    Updated Jul 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2025). Da-TACOS: A Dataset for Cover Song Identification and Understanding [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-3520368?locale=fi
    Explore at:
    unknown(3513878)Available download formats
    Dataset updated
    Jul 3, 2025
    Dataset authored and provided by
    Zenodohttp://zenodo.org/
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    We present Da-TACOS: a dataset for cover song identification and understanding. It contains two subsets, namely the benchmark subset (for benchmarking cover song identification systems) and the cover analysis subset (for analyzing the links among cover songs), with pre-extracted features and metadata for 15,000 and 10,000 songs, respectively. The annotations included in the metadata are obtained with the API of SecondHandSongs.com. All audio files we use to extract features are encoded in MP3 format and their sample rate is 44.1 kHz. Da-TACOS does not contain any audio files. For the results of our analyses on modifiable musical characteristics using the cover analysis subset and our initial benchmarking of 7 state-of-the-art cover song identification algorithms on the benchmark subset, you can look at our publication. For organizing the data, we use the structure of SecondHandSongs where each song is called a ‘performance’, and each clique (cover group) is called a ‘work’. Based on this, the file names of the songs are their unique performance IDs (PID, e.g. P_22), and their labels with respect to their cliques are their work IDs (WID, e.g. W_14). Metadata for each song includes performance title, performance artist, work title, work artist, release year, SecondHandSongs.com performance ID, SecondHandSongs.com work ID, whether the song is instrumental or not. In addition, we matched the original metadata with MusicBrainz to obtain MusicBrainz ID (MBID), song length and genre/style tags. We would like to note that MusicBrainz related information is not available for all the songs in Da-TACOS, and since we used just our metadata for matching, we include all possible MBIDs for a particular songs. For facilitating reproducibility in cover song identification (CSI) research, we propose a framework for feature extraction and benchmarking in our supplementary repository: acoss. The feature extraction component is designed to help CSI researchers to find the most commonly used features for CSI in a single address. The parameter values we used to extract the features in Da-TACOS are shared in the same repository. Moreover, the benchmarking component includes our implementations of 7 state-of-the-art CSI systems. We provide the performance results of an initial benchmarking of those 7 systems on the benchmark subset of Da-TACOS. We encourage other CSI researchers to contribute to acoss with implementing their favorite feature extraction algorithms and their CSI systems to build up a knowledge base where CSI research can reach larger audiences. The instructions for how to download and use the dataset are shared below. Please contact us if you have any questions or requests. 1. Structure 1.1. Metadata We provide two metadata files that contain information about the benchmark subset and the cover analysis subset. Both metadata files are stored as python dictionaries in .json format, and have the same hierarchical structure. An example to load the metadata files in python: import json with open('./da-tacos_metadata/da-tacos_benchmark_subset_metadata.json') as f: benchmark_metadata = json.load(f) The python dictionary obtained with the code above will have the respective WIDs as keys. Each key will provide the song dictionaries that contain the metadata regarding the songs that belong to their WIDs. An example can be seen below: "W_163992": { # work id "P_547131": { # performance id of the first song belonging to the clique 'W_163992' "work_title": "Trade Winds, Trade Winds", "work_artist": "Aki Aleong", "perf_title": "Trade Winds, Trade Winds", "perf_artist": "Aki Aleong", "release_year": "1961", "work_id": "W_163992", "perf_id": "P_547131", "instrumental": "No", "perf_artist_mbid": "9bfa011f-8331-4c9a-b49b-d05bc7916605", "mb_performances": { "4ce274b3-0979-4b39-b8a3-5ae1de388c4a": { "length": "175000" }, "7c10ba3b-6f1d-41ab-8b20-14b2567d384a": { "length": "177653" } } }, "P_547140": { # performance id of the second song belonging to the clique 'W_163992' "work_title": "Trade Winds, Trade Winds", "work_artist": "Aki Aleong", "perf_title": "Trade Winds, Trade Winds", "perf_artist": "Dodie Stevens", "release_year": "1961", "work_id": "W_163992", "perf_id": "P_547140", "instrumental": "No" } } 1.2. Pre-extracted features The list of features included in Da-TACOS can be seen below. All the features are extracted with acoss repository that uses open-source feature extraction libraries such as Essentia, LibROSA, and Madmom. To facilitate the use of the dataset, we provide two options regarding the file structure. 1- In da-tacos_benchmark_subset_single_files and da-tacos_coveranalysis_subset_single_files folders, we organize the data based on their respective cliques, and one file contains all the features for that particular song. { "chroma_cens": numpy.ndarray, "crema": numpy.ndarray, "hpcp": numpy.ndarray, "key_extractor": { "key": numpy.str_, "scale": numpy.str_,_ "strength": numpy.float64 }, "madmom_features": { "novfn":

  12. h

    python-edu

    • huggingface.co
    Updated Jan 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Avelina Hadji-Kyriacou (2025). python-edu [Dataset]. https://huggingface.co/datasets/Avelina/python-edu
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 8, 2025
    Authors
    Avelina Hadji-Kyriacou
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    This version is deprecated! Please use the cleaned version: Avelina/python-edu-cleaned

      SmolLM-Corpus: Python-Edu
    

    This dataset contains the python-edu subset of SmolLM-Corpus with the contents of the files stored in a new text field. All files were downloaded from the S3 bucket on January the 8th 2025, using the blob IDs from the original dataset with revision 3ba9d605774198c5868892d7a8deda78031a781f. Only 1 file was marked as not found and the corresponding row removed from the… See the full description on the dataset page: https://huggingface.co/datasets/Avelina/python-edu.

  13. Wikimedia Structured Dataset Navigator (JSONL)

    • kaggle.com
    zip
    Updated Apr 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mehranism (2025). Wikimedia Structured Dataset Navigator (JSONL) [Dataset]. https://www.kaggle.com/datasets/mehranism/wikimedia-structured-dataset-navigator-jsonl
    Explore at:
    zip(266196504 bytes)Available download formats
    Dataset updated
    Apr 23, 2025
    Authors
    Mehranism
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    📚 Overview: This dataset provides a compact and efficient way to explore the massive "Wikipedia Structured Contents" dataset by Wikimedia Foundation, which consists of 38 large JSONL files (each ~2.5GB). Loading these directly in Kaggle or Colab is impractical due to resource constraints. This file index solves that problem.

    🔍 What’s Inside: This dataset includes a single JSONL file named wiki_structured_dataset_navigator.jsonl that contains metadata for every file in the English portion of the Wikimedia dataset.

    Each line in the JSONL file is a JSON object with the following fields: - file_name: the actual filename in the source dataset (e.g., enwiki_namespace_0_0.jsonl) - file_index: the numeric row index of the file - name: the Wikipedia article title or identifier - url: a link to the full article on Wikipedia - description: a short description or abstract of the article (when available)

    🛠 Use Case: Use this dataset to search by keyword, article name, or description to find which specific files from the full Wikimedia dataset contain the topics you're interested in. You can then download only the relevant file(s) instead of the entire dataset.

    ⚡️ Benefits: - Lightweight (~MBs vs. GBs) - Easy to load and search - Great for indexing, previewing, and subsetting the Wikimedia dataset - Saves time, bandwidth, and compute resources

    📎 Example Usage (Python): ```python import kagglehub import json import pandas as pd import numpy as np import os from tqdm import tqdm from datetime import datetime import re

    def read_jsonl(file_path, max_records=None): data = [] with open(file_path, 'r', encoding='utf-8') as f: for i, line in enumerate(tqdm(f)): if max_records and i >= max_records: break data.append(json.loads(line)) return data

    file_path = kagglehub.dataset_download("mehranism/wikimedia-structured-dataset-navigator-jsonl",path="wiki_structured_dataset_navigator.jsonl") data = read_jsonl(file_path) print(f"Successfully loaded {len(data)} records")

    df = pd.DataFrame(data) print(f"Dataset shape: {df.shape}") print(" Columns in the dataset:") for col in df.columns: print(f"- {col}")

    
    This dataset is perfect for developers working on:
    - Retrieval-Augmented Generation (RAG)
    - Large Language Model (LLM) fine-tuning
    - Search and filtering pipelines
    - Academic research on structured Wikipedia content
    
    💡 Tip:
    Pair this index with the original [Wikipedia Structured Contents dataset](https://www.kaggle.com/datasets/wikimedia-foundation/wikipedia-structured-contents) for full article access.
    
    📃 Format:
    - File: `wiki_structured_dataset_navigator.jsonl`
    - Format: JSON Lines (1 object per line)
    - Encoding: UTF-8
    
    ---
    
    ### **Tags**
    

    wikipedia, wikimedia, jsonl, structured-data, search-index, metadata, file-catalog, dataset-index, large-language-models, machine-learning ```

    Licensing

    CC0: Public Domain Dedication
    

    (Recommended for open indexing tools with no sensitive data.)

  14. f

    Development of a Cambridge Structural Database Subset: A Collection of...

    • acs.figshare.com
    text/x-python
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Peyman Z. Moghadam; Aurelia Li; Seth B. Wiggin; Andi Tao; Andrew G. P. Maloney; Peter A. Wood; Suzanna C. Ward; David Fairen-Jimenez (2023). Development of a Cambridge Structural Database Subset: A Collection of Metal–Organic Frameworks for Past, Present, and Future [Dataset]. http://doi.org/10.1021/acs.chemmater.7b00441.s002
    Explore at:
    text/x-pythonAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    ACS Publications
    Authors
    Peyman Z. Moghadam; Aurelia Li; Seth B. Wiggin; Andi Tao; Andrew G. P. Maloney; Peter A. Wood; Suzanna C. Ward; David Fairen-Jimenez
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    We report the generation and characterization of the most complete collection of metal–organic frameworks (MOFs) maintained and updated, for the first time, by the Cambridge Crystallographic Data Centre (CCDC). To set up this subset, we asked the question “what is a MOF?” and implemented a number of “look-for-MOF” criteria embedded within a bespoke Cambridge Structural Database (CSD) Python API workflow to identify and extract information on 69 666 MOF materials. The CSD MOF subset is updated regularly with subsequent MOF additions to the CSD, bringing a unique record for all researchers working in the area of porous materials around the world, whether to perform high-throughput computational screening for materials discovery or to have a global view over the existing structures in a single resource. Using this resource, we then developed and used an array of computational tools to remove residual solvent molecules from the framework pores of all the MOFs identified and went on to analyze geometrical and physical properties of nondisordered structures.

  15. T

    imagenet2012_subset

    • tensorflow.org
    Updated Oct 21, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). imagenet2012_subset [Dataset]. https://www.tensorflow.org/datasets/catalog/imagenet2012_subset
    Explore at:
    Dataset updated
    Oct 21, 2024
    Description

    ILSVRC 2012, commonly known as 'ImageNet' is an image dataset organized according to the WordNet hierarchy. Each meaningful concept in WordNet, possibly described by multiple words or word phrases, is called a "synonym set" or "synset". There are more than 100,000 synsets in WordNet, majority of them are nouns (80,000+). In ImageNet, we aim to provide on average 1000 images to illustrate each synset. Images of each concept are quality-controlled and human-annotated. In its completion, we hope ImageNet will offer tens of millions of cleanly sorted images for most of the concepts in the WordNet hierarchy.

    The test split contains 100K images but no labels because no labels have been publicly released. We provide support for the test split from 2012 with the minor patch released on October 10, 2019. In order to manually download this data, a user must perform the following operations:

    1. Download the 2012 test split available here.
    2. Download the October 10, 2019 patch. There is a Google Drive link to the patch provided on the same page.
    3. Combine the two tar-balls, manually overwriting any images in the original archive with images from the patch. According to the instructions on image-net.org, this procedure overwrites just a few images.

    The resulting tar-ball may then be processed by TFDS.

    To assess the accuracy of a model on the ImageNet test split, one must run inference on all images in the split, export those results to a text file that must be uploaded to the ImageNet evaluation server. The maintainers of the ImageNet evaluation server permits a single user to submit up to 2 submissions per week in order to prevent overfitting.

    To evaluate the accuracy on the test split, one must first create an account at image-net.org. This account must be approved by the site administrator. After the account is created, one can submit the results to the test server at https://image-net.org/challenges/LSVRC/eval_server.php The submission consists of several ASCII text files corresponding to multiple tasks. The task of interest is "Classification submission (top-5 cls error)". A sample of an exported text file looks like the following:

    771 778 794 387 650
    363 691 764 923 427
    737 369 430 531 124
    755 930 755 59 168
    

    The export format is described in full in "readme.txt" within the 2013 development kit available here: https://image-net.org/data/ILSVRC/2013/ILSVRC2013_devkit.tgz Please see the section entitled "3.3 CLS-LOC submission format". Briefly, the format of the text file is 100,000 lines corresponding to each image in the test split. Each line of integers correspond to the rank-ordered, top 5 predictions for each test image. The integers are 1-indexed corresponding to the line number in the corresponding labels file. See labels.txt.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('imagenet2012_subset', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

    https://storage.googleapis.com/tfds-data/visualization/fig/imagenet2012_subset-1pct-5.0.0.png" alt="Visualization" width="500px">

  16. MatSim Dataset and benchmark for one-shot visual materials and textures...

    • zenodo.org
    • data.niaid.nih.gov
    pdf, zip
    Updated Jun 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Manuel S. Drehwald; Sagi Eppel; Jolina Li; Han Hao; Alan Aspuru-Guzik; Manuel S. Drehwald; Sagi Eppel; Jolina Li; Han Hao; Alan Aspuru-Guzik (2025). MatSim Dataset and benchmark for one-shot visual materials and textures recognition [Dataset]. http://doi.org/10.5281/zenodo.7390166
    Explore at:
    zip, pdfAvailable download formats
    Dataset updated
    Jun 25, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Manuel S. Drehwald; Sagi Eppel; Jolina Li; Han Hao; Alan Aspuru-Guzik; Manuel S. Drehwald; Sagi Eppel; Jolina Li; Han Hao; Alan Aspuru-Guzik
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    The MatSim Dataset and benchmark

    Lastest version

    Synthetic dataset and real images benchmark for visual similarity recognition of materials and textures.

    MatSim: a synthetic dataset, a benchmark, and a method for computer vision-based recognition of similarities and transitions between materials and textures focusing on identifying any material under any conditions using one or a few examples (one-shot learning).

    Based on the paper: One-shot recognition of any material anywhere using contrastive learning with physics-based rendering

    Benchmark_MATSIM.zip: contain the benchmark made of real-world images as described in the paper



    MatSim_object_train_split_1,2,3.zip: Contain a subset of the synthetics dataset for images of CGI images materials on random objects as described in the paper.

    MatSim_Vessels_Train_1,2,3.zip : Contain a subset of the synthetics dataset for images of CGI images materials inside transparent containers as described in the paper.

    *Note: these are subsets of the dataset; the full dataset can be found at:
    https://e1.pcloud.link/publink/show?code=kZIiSQZCYU5M4HOvnQykql9jxF4h0KiC5MX

    or
    https://icedrive.net/s/A13FWzZ8V2aP9T4ufGQ1N3fBZxDF

    Code:

    Up to date code for generating the dataset, reading and evaluation and trained nets can be found in this URL:https://github.com/sagieppel/MatSim-Dataset-Generator-Scripts-And-Neural-net

    Dataset Generation Scripts.zip: Contain the Blender (3.1) Python scripts used for generating the dataset, this code might be odl up to date code can be found here
    Net_Code_And_Trained_Model.zip: Contain a reference neural net code, including loaders, trained models, and evaluators scripts that can be used to read and train with the synthetic dataset or test the model with the benchmark. Note code in the ZIP file is not up to date and contains some bugs For the Latest version of this code see this URL

    Further documentation can be found inside the zip files or in the paper.

  17. T

    civil_comments

    • tensorflow.org
    • huggingface.co
    Updated Feb 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). civil_comments [Dataset]. https://www.tensorflow.org/datasets/catalog/civil_comments
    Explore at:
    Dataset updated
    Feb 28, 2023
    Description

    This version of the CivilComments Dataset provides access to the primary seven labels that were annotated by crowd workers, the toxicity and other tags are a value between 0 and 1 indicating the fraction of annotators that assigned these attributes to the comment text.

    The other tags are only available for a fraction of the input examples. They are currently ignored for the main dataset; the CivilCommentsIdentities set includes those labels, but only consists of the subset of the data with them. The other attributes that were part of the original CivilComments release are included only in the raw data. See the Kaggle documentation for more details about the available features.

    The comments in this dataset come from an archive of the Civil Comments platform, a commenting plugin for independent news sites. These public comments were created from 2015 - 2017 and appeared on approximately 50 English-language news sites across the world. When Civil Comments shut down in 2017, they chose to make the public comments available in a lasting open archive to enable future research. The original data, published on figshare, includes the public comment text, some associated metadata such as article IDs, publication IDs, timestamps and commenter-generated "civility" labels, but does not include user ids. Jigsaw extended this dataset by adding additional labels for toxicity, identity mentions, as well as covert offensiveness. This data set is an exact replica of the data released for the Jigsaw Unintended Bias in Toxicity Classification Kaggle challenge. This dataset is released under CC0, as is the underlying comment text.

    For comments that have a parent_id also in the civil comments data, the text of the previous comment is provided as the "parent_text" feature. Note that the splits were made without regard to this information, so using previous comments may leak some information. The annotators did not have access to the parent text when making the labels.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('civil_comments', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  18. T

    imagenette

    • tensorflow.org
    • opendatalab.com
    • +1more
    Updated Jun 1, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). imagenette [Dataset]. https://www.tensorflow.org/datasets/catalog/imagenette
    Explore at:
    Dataset updated
    Jun 1, 2024
    Description

    Imagenette is a subset of 10 easily classified classes from the Imagenet dataset. It was originally prepared by Jeremy Howard of FastAI. The objective behind putting together a small version of the Imagenet dataset was mainly because running new ideas/algorithms/experiments on the whole Imagenet take a lot of time.

    This version of the dataset allows researchers/practitioners to quickly try out ideas and share with others. The dataset comes in three variants:

    • Full size
    • 320 px
    • 160 px

    Note: The v2 config correspond to the new 70/30 train/valid split (released in Dec 6 2019).

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('imagenette', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

    https://storage.googleapis.com/tfds-data/visualization/fig/imagenette-full-size-v2-1.0.0.png" alt="Visualization" width="500px">

  19. d

    Data from: tableone: An open source Python package for producing summary...

    • datadryad.org
    • search.dataone.org
    • +1more
    zip
    Updated Apr 23, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tom J. Pollard; Alistair E. W. Johnson; Jesse D. Raffa; Roger G. Mark (2019). tableone: An open source Python package for producing summary statistics for research papers [Dataset]. http://doi.org/10.5061/dryad.26c4s35
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 23, 2019
    Dataset provided by
    Dryad
    Authors
    Tom J. Pollard; Alistair E. W. Johnson; Jesse D. Raffa; Roger G. Mark
    Time period covered
    Apr 19, 2018
    Description

    Objectives: In quantitative research, understanding basic parameters of the study population is key for interpretation of the results. As a result, it is typical for the first table (“Table 1”) of a research paper to include summary statistics for the study data. Our objectives are 2-fold. First, we seek to provide a simple, reproducible method for providing summary statistics for research papers in the Python programming language. Second, we seek to use the package to improve the quality of summary statistics reported in research papers.

    Materials and Methods: The tableone package is developed following good practice guidelines for scientific computing and all code is made available under a permissive MIT License. A testing framework runs on a continuous integration server, helping to maintain code stability. Issues are tracked openly and public contributions are encouraged.

    Results: The tableone software package automatically compiles summary statistics into publishable formats such...

  20. k

    Code of Dietel et al.: "Combined impacts of temperature, sea ice coverage,...

    • radar.kit.edu
    • radar-service.eu
    tar
    Updated May 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hendrik Andersen; Philip Stier; Barbara Dietel; Jan Cermak; Corinna Hoose (2024). Code of Dietel et al.: "Combined impacts of temperature, sea ice coverage, and mixing ratios of sea spray and dust on cloud phase over the Arctic and Southern Oceans", submitted to Geophysical Research Letters [Dataset]. http://doi.org/10.35097/VEbaqHtbXdEzreqO
    Explore at:
    tar(34304 bytes)Available download formats
    Dataset updated
    May 6, 2024
    Dataset provided by
    Karlsruhe Institute of Technology
    Stier, Philip
    Dietel, Barbara
    Authors
    Hendrik Andersen; Philip Stier; Barbara Dietel; Jan Cermak; Corinna Hoose
    Area covered
    Southern Ocean, Arctic
    Description

    Code of Dietel et al.: "Combined impacts of temperature, sea ice coverage, and mixing ratios of sea spray and dust on cloud phase over the Arctic and Southern Oceans", submitted to Geophysical Research Letters

    Scripts to train a machine learning model (Histogram based gradient boosting regression with scikitlearn) and calculate SHapley Additive exPlanation (SHAP) values

    The machine learning model can predict the liquid fraction in different cloud types based on four parameters, namely the cloud top temperature, the sea ice concentration, the dust mixing ratio and the sea salt mixing ratio. More information on the used dataset can be found here: Dietel et al. 2023

    Bash-scripts

    The bash scripts are used to run the python scripts for different cloud types and regions on a cluster. bash-scripts starting with GBR_[...] (Gradient Boosting Regression) run the python-script hist_gbr_subset_final2_with_comments.py for different regions (Arctic Ocean (AO), Southern Ocean (SO)) and different cloud types (low-level, mid-level,mid-to-low-level). bash-scripts starting with shap_values_[...] run the python-script shap_values-subset-final2_with_comments.py to calculate SHAP values based on the trained machine learning models for a 500 000 sample subset of the validation dataset.

    Python scripts

    hist_gbr_subset_final2_with_comments.py Python script to train the a Histogram-based Gradient Boosting Regression model using the scikitlearn python package. More detailed information can be found as comments in the scripts. shap_values-subset-final2_with_comments.py Calculates SHAP values for a 500 000 sample subset of the validation dataset to make the machine learning model explainable. More detailed information can be found as comments in the scripts.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Jan Schmitz (2023). starcoderdata-python-edu-lang-score [Dataset]. https://huggingface.co/datasets/JanSchTech/starcoderdata-python-edu-lang-score

starcoderdata-python-edu-lang-score

JanSchTech/starcoderdata-python-edu-lang-score

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 18, 2023
Authors
Jan Schmitz
Description

Dataset Card for Starcoder Data with Python Education and Language Scores

  Dataset Summary

The starcoderdata-python-edu-lang-score dataset contains the Python subset of the starcoderdata dataset. It augments the existing Python subset with features that assess the educational quality of code and classify the language of code comments. This dataset was created for high-quality Python education and language-based training, with a primary focus on facilitating models that can… See the full description on the dataset page: https://huggingface.co/datasets/JanSchTech/starcoderdata-python-edu-lang-score.

Search
Clear search
Close search
Google apps
Main menu