100+ datasets found
  1. h

    c-sharp-coding-dataset

    • huggingface.co
    Updated Dec 17, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Meldrum (2024). c-sharp-coding-dataset [Dataset]. https://huggingface.co/datasets/dmeldrum6/c-sharp-coding-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 17, 2024
    Authors
    David Meldrum
    Description

    Dataset Card for c-sharp-coding-dataset

    This dataset has been created with distilabel.

      Dataset Summary
    

    This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/dmeldrum6/c-sharp-coding-dataset/raw/main/pipeline.yaml"

    or explore the configuration: distilabel pipeline info --config… See the full description on the dataset page: https://huggingface.co/datasets/dmeldrum6/c-sharp-coding-dataset.

  2. Java-C# Code Statements

    • kaggle.com
    zip
    Updated Nov 27, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abdelrahman Soliman (2022). Java-C# Code Statements [Dataset]. https://www.kaggle.com/datasets/amss10/javac-code-statements
    Explore at:
    zip(846907 bytes)Available download formats
    Dataset updated
    Nov 27, 2022
    Authors
    Abdelrahman Soliman
    Description

    Dataset

    This dataset was created by Abdelrahman Soliman

    Contents

  3. Data set statistics: Number of events e#, of failure sub classes c# and of...

    • plos.figshare.com
    xls
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Guido Schwenk; Ben Jochinke; Klaus-Robert Müller (2023). Data set statistics: Number of events e#, of failure sub classes c# and of samples s#. [Dataset]. http://doi.org/10.1371/journal.pone.0228434.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Guido Schwenk; Ben Jochinke; Klaus-Robert Müller
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data set statistics: Number of events e#, of failure sub classes c# and of samples s#.

  4. Z

    C# Dataset of Data Class, Feature Envy and Refused Bequest code smells

    • nde-dev.biothings.io
    Updated Jan 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Luburić, Nikola (2024). C# Dataset of Data Class, Feature Envy and Refused Bequest code smells [Dataset]. https://nde-dev.biothings.io/resources?id=zenodo_10475431
    Explore at:
    Dataset updated
    Jan 9, 2024
    Dataset provided by
    Grujić, Katarina-Glorija
    Slivka, Jelena
    Kovačević, Aleksandar
    Luburić, Nikola
    Prokić, Simona
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset includes open-source projects written in C# programming language, annotated for the presence of Data Class, Feature Envy and Refused Bequest code smells. Each code snippet was manually annotated by at least two annotators.

    The dataset contains three excel datasheets:

    DataSet_Data_Class.xlsx - C# classes annotated for the Data Class code smell

    DataSet_Feature_Envy.xlsx - C# methods annotated for the Feature Envy code smell

    DataSet_Refused_Bequest.xlsx - C# classes annotated for the Refused Bequest code smell

    The columns in the datasheet represent:

    Code Snippet ID - the full name of the code snippet.

    for classes, this is the package/namespace name followed by the class name. The full name of inner classes also contains the names of any outer classes (e.g., namespace.subnamespace.outerclass.innerclass).

    for methods, this is the full name of the class and the method's signature (e.g., namespace.class.method(param1Type, param2Type))

    Link - the Github link to the code snippet, including the commit and the start and end LOC.

    Code Smell - code smell for which the code snippet is examined (Data Class, Feature Envy or Refused Bequest)

    Project Link - the link to the version of the code repository that was annotated

    Metrics – a list of metrics for the code snippet, calculated by our platform. Our dataset provides 31 class-level metrics for Data Class and Refused Bequest detection and 19 method-level metrics for Feature Envy detection. The list of metrics and their definitions is available here.

    Final annotation – a single severity score calculated by a majority vote.

    Annotators – each annotator's (1, 2, or 3) assigned severity score.

    To help guide their reasoning for evaluating the presence and the severity of a code smell, three annotators independently annotated whether the considered heuristics apply to an evaluated code snippet. We provide these results in three separate excel datasheets:

    DataClass_Heuristics.xlsx - C# classes annotated for the presence of heuristics relevant for the Data Class code smell.

    FeatureEnvy_Heuristics.xlsx - C# methods annotated for the presence of heuristics relevant for the Feature Envy code smell.

    RefusedBequest_Heuristics.xlsx - C# classes annotated for the presence of heuristics relevant for the Refused Bequest code smell.

    The columns of these two datasheets are:

    Code Snippet ID - the full name of the code snippet (matching the IDs from DataSet_Data_Class.xlsx, DataSet_Feature_Envy.xlsx and DataSet_Refused_Bequest.xlsx)

    Annotators – heuristics labelled by each of the annotators (1, 2, or 3).

    Heuristics – whether the heuristic is applicable to the examined code snippet or not

    Annotators annotated the dataset based on the annotation procedure and guidelines available here.

  5. h

    cs_repl_ai_alpaca

    • huggingface.co
    Updated Aug 2, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abhijit (2024). cs_repl_ai_alpaca [Dataset]. https://huggingface.co/datasets/abhijitkumarjha88192/cs_repl_ai_alpaca
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 2, 2024
    Authors
    Abhijit
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    abhijitkumarjha88192/cs_repl_ai_alpaca dataset hosted on Hugging Face and contributed by the HF Datasets community

  6. w

    Dataset of books called Real world functional programming : with examples in...

    • workwithdata.com
    Updated Apr 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Work With Data (2025). Dataset of books called Real world functional programming : with examples in F# and C# [Dataset]. https://www.workwithdata.com/datasets/books?f=1&fcol0=book&fop0=%3D&fval0=Real+world+functional+programming+%3A+with+examples+in+F%23+and+C%23
    Explore at:
    Dataset updated
    Apr 17, 2025
    Dataset authored and provided by
    Work With Data
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is about books. It has 1 row and is filtered where the book is Real world functional programming : with examples in F# and C#. It features 7 columns including author, publication date, language, and book publisher.

  7. h

    code-code-translation-java-csharp

    • huggingface.co
    Updated Jul 11, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Semeru Lab (2017). code-code-translation-java-csharp [Dataset]. https://huggingface.co/datasets/semeru/code-code-translation-java-csharp
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 11, 2017
    Dataset authored and provided by
    Semeru Lab
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset is imported from CodeXGLUE and pre-processed using their script.

      Where to find in Semeru:
    

    The dataset can be found at /nfs/semeru/semeru_datasets/code_xglue/code-to-code/code-to-code-trans in Semeru

      CodeXGLUE -- Code2Code Translation
    
    
    
    
    
      Task Definition
    

    Code translation aims to migrate legacy software from one programming language in a platform toanother. In CodeXGLUE, given a piece of Java (C#) code, the task is to translate the code into C#… See the full description on the dataset page: https://huggingface.co/datasets/semeru/code-code-translation-java-csharp.

  8. w

    Dataset of publication dates of book subjects that contain Design patterns...

    • workwithdata.com
    Updated Nov 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Work With Data (2024). Dataset of publication dates of book subjects that contain Design patterns in . NET : reusable approaches in C# and F# for object-oriented software design [Dataset]. https://www.workwithdata.com/datasets/book-subjects?col=book_subject%2Cj0-publication_date&f=1&fcol0=j0-book&fop0=%3D&fval0=Design+patterns+in+.+NET+%3A+reusable+approaches+in+C%23+and+F%23+for+object-oriented+software+design&j=1&j0=books
    Explore at:
    Dataset updated
    Nov 7, 2024
    Dataset authored and provided by
    Work With Data
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is about book subjects. It has 2 rows and is filtered where the books is Design patterns in . NET : reusable approaches in C# and F# for object-oriented software design. It features 2 columns including publication dates.

  9. Z

    Preprocessed C# Source Codes for Machine Learning

    • data.niaid.nih.gov
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pintér, Ádám; Szénási, Sándor (2020). Preprocessed C# Source Codes for Machine Learning [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3264760
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Óbuda University
    Authors
    Pintér, Ádám; Szénási, Sándor
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset comes from the HackerRank site, 329,937 C# source codes of 22 tasks were collected and all verified by unit tests.

    During the download process, source codes received only a unique serial number instead of the user name who solved the task and stored inside the 'task_name/origin' folder. After collecting the data, a new database was created, which included cleaned-up versions of the source codes ('task_name/cleaned' folders contains). Finally, a third set of data was extracted from this cleaned-up version, where a delimiter was inserted before and after each elementary expression to support easy processing and analysis processes ('task_name/reduced' folders contains). Inside the 'task_name' folder three csv files, which contain the equality checking result. The compressed folder also contains a vector space (and related files) made from the reduced data set. These four files are directly in the main folder.

  10. Z

    Towards a systematic approach to manual annotation of code smells - C#...

    • data.niaid.nih.gov
    Updated May 5, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nikola Luburić; Simona Prokić; Katarina-Glorija Grujić; Jelena Slivka; Aleksandar Kovačević; Goran Sladić; Dragan Vidaković (2022). Towards a systematic approach to manual annotation of code smells - C# Dataset of Long Method and Large Class code smells [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6520055
    Explore at:
    Dataset updated
    May 5, 2022
    Dataset provided by
    University of Novi Sad, Faculty of Technical Sciences
    Authors
    Nikola Luburić; Simona Prokić; Katarina-Glorija Grujić; Jelena Slivka; Aleksandar Kovačević; Goran Sladić; Dragan Vidaković
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset includes open-source projects written in C# programing language, annotated for the presence of Long Method and God Class code smells. Each instance was manually annotated by at least two annotators. We explain our motivation and methodology for creating this dataset in our preprint:

    Luburić, N., Prokić, S., Grujić, K.G., Slivka, J., Kovačević, A., Sladić, G. and Vidaković, D., 2021. Towards a systematic approach to manual annotation of code smells.

    The dataset contains two excel datasheets:

    DataSet_Large Class.xlsx – C# classes annotated for the Large Class code smell severity.

    DataSet_Long Method.xlsx – C# methods annotated for the Long method code smell severity.

    The columns in the datasheet represent:

    Code Snippet ID – the full name of the code snippet.

    For classes, this is the package/namespace name followed by the class name. The full name of inner classes also contains the names of any outer classes (e.g., namespace.subnamespace.outerclass.innerclass).

    For methods, this is the full name of the class and the methods’s signature (e.g., namespace.class.method(param1Type, param2Type) ).

    Link – The GitHub link to the code snippet, including the commit and the start and end LOC.

    Code Smell – code smell for which the code snippet is examined (Large Class or Long Method).

    Project Link – the link to the version of the code repository that was annotated.

    Metrics – a list of metrics for the code snippet, calculated by our platform. Our dataset provides 25 class-level metrics for Large Class detection and 18 method-level metrics for Long Method detection The list of metrics and their definitions is available here.

    Final annotation – a single severity score calculated by a majority vote.

    Annotators – each annotator's (1, 2, or 3) assigned severity score.

    To help guide their reasoning for evaluating the presence and the severity of a code smell, three annotators independently annotated whether the considered heuristics apply to an evaluated code snippet. We provide these results in two separate excel datasheets:

    LargeClass_Heuristics.xlsx - C# classes annotated for the presence of heuristics relevant for the Large Class code smell.

    LongMethod_Heuristics.xlsx - C# classes annotated for the presence of heuristics relevant for the Large Class code smell.

    The columns of these two datasheets are:

    Code Snippet ID - the full name of the code snippet (matching the IDs from DataSet_Large Class.xlsx and DataSet_Long Method.xlsx)

    Annotators – heuristics labelled by each of the annotators (1, 2, or 3).

    Heuristics – whether the heuristic is applicable to the examined code snippet or not (Section 1.2.4 lists heuristics relevant for the Large Class detection, and Section 1.2.5 lists the heuristics relevant for the Long Method detection).

  11. C# Methods with Documentation Comments from GitHub

    • kaggle.com
    zip
    Updated Apr 7, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alexander Chebykin (2018). C# Methods with Documentation Comments from GitHub [Dataset]. https://www.kaggle.com/awesomelemon/csharp-commented-methods-github
    Explore at:
    zip(526906028 bytes)Available download formats
    Dataset updated
    Apr 7, 2018
    Authors
    Alexander Chebykin
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Context

    This dataset was gathered from Open Source projects hosted on GitHub. We used it in our research to train deep RNN Encoder-Decoder model for API recommendation. We followed in the steps of Deep API Learning paper [https://dl.acm.org/citation.cfm?id=2950334], that used similar Java dataset, which was not available. Now people interested in the approach can work with our dataset.

    Paper describing dataset collection and model training is available here: https://ispras.ru/proceedings/docs/2018/30/3/isp_30_2018_3_63.pdf

    Content

    Database file contains 4 tables.

    Repo contains list of repositories processed, including URLs, numbers of forks, watchers, stars.

    Solution contains path of the Solution files found within repositories and flags marking whether they could be compiled.

    Method contains method names, full comments, summary sections of the documentation comments, linearized sequenced of API calls. The table also contains detected languges of summaries and tokenized method names.

    MethodParameter contains types and names of method parameters.

  12. m

    Dataset - Towards the Systematic Testing of Virtual Reality Programs

    • data.mendeley.com
    Updated Sep 16, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stevão Andrade (2021). Dataset - Towards the Systematic Testing of Virtual Reality Programs [Dataset]. http://doi.org/10.17632/4myfs585s9.2
    Explore at:
    Dataset updated
    Sep 16, 2021
    Authors
    Stevão Andrade
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains data related to the experiment conducted in the paper Towards the Systematic Testing of Virtual Reality Programs.

    It contains an implementation of an approach for predicting defect proneness on unlabeled datasets- Average Clustering and Labeling (ACL).

    ACL models get good prediction performance and are comparable to typical supervised learning models in terms of F-measure. ACL offers a viable choice for defect prediction on unlabeled dataset.

    This dataset also contains analyzes related to code smells on C# repositories. Please check the paper to get futher information.

  13. h

    cars

    • huggingface.co
    Updated Sep 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Евгений Маслов (2024). cars [Dataset]. https://huggingface.co/datasets/evgmaslov/cars
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 12, 2024
    Authors
    Евгений Маслов
    Description

    Cars

    This dataset contains pairs of C# code and its text description. Every code sample is a C# code file which is a part of the computational engineering system. One of this system’s function is to generate cars in form of voxels. Code file contains instructions of how to generate a car, so compiling this file with the whole system makes it possible to generate cars as on the picture below.

    Text column contains descriptions of cars that should be generated in form of natural… See the full description on the dataset page: https://huggingface.co/datasets/evgmaslov/cars.

  14. Data from: Roblox Dataset

    • kaggle.com
    zip
    Updated Apr 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jansen C. Cruz (2025). Roblox Dataset [Dataset]. https://www.kaggle.com/datasets/jansenccruz/roblox-dataset/discussion
    Explore at:
    zip(1071799 bytes)Available download formats
    Dataset updated
    Apr 3, 2025
    Authors
    Jansen C. Cruz
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Roblox Games Dataset 📊🎮

    9,734 rows x 16 columns

    You can implementation given in the GitHub to webscrape a current version. See the commit history to see other implementation. This dataset contains detailed information about 9,734 Roblox games, including their creators, popularity metrics, genres, and additional metadata. The data was web-scraped using C#, and you can find the full implementation along with commit history on GitHub (link below).

    📌 Dataset Features

    1. Title – Name of the game
    2. Creator – The developer or studio behind the game
    3. AgeRecommendation – Roblox's suggested age rating
    4. Active Players – Current number of active players
    5. Favorites – Number of users who favorited the game
    6. Visits – Total lifetime visits (in millions or billions)
    7. Voice Chat – Whether voice chat is supported
    8. Camera – Camera settings support
    9. Created & Updated – Creation and last update date
    10. Server Size – Size of the server
    11. Genre – Game genre (if available)
    12. Likes & Dislikes – Community feedback metrics
    13. Game Link – Direct link to the game
    14. DateFetched – When the data was collected

    🔧 Data Collection Process

    The data was collected through web scraping using C#. The implementation can be found on GitHub, where you can explore the commit history to see different iterations and improvements of the web scraper.

    🔗 GitHub Repository

    https://github.com/jansencruz23/roblox-webscraper-csharp

    📌 Potential Uses

    • Analyzing game popularity trends
    • Identifying top-performing game genres
    • Examining the impact of voice chat on engagement
    • Comparing game update frequency with popularity
    • Predict game popularity

    If you have any questions or suggestions, feel free to leave a comment! 🚀

  15. LCC_csharp

    • huggingface.co
    Updated Dec 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Microsoft (2023). LCC_csharp [Dataset]. https://huggingface.co/datasets/microsoft/LCC_csharp
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 24, 2023
    Dataset authored and provided by
    Microsofthttp://microsoft.com/
    Description

    Dataset Card for "LCC_csharp"

    More Information needed

  16. Data from: QScored: A Large Dataset of Code Smells and Quality Metrics

    • zenodo.org
    • data.niaid.nih.gov
    bin, txt
    Updated Dec 31, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tushar Sharma; Tushar Sharma (2022). QScored: A Large Dataset of Code Smells and Quality Metrics [Dataset]. http://doi.org/10.5281/zenodo.4468361
    Explore at:
    txt, binAvailable download formats
    Dataset updated
    Dec 31, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Tushar Sharma; Tushar Sharma
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset contains code quality information of more than 86 thousand GitHub repositories containing more than 1.1 billion lines of code mainly written in C# and Java. The code quality information contains detected 7 kinds of architecture smells, 19 kinds of design smells, and 11 kinds of implementation smells, and 27 commonly used code quality metrics computed at project, package, class, and method levels.

  17. Most used technologies in the .NET C# tech stack worldwide 2024

    • statista.com
    Updated Jul 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Most used technologies in the .NET C# tech stack worldwide 2024 [Dataset]. https://www.statista.com/statistics/1292362/popular-technologies-in-the-net-c-tech-stack/
    Explore at:
    Dataset updated
    Jul 18, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Jan 1, 2024 - Jun 30, 2024
    Area covered
    Worldwide
    Description

    A tech stack represents a combination of technologies a company uses in order to build and run an application or project. The most popular technology skill in the .NET C# tech stack in 2023 was database connectivity, chosen by over ** percent of respondents. It was followed by MVC and REST in the second and fourth place, respectively.

  18. Musical Scale Classification Dataset using Chroma

    • kaggle.com
    zip
    Updated Apr 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Om Avashia (2025). Musical Scale Classification Dataset using Chroma [Dataset]. https://www.kaggle.com/datasets/omavashia/synthetic-scale-chromagraph-tensor-dataset
    Explore at:
    zip(392580911 bytes)Available download formats
    Dataset updated
    Apr 8, 2025
    Authors
    Om Avashia
    License

    https://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/

    Description

    Dataset Description

    Musical Scale Dataset: 1900+ Chroma Tensors Labeled by Scale

    This dataset contains 1900+ unique synthetic musical audio samples generated from melodies in each of the 24 Western scales (12 major and 12 minor). Each sample has been converted into a chroma tensor, a 12-dimensional pitch class representation commonly used in music information retrieval (MIR) and deep learning tasks.

    What’s Inside

    • chroma_tensor: A JSON-safe formatted of a PyTorch tensor with shape [1, 12, T], where:
      • 12 = the 12 pitch classes (C, C#, D, ... B)
      • T = time steps
    • scale_index: An integer label from 0–23 identifying the scale the sample belongs to

    Use Cases

    This dataset is ideal for: - Training deep learning models (CNNs, MLPs) to classify musical scales - Exploring pitch-class distributions in Western tonal music - Prototyping models for music key detection, chord prediction, or tonal analysis - Teaching or demonstrating chromagram-based ML workflows

    Labels

    IndexScale
    0C major
    1C# major
    ......
    11B major
    12C minor
    ......
    23B minor

    Quick Load Example (PyTorch)

    Chroma tensors are of shape [1, 12, T], where: - 1 is the channel dimension (for CNN input) - 12 represents the 12 pitch classes (C through B) - T is the number of time frames

    import torch
    import pandas as pd
    from tqdm import tqdm
    
    df = pd.read_csv("/content/scale_dataset.csv")
    
    # Reconstruct chroma tensors
    X = [torch.tensor(eval(row)).reshape(1, 12, -1) for row in tqdm(df['chroma_tensor'])]
    y = df['scale_index'].tolist()
    

    Alternatively, you could directly load the chroma tensors and target scale indices using the .pt file.

    import torch
    import pandas as pd
    
    data = torch.load("chroma_tensors.pt")
    X_pt = data['X'] # list of [1, 12, 302] tensors
    y_pt = data['y'] # list of scale indices
    

    How It Was Built

    • Notes generated from random melodies using music21
    • MIDI converted to WAV via FluidSynth
    • Chromagrams extracted with librosa.feature.chroma_stft
    • Tensors flattened and saved alongside scale index labels

    File Format

    ColumnTypeDescription
    chroma_tensorstrFlattened 1D chroma tensor [1×12×T]
    scale_indexintLabel from 0 to 23

    Notes

    • Data is synthetic but musically valid and well-balanced
    • Each of the 24 scales appears 300 times
    • All tensors have fixed length (T) for easy batching
  19. h

    codebase-small

    • huggingface.co
    Updated Sep 24, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    **** ********* (2024). codebase-small [Dataset]. https://huggingface.co/datasets/grebniets123/codebase-small
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 24, 2024
    Authors
    **** *********
    License

    https://choosealicense.com/licenses/bsd/https://choosealicense.com/licenses/bsd/

    Description

    Dataset Summary

    This dataset was made from pieces of code from whole internet. I have used multiple hosting platforms to collect code from, not only GitHub was used. Codebase was gathered in order to make easy to collect pieces of code together and use them in order to train AI.

      Languages
    

    python ruby go html css c# c/c++ rust php

      Data Fields
    

    repo_name: name of repository path: path to file inside the repository content: content of file license: license of… See the full description on the dataset page: https://huggingface.co/datasets/grebniets123/codebase-small.

  20. Multi-language Open Source Code Identifier Dataset

    • kaggle.com
    zip
    Updated Jul 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bharat Mane (2025). Multi-language Open Source Code Identifier Dataset [Dataset]. https://www.kaggle.com/datasets/bharatmane/multi-language-open-source-code-identifier-dataset/data
    Explore at:
    zip(3690401 bytes)Available download formats
    Dataset updated
    Jul 8, 2025
    Authors
    Bharat Mane
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This dataset was created to support research and tool development in the areas of code readability, identifier naming, program comprehension, and code mining. It contains 362,886 unique identifier names (including classes, functions/methods, and variables) extracted from 21 widely used and actively maintained open-source projects.

    Projects were carefully selected from four major programming language ecosystems: Java, Python, C#, and JavaScript/TypeScript. The repositories span popular libraries and frameworks in domains such as data science, web development, backend systems, dependency injection, and more. These projects are widely recognized as benchmarks in their respective communities, ensuring that the dataset represents industry best practices in naming and code style.

    Context & Motivation: Good identifier naming is fundamental for code readability and maintainability, yet cross-language empirical datasets are rare. This dataset enables comparative studies of naming conventions, training and benchmarking of AI models, and reproducible research on identifier readability. It is designed to be both a large-scale resource and a realistic reflection of naming in production-quality code.

    Sources: - commons-lang, guava, hibernate-orm, logging-log4j2, spring-framework - django, flask, numpy, pandas, requests - Autofac, Dapper, Hangfire, IdentityServer, NLog - react, vue, d3, lodash, express, angular, angular-cli, ngx-bootstrap, TypeScript, NestJS

    Each identifier is labelled with its project, language, type, and name. We encourage use for academic research, code intelligence, machine learning, and developer education.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
David Meldrum (2024). c-sharp-coding-dataset [Dataset]. https://huggingface.co/datasets/dmeldrum6/c-sharp-coding-dataset

c-sharp-coding-dataset

dmeldrum6/c-sharp-coding-dataset

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 17, 2024
Authors
David Meldrum
Description

Dataset Card for c-sharp-coding-dataset

This dataset has been created with distilabel.

  Dataset Summary

This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/dmeldrum6/c-sharp-coding-dataset/raw/main/pipeline.yaml"

or explore the configuration: distilabel pipeline info --config… See the full description on the dataset page: https://huggingface.co/datasets/dmeldrum6/c-sharp-coding-dataset.

Search
Clear search
Close search
Google apps
Main menu