100+ datasets found

h
c-sharp-coding-dataset
huggingface.co
Updated Dec 17, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Meldrum (2024). c-sharp-coding-dataset [Dataset]. https://huggingface.co/datasets/dmeldrum6/c-sharp-coding-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 17, 2024
Authors
David Meldrum
Description
Dataset Card for c-sharp-coding-dataset

This dataset has been created with distilabel.

Dataset Summary

This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/dmeldrum6/c-sharp-coding-dataset/raw/main/pipeline.yaml"

or explore the configuration: distilabel pipeline info --config… See the full description on the dataset page: https://huggingface.co/datasets/dmeldrum6/c-sharp-coding-dataset.
Java-C# Code Statements
kaggle.com
zip
Updated Nov 27, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abdelrahman Soliman (2022). Java-C# Code Statements [Dataset]. https://www.kaggle.com/datasets/amss10/javac-code-statements
Explore at:
zip(846907 bytes)Available download formats
Dataset updated
Nov 27, 2022
Authors
Abdelrahman Soliman
Description
Dataset

This dataset was created by Abdelrahman Soliman

Contents
Data set statistics: Number of events e#, of failure sub classes c# and of...
plos.figshare.com
xls
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Guido Schwenk; Ben Jochinke; Klaus-Robert Müller (2023). Data set statistics: Number of events e#, of failure sub classes c# and of samples s#. [Dataset]. http://doi.org/10.1371/journal.pone.0228434.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0228434.t001
Dataset updated
Jun 1, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Guido Schwenk; Ben Jochinke; Klaus-Robert Müller
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data set statistics: Number of events e#, of failure sub classes c# and of samples s#.
Z
C# Dataset of Data Class, Feature Envy and Refused Bequest code smells
nde-dev.biothings.io
Updated Jan 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Luburić, Nikola (2024). C# Dataset of Data Class, Feature Envy and Refused Bequest code smells [Dataset]. https://nde-dev.biothings.io/resources?id=zenodo_10475431
Explore at:
Dataset updated
Jan 9, 2024
Dataset provided by
Grujić, Katarina-Glorija
Slivka, Jelena
Kovačević, Aleksandar
Luburić, Nikola
Prokić, Simona
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset includes open-source projects written in C# programming language, annotated for the presence of Data Class, Feature Envy and Refused Bequest code smells. Each code snippet was manually annotated by at least two annotators.

The dataset contains three excel datasheets:

DataSet_Data_Class.xlsx - C# classes annotated for the Data Class code smell

DataSet_Feature_Envy.xlsx - C# methods annotated for the Feature Envy code smell

DataSet_Refused_Bequest.xlsx - C# classes annotated for the Refused Bequest code smell

The columns in the datasheet represent:

Code Snippet ID - the full name of the code snippet.

for classes, this is the package/namespace name followed by the class name. The full name of inner classes also contains the names of any outer classes (e.g., namespace.subnamespace.outerclass.innerclass).

for methods, this is the full name of the class and the method's signature (e.g., namespace.class.method(param1Type, param2Type))

Link - the Github link to the code snippet, including the commit and the start and end LOC.

Code Smell - code smell for which the code snippet is examined (Data Class, Feature Envy or Refused Bequest)

Project Link - the link to the version of the code repository that was annotated

Metrics – a list of metrics for the code snippet, calculated by our platform. Our dataset provides 31 class-level metrics for Data Class and Refused Bequest detection and 19 method-level metrics for Feature Envy detection. The list of metrics and their definitions is available here.

Final annotation – a single severity score calculated by a majority vote.

Annotators – each annotator's (1, 2, or 3) assigned severity score.

To help guide their reasoning for evaluating the presence and the severity of a code smell, three annotators independently annotated whether the considered heuristics apply to an evaluated code snippet. We provide these results in three separate excel datasheets:

DataClass_Heuristics.xlsx - C# classes annotated for the presence of heuristics relevant for the Data Class code smell.

FeatureEnvy_Heuristics.xlsx - C# methods annotated for the presence of heuristics relevant for the Feature Envy code smell.

RefusedBequest_Heuristics.xlsx - C# classes annotated for the presence of heuristics relevant for the Refused Bequest code smell.

The columns of these two datasheets are:

Code Snippet ID - the full name of the code snippet (matching the IDs from DataSet_Data_Class.xlsx, DataSet_Feature_Envy.xlsx and DataSet_Refused_Bequest.xlsx)

Annotators – heuristics labelled by each of the annotators (1, 2, or 3).

Heuristics – whether the heuristic is applicable to the examined code snippet or not

Annotators annotated the dataset based on the annotation procedure and guidelines available here.
h
cs_repl_ai_alpaca
huggingface.co
Updated Aug 2, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abhijit (2024). cs_repl_ai_alpaca [Dataset]. https://huggingface.co/datasets/abhijitkumarjha88192/cs_repl_ai_alpaca
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 2, 2024
Authors
Abhijit
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
abhijitkumarjha88192/cs_repl_ai_alpaca dataset hosted on Hugging Face and contributed by the HF Datasets community
w
Dataset of books called Real world functional programming : with examples in...
workwithdata.com
Updated Apr 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Work With Data (2025). Dataset of books called Real world functional programming : with examples in F# and C# [Dataset]. https://www.workwithdata.com/datasets/books?f=1&fcol0=book&fop0=%3D&fval0=Real+world+functional+programming+%3A+with+examples+in+F%23+and+C%23
Explore at:
Dataset updated
Apr 17, 2025
Dataset authored and provided by
Work With Data
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is about books. It has 1 row and is filtered where the book is Real world functional programming : with examples in F# and C#. It features 7 columns including author, publication date, language, and book publisher.
h
code-code-translation-java-csharp
huggingface.co
Updated Jul 11, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Semeru Lab (2017). code-code-translation-java-csharp [Dataset]. https://huggingface.co/datasets/semeru/code-code-translation-java-csharp
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 11, 2017
Dataset authored and provided by
Semeru Lab
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset is imported from CodeXGLUE and pre-processed using their script.

Where to find in Semeru:

The dataset can be found at /nfs/semeru/semeru_datasets/code_xglue/code-to-code/code-to-code-trans in Semeru

CodeXGLUE -- Code2Code Translation Task Definition

Code translation aims to migrate legacy software from one programming language in a platform toanother. In CodeXGLUE, given a piece of Java (C#) code, the task is to translate the code into C#… See the full description on the dataset page: https://huggingface.co/datasets/semeru/code-code-translation-java-csharp.
w
Dataset of publication dates of book subjects that contain Design patterns...
workwithdata.com
Updated Nov 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Work With Data (2024). Dataset of publication dates of book subjects that contain Design patterns in . NET : reusable approaches in C# and F# for object-oriented software design [Dataset]. https://www.workwithdata.com/datasets/book-subjects?col=book_subject%2Cj0-publication_date&f=1&fcol0=j0-book&fop0=%3D&fval0=Design+patterns+in+.+NET+%3A+reusable+approaches+in+C%23+and+F%23+for+object-oriented+software+design&j=1&j0=books
Explore at:
Dataset updated
Nov 7, 2024
Dataset authored and provided by
Work With Data
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is about book subjects. It has 2 rows and is filtered where the books is Design patterns in . NET : reusable approaches in C# and F# for object-oriented software design. It features 2 columns including publication dates.
Z
Preprocessed C# Source Codes for Machine Learning
data.niaid.nih.gov
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pintér, Ádám; Szénási, Sándor (2020). Preprocessed C# Source Codes for Machine Learning [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3264760
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
Óbuda University
Authors
Pintér, Ádám; Szénási, Sándor
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset comes from the HackerRank site, 329,937 C# source codes of 22 tasks were collected and all verified by unit tests.

During the download process, source codes received only a unique serial number instead of the user name who solved the task and stored inside the 'task_name/origin' folder. After collecting the data, a new database was created, which included cleaned-up versions of the source codes ('task_name/cleaned' folders contains). Finally, a third set of data was extracted from this cleaned-up version, where a delimiter was inserted before and after each elementary expression to support easy processing and analysis processes ('task_name/reduced' folders contains). Inside the 'task_name' folder three csv files, which contain the equality checking result. The compressed folder also contains a vector space (and related files) made from the reduced data set. These four files are directly in the main folder.
Z
Towards a systematic approach to manual annotation of code smells - C#...
data.niaid.nih.gov
Updated May 5, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nikola Luburić; Simona Prokić; Katarina-Glorija Grujić; Jelena Slivka; Aleksandar Kovačević; Goran Sladić; Dragan Vidaković (2022). Towards a systematic approach to manual annotation of code smells - C# Dataset of Long Method and Large Class code smells [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6520055
Explore at:
Dataset updated
May 5, 2022
Dataset provided by
University of Novi Sad, Faculty of Technical Sciences
Authors
Nikola Luburić; Simona Prokić; Katarina-Glorija Grujić; Jelena Slivka; Aleksandar Kovačević; Goran Sladić; Dragan Vidaković
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset includes open-source projects written in C# programing language, annotated for the presence of Long Method and God Class code smells. Each instance was manually annotated by at least two annotators. We explain our motivation and methodology for creating this dataset in our preprint:

Luburić, N., Prokić, S., Grujić, K.G., Slivka, J., Kovačević, A., Sladić, G. and Vidaković, D., 2021. Towards a systematic approach to manual annotation of code smells.

The dataset contains two excel datasheets:

DataSet_Large Class.xlsx – C# classes annotated for the Large Class code smell severity.

DataSet_Long Method.xlsx – C# methods annotated for the Long method code smell severity.

The columns in the datasheet represent:

Code Snippet ID – the full name of the code snippet.

For classes, this is the package/namespace name followed by the class name. The full name of inner classes also contains the names of any outer classes (e.g., namespace.subnamespace.outerclass.innerclass).

For methods, this is the full name of the class and the methods’s signature (e.g., namespace.class.method(param1Type, param2Type) ).

Link – The GitHub link to the code snippet, including the commit and the start and end LOC.

Code Smell – code smell for which the code snippet is examined (Large Class or Long Method).

Project Link – the link to the version of the code repository that was annotated.

Metrics – a list of metrics for the code snippet, calculated by our platform. Our dataset provides 25 class-level metrics for Large Class detection and 18 method-level metrics for Long Method detection The list of metrics and their definitions is available here.

Final annotation – a single severity score calculated by a majority vote.

Annotators – each annotator's (1, 2, or 3) assigned severity score.

To help guide their reasoning for evaluating the presence and the severity of a code smell, three annotators independently annotated whether the considered heuristics apply to an evaluated code snippet. We provide these results in two separate excel datasheets:

LargeClass_Heuristics.xlsx - C# classes annotated for the presence of heuristics relevant for the Large Class code smell.

LongMethod_Heuristics.xlsx - C# classes annotated for the presence of heuristics relevant for the Large Class code smell.

The columns of these two datasheets are:

Code Snippet ID - the full name of the code snippet (matching the IDs from DataSet_Large Class.xlsx and DataSet_Long Method.xlsx)

Annotators – heuristics labelled by each of the annotators (1, 2, or 3).

Heuristics – whether the heuristic is applicable to the examined code snippet or not (Section 1.2.4 lists heuristics relevant for the Large Class detection, and Section 1.2.5 lists the heuristics relevant for the Long Method detection).
C# Methods with Documentation Comments from GitHub
kaggle.com
zip
Updated Apr 7, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alexander Chebykin (2018). C# Methods with Documentation Comments from GitHub [Dataset]. https://www.kaggle.com/awesomelemon/csharp-commented-methods-github
Explore at:
zip(526906028 bytes)Available download formats
Dataset updated
Apr 7, 2018
Authors
Alexander Chebykin
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Context

This dataset was gathered from Open Source projects hosted on GitHub. We used it in our research to train deep RNN Encoder-Decoder model for API recommendation. We followed in the steps of Deep API Learning paper [https://dl.acm.org/citation.cfm?id=2950334], that used similar Java dataset, which was not available. Now people interested in the approach can work with our dataset.

Paper describing dataset collection and model training is available here: https://ispras.ru/proceedings/docs/2018/30/3/isp_30_2018_3_63.pdf

Content

Database file contains 4 tables.

Repo contains list of repositories processed, including URLs, numbers of forks, watchers, stars.

Solution contains path of the Solution files found within repositories and flags marking whether they could be compiled.

Method contains method names, full comments, summary sections of the documentation comments, linearized sequenced of API calls. The table also contains detected languges of summaries and tokenized method names.

MethodParameter contains types and names of method parameters.
m
Dataset - Towards the Systematic Testing of Virtual Reality Programs
data.mendeley.com
Updated Sep 16, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stevão Andrade (2021). Dataset - Towards the Systematic Testing of Virtual Reality Programs [Dataset]. http://doi.org/10.17632/4myfs585s9.2
Explore at:
Unique identifier
https://doi.org/10.17632/4myfs585s9.2
Dataset updated
Sep 16, 2021
Authors
Stevão Andrade
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains data related to the experiment conducted in the paper Towards the Systematic Testing of Virtual Reality Programs.

It contains an implementation of an approach for predicting defect proneness on unlabeled datasets- Average Clustering and Labeling (ACL).

ACL models get good prediction performance and are comparable to typical supervised learning models in terms of F-measure. ACL offers a viable choice for defect prediction on unlabeled dataset.

This dataset also contains analyzes related to code smells on C# repositories. Please check the paper to get futher information.
h
cars
huggingface.co
Updated Sep 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Евгений Маслов (2024). cars [Dataset]. https://huggingface.co/datasets/evgmaslov/cars
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 12, 2024
Authors
Евгений Маслов
Description
Cars

This dataset contains pairs of C# code and its text description. Every code sample is a C# code file which is a part of the computational engineering system. One of this system’s function is to generate cars in form of voxels. Code file contains instructions of how to generate a car, so compiling this file with the whole system makes it possible to generate cars as on the picture below.

Text column contains descriptions of cars that should be generated in form of natural… See the full description on the dataset page: https://huggingface.co/datasets/evgmaslov/cars.
Data from: Roblox Dataset
kaggle.com
zip
Updated Apr 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jansen C. Cruz (2025). Roblox Dataset [Dataset]. https://www.kaggle.com/datasets/jansenccruz/roblox-dataset/discussion
Explore at:
zip(1071799 bytes)Available download formats
Dataset updated
Apr 3, 2025
Authors
Jansen C. Cruz
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Roblox Games Dataset 📊🎮

9,734 rows x 16 columns

You can implementation given in the GitHub to webscrape a current version. See the commit history to see other implementation. This dataset contains detailed information about 9,734 Roblox games, including their creators, popularity metrics, genres, and additional metadata. The data was web-scraped using C#, and you can find the full implementation along with commit history on GitHub (link below).

📌 Dataset Features

Title – Name of the game

Creator – The developer or studio behind the game

AgeRecommendation – Roblox's suggested age rating

Active Players – Current number of active players

Favorites – Number of users who favorited the game

Visits – Total lifetime visits (in millions or billions)

Voice Chat – Whether voice chat is supported

Camera – Camera settings support

Created & Updated – Creation and last update date

Server Size – Size of the server

Genre – Game genre (if available)

Likes & Dislikes – Community feedback metrics

Game Link – Direct link to the game

DateFetched – When the data was collected

🔧 Data Collection Process

The data was collected through web scraping using C#. The implementation can be found on GitHub, where you can explore the commit history to see different iterations and improvements of the web scraper.

🔗 GitHub Repository

https://github.com/jansencruz23/roblox-webscraper-csharp

📌 Potential Uses

Analyzing game popularity trends

Identifying top-performing game genres

Examining the impact of voice chat on engagement

Comparing game update frequency with popularity

Predict game popularity

If you have any questions or suggestions, feel free to leave a comment! 🚀
LCC_csharp
huggingface.co
Updated Dec 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Microsoft (2023). LCC_csharp [Dataset]. https://huggingface.co/datasets/microsoft/LCC_csharp
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 24, 2023
Dataset authored and provided by
Microsofthttp://microsoft.com/
Description
Dataset Card for "LCC_csharp"

More Information needed
Data from: QScored: A Large Dataset of Code Smells and Quality Metrics
zenodo.org
data.niaid.nih.gov
bin, txt
Updated Dec 31, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tushar Sharma; Tushar Sharma (2022). QScored: A Large Dataset of Code Smells and Quality Metrics [Dataset]. http://doi.org/10.5281/zenodo.4468361
Explore at:
txt, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4468361
Dataset updated
Dec 31, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Tushar Sharma; Tushar Sharma
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset contains code quality information of more than 86 thousand GitHub repositories containing more than 1.1 billion lines of code mainly written in C# and Java. The code quality information contains detected 7 kinds of architecture smells, 19 kinds of design smells, and 11 kinds of implementation smells, and 27 commonly used code quality metrics computed at project, package, class, and method levels.
Most used technologies in the .NET C# tech stack worldwide 2024
statista.com
Updated Jul 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Most used technologies in the .NET C# tech stack worldwide 2024 [Dataset]. https://www.statista.com/statistics/1292362/popular-technologies-in-the-net-c-tech-stack/
Explore at:
Dataset updated
Jul 18, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Jan 1, 2024 - Jun 30, 2024
Area covered
Worldwide
Description
A tech stack represents a combination of technologies a company uses in order to build and run an application or project. The most popular technology skill in the .NET C# tech stack in 2023 was database connectivity, chosen by over ** percent of respondents. It was followed by MVC and REST in the second and fourth place, respectively.
Musical Scale Classification Dataset using Chroma
kaggle.com
zip
Updated Apr 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Om Avashia (2025). Musical Scale Classification Dataset using Chroma [Dataset]. https://www.kaggle.com/datasets/omavashia/synthetic-scale-chromagraph-tensor-dataset
Explore at:
zip(392580911 bytes)Available download formats
Dataset updated
Apr 8, 2025
Authors
Om Avashia
License
https://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/
Description
Dataset Description

Musical Scale Dataset: 1900+ Chroma Tensors Labeled by Scale

This dataset contains 1900+ unique synthetic musical audio samples generated from melodies in each of the 24 Western scales (12 major and 12 minor). Each sample has been converted into a chroma tensor, a 12-dimensional pitch class representation commonly used in music information retrieval (MIR) and deep learning tasks.

What’s Inside

chroma_tensor: A JSON-safe formatted of a PyTorch tensor with shape [1, 12, T], where:

12 = the 12 pitch classes (C, C#, D, ... B)

T = time steps

scale_index: An integer label from 0–23 identifying the scale the sample belongs to

Use Cases

This dataset is ideal for: - Training deep learning models (CNNs, MLPs) to classify musical scales - Exploring pitch-class distributions in Western tonal music - Prototyping models for music key detection, chord prediction, or tonal analysis - Teaching or demonstrating chromagram-based ML workflows

Labels

Index Scale
0 C major
1 C# major
... ...
11 B major
12 C minor
... ...
23 B minor

Quick Load Example (PyTorch)

Chroma tensors are of shape [1, 12, T], where: - 1 is the channel dimension (for CNN input) - 12 represents the 12 pitch classes (C through B) - T is the number of time frames

import torch import pandas as pd from tqdm import tqdm df = pd.read_csv("/content/scale_dataset.csv") # Reconstruct chroma tensors X = [torch.tensor(eval(row)).reshape(1, 12, -1) for row in tqdm(df['chroma_tensor'])] y = df['scale_index'].tolist()

Alternatively, you could directly load the chroma tensors and target scale indices using the .pt file.

import torch import pandas as pd data = torch.load("chroma_tensors.pt") X_pt = data['X'] # list of [1, 12, 302] tensors y_pt = data['y'] # list of scale indices

How It Was Built

Notes generated from random melodies using music21

MIDI converted to WAV via FluidSynth

Chromagrams extracted with librosa.feature.chroma_stft

Tensors flattened and saved alongside scale index labels

File Format

Column Type Description
chroma_tensor str Flattened 1D chroma tensor [1×12×T]
scale_index int Label from 0 to 23

Notes

Data is synthetic but musically valid and well-balanced

Each of the 24 scales appears 300 times

All tensors have fixed length (T) for easy batching
h
codebase-small
huggingface.co
Updated Sep 24, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
**** ********* (2024). codebase-small [Dataset]. https://huggingface.co/datasets/grebniets123/codebase-small
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 24, 2024
Authors
**** *********
License
https://choosealicense.com/licenses/bsd/https://choosealicense.com/licenses/bsd/
Description
Dataset Summary

This dataset was made from pieces of code from whole internet. I have used multiple hosting platforms to collect code from, not only GitHub was used. Codebase was gathered in order to make easy to collect pieces of code together and use them in order to train AI.

Languages

python ruby go html css c# c/c++ rust php

Data Fields

repo_name: name of repository path: path to file inside the repository content: content of file license: license of… See the full description on the dataset page: https://huggingface.co/datasets/grebniets123/codebase-small.
Multi-language Open Source Code Identifier Dataset
kaggle.com
zip
Updated Jul 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bharat Mane (2025). Multi-language Open Source Code Identifier Dataset [Dataset]. https://www.kaggle.com/datasets/bharatmane/multi-language-open-source-code-identifier-dataset/data
Explore at:
zip(3690401 bytes)Available download formats
Dataset updated
Jul 8, 2025
Authors
Bharat Mane
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This dataset was created to support research and tool development in the areas of code readability, identifier naming, program comprehension, and code mining. It contains 362,886 unique identifier names (including classes, functions/methods, and variables) extracted from 21 widely used and actively maintained open-source projects.

Projects were carefully selected from four major programming language ecosystems: Java, Python, C#, and JavaScript/TypeScript. The repositories span popular libraries and frameworks in domains such as data science, web development, backend systems, dependency injection, and more. These projects are widely recognized as benchmarks in their respective communities, ensuring that the dataset represents industry best practices in naming and code style.

Context & Motivation: Good identifier naming is fundamental for code readability and maintainability, yet cross-language empirical datasets are rare. This dataset enables comparative studies of naming conventions, training and benchmarking of AI models, and reproducible research on identifier readability. It is designed to be both a large-scale resource and a realistic reflection of naming in production-quality code.

Sources: - commons-lang, guava, hibernate-orm, logging-log4j2, spring-framework - django, flask, numpy, pandas, requests - Autofac, Dapper, Hangfire, IdentityServer, NLog - react, vue, d3, lodash, express, angular, angular-cli, ngx-bootstrap, TypeScript, NestJS

Each identifier is labelled with its project, language, type, and name. We encourage use for academic research, code intelligence, machine learning, and developer education.

Index	Scale
0	C major
1	C# major
...	...
11	B major
12	C minor
...	...
23	B minor

Column	Type	Description
`chroma_tensor`	`str`	Flattened 1D chroma tensor `[1×12×T]`
`scale_index`	`int`	Label from 0 to 23

Facebook

Twitter

Click to copy link

Link copied

Cite

David Meldrum (2024). c-sharp-coding-dataset [Dataset]. https://huggingface.co/datasets/dmeldrum6/c-sharp-coding-dataset

c-sharp-coding-dataset

dmeldrum6/c-sharp-coding-dataset

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Dec 17, 2024

Authors

David Meldrum

Description

Dataset Card for c-sharp-coding-dataset

This dataset has been created with distilabel.

  Dataset Summary

This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/dmeldrum6/c-sharp-coding-dataset/raw/main/pipeline.yaml"

or explore the configuration: distilabel pipeline info --config… See the full description on the dataset page: https://huggingface.co/datasets/dmeldrum6/c-sharp-coding-dataset.

Clear search

Close search

Google apps

Main menu

c-sharp-coding-dataset

Java-C# Code Statements

Dataset

Contents

Data set statistics: Number of events e#, of failure sub classes c# and of...

C# Dataset of Data Class, Feature Envy and Refused Bequest code smells

cs_repl_ai_alpaca

Dataset of books called Real world functional programming : with examples in...

code-code-translation-java-csharp

Dataset of publication dates of book subjects that contain Design patterns...

Preprocessed C# Source Codes for Machine Learning

Towards a systematic approach to manual annotation of code smells - C#...

C# Methods with Documentation Comments from GitHub

Context

Content

Dataset - Towards the Systematic Testing of Virtual Reality Programs

cars

Data from: Roblox Dataset

Roblox Games Dataset 📊🎮

📌 Dataset Features

🔧 Data Collection Process

🔗 GitHub Repository

📌 Potential Uses

LCC_csharp

Data from: QScored: A Large Dataset of Code Smells and Quality Metrics

Most used technologies in the .NET C# tech stack worldwide 2024

Musical Scale Classification Dataset using Chroma

Dataset Description

What’s Inside

Use Cases

Labels

Quick Load Example (PyTorch)

How It Was Built

File Format

Notes

codebase-small

Multi-language Open Source Code Identifier Dataset

c-sharp-coding-datasetSee More Versions

dmeldrum6/c-sharp-coding-dataset

c-sharp-coding-dataset