Facebook
TwitterDataset Card for c-sharp-coding-dataset
This dataset has been created with distilabel.
Dataset Summary
This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/dmeldrum6/c-sharp-coding-dataset/raw/main/pipeline.yaml"
or explore the configuration: distilabel pipeline info --config… See the full description on the dataset page: https://huggingface.co/datasets/dmeldrum6/c-sharp-coding-dataset.
Facebook
TwitterThis dataset was created by Abdelrahman Soliman
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data set statistics: Number of events e#, of failure sub classes c# and of samples s#.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset includes open-source projects written in C# programming language, annotated for the presence of Data Class, Feature Envy and Refused Bequest code smells. Each code snippet was manually annotated by at least two annotators.
The dataset contains three excel datasheets:
DataSet_Data_Class.xlsx - C# classes annotated for the Data Class code smell
DataSet_Feature_Envy.xlsx - C# methods annotated for the Feature Envy code smell
DataSet_Refused_Bequest.xlsx - C# classes annotated for the Refused Bequest code smell
The columns in the datasheet represent:
Code Snippet ID - the full name of the code snippet.
for classes, this is the package/namespace name followed by the class name. The full name of inner classes also contains the names of any outer classes (e.g., namespace.subnamespace.outerclass.innerclass).
for methods, this is the full name of the class and the method's signature (e.g., namespace.class.method(param1Type, param2Type))
Link - the Github link to the code snippet, including the commit and the start and end LOC.
Code Smell - code smell for which the code snippet is examined (Data Class, Feature Envy or Refused Bequest)
Project Link - the link to the version of the code repository that was annotated
Metrics – a list of metrics for the code snippet, calculated by our platform. Our dataset provides 31 class-level metrics for Data Class and Refused Bequest detection and 19 method-level metrics for Feature Envy detection. The list of metrics and their definitions is available here.
Final annotation – a single severity score calculated by a majority vote.
Annotators – each annotator's (1, 2, or 3) assigned severity score.
To help guide their reasoning for evaluating the presence and the severity of a code smell, three annotators independently annotated whether the considered heuristics apply to an evaluated code snippet. We provide these results in three separate excel datasheets:
DataClass_Heuristics.xlsx - C# classes annotated for the presence of heuristics relevant for the Data Class code smell.
FeatureEnvy_Heuristics.xlsx - C# methods annotated for the presence of heuristics relevant for the Feature Envy code smell.
RefusedBequest_Heuristics.xlsx - C# classes annotated for the presence of heuristics relevant for the Refused Bequest code smell.
The columns of these two datasheets are:
Code Snippet ID - the full name of the code snippet (matching the IDs from DataSet_Data_Class.xlsx, DataSet_Feature_Envy.xlsx and DataSet_Refused_Bequest.xlsx)
Annotators – heuristics labelled by each of the annotators (1, 2, or 3).
Heuristics – whether the heuristic is applicable to the examined code snippet or not
Annotators annotated the dataset based on the annotation procedure and guidelines available here.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
abhijitkumarjha88192/cs_repl_ai_alpaca dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about books. It has 1 row and is filtered where the book is Real world functional programming : with examples in F# and C#. It features 7 columns including author, publication date, language, and book publisher.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset is imported from CodeXGLUE and pre-processed using their script.
Where to find in Semeru:
The dataset can be found at /nfs/semeru/semeru_datasets/code_xglue/code-to-code/code-to-code-trans in Semeru
CodeXGLUE -- Code2Code Translation
Task Definition
Code translation aims to migrate legacy software from one programming language in a platform toanother. In CodeXGLUE, given a piece of Java (C#) code, the task is to translate the code into C#… See the full description on the dataset page: https://huggingface.co/datasets/semeru/code-code-translation-java-csharp.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about book subjects. It has 2 rows and is filtered where the books is Design patterns in . NET : reusable approaches in C# and F# for object-oriented software design. It features 2 columns including publication dates.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset comes from the HackerRank site, 329,937 C# source codes of 22 tasks were collected and all verified by unit tests.
During the download process, source codes received only a unique serial number instead of the user name who solved the task and stored inside the 'task_name/origin' folder. After collecting the data, a new database was created, which included cleaned-up versions of the source codes ('task_name/cleaned' folders contains). Finally, a third set of data was extracted from this cleaned-up version, where a delimiter was inserted before and after each elementary expression to support easy processing and analysis processes ('task_name/reduced' folders contains). Inside the 'task_name' folder three csv files, which contain the equality checking result. The compressed folder also contains a vector space (and related files) made from the reduced data set. These four files are directly in the main folder.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset includes open-source projects written in C# programing language, annotated for the presence of Long Method and God Class code smells. Each instance was manually annotated by at least two annotators. We explain our motivation and methodology for creating this dataset in our preprint:
Luburić, N., Prokić, S., Grujić, K.G., Slivka, J., Kovačević, A., Sladić, G. and Vidaković, D., 2021. Towards a systematic approach to manual annotation of code smells.
The dataset contains two excel datasheets:
DataSet_Large Class.xlsx – C# classes annotated for the Large Class code smell severity.
DataSet_Long Method.xlsx – C# methods annotated for the Long method code smell severity.
The columns in the datasheet represent:
Code Snippet ID – the full name of the code snippet.
For classes, this is the package/namespace name followed by the class name. The full name of inner classes also contains the names of any outer classes (e.g., namespace.subnamespace.outerclass.innerclass).
For methods, this is the full name of the class and the methods’s signature (e.g., namespace.class.method(param1Type, param2Type) ).
Link – The GitHub link to the code snippet, including the commit and the start and end LOC.
Code Smell – code smell for which the code snippet is examined (Large Class or Long Method).
Project Link – the link to the version of the code repository that was annotated.
Metrics – a list of metrics for the code snippet, calculated by our platform. Our dataset provides 25 class-level metrics for Large Class detection and 18 method-level metrics for Long Method detection The list of metrics and their definitions is available here.
Final annotation – a single severity score calculated by a majority vote.
Annotators – each annotator's (1, 2, or 3) assigned severity score.
To help guide their reasoning for evaluating the presence and the severity of a code smell, three annotators independently annotated whether the considered heuristics apply to an evaluated code snippet. We provide these results in two separate excel datasheets:
LargeClass_Heuristics.xlsx - C# classes annotated for the presence of heuristics relevant for the Large Class code smell.
LongMethod_Heuristics.xlsx - C# classes annotated for the presence of heuristics relevant for the Large Class code smell.
The columns of these two datasheets are:
Code Snippet ID - the full name of the code snippet (matching the IDs from DataSet_Large Class.xlsx and DataSet_Long Method.xlsx)
Annotators – heuristics labelled by each of the annotators (1, 2, or 3).
Heuristics – whether the heuristic is applicable to the examined code snippet or not (Section 1.2.4 lists heuristics relevant for the Large Class detection, and Section 1.2.5 lists the heuristics relevant for the Long Method detection).
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset was gathered from Open Source projects hosted on GitHub. We used it in our research to train deep RNN Encoder-Decoder model for API recommendation. We followed in the steps of Deep API Learning paper [https://dl.acm.org/citation.cfm?id=2950334], that used similar Java dataset, which was not available. Now people interested in the approach can work with our dataset.
Paper describing dataset collection and model training is available here: https://ispras.ru/proceedings/docs/2018/30/3/isp_30_2018_3_63.pdf
Database file contains 4 tables.
Repo contains list of repositories processed, including URLs, numbers of forks, watchers, stars.
Solution contains path of the Solution files found within repositories and flags marking whether they could be compiled.
Method contains method names, full comments, summary sections of the documentation comments, linearized sequenced of API calls. The table also contains detected languges of summaries and tokenized method names.
MethodParameter contains types and names of method parameters.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains data related to the experiment conducted in the paper Towards the Systematic Testing of Virtual Reality Programs.
It contains an implementation of an approach for predicting defect proneness on unlabeled datasets- Average Clustering and Labeling (ACL).
ACL models get good prediction performance and are comparable to typical supervised learning models in terms of F-measure. ACL offers a viable choice for defect prediction on unlabeled dataset.
This dataset also contains analyzes related to code smells on C# repositories. Please check the paper to get futher information.
Facebook
TwitterCars
This dataset contains pairs of C# code and its text description. Every code sample is a C# code file which is a part of the computational engineering system. One of this system’s function is to generate cars in form of voxels. Code file contains instructions of how to generate a car, so compiling this file with the whole system makes it possible to generate cars as on the picture below.
Text column contains descriptions of cars that should be generated in form of natural… See the full description on the dataset page: https://huggingface.co/datasets/evgmaslov/cars.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
9,734 rows x 16 columns
You can implementation given in the GitHub to webscrape a current version. See the commit history to see other implementation. This dataset contains detailed information about 9,734 Roblox games, including their creators, popularity metrics, genres, and additional metadata. The data was web-scraped using C#, and you can find the full implementation along with commit history on GitHub (link below).
The data was collected through web scraping using C#. The implementation can be found on GitHub, where you can explore the commit history to see different iterations and improvements of the web scraper.
https://github.com/jansencruz23/roblox-webscraper-csharp
If you have any questions or suggestions, feel free to leave a comment! 🚀
Facebook
TwitterDataset Card for "LCC_csharp"
More Information needed
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset contains code quality information of more than 86 thousand GitHub repositories containing more than 1.1 billion lines of code mainly written in C# and Java. The code quality information contains detected 7 kinds of architecture smells, 19 kinds of design smells, and 11 kinds of implementation smells, and 27 commonly used code quality metrics computed at project, package, class, and method levels.
Facebook
TwitterA tech stack represents a combination of technologies a company uses in order to build and run an application or project. The most popular technology skill in the .NET C# tech stack in 2023 was database connectivity, chosen by over ** percent of respondents. It was followed by MVC and REST in the second and fourth place, respectively.
Facebook
Twitterhttps://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/
Musical Scale Dataset: 1900+ Chroma Tensors Labeled by Scale
This dataset contains 1900+ unique synthetic musical audio samples generated from melodies in each of the 24 Western scales (12 major and 12 minor). Each sample has been converted into a chroma tensor, a 12-dimensional pitch class representation commonly used in music information retrieval (MIR) and deep learning tasks.
chroma_tensor: A JSON-safe formatted of a PyTorch tensor with shape [1, 12, T], where:
12 = the 12 pitch classes (C, C#, D, ... B)T = time steps scale_index: An integer label from 0–23 identifying the scale the sample belongs toThis dataset is ideal for: - Training deep learning models (CNNs, MLPs) to classify musical scales - Exploring pitch-class distributions in Western tonal music - Prototyping models for music key detection, chord prediction, or tonal analysis - Teaching or demonstrating chromagram-based ML workflows
| Index | Scale |
|---|---|
| 0 | C major |
| 1 | C# major |
| ... | ... |
| 11 | B major |
| 12 | C minor |
| ... | ... |
| 23 | B minor |
Chroma tensors are of shape [1, 12, T], where:
- 1 is the channel dimension (for CNN input)
- 12 represents the 12 pitch classes (C through B)
- T is the number of time frames
import torch
import pandas as pd
from tqdm import tqdm
df = pd.read_csv("/content/scale_dataset.csv")
# Reconstruct chroma tensors
X = [torch.tensor(eval(row)).reshape(1, 12, -1) for row in tqdm(df['chroma_tensor'])]
y = df['scale_index'].tolist()
Alternatively, you could directly load the chroma tensors and target scale indices using the .pt file.
import torch
import pandas as pd
data = torch.load("chroma_tensors.pt")
X_pt = data['X'] # list of [1, 12, 302] tensors
y_pt = data['y'] # list of scale indices
music21FluidSynthlibrosa.feature.chroma_stft| Column | Type | Description |
|---|---|---|
chroma_tensor | str | Flattened 1D chroma tensor [1×12×T] |
scale_index | int | Label from 0 to 23 |
T) for easy batching
Facebook
Twitterhttps://choosealicense.com/licenses/bsd/https://choosealicense.com/licenses/bsd/
Dataset Summary
This dataset was made from pieces of code from whole internet. I have used multiple hosting platforms to collect code from, not only GitHub was used. Codebase was gathered in order to make easy to collect pieces of code together and use them in order to train AI.
Languages
python ruby go html css c# c/c++ rust php
Data Fields
repo_name: name of repository path: path to file inside the repository content: content of file license: license of… See the full description on the dataset page: https://huggingface.co/datasets/grebniets123/codebase-small.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset was created to support research and tool development in the areas of code readability, identifier naming, program comprehension, and code mining. It contains 362,886 unique identifier names (including classes, functions/methods, and variables) extracted from 21 widely used and actively maintained open-source projects.
Projects were carefully selected from four major programming language ecosystems: Java, Python, C#, and JavaScript/TypeScript. The repositories span popular libraries and frameworks in domains such as data science, web development, backend systems, dependency injection, and more. These projects are widely recognized as benchmarks in their respective communities, ensuring that the dataset represents industry best practices in naming and code style.
Context & Motivation: Good identifier naming is fundamental for code readability and maintainability, yet cross-language empirical datasets are rare. This dataset enables comparative studies of naming conventions, training and benchmarking of AI models, and reproducible research on identifier readability. It is designed to be both a large-scale resource and a realistic reflection of naming in production-quality code.
Sources: - commons-lang, guava, hibernate-orm, logging-log4j2, spring-framework - django, flask, numpy, pandas, requests - Autofac, Dapper, Hangfire, IdentityServer, NLog - react, vue, d3, lodash, express, angular, angular-cli, ngx-bootstrap, TypeScript, NestJS
Each identifier is labelled with its project, language, type, and name. We encourage use for academic research, code intelligence, machine learning, and developer education.
Facebook
TwitterDataset Card for c-sharp-coding-dataset
This dataset has been created with distilabel.
Dataset Summary
This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/dmeldrum6/c-sharp-coding-dataset/raw/main/pipeline.yaml"
or explore the configuration: distilabel pipeline info --config… See the full description on the dataset page: https://huggingface.co/datasets/dmeldrum6/c-sharp-coding-dataset.