100+ datasets found

h
python-github-code
huggingface.co
Updated Mar 31, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Angelica Chen (2023). python-github-code [Dataset]. https://huggingface.co/datasets/angie-chen55/python-github-code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 31, 2023
Authors
Angelica Chen
Description
angie-chen55/python-github-code dataset hosted on Hugging Face and contributed by the HF Datasets community
h
instructional_code-search-net-java
huggingface.co
Updated May 24, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fernando Tarin Morales (2023). instructional_code-search-net-java [Dataset]. https://huggingface.co/datasets/Nan-Do/instructional_code-search-net-java
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 24, 2023
Authors
Fernando Tarin Morales
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Card for "instructional_code-search-net-java"

Dataset Summary

This is an instructional dataset for Java. The dataset contains two different kind of tasks:

Given a piece of code generate a description of what it does. Given a description generate a piece of code that fulfils the description.

Languages

The dataset is in English.

Data Splits

There are no splits.

Dataset Creation

May of 2023

Curation Rationale

This dataset… See the full description on the dataset page: https://huggingface.co/datasets/Nan-Do/instructional_code-search-net-java.
b
CPRD codes: ICD-10 equivalent code lists for dementia subtypes - Datasets -...
data.bris.ac.uk
Updated Dec 11, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2017). CPRD codes: ICD-10 equivalent code lists for dementia subtypes - Datasets - data.bris [Dataset]. https://data.bris.ac.uk/data/dataset/2h4rmk9v7pw2k23h7vgf9tx1ea
Explore at:
Dataset updated
Dec 11, 2017
Description
This dataset contains the ICD-10 code lists used to test the sensitivity and specificity of the Clinical Practice Research Datalink (CPRD) medical code lists for dementia subtypes. The provided code lists are used to define dementia subtypes in linked data from the Hospital Episode Statistic (HES) inpatient dataset and the Office of National Statistics (ONS) death registry, which are then used as the 'gold standard' for comparison against dementia subtypes defined using the CPRD medical code lists. The CPRD medical code lists used in this comparison are available here: Venexia Walker, Neil Davies, Patrick Kehoe, Richard Martin (2017): CPRD codes: neurodegenerative diseases and commonly prescribed drugs. https://doi.org/10.5523/bris.1plm8il42rmlo2a2fqwslwckm2 Complete download (zip, 3.9 KiB)
ARC Code TI: QuIP
catalog.data.gov
datasets.ai
+4more
Updated Apr 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ames Research Center (2025). ARC Code TI: QuIP [Dataset]. https://catalog.data.gov/dataset/arc-code-ti-quip
Explore at:
Dataset updated
Apr 10, 2025
Dataset provided by
Ames Research Centerhttps://nasa.gov/ames/
Description
QuIP (QUick Image Processing) is an interpreter for image processing, graphics, psychophysical experimentation and general scientific computing.
VegeNet - Image datasets and Codes
zenodo.org
data.niaid.nih.gov
zip
Updated Oct 27, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jo Yen Tan; Jo Yen Tan (2022). VegeNet - Image datasets and Codes [Dataset]. http://doi.org/10.5281/zenodo.7254508
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7254508
Dataset updated
Oct 27, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jo Yen Tan; Jo Yen Tan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Compilation of python codes for data preprocessing and VegeNet building, as well as image datasets (zip files).

Image datasets:

vege_original : Images of vegetables captured manually in data acquisition stage

vege_cropped_renamed : Images in (1) cropped to remove background areas and image labels renamed

non-vege images : Images of non-vegetable foods for CNN network to recognize other-than-vegetable foods

food_image_dataset : Complete set of vege (2) and non-vege (3) images for architecture building.

food_image_dataset_split : Image dataset (4) split into train and test sets

process : Images created when cropping (pre-processing step) to create dataset (2).
c
Form-Based Code Pilot Neighborhoods
data.clevelandohio.gov
hub.arcgis.com
+1more
Updated Jul 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cleveland | GIS (2024). Form-Based Code Pilot Neighborhoods [Dataset]. https://data.clevelandohio.gov/datasets/form-based-code-pilot-neighborhoods
Explore at:
Dataset updated
Jul 1, 2024
Dataset authored and provided by
Cleveland | GIS
License
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Area covered
Description
DescriptionThis dataset outlines specific zones or regions designated under a Form-Based Code (FBC) framework. Unlike traditional zoning, form-based codes emphasize the physical form of buildings and public spaces over land use. These zones guide community design through parameters such as building height, setbacks, and architectural styles. The dataset provides a spatial reference for planning, zoning, and development decisions aligned with form-based design principles. The data was created by digitizing PDFs of approved Form-Based Code plans, accessible via links listed in the Ordinance Link column of the dataset.

Applications Featuring This DatasetForm-Based Code Explorer

Data GlossarySee the Attributes section below for details about each column in this dataset.

Update FrequencyWhen FBC neighborhood regions change. ContactCity Planning Commission – Zoning and Technology
Replication package for DRAGON: Robust Classification for Very Large...
zenodo.org
bin, zip
Updated May 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anonymous; Anonymous (2025). Replication package for DRAGON: Robust Classification for Very Large Collections of Software Repositories [Dataset]. http://doi.org/10.5281/zenodo.15424419
Explore at:
bin, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15424419
Dataset updated
May 15, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Anonymous; Anonymous
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
DRAGON: Multi-Label Classification

This archive contains the replication package for the DRAGON multi-label classification models, which leverage BERT-based architectures. The package includes scripts for repository mining, dataset creation, data processing, model training, and evaluation. The two main models used are DRAGON and LEGION.

Key Components:

Repository Mining: Scripts to extract repositories for dataset creation.

Dataset Preparation: Jupyter notebooks for cleaning and transforming data.

Data Processing: Conversion into a Hugging Face dataset format.

Model Training: Training scripts for DRAGON and LEGION, with configurable preprocessing options.

Evaluation: Threshold tuning and performance assessment.

Setup

Before running any commands, ensure you have the necessary dependencies installed. It is recommended to use a virtual environment:

python3 -m venv venv source venv/bin/activate # On Windows use `venv\Scripts\activate` pip install -r requirements.txt

Project Structure

repository_mining/: Contains scripts for mining the initial set of repositories.

repository_mining/doc/: Includes documentation with the necessary information for repository mining.

dataset_creation/: Contains all the notebooks to be run sequentially to prepare the dataset.

multilabel_class/: Contains scripts for classification, threshold tuning, and evaluation.

multilabel_class/model_output/: trained model organized by: first dataset, then model variantion.

data/: Contains the hugginface datasets ( our dataset and LEGION dataset) ready for the training/eval.

1️⃣ Data Mining

To mine the initial set of repositories from Software Heritage, use the scripts available in the repository_mining/ folder. Detailed information and steps for repository mining can be found in:

repository_mining/doc/

2️⃣ Dataset Creation

After mining the repositories, prepare the dataset by running the Jupyter notebooks inside the dataset_creation/ folder in sequence. These notebooks handle data cleaning, transformation, and formatting necessary for model training. All the documentation needed is inside each notebook explaining every step.

3️⃣ Data Processing

Once the dataset is prepared, convert it into a Hugging Face dataset using:

python3 multilabel_class/create_dataset.py --file_path data/02_processed_datasets/2024-05-22/origin-metadata-readme_names-900000dataset_forks-cleaned.csv

4️⃣ Classification / Training

Train the DRAGON Model

After processing the dataset, train the DRAGON model with the following command:

python3 multilabel_class/tune_thresholds.py --model_type bert --model_variant focal --dataset_path data/03_huggingaceV_datasets/2024-05-22/origin-metadata-readme_names-900000dataset_forks-cleaned/dataset

Ensure Configuration is Set Correctly

Modify the configuration file multilabel_class/utils/config.py to set the following parameter to True:

DEFAULT_PREPROCESSING_PARAMS = { 'use_sentence_pairs': True # If True, process as (text1, text2); if False, concatenate texts }

Training DRAGON Without Sentence Pairs

To train DRAGON without using sentence pairs, use the same command but set use_sentence_pairs to False in the config file:

DEFAULT_PREPROCESSING_PARAMS = { 'use_sentence_pairs': False }

Train DRAGON on a Benchmark Dataset

To train DRAGON on a benchmark dataset, use:

python3 multilabel_class/tune_thresholds.py --model_type bert --model_variant focal --dataset_path data/03_huggingaceV_datasets/LEGION/dataset

Ensure the use_sentence_pairs parameter is set to True in config.py.

Train LEGION on the DRAGON Dataset

To train LEGION on the DRAGON dataset, use:

python3 multilabel_class/tune_thresholds.py --model_type bert --model_variant db --dataset_path data/03_huggingaceV_datasets/2024-05-22/origin-metadata-readme_names-900000dataset_forks-cleaned/dataset

Ensure the use_sentence_pairs parameter is set to False in config.py:

DEFAULT_PREPROCESSING_PARAMS = { 'use_sentence_pairs': False }

Train LEGION on a Baseline Dataset

To train LEGION on a baseline dataset, run:

python3 multilabel_class/tune_thresholds.py --model_type bert --model_variant db --dataset_path data/03_huggingaceV_datasets/LEGION/dataset

5️⃣ Model Evaluation

Once thresholds are tuned, you can evaluate the model using:

python3 multilabel_class/evaluation.py --model_type bert --model_variant focal --dataset_path data/03_huggingaceV_datasets/2024-05-22/origin-metadata-readme_names-900000dataset_forks-cleaned/dataset

This evaluation script computes standard multi-label classification metrics including:

Micro and macro F1@1..5-score

Precision@1..5 and recall@1..5

Ensure that the model variant and dataset path correspond to the previously trained model.

Recommended: Evaluation via Notebooks

We suggest an interactive and visual analysis of model performance, you can also use the provided Jupyter notebooks located in:

DRAGON_replication/multilabel_class/notebooks/

These notebooks reproduce the complete evaluation pipeline and generate additional visualizations and metrics discussed in the associated paper.

Both command-line and notebook-based evaluations ensure reproducibility and offer complementary insights into model behavior.

Instructions for Unzipping Files

Several folders in this replication package have been compressed into .zip files to reduce package size. Before running any code, you must unzip all the provided .zip files in-place—that is, extract each archive into the same directory as the .zip file, using the same name as the zip file (without the .zip extension).

For example:

DRAGON_replication\data\02_processed_dataset\2024-05-22.zip

should be extracted to:

DRAGON_replication\data\02_processed_dataset\2024-05-22\

List of .zip files to extract

DRAGON_replication\data\02_processed_dataset\2024-05-22.zip

DRAGON_replication\data\03_huggingaceV_datasets\2024-05-22.zip

DRAGON_replication\data\03_huggingaceV_datasets\LEGION.zip

DRAGON_replication\dataset_creation\data.zip

DRAGON_replication\multilabel_class\model_output\2024-05-22.zip

DRAGON_replication\multilabel_class\model_output\LEGION.zip

Make sure that after extraction, each corresponding folder exists and contains the expected files. Do not change the folder names or directory structure after unzipping.

This README provides an overview of the essential steps for repository mining, dataset preparation, processing, model training, and evaluation. For further customization, refer to the configuration files and experiment with different preprocessing settings.
code_x_glue_cc_code_completion_token
huggingface.co
Updated Aug 20, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Google (2021). code_x_glue_cc_code_completion_token [Dataset]. https://huggingface.co/datasets/google/code_x_glue_cc_code_completion_token
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 20, 2021
Dataset authored and provided by
Googlehttp://google.com/
License
https://choosealicense.com/licenses/c-uda/https://choosealicense.com/licenses/c-uda/
Description
Dataset Card for "code_x_glue_cc_code_completion_token"

Dataset Summary

CodeXGLUE CodeCompletion-token dataset, available at https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/CodeCompletion-token Predict next code token given context of previous tokens. Models are evaluated by token level accuracy. Code completion is a one of the most widely used features in software development through IDEs. An effective code completion tool could improve software… See the full description on the dataset page: https://huggingface.co/datasets/google/code_x_glue_cc_code_completion_token.
h
code-readability-merged
huggingface.co
Updated Mar 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chair of Software Engineering II, Uni Passau (2025). code-readability-merged [Dataset]. https://huggingface.co/datasets/se2p/code-readability-merged
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 26, 2025
Dataset authored and provided by
Chair of Software Engineering II, Uni Passau
License
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Description
Java Code Readability Merged Dataset

This dataset contains 421 Java code snippets along with a readability score, aggregated from several scientific papers [1, 2, 3]. You can download the dataset using Hugging Face: from datasets import load_dataset ds = load_dataset("se2p/code-readability-merged")

The snippets are not split into train and test (and validation) set. Thus, the whole dataset is in the train set: ds = ds['train'] ds_as_list = ds.to_list() # Convert the dataset to… See the full description on the dataset page: https://huggingface.co/datasets/se2p/code-readability-merged.
i
Medicaid Claims by Recipient Zip Code
hub.mph.in.gov
Updated Sep 14, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2017). Medicaid Claims by Recipient Zip Code [Dataset]. https://hub.mph.in.gov/dataset/medicaid-claims-by-recipient-zip-code
Explore at:
Dataset updated
Sep 14, 2017
Description
Archived as of 6/26/2025: The datasets will no longer receive updates but the historical data will continue to be available for download. This dataset provides information related to the services related to recipients enrolled in Medicaid. It contains information about the total number of recipients, total number of claims, and total dollar amount, by recipient zip code. Restricted to claims with service date between 01/2012 to 12/2017. Restricted to patients with a Medicaid claim during this period. This data is for research purposes and is not intended to be used for reporting. Due to differences in geographic aggregation, time period considerations, and units of analysis, these numbers may differ from those reported by FSSA.
d
3.01 Property Code Enforcement (dashboard)
catalog.data.gov
Updated Mar 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
City of Tempe (2023). 3.01 Property Code Enforcement (dashboard) [Dataset]. https://catalog.data.gov/dataset/3-01-property-code-enforcement-dashboard-0ee61
Explore at:
Dataset updated
Mar 18, 2023
Dataset provided by
City of Tempe
Description
This operations dashboard shows historic and current data related to this performance measure.The performance measure dashboard is available at 3.01 Property Code Enforcement. Data Dictionary
h
code-search-net-go
huggingface.co
Updated May 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fernando Tarin Morales (2023). code-search-net-go [Dataset]. https://huggingface.co/datasets/Nan-Do/code-search-net-go
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 18, 2023
Authors
Fernando Tarin Morales
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Card for "code-search-net-go"

Dataset Summary

This dataset is the Go portion of the CodeSarchNet annotated with a summary column.The code-search-net dataset includes open source functions that include comments found at GitHub.The summary is a short description of what the function does.

Languages

The dataset's comments are in English and the functions are coded in Go

Data Splits

Train, test, validation labels are included in the dataset as… See the full description on the dataset page: https://huggingface.co/datasets/Nan-Do/code-search-net-go.
d
Data from: Housing Code Enforcement
catalog.data.gov
data.wu.ac.at
Updated Aug 26, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
data.montgomerycountymd.gov (2023). Housing Code Enforcement [Dataset]. https://catalog.data.gov/dataset/housing-code-enforcement-181fe
Explore at:
Dataset updated
Aug 26, 2023
Dataset provided by
data.montgomerycountymd.gov
Description
Housing code enforcement activities, including inspections and violations.
Curated Email-Based Code Reviews Datasets
figshare.com
bin
Updated Feb 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mingzhao Liang; Ping Charoenwet; Patanamon Thongtanunam (2024). Curated Email-Based Code Reviews Datasets [Dataset]. http://doi.org/10.6084/m9.figshare.24679656.v1
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.24679656.v1
Dataset updated
Feb 7, 2024
Dataset provided by
Figsharehttp://figshare.com/
Authors
Mingzhao Liang; Ping Charoenwet; Patanamon Thongtanunam
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Code review is an important practice that improves the overall quality of a proposed patch (i.e. code changes). While much research focused on tool-based code reviews (e.g. a Gerrit code review tool, GitHub), many traditional open-source software (OSS) projects still conduct code reviews through emails. However, due to the nature of unstructured email-based data, it can be challenging to mine email-based code reviews, hindering researchers from delving into the code review practice of such long-standing OSS projects. Therefore, this paper presents large-scale datasets of email-based code reviews of 167 projects across three OSS communities (i.e. Linux Kernel, OzLabs, and FFmpeg). We mined the data from Patchwork, a web-based patch-tracking system for email-based code review, and curated the data by grouping a submitted patch and its revised versions and grouping email aliases. Our datasets include a total of 4.2M patches with 2.1M patch groups and 169K email addresses belonging to 141K individuals. Our published artefacts include the datasets as well as a tool suite to crawl, curate, and store Patchwork data. With our datasets, future work can directly delve into an email-based code review practice of large OSS projects without additional effort in data collection and curation.
Data and code files for co-occurrence modeling project
catalog.data.gov
datadiscoverystudio.org
+2more
Updated Nov 12, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2020). Data and code files for co-occurrence modeling project [Dataset]. https://catalog.data.gov/dataset/data-and-code-files-for-co-occurrence-modeling-project
Explore at:
Dataset updated
Nov 12, 2020
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
Files included are original data inputs on stream fishes (fish_data_OEPA_2012.csv), water chemistry (OEPA_WATER_2012.csv), geographic data (NHD_Plus_StreamCat); modeling files for generating predictions from the original data, including the R code (MVP_R_Final.txt) and Stan code (MV_Probit_Stan_Final.txt); and the model output file containing predictions for all NHDPlus catchments in the East Fork Little Miami River watershed (MVP_EFLMR_cooc_Final). This dataset is associated with the following publication: Martin, R., E. Waits, and C. Nietch. Empirically-based modeling and mapping to consider the co-occurrence of ecological receptors and stressors. SCIENCE OF THE TOTAL ENVIRONMENT. Elsevier BV, AMSTERDAM, NETHERLANDS, 613(614): 1228-1239, (2018).
Q
QR Codes Market Report
datainsightsmarket.com
doc, pdf, ppt
Updated Mar 12, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Insights Market (2025). QR Codes Market Report [Dataset]. https://www.datainsightsmarket.com/reports/qr-codes-market-20882
Explore at:
pdf, doc, pptAvailable download formats
Dataset updated
Mar 12, 2025
Dataset authored and provided by
Data Insights Market
License
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The QR Code market is experiencing robust growth, projected to reach a market size of $10.5 billion in 2025 and exhibiting a Compound Annual Growth Rate (CAGR) of 16.67% from 2025 to 2033. This expansion is driven by several key factors. The increasing adoption of smartphones and mobile payment systems globally fuels the demand for QR codes in diverse applications, from marketing campaigns and contactless payments to information sharing and customer engagement initiatives. The shift towards digitalization across various industries, coupled with the convenience and cost-effectiveness of QR codes, contributes significantly to market growth. The dynamic nature of QR codes, allowing for updates and tracking of performance, adds to their appeal over static alternatives. Furthermore, the diversification of QR code formats, catering to different use cases like website links, menus, file downloads, and social media integration, expands the market's reach across various sectors. The market segmentation reveals a diverse landscape. Dynamic QR codes, offering greater flexibility and analytics capabilities, are gaining traction over their static counterparts. Among end-user applications, marketing and advertising dominate, leveraging QR codes for campaigns and promotions. However, significant growth is expected in payments and transactions, driven by the rising popularity of mobile wallets and contactless payment methods. Geographically, North America and Europe are anticipated to hold substantial market shares, but Asia-Pacific is poised for rapid expansion due to its burgeoning digital economy and large smartphone user base. Competition among key players, including Uniqode Phygital Inc, QR TIGER PTE LTD, and Flowcode, is intense, fostering innovation and driving down costs, further boosting market accessibility. While challenges like security concerns and potential misuse exist, technological advancements and increased awareness about secure QR code implementation are mitigating these risks. The overall outlook for the QR code market remains highly positive, indicating a sustained period of growth and innovation driven by the evolving digital landscape. Recent developments include: July 2024: Bandhan Bank launched its latest payment solution through the Bharat QR Code for its Current account and Savings account customers. The bank claimed that the solution will simplify how these self-employed segment customers make payments at any merchant outlet. An instant notification will also be received on every payment through a small speaker.June 2024: Flowcode, a marketing technology platform, unveiled a reimagined product designed for marketing and analytics teams at F1000 companies focused on measuring and maximizing offline conversions. Flowcode integrates seamlessly with data feeds, such as product catalogs, MLS listings, and more, to automate the creation of personalized, QR-enabled user journeys. This empowers brands to deliver unique, tailored consumer experiences, significantly increasing conversion rates.. Key drivers for this market are: Increased Smartphone Penetration, Growing Demand for Contactless Solutions; Increasing need for Security and Fraud Prevention. Potential restraints include: Increased Smartphone Penetration, Growing Demand for Contactless Solutions; Increasing need for Security and Fraud Prevention. Notable trends are: The Payments and Transactions Segment is Anticipated to Witness a Significant Growth.
P
The Stack Dataset
paperswithcode.com
Updated Oct 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Denis Kocetkov; Raymond Li; Loubna Ben allal; Jia Li; Chenghao Mou; Carlos Muñoz Ferrandis; Yacine Jernite; Margaret Mitchell; Sean Hughes; Thomas Wolf; Dzmitry Bahdanau; Leandro von Werra; Harm de Vries (2022). The Stack Dataset [Dataset]. https://paperswithcode.com/dataset/the-stack
Explore at:
Dataset updated
Oct 28, 2022
Authors
Denis Kocetkov; Raymond Li; Loubna Ben allal; Jia Li; Chenghao Mou; Carlos Muñoz Ferrandis; Yacine Jernite; Margaret Mitchell; Sean Hughes; Thomas Wolf; Dzmitry Bahdanau; Leandro von Werra; Harm de Vries
Description
The Stack contains over 3TB of permissively-licensed source code files covering 30 programming languages crawled from GitHub. The dataset was created as part of the BigCode Project, an open scientific collaboration working on the responsible development of Large Language Models for Code (Code LLMs).
h
mala-code-reasoning-v2
huggingface.co
Updated Jun 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
mala-code-reasoning-v2 [Dataset]. https://huggingface.co/datasets/MaLA-LM/mala-code-reasoning-v2
Explore at:
Dataset updated
Jun 9, 2025
Dataset authored and provided by
MaLA-LM
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
MaLA Corpus: Massive Language Adaptation Corpus

This MaLA code and reasoning dataset (V2) is used for training EMMA-500 Llama 3(.1) Mono/Bi model series.

🤗MaLA-LM/emma-500-llama3-8b-mono: CPT model trained on monolingual data mix in 500+ languages
🤗MaLA-LM/emma-500-llama3-8b-bi: CPT model trained on monolingual data mix in 500+ languages + bilingual translation data in 2,500+ language pairs
🤗MaLA-LM/emma-500-llama3.1-8b-mono: CPT model trained on monolingual data mix in… See the full description on the dataset page: https://huggingface.co/datasets/MaLA-LM/mala-code-reasoning-v2.
data and code
figshare.com
txt
Updated Mar 23, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maria Voukelatou (2017). data and code [Dataset]. http://doi.org/10.6084/m9.figshare.4780312.v2
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.4780312.v2
Dataset updated
Mar 23, 2017
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Maria Voukelatou
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
(raw data for use with accompanying r script)
DHCS County Code Reference Table
healthdata.gov
data.chhs.ca.gov
+4more
application/rdfxml +5
Updated May 13, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
chhs.data.ca.gov (2025). DHCS County Code Reference Table [Dataset]. https://healthdata.gov/State/DHCS-County-Code-Reference-Table/xzk9-w5kz/data
Explore at:
csv, tsv, application/rdfxml, xml, application/rssxml, jsonAvailable download formats
Dataset updated
May 13, 2025
Dataset provided by
chhs.data.ca.gov
Description
This reference table contains data elements for the 58 Counties in California that can be used to join to other data sets. This data includes the following fields:

DHCS County Code
County Name
County Region Code
County Region Description
FIPS County Code (xxx)
FIPS State Code + FIPS County Code (06xxx)
North/South Indicator