angie-chen55/python-github-code dataset hosted on Hugging Face and contributed by the HF Datasets community
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for "instructional_code-search-net-java"
Dataset Summary
This is an instructional dataset for Java. The dataset contains two different kind of tasks:
Given a piece of code generate a description of what it does. Given a description generate a piece of code that fulfils the description.
Languages
The dataset is in English.
Data Splits
There are no splits.
Dataset Creation
May of 2023
Curation Rationale
This dataset… See the full description on the dataset page: https://huggingface.co/datasets/Nan-Do/instructional_code-search-net-java.
This dataset contains the ICD-10 code lists used to test the sensitivity and specificity of the Clinical Practice Research Datalink (CPRD) medical code lists for dementia subtypes. The provided code lists are used to define dementia subtypes in linked data from the Hospital Episode Statistic (HES) inpatient dataset and the Office of National Statistics (ONS) death registry, which are then used as the 'gold standard' for comparison against dementia subtypes defined using the CPRD medical code lists. The CPRD medical code lists used in this comparison are available here: Venexia Walker, Neil Davies, Patrick Kehoe, Richard Martin (2017): CPRD codes: neurodegenerative diseases and commonly prescribed drugs. https://doi.org/10.5523/bris.1plm8il42rmlo2a2fqwslwckm2 Complete download (zip, 3.9 KiB)
QuIP (QUick Image Processing) is an interpreter for image processing, graphics, psychophysical experimentation and general scientific computing.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Compilation of python codes for data preprocessing and VegeNet building, as well as image datasets (zip files).
Image datasets:
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
DescriptionThis dataset outlines specific zones or regions designated under a Form-Based Code (FBC) framework. Unlike traditional zoning, form-based codes emphasize the physical form of buildings and public spaces over land use. These zones guide community design through parameters such as building height, setbacks, and architectural styles. The dataset provides a spatial reference for planning, zoning, and development decisions aligned with form-based design principles. The data was created by digitizing PDFs of approved Form-Based Code plans, accessible via links listed in the Ordinance Link column of the dataset.
Applications Featuring This DatasetForm-Based Code Explorer
Data GlossarySee the Attributes section below for details about each column in this dataset.
Update FrequencyWhen FBC neighborhood regions change.
ContactCity Planning Commission – Zoning and Technology
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This archive contains the replication package for the DRAGON multi-label classification models, which leverage BERT-based architectures. The package includes scripts for repository mining, dataset creation, data processing, model training, and evaluation. The two main models used are DRAGON and LEGION.
Before running any commands, ensure you have the necessary dependencies installed. It is recommended to use a virtual environment:
python3 -m venv venv
source venv/bin/activate # On Windows use `venv\Scripts\activate`
pip install -r requirements.txt
repository_mining/
: Contains scripts for mining the initial set of repositories.
repository_mining/doc/
: Includes documentation with the necessary information for repository mining.dataset_creation/
: Contains all the notebooks to be run sequentially to prepare the dataset.multilabel_class/
: Contains scripts for classification, threshold tuning, and evaluation.
multilabel_class/model_output/
: trained model organized by: first dataset, then model variantion.data/
: Contains the hugginface datasets ( our dataset and LEGION dataset) ready for the training/eval.To mine the initial set of repositories from Software Heritage, use the scripts available in the repository_mining/
folder. Detailed information and steps for repository mining can be found in:
repository_mining/doc/
After mining the repositories, prepare the dataset by running the Jupyter notebooks inside the dataset_creation/
folder in sequence. These notebooks handle data cleaning, transformation, and formatting necessary for model training. All the documentation needed is inside each notebook explaining every step.
Once the dataset is prepared, convert it into a Hugging Face dataset using:
python3 multilabel_class/create_dataset.py --file_path data/02_processed_datasets/2024-05-22/origin-metadata-readme_names-900000dataset_forks-cleaned.csv
After processing the dataset, train the DRAGON model with the following command:
python3 multilabel_class/tune_thresholds.py --model_type bert --model_variant focal --dataset_path data/03_huggingaceV_datasets/2024-05-22/origin-metadata-readme_names-900000dataset_forks-cleaned/dataset
Modify the configuration file multilabel_class/utils/config.py
to set the following parameter to True
:
DEFAULT_PREPROCESSING_PARAMS = {
'use_sentence_pairs': True # If True, process as (text1, text2); if False, concatenate texts
}
To train DRAGON without using sentence pairs, use the same command but set use_sentence_pairs
to False
in the config file:
DEFAULT_PREPROCESSING_PARAMS = {
'use_sentence_pairs': False
}
To train DRAGON on a benchmark dataset, use:
python3 multilabel_class/tune_thresholds.py --model_type bert --model_variant focal --dataset_path data/03_huggingaceV_datasets/LEGION/dataset
Ensure the use_sentence_pairs
parameter is set to True
in config.py
.
To train LEGION on the DRAGON dataset, use:
python3 multilabel_class/tune_thresholds.py --model_type bert --model_variant db --dataset_path data/03_huggingaceV_datasets/2024-05-22/origin-metadata-readme_names-900000dataset_forks-cleaned/dataset
Ensure the use_sentence_pairs
parameter is set to False
in config.py
:
DEFAULT_PREPROCESSING_PARAMS = {
'use_sentence_pairs': False
}
To train LEGION on a baseline dataset, run:
python3 multilabel_class/tune_thresholds.py --model_type bert --model_variant db --dataset_path data/03_huggingaceV_datasets/LEGION/dataset
Once thresholds are tuned, you can evaluate the model using:
python3 multilabel_class/evaluation.py --model_type bert --model_variant focal --dataset_path data/03_huggingaceV_datasets/2024-05-22/origin-metadata-readme_names-900000dataset_forks-cleaned/dataset
This evaluation script computes standard multi-label classification metrics including:
Ensure that the model variant and dataset path correspond to the previously trained model.
We suggest an interactive and visual analysis of model performance, you can also use the provided Jupyter notebooks located in:
DRAGON_replication/multilabel_class/notebooks/
These notebooks reproduce the complete evaluation pipeline and generate additional visualizations and metrics discussed in the associated paper.
Both command-line and notebook-based evaluations ensure reproducibility and offer complementary insights into model behavior.
Several folders in this replication package have been compressed into .zip
files to reduce package size. Before running any code, you must unzip all the provided .zip
files in-place—that is, extract each archive into the same directory as the .zip
file, using the same name as the zip file (without the .zip
extension).
For example:
DRAGON_replication\data\02_processed_dataset\2024-05-22.zip
should be extracted to:
DRAGON_replication\data\02_processed_dataset\2024-05-22\
.zip
files to extractDRAGON_replication\data\02_processed_dataset\2024-05-22.zip
DRAGON_replication\data\03_huggingaceV_datasets\2024-05-22.zip
DRAGON_replication\data\03_huggingaceV_datasets\LEGION.zip
DRAGON_replication\dataset_creation\data.zip
DRAGON_replication\multilabel_class\model_output\2024-05-22.zip
DRAGON_replication\multilabel_class\model_output\LEGION.zip
Make sure that after extraction, each corresponding folder exists and contains the expected files. Do not change the folder names or directory structure after unzipping.
This README provides an overview of the essential steps for repository mining, dataset preparation, processing, model training, and evaluation. For further customization, refer to the configuration files and experiment with different preprocessing settings.
https://choosealicense.com/licenses/c-uda/https://choosealicense.com/licenses/c-uda/
Dataset Card for "code_x_glue_cc_code_completion_token"
Dataset Summary
CodeXGLUE CodeCompletion-token dataset, available at https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/CodeCompletion-token Predict next code token given context of previous tokens. Models are evaluated by token level accuracy. Code completion is a one of the most widely used features in software development through IDEs. An effective code completion tool could improve software… See the full description on the dataset page: https://huggingface.co/datasets/google/code_x_glue_cc_code_completion_token.
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Java Code Readability Merged Dataset
This dataset contains 421 Java code snippets along with a readability score, aggregated from several scientific papers [1, 2, 3]. You can download the dataset using Hugging Face: from datasets import load_dataset ds = load_dataset("se2p/code-readability-merged")
The snippets are not split into train and test (and validation) set. Thus, the whole dataset is in the train set: ds = ds['train'] ds_as_list = ds.to_list() # Convert the dataset to… See the full description on the dataset page: https://huggingface.co/datasets/se2p/code-readability-merged.
Archived as of 6/26/2025: The datasets will no longer receive updates but the historical data will continue to be available for download. This dataset provides information related to the services related to recipients enrolled in Medicaid. It contains information about the total number of recipients, total number of claims, and total dollar amount, by recipient zip code. Restricted to claims with service date between 01/2012 to 12/2017. Restricted to patients with a Medicaid claim during this period. This data is for research purposes and is not intended to be used for reporting. Due to differences in geographic aggregation, time period considerations, and units of analysis, these numbers may differ from those reported by FSSA.
This operations dashboard shows historic and current data related to this performance measure.The performance measure dashboard is available at 3.01 Property Code Enforcement. Data Dictionary
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for "code-search-net-go"
Dataset Summary
This dataset is the Go portion of the CodeSarchNet annotated with a summary column.The code-search-net dataset includes open source functions that include comments found at GitHub.The summary is a short description of what the function does.
Languages
The dataset's comments are in English and the functions are coded in Go
Data Splits
Train, test, validation labels are included in the dataset as… See the full description on the dataset page: https://huggingface.co/datasets/Nan-Do/code-search-net-go.
Housing code enforcement activities, including inspections and violations.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Code review is an important practice that improves the overall quality of a proposed patch (i.e. code changes). While much research focused on tool-based code reviews (e.g. a Gerrit code review tool, GitHub), many traditional open-source software (OSS) projects still conduct code reviews through emails. However, due to the nature of unstructured email-based data, it can be challenging to mine email-based code reviews, hindering researchers from delving into the code review practice of such long-standing OSS projects. Therefore, this paper presents large-scale datasets of email-based code reviews of 167 projects across three OSS communities (i.e. Linux Kernel, OzLabs, and FFmpeg). We mined the data from Patchwork, a web-based patch-tracking system for email-based code review, and curated the data by grouping a submitted patch and its revised versions and grouping email aliases. Our datasets include a total of 4.2M patches with 2.1M patch groups and 169K email addresses belonging to 141K individuals. Our published artefacts include the datasets as well as a tool suite to crawl, curate, and store Patchwork data. With our datasets, future work can directly delve into an email-based code review practice of large OSS projects without additional effort in data collection and curation.
Files included are original data inputs on stream fishes (fish_data_OEPA_2012.csv), water chemistry (OEPA_WATER_2012.csv), geographic data (NHD_Plus_StreamCat); modeling files for generating predictions from the original data, including the R code (MVP_R_Final.txt) and Stan code (MV_Probit_Stan_Final.txt); and the model output file containing predictions for all NHDPlus catchments in the East Fork Little Miami River watershed (MVP_EFLMR_cooc_Final). This dataset is associated with the following publication: Martin, R., E. Waits, and C. Nietch. Empirically-based modeling and mapping to consider the co-occurrence of ecological receptors and stressors. SCIENCE OF THE TOTAL ENVIRONMENT. Elsevier BV, AMSTERDAM, NETHERLANDS, 613(614): 1228-1239, (2018).
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
The QR Code market is experiencing robust growth, projected to reach a market size of $10.5 billion in 2025 and exhibiting a Compound Annual Growth Rate (CAGR) of 16.67% from 2025 to 2033. This expansion is driven by several key factors. The increasing adoption of smartphones and mobile payment systems globally fuels the demand for QR codes in diverse applications, from marketing campaigns and contactless payments to information sharing and customer engagement initiatives. The shift towards digitalization across various industries, coupled with the convenience and cost-effectiveness of QR codes, contributes significantly to market growth. The dynamic nature of QR codes, allowing for updates and tracking of performance, adds to their appeal over static alternatives. Furthermore, the diversification of QR code formats, catering to different use cases like website links, menus, file downloads, and social media integration, expands the market's reach across various sectors. The market segmentation reveals a diverse landscape. Dynamic QR codes, offering greater flexibility and analytics capabilities, are gaining traction over their static counterparts. Among end-user applications, marketing and advertising dominate, leveraging QR codes for campaigns and promotions. However, significant growth is expected in payments and transactions, driven by the rising popularity of mobile wallets and contactless payment methods. Geographically, North America and Europe are anticipated to hold substantial market shares, but Asia-Pacific is poised for rapid expansion due to its burgeoning digital economy and large smartphone user base. Competition among key players, including Uniqode Phygital Inc, QR TIGER PTE LTD, and Flowcode, is intense, fostering innovation and driving down costs, further boosting market accessibility. While challenges like security concerns and potential misuse exist, technological advancements and increased awareness about secure QR code implementation are mitigating these risks. The overall outlook for the QR code market remains highly positive, indicating a sustained period of growth and innovation driven by the evolving digital landscape. Recent developments include: July 2024: Bandhan Bank launched its latest payment solution through the Bharat QR Code for its Current account and Savings account customers. The bank claimed that the solution will simplify how these self-employed segment customers make payments at any merchant outlet. An instant notification will also be received on every payment through a small speaker.June 2024: Flowcode, a marketing technology platform, unveiled a reimagined product designed for marketing and analytics teams at F1000 companies focused on measuring and maximizing offline conversions. Flowcode integrates seamlessly with data feeds, such as product catalogs, MLS listings, and more, to automate the creation of personalized, QR-enabled user journeys. This empowers brands to deliver unique, tailored consumer experiences, significantly increasing conversion rates.. Key drivers for this market are: Increased Smartphone Penetration, Growing Demand for Contactless Solutions; Increasing need for Security and Fraud Prevention. Potential restraints include: Increased Smartphone Penetration, Growing Demand for Contactless Solutions; Increasing need for Security and Fraud Prevention. Notable trends are: The Payments and Transactions Segment is Anticipated to Witness a Significant Growth.
The Stack contains over 3TB of permissively-licensed source code files covering 30 programming languages crawled from GitHub. The dataset was created as part of the BigCode Project, an open scientific collaboration working on the responsible development of Large Language Models for Code (Code LLMs).
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
MaLA Corpus: Massive Language Adaptation Corpus
This MaLA code and reasoning dataset (V2) is used for training EMMA-500 Llama 3(.1) Mono/Bi model series.
🤗MaLA-LM/emma-500-llama3-8b-mono: CPT model trained on monolingual data mix in 500+ languages
🤗MaLA-LM/emma-500-llama3-8b-bi: CPT model trained on monolingual data mix in 500+ languages + bilingual translation data in 2,500+ language pairs
🤗MaLA-LM/emma-500-llama3.1-8b-mono: CPT model trained on monolingual data mix in… See the full description on the dataset page: https://huggingface.co/datasets/MaLA-LM/mala-code-reasoning-v2.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
(raw data for use with accompanying r script)
This reference table contains data elements for the 58 Counties in California that can be used to join to other data sets. This data includes the following fields:
angie-chen55/python-github-code dataset hosted on Hugging Face and contributed by the HF Datasets community