angie-chen55/python-github-code dataset hosted on Hugging Face and contributed by the HF Datasets community
This dataset contains the ICD-10 code lists used to test the sensitivity and specificity of the Clinical Practice Research Datalink (CPRD) medical code lists for dementia subtypes. The provided code lists are used to define dementia subtypes in linked data from the Hospital Episode Statistic (HES) inpatient dataset and the Office of National Statistics (ONS) death registry, which are then used as the 'gold standard' for comparison against dementia subtypes defined using the CPRD medical code lists. The CPRD medical code lists used in this comparison are available here: Venexia Walker, Neil Davies, Patrick Kehoe, Richard Martin (2017): CPRD codes: neurodegenerative diseases and commonly prescribed drugs. https://doi.org/10.5523/bris.1plm8il42rmlo2a2fqwslwckm2 Complete download (zip, 3.9 KiB)
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for "instructional_code-search-net-java"
Dataset Summary
This is an instructional dataset for Java. The dataset contains two different kind of tasks:
Given a piece of code generate a description of what it does. Given a description generate a piece of code that fulfils the description.
Languages
The dataset is in English.
Data Splits
There are no splits.
Dataset Creation
May of 2023
Curation Rationale
This dataset… See the full description on the dataset page: https://huggingface.co/datasets/Nan-Do/instructional_code-search-net-java.
QuIP (QUick Image Processing) is an interpreter for image processing, graphics, psychophysical experimentation and general scientific computing.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Compilation of python codes for data preprocessing and VegeNet building, as well as image datasets (zip files).
Image datasets:
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
DescriptionThis dataset outlines specific zones or regions designated under a Form-Based Code (FBC) framework. Unlike traditional zoning, form-based codes emphasize the physical form of buildings and public spaces over land use. These zones guide community design through parameters such as building height, setbacks, and architectural styles. The dataset provides a spatial reference for planning, zoning, and development decisions aligned with form-based design principles. The data was created by digitizing PDFs of approved Form-Based Code plans, accessible via links listed in the Ordinance Link column of the dataset.
Applications Featuring This DatasetForm-Based Code Explorer
Data GlossarySee the Attributes section below for details about each column in this dataset.
Update FrequencyWhen FBC neighborhood regions change.
ContactCity Planning Commission – Zoning and Technology
Subscribers can access export and import data for 80 countries using HS codes or product names-ideal for informed market analysis.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Thanks for your interest in our work!
In order to facilitate your assessment and replication, we provides the dataset and source codes (verilog/python model/matlab) of our work (OIPUF) here.
By the way, our latest work (SOI PUF and cSOI PUF) published in IEEE TIFS (2024) is based on OIPUF.
If you have any questions, please feel free to contact with us: chongyaoxu@126.com / mklaw@um.edu.mo
Full text about OIPUF can be downloaded from https://ieeexplore.ieee.org/document/10103139
Full text about SOI PUF and cSOI PUF can be downloaded from https://ieeexplore.ieee.org/document/10458688
Source code and FPGA project of SOI PUF and cSOI PUF can be download from https://github.com/yg99992/SOI_PUF.
Matlab code
matlab/Generate_OI_block.m
This is a matlab manuscript used for generating the verilog code of random OI block.
matlab/OIPUF_64x4_placement.m
This is a matlab function used for generating XDC file for constraining the placement of (64,4)-OI block
matlab/OIPUF_64x8_placement.m
This is a matlab function used for generating XDC file for constraining the placement of (64,8)-OI block
matlab/OIPUF_placement_example.m
An example manuscript used for demonstrating the usage of OIPUF_64x4_placement.m and OIPUF_64x8_placement.m
Python code
python/puf_models.py
The python models of XOR PUFs and OIPUFs, which can be used to generate CRPs.
for example:
from puf_models import oi_puf
# generate a (64,4)-OIPUF and further use the generated OIPUF to generate 1M CRPs
crps, puf_instance = oi_puf.gen_CRPs_PUF(64, 4, 1_000_000)
python/attack_pypuf.py
A manuscript used to conduct to ANN attack on XOR PUF and OIPUF ('pypuf' package should be installed correctly).
Verilog code
verilog/OIPUF_64_4/
All the verilog files of (64, 4)-OIPUF
verilog/OIPUF_64_8/
All the verilog files of (64, 8)-OIPUF
CRP datasets extracted from FPGA
It consists of 13 CRP files (All the CRPs are extracted from FPGA):
FPGA_CRPs/FPGA3_CHAL_100M.csv
The 100 million 64-bit challenges
FPGA_CRPs/FPGA3_k4_PUF0.csv
The 100 million 1-bit responses extracted from (64,4)-OIPUF0
FPGA_CRPs/FPGA3_k4_PUF1.csv
The 100 million 1-bit responses extracted from (64,4)-OIPUF1
FPGA_CRPs/FPGA3_k4_PUF2.csv
The 100 million 1-bit responses extracted from (64,4)-OIPUF2
FPGA_CRPs/FPGA3_k4_PUF3.csv
The 100 million 1-bit responses extracted from (64,4)-OIPUF3
FPGA_CRPs/FPGA3_k4_PUF4.csv
The 100 million 1-bit responses extracted from (64,4)-OIPUF4
FPGA_CRPs/FPGA3_k4_PUF5.csv
The 100 million 1-bit responses extracted from (64,4)-OIPUF5
FPGA_CRPs/FPGA3_k8_PUF0.csv
The 100 million 1-bit responses extracted from (64,8)-OIPUF0
FPGA_CRPs/FPGA3_k8_PUF1.csv
The 100 million 1-bit responses extracted from (64,8)-OIPUF1
FPGA_CRPs/FPGA3_k8_PUF2.csv
The 100 million 1-bit responses extracted from (64,8)-OIPUF2
FPGA_CRPs/FPGA3_k8_PUF3.csv
The 100 million 1-bit responses extracted from (64,8)-OIPUF3
FPGA_CRPs/FPGA3_k8_PUF4.csv
The 100 million 1-bit responses extracted from (64,8)-OIPUF4
FPGA_CRPs/FPGA3_k8_PUF5.csv
The 100 million 1-bit responses extracted from (64,8)-OIPUF5
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This archive contains the replication package for the DRAGON multi-label classification models, which leverage BERT-based architectures. The package includes scripts for repository mining, dataset creation, data processing, model training, and evaluation. The two main models used are DRAGON and LEGION.
Before running any commands, ensure you have the necessary dependencies installed. It is recommended to use a virtual environment:
python3 -m venv venv
source venv/bin/activate # On Windows use `venv\Scripts\activate`
pip install -r requirements.txt
repository_mining/
: Contains scripts for mining the initial set of repositories.
repository_mining/doc/
: Includes documentation with the necessary information for repository mining.dataset_creation/
: Contains all the notebooks to be run sequentially to prepare the dataset.multilabel_class/
: Contains scripts for classification, threshold tuning, and evaluation.
multilabel_class/model_output/
: trained model organized by: first dataset, then model variantion.data/
: Contains the hugginface datasets ( our dataset and LEGION dataset) ready for the training/eval.To mine the initial set of repositories from Software Heritage, use the scripts available in the repository_mining/
folder. Detailed information and steps for repository mining can be found in:
repository_mining/doc/
After mining the repositories, prepare the dataset by running the Jupyter notebooks inside the dataset_creation/
folder in sequence. These notebooks handle data cleaning, transformation, and formatting necessary for model training. All the documentation needed is inside each notebook explaining every step.
Once the dataset is prepared, convert it into a Hugging Face dataset using:
python3 multilabel_class/create_dataset.py --file_path data/02_processed_datasets/2024-05-22/origin-metadata-readme_names-900000dataset_forks-cleaned.csv
After processing the dataset, train the DRAGON model with the following command:
python3 multilabel_class/tune_thresholds.py --model_type bert --model_variant focal --dataset_path data/03_huggingaceV_datasets/2024-05-22/origin-metadata-readme_names-900000dataset_forks-cleaned/dataset
Modify the configuration file multilabel_class/utils/config.py
to set the following parameter to True
:
DEFAULT_PREPROCESSING_PARAMS = {
'use_sentence_pairs': True # If True, process as (text1, text2); if False, concatenate texts
}
To train DRAGON without using sentence pairs, use the same command but set use_sentence_pairs
to False
in the config file:
DEFAULT_PREPROCESSING_PARAMS = {
'use_sentence_pairs': False
}
To train DRAGON on a benchmark dataset, use:
python3 multilabel_class/tune_thresholds.py --model_type bert --model_variant focal --dataset_path data/03_huggingaceV_datasets/LEGION/dataset
Ensure the use_sentence_pairs
parameter is set to True
in config.py
.
To train LEGION on the DRAGON dataset, use:
python3 multilabel_class/tune_thresholds.py --model_type bert --model_variant db --dataset_path data/03_huggingaceV_datasets/2024-05-22/origin-metadata-readme_names-900000dataset_forks-cleaned/dataset
Ensure the use_sentence_pairs
parameter is set to False
in config.py
:
DEFAULT_PREPROCESSING_PARAMS = {
'use_sentence_pairs': False
}
To train LEGION on a baseline dataset, run:
python3 multilabel_class/tune_thresholds.py --model_type bert --model_variant db --dataset_path data/03_huggingaceV_datasets/LEGION/dataset
Once thresholds are tuned, you can evaluate the model using:
python3 multilabel_class/evaluation.py --model_type bert --model_variant focal --dataset_path data/03_huggingaceV_datasets/2024-05-22/origin-metadata-readme_names-900000dataset_forks-cleaned/dataset
This evaluation script computes standard multi-label classification metrics including:
Ensure that the model variant and dataset path correspond to the previously trained model.
We suggest an interactive and visual analysis of model performance, you can also use the provided Jupyter notebooks located in:
DRAGON_replication/multilabel_class/notebooks/
These notebooks reproduce the complete evaluation pipeline and generate additional visualizations and metrics discussed in the associated paper.
Both command-line and notebook-based evaluations ensure reproducibility and offer complementary insights into model behavior.
Several folders in this replication package have been compressed into .zip
files to reduce package size. Before running any code, you must unzip all the provided .zip
files in-place—that is, extract each archive into the same directory as the .zip
file, using the same name as the zip file (without the .zip
extension).
For example:
DRAGON_replication\data\02_processed_dataset\2024-05-22.zip
should be extracted to:
DRAGON_replication\data\02_processed_dataset\2024-05-22\
.zip
files to extractDRAGON_replication\data\02_processed_dataset\2024-05-22.zip
DRAGON_replication\data\03_huggingaceV_datasets\2024-05-22.zip
DRAGON_replication\data\03_huggingaceV_datasets\LEGION.zip
DRAGON_replication\dataset_creation\data.zip
DRAGON_replication\multilabel_class\model_output\2024-05-22.zip
DRAGON_replication\multilabel_class\model_output\LEGION.zip
Make sure that after extraction, each corresponding folder exists and contains the expected files. Do not change the folder names or directory structure after unzipping.
This README provides an overview of the essential steps for repository mining, dataset preparation, processing, model training, and evaluation. For further customization, refer to the configuration files and experiment with different preprocessing settings.
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Java Code Readability Merged Dataset
This dataset contains 421 Java code snippets along with a readability score, aggregated from several scientific papers [1, 2, 3]. You can download the dataset using Hugging Face: from datasets import load_dataset ds = load_dataset("se2p/code-readability-merged")
The snippets are not split into train and test (and validation) set. Thus, the whole dataset is in the train set: ds = ds['train'] ds_as_list = ds.to_list() # Convert the dataset to… See the full description on the dataset page: https://huggingface.co/datasets/se2p/code-readability-merged.
https://choosealicense.com/licenses/c-uda/https://choosealicense.com/licenses/c-uda/
Dataset Card for "code_x_glue_cc_code_completion_token"
Dataset Summary
CodeXGLUE CodeCompletion-token dataset, available at https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/CodeCompletion-token Predict next code token given context of previous tokens. Models are evaluated by token level accuracy. Code completion is a one of the most widely used features in software development through IDEs. An effective code completion tool could improve software… See the full description on the dataset page: https://huggingface.co/datasets/google/code_x_glue_cc_code_completion_token.
Archived as of 6/26/2025: The datasets will no longer receive updates but the historical data will continue to be available for download. This dataset provides information related to the services related to recipients enrolled in Medicaid. It contains information about the total number of recipients, total number of claims, and total dollar amount, by recipient zip code. Restricted to claims with service date between 01/2012 to 12/2017. Restricted to patients with a Medicaid claim during this period. This data is for research purposes and is not intended to be used for reporting. Due to differences in geographic aggregation, time period considerations, and units of analysis, these numbers may differ from those reported by FSSA.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for "code-search-net-go"
Dataset Summary
This dataset is the Go portion of the CodeSarchNet annotated with a summary column.The code-search-net dataset includes open source functions that include comments found at GitHub.The summary is a short description of what the function does.
Languages
The dataset's comments are in English and the functions are coded in Go
Data Splits
Train, test, validation labels are included in the dataset as… See the full description on the dataset page: https://huggingface.co/datasets/Nan-Do/code-search-net-go.
This operations dashboard shows historic and current data related to this performance measure.The performance measure dashboard is available at 3.01 Property Code Enforcement. Data Dictionary
Housing code enforcement activities, including inspections and violations.
The content of the NODC Taxonomic Code, Version 8 CD-ROM (CD-ROM NODC-68) distributed by NODC is archived in this accession. Version 7 of the NODC Taxonomic Code (CD-ROM NODC-35), which does not include Integrated Taxonomic Information System (ITIS) Taxonomic Serial Numbers (TSNs), is also archived in this NODC accession. Prior to 1996, the NODC Taxonomic Code was the largest, most flexible, and widely used of the various coding schemes which adapted the Linnean system of biological nomenclature to modern methods of data storage and retrieval. It was based on a system of code numbers that reflected taxonomic relationships. Hundreds of historic data collections archived at NODC use the NODC Taxonomic Code to encode species identification. With the development and release of ITIS in 1996, NODC published the final version (Version 8) of the NODC Taxonomic Code on CD-ROM. This CD-ROM, provides NODC taxonomic codes along with the equivalent ITIS Taxonomic Serial Numbers to facilitate the transition to a new Integrated Taxonomic Information System (ITIS, http://www.itis.gov/). With the publication of NODC Taxonomic Code Version 8, the NODC code was frozen and discontinued. ITIS assumed responsibility for assigning new TSN codes and for verifying accepted scientific names and synonyms. More information about the Integrated Taxonomic Information System is available at http://www.itis.gov.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Code review is an important practice that improves the overall quality of a proposed patch (i.e. code changes). While much research focused on tool-based code reviews (e.g. a Gerrit code review tool, GitHub), many traditional open-source software (OSS) projects still conduct code reviews through emails. However, due to the nature of unstructured email-based data, it can be challenging to mine email-based code reviews, hindering researchers from delving into the code review practice of such long-standing OSS projects. Therefore, this paper presents large-scale datasets of email-based code reviews of 167 projects across three OSS communities (i.e. Linux Kernel, OzLabs, and FFmpeg). We mined the data from Patchwork, a web-based patch-tracking system for email-based code review, and curated the data by grouping a submitted patch and its revised versions and grouping email aliases. Our datasets include a total of 4.2M patches with 2.1M patch groups and 169K email addresses belonging to 141K individuals. Our published artefacts include the datasets as well as a tool suite to crawl, curate, and store Patchwork data. With our datasets, future work can directly delve into an email-based code review practice of large OSS projects without additional effort in data collection and curation.
Files included are original data inputs on stream fishes (fish_data_OEPA_2012.csv), water chemistry (OEPA_WATER_2012.csv), geographic data (NHD_Plus_StreamCat); modeling files for generating predictions from the original data, including the R code (MVP_R_Final.txt) and Stan code (MV_Probit_Stan_Final.txt); and the model output file containing predictions for all NHDPlus catchments in the East Fork Little Miami River watershed (MVP_EFLMR_cooc_Final). This dataset is associated with the following publication: Martin, R., E. Waits, and C. Nietch. Empirically-based modeling and mapping to consider the co-occurrence of ecological receptors and stressors. SCIENCE OF THE TOTAL ENVIRONMENT. Elsevier BV, AMSTERDAM, NETHERLANDS, 613(614): 1228-1239, (2018).
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
The QR Code market is experiencing robust growth, projected to reach a market size of $10.5 billion in 2025 and exhibiting a Compound Annual Growth Rate (CAGR) of 16.67% from 2025 to 2033. This expansion is driven by several key factors. The increasing adoption of smartphones and mobile payment systems globally fuels the demand for QR codes in diverse applications, from marketing campaigns and contactless payments to information sharing and customer engagement initiatives. The shift towards digitalization across various industries, coupled with the convenience and cost-effectiveness of QR codes, contributes significantly to market growth. The dynamic nature of QR codes, allowing for updates and tracking of performance, adds to their appeal over static alternatives. Furthermore, the diversification of QR code formats, catering to different use cases like website links, menus, file downloads, and social media integration, expands the market's reach across various sectors. The market segmentation reveals a diverse landscape. Dynamic QR codes, offering greater flexibility and analytics capabilities, are gaining traction over their static counterparts. Among end-user applications, marketing and advertising dominate, leveraging QR codes for campaigns and promotions. However, significant growth is expected in payments and transactions, driven by the rising popularity of mobile wallets and contactless payment methods. Geographically, North America and Europe are anticipated to hold substantial market shares, but Asia-Pacific is poised for rapid expansion due to its burgeoning digital economy and large smartphone user base. Competition among key players, including Uniqode Phygital Inc, QR TIGER PTE LTD, and Flowcode, is intense, fostering innovation and driving down costs, further boosting market accessibility. While challenges like security concerns and potential misuse exist, technological advancements and increased awareness about secure QR code implementation are mitigating these risks. The overall outlook for the QR code market remains highly positive, indicating a sustained period of growth and innovation driven by the evolving digital landscape. Recent developments include: July 2024: Bandhan Bank launched its latest payment solution through the Bharat QR Code for its Current account and Savings account customers. The bank claimed that the solution will simplify how these self-employed segment customers make payments at any merchant outlet. An instant notification will also be received on every payment through a small speaker.June 2024: Flowcode, a marketing technology platform, unveiled a reimagined product designed for marketing and analytics teams at F1000 companies focused on measuring and maximizing offline conversions. Flowcode integrates seamlessly with data feeds, such as product catalogs, MLS listings, and more, to automate the creation of personalized, QR-enabled user journeys. This empowers brands to deliver unique, tailored consumer experiences, significantly increasing conversion rates.. Key drivers for this market are: Increased Smartphone Penetration, Growing Demand for Contactless Solutions; Increasing need for Security and Fraud Prevention. Potential restraints include: Increased Smartphone Penetration, Growing Demand for Contactless Solutions; Increasing need for Security and Fraud Prevention. Notable trends are: The Payments and Transactions Segment is Anticipated to Witness a Significant Growth.
This child item describes R code used to determine whether public-supply water systems buy water, sell water, both buy and sell water, or are neutral (meaning the system has only local water supplies) using water source information from a proprietary dataset from the U.S. Environmental Protection Agency. This information was needed to better understand public-supply water use and where water buying and selling were likely to occur. Buying or selling of water may result in per capita rates that are not representative of the population within the water service area. This dataset is part of a larger data release using machine learning to predict public supply water use for 12-digit hydrologic units from 2000-2020. Output from this code was used as an input feature variable in the public supply water use machine learning model. This page includes the following files: ID_WSA_04062022_Buyers_Sellers_DR.R - an R script used to determine whether a public-supply water service area buys water, sells water, or is neutral BuySell_readme.txt - a README text file describing the script
angie-chen55/python-github-code dataset hosted on Hugging Face and contributed by the HF Datasets community