20 datasets found

Data from: ManyTypes4Py: A benchmark Python Dataset for Machine...
data.europa.eu
unknown
Updated Feb 28, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2021). ManyTypes4Py: A benchmark Python Dataset for Machine Learning-Based Type Inference [Dataset]. https://data.europa.eu/88u/dataset/oai-zenodo-org-4571228
Explore at:
unknown(395470535)Available download formats
Dataset updated
Feb 28, 2021
Dataset authored and provided by
Zenodohttp://zenodo.org/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset is gathered on Sep. 17th 2020. It has more than 5.4K Python repositories that are hosted on GitHub. Check out the file ManyTypes4PyDataset.spec for repositories URL and their commit SHA. The dataset is also de-duplicated using the CD4Py tool. The list of duplicate files is provided in duplicate_files.txt file. All of its Python projects are processed in JSON-formatted files. They contain a seq2seq representation of each file, type-related hints, and information for machine learning models. The structure of JSON-formatted files is described in JSONOutput.md file. The dataset is split into train, validation and test sets by source code files. The list of files and their corresponding set is provided in dataset_split.csv file. Notable changes to each version of the dataset are documented in CHANGELOG.md.
E
Data from: Slovene Natural Language Inference Dataset SI-NLI
live.european-language-grid.eu
binary format
Updated Nov 12, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). Slovene Natural Language Inference Dataset SI-NLI [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/20816
Explore at:
binary formatAvailable download formats
Dataset updated
Nov 12, 2022
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
SI-NLI (Slovene Natural Language Inference Dataset) contains 5,937 human-created Slovene sentence pairs (premise and hypothesis) that are manually labeled with the labels "entailment", "contradiction", and "neutral". We created the dataset using sentences that appear in the Slovenian reference corpus ccKres (http://hdl.handle.net/11356/1034). Annotators were tasked to modify the hypothesis in a candidate pair in a way that reflects one of the labels. The dataset is balanced since the annotators created three modifications (entailment, contradiction, neutral) for each candidate sentence pair. The dataset is split into train, validation, and test sets, with sizes of 4,392, 547, and 998. We used Slovenian pre-trained language models to create splits, thereby ensuring that difficult and easy instances are evenly distributed in all three subsets.

The dataset is released in a tabular TSV format. The README.txt file contains a description of the attributes. Only the hypothesis and premise are given in the test set (i.e. no annotations) since SI-NLI is integrated into the Slovene evaluation framework SloBENCH (https://slobench.cjvt.si/). If you use the dataset to train your models, please consider submitting the test set predictions to SloBENCH to get the evaluation score and see how it compares to others.
XNLI - Multilingual NLI
kaggle.com
Updated Nov 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). XNLI - Multilingual NLI [Dataset]. https://www.kaggle.com/datasets/thedevastator/xnli-multilingual-nli-dataset/suggestions
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 30, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
XNLI - Multilingual NLI

A dataset for multilingual natural language inference tasks

By xnli (From Huggingface) [source]

About this dataset

The xnli Multilingual Natural Language Inference Dataset is a comprehensive collection of data specifically curated for training and evaluating natural language inference (NLI) models in various languages. It provides a diverse range of language splits, each containing examples in different languages such as Arabic, Bulgarian, Chinese, German, English, Greek, Spanish, French, Hindi, Indonesian, Italian, Japanese and many others.

With the goal of facilitating NLI tasks across multiple languages, this dataset includes separate CSV files for each language split. The available splits cover an extensive range of languages including widely spoken ones like English and Spanish as well as less commonly used ones like Urdu and Vietnamese.

Each CSV file consists of labeled examples that are essential for training and assessing the performance of NLI models. These examples contain two main components: the premise and the hypothesis. The premise represents the initial sentence or text segment that forms the foundation for the NLI task. On the other hand,**the hypothesis serves as the second sentence or text segment. Its comparison to** the premise determines the logical relationship between them.

One crucial aspect contributing to effective analysis is the label assigned to each example indicating its logical relationship with respect to entailment or contradiction against their respective premises. These labels fall into three categories: entailment (where it can be inferred from** **the premise), contradiction (when it contradicts the premise), or neutral (when there exists no logical relationship between them).

Moreover,** to support development across different linguistic domains, this dataset also includes specific test splits dedicated to evaluating NLI models in individual languages such as English (en_test.csv), Urdu (ur_test.csv), among others.

Researchers and practitioners engaged in building multilingual NLI models can utilize this xnli dataset encompassing numerous language variations along with suitable labeled examples to train their models effectively and assess their performance accurately in terms of understanding logical relationships between sentences within multiple linguistic contexts

Research Ideas

Cross-lingual NLI Modeling: The xnli dataset provides an opportunity to train and test natural language inference models across multiple languages. Researchers can use this dataset to develop cross-lingual NLI models that can effectively understand the logical relationship between premises and hypotheses in different languages.

Language Transfer Learning: By training on the xnli dataset, language models can learn to transfer their knowledge across different languages. This dataset can be used for pre-training models in one language and fine-tuning them for downstream tasks in another language, improving the performance of natural language understanding models in low-resource languages.

Multilingual Evaluation Benchmarks: The xnli dataset serves as a benchmark for evaluating NLI models' performance across various languages. It allows researchers to compare the effectiveness of different models and techniques in handling diverse linguistic expressions, enabling advancements in multilingual understanding capabilities

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: el_validation.csv | Column name | Description | |:---------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | premise | The first sentence or text segment that serves as the basis for the natural language inference task. (Text) | | hypothesis | The second sentence or text segment that is compared to the premise to determine the logical relationship between them. (Text) ...
GloSAT Historical Measurement Table Dataset
zenodo.org
eprints.soton.ac.uk
+1more
bin, zip
Updated Jun 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stuart E. Middleton; Stuart E. Middleton; Juliusz Ziomek; Juliusz Ziomek (2025). GloSAT Historical Measurement Table Dataset [Dataset]. http://doi.org/10.5281/zenodo.5363457
Explore at:
zip, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5363457
Dataset updated
Jun 24, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Stuart E. Middleton; Stuart E. Middleton; Juliusz Ziomek; Juliusz Ziomek
License
https://fedoraproject.org/wiki/Licensing/BSD_with_Attributionhttps://fedoraproject.org/wiki/Licensing/BSD_with_Attribution
Description
Dataset containing scanned historical measurement table documents from ship logs and land measurement stations. Annotations provided in this dataset are designed to allow finergrained table detection and table structure recognition models to be trained and tested. Annotations are region boundaries for tables, cells, headings, headers and captions.

This dataset release includes code to train models on a training split, to use trained model checkpoints for inference and to evaluate interred results on a test split. Pretrained models used in the published HIP-2021 paper are included in the dataset so results can be easily reproduced without training the model checkpoints yourself.

Instructions and code can be found on the linked github repository https://github.com/stuartemiddleton/glosat_table_dataset

A pre-print of the HIP-2021 paper can be found on the authors website https://www.southampton.ac.uk/~sem03/HIP_2021.pdf

Original images sourced with permission from UK Met Office, US NOAA and weatheerrescue.org (University of Reading).

This work is part of the GloSAT project https://www.glosat.org/ and supported by the Natural Environment Research Council (NE/S015604/1). The authors acknowledge the use of the IRIDIS High Performance Computing Facility, and associated support services at the University of Southampton, in the completion of this work.
Calabi-Yau: CICY-4 folds
kaggle.com
zip
Updated Dec 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
lorrespz (2024). Calabi-Yau: CICY-4 folds [Dataset]. https://www.kaggle.com/datasets/lorresprz/calabi-yau-cicy-4-folds
Explore at:
zip(460825834 bytes)Available download formats
Dataset updated
Dec 4, 2024
Authors
lorrespz
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This dataset contains the complete intersection Calabi-Yau four-folds (CICY4) configuration matrices and their four Hodge numbers, designed for the problem of machine learning the Hodge numbers using the configuration matrices as inputs to a neural network model.

The original data for CICY4 is from the paper: "Topological Invariants and Fibration Structure of Complete Intersection Calabi-Yau Four-Folds", arXiv:1405.2073. and can be downloaded in either text or Mathematica format from: https://www-thphys.physics.ox.ac.uk/projects/CalabiYau/Cicy4folds/index.html

The full CICY4 data included with this dataset in npy format (conf.npy, hodge.npy, direct.npy) is created by running the script 'create_data.py' from https://github.com/robin-schneider/cicy-fourfolds. Given this full data, the following two additional datasets at 72% and 80% training ratios were created.

At 72% data split, - The train dataset consists of the files (conf_Xtrain.npy, hodge_ytrain.npy) - The validation dataset consists of the files (conf_Xvalid.npy, hodge_yvalid.npy) - The test dataset consists of the files (conf_Xtest.npy, hodge_ytest.npy)

At 80% data split, the 3 datasets are: - (conf_Xtrain_80.npy, hodge_ytrain_80.npy) - (conf_Xvalid.npy, hodge_yvalid.npy) - (conf_Xtest_80.npy, hodge_ytest_80.npy) The new train and test sets were formed from the old ones: The old test set is divided into 2 parts with the ratio (0.6, 0.4). The 0.6-partition becomes the new test set, the 0.4-partition is merged with the old train set to form the new train set.

Trained neural networks models and their training/validation losses - 12 models were trained on the 72% dataset and their checkpoints are stored in the folder 'trained_models'. The 12 csv files containing the train+validation losses of these models are stored in the folder 'train-validation-losses'. - At 80% data split, the top 3 performing models trained on the 72% dataset were retrained and their checkpoints are stored in 'trained_models_80pc_split', together with the 3 csv files containing the loss values during the training phase.

Inference notebook: The inference notebook using this dataset is https://www.kaggle.com/code/lorresprz/cicy4-training-results-inference-all-models

Publication: This dataset was created for the work: Deep Learning Calabi-Yau four folds with hybrid and recurrent neural network architectures, https://arxiv.org/abs/2405.17406
Link-prediction on Biomedical Knowledge Graphs
zenodo.org
data.niaid.nih.gov
zip
Updated Jun 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alberto Cattaneo; Daniel Justus; Stephen Bonner; Stephen Bonner; Thomas Martynec; Thomas Martynec; Alberto Cattaneo; Daniel Justus (2024). Link-prediction on Biomedical Knowledge Graphs [Dataset]. http://doi.org/10.5281/zenodo.12097377
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.12097377
Dataset updated
Jun 25, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Alberto Cattaneo; Daniel Justus; Stephen Bonner; Stephen Bonner; Thomas Martynec; Thomas Martynec; Alberto Cattaneo; Daniel Justus
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Time period covered
Jun 25, 2021
Description
Release of the experimental data from the paper Towards Linking Graph Topology to Model Performance for Biomedical Knowledge Graph Completion (accepted at Machine Learning for Life and Material Sciences workshop @ ICML2024).

Knowledge Graph Completion has been increasingly adopted as a useful method for several tasks in biomedical research, like drug repurposing or drug-target identification. To that end, a variety of datasets and Knowledge Graph Embedding models has been proposed over the years. However, little is known about the properties that render a dataset useful for a given task and, even though theoretical properties of Knowledge Graph Embedding models are well understood, their practical utility in this field remains controversial. We conduct a comprehensive investigation into the topological properties of publicly available biomedical Knowledge Graphs and establish links to the accuracy observed in real-world applications. By releasing all model predictions we invite the community to build upon our work and continue improving the understanding of these crucial applications.

Experiments were conducted on six datasets: five from the biomedical domain (Hetionet, PrimeKG, PharmKG, OpenBioLink2020 HQ, PharMeBINet) and one trivia KG (FB15k-237). All datasets were randomly split into training, validation and test set (80% / 10% / 10%; in the case of PharMeBINet, 99.3% / 0.35% / 0.35% to mitigate the increased inference cost on the larger dataset).

On each dataset, four different KGE models were compared: TransE, DistMult, RotatE, TripleRE. Hyperparameters were tuned on the validation split and we release results for tail predictions on the test split. In particular, each test query (h,r,?) is scored against all entities in the KG and we compute the rank of the score of the correct completion (h,r,t) , after masking out scores of other (h,r,t') triples contained in the graph.

Note: the ranks provided are computed as the average between the optimistic and pessimistic ranks of triple scores.

Inside experimental_data.zip, the following files are provided for each dataset:

{dataset}_preprocessing.ipynb: a Jupyter notebook for downloading and preprocessing the dataset. In particular, this generates the custom label->ID mapping for entities and relations, and the numerical tensor of (h_ID,r_ID,t_ID) triples for all edges in the graph, which can be used to compute graph topological metrics (e.g., using kg-topology-toolbox) and compare them with the edge prediction accuracy.

test_ranks.csv: csv table with columns ["h", "r", "t"] specifying the head, relation, tail IDs of the test triples, and columns ["DistMult", "TransE", "RotatE", "TripleRE"] with the rank of the ground-truth tail in the ordered list of predictions made by the four models;

entity_dict.csv: the list of entity labels, ordered by entity ID (as generated in the preprocessing notebook);

relation_dict.csv: the list of relation labels, ordered by relation ID (as generated in the preprocessing notebook).

The separate top_100_tail_predictions.zip archive contains, for each of the test queries in the corresponding test_ranks.csv table, the IDs of the top-100 tail predictions made by each of the four KGE models, ordered by decreasing likelihood. The predictions are released in a .npz archive of numpy arrays (one array of shape (n_test_triples, 100) for each of the KGE models).

All experiments (training and inference) have been run on Graphcore IPU hardware using the BESS-KGE distribution framework.
E
Czech Natural Language Inference Dataset with Explanations
live.european-language-grid.eu
binary format
Updated Dec 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). Czech Natural Language Inference Dataset with Explanations [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/23652
Explore at:
binary formatAvailable download formats
Dataset updated
Dec 31, 2023
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
The dataset contains two parts: the original Stanford Natural Language Inference (SNLI) dataset with automatic translations to Czech, for some items from the SNLI, it contains annotation of the Czech content and explanation.

The Czech SNLI data contain both Czech and English pairs premise-hypothesis. SNLI split into train/test/dev is preserved.

CZtrainSNLI.csv: 550152 pairs

CZtestSNLI.csv: 10000 pairs

CZdevSNLI.csv: 10000 pairs

The explanation dataset contains batches of pairs premise-hypothesis. Each batch contains 1499 pairs. Each pair contains:

reference to original SNLI example

English premise and English hypothesis

English gold label (one of Entailment, Contradiction, Neutral)

automatically translated premise and hypothesis to Czech

Czech gold label (one of entailment, contradiction, neutral, bad translation)

explanations for Czech label

Example record:

CSNLI ID: 4857558207.jpg#4r1e English premise: A mother holds her newborn baby. English hypothesis: A person holding a child. English gold label: entailment Czech premise: Matka drží své novorozené dítě. Czech hypothesis: Osoba, která drží dítě. Czech gold label: Entailment Explanation-hypothesis: Matka Explanation-premise: Osoba Explanation-relation: generalization

Size of the explanations dataset: - train: 159650 - dev: 2860 - test: 2880

Inter-Annotator Agreement (IAA) Packages 1 and 12 annotate the same data. The IAA measured by the kappa score is 0.67 (substantial agreement).

The translation was performed via LINDAT translation service. Next, the translated pairs were manually checked (without access to the original English gold label), with possible check of the original pair.

Explanations were annotated as follows: - if there is a part of the premise or hypothesis that is relevant for the annotator's decision, it is marked - if there are two such parts and there exists a relation between them, the relation is marked

Possible relation types: - generalization: white long skirt - skirt - specification: dog - bulldog - similar: couch - sofa - independence: they have no instruments - they belong to the group - exclusion: man - woman

Original SNLI dataset: https://nlp.stanford.edu/projects/snli/ LINDAT Translation Service: https://lindat.mff.cuni.cz/services/translation/
E
English translation of the Slovene Natural Language Inference Dataset...
live.european-language-grid.eu
binary format
Updated Mar 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). English translation of the Slovene Natural Language Inference Dataset SI-NLI-en 1.0 [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/23262
Explore at:
binary formatAvailable download formats
Dataset updated
Mar 18, 2024
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
SI-NLI-en is an English translation of the SI-NLI Slovene Natural Language Inference Dataset (http://hdl.handle.net/11356/1707). The English version was compiled by first using machine translation (DeepL) to translate all the premises and hypotheses from SI-NLI into English. The machine translations were then manually checked and corrected by a group of 7 students of translation at the University of Ljubljana. Each translator was given both the Slovene premise and all its hypotheses as well as the translations of both the premise and the hypotheses, so the translations were not checked in isolation, but as units to ensure maximum semantic coherence.

Just like SI-NLI, SI-NLI-en contains 5,937 sentence pairs (premise and hypothesis) that are manually labeled with the labels "entailment", "contradiction", and "neutral". The dataset is split into train, validation, and test sets, with sizes of 4,392, 547, and 998.

The dataset is released in a tabular TSV format. The 00README.txt file contains a description of the attributes. Only the hypothesis and premise are provided in the test set (with no annotations) since SI-NLI-en is integrated into the Slovene evaluation framework SloBENCH (https://slobench.cjvt.si/). If you use the dataset to train your models, please consider submitting the test set predictions to SloBENCH to get the evaluation score and see how it compares to others.
Data and script pipeline for: Common to rare transfer learning (CORAL)...
zenodo.org
bin, html
Updated Mar 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Otso Ovaskainen; Otso Ovaskainen (2025). Data and script pipeline for: Common to rare transfer learning (CORAL) enables inference and prediction for a quarter million rare Malagasy arthropods [Dataset]. http://doi.org/10.5281/zenodo.14962497
Explore at:
bin, htmlAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14962497
Dataset updated
Mar 3, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Otso Ovaskainen; Otso Ovaskainen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The scripts and the data provided in this depository demonstrate how to apply the approach described in the paper "Common to rare transfer learning (CORAL) enables inference and prediction for a quarter million rare Malagasy arthropods" by Ovaskainen et al. Here we summarize how to use the software with a small, simulated dataset, with running time less than a minute in a typical laptop (Demo 1); (2) how to apply the analyses presented in the paper for a small subset of the data, with running time of ca. one hour in a powerful laptop (Demo 2); how to reproduce the full analyses presented in the paper, with running time up to several days, depending on the computational resources (Demo 3). The Demos 1 and 2 are aimed to be user-friendly starting points for understanding and testing how to implement CORAL. The Demo 3 is included mainly for reproducibility.

System requirements

· The software can be used in any operating system where R can be installed.

· We have developed and tested the software in a windows environment with R version 4.3.1.

· Demo 1 requires the R-packages phytools (2.1-1), MASS (7.3-60), Hmsc (3.3-3), pROC (1.18.5) and MCMCpack (1.7-0).

· Demo 2 requires the R-packages phytools (2.1-1), MASS (7.3-60), Hmsc (3.3-3), pROC (1.18.5) and MCMCpack (1.7-0).

· Demo 3 requires the R-packages phytools (2.1-1), MASS (7.3-60), Hmsc (3.3-3), pROC (1.18.5) and MCMCpack (1.7-0), jsonify (1.2.2), buildmer (2.11), colorspace (2.1-0), matlib (0.9.6), vioplot (0.4.0), MLmetrics (1.1.3) and ggplot2 (3.5.0).

· The use of the software does not require any non-standard hardware.

Installation guide

· The CORAL functions are implemented in Hmsc (3.3-3). The software that applies the is presented as a R-pipeline and thus it does not require any installation other than installation of R.

Demo 1: Software demo with simulated data

The software demonstration consists of two R-markdown files:

· D01_software_demo_simulate_data. This script creates a simulated dataset of 100 species on 200 sampling units. The species occurrences are simulated with a probit model that assumes phylogenetically structured responses to two environmental predictors. The pipeline saves all the data needed to data analysis in the file allDataDemo.RData: XData (the first predictor; the second one is not provided in the dataset as it is assumed to remain unknown for the user), Y (species occurrence data), phy (phylogenetic tree), studyDesign (list of sampling units). Additionally, true values used for data generation are save in the file trueValuesDemo.RData: LF (the second environmental predictor that will be estimated through a latent factor approach), and beta (species responses to environmental predictors).

· D02_software_demo_apply_CORAL. This script loads the data generated by the script D01 and applies the CORAL approach to it. The script demonstrates the informativeness of the CORAL priors, the higher predictive power of CORAL models than baseline models, and the ability of CORAL to estimate the true values used for data generation.

Both markdown files provide more detailed information and illustrations. The provided html file shows the expected output. The running time of the demonstration is very short, from few seconds to at most one minute.

Demo 2: Software demo with a small subset of the data used in the paper

The software demonstration consists of one R-markdown file:

MA_small_demo. This script uses the CORAL functions in HMSC to analyze a small subset of the Malagasy arthropod data. In this demo, we define rare species as those with prevalence at least 40 and less than 50, and common species as those with prevalence at least 200. This leaves 51 species to the backbone model and 460 rare species modelled through the CORAL approach. The script assess model fit for CORAL priors, CORAL posteriors, and null models. It further visualizes the responses of both the common and the rare species to the included predictors.

Scripts and data for reproducing the results presented in the paper (Demo 3)

The input data for the script pipeline is the file “allData.RData”. This file includes the metadata (meta), the response matrix (Y), and the taxonomical information (taxonomy). Each file in the pipeline below depends on the outputs of previous files: they must be run in order. The first six files are used for fitting the backbone HMSC model and calculating parameters for the CORAL prior:

· S01_define_Hmsc_model - defines the initial HMSC model with fixed effects and sample- and site-level random effects.

· S02_export_Hmsc_model - prepares the initial model for HPC sampling for fitting with Hmsc-HPC. Fitting of the model can be then done in an HPC environment with the bash file generated by the script. Computationally intensive.

· S03_import_posterior – imports the posterior distributions sampled by the initial model.

· S04_define_second_stage_Hmsc_model - extracts latent factors from the initial model and defines the backbone model. This is then sampled using the same S02 export + S03 import scripts. Computationally intensive.

· S05_visualize_backbone_model – check backbone model quality with visual/numerical summaries. Generates Fig. 2 of the paper.

· S06_construct_coral_priors – calculate CORAL prior parameters.

The remaining scripts evaluate the model:

· S07_evaluate_prior_predictionss – use the CORAL prior to predict rare species presence/absences and evaluate the predictions in terms of AUC. Generates Fig. 3 of the paper.

· S08_make_training_test_split – generate train/test splits for cross-validation ensuring at least 40% of positive samples are in each partition.

· S09_cross-validate – fit CORAL and the baseline model to the train/test splits and calculate performance summaries. Note: we ran this once with the initial train/test split and then again with on the inverse split (i.e., training = ! training in the code, see comment). The paper presents the average results across these two splits. Computationally intensive.

· S10_show_cross-validation_results – Make plots visualizing AUC/Tjur’s R² produced by cross-validation. Generates Fig. 4 of the paper.

· S11a_fit_coral_models – Fit the CORAL model to all 250k rare species. Computationally intensive.

· S11b_fit_baseline_models – Fit the baseline model to all 250k rare species. Computationally intensive.

· S12_compare_posterior_inference – compare posterior climate predictions using CORAL and baseline models on selected species, as well as variance reduction for all species. Generates Fig. 5 of the paper.

Pre-processing scripts:

· P01_preprocess_sequence_data.R – Reads in the outputs of the bioinformatics pipeline and converts them into R-objects.

· P02_download_climatic_data.R – Downloads the climatic data from "sis-biodiversity-era5-global” and adds that to metadata.

· P03_construct_Y_matrix.R – Converts the response matrix from a sparse data format to regular matrix. Saves “allData.RData”, which includes the metadata (meta), the response matrix (Y), and the taxonomical information (taxonomy).

Computationally intensive files had runtimes of 5-24 hours on high-performance machines. Preliminary testing suggests runtimes of over 100 hours on a standard laptop.
h
Researchy-GEO
huggingface.co
Updated Jul 10, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
yujiang wu (2020). Researchy-GEO [Dataset]. https://huggingface.co/datasets/yujiangw/Researchy-GEO
Explore at:
Dataset updated
Jul 10, 2020
Authors
yujiang wu
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
AutoGEO-Researchy Dataset

This dataset contains multiple configurations for different tasks. Use the dropdown menu above to select a specific configuration to view.

main: Contains the primary train and test splits. rule_candidate: Data for rule candidate generation. cold_start: Data for cold-start finetuning. inference: Data for inference tasks. grpo_input: Input data for GRPO. grpo_eval: Evaluation data for GRPO.
h
Researchy-GEO-old
huggingface.co
Updated Jul 10, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
yujiang wu (2020). Researchy-GEO-old [Dataset]. https://huggingface.co/datasets/yujiangw/Researchy-GEO-old
Explore at:
Dataset updated
Jul 10, 2020
Authors
yujiang wu
Description
Researchy-GEO Dataset

This dataset contains multiple configurations for different tasks. Use the dropdown menu above to select a specific configuration to view.

main: Contains the primary train and test splits. rule_candidate: Data for rule candidate generation. cold_start: Data for cold-start finetuning. inference: Data for inference tasks. grpo_input: Input data for GRPO. grpo_eval: Evaluation data for GRPO.
Fine-grained Context-sensitive Lexical Inference
kaggle.com
zip
Updated Aug 12, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vered Shwartz (2017). Fine-grained Context-sensitive Lexical Inference [Dataset]. https://www.kaggle.com/datasets/vered1986/context-lexinf
Explore at:
zip(1171339 bytes)Available download formats
Dataset updated
Aug 12, 2017
Authors
Vered Shwartz
License
http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
Description
Context

Recognizing lexical inference is an essential component in natural language understanding. In question answering, for instance, identifying that broadcast and air are synonymous enables answering the question "When was 'Friends' first aired?" given the text "'Friends' was first broadcast in 1994". Semantic relations such as synonymy (tall, high) and hypernymy (cat, pet) are used to infer the meaning of one term from another, in order to overcome lexical variability. This inference should typically be performed within a given context, considering both the term meanings in context and the specific semantic relation that holds between the terms.

Content

This dataset provides annotations for fine-grained lexical inferences in-context. The dataset consists of 3,750 term pairs, each given within a context sentence, built upon a subset of terms from PPDB. Each term pair is annotated to the semantic relation that holds between the terms in the given contexts.

Files:

full_dataset.csv - the full dataset is provided, as well as the train-test-validation split.

train.csv, test.csv, validation.csv - A split of the dataset to 70% train, 25% test, and 5% validation sets. Each of the sets contains different term-pairs, to avoid overfitting for the most common relation of a term-pair in the training set.

File Structure: comma-separated file

Fields:

x: the first term

y: the second term

context_x: the sentence in which x appears (highlighted by )

context_y: the sentence in which y appears (highlighted by )

semantic_relation: the (directional) semantic relation that holds between x and y: equivalence, forward_entailment, reverse_entailment, alternation, other-related and independence.

confidence: the relation annotation confidence (percentage of annotators that selected this relation), in a scale of 0-1

Acknowledgements

If you use this dataset, please cite the following paper:

Adding Context to Semantic Data-Driven Paraphrasing.

Vered Shwartz and Ido Dagan. *SEM 2016.

Inspiration

I hope that this dataset will motivate the development of context-sensitive lexical inference methods, which have been relatively overlooked, although they are crucial for applications.
Distribution of bounding boxes for each sedimentary structure across...
plos.figshare.com
xlsx
Updated Jul 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ammar J. Abdlmutalib; Korhan Ayranci; Umair Bin Waheed; Hamad D. Alhajri; James A. MacEachern; Mohammed N. Al-Khabbaz (2025). Distribution of bounding boxes for each sedimentary structure across training and test sets in Split-I. This table highlights class representation balance to ensure effective model training and evaluation. [Dataset]. http://doi.org/10.1371/journal.pone.0327738.s001
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0327738.s001
Dataset updated
Jul 18, 2025
Dataset provided by
PLOShttp://plos.org/
Authors
Ammar J. Abdlmutalib; Korhan Ayranci; Umair Bin Waheed; Hamad D. Alhajri; James A. MacEachern; Mohammed N. Al-Khabbaz
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Distribution of bounding boxes for each sedimentary structure across training and test sets in Split-I. This table highlights class representation balance to ensure effective model training and evaluation.
r
PigTrack: a diverse and challenging benchmark dataset for multi-object...
resodate.org
Updated Apr 3, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jonathan Henrich; Christian Post; Thomas Kneib; Ramin Yahyapour; Imke Traulsen (2025). PigTrack: a diverse and challenging benchmark dataset for multi-object tracking of pigs [Dataset]. http://doi.org/10.25625/P7VQTP
Explore at:
Unique identifier
https://doi.org/10.25625/P7VQTP
Dataset updated
Apr 3, 2025
Dataset provided by
Georg-August-Universität Göttingen
GRO.data
Authors
Jonathan Henrich; Christian Post; Thomas Kneib; Ramin Yahyapour; Imke Traulsen
Description
Note: To better find the files to download, select "Change View: Tree". The dataset contains: 80 video sequences from conventional pig farming with multi-object tracking annotations together with a 'split.txt' file containing the predefined training, validation and test splits The original mp4 videos of the 80 video sequences A visualization of the annotated bounding boxes for all 80 videos Model weights of MOTRv2 and MOTIP trained for pig tracking. Pre-computed bounding box priors that can be used to train MOTRv2. A thorough explanation of all files contained in this data repository can be found in ReadMe.txt. The github repository associated with this dataset can be found at https://github.com/jonaden94/PigBench. It includes commands to automatically download the files from this data repository that are required for model training, evaluation, and inference.

Response Score Dataset on VLM

kaggle.com

zip

Updated Nov 6, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

tangx_0121 (2025). Response Score Dataset on VLM [Dataset]. https://www.kaggle.com/datasets/tangx0121/vlm-response-score-dataset

Explore at:

zip(3554905158 bytes)Available download formats

Dataset updated

Nov 6, 2025

Authors

tangx_0121

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

RSD: Response Score Dataset for Vision-Language Model Routing

1. Overview

Dataset Summary

The Response Score Dataset (RSD) is the first comprehensive multimodal response quality dataset specifically designed for training and evaluating Vision-Language Model (VLM) routers in edge-cloud collaborative systems. This dataset enables scenario-aware routing between large cloud models and small edge models, optimizing the trade-off between response quality, inference latency, and computational cost.

Key Statistics

📦 Total Samples: ~22,700 image-text pairs
🤖 Models Evaluated: 8 VLMs (2 Large + 3 Medium+ 2 small)
📚 Source Benchmarks: 7 public VLM datasets
⭐ Score Range: 1-10 (LLM-as-a-Judge)
✅ Human Validation: 200 samples (r=0.88 correlation)
💰 Construction Cost: ~$1,000 USD

Dataset Composition

Dataset	Samples	Difficulty	Task Type
ChartQA	2,500	Easy	Chart understanding & arithmetic
WildVision	500	Easy	Real-world open-ended VQA
GQA	12,000	Medium	Compositional spatial reasoning
VizWiz	4,319	Medium	Blind-assistance with noise
MMVet	218	Medium	Multi-ability composite tasks
MMMU-Pro	1,730	Hard	Professional domain knowledge
MMStar	1,500	Hard	Leak-resistant fine-grained eval
Total	~22,700	Mixed	Diverse multimodal tasks

Model Coverage

Large Models (LVLM - Cloud Deployment): - Gemma 3-27B - InternVL3-38B

Small Models (SVLM - Edge Deployment): - InternVL3-8B - Phi-4-Multimodal-5.6B - Qwen2.5-VL-7B - InternVL2.5-2B - InternVL2.5-1B - SmolVLM-256M

🗂️ 2. Dataset Structure

vlm_evaluation_dataset/
├── images/           # Original images for each sub-dataset
│  ├── MMVet/
│  ├── ChartQA_TEST/
│  ├── GQA_TestDev_Balanced/
│  ├── MMMU/
│  ├── MMStar/
│  ├── VizWiz/
│  └── WildVision/
│
├── metadata/          # Metadata files for each dataset (TSV format)
│  ├── MMVet.tsv
│  ├── ChartQA_TEST.tsv
│  ├── GQA_TestDev_Balanced.tsv
│  ├── MMMU.tsv
│  ├── MMStar.tsv
│  ├── VizWiz.tsv
│  └── WildVision.tsv
│
├── scoring_results/      # Model prediction and scoring results
│  ├── MMVet/
│  │  ├── InternVL3-8B/
│  │  │  └── single/
│  │  │    ├── results.csv    # Aggregated scoring results
│  │  │    ├── details.json    # Detailed reasoning and scoring records
│  │  │    └── log.json      # Model inference logs (optional)
│  │  └── OtherModel/
│  │    └── single/
│  ├── ChartQA_TEST/
│  │  └── ...
│  └── ...
│
├── statistics.json       # Dataset statistics summary (sample counts, category distribution, etc.)
└── README.md          # Overall dataset documentation

📋Metadata File Field Descriptions

Field Name	Description
`index`	Unique sample ID
`image`	Image path or Base64 encoding
`question`	Input question text
`answer`	Reference answer
`category`	Question category (e.g., visual reasoning, chart understanding, etc.)

Model Scoring Results Format (CSV/JSON)

Field Name	Description
`question_id`	Question ID (corresponds to metadata.index)
`question`	Question text
`reference_answer`	Ground truth answer
`prediction`	Model predicted answer
`score`	LLM score (range 0-10)
`reasoning`	Scoring rationale (text description)
`model_name`	Model name (e.g., InternVL3-8B)
`category`	Question category
`dataset_type`	Dataset name (e.g., MMVet)
`inference_time`	Model inference time (unit: seconds)

Data Statistics

Score Distribution

Overall (all models, all samples):
├── Mean: 5.58
├── Median: 6.00
├── Std Dev: 2.15
├── Min: 1.00
├── Max: 10.00
└── Mode: 6.00

By Model Type:
├── Large Models (LVLM): Mean = 5.81
└── Small Models (SVLM): Mean = 5.47

By Difficulty:
├── Easy: Mean = 7.13
├── Medium: Mean = 5.80
└── Hard: Mean = 3.36

Latency Distribution

Overall:
├── Mean: 1.31s
├── Median: 0.60s
├── P75: 1.17s
├── P90: 2.52s
└── P99: 5.45s

By Model:
├── SmolVLM-256M: 0.62s (fastest)
├── InternVL2.5-1B: 0.71s
├── InternVL2.5-2B: 0.81s
├── InternVL3-8B: 0.92s
├── Phi-4-5.6B: 1.80s
├── Qwen2.5-VL-7B: 0.90s
├── InternVL3-38B: 2.47s
└── Gemma3-27B: 2.56s (slowest)

🔨 3. Construction Pipeline & LLM-as-a-Judge Protocol

3.1 Construction Workflow

graph TD
  A[7 Public Benchmarks] --> B[Sample Collection ~22k]
  B --> C[8 VLM Inference]
  C --> D[Response Generation]
  D --> E[LLM-as-a-Judge Scoring]
  E --> F[Human Validation 200 samples]
  F --> G[Quality Check r>0.85]
  G --> H[MES-based Labeling]
  H --> I[Stratified Train/Val/Test Split]
  I --> J[Final RSD Dataset]
``...

ANLI - (Adversarial NLI Benchmark)
kaggle.com
zip
Updated Nov 20, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2022). ANLI - (Adversarial NLI Benchmark) [Dataset]. https://www.kaggle.com/datasets/thedevastator/anli-a-large-scale-nli-benchmark-dataset/code
Explore at:
zip(17680012 bytes)Available download formats
Dataset updated
Nov 20, 2022
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
ANLI - (Adversarial NLI Benchmark)

The Adversarial Natural Language Inference (ANLI, Nie et al.)

Source

Paper: link

About this dataset

The ANLI Adversarial Natural Language Inference dataset is a new, large-scale NLI benchmark dataset. The dataset is collected via an iterative, adversarial human-and-model-in-the-loop procedure. ANLI is much more difficult than its predecessors such as SNLI and MNLI. It contains three rounds. Each round has train/dev/test splits. The data fields are the same among all splits.

ANLI provides a unique challenge for natural language understanding models. The dataset is collected via an iterative, adversarial human-and-model-in-the loop procedure that makes it much more difficult than its predecessors such as SNLI and MNLI. This makes ANLI a great benchmark to assess the progress of NLI models

How to use the dataset

To use the ANLI dataset, you will need to download the train_r1.csv file. This file contains the data for the first round of training data for the ANLI dataset. Next, you will need to download the dev_r1.csv file. This file contains the data for the first round of development data for the ANLI dataset. Finally, you will need to download the test_r1.csv file. This file contains the data for the first round of testing in the ANLI dataset

Research Ideas

The ANLI Adversarial Natural Language Inference dataset can be used to train models to better understand natural language.

The dataset can be used to develop models that are more robust to adversarial examples.

The dataset can be used to improve the accuracy of NLI systems

Acknowledgements

The dataset was originally published on Huggingface Hub

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: dev_r2.csv | Column name | Description | |:---------------|:-----------------------------------------| | premise | The premise of the sentence. (String) | | hypothesis | The hypothesis of the sentence. (String) | | label | The label of the sentence. (String) | | reason | The reason for the label. (String) |

File: test_r2.csv | Column name | Description | |:---------------|:-----------------------------------------| | premise | The premise of the sentence. (String) | | hypothesis | The hypothesis of the sentence. (String) | | label | The label of the sentence. (String) | | reason | The reason for the label. (String) |

File: train_r3.csv | Column name | Description | |:---------------|:-----------------------------------------| | premise | The premise of the sentence. (String) | | hypothesis | The hypothesis of the sentence. (String) | | label | The label of the sentence. (String) | | reason | The reason for the label. (String) |

File: dev_r3.csv | Column name | Description | |:---------------|:-----------------------------------------| | premise | The premise of the sentence. (String) | | hypothesis | The hypothesis of the sentence. (String) | | label | The label of the sentence. (String) | | reason | The reason for the label. (String) |

File: test_r3.csv | Column name | Description | |:---------------|:-----------------------------------------| | premise | The premise of the sentence. (String) | | hypothesis | The hypothesis of the sentence. (String) | | label | The label of the sentence. (String) | | reason | The reason for the label. (String) |

File: train_r2.csv | Column name | Description | |:---------------|:-----------------------------------------| | premise | The premise of the sentence. (String) | | hypothesis | The hypothesis of the sentence. (String) | | label | The label of the sentence. (String) | | reason | The reason for the label. (String) |

File: train_r1.csv | Column name | Description | |:---------------|:-----------------------------------------| | premise | The premise of the sentence. (String) | | hypothesis | The hypothesis of the...
f
Distribution of bounding boxes for each sedimentary structure across...
figshare.com
xlsx
Updated Jul 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ammar J. Abdlmutalib; Korhan Ayranci; Umair Bin Waheed; Hamad D. Alhajri; James A. MacEachern; Mohammed N. Al-Khabbaz (2025). Distribution of bounding boxes for each sedimentary structure across training and test sets in Split-III. It confirms that all classes are represented, supporting fair performance evaluation despite observed precision drops. [Dataset]. http://doi.org/10.1371/journal.pone.0327738.s003
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0327738.s003
Dataset updated
Jul 18, 2025
Dataset provided by
PLOS ONE
Authors
Ammar J. Abdlmutalib; Korhan Ayranci; Umair Bin Waheed; Hamad D. Alhajri; James A. MacEachern; Mohammed N. Al-Khabbaz
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Distribution of bounding boxes for each sedimentary structure across training and test sets in Split-III. It confirms that all classes are represented, supporting fair performance evaluation despite observed precision drops.
COCO 2014 Dataset (for YOLOv3)
kaggle.com
zip
Updated Sep 9, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jeff Faudi (2021). COCO 2014 Dataset (for YOLOv3) [Dataset]. https://www.kaggle.com/datasets/jeffaudi/coco-2014-dataset-for-yolov3/discussion
Explore at:
zip(26852690979 bytes)Available download formats
Dataset updated
Sep 9, 2021
Authors
Jeff Faudi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Context

The MS COCO (Microsoft Common Objects in Context) dataset is a large-scale object detection, segmentation, key-point detection, and captioning dataset. The dataset consists of 164K images.

This is the original version from 2014 made available here for easy access in Kaggle and because it does not seem to be still available on the COCO Dataset website. This has been retrieved from the mirror that Joseph Redmon has setup on this own website.

Content

The 2014 version of the COCO dataset is an excellent object detection dataset with 80 classes, 82,783 training images and 40,504 validation images. This dataset contains all this imagery on two folders as well as the annotation with the class and location (bounding box) of the objects contained in each image.

The initial split provides training (83K), validation (41K) and test (41K) sets. Since the split between training and validation was not optimal in the original dataset, there is also two text (.part) files with a new split with only 5,000 images for validation and the rest for training. The test set has no labels and can be used for visual validation or pseudo-labelling.

Acknowledgements

This is mostly inspired by Erik Linder-Norén and [Joseph Redmon](https://pjreddie.com/darknet/yolo
Construction Sites and Nature Drone Images
kaggle.com
zip
Updated Jan 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bekhzod Olimov (2024). Construction Sites and Nature Drone Images [Dataset]. https://www.kaggle.com/datasets/killa92/construction-sites-and-nature-drone-images
Explore at:
zip(27884247 bytes)Available download formats
Dataset updated
Jan 16, 2024
Authors
Bekhzod Olimov
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
This dataset contains 1,784 drone images of various construction types and nature. It is splitted into three sets necessary for Machine Learning and Deep Learning tasks, namely train, validation, and test splits. The structure of the data is as follows:

ROOT

train

cls_names

img_file;

img_file;

img_file;

........

img_file.

valid

cls_names

img_file;

img_file;

img_file;

........

img_file.

test

no_class

img_file;

img_file;

img_file;

........

img_file.

There are 1,427, 179, and 178 images in the train, validation, and test folders. The train and validation folders have specific classes but the test set images has no classes and must be predicted using an AI model during inference.
DeepLearning Ensemble for Automated Acne Detection
kaggle.com
zip
Updated Jun 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Syed Mohmmad Ali Jafri (2025). DeepLearning Ensemble for Automated Acne Detection [Dataset]. https://www.kaggle.com/datasets/asmrgaming/deeplearning-ensemble-for-automated-acne-detection/code
Explore at:
zip(676760577 bytes)Available download formats
Dataset updated
Jun 30, 2025
Authors
Syed Mohmmad Ali Jafri
Description
AI-Powered Acne Detection using Ensemble Deep Learning This project presents an AI-based acne detection and severity assessment system that combines two deep learning models to analyze facial images. The approach integrates a classification model (ResNet50) and a localization model (YOLOv5) to provide both an overall severity prediction and detailed detection of individual acne lesions.

The classification module predicts the severity of acne on the entire face and estimates the number of lesions present in each image. It uses KL divergence loss to improve training stability and outputs confidence scores for each prediction. The localization module, based on YOLOv5, detects the exact positions of acne lesions and classifies them into six types: comedones, papules, pustules, nodules, cysts, and scars. It uses bounding boxes with configurable confidence thresholds and supports real-time detection.

The dataset used in this project contains 920 high-resolution facial images, annotated with 2,847 total lesions. The dataset is split into 637 training images, 194 validation images, and 89 test images. Annotations follow the YOLO format, and the class distribution is balanced across all acne types.

The system evaluates performance using standard classification metrics such as accuracy, precision, recall, and F1-score, and uses mAP and IoU for object detection. The ensemble results are generated by combining the outputs of both models, which helps improve accuracy and reliability.

This solution is implemented using Python 3.8 with the PyTorch framework. It requires YOLOv5 for detection and ResNet50 for classification. GPU acceleration with CUDA is recommended for training and inference.

The codebase includes a complete pipeline for training, validation, testing, and inference. It is modular and easy to extend. Jupyter notebook examples are provided for quick experimentation and visualization.

This project is suitable for various use cases including dermatology research, telemedicine, skincare applications, and educational tools. It demonstrates the value of combining classification and object detection models for practical medical image analysis.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Zenodo (2021). ManyTypes4Py: A benchmark Python Dataset for Machine Learning-Based Type Inference [Dataset]. https://data.europa.eu/88u/dataset/oai-zenodo-org-4571228

Data from: ManyTypes4Py: A benchmark Python Dataset for Machine Learning-Based Type Inference

Explore at:

unknown(395470535)Available download formats

Dataset updated

Feb 28, 2021

Dataset authored and provided by

Zenodohttp://zenodo.org/

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The dataset is gathered on Sep. 17th 2020. It has more than 5.4K Python repositories that are hosted on GitHub. Check out the file ManyTypes4PyDataset.spec for repositories URL and their commit SHA. The dataset is also de-duplicated using the CD4Py tool. The list of duplicate files is provided in duplicate_files.txt file. All of its Python projects are processed in JSON-formatted files. They contain a seq2seq representation of each file, type-related hints, and information for machine learning models. The structure of JSON-formatted files is described in JSONOutput.md file. The dataset is split into train, validation and test sets by source code files. The list of files and their corresponding set is provided in dataset_split.csv file. Notable changes to each version of the dataset are documented in CHANGELOG.md.

Clear search

Close search

Google apps

Main menu

Data from: ManyTypes4Py: A benchmark Python Dataset for Machine...

Data from: Slovene Natural Language Inference Dataset SI-NLI

XNLI - Multilingual NLI

XNLI - Multilingual NLI

A dataset for multilingual natural language inference tasks

About this dataset

Research Ideas

Acknowledgements

License

Columns

GloSAT Historical Measurement Table Dataset

Calabi-Yau: CICY-4 folds

Link-prediction on Biomedical Knowledge Graphs

Czech Natural Language Inference Dataset with Explanations

English translation of the Slovene Natural Language Inference Dataset...

Data and script pipeline for: Common to rare transfer learning (CORAL)...

Researchy-GEO

Researchy-GEO-old

Fine-grained Context-sensitive Lexical Inference

Context

Content

Acknowledgements

Inspiration

Distribution of bounding boxes for each sedimentary structure across...

PigTrack: a diverse and challenging benchmark dataset for multi-object...

Response Score Dataset on VLM

RSD: Response Score Dataset for Vision-Language Model Routing

1. Overview

Dataset Summary

Key Statistics

Dataset Composition

Model Coverage

🗂️ 2. Dataset Structure

📋Metadata File Field Descriptions

Model Scoring Results Format (CSV/JSON)

Data Statistics

Score Distribution

Latency Distribution

🔨 3. Construction Pipeline & LLM-as-a-Judge Protocol

3.1 Construction Workflow

ANLI - (Adversarial NLI Benchmark)

ANLI - (Adversarial NLI Benchmark)

The Adversarial Natural Language Inference (ANLI, Nie et al.)

Source

About this dataset

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Distribution of bounding boxes for each sedimentary structure across...

COCO 2014 Dataset (for YOLOv3)

Context

Content

Acknowledgements

Construction Sites and Nature Drone Images

DeepLearning Ensemble for Automated Acne Detection

Data from: ManyTypes4Py: A benchmark Python Dataset for Machine Learning-Based Type Inference