20 datasets found
  1. Data from: ManyTypes4Py: A benchmark Python Dataset for Machine...

    • data.europa.eu
    unknown
    Updated Feb 28, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2021). ManyTypes4Py: A benchmark Python Dataset for Machine Learning-Based Type Inference [Dataset]. https://data.europa.eu/88u/dataset/oai-zenodo-org-4571228
    Explore at:
    unknown(395470535)Available download formats
    Dataset updated
    Feb 28, 2021
    Dataset authored and provided by
    Zenodohttp://zenodo.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset is gathered on Sep. 17th 2020. It has more than 5.4K Python repositories that are hosted on GitHub. Check out the file ManyTypes4PyDataset.spec for repositories URL and their commit SHA. The dataset is also de-duplicated using the CD4Py tool. The list of duplicate files is provided in duplicate_files.txt file. All of its Python projects are processed in JSON-formatted files. They contain a seq2seq representation of each file, type-related hints, and information for machine learning models. The structure of JSON-formatted files is described in JSONOutput.md file. The dataset is split into train, validation and test sets by source code files. The list of files and their corresponding set is provided in dataset_split.csv file. Notable changes to each version of the dataset are documented in CHANGELOG.md.

  2. E

    Data from: Slovene Natural Language Inference Dataset SI-NLI

    • live.european-language-grid.eu
    binary format
    Updated Nov 12, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). Slovene Natural Language Inference Dataset SI-NLI [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/20816
    Explore at:
    binary formatAvailable download formats
    Dataset updated
    Nov 12, 2022
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    SI-NLI (Slovene Natural Language Inference Dataset) contains 5,937 human-created Slovene sentence pairs (premise and hypothesis) that are manually labeled with the labels "entailment", "contradiction", and "neutral". We created the dataset using sentences that appear in the Slovenian reference corpus ccKres (http://hdl.handle.net/11356/1034). Annotators were tasked to modify the hypothesis in a candidate pair in a way that reflects one of the labels. The dataset is balanced since the annotators created three modifications (entailment, contradiction, neutral) for each candidate sentence pair. The dataset is split into train, validation, and test sets, with sizes of 4,392, 547, and 998. We used Slovenian pre-trained language models to create splits, thereby ensuring that difficult and easy instances are evenly distributed in all three subsets.

    The dataset is released in a tabular TSV format. The README.txt file contains a description of the attributes. Only the hypothesis and premise are given in the test set (i.e. no annotations) since SI-NLI is integrated into the Slovene evaluation framework SloBENCH (https://slobench.cjvt.si/). If you use the dataset to train your models, please consider submitting the test set predictions to SloBENCH to get the evaluation score and see how it compares to others.

  3. XNLI - Multilingual NLI

    • kaggle.com
    Updated Nov 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). XNLI - Multilingual NLI [Dataset]. https://www.kaggle.com/datasets/thedevastator/xnli-multilingual-nli-dataset/suggestions
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 30, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    XNLI - Multilingual NLI

    A dataset for multilingual natural language inference tasks

    By xnli (From Huggingface) [source]

    About this dataset

    The xnli Multilingual Natural Language Inference Dataset is a comprehensive collection of data specifically curated for training and evaluating natural language inference (NLI) models in various languages. It provides a diverse range of language splits, each containing examples in different languages such as Arabic, Bulgarian, Chinese, German, English, Greek, Spanish, French, Hindi, Indonesian, Italian, Japanese and many others.

    With the goal of facilitating NLI tasks across multiple languages, this dataset includes separate CSV files for each language split. The available splits cover an extensive range of languages including widely spoken ones like English and Spanish as well as less commonly used ones like Urdu and Vietnamese.

    Each CSV file consists of labeled examples that are essential for training and assessing the performance of NLI models. These examples contain two main components: the premise and the hypothesis. The premise represents the initial sentence or text segment that forms the foundation for the NLI task. On the other hand,**the hypothesis serves as the second sentence or text segment. Its comparison to** the premise determines the logical relationship between them.

    One crucial aspect contributing to effective analysis is the label assigned to each example indicating its logical relationship with respect to entailment or contradiction against their respective premises. These labels fall into three categories: entailment (where it can be inferred from** **the premise), contradiction (when it contradicts the premise), or neutral (when there exists no logical relationship between them).

    Moreover,** to support development across different linguistic domains, this dataset also includes specific test splits dedicated to evaluating NLI models in individual languages such as English (en_test.csv), Urdu (ur_test.csv), among others.

    Researchers and practitioners engaged in building multilingual NLI models can utilize this xnli dataset encompassing numerous language variations along with suitable labeled examples to train their models effectively and assess their performance accurately in terms of understanding logical relationships between sentences within multiple linguistic contexts

    Research Ideas

    • Cross-lingual NLI Modeling: The xnli dataset provides an opportunity to train and test natural language inference models across multiple languages. Researchers can use this dataset to develop cross-lingual NLI models that can effectively understand the logical relationship between premises and hypotheses in different languages.
    • Language Transfer Learning: By training on the xnli dataset, language models can learn to transfer their knowledge across different languages. This dataset can be used for pre-training models in one language and fine-tuning them for downstream tasks in another language, improving the performance of natural language understanding models in low-resource languages.
    • Multilingual Evaluation Benchmarks: The xnli dataset serves as a benchmark for evaluating NLI models' performance across various languages. It allows researchers to compare the effectiveness of different models and techniques in handling diverse linguistic expressions, enabling advancements in multilingual understanding capabilities

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: el_validation.csv | Column name | Description | |:---------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | premise | The first sentence or text segment that serves as the basis for the natural language inference task. (Text) | | hypothesis | The second sentence or text segment that is compared to the premise to determine the logical relationship between them. (Text) ...

  4. GloSAT Historical Measurement Table Dataset

    • zenodo.org
    • eprints.soton.ac.uk
    • +1more
    bin, zip
    Updated Jun 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stuart E. Middleton; Stuart E. Middleton; Juliusz Ziomek; Juliusz Ziomek (2025). GloSAT Historical Measurement Table Dataset [Dataset]. http://doi.org/10.5281/zenodo.5363457
    Explore at:
    zip, binAvailable download formats
    Dataset updated
    Jun 24, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Stuart E. Middleton; Stuart E. Middleton; Juliusz Ziomek; Juliusz Ziomek
    License

    https://fedoraproject.org/wiki/Licensing/BSD_with_Attributionhttps://fedoraproject.org/wiki/Licensing/BSD_with_Attribution

    Description

    Dataset containing scanned historical measurement table documents from ship logs and land measurement stations. Annotations provided in this dataset are designed to allow finergrained table detection and table structure recognition models to be trained and tested. Annotations are region boundaries for tables, cells, headings, headers and captions.

    This dataset release includes code to train models on a training split, to use trained model checkpoints for inference and to evaluate interred results on a test split. Pretrained models used in the published HIP-2021 paper are included in the dataset so results can be easily reproduced without training the model checkpoints yourself.

    Instructions and code can be found on the linked github repository https://github.com/stuartemiddleton/glosat_table_dataset

    A pre-print of the HIP-2021 paper can be found on the authors website https://www.southampton.ac.uk/~sem03/HIP_2021.pdf

    Original images sourced with permission from UK Met Office, US NOAA and weatheerrescue.org (University of Reading).

    This work is part of the GloSAT project https://www.glosat.org/ and supported by the Natural Environment Research Council (NE/S015604/1). The authors acknowledge the use of the IRIDIS High Performance Computing Facility, and associated support services at the University of Southampton, in the completion of this work.

  5. Calabi-Yau: CICY-4 folds

    • kaggle.com
    zip
    Updated Dec 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    lorrespz (2024). Calabi-Yau: CICY-4 folds [Dataset]. https://www.kaggle.com/datasets/lorresprz/calabi-yau-cicy-4-folds
    Explore at:
    zip(460825834 bytes)Available download formats
    Dataset updated
    Dec 4, 2024
    Authors
    lorrespz
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This dataset contains the complete intersection Calabi-Yau four-folds (CICY4) configuration matrices and their four Hodge numbers, designed for the problem of machine learning the Hodge numbers using the configuration matrices as inputs to a neural network model.

    The original data for CICY4 is from the paper: "Topological Invariants and Fibration Structure of Complete Intersection Calabi-Yau Four-Folds", arXiv:1405.2073. and can be downloaded in either text or Mathematica format from: https://www-thphys.physics.ox.ac.uk/projects/CalabiYau/Cicy4folds/index.html

    The full CICY4 data included with this dataset in npy format (conf.npy, hodge.npy, direct.npy) is created by running the script 'create_data.py' from https://github.com/robin-schneider/cicy-fourfolds. Given this full data, the following two additional datasets at 72% and 80% training ratios were created.

    At 72% data split, - The train dataset consists of the files (conf_Xtrain.npy, hodge_ytrain.npy) - The validation dataset consists of the files (conf_Xvalid.npy, hodge_yvalid.npy) - The test dataset consists of the files (conf_Xtest.npy, hodge_ytest.npy)

    At 80% data split, the 3 datasets are: - (conf_Xtrain_80.npy, hodge_ytrain_80.npy) - (conf_Xvalid.npy, hodge_yvalid.npy) - (conf_Xtest_80.npy, hodge_ytest_80.npy) The new train and test sets were formed from the old ones: The old test set is divided into 2 parts with the ratio (0.6, 0.4). The 0.6-partition becomes the new test set, the 0.4-partition is merged with the old train set to form the new train set.

    Trained neural networks models and their training/validation losses - 12 models were trained on the 72% dataset and their checkpoints are stored in the folder 'trained_models'. The 12 csv files containing the train+validation losses of these models are stored in the folder 'train-validation-losses'. - At 80% data split, the top 3 performing models trained on the 72% dataset were retrained and their checkpoints are stored in 'trained_models_80pc_split', together with the 3 csv files containing the loss values during the training phase.

    Inference notebook: The inference notebook using this dataset is https://www.kaggle.com/code/lorresprz/cicy4-training-results-inference-all-models

    Publication: This dataset was created for the work: Deep Learning Calabi-Yau four folds with hybrid and recurrent neural network architectures, https://arxiv.org/abs/2405.17406

  6. Link-prediction on Biomedical Knowledge Graphs

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Jun 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alberto Cattaneo; Daniel Justus; Stephen Bonner; Stephen Bonner; Thomas Martynec; Thomas Martynec; Alberto Cattaneo; Daniel Justus (2024). Link-prediction on Biomedical Knowledge Graphs [Dataset]. http://doi.org/10.5281/zenodo.12097377
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 25, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Alberto Cattaneo; Daniel Justus; Stephen Bonner; Stephen Bonner; Thomas Martynec; Thomas Martynec; Alberto Cattaneo; Daniel Justus
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Time period covered
    Jun 25, 2021
    Description

    Release of the experimental data from the paper Towards Linking Graph Topology to Model Performance for Biomedical Knowledge Graph Completion (accepted at Machine Learning for Life and Material Sciences workshop @ ICML2024).

    Knowledge Graph Completion has been increasingly adopted as a useful method for several tasks in biomedical research, like drug repurposing or drug-target identification. To that end, a variety of datasets and Knowledge Graph Embedding models has been proposed over the years. However, little is known about the properties that render a dataset useful for a given task and, even though theoretical properties of Knowledge Graph Embedding models are well understood, their practical utility in this field remains controversial. We conduct a comprehensive investigation into the topological properties of publicly available biomedical Knowledge Graphs and establish links to the accuracy observed in real-world applications. By releasing all model predictions we invite the community to build upon our work and continue improving the understanding of these crucial applications.
    Experiments were conducted on six datasets: five from the biomedical domain (Hetionet, PrimeKG, PharmKG, OpenBioLink2020 HQ, PharMeBINet) and one trivia KG (FB15k-237). All datasets were randomly split into training, validation and test set (80% / 10% / 10%; in the case of PharMeBINet, 99.3% / 0.35% / 0.35% to mitigate the increased inference cost on the larger dataset).
    On each dataset, four different KGE models were compared: TransE, DistMult, RotatE, TripleRE. Hyperparameters were tuned on the validation split and we release results for tail predictions on the test split. In particular, each test query (h,r,?) is scored against all entities in the KG and we compute the rank of the score of the correct completion (h,r,t) , after masking out scores of other (h,r,t') triples contained in the graph.
    Note: the ranks provided are computed as the average between the optimistic and pessimistic ranks of triple scores.
    Inside experimental_data.zip, the following files are provided for each dataset:
    • {dataset}_preprocessing.ipynb: a Jupyter notebook for downloading and preprocessing the dataset. In particular, this generates the custom label->ID mapping for entities and relations, and the numerical tensor of (h_ID,r_ID,t_ID) triples for all edges in the graph, which can be used to compute graph topological metrics (e.g., using kg-topology-toolbox) and compare them with the edge prediction accuracy.
    • test_ranks.csv: csv table with columns ["h", "r", "t"] specifying the head, relation, tail IDs of the test triples, and columns ["DistMult", "TransE", "RotatE", "TripleRE"] with the rank of the ground-truth tail in the ordered list of predictions made by the four models;
    • entity_dict.csv: the list of entity labels, ordered by entity ID (as generated in the preprocessing notebook);
    • relation_dict.csv: the list of relation labels, ordered by relation ID (as generated in the preprocessing notebook).

    The separate top_100_tail_predictions.zip archive contains, for each of the test queries in the corresponding test_ranks.csv table, the IDs of the top-100 tail predictions made by each of the four KGE models, ordered by decreasing likelihood. The predictions are released in a .npz archive of numpy arrays (one array of shape (n_test_triples, 100) for each of the KGE models).

    All experiments (training and inference) have been run on Graphcore IPU hardware using the BESS-KGE distribution framework.

  7. E

    Czech Natural Language Inference Dataset with Explanations

    • live.european-language-grid.eu
    binary format
    Updated Dec 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Czech Natural Language Inference Dataset with Explanations [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/23652
    Explore at:
    binary formatAvailable download formats
    Dataset updated
    Dec 31, 2023
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    The dataset contains two parts: the original Stanford Natural Language Inference (SNLI) dataset with automatic translations to Czech, for some items from the SNLI, it contains annotation of the Czech content and explanation.

    The Czech SNLI data contain both Czech and English pairs premise-hypothesis. SNLI split into train/test/dev is preserved.

    • CZtrainSNLI.csv: 550152 pairs
    • CZtestSNLI.csv: 10000 pairs
    • CZdevSNLI.csv: 10000 pairs

    The explanation dataset contains batches of pairs premise-hypothesis. Each batch contains 1499 pairs. Each pair contains:

    • reference to original SNLI example
    • English premise and English hypothesis
    • English gold label (one of Entailment, Contradiction, Neutral)
    • automatically translated premise and hypothesis to Czech
    • Czech gold label (one of entailment, contradiction, neutral, bad translation)
    • explanations for Czech label

    Example record:

    CSNLI ID: 4857558207.jpg#4r1e English premise: A mother holds her newborn baby. English hypothesis: A person holding a child. English gold label: entailment Czech premise: Matka drží své novorozené dítě. Czech hypothesis: Osoba, která drží dítě. Czech gold label: Entailment Explanation-hypothesis: Matka Explanation-premise: Osoba Explanation-relation: generalization

    Size of the explanations dataset: - train: 159650 - dev: 2860 - test: 2880

    Inter-Annotator Agreement (IAA) Packages 1 and 12 annotate the same data. The IAA measured by the kappa score is 0.67 (substantial agreement).

    The translation was performed via LINDAT translation service. Next, the translated pairs were manually checked (without access to the original English gold label), with possible check of the original pair.

    Explanations were annotated as follows: - if there is a part of the premise or hypothesis that is relevant for the annotator's decision, it is marked - if there are two such parts and there exists a relation between them, the relation is marked

    Possible relation types: - generalization: white long skirt - skirt - specification: dog - bulldog - similar: couch - sofa - independence: they have no instruments - they belong to the group - exclusion: man - woman

    Original SNLI dataset: https://nlp.stanford.edu/projects/snli/ LINDAT Translation Service: https://lindat.mff.cuni.cz/services/translation/

  8. E

    English translation of the Slovene Natural Language Inference Dataset...

    • live.european-language-grid.eu
    binary format
    Updated Mar 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). English translation of the Slovene Natural Language Inference Dataset SI-NLI-en 1.0 [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/23262
    Explore at:
    binary formatAvailable download formats
    Dataset updated
    Mar 18, 2024
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    SI-NLI-en is an English translation of the SI-NLI Slovene Natural Language Inference Dataset (http://hdl.handle.net/11356/1707). The English version was compiled by first using machine translation (DeepL) to translate all the premises and hypotheses from SI-NLI into English. The machine translations were then manually checked and corrected by a group of 7 students of translation at the University of Ljubljana. Each translator was given both the Slovene premise and all its hypotheses as well as the translations of both the premise and the hypotheses, so the translations were not checked in isolation, but as units to ensure maximum semantic coherence.

    Just like SI-NLI, SI-NLI-en contains 5,937 sentence pairs (premise and hypothesis) that are manually labeled with the labels "entailment", "contradiction", and "neutral". The dataset is split into train, validation, and test sets, with sizes of 4,392, 547, and 998.

    The dataset is released in a tabular TSV format. The 00README.txt file contains a description of the attributes. Only the hypothesis and premise are provided in the test set (with no annotations) since SI-NLI-en is integrated into the Slovene evaluation framework SloBENCH (https://slobench.cjvt.si/). If you use the dataset to train your models, please consider submitting the test set predictions to SloBENCH to get the evaluation score and see how it compares to others.

  9. Data and script pipeline for: Common to rare transfer learning (CORAL)...

    • zenodo.org
    bin, html
    Updated Mar 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Otso Ovaskainen; Otso Ovaskainen (2025). Data and script pipeline for: Common to rare transfer learning (CORAL) enables inference and prediction for a quarter million rare Malagasy arthropods [Dataset]. http://doi.org/10.5281/zenodo.14962497
    Explore at:
    bin, htmlAvailable download formats
    Dataset updated
    Mar 3, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Otso Ovaskainen; Otso Ovaskainen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The scripts and the data provided in this depository demonstrate how to apply the approach described in the paper "Common to rare transfer learning (CORAL) enables inference and prediction for a quarter million rare Malagasy arthropods" by Ovaskainen et al. Here we summarize how to use the software with a small, simulated dataset, with running time less than a minute in a typical laptop (Demo 1); (2) how to apply the analyses presented in the paper for a small subset of the data, with running time of ca. one hour in a powerful laptop (Demo 2); how to reproduce the full analyses presented in the paper, with running time up to several days, depending on the computational resources (Demo 3). The Demos 1 and 2 are aimed to be user-friendly starting points for understanding and testing how to implement CORAL. The Demo 3 is included mainly for reproducibility.

    System requirements

    · The software can be used in any operating system where R can be installed.

    · We have developed and tested the software in a windows environment with R version 4.3.1.

    · Demo 1 requires the R-packages phytools (2.1-1), MASS (7.3-60), Hmsc (3.3-3), pROC (1.18.5) and MCMCpack (1.7-0).

    · Demo 2 requires the R-packages phytools (2.1-1), MASS (7.3-60), Hmsc (3.3-3), pROC (1.18.5) and MCMCpack (1.7-0).

    · Demo 3 requires the R-packages phytools (2.1-1), MASS (7.3-60), Hmsc (3.3-3), pROC (1.18.5) and MCMCpack (1.7-0), jsonify (1.2.2), buildmer (2.11), colorspace (2.1-0), matlib (0.9.6), vioplot (0.4.0), MLmetrics (1.1.3) and ggplot2 (3.5.0).

    · The use of the software does not require any non-standard hardware.

    Installation guide

    · The CORAL functions are implemented in Hmsc (3.3-3). The software that applies the is presented as a R-pipeline and thus it does not require any installation other than installation of R.

    Demo 1: Software demo with simulated data

    The software demonstration consists of two R-markdown files:

    · D01_software_demo_simulate_data. This script creates a simulated dataset of 100 species on 200 sampling units. The species occurrences are simulated with a probit model that assumes phylogenetically structured responses to two environmental predictors. The pipeline saves all the data needed to data analysis in the file allDataDemo.RData: XData (the first predictor; the second one is not provided in the dataset as it is assumed to remain unknown for the user), Y (species occurrence data), phy (phylogenetic tree), studyDesign (list of sampling units). Additionally, true values used for data generation are save in the file trueValuesDemo.RData: LF (the second environmental predictor that will be estimated through a latent factor approach), and beta (species responses to environmental predictors).

    · D02_software_demo_apply_CORAL. This script loads the data generated by the script D01 and applies the CORAL approach to it. The script demonstrates the informativeness of the CORAL priors, the higher predictive power of CORAL models than baseline models, and the ability of CORAL to estimate the true values used for data generation.

    Both markdown files provide more detailed information and illustrations. The provided html file shows the expected output. The running time of the demonstration is very short, from few seconds to at most one minute.

    Demo 2: Software demo with a small subset of the data used in the paper

    The software demonstration consists of one R-markdown file:

    MA_small_demo. This script uses the CORAL functions in HMSC to analyze a small subset of the Malagasy arthropod data. In this demo, we define rare species as those with prevalence at least 40 and less than 50, and common species as those with prevalence at least 200. This leaves 51 species to the backbone model and 460 rare species modelled through the CORAL approach. The script assess model fit for CORAL priors, CORAL posteriors, and null models. It further visualizes the responses of both the common and the rare species to the included predictors.

    Scripts and data for reproducing the results presented in the paper (Demo 3)

    The input data for the script pipeline is the file “allData.RData”. This file includes the metadata (meta), the response matrix (Y), and the taxonomical information (taxonomy). Each file in the pipeline below depends on the outputs of previous files: they must be run in order. The first six files are used for fitting the backbone HMSC model and calculating parameters for the CORAL prior:

    · S01_define_Hmsc_model - defines the initial HMSC model with fixed effects and sample- and site-level random effects.

    · S02_export_Hmsc_model - prepares the initial model for HPC sampling for fitting with Hmsc-HPC. Fitting of the model can be then done in an HPC environment with the bash file generated by the script. Computationally intensive.

    · S03_import_posterior – imports the posterior distributions sampled by the initial model.

    · S04_define_second_stage_Hmsc_model - extracts latent factors from the initial model and defines the backbone model. This is then sampled using the same S02 export + S03 import scripts. Computationally intensive.

    · S05_visualize_backbone_model – check backbone model quality with visual/numerical summaries. Generates Fig. 2 of the paper.

    · S06_construct_coral_priors – calculate CORAL prior parameters.

    The remaining scripts evaluate the model:

    · S07_evaluate_prior_predictionss – use the CORAL prior to predict rare species presence/absences and evaluate the predictions in terms of AUC. Generates Fig. 3 of the paper.

    · S08_make_training_test_split – generate train/test splits for cross-validation ensuring at least 40% of positive samples are in each partition.

    · S09_cross-validate – fit CORAL and the baseline model to the train/test splits and calculate performance summaries. Note: we ran this once with the initial train/test split and then again with on the inverse split (i.e., training = ! training in the code, see comment). The paper presents the average results across these two splits. Computationally intensive.

    · S10_show_cross-validation_results – Make plots visualizing AUC/Tjur’s R2 produced by cross-validation. Generates Fig. 4 of the paper.

    · S11a_fit_coral_models – Fit the CORAL model to all 250k rare species. Computationally intensive.

    · S11b_fit_baseline_models – Fit the baseline model to all 250k rare species. Computationally intensive.

    · S12_compare_posterior_inference – compare posterior climate predictions using CORAL and baseline models on selected species, as well as variance reduction for all species. Generates Fig. 5 of the paper.

    Pre-processing scripts:

    · P01_preprocess_sequence_data.R – Reads in the outputs of the bioinformatics pipeline and converts them into R-objects.

    · P02_download_climatic_data.R – Downloads the climatic data from "sis-biodiversity-era5-global” and adds that to metadata.

    · P03_construct_Y_matrix.R – Converts the response matrix from a sparse data format to regular matrix. Saves “allData.RData”, which includes the metadata (meta), the response matrix (Y), and the taxonomical information (taxonomy).

    Computationally intensive files had runtimes of 5-24 hours on high-performance machines. Preliminary testing suggests runtimes of over 100 hours on a standard laptop.

  10. h

    Researchy-GEO

    • huggingface.co
    Updated Jul 10, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    yujiang wu (2020). Researchy-GEO [Dataset]. https://huggingface.co/datasets/yujiangw/Researchy-GEO
    Explore at:
    Dataset updated
    Jul 10, 2020
    Authors
    yujiang wu
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    AutoGEO-Researchy Dataset

    This dataset contains multiple configurations for different tasks. Use the dropdown menu above to select a specific configuration to view.

    main: Contains the primary train and test splits. rule_candidate: Data for rule candidate generation. cold_start: Data for cold-start finetuning. inference: Data for inference tasks. grpo_input: Input data for GRPO. grpo_eval: Evaluation data for GRPO.

  11. h

    Researchy-GEO-old

    • huggingface.co
    Updated Jul 10, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    yujiang wu (2020). Researchy-GEO-old [Dataset]. https://huggingface.co/datasets/yujiangw/Researchy-GEO-old
    Explore at:
    Dataset updated
    Jul 10, 2020
    Authors
    yujiang wu
    Description

    Researchy-GEO Dataset

    This dataset contains multiple configurations for different tasks. Use the dropdown menu above to select a specific configuration to view.

    main: Contains the primary train and test splits. rule_candidate: Data for rule candidate generation. cold_start: Data for cold-start finetuning. inference: Data for inference tasks. grpo_input: Input data for GRPO. grpo_eval: Evaluation data for GRPO.

  12. Fine-grained Context-sensitive Lexical Inference

    • kaggle.com
    zip
    Updated Aug 12, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vered Shwartz (2017). Fine-grained Context-sensitive Lexical Inference [Dataset]. https://www.kaggle.com/datasets/vered1986/context-lexinf
    Explore at:
    zip(1171339 bytes)Available download formats
    Dataset updated
    Aug 12, 2017
    Authors
    Vered Shwartz
    License

    http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html

    Description

    Context

    Recognizing lexical inference is an essential component in natural language understanding. In question answering, for instance, identifying that broadcast and air are synonymous enables answering the question "When was 'Friends' first aired?" given the text "'Friends' was first broadcast in 1994". Semantic relations such as synonymy (tall, high) and hypernymy (cat, pet) are used to infer the meaning of one term from another, in order to overcome lexical variability. This inference should typically be performed within a given context, considering both the term meanings in context and the specific semantic relation that holds between the terms.

    Content

    This dataset provides annotations for fine-grained lexical inferences in-context. The dataset consists of 3,750 term pairs, each given within a context sentence, built upon a subset of terms from PPDB. Each term pair is annotated to the semantic relation that holds between the terms in the given contexts.

    Files:

    • full_dataset.csv - the full dataset is provided, as well as the train-test-validation split.
    • train.csv, test.csv, validation.csv - A split of the dataset to 70% train, 25% test, and 5% validation sets. Each of the sets contains different term-pairs, to avoid overfitting for the most common relation of a term-pair in the training set.

    File Structure: comma-separated file

    Fields:

    • x: the first term
    • y: the second term
    • context_x: the sentence in which x appears (highlighted by )
    • context_y: the sentence in which y appears (highlighted by )
    • semantic_relation: the (directional) semantic relation that holds between x and y: equivalence, forward_entailment, reverse_entailment, alternation, other-related and independence.
    • confidence: the relation annotation confidence (percentage of annotators that selected this relation), in a scale of 0-1

    Acknowledgements

    If you use this dataset, please cite the following paper:

    Adding Context to Semantic Data-Driven Paraphrasing.

    Vered Shwartz and Ido Dagan. *SEM 2016.

    Inspiration

    I hope that this dataset will motivate the development of context-sensitive lexical inference methods, which have been relatively overlooked, although they are crucial for applications.

  13. Distribution of bounding boxes for each sedimentary structure across...

    • plos.figshare.com
    xlsx
    Updated Jul 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ammar J. Abdlmutalib; Korhan Ayranci; Umair Bin Waheed; Hamad D. Alhajri; James A. MacEachern; Mohammed N. Al-Khabbaz (2025). Distribution of bounding boxes for each sedimentary structure across training and test sets in Split-I. This table highlights class representation balance to ensure effective model training and evaluation. [Dataset]. http://doi.org/10.1371/journal.pone.0327738.s001
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jul 18, 2025
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Ammar J. Abdlmutalib; Korhan Ayranci; Umair Bin Waheed; Hamad D. Alhajri; James A. MacEachern; Mohammed N. Al-Khabbaz
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Distribution of bounding boxes for each sedimentary structure across training and test sets in Split-I. This table highlights class representation balance to ensure effective model training and evaluation.

  14. r

    PigTrack: a diverse and challenging benchmark dataset for multi-object...

    • resodate.org
    Updated Apr 3, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jonathan Henrich; Christian Post; Thomas Kneib; Ramin Yahyapour; Imke Traulsen (2025). PigTrack: a diverse and challenging benchmark dataset for multi-object tracking of pigs [Dataset]. http://doi.org/10.25625/P7VQTP
    Explore at:
    Dataset updated
    Apr 3, 2025
    Dataset provided by
    Georg-August-Universität Göttingen
    GRO.data
    Authors
    Jonathan Henrich; Christian Post; Thomas Kneib; Ramin Yahyapour; Imke Traulsen
    Description

    Note: To better find the files to download, select "Change View: Tree". The dataset contains: 80 video sequences from conventional pig farming with multi-object tracking annotations together with a 'split.txt' file containing the predefined training, validation and test splits The original mp4 videos of the 80 video sequences A visualization of the annotated bounding boxes for all 80 videos Model weights of MOTRv2 and MOTIP trained for pig tracking. Pre-computed bounding box priors that can be used to train MOTRv2. A thorough explanation of all files contained in this data repository can be found in ReadMe.txt. The github repository associated with this dataset can be found at https://github.com/jonaden94/PigBench. It includes commands to automatically download the files from this data repository that are required for model training, evaluation, and inference.

  15. Response Score Dataset on VLM

    • kaggle.com
    zip
    Updated Nov 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    tangx_0121 (2025). Response Score Dataset on VLM [Dataset]. https://www.kaggle.com/datasets/tangx0121/vlm-response-score-dataset
    Explore at:
    zip(3554905158 bytes)Available download formats
    Dataset updated
    Nov 6, 2025
    Authors
    tangx_0121
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    RSD: Response Score Dataset for Vision-Language Model Routing

    1. Overview

    Dataset Summary

    The Response Score Dataset (RSD) is the first comprehensive multimodal response quality dataset specifically designed for training and evaluating Vision-Language Model (VLM) routers in edge-cloud collaborative systems. This dataset enables scenario-aware routing between large cloud models and small edge models, optimizing the trade-off between response quality, inference latency, and computational cost.

    Key Statistics

    📦 Total Samples: ~22,700 image-text pairs
    🤖 Models Evaluated: 8 VLMs (2 Large + 3 Medium+ 2 small)
    📚 Source Benchmarks: 7 public VLM datasets
    ⭐ Score Range: 1-10 (LLM-as-a-Judge)
    ✅ Human Validation: 200 samples (r=0.88 correlation)
    💰 Construction Cost: ~$1,000 USD
    

    Dataset Composition

    DatasetSamplesDifficultyTask Type
    ChartQA2,500EasyChart understanding & arithmetic
    WildVision500EasyReal-world open-ended VQA
    GQA12,000MediumCompositional spatial reasoning
    VizWiz4,319MediumBlind-assistance with noise
    MMVet218MediumMulti-ability composite tasks
    MMMU-Pro1,730HardProfessional domain knowledge
    MMStar1,500HardLeak-resistant fine-grained eval
    Total~22,700MixedDiverse multimodal tasks

    Model Coverage

    Large Models (LVLM - Cloud Deployment): - Gemma 3-27B - InternVL3-38B

    Small Models (SVLM - Edge Deployment): - InternVL3-8B - Phi-4-Multimodal-5.6B - Qwen2.5-VL-7B - InternVL2.5-2B - InternVL2.5-1B - SmolVLM-256M

    🗂️ 2. Dataset Structure

    vlm_evaluation_dataset/
    ├── images/           # Original images for each sub-dataset
    │  ├── MMVet/
    │  ├── ChartQA_TEST/
    │  ├── GQA_TestDev_Balanced/
    │  ├── MMMU/
    │  ├── MMStar/
    │  ├── VizWiz/
    │  └── WildVision/
    │
    ├── metadata/          # Metadata files for each dataset (TSV format)
    │  ├── MMVet.tsv
    │  ├── ChartQA_TEST.tsv
    │  ├── GQA_TestDev_Balanced.tsv
    │  ├── MMMU.tsv
    │  ├── MMStar.tsv
    │  ├── VizWiz.tsv
    │  └── WildVision.tsv
    │
    ├── scoring_results/      # Model prediction and scoring results
    │  ├── MMVet/
    │  │  ├── InternVL3-8B/
    │  │  │  └── single/
    │  │  │    ├── results.csv    # Aggregated scoring results
    │  │  │    ├── details.json    # Detailed reasoning and scoring records
    │  │  │    └── log.json      # Model inference logs (optional)
    │  │  └── OtherModel/
    │  │    └── single/
    │  ├── ChartQA_TEST/
    │  │  └── ...
    │  └── ...
    │
    ├── statistics.json       # Dataset statistics summary (sample counts, category distribution, etc.)
    └── README.md          # Overall dataset documentation
    

    📋Metadata File Field Descriptions

    Field NameDescription
    indexUnique sample ID
    imageImage path or Base64 encoding
    questionInput question text
    answerReference answer
    categoryQuestion category (e.g., visual reasoning, chart understanding, etc.)

    Model Scoring Results Format (CSV/JSON)

    Field NameDescription
    question_idQuestion ID (corresponds to metadata.index)
    questionQuestion text
    reference_answerGround truth answer
    predictionModel predicted answer
    scoreLLM score (range 0-10)
    reasoningScoring rationale (text description)
    model_nameModel name (e.g., InternVL3-8B)
    categoryQuestion category
    dataset_typeDataset name (e.g., MMVet)
    inference_timeModel inference time (unit: seconds)

    Data Statistics

    Score Distribution

    Overall (all models, all samples):
    ├── Mean: 5.58
    ├── Median: 6.00
    ├── Std Dev: 2.15
    ├── Min: 1.00
    ├── Max: 10.00
    └── Mode: 6.00
    
    By Model Type:
    ├── Large Models (LVLM): Mean = 5.81
    └── Small Models (SVLM): Mean = 5.47
    
    By Difficulty:
    ├── Easy: Mean = 7.13
    ├── Medium: Mean = 5.80
    └── Hard: Mean = 3.36
    

    Latency Distribution

    Overall:
    ├── Mean: 1.31s
    ├── Median: 0.60s
    ├── P75: 1.17s
    ├── P90: 2.52s
    └── P99: 5.45s
    
    By Model:
    ├── SmolVLM-256M: 0.62s (fastest)
    ├── InternVL2.5-1B: 0.71s
    ├── InternVL2.5-2B: 0.81s
    ├── InternVL3-8B: 0.92s
    ├── Phi-4-5.6B: 1.80s
    ├── Qwen2.5-VL-7B: 0.90s
    ├── InternVL3-38B: 2.47s
    └── Gemma3-27B: 2.56s (slowest)
    

    🔨 3. Construction Pipeline & LLM-as-a-Judge Protocol

    3.1 Construction Workflow

    graph TD
      A[7 Public Benchmarks] --> B[Sample Collection ~22k]
      B --> C[8 VLM Inference]
      C --> D[Response Generation]
      D --> E[LLM-as-a-Judge Scoring]
      E --> F[Human Validation 200 samples]
      F --> G[Quality Check r>0.85]
      G --> H[MES-based Labeling]
      H --> I[Stratified Train/Val/Test Split]
      I --> J[Final RSD Dataset]
    ``...
    
  16. ANLI - (Adversarial NLI Benchmark)

    • kaggle.com
    zip
    Updated Nov 20, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2022). ANLI - (Adversarial NLI Benchmark) [Dataset]. https://www.kaggle.com/datasets/thedevastator/anli-a-large-scale-nli-benchmark-dataset/code
    Explore at:
    zip(17680012 bytes)Available download formats
    Dataset updated
    Nov 20, 2022
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    ANLI - (Adversarial NLI Benchmark)

    The Adversarial Natural Language Inference (ANLI, Nie et al.)

    Source

    Paper: link

    About this dataset

    The ANLI Adversarial Natural Language Inference dataset is a new, large-scale NLI benchmark dataset. The dataset is collected via an iterative, adversarial human-and-model-in-the-loop procedure. ANLI is much more difficult than its predecessors such as SNLI and MNLI. It contains three rounds. Each round has train/dev/test splits. The data fields are the same among all splits.

    ANLI provides a unique challenge for natural language understanding models. The dataset is collected via an iterative, adversarial human-and-model-in-the loop procedure that makes it much more difficult than its predecessors such as SNLI and MNLI. This makes ANLI a great benchmark to assess the progress of NLI models

    How to use the dataset

    To use the ANLI dataset, you will need to download the train_r1.csv file. This file contains the data for the first round of training data for the ANLI dataset. Next, you will need to download the dev_r1.csv file. This file contains the data for the first round of development data for the ANLI dataset. Finally, you will need to download the test_r1.csv file. This file contains the data for the first round of testing in the ANLI dataset

    Research Ideas

    • The ANLI Adversarial Natural Language Inference dataset can be used to train models to better understand natural language.
    • The dataset can be used to develop models that are more robust to adversarial examples.
    • The dataset can be used to improve the accuracy of NLI systems

    Acknowledgements

    The dataset was originally published on Huggingface Hub

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: dev_r2.csv | Column name | Description | |:---------------|:-----------------------------------------| | premise | The premise of the sentence. (String) | | hypothesis | The hypothesis of the sentence. (String) | | label | The label of the sentence. (String) | | reason | The reason for the label. (String) |

    File: test_r2.csv | Column name | Description | |:---------------|:-----------------------------------------| | premise | The premise of the sentence. (String) | | hypothesis | The hypothesis of the sentence. (String) | | label | The label of the sentence. (String) | | reason | The reason for the label. (String) |

    File: train_r3.csv | Column name | Description | |:---------------|:-----------------------------------------| | premise | The premise of the sentence. (String) | | hypothesis | The hypothesis of the sentence. (String) | | label | The label of the sentence. (String) | | reason | The reason for the label. (String) |

    File: dev_r3.csv | Column name | Description | |:---------------|:-----------------------------------------| | premise | The premise of the sentence. (String) | | hypothesis | The hypothesis of the sentence. (String) | | label | The label of the sentence. (String) | | reason | The reason for the label. (String) |

    File: test_r3.csv | Column name | Description | |:---------------|:-----------------------------------------| | premise | The premise of the sentence. (String) | | hypothesis | The hypothesis of the sentence. (String) | | label | The label of the sentence. (String) | | reason | The reason for the label. (String) |

    File: train_r2.csv | Column name | Description | |:---------------|:-----------------------------------------| | premise | The premise of the sentence. (String) | | hypothesis | The hypothesis of the sentence. (String) | | label | The label of the sentence. (String) | | reason | The reason for the label. (String) |

    File: train_r1.csv | Column name | Description | |:---------------|:-----------------------------------------| | premise | The premise of the sentence. (String) | | hypothesis | The hypothesis of the...

  17. f

    Distribution of bounding boxes for each sedimentary structure across...

    • figshare.com
    xlsx
    Updated Jul 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ammar J. Abdlmutalib; Korhan Ayranci; Umair Bin Waheed; Hamad D. Alhajri; James A. MacEachern; Mohammed N. Al-Khabbaz (2025). Distribution of bounding boxes for each sedimentary structure across training and test sets in Split-III. It confirms that all classes are represented, supporting fair performance evaluation despite observed precision drops. [Dataset]. http://doi.org/10.1371/journal.pone.0327738.s003
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jul 18, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Ammar J. Abdlmutalib; Korhan Ayranci; Umair Bin Waheed; Hamad D. Alhajri; James A. MacEachern; Mohammed N. Al-Khabbaz
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Distribution of bounding boxes for each sedimentary structure across training and test sets in Split-III. It confirms that all classes are represented, supporting fair performance evaluation despite observed precision drops.

  18. COCO 2014 Dataset (for YOLOv3)

    • kaggle.com
    zip
    Updated Sep 9, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jeff Faudi (2021). COCO 2014 Dataset (for YOLOv3) [Dataset]. https://www.kaggle.com/datasets/jeffaudi/coco-2014-dataset-for-yolov3/discussion
    Explore at:
    zip(26852690979 bytes)Available download formats
    Dataset updated
    Sep 9, 2021
    Authors
    Jeff Faudi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Context

    The MS COCO (Microsoft Common Objects in Context) dataset is a large-scale object detection, segmentation, key-point detection, and captioning dataset. The dataset consists of 164K images.

    This is the original version from 2014 made available here for easy access in Kaggle and because it does not seem to be still available on the COCO Dataset website. This has been retrieved from the mirror that Joseph Redmon has setup on this own website.

    Content

    The 2014 version of the COCO dataset is an excellent object detection dataset with 80 classes, 82,783 training images and 40,504 validation images. This dataset contains all this imagery on two folders as well as the annotation with the class and location (bounding box) of the objects contained in each image.

    The initial split provides training (83K), validation (41K) and test (41K) sets. Since the split between training and validation was not optimal in the original dataset, there is also two text (.part) files with a new split with only 5,000 images for validation and the rest for training. The test set has no labels and can be used for visual validation or pseudo-labelling.

    Acknowledgements

    This is mostly inspired by Erik Linder-Norén and [Joseph Redmon](https://pjreddie.com/darknet/yolo

  19. Construction Sites and Nature Drone Images

    • kaggle.com
    zip
    Updated Jan 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bekhzod Olimov (2024). Construction Sites and Nature Drone Images [Dataset]. https://www.kaggle.com/datasets/killa92/construction-sites-and-nature-drone-images
    Explore at:
    zip(27884247 bytes)Available download formats
    Dataset updated
    Jan 16, 2024
    Authors
    Bekhzod Olimov
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    This dataset contains 1,784 drone images of various construction types and nature. It is splitted into three sets necessary for Machine Learning and Deep Learning tasks, namely train, validation, and test splits. The structure of the data is as follows:

    • ROOT

      • train

        • cls_names
          • img_file;
          • img_file;
          • img_file;
          • ........
          • img_file.
      • valid

        • cls_names
          • img_file;
          • img_file;
          • img_file;
          • ........
          • img_file.
      • test

        • no_class
          • img_file;
          • img_file;
          • img_file;
          • ........
          • img_file.

    There are 1,427, 179, and 178 images in the train, validation, and test folders. The train and validation folders have specific classes but the test set images has no classes and must be predicted using an AI model during inference.

  20. DeepLearning Ensemble for Automated Acne Detection

    • kaggle.com
    zip
    Updated Jun 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Syed Mohmmad Ali Jafri (2025). DeepLearning Ensemble for Automated Acne Detection [Dataset]. https://www.kaggle.com/datasets/asmrgaming/deeplearning-ensemble-for-automated-acne-detection/code
    Explore at:
    zip(676760577 bytes)Available download formats
    Dataset updated
    Jun 30, 2025
    Authors
    Syed Mohmmad Ali Jafri
    Description

    AI-Powered Acne Detection using Ensemble Deep Learning This project presents an AI-based acne detection and severity assessment system that combines two deep learning models to analyze facial images. The approach integrates a classification model (ResNet50) and a localization model (YOLOv5) to provide both an overall severity prediction and detailed detection of individual acne lesions.

    The classification module predicts the severity of acne on the entire face and estimates the number of lesions present in each image. It uses KL divergence loss to improve training stability and outputs confidence scores for each prediction. The localization module, based on YOLOv5, detects the exact positions of acne lesions and classifies them into six types: comedones, papules, pustules, nodules, cysts, and scars. It uses bounding boxes with configurable confidence thresholds and supports real-time detection.

    The dataset used in this project contains 920 high-resolution facial images, annotated with 2,847 total lesions. The dataset is split into 637 training images, 194 validation images, and 89 test images. Annotations follow the YOLO format, and the class distribution is balanced across all acne types.

    The system evaluates performance using standard classification metrics such as accuracy, precision, recall, and F1-score, and uses mAP and IoU for object detection. The ensemble results are generated by combining the outputs of both models, which helps improve accuracy and reliability.

    This solution is implemented using Python 3.8 with the PyTorch framework. It requires YOLOv5 for detection and ResNet50 for classification. GPU acceleration with CUDA is recommended for training and inference.

    The codebase includes a complete pipeline for training, validation, testing, and inference. It is modular and easy to extend. Jupyter notebook examples are provided for quick experimentation and visualization.

    This project is suitable for various use cases including dermatology research, telemedicine, skincare applications, and educational tools. It demonstrates the value of combining classification and object detection models for practical medical image analysis.

  21. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Zenodo (2021). ManyTypes4Py: A benchmark Python Dataset for Machine Learning-Based Type Inference [Dataset]. https://data.europa.eu/88u/dataset/oai-zenodo-org-4571228
Organization logo

Data from: ManyTypes4Py: A benchmark Python Dataset for Machine Learning-Based Type Inference

Related Article
Explore at:
unknown(395470535)Available download formats
Dataset updated
Feb 28, 2021
Dataset authored and provided by
Zenodohttp://zenodo.org/
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The dataset is gathered on Sep. 17th 2020. It has more than 5.4K Python repositories that are hosted on GitHub. Check out the file ManyTypes4PyDataset.spec for repositories URL and their commit SHA. The dataset is also de-duplicated using the CD4Py tool. The list of duplicate files is provided in duplicate_files.txt file. All of its Python projects are processed in JSON-formatted files. They contain a seq2seq representation of each file, type-related hints, and information for machine learning models. The structure of JSON-formatted files is described in JSONOutput.md file. The dataset is split into train, validation and test sets by source code files. The list of files and their corresponding set is provided in dataset_split.csv file. Notable changes to each version of the dataset are documented in CHANGELOG.md.

Search
Clear search
Close search
Google apps
Main menu