100+ datasets found
  1. Super Resolution Benchmarks

    • kaggle.com
    zip
    Updated Oct 3, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nanashi (2022). Super Resolution Benchmarks [Dataset]. https://www.kaggle.com/datasets/jesucristo/super-resolution-benchmarks
    Explore at:
    zip(967921789 bytes)Available download formats
    Dataset updated
    Oct 3, 2022
    Authors
    Nanashi
    License

    http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html

    Description

    https://github.com/mv-lab/swin2sr/raw/main/media/sr-teaser-2.png" alt="https://github.com/mv-lab/swin2sr">

    Please check the license of each dataset in particular and cite the corresponding authors. We borrowed some dataset information from SwinIR.

    Please check our project SWIN2SR https://github.com/mv-lab/swin2sr πŸš€ πŸ”₯

    We are not the owners of this data. We compiled the most famous benchmarks for: 1. image super-resolution (SR) 2. image denoising 3. jpeg compression artifacts removal

    Classical image super-resolution (SR) DIV2K validation + Set5 + Set14 + BSD100 + Urban100 + Manga109 - download here

    real-world image SR RealSRSet and 5images- download here

    grayscale/color JPEG compression artifact reduction Classic5 +LIVE1 - download here

    https://github.com/mv-lab/swin2sr/raw/main/media/swin2sr.png" alt="https://github.com/mv-lab/swin2sr">

  2. Credit Risk Benchmark Dataset

    • kaggle.com
    zip
    Updated Apr 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Adil Shamim (2025). Credit Risk Benchmark Dataset [Dataset]. https://www.kaggle.com/datasets/adilshamim8/credit-risk-benchmark-dataset
    Explore at:
    zip(316073 bytes)Available download formats
    Dataset updated
    Apr 8, 2025
    Authors
    Adil Shamim
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Overview:
    This dataset has been designed as a benchmark for AutoML and predictive modeling in the financial domain. It focuses on assessing credit risk by predicting whether a borrower will experience serious delinquency within two years. The data comprises a mix of financial metrics and personal attributes, which allow users to build and evaluate models for credit risk scoring.

    Dataset Characteristics:

    • Total Features: 10 predictors and 1 target variable.
    • Data Types: All predictors are numerical (real numbers) while the target variable is binary ({0, 1}).
    • Task: Binary classification focused on credit risk prediction.

    Column Descriptions:
    Below is a list of the available columns along with their abbreviated names for ease-of-use:

    • rev_util: Ratio of revolving credit utilization (balance/credit limit)
    • age: Age of the borrower
    • late_30_59: Number of times 30-59 days past due (worse than current)
    • debt_ratio: Debt to income (or assets) ratio
    • monthly_inc: Monthly income of the borrower
    • open_credit: Number of open credit lines and loans
    • late_90: Number of times 90 days or more late on a payment
    • real_estate: Number of real estate loans or credit lines
    • late_60_89: Number of times 60-89 days past due (worse than current)
    • dependents: Number of dependents
    • dlq_2yrs: Target variable indicating if a serious delinquency occurred in the next 2 years (0 = No, 1 = Yes)

    Use Cases and Applications:
    - Risk Management: Build and validate credit scoring models to forecast borrower default risks. - AutoML Benchmarking: Evaluate and compare the performance of various AutoML frameworks on a structured, financial dataset. - Academic Research: Explore trends and relationships in credit behavior, along with the predictive power of financial indicators. - Model Interpretability: Given the regulated nature of financial models, this dataset provides an excellent context for testing feature importance and creating explainable AI solutions.

    Additional Information:
    - Preprocessing & Feature Engineering: Users are encouraged to perform exploratory data analysis, handle potential missing values or outliers, and experiment with scaling techniques and feature transformations.
    - Regulatory Considerations: Since credit scoring models often require transparency, it’s important to incorporate techniques that ensure model interpretability.
    - Benchmarking: Ideal for comparing traditional modeling techniques (like logistic regression) with modern approaches (such as gradient boosting and neural networks).

    This dataset is now available on Kaggle for anyone looking to experiment with or benchmark predictive models for credit risk analysis. Whether you're a data scientist, researcher, or financial analyst, the dataset provides a straightforward yet robust framework for exploring credit-related behavior and risk factors.

  3. Benchmarks datasets for cluster analysis

    • kaggle.com
    zip
    Updated Nov 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Onthada Preedasawakul (2023). Benchmarks datasets for cluster analysis [Dataset]. https://www.kaggle.com/datasets/onthada/benchmarks-datasets-for-clustering
    Explore at:
    zip(608532 bytes)Available download formats
    Dataset updated
    Nov 15, 2023
    Authors
    Onthada Preedasawakul
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Description

    25 Artificial Datasets

    The datasets are generated using either Gaussian or Uniform distributions. Each dataset contains several known sub-groups intended for testing centroid-based clustering results and cluster validity indices.

    Cluster analysis is a popular machine learning used for segmenting datasets with similar data points in the same group. For those who are familiar with R, there is a new R package called "UniversalCVI" https://CRAN.R-project.org/package=UniversalCVI used for cluster evaluation. This package provides algorithms for checking the accuracy of a clustering result with known classes, computing cluster validity indices, and generating plots for comparing them. The package is compatible with K-means, fuzzy C means, EM clustering, and hierarchical clustering (single, average, and complete linkage). To use the "UniversalCVI" package, one can follow the instructions provided in the R documentation.

    For more in-depth details of the package and cluster evaluation, please see the papers https://doi.org/10.1016/j.patcog.2023.109910 and https://arxiv.org/abs/2308.14785

    All the datasets are also available on GitHub at

    https://github.com/O-PREEDASAWAKUL/FuzzyDatasets.git .

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F17645646%2Fa2f87fbad212a908718535589681a703%2Frealplot.jpeg?generation=1700111724010268&alt=media" alt="">

  4. NLP SOTA Benchmarks

    • kaggle.com
    zip
    Updated May 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mauro (2023). NLP SOTA Benchmarks [Dataset]. https://www.kaggle.com/datasets/mauromauro/nlp-sota-benchmarks
    Explore at:
    zip(82642 bytes)Available download formats
    Dataset updated
    May 23, 2023
    Authors
    Mauro
    Description

    Data Origin: The dataset is composed by records containing Model-Metric-Date triplets for benchmark datasets of machine learning tasks; the data was taken from Papers With Code (paperswithcode.com). The .csv files appearing in the dataset contain urls to the original website (Papers With Code) pages in which the research papers of interest are introduced.

    Disclaimer: The data is publicly accessible on the aforementioned source for free, and the subsample was collected and aggregated in a Kaggle dataset for educational and research purposes

  5. List Benchmarks Kaggle

    • kaggle.com
    zip
    Updated Nov 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohammad Farid Hendianto (2025). List Benchmarks Kaggle [Dataset]. https://www.kaggle.com/datasets/ireddragonicy/list-benchmarks-kaggle
    Explore at:
    zip(78480638 bytes)Available download formats
    Dataset updated
    Nov 8, 2025
    Authors
    Mohammad Farid Hendianto
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset

    This dataset was created by Mohammad Farid Hendianto

    Released under Apache 2.0

    Contents

  6. One Billion Words Benchmark

    • kaggle.com
    zip
    Updated Dec 17, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alexander Renz-Wieland (2021). One Billion Words Benchmark [Dataset]. https://www.kaggle.com/datasets/alexrenz/one-billion-words-benchmark
    Explore at:
    zip(1107323843 bytes)Available download formats
    Dataset updated
    Dec 17, 2021
    Authors
    Alexander Renz-Wieland
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    The is the one billion words benchmark text corpus, originally published in https://arxiv.org/abs/1312.3005, available here https://www.statmt.org/lm-benchmark/.

    We removed stop words, using the gensim stopword list.

    This is the text corpus that was used in the paper NuPS: A Parameter Server for Machine Learning with Non-Uniform Parameter Access (https://arxiv.org/abs/2104.00501).

  7. AI Platform Performance Dataset

    • kaggle.com
    zip
    Updated Sep 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Satya Prakash Swain (2024). AI Platform Performance Dataset [Dataset]. https://www.kaggle.com/datasets/satyaprakashswain/ai-platform-performance-dataset
    Explore at:
    zip(8734 bytes)Available download formats
    Dataset updated
    Sep 20, 2024
    Authors
    Satya Prakash Swain
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This dataset compares the performance of various AI platforms across different tasks and metrics. It is designed for use in Kaggle competitions and analysis.

    Columns

    • Platform Name: Name of the AI platform or framework
    • Task Type: Type of AI task (e.g., Image Classification, Natural Language Processing, Object Detection)
    • Dataset: Name of the benchmark dataset used
    • Model Architecture: The specific model architecture used for the task
    • Accuracy: Accuracy score for the given task (percentage)
    • Training Time: Time taken to train the model (in hours)
    • Inference Time: Time taken for inference (in milliseconds)
    • GPU Memory Usage: GPU memory consumed during training (in GB)
    • Energy Consumption: Energy consumed during training (in kWh)
    • Date: Date of the performance measurement

    Notes

    • This dataset is synthetic and for demonstration purposes. Real-world performance may vary.
    • Performance metrics are collected under standardized conditions, but may not reflect all use cases.
    • Regular updates are recommended to keep the dataset current with the latest AI advancements.

    Potential Uses

    • Comparing AI platform performance across different tasks
    • Analyzing trade-offs between accuracy, speed, and resource consumption
    • Tracking improvements in AI platforms over time
    • Helping data scientists choose the most suitable platform for their specific needs
  8. LexGLUE: Legal NLP Benchmark

    • kaggle.com
    zip
    Updated Dec 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). LexGLUE: Legal NLP Benchmark [Dataset]. https://www.kaggle.com/datasets/thedevastator/lexglue-legal-nlp-benchmark-dataset
    Explore at:
    zip(343671820 bytes)Available download formats
    Dataset updated
    Dec 2, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    LexGLUE: Legal NLP Benchmark

    Legal NLP Benchmark Dataset: LexGLUE

    By lex_glue (From Huggingface) [source]

    About this dataset

    The LexGLUE dataset is a comprehensive benchmark dataset specially created to evaluate the performance of natural language processing (NLP) models in various legal tasks. This dataset draws inspiration from the success of other multi-task NLP benchmarks like GLUE and SuperGLUE, as well as similar initiatives in different domains.

    The primary objective of LexGLUE is to advance the development of versatile models that can effectively handle multiple legal NLP tasks without requiring extensive task-specific fine-tuning. By providing a standardized evaluation platform, this dataset aims to foster innovation and advancements in the field of legal language understanding.

    The dataset consists of several columns that provide crucial information for each entry. The context column contains the specific text or document from which each legal language understanding task is derived, offering essential background information for proper comprehension. The endings column presents multiple potential options or choices that could complete the legal task at hand, enabling comprehensive evaluation.

    Furthermore, there are various columns related to labels and target categories associated with each entry. The label column represents the correct or expected answer for a given task, ensuring accuracy in model predictions during evaluation. The labels column provides categorical information regarding target labels or categories relevant to the respective legal NLP task.

    Another important element within this dataset is the text column, which contains the actual input text representing a particular legal scenario or context for analysis. Analyzing this text forms an integral part of conducting accurate and effective NLP tasks within a legal context.

    To facilitate efficient model performance assessment on diverse aspects of legal language understanding, additional files are included in this benchmark dataset: case_hold_test.csv comprises case contexts with multiple potential endings labeled as valid holdings or not; ledgar_validation.csv serves as a validation set specifically designed for evaluating NLP models' performance on legal tasks; ecthr_b_test.csv contains samples related to European Court of Human Rights (ECtHR) along with their corresponding labels for testing the capabilities of legal language understanding models in this domain.

    By providing a longer, accurate, informative, and descriptive description of the LexGLUE dataset, it becomes evident that it serves as a crucial resource for researchers and practitioners to benchmark and advance the state-of-the-art in legal NLP tasks

    Research Ideas

    • Training and evaluating NLP models: The LexGLUE dataset can be used to train and evaluate natural language processing models specifically designed for legal language understanding tasks. By using this dataset, researchers and developers can test the performance of their models on various legal NLP tasks, such as legal case analysis or European Court of Human Rights (ECtHR) related tasks.
    • Developing generic NLP models: The benchmark dataset is designed to push towards the development of generic models that can handle multiple legal NLP tasks with limited task-specific fine-tuning. Researchers can use this dataset to develop robust and versatile NLP models that can effectively understand and analyze legal texts.
    • Comparing different algorithms and approaches: LexGLUE provides a standardized benchmark for comparing different algorithms and approaches in the field of legal language understanding. Researchers can use this dataset to compare the performance of different techniques, such as rule-based methods, deep learning models, or transformer architectures, on various legal NLP tasks. This allows for a fair comparison between different approaches and facilitates progress in the field by identifying effective methods for solving specific legal language understanding challenges

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: case_hold_test.csv | Column name | Description ...

  9. Performance vs. Predicted Performance

    • kaggle.com
    zip
    Updated Dec 21, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Calathea21 (2022). Performance vs. Predicted Performance [Dataset]. https://www.kaggle.com/datasets/daphnelenders/performance-vs-predicted-performance
    Explore at:
    zip(835822 bytes)Available download formats
    Dataset updated
    Dec 21, 2022
    Authors
    Calathea21
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    This dataset contains information about high school students and their actual and predicted performance on an exam. Most of the information, including some general information about high school students and their grade for an exam, was based on an already existing dataset, while the predicted exam performance was based on a human experiment. In this experiment, participants were shown short descriptions of the students (based on the information in the original data) and had to rank and grade according to their expected performance. Prior to this task some participants were exposed to some "Stereotype Activation", suggesting that boys perform less well in school than girls.

    Description of *original_data.csv*

    Based on this dataset (which is also available on kaggle), we extracted a number of student profiles that participants had to make grade predictions for. For more information about this dataset we refer to the corresponding kaggle page: https://www.kaggle.com/datasets/uciml/student-alcohol-consumption

    Note that we performed some preprocessing on the original data:

    • The original data consisted of two parts: the information about students following a Maths course and the information about students following a Portuguese course. Since in both datasets the same type of information was recorded, we merged both datasets and added a column "subject", to show which course each student belongs to

    • We excluded all data where G3 = 0 (i.e. the grade for the last exam = 0)

    • From original_data.csv we randomly sampled 856 students that participants in our study had to make grade predictions for.

    Description of *CompleteDataAndBiases.csv*

    index - this column corresponds to the indeces in the file "original_data.csv". Through these indices, it is possible to add columns from the original data to the dataset with the grade prediction

    ParticipantID - the ID of the participant who made the performance predictions for the corresponding student. Predictions needed to be made for 856 students, and each participant made 8 predictions total. Thus there are 107 different participant IDs

    name - to make the prediction task more engaging for participants, each of the 8 student profiles, that participants had to grade & rank was randomly matched to one of four boy/girl's names (depending on the sex of the student)

    sex - the sex of each student, either female (F) or male (M). For benchmarking fair ML algorithms, this can be used as the sensitive attribute. We assume that in the fair version of the decision variable ("Pass"), no sex discrimination occurs. The biased versions of the variable ("Predicted Pass") are mostly discriminatory towards male students.

    studytime - this variable is taken from the original dataset and denotes how long a student studied for their exam. In the original data this variable consisted of four levels (less than 2 hours vs. 2-5 hours vs. 5-10 hours vs. more than 10 hours). We binned the latter two levels together and encoded this column numerically from 1-3.

    freetime - Originally, this variable ranged from 1 (very low) to 5 (very high). We binned this variable into three categories, where level 1 and 2 are binned, as well as level 4 and 5.

    romantic - Binary variable, denoting whether the student is in a romantic relationship or not.

    Walc - This variable shows how much alcohol each student consumes in the weekend. Originally it ranged from 1 to 5 (5 corresponding to the highest alcohol consumption), but we binned the last two levels together.

    goout - This variable shows how often a student goes out in a week. Originally it ranged from 1 to 5 (5 corresponding to going out very often), but we binned the last two levels together.

    Parents_edu - This variable was not present in the original dataset. Instead, the original dataset consisted of two variables "mum_edu" and "dad_edu". We obtained "Parents_edu" by taking the higher one of both. The variable consist of 4 levels, whereas 4 = highest level of education.

    absences - This variable shows the number of absences per student. Originally it ranged from 0 - 93, but because large number of absences were infrequent we binned all absences of >=7 into one level.

    reason - The reason for why a student chose to go to the school in question. The levels are close to home, school's reputation, school's curricular and other

    G3 - The actual grade each student received for the final exam of the course, ranging from 0-20.

    Pass - A binary variable showing whether G3 is a passing grade (i.e. >=10) or not.

    Predicted Grade - The grade the student was predicted to receive in our experiment

    Predicted Rank - In our ex...

  10. πŸŸ₯AMD - CPU Benchmarks (UserBenchmark)πŸ“Š

    • kaggle.com
    zip
    Updated May 3, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    πŸ’₯AlienπŸ’₯ (2022). πŸŸ₯AMD - CPU Benchmarks (UserBenchmark)πŸ“Š [Dataset]. https://www.kaggle.com/datasets/alanjo/amd-cpu-benchmarks
    Explore at:
    zip(7339 bytes)Available download formats
    Dataset updated
    May 3, 2022
    Authors
    πŸ’₯AlienπŸ’₯
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    Benchmarks allow for easy comparison between multiple CPUs by scoring their performance on a standardized series of tests, and they are useful in many instances: When buying or building a new PC.

    Content

    Newest data as of May 2nd, 2022 This dataset contains benchmarks of AMD processors.

    Acknowledgements

    Data scrapped from userbenchmark.

    Inspiration

    When Lisa Su became CEO of Advanced Micro Devices in 2014, the company was on the brink of bankruptcy. Since then, AMD's stock has soaredβ€”from less than US $2 per share to more than $110. The company is now a leader in high-performance computing. She funneled billions of dollars to research and development, while Intel funneled their R&D funds into executive pay. Now Intel is losing a large portion of the market share they originally dominated in.

    If you enjoyed this dataset, here's some similar datasets you may like 😎

  11. Large Language Models Comparison Dataset

    • kaggle.com
    zip
    Updated Feb 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Samay Ashar (2025). Large Language Models Comparison Dataset [Dataset]. https://www.kaggle.com/datasets/samayashar/large-language-models-comparison-dataset
    Explore at:
    zip(5894 bytes)Available download formats
    Dataset updated
    Feb 24, 2025
    Authors
    Samay Ashar
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset provides a comparison of various Large Language Models (LLMs) based on their performance, cost, and efficiency. It includes important details like speed, latency, benchmarks, and pricing, helping users understand how different models stack up against each other.

    Key Details:

    • File Name: llm_comparison_dataset.csv
    • Size: 14.57 kB
    • Total Columns: 15
    • License: CC0 (Public Domain)

    What’s Inside?

    Here are some of the key metrics included in the dataset:

    1. Context Window: Maximum number of tokens the model can process at once.
    2. Speed (tokens/sec): How fast the model generates responses.
    3. Latency (sec): Time delay before the model responds.
    4. Benchmark Scores: Performance ratings from MMLU (academic tasks) and Chatbot Arena (real-world chatbot performance).
    5. Open-Source: Indicates if the model is publicly available or proprietary.
    6. Price per Million Tokens: The cost of using the model for one million tokens.
    7. Training Dataset Size: Amount of data used to train the model.
    8. Compute Power: Resources needed to run the model.
    9. Energy Efficiency: How much power the model consumes.

    This dataset is useful for researchers, developers, and AI enthusiasts who want to compare LLMs and choose the best one based on their needs.

    πŸ“ŒIf you find this dataset useful, do give an upvote :)

  12. STS_Benchmark

    • kaggle.com
    zip
    Updated Mar 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Astha0410 (2024). STS_Benchmark [Dataset]. https://www.kaggle.com/datasets/astha0410/sts-benchmark
    Explore at:
    zip(439568 bytes)Available download formats
    Dataset updated
    Mar 4, 2024
    Authors
    Astha0410
    Description

    Dataset

    This dataset was created by Astha0410

    Contents

  13. MultiOrg

    • kaggle.com
    Updated May 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christina Bukas (2024). MultiOrg [Dataset]. http://doi.org/10.34740/kaggle/ds/5097172
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 28, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Christina Bukas
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    We release a large lung organoid 2D microscopy image dataset, for multi-rater benchmarking of object detection methods and to study uncertainty estimation. The dataset comprises more than 400 images of an entire microscopy plate well, along with more than 60,000 annotated organoids, deriving from different biological experimental setups, where two different types of organoids grew under varying conditions. The organoids in the dataset were annotated by two expert annotators by fitting the organoids within bounding boxes.

    Most importantly, we introduce three unique label sets for our test set images, which derive from the two annotators at different time points, allowing for quantification of label noise.

    Join our MultiOrg challenge now to develop annotation-noise-aware models!!

    Images

    All images are in the TIFF file format. Image resolution is 1.29um in x and y.

    Labels

    The labels correspond to bounding boxes fitted around each organoid in the image.

    Train set

    All annotations in the train set are in JSON`` format. The end of the filename, gives information on the annotator who created the labels (all annotations in the train set are from time point *t0*). For example,image_1_Annotator_A.json```, means that image 1(in the same directory) was labelled by Annotator A.

    The annotation file, comprises of a dictionary, where keys are the bounding box ids, and include the four points of the bounding box, i.e. for each bounding box: 0: # p0 0: x1 1: y1 1: #p1 0: x2 1: y1 2: #p2 0: x2 1: y2 3: # p3 0: x1 1: y1 Note that x1 corresponds to the minimum row value, y1 to the minimum column value, x2 to the maximum row value and y2 to the maximum column value.

    Test set

    This dataset consists of three label sets for the test set: * test0: Annotations at time point t0, for each image can belong to Annotator A or B. Images with IDs 1-22 were annotated by A and 23-55 by B. * test1_A: Annotations at time point t1, annotated by Annotator A * test1_B: Annotations at time point t1, annotated by Annotator B

    While the label set test0 is made directly available here, to indirectly access the labels for test1_A and test1_B one must join our MultiOrg competition and submit solutions to the leaderboard!

    Object detection benchmark

    To run our object detection benchmark with MultiOrg you will need to run the notebooks we provide: * create-benchmark-dataset * multiorg-detection-benchmark

    Provided Data structure

    The dataset is structured in the following way: ``` β”œβ”€ train -> The train set, consisting of 356 images β”‚ β”œβ”€β”€ Macros -> The experimental setup (Macros os Normal) β”‚ β”œβ”€β”€β”€ Plate_1 -> Contains all images from this plate, 26 plates, or experiments, are available in total β”‚ β”œβ”€β”€β”€β”€ image_0 -> Contains all files related to this image β”‚ β”œβ”€β”€β”€β”€β”€ image_0.tiff -> The image in TIFF format β”‚ β”œβ”€β”€β”€β”€β”€ image_0_Annotator_A.json -> The annotation in json format, with information on the annotator (A or B) in the file name β”‚ β”œβ”€β”€ Normal -> The experimental setup (Macros os Normal) β”‚ └─ test -> The test set, consisting of 55 images, and annotations provided only for label set test0 β”‚ β”œβ”€β”€ Macros
    β”‚ β”œβ”€β”€β”€ Plate_4
    β”‚ β”œβ”€β”€β”€β”€ image_0
    β”‚ β”œβ”€β”€β”€β”€β”€ image_0.tiff
    β”‚ β”œβ”€β”€β”€β”€β”€ image_0_t0_A.json -> The annotation, with information on the time point, here t0, and the annotator (A or B) β”‚ β”œβ”€β”€ Normal

  14. Arcade Natural Language to Code Challenge

    • kaggle.com
    zip
    Updated Feb 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google AI (2023). Arcade Natural Language to Code Challenge [Dataset]. https://www.kaggle.com/datasets/googleai/arcade-nl2code-dataset
    Explore at:
    zip(3921922 bytes)Available download formats
    Dataset updated
    Feb 22, 2023
    Dataset authored and provided by
    Google AI
    Description

    Arcade: Natural Language to Code Generation in Interactive Computing Notebooks

    Arcade is a collection of natural language to code problems on interactive data science notebooks. Each problem features an NL intent as problem specification, a reference code solution, and preceding notebook context (Markdown or code cells). Arcade can be used to evaluate the accuracies of code large language models in generating data science programs given natural language instructions. Please read our paper for more details.

    NoteπŸ‘‰ This Kaggle dataset only contains the dataset files of Arcade. Refer to our main Github repository for detailed instructions to use this dataset.

    Folder Structure

    Below is the structure of its content:

    └── ./
      β”œβ”€β”€ existing_tasks # Problems derived from existing data science notebooks on Github/
      β”‚  β”œβ”€β”€ metadata.json # Metadata by `build_existing_tasks_split.py` to reproduce this split.
      β”‚  β”œβ”€β”€ artifacts/ # Folder that stores dependent ML datasets to execute the problems, created by running `build_existing_tasks_split.py`
      β”‚  └── derived_datasets/ # Folder for preprocessed datasets used for prompting experiments.
      β”œβ”€β”€ new_tasks/
      β”‚  β”œβ”€β”€ dataset.json # Original, unprepossessed dataset
      β”‚  β”œβ”€β”€ kaggle_dataset_provenance.csv # Metadata of the Kaggle datasets used to build this split.
      β”‚  β”œβ”€β”€ artifacts/ # Folder that stores dependent ML Kaggle datasets to execute the problems, created by running `build_new_tasks_split.py`
      β”‚  └── derived_datasets/ # Folder for preprocessed datasets used for prompting experiments.
      └── checksums.txt # Table of MD5 checksums of dataset files.
    

    Dataset File Structure

    All the dataset '*.json' files follow the same structure. Each dataset file is a Json-serialized list of Episodes. Each episode corresponds to a notebook annotated with NL-to-code problems. The structure of an episode is documented below:

    {
      "notebook_name": "Name of the notebook.",
      "work_dir": "Path to the dependent data artifacts (e.g., ML datasets) to execute the notebook.",
      "annotator": "Anonymized annotator Id."
      "turns": [
        # A list of natural language to code examples using the current notebook context.
        {
          "input": "Prompt to a code generation model.",
          "turn": {
            "intent": {
              "value": "Annotated NL intent for the current turn.",
              "is_cell_intent": "Metadata used for the existing tasks split to indicate if the code solution is only part of an existing code cell.",
              "cell_idx": "Index of the intent Markdown cell.",
              "line_span": "Line span of the intent.",
              "not_sure": "Annotation confidence.",
              "output_variables": "List of variable names denoting the output. If None, use the output of the last line of code as the output of the problem.",
            },
            "code": {
              "value": "Reference code solution.",
              "cell_idx": "Cell index of the code cell containing the solution.",
              "num_lines": "Number of lines in the reference solution.",
              "line_span": "Line span.",
            },
            "code_context": "Context code (all code cells before this problem) that need to be executed before executing the reference/predicted programs.",
            "delta_code_context": "Delta context code between the last problem in this notebook and the current problem, useful for incremental execution.",
            "metadata": {
              "annotator_id": "Annotator Id",
              "num_code_lines": "Metadata, please ignore.",
              "utterance_without_output_spec": "Annotated NL intent without output specification. Refer to the paper for details.",
            },
          },
          "notebook": "Field intended to store the Json-serialized Jupyter notebook. Not used for now since the notebook can be reconstructed from other metadata in this file.",
          "metadata": {
            # A dict of metadata of this turn.
            "context_cells": [ # A list of context cells before the problem.
              {
                "cell_type": "code|markdown",
                "source": "Cell content."
              },
            ],
            "delta_cell_num": "Number of preceding context cells between the prior turn and the current turn.",
            # The following fields only occur in datasets inlined with schema descriptions.
            "context_cell_num": "Number of context cells in the prompt after inserting schema descriptions and left-truncation.",
            "inten...
    
  15. Industrial Dataset

    • kaggle.com
    zip
    Updated May 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Be Schue (2023). Industrial Dataset [Dataset]. https://www.kaggle.com/datasets/beschue/industrial-classification-data-set
    Explore at:
    zip(1287075520 bytes)Available download formats
    Dataset updated
    May 8, 2023
    Authors
    Be Schue
    Description

    The dataset includes 10 object categories from the MVTEC INDUSTRIAL 3D OBJECT DETECTION DATASET as input CAD objects. The selected objects include a diverse range of industrial products:

    S.NoObject Class
    1adapter plate triangular
    2bracket big
    3clamp small
    4engine part cooler round
    5engine part cooler square
    6injection pump
    7screw
    8star
    9tee connector
    10thread

    The dataset contains a total of 100,000 RGB images of each object category, divided into three sets: 70,000 for training, 20,000 for testing, and 10,000 for validation. Each image has a resolution of 224 x 224 and is in JPEG format.

    To ensure the suitability of our dataset for various computer vision tasks, we included not only the class labels but also generated bounding boxes and semantic masks for each image, which are stored in COCO annotation format. Each image contains one instance of the ten selected objects.

    Throughout the 10,000 images for each class, we randomly varied the position of the object in x-y-z direction and the object’s rotation to provide a diverse range of images. Additionally, we changed the object’s surface to a smooth metallic texture, imitating real industrial components. Lastly, we varied the lighting conditions within each image, including the position of the light sources, their energy, and emission strength.

    Find out more about our Data Generation Tool:

    Schuerrle, B., Sankarappan, V., & Morozov, A. (2023). SynthiCAD: Generation of Industrial Image Data Sets for Resilience Evaluation of Safety-Critical Classifiers. In Proceeding of the 33rd European Safety and Reliability Conference. 33rd European Safety and Reliability Conference. Research Publishing Services. https://doi.org/10.3850/978-981-18-8071-1_p400-cd

  16. GPU and CPU benchmark

    • kaggle.com
    zip
    Updated May 22, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    sewonghwang (2022). GPU and CPU benchmark [Dataset]. https://www.kaggle.com/datasets/sewonghwang/gpu-cpu-benchmark
    Explore at:
    zip(10053 bytes)Available download formats
    Dataset updated
    May 22, 2022
    Authors
    sewonghwang
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset

    This dataset was created by sewonghwang

    Released under CC0: Public Domain

    Contents

  17. Cyclist Dataset for Object Detection

    • kaggle.com
    zip
    Updated Mar 15, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SemiEmptyGlass (2022). Cyclist Dataset for Object Detection [Dataset]. https://www.kaggle.com/datasets/semiemptyglass/cyclist-dataset
    Explore at:
    zip(2319730694 bytes)Available download formats
    Dataset updated
    Mar 15, 2022
    Authors
    SemiEmptyGlass
    License

    Attribution-NonCommercial-ShareAlike 3.0 (CC BY-NC-SA 3.0)https://creativecommons.org/licenses/by-nc-sa/3.0/
    License information was derived automatically

    Description

    Cyclist Dataset

    Tsinghua-Daimler Cyclist Detection Benchmark Dataset in yolo format for Object Detection

    Context

    I'm not owner the of this dataset, all the credit goes to X. Li, F. Flohr, Y. Yang, H. Xiong, M. Braun, S. Pan, K. Li and D. M. Gavrila, the creators of this dataset.

    Content

    • img size - 2048x1024
    • 13.7k labeled images (1000 images have no cyclists)
    • labels in yolo format: id center_x center_y width height (relative to image width and height)

    Example yolo bounding box:

    0 0.41015625 0.44140625 0.0341796875 0.11328125
    

    Acknowledgments

    License Terms

    This dataset is made freely available non-commercial purposes such as academic research, teaching, scientific publications, or personal experimentation. Permission is granted to use, copy, and distribute the data given that you agree:

    • That the dataset comes "AS IS", without express or implied warranty. Although every effort has been made to ensure accuracy, Daimler (or the website host) does not accept any responsibility for errors or omissions.
    • That you include a reference to the above publication in any published work that makes use of the dataset.
    • That if you have altered the content of the dataset or created derivative work, prominent notices are made so that any recipients know that they are not receiving the original data.
    • That you may not use or distribute the dataset or any derivative work for commercial purposes as, for example, licensing or selling the data, or using the data with a purpose to procure a commercial gain.
    • That this original license notice is retained with all copies or derivatives of the dataset.
    • That all rights not expressly granted to you are reserved by Daimler.

    Cite

    X. Li, F. Flohr, Y. Yang, H. Xiong, M. Braun, S. Pan, K. Li and D. M. Gavrila. A New Benchmark for Vision-Based Cyclist Detection. In Proc. of the IEEE Intelligent Vehicles Symposium (IV), Gothenburg, Sweden, pp.1028-1033, 2016.
    
  18. High Freq FX benchmark submissions Mar 25

    • kaggle.com
    zip
    Updated May 2, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Autonity (2025). High Freq FX benchmark submissions Mar 25 [Dataset]. https://www.kaggle.com/datasets/autonity/high-freq-fx-benchmark-submissions-mar-25
    Explore at:
    zip(40095947 bytes)Available download formats
    Dataset updated
    May 2, 2025
    Authors
    Autonity
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset

    This dataset was created by Autonity

    Released under CC0: Public Domain

    Contents

  19. The NeuroTask Benchmark Dataset

    • kaggle.com
    zip
    Updated Jan 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carolina Filipe (2025). The NeuroTask Benchmark Dataset [Dataset]. https://www.kaggle.com/datasets/carolinafilipe/neurotask-multi-tasks-benchmark-dataset
    Explore at:
    zip(6088832856 bytes)Available download formats
    Dataset updated
    Jan 14, 2025
    Authors
    Carolina Filipe
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    NeuroTask is a benchmark dataset designed to facilitate the development of accurate and efficient methods for analyzing multi-session, multi-task, and multi-subject neural data. NeuroTask integrates 6 datasets from motor cortical regions, covering 7 tasks across 19 subjects.

    This dataset includes:

    • Spike counts per unit
    • Behavioral data (hand/cursor position, velocity, force)
    • Indices for dataset, session, subject, and trial

    The indices are included to uniquely identify each session using datasetID, animal, and session.

    The rationale for the file naming convention is as follows:

    datasetID _ bin size _ dataset name _ task.parquet

    Check out the github repository for more resources and some example notebooks: https://github.com/catniplab/NeuroTask/

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F20742846%2F85b47e421f30f4203cb97ceb78f2d2f6%2FNeuroTask3.png?generation=1716989002860465&alt=media" alt="">

  20. deepfake-benchmark-dfdcp

    • kaggle.com
    zip
    Updated Mar 4, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    lephanminhkhoa (2025). deepfake-benchmark-dfdcp [Dataset]. https://www.kaggle.com/datasets/lephanminhkhoa/deepfake-benchmark-dfdcp
    Explore at:
    zip(9612147346 bytes)Available download formats
    Dataset updated
    Mar 4, 2025
    Authors
    lephanminhkhoa
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset

    This dataset was created by lephanminhkhoa

    Released under Apache 2.0

    Contents

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Nanashi (2022). Super Resolution Benchmarks [Dataset]. https://www.kaggle.com/datasets/jesucristo/super-resolution-benchmarks
Organization logo

Super Resolution Benchmarks

Classical and real world super-resolution datasets used for testing algorithms

Explore at:
101 scholarly articles cite this dataset (View in Google Scholar)
zip(967921789 bytes)Available download formats
Dataset updated
Oct 3, 2022
Authors
Nanashi
License

http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html

Description

https://github.com/mv-lab/swin2sr/raw/main/media/sr-teaser-2.png" alt="https://github.com/mv-lab/swin2sr">

Please check the license of each dataset in particular and cite the corresponding authors. We borrowed some dataset information from SwinIR.

Please check our project SWIN2SR https://github.com/mv-lab/swin2sr πŸš€ πŸ”₯

We are not the owners of this data. We compiled the most famous benchmarks for: 1. image super-resolution (SR) 2. image denoising 3. jpeg compression artifacts removal

Classical image super-resolution (SR) DIV2K validation + Set5 + Set14 + BSD100 + Urban100 + Manga109 - download here

real-world image SR RealSRSet and 5images- download here

grayscale/color JPEG compression artifact reduction Classic5 +LIVE1 - download here

https://github.com/mv-lab/swin2sr/raw/main/media/swin2sr.png" alt="https://github.com/mv-lab/swin2sr">

Search
Clear search
Close search
Google apps
Main menu