20 datasets found
  1. f

    Data from: The regressinator: A simulation tool for teaching regression...

    • tandf.figshare.com
    txt
    Updated Jun 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alex Reinhart (2025). The regressinator: A simulation tool for teaching regression assumptions and diagnostics in R [Dataset]. http://doi.org/10.6084/m9.figshare.29361136.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 18, 2025
    Dataset provided by
    Taylor & Francis
    Authors
    Alex Reinhart
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    When students learn linear regression, they must learn to use diagnostics to check and improve their models. Model-building is an expert skill requiring the interpretation of diagnostic plots, an understanding of model assumptions, the selection of appropriate changes to remedy problems, and an intuition for how potential problems may affect results. Simulation offers opportunities to practice these skills, and is already widely used to teach important concepts in sampling, probability, and statistical inference. Visual inference, which uses simulation, has also recently been applied to regression instruction. This article presents the regressinator, an R package designed to facilitate simulation and visual inference in regression settings. Simulated regression problems can be easily defined with minimal programming, using the same modeling and plotting code students may already learn. The simulated data can then be used for model diagnostics, visual inference, and other activities, with the package providing functions to facilitate common tasks with a minimum of programming. Example activities covering model diagnostics, statistical power, and model selection are shown for both advanced undergraduate and Ph.D.-level regression courses.

  2. Employment Of India CLeaned and Messy Data

    • kaggle.com
    Updated Apr 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SONIA SHINDE (2025). Employment Of India CLeaned and Messy Data [Dataset]. https://www.kaggle.com/datasets/soniaaaaaaaa/employment-of-india-cleaned-and-messy-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 7, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    SONIA SHINDE
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Area covered
    India
    Description

    This dataset presents a dual-version representation of employment-related data from India, crafted to highlight the importance of data cleaning and transformation in any real-world data science or analytics project.

    🔹 Dataset Composition:

    It includes two parallel datasets: 1. Messy Dataset (Raw) – Represents a typical unprocessed dataset often encountered in data collection from surveys, databases, or manual entries. 2. Cleaned Dataset – This version demonstrates how proper data preprocessing can significantly enhance the quality and usability of data for analytical and visualization purposes.

    Each record captures multiple attributes related to individuals in the Indian job market, including: - Age Group
    - Employment Status (Employed/Unemployed)
    - Monthly Salary (INR)
    - Education Level
    - Industry Sector
    - Years of Experience
    - Location
    - Perceived AI Risk
    - Date of Data Recording

    Transformations & Cleaning Applied:

    The raw dataset underwent comprehensive transformations to convert it into its clean, analysis-ready form: - Missing Values: Identified and handled using either row elimination (where critical data was missing) or imputation techniques. - Duplicate Records: Identified using row comparison and removed to prevent analytical skew. - Inconsistent Formatting: Unified inconsistent naming in columns (like 'monthly_salary_(inr)' → 'Monthly Salary (INR)'), capitalization, and string spacing. - Incorrect Data Types: Converted columns like salary from string/object to float for numerical analysis. - Outliers: Detected and handled based on domain logic and distribution analysis. - Categorization: Converted numeric ages into grouped age categories for comparative analysis. - Standardization: Uniform labels for employment status, industry names, education, and AI risk levels were applied for visualization clarity.

    Purpose & Utility:

    This dataset is ideal for learners and professionals who want to understand: - The impact of messy data on visualization and insights - How transformation steps can dramatically improve data interpretation - Practical examples of preprocessing techniques before feeding into ML models or BI tools

    It's also useful for: - Training ML models with clean inputs
    - Data storytelling with visual clarity
    - Demonstrating reproducibility in data cleaning pipelines

    By examining both the messy and clean datasets, users gain a deeper appreciation for why “garbage in, garbage out” rings true in the world of data science.

  3. The ORBIT (Object Recognition for Blind Image Training)-India Dataset

    • zenodo.org
    • data.niaid.nih.gov
    Updated Apr 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gesu India; Gesu India; Martin Grayson; Martin Grayson; Daniela Massiceti; Daniela Massiceti; Cecily Morrison; Cecily Morrison; Simon Robinson; Simon Robinson; Jennifer Pearson; Jennifer Pearson; Matt Jones; Matt Jones (2025). The ORBIT (Object Recognition for Blind Image Training)-India Dataset [Dataset]. http://doi.org/10.5281/zenodo.12608444
    Explore at:
    Dataset updated
    Apr 24, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Gesu India; Gesu India; Martin Grayson; Martin Grayson; Daniela Massiceti; Daniela Massiceti; Cecily Morrison; Cecily Morrison; Simon Robinson; Simon Robinson; Jennifer Pearson; Jennifer Pearson; Matt Jones; Matt Jones
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    India
    Description

    The ORBIT (Object Recognition for Blind Image Training) -India Dataset is a collection of 105,243 images of 76 commonly used objects, collected by 12 individuals in India who are blind or have low vision. This dataset is an "Indian subset" of the original ORBIT dataset [1, 2], which was collected in the UK and Canada. In contrast to the ORBIT dataset, which was created in a Global North, Western, and English-speaking context, the ORBIT-India dataset features images taken in a low-resource, non-English-speaking, Global South context, a home to 90% of the world’s population of people with blindness. Since it is easier for blind or low-vision individuals to gather high-quality data by recording videos, this dataset, like the ORBIT dataset, contains images (each sized 224x224) derived from 587 videos. These videos were taken by our data collectors from various parts of India using the Find My Things [3] Android app. Each data collector was asked to record eight videos of at least 10 objects of their choice.

    Collected between July and November 2023, this dataset represents a set of objects commonly used by people who are blind or have low vision in India, including earphones, talking watches, toothbrushes, and typical Indian household items like a belan (rolling pin), and a steel glass. These videos were taken in various settings of the data collectors' homes and workspaces using the Find My Things Android app.

    The image dataset is stored in the ‘Dataset’ folder, organized by folders assigned to each data collector (P1, P2, ...P12) who collected them. Each collector's folder includes sub-folders named with the object labels as provided by our data collectors. Within each object folder, there are two subfolders: ‘clean’ for images taken on clean surfaces and ‘clutter’ for images taken in cluttered environments where the objects are typically found. The annotations are saved inside a ‘Annotations’ folder containing a JSON file per video (e.g., P1--coffee mug--clean--231220_084852_coffee mug_224.json) that contains keys corresponding to all frames/images in that video (e.g., "P1--coffee mug--clean--231220_084852_coffee mug_224--000001.jpeg": {"object_not_present_issue": false, "pii_present_issue": false}, "P1--coffee mug--clean--231220_084852_coffee mug_224--000002.jpeg": {"object_not_present_issue": false, "pii_present_issue": false}, ...). The ‘object_not_present_issue’ key is True if the object is not present in the image, and the ‘pii_present_issue’ key is True, if there is a personally identifiable information (PII) present in the image. Note, all PII present in the images has been blurred to protect the identity and privacy of our data collectors. This dataset version was created by cropping images originally sized at 1080 × 1920; therefore, an unscaled version of the dataset will follow soon.

    This project was funded by the Engineering and Physical Sciences Research Council (EPSRC) Industrial ICASE Award with Microsoft Research UK Ltd. as the Industrial Project Partner. We would like to acknowledge and express our gratitude to our data collectors for their efforts and time invested in carefully collecting videos to build this dataset for their community. The dataset is designed for developing few-shot learning algorithms, aiming to support researchers and developers in advancing object-recognition systems. We are excited to share this dataset and would love to hear from you if and how you use this dataset. Please feel free to reach out if you have any questions, comments or suggestions.

    REFERENCES:

    1. Daniela Massiceti, Lida Theodorou, Luisa Zintgraf, Matthew Tobias Harris, Simone Stumpf, Cecily Morrison, Edward Cutrell, and Katja Hofmann. 2021. ORBIT: A real-world few-shot dataset for teachable object recognition collected from people who are blind or low vision. DOI: https://doi.org/10.25383/city.14294597

    2. microsoft/ORBIT-Dataset. https://github.com/microsoft/ORBIT-Dataset

    3. Linda Yilin Wen, Cecily Morrison, Martin Grayson, Rita Faia Marques, Daniela Massiceti, Camilla Longden, and Edward Cutrell. 2024. Find My Things: Personalized Accessibility through Teachable AI for People who are Blind or Low Vision. In Extended Abstracts of the 2024 CHI Conference on Human Factors in Computing Systems (CHI EA '24). Association for Computing Machinery, New York, NY, USA, Article 403, 1–6. https://doi.org/10.1145/3613905.3648641

  4. Simple time-series visual objects interaction CRM

    • kaggle.com
    Updated May 12, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Krid Jin (2018). Simple time-series visual objects interaction CRM [Dataset]. https://www.kaggle.com/vasopikof/simple-timeseries-visual-objects-interaction-crm
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 12, 2018
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Krid Jin
    Description

    Context

    There's a story behind every dataset and here's your opportunity to share yours.

    Content

    This dataset aims to state the simple visual interaction between objects. In image format 100x100 px, 25 frames.

    An object moving in the frame.

  5. P

    Data from: Music21 Dataset

    • paperswithcode.com
    Updated Feb 18, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michael Scott Cuthbert; Christopher Ariza (2022). Music21 Dataset [Dataset]. https://paperswithcode.com/dataset/music21
    Explore at:
    Dataset updated
    Feb 18, 2021
    Authors
    Michael Scott Cuthbert; Christopher Ariza
    Description

    Music21 is an untrimmed video dataset crawled by keyword query from Youtube. It contains music performances belonging to 21 categories. This dataset is relatively clean and collected for the purpose of training and evaluating visual sound source separation models.

  6. ImageCLEF 2012 Image annotation and retrieval dataset (MIRFLICKR)

    • zenodo.org
    • explore.openaire.eu
    txt, zip
    Updated May 22, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bart Thomee; Adrian Popescu; Bart Thomee; Adrian Popescu (2020). ImageCLEF 2012 Image annotation and retrieval dataset (MIRFLICKR) [Dataset]. http://doi.org/10.5281/zenodo.1246796
    Explore at:
    zip, txtAvailable download formats
    Dataset updated
    May 22, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Bart Thomee; Adrian Popescu; Bart Thomee; Adrian Popescu
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    DESCRIPTION
    For this task, we use a subset of the MIRFLICKR (http://mirflickr.liacs.nl) collection. The entire collection contains 1 million images from the social photo sharing website Flickr and was formed by downloading up to a thousand photos per day that were deemed to be the most interesting according to Flickr. All photos in this collection were released by their users under a Creative Commons license, allowing them to be freely used for research purposes. Of the entire collection, 25 thousand images were manually annotated with a limited number of concepts and many of these annotations have been further refined and expanded over the lifetime of the ImageCLEF photo annotation task. This year we used crowd sourcing to annotate all of these 25 thousand images with the concepts.

    On this page we provide you with more information about the textual features, visual features and concept features we supply with each image in the collection we use for this year's task.


    TEXTUAL FEATURES
    All images are accompanied by the following textual features:

    - Flickr user tags
    These are the tags that the users assigned to the photos their uploaded to Flickr. The 'raw' tags are the original tags, while the 'clean' tags are those collapsed to lowercase and condensed to removed spaces.

    - EXIF metadata
    If available, the EXIF metadata contains information about the camera that took the photo and the parameters used. The 'raw' exif is the original camera data, while the 'clean' exif reduces the verbosity.

    - User information and Creative Commons license information
    This contains information about the user that took the photo and the license associated with it.


    VISUAL FEATURES
    Over the previous years of the photo annotation task we noticed that often the same types of visual features are used by the participants, in particular features based on interest points and bag-of-words are popular. To assist you we have extracted several features for you that you may want to use, so you can focus on the concept detection instead. We additionally give you some pointers to easy to use toolkits that will help you extract other features or the same features but with different default settings.

    - SIFT, C-SIFT, RGB-SIFT, OPPONENT-SIFT
    We used the ISIS Color Descriptors (http://www.colordescriptors.com) toolkit to extract these descriptors. This package provides you with many different types of features based on interest points, mostly using SIFT. It furthermore assists you with building codebooks for bag-of-words. The toolkit is available for Windows, Linux and Mac OS X.

    - SURF
    We used the OpenSURF (http://www.chrisevansdev.com/computer-vision-opensurf.html) toolkit to extract this descriptor. The open source code is available in C++, C#, Java and many more languages.

    - TOP-SURF
    We used the TOP-SURF (http://press.liacs.nl/researchdownloads/topsurf) toolkit to extract this descriptor, which represents images with SURF-based bag-of-words. The website provides codebooks of several different sizes that were created using a combination of images from the MIR-FLICKR collection and from the internet. The toolkit also offers the ability to create custom codebooks from your own image collection. The code is open source, written in C++ and available for Windows, Linux and Mac OS X.

    - GIST
    We used the LabelMe (http://labelme.csail.mit.edu) toolkit to extract this descriptor. The MATLAB-based library offers a comprehensive set of tools for annotating images.

    For the interest point-based features above we used a Fast Hessian-based technique to detect the interest points in each image. This detector is built into the OpenSURF library. In comparison with the Hessian-Laplace technique built into the ColorDescriptors toolkit it detects fewer points, resulting in a considerably reduced memory footprint. We therefore also provide you with the interest point locations in each image that the Fast Hessian-based technique detected, so when you would like to recalculate some features you can use them as a starting point for the extraction. The ColorDescriptors toolkit for instance accepts these locations as a separate parameter. Please go to http://www.imageclef.org/2012/photo-flickr/descriptors for more information on the file format of the visual features and how you can extract them yourself if you want to change the default settings.


    CONCEPT FEATURES
    We have solicited the help of workers on the Amazon Mechanical Turk platform to perform the concept annotation for us. To ensure a high standard of annotation we used the CrowdFlower platform that acts as a quality control layer by removing the judgments of workers that fail to annotate properly. We reused several concepts of last year's task and for most of these we annotated the remaining photos of the MIRFLICKR-25K collection that had not yet been used before in the previous task; for some concepts we reannotated all 25,000 images to boost their quality. For the new concepts we naturally had to annotate all of the images.

    - Concepts
    For each concept we indicate in which images it is present. The 'raw' concepts contain the judgments of all annotators for each image, where a '1' means an annotator indicated the concept was present whereas a '0' means the concept was not present, while the 'clean' concepts only contain the images for which the majority of annotators indicated the concept was present. Some images in the raw data for which we reused last year's annotations only have one judgment for a concept, whereas the other images have between three and five judgments; the single judgment does not mean only one annotator looked at it, as it is the result of a majority vote amongst last year's annotators.

    - Annotations
    For each image we indicate which concepts are present, so this is the reverse version of the data above. The 'raw' annotations contain the average agreement of the annotators on the presence of each concept, while the 'clean' annotations only include those for which there was a majority agreement amongst the annotators.

    You will notice that the annotations are not perfect. Especially when the concepts are more subjective or abstract, the annotators tend to disagree more with each other. The raw versions of the concept annotations should help you get an understanding of the exact judgments given by the annotators.

  7. P

    ShapeNetCore Dataset

    • paperswithcode.com
    • opendatalab.com
    Updated May 14, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Angel X. Chang; Thomas Funkhouser; Leonidas Guibas; Pat Hanrahan; Qi-Xing Huang; Zimo Li; Silvio Savarese; Manolis Savva; Shuran Song; Hao Su; Jianxiong Xiao; Li Yi; Fisher Yu (2021). ShapeNetCore Dataset [Dataset]. https://paperswithcode.com/dataset/shapenetcore
    Explore at:
    Dataset updated
    May 14, 2021
    Authors
    Angel X. Chang; Thomas Funkhouser; Leonidas Guibas; Pat Hanrahan; Qi-Xing Huang; Zimo Li; Silvio Savarese; Manolis Savva; Shuran Song; Hao Su; Jianxiong Xiao; Li Yi; Fisher Yu
    Description

    ShapeNetCore is a subset of the full ShapeNet dataset with single clean 3D models and manually verified category and alignment annotations. It covers 55 common object categories with about 51,300 unique 3D models. The 12 object categories of PASCAL 3D+, a popular computer vision 3D benchmark dataset, are all covered by ShapeNetCore.

  8. f

    Data Sheet 1_A machine learning-based detection, classification, and...

    • frontiersin.figshare.com
    docx
    Updated May 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mallela Pruthvi Raju; Subramanian Veerasingam; V. Suneel; Fahad Syed Asim; Hana Ahmed Khalil; Mark Chatting; P. Suneetha; P. Vethamony (2025). Data Sheet 1_A machine learning-based detection, classification, and quantification of marine litter along the central east coast of India.docx [Dataset]. http://doi.org/10.3389/fmars.2025.1604055.s001
    Explore at:
    docxAvailable download formats
    Dataset updated
    May 29, 2025
    Dataset provided by
    Frontiers
    Authors
    Mallela Pruthvi Raju; Subramanian Veerasingam; V. Suneel; Fahad Syed Asim; Hana Ahmed Khalil; Mark Chatting; P. Suneetha; P. Vethamony
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    India
    Description

    Globally, the growth of plastic production has increased exponentially from 1.5 million metric tons (Mt) in 1950 to 400.3 Mt in 2022, resulting in a substantial increase of marine litter along the coastal region. Presently, there is a growing interest in using an artificial intelligence (AI) based automatic and cost-effective approach to identify marine litter for clean-up processes. This study aims to understand the spatial distribution of marine litter along the central east coast of India using the conventional method and AI based object detection approach. From the field survey, a total of 4588 marine litter items could be identified, with an average of 1.147 ± 0.375 items/m2. Based on clean coast index, 37.5% of beaches were categorized as ‘dirty’ and 62.5% of beaches as ‘extremely dirty’. For the machine learning approach ‘You Only Look Once (YOLOv5)’ model was used to detect and classify various types of marine litter items. A total of 9714 images representing seven categories of marine litter (plastic, metal, glass, fabric, paper, processed wood, and rubber) were extracted from eight field videos recorded across diverse beach settings. The efficiency of the trained machine learning model was assessed using different metrices such as Recall, Precision, Mean average precision (mAP) and F1 score (a metric for forecast accuracy). The model achieved a F1 score of 0.797, mAP 0.5 of 0.95, and mAP@0.5-0.95 of 0.76, and these results show that YOLOv5 model could be used in conjunction with conventional marine litter monitoring, classification and detection to provide quick and accurate results.

  9. f

    Details of training and testing datasets.

    • plos.figshare.com
    xls
    Updated Sep 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anh Duy Nguyen; Huy Hieu Pham; Huynh Thanh Trung; Quoc Viet Hung Nguyen; Thao Nguyen Truong; Phi Le Nguyen (2023). Details of training and testing datasets. [Dataset]. http://doi.org/10.1371/journal.pone.0291865.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Sep 28, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Anh Duy Nguyen; Huy Hieu Pham; Huynh Thanh Trung; Quoc Viet Hung Nguyen; Thao Nguyen Truong; Phi Le Nguyen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Due to the significant resemblance in visual appearance, pill misuse is prevalent and has become a critical issue, responsible for one-third of all deaths worldwide. Pill identification, thus, is a crucial concern that needs to be investigated thoroughly. Recently, several attempts have been made to exploit deep learning to tackle the pill identification problem. However, most published works consider only single-pill identification and fail to distinguish hard samples with identical appearances. Also, most existing pill image datasets only feature single pill images captured in carefully controlled environments under ideal lighting conditions and clean backgrounds. In this work, we are the first to tackle the multi-pill detection problem in real-world settings, aiming at localizing and identifying pills captured by users during pill intake. Moreover, we also introduce a multi-pill image dataset taken in unconstrained conditions. To handle hard samples, we propose a novel method for constructing heterogeneous a priori graphs incorporating three forms of inter-pill relationships, including co-occurrence likelihood, relative size, and visual semantic correlation. We then offer a framework for integrating a priori with pills’ visual features to enhance detection accuracy. Our experimental results have proved the robustness, reliability, and explainability of the proposed framework. Experimentally, it outperforms all detection benchmarks in terms of all evaluation metrics. Specifically, our proposed framework improves COCO mAP metrics by 9.4% over Faster R-CNN and 12.0% compared to vanilla YOLOv5. Our study opens up new opportunities for protecting patients from medication errors using an AI-based pill identification solution.

  10. Lens Clean Import Data | Carl Zeiss Vision Incorporated

    • seair.co.in
    Updated Mar 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Seair Exim (2024). Lens Clean Import Data | Carl Zeiss Vision Incorporated [Dataset]. https://www.seair.co.in
    Explore at:
    .bin, .xml, .csv, .xlsAvailable download formats
    Dataset updated
    Mar 25, 2024
    Dataset provided by
    Seair Exim Solutions
    Authors
    Seair Exim
    Area covered
    United States
    Description

    Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.

  11. Lens Clean Import Data | Vision Precision Holdings Llc

    • seair.co.in
    Updated May 21, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Seair Exim (2025). Lens Clean Import Data | Vision Precision Holdings Llc [Dataset]. https://www.seair.co.in
    Explore at:
    .bin, .xml, .csv, .xlsAvailable download formats
    Dataset updated
    May 21, 2025
    Dataset provided by
    Seair Exim Solutions
    Authors
    Seair Exim
    Area covered
    United States
    Description

    Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.

  12. f

    An overview of existing public datasets for the task of image-based pill...

    • plos.figshare.com
    xls
    Updated Sep 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anh Duy Nguyen; Huy Hieu Pham; Huynh Thanh Trung; Quoc Viet Hung Nguyen; Thao Nguyen Truong; Phi Le Nguyen (2023). An overview of existing public datasets for the task of image-based pill detection. [Dataset]. http://doi.org/10.1371/journal.pone.0291865.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Sep 28, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Anh Duy Nguyen; Huy Hieu Pham; Huynh Thanh Trung; Quoc Viet Hung Nguyen; Thao Nguyen Truong; Phi Le Nguyen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    An overview of existing public datasets for the task of image-based pill detection.

  13. h

    caption-vidore-vdsid_french-clean

    • huggingface.co
    Updated Jul 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CATIE (2025). caption-vidore-vdsid_french-clean [Dataset]. https://huggingface.co/datasets/CATIE-AQ/caption-vidore-vdsid_french-clean
    Explore at:
    Dataset updated
    Jul 7, 2025
    Dataset authored and provided by
    CATIE
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Area covered
    Français
    Description

    Description

    vidore/vdsid_french dataset that we processed for a visual question answering task where answer is a caption.

      Citation
    

    @misc{faysse2024colpaliefficientdocumentretrieval, title={ColPali: Efficient Document Retrieval with Vision Language Models}, author={Manuel Faysse and Hugues Sibille and Tony Wu and Bilel Omrani and Gautier Viaud and Céline Hudelot and Pierre Colombo}, year={2024}, eprint={2407.01449}, archivePrefix={arXiv}… See the full description on the dataset page: https://huggingface.co/datasets/CATIE-AQ/caption-vidore-vdsid_french-clean.

  14. Chess Pieces Detection Images Dataset

    • kaggle.com
    Updated Nov 12, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anshul Mehta (2022). Chess Pieces Detection Images Dataset [Dataset]. https://www.kaggle.com/datasets/anshulmehtakaggl/chess-pieces-detection-images-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 12, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Anshul Mehta
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    This can be a great Computer Vision or Multi-Class Deep Learning Project.

    Content

    The Folder names are self-explanator, they contain the names of the Pieces and the Images are singular images so that the dataset tidy. Multiple object images have been deleted making it an easy dataset to work with.

    Inspiration

    I am an amateur chess player and a Chess fan, plus I did not come across any great datasets like this on the Internet and so decided to make this one.

  15. h

    caption-manu-tabfquad_retrieving-clean

    • huggingface.co
    Updated Jul 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CATIE (2025). caption-manu-tabfquad_retrieving-clean [Dataset]. https://huggingface.co/datasets/CATIE-AQ/caption-manu-tabfquad_retrieving-clean
    Explore at:
    Dataset updated
    Jul 7, 2025
    Dataset authored and provided by
    CATIE
    Description

    Description

    manu/tabfquad_retrieving dataset that we processed for a visual question answering task where answer is a caption.

      Citation
    

    https://huggingface.co/datasets/manu/tabfquad_retrieving

  16. RSNA Stage 2 Clean CSV's (IHD 2019)

    • kaggle.com
    zip
    Updated Nov 10, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carlo Lepelaars (2019). RSNA Stage 2 Clean CSV's (IHD 2019) [Dataset]. https://www.kaggle.com/carlolepelaars/rsna-clean-dataframes-stage-2
    Explore at:
    zip(4555953 bytes)Available download formats
    Dataset updated
    Nov 10, 2019
    Authors
    Carlo Lepelaars
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Preprocessed Stage 2 CSV's so they work easier with Image Data Generators.

    Original data came from this Kaggle competition:

    https://www.kaggle.com/c/rsna-intracranial-hemorrhage-detection/data

  17. h

    caption-vidore-tabfquad_test_subsampled-clean

    • huggingface.co
    Updated Jul 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CATIE (2025). caption-vidore-tabfquad_test_subsampled-clean [Dataset]. https://huggingface.co/datasets/CATIE-AQ/caption-vidore-tabfquad_test_subsampled-clean
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 7, 2025
    Dataset authored and provided by
    CATIE
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Description

    vidore/tabfquad_test_subsampled dataset that we processed for a visual question answering task where answer is a caption.

      Citation
    

    @misc{faysse2024colpaliefficientdocumentretrieval, title={ColPali: Efficient Document Retrieval with Vision Language Models}, author={Manuel Faysse and Hugues Sibille and Tony Wu and Bilel Omrani and Gautier Viaud and Céline Hudelot and Pierre Colombo}, year={2024}, eprint={2407.01449}… See the full description on the dataset page: https://huggingface.co/datasets/CATIE-AQ/caption-vidore-tabfquad_test_subsampled-clean.

  18. h

    OCR-liboaccn-OPUS-MIT-5M-clean

    • huggingface.co
    Updated May 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Loïck BOURDOIS (2025). OCR-liboaccn-OPUS-MIT-5M-clean [Dataset]. https://huggingface.co/datasets/lbourdois/OCR-liboaccn-OPUS-MIT-5M-clean
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 10, 2025
    Authors
    Loïck BOURDOIS
    Description

    Description

    This dataset is a processed version of liboaccn/OPUS-MIT-5M to make it easier to use, particularly for a visual question answering task where answer is an OCR transcription.Specifically, the original dataset has been processed to provide the image directly as a PIL rather than a path in an image column.We've also created a question column containing around 40 prompts based on via tutoiement, vouvoiement and imperative forms. Note that this dataset contains only the… See the full description on the dataset page: https://huggingface.co/datasets/lbourdois/OCR-liboaccn-OPUS-MIT-5M-clean.

  19. h

    OCR-neulab-PangeaInstruct-OCR-clean

    • huggingface.co
    Updated May 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Loïck BOURDOIS (2025). OCR-neulab-PangeaInstruct-OCR-clean [Dataset]. https://huggingface.co/datasets/lbourdois/OCR-neulab-PangeaInstruct-OCR-clean
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 10, 2025
    Authors
    Loïck BOURDOIS
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Description

    French part of the neulab/PangeaInstruct dataset (OCR data only) that we processed for a visual question answering task where answer is a caption.

      Citation
    

    @article{yue2024pangeafullyopenmultilingual, title={Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages}, author={Xiang Yue and Yueqi Song and Akari Asai and Seungone Kim and Jean de Dieu Nyandwi and Simran Khanuja and Anjali Kantharuban and Lintang Sutawika and Sathyanarayanan Ramamoorthy… See the full description on the dataset page: https://huggingface.co/datasets/lbourdois/OCR-neulab-PangeaInstruct-OCR-clean.

  20. h

    caption-floschne-xm3600-clean

    • huggingface.co
    Updated Jul 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CATIE (2025). caption-floschne-xm3600-clean [Dataset]. https://huggingface.co/datasets/CATIE-AQ/caption-floschne-xm3600-clean
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 7, 2025
    Dataset authored and provided by
    CATIE
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Description

    French part of the floschne/xm3600 dataset that we processed for a visual question answering task where answer is a caption.

      Citation
    

    @inproceedings{ThapliyalCrossmodal2022, author = {Ashish Thapliyal and Jordi Pont-Tuset and Xi Chen and Radu Soricut}, title = {{Crossmodal-3600: A Massively Multilingual Multimodal Evaluation Dataset}}, booktitle = {EMNLP}, year = {2022} }

  21. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Alex Reinhart (2025). The regressinator: A simulation tool for teaching regression assumptions and diagnostics in R [Dataset]. http://doi.org/10.6084/m9.figshare.29361136.v1

Data from: The regressinator: A simulation tool for teaching regression assumptions and diagnostics in R

Related Article
Explore at:
txtAvailable download formats
Dataset updated
Jun 18, 2025
Dataset provided by
Taylor & Francis
Authors
Alex Reinhart
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

When students learn linear regression, they must learn to use diagnostics to check and improve their models. Model-building is an expert skill requiring the interpretation of diagnostic plots, an understanding of model assumptions, the selection of appropriate changes to remedy problems, and an intuition for how potential problems may affect results. Simulation offers opportunities to practice these skills, and is already widely used to teach important concepts in sampling, probability, and statistical inference. Visual inference, which uses simulation, has also recently been applied to regression instruction. This article presents the regressinator, an R package designed to facilitate simulation and visual inference in regression settings. Simulated regression problems can be easily defined with minimal programming, using the same modeling and plotting code students may already learn. The simulated data can then be used for model diagnostics, visual inference, and other activities, with the package providing functions to facilitate common tasks with a minimum of programming. Example activities covering model diagnostics, statistical power, and model selection are shown for both advanced undergraduate and Ph.D.-level regression courses.

Search
Clear search
Close search
Google apps
Main menu