27 datasets found
  1. Database of literature surveyed in "A survey of crowdsourcing in medical...

    • figshare.com
    xlsx
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Silas Ørting; Veronika Cheplygina (2023). Database of literature surveyed in "A survey of crowdsourcing in medical image analysis" by Ørting et al. [Dataset]. http://doi.org/10.6084/m9.figshare.9751850.v2
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Silas Ørting; Veronika Cheplygina
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A table with a longer summary of surveyed papers on crowdsourcing in medical imaging, and a shorter summary that is included in the main paper text.

  2. n

    Data from: Crowdsourced geometric morphometrics enable rapid large-scale...

    • data.niaid.nih.gov
    • search.dataone.org
    • +1more
    zip
    Updated Nov 10, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jonathan Chang; Michael E. Alfaro (2016). Crowdsourced geometric morphometrics enable rapid large-scale collection and analysis of phenotypic data [Dataset]. http://doi.org/10.5061/dryad.gh4k7
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 10, 2016
    Dataset provided by
    University of California, Los Angeles
    Authors
    Jonathan Chang; Michael E. Alfaro
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description
    1. Advances in genomics and informatics have enabled the production of large phylogenetic trees. However, the ability to collect large phenotypic datasets has not kept pace. 2. Here, we present a method to quickly and accurately gather morphometric data using crowdsourced image-based landmarking. 3. We find that crowdsourced workers perform similarly to experienced morphologists on the same digitization tasks. We also demonstrate the speed and accuracy of our method on seven families of ray-finned fishes (Actinopterygii). 4. Crowdsourcing will enable the collection of morphological data across vast radiations of organisms, and can facilitate richer inference on the macroevolutionary processes that shape phenotypic diversity across the tree of life.
  3. h

    NetEaseCrowd

    • huggingface.co
    Updated Mar 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Haoyu (2024). NetEaseCrowd [Dataset]. https://huggingface.co/datasets/liuhyuu/NetEaseCrowd
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 21, 2024
    Authors
    Haoyu
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    🧑‍🤝‍🧑 NetEaseCrowd: A Dataset for Long-term and Online Crowdsourcing Truth Inference

    View it in GitHub

      Introduction
    

    We introduce NetEaseCrowd, a large-scale crowdsourcing annotation dataset based on a mature Chinese data crowdsourcing platform of NetEase Inc.. NetEaseCrowd dataset contains about 2,400 workers, 1,000,000 tasks, and 6,000,000 annotations between them, where the annotations are collected in about 6 months. In this dataset, we provide ground truths for… See the full description on the dataset page: https://huggingface.co/datasets/liuhyuu/NetEaseCrowd.

  4. Data from: Reducing Annotation Artifacts in Crowdsourcing Datasets for...

    • figshare.com
    txt
    Updated Sep 16, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    anonymous anonymous (2020). Reducing Annotation Artifacts in Crowdsourcing Datasets for Natural Language Processing [Dataset]. http://doi.org/10.6084/m9.figshare.12962480.v3
    Explore at:
    txtAvailable download formats
    Dataset updated
    Sep 16, 2020
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    anonymous anonymous
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Annotation Artifact

  5. Data from: A crowdsourced dataset of aerial images with annotated solar...

    • zenodo.org
    • data.niaid.nih.gov
    bin, zip
    Updated Feb 7, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gabriel Kasmi; Gabriel Kasmi; Yves-Marie Saint-Drenan; Yves-Marie Saint-Drenan; David Trebosc; Raphaël Jolivet; Johnathan Leloux; Babacar Sarr; Laurent Dubus; Laurent Dubus; David Trebosc; Raphaël Jolivet; Johnathan Leloux; Babacar Sarr (2023). A crowdsourced dataset of aerial images with annotated solar photovoltaic arrays and installation metadata [Dataset]. http://doi.org/10.5281/zenodo.7347432
    Explore at:
    bin, zipAvailable download formats
    Dataset updated
    Feb 7, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Gabriel Kasmi; Gabriel Kasmi; Yves-Marie Saint-Drenan; Yves-Marie Saint-Drenan; David Trebosc; Raphaël Jolivet; Johnathan Leloux; Babacar Sarr; Laurent Dubus; Laurent Dubus; David Trebosc; Raphaël Jolivet; Johnathan Leloux; Babacar Sarr
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Summary

    Photovoltaic (PV) energy generation plays a crucial role in the energy transition. Small-scale PV installations are deployed at an unprecedented pace, and their integration into the grid can be challenging since stakeholders often lack quality data about these installations. Overhead imagery is increasingly used to improve the knowledge of distributed PV installations with machine learning models capable of automatically mapping these installations. However, these models cannot be easily transferred from one region or data source to another due to differences in image acquisition. To address this issue known as domain shift and foster the development of PV array mapping pipelines, we propose a dataset containing aerial images, annotations, and segmentation masks. We provide installation metadata for more than 28,000 installations. We provide ground truth segmentation masks for 13,000 installations, including 7,000 with annotations for two different image providers. Finally, we provide ground truth annotations and associated installation metadata for more than 8,000 installations. Dataset applications include end-to-end PV registry construction, robust PV installations mapping, and analysis of crowdsourced datasets.

    This dataset contains the complete records associated with the article "A crowdsourced dataset of aerial images of solar panels, their segmentation masks, and characteristics", currently under review. The preprint is accessible at this link: https://arxiv.org/abs/2209.03726. These complete records consist of:

    1. The complete training dataset containing RGB overhead imagery, segmentation masks and metadata of PV installations (folder bdappv),
    2. The raw crowdsourcing data, and the postprocessed data for replication and validation (folder data).

    Data records

    Folders are organized as follows:

    • bdappv/ Root data folder
      • google / ign: One folder for each campaign
        • img/: Folder containing all the images presented to the users. This folder contains 28807 images for Google and 17325 images for IGN.
        • mask/: Folder containing all segmentations masks generated from the polygon annotations of the users. This folder contains 13303 masks for Google and 7686 masks for IGN.
      • metadata.csv The .csv file with the installations' metadata.

    • data/ Root data folder
      • raw/ Folder containing the raw crowdsourcing data and raw metadata;
        • input-google.json: .json input data data containing all information on images and raw annotators’ contributions for both phases (clicks and polygons) during the first annotation campaign;
        • input-ign.json: .json input data containing all information on images and raw annotators’ contributions for both phases (clicks and polygons) during the second annotation campaign;
        • raw-metadata.json: .json output containing the PV systems’ metadata extracted from the BDPV database before filtering. It can be used to replicate the association between the installations and the segmentation masks, as done in the notebook metadata.
      • replication/ Folder containing the compiled data used to generate the segmentation masks;
        • campaign-google/campaign-ign: One folder for each campaign
          • click-analysis.json: .json output on the click analysis, compiling raw input into a few best-guess locations for the PV arrays. This dataset enables the replication of our annotations,
          • polygon-analysis.json: .json output of polygon analysis, compiling raw input into a best-guess polygon for the PV arrays.
      • validation/ Folder containing the compiled data used for technical validation.
        • campaign-google/campaign-ign: One folder for each campaign
          • click-analysis-thres=1.0.json: .json output of the click analysis with a lowered threshold to analyze the effect of the threshold on image classification, as done in the notebook annotation;
          • polygon-analysis-thres=1.0.json: .json output of polygon analysis, with a lowered threshold to analyze the effect of the threshold on polygon annotation, as done in the notebook annotations.
        • metadata.csv: the .csv file of filtered installations' metadata.

    License

    We extracted the thumbnails contained in the google/img/ folder using Google Earth Engine API and we generated the thumbnails contained in the ign/img/ folder from high resolution tiles downloaded from the online IGN portal accessible here: https://geoservices.ign.fr/bdortho. Images provided by Google are subjet to Google's terms and conditions. Images provided by the IGN are subject to an open license 2.0.

    Access the terms and conditions of Google images at this URL: https://www.google.com/intl/en/help/legalnotices_maps/

    Access the terms and conditions of IGN images at this URL: https://www.etalab.gouv.fr/wp-content/uploads/2018/11/open-licence.pdf

  6. g

    Launching a campaign of annotations on Zooniverse with ChildProject

    • doi.gin.g-node.org
    Updated Sep 9, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lucas Gautheron (2021). Launching a campaign of annotations on Zooniverse with ChildProject [Dataset]. http://doi.org/10.12751/g-node.k2h9az
    Explore at:
    Dataset updated
    Sep 9, 2021
    Dataset provided by
    Laboratoire de Sciences Cognitives et Psycholinguistique
    Authors
    Lucas Gautheron
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Step-by-step tutorial to launch a campaign of annotations on Zooniverse based on daylong recordings managed with ChildProject.

  7. Supporting Online Toxicity Detection with Knowledge Graphs: Data

    • zenodo.org
    • datasets.ai
    • +3more
    zip
    Updated Mar 24, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Paula Reyero Lobo; Paula Reyero Lobo (2022). Supporting Online Toxicity Detection with Knowledge Graphs: Data [Dataset]. http://doi.org/10.5281/zenodo.6379344
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 24, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Paula Reyero Lobo; Paula Reyero Lobo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This data repository contains the output files from the analysis of the paper "Supporting Online Toxicity Detection with Knowledge Graphs" presented at the International Conference on Web and Social Media 2022 (ICWSM-2022).

    The data contains annotations of gender and sexual orientation entities provided by the Gender and Sexual Orientation Ontology (https://bioportal.bioontology.org/ontologies/GSSO).

    We analyse demographic group samples from the Civil Comments Identities dataset (https://www.tensorflow.org/datasets/catalog/civil_comments).

  8. D

    Data Labeling Market Research Report 2033

    • dataintelo.com
    csv, pdf, pptx
    Updated Sep 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataintelo (2025). Data Labeling Market Research Report 2033 [Dataset]. https://dataintelo.com/report/data-labeling-market
    Explore at:
    pdf, pptx, csvAvailable download formats
    Dataset updated
    Sep 30, 2025
    Dataset authored and provided by
    Dataintelo
    License

    https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy

    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Data Labeling Market Outlook



    According to our latest research, the global Data Labeling market size reached USD 3.7 billion in 2024, reflecting robust demand across multiple industries. The market is expected to expand at a CAGR of 24.1% from 2025 to 2033, reaching an estimated USD 28.6 billion by 2033. This remarkable growth is primarily driven by the exponential adoption of artificial intelligence (AI) and machine learning (ML) solutions, which require vast volumes of accurately labeled data for training and validation. As organizations worldwide accelerate their digital transformation initiatives, the need for high-quality, annotated datasets has never been more critical, positioning data labeling as a foundational element in the AI ecosystem.




    A major growth factor for the data labeling market is the rapid proliferation of AI-powered applications across diverse sectors such as healthcare, automotive, finance, and retail. As AI models become more sophisticated, the demand for precise and contextually relevant labeled data intensifies. Enterprises are increasingly relying on data labeling services to enhance the accuracy and reliability of their AI algorithms, particularly in applications like computer vision, natural language processing, and speech recognition. The surge in autonomous vehicle development, medical imaging analysis, and personalized recommendation systems are significant drivers fueling the need for scalable data annotation solutions. Moreover, the integration of data labeling with cloud-based platforms and automation tools is streamlining workflows and reducing turnaround times, further propelling market expansion.




    Another key driver is the growing emphasis on data quality and compliance in the wake of stricter regulatory frameworks. Organizations are under mounting pressure to ensure that their AI models are trained on unbiased, ethically sourced, and well-labeled data to avoid issues related to algorithmic bias and data privacy breaches. This has led to increased investments in advanced data labeling technologies, including semi-automated and fully automated annotation platforms, which not only improve efficiency but also help maintain compliance with global data protection regulations such as GDPR and CCPA. The emergence of specialized data labeling vendors offering domain-specific expertise and robust quality assurance processes is further bolstering market growth, as enterprises seek to mitigate risks associated with poor data quality.




    The data labeling market is also experiencing significant traction due to the expanding ecosystem of AI startups and the democratization of machine learning tools. With the availability of open-source frameworks and accessible cloud-based ML platforms, small and medium-sized enterprises (SMEs) are increasingly leveraging data labeling services to accelerate their AI initiatives. The rise of crowdsourcing and managed workforce solutions has enabled organizations to tap into global talent pools for large-scale annotation projects, driving down costs and enhancing scalability. Furthermore, advancements in active learning and human-in-the-loop (HITL) approaches are enabling more efficient and accurate labeling workflows, making data labeling an indispensable component of the AI development lifecycle.




    Regionally, North America continues to dominate the data labeling market, accounting for the largest revenue share in 2024, thanks to its mature AI ecosystem, strong presence of leading technology companies, and substantial investments in research and development. Asia Pacific is emerging as the fastest-growing region, propelled by rapid digitalization, government-led AI initiatives, and a burgeoning startup landscape in countries such as China, India, and Japan. Europe is also witnessing steady growth, driven by stringent data protection regulations and increasing adoption of AI technologies across key industries. The Middle East & Africa and Latin America are gradually catching up, supported by growing awareness of AI's transformative potential and rising investments in digital infrastructure.



    Component Analysis



    The data labeling market is segmented by component into Software and Services, each playing a pivotal role in supporting the end-to-end annotation lifecycle. Data labeling software encompasses a range of platforms and tools designed to facilitate the creation, management, and validation of labeled datasets. These solutions

  9. Z

    Data from: The CrowdGleason dataset: learning the Gleason grade from crowds...

    • data-staging.niaid.nih.gov
    • zenodo.org
    Updated Jan 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    López-Pérez, Miguel; Morquecho, Alba; Schmidt, Arne; Pérez-Bueno, Fernando; Martín-Castro, Aurelio; Mateos, Javier; Molina, Rafael (2025). The CrowdGleason dataset: learning the Gleason grade from crowds and experts [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_14178893
    Explore at:
    Dataset updated
    Jan 16, 2025
    Dataset provided by
    Universidad de Granada
    Universitat Politècnica de València
    Hospital Universitario Virgen de las Nieves
    Authors
    López-Pérez, Miguel; Morquecho, Alba; Schmidt, Arne; Pérez-Bueno, Fernando; Martín-Castro, Aurelio; Mateos, Javier; Molina, Rafael
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Introduction

    This repository contains the associated files to replicate the study entitled "The CrowdGleason dataset: Learning the Gleason grade from crowds and experts", published in the Computer Methods and Programs in Biomedicine, Volume 257, December 2024, 108472. For further details on the study and the dataset, please see the published article.

    CrowdGleason is a public prostate histopathological dataset which consists of 19,077 patches from 1,045 WSIs with various Gleason grades. The dataset was annotated using a crowdsourcing protocol involving seven pathologists-in-training to distribute the labeling effort.

    The whole dataset is divided into three sets for training, validation and testing. In detail, a training set with 13,824 patches of size 512 × 512 and a validation set to 2,327 patches, both extracted from 783 WSIs, annotated by one or more of a crowd of seven pathologists in-training, and a curated test set with 2,926 patches of size 512 × 512 extracted from other 262 WSIs, annotated by expert pathologists and all the pathologists in-training. Ground-truth labels for the curated test set were obtained by consensus between the expert pathologists and the majority of the pathologists in-training.

    Dataset files

    The dataset consists of several files:

    Patches.zip: contains the 19,077 patches of CrowdGleason in three folders: train, which contains 13,824 patches used for training; val, which contains 2327 patches to validate the models, and test, which contains 2,926 patches that form the test set.

    NormalizedPatches.zip: contains the same patches of Patches.zip after the colour normalization preprocessing step. These files are useful to directly replicate the results of the paper.

    Annotations.zip: contains three .csv files with crowdsourcing annotations. These files provide the train/val/test split as well as label information. The 'markerX' columns are the labels given by the X-th annotator; Patch filename is the patch name in the train/val/test folder; ground truth is the ground truth label in the curated test dataset; MV, DS, GLAD, MACE are the labels obtained by label aggregation of the pathologists in-training annotations using the majority voting, Dawid-Skene, GLAD and MACE methods, respectively.

    We have also included the files corresponding to the external dataset SICAPv2 to enhance the reproducibility of the experiments of our paper:

    NormalizedSICAPv2.zip: contains the normalized patches of the external dataset SICAPv2.

    NormalizedSICAPv2_Annotations.zip: contains the labels of the external dataset SICAPv2.

    Citation

    @article{LOPEZPEREZ2024108472,title = {The CrowdGleason dataset: Learning the Gleason grade from crowds and experts},journal = {Computer Methods and Programs in Biomedicine},volume = {257},pages = {108472},year = {2024},issn = {0169-2607},doi = {https://doi.org/10.1016/j.cmpb.2024.108472},url = {https://www.sciencedirect.com/science/article/pii/S0169260724004656},author = {Miguel López-Pérez and Alba Morquecho and Arne Schmidt and Fernando Pérez-Bueno and Aurelio Martín-Castro and Javier Mateos and Rafael Molina},keywords = {Computational pathology, Crowdsourcing, Prostate cancer, Gleason grade, Gaussian processes, Medical image analysis},}

    Funding:

    This work was supported in part by FEDER/Junta de Andalucía under project P20_00286, grant PID2022-140189OB-C22 funded by MICIU/AEI/10.13039/501100011033 and by ‘‘ERDF/EU’’. The work by Miguel López-Pérez and Fernando Pérez-Bueno was supported by the grants JDC2022-048318-I and JDC2022-048784-I, respectively, funded by MICIU/AEI/10.13039/501100011033 the European Union ‘‘NextGenerationEU’’/PRTR

  10. w

    Global Automated Data Annotation Tool Market Research Report: By Application...

    • wiseguyreports.com
    Updated Sep 18, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Global Automated Data Annotation Tool Market Research Report: By Application (Image Annotation, Text Annotation, Audio Annotation, Video Annotation, Sensor Data Annotation), By End Use Industry (Healthcare, Automotive, Retail, Finance, Manufacturing), By Deployment Type (Cloud-Based, On-Premise, Hybrid), By Techniques (Machine Learning, Deep Learning, Active Learning, Crowdsourcing) and By Regional (North America, Europe, South America, Asia Pacific, Middle East and Africa) - Forecast to 2035 [Dataset]. https://www.wiseguyreports.com/cn/reports/automated-data-annotation-tool-market
    Explore at:
    Dataset updated
    Sep 18, 2025
    License

    https://www.wiseguyreports.com/pages/privacy-policyhttps://www.wiseguyreports.com/pages/privacy-policy

    Time period covered
    Sep 25, 2025
    Area covered
    Global
    Description
    BASE YEAR2024
    HISTORICAL DATA2019 - 2023
    REGIONS COVEREDNorth America, Europe, APAC, South America, MEA
    REPORT COVERAGERevenue Forecast, Competitive Landscape, Growth Factors, and Trends
    MARKET SIZE 20241.31(USD Billion)
    MARKET SIZE 20251.49(USD Billion)
    MARKET SIZE 20355.2(USD Billion)
    SEGMENTS COVEREDApplication, End Use Industry, Deployment Type, Techniques, Regional
    COUNTRIES COVEREDUS, Canada, Germany, UK, France, Russia, Italy, Spain, Rest of Europe, China, India, Japan, South Korea, Malaysia, Thailand, Indonesia, Rest of APAC, Brazil, Mexico, Argentina, Rest of South America, GCC, South Africa, Rest of MEA
    KEY MARKET DYNAMICSgrowing AI adoption, increasing data volume, demand for accuracy, need for cost efficiency, advancements in machine learning
    MARKET FORECAST UNITSUSD Billion
    KEY COMPANIES PROFILEDAiforia Technologies, AWS, Deepomatic, Figure Eight, Snorkel AI, Clarifai, Labelbox, Microsoft Azure, Mindtech, Google Cloud, Teledyne Technologies, Scale AI, Slyd, Appen, SuperAnnotate, DataRobot
    MARKET FORECAST PERIOD2025 - 2035
    KEY MARKET OPPORTUNITIESAI integration for enhanced efficiency, Rising demand for machine learning, Expansion in autonomous vehicle development, Growth of healthcare data analysis, Increased focus on data privacy solutions
    COMPOUND ANNUAL GROWTH RATE (CAGR) 13.4% (2025 - 2035)
  11. Crowdsourcing Document Similarity Judgements

    • data.europa.eu
    unknown
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo, Crowdsourcing Document Similarity Judgements [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-4298976?locale=da
    Explore at:
    unknown(10862)Available download formats
    Dataset authored and provided by
    Zenodohttp://zenodo.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the data obtained from crowdsourcing tasks which ask workers to provide similarity metrics between pairs of documents. Each document, as well as each pair, has a unique ID. We provide crowd workers with the pairs through three different task variations: Variation 1: We showed workers 5 pairs of documents and, for each, asked them to rate their similarity in a 4-level Likert scale (None, Low, Medium, High), tell us a confidence level of how sure they were (from 0 to 4) and a written reason as to why they chose that similarity level. For quality reasons, two of the 5 pairs were golden-standards, which means we knew their ratings already and checked the workers' responses. They had to give the golden pair with the higher similarity a higher score than the other golden pair, otherwise, their answer would be rejected. Variation 2: We repeated variation 1 but with a slight alteration: instead of a Likert scale for the similarity score, we asked for a Magnitude Estimation, which is any number above 0. It could be 1, 0.0001, 1000, 42, as long as it was coherent, as in a more similar pair had a higher score than a less similar pair and vice-versa; Variation 3: We showed workers 5 rankings. Each ranking had a main document and 3 auxiliary documents to be compared against the main one. They also had to report a confidence score and give a short written reason, just like variation 1. The first ranking is a golden-standard, and we knew the values for the 3 pairs in it (the pairs were the main document paired with each of the 3 auxiliary documents), and they had to give the golden pair with the highest similarity a higher rank than the one with the lower similarity. The raw results from the tasks are recorded in the JSON file CrowdResults.json. For a description of its contents, please read the file CrowdResults_README.md. These raw annotations from the crowd were then parsed into the three CSVs you see, each corresponding to the aggregated results from one of the task variations. final_scores_likert.csv is the resulting scores for each pair using the variation 1 tasks; pair_id is a unique identifier for each pair; similarity_alg is the similarity assigned to the pair of documents from an automated similarity algorithm; relation is the type of relationship shown by the pair, where smaller values indicate more similar pairs; similarity_crowd_simple_maj stores the simple majority result from the crowd's annotations; similarity_crowd_simple_mean stores the mean of the crowd's annotations; similarity_crowd_simple_median stores the median of the crowd's annotations; final_scores_magnitude.csv is the resulting scores for each pair using the variation 2 tasks; pair_id is a unique identifier for each pair; similarity_alg is the similarity assigned to the pair of documents from an automated similarity algorithm; relation is the type of relationship shown by the pair, where smaller values indicate more similar pairs; scaled_similarity_worker is the magnitude score scaled based on worker's behaviours scaled_similarity_worker_docset is the magnitude score scaled based both on the worker's behaviour and on the pair final_scores_ranking.csv is the resulting scores for each pair using the variation 3 tasks; pair_id is a unique identifier for each pair; similarity_alg is the similarity assigned to the pair of documents from an automated similarity algorithm; relation is the type of relationship shown by the pair, where smaller values indicate more similar pairs; mean_similarity is the mean ranking from that value This dataset was built and used as part of the TheyBuyForYou project.

  12. Data from: Experiment 1 Results

    • figshare.com
    txt
    Updated Jun 14, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anggarda Prameswari (2017). Experiment 1 Results [Dataset]. http://doi.org/10.6084/m9.figshare.5106577.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 14, 2017
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Anggarda Prameswari
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Result from the initial experiment

  13. Crowd-Annotation Results: Identifying and Classifying User Requirements in...

    • zenodo.org
    • data.niaid.nih.gov
    Updated Apr 17, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Martijn van Vliet; Eduard C. Groen; Fabiano Dalpiaz; Sjaak Brinkkemper; Martijn van Vliet; Eduard C. Groen; Fabiano Dalpiaz; Sjaak Brinkkemper (2020). Crowd-Annotation Results: Identifying and Classifying User Requirements in Online Feedback [Dataset]. http://doi.org/10.5281/zenodo.3626185
    Explore at:
    Dataset updated
    Apr 17, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Martijn van Vliet; Eduard C. Groen; Fabiano Dalpiaz; Sjaak Brinkkemper; Martijn van Vliet; Eduard C. Groen; Fabiano Dalpiaz; Sjaak Brinkkemper
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Results from the Figure Eight experiment conducted as part of the paper "Identifying and Classifying User Requirements in Online Feedback via Crowdsourcing" published at REFSQ 2020.

  14. u

    ARTigo – Social Image Tagging [Dataset and Images]

    • data.ub.uni-muenchen.de
    • data.europa.eu
    Updated Nov 15, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Becker, Matthias; Bogner, Martin; Bross, Fabian; Bry, François; Campanella, Caterina; Commare, Laura; Cramerotti, Silvia; Jakob, Katharina; Josko, Martin; Kneißl, Fabian; Kohle, Hubertus; Krefeld, Thomas; Levushkina, Elena; Lücke, Stephan; Puglisi, Alessandra; Regner, Anke; Riepl, Christian; Schefels, Clemens; Schemainda, Corina; Schmidt, Eva; Schneider, Stefanie; Schön, Gerhard; Schulz, Klaus; Siglmüller, Franz; Steinmayr, Bartholomäus; Störkle, Florian; Teske, Iris; Wieser, Christoph (2018). ARTigo – Social Image Tagging [Dataset and Images] [Dataset]. http://doi.org/10.5282/ubm/data.136
    Explore at:
    Dataset updated
    Nov 15, 2018
    Authors
    Becker, Matthias; Bogner, Martin; Bross, Fabian; Bry, François; Campanella, Caterina; Commare, Laura; Cramerotti, Silvia; Jakob, Katharina; Josko, Martin; Kneißl, Fabian; Kohle, Hubertus; Krefeld, Thomas; Levushkina, Elena; Lücke, Stephan; Puglisi, Alessandra; Regner, Anke; Riepl, Christian; Schefels, Clemens; Schemainda, Corina; Schmidt, Eva; Schneider, Stefanie; Schön, Gerhard; Schulz, Klaus; Siglmüller, Franz; Steinmayr, Bartholomäus; Störkle, Florian; Teske, Iris; Wieser, Christoph
    Description

    ARTigo is a platform that uses crowdsourcing to gather annotations (tags) on works of art (see http://www.artigo.org/). The dataset is compromised of 54.497 objects, which are associated with 18.492 artists (11.519 of which are either anonymous or unknown), 295.343 German-, French-, English-language tags, and 9.669.410 taggings. It is based on a cleansed database dump dated November 15, 2018. The cleansing concerned only the metadata of the objects; tags and taggings are provided „as is“. A current but uncleansed version of the data is available via a RESTful API at: http://www.artigo.org/api.html. The data is licensed under Creative Commons BY-NC-SA 4.0. If you are unsure whether your project is a commercial use, please contact us at: hubertus.kohle@lmu.de.

  15. Disease Mention Annotation with Mechanical Turk

    • figshare.com
    xlsx
    Updated Jan 19, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Benjamin Good (2016). Disease Mention Annotation with Mechanical Turk [Dataset]. http://doi.org/10.6084/m9.figshare.1126402.v2
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jan 19, 2016
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Benjamin Good
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data from a study on using the AMT to annotate disease in PubMed abstracts.

  16. u

    Data from: MAESTRO Real - Multi-Annotator Estimated Strong Labels

    • producciocientifica.uv.es
    Updated 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Morato, Irene Martin; Harju, Manu; Mesaros, Annamaria; Morato, Irene Martin; Harju, Manu; Mesaros, Annamaria (2023). MAESTRO Real - Multi-Annotator Estimated Strong Labels [Dataset]. https://producciocientifica.uv.es/documentos/668fc44fb9e7c03b01bd97c4
    Explore at:
    Dataset updated
    2023
    Authors
    Morato, Irene Martin; Harju, Manu; Mesaros, Annamaria; Morato, Irene Martin; Harju, Manu; Mesaros, Annamaria
    Description

    The dataset was created for studying estimation of strong labels using crowdsourcing. It contains 49 real-life audio files from 5 different acoustic scenes, and the annotation outcome. Annotation was performed using Amazon Mechanical Turk. Total duration of the dataset is 189 minutes and 52 seconds Audio files are a subset from TUT Acoustic Scenes 2016 dataset, belonging to five acoustic scenes: cafe/restaurant, city center, grocery store, metro station and residential area. Each scene have 6 classes, some of them are common to all the scenes, resulting into 17 classes in total.
    The dataset contains: audio: the 49 real-life recordings, each from 3 to 5 min long. soft labels: estimated strong labels from the crowdsourced data, values between 0 and 1 indicates the uncertainty of the annotators. For more details about the real-life recordings, please see the following paper: A. Mesaros, T. Heittola and T. Virtanen, "TUT database for acoustic scene classification and sound event detection," 2016 24th European Signal Processing Conference (EUSIPCO), 2016, pp. 1128-1132.

  17. Z

    Towards A Reliable Ground-Truth For Biased Language Detection

    • data.niaid.nih.gov
    Updated Jul 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gipp, Bela; Krieger, David; Plank, Manuel; Spinde, Timo (2024). Towards A Reliable Ground-Truth For Biased Language Detection [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4625150
    Explore at:
    Dataset updated
    Jul 19, 2024
    Dataset provided by
    University of Wuppertal
    University of Konstanz
    Authors
    Gipp, Bela; Krieger, David; Plank, Manuel; Spinde, Timo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Reference texts such as encyclopedias and news articles can manifest biased language when objective reporting is substituted by subjective writing. Existing methods to detect linguistic cues of bias mostly rely on annotated data to train machine learning models. However, low annotator agreement and comparability is a substantial drawback in available media bias corpora. To improve available datasets, we collect and compare labels obtained from two popular crowdsourcing platforms. Our results demonstrate the existing crowdsourcing approaches' lack of data quality, underlining the need for a trained expert framework to gather a more reliable dataset. Improving the agreement from Krippendorff's (\alpha) = 0.144 (crowdsourcing labels) to (\alpha) = 0.419 (expert labels), we assume that trained annotators' linguistic knowledge increases data quality improving the performance of existing bias detection systems.

    The expert annotations are meant to be used to enrich the dataset MBIC – A Media Bias Annotation Dataset Including Annotator Characteristics available at https://zenodo.org/record/4474336#.YBHO6xYxmK8.

  18. r

    Analysis of influencing factors in speech quality assessment using...

    • resodate.org
    Updated Jan 25, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rafael Zequeira Jiménez (2022). Analysis of influencing factors in speech quality assessment using crowdsourcing [Dataset]. http://doi.org/10.14279/depositonce-12247
    Explore at:
    Dataset updated
    Jan 25, 2022
    Dataset provided by
    Technische Universität Berlin
    DepositOnce
    Authors
    Rafael Zequeira Jiménez
    Description

    Crowdsourcing has emerged as a competitive mechanism to conduct user studies on the Internet. Users in crowdsourcing perform small tasks remotely from their computer or mobile device in exchange for monetary compensation. Nowadays, multiple crowdsourcing platforms offer a fast, low cost and scalable approach to collect human input for data acquisition and annotations. However, the question remains whether the collected ratings in an online platform are still valid and reliable. And if such ratings are comparable to those gathered in a constrained laboratory environment. There is a lack of control to supervise the participant and often not enough information about their playback system and background environment. Therefore, different quality control mechanisms have been proposed to ensure reliable results and monitor these factors to the extent possible. The quality of the transmitted speech signal is essential for telecommunication network providers. It is an important indicator used to evaluate their systems, services, and to counterbalance potential issues. Traditionally, subjective speech quality studies are conducted under controlled laboratory conditions with professional audio equipment. This way, good control over the experimental setup can be accomplished, but with some disadvantages: conducting laboratory-based studies is expensive, time-consuming, and the number of participants is often relatively low. Consequently, the experiment outcomes might not be representative of a broad population. In contrast, crowdsourcing represents an excellent opportunity to move such listening tests to the Internet and target a much wider and diverse pool of potential users at a fraction of the cost and time. Nevertheless, the implementation of existing subjective testing methodologies into an Internet-based environment is not straightforward. Multiple challenges arise that need to be addressed to gather valid and reliable results. This dissertation evaluates the impact of relevant factors affecting the results of speech quality assessment studies carried out in crowdsourcing. These factors relate to the test structure, the effect of environmental background noise, and the influence of language differences. To the best of the author’s knowledge, these influencing factors have not yet been addressed. The results indicate that it is better to offer test tasks with a number of speech stimuli between 10 and 20 to encourage listener participation while reducing study response times. Additionally, the outcomes suggest that the threshold level of environmental background noise for collecting reliable speech quality scores in crowdsourcing is between 43dB(A) and 50dB(A). Also, listeners were more tolerant of the TV-Show noise compared to the street traffic noise when executing the listening test. Furthermore, the feasibility of using web-audio recordings for environmental noise classification is determined. A Multi-layer Perceptron Classifier with an adam solver achieved an accuracy of 0.69 in noise classification. In contrast, a deep model based on a "Long Short-Term Memory'' architecture accomplished an RMSE of 4.58 on average (scale of 30.6dBA to 81.3dBA) on the test set for noise level estimation. Finally, an experiment was performed to determine if it is possible to gather reliable speech quality ratings for German stimuli with native English and Spanish speakers in a crowdsourcing environment. The Person correlation to the laboratory results was strong and significant, and the RMSE low despite the listeners' mother tongue. However, a bias was seen in the quality scores collected from the English and Spanish crowd-workers, which was then corrected with a first-order mapping.

  19. Table 1_On the construction of a large-scale database of AI-assisted...

    • frontiersin.figshare.com
    docx
    Updated Jul 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amir Jabbarpour; Eric Moulton; Sanaz Kaviani; Siraj Ghassel; Wanzhen Zeng; Ramin Akbarian; Anne Couture; Aubert Roy; Richard Liu; Yousif A. Lucinian; Nuha Hejji; Sukainah AlSulaiman; Farnaz Shirazi; Eugene Leung; Sierra Bonsall; Samir Arfin; Bruce G. Gray; Ran Klein (2025). Table 1_On the construction of a large-scale database of AI-assisted annotating lung ventilation-perfusion scintigraphy for pulmonary embolism (VQ4PEDB).docx [Dataset]. http://doi.org/10.3389/fnume.2025.1632112.s001
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jul 17, 2025
    Dataset provided by
    Frontiers Mediahttp://www.frontiersin.org/
    Authors
    Amir Jabbarpour; Eric Moulton; Sanaz Kaviani; Siraj Ghassel; Wanzhen Zeng; Ramin Akbarian; Anne Couture; Aubert Roy; Richard Liu; Yousif A. Lucinian; Nuha Hejji; Sukainah AlSulaiman; Farnaz Shirazi; Eugene Leung; Sierra Bonsall; Samir Arfin; Bruce G. Gray; Ran Klein
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    IntroductionVentilation-perfusion (V/Q) nuclear scintigraphy remains a vital diagnostic tool for assessing pulmonary embolism (PE) and other lung conditions. Interpretation of these images requires specific expertise which may benefit from recent advances in artificial intelligence (AI) to improve diagnostic accuracy and confidence in reporting. Our study aims to develop a multi-center dataset combining imaging and clinical reports to aid in creating AI models for PE diagnosis.MethodsWe established a comprehensive imaging registry encompassing patient-level V/Q image data along with relevant clinical reports, CTPA images, DVT ultrasound impressions, D-dimer lab tests, and thrombosis unit records. Data extraction was performed at two hospitals in Canada and at multiple sites in the United States, followed by a rigorous de-identification process. We utilized the V7 Darwin platform for crowdsourced annotation of V/Q images including segmentation of V/Q mismatched vascular defects. The annotated data was then ingested into Deep Lake, a SQL-based database, for AI model training. Quality assurance involved manual inspections and algorithmic validation.ResultsA query of The Ottawa Hospital's data warehouse followed by initial data screening yielded 2,137 V/Q studies with 2,238 successfully retrieved as DICOM studies. Additional contributions included 600 studies from University Health Toronto, and 385 studies by private company Segmed Inc. resulting in a total of 3,122 V/Q planar and SPECT images. The majority of studies were acquired using Siemens, Philips, and GE scanners, adhering to standardized local imaging protocols. After annotating 1,500 studies from The Ottawa Hospital, the analysis identified 138 high-probability, 168 intermediate-probability, 266 low-probability, 244 very low-probability, and 669 normal, and 15 normal perfusion with reversed mismatched ventilation defect studies. In 1,500 patients were 3,511 segmented vascular perfusion defects.ConclusionThe VQ4PEDB comprised 8 unique ventilation agents and 11 unique scanners. The VQ4PEDB database is unique in its depth and breadth in the domain of V/Q nuclear scintigraphy for PE, comprising clinical reports, imaging studies, and annotations. We share our experience in addressing challenges associated with data retrieval, de-identification, and annotation. VQ4PEDB will be a valuable resource to development and validate AI models for diagnosing PE and other pulmonary diseases.

  20. COUGHVID V3

    • kaggle.com
    zip
    Updated Feb 27, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Orvile (2025). COUGHVID V3 [Dataset]. https://www.kaggle.com/datasets/orvile/coughvid-v3
    Explore at:
    zip(2311668723 bytes)Available download formats
    Dataset updated
    Feb 27, 2025
    Authors
    Orvile
    Description

    COUGHVID: A Crowdsourced Cough Audio Dataset for Machine Learning

    Overview

    The COUGHVID dataset is one of the largest crowdsourced cough audio collections available for research and development in cough sound classification. This dataset is particularly valuable for respiratory disease detection, including COVID-19 screening, using Machine Learning (ML) techniques.

    Dataset Description

    • Total Recordings: Over 30,000 crowdsourced cough recordings from a diverse range of subjects.
    • Labeled Data: More than 2,000 recordings annotated by experienced pulmonologists to detect medical abnormalities.
    • Demographics: Covers various subject demographics, including age, gender, geographic location, and COVID-19 status.
    • Application: Ideal for tasks such as cough audio classification, anomaly detection, and ML model training for diagnosing respiratory illnesses.

    New Semi-Supervised Labeling

    The third version of the COUGHVID dataset includes thousands of additional recordings obtained through October 2021. Additionally, cough recordings were re-labeled using a semi-supervised learning algorithm, which combined user-provided labels with expert physician annotations. This model expanded on previously unlabeled data to improve the dataset’s accuracy. These newly generated labels can be found in the "status_SSL" column of the "metadata_compiled.csv" file.

    Data Structure

    • Audio Files: Cough recordings collected from crowdsourcing.
    • Metadata: Contains subject information, cough characteristics, and expert annotations.
    • Labels: Medical labels identifying abnormalities in the cough sounds, including those from the semi-supervised learning process.

    Research Applications

    • Respiratory Disease Detection: Using AI for diagnosing respiratory conditions.
    • COVID-19 Pre-screening: Early detection based on cough sounds.
    • Anomaly Detection: Detecting unusual audio patterns in cough signals.
    • Medical and Public Health Research: Furthering advancements in healthcare diagnostics using audio analysis.

    Citation

    If you use this dataset in your research or project, please cite:

    Orlandic, L., Teijeiro, T., & Atienza, D. (2021). The COUGHVID crowdsourcing dataset: A corpus for the study of large-scale cough analysis algorithms (3.0) [Data set]. Zenodo. DOI: 10.5281/zenodo.7024894

    License

    Creative Commons Attribution 4.0 International (CC BY 4.0).

    Access to Private Testing Data

    Researchers who wish to test their models on the private test dataset should contact the COUGHVID team at coughvid@epfl.ch with a brief explanation of the type of validation they intend to conduct and the results obtained through cross-validation with the public data. After reviewing the request, access to the unlabeled recordings will be provided. The predictions on these recordings should then be sent to the team for performance evaluation.

    You find the dataset here as well https://zenodo.org/records/7024894

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Silas Ørting; Veronika Cheplygina (2023). Database of literature surveyed in "A survey of crowdsourcing in medical image analysis" by Ørting et al. [Dataset]. http://doi.org/10.6084/m9.figshare.9751850.v2
Organization logoOrganization logo

Database of literature surveyed in "A survey of crowdsourcing in medical image analysis" by Ørting et al.

Explore at:
xlsxAvailable download formats
Dataset updated
Jun 1, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Silas Ørting; Veronika Cheplygina
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

A table with a longer summary of surveyed papers on crowdsourcing in medical imaging, and a shorter summary that is included in the main paper text.

Search
Clear search
Close search
Google apps
Main menu