100+ datasets found
  1. w

    Global Data Labeling Tools Market Research Report: By Application (Machine...

    • wiseguyreports.com
    Updated Aug 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Global Data Labeling Tools Market Research Report: By Application (Machine Learning, Natural Language Processing, Computer Vision, Data Mining, Predictive Analytics), By Labeling Type (Image Annotation, Text Annotation, Video Annotation, Audio Annotation, 3D Point Cloud Annotation), By Deployment Type (Cloud-Based, On-Premises, Hybrid), By End User (Healthcare, Automotive, Retail, Finance, Telecommunications) and By Regional (North America, Europe, South America, Asia Pacific, Middle East and Africa) - Forecast to 2035 [Dataset]. https://www.wiseguyreports.com/reports/data-labeling-tools-market
    Explore at:
    Dataset updated
    Aug 23, 2025
    License

    https://www.wiseguyreports.com/pages/privacy-policyhttps://www.wiseguyreports.com/pages/privacy-policy

    Time period covered
    Aug 25, 2025
    Area covered
    Global
    Description
    BASE YEAR2024
    HISTORICAL DATA2019 - 2023
    REGIONS COVEREDNorth America, Europe, APAC, South America, MEA
    REPORT COVERAGERevenue Forecast, Competitive Landscape, Growth Factors, and Trends
    MARKET SIZE 20243.75(USD Billion)
    MARKET SIZE 20254.25(USD Billion)
    MARKET SIZE 203515.0(USD Billion)
    SEGMENTS COVEREDApplication, Labeling Type, Deployment Type, End User, Regional
    COUNTRIES COVEREDUS, Canada, Germany, UK, France, Russia, Italy, Spain, Rest of Europe, China, India, Japan, South Korea, Malaysia, Thailand, Indonesia, Rest of APAC, Brazil, Mexico, Argentina, Rest of South America, GCC, South Africa, Rest of MEA
    KEY MARKET DYNAMICSincreasing AI adoption, demand for accurate datasets, growing automation in workflows, rise of cloud-based solutions, emphasis on data privacy regulations
    MARKET FORECAST UNITSUSD Billion
    KEY COMPANIES PROFILEDLionbridge, Scale AI, Google Cloud, Amazon Web Services, DataSoring, CloudFactory, Mighty AI, Samasource, TrinityAI, Microsoft Azure, Clickworker, Pimlico, Hive, iMerit, Appen
    MARKET FORECAST PERIOD2025 - 2035
    KEY MARKET OPPORTUNITIESAI-driven automation integration, Expansion in machine learning applications, Increasing demand for annotated datasets, Growth in autonomous vehicles sector, Rising focus on data privacy compliance
    COMPOUND ANNUAL GROWTH RATE (CAGR) 13.4% (2025 - 2035)
  2. c

    Pseudo-Label Generation for Multi-Label Text Classification

    • s.cnmilf.com
    • datasets.ai
    • +1more
    Updated Apr 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dashlink (2025). Pseudo-Label Generation for Multi-Label Text Classification [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/pseudo-label-generation-for-multi-label-text-classification
    Explore at:
    Dataset updated
    Apr 11, 2025
    Dataset provided by
    Dashlink
    Description

    With the advent and expansion of social networking, the amount of generated text data has seen a sharp increase. In order to handle such a huge volume of text data, new and improved text mining techniques are a necessity. One of the characteristics of text data that makes text mining difficult, is multi-labelity. In order to build a robust and effective text classification method which is an integral part of text mining research, we must consider this property more closely. This kind of property is not unique to text data as it can be found in non-text (e.g., numeric) data as well. However, in text data, it is most prevalent. This property also puts the text classification problem in the _domain of multi-label classification (MLC), where each instance is associated with a subset of class-labels instead of a single class, as in conventional classification. In this paper, we explore how the generation of pseudo labels (i.e., combinations of existing class labels) can help us in performing better text classification and under what kind of circumstances. During the classification, the high and sparse dimensionality of text data has also been considered. Although, here we are proposing and evaluating a text classification technique, our main focus is on the handling of the multi-labelity of text data while utilizing the correlation among multiple labels existing in the data set. Our text classification technique is called pseudo-LSC (pseudo-Label Based Subspace Clustering). It is a subspace clustering algorithm that considers the high and sparse dimensionality as well as the correlation among different class labels during the classification process to provide better performance than existing approaches. Results on three real world multi-label data sets provide us insight into how the multi-labelity is handled in our classification process and shows the effectiveness of our approach.

  3. r

    Triple random ensemble method for multi-label classification

    • researchdata.edu.au
    • dro.deakin.edu.au
    Updated Sep 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    G Tsoumakas; G Nasierding; Abbas Z. Kouzani (2024). Triple random ensemble method for multi-label classification [Dataset]. https://researchdata.edu.au/triple-random-ensemble-label-classification/3385179
    Explore at:
    Dataset updated
    Sep 25, 2024
    Dataset provided by
    Deakin University
    Authors
    G Tsoumakas; G Nasierding; Abbas Z. Kouzani
    Description

    Triple random ensemble method for multi-label classification

  4. Pseudo-Label Generation for Multi-Label Text Classification - Dataset - NASA...

    • data.nasa.gov
    Updated Mar 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nasa.gov (2025). Pseudo-Label Generation for Multi-Label Text Classification - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/pseudo-label-generation-for-multi-label-text-classification
    Explore at:
    Dataset updated
    Mar 31, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    With the advent and expansion of social networking, the amount of generated text data has seen a sharp increase. In order to handle such a huge volume of text data, new and improved text mining techniques are a necessity. One of the characteristics of text data that makes text mining difficult, is multi-labelity. In order to build a robust and effective text classification method which is an integral part of text mining research, we must consider this property more closely. This kind of property is not unique to text data as it can be found in non-text (e.g., numeric) data as well. However, in text data, it is most prevalent. This property also puts the text classification problem in the domain of multi-label classification (MLC), where each instance is associated with a subset of class-labels instead of a single class, as in conventional classification. In this paper, we explore how the generation of pseudo labels (i.e., combinations of existing class labels) can help us in performing better text classification and under what kind of circumstances. During the classification, the high and sparse dimensionality of text data has also been considered. Although, here we are proposing and evaluating a text classification technique, our main focus is on the handling of the multi-labelity of text data while utilizing the correlation among multiple labels existing in the data set. Our text classification technique is called pseudo-LSC (pseudo-Label Based Subspace Clustering). It is a subspace clustering algorithm that considers the high and sparse dimensionality as well as the correlation among different class labels during the classification process to provide better performance than existing approaches. Results on three real world multi-label data sets provide us insight into how the multi-labelity is handled in our classification process and shows the effectiveness of our approach.

  5. road sign recognition

    • kaggle.com
    zip
    Updated May 2, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Said Azizov (2021). road sign recognition [Dataset]. https://www.kaggle.com/michaelcripman/road-sign-recognition
    Explore at:
    zip(3523596349 bytes)Available download formats
    Dataset updated
    May 2, 2021
    Authors
    Said Azizov
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    Data The input data will be given:

    Archive with task data; train.csv - training images annotations; train_images / - folder with training images; 5_15_2_vocab.json - decoding of the attributes at the 5_15_2 sign.

    The annotations contain fields:

    filename - path to the signed image; label - the class label for the sign on the image. Note! The characters 3_24, 3_25, 5_15_2, 5_31 and 6_2 have separate attributes. These attributes in the label field are separated by a "+" character, for example, 3_24 + 100. For sign 5_15_2, the attribute is the direction of the arrow, for the remaining signs, the attribute is the numbers on the sign.

  6. g

    Data from: Linked Data Mining Challenge RM Set

    • search.gesis.org
    • da-ra.de
    Updated Nov 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Schaible, Johann (2025). Linked Data Mining Challenge RM Set [Dataset]. https://search.gesis.org/research_data/SDN-10.7802-78
    Explore at:
    Dataset updated
    Nov 5, 2025
    Dataset provided by
    GESIS, Köln
    GESIS search
    Authors
    Schaible, Johann
    License

    https://www.gesis.org/en/institute/data-usage-termshttps://www.gesis.org/en/institute/data-usage-terms

    Description

    Rapid Miner Process files and XML test set including the predicted labels for the Linked Data Mining Challenge 2015.

  7. m

    Indoor Fire Dataset with Distributed Multi-Sensor Nodes

    • data.mendeley.com
    Updated Jun 7, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pascal V (2023). Indoor Fire Dataset with Distributed Multi-Sensor Nodes [Dataset]. http://doi.org/10.17632/npk2zcm85h.1
    Explore at:
    Dataset updated
    Jun 7, 2023
    Authors
    Pascal V
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset comprises 4 fire experiments (repeated 3 times) and 3 nuisance experiments (Ethanol: repeated 3 times, Deodorant: repeated 2 times, Hairspray: repeated 1 time), with various background sequences interspersed between the conducted experiments. All exeriments were caried out in random order to reduce the influence of prehistory. It consists of a total of 305,304 rows and 16 columns, structured as a continuous multivariate time series. Each row represents the sensor measurements (CO2, CO, H2, humidity, particulate matter of different sizes, air temperature, and UV) from a unique sensor node position in the EN54 test room at a specific timestamp. The columns correspond to the sensor measurements and include additional labels: a scenario-specific label ("scenario_label"), a binary label ("anomaly_label") distinguishing between "Normal" (background) and "Anomaly" (fire or nuisance scenario), a ternary label ("ternary_label") categorizing the data as "Nuisance," "Fire," or "Background," and a progress label ("progress_label") that allows for dividing the event sequences into sub-sequences based on ongoing physical sub-processes. The dataset comprises 82.98% background data points and 17.02% anomaly data points, which can be further divided into 12.50% fire anomaly data points and 4.52% nuisance anomaly data points. The "Sensor_ID" column can be utilized to access data from different sensor node positions.

  8. Open-Pit Mining Block Model Dataset

    • kaggle.com
    zip
    Updated Jul 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ziya (2025). Open-Pit Mining Block Model Dataset [Dataset]. https://www.kaggle.com/datasets/ziya07/open-pit-mining-block-model-dataset/data
    Explore at:
    zip(1812380 bytes)Available download formats
    Dataset updated
    Jul 15, 2025
    Authors
    Ziya
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset is a generated representation of an open-pit mining block model, designed to reflect realistic geological, spatial, and economic conditions found in large-scale mineral extraction projects. It contains 75,000 individual blocks, each representing a unit of earth material with associated attributes that influence decision-making in mine planning and resource evaluation.

    The dataset includes essential parameters such as ore grade, tonnage, economic value, and operational costs. A calculated profit value and a corresponding binary target label indicate whether a block is considered economically viable for extraction. This setup supports various types of analysis, such as profitability assessments, production scheduling, and resource categorization.

    🔑 Key Features Block_ID: Unique identifier for each block in the model.

    Spatial Coordinates (X, Y, Z): 3D location data representing the layout of the deposit.

    Rock Type: Geological classification of each block (e.g., Hematite, Magnetite, Waste).

    Ore Grade (%): Iron content percentage for ore-bearing blocks; set to 0% for waste.

    Tonnage (tonnes): Total mass of the block, used in volume and value calculations.

    Ore Value (¥/tonne): Estimated revenue based on grade and market assumptions.

    Mining Cost (¥): Estimated extraction cost per block.

    Processing Cost (¥): Cost associated with refining ore-bearing blocks.

    Waste Flag: Indicates whether a block is classified as waste material (1 = Waste, 0 = Ore).

    Profit (¥): Net value after subtracting mining and processing costs from potential revenue.

    Target: Label indicating whether a block is economically profitable (1 = Yes, 0 = No).

    This dataset is ideal for applications related to mineral resource evaluation, production planning, and profitability analysis. It can also be used for teaching and demonstration purposes in mining engineering and resource management contexts.

  9. Code for Predicting MIEs from Gene Expression and Chemical Target Labels...

    • catalog.data.gov
    • datasets.ai
    • +1more
    Updated Apr 21, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2022). Code for Predicting MIEs from Gene Expression and Chemical Target Labels with Machine Learning (MIEML) [Dataset]. https://catalog.data.gov/dataset/code-for-predicting-mies-from-gene-expression-and-chemical-target-labels-with-machine-lear
    Explore at:
    Dataset updated
    Apr 21, 2022
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    Modeling data and analysis scripts generated during the current study are available in the github repository: https://github.com/USEPA/CompTox-MIEML. RefChemDB is available for download as supplemental material from its original publication (PMID: 30570668). LINCS gene expression data are publicly available and accessible through the gene expression omnibus (GSE92742 and GSE70138) at https://www.ncbi.nlm.nih.gov/geo/ . This dataset is associated with the following publication: Bundy, J., R. Judson, A. Williams, C. Grulke, I. Shah, and L. Everett. Predicting Molecular Initiating Events Using Chemical Target Annotations and Gene Expression. BioData Mining. BioMed Central Ltd, London, UK, issue}: 7, (2022).

  10. f

    Additional file 5: of Study of serious adverse drug reactions using...

    • springernature.figshare.com
    xls
    Updated Jun 1, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Leihong Wu; Taylor Ingle; Zhichao Liu; Anna Zhao-Wong; Stephen Harris; Shraddha Thakkar; Guangxu Zhou; Junshuang Yang; Joshua Xu; Darshan Mehta; Weigong Ge; Weida Tong; Hong Fang (2023). Additional file 5: of Study of serious adverse drug reactions using FDA-approved drug labeling and MedDRA [Dataset]. http://doi.org/10.6084/m9.figshare.7850477.v1
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    figshare
    Authors
    Leihong Wu; Taylor Ingle; Zhichao Liu; Anna Zhao-Wong; Stephen Harris; Shraddha Thakkar; Guangxu Zhou; Junshuang Yang; Joshua Xu; Darshan Mehta; Weigong Ge; Weida Tong; Hong Fang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Table S5. 1164 selected SPL documents used in this study. (XLS 378 kb)

  11. m

    URL-Phish: A Feature-Engineered Dataset for Phishing Detection

    • data.mendeley.com
    Updated Sep 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Linh Dam Minh (2025). URL-Phish: A Feature-Engineered Dataset for Phishing Detection [Dataset]. http://doi.org/10.17632/65z9twcx3r.1
    Explore at:
    Dataset updated
    Sep 29, 2025
    Authors
    Linh Dam Minh
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset, named URL-Phish, is designed for phishing detection research. It contains 111,660 unique URLs divided into: • 100,000 benign samples (label = 0), collected from trusted sources including educational (.edu), governmental (.gov), and top-ranked domains. The benign dataset was obtained from the Research Organization Registry [1]. • 11,660 phishing samples (label = 1), obtained from the PhishTank repository [2] between November 2024 and September 2025. Each URL entry was automatically processed to extract 22 lexical and structural features, such as URL length, domain length, number of subdomains, digit ratio, entropy, and HTTPS usage. In addition, three reference columns (url, dom, tld) are preserved for interpretability. One label column is included (0 = benign, 1 = phishing). A data cleaning step removed duplicates and empty entries, followed by normalization of features to ensure consistency. The dataset is provided in CSV format, with 22 numerical feature columns, 3 string reference columns, and 1 label column (total = 26 columns).

    References [1] Research Organization Registry, “ROR Data.” Zenodo, Sept. 22, 2025. doi: 10.5281/ZENODO.6347574. [2] PhishTank, “PhishTank: Join the fight against phishing.” [Online]. Available: https://phishtank.org

  12. 4

    Pattern Mining for Label Ranking

    • data.4tu.nl
    • figshare.com
    zip
    Updated May 8, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    C.F. (Cláudio) Pinho Rebelo de Sá (2017). Pattern Mining for Label Ranking [Dataset]. http://doi.org/10.4121/uuid:21b1959d-9196-423e-94d0-53883fb0ff21
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 8, 2017
    Dataset provided by
    LIACS
    Authors
    C.F. (Cláudio) Pinho Rebelo de Sá
    License

    https://doi.org/10.4121/resource:terms_of_usehttps://doi.org/10.4121/resource:terms_of_use

    Description

    Label Ranking datasets used in the PhD thesis "Pattern Mining for Label Ranking"

  13. US Deep Learning Market Analysis, Size, and Forecast 2025-2029

    • technavio.com
    pdf
    Updated Jul 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Technavio (2025). US Deep Learning Market Analysis, Size, and Forecast 2025-2029 [Dataset]. https://www.technavio.com/report/us-deep-learning-market-industry-analysis
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jul 8, 2025
    Dataset provided by
    TechNavio
    Authors
    Technavio
    License

    https://www.technavio.com/content/privacy-noticehttps://www.technavio.com/content/privacy-notice

    Time period covered
    2025 - 2029
    Description

    Snapshot img

    US Deep Learning Market Size 2025-2029

    The deep learning market size in US is forecast to increase by USD 5.02 billion at a CAGR of 30.1% between 2024 and 2029.

    The deep learning market is experiencing robust growth, driven by the increasing adoption of artificial intelligence (AI) in various industries for advanced solutioning. This trend is fueled by the availability of vast amounts of data, which is a key requirement for deep learning algorithms to function effectively. Industry-specific solutions are gaining traction, as businesses seek to leverage deep learning for specific use cases such as image and speech recognition, fraud detection, and predictive maintenance. Alongside, intuitive data visualization tools are simplifying complex neural network outputs, helping stakeholders understand and validate insights. 
    
    
    However, challenges remain, including the need for powerful computing resources, data privacy concerns, and the high cost of implementing and maintaining deep learning systems. Despite these hurdles, the market's potential for innovation and disruption is immense, making it an exciting space for businesses to explore further. Semi-supervised learning, data labeling, and data cleaning facilitate efficient training of deep learning models. Cloud analytics is another significant trend, as companies seek to leverage cloud computing for cost savings and scalability. 
    

    What will be the Size of the market During the Forecast Period?

    Request Free Sample

    Deep learning, a subset of machine learning, continues to shape industries by enabling advanced applications such as image and speech recognition, text generation, and pattern recognition. Reinforcement learning, a type of deep learning, gains traction, with deep reinforcement learning leading the charge. Anomaly detection, a crucial application of unsupervised learning, safeguards systems against security vulnerabilities. Ethical implications and fairness considerations are increasingly important in deep learning, with emphasis on explainable AI and model interpretability. Graph neural networks and attention mechanisms enhance data preprocessing for sequential data modeling and object detection. Time series forecasting and dataset creation further expand deep learning's reach, while privacy preservation and bias mitigation ensure responsible use.

    In summary, deep learning's market dynamics reflect a constant pursuit of innovation, efficiency, and ethical considerations. The Deep Learning Market in the US is flourishing as organizations embrace intelligent systems powered by supervised learning and emerging self-supervised learning techniques. These methods refine predictive capabilities and reduce reliance on labeled data, boosting scalability. BFSI firms utilize AI image recognition for various applications, including personalizing customer communication, maintaining a competitive edge, and automating repetitive tasks to boost productivity. Sophisticated feature extraction algorithms now enable models to isolate patterns with high precision, particularly in applications such as image classification for healthcare, security, and retail.

    How is this market segmented and which is the largest segment?

    The market research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD million' for the period 2025-2029, as well as historical data from 2019-2023 for the following segments.

    Application
    
      Image recognition
      Voice recognition
      Video surveillance and diagnostics
      Data mining
    
    
    Type
    
      Software
      Services
      Hardware
    
    
    End-user
    
      Security
      Automotive
      Healthcare
      Retail and commerce
      Others
    
    
    Geography
    
      North America
    
        US
    

    By Application Insights

    The Image recognition segment is estimated to witness significant growth during the forecast period. In the realm of artificial intelligence (AI) and machine learning, image recognition, a subset of computer vision, is gaining significant traction. This technology utilizes neural networks, deep learning models, and various machine learning algorithms to decipher visual data from images and videos. Image recognition is instrumental in numerous applications, including visual search, product recommendations, and inventory management. Consumers can take photographs of products to discover similar items, enhancing the online shopping experience. In the automotive sector, image recognition is indispensable for advanced driver assistance systems (ADAS) and autonomous vehicles, enabling the identification of pedestrians, other vehicles, road signs, and lane markings.

    Furthermore, image recognition plays a pivotal role in augmented reality (AR) and virtual reality (VR) applications, where it tracks physical objects and overlays digital content onto real-world scenarios. The model training process involves the backpropagation algorithm, which calculates the loss fu

  14. Product data mining: entity classification&linking

    • kaggle.com
    zip
    Updated Jul 13, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    zzhang (2020). Product data mining: entity classification&linking [Dataset]. https://www.kaggle.com/ziqizhang/product-data-miningentity-classificationlinking
    Explore at:
    zip(10933 bytes)Available download formats
    Dataset updated
    Jul 13, 2020
    Authors
    zzhang
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    IMPORTANT: Round 1 results are now released, check our website for the leaderboard. We now open Round 2 submissions!

    1. Overview

    We release two datasets that are part of the the Semantic Web Challenge on Mining the Web of HTML-embedded Product Data is co-located with the 19th International Semantic Web Conference (https://iswc2020.semanticweb.org/, 2-6 Nov 2020 at Athens, Greece). The datasets belong to two shared tasks related to product data mining on the Web: (1) product matching (linking) and (2) product classification. This event is organised by The University of Sheffield, The University of Mannheim and Amazon, and is open to anyone. Systems successfully beating the baseline of the respective task, will be invited to write a paper describing their method and system and present the method as a poster (and potentially also a short talk) at the ISWC2020 conference. Winners of each task will be awarded 500 euro as prize (partly sponsored by Peak Indicators, https://www.peakindicators.com/).

    2. Task and dataset brief

    The challenge organises two tasks, product matching and product categorisation.

    i) Product Matching deals with identifying product offers on different websites that refer to the same real-world product (e.g., the same iPhone X model offered using different names/offer titles as well as different descriptions on various websites). A multi-million product offer corpus (16M) containing product offer clusters is released for the generation of training data. A validation set containing 1.1K offer pairs and a test set of 600 offer pairs will also be released. The goal of this task is to classify if the offer pairs in these datasets are match (i.e., referring to the same product) or non-match.

    ii) Product classification deals with assigning predefined product category labels (which can be multiple levels) to product instances (e.g., iPhone X is a ‘SmartPhone’, and also ‘Electronics’). A training dataset containing 10K product offers, a validation set of 3K product offers and a test set of 3K product offers will be released. Each dataset contains product offers with their metadata (e.g., name, description, URL) and three classification labels each corresponding to a level in the GS1 Global Product Classification taxonomy. The goal is to classify these product offers into the pre-defined category labels.

    All datasets are built based on structured data that was extracted from the Common Crawl (https://commoncrawl.org/) by the Web Data Commons project (http://webdatacommons.org/). Datasets can be found at: https://ir-ischool-uos.github.io/mwpd/

    3. Resources and tools

    The challenge will also release utility code (in Python) for processing the above datasets and scoring the system outputs. In addition, the following language resources for product-related data mining tasks: A text corpus of 150 million product offer descriptions Word embeddings trained on the above corpus

    4. Challenge website

    For details of the challenge please visit https://ir-ischool-uos.github.io/mwpd/

    5. Organizing committee

    Dr Ziqi Zhang (Information School, The University of Sheffield) Prof. Christian Bizer (Institute of Computer Science and Business Informatics, The Mannheim University) Dr Haiping Lu (Department of Computer Science, The University of Sheffield) Dr Jun Ma (Amazon Inc. Seattle, US) Prof. Paul Clough (Information School, The University of Sheffield & Peak Indicators) Ms Anna Primpeli (Institute of Computer Science and Business Informatics, The Mannheim University) Mr Ralph Peeters (Institute of Computer Science and Business Informatics, The Mannheim University) Mr. Abdulkareem Alqusair (Information School, The University of Sheffield)

    6. Contact

    To contact the organising committee please use the Google discussion group https://groups.google.com/forum/#!forum/mwpd2020

  15. The Insurance Company (TIC) Benchmark

    • kaggle.com
    zip
    Updated May 27, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kush Shah (2020). The Insurance Company (TIC) Benchmark [Dataset]. https://www.kaggle.com/datasets/kushshah95/the-insurance-company-tic-benchmark/code
    Explore at:
    zip(268454 bytes)Available download formats
    Dataset updated
    May 27, 2020
    Authors
    Kush Shah
    Description

    This data set used in the CoIL 2000 Challenge contains information on customers of an insurance company. The data consists of 86 variables and includes product usage data and socio-demographic data

    DETAILED DATA DESCRIPTION

    THE INSURANCE COMPANY (TIC) 2000

    (c) Sentient Machine Research 2000

    DISCLAIMER

    This dataset is owned and supplied by the Dutch data mining company Sentient Machine Research, and is based on real-world business data. You are allowed to use this dataset and accompanying information for non-commercial research and education purposes only. It is explicitly not allowed to use this dataset for commercial education or demonstration purposes. For any other use, please contact Peter van der Putten, info@smr.nl.

    This dataset has been used in the CoIL Challenge 2000 data mining competition. For papers describing results on this dataset, see the TIC 2000 homepage: http://www.wi.leidenuniv.nl/~putten/library/cc2000/

    REFERENCE P. van der Putten and M. van Someren (eds). CoIL Challenge 2000: The Insurance Company Case. Published by Sentient Machine Research, Amsterdam. Also a Leiden Institute of Advanced Computer Science Technical Report 2000-09. June 22, 2000. See http://www.liacs.nl/~putten/library/cc2000/

    RELEVANT FILES

    tic_2000_train_data.csv: Dataset to train and validate prediction models and build a description (5822 customer records). Each record consists of 86 attributes, containing sociodemographic data (attribute 1-43) and product ownership (attributes 44-86). The sociodemographic data is derived from zip codes. All customers living in areas with the same zip code have the same sociodemographic attributes. Attribute 86, "CARAVAN: Number of mobile home policies", is the target variable.

    tic_2000_eval_data.csv: Dataset for predictions (4000 customer records). It has the same format as TICDATA2000.txt, only the target is missing. Participants are supposed to return the list of predicted targets only. All datasets are in CSV format. The meaning of the attributes and attribute values is given dictionary.csv

    tic_2000_target_data.csv Targets for the evaluation set.

    dictionary.txt: Data description with numerical labeled categories descriptions. It has columnar description data and the labels of the dummy/Labeled encoding.

    Original Task description Link: http://liacs.leidenuniv.nl/~puttenpwhvander/library/cc2000/problem.html UCI Machine Learning Repository: http://archive.ics.uci.edu/ml/datasets/Insurance+Company+Benchmark+%28COIL+2000%29

  16. Additional file 4: of Study of serious adverse drug reactions using...

    • springernature.figshare.com
    xls
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Leihong Wu; Taylor Ingle; Zhichao Liu; Anna Zhao-Wong; Stephen Harris; Shraddha Thakkar; Guangxu Zhou; Junshuang Yang; Joshua Xu; Darshan Mehta; Weigong Ge; Weida Tong; Hong Fang (2023). Additional file 4: of Study of serious adverse drug reactions using FDA-approved drug labeling and MedDRA [Dataset]. http://doi.org/10.6084/m9.figshare.7850468.v1
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Leihong Wu; Taylor Ingle; Zhichao Liu; Anna Zhao-Wong; Stephen Harris; Shraddha Thakkar; Guangxu Zhou; Junshuang Yang; Joshua Xu; Darshan Mehta; Weigong Ge; Weida Tong; Hong Fang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Table S4. MedDRA term extraction performance of Oracle Text Search (XLS 56 kb)

  17. H

    Data from: Mpox Narrative on Instagram: A Labeled Multilingual Dataset of...

    • dataverse.harvard.edu
    • nde-dev.biothings.io
    • +1more
    Updated Oct 15, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nirmalya Thakur (2024). Mpox Narrative on Instagram: A Labeled Multilingual Dataset of Instagram Posts on Mpox for Sentiment, Hate Speech, and Anxiety Analysis [Dataset]. http://doi.org/10.7910/DVN/TJVSY0
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 15, 2024
    Dataset provided by
    Harvard Dataverse
    Authors
    Nirmalya Thakur
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Time period covered
    Jul 23, 2022 - Sep 5, 2024
    Description

    Please cite the following paper when using this dataset: N. Thakur, “Mpox narrative on Instagram: A labeled multilingual dataset of Instagram posts on mpox for sentiment, hate speech, and anxiety analysis,” arXiv [cs.LG], 2024, URL: https://arxiv.org/abs/2409.05292 Abstract The world is currently experiencing an outbreak of mpox, which has been declared a Public Health Emergency of International Concern by WHO. During recent virus outbreaks, social media platforms have played a crucial role in keeping the global population informed and updated regarding various aspects of the outbreaks. As a result, in the last few years, researchers from different disciplines have focused on the development of social media datasets focusing on different virus outbreaks. No prior work in this field has focused on the development of a dataset of Instagram posts about the mpox outbreak. The work presented in this paper (stated above) aims to address this research gap. It presents this multilingual dataset of 60,127 Instagram posts about mpox, published between July 23, 2022, and September 5, 2024. This dataset contains Instagram posts about mpox in 52 languages. For each of these posts, the Post ID, Post Description, Date of publication, language, and translated version of the post (translation to English was performed using the Google Translate API) are presented as separate attributes in the dataset. After developing this dataset, sentiment analysis, hate speech detection, and anxiety or stress detection were also performed. This process included classifying each post into one of the fine-grain sentiment classes, i.e., fear, surprise, joy, sadness, anger, disgust, or neutral hate or not hate anxiety/stress detected or no anxiety/stress detected These results are presented as separate attributes in the dataset for the training and testing of machine learning algorithms for sentiment, hate speech, and anxiety or stress detection, as well as for other applications. The distinct languages in which Instagram posts are present in this dataset are English, Portuguese, Indonesian, Spanish, Korean, French, Hindi, Finnish, Turkish, Italian, German, Tamil, Urdu, Thai, Arabic, Persian, Tagalog, Dutch, Catalan, Bengali, Marathi, Malayalam, Swahili, Afrikaans, Panjabi, Gujarati, Somali, Lithuanian, Norwegian, Estonian, Swedish, Telugu, Russian, Danish, Slovak, Japanese, Kannada, Polish, Vietnamese, Hebrew, Romanian, Nepali, Czech, Modern Greek, Albanian, Croatian, Slovenian, Bulgarian, Ukrainian, Welsh, Hungarian, and Latvian The following is a description of the attributes present in this dataset: Post ID: Unique ID of each Instagram post Post Description: Complete description of each post in the language in which it was originally published Date: Date of publication in MM/DD/YYYY format Language: Language of the post as detected using the Google Translate API Translated Post Description: Translated version of the post description. All posts which were not in English were translated into English using the Google Translate API. No language translation was performed for English posts. Sentiment: Results of sentiment analysis (using the preprocessed version of the translated Post Description) where each post was classified into one of the sentiment classes: fear, surprise, joy, sadness, anger, disgust, and neutral Hate: Results of hate speech detection (using the preprocessed version of the translated Post Description) where each post was classified as hate or not hate Anxiety or Stress: Results of anxiety or stress detection (using the preprocessed version of the translated Post Description) where each post was classified as stress/anxiety detected or no stress/anxiety detected. All the Instagram posts that were collected during this data mining process to develop this dataset were publicly available on Instagram and did not require a user to log in to Instagram to view the same (at the time of writing this paper).

  18. s

    Seair Exim Solutions

    • seair.co.in
    Updated Feb 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Seair Exim (2024). Seair Exim Solutions [Dataset]. https://www.seair.co.in
    Explore at:
    .bin, .xml, .csv, .xlsAvailable download formats
    Dataset updated
    Feb 29, 2024
    Dataset provided by
    Seair Info Solutions PVT LTD
    Authors
    Seair Exim
    Area covered
    United States
    Description

    Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.

  19. E

    Dataset for training classifiers of comparative sentences

    • live.european-language-grid.eu
    csv
    Updated Apr 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Dataset for training classifiers of comparative sentences [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7607
    Explore at:
    csvAvailable download formats
    Dataset updated
    Apr 19, 2024
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    As there was no large publicly available cross-domain dataset for comparative argument mining, we create one composed of sentences, potentially annotated with BETTER / WORSE markers (the first object is better / worse than the second object) or NONE (the sentence does not contain a comparison of the target objects). The BETTER sentences stand for a pro-argument in favor of the first compared object and WORSE-sentences represent a con-argument and favor the second object. We aim for minimizing dataset domain-specific biases in order to capture the nature of comparison and not the nature of the particular domains, thus decided to control the specificity of domains by the selection of comparison targets. We hypothesized and could confirm in preliminary experiments that comparison targets usually have a common hypernym (i.e., are instances of the same class), which we utilized for selection of the compared objects pairs. The most specific domain we choose, is computer science with comparison targets like programming languages, database products and technology standards such as Bluetooth or Ethernet. Many computer science concepts can be compared objectively (e.g., on transmission speed or suitability for certain applications). The objects for this domain were manually extracted from List of-articles at Wikipedia. In the annotation process, annotators were asked to only label sentences from this domain if they had some basic knowledge in computer science. The second, broader domain is brands. It contains objects of different types (e.g., cars, electronics, and food). As brands are present in everyday life, anyone should be able to label the majority of sentences containing well-known brands such as Coca-Cola or Mercedes. Again, targets for this domain were manually extracted from `List of''-articles at Wikipedia.The third domain is not restricted to any topic: random. For each of 24~randomly selected seed words 10 similar words were collected based on the distributional similarity API of JoBimText (http://www.jobimtext.org). Seed words created using randomlists.com: book, car, carpenter, cellphone, Christmas, coffee, cork, Florida, hamster, hiking, Hoover, Metallica, NBC, Netflix, ninja, pencil, salad, soccer, Starbucks, sword, Tolkien, wine, wood, XBox, Yale.Especially for brands and computer science, the resulting object lists were large (4493 in brands and 1339 in computer science). In a manual inspection, low-frequency and ambiguous objects were removed from all object lists (e.g., RAID (a hardware concept) and Unity (a game engine) are also regularly used nouns). The remaining objects were combined to pairs. For each object type (seed Wikipedia list page or the seed word), all possible combinations were created. These pairs were then used to find sentences containing both objects. The aforementioned approaches to selecting compared objects pairs tend minimize inclusion of the domain specific data, but do not solve the problem fully though. We keep open a question of extending dataset with diverse object pairs including abstract concepts for future work. As for the sentence mining, we used the publicly available index of dependency-parsed sentences from the Common Crawl corpus containing over 14 billion English sentences filtered for duplicates. This index was queried for sentences containing both objects of each pair. For 90% of the pairs, we also added comparative cue words (better, easier, faster, nicer, wiser, cooler, decent, safer, superior, solid, terrific, worse, harder, slower, poorly, uglier, poorer, lousy, nastier, inferior, mediocre) to the query in order to bias the selection towards comparisons but at the same time admit comparisons that do not contain any of the anticipated cues. This was necessary as a random sampling would have resulted in only a very tiny fraction of comparisons. Note that even sentences containing a cue word do not necessarily express a comparison between the desired targets (dog vs. cat: He's the best pet that you can get, better than a dog or cat.). It is thus especially crucial to enable a classifier to learn not to rely on the existence of clue words only (very likely in a random sample of sentences with very few comparisons). For our corpus, we keep pairs with at least 100 retrieved sentences.From all sentences of those pairs, 2500 for each category were randomly sampled as candidates for a crowdsourced annotation that we conducted on figure-eight.com in several small batches. Each sentence was annotated by at least five trusted workers. We ranked annotations by confidence, which is the figure-eight internal measure of combining annotator trust and voting, and discarded annotations with a confidence below 50%. Of all annotated items, 71% received unanimous votes and for over 85% at least 4 out of 5 workers agreed -- rendering the collection procedure aimed at ease of annotation successful.The final dataset contains 7199 sentences with 271 distinct object pairs. The majority of sentences (over 72%) are non-comparative despite biasing the selection with cue words; in 70% of the comparative sentences, the favored target is named first.You can browse though the data here: https://docs.google.com/spreadsheets/d/1U8i6EU9GUKmHdPnfwXEuBxi0h3aiRCLPRC-3c9ROiOE/edit?usp=sharing Full description of the dataset is available in the workshop paper at ACL 2019 conference. Please cite this paper if you use the data: Franzek, Mirco, Alexander Panchenko, and Chris Biemann. ""Categorization of Comparative Sentences for Argument Mining."" arXiv preprint arXiv:1809.06152 (2018).@inproceedings{franzek2018categorization, title={Categorization of Comparative Sentences for Argument Mining}, author={Panchenko, Alexander and Bondarenko, and Franzek, Mirco and Hagen, Matthias and Biemann, Chris}, booktitle={Proceedings of the 6th Workshop on Argument Mining at ACL'2019}, year={2019}, address={Florence, Italy}}

  20. DrivenData: Pump it Up

    • kaggle.com
    zip
    Updated Jan 21, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abid Ali Awan (2021). DrivenData: Pump it Up [Dataset]. https://www.kaggle.com/kingabzpro/drivendata-pump-it-up
    Explore at:
    zip(10914484 bytes)Available download formats
    Dataset updated
    Jan 21, 2021
    Authors
    Abid Ali Awan
    Description

    Context

    Can you predict which water pumps are faulty?

    Using data from Taarifa and the Tanzanian Ministry of Water, can you predict which pumps are functional, which need some repairs, and which don't work at all? This is an intermediate-level practice competition. Predict one of these three classes based on a number of variables about what kind of pump is operating, when it was installed, and how it is managed. A smart understanding of which waterpoints will fail can improve maintenance operations and ensure that clean, potable water is available to communities across Tanzania.

    Content

    Problem description This is where you'll find all of the documentation about this dataset and the problem we are trying to solve. For this competition, there are three subsections to the problem description:

    Features List of features Example of features Labels List of labels Submission Format Format example

    The features in this dataset

    Your goal is to predict the operating condition of a waterpoint for each record in the dataset. You are provided the following set of information about the waterpoints:

    amount_tsh - Total static head (amount water available to waterpoint) date_recorded - The date the row was entered funder - Who funded the well gps_height - Altitude of the well installer - Organization that installed the well longitude - GPS coordinate latitude - GPS coordinate wpt_name - Name of the waterpoint if there is one num_private - basin - Geographic water basin subvillage - Geographic location region - Geographic location region_code - Geographic location (coded) district_code - Geographic location (coded) lga - Geographic location ward - Geographic location population - Population around the well public_meeting - True/False recorded_by - Group entering this row of data scheme_management - Who operates the waterpoint scheme_name - Who operates the waterpoint permit - If the waterpoint is permitted construction_year - Year the waterpoint was constructed extraction_type - The kind of extraction the waterpoint uses extraction_type_group - The kind of extraction the waterpoint uses extraction_type_class - The kind of extraction the waterpoint uses management - How the waterpoint is managed management_group - How the waterpoint is managed payment - What the water costs payment_type - What the water costs water_quality - The quality of the water quality_group - The quality of the water quantity - The quantity of water quantity_group - The quantity of water source - The source of the water source_type - The source of the water source_class - The source of the water waterpoint_type - The kind of waterpoint waterpoint_type_group - The kind of waterpoint

    Feature data example For example, a single row in the dataset might have these values:

    amount_tsh 300.0 date_recorded 2013-02-26 funder Germany Republi gps_height 1335 installer CES longitude 37.2029845 latitude -3.22870286 wpt_name Kwaa Hassan Ismail num_private 0 basin Pangani subvillage Bwani region Kilimanjaro region_code 3 district_code 5 lga Hai ward Machame Uroki population 25 public_meeting True recorded_by GeoData Consultants Ltd scheme_management Water Board scheme_name Uroki-Bomang'ombe water sup permit True construction_year 1995 extraction_type gravity extraction_type_group gravity extraction_type_class gravity management water board management_group user-group payment other payment_type other water_quality soft quality_group good quantity enough quantity_group enough source spring source_type spring source_class groundwater waterpoint_type communal standpipe waterpoint_type_group communal standpipe

    The labels in this dataset dist image

    Distribution of Labels The labels in this dataset are simple. There are three possible values:

    functional - the waterpoint is operational and there are no repairs needed functional needs repair - the waterpoint is operational, but needs repairs non functional - the waterpoint is not operational

    Submission format The format for the submission file is simply the row id and the predicted label (for an example, see SubmissionFormat.csv on the data download page.

    For example, if you just predicted that all the waterpoints were functional you would have the following predictions: id status_group 50785 functional 51630 functional 17168 functional 45559 functional 49871 functional Your .csv file that you submit would look like:

    id,status_group 50785,functional 51630,functional 17168,functional 45559,functional ...

    Acknowledgements

    All rights reserved with DataDriven.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
(2025). Global Data Labeling Tools Market Research Report: By Application (Machine Learning, Natural Language Processing, Computer Vision, Data Mining, Predictive Analytics), By Labeling Type (Image Annotation, Text Annotation, Video Annotation, Audio Annotation, 3D Point Cloud Annotation), By Deployment Type (Cloud-Based, On-Premises, Hybrid), By End User (Healthcare, Automotive, Retail, Finance, Telecommunications) and By Regional (North America, Europe, South America, Asia Pacific, Middle East and Africa) - Forecast to 2035 [Dataset]. https://www.wiseguyreports.com/reports/data-labeling-tools-market

Global Data Labeling Tools Market Research Report: By Application (Machine Learning, Natural Language Processing, Computer Vision, Data Mining, Predictive Analytics), By Labeling Type (Image Annotation, Text Annotation, Video Annotation, Audio Annotation, 3D Point Cloud Annotation), By Deployment Type (Cloud-Based, On-Premises, Hybrid), By End User (Healthcare, Automotive, Retail, Finance, Telecommunications) and By Regional (North America, Europe, South America, Asia Pacific, Middle East and Africa) - Forecast to 2035

Explore at:
Dataset updated
Aug 23, 2025
License

https://www.wiseguyreports.com/pages/privacy-policyhttps://www.wiseguyreports.com/pages/privacy-policy

Time period covered
Aug 25, 2025
Area covered
Global
Description
BASE YEAR2024
HISTORICAL DATA2019 - 2023
REGIONS COVEREDNorth America, Europe, APAC, South America, MEA
REPORT COVERAGERevenue Forecast, Competitive Landscape, Growth Factors, and Trends
MARKET SIZE 20243.75(USD Billion)
MARKET SIZE 20254.25(USD Billion)
MARKET SIZE 203515.0(USD Billion)
SEGMENTS COVEREDApplication, Labeling Type, Deployment Type, End User, Regional
COUNTRIES COVEREDUS, Canada, Germany, UK, France, Russia, Italy, Spain, Rest of Europe, China, India, Japan, South Korea, Malaysia, Thailand, Indonesia, Rest of APAC, Brazil, Mexico, Argentina, Rest of South America, GCC, South Africa, Rest of MEA
KEY MARKET DYNAMICSincreasing AI adoption, demand for accurate datasets, growing automation in workflows, rise of cloud-based solutions, emphasis on data privacy regulations
MARKET FORECAST UNITSUSD Billion
KEY COMPANIES PROFILEDLionbridge, Scale AI, Google Cloud, Amazon Web Services, DataSoring, CloudFactory, Mighty AI, Samasource, TrinityAI, Microsoft Azure, Clickworker, Pimlico, Hive, iMerit, Appen
MARKET FORECAST PERIOD2025 - 2035
KEY MARKET OPPORTUNITIESAI-driven automation integration, Expansion in machine learning applications, Increasing demand for annotated datasets, Growth in autonomous vehicles sector, Rising focus on data privacy compliance
COMPOUND ANNUAL GROWTH RATE (CAGR) 13.4% (2025 - 2035)
Search
Clear search
Close search
Google apps
Main menu