100+ datasets found
  1. D

    Data Collection And Labeling Report

    • datainsightsmarket.com
    doc, pdf, ppt
    Updated Nov 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Insights Market (2025). Data Collection And Labeling Report [Dataset]. https://www.datainsightsmarket.com/reports/data-collection-and-labeling-1945059
    Explore at:
    ppt, doc, pdfAvailable download formats
    Dataset updated
    Nov 17, 2025
    Dataset authored and provided by
    Data Insights Market
    License

    https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    Explore the booming data collection and labeling market, driven by AI advancements. Discover key growth drivers, market trends, and forecasts for 2025-2033, essential for AI development across IT, automotive, and healthcare.

  2. d

    Replication Data for: Automatic Collective Behaviour Recognition

    • dataone.org
    • dataverse.harvard.edu
    Updated Nov 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abpeikar, Shadi (2023). Replication Data for: Automatic Collective Behaviour Recognition [Dataset]. http://doi.org/10.7910/DVN/S1YJOX
    Explore at:
    Dataset updated
    Nov 8, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Abpeikar, Shadi
    Description

    Collective behaviour such as the flocks of birds and schools of fish is inspired by computer-based systems and is widely used in agents’ formation. The human could easily recognise these behaviours; however, it is hard for a computer system to recognise these behaviours. Since humans could easily recognise these behaviours, ground truth data on human perception of collective behaviour could enable machine learning methods to mimic this human perception. Hence ground truth data has been collected from human perception of collective behaviour recognition by running an online survey. Specific collective motions considered in this online survey include 16 structured and unstructured behaviours. The defined structured collective motions include boids’ movements with an identifiable embedded pattern. Unstructured collective motions consist of random movement of boids with no patterns. The participants are from diverse levels of knowledge, all over the world, and are over 18 years old. Each question contains a short video (around 10 seconds), captured from one of the 16 simulated movements. The videos are shown in a randomized order to the participants. Then they were asked to label each structured motion of boids as ‘flocking’, ‘aligned’, or ‘grouped’ and others as ‘not flocking’, ‘not aligned’, or ‘not grouped’. By averaging human perceptions, three binary labelled datasets of these motions are created. The data could be trained by machine learning methods, which enabled them to automatically recognise collective behaviour.

  3. Data from: A survey of image labelling for computer vision applications

    • tandf.figshare.com
    docx
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christoph Sager; Christian Janiesch; Patrick Zschech (2023). A survey of image labelling for computer vision applications [Dataset]. http://doi.org/10.6084/m9.figshare.14445354.v1
    Explore at:
    docxAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Taylor & Francishttps://taylorandfrancis.com/
    Authors
    Christoph Sager; Christian Janiesch; Patrick Zschech
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Supervised machine learning methods for image analysis require large amounts of labelled training data to solve computer vision problems. The recent rise of deep learning algorithms for recognising image content has led to the emergence of many ad-hoc labelling tools. With this survey, we capture and systematise the commonalities as well as the distinctions between existing image labelling software. We perform a structured literature review to compile the underlying concepts and features of image labelling software such as annotation expressiveness and degree of automation. We structure the manual labelling task by its organisation of work, user interface design options, and user support techniques to derive a systematisation schema for this survey. Applying it to available software and the body of literature, enabled us to uncover several application archetypes and key domains such as image retrieval or instance identification in healthcare or television.

  4. G

    Telecom Data Labeling Market Research Report 2033

    • growthmarketreports.com
    csv, pdf, pptx
    Updated Aug 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Growth Market Reports (2025). Telecom Data Labeling Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/telecom-data-labeling-market
    Explore at:
    pdf, csv, pptxAvailable download formats
    Dataset updated
    Aug 29, 2025
    Dataset authored and provided by
    Growth Market Reports
    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Telecom Data Labeling Market Outlook



    According to our latest research, the global Telecom Data Labeling market size reached USD 1.42 billion in 2024, driven by the exponential growth in data generation, increasing adoption of AI and machine learning in telecom operations, and the rising complexity of communication networks. The market is forecasted to expand at a robust CAGR of 22.8% from 2025 to 2033, reaching an estimated USD 10.09 billion by 2033. This strong momentum is underpinned by the escalating demand for high-quality labeled datasets to power advanced analytics and automation in the telecom sector.




    The growth trajectory of the Telecom Data Labeling market is fundamentally propelled by the surging data volumes generated by telecom networks worldwide. With the proliferation of 5G, IoT devices, and cloud-based services, telecom operators are inundated with massive streams of structured and unstructured data. Efficient data labeling is essential to transform raw data into actionable insights, fueling AI-driven solutions for network optimization, predictive maintenance, and fraud detection. Additionally, the mounting pressure on telecom companies to enhance customer experience and operational efficiency is prompting significant investments in data labeling infrastructure and services, further accelerating market expansion.




    Another critical growth factor is the rapid evolution of artificial intelligence and machine learning applications within the telecommunications industry. AI-powered tools depend on vast quantities of accurately labeled data to deliver reliable predictions and automation. As telecom companies strive to automate network management, detect anomalies, and personalize user experiences, the demand for high-quality labeled datasets has surged. The emergence of advanced labeling techniques, including semi-automated and automated labeling methods, is enabling telecom enterprises to keep pace with the growing data complexity and volume, thus fostering faster and more scalable AI deployments.




    Furthermore, regulatory compliance and data privacy concerns are shaping the landscape of the Telecom Data Labeling market. As governments worldwide tighten data protection regulations, telecom operators are compelled to ensure that data used for AI and analytics is accurately labeled and anonymized. This necessity is driving the adoption of robust data labeling solutions that not only facilitate compliance but also enhance data quality and integrity. The integration of secure, privacy-centric labeling platforms is becoming a competitive differentiator, especially in regions with stringent data governance frameworks. This trend is expected to persist, reinforcing the marketÂ’s upward trajectory.



    AI-Powered Product Labeling is revolutionizing the telecom industry by providing more efficient and accurate data annotation processes. This technology leverages artificial intelligence to automate the labeling of large datasets, reducing the time and costs associated with manual labeling. By utilizing AI algorithms, telecom operators can ensure that their data is consistently labeled with high precision, which is crucial for training machine learning models. This advancement not only enhances the quality of labeled data but also accelerates the deployment of AI-driven solutions across various applications, such as network optimization and customer experience management. As AI-Powered Product Labeling continues to evolve, it is expected to play a pivotal role in the telecom sector's digital transformation journey, enabling operators to harness the full potential of their data assets.




    From a regional perspective, Asia Pacific is emerging as a powerhouse in the Telecom Data Labeling market, fueled by rapid digitalization, expanding telecom infrastructure, and the early adoption of 5G technologies. North America remains a significant contributor, owing to its mature telecom ecosystem and high investments in AI research and development. Europe is also witnessing steady growth, driven by regulatory mandates and increasing focus on data-driven network management. Meanwhile, Latin America and the Middle East & Africa are gradually catching up, with investments in digital transformation and telecom modernization initiatives providing new growth avenues. These regional dynamics collectively underscore the global nature

  5. w

    Data Use in Academia Dataset

    • datacatalog.worldbank.org
    csv, utf-8
    Updated Nov 27, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Semantic Scholar Open Research Corpus (S2ORC) (2023). Data Use in Academia Dataset [Dataset]. https://datacatalog.worldbank.org/search/dataset/0065200/data_use_in_academia_dataset
    Explore at:
    utf-8, csvAvailable download formats
    Dataset updated
    Nov 27, 2023
    Dataset provided by
    Semantic Scholar Open Research Corpus (S2ORC)
    Brian William Stacy
    License

    https://datacatalog.worldbank.org/public-licenses?fragment=cchttps://datacatalog.worldbank.org/public-licenses?fragment=cc

    Description

    This dataset contains metadata (title, abstract, date of publication, field, etc) for around 1 million academic articles. Each record contains additional information on the country of study and whether the article makes use of data. Machine learning tools were used to classify the country of study and data use.


    Our data source of academic articles is the Semantic Scholar Open Research Corpus (S2ORC) (Lo et al. 2020). The corpus contains more than 130 million English language academic papers across multiple disciplines. The papers included in the Semantic Scholar corpus are gathered directly from publishers, from open archives such as arXiv or PubMed, and crawled from the internet.


    We placed some restrictions on the articles to make them usable and relevant for our purposes. First, only articles with an abstract and parsed PDF or latex file are included in the analysis. The full text of the abstract is necessary to classify the country of study and whether the article uses data. The parsed PDF and latex file are important for extracting important information like the date of publication and field of study. This restriction eliminated a large number of articles in the original corpus. Around 30 million articles remain after keeping only articles with a parsable (i.e., suitable for digital processing) PDF, and around 26% of those 30 million are eliminated when removing articles without an abstract. Second, only articles from the year 2000 to 2020 were considered. This restriction eliminated an additional 9% of the remaining articles. Finally, articles from the following fields of study were excluded, as we aim to focus on fields that are likely to use data produced by countries’ national statistical system: Biology, Chemistry, Engineering, Physics, Materials Science, Environmental Science, Geology, History, Philosophy, Math, Computer Science, and Art. Fields that are included are: Economics, Political Science, Business, Sociology, Medicine, and Psychology. This third restriction eliminated around 34% of the remaining articles. From an initial corpus of 136 million articles, this resulted in a final corpus of around 10 million articles.


    Due to the intensive computer resources required, a set of 1,037,748 articles were randomly selected from the 10 million articles in our restricted corpus as a convenience sample.


    The empirical approach employed in this project utilizes text mining with Natural Language Processing (NLP). The goal of NLP is to extract structured information from raw, unstructured text. In this project, NLP is used to extract the country of study and whether the paper makes use of data. We will discuss each of these in turn.


    To determine the country or countries of study in each academic article, two approaches are employed based on information found in the title, abstract, or topic fields. The first approach uses regular expression searches based on the presence of ISO3166 country names. A defined set of country names is compiled, and the presence of these names is checked in the relevant fields. This approach is transparent, widely used in social science research, and easily extended to other languages. However, there is a potential for exclusion errors if a country’s name is spelled non-standardly.


    The second approach is based on Named Entity Recognition (NER), which uses machine learning to identify objects from text, utilizing the spaCy Python library. The Named Entity Recognition algorithm splits text into named entities, and NER is used in this project to identify countries of study in the academic articles. SpaCy supports multiple languages and has been trained on multiple spellings of countries, overcoming some of the limitations of the regular expression approach. If a country is identified by either the regular expression search or NER, it is linked to the article. Note that one article can be linked to more than one country.


    The second task is to classify whether the paper uses data. A supervised machine learning approach is employed, where 3500 publications were first randomly selected and manually labeled by human raters using the Mechanical Turk service (Paszke et al. 2019).[1] To make sure the human raters had a similar and appropriate definition of data in mind, they were given the following instructions before seeing their first paper:


    Each of these documents is an academic article. The goal of this study is to measure whether a specific academic article is using data and from which country the data came.

    There are two classification tasks in this exercise:

    1. identifying whether an academic article is using data from any country

    2. Identifying from which country that data came.

    For task 1, we are looking specifically at the use of data. Data is any information that has been collected, observed, generated or created to produce research findings. As an example, a study that reports findings or analysis using a survey data, uses data. Some clues to indicate that a study does use data includes whether a survey or census is described, a statistical model estimated, or a table or means or summary statistics is reported.

    After an article is classified as using data, please note the type of data used. The options are population or business census, survey data, administrative data, geospatial data, private sector data, and other data. If no data is used, then mark "Not applicable". In cases where multiple data types are used, please click multiple options.[2]

    For task 2, we are looking at the country or countries that are studied in the article. In some cases, no country may be applicable. For instance, if the research is theoretical and has no specific country application. In some cases, the research article may involve multiple countries. In these cases, select all countries that are discussed in the paper.

    We expect between 10 and 35 percent of all articles to use data.


    The median amount of time that a worker spent on an article, measured as the time between when the article was accepted to be classified by the worker and when the classification was submitted was 25.4 minutes. If human raters were exclusively used rather than machine learning tools, then the corpus of 1,037,748 articles examined in this study would take around 50 years of human work time to review at a cost of $3,113,244, which assumes a cost of $3 per article as was paid to MTurk workers.


    A model is next trained on the 3,500 labelled articles. We use a distilled version of the BERT (bidirectional Encoder Representations for transformers) model to encode raw text into a numeric format suitable for predictions (Devlin et al. (2018)). BERT is pre-trained on a large corpus comprising the Toronto Book Corpus and Wikipedia. The distilled version (DistilBERT) is a compressed model that is 60% the size of BERT and retains 97% of the language understanding capabilities and is 60% faster (Sanh, Debut, Chaumond, Wolf 2019). We use PyTorch to produce a model to classify articles based on the labeled data. Of the 3,500 articles that were hand coded by the MTurk workers, 900 are fed to the machine learning model. 900 articles were selected because of computational limitations in training the NLP model. A classification of “uses data” was assigned if the model predicted an article used data with at least 90% confidence.


    The performance of the models classifying articles to countries and as using data or not can be compared to the classification by the human raters. We consider the human raters as giving us the ground truth. This may underestimate the model performance if the workers at times got the allocation wrong in a way that would not apply to the model. For instance, a human rater could mistake the Republic of Korea for the Democratic People’s Republic of Korea. If both humans and the model perform the same kind of errors, then the performance reported here will be overestimated.


    The model was able to predict whether an article made use of data with 87% accuracy evaluated on the set of articles held out of the model training. The correlation between the number of articles written about each country using data estimated under the two approaches is given in the figure below. The number of articles represents an aggregate total of

  6. R

    AI in Semi-supervised Learning Market Research Report 2033

    • researchintelo.com
    csv, pdf, pptx
    Updated Jul 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Research Intelo (2025). AI in Semi-supervised Learning Market Research Report 2033 [Dataset]. https://researchintelo.com/report/ai-in-semi-supervised-learning-market
    Explore at:
    pdf, csv, pptxAvailable download formats
    Dataset updated
    Jul 24, 2025
    Dataset authored and provided by
    Research Intelo
    License

    https://researchintelo.com/privacy-and-policyhttps://researchintelo.com/privacy-and-policy

    Time period covered
    2024 - 2033
    Area covered
    Global
    Description

    AI in Semi-supervised Learning Market Outlook



    According to our latest research, the AI in Semi-supervised Learning market size reached USD 1.82 billion in 2024 globally, driven by rapid advancements in artificial intelligence and machine learning applications across diverse industries. The market is expected to expand at a robust CAGR of 28.1% from 2025 to 2033, reaching a projected value of USD 17.17 billion by 2033. This exponential growth is primarily fueled by the increasing need for efficient data labeling, the proliferation of unstructured data, and the growing adoption of AI-driven solutions in both large enterprises and small and medium businesses. As per the latest research, the surging demand for automation, accuracy, and cost-efficiency in data processing is significantly accelerating the adoption of semi-supervised learning models worldwide.



    One of the most significant growth factors for the AI in Semi-supervised Learning market is the explosive increase in data generation across industries such as healthcare, finance, retail, and automotive. Organizations are continually collecting vast amounts of structured and unstructured data, but the process of labeling this data for supervised learning remains time-consuming and expensive. Semi-supervised learning offers a compelling solution by leveraging small amounts of labeled data alongside large volumes of unlabeled data, thus reducing the dependency on extensive manual annotation. This approach not only accelerates the deployment of AI models but also enhances their accuracy and scalability, making it highly attractive for enterprises seeking to maximize the value of their data assets while minimizing operational costs.



    Another critical driver propelling the growth of the AI in Semi-supervised Learning market is the increasing sophistication of AI algorithms and the integration of advanced technologies such as deep learning, natural language processing, and computer vision. These advancements have enabled semi-supervised learning models to achieve remarkable performance in complex tasks like image and speech recognition, medical diagnostics, and fraud detection. The ability to process and interpret vast datasets with minimal supervision is particularly valuable in sectors where labeled data is scarce or expensive to obtain. Furthermore, the ongoing investments in research and development by leading technology companies and academic institutions are fostering innovation, resulting in more robust and scalable semi-supervised learning frameworks that can be seamlessly integrated into enterprise workflows.



    The proliferation of cloud computing and the increasing adoption of hybrid and multi-cloud environments are also contributing significantly to the expansion of the AI in Semi-supervised Learning market. Cloud-based deployment offers unparalleled scalability, flexibility, and cost-efficiency, allowing organizations of all sizes to access cutting-edge AI tools and infrastructure without the need for substantial upfront investments. This democratization of AI technology is empowering small and medium enterprises to leverage semi-supervised learning for competitive advantage, driving widespread adoption across regions and industries. Additionally, the emergence of AI-as-a-Service (AIaaS) platforms is further simplifying the integration and management of semi-supervised learning models, enabling businesses to accelerate their digital transformation initiatives and unlock new growth opportunities.



    From a regional perspective, North America currently dominates the AI in Semi-supervised Learning market, accounting for the largest share in 2024, followed closely by Europe and Asia Pacific. The strong presence of leading AI vendors, robust technological infrastructure, and high investments in AI research and development are key factors driving market growth in these regions. Asia Pacific is expected to witness the fastest CAGR during the forecast period, fueled by rapid digitalization, expanding IT infrastructure, and increasing government initiatives to promote AI adoption. Meanwhile, Latin America and the Middle East & Africa are also showing promising growth potential, supported by rising awareness of AI benefits and growing investments in digital transformation projects across various sectors.



    Component Analysis



    The component segment of the AI in Semi-supervised Learning market is divided into software, hardware, and services, each playing a pivotal role in the adoption and implementation of semi-s

  7. f

    3D Microvascular Image Data and Labels for Machine Learning

    • datasetcatalog.nlm.nih.gov
    • rdr.ucl.ac.uk
    Updated Apr 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Brown, Emmeline; Pinol, Carles Bosch; Brown, Emma; Walker-Samuel, Simon; Zhang, Yuxin; Holroyd, Natalie; Walsh, Claire (2024). 3D Microvascular Image Data and Labels for Machine Learning [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001419415
    Explore at:
    Dataset updated
    Apr 30, 2024
    Authors
    Brown, Emmeline; Pinol, Carles Bosch; Brown, Emma; Walker-Samuel, Simon; Zhang, Yuxin; Holroyd, Natalie; Walsh, Claire
    Description

    These images and associated binary labels were collected from collaborators across multiple universities to serve as a diverse representation of biomedical images of vessel structures, for use in the training and validation of machine learning tools for vessel segmentation. The dataset contains images from a variety of imaging modalities, at different resolutions, using difference sources of contrast and featuring different organs/ pathologies. This data was use to train, test and validated a foundational model for 3D vessel segmentation, tUbeNet, which can be found on github. The paper descripting the training and validation of the model can be found here. Filenames are structured as follows: Data - [Modality]_[species Organ]_[resolution].tif Labels - [Modality]_[species Organ]_[resolution]_labels.tif Sub-volumes of larger dataset - [Modality]_[species Organ]_subvolume[dimensions in pixels].tif Manual labelling of blood vessels was carried out using Amira (2020.2, Thermo-Fisher, UK). Training data: opticalHREM_murineLiver_2.26x2.26x1.75um.tif: A high resolution episcopic microscopy (HREM) dataset, acquired in house by staining a healthy mouse liver with Eosin B and imaged using a standard HREM protocol. NB: 25% of this image volume was withheld from training, for use as test data. CT_murineTumour_20x20x20um.tif: X-ray microCT images of a microvascular cast, taken from a subcutaneous mouse model of colorectal cancer (acquired in house). NB: 25% of this image volume was withheld from training, for use as test data. RSOM_murineTumour_20x20um.tif: Raster-Scanning Optoacoustic Mesoscopy (RSOM) data from a subcutaneous tumour model (provided by Emma Brown, Bohndiek Group, University of Cambridge). The image data has undergone filtering to reduce the background ​(Brown et al., 2019)​. OCTA_humanRetina_24x24um.tif: retinal angiography data obtained using Optical Coherence Tomography Angiography (OCT-A) (provided by Dr Ranjan Rajendram, Moorfields Eye Hospital). Test data: MRI_porcineLiver_0.9x0.9x5mm.tif: T1-weighted Balanced Turbo Field Echo Magnetic Resonance Imaging (MRI) data from a machine-perfused porcine liver, acquired in-house. Test Data MFHREM_murineTumourLectin_2.76x2.76x2.61um.tif: a subcutaneous colorectal tumour mouse model was imaged in house using Multi-fluorescence HREM in house, with Dylight 647 conjugated lectin staining the vasculature ​(Walsh et al., 2021)​. The image data has been processed using an asymmetric deconvolution algorithm described by ​Walsh et al., 2020​. NB: A sub-volume of 480x480x640 voxels was manually labelled (MFHREM_murineTumourLectin_subvolume480x480x640.tif). MFHREM_murineBrainLectin_0.85x0.85x0.86um.tif: an MF-HREM image of the cortex of a mouse brain, stained with Dylight-647 conjugated lectin, was acquired in house ​(Walsh et al., 2021)​. The image data has been downsampled and processed using an asymmetric deconvolution algorithm described by ​Walsh et al., 2020​. NB: A sub-volume of 1000x1000x99 voxels was manually labelled. This sub-volume is provided at full resolution and without preprocessing (MFHREM_murineBrainLectin_subvol_0.57x0.57x0.86um.tif). 2Photon_murineOlfactoryBulbLectin_0.2x0.46x5.2um.tif: two-photon data of mouse olfactory bulb blood vessels, labelled with sulforhodamine 101, was kindly provided by Yuxin Zhang at the Sensory Circuits and Neurotechnology Lab, the Francis Crick Institute ​(Bosch et al., 2022)​. NB: A sub-volume of 500x500x79 voxel was manually labelled (2Photon_murineOlfactoryBulbLectin_subvolume500x500x79.tif). References: ​​Bosch, C., Ackels, T., Pacureanu, A., Zhang, Y., Peddie, C. J., Berning, M., Rzepka, N., Zdora, M. C., Whiteley, I., Storm, M., Bonnin, A., Rau, C., Margrie, T., Collinson, L., & Schaefer, A. T. (2022). Functional and multiscale 3D structural investigation of brain tissue through correlative in vivo physiology, synchrotron microtomography and volume electron microscopy. Nature Communications 2022 13:1, 13(1), 1–16. https://doi.org/10.1038/s41467-022-30199-6 ​Brown, E., Brunker, J., & Bohndiek, S. E. (2019). Photoacoustic imaging as a tool to probe the tumour microenvironment. DMM Disease Models and Mechanisms, 12(7). https://doi.org/10.1242/DMM.039636 ​Walsh, C., Holroyd, N. A., Finnerty, E., Ryan, S. G., Sweeney, P. W., Shipley, R. J., & Walker-Samuel, S. (2021). Multifluorescence High-Resolution Episcopic Microscopy for 3D Imaging of Adult Murine Organs. Advanced Photonics Research, 2(10), 2100110. https://doi.org/10.1002/ADPR.202100110 ​Walsh, C., Holroyd, N., Shipley, R., & Walker-Samuel, S. (2020). Asymmetric Point Spread Function Estimation and Deconvolution for Serial-Sectioning Block-Face Imaging. Communications in Computer and Information Science, 1248 CCIS, 235–249. https://doi.org/10.1007/978-3-030-52791-4_19

  8. Data from: Digit Recognizer

    • kaggle.com
    zip
    Updated Jun 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anna Furgała-Wojas (2025). Digit Recognizer [Dataset]. https://www.kaggle.com/datasets/annafurgaawojas/digit-recognizer
    Explore at:
    zip(16054568 bytes)Available download formats
    Dataset updated
    Jun 3, 2025
    Authors
    Anna Furgała-Wojas
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    The Digit Recognizer dataset is a classic benchmark used in the field of computer vision and machine learning for handwritten digit classification. It is derived from the well-known MNIST dataset, which consists of grayscale images of handwritten digits (0 through 9). Each image is a 28x28 pixel square, flattened into a 784-dimensional vector, and labeled with the corresponding digit.

    This dataset serves as an excellent starting point for building and evaluating classification algorithms, particularly convolutional neural networks (CNNs). Due to its relatively small size and well-structured format, it allows for rapid experimentation and prototyping of models. Furthermore, the consistent quality of the images and the clear labeling make it a standard benchmark for comparing different machine learning approaches.

    In this analysis, we will explore the dataset’s structure, perform data preprocessing, visualize example digits, and apply various machine learning models to evaluate their accuracy in classifying handwritten digits.

  9. Z

    Data used in Machine learning reveals the waggle drift's role in the honey...

    • data-staging.niaid.nih.gov
    • zenodo.org
    • +1more
    Updated May 18, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dormagen, David M; Wild, Benjamin; Wario, Fernando; Landgraf, Tim (2023). Data used in Machine learning reveals the waggle drift's role in the honey bee dance communication system [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_7928120
    Explore at:
    Dataset updated
    May 18, 2023
    Dataset provided by
    Universidad de Guadalajara
    Freie Universität Berlin
    Authors
    Dormagen, David M; Wild, Benjamin; Wario, Fernando; Landgraf, Tim
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data and metadata used in "Machine learning reveals the waggle drift’s role in the honey bee dance communication system"

    All timestamps are given in ISO 8601 format.

    The following files are included:

    Berlin2019_waggle_phases.csv, Berlin2021_waggle_phases.csv

    Automatic individual detections of waggle phases during our recording periods in 2019 and 2021.

    timestamp: Date and time of the detection.

    cam_id: Camera ID (0: left side of the hive, 1: right side of the hive).

    x_median, y_median: Median position of the bee during the waggle phase (for 2019 given in millimeters after applying a homography, for 2021 in the original image coordinates).

    waggle_angle: Body orientation of the bee during the waggle phase in radians (0: oriented to the right, PI / 4: oriented upwards).

    Berlin2019_dances.csv

    Automatic detections of dance behavior during our recording period in 2019.

    dancer_id: Unique ID of the individual bee.

    dance_id: Unique ID of the dance.

    ts_from, ts_to: Date and time of the beginning and end of the dance.

    cam_id: Camera ID (0: left side of the hive, 1: right side of the hive).

    median_x, median_y: Median position of the individual during the dance.

    feeder_cam_id: ID of the feeder that the bee was detected at prior to the dance.

    Berlin2019_followers.csv

    Automatic detections of attendance and following behavior, corresponding to the dances in Berlin2019_dances.csv.

    dance_id: Unique ID of the dance being attended or followed.

    follower_id: Unique ID of the individual attending or following the dance.

    ts_from, ts_to: Date and time of the beginning and end of the interaction.

    label: “attendance” or “follower”

    cam_id: Camera ID (0: left side of the hive, 1: right side of the hive).

    Berlin2019_dances_with_manually_verified_times.csv

    A sample of dances from Berlin2019_dances.csv where the exact timestamps have been manually verified to correspond to the beginning of the first and last waggle phase down to a precision of ca. 166 ms (video material was recorded at 6 FPS).

    dance_id: Unique ID of the dance.

    dancer_id: Unique ID of the dancing individual.

    cam_id: Camera ID (0: left side of the hive, 1: right side of the hive).

    feeder_cam_id: ID of the feeder that the bee was detected at prior to the dance.

    dance_start, dance_end: Manually verified date and times of the beginning and end of the dance.

    Berlin2019_dance_classifier_labels.csv

    Manually annotated waggle phases or following behavior for our recording season in 2019 that was used to train the dancing and following classifier. Can be merged with the supplied individual detections.

    timestamp: Timestamp of the individual frame the behavior was observed in.

    frame_id: Unique ID of the video frame the behavior was observed in.

    bee_id: Unique ID of the individual bee.

    label: One of “nothing”, “waggle”, “follower”

    Berlin2019_dance_classifier_unlabeled.csv

    Additional unlabeled samples of timestamp and individual ID with the same format as Berlin2019_dance_classifier_labels.csv, but without a label. The data points have been sampled close to detections of our waggle phase classifier, so behaviors related to the waggle dance are likely overrepresented in that sample.

    Berlin2021_waggle_phase_classifier_labels.csv

    Manually annotated detections of our waggle phase detector (bb_wdd2) that were used to train the neural network filter (bb_wdd_filter) for the 2021 data.

    detection_id: Unique ID of the waggle phase.

    label: One of “waggle”, “activating”, “ventilating”, “trembling”, “other”. Where “waggle” denoted a waggle phase, “activating” is the shaking signal, “ventilating” is a bee fanning her wings. “trembling” denotes a tremble dance, but the distinction from the “other” class was often not clear, so “trembling” was merged into “other” for training.

    orientation: The body orientation of the bee that triggered the detection in radians (0: facing to the right, PI /4: facing up).

    metadata_path: Path to the individual detection in the same directory structure as created by the waggle dance detector.

    Berlin2021_waggle_phase_classifier_ground_truth.zip

    The output of the waggle dance detector (bb_wdd2) that corresponds to Berlin2021_waggle_phase_classifier_labels.csv and is used for training. The archive includes a directory structure as output by the bb_wdd2 and each directory includes the original image sequence that triggered the detection in an archive and the corresponding metadata. The training code supplied in bb_wdd_filter directly works with this directory structure.

    Berlin2019_tracks.zip

    Detections and tracks from the recording season in 2019 as produced by our tracking system. As the full data is several terabytes in size, we include the subset of our data here that is relevant for our publication which comprises over 46 million detections. We included tracks for all detected behaviors (dancing, following, attending) including one minute before and after the behavior. We also included all tracks that correspond to the labeled and unlabeled data that was used to train the dance classifier including 30 seconds before and after the data used for training. We grouped the exported data by date to make the handling easier, but to efficiently work with the data, we recommend importing it into an indexable database.

    The individual files contain the following columns:

    cam_id: Camera ID (0: left side of the hive, 1: right side of the hive).

    timestamp: Date and time of the detection.

    frame_id: Unique ID of the video frame of the recording from which the detection was extracted.

    track_id: Unique ID of an individual track (short motion path from one individual). For longer tracks, the detections can be linked based on the bee_id.

    bee_id: Unique ID of the individual bee.

    bee_id_confidence: Confidence between 0 and 1 that the bee_id is correct as output by our tracking system.

    x_pos_hive, y_pos_hive: Spatial position of the bee in the hive on the side indicated by cam_id. Given in millimeters after applying a homography on the video material.

    orientation_hive: Orientation of the bees’ thorax in the hive in radians (0: oriented to the right, PI / 4: oriented upwards).

    Berlin2019_feeder_experiment_log.csv

    Experiment log for our feeder experiments in 2019.

    date: Date given in the format year-month-day.

    feeder_cam_id: Numeric ID of the feeder.

    coordinates: Longitude and latitude of the feeder. For feeders 1 and 2 this is only given once and held constant. Feeder 3 had varying locations.

    time_opened, time_closed: Date and time when the feeder was set up or closed again. sucrose_solution: Concentration of the sucrose solution given as sugar:water (in terms of weight). On days where feeder 3 was open, the other two feeders offered water without sugar.

    Software used to acquire and analyze the data:

    bb_pipeline: Tag localization and decoding pipeline

    bb_pipeline_models: Pretrained localizer and decoder models for bb_pipeline

    bb_binary: Raw detection data storage format

    bb_irflash: IR flash system schematics and arduino code

    bb_imgacquisition: Recording and network storage

    bb_behavior: Database interaction and data (pre)processing, feature extraction

    bb_tracking: Tracking of bee detections over time

    bb_wdd2: Automatic detection and decoding of honey bee waggle dances

    bb_wdd_filter: Machine learning model to improve the accuracy of the waggle dance detector

    bb_dance_networks: Detection of dancing and following behavior from trajectories

  10. 🔢🖊️ Digital Recognition: MNIST Dataset

    • kaggle.com
    zip
    Updated Nov 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wasiq Ali (2025). 🔢🖊️ Digital Recognition: MNIST Dataset [Dataset]. https://www.kaggle.com/datasets/wasiqaliyasir/digital-mnist-dataset
    Explore at:
    zip(2278207 bytes)Available download formats
    Dataset updated
    Nov 13, 2025
    Authors
    Wasiq Ali
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Handwritten Digits Pixel Dataset - Documentation

    Overview

    The Handwritten Digits Pixel Dataset is a collection of numerical data representing handwritten digits from 0 to 9. Unlike image datasets that store actual image files, this dataset contains pixel intensity values arranged in a structured tabular format, making it ideal for machine learning and data analysis applications.

    Dataset Description

    Basic Information

    • Format: CSV (Comma-Separated Values)
    • Total Samples: [Number of rows based on your dataset]
    • Features: 784 pixel columns (28×28 pixels) + 1 label column
    • Label Range: Digits 0-9
    • Pixel Value Range: 0-255 (grayscale intensity)

    File Structure

    Column Description

    • label: The target variable representing the digit (0-9)
    • pixel columns: 784 columns named in format [row]xcolumn
    • Each pixel column contains integer values from 0-255 representing grayscale intensity

    Data Characteristics

    Label Distribution

    The dataset contains handwritten digit samples with the following distribution:

    • Digit 0: [X] samples
    • Digit 1: [X] samples
    • Digit 2: [X] samples
    • Digit 3: [X] samples
    • Digit 4: [X] samples
    • Digit 5: [X] samples
    • Digit 6: [X] samples
    • Digit 7: [X] samples
    • Digit 8: [X] samples
    • Digit 9: [X] samples

    (Note: Actual distribution counts would be calculated from your specific dataset)

    Data Quality

    • Missing Values: No missing values detected
    • Data Type: All values are integers
    • Normalization: Pixel values range from 0-255 (can be normalized to 0-1 for ML models)
    • Consistency: Uniform 28×28 grid structure across all samples

    Technical Specifications

    Data Preprocessing Requirements

    • Normalization: Scale pixel values from 0-255 to 0-1 range
    • Reshaping: Convert 1D pixel arrays to 2D 28×28 matrices for visualization
    • Train-Test Split: Recommended 80-20 or 70-30 split for model development

    Recommended Machine Learning Approaches

    Classification Algorithms:

    • Random Forest
    • Support Vector Machines (SVM)
    • Neural Networks
    • K-Nearest Neighbors (KNN)

    Deep Learning Architectures:

    • Convolutional Neural Networks (CNNs)
    • Multi-layer Perceptrons (MLPs)

    Dimensionality Reduction:

    • PCA (Principal Component Analysis)
    • t-SNE for visualization

    Usage Examples

    Loading the Dataset

    import pandas as pd
    
    # Load the dataset
    df = pd.read_csv('/kaggle/input/handwritten_digits_pixel_dataset/mnist.csv')
    
    # Separate features and labels
    X = df.drop('label', axis=1)
    y = df['label']
    
    # Normalize pixel values
    X_normalized = X / 255.0
    
  11. m

    MAAD : Multi-Label Arabic Articles Dataset

    • data.mendeley.com
    Updated Oct 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marwah Yahya Al-Nahari (2025). MAAD : Multi-Label Arabic Articles Dataset [Dataset]. http://doi.org/10.17632/hbfc9j8hj8.2
    Explore at:
    Dataset updated
    Oct 27, 2025
    Authors
    Marwah Yahya Al-Nahari
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The MAAD dataset represents a comprehensive collection of Arabic news articles that may be employed across a diverse array of Arabic Natural Language Processing (NLP) applications, including but not limited to classification, text generation, summarization, and various other tasks. The dataset was diligently assembled through the application of specifically designed Python scripts that targeted six prominent news platforms: Al Jazeera, BBC Arabic, Youm7, Russia Today, and Al Ummah, in conjunction with regional and local media outlets, ultimately resulting in a total of 602,792 articles. This dataset exhibits a total word count of 29,371,439, with the number of unique words totaling 296,518; the average word length has been determined to be 6.36 characters, while the mean article length is calculated at 736.09 characters. This extensive dataset is categorized into ten distinct classifications: Political, Economic, Cultural, Arts, Sports, Health, Technology, Community, Incidents, and Local. The data fields are categorized into five distinct types: Title, Article, Summary, Category, and Published_ Date. The MAAD dataset is structured into six files, each named after the corresponding news outlets from which the data was sourced; within each directory, text files are provided, containing the number of categories represented in a single file, formatted in txt to accommodate all news articles. This dataset serves as an expansive standard resource designed for utilization within the context of our research endeavors.

  12. Data from: Code4ML: a Large-scale Dataset of annotated Machine Learning Code...

    • zenodo.org
    csv
    Updated Sep 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonymous authors; Anonymous authors (2023). Code4ML: a Large-scale Dataset of annotated Machine Learning Code [Dataset]. http://doi.org/10.5281/zenodo.6607065
    Explore at:
    csvAvailable download formats
    Dataset updated
    Sep 15, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anonymous authors; Anonymous authors
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We present Code4ML: a Large-scale Dataset of annotated Machine Learning Code, a corpus of Python code snippets, competition summaries, and data descriptions from Kaggle.

    The data is organized in a table structure. Code4ML includes several main objects: competitions information, raw code blocks collected form Kaggle and manually marked up snippets. Each table has a .csv format.

    Each competition has the text description and metadata, reflecting competition and used dataset characteristics as well as evaluation metrics (competitions.csv). The corresponding datasets can be loaded using Kaggle API and data sources.

    The code blocks themselves and their metadata are collected to the data frames concerning the publishing year of the initial kernels. The current version of the corpus includes two code blocks files: snippets from kernels up to the 2020 year (сode_blocks_upto_20.csv) and those from the 2021 year (сode_blocks_21.csv) with corresponding metadata. The corpus consists of 2 743 615 ML code blocks collected from 107 524 Jupyter notebooks.

    Marked up code blocks have the following metadata: anonymized id, the format of the used data (for example, table or audio), the id of the semantic type, a flag for the code errors, the estimated relevance to the semantic class (from 1 to 5), the id of the parent notebook, and the name of the competition. The current version of the corpus has ~12 000 labeled snippets (markup_data_20220415.csv).

    As marked up code blocks data contains the numeric id of the code block semantic type, we also provide a mapping from this number to semantic type and subclass (actual_graph_2022-06-01.csv).

    The dataset can help solve various problems, including code synthesis from a prompt in natural language, code autocompletion, and semantic code classification.

  13. Generative AI In Data Labeling Solution And Services Market Analysis, Size,...

    • technavio.com
    pdf
    Updated Oct 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Technavio (2025). Generative AI In Data Labeling Solution And Services Market Analysis, Size, and Forecast 2025-2029 : North America (US, Canada, and Mexico), APAC (China, India, South Korea, Japan, Australia, and Indonesia), Europe (Germany, UK, France, Italy, The Netherlands, and Spain), South America (Brazil, Argentina, and Colombia), Middle East and Africa (South Africa, UAE, and Turkey), and Rest of World (ROW) [Dataset]. https://www.technavio.com/report/generative-ai-in-data-labeling-solution-and-services-market-industry-analysis
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Oct 9, 2025
    Dataset provided by
    TechNavio
    Authors
    Technavio
    License

    https://www.technavio.com/content/privacy-noticehttps://www.technavio.com/content/privacy-notice

    Time period covered
    2025 - 2029
    Area covered
    Canada, United States
    Description

    Snapshot img { margin: 10px !important; } Generative AI In Data Labeling Solution And Services Market Size 2025-2029

    The generative ai in data labeling solution and services market size is forecast to increase by USD 31.7 billion, at a CAGR of 24.2% between 2024 and 2029.

    The global generative AI in data labeling solution and services market is shaped by the escalating demand for high-quality, large-scale datasets. Traditional manual data labeling methods create a significant bottleneck in the ai development lifecycle, which is addressed by the proliferation of synthetic data generation for robust model training. This strategic shift allows organizations to create limitless volumes of perfectly labeled data on demand, covering a comprehensive spectrum of scenarios. This capability is particularly transformative for generative ai in automotive applications and in the development of data labeling and annotation tools, enabling more resilient and accurate systems.However, a paramount challenge confronting the market is ensuring accuracy, quality control, and mitigation of inherent model bias. Generative models can produce plausible but incorrect labels, a phenomenon known as hallucination, which can introduce systemic errors into training datasets. This makes ai in data quality a critical concern, necessitating robust human-in-the-loop verification processes to maintain the integrity of generative ai in healthcare data. The market's long-term viability depends on developing sophisticated frameworks for bias detection and creating reliable generative artificial intelligence (AI) that can be trusted for foundational tasks.

    What will be the Size of the Generative AI In Data Labeling Solution And Services Market during the forecast period?

    Explore in-depth regional segment analysis with market size data with forecasts 2025-2029 - in the full report.
    Request Free Sample

    The global generative AI in data labeling solution and services market is witnessing a transformation driven by advancements in generative adversarial networks and diffusion models. These techniques are central to synthetic data generation, augmenting AI model training data and redefining the machine learning pipeline. This evolution supports a move toward more sophisticated data-centric AI workflows, which integrate automated data labeling with human-in-the-loop annotation for enhanced accuracy. The scope of application is broadening from simple text-based data annotation to complex image-based data annotation and audio-based data annotation, creating a demand for robust multimodal data labeling capabilities. This shift across the AI development lifecycle is significant, with projections indicating a 35% rise in the use of AI-assisted labeling for specialized computer vision systems.Building upon this foundation, the focus intensifies on annotation quality control and AI-powered quality assurance within modern data annotation platforms. Methods like zero-shot learning and few-shot learning are becoming more viable, reducing dependency on massive datasets. The process of foundation model fine-tuning is increasingly guided by reinforcement learning from human feedback, ensuring outputs align with specific operational needs. Key considerations such as model bias mitigation and data privacy compliance are being addressed through AI-assisted labeling and semi-supervised learning. This impacts diverse sectors, from medical imaging analysis and predictive maintenance models to securing network traffic patterns against cybersecurity threat signatures and improving autonomous vehicle sensors for robotics training simulation and smart city solutions.

    How is this Generative AI In Data Labeling Solution And Services Market segmented?

    The generative ai in data labeling solution and services market research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in "USD million" for the period 2025-2029,for the following segments. End-userIT dataHealthcareRetailFinancial servicesOthersTypeSemi-supervisedAutomaticManualProductImage or video basedText basedAudio basedGeographyNorth AmericaUSCanadaMexicoAPACChinaIndiaSouth KoreaJapanAustraliaIndonesiaEuropeGermanyUKFranceItalyThe NetherlandsSpainSouth AmericaBrazilArgentinaColombiaMiddle East and AfricaSouth AfricaUAETurkeyRest of World (ROW)

    By End-user Insights

    The it data segment is estimated to witness significant growth during the forecast period.

    In the IT data segment, generative AI is transforming the creation of training data for software development, cybersecurity, and network management. It addresses the need for realistic, non-sensitive data at scale by producing synthetic code, structured log files, and diverse threat signatures. This is crucial for training AI-powered developer tools and intrusion detection systems. With South America representing an 8.1% market opportunity, the demand for localized and specia

  14. S

    Structured Product Label Management Report

    • archivemarketresearch.com
    doc, pdf, ppt
    Updated Aug 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Archive Market Research (2025). Structured Product Label Management Report [Dataset]. https://www.archivemarketresearch.com/reports/structured-product-label-management-557778
    Explore at:
    pdf, ppt, docAvailable download formats
    Dataset updated
    Aug 21, 2025
    Dataset authored and provided by
    Archive Market Research
    License

    https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The Structured Product Label Management market is experiencing robust growth, driven by increasing regulatory complexities and the need for efficient, accurate product labeling across diverse industries. Our analysis projects a market size of $2.5 billion in 2025, expanding at a Compound Annual Growth Rate (CAGR) of 15% from 2025 to 2033. This significant growth reflects the rising demand for streamlined label management systems capable of handling global regulatory variations and complex product information. Key drivers include the escalating costs of non-compliance, a growing emphasis on data integrity and traceability throughout the supply chain, and the increasing adoption of digital labeling solutions. The market is segmented by industry (e.g., food & beverage, pharmaceutical, cosmetic), label type (e.g., primary, secondary), and deployment model (cloud-based, on-premise). Leading companies like I4i, Inc., Intagras, Inc., Dakota Systems, Inc., RKE Holdings, LLC., and Spectra Soft are actively shaping this evolving landscape through innovative solutions and strategic partnerships. This market growth is further fueled by advancements in technology, such as Artificial Intelligence (AI) and machine learning, which are being integrated into label management systems to improve accuracy, automate processes, and enhance compliance. However, the market also faces certain restraints, including the high initial investment costs associated with implementing new systems, the complexities of integrating various data sources, and the need for continuous software updates to keep pace with evolving regulations. Despite these challenges, the long-term prospects for the Structured Product Label Management market remain highly positive, with significant opportunities for growth in emerging markets and expanding applications across diverse sectors. The continued focus on data-driven decision-making and consumer safety will be key factors driving future market expansion.

  15. m

    Human Faces and Objects Mix Image Dataset

    • data.mendeley.com
    Updated Mar 13, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bindu Garg (2025). Human Faces and Objects Mix Image Dataset [Dataset]. http://doi.org/10.17632/nzwvnrmwp3.1
    Explore at:
    Dataset updated
    Mar 13, 2025
    Authors
    Bindu Garg
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Description: Human Faces and Objects Dataset (HFO-5000) The Human Faces and Objects Dataset (HFO-5000) is a curated collection of 5,000 images, categorized into three distinct classes: male faces (1,500), female faces (1,500), and objects (2,000). This dataset is designed for machine learning and computer vision applications, including image classification, face detection, and object recognition. The dataset provides high-quality, labeled images with a structured CSV file for seamless integration into deep learning pipelines.

    Column Description: The dataset is accompanied by a CSV file that contains essential metadata for each image. The CSV file includes the following columns: file_name: The name of the image file (e.g., image_001.jpg). label: The category of the image, with three possible values: "male" (for male face images) "female" (for female face images) "object" (for images of various objects) file_path: The full or relative path to the image file within the dataset directory.

    Uniqueness and Key Features: 1) Balanced Distribution: The dataset maintains an even distribution of human faces (male and female) to minimize bias in classification tasks. 2) Diverse Object Selection: The object category consists of a wide variety of items, ensuring robustness in distinguishing between human and non-human entities. 3) High-Quality Images: The dataset consists of clear and well-defined images, suitable for both training and testing AI models. 4) Structured Annotations: The CSV file simplifies dataset management and integration into machine learning workflows. 5) Potential Use Cases: This dataset can be used for tasks such as gender classification, facial recognition benchmarking, human-object differentiation, and transfer learning applications.

    Conclusion: The HFO-5000 dataset provides a well-structured, diverse, and high-quality set of labeled images that can be used for various computer vision tasks. Its balanced distribution of human faces and objects ensures fairness in training AI models, making it a valuable resource for researchers and developers. By offering structured metadata and a wide range of images, this dataset facilitates advancements in deep learning applications related to facial recognition and object classification.

  16. Learning Privacy from Visual Entities - Curated data sets and pre-computed...

    • zenodo.org
    zip
    Updated May 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alessio Xompero; Alessio Xompero; Andrea Cavallaro; Andrea Cavallaro (2025). Learning Privacy from Visual Entities - Curated data sets and pre-computed visual entities [Dataset]. http://doi.org/10.5281/zenodo.15348506
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 7, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Alessio Xompero; Alessio Xompero; Andrea Cavallaro; Andrea Cavallaro
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
    This repository contains the curated image privacy datasets and pre-computed visual entities used in the publication Learning Privacy from Visual Entities by A. Xompero and A. Cavallaro.
    [
    arxiv][code]

    Curated image privacy data sets

    In the article, we trained and evaluated models on the Image Privacy Dataset (IPD) and the PrivacyAlert dataset. The datasets are originally provided by other sources and have been re-organised and curated for this work.

    Our curation organises the datasets in a common structure. We updated the annotations and labelled the splits of the data in the annotation file. This avoids having separated folders of images for each data split (training, validation, testing) and allows a flexible handling of new splits, e.g. created with a stratified K-Fold cross-validation procedure. As for the original datasets (PicAlert and PrivacyAlert), we provide the link to the images in bash scripts to download the images. Another bash script re-organises the images in sub-folders with maximum 1000 images in each folder.

    Both datasets refer to images publicly available on Flickr. These images have a large variety of content, including sensitive content, seminude people, vehicle plates, documents, private events. Images were annotated with a binary label denoting if the content was deemed to be public or private. As the images are publicly available, their label is mostly public. These datasets have therefore a high imbalance towards the public class. Note that IPD combines two other existing datasets, PicAlert and part of VISPR, to increase the number of private images already limited in PicAlert. Further details in our corresponding https://doi.org/10.48550/arXiv.2503.12464" target="_blank" rel="noopener">publication.

    List of datasets and their original source:

    Notes:

    • For PicAlert and PrivacyAlert, only urls to the original locations in Flickr are available in the Zenodo record
    • Collector and authors of the PrivacyAlert dataset selected the images from Flickr under Public Domain license
    • Owners of the photos on Flick could have removed the photos from the social media platform
    • Running the bash scripts to download the images can incur in the "429 Too Many Requests" status code

    Pre-computed visual entitities

    Some of the models run their pipeline end-to-end with the images as input, whereas other models require different or additional inputs. These inputs include the pre-computed visual entities (scene types and object types) represented in a graph format, e.g. for a Graph Neural Network. Re-using these pre-computed visual entities allows other researcher to build new models based on these features while avoiding re-computing the same on their own or for each epoch during the training of a model (faster training).

    For each image of each dataset, namely PrivacyAlert, PicAlert, and VISPR, we provide the predicted scene probabilities as a .csv file , the detected objects as a .json file in COCO data format, and the node features (visual entities already organised in graph format with their features) as a .json file. For consistency, all the files are already organised in batches following the structure of the images in the datasets folder. For each dataset, we also provide the pre-computed adjacency matrix for the graph data.

    Note: IPD is based on PicAlert and VISPR and therefore IPD refers to the scene probabilities and object detections of the other two datasets. Both PicAlert and VISPR must be downloaded and prepared to use IPD for training and testing.

    Further details on downloading and organising data can be found in our GitHub repository: https://github.com/graphnex/privacy-from-visual-entities (see ARTIFACT-EVALUATION.md#pre-computed-visual-entitities-)

    Enquiries, questions and comments

    If you have any enquiries, question, or comments, or you would like to file a bug report or a feature request, use the issue tracker of our GitHub repository.

  17. Transcribed Slates

    • kaggle.com
    zip
    Updated Apr 15, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Madison Courtney (2025). Transcribed Slates [Dataset]. https://www.kaggle.com/datasets/madisoncourtney/transcribed-slates
    Explore at:
    zip(8796498 bytes)Available download formats
    Dataset updated
    Apr 15, 2025
    Authors
    Madison Courtney
    Description

    General Information

    This dataset was created for the training and testing of machine learning systems for extracting information from slates/on-screen or filmed text in video productions. The data associated with each instance was acquired by observing text on the slates in the file. There are two levels of data collected, a direct transcription and contextual information. For the direct transcription if there was illegible text an approximation was derived. The information is reported by the original creator of the slates and can be assumed to be accurate.

    The data was collected using a software made specifically to categorize and transcribe metadata from these instances (see file directory description). The transcription was written in a natural reading order (for a western audience), so right to left and top to bottom. If the instance was labeled “Graphical” then the reading order was also right to left and top to bottom within individual sections as well as work as a whole.

    This dataset was created by Madison Courtney, in collaboration with GBH Archives staff, and in consultation with researchers in the Brandeis University Department of Computer Science.

    Uniqueness and overlapping data

    Some of the slates come from different episodes of the same series; therefore, some slates have data overlap. For example, the “series-title” may be common across many slates. However, each slate instance in this dataset was labeled independently of the others. No information was removed, but not every slate contains the same information.

    Different “sub-types” of slates have different graphical features, and present unique challenges for interpretation. In general, sub-types H (Handwritten), G (Graphical), C (Clapperboard) are more complex than D (Simple digital text) and B (Slate over bars). Most instances in the dataset are D. Users may wish to restrict the set to only those with subtype D.

    Labels and annotations were created by an expert human judge. In Version 2, labels and annotations were created only once without any measure of inter-annotator agreement. In Version 3, all data were confirmed and/or edited by a second expert human judge. The dataset is self-contained. But more information about the assets from which these slates were taken can be found at the main website of the AAPB https://www.americanarchive.org/

    Data size and structure

    The data is tabular. There are 7 columns and 503 rows. Each row represents a different labeled image. The image files themselves are included in the dataset directory. The columns are as follows:

    • 0: filename : The name of the image file for this slate
    • 1: seen : A boolean book-keeping field used during the annotation process
    • 2: type-label : The type of scene pictured in the image. All images in this set have type "S" signifying "Slate"
    • 3: subtype-label : The sub-type of scene pictured in the image. Possible subtypes are "H" (Handwritten), "C" (Clapperboard), "D" (Simple digital text), "B" (Slate over bars), "G" (Graphical).
    • 4: modifier : A boolean value indicating whether the slate was "transitional" in the sense that the still image was captured as the slate was fading in or out of view.
    • 5: note-3 : Verbatim transcription of the text appearing on the slate
    • 6: note-4 : Data in key-value structure indicating important data values presented on the slate. Possible keys are "program-title", "episode-title", "series-title", "title", "episode-no", "create-date", "air-date", "date", "director", "producer", "camera". Dates were normalized as YYYY-MM-DD. Names were normalized as Last, First Middle.

    Data format

    The directory contains the tabular data, the image files, and a small utility for viewing and/or editing labels. The Keystroke Labeler utility is a simple, serverless HTML-based viewer/editor. You can use the Keystroke Labeler by simply opening labeler.html in your web browser. The data are also provided serialized as JSON and CSV. The exact same label data appears redundantly in these 3 files: - img_arr_prog.js - the label data loaded by the Keystroke Labeler - img_labels.csv - the label data serialized as CSV - img_labels.json - the label data serialized as JSON

    This dataset includes metadata about programs in the American Archive of Public Broadcasting. Any use of programs referenced by this dataset are subject to the terms of use set by the American Archive of Public Broadcasting.

  18. D

    Telecom Data Labeling Market Research Report 2033

    • dataintelo.com
    csv, pdf, pptx
    Updated Oct 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataintelo (2025). Telecom Data Labeling Market Research Report 2033 [Dataset]. https://dataintelo.com/report/telecom-data-labeling-market
    Explore at:
    csv, pdf, pptxAvailable download formats
    Dataset updated
    Oct 1, 2025
    Dataset authored and provided by
    Dataintelo
    License

    https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy

    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Telecom Data Labeling Market Outlook



    According to our latest research, the global Telecom Data Labeling market size reached USD 1.32 billion in 2024, demonstrating robust expansion driven by the rapid adoption of artificial intelligence and machine learning across the telecommunications sector. The market is expected to grow at a CAGR of 22.8% during the forecast period, with the market size forecasted to reach USD 9.98 billion by 2033. This exceptional growth trajectory is primarily attributed to the increasing need for high-quality, labeled data to train advanced AI models for network optimization, fraud detection, and customer experience management within telecom operations.




    One of the primary growth factors fueling the Telecom Data Labeling market is the exponential surge in data generated by telecom networks, devices, and users. With the proliferation of IoT devices, 5G rollouts, and the expansion of cloud-based telecom services, telecom operators are inundated with massive volumes of structured and unstructured data. To extract actionable insights and automate critical processes, these organizations are increasingly relying on labeled datasets to train and validate AI-driven algorithms. The demand for accurate and scalable data labeling solutions has thus skyrocketed, as telecom companies seek to enhance network efficiency, reduce operational costs, and deliver personalized services to their customers. Additionally, the integration of AI-powered analytics with telecom infrastructure further amplifies the necessity for precise data annotation, ensuring that predictive models and automation tools function with optimal accuracy.




    Another significant driver for the Telecom Data Labeling market is the intensifying focus on customer experience management and fraud detection. Telecom providers are leveraging AI and machine learning to proactively identify and mitigate fraudulent activities, optimize network performance, and deliver seamless user experiences. These applications demand large volumes of accurately labeled data, encompassing text, audio, image, and video formats, to train sophisticated algorithms capable of real-time decision-making. The growing complexity of telecom networks, coupled with the need for advanced analytics to interpret customer interactions and network anomalies, underscores the critical role of data labeling in achieving business objectives. As telecom operators invest heavily in digital transformation, the adoption of automated and semi-supervised labeling solutions is expected to accelerate, further propelling market growth.




    Furthermore, the emergence of regulatory frameworks and data privacy mandates across different regions has spurred telecom companies to adopt more robust data labeling practices. Compliance with international standards such as GDPR, CCPA, and other local data protection laws requires telecom operators to maintain high standards of data accuracy, transparency, and accountability. This regulatory landscape is prompting the adoption of advanced data labeling platforms that offer end-to-end traceability, auditability, and security. The integration of data labeling solutions with existing telecom workflows not only enhances regulatory compliance but also supports the deployment of ethical and bias-free AI models. As a result, the demand for secure, scalable, and customizable data labeling services continues to rise, positioning the market for sustained growth throughout the forecast period.




    From a regional perspective, Asia Pacific is emerging as a dominant force in the Telecom Data Labeling market, driven by rapid digitalization, large-scale 5G deployments, and the presence of leading telecom operators. North America and Europe also contribute significantly to market expansion, owing to advanced telecom infrastructure, high AI adoption rates, and a strong focus on innovation. Meanwhile, Latin America and the Middle East & Africa are witnessing increasing investments in telecom modernization and AI-driven solutions, albeit from a smaller base. This regional diversification not only underscores the global nature of the market but also highlights the varying adoption patterns and growth opportunities across different geographies.



    Data Type Analysis



    The Data Type segment in the Telecom Data Labeling market is categorized into text, image, audio, and video data. Among these, text data labeling holds a substantial share due to the extensive use of natural languag

  19. d

    Classifications of auroral phenomena in THEMIS All-Sky images obtained via...

    • search.dataone.org
    • datadryad.org
    Updated Nov 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jeremiah Johnson; Dogacan Ozturk; Hyunju Connor; Donald Hampton; Matthew Blandin; Amy Keesee (2024). Classifications of auroral phenomena in THEMIS All-Sky images obtained via self-supervised learning [Dataset]. http://doi.org/10.5061/dryad.sbcc2frft
    Explore at:
    Dataset updated
    Nov 29, 2024
    Dataset provided by
    Dryad Digital Repository
    Authors
    Jeremiah Johnson; Dogacan Ozturk; Hyunju Connor; Donald Hampton; Matthew Blandin; Amy Keesee
    Description

    We report a novel machine learning algorithm for automatically detecting and classifying aurora in all-sky images (ASI) that is largely trained without requiring ground-truth labels. By including a small number of labeled images, we are able to automatically label all of the approximately 700 million images in the Time History of Events and Macroscale Interactions during Substorms (THEMIS) ASI dataset from 2008 to 2022. We use a two-stage approach. In the first stage, we adapt the Simple framework for Contrastive Learning of Representations (SimCLR) algorithm to learn latent representations of THEMIS all-sky images. We then finetune a classifier  network on the latent representations our model learns of the manually labeled Oslo aurora THEMIS (OATH) dataset. We demonstrate that this two-stage approach achieves excellent classification results on data for which there is no current ML classification benchmark. The outcome of this work will facilitate efficient information retrieval for re..., We obtained these ASI classifications using a self-supervised machine learning model. The details of the model are described in the forthcoming paper in JGR: Machine Learning and Computation., , # Classifications of auroral phenomena in THEMIS All-Sky images obtained via self-supervised learning

    https://doi.org/10.5061/dryad.sbcc2frft

    Description of the data and file structure

    This dataset contains classifications of all THEMIS All-Sky Images (ASI) captured 2008-2022 into one of six categories: arc, diffuse, discrete, cloudy, moon, clear. For each image, a probability for each of the six categories is provided. The classifications were obtained using a self-supervised machine learning model accessible at the link below.

    Files and variables

    File: themis-asi-predictions.zip

    Description:Â This compressed directory contains the classification data. The data is organized into subdirectories by date in the format YYYY-MM-DD. Each subdirectory contains all of the classification data for the corresponding date in compressed CSV files. Each compressed CSV file contains one hour's worth of data for one THEMIS A...

  20. R

    Data-Centric AI Platforms Market Research Report 2033

    • researchintelo.com
    csv, pdf, pptx
    Updated Oct 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Research Intelo (2025). Data-Centric AI Platforms Market Research Report 2033 [Dataset]. https://researchintelo.com/report/data-centric-ai-platforms-market
    Explore at:
    pptx, csv, pdfAvailable download formats
    Dataset updated
    Oct 2, 2025
    Dataset authored and provided by
    Research Intelo
    License

    https://researchintelo.com/privacy-and-policyhttps://researchintelo.com/privacy-and-policy

    Time period covered
    2024 - 2033
    Area covered
    Global
    Description

    Data-Centric AI Platforms Market Outlook



    According to our latest research, the Global Data-Centric AI Platforms market size was valued at $4.3 billion in 2024 and is projected to reach $23.1 billion by 2033, expanding at a robust CAGR of 20.1% during the forecast period of 2024–2033. The primary driver behind this remarkable growth is the increasing need for high-quality, well-curated data to fuel artificial intelligence and machine learning applications across diverse industries. As organizations recognize that the quality of data is as critical as the sophistication of algorithms, there is a marked shift towards platforms that enable efficient data management, annotation, governance, and quality assurance. This paradigm shift is further accentuated by the rapid digital transformation initiatives, surging adoption of AI-driven analytics, and the proliferation of big data, all of which necessitate a robust foundation of reliable, labeled, and structured data for optimal AI outcomes.



    Regional Outlook



    North America currently dominates the Data-Centric AI Platforms market, accounting for the largest share of the global revenue. This region’s leadership is underpinned by a mature technology ecosystem, widespread adoption of AI across major verticals such as BFSI, healthcare, and IT & telecommunications, and a strong presence of leading market players. The United States, in particular, is a hub for AI innovation, with a high concentration of data-centric startups, research institutions, and established enterprises investing heavily in AI infrastructure. Government initiatives promoting AI research, coupled with stringent data governance regulations, further drive the adoption of data-centric AI platforms. As of 2024, North America contributed approximately 41% of the global market value, reflecting its advanced digital maturity and early adoption curve.



    The Asia Pacific region is emerging as the fastest-growing market for Data-Centric AI Platforms, projected to record a remarkable CAGR of 24.5% between 2024 and 2033. This accelerated growth is fueled by rapid urbanization, digitalization efforts, and increasing investments in AI infrastructure by both governments and private enterprises. Countries like China, Japan, South Korea, and India are witnessing a surge in AI-driven projects, particularly in manufacturing, retail, and healthcare sectors. The region’s expanding data ecosystem, coupled with a growing pool of skilled AI professionals, is fostering the adoption of advanced data annotation, labeling, and quality management solutions. Furthermore, strategic initiatives such as China’s AI development plans and India’s Digital India mission are catalyzing the deployment of data-centric AI platforms, making Asia Pacific a key region to watch over the forecast period.



    Latin America, the Middle East, and Africa are gradually gaining traction in the Data-Centric AI Platforms market, albeit at a slower pace compared to North America and Asia Pacific. These emerging economies face unique challenges such as limited AI expertise, infrastructural constraints, and inconsistent regulatory frameworks. However, localized demand for AI-driven solutions in sectors like banking, agriculture, and public safety is prompting incremental adoption. Governments in these regions are beginning to recognize the strategic importance of AI, leading to policy reforms and capacity-building initiatives. While the overall market share remains modest, the potential for growth is significant, particularly as digital literacy improves, investment in cloud infrastructure increases, and global vendors expand their geographic footprint into these untapped markets.



    Report Scope





    Attributes Details
    Report Title Data-Centric AI Platforms Market Research Report 2033
    By Component Software, Services
    By Deployment Mode Cloud, On-Premises
    By Application Data Labeling, Data Annota

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Data Insights Market (2025). Data Collection And Labeling Report [Dataset]. https://www.datainsightsmarket.com/reports/data-collection-and-labeling-1945059

Data Collection And Labeling Report

Explore at:
ppt, doc, pdfAvailable download formats
Dataset updated
Nov 17, 2025
Dataset authored and provided by
Data Insights Market
License

https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy

Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description

Explore the booming data collection and labeling market, driven by AI advancements. Discover key growth drivers, market trends, and forecasts for 2025-2033, essential for AI development across IT, automotive, and healthcare.

Search
Clear search
Close search
Google apps
Main menu