100+ datasets found
  1. m

    Dataset of Pairs of an Image and Tags for Cataloging Image-based Records

    • data.mendeley.com
    • narcis.nl
    Updated Feb 24, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tokinori Suzuki (2022). Dataset of Pairs of an Image and Tags for Cataloging Image-based Records [Dataset]. http://doi.org/10.17632/msyc6mzvhg.1
    Explore at:
    Dataset updated
    Feb 24, 2022
    Authors
    Tokinori Suzuki
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Brief Explanation

    This dataset is created to develop and evaluate a cataloging system which assigns appropriate metadata to an image record for database management in digital libraries. That is assumed for evaluating a task, in which given an image and assigned tags, an appropriate Wikipedia page is selected for each of the given tags.

    A main characteristic of the dataset is including ambiguous tags. Thus, visual contents of images are not unique to their tags. For example, it includes a tag 'mouse' which has double meaning of not a mammal but a computer controller device. The annotations are corresponding Wikipedia articles for tags as correct entities by human judgement.

    The dataset offers both data and programs that reproduce experiments of the above-mentioned task. Its data consist of sources of images and annotations. The image sources are URLs of 420 images uploaded to Flickr. The annotations are a total 2,464 relevant Wikipedia pages manually judged for tags of the images. The dataset also provides programs in Jupiter notebook (scripts.ipynb) to conduct a series of experiments running some baseline methods for the designated task and evaluating the results.

    Structure of the Dataset

    1. data directory 1.1. image_URL.txt This file lists URLs of image files.

      1.2. rels.txt This file lists collect Wikipedia pages for each topic in topics.txt

      1.3. topics.txt This file lists a target pair, which is called a topic in this dataset, of an image and a tag to be disambiguated.

      1.4. enwiki_20171001.xml This file is extracted texts from the title and body parts of English Wikipedia articles as of 1st October 2017. This is a modified data of Wikipedia dump data (https://archive.org/download/enwiki-20171001).

    2. img directory This directory is a placeholder directory to fetch image files for downloading.

    3. results directory This directory is a placeholder directory to store results files for evaluation. It maintains three results of baseline methods in sub-directories. They contain json files each of which is a result of one topic, and are ready to be evaluated using an evaluation scripts in scripts.ipynb for reference of both usage and performance.

    4. scripts.ipynb The scripts for running baseline methods and evaluation are ready in this Jupyter notebook file.

  2. c

    ATLAS Top Tagging Open Data Set

    • opendata.cern.ch
    Updated 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ATLAS collaboration (2022). ATLAS Top Tagging Open Data Set [Dataset]. http://doi.org/10.7483/OPENDATA.ATLAS.FG5F.96GA
    Explore at:
    Dataset updated
    2022
    Dataset provided by
    CERN Open Data Portal
    Authors
    ATLAS collaboration
    Description

    Boosted top tagging is an essential binary classification task for experiments at the Large Hadron Collider (LHC) to measure the properties of the top quark. The ATLAS Top Tagging Open Data Set is a publicly available data set for the development of Machine Learning (ML) based boosted top tagging algorithms. The data are split into two orthogonal sets, named train and test and stored in the HDF5 file format, containing 42 million and 2.5 million jets respectively. Both sets are composed of equal parts signal (jets initiated by a boosted top quark) and background (jets initiated by light quarks or gluons). For each jet, the data set contains:

    • The four vectors of constituent particles
    • 15 high level summary quantities evaluated on the jet
    • The four vector of the whole jet
    • A training weight
    • A signal (1) vs background (0) label.

    There is one rule in using this data set: the contribution to a loss function from any jet should always be weighted by the training weight. Apart from this a model should separate the signal jets from background by whatever means necessary.

    Updated on July 26th 2024. This dataset has been superseeded by a new dataset which also includes systematic uncertainties. Please use the new dataset instead of this one.

  3. Z

    Public tags added to resources in Trove, 2008 to 2024

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    Updated Jun 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sherratt, Tim (2024). Public tags added to resources in Trove, 2008 to 2024 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5094313
    Explore at:
    Dataset updated
    Jun 6, 2024
    Authors
    Sherratt, Tim
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    This dataset contains details of 2,495,958 unique public tags added to 10,403,650 resources in Trove between August 2008 and June 2024. I harvested the data using the Trove API and saved it as a CSV file with the following columns:

    tag – lower-cased text tag

    date – date the tag was added

    zone – API zone containing the tagged resource

    record_id – the identifier of the tagged resource

    I've documented the method used to harvest the tags in this notebook.

    Using the zone and record_id you can find more information about a tagged item. To create urls to the resources in Trove:

    for resources in the 'book', 'article', 'picture', 'music', 'map', and 'collection' zones add the record_id to https://trove.nla.gov.au/work/

    for resources in the 'newspaper' and 'gazette' zones add the record_id to https://trove.nla.gov.au/article/

    for resources in the 'list' zone add the record_id to https://trove.nla.gov.au/list/

    Notes:

    Works (such as books) in Trove can have tags attached at either work or version level. This dataset aggregates all tags at the work level, removing any duplicates.

    A single resource in Trove can appear in multiple zones – for example, a book that includes maps and illustrations might appear in the 'book', 'picture', and 'map' zones. This means that some of the tags will essentially be duplicates – harvested from different zones, but relating to the same resource. Depending on your needs, you might want to remove these duplicates.

    While most of the tags were added by Trove users, more than 500,000 tags were added by Trove itself in November 2009. I think these tags were automatically generated from related Wikipedia pages. Depending on your needs, you might want to exclude these by limiting the date range or zones.

    User content added to Trove, including tags, is available for reuse under a CC-BY-NC licence.

    See this notebook for some examples of how you can manipulate, analyse, and visualise the tag data.

  4. FSDKaggle2018

    • zenodo.org
    • opendatalab.com
    • +1more
    zip
    Updated Jan 24, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eduardo Fonseca; Eduardo Fonseca; Xavier Favory; Jordi Pons; Frederic Font; Frederic Font; Manoj Plakal; Daniel P. W. Ellis; Daniel P. W. Ellis; Xavier Serra; Xavier Serra; Xavier Favory; Jordi Pons; Manoj Plakal (2020). FSDKaggle2018 [Dataset]. http://doi.org/10.5281/zenodo.2552860
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Eduardo Fonseca; Eduardo Fonseca; Xavier Favory; Jordi Pons; Frederic Font; Frederic Font; Manoj Plakal; Daniel P. W. Ellis; Daniel P. W. Ellis; Xavier Serra; Xavier Serra; Xavier Favory; Jordi Pons; Manoj Plakal
    Description

    FSDKaggle2018 is an audio dataset containing 11,073 audio files annotated with 41 labels of the AudioSet Ontology. FSDKaggle2018 has been used for the DCASE Challenge 2018 Task 2, which was run as a Kaggle competition titled Freesound General-Purpose Audio Tagging Challenge.

    Citation

    If you use the FSDKaggle2018 dataset or part of it, please cite our DCASE 2018 paper:

    Eduardo Fonseca, Manoj Plakal, Frederic Font, Daniel P. W. Ellis, Xavier Favory, Jordi Pons, Xavier Serra. "General-purpose Tagging of Freesound Audio with AudioSet Labels: Task Description, Dataset, and Baseline". Proceedings of the DCASE 2018 Workshop (2018)

    You can also consider citing our ISMIR 2017 paper, which describes how we gathered the manual annotations included in FSDKaggle2018.

    Eduardo Fonseca, Jordi Pons, Xavier Favory, Frederic Font, Dmitry Bogdanov, Andres Ferraro, Sergio Oramas, Alastair Porter, and Xavier Serra, "Freesound Datasets: A Platform for the Creation of Open Audio Datasets", In Proceedings of the 18th International Society for Music Information Retrieval Conference, Suzhou, China, 2017

    Contact

    You are welcome to contact Eduardo Fonseca should you have any questions at eduardo.fonseca@upf.edu.

    About this dataset

    Freesound Dataset Kaggle 2018 (or FSDKaggle2018 for short) is an audio dataset containing 11,073 audio files annotated with 41 labels of the AudioSet Ontology [1]. FSDKaggle2018 has been used for the Task 2 of the Detection and Classification of Acoustic Scenes and Events (DCASE) Challenge 2018. Please visit the DCASE2018 Challenge Task 2 website for more information. This Task was hosted on the Kaggle platform as a competition titled Freesound General-Purpose Audio Tagging Challenge. It was organized by researchers from the Music Technology Group of Universitat Pompeu Fabra, and from Google Research’s Machine Perception Team.

    The goal of this competition was to build an audio tagging system that can categorize an audio clip as belonging to one of a set of 41 diverse categories drawn from the AudioSet Ontology.

    All audio samples in this dataset are gathered from Freesound [2] and are provided here as uncompressed PCM 16 bit, 44.1 kHz, mono audio files. Note that because Freesound content is collaboratively contributed, recording quality and techniques can vary widely.

    The ground truth data provided in this dataset has been obtained after a data labeling process which is described below in the Data labeling process section. FSDKaggle2018 clips are unequally distributed in the following 41 categories of the AudioSet Ontology:

    "Acoustic_guitar", "Applause", "Bark", "Bass_drum", "Burping_or_eructation", "Bus", "Cello", "Chime", "Clarinet", "Computer_keyboard", "Cough", "Cowbell", "Double_bass", "Drawer_open_or_close", "Electric_piano", "Fart", "Finger_snapping", "Fireworks", "Flute", "Glockenspiel", "Gong", "Gunshot_or_gunfire", "Harmonica", "Hi-hat", "Keys_jangling", "Knock", "Laughter", "Meow", "Microwave_oven", "Oboe", "Saxophone", "Scissors", "Shatter", "Snare_drum", "Squeak", "Tambourine", "Tearing", "Telephone", "Trumpet", "Violin_or_fiddle", "Writing".

    Some other relevant characteristics of FSDKaggle2018:

    • The dataset is split into a train set and a test set.

    • The train set is meant to be for system development and includes ~9.5k samples unequally distributed among 41 categories. The minimum number of audio samples per category in the train set is 94, and the maximum 300. The duration of the audio samples ranges from 300ms to 30s due to the diversity of the sound categories and the preferences of Freesound users when recording sounds. The total duration of the train set is roughly 18h.

    • Out of the ~9.5k samples from the train set, ~3.7k have manually-verified ground truth annotations and ~5.8k have non-verified annotations. The non-verified annotations of the train set have a quality estimate of at least 65-70% in each category. Checkout the Data labeling process section below for more information about this aspect.

    • Non-verified annotations in the train set are properly flagged in train.csv so that participants can opt to use this information during the development of their systems.

    • The test set is composed of 1.6k samples with manually-verified annotations and with a similar category distribution than that of the train set. The total duration of the test set is roughly 2h.

    • All audio samples in this dataset have a single label (i.e. are only annotated with one label). Checkout the Data labeling process section below for more information about this aspect. A single label should be predicted for each file in the test set.

    Data labeling process

    The data labeling process started from a manual mapping between Freesound tags and AudioSet Ontology categories (or labels), which was carried out by researchers at the Music Technology Group, Universitat Pompeu Fabra, Barcelona. Using this mapping, a number of Freesound audio samples were automatically annotated with labels from the AudioSet Ontology. These annotations can be understood as weak labels since they express the presence of a sound category in an audio sample.

    Then, a data validation process was carried out in which a number of participants did listen to the annotated sounds and manually assessed the presence/absence of an automatically assigned sound category, according to the AudioSet category description.

    Audio samples in FSDKaggle2018 are only annotated with a single ground truth label (see train.csv). A total of 3,710 annotations included in the train set of FSDKaggle2018 are annotations that have been manually validated as present and predominant (some with inter-annotator agreement but not all of them). This means that in most cases there is no additional acoustic material other than the labeled category. In few cases there may be some additional sound events, but these additional events won't belong to any of the 41 categories of FSDKaggle2018.

    The rest of the annotations have not been manually validated and therefore some of them could be inaccurate. Nonetheless, we have estimated that at least 65-70% of the non-verified annotations per category in the train set are indeed correct. It can happen that some of these non-verified audio samples present several sound sources even though only one label is provided as ground truth. These additional sources are typically out of the set of the 41 categories, but in a few cases they could be within.

    More details about the data labeling process can be found in [3].

    License

    FSDKaggle2018 has licenses at two different levels, as explained next.

    All sounds in Freesound are released under Creative Commons (CC) licenses, and each audio clip has its own license as defined by the audio clip uploader in Freesound. For attribution purposes and to facilitate attribution of these files to third parties, we include a relation of the audio clips included in FSDKaggle2018 and their corresponding license. The licenses are specified in the files train_post_competition.csv and test_post_competition_scoring_clips.csv.

    In addition, FSDKaggle2018 as a whole is the result of a curation process and it has an additional license. FSDKaggle2018 is released under CC-BY. This license is specified in the LICENSE-DATASET file downloaded with the FSDKaggle2018.doc zip file.

    Files

    FSDKaggle2018 can be downloaded as a series of zip files with the following directory structure:

    root
    │
    └───FSDKaggle2018.audio_train/ Audio clips in the train set │
    └───FSDKaggle2018.audio_test/ Audio clips in the test set │
    └───FSDKaggle2018.meta/ Files for evaluation setup │ │
    │ └───train_post_competition.csv Data split and ground truth for the train set │ │
    │ └───test_post_competition_scoring_clips.csv Ground truth for the test set
    │
    └───FSDKaggle2018.doc/ │
    └───README.md The dataset description file you are reading │
    └───LICENSE-DATASET

  5. o

    An Example Showcase - Datasets - Government of Jersey Open Data

    • opendata.gov.je
    Updated Sep 17, 2015
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2015). An Example Showcase - Datasets - Government of Jersey Open Data [Dataset]. https://opendata.gov.je/dataset/an-example-showcase
    Explore at:
    Dataset updated
    Sep 17, 2015
    Description

    This is an example Showcase demonstrating the use of Showcase tags rather than using the more restrictive 'type' dropdown. Showcase and link to datasets in use. Datasets used in an app, website or visualization, or featured in an article, report or blog post can be showcased within the CKAN website. Showcases can include an image, description, tags and external link. Showcases may contain several datasets, helping users discover related datasets being used together. Showcases can be discovered by searching and filtered by tag. Site sysadmins can promote selected users to become 'Showcase Admins' to help create, populate and maintain showcases. ckanext-showcase is intended to be a more powerful replacement for the 'Related Item' feature.

  6. Public tags added to resources in Trove, 2008 to 2021

    • zenodo.org
    zip
    Updated May 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tim Sherratt; Tim Sherratt (2023). Public tags added to resources in Trove, 2008 to 2021 [Dataset]. http://doi.org/10.5281/zenodo.5094314
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 1, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Tim Sherratt; Tim Sherratt
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    This dataset contains 8,722,292 public tags added to resources in Trove between August 2008 and July 2021. I harvested the data using the Trove API and saved it as a CSV file with the following columns:

    • `tag` – lower-cased text tag
    • `date` – date the tag was added
    • `zone` – API zone containing the tagged resource
    • `record_id` – the identifier of the tagged resource

    I've documented the method used to harvest the tags in this notebook.

    Using the `zone` and `record_id` you can find more information about a tagged item. To create urls to the resources in Trove:

    Notes:

    • Works (such as books) in Trove can have tags attached at either work or version level. This dataset aggregates all tags at the work level, removing any duplicates.
    • A single resource in Trove can appear in multiple zones – for example, a book that includes maps and illustrations might appear in the 'book', 'picture', and 'map' zones. This means that some of the tags will essentially be duplicates – harvested from different zones, but relating to the same resource. Depending on your needs, you might want to remove these duplicates.
    • While most of the tags were added by Trove users, more than 500,000 tags were added by Trove itself in November 2009. I think these tags were automatically generated from related Wikipedia pages. Depending on your needs, you might want to exclude these by limiting the date range or zones.
    • User content added to Trove, including tags, is available for reuse under a CC-BY-NC licence.

    See this notebook for some examples of how you can manipulate, analyse, and visualise the tag data.

  7. Mockup WikiHow Dataset

    • kaggle.com
    Updated Dec 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vipul L Rathod (2024). Mockup WikiHow Dataset [Dataset]. https://www.kaggle.com/datasets/vipullrathod/wikihow-mockup/data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 10, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Vipul L Rathod
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This dataset is a mockup collection of 100 WikiHow articles designed for research and educational purposes. It includes the following fields:

    Article ID: Unique numeric identifier for each article. Title: The main title of the WikiHow article. Summary: A brief overview of the article's content, useful for summarization tasks. Tags: Relevant keywords associated with the article for classification and search optimization. This dataset is ideal for exploring text summarization, keyword extraction, and natural language processing (NLP) applications. The dataset is clean and well-structured, making it beginner-friendly for data analysis and machine learning projects.

  8. Face Detection - Face Recognition Dataset

    • kaggle.com
    zip
    Updated Nov 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Unique Data (2023). Face Detection - Face Recognition Dataset [Dataset]. https://www.kaggle.com/datasets/trainingdatapro/face-detection-photos-and-labels
    Explore at:
    zip(1252666206 bytes)Available download formats
    Dataset updated
    Nov 8, 2023
    Authors
    Unique Data
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    Face Detection - Object Detection & Face Recognition Dataset

    The dataset is created on the basis of Selfies and ID Dataset

    The dataset is a collection of images (selfies) of people and bounding box labeling for their faces. It has been specifically curated for face detection and face recognition tasks. The dataset encompasses diverse demographics, age, ethnicities, and genders.

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F01348572e2ae2836f10bc2f2da381009%2FFrame%2050%20(1).png?generation=1699439342545305&alt=media" alt="">

    The dataset is a valuable resource for researchers, developers, and organizations working on age prediction and face recognition to train, evaluate, and fine-tune AI models for real-world applications. It can be applied in various domains like psychology, market research, and personalized advertising.

    👉 Legally sourced datasets and carefully structured for AI training and model development. Explore samples from our dataset of 95,000+ human images & videos - Full dataset

    Metadata for the full dataset:

    • assignment_id - unique identifier of the media file
    • worker_id - unique identifier of the person
    • age - age of the person
    • true_gender - gender of the person
    • country - country of the person
    • ethnicity - ethnicity of the person
    • photo_1_extension, photo_2_extension, 
, photo_15_extension - photo extensions in the dataset
    • photo_1_resolution, photo_2_resolution, 
, photo_15_resolution - photo resolution in the dataset

    OTHER BIOMETRIC DATASETS:

    đŸ§© This is just an example of the data. Leave a request here to learn more

    Dataset structure

    • images - contains of original images of people
    • labels - includes visualized labeling for the original images
    • annotations.xml - contains coordinates of the bbox, created for the original photo

    Data Format

    Each image from images folder is accompanied by an XML-annotation in the annotations.xml file indicating the coordinates of the polygons and labels . For each point, the x and y coordinates are provided.

    Example of XML file structure

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F19e61b2d0780e9db80afe4a0ce879c4b%2Fcarbon.png?generation=1699440100527867&alt=media" alt="">

    🚀 You can learn more about our high-quality unique datasets here

    keywords: biometric system, biometric system attacks, biometric dataset, face recognition database, face recognition dataset, face detection dataset, facial analysis, object detection dataset, deep learning datasets, computer vision datset, human images dataset, human faces dataset

  9. G

    Data Labeling Market Research Report 2033

    • growthmarketreports.com
    csv, pdf, pptx
    Updated Sep 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Growth Market Reports (2025). Data Labeling Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/data-labeling-market
    Explore at:
    pdf, pptx, csvAvailable download formats
    Dataset updated
    Sep 1, 2025
    Dataset authored and provided by
    Growth Market Reports
    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Data Labeling Market Outlook



    According to our latest research, the global data labeling market size reached USD 3.2 billion in 2024, driven by the explosive growth in artificial intelligence and machine learning applications across industries. The market is poised to expand at a CAGR of 22.8% from 2025 to 2033, and is forecasted to reach USD 25.3 billion by 2033. This robust growth is primarily fueled by the increasing demand for high-quality annotated data to train advanced AI models, the proliferation of automation in business processes, and the rising adoption of data-driven decision-making frameworks in both the public and private sectors.




    One of the principal growth drivers for the data labeling market is the accelerating integration of AI and machine learning technologies across various industries, including healthcare, automotive, retail, and BFSI. As organizations strive to leverage AI for enhanced customer experiences, predictive analytics, and operational efficiency, the need for accurately labeled datasets has become paramount. Data labeling ensures that AI algorithms can learn from well-annotated examples, thereby improving model accuracy and reliability. The surge in demand for computer vision applications—such as facial recognition, autonomous vehicles, and medical imaging—has particularly heightened the need for image and video data labeling, further propelling market growth.




    Another significant factor contributing to the expansion of the data labeling market is the rapid digitization of business processes and the exponential growth in unstructured data. Enterprises are increasingly investing in data annotation tools and platforms to extract actionable insights from large volumes of text, audio, and video data. The proliferation of Internet of Things (IoT) devices and the widespread adoption of cloud computing have further amplified data generation, necessitating scalable and efficient data labeling solutions. Additionally, the rise of semi-automated and automated labeling technologies, powered by AI-assisted tools, is reducing manual effort and accelerating the annotation process, thereby enabling organizations to meet the growing demand for labeled data at scale.




    The evolving regulatory landscape and the emphasis on data privacy and security are also playing a crucial role in shaping the data labeling market. As governments worldwide introduce stringent data protection regulations, organizations are turning to specialized data labeling service providers that adhere to compliance standards. This trend is particularly pronounced in sectors such as healthcare and BFSI, where the accuracy and confidentiality of labeled data are critical. Furthermore, the increasing outsourcing of data labeling tasks to specialized vendors in emerging economies is enabling organizations to access skilled labor at lower costs, further fueling market expansion.




    From a regional perspective, North America currently dominates the data labeling market, followed by Europe and the Asia Pacific. The presence of major technology companies, robust investments in AI research, and the early adoption of advanced analytics solutions have positioned North America as the market leader. However, the Asia Pacific region is expected to witness the fastest growth during the forecast period, driven by the rapid digital transformation in countries like China, India, and Japan. The growing focus on AI innovation, government initiatives to promote digitalization, and the availability of a large pool of skilled annotators are key factors contributing to the regionÂ’s impressive growth trajectory.



    In the realm of security, Video Dataset Labeling for Security has emerged as a critical application area within the data labeling market. As surveillance systems become more sophisticated, the need for accurately labeled video data is paramount to ensure the effectiveness of security measures. Video dataset labeling involves annotating video frames to identify and track objects, behaviors, and anomalies, which are essential for developing intelligent security systems capable of real-time threat detection and response. This process not only enhances the accuracy of security algorithms but also aids in the training of AI models that can predict and prevent potential security breaches. The growing emphasis on public safety and

  10. OpenStack Tag Data

    • figshare.com
    txt
    Updated Feb 21, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    anonymous authors (2017). OpenStack Tag Data [Dataset]. http://doi.org/10.6084/m9.figshare.4670530.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Feb 21, 2017
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    anonymous authors
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In order to support future replication studies, we have used publicly available datasets and have made our manually tagged review data available here.[File] spsn_abandon172_fse.csv:The tags for abandoned contentious patches. The total number of tags is 172. spsn_integrate189_fse.csv:The tags for integrated contentious patches. The total number of tags is 189.[Column in a file] "Review Id" is an identification number of a patch. For example, in the paper, review #33395 is Review Id 33395. "Tag" is a key concern associated with abandoned contentious patches or integrated contentious patches. For example, in abandoned contentious patches, the key concern of Review Id 33395 is "Shallow Fix".

  11. m

    Event Detection Dataset

    • data.mendeley.com
    • datosdeinvestigacion.conicet.gov.ar
    • +2more
    Updated Jul 11, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mariano Maisonnave (2020). Event Detection Dataset [Dataset]. http://doi.org/10.17632/7d54rvzxkr.1
    Explore at:
    Dataset updated
    Jul 11, 2020
    Authors
    Mariano Maisonnave
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The present is a manually labeled data set for the task of Event Detection (ED). The task of ED consists of identifying event triggers, the word that most clearly indicates the occurrence of an event.

    The present data set consists of 2,200 news extracts from The New York Times (NYT) Annotated Corpus, separated into training (2,000) and testing (200) sets. Each news extract contains the plain text with the labels (event mentions), along with two metadata (publication date and an identifier).

    Labels description: We consider as event any ongoing real-world event or situation reported in the news articles. It is important to distinguish those events and situations that are in progress (or are reported as fresh events) at the moment the news is delivered from past events that are simply brought back, future events, hypothetical events, or events that will not take place. In our data set we only labeled as event the first type of event. Based on this criterion, some words that are typically considered as events are labeled as non-event triggers if they do not refer to ongoing events at the time the analyzed news is released. Take for instance the following news extract: "devaluation is not a realistic option to the current account deficit since it would only contribute to weakening the credibility of economic policies as it did during the last crisis." The only word that is labeled as event trigger in this example is "deficit" because it is the only ongoing event refereed in the news. Note that the words "devaluation", "weakening" and "crisis" could be labeled as event triggers in other news extracts, where the context of use of these words is different, but not in the given example.

    Further information: For a more detailed description of the data set and the data collection process please visit: https://cs.uns.edu.ar/~mmaisonnave/resources/ED_data.

    Data format: The dataset is split in two folders: training and testing. The first folder contains 2,000 XML files. The second folder contains 200 XML files. Each XML file has the following format.

    <?xml version="1.0" encoding="UTF-8"?>

    The first three tags (pubdate, file-id and sent-idx) contain metadata information. The first one is the publication date of the news article that contained that text extract. The next two tags represent a unique identifier for the text extract. The file-id uniquely identifies a news article, that can hold several text extracts. The second one is the index that identifies that text extract inside the full article.

    The last tag (sentence) defines the beginning and end of the text extract. Inside that text are the tags. Each of these tags surrounds one word that was manually labeled as an event trigger.

  12. Website Screenshots

    • kaggle.com
    • universe.roboflow.com
    Updated May 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pooria Mostafapoor (2025). Website Screenshots [Dataset]. https://www.kaggle.com/datasets/pooriamst/website-screenshots/versions/1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 19, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Pooria Mostafapoor
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    The Website Screenshots dataset is a synthetically generated dataset composed of screenshots from over 1000 of the world's top websites. They have been automatically annotated to label the following classes: :fa-spacer: * button - navigation links, tabs, etc. * heading - text that was enclosed in <h1> to <h6> tags. * link - inline, textual <a> tags. * label - text labeling form fields. * text - all other text. * image - <img>, <svg>, or <video> tags, and icons. * iframe - ads and 3rd party content.

    Example

    This is an example image and annotation from the dataset: https://i.imgur.com/mOG3u3Z.png" alt="WIkipedia Screenshot">

    Usage

    Annotated screenshots are very useful in Robotic Process Automation. But they can be expensive to label. This dataset would cost over $4000 for humans to label on popular labeling services. We hope this dataset provides a good starting point for your project. Try it with a model from our model library.

    The dataset contains 1689 train data, 243 test data and 483 valid data.

  13. Body Segmentation - 6,700 Photos

    • kaggle.com
    zip
    Updated Apr 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Unique Data (2023). Body Segmentation - 6,700 Photos [Dataset]. https://www.kaggle.com/datasets/trainingdatapro/human-segmentation-dataset
    Explore at:
    zip(35504836 bytes)Available download formats
    Dataset updated
    Apr 20, 2023
    Authors
    Unique Data
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    7 Types of Human Segmentation

    The dataset includes 7 different types of image segmentation of people in underwear. For women, 4 types of labeling are provided, and for men, 3 types of labeling are provided. The dataset solves tasks in the field of recommendation systems and e-commerce.

    👉 Legally sourced datasets and carefully structured for AI training and model development. Explore samples from our dataset - Full dataset

    Types of labeling

    Women I - distinctively detailed labeling of women. Special emphasis is placed on distinguishing the internal, external side, and lower breast depending on the type of underwear. The labeling also includes the face and hair, hands, forearms, shoulders, armpits, thighs, shins, underwear, accessories, and smartphones.

    ![https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2Fe157d0b7db89497f85c9b2d79d301086%2Fgirls_1_227.png?generation=1681741881080579&alt=media" alt="">

    Women II - labeling of images of women with attention to the side abs area (highlighted in gray on the labeling). The labeling also includes the face and hair, hands, forearms, thighs, underwear, accessories, and smartphones.

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F901d120c0273ea9a5a328fff15e26583%2Fgirls_2_-1087839647-1867563540.png?generation=1681741958025976&alt=media" alt="">

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F75fb0412edf631adce5f42ab6b9e8052%2Fwomen_fat_image_56570.png?generation=1681742864993159&alt=media" alt="">

    Women III - primarily labeling of underwear. In addition to the underwear itself, the labeling includes the face and hair, abdomen, and arms and legs.

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F6f32a06f0754a5a116fc994feae8c6f1%2Fgirls_5_111.png?generation=1681742011331681&alt=media" alt="">

    Women IV - labeling of both underwear and body parts. It includes labeling of underwear, face and hair, hands, forearms, body, legs, as well as smartphones and tattoos.

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F0dc22fcfd8b6e4fad3aa1806d14223ef%2Fgirls_6_image_4534.png?generation=1681742073295272&alt=media" alt="">

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F4a398547e13c555fdad142f521e62a5f%2Fsports_girls_IadpBSd3mI%20(1).png?generation=1681742947264828&alt=media" alt="">

    Men I - labeling of the upper part of men's bodies. It includes labeling of hands and wrists, shoulders, body, neck, face and hair, as well as phones and accessories.

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F3dae9889adb2b1415353769ccdd9c01b%2Fman_regular_1532667_38709335.png?generation=1681742128995529&alt=media" alt="">

    Men II - more detailed labeling of men's bodies. The labeling includes hands and wrists, shoulders, body and neck, head and hair, underwear, tattoos and accessories, nipple and navel area.

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2Fa57123e41066aa277bfeac140f4457da%2Fmen_1_3046.png?generation=1681742173310957&alt=media" alt="">

    Men Neuro - labeling produced by a neural network for subsequent correction by annotators.

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F5281cd644bc3f5949aaa9c40fb1cafd4%2F4595%20(1).png?generation=1681742187215164&alt=media" alt="">

    đŸ§© This is just an example of the data. Leave a request here to learn more

    🚀 You can learn more about our high-quality unique datasets here

    keywords: body segmentation dataset, human part segmentation dataset, human semantic part segmentation, human body segmentation data, human body segmentation deep learning, computer vision dataset, people images dataset, biometric data dataset, biometric dataset, images database, image-to-image, people segmentation, machine learning

  14. Multi-domain Image Characteristics Dataset

    • kaggle.com
    zip
    Updated May 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Akash (2020). Multi-domain Image Characteristics Dataset [Dataset]. https://www.kaggle.com/grassknoted/multidomain-image-characteristics-dataset
    Explore at:
    zip(141084369 bytes)Available download formats
    Dataset updated
    May 24, 2020
    Authors
    Akash
    License

    Attribution-NoDerivs 4.0 (CC BY-ND 4.0)https://creativecommons.org/licenses/by-nd/4.0/
    License information was derived automatically

    Description

    Context

    Typically, most publicly available datasets are created with the intent of testing classification or labeling algorithms. The primary goal of a learning algorithm that works on such datasets is to classify the data. Very few datasets exist, on which the goal of a learning algorithm is to reason out why and how the data has been classified.

    Content

    The Multi-domain Image Characteristic Dataset consists of thousands of images sourced from the internet. Each image falls under one of three domains - animals, birds or furniture. There are five types under each domain. There are 200 images of each type, summing up the total dataset to 3,000 images. The master file consists of two columns; the image name and the visible characteristics in that image. Every image was manually analysed and the characteristics for each image was generated, ensuring accuracy.

    Images falling under the same domain have a similar set of characteristics. For example, pictures under the birds domain will have a common set of characteristics such as color of the bird, presence of a beak, wing, eye, legs, etc. Care has been taken to ensure that each image is as unique as possible by including pictures that have different combinations of visible characteristics present. This includes pictures having variations in the capture angle, etc.

    The entire data is comprised of 3 primary classes, and further into 5 sub-classes for each primary class as follows: 1) Animals a) Cat; b) Dog; c) Fox; d) Hyena; e) Wolf

    2) Birds a) Duck; b) Eagle; c) Hawk; d) Parrot; e) Sparrow

    3) Furniture a) Bed; b) Chair; c) Sofa; d) Table; e) Nightstand

    Each subclass also contains a.csvfile, with the image name, and characteristics present in the corresponding image. The exhaustive list of image characteristics are divided as follows: 1) Face: Eyes, Mouth, Snout, Ears, Whiskers, Nose, Teeth, Beak, Tongue

    2) Body: Wings, Legs, Paws, Tail, Surface, Arm Rest, Base, Pillows, Cushion, Drawer, Knob, Mattress

    3) Color: Brown, Black, Grey, White, Purple, Pink, Yellow, Turquoise

  15. Codeforces Dataset

    • kaggle.com
    zip
    Updated Aug 2, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    dipkumar patel (2021). Codeforces Dataset [Dataset]. https://www.kaggle.com/datasets/immortal3/codeforces-dataset/data
    Explore at:
    zip(3750595 bytes)Available download formats
    Dataset updated
    Aug 2, 2021
    Authors
    dipkumar patel
    Description

    Context

    In competitive coding, Problems are giving and Participates have to submit code that successfully solves the given problem in a given timeline. There could be one or many ways to solve a given problem. Top competitive programmer already narrows down approaches based on their past experience (solved the similar problem) and/or intuition. After the contest is over, problems are tagged based on approaches and their difficulty. Problem tags are useful for narrowing down scope.

    Dataset

    This dataset consists of Codeforces problems. The dataset consists of two main columns: Problem Statement and Problem Tags. Problem Statement is the full text from the problem page. Problem Tags are comma-separated tagged classes

    Task

    Challenge is to predict the tags (Multi labeled classification), given Input is the problem statement.

    Acknowledgements

    All copyrights are attributed to Codeforces and original authors.

  16. Named Entity Recognition (NER) Corpus

    • kaggle.com
    zip
    Updated Jan 14, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Naser Al-qaydeh (2022). Named Entity Recognition (NER) Corpus [Dataset]. https://www.kaggle.com/datasets/naseralqaydeh/named-entity-recognition-ner-corpus
    Explore at:
    zip(4343548 bytes)Available download formats
    Dataset updated
    Jan 14, 2022
    Authors
    Naser Al-qaydeh
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    Task

    Named Entity Recognition(NER) is a task of categorizing the entities in a text into categories like names of persons, locations, organizations, etc.

    Dataset

    Each row in the CSV file is a complete sentence, list of POS tags for each word in the sentence, and list of NER tags for each word in the sentence

    You can use Pandas Dataframe to read and manipulate this dataset.

    Since each row in the CSV file contain lists, if we read the file with pandas.read_csv() and try to get tag lists by indexing the list will be a string. ```

    data['tag'][0] "['O', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-gpe', 'O', 'O', 'O', 'O', 'O']" type(data['tag'][0]) string You can use the following to convert it back to list type: from ast import literal_eval literal_eval(data['tag'][0] ) ['O', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-gpe', 'O', 'O', 'O', 'O', 'O'] type(literal_eval(data['tag'][0] )) list ```

    Acknowledgements

    This dataset is taken from Annotated Corpus for Named Entity Recognition by Abhinav Walia dataset and then processed.

    Annotated Corpus for Named Entity Recognition is annotated Corpus for Named Entity Recognition using GMB(Groningen Meaning Bank) corpus for entity classification with enhanced and popular features by Natural Language Processing applied to the data set.

    Essential info about entities:

    • geo = Geographical Entity
    • org = Organization
    • per = Person
    • gpe = Geopolitical Entity
    • tim = Time indicator
    • art = Artifact
    • eve = Event
    • nat = Natural Phenomenon
  17. d

    Data from: X-ray CT data with semantic annotations for the paper "A workflow...

    • catalog.data.gov
    • datasetcatalog.nlm.nih.gov
    • +1more
    Updated Jun 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agricultural Research Service (2025). X-ray CT data with semantic annotations for the paper "A workflow for segmenting soil and plant X-ray CT images with deep learning in Google’s Colaboratory" [Dataset]. https://catalog.data.gov/dataset/x-ray-ct-data-with-semantic-annotations-for-the-paper-a-workflow-for-segmenting-soil-and-p-d195a
    Explore at:
    Dataset updated
    Jun 5, 2025
    Dataset provided by
    Agricultural Research Service
    Description

    Leaves from genetically unique Juglans regia plants were scanned using X-ray micro-computed tomography (microCT) on the X-ray ÎŒCT beamline (8.3.2) at the Advanced Light Source (ALS) in Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA USA). Soil samples were collected in Fall of 2017 from the riparian oak forest located at the Russell Ranch Sustainable Agricultural Institute at the University of California Davis. The soil was sieved through a 2 mm mesh and was air dried before imaging. A single soil aggregate was scanned at 23 keV using the 10x objective lens with a pixel resolution of 650 nanometers on beamline 8.3.2 at the ALS. Additionally, a drought stressed almond flower bud (Prunus dulcis) from a plant housed at the University of California, Davis, was scanned using a 4x lens with a pixel resolution of 1.72 ”m on beamline 8.3.2 at the ALS Raw tomographic image data was reconstructed using TomoPy. Reconstructions were converted to 8-bit tif or png format using ImageJ or the PIL package in Python before further processing. Images were annotated using Intel’s Computer Vision Annotation Tool (CVAT) and ImageJ. Both CVAT and ImageJ are free to use and open source. Leaf images were annotated in following ThĂ©roux-Rancourt et al. (2020). Specifically, Hand labeling was done directly in ImageJ by drawing around each tissue; with 5 images annotated per leaf. Care was taken to cover a range of anatomical variation to help improve the generalizability of the models to other leaves. All slices were labeled by Dr. Mina Momayyezi and Fiona Duong.To annotate the flower bud and soil aggregate, images were imported into CVAT. The exterior border of the bud (i.e. bud scales) and flower were annotated in CVAT and exported as masks. Similarly, the exterior of the soil aggregate and particulate organic matter identified by eye were annotated in CVAT and exported as masks. To annotate air spaces in both the bud and soil aggregate, images were imported into ImageJ. A gaussian blur was applied to the image to decrease noise and then the air space was segmented using thresholding. After applying the threshold, the selected air space region was converted to a binary image with white representing the air space and black representing everything else. This binary image was overlaid upon the original image and the air space within the flower bud and aggregate was selected using the “free hand” tool. Air space outside of the region of interest for both image sets was eliminated. The quality of the air space annotation was then visually inspected for accuracy against the underlying original image; incomplete annotations were corrected using the brush or pencil tool to paint missing air space white and incorrectly identified air space black. Once the annotation was satisfactorily corrected, the binary image of the air space was saved. Finally, the annotations of the bud and flower or aggregate and organic matter were opened in ImageJ and the associated air space mask was overlaid on top of them forming a three-layer mask suitable for training the fully convolutional network. All labeling of the soil aggregate and soil aggregate images was done by Dr. Devin Rippner. These images and annotations are for training deep learning models to identify different constituents in leaves, almond buds, and soil aggregates Limitations: For the walnut leaves, some tissues (stomata, etc.) are not labeled and only represent a small portion of a full leaf. Similarly, both the almond bud and the aggregate represent just one single sample of each. The bud tissues are only divided up into buds scales, flower, and air space. Many other tissues remain unlabeled. For the soil aggregate annotated labels are done by eye with no actual chemical information. Therefore particulate organic matter identification may be incorrect. Resources in this dataset:Resource Title: Annotated X-ray CT images and masks of a Forest Soil Aggregate. File Name: forest_soil_images_masks_for_testing_training.zipResource Description: This aggregate was collected from the riparian oak forest at the Russell Ranch Sustainable Agricultural Facility. The aggreagate was scanned using X-ray micro-computed tomography (microCT) on the X-ray ÎŒCT beamline (8.3.2) at the Advanced Light Source (ALS) in Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA USA) using the 10x objective lens with a pixel resolution of 650 nanometers. For masks, the background has a value of 0,0,0; pores spaces have a value of 250,250, 250; mineral solids have a value= 128,0,0; and particulate organic matter has a value of = 000,128,000. These files were used for training a model to segment the forest soil aggregate and for testing the accuracy, precision, recall, and f1 score of the model.Resource Title: Annotated X-ray CT images and masks of an Almond bud (P. Dulcis). File Name: Almond_bud_tube_D_P6_training_testing_images_and_masks.zipResource Description: Drought stressed almond flower bud (Prunis dulcis) from a plant housed at the University of California, Davis, was scanned by X-ray micro-computed tomography (microCT) on the X-ray ÎŒCT beamline (8.3.2) at the Advanced Light Source (ALS) in Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA USA) using the 4x lens with a pixel resolution of 1.72 ”m using. For masks, the background has a value of 0,0,0; air spaces have a value of 255,255, 255; bud scales have a value= 128,0,0; and flower tissues have a value of = 000,128,000. These files were used for training a model to segment the almond bud and for testing the accuracy, precision, recall, and f1 score of the model.Resource Software Recommended: Fiji (ImageJ),url: https://imagej.net/software/fiji/downloads Resource Title: Annotated X-ray CT images and masks of Walnut leaves (J. Regia) . File Name: 6_leaf_training_testing_images_and_masks_for_paper.zipResource Description: Stems were collected from genetically unique J. regia accessions at the 117 USDA-ARS-NCGR in Wolfskill Experimental Orchard, Winters, California USA to use as scion, and were grafted by Sierra Gold Nursery onto a commonly used commercial rootstock, RX1 (J. microcarpa × J. regia). We used a common rootstock to eliminate any own-root effects and to simulate conditions for a commercial walnut orchard setting, where rootstocks are commonly used. The grafted saplings were repotted and transferred to the Armstrong lathe house facility at the University of California, Davis in June 2019, and kept under natural light and temperature. Leaves from each accession and treatment were scanned using X-ray micro-computed tomography (microCT) on the X-ray ÎŒCT beamline (8.3.2) at the Advanced Light Source (ALS) in Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA USA) using the 10x objective lens with a pixel resolution of 650 nanometers. For masks, the background has a value of 170,170,170; Epidermis value= 85,85,85; Mesophyll value= 0,0,0; Bundle Sheath Extension value= 152,152,152; Vein value= 220,220,220; Air value = 255,255,255.Resource Software Recommended: Fiji (ImageJ),url: https://imagej.net/software/fiji/downloads

  18. Namesakes

    • figshare.com
    json
    Updated Nov 20, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oleg Vasilyev; Aysu Altun; Nidhi Vyas; Vedant Dharnidharka; Erika Lampert; John Bohannon (2021). Namesakes [Dataset]. http://doi.org/10.6084/m9.figshare.17009105.v1
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Nov 20, 2021
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Oleg Vasilyev; Aysu Altun; Nidhi Vyas; Vedant Dharnidharka; Erika Lampert; John Bohannon
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Abstract

    Motivation: creating challenging dataset for testing Named-Entity
    

    Linking. The Namesakes dataset consists of three closely related datasets: Entities, News and Backlinks. Entities were collected as Wikipedia text chunks corresponding to highly ambiguous entity names. The News were collected as random news text chunks, containing mentions that either belong to the Entities dataset or can be easily confused with them. Backlinks were obtained from Wikipedia dump data with intention to have mentions linked to the entities of the Entity dataset. The Entities and News are human-labeled, resolving the mentions of the entities.Methods

    Entities were collected as Wikipedia 
    

    text chunks corresponding to highly ambiguous entity names: the most popular people names, the most popular locations, and organizations with name ambiguity. In each Entities text chunk, the named entities with the name similar to the chunk Wikipedia page name are labeled. For labeling, these entities were suggested to human annotators (odetta.ai) to tag as "Same" (same as the page entity) or "Other". The labeling was done by 6 experienced annotators that passed through a preliminary trial task. The only accepted tags are the tags assigned in agreement by not less than 5 annotators, and then passed through reconciliation with an experienced reconciliator.

    The News were collected as random news text chunks, containing mentions which either belong to the Entities dataset or can be easily confused with them. In each News text chunk one mention was selected for labeling, and 3-10 Wikipedia pages from Entities were suggested as the labels for an annotator to choose from. The labeling was done by 3 experienced annotators (odetta.ai), after the annotators passed a preliminary trial task. The results were reconciled by an experienced reconciliator. All the labeling was done using Lighttag (lighttag.io).

    Backlinks were obtained from Wikipedia dump data (dumps.wikimedia.org/enwiki/20210701) with intention to have mentions linked to the entities of the Entity dataset. The backlinks were filtered to leave only mentions in a good quality text; each text was cut 1000 characters after the last mention.

    Usage NotesEntities:
    

    File: Namesakes_entities.jsonl The Entities dataset consists of 4148 Wikipedia text chunks containing human-tagged mentions of entities. Each mention is tagged either as "Same" (meaning that the mention is of this Wikipedia page entity), or "Other" (meaning that the mention is of some other entity, just having the same or similar name). The Entities dataset is a jsonl list, each item is a dictionary with the following keys and values: Key: ‘pagename’: page name of the Wikipedia page. Key ‘pageid’: page id of the Wikipedia page. Key ‘title’: title of the Wikipedia page. Key ‘url’: URL of the Wikipedia page. Key ‘text’: The text chunk from the Wikipedia page. Key ‘entities’: list of the mentions in the page text, each entity is represented by a dictionary with the keys: Key 'text': the mention as a string from the page text. Key ‘start’: start character position of the entity in the text. Key ‘end’: end (one-past-last) character position of the entity in the text. Key ‘tag’: annotation tag given as a string - either ‘Same’ or ‘Other’.

    News: File: Namesakes_news.jsonl The News dataset consists of 1000 news text chunks, each one with a single annotated entity mention. The annotation either points to the corresponding entity from the Entities dataset (if the mention is of that entity), or indicates that the mentioned entity does not belong to the Entities dataset. The News dataset is a jsonl list, each item is a dictionary with the following keys and values: Key ‘id_text’: Id of the sample. Key ‘text’: The text chunk. Key ‘urls’: List of URLs of wikipedia entities suggested to labelers for identification of the entity mentioned in the text. Key ‘entity’: a dictionary describing the annotated entity mention in the text: Key 'text': the mention as a string found by an NER model in the text. Key ‘start’: start character position of the mention in the text. Key ‘end’: end (one-past-last) character position of the mention in the text. Key 'tag': This key exists only if the mentioned entity is annotated as belonging to the Entities dataset - if so, the value is a dictionary identifying the Wikipedia page assigned by annotators to the mentioned entity: Key ‘pageid’: Wikipedia page id. Key ‘pagetitle’: page title. Key 'url': page URL.

    Backlinks dataset: The Backlinks dataset consists of two parts: dictionary Entity-to-Backlinks and Backlinks documents. The dictionary points to backlinks for each entity of the Entity dataset (if any backlinks exist for the entity). The Backlinks documents are the backlinks Wikipedia text chunks with identified mentions of the entities from the Entities dataset.

    Each mention is identified by surrounded double square brackets, e.g. "Muir built a small cabin along [[Yosemite Creek]].". However, if the mention differs from the exact entity name, the double square brackets wrap both the exact name and, separated by '|', the mention string to the right, for example: "Muir also spent time with photographer [[Carleton E. Watkins | Carleton Watkins]] and studied his photographs of Yosemite.".

    The Entity-to-Backlinks is a jsonl with 1527 items. File: Namesakes_backlinks_entities.jsonl Each item is a tuple: Entity name. Entity Wikipedia page id. Backlinks ids: a list of pageids of backlink documents.

    The Backlinks documents is a jsonl with 26903 items. File: Namesakes_backlinks_texts.jsonl Each item is a dictionary: Key ‘pageid’: Id of the Wikipedia page. Key ‘title’: Title of the Wikipedia page. Key 'content': Text chunk from the Wikipedia page, with all mentions in the double brackets; the text is cut 1000 characters after the last mention, the cut is denoted as '...[CUT]'. Key 'mentions': List of the mentions from the text, for convenience. Each mention is a tuple: Entity name. Entity Wikipedia page id. Sorted list of all character indexes at which the mention occurrences start in the text.

  19. Kyoushi Log Data Set

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Apr 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Max Landauer; Maximilian Frank; Florian Skopik; Wolfgang Hotwagner; Markus Wurzenberger; Andreas Rauber; Max Landauer; Maximilian Frank; Florian Skopik; Wolfgang Hotwagner; Markus Wurzenberger; Andreas Rauber (2025). Kyoushi Log Data Set [Dataset]. http://doi.org/10.5281/zenodo.5779411
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 24, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Max Landauer; Maximilian Frank; Florian Skopik; Wolfgang Hotwagner; Markus Wurzenberger; Andreas Rauber; Max Landauer; Maximilian Frank; Florian Skopik; Wolfgang Hotwagner; Markus Wurzenberger; Andreas Rauber
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    This repository contains synthetic log data suitable for evaluation of intrusion detection systems. The logs were collected from a testbed that was built at the Austrian Institute of Technology (AIT) following the approaches by [1], [2], and [3]. Please refer to these papers for more detailed information on the dataset and cite them if the data is used for academic publications. Other than the related AIT-LDSv1.1, this dataset involves a more complex network structure, makes use of a different attack scenario, and collects log data from multiple hosts in the network. In brief, the testbed simulates a small enterprise network including mail server, file share, WordPress server, VPN, firewall, etc. Normal user behavior is simulated to generate background noise. After some days, two attack scenarios are launched against the network. Note that the AIT-LDSv2.0 extends this dataset with additional attack cases and variations of attack parameters.

    The archives have the following structure. The gather directory contains the raw log data from each host in the network, as well as their system configurations. The labels directory contains the ground truth for those log files that are labeled. The processing directory contains configurations for the labeling procedure and the rules directory contains the labeling rules. Labeling of events that are related to the attacks is carried out with the Kyoushi Labeling Framework.

    Each dataset contains traces of a specific attack scenario:

    • Scenario 1 (see gather/attacker_0/logs/sm.log for detailed attack log):
      • nmap scan
      • WPScan
      • dirb scan
      • webshell upload through wpDiscuz exploit (CVE-2020-24186)
      • privilege escalation
    • Scenario 2 (see gather/attacker_0/logs/dnsteal.log for detailed attack log):
      • DNSteal data exfiltration

    The log data collected from the servers includes

    • Apache access and error logs (labeled)
    • audit logs (labeled)
    • auth logs (labeled)
    • VPN logs (labeled)
    • DNS logs (labeled)
    • syslog
    • suricata logs
    • exim logs
    • horde logs
    • mail logs

    Note that only log files from affected servers are labeled. Label files and the directories in which they are located have the same name as their corresponding log file in the gather directory. Labels are in JSON format and comprise the following attributes: line (number of line in corresponding log file), labels (list of labels assigned to that log line), rules (names of labeling rules matching that log line). Note that not all attack traces are labeled in all log files; please refer to the labeling rules in case that some labels are not clear.

    Acknowledgements: Partially funded by the FFG projects INDICAETING (868306) and DECEPT (873980), and the EU project GUARD (833456).

    If you use the dataset, please cite the following publications:

    [1] M. Landauer, F. Skopik, M. Wurzenberger, W. Hotwagner and A. Rauber, "Have it Your Way: Generating Customized Log Datasets With a Model-Driven Simulation Testbed," in IEEE Transactions on Reliability, vol. 70, no. 1, pp. 402-415, March 2021, doi: 10.1109/TR.2020.3031317.

    [2] M. Landauer, M. Frank, F. Skopik, W. Hotwagner, M. Wurzenberger, and A. Rauber, "A Framework for Automatic Labeling of Log Datasets from Model-driven Testbeds for HIDS Evaluation". ACM Workshop on Secure and Trustworthy Cyber-Physical Systems (ACM SaT-CPS 2022), April 27, 2022, Baltimore, MD, USA. ACM.

    [3] M. Frank, "Quality improvement of labels for model-driven benchmark data generation for intrusion detection systems", Master's Thesis, Vienna University of Technology, 2021.

  20. Spanish Language Datasets | 1.8M+ Sentences | Translation Data | TTS |...

    • datarade.ai
    Updated Jul 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oxford Languages (2025). Spanish Language Datasets | 1.8M+ Sentences | Translation Data | TTS | Dictionary Display | Translations | EU & LATAM Coverage [Dataset]. https://datarade.ai/data-products/spanish-language-datasets-1-8m-sentences-nlp-tts-dic-oxford-languages
    Explore at:
    .json, .xml, .csv, .xls, .txt, .mp3, .wavAvailable download formats
    Dataset updated
    Jul 11, 2025
    Dataset authored and provided by
    Oxford Languageshttps://lexico.com/es
    Area covered
    Costa Rica, Nicaragua, Paraguay, Chile, Honduras, Ecuador, Colombia, Panama, Cuba, Bolivia (Plurinational State of)
    Description

    Linguistically annotated Spanish language datasets with headwords, definitions, senses, examples, POS tags, semantic metadata, and usage info. Ideal for dictionary tools, NLP, and TTS model training or fine-tuning.

    Our Spanish language datasets are carefully compiled and annotated by language and linguistic experts; you can find them available for licensing:

    1. Spanish Monolingual Dictionary Data
    2. Spanish Bilingual Dictionary Data
    3. Spanish Sentences Data
    4. Synonyms and Antonyms Data
    5. Audio Data
    6. Spanish Word List Data

    Key Features (approximate numbers):

    1. Spanish Monolingual Dictionary Data

    Our Spanish monolingual reliably offers clear definitions and examples, a large volume of headwords, and comprehensive coverage of the Spanish language.

    • Words: 73,000
    • Senses: 123,000
    • Example sentences: 104,000
    • Format: XML and JSON formats
    • Delivery: Email (link-based file sharing) and REST API
    • Updated frequency: annually
    1. Spanish Bilingual Dictionary Data

    The bilingual data provides translations in both directions, from English to Spanish and from Spanish to English. It is annually reviewed and updated by our in-house team of language experts. Offers significant coverage of the language, providing a large volume of translated words of excellent quality.

    • Translations: 221,300
    • Senses: 103,500
    • Example sentences: 74,500
    • Example translations: 83,800
    • Format: XML and JSON formats
    • Delivery: Email (link-based file sharing) and REST API
    • Updated frequency: annually
    1. Spanish Sentences Data

    Spanish sentences retrieved from the corpus are ideal for NLP model training, presenting approximately 20 million words. The sentences provide a great coverage of Spanish-speaking countries and are accordingly tagged to a particular country or dialect.

    • Sentences volume: 1,840,000
    • Format: XML and JSON format
    • Delivery: Email (link-based file sharing) and REST API
    1. Spanish Synonyms and Antonyms Data

    This Spanish language dataset offers a rich collection of synonyms and antonyms, accompanied by detailed definitions and part-of-speech (POS) annotations, making it a comprehensive resource for building linguistically aware AI systems and language technologies.

    • Synonyms: 127,700
    • Antonyms: 9,500
    • Format: XML format
    • Delivery: Email (link-based file sharing)
    • Updated frequency: annually
    1. Spanish Audio Data (word-level)

    Curated word-level audio data for the Spanish language, which covers all varieties of world Spanish, providing rich dialectal diversity in the Spanish language.

    • Audio files: 20,900
    • Format: XLSX (for index), MP3 and WAV (audio files)
    1. Spanish Word List Data

    This language data contains a carefully curated and comprehensive list of 450,000 Spanish words.

    • Wordforms: 450,000
    • Format: CSV and TXT formats
    • Delivery: Email (link-based file sharing)

    Use Cases:

    We consistently work with our clients on new use cases as language technology continues to evolve. These include NLP applications, TTS, dictionary display tools, games, translation, word embedding, and word sense disambiguation (WSD).

    If you have a specific use case in mind that isn't listed here, we’d be happy to explore it with you. Don’t hesitate to get in touch with us at Oxford.Languages@oup.com to start the conversation.

    Pricing:

    Oxford Languages offers flexible pricing based on use case and delivery format. Our datasets are licensed via term-based IP agreements and tiered pricing for API-delivered data. Whether you’re integrating into a product, training an LLM, or building custom NLP solutions, we tailor licensing to your specific needs.

    Contact our team or email us at Oxford.Languages@oup.com to explore pricing options and discover how our language data can support your goals.

    About the sample:

    The samples offer a brief overview of one or two language datasets (monolingual or/and bilingual dictionary data). To help you explore the structure and features of our dataset, we provide a sample in CSV format for preview purposes only.

    If you need the complete original sample or more details about any dataset, please contact us (Growth.OL@oup.com) to request access or further information

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Tokinori Suzuki (2022). Dataset of Pairs of an Image and Tags for Cataloging Image-based Records [Dataset]. http://doi.org/10.17632/msyc6mzvhg.1

Dataset of Pairs of an Image and Tags for Cataloging Image-based Records

Explore at:
Dataset updated
Feb 24, 2022
Authors
Tokinori Suzuki
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Brief Explanation

This dataset is created to develop and evaluate a cataloging system which assigns appropriate metadata to an image record for database management in digital libraries. That is assumed for evaluating a task, in which given an image and assigned tags, an appropriate Wikipedia page is selected for each of the given tags.

A main characteristic of the dataset is including ambiguous tags. Thus, visual contents of images are not unique to their tags. For example, it includes a tag 'mouse' which has double meaning of not a mammal but a computer controller device. The annotations are corresponding Wikipedia articles for tags as correct entities by human judgement.

The dataset offers both data and programs that reproduce experiments of the above-mentioned task. Its data consist of sources of images and annotations. The image sources are URLs of 420 images uploaded to Flickr. The annotations are a total 2,464 relevant Wikipedia pages manually judged for tags of the images. The dataset also provides programs in Jupiter notebook (scripts.ipynb) to conduct a series of experiments running some baseline methods for the designated task and evaluating the results.

Structure of the Dataset

  1. data directory 1.1. image_URL.txt This file lists URLs of image files.

    1.2. rels.txt This file lists collect Wikipedia pages for each topic in topics.txt

    1.3. topics.txt This file lists a target pair, which is called a topic in this dataset, of an image and a tag to be disambiguated.

    1.4. enwiki_20171001.xml This file is extracted texts from the title and body parts of English Wikipedia articles as of 1st October 2017. This is a modified data of Wikipedia dump data (https://archive.org/download/enwiki-20171001).

  2. img directory This directory is a placeholder directory to fetch image files for downloading.

  3. results directory This directory is a placeholder directory to store results files for evaluation. It maintains three results of baseline methods in sub-directories. They contain json files each of which is a result of one topic, and are ready to be evaluated using an evaluation scripts in scripts.ipynb for reference of both usage and performance.

  4. scripts.ipynb The scripts for running baseline methods and evaluation are ready in this Jupyter notebook file.

Search
Clear search
Close search
Google apps
Main menu