11 datasets found
  1. MORE: A Multimodal Relation Extraction Dataset

    • kaggle.com
    Updated Oct 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marquis03 (2024). MORE: A Multimodal Relation Extraction Dataset [Dataset]. https://www.kaggle.com/datasets/marquis03/more-a-multimodal-relation-extraction-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 23, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Marquis03
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    arXiv: https://arxiv.org/abs/2312.09753

    To construct the MORE dataset, we choose to use multimodal news data as a source rather than annotating existing MRE datasets primarily sourced from social media. Multimodal news data has selective and well-edited images and textual titles, resulting in relatively good data quality, and often contains timely and informative knowledge. We obtained the data from The New York Times English news and Yahoo News from 2019 to 2022, resulting in a candidate set of 15,000 multimodal news data instances covering various topics. We filtered out unqualified data and obtained a meticulously selected dataset for our research purposes. Then the candidate multimodal news was annotated in the following three distinct stages.

    Stage 1: Entity Identification and Object Detection. We utilized the AllenNLP named entity recognition tool1 and the Yolo V5 object detection tool2 to identify the entities in textual news titles and the object areas in the corresponding news images. All extracted objects and entities were reviewed and corrected manually by our annotators.

    Stage 2: Object-Entity Relation Annotation. We recruited well-educated annotators to examine the textual titles and images and deduce the relations between the entities and objects. Relations were randomly assigned to annotators from the candidate set to ensure an unbiased annotation process. The data did not clearly indicate any pre-defined relations will be labeled as none. At least two annotators are required to independently review and annotate each data. In cases where there were discrepancies or conflicts in the annotations, a third annotator was consulted, and their decision was considered final. The weighted Cohen's Kappa is used to measure the consistency between different annotators.

    Stage 3: Object-Overlapped Data Filtering. To refine the scope of multimodal object-entity relation extraction task, we only focused on relations in which visual objects did not co-occur with any entities mentioned in the textual news titles. This process filtered down the data from 15,000 to over 3,000 articles containing more than 20,000 object-entity relational facts. This approach ensured a dataset of only relatable object-entity relationships illustrated in images, rather than those that were already mentioned explicitly in the textual news titles, resulting in a more focused dataset for the task.

  2. f

    Navigating News Narratives: A Media Bias Analysis Dataset

    • figshare.com
    txt
    Updated Dec 8, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaina Raza (2023). Navigating News Narratives: A Media Bias Analysis Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.24422122.v4
    Explore at:
    txtAvailable download formats
    Dataset updated
    Dec 8, 2023
    Dataset provided by
    figshare
    Authors
    Shaina Raza
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The prevalence of bias in the news media has become a critical issue, affecting public perception on a range of important topics such as political views, health, insurance, resource distributions, religion, race, age, gender, occupation, and climate change. The media has a moral responsibility to ensure accurate information dissemination and to increase awareness about important issues and the potential risks associated with them. This highlights the need for a solution that can help mitigate against the spread of false or misleading information and restore public trust in the media.Data description: This is a dataset for news media bias covering different dimensions of the biases: political, hate speech, political, toxicity, sexism, ageism, gender identity, gender discrimination, race/ethnicity, climate change, occupation, spirituality, which makes it a unique contribution. The dataset used for this project does not contain any personally identifiable information (PII).The data structure is tabulated as follows:Text: The main content.Dimension: Descriptive category of the text.Biased_Words: A compilation of words regarded as biased.Aspect: Specific sub-topic within the main content.Label: Indicates the presence (True) or absence (False) of bias. The label is ternary - highly biased, slightly biased and neutralToxicity: Indicates the presence (True) or absence (False) of bias.Identity_mention: Mention of any identity based on words match.Annotation SchemeThe labels and annotations in the dataset are generated through a system of Active Learning, cycling through:Manual LabelingSemi-Supervised LearningHuman VerificationThe scheme comprises:Bias Label: Specifies the degree of bias (e.g., no bias, mild, or strong).Words/Phrases Level Biases: Pinpoints specific biased terms or phrases.Subjective Bias (Aspect): Highlights biases pertinent to content dimensions.Due to the nuances of semantic match algorithms, certain labels such as 'identity' and 'aspect' may appear distinctively different.List of datasets used : We curated different news categories like Climate crisis news summaries , occupational, spiritual/faith/ general using RSS to capture different dimensions of the news media biases. The annotation is performed using active learning to label the sentence (either neural/ slightly biased/ highly biased) and to pick biased words from the news.We also utilize publicly available data from the following links. Our Attribution to others.MBIC (media bias): Spinde, Timo, Lada Rudnitckaia, Kanishka Sinha, Felix Hamborg, Bela Gipp, and Karsten Donnay. "MBIC--A Media Bias Annotation Dataset Including Annotator Characteristics." arXiv preprint arXiv:2105.11910 (2021). https://zenodo.org/records/4474336Hyperpartisan news: Kiesel, Johannes, Maria Mestre, Rishabh Shukla, Emmanuel Vincent, Payam Adineh, David Corney, Benno Stein, and Martin Potthast. "Semeval-2019 task 4: Hyperpartisan news detection." In Proceedings of the 13th International Workshop on Semantic Evaluation, pp. 829-839. 2019. https://huggingface.co/datasets/hyperpartisan_news_detectionToxic comment classification: Adams, C.J., Jeffrey Sorensen, Julia Elliott, Lucas Dixon, Mark McDonald, Nithum, and Will Cukierski. 2017. "Toxic Comment Classification Challenge." Kaggle. https://kaggle.com/competitions/jigsaw-toxic-comment-classification-challenge.Jigsaw Unintended Bias: Adams, C.J., Daniel Borkan, Inversion, Jeffrey Sorensen, Lucas Dixon, Lucy Vasserman, and Nithum. 2019. "Jigsaw Unintended Bias in Toxicity Classification." Kaggle. https://kaggle.com/competitions/jigsaw-unintended-bias-in-toxicity-classification.Age Bias : Díaz, Mark, Isaac Johnson, Amanda Lazar, Anne Marie Piper, and Darren Gergle. "Addressing age-related bias in sentiment analysis." In Proceedings of the 2018 chi conference on human factors in computing systems, pp. 1-14. 2018. Age Bias Training and Testing Data - Age Bias and Sentiment Analysis Dataverse (harvard.edu)Multi-dimensional news Ukraine: Färber, Michael, Victoria Burkard, Adam Jatowt, and Sora Lim. "A multidimensional dataset based on crowdsourcing for analyzing and detecting news bias." In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pp. 3007-3014. 2020. https://zenodo.org/records/3885351#.ZF0KoxHMLtVSocial biases: Sap, Maarten, Saadia Gabriel, Lianhui Qin, Dan Jurafsky, Noah A. Smith, and Yejin Choi. "Social bias frames: Reasoning about social and power implications of language." arXiv preprint arXiv:1911.03891 (2019). https://maartensap.com/social-bias-frames/Goal of this dataset :We want to offer open and free access to dataset, ensuring a wide reach to researchers and AI practitioners across the world. The dataset should be user-friendly to use and uploading and accessing data should be straightforward, to facilitate usage.If you use this dataset, please cite us.Navigating News Narratives: A Media Bias Analysis Dataset © 2023 by Shaina Raza, Vector Institute is licensed under CC BY-NC 4.0

  3. Impact of Digital Habits on Mental Health

    • kaggle.com
    Updated Jun 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shahzad Aslam (2025). Impact of Digital Habits on Mental Health [Dataset]. https://www.kaggle.com/datasets/zeesolver/mental-health
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 14, 2025
    Dataset provided by
    Kaggle
    Authors
    Shahzad Aslam
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    This dataset explores the relationship between digital behavior and mental well-being among 100,000 individuals. It records how much time people spend on screens, use of social media (including TikTok), and how these habits may influence their sleep, stress, and mood levels.

    It includes six numerical features, all clean and ready for analysis, making it ideal for machine learning tasks like regression or classification. The data enables researchers and analysts to investigate how modern digital lifestyles may impact mental health indicators in measurable ways.

    Dataset Applications

    • Quantify how screen‑time, TikTok use, or multi‑platform engagement statistically relate to stress, sleep loss, and mood.
    • Train regression or classification models that forecast stress level or mood score from real‑time digital‑usage metrics.
    • Feed user‑specific data into recommender systems that suggest screen‑time caps or bedtime routines to improve mental health.
    • Provide evidence for guidelines on youth screen‑time limits and platform moderation based on observed stress‑sleep trade‑offs.
    • Serve as a teaching dataset for EDA, feature engineering, and model evaluation in data‑science or psychology curricula.
    • Evaluate app interventions (e.g., screen‑time nudges) by comparing predicted versus actual post‑intervention stress or mood shifts.
    • Cluster individuals into digital‑behavior personas (e.g., “heavy late‑night scrollers”) to tailor mental‑health resources.
    • Generate synthetic time‑series scenarios (what‑if reductions in TikTok hours) to estimate downstream impacts on sleep and stress.
    • Use engineered features (ratio of TikTok hours to total screen‑time, etc.) in broader wellbeing models that include diet or exercise data.
    • Assess whether mental‑health prediction models remain accurate and unbiased across different screen‑time or platform‑use segments. # Column Descriptions
    • screen_time_hours – Daily total screen usage in hours across all devices.
    • social_media_platforms_used – Number of different social media platforms used per day.
    • hours_on_TikTok – Time spent on TikTok daily, in hours.
    • sleep_hours – Average number of sleep hours per night.
    • stress_level – Stress intensity reported on a scale from 1 (low) to 10 (high).
    • mood_score – Self-rated mood on a scale from 2 (poor) to 10 (excell # Inspiration This dataset was inspired by growing concerns about how screen time and social media affect mental health. It enables analysis of the links between digital habits, stress, sleep, and mood—encouraging data-driven solutions for healthier online behavior and emotional well-being. # Ethically Mined Data: This dataset has been ethically mined and synthetically generated without collecting any personally identifiable information. All values are artificial but statistically realistic, allowing safe use in academic, research, and public health projects while fully respecting user privacy and data ethics.
  4. Artificial Intelligence (AI) Training Dataset Market Research Report 2033

    • growthmarketreports.com
    csv, pdf, pptx
    Updated Jun 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Growth Market Reports (2025). Artificial Intelligence (AI) Training Dataset Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/artificial-intelligence-training-dataset-market-global-industry-analysis
    Explore at:
    pptx, csv, pdfAvailable download formats
    Dataset updated
    Jun 30, 2025
    Dataset authored and provided by
    Growth Market Reports
    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Artificial Intelligence (AI) Training Dataset Market Outlook



    According to our latest research, the global Artificial Intelligence (AI) Training Dataset market size reached USD 3.15 billion in 2024, reflecting robust industry momentum. The market is expanding at a notable CAGR of 20.8% and is forecasted to attain USD 20.92 billion by 2033. This impressive growth is primarily attributed to the surging demand for high-quality, annotated datasets to fuel machine learning and deep learning models across diverse industry verticals. The proliferation of AI-driven applications, coupled with rapid advancements in data labeling technologies, is further accelerating the adoption and expansion of the AI training dataset market globally.




    One of the most significant growth factors propelling the AI training dataset market is the exponential rise in data-driven AI applications across industries such as healthcare, automotive, retail, and finance. As organizations increasingly rely on AI-powered solutions for automation, predictive analytics, and personalized customer experiences, the need for large, diverse, and accurately labeled datasets has become critical. Enhanced data annotation techniques, including manual, semi-automated, and fully automated methods, are enabling organizations to generate high-quality datasets at scale, which is essential for training sophisticated AI models. The integration of AI in edge devices, smart sensors, and IoT platforms is further amplifying the demand for specialized datasets tailored for unique use cases, thereby fueling market growth.




    Another key driver is the ongoing innovation in machine learning and deep learning algorithms, which require vast and varied training data to achieve optimal performance. The increasing complexity of AI models, especially in areas such as computer vision, natural language processing, and autonomous systems, necessitates the availability of comprehensive datasets that accurately represent real-world scenarios. Companies are investing heavily in data collection, annotation, and curation services to ensure their AI solutions can generalize effectively and deliver reliable outcomes. Additionally, the rise of synthetic data generation and data augmentation techniques is helping address challenges related to data scarcity, privacy, and bias, further supporting the expansion of the AI training dataset market.




    The market is also benefiting from the growing emphasis on ethical AI and regulatory compliance, particularly in data-sensitive sectors like healthcare, finance, and government. Organizations are prioritizing the use of high-quality, unbiased, and diverse datasets to mitigate algorithmic bias and ensure transparency in AI decision-making processes. This focus on responsible AI development is driving demand for curated datasets that adhere to strict quality and privacy standards. Moreover, the emergence of data marketplaces and collaborative data-sharing initiatives is making it easier for organizations to access and exchange valuable training data, fostering innovation and accelerating AI adoption across multiple domains.




    From a regional perspective, North America currently dominates the AI training dataset market, accounting for the largest revenue share in 2024, driven by significant investments in AI research, a mature technology ecosystem, and the presence of leading AI companies and data annotation service providers. Europe and Asia Pacific are also witnessing rapid growth, with increasing government support for AI initiatives, expanding digital infrastructure, and a rising number of AI startups. While North America sets the pace in terms of technological innovation, Asia Pacific is expected to exhibit the highest CAGR during the forecast period, fueled by the digital transformation of emerging economies and the proliferation of AI applications across various industry sectors.





    Data Type Analysis



    The AI training dataset market is segmented by data type into Text, Image/Video, Audio, and Others, each playing a crucial role in powering different AI applications. Text da

  5. H

    Biased Cars Dataset

    • dataverse.harvard.edu
    • opendatalab.com
    Updated Jan 26, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Spandan Madan; Timothy Henry; Jamell Dozier; Helen Ho; Nishchal Bhandari; Tomotake Sasaki; Fredo Durand; Hanspeter Pfister; Xavier Boix (2022). Biased Cars Dataset [Dataset]. http://doi.org/10.7910/DVN/F1NQ3R
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 26, 2022
    Dataset provided by
    Harvard Dataverse
    Authors
    Spandan Madan; Timothy Henry; Jamell Dozier; Helen Ho; Nishchal Bhandari; Tomotake Sasaki; Fredo Durand; Hanspeter Pfister; Xavier Boix
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    We introduce a challenging, photo-realistic dataset for analyzing out-of-distribution performance in computer vision: the Biased-Cars dataset. Our dataset features outdoor scene data with fine control over scene clutter (trees, street furniture, and pedestrians), car colors, object occlusions, diverse backgrounds (building/road textures) and lighting conditions (sky maps). Biased-Cars consists of 30K images of five different car models with different car colors seen from different viewpoints car colors varying between 0-90 degrees of azimuth, and 0-50 degrees of zenith across multiple scales. We provide labels for car model, color, viewpoint and scale. We also provide semantic label maps for background categories including road, sky, pavement, pedestrians, trees and buildings. Our dataset offers complete control over the joint distribution of categories, viewpoints, and other scene parameters, and the use of physically based rendering ensures photo-realism.

  6. Hate Speech and Bias against Asians, Blacks, Jews, Latines, and Muslims: A...

    • zenodo.org
    • data.niaid.nih.gov
    csv
    Updated Oct 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gunther Jikeli; Gunther Jikeli; Sameer Karali; Sameer Karali; Katharina Soemer; Katharina Soemer (2023). Hate Speech and Bias against Asians, Blacks, Jews, Latines, and Muslims: A Dataset for Machine Learning and Text Analytics [Dataset]. http://doi.org/10.5281/zenodo.8147308
    Explore at:
    csvAvailable download formats
    Dataset updated
    Oct 26, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Gunther Jikeli; Gunther Jikeli; Sameer Karali; Sameer Karali; Katharina Soemer; Katharina Soemer
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ### Institute for the Study of Contemporary Antisemitism (ISCA) at Indiana University Dataset on bias against Asians, Blacks, Jews, Latines, and Muslims

    The ISCA project compiled this dataset using an annotation portal, which was used to label tweets as either biased or non-biased, among other labels. Note that the annotation was done on live data, including images and context, such as threads. The original data comes from annotationportal.com. They include representative samples of live tweets from the years 2020 and 2021 with the keywords "Asians, Blacks, Jews, Latinos, and Muslims".

    A random sample of 600 tweets per year was drawn for each of the keywords. This includes retweets. Due to a sampling error, the sample for the year 2021 for the keyword "Jews" has only 453 tweets from 2021 and 147 from the first eight months of 2022 and it includes some tweets from the query with the keyword "Israel." The tweets were divided into six samples of 100 tweets, which were then annotated by three to seven students in the class "Researching White Supremacism and Antisemitism on Social Media" taught by Gunther Jikeli, Elisha S. Breton, and Seth Moller at Indiana University in the fall of 2022, see this report. Annotators used a scale from 1 to 5 (confident not biased, probably not biased, don't know, probably biased, confident biased). The definitions of bias against each minority group used for annotation are also included in the report.

    If a tweet called out or denounced bias against the minority in question, it was labeled as "calling out bias."

    The labels of whether a tweet is biased or calls out bias are based on a 75% majority vote. We considered "probably biased" and "confident biased" as biased and "confident not biased," "probably not biased," and "don't know" as not biased.

    The types of stereotypes vary widely across the different categories of prejudice. While about a third of all biased tweets were classified as "hate" against the minority, the stereotypes in the tweets often matched common stereotypes about the minority. Asians were blamed for the Covid pandemic. Blacks were seen as inferior and associated with crime. Jews were seen as powerful and held collectively responsible for the actions of the State of Israel. Some tweets denied the Holocaust. Hispanics/Latines were portrayed as being in the country illegally and as "invaders," in addition to stereotypical accusations of being lazy, stupid, or having too many children. Muslims, on the other hand, were often collectively blamed for terrorism and violence, though often in conversations about Muslims in India.

    # Content:

    This dataset contains 5880 tweets that cover a wide range of topics common in conversations about Asians, Blacks, Jews, Latines, and Muslims. 357 tweets (6.1 %) are labeled as biased and 5523 (93.9 %) are labeled as not biased. 1365 tweets (23.2 %) are labeled as calling out or denouncing bias.

    1180 out of 5880 tweets (20.1 %) contain the keyword "Asians," 590 were posted in 2020 and 590 in 2021. 39 tweets (3.3 %) are biased against Asian people. 370 tweets (31,4 %) call out bias against Asians.

    1160 out of 5880 tweets (19.7%) contain the keyword "Blacks," 578 were posted in 2020 and 582 in 2021. 101 tweets (8.7 %) are biased against Black people. 334 tweets (28.8 %) call out bias against Blacks.

    1189 out of 5880 tweets (20.2 %) contain the keyword "Jews," 592 were posted in 2020, 451 in 2021, and ––as mentioned above––146 tweets from 2022. 83 tweets (7 %) are biased against Jewish people. 220 tweets (18.5 %) call out bias against Jews.

    1169 out of 5880 tweets (19.9 %) contain the keyword "Latinos," 584 were posted in 2020 and 585 in 2021. 29 tweets (2.5 %) are biased against Latines. 181 tweets (15.5 %) call out bias against Latines.

    1182 out of 5880 tweets (20.1 %) contain the keyword "Muslims," 593 were posted in 2020 and 589 in 2021. 105 tweets (8.9 %) are biased against Muslims. 260 tweets (22 %) call out bias against Muslims.

    # File Description:

    The dataset is provided in a csv file format, with each row representing a single message, including replies, quotes, and retweets. The file contains the following columns:


    'TweetID': Represents the tweet ID.

    'Username': Represents the username who published the tweet (if it is a retweet, it will be the user who retweetet the original tweet.

    'Text': Represents the full text of the tweet (not pre-processed).

    'CreateDate': Represents the date the tweet was created.

    'Biased': Represents the labeled by our annotators if the tweet is biased (1) or not (0).

    'Calling_Out': Represents the label by our annotators if the tweet is calling out bias against minority groups (1) or not (0).

    'Keyword': Represents the keyword that was used in the query. The keyword can be in the text, including mentioned names, or the username.

    # Licences

    Data is published under the terms of the "Creative Commons Attribution 4.0 International" licence (https://creativecommons.org/licenses/by/4.0)

    # Acknowledgements

    We are grateful for the technical collaboration with Indiana University's Observatory on Social Media (OSoMe). We thank all class participants for the annotations and contributions, including Kate Baba, Eleni Ballis, Garrett Banuelos, Savannah Benjamin, Luke Bianco, Zoe Bogan, Elisha S. Breton, Aidan Calderaro, Anaye Caldron, Olivia Cozzi, Daj Crisler, Jenna Eidson, Ella Fanning, Victoria Ford, Jess Gruettner, Ronan Hancock, Isabel Hawes, Brennan Hensler, Kyra Horton, Maxwell Idczak, Sanjana Iyer, Jacob Joffe, Katie Johnson, Allison Jones, Kassidy Keltner, Sophia Knoll, Jillian Kolesky, Emily Lowrey, Rachael Morara, Benjamin Nadolne, Rachel Neglia, Seungmin Oh, Kirsten Pecsenye, Sophia Perkovich, Joey Philpott, Katelin Ray, Kaleb Samuels, Chloe Sherman, Rachel Weber, Molly Winkeljohn, Ally Wolfgang, Rowan Wolke, Michael Wong, Jane Woods, Kaleb Woodworth, and Aurora Young.

    This work used Jetstream2 at Indiana University through allocation HUM200003 from the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program, which is supported by National Science Foundation grants #2138259, #2138286, #2138307, #2137603, and #2138296.

  7. S

    Camelyon+

    • scidb.cn
    Updated Nov 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ling Xitong; Lei Yuanyuan; Li Jiawen; Cheng Junru; Huang Wenting; Guan Tian; Guan Jian; He Yonghong (2024). Camelyon+ [Dataset]. http://doi.org/10.57760/sciencedb.16442
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 8, 2024
    Dataset provided by
    Science Data Bank
    Authors
    Ling Xitong; Lei Yuanyuan; Li Jiawen; Cheng Junru; Huang Wenting; Guan Tian; Guan Jian; He Yonghong
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    The Camelyon+ dataset is accessible through ScienceDB. The original WSI data is available from the official Camelyon16 and Camelyon-17 websites, so it has not been uploaded to the database. Slide-level labels are included in XLSX files. We provide corrected versions of the Camelyon-16 and Camelyon-17 datasets, as well as a combined version of Camelyon+ with four classification labels (negative, micro, macro, ITC) and two classification labels (negative, tumor) to support different downstream tasks.To ensure unbiased data correction by pathologists, the original training dataset from Camelyon-16, originally named "tumor," "normal," and ID, has been renamed. The mapping to the original naming will be recorded and shared in an XLSX file. For positive WSIs, pixel-level annotations are provided in XML format.To enable future comparative experiments using various feature extractors on the Camelyon+ dataset, feature files extracted at 20X magnification using ResNet-50, VIT-S, PLIP, CONCH, UNI, and Gigapath are also available. These feature files are provided in PT format for easy use.

  8. The Turku UAS DeepSeaSalama - GAN dataset 1 (TDSS-G1)

    • zenodo.org
    pdf, zip
    Updated Jul 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mehdi Asadi; Mehdi Asadi; Jani Auranen; Jani Auranen (2024). The Turku UAS DeepSeaSalama - GAN dataset 1 (TDSS-G1) [Dataset]. http://doi.org/10.5281/zenodo.10714823
    Explore at:
    zip, pdfAvailable download formats
    Dataset updated
    Jul 7, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Mehdi Asadi; Mehdi Asadi; Jani Auranen; Jani Auranen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Feb 2024
    Area covered
    Turku
    Description

    The Turku UAS DeepSeaSalama-GAN dataset 1 (TDSS-G1) is a comprehensive image dataset obtained from a maritime environment. This dataset was assembled in the southwest Finnish archipelago area at Taalintehdas, using two stationary RGB fisheye cameras in the month of August 2022. The technical setup is described in the section “Sensor Platform design” in report “Development of Applied Research Platforms for Autonomous and Remotely Operated Systems” (https://www.theseus.fi/handle/10024/815628).

    The data collection and annotation process was carried out in the Autonomous and Intelligent Systems laboratory at Turku University of Applied Sciences. The dataset is a blend of original images captured by our cameras and synthetic data generated by a Generative Adversarial Network (GAN), simulating 18 distinct weather conditions.

    The TDSS-G1 dataset comprises 199 original images and a substantial addition of 3582 synthetic images, culminating in a total of 3781 annotated images. These images provide a diverse representation of various maritime objects, including motorboats, sailing boats, and seamarks.

    The creation of TDSS-G1 involved extracting images from videos recorded in MPEG format, with a resolution of 720p at 30 frames per second (FPS). An image was extracted every 100 milliseconds.

    The distribution of labels within TDSS-G1 is as follows: motorboats (62.1%), sailing boats (16.8%), and seamarks (21.1%).

    This distribution highlights a class imbalance, with motorboats being the most represented class and sailing boats being the least. This imbalance is an important factor to consider during the model training process, as it could influence the model’s ability to accurately recognize underrepresented classes. In the future synthetic datasets, vision Transformers will be used to tackle this problem.

    The TDSS-G1 dataset is organized into three distinct subsets for the purpose of training and evaluating machine learning models. These subsets are as follows:

    • Training Set: Located in dataset/train/images, this set is used to train the model. It learns to recognize the different classes of maritime objects from this data.
    • Validation Set: Stored in dataset/valid/images, this set is used to tune the model parameters and to prevent overfitting during the training process.
    • Test Set: Found in dataset/test/images, this set is used to evaluate the final performance of the model. It provides an unbiased assessment of how the model will perform on unseen data.

    The dataset comprises three classes (nc: 3), each representing a different type of maritime object. The classes are as follows:

    1. Motor Boat (motor_boat)
    2. Sailing Boat (sailing_boat)
    3. Seamark (seamark)

    These labels correspond to the annotated objects in the images. The model trained on this dataset will be capable of identifying these three types of maritime objects. As mentioned earlier, the distribution of these classes is imbalanced, which is an important factor to consider during the training process.

  9. youtubecommentsdataset

    • kaggle.com
    Updated Jun 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Afryan Fernando (2025). youtubecommentsdataset [Dataset]. https://www.kaggle.com/datasets/afryanfernando/youtubecommentsdataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 7, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Afryan Fernando
    Description

    This dataset comprises user comments collected from YouTube videos discussing Prabowo Subianto’s speech in relation to former U.S. President Donald Trump’s tariff policies. The data is organized into three separate Excel files, each representing a different sentiment distribution:

    1. Balanced Dataset: Contains an equal number of comments across all three sentiment classes — positive, negative, and neutral — to support unbiased model training and evaluation.

    2. Unbalanced Dataset: Reflects the natural distribution of sentiments as observed in the raw data, providing a realistic scenario for real-world sentiment analysis.

    3. Neutral-Inclusive Dataset: A version of the dataset that includes comments labeled as neutral, in addition to positive and negative sentiments, offering a more comprehensive view of public opinion.

    This dataset is suitable for sentiment classification tasks, public opinion mining, and research in political discourse analysis, particularly in the context of sentiment analysis

  10. f

    Data from: Averaging Strategy for Interpretable Machine Learning on Small...

    • acs.figshare.com
    bin
    Updated Aug 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hengjie Yu; Shiyu Tang; Sam Fong Yau Li; Fang Cheng (2023). Averaging Strategy for Interpretable Machine Learning on Small Datasets to Understand Element Uptake after Seed Nanotreatment [Dataset]. http://doi.org/10.1021/acs.est.3c01878.s002
    Explore at:
    binAvailable download formats
    Dataset updated
    Aug 18, 2023
    Dataset provided by
    ACS Publications
    Authors
    Hengjie Yu; Shiyu Tang; Sam Fong Yau Li; Fang Cheng
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Understanding plant uptake and translocation of nanomaterials is crucial for ensuring the successful and sustainable applications of seed nanotreatment. Here, we collect a dataset with 280 instances from experiments for predicting the relative metal/metalloid concentration (RMC) in maize seedlings after seed priming by various metal and metalloid oxide nanoparticles. To obtain unbiased predictions and explanations on small datasets, we present an averaging strategy and add a dimension for interpretable machine learning. The findings in post-hoc interpretations of sophisticated LightGBM models demonstrate that solubility is highly correlated with model performance. Surface area, concentration, zeta potential, and hydrodynamic diameter of nanoparticles and seedling part and relative weight of plants are dominant factors affecting RMC, and their effects and interactions are explained. Furthermore, self-interpretable models using the RuleFit algorithm are established to successfully predict RMC only based on six important features identified by post-hoc explanations. We then develop a visualization tool called RuleGrid to depict feature effects and interactions in numerous generated rules. Consistent parameter-RMC relationships are obtained by different methods. This study offers a promising interpretable data-driven approach to expand the knowledge of nanoparticle fate in plants and may profoundly contribute to the safety-by-design of nanomaterials in agricultural and environmental applications.

  11. f

    Summary of the negative datasets generated by different methods.

    • plos.figshare.com
    xls
    Updated Sep 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Efrat Cohen-Davidi; Isana Veksler-Lublinsky (2024). Summary of the negative datasets generated by different methods. [Dataset]. http://doi.org/10.1371/journal.pcbi.1012385.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Sep 6, 2024
    Dataset provided by
    PLOS Computational Biology
    Authors
    Efrat Cohen-Davidi; Isana Veksler-Lublinsky
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The method numbers correspond to method numbers in Fig 1. FPD denotes the full-positive-dataset.

  12. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Marquis03 (2024). MORE: A Multimodal Relation Extraction Dataset [Dataset]. https://www.kaggle.com/datasets/marquis03/more-a-multimodal-relation-extraction-dataset
Organization logo

MORE: A Multimodal Relation Extraction Dataset

A Multimodal Object-Entity Relation Extraction Dataset w/ Benchmark Evaluation

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 23, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Marquis03
License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

arXiv: https://arxiv.org/abs/2312.09753

To construct the MORE dataset, we choose to use multimodal news data as a source rather than annotating existing MRE datasets primarily sourced from social media. Multimodal news data has selective and well-edited images and textual titles, resulting in relatively good data quality, and often contains timely and informative knowledge. We obtained the data from The New York Times English news and Yahoo News from 2019 to 2022, resulting in a candidate set of 15,000 multimodal news data instances covering various topics. We filtered out unqualified data and obtained a meticulously selected dataset for our research purposes. Then the candidate multimodal news was annotated in the following three distinct stages.

Stage 1: Entity Identification and Object Detection. We utilized the AllenNLP named entity recognition tool1 and the Yolo V5 object detection tool2 to identify the entities in textual news titles and the object areas in the corresponding news images. All extracted objects and entities were reviewed and corrected manually by our annotators.

Stage 2: Object-Entity Relation Annotation. We recruited well-educated annotators to examine the textual titles and images and deduce the relations between the entities and objects. Relations were randomly assigned to annotators from the candidate set to ensure an unbiased annotation process. The data did not clearly indicate any pre-defined relations will be labeled as none. At least two annotators are required to independently review and annotate each data. In cases where there were discrepancies or conflicts in the annotations, a third annotator was consulted, and their decision was considered final. The weighted Cohen's Kappa is used to measure the consistency between different annotators.

Stage 3: Object-Overlapped Data Filtering. To refine the scope of multimodal object-entity relation extraction task, we only focused on relations in which visual objects did not co-occur with any entities mentioned in the textual news titles. This process filtered down the data from 15,000 to over 3,000 articles containing more than 20,000 object-entity relational facts. This approach ensured a dataset of only relatable object-entity relationships illustrated in images, rather than those that were already mentioned explicitly in the textual news titles, resulting in a more focused dataset for the task.

Search
Clear search
Close search
Google apps
Main menu