100+ datasets found
  1. Machine Learning Dataset

    • brightdata.com
    .json, .csv, .xlsx
    Updated Dec 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bright Data (2024). Machine Learning Dataset [Dataset]. https://brightdata.com/products/datasets/machine-learning
    Explore at:
    .json, .csv, .xlsxAvailable download formats
    Dataset updated
    Dec 23, 2024
    Dataset authored and provided by
    Bright Datahttps://brightdata.com/
    License

    https://brightdata.com/licensehttps://brightdata.com/license

    Area covered
    Worldwide
    Description

    Utilize our machine learning datasets to develop and validate your models. Our datasets are designed to support a variety of machine learning applications, from image recognition to natural language processing and recommendation systems. You can access a comprehensive dataset or tailor a subset to fit your specific requirements, using data from a combination of various sources and websites, including custom ones. Popular use cases include model training and validation, where the dataset can be used to ensure robust performance across different applications. Additionally, the dataset helps in algorithm benchmarking by providing extensive data to test and compare various machine learning algorithms, identifying the most effective ones for tasks such as fraud detection, sentiment analysis, and predictive maintenance. Furthermore, it supports feature engineering by allowing you to uncover significant data attributes, enhancing the predictive accuracy of your machine learning models for applications like customer segmentation, personalized marketing, and financial forecasting.

  2. m

    Video Dataset of Construction Site for training AI/ML Models

    • data.macgence.com
    mp3
    Updated May 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Macgence (2024). Video Dataset of Construction Site for training AI/ML Models [Dataset]. https://data.macgence.com/dataset/video-dataset-of-construction-site-for-training-ai-ml-models
    Explore at:
    mp3Available download formats
    Dataset updated
    May 26, 2024
    Dataset authored and provided by
    Macgence
    License

    https://data.macgence.com/terms-and-conditionshttps://data.macgence.com/terms-and-conditions

    Time period covered
    2025
    Area covered
    Worldwide
    Variables measured
    Outcome, Call Type, Transcriptions, Audio Recordings, Speaker Metadata, Conversation Topics
    Description

    High-quality video dataset of construction sites, ideal for training AI/ML models in detection, classification, and activity recognition tasks.

  3. i

    Labeled Image Datasets for AI & Computer Vision

    • images.cv
    Updated Apr 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Images.cv (2024). Labeled Image Datasets for AI & Computer Vision [Dataset]. https://images.cv/
    Explore at:
    Dataset updated
    Apr 26, 2024
    Dataset provided by
    Images.cv
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Explore and download labeled image datasets for AI, ML, and computer vision. Find datasets for object detection, image classification, and image segmentation.

  4. Machine Learning model data

    • ecmwf.int
    Updated Jan 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    European Centre for Medium-Range Weather Forecasts (2023). Machine Learning model data [Dataset]. https://www.ecmwf.int/en/forecasts/dataset/machine-learning-model-data
    Explore at:
    Dataset updated
    Jan 1, 2023
    Dataset authored and provided by
    European Centre for Medium-Range Weather Forecastshttp://ecmwf.int/
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    three of these models are available:

  5. m

    Relevant Image Dataset

    • data.mendeley.com
    Updated Dec 22, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hayri Volkan Agun (2020). Relevant Image Dataset [Dataset]. http://doi.org/10.17632/mbk294tthf.1
    Explore at:
    Dataset updated
    Dec 22, 2020
    Authors
    Hayri Volkan Agun
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset contains relevant and irrelevant image tags of Web pages of 125 different domains. The image dataset contains the web domain, file number, the text of image HTML element, attributes of image elements, the size attributes, the parent HTML element of the image, and relevancy of the image. Each Web domain contains 100 Web pages with varying number of image elements.

  6. MIProblems: A repository of multiple instance learning datasets

    • figshare.com
    zip
    Updated Jun 21, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Veronika Cheplygina (2018). MIProblems: A repository of multiple instance learning datasets [Dataset]. http://doi.org/10.6084/m9.figshare.6633983.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 21, 2018
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Veronika Cheplygina
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository contains the multiple instance learning datasets previously stored at miproblems.org. As I am now longer maintaining the website, I moved the datasets to Figshare. A detailed description of the files is found in readme.pdf

    If you use these datasets, please cite this Figshare resource rather than linking to miproblems.org, which will be offline soon.

  7. i

    Anomaly detection dataset

    • ieee-dataport.org
    Updated Nov 14, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Prarthi Jain (2020). Anomaly detection dataset [Dataset]. https://ieee-dataport.org/open-access/anomaly-detection-dataset
    Explore at:
    Dataset updated
    Nov 14, 2020
    Authors
    Prarthi Jain
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Please refer each dataset website for further information

  8. D

    Machine Learning Frameworks for Fake News Detection and Datasets

    • dataverse.nl
    rar, text/markdown
    Updated Oct 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fadi Mohsen; Fadi Mohsen; Bedir Chaushi; Hamed Abdelhaq; Kevin Wang; Bedir Chaushi; Hamed Abdelhaq; Kevin Wang (2024). Machine Learning Frameworks for Fake News Detection and Datasets [Dataset]. http://doi.org/10.34894/CUCITF
    Explore at:
    rar(133821784), text/markdown(6091)Available download formats
    Dataset updated
    Oct 30, 2024
    Dataset provided by
    DataverseNL
    Authors
    Fadi Mohsen; Fadi Mohsen; Bedir Chaushi; Hamed Abdelhaq; Kevin Wang; Bedir Chaushi; Hamed Abdelhaq; Kevin Wang
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    A web framework designed for researchers to perform comparative analysis of various machine learning algorithms in the context of fake news detection. The folder also includes several datasets for experimentation, alongside the source code. The rise of social media has transformed the landscape of news dissemination, presenting new challenges in combating the spread of fake news. This study addresses the automated detection of misinformation within written content, a task that has prompted extensive research efforts across various methodologies. We evaluate existing benchmarks, introduce a novel hybrid word embedding model, and implement a web framework for text classification. Our approach integrates traditional frequency–inverse document frequency (TF–IDF) methods with sophisticated feature extraction techniques, considering linguistic, psychological, morphological, and grammatical aspects of the text. Through a series of experiments on diverse datasets, applying transfer and incremental learning techniques, we demonstrate the effectiveness of our hybrid model in surpassing benchmarks and outperforming alternative experimental setups. Furthermore, our findings emphasize the importance of dataset alignment and balance in transfer learning, as well as the utility of incremental learning in maintaining high detection performance while reducing runtime. This research offers promising avenues for further advancements in fake news detection methodologies, with implications for future research and development in this critical domain.

  9. d

    80K+ Construction Site Images | AI Training Data | Machine Learning (ML)...

    • data.dataseeds.ai
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Seeds, 80K+ Construction Site Images | AI Training Data | Machine Learning (ML) data | Object & Scene Detection | Global Coverage [Dataset]. https://data.dataseeds.ai/products/50k-construction-site-images-ai-training-data-machine-le-data-seeds
    Explore at:
    Dataset authored and provided by
    Data Seeds
    Area covered
    Türkiye, Isle of Man, Bosnia and Herzegovina, Jamaica, Saint Kitts and Nevis, Somalia, Mauritania, Costa Rica, Austria, Nauru
    Description

    A dataset of 80K+ construction site images sourced globally, featuring full EXIF data, including camera settings and photography details. Enriched with object and scene detection metadata, the dataset is ideal for AI model training in image recognition, classification, and segmentation.

  10. d

    TagX Data collection for AI/ ML training | LLM data | Data collection for AI...

    • datarade.ai
    .json, .csv, .xls
    Updated Jun 18, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TagX (2021). TagX Data collection for AI/ ML training | LLM data | Data collection for AI development & model finetuning | Text, image, audio, and document data [Dataset]. https://datarade.ai/data-products/data-collection-and-capture-services-tagx
    Explore at:
    .json, .csv, .xlsAvailable download formats
    Dataset updated
    Jun 18, 2021
    Dataset authored and provided by
    TagX
    Area covered
    Equatorial Guinea, Russian Federation, Belize, Antigua and Barbuda, Saudi Arabia, Iceland, Colombia, Qatar, Benin, Djibouti
    Description

    We offer comprehensive data collection services that cater to a wide range of industries and applications. Whether you require image, audio, or text data, we have the expertise and resources to collect and deliver high-quality data that meets your specific requirements. Our data collection methods include manual collection, web scraping, and other automated techniques that ensure accuracy and completeness of data.

    Our team of experienced data collectors and quality assurance professionals ensure that the data is collected and processed according to the highest standards of quality. We also take great care to ensure that the data we collect is relevant and applicable to your use case. This means that you can rely on us to provide you with clean and useful data that can be used to train machine learning models, improve business processes, or conduct research.

    We are committed to delivering data in the format that you require. Whether you need raw data or a processed dataset, we can deliver the data in your preferred format, including CSV, JSON, or XML. We understand that every project is unique, and we work closely with our clients to ensure that we deliver the data that meets their specific needs. So if you need reliable data collection services for your next project, look no further than us.

  11. Machine Learning Materials Datasets

    • figshare.com
    txt
    Updated Sep 11, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dane Morgan (2018). Machine Learning Materials Datasets [Dataset]. http://doi.org/10.6084/m9.figshare.7017254.v5
    Explore at:
    txtAvailable download formats
    Dataset updated
    Sep 11, 2018
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Dane Morgan
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Three datasets are intended to be used for exploring machine learning applications in materials science. They are formatted in simple form and in particular for easy input into the MAterials Simulation Toolkit - Machine Learning (MAST-ML) package (see https://github.com/uw-cmg/MAST-ML).Each dataset is a materials property of interest and associated descriptors. For detailed information, please see the attached REAME text file.The first dataset for dilute solute diffusion can be used to predict an effective diffusion barrier for a solute element moving through another host element. The dataset has been calculated with DFT methods.The second dataset for perovskite stability gives energies of compostions of potential perovskite materials relative to the convex hull calculated with DFT. The perovskite dataset also includes columns with information about the A site, B site, and X site in the perovskite structure in order to perform more advanced grouping of the data.The third dataset is a metallic glasses dataset which has values of reduced glass transition temperature (Trg) for a variety of metallic alloys. An additional column is included for majority element for each alloy, which can be an interesting property to group on during tests.

  12. d

    80K+ Construction Site Images | AI Training Data | Machine Learning (ML)...

    • datarade.ai
    Updated Nov 26, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Seeds (2018). 80K+ Construction Site Images | AI Training Data | Machine Learning (ML) data | Object & Scene Detection | Global Coverage [Dataset]. https://datarade.ai/data-products/50k-construction-site-images-ai-training-data-machine-le-data-seeds
    Explore at:
    .bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
    Dataset updated
    Nov 26, 2018
    Dataset authored and provided by
    Data Seeds
    Area covered
    Swaziland, Guatemala, Grenada, Russian Federation, United Arab Emirates, Senegal, Tunisia, Peru, Venezuela (Bolivarian Republic of), Kenya
    Description

    This dataset features over 80,000 high-quality images of construction sites sourced from photographers worldwide. Built to support AI and machine learning applications, it delivers richly annotated and visually diverse imagery capturing real-world construction environments, machinery, and processes.

    Key Features: 1. Comprehensive Metadata: the dataset includes full EXIF data such as aperture, ISO, shutter speed, and focal length. Each image is annotated with construction phase, equipment types, safety indicators, and human activity context—making it ideal for object detection, site monitoring, and workflow analysis. Popularity metrics based on performance on our proprietary platform are also included.

    1. Unique Sourcing Capabilities: images are collected through a proprietary gamified platform, with competitions focused on industrial, construction, and labor themes. Custom datasets can be generated within 72 hours to target specific scenarios, such as building types, stages (excavation, framing, finishing), regions, or safety compliance visuals.

    2. Global Diversity: sourced from contributors in over 100 countries, the dataset reflects a wide range of construction practices, materials, climates, and regulatory environments. It includes residential, commercial, industrial, and infrastructure projects from both urban and rural areas.

    3. High-Quality Imagery: includes a mix of wide-angle site overviews, close-ups of tools and equipment, drone shots, and candid human activity. Resolution varies from standard to ultra-high-definition, supporting both macro and contextual analysis.

    4. Popularity Scores: each image is assigned a popularity score based on its performance in GuruShots competitions. These scores provide insight into visual clarity, engagement value, and human interest—useful for safety-focused or user-facing AI models.

    5. AI-Ready Design: this dataset is structured for training models in real-time object detection (e.g., helmets, machinery), construction progress tracking, material identification, and safety compliance. It’s compatible with standard ML frameworks used in construction tech.

    6. Licensing & Compliance: fully compliant with privacy, labor, and workplace imagery regulations. Licensing is transparent and ready for commercial or research deployment.

    Use Cases: 1. Training AI for safety compliance monitoring and PPE detection. 2. Powering progress tracking and material usage analysis tools. 3. Supporting site mapping, autonomous machinery, and smart construction platforms. 4. Enhancing augmented reality overlays and digital twin models for construction planning.

    This dataset provides a comprehensive, real-world foundation for AI innovation in construction technology, safety, and operational efficiency. Custom datasets are available on request. Contact us to learn more!

  13. n

    Data from: Assessing predictive performance of supervised machine learning...

    • data.niaid.nih.gov
    • datadryad.org
    • +1more
    zip
    Updated May 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Evans Omondi (2023). Assessing predictive performance of supervised machine learning algorithms for a diamond pricing model [Dataset]. http://doi.org/10.5061/dryad.wh70rxwrh
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 23, 2023
    Dataset provided by
    Strathmore University
    Authors
    Evans Omondi
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    The diamond is 58 times harder than any other mineral in the world, and its elegance as a jewel has long been appreciated. Forecasting diamond prices is challenging due to nonlinearity in important features such as carat, cut, clarity, table, and depth. Against this backdrop, the study conducted a comparative analysis of the performance of multiple supervised machine learning models (regressors and classifiers) in predicting diamond prices. Eight supervised machine learning algorithms were evaluated in this work including Multiple Linear Regression, Linear Discriminant Analysis, eXtreme Gradient Boosting, Random Forest, k-Nearest Neighbors, Support Vector Machines, Boosted Regression and Classification Trees, and Multi-Layer Perceptron. The analysis is based on data preprocessing, exploratory data analysis (EDA), training the aforementioned models, assessing their accuracy, and interpreting their results. Based on the performance metrics values and analysis, it was discovered that eXtreme Gradient Boosting was the most optimal algorithm in both classification and regression, with a R2 score of 97.45% and an Accuracy value of 74.28%. As a result, eXtreme Gradient Boosting was recommended as the optimal regressor and classifier for forecasting the price of a diamond specimen. Methods Kaggle, a data repository with thousands of datasets, was used in the investigation. It is an online community for machine learning practitioners and data scientists, as well as a robust, well-researched, and sufficient resource for analyzing various data sources. On Kaggle, users can search for and publish various datasets. In a web-based data-science environment, they can study datasets and construct models.

  14. P

    ML-CB Dataset

    • paperswithcode.com
    Updated Apr 17, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nathan Reitinger; Michelle L. Mazurek (2021). ML-CB Dataset [Dataset]. https://paperswithcode.com/dataset/ml-cb
    Explore at:
    Dataset updated
    Apr 17, 2021
    Authors
    Nathan Reitinger; Michelle L. Mazurek
    Description

    In this paper, we develop a new privacy enhancing tool: ML-CB—a means of using distinguishable pictorial information combined with underlying website source code to produce accurate and robust machine learning classifiers able to discern fingerprinting (i.e., surreptitious tracking) from non-fingerprinting canvas-based actions.

    The data introduced in the paper is collected by scraping roughly half a million websites using a custom Google Chrome extension storing information related to the canvas.

  15. i

    Data from: Disease Prediction Dataset

    • ieee-dataport.org
    Updated Feb 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ayush Nautiyal (2025). Disease Prediction Dataset [Dataset]. https://ieee-dataport.org/documents/disease-prediction-dataset
    Explore at:
    Dataset updated
    Feb 20, 2025
    Authors
    Ayush Nautiyal
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains symptoms and disease information. It contains total of 1325 symptoms covered with 391 disease.This dataset is refernced from website MedLinePlus. This dataset have training and testing dataset and can be used to train disease prediction algorithm . It is created on own for project disease prediction and do not involves any funding or promotional terms.

  16. e

    Synset Boulevard: Synthetic image dataset for Vehicle Make and Model...

    • data.europa.eu
    binary data
    Updated Aug 8, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e. V. (2024). Synset Boulevard: Synthetic image dataset for Vehicle Make and Model Recognition (VMMR) [Dataset]. https://data.europa.eu/data/datasets/725679870677258240?locale=en
    Explore at:
    binary dataAvailable download formats
    Dataset updated
    Aug 8, 2024
    Dataset authored and provided by
    Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e. V.
    License

    http://dcat-ap.de/def/licenses/cc-byhttp://dcat-ap.de/def/licenses/cc-by

    Description

    The Synset Boulevard dataset contains a total of 259,200 synthetically generated images of cars from a frontal traffic camera perspective, annotated by vehicle makes, models and years of construction for machine learning methods (ML) in the scope (task) of vehicle make and model recognition (VMMR).

    The data set contains 162 vehicle models from 43 brands with 200 images each, as well as 8 sub-data sets each to be able to investigate different imaging qualities. In addition to the classification annotations, the data set also contains label images for semantic segmentation, as well as information on image and scene properties, as well as vehicle color.

    The dataset was presented in May 2024 by Anne Sielemann, Stefan Wolf, Masoud Roschani, Jens Ziehn and Jürgen Beyerer in the publication: Sielemann, A., Wolf, S., Roschani, M., Ziehn, J. and Beyerer, J. (2024). Synset Boulevard: A Synthetic Image Dataset for VMMR. In 2024 IEEE International Conference on Robotics and Automation (ICRA).

    The model information is based on information from the ADAC online database (www.adac.de/rund-ums-fahrzeug/autokatalog/marken-modelle).

    The data was generated using the simulation environment OCTANE (www.octane.org), which uses the Cycles ray tracer of the Blender project.

    The dataset's website provides detailed information on the generation process and model assumptions. The dataset is therefore also intended to be used for the suitability analysis of simulated, synthetic datasets.

    The data set was developed as part of the Fraunhofer PREPARE program in the "ML4Safety" project with the funding code PREPARE 40-02702, as well as funded by the "Invest BW" funding program of the Ministry of Economic Affairs, Labour and Tourism as part of the "FeinSyn" research project.

  17. Phishing Websites Dataset

    • kaggle.com
    zip
    Updated Mar 23, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arnav Samal (2024). Phishing Websites Dataset [Dataset]. https://www.kaggle.com/datasets/arnavs19/phishing-websites-dataset
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Mar 23, 2024
    Authors
    Arnav Samal
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    These data consist of a collection of legitimate as well as phishing website instances. Each website is represented by the set of features which denote, whether website is legitimate or not. Data can serve as an input for machine learning process.

    Here, the two variants of the Phishing Dataset are presented.

    1. Full variant - dataset_full.csv

      • Total number of instances: 88,647
      • Number of legitimate website instances (labeled as 0): 58,000
      • Number of phishing website instances (labeled as 1): 30,647
      • Total number of features: 111
    2. Small variant - dataset_small.csv

      • Total number of instances: 58,645
      • Number of legitimate website instances (labeled as 0): 27,998
      • Number of phishing website instances (labeled as 1): 30,647
      • Total number of features: 111
  18. A Labelled Dataset for Sentiment Analysis of Videos on YouTube, TikTok, and...

    • zenodo.org
    • data.niaid.nih.gov
    • +2more
    csv
    Updated Jul 20, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nirmalya Thakur; Nirmalya Thakur; Vanessa Su; Mingchen Shao; Kesha A. Patel; Hongseok Jeong; Victoria Knieling; Andrew Bian; Vanessa Su; Mingchen Shao; Kesha A. Patel; Hongseok Jeong; Victoria Knieling; Andrew Bian (2024). A Labelled Dataset for Sentiment Analysis of Videos on YouTube, TikTok, and other sources about the 2024 outbreak of Measles [Dataset]. http://doi.org/10.5281/zenodo.11711230
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jul 20, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Nirmalya Thakur; Nirmalya Thakur; Vanessa Su; Mingchen Shao; Kesha A. Patel; Hongseok Jeong; Victoria Knieling; Andrew Bian; Vanessa Su; Mingchen Shao; Kesha A. Patel; Hongseok Jeong; Victoria Knieling; Andrew Bian
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jun 15, 2024
    Area covered
    YouTube
    Description

    Please cite the following paper when using this dataset:

    N. Thakur, V. Su, M. Shao, K. Patel, H. Jeong, V. Knieling, and A. Bian “A labelled dataset for sentiment analysis of videos on YouTube, TikTok, and other sources about the 2024 outbreak of measles,” Proceedings of the 26th International Conference on Human-Computer Interaction (HCII 2024), Washington, USA, 29 June - 4 July 2024. (Accepted as a Late Breaking Paper, Preprint Available at: https://doi.org/10.48550/arXiv.2406.07693)

    Abstract

    This dataset contains the data of 4011 videos about the ongoing outbreak of measles published on 264 websites on the internet between January 1, 2024, and May 31, 2024. These websites primarily include YouTube and TikTok, which account for 48.6% and 15.2% of the videos, respectively. The remainder of the websites include Instagram and Facebook as well as the websites of various global and local news organizations. For each of these videos, the URL of the video, title of the post, description of the post, and the date of publication of the video are presented as separate attributes in the dataset. After developing this dataset, sentiment analysis (using VADER), subjectivity analysis (using TextBlob), and fine-grain sentiment analysis (using DistilRoBERTa-base) of the video titles and video descriptions were performed. This included classifying each video title and video description into (i) one of the sentiment classes i.e. positive, negative, or neutral, (ii) one of the subjectivity classes i.e. highly opinionated, neutral opinionated, or least opinionated, and (iii) one of the fine-grain sentiment classes i.e. fear, surprise, joy, sadness, anger, disgust, or neutral. These results are presented as separate attributes in the dataset for the training and testing of machine learning algorithms for performing sentiment analysis or subjectivity analysis in this field as well as for other applications. The paper associated with this dataset (please see the above-mentioned citation) also presents a list of open research questions that may be investigated using this dataset.

  19. MORART-3K: Moroccan Arts and Handicrafts Dataset for Computer Vision Tasks

    • zenodo.org
    Updated Feb 14, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    HASSAN ZEKKOURI; HASSAN ZEKKOURI (2025). MORART-3K: Moroccan Arts and Handicrafts Dataset for Computer Vision Tasks [Dataset]. http://doi.org/10.5281/zenodo.14862418
    Explore at:
    Dataset updated
    Feb 14, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    HASSAN ZEKKOURI; HASSAN ZEKKOURI
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Feb 13, 2025
    Description

    The Moroccan Arts and Crafts Dataset comprises a compilation of images that exhibit typical Moroccan items categorized into 26 classes. This dataset has been meticulously crafted to facilitate Content-Based Image Retrieval (CBIR), classification, and the preservation of cultural heritage.

    The images and videos in this dataset were obtained from various places in Morocco, emphasizing significant cultural and artistic centers. The dataset comprises artifacts from workshops, museums, local markets, and historical places, guaranteeing a varied depiction of Moroccan workmanship. The pictures were captured under diverse scenarios, utilizing multiple lighting settings, backgrounds, and angles to improve resilience in practical applications.

    The purpose is image retrieval, object classification, and cultural study. Furthermore, the dataset adheres to open-access protocols and is organized to enable effortless integration with computer vision algorithms.

    For more visit morart-3k-dataset website

  20. Z

    Dataset used for detecting DNS over HTTPS by Machine Learning.

    • data.niaid.nih.gov
    • zenodo.org
    Updated Oct 28, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vekshin,Dmitrii (2020). Dataset used for detecting DNS over HTTPS by Machine Learning. [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3818004
    Explore at:
    Dataset updated
    Oct 28, 2020
    Dataset provided by
    Vekshin,Dmitrii
    Cejka,Tomas
    Hynek,Karel
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    The dataset consists of three different data sources:

    DoH enabled Firefox

    DoH enabled Google Chrome

    Cloudflared DoH proxy

    The capture of web browser data was made using the Selenium framework, which simulated classical user browsing. The browsers received command for visiting domains taken from Alexa's top 10K most visited websites. The capturing was performed on the host by listening to the network interface of the virtual machine. Overall the dataset contains almost 5,000 web-page visits by Mozilla and 1,000 pages visited by Chrome.

    The Cloudflared DoH proxy was installed in Raspberry PI, and the IP address of the Raspberry was set as the default DNS resolver in two separate offices in our university. It was continuously capturing the DNS/DoH traffic created up to 20 devices for around three months.

    The dataset contains 1,128,904 flows from which is around 33,000 labeled as DoH. We provide raw pcap data, CSV with flow data, and CSV file with extracted features.

    The CSV with extracted features has the following data fields:

    • Label (1 - Doh, 0 - regular HTTPS)
    • Data source
    • Duration
    • Minimal Inter-Packet Delay
    • Maximal Inter-Packet Delay
    • Average Inter-Packet Delay
    • A variance of Incoming Packet Sizes
    • A variance of Outgoing Packet Sizes
    • A ratio of the number of Incoming and outgoing bytes
    • A ration of the number of Incoming and outgoing packets
    • Average of Incoming Packet sizes
    • Average of Outgoing Packet sizes
    • The median value of Incoming Packet sizes
    • The median value of outgoing Packet sizes
    • The ratio of bursts and pauses
    • Number of bursts
    • Number of pauses
    • Autocorrelation
    • Transmission symmetry in the 1st third of connection
    • Transmission symmetry in the 2nd third of connection
    • Transmission symmetry in the last third of connection

    The observed network traffic does not contain privacy-sensitive information.

    The zip file structure is:

    |-- data | |-- extracted-features...extracted features used in ML for DoH recognition | | |-- chrome | | |-- cloudflared | | -- firefox | |-- flows...............................................exported flow data | | |-- chrome | | |-- cloudflared | |-- firefox | -- pcaps....................................................raw PCAP data | |-- chrome | |-- cloudflared |-- firefox |-- LICENSE `-- README.md

    When using this dataset, please cite the original work as follows:

    @inproceedings{vekshin2020, author = {Vekshin, Dmitrii and Hynek, Karel and Cejka, Tomas}, title = {DoH Insight: Detecting DNS over HTTPS by Machine Learning}, year = {2020}, isbn = {9781450388337}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3407023.3409192}, doi = {10.1145/3407023.3409192}, booktitle = {Proceedings of the 15th International Conference on Availability, Reliability and Security}, articleno = {87}, numpages = {8}, keywords = {classification, DoH, DNS over HTTPS, machine learning, detection, datasets}, location = {Virtual Event, Ireland}, series = {ARES '20} }

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Bright Data (2024). Machine Learning Dataset [Dataset]. https://brightdata.com/products/datasets/machine-learning
Organization logo

Machine Learning Dataset

Explore at:
.json, .csv, .xlsxAvailable download formats
Dataset updated
Dec 23, 2024
Dataset authored and provided by
Bright Datahttps://brightdata.com/
License

https://brightdata.com/licensehttps://brightdata.com/license

Area covered
Worldwide
Description

Utilize our machine learning datasets to develop and validate your models. Our datasets are designed to support a variety of machine learning applications, from image recognition to natural language processing and recommendation systems. You can access a comprehensive dataset or tailor a subset to fit your specific requirements, using data from a combination of various sources and websites, including custom ones. Popular use cases include model training and validation, where the dataset can be used to ensure robust performance across different applications. Additionally, the dataset helps in algorithm benchmarking by providing extensive data to test and compare various machine learning algorithms, identifying the most effective ones for tasks such as fraud detection, sentiment analysis, and predictive maintenance. Furthermore, it supports feature engineering by allowing you to uncover significant data attributes, enhancing the predictive accuracy of your machine learning models for applications like customer segmentation, personalized marketing, and financial forecasting.

Search
Clear search
Close search
Google apps
Main menu