100+ datasets found

Machine Learning Dataset
brightdata.com
.json, .csv, .xlsx
Updated Dec 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bright Data (2024). Machine Learning Dataset [Dataset]. https://brightdata.com/products/datasets/machine-learning
Explore at:
.json, .csv, .xlsxAvailable download formats
Dataset updated
Dec 23, 2024
Dataset authored and provided by
Bright Datahttps://brightdata.com/
License
https://brightdata.com/licensehttps://brightdata.com/license
Area covered
Worldwide
Description
Utilize our machine learning datasets to develop and validate your models. Our datasets are designed to support a variety of machine learning applications, from image recognition to natural language processing and recommendation systems. You can access a comprehensive dataset or tailor a subset to fit your specific requirements, using data from a combination of various sources and websites, including custom ones. Popular use cases include model training and validation, where the dataset can be used to ensure robust performance across different applications. Additionally, the dataset helps in algorithm benchmarking by providing extensive data to test and compare various machine learning algorithms, identifying the most effective ones for tasks such as fraud detection, sentiment analysis, and predictive maintenance. Furthermore, it supports feature engineering by allowing you to uncover significant data attributes, enhancing the predictive accuracy of your machine learning models for applications like customer segmentation, personalized marketing, and financial forecasting.
m
Video Dataset of Construction Site for training AI/ML Models
data.macgence.com
mp3
Updated May 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Macgence (2024). Video Dataset of Construction Site for training AI/ML Models [Dataset]. https://data.macgence.com/dataset/video-dataset-of-construction-site-for-training-ai-ml-models
Explore at:
mp3Available download formats
Dataset updated
May 26, 2024
Dataset authored and provided by
Macgence
License
https://data.macgence.com/terms-and-conditionshttps://data.macgence.com/terms-and-conditions
Time period covered
2025
Area covered
Worldwide
Variables measured
Outcome, Call Type, Transcriptions, Audio Recordings, Speaker Metadata, Conversation Topics
Description
High-quality video dataset of construction sites, ideal for training AI/ML models in detection, classification, and activity recognition tasks.
i
Labeled Image Datasets for AI & Computer Vision
images.cv
Updated Apr 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Images.cv (2024). Labeled Image Datasets for AI & Computer Vision [Dataset]. https://images.cv/
Explore at:
Dataset updated
Apr 26, 2024
Dataset provided by
Images.cv
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Explore and download labeled image datasets for AI, ML, and computer vision. Find datasets for object detection, image classification, and image segmentation.
Machine Learning model data
ecmwf.int
Updated Jan 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
European Centre for Medium-Range Weather Forecasts (2023). Machine Learning model data [Dataset]. https://www.ecmwf.int/en/forecasts/dataset/machine-learning-model-data
Explore at:
Dataset updated
Jan 1, 2023
Dataset authored and provided by
European Centre for Medium-Range Weather Forecastshttp://ecmwf.int/
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
three of these models are available:
m
Relevant Image Dataset
data.mendeley.com
Updated Dec 22, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hayri Volkan Agun (2020). Relevant Image Dataset [Dataset]. http://doi.org/10.17632/mbk294tthf.1
Explore at:
Unique identifier
https://doi.org/10.17632/mbk294tthf.1
Dataset updated
Dec 22, 2020
Authors
Hayri Volkan Agun
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset contains relevant and irrelevant image tags of Web pages of 125 different domains. The image dataset contains the web domain, file number, the text of image HTML element, attributes of image elements, the size attributes, the parent HTML element of the image, and relevancy of the image. Each Web domain contains 100 Web pages with varying number of image elements.
MIProblems: A repository of multiple instance learning datasets
figshare.com
zip
Updated Jun 21, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Veronika Cheplygina (2018). MIProblems: A repository of multiple instance learning datasets [Dataset]. http://doi.org/10.6084/m9.figshare.6633983.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.6633983.v1
Dataset updated
Jun 21, 2018
Dataset provided by
Figsharehttp://figshare.com/
Authors
Veronika Cheplygina
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains the multiple instance learning datasets previously stored at miproblems.org. As I am now longer maintaining the website, I moved the datasets to Figshare. A detailed description of the files is found in readme.pdf

If you use these datasets, please cite this Figshare resource rather than linking to miproblems.org, which will be offline soon.
i
Anomaly detection dataset
ieee-dataport.org
Updated Nov 14, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Prarthi Jain (2020). Anomaly detection dataset [Dataset]. https://ieee-dataport.org/open-access/anomaly-detection-dataset
Explore at:
Dataset updated
Nov 14, 2020
Authors
Prarthi Jain
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Please refer each dataset website for further information
D
Machine Learning Frameworks for Fake News Detection and Datasets
dataverse.nl
rar, text/markdown
Updated Oct 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fadi Mohsen; Fadi Mohsen; Bedir Chaushi; Hamed Abdelhaq; Kevin Wang; Bedir Chaushi; Hamed Abdelhaq; Kevin Wang (2024). Machine Learning Frameworks for Fake News Detection and Datasets [Dataset]. http://doi.org/10.34894/CUCITF
Explore at:
rar(133821784), text/markdown(6091)Available download formats
Unique identifier
https://doi.org/10.34894/CUCITF
Dataset updated
Oct 30, 2024
Dataset provided by
DataverseNL
Authors
Fadi Mohsen; Fadi Mohsen; Bedir Chaushi; Hamed Abdelhaq; Kevin Wang; Bedir Chaushi; Hamed Abdelhaq; Kevin Wang
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
A web framework designed for researchers to perform comparative analysis of various machine learning algorithms in the context of fake news detection. The folder also includes several datasets for experimentation, alongside the source code. The rise of social media has transformed the landscape of news dissemination, presenting new challenges in combating the spread of fake news. This study addresses the automated detection of misinformation within written content, a task that has prompted extensive research efforts across various methodologies. We evaluate existing benchmarks, introduce a novel hybrid word embedding model, and implement a web framework for text classification. Our approach integrates traditional frequency–inverse document frequency (TF–IDF) methods with sophisticated feature extraction techniques, considering linguistic, psychological, morphological, and grammatical aspects of the text. Through a series of experiments on diverse datasets, applying transfer and incremental learning techniques, we demonstrate the effectiveness of our hybrid model in surpassing benchmarks and outperforming alternative experimental setups. Furthermore, our findings emphasize the importance of dataset alignment and balance in transfer learning, as well as the utility of incremental learning in maintaining high detection performance while reducing runtime. This research offers promising avenues for further advancements in fake news detection methodologies, with implications for future research and development in this critical domain.
d
80K+ Construction Site Images | AI Training Data | Machine Learning (ML)...
data.dataseeds.ai
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Seeds, 80K+ Construction Site Images | AI Training Data | Machine Learning (ML) data | Object & Scene Detection | Global Coverage [Dataset]. https://data.dataseeds.ai/products/50k-construction-site-images-ai-training-data-machine-le-data-seeds
Explore at:
Dataset authored and provided by
Data Seeds
Area covered
Türkiye, Isle of Man, Bosnia and Herzegovina, Jamaica, Saint Kitts and Nevis, Somalia, Mauritania, Costa Rica, Austria, Nauru
Description
A dataset of 80K+ construction site images sourced globally, featuring full EXIF data, including camera settings and photography details. Enriched with object and scene detection metadata, the dataset is ideal for AI model training in image recognition, classification, and segmentation.
d
TagX Data collection for AI/ ML training | LLM data | Data collection for AI...
datarade.ai
.json, .csv, .xls
Updated Jun 18, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TagX (2021). TagX Data collection for AI/ ML training | LLM data | Data collection for AI development & model finetuning | Text, image, audio, and document data [Dataset]. https://datarade.ai/data-products/data-collection-and-capture-services-tagx
Explore at:
.json, .csv, .xlsAvailable download formats
Dataset updated
Jun 18, 2021
Dataset authored and provided by
TagX
Area covered
Equatorial Guinea, Russian Federation, Belize, Antigua and Barbuda, Saudi Arabia, Iceland, Colombia, Qatar, Benin, Djibouti
Description
We offer comprehensive data collection services that cater to a wide range of industries and applications. Whether you require image, audio, or text data, we have the expertise and resources to collect and deliver high-quality data that meets your specific requirements. Our data collection methods include manual collection, web scraping, and other automated techniques that ensure accuracy and completeness of data.

Our team of experienced data collectors and quality assurance professionals ensure that the data is collected and processed according to the highest standards of quality. We also take great care to ensure that the data we collect is relevant and applicable to your use case. This means that you can rely on us to provide you with clean and useful data that can be used to train machine learning models, improve business processes, or conduct research.

We are committed to delivering data in the format that you require. Whether you need raw data or a processed dataset, we can deliver the data in your preferred format, including CSV, JSON, or XML. We understand that every project is unique, and we work closely with our clients to ensure that we deliver the data that meets their specific needs. So if you need reliable data collection services for your next project, look no further than us.
Machine Learning Materials Datasets
figshare.com
txt
Updated Sep 11, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dane Morgan (2018). Machine Learning Materials Datasets [Dataset]. http://doi.org/10.6084/m9.figshare.7017254.v5
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.7017254.v5
Dataset updated
Sep 11, 2018
Dataset provided by
Figsharehttp://figshare.com/
Authors
Dane Morgan
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Three datasets are intended to be used for exploring machine learning applications in materials science. They are formatted in simple form and in particular for easy input into the MAterials Simulation Toolkit - Machine Learning (MAST-ML) package (see https://github.com/uw-cmg/MAST-ML).Each dataset is a materials property of interest and associated descriptors. For detailed information, please see the attached REAME text file.The first dataset for dilute solute diffusion can be used to predict an effective diffusion barrier for a solute element moving through another host element. The dataset has been calculated with DFT methods.The second dataset for perovskite stability gives energies of compostions of potential perovskite materials relative to the convex hull calculated with DFT. The perovskite dataset also includes columns with information about the A site, B site, and X site in the perovskite structure in order to perform more advanced grouping of the data.The third dataset is a metallic glasses dataset which has values of reduced glass transition temperature (Trg) for a variety of metallic alloys. An additional column is included for majority element for each alloy, which can be an interesting property to group on during tests.
d
80K+ Construction Site Images | AI Training Data | Machine Learning (ML)...
datarade.ai
Updated Nov 26, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Seeds (2018). 80K+ Construction Site Images | AI Training Data | Machine Learning (ML) data | Object & Scene Detection | Global Coverage [Dataset]. https://datarade.ai/data-products/50k-construction-site-images-ai-training-data-machine-le-data-seeds
Explore at:
.bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
Dataset updated
Nov 26, 2018
Dataset authored and provided by
Data Seeds
Area covered
Swaziland, Guatemala, Grenada, Russian Federation, United Arab Emirates, Senegal, Tunisia, Peru, Venezuela (Bolivarian Republic of), Kenya
Description
This dataset features over 80,000 high-quality images of construction sites sourced from photographers worldwide. Built to support AI and machine learning applications, it delivers richly annotated and visually diverse imagery capturing real-world construction environments, machinery, and processes.

Key Features: 1. Comprehensive Metadata: the dataset includes full EXIF data such as aperture, ISO, shutter speed, and focal length. Each image is annotated with construction phase, equipment types, safety indicators, and human activity context—making it ideal for object detection, site monitoring, and workflow analysis. Popularity metrics based on performance on our proprietary platform are also included.

Unique Sourcing Capabilities: images are collected through a proprietary gamified platform, with competitions focused on industrial, construction, and labor themes. Custom datasets can be generated within 72 hours to target specific scenarios, such as building types, stages (excavation, framing, finishing), regions, or safety compliance visuals.

Global Diversity: sourced from contributors in over 100 countries, the dataset reflects a wide range of construction practices, materials, climates, and regulatory environments. It includes residential, commercial, industrial, and infrastructure projects from both urban and rural areas.

High-Quality Imagery: includes a mix of wide-angle site overviews, close-ups of tools and equipment, drone shots, and candid human activity. Resolution varies from standard to ultra-high-definition, supporting both macro and contextual analysis.

Popularity Scores: each image is assigned a popularity score based on its performance in GuruShots competitions. These scores provide insight into visual clarity, engagement value, and human interest—useful for safety-focused or user-facing AI models.

AI-Ready Design: this dataset is structured for training models in real-time object detection (e.g., helmets, machinery), construction progress tracking, material identification, and safety compliance. It’s compatible with standard ML frameworks used in construction tech.

Licensing & Compliance: fully compliant with privacy, labor, and workplace imagery regulations. Licensing is transparent and ready for commercial or research deployment.

Use Cases: 1. Training AI for safety compliance monitoring and PPE detection. 2. Powering progress tracking and material usage analysis tools. 3. Supporting site mapping, autonomous machinery, and smart construction platforms. 4. Enhancing augmented reality overlays and digital twin models for construction planning.

This dataset provides a comprehensive, real-world foundation for AI innovation in construction technology, safety, and operational efficiency. Custom datasets are available on request. Contact us to learn more!
n
Data from: Assessing predictive performance of supervised machine learning...
data.niaid.nih.gov
datadryad.org
+1more
zip
Updated May 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Evans Omondi (2023). Assessing predictive performance of supervised machine learning algorithms for a diamond pricing model [Dataset]. http://doi.org/10.5061/dryad.wh70rxwrh
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.wh70rxwrh
Dataset updated
May 23, 2023
Dataset provided by
Strathmore University
Authors
Evans Omondi
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
The diamond is 58 times harder than any other mineral in the world, and its elegance as a jewel has long been appreciated. Forecasting diamond prices is challenging due to nonlinearity in important features such as carat, cut, clarity, table, and depth. Against this backdrop, the study conducted a comparative analysis of the performance of multiple supervised machine learning models (regressors and classifiers) in predicting diamond prices. Eight supervised machine learning algorithms were evaluated in this work including Multiple Linear Regression, Linear Discriminant Analysis, eXtreme Gradient Boosting, Random Forest, k-Nearest Neighbors, Support Vector Machines, Boosted Regression and Classification Trees, and Multi-Layer Perceptron. The analysis is based on data preprocessing, exploratory data analysis (EDA), training the aforementioned models, assessing their accuracy, and interpreting their results. Based on the performance metrics values and analysis, it was discovered that eXtreme Gradient Boosting was the most optimal algorithm in both classification and regression, with a R2 score of 97.45% and an Accuracy value of 74.28%. As a result, eXtreme Gradient Boosting was recommended as the optimal regressor and classifier for forecasting the price of a diamond specimen. Methods Kaggle, a data repository with thousands of datasets, was used in the investigation. It is an online community for machine learning practitioners and data scientists, as well as a robust, well-researched, and sufficient resource for analyzing various data sources. On Kaggle, users can search for and publish various datasets. In a web-based data-science environment, they can study datasets and construct models.
P
ML-CB Dataset
paperswithcode.com
Updated Apr 17, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nathan Reitinger; Michelle L. Mazurek (2021). ML-CB Dataset [Dataset]. https://paperswithcode.com/dataset/ml-cb
Explore at:
Dataset updated
Apr 17, 2021
Authors
Nathan Reitinger; Michelle L. Mazurek
Description
In this paper, we develop a new privacy enhancing tool: ML-CB—a means of using distinguishable pictorial information combined with underlying website source code to produce accurate and robust machine learning classifiers able to discern fingerprinting (i.e., surreptitious tracking) from non-fingerprinting canvas-based actions.

The data introduced in the paper is collected by scraping roughly half a million websites using a custom Google Chrome extension storing information related to the canvas.
i
Data from: Disease Prediction Dataset
ieee-dataport.org
Updated Feb 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ayush Nautiyal (2025). Disease Prediction Dataset [Dataset]. https://ieee-dataport.org/documents/disease-prediction-dataset
Explore at:
Dataset updated
Feb 20, 2025
Authors
Ayush Nautiyal
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains symptoms and disease information. It contains total of 1325 symptoms covered with 391 disease.This dataset is refernced from website MedLinePlus. This dataset have training and testing dataset and can be used to train disease prediction algorithm . It is created on own for project disease prediction and do not involves any funding or promotional terms.
e
Synset Boulevard: Synthetic image dataset for Vehicle Make and Model...
data.europa.eu
binary data
Updated Aug 8, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e. V. (2024). Synset Boulevard: Synthetic image dataset for Vehicle Make and Model Recognition (VMMR) [Dataset]. https://data.europa.eu/data/datasets/725679870677258240?locale=en
Explore at:
binary dataAvailable download formats
Dataset updated
Aug 8, 2024
Dataset authored and provided by
Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e. V.
License
http://dcat-ap.de/def/licenses/cc-byhttp://dcat-ap.de/def/licenses/cc-by
Description
The Synset Boulevard dataset contains a total of 259,200 synthetically generated images of cars from a frontal traffic camera perspective, annotated by vehicle makes, models and years of construction for machine learning methods (ML) in the scope (task) of vehicle make and model recognition (VMMR).

The data set contains 162 vehicle models from 43 brands with 200 images each, as well as 8 sub-data sets each to be able to investigate different imaging qualities. In addition to the classification annotations, the data set also contains label images for semantic segmentation, as well as information on image and scene properties, as well as vehicle color.

The dataset was presented in May 2024 by Anne Sielemann, Stefan Wolf, Masoud Roschani, Jens Ziehn and Jürgen Beyerer in the publication: Sielemann, A., Wolf, S., Roschani, M., Ziehn, J. and Beyerer, J. (2024). Synset Boulevard: A Synthetic Image Dataset for VMMR. In 2024 IEEE International Conference on Robotics and Automation (ICRA).

The model information is based on information from the ADAC online database (www.adac.de/rund-ums-fahrzeug/autokatalog/marken-modelle).

The data was generated using the simulation environment OCTANE (www.octane.org), which uses the Cycles ray tracer of the Blender project.

The dataset's website provides detailed information on the generation process and model assumptions. The dataset is therefore also intended to be used for the suitability analysis of simulated, synthetic datasets.

The data set was developed as part of the Fraunhofer PREPARE program in the "ML4Safety" project with the funding code PREPARE 40-02702, as well as funded by the "Invest BW" funding program of the Ministry of Economic Affairs, Labour and Tourism as part of the "FeinSyn" research project.
Phishing Websites Dataset
kaggle.com
zip
Updated Mar 23, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arnav Samal (2024). Phishing Websites Dataset [Dataset]. https://www.kaggle.com/datasets/arnavs19/phishing-websites-dataset
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Mar 23, 2024
Authors
Arnav Samal
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
These data consist of a collection of legitimate as well as phishing website instances. Each website is represented by the set of features which denote, whether website is legitimate or not. Data can serve as an input for machine learning process.

Here, the two variants of the Phishing Dataset are presented.

Full variant - dataset_full.csv

Total number of instances: 88,647

Number of legitimate website instances (labeled as 0): 58,000

Number of phishing website instances (labeled as 1): 30,647

Total number of features: 111

Small variant - dataset_small.csv

Total number of instances: 58,645

Number of legitimate website instances (labeled as 0): 27,998

Number of phishing website instances (labeled as 1): 30,647

Total number of features: 111
A Labelled Dataset for Sentiment Analysis of Videos on YouTube, TikTok, and...
zenodo.org
data.niaid.nih.gov
+2more
csv
Updated Jul 20, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nirmalya Thakur; Nirmalya Thakur; Vanessa Su; Mingchen Shao; Kesha A. Patel; Hongseok Jeong; Victoria Knieling; Andrew Bian; Vanessa Su; Mingchen Shao; Kesha A. Patel; Hongseok Jeong; Victoria Knieling; Andrew Bian (2024). A Labelled Dataset for Sentiment Analysis of Videos on YouTube, TikTok, and other sources about the 2024 outbreak of Measles [Dataset]. http://doi.org/10.5281/zenodo.11711230
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.11711230
Dataset updated
Jul 20, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Nirmalya Thakur; Nirmalya Thakur; Vanessa Su; Mingchen Shao; Kesha A. Patel; Hongseok Jeong; Victoria Knieling; Andrew Bian; Vanessa Su; Mingchen Shao; Kesha A. Patel; Hongseok Jeong; Victoria Knieling; Andrew Bian
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jun 15, 2024
Area covered
YouTube
Description
Please cite the following paper when using this dataset:

N. Thakur, V. Su, M. Shao, K. Patel, H. Jeong, V. Knieling, and A. Bian “A labelled dataset for sentiment analysis of videos on YouTube, TikTok, and other sources about the 2024 outbreak of measles,” Proceedings of the 26th International Conference on Human-Computer Interaction (HCII 2024), Washington, USA, 29 June - 4 July 2024. (Accepted as a Late Breaking Paper, Preprint Available at: https://doi.org/10.48550/arXiv.2406.07693)

Abstract

This dataset contains the data of 4011 videos about the ongoing outbreak of measles published on 264 websites on the internet between January 1, 2024, and May 31, 2024. These websites primarily include YouTube and TikTok, which account for 48.6% and 15.2% of the videos, respectively. The remainder of the websites include Instagram and Facebook as well as the websites of various global and local news organizations. For each of these videos, the URL of the video, title of the post, description of the post, and the date of publication of the video are presented as separate attributes in the dataset. After developing this dataset, sentiment analysis (using VADER), subjectivity analysis (using TextBlob), and fine-grain sentiment analysis (using DistilRoBERTa-base) of the video titles and video descriptions were performed. This included classifying each video title and video description into (i) one of the sentiment classes i.e. positive, negative, or neutral, (ii) one of the subjectivity classes i.e. highly opinionated, neutral opinionated, or least opinionated, and (iii) one of the fine-grain sentiment classes i.e. fear, surprise, joy, sadness, anger, disgust, or neutral. These results are presented as separate attributes in the dataset for the training and testing of machine learning algorithms for performing sentiment analysis or subjectivity analysis in this field as well as for other applications. The paper associated with this dataset (please see the above-mentioned citation) also presents a list of open research questions that may be investigated using this dataset.
MORART-3K: Moroccan Arts and Handicrafts Dataset for Computer Vision Tasks
zenodo.org
Updated Feb 14, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
HASSAN ZEKKOURI; HASSAN ZEKKOURI (2025). MORART-3K: Moroccan Arts and Handicrafts Dataset for Computer Vision Tasks [Dataset]. http://doi.org/10.5281/zenodo.14862418
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.14862418
Dataset updated
Feb 14, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
HASSAN ZEKKOURI; HASSAN ZEKKOURI
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Feb 13, 2025
Description
The Moroccan Arts and Crafts Dataset comprises a compilation of images that exhibit typical Moroccan items categorized into 26 classes. This dataset has been meticulously crafted to facilitate Content-Based Image Retrieval (CBIR), classification, and the preservation of cultural heritage.

The images and videos in this dataset were obtained from various places in Morocco, emphasizing significant cultural and artistic centers. The dataset comprises artifacts from workshops, museums, local markets, and historical places, guaranteeing a varied depiction of Moroccan workmanship. The pictures were captured under diverse scenarios, utilizing multiple lighting settings, backgrounds, and angles to improve resilience in practical applications.

The purpose is image retrieval, object classification, and cultural study. Furthermore, the dataset adheres to open-access protocols and is organized to enable effortless integration with computer vision algorithms.

For more visit morart-3k-dataset website
Z
Dataset used for detecting DNS over HTTPS by Machine Learning.
data.niaid.nih.gov
zenodo.org
Updated Oct 28, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vekshin,Dmitrii (2020). Dataset used for detecting DNS over HTTPS by Machine Learning. [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3818004
Explore at:
Dataset updated
Oct 28, 2020
Dataset provided by
Vekshin,Dmitrii
Cejka,Tomas
Hynek,Karel
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
The dataset consists of three different data sources:

DoH enabled Firefox

DoH enabled Google Chrome

Cloudflared DoH proxy

The capture of web browser data was made using the Selenium framework, which simulated classical user browsing. The browsers received command for visiting domains taken from Alexa's top 10K most visited websites. The capturing was performed on the host by listening to the network interface of the virtual machine. Overall the dataset contains almost 5,000 web-page visits by Mozilla and 1,000 pages visited by Chrome.

The Cloudflared DoH proxy was installed in Raspberry PI, and the IP address of the Raspberry was set as the default DNS resolver in two separate offices in our university. It was continuously capturing the DNS/DoH traffic created up to 20 devices for around three months.

The dataset contains 1,128,904 flows from which is around 33,000 labeled as DoH. We provide raw pcap data, CSV with flow data, and CSV file with extracted features.

The CSV with extracted features has the following data fields:

Label (1 - Doh, 0 - regular HTTPS)

Data source

Duration

Minimal Inter-Packet Delay

Maximal Inter-Packet Delay

Average Inter-Packet Delay

A variance of Incoming Packet Sizes

A variance of Outgoing Packet Sizes

A ratio of the number of Incoming and outgoing bytes

A ration of the number of Incoming and outgoing packets

Average of Incoming Packet sizes

Average of Outgoing Packet sizes

The median value of Incoming Packet sizes

The median value of outgoing Packet sizes

The ratio of bursts and pauses

Number of bursts

Number of pauses

Autocorrelation

Transmission symmetry in the 1st third of connection

Transmission symmetry in the 2nd third of connection

Transmission symmetry in the last third of connection

The observed network traffic does not contain privacy-sensitive information.

The zip file structure is:

|-- data | |-- extracted-features...extracted features used in ML for DoH recognition | | |-- chrome | | |-- cloudflared | | -- firefox | |-- flows...............................................exported flow data | | |-- chrome | | |-- cloudflared | |-- firefox | -- pcaps....................................................raw PCAP data | |-- chrome | |-- cloudflared |-- firefox |-- LICENSE `-- README.md

When using this dataset, please cite the original work as follows:

@inproceedings{vekshin2020, author = {Vekshin, Dmitrii and Hynek, Karel and Cejka, Tomas}, title = {DoH Insight: Detecting DNS over HTTPS by Machine Learning}, year = {2020}, isbn = {9781450388337}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3407023.3409192}, doi = {10.1145/3407023.3409192}, booktitle = {Proceedings of the 15th International Conference on Availability, Reliability and Security}, articleno = {87}, numpages = {8}, keywords = {classification, DoH, DNS over HTTPS, machine learning, detection, datasets}, location = {Virtual Event, Ireland}, series = {ARES '20} }

Facebook

Twitter

Click to copy link

Link copied

Cite

Bright Data (2024). Machine Learning Dataset [Dataset]. https://brightdata.com/products/datasets/machine-learning

Machine Learning Dataset

Explore at:

.json, .csv, .xlsxAvailable download formats

Dataset updated

Dec 23, 2024

Dataset authored and provided by

Bright Datahttps://brightdata.com/

License

https://brightdata.com/licensehttps://brightdata.com/license

Area covered

Worldwide

Description

Utilize our machine learning datasets to develop and validate your models. Our datasets are designed to support a variety of machine learning applications, from image recognition to natural language processing and recommendation systems. You can access a comprehensive dataset or tailor a subset to fit your specific requirements, using data from a combination of various sources and websites, including custom ones. Popular use cases include model training and validation, where the dataset can be used to ensure robust performance across different applications. Additionally, the dataset helps in algorithm benchmarking by providing extensive data to test and compare various machine learning algorithms, identifying the most effective ones for tasks such as fraud detection, sentiment analysis, and predictive maintenance. Furthermore, it supports feature engineering by allowing you to uncover significant data attributes, enhancing the predictive accuracy of your machine learning models for applications like customer segmentation, personalized marketing, and financial forecasting.

Clear search

Close search

Google apps

Main menu

Machine Learning Dataset

Video Dataset of Construction Site for training AI/ML Models

Labeled Image Datasets for AI & Computer Vision

Machine Learning model data

Relevant Image Dataset

MIProblems: A repository of multiple instance learning datasets

Anomaly detection dataset

Machine Learning Frameworks for Fake News Detection and Datasets

80K+ Construction Site Images | AI Training Data | Machine Learning (ML)...

TagX Data collection for AI/ ML training | LLM data | Data collection for AI...

Machine Learning Materials Datasets

80K+ Construction Site Images | AI Training Data | Machine Learning (ML)...

Data from: Assessing predictive performance of supervised machine learning...

ML-CB Dataset

Data from: Disease Prediction Dataset

Synset Boulevard: Synthetic image dataset for Vehicle Make and Model...

Phishing Websites Dataset

A Labelled Dataset for Sentiment Analysis of Videos on YouTube, TikTok, and...

MORART-3K: Moroccan Arts and Handicrafts Dataset for Computer Vision Tasks

Dataset used for detecting DNS over HTTPS by Machine Learning.

Machine Learning Dataset