Facebook
Twitterhttps://brightdata.com/licensehttps://brightdata.com/license
Utilize our machine learning datasets to develop and validate your models. Our datasets are designed to support a variety of machine learning applications, from image recognition to natural language processing and recommendation systems. You can access a comprehensive dataset or tailor a subset to fit your specific requirements, using data from a combination of various sources and websites, including custom ones. Popular use cases include model training and validation, where the dataset can be used to ensure robust performance across different applications. Additionally, the dataset helps in algorithm benchmarking by providing extensive data to test and compare various machine learning algorithms, identifying the most effective ones for tasks such as fraud detection, sentiment analysis, and predictive maintenance. Furthermore, it supports feature engineering by allowing you to uncover significant data attributes, enhancing the predictive accuracy of your machine learning models for applications like customer segmentation, personalized marketing, and financial forecasting.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The benchmarking datasets used for deepBlink. The npz files contain train/valid/test splits inside and can be used directly. The files belong to the following challenges / classes:- ISBI Particle tracking challenge: microtubule, vesicle, receptor- Custom synthetic (based on http://smal.ws): particle- Custom fixed cell: smfish- Custom live cell: suntagThe csv files are to determine which image in the test splits correspond to which original image, SNR, and density.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
I am sharing my 28 Machine Learning, Deep Learning (Artificial Intelligence - AI) projects with their data, software and outputs on Kaggle for educational purposes as open source. It appeals to people who want to work in this field, have 0 Machine Learning knowledge, have Intermediate Machine Learning knowledge, specialize in this field (Attracts to all levels). The deep learning projects in it are for advanced level, so I recommend you to start your studies from the Machine Learning section. You can check your own outputs along with the outputs in it. I am happy to share 28 educational projects with the whole world through Kaggle. Knowledge is free and better when shared!
Algorithms used in it:
1) Nearest Neighbor
2) Naive Bayes
3) Decision Trees
4) Linear Regression
5) Support Vector Machines (SVM)
6) Neural Networks
7) K-means clustering
Kind regards, Emirhan BULUT
You can use the links below for communication. If you have any questions or comments, feel free to let me know!
LinkedIn: https://www.linkedin.com/in/artificialintelligencebulut/ Email: emirhan@novosteer.com
Emirhan BULUT. (2022). Machine Learning Tutorials - Example Projects - AI [Data set]. Kaggle. https://doi.org/10.34740/KAGGLE/DSV/4361310
Facebook
TwitterAPISCRAPY's AI & ML training data is meticulously curated and labelled to ensure the best quality. Our training data comes from a variety of areas, including healthcare and banking, as well as e-commerce and natural language processing.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
SYNERGY is a free and open dataset on study selection in systematic reviews, comprising 169,288 academic works from 26 systematic reviews. Only 2,834 (1.67%) of the academic works in the binary classified dataset are included in the systematic reviews. This makes the SYNERGY dataset a unique dataset for the development of information retrieval algorithms, especially for sparse labels. Due to the many available variables available per record (i.e. titles, abstracts, authors, references, topics), this dataset is useful for researchers in NLP, machine learning, network analysis, and more. In total, the dataset contains 82,668,134 trainable data points. The easiest way to get the SYNERGY dataset is via the synergy-dataset Python package. See https://github.com/asreview/synergy-dataset for all information.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
The Learning Path Index Dataset is a comprehensive collection of byte-sized courses and learning materials tailored for individuals eager to delve into the fields of Data Science, Machine Learning, and Artificial Intelligence (AI), making it an indispensable reference for students, professionals, and educators in the Data Science and AI communities.
This Kaggle Dataset along with the KaggleX Learning Path Index GitHub Repo were created by the mentors and mentees of Cohort 3 KaggleX BIPOC Mentorship Program (between August 2023 and November 2023, also see this). See Credits section at the bottom of the long description.
This dataset was created out of a commitment to facilitate learning and growth within the Data Science, Machine Learning, and AI communities. It started off as an idea at the end of Cohort 2 of the KaggleX BIPOC Mentorship Program brainstorming and feedback session. It was one of the ideas to create byte-sized learning material to help our KaggleX mentees learn things faster. It aspires to simplify the process of finding, evaluating, and selecting the most fitting educational resources.
This dataset was meticulously curated to assist learners in navigating the vast landscape of Data Science, Machine Learning, and AI education. It serves as a compass for those aiming to develop their skills and expertise in these rapidly evolving fields.
The mentors and mentees communicated via Discord, Trello, Google Hangout, etc... to put together these artifacts and made them public for everyone to use and contribute back.
The dataset compiles data from a curated selection of reputable sources including leading educational platforms such as Google Developer, Google Cloud Skill Boost, IBM, Fast AI, etc. By drawing from these trusted sources, we ensure that the data is both accurate and pertinent. The raw data and other artifacts as a result of this exercise can be found on the GitHub Repo i.e. KaggleX Learning Path Index GitHub Repo.
The dataset encompasses the following attributes:
The Learning Path Index Dataset is openly shared under a permissive license, allowing users to utilize the data for educational, analytical, and research purposes within the Data Science, Machine Learning, and AI domains. Feel free to fork the dataset and make it your own, we would be delighted if you contributed back to the dataset and/or our KaggleX Learning Path Index GitHub Repo as well.
Credits for all the work done to create this Kaggle Dataset and the KaggleX [Learnin...
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Kaggle, a subsidiary of Google LLC, is an online community of data scientists and machine learning practitioners. Kaggle allows users to find and publish data sets, explore and build models in a web-based data-science environment, work with other data scientists and machine learning engineers, and enter competitions to solve data science challenges.
Kaggle got its start in 2010 by offering machine learning competitions and now also offers a public data platform, a cloud-based workbench for data science, and Artificial Intelligence education. Its key personnel were Anthony Goldbloom and Jeremy Howard. Nicholas Gruen was founding chair succeeded by Max Levchin. Equity was raised in 2011 valuing the company at $25 million. On 8 March 2017, Google announced that they were acquiring Kaggle.[1][2]
Source: Kaggle
Facebook
TwitterCredit to the original author: The dataset was originally published here
Hands-on teaching of modern machine learning and deep learning techniques heavily relies on the use of well-suited datasets. The "weather prediction dataset" is a novel tabular dataset that was specifically created for teaching machine learning and deep learning to an academic audience. The dataset contains intuitively accessible weather observations from 18 locations in Europe. It was designed to be suitable for a large variety of different training goals, many of which are not easily giving way to unrealistically high prediction accuracy. Teachers or instructors thus can chose the difficulty of the training goals and thereby match it with the respective learner audience or lesson objective. The compact size and complexity of the dataset make it possible to quickly train common machine learning and deep learning models on a standard laptop so that they can be used in live hands-on sessions.
The dataset can be found in the `\dataset` folder and be downloaded from zenodo: https://doi.org/10.5281/zenodo.4980359
If you make use of this dataset, in particular if this is in form of an academic contribution, then please cite the following two references:
Facebook
Twitterhttps://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
Looking for a free Walmart product dataset? The Walmart Products Free Dataset delivers a ready-to-use ecommerce product data CSV containing ~2,100 verified product records from Walmart.com. It includes vital details like product titles, prices, categories, brand info, availability, and descriptions — perfect for data analysis, price comparison, market research, or building machine-learning models.
Complete Product Metadata: Each entry includes URL, title, brand, SKU, price, currency, description, availability, delivery method, average rating, total ratings, image links, unique ID, and timestamp.
CSV Format, Ready to Use: Download instantly - no need for scraping, cleaning or formatting.
Good for E-commerce Research & ML: Ideal for product cataloging, price tracking, demand forecasting, recommendation systems, or data-driven projects.
Free & Easy Access: Priced at USD $0.0, making it a great starting point for developers, data analysts or students.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This is an iris dataset commonly used in machine learning. Accessed on 10-19-2020 from the following URL: http://faculty.smu.edu/tfomby/eco5385_eco6380/data/Iris.xls
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Overview This dataset provides comprehensive insights into the global popularity trends of Artificial Intelligence (AI) and Machine Learning (ML). The data has been meticulously gathered and curated to reflect the growing interest and adoption of these technologies across various regions and sectors.
Data Sources The dataset aggregates information from multiple sources, including:
Search engine query data Social media mentions and hashtags Research publication counts Online course enrolments Job postings
Facebook
TwitterThis dataset features over 80,000 high-quality images of construction sites sourced from photographers worldwide. Built to support AI and machine learning applications, it delivers richly annotated and visually diverse imagery capturing real-world construction environments, machinery, and processes.
Key Features: 1. Comprehensive Metadata: the dataset includes full EXIF data such as aperture, ISO, shutter speed, and focal length. Each image is annotated with construction phase, equipment types, safety indicators, and human activity context—making it ideal for object detection, site monitoring, and workflow analysis. Popularity metrics based on performance on our proprietary platform are also included.
Unique Sourcing Capabilities: images are collected through a proprietary gamified platform, with competitions focused on industrial, construction, and labor themes. Custom datasets can be generated within 72 hours to target specific scenarios, such as building types, stages (excavation, framing, finishing), regions, or safety compliance visuals.
Global Diversity: sourced from contributors in over 100 countries, the dataset reflects a wide range of construction practices, materials, climates, and regulatory environments. It includes residential, commercial, industrial, and infrastructure projects from both urban and rural areas.
High-Quality Imagery: includes a mix of wide-angle site overviews, close-ups of tools and equipment, drone shots, and candid human activity. Resolution varies from standard to ultra-high-definition, supporting both macro and contextual analysis.
Popularity Scores: each image is assigned a popularity score based on its performance in GuruShots competitions. These scores provide insight into visual clarity, engagement value, and human interest—useful for safety-focused or user-facing AI models.
AI-Ready Design: this dataset is structured for training models in real-time object detection (e.g., helmets, machinery), construction progress tracking, material identification, and safety compliance. It’s compatible with standard ML frameworks used in construction tech.
Licensing & Compliance: fully compliant with privacy, labor, and workplace imagery regulations. Licensing is transparent and ready for commercial or research deployment.
Use Cases: 1. Training AI for safety compliance monitoring and PPE detection. 2. Powering progress tracking and material usage analysis tools. 3. Supporting site mapping, autonomous machinery, and smart construction platforms. 4. Enhancing augmented reality overlays and digital twin models for construction planning.
This dataset provides a comprehensive, real-world foundation for AI innovation in construction technology, safety, and operational efficiency. Custom datasets are available on request. Contact us to learn more!
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The data contains two sets of datasets. One for pKb, and the other for LogP machine learning prediction. The datasets contain several descriptors generated using RDKit and density functional theory (DFT).
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
This dataset compiles the top 2500 datasets from Kaggle, encompassing a diverse range of topics and contributors. It provides insights into dataset creation, usability, popularity, and more, offering valuable information for researchers, analysts, and data enthusiasts.
Research Analysis: Researchers can utilize this dataset to analyze trends in dataset creation, popularity, and usability scores across various categories.
Contributor Insights: Kaggle contributors can explore the dataset to gain insights into factors influencing the success and engagement of their datasets, aiding in optimizing future submissions.
Machine Learning Training: Data scientists and machine learning enthusiasts can use this dataset to train models for predicting dataset popularity or usability based on features such as creator, category, and file types.
Market Analysis: Analysts can leverage the dataset to conduct market analysis, identifying emerging trends and popular topics within the data science community on Kaggle.
Educational Purposes: Educators and students can use this dataset to teach and learn about data analysis, visualization, and interpretation within the context of real-world datasets and community-driven platforms like Kaggle.
Column Definitions:
Dataset Name: Name of the dataset. Created By: Creator(s) of the dataset. Last Updated in number of days: Time elapsed since last update. Usability Score: Score indicating the ease of use. Number of File: Quantity of files included. Type of file: Format of files (e.g., CSV, JSON). Size: Size of the dataset. Total Votes: Number of votes received. Category: Categorization of the dataset's subject matter.
Facebook
Twitter"Collection of 100,000 high-quality video clips across diverse real-world domains, designed to accelerate the training and optimization of computer vision and multimodal AI models."
Overview This dataset contains 100,000 proprietary and partner-produced video clips filmed in 4K/6K with cinema-grade RED cameras. Each clip is commercially cleared with full releases, structured metadata, and available in RAW or MOV/MP4 formats. The collection spans a wide variety of domains — people and lifestyle, healthcare and medical, food and cooking, office and business, sports and fitness, nature and landscapes, education, and more. This breadth ensures robust training data for computer vision, multimodal, and machine learning projects.
The data set All 100,000 videos have been reviewed for quality and compliance. The dataset is optimized for AI model training, supporting use cases from face and activity recognition to scene understanding and generative AI. Custom datasets can also be produced on demand, enabling clients to close data gaps with tailored, high-quality content.
About M-ART M-ART is a leading provider of cinematic-grade datasets for AI training. With extensive expertise in large-scale content production and curation, M-ART delivers both ready-to-use video datasets and fully customized collections. All data is proprietary, rights-cleared, and designed to help global AI leaders accelerate research, development, and deployment of next-generation models.
Facebook
TwitterM-ART delivers diverse AI training datasets with over 20,000 assets in 4K/6K RAW video. All content is filmed on RED cinema cameras, commercially cleared with full releases, and structured with detailed metadata. Key areas of the catalog include drone and aerial footage, people and lifestyle, healthcare and medical, food and cooking, business and finance, construction and tools, education, and nature landscapes. In addition, M-ART offers the ability to create custom datasets for clients, providing unique, high-quality video collections that help companies stand out and accelerate AI model training.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
The dataset has been collected in the frame of the Prac1 of the subject Tipology and Data Life Cycle of the Master's Degree in Data Science of the Universitat Oberta de Catalunya (UOC).
The dataset contains 25 variables and 52478 records corresponding to books on the GoodReads Best Books Ever list (the larges list on the site).
Original code used to retrieve the dataset can be found on github repository: github.com/scostap/goodreads_bbe_dataset
The data was retrieved in two sets, the first 30000 books and then the remainig 22478. Dates were not parsed and reformated on the second chunk so publishDate and firstPublishDate are representet in a mm/dd/yyyy format for the first 30000 records and Month Day Year for the rest.
Book cover images can be optionally downloaded from the url in the 'coverImg' field. Python code for doing so and an example can be found on the github repo.
The 25 fields of the dataset are:
| Attributes | Definition | Completeness |
| ------------- | ------------- | ------------- |
| bookId | Book Identifier as in goodreads.com | 100 |
| title | Book title | 100 |
| series | Series Name | 45 |
| author | Book's Author | 100 |
| rating | Global goodreads rating | 100 |
| description | Book's description | 97 |
| language | Book's language | 93 |
| isbn | Book's ISBN | 92 |
| genres | Book's genres | 91 |
| characters | Main characters | 26 |
| bookFormat | Type of binding | 97 |
| edition | Type of edition (ex. Anniversary Edition) | 9 |
| pages | Number of pages | 96 |
| publisher | Editorial | 93 |
| publishDate | publication date | 98 |
| firstPublishDate | Publication date of first edition | 59 |
| awards | List of awards | 20 |
| numRatings | Number of total ratings | 100 |
| ratingsByStars | Number of ratings by stars | 97 |
| likedPercent | Derived field, percent of ratings over 2 starts (as in GoodReads) | 99 |
| setting | Story setting | 22 |
| coverImg | URL to cover image | 99 |
| bbeScore | Score in Best Books Ever list | 100 |
| bbeVotes | Number of votes in Best Books Ever list | 100 |
| price | Book's price (extracted from Iberlibro) | 73 |
Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
AI-generated images for academic studies and machine learning research. Various demographic and age groups. Free for a link and a citation or another mention in the research paper.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Four datasets are presented here. The original dataset is a collection of the COVID-19 data maintained by Our World in Data. It includes data on confirmed cases, and deaths, as well as other variables of potential interest for ten countries such as Australia, Brazil, Canada, China, Denmark, France, Israel, Italy, the United Kingdom, and the United States. The original dataset includes the data from the date of 31st December in 2019 to 31st May in 2020 with a total of 1.530 instances and 19 features. This dataset is collected from a variety of sources (the European Centre for Disease Prevention and Control, United Nations, World Bank, Global Burden of Disease, Blavatnik School of Government, etc.). After the original dataset is pre-processed by cleaning and removing some data including unnecessary and blank. Then, all strings are converted numeric values, and some new features such as continent, hemisphere, year, month, and day are added by extracting the original features. After that, the processed original dataset is organized for prediction of the number of new cases of COVID-19 for 1 day, 3 days, and 10 days ago and three datasets (Dataset-1, 2, 3) are created for that.
Facebook
TwitterAttribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
The MNIST dataset is a dataset of handwritten digits. It is a popular dataset for machine learning and artificial intelligence research. The dataset consists of 60,000 training images and 10,000 test images. Each image is a 28x28 pixel grayscale image of a handwritten digit. The digits are labeled from 0 to 9.
Facebook
Twitterhttps://brightdata.com/licensehttps://brightdata.com/license
Utilize our machine learning datasets to develop and validate your models. Our datasets are designed to support a variety of machine learning applications, from image recognition to natural language processing and recommendation systems. You can access a comprehensive dataset or tailor a subset to fit your specific requirements, using data from a combination of various sources and websites, including custom ones. Popular use cases include model training and validation, where the dataset can be used to ensure robust performance across different applications. Additionally, the dataset helps in algorithm benchmarking by providing extensive data to test and compare various machine learning algorithms, identifying the most effective ones for tasks such as fraud detection, sentiment analysis, and predictive maintenance. Furthermore, it supports feature engineering by allowing you to uncover significant data attributes, enhancing the predictive accuracy of your machine learning models for applications like customer segmentation, personalized marketing, and financial forecasting.