73 datasets found

11 Machine Learning Projects With Datasets
kaggle.com
zip
Updated Jan 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Summa One (2024). 11 Machine Learning Projects With Datasets [Dataset]. https://www.kaggle.com/datasets/summaone/ml-10pro
Explore at:
zip(69465704 bytes)Available download formats
Dataset updated
Jan 12, 2024
Authors
Summa One
Description
Dataset

This dataset was created by Summa One

Contents
Data from: NICHE: A Curated Dataset of Engineered Machine Learning Projects...
figshare.com
txt
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ratnadira Widyasari; Zhou YANG; Ferdian Thung; Sheng Qin Sim; Fiona Wee; Camellia Lok; Jack Phan; Haodi Qi; Constance Tan; Qijin Tay; David LO (2023). NICHE: A Curated Dataset of Engineered Machine Learning Projects in Python [Dataset]. http://doi.org/10.6084/m9.figshare.21967265.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.21967265.v1
Dataset updated
May 30, 2023
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Ratnadira Widyasari; Zhou YANG; Ferdian Thung; Sheng Qin Sim; Fiona Wee; Camellia Lok; Jack Phan; Haodi Qi; Constance Tan; Qijin Tay; David LO
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Machine learning (ML) has gained much attention and has been incorporated into our daily lives. While there are numerous publicly available ML projects on open source platforms such as GitHub, there have been limited attempts in filtering those projects to curate ML projects of high quality. The limited availability of such high-quality dataset poses an obstacle to understanding ML projects. To help clear this obstacle, we present NICHE, a manually labelled dataset consisting of 572 ML projects. Based on evidences of good software engineering practices, we label 441 of these projects as engineered and 131 as non-engineered. In this repository we provide "NICHE.csv" file that contains the list of the project names along with their labels, descriptive information for every dimension, and several basic statistics, such as the number of stars and commits. This dataset can help researchers understand the practices that are followed in high-quality ML projects. It can also be used as a benchmark for classifiers designed to identify engineered ML projects.

GitHub page: https://github.com/soarsmu/NICHE
AI/ML Youtube Videos
kaggle.com
Updated Oct 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Asmaa Hadir (2023). AI/ML Youtube Videos [Dataset]. https://www.kaggle.com/datasets/asmaahadir/aiml-youtube-channels-content-2018-2019
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 31, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Asmaa Hadir
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Area covered
YouTube
Description
I created this dataset as part of a data analysis project and concluded that it might be relevant for others who are interested in examining in analyzing content on YouTube. This dataset is a collection of over 6000 videos having the columns:

Channel: video's channel

Title: video title

PublishedDate: date the video was uploaded

Likes: likes count for the video

Views: views count for the video

Comments: comments count for the video

Through the YouTube API and using Python, I collect data about some of these popular channels' videos that provide educational content about Machine Learning and Data Science in order to extract insights about which topics had been popular within the last couple of years. Featured in the dataset are the following creators:

Krish Naik

Nicholas Renotte

Sentdex

DeepLearningAI

Artificial Intelligence — All in One

Siraj Raval

Jeremy Howard

Applied AI Course

Daniel Bourke

Jeff Heaton

DeepLearning.TV

Arxiv Insights

These channels are features in multiple top AI channels to subscribe to lists and have seen a big growth in the last couple of years on YouTube. They all have a creation date since or before 2018.
UCI and OpenML Data Sets for Ordinal Quantification
zenodo.org
data.niaid.nih.gov
+1more
zip
Updated Jul 25, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mirko Bunse; Mirko Bunse; Alejandro Moreo; Alejandro Moreo; Fabrizio Sebastiani; Fabrizio Sebastiani; Martin Senz; Martin Senz (2023). UCI and OpenML Data Sets for Ordinal Quantification [Dataset]. http://doi.org/10.5281/zenodo.8177302
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8177302
Dataset updated
Jul 25, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Mirko Bunse; Mirko Bunse; Alejandro Moreo; Alejandro Moreo; Fabrizio Sebastiani; Fabrizio Sebastiani; Martin Senz; Martin Senz
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
These four labeled data sets are targeted at ordinal quantification. The goal of quantification is not to predict the label of each individual instance, but the distribution of labels in unlabeled sets of data.

With the scripts provided, you can extract CSV files from the UCI machine learning repository and from OpenML. The ordinal class labels stem from a binning of a continuous regression label.

We complement this data set with the indices of data items that appear in each sample of our evaluation. Hence, you can precisely replicate our samples by drawing the specified data items. The indices stem from two evaluation protocols that are well suited for ordinal quantification. To this end, each row in the files app_val_indices.csv, app_tst_indices.csv, app-oq_val_indices.csv, and app-oq_tst_indices.csv represents one sample.

Our first protocol is the artificial prevalence protocol (APP), where all possible distributions of labels are drawn with an equal probability. The second protocol, APP-OQ, is a variant thereof, where only the smoothest 20% of all APP samples are considered. This variant is targeted at ordinal quantification tasks, where classes are ordered and a similarity of neighboring classes can be assumed.

Usage

You can extract four CSV files through the provided script extract-oq.jl, which is conveniently wrapped in a Makefile. The Project.toml and Manifest.toml specify the Julia package dependencies, similar to a requirements file in Python.

Preliminaries: You have to have a working Julia installation. We have used Julia v1.6.5 in our experiments.

Data Extraction: In your terminal, you can call either

make

(recommended), or

julia --project="." --eval "using Pkg; Pkg.instantiate()" julia --project="." extract-oq.jl

Outcome: The first row in each CSV file is the header. The first column, named "class_label", is the ordinal class.

Further Reading

Implementation of our experiments: https://github.com/mirkobunse/regularized-oq
Z
Data from: FISBe: A real-world benchmark dataset for instance segmentation...
data.niaid.nih.gov
data-staging.niaid.nih.gov
+1more
Updated Apr 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mais, Lisa; Hirsch, Peter; Managan, Claire; Kandarpa, Ramya; Rumberger, Josef Lorenz; Reinke, Annika; Maier-Hein, Lena; Ihrke, Gudrun; Kainmueller, Dagmar (2024). FISBe: A real-world benchmark dataset for instance segmentation of long-range thin filamentous structures [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10875062
Explore at:
Dataset updated
Apr 2, 2024
Dataset provided by
Howard Hughes Medical Institute - Janelia Research Campus
Max Delbrück Center for Molecular Medicine
German Cancer Research Center
Max Delbrück Center
Authors
Mais, Lisa; Hirsch, Peter; Managan, Claire; Kandarpa, Ramya; Rumberger, Josef Lorenz; Reinke, Annika; Maier-Hein, Lena; Ihrke, Gudrun; Kainmueller, Dagmar
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
General

For more details and the most up-to-date information please consult our project page: https://kainmueller-lab.github.io/fisbe.

Summary

A new dataset for neuron instance segmentation in 3d multicolor light microscopy data of fruit fly brains

30 completely labeled (segmented) images

71 partly labeled images

altogether comprising ∼600 expert-labeled neuron instances (labeling a single neuron takes between 30-60 min on average, yet a difficult one can take up to 4 hours)

To the best of our knowledge, the first real-world benchmark dataset for instance segmentation of long thin filamentous objects

A set of metrics and a novel ranking score for respective meaningful method benchmarking

An evaluation of three baseline methods in terms of the above metrics and score

Abstract

Instance segmentation of neurons in volumetric light microscopy images of nervous systems enables groundbreaking research in neuroscience by facilitating joint functional and morphological analyses of neural circuits at cellular resolution. Yet said multi-neuron light microscopy data exhibits extremely challenging properties for the task of instance segmentation: Individual neurons have long-ranging, thin filamentous and widely branching morphologies, multiple neurons are tightly inter-weaved, and partial volume effects, uneven illumination and noise inherent to light microscopy severely impede local disentangling as well as long-range tracing of individual neurons. These properties reflect a current key challenge in machine learning research, namely to effectively capture long-range dependencies in the data. While respective methodological research is buzzing, to date methods are typically benchmarked on synthetic datasets. To address this gap, we release the FlyLight Instance Segmentation Benchmark (FISBe) dataset, the first publicly available multi-neuron light microscopy dataset with pixel-wise annotations. In addition, we define a set of instance segmentation metrics for benchmarking that we designed to be meaningful with regard to downstream analyses. Lastly, we provide three baselines to kick off a competition that we envision to both advance the field of machine learning regarding methodology for capturing long-range data dependencies, and facilitate scientific discovery in basic neuroscience.

Dataset documentation:

We provide a detailed documentation of our dataset, following the Datasheet for Datasets questionnaire:

FISBe Datasheet

Our dataset originates from the FlyLight project, where the authors released a large image collection of nervous systems of ~74,000 flies, available for download under CC BY 4.0 license.

Files

fisbe_v1.0_{completely,partly}.zip

contains the image and ground truth segmentation data; there is one zarr file per sample, see below for more information on how to access zarr files.

fisbe_v1.0_mips.zip

maximum intensity projections of all samples, for convenience.

sample_list_per_split.txt

a simple list of all samples and the subset they are in, for convenience.

view_data.py

a simple python script to visualize samples, see below for more information on how to use it.

dim_neurons_val_and_test_sets.json

a list of instance ids per sample that are considered to be of low intensity/dim; can be used for extended evaluation.

Readme.md

general information

How to work with the image files

Each sample consists of a single 3d MCFO image of neurons of the fruit fly.For each image, we provide a pixel-wise instance segmentation for all separable neurons.Each sample is stored as a separate zarr file (zarr is a file storage format for chunked, compressed, N-dimensional arrays based on an open-source specification.").The image data ("raw") and the segmentation ("gt_instances") are stored as two arrays within a single zarr file.The segmentation mask for each neuron is stored in a separate channel.The order of dimensions is CZYX.

We recommend to work in a virtual environment, e.g., by using conda:

conda create -y -n flylight-env -c conda-forge python=3.9conda activate flylight-env

How to open zarr files

Install the python zarr package:

pip install zarr

Opened a zarr file with:

import zarrraw = zarr.open(, mode='r', path="volumes/raw")seg = zarr.open(, mode='r', path="volumes/gt_instances")

optional:import numpy as npraw_np = np.array(raw)

Zarr arrays are read lazily on-demand.Many functions that expect numpy arrays also work with zarr arrays.Optionally, the arrays can also explicitly be converted to numpy arrays.

How to view zarr image files

We recommend to use napari to view the image data.

Install napari:

pip install "napari[all]"

Save the following Python script:

import zarr, sys, napari

raw = zarr.load(sys.argv[1], mode='r', path="volumes/raw")gts = zarr.load(sys.argv[1], mode='r', path="volumes/gt_instances")

viewer = napari.Viewer(ndisplay=3)for idx, gt in enumerate(gts): viewer.add_labels( gt, rendering='translucent', blending='additive', name=f'gt_{idx}')viewer.add_image(raw[0], colormap="red", name='raw_r', blending='additive')viewer.add_image(raw[1], colormap="green", name='raw_g', blending='additive')viewer.add_image(raw[2], colormap="blue", name='raw_b', blending='additive')napari.run()

Execute:

python view_data.py /R9F03-20181030_62_B5.zarr

Metrics

S: Average of avF1 and C

avF1: Average F1 Score

C: Average ground truth coverage

clDice_TP: Average true positives clDice

FS: Number of false splits

FM: Number of false merges

tp: Relative number of true positives

For more information on our selected metrics and formal definitions please see our paper.

Baseline

To showcase the FISBe dataset together with our selection of metrics, we provide evaluation results for three baseline methods, namely PatchPerPix (ppp), Flood Filling Networks (FFN) and a non-learnt application-specific color clustering from Duan et al..For detailed information on the methods and the quantitative results please see our paper.

License

The FlyLight Instance Segmentation Benchmark (FISBe) dataset is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.

Citation

If you use FISBe in your research, please use the following BibTeX entry:

@misc{mais2024fisbe, title = {FISBe: A real-world benchmark dataset for instance segmentation of long-range thin filamentous structures}, author = {Lisa Mais and Peter Hirsch and Claire Managan and Ramya Kandarpa and Josef Lorenz Rumberger and Annika Reinke and Lena Maier-Hein and Gudrun Ihrke and Dagmar Kainmueller}, year = 2024, eprint = {2404.00130}, archivePrefix ={arXiv}, primaryClass = {cs.CV} }

Acknowledgments

We thank Aljoscha Nern for providing unpublished MCFO images as well as Geoffrey W. Meissner and the entire FlyLight Project Team for valuablediscussions.P.H., L.M. and D.K. were supported by the HHMI Janelia Visiting Scientist Program.This work was co-funded by Helmholtz Imaging.

Changelog

There have been no changes to the dataset so far.All future change will be listed on the changelog page.

Contributing

If you would like to contribute, have encountered any issues or have any suggestions, please open an issue for the FISBe dataset in the accompanying github repository.

All contributions are welcome!
g
Main Objects Segmentation Dataset
gts.ai
json
Updated Jun 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
GTS (2024). Main Objects Segmentation Dataset [Dataset]. https://gts.ai/case-study/main-objects-segmentation-dataset-enhance-data-annotation/
Explore at:
jsonAvailable download formats
Dataset updated
Jun 19, 2024
Dataset provided by
GLOBOSE TECHNOLOGY SOLUTIONS PRIVATE LIMITED
Authors
GTS
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The Main Objects Segmentation Dataset project focuses on curating a comprehensive dataset for training machine learning models in the field of computer vision.
f
Mini Version of the Flagship Dataset of Type 2 Diabetes from the AI-READI...
staging.fairhub.io
application/dicom
Updated Nov 17, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AI-READI Consortium (2025). Mini Version of the Flagship Dataset of Type 2 Diabetes from the AI-READI Project [Dataset]. http://doi.org/10.60775/fairhub.4
Explore at:
application/dicom(179.68 GB)Available download formats
Unique identifier
https://doi.org/10.60775/fairhub.4
Dataset updated
Nov 17, 2025
Dataset provided by
FAIRhub
Authors
AI-READI Consortium
License
https://doi.org/10.5281/zenodo.17555036https://doi.org/10.5281/zenodo.17555036
Dataset funded by
National Institutes of Health
Description
This dataset contains data from 100 participants that was collected between July 19, 2023 and May 01, 2025. Data from multiple modalities are included. The data in this dataset contain no protected health information (PHI). Information related to the sex and race/ethnicity of the participants as well as medication used has also been removed. A detailed description of the dataset is available in the AI-READI documentation for v3.0.0 of the dataset at https://docs.aireadi.org
G
GIS Resource Compilation Map Package - Applications of Machine Learning...
gdr.openei.org
data.openei.org
+3more
Updated Jun 1, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stephen Brown; Michael Fehler; Mark Coolbaugh; Sven Treitel; James Faulds; Bridget Ayling; Cary Lindsey; Rachel Micander; Eli Mlawsky; Connor Smith; John Queen; Chen Gu; John Akerley; Jacob DeAngelo; Jonathan Glen; Drew Siler; Erick Burns; Ian Warren; Stephen Brown; Michael Fehler; Mark Coolbaugh; Sven Treitel; James Faulds; Bridget Ayling; Cary Lindsey; Rachel Micander; Eli Mlawsky; Connor Smith; John Queen; Chen Gu; John Akerley; Jacob DeAngelo; Jonathan Glen; Drew Siler; Erick Burns; Ian Warren (2021). GIS Resource Compilation Map Package - Applications of Machine Learning Techniques to Geothermal Play Fairway Analysis in the Great Basin Region, Nevada [Dataset]. http://doi.org/10.15121/1897037
Explore at:
Unique identifier
https://doi.org/10.15121/1897037
Dataset updated
Jun 1, 2021
Dataset provided by
Nevada Bureau of Mines and Geology
Geothermal Data Repository
USDOE Office of Energy Efficiency and Renewable Energy (EERE), Renewable Power Office. Geothermal Technologies Program (EE-4G)
Authors
Stephen Brown; Michael Fehler; Mark Coolbaugh; Sven Treitel; James Faulds; Bridget Ayling; Cary Lindsey; Rachel Micander; Eli Mlawsky; Connor Smith; John Queen; Chen Gu; John Akerley; Jacob DeAngelo; Jonathan Glen; Drew Siler; Erick Burns; Ian Warren; Stephen Brown; Michael Fehler; Mark Coolbaugh; Sven Treitel; James Faulds; Bridget Ayling; Cary Lindsey; Rachel Micander; Eli Mlawsky; Connor Smith; John Queen; Chen Gu; John Akerley; Jacob DeAngelo; Jonathan Glen; Drew Siler; Erick Burns; Ian Warren
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Great Basin, Nevada
Description
This submission contains an ESRI map package (.mpk) with an embedded geodatabase for GIS resources used or derived in the Nevada Machine Learning project, meant to accompany the final report. The package includes layer descriptions, layer grouping, and symbology. Layer groups include: new/revised datasets (paleo-geothermal features, geochemistry, geophysics, heat flow, slip and dilation, potential structures, geothermal power plants, positive and negative test sites), machine learning model input grids, machine learning models (Artificial Neural Network (ANN), Extreme Learning Machine (ELM), Bayesian Neural Network (BNN), Principal Component Analysis (PCA/PCAk), Non-negative Matrix Factorization (NMF/NMFk) - supervised and unsupervised), original NV Play Fairway data and models, and NV cultural/reference data.

See layer descriptions for additional metadata. Smaller GIS resource packages (by category) can be found in the related datasets section of this submission. A submission linking the full codebase for generating machine learning output models is available through the "Related Datasets" link on this page, and contains results beyond the top picks present in this compilation.
Corpus Nummorum - Coin Image Dataset
zenodo.org
data.niaid.nih.gov
zip
Updated Nov 7, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Corpus_Nummorum; Corpus_Nummorum (2023). Corpus Nummorum - Coin Image Dataset [Dataset]. http://doi.org/10.5281/zenodo.10033993
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10033993
Dataset updated
Nov 7, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Corpus_Nummorum; Corpus_Nummorum
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Corpus Nummorum - Coin Image Dataset
This dataset is a collection of ancient coin images from three different sources: the Corpus Nummorum (CN) project, the Münzkabinett Berlin and the Bibliothèque nationale de France, Département des Monnaies, médailles et antiques. It covers Greek and Roman coins from ancient Thrace, Moesia Inferior, Troad and Mysia. This is a selection of the coins published on the CN portal (due to copyrights).
The dataset contains 115,160 images with about 29,000 unique coins. The images are split in three main folders with different assignment of the coins. Each main folder is sorted with the help fo subfolders which hold the coin images. The "dataset_coins" folder contains the coin photos divided into obverse and reverse and arranged by coin types. In the "dataset_types" folder the obverse and reverse image of the coins are concatenated and transformed to a quadratic format with black bars on the top and bottom. The images here are sorted by their coin type. The last folder "dataset_mints" contains the also concatenated images sorted by their mint. An "sources" csv file holds the sources for every image. Due to copyrights the image size is limited to 299*299 pixels. However, this should be sufficient for most ML approaches.
The main purpose for this dataset in the CN project is the training of Machine Learning based Image Recognition models. We use three different Convolutional Neural Network based architectures: VGG16, VGG19 and ResNet50. Our best model (VGG16) archieves on this dataset a 79% Top-1 and a 97% Top-5 accuracy for the coin type recognition. The mint recognition achieves an 79% Top-1 and 94% Top-5 accuracy. We have a Colab notebook with two models (trained on the whole CN dataset) online.
During the summer semester 2023, we held the "Data Challenge" event at our Department of Computer Science at the Goethe-University. We gave our students this dataset with the task to achieve better results than us. Here are their experiments:
Team 1: Voting and stacking of models
Team 2: Multimodal model
Team 3: Transformer models
Team 4: Dockerized TIMM Computer Vision Backend & FastAPI
Approach | Type Dataset | Mint Dataset
Ours 79% 79%
Team 1 - 86%
Team 2 86% -
Team 3 88% 58%
Team 4 - -

Now we would like to invite you to try out your own ideas and models on our coin data.
If you have any questions or suggestions, please, feel free to contact us.
WELFake dataset for fake news detection in text data
zenodo.org
data.europa.eu
csv
Updated Apr 9, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pawan K Pawan Kumar Verma; Pawan K Pawan Kumar Verma; Prateek Prateek Agrawal; Prateek Prateek Agrawal; Radu Radu Prodan; Radu Radu Prodan (2021). WELFake dataset for fake news detection in text data [Dataset]. http://doi.org/10.5281/zenodo.4561253
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4561253
Dataset updated
Apr 9, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Pawan K Pawan Kumar Verma; Pawan K Pawan Kumar Verma; Prateek Prateek Agrawal; Prateek Prateek Agrawal; Radu Radu Prodan; Radu Radu Prodan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We designed a larger and more generic Word Embedding over Linguistic Features for Fake News Detection (WELFake) dataset of 72,134 news articles with 35,028 real and 37,106 fake news. For this, we merged four popular news datasets (i.e. Kaggle, McIntire, Reuters, BuzzFeed Political) to prevent over-fitting of classifiers and to provide more text data for better ML training.

Dataset contains four columns: Serial number (starting from 0); Title (about the text news heading); Text (about the news content); and Label (0 = fake and 1 = real).

There are 78098 data entries in csv file out of which only 72134 entries are accessed as per the data frame.

This dataset is a part of our ongoing research on "Fake News Prediction on Social Media Website" as a doctoral degree program of Mr. Pawan Kumar Verma and is partially supported by the ARTICONF project funded by the European Union’s Horizon 2020 research and innovation program.
Top Rated Movies Dataset (TMDB API).csv
kaggle.com
zip
Updated Nov 9, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Muhammad_Atif_Khan181 (2025). Top Rated Movies Dataset (TMDB API).csv [Dataset]. https://www.kaggle.com/datasets/muhammadatifkhan181/top-rated-movies-in-tmdb-csv
Explore at:
zip(264990 bytes)Available download formats
Dataset updated
Nov 9, 2025
Authors
Muhammad_Atif_Khan181
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
This dataset contains information about the top-rated movies fetched from The Movie Database (TMDB) API. The data includes key movie attributes such as movie ID, title, release date, popularity, vote count, and vote average.

✅ Total Pages Scraped: 500
✅ Total Movies Included: 10,000+
✅ Source: TMDB API
✅ Purpose: Educational and non-commercial use only

The dataset can be used for: - Exploratory Data Analysis (EDA) - Machine Learning Projects - Recommendation Systems - Popularity Prediction - Sentiment and Trend Analysis - Data Visualization

Please note:
This product uses the TMDB API but is not endorsed or certified by TMDB.
Top Rated Movies Dataset (TMDb API)
kaggle.com
zip
Updated Nov 2, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shuvo Kundu (2025). Top Rated Movies Dataset (TMDb API) [Dataset]. https://www.kaggle.com/datasets/shuvokundu39/top-rated-movies-dataset-tmdb-api
Explore at:
zip(141810 bytes)Available download formats
Dataset updated
Nov 2, 2025
Authors
Shuvo Kundu
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
This dataset provides detailed information on top-rated movies collected from The Movie Database (TMDb) API. It contains key movie attributes such as title, popularity, a**verage rating**, vote count, overview, and adult content flag. The dataset is designed for data analysis, visualization, and machine learning applications such as movie recommendation systems, sentiment analysis, and popularity prediction.

By exploring this dataset, users can gain insights into how audience ratings, popularity, and engagement vary across different films. It serves as a valuable resource for students, data scientists, and researchers who want to work
Open Machine Learning Projects
kaggle.com
zip
Updated Mar 14, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Prashant Banerjee (2020). Open Machine Learning Projects [Dataset]. https://www.kaggle.com/prashant111/open-machine-learning-projects
Explore at:
zip(4520 bytes)Available download formats
Dataset updated
Mar 14, 2020
Authors
Prashant Banerjee
Description
DESCRIPTION

Information about popular open source projects related to machine learning.

SUMMARY

The goal of this dataset is to better undertand how open source machine learning projects evolve. Data collection date: early May 2018. Source: GitHub user interface and API. Contains original research.

Presentation

Columns

name - name of the project. alignment - either corporate, academia or indie. Corporate projects are being developed by professional engineers, typically have a dedicated development team and trying to solve specific problems. Academical projects usually mention publications, they help to research. Independent projects are often a hobby. company - name of the company if the alignment is corporate. forecast - expected middle-term evolution of the project. 1 means positive, 0 means negative (stagnation) and -1 means factual death. year - when the project was created. Defaults to the GitHub repository creation date but can be earlier - this is a subject of manual adjustments. code of conduct - whether the project has a code of conduct. contributing - whether the project has a contributions guide. stars - number of stargazers on GitHub. issues - number of issues on GitHub, either open or closed. contributors - number of contributors as reported by GitHub. core - estimation of the core team aka "bus factor". team - number of people which commit to a project regularly. commits - number of commits in the project. team / all - ratio of the number of commits by the dedicated development team to the overall number of contributions. Indicates roughly which part of the project is own by the internal developers. link - URL of the project. language - API language. multi means several languages. implementation - the language which was mainly used for implementing the project. license - license of the project.
Weather Prediction
kaggle.com
zenodo.org
zip
Updated Mar 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2024). Weather Prediction [Dataset]. https://www.kaggle.com/datasets/thedevastator/weather-prediction
Explore at:
zip(958204 bytes)Available download formats
Dataset updated
Mar 10, 2024
Authors
The Devastator
Description
Credit to the original author: The dataset was originally published here

Weather prediction dataset

A dataset for teaching machine learning and deep learning

Hands-on teaching of modern machine learning and deep learning techniques heavily relies on the use of well-suited datasets. The "weather prediction dataset" is a novel tabular dataset that was specifically created for teaching machine learning and deep learning to an academic audience. The dataset contains intuitively accessible weather observations from 18 locations in Europe. It was designed to be suitable for a large variety of different training goals, many of which are not easily giving way to unrealistically high prediction accuracy. Teachers or instructors thus can chose the difficulty of the training goals and thereby match it with the respective learner audience or lesson objective. The compact size and complexity of the dataset make it possible to quickly train common machine learning and deep learning models on a standard laptop so that they can be used in live hands-on sessions.

The dataset can be found in the `\dataset` folder and be downloaded from zenodo: https://doi.org/10.5281/zenodo.4980359

References

If you make use of this dataset, in particular if this is in form of an academic contribution, then please cite the following two references:

Klein Tank, A.M.G. and Coauthors, 2002. Daily dataset of 20th-century surface air temperature and precipitation series for the European Climate Assessment. Int. J. of Climatol., 22, 1441-1453. Data and metadata available at http://www.ecad.eu

Florian Huber, Dafne van Kuppevelt, Peter Steinbach, Colin Sauze, Yang Liu, Berend Weel, "Will the sun shine? – An accessible dataset for teaching machine learning and deep learning", DOI TO BE ADDED!

Map of the locations of the 18 weather stations from which data was collected
Ml basic project
kaggle.com
zip
Updated Jul 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Krishna Swapnika (2024). Ml basic project [Dataset]. https://www.kaggle.com/datasets/krishnaswapnika/ml-basic-project/data
Explore at:
zip(578 bytes)Available download formats
Dataset updated
Jul 17, 2024
Authors
Krishna Swapnika
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset

This dataset was created by Krishna Swapnika

Released under Apache 2.0

Contents
10K Most Popular Gaming 2025
kaggle.com
zip
Updated Aug 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Only Python (2025). 10K Most Popular Gaming 2025 [Dataset]. https://www.kaggle.com/datasets/onlypythondatasheet/10k-most-popular-gaming-2025
Explore at:
zip(5489826 bytes)Available download formats
Dataset updated
Aug 21, 2025
Authors
Only Python
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
For More Visit: https://onlypython01.blogspot.com

This dataset contains information on 10,000 of the most popular video games, curated from multiple sources. It is designed for data science, machine learning, and analytics projects in gaming, entertainment, and recommendation systems.

The dataset includes:

ID & Name – unique identifier and game title

Release & Update Dates – when the game was originally released and last updated

Rating & Suggestions Count – aggregated player ratings and number of community recommendations

Platforms – supported consoles and systems (e.g., PC, PlayStation, Xbox, Switch, Mobile)

Developers & Publishers – companies behind the games

Genres – classification (RPG, FPS, Adventure, etc.)

Image – cover art thumbnail URL for visualization

Description – text summary of the game

Potential Use Cases

Exploratory analysis: study trends in ratings, genres, or release dates

Machine Learning: build recommender systems for games

NLP: analyze game descriptions & genres

Visualization projects: timeline charts, platform distribution, developer networks
TMDB Datasets
kaggle.com
zip
Updated Dec 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Roman Nihal (2024). TMDB Datasets [Dataset]. https://www.kaggle.com/datasets/romannihal/tmdb-datasets
Explore at:
zip(5426529 bytes)Available download formats
Dataset updated
Dec 6, 2024
Authors
Roman Nihal
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This collection of datasets was created by fetching data from the TMDB (The Movie Database) API and performing extensive cleaning to ensure usability for data analysis and machine learning projects. It comprises three distinct datasets: tmdb_popular_movies: Contains 13,144 entries featuring the most popular movies. tmdb_top_rated_movies: Contains 12,525 entries highlighting top-rated movies. tmdb_upcoming_movies: Contains 11,959 entries showcasing upcoming movie releases.

Each dataset is structured with the following columns: id: Unique identifier for each movie. title: The title of the movie. overview: A brief description of the movie's plot. release_date: The movie's release date. popularity: A numeric value indicating the movie's popularity on TMDB. vote_average: Average rating given by TMDB users. vote_count: Total number of votes received.

Key Features Versatile Datasets: Covers popular, highly rated, and upcoming movies for diverse use cases. Cleaned and Preprocessed: Free from missing or duplicate values, making it ready for immediate analysis. Applications: Ideal for building recommendation systems, sentiment analysis, popularity prediction models, and more. These datasets were created to provide reliable resources for academic and professional projects in the fields of data science and machine learning.
Manga Dataset (title/genre/rating)
kaggle.com
zip
Updated Jul 18, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Clyde Melton (2022). Manga Dataset (title/genre/rating) [Dataset]. https://www.kaggle.com/datasets/clydemelton/manga-dataset
Explore at:
zip(10027 bytes)Available download formats
Dataset updated
Jul 18, 2022
Authors
Clyde Melton
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Data scraped from Mangakalot I had originally decided to create this dataset for use in a recommendation system for manga titles. Other datasets that I had found were either missing information that I wanted to use to build this system or contained too small a sample size to build what I deemed a useful product. This is also my first attempt at web scraping (I'm also fairly new to python and data science) so I suppose I wanted to do a simple project at first to learn the basics. I hope it proves useful to someone.
TMDB Top 8550 Movies Metadata 2025
kaggle.com
zip
Updated May 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sufi Inam Ul Hassan (2025). TMDB Top 8550 Movies Metadata 2025 [Dataset]. https://www.kaggle.com/datasets/sufiinamulhassan/tmdb-top-8550-movies-metadata
Explore at:
zip(1239492 bytes)Available download formats
Dataset updated
May 24, 2025
Authors
Sufi Inam Ul Hassan
Description
📄 Description

This dataset contains metadata for the top 8,550 movies listed on The Movie Database (TMDB). Each entry includes valuable information such as:

🎬 Title

📅 Release Date

🎭 Genres

🌐 Original Language

⭐ Average Rating

📈 Popularity Score

🗳️ Vote Count

🧾 Overview / Synopsis

✅ The dataset is ideal for:

Exploratory Data Analysis (EDA)

Building Recommendation Systems

Popularity Trend Analysis

Sentiment or Genre-based Analysis

Predictive Modeling & Machine Learning

It serves as a great resource for data scientists, analysts, machine learning practitioners, and film enthusiasts interested in movie metadata.

📚 Use Cases

Here are a few ideas for how to use this dataset:

📌 Build a Movie Recommender System

📌 Compare Trends Over Time (Genres, Ratings, etc.)

📌 Visualize Rating Distributions by Year

📌 Cluster Movies Based on Metadata

🔍 Source

All data is sourced from the TMDB API and reflects the top-rated or most popular movies available at the time of collection.

📢 Disclaimer

This dataset is intended for educational and research purposes only. All movie data and assets belong to their respective copyright holders and TMDB.
Emotion Prediction with Quantum5 Neural Network AI
kaggle.com
zip
Updated Oct 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
EMİRHAN BULUT (2025). Emotion Prediction with Quantum5 Neural Network AI [Dataset]. https://www.kaggle.com/datasets/emirhanai/emotion-prediction-with-semi-supervised-learning
Explore at:
zip(2332683 bytes)Available download formats
Dataset updated
Oct 19, 2025
Authors
EMİRHAN BULUT
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Emotion Prediction with Quantum5 Neural Network AI Machine Learning - By Emirhan BULUT

V1

I have created an artificial intelligence software that can make an emotion prediction based on the text you have written using the Semi Supervised Learning method and the RC algorithm. I used very simple codes and it was a software that focused on solving the problem. I aim to create the 2nd version of the software using RNN (Recurrent Neural Network). I hope I was able to create an example for you to use in your thesis and projects.

V2

I decided to apply a technique that I had developed in the emotion dataset that I had used Semi-Supervised learning in Machine Learning methods before. This technique is produced according to Quantum5 laws. I developed a smart artificial intelligence software that can predict emotion with Quantum5 neuronal networks. I share this software with all humanity as open source on Kaggle. It is my first open source project in NLP system with Quantum technology. Developing the NLP system with Quantum technology is very exciting!

Happy learning!

Emirhan BULUT

Head of AI and AI Inventor

Emirhan BULUT. (2022). Emotion Prediction with Quantum5 Neural Network AI [Data set]. Kaggle. https://doi.org/10.34740/KAGGLE/DS/2129637

The coding language used:

Python 3.9.8

Libraries Used:

Keras

Tensorflow

NumPy

Pandas

Scikit-learn (SKLEARN)

https://raw.githubusercontent.com/emirhanai/Emotion-Prediction-with-Semi-Supervised-Learning-of-Machine-Learning-Software-with-RC-Algorithm---By/main/Quantum%205.png" alt="Emotion Prediction with Quantum5 Neural Network on AI - Emirhan BULUT">

https://raw.githubusercontent.com/emirhanai/Emotion-Prediction-with-Semi-Supervised-Learning-of-Machine-Learning-Software-with-RC-Algorithm---By/main/Emotion%20Prediction%20with%20Semi%20Supervised%20Learning%20of%20Machine%20Learning%20Software%20with%20RC%20Algorithm%20-%20By%20Emirhan%20BULUT.png" alt="Emotion Prediction with Semi Supervised Learning of Machine Learning Software with RC Algorithm - Emirhan BULUT">

Developer Information:

Name-Surname: Emirhan BULUT

Contact (Email) : emirhan@isap.solutions

LinkedIn : https://www.linkedin.com/in/artificialintelligencebulut/

Kaggle: https://www.kaggle.com/emirhanai

Official Website: https://www.emirhanbulut.com.tr

Facebook

Twitter

Click to copy link

Link copied

Cite

Summa One (2024). 11 Machine Learning Projects With Datasets [Dataset]. https://www.kaggle.com/datasets/summaone/ml-10pro

11 Machine Learning Projects With Datasets

Explore at:

zip(69465704 bytes)Available download formats

Dataset updated

Jan 12, 2024

Authors

Summa One

Description

Dataset

This dataset was created by Summa One

Clear search

Close search

Google apps

Main menu

11 Machine Learning Projects With Datasets

Dataset

Contents

Data from: NICHE: A Curated Dataset of Engineered Machine Learning Projects...

AI/ML Youtube Videos

UCI and OpenML Data Sets for Ordinal Quantification

Data from: FISBe: A real-world benchmark dataset for instance segmentation...

optional:import numpy as npraw_np = np.array(raw)

Main Objects Segmentation Dataset

Mini Version of the Flagship Dataset of Type 2 Diabetes from the AI-READI...

GIS Resource Compilation Map Package - Applications of Machine Learning...

Corpus Nummorum - Coin Image Dataset

WELFake dataset for fake news detection in text data

Top Rated Movies Dataset (TMDB API).csv

Top Rated Movies Dataset (TMDb API)

Open Machine Learning Projects

DESCRIPTION

SUMMARY

Presentation

Columns

Weather Prediction

Weather prediction dataset

A dataset for teaching machine learning and deep learning

References

Map of the locations of the 18 weather stations from which data was collected

Ml basic project

Dataset

Contents

10K Most Popular Gaming 2025

TMDB Datasets

Manga Dataset (title/genre/rating)

TMDB Top 8550 Movies Metadata 2025

📄 Description

✅ The dataset is ideal for:

📚 Use Cases

🔍 Source

📢 Disclaimer

Emotion Prediction with Quantum5 Neural Network AI

Emotion Prediction with Quantum5 Neural Network AI Machine Learning - By Emirhan BULUT

The coding language used:

Libraries Used:

Developer Information:

11 Machine Learning Projects With Datasets

Dataset

Contents