73 datasets found
  1. 11 Machine Learning Projects With Datasets

    • kaggle.com
    zip
    Updated Jan 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Summa One (2024). 11 Machine Learning Projects With Datasets [Dataset]. https://www.kaggle.com/datasets/summaone/ml-10pro
    Explore at:
    zip(69465704 bytes)Available download formats
    Dataset updated
    Jan 12, 2024
    Authors
    Summa One
    Description

    Dataset

    This dataset was created by Summa One

    Contents

  2. Data from: NICHE: A Curated Dataset of Engineered Machine Learning Projects...

    • figshare.com
    txt
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ratnadira Widyasari; Zhou YANG; Ferdian Thung; Sheng Qin Sim; Fiona Wee; Camellia Lok; Jack Phan; Haodi Qi; Constance Tan; Qijin Tay; David LO (2023). NICHE: A Curated Dataset of Engineered Machine Learning Projects in Python [Dataset]. http://doi.org/10.6084/m9.figshare.21967265.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Ratnadira Widyasari; Zhou YANG; Ferdian Thung; Sheng Qin Sim; Fiona Wee; Camellia Lok; Jack Phan; Haodi Qi; Constance Tan; Qijin Tay; David LO
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Machine learning (ML) has gained much attention and has been incorporated into our daily lives. While there are numerous publicly available ML projects on open source platforms such as GitHub, there have been limited attempts in filtering those projects to curate ML projects of high quality. The limited availability of such high-quality dataset poses an obstacle to understanding ML projects. To help clear this obstacle, we present NICHE, a manually labelled dataset consisting of 572 ML projects. Based on evidences of good software engineering practices, we label 441 of these projects as engineered and 131 as non-engineered. In this repository we provide "NICHE.csv" file that contains the list of the project names along with their labels, descriptive information for every dimension, and several basic statistics, such as the number of stars and commits. This dataset can help researchers understand the practices that are followed in high-quality ML projects. It can also be used as a benchmark for classifiers designed to identify engineered ML projects.

    GitHub page: https://github.com/soarsmu/NICHE

  3. AI/ML Youtube Videos

    • kaggle.com
    Updated Oct 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Asmaa Hadir (2023). AI/ML Youtube Videos [Dataset]. https://www.kaggle.com/datasets/asmaahadir/aiml-youtube-channels-content-2018-2019
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 31, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Asmaa Hadir
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Area covered
    YouTube
    Description

    I created this dataset as part of a data analysis project and concluded that it might be relevant for others who are interested in examining in analyzing content on YouTube. This dataset is a collection of over 6000 videos having the columns:

    • Channel: video's channel
    • Title: video title
    • PublishedDate: date the video was uploaded
    • Likes: likes count for the video
    • Views: views count for the video
    • Comments: comments count for the video

      Through the YouTube API and using Python, I collect data about some of these popular channels' videos that provide educational content about Machine Learning and Data Science in order to extract insights about which topics had been popular within the last couple of years. Featured in the dataset are the following creators:

    • Krish Naik

    • Nicholas Renotte

    • Sentdex

    • DeepLearningAI

    • Artificial Intelligence — All in One

    • Siraj Raval

    • Jeremy Howard

    • Applied AI Course

    • Daniel Bourke

    • Jeff Heaton

    • DeepLearning.TV

    • Arxiv Insights

    These channels are features in multiple top AI channels to subscribe to lists and have seen a big growth in the last couple of years on YouTube. They all have a creation date since or before 2018.

  4. UCI and OpenML Data Sets for Ordinal Quantification

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    zip
    Updated Jul 25, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mirko Bunse; Mirko Bunse; Alejandro Moreo; Alejandro Moreo; Fabrizio Sebastiani; Fabrizio Sebastiani; Martin Senz; Martin Senz (2023). UCI and OpenML Data Sets for Ordinal Quantification [Dataset]. http://doi.org/10.5281/zenodo.8177302
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 25, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Mirko Bunse; Mirko Bunse; Alejandro Moreo; Alejandro Moreo; Fabrizio Sebastiani; Fabrizio Sebastiani; Martin Senz; Martin Senz
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    These four labeled data sets are targeted at ordinal quantification. The goal of quantification is not to predict the label of each individual instance, but the distribution of labels in unlabeled sets of data.

    With the scripts provided, you can extract CSV files from the UCI machine learning repository and from OpenML. The ordinal class labels stem from a binning of a continuous regression label.

    We complement this data set with the indices of data items that appear in each sample of our evaluation. Hence, you can precisely replicate our samples by drawing the specified data items. The indices stem from two evaluation protocols that are well suited for ordinal quantification. To this end, each row in the files app_val_indices.csv, app_tst_indices.csv, app-oq_val_indices.csv, and app-oq_tst_indices.csv represents one sample.

    Our first protocol is the artificial prevalence protocol (APP), where all possible distributions of labels are drawn with an equal probability. The second protocol, APP-OQ, is a variant thereof, where only the smoothest 20% of all APP samples are considered. This variant is targeted at ordinal quantification tasks, where classes are ordered and a similarity of neighboring classes can be assumed.

    Usage

    You can extract four CSV files through the provided script extract-oq.jl, which is conveniently wrapped in a Makefile. The Project.toml and Manifest.toml specify the Julia package dependencies, similar to a requirements file in Python.

    Preliminaries: You have to have a working Julia installation. We have used Julia v1.6.5 in our experiments.

    Data Extraction: In your terminal, you can call either

    make

    (recommended), or

    julia --project="." --eval "using Pkg; Pkg.instantiate()"
    julia --project="." extract-oq.jl

    Outcome: The first row in each CSV file is the header. The first column, named "class_label", is the ordinal class.

    Further Reading

    Implementation of our experiments: https://github.com/mirkobunse/regularized-oq

  5. Z

    Data from: FISBe: A real-world benchmark dataset for instance segmentation...

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    • +1more
    Updated Apr 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mais, Lisa; Hirsch, Peter; Managan, Claire; Kandarpa, Ramya; Rumberger, Josef Lorenz; Reinke, Annika; Maier-Hein, Lena; Ihrke, Gudrun; Kainmueller, Dagmar (2024). FISBe: A real-world benchmark dataset for instance segmentation of long-range thin filamentous structures [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10875062
    Explore at:
    Dataset updated
    Apr 2, 2024
    Dataset provided by
    Howard Hughes Medical Institute - Janelia Research Campus
    Max Delbrück Center for Molecular Medicine
    German Cancer Research Center
    Max Delbrück Center
    Authors
    Mais, Lisa; Hirsch, Peter; Managan, Claire; Kandarpa, Ramya; Rumberger, Josef Lorenz; Reinke, Annika; Maier-Hein, Lena; Ihrke, Gudrun; Kainmueller, Dagmar
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    General

    For more details and the most up-to-date information please consult our project page: https://kainmueller-lab.github.io/fisbe.

    Summary

    A new dataset for neuron instance segmentation in 3d multicolor light microscopy data of fruit fly brains

    30 completely labeled (segmented) images

    71 partly labeled images

    altogether comprising ∼600 expert-labeled neuron instances (labeling a single neuron takes between 30-60 min on average, yet a difficult one can take up to 4 hours)

    To the best of our knowledge, the first real-world benchmark dataset for instance segmentation of long thin filamentous objects

    A set of metrics and a novel ranking score for respective meaningful method benchmarking

    An evaluation of three baseline methods in terms of the above metrics and score

    Abstract

    Instance segmentation of neurons in volumetric light microscopy images of nervous systems enables groundbreaking research in neuroscience by facilitating joint functional and morphological analyses of neural circuits at cellular resolution. Yet said multi-neuron light microscopy data exhibits extremely challenging properties for the task of instance segmentation: Individual neurons have long-ranging, thin filamentous and widely branching morphologies, multiple neurons are tightly inter-weaved, and partial volume effects, uneven illumination and noise inherent to light microscopy severely impede local disentangling as well as long-range tracing of individual neurons. These properties reflect a current key challenge in machine learning research, namely to effectively capture long-range dependencies in the data. While respective methodological research is buzzing, to date methods are typically benchmarked on synthetic datasets. To address this gap, we release the FlyLight Instance Segmentation Benchmark (FISBe) dataset, the first publicly available multi-neuron light microscopy dataset with pixel-wise annotations. In addition, we define a set of instance segmentation metrics for benchmarking that we designed to be meaningful with regard to downstream analyses. Lastly, we provide three baselines to kick off a competition that we envision to both advance the field of machine learning regarding methodology for capturing long-range data dependencies, and facilitate scientific discovery in basic neuroscience.

    Dataset documentation:

    We provide a detailed documentation of our dataset, following the Datasheet for Datasets questionnaire:

    FISBe Datasheet

    Our dataset originates from the FlyLight project, where the authors released a large image collection of nervous systems of ~74,000 flies, available for download under CC BY 4.0 license.

    Files

    fisbe_v1.0_{completely,partly}.zip

    contains the image and ground truth segmentation data; there is one zarr file per sample, see below for more information on how to access zarr files.

    fisbe_v1.0_mips.zip

    maximum intensity projections of all samples, for convenience.

    sample_list_per_split.txt

    a simple list of all samples and the subset they are in, for convenience.

    view_data.py

    a simple python script to visualize samples, see below for more information on how to use it.

    dim_neurons_val_and_test_sets.json

    a list of instance ids per sample that are considered to be of low intensity/dim; can be used for extended evaluation.

    Readme.md

    general information

    How to work with the image files

    Each sample consists of a single 3d MCFO image of neurons of the fruit fly.For each image, we provide a pixel-wise instance segmentation for all separable neurons.Each sample is stored as a separate zarr file (zarr is a file storage format for chunked, compressed, N-dimensional arrays based on an open-source specification.").The image data ("raw") and the segmentation ("gt_instances") are stored as two arrays within a single zarr file.The segmentation mask for each neuron is stored in a separate channel.The order of dimensions is CZYX.

    We recommend to work in a virtual environment, e.g., by using conda:

    conda create -y -n flylight-env -c conda-forge python=3.9conda activate flylight-env

    How to open zarr files

    Install the python zarr package:

    pip install zarr

    Opened a zarr file with:

    import zarrraw = zarr.open(, mode='r', path="volumes/raw")seg = zarr.open(, mode='r', path="volumes/gt_instances")

    optional:import numpy as npraw_np = np.array(raw)

    Zarr arrays are read lazily on-demand.Many functions that expect numpy arrays also work with zarr arrays.Optionally, the arrays can also explicitly be converted to numpy arrays.

    How to view zarr image files

    We recommend to use napari to view the image data.

    Install napari:

    pip install "napari[all]"

    Save the following Python script:

    import zarr, sys, napari

    raw = zarr.load(sys.argv[1], mode='r', path="volumes/raw")gts = zarr.load(sys.argv[1], mode='r', path="volumes/gt_instances")

    viewer = napari.Viewer(ndisplay=3)for idx, gt in enumerate(gts): viewer.add_labels( gt, rendering='translucent', blending='additive', name=f'gt_{idx}')viewer.add_image(raw[0], colormap="red", name='raw_r', blending='additive')viewer.add_image(raw[1], colormap="green", name='raw_g', blending='additive')viewer.add_image(raw[2], colormap="blue", name='raw_b', blending='additive')napari.run()

    Execute:

    python view_data.py /R9F03-20181030_62_B5.zarr

    Metrics

    S: Average of avF1 and C

    avF1: Average F1 Score

    C: Average ground truth coverage

    clDice_TP: Average true positives clDice

    FS: Number of false splits

    FM: Number of false merges

    tp: Relative number of true positives

    For more information on our selected metrics and formal definitions please see our paper.

    Baseline

    To showcase the FISBe dataset together with our selection of metrics, we provide evaluation results for three baseline methods, namely PatchPerPix (ppp), Flood Filling Networks (FFN) and a non-learnt application-specific color clustering from Duan et al..For detailed information on the methods and the quantitative results please see our paper.

    License

    The FlyLight Instance Segmentation Benchmark (FISBe) dataset is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.

    Citation

    If you use FISBe in your research, please use the following BibTeX entry:

    @misc{mais2024fisbe, title = {FISBe: A real-world benchmark dataset for instance segmentation of long-range thin filamentous structures}, author = {Lisa Mais and Peter Hirsch and Claire Managan and Ramya Kandarpa and Josef Lorenz Rumberger and Annika Reinke and Lena Maier-Hein and Gudrun Ihrke and Dagmar Kainmueller}, year = 2024, eprint = {2404.00130}, archivePrefix ={arXiv}, primaryClass = {cs.CV} }

    Acknowledgments

    We thank Aljoscha Nern for providing unpublished MCFO images as well as Geoffrey W. Meissner and the entire FlyLight Project Team for valuablediscussions.P.H., L.M. and D.K. were supported by the HHMI Janelia Visiting Scientist Program.This work was co-funded by Helmholtz Imaging.

    Changelog

    There have been no changes to the dataset so far.All future change will be listed on the changelog page.

    Contributing

    If you would like to contribute, have encountered any issues or have any suggestions, please open an issue for the FISBe dataset in the accompanying github repository.

    All contributions are welcome!

  6. g

    Main Objects Segmentation Dataset

    • gts.ai
    json
    Updated Jun 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    GTS (2024). Main Objects Segmentation Dataset [Dataset]. https://gts.ai/case-study/main-objects-segmentation-dataset-enhance-data-annotation/
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Jun 19, 2024
    Dataset provided by
    GLOBOSE TECHNOLOGY SOLUTIONS PRIVATE LIMITED
    Authors
    GTS
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    The Main Objects Segmentation Dataset project focuses on curating a comprehensive dataset for training machine learning models in the field of computer vision.

  7. f

    Mini Version of the Flagship Dataset of Type 2 Diabetes from the AI-READI...

    • staging.fairhub.io
    application/dicom
    Updated Nov 17, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AI-READI Consortium (2025). Mini Version of the Flagship Dataset of Type 2 Diabetes from the AI-READI Project [Dataset]. http://doi.org/10.60775/fairhub.4
    Explore at:
    application/dicom(179.68 GB)Available download formats
    Dataset updated
    Nov 17, 2025
    Dataset provided by
    FAIRhub
    Authors
    AI-READI Consortium
    License

    https://doi.org/10.5281/zenodo.17555036https://doi.org/10.5281/zenodo.17555036

    Dataset funded by
    National Institutes of Health
    Description

    This dataset contains data from 100 participants that was collected between July 19, 2023 and May 01, 2025. Data from multiple modalities are included. The data in this dataset contain no protected health information (PHI). Information related to the sex and race/ethnicity of the participants as well as medication used has also been removed. A detailed description of the dataset is available in the AI-READI documentation for v3.0.0 of the dataset at https://docs.aireadi.org

  8. G

    GIS Resource Compilation Map Package - Applications of Machine Learning...

    • gdr.openei.org
    • data.openei.org
    • +3more
    Updated Jun 1, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stephen Brown; Michael Fehler; Mark Coolbaugh; Sven Treitel; James Faulds; Bridget Ayling; Cary Lindsey; Rachel Micander; Eli Mlawsky; Connor Smith; John Queen; Chen Gu; John Akerley; Jacob DeAngelo; Jonathan Glen; Drew Siler; Erick Burns; Ian Warren; Stephen Brown; Michael Fehler; Mark Coolbaugh; Sven Treitel; James Faulds; Bridget Ayling; Cary Lindsey; Rachel Micander; Eli Mlawsky; Connor Smith; John Queen; Chen Gu; John Akerley; Jacob DeAngelo; Jonathan Glen; Drew Siler; Erick Burns; Ian Warren (2021). GIS Resource Compilation Map Package - Applications of Machine Learning Techniques to Geothermal Play Fairway Analysis in the Great Basin Region, Nevada [Dataset]. http://doi.org/10.15121/1897037
    Explore at:
    Dataset updated
    Jun 1, 2021
    Dataset provided by
    Nevada Bureau of Mines and Geology
    Geothermal Data Repository
    USDOE Office of Energy Efficiency and Renewable Energy (EERE), Renewable Power Office. Geothermal Technologies Program (EE-4G)
    Authors
    Stephen Brown; Michael Fehler; Mark Coolbaugh; Sven Treitel; James Faulds; Bridget Ayling; Cary Lindsey; Rachel Micander; Eli Mlawsky; Connor Smith; John Queen; Chen Gu; John Akerley; Jacob DeAngelo; Jonathan Glen; Drew Siler; Erick Burns; Ian Warren; Stephen Brown; Michael Fehler; Mark Coolbaugh; Sven Treitel; James Faulds; Bridget Ayling; Cary Lindsey; Rachel Micander; Eli Mlawsky; Connor Smith; John Queen; Chen Gu; John Akerley; Jacob DeAngelo; Jonathan Glen; Drew Siler; Erick Burns; Ian Warren
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Great Basin, Nevada
    Description

    This submission contains an ESRI map package (.mpk) with an embedded geodatabase for GIS resources used or derived in the Nevada Machine Learning project, meant to accompany the final report. The package includes layer descriptions, layer grouping, and symbology. Layer groups include: new/revised datasets (paleo-geothermal features, geochemistry, geophysics, heat flow, slip and dilation, potential structures, geothermal power plants, positive and negative test sites), machine learning model input grids, machine learning models (Artificial Neural Network (ANN), Extreme Learning Machine (ELM), Bayesian Neural Network (BNN), Principal Component Analysis (PCA/PCAk), Non-negative Matrix Factorization (NMF/NMFk) - supervised and unsupervised), original NV Play Fairway data and models, and NV cultural/reference data.

    See layer descriptions for additional metadata. Smaller GIS resource packages (by category) can be found in the related datasets section of this submission. A submission linking the full codebase for generating machine learning output models is available through the "Related Datasets" link on this page, and contains results beyond the top picks present in this compilation.

  9. Corpus Nummorum - Coin Image Dataset

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Nov 7, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Corpus_Nummorum; Corpus_Nummorum (2023). Corpus Nummorum - Coin Image Dataset [Dataset]. http://doi.org/10.5281/zenodo.10033993
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 7, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Corpus_Nummorum; Corpus_Nummorum
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Corpus Nummorum - Coin Image Dataset

    This dataset is a collection of ancient coin images from three different sources: the Corpus Nummorum (CN) project, the Münzkabinett Berlin and the Bibliothèque nationale de France, Département des Monnaies, médailles et antiques. It covers Greek and Roman coins from ancient Thrace, Moesia Inferior, Troad and Mysia. This is a selection of the coins published on the CN portal (due to copyrights).

    The dataset contains 115,160 images with about 29,000 unique coins. The images are split in three main folders with different assignment of the coins. Each main folder is sorted with the help fo subfolders which hold the coin images. The "dataset_coins" folder contains the coin photos divided into obverse and reverse and arranged by coin types. In the "dataset_types" folder the obverse and reverse image of the coins are concatenated and transformed to a quadratic format with black bars on the top and bottom. The images here are sorted by their coin type. The last folder "dataset_mints" contains the also concatenated images sorted by their mint. An "sources" csv file holds the sources for every image. Due to copyrights the image size is limited to 299*299 pixels. However, this should be sufficient for most ML approaches.

    The main purpose for this dataset in the CN project is the training of Machine Learning based Image Recognition models. We use three different Convolutional Neural Network based architectures: VGG16, VGG19 and ResNet50. Our best model (VGG16) archieves on this dataset a 79% Top-1 and a 97% Top-5 accuracy for the coin type recognition. The mint recognition achieves an 79% Top-1 and 94% Top-5 accuracy. We have a Colab notebook with two models (trained on the whole CN dataset) online.

    During the summer semester 2023, we held the "Data Challenge" event at our Department of Computer Science at the Goethe-University. We gave our students this dataset with the task to achieve better results than us. Here are their experiments:

    Team 1: Voting and stacking of models

    Team 2: Multimodal model

    Team 3: Transformer models

    Team 4: Dockerized TIMM Computer Vision Backend & FastAPI

    • Approach | Type Dataset | Mint Dataset
    • Ours 79% 79%
    • Team 1 - 86%
    • Team 2 86% -
    • Team 3 88% 58%
    • Team 4 - -

    Now we would like to invite you to try out your own ideas and models on our coin data.

    If you have any questions or suggestions, please, feel free to contact us.

  10. WELFake dataset for fake news detection in text data

    • zenodo.org
    • data.europa.eu
    csv
    Updated Apr 9, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pawan K Pawan Kumar Verma; Pawan K Pawan Kumar Verma; Prateek Prateek Agrawal; Prateek Prateek Agrawal; Radu Radu Prodan; Radu Radu Prodan (2021). WELFake dataset for fake news detection in text data [Dataset]. http://doi.org/10.5281/zenodo.4561253
    Explore at:
    csvAvailable download formats
    Dataset updated
    Apr 9, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Pawan K Pawan Kumar Verma; Pawan K Pawan Kumar Verma; Prateek Prateek Agrawal; Prateek Prateek Agrawal; Radu Radu Prodan; Radu Radu Prodan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We designed a larger and more generic Word Embedding over Linguistic Features for Fake News Detection (WELFake) dataset of 72,134 news articles with 35,028 real and 37,106 fake news. For this, we merged four popular news datasets (i.e. Kaggle, McIntire, Reuters, BuzzFeed Political) to prevent over-fitting of classifiers and to provide more text data for better ML training.

    Dataset contains four columns: Serial number (starting from 0); Title (about the text news heading); Text (about the news content); and Label (0 = fake and 1 = real).

    There are 78098 data entries in csv file out of which only 72134 entries are accessed as per the data frame.

    This dataset is a part of our ongoing research on "Fake News Prediction on Social Media Website" as a doctoral degree program of Mr. Pawan Kumar Verma and is partially supported by the ARTICONF project funded by the European Union’s Horizon 2020 research and innovation program.

  11. Top Rated Movies Dataset (TMDB API).csv

    • kaggle.com
    zip
    Updated Nov 9, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Muhammad_Atif_Khan181 (2025). Top Rated Movies Dataset (TMDB API).csv [Dataset]. https://www.kaggle.com/datasets/muhammadatifkhan181/top-rated-movies-in-tmdb-csv
    Explore at:
    zip(264990 bytes)Available download formats
    Dataset updated
    Nov 9, 2025
    Authors
    Muhammad_Atif_Khan181
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    This dataset contains information about the top-rated movies fetched from The Movie Database (TMDB) API. The data includes key movie attributes such as movie ID, title, release date, popularity, vote count, and vote average.

    ✅ Total Pages Scraped: 500
    ✅ Total Movies Included: 10,000+
    ✅ Source: TMDB API
    ✅ Purpose: Educational and non-commercial use only

    The dataset can be used for: - Exploratory Data Analysis (EDA) - Machine Learning Projects - Recommendation Systems - Popularity Prediction - Sentiment and Trend Analysis - Data Visualization

    Please note:
    This product uses the TMDB API but is not endorsed or certified by TMDB.

  12. Top Rated Movies Dataset (TMDb API)

    • kaggle.com
    zip
    Updated Nov 2, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shuvo Kundu (2025). Top Rated Movies Dataset (TMDb API) [Dataset]. https://www.kaggle.com/datasets/shuvokundu39/top-rated-movies-dataset-tmdb-api
    Explore at:
    zip(141810 bytes)Available download formats
    Dataset updated
    Nov 2, 2025
    Authors
    Shuvo Kundu
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    This dataset provides detailed information on top-rated movies collected from The Movie Database (TMDb) API. It contains key movie attributes such as title, popularity, a**verage rating**, vote count, overview, and adult content flag. The dataset is designed for data analysis, visualization, and machine learning applications such as movie recommendation systems, sentiment analysis, and popularity prediction.

    By exploring this dataset, users can gain insights into how audience ratings, popularity, and engagement vary across different films. It serves as a valuable resource for students, data scientists, and researchers who want to work

  13. Open Machine Learning Projects

    • kaggle.com
    zip
    Updated Mar 14, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Prashant Banerjee (2020). Open Machine Learning Projects [Dataset]. https://www.kaggle.com/prashant111/open-machine-learning-projects
    Explore at:
    zip(4520 bytes)Available download formats
    Dataset updated
    Mar 14, 2020
    Authors
    Prashant Banerjee
    Description

    DESCRIPTION

    Information about popular open source projects related to machine learning.

    SUMMARY

    The goal of this dataset is to better undertand how open source machine learning projects evolve. Data collection date: early May 2018. Source: GitHub user interface and API. Contains original research.

    Presentation

    Columns

    name - name of the project. alignment - either corporate, academia or indie. Corporate projects are being developed by professional engineers, typically have a dedicated development team and trying to solve specific problems. Academical projects usually mention publications, they help to research. Independent projects are often a hobby. company - name of the company if the alignment is corporate. forecast - expected middle-term evolution of the project. 1 means positive, 0 means negative (stagnation) and -1 means factual death. year - when the project was created. Defaults to the GitHub repository creation date but can be earlier - this is a subject of manual adjustments. code of conduct - whether the project has a code of conduct. contributing - whether the project has a contributions guide. stars - number of stargazers on GitHub. issues - number of issues on GitHub, either open or closed. contributors - number of contributors as reported by GitHub. core - estimation of the core team aka "bus factor". team - number of people which commit to a project regularly. commits - number of commits in the project. team / all - ratio of the number of commits by the dedicated development team to the overall number of contributions. Indicates roughly which part of the project is own by the internal developers. link - URL of the project. language - API language. multi means several languages. implementation - the language which was mainly used for implementing the project. license - license of the project.

  14. Weather Prediction

    • kaggle.com
    • zenodo.org
    zip
    Updated Mar 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2024). Weather Prediction [Dataset]. https://www.kaggle.com/datasets/thedevastator/weather-prediction
    Explore at:
    zip(958204 bytes)Available download formats
    Dataset updated
    Mar 10, 2024
    Authors
    The Devastator
    Description

    Credit to the original author: The dataset was originally published here

    Weather prediction dataset

    A dataset for teaching machine learning and deep learning

    Hands-on teaching of modern machine learning and deep learning techniques heavily relies on the use of well-suited datasets. The "weather prediction dataset" is a novel tabular dataset that was specifically created for teaching machine learning and deep learning to an academic audience. The dataset contains intuitively accessible weather observations from 18 locations in Europe. It was designed to be suitable for a large variety of different training goals, many of which are not easily giving way to unrealistically high prediction accuracy. Teachers or instructors thus can chose the difficulty of the training goals and thereby match it with the respective learner audience or lesson objective. The compact size and complexity of the dataset make it possible to quickly train common machine learning and deep learning models on a standard laptop so that they can be used in live hands-on sessions.

    The dataset can be found in the `\dataset` folder and be downloaded from zenodo: https://doi.org/10.5281/zenodo.4980359

    References

    If you make use of this dataset, in particular if this is in form of an academic contribution, then please cite the following two references:

    • Klein Tank, A.M.G. and Coauthors, 2002. Daily dataset of 20th-century surface air temperature and precipitation series for the European Climate Assessment. Int. J. of Climatol., 22, 1441-1453. Data and metadata available at http://www.ecad.eu
    • Florian Huber, Dafne van Kuppevelt, Peter Steinbach, Colin Sauze, Yang Liu, Berend Weel, "Will the sun shine? – An accessible dataset for teaching machine learning and deep learning", DOI TO BE ADDED!

    Map of the locations of the 18 weather stations from which data was collected

    Map of weather stations

  15. Ml basic project

    • kaggle.com
    zip
    Updated Jul 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Krishna Swapnika (2024). Ml basic project [Dataset]. https://www.kaggle.com/datasets/krishnaswapnika/ml-basic-project/data
    Explore at:
    zip(578 bytes)Available download formats
    Dataset updated
    Jul 17, 2024
    Authors
    Krishna Swapnika
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset

    This dataset was created by Krishna Swapnika

    Released under Apache 2.0

    Contents

  16. 10K Most Popular Gaming 2025

    • kaggle.com
    zip
    Updated Aug 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Only Python (2025). 10K Most Popular Gaming 2025 [Dataset]. https://www.kaggle.com/datasets/onlypythondatasheet/10k-most-popular-gaming-2025
    Explore at:
    zip(5489826 bytes)Available download formats
    Dataset updated
    Aug 21, 2025
    Authors
    Only Python
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    For More Visit: https://onlypython01.blogspot.com

    This dataset contains information on 10,000 of the most popular video games, curated from multiple sources. It is designed for data science, machine learning, and analytics projects in gaming, entertainment, and recommendation systems.

    The dataset includes:

    ID & Name – unique identifier and game title

    Release & Update Dates – when the game was originally released and last updated

    Rating & Suggestions Count – aggregated player ratings and number of community recommendations

    Platforms – supported consoles and systems (e.g., PC, PlayStation, Xbox, Switch, Mobile)

    Developers & Publishers – companies behind the games

    Genres – classification (RPG, FPS, Adventure, etc.)

    Image – cover art thumbnail URL for visualization

    Description – text summary of the game

    Potential Use Cases

    Exploratory analysis: study trends in ratings, genres, or release dates

    Machine Learning: build recommender systems for games

    NLP: analyze game descriptions & genres

    Visualization projects: timeline charts, platform distribution, developer networks

  17. TMDB Datasets

    • kaggle.com
    zip
    Updated Dec 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Roman Nihal (2024). TMDB Datasets [Dataset]. https://www.kaggle.com/datasets/romannihal/tmdb-datasets
    Explore at:
    zip(5426529 bytes)Available download formats
    Dataset updated
    Dec 6, 2024
    Authors
    Roman Nihal
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This collection of datasets was created by fetching data from the TMDB (The Movie Database) API and performing extensive cleaning to ensure usability for data analysis and machine learning projects. It comprises three distinct datasets: tmdb_popular_movies: Contains 13,144 entries featuring the most popular movies. tmdb_top_rated_movies: Contains 12,525 entries highlighting top-rated movies. tmdb_upcoming_movies: Contains 11,959 entries showcasing upcoming movie releases.

    Each dataset is structured with the following columns: id: Unique identifier for each movie. title: The title of the movie. overview: A brief description of the movie's plot. release_date: The movie's release date. popularity: A numeric value indicating the movie's popularity on TMDB. vote_average: Average rating given by TMDB users. vote_count: Total number of votes received.

    Key Features Versatile Datasets: Covers popular, highly rated, and upcoming movies for diverse use cases. Cleaned and Preprocessed: Free from missing or duplicate values, making it ready for immediate analysis. Applications: Ideal for building recommendation systems, sentiment analysis, popularity prediction models, and more. These datasets were created to provide reliable resources for academic and professional projects in the fields of data science and machine learning.

  18. Manga Dataset (title/genre/rating)

    • kaggle.com
    zip
    Updated Jul 18, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Clyde Melton (2022). Manga Dataset (title/genre/rating) [Dataset]. https://www.kaggle.com/datasets/clydemelton/manga-dataset
    Explore at:
    zip(10027 bytes)Available download formats
    Dataset updated
    Jul 18, 2022
    Authors
    Clyde Melton
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Data scraped from Mangakalot I had originally decided to create this dataset for use in a recommendation system for manga titles. Other datasets that I had found were either missing information that I wanted to use to build this system or contained too small a sample size to build what I deemed a useful product. This is also my first attempt at web scraping (I'm also fairly new to python and data science) so I suppose I wanted to do a simple project at first to learn the basics. I hope it proves useful to someone.

  19. TMDB Top 8550 Movies Metadata 2025

    • kaggle.com
    zip
    Updated May 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sufi Inam Ul Hassan (2025). TMDB Top 8550 Movies Metadata 2025 [Dataset]. https://www.kaggle.com/datasets/sufiinamulhassan/tmdb-top-8550-movies-metadata
    Explore at:
    zip(1239492 bytes)Available download formats
    Dataset updated
    May 24, 2025
    Authors
    Sufi Inam Ul Hassan
    Description

    📄 Description

    This dataset contains metadata for the top 8,550 movies listed on The Movie Database (TMDB). Each entry includes valuable information such as:

    • 🎬 Title
    • 📅 Release Date
    • 🎭 Genres
    • 🌐 Original Language
    • Average Rating
    • 📈 Popularity Score
    • 🗳️ Vote Count
    • 🧾 Overview / Synopsis

    ✅ The dataset is ideal for:

    1. Exploratory Data Analysis (EDA)
    2. Building Recommendation Systems
    3. Popularity Trend Analysis
    4. Sentiment or Genre-based Analysis
    5. Predictive Modeling & Machine Learning

    It serves as a great resource for data scientists, analysts, machine learning practitioners, and film enthusiasts interested in movie metadata.

    📚 Use Cases

    Here are a few ideas for how to use this dataset:

    • 📌 Build a Movie Recommender System
    • 📌 Compare Trends Over Time (Genres, Ratings, etc.)
    • 📌 Visualize Rating Distributions by Year
    • 📌 Cluster Movies Based on Metadata

    🔍 Source

    All data is sourced from the TMDB API and reflects the top-rated or most popular movies available at the time of collection.

    📢 Disclaimer

    This dataset is intended for educational and research purposes only. All movie data and assets belong to their respective copyright holders and TMDB.

  20. Emotion Prediction with Quantum5 Neural Network AI

    • kaggle.com
    zip
    Updated Oct 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    EMİRHAN BULUT (2025). Emotion Prediction with Quantum5 Neural Network AI [Dataset]. https://www.kaggle.com/datasets/emirhanai/emotion-prediction-with-semi-supervised-learning
    Explore at:
    zip(2332683 bytes)Available download formats
    Dataset updated
    Oct 19, 2025
    Authors
    EMİRHAN BULUT
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Emotion Prediction with Quantum5 Neural Network AI Machine Learning - By Emirhan BULUT

    V1

    I have created an artificial intelligence software that can make an emotion prediction based on the text you have written using the Semi Supervised Learning method and the RC algorithm. I used very simple codes and it was a software that focused on solving the problem. I aim to create the 2nd version of the software using RNN (Recurrent Neural Network). I hope I was able to create an example for you to use in your thesis and projects.

    V2

    I decided to apply a technique that I had developed in the emotion dataset that I had used Semi-Supervised learning in Machine Learning methods before. This technique is produced according to Quantum5 laws. I developed a smart artificial intelligence software that can predict emotion with Quantum5 neuronal networks. I share this software with all humanity as open source on Kaggle. It is my first open source project in NLP system with Quantum technology. Developing the NLP system with Quantum technology is very exciting!

    Happy learning!

    Emirhan BULUT

    Head of AI and AI Inventor

    Emirhan BULUT. (2022). Emotion Prediction with Quantum5 Neural Network AI [Data set]. Kaggle. https://doi.org/10.34740/KAGGLE/DS/2129637

    The coding language used:

    Python 3.9.8

    Libraries Used:

    Keras

    Tensorflow

    NumPy

    Pandas

    Scikit-learn (SKLEARN)

    https://raw.githubusercontent.com/emirhanai/Emotion-Prediction-with-Semi-Supervised-Learning-of-Machine-Learning-Software-with-RC-Algorithm---By/main/Quantum%205.png" alt="Emotion Prediction with Quantum5 Neural Network on AI - Emirhan BULUT">

    https://raw.githubusercontent.com/emirhanai/Emotion-Prediction-with-Semi-Supervised-Learning-of-Machine-Learning-Software-with-RC-Algorithm---By/main/Emotion%20Prediction%20with%20Semi%20Supervised%20Learning%20of%20Machine%20Learning%20Software%20with%20RC%20Algorithm%20-%20By%20Emirhan%20BULUT.png" alt="Emotion Prediction with Semi Supervised Learning of Machine Learning Software with RC Algorithm - Emirhan BULUT">

    Developer Information:

    Name-Surname: Emirhan BULUT

    Contact (Email) : emirhan@isap.solutions

    LinkedIn : https://www.linkedin.com/in/artificialintelligencebulut/

    Kaggle: https://www.kaggle.com/emirhanai

    Official Website: https://www.emirhanbulut.com.tr

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Summa One (2024). 11 Machine Learning Projects With Datasets [Dataset]. https://www.kaggle.com/datasets/summaone/ml-10pro
Organization logo

11 Machine Learning Projects With Datasets

Explore at:
zip(69465704 bytes)Available download formats
Dataset updated
Jan 12, 2024
Authors
Summa One
Description

Dataset

This dataset was created by Summa One

Contents

Search
Clear search
Close search
Google apps
Main menu