100+ datasets found

Machine Learning Tutorials - Example Projects - AI
kaggle.com
zip
Updated Oct 20, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
EMİRHAN BULUT (2022). Machine Learning Tutorials - Example Projects - AI [Dataset]. https://www.kaggle.com/datasets/emirhanai/machine-learning-tutorials-example-projects-ai
Explore at:
zip(1587192509 bytes)Available download formats
Dataset updated
Oct 20, 2022
Authors
EMİRHAN BULUT
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Machine Learning Tutorials - Example Projects - AI

I am sharing my 28 Machine Learning, Deep Learning (Artificial Intelligence - AI) projects with their data, software and outputs on Kaggle for educational purposes as open source. It appeals to people who want to work in this field, have 0 Machine Learning knowledge, have Intermediate Machine Learning knowledge, specialize in this field (Attracts to all levels). The deep learning projects in it are for advanced level, so I recommend you to start your studies from the Machine Learning section. You can check your own outputs along with the outputs in it. I am happy to share 28 educational projects with the whole world through Kaggle. Knowledge is free and better when shared!

Algorithms used in it:

1) Nearest Neighbor 2) Naive Bayes 3) Decision Trees 4) Linear Regression 5) Support Vector Machines (SVM) 6) Neural Networks 7) K-means clustering

Kind regards, Emirhan BULUT

You can use the links below for communication. If you have any questions or comments, feel free to let me know!

LinkedIn: https://www.linkedin.com/in/artificialintelligencebulut/ Email: emirhan@novosteer.com

Emirhan BULUT. (2022). Machine Learning Tutorials - Example Projects - AI [Data set]. Kaggle. https://doi.org/10.34740/KAGGLE/DSV/4361310
Data from: NICHE: A Curated Dataset of Engineered Machine Learning Projects...
figshare.com
txt
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ratnadira Widyasari; Zhou YANG; Ferdian Thung; Sheng Qin Sim; Fiona Wee; Camellia Lok; Jack Phan; Haodi Qi; Constance Tan; Qijin Tay; David LO (2023). NICHE: A Curated Dataset of Engineered Machine Learning Projects in Python [Dataset]. http://doi.org/10.6084/m9.figshare.21967265.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.21967265.v1
Dataset updated
May 30, 2023
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Ratnadira Widyasari; Zhou YANG; Ferdian Thung; Sheng Qin Sim; Fiona Wee; Camellia Lok; Jack Phan; Haodi Qi; Constance Tan; Qijin Tay; David LO
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Machine learning (ML) has gained much attention and has been incorporated into our daily lives. While there are numerous publicly available ML projects on open source platforms such as GitHub, there have been limited attempts in filtering those projects to curate ML projects of high quality. The limited availability of such high-quality dataset poses an obstacle to understanding ML projects. To help clear this obstacle, we present NICHE, a manually labelled dataset consisting of 572 ML projects. Based on evidences of good software engineering practices, we label 441 of these projects as engineered and 131 as non-engineered. In this repository we provide "NICHE.csv" file that contains the list of the project names along with their labels, descriptive information for every dimension, and several basic statistics, such as the number of stars and commits. This dataset can help researchers understand the practices that are followed in high-quality ML projects. It can also be used as a benchmark for classifiers designed to identify engineered ML projects.

GitHub page: https://github.com/soarsmu/NICHE
R
Data from: Project Machine Learning Dataset
universe.roboflow.com
zip
Updated Jun 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
soda (2024). Project Machine Learning Dataset [Dataset]. https://universe.roboflow.com/soda-fj5ov/project-machine-learning-8sjsi
Explore at:
zipAvailable download formats
Dataset updated
Jun 6, 2024
Dataset authored and provided by
soda
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Deteksi Rempah Rempah Bounding Boxes
Description
Project Machine Learning

## Overview Project Machine Learning is a dataset for object detection tasks - it contains Deteksi Rempah Rempah annotations for 1,270 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
d
Data from: USGS Contributions to the Nevada Geothermal Machine Learning...
catalog.data.gov
datasets.ai
Updated Nov 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). USGS Contributions to the Nevada Geothermal Machine Learning Project (DE-FOA-0001956): Slip and Dilation Tendency Data [Dataset]. https://catalog.data.gov/dataset/usgs-contributions-to-the-nevada-geothermal-machine-learning-project-de-foa-0001956-slip-a
Explore at:
Dataset updated
Nov 19, 2025
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Description
This package contains data in a portion of northern Nevada, the extent of the ‘Nevada Machine Learning Project’ (DE-EE0008762). Slip tendency (TS) and dilation tendency (TD) were calculated for the all the faults in the Nevada ML study area. TS is the ratio between the shear components of the stress tensor and the normal components of the stress tensor acting on a fault plane. TD is the ratio of all the components of the stress tensor that are normal to a fault plane. Faults with higher TD are relatively more likely to dilate and host open, conductive fractures. Faults with higher TS are relatively more likely to slip, and these fractures may be propped open and conductive. These values of TS and TD were used to update a map surface from the Nevada Geothermal Machine Learning Project (DE-FOA-0001956) that used less reliable estimates for TS and TD. The new map surface was generated using the same procedure as the old surface, just with the new TS and TD data values.
Data from: Web Data Commons Training and Test Sets for Large-Scale Product...
linkagelibrary.icpsr.umich.edu
da-ra.de
Updated Nov 26, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ralph Peeters; Anna Primpeli; Christian Bizer (2020). Web Data Commons Training and Test Sets for Large-Scale Product Matching - Version 2.0 [Dataset]. http://doi.org/10.3886/E127481V1
Explore at:
Unique identifier
https://doi.org/10.3886/E127481V1
Dataset updated
Nov 26, 2020
Dataset provided by
University of Mannheim (Germany)
Authors
Ralph Peeters; Anna Primpeli; Christian Bizer
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Many e-shops have started to mark-up product data within their HTML pages using the schema.org vocabulary. The Web Data Commons project regularly extracts such data from the Common Crawl, a large public web crawl. The Web Data Commons Training and Test Sets for Large-Scale Product Matching contain product offers from different e-shops in the form of binary product pairs (with corresponding label “match” or “no match”) for four product categories, computers, cameras, watches and shoes. In order to support the evaluation of machine learning-based matching methods, the data is split into training, validation and test sets. For each product category, we provide training sets in four different sizes (2.000-70.000 pairs). Furthermore there are sets of ids for each training set for a possible validation split (stratified random draw) available. The test set for each product category consists of 1.100 product pairs. The labels of the test sets were manually checked while those of the training sets were derived using shared product identifiers from the Web weak supervision. The data stems from the WDC Product Data Corpus for Large-Scale Product Matching - Version 2.0 which consists of 26 million product offers originating from 79 thousand websites. For more information and download links for the corpus itself, please follow the links below.
Riga Data Science Club
kaggle.com
zip
Updated Mar 29, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dmitry Yemelyanov (2021). Riga Data Science Club [Dataset]. https://www.kaggle.com/datasets/dmitryyemelyanov/rigadsclub
Explore at:
zip(494849 bytes)Available download formats
Dataset updated
Mar 29, 2021
Authors
Dmitry Yemelyanov
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Area covered
Riga
Description
Context

Riga Data Science Club is a non-profit organisation to share ideas, experience and build machine learning projects together. Data Science community should known own data, so this is a dataset about ourselves: our website analytics, social media activity, slack statistics and even meetup transcriptions!

Content

Dataset is split up in several folders by the context: * linkedin - company page visitor, follower and post stats * slack - messaging and member activity * typeform - new member responses * website - website visitors by country, language, device, operating system, screen resolution * youtube - meetup transcriptions

Inspiration

Let's make Riga Data Science Club better! We expect this data to bring lots of insights on how to improve.

"Know your c̶u̶s̶t̶o̶m̶e̶r̶ member" - Explore member interests by analysing sign-up survey (typeform) responses - Explore messaging patterns in Slack to understand how members are retained and when they are lost

Social media intelligence * Define LinkedIn posting strategy based on historical engagement data * Define target user profile based on LinkedIn page attendance data

Website * Define website localisation strategy based on data about visitor countries and languages * Define website responsive design strategy based on data about visitor devices, operating systems and screen resolutions

Have some fun * NLP analysis of meetup transcriptions: word frequencies, question answering, something else?
NYC_building_energy_data
kaggle.com
zip
Updated Mar 4, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maksym Dubovyi (2020). NYC_building_energy_data [Dataset]. https://www.kaggle.com/maxbrain/nyc-building-energy-data
Explore at:
zip(9476304 bytes)Available download formats
Dataset updated
Mar 4, 2020
Authors
Maksym Dubovyi
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Area covered
New York
Description
In this notebook, we will walk through solving a complete machine learning problem using a real-world dataset. This was a "homework" assignment given to me for a job application over summer 2018. The entire assignment can be viewed here and the one sentence summary is:

Use the provided building energy data to develop a model that can predict a building's Energy Star score, and then interpret the results to find the variables that are most predictive of the score.

This is a supervised, regression machine learning task: given a set of data with targets (in this case the score) included, we want to train a model that can learn to map the features (also known as the explanatory variables) to the target.

Supervised problem: we are given both the features and the target Regression problem: the target is a continous variable, in this case ranging from 0-100 During training, we want the model to learn the relationship between the features and the score so we give it both the features and the answer. Then, to test how well the model has learned, we evaluate it on a testing set where it has never seen the answers!

Machine Learning Workflow Although the exact implementation details can vary, the general structure of a machine learning project stays relatively constant:

Data cleaning and formatting Exploratory data analysis Feature engineering and selection Establish a baseline and compare several machine learning models on a performance metric Perform hyperparameter tuning on the best model to optimize it for the problem Evaluate the best model on the testing set Interpret the model results to the extent possible Draw conclusions and write a well-documented report Setting up the structure of the pipeline ahead of time lets us see how one step flows into the other. However, the machine learning pipeline is an iterative procedure and so we don't always follow these steps in a linear fashion. We may revisit a previous step based on results from further down the pipeline. For example, while we may perform feature selection before building any models, we may use the modeling results to go back and select a different set of features. Or, the modeling may turn up unexpected results that mean we want to explore our data from another angle. Generally, you have to complete one step before moving on to the next, but don't feel like once you have finished one step the first time, you cannot go back and make improvements!

This notebook will cover the first three (and a half) steps of the pipeline with the other parts discussed in two additional notebooks. Throughout this series, the objective is to show how all the different data science practices come together to form a complete project. I try to focus more on the implementations of the methods rather than explaining them at a low-level, but have provided resources for those who want to go deeper. For the single best book (in my opinion) for learning the basics and implementing machine learning practices in Python, check out Hands-On Machine Learning with Scikit-Learn and Tensorflow by Aurelion Geron.

With this outline in place to guide us, let's get started!
d
M-ART | Video Data | Global | 100,000 Stock videos | Including metadata and...
datarade.ai
Updated Sep 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
M-ART (2025). M-ART | Video Data | Global | 100,000 Stock videos | Including metadata and releases | Dataset for AI & ML [Dataset]. https://datarade.ai/data-products/m-art-video-data-global-100-000-stock-videos-includin-m-art
Explore at:
.csv, .jpeg, .mp4, .movAvailable download formats
Dataset updated
Sep 11, 2025
Dataset authored and provided by
M-ART
Area covered
Curaçao, Andorra, Benin, Chad, Paraguay, Estonia, Tunisia, Bangladesh, Saint Helena, El Salvador
Description
"Collection of 100,000 high-quality video clips across diverse real-world domains, designed to accelerate the training and optimization of computer vision and multimodal AI models."

Overview This dataset contains 100,000 proprietary and partner-produced video clips filmed in 4K/6K with cinema-grade RED cameras. Each clip is commercially cleared with full releases, structured metadata, and available in RAW or MOV/MP4 formats. The collection spans a wide variety of domains — people and lifestyle, healthcare and medical, food and cooking, office and business, sports and fitness, nature and landscapes, education, and more. This breadth ensures robust training data for computer vision, multimodal, and machine learning projects.

The data set All 100,000 videos have been reviewed for quality and compliance. The dataset is optimized for AI model training, supporting use cases from face and activity recognition to scene understanding and generative AI. Custom datasets can also be produced on demand, enabling clients to close data gaps with tailored, high-quality content.

About M-ART M-ART is a leading provider of cinematic-grade datasets for AI training. With extensive expertise in large-scale content production and curation, M-ART delivers both ready-to-use video datasets and fully customized collections. All data is proprietary, rights-cleared, and designed to help global AI leaders accelerate research, development, and deployment of next-generation models.
R
Data from: Machine Learning Final Project Dataset
universe.roboflow.com
zip
Updated Jul 17, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
University (2025). Machine Learning Final Project Dataset [Dataset]. https://universe.roboflow.com/university-onsbq/machine-learning-final-project/dataset/4
Explore at:
zipAvailable download formats
Dataset updated
Jul 17, 2025
Dataset authored and provided by
University
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Objects Polygons
Description
Machine Learning Final Project

## Overview Machine Learning Final Project is a dataset for instance segmentation tasks - it contains Objects annotations for 599 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
d
FileMarket | Dataset for Face Anti-Spoofing (Videos) in Computer Vision...
datarade.ai
Updated Jul 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FileMarket (2024). FileMarket | Dataset for Face Anti-Spoofing (Videos) in Computer Vision Applications | Machine Learning (ML) Data | Deep Learning (DL) Data [Dataset]. https://datarade.ai/data-products/filemarket-dataset-for-face-anti-spoofing-videos-in-compu-filemarket
Explore at:
.bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
Dataset updated
Jul 10, 2024
Dataset authored and provided by
FileMarket
Area covered
Belarus, South Africa, Malawi, Zimbabwe, Central African Republic, Ukraine, Chad, Congo (Democratic Republic of the), Sierra Leone, Mali
Description
Live Face Anti-Spoof Dataset

A live face dataset is crucial for advancing computer vision tasks such as face detection, anti-spoofing detection, and face recognition. The Live Face Anti-Spoof Dataset offered by Ainnotate is specifically designed to train algorithms for anti-spoofing purposes, ensuring that AI systems can accurately differentiate between real and fake faces in various scenarios.

Key Features:

Comprehensive Video Collection: The dataset features thousands of videos showcasing a diverse range of individuals, including males and females, with and without glasses. It also includes men with beards, mustaches, and clean-shaven faces. Lighting Conditions: Videos are captured in both indoor and outdoor environments, ensuring that the data covers a wide range of lighting conditions, making it highly applicable for real-world use. Data Collection Method: Our datasets are gathered through a community-driven approach, leveraging our extensive network of over 700k users across various Telegram apps. This method ensures that the data is not only diverse but also ethically sourced with full consent from participants, providing reliable and real-world applicable data for training AI models. Versatility: This dataset is ideal for training models in face detection, anti-spoofing, and face recognition tasks, offering robust support for these essential computer vision applications. In addition to the Live Face Anti-Spoof Dataset, FileMarket provides specialized datasets across various categories to support a wide range of AI and machine learning projects:

Object Detection Data: Perfect for training AI in image and video analysis. Machine Learning (ML) Data: Offers a broad spectrum of applications, from predictive analytics to natural language processing (NLP). Large Language Model (LLM) Data: Designed to support text generation, chatbots, and machine translation models. Deep Learning (DL) Data: Essential for developing complex neural networks and deep learning models. Biometric Data: Includes diverse datasets for facial recognition, fingerprint analysis, and other biometric applications. This live face dataset, alongside our other specialized data categories, empowers your AI projects by providing high-quality, diverse, and comprehensive datasets. Whether your focus is on anti-spoofing detection, face recognition, or other biometric and machine learning tasks, our data offerings are tailored to meet your specific needs.
m
Data from: SalmonScan: A Novel Image Dataset for Machine Learning and Deep...
data.mendeley.com
Updated Apr 2, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Md Shoaib Ahmed (2024). SalmonScan: A Novel Image Dataset for Machine Learning and Deep Learning Analysis in Fish Disease Detection in Aquaculture [Dataset]. http://doi.org/10.17632/x3fz2nfm4w.3
Explore at:
Unique identifier
https://doi.org/10.17632/x3fz2nfm4w.3
Dataset updated
Apr 2, 2024
Authors
Md Shoaib Ahmed
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The SalmonScan dataset is a collection of images of salmon fish, including healthy fish and infected fish. The dataset consists of two classes of images:

Fresh salmon 🐟 Infected Salmon 🐠

This dataset is ideal for various computer vision tasks in machine learning and deep learning applications. Whether you are a researcher, developer, or student, the SalmonScan dataset offers a rich and diverse data source to support your projects and experiments.

So, dive in and explore the fascinating world of salmon health and disease!

The SalmonScan dataset (raw) consists of 24 fresh fish and 91 infected fish. [Due to server cleaning in the past, some raw datasets have been deleted]

The SalmonScan dataset (augmented) consists of approximately 1,208 images of salmon fish, classified into two classes:

Fresh salmon (healthy fish with no visible signs of disease), 456 images

Infected Salmon containing disease, 752 images

Each class contains a representative and diverse collection of images, capturing a range of different perspectives, scales, and lighting conditions. The images have been carefully curated to ensure that they are of high quality and suitable for use in a variety of computer vision tasks.

Data Preprocessing

The input images were preprocessed to enhance their quality and suitability for further analysis. The following steps were taken:

Resizing 📏: All the images were resized to a uniform size of 600 pixels in width and 250 pixels in height to ensure compatibility with the learning algorithm. Image Augmentation 📸: To overcome the small amount of images, various image augmentation techniques were applied to the input images. These included: Horizontal Flip ↩️: The images were horizontally flipped to create additional samples. Vertical Flip ⬆️: The images were vertically flipped to create additional samples. Rotation 🔄: The images were rotated to create additional samples. Cropping 🪓: A portion of the image was randomly cropped to create additional samples. Gaussian Noise 🌌: Gaussian noise was added to the images to create additional samples. Shearing 🌆: The images were sheared to create additional samples. Contrast Adjustment (Gamma) ⚖️: The gamma correction was applied to the images to adjust their contrast. Contrast Adjustment (Sigmoid) ⚖️: The sigmoid function was applied to the images to adjust their contrast.

Usage

To use the salmon scan dataset in your ML and DL projects, follow these steps:

Clone or download the salmon scan dataset repository from GitHub.

Use standard libraries such as numpy or pandas to convert the images into arrays, which can be input into a machine learning or deep learning model.

Split the dataset into training, validation, and test sets as per your requirement.

Preprocess the data as needed, such as resizing and normalizing the images.

Train your ML/DL model using the preprocessed training data.

Evaluate the model on the test set and make predictions on new, unseen data.
Z
Data from: FISBe: A real-world benchmark dataset for instance segmentation...
data.niaid.nih.gov
data-staging.niaid.nih.gov
+1more
Updated Apr 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mais, Lisa; Hirsch, Peter; Managan, Claire; Kandarpa, Ramya; Rumberger, Josef Lorenz; Reinke, Annika; Maier-Hein, Lena; Ihrke, Gudrun; Kainmueller, Dagmar (2024). FISBe: A real-world benchmark dataset for instance segmentation of long-range thin filamentous structures [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10875062
Explore at:
Dataset updated
Apr 2, 2024
Dataset provided by
German Cancer Research Center
Max Delbrück Center
Howard Hughes Medical Institute - Janelia Research Campus
Max Delbrück Center for Molecular Medicine
Authors
Mais, Lisa; Hirsch, Peter; Managan, Claire; Kandarpa, Ramya; Rumberger, Josef Lorenz; Reinke, Annika; Maier-Hein, Lena; Ihrke, Gudrun; Kainmueller, Dagmar
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
General

For more details and the most up-to-date information please consult our project page: https://kainmueller-lab.github.io/fisbe.

Summary

A new dataset for neuron instance segmentation in 3d multicolor light microscopy data of fruit fly brains

30 completely labeled (segmented) images

71 partly labeled images

altogether comprising ∼600 expert-labeled neuron instances (labeling a single neuron takes between 30-60 min on average, yet a difficult one can take up to 4 hours)

To the best of our knowledge, the first real-world benchmark dataset for instance segmentation of long thin filamentous objects

A set of metrics and a novel ranking score for respective meaningful method benchmarking

An evaluation of three baseline methods in terms of the above metrics and score

Abstract

Instance segmentation of neurons in volumetric light microscopy images of nervous systems enables groundbreaking research in neuroscience by facilitating joint functional and morphological analyses of neural circuits at cellular resolution. Yet said multi-neuron light microscopy data exhibits extremely challenging properties for the task of instance segmentation: Individual neurons have long-ranging, thin filamentous and widely branching morphologies, multiple neurons are tightly inter-weaved, and partial volume effects, uneven illumination and noise inherent to light microscopy severely impede local disentangling as well as long-range tracing of individual neurons. These properties reflect a current key challenge in machine learning research, namely to effectively capture long-range dependencies in the data. While respective methodological research is buzzing, to date methods are typically benchmarked on synthetic datasets. To address this gap, we release the FlyLight Instance Segmentation Benchmark (FISBe) dataset, the first publicly available multi-neuron light microscopy dataset with pixel-wise annotations. In addition, we define a set of instance segmentation metrics for benchmarking that we designed to be meaningful with regard to downstream analyses. Lastly, we provide three baselines to kick off a competition that we envision to both advance the field of machine learning regarding methodology for capturing long-range data dependencies, and facilitate scientific discovery in basic neuroscience.

Dataset documentation:

We provide a detailed documentation of our dataset, following the Datasheet for Datasets questionnaire:

FISBe Datasheet

Our dataset originates from the FlyLight project, where the authors released a large image collection of nervous systems of ~74,000 flies, available for download under CC BY 4.0 license.

Files

fisbe_v1.0_{completely,partly}.zip

contains the image and ground truth segmentation data; there is one zarr file per sample, see below for more information on how to access zarr files.

fisbe_v1.0_mips.zip

maximum intensity projections of all samples, for convenience.

sample_list_per_split.txt

a simple list of all samples and the subset they are in, for convenience.

view_data.py

a simple python script to visualize samples, see below for more information on how to use it.

dim_neurons_val_and_test_sets.json

a list of instance ids per sample that are considered to be of low intensity/dim; can be used for extended evaluation.

Readme.md

general information

How to work with the image files

Each sample consists of a single 3d MCFO image of neurons of the fruit fly.For each image, we provide a pixel-wise instance segmentation for all separable neurons.Each sample is stored as a separate zarr file (zarr is a file storage format for chunked, compressed, N-dimensional arrays based on an open-source specification.").The image data ("raw") and the segmentation ("gt_instances") are stored as two arrays within a single zarr file.The segmentation mask for each neuron is stored in a separate channel.The order of dimensions is CZYX.

We recommend to work in a virtual environment, e.g., by using conda:

conda create -y -n flylight-env -c conda-forge python=3.9conda activate flylight-env

How to open zarr files

Install the python zarr package:

pip install zarr

Opened a zarr file with:

import zarrraw = zarr.open(, mode='r', path="volumes/raw")seg = zarr.open(, mode='r', path="volumes/gt_instances")

optional:import numpy as npraw_np = np.array(raw)

Zarr arrays are read lazily on-demand.Many functions that expect numpy arrays also work with zarr arrays.Optionally, the arrays can also explicitly be converted to numpy arrays.

How to view zarr image files

We recommend to use napari to view the image data.

Install napari:

pip install "napari[all]"

Save the following Python script:

import zarr, sys, napari

raw = zarr.load(sys.argv[1], mode='r', path="volumes/raw")gts = zarr.load(sys.argv[1], mode='r', path="volumes/gt_instances")

viewer = napari.Viewer(ndisplay=3)for idx, gt in enumerate(gts): viewer.add_labels( gt, rendering='translucent', blending='additive', name=f'gt_{idx}')viewer.add_image(raw[0], colormap="red", name='raw_r', blending='additive')viewer.add_image(raw[1], colormap="green", name='raw_g', blending='additive')viewer.add_image(raw[2], colormap="blue", name='raw_b', blending='additive')napari.run()

Execute:

python view_data.py /R9F03-20181030_62_B5.zarr

Metrics

S: Average of avF1 and C

avF1: Average F1 Score

C: Average ground truth coverage

clDice_TP: Average true positives clDice

FS: Number of false splits

FM: Number of false merges

tp: Relative number of true positives

For more information on our selected metrics and formal definitions please see our paper.

Baseline

To showcase the FISBe dataset together with our selection of metrics, we provide evaluation results for three baseline methods, namely PatchPerPix (ppp), Flood Filling Networks (FFN) and a non-learnt application-specific color clustering from Duan et al..For detailed information on the methods and the quantitative results please see our paper.

License

The FlyLight Instance Segmentation Benchmark (FISBe) dataset is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.

Citation

If you use FISBe in your research, please use the following BibTeX entry:

@misc{mais2024fisbe, title = {FISBe: A real-world benchmark dataset for instance segmentation of long-range thin filamentous structures}, author = {Lisa Mais and Peter Hirsch and Claire Managan and Ramya Kandarpa and Josef Lorenz Rumberger and Annika Reinke and Lena Maier-Hein and Gudrun Ihrke and Dagmar Kainmueller}, year = 2024, eprint = {2404.00130}, archivePrefix ={arXiv}, primaryClass = {cs.CV} }

Acknowledgments

We thank Aljoscha Nern for providing unpublished MCFO images as well as Geoffrey W. Meissner and the entire FlyLight Project Team for valuablediscussions.P.H., L.M. and D.K. were supported by the HHMI Janelia Visiting Scientist Program.This work was co-funded by Helmholtz Imaging.

Changelog

There have been no changes to the dataset so far.All future change will be listed on the changelog page.

Contributing

If you would like to contribute, have encountered any issues or have any suggestions, please open an issue for the FISBe dataset in the accompanying github repository.

All contributions are welcome!
c
AS PhD data for Machine Learning-based Quantitative Grounded Theory: A New...
acquire.cqu.edu.au
researchdata.edu.au
zip
Updated Mar 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abhishek Sheetal (2025). AS PhD data for Machine Learning-based Quantitative Grounded Theory: A New Paradigm for Management Research [Dataset]. http://doi.org/10.25946/23577792.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.25946/23577792.v1
Dataset updated
Mar 26, 2025
Dataset provided by
CQUniversity
Authors
Abhishek Sheetal
License
https://www.gnu.org/licenses/gpl-3.0.htmlhttps://www.gnu.org/licenses/gpl-3.0.html
Description
In this project, I will analyze large publicly available datasets using machine learning to reveal new associations that can help refine existing theories or develop new theories in the social and management sciences. In the first project, I discuss some of the limitations of traditional statistical approaches and demonstrate how we can solve them using machine learning. In the second project, I demonstrate how machine learning can sieve through a large amount of data to identify patterns. In the third project, I document that machine learning models can be used to generate hypotheses that are subsequently validated by traditional methods (e.g., correlational and experimental studies). Machine learning models take a long time to build, requiring considerable software writing. However, these models are reusable. In the fourth project, I demonstrate how a machine learning model built in the third project can be reused for a different topic.
Data from: A large-scale comparative analysis of Coding Standard conformance...
figshare.com
application/x-gzip
Updated Oct 4, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anj Simmons; Scott Barnett; Jessica Rivera-Villicana; Akshat Bajaj; Rajesh Vasa (2021). A large-scale comparative analysis of Coding Standard conformance in Open-Source Data Science projects [Dataset]. http://doi.org/10.6084/m9.figshare.12377237.v3
Explore at:
application/x-gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12377237.v3
Dataset updated
Oct 4, 2021
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Anj Simmons; Scott Barnett; Jessica Rivera-Villicana; Akshat Bajaj; Rajesh Vasa
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This study investigates the extent to which data science projects follow code standards. In particular, which standards are followed, which are ignored, and how does this differ to traditional software projects? We compare a corpus of 1048 Open-Source Data Science projects to a reference group of 1099 non-Data Science projects with a similar level of quality and maturity.results.tar.gz: Extracted data for each project, including raw logs of all detected code violations.notebooks_out.tar.gz: Tables and figures generated by notebooks.source_code_anonymized.tar.gz: Anonymized source code (at time of publication) to identify, clone, and analyse the projects. Also includes Jupyter notebooks used to produce figures in the paper.The latest source code can be found at: https://github.com/a2i2/mining-data-science-repositoriesPublished in ESEM 2020: https://doi.org/10.1145/3382494.3410680Preprint: https://arxiv.org/abs/2007.08978
Data from: ManyTypes4Py: A benchmark Python Dataset for Machine...
data.europa.eu
unknown
Updated Feb 28, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2021). ManyTypes4Py: A benchmark Python Dataset for Machine Learning-Based Type Inference [Dataset]. https://data.europa.eu/88u/dataset/oai-zenodo-org-4571228
Explore at:
unknown(395470535)Available download formats
Dataset updated
Feb 28, 2021
Dataset authored and provided by
Zenodohttp://zenodo.org/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset is gathered on Sep. 17th 2020. It has more than 5.4K Python repositories that are hosted on GitHub. Check out the file ManyTypes4PyDataset.spec for repositories URL and their commit SHA. The dataset is also de-duplicated using the CD4Py tool. The list of duplicate files is provided in duplicate_files.txt file. All of its Python projects are processed in JSON-formatted files. They contain a seq2seq representation of each file, type-related hints, and information for machine learning models. The structure of JSON-formatted files is described in JSONOutput.md file. The dataset is split into train, validation and test sets by source code files. The list of files and their corresponding set is provided in dataset_split.csv file. Notable changes to each version of the dataset are documented in CHANGELOG.md.
d
Data from: USGS Contributions to the Nevada Geothermal Machine Learning...
catalog.data.gov
data.usgs.gov
+1more
Updated Nov 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). USGS Contributions to the Nevada Geothermal Machine Learning Project (DE-FOA-0001956): Heat Flow Data [Dataset]. https://catalog.data.gov/dataset/usgs-contributions-to-the-nevada-geothermal-machine-learning-project-de-foa-0001956-heat-f
Explore at:
Dataset updated
Nov 20, 2025
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
Nevada
Description
This package contains a map surface that depicts the estimated spatial variation of conductive heat flow (mW/m²) in a portion of northern Nevada, the extent of the ‘Nevada Machine Learning Project’ (DE-EE0008762). It was generated using well locations that had an estimated heat flow value from a measured thermal gradient and thermal conductivity, mainly using data from Southern Methodist University, with some additional USGS data. Well data are included along with and a map surface depicting estimated standard error of the heat flow interpolation.
Tyre FaultFindy
kaggle.com
zip
Updated Jan 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ranjan kumar pradhan (2025). Tyre FaultFindy [Dataset]. https://www.kaggle.com/datasets/rpjinu/tyre-faultfindy
Explore at:
zip(2859089280 bytes)Available download formats
Dataset updated
Jan 9, 2025
Authors
Ranjan kumar pradhan
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
FaultFindy (Build intelligence using Machine Learning to predict the faulty tyre in

manufacturing) The objective of this project is to develop an intelligent system using deep learning to predict the faults in manufacturing processes. By analyzing various manufacturing parameters and process data, the system will predict the faulty tyre generated during production. This predictive capability will enable manufacturers to proactively optimize their processes, reduce waste, and improve overall production efficiency.

Focus Areas:-

 Data Collection: Gather historical manufacturing data, including good and faulty corresponding tyre images.  Data Preprocessing: Clean, preprocess, and transform the data to make it suitable for deep learning models.  Feature Engineering: Extract relevant features and identify key process variables that impact faulty tyre generation.  Model Selection: Choose appropriate machine learning algorithms for faulty tyre prediction.  Model Training: Train the selected models using the preprocessed data.  Model Evaluation: Assess the performance of the trained models using appropriate evaluation metrics.  Hyperparameter Tuning: Optimize model hyperparameters to improve predictive accuracy.

Tasks/Activities List:

 Data Collection: o Gather historical manufacturing data, including good and faulty images. o Ensure data quality, handle missing values, and remove outliers.  Data Preprocessing: o Clean and preprocess the data to remove noise and inconsistencies.  Feature Engineering: o Identify important features and process variables that influence fault. o Engineer relevant features to capture patterns and correlations.  Model Selection: o Choose appropriate machine and deep learning algorithms. o Consider models like logistic regression, decision trees, random forests, or gradient boosting, CNN, computer vision.  Model Training: o Split the data into training and testing sets. o Train the selected machine learning models on the training data.  Model Evaluation: o Evaluate the models' performance using relevant metrics o Choose the best-performing model for deployment.  Hyperparameter Tuning: o Fine-tune hyperparameters of the selected model to optimize performance. o Use techniques like grid search or random search for hyperparameter optimization. Success Metrics:  The predictive model should achieve high accuracy
F
Data from: A Neural Approach for Text Extraction from Scholarly Figures
data.uni-hannover.de
zip
Updated Jan 20, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TIB (2022). A Neural Approach for Text Extraction from Scholarly Figures [Dataset]. https://data.uni-hannover.de/dataset/a-neural-approach-for-text-extraction-from-scholarly-figures
Explore at:
zipAvailable download formats
Dataset updated
Jan 20, 2022
Dataset authored and provided by
TIB
License
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Description
A Neural Approach for Text Extraction from Scholarly Figures

This is the readme for the supplemental data for our ICDAR 2019 paper.

You can read our paper via IEEE here: https://ieeexplore.ieee.org/document/8978202

If you found this dataset useful, please consider citing our paper:

@inproceedings{DBLP:conf/icdar/MorrisTE19, author = {David Morris and Peichen Tang and Ralph Ewerth}, title = {A Neural Approach for Text Extraction from Scholarly Figures}, booktitle = {2019 International Conference on Document Analysis and Recognition, {ICDAR} 2019, Sydney, Australia, September 20-25, 2019}, pages = {1438--1443}, publisher = {{IEEE}}, year = {2019}, url = {https://doi.org/10.1109/ICDAR.2019.00231}, doi = {10.1109/ICDAR.2019.00231}, timestamp = {Tue, 04 Feb 2020 13:28:39 +0100}, biburl = {https://dblp.org/rec/conf/icdar/MorrisTE19.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }

This work was financially supported by the German Federal Ministry of Education and Research (BMBF) and European Social Fund (ESF) (InclusiveOCW project, no. 01PE17004).

Datasets

We used different sources of data for testing, validation, and training. Our testing set was assembled by the work we cited by Böschen et al. We excluded the DeGruyter dataset, and use it as our validation dataset.

Testing

These datasets contain a readme with license information. Further information about the associated project can be found in the authors' published work we cited: https://doi.org/10.1007/978-3-319-51811-4_2

Validation

The DeGruyter dataset does not include the labeled images due to license restrictions. As of writing, the images can still be downloaded from DeGruyter via the links in the readme. Note that depending on what program you use to strip the images out of the PDF they are provided in, you may have to re-number the images.

Training

We used label_generator's generated dataset, which the author made available on a requester-pays amazon s3 bucket. We also used the Multi-Type Web Images dataset, which is mirrored here.

Code

We have made our code available in code.zip. We will upload code, announce further news, and field questions via the github repo.

Our text detection network is adapted from Argman's EAST implementation. The EAST/checkpoints/ours subdirectory contains the trained weights we used in the paper.

We used a tesseract script to run text extraction from detected text rows. This is inside our code code.tar as text_recognition_multipro.py.

We used a java script provided by Falk Böschen and adapted to our file structure. We included this as evaluator.jar.

Parameter sweeps are automated by param_sweep.rb. This file also shows how to invoke all of these components.
d
MC-LSTM papers, model runs
search.dataone.org
hydroshare.org
Updated Dec 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jonathan Martin Frame (2023). MC-LSTM papers, model runs [Dataset]. http://doi.org/10.4211/hs.d750278db868447dbd252a8c5431affd
Explore at:
Unique identifier
https://doi.org/10.4211/hs.d750278db868447dbd252a8c5431affd
Dataset updated
Dec 30, 2023
Dataset provided by
Hydroshare
Authors
Jonathan Martin Frame
Time period covered
Jan 1, 1989 - Jan 1, 2015
Area covered

Description
Runs from two papers exploring the use of mass conserving LSTM. Model results used in the papers are 1) model_outputs_for_analysis_extreme_events_paper.tar.gz, and 2) model_outputs_for_analysis_mass_balance_paper.tar.gz.

The models here are trained/calibrated on three different time periods. Standard Time Split (time split 1): test period(1989-1999) is the same period used by previous studies which allows us to confirm that the deep learning models (LSTM andMC-LSTM) trained for this project perform as expected relative to prior work. NWM Time Split (time split 2): The second test period (1995-2014) allows us to benchmark against the NWM-Rv2, which does not provide data prior to 1995. Return period split: The third test period (based on return periods) allows us to benchmark only on water years that contain streamflow events that are larger (per basin) than anything seen in the training data (<= 5-year return periods in training and > 5-year return periods in testing).

Also included are an ensemble of model runs for LSTM, MC-LSTM for the "standard" training period and two forcing products. These files are provided in the format "

IMPORTANT NOTE: This python environment should be used to extract and load the data: https://github.com/jmframe/mclstm_2021_extrapolate/blob/main/python_environment.yml, as the pickle files serialized the data with specific versions of python libraries. Specifically, the pickle serialization was done with xarray=0.16.1.

Code to interpret these runs can be found here: https://github.com/jmframe/mclstm_2021_extrapolate https://github.com/jmframe/mclstm_2021_mass_balance

Papers are available here: https://hess.copernicus.org/preprints/hess-2021-423/
t
FAIR Dataset for Disease Prediction in Healthcare Applications
test.researchdata.tuwien.ac.at
bin, csv, json, png
Updated Apr 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf (2025). FAIR Dataset for Disease Prediction in Healthcare Applications [Dataset]. http://doi.org/10.70124/5n77a-dnf02
Explore at:
csv, json, bin, pngAvailable download formats
Unique identifier
https://doi.org/10.70124/5n77a-dnf02
Dataset updated
Apr 14, 2025
Dataset provided by
TU Wien
Authors
Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset Description

Context and Methodology

Research Domain/Project:
This dataset was created for a machine learning experiment aimed at developing a classification model to predict outcomes based on a set of features. The primary research domain is disease prediction in patients. The dataset was used in the context of training, validating, and testing.

Purpose of the Dataset:
The purpose of this dataset is to provide training, validation, and testing data for the development of machine learning models. It includes labeled examples that help train classifiers to recognize patterns in the data and make predictions.

Dataset Creation:
Data preprocessing steps involved cleaning, normalization, and splitting the data into training, validation, and test sets. The data was carefully curated to ensure its quality and relevance to the problem at hand. For any missing values or outliers, appropriate handling techniques were applied (e.g., imputation, removal, etc.).

Technical Details

Structure of the Dataset:
The dataset consists of several files organized into folders by data type:

Training Data: Contains the training dataset used to train the machine learning model.

Validation Data: Used for hyperparameter tuning and model selection.

Test Data: Reserved for final model evaluation.

Each folder contains files with consistent naming conventions for easy navigation, such as train_data.csv, validation_data.csv, and test_data.csv. Each file follows a tabular format with columns representing features and rows representing individual data points.

Software Requirements:
To open and work with this dataset, you need VS Code or Jupyter, which could include tools like:

Python (with libraries such as pandas, numpy, scikit-learn, matplotlib, etc.)

Further Details

Reusability:
Users of this dataset should be aware that it is designed for machine learning experiments involving classification tasks. The dataset is already split into training, validation, and test subsets. Any model trained with this dataset should be evaluated using the test set to ensure proper validation.

Limitations:
The dataset may not cover all edge cases, and it might have biases depending on the selection of data sources. It's important to consider these limitations when generalizing model results to real-world applications.

Facebook

Twitter

Click to copy link

Link copied

Cite

EMİRHAN BULUT (2022). Machine Learning Tutorials - Example Projects - AI [Dataset]. https://www.kaggle.com/datasets/emirhanai/machine-learning-tutorials-example-projects-ai

Machine Learning Tutorials - Example Projects - AI

Explore at:

zip(1587192509 bytes)Available download formats

Dataset updated

Oct 20, 2022

Authors

EMİRHAN BULUT

License

Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically

Description

Machine Learning Tutorials - Example Projects - AI

I am sharing my 28 Machine Learning, Deep Learning (Artificial Intelligence - AI) projects with their data, software and outputs on Kaggle for educational purposes as open source. It appeals to people who want to work in this field, have 0 Machine Learning knowledge, have Intermediate Machine Learning knowledge, specialize in this field (Attracts to all levels). The deep learning projects in it are for advanced level, so I recommend you to start your studies from the Machine Learning section. You can check your own outputs along with the outputs in it. I am happy to share 28 educational projects with the whole world through Kaggle. Knowledge is free and better when shared!

Algorithms used in it:

1) Nearest Neighbor
2) Naive Bayes
3) Decision Trees
4) Linear Regression
5) Support Vector Machines (SVM)
6) Neural Networks
7) K-means clustering

Kind regards, Emirhan BULUT

You can use the links below for communication. If you have any questions or comments, feel free to let me know!

LinkedIn: https://www.linkedin.com/in/artificialintelligencebulut/ Email: emirhan@novosteer.com

Emirhan BULUT. (2022). Machine Learning Tutorials - Example Projects - AI [Data set]. Kaggle. https://doi.org/10.34740/KAGGLE/DSV/4361310

Clear search

Close search

Google apps

Main menu

Machine Learning Tutorials - Example Projects - AI

Machine Learning Tutorials - Example Projects - AI

Data from: NICHE: A Curated Dataset of Engineered Machine Learning Projects...

Data from: Project Machine Learning Dataset

Project Machine Learning

Data from: USGS Contributions to the Nevada Geothermal Machine Learning...

Data from: Web Data Commons Training and Test Sets for Large-Scale Product...

Riga Data Science Club

Context

Content

Inspiration

NYC_building_energy_data

M-ART | Video Data | Global | 100,000 Stock videos | Including metadata and...

Data from: Machine Learning Final Project Dataset

Machine Learning Final Project

FileMarket | Dataset for Face Anti-Spoofing (Videos) in Computer Vision...

Data from: SalmonScan: A Novel Image Dataset for Machine Learning and Deep...

Data from: FISBe: A real-world benchmark dataset for instance segmentation...

optional:import numpy as npraw_np = np.array(raw)

AS PhD data for Machine Learning-based Quantitative Grounded Theory: A New...

Data from: A large-scale comparative analysis of Coding Standard conformance...

Data from: ManyTypes4Py: A benchmark Python Dataset for Machine...

Data from: USGS Contributions to the Nevada Geothermal Machine Learning...

Tyre FaultFindy

FaultFindy (Build intelligence using Machine Learning to predict the faulty tyre in

Focus Areas:-

Tasks/Activities List:

Data from: A Neural Approach for Text Extraction from Scholarly Figures

A Neural Approach for Text Extraction from Scholarly Figures

Datasets

Testing

Validation

Training

Code

MC-LSTM papers, model runs

FAIR Dataset for Disease Prediction in Healthcare Applications

Dataset Description

Context and Methodology

Technical Details

Further Details

Machine Learning Tutorials - Example Projects - AI

Machine Learning Tutorials - Example Projects - AI

Machine Learning Tutorials - Example Projects - AI