26 datasets found

Street View House Numbers (SVHN) Dataset (numpy)
kaggle.com
zip
Updated Sep 18, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hugo R. V. Angulo (2021). Street View House Numbers (SVHN) Dataset (numpy) [Dataset]. https://www.kaggle.com/hugovallejo/street-view-house-numbers-svhn-dataset-numpy
Explore at:
zip(369259958 bytes)Available download formats
Dataset updated
Sep 18, 2021
Authors
Hugo R. V. Angulo
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
SVHN is a real-world image dataset for developing machine learning and object recognition algorithms with minimal requirement on data preprocessing and formatting. It can be seen as similar in flavor to MNIST (e.g., the images are of small cropped digits), but incorporates an order of magnitude more labeled data (over 600,000 digit images) and comes from a significantly harder, unsolved, real world problem (recognizing digits and numbers in natural scene images). SVHN is obtained from house numbers in Google Street View images.

This dataset took the data from the original dataset and convert the images to numpy arrays to make easier the processing of the umages.

10 classes, 1 for each digit. Digit '1' has label 1, '9' has label 9 and '0' has label 10. 73257 digits for training, 26032 digits for testing, Comes in two formats:

1. Original images with character level bounding boxes. 2. MNIST-like 32-by-32 images centered around a single character (many of the images do contain some distractors at the sides).

All the credit to:

http://ufldl.stanford.edu/housenumbers/

and,

Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, Andrew Y. Ng Reading Digits in Natural Images with Unsupervised Feature Learning NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011.
Z
Data from: FISBe: A real-world benchmark dataset for instance segmentation...
data.niaid.nih.gov
data-staging.niaid.nih.gov
+1more
Updated Apr 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mais, Lisa; Hirsch, Peter; Managan, Claire; Kandarpa, Ramya; Rumberger, Josef Lorenz; Reinke, Annika; Maier-Hein, Lena; Ihrke, Gudrun; Kainmueller, Dagmar (2024). FISBe: A real-world benchmark dataset for instance segmentation of long-range thin filamentous structures [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10875062
Explore at:
Dataset updated
Apr 2, 2024
Dataset provided by
German Cancer Research Center
Howard Hughes Medical Institute - Janelia Research Campus
Max Delbrück Center
Max Delbrück Center for Molecular Medicine
Authors
Mais, Lisa; Hirsch, Peter; Managan, Claire; Kandarpa, Ramya; Rumberger, Josef Lorenz; Reinke, Annika; Maier-Hein, Lena; Ihrke, Gudrun; Kainmueller, Dagmar
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
General

For more details and the most up-to-date information please consult our project page: https://kainmueller-lab.github.io/fisbe.

Summary

A new dataset for neuron instance segmentation in 3d multicolor light microscopy data of fruit fly brains

30 completely labeled (segmented) images

71 partly labeled images

altogether comprising ∼600 expert-labeled neuron instances (labeling a single neuron takes between 30-60 min on average, yet a difficult one can take up to 4 hours)

To the best of our knowledge, the first real-world benchmark dataset for instance segmentation of long thin filamentous objects

A set of metrics and a novel ranking score for respective meaningful method benchmarking

An evaluation of three baseline methods in terms of the above metrics and score

Abstract

Instance segmentation of neurons in volumetric light microscopy images of nervous systems enables groundbreaking research in neuroscience by facilitating joint functional and morphological analyses of neural circuits at cellular resolution. Yet said multi-neuron light microscopy data exhibits extremely challenging properties for the task of instance segmentation: Individual neurons have long-ranging, thin filamentous and widely branching morphologies, multiple neurons are tightly inter-weaved, and partial volume effects, uneven illumination and noise inherent to light microscopy severely impede local disentangling as well as long-range tracing of individual neurons. These properties reflect a current key challenge in machine learning research, namely to effectively capture long-range dependencies in the data. While respective methodological research is buzzing, to date methods are typically benchmarked on synthetic datasets. To address this gap, we release the FlyLight Instance Segmentation Benchmark (FISBe) dataset, the first publicly available multi-neuron light microscopy dataset with pixel-wise annotations. In addition, we define a set of instance segmentation metrics for benchmarking that we designed to be meaningful with regard to downstream analyses. Lastly, we provide three baselines to kick off a competition that we envision to both advance the field of machine learning regarding methodology for capturing long-range data dependencies, and facilitate scientific discovery in basic neuroscience.

Dataset documentation:

We provide a detailed documentation of our dataset, following the Datasheet for Datasets questionnaire:

FISBe Datasheet

Our dataset originates from the FlyLight project, where the authors released a large image collection of nervous systems of ~74,000 flies, available for download under CC BY 4.0 license.

Files

fisbe_v1.0_{completely,partly}.zip

contains the image and ground truth segmentation data; there is one zarr file per sample, see below for more information on how to access zarr files.

fisbe_v1.0_mips.zip

maximum intensity projections of all samples, for convenience.

sample_list_per_split.txt

a simple list of all samples and the subset they are in, for convenience.

view_data.py

a simple python script to visualize samples, see below for more information on how to use it.

dim_neurons_val_and_test_sets.json

a list of instance ids per sample that are considered to be of low intensity/dim; can be used for extended evaluation.

Readme.md

general information

How to work with the image files

Each sample consists of a single 3d MCFO image of neurons of the fruit fly.For each image, we provide a pixel-wise instance segmentation for all separable neurons.Each sample is stored as a separate zarr file (zarr is a file storage format for chunked, compressed, N-dimensional arrays based on an open-source specification.").The image data ("raw") and the segmentation ("gt_instances") are stored as two arrays within a single zarr file.The segmentation mask for each neuron is stored in a separate channel.The order of dimensions is CZYX.

We recommend to work in a virtual environment, e.g., by using conda:

conda create -y -n flylight-env -c conda-forge python=3.9conda activate flylight-env

How to open zarr files

Install the python zarr package:

pip install zarr

Opened a zarr file with:

import zarrraw = zarr.open(, mode='r', path="volumes/raw")seg = zarr.open(, mode='r', path="volumes/gt_instances")

optional:import numpy as npraw_np = np.array(raw)

Zarr arrays are read lazily on-demand.Many functions that expect numpy arrays also work with zarr arrays.Optionally, the arrays can also explicitly be converted to numpy arrays.

How to view zarr image files

We recommend to use napari to view the image data.

Install napari:

pip install "napari[all]"

Save the following Python script:

import zarr, sys, napari

raw = zarr.load(sys.argv[1], mode='r', path="volumes/raw")gts = zarr.load(sys.argv[1], mode='r', path="volumes/gt_instances")

viewer = napari.Viewer(ndisplay=3)for idx, gt in enumerate(gts): viewer.add_labels( gt, rendering='translucent', blending='additive', name=f'gt_{idx}')viewer.add_image(raw[0], colormap="red", name='raw_r', blending='additive')viewer.add_image(raw[1], colormap="green", name='raw_g', blending='additive')viewer.add_image(raw[2], colormap="blue", name='raw_b', blending='additive')napari.run()

Execute:

python view_data.py /R9F03-20181030_62_B5.zarr

Metrics

S: Average of avF1 and C

avF1: Average F1 Score

C: Average ground truth coverage

clDice_TP: Average true positives clDice

FS: Number of false splits

FM: Number of false merges

tp: Relative number of true positives

For more information on our selected metrics and formal definitions please see our paper.

Baseline

To showcase the FISBe dataset together with our selection of metrics, we provide evaluation results for three baseline methods, namely PatchPerPix (ppp), Flood Filling Networks (FFN) and a non-learnt application-specific color clustering from Duan et al..For detailed information on the methods and the quantitative results please see our paper.

License

The FlyLight Instance Segmentation Benchmark (FISBe) dataset is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.

Citation

If you use FISBe in your research, please use the following BibTeX entry:

@misc{mais2024fisbe, title = {FISBe: A real-world benchmark dataset for instance segmentation of long-range thin filamentous structures}, author = {Lisa Mais and Peter Hirsch and Claire Managan and Ramya Kandarpa and Josef Lorenz Rumberger and Annika Reinke and Lena Maier-Hein and Gudrun Ihrke and Dagmar Kainmueller}, year = 2024, eprint = {2404.00130}, archivePrefix ={arXiv}, primaryClass = {cs.CV} }

Acknowledgments

We thank Aljoscha Nern for providing unpublished MCFO images as well as Geoffrey W. Meissner and the entire FlyLight Project Team for valuablediscussions.P.H., L.M. and D.K. were supported by the HHMI Janelia Visiting Scientist Program.This work was co-funded by Helmholtz Imaging.

Changelog

There have been no changes to the dataset so far.All future change will be listed on the changelog page.

Contributing

If you would like to contribute, have encountered any issues or have any suggestions, please open an issue for the FISBe dataset in the accompanying github repository.

All contributions are welcome!
Numpy 1.18.4 User guide
kaggle.com
zip
Updated Jul 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tanay Mehta (2024). Numpy 1.18.4 User guide [Dataset]. https://www.kaggle.com/datasets/heyytanay/numpy-1-18-4-user-guide
Explore at:
zip(114024 bytes)Available download formats
Dataset updated
Jul 9, 2024
Authors
Tanay Mehta
Description
Dataset

This dataset was created by Tanay Mehta

Contents
DustNet - structured data and Python code to reproduce the model,...
zenodo.org
data.niaid.nih.gov
+1more
pdf
Updated Jul 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
T. E. Nowak; T. E. Nowak; Andy T. Augousti; Andy T. Augousti; Benno I. Simmons; Benno I. Simmons; Stefan Siegert; Stefan Siegert (2024). DustNet - structured data and Python code to reproduce the model, statistical analysis and figures [Dataset]. http://doi.org/10.5281/zenodo.10722953
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10722953
Dataset updated
Jul 7, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
T. E. Nowak; T. E. Nowak; Andy T. Augousti; Andy T. Augousti; Benno I. Simmons; Benno I. Simmons; Stefan Siegert; Stefan Siegert
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jan 7, 2023 - Mar 31, 2023
Description
Data and Python code used for AOD prediction with DustNet model - a Machine Learning/AI based forecasting.

Model input data and code

Processed MODIS AOD data (from Aqua and Terra) and selected ERA5 variables* ready to reproduce the DustNet model results or for similar forecasting with Machine Learning. These long-term daily timeseries (2003-2022) are provided as n-dimensional NumPy arrays. The Python code to handle the data and run the DustNet model** is included as Jupyter Notebook ‘DustNet_model_code.ipynb’. A subfolder with normalised and split data into training/validation/testing sets is also provided with Python code for two additional ML based models** used for comparison (U-NET and Conv2D). Pre-trained models are also archived here as TensorFlow files.

Model output data and code

This dataset was constructed by running the ‘DustNet_model_code.ipynb’ (see above). It consists of 1095 days of forecased AOD data (2020-2022) by CAMS, DustNet model, naïve prediction (persistence) and gridded climatology. The ground truth raw AOD data form MODIS is provided for comparison and statystical analysis of predictions. It is intended for a quick reproduction of figures and statystical analysis presented in DustNet introducing paper.

*datasets are NumPy arrays (v1.23) created in Python v3.8.18.

**all ML models were created with Keras in Python v3.10.10.
h
image-impeccable
huggingface.co
Updated May 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ThinkOnward (2025). image-impeccable [Dataset]. https://huggingface.co/datasets/thinkonward/image-impeccable
Explore at:
Dataset updated
May 11, 2025
Dataset authored and provided by
ThinkOnward
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset Card for Image Impeccable

Dataset Description

This data was produced by ThinkOnward for the Image Impeccable Challenge, using a synthetic seismic dataset generator called Synthoseis.

Created by: Mike McIntire and Jesse Pisel License: CC 4.0

Uses How to generate a dataset

This dataset is provided as paired noisy and clean seismic volumes. Follow the following step to load the data to numpy volumes import pandas as pd import numpy as… See the full description on the dataset page: https://huggingface.co/datasets/thinkonward/image-impeccable.
The Quick, Draw! Dataset
github.com
carrfratagen43.blogspot.com
Updated Mar 1, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Google (2017). The Quick, Draw! Dataset [Dataset]. https://github.com/googlecreativelab/quickdraw-dataset
Explore at:
Dataset updated
Mar 1, 2017
Dataset provided by
Googlehttp://google.com/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Quick Draw Dataset is a collection of 50 million drawings across 345 categories, contributed by players of the game "Quick, Draw!". The drawings were captured as timestamped vectors, tagged with metadata including what the player was asked to draw and in which country the player was located.

Example drawings: https://raw.githubusercontent.com/googlecreativelab/quickdraw-dataset/master/preview.jpg" alt="preview">
d
Data from: Preferential concentration of non-inertial buoyant particles in...
search.dataone.org
data.griidc.org
Updated Feb 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chor, Tomas (2025). Preferential concentration of non-inertial buoyant particles in the ocean mixed-layer under free-convection [Dataset]. http://doi.org/10.7266/N7VX0F2R
Explore at:
Unique identifier
https://doi.org/10.7266/N7VX0F2R
Dataset updated
Feb 5, 2025
Dataset provided by
GRIIDC
Authors
Chor, Tomas
Description
This dataset has been generated in order to investigate how particles (i.e. oil from a spill) behave on the surface when the ocean is dominated by convection. We found that there is a preferential concentration mechanism that dominates the surface signature. The dataset is submitted as a zip file with of pair of files for each figure. Each pair contains the data (in numpy-npz format) and a small python script to read the data.
d
Data from: Application of a 1H brain MRS benchmark dataset to deep learning...
datadryad.org
datasetcatalog.nlm.nih.gov
+1more
zip
Updated Mar 5, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Craig Stark; Aaron Gudmundson (2024). Application of a 1H brain MRS benchmark dataset to deep learning for out-of-voxel artifacts [Dataset]. http://doi.org/10.7280/D1RX1T
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.7280/D1RX1T
Dataset updated
Mar 5, 2024
Dataset provided by
Dryad
Authors
Craig Stark; Aaron Gudmundson
Time period covered
Aug 28, 2023
Description
NumPy archive files can be opened using Python and NumPy.
I
Data from: Enhancing Carrier Mobility In Monolayer MoS2 Transistors With...
databank.illinois.edu
aws-databank-alb.library.illinois.edu
Updated Mar 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yue Zhang; Helin Zhao; Siyuan Huang; Mohhamad Abir Hossain; Arend van der Zande (2024). Enhancing Carrier Mobility In Monolayer MoS2 Transistors With Process Induced Strain [Dataset]. http://doi.org/10.13012/B2IDB-4074704_V1
Explore at:
Unique identifier
https://doi.org/10.13012/B2IDB-4074704_V1
Dataset updated
Mar 29, 2024
Authors
Yue Zhang; Helin Zhao; Siyuan Huang; Mohhamad Abir Hossain; Arend van der Zande
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Read me file for the data repository ******************************************************************************* This repository has raw data for the publication "Enhancing Carrier Mobility In Monolayer MoS2 Transistors With Process Induced Strain". We arrange the data following the figure in which it first appeared. For all electrical transfer measurement, we provide the up-sweep and down-sweep data, with voltage units in V and conductance unit in S. All Raman modes have unit of cm^-1. ******************************************************************************* How to use this dataset All data in this dataset is stored in binary Numpy array format as .npy file. To read a .npy file: use the Numpy module of the python language, and use np.load() command. Example: suppose the filename is example_data.npy. To load it into a python program, open a Jupyter notebook, or in the python program, run: import numpy as np data = np.load("example_data.npy") Then the example file is stored in the data object. *******************************************************************************
Z
Three Annotated Anomaly Detection Datasets for Line-Scan Algorithms
data.niaid.nih.gov
zenodo.org
Updated Aug 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Garske, Samuel; Mao, Yiwei (2024). Three Annotated Anomaly Detection Datasets for Line-Scan Algorithms [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_13370799
Explore at:
Dataset updated
Aug 29, 2024
Dataset provided by
University of Sydney
Authors
Garske, Samuel; Mao, Yiwei
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Summary

This dataset contains two hyperspectral and one multispectral anomaly detection images, and their corresponding binary pixel masks. They were initially used for real-time anomaly detection in line-scanning, but they can be used for any anomaly detection task.

They are in .npy file format (will add tiff or geotiff variants in the future), with the image datasets being in the order of (height, width, channels). The SNP dataset was collected using sentinelhub, and the Synthetic dataset was collected from AVIRIS. The Python code used to analyse these datasets can be found at: https://github.com/WiseGamgee/HyperAD

How to Get Started

All that is needed to load these datasets is Python (preferably 3.8+) and the NumPy package. Example code for loading the Beach Dataset if you put it in a folder called "data" with the python script is:

import numpy as np

Load image file

hsi_array = np.load("data/beach_hsi.npy") n_pixels, n_lines, n_bands = hsi_array.shape print(f"This dataset has {n_pixels} pixels, {n_lines} lines, and {n_bands}.")

Load image mask

mask_array = np.load("data/beach_mask.npy") m_pixels, m_lines = mask_array.shape print(f"The corresponding anomaly mask is {m_pixels} pixels by {m_lines} lines.")

Citing the Datasets

If you use any of these datasets, please cite the following paper:

@article{garske2024erx, title={ERX - a Fast Real-Time Anomaly Detection Algorithm for Hyperspectral Line-Scanning}, author={Garske, Samuel and Evans, Bradley and Artlett, Christopher and Wong, KC}, journal={arXiv preprint arXiv:2408.14947}, year={2024},}

If you use the beach dataset please cite the following paper as well (original source):

@article{mao2022openhsi, title={OpenHSI: A complete open-source hyperspectral imaging solution for everyone}, author={Mao, Yiwei and Betters, Christopher H and Evans, Bradley and Artlett, Christopher P and Leon-Saval, Sergio G and Garske, Samuel and Cairns, Iver H and Cocks, Terry and Winter, Robert and Dell, Timothy}, journal={Remote Sensing}, volume={14}, number={9}, pages={2244}, year={2022}, publisher={MDPI} }
Supporting Dataset and Codes for "Stratosphere-Troposphere Exchange of Water...
zenodo.org
bin, zip
Updated May 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cong Dong; Cong Dong; Qiang Fu; Qiang Fu (2025). Supporting Dataset and Codes for "Stratosphere-Troposphere Exchange of Water Vapor Based on Observations and Reanalyses" [Dataset]. http://doi.org/10.5281/zenodo.15454081
Explore at:
zip, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15454081
Dataset updated
May 18, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Cong Dong; Cong Dong; Qiang Fu; Qiang Fu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This supports the manuscript titled "Stratosphere-Troposphere Exchange of Water Vapor Based on Observations and Reanalyses" submitted to *Geophysical Research Letters*.

It includes:
- Python and NCL scripts used for processing satellite observational data from COSMIC, CloudSat, MLS, and reanalyses data.
- Post-processed Reanalysis data from ERA5 and MERRA-2 and observational data
- Final data used for figures and statistical analyses.
- Jupyter notebooks for reproducing Figures 1–3, Table 1 and supplementary figures.

All data are stored in NetCDF or NumPy `.npz` format.

Please refer to the included `README.md` for detailed instructions on data structure, variable definitions, and how to reproduce the results.
Optiver Precomputed Features Numpy Array
kaggle.com
zip
Updated Aug 14, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tal Perry (2021). Optiver Precomputed Features Numpy Array [Dataset]. https://www.kaggle.com/lighttag/optiver-precomputed-features-numpy-array
Explore at:
zip(4346274917 bytes)Available download formats
Dataset updated
Aug 14, 2021
Authors
Tal Perry
Description
What's In This

This is a single numpy array with all the Optiver data joined together. It also has some of the features from this notebook It's designed to be mmapped so that you can read small pieces at once.

This is one big array with the trade and book data joined together plus some pre-computed features. The dtype of the array if fp16. The arrays shape is (n_times,n_stocks,600,27) where 600 is the max second_in_bucket and 27 is the number of columns.

How To Use It

Add the dataset to your notebook and then python import numpy as np ntimeids=3830 nstocks=112 ncolumns = 27 nseq = 600 arr = np.memmap('../input/optiver-precomputed-features-numpy-array/data.array',mode='r',dtype=np.float16,shape=(ntimeids,nstocks,600,ncolumns))

Caveats

Handling Varying Sequence Sizes

There are gaps in the stock ids and time ids, which doesn't work great with an array format. So we have time and stocks indexes as well (_ix suffix instead of _id). To calculate these:

import numpy as np import pandas as pd import numpy as np targets = pd.read_csv('/kaggle/input/optiver-realized-volatility-prediction/train.csv') ntimeids = targets.time_id.nunique() stock_ids = list(sorted(targets.stock_id.unique())) timeids = sorted(targets.time_id.unique()) timeid_to_ix = {time_id:i for i,time_id in enumerate(timeids)} stock_id_to_ix = {stock_id:i for i,stock_id in enumerate(stock_ids)}

Getting data For a particular stock id / time id

So to get the data for stock_id 13 on time_id 146 you'd do stock_ix = stock_id_to_ix[13] time_ix = timeid_to_ix[146] arr[time_ix,stock_ix]

Notice that the third dimension is of size 600 (the max number of points for a given time_ix,stock_id. Some of these will be empty. To get truncate a single stocks data do max_seq_ix = (arr[time_ix,stock_ix,:,-1]>0).cumsum().max() arr[time_ix,stock_ix,:max_seq_ix,]

Column Mappings

There are 27 columns in the last dimension these are:

['time_id', 'seconds_in_bucket', 'bid_price1', 'ask_price1', 'bid_price2', 'ask_price2', 'bid_size1', 'ask_size1', 'bid_size2', 'ask_size2', 'stock_id', 'wap1', 'wap2', 'log_return1', 'log_return2', 'wap_balance', 'price_spread', 'bid_spread', 'ask_spread', 'total_volume', 'volume_imbalance', 'price', 'size', 'order_count', 'stock_id_y', 'log_return_trade', 'target']
CIFAR-10 keras files cifar10.load_data()
kaggle.com
zip
Updated Jan 21, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Justin Güse (2020). CIFAR-10 keras files cifar10.load_data() [Dataset]. https://www.kaggle.com/guesejustin/cifar10-keras-files-cifar10load-data
Explore at:
zip(169650179 bytes)Available download formats
Dataset updated
Jan 21, 2020
Authors
Justin Güse
Description
In my opinion it was horrible to import these images into Kaggle the right way. The way I was used to is to use the Keras dataset and use cifar10.load_data(), but that does not work with Kaggle.

That is why I downloaded each x_train,y_train, x_test, y_test, packed them into a compressed numpy array, and uploaded them here.

How you would import them using Keras: (x_train, y_train), (x_test, y_test) = cifar10.load_data()

How you can import them now:

import numpy as np

data = np.load("/kaggle/input/cifar10-keras-files-cifar10load-data/cifar-10.npz")

filenames = ["x_train","y_train","x_test","y_test"]

nps = []

for filename in filenames:

nps.append(data[filename])

x_train,y_train,x_test,y_test = nps

Further information regarding the dataset: https://www.cs.toronto.edu/~kriz/cifar.html

The CIFAR-10 dataset

The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.

The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class.
Play Store Data Analysis By Vaishnavi
kaggle.com
zip
Updated Apr 30, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vaishnavi Sahu (2021). Play Store Data Analysis By Vaishnavi [Dataset]. https://www.kaggle.com/vaishnavisahu/play-store-data-analysis-by-vaishnavi
Explore at:
zip(597350 bytes)Available download formats
Dataset updated
Apr 30, 2021
Authors
Vaishnavi Sahu
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
**### Context

EDA using numpy and pandas

Content

In this Task i have to predict what factors makes an app perform well .whether its size , price , category or multiple factors together . what makes an app rank on the top in google Playstore .**

Column description: App : name of the application Category: category of the application Rating: rating of an application Reviews: reviews of that application Size: size of application Installs:how many users installed that application Type: Type of application Price: price of application content rating:rating of content of the application
original : CIFAR 100
kaggle.com
zip
Updated Dec 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shashwat Pandey (2024). original : CIFAR 100 [Dataset]. https://www.kaggle.com/datasets/shashwat90/original-cifar-100
Explore at:
zip(168517945 bytes)Available download formats
Dataset updated
Dec 28, 2024
Authors
Shashwat Pandey
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
The CIFAR-10 and CIFAR-100 datasets are labeled subsets of the 80 million tiny images dataset. CIFAR-10 and CIFAR-100 were created by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. (Sadly, the 80 million tiny images dataset has been thrown into the memory hole by its authors. Spotting the doublethink which was used to justify its erasure is left as an exercise for the reader.)

The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.

The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class.

The classes are completely mutually exclusive. There is no overlap between automobiles and trucks. "Automobile" includes sedans, SUVs, things of that sort. "Truck" includes only big trucks. Neither includes pickup trucks.

Baseline results You can find some baseline replicable results on this dataset on the project page for cuda-convnet. These results were obtained with a convolutional neural network. Briefly, they are 18% test error without data augmentation and 11% with. Additionally, Jasper Snoek has a new paper in which he used Bayesian hyperparameter optimization to find nice settings of the weight decay and other hyperparameters, which allowed him to obtain a test error rate of 15% (without data augmentation) using the architecture of the net that got 18%.

Other results Rodrigo Benenson has collected results on CIFAR-10/100 and other datasets on his website; click here to view.

Dataset layout Python / Matlab versions I will describe the layout of the Python version of the dataset. The layout of the Matlab version is identical.

The archive contains the files data_batch_1, data_batch_2, ..., data_batch_5, as well as test_batch. Each of these files is a Python "pickled" object produced with cPickle. Here is a python2 routine which will open such a file and return a dictionary: python def unpickle(file): import cPickle with open(file, 'rb') as fo: dict = cPickle.load(fo) return dict And a python3 version: def unpickle(file): import pickle with open(file, 'rb') as fo: dict = pickle.load(fo, encoding='bytes') return dict Loaded in this way, each of the batch files contains a dictionary with the following elements: data -- a 10000x3072 numpy array of uint8s. Each row of the array stores a 32x32 colour image. The first 1024 entries contain the red channel values, the next 1024 the green, and the final 1024 the blue. The image is stored in row-major order, so that the first 32 entries of the array are the red channel values of the first row of the image. labels -- a list of 10000 numbers in the range 0-9. The number at index i indicates the label of the ith image in the array data.

The dataset contains another file, called batches.meta. It too contains a Python dictionary object. It has the following entries: label_names -- a 10-element list which gives meaningful names to the numeric labels in the labels array described above. For example, label_names[0] == "airplane", label_names[1] == "automobile", etc. Binary version The binary version contains the files data_batch_1.bin, data_batch_2.bin, ..., data_batch_5.bin, as well as test_batch.bin. Each of these files is formatted as follows: <1 x label><3072 x pixel> ... <1 x label><3072 x pixel> In other words, the first byte is the label of the first image, which is a number in the range 0-9. The next 3072 bytes are the values of the pixels of the image. The first 1024 bytes are the red channel values, the next 1024 the green, and the final 1024 the blue. The values are stored in row-major order, so the first 32 bytes are the red channel values of the first row of the image.

Each file contains 10000 such 3073-byte "rows" of images, although there is nothing delimiting the rows. Therefore each file should be exactly 30730000 bytes long.

There is another file, called batches.meta.txt. This is an ASCII file that maps numeric labels in the range 0-9 to meaningful class names. It is merely a list of the 10 class names, one per row. The class name on row i corresponds to numeric label i.

The CIFAR-100 dataset This dataset is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a "fine" label (the class to which it belongs) and a "coarse" label (the superclass to which it belongs). Her...
NTU60 Processed Skeleton Dataset
kaggle.com
zip
Updated Aug 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oucherif Mohammed Ouail (2025). NTU60 Processed Skeleton Dataset [Dataset]. https://www.kaggle.com/datasets/oucherifouail/ntu60-processed-skeleton-dataset
Explore at:
zip(3075187118 bytes)Available download formats
Dataset updated
Aug 29, 2025
Authors
Oucherif Mohammed Ouail
Description
NTU RGB+D 60 – Preprocessed Skeleton Dataset

This dataset provides preprocessed skeleton sequences from the NTU RGB+D 60 benchmark, widely used for skeleton-based human action recognition.

The preprocessing module standardizes the raw NTU skeleton data to make it directly usable for training deep learning models.

Preprocessing Steps

Each skeleton sequence was processed by:

✅ Removing NaN / invalid frames

✅ Translating skeletons (centered spine base joint at origin)

✅ Normalizing body scale using spine length

✅ Aligning all sequences to 300 frames (padding or truncation)

✅ Formatting sequences to include up to 2 persons per clip

Output Files

Two .npz files are provided, following the standard evaluation protocols:

NTU60_CS.npz → Cross-Subject split

NTU60_CV.npz → Cross-View split

Each file contains:

x_train → Training data, shape (N_train, 300, 150)

y_train → Training labels, shape (N_train, 60) (one-hot)

x_test → Testing data, shape (N_test, 300, 150)

y_test → Testing labels, shape (N_test, 60) (one-hot)

Data Format

300 = max frames per sequence (zero-padded)

150 = 2 persons × 25 joints × 3 coordinates (x, y, z)

60 = number of action classes

If a sequence has only 1 person, the second person’s features are zero-filled.

Skeleton Properties

Centered → Spine base joint (joint-2) at origin (0,0,0)

Normalized → Body size scaled consistently

Aligned → Fixed-length sequences (300 frames)

Two-person setting → Always represented with 150 features

Evaluation Protocols

Cross-Subject (CS): Train and test sets split by different actors. The model is evaluated on unseen subjects to measure generalization across people.

Cross-View (CV): Train and test sets split by different camera views. The model is evaluated on unseen viewpoints to measure viewpoint invariance.

Usage

These .npz files can be directly loaded in PyTorch or NumPy-based pipelines. They are fully compatible with graph convolutional networks (GCNs), transformers, and other deep learning models for skeleton-based action recognition.

Example:

import numpy as np data = np.load("NTU60_CS.npz") x_train, y_train = data["x_train"], data["y_train"] print(x_train.shape) # (N_train, 300, 150) print(y_train.shape) # (N_train, 60)
Drone Dataset
kaggle.com
zip
Updated Oct 6, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pir Ghullam Mustafa (2025). Drone Dataset [Dataset]. https://www.kaggle.com/datasets/PirMustafa/drone-dataset
Explore at:
zip(637174000 bytes)Available download formats
Dataset updated
Oct 6, 2025
Authors
Pir Ghullam Mustafa
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Drone Anomaly Detection Time-Series Dataset

This dataset contains pre-processed time-series data for a binary classification task to determine whether a drone is healthy or faulty based on its motion data. The data has been windowed and is ready for use with sequence-based deep learning models like LSTMs, GRUs, or 1D CNNs.

Dataset Description

Source Data: The data is derived from the "DronePropA: Motion Trajectories Dataset for Defective Drones" by Ismail, Elshaar, et al. The original dataset consists of 130 .mat files, each representing a single flight experiment.

Preprocessing Steps: The original .mat files have been processed to create a single, model-ready .npz file. The following steps were applied: 1. Feature Extraction: For each of the 130 flights, 12 specific time-series features were extracted, focusing on the drone's core motion dynamics. 2. Labeling: Each flight was labeled as healthy (0) or faulty (1) based on the file naming convention described in the source paper. 3. Windowing: The time-series data from each flight was segmented into overlapping windows. Each window is 200 time-steps long with a 50% overlap between consecutive windows. 4. Aggregation: All windows from all flights were stacked into a single dataset.

Dataset Structure

The data is contained in a single compressed NumPy archive file: proceed_data.npz. This file contains two arrays: X and y.

X: A 3-dimensional NumPy array containing the feature data.

Shape: (num_windows, 200, 12)

First Dimension: The total number of windows aggregated from all flights.

Second Dimension: The number of time-steps in each window (200).

Third Dimension: The number of features recorded at each time-step (12).

y: A 1-dimensional NumPy array containing the corresponding labels for each window in X.

Shape: (num_windows,)

Values: 0 for a healthy window or 1 for a faulty window.

How to Use

You can load the data easily using NumPy.

import numpy as np # Load the dataset data = np.load('proceed_data.npz') # Extract the features and labels X = data['X'] y = data['y'] print("Data loaded successfully!") print(f"Features shape: {X.shape}") print(f"Labels shape: {y.shape}") ## Dataset Details **Features:** The 12 features in the third dimension of the `X` array are in the following order: 1. Position X (meters) 2. Position Y (meters) 3. Position Z (meters) 4. Roll (radians) 5. Pitch (radians) 6. Yaw (radians) 7. Roll Rate (rad/s) 8. Pitch Rate (rad/s) 9. Yaw Rate (rad/s) 10. Acceleration X (m/s²) 11. Acceleration Y (m/s²) 12. Acceleration Z (m/s²) **Labels:** The labels in the `y` array are defined as: * `0`: Healthy * `1`: Faulty ## Citation If you use this dataset, please cite the original authors of the DronePropA dataset.
London Housing Data
kaggle.com
zip
Updated Sep 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Science Lovers (2025). London Housing Data [Dataset]. https://www.kaggle.com/datasets/rohitgrewal/london-housing-data
Explore at:
zip(138862 bytes)Available download formats
Dataset updated
Sep 15, 2025
Authors
Data Science Lovers
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Area covered
London
Description
📹Project Video available on YouTube - https://youtu.be/q-Omt6LgRLc

🖇️Connect with me on LinkedIn - https://www.linkedin.com/in/rohit-grewal

London Housing Price Dataset

The dataset contains housing market information for different areas of London over time. It includes details such as average house prices, the number of houses sold, and crime statistics. The data spans multiple years and is organized by date and geographic area.

This data is available as a CSV file. We are going to analyze this data set using the Pandas DataFrame.

Using this dataset, we answered multiple questions with Python in our Project.

Q. 1) Convert the Datatype of 'Date' column to Date-Time format.

Q. 2.A) Add a new column ''year'' in the dataframe, which contains years only.

Q. 2.B) Add a new column ''month'' as 2nd column in the dataframe, which contains month only.

Q. 3) Remove the columns 'year' and 'month' from the dataframe.

Q. 4) Show all the records where 'No. of Crimes' is 0. And, how many such records are there ?

Q. 5) What is the maximum & minimum 'average_price' per year in england ?

Q. 6) What is the Maximum & Minimum No. of Crimes recorded per area ?

Q. 7) Show the total count of records of each area, where average price is less than 100000.

Enrol in our Udemy courses : 1. Python Data Analytics Projects - https://www.udemy.com/course/bigdata-analysis-python/?referralCode=F75B5F25D61BD4E5F161 2. Python For Data Science - https://www.udemy.com/course/python-for-data-science-real-time-exercises/?referralCode=9C91F0B8A3F0EB67FE67 3. Numpy For Data Science - https://www.udemy.com/course/python-numpy-exercises/?referralCode=FF9EDB87794FED46CBDF

These are the main Features/Columns available in the dataset :

1) Date – The month and year when the data was recorded.

2) Area – The London borough or area for which the housing and crime data is reported.

3) Average_price – The average house price in the given area during the specified month.

4) Code – The unique area code (e.g., government statistical code) corresponding to each borough or region.

5) Houses_sold – The number of houses sold in the given area during the specified month.

6) No_of_crimes – The number of crimes recorded in the given area during the specified month.
Classicmodels
kaggle.com
zip
Updated Dec 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Javier Landaeta (2024). Classicmodels [Dataset]. https://www.kaggle.com/datasets/javierlandaeta/classicmodels
Explore at:
zip(65751 bytes)Available download formats
Dataset updated
Dec 15, 2024
Authors
Javier Landaeta
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Abstract This project presents a comprehensive analysis of a company's annual sales, using the classic dataset classicmodels as the database. Python is used as the main programming language, along with the Pandas, NumPy and SQLAlchemy libraries for data manipulation and analysis, and PostgreSQL as the database management system.

The main objective of the project is to answer key questions related to the company's sales performance, such as: Which were the most profitable products and customers? Were sales goals met? The results obtained serve as input for strategic decision making in future sales campaigns.

Methodology 1. Data Extraction:

A connection is established with the PostgreSQL database to extract the relevant data from the orders, orderdetails, customers, products and employees tables.

A reusable function is created to read each table and load it into a Pandas DataFrame.

2. Data Cleansing and Transformation:

An exploratory analysis of the data is performed to identify missing values, inconsistencies, and outliers.

New variables are calculated, such as the total value of each sale, cost, and profit.

Different DataFrames are joined using primary and foreign keys to obtain a complete view of sales.

3. Exploratory Data Analysis (EDA):

Key metrics such as total sales, number of unique customers, and average order value are calculated.

Data is grouped by different dimensions (products, customers, dates) to identify patterns and trends.

Results are visualized using relevant graphics (histograms, bar charts, etc.).

4. Modeling and Prediction:

Although the main focus of the project is descriptive, predictive modeling techniques (e.g., time series) could be explored to forecast future sales.

5. Report Generation:

Detailed reports are created in Pandas DataFrames format that answer specific business questions.

These reports are stored in new PostgreSQL tables for further analysis and visualization.

Results - Identification of top products and customers: The best-selling products and the customers that generate the most revenue are identified. - Analysis of sales trends: Sales trends over time are analyzed and possible factors that influence sales behavior are identified. - Calculation of key metrics: Metrics such as average profit margin and sales growth rate are calculated.

Conclusions This project demonstrates how Python and PostgreSQL can be effectively used to analyze large data sets and obtain valuable insights for business decision making. The results obtained can serve as a starting point for future research and development in the area of sales analysis.

Technologies Used - Python: Pandas, NumPy, SQLAlchemy, Matplotlib/Seaborn - Database: PostgreSQL - Tools: Jupyter Notebook - Keywords: data analysis, Python, PostgreSQL, Pandas, NumPy, SQLAlchemy, EDA, sales, business intelligence
Dataset: Prime Numbers - First 1Lac
kaggle.com
zip
Updated May 12, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rehan Guha (2018). Dataset: Prime Numbers - First 1Lac [Dataset]. https://www.kaggle.com/rehanguha/dataset-prime-numbers-first-1lac
Explore at:
zip(872594 bytes)Available download formats
Dataset updated
May 12, 2018
Authors
Rehan Guha
License
http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
Description
File: - Contains 1000 files with 100 prime numbers in each file - Format *.dat

Data Format: - Python Numpy Array - Float64

Example ( How to use ): - numpy.loadtxt( [filename] )

Facebook

Twitter

Click to copy link

Link copied

Cite

Hugo R. V. Angulo (2021). Street View House Numbers (SVHN) Dataset (numpy) [Dataset]. https://www.kaggle.com/hugovallejo/street-view-house-numbers-svhn-dataset-numpy

Street View House Numbers (SVHN) Dataset (numpy)

Real-world image dataset for machine learning and object recognition.

Explore at:

zip(369259958 bytes)Available download formats

Dataset updated

Sep 18, 2021

Authors

Hugo R. V. Angulo

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

SVHN is a real-world image dataset for developing machine learning and object recognition algorithms with minimal requirement on data preprocessing and formatting. It can be seen as similar in flavor to MNIST (e.g., the images are of small cropped digits), but incorporates an order of magnitude more labeled data (over 600,000 digit images) and comes from a significantly harder, unsolved, real world problem (recognizing digits and numbers in natural scene images). SVHN is obtained from house numbers in Google Street View images.

This dataset took the data from the original dataset and convert the images to numpy arrays to make easier the processing of the umages.

10 classes, 1 for each digit. Digit '1' has label 1, '9' has label 9 and '0' has label 10. 73257 digits for training, 26032 digits for testing, Comes in two formats:

1. Original images with character level bounding boxes.
2. MNIST-like 32-by-32 images centered around a single character (many of the images do contain some distractors at the sides).

All the credit to:

http://ufldl.stanford.edu/housenumbers/

and,

Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, Andrew Y. Ng Reading Digits in Natural Images with Unsupervised Feature Learning NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011.

Clear search

Close search

Google apps

Main menu

Street View House Numbers (SVHN) Dataset (numpy)

Data from: FISBe: A real-world benchmark dataset for instance segmentation...

optional:import numpy as npraw_np = np.array(raw)

Numpy 1.18.4 User guide

Dataset

Contents

DustNet - structured data and Python code to reproduce the model,...

Data and Python code used for AOD prediction with DustNet model - a Machine Learning/AI based forecasting.

Model input data and code

Model output data and code

image-impeccable

The Quick, Draw! Dataset

Data from: Preferential concentration of non-inertial buoyant particles in...

Data from: Application of a 1H brain MRS benchmark dataset to deep learning...

Data from: Enhancing Carrier Mobility In Monolayer MoS2 Transistors With...

Three Annotated Anomaly Detection Datasets for Line-Scan Algorithms

Load image file

Load image mask

Supporting Dataset and Codes for "Stratosphere-Troposphere Exchange of Water...

Optiver Precomputed Features Numpy Array

What's In This

How To Use It

Caveats

Handling Varying Sequence Sizes

Getting data For a particular stock id / time id

Column Mappings

CIFAR-10 keras files cifar10.load_data()

The CIFAR-10 dataset

Play Store Data Analysis By Vaishnavi

Content

original : CIFAR 100

NTU60 Processed Skeleton Dataset

NTU RGB+D 60 – Preprocessed Skeleton Dataset

Preprocessing Steps

Output Files

Data Format

Skeleton Properties

Evaluation Protocols

Usage

Drone Dataset

Drone Anomaly Detection Time-Series Dataset

Dataset Description

Dataset Structure

How to Use

London Housing Data

📹Project Video available on YouTube - https://youtu.be/q-Omt6LgRLc

🖇️Connect with me on LinkedIn - https://www.linkedin.com/in/rohit-grewal

London Housing Price Dataset

Classicmodels

Dataset: Prime Numbers - First 1Lac

Street View House Numbers (SVHN) Dataset (numpy)See More Versions

Real-world image dataset for machine learning and object recognition.

Street View House Numbers (SVHN) Dataset (numpy)