26 datasets found
  1. Street View House Numbers (SVHN) Dataset (numpy)

    • kaggle.com
    zip
    Updated Sep 18, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hugo R. V. Angulo (2021). Street View House Numbers (SVHN) Dataset (numpy) [Dataset]. https://www.kaggle.com/hugovallejo/street-view-house-numbers-svhn-dataset-numpy
    Explore at:
    zip(369259958 bytes)Available download formats
    Dataset updated
    Sep 18, 2021
    Authors
    Hugo R. V. Angulo
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    SVHN is a real-world image dataset for developing machine learning and object recognition algorithms with minimal requirement on data preprocessing and formatting. It can be seen as similar in flavor to MNIST (e.g., the images are of small cropped digits), but incorporates an order of magnitude more labeled data (over 600,000 digit images) and comes from a significantly harder, unsolved, real world problem (recognizing digits and numbers in natural scene images). SVHN is obtained from house numbers in Google Street View images.

    This dataset took the data from the original dataset and convert the images to numpy arrays to make easier the processing of the umages.

    10 classes, 1 for each digit. Digit '1' has label 1, '9' has label 9 and '0' has label 10. 73257 digits for training, 26032 digits for testing, Comes in two formats:

    1. Original images with character level bounding boxes.
    2. MNIST-like 32-by-32 images centered around a single character (many of the images do contain some distractors at the sides).
    

    All the credit to:

    http://ufldl.stanford.edu/housenumbers/

    and,

    Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, Andrew Y. Ng Reading Digits in Natural Images with Unsupervised Feature Learning NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011.

  2. Z

    Data from: FISBe: A real-world benchmark dataset for instance segmentation...

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    • +1more
    Updated Apr 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mais, Lisa; Hirsch, Peter; Managan, Claire; Kandarpa, Ramya; Rumberger, Josef Lorenz; Reinke, Annika; Maier-Hein, Lena; Ihrke, Gudrun; Kainmueller, Dagmar (2024). FISBe: A real-world benchmark dataset for instance segmentation of long-range thin filamentous structures [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10875062
    Explore at:
    Dataset updated
    Apr 2, 2024
    Dataset provided by
    German Cancer Research Center
    Howard Hughes Medical Institute - Janelia Research Campus
    Max Delbrück Center
    Max Delbrück Center for Molecular Medicine
    Authors
    Mais, Lisa; Hirsch, Peter; Managan, Claire; Kandarpa, Ramya; Rumberger, Josef Lorenz; Reinke, Annika; Maier-Hein, Lena; Ihrke, Gudrun; Kainmueller, Dagmar
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    General

    For more details and the most up-to-date information please consult our project page: https://kainmueller-lab.github.io/fisbe.

    Summary

    A new dataset for neuron instance segmentation in 3d multicolor light microscopy data of fruit fly brains

    30 completely labeled (segmented) images

    71 partly labeled images

    altogether comprising ∼600 expert-labeled neuron instances (labeling a single neuron takes between 30-60 min on average, yet a difficult one can take up to 4 hours)

    To the best of our knowledge, the first real-world benchmark dataset for instance segmentation of long thin filamentous objects

    A set of metrics and a novel ranking score for respective meaningful method benchmarking

    An evaluation of three baseline methods in terms of the above metrics and score

    Abstract

    Instance segmentation of neurons in volumetric light microscopy images of nervous systems enables groundbreaking research in neuroscience by facilitating joint functional and morphological analyses of neural circuits at cellular resolution. Yet said multi-neuron light microscopy data exhibits extremely challenging properties for the task of instance segmentation: Individual neurons have long-ranging, thin filamentous and widely branching morphologies, multiple neurons are tightly inter-weaved, and partial volume effects, uneven illumination and noise inherent to light microscopy severely impede local disentangling as well as long-range tracing of individual neurons. These properties reflect a current key challenge in machine learning research, namely to effectively capture long-range dependencies in the data. While respective methodological research is buzzing, to date methods are typically benchmarked on synthetic datasets. To address this gap, we release the FlyLight Instance Segmentation Benchmark (FISBe) dataset, the first publicly available multi-neuron light microscopy dataset with pixel-wise annotations. In addition, we define a set of instance segmentation metrics for benchmarking that we designed to be meaningful with regard to downstream analyses. Lastly, we provide three baselines to kick off a competition that we envision to both advance the field of machine learning regarding methodology for capturing long-range data dependencies, and facilitate scientific discovery in basic neuroscience.

    Dataset documentation:

    We provide a detailed documentation of our dataset, following the Datasheet for Datasets questionnaire:

    FISBe Datasheet

    Our dataset originates from the FlyLight project, where the authors released a large image collection of nervous systems of ~74,000 flies, available for download under CC BY 4.0 license.

    Files

    fisbe_v1.0_{completely,partly}.zip

    contains the image and ground truth segmentation data; there is one zarr file per sample, see below for more information on how to access zarr files.

    fisbe_v1.0_mips.zip

    maximum intensity projections of all samples, for convenience.

    sample_list_per_split.txt

    a simple list of all samples and the subset they are in, for convenience.

    view_data.py

    a simple python script to visualize samples, see below for more information on how to use it.

    dim_neurons_val_and_test_sets.json

    a list of instance ids per sample that are considered to be of low intensity/dim; can be used for extended evaluation.

    Readme.md

    general information

    How to work with the image files

    Each sample consists of a single 3d MCFO image of neurons of the fruit fly.For each image, we provide a pixel-wise instance segmentation for all separable neurons.Each sample is stored as a separate zarr file (zarr is a file storage format for chunked, compressed, N-dimensional arrays based on an open-source specification.").The image data ("raw") and the segmentation ("gt_instances") are stored as two arrays within a single zarr file.The segmentation mask for each neuron is stored in a separate channel.The order of dimensions is CZYX.

    We recommend to work in a virtual environment, e.g., by using conda:

    conda create -y -n flylight-env -c conda-forge python=3.9conda activate flylight-env

    How to open zarr files

    Install the python zarr package:

    pip install zarr

    Opened a zarr file with:

    import zarrraw = zarr.open(, mode='r', path="volumes/raw")seg = zarr.open(, mode='r', path="volumes/gt_instances")

    optional:import numpy as npraw_np = np.array(raw)

    Zarr arrays are read lazily on-demand.Many functions that expect numpy arrays also work with zarr arrays.Optionally, the arrays can also explicitly be converted to numpy arrays.

    How to view zarr image files

    We recommend to use napari to view the image data.

    Install napari:

    pip install "napari[all]"

    Save the following Python script:

    import zarr, sys, napari

    raw = zarr.load(sys.argv[1], mode='r', path="volumes/raw")gts = zarr.load(sys.argv[1], mode='r', path="volumes/gt_instances")

    viewer = napari.Viewer(ndisplay=3)for idx, gt in enumerate(gts): viewer.add_labels( gt, rendering='translucent', blending='additive', name=f'gt_{idx}')viewer.add_image(raw[0], colormap="red", name='raw_r', blending='additive')viewer.add_image(raw[1], colormap="green", name='raw_g', blending='additive')viewer.add_image(raw[2], colormap="blue", name='raw_b', blending='additive')napari.run()

    Execute:

    python view_data.py /R9F03-20181030_62_B5.zarr

    Metrics

    S: Average of avF1 and C

    avF1: Average F1 Score

    C: Average ground truth coverage

    clDice_TP: Average true positives clDice

    FS: Number of false splits

    FM: Number of false merges

    tp: Relative number of true positives

    For more information on our selected metrics and formal definitions please see our paper.

    Baseline

    To showcase the FISBe dataset together with our selection of metrics, we provide evaluation results for three baseline methods, namely PatchPerPix (ppp), Flood Filling Networks (FFN) and a non-learnt application-specific color clustering from Duan et al..For detailed information on the methods and the quantitative results please see our paper.

    License

    The FlyLight Instance Segmentation Benchmark (FISBe) dataset is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.

    Citation

    If you use FISBe in your research, please use the following BibTeX entry:

    @misc{mais2024fisbe, title = {FISBe: A real-world benchmark dataset for instance segmentation of long-range thin filamentous structures}, author = {Lisa Mais and Peter Hirsch and Claire Managan and Ramya Kandarpa and Josef Lorenz Rumberger and Annika Reinke and Lena Maier-Hein and Gudrun Ihrke and Dagmar Kainmueller}, year = 2024, eprint = {2404.00130}, archivePrefix ={arXiv}, primaryClass = {cs.CV} }

    Acknowledgments

    We thank Aljoscha Nern for providing unpublished MCFO images as well as Geoffrey W. Meissner and the entire FlyLight Project Team for valuablediscussions.P.H., L.M. and D.K. were supported by the HHMI Janelia Visiting Scientist Program.This work was co-funded by Helmholtz Imaging.

    Changelog

    There have been no changes to the dataset so far.All future change will be listed on the changelog page.

    Contributing

    If you would like to contribute, have encountered any issues or have any suggestions, please open an issue for the FISBe dataset in the accompanying github repository.

    All contributions are welcome!

  3. Numpy 1.18.4 User guide

    • kaggle.com
    zip
    Updated Jul 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tanay Mehta (2024). Numpy 1.18.4 User guide [Dataset]. https://www.kaggle.com/datasets/heyytanay/numpy-1-18-4-user-guide
    Explore at:
    zip(114024 bytes)Available download formats
    Dataset updated
    Jul 9, 2024
    Authors
    Tanay Mehta
    Description

    Dataset

    This dataset was created by Tanay Mehta

    Contents

  4. DustNet - structured data and Python code to reproduce the model,...

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    pdf
    Updated Jul 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    T. E. Nowak; T. E. Nowak; Andy T. Augousti; Andy T. Augousti; Benno I. Simmons; Benno I. Simmons; Stefan Siegert; Stefan Siegert (2024). DustNet - structured data and Python code to reproduce the model, statistical analysis and figures [Dataset]. http://doi.org/10.5281/zenodo.10722953
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jul 7, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    T. E. Nowak; T. E. Nowak; Andy T. Augousti; Andy T. Augousti; Benno I. Simmons; Benno I. Simmons; Stefan Siegert; Stefan Siegert
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jan 7, 2023 - Mar 31, 2023
    Description

    Data and Python code used for AOD prediction with DustNet model - a Machine Learning/AI based forecasting.

    Model input data and code

    Processed MODIS AOD data (from Aqua and Terra) and selected ERA5 variables* ready to reproduce the DustNet model results or for similar forecasting with Machine Learning. These long-term daily timeseries (2003-2022) are provided as n-dimensional NumPy arrays. The Python code to handle the data and run the DustNet model** is included as Jupyter Notebook ‘DustNet_model_code.ipynb’. A subfolder with normalised and split data into training/validation/testing sets is also provided with Python code for two additional ML based models** used for comparison (U-NET and Conv2D). Pre-trained models are also archived here as TensorFlow files.

    Model output data and code

    This dataset was constructed by running the ‘DustNet_model_code.ipynb’ (see above). It consists of 1095 days of forecased AOD data (2020-2022) by CAMS, DustNet model, naïve prediction (persistence) and gridded climatology. The ground truth raw AOD data form MODIS is provided for comparison and statystical analysis of predictions. It is intended for a quick reproduction of figures and statystical analysis presented in DustNet introducing paper.

    *datasets are NumPy arrays (v1.23) created in Python v3.8.18.

    **all ML models were created with Keras in Python v3.10.10.

  5. h

    image-impeccable

    • huggingface.co
    Updated May 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ThinkOnward (2025). image-impeccable [Dataset]. https://huggingface.co/datasets/thinkonward/image-impeccable
    Explore at:
    Dataset updated
    May 11, 2025
    Dataset authored and provided by
    ThinkOnward
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Card for Image Impeccable

      Dataset Description
    

    This data was produced by ThinkOnward for the Image Impeccable Challenge, using a synthetic seismic dataset generator called Synthoseis.

    Created by: Mike McIntire and Jesse Pisel License: CC 4.0

      Uses
    
    
    
    
    
    
    
      How to generate a dataset
    

    This dataset is provided as paired noisy and clean seismic volumes. Follow the following step to load the data to numpy volumes import pandas as pd import numpy as… See the full description on the dataset page: https://huggingface.co/datasets/thinkonward/image-impeccable.

  6. The Quick, Draw! Dataset

    • github.com
    • carrfratagen43.blogspot.com
    Updated Mar 1, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google (2017). The Quick, Draw! Dataset [Dataset]. https://github.com/googlecreativelab/quickdraw-dataset
    Explore at:
    Dataset updated
    Mar 1, 2017
    Dataset provided by
    Googlehttp://google.com/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Quick Draw Dataset is a collection of 50 million drawings across 345 categories, contributed by players of the game "Quick, Draw!". The drawings were captured as timestamped vectors, tagged with metadata including what the player was asked to draw and in which country the player was located.

    Example drawings: https://raw.githubusercontent.com/googlecreativelab/quickdraw-dataset/master/preview.jpg" alt="preview">

  7. d

    Data from: Preferential concentration of non-inertial buoyant particles in...

    • search.dataone.org
    • data.griidc.org
    Updated Feb 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chor, Tomas (2025). Preferential concentration of non-inertial buoyant particles in the ocean mixed-layer under free-convection [Dataset]. http://doi.org/10.7266/N7VX0F2R
    Explore at:
    Dataset updated
    Feb 5, 2025
    Dataset provided by
    GRIIDC
    Authors
    Chor, Tomas
    Description

    This dataset has been generated in order to investigate how particles (i.e. oil from a spill) behave on the surface when the ocean is dominated by convection. We found that there is a preferential concentration mechanism that dominates the surface signature. The dataset is submitted as a zip file with of pair of files for each figure. Each pair contains the data (in numpy-npz format) and a small python script to read the data.

  8. d

    Data from: Application of a 1H brain MRS benchmark dataset to deep learning...

    • datadryad.org
    • datasetcatalog.nlm.nih.gov
    • +1more
    zip
    Updated Mar 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Craig Stark; Aaron Gudmundson (2024). Application of a 1H brain MRS benchmark dataset to deep learning for out-of-voxel artifacts [Dataset]. http://doi.org/10.7280/D1RX1T
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 5, 2024
    Dataset provided by
    Dryad
    Authors
    Craig Stark; Aaron Gudmundson
    Time period covered
    Aug 28, 2023
    Description

    NumPy archive files can be opened using Python and NumPy.

  9. I

    Data from: Enhancing Carrier Mobility In Monolayer MoS2 Transistors With...

    • databank.illinois.edu
    • aws-databank-alb.library.illinois.edu
    Updated Mar 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yue Zhang; Helin Zhao; Siyuan Huang; Mohhamad Abir Hossain; Arend van der Zande (2024). Enhancing Carrier Mobility In Monolayer MoS2 Transistors With Process Induced Strain [Dataset]. http://doi.org/10.13012/B2IDB-4074704_V1
    Explore at:
    Dataset updated
    Mar 29, 2024
    Authors
    Yue Zhang; Helin Zhao; Siyuan Huang; Mohhamad Abir Hossain; Arend van der Zande
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Read me file for the data repository ******************************************************************************* This repository has raw data for the publication "Enhancing Carrier Mobility In Monolayer MoS2 Transistors With Process Induced Strain". We arrange the data following the figure in which it first appeared. For all electrical transfer measurement, we provide the up-sweep and down-sweep data, with voltage units in V and conductance unit in S. All Raman modes have unit of cm^-1. ******************************************************************************* How to use this dataset All data in this dataset is stored in binary Numpy array format as .npy file. To read a .npy file: use the Numpy module of the python language, and use np.load() command. Example: suppose the filename is example_data.npy. To load it into a python program, open a Jupyter notebook, or in the python program, run: import numpy as np data = np.load("example_data.npy") Then the example file is stored in the data object. *******************************************************************************

  10. Z

    Three Annotated Anomaly Detection Datasets for Line-Scan Algorithms

    • data.niaid.nih.gov
    • zenodo.org
    Updated Aug 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Garske, Samuel; Mao, Yiwei (2024). Three Annotated Anomaly Detection Datasets for Line-Scan Algorithms [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_13370799
    Explore at:
    Dataset updated
    Aug 29, 2024
    Dataset provided by
    University of Sydney
    Authors
    Garske, Samuel; Mao, Yiwei
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Summary

    This dataset contains two hyperspectral and one multispectral anomaly detection images, and their corresponding binary pixel masks. They were initially used for real-time anomaly detection in line-scanning, but they can be used for any anomaly detection task.

    They are in .npy file format (will add tiff or geotiff variants in the future), with the image datasets being in the order of (height, width, channels). The SNP dataset was collected using sentinelhub, and the Synthetic dataset was collected from AVIRIS. The Python code used to analyse these datasets can be found at: https://github.com/WiseGamgee/HyperAD

    How to Get Started

    All that is needed to load these datasets is Python (preferably 3.8+) and the NumPy package. Example code for loading the Beach Dataset if you put it in a folder called "data" with the python script is:

    import numpy as np

    Load image file

    hsi_array = np.load("data/beach_hsi.npy") n_pixels, n_lines, n_bands = hsi_array.shape print(f"This dataset has {n_pixels} pixels, {n_lines} lines, and {n_bands}.")

    Load image mask

    mask_array = np.load("data/beach_mask.npy") m_pixels, m_lines = mask_array.shape print(f"The corresponding anomaly mask is {m_pixels} pixels by {m_lines} lines.")

    Citing the Datasets

    If you use any of these datasets, please cite the following paper:

    @article{garske2024erx, title={ERX - a Fast Real-Time Anomaly Detection Algorithm for Hyperspectral Line-Scanning}, author={Garske, Samuel and Evans, Bradley and Artlett, Christopher and Wong, KC}, journal={arXiv preprint arXiv:2408.14947}, year={2024},}

    If you use the beach dataset please cite the following paper as well (original source):

    @article{mao2022openhsi, title={OpenHSI: A complete open-source hyperspectral imaging solution for everyone}, author={Mao, Yiwei and Betters, Christopher H and Evans, Bradley and Artlett, Christopher P and Leon-Saval, Sergio G and Garske, Samuel and Cairns, Iver H and Cocks, Terry and Winter, Robert and Dell, Timothy}, journal={Remote Sensing}, volume={14}, number={9}, pages={2244}, year={2022}, publisher={MDPI} }

  11. Supporting Dataset and Codes for "Stratosphere-Troposphere Exchange of Water...

    • zenodo.org
    bin, zip
    Updated May 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cong Dong; Cong Dong; Qiang Fu; Qiang Fu (2025). Supporting Dataset and Codes for "Stratosphere-Troposphere Exchange of Water Vapor Based on Observations and Reanalyses" [Dataset]. http://doi.org/10.5281/zenodo.15454081
    Explore at:
    zip, binAvailable download formats
    Dataset updated
    May 18, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Cong Dong; Cong Dong; Qiang Fu; Qiang Fu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This supports the manuscript titled "Stratosphere-Troposphere Exchange of Water Vapor Based on Observations and Reanalyses" submitted to *Geophysical Research Letters*.

    It includes:
    - Python and NCL scripts used for processing satellite observational data from COSMIC, CloudSat, MLS, and reanalyses data.
    - Post-processed Reanalysis data from ERA5 and MERRA-2 and observational data
    - Final data used for figures and statistical analyses.
    - Jupyter notebooks for reproducing Figures 1–3, Table 1 and supplementary figures.

    All data are stored in NetCDF or NumPy `.npz` format.

    Please refer to the included `README.md` for detailed instructions on data structure, variable definitions, and how to reproduce the results.

  12. Optiver Precomputed Features Numpy Array

    • kaggle.com
    zip
    Updated Aug 14, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tal Perry (2021). Optiver Precomputed Features Numpy Array [Dataset]. https://www.kaggle.com/lighttag/optiver-precomputed-features-numpy-array
    Explore at:
    zip(4346274917 bytes)Available download formats
    Dataset updated
    Aug 14, 2021
    Authors
    Tal Perry
    Description

    What's In This

    This is a single numpy array with all the Optiver data joined together. It also has some of the features from this notebook It's designed to be mmapped so that you can read small pieces at once.

    This is one big array with the trade and book data joined together plus some pre-computed features. The dtype of the array if fp16. The arrays shape is (n_times,n_stocks,600,27) where 600 is the max second_in_bucket and 27 is the number of columns.

    How To Use It

    Add the dataset to your notebook and then python import numpy as np ntimeids=3830 nstocks=112 ncolumns = 27 nseq = 600 arr = np.memmap('../input/optiver-precomputed-features-numpy-array/data.array',mode='r',dtype=np.float16,shape=(ntimeids,nstocks,600,ncolumns))

    Caveats

    Handling Varying Sequence Sizes

    There are gaps in the stock ids and time ids, which doesn't work great with an array format. So we have time and stocks indexes as well (_ix suffix instead of _id). To calculate these:

    import numpy as np
    
    
    
    import pandas as pd
    import numpy as np
    
    targets = pd.read_csv('/kaggle/input/optiver-realized-volatility-prediction/train.csv')
    ntimeids = targets.time_id.nunique()
    stock_ids = list(sorted(targets.stock_id.unique()))
    timeids = sorted(targets.time_id.unique())
    timeid_to_ix = {time_id:i for i,time_id in enumerate(timeids)}
    stock_id_to_ix = {stock_id:i for i,stock_id in enumerate(stock_ids)}
    
    

    Getting data For a particular stock id / time id

    So to get the data for stock_id 13 on time_id 146 you'd do stock_ix = stock_id_to_ix[13] time_ix = timeid_to_ix[146] arr[time_ix,stock_ix]

    Notice that the third dimension is of size 600 (the max number of points for a given time_ix,stock_id. Some of these will be empty. To get truncate a single stocks data do max_seq_ix = (arr[time_ix,stock_ix,:,-1]>0).cumsum().max() arr[time_ix,stock_ix,:max_seq_ix,]

    Column Mappings

    There are 27 columns in the last dimension these are:

    ['time_id', 'seconds_in_bucket', 'bid_price1', 'ask_price1', 'bid_price2', 'ask_price2', 'bid_size1', 'ask_size1', 'bid_size2', 'ask_size2', 'stock_id', 'wap1', 'wap2', 'log_return1', 'log_return2', 'wap_balance', 'price_spread', 'bid_spread', 'ask_spread', 'total_volume', 'volume_imbalance', 'price', 'size', 'order_count', 'stock_id_y', 'log_return_trade', 'target']

  13. CIFAR-10 keras files cifar10.load_data()

    • kaggle.com
    zip
    Updated Jan 21, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Justin Güse (2020). CIFAR-10 keras files cifar10.load_data() [Dataset]. https://www.kaggle.com/guesejustin/cifar10-keras-files-cifar10load-data
    Explore at:
    zip(169650179 bytes)Available download formats
    Dataset updated
    Jan 21, 2020
    Authors
    Justin Güse
    Description

    In my opinion it was horrible to import these images into Kaggle the right way. The way I was used to is to use the Keras dataset and use cifar10.load_data(), but that does not work with Kaggle.

    That is why I downloaded each x_train,y_train, x_test, y_test, packed them into a compressed numpy array, and uploaded them here.

    How you would import them using Keras: (x_train, y_train), (x_test, y_test) = cifar10.load_data()

    How you can import them now:

    import numpy as np

    data = np.load("/kaggle/input/cifar10-keras-files-cifar10load-data/cifar-10.npz")

    filenames = ["x_train","y_train","x_test","y_test"]

    nps = []

    for filename in filenames:

    nps.append(data[filename])
    

    x_train,y_train,x_test,y_test = nps

    Further information regarding the dataset: https://www.cs.toronto.edu/~kriz/cifar.html

    The CIFAR-10 dataset

    The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.

    The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class.

  14. Play Store Data Analysis By Vaishnavi

    • kaggle.com
    zip
    Updated Apr 30, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vaishnavi Sahu (2021). Play Store Data Analysis By Vaishnavi [Dataset]. https://www.kaggle.com/vaishnavisahu/play-store-data-analysis-by-vaishnavi
    Explore at:
    zip(597350 bytes)Available download formats
    Dataset updated
    Apr 30, 2021
    Authors
    Vaishnavi Sahu
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    **### Context

    EDA using numpy and pandas

    Content

    In this Task i have to predict what factors makes an app perform well .whether its size , price , category or multiple factors together . what makes an app rank on the top in google Playstore .**

    Column description: App : name of the application Category: category of the application Rating: rating of an application Reviews: reviews of that application Size: size of application Installs:how many users installed that application Type: Type of application Price: price of application content rating:rating of content of the application

  15. original : CIFAR 100

    • kaggle.com
    zip
    Updated Dec 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shashwat Pandey (2024). original : CIFAR 100 [Dataset]. https://www.kaggle.com/datasets/shashwat90/original-cifar-100
    Explore at:
    zip(168517945 bytes)Available download formats
    Dataset updated
    Dec 28, 2024
    Authors
    Shashwat Pandey
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    The CIFAR-10 and CIFAR-100 datasets are labeled subsets of the 80 million tiny images dataset. CIFAR-10 and CIFAR-100 were created by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. (Sadly, the 80 million tiny images dataset has been thrown into the memory hole by its authors. Spotting the doublethink which was used to justify its erasure is left as an exercise for the reader.)

    The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.

    The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class.

    The classes are completely mutually exclusive. There is no overlap between automobiles and trucks. "Automobile" includes sedans, SUVs, things of that sort. "Truck" includes only big trucks. Neither includes pickup trucks.

    Baseline results You can find some baseline replicable results on this dataset on the project page for cuda-convnet. These results were obtained with a convolutional neural network. Briefly, they are 18% test error without data augmentation and 11% with. Additionally, Jasper Snoek has a new paper in which he used Bayesian hyperparameter optimization to find nice settings of the weight decay and other hyperparameters, which allowed him to obtain a test error rate of 15% (without data augmentation) using the architecture of the net that got 18%.

    Other results Rodrigo Benenson has collected results on CIFAR-10/100 and other datasets on his website; click here to view.

    Dataset layout Python / Matlab versions I will describe the layout of the Python version of the dataset. The layout of the Matlab version is identical.

    The archive contains the files data_batch_1, data_batch_2, ..., data_batch_5, as well as test_batch. Each of these files is a Python "pickled" object produced with cPickle. Here is a python2 routine which will open such a file and return a dictionary: python def unpickle(file): import cPickle with open(file, 'rb') as fo: dict = cPickle.load(fo) return dict And a python3 version: def unpickle(file): import pickle with open(file, 'rb') as fo: dict = pickle.load(fo, encoding='bytes') return dict Loaded in this way, each of the batch files contains a dictionary with the following elements: data -- a 10000x3072 numpy array of uint8s. Each row of the array stores a 32x32 colour image. The first 1024 entries contain the red channel values, the next 1024 the green, and the final 1024 the blue. The image is stored in row-major order, so that the first 32 entries of the array are the red channel values of the first row of the image. labels -- a list of 10000 numbers in the range 0-9. The number at index i indicates the label of the ith image in the array data.

    The dataset contains another file, called batches.meta. It too contains a Python dictionary object. It has the following entries: label_names -- a 10-element list which gives meaningful names to the numeric labels in the labels array described above. For example, label_names[0] == "airplane", label_names[1] == "automobile", etc. Binary version The binary version contains the files data_batch_1.bin, data_batch_2.bin, ..., data_batch_5.bin, as well as test_batch.bin. Each of these files is formatted as follows: <1 x label><3072 x pixel> ... <1 x label><3072 x pixel> In other words, the first byte is the label of the first image, which is a number in the range 0-9. The next 3072 bytes are the values of the pixels of the image. The first 1024 bytes are the red channel values, the next 1024 the green, and the final 1024 the blue. The values are stored in row-major order, so the first 32 bytes are the red channel values of the first row of the image.

    Each file contains 10000 such 3073-byte "rows" of images, although there is nothing delimiting the rows. Therefore each file should be exactly 30730000 bytes long.

    There is another file, called batches.meta.txt. This is an ASCII file that maps numeric labels in the range 0-9 to meaningful class names. It is merely a list of the 10 class names, one per row. The class name on row i corresponds to numeric label i.

    The CIFAR-100 dataset This dataset is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a "fine" label (the class to which it belongs) and a "coarse" label (the superclass to which it belongs). Her...

  16. NTU60 Processed Skeleton Dataset

    • kaggle.com
    zip
    Updated Aug 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oucherif Mohammed Ouail (2025). NTU60 Processed Skeleton Dataset [Dataset]. https://www.kaggle.com/datasets/oucherifouail/ntu60-processed-skeleton-dataset
    Explore at:
    zip(3075187118 bytes)Available download formats
    Dataset updated
    Aug 29, 2025
    Authors
    Oucherif Mohammed Ouail
    Description

    NTU RGB+D 60 – Preprocessed Skeleton Dataset

    This dataset provides preprocessed skeleton sequences from the NTU RGB+D 60 benchmark, widely used for skeleton-based human action recognition.

    The preprocessing module standardizes the raw NTU skeleton data to make it directly usable for training deep learning models.

    Preprocessing Steps

    Each skeleton sequence was processed by:

    • ✅ Removing NaN / invalid frames
    • ✅ Translating skeletons (centered spine base joint at origin)
    • ✅ Normalizing body scale using spine length
    • ✅ Aligning all sequences to 300 frames (padding or truncation)
    • ✅ Formatting sequences to include up to 2 persons per clip

    Output Files

    Two .npz files are provided, following the standard evaluation protocols:

    1. NTU60_CS.npz → Cross-Subject split
    2. NTU60_CV.npz → Cross-View split

    Each file contains:

    • x_train → Training data, shape (N_train, 300, 150)
    • y_train → Training labels, shape (N_train, 60) (one-hot)
    • x_test → Testing data, shape (N_test, 300, 150)
    • y_test → Testing labels, shape (N_test, 60) (one-hot)

    Data Format

    • 300 = max frames per sequence (zero-padded)
    • 150 = 2 persons × 25 joints × 3 coordinates (x, y, z)
    • 60 = number of action classes

    If a sequence has only 1 person, the second person’s features are zero-filled.

    Skeleton Properties

    • Centered → Spine base joint (joint-2) at origin (0,0,0)
    • Normalized → Body size scaled consistently
    • Aligned → Fixed-length sequences (300 frames)
    • Two-person setting → Always represented with 150 features

    Evaluation Protocols

    • Cross-Subject (CS): Train and test sets split by different actors. The model is evaluated on unseen subjects to measure generalization across people.
    • Cross-View (CV): Train and test sets split by different camera views. The model is evaluated on unseen viewpoints to measure viewpoint invariance.

    Usage

    These .npz files can be directly loaded in PyTorch or NumPy-based pipelines. They are fully compatible with graph convolutional networks (GCNs), transformers, and other deep learning models for skeleton-based action recognition.

    Example:

    import numpy as np
    
    data = np.load("NTU60_CS.npz")
    x_train, y_train = data["x_train"], data["y_train"]
    
    print(x_train.shape) # (N_train, 300, 150)
    print(y_train.shape) # (N_train, 60)
    
  17. Drone Dataset

    • kaggle.com
    zip
    Updated Oct 6, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pir Ghullam Mustafa (2025). Drone Dataset [Dataset]. https://www.kaggle.com/datasets/PirMustafa/drone-dataset
    Explore at:
    zip(637174000 bytes)Available download formats
    Dataset updated
    Oct 6, 2025
    Authors
    Pir Ghullam Mustafa
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Drone Anomaly Detection Time-Series Dataset

    This dataset contains pre-processed time-series data for a binary classification task to determine whether a drone is healthy or faulty based on its motion data. The data has been windowed and is ready for use with sequence-based deep learning models like LSTMs, GRUs, or 1D CNNs.

    Dataset Description

    Source Data: The data is derived from the "DronePropA: Motion Trajectories Dataset for Defective Drones" by Ismail, Elshaar, et al. The original dataset consists of 130 .mat files, each representing a single flight experiment.

    Preprocessing Steps: The original .mat files have been processed to create a single, model-ready .npz file. The following steps were applied: 1. Feature Extraction: For each of the 130 flights, 12 specific time-series features were extracted, focusing on the drone's core motion dynamics. 2. Labeling: Each flight was labeled as healthy (0) or faulty (1) based on the file naming convention described in the source paper. 3. Windowing: The time-series data from each flight was segmented into overlapping windows. Each window is 200 time-steps long with a 50% overlap between consecutive windows. 4. Aggregation: All windows from all flights were stacked into a single dataset.

    Dataset Structure

    The data is contained in a single compressed NumPy archive file: proceed_data.npz. This file contains two arrays: X and y.

    • X: A 3-dimensional NumPy array containing the feature data.

      • Shape: (num_windows, 200, 12)
      • First Dimension: The total number of windows aggregated from all flights.
      • Second Dimension: The number of time-steps in each window (200).
      • Third Dimension: The number of features recorded at each time-step (12).
    • y: A 1-dimensional NumPy array containing the corresponding labels for each window in X.

      • Shape: (num_windows,)
      • Values: 0 for a healthy window or 1 for a faulty window.

    How to Use

    You can load the data easily using NumPy.

    import numpy as np
    
    # Load the dataset
    data = np.load('proceed_data.npz')
    
    # Extract the features and labels
    X = data['X']
    y = data['y']
    
    print("Data loaded successfully!")
    print(f"Features shape: {X.shape}")
    print(f"Labels shape: {y.shape}")
    
    ## Dataset Details
    
    **Features:**
    The 12 features in the third dimension of the `X` array are in the following order:
    
    1. Position X (meters)
    2. Position Y (meters)
    3. Position Z (meters)
    4. Roll (radians)
    5. Pitch (radians)
    6. Yaw (radians)
    7. Roll Rate (rad/s)
    8. Pitch Rate (rad/s)
    9. Yaw Rate (rad/s)
    10. Acceleration X (m/s²)
    11. Acceleration Y (m/s²)
    12. Acceleration Z (m/s²)
    
    **Labels:**
    The labels in the `y` array are defined as:
    
    * `0`: Healthy
    * `1`: Faulty
    
    ## Citation
    
    If you use this dataset, please cite the original authors of the DronePropA dataset.
    
  18. London Housing Data

    • kaggle.com
    zip
    Updated Sep 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Science Lovers (2025). London Housing Data [Dataset]. https://www.kaggle.com/datasets/rohitgrewal/london-housing-data
    Explore at:
    zip(138862 bytes)Available download formats
    Dataset updated
    Sep 15, 2025
    Authors
    Data Science Lovers
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Area covered
    London
    Description

    📹Project Video available on YouTube - https://youtu.be/q-Omt6LgRLc

    🖇️Connect with me on LinkedIn - https://www.linkedin.com/in/rohit-grewal

    London Housing Price Dataset

    The dataset contains housing market information for different areas of London over time. It includes details such as average house prices, the number of houses sold, and crime statistics. The data spans multiple years and is organized by date and geographic area.

    This data is available as a CSV file. We are going to analyze this data set using the Pandas DataFrame.

    Using this dataset, we answered multiple questions with Python in our Project.

    Q. 1) Convert the Datatype of 'Date' column to Date-Time format.

    Q. 2.A) Add a new column ''year'' in the dataframe, which contains years only.

    Q. 2.B) Add a new column ''month'' as 2nd column in the dataframe, which contains month only.

    Q. 3) Remove the columns 'year' and 'month' from the dataframe.

    Q. 4) Show all the records where 'No. of Crimes' is 0. And, how many such records are there ?

    Q. 5) What is the maximum & minimum 'average_price' per year in england ?

    Q. 6) What is the Maximum & Minimum No. of Crimes recorded per area ?

    Q. 7) Show the total count of records of each area, where average price is less than 100000.

    Enrol in our Udemy courses : 1. Python Data Analytics Projects - https://www.udemy.com/course/bigdata-analysis-python/?referralCode=F75B5F25D61BD4E5F161 2. Python For Data Science - https://www.udemy.com/course/python-for-data-science-real-time-exercises/?referralCode=9C91F0B8A3F0EB67FE67 3. Numpy For Data Science - https://www.udemy.com/course/python-numpy-exercises/?referralCode=FF9EDB87794FED46CBDF

    These are the main Features/Columns available in the dataset :

    1) Date – The month and year when the data was recorded.

    2) Area – The London borough or area for which the housing and crime data is reported.

    3) Average_price – The average house price in the given area during the specified month.

    4) Code – The unique area code (e.g., government statistical code) corresponding to each borough or region.

    5) Houses_sold – The number of houses sold in the given area during the specified month.

    6) No_of_crimes – The number of crimes recorded in the given area during the specified month.

  19. Classicmodels

    • kaggle.com
    zip
    Updated Dec 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Javier Landaeta (2024). Classicmodels [Dataset]. https://www.kaggle.com/datasets/javierlandaeta/classicmodels
    Explore at:
    zip(65751 bytes)Available download formats
    Dataset updated
    Dec 15, 2024
    Authors
    Javier Landaeta
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Abstract This project presents a comprehensive analysis of a company's annual sales, using the classic dataset classicmodels as the database. Python is used as the main programming language, along with the Pandas, NumPy and SQLAlchemy libraries for data manipulation and analysis, and PostgreSQL as the database management system.

    The main objective of the project is to answer key questions related to the company's sales performance, such as: Which were the most profitable products and customers? Were sales goals met? The results obtained serve as input for strategic decision making in future sales campaigns.

    Methodology 1. Data Extraction:

    • A connection is established with the PostgreSQL database to extract the relevant data from the orders, orderdetails, customers, products and employees tables.
    • A reusable function is created to read each table and load it into a Pandas DataFrame.

    2. Data Cleansing and Transformation:

    • An exploratory analysis of the data is performed to identify missing values, inconsistencies, and outliers.
    • New variables are calculated, such as the total value of each sale, cost, and profit.
    • Different DataFrames are joined using primary and foreign keys to obtain a complete view of sales.

    3. Exploratory Data Analysis (EDA):

    • Key metrics such as total sales, number of unique customers, and average order value are calculated.
    • Data is grouped by different dimensions (products, customers, dates) to identify patterns and trends.
    • Results are visualized using relevant graphics (histograms, bar charts, etc.).

    4. Modeling and Prediction:

    • Although the main focus of the project is descriptive, predictive modeling techniques (e.g., time series) could be explored to forecast future sales.

    5. Report Generation:

    • Detailed reports are created in Pandas DataFrames format that answer specific business questions.
    • These reports are stored in new PostgreSQL tables for further analysis and visualization.

    Results - Identification of top products and customers: The best-selling products and the customers that generate the most revenue are identified. - Analysis of sales trends: Sales trends over time are analyzed and possible factors that influence sales behavior are identified. - Calculation of key metrics: Metrics such as average profit margin and sales growth rate are calculated.

    Conclusions This project demonstrates how Python and PostgreSQL can be effectively used to analyze large data sets and obtain valuable insights for business decision making. The results obtained can serve as a starting point for future research and development in the area of ​​sales analysis.

    Technologies Used - Python: Pandas, NumPy, SQLAlchemy, Matplotlib/Seaborn - Database: PostgreSQL - Tools: Jupyter Notebook - Keywords: data analysis, Python, PostgreSQL, Pandas, NumPy, SQLAlchemy, EDA, sales, business intelligence

  20. Dataset: Prime Numbers - First 1Lac

    • kaggle.com
    zip
    Updated May 12, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rehan Guha (2018). Dataset: Prime Numbers - First 1Lac [Dataset]. https://www.kaggle.com/rehanguha/dataset-prime-numbers-first-1lac
    Explore at:
    zip(872594 bytes)Available download formats
    Dataset updated
    May 12, 2018
    Authors
    Rehan Guha
    License

    http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html

    Description

    File: - Contains 1000 files with 100 prime numbers in each file - Format *.dat

    Data Format: - Python Numpy Array - Float64

    Example ( How to use ): - numpy.loadtxt( [filename] )

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Hugo R. V. Angulo (2021). Street View House Numbers (SVHN) Dataset (numpy) [Dataset]. https://www.kaggle.com/hugovallejo/street-view-house-numbers-svhn-dataset-numpy
Organization logo

Street View House Numbers (SVHN) Dataset (numpy)

Real-world image dataset for machine learning and object recognition.

Explore at:
zip(369259958 bytes)Available download formats
Dataset updated
Sep 18, 2021
Authors
Hugo R. V. Angulo
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

SVHN is a real-world image dataset for developing machine learning and object recognition algorithms with minimal requirement on data preprocessing and formatting. It can be seen as similar in flavor to MNIST (e.g., the images are of small cropped digits), but incorporates an order of magnitude more labeled data (over 600,000 digit images) and comes from a significantly harder, unsolved, real world problem (recognizing digits and numbers in natural scene images). SVHN is obtained from house numbers in Google Street View images.

This dataset took the data from the original dataset and convert the images to numpy arrays to make easier the processing of the umages.

10 classes, 1 for each digit. Digit '1' has label 1, '9' has label 9 and '0' has label 10. 73257 digits for training, 26032 digits for testing, Comes in two formats:

1. Original images with character level bounding boxes.
2. MNIST-like 32-by-32 images centered around a single character (many of the images do contain some distractors at the sides).

All the credit to:

http://ufldl.stanford.edu/housenumbers/

and,

Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, Andrew Y. Ng Reading Digits in Natural Images with Unsupervised Feature Learning NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011.

Search
Clear search
Close search
Google apps
Main menu