7 datasets found
  1. MIMIC-IV Lab Events Subset - Preprocessed for Data Normalization...

    • zenodo.org
    text/x-python
    Updated Oct 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ali Azadi; ali Azadi (2025). MIMIC-IV Lab Events Subset - Preprocessed for Data Normalization Analysis.xlsx [Dataset]. http://doi.org/10.5281/zenodo.17272946
    Explore at:
    text/x-pythonAvailable download formats
    Dataset updated
    Oct 5, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    ali Azadi; ali Azadi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description


    This updated version includes a Python script (glucose_analysis.py) that performs statistical evaluation of the glucose normalization process described in the associated thesis. The script supports key analyses, including normality assessment (Shapiro–Wilk test), variance homogeneity (Levene’s test), mean comparison (ANOVA), effect size estimation (Cohen’s d), and calculation of confidence intervals for the mean difference. These results validate the impact of Min-Max normalization on clinical data structure and usability within CDSS workflows. The script is designed to be reproducible and complements the processed dataset already included in this repository.

  2. feral-cat-segmentation_dataset

    • kaggle.com
    • universe.roboflow.com
    zip
    Updated Mar 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    lu hou yang (2025). feral-cat-segmentation_dataset [Dataset]. https://www.kaggle.com/datasets/luhouyang/feral-cat-segmentation-dataset
    Explore at:
    zip(971125684 bytes)Available download formats
    Dataset updated
    Mar 18, 2025
    Authors
    lu hou yang
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Feral Cat Segmentation Dataset

    Overview

    This dataset provides image segmentation data for feral cats, designed for computer vision and machine learning tasks. It builds upon the original public domain dataset by Paul Cashman from Roboflow, with additional preprocessing and multiple data formats for easier consumption.

    Dataset Source

    Dataset Contents

    The dataset is organized into three standard splits: - Train set - Validation set - Test set

    Each split contains data in multiple formats: 1. Original JPG images 2. Segmentation mask JPG images 3. Parquet files containing flattened image and mask data 4. Pickle files containing serialized image and mask data

    Data Formats

    1. Image Files

    • Format: JPG
    • Resolution: 224×224 pixels
    • Directory Structure:
      • train/: Original training images
      • valid/: Original validation images
      • test/: Original test images
      • train_mask/: Corresponding segmentation masks for training
      • valid_mask/: Corresponding segmentation masks for validation
      • test_mask/: Corresponding segmentation masks for testing

    2. Parquet Files

    • Files: train_dataset.parquet, valid_dataset.parquet, test_dataset.parquet
    • Content: Flattened image data and corresponding masks combined in a single table
    • Structure: Each row contains the flattened pixel values of an image followed by the flattened pixel values of its mask
    • Data Division: Image and mask data are split at index split_at = image_size[0] * image_size[1] * image_channels
      • Data before this index: image pixel values (reshaped to [-1, 224, 224, 3])
      • Data after this index: mask pixel values (reshaped to [-1, 224, 224, 1])
    • Benefits: Efficient storage and faster loading compared to individual image files

    3. Pickle Files

    • Files: train_dataset.pkl, valid_dataset.pkl, test_dataset.pkl
    • Content: Serialized Python objects containing images and their corresponding masks
    • Structure: List of [image, mask] pairs, where each image and mask is serialized using Python's pickle
    • Data Access: Similar to parquet files, when loaded through the provided dataset class, data is split at the same index: split_at = image_size[0] * image_size[1] * image_channels
    • Benefits: Preserves original data structure and enables quick loading in Python

    4. CSV Files

    • Files: train_dataset.csv, valid_dataset.csv, test_dataset.csv
    • Content: Same data as parquet files but in CSV format
    • Structure: No headers, raw flattened pixel values
    • Data Division: Same split point as parquet files

    Image Preprocessing

    All images were preprocessed with the following operations: - Resized to 224×224 pixels using bilinear interpolation - Segmentation masks were also resized to match the images using nearest neighbor interpolation - Original RLE (Run-Length Encoding) segmentation data converted to binary masks

    Data Normalization

    When used with the provided PyTorch dataset class, images are normalized with: - Mean: [0.48235, 0.45882, 0.40784] - Standard Deviation: [0.00392156862745098, 0.00392156862745098, 0.00392156862745098]

    PyTorch Integration

    A custom CatDataset class is included for easy integration with PyTorch:

    from cat_dataset import CatDataset
    
    # Load from parquet format
    dataset = CatDataset(
      root="path/to/dataset",
      split="train", # Options: "train", "valid", "test"
      format="parquet", # Options: "parquet", "pkl"
      image_size=[224, 224],
      image_channels=3,
      mask_channels=1
    )
    
    # Use with PyTorch DataLoader
    from torch.utils.data import DataLoader
    dataloader = DataLoader(dataset, batch_size=32, shuffle=True)
    

    Performance Comparison

    Loading time benchmarks from the original implementation: - Parquet format: ~1.29 seconds per iteration - Pickle format: ~0.71 seconds per iteration

    The pickle format provides the fastest loading times and is recommended for most use cases.

    Citation

    If you use this dataset in your research or projects, please cite:

    @misc{feral-cat-segmentation_dataset,
     title = {feral-cat-segmentation Dataset},
     type = {Open Source Dataset},
     author = {Paul Cashman},
     howpublished = {\url{https://universe.roboflow.com/paul-cashman-mxgwb/feral-cat-segmentation}},
     url = {https://universe.roboflow.com/paul-cashman-mxgwb/feral-cat-segmentation},
     journal = {Roboflow Universe},
     publisher = {Roboflow},
     year = {2025},
     month = {mar},
     note = {visited on 2025-03-19},
    }
    

    Sample Usage Code

    Basic Dataset Loading

    from ca...
    
  3. Z

    Task Scheduler Performance Survey Results

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jakub Beránek; Stanislav Böhm; Vojtěch Cima (2020). Task Scheduler Performance Survey Results [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_2630588
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    IT4Innovations
    Authors
    Jakub Beránek; Stanislav Böhm; Vojtěch Cima
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Task scheduler performance survey

    This dataset contains results of task graph scheduler performance survey. The results are stored in the following files, which correspond to simulations performed on the elementary, irw and pegasus task graph datasets published at https://doi.org/10.5281/zenodo.2630384.

    elementary-result.zip

    irw-result.zip

    pegasus-result.zip

    The files contain compressed pandas dataframes in CSV format, it can be read with the following Python code: python import pandas as pd frame = pd.read_csv("elementary-result.zip")

    Each row in the frame corresponds to a single instance of a task graph that was simulated with a specific configuration (network model, scheduler etc.). The list below summarizes the meaning of the individual columns.

    graph_name - name of the benchmarked task graph

    graph_set - name of the task graph dataset from which the graph originates

    graph_id - unique ID of the graph

    cluster_name - type of cluster used in this instance the format is x; 32x16 means 32 workers, each with 16 cores

    bandwidth - network bandwidth [MiB]

    netmodel - network model (simple or maxmin)

    scheduler_name - name of the scheduler

    imode - information mode

    min_sched_interval - minimal scheduling delay [s]

    sched_time - duration of each scheduler invocation [s]

    time - simulated makespan of the task graph execution [s]

    execution_time - real duration of all scheduler invocations [s]

    total_transfer - amount of data transferred amongst workers [MiB]

    The file charts.zip contains charts obtained by processing the datasets. On the X axis there is always bandwidth in [MiB/s]. There are the following files:

    [DATASET]-schedulers-time - Absolute makespan produced by schedulers [seconds]

    [DATASET]-schedulers-score - The same as above but normalized with respect to the best schedule (shortest makespan) for the given configuration.

    [DATASET]-schedulers-transfer - Sums of transfers between all workers for a given configuration [MiB]

    [DATASET]-[CLUSTER]-netmodel-time - Comparison of netmodels, absolute times [seconds]

    [DATASET]-[CLUSTER]-netmodel-score - Comparison of netmodels, normalized to the average of model "simple"

    [DATASET]-[CLUSTER]-netmodel-transfer - Comparison of netmodels, sum of transfered data between all workers [MiB]

    [DATASET]-[CLUSTER]-schedtime-time - Comparison of MSD, absolute times [seconds]

    [DATASET]-[CLUSTER]-schedtime-score - Comparison of MSD, normalized to the average of "MSD=0.0" case

    [DATASET]-[CLUSTER]-imode-time - Comparison of Imodes, absolute times [seconds]

    [DATASET]-[CLUSTER]-imode-score - Comparison of Imodes, normalized to the average of "exact" imode

    Reproducing the results

    1. Download and install Estee (https://github.com/It4innovations/estee)

    $ git clone https://github.com/It4innovations/estee $ cd estee $ pip install .

    1. Generate task graphs You can either use the provided script benchmarks/generate.py to generate graphs from three categories (elementary, irw and pegasus):

    $ cd benchmarks $ python generate.py elementary.zip elementary $ python generate.py irw.zip irw $ python generate.py pegasus.zip pegasus

    or use our task graph dataset that is provided at https://doi.org/10.5281/zenodo.2630384.

    1. Run benchmarks To run a benchmark suite, you should prepare a JSON file describing the benchmark. The file that was used to run experiments from the paper is provided in benchmark.json. Then you can run the benchmark using this command:

    $ python pbs.py compute benchmark.json

    The benchmark script can be interrupted at any time (for example using Ctrl+C). When interrupted, it will store the computed results to the result file and restore the computation when launched again.

    1. Visualizing results

    $ python view.py --all

    The resulting plots will appear in a folder called outputs.

  4. Python Energy Microscope: Benchmarking 5 Execution

    • kaggle.com
    zip
    Updated Jun 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Md. Fatin Shadab Turja (2025). Python Energy Microscope: Benchmarking 5 Execution [Dataset]. https://www.kaggle.com/datasets/fatinshadab/python-energy-microscope-dataset
    Explore at:
    zip(176065 bytes)Available download formats
    Dataset updated
    Jun 18, 2025
    Authors
    Md. Fatin Shadab Turja
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Dataset Description

    This dataset was created as part of the research project “Python Under the Microscope: A Comparative Energy Analysis of Execution Methods” (2025). The study explores the environmental sustainability of Python software by benchmarking five execution strategies—CPython, PyPy, Cython, ctypes, and py_compile—across 15 classical algorithmic workloads.

    Purpose & Motivation

    With energy and carbon efficiency becoming critical in modern computing, this dataset aims to:

    Quantify execution time, CPU energy usage, and carbon emissions

    Enable reproducible analysis of performance–sustainability trade-offs

    Introduce and validate the GreenScore, a composite metric for sustainability-aware software evaluation

    Data Collection & Tools

    All benchmarks were executed on a controlled laptop environment (Intel Core i5-1235U, Linux 6.8). Energy was measured via Intel RAPL counters using the pyRAPL library. Carbon footprint was estimated using a conversion factor of 0.000475 gCO₂ per joule based on regional electricity intensity.

    Each algorithm–method pair was run 50 times, capturing robust statistics for energy (μJ), time (s), and derived CO₂ emissions.

    Dataset Structure Overview

    Per-method folders (cpython/, pypy/, etc.) contain raw energy/ and time/ CSV files for all 15 benchmarks (50 trials each), as well as mean summaries.

    Aggregate folder includes combined metric comparisons, normalized data, and carbon footprint estimations.

    Analysis folder contains derived datasets: normalized scores, standard deviation, and the final GreenScore rankings used in our paper.

    Usage

    This dataset is ideal for:

    Reproducible software sustainability studies

    Benchmarking Python execution strategies

    Analyzing energy–performance–carbon trade-offs

    Validating green metrics and measurement tools

    Researchers and practitioners are encouraged to use, extend, and cite this dataset in sustainability-aware software design.

  5. Student Performance and Learning Behavior Dataset

    • kaggle.com
    zip
    Updated Sep 4, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Adil Shamim (2025). Student Performance and Learning Behavior Dataset [Dataset]. https://www.kaggle.com/datasets/adilshamim8/student-performance-and-learning-style
    Explore at:
    zip(78897 bytes)Available download formats
    Dataset updated
    Sep 4, 2025
    Authors
    Adil Shamim
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset provides a comprehensive view of student performance and learning behavior, integrating academic, demographic, behavioral, and psychological factors.

    It was created by merging two publicly available Kaggle datasets, resulting in a unified dataset of 14,003 student records with 16 attributes. All entries are anonymized, with no personally identifiable information.

    Key Features

    • Study behaviors & engagementStudyHours, Attendance, Extracurricular, AssignmentCompletion, OnlineCourses, Discussions
    • Resources & environmentResources, Internet, EduTech
    • Motivation & psychologyMotivation, StressLevel
    • DemographicsGender, Age (18–30 years)
    • Learning preferenceLearningStyle
    • Performance indicatorsExamScore, FinalGrade

    Objectives & Use Cases

    The dataset can be used for:

    • Predictive modeling → Regression/classification of student performance (ExamScore, FinalGrade)
    • Clustering analysis → Identifying learning behavior groups with K-Means or other unsupervised methods
    • Educational analytics → Exploring how study habits, stress, and motivation affect outcomes
    • Adaptive learning research → Linking behavioral patterns to personalized learning pathways

    Analysis Pipeline (from original study)

    The dataset was analyzed in Python using:

    • Preprocessing → Encoding, normalization (z-score, Min–Max), deduplication
    • Clustering → K-Means, Elbow Method, Silhouette Score, Davies–Bouldin Index
    • Dimensionality Reduction → PCA (2D/3D visualizations)
    • Statistical Analysis → ANOVA, regression for group differences
    • Interpretation → Mapping clusters to LearningStyle categories & extracting insights for adaptive learning

    File

    • merged_dataset.csv → 14,003 rows × 16 columns Includes student demographics, behaviors, engagement, learning styles, and performance indicators.

    Provenance

    This dataset is an excellent playground for educational data mining — from clustering and behavioral analytics to predictive modeling and personalized learning applications.

  6. Z

    Onset of mining operations

    • data-staging.niaid.nih.gov
    Updated Mar 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Remelgado, Ruben; Meyer, Carsten (2024). Onset of mining operations [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_8214548
    Explore at:
    Dataset updated
    Mar 17, 2024
    Dataset provided by
    German Centre for Integrative Biodiversity Research (iDiv)
    German Centre for Integrative Biodiversity Research
    Authors
    Remelgado, Ruben; Meyer, Carsten
    Description

    Motivation

    Maus et al created the first database of the spatial extent of mining areas by mobilizing nearly 20 years of Landsat data. This dataset is imperative for GlobES, as mining areas are specified in the IUCN habitat class scheme. Yet, this dataset is temporally static. To tackle this flaw, we mined the Landsat archive to infer the first observable year of mining.

    Approach

    For each mining area polygon, we collected 50 random samples within it and 50 random samples along its borders. This was meant to reflect increasing spectral differences between areas within and outside a mining exploration after its onset. Then, for each sample, we used Google Earth Engine to extract spectral profiles for every available acquisition between 1990 and 2020.

    After completing the extraction, we estimate mean spectral profiles for each acquisition date, once for the samples “inside” the mining area, and another for those “outside” of it. In this process, we masked pixels afflicted by clouds and cloud shadows using Landsat's quality information.

    Using the time-series of mean profiles, at each mining site and for each unique date, we normalized the “inside” and “outside” multi-spectral averages and estimated the Root Mean Square Error (RMSE) between them. The normalization step aimed at emphasizing differences in the shape of the spectral profiles rather than on specific values, which can be related to radiometric innacuracies, or simply to differences in acquisition dates. This resulted in an RMSE time-series for each mining site.

    We then used these data to infer the first mining year. To achieve this, we first derived a cumulative sum of the RMSE time-series with the intent of removing noise while preserving abrupt directional changes. For example, if a mine was introduced in a forest, it would drive an increase in the RMSE due to the removal of trees, whereas the outskirts of the mine would remain forested. In this example, the accumulated values would tilt upwards. However, if a mining exploration was accompanied by the removal of vegetation along its outskirts where bare land was common, a downwards shift is RMSE values is more likely as the landscape becomes more homogenization.

    To detect the date marking a shift in RMSE values, we used a knee/elbow detection algorithm implemented in the python package kneebow, which uses curve rotation to infer the inflection/deflection point of a time series. Here, downward trends correspond to the elbow and upward trends to the knee. To determine which of these metrics was the most adequate, we use the Area Under the Curve (AUC). An elbow is characterized by a convex shape of a time-series which makes the AUC greater than 50%. However, if the shape of the curve is concave, the knee is the most adequate metric. We limited the detection of shifts to time-series with at least 100 time steps. When below this threshold, we assumed the mine (or the the conditions to sustain it) were present since 1990.

    Content

    This repository contains the infrastructure used to infer the start of a mining operation, which is organized as following:

    00_data - Contains the base data required for the operation, including a SHP file with the mining area outlines, and validation samples.

    01_analysis - Contains several outputs of our analysis:

    xy.tar.gz - Sample locations for each mining site.

    sr.tar.gz - Spectral profiles for each sample location.

    mine_start.csv - First year when we detected the start of mining.

    02_code - Includes all code used in our analysis.

    requirements.txt - Python module requirements that can be fed to pip to replicate our study.

    config.yml - Configuration file, including information on the Landsat products used.

  7. Animals (Cats, Dogs, and Snakes)

    • kaggle.com
    zip
    Updated Nov 18, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Omar Rehan (2025). Animals (Cats, Dogs, and Snakes) [Dataset]. https://www.kaggle.com/datasets/aiomarrehan/animals-cats-dogs-and-snakes
    Explore at:
    zip(40219983 bytes)Available download formats
    Dataset updated
    Nov 18, 2025
    Authors
    Omar Rehan
    Description

    Cats, Dogs, and Snakes Dataset

    Dataset Overview

    The dataset contains images of three animal classes: Cats, Dogs, and Snakes. It is balanced and cleaned, designed for supervised image classification tasks.

    ClassNumber of ImagesDescription
    Cats1,000Includes multiple breeds and poses
    Dogs1,000Covers various breeds and backgrounds
    Snakes1,000Includes multiple species and natural settings

    Total Images: 3,000

    Image Properties:

    • Resolution: 224×224 pixels (resized for consistency)
    • Color Mode: RGB
    • Format: JPEG/PNG
    • Cleaned: Duplicate, blurry, and irrelevant images removed

    Data Split Recommendation

    SetPercentageNumber of Images
    Training70%2,100
    Validation15%450
    Test15%450

    Preprocessing

    Images in the dataset have been standardized to support machine learning pipelines:

    1. Resizing to 224×224 pixels.
    2. Normalization of pixel values to [0,1] or mean subtraction for deep learning frameworks.
    3. Label encoding: Integer encoding (0 = Cat, 1 = Dog, 2 = Snake) or one-hot encoding for model training.

    Example: Loading and Using the Dataset (Python)

    import os
    import tensorflow as tf
    from tensorflow.keras.preprocessing.image import ImageDataGenerator
    
    # Path to dataset
    dataset_path = "path/to/dataset"
    
    # ImageDataGenerator for preprocessing
    datagen = ImageDataGenerator(
      rescale=1./255,
      validation_split=0.15 # 15% for validation
    )
    
    # Load training data
    train_generator = datagen.flow_from_directory(
      dataset_path,
      target_size=(224, 224),
      batch_size=32,
      class_mode='categorical',
      subset='training',
      shuffle=True
    )
    
    # Load validation data
    validation_generator = datagen.flow_from_directory(
      dataset_path,
      target_size=(224, 224),
      batch_size=32,
      class_mode='categorical',
      subset='validation',
      shuffle=False
    )
    
    # Example: Iterate over one batch
    images, labels = next(train_generator)
    print(images.shape, labels.shape) # (32, 224, 224, 3) (32, 3)
    

    Key Features

    • Balanced: Equal number of samples per class reduces bias.
    • Cleaned: High-quality, relevant images improve model performance.
    • Diverse: Covers multiple breeds, species, and environments to ensure generalization.
    • Ready for ML: Preprocessed and easily integrated into popular deep learning frameworks.
  8. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
ali Azadi; ali Azadi (2025). MIMIC-IV Lab Events Subset - Preprocessed for Data Normalization Analysis.xlsx [Dataset]. http://doi.org/10.5281/zenodo.17272946
Organization logo

MIMIC-IV Lab Events Subset - Preprocessed for Data Normalization Analysis.xlsx

Explore at:
text/x-pythonAvailable download formats
Dataset updated
Oct 5, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
ali Azadi; ali Azadi
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description


This updated version includes a Python script (glucose_analysis.py) that performs statistical evaluation of the glucose normalization process described in the associated thesis. The script supports key analyses, including normality assessment (Shapiro–Wilk test), variance homogeneity (Levene’s test), mean comparison (ANOVA), effect size estimation (Cohen’s d), and calculation of confidence intervals for the mean difference. These results validate the impact of Min-Max normalization on clinical data structure and usability within CDSS workflows. The script is designed to be reproducible and complements the processed dataset already included in this repository.

Search
Clear search
Close search
Google apps
Main menu