7 datasets found

MIMIC-IV Lab Events Subset - Preprocessed for Data Normalization...
zenodo.org
text/x-python
Updated Oct 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ali Azadi; ali Azadi (2025). MIMIC-IV Lab Events Subset - Preprocessed for Data Normalization Analysis.xlsx [Dataset]. http://doi.org/10.5281/zenodo.17272946
Explore at:
text/x-pythonAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.17272946
Dataset updated
Oct 5, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
ali Azadi; ali Azadi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description

This updated version includes a Python script (glucose_analysis.py) that performs statistical evaluation of the glucose normalization process described in the associated thesis. The script supports key analyses, including normality assessment (Shapiro–Wilk test), variance homogeneity (Levene’s test), mean comparison (ANOVA), effect size estimation (Cohen’s d), and calculation of confidence intervals for the mean difference. These results validate the impact of Min-Max normalization on clinical data structure and usability within CDSS workflows. The script is designed to be reproducible and complements the processed dataset already included in this repository.
feral-cat-segmentation_dataset
kaggle.com
universe.roboflow.com
zip
Updated Mar 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
lu hou yang (2025). feral-cat-segmentation_dataset [Dataset]. https://www.kaggle.com/datasets/luhouyang/feral-cat-segmentation-dataset
Explore at:
zip(971125684 bytes)Available download formats
Dataset updated
Mar 18, 2025
Authors
lu hou yang
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Feral Cat Segmentation Dataset

Overview

This dataset provides image segmentation data for feral cats, designed for computer vision and machine learning tasks. It builds upon the original public domain dataset by Paul Cashman from Roboflow, with additional preprocessing and multiple data formats for easier consumption.

Dataset Source

Original Author: Paul Cashman

Original Source: Roboflow Universe

Extended by: Lu Hou Yang

GitHub: https://github.com/luhouyang/open_circles

License: Public Domain

Dataset Contents

The dataset is organized into three standard splits: - Train set - Validation set - Test set

Each split contains data in multiple formats: 1. Original JPG images 2. Segmentation mask JPG images 3. Parquet files containing flattened image and mask data 4. Pickle files containing serialized image and mask data

Data Formats

1. Image Files

Format: JPG

Resolution: 224×224 pixels

Directory Structure:

train/: Original training images

valid/: Original validation images

test/: Original test images

train_mask/: Corresponding segmentation masks for training

valid_mask/: Corresponding segmentation masks for validation

test_mask/: Corresponding segmentation masks for testing

2. Parquet Files

Files: train_dataset.parquet, valid_dataset.parquet, test_dataset.parquet

Content: Flattened image data and corresponding masks combined in a single table

Structure: Each row contains the flattened pixel values of an image followed by the flattened pixel values of its mask

Data Division: Image and mask data are split at index split_at = image_size[0] * image_size[1] * image_channels

Data before this index: image pixel values (reshaped to [-1, 224, 224, 3])

Data after this index: mask pixel values (reshaped to [-1, 224, 224, 1])

Benefits: Efficient storage and faster loading compared to individual image files

3. Pickle Files

Files: train_dataset.pkl, valid_dataset.pkl, test_dataset.pkl

Content: Serialized Python objects containing images and their corresponding masks

Structure: List of [image, mask] pairs, where each image and mask is serialized using Python's pickle

Data Access: Similar to parquet files, when loaded through the provided dataset class, data is split at the same index: split_at = image_size[0] * image_size[1] * image_channels

Benefits: Preserves original data structure and enables quick loading in Python

4. CSV Files

Files: train_dataset.csv, valid_dataset.csv, test_dataset.csv

Content: Same data as parquet files but in CSV format

Structure: No headers, raw flattened pixel values

Data Division: Same split point as parquet files

Image Preprocessing

All images were preprocessed with the following operations: - Resized to 224×224 pixels using bilinear interpolation - Segmentation masks were also resized to match the images using nearest neighbor interpolation - Original RLE (Run-Length Encoding) segmentation data converted to binary masks

Data Normalization

When used with the provided PyTorch dataset class, images are normalized with: - Mean: [0.48235, 0.45882, 0.40784] - Standard Deviation: [0.00392156862745098, 0.00392156862745098, 0.00392156862745098]

PyTorch Integration

A custom CatDataset class is included for easy integration with PyTorch:

from cat_dataset import CatDataset # Load from parquet format dataset = CatDataset( root="path/to/dataset", split="train", # Options: "train", "valid", "test" format="parquet", # Options: "parquet", "pkl" image_size=[224, 224], image_channels=3, mask_channels=1 ) # Use with PyTorch DataLoader from torch.utils.data import DataLoader dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

Performance Comparison

Loading time benchmarks from the original implementation: - Parquet format: ~1.29 seconds per iteration - Pickle format: ~0.71 seconds per iteration

The pickle format provides the fastest loading times and is recommended for most use cases.

Citation

If you use this dataset in your research or projects, please cite:

@misc{feral-cat-segmentation_dataset, title = {feral-cat-segmentation Dataset}, type = {Open Source Dataset}, author = {Paul Cashman}, howpublished = {\url{https://universe.roboflow.com/paul-cashman-mxgwb/feral-cat-segmentation}}, url = {https://universe.roboflow.com/paul-cashman-mxgwb/feral-cat-segmentation}, journal = {Roboflow Universe}, publisher = {Roboflow}, year = {2025}, month = {mar}, note = {visited on 2025-03-19}, }

Sample Usage Code

Basic Dataset Loading

from ca...
Z
Task Scheduler Performance Survey Results
data.niaid.nih.gov
zenodo.org
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jakub Beránek; Stanislav Böhm; Vojtěch Cima (2020). Task Scheduler Performance Survey Results [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_2630588
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
IT4Innovations
Authors
Jakub Beránek; Stanislav Böhm; Vojtěch Cima
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Task scheduler performance survey

This dataset contains results of task graph scheduler performance survey. The results are stored in the following files, which correspond to simulations performed on the elementary, irw and pegasus task graph datasets published at https://doi.org/10.5281/zenodo.2630384.

elementary-result.zip

irw-result.zip

pegasus-result.zip

The files contain compressed pandas dataframes in CSV format, it can be read with the following Python code: python import pandas as pd frame = pd.read_csv("elementary-result.zip")

Each row in the frame corresponds to a single instance of a task graph that was simulated with a specific configuration (network model, scheduler etc.). The list below summarizes the meaning of the individual columns.

graph_name - name of the benchmarked task graph

graph_set - name of the task graph dataset from which the graph originates

graph_id - unique ID of the graph

cluster_name - type of cluster used in this instance the format is x; 32x16 means 32 workers, each with 16 cores

bandwidth - network bandwidth [MiB]

netmodel - network model (simple or maxmin)

scheduler_name - name of the scheduler

imode - information mode

min_sched_interval - minimal scheduling delay [s]

sched_time - duration of each scheduler invocation [s]

time - simulated makespan of the task graph execution [s]

execution_time - real duration of all scheduler invocations [s]

total_transfer - amount of data transferred amongst workers [MiB]

The file charts.zip contains charts obtained by processing the datasets. On the X axis there is always bandwidth in [MiB/s]. There are the following files:

[DATASET]-schedulers-time - Absolute makespan produced by schedulers [seconds]

[DATASET]-schedulers-score - The same as above but normalized with respect to the best schedule (shortest makespan) for the given configuration.

[DATASET]-schedulers-transfer - Sums of transfers between all workers for a given configuration [MiB]

[DATASET]-[CLUSTER]-netmodel-time - Comparison of netmodels, absolute times [seconds]

[DATASET]-[CLUSTER]-netmodel-score - Comparison of netmodels, normalized to the average of model "simple"

[DATASET]-[CLUSTER]-netmodel-transfer - Comparison of netmodels, sum of transfered data between all workers [MiB]

[DATASET]-[CLUSTER]-schedtime-time - Comparison of MSD, absolute times [seconds]

[DATASET]-[CLUSTER]-schedtime-score - Comparison of MSD, normalized to the average of "MSD=0.0" case

[DATASET]-[CLUSTER]-imode-time - Comparison of Imodes, absolute times [seconds]

[DATASET]-[CLUSTER]-imode-score - Comparison of Imodes, normalized to the average of "exact" imode

Reproducing the results

Download and install Estee (https://github.com/It4innovations/estee)

$ git clone https://github.com/It4innovations/estee $ cd estee $ pip install .

Generate task graphs You can either use the provided script benchmarks/generate.py to generate graphs from three categories (elementary, irw and pegasus):

$ cd benchmarks $ python generate.py elementary.zip elementary $ python generate.py irw.zip irw $ python generate.py pegasus.zip pegasus

or use our task graph dataset that is provided at https://doi.org/10.5281/zenodo.2630384.

Run benchmarks To run a benchmark suite, you should prepare a JSON file describing the benchmark. The file that was used to run experiments from the paper is provided in benchmark.json. Then you can run the benchmark using this command:

$ python pbs.py compute benchmark.json

The benchmark script can be interrupted at any time (for example using Ctrl+C). When interrupted, it will store the computed results to the result file and restore the computation when launched again.

Visualizing results

$ python view.py --all

The resulting plots will appear in a folder called outputs.
Python Energy Microscope: Benchmarking 5 Execution
kaggle.com
zip
Updated Jun 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Md. Fatin Shadab Turja (2025). Python Energy Microscope: Benchmarking 5 Execution [Dataset]. https://www.kaggle.com/datasets/fatinshadab/python-energy-microscope-dataset
Explore at:
zip(176065 bytes)Available download formats
Dataset updated
Jun 18, 2025
Authors
Md. Fatin Shadab Turja
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Dataset Description

This dataset was created as part of the research project “Python Under the Microscope: A Comparative Energy Analysis of Execution Methods” (2025). The study explores the environmental sustainability of Python software by benchmarking five execution strategies—CPython, PyPy, Cython, ctypes, and py_compile—across 15 classical algorithmic workloads.

Purpose & Motivation

With energy and carbon efficiency becoming critical in modern computing, this dataset aims to:

Quantify execution time, CPU energy usage, and carbon emissions

Enable reproducible analysis of performance–sustainability trade-offs

Introduce and validate the GreenScore, a composite metric for sustainability-aware software evaluation

Data Collection & Tools

All benchmarks were executed on a controlled laptop environment (Intel Core i5-1235U, Linux 6.8). Energy was measured via Intel RAPL counters using the pyRAPL library. Carbon footprint was estimated using a conversion factor of 0.000475 gCO₂ per joule based on regional electricity intensity.

Each algorithm–method pair was run 50 times, capturing robust statistics for energy (μJ), time (s), and derived CO₂ emissions.

Dataset Structure Overview

Per-method folders (cpython/, pypy/, etc.) contain raw energy/ and time/ CSV files for all 15 benchmarks (50 trials each), as well as mean summaries.

Aggregate folder includes combined metric comparisons, normalized data, and carbon footprint estimations.

Analysis folder contains derived datasets: normalized scores, standard deviation, and the final GreenScore rankings used in our paper.

Usage

This dataset is ideal for:

Reproducible software sustainability studies

Benchmarking Python execution strategies

Analyzing energy–performance–carbon trade-offs

Validating green metrics and measurement tools

Researchers and practitioners are encouraged to use, extend, and cite this dataset in sustainability-aware software design.
Student Performance and Learning Behavior Dataset
kaggle.com
zip
Updated Sep 4, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Adil Shamim (2025). Student Performance and Learning Behavior Dataset [Dataset]. https://www.kaggle.com/datasets/adilshamim8/student-performance-and-learning-style
Explore at:
zip(78897 bytes)Available download formats
Dataset updated
Sep 4, 2025
Authors
Adil Shamim
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset provides a comprehensive view of student performance and learning behavior, integrating academic, demographic, behavioral, and psychological factors.

It was created by merging two publicly available Kaggle datasets, resulting in a unified dataset of 14,003 student records with 16 attributes. All entries are anonymized, with no personally identifiable information.

Key Features

Study behaviors & engagement → StudyHours, Attendance, Extracurricular, AssignmentCompletion, OnlineCourses, Discussions

Resources & environment → Resources, Internet, EduTech

Motivation & psychology → Motivation, StressLevel

Demographics → Gender, Age (18–30 years)

Learning preference → LearningStyle

Performance indicators → ExamScore, FinalGrade

Objectives & Use Cases

The dataset can be used for:

Predictive modeling → Regression/classification of student performance (ExamScore, FinalGrade)

Clustering analysis → Identifying learning behavior groups with K-Means or other unsupervised methods

Educational analytics → Exploring how study habits, stress, and motivation affect outcomes

Adaptive learning research → Linking behavioral patterns to personalized learning pathways

Analysis Pipeline (from original study)

The dataset was analyzed in Python using:

Preprocessing → Encoding, normalization (z-score, Min–Max), deduplication

Clustering → K-Means, Elbow Method, Silhouette Score, Davies–Bouldin Index

Dimensionality Reduction → PCA (2D/3D visualizations)

Statistical Analysis → ANOVA, regression for group differences

Interpretation → Mapping clusters to LearningStyle categories & extracting insights for adaptive learning

File

merged_dataset.csv → 14,003 rows × 16 columns Includes student demographics, behaviors, engagement, learning styles, and performance indicators.

Provenance

Source: Zenodo – Student Performance and Learning Behavior Dataset

Creator: Kamal Najem (2024)

License: CC BY 4.0 (per Zenodo terms)

This dataset is an excellent playground for educational data mining — from clustering and behavioral analytics to predictive modeling and personalized learning applications.
Z
Onset of mining operations
data-staging.niaid.nih.gov
Updated Mar 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Remelgado, Ruben; Meyer, Carsten (2024). Onset of mining operations [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_8214548
Explore at:
Dataset updated
Mar 17, 2024
Dataset provided by
German Centre for Integrative Biodiversity Research (iDiv)
German Centre for Integrative Biodiversity Research
Authors
Remelgado, Ruben; Meyer, Carsten
Description
Motivation

Maus et al created the first database of the spatial extent of mining areas by mobilizing nearly 20 years of Landsat data. This dataset is imperative for GlobES, as mining areas are specified in the IUCN habitat class scheme. Yet, this dataset is temporally static. To tackle this flaw, we mined the Landsat archive to infer the first observable year of mining.

Approach

For each mining area polygon, we collected 50 random samples within it and 50 random samples along its borders. This was meant to reflect increasing spectral differences between areas within and outside a mining exploration after its onset. Then, for each sample, we used Google Earth Engine to extract spectral profiles for every available acquisition between 1990 and 2020.

After completing the extraction, we estimate mean spectral profiles for each acquisition date, once for the samples “inside” the mining area, and another for those “outside” of it. In this process, we masked pixels afflicted by clouds and cloud shadows using Landsat's quality information.

Using the time-series of mean profiles, at each mining site and for each unique date, we normalized the “inside” and “outside” multi-spectral averages and estimated the Root Mean Square Error (RMSE) between them. The normalization step aimed at emphasizing differences in the shape of the spectral profiles rather than on specific values, which can be related to radiometric innacuracies, or simply to differences in acquisition dates. This resulted in an RMSE time-series for each mining site.

We then used these data to infer the first mining year. To achieve this, we first derived a cumulative sum of the RMSE time-series with the intent of removing noise while preserving abrupt directional changes. For example, if a mine was introduced in a forest, it would drive an increase in the RMSE due to the removal of trees, whereas the outskirts of the mine would remain forested. In this example, the accumulated values would tilt upwards. However, if a mining exploration was accompanied by the removal of vegetation along its outskirts where bare land was common, a downwards shift is RMSE values is more likely as the landscape becomes more homogenization.

To detect the date marking a shift in RMSE values, we used a knee/elbow detection algorithm implemented in the python package kneebow, which uses curve rotation to infer the inflection/deflection point of a time series. Here, downward trends correspond to the elbow and upward trends to the knee. To determine which of these metrics was the most adequate, we use the Area Under the Curve (AUC). An elbow is characterized by a convex shape of a time-series which makes the AUC greater than 50%. However, if the shape of the curve is concave, the knee is the most adequate metric. We limited the detection of shifts to time-series with at least 100 time steps. When below this threshold, we assumed the mine (or the the conditions to sustain it) were present since 1990.

Content

This repository contains the infrastructure used to infer the start of a mining operation, which is organized as following:

00_data - Contains the base data required for the operation, including a SHP file with the mining area outlines, and validation samples.

01_analysis - Contains several outputs of our analysis:

xy.tar.gz - Sample locations for each mining site.

sr.tar.gz - Spectral profiles for each sample location.

mine_start.csv - First year when we detected the start of mining.

02_code - Includes all code used in our analysis.

requirements.txt - Python module requirements that can be fed to pip to replicate our study.

config.yml - Configuration file, including information on the Landsat products used.
Animals (Cats, Dogs, and Snakes)
kaggle.com
zip
Updated Nov 18, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Omar Rehan (2025). Animals (Cats, Dogs, and Snakes) [Dataset]. https://www.kaggle.com/datasets/aiomarrehan/animals-cats-dogs-and-snakes
Explore at:
zip(40219983 bytes)Available download formats
Dataset updated
Nov 18, 2025
Authors
Omar Rehan
Description
Cats, Dogs, and Snakes Dataset

Dataset Overview

The dataset contains images of three animal classes: Cats, Dogs, and Snakes. It is balanced and cleaned, designed for supervised image classification tasks.

Class Number of Images Description
Cats 1,000 Includes multiple breeds and poses
Dogs 1,000 Covers various breeds and backgrounds
Snakes 1,000 Includes multiple species and natural settings

Total Images: 3,000

Image Properties:

Resolution: 224×224 pixels (resized for consistency)

Color Mode: RGB

Format: JPEG/PNG

Cleaned: Duplicate, blurry, and irrelevant images removed

Data Split Recommendation

Set Percentage Number of Images
Training 70% 2,100
Validation 15% 450
Test 15% 450

Preprocessing

Images in the dataset have been standardized to support machine learning pipelines:

Resizing to 224×224 pixels.

Normalization of pixel values to [0,1] or mean subtraction for deep learning frameworks.

Label encoding: Integer encoding (0 = Cat, 1 = Dog, 2 = Snake) or one-hot encoding for model training.

Example: Loading and Using the Dataset (Python)

import os import tensorflow as tf from tensorflow.keras.preprocessing.image import ImageDataGenerator # Path to dataset dataset_path = "path/to/dataset" # ImageDataGenerator for preprocessing datagen = ImageDataGenerator( rescale=1./255, validation_split=0.15 # 15% for validation ) # Load training data train_generator = datagen.flow_from_directory( dataset_path, target_size=(224, 224), batch_size=32, class_mode='categorical', subset='training', shuffle=True ) # Load validation data validation_generator = datagen.flow_from_directory( dataset_path, target_size=(224, 224), batch_size=32, class_mode='categorical', subset='validation', shuffle=False ) # Example: Iterate over one batch images, labels = next(train_generator) print(images.shape, labels.shape) # (32, 224, 224, 3) (32, 3)

Key Features

Balanced: Equal number of samples per class reduces bias.

Cleaned: High-quality, relevant images improve model performance.

Diverse: Covers multiple breeds, species, and environments to ensure generalization.

Ready for ML: Preprocessed and easily integrated into popular deep learning frameworks.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Class	Number of Images	Description
Cats	1,000	Includes multiple breeds and poses
Dogs	1,000	Covers various breeds and backgrounds
Snakes	1,000	Includes multiple species and natural settings

Set	Percentage	Number of Images
Training	70%	2,100
Validation	15%	450
Test	15%	450

Facebook

Twitter

Click to copy link

Link copied

Cite

ali Azadi; ali Azadi (2025). MIMIC-IV Lab Events Subset - Preprocessed for Data Normalization Analysis.xlsx [Dataset]. http://doi.org/10.5281/zenodo.17272946

MIMIC-IV Lab Events Subset - Preprocessed for Data Normalization Analysis.xlsx

Explore at:

text/x-pythonAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.17272946

Dataset updated

Oct 5, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

ali Azadi; ali Azadi

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This updated version includes a Python script (glucose_analysis.py) that performs statistical evaluation of the glucose normalization process described in the associated thesis. The script supports key analyses, including normality assessment (Shapiro–Wilk test), variance homogeneity (Levene’s test), mean comparison (ANOVA), effect size estimation (Cohen’s d), and calculation of confidence intervals for the mean difference. These results validate the impact of Min-Max normalization on clinical data structure and usability within CDSS workflows. The script is designed to be reproducible and complements the processed dataset already included in this repository.

Clear search

Close search

Google apps

Main menu

MIMIC-IV Lab Events Subset - Preprocessed for Data Normalization...

feral-cat-segmentation_dataset

Feral Cat Segmentation Dataset

Overview

Dataset Source

Dataset Contents

Data Formats

1. Image Files

2. Parquet Files

3. Pickle Files

4. CSV Files

Image Preprocessing

Data Normalization

PyTorch Integration

Performance Comparison

Citation

Sample Usage Code

Basic Dataset Loading

Task Scheduler Performance Survey Results

Python Energy Microscope: Benchmarking 5 Execution

Dataset Description

Purpose & Motivation

Data Collection & Tools

Dataset Structure Overview

Usage

Student Performance and Learning Behavior Dataset

Key Features

Objectives & Use Cases

Analysis Pipeline (from original study)

File

Provenance

Onset of mining operations

Animals (Cats, Dogs, and Snakes)

Cats, Dogs, and Snakes Dataset

Dataset Overview

Data Split Recommendation

Preprocessing

Example: Loading and Using the Dataset (Python)

Key Features

MIMIC-IV Lab Events Subset - Preprocessed for Data Normalization Analysis.xlsx