Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This updated version includes a Python script (glucose_analysis.py) that performs statistical evaluation of the glucose normalization process described in the associated thesis. The script supports key analyses, including normality assessment (Shapiro–Wilk test), variance homogeneity (Levene’s test), mean comparison (ANOVA), effect size estimation (Cohen’s d), and calculation of confidence intervals for the mean difference. These results validate the impact of Min-Max normalization on clinical data structure and usability within CDSS workflows. The script is designed to be reproducible and complements the processed dataset already included in this repository.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset provides image segmentation data for feral cats, designed for computer vision and machine learning tasks. It builds upon the original public domain dataset by Paul Cashman from Roboflow, with additional preprocessing and multiple data formats for easier consumption.
The dataset is organized into three standard splits: - Train set - Validation set - Test set
Each split contains data in multiple formats: 1. Original JPG images 2. Segmentation mask JPG images 3. Parquet files containing flattened image and mask data 4. Pickle files containing serialized image and mask data
train/: Original training imagesvalid/: Original validation imagestest/: Original test imagestrain_mask/: Corresponding segmentation masks for trainingvalid_mask/: Corresponding segmentation masks for validationtest_mask/: Corresponding segmentation masks for testingtrain_dataset.parquet, valid_dataset.parquet, test_dataset.parquetsplit_at = image_size[0] * image_size[1] * image_channels
[-1, 224, 224, 3])[-1, 224, 224, 1])train_dataset.pkl, valid_dataset.pkl, test_dataset.pklsplit_at = image_size[0] * image_size[1] * image_channelstrain_dataset.csv, valid_dataset.csv, test_dataset.csvAll images were preprocessed with the following operations: - Resized to 224×224 pixels using bilinear interpolation - Segmentation masks were also resized to match the images using nearest neighbor interpolation - Original RLE (Run-Length Encoding) segmentation data converted to binary masks
When used with the provided PyTorch dataset class, images are normalized with: - Mean: [0.48235, 0.45882, 0.40784] - Standard Deviation: [0.00392156862745098, 0.00392156862745098, 0.00392156862745098]
A custom CatDataset class is included for easy integration with PyTorch:
from cat_dataset import CatDataset
# Load from parquet format
dataset = CatDataset(
root="path/to/dataset",
split="train", # Options: "train", "valid", "test"
format="parquet", # Options: "parquet", "pkl"
image_size=[224, 224],
image_channels=3,
mask_channels=1
)
# Use with PyTorch DataLoader
from torch.utils.data import DataLoader
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)
Loading time benchmarks from the original implementation: - Parquet format: ~1.29 seconds per iteration - Pickle format: ~0.71 seconds per iteration
The pickle format provides the fastest loading times and is recommended for most use cases.
If you use this dataset in your research or projects, please cite:
@misc{feral-cat-segmentation_dataset,
title = {feral-cat-segmentation Dataset},
type = {Open Source Dataset},
author = {Paul Cashman},
howpublished = {\url{https://universe.roboflow.com/paul-cashman-mxgwb/feral-cat-segmentation}},
url = {https://universe.roboflow.com/paul-cashman-mxgwb/feral-cat-segmentation},
journal = {Roboflow Universe},
publisher = {Roboflow},
year = {2025},
month = {mar},
note = {visited on 2025-03-19},
}
from ca...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Task scheduler performance survey
This dataset contains results of task graph scheduler performance survey.
The results are stored in the following files, which correspond to simulations performed on
the elementary, irw and pegasus task graph datasets published at https://doi.org/10.5281/zenodo.2630384.
elementary-result.zip
irw-result.zip
pegasus-result.zip
The files contain compressed pandas dataframes in CSV format, it can be read with the following Python code:
python
import pandas as pd
frame = pd.read_csv("elementary-result.zip")
Each row in the frame corresponds to a single instance of a task graph that was simulated with a specific configuration (network model, scheduler etc.). The list below summarizes the meaning of the individual columns.
graph_name - name of the benchmarked task graph
graph_set - name of the task graph dataset from which the graph originates
graph_id - unique ID of the graph
cluster_name - type of cluster used in this instance the format is x; 32x16 means 32 workers, each with 16 cores
bandwidth - network bandwidth [MiB]
netmodel - network model (simple or maxmin)
scheduler_name - name of the scheduler
imode - information mode
min_sched_interval - minimal scheduling delay [s]
sched_time - duration of each scheduler invocation [s]
time - simulated makespan of the task graph execution [s]
execution_time - real duration of all scheduler invocations [s]
total_transfer - amount of data transferred amongst workers [MiB]
The file charts.zip contains charts obtained by processing the datasets.
On the X axis there is always bandwidth in [MiB/s].
There are the following files:
[DATASET]-schedulers-time - Absolute makespan produced by schedulers [seconds]
[DATASET]-schedulers-score - The same as above but normalized with respect to the best schedule (shortest makespan) for the given configuration.
[DATASET]-schedulers-transfer - Sums of transfers between all workers for a given configuration [MiB]
[DATASET]-[CLUSTER]-netmodel-time - Comparison of netmodels, absolute times [seconds]
[DATASET]-[CLUSTER]-netmodel-score - Comparison of netmodels, normalized to the average of model "simple"
[DATASET]-[CLUSTER]-netmodel-transfer - Comparison of netmodels, sum of transfered data between all workers [MiB]
[DATASET]-[CLUSTER]-schedtime-time - Comparison of MSD, absolute times [seconds]
[DATASET]-[CLUSTER]-schedtime-score - Comparison of MSD, normalized to the average of "MSD=0.0" case
[DATASET]-[CLUSTER]-imode-time - Comparison of Imodes, absolute times [seconds]
[DATASET]-[CLUSTER]-imode-score - Comparison of Imodes, normalized to the average of "exact" imode
Reproducing the results
$ git clone https://github.com/It4innovations/estee $ cd estee $ pip install .
benchmarks/generate.py to generate graphs
from three categories (elementary, irw and pegasus):$ cd benchmarks $ python generate.py elementary.zip elementary $ python generate.py irw.zip irw $ python generate.py pegasus.zip pegasus
or use our task graph dataset that is provided at https://doi.org/10.5281/zenodo.2630384.
benchmark.json. Then you can run the benchmark using this command:$ python pbs.py compute benchmark.json
The benchmark script can be interrupted at any time (for example using Ctrl+C). When interrupted, it will store the computed results to the result file and restore the computation when launched again.
$ python view.py --all
The resulting plots will appear in a folder called outputs.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This dataset was created as part of the research project “Python Under the Microscope: A Comparative Energy Analysis of Execution Methods” (2025). The study explores the environmental sustainability of Python software by benchmarking five execution strategies—CPython, PyPy, Cython, ctypes, and py_compile—across 15 classical algorithmic workloads.
With energy and carbon efficiency becoming critical in modern computing, this dataset aims to:
Quantify execution time, CPU energy usage, and carbon emissions
Enable reproducible analysis of performance–sustainability trade-offs
Introduce and validate the GreenScore, a composite metric for sustainability-aware software evaluation
All benchmarks were executed on a controlled laptop environment (Intel Core i5-1235U, Linux 6.8). Energy was measured via Intel RAPL counters using the pyRAPL library. Carbon footprint was estimated using a conversion factor of 0.000475 gCO₂ per joule based on regional electricity intensity.
Each algorithm–method pair was run 50 times, capturing robust statistics for energy (μJ), time (s), and derived CO₂ emissions.
Per-method folders (cpython/, pypy/, etc.) contain raw energy/ and time/ CSV files for all 15 benchmarks (50 trials each), as well as mean summaries.
Aggregate folder includes combined metric comparisons, normalized data, and carbon footprint estimations.
Analysis folder contains derived datasets: normalized scores, standard deviation, and the final GreenScore rankings used in our paper.
This dataset is ideal for:
Reproducible software sustainability studies
Benchmarking Python execution strategies
Analyzing energy–performance–carbon trade-offs
Validating green metrics and measurement tools
Researchers and practitioners are encouraged to use, extend, and cite this dataset in sustainability-aware software design.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset provides a comprehensive view of student performance and learning behavior, integrating academic, demographic, behavioral, and psychological factors.
It was created by merging two publicly available Kaggle datasets, resulting in a unified dataset of 14,003 student records with 16 attributes. All entries are anonymized, with no personally identifiable information.
StudyHours, Attendance, Extracurricular, AssignmentCompletion, OnlineCourses, DiscussionsResources, Internet, EduTechMotivation, StressLevelGender, Age (18–30 years)LearningStyleExamScore, FinalGradeThe dataset can be used for:
ExamScore, FinalGrade)The dataset was analyzed in Python using:
LearningStyle categories & extracting insights for adaptive learningmerged_dataset.csv → 14,003 rows × 16 columns
Includes student demographics, behaviors, engagement, learning styles, and performance indicators.This dataset is an excellent playground for educational data mining — from clustering and behavioral analytics to predictive modeling and personalized learning applications.
Facebook
TwitterMotivation
Maus et al created the first database of the spatial extent of mining areas by mobilizing nearly 20 years of Landsat data. This dataset is imperative for GlobES, as mining areas are specified in the IUCN habitat class scheme. Yet, this dataset is temporally static. To tackle this flaw, we mined the Landsat archive to infer the first observable year of mining.
Approach
For each mining area polygon, we collected 50 random samples within it and 50 random samples along its borders. This was meant to reflect increasing spectral differences between areas within and outside a mining exploration after its onset. Then, for each sample, we used Google Earth Engine to extract spectral profiles for every available acquisition between 1990 and 2020.
After completing the extraction, we estimate mean spectral profiles for each acquisition date, once for the samples “inside” the mining area, and another for those “outside” of it. In this process, we masked pixels afflicted by clouds and cloud shadows using Landsat's quality information.
Using the time-series of mean profiles, at each mining site and for each unique date, we normalized the “inside” and “outside” multi-spectral averages and estimated the Root Mean Square Error (RMSE) between them. The normalization step aimed at emphasizing differences in the shape of the spectral profiles rather than on specific values, which can be related to radiometric innacuracies, or simply to differences in acquisition dates. This resulted in an RMSE time-series for each mining site.
We then used these data to infer the first mining year. To achieve this, we first derived a cumulative sum of the RMSE time-series with the intent of removing noise while preserving abrupt directional changes. For example, if a mine was introduced in a forest, it would drive an increase in the RMSE due to the removal of trees, whereas the outskirts of the mine would remain forested. In this example, the accumulated values would tilt upwards. However, if a mining exploration was accompanied by the removal of vegetation along its outskirts where bare land was common, a downwards shift is RMSE values is more likely as the landscape becomes more homogenization.
To detect the date marking a shift in RMSE values, we used a knee/elbow detection algorithm implemented in the python package kneebow, which uses curve rotation to infer the inflection/deflection point of a time series. Here, downward trends correspond to the elbow and upward trends to the knee. To determine which of these metrics was the most adequate, we use the Area Under the Curve (AUC). An elbow is characterized by a convex shape of a time-series which makes the AUC greater than 50%. However, if the shape of the curve is concave, the knee is the most adequate metric. We limited the detection of shifts to time-series with at least 100 time steps. When below this threshold, we assumed the mine (or the the conditions to sustain it) were present since 1990.
Content
This repository contains the infrastructure used to infer the start of a mining operation, which is organized as following:
00_data - Contains the base data required for the operation, including a SHP file with the mining area outlines, and validation samples.
01_analysis - Contains several outputs of our analysis:
xy.tar.gz - Sample locations for each mining site.
sr.tar.gz - Spectral profiles for each sample location.
mine_start.csv - First year when we detected the start of mining.
02_code - Includes all code used in our analysis.
requirements.txt - Python module requirements that can be fed to pip to replicate our study.
config.yml - Configuration file, including information on the Landsat products used.
Facebook
TwitterThe dataset contains images of three animal classes: Cats, Dogs, and Snakes. It is balanced and cleaned, designed for supervised image classification tasks.
| Class | Number of Images | Description |
|---|---|---|
| Cats | 1,000 | Includes multiple breeds and poses |
| Dogs | 1,000 | Covers various breeds and backgrounds |
| Snakes | 1,000 | Includes multiple species and natural settings |
Total Images: 3,000
Image Properties:
| Set | Percentage | Number of Images |
|---|---|---|
| Training | 70% | 2,100 |
| Validation | 15% | 450 |
| Test | 15% | 450 |
Images in the dataset have been standardized to support machine learning pipelines:
import os
import tensorflow as tf
from tensorflow.keras.preprocessing.image import ImageDataGenerator
# Path to dataset
dataset_path = "path/to/dataset"
# ImageDataGenerator for preprocessing
datagen = ImageDataGenerator(
rescale=1./255,
validation_split=0.15 # 15% for validation
)
# Load training data
train_generator = datagen.flow_from_directory(
dataset_path,
target_size=(224, 224),
batch_size=32,
class_mode='categorical',
subset='training',
shuffle=True
)
# Load validation data
validation_generator = datagen.flow_from_directory(
dataset_path,
target_size=(224, 224),
batch_size=32,
class_mode='categorical',
subset='validation',
shuffle=False
)
# Example: Iterate over one batch
images, labels = next(train_generator)
print(images.shape, labels.shape) # (32, 224, 224, 3) (32, 3)
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This updated version includes a Python script (glucose_analysis.py) that performs statistical evaluation of the glucose normalization process described in the associated thesis. The script supports key analyses, including normality assessment (Shapiro–Wilk test), variance homogeneity (Levene’s test), mean comparison (ANOVA), effect size estimation (Cohen’s d), and calculation of confidence intervals for the mean difference. These results validate the impact of Min-Max normalization on clinical data structure and usability within CDSS workflows. The script is designed to be reproducible and complements the processed dataset already included in this repository.