76 datasets found

h
example-data-frame
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AI Robotics Ethics Society (PUCRS), example-data-frame [Dataset]. https://huggingface.co/datasets/AiresPucrs/example-data-frame
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
AI Robotics Ethics Society (PUCRS)
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Example DataFrame (Teeny-Tiny Castle)

This dataset is part of a tutorial tied to the Teeny-Tiny Castle, an open-source repository containing educational tools for AI Ethics and Safety research.

How to Use

from datasets import load_dataset

dataset = load_dataset("AiresPucrs/example-data-frame", split = 'train')
Z
Flow map data of the singel pendulum, double pendulum and 3-body problem
data.niaid.nih.gov
Updated Apr 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Horn, Philipp (2024). Flow map data of the singel pendulum, double pendulum and 3-body problem [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_11032351
Explore at:
Dataset updated
Apr 23, 2024
Dataset provided by
Horn, Philipp
Simon, Portegies Zwart
Veronica, Saz Ulibarrena
Koren, Barry
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset was constructed to compare the performance of various neural network architectures learning the flow maps of Hamiltonian systems. It was created for the paper: A Generalized Framework of Neural Networks for Hamiltonian Systems.

The dataset consists of trajectory data from three different Hamiltonian systems. Namely, the single pendulum, double pendulum and 3-body problem. The data was generated using numerical integrators. For the single pendulum, the symplectic Euler method with a step size of 0.01 was used. The data of the double pendulum was also computed by the symplectic Euler method, however, with an adaptive step size. The trajectories of the 3-body problem were calculated by the arbitrarily high-precision code Brutus.

For each Hamiltonian system, there is one file containing the entire trajectory information (*_all_runs.h5.1). In these files, the states along all trajectories are recorded with a step size of 0.01. These files are composed of several Pandas DataFrames. One DataFrame per trajectory, called "run0", "run1", ... and finally one large DataFrame in which all the trajectories are combined, called "all_runs". Additionally, one Pandas Series called "constants" is contained in these files, in which several parameters of the data are listed.

Also, there is a second file per Hamiltonian system in which the data is prepared as features and labels ready for neural networks to be trained (*_training.h5.1). Similar to the first type of files, they contain a Series called "constants". The features and labels are then separated into 6 DataFrames called "features", "labels", "val_features", "val_labels", "test_features" and "test_labels". The data is split into 80% training data, 10% validation data and 10% test data.

The code used to train various neural network architectures on this data can be found on GitHub at: https://github.com/AELITTEN/GHNN.

Already trained neural networks can be found on GitHub at: https://github.com/AELITTEN/NeuralNets_GHNN.

Single pendulum Double pendulum 3-body problem

Number of trajectories 500 2000 5000

final time in all_runs T (one period of the pendulum) 10 10

final time in training data 0.25*T 5 5

step size in training data 0.1 0.1 0.5
Data from: LifeSnaps: a 4-month multi-modal dataset capturing unobtrusive...
zenodo.org
data.europa.eu
zip
Updated Oct 20, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sofia Yfantidou; Sofia Yfantidou; Christina Karagianni; Stefanos Efstathiou; Stefanos Efstathiou; Athena Vakali; Athena Vakali; Joao Palotti; Joao Palotti; Dimitrios Panteleimon Giakatos; Dimitrios Panteleimon Giakatos; Thomas Marchioro; Thomas Marchioro; Andrei Kazlouski; Elena Ferrari; Šarūnas Girdzijauskas; Šarūnas Girdzijauskas; Christina Karagianni; Andrei Kazlouski; Elena Ferrari (2022). LifeSnaps: a 4-month multi-modal dataset capturing unobtrusive snapshots of our lives in the wild [Dataset]. http://doi.org/10.5281/zenodo.6832242
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6832242
Dataset updated
Oct 20, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Sofia Yfantidou; Sofia Yfantidou; Christina Karagianni; Stefanos Efstathiou; Stefanos Efstathiou; Athena Vakali; Athena Vakali; Joao Palotti; Joao Palotti; Dimitrios Panteleimon Giakatos; Dimitrios Panteleimon Giakatos; Thomas Marchioro; Thomas Marchioro; Andrei Kazlouski; Elena Ferrari; Šarūnas Girdzijauskas; Šarūnas Girdzijauskas; Christina Karagianni; Andrei Kazlouski; Elena Ferrari
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
LifeSnaps Dataset Documentation

Ubiquitous self-tracking technologies have penetrated various aspects of our lives, from physical and mental health monitoring to fitness and entertainment. Yet, limited data exist on the association between in the wild large-scale physical activity patterns, sleep, stress, and overall health, and behavioral patterns and psychological measurements due to challenges in collecting and releasing such datasets, such as waning user engagement, privacy considerations, and diversity in data modalities. In this paper, we present the LifeSnaps dataset, a multi-modal, longitudinal, and geographically-distributed dataset, containing a plethora of anthropological data, collected unobtrusively for the total course of more than 4 months by n=71 participants, under the European H2020 RAIS project. LifeSnaps contains more than 35 different data types from second to daily granularity, totaling more than 71M rows of data. The participants contributed their data through numerous validated surveys, real-time ecological momentary assessments, and a Fitbit Sense smartwatch, and consented to make these data available openly to empower future research. We envision that releasing this large-scale dataset of multi-modal real-world data, will open novel research opportunities and potential applications in the fields of medical digital innovations, data privacy and valorization, mental and physical well-being, psychology and behavioral sciences, machine learning, and human-computer interaction.

The following instructions will get you started with the LifeSnaps dataset and are complementary to the original publication.

Data Import: Reading CSV

For ease of use, we provide CSV files containing Fitbit, SEMA, and survey data at daily and/or hourly granularity. You can read the files via any programming language. For example, in Python, you can read the files into a Pandas DataFrame with the pandas.read_csv() command.

Data Import: Setting up a MongoDB (Recommended)

To take full advantage of the LifeSnaps dataset, we recommend that you use the raw, complete data via importing the LifeSnaps MongoDB database.

To do so, open the terminal/command prompt and run the following command for each collection in the DB. Ensure you have MongoDB Database Tools installed from here.

For the Fitbit data, run the following:

mongorestore --host localhost:27017 -d rais_anonymized -c fitbit

For the SEMA data, run the following:

mongorestore --host localhost:27017 -d rais_anonymized -c sema

For surveys data, run the following:

mongorestore --host localhost:27017 -d rais_anonymized -c surveys

If you have access control enabled, then you will need to add the --username and --password parameters to the above commands.

Data Availability

The MongoDB database contains three collections, fitbit, sema, and surveys, containing the Fitbit, SEMA3, and survey data, respectively. Similarly, the CSV files contain related information to these collections. Each document in any collection follows the format shown below:

{ _id:
A Replication Dataset for Fundamental Frequency Estimation
zenodo.org
live.european-language-grid.eu
+1more
bin
Updated Apr 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bastian Bechtold; Bastian Bechtold (2025). A Replication Dataset for Fundamental Frequency Estimation [Dataset]. http://doi.org/10.5281/zenodo.3904389
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3904389
Dataset updated
Apr 24, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Bastian Bechtold; Bastian Bechtold
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Part of the dissertation Pitch of Voiced Speech in the Short-Time Fourier Transform: Algorithms, Ground Truths, and Evaluation Methods.
© 2020, Bastian Bechtold. All rights reserved.

Estimating the fundamental frequency of speech remains an active area of research, with varied applications in speech recognition, speaker identification, and speech compression. A vast number of algorithms for estimatimating this quantity have been proposed over the years, and a number of speech and noise corpora have been developed for evaluating their performance. The present dataset contains estimated fundamental frequency tracks of 25 algorithms, six speech corpora, two noise corpora, at nine signal-to-noise ratios between -20 and 20 dB SNR, as well as an additional evaluation of synthetic harmonic tone complexes in white noise.

The dataset also contains pre-calculated performance measures both novel and traditional, in reference to each speech corpus’ ground truth, the algorithms’ own clean-speech estimate, and our own consensus truth. It can thus serve as the basis for a comparison study, or to replicate existing studies from a larger dataset, or as a reference for developing new fundamental frequency estimation algorithms. All source code and data is available to download, and entirely reproducible, albeit requiring about one year of processor-time.

Included Code and Data

ground truth data.zip is a JBOF dataset of fundamental frequency estimates and ground truths of all speech files in the following corpora:

CMU-ARCTIC (consensus truth) [1]

FDA (corpus truth and consensus truth) [2]

KEELE (corpus truth and consensus truth) [3]

MOCHA-TIMIT (consensus truth) [4]

PTDB-TUG (corpus truth and consensus truth) [5]

TIMIT (consensus truth) [6]

noisy speech data.zip is a JBOF datasets of fundamental frequency estimates of speech files mixed with noise from the following corpora:

NOISEX [7]

QUT-NOISE [8]

synthetic speech data.zip is a JBOF dataset of fundamental frequency estimates of synthetic harmonic tone complexes in white noise.

noisy_speech.pkl and synthetic_speech.pkl are pickled Pandas dataframes of performance metrics derived from the above data for the following list of fundamental frequency estimation algorithms:

AUTOC [9]

AMDF [10]

BANA [11]

CEP [12]

CREPE [13]

DIO [14]

DNN [15]

KALDI [16]

MAPS

MBSC [17]

NLS [18]

PEFAC [19]

PRAAT [20]

RAPT [21]

SACC [22]

SAFE [23]

SHR [24]

SIFT [25]

SRH [26]

STRAIGHT [27]

SWIPE [28]

YAAPT [29]

YIN [30]

noisy speech evaluation.py and synthetic speech evaluation.py are Python programs to calculate the above Pandas dataframes from the above JBOF datasets. They calculate the following performance measures:

Gross Pitch Error (GPE), the percentage of pitches where the estimated pitch deviates from the true pitch by more than 20%.

Fine Pitch Error (FPE), the mean error of grossly correct estimates.

High/Low Octave Pitch Error (OPE), the percentage pitches that are GPEs and happens to be at an integer multiple of the true pitch.

Gross Remaining Error (GRE), the percentage of pitches that are GPEs but not OPEs.

Fine Remaining Bias (FRB), the median error of GREs.

True Positive Rate (TPR), the percentage of true positive voicing estimates.

False Positive Rate (FPR), the percentage of false positive voicing estimates.

False Negative Rate (FNR), the percentage of false negative voicing estimates.

F₁, the harmonic mean of precision and recall of the voicing decision.

Pipfile is a pipenv-compatible pipfile for installing all prerequisites necessary for running the above Python programs.

The Python programs take about an hour to compute on a fast 2019 computer, and require at least 32 Gb of memory.

References:

John Kominek and Alan W Black. CMU ARCTIC database for speech synthesis, 2003.

Paul C Bagshaw, Steven Hiller, and Mervyn A Jack. Enhanced Pitch Tracking and the Processing of F0 Contours for Computer Aided Intonation Teaching. In EUROSPEECH, 1993.

F Plante, Georg F Meyer, and William A Ainsworth. A Pitch Extraction Reference Database. In Fourth European Conference on Speech Communication and Technology, pages 837–840, Madrid, Spain, 1995.

Alan Wrench. MOCHA MultiCHannel Articulatory database: English, November 1999.

Gregor Pirker, Michael Wohlmayr, Stefan Petrik, and Franz Pernkopf. A Pitch Tracking Corpus with Evaluation on Multipitch Tracking Scenario. page 4, 2011.

John S. Garofolo, Lori F. Lamel, William M. Fisher, Jonathan G. Fiscus, David S. Pallett, Nancy L. Dahlgren, and Victor Zue. TIMIT Acoustic-Phonetic Continuous Speech Corpus, 1993.

Andrew Varga and Herman J.M. Steeneken. Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recog- nition systems. Speech Communication, 12(3):247–251, July 1993.

David B. Dean, Sridha Sridharan, Robert J. Vogt, and Michael W. Mason. The QUT-NOISE-TIMIT corpus for the evaluation of voice activity detection algorithms. Proceedings of Interspeech 2010, 2010.

Man Mohan Sondhi. New methods of pitch extraction. Audio and Electroacoustics, IEEE Transactions on, 16(2):262—266, 1968.

Myron J. Ross, Harry L. Shaffer, Asaf Cohen, Richard Freudberg, and Harold J. Manley. Average magnitude difference function pitch extractor. Acoustics, Speech and Signal Processing, IEEE Transactions on, 22(5):353—362, 1974.

Na Yang, He Ba, Weiyang Cai, Ilker Demirkol, and Wendi Heinzelman. BaNa: A Noise Resilient Fundamental Frequency Detection Algorithm for Speech and Music. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(12):1833–1848, December 2014.

Michael Noll. Cepstrum Pitch Determination. The Journal of the Acoustical Society of America, 41(2):293–309, 1967.

Jong Wook Kim, Justin Salamon, Peter Li, and Juan Pablo Bello. CREPE: A Convolutional Representation for Pitch Estimation. arXiv:1802.06182 [cs, eess, stat], February 2018. arXiv: 1802.06182.

Masanori Morise, Fumiya Yokomori, and Kenji Ozawa. WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications. IEICE Transactions on Information and Systems, E99.D(7):1877–1884, 2016.

Kun Han and DeLiang Wang. Neural Network Based Pitch Tracking in Very Noisy Speech. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(12):2158–2168, Decem- ber 2014.

Pegah Ghahremani, Bagher BabaAli, Daniel Povey, Korbinian Riedhammer, Jan Trmal, and Sanjeev Khudanpur. A pitch extraction algorithm tuned for automatic speech recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, pages 2494–2498. IEEE, 2014.

Lee Ngee Tan and Abeer Alwan. Multi-band summary correlogram-based pitch detection for noisy speech. Speech Communication, 55(7-8):841–856, September 2013.

Jesper Kjær Nielsen, Tobias Lindstrøm Jensen, Jesper Rindom Jensen, Mads Græsbøll Christensen, and Søren Holdt Jensen. Fast fundamental frequency estimation: Making a statistically
Data from: Dataset for Vehicle Indoor Positioning in Industrial Environments...
zenodo.org
producciocientifica.uv.es
+1more
zip
Updated Mar 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ivo Silva; Ivo Silva; Cristiano Pendão; Cristiano Pendão; Joaquín Torres-Sospedra; Joaquín Torres-Sospedra; Adriano Moreira; Adriano Moreira (2025). Dataset for Vehicle Indoor Positioning in Industrial Environments with Wi-Fi, inertial, and odometry data [Dataset]. http://doi.org/10.5281/zenodo.7826540
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7826540
Dataset updated
Mar 13, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Ivo Silva; Ivo Silva; Cristiano Pendão; Cristiano Pendão; Joaquín Torres-Sospedra; Joaquín Torres-Sospedra; Adriano Moreira; Adriano Moreira
Description
Dataset collected in an indoor industrial environment using a mobile unit (manually pushed trolley) that resembles an industrial vehicle equipped with several sensors, namely, Wi-Fi, wheel encoder (displacement), and Inertial Measurement Unit (IMU).

Sensors were connected to a Raspberry Pi (RPi 3B +), which collected the data from the sensors. Ground truth information was obtained with video camera pointed towards the floor, registering the times when the trolley passed by reference tags.

List of sensors:

4x Wi-Fi interfaces: Edimax EW7811-Un

2x IMUs: Adafruit BNO055

1x Absolute Encoder: US Digital A2 (attached to a wheel with a diameter of 125 mm)

This dataset includes:

1x Wi-Fi radio map that can be used for Wi-Fi fingerprinting.

6x Trajectories: including sensor data + ground truth.

APs Information: list of APs in the building, including their position and transmission channel.

Floor plan: image of the building's floor plan with obstacles and non-navigable areas.

Python package provided for:

parsing the dataset into a data structure (Pandas dataframes).

performing statistical analysis on the data (number of samples, time difference between consecutive samples, etc.).

computing Dead Reckoning trajectory from a provided initial position.

computing Wi-Fi fingerprinting position estimates.

determining positioning error in Dead Reckoning and Wi-Fi fingerprinting.

generating plots including the floor plan of the building, dead reckoning trajectories, and CDFs.

When using this dataset, please cite its data description paper:

Silva , I.; Pendão, C.; Torres-Sospedra, J.; Moreira, A. Industrial Environment Multi-Sensor Dataset for Vehicle Indoor Tracking with Wi-Fi, Inertial and Odometry Data. Data 2023, 8, 157. https://doi.org/10.3390/data8100157
b
pandas DataFrames of the DYToMuMu_M-20_CT10_TuneZ2star_v2_8TeV process
bonndata.uni-bonn.de
bin, text/x-python +1
Updated Jan 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Timo Saala; Timo Saala (2025). pandas DataFrames of the DYToMuMu_M-20_CT10_TuneZ2star_v2_8TeV process [Dataset]. http://doi.org/10.60507/FK2/1MTTRE
Explore at:
bin(630694234), bin(595883050), text/x-python(2553), bin(642092194), txt(7203), bin(525465770), bin(637589794), bin(637555602), bin(515541514), bin(624730562), bin(635941242), bin(632160114)Available download formats
Unique identifier
https://doi.org/10.60507/FK2/1MTTRE
Dataset updated
Jan 21, 2025
Dataset provided by
bonndata
Authors
Timo Saala; Timo Saala
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains pandas DataFrames that represent filtered versions of CMS Open Data (in the form of ROOT files) available on the CERN OpenData Portal. This dataset specifically contains data from a DYToMuMu process (Drell-Yan process resulting in two Muons in the final state), which is a simulated process created during the 2012 LHC run. A total of 121 (99 for real collision data) relevant variables are contained in the filtered pandas DataFrames that can be found here. A list of variables can be found below, for a full explanation of them, please refer to the following paper (PLACEHOLDER, REFERENCE PAPER HERE): nEvent, runNum, lumisection, evtNum; nMuon, vecMuon_PT, vecMuon_Eta, vecMuon_Phi, vecMuon_PTErr, vecMuon_Q, vecMuon_StaPt, vecMuon_StaEta, vecMuon_StaPhi, vecMuon_TrkIso03, vecMuon_EcalIso03, vecMuon_HcalIso03; nVertex, vecVertex_nTracksfit, vecVertex_ndof, vecVertex_Chi2, vecVertex_X, vecVertex_Y, vecVertex_Z; nEle, vecEle_PT, vecEle_Eta, vecEle_Phi, vecEle_Q, vecEle_TrkIso03, vecEle_EcalIso03, vecEle_HcalIso03, vecEle_D0, vecEle_Dz; nTau, vecTau_PT, vecTau_Eta, vecTau_Phi, vecTau_Q, vecTau_RawIso3Hits, vecTau_RawIsoMVA3oldDMwoLT, vecTau_RawIsoMVA3oldDMwLT, vecTau_RawIsoMVA3newDMwoLT, vecTau_RawIsoMVA3newDMwLT; nPhoton, vecPhoton_PT, vecPhoton_Eta, vecPhoton_Phi, vecPhoton_Hovere, vecPhoton_Sthovere, vecPhoton_HasPixelSeed, vecPhoton_IsConv, vecPhoton_PassElectronVeto; nMctruth, vecMctruth_PT, vecMctruth_Eta, vecMctruth_Phi, vecMctruth_Id_1, vecMctruth_Id_2, vecMctruth_X_1, vecMctruth_X_2, vecMctruth_PdgId, vecMctruth_Status, vecMctruth_Y, vecMctruth_Mass, vecMctruth_Mothers.first, vecMctruth_Mothers.second; nJets, vecJet_PT, vecJet_Eta, vecJet_Phi, vecJet_D0, vecJet_Dz, vecJet_nCharged, vecJet_nNeutrals, vecJet_nParticles, vecJet_Beta, vecJet_BetaStar, vecJet_dR2Mean, vecJet_Q, vecJet_Mass, vecJet_Area, vecJet_Energy, vecJet_chEmEnergy, vecJet_neuEmEnergy, vecJet_chHadEnergy, vecJet_neuHadEnergy, vecJet_ID, vecJet_Num, vecJet_mcFlavor, vecJet_GenPT, vecJet_GenEta, vecJet_GenPhi, vecJet_GenMass, vecJet_flavorMatchPT, vecJet_JEC, vecJet_MatchIdx; nPF, vecPF_PT, vecPF_Eta, vecPF_Phi, vecPF_Mass, vecPF_E, vecPF_Q, vecPF_PfType, vecPF_EcalE, vecPF_HcalE, vecPF_ndof, vecPF_Chi2, vecPF_pvId, vecPF_X, vecPF_Y, vecPF_Z, vecPF_JetNum; fMET_PT, fMET_Eta, fMET_Phi; HLT_Mu17_Mu8, HLT_Mu24, HLT_MET120_v, HLT_Ele27, HLT_HT350. For the datasets containing data from real collisions at the LHC, the following variables are NOT contained: nMctruth, vecMctruth_PT, vecMctruth_Eta, vecMctruth_Phi, vecMctruth_Id_1, vecMctruth_Id_2, vecMctruth_X_1, vecMctruth_X_2, vecMctruth_PdgId, vecMctruth_Status, vecMctruth_Y, vecMctruth_Mass, vecMctruth_Mothers.first, vecMctruth_Mothers.second; vecJet_mcFlavor, vecJet_GenPT, vecJet_GenEta, vecJet_GenPhi, vecJet_GenMass, vecJet_flavorMatchPT, vecJet_JEC, vecJet_MatchIdx
SELTO Dataset
zenodo.org
data.niaid.nih.gov
application/gzip
Updated May 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sören Dittmer; David Erzmann; Henrik Harms; Rielson Falck; Marco Gosch; Sören Dittmer; David Erzmann; Henrik Harms; Rielson Falck; Marco Gosch (2023). SELTO Dataset [Dataset]. http://doi.org/10.5281/zenodo.7781392
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7781392
Dataset updated
May 23, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Sören Dittmer; David Erzmann; Henrik Harms; Rielson Falck; Marco Gosch; Sören Dittmer; David Erzmann; Henrik Harms; Rielson Falck; Marco Gosch
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A Benchmark Dataset for Deep Learning for 3D Topology Optimization

This dataset represents voxelized 3D topology optimization problems and solutions. The solutions have been generated in cooperation with the Ariane Group and Synera using the Altair OptiStruct implementation of SIMP within the Synera software. The SELTO dataset consists of four different 3D datasets for topology optimization, called disc simple, disc complex, sphere simple and sphere complex. Each of these datasets is further split into a training and a validation subset.

The following paper provides full documentation and examples:

Dittmer, S., Erzmann, D., Harms, H., Maass, P., SELTO: Sample-Efficient Learned Topology Optimization (2022) https://arxiv.org/abs/2209.05098.

The Python library DL4TO (https://github.com/dl4to/dl4to) can be used to download and access all SELTO dataset subsets.
Each TAR.GZ file container consists of multiple enumerated pairs of CSV files. Each pair describes a unique topology optimization problem and contains an associated ground truth solution. Each problem-solution pair consists of two files, where one contains voxel-wise information and the other file contains scalar information. For example, the i-th sample is stored in the files i.csv and i_info.csv, where i.csv contains all voxel-wise information and i_info.csv contains all scalar information. We define all spatially varying quantities at the center of the voxels, rather than on the vertices or surfaces. This allows for a shape-consistent tensor representation.

For the i-th sample, the columns of i_info.csv correspond to the following scalar information:

E - Young's modulus [Pa]

ν - Poisson's ratio [-]

σ_ys - a yield stress [Pa]

h - discretization size of the voxel grid [m]

The columns of i.csv correspond to the following voxel-wise information:

x, y, z - the indices that state the location of the voxel within the voxel mesh

Ω_design - design space information for each voxel. This is a ternary variable that indicates the type of density constraint on the voxel. 0 and 1 indicate that the density is fixed at 0 or 1, respectively. -1 indicates the absence of constraints, i.e., the density in that voxel can be freely optimized

Ω_dirichlet_x, Ω_dirichlet_y, Ω_dirichlet_z - homogeneous Dirichlet boundary conditions for each voxel. These are binary variables that define whether the voxel is subject to homogeneous Dirichlet boundary constraints in the respective dimension

F_x, F_y, F_z - floating point variables that define the three spacial components of external forces applied to each voxel. All forces are body forces given in [N/m^3]

density - defines the binary voxel-wise density of the ground truth solution to the topology optimization problem

How to Import the Dataset

with DL4TO: With the Python library DL4TO (https://github.com/dl4to/dl4to) it is straightforward to download and access the dataset as a customized PyTorch torch.utils.data.Dataset object. As shown in the tutorial this can be done via:

from dl4to.datasets import SELTODataset dataset = SELTODataset(root=root, name=name, train=train)

Here, root is the path where the dataset should be saved. name is the name of the SELTO subset and can be one of "disc_simple", "disc_complex", "sphere_simple" and "sphere_complex". train is a boolean that indicates whether the corresponding training or validation subset should be loaded. See here for further documentation on the SELTODataset class.

without DL4TO: After downloading and unzipping, any of the i.csv files can be manually imported into Python as a Pandas dataframe object:

import pandas as pd root = ... file_path = f'{root}/{i}.csv' columns = ['x', 'y', 'z', 'Ω_design','Ω_dirichlet_x', 'Ω_dirichlet_y', 'Ω_dirichlet_z', 'F_x', 'F_y', 'F_z', 'density'] df = pd.read_csv(file_path, names=columns)

Similarly, we can import a i_info.csv file via:

file_path = f'{root}/{i}_info.csv' info_column_names = ['E', 'ν', 'σ_ys', 'h'] df_info = pd.read_csv(file_path, names=info_columns)

We can extract PyTorch tensors from the Pandas dataframe df using the following function:

import torch def get_torch_tensors_from_dataframe(df, dtype=torch.float32): shape = df[['x', 'y', 'z']].iloc[-1].values.astype(int) + 1 voxels = [df['x'].values, df['y'].values, df['z'].values] Ω_design = torch.zeros(1, *shape, dtype=int) Ω_design[:, voxels[0], voxels[1], voxels[2]] = torch.from_numpy(data['Ω_design'].values.astype(int)) Ω_Dirichlet = torch.zeros(3, *shape, dtype=dtype) Ω_Dirichlet[0, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['Ω_dirichlet_x'].values, dtype=dtype) Ω_Dirichlet[1, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['Ω_dirichlet_y'].values, dtype=dtype) Ω_Dirichlet[2, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['Ω_dirichlet_z'].values, dtype=dtype) F = torch.zeros(3, *shape, dtype=dtype) F[0, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['F_x'].values, dtype=dtype) F[1, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['F_y'].values, dtype=dtype) F[2, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['F_z'].values, dtype=dtype) density = torch.zeros(1, *shape, dtype=dtype) density[:, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['density'].values, dtype=dtype) return Ω_design, Ω_Dirichlet, F, density
r
Data from: Dataset with condition monitoring vibration data annotated with...
researchdata.se
Updated Jun 17, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Karl Löwenmark; Fredrik Sandin; Marcus Liwicki; Stephan Schnabel (2025). Dataset with condition monitoring vibration data annotated with technical language, from paper machine industries in northern Sweden [Dataset]. http://doi.org/10.5878/hxc0-bd07
Explore at:
(200308), (124)Available download formats
Unique identifier
https://doi.org/10.5878/hxc0-bd07
Dataset updated
Jun 17, 2025
Dataset provided by
Luleå University of Technology
Authors
Karl Löwenmark; Fredrik Sandin; Marcus Liwicki; Stephan Schnabel
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Area covered
Sweden
Description
Labelled industry datasets are one of the most valuable assets in prognostics and health management (PHM) research. However, creating labelled industry datasets is both difficult and expensive, making publicly available industry datasets rare at best, in particular labelled datasets. Recent studies have showcased that industry annotations can be used to train artificial intelligence models directly on industry data ( https://doi.org/10.36001/ijphm.2022.v13i2.3137 , https://doi.org/10.36001/phmconf.2023.v15i1.3507 ), but while many industry datasets also contain text descriptions or logbooks in the form of annotations and maintenance work orders, few, if any, are publicly available. Therefore, we release a dataset consisting with annotated signal data from two large (80mx10mx10m) paper machines, from a Kraftliner production company in northern Sweden. The data consists of 21 090 pairs of signals and annotations from one year of production. The annotations are written in Swedish, by on-site Swedish experts, and the signals consist primarily of accelerometer vibration measurements from the two machines. The dataset is structured as a Pandas dataframe and serialized as a pickle (.pkl) file and a JSON (.json) file. The first column (‘id’) is the ID of the samples; the second column (‘Spectra’) are the fast Fourier transform and envelope-transformed vibration signals; the third column (‘Notes’) are the associated annotations, mapped so that each annotation is associated with all signals from ten days before the annotation date, up to the annotation date; and finally the fourth column (‘Embeddings’) are pre-computed embeddings using Swedish SentenceBERT. Each row corresponds to a vibration measurement sample, though there is no distinction in this data between which sensor or machine part each measurement is from.
An Empirical Study on Energy Usage Patterns of Different Variants of Data...
figshare.com
zip
Updated Nov 5, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Princy Chauhan (2024). An Empirical Study on Energy Usage Patterns of Different Variants of Data Processing Libraries [Dataset]. http://doi.org/10.6084/m9.figshare.27611421.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.27611421.v1
Dataset updated
Nov 5, 2024
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Princy Chauhan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
As computing power grows, so does the need for data processing, which uses a lot of energy in steps like cleaning and analyzing data. This study looks at the energy and time efficiency of four common Python libraries—Pandas, Vaex, Scikit-learn, and NumPy—tested on five datasets across 21 tasks. We compared the energy use of the newest and older versions of each library. Our findings show that no single library always saves the most energy. Instead, energy use varies by task type, how often tasks are done, and the library version. In some cases, newer versions use less energy, pointing to the need for more research on making data processing more energy-efficient.A zip file accompanying this study contains the scripts, datasets, and a README file for guidance. This setup allows for easy replication and testing of the experiments described, helping to further analyze energy efficiency across different libraries and tasks.
R
Data and codes from: Comparison of Solar Imaging Feature Extraction Methods...
entrepot.recherche.data.gouv.fr
7z, application/x-h5 +2
Updated Jun 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maria Tahtouh; Maria Tahtouh; Guillerme Bernoux; Guillerme Bernoux; Antoine Brunet; Antoine Brunet; Denis Standarovski; Denis Standarovski; Gautier Nguyen; Gautier Nguyen; Angélica Sicard; Angélica Sicard (2025). Data and codes from: Comparison of Solar Imaging Feature Extraction Methods in the Context of Space Weather Prediction with Deep Learning-Based Models [Dataset]. http://doi.org/10.57745/DZT7DS
Explore at:
application/x-h5(2599174), 7z(4407015), bin(40653687), 7z(2335796), text/x-python(29618), text/x-python(2593), text/x-python(4013), text/x-python(11669), bin(42832463), text/x-python(2710), application/x-h5(5006388127), bin(1784082487), text/x-python(18773)Available download formats
Unique identifier
https://doi.org/10.57745/DZT7DS
Dataset updated
Jun 4, 2025
Dataset provided by
Recherche Data Gouv
Authors
Maria Tahtouh; Maria Tahtouh; Guillerme Bernoux; Guillerme Bernoux; Antoine Brunet; Antoine Brunet; Denis Standarovski; Denis Standarovski; Gautier Nguyen; Gautier Nguyen; Angélica Sicard; Angélica Sicard
License
https://spdx.org/licenses/etalab-2.0.htmlhttps://spdx.org/licenses/etalab-2.0.html
Time period covered
2010 - 2020
Description
This dataset contains replication data for the paper "Comparison of Solar Imaging Feature Extraction Methods in the Context of Space Weather Prediction with Deep Learning-Based Models". It includes files stored into HDF5 (Hierarchical Data Format) file using HDFStore. One file contains the extracted features using the 6 different techniques for the wavelength 19.3 nm named solar_extracted_features_v01_2010-2020.h5 and the second the SERENADE outputs named serenade_predictions_v01.h5. Both files contain several datasets labeled with ‘keys’. The latter correspond to the extraction method. Here is a list of the key names: gn_1024: corresponding to the GoogLenet extractor with 1024 components. pca_1024: corresponding to the Principle Component Analysis technique leaving 1024 components. ae_1024: corresponding to the AutoEncoder with a latent space of 1024. gn_256 (only in solar_extracted_features_v01_2010-2020.h5): corresponding to the GoogLenet extractor with 256 components. pca_256: corresponding to the Principle Component Analysis technique leaving 256 components. ae_256: corresponding to the AutoEncoder technique with a latent space of 256. vae_256 (only in solar_extracted_features_v01_2010-2020.h5): corresponding to the Variational AutoEncoder technique with a latent space of 256. vae_256_old (only in serenade_predictions_v01.h5): the output predictions of SERENADE using the VAE extracted features using the hyperparameters optimized for GoogLeNet. vae_256_new (only in serenade_predictions_v01.h5): the output predictions of SERENADE using the VAE extracted features with the alternative architecture. All the above-mentioned models are explained and detailed in the paper. In order to read the files, the user can do it with the Pandas package for Python as follows: import pandas as pd df = pd.read_hdf('file_name.h5', key = 'model_name') and replace file_name by either solar_extracted_features_v01_2010-2020.h5 or serenade_predictions_v01.h5 and model_name by one of the models in the list above. The extracted features dataset will output a pandas DataFrame indexed by datetime and either 1024 or 256 columns of features. An additional column indicates to which subset (train, validation and test) the corresponding row belongs. The SERENADE outputs dataset will output a DataFrame indexed by datetime and 4 columns: Observations: the first column contains the true daily maximum of the Kp index. Predictions: the second column contains the predicted mean of the daily maximum of the Kp index. Standard Deviation: the third column contains the standard deviation as the predictions are probabilistic. Model: this column specifies from which feature extractor model the inputs were used to generate the predictions. We add the feature extractors AE and VAE class codes as well as their weights in the AEs_class.py and VAE_class.py codes and best_AE_1024.ckpt, best_AE_256.ckpt and best_VAE.ckpt checkpoints respectively. The figures in the manuscript can be reproduced using the codes named after the corresponding figure. The files 6_mins_predictions and seed_variation contain the SERENADE predictions to reproduce figures 7, 8, 9 and 10.
e
Dataframe of Significant Stems for: Big Data and Digital Aesthetic, Arts and...
b2find.eudat.eu
Updated Jul 30, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Dataframe of Significant Stems for: Big Data and Digital Aesthetic, Arts and Cultural Education: Hot Spots of Current Quantitative Research Dataset for: Big Data and Digital Aesthetic, Arts and Cultural Education: Hot Spots of Current Quantitative Research - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/591b4a44-6e71-51b8-8fd8-088835793cad
Explore at:
Dataset updated
Jul 30, 2025
Description
Systematic reviews are the method of choice to synthesize research evidence. To identify main topics (so-called hot spots) relevant to large corpora of original publications in need of a synthesis, one must address the “three Vs” of big data (volume, velocity, and variety), especially in loosely defined or fragmented disciplines. For this purpose, text mining and predictive modeling are very helpful. Thus, we applied these methods to a compilation of documents related to digitalization in aesthetic, arts, and cultural education, as a prototypical, loosely defined, fragmented discipline, and particularly to quantitative research within it (QRD-ACE). By broadly querying the abstract and citation database Scopus with terms indicative of QRD-ACE, we identified a corpus of N = 55,553 publications for the years 2013–2017. As the result of an iterative approach of text mining, priority screening, and predictive modeling, we identified n = 8,304 potentially relevant publications of which n = 1,666 were included after priority screening. Analysis of the subject distribution of the included publications revealed video games as a first hot spot of QRD-ACE. Topic modeling resulted in aesthetics and cultural activities on social media as a second hot spot, related to 4 of k = 8 identified topics. This way, we were able to identify current hot spots of QRD-ACE by screening less than 15% of the corpus. We discuss implications for harnessing text mining, predictive modeling, and priority screening in future research syntheses and avenues for future original research on QRD-ACE. Dataset for: Christ, A., Penthin, M., & Kröner, S. (2019). Big Data and Digital Aesthetic, Arts, and Cultural Education: Hot Spots of Current Quantitative Research. Social Science Computer Review, 089443931988845. https://doi.org/10.1177/0894439319888455
p
Dataframe of Significant Stems.csv
psycharchives.org
Updated Oct 8, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2019). Dataframe of Significant Stems.csv [Dataset]. https://www.psycharchives.org/en/item/84d5c4b2-579d-48a0-8d4e-f02f2ae99192
Explore at:
Dataset updated
Oct 8, 2019
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Systematic reviews are the method of choice to synthesize research evidence. To identify main topics (so-called hot spots) relevant to large corpora of original publications in need of a synthesis, one must address the “three Vs” of big data (volume, velocity, and variety), especially in loosely defined or fragmented disciplines. For this purpose, text mining and predictive modeling are very helpful. Thus, we applied these methods to a compilation of documents related to digitalization in aesthetic, arts, and cultural education, as a prototypical, loosely defined, fragmented discipline, and particularly to quantitative research within it (QRD-ACE). By broadly querying the abstract and citation database Scopus with terms indicative of QRD-ACE, we identified a corpus of N = 55,553 publications for the years 2013–2017. As the result of an iterative approach of text mining, priority screening, and predictive modeling, we identified n = 8,304 potentially relevant publications of which n = 1,666 were included after priority screening. Analysis of the subject distribution of the included publications revealed video games as a first hot spot of QRD-ACE. Topic modeling resulted in aesthetics and cultural activities on social media as a second hot spot, related to 4 of k = 8 identified topics. This way, we were able to identify current hot spots of QRD-ACE by screening less than 15% of the corpus. We discuss implications for harnessing text mining, predictive modeling, and priority screening in future research syntheses and avenues for future original research on QRD-ACE. Dataset for: Christ, A., Penthin, M., & Kröner, S. (2019). Big Data and Digital Aesthetic, Arts, and Cultural Education: Hot Spots of Current Quantitative Research. Social Science Computer Review, 089443931988845. https://doi.org/10.1177/0894439319888455:
EMG and Video Dataset for sensor fusion based hand gestures recognition
data.europa.eu
zenodo.org
unknown
Updated Jul 3, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2025). EMG and Video Dataset for sensor fusion based hand gestures recognition [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-3228846?locale=fi
Explore at:
unknown(469228)Available download formats
Dataset updated
Jul 3, 2025
Dataset authored and provided by
Zenodohttp://zenodo.org/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains data for hand gestures recognition recorded with 3 different sensors. sEMG: recorded via the Myo armband that is composed of 8 equally spaced non-invasive sEMG sensors that can be placed approximately around the middle of the forearm. The sampling frequency of Myo is 200 Hz. The output of the Myo is a.u DVS: Dynamic Video Sensor which is a very low power event based camera with 128x128 resolution DAVIS: Dynamic Video Sensor which is a very low power event based camera with 240x180 resolution that also acquires APS frames. The dataset contains recordings of 10 subjects. Each subject performed 3 sessions, where each of the 5 hand gesture was recorded 5 times, each lasting for 2s. Between the gestures a relaxing phase of 1s is present where the muscles could go to the rest position, removing any residual muscular activation. Note: We did not upload the raw data (*.aedat) for the DAVIS being those files very heavy. All the information for the sensor has been extracted and can be found in the two files *.npz and *.mat. In case the raw data was needed please contact enea.ceolini@ini.uzh.ch elisa@ini.uzh.ch ==== README ==== DATASET STRUCTURE: EMG and DVS recordings 10 subjects 3 sessions for each subject 5 gestures in each session ('pinky', 'elle', 'yo', 'index', 'thumb') Data name: subjectXX_sessionYY_ZZZ XX : [01, 02, 03, 04, 05, 06, 07, 08, 09, 10] YY : [01, 02, 03] ZZZ : [emg, ann, dvs, davis] Data format: emg: .npy ann: .npy dvs: .aedat,.npy davis: .mat,.npz DVS DVS recordings only contain DVS events - .aedat (raw data): can be imported in Matlab using (https://github.com/inivation/AedatTools/tree/master/Matlab) or in Python with function aedat2numpy in converter.py (https://github.com/Enny1991/hand_gestures_cc19/tree/master/jAER_utils) - .npy (exported data): numpy.ndarray (xpos, ypos, ts, pol), 2D numpy array containing data of all events, timestamps ts reset to the trigger event (synchronized with the myo), timestamps ts in seconds DAVIS DAVIS recordings contain DVS events and APS frames. - .mat (exported data): mat structure, name 'aedat', events are inside aedat.data.polarity (aedat.data.polarity.x,aedat.data.polarity.y,aedat.data.polarity.timeStamp,aedat.data.polarity.polarity), aps frames are inside aedat.data.frame.samples, timestamps are in aedat.data.frame.timeStampStart (start of frame collection) or aedat.data.frame.timeStampEnd (end of frame collection) - .npz (exported data): npz files: ['frames_time', 'dvs_events', 'frames'], 'dvs_events' is a numpy.ndarray (xpos, ypos, ts, pol), 2D numpy array containing data of all events, timestamps ts reset to the trigger event (synchronized with the myo), timestamps ts in seconds; 'frames' and 'frames_time' are aps data, 'frames' is a list of all the frames, reset at the triggered time, 'frames_time' is the time for each frame, we considered the start timeStamps for each frame.
Multi Room Transition Dataset
zenodo.org
bin, zip
Updated Aug 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Philipp Götz; Philipp Götz; Georg Götz; Nils Meyer-Kahlen; Kyung Yun Lee; Karolina Prawda; Emanuël A. P. Habets; Sebastian J. Schlecht; Georg Götz; Nils Meyer-Kahlen; Kyung Yun Lee; Karolina Prawda; Emanuël A. P. Habets; Sebastian J. Schlecht (2024). Multi Room Transition Dataset [Dataset]. http://doi.org/10.5281/zenodo.13341566
Explore at:
bin, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13341566
Dataset updated
Aug 19, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Philipp Götz; Philipp Götz; Georg Götz; Nils Meyer-Kahlen; Kyung Yun Lee; Karolina Prawda; Emanuël A. P. Habets; Sebastian J. Schlecht; Georg Götz; Nils Meyer-Kahlen; Kyung Yun Lee; Karolina Prawda; Emanuël A. P. Habets; Sebastian J. Schlecht
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Introduction

We present a dataset of room impulse response (RIR) measurements with a focus on covering a broad spectrum of room-acoustic conditions. A central motivation was to collect data from complex, real-world multi-room environments in which acoustic energy decay functions (EDFs) are characterized by multiple decay slopes. The dataset is accompanied by positional information and visual documentation in the form of 360˚ photographs. A total of 4032 room impulse responses were measured throughout three complex indoor environments using a KEMAR head and torso simulator, and a Zoom H3 ambisonics microphone array.

Acoustic environments

In each of the three environments considered in the dataset (hallways-lecturehall, offices, workshops), four loudspeakers were placed and remained stationary throughout all measurements. A mobile measurement platform was then moved along a predefined path that covered various measurement positions. At each position, the Room Impulse Responses (RIRs) between the mobile platform and each of the four loudspeakers were recorded.

Mobile measurement platform

The mobile recording platform consisted of a head and torso simulator (GRAS 45BB KEMAR), fitted with binaural microphones, and a four-channel tetrahedral microphone array (Zoom H3-VR), placed on top of the head simulator.

Signals

All measurements were conducted at a sampling rate of 48 kHz (24-bit), for ease-of-use, we provide the signals in the dataset in two formats: as a pickled pandas dataframe (mrtd_dataframe.pkl) and as SOFA files.

-------------------------------------------------------------------------------------------------------------------------------------------------------------------

Publication

A more detailled description of the dataset and a method for the blind estimation of Energy Decay Functions can be found in our IWAENC 2024 paper:

Philipp Götz, Georg Götz, Nils Meyer-Kahlen, Kyung Yun Lee, Karolina Prawda, Emanuël A. P. Habets, and Sebastian J. Schlecht: A Multi-Room Transition Dataset for Blind Estimation of Energy Decay, IWAENC 2024, Aalborg, Denmark

The accompanying code repository can be found at GitHub.
Avokado gelişim
kaggle.com
Updated May 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ABD�LKAD�R UY�UR (2025). Avokado gelişim [Dataset]. https://www.kaggle.com/datasets/abdlkadruyur/avokado-geliim/discussion?sort=undefined
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 22, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
ABD�LKAD�R UY�UR
Description
This dataset has been created for educational purposes, specifically to help learners practice SQL-like operations using Python’s pandas library. It is ideal for beginners who want to improve their data manipulation, querying, and transformation skills in a notebook environment such as Kaggle.

The dataset simulates a simple personnel and department system. It includes two tables:

personel: Contains employee data such as names, ages, salaries, and department IDs. departman: Contains department IDs and corresponding department names. Throughout this project, key SQL operations have been demonstrated with their pandas equivalents. These include:

Basic commands like SELECT, INSERT, UPDATE, DELETE Table structure operations: ALTER, DROP, TRUNCATE, COPY Filtering and logical expressions: WHERE, AND, OR, IN, IS NULL, BETWEEN, LIKE Aggregations and sorting: COUNT(), ORDER BY, LIMIT, DISTINCT String functions: LOWER, TRIM, REPLACE, SPLIT, LENGTH Joins: INNER JOIN, LEFT JOIN Comparison operators: =, !=, <, > The goal is to provide a hands-on, interactive environment for practicing SQL logic using real Python code. This dataset does not represent real individuals or businesses — it is entirely fictional and meant for training, teaching, and experimentation purposes only.
Weather DataSet
kaggle.com
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vikram Kathare (2023). Weather DataSet [Dataset]. https://www.kaggle.com/datasets/vikramkathare/weather-dataset/data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 2, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Vikram Kathare
Description
We will learn how to work on a real project of Data Analysis with Python. Questions are given in the project and then solved with the help of Python. It is a project of Data Analysis with Python or you can say, Data Science with Python.

The commands that we used in this project :

head() - It shows the first N rows in the data (by default, N=5).

shape - It shows the total no. of rows and no. of columns of the dataframe

index - This attribute provides the index of the dataframe

columns - It shows the name of each column

dtypes - It shows the data-type of each column

unique() - In a column, it shows all the unique values. It can be applied on a single column only, not on the whole dataframe.

nunique() - It shows the total no. of unique values in each column. It can be applied on a single column as well as on the whole dataframe.

count - It shows the total no. of non-null values in each column. It can be applied on a single column as well as on the whole dataframe.

value_counts - In a column, it shows all the unique values with their count. It can be applied on a single column only.

info() - Provides basic information about the dataframe.

Challenges for this DataSet:

Q. 1) Find all the unique 'Wind Speed' values in the data. Q. 2) Find the number of times when the 'Weather is exactly Clear'. Q. 3) Find the number of times when the 'Wind Speed was exactly 4 km/h'. Q. 4) Find out all the Null Values in the data. Q. 5) Rename the column name 'Weather' of the dataframe to 'Weather Condition'. Q. 6) What is the mean 'Visibility' ? Q. 7) What is the Standard Deviation of 'Pressure' in this data? Q. 8) What is the Variance of 'Relative Humidity' in this data ? Q. 9) Find all instances when 'Snow' was recorded. Q. 10) Find all instances when 'Wind Speed is above 24' and 'Visibility is 25'. Q. 11) What is the Mean value of each column against each 'Weather Condition ? Q. 12) What is the Minimum & Maximum value of each column against each 'Weather Condition ? Q. 13) Show all the Records where Weather Condition is Fog. Q. 14) Find all instances when 'Weather is Clear' or 'Visibility is above 40'. Q. 15) Find all instances when : A. 'Weather is Clear' and 'Relative Humidity is greater than 50' or B. 'Visibility is above 40'
m
Simultaneous measurements of ECG, body Impedance and temperature
mostwiedzy.pl
csv
Updated Dec 17, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tomasz Kocejko; Piotr Przystup (2020). Simultaneous measurements of ECG, body Impedance and temperature [Dataset]. http://doi.org/10.34808/x3f9-fh19
Explore at:
csv(42431670)Available download formats
Unique identifier
https://doi.org/10.34808/x3f9-fh19
Dataset updated
Dec 17, 2020
Authors
Tomasz Kocejko; Piotr Przystup
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
The data are complementary part of the experiment designed to demonstrate how to use the network protocols to transmit medical data. The dataset contains biomedical signals of ECG, Impedanc and temperature acquired simultaneously. The data allow students to become familiar with data acquisition methods (simulate data transmission by medical device over UDP protocol). This data helps students to get familiar with the methods of decoding the data frame and reading the data saved in different formats. Data frame of the dataset is as follows:
Airlines Flights Data
kaggle.com
Updated Jul 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Science Lovers (2025). Airlines Flights Data [Dataset]. https://www.kaggle.com/datasets/rohitgrewal/airlines-flights-data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 29, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Data Science Lovers
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
📹Project Video available on YouTube - https://youtu.be/gu3Ot78j_Gc

Airlines Flights Dataset for Different Cities

The Flights Booking Dataset of various Airlines is a scraped datewise from a famous website in a structured format. The dataset contains the records of flight travel details between the cities in India. Here, multiple features are present like Source & Destination City, Arrival & Departure Time, Duration & Price of the flight etc.

This data is available as a CSV file. We are going to analyze this data set using the Pandas DataFrame.

This analyse will be helpful for those working in Airlines, Travel domain.

Using this dataset, we answered multiple questions with Python in our Project.

Q.1. What are the airlines in the dataset, accompanied by their frequencies?

Q.2. Show Bar Graphs representing the Departure Time & Arrival Time.

Q.3. Show Bar Graphs representing the Source City & Destination City.

Q.4. Does price varies with airlines ?

Q.5. Does ticket price change based on the departure time and arrival time?

Q.6. How the price changes with change in Source and Destination?

Q.7. How is the price affected when tickets are bought in just 1 or 2 days before departure?

Q.8. How does the ticket price vary between Economy and Business class?

Q.9. What will be the Average Price of Vistara airline for a flight from Delhi to Hyderabad in Business Class ?

These are the main Features/Columns available in the dataset :

1) Airline: The name of the airline company is stored in the airline column. It is a categorical feature having 6 different airlines.

2) Flight: Flight stores information regarding the plane's flight code. It is a categorical feature.

3) Source City: City from which the flight takes off. It is a categorical feature having 6 unique cities.

4) Departure Time: This is a derived categorical feature obtained created by grouping time periods into bins. It stores information about the departure time and have 6 unique time labels.

5) Stops: A categorical feature with 3 distinct values that stores the number of stops between the source and destination cities.

6) Arrival Time: This is a derived categorical feature created by grouping time intervals into bins. It has six distinct time labels and keeps information about the arrival time.

7) Destination City: City where the flight will land. It is a categorical feature having 6 unique cities.

8) Class: A categorical feature that contains information on seat class; it has two distinct values: Business and Economy.

9) Duration: A continuous feature that displays the overall amount of time it takes to travel between cities in hours.

10) Days Left: This is a derived characteristic that is calculated by subtracting the trip date by the booking date.

11) Price: Target variable stores information of the ticket price.
D
Dataset for "Short-Form Videos Degrade Our Capacity to Retain Intentions:...
darus.uni-stuttgart.de
b2find.eudat.eu
Updated Sep 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Francesco Chiossi; Luke Haliburton; Changkun Ou; Andreas Butz; Albrecht Schmidt (2024). Dataset for "Short-Form Videos Degrade Our Capacity to Retain Intentions: Effect of Context Switching On Prospective Memory" [Dataset]. http://doi.org/10.18419/DARUS-3327
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.18419/DARUS-3327
Dataset updated
Sep 16, 2024
Dataset provided by
DaRUS
Authors
Francesco Chiossi; Luke Haliburton; Changkun Ou; Andreas Butz; Albrecht Schmidt
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset funded by
DFG
Description
Social media platforms use short, highly engaging videos to catch users’ attention. While the short-form video feeds popularized by TikTok are rapidly spreading to other platforms, we do not yet understand their impact on cognitive functions. We conducted a between-subjects experiment (𝑁 = 60) investigating the impact of engaging with TikTok, Twitter, and YouTube while performing a Prospective Memory task (i.e., executing a previously planned action). The study required participants to remember intentions over interruptions. We found that the TikTok condition significantly degraded the users’ performance in this task. As none of the other conditions (Twitter, YouTube, no activity) had a similar effect, our results indicate that the combination of short videos and rapid context-switching impairs intention recall and execution. We contribute a quantified understanding of the effect of social media feed format on Prospective Memory and outline consequences for media technology designers not to harm the users’ memory and wellbeing. Description of the Dataset Data frame: The ./data/rt.csv provides the data frame of reaction times. The ./data/acc.csv provides the data frame of reaction accuracy scores. The ./data/q.csv provides the data frame collected from questionnaires. The ./data/ddm.csv is the learned DDM features using ./appendix2_ddm_fitting.ipynb, which is then used in ./3.ddm_anova.ipynb. Figures: All figures appeared in the paper are placed in ./figures and can be reproduced using *_vis.ipynb files.
Tajweed Dataset
kaggle.com
Updated Apr 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ala'a Abdu Saleh Alawdi (2025). Tajweed Dataset [Dataset]. https://www.kaggle.com/datasets/alawdisoft/tajweed-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 6, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ala'a Abdu Saleh Alawdi
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
The provided code processes a Tajweed dataset, which appears to be a collection of audio recordings categorized by different Tajweed rules (Ikhfa, Izhar, Idgham, Iqlab). Let's break down the dataset's structure and the code's functionality:

Dataset Structure:

Organized by Tajweed Rule and Sheikh: The dataset is structured into directories for each Tajweed rule (e.g., 'Ikhfa', 'Izhar'). Within each rule's directory, there are subdirectories representing different reciters (sheikhs). This hierarchical organization is crucial for creating a structured metadata file and for training machine learning models.

Audio Files: The audio files (presumably WAV or other supported formats) are stored within the sheikh's subdirectories. The original filenames are not standardized.

Multiple Sheikhs per Rule: The dataset includes multiple recitations for each rule from different sheikhs, offering diversity in pronunciation.

Google Drive Storage: The dataset is located on Google Drive, which requires mounting the drive to access the data within a Colab environment.

Code Functionality:

Initialization and Imports: The code begins with necessary imports (pandas, pydub) and mounts Google Drive. Pydub is used for audio file format conversion.

Directory Listing: It initially checks if a specified directory exists (for example, Alaa_alhsri/Ikhfa) and lists its files, demonstrating basic file system access.

Metadata Creation: The core of the script is the generation of metadata, which provides essential information about each audio file. The tajweed_paths dictionary maps each Tajweed rule to a list of paths, associating each path with the reciter's name.

Iterating through Paths: The code iterates through each Tajweed rule and its corresponding paths.

File Listing: Inside each directory, it iterates through the audio files.

Metadata Dictionary: For each audio file, it creates a metadata dictionary that includes:

global_id: A unique identifier for each audio file.

original_filename: The original filename of the audio file.

new_filename: A standardized filename that incorporates the Tajweed rule (label), sheikh's ID, audio number, and a global ID.

label: The Tajweed rule.

sheikh_id: A numerical identifier for each sheikh.

sheikh_name: The name of the reciter.

audio_number: A sequential number for the audio files within a specific sheikh and Tajweed rule combination.

original_path: Full path to the original audio file.

new_path: Full path to the intended location for the renamed and potentially converted audio file.

Pandas DataFrame: The metadata is collected in a list of dictionaries and then converted into a Pandas DataFrame for easier viewing and processing. This DataFrame is highly informative.

File Renaming and Conversion:

File Renaming: (commented out) The code is able to rename the audio files to the standardized format defined in new_filename and store it in the designated directory.

Audio Conversion to WAV: The script then converts any files in the specified directories to .wav format, creating standardized files in a new output_dataset directory. The new filenames are based on rules, sheikh and a counter.

Metadata Export: Finally, the compiled metadata is saved as a CSV file (metadata.csv) in the output directory. This CSV file is crucial for training any machine learning model using this data.

Facebook

Twitter

Click to copy link

Link copied

Cite

AI Robotics Ethics Society (PUCRS), example-data-frame [Dataset]. https://huggingface.co/datasets/AiresPucrs/example-data-frame

example-data-frame

Example DataFrame

AiresPucrs/example-data-frame

Explore at:

218 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset authored and provided by

AI Robotics Ethics Society (PUCRS)

License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

Example DataFrame (Teeny-Tiny Castle)

This dataset is part of a tutorial tied to the Teeny-Tiny Castle, an open-source repository containing educational tools for AI Ethics and Safety research.

  How to Use

from datasets import load_dataset

dataset = load_dataset("AiresPucrs/example-data-frame", split = 'train')

Clear search

Close search

Google apps

Main menu

example-data-frame

Flow map data of the singel pendulum, double pendulum and 3-body problem

Data from: LifeSnaps: a 4-month multi-modal dataset capturing unobtrusive...

A Replication Dataset for Fundamental Frequency Estimation

Data from: Dataset for Vehicle Indoor Positioning in Industrial Environments...

pandas DataFrames of the DYToMuMu_M-20_CT10_TuneZ2star_v2_8TeV process

SELTO Dataset

Data from: Dataset with condition monitoring vibration data annotated with...

An Empirical Study on Energy Usage Patterns of Different Variants of Data...

Data and codes from: Comparison of Solar Imaging Feature Extraction Methods...

Dataframe of Significant Stems for: Big Data and Digital Aesthetic, Arts and...

Dataframe of Significant Stems.csv

EMG and Video Dataset for sensor fusion based hand gestures recognition

Multi Room Transition Dataset

Introduction

Acoustic environments

Mobile measurement platform

Signals

Publication

Avokado gelişim

Weather DataSet

Simultaneous measurements of ECG, body Impedance and temperature

Airlines Flights Data

📹Project Video available on YouTube - https://youtu.be/gu3Ot78j_Gc

Airlines Flights Dataset for Different Cities

Dataset for "Short-Form Videos Degrade Our Capacity to Retain Intentions:...

Tajweed Dataset

example-data-frameSee More Versions

Example DataFrame

AiresPucrs/example-data-frame

example-data-frame