100+ datasets found

Pandas Practice Dataset
kaggle.com
zip
Updated Jan 27, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mrityunjay Pathak (2023). Pandas Practice Dataset [Dataset]. https://www.kaggle.com/datasets/themrityunjaypathak/pandas-practice-dataset/discussion
Explore at:
zip(493 bytes)Available download formats
Dataset updated
Jan 27, 2023
Authors
Mrityunjay Pathak
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
What is Pandas?

Pandas is a Python library used for working with data sets.

It has functions for analyzing, cleaning, exploring, and manipulating data.

The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was created by Wes McKinney in 2008.

Why Use Pandas?

Pandas allows us to analyze big data and make conclusions based on statistical theories.

Pandas can clean messy data sets, and make them readable and relevant.

Relevant data is very important in data science.

What Can Pandas Do?

Pandas gives you answers about the data. Like:

Is there a correlation between two or more columns?

What is average value?

Max value?

Min value?
Numpy , pandas and matplot lib practice
kaggle.com
zip
Updated Jul 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
pratham saraf (2023). Numpy , pandas and matplot lib practice [Dataset]. https://www.kaggle.com/datasets/prathamsaraf1389/numpy-pandas-and-matplot-lib-practise/suggestions
Explore at:
zip(385020 bytes)Available download formats
Dataset updated
Jul 16, 2023
Authors
pratham saraf
License
https://cdla.io/permissive-1-0/https://cdla.io/permissive-1-0/
Description
The dataset has been created specifically for practicing Python, NumPy, Pandas, and Matplotlib. It is designed to provide a hands-on learning experience in data manipulation, analysis, and visualization using these libraries.

Specifics of the Dataset:

The dataset consists of 5000 rows and 20 columns, representing various features with different data types and distributions. The features include numerical variables with continuous and discrete distributions, categorical variables with multiple categories, binary variables, and ordinal variables. Each feature has been generated using different probability distributions and parameters to introduce variations and simulate real-world data scenarios. The dataset is synthetic and does not represent any real-world data. It has been created solely for educational purposes.

One of the defining characteristics of this dataset is the intentional incorporation of various real-world data challenges:

Certain columns are randomly selected to be populated with NaN values, effectively simulating the common challenge of missing data. - The proportion of these missing values in each column varies randomly between 1% to 70%. - Statistical noise has been introduced in the dataset. For numerical values in some features, this noise adheres to a distribution with mean 0 and standard deviation 0.1. - Categorical noise is introduced in some features', with its categories randomly altered in about 1% of the rows. Outliers have also been embedded in the dataset, resonating with the Interquartile Range (IQR) rule

Context of the Dataset:

The dataset aims to provide a comprehensive playground for practicing Python, NumPy, Pandas, and Matplotlib. It allows learners to explore data manipulation techniques, perform statistical analysis, and create visualizations using the provided features. By working with this dataset, learners can gain hands-on experience in data cleaning, preprocessing, feature engineering, and visualization. Sources of the Dataset:

The dataset has been generated programmatically using Python's random number generation functions and probability distributions. No external sources or real-world data have been used in creating this dataset.
Z
Multimodal Vision-Audio-Language Dataset
data.niaid.nih.gov
zenodo.org
Updated Jul 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Schaumlöffel, Timothy; Roig, Gemma; Choksi, Bhavin (2024). Multimodal Vision-Audio-Language Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10060784
Explore at:
Dataset updated
Jul 11, 2024
Dataset provided by
Goethe University Frankfurt
Authors
Schaumlöffel, Timothy; Roig, Gemma; Choksi, Bhavin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Multimodal Vision-Audio-Language Dataset is a large-scale dataset for multimodal learning. It contains 2M video clips with corresponding audio and a textual description of the visual and auditory content. The dataset is an ensemble of existing datasets and fills the gap of missing modalities. Details can be found in the attached report. Annotation The annotation files are provided as Parquet files. They can be read using Python and the pandas and pyarrow library. The split into train, validation and test set follows the split of the original datasets. Installation

pip install pandas pyarrow Example

import pandas as pddf = pd.read_parquet('annotation_train.parquet', engine='pyarrow')print(df.iloc[0])

dataset AudioSet filename train/---2_BBVHAA.mp3 captions_visual [a man in a black hat and glasses.] captions_auditory [a man speaks and dishes clank.] tags [Speech] Description The annotation file consists of the following fields:filename: Name of the corresponding file (video or audio file)dataset: Source dataset associated with the data pointcaptions_visual: A list of captions related to the visual content of the video. Can be NaN in case of no visual contentcaptions_auditory: A list of captions related to the auditory content of the videotags: A list of tags, classifying the sound of a file. It can be NaN if no tags are provided Data files The raw data files for most datasets are not released due to licensing issues. They must be downloaded from the source. However, due to missing files, we provide them on request. Please contact us at schaumloeffel@em.uni-frankfurt.de
Z
Exploratory Topic Modelling in Python Dataset - EHRI-3
data.niaid.nih.gov
Updated Jun 20, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dermentzi, Maria (2022). Exploratory Topic Modelling in Python Dataset - EHRI-3 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6670103
Explore at:
Dataset updated
Jun 20, 2022
Dataset provided by
King's College London
Authors
Dermentzi, Maria
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In the EHRI-3 project, we are investigating tools and methods that historical researchers and scholars can use to better understand, visualise, and interpret the material held by our partner archives. This dataset accompanies a tutorial exploring a technique called topic modelling in the context of a Holocaust-related historical collection.

We were on the lookout for datasets that would be easily accessible and, for convenience, predominantly in English. One such dataset was the United States Holocaust Memorial Museum’s (USHMM) extensive collection of oral history testimonies, for which there are a considerable number of textual transcripts. The museum’s total collection consists of over 80,703 testimonies, 41,695 of which are available in English, with 2,894 of them listing a transcript.

Since there is not yet a ready-to-download dataset that includes these transcripts, we had to construct our own. Using a web scraping tool, we managed to create a list of the links pointing to the metadata (including transcripts) of the testimonies that were of interest to us. After obtaining the transcript and other metadata of each of these testimonies, we were able to create our dataset and curate it to remove any unwanted entries. For example, we made sure to remove entries with restrictions on access or use. We also removed entries with transcripts that consisted only of some automatically generated headers and entries which turned out to be in languages other than English. The remaining 1,873 transcripts form the corpus of this tutorial — a small, but still decently sized dataset.

The process that we followed to put together this dataset is detailed in the Jupyter Notebook accompanying this post, which can be found in this Github repository.

In this Zenodo upload, the user can find two files, each of them containing a pickled pandas DataFrame that was obtained at a different stage of the tutorial:

"unrestricted_df.pkl" contains 1,946 entries of Oral Testimony transcripts and has five fields (RG_number, text, display_date, conditions_access, conditions_use) "unrestricted_lemmatized_df.pkl" contains 1,873 entries of Oral Testimony transcripts and has six fields (RG_number, text, display_date, conditions_access, conditions_use, lemmas)

Instructions on their intended use can be found in the accompanying Jupyter Notebook.

Credits:

The transcripts that form the corpus in this tutorial were obtained through the United States Holocaust Memorial Museum (USHMM).
💥 Data-cleaning-for-beginner-using-pandas💢💥
kaggle.com
zip
Updated Oct 16, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pavan Tanniru (2022). 💥 Data-cleaning-for-beginner-using-pandas💢💥 [Dataset]. https://www.kaggle.com/datasets/pavantanniru/-datacleaningforbeginnerusingpandas/code
Explore at:
zip(654 bytes)Available download formats
Dataset updated
Oct 16, 2022
Authors
Pavan Tanniru
Description
This dataset helps you to increase the data-cleaning process using the pure python pandas library.

Indicators

Age

Salary

Rating

Location

Established

Easy Apply
h
census-income
huggingface.co
Updated Jul 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
WC (2025). census-income [Dataset]. https://huggingface.co/datasets/cestwc/census-income
Explore at:
Dataset updated
Jul 21, 2025
Authors
WC
Description
Dataset Card for Census Income (Adult)

This dataset is a precise version of Adult or Census Income. This dataset from UCI somehow happens to occupy two links, but we checked and confirm that they are identical. We used the following python script to create this Hugging Face dataset. import pandas as pd from datasets import Dataset, DatasetDict, Features, Value, ClassLabel

URLs

url1 = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data" url2 =… See the full description on the dataset page: https://huggingface.co/datasets/cestwc/census-income.
All Pandas Operations Reference
kaggle.com
zip
Updated Nov 8, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Reuben (2019). All Pandas Operations Reference [Dataset]. https://www.kaggle.com/reubenn/all-pandas-operations-reference
Explore at:
zip(10449 bytes)Available download formats
Dataset updated
Nov 8, 2019
Authors
Reuben
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

Pandas is a very useful library, probably the most useful for data munging in Python. This notebook is an attempt to collate all pandas dataframes operations that a data scientist might use.

Content

You'll see how to create dataframes, read in files (even ones with anomalies), check out descriptive stats on columns, filter on different values and in different ways as well as perform some of the more oft-used operations

Acknowledgements

A big "thank you" to Data School. You'll find plenty of notebooks and videos here: https://github.com/justmarkham/pandas-videos
Z
3D skeletons UP-Fall Dataset
data.niaid.nih.gov
Updated Jul 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
KOFFI, Tresor (2024). 3D skeletons UP-Fall Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_12773012
Explore at:
Dataset updated
Jul 20, 2024
Dataset provided by
CESI LINEACT
Authors
KOFFI, Tresor
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
3D skeletons UP-Fall Dataset

Different between Fall and Impact detection

Overview

This dataset aims to facilitate research in fall detection, particularly focusing on the precise detection of impact moments within fall events. The 3D skeletons data accuracy and comprehensiveness make it a valuable resource for developing and benchmarking fall detection algorithms. The dataset contains 3D skeletal data extracted from fall events and daily activities of 5 subjects performing fall scenarios

Data Collection

The skeletal data was extracted using a pose estimation algorithm, which processes images frames to determine the 3D coordinates of each joint. Sequences with less than 100 frames of extracted data were excluded to ensure the quality and reliability of the dataset. As a result, some subjects may have fewer CSV files.

CSV Structure

The data is organized by subjects, and each subject contains CSV files named according to the pattern C1S1A1T1, where:

C: Camera (1 or 2)

S: Subject (1 to 5)

A: Activity (1 to N, representing different activities)

T: Trial (1 to 3)

subject1/`: Contains CSV files for Subject 1.

C1S1A1T1.csv: Data from Camera 1, Activity 1, Trial 1 for Subject 1

C1S1A2T1.csv: Data from Camera 1, Activity 2, Trial 1 for Subject 1

C1S1A3T1.csv: Data from Camera 1, Activity 3, Trial 1 for Subject 1

C2S1A1T1.csv: Data from Camera 2, Activity 1, Trial 1 for Subject 1

C2S1A2T1.csv: Data from Camera 2, Activity 2, Trial 1 for Subject 1

C2S1A3T1.csv: Data from Camera 2, Activity 3, Trial 1 for Subject 1

subject2/`: Contains CSV files for Subject 2.

C1S2A1T1.csv: Data from Camera 1, Activity 1, Trial 1 for Subject 2

C1S2A2T1.csv: Data from Camera 1, Activity 2, Trial 1 for Subject 2

C1S2A3T1.csv: Data from Camera 1, Activity 3, Trial 1 for Subject 2

C2S2A1T1.csv: Data from Camera 2, Activity 1, Trial 1 for Subject 2

C2S2A2T1.csv: Data from Camera 2, Activity 2, Trial 1 for Subject 2

C2S2A3T1.csv: Data from Camera 2, Activity 3, Trial 1 for Subject 2

subject3/, subject4/, subject5/: Similar structure as above, but may contain fewer CSV files due to the data extraction criteria mentioned above.

Column Descriptions

Each CSV file contains the following columns representing different skeletal joints and their respective coordinates in 3D space:

Column Name

Description

joint_1_x

X coordinate of joint 1

joint_1_y

Y coordinate of joint 1

joint_1_z

Z coordinate of joint 1

joint_2_x

X coordinate of joint 2

joint_2_y

Y coordinate of joint 2

joint_2_z

Z coordinate of joint 2

...

...

joint_n_x

X coordinate of joint n

joint_n_y

Y coordinate of joint n

joint_n_z

Z coordinate of joint n

LABEL

Label indicating impact (1) or non-impact (0)

Example

Here is an example of what a row in one of the CSV files might look like:

joint_1_x

joint_1_y

joint_1_z

joint_2_x

joint_2_y

joint_2_z

...

joint_n_x

joint_n_y

joint_n_33

LABEL

0.123

0.456

0.789

0.234

0.567

0.890

...

0.345

0.678

0.901

0

Usage

This data can be used for developing and benchmarking impact fall detection algorithms. It provides detailed information on human posture and movement during falls, making it suitable for machine learning and deep learning applications in impact fall detection and prevention.

Using github

Clone the repository:

-bash git clone

https://github.com/Tresor-Koffi/3D_skeletons-UP-Fall-Dataset

Navigate to the directory:

-bash -cd 3D_skeletons-UP-Fall-Dataset

Examples

Here's a simple example of how to load and inspect a sample data file using Python:```pythonimport pandas as pd

Load a sample data file for Subject 1, Camera 1, Activity 1, Trial 1

data = pd.read_csv('subject1/C1S1A1T1.csv')print(data.head())
Z
Dataset for paper "Mitigating the effect of errors in source parameters on...
data.niaid.nih.gov
Updated Sep 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nienke Blom; Phil-Simon Hardalupas; Nicholas Rawlinson (2022). Dataset for paper "Mitigating the effect of errors in source parameters on seismic (waveform) inversion" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6969601
Explore at:
Dataset updated
Sep 28, 2022
Dataset provided by
University of Cambridge
Authors
Nienke Blom; Phil-Simon Hardalupas; Nicholas Rawlinson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset corresponding to the journal article "Mitigating the effect of errors in source parameters on seismic (waveform) inversion" by Blom, Hardalupas and Rawlinson, accepted for publication in Geophysical Journal International. In this paper, we demonstrate the effect or errors in source parameters on seismic tomography, with a particular focus on (full) waveform tomography. We study effect both on forward modelling (i.e. comparing waveforms and measurements resulting from a perturbed vs. unperturbed source) and on seismic inversion (i.e. using a source which contains an (erroneous) perturbation to invert for Earth structure. These data were obtained using Salvus, a state-of-the-art (though proprietary) 3-D solver that can be used for wave propagation simulations (Afanasiev et al., GJI 2018).

This dataset contains:

The entire Salvus project. This project was prepared using Salvus version 0.11.x and 0.12.2 and should be fully compatible with the latter.

A number of Jupyter notebooks used to create all the figures, set up the project and do the data processing.

A number of Python scripts that are used in above notebooks.

two conda environment .yml files: one with the complete environment as used to produce this dataset, and one with the environment as supplied by Mondaic (the Salvus developers), on top of which I installed basemap and cartopy.

An overview of the inversion configurations used for each inversion experiment and the name of hte corresponding figures: inversion_runs_overview.ods / .csv .

Datasets corresponding to the different figures.

One dataset for Figure 1, showing the effect of a source perturbation in a real-world setting, as previously used by Blom et al., Solid Earth 2020

One dataset for Figure 2, showing how different methodologies and assumptions can lead to significantly different source parameters, notably including systematic shifts. This dataset was kindly supplied by Tim Craig (Craig, 2019).

A number of datasets (stored as pickled Pandas dataframes) derived from the Salvus project. We have computed:

travel-time arrival predictions from every source to all stations (df_stations...pkl)

misfits for different metrics for both P-wave centered and S-wave centered windows for all components on all stations, comparing every time waveforms from a reference source against waveforms from a perturbed source (df_misfits_cc.28s.pkl)

addition of synthetic waveforms for different (perturbed) moment tenors. All waveforms are stored in HDF5 (.h5) files of the ASDF (adaptable seismic data format) type

How to use this dataset:

To set up the conda environment:

make sure you have anaconda/miniconda

make sure you have access to Salvus functionality. This is not absolutely necessary, but most of the functionality within this dataset relies on salvus. You can do the analyses and create the figures without, but you'll have to hack around in the scripts to build workarounds.

Set up Salvus / create a conda environment. This is best done following the instructions on the Mondaic website. Check the changelog for breaking changes, in that case download an older salvus version.

Additionally in your conda env, install basemap and cartopy:

conda-env create -n salvus_0_12 -f environment.yml conda install -c conda-forge basemap conda install -c conda-forge cartopy

Install LASIF (https://github.com/dirkphilip/LASIF_2.0) and test. The project uses some lasif functionality.

To recreate the figures: This is extremely straightforward. Every figure has a corresponding Jupyter Notebook. Suffices to run the notebook in its entirety.

Figure 1: separate notebook, Fig1_event_98.py

Figure 2: separate notebook, Fig2_TimCraig_Andes_analysis.py

Figures 3-7: Figures_perturbation_study.py

Figures 8-10: Figures_toy_inversions.py

To recreate the dataframes in DATA: This can be done using the example notebook Create_perturbed_thrust_data_by_MT_addition.py and Misfits_moment_tensor_components.M66_M12.py . The same can easily be extended to the position shift and other perturbations you might want to investigate.

To recreate the complete Salvus project: This can be done using:

the notebook Prepare_project_Phil_28s_absb_M66.py (setting up project and running simulations)

the notebooks Moment_tensor_perturbations.py and Moment_tensor_perturbation_for_NS_thrust.py

For the inversions: using the notebook Inversion_SS_dip.M66.28s.py as an example. See the overview table inversion_runs_overview.ods (or .csv) as to naming conventions.

References:

Michael Afanasiev, Christian Boehm, Martin van Driel, Lion Krischer, Max Rietmann, Dave A May, Matthew G Knepley, Andreas Fichtner, Modular and flexible spectral-element waveform modelling in two and three dimensions, Geophysical Journal International, Volume 216, Issue 3, March 2019, Pages 1675–1692, https://doi.org/10.1093/gji/ggy469

Nienke Blom, Alexey Gokhberg, and Andreas Fichtner, Seismic waveform tomography of the central and eastern Mediterranean upper mantle, Solid Earth, Volume 11, Issue 2, 2020, Pages 669–690, 2020, https://doi.org/10.5194/se-11-669-2020

Tim J. Craig, Accurate depth determination for moderate-magnitude earthquakes using global teleseismic data. Journal of Geophysical Research: Solid Earth, 124, 2019, Pages 1759– 1780. https://doi.org/10.1029/2018JB016902
Z
polyOne Data Set - 100 million hypothetical polymers including 29 properties...
data.niaid.nih.gov
zenodo.org
Updated Mar 24, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christopher Kuenneth; Rampi Ramprasad (2023). polyOne Data Set - 100 million hypothetical polymers including 29 properties [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7124187
Explore at:
Dataset updated
Mar 24, 2023
Dataset provided by
Georgia Institute of Technology
Authors
Christopher Kuenneth; Rampi Ramprasad
Description
polyOne Data Set

The data set contains 100 million hypothetical polymers each with 29 predicted properties using machine learning models. We use PSMILES strings to represent polymer structures, see here and here. The polymers are generated by decomposing previously synthesized polymers into unique chemical fragments. Random and enumerative compositions of these fragments yield 100 million hypothetical PSMILES strings. All PSMILES strings are chemically valid polymers but, mostly, have never been synthesized before. More information can be found in the paper. Please note the license agreement in the LICENSE file.

Full data set including the properties

The data files are in Apache Parquet format. The files start with polyOne_*.parquet.

I recommend using dask (pip install dask) to load and process the data set. Pandas also works but is slower.

Load sharded data set with dask python import dask.dataframe as dd ddf = dd.read_parquet("*.parquet", engine="pyarrow")

For example, compute the description of data set ```python df_describe = ddf.describe().compute() df_describe

PSMILES strings only generated_polymer_smiles_train.txt - 80 million PSMILES strings for training polyBERT. One string per line. generated_polymer_smiles_dev.txt - 20 million PSMILES strings for testing polyBERT. One string per line.
Data from: LifeSnaps: a 4-month multi-modal dataset capturing unobtrusive...
zenodo.org
data.europa.eu
zip
Updated Oct 20, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sofia Yfantidou; Sofia Yfantidou; Christina Karagianni; Stefanos Efstathiou; Stefanos Efstathiou; Athena Vakali; Athena Vakali; Joao Palotti; Joao Palotti; Dimitrios Panteleimon Giakatos; Dimitrios Panteleimon Giakatos; Thomas Marchioro; Thomas Marchioro; Andrei Kazlouski; Elena Ferrari; Šarūnas Girdzijauskas; Šarūnas Girdzijauskas; Christina Karagianni; Andrei Kazlouski; Elena Ferrari (2022). LifeSnaps: a 4-month multi-modal dataset capturing unobtrusive snapshots of our lives in the wild [Dataset]. http://doi.org/10.5281/zenodo.6832242
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6832242
Dataset updated
Oct 20, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Sofia Yfantidou; Sofia Yfantidou; Christina Karagianni; Stefanos Efstathiou; Stefanos Efstathiou; Athena Vakali; Athena Vakali; Joao Palotti; Joao Palotti; Dimitrios Panteleimon Giakatos; Dimitrios Panteleimon Giakatos; Thomas Marchioro; Thomas Marchioro; Andrei Kazlouski; Elena Ferrari; Šarūnas Girdzijauskas; Šarūnas Girdzijauskas; Christina Karagianni; Andrei Kazlouski; Elena Ferrari
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
LifeSnaps Dataset Documentation

Ubiquitous self-tracking technologies have penetrated various aspects of our lives, from physical and mental health monitoring to fitness and entertainment. Yet, limited data exist on the association between in the wild large-scale physical activity patterns, sleep, stress, and overall health, and behavioral patterns and psychological measurements due to challenges in collecting and releasing such datasets, such as waning user engagement, privacy considerations, and diversity in data modalities. In this paper, we present the LifeSnaps dataset, a multi-modal, longitudinal, and geographically-distributed dataset, containing a plethora of anthropological data, collected unobtrusively for the total course of more than 4 months by n=71 participants, under the European H2020 RAIS project. LifeSnaps contains more than 35 different data types from second to daily granularity, totaling more than 71M rows of data. The participants contributed their data through numerous validated surveys, real-time ecological momentary assessments, and a Fitbit Sense smartwatch, and consented to make these data available openly to empower future research. We envision that releasing this large-scale dataset of multi-modal real-world data, will open novel research opportunities and potential applications in the fields of medical digital innovations, data privacy and valorization, mental and physical well-being, psychology and behavioral sciences, machine learning, and human-computer interaction.

The following instructions will get you started with the LifeSnaps dataset and are complementary to the original publication.

Data Import: Reading CSV

For ease of use, we provide CSV files containing Fitbit, SEMA, and survey data at daily and/or hourly granularity. You can read the files via any programming language. For example, in Python, you can read the files into a Pandas DataFrame with the pandas.read_csv() command.

Data Import: Setting up a MongoDB (Recommended)

To take full advantage of the LifeSnaps dataset, we recommend that you use the raw, complete data via importing the LifeSnaps MongoDB database.

To do so, open the terminal/command prompt and run the following command for each collection in the DB. Ensure you have MongoDB Database Tools installed from here.

For the Fitbit data, run the following:

mongorestore --host localhost:27017 -d rais_anonymized -c fitbit

For the SEMA data, run the following:

mongorestore --host localhost:27017 -d rais_anonymized -c sema

For surveys data, run the following:

mongorestore --host localhost:27017 -d rais_anonymized -c surveys

If you have access control enabled, then you will need to add the --username and --password parameters to the above commands.

Data Availability

The MongoDB database contains three collections, fitbit, sema, and surveys, containing the Fitbit, SEMA3, and survey data, respectively. Similarly, the CSV files contain related information to these collections. Each document in any collection follows the format shown below:

{ _id:
D
Data from: Compromised through Compression: Python source code for DLMS...
phys-techsciences.datastations.nl
text/markdown, txt +2
Updated Dec 14, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
P.J.M. van Aubel; E. Poll; P.J.M. van Aubel; E. Poll (2021). Compromised through Compression: Python source code for DLMS compression privacy analysis & graphing [Dataset]. http://doi.org/10.17026/DANS-2BY-BNA3
Explore at:
xml(5795), zip(20542), text/markdown(792), txt(626), zip(12920)Available download formats
Unique identifier
https://doi.org/10.17026/DANS-2BY-BNA3
Dataset updated
Dec 14, 2021
Dataset provided by
DANS Data Station Physical and Technical Sciences
Authors
P.J.M. van Aubel; E. Poll; P.J.M. van Aubel; E. Poll
License
http://opensource.org/licenses/BSD-2-Clausehttp://opensource.org/licenses/BSD-2-Clause
Description
Python code (for Python 3.9 & Pandas 1.3.2) to generate the results used in "Compromised through Compression: Privacy Implications of Smart Meter Traffic Analysis".Smart metering comes with risks to privacy. One concern is the possibility of an attacker seeing the traffic that reports the energy use of a household and deriving private information from that. Encryption helps to mask the actual energy measurements, but is not sufficient to cover all risks. One aspect which has yet gone unexplored — and where encryption does not help — is traffic analysis, i.e. whether the length of messages communicating energy measurements can leak privacy-sensitive information to an observer. In this paper we examine whether using encodings or compression for smart metering data could potentially leak information about household energy use. Our analysis is based on the real-world energy use data of ±80 Dutch households.We find that traffic analysis could reveal information about the energy use of individual households if compression is used. As a result, when messages are sent daily, an attacker performing traffic analysis would be able to determine when all the members of a household are away or not using electricity for an entire day. We demonstrate this issue by recognizing when households from our dataset were on holiday. If messages are sent more often, more granular living patterns could likely be determined.We propose a method of encoding the data that is nearly as effective as compression at reducing message size, but does not leak the information that compression leaks. By not requiring compression to achieve the best possible data savings, the risk of traffic analysis is eliminated.This code operates on the relative energy measurements from the "Zonnedael dataset" from Liander N.V. This dataset needs to be obtained separately; see instructions accompanying the code. The code transforms the dataset into absolute measurements such as would be taken by a smart meter. It then generates batch messages covering 24-hour periods starting at midnight, similar to how the Dutch infrastructure batches daily meter readings, in the different possible encodings with and without compression applied. For an explanation of the different encodings, see the paper. The code will then provide statistics on the efficiency of encoding and compression for the entire dataset, and attempt to find the periods of multi-day absences for each household. It will also generate the graphs in the style used in the paper and presentation.

RBD24 - Risk Activities Dataset 2024

zenodo.org

bin

Updated Mar 4, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Calvo Albert; Escuder Santiago; Ortiz Nil; Escrig Josep; Compastié Maxime; Calvo Albert; Escuder Santiago; Ortiz Nil; Escrig Josep; Compastié Maxime (2025). RBD24 - Risk Activities Dataset 2024 [Dataset]. http://doi.org/10.5281/zenodo.13787591

Explore at:

binAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.13787591

Dataset updated

Mar 4, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Calvo Albert; Escuder Santiago; Ortiz Nil; Escrig Josep; Compastié Maxime; Calvo Albert; Escuder Santiago; Ortiz Nil; Escrig Josep; Compastié Maxime

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Introduction

This repository contains a selection of behavioral datasets collected using soluble agents and labeled using realistic threat simulation and IDS rules. The collected datasets are anonymized and aggregated using time window representations. The dataset generation pipeline preprocesses the application logs from the corporate network, structures them according to entities and users inventory, and labels them based on the IDS and phishing simulation appliances.

This repository is associated with the article "RBD24: A labelled dataset with risk activities using log applications data" published in the journal Computers & Security. For more information go to https://doi.org/10.1016/j.cose.2024.104290" target="_blank" rel="noreferrer noopener">https://doi.org/10.1016/j.cose.2024.104290

Summary of the Datasets

The RBD24 dataset comprises various risk activities collected from real entities and users over a period of 15 days, with the samples segmented by Desktop (DE) and Smartphone (SM) devices.

DatasetId	Entity	Observed Behaviour	Groundtruth	Sample Shape
Crypto_desktop.parquet	DE	Miner Checking	IDS	0: 738/161202, 1: 11/1343
Crypto_smarphone.parquet	SM	Miner Checking	IDS	0: 613/180021, 1: 4/956
OutFlash_desktop.parquet	DE	Outdated software components	IDS	0: 738/161202, 1: 56/10820
OutFlash_smartphone.parquet	SM	Outdated software components	IDS	0: 613/180021, 1: 22/6639
OutTLS_desktop.parquet	DE	Outdated TLS protocol	IDS	0: 738/161202, 1: 18/2458
OutTLS_smartphone.parquet	SM	Outdated TLS protocol	IDS	0: 613/180021, 1: 11/2930
P2P_desktop.parquet	DE	P2P Activity	IDS	0: 738/161202, 1: 177/35892
P2P_smartphone.parquet	SM	P2P Activity	IDS	0: 613/180021, 1: 94/21688
NonEnc_desktop.parquet	DE	Non-encrypted password	IDS	0: 738/161202, 1: 291/59943
NonEnc_smaprthone.parquet	SM	Non-encrypted password	IDS	0: 613/180021, 1: 167/41434
Phishing_desktop.parquet	DE	Phishing email	Experimental Campaign	0: 98/13864, 1: 19/3072
Phishing_smartphone.parquet	SM	Phishing email	Experimental Campaign	0: 117/34006, 1: 26/8968

Methodology

To collect the dataset, we have deployed multiple agents and soluble agents within an infrastructure with
more than 3k entities, comprising laptops, workstations, and smartphone devices. The methods to build
ground truth are as follows:

- Simulator: We launch different realistic phishing campaigns, aiming to expose user credentials or defeat access to a service.
- IDS: We deploy an IDS to collect various alerts associated with behavioral anomalies, such as cryptomining or peer-to-peer traffic.

For each user exposed to the behaviors stated in the summary table, different TW is computed, aggregating
user behavior within a fixed time interval. This TW serves as the basis for generating various supervised
and unsupervised methods.

Sample Representation

The time windows (TW) are a data representation based on aggregated logs from multimodal sources between two
timestamps. In this study, logs from HTTP, DNS, SSL, and SMTP are taken into consideration, allowing the
construction of rich behavioral profiles. The indicators described in the TE are a set of manually curated
interpretable features designed to describe device-level properties within the specified time frame. The most
influential features are described below.

User:** A unique hash value that identifies a user.
Timestamp:** The timestamp of the windows.
Features
Label: 1 if the user exhibits compromised behavior, 0 otherwise. -1 indicates that it is a TW with an unknown label.

Dataset Format

Parquet format uses a columnar storage format, which enhances efficiency and compression, making it suitable for large datasets and complex analytical tasks. It has support across various tools and languages, including Python. Parquet can be used with pandas library in Python, allowing pandas to read and write Parquet files through the `pyarrow` or `fastparquet` libraries. Its efficient data retrieval and fast query execution improve performance over other formats. Compared to row-based storage formats such as CSV, Parquet's columnar storage greatly reduces read times and storage costs for large datasets. Although binary formats like HDF5 are effective for specific use cases, Parquet provides broader compatibility and optimization. The provided datasets use the Parquet format. Here’s an example of how to retrieve data using pandas, ensure you have the fastparquet library installed:

```python
import pandas as pd

# Reading a Parquet file
df = pd.read_parquet(
'path_to_your_file.parquet',
engine='fastparquet'
)

```

g
Data from: JSON Dataset of Simulated Building Heat Control for System of...
gimi9.com
researchdata.se
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
JSON Dataset of Simulated Building Heat Control for System of Systems Interoperability [Dataset]. https://gimi9.com/dataset/eu_https-doi-org-10-5878-1tv7-9x76/
Explore at:
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Interoperability in systems-of-systems is a difficult problem due to the abundance of data standards and formats. Current approaches to interoperability rely on hand-made adapters or methods using ontological metadata. This dataset was created to facilitate research on data-driven interoperability solutions. The data comes from a simulation of a building heating system, and the messages sent within control systems-of-systems. For more information see attached data documentation. The data comes in two semicolon-separated (;) csv files, training.csv and test.csv. The train/test split is not random; training data comes from the first 80% of simulated timesteps, and the test data is the last 20%. There is no specific validation dataset, the validation data should instead be randomly selected from the training data. The simulation runs for as many time steps as there are outside temperature values available. The original SMHI data only samples once every hour, which we linearly interpolate to get one temperature sample every ten seconds. The data saved at each time step consists of 34 JSON messages (four per room and two temperature readings from the outside), 9 temperature values (one per room and outside), 8 setpoint values, and 8 actuator outputs. The data associated with each of those 34 JSON-messages is stored as a single row in the tables. This means that much data is duplicated, a choice made to make it easier to use the data. The simulation data is not meant to be opened and analyzed in spreadsheet software, it is meant for training machine learning models. It is recommended to open the data with the pandas library for Python, available at https://pypi.org/project/pandas/.
H
Creating Curve Number Grid using PyQGIS through Jupyter Notebook in mygeohub...
hydroshare.org
beta.hydroshare.org
+1more
zip
Updated Apr 28, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sayan Dey; Shizhang Wang; Venkatesh Merwade (2020). Creating Curve Number Grid using PyQGIS through Jupyter Notebook in mygeohub [Dataset]. http://doi.org/10.4211/hs.abf67aad0eb64a53bf787d369afdcc84
Explore at:
zip(105.5 MB)Available download formats
Unique identifier
https://doi.org/10.4211/hs.abf67aad0eb64a53bf787d369afdcc84
Dataset updated
Apr 28, 2020
Dataset provided by
HydroShare
Authors
Sayan Dey; Shizhang Wang; Venkatesh Merwade
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered

Description
This resource serves as a template for creating a curve number grid raster file which could be used to create corresponding maps or for further utilization, soil data and reclassified land-use raster files are created along the process, user has to provided or connect to a set of shape-files including boundary of watershed, soil data and land-use containing this watershed, land-use reclassification and curve number look up table. Script contained in this resource mainly uses PyQGIS through Jupyter Notebook for majority of the processing with a touch of Pandas for data manipulation. Detailed description of procedure are commented in the script.
t
FAIR Dataset for Disease Prediction in Healthcare Applications
test.researchdata.tuwien.ac.at
bin, csv, json, png
Updated Apr 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf (2025). FAIR Dataset for Disease Prediction in Healthcare Applications [Dataset]. http://doi.org/10.70124/5n77a-dnf02
Explore at:
csv, json, bin, pngAvailable download formats
Unique identifier
https://doi.org/10.70124/5n77a-dnf02
Dataset updated
Apr 14, 2025
Dataset provided by
TU Wien
Authors
Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset Description

Context and Methodology

Research Domain/Project:
This dataset was created for a machine learning experiment aimed at developing a classification model to predict outcomes based on a set of features. The primary research domain is disease prediction in patients. The dataset was used in the context of training, validating, and testing.

Purpose of the Dataset:
The purpose of this dataset is to provide training, validation, and testing data for the development of machine learning models. It includes labeled examples that help train classifiers to recognize patterns in the data and make predictions.

Dataset Creation:
Data preprocessing steps involved cleaning, normalization, and splitting the data into training, validation, and test sets. The data was carefully curated to ensure its quality and relevance to the problem at hand. For any missing values or outliers, appropriate handling techniques were applied (e.g., imputation, removal, etc.).

Technical Details

Structure of the Dataset:
The dataset consists of several files organized into folders by data type:

Training Data: Contains the training dataset used to train the machine learning model.

Validation Data: Used for hyperparameter tuning and model selection.

Test Data: Reserved for final model evaluation.

Each folder contains files with consistent naming conventions for easy navigation, such as train_data.csv, validation_data.csv, and test_data.csv. Each file follows a tabular format with columns representing features and rows representing individual data points.

Software Requirements:
To open and work with this dataset, you need VS Code or Jupyter, which could include tools like:

Python (with libraries such as pandas, numpy, scikit-learn, matplotlib, etc.)

Further Details

Reusability:
Users of this dataset should be aware that it is designed for machine learning experiments involving classification tasks. The dataset is already split into training, validation, and test subsets. Any model trained with this dataset should be evaluated using the test set to ensure proper validation.

Limitations:
The dataset may not cover all edge cases, and it might have biases depending on the selection of data sources. It's important to consider these limitations when generalizing model results to real-world applications.
Video-Trajectory Robot Dataset
data.europa.eu
unknown
Updated Mar 7, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2022). Video-Trajectory Robot Dataset [Dataset]. https://data.europa.eu/88u/dataset/oai-zenodo-org-6337847
Explore at:
unknown(296205835)Available download formats
Dataset updated
Mar 7, 2022
Dataset authored and provided by
Zenodohttp://zenodo.org/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset consists of color and depth videos of Panda robot motions and their corresponding joint and Cartesian trajectories. The dataset also includes the trajectories of a receiver robot for the purpose of an object handover. Each motion sample comprises 6 files (RGB video, depth video and 4 giver/receiver trajectories in time series form). Total number of motion samples: 38393. Structure: MPEG-4 videos of robot motion and corresponding Python serialized (or “pickled”) files, containing joint and Cartesian trajectories. Dataset is divided into four parts: simulation dataset (PandaHandover_Sim.zip), real train dataset (PandaHandover_Real_Train.zip), real validation dataset (PandaHandover_Real_Val.zip), real test dataset (PandaHandover_Real_Test.zip). Extract using 7-Zip or similar software. Video files (.avi) can be opened using VLC media player or any other video player that supports MPEG-4 codec. The .pkl files can be loaded using Python (>=3.7) and the Python library Pandas (>=1.1.3).
"module-utilities": A Python package for simplify creating python modules.
catalog.data.gov
s.cnmilf.com
Updated Apr 11, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Standards and Technology (2024). "module-utilities": A Python package for simplify creating python modules. [Dataset]. https://catalog.data.gov/dataset/module-utilities-a-python-package-for-simplify-creating-python-modules
Explore at:
Dataset updated
Apr 11, 2024
Dataset provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
Description
"module-utilities" is a python package of utilities to simplify working with python packages.The main features of module-utilities are as follows: "cached" module: A module to cache class attributes and methods. Right now, this uses a standard python dictionary for storage. Future versions will hopefully be more robust to threading and shared cache."docfiller" module: A module to share documentation. This is adapted from the pandas doc decorator. There are a host of utilities build around this."docinhert": An interface to "docstring-inheritance" module. This can be combined with "docfiller" to make creating related function/class documentation easy.
u
Project HF‑EOLUS. Task 1. Sentinel‑1 SAR Derived Data Bundle (GeoParquet +...
portalcientifico.uvigo.gal
Updated 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Herrera Cortijo, Juan Luis; Fernández-Baladrón, Adrián; Rosón, Gabriel; Gil Coto, Miguel; Dubert, Jesús; Varela Benvenuto, Ramiro; Herrera Cortijo, Juan Luis; Fernández-Baladrón, Adrián; Rosón, Gabriel; Gil Coto, Miguel; Dubert, Jesús; Varela Benvenuto, Ramiro (2025). Project HF‑EOLUS. Task 1. Sentinel‑1 SAR Derived Data Bundle (GeoParquet + STAC) [Dataset]. https://portalcientifico.uvigo.gal/documentos/68c41a5a08c3ca034ca75767
Explore at:
Dataset updated
2025
Authors
Herrera Cortijo, Juan Luis; Fernández-Baladrón, Adrián; Rosón, Gabriel; Gil Coto, Miguel; Dubert, Jesús; Varela Benvenuto, Ramiro; Herrera Cortijo, Juan Luis; Fernández-Baladrón, Adrián; Rosón, Gabriel; Gil Coto, Miguel; Dubert, Jesús; Varela Benvenuto, Ramiro
Description
Overview

This Zenodo record packages a reproducible bundle of derived ocean wind data and auxiliary materials produced from Copernicus Sentinel‑1 SAR Level‑2 OCN (OWI) products. The bundle contains the processed dataset as a GeoParquet data lake with a static STAC catalog, together with the exact scripts, configuration, SQL statements, and inputs used to generate it. It is intended to enable full transparency, re‑use, and re‑execution of the data generation workflow.

The dataset was generated using the HF‑EOLUS Sentinel‑1 SAR Ingestion Pipeline (software record DOI: https://doi.org/10.5281/zenodo.17011823), which automates download and ingestion into partitioned GeoParquet, and builds a STAC Collection and Items describing the outputs. This record includes all relevant artifacts so that other researchers can verify the steps, inspect lineage, and rerun the pipeline if needed.

Dataset summary

Final table rows: 11,384,420

Rows with valid wind data: 7,992,086

Contents of this record

hf_eolus_sar.tar.gz: Tarball containing the processed data outputs. It includes the partitioned GeoParquet dataset (OGC GeoParquet v1.1 metadata) and a static STAC catalog (Collection and Items with the STAC Table Extension and Processing metadata) describing the Parquet assets.

pipeline.sh: The exact sequence of commands used to generate this dataset (download and ingestion), serving as an executable provenance log for reproducibility.

stac_properties_collection.json: STAC property definitions/templates applied at Collection level during catalog generation.

stac_properties_item.json: STAC property definitions/templates applied at Item level during catalog generation.

scripts/*.sql (e.g., scripts/athena_create_table.sql): SQL statements used for registering the resulting Parquet dataset as an external table (e.g., in AWS Athena) and for validating schema/partitions.

scripts/downloaded_files.txt: Manifest listing the Sentinel‑1 OCN product identifiers that were downloaded and used as inputs.

scripts/VILA_PRIO_hull.json: Area of Interest (AOI) polygon used to constrain the search and download of Sentinel‑1 scenes spatially. This file defines a convex hull bounding the intersection between the areas covering the echos of VILA and PRIO stations.

scripts/files_to_download.csv: Input list and/or search results for Sentinel‑1 OCN products targeted by the download stage (includes product IDs and acquisition metadata).

Reproducibility and re‑execution

Software pipeline: HF‑EOLUS Sentinel‑1 SAR Ingestion Pipeline (https://doi.org/10.5281/zenodo.17011823). The pipeline uses Dockerized R and Python environments for deterministic runs (no manual dependency setup required).

How to re‑run: Review pipeline.sh to see the exact commands, arguments, and environment variables used. If desired, clone the referenced pipeline repository, ensure Docker is available, and re‑execute the same steps. The AOI and time range used are captured in scripts/area_boundary.geojson and scripts/files_to_download.csv; the precise upstream inputs are listed in scripts/downloaded_files.txt.

Optional cloud registration: The SQL files in scripts/ can be used to register the resulting Parquet dataset in AWS Athena (or adapted for other engines like Trino/Spark). This step is optional and not required to read the Parquet files directly with tools such as Python (pyarrow/GeoPandas), R (arrow), or DuckDB.

Copernicus credentials: To re‑download Sentinel‑1 OCN data, provide your own Copernicus Data Space Ecosystem account credentials. Create a plain‑text file (e.g., credentials) with one line containing your key and pass its path via --credentials-file to scripts/download_sar.sh. For account and API access information, see https://dataspace.copernicus.eu and the documentation at https://documentation.dataspace.copernicus.eu/

Data format and standards

GeoParquet: Columnar, compressed Parquet files with OGC GeoParquet v1.1 geospatial metadata (geometry column, CRS, encoding). Files are partitioned to support efficient filtering and scalable analytics.

STAC: A static STAC Collection with Items describes each Parquet asset, including schema via the STAC Table Extension and lineage via Processing metadata. The catalog is suitable for static hosting and is interoperable with common STAC tooling.

Provenance and upstream data

Upstream source: Copernicus Sentinel‑1 SAR Level‑2 OCN (OWI) products provided by the European Space Agency (ESA) under the Copernicus Programme. The list of specific input products for this dataset is included in scripts/downloaded_files.txt.

Processing: All derivations (extraction of ocean wind variables, conversion to GeoParquet, and STAC catalog creation) were performed by the HF‑EOLUS pipeline referenced above. The sequence is captured verbatim in pipeline.sh.

Credit line: Contains modified Copernicus Sentinel‑1 data; we gratefully acknowledge the Copernicus Programme and the European Space Agency (ESA) for providing free and open Sentinel‑1 data.

How to cite

This data bundle:

Herrera Cortijo, J. L., Fernández‑Baladrón, A., Rosón, G., Gil Coto, M., Dubert, J., & Varela Benvenuto, R. (2025). Project HF‑EOLUS. Task 2. Sentinel‑1 SAR Derived Data Bundle (GeoParquet + STAC) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.17007304

Software/pipeline used

Herrera Cortijo, J. L., Fernández‑Baladrón, A., Rosón, G., Gil Coto, M., Dubert, J., & Varela Benvenuto, R. (2025). HF‑EOLUS Sentinel‑1 SAR Ingestion Pipeline (GeoParquet + STAC) (v0.1.2). Zenodo. https://doi.org/10.5281/zenodo.17011788

Usage notes

Local analysis: The GeoParquet files can be opened directly with Python (pyarrow, pandas/GeoPandas), R (arrow), or SQL engines (DuckDB/Trino) without additional ingestion steps.

Catalog discovery: The STAC catalog in the tarball is static and can be browsed with STAC tools or published on object storage or a simple web server.

AWS/Athena setup (optional): To use the GeoParquet in AWS, upload the dataset to Amazon S3 and adjust the SQL to your paths and names, then execute in Athena:

1) Upload the GeoParquet (and optionally the STAC catalog) to s3://<your-bucket>/<your-prefix>/ preserving the folder structure.

2) Edit scripts/athena_create_table.sql to set the LOCATION to your S3 path and customize the database and table (e.g., change SAR_INGEST.SAR to MY_DB.MY_TABLE).

3) In Athena, run the SQL to create the database (if needed) and the external table.

4) Load partitions with MSCK REPAIR TABLE MY_DB.MY_TABLE; (or add partitions explicitly) and validate with a quick query such as SELECT COUNT(*) FROM MY_DB.MY_TABLE;.

Acknowledgements

This work has been funded by the HF‑EOLUS project (TED2021‑129551B‑I00), financed by MICIU/AEI /10.13039/501100011033 and by the European Union NextGenerationEU/PRTR - BDNS 598843 - Component 17 - Investment I3. Members of the Marine Research Centre (CIM) of the University of Vigo have participated in the development. We gratefully acknowledge the Copernicus Programme and the European Space Agency (ESA) for providing free and open Sentinel‑1 data used in this work (contains modified Copernicus Sentinel data).

Disclaimer

This content is provided "as is", without warranties of any kind. Users are responsible for verifying fitness for purpose and for complying with the licensing and attribution requirements of upstream Copernicus Sentinel data.
d
Data for: Saturation of ocean surface wave slopes observed during hurricanes...
search.dataone.org
zenodo.org
+1more
Updated Aug 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jacob Davis; Jim Thomson; Isabel Houghton; James Doyle; Will Komaromi; Jon Moskaitis; Chris Fairall; Elizabeth Thompson (2025). Data for: Saturation of ocean surface wave slopes observed during hurricanes [Dataset]. http://doi.org/10.5061/dryad.g4f4qrfvb
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.g4f4qrfvb
Dataset updated
Aug 5, 2025
Dataset provided by
Dryad Digital Repository
Authors
Jacob Davis; Jim Thomson; Isabel Houghton; James Doyle; Will Komaromi; Jon Moskaitis; Chris Fairall; Elizabeth Thompson
Time period covered
Jun 26, 2023
Description
Observational wave data and modeled wind data to accompany the article J. Davis et al. (2023) "Saturation of ocean surface wave slopes observed during hurricanes" in Geophysical Research Letters. The observations include targeted aerial deployments into Hurricane Ian (2022) and opportunistic measurements from the free-drifting Sofar Ocean Spotter global network in Hurricane Fiona (2022). Surface wind speeds are modeled using the U.S. Naval Research Laboratoryâ€™s Coupled Ocean-Atmosphere Mesoscale Prediction System for Tropical Cyclones (COAMPS-TC). The datasets are table-like and include the following: 1) hourly records of surface wave statistics in the form of scalar energy spectra, directional moments, derived products (including mean square slope), and modeled wind speeds; 2) data to reproduce the binned mean square slopes and model wind speeds presented in the article; 3) data to reproduce the mean energy density versus wind speed plot; and 4) data to reproduce the mean energy densit..., This dataset contains wave measurements collected by free-drifting Spotter buoys (Sofar Ocean) which use GPS-derived motions to report hourly records of surface wave statistics in the form of scalar energy spectra and directional moments. The observational data are combined with modeled surface wind speeds from the U.S. Naval Research Laboratoryâ€™s Coupled Ocean-Atmosphere Mesoscale Prediction System for Tropical Cyclones (COAMPS-TC) which are interpolated onto the Spotter wave observations in time and space., The datasets are stored as text-based JSON and CSV files which can be read without the use of special software. The JSON files are created using the Python Pandas package DataFrame.to_json method with orient='records'. Example code is provided in MATLAB and Python., # Data to accompany the article "Saturation of ocean surface wave slopes observed during hurricanes"

Contains wave observations by free-drifting Spotter buoys from targeted aerial deployments into Hurricane Ian (2022) and opportunistic measurements from the free-drifting Sofar Ocean Spotter global network in Hurricane Fiona (2022). The observations are co-located with modeled surface wind speeds from the U.S. Naval Research Laboratorys Coupled Ocean-Atmosphere Mesoscale Prediction System for Tropical Cyclones (COAMPS-TC). The data are used in the article to show the saturation of mean square slope at extreme wind speeds and the coincident transition from an equilibrium-dominated spectral tail to a saturation-dominated tail. This archive contains two main, table-like datasets of wind-wave data (from Ian and Fiona) and three derived datasets necessary to reproduce the results and figures in the article.

Description of the data and file structure

The dataset is organized into fi...

Facebook

Twitter

Click to copy link

Link copied

Cite

Mrityunjay Pathak (2023). Pandas Practice Dataset [Dataset]. https://www.kaggle.com/datasets/themrityunjaypathak/pandas-practice-dataset/discussion

Pandas Practice Dataset

Dataset to Practice Your Pandas Skill's

Explore at:

4 scholarly articles cite this dataset (View in Google Scholar)

zip(493 bytes)Available download formats

Dataset updated

Jan 27, 2023

Authors

Mrityunjay Pathak

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

What is Pandas?

Pandas is a Python library used for working with data sets.

It has functions for analyzing, cleaning, exploring, and manipulating data.

The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was created by Wes McKinney in 2008.

Why Use Pandas?

Pandas allows us to analyze big data and make conclusions based on statistical theories.

Pandas can clean messy data sets, and make them readable and relevant.

Relevant data is very important in data science.

What Can Pandas Do?

Pandas gives you answers about the data. Like:

Is there a correlation between two or more columns?

What is average value?

Max value?

Min value?

Clear search

Close search

Google apps

Main menu

Pandas Practice Dataset

Numpy , pandas and matplot lib practice

Multimodal Vision-Audio-Language Dataset

Exploratory Topic Modelling in Python Dataset - EHRI-3

💥 Data-cleaning-for-beginner-using-pandas💢💥

Indicators

census-income

URLs

All Pandas Operations Reference

Context

Content

Acknowledgements

3D skeletons UP-Fall Dataset

Load a sample data file for Subject 1, Camera 1, Activity 1, Trial 1

Dataset for paper "Mitigating the effect of errors in source parameters on...

polyOne Data Set - 100 million hypothetical polymers including 29 properties...

Data from: LifeSnaps: a 4-month multi-modal dataset capturing unobtrusive...

Data from: Compromised through Compression: Python source code for DLMS...

RBD24 - Risk Activities Dataset 2024

Introduction

Summary of the Datasets

Methodology

Sample Representation

Dataset Format

Data from: JSON Dataset of Simulated Building Heat Control for System of...

Creating Curve Number Grid using PyQGIS through Jupyter Notebook in mygeohub...

FAIR Dataset for Disease Prediction in Healthcare Applications

Dataset Description

Context and Methodology

Technical Details

Further Details

Video-Trajectory Robot Dataset

"module-utilities": A Python package for simplify creating python modules.

Project HF‑EOLUS. Task 1. Sentinel‑1 SAR Derived Data Bundle (GeoParquet +...

Data for: Saturation of ocean surface wave slopes observed during hurricanes...

Description of the data and file structure

Pandas Practice Dataset

Dataset to Practice Your Pandas Skill's