Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
What is Pandas?
Pandas is a Python library used for working with data sets.
It has functions for analyzing, cleaning, exploring, and manipulating data.
The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was created by Wes McKinney in 2008.
Why Use Pandas?
Pandas allows us to analyze big data and make conclusions based on statistical theories.
Pandas can clean messy data sets, and make them readable and relevant.
Relevant data is very important in data science.
What Can Pandas Do?
Pandas gives you answers about the data. Like:
Is there a correlation between two or more columns?
What is average value?
Max value?
Min value?
Facebook
Twitterhttps://cdla.io/permissive-1-0/https://cdla.io/permissive-1-0/
The dataset has been created specifically for practicing Python, NumPy, Pandas, and Matplotlib. It is designed to provide a hands-on learning experience in data manipulation, analysis, and visualization using these libraries.
Specifics of the Dataset:
The dataset consists of 5000 rows and 20 columns, representing various features with different data types and distributions. The features include numerical variables with continuous and discrete distributions, categorical variables with multiple categories, binary variables, and ordinal variables. Each feature has been generated using different probability distributions and parameters to introduce variations and simulate real-world data scenarios. The dataset is synthetic and does not represent any real-world data. It has been created solely for educational purposes.
One of the defining characteristics of this dataset is the intentional incorporation of various real-world data challenges:
Certain columns are randomly selected to be populated with NaN values, effectively simulating the common challenge of missing data. - The proportion of these missing values in each column varies randomly between 1% to 70%. - Statistical noise has been introduced in the dataset. For numerical values in some features, this noise adheres to a distribution with mean 0 and standard deviation 0.1. - Categorical noise is introduced in some features', with its categories randomly altered in about 1% of the rows. Outliers have also been embedded in the dataset, resonating with the Interquartile Range (IQR) rule
Context of the Dataset:
The dataset aims to provide a comprehensive playground for practicing Python, NumPy, Pandas, and Matplotlib. It allows learners to explore data manipulation techniques, perform statistical analysis, and create visualizations using the provided features. By working with this dataset, learners can gain hands-on experience in data cleaning, preprocessing, feature engineering, and visualization. Sources of the Dataset:
The dataset has been generated programmatically using Python's random number generation functions and probability distributions. No external sources or real-world data have been used in creating this dataset.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Multimodal Vision-Audio-Language Dataset is a large-scale dataset for multimodal learning. It contains 2M video clips with corresponding audio and a textual description of the visual and auditory content. The dataset is an ensemble of existing datasets and fills the gap of missing modalities. Details can be found in the attached report. Annotation The annotation files are provided as Parquet files. They can be read using Python and the pandas and pyarrow library. The split into train, validation and test set follows the split of the original datasets. Installation
pip install pandas pyarrow Example
import pandas as pddf = pd.read_parquet('annotation_train.parquet', engine='pyarrow')print(df.iloc[0])
dataset AudioSet filename train/---2_BBVHAA.mp3 captions_visual [a man in a black hat and glasses.] captions_auditory [a man speaks and dishes clank.] tags [Speech] Description The annotation file consists of the following fields:filename: Name of the corresponding file (video or audio file)dataset: Source dataset associated with the data pointcaptions_visual: A list of captions related to the visual content of the video. Can be NaN in case of no visual contentcaptions_auditory: A list of captions related to the auditory content of the videotags: A list of tags, classifying the sound of a file. It can be NaN if no tags are provided Data files The raw data files for most datasets are not released due to licensing issues. They must be downloaded from the source. However, due to missing files, we provide them on request. Please contact us at schaumloeffel@em.uni-frankfurt.de
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In the EHRI-3 project, we are investigating tools and methods that historical researchers and scholars can use to better understand, visualise, and interpret the material held by our partner archives. This dataset accompanies a tutorial exploring a technique called topic modelling in the context of a Holocaust-related historical collection.
We were on the lookout for datasets that would be easily accessible and, for convenience, predominantly in English. One such dataset was the United States Holocaust Memorial Museum’s (USHMM) extensive collection of oral history testimonies, for which there are a considerable number of textual transcripts. The museum’s total collection consists of over 80,703 testimonies, 41,695 of which are available in English, with 2,894 of them listing a transcript.
Since there is not yet a ready-to-download dataset that includes these transcripts, we had to construct our own. Using a web scraping tool, we managed to create a list of the links pointing to the metadata (including transcripts) of the testimonies that were of interest to us. After obtaining the transcript and other metadata of each of these testimonies, we were able to create our dataset and curate it to remove any unwanted entries. For example, we made sure to remove entries with restrictions on access or use. We also removed entries with transcripts that consisted only of some automatically generated headers and entries which turned out to be in languages other than English. The remaining 1,873 transcripts form the corpus of this tutorial — a small, but still decently sized dataset.
The process that we followed to put together this dataset is detailed in the Jupyter Notebook accompanying this post, which can be found in this Github repository.
In this Zenodo upload, the user can find two files, each of them containing a pickled pandas DataFrame that was obtained at a different stage of the tutorial:
"unrestricted_df.pkl" contains 1,946 entries of Oral Testimony transcripts and has five fields (RG_number, text, display_date, conditions_access, conditions_use) "unrestricted_lemmatized_df.pkl" contains 1,873 entries of Oral Testimony transcripts and has six fields (RG_number, text, display_date, conditions_access, conditions_use, lemmas)
Instructions on their intended use can be found in the accompanying Jupyter Notebook.
Credits:
The transcripts that form the corpus in this tutorial were obtained through the United States Holocaust Memorial Museum (USHMM).
Facebook
TwitterThis dataset helps you to increase the data-cleaning process using the pure python pandas library.
Facebook
TwitterDataset Card for Census Income (Adult)
This dataset is a precise version of Adult or Census Income. This dataset from UCI somehow happens to occupy two links, but we checked and confirm that they are identical. We used the following python script to create this Hugging Face dataset. import pandas as pd from datasets import Dataset, DatasetDict, Features, Value, ClassLabel
url1 = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data" url2 =… See the full description on the dataset page: https://huggingface.co/datasets/cestwc/census-income.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Pandas is a very useful library, probably the most useful for data munging in Python. This notebook is an attempt to collate all pandas dataframes operations that a data scientist might use.
You'll see how to create dataframes, read in files (even ones with anomalies), check out descriptive stats on columns, filter on different values and in different ways as well as perform some of the more oft-used operations
A big "thank you" to Data School. You'll find plenty of notebooks and videos here: https://github.com/justmarkham/pandas-videos
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
3D skeletons UP-Fall Dataset
Different between Fall and Impact detection
Overview
This dataset aims to facilitate research in fall detection, particularly focusing on the precise detection of impact moments within fall events. The 3D skeletons data accuracy and comprehensiveness make it a valuable resource for developing and benchmarking fall detection algorithms. The dataset contains 3D skeletal data extracted from fall events and daily activities of 5 subjects performing fall scenarios
Data Collection
The skeletal data was extracted using a pose estimation algorithm, which processes images frames to determine the 3D coordinates of each joint. Sequences with less than 100 frames of extracted data were excluded to ensure the quality and reliability of the dataset. As a result, some subjects may have fewer CSV files.
CSV Structure
The data is organized by subjects, and each subject contains CSV files named according to the pattern C1S1A1T1, where:
C: Camera (1 or 2)
S: Subject (1 to 5)
A: Activity (1 to N, representing different activities)
T: Trial (1 to 3)
subject1/`: Contains CSV files for Subject 1.
C1S1A1T1.csv: Data from Camera 1, Activity 1, Trial 1 for Subject 1
C1S1A2T1.csv: Data from Camera 1, Activity 2, Trial 1 for Subject 1
C1S1A3T1.csv: Data from Camera 1, Activity 3, Trial 1 for Subject 1
C2S1A1T1.csv: Data from Camera 2, Activity 1, Trial 1 for Subject 1
C2S1A2T1.csv: Data from Camera 2, Activity 2, Trial 1 for Subject 1
C2S1A3T1.csv: Data from Camera 2, Activity 3, Trial 1 for Subject 1
subject2/`: Contains CSV files for Subject 2.
C1S2A1T1.csv: Data from Camera 1, Activity 1, Trial 1 for Subject 2
C1S2A2T1.csv: Data from Camera 1, Activity 2, Trial 1 for Subject 2
C1S2A3T1.csv: Data from Camera 1, Activity 3, Trial 1 for Subject 2
C2S2A1T1.csv: Data from Camera 2, Activity 1, Trial 1 for Subject 2
C2S2A2T1.csv: Data from Camera 2, Activity 2, Trial 1 for Subject 2
C2S2A3T1.csv: Data from Camera 2, Activity 3, Trial 1 for Subject 2
subject3/, subject4/, subject5/: Similar structure as above, but may contain fewer CSV files due to the data extraction criteria mentioned above.
Column Descriptions
Each CSV file contains the following columns representing different skeletal joints and their respective coordinates in 3D space:
Column Name
Description
joint_1_x
X coordinate of joint 1
joint_1_y
Y coordinate of joint 1
joint_1_z
Z coordinate of joint 1
joint_2_x
X coordinate of joint 2
joint_2_y
Y coordinate of joint 2
joint_2_z
Z coordinate of joint 2
...
...
joint_n_x
X coordinate of joint n
joint_n_y
Y coordinate of joint n
joint_n_z
Z coordinate of joint n
LABEL
Label indicating impact (1) or non-impact (0)
Example
Here is an example of what a row in one of the CSV files might look like:
joint_1_x
joint_1_y
joint_1_z
joint_2_x
joint_2_y
joint_2_z
...
joint_n_x
joint_n_y
joint_n_33
LABEL
0.123
0.456
0.789
0.234
0.567
0.890
...
0.345
0.678
0.901
0
Usage
This data can be used for developing and benchmarking impact fall detection algorithms. It provides detailed information on human posture and movement during falls, making it suitable for machine learning and deep learning applications in impact fall detection and prevention.
Using github
Clone the repository:
-bash git clone
https://github.com/Tresor-Koffi/3D_skeletons-UP-Fall-Dataset
Navigate to the directory:
-bash -cd 3D_skeletons-UP-Fall-Dataset
Examples
Here's a simple example of how to load and inspect a sample data file using Python:```pythonimport pandas as pd
data = pd.read_csv('subject1/C1S1A1T1.csv')print(data.head())
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset corresponding to the journal article "Mitigating the effect of errors in source parameters on seismic (waveform) inversion" by Blom, Hardalupas and Rawlinson, accepted for publication in Geophysical Journal International. In this paper, we demonstrate the effect or errors in source parameters on seismic tomography, with a particular focus on (full) waveform tomography. We study effect both on forward modelling (i.e. comparing waveforms and measurements resulting from a perturbed vs. unperturbed source) and on seismic inversion (i.e. using a source which contains an (erroneous) perturbation to invert for Earth structure. These data were obtained using Salvus, a state-of-the-art (though proprietary) 3-D solver that can be used for wave propagation simulations (Afanasiev et al., GJI 2018).
This dataset contains:
The entire Salvus project. This project was prepared using Salvus version 0.11.x and 0.12.2 and should be fully compatible with the latter.
A number of Jupyter notebooks used to create all the figures, set up the project and do the data processing.
A number of Python scripts that are used in above notebooks.
two conda environment .yml files: one with the complete environment as used to produce this dataset, and one with the environment as supplied by Mondaic (the Salvus developers), on top of which I installed basemap and cartopy.
An overview of the inversion configurations used for each inversion experiment and the name of hte corresponding figures: inversion_runs_overview.ods / .csv .
Datasets corresponding to the different figures.
One dataset for Figure 1, showing the effect of a source perturbation in a real-world setting, as previously used by Blom et al., Solid Earth 2020
One dataset for Figure 2, showing how different methodologies and assumptions can lead to significantly different source parameters, notably including systematic shifts. This dataset was kindly supplied by Tim Craig (Craig, 2019).
A number of datasets (stored as pickled Pandas dataframes) derived from the Salvus project. We have computed:
travel-time arrival predictions from every source to all stations (df_stations...pkl)
misfits for different metrics for both P-wave centered and S-wave centered windows for all components on all stations, comparing every time waveforms from a reference source against waveforms from a perturbed source (df_misfits_cc.28s.pkl)
addition of synthetic waveforms for different (perturbed) moment tenors. All waveforms are stored in HDF5 (.h5) files of the ASDF (adaptable seismic data format) type
How to use this dataset:
To set up the conda environment:
make sure you have anaconda/miniconda
make sure you have access to Salvus functionality. This is not absolutely necessary, but most of the functionality within this dataset relies on salvus. You can do the analyses and create the figures without, but you'll have to hack around in the scripts to build workarounds.
Set up Salvus / create a conda environment. This is best done following the instructions on the Mondaic website. Check the changelog for breaking changes, in that case download an older salvus version.
Additionally in your conda env, install basemap and cartopy:
conda-env create -n salvus_0_12 -f environment.yml conda install -c conda-forge basemap conda install -c conda-forge cartopy
Install LASIF (https://github.com/dirkphilip/LASIF_2.0) and test. The project uses some lasif functionality.
To recreate the figures: This is extremely straightforward. Every figure has a corresponding Jupyter Notebook. Suffices to run the notebook in its entirety.
Figure 1: separate notebook, Fig1_event_98.py
Figure 2: separate notebook, Fig2_TimCraig_Andes_analysis.py
Figures 3-7: Figures_perturbation_study.py
Figures 8-10: Figures_toy_inversions.py
To recreate the dataframes in DATA: This can be done using the example notebook Create_perturbed_thrust_data_by_MT_addition.py and Misfits_moment_tensor_components.M66_M12.py . The same can easily be extended to the position shift and other perturbations you might want to investigate.
To recreate the complete Salvus project: This can be done using:
the notebook Prepare_project_Phil_28s_absb_M66.py (setting up project and running simulations)
the notebooks Moment_tensor_perturbations.py and Moment_tensor_perturbation_for_NS_thrust.py
For the inversions: using the notebook Inversion_SS_dip.M66.28s.py as an example. See the overview table inversion_runs_overview.ods (or .csv) as to naming conventions.
References:
Michael Afanasiev, Christian Boehm, Martin van Driel, Lion Krischer, Max Rietmann, Dave A May, Matthew G Knepley, Andreas Fichtner, Modular and flexible spectral-element waveform modelling in two and three dimensions, Geophysical Journal International, Volume 216, Issue 3, March 2019, Pages 1675–1692, https://doi.org/10.1093/gji/ggy469
Nienke Blom, Alexey Gokhberg, and Andreas Fichtner, Seismic waveform tomography of the central and eastern Mediterranean upper mantle, Solid Earth, Volume 11, Issue 2, 2020, Pages 669–690, 2020, https://doi.org/10.5194/se-11-669-2020
Tim J. Craig, Accurate depth determination for moderate-magnitude earthquakes using global teleseismic data. Journal of Geophysical Research: Solid Earth, 124, 2019, Pages 1759– 1780. https://doi.org/10.1029/2018JB016902
Facebook
TwitterpolyOne Data Set
The data set contains 100 million hypothetical polymers each with 29 predicted properties using machine learning models. We use PSMILES strings to represent polymer structures, see here and here. The polymers are generated by decomposing previously synthesized polymers into unique chemical fragments. Random and enumerative compositions of these fragments yield 100 million hypothetical PSMILES strings. All PSMILES strings are chemically valid polymers but, mostly, have never been synthesized before. More information can be found in the paper. Please note the license agreement in the LICENSE file.
Full data set including the properties
The data files are in Apache Parquet format. The files start with polyOne_*.parquet.
I recommend using dask (pip install dask) to load and process the data set. Pandas also works but is slower.
Load sharded data set with dask
python
import dask.dataframe as dd
ddf = dd.read_parquet("*.parquet", engine="pyarrow")
For example, compute the description of data set ```python df_describe = ddf.describe().compute() df_describe
PSMILES strings only
generated_polymer_smiles_train.txt - 80 million PSMILES strings for training polyBERT. One string per line.
generated_polymer_smiles_dev.txt - 20 million PSMILES strings for testing polyBERT. One string per line.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
LifeSnaps Dataset Documentation
Ubiquitous self-tracking technologies have penetrated various aspects of our lives, from physical and mental health monitoring to fitness and entertainment. Yet, limited data exist on the association between in the wild large-scale physical activity patterns, sleep, stress, and overall health, and behavioral patterns and psychological measurements due to challenges in collecting and releasing such datasets, such as waning user engagement, privacy considerations, and diversity in data modalities. In this paper, we present the LifeSnaps dataset, a multi-modal, longitudinal, and geographically-distributed dataset, containing a plethora of anthropological data, collected unobtrusively for the total course of more than 4 months by n=71 participants, under the European H2020 RAIS project. LifeSnaps contains more than 35 different data types from second to daily granularity, totaling more than 71M rows of data. The participants contributed their data through numerous validated surveys, real-time ecological momentary assessments, and a Fitbit Sense smartwatch, and consented to make these data available openly to empower future research. We envision that releasing this large-scale dataset of multi-modal real-world data, will open novel research opportunities and potential applications in the fields of medical digital innovations, data privacy and valorization, mental and physical well-being, psychology and behavioral sciences, machine learning, and human-computer interaction.
The following instructions will get you started with the LifeSnaps dataset and are complementary to the original publication.
Data Import: Reading CSV
For ease of use, we provide CSV files containing Fitbit, SEMA, and survey data at daily and/or hourly granularity. You can read the files via any programming language. For example, in Python, you can read the files into a Pandas DataFrame with the pandas.read_csv() command.
Data Import: Setting up a MongoDB (Recommended)
To take full advantage of the LifeSnaps dataset, we recommend that you use the raw, complete data via importing the LifeSnaps MongoDB database.
To do so, open the terminal/command prompt and run the following command for each collection in the DB. Ensure you have MongoDB Database Tools installed from here.
For the Fitbit data, run the following:
mongorestore --host localhost:27017 -d rais_anonymized -c fitbit
For the SEMA data, run the following:
mongorestore --host localhost:27017 -d rais_anonymized -c sema
For surveys data, run the following:
mongorestore --host localhost:27017 -d rais_anonymized -c surveys
If you have access control enabled, then you will need to add the --username and --password parameters to the above commands.
Data Availability
The MongoDB database contains three collections, fitbit, sema, and surveys, containing the Fitbit, SEMA3, and survey data, respectively. Similarly, the CSV files contain related information to these collections. Each document in any collection follows the format shown below:
{
_id:
Facebook
Twitterhttp://opensource.org/licenses/BSD-2-Clausehttp://opensource.org/licenses/BSD-2-Clause
Python code (for Python 3.9 & Pandas 1.3.2) to generate the results used in "Compromised through Compression: Privacy Implications of Smart Meter Traffic Analysis".Smart metering comes with risks to privacy. One concern is the possibility of an attacker seeing the traffic that reports the energy use of a household and deriving private information from that. Encryption helps to mask the actual energy measurements, but is not sufficient to cover all risks. One aspect which has yet gone unexplored — and where encryption does not help — is traffic analysis, i.e. whether the length of messages communicating energy measurements can leak privacy-sensitive information to an observer. In this paper we examine whether using encodings or compression for smart metering data could potentially leak information about household energy use. Our analysis is based on the real-world energy use data of ±80 Dutch households.We find that traffic analysis could reveal information about the energy use of individual households if compression is used. As a result, when messages are sent daily, an attacker performing traffic analysis would be able to determine when all the members of a household are away or not using electricity for an entire day. We demonstrate this issue by recognizing when households from our dataset were on holiday. If messages are sent more often, more granular living patterns could likely be determined.We propose a method of encoding the data that is nearly as effective as compression at reducing message size, but does not leak the information that compression leaks. By not requiring compression to achieve the best possible data savings, the risk of traffic analysis is eliminated.This code operates on the relative energy measurements from the "Zonnedael dataset" from Liander N.V. This dataset needs to be obtained separately; see instructions accompanying the code. The code transforms the dataset into absolute measurements such as would be taken by a smart meter. It then generates batch messages covering 24-hour periods starting at midnight, similar to how the Dutch infrastructure batches daily meter readings, in the different possible encodings with and without compression applied. For an explanation of the different encodings, see the paper. The code will then provide statistics on the efficiency of encoding and compression for the entire dataset, and attempt to find the periods of multi-day absences for each household. It will also generate the graphs in the style used in the paper and presentation.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains a selection of behavioral datasets collected using soluble agents and labeled using realistic threat simulation and IDS rules. The collected datasets are anonymized and aggregated using time window representations. The dataset generation pipeline preprocesses the application logs from the corporate network, structures them according to entities and users inventory, and labels them based on the IDS and phishing simulation appliances.
This repository is associated with the article "RBD24: A labelled dataset with risk activities using log applications data" published in the journal Computers & Security. For more information go to https://doi.org/10.1016/j.cose.2024.104290" target="_blank" rel="noreferrer noopener">https://doi.org/10.1016/j.cose.2024.104290
The RBD24 dataset comprises various risk activities collected from real entities and users over a period of 15 days, with the samples segmented by Desktop (DE) and Smartphone (SM) devices.
| DatasetId | Entity | Observed Behaviour | Groundtruth | Sample Shape |
| Crypto_desktop.parquet | DE | Miner Checking | IDS | 0: 738/161202, 1: 11/1343 |
| Crypto_smarphone.parquet | SM | Miner Checking | IDS | 0: 613/180021, 1: 4/956 |
| OutFlash_desktop.parquet | DE | Outdated software components | IDS | 0: 738/161202, 1: 56/10820 |
| OutFlash_smartphone.parquet | SM | Outdated software components | IDS | 0: 613/180021, 1: 22/6639 |
| OutTLS_desktop.parquet | DE | Outdated TLS protocol | IDS | 0: 738/161202, 1: 18/2458 |
| OutTLS_smartphone.parquet | SM | Outdated TLS protocol | IDS | 0: 613/180021, 1: 11/2930 |
| P2P_desktop.parquet | DE | P2P Activity | IDS | 0: 738/161202, 1: 177/35892 |
| P2P_smartphone.parquet | SM | P2P Activity | IDS | 0: 613/180021, 1: 94/21688 |
| NonEnc_desktop.parquet | DE | Non-encrypted password | IDS | 0: 738/161202, 1: 291/59943 |
| NonEnc_smaprthone.parquet | SM | Non-encrypted password | IDS | 0: 613/180021, 1: 167/41434 |
| Phishing_desktop.parquet | DE | Phishing email |
Experimental Campaign | 0: 98/13864, 1: 19/3072 |
| Phishing_smartphone.parquet | SM | Phishing email | Experimental Campaign | 0: 117/34006, 1: 26/8968 |
To collect the dataset, we have deployed multiple agents and soluble agents within an infrastructure with
more than 3k entities, comprising laptops, workstations, and smartphone devices. The methods to build
ground truth are as follows:
- Simulator: We launch different realistic phishing campaigns, aiming to expose user credentials or defeat access to a service.
- IDS: We deploy an IDS to collect various alerts associated with behavioral anomalies, such as cryptomining or peer-to-peer traffic.
For each user exposed to the behaviors stated in the summary table, different TW is computed, aggregating
user behavior within a fixed time interval. This TW serves as the basis for generating various supervised
and unsupervised methods.
The time windows (TW) are a data representation based on aggregated logs from multimodal sources between two
timestamps. In this study, logs from HTTP, DNS, SSL, and SMTP are taken into consideration, allowing the
construction of rich behavioral profiles. The indicators described in the TE are a set of manually curated
interpretable features designed to describe device-level properties within the specified time frame. The most
influential features are described below.
Parquet format uses a columnar storage format, which enhances efficiency and compression, making it suitable for large datasets and complex analytical tasks. It has support across various tools and languages, including Python. Parquet can be used with pandas library in Python, allowing pandas to read and write Parquet files through the `pyarrow` or `fastparquet` libraries. Its efficient data retrieval and fast query execution improve performance over other formats. Compared to row-based storage formats such as CSV, Parquet's columnar storage greatly reduces read times and storage costs for large datasets. Although binary formats like HDF5 are effective for specific use cases, Parquet provides broader compatibility and optimization. The provided datasets use the Parquet format. Here’s an example of how to retrieve data using pandas, ensure you have the fastparquet library installed:
```pythonimport pandas as pd
# Reading a Parquet filedf = pd.read_parquet( 'path_to_your_file.parquet', engine='fastparquet' )
```
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Interoperability in systems-of-systems is a difficult problem due to the abundance of data standards and formats. Current approaches to interoperability rely on hand-made adapters or methods using ontological metadata. This dataset was created to facilitate research on data-driven interoperability solutions. The data comes from a simulation of a building heating system, and the messages sent within control systems-of-systems. For more information see attached data documentation. The data comes in two semicolon-separated (;) csv files, training.csv and test.csv. The train/test split is not random; training data comes from the first 80% of simulated timesteps, and the test data is the last 20%. There is no specific validation dataset, the validation data should instead be randomly selected from the training data. The simulation runs for as many time steps as there are outside temperature values available. The original SMHI data only samples once every hour, which we linearly interpolate to get one temperature sample every ten seconds. The data saved at each time step consists of 34 JSON messages (four per room and two temperature readings from the outside), 9 temperature values (one per room and outside), 8 setpoint values, and 8 actuator outputs. The data associated with each of those 34 JSON-messages is stored as a single row in the tables. This means that much data is duplicated, a choice made to make it easier to use the data. The simulation data is not meant to be opened and analyzed in spreadsheet software, it is meant for training machine learning models. It is recommended to open the data with the pandas library for Python, available at https://pypi.org/project/pandas/.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This resource serves as a template for creating a curve number grid raster file which could be used to create corresponding maps or for further utilization, soil data and reclassified land-use raster files are created along the process, user has to provided or connect to a set of shape-files including boundary of watershed, soil data and land-use containing this watershed, land-use reclassification and curve number look up table. Script contained in this resource mainly uses PyQGIS through Jupyter Notebook for majority of the processing with a touch of Pandas for data manipulation. Detailed description of procedure are commented in the script.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Research Domain/Project:
This dataset was created for a machine learning experiment aimed at developing a classification model to predict outcomes based on a set of features. The primary research domain is disease prediction in patients. The dataset was used in the context of training, validating, and testing.
Purpose of the Dataset:
The purpose of this dataset is to provide training, validation, and testing data for the development of machine learning models. It includes labeled examples that help train classifiers to recognize patterns in the data and make predictions.
Dataset Creation:
Data preprocessing steps involved cleaning, normalization, and splitting the data into training, validation, and test sets. The data was carefully curated to ensure its quality and relevance to the problem at hand. For any missing values or outliers, appropriate handling techniques were applied (e.g., imputation, removal, etc.).
Structure of the Dataset:
The dataset consists of several files organized into folders by data type:
Training Data: Contains the training dataset used to train the machine learning model.
Validation Data: Used for hyperparameter tuning and model selection.
Test Data: Reserved for final model evaluation.
Each folder contains files with consistent naming conventions for easy navigation, such as train_data.csv, validation_data.csv, and test_data.csv. Each file follows a tabular format with columns representing features and rows representing individual data points.
Software Requirements:
To open and work with this dataset, you need VS Code or Jupyter, which could include tools like:
Python (with libraries such as pandas, numpy, scikit-learn, matplotlib, etc.)
Reusability:
Users of this dataset should be aware that it is designed for machine learning experiments involving classification tasks. The dataset is already split into training, validation, and test subsets. Any model trained with this dataset should be evaluated using the test set to ensure proper validation.
Limitations:
The dataset may not cover all edge cases, and it might have biases depending on the selection of data sources. It's important to consider these limitations when generalizing model results to real-world applications.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset consists of color and depth videos of Panda robot motions and their corresponding joint and Cartesian trajectories. The dataset also includes the trajectories of a receiver robot for the purpose of an object handover. Each motion sample comprises 6 files (RGB video, depth video and 4 giver/receiver trajectories in time series form). Total number of motion samples: 38393. Structure: MPEG-4 videos of robot motion and corresponding Python serialized (or “pickled”) files, containing joint and Cartesian trajectories. Dataset is divided into four parts: simulation dataset (PandaHandover_Sim.zip), real train dataset (PandaHandover_Real_Train.zip), real validation dataset (PandaHandover_Real_Val.zip), real test dataset (PandaHandover_Real_Test.zip). Extract using 7-Zip or similar software. Video files (.avi) can be opened using VLC media player or any other video player that supports MPEG-4 codec. The .pkl files can be loaded using Python (>=3.7) and the Python library Pandas (>=1.1.3).
Facebook
Twitter"module-utilities" is a python package of utilities to simplify working with python packages.The main features of module-utilities are as follows: "cached" module: A module to cache class attributes and methods. Right now, this uses a standard python dictionary for storage. Future versions will hopefully be more robust to threading and shared cache."docfiller" module: A module to share documentation. This is adapted from the pandas doc decorator. There are a host of utilities build around this."docinhert": An interface to "docstring-inheritance" module. This can be combined with "docfiller" to make creating related function/class documentation easy.
Facebook
TwitterOverview
This Zenodo record packages a reproducible bundle of derived ocean wind data and auxiliary materials produced from Copernicus Sentinel‑1 SAR Level‑2 OCN (OWI) products. The bundle contains the processed dataset as a GeoParquet data lake with a static STAC catalog, together with the exact scripts, configuration, SQL statements, and inputs used to generate it. It is intended to enable full transparency, re‑use, and re‑execution of the data generation workflow.
The dataset was generated using the HF‑EOLUS Sentinel‑1 SAR Ingestion Pipeline (software record DOI: https://doi.org/10.5281/zenodo.17011823), which automates download and ingestion into partitioned GeoParquet, and builds a STAC Collection and Items describing the outputs. This record includes all relevant artifacts so that other researchers can verify the steps, inspect lineage, and rerun the pipeline if needed.
Dataset summary
Final table rows: 11,384,420
Rows with valid wind data: 7,992,086
Contents of this record
hf_eolus_sar.tar.gz: Tarball containing the processed data outputs. It includes the partitioned GeoParquet dataset (OGC GeoParquet v1.1 metadata) and a static STAC catalog (Collection and Items with the STAC Table Extension and Processing metadata) describing the Parquet assets.
pipeline.sh: The exact sequence of commands used to generate this dataset (download and ingestion), serving as an executable provenance log for reproducibility.
stac_properties_collection.json: STAC property definitions/templates applied at Collection level during catalog generation.
stac_properties_item.json: STAC property definitions/templates applied at Item level during catalog generation.
scripts/*.sql (e.g., scripts/athena_create_table.sql): SQL statements used for registering the resulting Parquet dataset as an external table (e.g., in AWS Athena) and for validating schema/partitions.
scripts/downloaded_files.txt: Manifest listing the Sentinel‑1 OCN product identifiers that were downloaded and used as inputs.
scripts/VILA_PRIO_hull.json: Area of Interest (AOI) polygon used to constrain the search and download of Sentinel‑1 scenes spatially. This file defines a convex hull bounding the intersection between the areas covering the echos of VILA and PRIO stations.
scripts/files_to_download.csv: Input list and/or search results for Sentinel‑1 OCN products targeted by the download stage (includes product IDs and acquisition metadata).
Reproducibility and re‑execution
Software pipeline: HF‑EOLUS Sentinel‑1 SAR Ingestion Pipeline (https://doi.org/10.5281/zenodo.17011823). The pipeline uses Dockerized R and Python environments for deterministic runs (no manual dependency setup required).
How to re‑run: Review pipeline.sh to see the exact commands, arguments, and environment variables used. If desired, clone the referenced pipeline repository, ensure Docker is available, and re‑execute the same steps. The AOI and time range used are captured in scripts/area_boundary.geojson and scripts/files_to_download.csv; the precise upstream inputs are listed in scripts/downloaded_files.txt.
Optional cloud registration: The SQL files in scripts/ can be used to register the resulting Parquet dataset in AWS Athena (or adapted for other engines like Trino/Spark). This step is optional and not required to read the Parquet files directly with tools such as Python (pyarrow/GeoPandas), R (arrow), or DuckDB.
Copernicus credentials: To re‑download Sentinel‑1 OCN data, provide your own Copernicus Data Space Ecosystem account credentials. Create a plain‑text file (e.g., credentials) with one line containing your key and pass its path via --credentials-file to scripts/download_sar.sh. For account and API access information, see https://dataspace.copernicus.eu and the documentation at https://documentation.dataspace.copernicus.eu/
Data format and standards
GeoParquet: Columnar, compressed Parquet files with OGC GeoParquet v1.1 geospatial metadata (geometry column, CRS, encoding). Files are partitioned to support efficient filtering and scalable analytics.
STAC: A static STAC Collection with Items describes each Parquet asset, including schema via the STAC Table Extension and lineage via Processing metadata. The catalog is suitable for static hosting and is interoperable with common STAC tooling.
Provenance and upstream data
Upstream source: Copernicus Sentinel‑1 SAR Level‑2 OCN (OWI) products provided by the European Space Agency (ESA) under the Copernicus Programme. The list of specific input products for this dataset is included in scripts/downloaded_files.txt.
Processing: All derivations (extraction of ocean wind variables, conversion to GeoParquet, and STAC catalog creation) were performed by the HF‑EOLUS pipeline referenced above. The sequence is captured verbatim in pipeline.sh.
Credit line: Contains modified Copernicus Sentinel‑1 data; we gratefully acknowledge the Copernicus Programme and the European Space Agency (ESA) for providing free and open Sentinel‑1 data.
How to cite
Herrera Cortijo, J. L., Fernández‑Baladrón, A., Rosón, G., Gil Coto, M., Dubert, J., & Varela Benvenuto, R. (2025). Project HF‑EOLUS. Task 2. Sentinel‑1 SAR Derived Data Bundle (GeoParquet + STAC) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.17007304
Herrera Cortijo, J. L., Fernández‑Baladrón, A., Rosón, G., Gil Coto, M., Dubert, J., & Varela Benvenuto, R. (2025). HF‑EOLUS Sentinel‑1 SAR Ingestion Pipeline (GeoParquet + STAC) (v0.1.2). Zenodo. https://doi.org/10.5281/zenodo.17011788
Usage notes
Local analysis: The GeoParquet files can be opened directly with Python (pyarrow, pandas/GeoPandas), R (arrow), or SQL engines (DuckDB/Trino) without additional ingestion steps.
Catalog discovery: The STAC catalog in the tarball is static and can be browsed with STAC tools or published on object storage or a simple web server.
AWS/Athena setup (optional): To use the GeoParquet in AWS, upload the dataset to Amazon S3 and adjust the SQL to your paths and names, then execute in Athena:
1) Upload the GeoParquet (and optionally the STAC catalog) to s3://<your-bucket>/<your-prefix>/ preserving the folder structure.
2) Edit scripts/athena_create_table.sql to set the LOCATION to your S3 path and customize the database and table (e.g., change SAR_INGEST.SAR to MY_DB.MY_TABLE).
3) In Athena, run the SQL to create the database (if needed) and the external table.
4) Load partitions with MSCK REPAIR TABLE MY_DB.MY_TABLE; (or add partitions explicitly) and validate with a quick query such as SELECT COUNT(*) FROM MY_DB.MY_TABLE;.
Acknowledgements
This work has been funded by the HF‑EOLUS project (TED2021‑129551B‑I00), financed by MICIU/AEI /10.13039/501100011033 and by the European Union NextGenerationEU/PRTR - BDNS 598843 - Component 17 - Investment I3. Members of the Marine Research Centre (CIM) of the University of Vigo have participated in the development. We gratefully acknowledge the Copernicus Programme and the European Space Agency (ESA) for providing free and open Sentinel‑1 data used in this work (contains modified Copernicus Sentinel data).
Disclaimer
This content is provided "as is", without warranties of any kind. Users are responsible for verifying fitness for purpose and for complying with the licensing and attribution requirements of upstream Copernicus Sentinel data.
Facebook
TwitterObservational wave data and modeled wind data to accompany the article J. Davis et al. (2023) "Saturation of ocean surface wave slopes observed during hurricanes" in Geophysical Research Letters. The observations include targeted aerial deployments into Hurricane Ian (2022) and opportunistic measurements from the free-drifting Sofar Ocean Spotter global network in Hurricane Fiona (2022). Surface wind speeds are modeled using the U.S. Naval Research Laboratory’s Coupled Ocean-Atmosphere Mesoscale Prediction System for Tropical Cyclones (COAMPS-TC). The datasets are table-like and include the following: 1) hourly records of surface wave statistics in the form of scalar energy spectra, directional moments, derived products (including mean square slope), and modeled wind speeds; 2) data to reproduce the binned mean square slopes and model wind speeds presented in the article; 3) data to reproduce the mean energy density versus wind speed plot; and 4) data to reproduce the mean energy densit..., This dataset contains wave measurements collected by free-drifting Spotter buoys (Sofar Ocean) which use GPS-derived motions to report hourly records of surface wave statistics in the form of scalar energy spectra and directional moments. The observational data are combined with modeled surface wind speeds from the U.S. Naval Research Laboratory’s Coupled Ocean-Atmosphere Mesoscale Prediction System for Tropical Cyclones (COAMPS-TC) which are interpolated onto the Spotter wave observations in time and space., The datasets are stored as text-based JSON and CSV files which can be read without the use of special software. The JSON files are created using the Python Pandas package DataFrame.to_json method with orient='records'. Example code is provided in MATLAB and Python., # Data to accompany the article "Saturation of ocean surface wave slopes observed during hurricanes"
Contains wave observations by free-drifting Spotter buoys from targeted aerial deployments into Hurricane Ian (2022) and opportunistic measurements from the free-drifting Sofar Ocean Spotter global network in Hurricane Fiona (2022). The observations are co-located with modeled surface wind speeds from the U.S. Naval Research Laboratorys Coupled Ocean-Atmosphere Mesoscale Prediction System for Tropical Cyclones (COAMPS-TC). The data are used in the article to show the saturation of mean square slope at extreme wind speeds and the coincident transition from an equilibrium-dominated spectral tail to a saturation-dominated tail. This archive contains two main, table-like datasets of wind-wave data (from Ian and Fiona) and three derived datasets necessary to reproduce the results and figures in the article.
The dataset is organized into fi...
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
What is Pandas?
Pandas is a Python library used for working with data sets.
It has functions for analyzing, cleaning, exploring, and manipulating data.
The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was created by Wes McKinney in 2008.
Why Use Pandas?
Pandas allows us to analyze big data and make conclusions based on statistical theories.
Pandas can clean messy data sets, and make them readable and relevant.
Relevant data is very important in data science.
What Can Pandas Do?
Pandas gives you answers about the data. Like:
Is there a correlation between two or more columns?
What is average value?
Max value?
Min value?