100+ datasets found
  1. Pandas Practice Dataset

    • kaggle.com
    zip
    Updated Jan 27, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mrityunjay Pathak (2023). Pandas Practice Dataset [Dataset]. https://www.kaggle.com/datasets/themrityunjaypathak/pandas-practice-dataset/discussion
    Explore at:
    zip(493 bytes)Available download formats
    Dataset updated
    Jan 27, 2023
    Authors
    Mrityunjay Pathak
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    What is Pandas?

    Pandas is a Python library used for working with data sets.

    It has functions for analyzing, cleaning, exploring, and manipulating data.

    The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was created by Wes McKinney in 2008.

    Why Use Pandas?

    Pandas allows us to analyze big data and make conclusions based on statistical theories.

    Pandas can clean messy data sets, and make them readable and relevant.

    Relevant data is very important in data science.

    What Can Pandas Do?

    Pandas gives you answers about the data. Like:

    Is there a correlation between two or more columns?

    What is average value?

    Max value?

    Min value?

  2. Numpy , pandas and matplot lib practice

    • kaggle.com
    zip
    Updated Jul 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    pratham saraf (2023). Numpy , pandas and matplot lib practice [Dataset]. https://www.kaggle.com/datasets/prathamsaraf1389/numpy-pandas-and-matplot-lib-practise/suggestions
    Explore at:
    zip(385020 bytes)Available download formats
    Dataset updated
    Jul 16, 2023
    Authors
    pratham saraf
    License

    https://cdla.io/permissive-1-0/https://cdla.io/permissive-1-0/

    Description

    The dataset has been created specifically for practicing Python, NumPy, Pandas, and Matplotlib. It is designed to provide a hands-on learning experience in data manipulation, analysis, and visualization using these libraries.

    Specifics of the Dataset:

    The dataset consists of 5000 rows and 20 columns, representing various features with different data types and distributions. The features include numerical variables with continuous and discrete distributions, categorical variables with multiple categories, binary variables, and ordinal variables. Each feature has been generated using different probability distributions and parameters to introduce variations and simulate real-world data scenarios. The dataset is synthetic and does not represent any real-world data. It has been created solely for educational purposes.

    One of the defining characteristics of this dataset is the intentional incorporation of various real-world data challenges:

    Certain columns are randomly selected to be populated with NaN values, effectively simulating the common challenge of missing data. - The proportion of these missing values in each column varies randomly between 1% to 70%. - Statistical noise has been introduced in the dataset. For numerical values in some features, this noise adheres to a distribution with mean 0 and standard deviation 0.1. - Categorical noise is introduced in some features', with its categories randomly altered in about 1% of the rows. Outliers have also been embedded in the dataset, resonating with the Interquartile Range (IQR) rule

    Context of the Dataset:

    The dataset aims to provide a comprehensive playground for practicing Python, NumPy, Pandas, and Matplotlib. It allows learners to explore data manipulation techniques, perform statistical analysis, and create visualizations using the provided features. By working with this dataset, learners can gain hands-on experience in data cleaning, preprocessing, feature engineering, and visualization. Sources of the Dataset:

    The dataset has been generated programmatically using Python's random number generation functions and probability distributions. No external sources or real-world data have been used in creating this dataset.

  3. Z

    Multimodal Vision-Audio-Language Dataset

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jul 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Schaumlöffel, Timothy; Roig, Gemma; Choksi, Bhavin (2024). Multimodal Vision-Audio-Language Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10060784
    Explore at:
    Dataset updated
    Jul 11, 2024
    Dataset provided by
    Goethe University Frankfurt
    Authors
    Schaumlöffel, Timothy; Roig, Gemma; Choksi, Bhavin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Multimodal Vision-Audio-Language Dataset is a large-scale dataset for multimodal learning. It contains 2M video clips with corresponding audio and a textual description of the visual and auditory content. The dataset is an ensemble of existing datasets and fills the gap of missing modalities. Details can be found in the attached report. Annotation The annotation files are provided as Parquet files. They can be read using Python and the pandas and pyarrow library. The split into train, validation and test set follows the split of the original datasets. Installation

    pip install pandas pyarrow Example

    import pandas as pddf = pd.read_parquet('annotation_train.parquet', engine='pyarrow')print(df.iloc[0])

    dataset AudioSet filename train/---2_BBVHAA.mp3 captions_visual [a man in a black hat and glasses.] captions_auditory [a man speaks and dishes clank.] tags [Speech] Description The annotation file consists of the following fields:filename: Name of the corresponding file (video or audio file)dataset: Source dataset associated with the data pointcaptions_visual: A list of captions related to the visual content of the video. Can be NaN in case of no visual contentcaptions_auditory: A list of captions related to the auditory content of the videotags: A list of tags, classifying the sound of a file. It can be NaN if no tags are provided Data files The raw data files for most datasets are not released due to licensing issues. They must be downloaded from the source. However, due to missing files, we provide them on request. Please contact us at schaumloeffel@em.uni-frankfurt.de

  4. Z

    Exploratory Topic Modelling in Python Dataset - EHRI-3

    • data.niaid.nih.gov
    Updated Jun 20, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dermentzi, Maria (2022). Exploratory Topic Modelling in Python Dataset - EHRI-3 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6670103
    Explore at:
    Dataset updated
    Jun 20, 2022
    Dataset provided by
    King's College London
    Authors
    Dermentzi, Maria
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In the EHRI-3 project, we are investigating tools and methods that historical researchers and scholars can use to better understand, visualise, and interpret the material held by our partner archives. This dataset accompanies a tutorial exploring a technique called topic modelling in the context of a Holocaust-related historical collection.

    We were on the lookout for datasets that would be easily accessible and, for convenience, predominantly in English. One such dataset was the United States Holocaust Memorial Museum’s (USHMM) extensive collection of oral history testimonies, for which there are a considerable number of textual transcripts. The museum’s total collection consists of over 80,703 testimonies, 41,695 of which are available in English, with 2,894 of them listing a transcript.

    Since there is not yet a ready-to-download dataset that includes these transcripts, we had to construct our own. Using a web scraping tool, we managed to create a list of the links pointing to the metadata (including transcripts) of the testimonies that were of interest to us. After obtaining the transcript and other metadata of each of these testimonies, we were able to create our dataset and curate it to remove any unwanted entries. For example, we made sure to remove entries with restrictions on access or use. We also removed entries with transcripts that consisted only of some automatically generated headers and entries which turned out to be in languages other than English. The remaining 1,873 transcripts form the corpus of this tutorial — a small, but still decently sized dataset.

    The process that we followed to put together this dataset is detailed in the Jupyter Notebook accompanying this post, which can be found in this Github repository.

    In this Zenodo upload, the user can find two files, each of them containing a pickled pandas DataFrame that was obtained at a different stage of the tutorial:

    "unrestricted_df.pkl" contains 1,946 entries of Oral Testimony transcripts and has five fields (RG_number, text, display_date, conditions_access, conditions_use) "unrestricted_lemmatized_df.pkl" contains 1,873 entries of Oral Testimony transcripts and has six fields (RG_number, text, display_date, conditions_access, conditions_use, lemmas)

    Instructions on their intended use can be found in the accompanying Jupyter Notebook.

    Credits:

    The transcripts that form the corpus in this tutorial were obtained through the United States Holocaust Memorial Museum (USHMM).

  5. 💥 Data-cleaning-for-beginner-using-pandas💢💥

    • kaggle.com
    zip
    Updated Oct 16, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pavan Tanniru (2022). 💥 Data-cleaning-for-beginner-using-pandas💢💥 [Dataset]. https://www.kaggle.com/datasets/pavantanniru/-datacleaningforbeginnerusingpandas/code
    Explore at:
    zip(654 bytes)Available download formats
    Dataset updated
    Oct 16, 2022
    Authors
    Pavan Tanniru
    Description

    This dataset helps you to increase the data-cleaning process using the pure python pandas library.

    Indicators

    1. Age
    2. Salary
    3. Rating
    4. Location
    5. Established
    6. Easy Apply
  6. h

    census-income

    • huggingface.co
    Updated Jul 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    WC (2025). census-income [Dataset]. https://huggingface.co/datasets/cestwc/census-income
    Explore at:
    Dataset updated
    Jul 21, 2025
    Authors
    WC
    Description

    Dataset Card for Census Income (Adult)

    This dataset is a precise version of Adult or Census Income. This dataset from UCI somehow happens to occupy two links, but we checked and confirm that they are identical. We used the following python script to create this Hugging Face dataset. import pandas as pd from datasets import Dataset, DatasetDict, Features, Value, ClassLabel

    URLs

    url1 = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data" url2 =… See the full description on the dataset page: https://huggingface.co/datasets/cestwc/census-income.

  7. All Pandas Operations Reference

    • kaggle.com
    zip
    Updated Nov 8, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Reuben (2019). All Pandas Operations Reference [Dataset]. https://www.kaggle.com/reubenn/all-pandas-operations-reference
    Explore at:
    zip(10449 bytes)Available download formats
    Dataset updated
    Nov 8, 2019
    Authors
    Reuben
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    Pandas is a very useful library, probably the most useful for data munging in Python. This notebook is an attempt to collate all pandas dataframes operations that a data scientist might use.

    Content

    You'll see how to create dataframes, read in files (even ones with anomalies), check out descriptive stats on columns, filter on different values and in different ways as well as perform some of the more oft-used operations

    Acknowledgements

    A big "thank you" to Data School. You'll find plenty of notebooks and videos here: https://github.com/justmarkham/pandas-videos

  8. Z

    3D skeletons UP-Fall Dataset

    • data.niaid.nih.gov
    Updated Jul 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    KOFFI, Tresor (2024). 3D skeletons UP-Fall Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_12773012
    Explore at:
    Dataset updated
    Jul 20, 2024
    Dataset provided by
    CESI LINEACT
    Authors
    KOFFI, Tresor
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    3D skeletons UP-Fall Dataset

                          Different between Fall and Impact detection 
    

    Overview

    This dataset aims to facilitate research in fall detection, particularly focusing on the precise detection of impact moments within fall events. The 3D skeletons data accuracy and comprehensiveness make it a valuable resource for developing and benchmarking fall detection algorithms. The dataset contains 3D skeletal data extracted from fall events and daily activities of 5 subjects performing fall scenarios

    Data Collection

    The skeletal data was extracted using a pose estimation algorithm, which processes images frames to determine the 3D coordinates of each joint. Sequences with less than 100 frames of extracted data were excluded to ensure the quality and reliability of the dataset. As a result, some subjects may have fewer CSV files.

    CSV Structure

    The data is organized by subjects, and each subject contains CSV files named according to the pattern C1S1A1T1, where:

    C: Camera (1 or 2)

    S: Subject (1 to 5)

    A: Activity (1 to N, representing different activities)

    T: Trial (1 to 3)

    subject1/`: Contains CSV files for Subject 1.

    C1S1A1T1.csv: Data from Camera 1, Activity 1, Trial 1 for Subject 1

    C1S1A2T1.csv: Data from Camera 1, Activity 2, Trial 1 for Subject 1

    C1S1A3T1.csv: Data from Camera 1, Activity 3, Trial 1 for Subject 1

    C2S1A1T1.csv: Data from Camera 2, Activity 1, Trial 1 for Subject 1

    C2S1A2T1.csv: Data from Camera 2, Activity 2, Trial 1 for Subject 1

    C2S1A3T1.csv: Data from Camera 2, Activity 3, Trial 1 for Subject 1

    subject2/`: Contains CSV files for Subject 2.

    C1S2A1T1.csv: Data from Camera 1, Activity 1, Trial 1 for Subject 2

    C1S2A2T1.csv: Data from Camera 1, Activity 2, Trial 1 for Subject 2

    C1S2A3T1.csv: Data from Camera 1, Activity 3, Trial 1 for Subject 2

    C2S2A1T1.csv: Data from Camera 2, Activity 1, Trial 1 for Subject 2

    C2S2A2T1.csv: Data from Camera 2, Activity 2, Trial 1 for Subject 2

    C2S2A3T1.csv: Data from Camera 2, Activity 3, Trial 1 for Subject 2

    subject3/, subject4/, subject5/: Similar structure as above, but may contain fewer CSV files due to the data extraction criteria mentioned above.

    Column Descriptions

    Each CSV file contains the following columns representing different skeletal joints and their respective coordinates in 3D space:

    Column Name

    Description

    joint_1_x

    X coordinate of joint 1

    joint_1_y

    Y coordinate of joint 1

    joint_1_z

    Z coordinate of joint 1

    joint_2_x

    X coordinate of joint 2

    joint_2_y

    Y coordinate of joint 2

    joint_2_z

    Z coordinate of joint 2

    ...

    ...

    joint_n_x

    X coordinate of joint n

    joint_n_y

    Y coordinate of joint n

    joint_n_z

    Z coordinate of joint n

    LABEL

    Label indicating impact (1) or non-impact (0)

    Example

    Here is an example of what a row in one of the CSV files might look like:

    joint_1_x

    joint_1_y

    joint_1_z

    joint_2_x

    joint_2_y

    joint_2_z

    ...

    joint_n_x

    joint_n_y

    joint_n_33

    LABEL

    0.123

    0.456

    0.789

    0.234

    0.567

    0.890

    ...

    0.345

    0.678

    0.901

    0

    Usage

    This data can be used for developing and benchmarking impact fall detection algorithms. It provides detailed information on human posture and movement during falls, making it suitable for machine learning and deep learning applications in impact fall detection and prevention.

    Using github

    1. Clone the repository:

      -bash git clone

    https://github.com/Tresor-Koffi/3D_skeletons-UP-Fall-Dataset

    1. Navigate to the directory:

      -bash -cd 3D_skeletons-UP-Fall-Dataset

    Examples

    Here's a simple example of how to load and inspect a sample data file using Python:```pythonimport pandas as pd

    Load a sample data file for Subject 1, Camera 1, Activity 1, Trial 1

    data = pd.read_csv('subject1/C1S1A1T1.csv')print(data.head())

  9. Z

    Dataset for paper "Mitigating the effect of errors in source parameters on...

    • data.niaid.nih.gov
    Updated Sep 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nienke Blom; Phil-Simon Hardalupas; Nicholas Rawlinson (2022). Dataset for paper "Mitigating the effect of errors in source parameters on seismic (waveform) inversion" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6969601
    Explore at:
    Dataset updated
    Sep 28, 2022
    Dataset provided by
    University of Cambridge
    Authors
    Nienke Blom; Phil-Simon Hardalupas; Nicholas Rawlinson
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset corresponding to the journal article "Mitigating the effect of errors in source parameters on seismic (waveform) inversion" by Blom, Hardalupas and Rawlinson, accepted for publication in Geophysical Journal International. In this paper, we demonstrate the effect or errors in source parameters on seismic tomography, with a particular focus on (full) waveform tomography. We study effect both on forward modelling (i.e. comparing waveforms and measurements resulting from a perturbed vs. unperturbed source) and on seismic inversion (i.e. using a source which contains an (erroneous) perturbation to invert for Earth structure. These data were obtained using Salvus, a state-of-the-art (though proprietary) 3-D solver that can be used for wave propagation simulations (Afanasiev et al., GJI 2018).

    This dataset contains:

    The entire Salvus project. This project was prepared using Salvus version 0.11.x and 0.12.2 and should be fully compatible with the latter.

    A number of Jupyter notebooks used to create all the figures, set up the project and do the data processing.

    A number of Python scripts that are used in above notebooks.

    two conda environment .yml files: one with the complete environment as used to produce this dataset, and one with the environment as supplied by Mondaic (the Salvus developers), on top of which I installed basemap and cartopy.

    An overview of the inversion configurations used for each inversion experiment and the name of hte corresponding figures: inversion_runs_overview.ods / .csv .

    Datasets corresponding to the different figures.

    One dataset for Figure 1, showing the effect of a source perturbation in a real-world setting, as previously used by Blom et al., Solid Earth 2020

    One dataset for Figure 2, showing how different methodologies and assumptions can lead to significantly different source parameters, notably including systematic shifts. This dataset was kindly supplied by Tim Craig (Craig, 2019).

    A number of datasets (stored as pickled Pandas dataframes) derived from the Salvus project. We have computed:

    travel-time arrival predictions from every source to all stations (df_stations...pkl)

    misfits for different metrics for both P-wave centered and S-wave centered windows for all components on all stations, comparing every time waveforms from a reference source against waveforms from a perturbed source (df_misfits_cc.28s.pkl)

    addition of synthetic waveforms for different (perturbed) moment tenors. All waveforms are stored in HDF5 (.h5) files of the ASDF (adaptable seismic data format) type

    How to use this dataset:

    To set up the conda environment:

    make sure you have anaconda/miniconda

    make sure you have access to Salvus functionality. This is not absolutely necessary, but most of the functionality within this dataset relies on salvus. You can do the analyses and create the figures without, but you'll have to hack around in the scripts to build workarounds.

    Set up Salvus / create a conda environment. This is best done following the instructions on the Mondaic website. Check the changelog for breaking changes, in that case download an older salvus version.

    Additionally in your conda env, install basemap and cartopy:

    conda-env create -n salvus_0_12 -f environment.yml conda install -c conda-forge basemap conda install -c conda-forge cartopy

    Install LASIF (https://github.com/dirkphilip/LASIF_2.0) and test. The project uses some lasif functionality.

    To recreate the figures: This is extremely straightforward. Every figure has a corresponding Jupyter Notebook. Suffices to run the notebook in its entirety.

    Figure 1: separate notebook, Fig1_event_98.py

    Figure 2: separate notebook, Fig2_TimCraig_Andes_analysis.py

    Figures 3-7: Figures_perturbation_study.py

    Figures 8-10: Figures_toy_inversions.py

    To recreate the dataframes in DATA: This can be done using the example notebook Create_perturbed_thrust_data_by_MT_addition.py and Misfits_moment_tensor_components.M66_M12.py . The same can easily be extended to the position shift and other perturbations you might want to investigate.

    To recreate the complete Salvus project: This can be done using:

    the notebook Prepare_project_Phil_28s_absb_M66.py (setting up project and running simulations)

    the notebooks Moment_tensor_perturbations.py and Moment_tensor_perturbation_for_NS_thrust.py

    For the inversions: using the notebook Inversion_SS_dip.M66.28s.py as an example. See the overview table inversion_runs_overview.ods (or .csv) as to naming conventions.

    References:

    Michael Afanasiev, Christian Boehm, Martin van Driel, Lion Krischer, Max Rietmann, Dave A May, Matthew G Knepley, Andreas Fichtner, Modular and flexible spectral-element waveform modelling in two and three dimensions, Geophysical Journal International, Volume 216, Issue 3, March 2019, Pages 1675–1692, https://doi.org/10.1093/gji/ggy469

    Nienke Blom, Alexey Gokhberg, and Andreas Fichtner, Seismic waveform tomography of the central and eastern Mediterranean upper mantle, Solid Earth, Volume 11, Issue 2, 2020, Pages 669–690, 2020, https://doi.org/10.5194/se-11-669-2020

    Tim J. Craig, Accurate depth determination for moderate-magnitude earthquakes using global teleseismic data. Journal of Geophysical Research: Solid Earth, 124, 2019, Pages 1759– 1780. https://doi.org/10.1029/2018JB016902

  10. Z

    polyOne Data Set - 100 million hypothetical polymers including 29 properties...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Mar 24, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christopher Kuenneth; Rampi Ramprasad (2023). polyOne Data Set - 100 million hypothetical polymers including 29 properties [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7124187
    Explore at:
    Dataset updated
    Mar 24, 2023
    Dataset provided by
    Georgia Institute of Technology
    Authors
    Christopher Kuenneth; Rampi Ramprasad
    Description

    polyOne Data Set

    The data set contains 100 million hypothetical polymers each with 29 predicted properties using machine learning models. We use PSMILES strings to represent polymer structures, see here and here. The polymers are generated by decomposing previously synthesized polymers into unique chemical fragments. Random and enumerative compositions of these fragments yield 100 million hypothetical PSMILES strings. All PSMILES strings are chemically valid polymers but, mostly, have never been synthesized before. More information can be found in the paper. Please note the license agreement in the LICENSE file.

    Full data set including the properties

    The data files are in Apache Parquet format. The files start with polyOne_*.parquet.

    I recommend using dask (pip install dask) to load and process the data set. Pandas also works but is slower.

    Load sharded data set with dask python import dask.dataframe as dd ddf = dd.read_parquet("*.parquet", engine="pyarrow")

    For example, compute the description of data set ```python df_describe = ddf.describe().compute() df_describe

    
    
    PSMILES strings only
    
    
    
      
    generated_polymer_smiles_train.txt - 80 million PSMILES strings for training polyBERT. One string per line.
      
    generated_polymer_smiles_dev.txt - 20 million PSMILES strings for testing polyBERT. One string per line.
    
  11. Data from: LifeSnaps: a 4-month multi-modal dataset capturing unobtrusive...

    • zenodo.org
    • data.europa.eu
    zip
    Updated Oct 20, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sofia Yfantidou; Sofia Yfantidou; Christina Karagianni; Stefanos Efstathiou; Stefanos Efstathiou; Athena Vakali; Athena Vakali; Joao Palotti; Joao Palotti; Dimitrios Panteleimon Giakatos; Dimitrios Panteleimon Giakatos; Thomas Marchioro; Thomas Marchioro; Andrei Kazlouski; Elena Ferrari; Šarūnas Girdzijauskas; Šarūnas Girdzijauskas; Christina Karagianni; Andrei Kazlouski; Elena Ferrari (2022). LifeSnaps: a 4-month multi-modal dataset capturing unobtrusive snapshots of our lives in the wild [Dataset]. http://doi.org/10.5281/zenodo.6832242
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 20, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Sofia Yfantidou; Sofia Yfantidou; Christina Karagianni; Stefanos Efstathiou; Stefanos Efstathiou; Athena Vakali; Athena Vakali; Joao Palotti; Joao Palotti; Dimitrios Panteleimon Giakatos; Dimitrios Panteleimon Giakatos; Thomas Marchioro; Thomas Marchioro; Andrei Kazlouski; Elena Ferrari; Šarūnas Girdzijauskas; Šarūnas Girdzijauskas; Christina Karagianni; Andrei Kazlouski; Elena Ferrari
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    LifeSnaps Dataset Documentation

    Ubiquitous self-tracking technologies have penetrated various aspects of our lives, from physical and mental health monitoring to fitness and entertainment. Yet, limited data exist on the association between in the wild large-scale physical activity patterns, sleep, stress, and overall health, and behavioral patterns and psychological measurements due to challenges in collecting and releasing such datasets, such as waning user engagement, privacy considerations, and diversity in data modalities. In this paper, we present the LifeSnaps dataset, a multi-modal, longitudinal, and geographically-distributed dataset, containing a plethora of anthropological data, collected unobtrusively for the total course of more than 4 months by n=71 participants, under the European H2020 RAIS project. LifeSnaps contains more than 35 different data types from second to daily granularity, totaling more than 71M rows of data. The participants contributed their data through numerous validated surveys, real-time ecological momentary assessments, and a Fitbit Sense smartwatch, and consented to make these data available openly to empower future research. We envision that releasing this large-scale dataset of multi-modal real-world data, will open novel research opportunities and potential applications in the fields of medical digital innovations, data privacy and valorization, mental and physical well-being, psychology and behavioral sciences, machine learning, and human-computer interaction.

    The following instructions will get you started with the LifeSnaps dataset and are complementary to the original publication.

    Data Import: Reading CSV

    For ease of use, we provide CSV files containing Fitbit, SEMA, and survey data at daily and/or hourly granularity. You can read the files via any programming language. For example, in Python, you can read the files into a Pandas DataFrame with the pandas.read_csv() command.

    Data Import: Setting up a MongoDB (Recommended)

    To take full advantage of the LifeSnaps dataset, we recommend that you use the raw, complete data via importing the LifeSnaps MongoDB database.

    To do so, open the terminal/command prompt and run the following command for each collection in the DB. Ensure you have MongoDB Database Tools installed from here.

    For the Fitbit data, run the following:

    mongorestore --host localhost:27017 -d rais_anonymized -c fitbit 

    For the SEMA data, run the following:

    mongorestore --host localhost:27017 -d rais_anonymized -c sema 

    For surveys data, run the following:

    mongorestore --host localhost:27017 -d rais_anonymized -c surveys 

    If you have access control enabled, then you will need to add the --username and --password parameters to the above commands.

    Data Availability

    The MongoDB database contains three collections, fitbit, sema, and surveys, containing the Fitbit, SEMA3, and survey data, respectively. Similarly, the CSV files contain related information to these collections. Each document in any collection follows the format shown below:

    {
      _id: 
  12. D

    Data from: Compromised through Compression: Python source code for DLMS...

    • phys-techsciences.datastations.nl
    text/markdown, txt +2
    Updated Dec 14, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    P.J.M. van Aubel; E. Poll; P.J.M. van Aubel; E. Poll (2021). Compromised through Compression: Python source code for DLMS compression privacy analysis & graphing [Dataset]. http://doi.org/10.17026/DANS-2BY-BNA3
    Explore at:
    xml(5795), zip(20542), text/markdown(792), txt(626), zip(12920)Available download formats
    Dataset updated
    Dec 14, 2021
    Dataset provided by
    DANS Data Station Physical and Technical Sciences
    Authors
    P.J.M. van Aubel; E. Poll; P.J.M. van Aubel; E. Poll
    License

    http://opensource.org/licenses/BSD-2-Clausehttp://opensource.org/licenses/BSD-2-Clause

    Description

    Python code (for Python 3.9 & Pandas 1.3.2) to generate the results used in "Compromised through Compression: Privacy Implications of Smart Meter Traffic Analysis".Smart metering comes with risks to privacy. One concern is the possibility of an attacker seeing the traffic that reports the energy use of a household and deriving private information from that. Encryption helps to mask the actual energy measurements, but is not sufficient to cover all risks. One aspect which has yet gone unexplored — and where encryption does not help — is traffic analysis, i.e. whether the length of messages communicating energy measurements can leak privacy-sensitive information to an observer. In this paper we examine whether using encodings or compression for smart metering data could potentially leak information about household energy use. Our analysis is based on the real-world energy use data of ±80 Dutch households.We find that traffic analysis could reveal information about the energy use of individual households if compression is used. As a result, when messages are sent daily, an attacker performing traffic analysis would be able to determine when all the members of a household are away or not using electricity for an entire day. We demonstrate this issue by recognizing when households from our dataset were on holiday. If messages are sent more often, more granular living patterns could likely be determined.We propose a method of encoding the data that is nearly as effective as compression at reducing message size, but does not leak the information that compression leaks. By not requiring compression to achieve the best possible data savings, the risk of traffic analysis is eliminated.This code operates on the relative energy measurements from the "Zonnedael dataset" from Liander N.V. This dataset needs to be obtained separately; see instructions accompanying the code. The code transforms the dataset into absolute measurements such as would be taken by a smart meter. It then generates batch messages covering 24-hour periods starting at midnight, similar to how the Dutch infrastructure batches daily meter readings, in the different possible encodings with and without compression applied. For an explanation of the different encodings, see the paper. The code will then provide statistics on the efficiency of encoding and compression for the entire dataset, and attempt to find the periods of multi-day absences for each household. It will also generate the graphs in the style used in the paper and presentation.

  13. RBD24 - Risk Activities Dataset 2024

    • zenodo.org
    bin
    Updated Mar 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Calvo Albert; Escuder Santiago; Ortiz Nil; Escrig Josep; Compastié Maxime; Calvo Albert; Escuder Santiago; Ortiz Nil; Escrig Josep; Compastié Maxime (2025). RBD24 - Risk Activities Dataset 2024 [Dataset]. http://doi.org/10.5281/zenodo.13787591
    Explore at:
    binAvailable download formats
    Dataset updated
    Mar 4, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Calvo Albert; Escuder Santiago; Ortiz Nil; Escrig Josep; Compastié Maxime; Calvo Albert; Escuder Santiago; Ortiz Nil; Escrig Josep; Compastié Maxime
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Introduction

    This repository contains a selection of behavioral datasets collected using soluble agents and labeled using realistic threat simulation and IDS rules. The collected datasets are anonymized and aggregated using time window representations. The dataset generation pipeline preprocesses the application logs from the corporate network, structures them according to entities and users inventory, and labels them based on the IDS and phishing simulation appliances.

    This repository is associated with the article "RBD24: A labelled dataset with risk activities using log applications data" published in the journal Computers & Security. For more information go to https://doi.org/10.1016/j.cose.2024.104290" target="_blank" rel="noreferrer noopener">https://doi.org/10.1016/j.cose.2024.104290

    Summary of the Datasets

    The RBD24 dataset comprises various risk activities collected from real entities and users over a period of 15 days, with the samples segmented by Desktop (DE) and Smartphone (SM) devices.

    DatasetIdEntity Observed BehaviourGroundtruthSample Shape
    Crypto_desktop.parquetDEMiner CheckingIDS0: 738/161202, 1: 11/1343
    Crypto_smarphone.parquetSMMiner CheckingIDS0: 613/180021, 1: 4/956
    OutFlash_desktop.parquetDEOutdated software components IDS0: 738/161202, 1: 56/10820
    OutFlash_smartphone.parquetSMOutdated software components IDS0: 613/180021, 1: 22/6639
    OutTLS_desktop.parquetDEOutdated TLS protocolIDS0: 738/161202, 1: 18/2458
    OutTLS_smartphone.parquetSMOutdated TLS protocolIDS0: 613/180021, 1: 11/2930
    P2P_desktop.parquetDEP2P ActivityIDS0: 738/161202, 1: 177/35892
    P2P_smartphone.parquetSMP2P ActivityIDS0: 613/180021, 1: 94/21688
    NonEnc_desktop.parquetDENon-encrypted passwordIDS0: 738/161202, 1: 291/59943
    NonEnc_smaprthone.parquetSMNon-encrypted passwordIDS0: 613/180021, 1: 167/41434
    Phishing_desktop.parquetDEPhishing email

    Experimental Campaign

    0: 98/13864, 1: 19/3072
    Phishing_smartphone.parquetSMPhishing emailExperimental Campaign0: 117/34006, 1: 26/8968

    Methodology

    To collect the dataset, we have deployed multiple agents and soluble agents within an infrastructure with
    more than 3k entities, comprising laptops, workstations, and smartphone devices. The methods to build
    ground truth are as follows:

    - Simulator: We launch different realistic phishing campaigns, aiming to expose user credentials or defeat access to a service.
    - IDS: We deploy an IDS to collect various alerts associated with behavioral anomalies, such as cryptomining or peer-to-peer traffic.

    For each user exposed to the behaviors stated in the summary table, different TW is computed, aggregating
    user behavior within a fixed time interval. This TW serves as the basis for generating various supervised
    and unsupervised methods.

    Sample Representation

    The time windows (TW) are a data representation based on aggregated logs from multimodal sources between two
    timestamps. In this study, logs from HTTP, DNS, SSL, and SMTP are taken into consideration, allowing the
    construction of rich behavioral profiles. The indicators described in the TE are a set of manually curated
    interpretable features designed to describe device-level properties within the specified time frame. The most
    influential features are described below.

    • User:** A unique hash value that identifies a user.
    • Timestamp:** The timestamp of the windows.
    • Features
    • Label: 1 if the user exhibits compromised behavior, 0 otherwise. -1 indicates that it is a TW with an unknown label.

    Dataset Format

    Parquet format uses a columnar storage format, which enhances efficiency and compression, making it suitable for large datasets and complex analytical tasks. It has support across various tools and languages, including Python. Parquet can be used with pandas library in Python, allowing pandas to read and write Parquet files through the `pyarrow` or `fastparquet` libraries. Its efficient data retrieval and fast query execution improve performance over other formats. Compared to row-based storage formats such as CSV, Parquet's columnar storage greatly reduces read times and storage costs for large datasets. Although binary formats like HDF5 are effective for specific use cases, Parquet provides broader compatibility and optimization. The provided datasets use the Parquet format. Here’s an example of how to retrieve data using pandas, ensure you have the fastparquet library installed:

    ```python
    import pandas as pd

    # Reading a Parquet file
    df = pd.read_parquet(
    'path_to_your_file.parquet',
    engine='fastparquet'
    )

    ```

  14. g

    Data from: JSON Dataset of Simulated Building Heat Control for System of...

    • gimi9.com
    • researchdata.se
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    JSON Dataset of Simulated Building Heat Control for System of Systems Interoperability [Dataset]. https://gimi9.com/dataset/eu_https-doi-org-10-5878-1tv7-9x76/
    Explore at:
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Interoperability in systems-of-systems is a difficult problem due to the abundance of data standards and formats. Current approaches to interoperability rely on hand-made adapters or methods using ontological metadata. This dataset was created to facilitate research on data-driven interoperability solutions. The data comes from a simulation of a building heating system, and the messages sent within control systems-of-systems. For more information see attached data documentation. The data comes in two semicolon-separated (;) csv files, training.csv and test.csv. The train/test split is not random; training data comes from the first 80% of simulated timesteps, and the test data is the last 20%. There is no specific validation dataset, the validation data should instead be randomly selected from the training data. The simulation runs for as many time steps as there are outside temperature values available. The original SMHI data only samples once every hour, which we linearly interpolate to get one temperature sample every ten seconds. The data saved at each time step consists of 34 JSON messages (four per room and two temperature readings from the outside), 9 temperature values (one per room and outside), 8 setpoint values, and 8 actuator outputs. The data associated with each of those 34 JSON-messages is stored as a single row in the tables. This means that much data is duplicated, a choice made to make it easier to use the data. The simulation data is not meant to be opened and analyzed in spreadsheet software, it is meant for training machine learning models. It is recommended to open the data with the pandas library for Python, available at https://pypi.org/project/pandas/.

  15. H

    Creating Curve Number Grid using PyQGIS through Jupyter Notebook in mygeohub...

    • hydroshare.org
    • beta.hydroshare.org
    • +1more
    zip
    Updated Apr 28, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sayan Dey; Shizhang Wang; Venkatesh Merwade (2020). Creating Curve Number Grid using PyQGIS through Jupyter Notebook in mygeohub [Dataset]. http://doi.org/10.4211/hs.abf67aad0eb64a53bf787d369afdcc84
    Explore at:
    zip(105.5 MB)Available download formats
    Dataset updated
    Apr 28, 2020
    Dataset provided by
    HydroShare
    Authors
    Sayan Dey; Shizhang Wang; Venkatesh Merwade
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Description

    This resource serves as a template for creating a curve number grid raster file which could be used to create corresponding maps or for further utilization, soil data and reclassified land-use raster files are created along the process, user has to provided or connect to a set of shape-files including boundary of watershed, soil data and land-use containing this watershed, land-use reclassification and curve number look up table. Script contained in this resource mainly uses PyQGIS through Jupyter Notebook for majority of the processing with a touch of Pandas for data manipulation. Detailed description of procedure are commented in the script.

  16. t

    FAIR Dataset for Disease Prediction in Healthcare Applications

    • test.researchdata.tuwien.ac.at
    bin, csv, json, png
    Updated Apr 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf (2025). FAIR Dataset for Disease Prediction in Healthcare Applications [Dataset]. http://doi.org/10.70124/5n77a-dnf02
    Explore at:
    csv, json, bin, pngAvailable download formats
    Dataset updated
    Apr 14, 2025
    Dataset provided by
    TU Wien
    Authors
    Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Description

    Context and Methodology

    • Research Domain/Project:
      This dataset was created for a machine learning experiment aimed at developing a classification model to predict outcomes based on a set of features. The primary research domain is disease prediction in patients. The dataset was used in the context of training, validating, and testing.

    • Purpose of the Dataset:
      The purpose of this dataset is to provide training, validation, and testing data for the development of machine learning models. It includes labeled examples that help train classifiers to recognize patterns in the data and make predictions.

    • Dataset Creation:
      Data preprocessing steps involved cleaning, normalization, and splitting the data into training, validation, and test sets. The data was carefully curated to ensure its quality and relevance to the problem at hand. For any missing values or outliers, appropriate handling techniques were applied (e.g., imputation, removal, etc.).

    Technical Details

    • Structure of the Dataset:
      The dataset consists of several files organized into folders by data type:

      • Training Data: Contains the training dataset used to train the machine learning model.

      • Validation Data: Used for hyperparameter tuning and model selection.

      • Test Data: Reserved for final model evaluation.

      Each folder contains files with consistent naming conventions for easy navigation, such as train_data.csv, validation_data.csv, and test_data.csv. Each file follows a tabular format with columns representing features and rows representing individual data points.

    • Software Requirements:
      To open and work with this dataset, you need VS Code or Jupyter, which could include tools like:

      • Python (with libraries such as pandas, numpy, scikit-learn, matplotlib, etc.)

    Further Details

    • Reusability:
      Users of this dataset should be aware that it is designed for machine learning experiments involving classification tasks. The dataset is already split into training, validation, and test subsets. Any model trained with this dataset should be evaluated using the test set to ensure proper validation.

    • Limitations:
      The dataset may not cover all edge cases, and it might have biases depending on the selection of data sources. It's important to consider these limitations when generalizing model results to real-world applications.

  17. Video-Trajectory Robot Dataset

    • data.europa.eu
    unknown
    Updated Mar 7, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2022). Video-Trajectory Robot Dataset [Dataset]. https://data.europa.eu/88u/dataset/oai-zenodo-org-6337847
    Explore at:
    unknown(296205835)Available download formats
    Dataset updated
    Mar 7, 2022
    Dataset authored and provided by
    Zenodohttp://zenodo.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset consists of color and depth videos of Panda robot motions and their corresponding joint and Cartesian trajectories. The dataset also includes the trajectories of a receiver robot for the purpose of an object handover. Each motion sample comprises 6 files (RGB video, depth video and 4 giver/receiver trajectories in time series form). Total number of motion samples: 38393. Structure: MPEG-4 videos of robot motion and corresponding Python serialized (or “pickled”) files, containing joint and Cartesian trajectories. Dataset is divided into four parts: simulation dataset (PandaHandover_Sim.zip), real train dataset (PandaHandover_Real_Train.zip), real validation dataset (PandaHandover_Real_Val.zip), real test dataset (PandaHandover_Real_Test.zip). Extract using 7-Zip or similar software. Video files (.avi) can be opened using VLC media player or any other video player that supports MPEG-4 codec. The .pkl files can be loaded using Python (>=3.7) and the Python library Pandas (>=1.1.3).

  18. "module-utilities": A Python package for simplify creating python modules.

    • catalog.data.gov
    • s.cnmilf.com
    Updated Apr 11, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institute of Standards and Technology (2024). "module-utilities": A Python package for simplify creating python modules. [Dataset]. https://catalog.data.gov/dataset/module-utilities-a-python-package-for-simplify-creating-python-modules
    Explore at:
    Dataset updated
    Apr 11, 2024
    Dataset provided by
    National Institute of Standards and Technologyhttp://www.nist.gov/
    Description

    "module-utilities" is a python package of utilities to simplify working with python packages.The main features of module-utilities are as follows: "cached" module: A module to cache class attributes and methods. Right now, this uses a standard python dictionary for storage. Future versions will hopefully be more robust to threading and shared cache."docfiller" module: A module to share documentation. This is adapted from the pandas doc decorator. There are a host of utilities build around this."docinhert": An interface to "docstring-inheritance" module. This can be combined with "docfiller" to make creating related function/class documentation easy.

  19. u

    Project HF‑EOLUS. Task 1. Sentinel‑1 SAR Derived Data Bundle (GeoParquet +...

    • portalcientifico.uvigo.gal
    Updated 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Herrera Cortijo, Juan Luis; Fernández-Baladrón, Adrián; Rosón, Gabriel; Gil Coto, Miguel; Dubert, Jesús; Varela Benvenuto, Ramiro; Herrera Cortijo, Juan Luis; Fernández-Baladrón, Adrián; Rosón, Gabriel; Gil Coto, Miguel; Dubert, Jesús; Varela Benvenuto, Ramiro (2025). Project HF‑EOLUS. Task 1. Sentinel‑1 SAR Derived Data Bundle (GeoParquet + STAC) [Dataset]. https://portalcientifico.uvigo.gal/documentos/68c41a5a08c3ca034ca75767
    Explore at:
    Dataset updated
    2025
    Authors
    Herrera Cortijo, Juan Luis; Fernández-Baladrón, Adrián; Rosón, Gabriel; Gil Coto, Miguel; Dubert, Jesús; Varela Benvenuto, Ramiro; Herrera Cortijo, Juan Luis; Fernández-Baladrón, Adrián; Rosón, Gabriel; Gil Coto, Miguel; Dubert, Jesús; Varela Benvenuto, Ramiro
    Description

    Overview

    This Zenodo record packages a reproducible bundle of derived ocean wind data and auxiliary materials produced from Copernicus Sentinel‑1 SAR Level‑2 OCN (OWI) products. The bundle contains the processed dataset as a GeoParquet data lake with a static STAC catalog, together with the exact scripts, configuration, SQL statements, and inputs used to generate it. It is intended to enable full transparency, re‑use, and re‑execution of the data generation workflow.

    The dataset was generated using the HF‑EOLUS Sentinel‑1 SAR Ingestion Pipeline (software record DOI: https://doi.org/10.5281/zenodo.17011823), which automates download and ingestion into partitioned GeoParquet, and builds a STAC Collection and Items describing the outputs. This record includes all relevant artifacts so that other researchers can verify the steps, inspect lineage, and rerun the pipeline if needed.

    Dataset summary

    • Final table rows: 11,384,420

    • Rows with valid wind data: 7,992,086

    Contents of this record

    • hf_eolus_sar.tar.gz: Tarball containing the processed data outputs. It includes the partitioned GeoParquet dataset (OGC GeoParquet v1.1 metadata) and a static STAC catalog (Collection and Items with the STAC Table Extension and Processing metadata) describing the Parquet assets.

    • pipeline.sh: The exact sequence of commands used to generate this dataset (download and ingestion), serving as an executable provenance log for reproducibility.

    • stac_properties_collection.json: STAC property definitions/templates applied at Collection level during catalog generation.

    • stac_properties_item.json: STAC property definitions/templates applied at Item level during catalog generation.

    • scripts/*.sql (e.g., scripts/athena_create_table.sql): SQL statements used for registering the resulting Parquet dataset as an external table (e.g., in AWS Athena) and for validating schema/partitions.

    • scripts/downloaded_files.txt: Manifest listing the Sentinel‑1 OCN product identifiers that were downloaded and used as inputs.

    • scripts/VILA_PRIO_hull.json: Area of Interest (AOI) polygon used to constrain the search and download of Sentinel‑1 scenes spatially. This file defines a convex hull bounding the intersection between the areas covering the echos of VILA and PRIO stations.

    • scripts/files_to_download.csv: Input list and/or search results for Sentinel‑1 OCN products targeted by the download stage (includes product IDs and acquisition metadata).

    Reproducibility and re‑execution

    • Software pipeline: HF‑EOLUS Sentinel‑1 SAR Ingestion Pipeline (https://doi.org/10.5281/zenodo.17011823). The pipeline uses Dockerized R and Python environments for deterministic runs (no manual dependency setup required).

    • How to re‑run: Review pipeline.sh to see the exact commands, arguments, and environment variables used. If desired, clone the referenced pipeline repository, ensure Docker is available, and re‑execute the same steps. The AOI and time range used are captured in scripts/area_boundary.geojson and scripts/files_to_download.csv; the precise upstream inputs are listed in scripts/downloaded_files.txt.

    • Optional cloud registration: The SQL files in scripts/ can be used to register the resulting Parquet dataset in AWS Athena (or adapted for other engines like Trino/Spark). This step is optional and not required to read the Parquet files directly with tools such as Python (pyarrow/GeoPandas), R (arrow), or DuckDB.

    • Copernicus credentials: To re‑download Sentinel‑1 OCN data, provide your own Copernicus Data Space Ecosystem account credentials. Create a plain‑text file (e.g., credentials) with one line containing your key and pass its path via --credentials-file to scripts/download_sar.sh. For account and API access information, see https://dataspace.copernicus.eu and the documentation at https://documentation.dataspace.copernicus.eu/

    Data format and standards

    • GeoParquet: Columnar, compressed Parquet files with OGC GeoParquet v1.1 geospatial metadata (geometry column, CRS, encoding). Files are partitioned to support efficient filtering and scalable analytics.

    • STAC: A static STAC Collection with Items describes each Parquet asset, including schema via the STAC Table Extension and lineage via Processing metadata. The catalog is suitable for static hosting and is interoperable with common STAC tooling.

    Provenance and upstream data

    • Upstream source: Copernicus Sentinel‑1 SAR Level‑2 OCN (OWI) products provided by the European Space Agency (ESA) under the Copernicus Programme. The list of specific input products for this dataset is included in scripts/downloaded_files.txt.

    • Processing: All derivations (extraction of ocean wind variables, conversion to GeoParquet, and STAC catalog creation) were performed by the HF‑EOLUS pipeline referenced above. The sequence is captured verbatim in pipeline.sh.

    • Credit line: Contains modified Copernicus Sentinel‑1 data; we gratefully acknowledge the Copernicus Programme and the European Space Agency (ESA) for providing free and open Sentinel‑1 data.

    How to cite

    • This data bundle:

    Herrera Cortijo, J. L., Fernández‑Baladrón, A., Rosón, G., Gil Coto, M., Dubert, J., & Varela Benvenuto, R. (2025). Project HF‑EOLUS. Task 2. Sentinel‑1 SAR Derived Data Bundle (GeoParquet + STAC) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.17007304

    • Software/pipeline used

    Herrera Cortijo, J. L., Fernández‑Baladrón, A., Rosón, G., Gil Coto, M., Dubert, J., & Varela Benvenuto, R. (2025). HF‑EOLUS Sentinel‑1 SAR Ingestion Pipeline (GeoParquet + STAC) (v0.1.2). Zenodo. https://doi.org/10.5281/zenodo.17011788

    Usage notes

    • Local analysis: The GeoParquet files can be opened directly with Python (pyarrow, pandas/GeoPandas), R (arrow), or SQL engines (DuckDB/Trino) without additional ingestion steps.

    • Catalog discovery: The STAC catalog in the tarball is static and can be browsed with STAC tools or published on object storage or a simple web server.

    • AWS/Athena setup (optional): To use the GeoParquet in AWS, upload the dataset to Amazon S3 and adjust the SQL to your paths and names, then execute in Athena:

    1) Upload the GeoParquet (and optionally the STAC catalog) to s3://<your-bucket>/<your-prefix>/ preserving the folder structure.

    2) Edit scripts/athena_create_table.sql to set the LOCATION to your S3 path and customize the database and table (e.g., change SAR_INGEST.SAR to MY_DB.MY_TABLE).

    3) In Athena, run the SQL to create the database (if needed) and the external table.

    4) Load partitions with MSCK REPAIR TABLE MY_DB.MY_TABLE; (or add partitions explicitly) and validate with a quick query such as SELECT COUNT(*) FROM MY_DB.MY_TABLE;.

    Acknowledgements

    This work has been funded by the HF‑EOLUS project (TED2021‑129551B‑I00), financed by MICIU/AEI /10.13039/501100011033 and by the European Union NextGenerationEU/PRTR - BDNS 598843 - Component 17 - Investment I3. Members of the Marine Research Centre (CIM) of the University of Vigo have participated in the development. We gratefully acknowledge the Copernicus Programme and the European Space Agency (ESA) for providing free and open Sentinel‑1 data used in this work (contains modified Copernicus Sentinel data).

    Disclaimer

    This content is provided "as is", without warranties of any kind. Users are responsible for verifying fitness for purpose and for complying with the licensing and attribution requirements of upstream Copernicus Sentinel data.

  20. d

    Data for: Saturation of ocean surface wave slopes observed during hurricanes...

    • search.dataone.org
    • zenodo.org
    • +1more
    Updated Aug 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jacob Davis; Jim Thomson; Isabel Houghton; James Doyle; Will Komaromi; Jon Moskaitis; Chris Fairall; Elizabeth Thompson (2025). Data for: Saturation of ocean surface wave slopes observed during hurricanes [Dataset]. http://doi.org/10.5061/dryad.g4f4qrfvb
    Explore at:
    Dataset updated
    Aug 5, 2025
    Dataset provided by
    Dryad Digital Repository
    Authors
    Jacob Davis; Jim Thomson; Isabel Houghton; James Doyle; Will Komaromi; Jon Moskaitis; Chris Fairall; Elizabeth Thompson
    Time period covered
    Jun 26, 2023
    Description

    Observational wave data and modeled wind data to accompany the article J. Davis et al. (2023) "Saturation of ocean surface wave slopes observed during hurricanes" in Geophysical Research Letters. The observations include targeted aerial deployments into Hurricane Ian (2022) and opportunistic measurements from the free-drifting Sofar Ocean Spotter global network in Hurricane Fiona (2022). Surface wind speeds are modeled using the U.S. Naval Research Laboratory’s Coupled Ocean-Atmosphere Mesoscale Prediction System for Tropical Cyclones (COAMPS-TC). The datasets are table-like and include the following: 1) hourly records of surface wave statistics in the form of scalar energy spectra, directional moments, derived products (including mean square slope), and modeled wind speeds; 2) data to reproduce the binned mean square slopes and model wind speeds presented in the article; 3) data to reproduce the mean energy density versus wind speed plot; and 4) data to reproduce the mean energy densit..., This dataset contains wave measurements collected by free-drifting Spotter buoys (Sofar Ocean) which use GPS-derived motions to report hourly records of surface wave statistics in the form of scalar energy spectra and directional moments. The observational data are combined with modeled surface wind speeds from the U.S. Naval Research Laboratory’s Coupled Ocean-Atmosphere Mesoscale Prediction System for Tropical Cyclones (COAMPS-TC) which are interpolated onto the Spotter wave observations in time and space., The datasets are stored as text-based JSON and CSV files which can be read without the use of special software. The JSON files are created using the Python Pandas package DataFrame.to_json method with orient='records'. Example code is provided in MATLAB and Python., # Data to accompany the article "Saturation of ocean surface wave slopes observed during hurricanes"

    Contains wave observations by free-drifting Spotter buoys from targeted aerial deployments into Hurricane Ian (2022) and opportunistic measurements from the free-drifting Sofar Ocean Spotter global network in Hurricane Fiona (2022). The observations are co-located with modeled surface wind speeds from the U.S. Naval Research Laboratorys Coupled Ocean-Atmosphere Mesoscale Prediction System for Tropical Cyclones (COAMPS-TC). The data are used in the article to show the saturation of mean square slope at extreme wind speeds and the coincident transition from an equilibrium-dominated spectral tail to a saturation-dominated tail. This archive contains two main, table-like datasets of wind-wave data (from Ian and Fiona) and three derived datasets necessary to reproduce the results and figures in the article.

    Description of the data and file structure

    The dataset is organized into fi...

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Mrityunjay Pathak (2023). Pandas Practice Dataset [Dataset]. https://www.kaggle.com/datasets/themrityunjaypathak/pandas-practice-dataset/discussion
Organization logo

Pandas Practice Dataset

Dataset to Practice Your Pandas Skill's

Explore at:
4 scholarly articles cite this dataset (View in Google Scholar)
zip(493 bytes)Available download formats
Dataset updated
Jan 27, 2023
Authors
Mrityunjay Pathak
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

What is Pandas?

Pandas is a Python library used for working with data sets.

It has functions for analyzing, cleaning, exploring, and manipulating data.

The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was created by Wes McKinney in 2008.

Why Use Pandas?

Pandas allows us to analyze big data and make conclusions based on statistical theories.

Pandas can clean messy data sets, and make them readable and relevant.

Relevant data is very important in data science.

What Can Pandas Do?

Pandas gives you answers about the data. Like:

Is there a correlation between two or more columns?

What is average value?

Max value?

Min value?

Search
Clear search
Close search
Google apps
Main menu