17 datasets found

polyOne Data Set - 100 million hypothetical polymers including 29 properties...
zenodo.org
data.niaid.nih.gov
bin, txt
Updated Mar 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christopher Kuenneth; Christopher Kuenneth; Rampi Ramprasad; Rampi Ramprasad (2023). polyOne Data Set - 100 million hypothetical polymers including 29 properties [Dataset]. http://doi.org/10.5281/zenodo.7124188
Explore at:
bin, txtAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7124188
Dataset updated
Mar 24, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Christopher Kuenneth; Christopher Kuenneth; Rampi Ramprasad; Rampi Ramprasad
Description
polyOne Data Set

The data set contains 100 million hypothetical polymers each with 29 predicted properties using machine learning models. We use PSMILES strings to represent polymer structures, see here and here. The polymers are generated by decomposing previously synthesized polymers into unique chemical fragments. Random and enumerative compositions of these fragments yield 100 million hypothetical PSMILES strings. All PSMILES strings are chemically valid polymers but, mostly, have never been synthesized before. More information can be found in the paper. Please note the license agreement in the LICENSE file.

Full data set including the properties

The data files are in Apache Parquet format. The files start with `polyOne_*.parquet`.

I recommend using dask (`pip install dask`) to load and process the data set. Pandas also works but is slower.

Load sharded data set with dask
```python
import dask.dataframe as dd
ddf = dd.read_parquet("*.parquet", engine="pyarrow")
```

For example, compute the description of data set
```python
df_describe = ddf.describe().compute()
df_describe

```

PSMILES strings only

generated_polymer_smiles_train.txt - 80 million PSMILES strings for training polyBERT. One string per line.

generated_polymer_smiles_dev.txt - 20 million PSMILES strings for testing polyBERT. One string per line.
h
kaggle-entity-annotated-corpus-ner-dataset
huggingface.co
Updated Jul 10, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rafael Arias Calles (2022). kaggle-entity-annotated-corpus-ner-dataset [Dataset]. https://huggingface.co/datasets/rjac/kaggle-entity-annotated-corpus-ner-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 10, 2022
Authors
Rafael Arias Calles
License
https://choosealicense.com/licenses/odbl/https://choosealicense.com/licenses/odbl/
Description
Date: 2022-07-10 Files: ner_dataset.csv Source: Kaggle entity annotated corpus notes: The dataset only contains the tokens and ner tag labels. Labels are uppercase.

About Dataset

from Kaggle Datasets

Context

Annotated Corpus for Named Entity Recognition using GMB(Groningen Meaning Bank) corpus for entity classification with enhanced and popular features by Natural Language Processing applied to the data set. Tip: Use Pandas Dataframe to load dataset if using Python for… See the full description on the dataset page: https://huggingface.co/datasets/rjac/kaggle-entity-annotated-corpus-ner-dataset.
SELTO Dataset
zenodo.org
data.niaid.nih.gov
application/gzip
Updated May 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sören Dittmer; David Erzmann; Henrik Harms; Rielson Falck; Marco Gosch; Sören Dittmer; David Erzmann; Henrik Harms; Rielson Falck; Marco Gosch (2023). SELTO Dataset [Dataset]. http://doi.org/10.5281/zenodo.7781392
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7781392
Dataset updated
May 23, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Sören Dittmer; David Erzmann; Henrik Harms; Rielson Falck; Marco Gosch; Sören Dittmer; David Erzmann; Henrik Harms; Rielson Falck; Marco Gosch
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A Benchmark Dataset for Deep Learning for 3D Topology Optimization

This dataset represents voxelized 3D topology optimization problems and solutions. The solutions have been generated in cooperation with the Ariane Group and Synera using the Altair OptiStruct implementation of SIMP within the Synera software. The SELTO dataset consists of four different 3D datasets for topology optimization, called disc simple, disc complex, sphere simple and sphere complex. Each of these datasets is further split into a training and a validation subset.

The following paper provides full documentation and examples:

Dittmer, S., Erzmann, D., Harms, H., Maass, P., SELTO: Sample-Efficient Learned Topology Optimization (2022) https://arxiv.org/abs/2209.05098.

The Python library DL4TO (https://github.com/dl4to/dl4to) can be used to download and access all SELTO dataset subsets.
Each TAR.GZ file container consists of multiple enumerated pairs of CSV files. Each pair describes a unique topology optimization problem and contains an associated ground truth solution. Each problem-solution pair consists of two files, where one contains voxel-wise information and the other file contains scalar information. For example, the i-th sample is stored in the files i.csv and i_info.csv, where i.csv contains all voxel-wise information and i_info.csv contains all scalar information. We define all spatially varying quantities at the center of the voxels, rather than on the vertices or surfaces. This allows for a shape-consistent tensor representation.

For the i-th sample, the columns of i_info.csv correspond to the following scalar information:

E - Young's modulus [Pa]

ν - Poisson's ratio [-]

σ_ys - a yield stress [Pa]

h - discretization size of the voxel grid [m]

The columns of i.csv correspond to the following voxel-wise information:

x, y, z - the indices that state the location of the voxel within the voxel mesh

Ω_design - design space information for each voxel. This is a ternary variable that indicates the type of density constraint on the voxel. 0 and 1 indicate that the density is fixed at 0 or 1, respectively. -1 indicates the absence of constraints, i.e., the density in that voxel can be freely optimized

Ω_dirichlet_x, Ω_dirichlet_y, Ω_dirichlet_z - homogeneous Dirichlet boundary conditions for each voxel. These are binary variables that define whether the voxel is subject to homogeneous Dirichlet boundary constraints in the respective dimension

F_x, F_y, F_z - floating point variables that define the three spacial components of external forces applied to each voxel. All forces are body forces given in [N/m^3]

density - defines the binary voxel-wise density of the ground truth solution to the topology optimization problem

How to Import the Dataset

with DL4TO: With the Python library DL4TO (https://github.com/dl4to/dl4to) it is straightforward to download and access the dataset as a customized PyTorch torch.utils.data.Dataset object. As shown in the tutorial this can be done via:

from dl4to.datasets import SELTODataset dataset = SELTODataset(root=root, name=name, train=train)

Here, root is the path where the dataset should be saved. name is the name of the SELTO subset and can be one of "disc_simple", "disc_complex", "sphere_simple" and "sphere_complex". train is a boolean that indicates whether the corresponding training or validation subset should be loaded. See here for further documentation on the SELTODataset class.

without DL4TO: After downloading and unzipping, any of the i.csv files can be manually imported into Python as a Pandas dataframe object:

import pandas as pd root = ... file_path = f'{root}/{i}.csv' columns = ['x', 'y', 'z', 'Ω_design','Ω_dirichlet_x', 'Ω_dirichlet_y', 'Ω_dirichlet_z', 'F_x', 'F_y', 'F_z', 'density'] df = pd.read_csv(file_path, names=columns)

Similarly, we can import a i_info.csv file via:

file_path = f'{root}/{i}_info.csv' info_column_names = ['E', 'ν', 'σ_ys', 'h'] df_info = pd.read_csv(file_path, names=info_columns)

We can extract PyTorch tensors from the Pandas dataframe df using the following function:

import torch def get_torch_tensors_from_dataframe(df, dtype=torch.float32): shape = df[['x', 'y', 'z']].iloc[-1].values.astype(int) + 1 voxels = [df['x'].values, df['y'].values, df['z'].values] Ω_design = torch.zeros(1, *shape, dtype=int) Ω_design[:, voxels[0], voxels[1], voxels[2]] = torch.from_numpy(data['Ω_design'].values.astype(int)) Ω_Dirichlet = torch.zeros(3, *shape, dtype=dtype) Ω_Dirichlet[0, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['Ω_dirichlet_x'].values, dtype=dtype) Ω_Dirichlet[1, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['Ω_dirichlet_y'].values, dtype=dtype) Ω_Dirichlet[2, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['Ω_dirichlet_z'].values, dtype=dtype) F = torch.zeros(3, *shape, dtype=dtype) F[0, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['F_x'].values, dtype=dtype) F[1, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['F_y'].values, dtype=dtype) F[2, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['F_z'].values, dtype=dtype) density = torch.zeros(1, *shape, dtype=dtype) density[:, voxels[0], voxels[1], voxels[2]] = torch.tensor(df['density'].values, dtype=dtype) return Ω_design, Ω_Dirichlet, F, density
IMDb Top 4070: Explore the Cinema Data
kaggle.com
Updated Aug 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
K.T.S. Prabhu (2023). IMDb Top 4070: Explore the Cinema Data [Dataset]. https://www.kaggle.com/datasets/ktsprabhu/imdb-top-4070-explore-the-cinema-data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 15, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
K.T.S. Prabhu
Description
Description: Dive into the world of exceptional cinema with our meticulously curated dataset, "IMDb's Gems Unveiled." This dataset is a result of an extensive data collection effort based on two critical criteria: IMDb ratings exceeding 7 and a substantial number of votes, surpassing 10,000. The outcome? A treasure trove of 4070 movies meticulously selected from IMDb's vast repository.

What sets this dataset apart is its richness and diversity. With more than 20 data points meticulously gathered for each movie, this collection offers a comprehensive insight into each cinematic masterpiece. Our data collection process leveraged the power of Selenium and Pandas modules, ensuring accuracy and reliability.

Cleaning this vast dataset was a meticulous task, combining both Excel and Python for optimum precision. Analysis is powered by Pandas, Matplotlib, and NLTK, enabling to uncover hidden patterns, trends, and themes within the realm of cinema.

Note: The data is collected as of April 2023. Future versions of this analysis include Movie recommendation system Please do connect for any queries, All Love, No Hate.
TinyStories
kaggle.com
opendatalab.com
+1more
Updated Nov 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). TinyStories [Dataset]. https://www.kaggle.com/datasets/thedevastator/tinystories-narrative-classification
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 24, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
TinyStories

A Diverse, Richly Annotated Corpus of Short-Form Stories

By Huggingface Hub [source]

About this dataset

This dataset contains the text of a remarkable collection of short stories known as the TinyStories Corpus. With over 2,000 annotated stories, it is populated with an array of diverse styles and genres from multiple sources. This corpus is enriched by intricate annotations across each narrative content, making it a valuable resource for narrative text classification. The text field in each row includes the entirety of each story that can be used to identify plots, characters and other features associated with story-telling techniques. Through this collection of stories, users will gain an extensive insight into a wide range of narratives which could be used to produce powerful machine learning models for Narrative Text Classification

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

In this dataset, each row contains a short story along with its associated labels for narrative text classification tasks. The data consists of the following columns: - text: The story text itself (string) - validation.csv: Contains a set of short stories for validation (dataframe) - train.csv: Contains the text of short stories used for narrative text classification (dataframe)

The data contained in both files can be used for various types of machine learning tasks related to narrative text classification. These include but are not limited to experiments such as determining story genres, predicting user reactions, sentiment analysis etc.

To get started with using this dataset, begin by downloading both validation and train csv files from Kaggle datasets page and saving them on your computer or local environment. Once downloaded, you may need to preprocess both datasets by cleaning up any unnecessary/wrongly formatted values or duplicate entries if any exists within it before proceeding further on to your research work or machine learning task experimentations as these have great impacts on your research results accuracy rate which you do not want compromised!

Next step is simply loading up these two datasets into Python pandas dataframes so that they can easily be manipulated and analyzed using common tools associated with Natural Language Processing(NLP). This would require you writing few simple lines using pandas API functions like read_csv(), .append(), .concat()etc depending upon what kind of analysis/experiment you intend conducting afterwards utilizing this dataset in Python Jupyter Notebook framework as well as other machine learning frameworks popular among data scientists like scikit-learn if it will be something more complex than simple NLP task operations!

By now if done everything mentioned correctly here then we are ready now to finally get into actually working out our desired applications from exploring potential connections between different narratives or character traits via supervised Machine Learning models such as Naive Bayes Classifier among many others that could ultimately provide us useful insights revealing patterns existing underneath all those texts! With all necessary datas loaded up in supporting python platforms correctly so feel free to make interesting discoveries/predictions from extensive analyses provided by this richly annotated TinyStories Narrative Dataset!

Research Ideas

Creating a text classification algorithm to automatically categorize short stories by genre.

Developing an AI-based summarization tool to quickly summarize the main points in a story.

Developing an AI-based story generator that can generate new stories based on existing ones in the dataset

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: validation.csv | Column name | Description | |:--------------|:--------------------------------| | text | The text of the story. (String) |

File: train.csv | Column name | Description | |:--------------|:----------------------------...
Tokopedia Product Reviews
kaggle.com
zip
Updated Jul 21, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Farhan (2019). Tokopedia Product Reviews [Dataset]. https://www.kaggle.com/datasets/farhan999/tokopedia-product-reviews/versions/1/code
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Jul 21, 2019
Authors
Farhan
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Tokopedia Product Reviews 2019

Dataset Description

This dataset contains 40,607 product reviews from Tokopedia, one of Indonesia's largest e-commerce platforms, scraped in 2019. The dataset provides valuable insights into customer sentiment and shopping behavior in the Indonesian e-commerce market.

Dataset Summary

Language: Indonesian (Bahasa Indonesia)

Task: Sentiment Analysis, Product Review Analysis, E-commerce Research

Size: 40,607 reviews

Categories: 5 product categories

Unique Products: 3,647

Collection Period: 2019

Dataset Structure

Column Description

text (string): The review text written by customers

rating (int): Rating given by the reviewer (typically 1-5 scale)

category (string): Product category, one of:

pertukangan (tools/hardware)

fashion (fashion)

elektronik (electronics)

handphone (mobile phones)

olahraga (sports)

product_name (string): Name of the product

product_id (string): Unique identifier for the product

sold (int): Number of items sold

shop_id (string): Unique identifier for the shop/seller

product_url (string): URL link to the product page

Data Splits

The dataset comes as a single split containing all 40,607 reviews.

Dataset Statistics

Category Count
Total Reviews 40,607
Unique Products 3,647
Product Categories 5
Language Indonesian

Usage

Loading the Dataset

python # ------------------------------------------------------------------ # Minimal example: download the "Tokopedia Product Reviews" dataset # from Kaggle and load it into a pandas DataFrame # ------------------------------------------------------------------ # --- KaggleHub (no manual kaggle.json) ------------------ # Install required packages !pip install -q --upgrade kagglehub pandas import kagglehub import os import zipfile import pandas as pd # Download the dataset (cached after the first run) dataset_path = kagglehub.dataset_download("farhan999/tokopedia-product-reviews") print("Dataset saved at:", dataset_path) # Locate the main CSV file inside the downloaded folder csv_file = None for root, _, files in os.walk(dataset_path): for f in files: if f.lower().endswith('.csv'): csv_file = os.path.join(root, f) break if csv_file: # Load CSV into a DataFrame and display the first few rows df = pd.read_csv(csv_file) display(df.head()) else: print("No CSV file found in the dataset.")

Potential Use Cases

Sentiment Analysis: Classify customer sentiment based on review text and ratings

Product Recommendation: Analyze product preferences across different categories

Market Research: Understand Indonesian e-commerce customer behavior

Natural Language Processing: Train Indonesian language models for e-commerce domain

Category Classification: Predict product categories from review text

Rating Prediction: Predict customer ratings from review text

Data Collection

The data was collected through web scraping of Tokopedia product pages in 2019. The scraping process captured genuine customer reviews across five major product categories, providing a representative sample of customer feedback on the platform.

Ethical Considerations

This dataset contains public reviews that were posted on Tokopedia's platform

Personal information has been anonymized (shop_id and product_id are anonymized identifiers)

The data reflects genuine customer opinions and experiences

Users should be mindful of potential biases in the data (e.g., selection bias, temporal bias from 2019)

Limitations

Temporal Limitation: Data is from 2019 and may not reflect current market trends

Platform Specific: Limited to Tokopedia platform, may not generalize to other Indonesian e-commerce platforms

Category Limitation: Only covers 5 product categories

Language: Primarily in Indonesian, limiting applicability to other languages

Citation

If you use this dataset in your research, please cite:

@misc{tokopedia-product-reviews-2019, title={Tokopedia Product Reviews}, url={https://www.kaggle.com/dsv/562904}, DOI={10.34740/KAGGLE/DSV/562904}, publisher={Kaggle}, author={M. Farhan}, year={2019} }

Contact

For questions or issues regarding this dataset, please open an issue in the dataset repository or contact kontak.farhan@gmail.com.

Acknowledgments

Thanks to Tokopedia for providing a platform that enables customer reviews
What you see is what you get: Delineating the urban jobs-housing spatial...
figshare.com
zip
Updated Feb 12, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yao Yao; Jiaqi Zhang; Chen Qian; Yu Wang; Shuliang Ren; Zehao Yuan; Qingfeng Guan (2021). What you see is what you get: Delineating the urban jobs-housing spatial distribution at a parcel scale by using street view imagery [Dataset]. http://doi.org/10.6084/m9.figshare.12960212.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12960212.v1
Dataset updated
Feb 12, 2021
Dataset provided by
Figsharehttp://figshare.com/
Authors
Yao Yao; Jiaqi Zhang; Chen Qian; Yu Wang; Shuliang Ren; Zehao Yuan; Qingfeng Guan
License
https://www.gnu.org/copyleft/gpl.htmlhttps://www.gnu.org/copyleft/gpl.html
Description
The compressed package (Study_code.zip) contains the code files implemented by an under review paper ("What you see is what you get: Delineating urban jobs-housing spatial distribution at a parcel scale by using street view imagery based on deep learning technique").The compressed package (input_land_parcel_with_attributes.zip) is the sampled mixed "jobs-housing" attributes data of the study area with multiple probability attributes (Only working, Only living, working and living) at the land parcel scale.The compressed package (input_street_view_images.zip) is the surrounding street view data near sampled land parcels (input_land_parcel_with_attributes.zip) with the pixel size of 240*160 obtained from Tencent map (https://map.qq.com/).The compressed package (output_results.zip) contains the result vector files (Jobs-housing pattern distribution and error distribution) and file description (Readme.txt).This project uses some Python open source libraries (Numpy, Pandas, Selenium, Gdal, Pytorch and sklearn). This project complies with the GPL license.Numpy (https://numpy.org/) is an open source numerical calculation tool developed by Travis Oliphant. Used in this project for matrix operation. This library complies with the BSD license.Pandas (https://pandas.pydata.org/) is an open source library, providing high-performance, easy-to-use data structures and data analysis tools. This library complies with the BSD license.Selenium(https://www.selenium.dev/) is a suite of tools for automating web browsers.Used in this project for getting street view images.This library complies with the BSD license.Gdal(https://gdal.org/) is a translator library for raster and vector geospatial data formats.Used in this project for processing geospatial data.This library complies with the BSD license.Pytorch(https://pytorch.org/) is an open source machine learning framework that accelerates the path from research prototyping to production deployment.Used in this project for deep learning.This library complies with the BSD license.sklearn(https://scikit-learn.org/) is an open source machine learning tool for python.Used in this project for comparing precision metrics.This library complies with the BSD license.
d
Twitch.tv Chat Log Data
search.dataone.org
Updated Nov 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kim, Jeongmin (2023). Twitch.tv Chat Log Data [Dataset]. http://doi.org/10.7910/DVN/VE0IVQ
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/VE0IVQ
Dataset updated
Nov 22, 2023
Dataset provided by
Harvard Dataverse
Authors
Kim, Jeongmin
Description
Collection of chat log of 2,162 Twitch streaming videos by 52 streamers. Time period of target streaming video is from 2018-04-24 to 2018-06-24. Description of columns follows below: body: Actual text for user chat channel_id: Channel identifier (integer) commenter_id: User identifier (integer) commenter_type: User type (character) created_at: Time of when chat was entered (ISO 8601 date and time) fragments: Chat text including parsing information of Twitch emote (JSON list) offset: Time offset between start time of video stream and the time of when chat was entered (float) updated_at: Time of when chat was edited (ISO 8601 date and time) video_id: Video identifier (integer) File name indicates name of Twitch stream channel. This dataset is saved as python3 pandas.DataFrame with python pickle format. import pandas as pd pd.read_pickle('ninja.pkl')
Z
Wrist-mounted IMU data towards the investigation of free-living smoking...
data.niaid.nih.gov
data.europa.eu
Updated May 3, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Delopoulos, Anastasios (2021). Wrist-mounted IMU data towards the investigation of free-living smoking behavior - the Smoking Event Detection (SED) and Free-living Smoking Event Detection (SED-FL) datasets [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4507450
Explore at:
Dataset updated
May 3, 2021
Dataset provided by
Kyritsis, Konstantinos
Delopoulos, Anastasios
Kirmizis, Athanasios
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Introduction

The Smoking Event Detection (SED) and the Free-living Smoking Event Detection (SED-FL) datasets were created by the Multimedia Understanding Group towards the investigation of smoking behavior, both while smoking and in-the-wild. Both datasets contain the triaxial acceleration and orientation velocity signals ( DoF) that originate from a commercial smartwatch (Mobvoi TicWatch E™). The SED dataset consists of (20) smoking sessions provided by (11) unique subjects, while the SED-FL dataset contains (10) all-day recordings provided by (7) unique subjects.

In addition, the start and end moments of each puff cycle are annotated throughout the SED dataset.

Description

SED

A total of (11) subjects were recorded while smoking a cigarette at interior or exterior areas. The total duration of the (20) sessions sums up to (161) minutes, with a mean duration of (8.08) minutes. Each participant was free to smoke naturally, with the only limitation being to not swap the cigarette between hands during the smoking session. Prior to the recording, the participant was asked to wear the smartwatch to the hand that he typically uses in his everyday life to smoke. A camera was already set facing the participant, including at least the whole length of the arms in its field of view. The purpose of video recording was to obtain ground truth information for each of the puff cycles that occur during the smoking session. Participants were also asked to perform a clapping hand movement both at the start and end of the meal, for synchronization purposes (as this movement is distinctive in the accelerometer signal). No other instructions were given to the participants. It should be noted that the SED dataset does not contain instances of electronic cigarettes (also known as vaping devices), or heated tobacco products.

SED-FL

SED-FL includes (10) in-the-wild sessions that belong to (7) unique subjects. This is achieved by recording the subjects’ meals as a small part part of their everyday life, unscripted, activities. Participants were instructed to wear the smartwatch to the hand of their preference well ahead before any smoking session and continue to wear it throughout the day until the battery is depleted. In addition, we followed a self-report labeling model, meaning that the ground truth is provided from the participant by documenting the start and end moments of their smoking sessions to the best of their abilities as well as the hand they wear the smartwatch on. The total duration of the recordings sums up to (78.3) hours, with a mean duration of (7.83) hours.

For both datasets, the accompanying Python script read_dataset.py will visualize the IMU signals and ground truth for each of the recordings. Information on how to execute the Python scripts can be found below.

The script and the daataset's pickle file must be located in the same directory.

Tested with Python 3.6.4

Requirements: Pandas, Pickle and Matplotlib

Visualize signals and ground truth

python read_datasets.py

Annotation

For all recordings, we annotated the start and end points for each puff cycle (i.e., smoking gesture). The annotation process was performed in such a way that the start and end times of each smoking gesture do not overlap each other.

Technical details

SED

We provide the SED dataset as a pickle. The file can be loaded using Python in the following way:

import pickle as pkl import pandas as pd

with open('./SED.pkl','rb') as fh: dataset = pkl.load(fh)

The dataset variable in the snippet above is a dictionary with keys, each corresponding to a unique subject (numbered from to ). It should be mentioned that the subject identifier in SED is in-line with the subject identifier in the SED-FL dataset; i.e., SED’s subject with id equal to is the same person as SED-FL’s subject with id equal to .

The content of a dataset ‘s subject is a list with length equal to corresponding subject’s number of recorded smoking sessions. For example, assuming that subject has recorded smoking sessions, the command:

sessions = dataset['8']

would yield a list of length equal to . Each member of the list is a Pandas DataFrame with dimensions , where is the length of the recording.

The columns of a session’s DataFrame are:

'T': The timestamps in seconds

'AccX': The accelerometer measurements for the axis in (m/s^2)

'AccY': The accelerometer measurements for the axis in (m/s^2)

'AccZ': The accelerometer measurements for the axis in (m/s^2)

'GyrX': The gyroscope measurements for the axis in (rad/s)

'GyrY': The gyroscope measurements for the axis in (rad/s)

'GyrZ': The gyroscope measurements for the axis in (rad/s)

'GT': The manually annotated ground truth for puff cycles

The contents of this DataFrame are essentially the accelerometer and gyroscope sensor streams, resampled at a constant sampling rate of Hz and aligned with each other and with their puff cycle ground truth. All sensor streams are transformed in such a way that reflects all participants wearing the smartwatch at the same hand with the same orientation, thusly achieving data uniformity. This transformation is in par with the signals in the SED-FL dataset. The ground truth is a signal with value during puff cycles, and elsewhere.

No other preprocessing is performed on the data; e.g., the acceleration component due to the Earth's gravitational field is present at the processed acceleration measurements. The potential researcher can consult the article "Modeling Wrist Micromovements to Measure In-Meal Eating Behavior from Inertial Sensor Data" by Kyritsis et al. on how to further preprocess the IMU signals (i.e., smooth and remove the gravitational component).

SED-FL

Similar to SED, we provide the SED-FL dataset as a pickle. The file can be loaded using Python in the following way:

import pickle as pkl import pandas as pd

with open('./SED-FL.pkl','rb') as fh: dataset = pkl.load(fh)

The dataset variable in the snippet above is a dictionary with keys, each corresponding to a unique subject. It should be mentioned that the subject identifier in SED-FL is in-line with the subject identifier in the SED dataset; i.e., SED-FL’s subject with id equal to is the same person as SED’s subject with id equal to .

The content of a dataset ‘s subject is a list with length equal to corresponding subject’s number of recorded daily sessions. For example, assuming that subject has recorded 2 daily sessions, the command:

sessions = dataset['8']

would yield a list of length equal to (2). Each member of the list is a Pandas DataFrame with dimensions (M \times 8), where (M) is the length of the recording.

The columns of a session’s DataFrame are exactly the same with the ones in the SED dataset. However, the 'GT' column contains ground truth that relates with the smoking sessions during the day (instead of puff cycles in SED).

The contents of this DataFrame are essentially the accelerometer and gyroscope sensor streams, resampled at a constant sampling rate of (50) Hz and aligned with each other and with their smoking session ground truth. All sensor streams are transformed in such a way that reflects all participants wearing the smartwatch at the same hand with the same orientation, thusly achieving data uniformity. This transformation is in par with the signals in the SED dataset. The ground truth is a signal with value (+1) during smoking sessions, and (-1) elsewhere.

No other preprocessing is performed on the data; e.g., the acceleration component due to the Earth's gravitational field is present at the processed acceleration measurements. The potential researcher can consult the article "Modeling Wrist Micromovements to Measure In-Meal Eating Behavior from Inertial Sensor Data" by Kyritsis et al. on how to further preprocess the IMU signals (i.e., smooth and remove the gravitational component).

Ethics and funding

Informed consent, including permission for third-party access to anonymized data, was obtained from all subjects prior to their engagement in the study. The work leading to these results has received funding from the EU Commission under Grant Agreement No. 965231, the REBECCA project (H2020).

Contact

Any inquiries regarding the SED and SED-FL datasets should be addressed to:

Mr. Konstantinos KYRITSIS (Electrical & Computer Engineer, PhD candidate)

Multimedia Understanding Group (MUG) Department of Electrical & Computer Engineering Aristotle University of Thessaloniki University Campus, Building C, 3rd floor Thessaloniki, Greece, GR54124

Tel: +30 2310 996359, 996365 Fax: +30 2310 996398 E-mail: kokirits [at] mug [dot] ee [dot] auth [dot] gr

Category	Count
Total Reviews	40,607
Unique Products	3,647
Product Categories	5
Language	Indonesian

Speed profiles of freeways in California (I5-S and I210-E)

zenodo.org

csv

Updated Jan 24, 2020

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Semin Kwak; Semin Kwak (2020). Speed profiles of freeways in California (I5-S and I210-E) [Dataset]. http://doi.org/10.5281/zenodo.3478594

Explore at:

csvAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.3478594

Dataset updated

Jan 24, 2020

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Semin Kwak; Semin Kwak

License

Attribution 1.0 (CC BY 1.0)https://creativecommons.org/licenses/by/1.0/
License information was derived automatically

Area covered

California

Description

Speed profiles of freeways in California (I5-S and I210-E). Original data is retrieved from PeMS.

Each YEAR_FREEWAY.csv file contains Timestamp and Speed data.

freeway_meta.csv file contains meta information for each detector: freeway number, direction, detector ID, absolute milepost, and x y coordinates.

# Freeway speed data description

### Data loading example (single freeway: I5-S 2012)


```python
%%time
import pandas as pd

# Date time parser
mydateparser = lambda x: pd.datetime.strptime(x, "%m/%d/%Y %H:%M:%S")

# Freeway data loading (This part should be changed to a proper URL in zenodo.org)
data = pd.read_csv("dataset/2012_I5S.csv", 
          parse_dates=["Timestamp"],
          date_parser=mydateparser).pivot(index="Timestamp",columns='Station_ID', values='Speed')


# Meta data loading
meta = pd.read_csv("dataset/freeway_meta.csv").set_index(['Fwy','Dir'])
```

  CPU times: user 50.5 s, sys: 911 ms, total: 51.4 s
  Wall time: 50.9 s


### Speed data and meta data


```python
data.head()
```







 
  
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
  
  
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
  
 
 
  
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
  
  
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
  
  
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
  
  
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
  
  
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   
  
 
Station_ID 1 2 3 4 5 6 7 8 9 10 ... 80 81 82 83 84 85 86 87 88 89
Timestamp 
2012-01-01 06:00:00 70.0 69.8 70.1 69.6 69.9 70.8 70.1 69.3 69.2 68.2 ... 72.1 67.6 71.0 66.8 65.9 58.2 67.1 63.8 67.1 71.6
2012-01-01 06:05:00 69.2 69.8 69.8 69.4 69.5 69.5 68.3 67.5 67.4 67.2 ... 71.5 66.1 69.5 67.4 68.3 59.0 66.9 60.8 66.6 65.7
2012-01-01 06:10:00 69.2 69.0 68.6 68.7 68.6 68.9 61.7 68.3 67.4 67.7 ... 71.1 65.2 71.2 66.5 65.4 59.6 66.3 58.4 68.2 65.6
2012-01-01 06:15:00 69.9 69.6 69.7 69.2 69.0 69.1 65.3 67.6 67.1 66.8 ... 69.9 67.1 69.3 66.9 68.2 60.6 66.0 55.5 67.1 69.7
2012-01-01 06:20:00 68.7 68.4 68.2 67.9 68.3 69.3 67.0 68.4 68.2 68.2 ... 70.9 67.2 69.9 65.6 66.7 62.8 66.2 62.6 67.2 67.5
5 rows × 89 columns





```python
meta.head()
```







 
  
   
   
   
   
   
   
  
  
   
   
   
   
   
   
  
 
 
  
   
   
   
   
   
   
  
  
   
   
   
   
   
  
  
   
   
   
   
   
  
  
   
   
   
   
   
  
  
   
   
   
   
   
  
 
ID Abs_mp Latitude Longitude
Fwy Dir 
5 S 1 0.058 32.542731 -117.030501
S 2 0.146 32.543587 -117.031769
S 3 1.291 32.552409 -117.048120
S 4 2.222 32.558422 -117.062360
S 5 2.559 32.561106 -117.067228




### Choose a day


```python
# Sampling (2012-01-13)
myday = "2012-01-13"

# Filter the data by the day
myday_speed_data = data.loc[myday]
```

### A speed profile


```python
from matplotlib import pyplot as plt
import matplotlib.dates as mdates

# Axis value setting
mp = meta[meta.ID.isin(data.columns)].Abs_mp
hour = myday_speed_data.index

# Draw the day
fig, ax = plt.subplots()
heatmap = ax.pcolormesh(hour,mp,myday_speed_data.T, cmap=plt.cm.RdYlGn, vmin=0, vmax=80, alpha=1)
plt.colorbar(heatmap, ax=ax)

# Appearance setting
ax.xaxis.set_major_formatter(mdates.DateFormatter("%H"))
plt.title(pd.Timestamp(myday).strftime("%Y-%m-%d [%a]"))
plt.xlabel("hour")
plt.ylabel("milepost")
plt.show()
```


![png](output_9_0.png)

Zippi_Shvartsman_et_al_2023_bmi_manual_files
figshare.com
bin
Updated Aug 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gabrielle Shvartsman; Ellen Zippi; Nuria Vendrell-Llopis; Joni D. Wallis; Jose M. Carmena (2023). Zippi_Shvartsman_et_al_2023_bmi_manual_files [Dataset]. http://doi.org/10.6084/m9.figshare.23674200.v1
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.23674200.v1
Dataset updated
Aug 29, 2023
Dataset provided by
figshare
Authors
Gabrielle Shvartsman; Ellen Zippi; Nuria Vendrell-Llopis; Joni D. Wallis; Jose M. Carmena
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Included files:Each file includes LFP (local field potential) data for both animals (‘h’, ‘y’) during a particular type of task control (‘bmi’ or ‘manual’) and time-locked to 500ms before or after a particular event in the task (‘go_cue’ or ‘target’) for each rewarded trial in each day of the task (‘h’: [1-13], ‘y’: [1-22]).File description:Each file includes a Pandas DataFrame, saved as a .feather file. Data can be accessed using Python by calling:import pandas as pdpd.read_feather([file name])Each DataFrame has the following columns:control_type: ‘bmi’, ‘manual’, or ‘baseline’event: go cue (‘go_cue’) or target acquisition (‘target’)subj: which animal, ‘h’ or ‘y’day: which day of the session, ‘h’: [1-13], ‘y’: [1-22]roi: region of interest; ‘direct’, ‘dlpfc’, or ‘cd’ where ‘direct’ includes most channels from m1 each day but is specific to channels which had sufficient spiking to be used as input to the BMI decoderch: electrode channel number, only low-noise channels were included (see Methods for details)n_rewarded_trial: which trial number data segment is from, only successfully completed (rewarded) trials are includedtime_from_window_ms: go_cue: 0-500ms from go cue, for target: -500-0ms from target acquisitionlfp: local field potential value (see Methods for details)
Data from: BSRN solar radiation data for the testing, validation and...
zenodo.org
portaldelainvestigacion.uma.es
+2more
bin
Updated Feb 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jose A Ruiz-Arias; Jose A Ruiz-Arias (2024). BSRN solar radiation data for the testing, validation and benchmarking of solar irradiance components separation models [Dataset]. http://doi.org/10.5281/zenodo.10593079
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10593079
Dataset updated
Feb 11, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jose A Ruiz-Arias; Jose A Ruiz-Arias
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset is an excerpt of the validation dataset used in:

Ruiz-Arias JA, Gueymard CA. Review and performance benchmarking of 1-min solar irradiance components separation methods: The critical role of dynamically-constrained sky conditions. Submitted for publication to Renewable and Sustainable Energy Reviews.

and it is ready to use in the Python package splitting_models developed during that research. See the documentation in the Python package for usage details. Below, there is a detailed description of the dataset.

The data is in a single parquet file that contains 1-min time series of solar geometry, clear-sky solar irradiance simulations, solar irradiance observations and CAELUS sky types for 5 BSRN sites, one per primary Köppen-Geiger climate, namely: Minamitorishima (mnm), JP, for equatorial climate; Alice Springs (asp), AU, for dry climate; Carpentras (car), FR, for temperate climate; Bondville (bon), US, for continental climate; and Sonnblick (son), AT, for cold/polar/snow climate. It includes one calendar year per site. The BSRN data is publicly available. See download instructions in https://bsrn.awi.de/data.

The specific variables included in the dataset are:

climate: primary Köppen-Geiger climate. Values are: A (equatorial), B (dry), C (temperate), D (continental) and E (polar/snow).

longitude: longitude, in degrees east.

latitude: latitude, in degrees north.

sza: solar zenith angle, in degrees.

eth: extraterrestrial solar irradiance (i.e., top of atmosphere solar irradiance), in W/m2.

ghics: clear-sky global solar irradiance, in W/m2. It is evaluated with the SPARTA clear-sky model and MERRA-2 clear-sky atmosphere.

difcs: clear-sky diffuse solar irradiance, in W/m2.It is evaluated with the SPARTA clear-sky model and MERRA-2 clear-sky atmosphere.

ghicda: clean-and-dry clear-sky global solar irradiance, in W/m2. It is evaluated with the SPARTA clear-sky model and MERRA-2 clear-sky atmosphere, prescribing zero aerosols and zero precipitable water.

ghi: observed global horizontal irradiance, in W/m2.

dif: observed diffuse irradiance, in W/m2.

sky_type: CAELUS sky type. Values are: 1 (unknown), 2 (overcast), 3 (thick clouds), 4 (scattered clouds), 5 (thin clouds), 6 (cloudless) and 7 (cloud enhancement).

The dataset can be easily loaded in a Python Pandas DataFrame as follows:

import pandas as pd

data = pd.read_parquet(

The dataframe has a multi-index with two levels: times_utc and site. The former are the UTC timestamps at the center of each 1-min interval. The latter is each site's label.
f
Zippi_Shvartsman_et_al_2023_baseline_files
figshare.com
bin
Updated Aug 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gabrielle Shvartsman; Ellen Zippi; Nuria Vendrell-Llopis; Joni D. Wallis; Jose M. Carmena (2023). Zippi_Shvartsman_et_al_2023_baseline_files [Dataset]. http://doi.org/10.6084/m9.figshare.24052824.v1
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.24052824.v1
Dataset updated
Aug 29, 2023
Dataset provided by
figshare
Authors
Gabrielle Shvartsman; Ellen Zippi; Nuria Vendrell-Llopis; Joni D. Wallis; Jose M. Carmena
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Included files:Each file includes LFP (local field potential) data for both animals (‘h’, ‘y’) during rest periods for each day (‘baseline’) without any time-locking (500ms segments were randomly selected from baseline in our analyses). Separate baseline files are included for each animal.File description:Each file includes a Pandas DataFrame, saved as a .feather file. Data can be accessed using Python by calling:import pandas as pdpd.read_feather([file name])Each DataFrame has the following columns:control_type: ‘bmi’, ‘manual’, or ‘baseline’subj: which animal, ‘h’ or ‘y’day: which day of the session, ‘h’: [1-13], ‘y’: [1-22]roi: region of interest; ‘direct’, ‘dlpfc’, or ‘cd’ where ‘direct’ includes most channels from m1 each day but is specific to channels which had sufficient spiking to be used as input to the BMI decoderch: electrode channel number, only low-noise channels were included (see Methods for details)time_from_window_ms: represents every ms from start to end of the recorded rest periodlfp: local field potential value (see Methods for details)
Z
Microsensor Beverage Tasting (MicroBeTa)
data.niaid.nih.gov
data.europa.eu
Updated May 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Margarit-Taulé, Josep Maria (2023). Microsensor Beverage Tasting (MicroBeTa) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5457500
Explore at:
Dataset updated
May 15, 2023
Dataset provided by
Jiménez-Jorquera, Cecilia
Rovira, Meritxell
LeBow, Nicholas
Margarit-Taulé, Josep Maria
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
MicroBeTa is a dataset for automatic, "electronic tongue" beverage classification. It includes temporal multivariate readings simultaneously acquired from a temperature sensor and solid-state electrochemical microsensors developed and manufactured by the Chemical Transducers Group at the Institute of Microelectronics of Barcelona (IMB-CNM), CSIC:

http://gtq.imb-cnm.csic.es/en

Citing the MicroBeTA dataset

The MicroBeTA is released under a Creative Commons Attribution license, so please cite it if it is used in your work in any form. Published academic papers should use the citation for our Frontiers in Neuroscience paper. Personal works, such as machine learning projects or blog posts, should provide a URL to this Zenodo page, though referencing our research paper would also be appreciated.

Academic paper citation

LeBow N, Rueckauer B, Sun P, Rovira M, Jiménez-Jorquera C, Liu S-C and Margarit-Taulé JM (2021) Real-Time Edge Neuromorphic Tasting From Chemical Microsensor Arrays. Front. Neurosci. 15:771480. http://doi.org/10.3389/fnins.2021.771480

Personal use citation

Include a link to this Zenodo page: http://doi.org/10.5281/zenodo.5457501

Description

The dataset includes seven hours of readings from a sensor array acquired every second during three sessions performed over the course of three days at the IBM-CNM. The array comprises one Pt-100 temperature sensor, one microelectrode each for electrical conductivity and oxidation-reduction potential (ORP), and six ISFET sensors sensitive to specific ions (H+, Na+, K+,Ca2+, Cl-, and NO3-).

The beverage types selected for MicroBeTa are five commercial beverage varieties of white wine, red wine, still water, sparkling water and cava. This beverage selection covers a wide range of characteristics within a limited set of classes, with several semi-overlapping sets of attributes that could be expected to provide insight into how the data from various sensors could be used by the classifier, e.g. still and sparkling water, red wine and cava covering four general cases arising from the presence or absence of carbonation and fermentation byproducts, respectively.

All sensors were read out continuously and concurrently during each session, while the sensor array was moved from one beverage sample to another at fixed intervals of five minutes. The sequence of transitions between beverage samples was chosen to cover all combinations from one beverage to another. During each transfer, the sensor array was washed with deionized water before being placed in the next sample to avoid unnecessary cross-contamination of subsequent beverages in the series.

Data Files

clean_dataset.h5: Contains a Python Pandas dataframe including the reading signals from all sensor channels and the labels ('Time', 'H+', 'K+', 'Na+', 'Cl-', 'NO3-', 'Ca2+', 'Conductivity', 'ORP', 'Temperature', and 'Label', respectively), with the washing and transfer periods as well as transient instabilities of individual sensors discarded.

preprocessed_dataset_9cols.h5: Contains a Python Pandas dataframe (['n_output_classes', 'samples_train', 'labels_train', 'samples_test', 'labels_test'] columns) of sensor samples for training and testing a classifier model. The data samples are fixed-length, overlapped time windows containing the signal values from all nine sensors ('Temperature','H+', 'K+', 'Na+', 'Cl-', 'NO3-', 'Ca2+', 'Conductivity', and 'ORP', respectively) over a contiguous range of 16 timestamps. The samples are preprocessed as follows:

Incomplete measurement cycles in which not all beverages are recorded, or measurements of specific beverage samples much shorter than others, are removed entirely. Any measurements lasting significantly longer than five minutes are truncated to that length.

A high-pass filter with a cut-off frequency of 0.5 mHz is used to attenuate level offsets in the input signals while emphasizing their dynamic components.

Outliers in which at least one sensor channel contains a value further than four standard deviations from the mean are deleted.

Each sensor channel is normalized independently using quantile normalization.

preprocessed_dataset_7cols.h5: Contains the same Pandas dataframe of sensor samples for training and testing a classifier model as preprocessed_dataset_9cols.h5, but in this case excluding the two least informative sensors ('Temperature' and 'NO3-').

Contact

Further details on the creation and validation of MicroBeTa will be disclosed in our Frontiers paper. If you have any questions or comments about the dataset, please feel free to write to:

josepmaria.margarit@imb-cnm.csic.es
Datatset: Machine-Learning Side-Channel Attacks on the GALACTICS...
zenodo.org
bin
Updated Jul 15, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Soundes Marzougui; Soundes Marzougui; Nils Wisiol; Nils Wisiol; Patrick Gersch; Patrick Gersch; Juliane Krämer; Juliane Krämer; Jean-Pierre Seifert; Jean-Pierre Seifert (2021). Datatset: Machine-Learning Side-Channel Attacks on the GALACTICS Constant-Time Implementation of BLISS [Dataset]. http://doi.org/10.5281/zenodo.5101343
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5101343
Dataset updated
Jul 15, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Soundes Marzougui; Soundes Marzougui; Nils Wisiol; Nils Wisiol; Patrick Gersch; Patrick Gersch; Juliane Krämer; Juliane Krämer; Jean-Pierre Seifert; Jean-Pierre Seifert
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset accompanies the paper "Machine-Learning Side-Channel Attacks on the GALACTICS Constant-Time Implementation of BLISS". It was used to experimentally prove the presented attack strategies on real hardware. The corresponding source code for all three attacks is also publicly available.

A detailed description of how the data was obtained can be found in the paper. Section 4 precisely describes the experimental setup.

Prerequisites:

sudo apt-get install p7zip

Extract the data:

7z x galactics_attack_data.7z

Running the attacks:

The source code to run the three presented attacks can be found on Github. The instructions on how to use the python code can be obtained from the corresponding README.

Re-using the dataset:

The dataset consists of .pickle and .bin files. The .pickle files can be read using Pythons Pandas library. Python access functions for the .bin files are also provided.
Z
glassDef dataset: metallic glass deformation
data.niaid.nih.gov
zenodo.org
Updated Dec 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kamran Karimi (2023). glassDef dataset: metallic glass deformation [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7736625
Explore at:
Dataset updated
Dec 24, 2023
Dataset provided by
Stefanos Papanikolaou
Amin Esfandiarpour
Kamran Karimi
Rene Alvarez-Donado
Mikko J. Alava
License
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Description
The glassDef dataset contains a set of text-based LAMMPS dump files corresponding to shear deformation tests on different bulk metallic glasses. This includes FeNi, CoNiFe, CoNiCrFe, CoCrFeMn, CoNiCrFeMn, and Co5Cr2Fe40Mn27Ni26 amorphous alloys with data files that exist in relevant subdirectories. Each dump file corresponds to multiple realizations and includes the dimensions of the simulation box as well as atom coordinates, the atom ID, and associated type of nearly 50,000 atoms.

Load glassDef Dataset in Python

The glassDef dataset may be loaded in Python into Pandas DataFrame. To go into the relevant subdirectory, run cd glass{glass_name}/Run[0-3]/ where “glass_name” denotes the chemical composition. Each subdirectory contains at least three glass realizations within subfolders that are labeled as “Run[0-3]”.

cd glassFeNi/Run0; python

import pandas

df = pandas.read_csv("FeNi_glass.dump",skiprows=9)

One may display an assigned DataFrame in the form of a table:

df.head()

To learn more about further analyses performed on the loaded data, please refer to the paper cited below.

glassDef Dataset Structure

glassDef Data Fields

Dump files: “id”, “type”, “x”, “y”, “z”.

glassDef Dataset Description

Paper: Karimi, Kamran, Amin Esfandiarpour, René Alvarez-Donado, Mikko J. Alava, and Stefanos Papanikolaou. "Shear banding instability in multicomponent metallic glasses: Interplay of composition and short-range order." Physical Review B 105, no. 9 (2022): 094117.

Contact: kamran.karimi@ncbj.gov.pl
m
Dataset of Leak Simulations in Experimental Testbed Water Distribution...
data.mendeley.com
Updated Dec 12, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohsen Aghashahi (2022). Dataset of Leak Simulations in Experimental Testbed Water Distribution System [Dataset]. http://doi.org/10.17632/tbrnp6vrnj.1
Explore at:
Unique identifier
https://doi.org/10.17632/tbrnp6vrnj.1
Dataset updated
Dec 12, 2022
Authors
Mohsen Aghashahi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the first fully labeled open dataset for leak detection and localization in water distribution systems. This dataset includes two hundred and eighty signals acquired from a laboratory-scale water distribution testbed with four types of induced leaks and no-leak. The testbed was 47 m long built from 152.4 mm diameter PVC pipes. Two accelerometers (A1 and A2), two hydrophones (H1 and H2), and two dynamic pressure sensors (P1 and P2) were deployed to measure acceleration, acoustic, and dynamic pressure data. The data were recorded through controlled experiments where the following were changed: network architecture, leak type, background flow condition, background noise condition, and sensor types and locations. Each signal was recorded for 30 seconds. Network architectures were looped (LO) and branched (BR). Leak types were Longitudinal Crack (LC), Circumferential Crack (CC), Gasket Leak (GL), Orifice Leak (OL), and No-leak (NL). Background flow conditions included 0 L/s (ND), 0.18 L/s, 0.47 L/s, and Transient (background flow rate abruptly changed from 0.47 L/s to 0 L/s at the second 20th of 30-second long measurements). Background noise conditions, with noise (N) and without noise (NN), determined whether a background noise was present during acoustic data measurements. Accelerometer and dynamic pressure data are in ‘.csv’ format, and the hydrophone data are in ‘.raw’ format with 8000 Hz frequency. The file “Python code to convert raw acoustic data to pandas DataFrame.py” converts the raw hydrophone data to DataFrame in Python.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

		ID	Abs_mp	Latitude	Longitude
5	S	1	0.058	32.542731	-117.030501
S	2	0.146	32.543587	-117.031769
S	3	1.291	32.552409	-117.048120
S	4	2.222	32.558422	-117.062360
S	5	2.559	32.561106	-117.067228

Facebook

Twitter

Click to copy link

Link copied

Cite

Christopher Kuenneth; Christopher Kuenneth; Rampi Ramprasad; Rampi Ramprasad (2023). polyOne Data Set - 100 million hypothetical polymers including 29 properties [Dataset]. http://doi.org/10.5281/zenodo.7124188

polyOne Data Set - 100 million hypothetical polymers including 29 properties

Explore at:

bin, txtAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.7124188

Dataset updated

Mar 24, 2023

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Christopher Kuenneth; Christopher Kuenneth; Rampi Ramprasad; Rampi Ramprasad

Description

polyOne Data Set

The data set contains 100 million hypothetical polymers each with 29 predicted properties using machine learning models. We use PSMILES strings to represent polymer structures, see here and here. The polymers are generated by decomposing previously synthesized polymers into unique chemical fragments. Random and enumerative compositions of these fragments yield 100 million hypothetical PSMILES strings. All PSMILES strings are chemically valid polymers but, mostly, have never been synthesized before. More information can be found in the paper. Please note the license agreement in the LICENSE file.

Full data set including the properties

The data files are in Apache Parquet format. The files start with `polyOne_*.parquet`.

I recommend using dask (`pip install dask`) to load and process the data set. Pandas also works but is slower.

Load sharded data set with dask
```python
import dask.dataframe as dd
ddf = dd.read_parquet("*.parquet", engine="pyarrow")
```

For example, compute the description of data set
```python
df_describe = ddf.describe().compute()
df_describe

```

PSMILES strings only

generated_polymer_smiles_train.txt - 80 million PSMILES strings for training polyBERT. One string per line.
generated_polymer_smiles_dev.txt - 20 million PSMILES strings for testing polyBERT. One string per line.

Clear search

Close search

Google apps

Main menu

Station_ID	1	2	3	4	5	6	7	8	9	10	...	80	81	82	83	84	85	86	87	88	89
Timestamp
2012-01-01 06:00:00	70.0	69.8	70.1	69.6	69.9	70.8	70.1	69.3	69.2	68.2	...	72.1	67.6	71.0	66.8	65.9	58.2	67.1	63.8	67.1	71.6
2012-01-01 06:05:00	69.2	69.8	69.8	69.4	69.5	69.5	68.3	67.5	67.4	67.2	...	71.5	66.1	69.5	67.4	68.3	59.0	66.9	60.8	66.6	65.7
2012-01-01 06:10:00	69.2	69.0	68.6	68.7	68.6	68.9	61.7	68.3	67.4	67.7	...	71.1	65.2	71.2	66.5	65.4	59.6	66.3	58.4	68.2	65.6
2012-01-01 06:15:00	69.9	69.6	69.7	69.2	69.0	69.1	65.3	67.6	67.1	66.8	...	69.9	67.1	69.3	66.9	68.2	60.6	66.0	55.5	67.1	69.7
2012-01-01 06:20:00	68.7	68.4	68.2	67.9	68.3	69.3	67.0	68.4	68.2	68.2	...	70.9	67.2	69.9	65.6	66.7	62.8	66.2	62.6	67.2	67.5

polyOne Data Set - 100 million hypothetical polymers including 29 properties...

kaggle-entity-annotated-corpus-ner-dataset

SELTO Dataset

IMDb Top 4070: Explore the Cinema Data

TinyStories

TinyStories

A Diverse, Richly Annotated Corpus of Short-Form Stories

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Tokopedia Product Reviews

Tokopedia Product Reviews 2019

Dataset Description

Dataset Summary

Dataset Structure

Column Description

Data Splits

Dataset Statistics

Usage

Loading the Dataset

Potential Use Cases

Data Collection

Ethical Considerations

Limitations

Citation

Contact

Acknowledgments

What you see is what you get: Delineating the urban jobs-housing spatial...

Twitch.tv Chat Log Data

Wrist-mounted IMU data towards the investigation of free-living smoking...

The script and the daataset's pickle file must be located in the same directory.

Tested with Python 3.6.4

Requirements: Pandas, Pickle and Matplotlib

Visualize signals and ground truth

Speed profiles of freeways in California (I5-S and I210-E)

Zippi_Shvartsman_et_al_2023_bmi_manual_files

Data from: BSRN solar radiation data for the testing, validation and...

Zippi_Shvartsman_et_al_2023_baseline_files

Microsensor Beverage Tasting (MicroBeTa)

Datatset: Machine-Learning Side-Channel Attacks on the GALACTICS...

glassDef dataset: metallic glass deformation

Dataset of Leak Simulations in Experimental Testbed Water Distribution...

polyOne Data Set - 100 million hypothetical polymers including 29 properties