95 datasets found

Data Cleaning, Translation & Split of the Dataset for the Automatic...
zenodo.org
bin, csv +1
Updated Apr 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Juliane Köhler; Juliane Köhler (2025). Data Cleaning, Translation & Split of the Dataset for the Automatic Classification of Documents for the Classification System for the Berliner Handreichungen zur Bibliotheks- und Informationswissenschaft [Dataset]. http://doi.org/10.5281/zenodo.6957842
Explore at:
text/x-python, csv, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6957842
Dataset updated
Apr 24, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Juliane Köhler; Juliane Köhler
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Cleaned_Dataset.csv – The combined CSV files of all scraped documents from DABI, e-LiS, o-bib and Springer.

Data_Cleaning.ipynb – The Jupyter Notebook with python code for the analysis and cleaning of the original dataset.

ger_train.csv – The German training set as CSV file.

ger_validation.csv – The German validation set as CSV file.

en_test.csv – The English test set as CSV file.

en_train.csv – The English training set as CSV file.

en_validation.csv – The English validation set as CSV file.

splitting.py – The python code for splitting a dataset into train, test and validation set.

DataSetTrans_de.csv – The final German dataset as a CSV file.

DataSetTrans_en.csv – The final English dataset as a CSV file.

translation.py – The python code for translating the cleaned dataset.
Materials Project Time Split Data
figshare.com
json
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sterling G. Baird; Taylor Sparks (2023). Materials Project Time Split Data [Dataset]. http://doi.org/10.6084/m9.figshare.19991516.v4
Explore at:
jsonAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.19991516.v4
Dataset updated
May 30, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Sterling G. Baird; Taylor Sparks
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Full and dummy snapshots (2022-06-04) of data for mp-time-split encoded via matminer convenience functions grabbed via the new Materials Project API. The dataset is restricted to experimentally verified compounds with no more than 52 sites. No other filtering criteria were applied. The snapshots were developed for sparks-baird/mp-time-split as a benchmark dataset for materials generative modeling. Compressed version of the files (.gz) are also available. dtypes python from pprint import pprint from matminer.utils.io import load_dataframe_from_json filepath = "insert/path/to/file/here.json" expt_df = load_dataframe_from_json(filepath) pprint(expt_df.iloc[0].apply(type).to_dict()) {'discovery': , 'energy_above_hull': , 'formation_energy_per_atom': , 'material_id': , 'references': , 'structure': , 'theoretical': , 'year': } index/mpids (just the number for the index). Note that material_id-s that begin with "mvc-" have the "mvc" dropped and the hyphen (minus sign) is left to distinguish between "mp-" and "mvc-" types while still allowing for sorting. E.g. mvc-001 -> -1.

{146: MPID(mp-146), 925: MPID(mp-925), 1282: MPID(mp-1282), 1335: MPID(mp-1335), 12778: MPID(mp-12778), 2540: MPID(mp-2540), 316: MPID(mp-316), 1395: MPID(mp-1395), 2678: MPID(mp-2678), 1281: MPID(mp-1281), 1251: MPID(mp-1251)}

NETFLIX Stock Data 2025

kaggle.com

Updated Jun 13, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Umer Haddii (2025). NETFLIX Stock Data 2025 [Dataset]. https://www.kaggle.com/datasets/umerhaddii/netflix-stock-data-2025

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Jun 13, 2025

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Umer Haddii

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Context

Netflix, Inc. is an American media company engaged in paid streaming and the production of films and series.

Market cap

Market capitalization of Netflix (NFLX)

Market cap: $517.08 Billion USD

As of June 2025 Netflix has a market cap of $517.08 Billion USD. This makes Netflix the world's 19th most valuable company by market cap according to our data. The market capitalization, commonly called market cap, is the total market value of a publicly traded company's outstanding shares and is commonly used to measure how much a company is worth.

Revenue

Revenue for Netflix (NFLX)

Revenue in 2025: $40.17 Billion USD

According to Netflix's latest financial reports the company's current revenue (TTM ) is $40.17 Billion USD. In 2024 the company made a revenue of $39.00 Billion USD an increase over the revenue in the year 2023 that were of $33.72 Billion USD. The revenue is the total amount of income that a company generates by the sale of goods or services. Unlike with the earnings no expenses are subtracted.

Earnings

Earnings for Netflix (NFLX)

Earnings in 2025 (TTM): $11.31 Billion USD

According to Netflix's latest financial reports the company's current earnings are $40.17 Billion USD. In 2024 the company made an earning of $10.70 Billion USD, an increase over its 2023 earnings that were of $7.02 Billion USD. The earnings displayed on this page is the company's Pretax Income.

End of Day market cap according to different sources

On Jun 12th, 2025 the market cap of Netflix was reported to be:

$517.08 Billion USD by Yahoo Finance

$517.08 Billion USD by CompaniesMarketCap

$517.21 Billion USD by Nasdaq

Content

Geography: USA

Time period: May 2002- June 2025

Unit of analysis: Netflix Stock Data 2025

Variables

Variable	Description

date	date
open	The price at market open.
high	The highest price for that day.
low	The lowest price for that day.
close	The price at market close, adjusted for splits.
adj_close	The closing price after adjustments for all applicable splits and dividend distributions. Data is adjusted using appropriate split and dividend multipliers, adhering to Center for Research in Security Prices (CRSP) standards.
volume	The number of shares traded on that day.

Acknowledgements

This dataset belongs to me. I’m sharing it here for free. You may do with it as you wish.

t
Data from: Decoding Wayfinding: Analyzing Wayfinding Processes in the...
researchdata.tuwien.at
b2find.eudat.eu
html, pdf, zip
Updated Mar 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Negar Alinaghi; Ioannis Giannopoulos; Ioannis Giannopoulos; Negar Alinaghi; Negar Alinaghi; Negar Alinaghi (2025). Decoding Wayfinding: Analyzing Wayfinding Processes in the Outdoor Environment [Dataset]. http://doi.org/10.48436/m2ha4-t1v92
Explore at:
html, zip, pdfAvailable download formats
Unique identifier
https://doi.org/10.48436/m2ha4-t1v92
Dataset updated
Mar 19, 2025
Dataset provided by
TU Wien
Authors
Negar Alinaghi; Ioannis Giannopoulos; Ioannis Giannopoulos; Negar Alinaghi; Negar Alinaghi; Negar Alinaghi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
How To Cite?

Alinaghi, N., Giannopoulos, I., Kattenbeck, M., & Raubal, M. (2025). Decoding wayfinding: analyzing wayfinding processes in the outdoor environment. International Journal of Geographical Information Science, 1–31. https://doi.org/10.1080/13658816.2025.2473599

Link to the paper: https://www.tandfonline.com/doi/full/10.1080/13658816.2025.2473599

Folder Structure

The folder named “submission” contains the following:

“pythonProject”: This folder contains all the Python files and subfolders needed for analysis.

ijgis.yml: This file lists all the Python libraries and dependencies required to run the code.

Setting Up the Environment

Use the ijgis.yml file to create a Python project and environment. Ensure you activate the environment before running the code.

The pythonProject folder contains several .py files and subfolders, each with specific functionality as described below.

Subfolders

1. Data_4_IJGIS

This folder contains the data used for the results reported in the paper.

Note: The data analysis that we explain in this paper already begins with the synchronization and cleaning of the recorded raw data. The published data is already synchronized and cleaned. Both the cleaned files and the merged files with features extracted for them are given in this directory. If you want to perform the segmentation and feature extraction yourself, you should run the respective Python files yourself. If not, you can use the “merged_…csv” files as input for the training.

2. results_[DateTime] (e.g., results_20240906_15_00_13)

This folder will be generated when you run the code and will store the output of each step.

The current folder contains results created during code debugging for the submission.

When you run the code, a new folder with fresh results will be generated.

Python Files

1. helper_functions.py

Contains reusable functions used throughout the analysis.

Each function includes a description of its purpose and the input parameters required.

2. create_sanity_plots.py

Generates scatter plots like those in Figure 3 of the paper.

Although the code has been run for all 309 trials, it can be used to check the sample data provided.

Output: A .png file for each column of the raw gaze and IMU recordings, color-coded with logged events.

Usage: Run this file to create visualizations similar to Figure 3.

3. overlapping_sliding_window_loop.py

Implements overlapping sliding window segmentation and generates plots like those in Figure 4.

Output:

Two new subfolders, “Gaze” and “IMU”, will be added to the Data_4_IJGIS folder.

Segmented files (default: 2–10 seconds with a 1-second step size) will be saved as .csv files.

A visualization of the segments, similar to Figure 4, will be automatically generated.

4. gaze_features.py & imu_features.py (Note: there has been an update to the IDT function implementation in the gaze_features.py on 19.03.2025.)

These files compute features as explained in Tables 1 and 2 of the paper, respectively.

They process the segmented recordings generated by the overlapping_sliding_window_loop.py.

Usage: Just to know how the features are calculated, you can run this code after the segmentation with the sliding window and run these files to calculate the features from the segmented data.

5. training_prediction.py

This file contains the main machine learning analysis of the paper. This file contains all the code for the training of the model, its evaluation, and its use for the inference of the “monitoring part”. It covers the following steps:

a. Data Preparation (corresponding to Section 5.1.1 of the paper)

Prepares the data according to the research question (RQ) described in the paper. Since this data was collected with several RQs in mind, we remove parts of the data that are not related to the RQ of this paper.

A function named plot_labels_comparison(df, save_path, x_label_freq=10, figsize=(15, 5)) in line 116 visualizes the data preparation results. As this visualization is not used in the paper, the line is commented out, but if you want to see visually what has been changed compared to the original data, you can comment out this line.

b. Training/Validation/Test Split

Splits the data for machine learning experiments (an explanation can be found in Section 5.1.1. Preparation of data for training and inference of the paper).

Make sure that you follow the instructions in the comments to the code exactly.

Output: The split data is saved as .csv files in the results folder.

c. Machine and Deep Learning Experiments

This part contains three main code blocks:

iii. One for the XGboost code with correct hyperparameter tuning:
Please read the instructions for each block carefully to ensure that the code works smoothly. Regardless of which block you use, you will get the classification results (in the form of scores) for unseen data. The way we empirically test the confidence threshold of

MLP Network (Commented Out): This code was used for classification with the MLP network, and the results shown in Table 3 are from this code. If you wish to use this model, please comment out the following blocks accordingly.

XGBoost without Hyperparameter Tuning: If you want to run the code but do not want to spend time on the full training with hyperparameter tuning (as was done for the paper), just uncomment this part. This will give you a simple, untuned model with which you can achieve at least some results.

XGBoost with Hyperparameter Tuning: If you want to train the model the way we trained it for the analysis reported in the paper, use this block (the plots in Figure 7 are from this block). We ran this block with different feature sets and different segmentation files and created a simple bar chart from the saved results, shown in Figure 6.

Note: Please read the instructions for each block carefully to ensure that the code works smoothly. Regardless of which block you use, you will get the classification results (in the form of scores) for unseen data. The way we empirically calculated the confidence threshold of the model (explained in the paper in Section 5.2. Part II: Decoding surveillance by sequence analysis) is given in this block in lines 361 to 380.

d. Inference (Monitoring Part)

Final inference is performed using the monitoring data. This step produces a .csv file containing inferred labels.

Figure 8 in the paper is generated using this part of the code.

6. sequence_analysis.py

Performs analysis on the inferred data, producing Figures 9 and 10 from the paper.

This file reads the inferred data from the previous step and performs sequence analysis as described in Sections 5.2.1 and 5.2.2.

Licenses

The data is licensed under CC-BY, the code is licensed under MIT.
c
Data from: ParIce Dev/Test/Train Splits 20.05
repotest.clarin.is
Updated May 28, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Starkaður Barkarson; Steinþór Steingrímsson (2020). ParIce Dev/Test/Train Splits 20.05 [Dataset]. https://repotest.clarin.is/repository/xmlui/handle/20.500.12537/24
Explore at:
Dataset updated
May 28, 2020
Authors
Starkaður Barkarson; Steinþór Steingrímsson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Three dev/test sets for MT quality estimation created from subcorpora of ParIce. The dev/test sets contain English-Icelandic segment pairs. One of the three sets is made up of subtitle segments from OpenSubtitles, one of segments from drug descriptions distributed by the European Medical Agency (EMA) and one from EEA documents. The sets are manually annotated so all pairs are correct.

The goal was to create dev/test sets with a total of at least 3000 correct translation segments from each subcorpus. All segments contain four or more words in the English segments. The OpenSubtitles set contains 1,531/1,532 segments in dev/test. Furthermore, It contains 2,277 segment pairs that have less than four words on the English side and 777 segment pairs that have incorrect alignments or translations. The training set contains 1,298,489 segments, which have not been manually checked for errors. The OpenSubtitles sets are compiled using a Python script that downloads the segments and creates the splits. The EMA set contains 2,254/2,255 segment pairs in dev/test. Furthermore, it contains 491 segment pairs that have less than four words on the English side and 240 segments that have incorrect alignments or translations. The training set contains 399.093 segments, which have not been manually checked for errors. The EEA set contains 22 whole documents. Documents with between 100 and 200 sentences were selected at random until we reached more than 3000 sentence pairs. Alignments and translations were manually corrected for these documents. Longer sentences were split into smaller parts, where possible. The split consists of 2,292/2,396 dev/test segments and 1,697,927 training segments that have not been manually checked.

Þrjú sett af setningum til þróunar/prófunar á þýðingavélum. Settin eru búin til úr undirmálheildum ParIce og innihalda ensk-íslensk pör. Eitt af settunum er búið til úr skjátextum úr OpenSubtitles, annað úr fylgiseðlatextum frá EMA og það þriðja úr EES-þýðingum. Pörin hafa verið handyfirfarin til að tryggja að þróunar-/prófunargögn séu örugglega rétt.

Markmiðið var að búa til sett til þróunar/prófunar sem hefðu a.m.k. 3000 réttar þýðingar samtals fyrir hverja undirmálheild. Í öllum pörunum eru a.m.k. fjögur orð í enska hlutanum. Settin úr OpenSubtitles inniheldur 1,531/1,532 pör fyrir þróun/prófun. Að auki fylgja með 2,277 pör þar sem færri en fjögur orð eru í enska hlutanum og 777 pör þar sem þýðing eða samröðun er röng. Þjálfunarsettið inniheldur 1,298,489 pör, sem ekki hafa verið handyfirfarin. OpenSubtitles settin eru mynduð með Python forriti sem sækir pörin og skiptir þeim upp í settin. EMA settin innihalda 2,254/2,255 pör fyrir þróun/prófun. Að auki fylgja með 491 pör þar sem færri en fjögur orð eru í enska hlutanum og 240 pör þar sem þýðing eða samröðun er röng. Þjálfunarsettið inniheldur 399,093 pör, sem ekki hafa verið handyfirfarin. EES settin innihalda 22 heil skjöl. Skjölin voru valin af handahófi úr þeim skjölum í málheildinni sem innihalda á milli 100 og 200 setningar, þar til fleiri en 3000 setningum var náð. Samröðun var handyfirfarin og löguð og rangar þýðingar einnig. Lengri setningum var skipt upp í minni hluta, þegar hægt var. Settin innihalda 2,292/2,396 pör fyrir þróun/prófun og 1,697,927 pör til þjálfunar. Þjálfunarpörin hafa ekki verið handyfirfarin.
Z
DustNet - structured data and Python code to reproduce the model,...
data.niaid.nih.gov
zenodo.org
Updated Jul 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nowak, T. E. (2024). DustNet - structured data and Python code to reproduce the model, statistical analysis and figures [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10631953
Explore at:
Dataset updated
Jul 7, 2024
Dataset provided by
Nowak, T. E.
Siegert, Stefan
Augousti, Andy T.
Simmons, Benno I.
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data and Python code used for AOD prediction with DustNet model - a Machine Learning/AI based forecasting.

Model input data and code

Processed MODIS AOD data (from Aqua and Terra) and selected ERA5 variables* ready to reproduce the DustNet model results or for similar forecasting with Machine Learning. These long-term daily timeseries (2003-2022) are provided as n-dimensional NumPy arrays. The Python code to handle the data and run the DustNet model** is included as Jupyter Notebook ‘DustNet_model_code.ipynb’. A subfolder with normalised and split data into training/validation/testing sets is also provided with Python code for two additional ML based models** used for comparison (U-NET and Conv2D). Pre-trained models are also archived here as TensorFlow files.

Model output data and code

This dataset was constructed by running the ‘DustNet_model_code.ipynb’ (see above). It consists of 1095 days of forecased AOD data (2020-2022) by CAMS, DustNet model, naïve prediction (persistence) and gridded climatology. The ground truth raw AOD data form MODIS is provided for comparison and statystical analysis of predictions. It is intended for a quick reproduction of figures and statystical analysis presented in DustNet introducing paper.

*datasets are NumPy arrays (v1.23) created in Python v3.8.18.

**all ML models were created with Keras in Python v3.10.10.
d
MD17 data for graph2mat
data.dtu.dk
txt
Updated Aug 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arghya Bhowmik (2024). MD17 data for graph2mat [Dataset]. http://doi.org/10.11583/DTU.26195285.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.11583/DTU.26195285.v1
Dataset updated
Aug 6, 2024
Dataset provided by
Technical University of Denmark
Authors
Arghya Bhowmik
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Creators

Pol Febrer (pol.febrer@icn2.cat, ORCID 0000-0003-0904-2234) Peter Bjorn Jorgensen (peterbjorgensen@gmail.com, ORCID 0000-0003-4404-7276) Arghya Bhowmik (arbh@dtu.dk, ORCID 0000-0003-3198-5116)

Related publication

The dataset is published as part of the paper: "GRAPH2MAT: UNIVERSAL GRAPH TO MATRIX CONVERSION FOR ELECTRON DENSITY PREDICTION" (https://doi.org/10.26434/chemrxiv-2024-j4g21) https://github.com/BIG-MAP/graph2mat

Short description

This dataset contains the Hamiltonian, Overlap, Density and Energy Density matrices from SIESTA calculations of a subset of the MD17 aspirin dataset. The subset is taken from the third split in (https://doi.org/10.6084/m9.figshare.12672038.v3).

SIESTA 5.0.0 was used to compute the dataset.

Contents

The dataset has two directories:

pseudos: Contains the pseudopotentials used for the calculation (obtained from http://www.pseudo-dojo.org/, type NC SR (ONCVPSP v0.5), PBE, standard accuracy)

splits: The data splits used in the published paper. Each file "splits_X.json" contains the splits for training size X.

And then, three directories containing the calculations with different basis sets: - matrix_dataset_defsplit: Uses the default split-valence DZP basis in SIESTA. - matrix_dataset_optimsplit: Uses a split-valence DZP basis optimized for aspirin. - matrix_dataset_defnodes: Uses the default nodes DZP basis in SIESTA.

Each of the basis directories has two subdirectories: - basis: Contains the files specifying the basis used for each atom. - runs: The results of running the SIESTA simulations. Contents are discussed next.

The "runs" directory contains one directory for each run, named with the index of the run. Each directory contains: - RUN.fdf, geom.fdf: The input files used for the SIESTA calculation. - RUN.out: The log of the SIESTA run, which apar - siesta.TSDE: Contains the Density and Energy Density matrices. - siesta.TSHS: Contains the Hamiltonian and Overlap matrices.

Each matrix can be read using the sisl python package (https://github.com/zerothi/sisl) like:

import sisl matrix = sisl.get_sile("RUN.fdf").read_X()

where X is hamiltonian, overlap, density_matrix or energy_density_matrix.

To reproduce the results presented in the paper, follow the documentation of the graph2mat package (https://github.com/BIG-MAP/graph2mat).

Cite this data

https://doi.org/10.11583/DTU.c.7310005 © 2024 Technical University of Denmark

License

This dataset is published under the CC BY 4.0 license. This license allows reusers to distribute, remix, adapt, and build upon the material in any medium or format, so long as attribution is given to the creator.
KikuyuASR_trainingdataset
huggingface.co
Updated Nov 20, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CGIAR (2024). KikuyuASR_trainingdataset [Dataset]. https://huggingface.co/datasets/CGIAR/KikuyuASR_trainingdataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 20, 2024
Dataset authored and provided by
CGIARhttp://cgiar.org/
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This dataset is obtained as part of AIEP prject by Digital Green and Karya from the extension workers, lead farmers and farmers. Process of collection of data: Selected users were given the option of doing a task and getting paid for it. The users were supposed to record the sentence as it appeared on the screen. The audio file thus obtained was validated matched with the sentences to fine tune the model. Also available are the python script that helps in processing and splitting the data into… See the full description on the dataset page: https://huggingface.co/datasets/CGIAR/KikuyuASR_trainingdataset.
CYP450 80/20 splits
figshare.com
txt
Updated Jan 19, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel Siegle (2016). CYP450 80/20 splits [Dataset]. http://doi.org/10.6084/m9.figshare.1066108.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.1066108.v1
Dataset updated
Jan 19, 2016
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Daniel Siegle
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data from an NIH HTS of 17K compounds against five isozymes of cytochrome P450 screening for inhibition. The activity score is taken from the NIH assay and merged with all the 2-D descriptors from the program Molecular Operating Environment (MOE). The datasets are separated by isozyme and then balanced between actives and inactives. Finally the balanced datasets are subject to an 80/20 training/test split. Link to python script of data manipulation...
A Dataset of Outdoor RSS Measurements for Localization
zenodo.org
data.niaid.nih.gov
json, tiff, zip
Updated Jul 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Frost Mitchell; Frost Mitchell; Aniqua Baset; Sneha Kumar Kasera; Aditya Bhaskara; Aniqua Baset; Sneha Kumar Kasera; Aditya Bhaskara (2024). A Dataset of Outdoor RSS Measurements for Localization [Dataset]. http://doi.org/10.5281/zenodo.7259895
Explore at:
tiff, json, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7259895
Dataset updated
Jul 15, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Frost Mitchell; Frost Mitchell; Aniqua Baset; Sneha Kumar Kasera; Aditya Bhaskara; Aniqua Baset; Sneha Kumar Kasera; Aditya Bhaskara
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset Description

This dataset is a large-scale set of measurements for RSS-based localization. The data consists of received signal strength (RSS) measurements taken using the POWDER Testbed at the University of Utah. Samples include either 0, 1, or 2 active transmitters.

The dataset consists of 5,214 unique samples, with transmitters in 5,514 unique locations. The majority of the samples contain only 1 transmitter, but there are small sets of samples with 0 or 2 active transmitters, as shown below. Each sample has RSS values from between 10 and 25 receivers. The majority of the receivers are stationary endpoints fixed on the side of buildings, on rooftop towers, or on free-standing poles. A small set of receivers are located on shuttles which travel specific routes throughout campus.

Dataset Description Sample Count Receiver Count
No-Tx Samples 46 10 to 25
1-Tx Samples 4822 10 to 25
2-Tx Samples 346 11 to 12

The transmitters for this dataset are handheld walkie-talkies (Baofeng BF-F8HP) transmitting in the FRS/GMRS band at 462.7 MHz. These devices have a rated transmission power of 1 W. The raw IQ samples were processed through a 6 kHz bandpass filter to remove neighboring transmissions, and the RSS value was calculated as follows:

$RSS = \frac{10}{N} \log_{10}\left(\sum_i^N x_i^2 \right) $

Measurement Parameters Description
Frequency 462.7 MHz
Radio Gain 35 dB
Receiver Sample Rate 2 MHz
Sample Length N=10,000
Band-pass Filter 6 kHz
Transmitters 0 to 2
Transmission Power 1 W

Receivers consist of Ettus USRP X310 and B210 radios, and a mix of wide- and narrow-band antennas, as shown in the table below Each receiver took measurements with a receiver gain of 35 dB. However, devices have different maxmimum gain settings, and no calibration data was available, so all RSS values in the dataset are uncalibrated, and are only relative to the device.

Usage Instructions

Data is provided in .json format, both as one file and as split files.

import json data_file = 'powder_462.7_rss_data.json' with open(data_file) as f: data = json.load(f)

The json data is a dictionary with the sample timestamp as a key. Within each sample are the following keys:

rx_data: A list of data from each receiver. Each entry contains RSS value, latitude, longitude, and device name.

tx_coords: A list of coordinates for each transmitter. Each entry contains latitude and longitude.

metadata: A list of dictionaries containing metadata for each transmitter, in the same order as the rows in tx_coords

File Separations and Train/Test Splits

In the separated_data.zip folder there are several train/test separations of the data.

all_data contains all the data in the main JSON file, separated by the number of transmitters.

stationary consists of 3 cases where a stationary receiver remained in one location for several minutes. This may be useful for evaluating localization using mobile shuttles, or measuring the variation in the channel characteristics for stationary receivers.

train_test_splits contains unique data splits used for training and evaluating ML models. These splits only used data from the single-tx case. In other words, the union of each splits, along with unused.json, is equivalent to the file all_data/single_tx.json.

The random split is a random 80/20 split of the data.

special_test_cases contains the stationary transmitter data, indoor transmitter data (with high noise in GPS location), and transmitters off campus.

The grid split divides the campus region in to a 10 by 10 grid. Each grid square is assigned to the training or test set, with 80 squares in the training set and the remainder in the test set. If a square is assigned to the test set, none of its four neighbors are included in the test set. Transmitters occuring in each grid square are assigned to train or test. One such random assignment of grid squares makes up the grid split.

The seasonal split contains data separated by the month of collection, in April or July.

The transportation split contains data separated by the method of movement for the transmitter: walking, cycling, or driving. The non-driving.json file contains the union of the walking and cycling data.

campus.json contains the on-campus data, so is equivalent to the union of each split, not including unused.json.

Digital Surface Model

The dataset includes a digital surface model (DSM) from a State of Utah 2013-2014 LiDAR survey. This map includes the University of Utah campus and surrounding area. The DSM includes buildings and trees, unlike some digital elevation models.

To read the data in python:

import rasterio as rio import numpy as np import utm dsm_object = rio.open('dsm.tif') dsm_map = dsm_object.read(1) # a np.array containing elevation values dsm_resolution = dsm_object.res # a tuple containing x,y resolution (0.5 meters) dsm_transform = dsm_object.transform # an Affine transform for conversion to UTM-12 coordinates utm_transform = np.array(dsm_transform).reshape((3,3))[:2] utm_top_left = utm_transform @ np.array([0,0,1]) utm_bottom_right = utm_transform @ np.array([dsm_object.shape[0], dsm_object.shape[1], 1]) latlon_top_left = utm.to_latlon(utm_top_left[0], utm_top_left[1], 12, 'T') latlon_bottom_right = utm.to_latlon(utm_bottom_right[0], utm_bottom_right[1], 12, 'T')

Dataset Acknowledgement: This DSM file is acquired by the State of Utah and its partners, and is in the public domain and can be freely distributed with proper credit to the State of Utah and its partners. The State of Utah and its partners makes no warranty, expressed or implied, regarding its suitability for a particular use and shall not be liable under any circumstances for any direct, indirect, special, incidental, or consequential damages with respect to users of this product.

DSM DOI: https://doi.org/10.5069/G9TH8JNQ
d
MC-LSTM papers, model runs
search.dataone.org
hydroshare.org
+1more
Updated Dec 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jonathan Martin Frame (2023). MC-LSTM papers, model runs [Dataset]. http://doi.org/10.4211/hs.d750278db868447dbd252a8c5431affd
Explore at:
Unique identifier
https://doi.org/10.4211/hs.d750278db868447dbd252a8c5431affd
Dataset updated
Dec 30, 2023
Dataset provided by
Hydroshare
Authors
Jonathan Martin Frame
Time period covered
Jan 1, 1989 - Jan 1, 2015
Area covered

Description
Runs from two papers exploring the use of mass conserving LSTM. Model results used in the papers are 1) model_outputs_for_analysis_extreme_events_paper.tar.gz, and 2) model_outputs_for_analysis_mass_balance_paper.tar.gz.

The models here are trained/calibrated on three different time periods. Standard Time Split (time split 1): test period(1989-1999) is the same period used by previous studies which allows us to confirm that the deep learning models (LSTM andMC-LSTM) trained for this project perform as expected relative to prior work. NWM Time Split (time split 2): The second test period (1995-2014) allows us to benchmark against the NWM-Rv2, which does not provide data prior to 1995. Return period split: The third test period (based on return periods) allows us to benchmark only on water years that contain streamflow events that are larger (per basin) than anything seen in the training data (<= 5-year return periods in training and > 5-year return periods in testing).

Also included are an ensemble of model runs for LSTM, MC-LSTM for the "standard" training period and two forcing products. These files are provided in the format "

IMPORTANT NOTE: This python environment should be used to extract and load the data: https://github.com/jmframe/mclstm_2021_extrapolate/blob/main/python_environment.yml, as the pickle files serialized the data with specific versions of python libraries. Specifically, the pickle serialization was done with xarray=0.16.1.

Code to interpret these runs can be found here: https://github.com/jmframe/mclstm_2021_extrapolate https://github.com/jmframe/mclstm_2021_mass_balance

Papers are available here: https://hess.copernicus.org/preprints/hess-2021-423/
Z
Multimodal Vision-Audio-Language Dataset
data.niaid.nih.gov
zenodo.org
Updated Jul 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Choksi, Bhavin (2024). Multimodal Vision-Audio-Language Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10060784
Explore at:
Dataset updated
Jul 11, 2024
Dataset provided by
Roig, Gemma
Choksi, Bhavin
Schaumlöffel, Timothy
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Multimodal Vision-Audio-Language Dataset is a large-scale dataset for multimodal learning. It contains 2M video clips with corresponding audio and a textual description of the visual and auditory content. The dataset is an ensemble of existing datasets and fills the gap of missing modalities. Details can be found in the attached report. Annotation The annotation files are provided as Parquet files. They can be read using Python and the pandas and pyarrow library. The split into train, validation and test set follows the split of the original datasets. Installation

pip install pandas pyarrow Example

import pandas as pddf = pd.read_parquet('annotation_train.parquet', engine='pyarrow')print(df.iloc[0])

dataset AudioSet filename train/---2_BBVHAA.mp3 captions_visual [a man in a black hat and glasses.] captions_auditory [a man speaks and dishes clank.] tags [Speech] Description The annotation file consists of the following fields:filename: Name of the corresponding file (video or audio file)dataset: Source dataset associated with the data pointcaptions_visual: A list of captions related to the visual content of the video. Can be NaN in case of no visual contentcaptions_auditory: A list of captions related to the auditory content of the videotags: A list of tags, classifying the sound of a file. It can be NaN if no tags are provided Data files The raw data files for most datasets are not released due to licensing issues. They must be downloaded from the source. However, due to missing files, we provide them on request. Please contact us at schaumloeffel@em.uni-frankfurt.de
m
Data from: Finding the optimal integration coefficient for a palindromic...
data.mendeley.com
research.science.eus
+1more
Updated Dec 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lorenzo Nagar (2023). Finding the optimal integration coefficient for a palindromic multi-stage splitting integrator in HMC applications to Bayesian inference [Dataset]. http://doi.org/10.17632/5mmh4wcdd6.1
Explore at:
Unique identifier
https://doi.org/10.17632/5mmh4wcdd6.1
Dataset updated
Dec 4, 2023
Authors
Lorenzo Nagar
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We present the tables of integration coefficients for the 2- and 3-stage adaptive splitting integrators derived for Hamiltonian Monte Carlo (HMC) using the Adaptive Integration Approach s-AIA introduced in

Nagar, L., Fernández-Pendás, M., Sanz-Serna, J. M., Akhmatskaya, E. (2023). Adaptive multi-stage integration schemes for Hamiltonian Monte Carlo. arXiv:2307.02096. doi:10.48550/arXiv.2307.02096 .

The tables provide the maps that assign the optimal (in terms of the best conservation of energy for harmonic forces) integration coefficient for a k-stage palindromic splitting integrator to a nondimensional simulation step size in the stability interval (0, 2 k).

The repository includes the two tables for 2- and 3-stage s-AIA, a Python script that provides the optimal integration coefficient for a user-chosen dimensional step size, two .txt files containing the values of the optimal integration coefficients for 2- and 3-stage s-AIA used by the Python script, and a readme.pdf file describing the s-AIA methodology and the usage guidelines for the tables.
d
Aggregated Beaufort Sea benthic infauna data from the National Oceanographic...
search.dataone.org
knb.ecoinformatics.org
+2more
Updated Apr 29, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kenneth Dunton; Susan Schonberg; Tim Whiteaker (2021). Aggregated Beaufort Sea benthic infauna data from the National Oceanographic Data Center (NODC), 1971-1980 [Dataset]. http://doi.org/10.24431/rw1k57r
Explore at:
Unique identifier
https://doi.org/10.24431/rw1k57r
Dataset updated
Apr 29, 2021
Dataset provided by
Research Workspace
Authors
Kenneth Dunton; Susan Schonberg; Tim Whiteaker
Time period covered
Aug 6, 1975 - Aug 16, 1980
Area covered

Description
These data were originally collected in the 1970s and early 1980s, and archived at NODC in a text format whose column-based structure varies depending on the data record type represented by a given line of text. These text files were parsed using Python code which splits the data into separate files according to record type, and stores the data in comma-separated values format. Inputs to the Python code include the original data file, CSV files with information on how to parse each record type within the data file, and any lookups required to interpret the data, such as transforming an equipment code of "8" into "EKMAN GRAB". The CSV files with information on how to parse each record type were created by referencing parsing instructions provided by NCEI. If a given record type is not included in the actual data, then no output files for that record type are created. This project includes a readme file, original data files from prior investigators, code lookups, CSV files of parsing instructions, optional files created by splitting original data files into separate files by record type, output CSV files created by parsing original data files into separate files by record type, and Python scripts to perform the parsing. The output CSV files represent the dataset produced from this work. Parsing instructions for original data files as well as data codes can be found at https://www.nodc.noaa.gov/access/dataformats.html. Taxon identifiers from the Integrated Taxonomic Information System can be included in the output by the parsing code; full taxonomic information for these identifiers can be retrieved from the ITIS website, https://itis.gov/.
H
Replication Data for Are Resdistricting No-Split Rules Neutral
dataverse.harvard.edu
Updated Jul 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Geoffrey Wise (2025). Replication Data for Are Resdistricting No-Split Rules Neutral [Dataset]. http://doi.org/10.7910/DVN/92MNTJ
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/92MNTJ
Dataset updated
Jul 21, 2025
Dataset provided by
Harvard Dataverse
Authors
Geoffrey Wise
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Replication data for Replication Data for Are Resdistricting No-Split Rules Neutral? Post-2020 Ohio as a Case Study. Python codes, input and output files.
m
Behaviour Biometrics Dataset
data.mendeley.com
Updated Jun 20, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nonso Nnamoko (2022). Behaviour Biometrics Dataset [Dataset]. http://doi.org/10.17632/fnf8b85kr6.1
Explore at:
Unique identifier
https://doi.org/10.17632/fnf8b85kr6.1
Dataset updated
Jun 20, 2022
Authors
Nonso Nnamoko
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset provides a collection of behaviour biometrics data (commonly known as Keyboard, Mouse and Touchscreen (KMT) dynamics). The data was collected for use in a FinTech research project undertaken by academics and researchers at Computer Science Department, Edge Hill University, United Kingdom. The project called CyberSIgnature uses KMT dynamics data to distinguish between legitimate card owners and fraudsters. An application was developed that has a graphical user interface (GUI) similar to a standard online card payment form including fields for card type, name, card number, card verification code (cvc) and expiry date. Then, user KMT dynamics were captured while they entered fictitious card information on the GUI application.

The dataset consists of 1,760 KMT dynamic instances collected over 88 user sessions on the GUI application. Each user session involves 20 iterations of data entry in which the user is assigned a fictitious card information (drawn at random from a pool) to enter 10 times and subsequently presented with 10 additional card information, each to be entered once. The 10 additional card information is drawn from a pool that has been assigned or to be assigned to other users. A KMT data instance is collected during each data entry iteration. Thus, a total of 20 KMT data instances (i.e., 10 legitimate and 10 illegitimate) was collected during each user entry session on the GUI application.

The raw dataset is stored in .json format within 88 separate files. The root folder named behaviour_biometrics_dataset' consists of two sub-foldersraw_kmt_dataset' and `feature_kmt_dataset'; and a Jupyter notebook file (kmt_feature_classificatio.ipynb). Their folder and file content is described below:

-- raw_kmt_dataset': this folder contains 88 files, each namedraw_kmt_user_n.json', where n is a number from 0001 to 0088. Each file contains 20 instances of KMT dynamics data corresponding to a given fictitious card; and the data instances are equally split between legitimate (n = 10) and illegitimate (n = 10) classes. The legitimate class corresponds to KMT dynamics captured from the user that is assigned to the card detail; while the illegitimate class corresponds to KMT dynamics data collected from other users entering the same card detail.

-- feature_kmt_dataset': this folder contains two sub-folders, namely:feature_kmt_json' and feature_kmt_xlsx'. Each folder contains 88 files (of the relevant format: .json or .xlsx) , each namedfeature_kmt_user_n', where n is a number from 0001 to 0088. Each file contains 20 instances of features extracted from the corresponding `raw_kmt_user_n' file including the class labels (legitimate = 1 or illegitimate = 0).

-- `kmt_feature_classification.ipynb': this file contains python code necessary to generate features from the raw KMT files and apply simple machine learning classification task to generate results. The code is designed to run with minimal effort from the user.
Dataset for "Predicting the growth trajectory and yield of greenhouse...
doi.org
explore.openaire.eu
+1more
text/x-python, zip
Updated Apr 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Qi Yang; Qi Yang; Zhenong Jin; Zhenong Jin (2024). Dataset for "Predicting the growth trajectory and yield of greenhouse strawberries based on knowledge-guided computer vision" [Dataset]. http://doi.org/10.5281/zenodo.10957909
Explore at:
zip, text/x-pythonAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10957909
Dataset updated
Apr 11, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Qi Yang; Qi Yang; Zhenong Jin; Zhenong Jin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Overall

A strawberry dataset for the paper "Qi Yang, Licheng Liu, Junxiong Zhou, Mary Rogers, Zhenong Jin, 2024. Predicting the growth trajectory and yield of greenhouse strawberries based on knowledge-guided computer vision, Computers and Electronics in Agriculture, 220, 108911. https://doi.org/10.1016/j.compag.2024.108911" target="_blank" rel="noreferrer noopener">https://doi.org/10.1016/j.compag.2024.108911"

Plant traits measurements

The folder "measurement.zip" includes treatment-level and fruit-level ground truth data.

Treatment-level

data_dryMatter_2022.csv data_dryMatter_2023.csv data_freshMatter_2022.csv data_freshMatter_2023.csv data_fruitNumber_2022.csv data_fruitNumber_2023.csv data_plantBiomass_2022.csv data_plantBiomass_2023.csv

Fruit-level

Fruit conditions with five classes, 1-5 represent Normal, Wizened, Malformed, Wizened & Malformed, and Overripe, respectively.

data_size_freshWeight_condition_2022_0N.csv data_size_freshWeight_condition_2022_50N.csv data_size_freshWeight_condition_2022_100N.csv data_size_freshWeight_condition_2022_150N.csv

Fruit size for tagged fruits

data_taggedFruit_diameter_2022.csv data_taggedFruit_diameter_2023.csv data_taggedFruit_length_2022.csv data_taggedFruit_length_2023.csv

Fresh yield and lifespan for tagged fruits (only available in experiment 2023)

data_taggedFruit_freshMatter_2023.csv data_taggedFruit_lifespan_2023.csv

Weather data

weather_daily_2022.csv weather_daily_2023.csv

Image data with label

Object and phenology detection

The folder "strawberry_img_random.zip" contains images and the corresponding JSON labels for object and phenological stages detection.

Fruit size and decimal phenological stage

The folder "strawberry_img_tagged.zip" contains images and the corresponding JSON labels for fruit size and decimal phenological stages detection.

For example, "label": "small g, 8.84, 7.62, 0.4", This label means the fruit has an 8.84mm diameter and 7.62mm length, with the main stage being small green and the decimal stage being DS-4

Merge and split Data

A Python script, "datasetProcessing.py", can be used to merge and split the image data into training and testing set.

Pre-trained models

models.zip

Data collector: Dr. Qi Yang, University of Minnesota, USA. Email: qiyang577@gmail.com

All the files belong to Prof. Zhenong Jin, University of Minnesota, USA. Email: jinzn@umn.edu
d
Ethylene carbonate data for graph2mat
data.dtu.dk
txt
Updated Aug 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arghya Bhowmik (2024). Ethylene carbonate data for graph2mat [Dataset]. http://doi.org/10.11583/DTU.26193278.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.11583/DTU.26193278.v1
Dataset updated
Aug 6, 2024
Dataset provided by
Technical University of Denmark
Authors
Arghya Bhowmik
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Creators

Pol Febrer (pol.febrer@icn2.cat, ORCID 0000-0003-0904-2234) Peter Bjorn Jorgensen (peterbjorgensen@gmail.com, ORCID 0000-0003-4404-7276) Arghya Bhowmik (arbh@dtu.dk, ORCID 0000-0003-3198-5116)

Related publication

The dataset is published as part of the paper: "GRAPH2MAT: UNIVERSAL GRAPH TO MATRIX CONVERSION FOR ELECTRON DENSITY PREDICTION" (https://doi.org/10.26434/chemrxiv-2024-j4g21) https://github.com/BIG-MAP/graph2mat

Short description

This dataset contains the Hamiltonian, Overlap, Density and Energy Density matrices from SIESTA calculations of a subset of the MD17 aspirin dataset. The subset is taken from the third split in (https://doi.org/10.6084/m9.figshare.12672038.v3).

SIESTA 5.0.0 was used to compute the dataset.

Contents

The dataset has two directories:

pseudos: Contains the pseudopotentials used for the calculation (obtained from http://www.pseudo-dojo.org/, type NC SR (ONCVPSP v0.5), PBE, standard accuracy)

splits: The data splits used in the published paper. Each file "splits_X.json" contains the splits for training size X.

And then, three directories containing the calculations with different basis sets: - matrix_dataset_defsplit: Uses the default split-valence DZP basis in SIESTA. - matrix_dataset_optimsplit: Uses a split-valence DZP basis optimized for aspirin. - matrix_dataset_defnodes: Uses the default nodes DZP basis in SIESTA.

Each of the basis directories has two subdirectories: - basis: Contains the files specifying the basis used for each atom. - runs: The results of running the SIESTA simulations. Contents are discussed next.

The "runs" directory contains one directory for each run, named with the index of the run. Each directory contains: - RUN.fdf, geom.fdf: The input files used for the SIESTA calculation. - RUN.out: The log of the SIESTA run, which apar - siesta.TSDE: Contains the Density and Energy Density matrices. - siesta.TSHS: Contains the Hamiltonian and Overlap matrices.

Each matrix can be read using the sisl python package (https://github.com/zerothi/sisl) like:

import sisl matrix = sisl.get_sile("RUN.fdf").read_X()

where X is hamiltonian, overlap, density_matrix or energy_density_matrix.

To reproduce the results presented in the paper, follow the documentation of the graph2mat package (https://github.com/BIG-MAP/graph2mat).

Cite this data

https://doi.org/10.11583/DTU.c.7310005 © 2024 Technical University of Denmark

License

This dataset is published under the CC BY 4.0 license. This license allows reusers to distribute, remix, adapt, and build upon the material in any medium or format, so long as attribution is given to the creator.
Dataset for Cost-effective Simulation-based Test Selection in Self-driving...
zenodo.org
data.niaid.nih.gov
pdf, zip
Updated Jul 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christian Birchler; Nicolas Ganz; Sajad Khatiri; Alessio Gambi; Sebastiano Panichella; Christian Birchler; Nicolas Ganz; Sajad Khatiri; Alessio Gambi; Sebastiano Panichella (2024). Dataset for Cost-effective Simulation-based Test Selection in Self-driving Cars Software with SDC-Scissor [Dataset]. http://doi.org/10.5281/zenodo.5914130
Explore at:
zip, pdfAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5914130
Dataset updated
Jul 17, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Christian Birchler; Nicolas Ganz; Sajad Khatiri; Alessio Gambi; Sebastiano Panichella; Christian Birchler; Nicolas Ganz; Sajad Khatiri; Alessio Gambi; Sebastiano Panichella
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
SDC-Scissor tool for Cost-effective Simulation-based Test Selection in Self-driving Cars Software

This dataset provides test cases for self-driving cars with the BeamNG simulator. Check out the repository and demo video to get started.

GitHub: github.com/ChristianBirchler/sdc-scissor

This project extends the tool competition platform from the Cyber-Phisical Systems Testing Competition which was part of the SBST Workshop in 2021.

Usage

Demo

YouTube Link

Installation

The tool can either be run with Docker or locally using Poetry.

When running the simulations a working installation of BeamNG.research is required. Additionally, this simulation cannot be run in a Docker container but must run locally.

To install the application use one of the following approaches:

Docker: docker build --tag sdc-scissor .

Poetry: poetry install

Using the Tool

The tool can be used with the following two commands:

Docker: docker run --volume "$(pwd)/results:/out" --rm sdc-scissor [COMMAND] [OPTIONS] (this will write all files written to /out to the local folder results)

Poetry: poetry run python sdc-scissor.py [COMMAND] [OPTIONS]

There are multiple commands to use. For simplifying the documentation only the command and their options are described.

Generation of tests:

generate-tests --out-path /path/to/store/tests

Automated labeling of Tests:

label-tests --road-scenarios /path/to/tests --result-folder /path/to/store/labeled/tests

Note: This only works locally with BeamNG.research installed

Model evaluation:

evaluate-models --dataset /path/to/train/set --save

Split train and test data:

split-train-test-data --scenarios /path/to/scenarios --train-dir /path/for/train/data --test-dir /path/for/test/data --train-ratio 0.8

Test outcome prediction:

predict-tests --scenarios /path/to/scenarios --classifier /path/to/model.joblib

Evaluation based on random strategy:

evaluate --scenarios /path/to/test/scenarios --classifier /path/to/model.joblib

The possible parameters are always documented with --help.

Linting

The tool is verified the linters flake8 and pylint. These are automatically enabled in Visual Studio Code and can be run manually with the following commands:

poetry run flake8 . poetry run pylint **/*.py

License

The software we developed is distributed under GNU GPL license. See the LICENSE.md file.

Contacts

Christian Birchler - Zurich University of Applied Science (ZHAW), Switzerland - birc@zhaw.ch

Nicolas Ganz - Zurich University of Applied Science (ZHAW), Switzerland - gann@zhaw.ch

Sajad Khatiri - Zurich University of Applied Science (ZHAW), Switzerland - mazr@zhaw.ch

Dr. Alessio Gambi - Passau University, Germany - alessio.gambi@uni-passau.de

Dr. Sebastiano Panichella - Zurich University of Applied Science (ZHAW), Switzerland - panc@zhaw.ch

References

Christian Birchler, Nicolas Ganz, Sajad Khatiri, Alessio Gambi, and Sebastiano Panichella. 2022. Cost-effective Simulation-based Test Selection in Self-driving Cars Software with SDC-Scissor. In 2022 IEEE 29th International Conference on Software Analysis, Evolution and Reengineering (SANER), IEEE.

If you use this tool in your research, please cite the following papers:

@INPROCEEDINGS{Birchler2022, author={Birchler, Christian and Ganz, Nicolas and Khatiri, Sajad and Gambi, Alessio, and Panichella, Sebastiano}, booktitle={2022 IEEE 29th International Conference on Software Analysis, Evolution and Reengineering (SANER), title={Cost-effective Simulationbased Test Selection in Self-driving Cars Software with SDC-Scissor}, year={2022}, }
t
Data from: The Impact of Traffic Lights on Modal Split and Route Choice: A...
researchdata.tuwien.ac.at
bin
Updated Jun 25, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ioanna Gogousou; Ioanna Gogousou (2024). The Impact of Traffic Lights on Modal Split and Route Choice: A use-case in Vienna [Dataset]. http://doi.org/10.48436/2fw81-v5j57
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.48436/2fw81-v5j57
Dataset updated
Jun 25, 2024
Dataset provided by
TU Wien
Authors
Ioanna Gogousou; Ioanna Gogousou
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
2024
Area covered
Vienna
Description
The data and code scripts used for the analysis in the paper entitled "The Impact of Traffic Lights on Modal Split and Route Choice: A use-case in Vienna", submitted to AGILE (Association of Geographic Information Laboratories in Europe) 2024 Conference.
It comprises three folders within the zip file:
Data: Contains the datasets for the analysis.
Code: Includes script files essential for conducting the analysis. The scripts are written in Python.
Results: Includes the outcomes showcased in the associated paper.
Visualizations : Includes a jupyter notebook for the generated plots in the the associated paper.
Programming Language: Python
For reproducibility read the README.txt file included in the zip folder.
All data files are licensed under CC BY 4.0, all software is licensed under MIT License.

Dataset Description	Sample Count	Receiver Count
No-Tx Samples	46	10 to 25
1-Tx Samples	4822	10 to 25
2-Tx Samples	346	11 to 12

Measurement Parameters	Description
Frequency	462.7 MHz
Radio Gain	35 dB
Receiver Sample Rate	2 MHz
Sample Length	N=10,000
Band-pass Filter	6 kHz
Transmitters	0 to 2
Transmission Power	1 W

Facebook

Twitter

Click to copy link

Link copied

Cite

Juliane Köhler; Juliane Köhler (2025). Data Cleaning, Translation & Split of the Dataset for the Automatic Classification of Documents for the Classification System for the Berliner Handreichungen zur Bibliotheks- und Informationswissenschaft [Dataset]. http://doi.org/10.5281/zenodo.6957842

Data Cleaning, Translation & Split of the Dataset for the Automatic Classification of Documents for the Classification System for the Berliner Handreichungen zur Bibliotheks- und Informationswissenschaft

Explore at:

text/x-python, csv, binAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.6957842

Dataset updated

Apr 24, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Juliane Köhler; Juliane Köhler

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Cleaned_Dataset.csv – The combined CSV files of all scraped documents from DABI, e-LiS, o-bib and Springer.
Data_Cleaning.ipynb – The Jupyter Notebook with python code for the analysis and cleaning of the original dataset.
ger_train.csv – The German training set as CSV file.
ger_validation.csv – The German validation set as CSV file.
en_test.csv – The English test set as CSV file.
en_train.csv – The English training set as CSV file.
en_validation.csv – The English validation set as CSV file.
splitting.py – The python code for splitting a dataset into train, test and validation set.
DataSetTrans_de.csv – The final German dataset as a CSV file.
DataSetTrans_en.csv – The final English dataset as a CSV file.
translation.py – The python code for translating the cleaned dataset.

Clear search

Close search

Google apps

Main menu

Data Cleaning, Translation & Split of the Dataset for the Automatic...

Materials Project Time Split Data

NETFLIX Stock Data 2025

Context

Market cap

Revenue

Earnings

End of Day market cap according to different sources

Content

Variables

Acknowledgements

Data from: Decoding Wayfinding: Analyzing Wayfinding Processes in the...

How To Cite?

Folder Structure

Setting Up the Environment

Subfolders

1. Data_4_IJGIS

2. results_[DateTime] (e.g., results_20240906_15_00_13)

Python Files

1. helper_functions.py

2. create_sanity_plots.py

3. overlapping_sliding_window_loop.py

4. gaze_features.py & imu_features.py (Note: there has been an update to the IDT function implementation in the gaze_features.py on 19.03.2025.)

5. training_prediction.py

a. Data Preparation (corresponding to Section 5.1.1 of the paper)

b. Training/Validation/Test Split

c. Machine and Deep Learning Experiments

d. Inference (Monitoring Part)

6. sequence_analysis.py

Licenses

Data from: ParIce Dev/Test/Train Splits 20.05

DustNet - structured data and Python code to reproduce the model,...

MD17 data for graph2mat

Creators

Related publication

Short description

Contents

Cite this data

License

KikuyuASR_trainingdataset

CYP450 80/20 splits

A Dataset of Outdoor RSS Measurements for Localization

MC-LSTM papers, model runs

Multimodal Vision-Audio-Language Dataset

Data from: Finding the optimal integration coefficient for a palindromic...

Aggregated Beaufort Sea benthic infauna data from the National Oceanographic...

Replication Data for Are Resdistricting No-Split Rules Neutral

Behaviour Biometrics Dataset

Dataset for "Predicting the growth trajectory and yield of greenhouse...

Overall

Plant traits measurements

Treatment-level

Fruit-level

Weather data

Image data with label

Object and phenology detection

Fruit size and decimal phenological stage

Merge and split Data

Pre-trained models

Ethylene carbonate data for graph2mat

Creators

Related publication

Short description

Contents

Cite this data

License

Dataset for Cost-effective Simulation-based Test Selection in Self-driving...

Data from: The Impact of Traffic Lights on Modal Split and Route Choice: A...

Data Cleaning, Translation & Split of the Dataset for the Automatic Classification of Documents for the Classification System for the Berliner Handreichungen zur Bibliotheks- und Informationswissenschaft