95 datasets found
  1. Data Cleaning, Translation & Split of the Dataset for the Automatic...

    • zenodo.org
    bin, csv +1
    Updated Apr 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Juliane Köhler; Juliane Köhler (2025). Data Cleaning, Translation & Split of the Dataset for the Automatic Classification of Documents for the Classification System for the Berliner Handreichungen zur Bibliotheks- und Informationswissenschaft [Dataset]. http://doi.org/10.5281/zenodo.6957842
    Explore at:
    text/x-python, csv, binAvailable download formats
    Dataset updated
    Apr 24, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Juliane Köhler; Juliane Köhler
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
    • Cleaned_Dataset.csv – The combined CSV files of all scraped documents from DABI, e-LiS, o-bib and Springer.
    • Data_Cleaning.ipynb – The Jupyter Notebook with python code for the analysis and cleaning of the original dataset.
    • ger_train.csv – The German training set as CSV file.
    • ger_validation.csv – The German validation set as CSV file.
    • en_test.csv – The English test set as CSV file.
    • en_train.csv – The English training set as CSV file.
    • en_validation.csv – The English validation set as CSV file.
    • splitting.py – The python code for splitting a dataset into train, test and validation set.
    • DataSetTrans_de.csv – The final German dataset as a CSV file.
    • DataSetTrans_en.csv – The final English dataset as a CSV file.
    • translation.py – The python code for translating the cleaned dataset.
  2. Materials Project Time Split Data

    • figshare.com
    json
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sterling G. Baird; Taylor Sparks (2023). Materials Project Time Split Data [Dataset]. http://doi.org/10.6084/m9.figshare.19991516.v4
    Explore at:
    jsonAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Sterling G. Baird; Taylor Sparks
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Full and dummy snapshots (2022-06-04) of data for mp-time-split encoded via matminer convenience functions grabbed via the new Materials Project API. The dataset is restricted to experimentally verified compounds with no more than 52 sites. No other filtering criteria were applied. The snapshots were developed for sparks-baird/mp-time-split as a benchmark dataset for materials generative modeling. Compressed version of the files (.gz) are also available. dtypes python from pprint import pprint from matminer.utils.io import load_dataframe_from_json filepath = "insert/path/to/file/here.json" expt_df = load_dataframe_from_json(filepath) pprint(expt_df.iloc[0].apply(type).to_dict()) {'discovery': , 'energy_above_hull': , 'formation_energy_per_atom': , 'material_id': , 'references': , 'structure': , 'theoretical': , 'year': } index/mpids (just the number for the index). Note that material_id-s that begin with "mvc-" have the "mvc" dropped and the hyphen (minus sign) is left to distinguish between "mp-" and "mvc-" types while still allowing for sorting. E.g. mvc-001 -> -1.

    {146: MPID(mp-146), 925: MPID(mp-925), 1282: MPID(mp-1282), 1335: MPID(mp-1335), 12778: MPID(mp-12778), 2540: MPID(mp-2540), 316: MPID(mp-316), 1395: MPID(mp-1395), 2678: MPID(mp-2678), 1281: MPID(mp-1281), 1251: MPID(mp-1251)}

  3. NETFLIX Stock Data 2025

    • kaggle.com
    Updated Jun 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Umer Haddii (2025). NETFLIX Stock Data 2025 [Dataset]. https://www.kaggle.com/datasets/umerhaddii/netflix-stock-data-2025
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 13, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Umer Haddii
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    Netflix, Inc. is an American media company engaged in paid streaming and the production of films and series.

    Market cap

    Market capitalization of Netflix (NFLX)
    
    Market cap: $517.08 Billion USD
    
    

    As of June 2025 Netflix has a market cap of $517.08 Billion USD. This makes Netflix the world's 19th most valuable company by market cap according to our data. The market capitalization, commonly called market cap, is the total market value of a publicly traded company's outstanding shares and is commonly used to measure how much a company is worth.

    Revenue

    Revenue for Netflix (NFLX)
    
    Revenue in 2025: $40.17 Billion USD
    

    According to Netflix's latest financial reports the company's current revenue (TTM ) is $40.17 Billion USD. In 2024 the company made a revenue of $39.00 Billion USD an increase over the revenue in the year 2023 that were of $33.72 Billion USD. The revenue is the total amount of income that a company generates by the sale of goods or services. Unlike with the earnings no expenses are subtracted.

    Earnings

    Earnings for Netflix (NFLX)
    
    Earnings in 2025 (TTM): $11.31 Billion USD
    
    

    According to Netflix's latest financial reports the company's current earnings are $40.17 Billion USD. In 2024 the company made an earning of $10.70 Billion USD, an increase over its 2023 earnings that were of $7.02 Billion USD. The earnings displayed on this page is the company's Pretax Income.

    End of Day market cap according to different sources

    On Jun 12th, 2025 the market cap of Netflix was reported to be:

    $517.08 Billion USD by Yahoo Finance

    $517.08 Billion USD by CompaniesMarketCap

    $517.21 Billion USD by Nasdaq

    Content

    Geography: USA

    Time period: May 2002- June 2025

    Unit of analysis: Netflix Stock Data 2025

    Variables

    VariableDescription
    datedate
    openThe price at market open.
    highThe highest price for that day.
    lowThe lowest price for that day.
    closeThe price at market close, adjusted for splits.
    adj_closeThe closing price after adjustments for all applicable splits and dividend distributions. Data is adjusted using appropriate split and dividend multipliers, adhering to Center for Research in Security Prices (CRSP) standards.
    volumeThe number of shares traded on that day.

    Acknowledgements

    This dataset belongs to me. I’m sharing it here for free. You may do with it as you wish.

  4. t

    Data from: Decoding Wayfinding: Analyzing Wayfinding Processes in the...

    • researchdata.tuwien.at
    • b2find.eudat.eu
    html, pdf, zip
    Updated Mar 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Negar Alinaghi; Ioannis Giannopoulos; Ioannis Giannopoulos; Negar Alinaghi; Negar Alinaghi; Negar Alinaghi (2025). Decoding Wayfinding: Analyzing Wayfinding Processes in the Outdoor Environment [Dataset]. http://doi.org/10.48436/m2ha4-t1v92
    Explore at:
    html, zip, pdfAvailable download formats
    Dataset updated
    Mar 19, 2025
    Dataset provided by
    TU Wien
    Authors
    Negar Alinaghi; Ioannis Giannopoulos; Ioannis Giannopoulos; Negar Alinaghi; Negar Alinaghi; Negar Alinaghi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    How To Cite?

    Alinaghi, N., Giannopoulos, I., Kattenbeck, M., & Raubal, M. (2025). Decoding wayfinding: analyzing wayfinding processes in the outdoor environment. International Journal of Geographical Information Science, 1–31. https://doi.org/10.1080/13658816.2025.2473599

    Link to the paper: https://www.tandfonline.com/doi/full/10.1080/13658816.2025.2473599

    Folder Structure

    The folder named “submission” contains the following:

    1. “pythonProject”: This folder contains all the Python files and subfolders needed for analysis.
    2. ijgis.yml: This file lists all the Python libraries and dependencies required to run the code.

    Setting Up the Environment

    1. Use the ijgis.yml file to create a Python project and environment. Ensure you activate the environment before running the code.
    2. The pythonProject folder contains several .py files and subfolders, each with specific functionality as described below.

    Subfolders

    1. Data_4_IJGIS

    • This folder contains the data used for the results reported in the paper.
    • Note: The data analysis that we explain in this paper already begins with the synchronization and cleaning of the recorded raw data. The published data is already synchronized and cleaned. Both the cleaned files and the merged files with features extracted for them are given in this directory. If you want to perform the segmentation and feature extraction yourself, you should run the respective Python files yourself. If not, you can use the “merged_…csv” files as input for the training.

    2. results_[DateTime] (e.g., results_20240906_15_00_13)

    • This folder will be generated when you run the code and will store the output of each step.
    • The current folder contains results created during code debugging for the submission.
    • When you run the code, a new folder with fresh results will be generated.

    Python Files

    1. helper_functions.py

    • Contains reusable functions used throughout the analysis.
    • Each function includes a description of its purpose and the input parameters required.

    2. create_sanity_plots.py

    • Generates scatter plots like those in Figure 3 of the paper.
    • Although the code has been run for all 309 trials, it can be used to check the sample data provided.
    • Output: A .png file for each column of the raw gaze and IMU recordings, color-coded with logged events.
    • Usage: Run this file to create visualizations similar to Figure 3.

    3. overlapping_sliding_window_loop.py

    • Implements overlapping sliding window segmentation and generates plots like those in Figure 4.
    • Output:
      • Two new subfolders, “Gaze” and “IMU”, will be added to the Data_4_IJGIS folder.
      • Segmented files (default: 2–10 seconds with a 1-second step size) will be saved as .csv files.
      • A visualization of the segments, similar to Figure 4, will be automatically generated.

    4. gaze_features.py & imu_features.py (Note: there has been an update to the IDT function implementation in the gaze_features.py on 19.03.2025.)

    • These files compute features as explained in Tables 1 and 2 of the paper, respectively.
    • They process the segmented recordings generated by the overlapping_sliding_window_loop.py.
    • Usage: Just to know how the features are calculated, you can run this code after the segmentation with the sliding window and run these files to calculate the features from the segmented data.

    5. training_prediction.py

    • This file contains the main machine learning analysis of the paper. This file contains all the code for the training of the model, its evaluation, and its use for the inference of the “monitoring part”. It covers the following steps:
    a. Data Preparation (corresponding to Section 5.1.1 of the paper)
    • Prepares the data according to the research question (RQ) described in the paper. Since this data was collected with several RQs in mind, we remove parts of the data that are not related to the RQ of this paper.
    • A function named plot_labels_comparison(df, save_path, x_label_freq=10, figsize=(15, 5)) in line 116 visualizes the data preparation results. As this visualization is not used in the paper, the line is commented out, but if you want to see visually what has been changed compared to the original data, you can comment out this line.
    b. Training/Validation/Test Split
    • Splits the data for machine learning experiments (an explanation can be found in Section 5.1.1. Preparation of data for training and inference of the paper).
    • Make sure that you follow the instructions in the comments to the code exactly.
    • Output: The split data is saved as .csv files in the results folder.
    c. Machine and Deep Learning Experiments

    This part contains three main code blocks:

    iii. One for the XGboost code with correct hyperparameter tuning:
    Please read the instructions for each block carefully to ensure that the code works smoothly. Regardless of which block you use, you will get the classification results (in the form of scores) for unseen data. The way we empirically test the confidence threshold of

    • MLP Network (Commented Out): This code was used for classification with the MLP network, and the results shown in Table 3 are from this code. If you wish to use this model, please comment out the following blocks accordingly.
    • XGBoost without Hyperparameter Tuning: If you want to run the code but do not want to spend time on the full training with hyperparameter tuning (as was done for the paper), just uncomment this part. This will give you a simple, untuned model with which you can achieve at least some results.
    • XGBoost with Hyperparameter Tuning: If you want to train the model the way we trained it for the analysis reported in the paper, use this block (the plots in Figure 7 are from this block). We ran this block with different feature sets and different segmentation files and created a simple bar chart from the saved results, shown in Figure 6.

    Note: Please read the instructions for each block carefully to ensure that the code works smoothly. Regardless of which block you use, you will get the classification results (in the form of scores) for unseen data. The way we empirically calculated the confidence threshold of the model (explained in the paper in Section 5.2. Part II: Decoding surveillance by sequence analysis) is given in this block in lines 361 to 380.

    d. Inference (Monitoring Part)
    • Final inference is performed using the monitoring data. This step produces a .csv file containing inferred labels.
    • Figure 8 in the paper is generated using this part of the code.

    6. sequence_analysis.py

    • Performs analysis on the inferred data, producing Figures 9 and 10 from the paper.
    • This file reads the inferred data from the previous step and performs sequence analysis as described in Sections 5.2.1 and 5.2.2.

    Licenses

    The data is licensed under CC-BY, the code is licensed under MIT.

  5. c

    Data from: ParIce Dev/Test/Train Splits 20.05

    • repotest.clarin.is
    Updated May 28, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Starkaður Barkarson; Steinþór Steingrímsson (2020). ParIce Dev/Test/Train Splits 20.05 [Dataset]. https://repotest.clarin.is/repository/xmlui/handle/20.500.12537/24
    Explore at:
    Dataset updated
    May 28, 2020
    Authors
    Starkaður Barkarson; Steinþór Steingrímsson
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Three dev/test sets for MT quality estimation created from subcorpora of ParIce. The dev/test sets contain English-Icelandic segment pairs. One of the three sets is made up of subtitle segments from OpenSubtitles, one of segments from drug descriptions distributed by the European Medical Agency (EMA) and one from EEA documents. The sets are manually annotated so all pairs are correct.

    The goal was to create dev/test sets with a total of at least 3000 correct translation segments from each subcorpus. All segments contain four or more words in the English segments. The OpenSubtitles set contains 1,531/1,532 segments in dev/test. Furthermore, It contains 2,277 segment pairs that have less than four words on the English side and 777 segment pairs that have incorrect alignments or translations. The training set contains 1,298,489 segments, which have not been manually checked for errors. The OpenSubtitles sets are compiled using a Python script that downloads the segments and creates the splits. The EMA set contains 2,254/2,255 segment pairs in dev/test. Furthermore, it contains 491 segment pairs that have less than four words on the English side and 240 segments that have incorrect alignments or translations. The training set contains 399.093 segments, which have not been manually checked for errors. The EEA set contains 22 whole documents. Documents with between 100 and 200 sentences were selected at random until we reached more than 3000 sentence pairs. Alignments and translations were manually corrected for these documents. Longer sentences were split into smaller parts, where possible. The split consists of 2,292/2,396 dev/test segments and 1,697,927 training segments that have not been manually checked.

    Þrjú sett af setningum til þróunar/prófunar á þýðingavélum. Settin eru búin til úr undirmálheildum ParIce og innihalda ensk-íslensk pör. Eitt af settunum er búið til úr skjátextum úr OpenSubtitles, annað úr fylgiseðlatextum frá EMA og það þriðja úr EES-þýðingum. Pörin hafa verið handyfirfarin til að tryggja að þróunar-/prófunargögn séu örugglega rétt.

    Markmiðið var að búa til sett til þróunar/prófunar sem hefðu a.m.k. 3000 réttar þýðingar samtals fyrir hverja undirmálheild. Í öllum pörunum eru a.m.k. fjögur orð í enska hlutanum. Settin úr OpenSubtitles inniheldur 1,531/1,532 pör fyrir þróun/prófun. Að auki fylgja með 2,277 pör þar sem færri en fjögur orð eru í enska hlutanum og 777 pör þar sem þýðing eða samröðun er röng. Þjálfunarsettið inniheldur 1,298,489 pör, sem ekki hafa verið handyfirfarin. OpenSubtitles settin eru mynduð með Python forriti sem sækir pörin og skiptir þeim upp í settin. EMA settin innihalda 2,254/2,255 pör fyrir þróun/prófun. Að auki fylgja með 491 pör þar sem færri en fjögur orð eru í enska hlutanum og 240 pör þar sem þýðing eða samröðun er röng. Þjálfunarsettið inniheldur 399,093 pör, sem ekki hafa verið handyfirfarin. EES settin innihalda 22 heil skjöl. Skjölin voru valin af handahófi úr þeim skjölum í málheildinni sem innihalda á milli 100 og 200 setningar, þar til fleiri en 3000 setningum var náð. Samröðun var handyfirfarin og löguð og rangar þýðingar einnig. Lengri setningum var skipt upp í minni hluta, þegar hægt var. Settin innihalda 2,292/2,396 pör fyrir þróun/prófun og 1,697,927 pör til þjálfunar. Þjálfunarpörin hafa ekki verið handyfirfarin.

  6. Z

    DustNet - structured data and Python code to reproduce the model,...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jul 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nowak, T. E. (2024). DustNet - structured data and Python code to reproduce the model, statistical analysis and figures [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10631953
    Explore at:
    Dataset updated
    Jul 7, 2024
    Dataset provided by
    Nowak, T. E.
    Siegert, Stefan
    Augousti, Andy T.
    Simmons, Benno I.
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data and Python code used for AOD prediction with DustNet model - a Machine Learning/AI based forecasting.

    Model input data and code

    Processed MODIS AOD data (from Aqua and Terra) and selected ERA5 variables* ready to reproduce the DustNet model results or for similar forecasting with Machine Learning. These long-term daily timeseries (2003-2022) are provided as n-dimensional NumPy arrays. The Python code to handle the data and run the DustNet model** is included as Jupyter Notebook ‘DustNet_model_code.ipynb’. A subfolder with normalised and split data into training/validation/testing sets is also provided with Python code for two additional ML based models** used for comparison (U-NET and Conv2D). Pre-trained models are also archived here as TensorFlow files.

    Model output data and code

    This dataset was constructed by running the ‘DustNet_model_code.ipynb’ (see above). It consists of 1095 days of forecased AOD data (2020-2022) by CAMS, DustNet model, naïve prediction (persistence) and gridded climatology. The ground truth raw AOD data form MODIS is provided for comparison and statystical analysis of predictions. It is intended for a quick reproduction of figures and statystical analysis presented in DustNet introducing paper.

    *datasets are NumPy arrays (v1.23) created in Python v3.8.18.

    **all ML models were created with Keras in Python v3.10.10.

  7. d

    MD17 data for graph2mat

    • data.dtu.dk
    txt
    Updated Aug 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arghya Bhowmik (2024). MD17 data for graph2mat [Dataset]. http://doi.org/10.11583/DTU.26195285.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Aug 6, 2024
    Dataset provided by
    Technical University of Denmark
    Authors
    Arghya Bhowmik
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Creators

    Pol Febrer (pol.febrer@icn2.cat, ORCID 0000-0003-0904-2234) Peter Bjorn Jorgensen (peterbjorgensen@gmail.com, ORCID 0000-0003-4404-7276) Arghya Bhowmik (arbh@dtu.dk, ORCID 0000-0003-3198-5116)

    Related publication

    The dataset is published as part of the paper: "GRAPH2MAT: UNIVERSAL GRAPH TO MATRIX CONVERSION FOR ELECTRON DENSITY PREDICTION" (https://doi.org/10.26434/chemrxiv-2024-j4g21) https://github.com/BIG-MAP/graph2mat

    Short description

    This dataset contains the Hamiltonian, Overlap, Density and Energy Density matrices from SIESTA calculations of a subset of the MD17 aspirin dataset. The subset is taken from the third split in (https://doi.org/10.6084/m9.figshare.12672038.v3).

    SIESTA 5.0.0 was used to compute the dataset.

    Contents

    The dataset has two directories:

    • pseudos: Contains the pseudopotentials used for the calculation (obtained from http://www.pseudo-dojo.org/, type NC SR (ONCVPSP v0.5), PBE, standard accuracy)
    • splits: The data splits used in the published paper. Each file "splits_X.json" contains the splits for training size X.

    And then, three directories containing the calculations with different basis sets: - matrix_dataset_defsplit: Uses the default split-valence DZP basis in SIESTA. - matrix_dataset_optimsplit: Uses a split-valence DZP basis optimized for aspirin. - matrix_dataset_defnodes: Uses the default nodes DZP basis in SIESTA.

    Each of the basis directories has two subdirectories: - basis: Contains the files specifying the basis used for each atom. - runs: The results of running the SIESTA simulations. Contents are discussed next.

    The "runs" directory contains one directory for each run, named with the index of the run. Each directory contains: - RUN.fdf, geom.fdf: The input files used for the SIESTA calculation. - RUN.out: The log of the SIESTA run, which apar - siesta.TSDE: Contains the Density and Energy Density matrices. - siesta.TSHS: Contains the Hamiltonian and Overlap matrices.

    Each matrix can be read using the sisl python package (https://github.com/zerothi/sisl) like:

    import sisl
    
    matrix = sisl.get_sile("RUN.fdf").read_X()
    

    where X is hamiltonian, overlap, density_matrix or energy_density_matrix.

    To reproduce the results presented in the paper, follow the documentation of the graph2mat package (https://github.com/BIG-MAP/graph2mat).

    Cite this data

    https://doi.org/10.11583/DTU.c.7310005 © 2024 Technical University of Denmark

    License

    This dataset is published under the CC BY 4.0 license. This license allows reusers to distribute, remix, adapt, and build upon the material in any medium or format, so long as attribution is given to the creator.

  8. KikuyuASR_trainingdataset

    • huggingface.co
    Updated Nov 20, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CGIAR (2024). KikuyuASR_trainingdataset [Dataset]. https://huggingface.co/datasets/CGIAR/KikuyuASR_trainingdataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 20, 2024
    Dataset authored and provided by
    CGIARhttp://cgiar.org/
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This dataset is obtained as part of AIEP prject by Digital Green and Karya from the extension workers, lead farmers and farmers. Process of collection of data: Selected users were given the option of doing a task and getting paid for it. The users were supposed to record the sentence as it appeared on the screen. The audio file thus obtained was validated matched with the sentences to fine tune the model. Also available are the python script that helps in processing and splitting the data into… See the full description on the dataset page: https://huggingface.co/datasets/CGIAR/KikuyuASR_trainingdataset.

  9. CYP450 80/20 splits

    • figshare.com
    txt
    Updated Jan 19, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel Siegle (2016). CYP450 80/20 splits [Dataset]. http://doi.org/10.6084/m9.figshare.1066108.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jan 19, 2016
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Daniel Siegle
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data from an NIH HTS of 17K compounds against five isozymes of cytochrome P450 screening for inhibition. The activity score is taken from the NIH assay and merged with all the 2-D descriptors from the program Molecular Operating Environment (MOE). The datasets are separated by isozyme and then balanced between actives and inactives. Finally the balanced datasets are subject to an 80/20 training/test split. Link to python script of data manipulation...

  10. A Dataset of Outdoor RSS Measurements for Localization

    • zenodo.org
    • data.niaid.nih.gov
    json, tiff, zip
    Updated Jul 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Frost Mitchell; Frost Mitchell; Aniqua Baset; Sneha Kumar Kasera; Aditya Bhaskara; Aniqua Baset; Sneha Kumar Kasera; Aditya Bhaskara (2024). A Dataset of Outdoor RSS Measurements for Localization [Dataset]. http://doi.org/10.5281/zenodo.7259895
    Explore at:
    tiff, json, zipAvailable download formats
    Dataset updated
    Jul 15, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Frost Mitchell; Frost Mitchell; Aniqua Baset; Sneha Kumar Kasera; Aditya Bhaskara; Aniqua Baset; Sneha Kumar Kasera; Aditya Bhaskara
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Description

    This dataset is a large-scale set of measurements for RSS-based localization. The data consists of received signal strength (RSS) measurements taken using the POWDER Testbed at the University of Utah. Samples include either 0, 1, or 2 active transmitters.

    The dataset consists of 5,214 unique samples, with transmitters in 5,514 unique locations. The majority of the samples contain only 1 transmitter, but there are small sets of samples with 0 or 2 active transmitters, as shown below. Each sample has RSS values from between 10 and 25 receivers. The majority of the receivers are stationary endpoints fixed on the side of buildings, on rooftop towers, or on free-standing poles. A small set of receivers are located on shuttles which travel specific routes throughout campus.

    Dataset DescriptionSample CountReceiver Count
    No-Tx Samples4610 to 25
    1-Tx Samples482210 to 25
    2-Tx Samples34611 to 12

    The transmitters for this dataset are handheld walkie-talkies (Baofeng BF-F8HP) transmitting in the FRS/GMRS band at 462.7 MHz. These devices have a rated transmission power of 1 W. The raw IQ samples were processed through a 6 kHz bandpass filter to remove neighboring transmissions, and the RSS value was calculated as follows:

    \(RSS = \frac{10}{N} \log_{10}\left(\sum_i^N x_i^2 \right) \)

    Measurement ParametersDescription
    Frequency462.7 MHz
    Radio Gain35 dB
    Receiver Sample Rate2 MHz
    Sample LengthN=10,000
    Band-pass Filter6 kHz
    Transmitters0 to 2
    Transmission Power1 W

    Receivers consist of Ettus USRP X310 and B210 radios, and a mix of wide- and narrow-band antennas, as shown in the table below Each receiver took measurements with a receiver gain of 35 dB. However, devices have different maxmimum gain settings, and no calibration data was available, so all RSS values in the dataset are uncalibrated, and are only relative to the device.

    Usage Instructions

    Data is provided in .json format, both as one file and as split files.

    import json
    data_file = 'powder_462.7_rss_data.json'
    with open(data_file) as f:
      data = json.load(f)
    

    The json data is a dictionary with the sample timestamp as a key. Within each sample are the following keys:

    • rx_data: A list of data from each receiver. Each entry contains RSS value, latitude, longitude, and device name.
    • tx_coords: A list of coordinates for each transmitter. Each entry contains latitude and longitude.
    • metadata: A list of dictionaries containing metadata for each transmitter, in the same order as the rows in tx_coords

    File Separations and Train/Test Splits

    In the separated_data.zip folder there are several train/test separations of the data.

    • all_data contains all the data in the main JSON file, separated by the number of transmitters.
    • stationary consists of 3 cases where a stationary receiver remained in one location for several minutes. This may be useful for evaluating localization using mobile shuttles, or measuring the variation in the channel characteristics for stationary receivers.
    • train_test_splits contains unique data splits used for training and evaluating ML models. These splits only used data from the single-tx case. In other words, the union of each splits, along with unused.json, is equivalent to the file all_data/single_tx.json.
      • The random split is a random 80/20 split of the data.
      • special_test_cases contains the stationary transmitter data, indoor transmitter data (with high noise in GPS location), and transmitters off campus.
      • The grid split divides the campus region in to a 10 by 10 grid. Each grid square is assigned to the training or test set, with 80 squares in the training set and the remainder in the test set. If a square is assigned to the test set, none of its four neighbors are included in the test set. Transmitters occuring in each grid square are assigned to train or test. One such random assignment of grid squares makes up the grid split.
      • The seasonal split contains data separated by the month of collection, in April or July.
      • The transportation split contains data separated by the method of movement for the transmitter: walking, cycling, or driving. The non-driving.json file contains the union of the walking and cycling data.
      • campus.json contains the on-campus data, so is equivalent to the union of each split, not including unused.json.

    Digital Surface Model

    The dataset includes a digital surface model (DSM) from a State of Utah 2013-2014 LiDAR survey. This map includes the University of Utah campus and surrounding area. The DSM includes buildings and trees, unlike some digital elevation models.

    To read the data in python:

    import rasterio as rio
    import numpy as np
    import utm
    
    dsm_object = rio.open('dsm.tif')
    dsm_map = dsm_object.read(1)   # a np.array containing elevation values
    dsm_resolution = dsm_object.res   # a tuple containing x,y resolution (0.5 meters) 
    dsm_transform = dsm_object.transform   # an Affine transform for conversion to UTM-12 coordinates
    utm_transform = np.array(dsm_transform).reshape((3,3))[:2]
    utm_top_left = utm_transform @ np.array([0,0,1])
    utm_bottom_right = utm_transform @ np.array([dsm_object.shape[0], dsm_object.shape[1], 1])
    latlon_top_left = utm.to_latlon(utm_top_left[0], utm_top_left[1], 12, 'T')
    latlon_bottom_right = utm.to_latlon(utm_bottom_right[0], utm_bottom_right[1], 12, 'T')
    

    Dataset Acknowledgement: This DSM file is acquired by the State of Utah and its partners, and is in the public domain and can be freely distributed with proper credit to the State of Utah and its partners. The State of Utah and its partners makes no warranty, expressed or implied, regarding its suitability for a particular use and shall not be liable under any circumstances for any direct, indirect, special, incidental, or consequential damages with respect to users of this product.

    DSM DOI: https://doi.org/10.5069/G9TH8JNQ

  11. d

    MC-LSTM papers, model runs

    • search.dataone.org
    • hydroshare.org
    • +1more
    Updated Dec 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jonathan Martin Frame (2023). MC-LSTM papers, model runs [Dataset]. http://doi.org/10.4211/hs.d750278db868447dbd252a8c5431affd
    Explore at:
    Dataset updated
    Dec 30, 2023
    Dataset provided by
    Hydroshare
    Authors
    Jonathan Martin Frame
    Time period covered
    Jan 1, 1989 - Jan 1, 2015
    Area covered
    Description

    Runs from two papers exploring the use of mass conserving LSTM. Model results used in the papers are 1) model_outputs_for_analysis_extreme_events_paper.tar.gz, and 2) model_outputs_for_analysis_mass_balance_paper.tar.gz.

    The models here are trained/calibrated on three different time periods. Standard Time Split (time split 1): test period(1989-1999) is the same period used by previous studies which allows us to confirm that the deep learning models (LSTM andMC-LSTM) trained for this project perform as expected relative to prior work. NWM Time Split (time split 2): The second test period (1995-2014) allows us to benchmark against the NWM-Rv2, which does not provide data prior to 1995. Return period split: The third test period (based on return periods) allows us to benchmark only on water years that contain streamflow events that are larger (per basin) than anything seen in the training data (<= 5-year return periods in training and > 5-year return periods in testing).

    Also included are an ensemble of model runs for LSTM, MC-LSTM for the "standard" training period and two forcing products. These files are provided in the format "

    IMPORTANT NOTE: This python environment should be used to extract and load the data: https://github.com/jmframe/mclstm_2021_extrapolate/blob/main/python_environment.yml, as the pickle files serialized the data with specific versions of python libraries. Specifically, the pickle serialization was done with xarray=0.16.1.

    Code to interpret these runs can be found here: https://github.com/jmframe/mclstm_2021_extrapolate https://github.com/jmframe/mclstm_2021_mass_balance

    Papers are available here: https://hess.copernicus.org/preprints/hess-2021-423/

  12. Z

    Multimodal Vision-Audio-Language Dataset

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jul 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Choksi, Bhavin (2024). Multimodal Vision-Audio-Language Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10060784
    Explore at:
    Dataset updated
    Jul 11, 2024
    Dataset provided by
    Roig, Gemma
    Choksi, Bhavin
    Schaumlöffel, Timothy
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Multimodal Vision-Audio-Language Dataset is a large-scale dataset for multimodal learning. It contains 2M video clips with corresponding audio and a textual description of the visual and auditory content. The dataset is an ensemble of existing datasets and fills the gap of missing modalities. Details can be found in the attached report. Annotation The annotation files are provided as Parquet files. They can be read using Python and the pandas and pyarrow library. The split into train, validation and test set follows the split of the original datasets. Installation

    pip install pandas pyarrow Example

    import pandas as pddf = pd.read_parquet('annotation_train.parquet', engine='pyarrow')print(df.iloc[0])

    dataset AudioSet filename train/---2_BBVHAA.mp3 captions_visual [a man in a black hat and glasses.] captions_auditory [a man speaks and dishes clank.] tags [Speech] Description The annotation file consists of the following fields:filename: Name of the corresponding file (video or audio file)dataset: Source dataset associated with the data pointcaptions_visual: A list of captions related to the visual content of the video. Can be NaN in case of no visual contentcaptions_auditory: A list of captions related to the auditory content of the videotags: A list of tags, classifying the sound of a file. It can be NaN if no tags are provided Data files The raw data files for most datasets are not released due to licensing issues. They must be downloaded from the source. However, due to missing files, we provide them on request. Please contact us at schaumloeffel@em.uni-frankfurt.de

  13. m

    Data from: Finding the optimal integration coefficient for a palindromic...

    • data.mendeley.com
    • research.science.eus
    • +1more
    Updated Dec 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lorenzo Nagar (2023). Finding the optimal integration coefficient for a palindromic multi-stage splitting integrator in HMC applications to Bayesian inference [Dataset]. http://doi.org/10.17632/5mmh4wcdd6.1
    Explore at:
    Dataset updated
    Dec 4, 2023
    Authors
    Lorenzo Nagar
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We present the tables of integration coefficients for the 2- and 3-stage adaptive splitting integrators derived for Hamiltonian Monte Carlo (HMC) using the Adaptive Integration Approach s-AIA introduced in

    • Nagar, L., Fernández-Pendás, M., Sanz-Serna, J. M., Akhmatskaya, E. (2023). Adaptive multi-stage integration schemes for Hamiltonian Monte Carlo. arXiv:2307.02096. doi:10.48550/arXiv.2307.02096 .

    The tables provide the maps that assign the optimal (in terms of the best conservation of energy for harmonic forces) integration coefficient for a k-stage palindromic splitting integrator to a nondimensional simulation step size in the stability interval (0, 2 k).

    The repository includes the two tables for 2- and 3-stage s-AIA, a Python script that provides the optimal integration coefficient for a user-chosen dimensional step size, two .txt files containing the values of the optimal integration coefficients for 2- and 3-stage s-AIA used by the Python script, and a readme.pdf file describing the s-AIA methodology and the usage guidelines for the tables.

  14. d

    Aggregated Beaufort Sea benthic infauna data from the National Oceanographic...

    • search.dataone.org
    • knb.ecoinformatics.org
    • +2more
    Updated Apr 29, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kenneth Dunton; Susan Schonberg; Tim Whiteaker (2021). Aggregated Beaufort Sea benthic infauna data from the National Oceanographic Data Center (NODC), 1971-1980 [Dataset]. http://doi.org/10.24431/rw1k57r
    Explore at:
    Dataset updated
    Apr 29, 2021
    Dataset provided by
    Research Workspace
    Authors
    Kenneth Dunton; Susan Schonberg; Tim Whiteaker
    Time period covered
    Aug 6, 1975 - Aug 16, 1980
    Area covered
    Description

    These data were originally collected in the 1970s and early 1980s, and archived at NODC in a text format whose column-based structure varies depending on the data record type represented by a given line of text. These text files were parsed using Python code which splits the data into separate files according to record type, and stores the data in comma-separated values format. Inputs to the Python code include the original data file, CSV files with information on how to parse each record type within the data file, and any lookups required to interpret the data, such as transforming an equipment code of "8" into "EKMAN GRAB". The CSV files with information on how to parse each record type were created by referencing parsing instructions provided by NCEI. If a given record type is not included in the actual data, then no output files for that record type are created. This project includes a readme file, original data files from prior investigators, code lookups, CSV files of parsing instructions, optional files created by splitting original data files into separate files by record type, output CSV files created by parsing original data files into separate files by record type, and Python scripts to perform the parsing. The output CSV files represent the dataset produced from this work. Parsing instructions for original data files as well as data codes can be found at https://www.nodc.noaa.gov/access/dataformats.html. Taxon identifiers from the Integrated Taxonomic Information System can be included in the output by the parsing code; full taxonomic information for these identifiers can be retrieved from the ITIS website, https://itis.gov/.

  15. H

    Replication Data for Are Resdistricting No-Split Rules Neutral

    • dataverse.harvard.edu
    Updated Jul 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Geoffrey Wise (2025). Replication Data for Are Resdistricting No-Split Rules Neutral [Dataset]. http://doi.org/10.7910/DVN/92MNTJ
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 21, 2025
    Dataset provided by
    Harvard Dataverse
    Authors
    Geoffrey Wise
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Replication data for Replication Data for Are Resdistricting No-Split Rules Neutral? Post-2020 Ohio as a Case Study. Python codes, input and output files.

  16. m

    Behaviour Biometrics Dataset

    • data.mendeley.com
    Updated Jun 20, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nonso Nnamoko (2022). Behaviour Biometrics Dataset [Dataset]. http://doi.org/10.17632/fnf8b85kr6.1
    Explore at:
    Dataset updated
    Jun 20, 2022
    Authors
    Nonso Nnamoko
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset provides a collection of behaviour biometrics data (commonly known as Keyboard, Mouse and Touchscreen (KMT) dynamics). The data was collected for use in a FinTech research project undertaken by academics and researchers at Computer Science Department, Edge Hill University, United Kingdom. The project called CyberSIgnature uses KMT dynamics data to distinguish between legitimate card owners and fraudsters. An application was developed that has a graphical user interface (GUI) similar to a standard online card payment form including fields for card type, name, card number, card verification code (cvc) and expiry date. Then, user KMT dynamics were captured while they entered fictitious card information on the GUI application.

    The dataset consists of 1,760 KMT dynamic instances collected over 88 user sessions on the GUI application. Each user session involves 20 iterations of data entry in which the user is assigned a fictitious card information (drawn at random from a pool) to enter 10 times and subsequently presented with 10 additional card information, each to be entered once. The 10 additional card information is drawn from a pool that has been assigned or to be assigned to other users. A KMT data instance is collected during each data entry iteration. Thus, a total of 20 KMT data instances (i.e., 10 legitimate and 10 illegitimate) was collected during each user entry session on the GUI application.

    The raw dataset is stored in .json format within 88 separate files. The root folder named behaviour_biometrics_dataset' consists of two sub-foldersraw_kmt_dataset' and `feature_kmt_dataset'; and a Jupyter notebook file (kmt_feature_classificatio.ipynb). Their folder and file content is described below:

    -- raw_kmt_dataset': this folder contains 88 files, each namedraw_kmt_user_n.json', where n is a number from 0001 to 0088. Each file contains 20 instances of KMT dynamics data corresponding to a given fictitious card; and the data instances are equally split between legitimate (n = 10) and illegitimate (n = 10) classes. The legitimate class corresponds to KMT dynamics captured from the user that is assigned to the card detail; while the illegitimate class corresponds to KMT dynamics data collected from other users entering the same card detail.

    -- feature_kmt_dataset': this folder contains two sub-folders, namely:feature_kmt_json' and feature_kmt_xlsx'. Each folder contains 88 files (of the relevant format: .json or .xlsx) , each namedfeature_kmt_user_n', where n is a number from 0001 to 0088. Each file contains 20 instances of features extracted from the corresponding `raw_kmt_user_n' file including the class labels (legitimate = 1 or illegitimate = 0).

    -- `kmt_feature_classification.ipynb': this file contains python code necessary to generate features from the raw KMT files and apply simple machine learning classification task to generate results. The code is designed to run with minimal effort from the user.

  17. Dataset for "Predicting the growth trajectory and yield of greenhouse...

    • doi.org
    • explore.openaire.eu
    • +1more
    text/x-python, zip
    Updated Apr 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Qi Yang; Qi Yang; Zhenong Jin; Zhenong Jin (2024). Dataset for "Predicting the growth trajectory and yield of greenhouse strawberries based on knowledge-guided computer vision" [Dataset]. http://doi.org/10.5281/zenodo.10957909
    Explore at:
    zip, text/x-pythonAvailable download formats
    Dataset updated
    Apr 11, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Qi Yang; Qi Yang; Zhenong Jin; Zhenong Jin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Overall

    A strawberry dataset for the paper "Qi Yang, Licheng Liu, Junxiong Zhou, Mary Rogers, Zhenong Jin, 2024. Predicting the growth trajectory and yield of greenhouse strawberries based on knowledge-guided computer vision, Computers and Electronics in Agriculture, 220, 108911. https://doi.org/10.1016/j.compag.2024.108911" target="_blank" rel="noreferrer noopener">https://doi.org/10.1016/j.compag.2024.108911"

    Plant traits measurements

    The folder "measurement.zip" includes treatment-level and fruit-level ground truth data.

    Treatment-level

    data_dryMatter_2022.csv
    data_dryMatter_2023.csv
    data_freshMatter_2022.csv
    data_freshMatter_2023.csv
    data_fruitNumber_2022.csv
    data_fruitNumber_2023.csv
    data_plantBiomass_2022.csv
    data_plantBiomass_2023.csv

    Fruit-level

    Fruit conditions with five classes, 1-5 represent Normal, Wizened, Malformed, Wizened & Malformed, and Overripe, respectively.

    data_size_freshWeight_condition_2022_0N.csv
    data_size_freshWeight_condition_2022_50N.csv
    data_size_freshWeight_condition_2022_100N.csv
    data_size_freshWeight_condition_2022_150N.csv

    Fruit size for tagged fruits

    data_taggedFruit_diameter_2022.csv
    data_taggedFruit_diameter_2023.csv
    data_taggedFruit_length_2022.csv
    data_taggedFruit_length_2023.csv

    Fresh yield and lifespan for tagged fruits (only available in experiment 2023)

    data_taggedFruit_freshMatter_2023.csv
    data_taggedFruit_lifespan_2023.csv

    Weather data

    weather_daily_2022.csv
    weather_daily_2023.csv

    Image data with label

    Object and phenology detection

    The folder "strawberry_img_random.zip" contains images and the corresponding JSON labels for object and phenological stages detection.

    Fruit size and decimal phenological stage

    The folder "strawberry_img_tagged.zip" contains images and the corresponding JSON labels for fruit size and decimal phenological stages detection.

    For example,
    "label": "small g, 8.84, 7.62, 0.4",
    This label means the fruit has an 8.84mm diameter and 7.62mm length, 
    with the main stage being small green and the decimal stage being DS-4
    

    Merge and split Data

    A Python script, "datasetProcessing.py", can be used to merge and split the image data into training and testing set.

    Pre-trained models

    models.zip

    Data collector: Dr. Qi Yang, University of Minnesota, USA. Email: qiyang577@gmail.com

    All the files belong to Prof. Zhenong Jin, University of Minnesota, USA. Email: jinzn@umn.edu

  18. d

    Ethylene carbonate data for graph2mat

    • data.dtu.dk
    txt
    Updated Aug 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arghya Bhowmik (2024). Ethylene carbonate data for graph2mat [Dataset]. http://doi.org/10.11583/DTU.26193278.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Aug 6, 2024
    Dataset provided by
    Technical University of Denmark
    Authors
    Arghya Bhowmik
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Creators

    Pol Febrer (pol.febrer@icn2.cat, ORCID 0000-0003-0904-2234) Peter Bjorn Jorgensen (peterbjorgensen@gmail.com, ORCID 0000-0003-4404-7276) Arghya Bhowmik (arbh@dtu.dk, ORCID 0000-0003-3198-5116)

    Related publication

    The dataset is published as part of the paper: "GRAPH2MAT: UNIVERSAL GRAPH TO MATRIX CONVERSION FOR ELECTRON DENSITY PREDICTION" (https://doi.org/10.26434/chemrxiv-2024-j4g21) https://github.com/BIG-MAP/graph2mat

    Short description

    This dataset contains the Hamiltonian, Overlap, Density and Energy Density matrices from SIESTA calculations of a subset of the MD17 aspirin dataset. The subset is taken from the third split in (https://doi.org/10.6084/m9.figshare.12672038.v3).

    SIESTA 5.0.0 was used to compute the dataset.

    Contents

    The dataset has two directories:

    • pseudos: Contains the pseudopotentials used for the calculation (obtained from http://www.pseudo-dojo.org/, type NC SR (ONCVPSP v0.5), PBE, standard accuracy)
    • splits: The data splits used in the published paper. Each file "splits_X.json" contains the splits for training size X.

    And then, three directories containing the calculations with different basis sets: - matrix_dataset_defsplit: Uses the default split-valence DZP basis in SIESTA. - matrix_dataset_optimsplit: Uses a split-valence DZP basis optimized for aspirin. - matrix_dataset_defnodes: Uses the default nodes DZP basis in SIESTA.

    Each of the basis directories has two subdirectories: - basis: Contains the files specifying the basis used for each atom. - runs: The results of running the SIESTA simulations. Contents are discussed next.

    The "runs" directory contains one directory for each run, named with the index of the run. Each directory contains: - RUN.fdf, geom.fdf: The input files used for the SIESTA calculation. - RUN.out: The log of the SIESTA run, which apar - siesta.TSDE: Contains the Density and Energy Density matrices. - siesta.TSHS: Contains the Hamiltonian and Overlap matrices.

    Each matrix can be read using the sisl python package (https://github.com/zerothi/sisl) like:

    import sisl
    
    matrix = sisl.get_sile("RUN.fdf").read_X()
    

    where X is hamiltonian, overlap, density_matrix or energy_density_matrix.

    To reproduce the results presented in the paper, follow the documentation of the graph2mat package (https://github.com/BIG-MAP/graph2mat).

    Cite this data

    https://doi.org/10.11583/DTU.c.7310005 © 2024 Technical University of Denmark

    License

    This dataset is published under the CC BY 4.0 license. This license allows reusers to distribute, remix, adapt, and build upon the material in any medium or format, so long as attribution is given to the creator.

  19. Dataset for Cost-effective Simulation-based Test Selection in Self-driving...

    • zenodo.org
    • data.niaid.nih.gov
    pdf, zip
    Updated Jul 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christian Birchler; Nicolas Ganz; Sajad Khatiri; Alessio Gambi; Sebastiano Panichella; Christian Birchler; Nicolas Ganz; Sajad Khatiri; Alessio Gambi; Sebastiano Panichella (2024). Dataset for Cost-effective Simulation-based Test Selection in Self-driving Cars Software with SDC-Scissor [Dataset]. http://doi.org/10.5281/zenodo.5914130
    Explore at:
    zip, pdfAvailable download formats
    Dataset updated
    Jul 17, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Christian Birchler; Nicolas Ganz; Sajad Khatiri; Alessio Gambi; Sebastiano Panichella; Christian Birchler; Nicolas Ganz; Sajad Khatiri; Alessio Gambi; Sebastiano Panichella
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    SDC-Scissor tool for Cost-effective Simulation-based Test Selection in Self-driving Cars Software

    This dataset provides test cases for self-driving cars with the BeamNG simulator. Check out the repository and demo video to get started.

    GitHub: github.com/ChristianBirchler/sdc-scissor

    This project extends the tool competition platform from the Cyber-Phisical Systems Testing Competition which was part of the SBST Workshop in 2021.

    Usage

    Demo

    YouTube Link

    Installation

    The tool can either be run with Docker or locally using Poetry.

    When running the simulations a working installation of BeamNG.research is required. Additionally, this simulation cannot be run in a Docker container but must run locally.

    To install the application use one of the following approaches:

    • Docker: docker build --tag sdc-scissor .
    • Poetry: poetry install

    Using the Tool

    The tool can be used with the following two commands:

    • Docker: docker run --volume "$(pwd)/results:/out" --rm sdc-scissor [COMMAND] [OPTIONS] (this will write all files written to /out to the local folder results)
    • Poetry: poetry run python sdc-scissor.py [COMMAND] [OPTIONS]

    There are multiple commands to use. For simplifying the documentation only the command and their options are described.

    • Generation of tests:
      • generate-tests --out-path /path/to/store/tests
    • Automated labeling of Tests:
      • label-tests --road-scenarios /path/to/tests --result-folder /path/to/store/labeled/tests
      • Note: This only works locally with BeamNG.research installed
    • Model evaluation:
      • evaluate-models --dataset /path/to/train/set --save
    • Split train and test data:
      • split-train-test-data --scenarios /path/to/scenarios --train-dir /path/for/train/data --test-dir /path/for/test/data --train-ratio 0.8
    • Test outcome prediction:
      • predict-tests --scenarios /path/to/scenarios --classifier /path/to/model.joblib
    • Evaluation based on random strategy:
      • evaluate --scenarios /path/to/test/scenarios --classifier /path/to/model.joblib

    The possible parameters are always documented with --help.

    Linting

    The tool is verified the linters flake8 and pylint. These are automatically enabled in Visual Studio Code and can be run manually with the following commands:

    poetry run flake8 .
    poetry run pylint **/*.py

    License

    The software we developed is distributed under GNU GPL license. See the LICENSE.md file.

    Contacts

    Christian Birchler - Zurich University of Applied Science (ZHAW), Switzerland - birc@zhaw.ch

    Nicolas Ganz - Zurich University of Applied Science (ZHAW), Switzerland - gann@zhaw.ch

    Sajad Khatiri - Zurich University of Applied Science (ZHAW), Switzerland - mazr@zhaw.ch

    Dr. Alessio Gambi - Passau University, Germany - alessio.gambi@uni-passau.de

    Dr. Sebastiano Panichella - Zurich University of Applied Science (ZHAW), Switzerland - panc@zhaw.ch

    References

    • Christian Birchler, Nicolas Ganz, Sajad Khatiri, Alessio Gambi, and Sebastiano Panichella. 2022. Cost-effective Simulation-based Test Selection in Self-driving Cars Software with SDC-Scissor. In 2022 IEEE 29th International Conference on Software Analysis, Evolution and Reengineering (SANER), IEEE.

    If you use this tool in your research, please cite the following papers:

    @INPROCEEDINGS{Birchler2022,
     author={Birchler, Christian and Ganz, Nicolas and Khatiri, Sajad and Gambi, Alessio, and Panichella, Sebastiano},
     booktitle={2022 IEEE 29th International Conference on Software Analysis, Evolution and Reengineering (SANER), 
     title={Cost-effective Simulationbased Test Selection in Self-driving Cars Software with SDC-Scissor}, 
     year={2022},
    }
  20. t

    Data from: The Impact of Traffic Lights on Modal Split and Route Choice: A...

    • researchdata.tuwien.ac.at
    bin
    Updated Jun 25, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ioanna Gogousou; Ioanna Gogousou (2024). The Impact of Traffic Lights on Modal Split and Route Choice: A use-case in Vienna [Dataset]. http://doi.org/10.48436/2fw81-v5j57
    Explore at:
    binAvailable download formats
    Dataset updated
    Jun 25, 2024
    Dataset provided by
    TU Wien
    Authors
    Ioanna Gogousou; Ioanna Gogousou
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    2024
    Area covered
    Vienna
    Description

    The data and code scripts used for the analysis in the paper entitled "The Impact of Traffic Lights on Modal Split and Route Choice: A use-case in Vienna", submitted to AGILE (Association of Geographic Information Laboratories in Europe) 2024 Conference.

    It comprises three folders within the zip file:

    1. Data: Contains the datasets for the analysis.
    2. Code: Includes script files essential for conducting the analysis. The scripts are written in Python.
    3. Results: Includes the outcomes showcased in the associated paper.
    4. Visualizations : Includes a jupyter notebook for the generated plots in the the associated paper.

    Programming Language: Python

    For reproducibility read the README.txt file included in the zip folder.

    All data files are licensed under CC BY 4.0, all software is licensed under MIT License.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Juliane Köhler; Juliane Köhler (2025). Data Cleaning, Translation & Split of the Dataset for the Automatic Classification of Documents for the Classification System for the Berliner Handreichungen zur Bibliotheks- und Informationswissenschaft [Dataset]. http://doi.org/10.5281/zenodo.6957842
Organization logo

Data Cleaning, Translation & Split of the Dataset for the Automatic Classification of Documents for the Classification System for the Berliner Handreichungen zur Bibliotheks- und Informationswissenschaft

Explore at:
text/x-python, csv, binAvailable download formats
Dataset updated
Apr 24, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Juliane Köhler; Juliane Köhler
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description
  • Cleaned_Dataset.csv – The combined CSV files of all scraped documents from DABI, e-LiS, o-bib and Springer.
  • Data_Cleaning.ipynb – The Jupyter Notebook with python code for the analysis and cleaning of the original dataset.
  • ger_train.csv – The German training set as CSV file.
  • ger_validation.csv – The German validation set as CSV file.
  • en_test.csv – The English test set as CSV file.
  • en_train.csv – The English training set as CSV file.
  • en_validation.csv – The English validation set as CSV file.
  • splitting.py – The python code for splitting a dataset into train, test and validation set.
  • DataSetTrans_de.csv – The final German dataset as a CSV file.
  • DataSetTrans_en.csv – The final English dataset as a CSV file.
  • translation.py – The python code for translating the cleaned dataset.
Search
Clear search
Close search
Google apps
Main menu