18 datasets found
  1. Replication package for: Altered Histories in Version Control System...

    • zenodo.org
    bin, zip
    Updated Jun 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonymous; Anonymous (2025). Replication package for: Altered Histories in Version Control System Repositories: Evidence from the Trenches [Dataset]. http://doi.org/10.5281/zenodo.15558282
    Explore at:
    bin, zipAvailable download formats
    Dataset updated
    Jun 2, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anonymous; Anonymous
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
    # History Alterations - Replication Package


    This repository contains the complete replication package for the research article Altered Histories in Version Control System Repositories: Evidence from the Trenches. The package provides tools to detect, analyze, and categorize Git history alterations across software repositories, along with Jupyter notebooks to reproduce the analysis presented in the paper.

    ## 📋 Table of Contents


    ## 🔍 Overview

    This replication package enables researchers to reproduce the analysis of altered Git histories in software repositories archived by Software Heritage. The study investigates how and why Git histories are modified over time, providing insights into developer practices and repository maintenance patterns.

    Main Research Questions:

    - How prevalent are Git history alterations in open-source repositories?
    - What types of changes are most commonly made to Git histories?
    - What are the root causes of these alterations?
    - How do these practices vary across different types of repositories?

    ## 📁 Repository Structure

    </div> <div>├── README.md # This file</div> <div>├── data/ # Pre-computed datasets</div> <div>│ ├── ...</div> <div>├── altered-history/ # Main analysis tool</div> <div>│ ├── src/ # Rust source code</div> <div>│ ├── notebooks/ # Analysis notebooks</div> <div>│ │ ├── analysis.ipynb # Main analysis notebook</div> <div>│ │ ├── build_analysis_dataset.ipynb</div> <div>│ │ └── utils_analysis.py # Analysis utilities</div> <div>│ └── README.md</div> <div>├── git-historian/ # History checking tool</div> <div>│ ├── src/ # Rust source code</div> <div>│ └── README.md</div> <div>├── modified-files/ # File modification analysis tool</div> <div>│ ├── src/ # Rust source code</div> <div>│ ├── notebooks/ # Additional analysis notebooks</div> <div>│ │ ├── license_analysis.ipynb</div> <div>│ │ ├── license_categorization.py</div> <div>│ │ ├── secret-analysis.ipynb</div> <div>│ │ └── swh_license_files.py</div> <div>│ └── README.md</div> <div>

    ## 🚀 Quick Start

    ### Prerequisites

    - Rust (latest stable version)
    - Python 3.8+ with Jupyter
    - PostgreSQL (for database operations)
    - Git (for repository analysis)

    ### Installation

    1. Clone the repository:
    bash</div> <div>git clone <repository-url></div> <div>cd altered-histories-tool-replication-pkg</div> <div>

    2. Unzip all directories

    3. Install Python dependencies:
    bash</div> <div>pip install pandas matplotlib seaborn jupyter plotly numpy</div> <div>

    4. Build the Rust tools (optional, for dataset generation):
    bash</div> <div>cd altered-history && cargo build --release && cd ..</div> <div>cd git-historian && cargo build --release && cd ..</div> <div>cd modified-files && cargo build --release && cd ..</div> <div>

    ## 📊 Reproducing the Analysis

    ### Option 1: Using Pre-computed Data (Recommended)

    The data/ directory contains pre-computed datasets that allow you to reproduce all analyses without running the computationally intensive data collection process.

    1. Open the main analysis notebook:
    bash</div> <div>cd altered-history/notebooks</div> <div>jupyter notebook analysis.ipynb</div> <div>

    2. Run all cells to reproduce the complete analysis.

    3. Explore additional analyses:

    Modify notebooks at will to explore the dataframe.
    bash</div> <div># Build analysis dataset (shows data preparation)</div> <div>jupyter notebook build_analysis_dataset.ipynb</div> <div> </div> <div># License-related analysis</div> <div>cd ../../modified-files/notebooks</div> <div>jupyter notebook license_analysis.ipynb</div> <div> </div> <div># Security and secrets analysis</div> <div>jupyter notebook secret-analysis.ipynb</div> <div>

    ### Option 2: Regenerating the Dataset

    To reproduce the complete data collection and analysis pipeline:

    1. Download Software Heritage datasets (see individual tool READMEs)
    2. Configure database connections in each tool
    3. Run the analysis pipeline following the step-by-step instructions in each tool's README
    4. Process results using the provided notebooks

    Note: Complete dataset regeneration requires significant computational resources and time (potentially weeks for large datasets).

    ## 📋 Data

    The data/ directory contains several key datasets including:

    - res.pkl: Main analysis results containing categorized alterations
    - stars_without_dup.pkl: Repository popularity metrics (GitHub stars)
    - visit_type.pkl: Classification of repository visit patterns
    - altered_histories_2024_08_23.dump: PostgreSQL database dump for git-historian tool

    ## 🛠️ Tools Description

    ### 1. altered-history

    Purpose: Detects and categorizes Git history alterations in Software Heritage archives.

    Key Features:

    - Three-step analysis pipeline (detection → root cause → categorization)
    - Parallel processing for large datasets
    - Comprehensive alteration taxonomy

    Usage: See altered-history/README.md for detailed instructions.

    ### 2. git-historian

    Purpose: Checks individual repositories against the database of known alterations.

    Key Features:

    - PostgreSQL integration
    - Git hook integration for automated checking
    - Caching system for performance

    Usage: See git-historian/README.md for detailed instructions.

    ### 3. modified-files

    Purpose: Analyzes file-level modifications and their patterns.

    Key Features:

    - File modification tracking
    - License and security analysis
    - Integration with Software Heritage graph

    Usage: See modified-files/README.md for detailed instructions.

    ## 📋 Requirements

    ### System Requirements

    - Memory: Minimum 16GB RAM (1.5TB+ recommended for full dataset processing)
    - Storage: 600GB+ free space for complete datasets
    - CPU: Multi-core processor recommended for parallel processing

    ## 🔄 Reproducibility Notes

    1. Deterministic Results: The analysis notebooks will produce identical results when run with the provided datasets.

    2. Versioning: All tools are pinned to specific versions to ensure reproducibility.

    3. Random Seeds: Where applicable, random seeds are fixed in the analysis code.

  2. Z

    Geographic Diversity in Public Code Contributions — Replication Package

    • data.niaid.nih.gov
    • explore.openaire.eu
    Updated Mar 31, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stefano Zacchiroli (2022). Geographic Diversity in Public Code Contributions — Replication Package [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6390354
    Explore at:
    Dataset updated
    Mar 31, 2022
    Dataset provided by
    Stefano Zacchiroli
    Davide Rossi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Geographic Diversity in Public Code Contributions - Replication Package

    This document describes how to replicate the findings of the paper: Davide Rossi and Stefano Zacchiroli, 2022, Geographic Diversity in Public Code Contributions - An Exploratory Large-Scale Study Over 50 Years. In 19th International Conference on Mining Software Repositories (MSR ’22), May 23-24, Pittsburgh, PA, USA. ACM, New York, NY, USA, 5 pages. https://doi.org/10.1145/3524842.3528471

    This document comes with the software needed to mine and analyze the data presented in the paper.

    Prerequisites

    These instructions assume the use of the bash shell, the Python programming language, the PosgreSQL DBMS (version 11 or later), the zstd compression utility and various usual *nix shell utilities (cat, pv, …), all of which are available for multiple architectures and OSs. It is advisable to create a Python virtual environment and install the following PyPI packages:

    click==8.0.4 cycler==0.11.0 fonttools==4.31.2 kiwisolver==1.4.0 matplotlib==3.5.1 numpy==1.22.3 packaging==21.3 pandas==1.4.1 patsy==0.5.2 Pillow==9.0.1 pyparsing==3.0.7 python-dateutil==2.8.2 pytz==2022.1 scipy==1.8.0 six==1.16.0 statsmodels==0.13.2

    Initial data

    swh-replica, a PostgreSQL database containing a copy of Software Heritage data. The schema for the database is available at https://forge.softwareheritage.org/source/swh-storage/browse/master/swh/storage/sql/. We retrieved these data from Software Heritage, in collaboration with the archive operators, taking an archive snapshot as of 2021-07-07. We cannot make these data available in full as part of the replication package due to both its volume and the presence in it of personal information such as user email addresses. However, equivalent data (stripped of email addresses) can be obtained from the Software Heritage archive dataset, as documented in the article: Antoine Pietri, Diomidis Spinellis, Stefano Zacchiroli, The Software Heritage Graph Dataset: Public software development under one roof. In proceedings of MSR 2019: The 16th International Conference on Mining Software Repositories, May 2019, Montreal, Canada. Pages 138-142, IEEE 2019. http://dx.doi.org/10.1109/MSR.2019.00030. Once retrieved, the data can be loaded in PostgreSQL to populate swh-replica.

    names.tab - forenames and surnames per country with their frequency

    zones.acc.tab - countries/territories, timezones, population and world zones

    c_c.tab - ccTDL entities - world zones matches

    Data preparation

    Export data from the swh-replica database to create commits.csv.zst and authors.csv.zst

    sh> ./export.sh

    Run the authors cleanup script to create authors--clean.csv.zst

    sh> ./cleanup.sh authors.csv.zst

    Filter out implausible names and create authors--plausible.csv.zst

    sh> pv authors--clean.csv.zst | unzstd | ./filter_names.py 2> authors--plausible.csv.log | zstdmt > authors--plausible.csv.zst

    Zone detection by email

    Run the email detection script to create author-country-by-email.tab.zst

    sh> pv authors--plausible.csv.zst | zstdcat | ./guess_country_by_email.py -f 3 2> author-country-by-email.csv.log | zstdmt > author-country-by-email.tab.zst

    Database creation and initial data ingestion

    Create the PostgreSQL DB

    sh> createdb zones-commit

    Notice that from now on when prepending the psql> prompt we assume the execution of psql on the zones-commit database.

    Import data into PostgreSQL DB

    sh> ./import_data.sh

    Zone detection by name

    Extract commits data from the DB and create commits.tab, that is used as input for the zone detection script

    sh> psql -f extract_commits.sql zones-commit

    Run the world zone detection script to create commit_zones.tab.zst

    sh> pv commits.tab | ./assign_world_zone.py -a -n names.tab -p zones.acc.tab -x -w 8 | zstdmt > commit_zones.tab.zst Use ./assign_world_zone.py --help if you are interested in changing the script parameters.

    Ingest zones assignment data into the DB

    psql> \copy commit_zone from program 'zstdcat commit_zones.tab.zst | cut -f1,6 | grep -Ev ''\s$'''

    Extraction and graphs

    Run the script to execute the queries to extract the data to plot from the DB. This creates commit_zones_7120.tab, author_zones_7120_t5.tab, commit_zones_7120.grid and author_zones_7120_t5.grid. Edit extract_data.sql if you whish to modify extraction parameters (start/end year, sampling, …).

    sh> ./extract_data.sh

    Run the script to create the graphs from all the previously extracted tabfiles.

    sh> ./create_stackedbar_chart.py -w 20 -s 1971 -f commit_zones_7120.grid -f author_zones_7120_t5.grid -o chart.pdf

  3. Optimal Bayesian Experimental Design Version 1.2.0

    • datasets.ai
    • catalog.data.gov
    0, 57
    Updated Oct 7, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institute of Standards and Technology (2024). Optimal Bayesian Experimental Design Version 1.2.0 [Dataset]. https://datasets.ai/datasets/optimal-bayesian-experimental-design-version-1-2-0
    Explore at:
    0, 57Available download formats
    Dataset updated
    Oct 7, 2024
    Dataset authored and provided by
    National Institute of Standards and Technologyhttp://www.nist.gov/
    Description

    Python module 'optbayesexpt' uses optimal Bayesian experimental design methods to control measurement settings in order to efficiently determine model parameters. Given an parametric model - analogous to a fitting function - Bayesian inference uses each measurement 'data point' to refine model parameters. Using this information, the software suggests measurement settings that are likely to efficiently reduce uncertainties. A TCP socket interface allows the software to be used from experimental control software written in other programming languages. Code is developed in Python, and shared via GitHub's USNISTGOV organization.

  4. Data from: Open-source quality control routine and multi-year power...

    • zenodo.org
    • explore.openaire.eu
    csv, text/x-python +1
    Updated Apr 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lennard Visser; Lennard Visser; Boudewijn Elsinga; Tarek AlSkaif; Tarek AlSkaif; Wilfried van Sark; Wilfried van Sark; Boudewijn Elsinga (2024). Open-source quality control routine and multi-year power generation data of 175 PV systems [Dataset]. http://doi.org/10.5281/zenodo.10953360
    Explore at:
    text/x-python, csv, zipAvailable download formats
    Dataset updated
    Apr 28, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Lennard Visser; Lennard Visser; Boudewijn Elsinga; Tarek AlSkaif; Tarek AlSkaif; Wilfried van Sark; Wilfried van Sark; Boudewijn Elsinga
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Description

    The repository contains an extensive dataset of PV power measurements and a python package (qcpv) for quality controlling PV power measurements. The dataset features four years (2014-2017) of power measurements of 175 rooftop mounted residential PV systems located in Utrecht, the Netherlands. The power measurements have a 1-min resolution.

    PV power measurements

    Three different versions of the power measurements are included in three data-subsets in the repository. Unfiltered power measurements are enclosed in unfiltered_pv_power_measurements.csv. Filtered power measurements are included as filtered_pv_power_measurements_sc.csv and filtered_pv_power_measurements_ac.csv. The former dataset contains the quality controlled power measurements after running single system filters only, the latter dataset considers the output after running both single and across system filters. The metadata of the PV systems is added in metadata.csv. This file holds for each PV system a unique ID, start and end time of registered power measurements, estimated DC and AC capacity, tilt and azimuth angle, annual yield and mapped grids of the system location (north, south, west and east boundary).

    Quality control routine

    An open-source quality control routine that can be applied to filter erroneous PV power measurements is added to the repository in the form of the Python package qcpv (qcpv.py). Sample code to call and run the functions in the qcpv package is available as example.py.

    Objective

    By publishing the dataset we provide access to high quality PV power measurements that can be used for research experiments on several topics related to PV power and the integration of PV in the electricity grid.

    By publishing the qcpv package we strive to set a next step into developing a standardized routine for quality control of PV power measurements. We hope to stimulate others to adopt and improve the routine of quality control and work towards a widely adopted standardized routine.

    Data usage

    If you use the data and/or python package in a published work please cite: Visser, L., Elsinga, B., AlSkaif, T., van Sark, W., 2022. Open-source quality control routine and multi-year power generation data of 175 PV systems. Journal of Renewable and Sustainable Energy.

    Units

    Timestamps are in UTC (YYYY-MM-DD HH:MM:SS+00:00).

    Power measurements are in Watt.

    Installed capacities (DC and AC) are in Watt-peak.

    Additional information

    A detailed discussion of the data and qcpv package is presented in: Visser, L., Elsinga, B., AlSkaif, T., van Sark, W., 2022. Open-source quality control routine and multi-year power generation data of 175 PV systems. Journal of Renewable and Sustainable Energy. Corrections are discussed in: Visser, L., Elsinga, B., AlSkaif, T., van Sark, W., 2024. Erratum: Open-source quality control routine and multiyear power generation data of 175 PV systems. Journal of Renewable and Sustainable Energy.

    Acknowledgements

    This work is part of the Energy Intranets (NEAT: ESI-BiDa 647.003.002) project, which is funded by the Dutch Research Council NWO in the framework of the Energy Systems Integration & Big Data programme. The authors would especially like to thank the PV owners who volunteered to take part in the measurement campaign.

  5. Vulnerability prediction using pre-trained models: An empirical evaluation...

    • zenodo.org
    csv
    Updated Mar 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ilias Kalouptsoglou; Ilias Kalouptsoglou; Miltiadis Siavvas; Miltiadis Siavvas; Apostolos Ampatzoglou; Apostolos Ampatzoglou; Dionysios Kehagias; Dionysios Kehagias; Alexander Chatzigeorgiou; Alexander Chatzigeorgiou (2025). Vulnerability prediction using pre-trained models: An empirical evaluation [Dataset] [Dataset]. http://doi.org/10.5281/zenodo.15082636
    Explore at:
    csvAvailable download formats
    Dataset updated
    Mar 28, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Ilias Kalouptsoglou; Ilias Kalouptsoglou; Miltiadis Siavvas; Miltiadis Siavvas; Apostolos Ampatzoglou; Apostolos Ampatzoglou; Dionysios Kehagias; Dionysios Kehagias; Alexander Chatzigeorgiou; Alexander Chatzigeorgiou
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains the extension of a publicly available dataset that was published initially by Bagheri et al. in their paper:

    A. Bagheri and P. Hegedűs, "A comparison of different source code representation methods for vulnerability prediction in python", Quality of Information and Communications Technology, 2021.

    This dataset is an extension of the dataset presented by Bagheri et al., who used a version control system as a data source for collecting source code components. Specifically, they used GitHub since it has a high number of software projects. To create a labeled dataset, i.e., a dataset of files signed with a label that declares if they are vulnerable or not, they scanned the commit messages in Python GitHub projects. In particular, they searched for commits, which contain vulnerability-fixing keywords in the commit message. They gathered a large number of Python source files included in such commits. The version of each file before the vulnerability-fixing commit (i.e., parent version) is considered vulnerable, since it contains the vulnerability that required a patch, whereas the version of the file in the vulnerability-fixing commit is considered non-vulnerable. However, in their study, Bagheri et al. utilized only the fragment of the diff file, which contains the difference between the vulnerable and the fixed version, and they proposed models to separate the “bad” and the “good” parts of a file. In the current study, we extend their dataset by collecting clean (i.e., non-vulnerable) versions from GitHub. For this purpose, we retrieved files from the latest version of the dataset’s GitHub repositories, since the latest versions are the safest versions that can be considered non-vulnerable because no vulnerabilities have yet been reported for them. Hence, we can construct models to perform vulnerability prediction at the file-level of granularity. Overall, the extended dataset contains 4,184 Python files, 3,186 of which are considered vulnerable and 998 are considered neutral (i.e., non-vulnerable).

  6. OCEAN mailing list data from open source communities

    • figshare.com
    zip
    Updated Mar 31, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Melanie Warrick; Samuel F. Rosenblatt; Jean-Gabriel Young; Amanda Casari; Laurent Hébert-Dufresne; James Bagrow (2022). OCEAN mailing list data from open source communities [Dataset]. http://doi.org/10.6084/m9.figshare.19082540.v2
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 31, 2022
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Melanie Warrick; Samuel F. Rosenblatt; Jean-Gabriel Young; Amanda Casari; Laurent Hébert-Dufresne; James Bagrow
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We present the data collected as part of the Open-source Complex Ecosystem And Networks (OCEAN) partnership between Google Open Source and the University of Vermont. This includes mailing list emails with standardized format spanning the past three decades from fourteen mailing lists across four different open source communities: Python, Angular, Node.js, and the Go language.This data is presented in the following publication: Warrick, M., Rosenblatt, S. F., Young, J. G., Casari, A., Hébert-Dufresne, L., & Bagrow, J. P. (2022). The OCEAN mailing list data set: Network analysis spanning mailing lists and code repositories. In 2022 IEEE/ACM 19th International Conference on Mining Software Repositories (MSR). IEEE.

  7. Data from: Climate Policy Database

    • data.niaid.nih.gov
    Updated Mar 28, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wageningen University and Research (2024). Climate Policy Database [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7774109
    Explore at:
    Dataset updated
    Mar 28, 2024
    Dataset provided by
    Netherlands Environmental Assessment Agencyhttps://www.pbl.nl/
    NewClimate Institute
    Wageningen University and Research
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Recommended Citation

    Citing this version

    NewClimate Institute, Wageningen University and Research & PBL Netherlands Environmental Assessment Agency. (2023). Climate Policy Database. DOI: 10.5281/zenodo.10869734

    Citing all CPDB versions

    NewClimate Institute, Wageningen University and Research & PBL Netherlands Environmental Assessment Agency. (2016). Climate Policy Database. DOI: 10.5281/zenodo.7774109

    Peer reviewed publication

    Description

    The Climate Policy Database (CDPB) is an open, collaborative tool to advance the data collection of the implementation status of climate policies. This project is funded by the European Union H2020 ELEVATE and ENGAGE projects and was, in its previous phase, funded under CD-Links. The database is maintained by NewClimate Institute with support from PBL Netherlands Environmental Assessment Agency and Wageningen University and Research.

    Although the CPDB exists since 2016, annual versions of the database have only been stored since 2019.

    The Climate Policy Database is updated periodically. The latest version of the database can be downloaded on the CPDB website or accessed through a Python API. Each year, we also create a static database, which is included here for version control.

  8. Employee Turnover at TECHCO

    • kaggle.com
    Updated Jun 30, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ryan A (2020). Employee Turnover at TECHCO [Dataset]. http://doi.org/10.34740/kaggle/ds/49465
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 30, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ryan A
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Context

    These are simulated data based on employee turnover data in a real technology company in India (we refer to this company by a pseudonym, 'TECHCO'). These data can be used to analyze drivers of turnover at TECHCO. The original dataset was analyzed in the paper Machine Learning for Pattern Discovery in Management Research (SSRN version here). This publicly offered dataset is simulated based on the original data for privacy considerations. Along with the accompanying Python Kaggle code and R Kaggle code, this dataset will help readers learn how to implement the ML techniques in the paper. The data and code demonstrate how ML can be useful for discovering nonlinear and interactive patterns between variables that may otherwise have gone unnoticed.

    Content

    This dataset includes 1,191 entry-level employees that were quasi-randomly deployed to any of TECHCO’s nine geographically dispersed production centers in 2007. The data are structured as a panel with one observation for each month that an individual is employed at the company for up to 40 months. The data include 34,453 observations from 1,191 employees total; The dependent variable, Turnover, indicates whether the employee left or stayed during that time period.

    Objectives

    The objective in the original paper was to explore patterns in the data that would help us learn more about the drivers of employee turnover. Another objective could be to find the best predictive model to estimate when a specific employee will leave.

  9. n

    Instrumentino: An open-source modular Python framework for controlling...

    • narcis.nl
    • data.mendeley.com
    Updated Mar 14, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Koenka, I (via Mendeley Data) (2019). Instrumentino: An open-source modular Python framework for controlling Arduino based experimental instruments [Dataset]. http://doi.org/10.17632/k7j4982sp2.1
    Explore at:
    Dataset updated
    Mar 14, 2019
    Dataset provided by
    Data Archiving and Networked Services (DANS)
    Authors
    Koenka, I (via Mendeley Data)
    Description

    This program has been imported from the CPC Program Library held at Queen's University Belfast (1969-2018)

    Abstract Instrumentino is an open-source modular graphical user interface framework for controlling Arduino based experimental instruments. It expands the control capability of Arduino by allowing instruments builders to easily create a custom user interface program running on an attached personal computer. It enables the definition of operation sequences and their automated running without user intervention. Acquired experimental data and a usage log are automatically saved on the computer for furthe...

    Title of program: Instrumentino, Controlino Catalogue Id: AETJ_v1_0

    Nature of problem Control and monitor purpose-made experimental instruments

    Versions of this program held in the CPC repository in Mendeley Data AETJ_v1_0; Instrumentino, Controlino; 10.1016/j.cpc.2014.06.007

  10. Self Driving Car

    • kaggle.com
    zip
    Updated Mar 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aslan Ahmedov (2023). Self Driving Car [Dataset]. https://www.kaggle.com/aslanahmedov/self-driving-carbehavioural-cloning
    Explore at:
    zip(18420532 bytes)Available download formats
    Dataset updated
    Mar 8, 2023
    Authors
    Aslan Ahmedov
    Description

    https://user-images.githubusercontent.com/91852182/147305077-8b86ec92-ed26-43ca-860c-5812fea9b1d8.gif" alt="ezgif com-gif-maker">

    SELF-DRIVING CAR USING UDACITY’S CAR SIMULATOR ENVIRONMENT AND TRAINED BY DEEP NEURAL NETWORKS COMPLETE GUIDE

    Table of Contents

    Introduction

    • Problem Definition
    • Solution Approach
    • Technologies Used
    • Convolutional Neural Networks (CNN)
    • Time-Distributed Layers

    Udacity Simulator and Dataset

    The Training Process

    Augmentation and image pre-processing

    Experimental configurations

    Network architectures

    Results

    • Value loss or Accuracy
    • Why We Use ELU Over RELU

    The Connection Part

    Files

    Overview

    References

    Introduction

    Self-drivi cars has become a trending subject with a significant improvement in the technologies in the last decade. The project purpose is to train a neural network to drive an autonomous car agent on the tracks of Udacity’s Car Simulator environment. Udacity has released the simulator as an open source software and enthusiasts have hosted a competition (challenge) to teach a car how to drive using only camera images and deep learning. Driving a car in an autonomous manner requires learning to control steering angle, throttle and brakes. Behavioral cloning technique is used to mimic human driving behavior in the training mode on the track. That means a dataset is generated in the simulator by user driven car in training mode, and the deep neural network model then drives the car in autonomous mode. Ultimately, the car was able to run on Track 1 generalizing well. The project aims at reaching the same accuracy on real time data in the future.https://user-images.githubusercontent.com/91852182/147298831-225740f9-6903-4570-8336-0c9f16676456.png" alt="6">

    Problem Definition

    Udacity released an open source simulator for self-driving cars to depict a real-time environment. The challenge is to mimic the driving behavior of a human on the simulator with the help of a model trained by deep neural networks. The concept is called Behavioral Cloning, to mimic how a human drives. The simulator contains two tracks and two modes, namely, training mode and autonomous mode. The dataset is generated from the simulator by the user, driving the car in training mode. This dataset is also known as the “good” driving data. This is followed by testing on the track, seeing how the deep learning model performs after being trained by that user data.

    Solution Approach

    https://user-images.githubusercontent.com/91852182/147298261-4d57a5c1-1fda-4654-9741-2f284e6d0479.png" alt="1">

    The problem is solved in the following steps:

    • The simulator can be used to collect data by driving the car in the training mode using a joystick or keyboard, providing the so called “good-driving” behavior input data in form of a driving_log (.csv file) and a set of images. The simulator acts as a server and pipes these images and data log to the Python client.
    • The client (Python program) is the machine learning model built using Deep Neural Networks. These models are developed on Keras (a high-level API over Tensorflow). Keras provides sequential models to build a linear stack of network layers. Such models are used in the project to train over the datasets as the second step. Detailed description of CNN models experimented and used can be referred to in the chapter on network architectures.
    • Once the model is trained, it provides steering angles and throttle to drive in an autonomous mode to the server (simulator).
    • These modules, or inputs, are piped back to the server and are used to drive the car autonomously in the simulator and keep it from falling off the track.

    Technologies Used

    Technologies that are used in the implementation of this project and the motivation behind using these are described in this section.

    TensorFlow: This an open-source library for dataflow programming. It is widely used for machine learning applications. It is also used as both a math library and for large computation. For this project Keras, a high-level API that uses TensorFlow as the backend is used. Keras facilitate in building the models easily as it more user friendly.

    Different libraries are available in Python that helps in machine learning projects. Several of those libraries have improved the performance of this project. Few of them are mentioned in this section. First, “Numpy” that provides with high-level math function collection to support multi-dimensional metrices and arrays. This is used for faster computations over the weights (gradients) in neural networks. Second, “scikit-learn” is a machine learning library for Python which features different algorithms and Machine Learning function packages. Another one is OpenCV (Open Source Computer Vision Library) which is designed for computational efficiency with focus on real-time applications. In this project, OpenCV is used for image preprocessing and augmentation techniques.

    The project makes use of Conda Environment which is an open source distribution for Python which simplifies package management and deployment. It is best for large scale data processing. The machine on which this project was built, is a personal computer.

    Convolutional Neural Networks (CNN)

    CNN is a type of feed-forward neural network computing system that can be used to learn from input data. Learning is accomplished by determining a set of weights or filter values that allow the network to model the behavior according to the training data. The desired output and the output generated by CNN initialized with random weights will be different. This difference (generated error) is backpropagated through the layers of CNN to adjust the weights of the neurons, which in turn reduces the error and allows us produce output closer to the desired one.

    CNN is good at capturing hierarchical and spatial data from images. It utilizes filters that look at regions of an input image with a defined window size and map it to some output. It then slides the window by some defined stride to other regions, covering the whole image. Each convolution filter layer thus captures the properties of this input image hierarchically in a series of subsequent layers, capturing the details like lines in image, then shapes, then whole objects in later layers. CNN can be a good fit to feed the images of a dataset and classify them into their respective classes.

    Time-Distributed Layers

    Another type of layers sometimes used in deep learning networks is a Time- distributed layer. Time-Distributed layers are provided in Keras as wrapper layers. Every temporal slice of an input is applied with this wrapper layer. The requirement for input is that to be at least three-dimensional, first index can be considered as temporal dimension. These Time-Distributed can be applied to a dense layer to each of the timesteps, independently or even used with Convolutional Layers. The way they can be written is also simple in Keras as shown in Figure 1 and Figure 2.

    https://user-images.githubusercontent.com/91852182/147298483-4f37a092-7e71-4ce6-9274-9a133d138a4c.png" alt="2">

    Fig. 1: TimeDistributed Dense layer

    https://user-images.githubusercontent.com/91852182/147298501-6459d968-a279-4140-9be3-2d3ea826d9f6.png" alt="3">

    Fig. 2: TimeDistributed Convolution layer

    Udacity Simulator and Dataset

    We will first download the simulator to start our behavioural training process. Udacity has built a simulator for self-driving cars and made it open source for the enthusiasts, so they can work on something close to a real-time environment. It is built on Unity, the video game development platform. The simulator consists of a configurable resolution and controls setting and is very user friendly. The graphics and input configurations can be changed according to user preference and machine configuration as shown in Figure 3. The user pushes the “Play!” button to enter the simulator user interface. You can enter the Controls tab to explore the keyboard controls, quite similar to a racing game which can be seen in Figure 4.

    https://user-images.githubusercontent.com/91852182/147298708-de15ebc5-2482-42f8-b2a2-8d3c59fceff4.png" alt=" 4">

    Fig. 3: Configuration screen

    https://user-images.githubusercontent.com/91852182/147298712-944e2c2d-e01d-459b-8a7d-3c5471bea179.png" alt="5">

    Fig. 4: Controls Configuration

    The first actual screen of the simulator can be seen in Figure 5 and its components are discussed below. The simulator involves two tracks. One of them can be considered as simple and another one as complex that can be evident in the screenshots attached in Figure 6 and Figure 7. The word “simple” here just means that it has fewer curvy tracks and is easier to drive on, refer Figure 6. The “complex” track has steep elevations, sharp turns, shadowed environment, and is tough to drive on, even by a user doing it manually. Please refer Figure 6. There are two modes for driving the car in the simulator: (1) Training mode and (2) Autonomous mode. The training mode gives you the option of recording your run and capturing the training dataset. The small red sign at the top right of the screen in the Figure 6 and 7 depicts the car is being driven in training mode. The autonomous mode can be used to test the models to see if it can drive on the track without human intervention. Also, if you try to press the controls to get the car back on track, it will immediately notify you that it shifted to manual controls. The mode screenshot can be as seen in Figure 8. Once we have mastered how the car driven controls in simulator using keyboard keys, then we get started with record button to collect data. We will save the data from it in a specified folder as you can see

  11. d

    Protected Areas Database of the United States (PAD-US) 3.0 Vector Analysis...

    • catalog.data.gov
    Updated Jul 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2024). Protected Areas Database of the United States (PAD-US) 3.0 Vector Analysis and Summary Statistics [Dataset]. https://catalog.data.gov/dataset/protected-areas-database-of-the-united-states-pad-us-3-0-vector-analysis-and-summary-stati
    Explore at:
    Dataset updated
    Jul 6, 2024
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Area covered
    United States
    Description

    Spatial analysis and statistical summaries of the Protected Areas Database of the United States (PAD-US) provide land managers and decision makers with a general assessment of management intent for biodiversity protection, natural resource management, and recreation access across the nation. The PAD-US 3.0 Combined Fee, Designation, Easement feature class (with Military Lands and Tribal Areas from the Proclamation and Other Planning Boundaries feature class) was modified to remove overlaps, avoiding overestimation in protected area statistics and to support user needs. A Python scripted process ("PADUS3_0_CreateVectorAnalysisFileScript.zip") associated with this data release prioritized overlapping designations (e.g. Wilderness within a National Forest) based upon their relative biodiversity conservation status (e.g. GAP Status Code 1 over 2), public access values (in the order of Closed, Restricted, Open, Unknown), and geodatabase load order (records are deliberately organized in the PAD-US full inventory with fee owned lands loaded before overlapping management designations, and easements). The Vector Analysis File ("PADUS3_0VectorAnalysisFile_ClipCensus.zip") associated item of PAD-US 3.0 Spatial Analysis and Statistics ( https://doi.org/10.5066/P9KLBB5D ) was clipped to the Census state boundary file to define the extent and serve as a common denominator for statistical summaries. Boundaries of interest to stakeholders (State, Department of the Interior Region, Congressional District, County, EcoRegions I-IV, Urban Areas, Landscape Conservation Cooperative) were incorporated into separate geodatabase feature classes to support various data summaries ("PADUS3_0VectorAnalysisFileOtherExtents_Clip_Census.zip") and Comma-separated Value (CSV) tables ("PADUS3_0SummaryStatistics_TabularData_CSV.zip") summarizing "PADUS3_0VectorAnalysisFileOtherExtents_Clip_Census.zip" are provided as an alternative format and enable users to explore and download summary statistics of interest (Comma-separated Table [CSV], Microsoft Excel Workbook [.XLSX], Portable Document Format [.PDF] Report) from the PAD-US Lands and Inland Water Statistics Dashboard ( https://www.usgs.gov/programs/gap-analysis-project/science/pad-us-statistics ). In addition, a "flattened" version of the PAD-US 3.0 combined file without other extent boundaries ("PADUS3_0VectorAnalysisFile_ClipCensus.zip") allow for other applications that require a representation of overall protection status without overlapping designation boundaries. The "PADUS3_0VectorAnalysis_State_Clip_CENSUS2020" feature class ("PADUS3_0VectorAnalysisFileOtherExtents_Clip_Census.gdb") is the source of the PAD-US 3.0 raster files (associated item of PAD-US 3.0 Spatial Analysis and Statistics, https://doi.org/10.5066/P9KLBB5D ). Note, the PAD-US inventory is now considered functionally complete with the vast majority of land protection types represented in some manner, while work continues to maintain updates and improve data quality (see inventory completeness estimates at: http://www.protectedlands.net/data-stewards/ ). In addition, changes in protected area status between versions of the PAD-US may be attributed to improving the completeness and accuracy of the spatial data more than actual management actions or new acquisitions. USGS provides no legal warranty for the use of this data. While PAD-US is the official aggregation of protected areas ( https://www.fgdc.gov/ngda-reports/NGDA_Datasets.html ), agencies are the best source of their lands data.

  12. D

    Side-Channel Based Intrusion Detection for Industrial Control Systems:...

    • phys-techsciences.datastations.nl
    html, text/markdown +5
    Updated May 27, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    P.J.M. Van Aubel; P.J.M. Van Aubel; K. Papagiannopoulos; K. Papagiannopoulos (2017). Side-Channel Based Intrusion Detection for Industrial Control Systems: Python & MATLAB source code for EM side-channel analysis & graphing [Dataset]. http://doi.org/10.17026/dans-x7m-6222
    Explore at:
    text/plain; charset=utf-8(908), text/x-python(4664), text/markdown(1928), text/x-python(5904), html(2226), text/markdown(34893), text/x-python(10199), text/x-python(10850), text/x-python(27516), text/x-python(4551), text/x-python(2830), text/markdown(4166), text/plain; charset=us-ascii(266), html(37019), text/x-matlab(7011), zip(33017), text/x-python(6618), text/x-python(4435), text/markdown(1063), text/x-matlab(421), text/x-matlab(303), html(1507), html(5149), text/x-matlab(54)Available download formats
    Dataset updated
    May 27, 2017
    Dataset provided by
    DANS Data Station Phys-Tech Sciences
    Authors
    P.J.M. Van Aubel; P.J.M. Van Aubel; K. Papagiannopoulos; K. Papagiannopoulos
    License

    http://www.gnu.org/licenses/gpl-3.0http://www.gnu.org/licenses/gpl-3.0

    Description

    Python & MATLAB source code to capture traces & generate the results used in 'Side-Channel Based Intrusion Detection for Industrial Control Systems' (doi:10.1007/978-3-319-99843-5_19) and 'Security and Privacy in the Smart Grid' (PhD Thesis, ISBN 978-94-6473-209-2).Industrial Control Systems are under increased scrutiny. Their security is historically sub-par, and although measures are being taken by the manufacturers to remedy this, the large installed base of legacy systems cannot easily be updated with state-of-the-art security measures. In these publications we use a technique from cryptographic side-channel analysis, multivariate templating, to detect anomalous behaviour in Programmable Logic Controllers. Our solution uses side-channel measurements of the electromagnetic emissions of an industrial control system to detect behavioural changes of the software running on them. To demonstrate the feasibility of this method, we show it is possible to profile and distinguish between even small changes in programs on Siemens S7-317 PLCs, using methods from cryptographic side-channel analysis.The code consists of python source files for capturing electromagnetic traces,and python & MATLAB source files for analysing the resulting dataset.The raw EM traces used to obtain the results in the aforementioned publications are currently available as a separate dataset at doi:10.17026/dans-ztf-vrz9. Date Submitted: 2023-08-23 Modified: 2023-08-22 Modified: 2017-06-23

  13. O

    Time series

    • data.open-power-system-data.org
    csv, sqlite, xlsx
    Updated Oct 6, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jonathan Muehlenpfordt (2020). Time series [Dataset]. http://doi.org/10.25832/time_series/2020-10-06
    Explore at:
    csv, sqlite, xlsxAvailable download formats
    Dataset updated
    Oct 6, 2020
    Dataset provided by
    Open Power System Data
    Authors
    Jonathan Muehlenpfordt
    Time period covered
    Jan 1, 2015 - Oct 1, 2020
    Variables measured
    utc_timestamp, DE_wind_profile, DE_solar_profile, DE_wind_capacity, DK_wind_capacity, SE_wind_capacity, CH_solar_capacity, DE_solar_capacity, DK_solar_capacity, AT_price_day_ahead, and 290 more
    Description

    Load, wind and solar, prices in hourly resolution. This data package contains different kinds of timeseries data relevant for power system modelling, namely electricity prices, electricity consumption (load) as well as wind and solar power generation and capacities. The data is aggregated either by country, control area or bidding zone. Geographical coverage includes the EU and some neighbouring countries. All variables are provided in hourly resolution. Where original data is available in higher resolution (half-hourly or quarter-hourly), it is provided in separate files. This package version only contains data provided by TSOs and power exchanges via ENTSO-E Transparency, covering the period 2015-mid 2020. See previous versions for historical data from a broader range of sources. All data processing is conducted in Python/pandas and has been documented in the Jupyter notebooks linked below.

  14. m

    Programming Software Market Size, Share & Industry Analysis 2033

    • marketresearchintellect.com
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Market Research Intellect, Programming Software Market Size, Share & Industry Analysis 2033 [Dataset]. https://www.marketresearchintellect.com/product/programming-software-market/
    Explore at:
    Dataset authored and provided by
    Market Research Intellect
    License

    https://www.marketresearchintellect.com/privacy-policyhttps://www.marketresearchintellect.com/privacy-policy

    Area covered
    Global
    Description

    The size and share of this market is categorized based on Application Development (Web Development, Mobile Application Development, Desktop Application Development, Game Development, Embedded Systems Development) and Software Development Tools (Integrated Development Environment (IDE), Version Control Systems, Build Automation Tools, Debugging Tools, Code Analysis Tools) and Programming Languages (Java, Python, C#, JavaScript, Ruby) and Database Management (Relational Database Management Systems (RDBMS), NoSQL Databases, Cloud Databases, Data Warehousing Solutions, Database Development Tools) and Cloud Computing (Infrastructure as a Service (IaaS), Platform as a Service (PaaS), Software as a Service (SaaS), Function as a Service (FaaS), Cloud Development Frameworks) and geographical regions (North America, Europe, Asia-Pacific, South America, Middle-East and Africa).

  15. Resources of IncRML: Incremental Knowledge Graph Construction from...

    • zenodo.org
    bin, text/x-python +1
    Updated Mar 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dylan Van Assche; Dylan Van Assche; Julian Andres Rojas Melendez; Julian Andres Rojas Melendez; Ben De Meester; Ben De Meester; Pieter Colpaert; Pieter Colpaert (2024). Resources of IncRML: Incremental Knowledge Graph Construction from Heterogeneous Data Sources [Dataset]. http://doi.org/10.5281/zenodo.10171157
    Explore at:
    xz, text/x-python, binAvailable download formats
    Dataset updated
    Mar 18, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Dylan Van Assche; Dylan Van Assche; Julian Andres Rojas Melendez; Julian Andres Rojas Melendez; Ben De Meester; Ben De Meester; Pieter Colpaert; Pieter Colpaert
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jul 8, 2023
    Description

    IncRML resources

    This Zenodo dataset contains all the resources of the paper 'IncRML: Incremental Knowledge Graph Construction from Heterogeneous Data Sources' submitted to the Semantic Web Journal's Special Issue on Knowledge Graph Construction. This resource aims to make the paper experiments fully reproducible through our experiment tool written in Python which was already used before in the Knowledge Graph Construction Challenge by the ESWC 2023 Workshop on Knowledge Graph Construction. The exact Java JAR file of the RMLMapper (rmlmapper.jar) is also provided in this dataset which was used to execute the experiments. This JAR file was executed with Java OpenJDK 11.0.20.1 on Ubuntu 22.04.1 LTS (Linux 5.15.0-53-generic). Each experiment was executed 5 times and the median values are reported together with the standard deviation of the measurements.

    Datasets

    We provide both dataset dumps of the GTFS-Madrid-Benchmark and of real-life use cases from Open Data in Belgium.
    GTFS-Madrid-Benchmark dumps are used to analyze the impact on execution time and resources, while the real-life use cases aim to verify the approach on different types of datasets since the GTFS-Madrid-Benchmark is a single type of dataset which does not advertise changes at all.

    Benchmarks

    • GTFS-Madrid-Benchmark: change types with fixed data size and amount of changes: additions-only, modifications-only, deletions-only (11 versions)
    • GTFS-Madrid-Benchmark: amount of changes with fixed data size: 0%, 25%, 50%, 75%, and 100% changes (11 versions)
    • GTFS-Madrid-Benchmark: data size with fixed amount of changes: scales 1, 10, 100 (11 versions)

    Real-life use cases

    • Traffic control center Vlaams Verkeerscentrum (Belgium): traffic board messages data (1 day, 28760 versions)
    • Meteorological institute KMI (Belgium): weather sensor data (1 day, 144 versions)
    • Public transport agency NMBS (Belgium): train schedule data (1 week, 7 versions)
    • Public transport agency De Lijn (Belgium): busses schedule data (1 week, 7 versions)
    • Bike-sharing company BlueBike (Belgium): bike-sharing availability data (1 day, 1440 versions)
    • Bike-sharing company JCDecaux (EU): bike-sharing availability data (1 day, 1440 versions)
    • OpenStreetMap (World): geographical map data (1 day, 1440 versions)

    Remarks

    1. The first version of each dataset is always used as a baseline. All next versions are applied as an update on the existing version. The reported results are only focusing on the updates since these are the actual incremental generation.
    2. GTFS-Change-50_percent-{ALL, CHANGE}.tar.xz datasets are not uploaded as GTFS-Madrid-Benchmark scale 100 because both share the same parameters (50% changes, scale 100). Please use GTFS-Scale-100-{ALL, CHANGE}.tar.xz for GTFS-Change-50_percent-{ALL, CHANGE}.tar.xz
    3. All datasets are compressed with XZ and provided as a TAR archive, be aware that you need sufficient space to decompress these archives! 2 TB of free space is advised to decompress all benchmarks and use cases. The expected output is provided as a ZIP file in each TAR archive, decompressing these requires even more space (4 TB).

    Reproducing

    By using our experiment tool, you can easily reproduce the experiments as followed:

    1. Download one of the TAR.XZ archives and unpack them.
    2. Clone the GitHub repository of our experiment tool and install the Python dependencies with 'pip install -r requirements.txt'.
    3. Download the rmlmapper.jar JAR file from this Zenodo dataset and place it inside the experiment tool root folder.
    4. Execute the tool by running: './exectool --root=/path/to/the/root/of/the/tarxz/archive --runs=5 run'. The argument '--runs=5' is used to perform the experiment 5 times.
    5. Once executed, you can generate the statistics by running: './exectool --root=/path/to/the/root/of/the/tarxz/archive stats'.

    Testcases

    Testcases to verify the integration of RML and LDES with IncRML, see https://doi.org/10.5281/zenodo.10171394

  16. 4

    Code and Data for "Availability and utilisation of airspace structure in a...

    • data.4tu.nl
    zip
    Updated Jun 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andres Morfin Veytia (2023). Code and Data for "Availability and utilisation of airspace structure in a U-space traffic management system" [Dataset]. http://doi.org/10.4121/980484ec-5187-4f73-9e57-2e0dcd1330cc.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 8, 2023
    Dataset provided by
    4TU.ResearchData
    Authors
    Andres Morfin Veytia
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Dataset funded by
    European Commission
    Description

    This contains the code and output logs to run the BlueSky simulator for "Availability and utilisation of airspace structure in a U-space traffic management system". The code provided here is a modified version of the main fork of BlueSky (https://github.com/TUDelft-CNS-ATM/bluesky).

    The first step is to install the correct environment. Refer to `condaenv.txt` for the list of packages needed to run the simulator.

    After setting up the environment, we then need to save all of the potential paths of drones in `bluesky/plugins/streets/path_plan_dills`. Note that this takes about 180GB of storage so make sure to have enough available. The paths can be downloaded from https://surfdrive.surf.nl/files/index.php/s/makXrEfPtrtdzaO. There are some example paths saved in this dataset but it will not be possible to run all of the experiment without downloading the paths.

    The scenarios for sub-experiment 1 are saved in `bluesky/scenario/subexperiment1`.

    The scenarios for sub-experiment 2 are saved in `bluesky/scenario/subexperiment2`.

    To run the scenarios we first need to start a bluesky server by running the following code inside `bluesky`:

    `python BlueSky.py --headless`

    In another terminal we can start a bluesky client by running:

    `python BlueSky.py --client`

    In the bluesky console we can now run each batch scenario by typing and entering:

    `batch batch_subexperiment_1.scn` or

    `batch batch_subexperiment_2.scn`

    The logs of the scenarios are saved in `bluesky/output`.

    Without the paths, it will not possible to run the simulations. However, this code currently has some paths so that it is possible to run some example scenarios. The zeroth repetition for the low imposed traffic demand case can be run without all of the paths. For example, `bluesky/scenario/subexperiment1/Flight_intention_low_40_0_1to1.scn` and `bluesky/scenario/subexperiment2/Flight_intention_low_40_0_baseline.scn` can be run directly with this dataset.

    First start bluesky by running:

    `python BlueSky.py`

    In the console, type.

    `ic subexperiment2/Flight_intention_low_40_0_baseline.scn`

    Please do not hesistate to contact me with any questions.

    -Andres

  17. Data from: MetSim: Integrated Programmatic Access and Pathway Management for...

    • catalog.data.gov
    • datasets.ai
    Updated Jun 22, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2024). MetSim: Integrated Programmatic Access and Pathway Management for Xenobiotic Metabolism Simulators [Dataset]. https://catalog.data.gov/dataset/metsim-integrated-programmatic-access-and-pathway-management-for-xenobiotic-metabolism-sim
    Explore at:
    Dataset updated
    Jun 22, 2024
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    With exception of metabolic simulations performed using TIMES (version 2.31.2.82), all work was performed using Python (version 3.10.4) run with IPython (version 8.4.0) in JupyterLab (version 3.3.2). The Toolbox API (OECD Toolbox version 4.5 with Service Pack 1 update, API version 6), and BioTransformer (Wishart Lab, version 3.0, executable Java Archive, June 15, 2022 release) were used for automated metabolic simulations. Efficient batch execution of metabolism simulations was handled via parallel processing multiple individual calls to either BioTransformer or the Toolbox API via the “multiprocess” package. The command line interface (CLI) calls needed to interact with BioTransformer were executed via the “subprocess” package, and the Toolbox API was queried via its Swagger user interface hosted on a locally running Windows Desktop instance of the Toolbox Server. The data generated from the MetSim hierarchical schema were translated into JavaScript Object Notation (JSON) format using Python. The resulting data were inserted into a Mongo Database (MongoDB) using the “pymongo” package for efficient storage and retrieval. The code repository including all Jupyter Notebooks documenting the analysis performed and the MetSim framework are available at https://github.com/patlewig/metsim. Data files needed to reproduce the analysis are provided at https://doi.org/10.23645/epacomptox.25463926 and as Supporting Information. This dataset is associated with the following publication: Groff, L., A. Williams, I. Shah, and G. Patlewicz. MetSim: Integrated Programmatic Access and Pathway Management for Xenobiotic Metabolism Simulators. CHEMICAL RESEARCH IN TOXICOLOGY. American Chemical Society, Washington, DC, USA, 37(5): 685-697, (2024).

  18. d

    Data for the Lakes and Impoundments Drought Index in the Massachusetts...

    • catalog.data.gov
    • data.usgs.gov
    Updated Sep 15, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2024). Data for the Lakes and Impoundments Drought Index in the Massachusetts Drought Management Plan [Dataset]. https://catalog.data.gov/dataset/data-for-the-lakes-and-impoundments-drought-index-in-the-massachusetts-drought-management-
    Explore at:
    Dataset updated
    Sep 15, 2024
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Area covered
    Massachusetts
    Description

    The Massachusetts Drought Management Plan (DMP, 2023) uses data from select lake and impoundment systems as an index for drought in six of seven regions in the state. The contents of these lakes and impoundments are reported to Massachusetts Department of Conservation and Recreation (DCR) and classified as one of five levels for drought severity ranging from level 0 (Normal; percentile greater than 30) to level 4 (Emergency; percentile less than 2). Lake and impoundment system data are provided at the end of each month to DCR through multiple agencies as lake levels, volumes, or percent-full (reservoir capacity). USGS reviewed data from 14 of the lake or impoundment systems including 28 waterbodies. Diagrams for each system show the capacity of each waterbody and how water is transported through the systems. This data release provides historical monthly data in volume for each system and historical monthly data in feet for systems that consist of only one waterbody when recorded values were available . From these historical monthly data, the 50th-, 30th-, 20th-, 10th-, and 2nd- percentiles were computed. Stage volume rating data for each waterbody at each system are provided in two formats to convert gage height (feet) to volume (million gallons). The stage volume rating data files are formatted as a text (.txt) table for easy manual reading and the other is a comma-separated value (.csv) column format that is easily loaded into a spreadsheet. Stage volume rating data were provided by the municipalities and agencies that manage the systems or were developed for this study. At one system (Hudson, Gates Pond), no stage volume rating data or bathymetry data were available. A stage volume rating was developed using a python script using maximum depth and a shape file of the pond shoreline. The Python script used to develop the stage volume rating data and the R script used to compute the quantiles are published as a part of this data release. Files for each system include supplied historical volume, computed volume percentiles, stage volume rating(s), and a system diagram. Historical elevation data and computed elevation percentiles are included when applicable.

  19. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Anonymous; Anonymous (2025). Replication package for: Altered Histories in Version Control System Repositories: Evidence from the Trenches [Dataset]. http://doi.org/10.5281/zenodo.15558282
Organization logo

Replication package for: Altered Histories in Version Control System Repositories: Evidence from the Trenches

Explore at:
bin, zipAvailable download formats
Dataset updated
Jun 2, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Anonymous; Anonymous
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description
# History Alterations - Replication Package


This repository contains the complete replication package for the research article Altered Histories in Version Control System Repositories: Evidence from the Trenches. The package provides tools to detect, analyze, and categorize Git history alterations across software repositories, along with Jupyter notebooks to reproduce the analysis presented in the paper.

## 📋 Table of Contents


## 🔍 Overview

This replication package enables researchers to reproduce the analysis of altered Git histories in software repositories archived by Software Heritage. The study investigates how and why Git histories are modified over time, providing insights into developer practices and repository maintenance patterns.

Main Research Questions:

- How prevalent are Git history alterations in open-source repositories?
- What types of changes are most commonly made to Git histories?
- What are the root causes of these alterations?
- How do these practices vary across different types of repositories?

## 📁 Repository Structure

</div> <div>├── README.md # This file</div> <div>├── data/ # Pre-computed datasets</div> <div>│ ├── ...</div> <div>├── altered-history/ # Main analysis tool</div> <div>│ ├── src/ # Rust source code</div> <div>│ ├── notebooks/ # Analysis notebooks</div> <div>│ │ ├── analysis.ipynb # Main analysis notebook</div> <div>│ │ ├── build_analysis_dataset.ipynb</div> <div>│ │ └── utils_analysis.py # Analysis utilities</div> <div>│ └── README.md</div> <div>├── git-historian/ # History checking tool</div> <div>│ ├── src/ # Rust source code</div> <div>│ └── README.md</div> <div>├── modified-files/ # File modification analysis tool</div> <div>│ ├── src/ # Rust source code</div> <div>│ ├── notebooks/ # Additional analysis notebooks</div> <div>│ │ ├── license_analysis.ipynb</div> <div>│ │ ├── license_categorization.py</div> <div>│ │ ├── secret-analysis.ipynb</div> <div>│ │ └── swh_license_files.py</div> <div>│ └── README.md</div> <div>

## 🚀 Quick Start

### Prerequisites

- Rust (latest stable version)
- Python 3.8+ with Jupyter
- PostgreSQL (for database operations)
- Git (for repository analysis)

### Installation

1. Clone the repository:
bash</div> <div>git clone <repository-url></div> <div>cd altered-histories-tool-replication-pkg</div> <div>

2. Unzip all directories

3. Install Python dependencies:
bash</div> <div>pip install pandas matplotlib seaborn jupyter plotly numpy</div> <div>

4. Build the Rust tools (optional, for dataset generation):
bash</div> <div>cd altered-history && cargo build --release && cd ..</div> <div>cd git-historian && cargo build --release && cd ..</div> <div>cd modified-files && cargo build --release && cd ..</div> <div>

## 📊 Reproducing the Analysis

### Option 1: Using Pre-computed Data (Recommended)

The data/ directory contains pre-computed datasets that allow you to reproduce all analyses without running the computationally intensive data collection process.

1. Open the main analysis notebook:
bash</div> <div>cd altered-history/notebooks</div> <div>jupyter notebook analysis.ipynb</div> <div>

2. Run all cells to reproduce the complete analysis.

3. Explore additional analyses:

Modify notebooks at will to explore the dataframe.
bash</div> <div># Build analysis dataset (shows data preparation)</div> <div>jupyter notebook build_analysis_dataset.ipynb</div> <div> </div> <div># License-related analysis</div> <div>cd ../../modified-files/notebooks</div> <div>jupyter notebook license_analysis.ipynb</div> <div> </div> <div># Security and secrets analysis</div> <div>jupyter notebook secret-analysis.ipynb</div> <div>

### Option 2: Regenerating the Dataset

To reproduce the complete data collection and analysis pipeline:

1. Download Software Heritage datasets (see individual tool READMEs)
2. Configure database connections in each tool
3. Run the analysis pipeline following the step-by-step instructions in each tool's README
4. Process results using the provided notebooks

Note: Complete dataset regeneration requires significant computational resources and time (potentially weeks for large datasets).

## 📋 Data

The data/ directory contains several key datasets including:

- res.pkl: Main analysis results containing categorized alterations
- stars_without_dup.pkl: Repository popularity metrics (GitHub stars)
- visit_type.pkl: Classification of repository visit patterns
- altered_histories_2024_08_23.dump: PostgreSQL database dump for git-historian tool

## 🛠️ Tools Description

### 1. altered-history

Purpose: Detects and categorizes Git history alterations in Software Heritage archives.

Key Features:

- Three-step analysis pipeline (detection → root cause → categorization)
- Parallel processing for large datasets
- Comprehensive alteration taxonomy

Usage: See altered-history/README.md for detailed instructions.

### 2. git-historian

Purpose: Checks individual repositories against the database of known alterations.

Key Features:

- PostgreSQL integration
- Git hook integration for automated checking
- Caching system for performance

Usage: See git-historian/README.md for detailed instructions.

### 3. modified-files

Purpose: Analyzes file-level modifications and their patterns.

Key Features:

- File modification tracking
- License and security analysis
- Integration with Software Heritage graph

Usage: See modified-files/README.md for detailed instructions.

## 📋 Requirements

### System Requirements

- Memory: Minimum 16GB RAM (1.5TB+ recommended for full dataset processing)
- Storage: 600GB+ free space for complete datasets
- CPU: Multi-core processor recommended for parallel processing

## 🔄 Reproducibility Notes

1. Deterministic Results: The analysis notebooks will produce identical results when run with the provided datasets.

2. Versioning: All tools are pinned to specific versions to ensure reproducibility.

3. Random Seeds: Where applicable, random seeds are fixed in the analysis code.

Search
Clear search
Close search
Google apps
Main menu