Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
</div>
<div>├── README.md # This file</div>
<div>├── data/ # Pre-computed datasets</div>
<div>│ ├── ...</div>
<div>├── altered-history/ # Main analysis tool</div>
<div>│ ├── src/ # Rust source code</div>
<div>│ ├── notebooks/ # Analysis notebooks</div>
<div>│ │ ├── analysis.ipynb # Main analysis notebook</div>
<div>│ │ ├── build_analysis_dataset.ipynb</div>
<div>│ │ └── utils_analysis.py # Analysis utilities</div>
<div>│ └── README.md</div>
<div>├── git-historian/ # History checking tool</div>
<div>│ ├── src/ # Rust source code</div>
<div>│ └── README.md</div>
<div>├── modified-files/ # File modification analysis tool</div>
<div>│ ├── src/ # Rust source code</div>
<div>│ ├── notebooks/ # Additional analysis notebooks</div>
<div>│ │ ├── license_analysis.ipynb</div>
<div>│ │ ├── license_categorization.py</div>
<div>│ │ ├── secret-analysis.ipynb</div>
<div>│ │ └── swh_license_files.py</div>
<div>│ └── README.md</div>
<div>
bash</div>
<div>git clone <repository-url></div>
<div>cd altered-histories-tool-replication-pkg</div>
<div>
bash</div>
<div>pip install pandas matplotlib seaborn jupyter plotly numpy</div>
<div>
bash</div>
<div>cd altered-history && cargo build --release && cd ..</div>
<div>cd git-historian && cargo build --release && cd ..</div>
<div>cd modified-files && cargo build --release && cd ..</div>
<div>
data/
directory contains pre-computed datasets that allow you to reproduce all analyses without running the computationally intensive data collection process.bash</div>
<div>cd altered-history/notebooks</div>
<div>jupyter notebook analysis.ipynb</div>
<div>
bash</div>
<div># Build analysis dataset (shows data preparation)</div>
<div>jupyter notebook build_analysis_dataset.ipynb</div>
<div> </div>
<div># License-related analysis</div>
<div>cd ../../modified-files/notebooks</div>
<div>jupyter notebook license_analysis.ipynb</div>
<div> </div>
<div># Security and secrets analysis</div>
<div>jupyter notebook secret-analysis.ipynb</div>
<div>
data/
directory contains several key datasets including:res.pkl
: Main analysis results containing categorized alterationsstars_without_dup.pkl
: Repository popularity metrics (GitHub stars)visit_type.pkl
: Classification of repository visit patternsaltered_histories_2024_08_23.dump
: PostgreSQL database dump for git-historian toolaltered-history/README.md
for detailed instructions.git-historian/README.md
for detailed instructions.modified-files/README.md
for detailed instructions.Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Geographic Diversity in Public Code Contributions - Replication Package
This document describes how to replicate the findings of the paper: Davide Rossi and Stefano Zacchiroli, 2022, Geographic Diversity in Public Code Contributions - An Exploratory Large-Scale Study Over 50 Years. In 19th International Conference on Mining Software Repositories (MSR ’22), May 23-24, Pittsburgh, PA, USA. ACM, New York, NY, USA, 5 pages. https://doi.org/10.1145/3524842.3528471
This document comes with the software needed to mine and analyze the data presented in the paper.
Prerequisites
These instructions assume the use of the bash shell, the Python programming language, the PosgreSQL DBMS (version 11 or later), the zstd compression utility and various usual *nix shell utilities (cat, pv, …), all of which are available for multiple architectures and OSs. It is advisable to create a Python virtual environment and install the following PyPI packages:
click==8.0.4 cycler==0.11.0 fonttools==4.31.2 kiwisolver==1.4.0 matplotlib==3.5.1 numpy==1.22.3 packaging==21.3 pandas==1.4.1 patsy==0.5.2 Pillow==9.0.1 pyparsing==3.0.7 python-dateutil==2.8.2 pytz==2022.1 scipy==1.8.0 six==1.16.0 statsmodels==0.13.2
Initial data
swh-replica, a PostgreSQL database containing a copy of Software Heritage data. The schema for the database is available at https://forge.softwareheritage.org/source/swh-storage/browse/master/swh/storage/sql/. We retrieved these data from Software Heritage, in collaboration with the archive operators, taking an archive snapshot as of 2021-07-07. We cannot make these data available in full as part of the replication package due to both its volume and the presence in it of personal information such as user email addresses. However, equivalent data (stripped of email addresses) can be obtained from the Software Heritage archive dataset, as documented in the article: Antoine Pietri, Diomidis Spinellis, Stefano Zacchiroli, The Software Heritage Graph Dataset: Public software development under one roof. In proceedings of MSR 2019: The 16th International Conference on Mining Software Repositories, May 2019, Montreal, Canada. Pages 138-142, IEEE 2019. http://dx.doi.org/10.1109/MSR.2019.00030. Once retrieved, the data can be loaded in PostgreSQL to populate swh-replica.
names.tab - forenames and surnames per country with their frequency
zones.acc.tab - countries/territories, timezones, population and world zones
c_c.tab - ccTDL entities - world zones matches
Data preparation
Export data from the swh-replica database to create commits.csv.zst and authors.csv.zst
sh> ./export.sh
Run the authors cleanup script to create authors--clean.csv.zst
sh> ./cleanup.sh authors.csv.zst
Filter out implausible names and create authors--plausible.csv.zst
sh> pv authors--clean.csv.zst | unzstd | ./filter_names.py 2> authors--plausible.csv.log | zstdmt > authors--plausible.csv.zst
Zone detection by email
Run the email detection script to create author-country-by-email.tab.zst
sh> pv authors--plausible.csv.zst | zstdcat | ./guess_country_by_email.py -f 3 2> author-country-by-email.csv.log | zstdmt > author-country-by-email.tab.zst
Database creation and initial data ingestion
Create the PostgreSQL DB
sh> createdb zones-commit
Notice that from now on when prepending the psql> prompt we assume the execution of psql on the zones-commit database.
Import data into PostgreSQL DB
sh> ./import_data.sh
Zone detection by name
Extract commits data from the DB and create commits.tab, that is used as input for the zone detection script
sh> psql -f extract_commits.sql zones-commit
Run the world zone detection script to create commit_zones.tab.zst
sh> pv commits.tab | ./assign_world_zone.py -a -n names.tab -p zones.acc.tab -x -w 8 | zstdmt > commit_zones.tab.zst Use ./assign_world_zone.py --help if you are interested in changing the script parameters.
Ingest zones assignment data into the DB
psql> \copy commit_zone from program 'zstdcat commit_zones.tab.zst | cut -f1,6 | grep -Ev ''\s$'''
Extraction and graphs
Run the script to execute the queries to extract the data to plot from the DB. This creates commit_zones_7120.tab, author_zones_7120_t5.tab, commit_zones_7120.grid and author_zones_7120_t5.grid. Edit extract_data.sql if you whish to modify extraction parameters (start/end year, sampling, …).
sh> ./extract_data.sh
Run the script to create the graphs from all the previously extracted tabfiles.
sh> ./create_stackedbar_chart.py -w 20 -s 1971 -f commit_zones_7120.grid -f author_zones_7120_t5.grid -o chart.pdf
Python module 'optbayesexpt' uses optimal Bayesian experimental design methods to control measurement settings in order to efficiently determine model parameters. Given an parametric model - analogous to a fitting function - Bayesian inference uses each measurement 'data point' to refine model parameters. Using this information, the software suggests measurement settings that are likely to efficiently reduce uncertainties. A TCP socket interface allows the software to be used from experimental control software written in other programming languages. Code is developed in Python, and shared via GitHub's USNISTGOV organization.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The repository contains an extensive dataset of PV power measurements and a python package (qcpv) for quality controlling PV power measurements. The dataset features four years (2014-2017) of power measurements of 175 rooftop mounted residential PV systems located in Utrecht, the Netherlands. The power measurements have a 1-min resolution.
PV power measurements
Three different versions of the power measurements are included in three data-subsets in the repository. Unfiltered power measurements are enclosed in unfiltered_pv_power_measurements.csv. Filtered power measurements are included as filtered_pv_power_measurements_sc.csv and filtered_pv_power_measurements_ac.csv. The former dataset contains the quality controlled power measurements after running single system filters only, the latter dataset considers the output after running both single and across system filters. The metadata of the PV systems is added in metadata.csv. This file holds for each PV system a unique ID, start and end time of registered power measurements, estimated DC and AC capacity, tilt and azimuth angle, annual yield and mapped grids of the system location (north, south, west and east boundary).
Quality control routine
An open-source quality control routine that can be applied to filter erroneous PV power measurements is added to the repository in the form of the Python package qcpv (qcpv.py). Sample code to call and run the functions in the qcpv package is available as example.py.
Objective
By publishing the dataset we provide access to high quality PV power measurements that can be used for research experiments on several topics related to PV power and the integration of PV in the electricity grid.
By publishing the qcpv package we strive to set a next step into developing a standardized routine for quality control of PV power measurements. We hope to stimulate others to adopt and improve the routine of quality control and work towards a widely adopted standardized routine.
Data usage
If you use the data and/or python package in a published work please cite: Visser, L., Elsinga, B., AlSkaif, T., van Sark, W., 2022. Open-source quality control routine and multi-year power generation data of 175 PV systems. Journal of Renewable and Sustainable Energy.
Units
Timestamps are in UTC (YYYY-MM-DD HH:MM:SS+00:00).
Power measurements are in Watt.
Installed capacities (DC and AC) are in Watt-peak.
Additional information
A detailed discussion of the data and qcpv package is presented in: Visser, L., Elsinga, B., AlSkaif, T., van Sark, W., 2022. Open-source quality control routine and multi-year power generation data of 175 PV systems. Journal of Renewable and Sustainable Energy. Corrections are discussed in: Visser, L., Elsinga, B., AlSkaif, T., van Sark, W., 2024. Erratum: Open-source quality control routine and multiyear power generation data of 175 PV systems. Journal of Renewable and Sustainable Energy.
Acknowledgements
This work is part of the Energy Intranets (NEAT: ESI-BiDa 647.003.002) project, which is funded by the Dutch Research Council NWO in the framework of the Energy Systems Integration & Big Data programme. The authors would especially like to thank the PV owners who volunteered to take part in the measurement campaign.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains the extension of a publicly available dataset that was published initially by Bagheri et al. in their paper:
A. Bagheri and P. Hegedűs, "A comparison of different source code representation methods for vulnerability prediction in python", Quality of Information and Communications Technology, 2021.
This dataset is an extension of the dataset presented by Bagheri et al., who used a version control system as a data source for collecting source code components. Specifically, they used GitHub since it has a high number of software projects. To create a labeled dataset, i.e., a dataset of files signed with a label that declares if they are vulnerable or not, they scanned the commit messages in Python GitHub projects. In particular, they searched for commits, which contain vulnerability-fixing keywords in the commit message. They gathered a large number of Python source files included in such commits. The version of each file before the vulnerability-fixing commit (i.e., parent version) is considered vulnerable, since it contains the vulnerability that required a patch, whereas the version of the file in the vulnerability-fixing commit is considered non-vulnerable. However, in their study, Bagheri et al. utilized only the fragment of the diff file, which contains the difference between the vulnerable and the fixed version, and they proposed models to separate the “bad” and the “good” parts of a file. In the current study, we extend their dataset by collecting clean (i.e., non-vulnerable) versions from GitHub. For this purpose, we retrieved files from the latest version of the dataset’s GitHub repositories, since the latest versions are the safest versions that can be considered non-vulnerable because no vulnerabilities have yet been reported for them. Hence, we can construct models to perform vulnerability prediction at the file-level of granularity. Overall, the extended dataset contains 4,184 Python files, 3,186 of which are considered vulnerable and 998 are considered neutral (i.e., non-vulnerable).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We present the data collected as part of the Open-source Complex Ecosystem And Networks (OCEAN) partnership between Google Open Source and the University of Vermont. This includes mailing list emails with standardized format spanning the past three decades from fourteen mailing lists across four different open source communities: Python, Angular, Node.js, and the Go language.This data is presented in the following publication: Warrick, M., Rosenblatt, S. F., Young, J. G., Casari, A., Hébert-Dufresne, L., & Bagrow, J. P. (2022). The OCEAN mailing list data set: Network analysis spanning mailing lists and code repositories. In 2022 IEEE/ACM 19th International Conference on Mining Software Repositories (MSR). IEEE.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Recommended Citation
Citing this version
NewClimate Institute, Wageningen University and Research & PBL Netherlands Environmental Assessment Agency. (2023). Climate Policy Database. DOI: 10.5281/zenodo.10869734
Citing all CPDB versions
NewClimate Institute, Wageningen University and Research & PBL Netherlands Environmental Assessment Agency. (2016). Climate Policy Database. DOI: 10.5281/zenodo.7774109
Peer reviewed publication
Description
The Climate Policy Database (CDPB) is an open, collaborative tool to advance the data collection of the implementation status of climate policies. This project is funded by the European Union H2020 ELEVATE and ENGAGE projects and was, in its previous phase, funded under CD-Links. The database is maintained by NewClimate Institute with support from PBL Netherlands Environmental Assessment Agency and Wageningen University and Research.
Although the CPDB exists since 2016, annual versions of the database have only been stored since 2019.
The Climate Policy Database is updated periodically. The latest version of the database can be downloaded on the CPDB website or accessed through a Python API. Each year, we also create a static database, which is included here for version control.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
These are simulated data based on employee turnover data in a real technology company in India (we refer to this company by a pseudonym, 'TECHCO'). These data can be used to analyze drivers of turnover at TECHCO. The original dataset was analyzed in the paper Machine Learning for Pattern Discovery in Management Research (SSRN version here). This publicly offered dataset is simulated based on the original data for privacy considerations. Along with the accompanying Python Kaggle code and R Kaggle code, this dataset will help readers learn how to implement the ML techniques in the paper. The data and code demonstrate how ML can be useful for discovering nonlinear and interactive patterns between variables that may otherwise have gone unnoticed.
This dataset includes 1,191 entry-level employees that were quasi-randomly deployed to any of TECHCO’s nine geographically dispersed production centers in 2007. The data are structured as a panel with one observation for each month that an individual is employed at the company for up to 40 months. The data include 34,453 observations from 1,191 employees total; The dependent variable, Turnover, indicates whether the employee left or stayed during that time period.
The objective in the original paper was to explore patterns in the data that would help us learn more about the drivers of employee turnover. Another objective could be to find the best predictive model to estimate when a specific employee will leave.
This program has been imported from the CPC Program Library held at Queen's University Belfast (1969-2018)
Abstract Instrumentino is an open-source modular graphical user interface framework for controlling Arduino based experimental instruments. It expands the control capability of Arduino by allowing instruments builders to easily create a custom user interface program running on an attached personal computer. It enables the definition of operation sequences and their automated running without user intervention. Acquired experimental data and a usage log are automatically saved on the computer for furthe...
Title of program: Instrumentino, Controlino Catalogue Id: AETJ_v1_0
Nature of problem Control and monitor purpose-made experimental instruments
Versions of this program held in the CPC repository in Mendeley Data AETJ_v1_0; Instrumentino, Controlino; 10.1016/j.cpc.2014.06.007
https://user-images.githubusercontent.com/91852182/147305077-8b86ec92-ed26-43ca-860c-5812fea9b1d8.gif" alt="ezgif com-gif-maker">
Self-drivi cars has become a trending subject with a significant improvement in the technologies in the last decade. The project purpose is to train a neural network to drive an autonomous car agent on the tracks of Udacity’s Car Simulator environment. Udacity has released the simulator as an open source software and enthusiasts have hosted a competition (challenge) to teach a car how to drive using only camera images and deep learning. Driving a car in an autonomous manner requires learning to control steering angle, throttle and brakes. Behavioral cloning technique is used to mimic human driving behavior in the training mode on the track. That means a dataset is generated in the simulator by user driven car in training mode, and the deep neural network model then drives the car in autonomous mode. Ultimately, the car was able to run on Track 1 generalizing well. The project aims at reaching the same accuracy on real time data in the future.https://user-images.githubusercontent.com/91852182/147298831-225740f9-6903-4570-8336-0c9f16676456.png" alt="6">
Udacity released an open source simulator for self-driving cars to depict a real-time environment. The challenge is to mimic the driving behavior of a human on the simulator with the help of a model trained by deep neural networks. The concept is called Behavioral Cloning, to mimic how a human drives. The simulator contains two tracks and two modes, namely, training mode and autonomous mode. The dataset is generated from the simulator by the user, driving the car in training mode. This dataset is also known as the “good” driving data. This is followed by testing on the track, seeing how the deep learning model performs after being trained by that user data.
https://user-images.githubusercontent.com/91852182/147298261-4d57a5c1-1fda-4654-9741-2f284e6d0479.png" alt="1">
The problem is solved in the following steps:
Technologies that are used in the implementation of this project and the motivation behind using these are described in this section.
TensorFlow: This an open-source library for dataflow programming. It is widely used for machine learning applications. It is also used as both a math library and for large computation. For this project Keras, a high-level API that uses TensorFlow as the backend is used. Keras facilitate in building the models easily as it more user friendly.
Different libraries are available in Python that helps in machine learning projects. Several of those libraries have improved the performance of this project. Few of them are mentioned in this section. First, “Numpy” that provides with high-level math function collection to support multi-dimensional metrices and arrays. This is used for faster computations over the weights (gradients) in neural networks. Second, “scikit-learn” is a machine learning library for Python which features different algorithms and Machine Learning function packages. Another one is OpenCV (Open Source Computer Vision Library) which is designed for computational efficiency with focus on real-time applications. In this project, OpenCV is used for image preprocessing and augmentation techniques.
The project makes use of Conda Environment which is an open source distribution for Python which simplifies package management and deployment. It is best for large scale data processing. The machine on which this project was built, is a personal computer.
CNN is a type of feed-forward neural network computing system that can be used to learn from input data. Learning is accomplished by determining a set of weights or filter values that allow the network to model the behavior according to the training data. The desired output and the output generated by CNN initialized with random weights will be different. This difference (generated error) is backpropagated through the layers of CNN to adjust the weights of the neurons, which in turn reduces the error and allows us produce output closer to the desired one.
CNN is good at capturing hierarchical and spatial data from images. It utilizes filters that look at regions of an input image with a defined window size and map it to some output. It then slides the window by some defined stride to other regions, covering the whole image. Each convolution filter layer thus captures the properties of this input image hierarchically in a series of subsequent layers, capturing the details like lines in image, then shapes, then whole objects in later layers. CNN can be a good fit to feed the images of a dataset and classify them into their respective classes.
Another type of layers sometimes used in deep learning networks is a Time- distributed layer. Time-Distributed layers are provided in Keras as wrapper layers. Every temporal slice of an input is applied with this wrapper layer. The requirement for input is that to be at least three-dimensional, first index can be considered as temporal dimension. These Time-Distributed can be applied to a dense layer to each of the timesteps, independently or even used with Convolutional Layers. The way they can be written is also simple in Keras as shown in Figure 1 and Figure 2.
https://user-images.githubusercontent.com/91852182/147298483-4f37a092-7e71-4ce6-9274-9a133d138a4c.png" alt="2">
Fig. 1: TimeDistributed Dense layer
https://user-images.githubusercontent.com/91852182/147298501-6459d968-a279-4140-9be3-2d3ea826d9f6.png" alt="3">
Fig. 2: TimeDistributed Convolution layer
We will first download the simulator to start our behavioural training process. Udacity has built a simulator for self-driving cars and made it open source for the enthusiasts, so they can work on something close to a real-time environment. It is built on Unity, the video game development platform. The simulator consists of a configurable resolution and controls setting and is very user friendly. The graphics and input configurations can be changed according to user preference and machine configuration as shown in Figure 3. The user pushes the “Play!” button to enter the simulator user interface. You can enter the Controls tab to explore the keyboard controls, quite similar to a racing game which can be seen in Figure 4.
https://user-images.githubusercontent.com/91852182/147298708-de15ebc5-2482-42f8-b2a2-8d3c59fceff4.png" alt=" 4">
Fig. 3: Configuration screen
https://user-images.githubusercontent.com/91852182/147298712-944e2c2d-e01d-459b-8a7d-3c5471bea179.png" alt="5">
Fig. 4: Controls Configuration
The first actual screen of the simulator can be seen in Figure 5 and its components are discussed below. The simulator involves two tracks. One of them can be considered as simple and another one as complex that can be evident in the screenshots attached in Figure 6 and Figure 7. The word “simple” here just means that it has fewer curvy tracks and is easier to drive on, refer Figure 6. The “complex” track has steep elevations, sharp turns, shadowed environment, and is tough to drive on, even by a user doing it manually. Please refer Figure 6. There are two modes for driving the car in the simulator: (1) Training mode and (2) Autonomous mode. The training mode gives you the option of recording your run and capturing the training dataset. The small red sign at the top right of the screen in the Figure 6 and 7 depicts the car is being driven in training mode. The autonomous mode can be used to test the models to see if it can drive on the track without human intervention. Also, if you try to press the controls to get the car back on track, it will immediately notify you that it shifted to manual controls. The mode screenshot can be as seen in Figure 8. Once we have mastered how the car driven controls in simulator using keyboard keys, then we get started with record button to collect data. We will save the data from it in a specified folder as you can see
Spatial analysis and statistical summaries of the Protected Areas Database of the United States (PAD-US) provide land managers and decision makers with a general assessment of management intent for biodiversity protection, natural resource management, and recreation access across the nation. The PAD-US 3.0 Combined Fee, Designation, Easement feature class (with Military Lands and Tribal Areas from the Proclamation and Other Planning Boundaries feature class) was modified to remove overlaps, avoiding overestimation in protected area statistics and to support user needs. A Python scripted process ("PADUS3_0_CreateVectorAnalysisFileScript.zip") associated with this data release prioritized overlapping designations (e.g. Wilderness within a National Forest) based upon their relative biodiversity conservation status (e.g. GAP Status Code 1 over 2), public access values (in the order of Closed, Restricted, Open, Unknown), and geodatabase load order (records are deliberately organized in the PAD-US full inventory with fee owned lands loaded before overlapping management designations, and easements). The Vector Analysis File ("PADUS3_0VectorAnalysisFile_ClipCensus.zip") associated item of PAD-US 3.0 Spatial Analysis and Statistics ( https://doi.org/10.5066/P9KLBB5D ) was clipped to the Census state boundary file to define the extent and serve as a common denominator for statistical summaries. Boundaries of interest to stakeholders (State, Department of the Interior Region, Congressional District, County, EcoRegions I-IV, Urban Areas, Landscape Conservation Cooperative) were incorporated into separate geodatabase feature classes to support various data summaries ("PADUS3_0VectorAnalysisFileOtherExtents_Clip_Census.zip") and Comma-separated Value (CSV) tables ("PADUS3_0SummaryStatistics_TabularData_CSV.zip") summarizing "PADUS3_0VectorAnalysisFileOtherExtents_Clip_Census.zip" are provided as an alternative format and enable users to explore and download summary statistics of interest (Comma-separated Table [CSV], Microsoft Excel Workbook [.XLSX], Portable Document Format [.PDF] Report) from the PAD-US Lands and Inland Water Statistics Dashboard ( https://www.usgs.gov/programs/gap-analysis-project/science/pad-us-statistics ). In addition, a "flattened" version of the PAD-US 3.0 combined file without other extent boundaries ("PADUS3_0VectorAnalysisFile_ClipCensus.zip") allow for other applications that require a representation of overall protection status without overlapping designation boundaries. The "PADUS3_0VectorAnalysis_State_Clip_CENSUS2020" feature class ("PADUS3_0VectorAnalysisFileOtherExtents_Clip_Census.gdb") is the source of the PAD-US 3.0 raster files (associated item of PAD-US 3.0 Spatial Analysis and Statistics, https://doi.org/10.5066/P9KLBB5D ). Note, the PAD-US inventory is now considered functionally complete with the vast majority of land protection types represented in some manner, while work continues to maintain updates and improve data quality (see inventory completeness estimates at: http://www.protectedlands.net/data-stewards/ ). In addition, changes in protected area status between versions of the PAD-US may be attributed to improving the completeness and accuracy of the spatial data more than actual management actions or new acquisitions. USGS provides no legal warranty for the use of this data. While PAD-US is the official aggregation of protected areas ( https://www.fgdc.gov/ngda-reports/NGDA_Datasets.html ), agencies are the best source of their lands data.
http://www.gnu.org/licenses/gpl-3.0http://www.gnu.org/licenses/gpl-3.0
Python & MATLAB source code to capture traces & generate the results used in 'Side-Channel Based Intrusion Detection for Industrial Control Systems' (doi:10.1007/978-3-319-99843-5_19) and 'Security and Privacy in the Smart Grid' (PhD Thesis, ISBN 978-94-6473-209-2).Industrial Control Systems are under increased scrutiny. Their security is historically sub-par, and although measures are being taken by the manufacturers to remedy this, the large installed base of legacy systems cannot easily be updated with state-of-the-art security measures. In these publications we use a technique from cryptographic side-channel analysis, multivariate templating, to detect anomalous behaviour in Programmable Logic Controllers. Our solution uses side-channel measurements of the electromagnetic emissions of an industrial control system to detect behavioural changes of the software running on them. To demonstrate the feasibility of this method, we show it is possible to profile and distinguish between even small changes in programs on Siemens S7-317 PLCs, using methods from cryptographic side-channel analysis.The code consists of python source files for capturing electromagnetic traces,and python & MATLAB source files for analysing the resulting dataset.The raw EM traces used to obtain the results in the aforementioned publications are currently available as a separate dataset at doi:10.17026/dans-ztf-vrz9. Date Submitted: 2023-08-23 Modified: 2023-08-22 Modified: 2017-06-23
Load, wind and solar, prices in hourly resolution. This data package contains different kinds of timeseries data relevant for power system modelling, namely electricity prices, electricity consumption (load) as well as wind and solar power generation and capacities. The data is aggregated either by country, control area or bidding zone. Geographical coverage includes the EU and some neighbouring countries. All variables are provided in hourly resolution. Where original data is available in higher resolution (half-hourly or quarter-hourly), it is provided in separate files. This package version only contains data provided by TSOs and power exchanges via ENTSO-E Transparency, covering the period 2015-mid 2020. See previous versions for historical data from a broader range of sources. All data processing is conducted in Python/pandas and has been documented in the Jupyter notebooks linked below.
https://www.marketresearchintellect.com/privacy-policyhttps://www.marketresearchintellect.com/privacy-policy
The size and share of this market is categorized based on Application Development (Web Development, Mobile Application Development, Desktop Application Development, Game Development, Embedded Systems Development) and Software Development Tools (Integrated Development Environment (IDE), Version Control Systems, Build Automation Tools, Debugging Tools, Code Analysis Tools) and Programming Languages (Java, Python, C#, JavaScript, Ruby) and Database Management (Relational Database Management Systems (RDBMS), NoSQL Databases, Cloud Databases, Data Warehousing Solutions, Database Development Tools) and Cloud Computing (Infrastructure as a Service (IaaS), Platform as a Service (PaaS), Software as a Service (SaaS), Function as a Service (FaaS), Cloud Development Frameworks) and geographical regions (North America, Europe, Asia-Pacific, South America, Middle-East and Africa).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This Zenodo dataset contains all the resources of the paper 'IncRML: Incremental Knowledge Graph Construction from Heterogeneous Data Sources' submitted to the Semantic Web Journal's Special Issue on Knowledge Graph Construction. This resource aims to make the paper experiments fully reproducible through our experiment tool written in Python which was already used before in the Knowledge Graph Construction Challenge by the ESWC 2023 Workshop on Knowledge Graph Construction. The exact Java JAR file of the RMLMapper (rmlmapper.jar) is also provided in this dataset which was used to execute the experiments. This JAR file was executed with Java OpenJDK 11.0.20.1 on Ubuntu 22.04.1 LTS (Linux 5.15.0-53-generic). Each experiment was executed 5 times and the median values are reported together with the standard deviation of the measurements.
We provide both dataset dumps of the GTFS-Madrid-Benchmark and of real-life use cases from Open Data in Belgium.
GTFS-Madrid-Benchmark dumps are used to analyze the impact on execution time and resources, while the real-life use cases aim to verify the approach on different types of datasets since the GTFS-Madrid-Benchmark is a single type of dataset which does not advertise changes at all.
By using our experiment tool, you can easily reproduce the experiments as followed:
Testcases to verify the integration of RML and LDES with IncRML, see https://doi.org/10.5281/zenodo.10171394
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This contains the code and output logs to run the BlueSky simulator for "Availability and utilisation of airspace structure in a U-space traffic management system". The code provided here is a modified version of the main fork of BlueSky (https://github.com/TUDelft-CNS-ATM/bluesky).
The first step is to install the correct environment. Refer to `condaenv.txt` for the list of packages needed to run the simulator.
After setting up the environment, we then need to save all of the potential paths of drones in `bluesky/plugins/streets/path_plan_dills`. Note that this takes about 180GB of storage so make sure to have enough available. The paths can be downloaded from https://surfdrive.surf.nl/files/index.php/s/makXrEfPtrtdzaO. There are some example paths saved in this dataset but it will not be possible to run all of the experiment without downloading the paths.
The scenarios for sub-experiment 1 are saved in `bluesky/scenario/subexperiment1`.
The scenarios for sub-experiment 2 are saved in `bluesky/scenario/subexperiment2`.
To run the scenarios we first need to start a bluesky server by running the following code inside `bluesky`:
`python BlueSky.py --headless`
In another terminal we can start a bluesky client by running:
`python BlueSky.py --client`
In the bluesky console we can now run each batch scenario by typing and entering:
`batch batch_subexperiment_1.scn` or
`batch batch_subexperiment_2.scn`
The logs of the scenarios are saved in `bluesky/output`.
Without the paths, it will not possible to run the simulations. However, this code currently has some paths so that it is possible to run some example scenarios. The zeroth repetition for the low imposed traffic demand case can be run without all of the paths. For example, `bluesky/scenario/subexperiment1/Flight_intention_low_40_0_1to1.scn` and `bluesky/scenario/subexperiment2/Flight_intention_low_40_0_baseline.scn` can be run directly with this dataset.
First start bluesky by running:
`python BlueSky.py`
In the console, type.
`ic subexperiment2/Flight_intention_low_40_0_baseline.scn`
Please do not hesistate to contact me with any questions.
-Andres
With exception of metabolic simulations performed using TIMES (version 2.31.2.82), all work was performed using Python (version 3.10.4) run with IPython (version 8.4.0) in JupyterLab (version 3.3.2). The Toolbox API (OECD Toolbox version 4.5 with Service Pack 1 update, API version 6), and BioTransformer (Wishart Lab, version 3.0, executable Java Archive, June 15, 2022 release) were used for automated metabolic simulations. Efficient batch execution of metabolism simulations was handled via parallel processing multiple individual calls to either BioTransformer or the Toolbox API via the “multiprocess” package. The command line interface (CLI) calls needed to interact with BioTransformer were executed via the “subprocess” package, and the Toolbox API was queried via its Swagger user interface hosted on a locally running Windows Desktop instance of the Toolbox Server. The data generated from the MetSim hierarchical schema were translated into JavaScript Object Notation (JSON) format using Python. The resulting data were inserted into a Mongo Database (MongoDB) using the “pymongo” package for efficient storage and retrieval. The code repository including all Jupyter Notebooks documenting the analysis performed and the MetSim framework are available at https://github.com/patlewig/metsim. Data files needed to reproduce the analysis are provided at https://doi.org/10.23645/epacomptox.25463926 and as Supporting Information. This dataset is associated with the following publication: Groff, L., A. Williams, I. Shah, and G. Patlewicz. MetSim: Integrated Programmatic Access and Pathway Management for Xenobiotic Metabolism Simulators. CHEMICAL RESEARCH IN TOXICOLOGY. American Chemical Society, Washington, DC, USA, 37(5): 685-697, (2024).
The Massachusetts Drought Management Plan (DMP, 2023) uses data from select lake and impoundment systems as an index for drought in six of seven regions in the state. The contents of these lakes and impoundments are reported to Massachusetts Department of Conservation and Recreation (DCR) and classified as one of five levels for drought severity ranging from level 0 (Normal; percentile greater than 30) to level 4 (Emergency; percentile less than 2). Lake and impoundment system data are provided at the end of each month to DCR through multiple agencies as lake levels, volumes, or percent-full (reservoir capacity). USGS reviewed data from 14 of the lake or impoundment systems including 28 waterbodies. Diagrams for each system show the capacity of each waterbody and how water is transported through the systems. This data release provides historical monthly data in volume for each system and historical monthly data in feet for systems that consist of only one waterbody when recorded values were available . From these historical monthly data, the 50th-, 30th-, 20th-, 10th-, and 2nd- percentiles were computed. Stage volume rating data for each waterbody at each system are provided in two formats to convert gage height (feet) to volume (million gallons). The stage volume rating data files are formatted as a text (.txt) table for easy manual reading and the other is a comma-separated value (.csv) column format that is easily loaded into a spreadsheet. Stage volume rating data were provided by the municipalities and agencies that manage the systems or were developed for this study. At one system (Hudson, Gates Pond), no stage volume rating data or bathymetry data were available. A stage volume rating was developed using a python script using maximum depth and a shape file of the pond shoreline. The Python script used to develop the stage volume rating data and the R script used to compute the quantiles are published as a part of this data release. Files for each system include supplied historical volume, computed volume percentiles, stage volume rating(s), and a system diagram. Historical elevation data and computed elevation percentiles are included when applicable.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
</div>
<div>├── README.md # This file</div>
<div>├── data/ # Pre-computed datasets</div>
<div>│ ├── ...</div>
<div>├── altered-history/ # Main analysis tool</div>
<div>│ ├── src/ # Rust source code</div>
<div>│ ├── notebooks/ # Analysis notebooks</div>
<div>│ │ ├── analysis.ipynb # Main analysis notebook</div>
<div>│ │ ├── build_analysis_dataset.ipynb</div>
<div>│ │ └── utils_analysis.py # Analysis utilities</div>
<div>│ └── README.md</div>
<div>├── git-historian/ # History checking tool</div>
<div>│ ├── src/ # Rust source code</div>
<div>│ └── README.md</div>
<div>├── modified-files/ # File modification analysis tool</div>
<div>│ ├── src/ # Rust source code</div>
<div>│ ├── notebooks/ # Additional analysis notebooks</div>
<div>│ │ ├── license_analysis.ipynb</div>
<div>│ │ ├── license_categorization.py</div>
<div>│ │ ├── secret-analysis.ipynb</div>
<div>│ │ └── swh_license_files.py</div>
<div>│ └── README.md</div>
<div>
bash</div>
<div>git clone <repository-url></div>
<div>cd altered-histories-tool-replication-pkg</div>
<div>
bash</div>
<div>pip install pandas matplotlib seaborn jupyter plotly numpy</div>
<div>
bash</div>
<div>cd altered-history && cargo build --release && cd ..</div>
<div>cd git-historian && cargo build --release && cd ..</div>
<div>cd modified-files && cargo build --release && cd ..</div>
<div>
data/
directory contains pre-computed datasets that allow you to reproduce all analyses without running the computationally intensive data collection process.bash</div>
<div>cd altered-history/notebooks</div>
<div>jupyter notebook analysis.ipynb</div>
<div>
bash</div>
<div># Build analysis dataset (shows data preparation)</div>
<div>jupyter notebook build_analysis_dataset.ipynb</div>
<div> </div>
<div># License-related analysis</div>
<div>cd ../../modified-files/notebooks</div>
<div>jupyter notebook license_analysis.ipynb</div>
<div> </div>
<div># Security and secrets analysis</div>
<div>jupyter notebook secret-analysis.ipynb</div>
<div>
data/
directory contains several key datasets including:res.pkl
: Main analysis results containing categorized alterationsstars_without_dup.pkl
: Repository popularity metrics (GitHub stars)visit_type.pkl
: Classification of repository visit patternsaltered_histories_2024_08_23.dump
: PostgreSQL database dump for git-historian toolaltered-history/README.md
for detailed instructions.git-historian/README.md
for detailed instructions.modified-files/README.md
for detailed instructions.