Facebook
TwitterThe nbedit extension for CKAN allows users to create, edit, and run Jupyter Notebooks directly within the CKAN environment. This extension enables users to integrate data exploration and analysis workflows alongside their data management activities. It facilitates a streamlined process for working with datasets, by creating an edit view specifically for notebook editing. Key Features: Create Notebook Edit View: Makes it possible for users to construct edit views in CKAN configured specifically for editing Jupyter Notebooks. This feature sets the foundation for fully integrating analytical workflows and data management. Jupyter Notebook Server Integration: This extension manages the starting and stopping of a Jupyter Notebook server potentially simplifying the technical challenges of deploying such tools within the CKAN ecosystem. User Authentication and Authorization: Seamless integration creates corresponding JupyterHub users for each CKAN user and dynamically requests API Tokens to manage user access. This allows users to access notebooks using tokens within their current CKAN sessions. Project-Based Notebook Management: The extension maps CKAN projects to corresponding groups within JupyterHub, enabling better administration and reporting across datasets and notebook related activities. Fullscreen Editing: Provides an option to open a notebook in fullscreen mode, giving complete focus to data analysis and code development within the CKAN environment. Technical Integration: The nbedit extension uses an API token setup as a service account to JupyterHub. It automatically creates corresponding JupyterHub users for each CKAN user. It also automatically creates groups with administrative reporting capabilities in JupyterHub. The extension likely interfaces with the CKAN resource view system, and triggers backend processes to start/stop notebook servers and manage user authentication. It uses API interaction with JupyterHub, requesting tokens for users making requests on behalf of the current CKAN session. The extension requires configuration settings to be set within the CKAN configuration file (e.g., /etc/ckan/default/development.ini). Benefits & Impact: By integrating Jupyter Notebook functionality, this extension allows users to explore and analyze and edit data in CKAN. The Jupyter Notebook environment can easily be setup to edit and view datasets. The integration increases productivity by removing the need to switch between application and external Jupyter Notebook servers.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This resource includes two Jupyter Notebooks as a quick start tutorial for the ERA5 Data Component of the PyMT modeling framework (https://pymt.readthedocs.io/) developed by Community Surface Dynamics Modeling System (CSDMS https://csdms.colorado.edu/).
The bmi_era5 package is an implementation of the Basic Model Interface (BMI https://bmi.readthedocs.io/en/latest/) for the ERA5 dataset (https://confluence.ecmwf.int/display/CKB/ERA5). This package uses the cdsapi (https://cds.climate.copernicus.eu/api-how-to) to download the ERA5 dataset and wraps the dataset with BMI for data control and query (currently support 3 dimensional ERA5 dataset). This package is not implemented for people to use and is the key element to help convert the ERA5 dataset into a data component for the PyMT modeling framework.
The pymt_era5 package is implemented for people to use as a reusable, plug-and-play ERA5 data component for the PyMT modeling framework. This package uses the BMI implementation from the bmi_era5 package and allows the ERA5 datasets to be easily coupled with other datasets or models that expose a BMI.
HydroShare users can test and run the Jupyter Notebooks (bmi_era5.ipynb, pymt_era5.ipynb) directly through the "CUAHSI JupyterHub" web app with the following steps: - For the new user of the CUAHSI JupyterHub, please first make a request to join the "CUAHSI Could Computing Group" (https://www.hydroshare.org/group/156). After approval, the user will gain access to launch the CUAHSI JupyterHub. - Click on the "Open with" button. (on the top right corner of the page) - Select "CUAHSI JupyterHub". - Select "CSDMS Workbench" server option. (Make sure to select the right server option. Otherwise, the notebook won't run correctly.)
If there is any question or suggestion about the ERA5 data component, please create a github issue at https://github.com/gantian127/bmi_era5/issues
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This resource includes two Jupyter Notebooks as a quick start tutorial for the ROMS data component of the PyMT modeling framework (https://pymt.readthedocs.io/) developed by Community Surface Dynamics Modeling System (CSDMS https://csdms.colorado.edu/).
bmi_roms package is an implementation of the Basic Model Interface (BMI https://bmi.readthedocs.io/en/latest/) for the ROMS model (https://www.myroms.org/) datasets. This package downloads the datasets and wraps them with BMI for data control and query. This package is not implemented for people to use but is the key element to convert the ROMS model output dataset into a data component for the PyMT modeling framework.
The pymt_roms package is implemented for people to use as a reusable, plug-and-play ROMS data component for the PyMT modeling framework. This package uses the BMI implementation from the bmi_roms package and allows the ROMS datasets to be easily coupled with other datasets or models that expose a BMI.
If there is any question or suggestion about the ROMS data component, please create a github issue at https://github.com/gantian127/bmi_roms/issues
Facebook
TwitterObjective Daily COVID-19 data reported by the World Health Organization (WHO) may provide the basis for political ad hoc decisions including travel restrictions. Data reported by countries, however, is heterogeneous and metrics to evaluate its quality are scarce. In this work, we analyzed COVID-19 case counts provided by WHO and developed tools to evaluate country-specific reporting behaviors. Methods In this retrospective cross-sectional study, COVID-19 data reported daily to WHO from 3rd January 2020 until 14th June 2021 were analyzed. We proposed the concepts of binary reporting rate and relative reporting behavior and performed descriptive analyses for all countries with these metrics. We developed a score to evaluate the consistency of incidence and binary reporting rates. Further, we performed spectral clustering of the binary reporting rate and relative reporting behavior to identify salient patterns in these metrics. Results Our final analysis included 222 countries and regions...., Data collection COVID-19 data was downloaded from WHO. Using a public repository, we have added the countries' full names to the WHO data set using the two-letter abbreviations for each country to merge both data sets. The provided COVID-19 data covers January 2020 until June 2021. We uploaded the final data set used for the analyses of this paper. Data processing We processed data using a Jupyter Notebook with a Python kernel and publically available external libraries. This upload contains the required Jupyter Notebook (reporting_behavior.ipynb) with all analyses and some additional work, a README, and the conda environment yml (env.yml)., Any text editor including Microsoft Excel and their free alternatives can open the uploaded CSV file. Any web browser and some code editors (like the freely available Visual Studio Code) can show the uploaded Jupyter Notebook if the required Python environment is set up correctly.
Facebook
TwitterFor the automated workflows, we create Jupyter notebooks for each state. In these workflows, GIS processing to merge, extract and project GeoTIFF data was the most important process. For this process, we used ArcPy which is a python package to perform geographic data analysis, data conversion, and data management in ArcGIS (Toms, 2015). After creating state-scale LSS datasets in GeoTIFF format, we convert GeoTIFF to NetCDF using xarray and rioxarray Python packages. Xarray is a Python package to work with multi-dimensional arrays and rioxarray is rasterio xarray extension. Rasterio is a Python library to read and write GeoTIFF and other raster formats. We used xarray to manipulate data type and add metadata in NetCDF file and rioxarray to save GeoTIFF to NetCDF format. Through these procedures, we created three composite HyddroShare resources to share state-scale LSS datasets. Due to the limitation of ArcGIS Pro license which is a commercial GIS software, we developed this Jupyter notebook on Windows OS.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset compares four cities FIXED-line broadband internet speeds: - Melbourne, AU - Bangkok, TH - Shanghai, CN - Los Angeles, US - Alice Springs, AU
ERRATA: 1.Data is for Q3 2020, but some files are labelled incorrectly as 02-20 of June 20. They all should read Sept 20, or 09-20 as Q3 20, rather than Q2. Will rename and reload. Amended in v7.
*lines of data for each geojson file; a line equates to a 600m^2 location, inc total tests, devices used, and average upload and download speed - MEL 16181 locations/lines => 0.85M speedtests (16.7 tests per 100people) - SHG 31745 lines => 0.65M speedtests (2.5/100pp) - BKK 29296 lines => 1.5M speedtests (14.3/100pp) - LAX 15899 lines => 1.3M speedtests (10.4/100pp) - ALC 76 lines => 500 speedtests (2/100pp)
Geojsons of these 2* by 2* extracts for MEL, BKK, SHG now added, and LAX added v6. Alice Springs added v15.
This dataset unpacks, geospatially, data summaries provided in Speedtest Global Index (linked below). See Jupyter Notebook (*.ipynb) to interrogate geo data. See link to install Jupyter.
** To Do Will add Google Map versions so everyone can see without installing Jupyter. - Link to Google Map (BKK) added below. Key:Green > 100Mbps(Superfast). Black > 500Mbps (Ultrafast). CSV provided. Code in Speedtestv1.1.ipynb Jupyter Notebook. - Community (Whirlpool) surprised [Link: https://whrl.pl/RgAPTl] that Melb has 20% at or above 100Mbps. Suggest plot Top 20% on map for community. Google Map link - now added (and tweet).
** Python melb = au_tiles.cx[144:146 , -39:-37] #Lat/Lon extract shg = tiles.cx[120:122 , 30:32] #Lat/Lon extract bkk = tiles.cx[100:102 , 13:15] #Lat/Lon extract lax = tiles.cx[-118:-120, 33:35] #lat/Lon extract ALC=tiles.cx[132:134, -22:-24] #Lat/Lon extract
Histograms (v9), and data visualisations (v3,5,9,11) will be provided. Data Sourced from - This is an extract of Speedtest Open data available at Amazon WS (link below - opendata.aws).
**VERSIONS v.24 Add tweet and google map of Top 20% (over 100Mbps locations) in Mel Q322. Add v.1.5 MEL-Superfast notebook, and CSV of results (now on Google Map; link below). v23. Add graph of 2022 Broadband distribution, and compare 2020 - 2022. Updated v1.4 Jupyter notebook. v22. Add Import ipynb; workflow-import-4cities. v21. Add Q3 2022 data; five cities inc ALC. Geojson files. (2020; 4.3M tests 2022; 2.9M tests)
v20. Speedtest - Five Cities inc ALC. v19. Add ALC2.ipynb. v18. Add ALC line graph. v17. Added ipynb for ALC. Added ALC to title.v16. Load Alice Springs Data Q221 - csv. Added Google Map link of ALC. v15. Load Melb Q1 2021 data - csv. V14. Added Melb Q1 2021 data - geojson. v13. Added Twitter link to pics. v12 Add Line-Compare pic (fastest 1000 locations) inc Jupyter (nbn-intl-v1.2.ipynb). v11 Add Line-Compare pic, plotting Four Cities on a graph. v10 Add Four Histograms in one pic. v9 Add Histogram for Four Cities. Add NBN-Intl.v1.1.ipynb (Jupyter Notebook). v8 Renamed LAX file to Q3, rather than 03. v7 Amended file names of BKK files to correctly label as Q3, not Q2 or 06. v6 Added LAX file. v5 Add screenshot of BKK Google Map. v4 Add BKK Google map(link below), and BKK csv mapping files. v3 replaced MEL map with big key version. Prev key was very tiny in top right corner. v2 Uploaded MEL, SHG, BKK data and Jupyter Notebook v1 Metadata record
** LICENCE AWS data licence on Speedtest data is "CC BY-NC-SA 4.0", so use of this data must be: - non-commercial (NC) - reuse must be share-alike (SA)(add same licence). This restricts the standard CC-BY Figshare licence.
** Other uses of Speedtest Open Data; - see link at Speedtest below.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains all comments (comments and replies) of the YouTube vision video "Tunnels" by "The Boring Company" fetched on 2020-10-13 using YouTube API. The comments are classified manually by three persons. We performed a single-class labeling of the video comments regarding their relevance for requirement engineering (RE) (ham/spam), their polarity (positive/neutral/negative). Furthermore, we performed a multi-class labeling of the comments regarding their intention (feature request and problem report) and their topic (efficiency and safety). While a comment can only be relevant or not relevant and have only one polarity, a comment can have one or more intentions and also one or more topics.
For the replies, one person also classified them regarding their relevance for RE. However, the investigation of the replies is ongoing and future work.
Remark: For 126 comments and 26 replies, we could not determine the date and time since they were no longer accessible on YouTube at the time this data set was created. In the case of a missing date and time, we inserted "NULL" in the corresponding cell.
This data set includes the following files:
Dataset.xlsx contains the raw and labeled video comments and replies:
For each comment, the data set contains:
ID: An identification number generated by YouTube for the comment
Date: The date and time of the creation of the comment
Author: The username of the author of the comment
Likes: The number of likes of the comment
Replies: The number of replies to the comment
Comment: The written comment
Relevance: Label indicating the relevance of the comment for RE (ham = relevant, spam = irrelevant)
Polarity: Label indicating the polarity of the comment
Feature request: Label indicating that the comment request a feature
Problem report: Label indicating that the comment reports a problem
Efficiency: Label indicating that the comment deals with the topic efficiency
Safety: Label indicating that the comment deals with the topic safety
For each reply, the data set contains:
ID: The identification number of the comment to which the reply belongs
Date: The date and time of the creation of the reply
Author: The username of the author of the reply
Likes: The number of likes of the reply
Comment: The written reply
Relevance: Label indicating the relevance of the reply for RE (ham = relevant, spam = irrelevant)
Detailed analysis results.xlsx contains the detailed results of all ten times repeated 10-fold cross validation analyses for each of all considered combinations of machine learning algorithms and features
Guide Sheet - Multi-class labeling.pdf describes the coding task, defines the categories, and lists examples to reduce inconsistencies and increase the quality of manual multi-class labeling
Guide Sheet - Single-class labeling.pdf describes the coding task, defines the categories, and lists examples to reduce inconsistencies and increase the quality of manual single-class labeling
Python scripts for analysis.zip contains the scripts (as jupyter notebooks) and prepared data (as csv-files) for the analyses
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Git archive containing Python modules and resources used to generate machine-learning models used in the "Applications of Machine Learning Techniques to Geothermal Play Fairway Analysis in the Great Basin Region, Nevada" project. This software is licensed as free to use, modify, and distribute with attribution. Full license details are included within the archive. See "documentation.zip" for setup instructions and file trees annotated with module descriptions.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This resource includes two Jupyter Notebooks as a quick start tutorial for the NWIS Data Component of the PyMT modeling framework (https://pymt.readthedocs.io/) developed by Community Surface Dynamics Modeling System (CSDMS https://csdms.colorado.edu/).
The bmi_nwis package is an implementation of the Basic Model Interface (BMI https://bmi.readthedocs.io/en/latest/) for the USGS NWIS dataset (https://waterdata.usgs.gov/nwis). This package uses the dataretrieval package (https://github.com/USGS-python/dataretrieval) to download the NWIS dataset and wraps the dataset with BMI for data control and query. This package is not implemented for people to use but is the key element to convert the NWIS dataset into a data component for the PyMT modeling framework.
The pymt_nwis package is implemented for people to use as a reusable, plug-and-play NWIS data component for the PyMT modeling framework. This package uses the BMI implementation from the bmi_nwis package and allows the NWIS datasets to be easily coupled with other datasets or models that expose a BMI.
HydroShare users can test and run the Jupyter Notebooks (bmi_nwis.ipynb, pymt_nwis.ipynb) directly through the "CUAHSI JupyterHub" web app with the following steps: - For the new user of the CUAHSI JupyterHub, please first make a request to join the "CUAHSI Could Computing Group" (https://www.hydroshare.org/group/156). After approval, the user will gain access to launch the CUAHSI JupyterHub. - Click on the "Open with" button. (on the top right corner of the page) - Select "CUAHSI JupyterHub". - Select "CSDMS Workbench" server option. (Make sure to select the right server option. Otherwise, the notebook won't run correctly.)
If there is any question or suggestion about the NWIS data component, please create a github issue at https://github.com/gantian127/bmi_nwis/issues
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Replication Package
This repository contains data and source files needed to replicate our work described in the paper "Unboxing Default Argument Breaking Changes in Scikit Learn".
Requirements
We recommend the following requirements to replicate our study:
Package Structure
We relied on Docker containers to provide a working environment that is easier to replicate. Specifically, we configure the following containers:
data-analysis, an R-based Container we used to run our data analysis.data-collection, a Python Container we used to collect Scikit's default arguments and detect them in client applications.database, a Postgres Container we used to store clients' data, obtainer from Grotov et al.storage, a directory used to store the data processed in data-analysis and data-collection. This directory is shared in both containers.docker-compose.yml, the Docker file that configures all containers used in the package.In the remainder of this document, we describe how to set up each container properly.
Using VSCode to Setup the Package
We selected VSCode as the IDE of choice because its extensions allow us to implement our scripts directly inside the containers. In this package, we provide configuration parameters for both data-analysis and data-collection containers. This way you can directly access and run each container inside it without any specific configuration.
You first need to set up the containers
$ cd /replication/package/folder
$ docker-compose build
$ docker-compose up
# Wait docker creating and running all containers
Then, you can open them in Visual Studio Code:
If you want/need a more customized organization, the remainder of this file describes it in detail.
Longest Road: Manual Package Setup
Database Setup
The database container will automatically restore the dump in dump_matroskin.tar in its first launch. To set up and run the container, you should:
Build an image:
$ cd ./database
$ docker build --tag 'dabc-database' .
$ docker image ls
REPOSITORY TAG IMAGE ID CREATED SIZE
dabc-database latest b6f8af99c90d 50 minutes ago 18.5GB
Create and enter inside the container:
$ docker run -it --name dabc-database-1 dabc-database
$ docker exec -it dabc-database-1 /bin/bash
root# psql -U postgres -h localhost -d jupyter-notebooks
jupyter-notebooks=# \dt
List of relations
Schema | Name | Type | Owner
--------+-------------------+-------+-------
public | Cell | table | root
public | Code_cell | table | root
public | Md_cell | table | root
public | Notebook | table | root
public | Notebook_features | table | root
public | Notebook_metadata | table | root
public | repository | table | root
If you got the tables list as above, your database is properly setup.
It is important to mention that this database is extended from the one provided by Grotov et al.. Basically, we added three columns in the table Notebook_features (API_functions_calls, defined_functions_calls, andother_functions_calls) containing the function calls performed by each client in the database.
Data Collection Setup
This container is responsible for collecting the data to answer our research questions. It has the following structure:
dabcs.py, extract DABCs from Scikit Learn source code, and export them to a CSV file.dabcs-clients.py, extract function calls from clients and export them to a CSV file. We rely on a modified version of Matroskin to leverage the function calls. You can find the tool's source code in the `matroskin`` directory.Makefile, commands to set up and run both dabcs.py and dabcs-clients.pymatroskin, the directory containing the modified version of matroskin tool. We extended the library to collect the function calls performed on the client notebooks of Grotov's dataset.storage, a docker volume where the data-collection should save the exported data. This data will be used later in Data Analysis.requirements.txt, Python dependencies adopted in this module.Note that the container will automatically configure this module for you, e.g., install dependencies, configure matroskin, download scikit learn source code, etc. For this, you must run the following commands:
$ cd ./data-collection
$ docker build --tag "data-collection" .
$ docker run -it -d --name data-collection-1 -v $(pwd)/:/data-collection -v $(pwd)/../storage/:/data-collection/storage/ data-collection
$ docker exec -it data-collection-1 /bin/bash
$ ls
Dockerfile Makefile config.yml dabcs-clients.py dabcs.py matroskin storage requirements.txt utils.py
If you see project files, it means the container is configured accordingly.
Data Analysis Setup
We use this container to conduct the analysis over the data produced by the Data Collection container. It has the following structure:
dependencies.R, an R script containing the dependencies used in our data analysis.data-analysis.Rmd, the R notebook we used to perform our data analysisdatasets, a docker volume pointing to the storage directory.Execute the following commands to run this container:
$ cd ./data-analysis
$ docker build --tag "data-analysis" .
$ docker run -it -d --name data-analysis-1 -v $(pwd)/:/data-analysis -v $(pwd)/../storage/:/data-collection/datasets/ data-analysis
$ docker exec -it data-analysis-1 /bin/bash
$ ls
data-analysis.Rmd datasets dependencies.R Dockerfile figures Makefile
If you see project files, it means the container is configured accordingly.
A note on storage shared folder
As mentioned, the storage folder is mounted as a volume and shared between data-collection and data-analysis containers. We compressed the content of this folder due to space constraints. Therefore, before starting working on Data Collection or Data Analysis, make sure you extracted the compressed files. You can do this by running the Makefile inside storage folder.
$ make unzip # extract files
$ ls
clients-dabcs.csv clients-validation.csv dabcs.csv Makefile scikit-learn-versions.csv versions.csv
$ make zip # compress files
$ ls
csv-files.tar.gz Makefile
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Purpose These are a collection of supplementary files that are to be included in my dissertation. They include but are not limited to small IPython notebooks, extra figures, data-sets that are too large to publish in the main document such as full ortholog lists and other primary data.
Viewing IPython notebooks (ipynb files) To view an IPython notebook, "right-click" its download link and select "Copy link address". Then navigate to the the free notebook viewer by following this link: http://nbviewer.ipython.org/. Finally, paste the link to the ipynb file that you copied into the URL form on the nbviewer page and click "Go".
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract: This data was collected and processed as part of the CRC 985 INF project. It was used to create an overview of the data-producing methods available and employed throughout the project and their associated file types. This information was used as a basis for the associated manuscript (see related identifiers). The Jupyter Notebook used to create the figures in the manuscipt is included within this dataset. Furthermore, the surveys give insight into research data management practices within this project and large, interdisciplinary projects in general. Method: Survey
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data was collected and processed as part of the CRC 985 INF project. It was used to create an overview of the data-producing methods available and employed throughout the project and their associated file types. This information was used as a basis for the associated manuscript (see related identifiers). The Jupyter Notebook used to create the figures in the manuscipt is included within this dataset. Furthermore, the surveys give insight into research data management practices within this project and large, interdisciplinary projects in general.
Facebook
TwitterThis data repository contains the data sets and python scripts associated with the manuscript 'Machine learning isotropic g values of radical polymers '. Electron paramagnetic resonance measurements allow for obtaining experimental g values of radical polymers. Analogous to chemical shifts, g values give insight into the identity and environment of the paramagnetic center. In this work, Machine learning based prediction of g values is explored as a viable alternative to computationally expensive density functional theory (DFT) methods. Description of folder contents (switch to tree view): Datasets : Contains PTMA polymer structures from TR, TE-1, and TE-2 data sets transformed using a molecular descriptor (SOAP, MBTR or DAD) and corresponding DFT-calculated g values. Filenames contain 'PTMA_X' where X denotes the number of monomers which are radicals. Structure data sets have 'structure_data' in the title, DFT calculated g values have 'giso_DFT_data' in the title. The files are in .npy (NumPy) format. Models : ERT models trained on SOAP, MBTR and DAD feature vectors. Scripts : Contains scripts which can be used to predict g values from XYZ files of PTMA structures with 6 monomer units and varying radical density. The script 'prediction_functions.py' contains the functions which transform the XYZ coordinates into an appropriate feature vector which the trained model uses to predict. Description of individual functions are also given as docstrings (python documentation strings) in the code. The folder also contains additional files needed for the ERT-DAD model in .pkl format. XYZ_files : Contains atomic coordinates of PTMA structures in XYZ format. Two subfolders : WSD and TE-2 correspond to structures present in the whole structure data set and TE-2 test data set (see main text in the manuscript for details). Filenames in the folder 'XYZ_files/TE-2/PTMA-X/' are of the type 'chainlength_6ptma_Y'_Y''.xyz' where 'chainlength_6ptma' denotes the length of polymer chain (6 monomers), Y' denotes the proportion of monomers which are radicals (for instance, Y' = 50 means 3 out of 6 monomers are radicals) and Y'' denotes the order of the MD time frame. Actual time frame values of Y'' in ps is given in the manuscript. PTMA-ML.ipynb : Jupyter notebook detailing the workflow of generating the trained model. The file includes steps to load data sets, transform xyz files using molecular descriptors, optimise hyperparameters , train the model, cross validate using the training data set and evaluate the model. PTMA-ML.pdf : PTMA-ML.ipynb in PDF format. List of abbreviations : PTMA : poly(2,2,6,6-tetramethyl-1-piperidinyloxy-4-yl methacrylate) TR : Training data set TE-1 : Test data set 1 TE-2 : Test data set 2 ERT : Extremely randomized trees WSD : Whole structure data set SOAP : Smooth overlap of atomic orbitals MBTR : Many-body tensor representation DAD : Distances-Angles-Dihedrals
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset corresponding to the journal article "Mitigating the effect of errors in source parameters on seismic (waveform) inversion" by Blom, Hardalupas and Rawlinson, accepted for publication in Geophysical Journal International. In this paper, we demonstrate the effect or errors in source parameters on seismic tomography, with a particular focus on (full) waveform tomography. We study effect both on forward modelling (i.e. comparing waveforms and measurements resulting from a perturbed vs. unperturbed source) and on seismic inversion (i.e. using a source which contains an (erroneous) perturbation to invert for Earth structure. These data were obtained using Salvus, a state-of-the-art (though proprietary) 3-D solver that can be used for wave propagation simulations (Afanasiev et al., GJI 2018).
This dataset contains:
The entire Salvus project. This project was prepared using Salvus version 0.11.x and 0.12.2 and should be fully compatible with the latter.
A number of Jupyter notebooks used to create all the figures, set up the project and do the data processing.
A number of Python scripts that are used in above notebooks.
two conda environment .yml files: one with the complete environment as used to produce this dataset, and one with the environment as supplied by Mondaic (the Salvus developers), on top of which I installed basemap and cartopy.
An overview of the inversion configurations used for each inversion experiment and the name of hte corresponding figures: inversion_runs_overview.ods / .csv .
Datasets corresponding to the different figures.
One dataset for Figure 1, showing the effect of a source perturbation in a real-world setting, as previously used by Blom et al., Solid Earth 2020
One dataset for Figure 2, showing how different methodologies and assumptions can lead to significantly different source parameters, notably including systematic shifts. This dataset was kindly supplied by Tim Craig (Craig, 2019).
A number of datasets (stored as pickled Pandas dataframes) derived from the Salvus project. We have computed:
travel-time arrival predictions from every source to all stations (df_stations...pkl)
misfits for different metrics for both P-wave centered and S-wave centered windows for all components on all stations, comparing every time waveforms from a reference source against waveforms from a perturbed source (df_misfits_cc.28s.pkl)
addition of synthetic waveforms for different (perturbed) moment tenors. All waveforms are stored in HDF5 (.h5) files of the ASDF (adaptable seismic data format) type
How to use this dataset:
To set up the conda environment:
make sure you have anaconda/miniconda
make sure you have access to Salvus functionality. This is not absolutely necessary, but most of the functionality within this dataset relies on salvus. You can do the analyses and create the figures without, but you'll have to hack around in the scripts to build workarounds.
Set up Salvus / create a conda environment. This is best done following the instructions on the Mondaic website. Check the changelog for breaking changes, in that case download an older salvus version.
Additionally in your conda env, install basemap and cartopy:
conda-env create -n salvus_0_12 -f environment.yml conda install -c conda-forge basemap conda install -c conda-forge cartopy
Install LASIF (https://github.com/dirkphilip/LASIF_2.0) and test. The project uses some lasif functionality.
To recreate the figures: This is extremely straightforward. Every figure has a corresponding Jupyter Notebook. Suffices to run the notebook in its entirety.
Figure 1: separate notebook, Fig1_event_98.py
Figure 2: separate notebook, Fig2_TimCraig_Andes_analysis.py
Figures 3-7: Figures_perturbation_study.py
Figures 8-10: Figures_toy_inversions.py
To recreate the dataframes in DATA: This can be done using the example notebook Create_perturbed_thrust_data_by_MT_addition.py and Misfits_moment_tensor_components.M66_M12.py . The same can easily be extended to the position shift and other perturbations you might want to investigate.
To recreate the complete Salvus project: This can be done using:
the notebook Prepare_project_Phil_28s_absb_M66.py (setting up project and running simulations)
the notebooks Moment_tensor_perturbations.py and Moment_tensor_perturbation_for_NS_thrust.py
For the inversions: using the notebook Inversion_SS_dip.M66.28s.py as an example. See the overview table inversion_runs_overview.ods (or .csv) as to naming conventions.
References:
Michael Afanasiev, Christian Boehm, Martin van Driel, Lion Krischer, Max Rietmann, Dave A May, Matthew G Knepley, Andreas Fichtner, Modular and flexible spectral-element waveform modelling in two and three dimensions, Geophysical Journal International, Volume 216, Issue 3, March 2019, Pages 1675–1692, https://doi.org/10.1093/gji/ggy469
Nienke Blom, Alexey Gokhberg, and Andreas Fichtner, Seismic waveform tomography of the central and eastern Mediterranean upper mantle, Solid Earth, Volume 11, Issue 2, 2020, Pages 669–690, 2020, https://doi.org/10.5194/se-11-669-2020
Tim J. Craig, Accurate depth determination for moderate-magnitude earthquakes using global teleseismic data. Journal of Geophysical Research: Solid Earth, 124, 2019, Pages 1759– 1780. https://doi.org/10.1029/2018JB016902
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Introduction This Dataverse entry contains replication data for our journal article “Static polarizabilities at the basis set limit: A benchmark of 124 species” published in Journal of Chemical Theory and Computation. It contains highly precise static polarizabilities computed in multiwavelet basis in combination with density functional theory (DFT, PBE functional). In addition, the d/Preliminaryata set contains analysis tools (Jupyter Notebooks with Python3 code) for generating the figures in the journal article. How to use Because our multiwavelet data is guaranteed to be at the complete basis set limit (to within the specified limit), it is suitable as a benchmark reference in studies of static polarizabilities where the basis set convergence is important. With multiwavelets we don't have to assume that the computed property is at the basis set limit, as is the case with Gaussian type orbital (GTO) basis sets, and it is therefore possible to confirm whether the property of interest computed basis is sufficiently converged with respect to the complete basis set limit. Our benchmark reference can also be used in the development of new methodology that requires accurate training data. Running the Jupyter Notebooks The Anaconda Python distribution is usually recommended for obtaining Jupyter Notebook. It can be downloaded from here: https://www.anaconda.com/distribution/ The simplest way to run the notebooks is to download all files in this DataverseNO dataset. That will preserve the directory structure, which is absolutely necessary to avoid errors. Then start your Jupyter Notebook session, navigate to the data set directory, and open the desired notebook. Journal article Brakestad et al. "Static polarizabilities at the basis set limit: A benchmark of 124 species". J. Chem. Theory Comput. (2020) Abstract from journal article Benchmarking molecular properties with Gaussian-type orbital (GTO) basis sets can be challenging, because one has to assume that the computed property is at the complete basis set (CBS) limit, without a robust measure of the error. Multiwavelet (MW) bases can be systematically improved with a controllable error, which eliminates the need for such assumptions. In this work, we have used MWs within Kohn–Sham density functional theory to compute static polarizabilities for a set of 92 closed-shell and 32 open-shell species. The results are compared to recent benchmark calculations employing the GTO-type aug-pc4 basis set. We observe discrepancies between GTO and MW results for several species, with open-shell systems showing the largest deviations. Based on linear response calculations, we show that these discrepancies originate from artefacts caused by the field strength, and that several polarizabilies from a previous study were contaminated by higher order responses (hyperpolarizabilities). Based on our MW benchmark results, we can affirm that aug-pc4 is able to provide results close to the CBS limit, as long as finite-difference effects can be controlled. However, we suggest that a better approach is to use MWs, which are able to yield precise finite-difference polarizabilities even with small field strengths.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Contents
Set of simulation data, supplementary for a paper submitted to (published: 15 June 2019) the Fire Safety Journal, with the title "Application Cases of Inverse Modelling with the PROPTI Framework". See also our project at ResearchGate.
This repository contains the complete input data for each IMP run of the mass loss calorimeter, shown in this paper. This comprises of the experimental data files, the templates for the simulation models and the input file for PROPTI.
The data base files are provided. This includes the original ones created by PROPTI during the run, as well as the cleaned data base files, used to create the plots, and the extracted best parameter sets per generation. Plots, created during the IMP runs as means of monitoring the progress are also included.
Furthermore, the repository contains a small collection of Jupyter notebooks which have been used to process the data base files and create the plots presented in this paper.
The full factorial simulations were set up from within a Jupyter notebook. This notebook and the conducted simulations are also part of this repository.
Data of the various TGA simulations are provided within a very similar repository, linked to a conference paper (ESFSS 2018, Nancy, France).
Finally, the simulation input files, PROPTI input, as well as the custom script for file handling in concert with OpenFOAM, are provided.
Technical Information
Each ZIP archive represents a sub-directory of the original directory. For the analysis scripts, the Jupyter notebooks, to work properly out of the box it is necessary to keep this structure. Thus, simply extract all archives into the same directory.
Note: Size on disc, after extraction, is about 4.1 GB. Version 2 adds about 5.1 GB.
Version 2:
Version 2 contains new IMP runs that address an error in determining the normalised residual mass, see Jupyter Notebook "RevisedTargetAssessment.ipynb", as well as input from the reviewers. The IMP runs are denoted by "08" after the optimisation algorithm label, e.g. "MLC_FSCABC_08_new_75kw_Ins".
Facebook
TwitterThis dataset contains simulation input files in GROMACS format accompanying the mentioned publication. Structure, topology, and simulation parameter-files (directory mdp) are provided for bulk simulations of pure dioxane and formic acid as well as mixture of both in pore and bulk simulation. The pore simulation is divided into three steps, an energy-minimization, an NVT equilibration, and an NVT production simulation run. While the bulk simulations introduce an NpT step after the first equilibration step and an NpT production run instead of a NVT production. Provided structure files are of an already equilibrated system. Object files are supplied which can be used to load the generated pores into PoreMS for later alteration and analysis. Results of density of pore systems are provided in hdf5 format to be processed with the PoreAna python package. Jupyter notebooks to load and display with PoreAna the data are provided. Also yaml files which contain the density of the pure and mixture bulk simulations are added to the data set. Here an accompanying jupyter notebook to read the yaml files is supplied with.
In addition, the data set contains data from IR and NMR experiments.
We recommend viewing the data by choosing the option "Tree".
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This archive reproduces a table titled "Table 3.1 Boone county population size, 1990 and 2000" from Wang and vom Hofe (2007, p.58). The archive provides a Jupyter Notebook that uses Python and can be run in Google Colaboratory. The workflow uses Census API to retrieve data, reproduce the table, and ensure reproducibility for anyone accessing this archive.The Python code was developed in Google Colaboratory, or Google Colab for short, which is an Integrated Development Environment (IDE) of JupyterLab and streamlines package installation, code collaboration and management. The Census API is used to obtain population counts from the 1990 and 2000 Decennial Census (Summary File 1, 100% data). All downloaded data are maintained in the notebook's temporary working directory while in use. The data are also stored separately with this archive.The notebook features extensive explanations, comments, code snippets, and code output. The notebook can be viewed in a PDF format or downloaded and opened in Google Colab. References to external resources are also provided for the various functional components. The notebook features code to perform the following functions:install/import necessary Python packagesintroduce a Census API Querydownload Census data via CensusAPI manipulate Census tabular data calculate absolute change and percent changeformatting numbersexport the table to csvThe notebook can be modified to perform the same operations for any county in the United States by changing the State and County FIPS code parameters for the Census API downloads. The notebook could be adapted for use in other environments (i.e., Jupyter Notebook) as well as reading and writing files to a local or shared drive, or cloud drive (i.e., Google Drive).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Hydrologic models are growing in complexity: spatial representations, model coupling, process representations, software structure, etc. New and emerging datasets are growing, supporting even more detailed modeling use cases. This complexity is leading to the reproducibility crisis in hydrologic modeling and analysis. We argue that moving hydrologic modeling to the cloud can help to address this reproducibility crisis. - We create two notebooks: 1. The first notebook demonstrates the process of collecting and manipulating GIS and Time-series data using GRASS GIS, Python and R to create RHESsys Model input. 2. The second notebook demonstrates the process of model compilation, parallel simulation, and visualization.
The first notebook includes:
The second notebook includes:
Facebook
TwitterThe nbedit extension for CKAN allows users to create, edit, and run Jupyter Notebooks directly within the CKAN environment. This extension enables users to integrate data exploration and analysis workflows alongside their data management activities. It facilitates a streamlined process for working with datasets, by creating an edit view specifically for notebook editing. Key Features: Create Notebook Edit View: Makes it possible for users to construct edit views in CKAN configured specifically for editing Jupyter Notebooks. This feature sets the foundation for fully integrating analytical workflows and data management. Jupyter Notebook Server Integration: This extension manages the starting and stopping of a Jupyter Notebook server potentially simplifying the technical challenges of deploying such tools within the CKAN ecosystem. User Authentication and Authorization: Seamless integration creates corresponding JupyterHub users for each CKAN user and dynamically requests API Tokens to manage user access. This allows users to access notebooks using tokens within their current CKAN sessions. Project-Based Notebook Management: The extension maps CKAN projects to corresponding groups within JupyterHub, enabling better administration and reporting across datasets and notebook related activities. Fullscreen Editing: Provides an option to open a notebook in fullscreen mode, giving complete focus to data analysis and code development within the CKAN environment. Technical Integration: The nbedit extension uses an API token setup as a service account to JupyterHub. It automatically creates corresponding JupyterHub users for each CKAN user. It also automatically creates groups with administrative reporting capabilities in JupyterHub. The extension likely interfaces with the CKAN resource view system, and triggers backend processes to start/stop notebook servers and manage user authentication. It uses API interaction with JupyterHub, requesting tokens for users making requests on behalf of the current CKAN session. The extension requires configuration settings to be set within the CKAN configuration file (e.g., /etc/ckan/default/development.ini). Benefits & Impact: By integrating Jupyter Notebook functionality, this extension allows users to explore and analyze and edit data in CKAN. The Jupyter Notebook environment can easily be setup to edit and view datasets. The integration increases productivity by removing the need to switch between application and external Jupyter Notebook servers.