Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The self-documenting aspects and the ability to reproduce results have been touted as significant benefits of Jupyter Notebooks. At the same time, there has been growing criticism that the way notebooks are being used leads to unexpected behavior, encourage poor coding practices and that their results can be hard to reproduce. To understand good and bad practices used in the development of real notebooks, we analyzed 1.4 million notebooks from GitHub.
This repository contains two files:
The dump.tar.bz2 file contains a PostgreSQL dump of the database, with all the data we extracted from the notebooks.
The jupyter_reproducibility.tar.bz2 file contains all the scripts we used to query and download Jupyter Notebooks, extract data from them, and analyze the data. It is organized as follows:
In the remaining of this text, we give instructions for reproducing the analyses, by using the data provided in the dump and reproducing the collection, by collecting data from GitHub again.
Reproducing the Analysis
This section shows how to load the data in the database and run the analyses notebooks. In the analysis, we used the following environment:
Ubuntu 18.04.1 LTS
PostgreSQL 10.6
Conda 4.5.11
Python 3.7.2
PdfCrop 2012/11/02 v1.38
First, download dump.tar.bz2 and extract it:
tar -xjf dump.tar.bz2
It extracts the file db2019-03-13.dump. Create a database in PostgreSQL (we call it "jupyter"), and use psql to restore the dump:
psql jupyter < db2019-03-13.dump
It populates the database with the dump. Now, configure the connection string for sqlalchemy by setting the environment variable JUP_DB_CONNECTTION:
export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter";
Download and extract jupyter_reproducibility.tar.bz2:
tar -xjf jupyter_reproducibility.tar.bz2
Create a conda environment with Python 3.7:
conda create -n analyses python=3.7
conda activate analyses
Go to the analyses folder and install all the dependencies of the requirements.txt
cd jupyter_reproducibility/analyses
pip install -r requirements.txt
For reproducing the analyses, run jupyter on this folder:
jupyter notebook
Execute the notebooks on this order:
Reproducing or Expanding the Collection
The collection demands more steps to reproduce and takes much longer to run (months). It also involves running arbitrary code on your machine. Proceed with caution.
Requirements
This time, we have extra requirements:
All the analysis requirements
lbzip2 2.5
gcc 7.3.0
Github account
Gmail account
Environment
First, set the following environment variables:
export JUP_MACHINE="db"; # machine identifier
export JUP_BASE_DIR="/mnt/jupyter/github"; # place to store the repositories
export JUP_LOGS_DIR="/home/jupyter/logs"; # log files
export JUP_COMPRESSION="lbzip2"; # compression program
export JUP_VERBOSE="5"; # verbose level
export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter"; # sqlchemy connection
export JUP_GITHUB_USERNAME="github_username"; # your github username
export JUP_GITHUB_PASSWORD="github_password"; # your github password
export JUP_MAX_SIZE="8000.0"; # maximum size of the repositories directory (in GB)
export JUP_FIRST_DATE="2013-01-01"; # initial date to query github
export JUP_EMAIL_LOGIN="gmail@gmail.com"; # your gmail address
export JUP_EMAIL_TO="target@email.com"; # email that receives notifications
export JUP_OAUTH_FILE="~/oauth2_creds.json" # oauth2 auhentication file
export JUP_NOTEBOOK_INTERVAL=""; # notebook id interval for this machine. Leave it in blank
export JUP_REPOSITORY_INTERVAL=""; # repository id interval for this machine. Leave it in blank
export JUP_WITH_EXECUTION="1"; # run execute python notebooks
export JUP_WITH_DEPENDENCY="0"; # run notebooks with and without declared dependnecies
export JUP_EXECUTION_MODE="-1"; # run following the execution order
export JUP_EXECUTION_DIR="/home/jupyter/execution"; # temporary directory for running notebooks
export JUP_ANACONDA_PATH="~/anaconda3"; # conda installation path
export JUP_MOUNT_BASE="/home/jupyter/mount_ghstudy.sh"; # bash script to mount base dir
export JUP_UMOUNT_BASE="/home/jupyter/umount_ghstudy.sh"; # bash script to umount base dir
export JUP_NOTEBOOK_TIMEOUT="300"; # timeout the extraction
# Frequenci of log report
export JUP_ASTROID_FREQUENCY="5";
export JUP_IPYTHON_FREQUENCY="5";
export JUP_NOTEBOOKS_FREQUENCY="5";
export JUP_REQUIREMENT_FREQUENCY="5";
export JUP_CRAWLER_FREQUENCY="1";
export JUP_CLONE_FREQUENCY="1";
export JUP_COMPRESS_FREQUENCY="5";
export JUP_DB_IP="localhost"; # postgres database IP
Then, configure the file ~/oauth2_creds.json, according to yagmail documentation: https://media.readthedocs.org/pdf/yagmail/latest/yagmail.pdf
Configure the mount_ghstudy.sh and umount_ghstudy.sh scripts. The first one should mount the folder that stores the directories. The second one should umount it. You can leave the scripts in blank, but it is not advisable, as the reproducibility study runs arbitrary code on your machine and you may lose your data.
Scripts
Download and extract jupyter_reproducibility.tar.bz2:
tar -xjf jupyter_reproducibility.tar.bz2
Install 5 conda environments and 5 anaconda environments, for each python version. In each of them, upgrade pip, install pipenv, and install the archaeology package (Note that it is a local package that has not been published to pypi. Make sure to use the -e option):
Conda 2.7
conda create -n raw27 python=2.7 -y
conda activate raw27
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Anaconda 2.7
conda create -n py27 python=2.7 anaconda -y
conda activate py27
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Conda 3.4
It requires a manual jupyter and pathlib2 installation due to some incompatibilities found on the default installation.
conda create -n raw34 python=3.4 -y
conda activate raw34
conda install jupyter -c conda-forge -y
conda uninstall jupyter -y
pip install --upgrade pip
pip install jupyter
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
pip install pathlib2
Anaconda 3.4
conda create -n py34 python=3.4 anaconda -y
conda activate py34
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Conda 3.5
conda create -n raw35 python=3.5 -y
conda activate raw35
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Anaconda 3.5
It requires the manual installation of other anaconda packages.
conda create -n py35 python=3.5 anaconda -y
conda install -y appdirs atomicwrites keyring secretstorage libuuid navigator-updater prometheus_client pyasn1 pyasn1-modules spyder-kernels tqdm jeepney automat constantly anaconda-navigator
conda activate py35
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Conda 3.6
conda create -n raw36 python=3.6 -y
conda activate raw36
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Anaconda 3.6
conda create -n py36 python=3.6 anaconda -y
conda activate py36
conda install -y anaconda-navigator jupyterlab_server navigator-updater
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Conda 3.7
<code
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The self-documenting aspects and the ability to reproduce results have been touted as significant benefits of Jupyter Notebooks. At the same time, there has been growing criticism that the way notebooks are being used leads to unexpected behavior, encourage poor coding practices and that their results can be hard to reproduce. To understand good and bad practices used in the development of real notebooks, we analyzed 1.4 million notebooks from GitHub.
This repository contains two files:
The dump.tar.bz2 file contains a PostgreSQL dump of the database, with all the data we extracted from the notebooks.
The jupyter_reproducibility.tar.bz2 file contains all the scripts we used to query and download Jupyter Notebooks, extract data from them, and analyze the data. It is organized as follows:
In the remaining of this text, we give instructions for reproducing the analyses, by using the data provided in the dump and reproducing the collection, by collecting data from GitHub again.
Reproducing the Analysis
This section shows how to load the data in the database and run the analyses notebooks. In the analysis, we used the following environment:
Ubuntu 18.04.1 LTS
PostgreSQL 10.6
Conda 4.5.1
Python 3.6.8
PdfCrop 2012/11/02 v1.38
First, download dump.tar.bz2 and extract it:
tar -xjf dump.tar.bz2
It extracts the file db2019-01-13.dump. Create a database in PostgreSQL (we call it "jupyter"), and use psql to restore the dump:
psql jupyter < db2019-01-13.dump
It populates the database with the dump. Now, configure the connection string for sqlalchemy by setting the environment variable JUP_DB_CONNECTTION:
export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter";
Download and extract jupyter_reproducibility.tar.bz2:
tar -xjf jupyter_reproducibility.tar.bz2
Create a conda environment with Python 3.6:
conda create -n py36 python=3.6
Go to the analyses folder and install all the dependencies of the requirements.txt
cd jupyter_reproducibility/analyses
pip install -r requirements.txt
For reproducing the analyses, run jupyter on this folder:
jupyter notebook
Execute the notebooks on this order:
Reproducing or Expanding the Collection
The collection demands more steps to reproduce and takes much longer to run (months). It also involves running arbitrary code on your machine. Proceed with caution.
Requirements
This time, we have extra requirements:
All the analysis requirements
lbzip2 2.5
gcc 7.3.0
Github account
Gmail account
Environment
First, set the following environment variables:
export JUP_MACHINE="db"; # machine identifier
export JUP_BASE_DIR="/mnt/jupyter/github"; # place to store the repositories
export JUP_LOGS_DIR="/home/jupyter/logs"; # log files
export JUP_COMPRESSION="lbzip2"; # compression program
export JUP_VERBOSE="5"; # verbose level
export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter"; # sqlchemy connection
export JUP_GITHUB_USERNAME="github_username"; # your github username
export JUP_GITHUB_PASSWORD="github_password"; # your github password
export JUP_MAX_SIZE="8000.0"; # maximum size of the repositories directory (in GB)
export JUP_FIRST_DATE="2013-01-01"; # initial date to query github
export JUP_EMAIL_LOGIN="gmail@gmail.com"; # your gmail address
export JUP_EMAIL_TO="target@email.com"; # email that receives notifications
export JUP_OAUTH_FILE="~/oauth2_creds.json" # oauth2 auhentication file
export JUP_NOTEBOOK_INTERVAL=""; # notebook id interval for this machine. Leave it in blank
export JUP_REPOSITORY_INTERVAL=""; # repository id interval for this machine. Leave it in blank
export JUP_WITH_EXECUTION="1"; # run execute python notebooks
export JUP_WITH_DEPENDENCY="0"; # run notebooks with and without declared dependnecies
export JUP_EXECUTION_MODE="-1"; # run following the execution order
export JUP_EXECUTION_DIR="/home/jupyter/execution"; # temporary directory for running notebooks
export JUP_ANACONDA_PATH="~/anaconda3"; # conda installation path
export JUP_MOUNT_BASE="/home/jupyter/mount_ghstudy.sh"; # bash script to mount base dir
export JUP_UMOUNT_BASE="/home/jupyter/umount_ghstudy.sh"; # bash script to umount base dir
export JUP_NOTEBOOK_TIMEOUT="300"; # timeout the extraction
# Frequenci of log report
export JUP_ASTROID_FREQUENCY="5";
export JUP_IPYTHON_FREQUENCY="5";
export JUP_NOTEBOOKS_FREQUENCY="5";
export JUP_REQUIREMENT_FREQUENCY="5";
export JUP_CRAWLER_FREQUENCY="1";
export JUP_CLONE_FREQUENCY="1";
export JUP_COMPRESS_FREQUENCY="5";
export JUP_DB_IP="localhost"; # postgres database IP
Then, configure the file ~/oauth2_creds.json, according to yagmail documentation: https://media.readthedocs.org/pdf/yagmail/latest/yagmail.pdf
Configure the mount_ghstudy.sh and umount_ghstudy.sh scripts. The first one should mount the folder that stores the directories. The second one should umount it. You can leave the scripts in blank, but it is not advisable, as the reproducibility study runs arbitrary code on your machine and you may lose your data.
Scripts
Download and extract jupyter_reproducibility.tar.bz2:
tar -xjf jupyter_reproducibility.tar.bz2
Install 5 conda environments and 5 anaconda environments, for each python version. In each of them, upgrade pip, install pipenv, and install the archaeology package (Note that it is a local package that has not been published to pypi. Make sure to use the -e option):
Conda 2.7
conda create -n raw27 python=2.7 -y
conda activate raw27
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Anaconda 2.7
conda create -n py27 python=2.7 anaconda -y
conda activate py27
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Conda 3.4
It requires a manual jupyter and pathlib2 installation due to some incompatibilities found on the default installation.
conda create -n raw34 python=3.4 -y
conda activate raw34
conda install jupyter -c conda-forge -y
conda uninstall jupyter -y
pip install --upgrade pip
pip install jupyter
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
pip install pathlib2
Anaconda 3.4
conda create -n py34 python=3.4 anaconda -y
conda activate py34
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Conda 3.5
conda create -n raw35 python=3.5 -y
conda activate raw35
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Anaconda 3.5
It requires the manual installation of other anaconda packages.
conda create -n py35 python=3.5 anaconda -y
conda install -y appdirs atomicwrites keyring secretstorage libuuid navigator-updater prometheus_client pyasn1 pyasn1-modules spyder-kernels tqdm jeepney automat constantly anaconda-navigator
conda activate py35
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Conda 3.6
conda create -n raw36 python=3.6 -y
conda activate raw36
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Anaconda 3.6
conda create -n py36 python=3.6 anaconda -y
conda activate py36
conda install -y anaconda-navigator jupyterlab_server navigator-updater
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Conda 3.7
conda create -n raw37 python=3.7 -y
conda activate raw37
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Anaconda 3.7
When we
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This repository contains the dataset for the study of computational reproducibility of Jupyter notebooks from biomedical publications. Our focus lies in evaluating the extent of reproducibility of Jupyter notebooks derived from GitHub repositories linked to publications present in the biomedical literature repository, PubMed Central. We analyzed the reproducibility of Jupyter notebooks from GitHub repositories associated with publications indexed in the biomedical literature repository PubMed Central. The dataset includes the metadata information of the journals, publications, the Github repositories mentioned in the publications and the notebooks present in the Github repositories.
Data Collection and Analysis
We use the code for reproducibility of Jupyter notebooks from the study done by Pimentel et al., 2019 and adapted the code from ReproduceMeGit. We provide code for collecting the publication metadata from PubMed Central using NCBI Entrez utilities via Biopython.
Our approach involves searching PMC using the esearch function for Jupyter notebooks using the query: ``(ipynb OR jupyter OR ipython) AND github''. We meticulously retrieve data in XML format, capturing essential details about journals and articles. By systematically scanning the entire article, encompassing the abstract, body, data availability statement, and supplementary materials, we extract GitHub links. Additionally, we mine repositories for key information such as dependency declarations found in files like requirements.txt, setup.py, and pipfile. Leveraging the GitHub API, we enrich our data by incorporating repository creation dates, update histories, pushes, and programming languages.
All the extracted information is stored in a SQLite database. After collecting and creating the database tables, we ran a pipeline to collect the Jupyter notebooks contained in the GitHub repositories based on the code from Pimentel et al., 2019.
Our reproducibility pipeline was started on 27 March 2023.
Repository Structure
Our repository is organized into two main folders:
Accessing Data and Resources:
System Requirements:
Running the pipeline:
Running the analysis:
References:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## "Spatial scales of kinetic energy in the Arctic Ocean"
Available dataset for each figure (1~9) and figure10 in the main text, including Jupyter notebook scripts (Fig1, Fig2, Fig5, Fig10) and Matlab scripts (Fig3, Fig4, Fig6, Fig7, Fig8, Fig9).
## Description
This dataset is as the supplementary to the manuscript "Spatial scales of kinetic energy in the Arctic Ocean", including jupyter notebook scripts and matlab scripts of visualization directly for figures1~9.
1) Jupyter notebook scripts for visualization
the MESH and BG are used for visualization, and *.mat are the dataset for Fig1/2/5/10. The load path in the script should be changed to your files accordingly.
2) Matlab scripts for plots
All figures/panels are directly produced, but it is composed of panels for Fig7/8/9 additionally.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Maniple This repository contains code, scripts and data necessary to reproduce the paper "The Fact Selection Problem in LLM-Based Program Repair".
Installation Before installing the project, ensure you have the following prerequisites installed on your system:
Follow these steps to install and set up the project on your local machine:
cd maniple python3 -m pip install .
Structure of Directories The project is organized into several directories, each serving a specific purpose:
data/ # Training and testing datasets BGP32.zip/ # Sampled 32 bugs from the BugsInPy dataset black/ # The bug project folder 10/ # The bug ID folder 100000001/ # The bitvector used for prompting prompt.md # The prompt used for this bitvector response_1.md # The response from the model response_1.json # The response in JSON format response_1.patch # The response in patch format result_1.json # Testing result ... BGP32-without-cot.zip # GPT response for 32 bugs without CoT prompting BGP314.zip # 314 bugs from the BugsInPy dataset BGP157Ply1-llama3-70b.zip # experiment with llama3 model on BGP157Ply1 dataset BGP32-permutation.zip # permutation experiment on BGP32 dataset
maniple/ # Scripts for getting facts and generate prompts strata_based/ # Scripts for generating prompts utils/ # Utility functions metrics/ # Scripts for calculating metrics for dataset
patch_correctness_labelling.xlsx # The labelling of patch correctness experiment.ipynb # Jupyter notebook for training models
experiment-initialization-resources/ # Contains raw facts for each bug bug-data/ # row facts for each bug ansible/ # Bug project folder 5/ # Bug ID folder bug-info.json # Metadata for the bug facts_in_prompt.json # Facts used in the prompt processed_facts.json # Processed facts external_facts.json # GitHub issues for this bug static-dynamic-facts.json # Static and dynamic facts ... datasets-list/ # Subsets from BugsInPy dataset strata-bitvector/ # Debugging information for bitvectors
Steps to Reproduce the Experiments Please follow the steps below sequentially to reproduce the experiments on 314 bugs in BugsInPy with our bitvector based prompt
Prepare the Dataset
The CLI scripts under the maniple
directory provide useful commands to download and prepare environments for each bug.
To download and prepare environments for each bugs, you can use the prep
command.
maniple prep --dataset 314-dataset
This script will automatically download all 314 bugs from GitHub, create a virtual environment for the bug and install the necessary dependencies.
Fact Extraction
Then you can extract facts from the bug data using the extract
command as follows:
maniple extract --dataset 314-dataset --output-dir data/BGP314
This script will extract facts from the bug data and save them in the specified output directory.
You can find all extracted facts under the experiment-initialization-resources/bug-data
directory.
Generate Bitvector Specific Prompts and Responses First, you need to generate bitvector for the facts. The 128 bitvector for our paper can be generated by the following command.
python3 -m maniple.strata_based.fact_bitvector_generator
You can customize your bitvectors, they should be put under experiment-initialization-resources/strata-bitvectors
directory. You can refer the example bitvector format used for our paper.
To reproduce our experiment prompt and response, please use the command below, and replace with your own key.
export OPENAI_API_KEY=
setx OPENAI_API_KEY
python3 -m maniple.strata_based.prompt_generator --database BGP314 --partition 10 --start_index 1 --trial 15
Again, you can build your own customize prompt with customize bitvector using our extracted facts. Above is only for reproducing our prompt and response.
This script will generate prompts and responses for all 314 bugs in the dataset by enumerating all possible bitvectors according to current strata design specified in maniple/strata_based/fact_strata_table.json
. By specifying --trial 15
, the script will generate 15 responses for each prompt. And by specifying --partition 10
the script will start 10 threads to speed up the process.
Testing Generated Patches Please use following command:
maniple validate --output-dir data/BGP314
This script will validate the generated patches for the specified bug and save the results in the specified output directory. The test comes from the developer's fix commit.
Contributing Contributions to this project are welcome! Please submit a PR if you find any bugs or have any suggestions.
License This project is licensed under the MIT - see the LICENSE file for details.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
I've been recently exploring Microsoft Azure and have been playing this game for the past 4 or so years. I am also a software developer by profession. I did a simple pipeline that gets data from the official Clash Royale API using (Python) Jupyter Notebooks and Azure VMs. I tried searching for public Clash Royale datasets, but the ones I saw don't quite have that much data from my perspective, so I decided to create one for the whole community.
I started pulling in the data at the beginning of the month of December until season 18 ended. This covers the season reset last December 07, and the latest balance changes last December 09. This dataset also contains ladder data for the new Legendary card Mother Witch.
The amount of data that I have, with the latest dataset, has ballooned to around 37.9 M distinct/ unique ladder matches that were (pseudo) randomly being pulled from a pool of 300k+ clans. If you think that this is A LOT, this could only be a percent of a percent (even lower) of the real amount of ladder battle data. It still may not reflect the whole population, also, the majority of my data are matches between players of 4000 trophies or more.
I don't see any reason for me not to share this to the public as the data is now considerably large that working on it and producing insights will take more than just a few hours of "hobby" time to do.
Feel free to use it on your own research and analysis, but don't forget to credit me.
Also, please don't monetize this dataset.
Stay safe. Stay healthy.
Happy holidays!
Card Ids Master List is in the discussion, I also created a simple notebook to load the data and made a sample n=20 rows, so you can get an idea on what the fields are.
With this data, the following can possibly be answered 1. Which cards are the strongest? The weakest? 2. Which win-con is the most winning? 3. Which cards are always with a specific win-con? 4. When 2 opposing players are using maxed decks, which win-con is the most winning? 5. Most widely used cards? Win-Cons? 6. What are the different metas in different arenas and trophy ranges? 7. Is ladder matchmaking algorithm rigged? (MOST CONTROVERSIAL)
(and many more)
I have 2 VMs running a total of 14 processes, and for each of these processes, I've divided a pool of 300k+ clans into the same number of groups. This went on 24/7, non-stop for the whole season. Each process will then randomize the list of clans it is assigned to and will iterate through each clan, and get that clan's members' ladder data. It is important to note that I also have a pool of 470 hand-picked clans that I always get data from, as these clans were the starting point that eventually enabled me to get the 300k+ clans. There are clans who have minimal ladder data, there are some clans who have A LOT.
To prevent out of memory exceptions, as my VMs are not really that powerful (I'm using Azure free credits), I've put on a time and limit of battles extracted per member.
My account: https://royaleapi.com/player/89L2CLRP My clan: https://royaleapi.com/clan/J898GQ
Thank you to SUPERCELL for creating this FREEMIUM game that has tested countless people's patience, as well as the durability of countless mobile devices after being smashed against a wall, and thrown on the floor.
Thank you to Microsoft for Azure and free monthly credits
Thank you to Python and Jupyter notebooks.
Thank you Kaggle for hosting this dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This tar file contains the docker image for building the ARcode model and baseline models for application recognition for the SC22 paper with the same title.The files/folders in this image contains:notebooks: The notebooks for models and experiment results.-- ARcode.ipynb: The interactive Jupyter Notebook for the ARcode model.-- ARcode_unknown.ipynb: The interactive Jupyter Notebook for the ARcode model for detecting unknown applications.-- ARcode_partial.ipynb: The interactive Jupyter Notebook for the ARcode model on partial job signatures.-- ARcode_channel.ipynb: The interactive Jupyter Notebook for the ARcode model on one channel of job signatures.-- baselines.ipynb: The interactive Jupyter Notebook for the baseline models. These models are Random Forest, LinearSVC and SVC; all of them are implemented through Taxonomist(https://doi.org/10.6084/m9.figshare.6384248.v1).-- baselines_unknown.ipynb: The interactive Jupyter Notebook for the baseline models for detecting unknown applications.dataset: The dataset for training the models mentioned above.-- ARcode_labels.npy: A numpy array of the signatures' labels.-- ARcode_signatures.npy: A numpy array of the generated signatures.-- baseline_labels.npy: A numpy array of the labels for the baseline dataset.-- baseline_features.npy: A numpy array of the statistic features generated from the raw monitoring data.-- knl_app_code.json: Mapping of IDs to application names. This mapping is used when creating the dataset.models: The saved models.-- arcode.h5: An HDF5 file containing the serialized weights for the ARcode model.-- arcode.json: A JSON file describing the ARcode model.results: The saved experiment results.Following these steps to start Jupyter Notebook in the image:1. Load the image into Docker on your local machine:docker load < archive-arcode.tar2. Start the Jupyter notebook in the docker image:docker run --init --user root -p 8888:8888 artlands/arcode3. Copy the URL shown in your terminal and paste in a brower: http://127.0.0.1:8888/?token=your_tokenAcknowledgement: This research used resources of the National Energy Research Scientific Computing Center (NERSC), a U.S. Department of Energy Office of Science User Facility located at Lawrence Berkeley National Laboratory, operated under Contract No. DE-AC02-05CH11231
The BBC Hindi News Articles Dataset offers a comprehensive collection of news articles gathered through Python web scraping. This dataset features articles from various categories, providing a broad spectrum of content for analysis. Each entry in the dataset includes three key data points:
Headline: The title of the news article. Content: The full text of the article. Category: The category to which the article belongs. Ideal for natural language processing (NLP) tasks, sentiment analysis, and language modeling, this dataset provides a rich resource for understanding and exploring Hindi news media.
I could not find datasets under Creative commons license so I thought of scraping it by myself and making it available on Kaggle!
Please use it freely and just put up credit for the dataset. Upvote would be really appreciated :)
I have also uploaded my jupyter notebook for web scraping on GitHub if you want to check that out: https://github.com/AadiSrivastava05/BBC-Hindi-News-Dataset-with-web-scraping-script
Original Data Source: BBC Hindi News Articles Dataset - Detailed
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
🔹 Release v1.0 - Duffing Oscillator Response Analysis (DORA)
This release provides a collection of benchmark tasks and datasets, accompanied by minimal code to generate, import, and plot the data. The primary focus is on the Duffing Oscillator Response Analysis (DORA) prediction task, which evaluates machine learning models' ability to generalize system responses in unseen parameter regimes.
🚀 Key Features:
Duffing Oscillator Response Analysis (DORA) Prediction Task:
Objective: Predict the response of a forced Duffing oscillator using a minimal training dataset. This task assesses a model's capability to extrapolate system behavior in unseen parameter regimes, specifically varying amplitudes of external periodic forcing.
Expectation: A proficient model should qualitatively capture the system's response, such as identifying the exact number of cycles in a limit-cycle regime or chaotic trajectories when the system transitions to a chaotic regime, all trained on limited datasets.
Comprehensive Dataset:
Training Data (DORA_Train.csv): Contains data for two external forcing amplitudes, ( f $\in$ [0.46, 0.49] ).
Testing Data (DORA_Test.csv): Includes data for five forcing amplitudes, ( f $\in$ [0.2, 0.35, 0.48, 0.58, 0.75] ).
📊 Data Description:
Each dataset comprises five columns:
Column Description
t Time variable
q1(t) Time evolution of the Duffing oscillator's position
q2(t) Time evolution of the Duffing oscillator's velocity
f(t) Time evolution of external periodic forcing
f_amplitude Constant amplitude during system evaluation (default: 250)
🛠 Utility Scripts and Notebooks:
Data Generation and Visualization:
DORA_generator.py: Generates, plots, and saves training and testing data.Usage:
python DORA_generator.py -time 250 -plots 1
DORA.ipynb: A Jupyter Notebook for dataset generation, loading, and plotting.
Data Loading and Plotting:
ReadData.py: Loads and plots the provided datasets (DORA_Train.csv and DORA_Test.csv).
📈 Model Evaluation:
The prediction model's success is determined by its ability to extrapolate system behavior outside the training data.System response characteristics for external forcing are quantified in terms of amplitude and mean of ( q1^2(t) ).These can be obtained using the provided Signal_Characteristic function.
🔹 Performance Metrics:
Response Amplitude Error:MSE[max(q1_prediction²(t > t)), max(q1_original²(t > t))]
Response Mean Error:MSE[Mean(q1_prediction²(t > t)), Mean(q1_original²(t > t))]
Note: ( t* = 20s ) denotes the steady-state time.
📌 Reference Implementation:
An exemplar solution using reservoir computing is detailed in the following:📖 Yadav et al., 2025 – Springer Nonlinear Dynamics
📄 Citation:
If you utilize this dataset or code in your research, please cite:
@article{Yadav2024, author = {Manish Yadav and Swati Chauhan and Manish Dev Shrimali and Merten Stender}, title = {Predicting multi-parametric dynamics of an externally forced oscillator using reservoir computing and minimal data}, journal = {Nonlinear Dynamics}, year = {2024}, doi = {10.1007/s11071-024-10720-w}}
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains the code and datasets used in the data analysis for "Fracture toughness of mixed-mode anticracks in highly porous materials". The analysis is implemented in Python, using Jupyter Notebooks.
main.ipynb
: Jupyter notebook with the main data analysis workflow.energy.py
: Methods for the calculation of energy release rates.regression.py
: Methods for the regression analyses.visualization.py
: Methods for generating visualizations.df_mmft.pkl
: Pickled DataFrame with experimental data gathered in the present work.df_legacy.pkl
: Pickled DataFrame with literature data.pandas
, matplotlib
, numpy
, scipy
, tqdm
, uncertainties
, weac
pip install -r requirements.txt
.main.ipynb
notebook in Jupyter Notebook or JupyterLab.df_mmft.pkl
and df_legacy.pkl
, which contain experimental measurements and corresponding parameters. Below are the descriptions for each column in these DataFrames:df_mmft.pkl
exp_id
: Unique identifier for each experiment.datestring
: Date of the experiment as a string.datetime
: Timestamp of the experiment.bunker
: Field site of the experiment. Bunker IDs 1 and 2 correspond to field sites A and B, respectively.slope_incl
: Inclination of the slope in degrees.h_sledge_top
: Distance from sample top surface to the sled in mm.h_wl_top
: Distance from sample top surface to weak layer in mm.h_wl_notch
: Distance from the notch root to the weak layer in mm.rc_right
: Critical cut length in mm, measured on the front side of the sample.rc_left
: Critical cut length in mm, measured on the back side of the sample.rc
: Mean of rc_right
and rc_left
.densities
: List of density measurements in kg/m^3 for each distinct slab layer of each sample.densities_mean
: Daily mean of densities
.layers
: 2D array with layer density (kg/m^3) and layer thickness (mm) pairs for each distinct slab layer.layers_mean
: Daily mean of layers
.surface_lineload
: Surface line load of added surface weights in N/mm.wl_thickness
: Weak-layer thickness in mm.notes
: Additional notes regarding the experiment or observations.L
: Length of the slab–weak-layer assembly in mm.df_legacy.pkl
#
: Record number.rc
: Critical cut length in mm.slope_incl
: Inclination of the slope in degrees.h
: Slab height in mm.density
: Mean slab density in kg/m^3.L
: Lenght of the slab–weak-layer assembly in mm.collapse_height
: Weak-layer height reduction through collapse.layers_mean
: 2D array with layer density (kg/m^3) and layer thickness (mm) pairs for each distinct slab layer.wl_thickness
: Weak-layer thickness in mm.surface_lineload
: Surface line load from added weights in N/mm.For more detailed information on the datasets, refer to the paper or the documentation provided within the Jupyter notebook.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Accident Detection Model is made using YOLOv8, Google Collab, Python, Roboflow, Deep Learning, OpenCV, Machine Learning, Artificial Intelligence. It can detect an accident on any accident by live camera, image or video provided. This model is trained on a dataset of 3200+ images, These images were annotated on roboflow.
https://user-images.githubusercontent.com/78155393/233774342-287492bb-26c1-4acf-bc2c-9462e97a03ca.png" alt="Survey">
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This database contains 4976 planetary images of boulder fields located on Earth, Mars and Moon. The data was collected during the BOULDERING Marie Skłodowska-Curie Global fellowship between October 2021 and 2024. The data was already splitted into train, validation and test datasets, but feel free to re-organize the labels at your convenience.
For each image, all of the boulder outlines within the image were carefully mapped in QGIS. More information about the labelling procedure can be found in the following manuscript (https://agupubs.onlinelibrary.wiley.com/doi/full/10.1029/2023JE008013). This dataset differs from the previous dataset included along with the manuscript https://zenodo.org/records/8171052, as it contains more mapped images, especially of boulder populations around young impact structures on the Moon (cold spots). In addition, the boulder outlines were also pre-processed so that it can be ingested directly in YOLOv8.
A description of what is what is given in the README.txt file (in addition in how to load the custom datasets in Detectron2 and YOLO). Most of the other files are mostly self-explanatory. Please see previous dataset or manuscript for more information. If you want to have more information about specific lunar and martian planetary images, the IDs of the images are still available in the name of the file. Use this ID to find more information (e.g., M121118602_00875_image.png, ID M121118602 ca be used on https://pilot.wr.usgs.gov/). I will also upload the raw data from which this pre-processed dataset was generated (see https://zenodo.org/records/14250970).
Thanks to this database, you can easily train a Detectron2 Mask R-CNN or YOLO instance segmentation models to automatically detect boulders.
How to cite:
Please refer to the "how to cite" section of the readme file of https://github.com/astroNils/YOLOv8-BeyondEarth.
Structure:
. └── boulder2024/ ├── jupyter-notebooks/ │ └── REGISTERING_BOULDER_DATASET_IN_DETECTRON2.ipynb ├── test/ │ └── images/ │ ├── _image.png │ ├── ... │ └── labels/ │ ├── _image.txt │ ├── ... ├── train/ │ └── images/ │ ├── _image.png │ ├── ... │ └── labels/ │ ├── _image.txt │ ├── ... ├── validation/ │ └── images/ │ ├── _image.png │ ├── ... │ └── labels/ │ ├── _image.txt │ ├── ... ├── detectron2_inst_seg_boulder_dataset.json ├── README.txt ├── yolo_inst_seg_boulder_dataset.yaml
detectron2_inst_seg_boulder_dataset.json
is a json file containing the masks as expected by Detectron2 (see https://detectron2.readthedocs.io/en/latest/tutorials/datasets.html for more information on the format). In order to use this custom dataset, you need to register the dataset before using it in the training. There is an example how to do that in the jupyter-notebooks folder. You need to have detectron2, and all of its depedencies installed.
yolo_inst_seg_boulder_dataset.yaml
can be used as it is, however you need to update the paths in the .yaml file, to the test, train and validation folders. More information about the YOLO format can be found here (https://docs.ultralytics.com/datasets/segment/).
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
🚀 Hugging Face Uploader: Streamline Your Model Sharing! 🚀
This tool provides a user-friendly way to upload files directly to your Hugging Face repositories. Whether you prefer the interactive environment of a Jupyter Notebook or the command-line efficiency of a Python script, we've got you covered. We've designed it to streamline your workflow and make sharing your models, datasets, and spaces easier than ever before! Will be more consistently updated here:… See the full description on the dataset page: https://huggingface.co/datasets/EarthnDusk/Huggingface_Uploader.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
TRANSJAKARTA - Public Transportation - Transaction Data
When a data analyst want to build the framework for the analysis they should not have waited for the real transaction to fill in from time to time. They could try to create a dummy data for testing whether the framework or the data structure already meet the requirement for deep analytics. Here i tried to simulate transaction data for Transjakarta as i found none on the Internet that is publicly shared. Hope you can exercise with this data i simulate to make it more meaningful as the master data from this data are real (but with dummy transactions).
The master datas are sourced from: https://ppid.transjakarta.co.id/pusat-data/data-terbuka/transjakarta-gtfs-feed The data was generated using Python using Faker and Random based on master datas. The source might be updated from time to time and the dataset might not represent the latest version from the source.
Context: Transjakarta is public transportation company from Indonesia, based in Jakarta. The transportation modes are big bus (BRT), medium and big bus (non-BRT), mini bus (Mikrotrans). The mechanism in Transjakarta is to Tap-In and Tap-Out using payment card as your tickets.
Content: Basically this data is simulation for Transaction data in Transjakarta. It does not represent the real data / structure used in Transjakarta
Inspiration: Transjakarta is growing as public transportation company. But none have shared data for the transaction analysis. We can analyze which route are busy and not. Which route is heavy with traffic jam or not. And other dimension provided you can analyze.
*If you'd like to see how i created this dataset you can peek the process in my GitHub
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset merges two public datasets:- Speedtest network performance data for Australia (Q3, 2020), as loaded to AURIN geo-analysis platform (88,000 approx locations)- NBN mapping of Technology data (inc FTTN, FTTP, FTTC, FTTB, HFC, Wireless, Satellite); complete map of Australia, colour-coded by technology, in WMS or KML format. KML format used in this instance.to produce an intersection datasets (Result: 319337 rows × 26 columns), including: - data includes LocID, download speed, nbn tech, lat, Lon, SA2, SA3, SA4 (see ABS link below).But many techs in one Speedtest block (600m^2) so have to untangle.- a ten line sample (CSV) included.- see image as provided by AURIN - QGIS.png.** Versionsv2.Load updated Jupyter Notebook v1.1 and locations csv, which shows location breakdown by NBN Technology (pivot table export).v1. Initial load, including Jupyter Notebook and human readable, geojson.** METHOD: Advice from AURIN:In order to join these two maps, you will need to perform a spatial join based on the two layers. It is possible to do this with geopandas.sjoin(), which by default performs an intersection join - that is, any portion of a matching polygon from the second layer is considered a match to join on. More information about spatial predicates is here in case you're looking for a different spatial relationship.In this link, I've supplied a notebook (OoklaNBN-AURIN.ipynb) that collects the datasets from the AURIN API and data.gov.au, combines the several KML NBN layers into one, and joins them with the Ookla 2020 Q3 dataset. In order to use it, you will first need to input your AURIN API credentials into the first cell.The spatial join occurs in the final notebook cell and writes its output to a geopackage (OoklaNBN.gpkg) which I've included also, as the script can take some time to run. You will notice that each Ookla cell now may be represented by many records, this is due to there being more than one overlapping NBN technology polygon. As one Ookla grid can cover many technology zones, aggregating these may be useful depending on how you approach your analysis.Regards AURIN.The authors acknowledge the facilities and scientific and technical assistance of the NCRIS-enabled Australian Urban Research Infrastructure Network (AURIN)”Thanks Evan Thomas, AURIN - ORCiD: https://orcid.org/0000-0001-7564-4116*** Preliminary Analysis1. Count of Tech TypeTech countFibre to the Basement (vectored or non-vectored) 14147Satellite 15946Fixed Wireless 23663Fibre to the Curb 36843Hybrid Fibre Coaxial (HFC) 41175Fibre to the Premises 93732Fibre to the Node 938312. Mean of Tech TypeNBN Tech Type Aus mean(mbps)Satellite 48.1674 Fixed Wireless 49.1477 Fibre to the Node 75.1639 Fibre to the Premises 83.71 Fibre to the Curb 86.8433 Hybrid Fibre Coaxial (HFC) 87.1248 Fibre to the Basement (vectored or non-vectored) 116.324*** Licence;Speedtest licence (at AWS data licence) is "CC BY-NC-SA 4.0", so use of this data must be:- non-commercial (NC)- reuse must be share-alike (SA)(add same licence).This restricts the standard CC-BY Figshare licence.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains the results of geomechanical simulations conducted on a faulted aquifer under conditions of CO2 injection. The primary focus of the simulations is the pressure evolution within the rock matrix and along the fault, as well as the associated changes in the mechanical state, including rock deformation and fault slip. Additionally, the simulations explore the sensitivity of fault stability under varying orientations of far-field stress.
The dataset includes raw data in VTK format, as well as an illustrative Jupyter notebook that provides a comprehensive explanation of the problem's geometry, boundary and initial conditions, and an interpretation of the observed physical phenomena. The Jupyter notebook is designed to be run both online and locally.
These simulations were performed using an open-source FEM-based geomechanical simulator. Detailed instructions for running the notebook, along with a link to the geomechanical simulator, are provided in the description below.
An interactive notebook showcasing visualisations of the dataset is available on RenkuLab.
Alternatively, you can launch the notebook on your computer. Download the dataset, install dependencies, and launch Jupyter notebook:
pip install -r requirements_freeze.txt
jupyter notebook
Then, open notebooks/DataVisualisation.ipynb
.
To recreate the results found in this dataset, install the solver and go through the example at examples/injection_fault
.
The repository has the following structure:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Purpose These are a collection of supplementary files that are to be included in my dissertation. They include but are not limited to small IPython notebooks, extra figures, data-sets that are too large to publish in the main document such as full ortholog lists and other primary data.
Viewing IPython notebooks (ipynb files) To view an IPython notebook, "right-click" its download link and select "Copy link address". Then navigate to the the free notebook viewer by following this link: http://nbviewer.ipython.org/. Finally, paste the link to the ipynb file that you copied into the URL form on the nbviewer page and click "Go".
This is the source code package for the labbench python module, version 0.20, which is its first public release. The purpose of labbench is to streamline and organize complicated laboratory automation tasks that involve large-scale benchtop automation, concurrency, and/or data management. It is built around a system of wrappers that facilitate robust, concise exception handling, type checking, API conventions, and synchronized device connection through python context blocks. The wrappers also provide convenient new functionality, such as support for automated status displays in jupyter notebooks, simplified threaded concurrency, and automated, type-safe logging to relational databases.Together, these features help to minimize the amount of "copy-and-paste" code that can make your lab automation scripts error-prone and difficult to maintain.The python code that results can be clear, concise, reusable and maintainable, and provide consistent formatting for stored data. The result helps researchers to meet NIST's open data obligations, even for complicated, large, and heterogeneous datasets.Several past and ongoing projects in the NIST Communication Technology Laboratory (CTL) published data that were acquired by automation in labbench. We release it here both for transparency and to invite public use and feedback. Ongoing updates to this source code will be maintained on the NIST github page at https://github.com/usnistgov/labbench.The code was developed in python, documented with the python sphinx package and markdown, and shared through the USNISTGOV organization on GitHub.INSTALLATIONlabbench can run on any computer that supports python 3.6. The hardware requirements are discussed here: https://docs.anaconda.com/anaconda/install/#requirements1. Install your favorite distribution of a python version 3.6 or greater2. In a command prompt, pip install git+https://gitlab.nist.gov/gitlab/ssm/labbench3. (Optional) install an NI VISA [1] runtime, for example this one for windows.USAGEThe source distribution contains detailed information including* README.md - documentation to get started using labbench* LICENSE.md - license and redistribution information* doc/labbench-api.pdf - complete listing of the module and documentation
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset corresponding to the journal article "Mitigating the effect of errors in source parameters on seismic (waveform) inversion" by Blom, Hardalupas and Rawlinson, accepted for publication in Geophysical Journal International. In this paper, we demonstrate the effect or errors in source parameters on seismic tomography, with a particular focus on (full) waveform tomography. We study effect both on forward modelling (i.e. comparing waveforms and measurements resulting from a perturbed vs. unperturbed source) and on seismic inversion (i.e. using a source which contains an (erroneous) perturbation to invert for Earth structure. These data were obtained using Salvus, a state-of-the-art (though proprietary) 3-D solver that can be used for wave propagation simulations (Afanasiev et al., GJI 2018).
This dataset contains:
The entire Salvus project. This project was prepared using Salvus version 0.11.x and 0.12.2 and should be fully compatible with the latter.
A number of Jupyter notebooks used to create all the figures, set up the project and do the data processing.
A number of Python scripts that are used in above notebooks.
two conda environment .yml files: one with the complete environment as used to produce this dataset, and one with the environment as supplied by Mondaic (the Salvus developers), on top of which I installed basemap and cartopy.
An overview of the inversion configurations used for each inversion experiment and the name of hte corresponding figures: inversion_runs_overview.ods / .csv .
Datasets corresponding to the different figures.
One dataset for Figure 1, showing the effect of a source perturbation in a real-world setting, as previously used by Blom et al., Solid Earth 2020
One dataset for Figure 2, showing how different methodologies and assumptions can lead to significantly different source parameters, notably including systematic shifts. This dataset was kindly supplied by Tim Craig (Craig, 2019).
A number of datasets (stored as pickled Pandas dataframes) derived from the Salvus project. We have computed:
travel-time arrival predictions from every source to all stations (df_stations...pkl)
misfits for different metrics for both P-wave centered and S-wave centered windows for all components on all stations, comparing every time waveforms from a reference source against waveforms from a perturbed source (df_misfits_cc.28s.pkl)
addition of synthetic waveforms for different (perturbed) moment tenors. All waveforms are stored in HDF5 (.h5) files of the ASDF (adaptable seismic data format) type
How to use this dataset:
To set up the conda environment:
make sure you have anaconda/miniconda
make sure you have access to Salvus functionality. This is not absolutely necessary, but most of the functionality within this dataset relies on salvus. You can do the analyses and create the figures without, but you'll have to hack around in the scripts to build workarounds.
Set up Salvus / create a conda environment. This is best done following the instructions on the Mondaic website. Check the changelog for breaking changes, in that case download an older salvus version.
Additionally in your conda env, install basemap and cartopy:
conda-env create -n salvus_0_12 -f environment.yml conda install -c conda-forge basemap conda install -c conda-forge cartopy
Install LASIF (https://github.com/dirkphilip/LASIF_2.0) and test. The project uses some lasif functionality.
To recreate the figures: This is extremely straightforward. Every figure has a corresponding Jupyter Notebook. Suffices to run the notebook in its entirety.
Figure 1: separate notebook, Fig1_event_98.py
Figure 2: separate notebook, Fig2_TimCraig_Andes_analysis.py
Figures 3-7: Figures_perturbation_study.py
Figures 8-10: Figures_toy_inversions.py
To recreate the dataframes in DATA: This can be done using the example notebook Create_perturbed_thrust_data_by_MT_addition.py and Misfits_moment_tensor_components.M66_M12.py . The same can easily be extended to the position shift and other perturbations you might want to investigate.
To recreate the complete Salvus project: This can be done using:
the notebook Prepare_project_Phil_28s_absb_M66.py (setting up project and running simulations)
the notebooks Moment_tensor_perturbations.py and Moment_tensor_perturbation_for_NS_thrust.py
For the inversions: using the notebook Inversion_SS_dip.M66.28s.py as an example. See the overview table inversion_runs_overview.ods (or .csv) as to naming conventions.
References:
Michael Afanasiev, Christian Boehm, Martin van Driel, Lion Krischer, Max Rietmann, Dave A May, Matthew G Knepley, Andreas Fichtner, Modular and flexible spectral-element waveform modelling in two and three dimensions, Geophysical Journal International, Volume 216, Issue 3, March 2019, Pages 1675–1692, https://doi.org/10.1093/gji/ggy469
Nienke Blom, Alexey Gokhberg, and Andreas Fichtner, Seismic waveform tomography of the central and eastern Mediterranean upper mantle, Solid Earth, Volume 11, Issue 2, 2020, Pages 669–690, 2020, https://doi.org/10.5194/se-11-669-2020
Tim J. Craig, Accurate depth determination for moderate-magnitude earthquakes using global teleseismic data. Journal of Geophysical Research: Solid Earth, 124, 2019, Pages 1759– 1780. https://doi.org/10.1029/2018JB016902
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
results
directory, following the structureresults/{DATE}_{TIME}-{INSTANCE}-{REGION}/results/exp0_250000_9_generic_throughput_{IDX}.csv
{DATE}
is the date of the execution in the format YYYY-MM-DD
,{TIME}
is the time of the execution in the format HH-MM-SS
,{INSTANCE}
is the instance type used for the execution (m6i
or m6g
),{REGION}
is the AWS region used for the execution (useast1
or eucentral1
),{IDX}
is the number of the repetition of an execution (1
-3
).timestamp
in epoch seconds,value
the measured throughput in records per second as obtained with the ad-hoc throughput metric of ShuffleBench.results/{DATE}_{TIME}-{INSTANCE}-{REGION}
also contains a theodolite.log
file that contains the logs of the Theodolite benchmarking tool and the logged configuration of each execution in results
. Although we do not expect them to provide additional insights (since the purpose of our study was to repeatedly execute the same benchmark), we refer to the documentation of Theodolite for further details.results-analysis.ipynb
following these steps:python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
jupyter notebook
periodic-executor
` directory contains scripts and configuration files used to automatically execute ShuffleBench. As ShuffleBench relies on the Theodolite benchmarking framework for executing benchmarks within Kubernetes, the code here is mostly for setting up a Kubernetes cluster, installing Theodolite, configuring the benchmark executions, and collecting the benchmark results.docker build -t $ECR_REPOSITORY/$IMAGE_NAME .
docker push $ECR_REPOSITORY/$IMAGE_NAME
shufflebench-periodic-schedule-results
has to be created.Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The self-documenting aspects and the ability to reproduce results have been touted as significant benefits of Jupyter Notebooks. At the same time, there has been growing criticism that the way notebooks are being used leads to unexpected behavior, encourage poor coding practices and that their results can be hard to reproduce. To understand good and bad practices used in the development of real notebooks, we analyzed 1.4 million notebooks from GitHub.
This repository contains two files:
The dump.tar.bz2 file contains a PostgreSQL dump of the database, with all the data we extracted from the notebooks.
The jupyter_reproducibility.tar.bz2 file contains all the scripts we used to query and download Jupyter Notebooks, extract data from them, and analyze the data. It is organized as follows:
In the remaining of this text, we give instructions for reproducing the analyses, by using the data provided in the dump and reproducing the collection, by collecting data from GitHub again.
Reproducing the Analysis
This section shows how to load the data in the database and run the analyses notebooks. In the analysis, we used the following environment:
Ubuntu 18.04.1 LTS
PostgreSQL 10.6
Conda 4.5.11
Python 3.7.2
PdfCrop 2012/11/02 v1.38
First, download dump.tar.bz2 and extract it:
tar -xjf dump.tar.bz2
It extracts the file db2019-03-13.dump. Create a database in PostgreSQL (we call it "jupyter"), and use psql to restore the dump:
psql jupyter < db2019-03-13.dump
It populates the database with the dump. Now, configure the connection string for sqlalchemy by setting the environment variable JUP_DB_CONNECTTION:
export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter";
Download and extract jupyter_reproducibility.tar.bz2:
tar -xjf jupyter_reproducibility.tar.bz2
Create a conda environment with Python 3.7:
conda create -n analyses python=3.7
conda activate analyses
Go to the analyses folder and install all the dependencies of the requirements.txt
cd jupyter_reproducibility/analyses
pip install -r requirements.txt
For reproducing the analyses, run jupyter on this folder:
jupyter notebook
Execute the notebooks on this order:
Reproducing or Expanding the Collection
The collection demands more steps to reproduce and takes much longer to run (months). It also involves running arbitrary code on your machine. Proceed with caution.
Requirements
This time, we have extra requirements:
All the analysis requirements
lbzip2 2.5
gcc 7.3.0
Github account
Gmail account
Environment
First, set the following environment variables:
export JUP_MACHINE="db"; # machine identifier
export JUP_BASE_DIR="/mnt/jupyter/github"; # place to store the repositories
export JUP_LOGS_DIR="/home/jupyter/logs"; # log files
export JUP_COMPRESSION="lbzip2"; # compression program
export JUP_VERBOSE="5"; # verbose level
export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter"; # sqlchemy connection
export JUP_GITHUB_USERNAME="github_username"; # your github username
export JUP_GITHUB_PASSWORD="github_password"; # your github password
export JUP_MAX_SIZE="8000.0"; # maximum size of the repositories directory (in GB)
export JUP_FIRST_DATE="2013-01-01"; # initial date to query github
export JUP_EMAIL_LOGIN="gmail@gmail.com"; # your gmail address
export JUP_EMAIL_TO="target@email.com"; # email that receives notifications
export JUP_OAUTH_FILE="~/oauth2_creds.json" # oauth2 auhentication file
export JUP_NOTEBOOK_INTERVAL=""; # notebook id interval for this machine. Leave it in blank
export JUP_REPOSITORY_INTERVAL=""; # repository id interval for this machine. Leave it in blank
export JUP_WITH_EXECUTION="1"; # run execute python notebooks
export JUP_WITH_DEPENDENCY="0"; # run notebooks with and without declared dependnecies
export JUP_EXECUTION_MODE="-1"; # run following the execution order
export JUP_EXECUTION_DIR="/home/jupyter/execution"; # temporary directory for running notebooks
export JUP_ANACONDA_PATH="~/anaconda3"; # conda installation path
export JUP_MOUNT_BASE="/home/jupyter/mount_ghstudy.sh"; # bash script to mount base dir
export JUP_UMOUNT_BASE="/home/jupyter/umount_ghstudy.sh"; # bash script to umount base dir
export JUP_NOTEBOOK_TIMEOUT="300"; # timeout the extraction
# Frequenci of log report
export JUP_ASTROID_FREQUENCY="5";
export JUP_IPYTHON_FREQUENCY="5";
export JUP_NOTEBOOKS_FREQUENCY="5";
export JUP_REQUIREMENT_FREQUENCY="5";
export JUP_CRAWLER_FREQUENCY="1";
export JUP_CLONE_FREQUENCY="1";
export JUP_COMPRESS_FREQUENCY="5";
export JUP_DB_IP="localhost"; # postgres database IP
Then, configure the file ~/oauth2_creds.json, according to yagmail documentation: https://media.readthedocs.org/pdf/yagmail/latest/yagmail.pdf
Configure the mount_ghstudy.sh and umount_ghstudy.sh scripts. The first one should mount the folder that stores the directories. The second one should umount it. You can leave the scripts in blank, but it is not advisable, as the reproducibility study runs arbitrary code on your machine and you may lose your data.
Scripts
Download and extract jupyter_reproducibility.tar.bz2:
tar -xjf jupyter_reproducibility.tar.bz2
Install 5 conda environments and 5 anaconda environments, for each python version. In each of them, upgrade pip, install pipenv, and install the archaeology package (Note that it is a local package that has not been published to pypi. Make sure to use the -e option):
Conda 2.7
conda create -n raw27 python=2.7 -y
conda activate raw27
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Anaconda 2.7
conda create -n py27 python=2.7 anaconda -y
conda activate py27
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Conda 3.4
It requires a manual jupyter and pathlib2 installation due to some incompatibilities found on the default installation.
conda create -n raw34 python=3.4 -y
conda activate raw34
conda install jupyter -c conda-forge -y
conda uninstall jupyter -y
pip install --upgrade pip
pip install jupyter
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
pip install pathlib2
Anaconda 3.4
conda create -n py34 python=3.4 anaconda -y
conda activate py34
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Conda 3.5
conda create -n raw35 python=3.5 -y
conda activate raw35
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Anaconda 3.5
It requires the manual installation of other anaconda packages.
conda create -n py35 python=3.5 anaconda -y
conda install -y appdirs atomicwrites keyring secretstorage libuuid navigator-updater prometheus_client pyasn1 pyasn1-modules spyder-kernels tqdm jeepney automat constantly anaconda-navigator
conda activate py35
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Conda 3.6
conda create -n raw36 python=3.6 -y
conda activate raw36
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Anaconda 3.6
conda create -n py36 python=3.6 anaconda -y
conda activate py36
conda install -y anaconda-navigator jupyterlab_server navigator-updater
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Conda 3.7
<code