Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The self-documenting aspects and the ability to reproduce results have been touted as significant benefits of Jupyter Notebooks. At the same time, there has been growing criticism that the way notebooks are being used leads to unexpected behavior, encourages poor coding practices and that their results can be hard to reproduce. To understand good and bad practices used in the development of real notebooks, we analyzed 1.4 million notebooks from GitHub. Based on the results, we proposed and evaluated Julynter, a linting tool for Jupyter Notebooks.
Papers:
This repository contains three files:
Reproducing the Notebook Study
The db2020-09-22.dump.gz file contains a PostgreSQL dump of the database, with all the data we extracted from notebooks. For loading it, run:
gunzip -c db2020-09-22.dump.gz | psql jupyter
Note that this file contains only the database with the extracted data. The actual repositories are available in a google drive folder, which also contains the docker images we used in the reproducibility study. The repositories are stored as content/{hash_dir1}/{hash_dir2}.tar.bz2, where hash_dir1 and hash_dir2 are columns of repositories in the database.
For scripts, notebooks, and detailed instructions on how to analyze or reproduce the data collection, please check the instructions on the Jupyter Archaeology repository (tag 1.0.0)
The sample.tar.gz file contains the repositories obtained during the manual sampling.
Reproducing the Julynter Experiment
The julynter_reproducility.tar.gz file contains all the data collected in the Julynter experiment and the analysis notebooks. Reproducing the analysis is straightforward:
The collected data is stored in the julynter/data folder.
Changelog
2019/01/14 - Version 1 - Initial version
2019/01/22 - Version 2 - Update N8.Execution.ipynb to calculate the rate of failure for each reason
2019/03/13 - Version 3 - Update package for camera ready. Add columns to db to detect duplicates, change notebooks to consider them, and add N1.Skip.Notebook.ipynb and N11.Repository.With.Notebook.Restriction.ipynb.
2021/03/15 - Version 4 - Add Julynter experiment; Update database dump to include new data collected for the second paper; remove scripts and analysis notebooks from this package (moved to GitHub), add a link to Google Drive with collected repository files
The self-documenting aspects and the ability to reproduce results have been touted as significant benefits of Jupyter Notebooks. At the same time, there has been growing criticism that the way notebooks are being used leads to unexpected behavior, encourage poor coding practices and that their results can be hard to reproduce. To understand good and bad practices used in the development of real notebooks, we analyzed 1.4 million notebooks from GitHub. Paper: https://2019.msrconf.org/event/msr-2019-papers-a-large-scale-study-about-quality-and-reproducibility-of-jupyter-notebooks This repository contains two files: dump.tar.bz2 jupyter_reproducibility.tar.bz2 The dump.tar.bz2 file contains a PostgreSQL dump of the database, with all the data we extracted from the notebooks. The jupyter_reproducibility.tar.bz2 file contains all the scripts we used to query and download Jupyter Notebooks, extract data from them, and analyze the data. It is organized as follows: analyses: this folder has all the notebooks we use to analyze the data in the PostgreSQL database. archaeology: this folder has all the scripts we use to query, download, and extract data from GitHub notebooks. paper: empty. The notebook analyses/N12.To.Paper.ipynb moves data to it In the remaining of this text, we give instructions for reproducing the analyses, by using the data provided in the dump and reproducing the collection, by collecting data from GitHub again. Reproducing the Analysis This section shows how to load the data in the database and run the analyses notebooks. In the analysis, we used the following environment: Ubuntu 18.04.1 LTS PostgreSQL 10.6 Conda 4.5.11 Python 3.7.2 PdfCrop 2012/11/02 v1.38 First, download dump.tar.bz2 and extract it: tar -xjf dump.tar.bz2 It extracts the file db2019-03-13.dump. Create a database in PostgreSQL (we call it "jupyter"), and use psql to restore the dump: psql jupyter < db2019-03-13.dump It populates the database with the dump. Now, configure the connection string for sqlalchemy by setting the environment variable JUP_DB_CONNECTTION: export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter"; Download and extract jupyter_reproducibility.tar.bz2: tar -xjf jupyter_reproducibility.tar.bz2 Create a conda environment with Python 3.7: conda create -n analyses python=3.7 conda activate analyses Go to the analyses folder and install all the dependencies of the requirements.txt cd jupyter_reproducibility/analyses pip install -r requirements.txt For reproducing the analyses, run jupyter on this folder: jupyter notebook Execute the notebooks on this order: Index.ipynb N0.Repository.ipynb N1.Skip.Notebook.ipynb N2.Notebook.ipynb N3.Cell.ipynb N4.Features.ipynb N5.Modules.ipynb N6.AST.ipynb N7.Name.ipynb N8.Execution.ipynb N9.Cell.Execution.Order.ipynb N10.Markdown.ipynb N11.Repository.With.Notebook.Restriction.ipynb N12.To.Paper.ipynb Reproducing or Expanding the Collection The collection demands more steps to reproduce and takes much longer to run (months). It also involves running arbitrary code on your machine. Proceed with caution. Requirements This time, we have extra requirements: All the analysis requirements lbzip2 2.5 gcc 7.3.0 Github account Gmail account Environment First, set the following environment variables: export JUP_MACHINE="db"; # machine identifier export JUP_BASE_DIR="/mnt/jupyter/github"; # place to store the repositories export JUP_LOGS_DIR="/home/jupyter/logs"; # log files export JUP_COMPRESSION="lbzip2"; # compression program export JUP_VERBOSE="5"; # verbose level export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter"; # sqlchemy connection export JUP_GITHUB_USERNAME="github_username"; # your github username export JUP_GITHUB_PASSWORD="github_password"; # your github password export JUP_MAX_SIZE="8000.0"; # maximum size of the repositories directory (in GB) export JUP_FIRST_DATE="2013-01-01"; # initial date to query github export JUP_EMAIL_LOGIN="gmail@gmail.com"; # your gmail address export JUP_EMAIL_TO="target@email.com"; # email that receives notifications export JUP_OAUTH_FILE="~/oauth2_creds.json" # oauth2 auhentication file export JUP_NOTEBOOK_INTERVAL=""; # notebook id interval for this machine. Leave it in blank export JUP_REPOSITORY_INTERVAL=""; # repository id interval for this machine. Leave it in blank export JUP_WITH_EXECUTION="1"; # run execute python notebooks export JUP_WITH_DEPENDENCY="0"; # run notebooks with and without declared dependnecies export JUP_EXECUTION_MODE="-1"; # run following the execution order export JUP_EXECUTION_DIR="/home/jupyter/execution"; # temporary directory for running notebooks export JUP_ANACONDA_PATH="~/anaconda3"; # conda installation path export JUP_MOUNT_BASE="/home/jupyter/mount_ghstudy.sh"; # bash script to mount base dir export JUP_UMOUNT_BASE="/home/jupyter/umount_ghstudy.sh"; # bash script to umount base dir export JUP_NOTEBOOK_TIMEOUT="300"; # timeou...
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The self-documenting aspects and the ability to reproduce results have been touted as significant benefits of Jupyter Notebooks. At the same time, there has been growing criticism that the way notebooks are being used leads to unexpected behavior, encourages poor coding practices and that their results can be hard to reproduce. To understand good and bad practices used in the development of real notebooks, we analyzed 1.4 million notebooks from GitHub. Based on the results, we proposed and evaluated Julynter, a linting tool for Jupyter Notebooks.
Papers:
This repository contains three files:
Reproducing the Notebook Study
The db2020-09-22.dump.gz file contains a PostgreSQL dump of the database, with all the data we extracted from notebooks. For loading it, run:
gunzip -c db2020-09-22.dump.gz | psql jupyter
Note that this file contains only the database with the extracted data. The actual repositories are available in a google drive folder, which also contains the docker images we used in the reproducibility study. The repositories are stored as content/{hash_dir1}/{hash_dir2}.tar.bz2, where hash_dir1 and hash_dir2 are columns of repositories in the database.
For scripts, notebooks, and detailed instructions on how to analyze or reproduce the data collection, please check the instructions on the Jupyter Archaeology repository (tag 1.0.0)
The sample.tar.gz file contains the repositories obtained during the manual sampling.
Reproducing the Julynter Experiment
The julynter_reproducility.tar.gz file contains all the data collected in the Julynter experiment and the analysis notebooks. Reproducing the analysis is straightforward:
The collected data is stored in the julynter/data folder.
Changelog
2019/01/14 - Version 1 - Initial version
2019/01/22 - Version 2 - Update N8.Execution.ipynb to calculate the rate of failure for each reason
2019/03/13 - Version 3 - Update package for camera ready. Add columns to db to detect duplicates, change notebooks to consider them, and add N1.Skip.Notebook.ipynb and N11.Repository.With.Notebook.Restriction.ipynb.
2021/03/15 - Version 4 - Add Julynter experiment; Update database dump to include new data collected for the second paper; remove scripts and analysis notebooks from this package (moved to GitHub), add a link to Google Drive with collected repository files