60 datasets found

Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter...
zenodo.org
bz2
Updated Mar 15, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
João Felipe; João Felipe; Leonardo; Leonardo; Vanessa; Vanessa; Juliana; Juliana (2021). Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter Notebooks [Dataset]. http://doi.org/10.5281/zenodo.2592524
Explore at:
bz2Available download formats
Unique identifier
https://doi.org/10.5281/zenodo.2592524
Dataset updated
Mar 15, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
João Felipe; João Felipe; Leonardo; Leonardo; Vanessa; Vanessa; Juliana; Juliana
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The self-documenting aspects and the ability to reproduce results have been touted as significant benefits of Jupyter Notebooks. At the same time, there has been growing criticism that the way notebooks are being used leads to unexpected behavior, encourage poor coding practices and that their results can be hard to reproduce. To understand good and bad practices used in the development of real notebooks, we analyzed 1.4 million notebooks from GitHub.

Paper: https://2019.msrconf.org/event/msr-2019-papers-a-large-scale-study-about-quality-and-reproducibility-of-jupyter-notebooks

This repository contains two files:

dump.tar.bz2

jupyter_reproducibility.tar.bz2

The dump.tar.bz2 file contains a PostgreSQL dump of the database, with all the data we extracted from the notebooks.

The jupyter_reproducibility.tar.bz2 file contains all the scripts we used to query and download Jupyter Notebooks, extract data from them, and analyze the data. It is organized as follows:

analyses: this folder has all the notebooks we use to analyze the data in the PostgreSQL database.

archaeology: this folder has all the scripts we use to query, download, and extract data from GitHub notebooks.

paper: empty. The notebook analyses/N12.To.Paper.ipynb moves data to it

In the remaining of this text, we give instructions for reproducing the analyses, by using the data provided in the dump and reproducing the collection, by collecting data from GitHub again.

Reproducing the Analysis

This section shows how to load the data in the database and run the analyses notebooks. In the analysis, we used the following environment:

Ubuntu 18.04.1 LTS
PostgreSQL 10.6
Conda 4.5.11
Python 3.7.2
PdfCrop 2012/11/02 v1.38

First, download dump.tar.bz2 and extract it:

tar -xjf dump.tar.bz2

It extracts the file db2019-03-13.dump. Create a database in PostgreSQL (we call it "jupyter"), and use psql to restore the dump:

psql jupyter < db2019-03-13.dump

It populates the database with the dump. Now, configure the connection string for sqlalchemy by setting the environment variable JUP_DB_CONNECTTION:

export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter";

Download and extract jupyter_reproducibility.tar.bz2:

tar -xjf jupyter_reproducibility.tar.bz2

Create a conda environment with Python 3.7:

conda create -n analyses python=3.7 conda activate analyses

Go to the analyses folder and install all the dependencies of the requirements.txt

cd jupyter_reproducibility/analyses pip install -r requirements.txt

For reproducing the analyses, run jupyter on this folder:

jupyter notebook

Execute the notebooks on this order:

Index.ipynb

N0.Repository.ipynb

N1.Skip.Notebook.ipynb

N2.Notebook.ipynb

N3.Cell.ipynb

N4.Features.ipynb

N5.Modules.ipynb

N6.AST.ipynb

N7.Name.ipynb

N8.Execution.ipynb

N9.Cell.Execution.Order.ipynb

N10.Markdown.ipynb

N11.Repository.With.Notebook.Restriction.ipynb

N12.To.Paper.ipynb

Reproducing or Expanding the Collection

The collection demands more steps to reproduce and takes much longer to run (months). It also involves running arbitrary code on your machine. Proceed with caution.

Requirements

This time, we have extra requirements:

All the analysis requirements
lbzip2 2.5
gcc 7.3.0
Github account
Gmail account

Environment

First, set the following environment variables:

export JUP_MACHINE="db"; # machine identifier export JUP_BASE_DIR="/mnt/jupyter/github"; # place to store the repositories export JUP_LOGS_DIR="/home/jupyter/logs"; # log files export JUP_COMPRESSION="lbzip2"; # compression program export JUP_VERBOSE="5"; # verbose level export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter"; # sqlchemy connection export JUP_GITHUB_USERNAME="github_username"; # your github username export JUP_GITHUB_PASSWORD="github_password"; # your github password export JUP_MAX_SIZE="8000.0"; # maximum size of the repositories directory (in GB) export JUP_FIRST_DATE="2013-01-01"; # initial date to query github export JUP_EMAIL_LOGIN="gmail@gmail.com"; # your gmail address export JUP_EMAIL_TO="target@email.com"; # email that receives notifications export JUP_OAUTH_FILE="~/oauth2_creds.json" # oauth2 auhentication file export JUP_NOTEBOOK_INTERVAL=""; # notebook id interval for this machine. Leave it in blank export JUP_REPOSITORY_INTERVAL=""; # repository id interval for this machine. Leave it in blank export JUP_WITH_EXECUTION="1"; # run execute python notebooks export JUP_WITH_DEPENDENCY="0"; # run notebooks with and without declared dependnecies export JUP_EXECUTION_MODE="-1"; # run following the execution order export JUP_EXECUTION_DIR="/home/jupyter/execution"; # temporary directory for running notebooks export JUP_ANACONDA_PATH="~/anaconda3"; # conda installation path export JUP_MOUNT_BASE="/home/jupyter/mount_ghstudy.sh"; # bash script to mount base dir export JUP_UMOUNT_BASE="/home/jupyter/umount_ghstudy.sh"; # bash script to umount base dir export JUP_NOTEBOOK_TIMEOUT="300"; # timeout the extraction # Frequenci of log report export JUP_ASTROID_FREQUENCY="5"; export JUP_IPYTHON_FREQUENCY="5"; export JUP_NOTEBOOKS_FREQUENCY="5"; export JUP_REQUIREMENT_FREQUENCY="5"; export JUP_CRAWLER_FREQUENCY="1"; export JUP_CLONE_FREQUENCY="1"; export JUP_COMPRESS_FREQUENCY="5"; export JUP_DB_IP="localhost"; # postgres database IP

Then, configure the file ~/oauth2_creds.json, according to yagmail documentation: https://media.readthedocs.org/pdf/yagmail/latest/yagmail.pdf

Configure the mount_ghstudy.sh and umount_ghstudy.sh scripts. The first one should mount the folder that stores the directories. The second one should umount it. You can leave the scripts in blank, but it is not advisable, as the reproducibility study runs arbitrary code on your machine and you may lose your data.

Scripts

Download and extract jupyter_reproducibility.tar.bz2:

tar -xjf jupyter_reproducibility.tar.bz2

Install 5 conda environments and 5 anaconda environments, for each python version. In each of them, upgrade pip, install pipenv, and install the archaeology package (Note that it is a local package that has not been published to pypi. Make sure to use the -e option):

Conda 2.7

conda create -n raw27 python=2.7 -y conda activate raw27 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Anaconda 2.7

conda create -n py27 python=2.7 anaconda -y conda activate py27 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Conda 3.4

It requires a manual jupyter and pathlib2 installation due to some incompatibilities found on the default installation.

conda create -n raw34 python=3.4 -y conda activate raw34 conda install jupyter -c conda-forge -y conda uninstall jupyter -y pip install --upgrade pip pip install jupyter pip install pipenv pip install -e jupyter_reproducibility/archaeology pip install pathlib2

Anaconda 3.4

conda create -n py34 python=3.4 anaconda -y conda activate py34 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Conda 3.5

conda create -n raw35 python=3.5 -y conda activate raw35 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Anaconda 3.5

It requires the manual installation of other anaconda packages.

conda create -n py35 python=3.5 anaconda -y conda install -y appdirs atomicwrites keyring secretstorage libuuid navigator-updater prometheus_client pyasn1 pyasn1-modules spyder-kernels tqdm jeepney automat constantly anaconda-navigator conda activate py35 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Conda 3.6

conda create -n raw36 python=3.6 -y conda activate raw36 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Anaconda 3.6

conda create -n py36 python=3.6 anaconda -y conda activate py36 conda install -y anaconda-navigator jupyterlab_server navigator-updater pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Conda 3.7

<code
Jupyter Notebook Activity Dataset (rsds-20241113)
zenodo.org
application/gzip, zip
Updated Jan 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tomoki Nakamaru; Tomoki Nakamaru; Tomomasa Matsunaga; Tetsuro Yamazaki; Tomomasa Matsunaga; Tetsuro Yamazaki (2025). Jupyter Notebook Activity Dataset (rsds-20241113) [Dataset]. http://doi.org/10.5281/zenodo.13357570
Explore at:
zip, application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13357570
Dataset updated
Jan 18, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Tomoki Nakamaru; Tomoki Nakamaru; Tomomasa Matsunaga; Tetsuro Yamazaki; Tomomasa Matsunaga; Tetsuro Yamazaki
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
List of data

rsds-20241113.zip: Collection of SQLite database files

image.tar.gz: Docker image provided in our data collection experiment

redspot-341ffa5.zip: Redspot source code (redspot@341ffa5)

Extended version of Section 2D of our paper
Redspot is a Jupyter extension (i.e., Python package) that records activity signals. However, it also offers interfaces to read recorded signals. The following shows the most basic usage of its command-line interface:

redspot replay

This command generates snapshots (.ipynb files) restored from the signal records. Note that this command does not produce a snapshot for every signal. Since the change represented by a single signal is typically minimal (e.g., one keystroke), generating a snapshot for each signal results in a meaninglessly large number of snapshots. However, we want to obtain signal-level snapshots for some analyses. In such cases, one can analyze them using the application programming interfaces:

from redspot import database

from redspot.notebook import Notebook

nbk = Notebook()

for signal in database.get("path-to-db"):

time, panel, kind, args = signal

nbk.apply(kind, args) # apply change

print(nbk) # print notebook

To record activities, one needs to run the Redspot command in the recording mode as follows:

redspot record

This command launches Jupyter Notebook with Redspot enabled. Activities made in the launched environment are stored in an SQLite file named ``redspot.db'' under the current path.

To launch the environment we provided to the participants, one first needs to download and import the image (image.tar.gz). One can then run the image with the following command:

docker run --rm -it -p8888:8888

Note that the SQLite file is generated in the running container. The file can be downloaded into the host machine via the file viewer of Jupyter Notebook.
d
Jupyter Notebooks for the ERA5 Data Component
dataone.org
hydroshare.org
+2more
Updated May 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tian Gan (2025). Jupyter Notebooks for the ERA5 Data Component [Dataset]. https://dataone.org/datasets/sha256%3Af5e3c9a23bf22ead23f736b4ae714862ab125d01c3c8fb83a4c5052bee26d38f
Explore at:
Dataset updated
May 3, 2025
Dataset provided by
Hydroshare
Authors
Tian Gan
Description
This resource includes two Jupyter Notebooks as a quick start tutorial for the ERA5 Data Component of the PyMT modeling framework (https://pymt.readthedocs.io/) developed by Community Surface Dynamics Modeling System (CSDMS https://csdms.colorado.edu/).

The bmi_era5 package is an implementation of the Basic Model Interface (BMI https://bmi.readthedocs.io/en/latest/) for the ERA5 dataset (https://confluence.ecmwf.int/display/CKB/ERA5). This package uses the cdsapi (https://cds.climate.copernicus.eu/api-how-to) to download the ERA5 dataset and wraps the dataset with BMI for data control and query (currently support 3 dimensional ERA5 dataset). This package is not implemented for people to use and is the key element to help convert the ERA5 dataset into a data component for the PyMT modeling framework.

The pymt_era5 package is implemented for people to use as a reusable, plug-and-play ERA5 data component for the PyMT modeling framework. This package uses the BMI implementation from the bmi_era5 package and allows the ERA5 datasets to be easily coupled with other datasets or models that expose a BMI.

HydroShare users can test and run the Jupyter Notebooks (bmi_era5.ipynb, pymt_era5.ipynb) directly through the "CUAHSI JupyterHub" web app with the following steps: - For the new user of the CUAHSI JupyterHub, please first make a request to join the "CUAHSI Could Computing Group" (https://www.hydroshare.org/group/156). After approval, the user will gain access to launch the CUAHSI JupyterHub. - Click on the "Open with" button. (on the top right corner of the page) - Select "CUAHSI JupyterHub". - Select "CSDMS Workbench" server option. (Make sure to select the right server option. Otherwise, the notebook won't run correctly.)

If there is any question or suggestion about the ERA5 data component, please create a github issue at https://github.com/gantian127/bmi_era5/issues
Reproducibility in Practice: Dataset of a Large-Scale Study of Jupyter...
zenodo.org
bz2
Updated Mar 15, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anonymous; Anonymous (2021). Reproducibility in Practice: Dataset of a Large-Scale Study of Jupyter Notebooks [Dataset]. http://doi.org/10.5281/zenodo.2538877
Explore at:
bz2Available download formats
Unique identifier
https://doi.org/10.5281/zenodo.2538877
Dataset updated
Mar 15, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Anonymous; Anonymous
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The self-documenting aspects and the ability to reproduce results have been touted as significant benefits of Jupyter Notebooks. At the same time, there has been growing criticism that the way notebooks are being used leads to unexpected behavior, encourage poor coding practices and that their results can be hard to reproduce. To understand good and bad practices used in the development of real notebooks, we analyzed 1.4 million notebooks from GitHub.

This repository contains two files:

dump.tar.bz2

jupyter_reproducibility.tar.bz2

The dump.tar.bz2 file contains a PostgreSQL dump of the database, with all the data we extracted from the notebooks.

The jupyter_reproducibility.tar.bz2 file contains all the scripts we used to query and download Jupyter Notebooks, extract data from them, and analyze the data. It is organized as follows:

analyses: this folder has all the notebooks we use to analyze the data in the PostgreSQL database.

archaeology: this folder has all the scripts we use to query, download, and extract data from GitHub notebooks.

paper: empty. The notebook analyses/N11.To.Paper.ipynb moves data to it

In the remaining of this text, we give instructions for reproducing the analyses, by using the data provided in the dump and reproducing the collection, by collecting data from GitHub again.

Reproducing the Analysis

This section shows how to load the data in the database and run the analyses notebooks. In the analysis, we used the following environment:

Ubuntu 18.04.1 LTS
PostgreSQL 10.6
Conda 4.5.1
Python 3.6.8
PdfCrop 2012/11/02 v1.38

First, download dump.tar.bz2 and extract it:

tar -xjf dump.tar.bz2

It extracts the file db2019-01-13.dump. Create a database in PostgreSQL (we call it "jupyter"), and use psql to restore the dump:

psql jupyter < db2019-01-13.dump

It populates the database with the dump. Now, configure the connection string for sqlalchemy by setting the environment variable JUP_DB_CONNECTTION:

export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter";

Download and extract jupyter_reproducibility.tar.bz2:

tar -xjf jupyter_reproducibility.tar.bz2

Create a conda environment with Python 3.6:

conda create -n py36 python=3.6

Go to the analyses folder and install all the dependencies of the requirements.txt

cd jupyter_reproducibility/analyses pip install -r requirements.txt

For reproducing the analyses, run jupyter on this folder:

jupyter notebook

Execute the notebooks on this order:

N0.Index.ipynb

N1.Repository.ipynb

N2.Notebook.ipynb

N3.Cell.ipynb

N4.Features.ipynb

N5.Modules.ipynb

N6.AST.ipynb

N7.Name.ipynb

N8.Execution.ipynb

N9.Cell.Execution.Order.ipynb

N10.Markdown.ipynb

N11.To.Paper.ipynb

Reproducing or Expanding the Collection

The collection demands more steps to reproduce and takes much longer to run (months). It also involves running arbitrary code on your machine. Proceed with caution.

Requirements

This time, we have extra requirements:

All the analysis requirements
lbzip2 2.5
gcc 7.3.0
Github account
Gmail account

Environment

First, set the following environment variables:

export JUP_MACHINE="db"; # machine identifier export JUP_BASE_DIR="/mnt/jupyter/github"; # place to store the repositories export JUP_LOGS_DIR="/home/jupyter/logs"; # log files export JUP_COMPRESSION="lbzip2"; # compression program export JUP_VERBOSE="5"; # verbose level export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter"; # sqlchemy connection export JUP_GITHUB_USERNAME="github_username"; # your github username export JUP_GITHUB_PASSWORD="github_password"; # your github password export JUP_MAX_SIZE="8000.0"; # maximum size of the repositories directory (in GB) export JUP_FIRST_DATE="2013-01-01"; # initial date to query github export JUP_EMAIL_LOGIN="gmail@gmail.com"; # your gmail address export JUP_EMAIL_TO="target@email.com"; # email that receives notifications export JUP_OAUTH_FILE="~/oauth2_creds.json" # oauth2 auhentication file export JUP_NOTEBOOK_INTERVAL=""; # notebook id interval for this machine. Leave it in blank export JUP_REPOSITORY_INTERVAL=""; # repository id interval for this machine. Leave it in blank export JUP_WITH_EXECUTION="1"; # run execute python notebooks export JUP_WITH_DEPENDENCY="0"; # run notebooks with and without declared dependnecies export JUP_EXECUTION_MODE="-1"; # run following the execution order export JUP_EXECUTION_DIR="/home/jupyter/execution"; # temporary directory for running notebooks export JUP_ANACONDA_PATH="~/anaconda3"; # conda installation path export JUP_MOUNT_BASE="/home/jupyter/mount_ghstudy.sh"; # bash script to mount base dir export JUP_UMOUNT_BASE="/home/jupyter/umount_ghstudy.sh"; # bash script to umount base dir export JUP_NOTEBOOK_TIMEOUT="300"; # timeout the extraction # Frequenci of log report export JUP_ASTROID_FREQUENCY="5"; export JUP_IPYTHON_FREQUENCY="5"; export JUP_NOTEBOOKS_FREQUENCY="5"; export JUP_REQUIREMENT_FREQUENCY="5"; export JUP_CRAWLER_FREQUENCY="1"; export JUP_CLONE_FREQUENCY="1"; export JUP_COMPRESS_FREQUENCY="5"; export JUP_DB_IP="localhost"; # postgres database IP

Then, configure the file ~/oauth2_creds.json, according to yagmail documentation: https://media.readthedocs.org/pdf/yagmail/latest/yagmail.pdf

Configure the mount_ghstudy.sh and umount_ghstudy.sh scripts. The first one should mount the folder that stores the directories. The second one should umount it. You can leave the scripts in blank, but it is not advisable, as the reproducibility study runs arbitrary code on your machine and you may lose your data.

Scripts

Download and extract jupyter_reproducibility.tar.bz2:

tar -xjf jupyter_reproducibility.tar.bz2

Install 5 conda environments and 5 anaconda environments, for each python version. In each of them, upgrade pip, install pipenv, and install the archaeology package (Note that it is a local package that has not been published to pypi. Make sure to use the -e option):

Conda 2.7

conda create -n raw27 python=2.7 -y conda activate raw27 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Anaconda 2.7

conda create -n py27 python=2.7 anaconda -y conda activate py27 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Conda 3.4

It requires a manual jupyter and pathlib2 installation due to some incompatibilities found on the default installation.

conda create -n raw34 python=3.4 -y conda activate raw34 conda install jupyter -c conda-forge -y conda uninstall jupyter -y pip install --upgrade pip pip install jupyter pip install pipenv pip install -e jupyter_reproducibility/archaeology pip install pathlib2

Anaconda 3.4

conda create -n py34 python=3.4 anaconda -y conda activate py34 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Conda 3.5

conda create -n raw35 python=3.5 -y conda activate raw35 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Anaconda 3.5

It requires the manual installation of other anaconda packages.

conda create -n py35 python=3.5 anaconda -y conda install -y appdirs atomicwrites keyring secretstorage libuuid navigator-updater prometheus_client pyasn1 pyasn1-modules spyder-kernels tqdm jeepney automat constantly anaconda-navigator conda activate py35 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Conda 3.6

conda create -n raw36 python=3.6 -y conda activate raw36 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Anaconda 3.6

conda create -n py36 python=3.6 anaconda -y conda activate py36 conda install -y anaconda-navigator jupyterlab_server navigator-updater pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Conda 3.7

conda create -n raw37 python=3.7 -y conda activate raw37 pip install --upgrade pip pip install pipenv pip install -e
Titanic Dataset
kaggle.com
Updated Apr 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Muhammad Mudasar Sabir (2025). Titanic Dataset [Dataset]. https://www.kaggle.com/datasets/mudasarsabir/titanic-dataset/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 25, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Muhammad Mudasar Sabir
Description
Description 👋🛳️ Ahoy, welcome to Kaggle! You’re in the right place. This is the legendary Titanic ML competition – the best, first challenge for you to dive into ML competitions and familiarize yourself with how the Kaggle platform works.

If you want to talk with other users about this competition, come join our Discord! We've got channels for competitions, job postings and career discussions, resources, and socializing with your fellow data scientists. Follow the link here: https://discord.gg/kaggle

The competition is simple: use machine learning to create a model that predicts which passengers survived the Titanic shipwreck.

Read on or watch the video below to explore more details. Once you’re ready to start competing, click on the "Join Competition button to create an account and gain access to the competition data. Then check out Alexis Cook’s Titanic Tutorial that walks you through step by step how to make your first submission!

The Challenge The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

Recommended Tutorial We highly recommend Alexis Cook’s Titanic Tutorial that walks you through making your very first submission step by step and this starter notebook to get started.

How Kaggle’s Competitions Work Join the Competition Read about the challenge description, accept the Competition Rules and gain access to the competition dataset. Get to Work Download the data, build models on it locally or on Kaggle Notebooks (our no-setup, customizable Jupyter Notebooks environment with free GPUs) and generate a prediction file. Make a Submission Upload your prediction as a submission on Kaggle and receive an accuracy score. Check the Leaderboard See how your model ranks against other Kagglers on our leaderboard. Improve Your Score Check out the discussion forum to find lots of tutorials and insights from other competitors. Kaggle Lingo Video You may run into unfamiliar lingo as you dig into the Kaggle discussion forums and public notebooks. Check out Dr. Rachael Tatman’s video on Kaggle Lingo to get up to speed!

What Data Will I Use in This Competition? In this competition, you’ll gain access to two similar datasets that include passenger information like name, age, gender, socio-economic class, etc. One dataset is titled train.csv and the other is titled test.csv.

Train.csv will contain the details of a subset of the passengers on board (891 to be exact) and importantly, will reveal whether they survived or not, also known as the “ground truth”.

The test.csv dataset contains similar information but does not disclose the “ground truth” for each passenger. It’s your job to predict these outcomes.

Using the patterns you find in the train.csv data, predict whether the other 418 passengers on board (found in test.csv) survived.

Check out the “Data” tab to explore the datasets even further. Once you feel you’ve created a competitive model, submit it to Kaggle to see where your model stands on our leaderboard against other Kagglers.

How to Submit your Prediction to Kaggle Once you’re ready to make a submission and get on the leaderboard:

Click on the “Submit Predictions” button

Upload a CSV file in the submission file format. You’re able to submit 10 submissions a day.

Submission File Format: You should submit a csv file with exactly 418 entries plus a header row. Your submission will show an error if you have extra columns (beyond PassengerId and Survived) or rows.

The file should have exactly 2 columns:

PassengerId (sorted in any order) Survived (contains your binary predictions: 1 for survived, 0 for deceased) Got it! I’m ready to get started. Where do I get help if I need it? For Competition Help: Titanic Discussion Forum Kaggle doesn’t have a dedicated team to help troubleshoot your code so you’ll typically find that you receive a response more quickly by asking your question in the appropriate forum. The forums are full of useful information on the data, metric, and different approaches. We encourage you to use the forums often. If you share your knowledge, you'll find that others will share a lot in turn!

A Last Word on Kaggle Notebooks As we mentioned before, Kaggle Notebooks is our no-setup, customizable, Jupyter Notebooks environment with free GPUs and a huge repository ...
H
Jupyter Notebooks for the Retrieval of AORC Data for Hydrologic Analysis
hydroshare.org
beta.hydroshare.org
+1more
zip
Updated Aug 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Homa Salehabadi; David Tarboton; Ayman Nassar; Anthony M. Castronova; Pabitra Dash; Arpita Patel; Furqan Baig (2025). Jupyter Notebooks for the Retrieval of AORC Data for Hydrologic Analysis [Dataset]. https://www.hydroshare.org/resource/72ea9726187e43d7b50a624f2acf591f
Explore at:
zip(6.8 MB)Available download formats
Dataset updated
Aug 5, 2025
Dataset provided by
HydroShare
Authors
Homa Salehabadi; David Tarboton; Ayman Nassar; Anthony M. Castronova; Pabitra Dash; Arpita Patel; Furqan Baig
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Feb 1, 1979 - Jan 31, 2023
Area covered

Description
This HydroShare resource contains Jupyter Notebooks with instructions and code for accessing and subsetting the NOAA Analysis of Record for Calibration (AORC) dataset. The resource includes four Jupyter Notebooks: 1. AORC_LL_PointRetrieval.ipynb: Retrieves and aggregates data from the latitude-longitude gridded dataset for a specific point specified using geographic coordinates. 2. AORC_LL_ZoneRetrieval.ipynb: Retrieves and aggregates data from the latitude-longitude gridded dataset for an area defined by a polygon shapefile. 3. AORC_NWMProj_PointRetrieval.ipynb: Retrieves and aggregates data from the NWM projected dataset for a specific point specified using geographic coordinates. 4. AORC_NWMProj_ZoneRetrieval.ipynb: Retrieves and aggregates data from the NWM projected dataset for an area defined by a polygon shapefile.

These notebooks programmatically retrieve the data from Amazon Web Services, aggregate it at user-defined time scales (which may differ from NOAA’s original time steps), and, in the case of shapefile-based data retrieval, compute the average over the shapes in the given shapefile.

The provided notebooks are coded to retrieve data from AORC version 1.1 in Zarr format, either from the latitude-longitude gridded dataset (https://registry.opendata.aws/noaa-nws-aorc/) or the NWM projected dataset (https://registry.opendata.aws/nwm-archive/).
TinyStories
kaggle.com
opendatalab.com
+1more
Updated Nov 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). TinyStories [Dataset]. https://www.kaggle.com/datasets/thedevastator/tinystories-narrative-classification
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 24, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
TinyStories

A Diverse, Richly Annotated Corpus of Short-Form Stories

By Huggingface Hub [source]

About this dataset

This dataset contains the text of a remarkable collection of short stories known as the TinyStories Corpus. With over 2,000 annotated stories, it is populated with an array of diverse styles and genres from multiple sources. This corpus is enriched by intricate annotations across each narrative content, making it a valuable resource for narrative text classification. The text field in each row includes the entirety of each story that can be used to identify plots, characters and other features associated with story-telling techniques. Through this collection of stories, users will gain an extensive insight into a wide range of narratives which could be used to produce powerful machine learning models for Narrative Text Classification

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

In this dataset, each row contains a short story along with its associated labels for narrative text classification tasks. The data consists of the following columns: - text: The story text itself (string) - validation.csv: Contains a set of short stories for validation (dataframe) - train.csv: Contains the text of short stories used for narrative text classification (dataframe)

The data contained in both files can be used for various types of machine learning tasks related to narrative text classification. These include but are not limited to experiments such as determining story genres, predicting user reactions, sentiment analysis etc.

To get started with using this dataset, begin by downloading both validation and train csv files from Kaggle datasets page and saving them on your computer or local environment. Once downloaded, you may need to preprocess both datasets by cleaning up any unnecessary/wrongly formatted values or duplicate entries if any exists within it before proceeding further on to your research work or machine learning task experimentations as these have great impacts on your research results accuracy rate which you do not want compromised!

Next step is simply loading up these two datasets into Python pandas dataframes so that they can easily be manipulated and analyzed using common tools associated with Natural Language Processing(NLP). This would require you writing few simple lines using pandas API functions like read_csv(), .append(), .concat()etc depending upon what kind of analysis/experiment you intend conducting afterwards utilizing this dataset in Python Jupyter Notebook framework as well as other machine learning frameworks popular among data scientists like scikit-learn if it will be something more complex than simple NLP task operations!

By now if done everything mentioned correctly here then we are ready now to finally get into actually working out our desired applications from exploring potential connections between different narratives or character traits via supervised Machine Learning models such as Naive Bayes Classifier among many others that could ultimately provide us useful insights revealing patterns existing underneath all those texts! With all necessary datas loaded up in supporting python platforms correctly so feel free to make interesting discoveries/predictions from extensive analyses provided by this richly annotated TinyStories Narrative Dataset!

Research Ideas

Creating a text classification algorithm to automatically categorize short stories by genre.

Developing an AI-based summarization tool to quickly summarize the main points in a story.

Developing an AI-based story generator that can generate new stories based on existing ones in the dataset

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: validation.csv | Column name | Description | |:--------------|:--------------------------------| | text | The text of the story. (String) |

File: train.csv | Column name | Description | |:--------------|:----------------------------...
Speedtest Open Data - Four International cities - MEL, BKK, SHG, LAX plus...
figshare.com
txt
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Richard Ferrers; Speedtest Global Index (2023). Speedtest Open Data - Four International cities - MEL, BKK, SHG, LAX plus ALC - 2020, 2022 [Dataset]. http://doi.org/10.6084/m9.figshare.13621169.v24
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.13621169.v24
Dataset updated
May 30, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Richard Ferrers; Speedtest Global Index
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset compares four cities FIXED-line broadband internet speeds: - Melbourne, AU - Bangkok, TH - Shanghai, CN - Los Angeles, US - Alice Springs, AU

ERRATA: 1.Data is for Q3 2020, but some files are labelled incorrectly as 02-20 of June 20. They all should read Sept 20, or 09-20 as Q3 20, rather than Q2. Will rename and reload. Amended in v7.

LAX file named 0320, when should be Q320. Amended in v8.

*lines of data for each geojson file; a line equates to a 600m^2 location, inc total tests, devices used, and average upload and download speed - MEL 16181 locations/lines => 0.85M speedtests (16.7 tests per 100people) - SHG 31745 lines => 0.65M speedtests (2.5/100pp) - BKK 29296 lines => 1.5M speedtests (14.3/100pp) - LAX 15899 lines => 1.3M speedtests (10.4/100pp) - ALC 76 lines => 500 speedtests (2/100pp)

Geojsons of these 2* by 2* extracts for MEL, BKK, SHG now added, and LAX added v6. Alice Springs added v15.

This dataset unpacks, geospatially, data summaries provided in Speedtest Global Index (linked below). See Jupyter Notebook (*.ipynb) to interrogate geo data. See link to install Jupyter.

** To Do Will add Google Map versions so everyone can see without installing Jupyter. - Link to Google Map (BKK) added below. Key:Green > 100Mbps(Superfast). Black > 500Mbps (Ultrafast). CSV provided. Code in Speedtestv1.1.ipynb Jupyter Notebook. - Community (Whirlpool) surprised [Link: https://whrl.pl/RgAPTl] that Melb has 20% at or above 100Mbps. Suggest plot Top 20% on map for community. Google Map link - now added (and tweet).

** Python melb = au_tiles.cx[144:146 , -39:-37] #Lat/Lon extract shg = tiles.cx[120:122 , 30:32] #Lat/Lon extract bkk = tiles.cx[100:102 , 13:15] #Lat/Lon extract lax = tiles.cx[-118:-120, 33:35] #lat/Lon extract ALC=tiles.cx[132:134, -22:-24] #Lat/Lon extract

Histograms (v9), and data visualisations (v3,5,9,11) will be provided. Data Sourced from - This is an extract of Speedtest Open data available at Amazon WS (link below - opendata.aws).

**VERSIONS v.24 Add tweet and google map of Top 20% (over 100Mbps locations) in Mel Q322. Add v.1.5 MEL-Superfast notebook, and CSV of results (now on Google Map; link below). v23. Add graph of 2022 Broadband distribution, and compare 2020 - 2022. Updated v1.4 Jupyter notebook. v22. Add Import ipynb; workflow-import-4cities. v21. Add Q3 2022 data; five cities inc ALC. Geojson files. (2020; 4.3M tests 2022; 2.9M tests)

Melb 14784 lines Avg download speed 69.4M Tests 0.39M

SHG 31207 lines Avg 233.7M Tests 0.56M

ALC 113 lines Avg 51.5M Test 1092

BKK 29684 lines Avg 215.9M Tests 1.2M

LAX 15505 lines Avg 218.5M Tests 0.74M

v20. Speedtest - Five Cities inc ALC. v19. Add ALC2.ipynb. v18. Add ALC line graph. v17. Added ipynb for ALC. Added ALC to title.v16. Load Alice Springs Data Q221 - csv. Added Google Map link of ALC. v15. Load Melb Q1 2021 data - csv. V14. Added Melb Q1 2021 data - geojson. v13. Added Twitter link to pics. v12 Add Line-Compare pic (fastest 1000 locations) inc Jupyter (nbn-intl-v1.2.ipynb). v11 Add Line-Compare pic, plotting Four Cities on a graph. v10 Add Four Histograms in one pic. v9 Add Histogram for Four Cities. Add NBN-Intl.v1.1.ipynb (Jupyter Notebook). v8 Renamed LAX file to Q3, rather than 03. v7 Amended file names of BKK files to correctly label as Q3, not Q2 or 06. v6 Added LAX file. v5 Add screenshot of BKK Google Map. v4 Add BKK Google map(link below), and BKK csv mapping files. v3 replaced MEL map with big key version. Prev key was very tiny in top right corner. v2 Uploaded MEL, SHG, BKK data and Jupyter Notebook v1 Metadata record

** LICENCE AWS data licence on Speedtest data is "CC BY-NC-SA 4.0", so use of this data must be: - non-commercial (NC) - reuse must be share-alike (SA)(add same licence). This restricts the standard CC-BY Figshare licence.

** Other uses of Speedtest Open Data; - see link at Speedtest below.
h
synthetic_credit_card_default
huggingface.co
Updated Aug 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Syncora.ai - Agentic Synthetic Data Platform (2025). synthetic_credit_card_default [Dataset]. https://huggingface.co/datasets/syncora/synthetic_credit_card_default
Explore at:
Dataset updated
Aug 14, 2025
Authors
Syncora.ai - Agentic Synthetic Data Platform
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Synthetic Credit Card Default Dataset

High-fidelity synthetic dataset for financial AI research, created with Syncora.ai ✅ What's in This Repo?

This repository includes:

✅ Synthetic Credit Card Default Dataset (CSV) → Download Here ✅ Jupyter Notebook for Analysis & Modeling → Open Notebook ✅ Instructions for generating your own synthetic data using Syncora API

📘 About This Dataset

This dataset contains realistic, fully synthetic credit card… See the full description on the dataset page: https://huggingface.co/datasets/syncora/synthetic_credit_card_default.
Z
Inputs of the Jupyter Notebook - Met Office UKV high-resolution atmosphere...
data.niaid.nih.gov
zenodo.org
Updated Sep 18, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Environmental Data Science Community (2022). Inputs of the Jupyter Notebook - Met Office UKV high-resolution atmosphere model data [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7087008
Explore at:
Dataset updated
Sep 18, 2022
Dataset authored and provided by
Environmental Data Science Community
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset contains the inputs of the notebook "Met Office UKV high-resolution atmosphere model data" published in The Environmental Data Science Book.

The input data refer to a subset of single sample data file for 1.5 m temperature as part of the Met Office contribution to the COVID 19 modelling effort.

The full dataset was available for download from the Met Office Azure (https://metdatasa.blob.core.windows.net/covid19-response-non-commercial/). The full dataset was available for download under the terms of non-commercial purposes.

Contributions

Notebook

Samantha V. Adams (author), Met Office Informatics Lab, @svadams

Alejandro Coca-Castro (reviewer), The Alan Turing Institute, @acocac

Dataset originator/creator

Met Office Informatics Lab (creator)

Microsoft (support)

European Regional Development Fund (support)

Dataset authors

Met Office

Dataset documentation

Theo McCaie. Met office and partners offer data and compute platform for covid-19 researchers. URL: https://medium.com/informatics-lab/met-office-and-partners-offer-data-and-compute-platform-for-covid-19-researchers-83848ac55f5f.

Note this data should be used only for non-commercial purposes.
G
PoroTomo Natural Laboratory Horizontal and Vertical Distributed Acoustic...
gdr.openei.org
data.openei.org
+3more
code, data, image +1
Updated Mar 29, 2016
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kurt Feigl; Elena Reinisch; Jeremy Patterson; Samir Jreij; Lesley Parker; Avinash Nayak; Xiangfang Zeng; Michael Cardiff; Neal E. Lord; Dante Fratta; Clifford Thurber; Herbert Wang; Michelle Robertson; Thomas Coleman; Douglas E. Miller; Paul Spielman; John Akerley; Corne Kreemer; Christina Morency; Eric Matzel; Whitney Trainor-Guitton; Nicholas Davatzes; Kurt Feigl; Elena Reinisch; Jeremy Patterson; Samir Jreij; Lesley Parker; Avinash Nayak; Xiangfang Zeng; Michael Cardiff; Neal E. Lord; Dante Fratta; Clifford Thurber; Herbert Wang; Michelle Robertson; Thomas Coleman; Douglas E. Miller; Paul Spielman; John Akerley; Corne Kreemer; Christina Morency; Eric Matzel; Whitney Trainor-Guitton; Nicholas Davatzes (2016). PoroTomo Natural Laboratory Horizontal and Vertical Distributed Acoustic Sensing Data [Dataset]. http://doi.org/10.15121/1778858
Explore at:
data, website, code, imageAvailable download formats
Unique identifier
https://doi.org/10.15121/1778858
Dataset updated
Mar 29, 2016
Dataset provided by
University of Wisconsin
USDOE Office of Energy Efficiency and Renewable Energy (EERE), Renewable Power Office. Geothermal Technologies Program (EE-4G)
Geothermal Data Repository
Authors
Kurt Feigl; Elena Reinisch; Jeremy Patterson; Samir Jreij; Lesley Parker; Avinash Nayak; Xiangfang Zeng; Michael Cardiff; Neal E. Lord; Dante Fratta; Clifford Thurber; Herbert Wang; Michelle Robertson; Thomas Coleman; Douglas E. Miller; Paul Spielman; John Akerley; Corne Kreemer; Christina Morency; Eric Matzel; Whitney Trainor-Guitton; Nicholas Davatzes; Kurt Feigl; Elena Reinisch; Jeremy Patterson; Samir Jreij; Lesley Parker; Avinash Nayak; Xiangfang Zeng; Michael Cardiff; Neal E. Lord; Dante Fratta; Clifford Thurber; Herbert Wang; Michelle Robertson; Thomas Coleman; Douglas E. Miller; Paul Spielman; John Akerley; Corne Kreemer; Christina Morency; Eric Matzel; Whitney Trainor-Guitton; Nicholas Davatzes
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset includes links to the PoroTomo DAS data in both SEG-Y and hdf5 (via h5py and HSDS with h5pyd) formats with tutorial notebooks for use. Data are hosted on Amazon Web Services (AWS) Simple Storage Service (S3) through the Open Energy Data Initiative (OEDI). Also included are links to the documentation for the dataset, Jupyter Notebook tutorials for working with the data as it is stored in AWS S3, and links to data viewers in OEDI for the horizontal (DASH) and vertical (DASV) DAS datasets.

Horizontal DAS (DASH) data collection began 3/8/16, paused, and then started again on 3/11/2016 and ended 3/26/2016 using zigzag trenched fiber optic cabels. Vertical DAS (DASV) data collection began 3/17/2016 and ended 3/28/16 using a fiber optic cable through the first 363 m of a vertical well. These are raw data files from the DAS deployment at (DASH) and below (DASV) the surface during testing at the PoroTomo Natural Laboratory at Brady Hot Spring in Nevada.

SEG-Y and hdf5 files are stored in 30 second files organized into directories by day. The hdf5 files available via HSDS are stored in daily files. Metadata includes information on the timing of recording gaps and a file count is included that lists the number of files from each day of recording.

These data are available for download without login credentials through the free and publicly accessible Open Energy Data Initiative (OEDI) data viewer which allows users to browse and download individual or groups of files.
Dataset of a Study of Computational reproducibility of Jupyter notebooks...
zenodo.org
pdf, zip
Updated Jul 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sheeba Samuel; Sheeba Samuel; Daniel Mietchen; Daniel Mietchen (2024). Dataset of a Study of Computational reproducibility of Jupyter notebooks from biomedical publications [Dataset]. http://doi.org/10.5281/zenodo.8226725
Explore at:
zip, pdfAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8226725
Dataset updated
Jul 11, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Sheeba Samuel; Sheeba Samuel; Daniel Mietchen; Daniel Mietchen
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This repository contains the dataset for the study of computational reproducibility of Jupyter notebooks from biomedical publications. Our focus lies in evaluating the extent of reproducibility of Jupyter notebooks derived from GitHub repositories linked to publications present in the biomedical literature repository, PubMed Central. We analyzed the reproducibility of Jupyter notebooks from GitHub repositories associated with publications indexed in the biomedical literature repository PubMed Central. The dataset includes the metadata information of the journals, publications, the Github repositories mentioned in the publications and the notebooks present in the Github repositories.

Data Collection and Analysis

We use the code for reproducibility of Jupyter notebooks from the study done by Pimentel et al., 2019 and adapted the code from ReproduceMeGit. We provide code for collecting the publication metadata from PubMed Central using NCBI Entrez utilities via Biopython.

Our approach involves searching PMC using the esearch function for Jupyter notebooks using the query: ``(ipynb OR jupyter OR ipython) AND github''. We meticulously retrieve data in XML format, capturing essential details about journals and articles. By systematically scanning the entire article, encompassing the abstract, body, data availability statement, and supplementary materials, we extract GitHub links. Additionally, we mine repositories for key information such as dependency declarations found in files like requirements.txt, setup.py, and pipfile. Leveraging the GitHub API, we enrich our data by incorporating repository creation dates, update histories, pushes, and programming languages.

All the extracted information is stored in a SQLite database. After collecting and creating the database tables, we ran a pipeline to collect the Jupyter notebooks contained in the GitHub repositories based on the code from Pimentel et al., 2019.

Our reproducibility pipeline was started on 27 March 2023.

Repository Structure

Our repository is organized into two main folders:

archaeology: This directory hosts scripts designed to download, parse, and extract metadata from PubMed Central publications and associated repositories. There are 24 database tables created which store the information on articles, journals, authors, repositories, notebooks, cells, modules, executions, etc. in the db.sqlite database file.

analyses: Here, you will find notebooks instrumental in the in-depth analysis of data related to our study. The db.sqlite file generated by running the archaelogy folder is stored in the analyses folder for further analysis. The path can however be configured in the config.py file. There are two sets of notebooks: one set (naming pattern N[0-9]*.ipynb) is focused on examining data pertaining to repositories and notebooks, while the other set (PMC[0-9]*.ipynb) is for analyzing data associated with publications in PubMed Central, i.e.\ for plots involving data about articles, journals, publication dates or research fields. The resultant figures from the these notebooks are stored in the 'outputs' folder.

MethodsWorkflow: The MethodsWorkflow file provides a conceptual overview of the workflow used in this study.

Accessing Data and Resources:

All the data generated during the initial study can be accessed at https://doi.org/10.5281/zenodo.6802158

For the latest results and re-run data, refer to this link.

The comprehensive SQLite database that encapsulates all the study's extracted data is stored in the db.sqlite file.

The metadata in xml format extracted from PubMed Central which contains the information about the articles and journal can be accessed in pmc.xml file.

System Requirements:

Centos 7 (Documentation: https://www.centos.org/)

Conda 4.9.4 (Installation Guide: https://docs.anaconda.com/anaconda/install/linux/)

Python 3.7.6 (Download Link: https://www.python.org/downloads/)

GitHub account (Get Started: https://github.com/, Requires GitHub Username and Token)

gcc 7.3.0 (Installation Guide: https://gcc.gnu.org/install/)

lbzip2 (Command: `conda install -c conda-forge lbzip2')

Running the pipeline:

Clone the computational-reproducibility-pmc repository using Git:
git clone https://github.com/fusion-jena/computational-reproducibility-pmc.git

Navigate to the computational-reproducibility-pmc directory:
cd computational-reproducibility-pmc/computational-reproducibility-pmc

Configure environment variables in the config.py file:
GITHUB_USERNAME = os.environ.get("JUP_GITHUB_USERNAME", "add your github username here")
GITHUB_TOKEN = os.environ.get("JUP_GITHUB_PASSWORD", "add your github token here")

Other environment variables can also be set in the config.py file.
BASE_DIR = Path(os.environ.get("JUP_BASE_DIR", "./")).expanduser() # Add the path of directory where the GitHub repositories will be saved
DB_CONNECTION = os.environ.get("JUP_DB_CONNECTION", "sqlite:///db.sqlite") # Add the path where the database is stored.

To set up conda environments for each python versions, upgrade pip, install pipenv, and install the archaeology package in each environment, execute:
source conda-setup.sh

Change to the archaeology directory
cd archaeology

Activate conda environment. We used py36 to run the pipeline.
conda activate py36

Execute the main pipeline script (r0_main.py):
python r0_main.py

Running the analysis:

Navigate to the analysis directory.
cd analyses

Activate conda environment. We use raw38 for the analysis of the metadata collected in the study.
conda activate raw38

Install the required packages using the requirements.txt file.
pip install -r requirements.txt

Launch Jupyterlab
jupyter lab

Refer to the Index.ipynb notebook for the execution order and guidance.

References:

Sheeba Samuel, Daniel Mietchen. (2024). Computational reproducibility of Jupyter notebooks from biomedical publications, https://doi.org/10.1093/gigascience/giad113, GigaScience

Sheeba Samuel, Daniel Mietchen. (2022). Computational reproducibility of Jupyter notebooks from biomedical publications, https://arxiv.org/pdf/2209.04308.pdf, CoRR abs/2209.04308

Sheeba Samuel, & Daniel Mietchen. (2022). Dataset of a Study of Computational reproducibility of Jupyter notebooks from biomedical publications [Data set]. Zenodo. https://doi.org/10.5281/zenodo.6802158
n
Dataset of Pairs of an Image and Tags for Cataloging Image-based Records
narcis.nl
data.mendeley.com
Updated Apr 19, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Suzuki, T (via Mendeley Data) (2022). Dataset of Pairs of an Image and Tags for Cataloging Image-based Records [Dataset]. http://doi.org/10.17632/msyc6mzvhg.2
Explore at:
Unique identifier
https://doi.org/10.17632/msyc6mzvhg.2
Dataset updated
Apr 19, 2022
Dataset provided by
Data Archiving and Networked Services (DANS)
Authors
Suzuki, T (via Mendeley Data)
Description
Brief ExplanationThis dataset is created to develop and evaluate a cataloging system which assigns appropriate metadata to an image record for database management in digital libraries. That is assumed for evaluating a task, in which given an image and assigned tags, an appropriate Wikipedia page is selected for each of the given tags.A main characteristic of the dataset is including ambiguous tags. Thus, visual contents of images are not unique to their tags. For example, it includes a tag 'mouse' which has double meaning of not a mammal but a computer controller device. The annotations are corresponding Wikipedia articles for tags as correct entities by human judgement.The dataset offers both data and programs that reproduce experiments of the above-mentioned task. Its data consist of sources of images and annotations. The image sources are URLs of 420 images uploaded to Flickr. The annotations are a total 2,464 relevant Wikipedia pages manually judged for tags of the images. The dataset also provides programs in Jupiter notebook (scripts.ipynb) to conduct a series of experiments running some baseline methods for the designated task and evaluating the results. ## Structure of the Dataset1. data directory 1.1. image_URL.txt This file lists URLs of image files. 1.2. rels.txt This file lists collect Wikipedia pages for each topic in topics.txt 1.3. topics.txt This file lists a target pair, which is called a topic in this dataset, of an image and a tag to be disambiguated. 1.4. enwiki_20171001.xml This file is extracted texts from the title and body parts of English Wikipedia articles as of 1st October 2017. This is a modified data of Wikipedia dump data (https://archive.org/download/enwiki-20171001).2. img directory This directory is a placeholder directory to fetch image files for downloading.3. results directory This directory is a placeholder directory to store results files for evaluation. It maintains three results of baseline methods in sub-directories. They contain json files each of which is a result of one topic, and are ready to be evaluated using an evaluation scripts in scripts.ipynb for reference of both usage and performance. 4. scripts.ipynb The scripts for running baseline methods and evaluation are ready in this Jupyter notebook file.
d
Replication Data for: The Spread of Synthetic Media on X
search.dataone.org
Updated Sep 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Corsi, Giulio; Marino, Bill; Wong, Willow (2024). Replication Data for: The Spread of Synthetic Media on X [Dataset]. http://doi.org/10.7910/DVN/QYS1VH
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/QYS1VH
Dataset updated
Sep 24, 2024
Dataset provided by
Harvard Dataverse
Authors
Corsi, Giulio; Marino, Bill; Wong, Willow
Description
This dataset supports the research article "The Spread of Synthetic Media on X". In compliance with X's terms of service, we are only able to provide the tweet IDs used in our analysis rather than the full annotated tweet data. However, to enable reproducibility and further research, we are also making available the Python Jupyter notebook used for data collection and analysis, as well as the validation dataset used to assess the performance of our classification models. Additional, the Community Notes datasets are public and accessible at: https://communitynotes.x.com/guide/en/under-the-hood/download-data
q
Python Script for High-throughput Machine...
data.researchdatafinder.qut.edu.au
Updated Aug 14, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Python Script for High-throughput Machine... [Dataset]. https://data.researchdatafinder.qut.edu.au/dataset/high-throughput-synthesis-and/resource/8c411d78-bc44-457e-bcc5-5d35ba0c8b7f
Explore at:
Dataset updated
Aug 14, 2024
License
http://researchdatafinder.qut.edu.au/display/n19425http://researchdatafinder.qut.edu.au/display/n19425
Description
Jupyter Notebook file with Python script for Machine Learning analysis of high-throughput polymer infectivity data. QUT Research Data Respository Dataset Resource available for download
Z
Data from: T1DiabetesGranada: a longitudinal multi-modal dataset of type 1...
data.niaid.nih.gov
produccioncientifica.ugr.es
Updated Feb 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Munoz-Torres, Manuel (2024). T1DiabetesGranada: a longitudinal multi-modal dataset of type 1 diabetes mellitus [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10050943
Explore at:
Dataset updated
Feb 2, 2024
Dataset provided by
Aviles Perez, Maria Dolores
Rodriguez-Leon, Ciro
Lopez-Ibarra, Pablo J
Quesada-Charneco, Miguel
Banos, Oresti
Munoz-Torres, Manuel
Villalonga, Claudia
Description
T1DiabetesGranada

A longitudinal multi-modal dataset of type 1 diabetes mellitus

Documented by:

Rodriguez-Leon, C., Aviles-Perez, M. D., Banos, O., Quesada-Charneco, M., Lopez-Ibarra, P. J., Villalonga, C., & Munoz-Torres, M. (2023). T1DiabetesGranada: a longitudinal multi-modal dataset of type 1 diabetes mellitus. Scientific Data, 10(1), 916. https://doi.org/10.1038/s41597-023-02737-4

Background

Type 1 diabetes mellitus (T1D) patients face daily difficulties in keeping their blood glucose levels within appropriate ranges. Several techniques and devices, such as flash glucose meters, have been developed to help T1D patients improve their quality of life. Most recently, the data collected via these devices is being used to train advanced artificial intelligence models to characterize the evolution of the disease and support its management. The main problem for the generation of these models is the scarcity of data, as most published works use private or artificially generated datasets. For this reason, this work presents T1DiabetesGranada, a open under specific permission longitudinal dataset that not only provides continuous glucose levels, but also patient demographic and clinical information. The dataset includes 257780 days of measurements over four years from 736 T1D patients from the province of Granada, Spain. This dataset progresses significantly beyond the state of the art as one the longest and largest open datasets of continuous glucose measurements, thus boosting the development of new artificial intelligence models for glucose level characterization and prediction.

Data Records

The data are stored in four comma-separated values (CSV) files which are available in T1DiabetesGranada.zip. These files are described in detail below.

Patient_info.csv

Patient_info.csv is the file containing information about the patients, such as demographic data, start and end dates of blood glucose level measurements and biochemical parameters, number of biochemical parameters or number of diagnostics. This file is composed of 736 records, one for each patient in the dataset, and includes the following variables:

Patient_ID – Unique identifier of the patient. Format: LIB19XXXX.

Sex – Sex of the patient. Values: F (for female), masculine (for male)

Birth_year – Year of birth of the patient. Format: YYYY.

Initial_measurement_date – Date of the first blood glucose level measurement of the patient in the Glucose_measurements.csv file. Format: YYYY-MM-DD.

Final_measurement_date – Date of the last blood glucose level measurement of the patient in the Glucose_measurements.csv file. Format: YYYY-MM-DD.

Number_of_days_with_measures – Number of days with blood glucose level measurements of the patient, extracted from the Glucose_measurements.csv file. Values: ranging from 8 to 1463.

Number_of_measurements – Number of blood glucose level measurements of the patient, extracted from the Glucose_measurements.csv file. Values: ranging from 400 to 137292.

Initial_biochemical_parameters_date – Date of the first biochemical test to measure some biochemical parameter of the patient, extracted from the Biochemical_parameters.csv file. Format: YYYY-MM-DD.

Final_biochemical_parameters_date – Date of the last biochemical test to measure some biochemical parameter of the patient, extracted from the Biochemical_parameters.csv file. Format: YYYY-MM-DD.

Number_of_biochemical_parameters – Number of biochemical parameters measured on the patient, extracted from the Biochemical_parameters.csv file. Values: ranging from 4 to 846.

Number_of_diagnostics – Number of diagnoses realized to the patient, extracted from the Diagnostics.csv file. Values: ranging from 1 to 24.

Glucose_measurements.csv

Glucose_measurements.csv is the file containing the continuous blood glucose level measurements of the patients. The file is composed of more than 22.6 million records that constitute the time series of continuous blood glucose level measurements. It includes the following variables:

Patient_ID – Unique identifier of the patient. Format: LIB19XXXX.

Measurement_date – Date of the blood glucose level measurement. Format: YYYY-MM-DD.

Measurement_time – Time of the blood glucose level measurement. Format: HH:MM:SS.

Measurement – Value of the blood glucose level measurement in mg/dL. Values: ranging from 40 to 500.

Biochemical_parameters.csv

Biochemical_parameters.csv is the file containing data of the biochemical tests performed on patients to measure their biochemical parameters. This file is composed of 87482 records and includes the following variables:

Patient_ID – Unique identifier of the patient. Format: LIB19XXXX.

Reception_date – Date of receipt in the laboratory of the sample to measure the biochemical parameter. Format: YYYY-MM-DD.

Name – Name of the measured biochemical parameter. Values: 'Potassium', 'HDL cholesterol', 'Gammaglutamyl Transferase (GGT)', 'Creatinine', 'Glucose', 'Uric acid', 'Triglycerides', 'Alanine transaminase (GPT)', 'Chlorine', 'Thyrotropin (TSH)', 'Sodium', 'Glycated hemoglobin (Ac)', 'Total cholesterol', 'Albumin (urine)', 'Creatinine (urine)', 'Insulin', 'IA ANTIBODIES'.

Value – Value of the biochemical parameter. Values: ranging from -4.0 to 6446.74.

Diagnostics.csv

Diagnostics.csv is the file containing diagnoses of diabetes mellitus complications or other diseases that patients have in addition to type 1 diabetes mellitus. This file is composed of 1757 records and includes the following variables:

Patient_ID – Unique identifier of the patient. Format: LIB19XXXX.

Code – ICD-9-CM diagnosis code. Values: subset of 594 of the ICD-9-CM codes (https://www.cms.gov/Medicare/Coding/ICD9ProviderDiagnosticCodes/codes).

Description – ICD-9-CM long description. Values: subset of 594 of the ICD-9-CM long description (https://www.cms.gov/Medicare/Coding/ICD9ProviderDiagnosticCodes/codes).

Technical Validation

Blood glucose level measurements are collected using FreeStyle Libre devices, which are widely used for healthcare in patients with T1D. Abbott Diabetes Care, Inc., Alameda, CA, USA, the manufacturer company, has conducted validation studies of these devices concluding that the measurements made by their sensors compare to YSI analyzer devices (Xylem Inc.), the gold standard, yielding results of 99.9% of the time within zones A and B of the consensus error grid. In addition, other studies external to the company concluded that the accuracy of the measurements is adequate.

Moreover, it was also checked in most cases the blood glucose level measurements per patient were continuous (i.e. a sample at least every 15 minutes) in the Glucose_measurements.csv file as they should be.

Usage Notes

For data downloading, it is necessary to be authenticated on the Zenodo platform, accept the Data Usage Agreement and send a request specifying full name, email, and the justification of the data use. This request will be processed by the Secretary of the Department of Computer Engineering, Automatics, and Robotics of the University of Granada and access to the dataset will be granted.

The files that compose the dataset are CSV type files delimited by commas and are available in T1DiabetesGranada.zip. A Jupyter Notebook (Python v. 3.8) with code that may help to a better understanding of the dataset, with graphics and statistics, is available in UsageNotes.zip.

Graphs_and_stats.ipynb

The Jupyter Notebook generates tables, graphs and statistics for a better understanding of the dataset. It has four main sections, one dedicated to each file in the dataset. In addition, it has useful functions such as calculating the patient age, deleting a patient list from a dataset file and leaving only a patient list in a dataset file.

Code Availability

The dataset was generated using some custom code located in CodeAvailability.zip. The code is provided as Jupyter Notebooks created with Python v. 3.8. The code was used to conduct tasks such as data curation and transformation, and variables extraction.

Original_patient_info_curation.ipynb

In the Jupyter Notebook is preprocessed the original file with patient data. Mainly irrelevant rows and columns are removed, and the sex variable is recoded.

Glucose_measurements_curation.ipynb

In the Jupyter Notebook is preprocessed the original file with the continuous glucose level measurements of the patients. Principally rows without information or duplicated rows are removed and the variable with the timestamp is transformed into two new variables, measurement date and measurement time.

Biochemical_parameters_curation.ipynb

In the Jupyter Notebook is preprocessed the original file with patient data of the biochemical tests performed on patients to measure their biochemical parameters. Mainly irrelevant rows and columns are removed and the variable with the name of the measured biochemical parameter is translated.

Diagnostic_curation.ipynb

In the Jupyter Notebook is preprocessed the original file with patient data of the diagnoses of diabetes mellitus complications or other diseases that patients have in addition to T1D.

Get_patient_info_variables.ipynb

In the Jupyter Notebook it is coded the feature extraction process from the files Glucose_measurements.csv, Biochemical_parameters.csv and Diagnostics.csv to complete the file Patient_info.csv. It is divided into six sections, the first three to extract the features from each of the mentioned files and the next three to add the extracted features to the resulting new file.

Data Usage Agreement

The conditions for use are as follows:

You confirm that you will not attempt to re-identify research participants for any reason, including for re-identification theory research.

You commit to keeping the T1DiabetesGranada dataset confidential and secure and will not redistribute data or Zenodo account credentials.

You will require
h
synthetic-healthcare-admissions
huggingface.co
Updated Aug 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Syncora.ai - Agentic Synthetic Data Platform (2025). synthetic-healthcare-admissions [Dataset]. https://huggingface.co/datasets/syncora/synthetic-healthcare-admissions
Explore at:
Dataset updated
Aug 30, 2025
Authors
Syncora.ai - Agentic Synthetic Data Platform
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Synthetic Healthcare Admissions Dataset

A fully synthetic healthcare dataset for building AI solutions in healthcare, developed using Syncora.ai. ✅ What's in This Repo?

✅ Healthcare Dataset (CSV) → Download Here
✅ Example Jupyter Notebook → Open Notebook
✅ Use cases

📘 About This Dataset

This synthetic healthcare dataset simulates hospital admission records including demographics, billing, medications, and lab results.It is 100% synthetic… See the full description on the dataset page: https://huggingface.co/datasets/syncora/synthetic-healthcare-admissions.
d
Development of an AI/ML-ready knee ultrasound dataset in a population-based...
dataone.org
Updated Nov 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nelson, Amanda (2023). Development of an AI/ML-ready knee ultrasound dataset in a population-based cohort [Dataset]. http://doi.org/10.7910/DVN/SKP9IB
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/SKP9IB
Dataset updated
Nov 8, 2023
Dataset provided by
Harvard Dataverse
Authors
Nelson, Amanda
Description
About this data An ultrasound dataset to use in the discovery of ultrasound features associated with pain and radiographic change in KOA is highly innovative and will be a major step forward for the field. These ultrasound images originate from the diverse and inclusive population-based Johnston County Health Study (JoCoHS). This dataset is designed to adhere to FAIR principles and was funded in part by an Administrative Supplement to Improve the AI/ML-Readiness of NIH-Supported Data (3R01AR077060-03S1). Working with this dataset WorkingWithTheDataset.ipynb Jupyter notebook If you are familiar with working with Jupyter notebooks, we recommend using the WorkingWithTheDataset.ipynb Jupyter notebook to retrieve, validate, and learn more about the dataset. You should downloading the latest WorkingWithTheDataset.ipynb file and uploading it to an online Jupyter environment such as https://colab.research.google.com or use the notebook in your Jupyter environment of choice. You will also need to download the CONFIGURATION_SETTINGS.template.md file from this dataset since the contents are used to configure the Jupyter notebook. Note: at the time of this writing, we do not recommend using Binder (mybinder.org) if you are interested in only reviewing the WorkingWithTheDataset.ipynb notebook. When Binder loads the dataset, it will download all files from this dataset, resulting in a long build time. However, if you plan to work with all files in the dataset then Binder might work for you. We do not offer support for this service or other Jupyter Lab environments. Metadata The DatasetMetadata.json file contains general information about the files and variables within this dataset. We use it as our validation metadata to verify the data we are importing into this Dataverse dataset. This file is also the most comprehensive with regards to the dataset metadata. Data collection in progress This dataset is not complete and will be updated regularly as additional data is collected.
H
Jupyter Notebooks for the NWIS Data Component
hydroshare.org
beta.hydroshare.org
zip
Updated Jun 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tian Gan (2023). Jupyter Notebooks for the NWIS Data Component [Dataset]. https://www.hydroshare.org/resource/1473ac1368b44808b57cca2848923779
Explore at:
zip(20.4 KB)Available download formats
Dataset updated
Jun 21, 2023
Dataset provided by
HydroShare
Authors
Tian Gan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This resource includes two Jupyter Notebooks as a quick start tutorial for the NWIS Data Component of the PyMT modeling framework (https://pymt.readthedocs.io/) developed by Community Surface Dynamics Modeling System (CSDMS https://csdms.colorado.edu/).

The bmi_nwis package is an implementation of the Basic Model Interface (BMI https://bmi.readthedocs.io/en/latest/) for the USGS NWIS dataset (https://waterdata.usgs.gov/nwis). This package uses the dataretrieval package (https://github.com/USGS-python/dataretrieval) to download the NWIS dataset and wraps the dataset with BMI for data control and query. This package is not implemented for people to use but is the key element to convert the NWIS dataset into a data component for the PyMT modeling framework.

The pymt_nwis package is implemented for people to use as a reusable, plug-and-play NWIS data component for the PyMT modeling framework. This package uses the BMI implementation from the bmi_nwis package and allows the NWIS datasets to be easily coupled with other datasets or models that expose a BMI.

HydroShare users can test and run the Jupyter Notebooks (bmi_nwis.ipynb, pymt_nwis.ipynb) directly through the "CUAHSI JupyterHub" web app with the following steps: - For the new user of the CUAHSI JupyterHub, please first make a request to join the "CUAHSI Could Computing Group" (https://www.hydroshare.org/group/156). After approval, the user will gain access to launch the CUAHSI JupyterHub. - Click on the "Open with" button. (on the top right corner of the page) - Select "CUAHSI JupyterHub". - Select "CSDMS Workbench" server option. (Make sure to select the right server option. Otherwise, the notebook won't run correctly.)

If there is any question or suggestion about the NWIS data component, please create a github issue at https://github.com/gantian127/bmi_nwis/issues
d
Toward a Reproducible Research Data Repository
data.depositar.io
mp4, pdf
Updated Jan 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
depositar (2024). Toward a Reproducible Research Data Repository [Dataset]. https://data.depositar.io/dataset/reproducible-research-data-repository
Explore at:
pdf(2586248), pdf(627064), pdf(212638), mp4(22141307)Available download formats
Dataset updated
Jan 26, 2024
Dataset provided by
depositar
Description
Collected in this dataset are the slideset and abstract for a presentation on Toward a Reproducible Research Data Repository by the depositar team at International Symposium on Data Science 2023 (DSWS 2023), hosted by the Science Council of Japan in Tokyo on December 13-15, 2023. The conference was organized by the Joint Support-Center for Data Science Research (DS), Research Organization of Information and Systems (ROIS) and the Committee of International Collaborations on Data Science, Science Council of Japan. The conference programme is also included as a reference.

Title

Toward a Reproducible Research Data Repository

Author(s)

Cheng-Jen Lee, Chia-Hsun Ally Wang, Ming-Syuan Ho, and Tyng-Ruey Chuang

Affiliation of presenter

Institute of Information Science, Academia Sinica, Taiwan

Summary of Abstract

The depositar (https://data.depositar.io/) is a research data repository at Academia Sinica (Taiwan) open to researhers worldwide for the deposit, discovery, and reuse of datasets. The depositar software itself is open source and builds on top of CKAN. CKAN, an open source project initiated by the Open Knowledge Foundation and sustained by an active user community, is a leading data management system for building data hubs and portals. In addition to CKAN's out-of-the-box features such as JSON data API and in-browser preview of uploaded data, we have added several features to the depositar, including sourcing from Wikidata as dataset keywords, a citation snippet for datasets, in-browser Shapefile preview, and a persistent identifier system based on ARK (Archival Resource Keys). At the same time, the depositar team faces an increasing demand for interactive computing (e.g. Jupyter Notebook) which facilitates not just data analysis, but also for the replication and demonstration of scientific studies. Recently, we have provided a JupyterHub service (a multi-tenancy JupyterLab) to some of the depositar's users. However, it still requires users to first download the data files (or copy the URLs of the files) from the depositar, then upload the data files (or paste the URLs) to the Jupyter notebooks for analysis. Furthermore, a JupyterHub deployed on a single server is limited by its processing power which may lower the service level to the users. To address the above issues, we are integrating the BinderHub into the depositar. BinderHub (https://binderhub.readthedocs.io/) is a kubernetes-based service that allows users to create interactive computing environments from code repositories. Once the integration is completed, users will be able to launch Jupyter Notebooks to perform data analysis and vsualization without leaving the depositar by clicking the BinderHub buttons on the datasets. In this presentation, we will first make a brief introduction to the depositar and BinderHub along with their relationship, then we will share our experiences in incorporating interactive computation in a data repository. We shall also evaluate the possibility of integrating the depositar with other automation frameworks (e.g. the Snakemake workflow management system) in order to enable users to reproduce data analysis.

Keywords

BinderHub, CKAN, Data Repositories, Interactive Computing, Reproducible Research

Facebook

Twitter

Click to copy link

Link copied

Cite

João Felipe; João Felipe; Leonardo; Leonardo; Vanessa; Vanessa; Juliana; Juliana (2021). Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter Notebooks [Dataset]. http://doi.org/10.5281/zenodo.2592524

Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter Notebooks

Explore at:

bz2Available download formats

Unique identifier

https://doi.org/10.5281/zenodo.2592524

Dataset updated

Mar 15, 2021

Dataset provided by

Zenodohttp://zenodo.org/

Authors

João Felipe; João Felipe; Leonardo; Leonardo; Vanessa; Vanessa; Juliana; Juliana

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The self-documenting aspects and the ability to reproduce results have been touted as significant benefits of Jupyter Notebooks. At the same time, there has been growing criticism that the way notebooks are being used leads to unexpected behavior, encourage poor coding practices and that their results can be hard to reproduce. To understand good and bad practices used in the development of real notebooks, we analyzed 1.4 million notebooks from GitHub.

Paper: https://2019.msrconf.org/event/msr-2019-papers-a-large-scale-study-about-quality-and-reproducibility-of-jupyter-notebooks

This repository contains two files:

dump.tar.bz2
jupyter_reproducibility.tar.bz2

The dump.tar.bz2 file contains a PostgreSQL dump of the database, with all the data we extracted from the notebooks.

The jupyter_reproducibility.tar.bz2 file contains all the scripts we used to query and download Jupyter Notebooks, extract data from them, and analyze the data. It is organized as follows:

analyses: this folder has all the notebooks we use to analyze the data in the PostgreSQL database.
archaeology: this folder has all the scripts we use to query, download, and extract data from GitHub notebooks.
paper: empty. The notebook analyses/N12.To.Paper.ipynb moves data to it

In the remaining of this text, we give instructions for reproducing the analyses, by using the data provided in the dump and reproducing the collection, by collecting data from GitHub again.

Reproducing the Analysis

This section shows how to load the data in the database and run the analyses notebooks. In the analysis, we used the following environment:

Ubuntu 18.04.1 LTS
PostgreSQL 10.6
Conda 4.5.11
Python 3.7.2
PdfCrop 2012/11/02 v1.38

First, download dump.tar.bz2 and extract it:

tar -xjf dump.tar.bz2

It extracts the file db2019-03-13.dump. Create a database in PostgreSQL (we call it "jupyter"), and use psql to restore the dump:

psql jupyter < db2019-03-13.dump

It populates the database with the dump. Now, configure the connection string for sqlalchemy by setting the environment variable JUP_DB_CONNECTTION:

export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter";

Download and extract jupyter_reproducibility.tar.bz2:

tar -xjf jupyter_reproducibility.tar.bz2

Create a conda environment with Python 3.7:

conda create -n analyses python=3.7
conda activate analyses

Go to the analyses folder and install all the dependencies of the requirements.txt

cd jupyter_reproducibility/analyses
pip install -r requirements.txt

For reproducing the analyses, run jupyter on this folder:

jupyter notebook

Execute the notebooks on this order:

Index.ipynb
N0.Repository.ipynb
N1.Skip.Notebook.ipynb
N2.Notebook.ipynb
N3.Cell.ipynb
N4.Features.ipynb
N5.Modules.ipynb
N6.AST.ipynb
N7.Name.ipynb
N8.Execution.ipynb
N9.Cell.Execution.Order.ipynb
N10.Markdown.ipynb
N11.Repository.With.Notebook.Restriction.ipynb
N12.To.Paper.ipynb

Reproducing or Expanding the Collection

The collection demands more steps to reproduce and takes much longer to run (months). It also involves running arbitrary code on your machine. Proceed with caution.

Requirements

This time, we have extra requirements:

All the analysis requirements
lbzip2 2.5
gcc 7.3.0
Github account
Gmail account

Environment

First, set the following environment variables:

export JUP_MACHINE="db"; # machine identifier
export JUP_BASE_DIR="/mnt/jupyter/github"; # place to store the repositories
export JUP_LOGS_DIR="/home/jupyter/logs"; # log files
export JUP_COMPRESSION="lbzip2"; # compression program
export JUP_VERBOSE="5"; # verbose level
export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter"; # sqlchemy connection
export JUP_GITHUB_USERNAME="github_username"; # your github username
export JUP_GITHUB_PASSWORD="github_password"; # your github password
export JUP_MAX_SIZE="8000.0"; # maximum size of the repositories directory (in GB)
export JUP_FIRST_DATE="2013-01-01"; # initial date to query github
export JUP_EMAIL_LOGIN="gmail@gmail.com"; # your gmail address
export JUP_EMAIL_TO="target@email.com"; # email that receives notifications
export JUP_OAUTH_FILE="~/oauth2_creds.json" # oauth2 auhentication file
export JUP_NOTEBOOK_INTERVAL=""; # notebook id interval for this machine. Leave it in blank
export JUP_REPOSITORY_INTERVAL=""; # repository id interval for this machine. Leave it in blank
export JUP_WITH_EXECUTION="1"; # run execute python notebooks
export JUP_WITH_DEPENDENCY="0"; # run notebooks with and without declared dependnecies
export JUP_EXECUTION_MODE="-1"; # run following the execution order
export JUP_EXECUTION_DIR="/home/jupyter/execution"; # temporary directory for running notebooks
export JUP_ANACONDA_PATH="~/anaconda3"; # conda installation path
export JUP_MOUNT_BASE="/home/jupyter/mount_ghstudy.sh"; # bash script to mount base dir
export JUP_UMOUNT_BASE="/home/jupyter/umount_ghstudy.sh"; # bash script to umount base dir
export JUP_NOTEBOOK_TIMEOUT="300"; # timeout the extraction


# Frequenci of log report
export JUP_ASTROID_FREQUENCY="5";
export JUP_IPYTHON_FREQUENCY="5";
export JUP_NOTEBOOKS_FREQUENCY="5";
export JUP_REQUIREMENT_FREQUENCY="5";
export JUP_CRAWLER_FREQUENCY="1";
export JUP_CLONE_FREQUENCY="1";
export JUP_COMPRESS_FREQUENCY="5";

export JUP_DB_IP="localhost"; # postgres database IP

Then, configure the file ~/oauth2_creds.json, according to yagmail documentation: https://media.readthedocs.org/pdf/yagmail/latest/yagmail.pdf

Configure the mount_ghstudy.sh and umount_ghstudy.sh scripts. The first one should mount the folder that stores the directories. The second one should umount it. You can leave the scripts in blank, but it is not advisable, as the reproducibility study runs arbitrary code on your machine and you may lose your data.

Scripts

Download and extract jupyter_reproducibility.tar.bz2:

tar -xjf jupyter_reproducibility.tar.bz2

Install 5 conda environments and 5 anaconda environments, for each python version. In each of them, upgrade pip, install pipenv, and install the archaeology package (Note that it is a local package that has not been published to pypi. Make sure to use the -e option):

Conda 2.7

conda create -n raw27 python=2.7 -y
conda activate raw27
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology

Anaconda 2.7

conda create -n py27 python=2.7 anaconda -y
conda activate py27
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology

Conda 3.4

It requires a manual jupyter and pathlib2 installation due to some incompatibilities found on the default installation.

conda create -n raw34 python=3.4 -y
conda activate raw34
conda install jupyter -c conda-forge -y
conda uninstall jupyter -y
pip install --upgrade pip
pip install jupyter
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
pip install pathlib2

Anaconda 3.4

conda create -n py34 python=3.4 anaconda -y
conda activate py34
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology

Conda 3.5

conda create -n raw35 python=3.5 -y
conda activate raw35
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology

Anaconda 3.5

It requires the manual installation of other anaconda packages.

conda create -n py35 python=3.5 anaconda -y
conda install -y appdirs atomicwrites keyring secretstorage libuuid navigator-updater prometheus_client pyasn1 pyasn1-modules spyder-kernels tqdm jeepney automat constantly anaconda-navigator
conda activate py35
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology

Conda 3.6

conda create -n raw36 python=3.6 -y
conda activate raw36
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology

Anaconda 3.6

conda create -n py36 python=3.6 anaconda -y
conda activate py36
conda install -y anaconda-navigator jupyterlab_server navigator-updater
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology

Conda 3.7

<code

Clear search

Close search

Google apps

Main menu

Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter...

Jupyter Notebook Activity Dataset (rsds-20241113)

List of data

Extended version of Section 2D of our paper

Jupyter Notebooks for the ERA5 Data Component

Reproducibility in Practice: Dataset of a Large-Scale Study of Jupyter...

Titanic Dataset

Jupyter Notebooks for the Retrieval of AORC Data for Hydrologic Analysis

TinyStories

TinyStories

A Diverse, Richly Annotated Corpus of Short-Form Stories

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Speedtest Open Data - Four International cities - MEL, BKK, SHG, LAX plus...

Melb 14784 lines Avg download speed 69.4M Tests 0.39M

SHG 31207 lines Avg 233.7M Tests 0.56M

ALC 113 lines Avg 51.5M Test 1092

BKK 29684 lines Avg 215.9M Tests 1.2M

LAX 15505 lines Avg 218.5M Tests 0.74M

synthetic_credit_card_default

Inputs of the Jupyter Notebook - Met Office UKV high-resolution atmosphere...

PoroTomo Natural Laboratory Horizontal and Vertical Distributed Acoustic...

Dataset of a Study of Computational reproducibility of Jupyter notebooks...

Dataset of Pairs of an Image and Tags for Cataloging Image-based Records

Replication Data for: The Spread of Synthetic Media on X

Python Script for High-throughput Machine...

Data from: T1DiabetesGranada: a longitudinal multi-modal dataset of type 1...

synthetic-healthcare-admissions

Development of an AI/ML-ready knee ultrasound dataset in a population-based...

Jupyter Notebooks for the NWIS Data Component

Toward a Reproducible Research Data Repository

Title

Author(s)

Affiliation of presenter

Summary of Abstract

Keywords

Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter NotebooksSee More Versions

Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter Notebooks