35 datasets found

Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter...
zenodo.org
explore.openaire.eu
bz2
Updated Mar 15, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
João Felipe; João Felipe; Leonardo; Leonardo; Vanessa; Vanessa; Juliana; Juliana (2021). Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter Notebooks [Dataset]. http://doi.org/10.5281/zenodo.2592524
Explore at:
bz2Available download formats
Unique identifier
https://doi.org/10.5281/zenodo.2592524
Dataset updated
Mar 15, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
João Felipe; João Felipe; Leonardo; Leonardo; Vanessa; Vanessa; Juliana; Juliana
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The self-documenting aspects and the ability to reproduce results have been touted as significant benefits of Jupyter Notebooks. At the same time, there has been growing criticism that the way notebooks are being used leads to unexpected behavior, encourage poor coding practices and that their results can be hard to reproduce. To understand good and bad practices used in the development of real notebooks, we analyzed 1.4 million notebooks from GitHub.

Paper: https://2019.msrconf.org/event/msr-2019-papers-a-large-scale-study-about-quality-and-reproducibility-of-jupyter-notebooks

This repository contains two files:

dump.tar.bz2

jupyter_reproducibility.tar.bz2

The dump.tar.bz2 file contains a PostgreSQL dump of the database, with all the data we extracted from the notebooks.

The jupyter_reproducibility.tar.bz2 file contains all the scripts we used to query and download Jupyter Notebooks, extract data from them, and analyze the data. It is organized as follows:

analyses: this folder has all the notebooks we use to analyze the data in the PostgreSQL database.

archaeology: this folder has all the scripts we use to query, download, and extract data from GitHub notebooks.

paper: empty. The notebook analyses/N12.To.Paper.ipynb moves data to it

In the remaining of this text, we give instructions for reproducing the analyses, by using the data provided in the dump and reproducing the collection, by collecting data from GitHub again.

Reproducing the Analysis

This section shows how to load the data in the database and run the analyses notebooks. In the analysis, we used the following environment:

Ubuntu 18.04.1 LTS
PostgreSQL 10.6
Conda 4.5.11
Python 3.7.2
PdfCrop 2012/11/02 v1.38

First, download dump.tar.bz2 and extract it:

tar -xjf dump.tar.bz2

It extracts the file db2019-03-13.dump. Create a database in PostgreSQL (we call it "jupyter"), and use psql to restore the dump:

psql jupyter < db2019-03-13.dump

It populates the database with the dump. Now, configure the connection string for sqlalchemy by setting the environment variable JUP_DB_CONNECTTION:

export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter";

Download and extract jupyter_reproducibility.tar.bz2:

tar -xjf jupyter_reproducibility.tar.bz2

Create a conda environment with Python 3.7:

conda create -n analyses python=3.7 conda activate analyses

Go to the analyses folder and install all the dependencies of the requirements.txt

cd jupyter_reproducibility/analyses pip install -r requirements.txt

For reproducing the analyses, run jupyter on this folder:

jupyter notebook

Execute the notebooks on this order:

Index.ipynb

N0.Repository.ipynb

N1.Skip.Notebook.ipynb

N2.Notebook.ipynb

N3.Cell.ipynb

N4.Features.ipynb

N5.Modules.ipynb

N6.AST.ipynb

N7.Name.ipynb

N8.Execution.ipynb

N9.Cell.Execution.Order.ipynb

N10.Markdown.ipynb

N11.Repository.With.Notebook.Restriction.ipynb

N12.To.Paper.ipynb

Reproducing or Expanding the Collection

The collection demands more steps to reproduce and takes much longer to run (months). It also involves running arbitrary code on your machine. Proceed with caution.

Requirements

This time, we have extra requirements:

All the analysis requirements
lbzip2 2.5
gcc 7.3.0
Github account
Gmail account

Environment

First, set the following environment variables:

export JUP_MACHINE="db"; # machine identifier export JUP_BASE_DIR="/mnt/jupyter/github"; # place to store the repositories export JUP_LOGS_DIR="/home/jupyter/logs"; # log files export JUP_COMPRESSION="lbzip2"; # compression program export JUP_VERBOSE="5"; # verbose level export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter"; # sqlchemy connection export JUP_GITHUB_USERNAME="github_username"; # your github username export JUP_GITHUB_PASSWORD="github_password"; # your github password export JUP_MAX_SIZE="8000.0"; # maximum size of the repositories directory (in GB) export JUP_FIRST_DATE="2013-01-01"; # initial date to query github export JUP_EMAIL_LOGIN="gmail@gmail.com"; # your gmail address export JUP_EMAIL_TO="target@email.com"; # email that receives notifications export JUP_OAUTH_FILE="~/oauth2_creds.json" # oauth2 auhentication file export JUP_NOTEBOOK_INTERVAL=""; # notebook id interval for this machine. Leave it in blank export JUP_REPOSITORY_INTERVAL=""; # repository id interval for this machine. Leave it in blank export JUP_WITH_EXECUTION="1"; # run execute python notebooks export JUP_WITH_DEPENDENCY="0"; # run notebooks with and without declared dependnecies export JUP_EXECUTION_MODE="-1"; # run following the execution order export JUP_EXECUTION_DIR="/home/jupyter/execution"; # temporary directory for running notebooks export JUP_ANACONDA_PATH="~/anaconda3"; # conda installation path export JUP_MOUNT_BASE="/home/jupyter/mount_ghstudy.sh"; # bash script to mount base dir export JUP_UMOUNT_BASE="/home/jupyter/umount_ghstudy.sh"; # bash script to umount base dir export JUP_NOTEBOOK_TIMEOUT="300"; # timeout the extraction # Frequenci of log report export JUP_ASTROID_FREQUENCY="5"; export JUP_IPYTHON_FREQUENCY="5"; export JUP_NOTEBOOKS_FREQUENCY="5"; export JUP_REQUIREMENT_FREQUENCY="5"; export JUP_CRAWLER_FREQUENCY="1"; export JUP_CLONE_FREQUENCY="1"; export JUP_COMPRESS_FREQUENCY="5"; export JUP_DB_IP="localhost"; # postgres database IP

Then, configure the file ~/oauth2_creds.json, according to yagmail documentation: https://media.readthedocs.org/pdf/yagmail/latest/yagmail.pdf

Configure the mount_ghstudy.sh and umount_ghstudy.sh scripts. The first one should mount the folder that stores the directories. The second one should umount it. You can leave the scripts in blank, but it is not advisable, as the reproducibility study runs arbitrary code on your machine and you may lose your data.

Scripts

Download and extract jupyter_reproducibility.tar.bz2:

tar -xjf jupyter_reproducibility.tar.bz2

Install 5 conda environments and 5 anaconda environments, for each python version. In each of them, upgrade pip, install pipenv, and install the archaeology package (Note that it is a local package that has not been published to pypi. Make sure to use the -e option):

Conda 2.7

conda create -n raw27 python=2.7 -y conda activate raw27 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Anaconda 2.7

conda create -n py27 python=2.7 anaconda -y conda activate py27 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Conda 3.4

It requires a manual jupyter and pathlib2 installation due to some incompatibilities found on the default installation.

conda create -n raw34 python=3.4 -y conda activate raw34 conda install jupyter -c conda-forge -y conda uninstall jupyter -y pip install --upgrade pip pip install jupyter pip install pipenv pip install -e jupyter_reproducibility/archaeology pip install pathlib2

Anaconda 3.4

conda create -n py34 python=3.4 anaconda -y conda activate py34 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Conda 3.5

conda create -n raw35 python=3.5 -y conda activate raw35 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Anaconda 3.5

It requires the manual installation of other anaconda packages.

conda create -n py35 python=3.5 anaconda -y conda install -y appdirs atomicwrites keyring secretstorage libuuid navigator-updater prometheus_client pyasn1 pyasn1-modules spyder-kernels tqdm jeepney automat constantly anaconda-navigator conda activate py35 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Conda 3.6

conda create -n raw36 python=3.6 -y conda activate raw36 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Anaconda 3.6

conda create -n py36 python=3.6 anaconda -y conda activate py36 conda install -y anaconda-navigator jupyterlab_server navigator-updater pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Conda 3.7

<code
Reproducibility in Practice: Dataset of a Large-Scale Study of Jupyter...
zenodo.org
bz2
Updated Mar 15, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anonymous; Anonymous (2021). Reproducibility in Practice: Dataset of a Large-Scale Study of Jupyter Notebooks [Dataset]. http://doi.org/10.5281/zenodo.2546834
Explore at:
bz2Available download formats
Unique identifier
https://doi.org/10.5281/zenodo.2546834
Dataset updated
Mar 15, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Anonymous; Anonymous
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The self-documenting aspects and the ability to reproduce results have been touted as significant benefits of Jupyter Notebooks. At the same time, there has been growing criticism that the way notebooks are being used leads to unexpected behavior, encourage poor coding practices and that their results can be hard to reproduce. To understand good and bad practices used in the development of real notebooks, we analyzed 1.4 million notebooks from GitHub.

This repository contains two files:

dump.tar.bz2

jupyter_reproducibility.tar.bz2

The dump.tar.bz2 file contains a PostgreSQL dump of the database, with all the data we extracted from the notebooks.

The jupyter_reproducibility.tar.bz2 file contains all the scripts we used to query and download Jupyter Notebooks, extract data from them, and analyze the data. It is organized as follows:

analyses: this folder has all the notebooks we use to analyze the data in the PostgreSQL database.

archaeology: this folder has all the scripts we use to query, download, and extract data from GitHub notebooks.

paper: empty. The notebook analyses/N11.To.Paper.ipynb moves data to it

In the remaining of this text, we give instructions for reproducing the analyses, by using the data provided in the dump and reproducing the collection, by collecting data from GitHub again.

Reproducing the Analysis

This section shows how to load the data in the database and run the analyses notebooks. In the analysis, we used the following environment:

Ubuntu 18.04.1 LTS
PostgreSQL 10.6
Conda 4.5.1
Python 3.6.8
PdfCrop 2012/11/02 v1.38

First, download dump.tar.bz2 and extract it:

tar -xjf dump.tar.bz2

It extracts the file db2019-01-13.dump. Create a database in PostgreSQL (we call it "jupyter"), and use psql to restore the dump:

psql jupyter < db2019-01-13.dump

It populates the database with the dump. Now, configure the connection string for sqlalchemy by setting the environment variable JUP_DB_CONNECTTION:

export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter";

Download and extract jupyter_reproducibility.tar.bz2:

tar -xjf jupyter_reproducibility.tar.bz2

Create a conda environment with Python 3.6:

conda create -n py36 python=3.6

Go to the analyses folder and install all the dependencies of the requirements.txt

cd jupyter_reproducibility/analyses pip install -r requirements.txt

For reproducing the analyses, run jupyter on this folder:

jupyter notebook

Execute the notebooks on this order:

N0.Index.ipynb

N1.Repository.ipynb

N2.Notebook.ipynb

N3.Cell.ipynb

N4.Features.ipynb

N5.Modules.ipynb

N6.AST.ipynb

N7.Name.ipynb

N8.Execution.ipynb

N9.Cell.Execution.Order.ipynb

N10.Markdown.ipynb

N11.To.Paper.ipynb

Reproducing or Expanding the Collection

The collection demands more steps to reproduce and takes much longer to run (months). It also involves running arbitrary code on your machine. Proceed with caution.

Requirements

This time, we have extra requirements:

All the analysis requirements
lbzip2 2.5
gcc 7.3.0
Github account
Gmail account

Environment

First, set the following environment variables:

export JUP_MACHINE="db"; # machine identifier export JUP_BASE_DIR="/mnt/jupyter/github"; # place to store the repositories export JUP_LOGS_DIR="/home/jupyter/logs"; # log files export JUP_COMPRESSION="lbzip2"; # compression program export JUP_VERBOSE="5"; # verbose level export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter"; # sqlchemy connection export JUP_GITHUB_USERNAME="github_username"; # your github username export JUP_GITHUB_PASSWORD="github_password"; # your github password export JUP_MAX_SIZE="8000.0"; # maximum size of the repositories directory (in GB) export JUP_FIRST_DATE="2013-01-01"; # initial date to query github export JUP_EMAIL_LOGIN="gmail@gmail.com"; # your gmail address export JUP_EMAIL_TO="target@email.com"; # email that receives notifications export JUP_OAUTH_FILE="~/oauth2_creds.json" # oauth2 auhentication file export JUP_NOTEBOOK_INTERVAL=""; # notebook id interval for this machine. Leave it in blank export JUP_REPOSITORY_INTERVAL=""; # repository id interval for this machine. Leave it in blank export JUP_WITH_EXECUTION="1"; # run execute python notebooks export JUP_WITH_DEPENDENCY="0"; # run notebooks with and without declared dependnecies export JUP_EXECUTION_MODE="-1"; # run following the execution order export JUP_EXECUTION_DIR="/home/jupyter/execution"; # temporary directory for running notebooks export JUP_ANACONDA_PATH="~/anaconda3"; # conda installation path export JUP_MOUNT_BASE="/home/jupyter/mount_ghstudy.sh"; # bash script to mount base dir export JUP_UMOUNT_BASE="/home/jupyter/umount_ghstudy.sh"; # bash script to umount base dir export JUP_NOTEBOOK_TIMEOUT="300"; # timeout the extraction # Frequenci of log report export JUP_ASTROID_FREQUENCY="5"; export JUP_IPYTHON_FREQUENCY="5"; export JUP_NOTEBOOKS_FREQUENCY="5"; export JUP_REQUIREMENT_FREQUENCY="5"; export JUP_CRAWLER_FREQUENCY="1"; export JUP_CLONE_FREQUENCY="1"; export JUP_COMPRESS_FREQUENCY="5"; export JUP_DB_IP="localhost"; # postgres database IP

Then, configure the file ~/oauth2_creds.json, according to yagmail documentation: https://media.readthedocs.org/pdf/yagmail/latest/yagmail.pdf

Configure the mount_ghstudy.sh and umount_ghstudy.sh scripts. The first one should mount the folder that stores the directories. The second one should umount it. You can leave the scripts in blank, but it is not advisable, as the reproducibility study runs arbitrary code on your machine and you may lose your data.

Scripts

Download and extract jupyter_reproducibility.tar.bz2:

tar -xjf jupyter_reproducibility.tar.bz2

Install 5 conda environments and 5 anaconda environments, for each python version. In each of them, upgrade pip, install pipenv, and install the archaeology package (Note that it is a local package that has not been published to pypi. Make sure to use the -e option):

Conda 2.7

conda create -n raw27 python=2.7 -y conda activate raw27 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Anaconda 2.7

conda create -n py27 python=2.7 anaconda -y conda activate py27 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Conda 3.4

It requires a manual jupyter and pathlib2 installation due to some incompatibilities found on the default installation.

conda create -n raw34 python=3.4 -y conda activate raw34 conda install jupyter -c conda-forge -y conda uninstall jupyter -y pip install --upgrade pip pip install jupyter pip install pipenv pip install -e jupyter_reproducibility/archaeology pip install pathlib2

Anaconda 3.4

conda create -n py34 python=3.4 anaconda -y conda activate py34 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Conda 3.5

conda create -n raw35 python=3.5 -y conda activate raw35 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Anaconda 3.5

It requires the manual installation of other anaconda packages.

conda create -n py35 python=3.5 anaconda -y conda install -y appdirs atomicwrites keyring secretstorage libuuid navigator-updater prometheus_client pyasn1 pyasn1-modules spyder-kernels tqdm jeepney automat constantly anaconda-navigator conda activate py35 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Conda 3.6

conda create -n raw36 python=3.6 -y conda activate raw36 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Anaconda 3.6

conda create -n py36 python=3.6 anaconda -y conda activate py36 conda install -y anaconda-navigator jupyterlab_server navigator-updater pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Conda 3.7

conda create -n raw37 python=3.7 -y conda activate raw37 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Anaconda 3.7

When we
Dataset of a Study of Computational reproducibility of Jupyter notebooks...
zenodo.org
explore.openaire.eu
pdf, zip
Updated Jul 11, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sheeba Samuel; Sheeba Samuel; Daniel Mietchen; Daniel Mietchen (2024). Dataset of a Study of Computational reproducibility of Jupyter notebooks from biomedical publications [Dataset]. http://doi.org/10.5281/zenodo.8226725
Explore at:
zip, pdfAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8226725
Dataset updated
Jul 11, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Sheeba Samuel; Sheeba Samuel; Daniel Mietchen; Daniel Mietchen
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This repository contains the dataset for the study of computational reproducibility of Jupyter notebooks from biomedical publications. Our focus lies in evaluating the extent of reproducibility of Jupyter notebooks derived from GitHub repositories linked to publications present in the biomedical literature repository, PubMed Central. We analyzed the reproducibility of Jupyter notebooks from GitHub repositories associated with publications indexed in the biomedical literature repository PubMed Central. The dataset includes the metadata information of the journals, publications, the Github repositories mentioned in the publications and the notebooks present in the Github repositories.

Data Collection and Analysis

We use the code for reproducibility of Jupyter notebooks from the study done by Pimentel et al., 2019 and adapted the code from ReproduceMeGit. We provide code for collecting the publication metadata from PubMed Central using NCBI Entrez utilities via Biopython.

Our approach involves searching PMC using the esearch function for Jupyter notebooks using the query: ``(ipynb OR jupyter OR ipython) AND github''. We meticulously retrieve data in XML format, capturing essential details about journals and articles. By systematically scanning the entire article, encompassing the abstract, body, data availability statement, and supplementary materials, we extract GitHub links. Additionally, we mine repositories for key information such as dependency declarations found in files like requirements.txt, setup.py, and pipfile. Leveraging the GitHub API, we enrich our data by incorporating repository creation dates, update histories, pushes, and programming languages.

All the extracted information is stored in a SQLite database. After collecting and creating the database tables, we ran a pipeline to collect the Jupyter notebooks contained in the GitHub repositories based on the code from Pimentel et al., 2019.

Our reproducibility pipeline was started on 27 March 2023.

Repository Structure

Our repository is organized into two main folders:

archaeology: This directory hosts scripts designed to download, parse, and extract metadata from PubMed Central publications and associated repositories. There are 24 database tables created which store the information on articles, journals, authors, repositories, notebooks, cells, modules, executions, etc. in the db.sqlite database file.

analyses: Here, you will find notebooks instrumental in the in-depth analysis of data related to our study. The db.sqlite file generated by running the archaelogy folder is stored in the analyses folder for further analysis. The path can however be configured in the config.py file. There are two sets of notebooks: one set (naming pattern N[0-9]*.ipynb) is focused on examining data pertaining to repositories and notebooks, while the other set (PMC[0-9]*.ipynb) is for analyzing data associated with publications in PubMed Central, i.e.\ for plots involving data about articles, journals, publication dates or research fields. The resultant figures from the these notebooks are stored in the 'outputs' folder.

MethodsWorkflow: The MethodsWorkflow file provides a conceptual overview of the workflow used in this study.

Accessing Data and Resources:

All the data generated during the initial study can be accessed at https://doi.org/10.5281/zenodo.6802158

For the latest results and re-run data, refer to this link.

The comprehensive SQLite database that encapsulates all the study's extracted data is stored in the db.sqlite file.

The metadata in xml format extracted from PubMed Central which contains the information about the articles and journal can be accessed in pmc.xml file.

System Requirements:

Centos 7 (Documentation: https://www.centos.org/)

Conda 4.9.4 (Installation Guide: https://docs.anaconda.com/anaconda/install/linux/)

Python 3.7.6 (Download Link: https://www.python.org/downloads/)

GitHub account (Get Started: https://github.com/, Requires GitHub Username and Token)

gcc 7.3.0 (Installation Guide: https://gcc.gnu.org/install/)

lbzip2 (Command: `conda install -c conda-forge lbzip2')

Running the pipeline:

Clone the computational-reproducibility-pmc repository using Git:
git clone https://github.com/fusion-jena/computational-reproducibility-pmc.git

Navigate to the computational-reproducibility-pmc directory:
cd computational-reproducibility-pmc/computational-reproducibility-pmc

Configure environment variables in the config.py file:
GITHUB_USERNAME = os.environ.get("JUP_GITHUB_USERNAME", "add your github username here")
GITHUB_TOKEN = os.environ.get("JUP_GITHUB_PASSWORD", "add your github token here")

Other environment variables can also be set in the config.py file.
BASE_DIR = Path(os.environ.get("JUP_BASE_DIR", "./")).expanduser() # Add the path of directory where the GitHub repositories will be saved
DB_CONNECTION = os.environ.get("JUP_DB_CONNECTION", "sqlite:///db.sqlite") # Add the path where the database is stored.

To set up conda environments for each python versions, upgrade pip, install pipenv, and install the archaeology package in each environment, execute:
source conda-setup.sh

Change to the archaeology directory
cd archaeology

Activate conda environment. We used py36 to run the pipeline.
conda activate py36

Execute the main pipeline script (r0_main.py):
python r0_main.py

Running the analysis:

Navigate to the analysis directory.
cd analyses

Activate conda environment. We use raw38 for the analysis of the metadata collected in the study.
conda activate raw38

Install the required packages using the requirements.txt file.
pip install -r requirements.txt

Launch Jupyterlab
jupyter lab

Refer to the Index.ipynb notebook for the execution order and guidance.

References:

Sheeba Samuel, Daniel Mietchen. (2024). Computational reproducibility of Jupyter notebooks from biomedical publications, https://doi.org/10.1093/gigascience/giad113, GigaScience

Sheeba Samuel, Daniel Mietchen. (2022). Computational reproducibility of Jupyter notebooks from biomedical publications, https://arxiv.org/pdf/2209.04308.pdf, CoRR abs/2209.04308

Sheeba Samuel, & Daniel Mietchen. (2022). Dataset of a Study of Computational reproducibility of Jupyter notebooks from biomedical publications [Data set]. Zenodo. https://doi.org/10.5281/zenodo.6802158
[Dataset & scripts] to "Spatial scales of kinetic energy in the Arctic...
zenodo.org
explore.openaire.eu
bin, tar, zip
Updated Feb 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Caili Liu; Qiang Wang; Sergey Danilov; Nikolay Koldonov; Vasco Müller; Xinyue Li; Dmitry Sidorenko; Shaoqing Zhang; Caili Liu; Qiang Wang; Sergey Danilov; Nikolay Koldonov; Vasco Müller; Xinyue Li; Dmitry Sidorenko; Shaoqing Zhang (2024). [Dataset & scripts] to "Spatial scales of kinetic energy in the Arctic Ocean", dataset from Caili Liu [Dataset]. http://doi.org/10.5281/zenodo.10577960
Explore at:
bin, zip, tarAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10577960
Dataset updated
Feb 2, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Caili Liu; Qiang Wang; Sergey Danilov; Nikolay Koldonov; Vasco Müller; Xinyue Li; Dmitry Sidorenko; Shaoqing Zhang; Caili Liu; Qiang Wang; Sergey Danilov; Nikolay Koldonov; Vasco Müller; Xinyue Li; Dmitry Sidorenko; Shaoqing Zhang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jan 28, 2024
Area covered
Arctic Ocean
Description
## "Spatial scales of kinetic energy in the Arctic Ocean"

Available dataset for each figure (1~9) and figure10 in the main text, including Jupyter notebook scripts (Fig1, Fig2, Fig5, Fig10) and Matlab scripts (Fig3, Fig4, Fig6, Fig7, Fig8, Fig9).

## Description

This dataset is as the supplementary to the manuscript "Spatial scales of kinetic energy in the Arctic Ocean", including jupyter notebook scripts and matlab scripts of visualization directly for figures1~9.

1) Jupyter notebook scripts for visualization
the MESH and BG are used for visualization, and *.mat are the dataset for Fig1/2/5/10. The load path in the script should be changed to your files accordingly.

2) Matlab scripts for plots
All figures/panels are directly produced, but it is composed of panels for Fig7/8/9 additionally.
Z
Data from: Maniple
data.niaid.nih.gov
Updated Aug 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anonymous, Anonymous (2024). Maniple [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10853003
Explore at:
Dataset updated
Aug 3, 2024
Dataset authored and provided by
Anonymous, Anonymous
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Maniple This repository contains code, scripts and data necessary to reproduce the paper "The Fact Selection Problem in LLM-Based Program Repair".

Installation Before installing the project, ensure you have the following prerequisites installed on your system:

Python version 3.10 or higher.

Follow these steps to install and set up the project on your local machine:

cd maniple python3 -m pip install .

Structure of Directories The project is organized into several directories, each serving a specific purpose:

data/ # Training and testing datasets BGP32.zip/ # Sampled 32 bugs from the BugsInPy dataset black/ # The bug project folder 10/ # The bug ID folder 100000001/ # The bitvector used for prompting prompt.md # The prompt used for this bitvector response_1.md # The response from the model response_1.json # The response in JSON format response_1.patch # The response in patch format result_1.json # Testing result ... BGP32-without-cot.zip # GPT response for 32 bugs without CoT prompting BGP314.zip # 314 bugs from the BugsInPy dataset BGP157Ply1-llama3-70b.zip # experiment with llama3 model on BGP157Ply1 dataset BGP32-permutation.zip # permutation experiment on BGP32 dataset

maniple/ # Scripts for getting facts and generate prompts strata_based/ # Scripts for generating prompts utils/ # Utility functions metrics/ # Scripts for calculating metrics for dataset

patch_correctness_labelling.xlsx # The labelling of patch correctness experiment.ipynb # Jupyter notebook for training models

experiment-initialization-resources/ # Contains raw facts for each bug bug-data/ # row facts for each bug ansible/ # Bug project folder 5/ # Bug ID folder bug-info.json # Metadata for the bug facts_in_prompt.json # Facts used in the prompt processed_facts.json # Processed facts external_facts.json # GitHub issues for this bug static-dynamic-facts.json # Static and dynamic facts ... datasets-list/ # Subsets from BugsInPy dataset strata-bitvector/ # Debugging information for bitvectors

Steps to Reproduce the Experiments Please follow the steps below sequentially to reproduce the experiments on 314 bugs in BugsInPy with our bitvector based prompt

Prepare the Dataset The CLI scripts under the maniple directory provide useful commands to download and prepare environments for each bug.

To download and prepare environments for each bugs, you can use the prep command.

maniple prep --dataset 314-dataset

This script will automatically download all 314 bugs from GitHub, create a virtual environment for the bug and install the necessary dependencies.

Fact Extraction Then you can extract facts from the bug data using the extract command as follows:

maniple extract --dataset 314-dataset --output-dir data/BGP314

This script will extract facts from the bug data and save them in the specified output directory.

You can find all extracted facts under the experiment-initialization-resources/bug-data directory.

Generate Bitvector Specific Prompts and Responses First, you need to generate bitvector for the facts. The 128 bitvector for our paper can be generated by the following command.

python3 -m maniple.strata_based.fact_bitvector_generator

You can customize your bitvectors, they should be put under experiment-initialization-resources/strata-bitvectors directory. You can refer the example bitvector format used for our paper.

To reproduce our experiment prompt and response, please use the command below, and replace with your own key.

On Linux/macOS:

export OPENAI_API_KEY=

On windows:

setx OPENAI_API_KEY

python3 -m maniple.strata_based.prompt_generator --database BGP314 --partition 10 --start_index 1 --trial 15

Again, you can build your own customize prompt with customize bitvector using our extracted facts. Above is only for reproducing our prompt and response.

This script will generate prompts and responses for all 314 bugs in the dataset by enumerating all possible bitvectors according to current strata design specified in maniple/strata_based/fact_strata_table.json. By specifying --trial 15, the script will generate 15 responses for each prompt. And by specifying --partition 10 the script will start 10 threads to speed up the process.

Testing Generated Patches Please use following command:

maniple validate --output-dir data/BGP314

This script will validate the generated patches for the specified bug and save the results in the specified output directory. The test comes from the developer's fix commit.

Contributing Contributions to this project are welcome! Please submit a PR if you find any bugs or have any suggestions.

License This project is licensed under the MIT - see the LICENSE file for details.
Clash Royale S18 Ladder Datasets (37.9M matches)
kaggle.com
zip
Updated Jan 4, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BwandoWando (2021). Clash Royale S18 Ladder Datasets (37.9M matches) [Dataset]. https://www.kaggle.com/bwandowando/clash-royale-season-18-dec-0320-dataset
Explore at:
zip(5396459510 bytes)Available download formats
Dataset updated
Jan 4, 2021
Authors
BwandoWando
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Context

I've been recently exploring Microsoft Azure and have been playing this game for the past 4 or so years. I am also a software developer by profession. I did a simple pipeline that gets data from the official Clash Royale API using (Python) Jupyter Notebooks and Azure VMs. I tried searching for public Clash Royale datasets, but the ones I saw don't quite have that much data from my perspective, so I decided to create one for the whole community.

I started pulling in the data at the beginning of the month of December until season 18 ended. This covers the season reset last December 07, and the latest balance changes last December 09. This dataset also contains ladder data for the new Legendary card Mother Witch.

The amount of data that I have, with the latest dataset, has ballooned to around 37.9 M distinct/ unique ladder matches that were (pseudo) randomly being pulled from a pool of 300k+ clans. If you think that this is A LOT, this could only be a percent of a percent (even lower) of the real amount of ladder battle data. It still may not reflect the whole population, also, the majority of my data are matches between players of 4000 trophies or more.

I don't see any reason for me not to share this to the public as the data is now considerably large that working on it and producing insights will take more than just a few hours of "hobby" time to do.

Feel free to use it on your own research and analysis, but don't forget to credit me.

Also, please don't monetize this dataset.

Stay safe. Stay healthy.

Happy holidays!

Content

Card Ids Master List is in the discussion, I also created a simple notebook to load the data and made a sample n=20 rows, so you can get an idea on what the fields are.

Inspiration

With this data, the following can possibly be answered 1. Which cards are the strongest? The weakest? 2. Which win-con is the most winning? 3. Which cards are always with a specific win-con? 4. When 2 opposing players are using maxed decks, which win-con is the most winning? 5. Most widely used cards? Win-Cons? 6. What are the different metas in different arenas and trophy ranges? 7. Is ladder matchmaking algorithm rigged? (MOST CONTROVERSIAL)

(and many more)

Implementation

I have 2 VMs running a total of 14 processes, and for each of these processes, I've divided a pool of 300k+ clans into the same number of groups. This went on 24/7, non-stop for the whole season. Each process will then randomize the list of clans it is assigned to and will iterate through each clan, and get that clan's members' ladder data. It is important to note that I also have a pool of 470 hand-picked clans that I always get data from, as these clans were the starting point that eventually enabled me to get the 300k+ clans. There are clans who have minimal ladder data, there are some clans who have A LOT.

To prevent out of memory exceptions, as my VMs are not really that powerful (I'm using Azure free credits), I've put on a time and limit of battles extracted per member.

My Clan and Handle

My account: https://royaleapi.com/player/89L2CLRP My clan: https://royaleapi.com/clan/J898GQ

Acknowledgements

Thank you to SUPERCELL for creating this FREEMIUM game that has tested countless people's patience, as well as the durability of countless mobile devices after being smashed against a wall, and thrown on the floor.

Thank you to Microsoft for Azure and free monthly credits

Thank you to Python and Jupyter notebooks.

Thank you Kaggle for hosting this dataset.
Data from: ARcode: HPC Application Recognition Through Image-encoded...
figshare.com
bin
Updated Apr 7, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jie Li; Brandon G. Cook; Yong Chen (2022). ARcode: HPC Application Recognition Through Image-encoded Monitoring Data [Dataset]. http://doi.org/10.6084/m9.figshare.19530528.v1
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.19530528.v1
Dataset updated
Apr 7, 2022
Dataset provided by
Figsharehttp://figshare.com/
Authors
Jie Li; Brandon G. Cook; Yong Chen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This tar file contains the docker image for building the ARcode model and baseline models for application recognition for the SC22 paper with the same title.The files/folders in this image contains:notebooks: The notebooks for models and experiment results.-- ARcode.ipynb: The interactive Jupyter Notebook for the ARcode model.-- ARcode_unknown.ipynb: The interactive Jupyter Notebook for the ARcode model for detecting unknown applications.-- ARcode_partial.ipynb: The interactive Jupyter Notebook for the ARcode model on partial job signatures.-- ARcode_channel.ipynb: The interactive Jupyter Notebook for the ARcode model on one channel of job signatures.-- baselines.ipynb: The interactive Jupyter Notebook for the baseline models. These models are Random Forest, LinearSVC and SVC; all of them are implemented through Taxonomist(https://doi.org/10.6084/m9.figshare.6384248.v1).-- baselines_unknown.ipynb: The interactive Jupyter Notebook for the baseline models for detecting unknown applications.dataset: The dataset for training the models mentioned above.-- ARcode_labels.npy: A numpy array of the signatures' labels.-- ARcode_signatures.npy: A numpy array of the generated signatures.-- baseline_labels.npy: A numpy array of the labels for the baseline dataset.-- baseline_features.npy: A numpy array of the statistic features generated from the raw monitoring data.-- knl_app_code.json: Mapping of IDs to application names. This mapping is used when creating the dataset.models: The saved models.-- arcode.h5: An HDF5 file containing the serialized weights for the ARcode model.-- arcode.json: A JSON file describing the ARcode model.results: The saved experiment results.Following these steps to start Jupyter Notebook in the image:1. Load the image into Docker on your local machine:docker load < archive-arcode.tar2. Start the Jupyter notebook in the docker image:docker run --init --user root -p 8888:8888 artlands/arcode3. Copy the URL shown in your terminal and paste in a brower: http://127.0.0.1:8888/?token=your_tokenAcknowledgement: This research used resources of the National Energy Research Scientific Computing Center (NERSC), a U.S. Department of Energy Office of Science User Facility located at Lawrence Berkeley National Laboratory, operated under Contract No. DE-AC02-05CH11231
o
BBC Hindi News Articles Dataset - Detailed
opendatabay.com
.undefined
Updated Jun 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). BBC Hindi News Articles Dataset - Detailed [Dataset]. https://www.opendatabay.com/data/ai-ml/457770b4-9c6f-4ee3-a7fb-f2ab0e53aff2
Explore at:
.undefinedAvailable download formats
Dataset updated
Jun 18, 2025
Dataset authored and provided by
Datasimple
Area covered
Entertainment & Media Consumption
Description
The BBC Hindi News Articles Dataset offers a comprehensive collection of news articles gathered through Python web scraping. This dataset features articles from various categories, providing a broad spectrum of content for analysis. Each entry in the dataset includes three key data points:

Headline: The title of the news article. Content: The full text of the article. Category: The category to which the article belongs. Ideal for natural language processing (NLP) tasks, sentiment analysis, and language modeling, this dataset provides a rich resource for understanding and exploring Hindi news media.

I could not find datasets under Creative commons license so I thought of scraping it by myself and making it available on Kaggle!

Please use it freely and just put up credit for the dataset. Upvote would be really appreciated :)

I have also uploaded my jupyter notebook for web scraping on GitHub if you want to check that out: https://github.com/AadiSrivastava05/BBC-Hindi-News-Dataset-with-web-scraping-script

Original Data Source: BBC Hindi News Articles Dataset - Detailed
Z
Benchmark-Tasks: Duffing Oscillator Response Analysis (DORA)
data.niaid.nih.gov
Updated Feb 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yadav, Manish (2025). Benchmark-Tasks: Duffing Oscillator Response Analysis (DORA) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_14851013
Explore at:
Dataset updated
Feb 11, 2025
Dataset provided by
Stender, Merten
Yadav, Manish
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
🔹 Release v1.0 - Duffing Oscillator Response Analysis (DORA)

This release provides a collection of benchmark tasks and datasets, accompanied by minimal code to generate, import, and plot the data. The primary focus is on the Duffing Oscillator Response Analysis (DORA) prediction task, which evaluates machine learning models' ability to generalize system responses in unseen parameter regimes.

🚀 Key Features:

Duffing Oscillator Response Analysis (DORA) Prediction Task:

Objective: Predict the response of a forced Duffing oscillator using a minimal training dataset. This task assesses a model's capability to extrapolate system behavior in unseen parameter regimes, specifically varying amplitudes of external periodic forcing.

Expectation: A proficient model should qualitatively capture the system's response, such as identifying the exact number of cycles in a limit-cycle regime or chaotic trajectories when the system transitions to a chaotic regime, all trained on limited datasets.

Comprehensive Dataset:

Training Data (DORA_Train.csv): Contains data for two external forcing amplitudes, ( f $\in$ [0.46, 0.49] ).

Testing Data (DORA_Test.csv): Includes data for five forcing amplitudes, ( f $\in$ [0.2, 0.35, 0.48, 0.58, 0.75] ).

📊 Data Description:

Each dataset comprises five columns:

Column Description

t Time variable

q1(t) Time evolution of the Duffing oscillator's position

q2(t) Time evolution of the Duffing oscillator's velocity

f(t) Time evolution of external periodic forcing

f_amplitude Constant amplitude during system evaluation (default: 250)

🛠 Utility Scripts and Notebooks:

Data Generation and Visualization:

DORA_generator.py: Generates, plots, and saves training and testing data.Usage:

python DORA_generator.py -time 250 -plots 1

DORA.ipynb: A Jupyter Notebook for dataset generation, loading, and plotting.

Data Loading and Plotting:

ReadData.py: Loads and plots the provided datasets (DORA_Train.csv and DORA_Test.csv).

📈 Model Evaluation:

The prediction model's success is determined by its ability to extrapolate system behavior outside the training data.System response characteristics for external forcing are quantified in terms of amplitude and mean of ( q1^2(t) ).These can be obtained using the provided Signal_Characteristic function.

🔹 Performance Metrics:

Response Amplitude Error:MSE[max(q1_prediction²(t > t)), max(q1_original²(t > t))]

Response Mean Error:MSE[Mean(q1_prediction²(t > t)), Mean(q1_original²(t > t))]

Note: ( t* = 20s ) denotes the steady-state time.

📌 Reference Implementation:

An exemplar solution using reservoir computing is detailed in the following:📖 Yadav et al., 2025 – Springer Nonlinear Dynamics

📄 Citation:

If you utilize this dataset or code in your research, please cite:

@article{Yadav2024, author = {Manish Yadav and Swati Chauhan and Manish Dev Shrimali and Merten Stender}, title = {Predicting multi-parametric dynamics of an externally forced oscillator using reservoir computing and minimal data}, journal = {Nonlinear Dynamics}, year = {2024}, doi = {10.1007/s11071-024-10720-w}}
Fracture toughness of mixed-mode anticracks in highly porous materials...
zenodo.org
bin, text/x-python +1
Updated Sep 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Valentin Adam; Valentin Adam; Bastian Bergfeld; Bastian Bergfeld; Philipp Weißgraeber; Philipp Weißgraeber; Alec van Herwijnen; Alec van Herwijnen; Philipp L. Rosendahl; Philipp L. Rosendahl (2024). Fracture toughness of mixed-mode anticracks in highly porous materials dataset and data processing [Dataset]. http://doi.org/10.5281/zenodo.11443644
Explore at:
text/x-python, txt, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.11443644
Dataset updated
Sep 2, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Valentin Adam; Valentin Adam; Bastian Bergfeld; Bastian Bergfeld; Philipp Weißgraeber; Philipp Weißgraeber; Alec van Herwijnen; Alec van Herwijnen; Philipp L. Rosendahl; Philipp L. Rosendahl
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains the code and datasets used in the data analysis for "Fracture toughness of mixed-mode anticracks in highly porous materials". The analysis is implemented in Python, using Jupyter Notebooks.

Contents

main.ipynb: Jupyter notebook with the main data analysis workflow.

energy.py: Methods for the calculation of energy release rates.

regression.py: Methods for the regression analyses.

visualization.py: Methods for generating visualizations.

df_mmft.pkl: Pickled DataFrame with experimental data gathered in the present work.

df_legacy.pkl: Pickled DataFrame with literature data.

Prerequisites

To run the scripts and notebooks, you need:

Python 3.12 or higher

Jupyter Notebook or JupyterLab

Libraries: pandas, matplotlib, numpy, scipy, tqdm, uncertainties, weac

Setup

Download the zip file or clone this repository to your local machine.

Ensure that Python and Jupyter are installed.

Install required Python libraries using pip install -r requirements.txt.

Running the Analysis

Open the main.ipynb notebook in Jupyter Notebook or JupyterLab.

Execute the cells in sequence to reproduce the analysis.

Data Description

The data included in this repository is encapsulated in two pickled DataFrame files, df_mmft.pkl and df_legacy.pkl, which contain experimental measurements and corresponding parameters. Below are the descriptions for each column in these DataFrames:

df_mmft.pkl

Includes data such as experiment identifiers, datetime, and physical measurements like slope inclination and critical cut lengths.

exp_id: Unique identifier for each experiment.

datestring: Date of the experiment as a string.

datetime: Timestamp of the experiment.

bunker: Field site of the experiment. Bunker IDs 1 and 2 correspond to field sites A and B, respectively.

slope_incl: Inclination of the slope in degrees.

h_sledge_top: Distance from sample top surface to the sled in mm.

h_wl_top: Distance from sample top surface to weak layer in mm.

h_wl_notch: Distance from the notch root to the weak layer in mm.

rc_right: Critical cut length in mm, measured on the front side of the sample.

rc_left: Critical cut length in mm, measured on the back side of the sample.

rc: Mean of rc_right and rc_left.

densities: List of density measurements in kg/m^3 for each distinct slab layer of each sample.

densities_mean: Daily mean of densities.

layers: 2D array with layer density (kg/m^3) and layer thickness (mm) pairs for each distinct slab layer.

layers_mean: Daily mean of layers.

surface_lineload: Surface line load of added surface weights in N/mm.

wl_thickness: Weak-layer thickness in mm.

notes: Additional notes regarding the experiment or observations.

L: Length of the slab–weak-layer assembly in mm.

df_legacy.pkl

Contains robustness data such as radii of curvature, slope inclination, and various geometrical measurements.

#: Record number.

rc: Critical cut length in mm.

slope_incl: Inclination of the slope in degrees.

h: Slab height in mm.

density: Mean slab density in kg/m^3.

L: Lenght of the slab–weak-layer assembly in mm.

collapse_height: Weak-layer height reduction through collapse.

layers_mean: 2D array with layer density (kg/m^3) and layer thickness (mm) pairs for each distinct slab layer.

wl_thickness: Weak-layer thickness in mm.

surface_lineload: Surface line load from added weights in N/mm.

For more detailed information on the datasets, refer to the paper or the documentation provided within the Jupyter notebook.

License

This work is licensed under a Creative Commons Attribution 4.0 International License.

You are free to:

Share — copy and redistribute the material in any medium or format

Adapt — remix, transform, and build upon the material for any purpose, even commercially.

Under the following terms:

Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.

Citation

Please cite the following paper if you use this analysis or the accompanying datasets:

Adam, V., Bergfeld, B., Weißgraeber, P. van Herwijnen, A., Rosendahl, P.L., Fracture toughness of mixed-mode anticracks in highly porous materials. Nature Communincations 15, 7379 (2024). https://doi.org/10.1038/s41467-024-51491-7
R
Accident Detection Model Dataset
universe.roboflow.com
zip
Updated Apr 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Accident detection model (2024). Accident Detection Model Dataset [Dataset]. https://universe.roboflow.com/accident-detection-model/accident-detection-model/dataset/1
Explore at:
zipAvailable download formats
Dataset updated
Apr 8, 2024
Dataset authored and provided by
Accident detection model
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Accident Bounding Boxes
Description
Accident-Detection-Model

Accident Detection Model is made using YOLOv8, Google Collab, Python, Roboflow, Deep Learning, OpenCV, Machine Learning, Artificial Intelligence. It can detect an accident on any accident by live camera, image or video provided. This model is trained on a dataset of 3200+ images, These images were annotated on roboflow.

Problem Statement

Road accidents are a major problem in India, with thousands of people losing their lives and many more suffering serious injuries every year.

According to the Ministry of Road Transport and Highways, India witnessed around 4.5 lakh road accidents in 2019, which resulted in the deaths of more than 1.5 lakh people.

The age range that is most severely hit by road accidents is 18 to 45 years old, which accounts for almost 67 percent of all accidental deaths.

Accidents survey

https://user-images.githubusercontent.com/78155393/233774342-287492bb-26c1-4acf-bc2c-9462e97a03ca.png" alt="Survey">

Literature Survey

Sreyan Ghosh in Mar-2019, The goal is to develop a system using deep learning convolutional neural network that has been trained to identify video frames as accident or non-accident.

Deeksha Gour Sep-2019, uses computer vision technology, neural networks, deep learning, and various approaches and algorithms to detect objects.

Research Gap

Lack of real-world data - We trained model for more then 3200 images.

Large interpretability time and space needed - Using google collab to reduce interpretability time and space required.

Outdated Versions of previous works - We aer using Latest version of Yolo v8.

Proposed methodology

We are using Yolov8 to train our custom dataset which has been 3200+ images, collected from different platforms.

This model after training with 25 iterations and is ready to detect an accident with a significant probability.

Model Set-up

Preparing Custom dataset

We have collected 1200+ images from different sources like YouTube, Google images, Kaggle.com etc.

Then we annotated all of them individually on a tool called roboflow.

During Annotation we marked the images with no accident as NULL and we drew a box on the site of accident on the images having an accident

Then we divided the data set into train, val, test in the ratio of 8:1:1

At the final step we downloaded the dataset in yolov8 format.
#### Using Google Collab

We are using google colaboratory to code this model because google collab uses gpu which is faster than local environments.

You can use Jupyter notebooks, which let you blend code, text, and visualisations in a single document, to write and run Python code using Google Colab.

Users can run individual code cells in Jupyter Notebooks and quickly view the results, which is helpful for experimenting and debugging. Additionally, they enable the development of visualisations that make use of well-known frameworks like Matplotlib, Seaborn, and Plotly.

In Google collab, First of all we Changed runtime from TPU to GPU.

We cross checked it by running command ‘!nvidia-smi’
#### Coding

First of all, We installed Yolov8 by the command ‘!pip install ultralytics==8.0.20’

Further we checked about Yolov8 by the command ‘from ultralytics import YOLO from IPython.display import display, Image’

Then we connected and mounted our google drive account by the code ‘from google.colab import drive drive.mount('/content/drive')’

Then we ran our main command to run the training process ‘%cd /content/drive/MyDrive/Accident Detection model !yolo task=detect mode=train model=yolov8s.pt data= data.yaml epochs=1 imgsz=640 plots=True’

After the training we ran command to test and validate our model ‘!yolo task=detect mode=val model=runs/detect/train/weights/best.pt data=data.yaml’ ‘!yolo task=detect mode=predict model=runs/detect/train/weights/best.pt conf=0.25 source=data/test/images’

Further to get result from any video or image we ran this command ‘!yolo task=detect mode=predict model=runs/detect/train/weights/best.pt source="/content/drive/MyDrive/Accident-Detection-model/data/testing1.jpg/mp4"’

The results are stored in the runs/detect/predict folder.
Hence our model is trained, validated and tested to be able to detect accidents on any video or image.

Challenges I ran into

I majorly ran into 3 problems while making this model

I got difficulty while saving the results in a folder, as yolov8 is latest version so it is still underdevelopment. so i then read some blogs, referred to stackoverflow then i got to know that we need to writ an extra command in new v8 that ''save=true'' This made me save my results in a folder.

I was facing problem on cvat website because i was not sure what
Z
Pre-processed (in Detectron2 and YOLO format) planetary images and boulder...
data.niaid.nih.gov
Updated Nov 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amaro, Brian (2024). Pre-processed (in Detectron2 and YOLO format) planetary images and boulder labels collected during the BOULDERING Marie Skłodowska-Curie Global fellowship [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_14250873
Explore at:
Dataset updated
Nov 30, 2024
Dataset provided by
Lapotre, Mathieu
Prieur, Nils
Gonzalez, Emiliano
Amaro, Brian
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This database contains 4976 planetary images of boulder fields located on Earth, Mars and Moon. The data was collected during the BOULDERING Marie Skłodowska-Curie Global fellowship between October 2021 and 2024. The data was already splitted into train, validation and test datasets, but feel free to re-organize the labels at your convenience.

For each image, all of the boulder outlines within the image were carefully mapped in QGIS. More information about the labelling procedure can be found in the following manuscript (https://agupubs.onlinelibrary.wiley.com/doi/full/10.1029/2023JE008013). This dataset differs from the previous dataset included along with the manuscript https://zenodo.org/records/8171052, as it contains more mapped images, especially of boulder populations around young impact structures on the Moon (cold spots). In addition, the boulder outlines were also pre-processed so that it can be ingested directly in YOLOv8.

A description of what is what is given in the README.txt file (in addition in how to load the custom datasets in Detectron2 and YOLO). Most of the other files are mostly self-explanatory. Please see previous dataset or manuscript for more information. If you want to have more information about specific lunar and martian planetary images, the IDs of the images are still available in the name of the file. Use this ID to find more information (e.g., M121118602_00875_image.png, ID M121118602 ca be used on https://pilot.wr.usgs.gov/). I will also upload the raw data from which this pre-processed dataset was generated (see https://zenodo.org/records/14250970).

Thanks to this database, you can easily train a Detectron2 Mask R-CNN or YOLO instance segmentation models to automatically detect boulders.

How to cite:

Please refer to the "how to cite" section of the readme file of https://github.com/astroNils/YOLOv8-BeyondEarth.

Structure:

. └── boulder2024/ ├── jupyter-notebooks/ │ └── REGISTERING_BOULDER_DATASET_IN_DETECTRON2.ipynb ├── test/ │ └── images/ │ ├── _image.png │ ├── ... │ └── labels/ │ ├── _image.txt │ ├── ... ├── train/ │ └── images/ │ ├── _image.png │ ├── ... │ └── labels/ │ ├── _image.txt │ ├── ... ├── validation/ │ └── images/ │ ├── _image.png │ ├── ... │ └── labels/ │ ├── _image.txt │ ├── ... ├── detectron2_inst_seg_boulder_dataset.json ├── README.txt ├── yolo_inst_seg_boulder_dataset.yaml

detectron2_inst_seg_boulder_dataset.json

is a json file containing the masks as expected by Detectron2 (see https://detectron2.readthedocs.io/en/latest/tutorials/datasets.html for more information on the format). In order to use this custom dataset, you need to register the dataset before using it in the training. There is an example how to do that in the jupyter-notebooks folder. You need to have detectron2, and all of its depedencies installed.

yolo_inst_seg_boulder_dataset.yaml

can be used as it is, however you need to update the paths in the .yaml file, to the test, train and validation folders. More information about the YOLO format can be found here (https://docs.ultralytics.com/datasets/segment/).
h
Huggingface_Uploader
huggingface.co
Updated Feb 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ktiseos Nyx (2025). Huggingface_Uploader [Dataset]. https://huggingface.co/datasets/EarthnDusk/Huggingface_Uploader
Explore at:
Dataset updated
Feb 17, 2025
Dataset authored and provided by
Ktiseos Nyx
License
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Description
🚀 Hugging Face Uploader: Streamline Your Model Sharing! 🚀

This tool provides a user-friendly way to upload files directly to your Hugging Face repositories. Whether you prefer the interactive environment of a Jupyter Notebook or the command-line efficiency of a Python script, we've got you covered. We've designed it to streamline your workflow and make sharing your models, datasets, and spaces easier than ever before! Will be more consistently updated here:… See the full description on the dataset page: https://huggingface.co/datasets/EarthnDusk/Huggingface_Uploader.
Transjakarta - Public Transportation Transaction
kaggle.com
Updated Jun 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
dikisahkan (2023). Transjakarta - Public Transportation Transaction [Dataset]. https://www.kaggle.com/datasets/dikisahkan/transjakarta-transportation-transaction
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 29, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
dikisahkan
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
TRANSJAKARTA - Public Transportation - Transaction Data

When a data analyst want to build the framework for the analysis they should not have waited for the real transaction to fill in from time to time. They could try to create a dummy data for testing whether the framework or the data structure already meet the requirement for deep analytics. Here i tried to simulate transaction data for Transjakarta as i found none on the Internet that is publicly shared. Hope you can exercise with this data i simulate to make it more meaningful as the master data from this data are real (but with dummy transactions).

The master datas are sourced from: https://ppid.transjakarta.co.id/pusat-data/data-terbuka/transjakarta-gtfs-feed The data was generated using Python using Faker and Random based on master datas. The source might be updated from time to time and the dataset might not represent the latest version from the source.

Context: Transjakarta is public transportation company from Indonesia, based in Jakarta. The transportation modes are big bus (BRT), medium and big bus (non-BRT), mini bus (Mikrotrans). The mechanism in Transjakarta is to Tap-In and Tap-Out using payment card as your tickets.

Content: Basically this data is simulation for Transaction data in Transjakarta. It does not represent the real data / structure used in Transjakarta

Inspiration: Transjakarta is growing as public transportation company. But none have shared data for the transaction analysis. We can analyze which route are busy and not. Which route is heavy with traffic jam or not. And other dimension provided you can analyze.

*If you'd like to see how i created this dataset you can peek the process in my GitHub
Speedtest (Download Performance) vs NBN (Technology Maps)
figshare.com
txt
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Richard Ferrers; Evan Thomas; AURIN (2023). Speedtest (Download Performance) vs NBN (Technology Maps) [Dataset]. http://doi.org/10.6084/m9.figshare.14006138.v3
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.14006138.v3
Dataset updated
Jun 1, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Richard Ferrers; Evan Thomas; AURIN
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset merges two public datasets:- Speedtest network performance data for Australia (Q3, 2020), as loaded to AURIN geo-analysis platform (88,000 approx locations)- NBN mapping of Technology data (inc FTTN, FTTP, FTTC, FTTB, HFC, Wireless, Satellite); complete map of Australia, colour-coded by technology, in WMS or KML format. KML format used in this instance.to produce an intersection datasets (Result: 319337 rows × 26 columns), including: - data includes LocID, download speed, nbn tech, lat, Lon, SA2, SA3, SA4 (see ABS link below).But many techs in one Speedtest block (600m^2) so have to untangle.- a ten line sample (CSV) included.- see image as provided by AURIN - QGIS.png.** Versionsv2.Load updated Jupyter Notebook v1.1 and locations csv, which shows location breakdown by NBN Technology (pivot table export).v1. Initial load, including Jupyter Notebook and human readable, geojson.** METHOD: Advice from AURIN:In order to join these two maps, you will need to perform a spatial join based on the two layers. It is possible to do this with geopandas.sjoin(), which by default performs an intersection join - that is, any portion of a matching polygon from the second layer is considered a match to join on. More information about spatial predicates is here in case you're looking for a different spatial relationship.In this link, I've supplied a notebook (OoklaNBN-AURIN.ipynb) that collects the datasets from the AURIN API and data.gov.au, combines the several KML NBN layers into one, and joins them with the Ookla 2020 Q3 dataset. In order to use it, you will first need to input your AURIN API credentials into the first cell.The spatial join occurs in the final notebook cell and writes its output to a geopackage (OoklaNBN.gpkg) which I've included also, as the script can take some time to run. You will notice that each Ookla cell now may be represented by many records, this is due to there being more than one overlapping NBN technology polygon. As one Ookla grid can cover many technology zones, aggregating these may be useful depending on how you approach your analysis.Regards AURIN.The authors acknowledge the facilities and scientific and technical assistance of the NCRIS-enabled Australian Urban Research Infrastructure Network (AURIN)”Thanks Evan Thomas, AURIN - ORCiD: https://orcid.org/0000-0001-7564-4116*** Preliminary Analysis1. Count of Tech TypeTech countFibre to the Basement (vectored or non-vectored) 14147Satellite 15946Fixed Wireless 23663Fibre to the Curb 36843Hybrid Fibre Coaxial (HFC) 41175Fibre to the Premises 93732Fibre to the Node 938312. Mean of Tech TypeNBN Tech Type Aus mean(mbps)Satellite 48.1674 Fixed Wireless 49.1477 Fibre to the Node 75.1639 Fibre to the Premises 83.71 Fibre to the Curb 86.8433 Hybrid Fibre Coaxial (HFC) 87.1248 Fibre to the Basement (vectored or non-vectored) 116.324*** Licence;Speedtest licence (at AWS data licence) is "CC BY-NC-SA 4.0", so use of this data must be:- non-commercial (NC)- reuse must be share-alike (SA)(add same licence).This restricts the standard CC-BY Figshare licence.
Hydro-mechanical simulation of CO2 Injection into a faulted aquifer
zenodo.org
zip
Updated Oct 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Emil Gallyamov; Emil Gallyamov; Ismael Gomes Almada Guillemin; Guillaume Anciaux; Guillaume Anciaux; Ismael Gomes Almada Guillemin (2024). Hydro-mechanical simulation of CO2 Injection into a faulted aquifer [Dataset]. http://doi.org/10.5281/zenodo.13710229
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13710229
Dataset updated
Oct 25, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Emil Gallyamov; Emil Gallyamov; Ismael Gomes Almada Guillemin; Guillaume Anciaux; Guillaume Anciaux; Ismael Gomes Almada Guillemin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Summary

This dataset contains the results of geomechanical simulations conducted on a faulted aquifer under conditions of CO2 injection. The primary focus of the simulations is the pressure evolution within the rock matrix and along the fault, as well as the associated changes in the mechanical state, including rock deformation and fault slip. Additionally, the simulations explore the sensitivity of fault stability under varying orientations of far-field stress.

The dataset includes raw data in VTK format, as well as an illustrative Jupyter notebook that provides a comprehensive explanation of the problem's geometry, boundary and initial conditions, and an interpretation of the observed physical phenomena. The Jupyter notebook is designed to be run both online and locally.

These simulations were performed using an open-source FEM-based geomechanical simulator. Detailed instructions for running the notebook, along with a link to the geomechanical simulator, are provided in the description below.

Contributions

Emil Gallyamov did contribute to the production of the dataset and its visualisation.

Ismaël Gomes did contribute to the development of the visualisation interface through Jupyter Notebooks and Streamlit.

Guillaume Anciaux did contribute to the development of the visualisation interface and data curation.

Data collection: period and details

From 20 April, 2024 to 30 August, 2024, the .pvt and .npy files were generated, curated, and visualisation routines were developed.

Funding sources

ENAC Interdisciplinary Cluster Grant project OSGEOCGS.

Notebook demonstration

Online

An interactive notebook showcasing visualisations of the dataset is available on RenkuLab.

Running locally

Alternatively, you can launch the notebook on your computer. Download the dataset, install dependencies, and launch Jupyter notebook:

pip install -r requirements_freeze.txt

jupyter notebook

Then, open notebooks/DataVisualisation.ipynb.

Reproducing the dataset

To recreate the results found in this dataset, install the solver and go through the example at examples/injection_fault.

Data structure and information

The repository has the following structure:

data: simulation results

reservoir_vs_time: bulk, solid, cohesive and fault fields for the duration of the simulation

paraview: vtk files

*.npy: numpy arrays to store numerical data at specified locations

fault_vs_angle: only cohesive fields for 180 degrees of rotation

cohesive_0*.vtu: vtu files containing absolute values of the fields

cohesive_init_0*.vtu: all the fields in these files are frozen to the initial (at rest) system state

cohesive_parsed.pvd

cohesive_init_parsed.pvd: postprocessing of Ismaël is substracting one set of data from the other one and plots the difference - increase of slip

notebooks: data visualisation through Jupyter Notebooks

images: images used for illustration in notebooks (e.g. schema of stress rotation, model geometry, etc.)

stress_rotation: contains images with scheme of rotation for angles from 0 to 180

DataVisualisation.ipynb: main notebook with visualized DataVisualisation

library: all the scripts needed to visualize DataVisualisation
Dissertation Supplementary Files
figshare.com
html
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Augustine Dunn (2023). Dissertation Supplementary Files [Dataset]. http://doi.org/10.6084/m9.figshare.810442.v7
Explore at:
htmlAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.810442.v7
Dataset updated
May 31, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Augustine Dunn
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Purpose These are a collection of supplementary files that are to be included in my dissertation. They include but are not limited to small IPython notebooks, extra figures, data-sets that are too large to publish in the main document such as full ortholog lists and other primary data.

Viewing IPython notebooks (ipynb files) To view an IPython notebook, "right-click" its download link and select "Copy link address". Then navigate to the the free notebook viewer by following this link: http://nbviewer.ipython.org/. Finally, paste the link to the ipynb file that you copied into the URL form on the nbviewer page and click "Go".
Source code for labbench 0.20 release
datasets.ai
data.nist.gov
+1more
0, 47
Updated Aug 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Standards and Technology (2024). Source code for labbench 0.20 release [Dataset]. https://datasets.ai/datasets/source-code-for-labbench-0-20-release-9dbe7
Explore at:
0, 47Available download formats
Dataset updated
Aug 22, 2024
Dataset authored and provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
Description
This is the source code package for the labbench python module, version 0.20, which is its first public release. The purpose of labbench is to streamline and organize complicated laboratory automation tasks that involve large-scale benchtop automation, concurrency, and/or data management. It is built around a system of wrappers that facilitate robust, concise exception handling, type checking, API conventions, and synchronized device connection through python context blocks. The wrappers also provide convenient new functionality, such as support for automated status displays in jupyter notebooks, simplified threaded concurrency, and automated, type-safe logging to relational databases.Together, these features help to minimize the amount of "copy-and-paste" code that can make your lab automation scripts error-prone and difficult to maintain.The python code that results can be clear, concise, reusable and maintainable, and provide consistent formatting for stored data. The result helps researchers to meet NIST's open data obligations, even for complicated, large, and heterogeneous datasets.Several past and ongoing projects in the NIST Communication Technology Laboratory (CTL) published data that were acquired by automation in labbench. We release it here both for transparency and to invite public use and feedback. Ongoing updates to this source code will be maintained on the NIST github page at https://github.com/usnistgov/labbench.The code was developed in python, documented with the python sphinx package and markdown, and shared through the USNISTGOV organization on GitHub.INSTALLATIONlabbench can run on any computer that supports python 3.6. The hardware requirements are discussed here: https://docs.anaconda.com/anaconda/install/#requirements1. Install your favorite distribution of a python version 3.6 or greater2. In a command prompt, pip install git+https://gitlab.nist.gov/gitlab/ssm/labbench3. (Optional) install an NI VISA [1] runtime, for example this one for windows.USAGEThe source distribution contains detailed information including* README.md - documentation to get started using labbench* LICENSE.md - license and redistribution information* doc/labbench-api.pdf - complete listing of the module and documentation
Z
Dataset for paper "Mitigating the effect of errors in source parameters on...
data.niaid.nih.gov
Updated Sep 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nicholas Rawlinson (2022). Dataset for paper "Mitigating the effect of errors in source parameters on seismic (waveform) inversion" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6969601
Explore at:
Dataset updated
Sep 28, 2022
Dataset provided by
Nicholas Rawlinson
Nienke Blom
Phil-Simon Hardalupas
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset corresponding to the journal article "Mitigating the effect of errors in source parameters on seismic (waveform) inversion" by Blom, Hardalupas and Rawlinson, accepted for publication in Geophysical Journal International. In this paper, we demonstrate the effect or errors in source parameters on seismic tomography, with a particular focus on (full) waveform tomography. We study effect both on forward modelling (i.e. comparing waveforms and measurements resulting from a perturbed vs. unperturbed source) and on seismic inversion (i.e. using a source which contains an (erroneous) perturbation to invert for Earth structure. These data were obtained using Salvus, a state-of-the-art (though proprietary) 3-D solver that can be used for wave propagation simulations (Afanasiev et al., GJI 2018).

This dataset contains:

The entire Salvus project. This project was prepared using Salvus version 0.11.x and 0.12.2 and should be fully compatible with the latter.

A number of Jupyter notebooks used to create all the figures, set up the project and do the data processing.

A number of Python scripts that are used in above notebooks.

two conda environment .yml files: one with the complete environment as used to produce this dataset, and one with the environment as supplied by Mondaic (the Salvus developers), on top of which I installed basemap and cartopy.

An overview of the inversion configurations used for each inversion experiment and the name of hte corresponding figures: inversion_runs_overview.ods / .csv .

Datasets corresponding to the different figures.

One dataset for Figure 1, showing the effect of a source perturbation in a real-world setting, as previously used by Blom et al., Solid Earth 2020

One dataset for Figure 2, showing how different methodologies and assumptions can lead to significantly different source parameters, notably including systematic shifts. This dataset was kindly supplied by Tim Craig (Craig, 2019).

A number of datasets (stored as pickled Pandas dataframes) derived from the Salvus project. We have computed:

travel-time arrival predictions from every source to all stations (df_stations...pkl)

misfits for different metrics for both P-wave centered and S-wave centered windows for all components on all stations, comparing every time waveforms from a reference source against waveforms from a perturbed source (df_misfits_cc.28s.pkl)

addition of synthetic waveforms for different (perturbed) moment tenors. All waveforms are stored in HDF5 (.h5) files of the ASDF (adaptable seismic data format) type

How to use this dataset:

To set up the conda environment:

make sure you have anaconda/miniconda

make sure you have access to Salvus functionality. This is not absolutely necessary, but most of the functionality within this dataset relies on salvus. You can do the analyses and create the figures without, but you'll have to hack around in the scripts to build workarounds.

Set up Salvus / create a conda environment. This is best done following the instructions on the Mondaic website. Check the changelog for breaking changes, in that case download an older salvus version.

Additionally in your conda env, install basemap and cartopy:

conda-env create -n salvus_0_12 -f environment.yml conda install -c conda-forge basemap conda install -c conda-forge cartopy

Install LASIF (https://github.com/dirkphilip/LASIF_2.0) and test. The project uses some lasif functionality.

To recreate the figures: This is extremely straightforward. Every figure has a corresponding Jupyter Notebook. Suffices to run the notebook in its entirety.

Figure 1: separate notebook, Fig1_event_98.py

Figure 2: separate notebook, Fig2_TimCraig_Andes_analysis.py

Figures 3-7: Figures_perturbation_study.py

Figures 8-10: Figures_toy_inversions.py

To recreate the dataframes in DATA: This can be done using the example notebook Create_perturbed_thrust_data_by_MT_addition.py and Misfits_moment_tensor_components.M66_M12.py . The same can easily be extended to the position shift and other perturbations you might want to investigate.

To recreate the complete Salvus project: This can be done using:

the notebook Prepare_project_Phil_28s_absb_M66.py (setting up project and running simulations)

the notebooks Moment_tensor_perturbations.py and Moment_tensor_perturbation_for_NS_thrust.py

For the inversions: using the notebook Inversion_SS_dip.M66.28s.py as an example. See the overview table inversion_runs_overview.ods (or .csv) as to naming conventions.

References:

Michael Afanasiev, Christian Boehm, Martin van Driel, Lion Krischer, Max Rietmann, Dave A May, Matthew G Knepley, Andreas Fichtner, Modular and flexible spectral-element waveform modelling in two and three dimensions, Geophysical Journal International, Volume 216, Issue 3, March 2019, Pages 1675–1692, https://doi.org/10.1093/gji/ggy469

Nienke Blom, Alexey Gokhberg, and Andreas Fichtner, Seismic waveform tomography of the central and eastern Mediterranean upper mantle, Solid Earth, Volume 11, Issue 2, 2020, Pages 669–690, 2020, https://doi.org/10.5194/se-11-669-2020

Tim J. Craig, Accurate depth determination for moderate-magnitude earthquakes using global teleseismic data. Journal of Geophysical Research: Solid Earth, 124, 2019, Pages 1759– 1780. https://doi.org/10.1029/2018JB016902
Replication Package for: When Should I Run My Application Benchmark?...
zenodo.org
zip
Updated Apr 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sören Henning; Sören Henning; Adriano Vogel; Adriano Vogel; Esteban Perez-Wohlfeil; Esteban Perez-Wohlfeil; Otmar Ertl; Otmar Ertl; Rick Rabiser; Rick Rabiser (2025). Replication Package for: When Should I Run My Application Benchmark? Studying Cloud Performance Variability for the Case of Stream Processing Applications [Dataset]. http://doi.org/10.5281/zenodo.15223326
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15223326
Dataset updated
Apr 15, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Sören Henning; Sören Henning; Adriano Vogel; Adriano Vogel; Esteban Perez-Wohlfeil; Esteban Perez-Wohlfeil; Otmar Ertl; Otmar Ertl; Rick Rabiser; Rick Rabiser
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This replication package contains the data and scripts to replicate the results of our paper "When Should I Run My Application Benchmark?: Studying Cloud Performance Variability for the Case of Stream Processing Applications".

It contains throughput measurements of 2366 executions of ShuffleBench, an open-source application benchmark for stream processing frameworks. The experiments have been conducted over a period of more than 3 month in 2024 in AWS EKS Kubernetes clusters in two AWS regions (us-east-1 and eu-central-1) and with different two different machine types (m6i and m6g).

Dataset Description

The benchmark results are located in the results directory, following the structure

results/{DATE}_{TIME}-{INSTANCE}-{REGION}/results/exp0_250000_9_generic_throughput_{IDX}.csv

where:

{DATE} is the date of the execution in the format YYYY-MM-DD,

{TIME} is the time of the execution in the format HH-MM-SS,

{INSTANCE} is the instance type used for the execution (m6i or m6g),

{REGION} is the AWS region used for the execution (useast1 or eucentral1),

{IDX} is the number of the repetition of an execution (1-3).

Each of these CVS files contains the throughput measurements of the benchmark executions, every 5 seconds. The important columns are:

timestamp in epoch seconds,

value the measured throughput in records per second as obtained with the ad-hoc throughput metric of ShuffleBench.

In addition to that, results/{DATE}_{TIME}-{INSTANCE}-{REGION} also contains a theodolite.log file that contains the logs of the Theodolite benchmarking tool and the logged configuration of each execution in results. Although we do not expect them to provide additional insights (since the purpose of our study was to repeatedly execute the same benchmark), we refer to the documentation of Theodolite for further details.

Dataset Analysis

To repeat our data analysis, you can use the Jupyter notebook results-analysis.ipynb following these steps:

(Optional:) Create a Python virtual environment and activate it:
python3 -m venv .venv
source .venv/bin/activate

Install the required Python packages:
pip install -r requirements.txt

Start Jupyter, for example, via:
jupyter notebook

This Jupyter notebook also allows users to conduct further analysis on the dataset.

Periodic Benchmark Executor

The `periodic-executor` directory contains scripts and configuration files used to automatically execute ShuffleBench. As ShuffleBench relies on the Theodolite benchmarking framework for executing benchmarks within Kubernetes, the code here is mostly for setting up a Kubernetes cluster, installing Theodolite, configuring the benchmark executions, and collecting the benchmark results.

The periodic benchmark executor is bundled as a Docker image. It can be built and pushed to an ECR repository with the following commands:

docker build -t $ECR_REPOSITORY/$IMAGE_NAME .

docker push $ECR_REPOSITORY/$IMAGE_NAME

To automatically run this container, a AWS Elastic Container Service (ECS) task definition and a scheduled task has to be be created.

For storing the benchmark results, an S3 bucket with the name shufflebench-periodic-schedule-results has to be created.

Finally, the required IAM permissions have to be set up. For confidentiality reasons, we cannot provide the exact IAM policy here, but required permissions include creation of an EKS Kubernetes cluster via eksctl (see the official documentation) and access to the S3 bucket.

Facebook

Twitter

Click to copy link

Link copied

Cite

João Felipe; João Felipe; Leonardo; Leonardo; Vanessa; Vanessa; Juliana; Juliana (2021). Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter Notebooks [Dataset]. http://doi.org/10.5281/zenodo.2592524

Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter Notebooks

Explore at:

bz2Available download formats

Unique identifier

https://doi.org/10.5281/zenodo.2592524

Dataset updated

Mar 15, 2021

Dataset provided by

Zenodohttp://zenodo.org/

Authors

João Felipe; João Felipe; Leonardo; Leonardo; Vanessa; Vanessa; Juliana; Juliana

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The self-documenting aspects and the ability to reproduce results have been touted as significant benefits of Jupyter Notebooks. At the same time, there has been growing criticism that the way notebooks are being used leads to unexpected behavior, encourage poor coding practices and that their results can be hard to reproduce. To understand good and bad practices used in the development of real notebooks, we analyzed 1.4 million notebooks from GitHub.

Paper: https://2019.msrconf.org/event/msr-2019-papers-a-large-scale-study-about-quality-and-reproducibility-of-jupyter-notebooks

This repository contains two files:

dump.tar.bz2
jupyter_reproducibility.tar.bz2

The dump.tar.bz2 file contains a PostgreSQL dump of the database, with all the data we extracted from the notebooks.

The jupyter_reproducibility.tar.bz2 file contains all the scripts we used to query and download Jupyter Notebooks, extract data from them, and analyze the data. It is organized as follows:

analyses: this folder has all the notebooks we use to analyze the data in the PostgreSQL database.
archaeology: this folder has all the scripts we use to query, download, and extract data from GitHub notebooks.
paper: empty. The notebook analyses/N12.To.Paper.ipynb moves data to it

In the remaining of this text, we give instructions for reproducing the analyses, by using the data provided in the dump and reproducing the collection, by collecting data from GitHub again.

Reproducing the Analysis

This section shows how to load the data in the database and run the analyses notebooks. In the analysis, we used the following environment:

Ubuntu 18.04.1 LTS
PostgreSQL 10.6
Conda 4.5.11
Python 3.7.2
PdfCrop 2012/11/02 v1.38

First, download dump.tar.bz2 and extract it:

tar -xjf dump.tar.bz2

It extracts the file db2019-03-13.dump. Create a database in PostgreSQL (we call it "jupyter"), and use psql to restore the dump:

psql jupyter < db2019-03-13.dump

It populates the database with the dump. Now, configure the connection string for sqlalchemy by setting the environment variable JUP_DB_CONNECTTION:

export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter";

Download and extract jupyter_reproducibility.tar.bz2:

tar -xjf jupyter_reproducibility.tar.bz2

Create a conda environment with Python 3.7:

conda create -n analyses python=3.7
conda activate analyses

Go to the analyses folder and install all the dependencies of the requirements.txt

cd jupyter_reproducibility/analyses
pip install -r requirements.txt

For reproducing the analyses, run jupyter on this folder:

jupyter notebook

Execute the notebooks on this order:

Index.ipynb
N0.Repository.ipynb
N1.Skip.Notebook.ipynb
N2.Notebook.ipynb
N3.Cell.ipynb
N4.Features.ipynb
N5.Modules.ipynb
N6.AST.ipynb
N7.Name.ipynb
N8.Execution.ipynb
N9.Cell.Execution.Order.ipynb
N10.Markdown.ipynb
N11.Repository.With.Notebook.Restriction.ipynb
N12.To.Paper.ipynb

Reproducing or Expanding the Collection

The collection demands more steps to reproduce and takes much longer to run (months). It also involves running arbitrary code on your machine. Proceed with caution.

Requirements

This time, we have extra requirements:

All the analysis requirements
lbzip2 2.5
gcc 7.3.0
Github account
Gmail account

Environment

First, set the following environment variables:

export JUP_MACHINE="db"; # machine identifier
export JUP_BASE_DIR="/mnt/jupyter/github"; # place to store the repositories
export JUP_LOGS_DIR="/home/jupyter/logs"; # log files
export JUP_COMPRESSION="lbzip2"; # compression program
export JUP_VERBOSE="5"; # verbose level
export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter"; # sqlchemy connection
export JUP_GITHUB_USERNAME="github_username"; # your github username
export JUP_GITHUB_PASSWORD="github_password"; # your github password
export JUP_MAX_SIZE="8000.0"; # maximum size of the repositories directory (in GB)
export JUP_FIRST_DATE="2013-01-01"; # initial date to query github
export JUP_EMAIL_LOGIN="gmail@gmail.com"; # your gmail address
export JUP_EMAIL_TO="target@email.com"; # email that receives notifications
export JUP_OAUTH_FILE="~/oauth2_creds.json" # oauth2 auhentication file
export JUP_NOTEBOOK_INTERVAL=""; # notebook id interval for this machine. Leave it in blank
export JUP_REPOSITORY_INTERVAL=""; # repository id interval for this machine. Leave it in blank
export JUP_WITH_EXECUTION="1"; # run execute python notebooks
export JUP_WITH_DEPENDENCY="0"; # run notebooks with and without declared dependnecies
export JUP_EXECUTION_MODE="-1"; # run following the execution order
export JUP_EXECUTION_DIR="/home/jupyter/execution"; # temporary directory for running notebooks
export JUP_ANACONDA_PATH="~/anaconda3"; # conda installation path
export JUP_MOUNT_BASE="/home/jupyter/mount_ghstudy.sh"; # bash script to mount base dir
export JUP_UMOUNT_BASE="/home/jupyter/umount_ghstudy.sh"; # bash script to umount base dir
export JUP_NOTEBOOK_TIMEOUT="300"; # timeout the extraction


# Frequenci of log report
export JUP_ASTROID_FREQUENCY="5";
export JUP_IPYTHON_FREQUENCY="5";
export JUP_NOTEBOOKS_FREQUENCY="5";
export JUP_REQUIREMENT_FREQUENCY="5";
export JUP_CRAWLER_FREQUENCY="1";
export JUP_CLONE_FREQUENCY="1";
export JUP_COMPRESS_FREQUENCY="5";

export JUP_DB_IP="localhost"; # postgres database IP

Then, configure the file ~/oauth2_creds.json, according to yagmail documentation: https://media.readthedocs.org/pdf/yagmail/latest/yagmail.pdf

Configure the mount_ghstudy.sh and umount_ghstudy.sh scripts. The first one should mount the folder that stores the directories. The second one should umount it. You can leave the scripts in blank, but it is not advisable, as the reproducibility study runs arbitrary code on your machine and you may lose your data.

Scripts

Download and extract jupyter_reproducibility.tar.bz2:

tar -xjf jupyter_reproducibility.tar.bz2

Install 5 conda environments and 5 anaconda environments, for each python version. In each of them, upgrade pip, install pipenv, and install the archaeology package (Note that it is a local package that has not been published to pypi. Make sure to use the -e option):

Conda 2.7

conda create -n raw27 python=2.7 -y
conda activate raw27
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology

Anaconda 2.7

conda create -n py27 python=2.7 anaconda -y
conda activate py27
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology

Conda 3.4

It requires a manual jupyter and pathlib2 installation due to some incompatibilities found on the default installation.

conda create -n raw34 python=3.4 -y
conda activate raw34
conda install jupyter -c conda-forge -y
conda uninstall jupyter -y
pip install --upgrade pip
pip install jupyter
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
pip install pathlib2

Anaconda 3.4

conda create -n py34 python=3.4 anaconda -y
conda activate py34
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology

Conda 3.5

conda create -n raw35 python=3.5 -y
conda activate raw35
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology

Anaconda 3.5

It requires the manual installation of other anaconda packages.

conda create -n py35 python=3.5 anaconda -y
conda install -y appdirs atomicwrites keyring secretstorage libuuid navigator-updater prometheus_client pyasn1 pyasn1-modules spyder-kernels tqdm jeepney automat constantly anaconda-navigator
conda activate py35
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology

Conda 3.6

conda create -n raw36 python=3.6 -y
conda activate raw36
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology

Anaconda 3.6

conda create -n py36 python=3.6 anaconda -y
conda activate py36
conda install -y anaconda-navigator jupyterlab_server navigator-updater
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology

Conda 3.7

<code

Clear search

Close search

Google apps

Main menu

Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter...

Reproducibility in Practice: Dataset of a Large-Scale Study of Jupyter...

Dataset of a Study of Computational reproducibility of Jupyter notebooks...

[Dataset & scripts] to "Spatial scales of kinetic energy in the Arctic...

Data from: Maniple

On Linux/macOS:

On windows:

Clash Royale S18 Ladder Datasets (37.9M matches)

Context

Content

Inspiration

Implementation

My Clan and Handle

Acknowledgements

Data from: ARcode: HPC Application Recognition Through Image-encoded...

BBC Hindi News Articles Dataset - Detailed

Benchmark-Tasks: Duffing Oscillator Response Analysis (DORA)

Fracture toughness of mixed-mode anticracks in highly porous materials...

Contents

Prerequisites

Setup

Running the Analysis

Data Description

df_mmft.pkl

df_legacy.pkl

License

Citation

Accident Detection Model Dataset

Accident-Detection-Model

Problem Statement

Accidents survey

Literature Survey

Research Gap

Proposed methodology

Model Set-up

Preparing Custom dataset

Challenges I ran into

I majorly ran into 3 problems while making this model

Pre-processed (in Detectron2 and YOLO format) planetary images and boulder...

Huggingface_Uploader

Transjakarta - Public Transportation Transaction

Speedtest (Download Performance) vs NBN (Technology Maps)

Hydro-mechanical simulation of CO2 Injection into a faulted aquifer

Summary

Contributions

Data collection: period and details

Funding sources

Notebook demonstration

Online

Running locally

Reproducing the dataset

Data structure and information

Dissertation Supplementary Files

Source code for labbench 0.20 release

Dataset for paper "Mitigating the effect of errors in source parameters on...

Replication Package for: When Should I Run My Application Benchmark?...

Dataset Description

Dataset Analysis

Periodic Benchmark Executor

Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter NotebooksSee More Versions

`df_mmft.pkl`

`df_legacy.pkl`

Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter Notebooks