57 datasets found

Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter...
zenodo.org
bz2
Updated Mar 15, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
João Felipe; João Felipe; Leonardo; Leonardo; Vanessa; Vanessa; Juliana; Juliana (2021). Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter Notebooks [Dataset]. http://doi.org/10.5281/zenodo.2592524
Explore at:
bz2Available download formats
Unique identifier
https://doi.org/10.5281/zenodo.2592524
Dataset updated
Mar 15, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
João Felipe; João Felipe; Leonardo; Leonardo; Vanessa; Vanessa; Juliana; Juliana
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The self-documenting aspects and the ability to reproduce results have been touted as significant benefits of Jupyter Notebooks. At the same time, there has been growing criticism that the way notebooks are being used leads to unexpected behavior, encourage poor coding practices and that their results can be hard to reproduce. To understand good and bad practices used in the development of real notebooks, we analyzed 1.4 million notebooks from GitHub.

Paper: https://2019.msrconf.org/event/msr-2019-papers-a-large-scale-study-about-quality-and-reproducibility-of-jupyter-notebooks

This repository contains two files:

dump.tar.bz2

jupyter_reproducibility.tar.bz2

The dump.tar.bz2 file contains a PostgreSQL dump of the database, with all the data we extracted from the notebooks.

The jupyter_reproducibility.tar.bz2 file contains all the scripts we used to query and download Jupyter Notebooks, extract data from them, and analyze the data. It is organized as follows:

analyses: this folder has all the notebooks we use to analyze the data in the PostgreSQL database.

archaeology: this folder has all the scripts we use to query, download, and extract data from GitHub notebooks.

paper: empty. The notebook analyses/N12.To.Paper.ipynb moves data to it

In the remaining of this text, we give instructions for reproducing the analyses, by using the data provided in the dump and reproducing the collection, by collecting data from GitHub again.

Reproducing the Analysis

This section shows how to load the data in the database and run the analyses notebooks. In the analysis, we used the following environment:

Ubuntu 18.04.1 LTS
PostgreSQL 10.6
Conda 4.5.11
Python 3.7.2
PdfCrop 2012/11/02 v1.38

First, download dump.tar.bz2 and extract it:

tar -xjf dump.tar.bz2

It extracts the file db2019-03-13.dump. Create a database in PostgreSQL (we call it "jupyter"), and use psql to restore the dump:

psql jupyter < db2019-03-13.dump

It populates the database with the dump. Now, configure the connection string for sqlalchemy by setting the environment variable JUP_DB_CONNECTTION:

export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter";

Download and extract jupyter_reproducibility.tar.bz2:

tar -xjf jupyter_reproducibility.tar.bz2

Create a conda environment with Python 3.7:

conda create -n analyses python=3.7 conda activate analyses

Go to the analyses folder and install all the dependencies of the requirements.txt

cd jupyter_reproducibility/analyses pip install -r requirements.txt

For reproducing the analyses, run jupyter on this folder:

jupyter notebook

Execute the notebooks on this order:

Index.ipynb

N0.Repository.ipynb

N1.Skip.Notebook.ipynb

N2.Notebook.ipynb

N3.Cell.ipynb

N4.Features.ipynb

N5.Modules.ipynb

N6.AST.ipynb

N7.Name.ipynb

N8.Execution.ipynb

N9.Cell.Execution.Order.ipynb

N10.Markdown.ipynb

N11.Repository.With.Notebook.Restriction.ipynb

N12.To.Paper.ipynb

Reproducing or Expanding the Collection

The collection demands more steps to reproduce and takes much longer to run (months). It also involves running arbitrary code on your machine. Proceed with caution.

Requirements

This time, we have extra requirements:

All the analysis requirements
lbzip2 2.5
gcc 7.3.0
Github account
Gmail account

Environment

First, set the following environment variables:

export JUP_MACHINE="db"; # machine identifier export JUP_BASE_DIR="/mnt/jupyter/github"; # place to store the repositories export JUP_LOGS_DIR="/home/jupyter/logs"; # log files export JUP_COMPRESSION="lbzip2"; # compression program export JUP_VERBOSE="5"; # verbose level export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter"; # sqlchemy connection export JUP_GITHUB_USERNAME="github_username"; # your github username export JUP_GITHUB_PASSWORD="github_password"; # your github password export JUP_MAX_SIZE="8000.0"; # maximum size of the repositories directory (in GB) export JUP_FIRST_DATE="2013-01-01"; # initial date to query github export JUP_EMAIL_LOGIN="gmail@gmail.com"; # your gmail address export JUP_EMAIL_TO="target@email.com"; # email that receives notifications export JUP_OAUTH_FILE="~/oauth2_creds.json" # oauth2 auhentication file export JUP_NOTEBOOK_INTERVAL=""; # notebook id interval for this machine. Leave it in blank export JUP_REPOSITORY_INTERVAL=""; # repository id interval for this machine. Leave it in blank export JUP_WITH_EXECUTION="1"; # run execute python notebooks export JUP_WITH_DEPENDENCY="0"; # run notebooks with and without declared dependnecies export JUP_EXECUTION_MODE="-1"; # run following the execution order export JUP_EXECUTION_DIR="/home/jupyter/execution"; # temporary directory for running notebooks export JUP_ANACONDA_PATH="~/anaconda3"; # conda installation path export JUP_MOUNT_BASE="/home/jupyter/mount_ghstudy.sh"; # bash script to mount base dir export JUP_UMOUNT_BASE="/home/jupyter/umount_ghstudy.sh"; # bash script to umount base dir export JUP_NOTEBOOK_TIMEOUT="300"; # timeout the extraction # Frequenci of log report export JUP_ASTROID_FREQUENCY="5"; export JUP_IPYTHON_FREQUENCY="5"; export JUP_NOTEBOOKS_FREQUENCY="5"; export JUP_REQUIREMENT_FREQUENCY="5"; export JUP_CRAWLER_FREQUENCY="1"; export JUP_CLONE_FREQUENCY="1"; export JUP_COMPRESS_FREQUENCY="5"; export JUP_DB_IP="localhost"; # postgres database IP

Then, configure the file ~/oauth2_creds.json, according to yagmail documentation: https://media.readthedocs.org/pdf/yagmail/latest/yagmail.pdf

Configure the mount_ghstudy.sh and umount_ghstudy.sh scripts. The first one should mount the folder that stores the directories. The second one should umount it. You can leave the scripts in blank, but it is not advisable, as the reproducibility study runs arbitrary code on your machine and you may lose your data.

Scripts

Download and extract jupyter_reproducibility.tar.bz2:

tar -xjf jupyter_reproducibility.tar.bz2

Install 5 conda environments and 5 anaconda environments, for each python version. In each of them, upgrade pip, install pipenv, and install the archaeology package (Note that it is a local package that has not been published to pypi. Make sure to use the -e option):

Conda 2.7

conda create -n raw27 python=2.7 -y conda activate raw27 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Anaconda 2.7

conda create -n py27 python=2.7 anaconda -y conda activate py27 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Conda 3.4

It requires a manual jupyter and pathlib2 installation due to some incompatibilities found on the default installation.

conda create -n raw34 python=3.4 -y conda activate raw34 conda install jupyter -c conda-forge -y conda uninstall jupyter -y pip install --upgrade pip pip install jupyter pip install pipenv pip install -e jupyter_reproducibility/archaeology pip install pathlib2

Anaconda 3.4

conda create -n py34 python=3.4 anaconda -y conda activate py34 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Conda 3.5

conda create -n raw35 python=3.5 -y conda activate raw35 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Anaconda 3.5

It requires the manual installation of other anaconda packages.

conda create -n py35 python=3.5 anaconda -y conda install -y appdirs atomicwrites keyring secretstorage libuuid navigator-updater prometheus_client pyasn1 pyasn1-modules spyder-kernels tqdm jeepney automat constantly anaconda-navigator conda activate py35 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Conda 3.6

conda create -n raw36 python=3.6 -y conda activate raw36 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Anaconda 3.6

conda create -n py36 python=3.6 anaconda -y conda activate py36 conda install -y anaconda-navigator jupyterlab_server navigator-updater pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Conda 3.7

<code
Speedtest Open Data - Four International cities - MEL, BKK, SHG, LAX plus...
figshare.com
txt
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Richard Ferrers; Speedtest Global Index (2023). Speedtest Open Data - Four International cities - MEL, BKK, SHG, LAX plus ALC - 2020, 2022 [Dataset]. http://doi.org/10.6084/m9.figshare.13621169.v24
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.13621169.v24
Dataset updated
May 30, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Richard Ferrers; Speedtest Global Index
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset compares four cities FIXED-line broadband internet speeds: - Melbourne, AU - Bangkok, TH - Shanghai, CN - Los Angeles, US - Alice Springs, AU

ERRATA: 1.Data is for Q3 2020, but some files are labelled incorrectly as 02-20 of June 20. They all should read Sept 20, or 09-20 as Q3 20, rather than Q2. Will rename and reload. Amended in v7.

LAX file named 0320, when should be Q320. Amended in v8.

*lines of data for each geojson file; a line equates to a 600m^2 location, inc total tests, devices used, and average upload and download speed - MEL 16181 locations/lines => 0.85M speedtests (16.7 tests per 100people) - SHG 31745 lines => 0.65M speedtests (2.5/100pp) - BKK 29296 lines => 1.5M speedtests (14.3/100pp) - LAX 15899 lines => 1.3M speedtests (10.4/100pp) - ALC 76 lines => 500 speedtests (2/100pp)

Geojsons of these 2* by 2* extracts for MEL, BKK, SHG now added, and LAX added v6. Alice Springs added v15.

This dataset unpacks, geospatially, data summaries provided in Speedtest Global Index (linked below). See Jupyter Notebook (*.ipynb) to interrogate geo data. See link to install Jupyter.

** To Do Will add Google Map versions so everyone can see without installing Jupyter. - Link to Google Map (BKK) added below. Key:Green > 100Mbps(Superfast). Black > 500Mbps (Ultrafast). CSV provided. Code in Speedtestv1.1.ipynb Jupyter Notebook. - Community (Whirlpool) surprised [Link: https://whrl.pl/RgAPTl] that Melb has 20% at or above 100Mbps. Suggest plot Top 20% on map for community. Google Map link - now added (and tweet).

** Python melb = au_tiles.cx[144:146 , -39:-37] #Lat/Lon extract shg = tiles.cx[120:122 , 30:32] #Lat/Lon extract bkk = tiles.cx[100:102 , 13:15] #Lat/Lon extract lax = tiles.cx[-118:-120, 33:35] #lat/Lon extract ALC=tiles.cx[132:134, -22:-24] #Lat/Lon extract

Histograms (v9), and data visualisations (v3,5,9,11) will be provided. Data Sourced from - This is an extract of Speedtest Open data available at Amazon WS (link below - opendata.aws).

**VERSIONS v.24 Add tweet and google map of Top 20% (over 100Mbps locations) in Mel Q322. Add v.1.5 MEL-Superfast notebook, and CSV of results (now on Google Map; link below). v23. Add graph of 2022 Broadband distribution, and compare 2020 - 2022. Updated v1.4 Jupyter notebook. v22. Add Import ipynb; workflow-import-4cities. v21. Add Q3 2022 data; five cities inc ALC. Geojson files. (2020; 4.3M tests 2022; 2.9M tests)

Melb 14784 lines Avg download speed 69.4M Tests 0.39M

SHG 31207 lines Avg 233.7M Tests 0.56M

ALC 113 lines Avg 51.5M Test 1092

BKK 29684 lines Avg 215.9M Tests 1.2M

LAX 15505 lines Avg 218.5M Tests 0.74M

v20. Speedtest - Five Cities inc ALC. v19. Add ALC2.ipynb. v18. Add ALC line graph. v17. Added ipynb for ALC. Added ALC to title.v16. Load Alice Springs Data Q221 - csv. Added Google Map link of ALC. v15. Load Melb Q1 2021 data - csv. V14. Added Melb Q1 2021 data - geojson. v13. Added Twitter link to pics. v12 Add Line-Compare pic (fastest 1000 locations) inc Jupyter (nbn-intl-v1.2.ipynb). v11 Add Line-Compare pic, plotting Four Cities on a graph. v10 Add Four Histograms in one pic. v9 Add Histogram for Four Cities. Add NBN-Intl.v1.1.ipynb (Jupyter Notebook). v8 Renamed LAX file to Q3, rather than 03. v7 Amended file names of BKK files to correctly label as Q3, not Q2 or 06. v6 Added LAX file. v5 Add screenshot of BKK Google Map. v4 Add BKK Google map(link below), and BKK csv mapping files. v3 replaced MEL map with big key version. Prev key was very tiny in top right corner. v2 Uploaded MEL, SHG, BKK data and Jupyter Notebook v1 Metadata record

** LICENCE AWS data licence on Speedtest data is "CC BY-NC-SA 4.0", so use of this data must be: - non-commercial (NC) - reuse must be share-alike (SA)(add same licence). This restricts the standard CC-BY Figshare licence.

** Other uses of Speedtest Open Data; - see link at Speedtest below.
B
Python Code for Visualizing COVID-19 data
borealisdata.ca
datasetcatalog.nlm.nih.gov
+1more
Updated Dec 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ryan Chartier; Geoffrey Rockwell (2023). Python Code for Visualizing COVID-19 data [Dataset]. http://doi.org/10.5683/SP3/PYEQL0
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.5683/SP3/PYEQL0
Dataset updated
Dec 16, 2023
Dataset provided by
Borealis
Authors
Ryan Chartier; Geoffrey Rockwell
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The purpose of this code is to produce a line graph visualization of COVID-19 data. This Jupyter notebook was built and run on Google Colab. This code will serve mostly as a guide and will need to be adapted where necessary to be run locally. The separate COVID-19 datasets uploaded to this Dataverse can be used with this code. This upload is made up of the IPYNB and PDF files of the code.
d
Toward a Reproducible Research Data Repository
data.depositar.io
mp4, pdf
Updated Jan 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
depositar (2024). Toward a Reproducible Research Data Repository [Dataset]. https://data.depositar.io/dataset/reproducible-research-data-repository
Explore at:
pdf(627064), mp4(22141307), pdf(2586248), pdf(212638)Available download formats
Dataset updated
Jan 26, 2024
Dataset provided by
depositar
Description
Collected in this dataset are the slideset and abstract for a presentation on Toward a Reproducible Research Data Repository by the depositar team at International Symposium on Data Science 2023 (DSWS 2023), hosted by the Science Council of Japan in Tokyo on December 13-15, 2023. The conference was organized by the Joint Support-Center for Data Science Research (DS), Research Organization of Information and Systems (ROIS) and the Committee of International Collaborations on Data Science, Science Council of Japan. The conference programme is also included as a reference.

Title

Toward a Reproducible Research Data Repository

Author(s)

Cheng-Jen Lee, Chia-Hsun Ally Wang, Ming-Syuan Ho, and Tyng-Ruey Chuang

Affiliation of presenter

Institute of Information Science, Academia Sinica, Taiwan

Summary of Abstract

The depositar (https://data.depositar.io/) is a research data repository at Academia Sinica (Taiwan) open to researhers worldwide for the deposit, discovery, and reuse of datasets. The depositar software itself is open source and builds on top of CKAN. CKAN, an open source project initiated by the Open Knowledge Foundation and sustained by an active user community, is a leading data management system for building data hubs and portals. In addition to CKAN's out-of-the-box features such as JSON data API and in-browser preview of uploaded data, we have added several features to the depositar, including sourcing from Wikidata as dataset keywords, a citation snippet for datasets, in-browser Shapefile preview, and a persistent identifier system based on ARK (Archival Resource Keys). At the same time, the depositar team faces an increasing demand for interactive computing (e.g. Jupyter Notebook) which facilitates not just data analysis, but also for the replication and demonstration of scientific studies. Recently, we have provided a JupyterHub service (a multi-tenancy JupyterLab) to some of the depositar's users. However, it still requires users to first download the data files (or copy the URLs of the files) from the depositar, then upload the data files (or paste the URLs) to the Jupyter notebooks for analysis. Furthermore, a JupyterHub deployed on a single server is limited by its processing power which may lower the service level to the users. To address the above issues, we are integrating the BinderHub into the depositar. BinderHub (https://binderhub.readthedocs.io/) is a kubernetes-based service that allows users to create interactive computing environments from code repositories. Once the integration is completed, users will be able to launch Jupyter Notebooks to perform data analysis and vsualization without leaving the depositar by clicking the BinderHub buttons on the datasets. In this presentation, we will first make a brief introduction to the depositar and BinderHub along with their relationship, then we will share our experiences in incorporating interactive computation in a data repository. We shall also evaluate the possibility of integrating the depositar with other automation frameworks (e.g. the Snakemake workflow management system) in order to enable users to reproduce data analysis.

Keywords

BinderHub, CKAN, Data Repositories, Interactive Computing, Reproducible Research
d
Reporting behavior from WHO COVID-19 public data
search.dataone.org
data.niaid.nih.gov
+1more
Updated Jul 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Auss Abbood (2025). Reporting behavior from WHO COVID-19 public data [Dataset]. http://doi.org/10.5061/dryad.9s4mw6mmb
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.9s4mw6mmb
Dataset updated
Jul 14, 2025
Dataset provided by
Dryad Digital Repository
Authors
Auss Abbood
Time period covered
Dec 16, 2022
Description
Objective Daily COVID-19 data reported by the World Health Organization (WHO) may provide the basis for political ad hoc decisions including travel restrictions. Data reported by countries, however, is heterogeneous and metrics to evaluate its quality are scarce. In this work, we analyzed COVID-19 case counts provided by WHO and developed tools to evaluate country-specific reporting behaviors. Methods In this retrospective cross-sectional study, COVID-19 data reported daily to WHO from 3rd January 2020 until 14th June 2021 were analyzed. We proposed the concepts of binary reporting rate and relative reporting behavior and performed descriptive analyses for all countries with these metrics. We developed a score to evaluate the consistency of incidence and binary reporting rates. Further, we performed spectral clustering of the binary reporting rate and relative reporting behavior to identify salient patterns in these metrics. Results Our final analysis included 222 countries and regions...., Data collection COVID-19 data was downloaded from WHO. Using a public repository, we have added the countries' full names to the WHO data set using the two-letter abbreviations for each country to merge both data sets. The provided COVID-19 data covers January 2020 until June 2021. We uploaded the final data set used for the analyses of this paper. Data processing We processed data using a Jupyter Notebook with a Python kernel and publically available external libraries. This upload contains the required Jupyter Notebook (reporting_behavior.ipynb) with all analyses and some additional work, a README, and the conda environment yml (env.yml)., Any text editor including Microsoft Excel and their free alternatives can open the uploaded CSV file. Any web browser and some code editors (like the freely available Visual Studio Code) can show the uploaded Jupyter Notebook if the required Python environment is set up correctly.
Amazon Web Scrapping Dataset
kaggle.com
zip
Updated Jun 17, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohammad Hurairah (2023). Amazon Web Scrapping Dataset [Dataset]. https://www.kaggle.com/datasets/mohammadhurairah/amazon-web-scrapper-dataset
Explore at:
zip(2220 bytes)Available download formats
Dataset updated
Jun 17, 2023
Authors
Mohammad Hurairah
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Amazon Scrapping Dataset; 1. Import libraries 2. Connect to the website 3. Import CSV and datetime 4. Import pandas 5. Appending dataset to CSV 6. Automation Dataset updated 7. Timers setup 8. Email notification
H
Jupyter Notebooks to demonstrate WRF-Hydro at Cronton New York
beta.hydroshare.org
hydroshare.org
zip
Updated Apr 17, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Katelyn FitzGerald (2019). Jupyter Notebooks to demonstrate WRF-Hydro at Cronton New York [Dataset]. https://beta.hydroshare.org/resource/0dd2b44ad47e428c83187ad0cef8cc08/
Explore at:
zip(1.7 MB)Available download formats
Dataset updated
Apr 17, 2019
Dataset provided by
HydroShare
Authors
Katelyn FitzGerald
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
These are Jupyter Notebooks for the WRF-Hydro training. You can follow this procedure 1. Download "Download_WRF_Hydro_data_from_HydroShare_Resources.ipynb" on your local computer. - Move into “Notebook_for_CyberGIS” folder and download Jupyter Notebooks - Notebook name: Download_WRF_Hydro_data_from_HydroShare_Resources.ipynb 2. Start CyberGIS WebApp(Discover tab - search "CyberGIS HPC") and upload previous Jupyter Notebook - Create “wrfhydro” directory in you personal directory in CyberGIS and upload previous Jupyer Notebook into “wrfhydro” directory

mkdir wrfhydro 3. Open and run Jupyter Notebook - Download WRF-Hydro Jupyter Notebooks from HydroShare (https://www.hydroshare.org/resource/0dd2b44ad47e428c83187ad0cef8cc08/) - Download WRF-Hydro Test Case at Cronton New York (https://www.hydroshare.org/resource/0ef1e94ac2794ea587c1cb9006399626/) - Download WRF-Hydro v5.0.3 Singularity from HydroShare (https://www.hydroshare.org/resource/81bffca13aa34594aa49e6b79d1026b7/) - Create Kernel for WRF-Hydro to use WRF-Hydro v5.0.3 Singularity container mkdir /data/hsjupyter/a/davidchoi76/.local/share/jupyter/kernels/wrfhydro/ cp ~/wrfhydro/kernel.json /data/hsjupyter/a/davidchoi76/.local/share/jupyter/kernels/wrfhydro/ 4. Open and run each Jupyter Notebooks - Lesson 1- Getting started, Lesson 2- Running WRF-Hydro, Lesson 3- Working with WRF-Hydro inputs and outputs - Lesson 4- Run-time options for Gridded configuration, Lesson 5- Exploring other configurations, Lesson 6- Bringing it All Together
Social Media Customer Analysis
kaggle.com
zip
Updated Apr 16, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nafe Muhtasim (2021). Social Media Customer Analysis [Dataset]. https://www.kaggle.com/nafemuhtasim/social-media-customer-analysis
Explore at:
zip(108529 bytes)Available download formats
Dataset updated
Apr 16, 2021
Authors
Nafe Muhtasim
Description
This is the data of a social media platform of an organization. You have been hired by the organization & given their social media data to analyze, visualize and prepare a report on it.

You are required to prepare a neat notebook on it using Jupyter Notebook/Jupyter Lab or Google Colab. Then, zip everything including the notebook file (.ipynb file) and the dataset. Finally, upload through the google forms link stated below. The notebook should be neat, containing codes with details regarding your code, visualizations, and description of your purpose of doing each task.

You are suggested but not limited to go through the general steps like -> Data Cleaning, Data preparation, Exploratory Data Analysis(EDA), Correlations finding, Feature extraction, and more. (There is no limit to your skills and ideas)

After doing what needs to be done, you are to give your organization insights and facts. For example, are they reaching more audiences on weekends? Is posting content on the weekdays turn out to be more effective? Is posting many contents on the same day make more sense? Or, should they post content regularly and keep day-to-day consistency? Did you find any trend patterns in the data? What are your advice after completing the analysis? Mention them clearly at the end of the Notebook. (These are just a few examples, your findings may be entirely different and that is totally acceptable. )

Note that, we will value clear documentation which states clear insights from analysis of data & visualizations, more than anything else. It will not matter how complex methods are you applying if it eventually does not find anything useful.
R
Farm Harmful Animals 2 Dataset
universe.roboflow.com
zip
Updated Nov 11, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SeniorProject (2024). Farm Harmful Animals 2 Dataset [Dataset]. https://universe.roboflow.com/seniorproject-nz8ra/farm-harmful-animals-dataset-2
Explore at:
zipAvailable download formats
Dataset updated
Nov 11, 2024
Dataset authored and provided by
SeniorProject
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Wild Boars WeRt Bounding Boxes
Description
https://www.kaggle.com/datasets/muzammilaliveltech/farm-harmful-animals-dataset

this dataset is not mine, it was uploaded to Kaggle by MUZAMMIL ALI VELTECH under CC0: Public Domain. This Roboflow project was made as an attempt to use the dataset after having issue trying to import in Jupyter Notebook from Kaggle
Virtual Machine with Case Study workflow in Jupyter Notebook
figshare.com
tar
Updated Nov 6, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Samuel Lampa; Jonathan Alvarsson; Ola Spjuth (2016). Virtual Machine with Case Study workflow in Jupyter Notebook [Dataset]. http://doi.org/10.6084/m9.figshare.4038048.v4
Explore at:
tarAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.4038048.v4
Dataset updated
Nov 6, 2016
Dataset provided by
Figsharehttp://figshare.com/
Authors
Samuel Lampa; Jonathan Alvarsson; Ola Spjuth
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
A virtual machine with the case study workflow for SciLuigi, runnable from within a Jupyter Notebook. Usage: 1. Import the .ova image into a Virtual Machine software such as Virtual box. 2. Start the virtual machine. 3. Log in with ubuntu and changethis...4. Open a terminal and execute the passwd command, to immediately set a new password.5. Click the "Open Jupyter Notebook" icon on the desktop.6. Inside Jupyter, click: Cell > Run all cells7. The workflow will now start.
Data from: Open Science Support at the Swiss Federal Research Institute WSL....
envidat.ch
observatorio-cientifico.ua.es
not available, pdf
Updated May 27, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ionut Iosifescu Enescu; Gian-Kasper Plattner; Dominik Haas-Artho; Thomas Kramer; Matthias Häni; Lucia Espona Pernas; Konrad Steffen (2025). Open Science Support at the Swiss Federal Research Institute WSL. The EnviDat Concept [Dataset]. http://doi.org/10.16904/envidat.160
Explore at:
pdf, not availableAvailable download formats
Unique identifier
https://doi.org/10.16904/envidat.160
Dataset updated
May 27, 2025
Dataset provided by
Swiss Federal Institute for Forest, Snow and Landscape Research
Authors
Ionut Iosifescu Enescu; Gian-Kasper Plattner; Dominik Haas-Artho; Thomas Kramer; Matthias Häni; Lucia Espona Pernas; Konrad Steffen
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
WSL Birmensdorf
Dataset funded by
WSL
Description
This poster was originally created for the swissuniversities Open Science Action Plan: Kick-Off Forum, and showed to the audience on 17.10.2019. It illustrates how the environmental data portal EnviDat provides the tools for fostering Open Science and Reproducibility of scientific research at WSL. Supporting open science is a highly relevant user requirement for EnviDat and for implementing FAIR (Findability, Accessibility, Interoperability and Reusability) principles at dataset level. EnviDat encourages WSL scientists to complement data publication with a complete description of research methods and the inclusion of the open source software, code or scripts used for processing the dataset or for obtaining the published results. By openly publishing open software (e.g. as Jupyter notebooks) alongside research data sets, researchers can contribute to mitigate reproducibility issues. EnviDat also promotes and supports, where possible and practical, the publication of software as Jupyter notebooks. Jupyter notebooks provide a solution for improved documentation and interactive execution of open code in a wide range of programming languages (Python, R, Octave/Matlab, Java or Scala). These programming languages are widely used in environmental research at WSL and well supported by the Jupyter-compatible kernels. We have sucessfully interfaced EnviDat-hosted notebooks with the WSL High-Performance Computing (HPC) Linux Cluster through a JupyterHub/JuypterLab beta installation on the HPC cluster implemented in close collaboration with the WSL IT-Services. For existing software that cannot be easily migrated to Jupyter Notebooks, the Open Science and Reproducibility is assisted by containerisation. We have proven that several Singularity containers can successfully run on WSL's HPC cluster. Finally, the researchers can upload the data/results complemented by code (e.g. as Jupyter Notebooks, or Singularity containers) and any additional documentation in EnviDat. Consequently, they will receive a DOI for the entire dataset, which they can reference in their science paper in order to publish a more reproducible research. License: This poster is released by WSL and the EnviDat team to the public domain under a Creative Commons 4.0 CC0 "No Rights Reserved" international license. You can reuse this poster in any way you want, for any purposes and without restrictions.
Dataset for "Beyond Self-Promotion: How Software Engineering Research Is...
zenodo.org
csv, zip
Updated Jan 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marvin Wyrich; Marvin Wyrich; Justus Bogner; Justus Bogner (2024). Dataset for "Beyond Self-Promotion: How Software Engineering Research Is Discussed on LinkedIn" [Dataset]. http://doi.org/10.5281/zenodo.10453832
Explore at:
csv, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10453832
Dataset updated
Jan 3, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Marvin Wyrich; Marvin Wyrich; Justus Bogner; Justus Bogner
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains the artifacts of our study on how software engineering research papers are shared and interacted with on LinkedIn, a professional social network. This includes:

included-papers.csv: the list of the 79 ICSE and FSE papers we found on LinkedIn

linkedin-post-data.csv: the final data of the 98 LinkedIn posts we collected and synthesized

linkedin-post-scraping.zip: the scripts used to automatically collect several attributes of the LinkedIn posts

analysis.zip: the Jupyter notebook used to analyze and visualize linkedin-post-data.csv
Sample Park Analysis
figshare.com
zip
Updated Nov 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eric Delmelle (2025). Sample Park Analysis [Dataset]. http://doi.org/10.6084/m9.figshare.30509021.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.30509021.v1
Dataset updated
Nov 2, 2025
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Eric Delmelle
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
README – Sample Park Analysis## OverviewThis repository contains a Google Colab / Jupyter notebook and accompanying dataset used for analyzing park features and associated metrics. The notebook demonstrates data loading, cleaning, and exploratory analysis of the Hope_Park_original.csv file.## Contents- sample park analysis.ipynb — The main analysis notebook (Colab/Jupyter format)- Hope_Park_original.csv — Source dataset containing park information- README.md — Documentation for the contents and usage## Usage1. Open the notebook in Google Colab or Jupyter.2. Upload the Hope_Park_original.csv file to the working directory (or adjust the file path in the notebook).3. Run each cell sequentially to reproduce the analysis.## RequirementsThe notebook uses standard Python data science libraries:```pythonpandasnumpymatplotlibseaborn
Z
Exploratory Topic Modelling in Python Dataset - EHRI-3
data.niaid.nih.gov
Updated Jun 20, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dermentzi, Maria (2022). Exploratory Topic Modelling in Python Dataset - EHRI-3 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6670103
Explore at:
Dataset updated
Jun 20, 2022
Dataset provided by
King's College London
Authors
Dermentzi, Maria
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In the EHRI-3 project, we are investigating tools and methods that historical researchers and scholars can use to better understand, visualise, and interpret the material held by our partner archives. This dataset accompanies a tutorial exploring a technique called topic modelling in the context of a Holocaust-related historical collection.

We were on the lookout for datasets that would be easily accessible and, for convenience, predominantly in English. One such dataset was the United States Holocaust Memorial Museum’s (USHMM) extensive collection of oral history testimonies, for which there are a considerable number of textual transcripts. The museum’s total collection consists of over 80,703 testimonies, 41,695 of which are available in English, with 2,894 of them listing a transcript.

Since there is not yet a ready-to-download dataset that includes these transcripts, we had to construct our own. Using a web scraping tool, we managed to create a list of the links pointing to the metadata (including transcripts) of the testimonies that were of interest to us. After obtaining the transcript and other metadata of each of these testimonies, we were able to create our dataset and curate it to remove any unwanted entries. For example, we made sure to remove entries with restrictions on access or use. We also removed entries with transcripts that consisted only of some automatically generated headers and entries which turned out to be in languages other than English. The remaining 1,873 transcripts form the corpus of this tutorial — a small, but still decently sized dataset.

The process that we followed to put together this dataset is detailed in the Jupyter Notebook accompanying this post, which can be found in this Github repository.

In this Zenodo upload, the user can find two files, each of them containing a pickled pandas DataFrame that was obtained at a different stage of the tutorial:

"unrestricted_df.pkl" contains 1,946 entries of Oral Testimony transcripts and has five fields (RG_number, text, display_date, conditions_access, conditions_use) "unrestricted_lemmatized_df.pkl" contains 1,873 entries of Oral Testimony transcripts and has six fields (RG_number, text, display_date, conditions_access, conditions_use, lemmas)

Instructions on their intended use can be found in the accompanying Jupyter Notebook.

Credits:

The transcripts that form the corpus in this tutorial were obtained through the United States Holocaust Memorial Museum (USHMM).
Data, Pre- and Post- processing scripts for shallow-water photogrammetry...
zenodo.org
bin, tar, zip
Updated Nov 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Elisa Casella; Elisa Casella; Alessio Rovere; Alessio Rovere (2024). Data, Pre- and Post- processing scripts for shallow-water photogrammetry applications [Dataset]. http://doi.org/10.5281/zenodo.14014224
Explore at:
tar, bin, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14014224
Dataset updated
Nov 2, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Elisa Casella; Elisa Casella; Alessio Rovere; Alessio Rovere
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Feb 11, 2024
Description
Description

This repository contains the data and scripts to reproduce the results of Casella et al. (Remote Sensing, 2024). The scripts are in python, two of them are wrapped in Jupyter Notebooks with explanatory notes.

Please read the paper for further information on the platform used to collect the data shared in this repository.

Folder structure

The main folder contains two subfolders:

Data: this folder includes all the original data, and the results of the preprocessing and post-processing notebooks. Note that the original photos, included in the "Camera/all_photos" folder must be unzipped before running the preprocessing Jupyter Notebooks.

Precision_Analysis: this folder includes the digital bathymetric models that were co-registered as described in the paper. The co-registration was done offline with Quantum GIS. In this folder are also stored the results of the "Compare_DBMs.py" script, that makes the precision analysis (differences between co-registered DBMs).

Installation

Refer to the README.md file for a quick installation guide using Anaconda.

Credits

The code included in this work has been improved with the assistance of ChatGPT, which provided guidance on optimization, debugging, and documentation to enhance clarity and functionality. All the code has been reviewed and supervised by humans to ensure consistency and correctness.
Python code for jupyter notebook
figshare.com
json
Updated May 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Narongrit Kasemsap (2025). Python code for jupyter notebook [Dataset]. http://doi.org/10.6084/m9.figshare.28883624.v3
Explore at:
jsonAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.28883624.v3
Dataset updated
May 7, 2025
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Narongrit Kasemsap
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Computer vision for detect bradykinesia on finger tapping task using mediapipe, use finger tapping taskStep by Step1.MediaPipe Hand detection and feature extraction2.Load data for MLtraining (excel file)3.Prepare data for ML4.Train ML with cross-validation5.Plot Multiple ROC curve6.Plot SHAP summary plot7.Load model (file svc1.pkl) to detect bradykinesia
h
iris
huggingface.co
Updated Jul 1, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Energy Research Scientific Computing Center (2025). iris [Dataset]. https://huggingface.co/datasets/NERSC/iris
Explore at:
Dataset updated
Jul 1, 2025
Dataset authored and provided by
National Energy Research Scientific Computing Center
Description
Iris

The following code can be used to load the dataset from its stored location at NERSC. You may also access this code via a NERSC-hosted Jupyter notebook here.

Iris data loader

import pandas as pd iris_dat = pd.read_csv('/global/cfs/cdirs/dasrepo/www/ai_ready_datasets/iris/data/iris.csv')

If you would like to download the data, visit the following link: https://portal.nersc.gov/cfs/dasrepo/ai_ready_datasets/iris/data
Z
Cookiescanner Dataset
data.niaid.nih.gov
Updated Sep 14, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gundelach, Ralf; Herrmann, Dominik (2023). Cookiescanner Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7884467
Explore at:
Dataset updated
Sep 14, 2023
Dataset provided by
University of Bamberg
Authors
Gundelach, Ralf; Herrmann, Dominik
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This upload contains the dataset from the paper "Cookiescanner: An Automated Tool for Detecting and Evaluating GDPR Consent Notices on Websites" presented at ARES 2023.

Folder Structure - 01_bert_classifier: The final BERT model as well as the datasets and Jupyter notebook to train/evaluate it. Additionally, the folder also contains a small flask application and a Docker file to deploy it as a web service. - 02_raw_dataset: The 1.000 sampled scans. The results.json contains the raw scan data, while the rest of the subfolders contain the screenshots of the detection methods. - 03_banner_detection: Contains the analysis CSV file, as well as a folder with the banner screenshots. - 04_dark_patterns: Analysis files as well as the screenshots of banners with the dark patterns from the paper.

Source Code For the source code of the scanner, please refer to https://github.com/UBA-PSI/cookiescanner.
Flipkart Laptops Data
kaggle.com
zip
Updated Jun 6, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abdul Hannan Ansari (2022). Flipkart Laptops Data [Dataset]. https://www.kaggle.com/datasets/ansariabdulhannan/flipkart-laptops-data
Explore at:
zip(18601 bytes)Available download formats
Dataset updated
Jun 6, 2022
Authors
Abdul Hannan Ansari
Description
These dataset contains 40 Laptops which has scrapped from the Flipkart website using python code. I want you to clean this dataset analyze it like a Data Analyst do. I know this dataset is too short I wish to make it messier and bigger later on as the need increases. If you have analyzed the data the make sure to upload the jupyter notebook of it. All the Best!

ABOUT UPDATED DATA: I have updated the data by adding more stuffs to make it interesting and messier. You can clean the dataset and do more stuffs over this dataset. Note that the prices are in INR Rupees
Legality Without Justice: Symbolic Governance, Institutional Denial, and the...
zenodo.org
bin, csv
Updated Nov 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Scott Brown; Scott Brown (2025). Legality Without Justice: Symbolic Governance, Institutional Denial, and the Ethical Foundations of Law [Dataset]. http://doi.org/10.5281/zenodo.16361108
Explore at:
csv, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.16361108
Dataset updated
Nov 6, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Scott Brown; Scott Brown
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Description:
This dataset accompanies the empirical analysis in Legality Without Justice, a study examining the relationship between public trust in institutions and perceived governance legitimacy using data from the World Values Survey Wave 7 (2017–2022). It includes:

WVS_Cross-National_Wave_7_csv_v6_0.csv — World Values Survey Wave 7 core data.

GDP.csv — World Bank GDP per capita (current US$) for 2022 by country.

denial.ipynb — Fully documented Jupyter notebook with code for data merging, exploratory statistics, and ordinal logistic regression using OrderedModel. Includes GDP as a control for institutional trust and perceived governance.

All data processing and analysis were conducted in Python using FAIR reproducibility principles and can be replicated or extended on Google Colab.

DOI: 10.5281/zenodo.16361108
License: Creative Commons Attribution 4.0 International (CC BY 4.0)
Authors: Anon Annotator
Publication date: 2025-07-23
Language: English
Version: 1.0.0
Publisher: Zenodo
Programming language: Python

🔽 How to Download and Run on Google Colab

Step 1: Open Google Colab

Go to https://colab.research.google.com

Step 2: Upload Files

Click File > Upload notebook, and upload the denial.ipynb file.
Also upload the CSVs (WVS_Cross-National_Wave_7_csv_v6_0.csv and GDP.csv) using the file browser on the left sidebar.

Step 3: Adjust File Paths (if needed)

In denial.ipynb, ensure file paths match:

python

CopiarEditar

wvs = pd.read_csv('/content/WVS_Cross-National_Wave_7_csv_v6_0.csv') gdp = pd.read_csv('/content/GDP.csv')

Step 4: Run the Code

Execute the notebook cells from top to bottom. You may need to install required libraries:

python

CopiarEditar

!pip install statsmodels pandas numpy

The notebook performs:

Data cleaning

Merging WVS and GDP datasets

Summary statistics

Ordered logistic regression to test if confidence in courts/police (Q57, Q58) predicts belief that the country is governed in the interest of the people (Q183), controlling for GDP.

Facebook

Twitter

Click to copy link

Link copied

Cite

João Felipe; João Felipe; Leonardo; Leonardo; Vanessa; Vanessa; Juliana; Juliana (2021). Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter Notebooks [Dataset]. http://doi.org/10.5281/zenodo.2592524

Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter Notebooks

Explore at:

2 scholarly articles cite this dataset (View in Google Scholar)

bz2Available download formats

Unique identifier

https://doi.org/10.5281/zenodo.2592524

Dataset updated

Mar 15, 2021

Dataset provided by

Zenodohttp://zenodo.org/

Authors

João Felipe; João Felipe; Leonardo; Leonardo; Vanessa; Vanessa; Juliana; Juliana

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The self-documenting aspects and the ability to reproduce results have been touted as significant benefits of Jupyter Notebooks. At the same time, there has been growing criticism that the way notebooks are being used leads to unexpected behavior, encourage poor coding practices and that their results can be hard to reproduce. To understand good and bad practices used in the development of real notebooks, we analyzed 1.4 million notebooks from GitHub.

Paper: https://2019.msrconf.org/event/msr-2019-papers-a-large-scale-study-about-quality-and-reproducibility-of-jupyter-notebooks

This repository contains two files:

dump.tar.bz2
jupyter_reproducibility.tar.bz2

The dump.tar.bz2 file contains a PostgreSQL dump of the database, with all the data we extracted from the notebooks.

The jupyter_reproducibility.tar.bz2 file contains all the scripts we used to query and download Jupyter Notebooks, extract data from them, and analyze the data. It is organized as follows:

analyses: this folder has all the notebooks we use to analyze the data in the PostgreSQL database.
archaeology: this folder has all the scripts we use to query, download, and extract data from GitHub notebooks.
paper: empty. The notebook analyses/N12.To.Paper.ipynb moves data to it

In the remaining of this text, we give instructions for reproducing the analyses, by using the data provided in the dump and reproducing the collection, by collecting data from GitHub again.

Reproducing the Analysis

This section shows how to load the data in the database and run the analyses notebooks. In the analysis, we used the following environment:

Ubuntu 18.04.1 LTS
PostgreSQL 10.6
Conda 4.5.11
Python 3.7.2
PdfCrop 2012/11/02 v1.38

First, download dump.tar.bz2 and extract it:

tar -xjf dump.tar.bz2

It extracts the file db2019-03-13.dump. Create a database in PostgreSQL (we call it "jupyter"), and use psql to restore the dump:

psql jupyter < db2019-03-13.dump

It populates the database with the dump. Now, configure the connection string for sqlalchemy by setting the environment variable JUP_DB_CONNECTTION:

export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter";

Download and extract jupyter_reproducibility.tar.bz2:

tar -xjf jupyter_reproducibility.tar.bz2

Create a conda environment with Python 3.7:

conda create -n analyses python=3.7
conda activate analyses

Go to the analyses folder and install all the dependencies of the requirements.txt

cd jupyter_reproducibility/analyses
pip install -r requirements.txt

For reproducing the analyses, run jupyter on this folder:

jupyter notebook

Execute the notebooks on this order:

Index.ipynb
N0.Repository.ipynb
N1.Skip.Notebook.ipynb
N2.Notebook.ipynb
N3.Cell.ipynb
N4.Features.ipynb
N5.Modules.ipynb
N6.AST.ipynb
N7.Name.ipynb
N8.Execution.ipynb
N9.Cell.Execution.Order.ipynb
N10.Markdown.ipynb
N11.Repository.With.Notebook.Restriction.ipynb
N12.To.Paper.ipynb

Reproducing or Expanding the Collection

The collection demands more steps to reproduce and takes much longer to run (months). It also involves running arbitrary code on your machine. Proceed with caution.

Requirements

This time, we have extra requirements:

All the analysis requirements
lbzip2 2.5
gcc 7.3.0
Github account
Gmail account

Environment

First, set the following environment variables:

export JUP_MACHINE="db"; # machine identifier
export JUP_BASE_DIR="/mnt/jupyter/github"; # place to store the repositories
export JUP_LOGS_DIR="/home/jupyter/logs"; # log files
export JUP_COMPRESSION="lbzip2"; # compression program
export JUP_VERBOSE="5"; # verbose level
export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter"; # sqlchemy connection
export JUP_GITHUB_USERNAME="github_username"; # your github username
export JUP_GITHUB_PASSWORD="github_password"; # your github password
export JUP_MAX_SIZE="8000.0"; # maximum size of the repositories directory (in GB)
export JUP_FIRST_DATE="2013-01-01"; # initial date to query github
export JUP_EMAIL_LOGIN="gmail@gmail.com"; # your gmail address
export JUP_EMAIL_TO="target@email.com"; # email that receives notifications
export JUP_OAUTH_FILE="~/oauth2_creds.json" # oauth2 auhentication file
export JUP_NOTEBOOK_INTERVAL=""; # notebook id interval for this machine. Leave it in blank
export JUP_REPOSITORY_INTERVAL=""; # repository id interval for this machine. Leave it in blank
export JUP_WITH_EXECUTION="1"; # run execute python notebooks
export JUP_WITH_DEPENDENCY="0"; # run notebooks with and without declared dependnecies
export JUP_EXECUTION_MODE="-1"; # run following the execution order
export JUP_EXECUTION_DIR="/home/jupyter/execution"; # temporary directory for running notebooks
export JUP_ANACONDA_PATH="~/anaconda3"; # conda installation path
export JUP_MOUNT_BASE="/home/jupyter/mount_ghstudy.sh"; # bash script to mount base dir
export JUP_UMOUNT_BASE="/home/jupyter/umount_ghstudy.sh"; # bash script to umount base dir
export JUP_NOTEBOOK_TIMEOUT="300"; # timeout the extraction


# Frequenci of log report
export JUP_ASTROID_FREQUENCY="5";
export JUP_IPYTHON_FREQUENCY="5";
export JUP_NOTEBOOKS_FREQUENCY="5";
export JUP_REQUIREMENT_FREQUENCY="5";
export JUP_CRAWLER_FREQUENCY="1";
export JUP_CLONE_FREQUENCY="1";
export JUP_COMPRESS_FREQUENCY="5";

export JUP_DB_IP="localhost"; # postgres database IP

Then, configure the file ~/oauth2_creds.json, according to yagmail documentation: https://media.readthedocs.org/pdf/yagmail/latest/yagmail.pdf

Configure the mount_ghstudy.sh and umount_ghstudy.sh scripts. The first one should mount the folder that stores the directories. The second one should umount it. You can leave the scripts in blank, but it is not advisable, as the reproducibility study runs arbitrary code on your machine and you may lose your data.

Scripts

Download and extract jupyter_reproducibility.tar.bz2:

tar -xjf jupyter_reproducibility.tar.bz2

Install 5 conda environments and 5 anaconda environments, for each python version. In each of them, upgrade pip, install pipenv, and install the archaeology package (Note that it is a local package that has not been published to pypi. Make sure to use the -e option):

Conda 2.7

conda create -n raw27 python=2.7 -y
conda activate raw27
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology

Anaconda 2.7

conda create -n py27 python=2.7 anaconda -y
conda activate py27
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology

Conda 3.4

It requires a manual jupyter and pathlib2 installation due to some incompatibilities found on the default installation.

conda create -n raw34 python=3.4 -y
conda activate raw34
conda install jupyter -c conda-forge -y
conda uninstall jupyter -y
pip install --upgrade pip
pip install jupyter
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
pip install pathlib2

Anaconda 3.4

conda create -n py34 python=3.4 anaconda -y
conda activate py34
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology

Conda 3.5

conda create -n raw35 python=3.5 -y
conda activate raw35
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology

Anaconda 3.5

It requires the manual installation of other anaconda packages.

conda create -n py35 python=3.5 anaconda -y
conda install -y appdirs atomicwrites keyring secretstorage libuuid navigator-updater prometheus_client pyasn1 pyasn1-modules spyder-kernels tqdm jeepney automat constantly anaconda-navigator
conda activate py35
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology

Conda 3.6

conda create -n raw36 python=3.6 -y
conda activate raw36
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology

Anaconda 3.6

conda create -n py36 python=3.6 anaconda -y
conda activate py36
conda install -y anaconda-navigator jupyterlab_server navigator-updater
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology

Conda 3.7

<code

Clear search

Close search

Google apps

Main menu

Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter...

Speedtest Open Data - Four International cities - MEL, BKK, SHG, LAX plus...

Melb 14784 lines Avg download speed 69.4M Tests 0.39M

SHG 31207 lines Avg 233.7M Tests 0.56M

ALC 113 lines Avg 51.5M Test 1092

BKK 29684 lines Avg 215.9M Tests 1.2M

LAX 15505 lines Avg 218.5M Tests 0.74M

Python Code for Visualizing COVID-19 data

Toward a Reproducible Research Data Repository

Title

Author(s)

Affiliation of presenter

Summary of Abstract

Keywords

Reporting behavior from WHO COVID-19 public data

Amazon Web Scrapping Dataset

Jupyter Notebooks to demonstrate WRF-Hydro at Cronton New York

Social Media Customer Analysis

Farm Harmful Animals 2 Dataset

Virtual Machine with Case Study workflow in Jupyter Notebook

Data from: Open Science Support at the Swiss Federal Research Institute WSL....

Dataset for "Beyond Self-Promotion: How Software Engineering Research Is...

Sample Park Analysis

Exploratory Topic Modelling in Python Dataset - EHRI-3

Data, Pre- and Post- processing scripts for shallow-water photogrammetry...

Description

Folder structure

Installation

Credits

Python code for jupyter notebook

iris

Iris data loader

Cookiescanner Dataset

Flipkart Laptops Data

Legality Without Justice: Symbolic Governance, Institutional Denial, and the...

🔽 How to Download and Run on Google Colab

Step 1: Open Google Colab

Step 2: Upload Files

Step 3: Adjust File Paths (if needed)

Step 4: Run the Code

Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter NotebooksSee More Versions

Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter Notebooks