100+ datasets found
  1. Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter...

    • zenodo.org
    bz2
    Updated Mar 15, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    João Felipe; João Felipe; Leonardo; Leonardo; Vanessa; Vanessa; Juliana; Juliana (2021). Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter Notebooks [Dataset]. http://doi.org/10.5281/zenodo.2592524
    Explore at:
    bz2Available download formats
    Dataset updated
    Mar 15, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    João Felipe; João Felipe; Leonardo; Leonardo; Vanessa; Vanessa; Juliana; Juliana
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The self-documenting aspects and the ability to reproduce results have been touted as significant benefits of Jupyter Notebooks. At the same time, there has been growing criticism that the way notebooks are being used leads to unexpected behavior, encourage poor coding practices and that their results can be hard to reproduce. To understand good and bad practices used in the development of real notebooks, we analyzed 1.4 million notebooks from GitHub.

    Paper: https://2019.msrconf.org/event/msr-2019-papers-a-large-scale-study-about-quality-and-reproducibility-of-jupyter-notebooks

    This repository contains two files:

    • dump.tar.bz2
    • jupyter_reproducibility.tar.bz2

    The dump.tar.bz2 file contains a PostgreSQL dump of the database, with all the data we extracted from the notebooks.

    The jupyter_reproducibility.tar.bz2 file contains all the scripts we used to query and download Jupyter Notebooks, extract data from them, and analyze the data. It is organized as follows:

    • analyses: this folder has all the notebooks we use to analyze the data in the PostgreSQL database.
    • archaeology: this folder has all the scripts we use to query, download, and extract data from GitHub notebooks.
    • paper: empty. The notebook analyses/N12.To.Paper.ipynb moves data to it

    In the remaining of this text, we give instructions for reproducing the analyses, by using the data provided in the dump and reproducing the collection, by collecting data from GitHub again.

    Reproducing the Analysis

    This section shows how to load the data in the database and run the analyses notebooks. In the analysis, we used the following environment:

    Ubuntu 18.04.1 LTS
    PostgreSQL 10.6
    Conda 4.5.11
    Python 3.7.2
    PdfCrop 2012/11/02 v1.38

    First, download dump.tar.bz2 and extract it:

    tar -xjf dump.tar.bz2

    It extracts the file db2019-03-13.dump. Create a database in PostgreSQL (we call it "jupyter"), and use psql to restore the dump:

    psql jupyter < db2019-03-13.dump

    It populates the database with the dump. Now, configure the connection string for sqlalchemy by setting the environment variable JUP_DB_CONNECTTION:

    export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter";

    Download and extract jupyter_reproducibility.tar.bz2:

    tar -xjf jupyter_reproducibility.tar.bz2

    Create a conda environment with Python 3.7:

    conda create -n analyses python=3.7
    conda activate analyses

    Go to the analyses folder and install all the dependencies of the requirements.txt

    cd jupyter_reproducibility/analyses
    pip install -r requirements.txt

    For reproducing the analyses, run jupyter on this folder:

    jupyter notebook

    Execute the notebooks on this order:

    • Index.ipynb
    • N0.Repository.ipynb
    • N1.Skip.Notebook.ipynb
    • N2.Notebook.ipynb
    • N3.Cell.ipynb
    • N4.Features.ipynb
    • N5.Modules.ipynb
    • N6.AST.ipynb
    • N7.Name.ipynb
    • N8.Execution.ipynb
    • N9.Cell.Execution.Order.ipynb
    • N10.Markdown.ipynb
    • N11.Repository.With.Notebook.Restriction.ipynb
    • N12.To.Paper.ipynb

    Reproducing or Expanding the Collection

    The collection demands more steps to reproduce and takes much longer to run (months). It also involves running arbitrary code on your machine. Proceed with caution.

    Requirements

    This time, we have extra requirements:

    All the analysis requirements
    lbzip2 2.5
    gcc 7.3.0
    Github account
    Gmail account

    Environment

    First, set the following environment variables:

    export JUP_MACHINE="db"; # machine identifier
    export JUP_BASE_DIR="/mnt/jupyter/github"; # place to store the repositories
    export JUP_LOGS_DIR="/home/jupyter/logs"; # log files
    export JUP_COMPRESSION="lbzip2"; # compression program
    export JUP_VERBOSE="5"; # verbose level
    export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter"; # sqlchemy connection
    export JUP_GITHUB_USERNAME="github_username"; # your github username
    export JUP_GITHUB_PASSWORD="github_password"; # your github password
    export JUP_MAX_SIZE="8000.0"; # maximum size of the repositories directory (in GB)
    export JUP_FIRST_DATE="2013-01-01"; # initial date to query github
    export JUP_EMAIL_LOGIN="gmail@gmail.com"; # your gmail address
    export JUP_EMAIL_TO="target@email.com"; # email that receives notifications
    export JUP_OAUTH_FILE="~/oauth2_creds.json" # oauth2 auhentication file
    export JUP_NOTEBOOK_INTERVAL=""; # notebook id interval for this machine. Leave it in blank
    export JUP_REPOSITORY_INTERVAL=""; # repository id interval for this machine. Leave it in blank
    export JUP_WITH_EXECUTION="1"; # run execute python notebooks
    export JUP_WITH_DEPENDENCY="0"; # run notebooks with and without declared dependnecies
    export JUP_EXECUTION_MODE="-1"; # run following the execution order
    export JUP_EXECUTION_DIR="/home/jupyter/execution"; # temporary directory for running notebooks
    export JUP_ANACONDA_PATH="~/anaconda3"; # conda installation path
    export JUP_MOUNT_BASE="/home/jupyter/mount_ghstudy.sh"; # bash script to mount base dir
    export JUP_UMOUNT_BASE="/home/jupyter/umount_ghstudy.sh"; # bash script to umount base dir
    export JUP_NOTEBOOK_TIMEOUT="300"; # timeout the extraction
    
    
    # Frequenci of log report
    export JUP_ASTROID_FREQUENCY="5";
    export JUP_IPYTHON_FREQUENCY="5";
    export JUP_NOTEBOOKS_FREQUENCY="5";
    export JUP_REQUIREMENT_FREQUENCY="5";
    export JUP_CRAWLER_FREQUENCY="1";
    export JUP_CLONE_FREQUENCY="1";
    export JUP_COMPRESS_FREQUENCY="5";
    
    export JUP_DB_IP="localhost"; # postgres database IP

    Then, configure the file ~/oauth2_creds.json, according to yagmail documentation: https://media.readthedocs.org/pdf/yagmail/latest/yagmail.pdf

    Configure the mount_ghstudy.sh and umount_ghstudy.sh scripts. The first one should mount the folder that stores the directories. The second one should umount it. You can leave the scripts in blank, but it is not advisable, as the reproducibility study runs arbitrary code on your machine and you may lose your data.

    Scripts

    Download and extract jupyter_reproducibility.tar.bz2:

    tar -xjf jupyter_reproducibility.tar.bz2

    Install 5 conda environments and 5 anaconda environments, for each python version. In each of them, upgrade pip, install pipenv, and install the archaeology package (Note that it is a local package that has not been published to pypi. Make sure to use the -e option):

    Conda 2.7

    conda create -n raw27 python=2.7 -y
    conda activate raw27
    pip install --upgrade pip
    pip install pipenv
    pip install -e jupyter_reproducibility/archaeology

    Anaconda 2.7

    conda create -n py27 python=2.7 anaconda -y
    conda activate py27
    pip install --upgrade pip
    pip install pipenv
    pip install -e jupyter_reproducibility/archaeology
    

    Conda 3.4

    It requires a manual jupyter and pathlib2 installation due to some incompatibilities found on the default installation.

    conda create -n raw34 python=3.4 -y
    conda activate raw34
    conda install jupyter -c conda-forge -y
    conda uninstall jupyter -y
    pip install --upgrade pip
    pip install jupyter
    pip install pipenv
    pip install -e jupyter_reproducibility/archaeology
    pip install pathlib2

    Anaconda 3.4

    conda create -n py34 python=3.4 anaconda -y
    conda activate py34
    pip install --upgrade pip
    pip install pipenv
    pip install -e jupyter_reproducibility/archaeology

    Conda 3.5

    conda create -n raw35 python=3.5 -y
    conda activate raw35
    pip install --upgrade pip
    pip install pipenv
    pip install -e jupyter_reproducibility/archaeology

    Anaconda 3.5

    It requires the manual installation of other anaconda packages.

    conda create -n py35 python=3.5 anaconda -y
    conda install -y appdirs atomicwrites keyring secretstorage libuuid navigator-updater prometheus_client pyasn1 pyasn1-modules spyder-kernels tqdm jeepney automat constantly anaconda-navigator
    conda activate py35
    pip install --upgrade pip
    pip install pipenv
    pip install -e jupyter_reproducibility/archaeology

    Conda 3.6

    conda create -n raw36 python=3.6 -y
    conda activate raw36
    pip install --upgrade pip
    pip install pipenv
    pip install -e jupyter_reproducibility/archaeology

    Anaconda 3.6

    conda create -n py36 python=3.6 anaconda -y
    conda activate py36
    conda install -y anaconda-navigator jupyterlab_server navigator-updater
    pip install --upgrade pip
    pip install pipenv
    pip install -e jupyter_reproducibility/archaeology

    Conda 3.7

    <code

  2. Reproducibility in Practice: Dataset of a Large-Scale Study of Jupyter...

    • zenodo.org
    bz2
    Updated Mar 15, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonymous; Anonymous (2021). Reproducibility in Practice: Dataset of a Large-Scale Study of Jupyter Notebooks [Dataset]. http://doi.org/10.5281/zenodo.2546834
    Explore at:
    bz2Available download formats
    Dataset updated
    Mar 15, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anonymous; Anonymous
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The self-documenting aspects and the ability to reproduce results have been touted as significant benefits of Jupyter Notebooks. At the same time, there has been growing criticism that the way notebooks are being used leads to unexpected behavior, encourage poor coding practices and that their results can be hard to reproduce. To understand good and bad practices used in the development of real notebooks, we analyzed 1.4 million notebooks from GitHub.

    This repository contains two files:

    • dump.tar.bz2
    • jupyter_reproducibility.tar.bz2

    The dump.tar.bz2 file contains a PostgreSQL dump of the database, with all the data we extracted from the notebooks.

    The jupyter_reproducibility.tar.bz2 file contains all the scripts we used to query and download Jupyter Notebooks, extract data from them, and analyze the data. It is organized as follows:

    • analyses: this folder has all the notebooks we use to analyze the data in the PostgreSQL database.
    • archaeology: this folder has all the scripts we use to query, download, and extract data from GitHub notebooks.
    • paper: empty. The notebook analyses/N11.To.Paper.ipynb moves data to it

    In the remaining of this text, we give instructions for reproducing the analyses, by using the data provided in the dump and reproducing the collection, by collecting data from GitHub again.

    Reproducing the Analysis

    This section shows how to load the data in the database and run the analyses notebooks. In the analysis, we used the following environment:

    Ubuntu 18.04.1 LTS
    PostgreSQL 10.6
    Conda 4.5.1
    Python 3.6.8
    PdfCrop 2012/11/02 v1.38

    First, download dump.tar.bz2 and extract it:

    tar -xjf dump.tar.bz2

    It extracts the file db2019-01-13.dump. Create a database in PostgreSQL (we call it "jupyter"), and use psql to restore the dump:

    psql jupyter < db2019-01-13.dump

    It populates the database with the dump. Now, configure the connection string for sqlalchemy by setting the environment variable JUP_DB_CONNECTTION:

    export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter";

    Download and extract jupyter_reproducibility.tar.bz2:

    tar -xjf jupyter_reproducibility.tar.bz2

    Create a conda environment with Python 3.6:

    conda create -n py36 python=3.6

    Go to the analyses folder and install all the dependencies of the requirements.txt

    cd jupyter_reproducibility/analyses
    pip install -r requirements.txt

    For reproducing the analyses, run jupyter on this folder:

    jupyter notebook

    Execute the notebooks on this order:

    • N0.Index.ipynb
    • N1.Repository.ipynb
    • N2.Notebook.ipynb
    • N3.Cell.ipynb
    • N4.Features.ipynb
    • N5.Modules.ipynb
    • N6.AST.ipynb
    • N7.Name.ipynb
    • N8.Execution.ipynb
    • N9.Cell.Execution.Order.ipynb
    • N10.Markdown.ipynb
    • N11.To.Paper.ipynb

    Reproducing or Expanding the Collection

    The collection demands more steps to reproduce and takes much longer to run (months). It also involves running arbitrary code on your machine. Proceed with caution.

    Requirements

    This time, we have extra requirements:

    All the analysis requirements
    lbzip2 2.5
    gcc 7.3.0
    Github account
    Gmail account

    Environment

    First, set the following environment variables:

    export JUP_MACHINE="db"; # machine identifier
    export JUP_BASE_DIR="/mnt/jupyter/github"; # place to store the repositories
    export JUP_LOGS_DIR="/home/jupyter/logs"; # log files
    export JUP_COMPRESSION="lbzip2"; # compression program
    export JUP_VERBOSE="5"; # verbose level
    export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter"; # sqlchemy connection
    export JUP_GITHUB_USERNAME="github_username"; # your github username
    export JUP_GITHUB_PASSWORD="github_password"; # your github password
    export JUP_MAX_SIZE="8000.0"; # maximum size of the repositories directory (in GB)
    export JUP_FIRST_DATE="2013-01-01"; # initial date to query github
    export JUP_EMAIL_LOGIN="gmail@gmail.com"; # your gmail address
    export JUP_EMAIL_TO="target@email.com"; # email that receives notifications
    export JUP_OAUTH_FILE="~/oauth2_creds.json" # oauth2 auhentication file
    export JUP_NOTEBOOK_INTERVAL=""; # notebook id interval for this machine. Leave it in blank
    export JUP_REPOSITORY_INTERVAL=""; # repository id interval for this machine. Leave it in blank
    export JUP_WITH_EXECUTION="1"; # run execute python notebooks
    export JUP_WITH_DEPENDENCY="0"; # run notebooks with and without declared dependnecies
    export JUP_EXECUTION_MODE="-1"; # run following the execution order
    export JUP_EXECUTION_DIR="/home/jupyter/execution"; # temporary directory for running notebooks
    export JUP_ANACONDA_PATH="~/anaconda3"; # conda installation path
    export JUP_MOUNT_BASE="/home/jupyter/mount_ghstudy.sh"; # bash script to mount base dir
    export JUP_UMOUNT_BASE="/home/jupyter/umount_ghstudy.sh"; # bash script to umount base dir
    export JUP_NOTEBOOK_TIMEOUT="300"; # timeout the extraction
    
    
    # Frequenci of log report
    export JUP_ASTROID_FREQUENCY="5";
    export JUP_IPYTHON_FREQUENCY="5";
    export JUP_NOTEBOOKS_FREQUENCY="5";
    export JUP_REQUIREMENT_FREQUENCY="5";
    export JUP_CRAWLER_FREQUENCY="1";
    export JUP_CLONE_FREQUENCY="1";
    export JUP_COMPRESS_FREQUENCY="5";
    
    export JUP_DB_IP="localhost"; # postgres database IP

    Then, configure the file ~/oauth2_creds.json, according to yagmail documentation: https://media.readthedocs.org/pdf/yagmail/latest/yagmail.pdf

    Configure the mount_ghstudy.sh and umount_ghstudy.sh scripts. The first one should mount the folder that stores the directories. The second one should umount it. You can leave the scripts in blank, but it is not advisable, as the reproducibility study runs arbitrary code on your machine and you may lose your data.

    Scripts

    Download and extract jupyter_reproducibility.tar.bz2:

    tar -xjf jupyter_reproducibility.tar.bz2

    Install 5 conda environments and 5 anaconda environments, for each python version. In each of them, upgrade pip, install pipenv, and install the archaeology package (Note that it is a local package that has not been published to pypi. Make sure to use the -e option):

    Conda 2.7

    conda create -n raw27 python=2.7 -y
    conda activate raw27
    pip install --upgrade pip
    pip install pipenv
    pip install -e jupyter_reproducibility/archaeology

    Anaconda 2.7

    conda create -n py27 python=2.7 anaconda -y
    conda activate py27
    pip install --upgrade pip
    pip install pipenv
    pip install -e jupyter_reproducibility/archaeology
    

    Conda 3.4

    It requires a manual jupyter and pathlib2 installation due to some incompatibilities found on the default installation.

    conda create -n raw34 python=3.4 -y
    conda activate raw34
    conda install jupyter -c conda-forge -y
    conda uninstall jupyter -y
    pip install --upgrade pip
    pip install jupyter
    pip install pipenv
    pip install -e jupyter_reproducibility/archaeology
    pip install pathlib2

    Anaconda 3.4

    conda create -n py34 python=3.4 anaconda -y
    conda activate py34
    pip install --upgrade pip
    pip install pipenv
    pip install -e jupyter_reproducibility/archaeology

    Conda 3.5

    conda create -n raw35 python=3.5 -y
    conda activate raw35
    pip install --upgrade pip
    pip install pipenv
    pip install -e jupyter_reproducibility/archaeology

    Anaconda 3.5

    It requires the manual installation of other anaconda packages.

    conda create -n py35 python=3.5 anaconda -y
    conda install -y appdirs atomicwrites keyring secretstorage libuuid navigator-updater prometheus_client pyasn1 pyasn1-modules spyder-kernels tqdm jeepney automat constantly anaconda-navigator
    conda activate py35
    pip install --upgrade pip
    pip install pipenv
    pip install -e jupyter_reproducibility/archaeology

    Conda 3.6

    conda create -n raw36 python=3.6 -y
    conda activate raw36
    pip install --upgrade pip
    pip install pipenv
    pip install -e jupyter_reproducibility/archaeology

    Anaconda 3.6

    conda create -n py36 python=3.6 anaconda -y
    conda activate py36
    conda install -y anaconda-navigator jupyterlab_server navigator-updater
    pip install --upgrade pip
    pip install pipenv
    pip install -e jupyter_reproducibility/archaeology

    Conda 3.7

    conda create -n raw37 python=3.7 -y
    conda activate raw37
    pip install --upgrade pip
    pip install pipenv
    pip install -e jupyter_reproducibility/archaeology

    Anaconda 3.7

    When we

  3. Speedtest Open Data - Four International cities - MEL, BKK, SHG, LAX plus...

    • figshare.com
    txt
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Richard Ferrers; Speedtest Global Index (2023). Speedtest Open Data - Four International cities - MEL, BKK, SHG, LAX plus ALC - 2020, 2022 [Dataset]. http://doi.org/10.6084/m9.figshare.13621169.v24
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Richard Ferrers; Speedtest Global Index
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset compares four cities FIXED-line broadband internet speeds: - Melbourne, AU - Bangkok, TH - Shanghai, CN - Los Angeles, US - Alice Springs, AU

    ERRATA: 1.Data is for Q3 2020, but some files are labelled incorrectly as 02-20 of June 20. They all should read Sept 20, or 09-20 as Q3 20, rather than Q2. Will rename and reload. Amended in v7.

    1. LAX file named 0320, when should be Q320. Amended in v8.

    *lines of data for each geojson file; a line equates to a 600m^2 location, inc total tests, devices used, and average upload and download speed - MEL 16181 locations/lines => 0.85M speedtests (16.7 tests per 100people) - SHG 31745 lines => 0.65M speedtests (2.5/100pp) - BKK 29296 lines => 1.5M speedtests (14.3/100pp) - LAX 15899 lines => 1.3M speedtests (10.4/100pp) - ALC 76 lines => 500 speedtests (2/100pp)

    Geojsons of these 2* by 2* extracts for MEL, BKK, SHG now added, and LAX added v6. Alice Springs added v15.

    This dataset unpacks, geospatially, data summaries provided in Speedtest Global Index (linked below). See Jupyter Notebook (*.ipynb) to interrogate geo data. See link to install Jupyter.

    ** To Do Will add Google Map versions so everyone can see without installing Jupyter. - Link to Google Map (BKK) added below. Key:Green > 100Mbps(Superfast). Black > 500Mbps (Ultrafast). CSV provided. Code in Speedtestv1.1.ipynb Jupyter Notebook. - Community (Whirlpool) surprised [Link: https://whrl.pl/RgAPTl] that Melb has 20% at or above 100Mbps. Suggest plot Top 20% on map for community. Google Map link - now added (and tweet).

    ** Python melb = au_tiles.cx[144:146 , -39:-37] #Lat/Lon extract shg = tiles.cx[120:122 , 30:32] #Lat/Lon extract bkk = tiles.cx[100:102 , 13:15] #Lat/Lon extract lax = tiles.cx[-118:-120, 33:35] #lat/Lon extract ALC=tiles.cx[132:134, -22:-24] #Lat/Lon extract

    Histograms (v9), and data visualisations (v3,5,9,11) will be provided. Data Sourced from - This is an extract of Speedtest Open data available at Amazon WS (link below - opendata.aws).

    **VERSIONS v.24 Add tweet and google map of Top 20% (over 100Mbps locations) in Mel Q322. Add v.1.5 MEL-Superfast notebook, and CSV of results (now on Google Map; link below). v23. Add graph of 2022 Broadband distribution, and compare 2020 - 2022. Updated v1.4 Jupyter notebook. v22. Add Import ipynb; workflow-import-4cities. v21. Add Q3 2022 data; five cities inc ALC. Geojson files. (2020; 4.3M tests 2022; 2.9M tests)

    Melb 14784 lines Avg download speed 69.4M Tests 0.39M

    SHG 31207 lines Avg 233.7M Tests 0.56M

    ALC 113 lines Avg 51.5M Test 1092

    BKK 29684 lines Avg 215.9M Tests 1.2M

    LAX 15505 lines Avg 218.5M Tests 0.74M

    v20. Speedtest - Five Cities inc ALC. v19. Add ALC2.ipynb. v18. Add ALC line graph. v17. Added ipynb for ALC. Added ALC to title.v16. Load Alice Springs Data Q221 - csv. Added Google Map link of ALC. v15. Load Melb Q1 2021 data - csv. V14. Added Melb Q1 2021 data - geojson. v13. Added Twitter link to pics. v12 Add Line-Compare pic (fastest 1000 locations) inc Jupyter (nbn-intl-v1.2.ipynb). v11 Add Line-Compare pic, plotting Four Cities on a graph. v10 Add Four Histograms in one pic. v9 Add Histogram for Four Cities. Add NBN-Intl.v1.1.ipynb (Jupyter Notebook). v8 Renamed LAX file to Q3, rather than 03. v7 Amended file names of BKK files to correctly label as Q3, not Q2 or 06. v6 Added LAX file. v5 Add screenshot of BKK Google Map. v4 Add BKK Google map(link below), and BKK csv mapping files. v3 replaced MEL map with big key version. Prev key was very tiny in top right corner. v2 Uploaded MEL, SHG, BKK data and Jupyter Notebook v1 Metadata record

    ** LICENCE AWS data licence on Speedtest data is "CC BY-NC-SA 4.0", so use of this data must be: - non-commercial (NC) - reuse must be share-alike (SA)(add same licence). This restricts the standard CC-BY Figshare licence.

    ** Other uses of Speedtest Open Data; - see link at Speedtest below.

  4. d

    Analysis of Air Temperature using CUAHSI HIS Web Services

    • search.dataone.org
    • hydroshare.org
    Updated Dec 5, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Liza Brazil (2021). Analysis of Air Temperature using CUAHSI HIS Web Services [Dataset]. https://search.dataone.org/view/sha256%3Af0e49064a8c110ddfd3c3169685aa1e08fdb113b15f8e7f50c5ab62dcdadc3f4
    Explore at:
    Dataset updated
    Dec 5, 2021
    Dataset provided by
    Hydroshare
    Authors
    Liza Brazil
    Description

    This resource contains a Jupyter notebook that demonstrate how the CUAHSI JupyterHub platform can be used to perform basic hydrologic data analysis. Temperature data is collected via the CUAHSI Hydrologic Information System (HIS) using web services. These data are interrogated, organized using Python classes, and plotted in various ways to demonstrate common data analysis steps. To get started, click the Open with dropdown on the top right of the resource and select CUAHSI JupyterHub. To use CUAHSI JupyterHub, you will need a HydroShare account.

  5. B

    Python Code for Visualizing COVID-19 data

    • borealisdata.ca
    • datasetcatalog.nlm.nih.gov
    • +1more
    Updated Dec 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ryan Chartier; Geoffrey Rockwell (2023). Python Code for Visualizing COVID-19 data [Dataset]. http://doi.org/10.5683/SP3/PYEQL0
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 16, 2023
    Dataset provided by
    Borealis
    Authors
    Ryan Chartier; Geoffrey Rockwell
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    The purpose of this code is to produce a line graph visualization of COVID-19 data. This Jupyter notebook was built and run on Google Colab. This code will serve mostly as a guide and will need to be adapted where necessary to be run locally. The separate COVID-19 datasets uploaded to this Dataverse can be used with this code. This upload is made up of the IPYNB and PDF files of the code.

  6. Dataset of a Study of Computational reproducibility of Jupyter notebooks...

    • zenodo.org
    pdf, zip
    Updated Jul 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sheeba Samuel; Sheeba Samuel; Daniel Mietchen; Daniel Mietchen (2024). Dataset of a Study of Computational reproducibility of Jupyter notebooks from biomedical publications [Dataset]. http://doi.org/10.5281/zenodo.8226725
    Explore at:
    zip, pdfAvailable download formats
    Dataset updated
    Jul 11, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Sheeba Samuel; Sheeba Samuel; Daniel Mietchen; Daniel Mietchen
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This repository contains the dataset for the study of computational reproducibility of Jupyter notebooks from biomedical publications. Our focus lies in evaluating the extent of reproducibility of Jupyter notebooks derived from GitHub repositories linked to publications present in the biomedical literature repository, PubMed Central. We analyzed the reproducibility of Jupyter notebooks from GitHub repositories associated with publications indexed in the biomedical literature repository PubMed Central. The dataset includes the metadata information of the journals, publications, the Github repositories mentioned in the publications and the notebooks present in the Github repositories.

    Data Collection and Analysis

    We use the code for reproducibility of Jupyter notebooks from the study done by Pimentel et al., 2019 and adapted the code from ReproduceMeGit. We provide code for collecting the publication metadata from PubMed Central using NCBI Entrez utilities via Biopython.

    Our approach involves searching PMC using the esearch function for Jupyter notebooks using the query: ``(ipynb OR jupyter OR ipython) AND github''. We meticulously retrieve data in XML format, capturing essential details about journals and articles. By systematically scanning the entire article, encompassing the abstract, body, data availability statement, and supplementary materials, we extract GitHub links. Additionally, we mine repositories for key information such as dependency declarations found in files like requirements.txt, setup.py, and pipfile. Leveraging the GitHub API, we enrich our data by incorporating repository creation dates, update histories, pushes, and programming languages.

    All the extracted information is stored in a SQLite database. After collecting and creating the database tables, we ran a pipeline to collect the Jupyter notebooks contained in the GitHub repositories based on the code from Pimentel et al., 2019.

    Our reproducibility pipeline was started on 27 March 2023.

    Repository Structure

    Our repository is organized into two main folders:

    • archaeology: This directory hosts scripts designed to download, parse, and extract metadata from PubMed Central publications and associated repositories. There are 24 database tables created which store the information on articles, journals, authors, repositories, notebooks, cells, modules, executions, etc. in the db.sqlite database file.
    • analyses: Here, you will find notebooks instrumental in the in-depth analysis of data related to our study. The db.sqlite file generated by running the archaelogy folder is stored in the analyses folder for further analysis. The path can however be configured in the config.py file. There are two sets of notebooks: one set (naming pattern N[0-9]*.ipynb) is focused on examining data pertaining to repositories and notebooks, while the other set (PMC[0-9]*.ipynb) is for analyzing data associated with publications in PubMed Central, i.e.\ for plots involving data about articles, journals, publication dates or research fields. The resultant figures from the these notebooks are stored in the 'outputs' folder.
    • MethodsWorkflow: The MethodsWorkflow file provides a conceptual overview of the workflow used in this study.

    Accessing Data and Resources:

    • All the data generated during the initial study can be accessed at https://doi.org/10.5281/zenodo.6802158
    • For the latest results and re-run data, refer to this link.
    • The comprehensive SQLite database that encapsulates all the study's extracted data is stored in the db.sqlite file.
    • The metadata in xml format extracted from PubMed Central which contains the information about the articles and journal can be accessed in pmc.xml file.

    System Requirements:

    Running the pipeline:

    • Clone the computational-reproducibility-pmc repository using Git:
      git clone https://github.com/fusion-jena/computational-reproducibility-pmc.git
    • Navigate to the computational-reproducibility-pmc directory:
      cd computational-reproducibility-pmc/computational-reproducibility-pmc
    • Configure environment variables in the config.py file:
      GITHUB_USERNAME = os.environ.get("JUP_GITHUB_USERNAME", "add your github username here")
      GITHUB_TOKEN = os.environ.get("JUP_GITHUB_PASSWORD", "add your github token here")
    • Other environment variables can also be set in the config.py file.
      BASE_DIR = Path(os.environ.get("JUP_BASE_DIR", "./")).expanduser() # Add the path of directory where the GitHub repositories will be saved
      DB_CONNECTION = os.environ.get("JUP_DB_CONNECTION", "sqlite:///db.sqlite") # Add the path where the database is stored.
    • To set up conda environments for each python versions, upgrade pip, install pipenv, and install the archaeology package in each environment, execute:
      source conda-setup.sh
    • Change to the archaeology directory
      cd archaeology
    • Activate conda environment. We used py36 to run the pipeline.
      conda activate py36
    • Execute the main pipeline script (r0_main.py):
      python r0_main.py

    Running the analysis:

    • Navigate to the analysis directory.
      cd analyses
    • Activate conda environment. We use raw38 for the analysis of the metadata collected in the study.
      conda activate raw38
    • Install the required packages using the requirements.txt file.
      pip install -r requirements.txt
    • Launch Jupyterlab
      jupyter lab
    • Refer to the Index.ipynb notebook for the execution order and guidance.

    References:

  7. Amazon Web Scrapping Dataset

    • kaggle.com
    zip
    Updated Jun 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohammad Hurairah (2023). Amazon Web Scrapping Dataset [Dataset]. https://www.kaggle.com/datasets/mohammadhurairah/amazon-web-scrapper-dataset
    Explore at:
    zip(2220 bytes)Available download formats
    Dataset updated
    Jun 17, 2023
    Authors
    Mohammad Hurairah
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Amazon Scrapping Dataset; 1. Import libraries 2. Connect to the website 3. Import CSV and datetime 4. Import pandas 5. Appending dataset to CSV 6. Automation Dataset updated 7. Timers setup 8. Email notification

  8. Python code for jupyter notebook

    • figshare.com
    json
    Updated May 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Narongrit Kasemsap (2025). Python code for jupyter notebook [Dataset]. http://doi.org/10.6084/m9.figshare.28883624.v3
    Explore at:
    jsonAvailable download formats
    Dataset updated
    May 7, 2025
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Narongrit Kasemsap
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Computer vision for detect bradykinesia on finger tapping task using mediapipe, use finger tapping taskStep by Step1.MediaPipe Hand detection and feature extraction2.Load data for MLtraining (excel file)3.Prepare data for ML4.Train ML with cross-validation5.Plot Multiple ROC curve6.Plot SHAP summary plot7.Load model (file svc1.pkl) to detect bradykinesia

  9. h

    iris

    • huggingface.co
    Updated Jul 1, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Energy Research Scientific Computing Center (2025). iris [Dataset]. https://huggingface.co/datasets/NERSC/iris
    Explore at:
    Dataset updated
    Jul 1, 2025
    Dataset authored and provided by
    National Energy Research Scientific Computing Center
    Description

    Iris

    The following code can be used to load the dataset from its stored location at NERSC. You may also access this code via a NERSC-hosted Jupyter notebook here.

    Iris data loader

    import pandas as pd iris_dat = pd.read_csv('/global/cfs/cdirs/dasrepo/www/ai_ready_datasets/iris/data/iris.csv')

    If you would like to download the data, visit the following link: https://portal.nersc.gov/cfs/dasrepo/ai_ready_datasets/iris/data

  10. o

    Demographic Analysis Workflow using Census API in Jupyter Notebook:...

    • openicpsr.org
    delimited
    Updated Jul 23, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Donghwan Gu; Nathanael Rosenheim (2020). Demographic Analysis Workflow using Census API in Jupyter Notebook: 1990-2000 Population Size and Change [Dataset]. http://doi.org/10.3886/E120381V1
    Explore at:
    delimitedAvailable download formats
    Dataset updated
    Jul 23, 2020
    Dataset provided by
    Texas A&M University
    Authors
    Donghwan Gu; Nathanael Rosenheim
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Boone County, Kentucky, US Counties
    Description

    This archive reproduces a table titled "Table 3.1 Boone county population size, 1990 and 2000" from Wang and vom Hofe (2007, p.58). The archive provides a Jupyter Notebook that uses Python and can be run in Google Colaboratory. The workflow uses Census API to retrieve data, reproduce the table, and ensure reproducibility for anyone accessing this archive.The Python code was developed in Google Colaboratory, or Google Colab for short, which is an Integrated Development Environment (IDE) of JupyterLab and streamlines package installation, code collaboration and management. The Census API is used to obtain population counts from the 1990 and 2000 Decennial Census (Summary File 1, 100% data). All downloaded data are maintained in the notebook's temporary working directory while in use. The data are also stored separately with this archive.The notebook features extensive explanations, comments, code snippets, and code output. The notebook can be viewed in a PDF format or downloaded and opened in Google Colab. References to external resources are also provided for the various functional components. The notebook features code to perform the following functions:install/import necessary Python packagesintroduce a Census API Querydownload Census data via CensusAPI manipulate Census tabular data calculate absolute change and percent changeformatting numbersexport the table to csvThe notebook can be modified to perform the same operations for any county in the United States by changing the State and County FIPS code parameters for the Census API downloads. The notebook could be adapted for use in other environments (i.e., Jupyter Notebook) as well as reading and writing files to a local or shared drive, or cloud drive (i.e., Google Drive).

  11. Z

    Exploratory Topic Modelling in Python Dataset - EHRI-3

    • data.niaid.nih.gov
    Updated Jun 20, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dermentzi, Maria (2022). Exploratory Topic Modelling in Python Dataset - EHRI-3 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6670103
    Explore at:
    Dataset updated
    Jun 20, 2022
    Dataset provided by
    King's College London
    Authors
    Dermentzi, Maria
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In the EHRI-3 project, we are investigating tools and methods that historical researchers and scholars can use to better understand, visualise, and interpret the material held by our partner archives. This dataset accompanies a tutorial exploring a technique called topic modelling in the context of a Holocaust-related historical collection.

    We were on the lookout for datasets that would be easily accessible and, for convenience, predominantly in English. One such dataset was the United States Holocaust Memorial Museum’s (USHMM) extensive collection of oral history testimonies, for which there are a considerable number of textual transcripts. The museum’s total collection consists of over 80,703 testimonies, 41,695 of which are available in English, with 2,894 of them listing a transcript.

    Since there is not yet a ready-to-download dataset that includes these transcripts, we had to construct our own. Using a web scraping tool, we managed to create a list of the links pointing to the metadata (including transcripts) of the testimonies that were of interest to us. After obtaining the transcript and other metadata of each of these testimonies, we were able to create our dataset and curate it to remove any unwanted entries. For example, we made sure to remove entries with restrictions on access or use. We also removed entries with transcripts that consisted only of some automatically generated headers and entries which turned out to be in languages other than English. The remaining 1,873 transcripts form the corpus of this tutorial — a small, but still decently sized dataset.

    The process that we followed to put together this dataset is detailed in the Jupyter Notebook accompanying this post, which can be found in this Github repository.

    In this Zenodo upload, the user can find two files, each of them containing a pickled pandas DataFrame that was obtained at a different stage of the tutorial:

    "unrestricted_df.pkl" contains 1,946 entries of Oral Testimony transcripts and has five fields (RG_number, text, display_date, conditions_access, conditions_use) "unrestricted_lemmatized_df.pkl" contains 1,873 entries of Oral Testimony transcripts and has six fields (RG_number, text, display_date, conditions_access, conditions_use, lemmas)

    Instructions on their intended use can be found in the accompanying Jupyter Notebook.

    Credits:

    The transcripts that form the corpus in this tutorial were obtained through the United States Holocaust Memorial Museum (USHMM).

  12. H

    Using ODM2API and SQLAlchemy to load time series data from a file.

    • hydroshare.org
    • beta.hydroshare.org
    • +1more
    zip
    Updated Jun 14, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Miguel Leon (2017). Using ODM2API and SQLAlchemy to load time series data from a file. [Dataset]. https://www.hydroshare.org/resource/a1e35535a8894d6f840e23640ba05143
    Explore at:
    zip(15.4 KB)Available download formats
    Dataset updated
    Jun 14, 2017
    Dataset provided by
    HydroShare
    Authors
    Miguel Leon
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is an ipython notebook script which uses the ODM2API and SQLAlchemy to load time series data from a file. This can also be viewed from github through nbviewer https://nbviewer.jupyter.org/github/miguelcleon/ODM2API-ipython-notebooks/blob/master/Using%20ODM2API%20and%20SQLAlchemy%20to%20load%20time%20series%20data%20from%20a%20file.ipynb

  13. Financial ratios 4 Nasdaq 100 membrs + 12m returns

    • kaggle.com
    zip
    Updated Jun 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SheepBoss (2023). Financial ratios 4 Nasdaq 100 membrs + 12m returns [Dataset]. https://www.kaggle.com/datasets/mlcapital/financial-ratios-4-nasdaq-100-membrs-12m-returns/code
    Explore at:
    zip(212319 bytes)Available download formats
    Dataset updated
    Jun 7, 2023
    Authors
    SheepBoss
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    A Python module using Jupyter Notebooks to take an existing dataset available at Kaggle and undertake some data cleansing, data hard coding and data science management so it can be more useful for Machine Learning models. Source of original dataset: https://www.kaggle.com/datasets/ifuurh/nasdaq100-fundamental-data

    Introduction The problem we are trying to solve is that there are very limited datasets on Kaggle if you wish to apply ML models to the problem of individual stock Share Price prediction using financial statement ratios as your input data. This is a problem that needs addressing as there is a multi-billion global fundamental financial ratio investment analysis industry that is ripe for performance enhancement by Machine Learning. We believe that the best dataset for such a purpose on Kaggle was the above dataset that we found above. The problem with this dataset for ML model use was as follows: • There was a number of data attributes that were not shown across each annual period. We removed data attributes that were not populated across all the annual periods. • We filled in data that was missing and we replaced NANs and INFs with logical and reasonable fill values. • We attached label data being 12 month ahead Share Price returns for each stock and each annual period providing this data both as discrete percentage returns and binary outperform or underperform the Nasdaq 100 index labels.

    Resulting Datasets The resulting datasets cover 102 stocks using 39 financial ratios across both 4 and 5 year periods using two different types of labels.

    In summary, this repository provides a Jupyter Notebook that shows the steps undertaken to generate:

    Two datasets for 2017 to 2021 with the Y labels attached at the end column. • labels 1 or 0: for binary outperformance against index. • perfs labels: for actual performance for the stock for that calendar year. And Two mote datasets for 2017 to 2020 with the same Y label data as above: • labels 1 or 0: for binary outperformance against index. • perfs labels: for actual performance for the stock for that calendar year.

    Usage & Contributing At the moment the project is in development. You can use the repository and play with the Jupyter Notebook to generate your own datasets with differing assumptions to ours. We will then load up some ML models that we think can be the most effective at predicting 12 month forward Share Price outcomes based on the 39 financial ratios provided. We would welcome your thoughts on our models. Even better we would welcome YOUR ideas on the best models to use to solve such a prediction problem using these datasets? You can always help to get this problem solved. It's an open-source project after all!

    Resources • Kaggle: https://www.kaggle.com/datasets/ifuurh/nasdaq100-fundamental-data • Jupyter Notebooks: https://jupyter.org/ • Yfinance: https://pypi.org/project/yfinance/

  14. Sample Park Analysis

    • figshare.com
    zip
    Updated Nov 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eric Delmelle (2025). Sample Park Analysis [Dataset]. http://doi.org/10.6084/m9.figshare.30509021.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 2, 2025
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Eric Delmelle
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    README – Sample Park Analysis## OverviewThis repository contains a Google Colab / Jupyter notebook and accompanying dataset used for analyzing park features and associated metrics. The notebook demonstrates data loading, cleaning, and exploratory analysis of the Hope_Park_original.csv file.## Contents- sample park analysis.ipynb — The main analysis notebook (Colab/Jupyter format)- Hope_Park_original.csv — Source dataset containing park information- README.md — Documentation for the contents and usage## Usage1. Open the notebook in Google Colab or Jupyter.2. Upload the Hope_Park_original.csv file to the working directory (or adjust the file path in the notebook).3. Run each cell sequentially to reproduce the analysis.## RequirementsThe notebook uses standard Python data science libraries:```pythonpandasnumpymatplotlibseaborn

  15. Virtual Machine with Case Study workflow in Jupyter Notebook

    • figshare.com
    tar
    Updated Nov 6, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Samuel Lampa; Jonathan Alvarsson; Ola Spjuth (2016). Virtual Machine with Case Study workflow in Jupyter Notebook [Dataset]. http://doi.org/10.6084/m9.figshare.4038048.v4
    Explore at:
    tarAvailable download formats
    Dataset updated
    Nov 6, 2016
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Samuel Lampa; Jonathan Alvarsson; Ola Spjuth
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    A virtual machine with the case study workflow for SciLuigi, runnable from within a Jupyter Notebook. Usage: 1. Import the .ova image into a Virtual Machine software such as Virtual box. 2. Start the virtual machine. 3. Log in with ubuntu and changethis...4. Open a terminal and execute the passwd command, to immediately set a new password.5. Click the "Open Jupyter Notebook" icon on the desktop.6. Inside Jupyter, click: Cell > Run all cells7. The workflow will now start.

  16. Speedtest Open Data - Australia(NZ) 2020-2025; Q220 - Q325 extract by Qtr

    • figshare.com
    txt
    Updated Oct 24, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Richard Ferrers; Speedtest Global Index (2025). Speedtest Open Data - Australia(NZ) 2020-2025; Q220 - Q325 extract by Qtr [Dataset]. http://doi.org/10.6084/m9.figshare.13370504.v43
    Explore at:
    txtAvailable download formats
    Dataset updated
    Oct 24, 2025
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Richard Ferrers; Speedtest Global Index
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Australia, New Zealand
    Description

    This is an Australian extract of Speedtest Open data available at Amazon WS (link below - opendata.aws).AWS data licence is "CC BY-NC-SA 4.0", so use of this data must be:- non-commercial (NC)- reuse must be share-alike (SA)(add same licence).This restricts the standard CC-BY Figshare licence.A world speedtest open data was dowloaded (>400Mb, 7M lines of data). An extract of Australia's location (lat, long) revealed 88,000 lines of data (attached as csv).A Jupyter notebook of extract process is attached.See Binder version at Github - https://github.com/areff2000/speedtestAU.+> Install: 173 packages | Downgrade: 1 packages | Total download: 432MBBuild container time: approx - load time 25secs.=> Error: Timesout - BUT UNABLE TO LOAD GLOBAL DATA FILE (6.6M lines).=> Error: Overflows 8GB RAM container provided with global data file (3GB)=> On local JupyterLab M2 MBP; loads in 6 mins.Added Binder from ARDC service: https://binderhub.rc.nectar.org.auDocs: https://ardc.edu.au/resource/fair-for-jupyter-notebooks-a-practical-guide/A link to Twitter thread of outputs provided.A link to Data tutorial provided (GitHub), including Jupyter Notebook to analyse World Speedtest data, selecting one US State.Data Shows: (Q220)- 3.1M speedtests | 762,000 devices |- 88,000 grid locations (600m * 600m), summarised as a point- average speed 33.7Mbps (down), 12.4M (up) | Max speed 724Mbps- data is for 600m * 600m grids, showing average speed up/down, number of tests, and number of users (IP). Added centroid, and now lat/long.See tweet of image of centroids also attached.NB: Discrepancy Q2-21, Speedtest Global shows June AU average speedtest at 80Mbps, whereas Q2 mean is 52Mbps (v17; Q1 45Mbps; v14). Dec 20 Speedtest Global has AU at 59Mbps. Could be possible timing difference. Or spatial anonymising masking shaping highest speeds. Else potentially data inconsistent between national average and geospatial detail. Check in upcoming quarters.NextSteps:Histogram - compare Q220, Q121, Q122. per v1.4.ipynb.Versions:v43. Added revised NZ vs AUS graph for Q325 (NZ; Q2 25) since had NZ available from Github (link below). Calc using PlayNZ.ipynb notebook. See images in Twitter - https://x.com/ValueMgmt/status/1981607615496122814v42: Added AUS Q325 (97.6k lines avg d/l 165.5 Mbps (median d/l 150.8 Mbps) u/l 28.08 Mbps). Imported using v2 Jupyter notebook (MBP 16Gb). Mean tests: 24.5. Mean devices: 6.02. Download, extract and publish: UNK - not measured mins. Download avg is double Q423. Noting, NBN increased D/L speeds from Sept '25; 100 -> 500, 250 -> 750. For 1Gbps, upload speed only increased from 50Mbps to 100Mbps. New 2Gbps services introduced on FTTP and HFC networks.v41: Added AUS Q225 (96k lines avg d/l 130.5 Mbps (median d/l 108.4 Mbps) u/l 22.45 Mbps). Imported using v2 Jupyter notebook (MBP 16Gb). Mean tests: 17.2. Mean devices: 5.11. Download, extract and publish: 20 mins. Download avg is double Q422.v40: Added AUS Q125 (93k lines avg d/l 116.6 Mbps u/l 21.35 Mbps). Imported using v2 Jupyter notebook (MBP 16Gb). Mean tests: 16.9. Mean devices: 5.13. Download, extract and publish: 14 mins.v39: Added AUS Q424 (95k lines avg d/l 110.9 Mbps u/l 21.02 Mbps). Imported using v2 Jupyter notebook (MBP 16Gb). Mean tests: 17.2. Mean devices: 5.24. Download, extract and publish: 14 mins.v38: Added AUS Q324 (92k lines avg d/l 107.0 Mbps u/l 20.79 Mbps). Imported using v2 Jupyter notebook (iMac 32Gb). Mean tests: 17.7. Mean devices: 5.33.Added github speedtest-workflow-importv2vis.ipynb Jupyter added datavis code to colour code national map. (per Binder on Github; link below).v37: Added AUS Q224 (91k lines avg d/l 97.40 Mbps u/l 19.88 Mbps). Imported using speedtest-workflow-importv2 jupyter notebook. Mean tests:18.1. Mean devices: 5.4.v36 Load UK data, Q1-23 and compare to AUS and NZ Q123 data. Add compare image (au-nz-ukQ123.png), calc PlayNZUK.ipynb, data load import-UK.ipynb. UK data bit rough and ready as uses rectangle to mark out UK, but includes some EIRE and FR. Indicative only and to be definitively needs geo-clean to exclude neighbouring countries.v35 Load Melb geo-maps of speed quartiles (0-25, 25-50, 50-75, 75-100, 100-). Avg in 2020; 41Mbps. Avg in 2023; 86Mbps. MelbQ323.png, MelbQ320.png. Calc with Speedtest-incHist.ipynb code. Needed to install conda mapclassify. ax=melb.plot(column=...dict(bins[25,50,75,100]))v34 Added AUS Q124 (93k lines avg d/l 87.00 Mbps u/l 18.86 Mbps). Imported using speedtest-workflow-importv2 jupyter notebook. Mean tests:18.3. Mean devices: 5.5.v33 Added AUS Q423 (92k lines avg d/l 82.62 Mbps). Imported using speedtest-workflow-importv2 jupyter notebook. Mean tests:18.0. Mean devices: 5.6. Added link to Github.v32 Recalc Au vs NZ for upload performance; added image. using PlayNZ Jupyter. NZ approx 40% locations at or above 100Mbps. Aus

  17. e

    Data from: Coherent X-ray Scattering Reveals Nanoscale Fluctuations in...

    • data.europa.eu
    • demo.researchdata.se
    • +2more
    unknown
    Updated Jan 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stockholms universitet (2023). Coherent X-ray Scattering Reveals Nanoscale Fluctuations in Hydrated Proteins [Dataset]. https://data.europa.eu/data/datasets/https-doi-org-10-17045-sthlmuni-22756400?locale=sl
    Explore at:
    unknownAvailable download formats
    Dataset updated
    Jan 29, 2023
    Dataset authored and provided by
    Stockholms universitet
    Description

    Datasets:

    • "Figure1a.csv": scattering intensity of hydrated proteins in Wide-Angle X-ray Scattering for different fluences (in units of photons/second/area).

    • "Figure1a_inset.csv": scattering intensity of hydrated proteins in Small-Angle X-ray Scattering for different fluences (in units of photons/second/area).

    • "Figure1b.csv": Intensity autocorrelation functions g2 at momentum transfer Q = 0.08 1/nm for different fluences (in units of photons/second/area).

    • "Figure1b_inset.csv": decay rate (in second) as a function of the momentum transfer Q (in 1/nm) for different fluences (in units of photons/second/area).

    • "Figure1c.csv": decay rate (in second) for variable fluence (in photons/second/um^2) at the momentum transfer Q = 0.08 1/nm.

    • "Figure1d.csv": renormalised intensity autocorrelation functions g2 at momentum transfer Q = 0.08 1/nm for variable fluence (in photons/second/um^2), where the time axis is normalised to the corresponding fluence F by calculating t/(1 + a · F·τ0), where Ï„0 is the equilibrium time constant extracted by extrapolation to F=0 (from data in "Figure1c.csv)"

    • "Figure2a.csv": The Wide-Angle X-ray Scattering scattering intensity at different temperatures T=180-290 K

    • "Figure2b.csv": The Small-Angle X-ray Scattering scattering intensity at different temperatures T=180-290 K

    • "Figure2c.csv": Intensity autocorrelation functions g2 for different temperatures (T=180-290 K) at momentum transfer Q = 0.1 1/nm.

    • "Figure2d-2e.csv": time constants (in second) and the Kohlrausch-Williams-Watts (KWW) exponent extracted from the fits of data in "Figure2c.csv" as a function of temperature (in K)

    • "Figure3b.csv": The normalised variance Chi_T at different temperatures (T=180-290 K) extracted from the two-time correlation functions.

    • "Figure3c.csv": The maximum of the normalised variance Chi_0 as a function of temperature (in K).

    Additionally, a Jupyter notebook "open-data.ipynb" which shows how to load and plot the data from the csv files in Python.

  18. d

    Reporting behavior from WHO COVID-19 public data

    • search.dataone.org
    • data.niaid.nih.gov
    • +1more
    Updated Jul 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Auss Abbood (2025). Reporting behavior from WHO COVID-19 public data [Dataset]. http://doi.org/10.5061/dryad.9s4mw6mmb
    Explore at:
    Dataset updated
    Jul 14, 2025
    Dataset provided by
    Dryad Digital Repository
    Authors
    Auss Abbood
    Time period covered
    Dec 16, 2022
    Description

    Objective Daily COVID-19 data reported by the World Health Organization (WHO) may provide the basis for political ad hoc decisions including travel restrictions. Data reported by countries, however, is heterogeneous and metrics to evaluate its quality are scarce. In this work, we analyzed COVID-19 case counts provided by WHO and developed tools to evaluate country-specific reporting behaviors. Methods In this retrospective cross-sectional study, COVID-19 data reported daily to WHO from 3rd January 2020 until 14th June 2021 were analyzed. We proposed the concepts of binary reporting rate and relative reporting behavior and performed descriptive analyses for all countries with these metrics. We developed a score to evaluate the consistency of incidence and binary reporting rates. Further, we performed spectral clustering of the binary reporting rate and relative reporting behavior to identify salient patterns in these metrics. Results Our final analysis included 222 countries and regions...., Data collection COVID-19 data was downloaded from WHO. Using a public repository, we have added the countries' full names to the WHO data set using the two-letter abbreviations for each country to merge both data sets. The provided COVID-19 data covers January 2020 until June 2021. We uploaded the final data set used for the analyses of this paper. Data processing We processed data using a Jupyter Notebook with a Python kernel and publically available external libraries. This upload contains the required Jupyter Notebook (reporting_behavior.ipynb) with all analyses and some additional work, a README, and the conda environment yml (env.yml)., Any text editor including Microsoft Excel and their free alternatives can open the uploaded CSV file. Any web browser and some code editors (like the freely available Visual Studio Code) can show the uploaded Jupyter Notebook if the required Python environment is set up correctly.

  19. Pulsar Voices

    • figshare.com
    pdf
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Richard Ferrers; Anderson Murray; Ben Raymond; Gary Ruben; CHRISTOPHER RUSSELL; Sarath Tomy; Michael Walker (2023). Pulsar Voices [Dataset]. http://doi.org/10.6084/m9.figshare.3084748.v2
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Richard Ferrers; Anderson Murray; Ben Raymond; Gary Ruben; CHRISTOPHER RUSSELL; Sarath Tomy; Michael Walker
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The data is sourced from CSIRO Parkes ATNF.eg http://www.atnf.csiro.au/research/pulsar/psrcat/Feel the pulse of the universeWe're taking signal data from astronomical "pulsar" sources and creating a way to listen to their signals audibly.Pulsar data is available from ATNF at CSIRO.au. Our team at #SciHackMelb has been working on a #datavis to give researchers and others a novel way to explore the Pulsar corpus, especially through the sound of the frequencies at which the Pulsars emit pulses.Link to project page at #SciHackMelb - http://www.the-hackfest.com/events/melbourne-science-hackfest/projects/pulsar-voices/The files attached here include: source data, project presentation, data as used in website final_pulsar.sql, and other methodology documentation. Importantly, see the Github link which contains data manipulation code, html code to present the data, and render audibly, iPython Notebook to process single pulsar data into an audible waveform file. Together all these resources are the Pulsar Voices activity and resulting data.Source Data;* RA - east/west coordinates (0 - 24 hrs, roughly equates to longitude) [theta; transforms RA to 0 - 360*]* Dec - north/south coordinates (-90, +90 roughly equates to latitude i.e. 90 is above north pole, and -90 south pole)* P0 - the time in seconds that a pulsar repeats its signal* f - 1/P0 which ranges from 700 cycles per sec, to some which pulses which occur every few seconds* kps - distance from Earth in kilo-parsecs. 1 kps = 3,000 light years. The furthest data is 30 kps. The galactic centre is about 25,000 light years away i.e. about 8kps.psrcatShort.csv = 2,295 Pulsars all known pulsars with above fields; RA, Dec, ThetapsrcatMedium.csv - add P0 and kps, only 1428 lines - i.e. not available for all 2,295 datapointpsrcatSparse.csv - add P0 and kps, banks if n/a, 2,295 linesshort.txt - important pulsars with high levels of observation (** even more closely examined)pulsar.R - code contributed by Ben Raymond to visualise Pulsar frequency, period in histogrampulsarVoices_authors.JPG - added photo of authors from SciHackMelbAdded to the raw data:- Coordinates to map RA, Dec to screen width(y)/height(x)y = RA[Theta]*width/360; x = (Dec + 90)*height/180- audible frequency converted from Pulsar frequency (1/P0)Formula for 1/P0(x) -> Hz(y) => y = 10 ^ (0.5 log(x) + 2.8)Explanation in text file; Convert1/P0toHz.txtTone generator from: http://www.softsynth.com/webaudio/tone.php- detailed waveform file audible converted from Pulsar signal data, and waveform image (and python notebook to generate; available):The project source is hosted on github at:https://github.com/gazzar/pulsarvoicesAn IPython/Jupyter notebook contains code and a rough description of the method used to process a psrfits .sf filedownloaded via the CSIRO Data Access Portal at http://doi.org/10.4225/08/55940087706E1The notebook contains experimental code to read one of these .sf files and access the contained spectrogram data, processing it to generate an audible signal.It also reads the .txt files containing columnar pulse phase data (which is also contained in the .sf files) and processes these by frequency modulating the signal with an audible carrier.This is the method used to generate the .wav and .png files used in the web interface.https://github.com/gazzar/pulsarvoices/blob/master/ipynb/hackfest1.ipynb A standalone python script that does the .txt to .png and .wav signal processing was used to process 15 more pulsar data examples. These can be reproduced by running the script.https://github.com/gazzar/pulsarvoices/blob/master/data/pulsarvoices.pyProcessed file at: https://github.com/gazzar/pulsarvoices/tree/master/webhttps://github.com/gazzar/pulsarvoices/blob/master/web/J0437-4715.pngJ0437-4715.wav | J0437-4715.png)#Datavis online at: http://checkonline.com.au/tooltip.php. Code at Github linked above. See especially:https://github.com/gazzar/pulsarvoices/blob/master/web/index.phpparticularly, lines 314 - 328 (or search: "SELECT * FROM final_pulsar";) which loads pulsar data from DB and push to screen with Hz on mouseover.Pulsar Voices webpage Functions:1.There is sound when you run the mouse across the Pulsars. We plot all known pulsars (N=2,295), and play a tone for pulsars we had data on frequency i.e. about 75%.2. In the bottom left corner a more detailed Pulsar sound, and wave image pops up when you click the star icon. Two of the team worked exclusively on turning a single pulsars waveform into an audible wav file. They created 16 of these files, and a workflow, but the team only had time to load one waveform. With more time, it would be great to load these files.3. If you leave the mouse over a Pulsar, a little data description pops up, with location (RA, Dec), distance (kilo parsecs; 1 = 3,000 light years), and frequency of rotation (and Hz converted to human hearing).4.If you click on a Pulsar, other pulsars with similar frequency are highlighted in white. With more time I was interested to see if there are harmonics between pulsars. i.e. related frequencies.The TeamMichael Walker is: orcid.org/0000-0003-3086-6094 ; Biosciences PhD student, Unimelb, Melbourne.Richard Ferrers is: orcid.org/0000-0002-2923-9889 ; ANDS Research Data Analyst, Innovation/Value Researcher, Melbourne.Sarath Tomy is: http://orcid.org/0000-0003-4301-0690 ; La Trobe PhD Comp Sci, Melbourne.Gary Ruben is: http://orcid.org/0000-0002-6591-1820 ; CSIRO Postdoc at Australian Synchrotron, Melbourne.Christopher Russell is: Data Manager, CSIRO, Sydney.https://wiki.csiro.au/display/ASC/Chris+RussellAnderson Murray is: orcid.org/0000-0001-6986-9140; Physics Honours, Monash, Melbourne.Contact: richard.ferrers@ands.org.au for more information.What is still left to do?* load data, description, images fileset to figshare :: DOI ; DONE except DOI* add overview images as option eg frequency bi-modal histogram* colour code pulsars by distance; DONE* add pulsar detail sound to Top three Observants; 16 pulsars processed but not loaded* add tones to pulsars to indicate f; DONE* add tooltips to show location, distance, frequency, name; DONE* add title and description; DONE* project data onto a planetarium dome with interaction to play pulsar frequencies.DONE see youtube video at https://youtu.be/F119gqOKJ1U* zoom into parts of sky to get separation between close data points - see youtube; function in Google Earth #datavis of dataset. Link at youtube.* set upper and lower tone boundaries, so tones aren't annoying* colour code pulsars by frequency bins e.g. >100 Hz, 10 - 100, 1 - 10,

  20. r

    MCCN Case Study 3 - Select optimal survey locality

    • researchdata.edu.au
    Updated Nov 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rakesh David; Lili Andres Hernandez; Hoang Son Le; Donald Hobern; Alisha Aneja (2025). MCCN Case Study 3 - Select optimal survey locality [Dataset]. http://doi.org/10.25909/29176451.V1
    Explore at:
    Dataset updated
    Nov 13, 2025
    Dataset provided by
    The University of Adelaide
    Authors
    Rakesh David; Lili Andres Hernandez; Hoang Son Le; Donald Hobern; Alisha Aneja
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The MCCN project is to deliver tools to assist the agricultural sector to understand crop-environment relationships, specifically by facilitating generation of data cubes for spatiotemporal data. This repository contains Jupyter notebooks to demonstrate the functionality of the MCCN data cube components.

    The dataset contains input files for the case study (source_data), RO-Crate metadata (ro-crate-metadata.json), results from the case study (results), and Jupyter Notebook (MCCN-CASE 3.ipynb)

    Research Activity Identifier (RAiD)

    RAiD: https://doi.org/10.26292/8679d473

    Case Studies

    This repository contains code and sample data for the following case studies. Note that the analyses here are to demonstrate the software and result should not be considered scientifically or statistically meaningful. No effort has been made to address bias in samples, and sample data may not be available at sufficient density to warrant analysis. All case studies end with generation of an RO-Crate data package including the source data, the notebook and generated outputs, including netcdf exports of the datacubes themselves.

    Case Study 3 - Select optimal survey locality

    Given a set of existing survey locations across a variable landscape, determine the optimal site to add to increase the range of surveyed environments. This study demonstrates: 1) Loading heterogeneous data sources into a cube, and 2) Analysis and visualisation using numpy and matplotlib.

    Data Sources

    The primary goal for this case study is to demonstrate being able to import a set of environmental values for different sites and then use these to identify a subset that maximises spread across the various environmental dimensions.

    This is a simple implementation that uses four environmental attributes imported for all Australia (or a subset like NSW) at a moderate grid scale:

    1. Digital soil maps for key soil properties over New South Wales, version 2.0 - SEED - see https://esoil.io/TERNLandscapes/Public/Pages/SLGA/ProductDetails-SoilAttributes.html
    2. ANUCLIM Annual Mean Rainfall raster layer - SEED - see https://datasets.seed.nsw.gov.au/dataset/anuclim-annual-mean-rainfall-raster-layer
    3. ANUCLIM Annual Mean Temperature raster layer - SEED - see https://datasets.seed.nsw.gov.au/dataset/anuclim-annual-mean-temperature-raster-layer

    Dependencies

    • This notebook requires Python 3.10 or higher
    • Install relevant Python libraries with: pip install mccn-engine rocrate
    • Installing mccn-engine will install other dependencies

    Overview

    1. Generate STAC metadata for layers from predefined configuratiion
    2. Load data cube and exclude nodata values
    3. Scale all variables to a 0.0-1.0 range
    4. Select four layers for comparison (soil organic carbon 0-30 cm, soil pH 0-30 cm, mean annual rainfall, mean annual temperature)
    5. Select 10 random points within NSW
    6. Generate 10 new layers representing standardised environmental distance between one of the selected points and all other points in NSW
    7. For every point in NSW, find the lowest environmental distance to any of the selected points
    8. Select the point in NSW that has the highest value for the lowest environmental distance to any selected point - this is the most different point
    9. Clean up and save results to RO-Crate


Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
João Felipe; João Felipe; Leonardo; Leonardo; Vanessa; Vanessa; Juliana; Juliana (2021). Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter Notebooks [Dataset]. http://doi.org/10.5281/zenodo.2592524
Organization logo

Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter Notebooks

Explore at:
2 scholarly articles cite this dataset (View in Google Scholar)
bz2Available download formats
Dataset updated
Mar 15, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
João Felipe; João Felipe; Leonardo; Leonardo; Vanessa; Vanessa; Juliana; Juliana
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The self-documenting aspects and the ability to reproduce results have been touted as significant benefits of Jupyter Notebooks. At the same time, there has been growing criticism that the way notebooks are being used leads to unexpected behavior, encourage poor coding practices and that their results can be hard to reproduce. To understand good and bad practices used in the development of real notebooks, we analyzed 1.4 million notebooks from GitHub.

Paper: https://2019.msrconf.org/event/msr-2019-papers-a-large-scale-study-about-quality-and-reproducibility-of-jupyter-notebooks

This repository contains two files:

  • dump.tar.bz2
  • jupyter_reproducibility.tar.bz2

The dump.tar.bz2 file contains a PostgreSQL dump of the database, with all the data we extracted from the notebooks.

The jupyter_reproducibility.tar.bz2 file contains all the scripts we used to query and download Jupyter Notebooks, extract data from them, and analyze the data. It is organized as follows:

  • analyses: this folder has all the notebooks we use to analyze the data in the PostgreSQL database.
  • archaeology: this folder has all the scripts we use to query, download, and extract data from GitHub notebooks.
  • paper: empty. The notebook analyses/N12.To.Paper.ipynb moves data to it

In the remaining of this text, we give instructions for reproducing the analyses, by using the data provided in the dump and reproducing the collection, by collecting data from GitHub again.

Reproducing the Analysis

This section shows how to load the data in the database and run the analyses notebooks. In the analysis, we used the following environment:

Ubuntu 18.04.1 LTS
PostgreSQL 10.6
Conda 4.5.11
Python 3.7.2
PdfCrop 2012/11/02 v1.38

First, download dump.tar.bz2 and extract it:

tar -xjf dump.tar.bz2

It extracts the file db2019-03-13.dump. Create a database in PostgreSQL (we call it "jupyter"), and use psql to restore the dump:

psql jupyter < db2019-03-13.dump

It populates the database with the dump. Now, configure the connection string for sqlalchemy by setting the environment variable JUP_DB_CONNECTTION:

export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter";

Download and extract jupyter_reproducibility.tar.bz2:

tar -xjf jupyter_reproducibility.tar.bz2

Create a conda environment with Python 3.7:

conda create -n analyses python=3.7
conda activate analyses

Go to the analyses folder and install all the dependencies of the requirements.txt

cd jupyter_reproducibility/analyses
pip install -r requirements.txt

For reproducing the analyses, run jupyter on this folder:

jupyter notebook

Execute the notebooks on this order:

  • Index.ipynb
  • N0.Repository.ipynb
  • N1.Skip.Notebook.ipynb
  • N2.Notebook.ipynb
  • N3.Cell.ipynb
  • N4.Features.ipynb
  • N5.Modules.ipynb
  • N6.AST.ipynb
  • N7.Name.ipynb
  • N8.Execution.ipynb
  • N9.Cell.Execution.Order.ipynb
  • N10.Markdown.ipynb
  • N11.Repository.With.Notebook.Restriction.ipynb
  • N12.To.Paper.ipynb

Reproducing or Expanding the Collection

The collection demands more steps to reproduce and takes much longer to run (months). It also involves running arbitrary code on your machine. Proceed with caution.

Requirements

This time, we have extra requirements:

All the analysis requirements
lbzip2 2.5
gcc 7.3.0
Github account
Gmail account

Environment

First, set the following environment variables:

export JUP_MACHINE="db"; # machine identifier
export JUP_BASE_DIR="/mnt/jupyter/github"; # place to store the repositories
export JUP_LOGS_DIR="/home/jupyter/logs"; # log files
export JUP_COMPRESSION="lbzip2"; # compression program
export JUP_VERBOSE="5"; # verbose level
export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter"; # sqlchemy connection
export JUP_GITHUB_USERNAME="github_username"; # your github username
export JUP_GITHUB_PASSWORD="github_password"; # your github password
export JUP_MAX_SIZE="8000.0"; # maximum size of the repositories directory (in GB)
export JUP_FIRST_DATE="2013-01-01"; # initial date to query github
export JUP_EMAIL_LOGIN="gmail@gmail.com"; # your gmail address
export JUP_EMAIL_TO="target@email.com"; # email that receives notifications
export JUP_OAUTH_FILE="~/oauth2_creds.json" # oauth2 auhentication file
export JUP_NOTEBOOK_INTERVAL=""; # notebook id interval for this machine. Leave it in blank
export JUP_REPOSITORY_INTERVAL=""; # repository id interval for this machine. Leave it in blank
export JUP_WITH_EXECUTION="1"; # run execute python notebooks
export JUP_WITH_DEPENDENCY="0"; # run notebooks with and without declared dependnecies
export JUP_EXECUTION_MODE="-1"; # run following the execution order
export JUP_EXECUTION_DIR="/home/jupyter/execution"; # temporary directory for running notebooks
export JUP_ANACONDA_PATH="~/anaconda3"; # conda installation path
export JUP_MOUNT_BASE="/home/jupyter/mount_ghstudy.sh"; # bash script to mount base dir
export JUP_UMOUNT_BASE="/home/jupyter/umount_ghstudy.sh"; # bash script to umount base dir
export JUP_NOTEBOOK_TIMEOUT="300"; # timeout the extraction


# Frequenci of log report
export JUP_ASTROID_FREQUENCY="5";
export JUP_IPYTHON_FREQUENCY="5";
export JUP_NOTEBOOKS_FREQUENCY="5";
export JUP_REQUIREMENT_FREQUENCY="5";
export JUP_CRAWLER_FREQUENCY="1";
export JUP_CLONE_FREQUENCY="1";
export JUP_COMPRESS_FREQUENCY="5";

export JUP_DB_IP="localhost"; # postgres database IP

Then, configure the file ~/oauth2_creds.json, according to yagmail documentation: https://media.readthedocs.org/pdf/yagmail/latest/yagmail.pdf

Configure the mount_ghstudy.sh and umount_ghstudy.sh scripts. The first one should mount the folder that stores the directories. The second one should umount it. You can leave the scripts in blank, but it is not advisable, as the reproducibility study runs arbitrary code on your machine and you may lose your data.

Scripts

Download and extract jupyter_reproducibility.tar.bz2:

tar -xjf jupyter_reproducibility.tar.bz2

Install 5 conda environments and 5 anaconda environments, for each python version. In each of them, upgrade pip, install pipenv, and install the archaeology package (Note that it is a local package that has not been published to pypi. Make sure to use the -e option):

Conda 2.7

conda create -n raw27 python=2.7 -y
conda activate raw27
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology

Anaconda 2.7

conda create -n py27 python=2.7 anaconda -y
conda activate py27
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology

Conda 3.4

It requires a manual jupyter and pathlib2 installation due to some incompatibilities found on the default installation.

conda create -n raw34 python=3.4 -y
conda activate raw34
conda install jupyter -c conda-forge -y
conda uninstall jupyter -y
pip install --upgrade pip
pip install jupyter
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
pip install pathlib2

Anaconda 3.4

conda create -n py34 python=3.4 anaconda -y
conda activate py34
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology

Conda 3.5

conda create -n raw35 python=3.5 -y
conda activate raw35
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology

Anaconda 3.5

It requires the manual installation of other anaconda packages.

conda create -n py35 python=3.5 anaconda -y
conda install -y appdirs atomicwrites keyring secretstorage libuuid navigator-updater prometheus_client pyasn1 pyasn1-modules spyder-kernels tqdm jeepney automat constantly anaconda-navigator
conda activate py35
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology

Conda 3.6

conda create -n raw36 python=3.6 -y
conda activate raw36
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology

Anaconda 3.6

conda create -n py36 python=3.6 anaconda -y
conda activate py36
conda install -y anaconda-navigator jupyterlab_server navigator-updater
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology

Conda 3.7

<code

Search
Clear search
Close search
Google apps
Main menu