5 datasets found

Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter...
zenodo.org
bz2
Updated Mar 15, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
João Felipe; João Felipe; Leonardo; Leonardo; Vanessa; Vanessa; Juliana; Juliana (2021). Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter Notebooks [Dataset]. http://doi.org/10.5281/zenodo.2592524
Explore at:
bz2Available download formats
Unique identifier
https://doi.org/10.5281/zenodo.2592524
Dataset updated
Mar 15, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
João Felipe; João Felipe; Leonardo; Leonardo; Vanessa; Vanessa; Juliana; Juliana
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The self-documenting aspects and the ability to reproduce results have been touted as significant benefits of Jupyter Notebooks. At the same time, there has been growing criticism that the way notebooks are being used leads to unexpected behavior, encourage poor coding practices and that their results can be hard to reproduce. To understand good and bad practices used in the development of real notebooks, we analyzed 1.4 million notebooks from GitHub.

Paper: https://2019.msrconf.org/event/msr-2019-papers-a-large-scale-study-about-quality-and-reproducibility-of-jupyter-notebooks

This repository contains two files:

dump.tar.bz2

jupyter_reproducibility.tar.bz2

The dump.tar.bz2 file contains a PostgreSQL dump of the database, with all the data we extracted from the notebooks.

The jupyter_reproducibility.tar.bz2 file contains all the scripts we used to query and download Jupyter Notebooks, extract data from them, and analyze the data. It is organized as follows:

analyses: this folder has all the notebooks we use to analyze the data in the PostgreSQL database.

archaeology: this folder has all the scripts we use to query, download, and extract data from GitHub notebooks.

paper: empty. The notebook analyses/N12.To.Paper.ipynb moves data to it

In the remaining of this text, we give instructions for reproducing the analyses, by using the data provided in the dump and reproducing the collection, by collecting data from GitHub again.

Reproducing the Analysis

This section shows how to load the data in the database and run the analyses notebooks. In the analysis, we used the following environment:

Ubuntu 18.04.1 LTS
PostgreSQL 10.6
Conda 4.5.11
Python 3.7.2
PdfCrop 2012/11/02 v1.38

First, download dump.tar.bz2 and extract it:

tar -xjf dump.tar.bz2

It extracts the file db2019-03-13.dump. Create a database in PostgreSQL (we call it "jupyter"), and use psql to restore the dump:

psql jupyter < db2019-03-13.dump

It populates the database with the dump. Now, configure the connection string for sqlalchemy by setting the environment variable JUP_DB_CONNECTTION:

export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter";

Download and extract jupyter_reproducibility.tar.bz2:

tar -xjf jupyter_reproducibility.tar.bz2

Create a conda environment with Python 3.7:

conda create -n analyses python=3.7 conda activate analyses

Go to the analyses folder and install all the dependencies of the requirements.txt

cd jupyter_reproducibility/analyses pip install -r requirements.txt

For reproducing the analyses, run jupyter on this folder:

jupyter notebook

Execute the notebooks on this order:

Index.ipynb

N0.Repository.ipynb

N1.Skip.Notebook.ipynb

N2.Notebook.ipynb

N3.Cell.ipynb

N4.Features.ipynb

N5.Modules.ipynb

N6.AST.ipynb

N7.Name.ipynb

N8.Execution.ipynb

N9.Cell.Execution.Order.ipynb

N10.Markdown.ipynb

N11.Repository.With.Notebook.Restriction.ipynb

N12.To.Paper.ipynb

Reproducing or Expanding the Collection

The collection demands more steps to reproduce and takes much longer to run (months). It also involves running arbitrary code on your machine. Proceed with caution.

Requirements

This time, we have extra requirements:

All the analysis requirements
lbzip2 2.5
gcc 7.3.0
Github account
Gmail account

Environment

First, set the following environment variables:

export JUP_MACHINE="db"; # machine identifier export JUP_BASE_DIR="/mnt/jupyter/github"; # place to store the repositories export JUP_LOGS_DIR="/home/jupyter/logs"; # log files export JUP_COMPRESSION="lbzip2"; # compression program export JUP_VERBOSE="5"; # verbose level export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter"; # sqlchemy connection export JUP_GITHUB_USERNAME="github_username"; # your github username export JUP_GITHUB_PASSWORD="github_password"; # your github password export JUP_MAX_SIZE="8000.0"; # maximum size of the repositories directory (in GB) export JUP_FIRST_DATE="2013-01-01"; # initial date to query github export JUP_EMAIL_LOGIN="gmail@gmail.com"; # your gmail address export JUP_EMAIL_TO="target@email.com"; # email that receives notifications export JUP_OAUTH_FILE="~/oauth2_creds.json" # oauth2 auhentication file export JUP_NOTEBOOK_INTERVAL=""; # notebook id interval for this machine. Leave it in blank export JUP_REPOSITORY_INTERVAL=""; # repository id interval for this machine. Leave it in blank export JUP_WITH_EXECUTION="1"; # run execute python notebooks export JUP_WITH_DEPENDENCY="0"; # run notebooks with and without declared dependnecies export JUP_EXECUTION_MODE="-1"; # run following the execution order export JUP_EXECUTION_DIR="/home/jupyter/execution"; # temporary directory for running notebooks export JUP_ANACONDA_PATH="~/anaconda3"; # conda installation path export JUP_MOUNT_BASE="/home/jupyter/mount_ghstudy.sh"; # bash script to mount base dir export JUP_UMOUNT_BASE="/home/jupyter/umount_ghstudy.sh"; # bash script to umount base dir export JUP_NOTEBOOK_TIMEOUT="300"; # timeout the extraction # Frequenci of log report export JUP_ASTROID_FREQUENCY="5"; export JUP_IPYTHON_FREQUENCY="5"; export JUP_NOTEBOOKS_FREQUENCY="5"; export JUP_REQUIREMENT_FREQUENCY="5"; export JUP_CRAWLER_FREQUENCY="1"; export JUP_CLONE_FREQUENCY="1"; export JUP_COMPRESS_FREQUENCY="5"; export JUP_DB_IP="localhost"; # postgres database IP

Then, configure the file ~/oauth2_creds.json, according to yagmail documentation: https://media.readthedocs.org/pdf/yagmail/latest/yagmail.pdf

Configure the mount_ghstudy.sh and umount_ghstudy.sh scripts. The first one should mount the folder that stores the directories. The second one should umount it. You can leave the scripts in blank, but it is not advisable, as the reproducibility study runs arbitrary code on your machine and you may lose your data.

Scripts

Download and extract jupyter_reproducibility.tar.bz2:

tar -xjf jupyter_reproducibility.tar.bz2

Install 5 conda environments and 5 anaconda environments, for each python version. In each of them, upgrade pip, install pipenv, and install the archaeology package (Note that it is a local package that has not been published to pypi. Make sure to use the -e option):

Conda 2.7

conda create -n raw27 python=2.7 -y conda activate raw27 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Anaconda 2.7

conda create -n py27 python=2.7 anaconda -y conda activate py27 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Conda 3.4

It requires a manual jupyter and pathlib2 installation due to some incompatibilities found on the default installation.

conda create -n raw34 python=3.4 -y conda activate raw34 conda install jupyter -c conda-forge -y conda uninstall jupyter -y pip install --upgrade pip pip install jupyter pip install pipenv pip install -e jupyter_reproducibility/archaeology pip install pathlib2

Anaconda 3.4

conda create -n py34 python=3.4 anaconda -y conda activate py34 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Conda 3.5

conda create -n raw35 python=3.5 -y conda activate raw35 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Anaconda 3.5

It requires the manual installation of other anaconda packages.

conda create -n py35 python=3.5 anaconda -y conda install -y appdirs atomicwrites keyring secretstorage libuuid navigator-updater prometheus_client pyasn1 pyasn1-modules spyder-kernels tqdm jeepney automat constantly anaconda-navigator conda activate py35 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Conda 3.6

conda create -n raw36 python=3.6 -y conda activate raw36 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Anaconda 3.6

conda create -n py36 python=3.6 anaconda -y conda activate py36 conda install -y anaconda-navigator jupyterlab_server navigator-updater pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Conda 3.7

<code
d
Sediment Export to Nearshore Waters - Hawaii
catalog.data.gov
data.ioos.us
Updated Jan 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Center for Ecological Analysis and Synthesis (NCEAS) (Point of Contact) (2025). Sediment Export to Nearshore Waters - Hawaii [Dataset]. https://catalog.data.gov/dataset/sediment-export-to-nearshore-waters-hawaii
Explore at:
Dataset updated
Jan 26, 2025
Dataset provided by
National Center for Ecological Analysis and Synthesis (NCEAS) (Point of Contact)
Area covered
Hawaii
Description
This raster data layer represents sediment plumes originating from stream mouths and coastal pour points. The Integrated Valuation of Ecosystem Services and Tradeoffs (InVEST) model for sediment retention was modified for Hawaii, parameterized, and run for each of the Main Hawaiian Islands to determine sediment export from subwatershed hydrologic units (Falinski, 2016). Results from this model were aggregated into larger drainage areas that flow to single coastal pour points. From these points sediment was dispersed offshore using the Kernel Density tool in ArcGIS with a 1.5-km search radius. The resulting raster depicts simplistic sediment plumes with units in tons of sediment per year per hectare. The InVEST model predicts the average annual amount of sediment (tons/yr) retained in and exported from each map pixel as a function of many landscape variables. Data inputs to InVEST included: 1) USGS 10-m Digital Elevation Model (DEM); 2) NOAA Coastal Change Analysis Program (C-CAP) land use/land cover data; 3) R factor (old USGS maps and interpolation); 4) K factor (USDA Natural Resources Conservation Service (NRCS) Soil Survey Geographic database (SSURGO)); 5) University of Hawaii at Manoa (UH) rainfall atlas; 6) ArcHydro-derived subwatersheds such that flow lines approximately match the State of Hawaii streams layer; and 7) derived products from the above and more. See Falinski (2016) for detailed methodology. Coastal pour points were created by intersecting streams and coastline features from the National Hydrography Dataset (NHD), resulting in points where streams flow to the shoreline. The NHD was used rather than flow lines generated from the DEM because there are many instances in Hawaii where streams flow into man-made ditch systems and never reach the coast or simply dry up and go underground before reaching the coast. To determine the amount of sediment load at the coastline, resulting coastal points were given a unique drainage identifier. Next, the stream segment features were buffered by 1 m and dissolved so that connecting stream networks became single features. These polygon stream features were then assigned the drainage ID from the coastal points using a spatial join and subsequently used to assign that drainage ID to the subwatershed polygons. Finally, subwatersheds were dissolved by drainage ID and sediment export from each subwatershed was summed up to yield the total sediment export for each larger drainage basin, which was then joined back to the corresponding coastal drainage points. Each step in the process required quality control to ensure that: no pour points are left out, subwatersheds are not erroneously connected to the wrong drainage or left out, each drainage has only 1 pour point, and drainages do not erroneously span a ridgeline that should divide basins.
Reproducibility in Practice: Dataset of a Large-Scale Study of Jupyter...
zenodo.org
bz2
Updated Mar 15, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anonymous; Anonymous (2021). Reproducibility in Practice: Dataset of a Large-Scale Study of Jupyter Notebooks [Dataset]. http://doi.org/10.5281/zenodo.2546834
Explore at:
bz2Available download formats
Unique identifier
https://doi.org/10.5281/zenodo.2546834
Dataset updated
Mar 15, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Anonymous; Anonymous
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The self-documenting aspects and the ability to reproduce results have been touted as significant benefits of Jupyter Notebooks. At the same time, there has been growing criticism that the way notebooks are being used leads to unexpected behavior, encourage poor coding practices and that their results can be hard to reproduce. To understand good and bad practices used in the development of real notebooks, we analyzed 1.4 million notebooks from GitHub.

This repository contains two files:

dump.tar.bz2

jupyter_reproducibility.tar.bz2

The dump.tar.bz2 file contains a PostgreSQL dump of the database, with all the data we extracted from the notebooks.

The jupyter_reproducibility.tar.bz2 file contains all the scripts we used to query and download Jupyter Notebooks, extract data from them, and analyze the data. It is organized as follows:

analyses: this folder has all the notebooks we use to analyze the data in the PostgreSQL database.

archaeology: this folder has all the scripts we use to query, download, and extract data from GitHub notebooks.

paper: empty. The notebook analyses/N11.To.Paper.ipynb moves data to it

In the remaining of this text, we give instructions for reproducing the analyses, by using the data provided in the dump and reproducing the collection, by collecting data from GitHub again.

Reproducing the Analysis

This section shows how to load the data in the database and run the analyses notebooks. In the analysis, we used the following environment:

Ubuntu 18.04.1 LTS
PostgreSQL 10.6
Conda 4.5.1
Python 3.6.8
PdfCrop 2012/11/02 v1.38

First, download dump.tar.bz2 and extract it:

tar -xjf dump.tar.bz2

It extracts the file db2019-01-13.dump. Create a database in PostgreSQL (we call it "jupyter"), and use psql to restore the dump:

psql jupyter < db2019-01-13.dump

It populates the database with the dump. Now, configure the connection string for sqlalchemy by setting the environment variable JUP_DB_CONNECTTION:

export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter";

Download and extract jupyter_reproducibility.tar.bz2:

tar -xjf jupyter_reproducibility.tar.bz2

Create a conda environment with Python 3.6:

conda create -n py36 python=3.6

Go to the analyses folder and install all the dependencies of the requirements.txt

cd jupyter_reproducibility/analyses pip install -r requirements.txt

For reproducing the analyses, run jupyter on this folder:

jupyter notebook

Execute the notebooks on this order:

N0.Index.ipynb

N1.Repository.ipynb

N2.Notebook.ipynb

N3.Cell.ipynb

N4.Features.ipynb

N5.Modules.ipynb

N6.AST.ipynb

N7.Name.ipynb

N8.Execution.ipynb

N9.Cell.Execution.Order.ipynb

N10.Markdown.ipynb

N11.To.Paper.ipynb

Reproducing or Expanding the Collection

The collection demands more steps to reproduce and takes much longer to run (months). It also involves running arbitrary code on your machine. Proceed with caution.

Requirements

This time, we have extra requirements:

All the analysis requirements
lbzip2 2.5
gcc 7.3.0
Github account
Gmail account

Environment

First, set the following environment variables:

export JUP_MACHINE="db"; # machine identifier export JUP_BASE_DIR="/mnt/jupyter/github"; # place to store the repositories export JUP_LOGS_DIR="/home/jupyter/logs"; # log files export JUP_COMPRESSION="lbzip2"; # compression program export JUP_VERBOSE="5"; # verbose level export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter"; # sqlchemy connection export JUP_GITHUB_USERNAME="github_username"; # your github username export JUP_GITHUB_PASSWORD="github_password"; # your github password export JUP_MAX_SIZE="8000.0"; # maximum size of the repositories directory (in GB) export JUP_FIRST_DATE="2013-01-01"; # initial date to query github export JUP_EMAIL_LOGIN="gmail@gmail.com"; # your gmail address export JUP_EMAIL_TO="target@email.com"; # email that receives notifications export JUP_OAUTH_FILE="~/oauth2_creds.json" # oauth2 auhentication file export JUP_NOTEBOOK_INTERVAL=""; # notebook id interval for this machine. Leave it in blank export JUP_REPOSITORY_INTERVAL=""; # repository id interval for this machine. Leave it in blank export JUP_WITH_EXECUTION="1"; # run execute python notebooks export JUP_WITH_DEPENDENCY="0"; # run notebooks with and without declared dependnecies export JUP_EXECUTION_MODE="-1"; # run following the execution order export JUP_EXECUTION_DIR="/home/jupyter/execution"; # temporary directory for running notebooks export JUP_ANACONDA_PATH="~/anaconda3"; # conda installation path export JUP_MOUNT_BASE="/home/jupyter/mount_ghstudy.sh"; # bash script to mount base dir export JUP_UMOUNT_BASE="/home/jupyter/umount_ghstudy.sh"; # bash script to umount base dir export JUP_NOTEBOOK_TIMEOUT="300"; # timeout the extraction # Frequenci of log report export JUP_ASTROID_FREQUENCY="5"; export JUP_IPYTHON_FREQUENCY="5"; export JUP_NOTEBOOKS_FREQUENCY="5"; export JUP_REQUIREMENT_FREQUENCY="5"; export JUP_CRAWLER_FREQUENCY="1"; export JUP_CLONE_FREQUENCY="1"; export JUP_COMPRESS_FREQUENCY="5"; export JUP_DB_IP="localhost"; # postgres database IP

Then, configure the file ~/oauth2_creds.json, according to yagmail documentation: https://media.readthedocs.org/pdf/yagmail/latest/yagmail.pdf

Configure the mount_ghstudy.sh and umount_ghstudy.sh scripts. The first one should mount the folder that stores the directories. The second one should umount it. You can leave the scripts in blank, but it is not advisable, as the reproducibility study runs arbitrary code on your machine and you may lose your data.

Scripts

Download and extract jupyter_reproducibility.tar.bz2:

tar -xjf jupyter_reproducibility.tar.bz2

Install 5 conda environments and 5 anaconda environments, for each python version. In each of them, upgrade pip, install pipenv, and install the archaeology package (Note that it is a local package that has not been published to pypi. Make sure to use the -e option):

Conda 2.7

conda create -n raw27 python=2.7 -y conda activate raw27 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Anaconda 2.7

conda create -n py27 python=2.7 anaconda -y conda activate py27 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Conda 3.4

It requires a manual jupyter and pathlib2 installation due to some incompatibilities found on the default installation.

conda create -n raw34 python=3.4 -y conda activate raw34 conda install jupyter -c conda-forge -y conda uninstall jupyter -y pip install --upgrade pip pip install jupyter pip install pipenv pip install -e jupyter_reproducibility/archaeology pip install pathlib2

Anaconda 3.4

conda create -n py34 python=3.4 anaconda -y conda activate py34 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Conda 3.5

conda create -n raw35 python=3.5 -y conda activate raw35 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Anaconda 3.5

It requires the manual installation of other anaconda packages.

conda create -n py35 python=3.5 anaconda -y conda install -y appdirs atomicwrites keyring secretstorage libuuid navigator-updater prometheus_client pyasn1 pyasn1-modules spyder-kernels tqdm jeepney automat constantly anaconda-navigator conda activate py35 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Conda 3.6

conda create -n raw36 python=3.6 -y conda activate raw36 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Anaconda 3.6

conda create -n py36 python=3.6 anaconda -y conda activate py36 conda install -y anaconda-navigator jupyterlab_server navigator-updater pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Conda 3.7

conda create -n raw37 python=3.7 -y conda activate raw37 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Anaconda 3.7

When we
a
New Zealand Regional Councils
resources-gisinschools-nz.hub.arcgis.com
gisinschools.eagle.co.nz
Updated Nov 10, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
GIS in Schools - Teaching Materials - New Zealand (2016). New Zealand Regional Councils [Dataset]. https://resources-gisinschools-nz.hub.arcgis.com/datasets/new-zealand-regional-councils
Explore at:
Dataset updated
Nov 10, 2016
Dataset authored and provided by
GIS in Schools - Teaching Materials - New Zealand
Area covered
New Zealand,
Description
The region is the top tier of local government in New Zealand. There are 16 regions of New Zealand (Part 1 of Schedule 2 of the Local Government Act 2002). Eleven are governed by an elected regional council, while five are governed by territorial authorities (the second tier of local government) who also perform the functions of a regional council and thus are known as unitary authorities. These unitary authorities are Auckland Council, Nelson City Council, Gisborne, Tasman, and Marlborough District Councils. The Chatham Islands Council also perform some of the functions of a regional council, but is not strictly a unitary authority. Unitary authorities act as regional councils for the purposes of a wide range of Acts and regulations. Regional council areas are based on water catchment areas. Regional councils are responsible for the administration of many environmental and public transport matters.Regional Councils were established in 1989 after the abolition of the 22 local government regions. The local government act 2002, requires the boundaries of regions to confirm as far as possible to one or more water catchments. When determining regional boundaries, the local Government commission gave consideration to regional communities of interest when selecting water catchments to included in a region. It also considered factors such as natural resource management, land use planning and environmental matters. Some regional boundaries are conterminous with territorial authority boundaries but there are many exceptions. An example is Taupo District, which is split between four regions, although most of its area falls within the Waikato Region. Where territorial local authorities straddle regional council boundaries, the affected area have been statistically defined in complete area units. Generally regional councils contain complete territorial authorities. The unitary authority of the Auckland Council was formed in 2010, under the Local Government (Tamaki Makarau Reorganisation) Act 2009, replacing the Auckland Regional Council and seven territorial authorities.The seaward boundary of any costal regional council is the twelve mile New Zealand territorial limit. Regional councils are defined at meshblock and area unit level.Regional Councils included in the 2013 digital pattern are:Regional Council CodeRegional Council Name01Northland Region02Auckland Region03Waikato Region04Bay of Plenty Region05Gisborne Region06Hawke's Bay Region07Taranaki Region08Manawatu-Wanganui Region09Wellington Region12West Coast Region13Canterbury Region14Otago Region15Southland Region16Tasman Region17Nelson Region18Marlborough Region99Area Outside RegionAs at 1stJuly 2007, Digital Boundary data became freely available.Deriving of Output FilesThe original vertices delineating the meshblock boundary pattern were digitised in 1991 from 1:5,000 scale urban maps and 1:50,000 scale rural maps. The magnitude of error of the original digital points would have been in the range of +/- 10 metres in urban areas and +/- 25 metres in rural areas. Where meshblock boundaries coincide with cadastral boundaries the magnitude of error will be within the range of 1–5 metres in urban areas and 5 - 20 metres in rural areas. This being the estimated magnitude of error of Landonline.The creation of high definition and generalised meshblock boundaries for the 2013 digital pattern and the dissolving of these meshblocks into other geographies/boundaries were completed within Statistics New Zealand using ESRI's ArcGIS desktop suite and the Data Interoperability extension with the following process: 1. Import data and all attribute fields into an ESRI File Geodatabase from LINZ as a shapefile2. Run geometry checks and repairs.3. Run Topology Checks on all data (Must Not Have Gaps, Must Not Overlap), detailed below.4. Generalise the meshblock layers to a 1m tolerance to create generalised dataset. 5. Clip the high definition and generalised meshblock layers to the coastline using land water codes.6. Dissolve all four meshblock datasets (clipped and unclipped, for both generalised and high definition versions) to higher geographies to create the following output data layers: Area Unit, Territorial Authorities, Regional Council, Urban Areas, Community Boards, Territorial Authority Subdivisions, Wards Constituencies and Maori Constituencies for the four datasets. 7. Complete a frequency analysis to determine that each code only has a single record.8. Re-run topology checks for overlaps and gaps.9. Export all created datasets into MapInfo and Shapefile format using the Data Interoperability extension to create 3 output formats for each file. 10. Quality Assurance and rechecking of delivery files.The High Definition version is similar to how the layer exists in Landonline with a couple of changes to fix topology errors identified in topology checking. The following quality checks and steps were applied to the meshblock pattern:Translation of ESRI Shapefiles to ESRI geodatabase datasetThe meshblock dataset was imported into the ESRI File Geodatabase format, required to run the ESRI topology checks. Topology rules were set for each of the layers. Topology ChecksA tolerance of 0.1 cm was applied to the data, which meant that the topology engine validating the data saw any vertex closer than this distance as the same location. A default topology rule of “Must Be Larger than Cluster Tolerance” is applied to all data – this would highlight where any features with a width less than 0.1cm exist. No errors were found for this rule.Three additional topology rules were applied specifically within each of the layers in the ESRI geodatabase – namely “Must Not Overlap”, “Must Not Have Gaps” and “"Area Boundary Must Be Covered By Boundary Of (Meshblock)”. These check that a layer forms a continuous coverage over a surface, that any given point on that surface is only assigned to a single category, and that the dissolved boundaries are identical to the parent meshblock boundaries.Topology Checks Results: There were no errors in either the gap or overlap checks.GeneralisingTo create the generalised Meshblock layer the “Simplify Polygon” geoprocessing tool was used in ArcGIS, with the following parameters:Simplification Algorithm: POINT_REMOVEMaximum Allowable Offset: 1 metreMinimum Area: 1 square metreHandling Topological Errors: RESOLVE_ERRORSClipping of Layers to CoastlineThe processed feature class was then clipped to the coastline. The coastline was defined as features within the supplied Land2013 with codes and descriptions as follows:11- Island – Included12- Mainland – Included21- Inland Water – Included22- Inlet – Excluded23- Oceanic –Excluded33- Other – Included.Features were clipped using the Data Interoperability extension, attribute filter tool. The attribute filter was used on both the generalised and high definition meshblock datasets creating four meshblock layers. Each meshblock dataset also contained all higher geographies and land-water data as attributes. Note: Meshblock 0017001 which is classified as island, was excluded from the clipped meshblock layers, as most of this meshblock is oceanic. Dissolve meshblocks to higher geographiesStatistics New Zealand then dissolved the ESRI meshblock feature classes to the higher geographies, for both the full and clipped dataset, generalised and high definition datasets. To dissolve the higher geographies, a model was built using the dissolver, aggregator and sorter tools, with each output set to include geography code and names within the Data Interoperability extension. Export to MapInfo Format and ShapfilesThe data was exported to MapInfo and Shapefile format using ESRI's Data Interoperability extension Translation tool. Quality Assurance and rechecking of delivery filesThe feature counts of all files were checked to ensure all layers had the correct number of features. This included checking that all multipart features had translated correctly in the new file.
a
Heat Severity - USA 2023
hazard-mitigation-planning-geauga.hub.arcgis.com
keep-cool-global-community.hub.arcgis.com
Updated Apr 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Trust for Public Land (2024). Heat Severity - USA 2023 [Dataset]. https://hazard-mitigation-planning-geauga.hub.arcgis.com/datasets/db5bdb0f0c8c4b85b8270ec67448a0b6
Explore at:
Dataset updated
Apr 24, 2024
Dataset authored and provided by
The Trust for Public Land
Area covered

Description
Notice: this is not the latest Heat Island Severity image service.This layer contains the relative heat severity for every pixel for every city in the United States, including Alaska, Hawaii, and Puerto Rico. Heat Severity is a reclassified version of Heat Anomalies raster which is also published on this site. This data is generated from 30-meter Landsat 8 imagery band 10 (ground-level thermal sensor) from the summer of 2023.To explore previous versions of the data, visit the links below:Heat Severity - USA 2022Heat Severity - USA 2021Heat Severity - USA 2020Heat Severity - USA 2019Federal statistics over a 30-year period show extreme heat is the leading cause of weather-related deaths in the United States. Extreme heat exacerbated by urban heat islands can lead to increased respiratory difficulties, heat exhaustion, and heat stroke. These heat impacts significantly affect the most vulnerable—children, the elderly, and those with preexisting conditions.The purpose of this layer is to show where certain areas of cities are hotter than the average temperature for that same city as a whole. Severity is measured on a scale of 1 to 5, with 1 being a relatively mild heat area (slightly above the mean for the city), and 5 being a severe heat area (significantly above the mean for the city). The absolute heat above mean values are classified into these 5 classes using the Jenks Natural Breaks classification method, which seeks to reduce the variance within classes and maximize the variance between classes. Knowing where areas of high heat are located can help a city government plan for mitigation strategies.This dataset represents a snapshot in time. It will be updated yearly, but is static between updates. It does not take into account changes in heat during a single day, for example, from building shadows moving. The thermal readings detected by the Landsat 8 sensor are surface-level, whether that surface is the ground or the top of a building. Although there is strong correlation between surface temperature and air temperature, they are not the same. We believe that this is useful at the national level, and for cities that don’t have the ability to conduct their own hyper local temperature survey. Where local data is available, it may be more accurate than this dataset. Dataset SummaryThis dataset was developed using proprietary Python code developed at Trust for Public Land, running on the Descartes Labs platform through the Descartes Labs API for Python. The Descartes Labs platform allows for extremely fast retrieval and processing of imagery, which makes it possible to produce heat island data for all cities in the United States in a relatively short amount of time.What can you do with this layer?This layer has query, identify, and export image services available. Since it is served as an image service, it is not necessary to download the data; the service itself is data that can be used directly in any Esri geoprocessing tool that accepts raster data as input.In order to click on the image service and see the raw pixel values in a map viewer, you must be signed in to ArcGIS Online, then Enable Pop-Ups and Configure Pop-Ups.Using the Urban Heat Island (UHI) Image ServicesThe data is made available as an image service. There is a processing template applied that supplies the yellow-to-red or blue-to-red color ramp, but once this processing template is removed (you can do this in ArcGIS Pro or ArcGIS Desktop, or in QGIS), the actual data values come through the service and can be used directly in a geoprocessing tool (for example, to extract an area of interest). Following are instructions for doing this in Pro.In ArcGIS Pro, in a Map view, in the Catalog window, click on Portal. In the Portal window, click on the far-right icon representing Living Atlas. Search on the acronyms “tpl” and “uhi”. The results returned will be the UHI image services. Right click on a result and select “Add to current map” from the context menu. When the image service is added to the map, right-click on it in the map view, and select Properties. In the Properties window, select Processing Templates. On the drop-down menu at the top of the window, the default Processing Template is either a yellow-to-red ramp or a blue-to-red ramp. Click the drop-down, and select “None”, then “OK”. Now you will have the actual pixel values displayed in the map, and available to any geoprocessing tool that takes a raster as input. Below is a screenshot of ArcGIS Pro with a UHI image service loaded, color ramp removed, and symbology changed back to a yellow-to-red ramp (a classified renderer can also be used): A typical operation at this point is to clip out your area of interest. To do this, add your polygon shapefile or feature class to the map view, and use the Clip Raster tool to export your area of interest as a geoTIFF raster (file extension ".tif"). In the environments tab for the Clip Raster tool, click the dropdown for "Extent" and select "Same as Layer:", and select the name of your polygon. If you then need to convert the output raster to a polygon shapefile or feature class, run the Raster to Polygon tool, and select "Value" as the field.Other Sources of Heat Island InformationPlease see these websites for valuable information on heat islands and to learn about exciting new heat island research being led by scientists across the country:EPA’s Heat Island Resource CenterDr. Ladd Keith, University of ArizonaDr. Ben McMahan, University of Arizona Dr. Jeremy Hoffman, Science Museum of Virginia Dr. Hunter Jones, NOAA Daphne Lundi, Senior Policy Advisor, NYC Mayor's Office of Recovery and ResiliencyDisclaimer/FeedbackWith nearly 14,000 cities represented, checking each city's heat island raster for quality assurance would be prohibitively time-consuming, so Trust for Public Land checked a statistically significant sample size for data quality. The sample passed all quality checks, with about 98.5% of the output cities error-free, but there could be instances where the user finds errors in the data. These errors will most likely take the form of a line of discontinuity where there is no city boundary; this type of error is caused by large temperature differences in two adjacent Landsat scenes, so the discontinuity occurs along scene boundaries (see figure below). Trust for Public Land would appreciate feedback on these errors so that version 2 of the national UHI dataset can be improved. Contact Dale.Watt@tpl.org with feedback.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

João Felipe; João Felipe; Leonardo; Leonardo; Vanessa; Vanessa; Juliana; Juliana (2021). Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter Notebooks [Dataset]. http://doi.org/10.5281/zenodo.2592524

Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter Notebooks

Explore at:

2 scholarly articles cite this dataset (View in Google Scholar)

bz2Available download formats

Unique identifier

https://doi.org/10.5281/zenodo.2592524

Dataset updated

Mar 15, 2021

Dataset provided by

Zenodohttp://zenodo.org/

Authors

João Felipe; João Felipe; Leonardo; Leonardo; Vanessa; Vanessa; Juliana; Juliana

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The self-documenting aspects and the ability to reproduce results have been touted as significant benefits of Jupyter Notebooks. At the same time, there has been growing criticism that the way notebooks are being used leads to unexpected behavior, encourage poor coding practices and that their results can be hard to reproduce. To understand good and bad practices used in the development of real notebooks, we analyzed 1.4 million notebooks from GitHub.

Paper: https://2019.msrconf.org/event/msr-2019-papers-a-large-scale-study-about-quality-and-reproducibility-of-jupyter-notebooks

This repository contains two files:

dump.tar.bz2
jupyter_reproducibility.tar.bz2

The dump.tar.bz2 file contains a PostgreSQL dump of the database, with all the data we extracted from the notebooks.

The jupyter_reproducibility.tar.bz2 file contains all the scripts we used to query and download Jupyter Notebooks, extract data from them, and analyze the data. It is organized as follows:

analyses: this folder has all the notebooks we use to analyze the data in the PostgreSQL database.
archaeology: this folder has all the scripts we use to query, download, and extract data from GitHub notebooks.
paper: empty. The notebook analyses/N12.To.Paper.ipynb moves data to it

In the remaining of this text, we give instructions for reproducing the analyses, by using the data provided in the dump and reproducing the collection, by collecting data from GitHub again.

Reproducing the Analysis

This section shows how to load the data in the database and run the analyses notebooks. In the analysis, we used the following environment:

Ubuntu 18.04.1 LTS
PostgreSQL 10.6
Conda 4.5.11
Python 3.7.2
PdfCrop 2012/11/02 v1.38

First, download dump.tar.bz2 and extract it:

tar -xjf dump.tar.bz2

It extracts the file db2019-03-13.dump. Create a database in PostgreSQL (we call it "jupyter"), and use psql to restore the dump:

psql jupyter < db2019-03-13.dump

It populates the database with the dump. Now, configure the connection string for sqlalchemy by setting the environment variable JUP_DB_CONNECTTION:

export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter";

Download and extract jupyter_reproducibility.tar.bz2:

tar -xjf jupyter_reproducibility.tar.bz2

Create a conda environment with Python 3.7:

conda create -n analyses python=3.7
conda activate analyses

Go to the analyses folder and install all the dependencies of the requirements.txt

cd jupyter_reproducibility/analyses
pip install -r requirements.txt

For reproducing the analyses, run jupyter on this folder:

jupyter notebook

Execute the notebooks on this order:

Index.ipynb
N0.Repository.ipynb
N1.Skip.Notebook.ipynb
N2.Notebook.ipynb
N3.Cell.ipynb
N4.Features.ipynb
N5.Modules.ipynb
N6.AST.ipynb
N7.Name.ipynb
N8.Execution.ipynb
N9.Cell.Execution.Order.ipynb
N10.Markdown.ipynb
N11.Repository.With.Notebook.Restriction.ipynb
N12.To.Paper.ipynb

Reproducing or Expanding the Collection

The collection demands more steps to reproduce and takes much longer to run (months). It also involves running arbitrary code on your machine. Proceed with caution.

Requirements

This time, we have extra requirements:

All the analysis requirements
lbzip2 2.5
gcc 7.3.0
Github account
Gmail account

Environment

First, set the following environment variables:

export JUP_MACHINE="db"; # machine identifier
export JUP_BASE_DIR="/mnt/jupyter/github"; # place to store the repositories
export JUP_LOGS_DIR="/home/jupyter/logs"; # log files
export JUP_COMPRESSION="lbzip2"; # compression program
export JUP_VERBOSE="5"; # verbose level
export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter"; # sqlchemy connection
export JUP_GITHUB_USERNAME="github_username"; # your github username
export JUP_GITHUB_PASSWORD="github_password"; # your github password
export JUP_MAX_SIZE="8000.0"; # maximum size of the repositories directory (in GB)
export JUP_FIRST_DATE="2013-01-01"; # initial date to query github
export JUP_EMAIL_LOGIN="gmail@gmail.com"; # your gmail address
export JUP_EMAIL_TO="target@email.com"; # email that receives notifications
export JUP_OAUTH_FILE="~/oauth2_creds.json" # oauth2 auhentication file
export JUP_NOTEBOOK_INTERVAL=""; # notebook id interval for this machine. Leave it in blank
export JUP_REPOSITORY_INTERVAL=""; # repository id interval for this machine. Leave it in blank
export JUP_WITH_EXECUTION="1"; # run execute python notebooks
export JUP_WITH_DEPENDENCY="0"; # run notebooks with and without declared dependnecies
export JUP_EXECUTION_MODE="-1"; # run following the execution order
export JUP_EXECUTION_DIR="/home/jupyter/execution"; # temporary directory for running notebooks
export JUP_ANACONDA_PATH="~/anaconda3"; # conda installation path
export JUP_MOUNT_BASE="/home/jupyter/mount_ghstudy.sh"; # bash script to mount base dir
export JUP_UMOUNT_BASE="/home/jupyter/umount_ghstudy.sh"; # bash script to umount base dir
export JUP_NOTEBOOK_TIMEOUT="300"; # timeout the extraction


# Frequenci of log report
export JUP_ASTROID_FREQUENCY="5";
export JUP_IPYTHON_FREQUENCY="5";
export JUP_NOTEBOOKS_FREQUENCY="5";
export JUP_REQUIREMENT_FREQUENCY="5";
export JUP_CRAWLER_FREQUENCY="1";
export JUP_CLONE_FREQUENCY="1";
export JUP_COMPRESS_FREQUENCY="5";

export JUP_DB_IP="localhost"; # postgres database IP

Then, configure the file ~/oauth2_creds.json, according to yagmail documentation: https://media.readthedocs.org/pdf/yagmail/latest/yagmail.pdf

Configure the mount_ghstudy.sh and umount_ghstudy.sh scripts. The first one should mount the folder that stores the directories. The second one should umount it. You can leave the scripts in blank, but it is not advisable, as the reproducibility study runs arbitrary code on your machine and you may lose your data.

Scripts

Download and extract jupyter_reproducibility.tar.bz2:

tar -xjf jupyter_reproducibility.tar.bz2

Install 5 conda environments and 5 anaconda environments, for each python version. In each of them, upgrade pip, install pipenv, and install the archaeology package (Note that it is a local package that has not been published to pypi. Make sure to use the -e option):

Conda 2.7

conda create -n raw27 python=2.7 -y
conda activate raw27
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology

Anaconda 2.7

conda create -n py27 python=2.7 anaconda -y
conda activate py27
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology

Conda 3.4

It requires a manual jupyter and pathlib2 installation due to some incompatibilities found on the default installation.

conda create -n raw34 python=3.4 -y
conda activate raw34
conda install jupyter -c conda-forge -y
conda uninstall jupyter -y
pip install --upgrade pip
pip install jupyter
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
pip install pathlib2

Anaconda 3.4

conda create -n py34 python=3.4 anaconda -y
conda activate py34
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology

Conda 3.5

conda create -n raw35 python=3.5 -y
conda activate raw35
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology

Anaconda 3.5

It requires the manual installation of other anaconda packages.

conda create -n py35 python=3.5 anaconda -y
conda install -y appdirs atomicwrites keyring secretstorage libuuid navigator-updater prometheus_client pyasn1 pyasn1-modules spyder-kernels tqdm jeepney automat constantly anaconda-navigator
conda activate py35
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology

Conda 3.6

conda create -n raw36 python=3.6 -y
conda activate raw36
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology

Anaconda 3.6

conda create -n py36 python=3.6 anaconda -y
conda activate py36
conda install -y anaconda-navigator jupyterlab_server navigator-updater
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology

Conda 3.7

<code

Clear search

Close search

Google apps

Main menu

Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter...

Sediment Export to Nearshore Waters - Hawaii

Reproducibility in Practice: Dataset of a Large-Scale Study of Jupyter...

New Zealand Regional Councils

Heat Severity - USA 2023

Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter NotebooksSee More Versions

Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter Notebooks