Facebook
TwitterThis repository contains the entire Python Data Science Handbook, in the form of (free!) Jupyter notebooks.
Read the book in its entirety online at https://jakevdp.github.io/PythonDataScienceHandbook/
Run the code using the Jupyter notebooks available in this repository's notebooks directory.
Launch executable versions of these notebooks using Google Colab:
Launch a live notebook server with these notebooks using binder:
Buy the printed book through O'Reilly Media
The book was written and tested with Python 3.5, though other Python versions (including Python 2.7) should work in nearly all cases.
The book introduces the core libraries essential for working with data in Python: particularly IPython, NumPy, Pandas, Matplotlib, Scikit-Learn, and related packages. Familiarity with Python as a language is assumed; if you need a quick introduction to the language itself, see the free companion project, A Whirlwind Tour of Python: it's a fast-paced introduction to the Python language aimed at researchers and scientists.
See Index.ipynb for an index of the notebooks available to accompany the text.
The code in the book was tested with Python 3.5, though most (but not all) will also work correctly with Python 2.7 and other older Python versions.
The packages I used to run the code in the book are listed in requirements.txt (Note that some of these exact version numbers may not be available on your platform: you may have to tweak them for your own use). To install the requirements using conda, run the following at the command-line:
$ conda install --file requirements.txt
To create a stand-alone environment named PDSH with Python 3.5 and all the required package versions, run the following:
$ conda create -n PDSH python=3.5 --file requirements.txt
You can read more about using conda environments in the Managing Environments section of the conda documentation.
The code in this repository, including all code samples in the notebooks listed above, is released under the MIT license. Read more at the Open Source Initiative.
The text content of the book is released under the CC-BY-NC-ND license. Read more at Creative Commons.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The self-documenting aspects and the ability to reproduce results have been touted as significant benefits of Jupyter Notebooks. At the same time, there has been growing criticism that the way notebooks are being used leads to unexpected behavior, encourage poor coding practices and that their results can be hard to reproduce. To understand good and bad practices used in the development of real notebooks, we analyzed 1.4 million notebooks from GitHub.
This repository contains two files:
The dump.tar.bz2 file contains a PostgreSQL dump of the database, with all the data we extracted from the notebooks.
The jupyter_reproducibility.tar.bz2 file contains all the scripts we used to query and download Jupyter Notebooks, extract data from them, and analyze the data. It is organized as follows:
In the remaining of this text, we give instructions for reproducing the analyses, by using the data provided in the dump and reproducing the collection, by collecting data from GitHub again.
Reproducing the Analysis
This section shows how to load the data in the database and run the analyses notebooks. In the analysis, we used the following environment:
Ubuntu 18.04.1 LTS
PostgreSQL 10.6
Conda 4.5.11
Python 3.7.2
PdfCrop 2012/11/02 v1.38
First, download dump.tar.bz2 and extract it:
tar -xjf dump.tar.bz2
It extracts the file db2019-03-13.dump. Create a database in PostgreSQL (we call it "jupyter"), and use psql to restore the dump:
psql jupyter < db2019-03-13.dump
It populates the database with the dump. Now, configure the connection string for sqlalchemy by setting the environment variable JUP_DB_CONNECTTION:
export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter";
Download and extract jupyter_reproducibility.tar.bz2:
tar -xjf jupyter_reproducibility.tar.bz2
Create a conda environment with Python 3.7:
conda create -n analyses python=3.7
conda activate analyses
Go to the analyses folder and install all the dependencies of the requirements.txt
cd jupyter_reproducibility/analyses
pip install -r requirements.txt
For reproducing the analyses, run jupyter on this folder:
jupyter notebook
Execute the notebooks on this order:
Reproducing or Expanding the Collection
The collection demands more steps to reproduce and takes much longer to run (months). It also involves running arbitrary code on your machine. Proceed with caution.
Requirements
This time, we have extra requirements:
All the analysis requirements
lbzip2 2.5
gcc 7.3.0
Github account
Gmail account
Environment
First, set the following environment variables:
export JUP_MACHINE="db"; # machine identifier
export JUP_BASE_DIR="/mnt/jupyter/github"; # place to store the repositories
export JUP_LOGS_DIR="/home/jupyter/logs"; # log files
export JUP_COMPRESSION="lbzip2"; # compression program
export JUP_VERBOSE="5"; # verbose level
export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter"; # sqlchemy connection
export JUP_GITHUB_USERNAME="github_username"; # your github username
export JUP_GITHUB_PASSWORD="github_password"; # your github password
export JUP_MAX_SIZE="8000.0"; # maximum size of the repositories directory (in GB)
export JUP_FIRST_DATE="2013-01-01"; # initial date to query github
export JUP_EMAIL_LOGIN="gmail@gmail.com"; # your gmail address
export JUP_EMAIL_TO="target@email.com"; # email that receives notifications
export JUP_OAUTH_FILE="~/oauth2_creds.json" # oauth2 auhentication file
export JUP_NOTEBOOK_INTERVAL=""; # notebook id interval for this machine. Leave it in blank
export JUP_REPOSITORY_INTERVAL=""; # repository id interval for this machine. Leave it in blank
export JUP_WITH_EXECUTION="1"; # run execute python notebooks
export JUP_WITH_DEPENDENCY="0"; # run notebooks with and without declared dependnecies
export JUP_EXECUTION_MODE="-1"; # run following the execution order
export JUP_EXECUTION_DIR="/home/jupyter/execution"; # temporary directory for running notebooks
export JUP_ANACONDA_PATH="~/anaconda3"; # conda installation path
export JUP_MOUNT_BASE="/home/jupyter/mount_ghstudy.sh"; # bash script to mount base dir
export JUP_UMOUNT_BASE="/home/jupyter/umount_ghstudy.sh"; # bash script to umount base dir
export JUP_NOTEBOOK_TIMEOUT="300"; # timeout the extraction
# Frequenci of log report
export JUP_ASTROID_FREQUENCY="5";
export JUP_IPYTHON_FREQUENCY="5";
export JUP_NOTEBOOKS_FREQUENCY="5";
export JUP_REQUIREMENT_FREQUENCY="5";
export JUP_CRAWLER_FREQUENCY="1";
export JUP_CLONE_FREQUENCY="1";
export JUP_COMPRESS_FREQUENCY="5";
export JUP_DB_IP="localhost"; # postgres database IP
Then, configure the file ~/oauth2_creds.json, according to yagmail documentation: https://media.readthedocs.org/pdf/yagmail/latest/yagmail.pdf
Configure the mount_ghstudy.sh and umount_ghstudy.sh scripts. The first one should mount the folder that stores the directories. The second one should umount it. You can leave the scripts in blank, but it is not advisable, as the reproducibility study runs arbitrary code on your machine and you may lose your data.
Scripts
Download and extract jupyter_reproducibility.tar.bz2:
tar -xjf jupyter_reproducibility.tar.bz2
Install 5 conda environments and 5 anaconda environments, for each python version. In each of them, upgrade pip, install pipenv, and install the archaeology package (Note that it is a local package that has not been published to pypi. Make sure to use the -e option):
Conda 2.7
conda create -n raw27 python=2.7 -y
conda activate raw27
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Anaconda 2.7
conda create -n py27 python=2.7 anaconda -y
conda activate py27
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Conda 3.4
It requires a manual jupyter and pathlib2 installation due to some incompatibilities found on the default installation.
conda create -n raw34 python=3.4 -y
conda activate raw34
conda install jupyter -c conda-forge -y
conda uninstall jupyter -y
pip install --upgrade pip
pip install jupyter
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
pip install pathlib2
Anaconda 3.4
conda create -n py34 python=3.4 anaconda -y
conda activate py34
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Conda 3.5
conda create -n raw35 python=3.5 -y
conda activate raw35
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Anaconda 3.5
It requires the manual installation of other anaconda packages.
conda create -n py35 python=3.5 anaconda -y
conda install -y appdirs atomicwrites keyring secretstorage libuuid navigator-updater prometheus_client pyasn1 pyasn1-modules spyder-kernels tqdm jeepney automat constantly anaconda-navigator
conda activate py35
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Conda 3.6
conda create -n raw36 python=3.6 -y
conda activate raw36
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Anaconda 3.6
conda create -n py36 python=3.6 anaconda -y
conda activate py36
conda install -y anaconda-navigator jupyterlab_server navigator-updater
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Conda 3.7
<code
Facebook
Twitterhttps://github.com/microsoft/DataScienceProblems/blob/main/LICENSE.txthttps://github.com/microsoft/DataScienceProblems/blob/main/LICENSE.txt
Evaluate a natural language code generation model on real data science pedagogical notebooks! Data Science Problems (DSP) includes well-posed data science problems in Markdown along with unit tests to verify correctness and a Docker environment for reproducible execution. About 1/3 of notebooks in this benchmark also include data dependencies, so this benchmark not only can test a model's ability to chain together complex tasks, but also evaluate the solutions on real data! See our paper Training and Evaluating a Jupyter Notebook Data Science Assistant (https://arxiv.org/abs/2201.12901) for more details about state of the art results and other properties of the dataset.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The self-documenting aspects and the ability to reproduce results have been touted as significant benefits of Jupyter Notebooks. At the same time, there has been growing criticism that the way notebooks are being used leads to unexpected behavior, encourage poor coding practices and that their results can be hard to reproduce. To understand good and bad practices used in the development of real notebooks, we analyzed 1.4 million notebooks from GitHub.
This repository contains two files:
The dump.tar.bz2 file contains a PostgreSQL dump of the database, with all the data we extracted from the notebooks.
The jupyter_reproducibility.tar.bz2 file contains all the scripts we used to query and download Jupyter Notebooks, extract data from them, and analyze the data. It is organized as follows:
In the remaining of this text, we give instructions for reproducing the analyses, by using the data provided in the dump and reproducing the collection, by collecting data from GitHub again.
Reproducing the Analysis
This section shows how to load the data in the database and run the analyses notebooks. In the analysis, we used the following environment:
Ubuntu 18.04.1 LTS
PostgreSQL 10.6
Conda 4.5.1
Python 3.6.8
PdfCrop 2012/11/02 v1.38
First, download dump.tar.bz2 and extract it:
tar -xjf dump.tar.bz2
It extracts the file db2019-01-13.dump. Create a database in PostgreSQL (we call it "jupyter"), and use psql to restore the dump:
psql jupyter < db2019-01-13.dump
It populates the database with the dump. Now, configure the connection string for sqlalchemy by setting the environment variable JUP_DB_CONNECTTION:
export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter";
Download and extract jupyter_reproducibility.tar.bz2:
tar -xjf jupyter_reproducibility.tar.bz2
Create a conda environment with Python 3.6:
conda create -n py36 python=3.6
Go to the analyses folder and install all the dependencies of the requirements.txt
cd jupyter_reproducibility/analyses
pip install -r requirements.txt
For reproducing the analyses, run jupyter on this folder:
jupyter notebook
Execute the notebooks on this order:
Reproducing or Expanding the Collection
The collection demands more steps to reproduce and takes much longer to run (months). It also involves running arbitrary code on your machine. Proceed with caution.
Requirements
This time, we have extra requirements:
All the analysis requirements
lbzip2 2.5
gcc 7.3.0
Github account
Gmail account
Environment
First, set the following environment variables:
export JUP_MACHINE="db"; # machine identifier
export JUP_BASE_DIR="/mnt/jupyter/github"; # place to store the repositories
export JUP_LOGS_DIR="/home/jupyter/logs"; # log files
export JUP_COMPRESSION="lbzip2"; # compression program
export JUP_VERBOSE="5"; # verbose level
export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter"; # sqlchemy connection
export JUP_GITHUB_USERNAME="github_username"; # your github username
export JUP_GITHUB_PASSWORD="github_password"; # your github password
export JUP_MAX_SIZE="8000.0"; # maximum size of the repositories directory (in GB)
export JUP_FIRST_DATE="2013-01-01"; # initial date to query github
export JUP_EMAIL_LOGIN="gmail@gmail.com"; # your gmail address
export JUP_EMAIL_TO="target@email.com"; # email that receives notifications
export JUP_OAUTH_FILE="~/oauth2_creds.json" # oauth2 auhentication file
export JUP_NOTEBOOK_INTERVAL=""; # notebook id interval for this machine. Leave it in blank
export JUP_REPOSITORY_INTERVAL=""; # repository id interval for this machine. Leave it in blank
export JUP_WITH_EXECUTION="1"; # run execute python notebooks
export JUP_WITH_DEPENDENCY="0"; # run notebooks with and without declared dependnecies
export JUP_EXECUTION_MODE="-1"; # run following the execution order
export JUP_EXECUTION_DIR="/home/jupyter/execution"; # temporary directory for running notebooks
export JUP_ANACONDA_PATH="~/anaconda3"; # conda installation path
export JUP_MOUNT_BASE="/home/jupyter/mount_ghstudy.sh"; # bash script to mount base dir
export JUP_UMOUNT_BASE="/home/jupyter/umount_ghstudy.sh"; # bash script to umount base dir
export JUP_NOTEBOOK_TIMEOUT="300"; # timeout the extraction
# Frequenci of log report
export JUP_ASTROID_FREQUENCY="5";
export JUP_IPYTHON_FREQUENCY="5";
export JUP_NOTEBOOKS_FREQUENCY="5";
export JUP_REQUIREMENT_FREQUENCY="5";
export JUP_CRAWLER_FREQUENCY="1";
export JUP_CLONE_FREQUENCY="1";
export JUP_COMPRESS_FREQUENCY="5";
export JUP_DB_IP="localhost"; # postgres database IP
Then, configure the file ~/oauth2_creds.json, according to yagmail documentation: https://media.readthedocs.org/pdf/yagmail/latest/yagmail.pdf
Configure the mount_ghstudy.sh and umount_ghstudy.sh scripts. The first one should mount the folder that stores the directories. The second one should umount it. You can leave the scripts in blank, but it is not advisable, as the reproducibility study runs arbitrary code on your machine and you may lose your data.
Scripts
Download and extract jupyter_reproducibility.tar.bz2:
tar -xjf jupyter_reproducibility.tar.bz2
Install 5 conda environments and 5 anaconda environments, for each python version. In each of them, upgrade pip, install pipenv, and install the archaeology package (Note that it is a local package that has not been published to pypi. Make sure to use the -e option):
Conda 2.7
conda create -n raw27 python=2.7 -y
conda activate raw27
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Anaconda 2.7
conda create -n py27 python=2.7 anaconda -y
conda activate py27
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Conda 3.4
It requires a manual jupyter and pathlib2 installation due to some incompatibilities found on the default installation.
conda create -n raw34 python=3.4 -y
conda activate raw34
conda install jupyter -c conda-forge -y
conda uninstall jupyter -y
pip install --upgrade pip
pip install jupyter
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
pip install pathlib2
Anaconda 3.4
conda create -n py34 python=3.4 anaconda -y
conda activate py34
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Conda 3.5
conda create -n raw35 python=3.5 -y
conda activate raw35
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Anaconda 3.5
It requires the manual installation of other anaconda packages.
conda create -n py35 python=3.5 anaconda -y
conda install -y appdirs atomicwrites keyring secretstorage libuuid navigator-updater prometheus_client pyasn1 pyasn1-modules spyder-kernels tqdm jeepney automat constantly anaconda-navigator
conda activate py35
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Conda 3.6
conda create -n raw36 python=3.6 -y
conda activate raw36
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Anaconda 3.6
conda create -n py36 python=3.6 anaconda -y
conda activate py36
conda install -y anaconda-navigator jupyterlab_server navigator-updater
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Conda 3.7
conda create -n raw37 python=3.7 -y
conda activate raw37
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Anaconda 3.7
When we
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We implemented automated workflows using Jupyter notebooks for each state. The GIS processing, crucial for merging, extracting, and projecting GeoTIFF data, was performed using ArcPy—a Python package for geographic data analysis, conversion, and management within ArcGIS (Toms, 2015). After generating state-scale LES (large extent spatial) datasets in GeoTIFF format, we utilized the xarray and rioxarray Python packages to convert GeoTIFF to NetCDF. Xarray is a Python package to work with multi-dimensional arrays and rioxarray is rasterio xarray extension. Rasterio is a Python library to read and write GeoTIFF and other raster formats. Xarray facilitated data manipulation and metadata addition in the NetCDF file, while rioxarray was used to save GeoTIFF as NetCDF. These procedures resulted in the creation of three HydroShare resources (HS 3, HS 4 and HS 5) for sharing state-scale LES datasets. Notably, due to licensing constraints with ArcGIS Pro, a commercial GIS software, the Jupyter notebook development was undertaken on a Windows OS.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This study investigates the extent to which data science projects follow code standards. In particular, which standards are followed, which are ignored, and how does this differ to traditional software projects? We compare a corpus of 1048 Open-Source Data Science projects to a reference group of 1099 non-Data Science projects with a similar level of quality and maturity.results.tar.gz: Extracted data for each project, including raw logs of all detected code violations.notebooks_out.tar.gz: Tables and figures generated by notebooks.source_code_anonymized.tar.gz: Anonymized source code (at time of publication) to identify, clone, and analyse the projects. Also includes Jupyter notebooks used to produce figures in the paper.The latest source code can be found at: https://github.com/a2i2/mining-data-science-repositoriesPublished in ESEM 2020: https://doi.org/10.1145/3382494.3410680Preprint: https://arxiv.org/abs/2007.08978
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset contains the outputs of the notebook "Detecting floating objects using Deep Learning and Sentinel-2 imagery" published in the ocean modelling section of The Environmental Data Science Book.
Contributions
Notebook
Jamila Mifdal (author), European Space Agency Φ-lab, @jmifdal
Raquel Carmo (author), European Space Agency Φ-lab, @raquelcarmo
Alejandro Coca-Castro (reviewer), The Alan Turing Institute, @acocac
Modelling codebase
Jamila Mifdal (author), European Space Agency Φ-lab, @jmifdal
Raquel Carmo (author), European Space Agency Φ-lab, @raquelcarmo
Marc Rußwurm (author), EPFL-ECEO, @marccoru
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset is the https://github.com/jakevdp/PythonDataScienceHandbook's Jupyter notebooks converted into Markdown for better RAG. Of course, you can use it for purposes other than RAG as long as they don't violate the LICENSE terms.
The https://github.com/jakevdp/PythonDataScienceHandbook contains the entire Python Data Science Handbook, in the form of Jupyter notebooks.
Facebook
Twitterhttps://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
According to our latest research, the global Data Science Notebook Platform market size reached USD 820 million in 2024, driven by the accelerating adoption of advanced analytics and machine learning tools across industries. The market is projected to expand at a robust CAGR of 21.6% from 2025 to 2033, reaching a forecasted value of USD 6.06 billion by 2033. This remarkable growth is underpinned by the increasing demand for collaborative, scalable, and cloud-based data science solutions as organizations prioritize data-driven decision-making and digital transformation initiatives.
One of the primary growth factors propelling the Data Science Notebook Platform market is the rapid digitalization of enterprises and the proliferation of big data. As organizations generate and collect massive volumes of structured and unstructured data, there is a pressing need for platforms that enable seamless data exploration, analysis, and visualization. Data science notebook platforms, with their interactive and user-friendly interfaces, empower data scientists, analysts, and business users to collaborate in real-time, streamline workflows, and accelerate the development of machine learning models. The increasing integration of these platforms with cloud-based data storage and processing solutions further enhances their scalability, flexibility, and accessibility, making them indispensable tools for modern data-driven enterprises.
Another significant driver is the growing adoption of artificial intelligence (AI) and machine learning (ML) across various sectors such as BFSI, healthcare, retail, and manufacturing. These industries are leveraging data science notebook platforms to develop, test, and deploy sophisticated ML algorithms that can deliver actionable insights, optimize operations, and personalize customer experiences. The ability of these platforms to support a wide range of programming languages, libraries, and frameworks—such as Python, R, TensorFlow, and PyTorch—enables organizations to innovate rapidly and stay ahead of the competition. Moreover, the rising emphasis on open-source technologies and community-driven development is fostering a vibrant ecosystem around data science notebooks, driving further innovation and adoption.
Furthermore, the shift towards remote and hybrid work models has amplified the need for collaborative data science tools that can bridge geographical and functional silos. Data science notebook platforms offer integrated collaboration features, version control, and secure sharing capabilities, enabling distributed teams to work together efficiently on complex data projects. The growing focus on democratizing data science and empowering business users with self-service analytics tools is also expanding the user base of these platforms beyond traditional data scientists to include business analysts, domain experts, and citizen data scientists. This trend is expected to continue, fueling sustained demand for versatile and user-friendly data science notebook solutions.
From a regional perspective, North America currently dominates the Data Science Notebook Platform market, accounting for the largest revenue share in 2024, thanks to its mature technology infrastructure, high concentration of data-driven enterprises, and strong presence of leading platform vendors. However, the Asia Pacific region is poised for the fastest growth over the forecast period, driven by rapid digital transformation, increasing investments in AI and analytics, and expanding talent pools in countries such as China, India, and Japan. Europe also represents a significant market, characterized by stringent data privacy regulations and a growing focus on responsible AI and ethical data practices. Meanwhile, Latin America and the Middle East & Africa are witnessing gradual adoption, supported by government initiatives and the rising penetration of cloud-based solutions.
The Data Science Notebook Platform market is segmented by component into Software and Services, each playing a critical role in the overall ecosystem. The software segment encompasses a broad range of notebook solutions, including open-source platforms like Jupyter and proprietary offerings from major technology vendors. These platforms are designed to provide interactive development environments where users can write, execute, and visualize code, making it easier
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset contains the inputs of the notebook "Met Office UKV high-resolution atmosphere model data" published in The Environmental Data Science Book.
The input data refer to a subset of single sample data file for 1.5 m temperature as part of the Met Office contribution to the COVID 19 modelling effort.
The full dataset was available for download from the Met Office Azure (https://metdatasa.blob.core.windows.net/covid19-response-non-commercial/). The full dataset was available for download under the terms of non-commercial purposes.
Contributions
Notebook
Samantha V. Adams (author), Met Office Informatics Lab, @svadams
Alejandro Coca-Castro (reviewer), The Alan Turing Institute, @acocac
Dataset originator/creator
Met Office Informatics Lab (creator)
Microsoft (support)
European Regional Development Fund (support)
Dataset authors
Met Office
Dataset documentation
Theo McCaie. Met office and partners offer data and compute platform for covid-19 researchers. URL: https://medium.com/informatics-lab/met-office-and-partners-offer-data-and-compute-platform-for-covid-19-researchers-83848ac55f5f.
Note this data should be used only for non-commercial purposes.
Facebook
TwitterThis resource contains Jupyter Notebooks with examples that illustrate how to work with SQLite databases in Python including database creation and viewing and querying with SQL. The resource is part of set of materials for hydroinformatics and water data science instruction. Complete learning module materials are found in HydroLearn: Jones, A.S., Horsburgh, J.S., Bastidas Pacheco, C.J. (2022). Hydroinformatics and Water Data Science. HydroLearn. https://edx.hydrolearn.org/courses/course-v1:USU+CEE6110+2022/about..
This resources consists of 3 example notebooks and a SQLite database.
Notebooks: 1. Example 1: Querying databases using SQL in Python 2. Example 2: Python functions to query SQLite databases 3. Example 3: SQL join, aggregate, and subquery functions
Data files: These examples use a SQLite database that uses the Observations Data Model structure and is pre-populated with Logan River temperature data.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset contains the outputs of the notebook "Tree crown detection using DeepForest" published in The Environmental Data Science Book.
Contributions
Notebook
Sebastian H. M. Hickman (author), University of Cambridge, @shmh40
Alejandro Coca-Castro (reviewer), The Alan Turing Institute, @acocac
Modelling codebase
Sebastian H. M. Hickman (author), University of Cambridge @shmh40
James G. C. Ball (contributor), University of Cambridge @PatBall1
David A. Coomes (contributor), University of Cambridge
Toby Jackson (contributor), University of Cambridge
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset contains the outputs of the notebook "Met Office UKV high-resolution atmosphere model data" published in the urban sensors section of The Environmental Data Science Book.
Contributions
Notebook
Samantha V. Adams (author), Met Office Informatics Lab, @svadams
Alejandro Coca-Castro (reviewer), The Alan Turing Institute, @acocac
Dataset originator/creator
Met Office Informatics Lab (creator)
Microsoft (support)
European Regional Development Fund (support)
Dataset authors
Met Office
Dataset documentation
Theo McCaie. Met office and partners offer data and compute platform for covid-19 researchers. URL: https://medium.com/informatics-lab/met-office-and-partners-offer-data-and-compute-platform-for-covid-19-researchers-83848ac55f5f.
Facebook
TwitterDataset containing the raw data and results from analyses, along with supporting Jupyter notebook that shows the processing of data in the following paper: Big Data Analytics for Scanning Transmission Electron Microscopy Ptychography S. Jesse, M. Chi, A. Belianinov, C. Beekman, S. V. Kalinin, A. Y. Borisevich & A. R. Lupini Scientific Reports volume 6, Article number: 26348 (2016) https://www.nature.com/articles/srep26348
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset contains the outputs of the notebook "MODIS MOD021KM and FIRMS" published in The Environmental Data Science Book.
Contributions
Notebook
Samuel Jackson (author), Science & Technology Facilities Council, @samueljackson92
Alejandro Coca-Castro (reviewer), The Alan Turing Institute, @acocac
Dataset originator/creator
MOD021KM
MODIS Characterization Support Team (MCST)
MODIS Adaptive Processing System (MODAPS)
Firms
University of Maryland
Dataset authors
MOD021KM
MODIS Science Data Support Team (SDST)
Firms
NASA’s Applied Sciences Program
Dataset documentation
Louis Giglio, Wilfrid Schroeder, Joanne V. Hall, and Christopher O. Justice. MODIS Collection 6 Active Fire Product User’s Guide Revision B. Technical Report, NASA, 2018. URL: https://modis-fire.umd.edu/files/MODIS_C6_Fire_User_Guide_B.pdf.
Facebook
TwitterOur GeoThermalCloud framework is designed to process geothermal datasets using a novel toolbox for unsupervised and physics-informed machine learning called SmartTensors. More information about GeoThermalCloud can be found at the GeoThermalCloud GitHub Repository. More information about SmartTensors can be found at the SmartTensors Github Repository and the SmartTensors page at LANL.gov. Links to these pages are included in this submission. GeoThermalCloud.jl is a repository containing all the data and codes required to demonstrate applications of machine learning methods for geothermal exploration. GeoThermalCloud.jl includes: - site data - simulation scripts - jupyter notebooks - intermediate results - code outputs - summary figures - readme markdown files GeoThermalCloud.jl showcases the machine learning analyses performed for the following geothermal sites: - Brady: geothermal exploration of the Brady geothermal site, Nevada - SWNM: geothermal exploration of the Southwest New Mexico (SWNM) region - GreatBasin: geothermal exploration of the Great Basin region, Nevada Reports, research papers, and presentations summarizing these machine learning analyses are also available and will be posted soon.
Facebook
TwitterArcade is a collection of natural language to code problems on interactive data science notebooks. Each problem features an NL intent as problem specification, a reference code solution, and preceding notebook context (Markdown or code cells). Arcade can be used to evaluate the accuracies of code large language models in generating data science programs given natural language instructions. Please read our paper for more details.
Note👉 This Kaggle dataset only contains the dataset files of Arcade. Refer to our main Github repository for detailed instructions to use this dataset.
Below is the structure of its content:
└── ./
├── existing_tasks # Problems derived from existing data science notebooks on Github/
│ ├── metadata.json # Metadata by `build_existing_tasks_split.py` to reproduce this split.
│ ├── artifacts/ # Folder that stores dependent ML datasets to execute the problems, created by running `build_existing_tasks_split.py`
│ └── derived_datasets/ # Folder for preprocessed datasets used for prompting experiments.
├── new_tasks/
│ ├── dataset.json # Original, unprepossessed dataset
│ ├── kaggle_dataset_provenance.csv # Metadata of the Kaggle datasets used to build this split.
│ ├── artifacts/ # Folder that stores dependent ML Kaggle datasets to execute the problems, created by running `build_new_tasks_split.py`
│ └── derived_datasets/ # Folder for preprocessed datasets used for prompting experiments.
└── checksums.txt # Table of MD5 checksums of dataset files.
All the dataset '*.json' files follow the same structure. Each dataset file is a Json-serialized list of Episodes. Each episode corresponds to a notebook annotated with NL-to-code problems. The structure of an episode is documented below:
{
"notebook_name": "Name of the notebook.",
"work_dir": "Path to the dependent data artifacts (e.g., ML datasets) to execute the notebook.",
"annotator": "Anonymized annotator Id."
"turns": [
# A list of natural language to code examples using the current notebook context.
{
"input": "Prompt to a code generation model.",
"turn": {
"intent": {
"value": "Annotated NL intent for the current turn.",
"is_cell_intent": "Metadata used for the existing tasks split to indicate if the code solution is only part of an existing code cell.",
"cell_idx": "Index of the intent Markdown cell.",
"line_span": "Line span of the intent.",
"not_sure": "Annotation confidence.",
"output_variables": "List of variable names denoting the output. If None, use the output of the last line of code as the output of the problem.",
},
"code": {
"value": "Reference code solution.",
"cell_idx": "Cell index of the code cell containing the solution.",
"num_lines": "Number of lines in the reference solution.",
"line_span": "Line span.",
},
"code_context": "Context code (all code cells before this problem) that need to be executed before executing the reference/predicted programs.",
"delta_code_context": "Delta context code between the last problem in this notebook and the current problem, useful for incremental execution.",
"metadata": {
"annotator_id": "Annotator Id",
"num_code_lines": "Metadata, please ignore.",
"utterance_without_output_spec": "Annotated NL intent without output specification. Refer to the paper for details.",
},
},
"notebook": "Field intended to store the Json-serialized Jupyter notebook. Not used for now since the notebook can be reconstructed from other metadata in this file.",
"metadata": {
# A dict of metadata of this turn.
"context_cells": [ # A list of context cells before the problem.
{
"cell_type": "code|markdown",
"source": "Cell content."
},
],
"delta_cell_num": "Number of preceding context cells between the prior turn and the current turn.",
# The following fields only occur in datasets inlined with schema descriptions.
"context_cell_num": "Number of context cells in the prompt after inserting schema descriptions and left-truncation.",
"inten...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset contains the inputs of the notebook "Cosmos-UK soil moisture" published in The Environmental Data Science Book.
The input data refer to a subset of the public 2013-2019 COSMOS-UK dataset, daily and subhourly observations and metadata for four stations: WYTH1, WADDN, SHEEP and CHIMN. These stations represent the first sites to prototype COSMOS sensors in the UK, see further details in Evans et al. (2016) and they are situated in human-intervened areas (grassland and cropland), except for one in a woodland land cover site.
Data from COSMOS-UK up to the end of 2019 are available for download from the UKCEH Environmental Information Data Centre (EIDC). The data are accompanied by documentation that describes the site-specific instrumentation, data and processing including quality control. The full dataset is available for download under the terms of the Open Government License.
Contributions
Notebook
Alejandro Coca-Castro (author), The Alan Turing Institute, @acocac
Doran Khamis (reviewer), UK Centre for Ecology & Hydrology, @dorankhamis
Matt Fry (reviewer), UK Centre for Ecology & Hydrology, @mattfry-ceh
Dataset originator/creator
UK Centre for Ecology & Hydrology (creator)
Natural Environment Research Council (support)
Dataset reference and documentation
S. Stanley, V. Antoniou, A. Askquith-Ellis, L.A. Ball, E.S. Bennett, J.R. Blake, D.B. Boorman, M. Brooks, M. Clarke, H.M. Cooper, N. Cowan, A. Cumming, J.G. Evans, P. Farrand, M. Fry, O.E. Hitt, W.D. Lord, R. Morrison, G.V. Nash, D. Rylett, P.M. Scarlett, O.D. Swain, M. Szczykulska, J.L. Thornton, E.J. Trill, A.C. Warwick, and B. Winterbourn. Daily and sub-daily hydrometeorological and soil data (2013-2019) [cosmos-uk]. 2021. URL: https://doi.org/10.5285/b5c190e4-e35d-40ea-8fbe-598da03a1185, doi:10.5285/b5c190e4-e35d-40ea-8fbe-598da03a1185.
Further references
Jonathan G. Evans, H. C. Ward, J. R. Blake, E. J. Hewitt, R. Morrison, M. Fry, L. A. Ball, L. C. Doughty, J. W. Libre, O. E. Hitt, D. Rylett, R. J. Ellis, A. C. Warwick, M. Brooks, M. A. Parkes, G. M.H. Wright, A. C. Singer, D. B. Boorman, and A. Jenkins. Soil water content in southern england derived from a cosmic-ray soil moisture observing system – cosmos-uk. Hydrological Processes, 30:4987–4999, 12 2016. doi:10.1002/hyp.10929.
M. Zreda, W. J. Shuttleworth, X. Zeng, C. Zweck, D. Desilets, T. Franz, and R. Rosolem. Cosmos: the cosmic-ray soil moisture observing system. Hydrology and Earth System Sciences, 16(11):4079–4099, 2012. URL: https://hess.copernicus.org/articles/16/4079/2012/, doi:10.5194/hess-16-4079-2012.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset contains the outputs of the notebook "Deep learning and variational inversion to quantify and attribute climate change (CIRC23)" published in The Environmental Data Science Book.
Facebook
Twitterhttps://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
The Python IDE market is booming, projected to reach $1.08B by 2033 at an 8.1% CAGR. Explore key trends, leading companies (PyCharm, Eclipse, AWS Cloud9), and regional market analysis in this comprehensive report. Discover the impact of cloud-based IDEs and the future of Python development tools.
Facebook
TwitterThis repository contains the entire Python Data Science Handbook, in the form of (free!) Jupyter notebooks.
Read the book in its entirety online at https://jakevdp.github.io/PythonDataScienceHandbook/
Run the code using the Jupyter notebooks available in this repository's notebooks directory.
Launch executable versions of these notebooks using Google Colab:
Launch a live notebook server with these notebooks using binder:
Buy the printed book through O'Reilly Media
The book was written and tested with Python 3.5, though other Python versions (including Python 2.7) should work in nearly all cases.
The book introduces the core libraries essential for working with data in Python: particularly IPython, NumPy, Pandas, Matplotlib, Scikit-Learn, and related packages. Familiarity with Python as a language is assumed; if you need a quick introduction to the language itself, see the free companion project, A Whirlwind Tour of Python: it's a fast-paced introduction to the Python language aimed at researchers and scientists.
See Index.ipynb for an index of the notebooks available to accompany the text.
The code in the book was tested with Python 3.5, though most (but not all) will also work correctly with Python 2.7 and other older Python versions.
The packages I used to run the code in the book are listed in requirements.txt (Note that some of these exact version numbers may not be available on your platform: you may have to tweak them for your own use). To install the requirements using conda, run the following at the command-line:
$ conda install --file requirements.txt
To create a stand-alone environment named PDSH with Python 3.5 and all the required package versions, run the following:
$ conda create -n PDSH python=3.5 --file requirements.txt
You can read more about using conda environments in the Managing Environments section of the conda documentation.
The code in this repository, including all code samples in the notebooks listed above, is released under the MIT license. Read more at the Open Source Initiative.
The text content of the book is released under the CC-BY-NC-ND license. Read more at Creative Commons.