100+ datasets found

s
Python Import Data India – Buyers & Importers List
seair.co.in
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Seair Exim, Python Import Data India – Buyers & Importers List [Dataset]. https://www.seair.co.in
Explore at:
.bin, .xml, .csv, .xlsAvailable download formats
Dataset provided by
Seair Info Solutions PVT LTD
Authors
Seair Exim
Area covered
India
Description
Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.
i
Code to import PSCAD data into Python (Spyder)
ieee-dataport.org
Updated Nov 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Franz Guzman Llanos (2025). Code to import PSCAD data into Python (Spyder) [Dataset]. https://ieee-dataport.org/documents/code-import-pscad-data-python-spyder
Explore at:
Dataset updated
Nov 20, 2025
Authors
Franz Guzman Llanos
Description
minimizes errors

Data from: Ecosystem-Level Determinants of Sustained Activity in Open-Source...

zenodo.org

application/gzip, bin +2

Updated Aug 2, 2024

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Marat Valiev; Marat Valiev; Bogdan Vasilescu; James Herbsleb; Bogdan Vasilescu; James Herbsleb (2024). Ecosystem-Level Determinants of Sustained Activity in Open-Source Projects: A Case Study of the PyPI Ecosystem [Dataset]. http://doi.org/10.5281/zenodo.1419788

Explore at:

bin, application/gzip, zip, text/x-pythonAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.1419788

Dataset updated

Aug 2, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Marat Valiev; Marat Valiev; Bogdan Vasilescu; James Herbsleb; Bogdan Vasilescu; James Herbsleb

License

https://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.htmlhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html

Description

Replication pack, FSE2018 submission #164:
------------------------------------------

**Working title:** Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: 
A Case Study of the PyPI Ecosystem

**Note:** link to data artifacts is already included in the paper. 
Link to the code will be included in the Camera Ready version as well.


Content description
===================

- **ghd-0.1.0.zip** - the code archive. This code produces the dataset files 
 described below
- **settings.py** - settings template for the code archive.
- **dataset_minimal_Jan_2018.zip** - the minimally sufficient version of the dataset.
 This dataset only includes stats aggregated by the ecosystem (PyPI)
- **dataset_full_Jan_2018.tgz** - full version of the dataset, including project-level
 statistics. It is ~34Gb unpacked. This dataset still doesn't include PyPI packages
 themselves, which take around 2TB.
- **build_model.r, helpers.r** - R files to process the survival data 
  (`survival_data.csv` in **dataset_minimal_Jan_2018.zip**, 
  `common.cache/survival_data.pypi_2008_2017-12_6.csv` in 
  **dataset_full_Jan_2018.tgz**)
- **Interview protocol.pdf** - approximate protocol used for semistructured interviews.
- LICENSE - text of GPL v3, under which this dataset is published
- INSTALL.md - replication guide (~2 pages)

Replication guide
=================

Step 0 - prerequisites
----------------------

- Unix-compatible OS (Linux or OS X)
- Python interpreter (2.7 was used; Python 3 compatibility is highly likely)
- R 3.4 or higher (3.4.4 was used, 3.2 is known to be incompatible)

Depending on detalization level (see Step 2 for more details):
- up to 2Tb of disk space (see Step 2 detalization levels)
- at least 16Gb of RAM (64 preferable)
- few hours to few month of processing time

Step 1 - software
----------------

- unpack **ghd-0.1.0.zip**, or clone from gitlab:

   git clone https://gitlab.com/user2589/ghd.git
   git checkout 0.1.0
 
 `cd` into the extracted folder. 
 All commands below assume it as a current directory.
  
- copy `settings.py` into the extracted folder. Edit the file:
  * set `DATASET_PATH` to some newly created folder path
  * add at least one GitHub API token to `SCRAPER_GITHUB_API_TOKENS` 
- install docker. For Ubuntu Linux, the command is 
  `sudo apt-get install docker-compose`
- install libarchive and headers: `sudo apt-get install libarchive-dev`
- (optional) to replicate on NPM, install yajl: `sudo apt-get install yajl-tools`
 Without this dependency, you might get an error on the next step, 
 but it's safe to ignore.
- install Python libraries: `pip install --user -r requirements.txt` . 
- disable all APIs except GitHub (Bitbucket and Gitlab support were
 not yet implemented when this study was in progress): edit
 `scraper/init.py`, comment out everything except GitHub support
 in `PROVIDERS`.

Step 2 - obtaining the dataset
-----------------------------

The ultimate goal of this step is to get output of the Python function 
`common.utils.survival_data()` and save it into a CSV file:

  # copy and paste into a Python console
  from common import utils
  survival_data = utils.survival_data('pypi', '2008', smoothing=6)
  survival_data.to_csv('survival_data.csv')

Since full replication will take several months, here are some ways to speedup
the process:

####Option 2.a, difficulty level: easiest

Just use the precomputed data. Step 1 is not necessary under this scenario.

- extract **dataset_minimal_Jan_2018.zip**
- get `survival_data.csv`, go to the next step

####Option 2.b, difficulty level: easy

Use precomputed longitudinal feature values to build the final table.
The whole process will take 15..30 minutes.

- create a folder `

d
Python code used to download U.S. Census Bureau data for public-supply water...
catalog.data.gov
data.usgs.gov
Updated Nov 19, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Python code used to download U.S. Census Bureau data for public-supply water service areas [Dataset]. https://catalog.data.gov/dataset/python-code-used-to-download-u-s-census-bureau-data-for-public-supply-water-service-areas
Explore at:
Dataset updated
Nov 19, 2025
Dataset provided by
U.S. Geological Survey
Description
This child item describes Python code used to query census data from the TigerWeb Representational State Transfer (REST) services and the U.S. Census Bureau Application Programming Interface (API). These data were needed as input feature variables for a machine learning model to predict public supply water use for the conterminous United States. Census data were retrieved for public-supply water service areas, but the census data collector could be used to retrieve data for other areas of interest. This dataset is part of a larger data release using machine learning to predict public supply water use for 12-digit hydrologic units from 2000-2020. Data retrieved by the census data collector code were used as input features in the public supply delivery and water use machine learning models. This page includes the following file: census_data_collector.zip - a zip file containing the census data collector Python code used to retrieve data from the U.S. Census Bureau and a README file.
Into to Python data
kaggle.com
zip
Updated Oct 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alex Bushnell (2024). Into to Python data [Dataset]. https://www.kaggle.com/datasets/alexbushnell/into-to-python-data/code
Explore at:
zip(186718 bytes)Available download formats
Dataset updated
Oct 31, 2024
Authors
Alex Bushnell
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset

This dataset was created by Alex Bushnell

Released under MIT

Contents
Z
Storage and Transit Time Data and Code
data.niaid.nih.gov
zenodo.org
Updated Jun 12, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrew Felton (2024). Storage and Transit Time Data and Code [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8136816
Explore at:
Dataset updated
Jun 12, 2024
Dataset provided by
Montana State University
Authors
Andrew Felton
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Author: Andrew J. FeltonDate: 5/5/2024

This R project contains the primary code and data (following pre-processing in python) used for data production, manipulation, visualization, and analysis and figure production for the study entitled:

"Global estimates of the storage and transit time of water through vegetation"

Please note that 'turnover' and 'transit' are used interchangeably in this project.

Data information:

The data folder contains key data sets used for analysis. In particular:

"data/turnover_from_python/updated/annual/multi_year_average/average_annual_turnover.nc" contains a global array summarizing five year (2016-2020) averages of annual transit, storage, canopy transpiration, and number of months of data. This is the core dataset for the analysis; however, each folder has much more data, including a dataset for each year of the analysis. Data are also available is separate .csv files for each land cover type. Oterh data can be found for the minimum, monthly, and seasonal transit time found in their respective folders. These data were produced using the python code found in the "supporting_code" folder given the ease of working with .nc and EASE grid in the xarray python module. R was used primarily for data visualization purposes. The remaining files in the "data" and "data/supporting_data"" folder primarily contain ground-based estimates of storage and transit found in public databases or through a literature search, but have been extensively processed and filtered here.

Code information

Python scripts can be found in the "supporting_code" folder.

Each R script in this project has a particular function:

01_start.R: This script loads the R packages used in the analysis, sets thedirectory, and imports custom functions for the project. You can also load in the main transit time (turnover) datasets here using the source() function.

02_functions.R: This script contains the custom function for this analysis, primarily to work with importing the seasonal transit data. Load this using the source() function in the 01_start.R script.

03_generate_data.R: This script is not necessary to run and is primarilyfor documentation. The main role of this code was to import and wranglethe data needed to calculate ground-based estimates of aboveground water storage.

04_annual_turnover_storage_import.R: This script imports the annual turnover andstorage data for each landcover type. You load in these data from the 01_start.R scriptusing the source() function.

05_minimum_turnover_storage_import.R: This script imports the minimum turnover andstorage data for each landcover type. Minimum is defined as the lowest monthlyestimate.You load in these data from the 01_start.R scriptusing the source() function.

06_figures_tables.R: This is the main workhouse for figure/table production and supporting analyses. This script generates the key figures and summary statistics used in the study that then get saved in the manuscript_figures folder. Note that allmaps were produced using Python code found in the "supporting_code"" folder.
z
Open Context Database SQL Dump
zenodo.org
data-staging.niaid.nih.gov
+2more
zip
Updated Jan 23, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eric Kansa; Eric Kansa; Sarah Whitcher Kansa; Sarah Whitcher Kansa (2025). Open Context Database SQL Dump [Dataset]. http://doi.org/10.5281/zenodo.14728229
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14728229
Dataset updated
Jan 23, 2025
Dataset provided by
Open Context
Authors
Eric Kansa; Eric Kansa; Sarah Whitcher Kansa; Sarah Whitcher Kansa
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Open Context (https://opencontext.org) publishes free and open access research data for archaeology and related disciplines. An open source (but bespoke) Django (Python) application supports these data publishing services. The software repository is here: https://github.com/ekansa/open-context-py

The Open Context team runs ETL (extract, transform, load) workflows to import data contributed by researchers from various source relational databases and spreadsheets. Open Context uses PostgreSQL (https://www.postgresql.org) relational database to manage these imported data in a graph style schema. The Open Context Python application interacts with the PostgreSQL database via the Django Object-Relational-Model (ORM).

This database dump includes all published structured data organized used by Open Context (table names that start with 'oc_all_'). The binary media files referenced by these structured data records are stored elsewhere. Binary media files for some projects, still in preparation, are not yet archived with long term digital repositories.

These data comprehensively reflect the structured data currently published and publicly available on Open Context. Other data (such as user and group information) used to run the Website are not included.

IMPORTANT

This database dump contains data from roughly 190+ different projects. Each project dataset has its own metadata and citation expectations. If you use these data, you must cite each data contributor appropriately, not just this Zenodo archived database dump.
s
Python Import Data in February - Seair.co.in
seair.co.in
Updated Feb 18, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Seair Exim (2016). Python Import Data in February - Seair.co.in [Dataset]. https://www.seair.co.in
Explore at:
.bin, .xml, .csv, .xlsAvailable download formats
Dataset updated
Feb 18, 2016
Dataset provided by
Seair Info Solutions PVT LTD
Authors
Seair Exim
Area covered
Malaysia, Gibraltar, Austria, Nauru, Argentina, Slovakia, Tokelau, Timor-Leste, French Guiana, Korea (Democratic People's Republic of)
Description
Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.
q
Data from: Get Started With GIS in Open Source Python Workshop
qubeshub.org
Updated Oct 15, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Leah Wasser; Jenny Palomino; Joe McGlinchy (2019). Get Started With GIS in Open Source Python Workshop [Dataset]. https://qubeshub.org/publications/1441
Explore at:
Dataset updated
Oct 15, 2019
Dataset provided by
QUBES
Authors
Leah Wasser; Jenny Palomino; Joe McGlinchy
Description
There are a suite of powerful open source python libraries that can be used to work with spatial data. Learn how to use geopandas, rasterio and matplotlib to plot and manipulate spatial data in Python.
Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter...
zenodo.org
bz2
Updated Mar 15, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
João Felipe; João Felipe; Leonardo; Leonardo; Vanessa; Vanessa; Juliana; Juliana (2021). Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter Notebooks [Dataset]. http://doi.org/10.5281/zenodo.2592524
Explore at:
bz2Available download formats
Unique identifier
https://doi.org/10.5281/zenodo.2592524
Dataset updated
Mar 15, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
João Felipe; João Felipe; Leonardo; Leonardo; Vanessa; Vanessa; Juliana; Juliana
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The self-documenting aspects and the ability to reproduce results have been touted as significant benefits of Jupyter Notebooks. At the same time, there has been growing criticism that the way notebooks are being used leads to unexpected behavior, encourage poor coding practices and that their results can be hard to reproduce. To understand good and bad practices used in the development of real notebooks, we analyzed 1.4 million notebooks from GitHub.

Paper: https://2019.msrconf.org/event/msr-2019-papers-a-large-scale-study-about-quality-and-reproducibility-of-jupyter-notebooks

This repository contains two files:

dump.tar.bz2

jupyter_reproducibility.tar.bz2

The dump.tar.bz2 file contains a PostgreSQL dump of the database, with all the data we extracted from the notebooks.

The jupyter_reproducibility.tar.bz2 file contains all the scripts we used to query and download Jupyter Notebooks, extract data from them, and analyze the data. It is organized as follows:

analyses: this folder has all the notebooks we use to analyze the data in the PostgreSQL database.

archaeology: this folder has all the scripts we use to query, download, and extract data from GitHub notebooks.

paper: empty. The notebook analyses/N12.To.Paper.ipynb moves data to it

In the remaining of this text, we give instructions for reproducing the analyses, by using the data provided in the dump and reproducing the collection, by collecting data from GitHub again.

Reproducing the Analysis

This section shows how to load the data in the database and run the analyses notebooks. In the analysis, we used the following environment:

Ubuntu 18.04.1 LTS
PostgreSQL 10.6
Conda 4.5.11
Python 3.7.2
PdfCrop 2012/11/02 v1.38

First, download dump.tar.bz2 and extract it:

tar -xjf dump.tar.bz2

It extracts the file db2019-03-13.dump. Create a database in PostgreSQL (we call it "jupyter"), and use psql to restore the dump:

psql jupyter < db2019-03-13.dump

It populates the database with the dump. Now, configure the connection string for sqlalchemy by setting the environment variable JUP_DB_CONNECTTION:

export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter";

Download and extract jupyter_reproducibility.tar.bz2:

tar -xjf jupyter_reproducibility.tar.bz2

Create a conda environment with Python 3.7:

conda create -n analyses python=3.7 conda activate analyses

Go to the analyses folder and install all the dependencies of the requirements.txt

cd jupyter_reproducibility/analyses pip install -r requirements.txt

For reproducing the analyses, run jupyter on this folder:

jupyter notebook

Execute the notebooks on this order:

Index.ipynb

N0.Repository.ipynb

N1.Skip.Notebook.ipynb

N2.Notebook.ipynb

N3.Cell.ipynb

N4.Features.ipynb

N5.Modules.ipynb

N6.AST.ipynb

N7.Name.ipynb

N8.Execution.ipynb

N9.Cell.Execution.Order.ipynb

N10.Markdown.ipynb

N11.Repository.With.Notebook.Restriction.ipynb

N12.To.Paper.ipynb

Reproducing or Expanding the Collection

The collection demands more steps to reproduce and takes much longer to run (months). It also involves running arbitrary code on your machine. Proceed with caution.

Requirements

This time, we have extra requirements:

All the analysis requirements
lbzip2 2.5
gcc 7.3.0
Github account
Gmail account

Environment

First, set the following environment variables:

export JUP_MACHINE="db"; # machine identifier export JUP_BASE_DIR="/mnt/jupyter/github"; # place to store the repositories export JUP_LOGS_DIR="/home/jupyter/logs"; # log files export JUP_COMPRESSION="lbzip2"; # compression program export JUP_VERBOSE="5"; # verbose level export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter"; # sqlchemy connection export JUP_GITHUB_USERNAME="github_username"; # your github username export JUP_GITHUB_PASSWORD="github_password"; # your github password export JUP_MAX_SIZE="8000.0"; # maximum size of the repositories directory (in GB) export JUP_FIRST_DATE="2013-01-01"; # initial date to query github export JUP_EMAIL_LOGIN="gmail@gmail.com"; # your gmail address export JUP_EMAIL_TO="target@email.com"; # email that receives notifications export JUP_OAUTH_FILE="~/oauth2_creds.json" # oauth2 auhentication file export JUP_NOTEBOOK_INTERVAL=""; # notebook id interval for this machine. Leave it in blank export JUP_REPOSITORY_INTERVAL=""; # repository id interval for this machine. Leave it in blank export JUP_WITH_EXECUTION="1"; # run execute python notebooks export JUP_WITH_DEPENDENCY="0"; # run notebooks with and without declared dependnecies export JUP_EXECUTION_MODE="-1"; # run following the execution order export JUP_EXECUTION_DIR="/home/jupyter/execution"; # temporary directory for running notebooks export JUP_ANACONDA_PATH="~/anaconda3"; # conda installation path export JUP_MOUNT_BASE="/home/jupyter/mount_ghstudy.sh"; # bash script to mount base dir export JUP_UMOUNT_BASE="/home/jupyter/umount_ghstudy.sh"; # bash script to umount base dir export JUP_NOTEBOOK_TIMEOUT="300"; # timeout the extraction # Frequenci of log report export JUP_ASTROID_FREQUENCY="5"; export JUP_IPYTHON_FREQUENCY="5"; export JUP_NOTEBOOKS_FREQUENCY="5"; export JUP_REQUIREMENT_FREQUENCY="5"; export JUP_CRAWLER_FREQUENCY="1"; export JUP_CLONE_FREQUENCY="1"; export JUP_COMPRESS_FREQUENCY="5"; export JUP_DB_IP="localhost"; # postgres database IP

Then, configure the file ~/oauth2_creds.json, according to yagmail documentation: https://media.readthedocs.org/pdf/yagmail/latest/yagmail.pdf

Configure the mount_ghstudy.sh and umount_ghstudy.sh scripts. The first one should mount the folder that stores the directories. The second one should umount it. You can leave the scripts in blank, but it is not advisable, as the reproducibility study runs arbitrary code on your machine and you may lose your data.

Scripts

Download and extract jupyter_reproducibility.tar.bz2:

tar -xjf jupyter_reproducibility.tar.bz2

Install 5 conda environments and 5 anaconda environments, for each python version. In each of them, upgrade pip, install pipenv, and install the archaeology package (Note that it is a local package that has not been published to pypi. Make sure to use the -e option):

Conda 2.7

conda create -n raw27 python=2.7 -y conda activate raw27 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Anaconda 2.7

conda create -n py27 python=2.7 anaconda -y conda activate py27 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Conda 3.4

It requires a manual jupyter and pathlib2 installation due to some incompatibilities found on the default installation.

conda create -n raw34 python=3.4 -y conda activate raw34 conda install jupyter -c conda-forge -y conda uninstall jupyter -y pip install --upgrade pip pip install jupyter pip install pipenv pip install -e jupyter_reproducibility/archaeology pip install pathlib2

Anaconda 3.4

conda create -n py34 python=3.4 anaconda -y conda activate py34 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Conda 3.5

conda create -n raw35 python=3.5 -y conda activate raw35 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Anaconda 3.5

It requires the manual installation of other anaconda packages.

conda create -n py35 python=3.5 anaconda -y conda install -y appdirs atomicwrites keyring secretstorage libuuid navigator-updater prometheus_client pyasn1 pyasn1-modules spyder-kernels tqdm jeepney automat constantly anaconda-navigator conda activate py35 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Conda 3.6

conda create -n raw36 python=3.6 -y conda activate raw36 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Anaconda 3.6

conda create -n py36 python=3.6 anaconda -y conda activate py36 conda install -y anaconda-navigator jupyterlab_server navigator-updater pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Conda 3.7

<code
S
Use python program language to extract key phrases from patent text
scidb.cn
Updated Oct 19, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yu Yan (2022). Use python program language to extract key phrases from patent text [Dataset]. http://doi.org/10.57760/sciencedb.03770
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.03770
Dataset updated
Oct 19, 2022
Dataset provided by
Science Data Bank
Authors
Yu Yan
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Use python program language to extract key phrases from patent text
H
Python Web Scraping and Data Analysis: Gorilla Specimens from Chicago’s...
dataverse.harvard.edu
search.dataone.org
Updated Mar 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Woodger Faugas (2023). Python Web Scraping and Data Analysis: Gorilla Specimens from Chicago’s Field Museum [Dataset]. http://doi.org/10.7910/DVN/ELAZCU
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/ELAZCU
Dataset updated
Mar 24, 2023
Dataset provided by
Harvard Dataverse
Authors
Woodger Faugas
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Chicago
Description
An in-depth analysis of millions of data entries from Chicago’s Field Museum underwent implementation, furnishing insights related to 25 Gorilla specimens and spanning the realms of biogeography, zoology, primatology, and biological anthropology. Taxonomically, and at first glance, all specimens examined belong to the kingdom Animalia, phylum Chordata, class Mammalia, order Primates, and family Hominidae. Furthermore, these specimens can be further categorized under the genus Gorilla and species gorilla, with most belonging to the subspecies Gorilla gorilla gorilla and some specimens being categorized as Gorilla gorilla. Biologically, specimens’ sex distribution entails 16 specimens (or 64% of the total) being identified as male and 5 (or 20%) identified as female, with 4 (or 16%) specimens having their sex unassigned. Furthermore, collectors, none of whom are unidentified by name, culled most of these specimens from unidentified zoos, with a few specimens having been sourced from Ward’s Natural Science Establishment, a well-known natural science materials supplier to North American museums. In terms of historicity, the specimens underwent collection between 1975 and 1993, with some entries lacking this information. Additionally, multiple organ preparations have been performed on the specimens, encompassing skulls, skeletons, skins, and endocrine organs being mounted and alcohol-preserved. Disappointingly, despite the existence of these preparations, tissue samples and coordinates are largely unavailable for the 25 specimens on record, limiting further research or analysis. In fact, tissue sampling is available for a sole specimen identified by IRN 2661980. Only one specimen, identifiable as IRN 2514759, has a specified geographical location indicated as “Africa, West Africa, West Indies,” while the rest have either “Unknown/None, Zoo” locations, signaling that no entry is available. Python code to extract data from the Field Museum’s zoological collections records and online database include the contents of the .py file herewith attached. This code constitutes a web scraping algorithm, retrieving data from the above-mentioned website, processing it, and storing it in a structured format. To achieve these tasks, it first imports necessary libraries by drawing on requests for making HTTP requests, Pandas for handling data, time for introducing delays, lxml for parsing HTML, and BeautifulSoup for web scraping. Furthermore, this algorithm defines the main URL for searching for Gorilla gorilla specimens before setting up headers for making HTTP requests, e.g., User-Agent and other headers to mimic a browser request. Next, an HTTP GET request to the main URL is made, and the response text is obtained. The next step consists of parsing the response text using BeautifulSoup and lxml. Extracting information from the search results page (e.g., Internal Record Number, Catalog Subset, Higher Classification, Catalog Number, Taxonomic Name, DwC Locality, Collector/field, Collection No., Coordinates Available, Tissue Available, and Sex) comes next. This information is then stored in a list called basic_data. The algorithm subsequently iterates through each record in basic_data, and accesses its detailed information page by making another HTTP GET request with the extracted URL. For each detailed information page, the code thereafter extracts additional data (e.g., FM Catalog, Scientific Name, Phylum, Class, Order, Family, Genus, Species, Field Number, Collector, Collection No., Geography, Date Collected, Preparations, Tissue Available, Co-ordinates Available, and Sex). Correspondingly, this information is stored in a list called main_data. The above algorithm processes the final main_data list and converts it into a structured format, i.e., a CSV file.
Reproducibility in Practice: Dataset of a Large-Scale Study of Jupyter...
zenodo.org
bz2
Updated Mar 15, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anonymous; Anonymous (2021). Reproducibility in Practice: Dataset of a Large-Scale Study of Jupyter Notebooks [Dataset]. http://doi.org/10.5281/zenodo.2546834
Explore at:
bz2Available download formats
Unique identifier
https://doi.org/10.5281/zenodo.2546834
Dataset updated
Mar 15, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Anonymous; Anonymous
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The self-documenting aspects and the ability to reproduce results have been touted as significant benefits of Jupyter Notebooks. At the same time, there has been growing criticism that the way notebooks are being used leads to unexpected behavior, encourage poor coding practices and that their results can be hard to reproduce. To understand good and bad practices used in the development of real notebooks, we analyzed 1.4 million notebooks from GitHub.

This repository contains two files:

dump.tar.bz2

jupyter_reproducibility.tar.bz2

The dump.tar.bz2 file contains a PostgreSQL dump of the database, with all the data we extracted from the notebooks.

The jupyter_reproducibility.tar.bz2 file contains all the scripts we used to query and download Jupyter Notebooks, extract data from them, and analyze the data. It is organized as follows:

analyses: this folder has all the notebooks we use to analyze the data in the PostgreSQL database.

archaeology: this folder has all the scripts we use to query, download, and extract data from GitHub notebooks.

paper: empty. The notebook analyses/N11.To.Paper.ipynb moves data to it

In the remaining of this text, we give instructions for reproducing the analyses, by using the data provided in the dump and reproducing the collection, by collecting data from GitHub again.

Reproducing the Analysis

This section shows how to load the data in the database and run the analyses notebooks. In the analysis, we used the following environment:

Ubuntu 18.04.1 LTS
PostgreSQL 10.6
Conda 4.5.1
Python 3.6.8
PdfCrop 2012/11/02 v1.38

First, download dump.tar.bz2 and extract it:

tar -xjf dump.tar.bz2

It extracts the file db2019-01-13.dump. Create a database in PostgreSQL (we call it "jupyter"), and use psql to restore the dump:

psql jupyter < db2019-01-13.dump

It populates the database with the dump. Now, configure the connection string for sqlalchemy by setting the environment variable JUP_DB_CONNECTTION:

export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter";

Download and extract jupyter_reproducibility.tar.bz2:

tar -xjf jupyter_reproducibility.tar.bz2

Create a conda environment with Python 3.6:

conda create -n py36 python=3.6

Go to the analyses folder and install all the dependencies of the requirements.txt

cd jupyter_reproducibility/analyses pip install -r requirements.txt

For reproducing the analyses, run jupyter on this folder:

jupyter notebook

Execute the notebooks on this order:

N0.Index.ipynb

N1.Repository.ipynb

N2.Notebook.ipynb

N3.Cell.ipynb

N4.Features.ipynb

N5.Modules.ipynb

N6.AST.ipynb

N7.Name.ipynb

N8.Execution.ipynb

N9.Cell.Execution.Order.ipynb

N10.Markdown.ipynb

N11.To.Paper.ipynb

Reproducing or Expanding the Collection

The collection demands more steps to reproduce and takes much longer to run (months). It also involves running arbitrary code on your machine. Proceed with caution.

Requirements

This time, we have extra requirements:

All the analysis requirements
lbzip2 2.5
gcc 7.3.0
Github account
Gmail account

Environment

First, set the following environment variables:

export JUP_MACHINE="db"; # machine identifier export JUP_BASE_DIR="/mnt/jupyter/github"; # place to store the repositories export JUP_LOGS_DIR="/home/jupyter/logs"; # log files export JUP_COMPRESSION="lbzip2"; # compression program export JUP_VERBOSE="5"; # verbose level export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter"; # sqlchemy connection export JUP_GITHUB_USERNAME="github_username"; # your github username export JUP_GITHUB_PASSWORD="github_password"; # your github password export JUP_MAX_SIZE="8000.0"; # maximum size of the repositories directory (in GB) export JUP_FIRST_DATE="2013-01-01"; # initial date to query github export JUP_EMAIL_LOGIN="gmail@gmail.com"; # your gmail address export JUP_EMAIL_TO="target@email.com"; # email that receives notifications export JUP_OAUTH_FILE="~/oauth2_creds.json" # oauth2 auhentication file export JUP_NOTEBOOK_INTERVAL=""; # notebook id interval for this machine. Leave it in blank export JUP_REPOSITORY_INTERVAL=""; # repository id interval for this machine. Leave it in blank export JUP_WITH_EXECUTION="1"; # run execute python notebooks export JUP_WITH_DEPENDENCY="0"; # run notebooks with and without declared dependnecies export JUP_EXECUTION_MODE="-1"; # run following the execution order export JUP_EXECUTION_DIR="/home/jupyter/execution"; # temporary directory for running notebooks export JUP_ANACONDA_PATH="~/anaconda3"; # conda installation path export JUP_MOUNT_BASE="/home/jupyter/mount_ghstudy.sh"; # bash script to mount base dir export JUP_UMOUNT_BASE="/home/jupyter/umount_ghstudy.sh"; # bash script to umount base dir export JUP_NOTEBOOK_TIMEOUT="300"; # timeout the extraction # Frequenci of log report export JUP_ASTROID_FREQUENCY="5"; export JUP_IPYTHON_FREQUENCY="5"; export JUP_NOTEBOOKS_FREQUENCY="5"; export JUP_REQUIREMENT_FREQUENCY="5"; export JUP_CRAWLER_FREQUENCY="1"; export JUP_CLONE_FREQUENCY="1"; export JUP_COMPRESS_FREQUENCY="5"; export JUP_DB_IP="localhost"; # postgres database IP

Then, configure the file ~/oauth2_creds.json, according to yagmail documentation: https://media.readthedocs.org/pdf/yagmail/latest/yagmail.pdf

Configure the mount_ghstudy.sh and umount_ghstudy.sh scripts. The first one should mount the folder that stores the directories. The second one should umount it. You can leave the scripts in blank, but it is not advisable, as the reproducibility study runs arbitrary code on your machine and you may lose your data.

Scripts

Download and extract jupyter_reproducibility.tar.bz2:

tar -xjf jupyter_reproducibility.tar.bz2

Install 5 conda environments and 5 anaconda environments, for each python version. In each of them, upgrade pip, install pipenv, and install the archaeology package (Note that it is a local package that has not been published to pypi. Make sure to use the -e option):

Conda 2.7

conda create -n raw27 python=2.7 -y conda activate raw27 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Anaconda 2.7

conda create -n py27 python=2.7 anaconda -y conda activate py27 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Conda 3.4

It requires a manual jupyter and pathlib2 installation due to some incompatibilities found on the default installation.

conda create -n raw34 python=3.4 -y conda activate raw34 conda install jupyter -c conda-forge -y conda uninstall jupyter -y pip install --upgrade pip pip install jupyter pip install pipenv pip install -e jupyter_reproducibility/archaeology pip install pathlib2

Anaconda 3.4

conda create -n py34 python=3.4 anaconda -y conda activate py34 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Conda 3.5

conda create -n raw35 python=3.5 -y conda activate raw35 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Anaconda 3.5

It requires the manual installation of other anaconda packages.

conda create -n py35 python=3.5 anaconda -y conda install -y appdirs atomicwrites keyring secretstorage libuuid navigator-updater prometheus_client pyasn1 pyasn1-modules spyder-kernels tqdm jeepney automat constantly anaconda-navigator conda activate py35 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Conda 3.6

conda create -n raw36 python=3.6 -y conda activate raw36 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Anaconda 3.6

conda create -n py36 python=3.6 anaconda -y conda activate py36 conda install -y anaconda-navigator jupyterlab_server navigator-updater pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Conda 3.7

conda create -n raw37 python=3.7 -y conda activate raw37 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Anaconda 3.7

When we
H
Analysis of flow at Lees Ferry on the Colorado River
hydroshare.org
search.dataone.org
zip
Updated Nov 29, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jaxon White (2018). Analysis of flow at Lees Ferry on the Colorado River [Dataset]. https://www.hydroshare.org/resource/ed3a40af3a2947a59c441f818adc6ea2
Explore at:
zip(5.0 KB)Available download formats
Dataset updated
Nov 29, 2018
Dataset provided by
HydroShare
Authors
Jaxon White
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Description
This resource connects to CUASHI's web service called WaterOneFlow and performs a query to retrieve the data for the gage station on the Colorado River at Lees Ferry. The code in the Notebook does the following:

Collects the flow data for the gage based on the inputs

Pulls the 15-minute resolution data into one Python object

Resamples the object to a resolution of one day and finds the average, minimum, and maximum flows for that day

Plots the average, minimum, and maximum flows for comparison
d
Data from: ViTexOCR; a script to extract text overlays from digital video
catalog.data.gov
data.usgs.gov
+4more
Updated Nov 19, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). ViTexOCR; a script to extract text overlays from digital video [Dataset]. https://catalog.data.gov/dataset/vitexocr-a-script-to-extract-text-overlays-from-digital-video
Explore at:
Dataset updated
Nov 19, 2025
Dataset provided by
U.S. Geological Survey
Description
The ViTexOCR script presents a new method for extracting navigation data from videos with text overlays using optical character recognition (OCR) software. Over the past few decades, it was common for videos recorded during surveys to be overlaid with real-time geographic positioning satellite chyrons including latitude, longitude, date and time, as well as other ancillary data (such as speed, heading, or user input identifying fields). Embedding these data into videos provides them with utility and accuracy, but using the location data for other purposes, such as analysis in a geographic information system, is not possible when only available on the video display. Extracting the text data from imagery using software allows these videos to be located and analyzed in a geospatial context. The script allows a user to select a video, specify the text data types (e.g. latitude, longitude, date, time, or other), text color, and the pixel locations of overlay text data on a sample video frame. The script’s output is a data file containing the retrieved geospatial and temporal data. All functionality is bundled in a Python script that incorporates a graphical user interface and several other software dependencies.
U
Python code used to download gridMET climate data for public-supply water...
data.usgs.gov
s.cnmilf.com
+1more
Updated Aug 27, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carol Luukkonen; Ayman Alzraiee; Joshua Larsen; Donald Martin; Deidre Herbert; Cheryl Buchwald; Natalie Houston; Kristen Valseth; Scott Paulinski; Lisa Miller; Richard Niswonger; Jana Stewart; Cheryl Dieter (2024). Python code used to download gridMET climate data for public-supply water service areas [Dataset]. http://doi.org/10.5066/P9FUL880
Explore at:
Unique identifier
https://doi.org/10.5066/P9FUL880
Dataset updated
Aug 27, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Authors
Carol Luukkonen; Ayman Alzraiee; Joshua Larsen; Donald Martin; Deidre Herbert; Cheryl Buchwald; Natalie Houston; Kristen Valseth; Scott Paulinski; Lisa Miller; Richard Niswonger; Jana Stewart; Cheryl Dieter
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Time period covered
Jan 1, 2000 - Dec 31, 2020
Description
This child item describes Python code used to retrieve gridMET climate data for a specific area and time period. Climate data were retrieved for public-supply water service areas, but the climate data collector could be used to retrieve data for other areas of interest. This dataset is part of a larger data release using machine learning to predict public supply water use for 12-digit hydrologic units from 2000-2020. Data retrieved by the climate data collector code were used as input feature variables in the public supply delivery and water use machine learning models. This page includes the following file: climate_data_collector.zip - a zip file containing the climate data collector Python code used to retrieve climate data and a README file.
H
Using Python to Access and Plot Streamflow data from NWIS
hydroshare.org
hydroshare.cuahsi.org
zip
Updated Apr 20, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aaron Sigman (2022). Using Python to Access and Plot Streamflow data from NWIS [Dataset]. https://www.hydroshare.org/resource/8553e1b0b1cd44b6885a5c6033b41038
Explore at:
zip(154.5 KB)Available download formats
Dataset updated
Apr 20, 2022
Dataset provided by
HydroShare
Authors
Aaron Sigman
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jan 1, 1970 - Dec 31, 2021
Description
This code retrieves stream info from nwis using the dataretrieval tool in python. You can input site, parameters, and dates at the top. This code pulls daily measurements, annual_stats, and daily_stats. We calculate 30-year normals, as well as plot annual average flows, annual min, max, and mean flows, and percentile flows. This resource only pulls from the USGS ftp site and doesn't have or require any local storage.
Classicmodels
kaggle.com
zip
Updated Dec 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Javier Landaeta (2024). Classicmodels [Dataset]. https://www.kaggle.com/datasets/javierlandaeta/classicmodels
Explore at:
zip(65751 bytes)Available download formats
Dataset updated
Dec 15, 2024
Authors
Javier Landaeta
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Abstract This project presents a comprehensive analysis of a company's annual sales, using the classic dataset classicmodels as the database. Python is used as the main programming language, along with the Pandas, NumPy and SQLAlchemy libraries for data manipulation and analysis, and PostgreSQL as the database management system.

The main objective of the project is to answer key questions related to the company's sales performance, such as: Which were the most profitable products and customers? Were sales goals met? The results obtained serve as input for strategic decision making in future sales campaigns.

Methodology 1. Data Extraction:

A connection is established with the PostgreSQL database to extract the relevant data from the orders, orderdetails, customers, products and employees tables.

A reusable function is created to read each table and load it into a Pandas DataFrame.

2. Data Cleansing and Transformation:

An exploratory analysis of the data is performed to identify missing values, inconsistencies, and outliers.

New variables are calculated, such as the total value of each sale, cost, and profit.

Different DataFrames are joined using primary and foreign keys to obtain a complete view of sales.

3. Exploratory Data Analysis (EDA):

Key metrics such as total sales, number of unique customers, and average order value are calculated.

Data is grouped by different dimensions (products, customers, dates) to identify patterns and trends.

Results are visualized using relevant graphics (histograms, bar charts, etc.).

4. Modeling and Prediction:

Although the main focus of the project is descriptive, predictive modeling techniques (e.g., time series) could be explored to forecast future sales.

5. Report Generation:

Detailed reports are created in Pandas DataFrames format that answer specific business questions.

These reports are stored in new PostgreSQL tables for further analysis and visualization.

Results - Identification of top products and customers: The best-selling products and the customers that generate the most revenue are identified. - Analysis of sales trends: Sales trends over time are analyzed and possible factors that influence sales behavior are identified. - Calculation of key metrics: Metrics such as average profit margin and sales growth rate are calculated.

Conclusions This project demonstrates how Python and PostgreSQL can be effectively used to analyze large data sets and obtain valuable insights for business decision making. The results obtained can serve as a starting point for future research and development in the area of sales analysis.

Technologies Used - Python: Pandas, NumPy, SQLAlchemy, Matplotlib/Seaborn - Database: PostgreSQL - Tools: Jupyter Notebook - Keywords: data analysis, Python, PostgreSQL, Pandas, NumPy, SQLAlchemy, EDA, sales, business intelligence
s
Python Import Data in August - Seair.co.in
seair.co.in
Updated Aug 20, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Seair Exim (2016). Python Import Data in August - Seair.co.in [Dataset]. https://www.seair.co.in
Explore at:
.bin, .xml, .csv, .xlsAvailable download formats
Dataset updated
Aug 20, 2016
Dataset provided by
Seair Info Solutions PVT LTD
Authors
Seair Exim
Area covered
Christmas Island, Nepal, Belgium, South Africa, Virgin Islands (U.S.), Lebanon, Gambia, Ecuador, Saint Pierre and Miquelon, Falkland Islands (Malvinas)
Description
Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.
J
Data from dynamic wind profile long-term operation of alkaline and PEM water...
data-legacy.fz-juelich.de
resodate.org
bin, csv +3
Updated May 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jülich DATA (2025). Data from dynamic wind profile long-term operation of alkaline and PEM water electrolysis with extraction of performance data in Python [Dataset]. http://doi.org/10.26165/JUELICH-DATA/PYGQTO
Explore at:
csv(754156), csv(136688970), text/x-python(2537), csv(2771450), text/x-python(72207), csv(55902591), zip(7575081), txt(255), csv(20952), csv(68679746), bin(69319)Available download formats
Unique identifier
https://doi.org/10.26165/JUELICH-DATA/PYGQTO
Dataset updated
May 21, 2025
Dataset provided by
Jülich DATA
Description
We created a semi-synthetic wind profile from wind turbine data and converted it to current and potential profiles for PEM and alkaline water electrolysis cells with a maximum power output of 40 and 4 W respectively. Then we conducted dynamic electrolysis with these profiles for up to 961 h with PEMWE and AWE single cells. The data obtained from the dynamic operation are included in the dataset. We applied two analysis methods to our datasets in Python to extract performance data from the electrolysis cells like I-V-curves, current density dependent cell voltage changes and resistances. The Python code is also part of the dataset.

Facebook

Twitter

Click to copy link

Link copied

Cite

Seair Exim, Python Import Data India – Buyers & Importers List [Dataset]. https://www.seair.co.in

Python Import Data India – Buyers & Importers List

Seair Exim Solutions

Seair Info Solutions PVT LTD

Explore at:

27 scholarly articles cite this dataset (View in Google Scholar)

.bin, .xml, .csv, .xlsAvailable download formats

Dataset provided by

Seair Info Solutions PVT LTD

Authors

Seair Exim

Area covered

India

Description

Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.

Clear search

Close search

Google apps

Main menu

Python Import Data India – Buyers & Importers List

Code to import PSCAD data into Python (Spyder)

Data from: Ecosystem-Level Determinants of Sustained Activity in Open-Source...

Python code used to download U.S. Census Bureau data for public-supply water...

Into to Python data

Dataset

Contents

Storage and Transit Time Data and Code

Code information

Open Context Database SQL Dump

Python Import Data in February - Seair.co.in

Data from: Get Started With GIS in Open Source Python Workshop

Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter...

Use python program language to extract key phrases from patent text

Python Web Scraping and Data Analysis: Gorilla Specimens from Chicago’s...

Reproducibility in Practice: Dataset of a Large-Scale Study of Jupyter...

Analysis of flow at Lees Ferry on the Colorado River

Data from: ViTexOCR; a script to extract text overlays from digital video

Python code used to download gridMET climate data for public-supply water...

Using Python to Access and Plot Streamflow data from NWIS

Classicmodels

Python Import Data in August - Seair.co.in

Data from dynamic wind profile long-term operation of alkaline and PEM water...

Python Import Data India – Buyers & Importers List

Seair Exim Solutions

Seair Info Solutions PVT LTD