Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.
Python Llc Company Export Import Records. Follow the Eximpedia platform for HS code, importer-exporter records, and customs shipment details.
Python Logistics Llc Company Export Import Records. Follow the Eximpedia platform for HS code, importer-exporter records, and customs shipment details.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The self-documenting aspects and the ability to reproduce results have been touted as significant benefits of Jupyter Notebooks. At the same time, there has been growing criticism that the way notebooks are being used leads to unexpected behavior, encourage poor coding practices and that their results can be hard to reproduce. To understand good and bad practices used in the development of real notebooks, we analyzed 1.4 million notebooks from GitHub.
This repository contains two files:
The dump.tar.bz2 file contains a PostgreSQL dump of the database, with all the data we extracted from the notebooks.
The jupyter_reproducibility.tar.bz2 file contains all the scripts we used to query and download Jupyter Notebooks, extract data from them, and analyze the data. It is organized as follows:
In the remaining of this text, we give instructions for reproducing the analyses, by using the data provided in the dump and reproducing the collection, by collecting data from GitHub again.
Reproducing the Analysis
This section shows how to load the data in the database and run the analyses notebooks. In the analysis, we used the following environment:
Ubuntu 18.04.1 LTS
PostgreSQL 10.6
Conda 4.5.1
Python 3.6.8
PdfCrop 2012/11/02 v1.38
First, download dump.tar.bz2 and extract it:
tar -xjf dump.tar.bz2
It extracts the file db2019-01-13.dump. Create a database in PostgreSQL (we call it "jupyter"), and use psql to restore the dump:
psql jupyter < db2019-01-13.dump
It populates the database with the dump. Now, configure the connection string for sqlalchemy by setting the environment variable JUP_DB_CONNECTTION:
export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter";
Download and extract jupyter_reproducibility.tar.bz2:
tar -xjf jupyter_reproducibility.tar.bz2
Create a conda environment with Python 3.6:
conda create -n py36 python=3.6
Go to the analyses folder and install all the dependencies of the requirements.txt
cd jupyter_reproducibility/analyses
pip install -r requirements.txt
For reproducing the analyses, run jupyter on this folder:
jupyter notebook
Execute the notebooks on this order:
Reproducing or Expanding the Collection
The collection demands more steps to reproduce and takes much longer to run (months). It also involves running arbitrary code on your machine. Proceed with caution.
Requirements
This time, we have extra requirements:
All the analysis requirements
lbzip2 2.5
gcc 7.3.0
Github account
Gmail account
Environment
First, set the following environment variables:
export JUP_MACHINE="db"; # machine identifier
export JUP_BASE_DIR="/mnt/jupyter/github"; # place to store the repositories
export JUP_LOGS_DIR="/home/jupyter/logs"; # log files
export JUP_COMPRESSION="lbzip2"; # compression program
export JUP_VERBOSE="5"; # verbose level
export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter"; # sqlchemy connection
export JUP_GITHUB_USERNAME="github_username"; # your github username
export JUP_GITHUB_PASSWORD="github_password"; # your github password
export JUP_MAX_SIZE="8000.0"; # maximum size of the repositories directory (in GB)
export JUP_FIRST_DATE="2013-01-01"; # initial date to query github
export JUP_EMAIL_LOGIN="gmail@gmail.com"; # your gmail address
export JUP_EMAIL_TO="target@email.com"; # email that receives notifications
export JUP_OAUTH_FILE="~/oauth2_creds.json" # oauth2 auhentication file
export JUP_NOTEBOOK_INTERVAL=""; # notebook id interval for this machine. Leave it in blank
export JUP_REPOSITORY_INTERVAL=""; # repository id interval for this machine. Leave it in blank
export JUP_WITH_EXECUTION="1"; # run execute python notebooks
export JUP_WITH_DEPENDENCY="0"; # run notebooks with and without declared dependnecies
export JUP_EXECUTION_MODE="-1"; # run following the execution order
export JUP_EXECUTION_DIR="/home/jupyter/execution"; # temporary directory for running notebooks
export JUP_ANACONDA_PATH="~/anaconda3"; # conda installation path
export JUP_MOUNT_BASE="/home/jupyter/mount_ghstudy.sh"; # bash script to mount base dir
export JUP_UMOUNT_BASE="/home/jupyter/umount_ghstudy.sh"; # bash script to umount base dir
export JUP_NOTEBOOK_TIMEOUT="300"; # timeout the extraction
# Frequenci of log report
export JUP_ASTROID_FREQUENCY="5";
export JUP_IPYTHON_FREQUENCY="5";
export JUP_NOTEBOOKS_FREQUENCY="5";
export JUP_REQUIREMENT_FREQUENCY="5";
export JUP_CRAWLER_FREQUENCY="1";
export JUP_CLONE_FREQUENCY="1";
export JUP_COMPRESS_FREQUENCY="5";
export JUP_DB_IP="localhost"; # postgres database IP
Then, configure the file ~/oauth2_creds.json, according to yagmail documentation: https://media.readthedocs.org/pdf/yagmail/latest/yagmail.pdf
Configure the mount_ghstudy.sh and umount_ghstudy.sh scripts. The first one should mount the folder that stores the directories. The second one should umount it. You can leave the scripts in blank, but it is not advisable, as the reproducibility study runs arbitrary code on your machine and you may lose your data.
Scripts
Download and extract jupyter_reproducibility.tar.bz2:
tar -xjf jupyter_reproducibility.tar.bz2
Install 5 conda environments and 5 anaconda environments, for each python version. In each of them, upgrade pip, install pipenv, and install the archaeology package (Note that it is a local package that has not been published to pypi. Make sure to use the -e option):
Conda 2.7
conda create -n raw27 python=2.7 -y
conda activate raw27
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Anaconda 2.7
conda create -n py27 python=2.7 anaconda -y
conda activate py27
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Conda 3.4
It requires a manual jupyter and pathlib2 installation due to some incompatibilities found on the default installation.
conda create -n raw34 python=3.4 -y
conda activate raw34
conda install jupyter -c conda-forge -y
conda uninstall jupyter -y
pip install --upgrade pip
pip install jupyter
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
pip install pathlib2
Anaconda 3.4
conda create -n py34 python=3.4 anaconda -y
conda activate py34
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Conda 3.5
conda create -n raw35 python=3.5 -y
conda activate raw35
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Anaconda 3.5
It requires the manual installation of other anaconda packages.
conda create -n py35 python=3.5 anaconda -y
conda install -y appdirs atomicwrites keyring secretstorage libuuid navigator-updater prometheus_client pyasn1 pyasn1-modules spyder-kernels tqdm jeepney automat constantly anaconda-navigator
conda activate py35
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Conda 3.6
conda create -n raw36 python=3.6 -y
conda activate raw36
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Anaconda 3.6
conda create -n py36 python=3.6 anaconda -y
conda activate py36
conda install -y anaconda-navigator jupyterlab_server navigator-updater
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Conda 3.7
conda create -n raw37 python=3.7 -y
conda activate raw37
pip install --upgrade pip
pip install pipenv
pip install -e
Ballroom Python South Company Export Import Records. Follow the Eximpedia platform for HS code, importer-exporter records, and customs shipment details.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Build a trapezoidal PCHE two channel model using FLUNT software as a unit for processing, simulate by changing different input conditions, obtain corresponding results, export them as CSV files, use Python for data processing, remove unnecessary information columns, and combine certain information from each file to form a snapshot matrix CSV file. After processing the snapshot matrix CSV file in Python, import it into MATLAB for prediction, and finally export the MATLAB results as a result CSV file.
http://researchdatafinder.qut.edu.au/display/n4066http://researchdatafinder.qut.edu.au/display/n4066
Exports required model results as text files, for reading into Matlab. Must be run from command line (after loading Abaqus module) thus: abaqus python ExportModelEndState02a.py Output databases... QUT Research Data Respository Dataset Resource available for download
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The self-documenting aspects and the ability to reproduce results have been touted as significant benefits of Jupyter Notebooks. At the same time, there has been growing criticism that the way notebooks are being used leads to unexpected behavior, encourage poor coding practices and that their results can be hard to reproduce. To understand good and bad practices used in the development of real notebooks, we analyzed 1.4 million notebooks from GitHub.
This repository contains two files:
The dump.tar.bz2 file contains a PostgreSQL dump of the database, with all the data we extracted from the notebooks.
The jupyter_reproducibility.tar.bz2 file contains all the scripts we used to query and download Jupyter Notebooks, extract data from them, and analyze the data. It is organized as follows:
In the remaining of this text, we give instructions for reproducing the analyses, by using the data provided in the dump and reproducing the collection, by collecting data from GitHub again.
Reproducing the Analysis
This section shows how to load the data in the database and run the analyses notebooks. In the analysis, we used the following environment:
Ubuntu 18.04.1 LTS
PostgreSQL 10.6
Conda 4.5.11
Python 3.7.2
PdfCrop 2012/11/02 v1.38
First, download dump.tar.bz2 and extract it:
tar -xjf dump.tar.bz2
It extracts the file db2019-03-13.dump. Create a database in PostgreSQL (we call it "jupyter"), and use psql to restore the dump:
psql jupyter < db2019-03-13.dump
It populates the database with the dump. Now, configure the connection string for sqlalchemy by setting the environment variable JUP_DB_CONNECTTION:
export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter";
Download and extract jupyter_reproducibility.tar.bz2:
tar -xjf jupyter_reproducibility.tar.bz2
Create a conda environment with Python 3.7:
conda create -n analyses python=3.7
conda activate analyses
Go to the analyses folder and install all the dependencies of the requirements.txt
cd jupyter_reproducibility/analyses
pip install -r requirements.txt
For reproducing the analyses, run jupyter on this folder:
jupyter notebook
Execute the notebooks on this order:
Reproducing or Expanding the Collection
The collection demands more steps to reproduce and takes much longer to run (months). It also involves running arbitrary code on your machine. Proceed with caution.
Requirements
This time, we have extra requirements:
All the analysis requirements
lbzip2 2.5
gcc 7.3.0
Github account
Gmail account
Environment
First, set the following environment variables:
export JUP_MACHINE="db"; # machine identifier
export JUP_BASE_DIR="/mnt/jupyter/github"; # place to store the repositories
export JUP_LOGS_DIR="/home/jupyter/logs"; # log files
export JUP_COMPRESSION="lbzip2"; # compression program
export JUP_VERBOSE="5"; # verbose level
export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter"; # sqlchemy connection
export JUP_GITHUB_USERNAME="github_username"; # your github username
export JUP_GITHUB_PASSWORD="github_password"; # your github password
export JUP_MAX_SIZE="8000.0"; # maximum size of the repositories directory (in GB)
export JUP_FIRST_DATE="2013-01-01"; # initial date to query github
export JUP_EMAIL_LOGIN="gmail@gmail.com"; # your gmail address
export JUP_EMAIL_TO="target@email.com"; # email that receives notifications
export JUP_OAUTH_FILE="~/oauth2_creds.json" # oauth2 auhentication file
export JUP_NOTEBOOK_INTERVAL=""; # notebook id interval for this machine. Leave it in blank
export JUP_REPOSITORY_INTERVAL=""; # repository id interval for this machine. Leave it in blank
export JUP_WITH_EXECUTION="1"; # run execute python notebooks
export JUP_WITH_DEPENDENCY="0"; # run notebooks with and without declared dependnecies
export JUP_EXECUTION_MODE="-1"; # run following the execution order
export JUP_EXECUTION_DIR="/home/jupyter/execution"; # temporary directory for running notebooks
export JUP_ANACONDA_PATH="~/anaconda3"; # conda installation path
export JUP_MOUNT_BASE="/home/jupyter/mount_ghstudy.sh"; # bash script to mount base dir
export JUP_UMOUNT_BASE="/home/jupyter/umount_ghstudy.sh"; # bash script to umount base dir
export JUP_NOTEBOOK_TIMEOUT="300"; # timeout the extraction
# Frequenci of log report
export JUP_ASTROID_FREQUENCY="5";
export JUP_IPYTHON_FREQUENCY="5";
export JUP_NOTEBOOKS_FREQUENCY="5";
export JUP_REQUIREMENT_FREQUENCY="5";
export JUP_CRAWLER_FREQUENCY="1";
export JUP_CLONE_FREQUENCY="1";
export JUP_COMPRESS_FREQUENCY="5";
export JUP_DB_IP="localhost"; # postgres database IP
Then, configure the file ~/oauth2_creds.json, according to yagmail documentation: https://media.readthedocs.org/pdf/yagmail/latest/yagmail.pdf
Configure the mount_ghstudy.sh and umount_ghstudy.sh scripts. The first one should mount the folder that stores the directories. The second one should umount it. You can leave the scripts in blank, but it is not advisable, as the reproducibility study runs arbitrary code on your machine and you may lose your data.
Scripts
Download and extract jupyter_reproducibility.tar.bz2:
tar -xjf jupyter_reproducibility.tar.bz2
Install 5 conda environments and 5 anaconda environments, for each python version. In each of them, upgrade pip, install pipenv, and install the archaeology package (Note that it is a local package that has not been published to pypi. Make sure to use the -e option):
Conda 2.7
conda create -n raw27 python=2.7 -y
conda activate raw27
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Anaconda 2.7
conda create -n py27 python=2.7 anaconda -y
conda activate py27
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Conda 3.4
It requires a manual jupyter and pathlib2 installation due to some incompatibilities found on the default installation.
conda create -n raw34 python=3.4 -y
conda activate raw34
conda install jupyter -c conda-forge -y
conda uninstall jupyter -y
pip install --upgrade pip
pip install jupyter
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
pip install pathlib2
Anaconda 3.4
conda create -n py34 python=3.4 anaconda -y
conda activate py34
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Conda 3.5
conda create -n raw35 python=3.5 -y
conda activate raw35
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Anaconda 3.5
It requires the manual installation of other anaconda packages.
conda create -n py35 python=3.5 anaconda -y
conda install -y appdirs atomicwrites keyring secretstorage libuuid navigator-updater prometheus_client pyasn1 pyasn1-modules spyder-kernels tqdm jeepney automat constantly anaconda-navigator
conda activate py35
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Conda 3.6
conda create -n raw36 python=3.6 -y
conda activate raw36
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Anaconda 3.6
conda create -n py36 python=3.6 anaconda -y
conda activate py36
conda install -y anaconda-navigator jupyterlab_server navigator-updater
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Conda 3.7
<code
This project developed a comprehensive data management system designed to support collaborative groundwater research across institutions by establishing a centralized, structured database for hydrologic time series data. Built on the Observations Data Model (ODM), the system stores time series data and metadata in a relational SQLite database. Key project components included database construction, automation of data formatting and importation, development of analytical and visualization tools, and integration with ArcGIS for geospatial representation. The data import workflow standardizes and validates diverse .csv datasets by aligning them with ODM formatting. A Python-based module was created to facilitate data retrieval, analysis, visualization, and export, while an interactive map feature enables users to explore site-specific data availability. Additionally, a custom ArcGIS script was implemented to generate maps that incorporate stream networks, site locations, and watershed boundaries using DEMs from USGS sources. The system was tested using real-world datasets from groundwater wells and surface water gages across Utah, demonstrating its flexibility in handling diverse formats and parameters. The relational structure enabled efficient querying and visualization, and the developed tools promoted accessibility and alignment with FAIR principles.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Geographic Diversity in Public Code Contributions - Replication Package
This document describes how to replicate the findings of the paper: Davide Rossi and Stefano Zacchiroli, 2022, Geographic Diversity in Public Code Contributions - An Exploratory Large-Scale Study Over 50 Years. In 19th International Conference on Mining Software Repositories (MSR ’22), May 23-24, Pittsburgh, PA, USA. ACM, New York, NY, USA, 5 pages. https://doi.org/10.1145/3524842.3528471
This document comes with the software needed to mine and analyze the data presented in the paper.
Prerequisites
These instructions assume the use of the bash shell, the Python programming language, the PosgreSQL DBMS (version 11 or later), the zstd compression utility and various usual *nix shell utilities (cat, pv, …), all of which are available for multiple architectures and OSs. It is advisable to create a Python virtual environment and install the following PyPI packages:
click==8.0.4 cycler==0.11.0 fonttools==4.31.2 kiwisolver==1.4.0 matplotlib==3.5.1 numpy==1.22.3 packaging==21.3 pandas==1.4.1 patsy==0.5.2 Pillow==9.0.1 pyparsing==3.0.7 python-dateutil==2.8.2 pytz==2022.1 scipy==1.8.0 six==1.16.0 statsmodels==0.13.2
Initial data
swh-replica, a PostgreSQL database containing a copy of Software Heritage data. The schema for the database is available at https://forge.softwareheritage.org/source/swh-storage/browse/master/swh/storage/sql/. We retrieved these data from Software Heritage, in collaboration with the archive operators, taking an archive snapshot as of 2021-07-07. We cannot make these data available in full as part of the replication package due to both its volume and the presence in it of personal information such as user email addresses. However, equivalent data (stripped of email addresses) can be obtained from the Software Heritage archive dataset, as documented in the article: Antoine Pietri, Diomidis Spinellis, Stefano Zacchiroli, The Software Heritage Graph Dataset: Public software development under one roof. In proceedings of MSR 2019: The 16th International Conference on Mining Software Repositories, May 2019, Montreal, Canada. Pages 138-142, IEEE 2019. http://dx.doi.org/10.1109/MSR.2019.00030. Once retrieved, the data can be loaded in PostgreSQL to populate swh-replica.
names.tab - forenames and surnames per country with their frequency
zones.acc.tab - countries/territories, timezones, population and world zones
c_c.tab - ccTDL entities - world zones matches
Data preparation
Export data from the swh-replica database to create commits.csv.zst and authors.csv.zst
sh> ./export.sh
Run the authors cleanup script to create authors--clean.csv.zst
sh> ./cleanup.sh authors.csv.zst
Filter out implausible names and create authors--plausible.csv.zst
sh> pv authors--clean.csv.zst | unzstd | ./filter_names.py 2> authors--plausible.csv.log | zstdmt > authors--plausible.csv.zst
Zone detection by email
Run the email detection script to create author-country-by-email.tab.zst
sh> pv authors--plausible.csv.zst | zstdcat | ./guess_country_by_email.py -f 3 2> author-country-by-email.csv.log | zstdmt > author-country-by-email.tab.zst
Database creation and initial data ingestion
Create the PostgreSQL DB
sh> createdb zones-commit
Notice that from now on when prepending the psql> prompt we assume the execution of psql on the zones-commit database.
Import data into PostgreSQL DB
sh> ./import_data.sh
Zone detection by name
Extract commits data from the DB and create commits.tab, that is used as input for the zone detection script
sh> psql -f extract_commits.sql zones-commit
Run the world zone detection script to create commit_zones.tab.zst
sh> pv commits.tab | ./assign_world_zone.py -a -n names.tab -p zones.acc.tab -x -w 8 | zstdmt > commit_zones.tab.zst Use ./assign_world_zone.py --help if you are interested in changing the script parameters.
Ingest zones assignment data into the DB
psql> \copy commit_zone from program 'zstdcat commit_zones.tab.zst | cut -f1,6 | grep -Ev ''\s$'''
Extraction and graphs
Run the script to execute the queries to extract the data to plot from the DB. This creates commit_zones_7120.tab, author_zones_7120_t5.tab, commit_zones_7120.grid and author_zones_7120_t5.grid. Edit extract_data.sql if you whish to modify extraction parameters (start/end year, sampling, …).
sh> ./extract_data.sh
Run the script to create the graphs from all the previously extracted tabfiles.
sh> ./create_stackedbar_chart.py -w 20 -s 1971 -f commit_zones_7120.grid -f author_zones_7120_t5.grid -o chart.pdf
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This archive reproduces a figure titled "Figure 3.2 Boone County population distribution" from Wang and vom Hofe (2007, p.60). The archive provides a Jupyter Notebook that uses Python and can be run in Google Colaboratory. The workflow uses the Census API to retrieve data, reproduce the figure, and ensure reproducibility for anyone accessing this archive.The Python code was developed in Google Colaboratory, or Google Colab for short, which is an Integrated Development Environment (IDE) of JupyterLab and streamlines package installation, code collaboration, and management. The Census API is used to obtain population counts from the 2000 Decennial Census (Summary File 1, 100% data). Shapefiles are downloaded from the TIGER/Line FTP Server. All downloaded data are maintained in the notebook's temporary working directory while in use. The data and shapefiles are stored separately with this archive. The final map is also stored as an HTML file.The notebook features extensive explanations, comments, code snippets, and code output. The notebook can be viewed in a PDF format or downloaded and opened in Google Colab. References to external resources are also provided for the various functional components. The notebook features code that performs the following functions:install/import necessary Python packagesdownload the Census Tract shapefile from the TIGER/Line FTP Serverdownload Census data via CensusAPI manipulate Census tabular data merge Census data with TIGER/Line shapefileapply a coordinate reference systemcalculate land area and population densitymap and export the map to HTMLexport the map to ESRI shapefileexport the table to CSVThe notebook can be modified to perform the same operations for any county in the United States by changing the State and County FIPS code parameters for the TIGER/Line shapefile and Census API downloads. The notebook can be adapted for use in other environments (i.e., Jupyter Notebook) as well as reading and writing files to a local or shared drive, or cloud drive (i.e., Google Drive).
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This study provides data from the World Bank's PovcalNet on the distribution of household income and consumption across populations for 942 country-years, organized in dta and csv files by region. Each distribution contains 10,000 data points, one for each 0.01 incremental increase in percent of people living in households at or below a given income or consumption level. In addition, a data set containing the estimated parameters of the Beta and General Quadratic Lorenz curves is provided. For reference, we also provide the Python scripts used to query the PovcalNet online tool and export data from the Mongo database used to store results of these queries, along with all do files used to clean and construct the final data sets and summary statistics.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Worldwide Gender Differences in Public Code Contributions - Replication Package
This document describes how to replicate the findings of the paper: Davide Rossi and Stefano Zacchiroli, 2022, Worldwide Gender Differences in Public Code Contributions. In Software Engineering in Society (ICSE-SEIS'22), May 21-29, 2022, Pittsburgh, PA, USA. ACM, New York, NY, USA, 12 pages. https://doi.org/10.1145/3510458.3513011
This document comes with the software needed to mine and analyze the data presented in the paper.
Prerequisites
These instructions assume the use of the bash shell, the Python programming language, the PosgreSQL DBMS (version 11 or later), the zstd compression utility and various usual *nix shell utilities (cat, pv, ...), all of which are available for multiple architectures and OSs.
It is advisable to create a Python virtual environment and install the following PyPI packages: click==8.0.3 cycler==0.10.0 gender-guesser==0.4.0 kiwisolver==1.3.2 matplotlib==3.4.3 numpy==1.21.3 pandas==1.3.4 patsy==0.5.2 Pillow==8.4.0 pyparsing==2.4.7 python-dateutil==2.8.2 pytz==2021.3 scipy==1.7.1 six==1.16.0 statsmodels==0.13.0
Initial data
swh-replica
, a PostgreSQL database containing a copy of Software Heritage data. The schema for the database is available at https://forge.softwareheritage.org/source/swh-storage/browse/master/swh/storage/sql/.swh-replica
.names.tab
- forenames and surnames per country with their frequencyzones.acc.tab
- countries/territories, timezones, population and world zonesc_c.tab
- ccTDL entities - world zones matchesData preparation
swh-replica
database to create commits.csv.zst
and authors.csv.zst
sh> ./export.sh
authors--clean.csv.zst
sh> ./cleanup.sh authors.csv.zst
authors--plausible.csv.zst
sh> pv authors--clean.csv.zst | unzstd | ./filter_names.py 2> authors--plausible.csv.log | zstdmt > authors--plausible.csv.zst
Gender detection
author-fullnames-gender.csv.zst
sh> pv authors--plausible.csv.zst | unzstd | ./guess_gender.py --fullname --field 2 | zstdmt > author-fullnames-gender.csv.zst
Database creation and data ingestion
Create the PostgreSQL DB sh> createdb gender-commit
Notice that from now on when prepending the psql>
prompt we assume the execution of psql on the gender-commit
database.
Import data into PostgreSQL DB sh> ./import_data.sh
Zone detection
commits.tab
, that is used as input for the gender detection scriptsh> psql -f extract_commits.sql gender-commit
commit_zones.tab.zst
sh> pv commits.tab | ./assign_world_zone.py -a -n names.tab -p zones.acc.tab -x -w 8 | zstdmt > commit_zones.tab.zst
Use ./assign_world_zone.py --help
if you are interested in changing the script parameters.psql> \copy commit_culture from program 'zstdcat commit_zones.tab.zst | cut -f1,6 | grep -Ev ''\s$'''
Extraction and graphs
commits_tz.tab
, authors_tz.tab
, commits_zones.tab
, authors_zones.tab
, and authors_zones_1620.tab
.extract_data.sql
if you whish to modify extraction parameters (start/end year, sampling, ...). sh> ./extract_data.sh
commits_tzs.pdf
, authors_tzs.pdf
, commits_zones.pdf
, authors_zones.pdf
, and authors_zones_1620.pdf
. sh> ./create_charts.sh
Additional graphs
This package also includes some already-made graphs
authors_zones_1.pdf
: stacked graphs showing the ratio of female authors per world zone through the years, considering all authors with at least one commit per periodauthors_zones_2.pdf
: ditto with at least two commits per periodauthors_zones_10.pdf
: ditto with at least ten commits per periodAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Flow data and model parameters to configure and run the pyDODOC model for estimating dissolved organic carbon (DOC) export from the Barmah floodplain in the Murray-Darling Basin (MDB) for different temperature, litter load concentration and flooding scenarios. Model outputs including DOC leaching, DOC-consumed and net DOC export from the floodplain are included in the output file for each scenario. The data were produced as a part of the Murray-Darling Basin Ecosystem Function (MDB-EF) project. Lineage: Measured flow and water quality data (e.g. DO, DOC) were obtained from the water monitoring agencies of New South Wales (NSW) and Victoria (Vic). The outputs were produced by executing the pyDODOC model for different temperatures and leaf litter concentrations.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Bayesian Target-Vector Optimization for Efficient Parameter Reconstruction
This publication contains the research data and example scripts for the paper “Bayesian Target-Vector Optimization for Efficient Parameter Reconstruction” [1]. The research data is found in the directory research_data, the example scripts are found in the directory example_scripts.
The research data contains all necessary information to be able to reconstruct the figures and values given in the paper, as well as all result figures shown. Where possible, the directories contain the necessary scripts to recreate the results themselves, up to stochastic variations.
The example scripts are intended to show how one can (i), perform a least-square type optimization of a model function (here we focus our efforts on the analytical model functions MGH17 and Gauss3, as described in the paper) using various methods (BTVO, LM, BO, L-BFGS-B, NM, including using derivative information when applicable), and (ii), perform Markov chain Monte Carlo (MCMC) sampling around the found maximum likelihood estimate (MLE) to estimate the uncertainties of the MLE parameter (both using a surrogate model of the actual model function, as well as using the actual model function directly).
Research data
Contained are directories for the experimental problem GIXRF, and the two analytical model functions MGH17 and Gauss3. What follows is a listing of directories and the contents:
gauss3_optimization: Optimization logs for the Gauss3 model function for BTVO, LM, BO, L-BFGS-B, NM (with derivatives when applicable), .npy files used for generating the plots, a benchmark.py file used for the generation of the data, as well as the plots shown in the paper.
mgh17_optimization: Optimization logs for the MGH17 model function for BTVO, LM, BO, L-BFGS-B, NM (with derivatives when applicable), .npy files used for generating the plots, a benchmark.py file used for the generation of the data, as well as the plots shown in the paper.
mgh17_mcmc_analytical: Scripts for the creation of the plots (does not use an optimization log), as well as plots shown in the paper. This uses the model function directly to perform the MCMC sampling.
mgh17_mcmc_surrogate: Optimization log of the MGH17 function used for the creation of the MCMC plots, scripts for the creation of the plots (use the optimization log), as well as plots shown in the paper. This uses a surrogate model to perform the MCMC sampling.
gixrf_optimization: benchmark.py file to perform the optimization, the optimization logs for the various methods (BTVO, LM, BO, L-BFGS-B, NM), .npy files and scripts used for the creation of the plots, and the plots shown in the paper.
gixrf_mcmc_supplement: optimization log used for the creation of the plot, pickle file used for the creation of the plot, script to create the MCMC plot.
gixrf_optimum_difference_supplement: optimization logs of BTVO optimization of the GIXRF problem, scripts to create the difference/error plots shown for the GIXRF problem in the supplement, and the plots themselves.
Employed software for creating the research data
The software used in the creation is:
JCMsuite Analysis and Optimization toolkit, development version, commit d55e99b (the closest commercial release is found in JCMsuite version 5.0.2)
A list of Python packages installed (excerpt from conda list, name and version)
corner 2.1.0
emcee 3.0.2
jax 0.2.22
jaxlib 0.1.72
matplotlib 3.2.1
numba 0.40.1
numpy 1.18.1
pandas 0.24.1
python 3.7.11
scikit-optimize 0.7.4
scipy 1.7.1
tikzplotlib 0.9.9
JCMsuite 4.6.3 for the evaluation of the experimental model
Example scripts
This directory contains a few sample files that show how parameter reconstructions can be performed using the JCMsuite analysis and optimization toolbox, with a particular focus on the Bayesian target-vector optimization method shown in the paper.
It also contains example files that show how an uncertainty quantification can be performed using MCMC, both directly using a model function, as well as using a surrogate model of the model function.
What follows is a listing of the contents of the directory:
mcmc_mgh17_analytical.py: performs a MCMC analysis of the MGH17 model function directly, without constructing a surrogate model. Uses emcee.
mcmc_mgh17_surrogate.py: performs a MCMC analysis of the MGH17 model function by constructing a surrogate model of the model function. Uses the JCMsuite analysis and optimization toolbox.
opt_gauss3.py: performs a parameter reconstruction of the Gauss3 model function using various methods (BTVO, LM, BO, L-BFGS-B, NM, with derivatives when applicable).
opt_mgh17.py: performs a parameter reconstruction of the MGH17 model function using various methods (BTVO, LM, BO, L-BFGS-B, NM, with derivatives when applicable).
util/model_functions.py: contains the MGH17 and Gauss3 model functions, their (automatic) derivatives, and objective functions used in the optimizations.
Requirements to execute the example scripts
These scripts have been developed and tested under Linux, Debian 10. We have tried to make sure that they would also work in a Windows environment, but can unfortunately give no guarantees for that.
We mainly use Python to run the reconstructions. To execute the files, a few Python packages have to be installed. In addition to the usual scientific Python stack (NumPy, SciPy, matplotlib, pandas, etc.), the packages jax and jaxlib (for automatic differentiation of Python/NumPy functions), emcee and corner (for MCMC sampling and subsequent plotting of the results) have to be installed.
This can be achieved for example using pip, e.g.
pip install -r requirements.txt
Additionally, JCMsuite has to be installed. For this you can visit [2] and download a free trial version.
On Linux, the installation has to be added to the PATH, e.g. by adding the following to your .bashrc file:
export JCMROOT=/FULL/PATH/TO/BASE/DIRECTORY export PATH=$JCMROOT/bin:$PATH export PYTHONPATH=$JCMROOT/ThirdPartySupport/Python:$PYTHONPATH
Bibliography
[1] M. Plock, K. Andrle, S. Burger, P.-I. Schneider, Bayesian Target-Vector Optimization for Efficient Parameter Reconstruction. Adv. Theory Simul. 5, 2200112 (2022).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset accompanies the research article titled Democracy and the Market for Culture, which investigates how different dimensions of democratic governance influence international trade in cultural goods. Drawing from UNESCO cultural exports data, World Bank GDP per capita, and the V-Dem Core v15 democracy indicators, the study uses fixed-effects OLS regression to analyze the institutional determinants of cultural export performance across countries from 2004 to 2019.
The dataset includes:
culture_exports.csv
: UNESCO cultural goods export data (USD, by country-year).
GDPPC.csv
: World Bank GDP per capita data (log-transformed).
V-Dem-CY-Core-v15.csv
: Country-year democracy indicators from V-Dem v15.
Three Colab-ready Python notebooks (v2x_libdem.ipynb
, v2x_partipdem.ipynb
, v2x_polyarchy.ipynb
) with full replication code for each regression model.
Open Google Colab:
Go to https://colab.research.google.com/
Upload Files to Colab:
You can drag the files into the file panel (left sidebar) or use this code to upload:
from google.colab import files
uploaded = files.upload()
Install Required Libraries (if needed):
Most notebooks use pandas
, statsmodels
, and matplotlib
:
!pip install pandas statsmodels matplotlib
Run a Notebook:
Open v2x_libdem.ipynb
, v2x_partipdem.ipynb
, or v2x_polyarchy.ipynb
in Colab.
Click "Runtime" > "Run all" or run each cell individually to replicate the regression results.
Customize or Extend:
You can plug in other democracy or economic indicators from the V-Dem
dataset to test additional hypotheses.
Modify the regression specification or add visualization outputs using seaborn
or plotly
.
License: Creative Commons Attribution 4.0 (CC BY 4.0)
Citation:
Anon Annotator. (2025). Democracy and the Market for Culture: Institutional Determinants of Cultural Goods Exports [Dataset]. Zenodo. https://doi.org/10.5281/zenodo.16417376
A first trial for merging the Google Earth Engine API into HydroShare.
Google EarthEngine API http://href="https://developers.google.com/earth-engine/api_docs
The code here only provides connection to Earth Engine API yet. Does not interact with HS data or functions yet.
It uses git code published by Erik Tyler href="https://github.com/tylere/eeus2017-python" https://github.com/tylere/eeus2017-python%3C/a" rel="nofollow">rel="nofollow">https://github.com/tylere/eeus2017-python> to demonstrate adequate installation and setup
To Do:
Import /Export data from HS to Earth Engine Permanent storing of Google EE key Fix weird behavior of leaflet python module Separate code in folders
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Replication Package
This repository contains data and source files needed to replicate our work described in the paper "Unboxing Default Argument Breaking Changes in Scikit Learn".
Requirements
We recommend the following requirements to replicate our study:
Package Structure
We relied on Docker containers to provide a working environment that is easier to replicate. Specifically, we configure the following containers:
data-analysis
, an R-based Container we used to run our data analysis.data-collection
, a Python Container we used to collect Scikit's default arguments and detect them in client applications.database
, a Postgres Container we used to store clients' data, obtainer from Grotov et al.storage
, a directory used to store the data processed in data-analysis
and data-collection
. This directory is shared in both containers.docker-compose.yml
, the Docker file that configures all containers used in the package.In the remainder of this document, we describe how to set up each container properly.
Using VSCode to Setup the Package
We selected VSCode as the IDE of choice because its extensions allow us to implement our scripts directly inside the containers. In this package, we provide configuration parameters for both data-analysis
and data-collection
containers. This way you can directly access and run each container inside it without any specific configuration.
You first need to set up the containers
$ cd /replication/package/folder
$ docker-compose build
$ docker-compose up
# Wait docker creating and running all containers
Then, you can open them in Visual Studio Code:
If you want/need a more customized organization, the remainder of this file describes it in detail.
Longest Road: Manual Package Setup
Database Setup
The database container will automatically restore the dump in dump_matroskin.tar
in its first launch. To set up and run the container, you should:
Build an image:
$ cd ./database
$ docker build --tag 'dabc-database' .
$ docker image ls
REPOSITORY TAG IMAGE ID CREATED SIZE
dabc-database latest b6f8af99c90d 50 minutes ago 18.5GB
Create and enter inside the container:
$ docker run -it --name dabc-database-1 dabc-database
$ docker exec -it dabc-database-1 /bin/bash
root# psql -U postgres -h localhost -d jupyter-notebooks
jupyter-notebooks=# \dt
List of relations
Schema | Name | Type | Owner
--------+-------------------+-------+-------
public | Cell | table | root
public | Code_cell | table | root
public | Md_cell | table | root
public | Notebook | table | root
public | Notebook_features | table | root
public | Notebook_metadata | table | root
public | repository | table | root
If you got the tables list as above, your database is properly setup.
It is important to mention that this database is extended from the one provided by Grotov et al.. Basically, we added three columns in the table Notebook_features
(API_functions_calls
, defined_functions_calls
, andother_functions_calls
) containing the function calls performed by each client in the database.
Data Collection Setup
This container is responsible for collecting the data to answer our research questions. It has the following structure:
dabcs.py
, extract DABCs from Scikit Learn source code, and export them to a CSV file.dabcs-clients.py
, extract function calls from clients and export them to a CSV file. We rely on a modified version of Matroskin to leverage the function calls. You can find the tool's source code in the `matroskin`` directory.Makefile
, commands to set up and run both dabcs.py
and dabcs-clients.py
matroskin
, the directory containing the modified version of matroskin tool. We extended the library to collect the function calls performed on the client notebooks of Grotov's dataset.storage
, a docker volume where the data-collection should save the exported data. This data will be used later in Data Analysis.requirements.txt
, Python dependencies adopted in this module.Note that the container will automatically configure this module for you, e.g., install dependencies, configure matroskin, download scikit learn source code, etc. For this, you must run the following commands:
$ cd ./data-collection
$ docker build --tag "data-collection" .
$ docker run -it -d --name data-collection-1 -v $(pwd)/:/data-collection -v $(pwd)/../storage/:/data-collection/storage/ data-collection
$ docker exec -it data-collection-1 /bin/bash
$ ls
Dockerfile Makefile config.yml dabcs-clients.py dabcs.py matroskin storage requirements.txt utils.py
If you see project files, it means the container is configured accordingly.
Data Analysis Setup
We use this container to conduct the analysis over the data produced by the Data Collection container. It has the following structure:
dependencies.R
, an R script containing the dependencies used in our data analysis.data-analysis.Rmd
, the R notebook we used to perform our data analysisdatasets
, a docker volume pointing to the storage
directory.Execute the following commands to run this container:
$ cd ./data-analysis
$ docker build --tag "data-analysis" .
$ docker run -it -d --name data-analysis-1 -v $(pwd)/:/data-analysis -v $(pwd)/../storage/:/data-collection/datasets/ data-analysis
$ docker exec -it data-analysis-1 /bin/bash
$ ls
data-analysis.Rmd datasets dependencies.R Dockerfile figures Makefile
If you see project files, it means the container is configured accordingly.
A note on storage
shared folder
As mentioned, the storage
folder is mounted as a volume and shared between data-collection
and data-analysis
containers. We compressed the content of this folder due to space constraints. Therefore, before starting working on Data Collection or Data Analysis, make sure you extracted the compressed files. You can do this by running the Makefile
inside storage
folder.
$ make unzip # extract files
$ ls
clients-dabcs.csv clients-validation.csv dabcs.csv Makefile scikit-learn-versions.csv versions.csv
$ make zip # compress files
$ ls
csv-files.tar.gz Makefile
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Kubernetes Security Landscape: AI-Driven Insights from a Comprehensive Analysis of Developer DiscussionsThis zip archive is organized into four main directories, plus a handful of supporting files. It contains everything you need to recreate the results of our research under the Open Research initiative.assets/: Images and compiled assets used in the research paperdata/: Exported raw data sources, including the data dump, CSV export, and Pickle filesresearch/: Jupyter Notebooks demonstrating how the data was processed within the paperscripts/: Python scripts for scraping - exposed by cli.pycli.py: Python script which exposes a command-line-interface (CLI) that allows for easier interaction with the scraping scripts in the scripts/ directorypoetry.lock: Dependency package manifest for using with the poetry install command and poetry package manager.requirements.txt: Python/Pip project configuration file if you prefer to use Pip instead of Poetry.README.md:
These data are single-beam bathymetry points compiled in comma separated values (CSV) file format, generated from a hydrographic survey of the northern portion of Lake Calumet in Cook County, Illinois. Hydrographic data were collected July 18-19, 2023, using a single-beam echosounder (SBES) integrated with a Global Navigation Satellite System (GNSS) mounted on a marine survey vessel. Surface water elevation data were collected July 18 utilizing a single-base real-time kinematic (RTK)/GNSS unit. Bathymetric data points were collected as the vessel traversed the northern portions of the lake along overlapping survey lines. The SBES internally collected and stored the depth data from the echosounder and the horizontal and vertical position data of the vessel from the GNSS in real time. Data processing required specialized computer software to export bathymetry data from the raw data files. A Python script was written to calculate the lakebed elevations and identify outliers in the dataset. These data are provided in comma separated values (CSV) format as LakeCalumet_SBES_20230718.csv. Data points are stored as a series of x (longitude), y (latitude), and z (elevation or depth) points along with variable length records specific to the data transects.
Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.