100+ datasets found

s
Python Import Data India – Buyers & Importers List
seair.co.in
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Seair Exim, Python Import Data India – Buyers & Importers List [Dataset]. https://www.seair.co.in
Explore at:
.bin, .xml, .csv, .xlsAvailable download formats
Dataset provided by
Seair Info Solutions PVT LTD
Authors
Seair Exim
Area covered
India
Description
Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.
h
Python-DPO-Large
huggingface.co
Updated Mar 15, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NextWealth Entrepreneurs Private Limited (2023). Python-DPO-Large [Dataset]. https://huggingface.co/datasets/NextWealth/Python-DPO-Large
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 15, 2023
Dataset authored and provided by
NextWealth Entrepreneurs Private Limited
Description
Dataset Card for Python-DPO

This dataset is the larger version of Python-DPO dataset and has been created using Argilla.

Load with datasets

To load this dataset with datasets, you'll just need to install datasets as pip install datasets --upgrade and then use the following code: from datasets import load_dataset

ds = load_dataset("NextWealth/Python-DPO")

Data Fields

Each data instance contains:

instruction: The problem description/requirements chosen_code:… See the full description on the dataset page: https://huggingface.co/datasets/NextWealth/Python-DPO-Large.
Vezora/Tested-188k-Python-Alpaca: Functional
kaggle.com
zip
Updated Nov 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). Vezora/Tested-188k-Python-Alpaca: Functional [Dataset]. https://www.kaggle.com/datasets/thedevastator/vezora-tested-188k-python-alpaca-functional-pyth/discussion
Explore at:
zip(12200606 bytes)Available download formats
Dataset updated
Nov 30, 2023
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Vezora/Tested-188k-Python-Alpaca: Functional Python Code Dataset

188k Functional Python Code Samples

By Vezora (From Huggingface) [source]

About this dataset

The Vezora/Tested-188k-Python-Alpaca dataset is a comprehensive collection of functional Python code samples, specifically designed for training and analysis purposes. With 188,000 samples, this dataset offers an extensive range of examples that cater to the research needs of Python programming enthusiasts.

This valuable resource consists of various columns, including input, which represents the input or parameters required for executing the Python code sample. The instruction column describes the task or objective that the Python code sample aims to solve. Additionally, there is an output column that showcases the resulting output generated by running the respective Python code.

By utilizing this dataset, researchers can effectively study and analyze real-world scenarios and applications of Python programming. Whether for educational purposes or development projects, this dataset serves as a reliable reference for individuals seeking practical examples and solutions using Python

How to use the dataset

The Vezora/Tested-188k-Python-Alpaca dataset is a comprehensive collection of functional Python code samples, containing 188,000 samples in total. This dataset can be a valuable resource for researchers and programmers interested in exploring various aspects of Python programming.

Contents of the Dataset

The dataset consists of several columns:

output: This column represents the expected output or result that is obtained when executing the corresponding Python code sample.

instruction: It provides information about the task or instruction that each Python code sample is intended to solve.

input: The input parameters or values required to execute each Python code sample.

Exploring the Dataset

To make effective use of this dataset, it is essential to understand its structure and content properly. Here are some steps you can follow:

Importing Data: Load the dataset into your preferred environment for data analysis using appropriate tools like pandas in Python.

import pandas as pd # Load the dataset df = pd.read_csv('train.csv')

Understanding Column Names: Familiarize yourself with the column names and their meanings by referring to the provided description.

# Display column names print(df.columns)

Sample Exploration: Get an initial understanding of the data structure by examining a few random samples from different columns.

# Display random samples from 'output' column print(df['output'].sample(5))

Analyzing Instructions: Analyze different instructions or tasks present in the 'instruction' column to identify specific areas you are interested in studying or learning about.

# Count unique instructions and display top ones with highest occurrences instruction_counts = df['instruction'].value_counts() print(instruction_counts.head(10))

Potential Use Cases

The Vezora/Tested-188k-Python-Alpaca dataset can be utilized in various ways:

Code Analysis: Analyze the code samples to understand common programming patterns and best practices.

Code Debugging: Use code samples with known outputs to test and debug your own Python programs.

Educational Purposes: Utilize the dataset as a teaching tool for Python programming classes or tutorials.

Machine Learning Applications: Train machine learning models to predict outputs based on given inputs.

Remember that this dataset provides a plethora of diverse Python coding examples, allowing you to explore different

Research Ideas

Code analysis: Researchers and developers can use this dataset to analyze various Python code samples and identify patterns, best practices, and common mistakes. This can help in improving code quality and optimizing performance.

Language understanding: Natural language processing techniques can be applied to the instruction column of this dataset to develop models that can understand and interpret natural language instructions for programming tasks.

Code generation: The input column of this dataset contains the required inputs for executing each Python code sample. Researchers can build models that generate Python code based on specific inputs or task requirements using the examples provided in this dataset. This can be useful in automating repetitive programming tasks o...
Storage and Transit Time Data and Code
zenodo.org
zip
Updated Oct 29, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrew Felton; Andrew Felton (2024). Storage and Transit Time Data and Code [Dataset]. http://doi.org/10.5281/zenodo.14009758
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14009758
Dataset updated
Oct 29, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Andrew Felton; Andrew Felton
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Author: Andrew J. Felton
Date: 10/29/2024

This R project contains the primary code and data (following pre-processing in python) used for data production, manipulation, visualization, and analysis, and figure production for the study entitled:

"Global estimates of the storage and transit time of water through vegetation"

Please note that 'turnover' and 'transit' are used interchangeably. Also please note that this R project has been updated multiple times as the analysis has updated.

Data information:

The data folder contains key data sets used for analysis. In particular:

"data/turnover_from_python/updated/august_2024_lc/" contains the core datasets used in this study including global arrays summarizing five year (2016-2020) averages of mean (annual) and minimum (monthly) transit time, storage, canopy transpiration, and number of months of data able as both an array (.nc) or data table (.csv). These data were produced in python using the python scripts found in the "supporting_code" folder. The remaining files in the "data" and "data/supporting_data"" folder primarily contain ground-based estimates of storage and transit found in public databases or through a literature search, but have been extensively processed and filtered here. The "supporting_data"" folder also contains annual (2016-2020) MODIS land cover data used in the analysis and contains separate filters containing the original data (.hdf) and then the final process (filtered) data in .nc format. The resulting annual land cover distributions were used in the pre-processing of data in python.

#Code information

Python scripts can be found in the "supporting_code" folder.

Each R script in this project has a role:

"01_start.R": This script sets the working directory, loads in the tidyverse package (the remaining packages in this project are called using the `::` operator), and can run two other scripts: one that loads the customized functions (02_functions.R) and one for importing and processing the key dataset for this analysis (03_import_data.R).

"02_functions.R": This script contains custom functions. Load this using the
`source()` function in the 01_start.R script.

"03_import_data.R": This script imports and processes the .csv transit data. It joins the mean (annual) transit time data with the minimum (monthly) transit data to generate one dataset for analysis: annual_turnover_2. Load this using the
`source()` function in the 01_start.R script.

"04_figures_tables.R": This is the main workhouse for figure/table production and
supporting analyses. This script generates the key figures and summary statistics
used in the study that then get saved in the manuscript_figures folder. Note that all
maps were produced using Python code found in the "supporting_code"" folder.

"supporting_generate_data.R": This script processes supporting data used in the analysis, primarily the varying ground-based datasets of leaf water content.

"supporting_process_land_cover.R": This takes annual MODIS land cover distributions and processes them through a multi-step filtering process so that they can be used in preprocessing of datasets in python.
Python frameworks used in data science 2021
statista.com
Updated Jun 15, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2022). Python frameworks used in data science 2021 [Dataset]. https://www.statista.com/statistics/1338424/python-use-frameworks-data-science/
Explore at:
Dataset updated
Jun 15, 2022
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Oct 2021 - Dec 2021
Area covered
Worldwide
Description
Python is one of the most popular programming languages among data scientists, partly due to its varied packages and capabilities. In 2021, Numpy and Pandas were the most used Python frameworks for data science, with a ** percent and ** percent share respectively.
d
Python code used to download U.S. Census Bureau data for public-supply water...
catalog.data.gov
data.usgs.gov
Updated Nov 19, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Python code used to download U.S. Census Bureau data for public-supply water service areas [Dataset]. https://catalog.data.gov/dataset/python-code-used-to-download-u-s-census-bureau-data-for-public-supply-water-service-areas
Explore at:
Dataset updated
Nov 19, 2025
Dataset provided by
U.S. Geological Survey
Description
This child item describes Python code used to query census data from the TigerWeb Representational State Transfer (REST) services and the U.S. Census Bureau Application Programming Interface (API). These data were needed as input feature variables for a machine learning model to predict public supply water use for the conterminous United States. Census data were retrieved for public-supply water service areas, but the census data collector could be used to retrieve data for other areas of interest. This dataset is part of a larger data release using machine learning to predict public supply water use for 12-digit hydrologic units from 2000-2020. Data retrieved by the census data collector code were used as input features in the public supply delivery and water use machine learning models. This page includes the following file: census_data_collector.zip - a zip file containing the census data collector Python code used to retrieve data from the U.S. Census Bureau and a README file.

Code Similarity Dataset – Python Variants

kaggle.com

zip

Updated Jul 6, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Hem Ajit Patel (2025). Code Similarity Dataset – Python Variants [Dataset]. https://www.kaggle.com/datasets/hemajitpatel/code-similarity-dataset-python-variants

Explore at:

zip(39806 bytes)Available download formats

Dataset updated

Jul 6, 2025

Authors

Hem Ajit Patel

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Code Similarity Dataset – Python Variants

A collection of code snippets solving common programming problems in multiple variations.

Each problem has 20+ versions, written in different styles and logic patterns, making this dataset ideal for studying:

Code similarity
Plagiarism detection
AI-based code search
Code classification
Semantic code retrieval

📚 What's Inside?

The dataset includes the following tasks: - Reverse a String - Find Max in List - Check if a Number is Prime - Check if a String is a Palindrome - Generate Fibonacci Sequence

Each task contains: - 20 variations of code - Metadata file describing method and notes - README with usage instructions

Column Descriptions

The full_metadata.csv file contains the following fields:

Column Name	Description
`problem_type`	The programming task solved (e.g., `reverse_string`, `max_in_list`)
`id`	Unique ID of the snippet within that problem group
`filename`	Filename of the code snippet (e.g., `snip_01.py`)
`language`	Programming language used (`Python`)
`method`	Type of approach used (e.g., `Slicing`, `Recursive`, `While loop`)
`notes`	Additional details about the logic or style used in the snippet

🗂 Folder Structure

CodeSimilarityDataset/ │ ├── reverse_string/ │ ├── snippets/ │ ├── metadata.csv │ └── README.txt │ ├── max_in_list/ │ ├── snippets/ │ ├── metadata.csv │ └── README.txt │ ├── is_prime/ │ ├── snippets/ │ ├── metadata.csv │ └── README.txt │ ├── is_palindrome/ │ ├── snippets/ │ ├── metadata.csv │ └── README.txt │ ├── fibonacci/ │ ├── snippets/ │ ├── metadata.csv │ └── README.txt │ └── full_metadata.csv ← Combined metadata across all problems

🔍 Use Cases

Train models to detect similar code logic
Build plagiarism detection systems
Improve code recommendation engines
Teach students about code variation
Benchmark code search algorithms

🧪 Sample Applications

Visualize logic type distribution

Compare structural similarity (AST/difflib/token matching)

Cluster similar snippets using embeddings

Train code-style-aware LLMs

📦 File Formats

All code snippets are .py files. Metadata is provided in CSV format for easy loading into pandas or other tools.

🛠 How to Use

You can load metadata easily with Python:

import pandas as pd

df = pd.read_csv('full_metadata.csv') print(df.sample(5))

Then read any snippet:

with open("reverse_string/snippets/snip_01.py") as f: code = f.read() print(code)

📄 License

This dataset is released under the MIT License — free to use, modify, and distribute with proper attribution.

Data from: Ecosystem-Level Determinants of Sustained Activity in Open-Source...

zenodo.org

application/gzip, bin +2

Updated Aug 2, 2024

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Marat Valiev; Marat Valiev; Bogdan Vasilescu; James Herbsleb; Bogdan Vasilescu; James Herbsleb (2024). Ecosystem-Level Determinants of Sustained Activity in Open-Source Projects: A Case Study of the PyPI Ecosystem [Dataset]. http://doi.org/10.5281/zenodo.1419788

Explore at:

bin, application/gzip, zip, text/x-pythonAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.1419788

Dataset updated

Aug 2, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Marat Valiev; Marat Valiev; Bogdan Vasilescu; James Herbsleb; Bogdan Vasilescu; James Herbsleb

License

https://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.htmlhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html

Description

Replication pack, FSE2018 submission #164:
------------------------------------------

**Working title:** Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: 
A Case Study of the PyPI Ecosystem

**Note:** link to data artifacts is already included in the paper. 
Link to the code will be included in the Camera Ready version as well.


Content description
===================

- **ghd-0.1.0.zip** - the code archive. This code produces the dataset files 
 described below
- **settings.py** - settings template for the code archive.
- **dataset_minimal_Jan_2018.zip** - the minimally sufficient version of the dataset.
 This dataset only includes stats aggregated by the ecosystem (PyPI)
- **dataset_full_Jan_2018.tgz** - full version of the dataset, including project-level
 statistics. It is ~34Gb unpacked. This dataset still doesn't include PyPI packages
 themselves, which take around 2TB.
- **build_model.r, helpers.r** - R files to process the survival data 
  (`survival_data.csv` in **dataset_minimal_Jan_2018.zip**, 
  `common.cache/survival_data.pypi_2008_2017-12_6.csv` in 
  **dataset_full_Jan_2018.tgz**)
- **Interview protocol.pdf** - approximate protocol used for semistructured interviews.
- LICENSE - text of GPL v3, under which this dataset is published
- INSTALL.md - replication guide (~2 pages)

Replication guide
=================

Step 0 - prerequisites
----------------------

- Unix-compatible OS (Linux or OS X)
- Python interpreter (2.7 was used; Python 3 compatibility is highly likely)
- R 3.4 or higher (3.4.4 was used, 3.2 is known to be incompatible)

Depending on detalization level (see Step 2 for more details):
- up to 2Tb of disk space (see Step 2 detalization levels)
- at least 16Gb of RAM (64 preferable)
- few hours to few month of processing time

Step 1 - software
----------------

- unpack **ghd-0.1.0.zip**, or clone from gitlab:

   git clone https://gitlab.com/user2589/ghd.git
   git checkout 0.1.0
 
 `cd` into the extracted folder. 
 All commands below assume it as a current directory.
  
- copy `settings.py` into the extracted folder. Edit the file:
  * set `DATASET_PATH` to some newly created folder path
  * add at least one GitHub API token to `SCRAPER_GITHUB_API_TOKENS` 
- install docker. For Ubuntu Linux, the command is 
  `sudo apt-get install docker-compose`
- install libarchive and headers: `sudo apt-get install libarchive-dev`
- (optional) to replicate on NPM, install yajl: `sudo apt-get install yajl-tools`
 Without this dependency, you might get an error on the next step, 
 but it's safe to ignore.
- install Python libraries: `pip install --user -r requirements.txt` . 
- disable all APIs except GitHub (Bitbucket and Gitlab support were
 not yet implemented when this study was in progress): edit
 `scraper/init.py`, comment out everything except GitHub support
 in `PROVIDERS`.

Step 2 - obtaining the dataset
-----------------------------

The ultimate goal of this step is to get output of the Python function 
`common.utils.survival_data()` and save it into a CSV file:

  # copy and paste into a Python console
  from common import utils
  survival_data = utils.survival_data('pypi', '2008', smoothing=6)
  survival_data.to_csv('survival_data.csv')

Since full replication will take several months, here are some ways to speedup
the process:

####Option 2.a, difficulty level: easiest

Just use the precomputed data. Step 1 is not necessary under this scenario.

- extract **dataset_minimal_Jan_2018.zip**
- get `survival_data.csv`, go to the next step

####Option 2.b, difficulty level: easy

Use precomputed longitudinal feature values to build the final table.
The whole process will take 15..30 minutes.

- create a folder `

NYC Jobs Dataset (Filtered Columns)
kaggle.com
zip
Updated Oct 5, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jeffery Mandrake (2022). NYC Jobs Dataset (Filtered Columns) [Dataset]. https://www.kaggle.com/datasets/jefferymandrake/nyc-jobs-filtered-cols
Explore at:
zip(93408 bytes)Available download formats
Dataset updated
Oct 5, 2022
Authors
Jeffery Mandrake
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Area covered
New York
Description
Use this dataset with Misra's Pandas tutorial: How to use the Pandas GroupBy function | Pandas tutorial

The original dataset came from this site: https://data.cityofnewyork.us/City-Government/NYC-Jobs/kpav-sd4t/data

I used Google Colab to filter the columns with the following Pandas commands. Here's a Colab Notebook you can use with the commands listed below: https://colab.research.google.com/drive/17Jpgeytc075CpqDnbQvVMfh9j-f4jM5l?usp=sharing

Once the csv file is uploaded to Google Colab, use these commands to process the file.

import pandas as pd # load the file and create a pandas dataframe df = pd.read_csv('/content/NYC_Jobs.csv') # keep only these columns df = df[['Job ID', 'Civil Service Title', 'Agency', 'Posting Type', 'Job Category', 'Salary Range From', 'Salary Range To' ]] # save the csv file without the index column df.to_csv('/content/NYC_Jobs_filtered_cols.csv', index=False)
Z
Auditory cortex single unit population activity during natural sound...
data.niaid.nih.gov
Updated Jun 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pennington, Jacob; David, Stephen (2023). Auditory cortex single unit population activity during natural sound presentation -- dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7796573
Explore at:
Dataset updated
Jun 15, 2023
Dataset provided by
Oregon Health & Science University
Washington State University, Vancouver
Authors
Pennington, Jacob; David, Stephen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Overview

High-density multi-channel neurophysiology data were collected from primary (A1) and secondary (PEG) fields of auditory cortex of passively listening ferrets during presentation of a large natural sound library. Single unit spikes were sorted using Kilosort. This dataset includes spike times for 849 A1 units and 398 PEG units. Stimulus waveforms were transformed to log-spaced spectrograms for analysis (18 channels, 10 ms time bins). Data set includes raw sound waveforms as well.

The authors request that any publication using this data cite the following work: https://www.biorxiv.org/content/10.1101/2022.06.10.495698v2

Data format/description

Neural data are stored in two files. All recordings were performed during presentation of the same natural sound library.

recordings/A1_NAT4_ozgf.fs100.ch18.tgz - data from 849 A1 single units and log spectrogram of stimuli aligned with spike times.

recordings/PEG_NAT4_ozgf.fs100.ch18.tgz - data from 398 PEG single units and log spectrogram of stimuli aligned with spike times.

wav.zip - raw wav files. Note: Only first 1-sec of each wav file was presented during experiments. Recordings have longer duration

Example scripts

Python scripts included with this dataset demonstrate how to load the neural data and perform a CNN model fit. Running the scripts requires the NEMS0 python library, which is available open source at https://github.com/lbhb/NEMS0.

Quick install

Create and activate a new conda environment:

conda create -n NEMS0 python=3.7 conda activate NEMS0

Download NEMS0:

git clone https://github.com/lbhb/NEMS0

Install NEMS0:

pip install -e NEMS0

Detailed instructions for installing NEMS0 are available in the Github repository (https://github.com/lbhb/NEMS0).

Demo scripts

Once NEMS0 is installed and the data are downloaded, move to the directory where the data and demo scripts are stored and run them in a NEMS0 environment.

pop_cnn_load.py - Load the A1 data and compare predictions for two neurons (Fig 3) by two population models (stage 1 fit complete). Illustrates how to load the data using Python.

pop_cnn_fit.py - Load a pre-fit A1 population model (stage 1) and complete stage 2 fit (refinement) for a single neuron. Illustrates use of NEMS0 for CNN model fitting.

Funding

Data collection, software development and processing were supported by funding from the NIH (R01DC014950, R01EB028155).
T
booksum
tensorflow.org
opendatalab.com
Updated Dec 6, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). booksum [Dataset]. https://www.tensorflow.org/datasets/catalog/booksum
Explore at:
Dataset updated
Dec 6, 2022
Description
BookSum: A Collection of Datasets for Long-form Narrative Summarization

This implementation currently only supports book and chapter summaries.

GitHub: https://github.com/salesforce/booksum

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('booksum', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
T
tidybot
tensorflow.org
Updated Dec 11, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). tidybot [Dataset]. https://www.tensorflow.org/datasets/catalog/tidybot
Explore at:
Dataset updated
Dec 11, 2024
Description
To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('tidybot', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
i
Code to import PSCAD data into Python (Spyder)
ieee-dataport.org
Updated Nov 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Franz Guzman Llanos (2025). Code to import PSCAD data into Python (Spyder) [Dataset]. https://ieee-dataport.org/documents/code-import-pscad-data-python-spyder
Explore at:
Dataset updated
Nov 20, 2025
Authors
Franz Guzman Llanos
Description
minimizes errors
U
Python code used to download gridMET climate data for public-supply water...
data.usgs.gov
s.cnmilf.com
+1more
Updated Aug 27, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carol Luukkonen; Ayman Alzraiee; Joshua Larsen; Donald Martin; Deidre Herbert; Cheryl Buchwald; Natalie Houston; Kristen Valseth; Scott Paulinski; Lisa Miller; Richard Niswonger; Jana Stewart; Cheryl Dieter (2024). Python code used to download gridMET climate data for public-supply water service areas [Dataset]. http://doi.org/10.5066/P9FUL880
Explore at:
Unique identifier
https://doi.org/10.5066/P9FUL880
Dataset updated
Aug 27, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Authors
Carol Luukkonen; Ayman Alzraiee; Joshua Larsen; Donald Martin; Deidre Herbert; Cheryl Buchwald; Natalie Houston; Kristen Valseth; Scott Paulinski; Lisa Miller; Richard Niswonger; Jana Stewart; Cheryl Dieter
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Time period covered
Jan 1, 2000 - Dec 31, 2020
Description
This child item describes Python code used to retrieve gridMET climate data for a specific area and time period. Climate data were retrieved for public-supply water service areas, but the climate data collector could be used to retrieve data for other areas of interest. This dataset is part of a larger data release using machine learning to predict public supply water use for 12-digit hydrologic units from 2000-2020. Data retrieved by the climate data collector code were used as input feature variables in the public supply delivery and water use machine learning models. This page includes the following file: climate_data_collector.zip - a zip file containing the climate data collector Python code used to retrieve climate data and a README file.
Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter...
zenodo.org
bz2
Updated Mar 15, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
João Felipe; João Felipe; Leonardo; Leonardo; Vanessa; Vanessa; Juliana; Juliana (2021). Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter Notebooks [Dataset]. http://doi.org/10.5281/zenodo.2592524
Explore at:
bz2Available download formats
Unique identifier
https://doi.org/10.5281/zenodo.2592524
Dataset updated
Mar 15, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
João Felipe; João Felipe; Leonardo; Leonardo; Vanessa; Vanessa; Juliana; Juliana
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The self-documenting aspects and the ability to reproduce results have been touted as significant benefits of Jupyter Notebooks. At the same time, there has been growing criticism that the way notebooks are being used leads to unexpected behavior, encourage poor coding practices and that their results can be hard to reproduce. To understand good and bad practices used in the development of real notebooks, we analyzed 1.4 million notebooks from GitHub.

Paper: https://2019.msrconf.org/event/msr-2019-papers-a-large-scale-study-about-quality-and-reproducibility-of-jupyter-notebooks

This repository contains two files:

dump.tar.bz2

jupyter_reproducibility.tar.bz2

The dump.tar.bz2 file contains a PostgreSQL dump of the database, with all the data we extracted from the notebooks.

The jupyter_reproducibility.tar.bz2 file contains all the scripts we used to query and download Jupyter Notebooks, extract data from them, and analyze the data. It is organized as follows:

analyses: this folder has all the notebooks we use to analyze the data in the PostgreSQL database.

archaeology: this folder has all the scripts we use to query, download, and extract data from GitHub notebooks.

paper: empty. The notebook analyses/N12.To.Paper.ipynb moves data to it

In the remaining of this text, we give instructions for reproducing the analyses, by using the data provided in the dump and reproducing the collection, by collecting data from GitHub again.

Reproducing the Analysis

This section shows how to load the data in the database and run the analyses notebooks. In the analysis, we used the following environment:

Ubuntu 18.04.1 LTS
PostgreSQL 10.6
Conda 4.5.11
Python 3.7.2
PdfCrop 2012/11/02 v1.38

First, download dump.tar.bz2 and extract it:

tar -xjf dump.tar.bz2

It extracts the file db2019-03-13.dump. Create a database in PostgreSQL (we call it "jupyter"), and use psql to restore the dump:

psql jupyter < db2019-03-13.dump

It populates the database with the dump. Now, configure the connection string for sqlalchemy by setting the environment variable JUP_DB_CONNECTTION:

export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter";

Download and extract jupyter_reproducibility.tar.bz2:

tar -xjf jupyter_reproducibility.tar.bz2

Create a conda environment with Python 3.7:

conda create -n analyses python=3.7 conda activate analyses

Go to the analyses folder and install all the dependencies of the requirements.txt

cd jupyter_reproducibility/analyses pip install -r requirements.txt

For reproducing the analyses, run jupyter on this folder:

jupyter notebook

Execute the notebooks on this order:

Index.ipynb

N0.Repository.ipynb

N1.Skip.Notebook.ipynb

N2.Notebook.ipynb

N3.Cell.ipynb

N4.Features.ipynb

N5.Modules.ipynb

N6.AST.ipynb

N7.Name.ipynb

N8.Execution.ipynb

N9.Cell.Execution.Order.ipynb

N10.Markdown.ipynb

N11.Repository.With.Notebook.Restriction.ipynb

N12.To.Paper.ipynb

Reproducing or Expanding the Collection

The collection demands more steps to reproduce and takes much longer to run (months). It also involves running arbitrary code on your machine. Proceed with caution.

Requirements

This time, we have extra requirements:

All the analysis requirements
lbzip2 2.5
gcc 7.3.0
Github account
Gmail account

Environment

First, set the following environment variables:

export JUP_MACHINE="db"; # machine identifier export JUP_BASE_DIR="/mnt/jupyter/github"; # place to store the repositories export JUP_LOGS_DIR="/home/jupyter/logs"; # log files export JUP_COMPRESSION="lbzip2"; # compression program export JUP_VERBOSE="5"; # verbose level export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter"; # sqlchemy connection export JUP_GITHUB_USERNAME="github_username"; # your github username export JUP_GITHUB_PASSWORD="github_password"; # your github password export JUP_MAX_SIZE="8000.0"; # maximum size of the repositories directory (in GB) export JUP_FIRST_DATE="2013-01-01"; # initial date to query github export JUP_EMAIL_LOGIN="gmail@gmail.com"; # your gmail address export JUP_EMAIL_TO="target@email.com"; # email that receives notifications export JUP_OAUTH_FILE="~/oauth2_creds.json" # oauth2 auhentication file export JUP_NOTEBOOK_INTERVAL=""; # notebook id interval for this machine. Leave it in blank export JUP_REPOSITORY_INTERVAL=""; # repository id interval for this machine. Leave it in blank export JUP_WITH_EXECUTION="1"; # run execute python notebooks export JUP_WITH_DEPENDENCY="0"; # run notebooks with and without declared dependnecies export JUP_EXECUTION_MODE="-1"; # run following the execution order export JUP_EXECUTION_DIR="/home/jupyter/execution"; # temporary directory for running notebooks export JUP_ANACONDA_PATH="~/anaconda3"; # conda installation path export JUP_MOUNT_BASE="/home/jupyter/mount_ghstudy.sh"; # bash script to mount base dir export JUP_UMOUNT_BASE="/home/jupyter/umount_ghstudy.sh"; # bash script to umount base dir export JUP_NOTEBOOK_TIMEOUT="300"; # timeout the extraction # Frequenci of log report export JUP_ASTROID_FREQUENCY="5"; export JUP_IPYTHON_FREQUENCY="5"; export JUP_NOTEBOOKS_FREQUENCY="5"; export JUP_REQUIREMENT_FREQUENCY="5"; export JUP_CRAWLER_FREQUENCY="1"; export JUP_CLONE_FREQUENCY="1"; export JUP_COMPRESS_FREQUENCY="5"; export JUP_DB_IP="localhost"; # postgres database IP

Then, configure the file ~/oauth2_creds.json, according to yagmail documentation: https://media.readthedocs.org/pdf/yagmail/latest/yagmail.pdf

Configure the mount_ghstudy.sh and umount_ghstudy.sh scripts. The first one should mount the folder that stores the directories. The second one should umount it. You can leave the scripts in blank, but it is not advisable, as the reproducibility study runs arbitrary code on your machine and you may lose your data.

Scripts

Download and extract jupyter_reproducibility.tar.bz2:

tar -xjf jupyter_reproducibility.tar.bz2

Install 5 conda environments and 5 anaconda environments, for each python version. In each of them, upgrade pip, install pipenv, and install the archaeology package (Note that it is a local package that has not been published to pypi. Make sure to use the -e option):

Conda 2.7

conda create -n raw27 python=2.7 -y conda activate raw27 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Anaconda 2.7

conda create -n py27 python=2.7 anaconda -y conda activate py27 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Conda 3.4

It requires a manual jupyter and pathlib2 installation due to some incompatibilities found on the default installation.

conda create -n raw34 python=3.4 -y conda activate raw34 conda install jupyter -c conda-forge -y conda uninstall jupyter -y pip install --upgrade pip pip install jupyter pip install pipenv pip install -e jupyter_reproducibility/archaeology pip install pathlib2

Anaconda 3.4

conda create -n py34 python=3.4 anaconda -y conda activate py34 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Conda 3.5

conda create -n raw35 python=3.5 -y conda activate raw35 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Anaconda 3.5

It requires the manual installation of other anaconda packages.

conda create -n py35 python=3.5 anaconda -y conda install -y appdirs atomicwrites keyring secretstorage libuuid navigator-updater prometheus_client pyasn1 pyasn1-modules spyder-kernels tqdm jeepney automat constantly anaconda-navigator conda activate py35 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Conda 3.6

conda create -n raw36 python=3.6 -y conda activate raw36 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Anaconda 3.6

conda create -n py36 python=3.6 anaconda -y conda activate py36 conda install -y anaconda-navigator jupyterlab_server navigator-updater pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Conda 3.7

<code
f
Open data: Frequency mismatch negativity and visual load
su.figshare.com
researchdata.se
+1more
pdf
Updated Feb 23, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stefan Wiens; Erik van Berlekom; Malina Szychowska; Rasmus Eklund (2021). Open data: Frequency mismatch negativity and visual load [Dataset]. http://doi.org/10.17045/sthlmuni.7016369.v2
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.17045/sthlmuni.7016369.v2
Dataset updated
Feb 23, 2021
Dataset provided by
Stockholm University
Authors
Stefan Wiens; Erik van Berlekom; Malina Szychowska; Rasmus Eklund
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Wiens, S., van Berlekom, E., Szychowska, M., & Eklund, R. (2019). Visual Perceptual Load Does Not Affect the Frequency Mismatch Negativity. Frontiers in Psychology, 10(1970). doi:10.3389/fpsyg.2019.01970We manipulated visual perceptual load (high and low load) while we recorded electroencephalography. Event-related potentials (ERPs) were computed from these data.OSF_*.pdf contains the preregistration at open science framework (osf).https://doi.org/10.17605/OSF.IO/EWG9XERP_2019_rawdata_bdf.zip contains the raw eeg data files that were recorded with a biosemi system (www.biosemi.com). The files can be opened in matlab with the fieldtrip toolbox. https://www.mathworks.com/products/matlab.htmlhttp://www.fieldtriptoolbox.org/ERP_2019_visual_load_fieldtrip_scripts.zip contains all the matlab scripts that were used to process the ERP data with the toolbox fieldtrip. http://www.fieldtriptoolbox.org/ERP_2019_fieldtrip_mat_*.zip contain the final, preprocessed individual data files. They can be opened with matlab.ERP_2019_visual_load_python_scripts.zip contains the python scripts for the main task. They need python (https://www.python.org/) and psychopy (http://www.psychopy.org/)ERP_2019_visual_load_wmc_R_scripts.zip contains the R scripts to process the working memory capacity (wmc) data. https://www.r-project.org/.ERP_2019_visual_load_R_scripts.zip contains the R scripts to analyze the data and the output files with figures (eg scatterplots). https://www.r-project.org/.
a
Caltech-101
datasets.activeloop.ai
huggingface.co
deeplake
Updated Feb 3, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Caltech (2022). Caltech-101 [Dataset]. https://datasets.activeloop.ai/docs/ml/datasets/caltech-101-dataset/
Explore at:
deeplakeAvailable download formats
Dataset updated
Feb 3, 2022
Dataset authored and provided by
Caltech
License
Attribution-NonCommercial-NoDerivs 3.0 (CC BY-NC-ND 3.0)https://creativecommons.org/licenses/by-nc-nd/3.0/
License information was derived automatically
Dataset funded by
National Science Foundation
Description
The Caltech-101 dataset is a dataset of 101 categories of objects, each with 30 to 800 images. The images are all 32 x 32 pixels in size and are in grayscale. The dataset is used to train and evaluate machine learning models for the task of object recognition.
d
Data from: Public supply water use reanalysis for the 2000-2020 period by...
catalog.data.gov
data.usgs.gov
+1more
Updated Nov 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Public supply water use reanalysis for the 2000-2020 period by HUC12, month, and year for the conterminous United States (ver. 2.0, August 2024) [Dataset]. https://catalog.data.gov/dataset/public-supply-water-use-reanalysis-for-the-2000-2020-period-by-huc12-month-and-year-for-th
Explore at:
Dataset updated
Nov 19, 2025
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
Contiguous United States, United States
Description
The U.S. Geological Survey is developing national water-use models to support water resources management in the United States. Model benefits include a nationally consistent estimation approach, greater temporal and spatial resolution of estimates, efficient and automated updates of results, and capabilities to forecast water use into the future and assess model uncertainty. The term “reanalysis” refers to the process of reevaluating and recalculating water-use data using updated or refined methods, data sources, models, or assumptions. In this data release, water use refers to water that is withdrawn by public and private water suppliers and includes water provided for domestic, commercial, industrial, thermoelectric power, and public water uses, as well as water that is consumed or lost within the public supply system. Consumptive use refers to water withdrawn by the public supply system that is evaporated, transpired, incorporated into products or crops, or consumed by humans or livestock. This data release contains data used in a machine learning model (child item 2) to estimate monthly water use for communities that are supplied by public-supply water systems in the conterminous United States for 2000-2020. This data release also contains associated scripts used to produce input features (child items 4 - 8) as well as model water use estimates by 12-digit hydrologic unit code (HUC12) and public supply water service area (WSA). HUC12 boundaries are in child item 3. Public supply delivery and consumptive use estimates are in child items 1 and 9, respectively. First posted: November 1, 2023 Revised: August 8, 2024 This version replaces the previous version of the data release: Luukkonen, C.L., Alzraiee, A.H., Larsen, J.D., Martin, D.J., Herbert, D.M., Buchwald, C.A., Houston, N.A., Valseth, K.J., Paulinski, S., Miller, L.D., Niswonger, R.G., Stewart, J.S., and Dieter, C.A., 2023, Public supply water use reanalysis for the 2000-2020 period by HUC12, month, and year for the conterminous United States: U.S. Geological Survey data release, https://doi.org/10.5066/P9FUL880 Version 2.0 This data release has been updated as of 8/8/2024. The previous version has been replaced because some fractions used for downscaling WSA estimates to HUC12 did not sum to one for some WSAs in Virginia. Updated model water use estimates by HUC12 are included in this version. A change was made in two scripts to check for this condition. Output files have also been updated to preserve the leading zero in in the HUC12 codes. Additional files are also included to provide information about mapping the WSAs and groundwater and surface water fractions to HUC12 and to provide public supply water-use estimates by WSA. The 'Machine learning model that estimates total monthly and annual per capita public supply water use' child item has been updated with these corrections and additional files. A new child item 'R code used to estimate public supply consumptive water use' has been added to provide estimates of public supply consumptive use. This page includes the following files: PS_HUC12_Tot_2000_2020.csv - a csv file with estimated monthly public supply total water use from 2000-2020 by HUC12, in million gallons per day PS_HUC12_GW_2000_2020.csv - a csv file with estimated monthly public supply groundwater use for 2000-2020 by HUC12, in million gallons per day PS_HUC12_SW_2000_2020.csv - a csv file with estimated monthly public supply surface water use for 2000-2020 by HUC12, in million gallons per day PS_WSA_Tot_2000_2020.csv - a csv file with estimated monthly public supply total water use from 2000-2020 by WSA, in million gallons per day PS_WSA_GW_2000_2020.csv - a csv file with estimated monthly public supply groundwater use for 2000-2020 by WSA, in million gallons per day PS_WSA_SW_2000_2020.csv - a csv file with estimated monthly public supply surface water use for 2000-2020 by WSA, in million gallons per day Note: 1) Groundwater and surface water fractions were determined using source counts as described in the 'R code that determines groundwater and surface water source fractions for public-supply water service areas, counties, and 12-digit hydrologic units' child item. 2) Some HUC12s have estimated water use of zero because no public-supply water service areas were modeled within the HUC. change_files_format.py - A Python script used to change the water use estimates by WSA and HUC12 files from wide format to the thin and long format version_history.txt - a txt file describing changes in this version The data release is organized into these items: 1. Machine learning model that estimates public supply deliveries for domestic and other use types - The public supply delivery model estimates total delivery of domestic, commercial, industrial, institutional, and irrigation (CII) water use for public supply water service areas within the conterminous United States. This item contains model input datasets, code used to build the delivery machine learning model, and output predictions. 2. Machine learning model that estimates total monthly and annual per capita public supply water use - The public supply water use model estimates total monthly water use for 12-digit hydrologic units within the conterminous United States. This item contains model input datasets, code used to build the water use machine learning model, and output predictions. 3. National watershed boundary (HUC12) dataset for the conterminous United States, retrieved 10/26/2020 - Spatial data consisting of a shapefile with 12-digit hydrologic units for the conterminous United States retrieved 10/26/2020. 4. Python code used to determine average yearly and monthly tourism per 1000 residents for public-supply water service areas - This code was used to create a feature for the public supply model that provides information for areas affected by population increases due to tourism. 5. Python code used to download gridMET climate data for public-supply water service areas - The climate data collector is a tool used to query climate data which are used as input features in the public supply models. 6. Python code used to download U.S. Census Bureau data for public-supply water service areas - The census data collector is a geographic based tool to query census data which are used as input features in the public supply models. 7. R code that determines buying and selling of water by public-supply water service areas - This code was used to create a feature for the public supply model that indicates whether public-supply systems buy water, sell water, or neither buy nor sell water. 8. R code that determines groundwater and surface water source fractions for public-supply water service areas, counties, and 12-digit hydrologic units - This code was used to determine source water fractions (groundwater and/or surface water) for public supply systems and HUC12s. 9. R code used to estimate public supply consumptive water use - This code was used to estimate public supply consumptive water use using an assumed fraction of deliveries for outdoor irrigation and estimates of evaporative demand. This item contains estimated monthly public supply consumptive use datasets by HUC12 and WSA.
T
Data from: dolma
tensorflow.org
Updated Mar 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). dolma [Dataset]. https://www.tensorflow.org/datasets/catalog/dolma
Explore at:
Dataset updated
Mar 14, 2025
Description
Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('dolma', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
IoT Sports Training Load Dataset
kaggle.com
zip
Updated Oct 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Python Developer (2025). IoT Sports Training Load Dataset [Dataset]. https://www.kaggle.com/datasets/programmer3/iot-sports-training-load-dataset
Explore at:
zip(226978 bytes)Available download formats
Dataset updated
Oct 9, 2025
Authors
Python Developer
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset contains 2300 multimodal IoT sensor recordings collected from athletes during traditional sports training sessions, including basketball, soccer, running, and other athletic activities. The dataset includes heart rate, acceleration (X, Y, Z), gyroscope readings (X, Y, Z), speed, step count, jump height, and training load. It is designed to facilitate analysis of athlete performance, training load monitoring, and predictive modeling for sports science applications.

Facebook

Twitter

Click to copy link

Link copied

Cite

Seair Exim, Python Import Data India – Buyers & Importers List [Dataset]. https://www.seair.co.in

Python Import Data India – Buyers & Importers List

Seair Exim Solutions

Seair Info Solutions PVT LTD

Explore at:

27 scholarly articles cite this dataset (View in Google Scholar)

.bin, .xml, .csv, .xlsAvailable download formats

Dataset provided by

Seair Info Solutions PVT LTD

Authors

Seair Exim

Area covered

India

Description

Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.

Clear search

Close search

Google apps

Main menu

Python Import Data India – Buyers & Importers List

Python-DPO-Large

Vezora/Tested-188k-Python-Alpaca: Functional

Vezora/Tested-188k-Python-Alpaca: Functional Python Code Dataset

188k Functional Python Code Samples

About this dataset

How to use the dataset

Contents of the Dataset

Exploring the Dataset

Potential Use Cases

Research Ideas

Storage and Transit Time Data and Code

Python frameworks used in data science 2021

Python code used to download U.S. Census Bureau data for public-supply water...

Code Similarity Dataset – Python Variants

Code Similarity Dataset – Python Variants

📚 What's Inside?

Column Descriptions

🗂 Folder Structure

🔍 Use Cases

🧪 Sample Applications

📦 File Formats

🛠 How to Use

📄 License

Data from: Ecosystem-Level Determinants of Sustained Activity in Open-Source...

NYC Jobs Dataset (Filtered Columns)

Auditory cortex single unit population activity during natural sound...

booksum

tidybot

Code to import PSCAD data into Python (Spyder)

Python code used to download gridMET climate data for public-supply water...

Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter...

Open data: Frequency mismatch negativity and visual load

Caltech-101

Data from: Public supply water use reanalysis for the 2000-2020 period by...

Data from: dolma

IoT Sports Training Load Dataset

Python Import Data India – Buyers & Importers List

Seair Exim Solutions

Seair Info Solutions PVT LTD