100+ datasets found
  1. s

    Python Import Data India – Buyers & Importers List

    • seair.co.in
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Seair Exim, Python Import Data India – Buyers & Importers List [Dataset]. https://www.seair.co.in
    Explore at:
    .bin, .xml, .csv, .xlsAvailable download formats
    Dataset provided by
    Seair Info Solutions PVT LTD
    Authors
    Seair Exim
    Area covered
    India
    Description

    Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.

  2. h

    Python-DPO-Large

    • huggingface.co
    Updated Mar 15, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NextWealth Entrepreneurs Private Limited (2023). Python-DPO-Large [Dataset]. https://huggingface.co/datasets/NextWealth/Python-DPO-Large
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 15, 2023
    Dataset authored and provided by
    NextWealth Entrepreneurs Private Limited
    Description

    Dataset Card for Python-DPO

    This dataset is the larger version of Python-DPO dataset and has been created using Argilla.

      Load with datasets
    

    To load this dataset with datasets, you'll just need to install datasets as pip install datasets --upgrade and then use the following code: from datasets import load_dataset

    ds = load_dataset("NextWealth/Python-DPO")

      Data Fields
    

    Each data instance contains:

    instruction: The problem description/requirements chosen_code:… See the full description on the dataset page: https://huggingface.co/datasets/NextWealth/Python-DPO-Large.

  3. Vezora/Tested-188k-Python-Alpaca: Functional

    • kaggle.com
    zip
    Updated Nov 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Vezora/Tested-188k-Python-Alpaca: Functional [Dataset]. https://www.kaggle.com/datasets/thedevastator/vezora-tested-188k-python-alpaca-functional-pyth/discussion
    Explore at:
    zip(12200606 bytes)Available download formats
    Dataset updated
    Nov 30, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Vezora/Tested-188k-Python-Alpaca: Functional Python Code Dataset

    188k Functional Python Code Samples

    By Vezora (From Huggingface) [source]

    About this dataset

    The Vezora/Tested-188k-Python-Alpaca dataset is a comprehensive collection of functional Python code samples, specifically designed for training and analysis purposes. With 188,000 samples, this dataset offers an extensive range of examples that cater to the research needs of Python programming enthusiasts.

    This valuable resource consists of various columns, including input, which represents the input or parameters required for executing the Python code sample. The instruction column describes the task or objective that the Python code sample aims to solve. Additionally, there is an output column that showcases the resulting output generated by running the respective Python code.

    By utilizing this dataset, researchers can effectively study and analyze real-world scenarios and applications of Python programming. Whether for educational purposes or development projects, this dataset serves as a reliable reference for individuals seeking practical examples and solutions using Python

    How to use the dataset

    The Vezora/Tested-188k-Python-Alpaca dataset is a comprehensive collection of functional Python code samples, containing 188,000 samples in total. This dataset can be a valuable resource for researchers and programmers interested in exploring various aspects of Python programming.

    Contents of the Dataset

    The dataset consists of several columns:

    • output: This column represents the expected output or result that is obtained when executing the corresponding Python code sample.
    • instruction: It provides information about the task or instruction that each Python code sample is intended to solve.
    • input: The input parameters or values required to execute each Python code sample.

    Exploring the Dataset

    To make effective use of this dataset, it is essential to understand its structure and content properly. Here are some steps you can follow:

    • Importing Data: Load the dataset into your preferred environment for data analysis using appropriate tools like pandas in Python.
    import pandas as pd
    
    # Load the dataset
    df = pd.read_csv('train.csv')
    
    • Understanding Column Names: Familiarize yourself with the column names and their meanings by referring to the provided description.
    # Display column names
    print(df.columns)
    
    • Sample Exploration: Get an initial understanding of the data structure by examining a few random samples from different columns.
    # Display random samples from 'output' column
    print(df['output'].sample(5))
    
    • Analyzing Instructions: Analyze different instructions or tasks present in the 'instruction' column to identify specific areas you are interested in studying or learning about.
    # Count unique instructions and display top ones with highest occurrences
    instruction_counts = df['instruction'].value_counts()
    print(instruction_counts.head(10))
    

    Potential Use Cases

    The Vezora/Tested-188k-Python-Alpaca dataset can be utilized in various ways:

    • Code Analysis: Analyze the code samples to understand common programming patterns and best practices.
    • Code Debugging: Use code samples with known outputs to test and debug your own Python programs.
    • Educational Purposes: Utilize the dataset as a teaching tool for Python programming classes or tutorials.
    • Machine Learning Applications: Train machine learning models to predict outputs based on given inputs.

    Remember that this dataset provides a plethora of diverse Python coding examples, allowing you to explore different

    Research Ideas

    • Code analysis: Researchers and developers can use this dataset to analyze various Python code samples and identify patterns, best practices, and common mistakes. This can help in improving code quality and optimizing performance.
    • Language understanding: Natural language processing techniques can be applied to the instruction column of this dataset to develop models that can understand and interpret natural language instructions for programming tasks.
    • Code generation: The input column of this dataset contains the required inputs for executing each Python code sample. Researchers can build models that generate Python code based on specific inputs or task requirements using the examples provided in this dataset. This can be useful in automating repetitive programming tasks o...
  4. Storage and Transit Time Data and Code

    • zenodo.org
    zip
    Updated Oct 29, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrew Felton; Andrew Felton (2024). Storage and Transit Time Data and Code [Dataset]. http://doi.org/10.5281/zenodo.14009758
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 29, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Andrew Felton; Andrew Felton
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Author: Andrew J. Felton
    Date: 10/29/2024

    This R project contains the primary code and data (following pre-processing in python) used for data production, manipulation, visualization, and analysis, and figure production for the study entitled:

    "Global estimates of the storage and transit time of water through vegetation"

    Please note that 'turnover' and 'transit' are used interchangeably. Also please note that this R project has been updated multiple times as the analysis has updated.

    Data information:

    The data folder contains key data sets used for analysis. In particular:

    "data/turnover_from_python/updated/august_2024_lc/" contains the core datasets used in this study including global arrays summarizing five year (2016-2020) averages of mean (annual) and minimum (monthly) transit time, storage, canopy transpiration, and number of months of data able as both an array (.nc) or data table (.csv). These data were produced in python using the python scripts found in the "supporting_code" folder. The remaining files in the "data" and "data/supporting_data"" folder primarily contain ground-based estimates of storage and transit found in public databases or through a literature search, but have been extensively processed and filtered here. The "supporting_data"" folder also contains annual (2016-2020) MODIS land cover data used in the analysis and contains separate filters containing the original data (.hdf) and then the final process (filtered) data in .nc format. The resulting annual land cover distributions were used in the pre-processing of data in python.

    #Code information

    Python scripts can be found in the "supporting_code" folder.

    Each R script in this project has a role:

    "01_start.R": This script sets the working directory, loads in the tidyverse package (the remaining packages in this project are called using the `::` operator), and can run two other scripts: one that loads the customized functions (02_functions.R) and one for importing and processing the key dataset for this analysis (03_import_data.R).

    "02_functions.R": This script contains custom functions. Load this using the
    `source()` function in the 01_start.R script.

    "03_import_data.R": This script imports and processes the .csv transit data. It joins the mean (annual) transit time data with the minimum (monthly) transit data to generate one dataset for analysis: annual_turnover_2. Load this using the
    `source()` function in the 01_start.R script.

    "04_figures_tables.R": This is the main workhouse for figure/table production and
    supporting analyses. This script generates the key figures and summary statistics
    used in the study that then get saved in the manuscript_figures folder. Note that all
    maps were produced using Python code found in the "supporting_code"" folder.

    "supporting_generate_data.R": This script processes supporting data used in the analysis, primarily the varying ground-based datasets of leaf water content.

    "supporting_process_land_cover.R": This takes annual MODIS land cover distributions and processes them through a multi-step filtering process so that they can be used in preprocessing of datasets in python.

  5. Python frameworks used in data science 2021

    • statista.com
    Updated Jun 15, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2022). Python frameworks used in data science 2021 [Dataset]. https://www.statista.com/statistics/1338424/python-use-frameworks-data-science/
    Explore at:
    Dataset updated
    Jun 15, 2022
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Oct 2021 - Dec 2021
    Area covered
    Worldwide
    Description

    Python is one of the most popular programming languages among data scientists, partly due to its varied packages and capabilities. In 2021, Numpy and Pandas were the most used Python frameworks for data science, with a ** percent and ** percent share respectively.

  6. d

    Python code used to download U.S. Census Bureau data for public-supply water...

    • catalog.data.gov
    • data.usgs.gov
    Updated Nov 19, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Python code used to download U.S. Census Bureau data for public-supply water service areas [Dataset]. https://catalog.data.gov/dataset/python-code-used-to-download-u-s-census-bureau-data-for-public-supply-water-service-areas
    Explore at:
    Dataset updated
    Nov 19, 2025
    Dataset provided by
    U.S. Geological Survey
    Description

    This child item describes Python code used to query census data from the TigerWeb Representational State Transfer (REST) services and the U.S. Census Bureau Application Programming Interface (API). These data were needed as input feature variables for a machine learning model to predict public supply water use for the conterminous United States. Census data were retrieved for public-supply water service areas, but the census data collector could be used to retrieve data for other areas of interest. This dataset is part of a larger data release using machine learning to predict public supply water use for 12-digit hydrologic units from 2000-2020. Data retrieved by the census data collector code were used as input features in the public supply delivery and water use machine learning models. This page includes the following file: census_data_collector.zip - a zip file containing the census data collector Python code used to retrieve data from the U.S. Census Bureau and a README file.

  7. Code Similarity Dataset – Python Variants

    • kaggle.com
    zip
    Updated Jul 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hem Ajit Patel (2025). Code Similarity Dataset – Python Variants [Dataset]. https://www.kaggle.com/datasets/hemajitpatel/code-similarity-dataset-python-variants
    Explore at:
    zip(39806 bytes)Available download formats
    Dataset updated
    Jul 6, 2025
    Authors
    Hem Ajit Patel
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Code Similarity Dataset – Python Variants

    A collection of code snippets solving common programming problems in multiple variations.

    Each problem has 20+ versions, written in different styles and logic patterns, making this dataset ideal for studying:

    • Code similarity
    • Plagiarism detection
    • AI-based code search
    • Code classification
    • Semantic code retrieval

    📚 What's Inside?

    The dataset includes the following tasks: - Reverse a String - Find Max in List - Check if a Number is Prime - Check if a String is a Palindrome - Generate Fibonacci Sequence

    Each task contains: - 20 variations of code - Metadata file describing method and notes - README with usage instructions

    Column Descriptions

    The full_metadata.csv file contains the following fields:

    Column NameDescription
    problem_typeThe programming task solved (e.g., reverse_string, max_in_list)
    idUnique ID of the snippet within that problem group
    filenameFilename of the code snippet (e.g., snip_01.py)
    languageProgramming language used (Python)
    methodType of approach used (e.g., Slicing, Recursive, While loop)
    notesAdditional details about the logic or style used in the snippet

    🗂 Folder Structure

    CodeSimilarityDataset/ │ ├── reverse_string/ │ ├── snippets/ │ ├── metadata.csv │ └── README.txt │ ├── max_in_list/ │ ├── snippets/ │ ├── metadata.csv │ └── README.txt │ ├── is_prime/ │ ├── snippets/ │ ├── metadata.csv │ └── README.txt │ ├── is_palindrome/ │ ├── snippets/ │ ├── metadata.csv │ └── README.txt │ ├── fibonacci/ │ ├── snippets/ │ ├── metadata.csv │ └── README.txt │ └── full_metadata.csv ← Combined metadata across all problems

    🔍 Use Cases

    • Train models to detect similar code logic
    • Build plagiarism detection systems
    • Improve code recommendation engines
    • Teach students about code variation
    • Benchmark code search algorithms

    🧪 Sample Applications

    Visualize logic type distribution

    Compare structural similarity (AST/difflib/token matching)

    Cluster similar snippets using embeddings

    Train code-style-aware LLMs

    📦 File Formats

    All code snippets are .py files. Metadata is provided in CSV format for easy loading into pandas or other tools.

    🛠 How to Use

    You can load metadata easily with Python:

    import pandas as pd

    df = pd.read_csv('full_metadata.csv') print(df.sample(5))

    Then read any snippet:

    with open("reverse_string/snippets/snip_01.py") as f: code = f.read() print(code)

    📄 License

    This dataset is released under the MIT License — free to use, modify, and distribute with proper attribution.

  8. Data from: Ecosystem-Level Determinants of Sustained Activity in Open-Source...

    • zenodo.org
    application/gzip, bin +2
    Updated Aug 2, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marat Valiev; Marat Valiev; Bogdan Vasilescu; James Herbsleb; Bogdan Vasilescu; James Herbsleb (2024). Ecosystem-Level Determinants of Sustained Activity in Open-Source Projects: A Case Study of the PyPI Ecosystem [Dataset]. http://doi.org/10.5281/zenodo.1419788
    Explore at:
    bin, application/gzip, zip, text/x-pythonAvailable download formats
    Dataset updated
    Aug 2, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Marat Valiev; Marat Valiev; Bogdan Vasilescu; James Herbsleb; Bogdan Vasilescu; James Herbsleb
    License

    https://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.htmlhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html

    Description
    Replication pack, FSE2018 submission #164:
    ------------------------------------------
    
    **Working title:** Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: 
    A Case Study of the PyPI Ecosystem
    
    **Note:** link to data artifacts is already included in the paper. 
    Link to the code will be included in the Camera Ready version as well.
    
    
    Content description
    ===================
    
    - **ghd-0.1.0.zip** - the code archive. This code produces the dataset files 
     described below
    - **settings.py** - settings template for the code archive.
    - **dataset_minimal_Jan_2018.zip** - the minimally sufficient version of the dataset.
     This dataset only includes stats aggregated by the ecosystem (PyPI)
    - **dataset_full_Jan_2018.tgz** - full version of the dataset, including project-level
     statistics. It is ~34Gb unpacked. This dataset still doesn't include PyPI packages
     themselves, which take around 2TB.
    - **build_model.r, helpers.r** - R files to process the survival data 
      (`survival_data.csv` in **dataset_minimal_Jan_2018.zip**, 
      `common.cache/survival_data.pypi_2008_2017-12_6.csv` in 
      **dataset_full_Jan_2018.tgz**)
    - **Interview protocol.pdf** - approximate protocol used for semistructured interviews.
    - LICENSE - text of GPL v3, under which this dataset is published
    - INSTALL.md - replication guide (~2 pages)
    Replication guide
    =================
    
    Step 0 - prerequisites
    ----------------------
    
    - Unix-compatible OS (Linux or OS X)
    - Python interpreter (2.7 was used; Python 3 compatibility is highly likely)
    - R 3.4 or higher (3.4.4 was used, 3.2 is known to be incompatible)
    
    Depending on detalization level (see Step 2 for more details):
    - up to 2Tb of disk space (see Step 2 detalization levels)
    - at least 16Gb of RAM (64 preferable)
    - few hours to few month of processing time
    
    Step 1 - software
    ----------------
    
    - unpack **ghd-0.1.0.zip**, or clone from gitlab:
    
       git clone https://gitlab.com/user2589/ghd.git
       git checkout 0.1.0
     
     `cd` into the extracted folder. 
     All commands below assume it as a current directory.
      
    - copy `settings.py` into the extracted folder. Edit the file:
      * set `DATASET_PATH` to some newly created folder path
      * add at least one GitHub API token to `SCRAPER_GITHUB_API_TOKENS` 
    - install docker. For Ubuntu Linux, the command is 
      `sudo apt-get install docker-compose`
    - install libarchive and headers: `sudo apt-get install libarchive-dev`
    - (optional) to replicate on NPM, install yajl: `sudo apt-get install yajl-tools`
     Without this dependency, you might get an error on the next step, 
     but it's safe to ignore.
    - install Python libraries: `pip install --user -r requirements.txt` . 
    - disable all APIs except GitHub (Bitbucket and Gitlab support were
     not yet implemented when this study was in progress): edit
     `scraper/init.py`, comment out everything except GitHub support
     in `PROVIDERS`.
    
    Step 2 - obtaining the dataset
    -----------------------------
    
    The ultimate goal of this step is to get output of the Python function 
    `common.utils.survival_data()` and save it into a CSV file:
    
      # copy and paste into a Python console
      from common import utils
      survival_data = utils.survival_data('pypi', '2008', smoothing=6)
      survival_data.to_csv('survival_data.csv')
    
    Since full replication will take several months, here are some ways to speedup
    the process:
    
    ####Option 2.a, difficulty level: easiest
    
    Just use the precomputed data. Step 1 is not necessary under this scenario.
    
    - extract **dataset_minimal_Jan_2018.zip**
    - get `survival_data.csv`, go to the next step
    
    ####Option 2.b, difficulty level: easy
    
    Use precomputed longitudinal feature values to build the final table.
    The whole process will take 15..30 minutes.
    
    - create a folder `
  9. NYC Jobs Dataset (Filtered Columns)

    • kaggle.com
    zip
    Updated Oct 5, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jeffery Mandrake (2022). NYC Jobs Dataset (Filtered Columns) [Dataset]. https://www.kaggle.com/datasets/jefferymandrake/nyc-jobs-filtered-cols
    Explore at:
    zip(93408 bytes)Available download formats
    Dataset updated
    Oct 5, 2022
    Authors
    Jeffery Mandrake
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Area covered
    New York
    Description

    Use this dataset with Misra's Pandas tutorial: How to use the Pandas GroupBy function | Pandas tutorial

    The original dataset came from this site: https://data.cityofnewyork.us/City-Government/NYC-Jobs/kpav-sd4t/data

    I used Google Colab to filter the columns with the following Pandas commands. Here's a Colab Notebook you can use with the commands listed below: https://colab.research.google.com/drive/17Jpgeytc075CpqDnbQvVMfh9j-f4jM5l?usp=sharing

    Once the csv file is uploaded to Google Colab, use these commands to process the file.

    import pandas as pd # load the file and create a pandas dataframe df = pd.read_csv('/content/NYC_Jobs.csv') # keep only these columns df = df[['Job ID', 'Civil Service Title', 'Agency', 'Posting Type', 'Job Category', 'Salary Range From', 'Salary Range To' ]] # save the csv file without the index column df.to_csv('/content/NYC_Jobs_filtered_cols.csv', index=False)

  10. Z

    Auditory cortex single unit population activity during natural sound...

    • data.niaid.nih.gov
    Updated Jun 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pennington, Jacob; David, Stephen (2023). Auditory cortex single unit population activity during natural sound presentation -- dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7796573
    Explore at:
    Dataset updated
    Jun 15, 2023
    Dataset provided by
    Oregon Health & Science University
    Washington State University, Vancouver
    Authors
    Pennington, Jacob; David, Stephen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Overview

    High-density multi-channel neurophysiology data were collected from primary (A1) and secondary (PEG) fields of auditory cortex of passively listening ferrets during presentation of a large natural sound library. Single unit spikes were sorted using Kilosort. This dataset includes spike times for 849 A1 units and 398 PEG units. Stimulus waveforms were transformed to log-spaced spectrograms for analysis (18 channels, 10 ms time bins). Data set includes raw sound waveforms as well.

    The authors request that any publication using this data cite the following work: https://www.biorxiv.org/content/10.1101/2022.06.10.495698v2

    Data format/description

    Neural data are stored in two files. All recordings were performed during presentation of the same natural sound library.

    recordings/A1_NAT4_ozgf.fs100.ch18.tgz - data from 849 A1 single units and log spectrogram of stimuli aligned with spike times.

    recordings/PEG_NAT4_ozgf.fs100.ch18.tgz - data from 398 PEG single units and log spectrogram of stimuli aligned with spike times.

    wav.zip - raw wav files. Note: Only first 1-sec of each wav file was presented during experiments. Recordings have longer duration

    Example scripts

    Python scripts included with this dataset demonstrate how to load the neural data and perform a CNN model fit. Running the scripts requires the NEMS0 python library, which is available open source at https://github.com/lbhb/NEMS0.

    Quick install

    Create and activate a new conda environment:

    conda create -n NEMS0 python=3.7 conda activate NEMS0

    Download NEMS0:

    git clone https://github.com/lbhb/NEMS0

    Install NEMS0:

    pip install -e NEMS0

    Detailed instructions for installing NEMS0 are available in the Github repository (https://github.com/lbhb/NEMS0).

    Demo scripts

    Once NEMS0 is installed and the data are downloaded, move to the directory where the data and demo scripts are stored and run them in a NEMS0 environment.

    pop_cnn_load.py - Load the A1 data and compare predictions for two neurons (Fig 3) by two population models (stage 1 fit complete). Illustrates how to load the data using Python.

    pop_cnn_fit.py - Load a pre-fit A1 population model (stage 1) and complete stage 2 fit (refinement) for a single neuron. Illustrates use of NEMS0 for CNN model fitting.

    Funding

    Data collection, software development and processing were supported by funding from the NIH (R01DC014950, R01EB028155).

  11. T

    booksum

    • tensorflow.org
    • opendatalab.com
    Updated Dec 6, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). booksum [Dataset]. https://www.tensorflow.org/datasets/catalog/booksum
    Explore at:
    Dataset updated
    Dec 6, 2022
    Description

    BookSum: A Collection of Datasets for Long-form Narrative Summarization

    This implementation currently only supports book and chapter summaries.

    GitHub: https://github.com/salesforce/booksum

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('booksum', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  12. T

    tidybot

    • tensorflow.org
    Updated Dec 11, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). tidybot [Dataset]. https://www.tensorflow.org/datasets/catalog/tidybot
    Explore at:
    Dataset updated
    Dec 11, 2024
    Description

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('tidybot', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  13. i

    Code to import PSCAD data into Python (Spyder)

    • ieee-dataport.org
    Updated Nov 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Franz Guzman Llanos (2025). Code to import PSCAD data into Python (Spyder) [Dataset]. https://ieee-dataport.org/documents/code-import-pscad-data-python-spyder
    Explore at:
    Dataset updated
    Nov 20, 2025
    Authors
    Franz Guzman Llanos
    Description

    minimizes errors

  14. U

    Python code used to download gridMET climate data for public-supply water...

    • data.usgs.gov
    • s.cnmilf.com
    • +1more
    Updated Aug 27, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carol Luukkonen; Ayman Alzraiee; Joshua Larsen; Donald Martin; Deidre Herbert; Cheryl Buchwald; Natalie Houston; Kristen Valseth; Scott Paulinski; Lisa Miller; Richard Niswonger; Jana Stewart; Cheryl Dieter (2024). Python code used to download gridMET climate data for public-supply water service areas [Dataset]. http://doi.org/10.5066/P9FUL880
    Explore at:
    Dataset updated
    Aug 27, 2024
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Authors
    Carol Luukkonen; Ayman Alzraiee; Joshua Larsen; Donald Martin; Deidre Herbert; Cheryl Buchwald; Natalie Houston; Kristen Valseth; Scott Paulinski; Lisa Miller; Richard Niswonger; Jana Stewart; Cheryl Dieter
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Time period covered
    Jan 1, 2000 - Dec 31, 2020
    Description

    This child item describes Python code used to retrieve gridMET climate data for a specific area and time period. Climate data were retrieved for public-supply water service areas, but the climate data collector could be used to retrieve data for other areas of interest. This dataset is part of a larger data release using machine learning to predict public supply water use for 12-digit hydrologic units from 2000-2020. Data retrieved by the climate data collector code were used as input feature variables in the public supply delivery and water use machine learning models. This page includes the following file: climate_data_collector.zip - a zip file containing the climate data collector Python code used to retrieve climate data and a README file.

  15. Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter...

    • zenodo.org
    bz2
    Updated Mar 15, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    João Felipe; João Felipe; Leonardo; Leonardo; Vanessa; Vanessa; Juliana; Juliana (2021). Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter Notebooks [Dataset]. http://doi.org/10.5281/zenodo.2592524
    Explore at:
    bz2Available download formats
    Dataset updated
    Mar 15, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    João Felipe; João Felipe; Leonardo; Leonardo; Vanessa; Vanessa; Juliana; Juliana
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The self-documenting aspects and the ability to reproduce results have been touted as significant benefits of Jupyter Notebooks. At the same time, there has been growing criticism that the way notebooks are being used leads to unexpected behavior, encourage poor coding practices and that their results can be hard to reproduce. To understand good and bad practices used in the development of real notebooks, we analyzed 1.4 million notebooks from GitHub.

    Paper: https://2019.msrconf.org/event/msr-2019-papers-a-large-scale-study-about-quality-and-reproducibility-of-jupyter-notebooks

    This repository contains two files:

    • dump.tar.bz2
    • jupyter_reproducibility.tar.bz2

    The dump.tar.bz2 file contains a PostgreSQL dump of the database, with all the data we extracted from the notebooks.

    The jupyter_reproducibility.tar.bz2 file contains all the scripts we used to query and download Jupyter Notebooks, extract data from them, and analyze the data. It is organized as follows:

    • analyses: this folder has all the notebooks we use to analyze the data in the PostgreSQL database.
    • archaeology: this folder has all the scripts we use to query, download, and extract data from GitHub notebooks.
    • paper: empty. The notebook analyses/N12.To.Paper.ipynb moves data to it

    In the remaining of this text, we give instructions for reproducing the analyses, by using the data provided in the dump and reproducing the collection, by collecting data from GitHub again.

    Reproducing the Analysis

    This section shows how to load the data in the database and run the analyses notebooks. In the analysis, we used the following environment:

    Ubuntu 18.04.1 LTS
    PostgreSQL 10.6
    Conda 4.5.11
    Python 3.7.2
    PdfCrop 2012/11/02 v1.38

    First, download dump.tar.bz2 and extract it:

    tar -xjf dump.tar.bz2

    It extracts the file db2019-03-13.dump. Create a database in PostgreSQL (we call it "jupyter"), and use psql to restore the dump:

    psql jupyter < db2019-03-13.dump

    It populates the database with the dump. Now, configure the connection string for sqlalchemy by setting the environment variable JUP_DB_CONNECTTION:

    export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter";

    Download and extract jupyter_reproducibility.tar.bz2:

    tar -xjf jupyter_reproducibility.tar.bz2

    Create a conda environment with Python 3.7:

    conda create -n analyses python=3.7
    conda activate analyses

    Go to the analyses folder and install all the dependencies of the requirements.txt

    cd jupyter_reproducibility/analyses
    pip install -r requirements.txt

    For reproducing the analyses, run jupyter on this folder:

    jupyter notebook

    Execute the notebooks on this order:

    • Index.ipynb
    • N0.Repository.ipynb
    • N1.Skip.Notebook.ipynb
    • N2.Notebook.ipynb
    • N3.Cell.ipynb
    • N4.Features.ipynb
    • N5.Modules.ipynb
    • N6.AST.ipynb
    • N7.Name.ipynb
    • N8.Execution.ipynb
    • N9.Cell.Execution.Order.ipynb
    • N10.Markdown.ipynb
    • N11.Repository.With.Notebook.Restriction.ipynb
    • N12.To.Paper.ipynb

    Reproducing or Expanding the Collection

    The collection demands more steps to reproduce and takes much longer to run (months). It also involves running arbitrary code on your machine. Proceed with caution.

    Requirements

    This time, we have extra requirements:

    All the analysis requirements
    lbzip2 2.5
    gcc 7.3.0
    Github account
    Gmail account

    Environment

    First, set the following environment variables:

    export JUP_MACHINE="db"; # machine identifier
    export JUP_BASE_DIR="/mnt/jupyter/github"; # place to store the repositories
    export JUP_LOGS_DIR="/home/jupyter/logs"; # log files
    export JUP_COMPRESSION="lbzip2"; # compression program
    export JUP_VERBOSE="5"; # verbose level
    export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter"; # sqlchemy connection
    export JUP_GITHUB_USERNAME="github_username"; # your github username
    export JUP_GITHUB_PASSWORD="github_password"; # your github password
    export JUP_MAX_SIZE="8000.0"; # maximum size of the repositories directory (in GB)
    export JUP_FIRST_DATE="2013-01-01"; # initial date to query github
    export JUP_EMAIL_LOGIN="gmail@gmail.com"; # your gmail address
    export JUP_EMAIL_TO="target@email.com"; # email that receives notifications
    export JUP_OAUTH_FILE="~/oauth2_creds.json" # oauth2 auhentication file
    export JUP_NOTEBOOK_INTERVAL=""; # notebook id interval for this machine. Leave it in blank
    export JUP_REPOSITORY_INTERVAL=""; # repository id interval for this machine. Leave it in blank
    export JUP_WITH_EXECUTION="1"; # run execute python notebooks
    export JUP_WITH_DEPENDENCY="0"; # run notebooks with and without declared dependnecies
    export JUP_EXECUTION_MODE="-1"; # run following the execution order
    export JUP_EXECUTION_DIR="/home/jupyter/execution"; # temporary directory for running notebooks
    export JUP_ANACONDA_PATH="~/anaconda3"; # conda installation path
    export JUP_MOUNT_BASE="/home/jupyter/mount_ghstudy.sh"; # bash script to mount base dir
    export JUP_UMOUNT_BASE="/home/jupyter/umount_ghstudy.sh"; # bash script to umount base dir
    export JUP_NOTEBOOK_TIMEOUT="300"; # timeout the extraction
    
    
    # Frequenci of log report
    export JUP_ASTROID_FREQUENCY="5";
    export JUP_IPYTHON_FREQUENCY="5";
    export JUP_NOTEBOOKS_FREQUENCY="5";
    export JUP_REQUIREMENT_FREQUENCY="5";
    export JUP_CRAWLER_FREQUENCY="1";
    export JUP_CLONE_FREQUENCY="1";
    export JUP_COMPRESS_FREQUENCY="5";
    
    export JUP_DB_IP="localhost"; # postgres database IP

    Then, configure the file ~/oauth2_creds.json, according to yagmail documentation: https://media.readthedocs.org/pdf/yagmail/latest/yagmail.pdf

    Configure the mount_ghstudy.sh and umount_ghstudy.sh scripts. The first one should mount the folder that stores the directories. The second one should umount it. You can leave the scripts in blank, but it is not advisable, as the reproducibility study runs arbitrary code on your machine and you may lose your data.

    Scripts

    Download and extract jupyter_reproducibility.tar.bz2:

    tar -xjf jupyter_reproducibility.tar.bz2

    Install 5 conda environments and 5 anaconda environments, for each python version. In each of them, upgrade pip, install pipenv, and install the archaeology package (Note that it is a local package that has not been published to pypi. Make sure to use the -e option):

    Conda 2.7

    conda create -n raw27 python=2.7 -y
    conda activate raw27
    pip install --upgrade pip
    pip install pipenv
    pip install -e jupyter_reproducibility/archaeology

    Anaconda 2.7

    conda create -n py27 python=2.7 anaconda -y
    conda activate py27
    pip install --upgrade pip
    pip install pipenv
    pip install -e jupyter_reproducibility/archaeology
    

    Conda 3.4

    It requires a manual jupyter and pathlib2 installation due to some incompatibilities found on the default installation.

    conda create -n raw34 python=3.4 -y
    conda activate raw34
    conda install jupyter -c conda-forge -y
    conda uninstall jupyter -y
    pip install --upgrade pip
    pip install jupyter
    pip install pipenv
    pip install -e jupyter_reproducibility/archaeology
    pip install pathlib2

    Anaconda 3.4

    conda create -n py34 python=3.4 anaconda -y
    conda activate py34
    pip install --upgrade pip
    pip install pipenv
    pip install -e jupyter_reproducibility/archaeology

    Conda 3.5

    conda create -n raw35 python=3.5 -y
    conda activate raw35
    pip install --upgrade pip
    pip install pipenv
    pip install -e jupyter_reproducibility/archaeology

    Anaconda 3.5

    It requires the manual installation of other anaconda packages.

    conda create -n py35 python=3.5 anaconda -y
    conda install -y appdirs atomicwrites keyring secretstorage libuuid navigator-updater prometheus_client pyasn1 pyasn1-modules spyder-kernels tqdm jeepney automat constantly anaconda-navigator
    conda activate py35
    pip install --upgrade pip
    pip install pipenv
    pip install -e jupyter_reproducibility/archaeology

    Conda 3.6

    conda create -n raw36 python=3.6 -y
    conda activate raw36
    pip install --upgrade pip
    pip install pipenv
    pip install -e jupyter_reproducibility/archaeology

    Anaconda 3.6

    conda create -n py36 python=3.6 anaconda -y
    conda activate py36
    conda install -y anaconda-navigator jupyterlab_server navigator-updater
    pip install --upgrade pip
    pip install pipenv
    pip install -e jupyter_reproducibility/archaeology

    Conda 3.7

    <code

  16. f

    Open data: Frequency mismatch negativity and visual load

    • su.figshare.com
    • researchdata.se
    • +1more
    pdf
    Updated Feb 23, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stefan Wiens; Erik van Berlekom; Malina Szychowska; Rasmus Eklund (2021). Open data: Frequency mismatch negativity and visual load [Dataset]. http://doi.org/10.17045/sthlmuni.7016369.v2
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Feb 23, 2021
    Dataset provided by
    Stockholm University
    Authors
    Stefan Wiens; Erik van Berlekom; Malina Szychowska; Rasmus Eklund
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Wiens, S., van Berlekom, E., Szychowska, M., & Eklund, R. (2019). Visual Perceptual Load Does Not Affect the Frequency Mismatch Negativity. Frontiers in Psychology, 10(1970). doi:10.3389/fpsyg.2019.01970We manipulated visual perceptual load (high and low load) while we recorded electroencephalography. Event-related potentials (ERPs) were computed from these data.OSF_*.pdf contains the preregistration at open science framework (osf).https://doi.org/10.17605/OSF.IO/EWG9XERP_2019_rawdata_bdf.zip contains the raw eeg data files that were recorded with a biosemi system (www.biosemi.com). The files can be opened in matlab with the fieldtrip toolbox. https://www.mathworks.com/products/matlab.htmlhttp://www.fieldtriptoolbox.org/ERP_2019_visual_load_fieldtrip_scripts.zip contains all the matlab scripts that were used to process the ERP data with the toolbox fieldtrip. http://www.fieldtriptoolbox.org/ERP_2019_fieldtrip_mat_*.zip contain the final, preprocessed individual data files. They can be opened with matlab.ERP_2019_visual_load_python_scripts.zip contains the python scripts for the main task. They need python (https://www.python.org/) and psychopy (http://www.psychopy.org/)ERP_2019_visual_load_wmc_R_scripts.zip contains the R scripts to process the working memory capacity (wmc) data. https://www.r-project.org/.ERP_2019_visual_load_R_scripts.zip contains the R scripts to analyze the data and the output files with figures (eg scatterplots). https://www.r-project.org/.

  17. a

    Caltech-101

    • datasets.activeloop.ai
    • huggingface.co
    deeplake
    Updated Feb 3, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Caltech (2022). Caltech-101 [Dataset]. https://datasets.activeloop.ai/docs/ml/datasets/caltech-101-dataset/
    Explore at:
    deeplakeAvailable download formats
    Dataset updated
    Feb 3, 2022
    Dataset authored and provided by
    Caltech
    License

    Attribution-NonCommercial-NoDerivs 3.0 (CC BY-NC-ND 3.0)https://creativecommons.org/licenses/by-nc-nd/3.0/
    License information was derived automatically

    Dataset funded by
    National Science Foundation
    Description

    The Caltech-101 dataset is a dataset of 101 categories of objects, each with 30 to 800 images. The images are all 32 x 32 pixels in size and are in grayscale. The dataset is used to train and evaluate machine learning models for the task of object recognition.

  18. d

    Data from: Public supply water use reanalysis for the 2000-2020 period by...

    • catalog.data.gov
    • data.usgs.gov
    • +1more
    Updated Nov 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Public supply water use reanalysis for the 2000-2020 period by HUC12, month, and year for the conterminous United States (ver. 2.0, August 2024) [Dataset]. https://catalog.data.gov/dataset/public-supply-water-use-reanalysis-for-the-2000-2020-period-by-huc12-month-and-year-for-th
    Explore at:
    Dataset updated
    Nov 19, 2025
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Area covered
    Contiguous United States, United States
    Description

    The U.S. Geological Survey is developing national water-use models to support water resources management in the United States. Model benefits include a nationally consistent estimation approach, greater temporal and spatial resolution of estimates, efficient and automated updates of results, and capabilities to forecast water use into the future and assess model uncertainty. The term “reanalysis” refers to the process of reevaluating and recalculating water-use data using updated or refined methods, data sources, models, or assumptions. In this data release, water use refers to water that is withdrawn by public and private water suppliers and includes water provided for domestic, commercial, industrial, thermoelectric power, and public water uses, as well as water that is consumed or lost within the public supply system. Consumptive use refers to water withdrawn by the public supply system that is evaporated, transpired, incorporated into products or crops, or consumed by humans or livestock. This data release contains data used in a machine learning model (child item 2) to estimate monthly water use for communities that are supplied by public-supply water systems in the conterminous United States for 2000-2020. This data release also contains associated scripts used to produce input features (child items 4 - 8) as well as model water use estimates by 12-digit hydrologic unit code (HUC12) and public supply water service area (WSA). HUC12 boundaries are in child item 3. Public supply delivery and consumptive use estimates are in child items 1 and 9, respectively. First posted: November 1, 2023 Revised: August 8, 2024 This version replaces the previous version of the data release: Luukkonen, C.L., Alzraiee, A.H., Larsen, J.D., Martin, D.J., Herbert, D.M., Buchwald, C.A., Houston, N.A., Valseth, K.J., Paulinski, S., Miller, L.D., Niswonger, R.G., Stewart, J.S., and Dieter, C.A., 2023, Public supply water use reanalysis for the 2000-2020 period by HUC12, month, and year for the conterminous United States: U.S. Geological Survey data release, https://doi.org/10.5066/P9FUL880 Version 2.0 This data release has been updated as of 8/8/2024. The previous version has been replaced because some fractions used for downscaling WSA estimates to HUC12 did not sum to one for some WSAs in Virginia. Updated model water use estimates by HUC12 are included in this version. A change was made in two scripts to check for this condition. Output files have also been updated to preserve the leading zero in in the HUC12 codes. Additional files are also included to provide information about mapping the WSAs and groundwater and surface water fractions to HUC12 and to provide public supply water-use estimates by WSA. The 'Machine learning model that estimates total monthly and annual per capita public supply water use' child item has been updated with these corrections and additional files. A new child item 'R code used to estimate public supply consumptive water use' has been added to provide estimates of public supply consumptive use. This page includes the following files: PS_HUC12_Tot_2000_2020.csv - a csv file with estimated monthly public supply total water use from 2000-2020 by HUC12, in million gallons per day PS_HUC12_GW_2000_2020.csv - a csv file with estimated monthly public supply groundwater use for 2000-2020 by HUC12, in million gallons per day PS_HUC12_SW_2000_2020.csv - a csv file with estimated monthly public supply surface water use for 2000-2020 by HUC12, in million gallons per day PS_WSA_Tot_2000_2020.csv - a csv file with estimated monthly public supply total water use from 2000-2020 by WSA, in million gallons per day PS_WSA_GW_2000_2020.csv - a csv file with estimated monthly public supply groundwater use for 2000-2020 by WSA, in million gallons per day PS_WSA_SW_2000_2020.csv - a csv file with estimated monthly public supply surface water use for 2000-2020 by WSA, in million gallons per day Note: 1) Groundwater and surface water fractions were determined using source counts as described in the 'R code that determines groundwater and surface water source fractions for public-supply water service areas, counties, and 12-digit hydrologic units' child item. 2) Some HUC12s have estimated water use of zero because no public-supply water service areas were modeled within the HUC. change_files_format.py - A Python script used to change the water use estimates by WSA and HUC12 files from wide format to the thin and long format version_history.txt - a txt file describing changes in this version The data release is organized into these items: 1. Machine learning model that estimates public supply deliveries for domestic and other use types - The public supply delivery model estimates total delivery of domestic, commercial, industrial, institutional, and irrigation (CII) water use for public supply water service areas within the conterminous United States. This item contains model input datasets, code used to build the delivery machine learning model, and output predictions. 2. Machine learning model that estimates total monthly and annual per capita public supply water use - The public supply water use model estimates total monthly water use for 12-digit hydrologic units within the conterminous United States. This item contains model input datasets, code used to build the water use machine learning model, and output predictions. 3. National watershed boundary (HUC12) dataset for the conterminous United States, retrieved 10/26/2020 - Spatial data consisting of a shapefile with 12-digit hydrologic units for the conterminous United States retrieved 10/26/2020. 4. Python code used to determine average yearly and monthly tourism per 1000 residents for public-supply water service areas - This code was used to create a feature for the public supply model that provides information for areas affected by population increases due to tourism. 5. Python code used to download gridMET climate data for public-supply water service areas - The climate data collector is a tool used to query climate data which are used as input features in the public supply models. 6. Python code used to download U.S. Census Bureau data for public-supply water service areas - The census data collector is a geographic based tool to query census data which are used as input features in the public supply models. 7. R code that determines buying and selling of water by public-supply water service areas - This code was used to create a feature for the public supply model that indicates whether public-supply systems buy water, sell water, or neither buy nor sell water. 8. R code that determines groundwater and surface water source fractions for public-supply water service areas, counties, and 12-digit hydrologic units - This code was used to determine source water fractions (groundwater and/or surface water) for public supply systems and HUC12s. 9. R code used to estimate public supply consumptive water use - This code was used to estimate public supply consumptive water use using an assumed fraction of deliveries for outdoor irrigation and estimates of evaporative demand. This item contains estimated monthly public supply consumptive use datasets by HUC12 and WSA.

  19. T

    Data from: dolma

    • tensorflow.org
    Updated Mar 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). dolma [Dataset]. https://www.tensorflow.org/datasets/catalog/dolma
    Explore at:
    Dataset updated
    Mar 14, 2025
    Description

    Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('dolma', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  20. IoT Sports Training Load Dataset

    • kaggle.com
    zip
    Updated Oct 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Python Developer (2025). IoT Sports Training Load Dataset [Dataset]. https://www.kaggle.com/datasets/programmer3/iot-sports-training-load-dataset
    Explore at:
    zip(226978 bytes)Available download formats
    Dataset updated
    Oct 9, 2025
    Authors
    Python Developer
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset contains 2300 multimodal IoT sensor recordings collected from athletes during traditional sports training sessions, including basketball, soccer, running, and other athletic activities. The dataset includes heart rate, acceleration (X, Y, Z), gyroscope readings (X, Y, Z), speed, step count, jump height, and training load. It is designed to facilitate analysis of athlete performance, training load monitoring, and predictive modeling for sports science applications.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Seair Exim, Python Import Data India – Buyers & Importers List [Dataset]. https://www.seair.co.in

Python Import Data India – Buyers & Importers List

Seair Exim Solutions

Seair Info Solutions PVT LTD

Explore at:
27 scholarly articles cite this dataset (View in Google Scholar)
.bin, .xml, .csv, .xlsAvailable download formats
Dataset provided by
Seair Info Solutions PVT LTD
Authors
Seair Exim
Area covered
India
Description

Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.

Search
Clear search
Close search
Google apps
Main menu