100+ datasets found

Data from: Multidimensional Data Exploration with Glue
figshare.com
pdf
Updated Jan 18, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Openproceedings Bot (2016). Multidimensional Data Exploration with Glue [Dataset]. http://doi.org/10.6084/m9.figshare.935503.v1
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.935503.v1
Dataset updated
Jan 18, 2016
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Openproceedings Bot
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Modern research projects incorporate data from several sources, and new insights are increasingly driven by the ability to interpret data in the context of other data. Glue is an interactive environment built on top of the standard Python science stack to visualize relationships within and between datasets. With Glue, users can load and visualize multiple related datasets simultaneously. Users specify the logical connections that exist between data, and Glue transparently uses this information as needed to enable visualization across files. This functionality makes it trivial, for example, to interactively overplot catalogs on top of images. The central philosophy behind Glue is that the structure of research data is highly customized and problem-specific. Glue aims to accommodate this and simplify the "data munging" process, so that researchers can more naturally explore what their data have to say. The result is a cleaner scientific workflow, faster interaction with data, and an easier avenue to insight.
DataCite public data exploration
redivis.com
Updated Apr 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ian Mathews (2025). DataCite public data exploration [Dataset]. https://redivis.com/workflows/hx1e-a6w8vmwsx
Explore at:
Dataset updated
Apr 29, 2025
Dataset provided by
Redivis Inc.
Authors
Ian Mathews
Description
This is a sample project highlighting some basic methodologies in working with the DataCite public data file and Data Citation Corpus on Redivis.

Using the transform interface, we extract all records associated with DOIs for Stanford datasets on Redivis. We then make a simple plot using a python notebook to see DOI issuance over time. The nested nature of some of the public data file fields makes exploration a bit challenging; future work could break this dataset into multiple related tables for easier analysis.

We can also join with the Data Citation Corpus to find all citations referencing Stanford-on-Redivis DOIs (the citation corpus is a work in progress, and doesn't currently capture many of the citations in the literature).
Zegami user manual for data exploration: "Systematic analysis of YFP gene...
zenodo.org
explore.openaire.eu
pdf, zip
Updated Jul 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maria Kiourlappou; Maria Kiourlappou; Stephen Taylor; Ilan Davis; Ilan Davis; Stephen Taylor (2024). Zegami user manual for data exploration: "Systematic analysis of YFP gene traps reveals common discordance between mRNA and protein across the nervous system" [Dataset]. http://doi.org/10.5281/zenodo.6374012
Explore at:
pdf, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6374012
Dataset updated
Jul 17, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Maria Kiourlappou; Maria Kiourlappou; Stephen Taylor; Ilan Davis; Ilan Davis; Stephen Taylor
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The explosion in biological data generation challenges the available technologies and methodologies for data interrogation. Moreover, highly rich and complex datasets together with diverse linked data are difficult to explore when provided in flat files. Here we provide a way to filter and analyse in a systematic way a dataset with more than 18 thousand data points using Zegami, a solution for interactive data visualisation and exploration. The primary data we use are derived from a systematic analysis of 200 YFP gene traps reveals common discordance between mRNA and protein across the nervous system which is submitted elsewhere. This manual provides the raw image data together with annotations and associated data and explains how to use Zegami for exploring all these data types together by providing specific examples. We also provide the open source python code used to annotate the figures.
o
Scientific Data Analysis and Visualization with Python
explore.openaire.eu
Updated Feb 2, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Md. Jalal Uddin; Nishat Rayhana Eshita; Md. Asif Newaz; Naiem Sheikh; Afifa Talukder; Aysha Akter; Md. Habibur Rahman; Md. Babul Miah (2022). Scientific Data Analysis and Visualization with Python [Dataset]. http://doi.org/10.5281/zenodo.5944707
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.5944707
Dataset updated
Feb 2, 2022
Authors
Md. Jalal Uddin; Nishat Rayhana Eshita; Md. Asif Newaz; Naiem Sheikh; Afifa Talukder; Aysha Akter; Md. Habibur Rahman; Md. Babul Miah
Description
The publication "Scientific Data Analysis and Visualisation with Python" delves into various facets of Python programming, with a special focus on data analysis and visualisation. Let us deconstruct the main sections: Examining operators and expressions: The text explores arithmetic, comparison, logic, bitwise, assignment and membership operators. These operators serve as fundamental components in the construction of any Python script. Illustrative real-world scenarios show the practical applications of these operators. For example, arithmetic operators are essential for performing mathematical calculations, while comparison operators facilitate decision-making processes. Discussion of data structures and control flow: The book discusses procedures for input, handling strings, working with lists, dictionaries, loops, and conditional expressions. Scientists and software developers can learn how to manipulate data structures efficiently. In particular, lists and dictionaries play a crucial role in organising and retrieving data. Insight into functions and modularisation: Functions are central to Python programming. The publication offers valuable perspectives on the creation and use of functions. The process of modularisation increases the reusability and maintainability of code. By breaking down complex tasks into smaller functions, developers can improve the understandability of their code. Exploring data with Pandas: The book presents a detailed examination of Pandas, a robust library. Readers will gain skills in loading, manipulating, and analysing data frames. Explain data presentation and visualisation: Effective visualisation is critical to understanding data. The publication introduces matplotlib and other plotting libraries. Scientific researchers and analysts can create powerful visual representations to effectively communicate insights. In summary, this publication serves as a valuable resource for individuals at various levels of Python proficiency, including beginners and experienced users. Whether you are a scientist navigating through data or a developer honing your skills, the comprehensive content in this book will guide you towards mastering Python data analysis and visualisation. The training materials are provided for international learners. However, the following lectures on Python are available on YouTube for both international and Bangladeshi learners. For international learners: https://youtube.com/playlist?list=PL4T8G4Q9_JQ9ci8DAhpizHGQ7IsCZFsKu For Bangladeshi learners: https://youtube.com/playlist?list=PL4T8G4Q9_JQ_byYGwq3FyGhDOFRNdHRL8 My profile: https://researchsociety20.org/founder-and-director/
S&P 500 Companies Analysis Project
kaggle.com
Updated Apr 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
anshadkaggle (2025). S&P 500 Companies Analysis Project [Dataset]. https://www.kaggle.com/datasets/anshadkaggle/s-and-p-500-companies-analysis-project
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 6, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
anshadkaggle
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This project focuses on analyzing the S&P 500 companies using data analysis tools like Python (Pandas), SQL, and Power BI. The goal is to extract insights related to sectors, industries, locations, and more, and visualize them using dashboards.

Included Files:

sp500_cleaned.csv – Cleaned dataset used for analysis

sp500_analysis.ipynb – Jupyter Notebook (Python + SQL code)

dashboard_screenshot.png – Screenshot of Power BI dashboard

README.md – Summary of the project and key takeaways

This project demonstrates practical data cleaning, querying, and visualization skills.
Vezora/Tested-188k-Python-Alpaca: Functional
kaggle.com
Updated Nov 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). Vezora/Tested-188k-Python-Alpaca: Functional [Dataset]. https://www.kaggle.com/datasets/thedevastator/vezora-tested-188k-python-alpaca-functional-pyth
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 30, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Vezora/Tested-188k-Python-Alpaca: Functional Python Code Dataset

188k Functional Python Code Samples

By Vezora (From Huggingface) [source]

About this dataset

The Vezora/Tested-188k-Python-Alpaca dataset is a comprehensive collection of functional Python code samples, specifically designed for training and analysis purposes. With 188,000 samples, this dataset offers an extensive range of examples that cater to the research needs of Python programming enthusiasts.

This valuable resource consists of various columns, including input, which represents the input or parameters required for executing the Python code sample. The instruction column describes the task or objective that the Python code sample aims to solve. Additionally, there is an output column that showcases the resulting output generated by running the respective Python code.

By utilizing this dataset, researchers can effectively study and analyze real-world scenarios and applications of Python programming. Whether for educational purposes or development projects, this dataset serves as a reliable reference for individuals seeking practical examples and solutions using Python

How to use the dataset

The Vezora/Tested-188k-Python-Alpaca dataset is a comprehensive collection of functional Python code samples, containing 188,000 samples in total. This dataset can be a valuable resource for researchers and programmers interested in exploring various aspects of Python programming.

Contents of the Dataset

The dataset consists of several columns:

output: This column represents the expected output or result that is obtained when executing the corresponding Python code sample.

instruction: It provides information about the task or instruction that each Python code sample is intended to solve.

input: The input parameters or values required to execute each Python code sample.

Exploring the Dataset

To make effective use of this dataset, it is essential to understand its structure and content properly. Here are some steps you can follow:

Importing Data: Load the dataset into your preferred environment for data analysis using appropriate tools like pandas in Python.

import pandas as pd # Load the dataset df = pd.read_csv('train.csv')

Understanding Column Names: Familiarize yourself with the column names and their meanings by referring to the provided description.

# Display column names print(df.columns)

Sample Exploration: Get an initial understanding of the data structure by examining a few random samples from different columns.

# Display random samples from 'output' column print(df['output'].sample(5))

Analyzing Instructions: Analyze different instructions or tasks present in the 'instruction' column to identify specific areas you are interested in studying or learning about.

# Count unique instructions and display top ones with highest occurrences instruction_counts = df['instruction'].value_counts() print(instruction_counts.head(10))

Potential Use Cases

The Vezora/Tested-188k-Python-Alpaca dataset can be utilized in various ways:

Code Analysis: Analyze the code samples to understand common programming patterns and best practices.

Code Debugging: Use code samples with known outputs to test and debug your own Python programs.

Educational Purposes: Utilize the dataset as a teaching tool for Python programming classes or tutorials.

Machine Learning Applications: Train machine learning models to predict outputs based on given inputs.

Remember that this dataset provides a plethora of diverse Python coding examples, allowing you to explore different

Research Ideas

Code analysis: Researchers and developers can use this dataset to analyze various Python code samples and identify patterns, best practices, and common mistakes. This can help in improving code quality and optimizing performance.

Language understanding: Natural language processing techniques can be applied to the instruction column of this dataset to develop models that can understand and interpret natural language instructions for programming tasks.

Code generation: The input column of this dataset contains the required inputs for executing each Python code sample. Researchers can build models that generate Python code based on specific inputs or task requirements using the examples provided in this dataset. This can be useful in automating repetitive programming tasks o...
f
Data from: OpenColab project: OpenSim in Google colaboratory to explore...
tandf.figshare.com
docx
Updated Jul 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hossein Mokhtarzadeh; Fangwei Jiang; Shengzhe Zhao; Fatemeh Malekipour (2023). OpenColab project: OpenSim in Google colaboratory to explore biomechanics on the web [Dataset]. http://doi.org/10.6084/m9.figshare.20440340.v1
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.20440340.v1
Dataset updated
Jul 6, 2023
Dataset provided by
Taylor & Francis
Authors
Hossein Mokhtarzadeh; Fangwei Jiang; Shengzhe Zhao; Fatemeh Malekipour
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
OpenSim is an open-source biomechanical package with a variety of applications. It is available for many users with bindings in MATLAB, Python, and Java via its application programming interfaces (APIs). Although the developers described well the OpenSim installation on different operating systems (Windows, Mac, and Linux), it is time-consuming and complex since each operating system requires a different configuration. This project aims to demystify the development of neuro-musculoskeletal modeling in OpenSim with zero configuration on any operating system for installation (thus cross-platform), easy to share models while accessing free graphical processing units (GPUs) on a web-based platform of Google Colab. To achieve this, OpenColab was developed where OpenSim source code was used to build a Conda package that can be installed on the Google Colab with only one block of code in less than 7 min. To use OpenColab, one requires a connection to the internet and a Gmail account. Moreover, OpenColab accesses vast libraries of machine learning methods available within free Google products, e.g. TensorFlow. Next, we performed an inverse problem in biomechanics and compared OpenColab results with OpenSim graphical user interface (GUI) for validation. The outcomes of OpenColab and GUI matched well (r≥0.82). OpenColab takes advantage of the zero-configuration of cloud-based platforms, accesses GPUs, and enables users to share and reproduce modeling approaches for further validation, innovative online training, and research applications. Step-by-step installation processes and examples are available at: https://simtk.org/projects/opencolab.
d
Python and R Basics for Environmental Data Sciences
search.dataone.org
hydroshare.org
Updated Dec 5, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tao Wen (2021). Python and R Basics for Environmental Data Sciences [Dataset]. https://search.dataone.org/view/sha256%3Aa4a66e6665773400ae76151d376607edf33cfead15ffad958fe5795436ff48ff
Explore at:
Dataset updated
Dec 5, 2021
Dataset provided by
Hydroshare
Authors
Tao Wen
Area covered

Description
This resource collects teaching materials that are originally created for the in-person course 'GEOSC/GEOG 497 – Data Mining in Environmental Sciences' at Penn State University (co-taught by Tao Wen, Susan Brantley, and Alan Taylor) and then refined/revised by Tao Wen to be used in the online teaching module 'Data Science in Earth and Environmental Sciences' hosted on the NSF-sponsored HydroLearn platform.

This resource includes both R Notebooks and Python Jupyter Notebooks to teach the basics of R and Python coding, data analysis and data visualization, as well as building machine learning models in both programming languages by using authentic research data and questions. All of these R/Python scripts can be executed either on the CUAHSI JupyterHub or on your local machine.

This resource is shared under the CC-BY license. Please contact the creator Tao Wen at Syracuse University (twen08@syr.edu) for any questions you have about this resource. If you identify any errors in the files, please contact the creator.
Meta Kaggle Code
kaggle.com
zip
Updated Jun 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kaggle (2025). Meta Kaggle Code [Dataset]. https://www.kaggle.com/datasets/kaggle/meta-kaggle-code/code
Explore at:
zip(146671216621 bytes)Available download formats
Dataset updated
Jun 26, 2025
Dataset authored and provided by
Kagglehttp://kaggle.com/
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Explore our public notebook content!

Meta Kaggle Code is an extension to our popular Meta Kaggle dataset. This extension contains all the raw source code from hundreds of thousands of public, Apache 2.0 licensed Python and R notebooks versions on Kaggle used to analyze Datasets, make submissions to Competitions, and more. This represents nearly a decade of data spanning a period of tremendous evolution in the ways ML work is done.

Why we’re releasing this dataset

By collecting all of this code created by Kaggle’s community in one dataset, we hope to make it easier for the world to research and share insights about trends in our industry. With the growing significance of AI-assisted development, we expect this data can also be used to fine-tune models for ML-specific code generation tasks.

Meta Kaggle for Code is also a continuation of our commitment to open data and research. This new dataset is a companion to Meta Kaggle which we originally released in 2016. On top of Meta Kaggle, our community has shared nearly 1,000 public code examples. Research papers written using Meta Kaggle have examined how data scientists collaboratively solve problems, analyzed overfitting in machine learning competitions, compared discussions between Kaggle and Stack Overflow communities, and more.

The best part is Meta Kaggle enriches Meta Kaggle for Code. By joining the datasets together, you can easily understand which competitions code was run against, the progression tier of the code’s author, how many votes a notebook had, what kinds of comments it received, and much, much more. We hope the new potential for uncovering deep insights into how ML code is written feels just as limitless to you as it does to us!

Sensitive data

While we have made an attempt to filter out notebooks containing potentially sensitive information published by Kaggle users, the dataset may still contain such information. Research, publications, applications, etc. relying on this data should only use or report on publicly available, non-sensitive information.

Joining with Meta Kaggle

The files contained here are a subset of the KernelVersions in Meta Kaggle. The file names match the ids in the KernelVersions csv file. Whereas Meta Kaggle contains data for all interactive and commit sessions, Meta Kaggle Code contains only data for commit sessions.

File organization

The files are organized into a two-level directory structure. Each top level folder contains up to 1 million files, e.g. - folder 123 contains all versions from 123,000,000 to 123,999,999. Each sub folder contains up to 1 thousand files, e.g. - 123/456 contains all versions from 123,456,000 to 123,456,999. In practice, each folder will have many fewer than 1 thousand files due to private and interactive sessions.

The ipynb files in this dataset hosted on Kaggle do not contain the output cells. If the outputs are required, the full set of ipynbs with the outputs embedded can be obtained from this public GCS bucket: kaggle-meta-kaggle-code-downloads. Note that this is a "requester pays" bucket. This means you will need a GCP account with billing enabled to download. Learn more here: https://cloud.google.com/storage/docs/requester-pays

Questions / Comments

We love feedback! Let us know in the Discussion tab.

Happy Kaggling!
Datasets for manuscript "A data engineering framework for chemical flow...
catalog.data.gov
gimi9.com
Updated Nov 7, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2021). Datasets for manuscript "A data engineering framework for chemical flow analysis of industrial pollution abatement operations" [Dataset]. https://catalog.data.gov/dataset/datasets-for-manuscript-a-data-engineering-framework-for-chemical-flow-analysis-of-industr
Explore at:
Dataset updated
Nov 7, 2021
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
The EPA GitHub repository PAU4ChemAs as described in the README.md file, contains Python scripts written to build the PAU dataset modules (technologies, capital and operating costs, and chemical prices) for tracking chemical flows transfers, releases estimation, and identification of potential occupation exposure scenarios in pollution abatement units (PAUs). These PAUs are employed for on-site chemical end-of-life management. The folder datasets contains the outputs for each framework step. The Chemicals_in_categories.csv contains the chemicals for the TRI chemical categories. The EPA GitHub repository PAU_case_study as described in its readme.md entry, contains the Python scripts to run the manuscript case study for designing the PAUs, the data-driven models, and the decision-making module for chemicals of concern and tracking flow transfers at the end-of-life stage. The data was obtained by means of data engineering using different publicly-available databases. The properties of chemicals were obtained using the GitHub repository Properties_Scraper, while the PAU dataset using the repository PAU4Chem. Finally, the EPA GitHub repository Properties_Scraper contains a Python script to massively gather information about exposure limits and physical properties from different publicly-available sources: EPA, NOAA, OSHA, and the institute for Occupational Safety and Health of the German Social Accident Insurance (IFA). Also, all GitHub repositories describe the Python libraries required for running their code, how to use them, the obtained outputs files after running the Python script modules, and the corresponding EPA Disclaimer. This dataset is associated with the following publication: Hernandez-Betancur, J.D., M. Martin, and G.J. Ruiz-Mercado. A data engineering framework for on-site end-of-life industrial operations. JOURNAL OF CLEANER PRODUCTION. Elsevier Science Ltd, New York, NY, USA, 327: 129514, (2021).
H
Exploring Residential Water Use Data In Python
beta.hydroshare.org
hydroshare.org
zip
Updated Apr 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Robert Gordon (2024). Exploring Residential Water Use Data In Python [Dataset]. https://beta.hydroshare.org/resource/3dc0b432d1d746c18661a9274fa9a709/
Explore at:
zip(144.6 MB)Available download formats
Dataset updated
Apr 18, 2024
Dataset provided by
HydroShare
Authors
Robert Gordon
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Mar 3, 2017 - Mar 27, 2017
Area covered

Description
An analysis of water use data collected at Richards Hall on the Utah State University Campus during March of 2017. Water use is examined to answer three different questions. How water use differs during and after Spring Break, how water use differs on weekends vs. weekdays, and how sampling interval affects to total volume recorded.
Data from: Python Case Study
kaggle.com
Updated Feb 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ishrat Amin (2025). Python Case Study [Dataset]. https://www.kaggle.com/datasets/ishratamin/python-case-study/versions/1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 17, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ishrat Amin
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Dataset

This dataset was created by Ishrat Amin

Released under CC0: Public Domain

Contents
Data from: An Empirical Study on the Usage and Availability of Machine...
zenodo.org
data.niaid.nih.gov
txt, zip
Updated Dec 20, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Giuliano Antoniol; Massimiliano Di Penta; Massimiliano Di Penta; Cyrine Zid; Vittoria Nardone; Vittoria Nardone; Giuliano Antoniol; Cyrine Zid (2021). An Empirical Study on the Usage and Availability of Machine Learning Libraries in Open-Source Python Projects - Dataset [Dataset]. http://doi.org/10.5281/zenodo.5788525
Explore at:
txt, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5788525
Dataset updated
Dec 20, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Giuliano Antoniol; Massimiliano Di Penta; Massimiliano Di Penta; Cyrine Zid; Vittoria Nardone; Vittoria Nardone; Giuliano Antoniol; Cyrine Zid
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains the dataset of the manuscript:

"An Empirical Study on the Usage and Availability of Machine Learning Libraries in Open-Source Python Projects"
Python Code Instruction
kaggle.com
Updated Nov 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). Python Code Instruction [Dataset]. https://www.kaggle.com/datasets/thedevastator/python-code-instruction-dataset/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 30, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Python Code Instruction

Training Data with Instruction, Input, Output, and Prompt Columns

By Tarun Bisht (From Huggingface) [source]

About this dataset

The python_code_instructions_18k_alpaca dataset is a comprehensive training dataset specifically curated for researchers and developers involved in the analysis and comprehension of Python code instructions. It contains a vast collection of Python code snippets along with their corresponding instruction, input, output, and prompt information. By utilizing this dataset, users can gain valuable insights into various Python programming concepts and techniques.

The dataset is organized into columns to facilitate easy access to the required information. The instruction column holds the specific task or instruction that the Python code snippet is designed to perform. This allows users to understand the purpose or requirement of each code snippet at a glance.

The input column contains all necessary input data or parameters that are required for executing the Python code snippet accurately. These inputs provide context and enable users to comprehend how different variables or values impact the overall functioning of each code snippet.

Likewise, the output column presents expected results or outcomes that should be produced when executing each Python code snippet with its specified input values. This allows for validation and verification purposes, ensuring that each code snippet performs as intended.

In addition to instruction, input, and output details, this dataset also includes prompts. The prompt column provides additional context or information intended to assist users in better understanding the purpose or requirements of each particular Python code snippet.

By leveraging this comprehensive python_code_instructions_18k_alpaca training dataset, researchers and developers can delve into numerous real-world examples of Python programming challenges - helping them enhance their coding skills while gaining invaluable knowledge about effective implementation techniques across various domains

Research Ideas

Code Instruction Analysis: This dataset can be used to analyze different types of Python code instructions and identify patterns or common practices. Researchers or developers can use this dataset to gain insights into effective ways of writing code instructions.

Code Output Prediction: With the given input and instruction, this dataset can be used to train models for predicting the expected output of a Python code snippet. This can be useful in automating the testing process or verifying the correctness of the code.

Prompt Generation: Developers often struggle with providing clear and concise prompts for their code snippets. This dataset can serve as a resource for generating prompts by analyzing existing examples and extracting key information or requirements from them

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: train.csv | Column name | Description | |:----------------|:------------------------------------------------------------------------------------------------------------------| | instruction | Specific tasks or instructions assigned to each Python code snippet. (Text) | | input | The input data or parameters required for executing the code instruction. (Text) | | output | The expected result or output that should be produced when executing the code instruction. (Text) | | prompt | Additional information or context to help understand the purpose or requirements of each code instruction. (Text) |

Acknowledgements

If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Tarun Bisht (From Huggingface).
Z
Assessing the impact of hints in learning formal specification: Research...
data.niaid.nih.gov
explore.openaire.eu
+1more
Updated Jan 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Margolis, Iara (2024). Assessing the impact of hints in learning formal specification: Research artifact [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10450608
Explore at:
Dataset updated
Jan 29, 2024
Dataset provided by
Campos, José Creissac
Margolis, Iara
Macedo, Nuno
Sousa, Emanuel
Cunha, Alcino
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This artifact accompanies the SEET@ICSE article "Assessing the impact of hints in learning formal specification", which reports on a user study to investigate the impact of different types of automated hints while learning a formal specification language, both in terms of immediate performance and learning retention, but also in the emotional response of the students. This research artifact provides all the material required to replicate this study (except for the proprietary questionnaires passed to assess the emotional response and user experience), as well as the collected data and data analysis scripts used for the discussion in the paper.

Dataset

The artifact contains the resources described below.

Experiment resources

The resources needed for replicating the experiment, namely in directory experiment:

alloy_sheet_pt.pdf: the 1-page Alloy sheet that participants had access to during the 2 sessions of the experiment. The sheet was passed in Portuguese due to the population of the experiment.

alloy_sheet_en.pdf: a version the 1-page Alloy sheet that participants had access to during the 2 sessions of the experiment translated into English.

docker-compose.yml: a Docker Compose configuration file to launch Alloy4Fun populated with the tasks in directory data/experiment for the 2 sessions of the experiment.

api and meteor: directories with source files for building and launching the Alloy4Fun platform for the study.

Experiment data

The task database used in our application of the experiment, namely in directory data/experiment:

Model.json, Instance.json, and Link.json: JSON files with to populate Alloy4Fun with the tasks for the 2 sessions of the experiment.

identifiers.txt: the list of all (104) available participant identifiers that can participate in the experiment.

Collected data

Data collected in the application of the experiment as a simple one-factor randomised experiment in 2 sessions involving 85 undergraduate students majoring in CSE. The experiment was validated by the Ethics Committee for Research in Social and Human Sciences of the Ethics Council of the University of Minho, where the experiment took place. Data is shared the shape of JSON and CSV files with a header row, namely in directory data/results:

data_sessions.json: data collected from task-solving in the 2 sessions of the experiment, used to calculate variables productivity (PROD1 and PROD2, between 0 and 12 solved tasks) and efficiency (EFF1 and EFF2, between 0 and 1).

data_socio.csv: data collected from socio-demographic questionnaire in the 1st session of the experiment, namely:

participant identification: participant's unique identifier (ID);

socio-demographic information: participant's age (AGE), sex (SEX, 1 through 4 for female, male, prefer not to disclosure, and other, respectively), and average academic grade (GRADE, from 0 to 20, NA denotes preference to not disclosure).

data_emo.csv: detailed data collected from the emotional questionnaire in the 2 sessions of the experiment, namely:

participant identification: participant's unique identifier (ID) and the assigned treatment (column HINT, either N, L, E or D);

detailed emotional response data: the differential in the 5-point Likert scale for each of the 14 measured emotions in the 2 sessions, ranging from -5 to -1 if decreased, 0 if maintained, from 1 to 5 if increased, or NA denoting failure to submit the questionnaire. Half of the emotions are positive (Admiration1 and Admiration2, Desire1 and Desire2, Hope1 and Hope2, Fascination1 and Fascination2, Joy1 and Joy2, Satisfaction1 and Satisfaction2, and Pride1 and Pride2), and half are negative (Anger1 and Anger2, Boredom1 and Boredom2, Contempt1 and Contempt2, Disgust1 and Disgust2, Fear1 and Fear2, Sadness1 and Sadness2, and Shame1 and Shame2). This detailed data was used to compute the aggregate data in data_emo_aggregate.csv and in the detailed discussion in Section 6 of the paper.

data_umux.csv: data collected from the user experience questionnaires in the 2 sessions of the experiment, namely:

participant identification: participant's unique identifier (ID);

user experience data: summarised user experience data from the UMUX surveys (UMUX1 and UMUX2, as a usability metric ranging from 0 to 100).

participants.txt: the list of participant identifiers that have registered for the experiment.

Analysis scripts

The analysis scripts required to replicate the analysis of the results of the experiment as reported in the paper, namely in directory analysis:

analysis.r: An R script to analyse the data in the provided CSV files; each performed analysis is documented within the file itself.

requirements.r: An R script to install the required libraries for the analysis script.

normalize_task.r: A Python script to normalize the task JSON data from file data_sessions.json into the CSV format required by the analysis script.

normalize_emo.r: A Python script to compute the aggregate emotional response in the CSV format required by the analysis script from the detailed emotional response data in the CSV format of data_emo.csv.

Dockerfile: Docker script to automate the analysis script from the collected data.

Setup

To replicate the experiment and the analysis of the results, only Docker is required.

If you wish to manually replicate the experiment and collect your own data, you'll need to install:

A modified version of the Alloy4Fun platform, which is built in the Meteor web framework. This version of Alloy4Fun is publicly available in branch study of its repository at https://github.com/haslab/Alloy4Fun/tree/study.

If you wish to manually replicate the analysis of the data collected in our experiment, you'll need to install:

Python to manipulate the JSON data collected in the experiment. Python is freely available for download at https://www.python.org/downloads/, with distributions for most platforms.

R software for the analysis scripts. R is freely available for download at https://cran.r-project.org/mirrors.html, with binary distributions available for Windows, Linux and Mac.

Usage

Experiment replication

This section describes how to replicate our user study experiment, and collect data about how different hints impact the performance of participants.

To launch the Alloy4Fun platform populated with tasks for each session, just run the following commands from the root directory of the artifact. The Meteor server may take a few minutes to launch, wait for the "Started your app" message to show.

cd experimentdocker-compose up

This will launch Alloy4Fun at http://localhost:3000. The tasks are accessed through permalinks assigned to each participant. The experiment allows for up to 104 participants, and the list of available identifiers is given in file identifiers.txt. The group of each participant is determined by the last character of the identifier, either N, L, E or D. The task database can be consulted in directory data/experiment, in Alloy4Fun JSON files.

In the 1st session, each participant was given one permalink that gives access to 12 sequential tasks. The permalink is simply the participant's identifier, so participant 0CAN would just access http://localhost:3000/0CAN. The next task is available after a correct submission to the current task or when a time-out occurs (5mins). Each participant was assigned to a different treatment group, so depending on the permalink different kinds of hints are provided. Below are 4 permalinks, each for each hint group:

Group N (no hints): http://localhost:3000/0CAN

Group L (error locations): http://localhost:3000/CA0L

Group E (counter-example): http://localhost:3000/350E

Group D (error description): http://localhost:3000/27AD

In the 2nd session, likewise the 1st session, each permalink gave access to 12 sequential tasks, and the next task is available after a correct submission or a time-out (5mins). The permalink is constructed by prepending the participant's identifier with P-. So participant 0CAN would just access http://localhost:3000/P-0CAN. In the 2nd sessions all participants were expected to solve the tasks without any hints provided, so the permalinks from different groups are undifferentiated.

Before the 1st session the participants should answer the socio-demographic questionnaire, that should ask the following information: unique identifier, age, sex, familiarity with the Alloy language, and average academic grade.

Before and after both sessions the participants should answer the standard PrEmo 2 questionnaire. PrEmo 2 is published under an Attribution-NonCommercial-NoDerivatives 4.0 International Creative Commons licence (CC BY-NC-ND 4.0). This means that you are free to use the tool for non-commercial purposes as long as you give appropriate credit, provide a link to the license, and do not modify the original material. The original material, namely the depictions of the diferent emotions, can be downloaded from https://diopd.org/premo/. The questionnaire should ask for the unique user identifier, and for the attachment with each of the depicted 14 emotions, expressed in a 5-point Likert scale.

After both sessions the participants should also answer the standard UMUX questionnaire. This questionnaire can be used freely, and should ask for the user unique identifier and answers for the standard 4 questions in a 7-point Likert scale. For information about the questions, how to implement the questionnaire, and how to compute the usability metric ranging from 0 to 100 score from the answers, please see the original paper:

Kraig Finstad. 2010. The usability metric for user experience. Interacting with computers 22, 5 (2010), 323–327.

Analysis of other applications of the experiment

This section describes how to replicate the analysis of the data collected in an application of the experiment described in Experiment replication.

The analysis script expects data in 4 CSV files,
a
Python for ArcGIS - Working with ArcGIS Notebooks
edu.hub.arcgis.com
Updated Oct 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Education and Research (2024). Python for ArcGIS - Working with ArcGIS Notebooks [Dataset]. https://edu.hub.arcgis.com/documents/16fbaf21dc7b41c187ebcfd9f6ea1d58
Explore at:
Dataset updated
Oct 8, 2024
Dataset authored and provided by
Education and Research
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
This resource was created by Esri Canada Education and Research. To browse our full collection of higher-education learning resources, please visit https://hed.esri.ca/resourcefinder/.This tutorial introduces you to using Python code in a Jupyter Notebook, an open source web application that enables you to create and share documents that contain rich text, equations and multimedia, alongside executable code and visualization of analysis outputs. The tutorial begins by stepping through the basics of setting up and being productive with Python notebooks. You will be introduced to ArcGIS Notebooks, which are Python Notebooks that are well-integrated within the ArcGIS platform. Finally, you will be guided through a series of ArcGIS Notebooks that illustrate how to create compelling notebooks for data science that integrate your own Python scripts using the ArcGIS API for Python and ArcPy in combination with thousands of open source Python libraries to enhance your analysis and visualization.To download the dataset Labs, click the Open button to the top right. This will automatically download a ZIP file containing all files and data required.You can also clone the tutorial documents and datasets for this GitHub repo: https://github.com/highered-esricanada/arcgis-notebooks-tutorial.git.Software & Solutions Used: Required: This tutorial was last tested on August 27th, 2024, using ArcGIS Pro 3.3. If you're using a different version of ArcGIS Pro, you may encounter different functionality and results.Recommended: ArcGIS Online subscription account with permissions to use advanced Notebooks and GeoEnrichmentOptional: Notebook Server for ArcGIS Enterprise 11.3+Time to Complete: 2 h (excludes processing time)File Size: 196 MBDate Created: January 2022Last Updated: August 27, 2024
d
Tutorial for NetCDF climate data retrieval and model integration
search.dataone.org
hydroshare.org
+2more
Updated Dec 5, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christina Bandaragoda; Jimmy Phuong (2021). Tutorial for NetCDF climate data retrieval and model integration [Dataset]. https://search.dataone.org/view/sha256%3A01e446404092bdcebd82469ba4ad3653a87530cde60581284d1eb36d28dd42b2
Explore at:
Dataset updated
Dec 5, 2021
Dataset provided by
Hydroshare
Authors
Christina Bandaragoda; Jimmy Phuong
Description
Hydrological and meteorological information can help inform the conditions and risk factors related to the environment and their inhabitants. Due to the limitations of observation sampling, gridded data sets provide the modeled information for areas where data collection are infeasible using observations collected and known process relations. Although available, data users are faced with barriers to use, challenges like how to access, acquire, then analyze data for small watershed areas, when these datasets were produced for large, continental scale processes. In this tutorial, we introduce Observatory for Gridded Hydrometeorology (OGH) to resolve such hurdles in a use-case that incorporates NetCDF gridded data sets processes developed to interpret the findings and apply secondary modeling frameworks (landlab).

LEARNING OBJECTIVES - Familiarize with data management, metadata management, and analyses with gridded data - Inspecting and problem solving with Python libraries - Explore data architecture and processes - Learn about OGH Python Library - Discuss conceptual data engineering and science operations

Use-case operations: 1. Prepare computing environment 2. Get list of grid cells 3. NetCDF retrieval and clipping to a spatial extent 4. Extract NetCDF metadata and convert NetCDFs to 1D ASCII time-series files 5. Visualize the average monthly total precipitations 6. Apply summary values as modeling inputs 7. Visualize modeling outputs 8. Save results in a new HydroShare resource

For inquiries, issues, or contribute to the developments, please refer to https://github.com/freshwater-initiative/Observatory
o
Napari Tutorial
explore.openaire.eu
Updated Nov 26, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Robert Haase (2023). Napari Tutorial [Dataset]. http://doi.org/10.5281/zenodo.10207321
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.10207321
Dataset updated
Nov 26, 2023
Authors
Robert Haase
Description
In this Napari tutorial we learn about the Napari viewer, a interactive tool allowing to browse imaging data using Python. We will learn how to set up image analysis workflows interactively, how to train pixel and object classifiers and how to use dimensionality reduction and clustering to explore data of segmented biological objects such as cells and nuclei.
Z
Storage and Transit Time Data and Code
data.niaid.nih.gov
zenodo.org
Updated Jun 12, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrew Felton (2024). Storage and Transit Time Data and Code [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8136816
Explore at:
Dataset updated
Jun 12, 2024
Dataset authored and provided by
Andrew Felton
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Author: Andrew J. FeltonDate: 5/5/2024

This R project contains the primary code and data (following pre-processing in python) used for data production, manipulation, visualization, and analysis and figure production for the study entitled:

"Global estimates of the storage and transit time of water through vegetation"

Please note that 'turnover' and 'transit' are used interchangeably in this project.

Data information:

The data folder contains key data sets used for analysis. In particular:

"data/turnover_from_python/updated/annual/multi_year_average/average_annual_turnover.nc" contains a global array summarizing five year (2016-2020) averages of annual transit, storage, canopy transpiration, and number of months of data. This is the core dataset for the analysis; however, each folder has much more data, including a dataset for each year of the analysis. Data are also available is separate .csv files for each land cover type. Oterh data can be found for the minimum, monthly, and seasonal transit time found in their respective folders. These data were produced using the python code found in the "supporting_code" folder given the ease of working with .nc and EASE grid in the xarray python module. R was used primarily for data visualization purposes. The remaining files in the "data" and "data/supporting_data"" folder primarily contain ground-based estimates of storage and transit found in public databases or through a literature search, but have been extensively processed and filtered here.

Code information

Python scripts can be found in the "supporting_code" folder.

Each R script in this project has a particular function:

01_start.R: This script loads the R packages used in the analysis, sets thedirectory, and imports custom functions for the project. You can also load in the main transit time (turnover) datasets here using the source() function.

02_functions.R: This script contains the custom function for this analysis, primarily to work with importing the seasonal transit data. Load this using the source() function in the 01_start.R script.

03_generate_data.R: This script is not necessary to run and is primarilyfor documentation. The main role of this code was to import and wranglethe data needed to calculate ground-based estimates of aboveground water storage.

04_annual_turnover_storage_import.R: This script imports the annual turnover andstorage data for each landcover type. You load in these data from the 01_start.R scriptusing the source() function.

05_minimum_turnover_storage_import.R: This script imports the minimum turnover andstorage data for each landcover type. Minimum is defined as the lowest monthlyestimate.You load in these data from the 01_start.R scriptusing the source() function.

06_figures_tables.R: This is the main workhouse for figure/table production and supporting analyses. This script generates the key figures and summary statistics used in the study that then get saved in the manuscript_figures folder. Note that allmaps were produced using Python code found in the "supporting_code"" folder.
i
Data from: Data and Python scripts supporting Python package FAPS
research-explorer.ista.ac.at
Updated Apr 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ellis, Thomas (2025). Data and Python scripts supporting Python package FAPS [Dataset]. https://research-explorer.ista.ac.at/record/5583
Explore at:
Dataset updated
Apr 15, 2025
Authors
Ellis, Thomas
Description
Data and scripts are provided in support of the manuscript "Efficient inference of paternity and sibship inference given known maternity via hierarchical clustering", and the associated Python package FAPS, available from www.github.com/ellisztamas/faps.

Simulation scripts cover: 1. Performance under different mating scenarios. 2. Comparison with Colony2. 3. Effect of changing the number of Monte Carlo draws

The final script covers the analysis of half-sib arrays from wild-pollinated seed in an Antirrhinum majus hybrid zone.

Facebook

Twitter

Click to copy link

Link copied

Cite

Openproceedings Bot (2016). Multidimensional Data Exploration with Glue [Dataset]. http://doi.org/10.6084/m9.figshare.935503.v1

Data from: Multidimensional Data Exploration with Glue

Explore at:

pdfAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.935503.v1

Dataset updated

Jan 18, 2016

Dataset provided by

figshare
Figsharehttp://figshare.com/

Authors

Openproceedings Bot

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Modern research projects incorporate data from several sources, and new insights are increasingly driven by the ability to interpret data in the context of other data. Glue is an interactive environment built on top of the standard Python science stack to visualize relationships within and between datasets. With Glue, users can load and visualize multiple related datasets simultaneously. Users specify the logical connections that exist between data, and Glue transparently uses this information as needed to enable visualization across files. This functionality makes it trivial, for example, to interactively overplot catalogs on top of images. The central philosophy behind Glue is that the structure of research data is highly customized and problem-specific. Glue aims to accommodate this and simplify the "data munging" process, so that researchers can more naturally explore what their data have to say. The result is a cleaner scientific workflow, faster interaction with data, and an easier avenue to insight.

Clear search

Close search

Google apps

Main menu

Data from: Multidimensional Data Exploration with Glue

DataCite public data exploration

Zegami user manual for data exploration: "Systematic analysis of YFP gene...

Scientific Data Analysis and Visualization with Python

S&P 500 Companies Analysis Project

Vezora/Tested-188k-Python-Alpaca: Functional

Vezora/Tested-188k-Python-Alpaca: Functional Python Code Dataset

188k Functional Python Code Samples

About this dataset

How to use the dataset

Contents of the Dataset

Exploring the Dataset

Potential Use Cases

Research Ideas

Data from: OpenColab project: OpenSim in Google colaboratory to explore...

Python and R Basics for Environmental Data Sciences

Meta Kaggle Code

Explore our public notebook content!

Why we’re releasing this dataset

Sensitive data

Joining with Meta Kaggle

File organization

Questions / Comments

Datasets for manuscript "A data engineering framework for chemical flow...

Exploring Residential Water Use Data In Python

Data from: Python Case Study

Dataset

Contents

Data from: An Empirical Study on the Usage and Availability of Machine...

Python Code Instruction

Python Code Instruction

Training Data with Instruction, Input, Output, and Prompt Columns

About this dataset

Research Ideas

Acknowledgements

License

Columns

Acknowledgements

Assessing the impact of hints in learning formal specification: Research...

Python for ArcGIS - Working with ArcGIS Notebooks

Tutorial for NetCDF climate data retrieval and model integration

Napari Tutorial

Storage and Transit Time Data and Code

Code information

Data from: Data and Python scripts supporting Python package FAPS

Data from: Multidimensional Data Exploration with Glue