Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Attention: This dataset is a summary and reformat pulled from github code.
You should make your own assumptions based on this. In fact, there is another dataset I formed through parsing that addresses several points:
out of 500k python related items, most of them are python-ish, not pythonic the majority of the items here contain excessive licensing inclusion of original code the items here are sometimes not even python but have references There's a whole lot of gpl summaries… See the full description on the dataset page: https://huggingface.co/datasets/jtatman/python-code-dataset-500k.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for "code-search-net-python"
Dataset Description
Homepage: None Repository: https://huggingface.co/datasets/Nan-Do/code-search-net-python Paper: None Leaderboard: None Point of Contact: @Nan-Do
Dataset Summary
This dataset is the Python portion of the CodeSarchNet annotated with a summary column.The code-search-net dataset includes open source functions that include comments found at GitHub.The summary is a short description of what the… See the full description on the dataset page: https://huggingface.co/datasets/Nan-Do/code-search-net-python.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description: - This dataset includes all 22 built-in datasets from the Seaborn library, a widely used Python data visualization tool. Seaborn's built-in datasets are essential resources for anyone interested in practicing data analysis, visualization, and machine learning. They span a wide range of topics, from classic datasets like the Iris flower classification to real-world data such as Titanic survival records and diamond characteristics.
This complete collection serves as an excellent starting point for anyone looking to improve their data science skills, offering a wide array of datasets suitable for both beginners and advanced users.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 3.0 (CC BY-NC-SA 3.0)https://creativecommons.org/licenses/by-nc-sa/3.0/
License information was derived automatically
This dataset is about Python programming. Question and answers are generated using Gemma. There are more than four hundred questions and their corresponding answers about Python programming.
Questions are ranging from concepts like data-types, variables and keywords to regular-expression and threading.
I have used this dataset here
The code used for dataset generated is available here
Facebook
Twitteriamketan25/python-qa-instructions-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This folder contains datasets to be downloaded from students for their practices with R and Python
Facebook
TwitterThis dataset was created by Nitish S.
Facebook
TwitterCodeParrot 🦜 Dataset
What is it?
This is the full CodeParrot dataset. It contains Python files used to train the code generation model in Chapter 10: Training Transformers from Scratch in the NLP with Transformers book. You can find the full code in the accompanying Github repository.
Creation
It was created with the GitHub dataset available via Google's BigQuery. It contains approximately 22 million Python files and is 180 GB (50 GB compressed) big. The… See the full description on the dataset page: https://huggingface.co/datasets/transformersbook/codeparrot.
Facebook
TwitterThe EPA GitHub repository PAU4ChemAs as described in the README.md file, contains Python scripts written to build the PAU dataset modules (technologies, capital and operating costs, and chemical prices) for tracking chemical flows transfers, releases estimation, and identification of potential occupation exposure scenarios in pollution abatement units (PAUs). These PAUs are employed for on-site chemical end-of-life management. The folder datasets contains the outputs for each framework step. The Chemicals_in_categories.csv contains the chemicals for the TRI chemical categories. The EPA GitHub repository PAU_case_study as described in its readme.md entry, contains the Python scripts to run the manuscript case study for designing the PAUs, the data-driven models, and the decision-making module for chemicals of concern and tracking flow transfers at the end-of-life stage. The data was obtained by means of data engineering using different publicly-available databases. The properties of chemicals were obtained using the GitHub repository Properties_Scraper, while the PAU dataset using the repository PAU4Chem. Finally, the EPA GitHub repository Properties_Scraper contains a Python script to massively gather information about exposure limits and physical properties from different publicly-available sources: EPA, NOAA, OSHA, and the institute for Occupational Safety and Health of the German Social Accident Insurance (IFA). Also, all GitHub repositories describe the Python libraries required for running their code, how to use them, the obtained outputs files after running the Python script modules, and the corresponding EPA Disclaimer. This dataset is associated with the following publication: Hernandez-Betancur, J.D., M. Martin, and G.J. Ruiz-Mercado. A data engineering framework for on-site end-of-life industrial operations. JOURNAL OF CLEANER PRODUCTION. Elsevier Science Ltd, New York, NY, USA, 327: 129514, (2021).
Facebook
TwitterAutomatically describing images using natural sentences is an essential task to visually impaired people's inclusion on the Internet. Although there are many datasets in the literature, most of them contain only English captions, whereas datasets with captions described in other languages are scarce.
PraCegoVer arose on the Internet, stimulating users from social media to publish images, tag #PraCegoVer and add a short description of their content. Inspired by this movement, we have proposed the #PraCegoVer, a multi-modal dataset with Portuguese captions based on posts from Instagram. It is the first large dataset for image captioning in Portuguese with freely annotated images.
Dataset Structure
containing the images. The file dataset.json comprehends a list of json objects with the attributes:
user: anonymized user that made the post;
filename: image file name;
raw_caption: raw caption;
caption: clean caption;
date: post date.
Each instance in dataset.json is associated with exactly one image in the images directory whose filename is pointed by the attribute filename. Also, we provide a sample with five instances, so the users can download the sample to get an overview of the dataset before downloading it completely.
Download Instructions
If you just want to have an overview of the dataset structure, you can download sample.tar.gz. But, if you want to use the dataset, or any of its subsets (63k and 173k), you must download all the files and run the following commands to uncompress and join the files:
cat images.tar.gz.part* > images.tar.gz tar -xzvf images.tar.gz
Alternatively, you can download the entire dataset from the terminal using the python script download_dataset.py available in PraCegoVer repository. In this case, first, you have to download the script and create an access token here. Then, you can run the following command to download and uncompress the image files:
python download_dataset.py --access_token=
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Datasets for practising in class
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Cadaster data from PDOK used to illustrate the use of geopandas and shapely, geospatial python packages for manipulating vector data. The brpgewaspercelen_definitief_2020.gpkg file has been subsetted in order to make the download manageable for workshops. Other datasets are copies of those available from PDOK.
Facebook
Twittergreatdarklord/python-raw-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was created by Muhammet Ikbal Elek
Released under CC0: Public Domain
Facebook
TwitterThe MNIST database of handwritten digits.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('mnist', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
https://storage.googleapis.com/tfds-data/visualization/fig/mnist-3.0.1.png" alt="Visualization" width="500px">
Facebook
TwitterProgramming-Language/codeagent-python dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Machine learning (ML) has gained much attention and has been incorporated into our daily lives. While there are numerous publicly available ML projects on open source platforms such as GitHub, there have been limited attempts in filtering those projects to curate ML projects of high quality. The limited availability of such high-quality dataset poses an obstacle to understanding ML projects. To help clear this obstacle, we present NICHE, a manually labelled dataset consisting of 572 ML projects. Based on evidences of good software engineering practices, we label 441 of these projects as engineered and 131 as non-engineered. In this repository we provide "NICHE.csv" file that contains the list of the project names along with their labels, descriptive information for every dimension, and several basic statistics, such as the number of stars and commits. This dataset can help researchers understand the practices that are followed in high-quality ML projects. It can also be used as a benchmark for classifiers designed to identify engineered ML projects.
GitHub page: https://github.com/soarsmu/NICHE
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data used for publication in "Assessment of surrogate models for flood inundation: The physics-guided LSG model vs. state-of-the-art machine learning models". Five surrogate models for flood inundation is to emulate the results of high-resolution hydrodynamic models. The surrogate models are compared based on accuracy and computational speed for three distinct case studies namely Carlisle (United Kingdom), Chowilla floodplain (Australia), and Burnett River (Australia).The dataset is structured in 5 files - "Carlisle", "Chowilla", "BurnettRV", "Comparison_results", and "Python_data". As a minimum to run the models the "Python_data" file and one of "Carlisle", "Chowilla", or "BurnettRV" are needed. We suggest to use the "Carlisle" case study for initial testing given its small size and small data requirement."Carlisle", "Chowilla", and "BurnettRV" files These files contain hydrodynamic modelling data for training and validation for each individual case study, as well as specific Python scripts for training and running the surrogate models in each case study. There are only small differences between each folder, depending on the hydrodynamic model trying to emulate and input boundary conditions (input features).Each case study file has the following folders:Geometry_data: DEM files, .npz files containing of the high-fidelity models grid (XYZ-coordinates) and areas (Same data is available for the low-fidelity model used in the LSG model), .shp files indicating location of boundaries and main flow paths (mainly used in the LSTM-SRR model). XXX_modeldata: Folder to storage trained model data for each XXX surrogate model. For example, GP_EOF_modeldata contains files used to store the trainined GP-EOF model.HD_model_data: High-fidelity (And low-fidelity) simulation results for all flood events of that case study. This folder also contains all boundary input conditions.HF_EOF_analysis: Storing of data used in the EOF analysis. EOF analysis is applied for the LSG, GP-EOF, and LSTM-EOF surrogate models. Results_data: Storing results of running the evaluation of the surrogate models.Train_test_split_data: The train-test-validation data split is the same for all surrogate models. The specific split for each cross-validation fold is stored in this folder.And Python files:YYY_event_summary, YYY_Extrap_event_summary: Files containing overview of all events, and which events are connected between the low- and high-fidelity models for each YYY case study.EOF_analysis_HFdata_preprocessing, EOF_analysis_HFdata: Preprocessing before EOF analysis and the EOF analysis of the high-fidelity data. This is used for the LSG, GP-EOF, and LSTM-EOF surrogate models.Evaluation, Evaluation_extrap: Scripts for evaluating the surrogate model for that case study and saving the results for each cross-validation fold.train_test_split: Script for splitting the flood datasets for each cross-validation fold, so all surrogate models train on the same data.XXX_training: Script for training each XXX surrogate model.XXX_preprocessing: Some surrogate models might rely on some information that needs to be generated before training. This is performed using these scripts."Comparison_results" fileFiles used for comparing surrogate models and generate the figures in the paper "Assessment of surrogate models for flood inundation: The physics-guided LSG model vs. state-of-the-art machine learning models". Figures are also included. "Python_data" fileFolder containing Python script with utility functions for setting up, training, and running the surrogate models, as well as for evaluating the surrogate models. This folder also contains a python_environment.yml file with all Python package versions and dependencies.This folder also contains two sub-folders:LSG_mods_and_func: Python scripts for using the LSG model. Some of these scripts are also utilized when working with the other surrogate models. SRR_method_master_Zhou2021: Scripts obtained from https://github.com/yuerongz/SRR-method. Small edits have for speed and use in this study.
Facebook
TwitterDataset Card for my-distiset-3c1699f5
This dataset has been created with distilabel.
Dataset Summary
This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/sdiazlor/my-distiset-3c1699f5/raw/main/pipeline.yaml"
or explore the configuration: distilabel pipeline info --config… See the full description on the dataset page: https://huggingface.co/datasets/sdiazlor/math-python-reasoning-dataset.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset was created by Anaj krishna
Released under MIT
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Attention: This dataset is a summary and reformat pulled from github code.
You should make your own assumptions based on this. In fact, there is another dataset I formed through parsing that addresses several points:
out of 500k python related items, most of them are python-ish, not pythonic the majority of the items here contain excessive licensing inclusion of original code the items here are sometimes not even python but have references There's a whole lot of gpl summaries… See the full description on the dataset page: https://huggingface.co/datasets/jtatman/python-code-dataset-500k.