82 datasets found
  1. Python Code Instruction

    • kaggle.com
    Updated Nov 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Python Code Instruction [Dataset]. https://www.kaggle.com/datasets/thedevastator/python-code-instruction-dataset/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 30, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Python Code Instruction

    Training Data with Instruction, Input, Output, and Prompt Columns

    By Tarun Bisht (From Huggingface) [source]

    About this dataset

    The python_code_instructions_18k_alpaca dataset is a comprehensive training dataset specifically curated for researchers and developers involved in the analysis and comprehension of Python code instructions. It contains a vast collection of Python code snippets along with their corresponding instruction, input, output, and prompt information. By utilizing this dataset, users can gain valuable insights into various Python programming concepts and techniques.

    The dataset is organized into columns to facilitate easy access to the required information. The instruction column holds the specific task or instruction that the Python code snippet is designed to perform. This allows users to understand the purpose or requirement of each code snippet at a glance.

    The input column contains all necessary input data or parameters that are required for executing the Python code snippet accurately. These inputs provide context and enable users to comprehend how different variables or values impact the overall functioning of each code snippet.

    Likewise, the output column presents expected results or outcomes that should be produced when executing each Python code snippet with its specified input values. This allows for validation and verification purposes, ensuring that each code snippet performs as intended.

    In addition to instruction, input, and output details, this dataset also includes prompts. The prompt column provides additional context or information intended to assist users in better understanding the purpose or requirements of each particular Python code snippet.

    By leveraging this comprehensive python_code_instructions_18k_alpaca training dataset, researchers and developers can delve into numerous real-world examples of Python programming challenges - helping them enhance their coding skills while gaining invaluable knowledge about effective implementation techniques across various domains

    Research Ideas

    • Code Instruction Analysis: This dataset can be used to analyze different types of Python code instructions and identify patterns or common practices. Researchers or developers can use this dataset to gain insights into effective ways of writing code instructions.
    • Code Output Prediction: With the given input and instruction, this dataset can be used to train models for predicting the expected output of a Python code snippet. This can be useful in automating the testing process or verifying the correctness of the code.
    • Prompt Generation: Developers often struggle with providing clear and concise prompts for their code snippets. This dataset can serve as a resource for generating prompts by analyzing existing examples and extracting key information or requirements from them

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: train.csv | Column name | Description | |:----------------|:------------------------------------------------------------------------------------------------------------------| | instruction | Specific tasks or instructions assigned to each Python code snippet. (Text) | | input | The input data or parameters required for executing the code instruction. (Text) | | output | The expected result or output that should be produced when executing the code instruction. (Text) | | prompt | Additional information or context to help understand the purpose or requirements of each code instruction. (Text) |

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Tarun Bisht (From Huggingface).

  2. Python for Machine Learning - Crash Course

    • kaggle.com
    Updated Jun 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kunaal Naik (2023). Python for Machine Learning - Crash Course [Dataset]. https://www.kaggle.com/datasets/funxexcel/python-for-machine-learning-crash-course/data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 24, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Kunaal Naik
    Description

    Dataset

    This dataset was created by Kunaal Naik

    Contents

  3. truncated slab data for python course

    • kaggle.com
    zip
    Updated Nov 4, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Katja Rodionova (2018). truncated slab data for python course [Dataset]. https://www.kaggle.com/kpkatja/truncated-slab-data-for-python-course
    Explore at:
    zip(3292 bytes)Available download formats
    Dataset updated
    Nov 4, 2018
    Authors
    Katja Rodionova
    Description

    Dataset

    This dataset was created by Katja Rodionova

    Contents

  4. Data from: Code4ML: a Large-scale Dataset of annotated Machine Learning Code...

    • zenodo.org
    Updated May 18, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ekaterina Trofimova; Ekaterina Trofimova; Emil Sataev; Anastasia Drozdova; Polina Guseva; Anna Scherbakova; Andrey Ustyuzhanin; Anastasia Gorodilova; Valeriy Berezovskiy; Emil Sataev; Anastasia Drozdova; Polina Guseva; Anna Scherbakova; Andrey Ustyuzhanin; Anastasia Gorodilova; Valeriy Berezovskiy (2024). Code4ML: a Large-scale Dataset of annotated Machine Learning Code [Dataset]. http://doi.org/10.5281/zenodo.11213783
    Explore at:
    Dataset updated
    May 18, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Ekaterina Trofimova; Ekaterina Trofimova; Emil Sataev; Anastasia Drozdova; Polina Guseva; Anna Scherbakova; Andrey Ustyuzhanin; Anastasia Gorodilova; Valeriy Berezovskiy; Emil Sataev; Anastasia Drozdova; Polina Guseva; Anna Scherbakova; Andrey Ustyuzhanin; Anastasia Gorodilova; Valeriy Berezovskiy
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is an enriched version of Code4ML: a Large-scale Dataset of annotated Machine Learning Code, a corpus of Python code snippets, competition summaries, and data descriptions from Kaggle. The initial corpus consists of ≈ 2.5 million snippets of ML code collected from ≈ 100 thousand Jupyter notebooks. A representative fraction of the snippets is annotated by human assessors through a user-friendly interface specially designed for that purpose.

    The data is organized as a set of tables in CSV format. It includes several central entities: raw code blocks collected from Kaggle (code_blocks.csv), kernels (kernels_meta.csv) and competitions meta information (competitions_meta.csv). Manually annotated code blocks are presented as a separate table (murkup_data.csv). As this table contains the numeric id of the code block semantic type, we also provide a mapping from the id to semantic class and subclass (vertices.csv).

    Snippets information (code_blocks.csv) can be mapped with kernels meta-data via kernel_id. Kernels metadata is linked to Kaggle competitions information through comp_name. To ensure the quality of the data kernels_meta.csv includes only notebooks with an available Kaggle score.

    Automatic classification of code_blocks are stored in data_with_preds.csv. The mapping of this table with code_blocks.csv can be doe through code_blocks_index column, which corresponds to code_blocks indices.

    The updated Code4ML 2.0 corpus includes kernels retrieved from Code Kaggle Meta. These kernels correspond to the kaggle competitions launched since 2020. The natural descriptions of the competitions are retrieved with the aim of LLM.

    kernels_meta2.csv may contain kernels without Kaggle score, but with the place in the leader board (rank).

    Code4ML 2.0 dataset can be used for various purposes, including training and evaluating models for code generation, code understanding, and natural language processing tasks.

  5. R

    Custom Yolov7 On Kaggle On Custom Dataset

    • universe.roboflow.com
    zip
    Updated Jan 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Owais Ahmad (2023). Custom Yolov7 On Kaggle On Custom Dataset [Dataset]. https://universe.roboflow.com/owais-ahmad/custom-yolov7-on-kaggle-on-custom-dataset-rakiq
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 29, 2023
    Dataset authored and provided by
    Owais Ahmad
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Person Car Bounding Boxes
    Description

    Custom Training with YOLOv7 🔥

    Some Important links

    Contact Information

    Objective

    To Showcase custom Object Detection on the Given Dataset to train and Infer the Model using newly launched YoloV7.

    Data Acquisition

    The goal of this task is to train a model that can localize and classify each instance of Person and Car as accurately as possible.

    from IPython.display import Markdown, display
    
    display(Markdown("../input/Car-Person-v2-Roboflow/README.roboflow.txt"))
    

    Custom Training with YOLOv7 🔥

    In this Notebook, I have processed the images with RoboFlow because in COCO formatted dataset was having different dimensions of image and Also data set was not splitted into different Format. To train a custom YOLOv7 model we need to recognize the objects in the dataset. To do so I have taken the following steps:

    • Export the dataset to YOLOv7
    • Train YOLOv7 to recognize the objects in our dataset
    • Evaluate our YOLOv7 model's performance
    • Run test inference to view performance of YOLOv7 model at work

    📦 YOLOv7

    https://raw.githubusercontent.com/Owaiskhan9654/Yolo-V7-Custom-Dataset-Train-on-Kaggle/main/car-person-2.PNG" width=800>

    Image Credit - jinfagang

    Step 1: Install Requirements

    !git clone https://github.com/WongKinYiu/yolov7 # Downloading YOLOv7 repository and installing requirements
    %cd yolov7
    !pip install -qr requirements.txt
    !pip install -q roboflow
    

    Downloading YOLOV7 starting checkpoint

    !wget "https://github.com/WongKinYiu/yolov7/releases/download/v0.1/yolov7.pt"
    
    import os
    import glob
    import wandb
    import torch
    from roboflow import Roboflow
    from kaggle_secrets import UserSecretsClient
    from IPython.display import Image, clear_output, display # to display images
    
    
    
    print(f"Setup complete. Using torch {torch._version_} ({torch.cuda.get_device_properties(0).name if torch.cuda.is_available() else 'CPU'})")
    

    https://camo.githubusercontent.com/dd842f7b0be57140e68b2ab9cb007992acd131c48284eaf6b1aca758bfea358b/68747470733a2f2f692e696d6775722e636f6d2f52557469567a482e706e67">

    I will be integrating W&B for visualizations and logging artifacts and comparisons of different models!

    YOLOv7-Car-Person-Custom

    try:
      user_secrets = UserSecretsClient()
      wandb_api_key = user_secrets.get_secret("wandb_api")
      wandb.login(key=wandb_api_key)
      anonymous = None
    except:
      wandb.login(anonymous='must')
      print('To use your W&B account,
    Go to Add-ons -> Secrets and provide your W&B access token. Use the Label name as WANDB. 
    Get your W&B access token from here: https://wandb.ai/authorize')
      
      
      
    wandb.init(project="YOLOvR",name=f"7. YOLOv7-Car-Person-Custom-Run-7")
    

    Step 2: Assemble Our Dataset

    https://uploads-ssl.webflow.com/5f6bc60e665f54545a1e52a5/615627e5824c9c6195abfda9_computer-vision-cycle.png" alt="">

    In order to train our custom model, we need to assemble a dataset of representative images with bounding box annotations around the objects that we want to detect. And we need our dataset to be in YOLOv7 format.

    In Roboflow, We can choose between two paths:

    Version v2 Aug 12, 2022 Looks like this.

    https://raw.githubusercontent.com/Owaiskhan9654/Yolo-V7-Custom-Dataset-Train-on-Kaggle/main/Roboflow.PNG" alt="">

    user_secrets = UserSecretsClient()
    roboflow_api_key = user_secrets.get_secret("roboflow_api")
    
    rf = Roboflow(api_key=roboflow_api_key)
    project = rf.workspace("owais-ahmad").project("custom-yolov7-on-kaggle-on-custom-dataset-rakiq")
    dataset = project.version(2).download("yolov7")
    

    Step 3: Training Custom pretrained YOLOv7 model

    Here, I am able to pass a number of arguments: - img: define input image size - batch: determine

  6. Python-course-gemet-english-2023

    • kaggle.com
    Updated May 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maria Skeppstedt (2024). Python-course-gemet-english-2023 [Dataset]. https://www.kaggle.com/datasets/mariaskeppstedt/python-course-gemet-english-2023/data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 18, 2024
    Dataset provided by
    Kaggle
    Authors
    Maria Skeppstedt
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dataset

    This dataset was created by Maria Skeppstedt

    Released under CC BY-SA 4.0

    Contents

  7. Code4ML 2.0

    • zenodo.org
    csv, txt
    Updated May 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonimous authors; Anonimous authors (2025). Code4ML 2.0 [Dataset]. http://doi.org/10.5281/zenodo.15465737
    Explore at:
    csv, txtAvailable download formats
    Dataset updated
    May 19, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anonimous authors; Anonimous authors
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is an enriched version of the Code4ML dataset, a large-scale corpus of annotated Python code snippets, competition summaries, and data descriptions sourced from Kaggle. The initial release includes approximately 2.5 million snippets of machine learning code extracted from around 100,000 Jupyter notebooks. A portion of these snippets has been manually annotated by human assessors through a custom-built, user-friendly interface designed for this task.

    The original dataset is organized into multiple CSV files, each containing structured data on different entities:

    • code_blocks.csv: Contains raw code snippets extracted from Kaggle.
    • kernels_meta.csv: Metadata for the notebooks (kernels) from which the code snippets were derived.
    • competitions_meta.csv: Metadata describing Kaggle competitions, including information about tasks and data.
    • markup_data.csv: Annotated code blocks with semantic types, allowing deeper analysis of code structure.
    • vertices.csv: A mapping from numeric IDs to semantic types and subclasses, used to interpret annotated code blocks.

    Table 1. code_blocks.csv structure

    ColumnDescription
    code_blocks_indexGlobal index linking code blocks to markup_data.csv.
    kernel_idIdentifier for the Kaggle Jupyter notebook from which the code block was extracted.
    code_block_id

    Position of the code block within the notebook.

    code_block

    The actual machine learning code snippet.

    Table 2. kernels_meta.csv structure

    ColumnDescription
    kernel_idIdentifier for the Kaggle Jupyter notebook.
    kaggle_scorePerformance metric of the notebook.
    kaggle_commentsNumber of comments on the notebook.
    kaggle_upvotesNumber of upvotes the notebook received.
    kernel_linkURL to the notebook.
    comp_nameName of the associated Kaggle competition.

    Table 3. competitions_meta.csv structure

    ColumnDescription
    comp_nameName of the Kaggle competition.
    descriptionOverview of the competition task.
    data_typeType of data used in the competition.
    comp_typeClassification of the competition.
    subtitleShort description of the task.
    EvaluationAlgorithmAbbreviationMetric used for assessing competition submissions.
    data_sourcesLinks to datasets used.
    metric typeClass label for the assessment metric.

    Table 4. markup_data.csv structure

    ColumnDescription
    code_blockMachine learning code block.
    too_longFlag indicating whether the block spans multiple semantic types.
    marksConfidence level of the annotation.
    graph_vertex_idID of the semantic type.

    The dataset allows mapping between these tables. For example:

    • code_blocks.csv can be linked to kernels_meta.csv via the kernel_id column.
    • kernels_meta.csv is connected to competitions_meta.csv through comp_name. To maintain quality, kernels_meta.csv includes only notebooks with available Kaggle scores.

    In addition, data_with_preds.csv contains automatically classified code blocks, with a mapping back to code_blocks.csvvia the code_blocks_index column.

    Code4ML 2.0 Enhancements

    The updated Code4ML 2.0 corpus introduces kernels extracted from Meta Kaggle Code. These kernels correspond to the kaggle competitions launched since 2020. The natural descriptions of the competitions are retrieved with the aim of LLM.

    Notebooks in kernels_meta2.csv may not have a Kaggle score but include a leaderboard ranking (rank), providing additional context for evaluation.

    competitions_meta_2.csv is enriched with data_cards, decsribing the data used in the competitions.

    Applications

    The Code4ML 2.0 corpus is a versatile resource, enabling training and evaluation of models in areas such as:

    • Code generation
    • Code understanding
    • Natural language processing of code-related tasks
  8. BoolArt-Python-Course

    • kaggle.com
    Updated Sep 11, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    KKY (2022). BoolArt-Python-Course [Dataset]. https://www.kaggle.com/datasets/evilpsycho42/boolartpythonplaygroud
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 11, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    KKY
    License

    http://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html

    Description

    Dataset

    This dataset was created by KKY

    Released under GNU Lesser General Public License 3.0

    Contents

  9. Z

    DECIMER Image classifier dataset

    • data.niaid.nih.gov
    Updated Jul 9, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    M. Isabel agea (2022). DECIMER Image classifier dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6670745
    Explore at:
    Dataset updated
    Jul 9, 2022
    Dataset authored and provided by
    M. Isabel agea
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Images dataset divided into train (10905114 images), validation (2115528 images) and test (544946 images) folders containing a balanced number of images for two classes (chemical structures and non-chemical structures).

    The chemical structures were generated using RanDepict to random picked compounds from the ChEMBL30 database and the COCONUT database.

    The non-chemical structures were generated using Python or they were retrieved from several public datasets:

    COCO dataset, MIT Places-205 dataset, Visual Genome dataset, Google Open labeled Images, MMU-OCR-21 (kaggle), HandWritten_Character (kaggle), CoronaHack -Chest X-Ray-dataset (kaggle), PANDAS Augmented Images (kaggle), Bacterial_Colony (kaggle), Ceylon Epigraphy Periods (kaggle), Chinese Calligraphy Styles by Calligraphers (kaggle), Graphs Dataset (kaggle), Function_Graphs Polynomial (kaggle), sketches (kaggle), Person Face Sketches (kaggle), Art Pictograms (kaggle), Russian handwritten letters (kaggle), Handwritten Russian Letters (kaggle), Covid-19 Misinformation Tweets Labeled Dataset (kaggle) and grapheme-imgs-224x224 (kaggle).

    This data was used to build a CNN classification model using as a base model EfficienNetB0 and fine tuning it. The model is available on Github.

  10. Best Books Ever Dataset

    • zenodo.org
    csv
    Updated Nov 10, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lorena Casanova Lozano; Sergio Costa Planells; Lorena Casanova Lozano; Sergio Costa Planells (2020). Best Books Ever Dataset [Dataset]. http://doi.org/10.5281/zenodo.4265096
    Explore at:
    csvAvailable download formats
    Dataset updated
    Nov 10, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Lorena Casanova Lozano; Sergio Costa Planells; Lorena Casanova Lozano; Sergio Costa Planells
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    The dataset has been collected in the frame of the Prac1 of the subject Tipology and Data Life Cycle of the Master's Degree in Data Science of the Universitat Oberta de Catalunya (UOC).

    The dataset contains 25 variables and 52478 records corresponding to books on the GoodReads Best Books Ever list (the larges list on the site).

    Original code used to retrieve the dataset can be found on github repository: github.com/scostap/goodreads_bbe_dataset

    The data was retrieved in two sets, the first 30000 books and then the remainig 22478. Dates were not parsed and reformated on the second chunk so publishDate and firstPublishDate are representet in a mm/dd/yyyy format for the first 30000 records and Month Day Year for the rest.

    Book cover images can be optionally downloaded from the url in the 'coverImg' field. Python code for doing so and an example can be found on the github repo.

    The 25 fields of the dataset are:

    | Attributes | Definition | Completeness |
    | ------------- | ------------- | ------------- | 
    | bookId | Book Identifier as in goodreads.com | 100 |
    | title | Book title | 100 |
    | series | Series Name | 45 |
    | author | Book's Author | 100 |
    | rating | Global goodreads rating | 100 |
    | description | Book's description | 97 |
    | language | Book's language | 93 |
    | isbn | Book's ISBN | 92 |
    | genres | Book's genres | 91 |
    | characters | Main characters | 26 |
    | bookFormat | Type of binding | 97 |
    | edition | Type of edition (ex. Anniversary Edition) | 9 |
    | pages | Number of pages | 96 |
    | publisher | Editorial | 93 |
    | publishDate | publication date | 98 |
    | firstPublishDate | Publication date of first edition | 59 |
    | awards | List of awards | 20 |
    | numRatings | Number of total ratings | 100 |
    | ratingsByStars | Number of ratings by stars | 97 |
    | likedPercent | Derived field, percent of ratings over 2 starts (as in GoodReads) | 99 |
    | setting | Story setting | 22 |
    | coverImg | URL to cover image | 99 |
    | bbeScore | Score in Best Books Ever list | 100 |
    | bbeVotes | Number of votes in Best Books Ever list | 100 |
    | price | Book's price (extracted from Iberlibro) | 73 |

  11. A

    ‘COVID-19 dataset in Japan’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Jan 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘COVID-19 dataset in Japan’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-covid-19-dataset-in-japan-2665/latest
    Explore at:
    Dataset updated
    Jan 28, 2022
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Japan
    Description

    Analysis of ‘COVID-19 dataset in Japan’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/lisphilar/covid19-dataset-in-japan on 28 January 2022.

    --- Dataset description provided by original source is as follows ---

    1. Context

    This is a COVID-19 dataset in Japan. This does not include the cases in Diamond Princess cruise ship (Yokohama city, Kanagawa prefecture) and Costa Atlantica cruise ship (Nagasaki city, Nagasaki prefecture). - Total number of cases in Japan - The number of vaccinated people (New/experimental) - The number of cases at prefecture level - Metadata of each prefecture

    Note: Lisphilar (author) uploads the same files to https://github.com/lisphilar/covid19-sir/tree/master/data

    This dataset can be retrieved with CovsirPhy (Python library).

    pip install covsirphy --upgrade
    
    import covsirphy as cs
    data_loader = cs.DataLoader()
    japan_data = data_loader.japan()
    # The number of cases (Total/each province)
    clean_df = japan_data.cleaned()
    # Metadata
    meta_df = japan_data.meta()
    

    Please refer to CovsirPhy Documentation: Japan-specific dataset.

    Note: Before analysing the data, please refer to Kaggle notebook: EDA of Japan dataset and COVID-19: Government/JHU data in Japan. The detailed explanation of the build process is discussed in Steps to build the dataset in Japan. If you find errors or have any questions, feel free to create a discussion topic.

    1.1 Total number of cases in Japan

    covid_jpn_total.csv Cumulative number of cases: - PCR-tested / PCR-tested and positive - with symptoms (to 08May2020) / without symptoms (to 08May2020) / unknown (to 08May2020) - discharged - fatal

    The number of cases: - requiring hospitalization (from 09May2020) - hospitalized with mild symptoms (to 08May2020) / severe symptoms / unknown (to 08May2020) - requiring hospitalization, but waiting in hotels or at home (to 08May2020)

    In primary source, some variables were removed on 09May2020. Values are NA in this dataset from 09May2020.

    Manually collected the data from Ministry of Health, Labour and Welfare HP:
    厚生労働省 HP (in Japanese)
    Ministry of Health, Labour and Welfare HP (in English)

    The number of vaccinated people: - Vaccinated_1st: the number of vaccinated persons for the first time on the date - Vaccinated_2nd: the number of vaccinated persons with the second dose on the date - Vaccinated_3rd: the number of vaccinated persons with the third dose on the date

    Data sources for vaccination: - To 09Apr2021: 厚生労働省 HP 新型コロナワクチンの接種実績(in Japanese) - 首相官邸 新型コロナワクチンについて - From 10APr2021: Twitter: 首相官邸(新型コロナワクチン情報)

    1.2 The number of cases at prefecture level

    covid_jpn_prefecture.csv Cumulative number of cases: - PCR-tested / PCR-tested and positive - discharged - fatal

    The number of cases: - requiring hospitalization (from 09May2020) - hospitalized with severe symptoms (from 09May2020)

    Using pdf-excel converter, manually collected the data from Ministry of Health, Labour and Welfare HP:
    厚生労働省 HP (in Japanese)
    Ministry of Health, Labour and Welfare HP (in English)

    Note: covid_jpn_prefecture.groupby("Date").sum() does not match covid_jpn_total. When you analyse total data in Japan, please use covid_jpn_total data.

    1.3 Metadata of each prefecture

    covid_jpn_metadata.csv - Population (Total, Male, Female): 厚生労働省 厚生統計要覧(2017年度)第1-5表 - Area (Total, Habitable): Wikipedia 都道府県の面積一覧 (2015)

    2. Acknowledgements

    To create this dataset, edited and transformed data of the following sites was used.

    厚生労働省 Ministry of Health, Labour and Welfare, Japan:
    厚生労働省 HP (in Japanese)
    Ministry of Health, Labour and Welfare HP (in English) 厚生労働省 HP 利用規約・リンク・著作権等 CC BY 4.0 (in Japanese)

    国土交通省 Ministry of Land, Infrastructure, Transport and Tourism, Japan: 国土交通省 HP (in Japanese) 国土交通省 HP (in English) 国土交通省 HP 利用規約・リンク・著作権等 CC BY 4.0 (in Japanese)

    Code for Japan / COVID-19 Japan: Code for Japan COVID-19 Japan Dashboard (CC BY 4.0) COVID-19 Japan 都道府県別 感染症病床数 (CC BY)

    Wikipedia: Wikipedia

    LinkData: LinkData (Public Domain)

    Inspiration

    1. Changes in number of cases over time
    2. Percentage of patients without symptoms / mild or severe symptoms
    3. What to do next to prevent outbreak

    License and how to cite

    Kindly cite this dataset under CC BY-4.0 license as follows. - Hirokazu Takaya (2020-2022), COVID-19 dataset in Japan, GitHub repository, https://github.com/lisphilar/covid19-sir/data/japan, or - Hirokazu Takaya (2020-2022), COVID-19 dataset in Japan, Kaggle Dataset, https://www.kaggle.com/lisphilar/covid19-dataset-in-japan

    --- Original source retains full ownership of the source dataset ---

  12. Vezora/Tested-188k-Python-Alpaca: Functional

    • kaggle.com
    Updated Nov 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Vezora/Tested-188k-Python-Alpaca: Functional [Dataset]. https://www.kaggle.com/datasets/thedevastator/vezora-tested-188k-python-alpaca-functional-pyth
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 30, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Vezora/Tested-188k-Python-Alpaca: Functional Python Code Dataset

    188k Functional Python Code Samples

    By Vezora (From Huggingface) [source]

    About this dataset

    The Vezora/Tested-188k-Python-Alpaca dataset is a comprehensive collection of functional Python code samples, specifically designed for training and analysis purposes. With 188,000 samples, this dataset offers an extensive range of examples that cater to the research needs of Python programming enthusiasts.

    This valuable resource consists of various columns, including input, which represents the input or parameters required for executing the Python code sample. The instruction column describes the task or objective that the Python code sample aims to solve. Additionally, there is an output column that showcases the resulting output generated by running the respective Python code.

    By utilizing this dataset, researchers can effectively study and analyze real-world scenarios and applications of Python programming. Whether for educational purposes or development projects, this dataset serves as a reliable reference for individuals seeking practical examples and solutions using Python

    How to use the dataset

    The Vezora/Tested-188k-Python-Alpaca dataset is a comprehensive collection of functional Python code samples, containing 188,000 samples in total. This dataset can be a valuable resource for researchers and programmers interested in exploring various aspects of Python programming.

    Contents of the Dataset

    The dataset consists of several columns:

    • output: This column represents the expected output or result that is obtained when executing the corresponding Python code sample.
    • instruction: It provides information about the task or instruction that each Python code sample is intended to solve.
    • input: The input parameters or values required to execute each Python code sample.

    Exploring the Dataset

    To make effective use of this dataset, it is essential to understand its structure and content properly. Here are some steps you can follow:

    • Importing Data: Load the dataset into your preferred environment for data analysis using appropriate tools like pandas in Python.
    import pandas as pd
    
    # Load the dataset
    df = pd.read_csv('train.csv')
    
    • Understanding Column Names: Familiarize yourself with the column names and their meanings by referring to the provided description.
    # Display column names
    print(df.columns)
    
    • Sample Exploration: Get an initial understanding of the data structure by examining a few random samples from different columns.
    # Display random samples from 'output' column
    print(df['output'].sample(5))
    
    • Analyzing Instructions: Analyze different instructions or tasks present in the 'instruction' column to identify specific areas you are interested in studying or learning about.
    # Count unique instructions and display top ones with highest occurrences
    instruction_counts = df['instruction'].value_counts()
    print(instruction_counts.head(10))
    

    Potential Use Cases

    The Vezora/Tested-188k-Python-Alpaca dataset can be utilized in various ways:

    • Code Analysis: Analyze the code samples to understand common programming patterns and best practices.
    • Code Debugging: Use code samples with known outputs to test and debug your own Python programs.
    • Educational Purposes: Utilize the dataset as a teaching tool for Python programming classes or tutorials.
    • Machine Learning Applications: Train machine learning models to predict outputs based on given inputs.

    Remember that this dataset provides a plethora of diverse Python coding examples, allowing you to explore different

    Research Ideas

    • Code analysis: Researchers and developers can use this dataset to analyze various Python code samples and identify patterns, best practices, and common mistakes. This can help in improving code quality and optimizing performance.
    • Language understanding: Natural language processing techniques can be applied to the instruction column of this dataset to develop models that can understand and interpret natural language instructions for programming tasks.
    • Code generation: The input column of this dataset contains the required inputs for executing each Python code sample. Researchers can build models that generate Python code based on specific inputs or task requirements using the examples provided in this dataset. This can be useful in automating repetitive programming tasks o...
  13. A

    ‘Waiter's Tips Dataset’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Jan 28, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘Waiter's Tips Dataset’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-waiter-s-tips-dataset-b284/latest
    Explore at:
    Dataset updated
    Jan 28, 2022
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘Waiter's Tips Dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/aminizahra/tips-dataset on 28 January 2022.

    --- Dataset description provided by original source is as follows ---

    Context

    One waiter recorded information about each tip he received over a period of a few months working in one restaurant. In all he recorded 244 tips.

    Acknowledgements

    The data was reported in a collection of case studies for business statistics.

    Bryant, P. G. and Smith, M (1995) Practical Data Analysis: Case Studies in Business Statistics. Homewood, IL: Richard D. Irwin Publishing

    The dataset is also available through the Python package Seaborn.

    Hint

    Of course, this database has additional columns compared to other tips datasets.

    Dataset info

    RangeIndex: 244 entries, 0 to 243

    Data columns (total 11 columns):

    # Column Non-Null Count Dtype

    0 total_bill 244 non-null float64

    1 tip 244 non-null float64

    2 sex 244 non-null object

    3 smoker 244 non-null object

    4 day 244 non-null object

    5 time 244 non-null object

    6 size 244 non-null int64

    7 price_per_person 244 non-null float64

    8 Payer Name 244 non-null object

    9 CC Number 244 non-null int64

    10 Payment ID 244 non-null object

    dtypes: float64(3), int64(2), object(6)

    Some details

    total_bill a numeric vector, the bill amount (dollars)

    tip a numeric vector, the tip amount (dollars)

    sex a factor with levels Female Male, gender of the payer of the bill

    Smoker a factor with levels No Yes, whether the party included smokers

    day a factor with levels Friday Saturday Sunday Thursday, day of the week

    time a factor with levels Day Night, rough time of day

    size a numeric vector, number of ppartyeople in

    --- Original source retains full ownership of the source dataset ---

  14. T

    fashion_mnist

    • tensorflow.org
    • opendatalab.com
    • +3more
    Updated Jun 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). fashion_mnist [Dataset]. https://www.tensorflow.org/datasets/catalog/fashion_mnist
    Explore at:
    Dataset updated
    Jun 1, 2024
    Description

    Fashion-MNIST is a dataset of Zalando's article images consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('fashion_mnist', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

    https://storage.googleapis.com/tfds-data/visualization/fig/fashion_mnist-3.0.1.png" alt="Visualization" width="500px">

  15. A

    ‘Secondary Mushroom (Date Donated 2021)’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Feb 13, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘Secondary Mushroom (Date Donated 2021)’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-secondary-mushroom-date-donated-2021-057c/12557338/?iid=017-114&v=presentation
    Explore at:
    Dataset updated
    Feb 13, 2022
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘Secondary Mushroom (Date Donated 2021)’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/yasserhessein/secondary-mushroom-dataset on 13 February 2022.

    --- Dataset description provided by original source is as follows ---

    Context

    The given information is about the Secondary Mushroom Dataset, the Primary Mushroom Dataset used for the simulation and the respective metadata can be found in the zip.

    Content

    This dataset includes 61069 hypothetical mushrooms with caps based on 173 species (353 mushrooms per species). Each mushroom is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended (the latter class was combined with the poisonous class).

    Acknowledgements

    Donor: D. Wagner, dwagner93 '@' gmx.de Product of bachelor thesis at Philipps-Universität Marburg, Bioinformatics Division, supervised by Dr. G. Hattab. Repository containing the related Python scripts and all the data sets: https://mushroom.mathematik.uni-marburg.de/files/ Inspired by the Mushroom Data Set of J. Schlimmer: url:https://archive.ics.uci.edu/ml/datasets/Mushroom.

    Inspiration

    The related Python project contains a Python module secondary_data_generation.py used to generate this data based on primary_data_edited.csv also found in the repository. Both nominal and metrical variables are a result of randomization. The simulated and ordered by species version is found in secondary_data_generated.csv. The randomly shuffled version is found in secondary_data_shuffled.csv.

    Donor: D. Wagner, dwagner93 '@' gmx.de Product of bachelor thesis at Philipps-Universität Marburg, Bioinformatics Division, supervised by Dr. G. Hattab. Repository containing the related Python scripts and all the data sets: https://mushroom.mathematik.uni-marburg.de/files/ Inspired by the Mushroom Data Set of J. Schlimmer: url:https://archive.ics.uci.edu/ml/datasets/Mushroom.

    --- Original source retains full ownership of the source dataset ---

  16. Answers of the Questions of the DataCamp DS Course

    • kaggle.com
    Updated Mar 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Paul Driessens (2023). Answers of the Questions of the DataCamp DS Course [Dataset]. https://www.kaggle.com/datasets/pauleyd/answers-of-the-questions-of-the-datacamp-ds-course/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 12, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Paul Driessens
    Description

    Dataset

    This dataset was created by Paul Driessens

    Contents

  17. A

    ‘COVID-19: Holidays of countries’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Jan 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘COVID-19: Holidays of countries’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-covid-19-holidays-of-countries-d8bd/e5a9e831/
    Explore at:
    Dataset updated
    Jan 28, 2022
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘COVID-19: Holidays of countries’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/vbmokin/covid19-holidays-of-countries on 28 January 2022.

    --- Dataset description provided by original source is as follows ---

    Context

    This research is devoted to the analysis of the impact of holidays on the statistics of confirmed coronavirus diseases. The Prophet using the holidays library with holidays of countries and their regions. As of 30 June 2020, only 62 countries (some with regions) are available in the holidays library:

    ['AR', 'AT', 'AU', 'BD', 'BE', 'BG', 'BR', 'BY', 'CA', 'CH', 'CL', 'CN', 'CO', 'CZ', 'DE', 'DK', 'DO', 'EE', 'EG', 'ES', 'FI', 'FR', 'GB', 'GR', 'HN', 'HR', 'HU', 'ID', 'IE', 'IL', 'IN', 'IS', 'IT', 'JP', 'KE', 'KR', 'LT', 'LU', 'MX', 'MY', 'NG', 'NI', 'NL', 'NO', 'NZ', 'PE', 'PH', 'PK', 'PL', 'PT', 'PY', 'RS', 'RU', 'SE', 'SG', 'SI', 'SK', 'TH', 'TR', 'UA', 'US', 'ZA'] or ['Argentina', 'Australia', 'Austria', 'Bangladesh', 'Belarus', 'Belgium', 'Brazil', 'Bulgaria', 'Canada', 'Chile', 'China', 'Colombia', 'Croatia', 'Czechia', 'Denmark', 'Dominican Republic', 'Egypt', 'Estonia', 'Finland', 'France', 'Germany', 'Greece', 'Honduras', 'Hungary', 'Iceland', 'India', 'Indonesia', 'Ireland', 'Israel', 'Italy', 'Japan', 'Kenya', 'Korea, Republic of', 'Lithuania', 'Luxembourg', 'Malaysia', 'Mexico', 'Netherlands', 'New Zealand', 'Nicaragua', 'Nigeria', 'Norway', 'Pakistan', 'Paraguay', 'Peru', 'Philippines', 'Poland', 'Portugal', 'Russian Federation', 'Serbia', 'Singapore', 'Slovakia', 'Slovenia', 'South Africa', 'Spain', 'Sweden', 'Switzerland', 'Thailand', 'Turkey', 'Ukraine', 'United Kingdom', 'United States']

    I will note at once that the list of available countries in the description of the holidays library contains a lot of mistakes, which I wrote to the authors.

    When I asked if this list would expand, the Prophet team made it clear that they were waiting for help from the community with holidays library expand.

    As of Jan 2021 (version 8.4.1), 67 countries (some with regions) are available in the holidays library: a number of data have been refined and countries ['BI', 'LV', 'MA', 'RO', 'VN' - two-letter country codes or alpha_2 of the country (ISO 3166)] added.

    Unfortunately, the format of the holidays library is not very suitable for coronavirus problems, as it has a number of disadvantages. First, the names of the countries are given in one word, which makes it difficult for many of them to identify them according to their common names (ISO 3166). It is best that the dataset contains the common name and two-letter abbreviation in English according to ISO 3166 (see pycountry). Second, the dates are not adapted to the potential impact of the holidays on coronavirus statistics. It is known that after the moment of infection, the active manifestation of symptoms occurs with a delay of 4-10 days, that is a person is likely to get into the statistics on the number of diseases only after 4-7 days. Therefore, it is advisable to use the dates window of impacts: ``` Lower_window = [4, 7] Upper_window = [7, 10]

    `Lower_window <= 0`
    But my [request](https://github.com/facebook/prophet/issues/1588#issue-661098613) to allow positive numbers in this parameter [was refused](https://github.com/facebook/prophet/issues/1588#issuecomment-661984730) by the Prophet team and [advised](https://github.com/facebook/prophet/issues/1588#issuecomment-661984730) to simply move the dates themselves.
    Therefore, it is advisable to shift the holiday dates by 7 days. If the researcher thinks that 7 is too much and enough is 4 days, then he simply indicates "Lower" of the window in -3. Actually, by default, it makes sense to specify parameters:
    

    Lower_window = -3 Upper_window = 3

    If necessary, these settings are easy to change
    
    ### Content
    
    This dataset:
    1. Contains ISO codes, ISO names (common and official) (ISO 3166) of **70** countries (3 European countries **['Albania' - 'AL', 'Georgia' - 'GE', 'Moldova' - 'MD']** have been added).
    2. Contains imported dates from the holidays library for 2020-01-20-2021-12-31 (all countries from holidays library as of Jan 2021), and the same dates, but moved 7 days forward.
    3. Holidays of countries that are not in the list of holidays of the library, but which are in the data of the World Health Organization and on which considerable statistics of diseases on coronavirus are already collected.
    4. Parameters for Prophet model:
    `lower_window, upper_window, prior_scale`
    If you find errors, please write to the [Discussion](https://www.kaggle.com/vbmokin/covid19-holidays-of-countries/discussion).
    
    It is planned to periodically update (and, if necessary, correct) this dataset. 
    
    ### Acknowledgements
    
    Thanks to the authors of the information resources
    * [https://github.com/dr-prodigy/python-holidays](https://github.com/dr-prodigy/python-holidays)
    * [https://en.wikipedia.org/wiki/List_of_holidays_by_country](https://en.wikipedia.org/wiki/List_of_holidays_by_country)
    about the dates and names of holidays in different countries, which I used.
    
    Thanks for the image to <a href="https://pixabay.com/ru/users/iXimus-2352783/?utm_source=link-attribution&utm_medium=referral&utm_campaign=image&utm_content=5062659">iXimus</a> from <a href="https://pixabay.com/ru/?utm_source=link-attribution&utm_medium=referral&utm_campaign=image&utm_content=5062659">Pixabay</a>
    
    
    ### Inspiration
    
    The main task for which this dataset was created is to study the impact of holidays on the accuracy of predicting coronavirus diseases, identifying new patterns, and forming optimal solutions to counteract or minimize its spread.
    
    Tasks that need to be solved to improve this dataset in order to increase the accuracy of modeling the impact of holidays on the number of coronavirus patients:
    1) Expanding the list of countries
    2) Clarification of holiday dates
    3) Clarification of parameters 
    `lower_window, upper_window, prior_scale`
    they must be unique for each country and each holiday.
    
    Also, it is advisable to carry out similar work for each region of countries, but this will not be done in this dataset.
    
    --- Original source retains full ownership of the source dataset ---
    
  18. Blog-1K

    • zenodo.org
    • data.niaid.nih.gov
    application/gzip
    Updated Dec 21, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Haining Wang; Haining Wang (2022). Blog-1K [Dataset]. http://doi.org/10.5281/zenodo.7455623
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Dec 21, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Haining Wang; Haining Wang
    License

    https://www.isc.org/downloads/software-support-policy/isc-license/https://www.isc.org/downloads/software-support-policy/isc-license/

    Description

    The Blog-1K corpus is a redistributable authorship identification testbed for contemporary English prose. It has 1,000 candidate authors, 16K+ posts, and a pre-defined data split (train/dev/test proportional to ca. 8:1:1). It is a subset of the Blog Authorship Corpus from Kaggle. The MD5 for Blog-1K is '0a9e38740af9f921b6316b7f400acf06'.

    1. Preprocessing

    We first filter out texts shorter than 1,000 characters. Then we select one thousand authors whose writings meet the following criteria:
    - accumulatively at least 10,000 characters,
    - accumulatively at most 49,410 characters,
    - accumulatively at least 16 posts,
    - accumulatively at most 40 posts, and
    - each text has at least 50 function words found in the Koppel512 list (to filter out non-English prose).

    Blog-1K has three columns: 'id', 'text', and 'split', where 'id' corresponds to its parent corpus.

    2. Statistics

    Its creation and statistics can be found in the Jupyter Notebook.

    Split# Authors# Posts# CharactersAvg. Characters Per Author (Std.)Avg. Characters Per Post (Std.)
    Train1,00016,13230,092,05730,092 (5,884)1,865 (1,007)
    Validation9352,0173,755,3624,016 (2,269)1,862 (999)
    Test9242,0173,732,4484,039 (2,188)1,850 (936)


    3. Usage

    import pandas as pd
    
    df = pd.read_csv('blog1000.csv.gz', compression='infer')
    
    # read in training data
    train_text, train_label = zip(*df.loc[df.split=='train'][['text', 'id']].itertuples(index=False))

    4. License
    All the materials is licensed under the ISC License.


    5. Contact
    Please contact its maintainer for questions.

  19. Car Prices Poland

    • kaggle.com
    Updated Jan 20, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aleksandr Glotov (2022). Car Prices Poland [Dataset]. https://www.kaggle.com/datasets/aleksandrglotov/car-prices-poland
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 20, 2022
    Dataset provided by
    Kaggle
    Authors
    Aleksandr Glotov
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Hello I am a Python developer and train the ability to parse, analyze data and create neural networks. I decided to share the dataset I'm currently working with. Maybe it will be useful to someone. Why a car? I love cars! As well as programming. If you are interested in me as a specialist, please contact me glotoffalexandr@gmail.com (I'm looking for a new job)

    The dataset was assembled in January 2022. Data from a well-known car sale site in Poland (which is public). Selenium and request were used for parsing (python of course)

    The dataset contains information about the make, model, generation, year of production, mileage, engine type and volume, localization and price

    I have ideas for expanding the model: add body type, configuration, color, power, etc.

    I see many ways to use models created on the basis of this dataset, I will describe them in notebooks

  20. Python Statistics Essential Training

    • kaggle.com
    Updated Nov 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MD.Tanvir Rahman (2023). Python Statistics Essential Training [Dataset]. https://www.kaggle.com/datasets/tanvirrahmanornob/python-statistics-essential-training
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 17, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    MD.Tanvir Rahman
    Description

    Dataset

    This dataset was created by MD.Tanvir Rahman

    Contents

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
The Devastator (2023). Python Code Instruction [Dataset]. https://www.kaggle.com/datasets/thedevastator/python-code-instruction-dataset/code
Organization logo

Python Code Instruction

Training Data with Instruction, Input, Output, and Prompt Columns

Explore at:
25 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 30, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
The Devastator
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Python Code Instruction

Training Data with Instruction, Input, Output, and Prompt Columns

By Tarun Bisht (From Huggingface) [source]

About this dataset

The python_code_instructions_18k_alpaca dataset is a comprehensive training dataset specifically curated for researchers and developers involved in the analysis and comprehension of Python code instructions. It contains a vast collection of Python code snippets along with their corresponding instruction, input, output, and prompt information. By utilizing this dataset, users can gain valuable insights into various Python programming concepts and techniques.

The dataset is organized into columns to facilitate easy access to the required information. The instruction column holds the specific task or instruction that the Python code snippet is designed to perform. This allows users to understand the purpose or requirement of each code snippet at a glance.

The input column contains all necessary input data or parameters that are required for executing the Python code snippet accurately. These inputs provide context and enable users to comprehend how different variables or values impact the overall functioning of each code snippet.

Likewise, the output column presents expected results or outcomes that should be produced when executing each Python code snippet with its specified input values. This allows for validation and verification purposes, ensuring that each code snippet performs as intended.

In addition to instruction, input, and output details, this dataset also includes prompts. The prompt column provides additional context or information intended to assist users in better understanding the purpose or requirements of each particular Python code snippet.

By leveraging this comprehensive python_code_instructions_18k_alpaca training dataset, researchers and developers can delve into numerous real-world examples of Python programming challenges - helping them enhance their coding skills while gaining invaluable knowledge about effective implementation techniques across various domains

Research Ideas

  • Code Instruction Analysis: This dataset can be used to analyze different types of Python code instructions and identify patterns or common practices. Researchers or developers can use this dataset to gain insights into effective ways of writing code instructions.
  • Code Output Prediction: With the given input and instruction, this dataset can be used to train models for predicting the expected output of a Python code snippet. This can be useful in automating the testing process or verifying the correctness of the code.
  • Prompt Generation: Developers often struggle with providing clear and concise prompts for their code snippets. This dataset can serve as a resource for generating prompts by analyzing existing examples and extracting key information or requirements from them

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: train.csv | Column name | Description | |:----------------|:------------------------------------------------------------------------------------------------------------------| | instruction | Specific tasks or instructions assigned to each Python code snippet. (Text) | | input | The input data or parameters required for executing the code instruction. (Text) | | output | The expected result or output that should be produced when executing the code instruction. (Text) | | prompt | Additional information or context to help understand the purpose or requirements of each code instruction. (Text) |

Acknowledgements

If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Tarun Bisht (From Huggingface).

Search
Clear search
Close search
Google apps
Main menu