100+ datasets found
  1. Data from: ManyTypes4Py: A benchmark Python Dataset for Machine...

    • data.europa.eu
    unknown
    Updated Feb 28, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2021). ManyTypes4Py: A benchmark Python Dataset for Machine Learning-Based Type Inference [Dataset]. https://data.europa.eu/88u/dataset/oai-zenodo-org-4571228
    Explore at:
    unknown(395470535)Available download formats
    Dataset updated
    Feb 28, 2021
    Dataset authored and provided by
    Zenodohttp://zenodo.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset is gathered on Sep. 17th 2020. It has more than 5.4K Python repositories that are hosted on GitHub. Check out the file ManyTypes4PyDataset.spec for repositories URL and their commit SHA. The dataset is also de-duplicated using the CD4Py tool. The list of duplicate files is provided in duplicate_files.txt file. All of its Python projects are processed in JSON-formatted files. They contain a seq2seq representation of each file, type-related hints, and information for machine learning models. The structure of JSON-formatted files is described in JSONOutput.md file. The dataset is split into train, validation and test sets by source code files. The list of files and their corresponding set is provided in dataset_split.csv file. Notable changes to each version of the dataset are documented in CHANGELOG.md.

  2. Dataset_Python_Question_Answer

    • kaggle.com
    zip
    Updated Mar 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chinmaya (2024). Dataset_Python_Question_Answer [Dataset]. https://www.kaggle.com/datasets/chinmayadatt/dataset-python-question-answer
    Explore at:
    zip(189137 bytes)Available download formats
    Dataset updated
    Mar 29, 2024
    Authors
    Chinmaya
    License

    Attribution-NonCommercial-ShareAlike 3.0 (CC BY-NC-SA 3.0)https://creativecommons.org/licenses/by-nc-sa/3.0/
    License information was derived automatically

    Description

    This dataset is about Python programming. Question and answers are generated using Gemma. There are more than four hundred questions and their corresponding answers about Python programming.

    Questions are ranging from concepts like data-types, variables and keywords to regular-expression and threading.

    I have used this dataset here

    The code used for dataset generated is available here

  3. o

    Data from: ManyTypes4Py: A benchmark Python Dataset for Machine...

    • explore.openaire.eu
    • data.europa.eu
    Updated Apr 26, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amir M. Mir; Evaldas Latoskinas; Georgios Gousios (2021). ManyTypes4Py: A benchmark Python Dataset for Machine Learning-Based Type Inference [Dataset]. http://doi.org/10.5281/zenodo.4044635
    Explore at:
    Dataset updated
    Apr 26, 2021
    Authors
    Amir M. Mir; Evaldas Latoskinas; Georgios Gousios
    Description

    The dataset is gathered on Sep. 17th 2020 from GitHub. It has more than 5.2K Python repositories and 4.2M type annotations. The dataset is also de-duplicated using the CD4Py tool. Check out the README.MD file for the description of the dataset. Notable changes to each version of the dataset are documented in CHANGELOG.md. The dataset's scripts and utilities are available on its GitHub repository.

  4. Data from: ManyTypes4Py: A Benchmark Python Dataset for Machine...

    • data.europa.eu
    unknown
    Updated Jul 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2025). ManyTypes4Py: A Benchmark Python Dataset for Machine Learning-Based Type Inference [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-5244636?locale=lv
    Explore at:
    unknown(1052407809)Available download formats
    Dataset updated
    Jul 3, 2025
    Dataset authored and provided by
    Zenodohttp://zenodo.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset is gathered on Sep. 17th 2020 from GitHub. It has clean and complete versions (from v0.7): The clean version has 5.1K type-checked Python repositories and 1.2M type annotations. The complete version has 5.2K Python repositories and 3.3M type annotations. The dataset's source files are type-checked using mypy (clean version). The dataset is also de-duplicated using the CD4Py tool. Check out the README.MD file for the description of the dataset. Notable changes to each version of the dataset are documented in CHANGELOG.md. The dataset's scripts and utilities are available on its GitHub repository.

  5. Z

    CrossDomainTypes4Py: A Python Dataset for Cross-Domain Evaluation of Type...

    • data.niaid.nih.gov
    Updated Jan 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gruner, Bernd; Heinze, Thomas; Brust, Clemens-Alexander (2022). CrossDomainTypes4Py: A Python Dataset for Cross-Domain Evaluation of Type Inference Systems [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5747023
    Explore at:
    Dataset updated
    Jan 28, 2022
    Dataset provided by
    German Aerospace Center (DLR)
    German Aerospace Center (DLR), Cooperative University Gera-Eisenach
    Authors
    Gruner, Bernd; Heinze, Thomas; Brust, Clemens-Alexander
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains python repositories mined on GitHub on January 20, 2021. It allows a cross-domain evaluation of type inference systems. For this purpose, it consists of two sub-datasets, each containing only projects from the web or scientific calculation domain, respectively. Therefore we searched for projects with dependencies to either NumPy or Flask. Furthermore, only projects with dependencies to mypy were considered, because this should ensure that at least parts of the projects have type annotations. These can be used later as ground truth. Further details about the dataset will be described in an upcoming paper, as soon as it is published it will be linked here. The dataset consists of two files for the two sub-datasets. The web domain dataset contains 3129 repositories and the scientific calculation domain dataset contains 4783 repositories. The files have two columns with the URL to the GitHub repository and the used commit hash. Thus, it is possible to download the dataset using shell or python scripts, for example, the pipeline provided by ManyTypes4Py can be used. If repositories do not exist anymore or are private, you can contact us via the following email address: bernd.gruner@dlr.de. We have a backup of all repositories and will be happy to help you.

  6. User Profiling and Segmentation Project

    • kaggle.com
    Updated Jul 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sanjana Murthy (2024). User Profiling and Segmentation Project [Dataset]. https://www.kaggle.com/datasets/sanjanamurthy392/user-profiling-and-segmentation-project
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 9, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Sanjana Murthy
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    About Datasets: - Domain : Marketing - Project: User Profiling and Segmentation - Datasets: user_profile_for_ads.csv - Dataset Type: Excel Data - Dataset Size: 16k+ records

    KPI's: 1. Distribution of Key Demographic Variables like: a. Count of Age b. Count of Gender c. Count of Education Level d. Count of Income Level e. Count of Device Usage

    1. Understanding Online Behavior like: a. Count of Time Spent Online (hrs/Weekday) b. Count of Time Spent Online (hrs/Weekend)

    2. Ad Interaction Metrics: a. Count of likes and Reactions b. Count of click through rates (CTR) c. Count of Conversion Rate d. Count of Ad Interaction Time (secs) e. Count of Ad Interaction Time by Top Interests

    Process: 1. Understanding the problem 2. Data Collection 3. Exploring and analyzing the data 4. Interpreting the results

    This data contains pandas, matplotlib, seaborn, isnull, set_style, suptitle, countplot, palette, tight_layout, figsize, histplot, barplot, sklearn, standardscaler, OneHotEncoder, ColumnTransformer, Pipeline, KMeans, cluster_means, groupby, numpy, radar_df

  7. Code Similarity Dataset – Python Variants

    • kaggle.com
    zip
    Updated Jul 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hem Ajit Patel (2025). Code Similarity Dataset – Python Variants [Dataset]. https://www.kaggle.com/datasets/hemajitpatel/code-similarity-dataset-python-variants
    Explore at:
    zip(39806 bytes)Available download formats
    Dataset updated
    Jul 6, 2025
    Authors
    Hem Ajit Patel
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Code Similarity Dataset – Python Variants

    A collection of code snippets solving common programming problems in multiple variations.

    Each problem has 20+ versions, written in different styles and logic patterns, making this dataset ideal for studying:

    • Code similarity
    • Plagiarism detection
    • AI-based code search
    • Code classification
    • Semantic code retrieval

    📚 What's Inside?

    The dataset includes the following tasks: - Reverse a String - Find Max in List - Check if a Number is Prime - Check if a String is a Palindrome - Generate Fibonacci Sequence

    Each task contains: - 20 variations of code - Metadata file describing method and notes - README with usage instructions

    Column Descriptions

    The full_metadata.csv file contains the following fields:

    Column NameDescription
    problem_typeThe programming task solved (e.g., reverse_string, max_in_list)
    idUnique ID of the snippet within that problem group
    filenameFilename of the code snippet (e.g., snip_01.py)
    languageProgramming language used (Python)
    methodType of approach used (e.g., Slicing, Recursive, While loop)
    notesAdditional details about the logic or style used in the snippet

    🗂 Folder Structure

    CodeSimilarityDataset/ │ ├── reverse_string/ │ ├── snippets/ │ ├── metadata.csv │ └── README.txt │ ├── max_in_list/ │ ├── snippets/ │ ├── metadata.csv │ └── README.txt │ ├── is_prime/ │ ├── snippets/ │ ├── metadata.csv │ └── README.txt │ ├── is_palindrome/ │ ├── snippets/ │ ├── metadata.csv │ └── README.txt │ ├── fibonacci/ │ ├── snippets/ │ ├── metadata.csv │ └── README.txt │ └── full_metadata.csv ← Combined metadata across all problems

    🔍 Use Cases

    • Train models to detect similar code logic
    • Build plagiarism detection systems
    • Improve code recommendation engines
    • Teach students about code variation
    • Benchmark code search algorithms

    🧪 Sample Applications

    Visualize logic type distribution

    Compare structural similarity (AST/difflib/token matching)

    Cluster similar snippets using embeddings

    Train code-style-aware LLMs

    📦 File Formats

    All code snippets are .py files. Metadata is provided in CSV format for easy loading into pandas or other tools.

    🛠 How to Use

    You can load metadata easily with Python:

    import pandas as pd

    df = pd.read_csv('full_metadata.csv') print(df.sample(5))

    Then read any snippet:

    with open("reverse_string/snippets/snip_01.py") as f: code = f.read() print(code)

    📄 License

    This dataset is released under the MIT License — free to use, modify, and distribute with proper attribution.

  8. Date-Dataset

    • kaggle.com
    zip
    Updated Aug 18, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nishtha kukreti (2021). Date-Dataset [Dataset]. https://www.kaggle.com/nishthakukreti/datedataset
    Explore at:
    zip(38657 bytes)Available download formats
    Dataset updated
    Aug 18, 2021
    Authors
    nishtha kukreti
    Description

    Context

    This is the random date data-set generated by me using python script to create a Machine Learning model to tag the date in any given document.

    Content

    This data-set contains whether the given word of word are dates or not

    Acknowledgements

    We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.

    Inspiration

    Implement Machine Learning model or Deep learning Model or train a custom spacy to tag the date and other POS.

  9. Data from: Code4ML: a Large-scale Dataset of annotated Machine Learning Code...

    • zenodo.org
    csv
    Updated Sep 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonymous authors; Anonymous authors (2023). Code4ML: a Large-scale Dataset of annotated Machine Learning Code [Dataset]. http://doi.org/10.5281/zenodo.6607065
    Explore at:
    csvAvailable download formats
    Dataset updated
    Sep 15, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anonymous authors; Anonymous authors
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We present Code4ML: a Large-scale Dataset of annotated Machine Learning Code, a corpus of Python code snippets, competition summaries, and data descriptions from Kaggle.

    The data is organized in a table structure. Code4ML includes several main objects: competitions information, raw code blocks collected form Kaggle and manually marked up snippets. Each table has a .csv format.

    Each competition has the text description and metadata, reflecting competition and used dataset characteristics as well as evaluation metrics (competitions.csv). The corresponding datasets can be loaded using Kaggle API and data sources.

    The code blocks themselves and their metadata are collected to the data frames concerning the publishing year of the initial kernels. The current version of the corpus includes two code blocks files: snippets from kernels up to the 2020 year (сode_blocks_upto_20.csv) and those from the 2021 year (сode_blocks_21.csv) with corresponding metadata. The corpus consists of 2 743 615 ML code blocks collected from 107 524 Jupyter notebooks.

    Marked up code blocks have the following metadata: anonymized id, the format of the used data (for example, table or audio), the id of the semantic type, a flag for the code errors, the estimated relevance to the semantic class (from 1 to 5), the id of the parent notebook, and the name of the competition. The current version of the corpus has ~12 000 labeled snippets (markup_data_20220415.csv).

    As marked up code blocks data contains the numeric id of the code block semantic type, we also provide a mapping from this number to semantic type and subclass (actual_graph_2022-06-01.csv).

    The dataset can help solve various problems, including code synthesis from a prompt in natural language, code autocompletion, and semantic code classification.

  10. h

    reason_code-search-net-python

    • huggingface.co
    • opendatalab.com
    Updated May 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fernando Tarin Morales (2023). reason_code-search-net-python [Dataset]. https://huggingface.co/datasets/Nan-Do/reason_code-search-net-python
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 15, 2023
    Authors
    Fernando Tarin Morales
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Card for "reason_code-search-net-python"

      Dataset Summary
    

    This dataset is an instructional dataset for Python.The dataset contains five different kind of tasks.
    Given a Python 3 function:

    Type 1: Generate a summary explaining what it does. (For example: This function counts the number of objects stored in the jsonl file passed as input.) Type 2: Generate a summary explaining what its input parameters represent ("For example: infile: a file descriptor of a file… See the full description on the dataset page: https://huggingface.co/datasets/Nan-Do/reason_code-search-net-python.

  11. Comprehensive Pokemon Dataset

    • kaggle.com
    Updated Jun 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tanishk Sharma (2025). Comprehensive Pokemon Dataset [Dataset]. https://www.kaggle.com/datasets/tanishksharma9905/pokemon-data-csv
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 17, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Tanishk Sharma
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Here’s a clean, professional description you can use for your Kaggle Pokémon dataset:

    📄 Dataset Description

    This dataset provides detailed information on all available Pokémon, sourced directly from the PokeAPI. It includes key attributes such as:

    • ID and Name
    • Base experience, height, and weight
    • Primary and secondary types
    • Abilities
    • Top 5 moves
    • Base stats (e.g., HP, Attack, Defense, Speed)

    The dataset is ideal for:

    • 📊 Exploratory data analysis
    • 🧠 Machine learning projects
    • 🎮 Game design and balance modeling
    • 📚 Educational or statistical learning

    All data is extracted programmatically via the official PokeAPI using Python and stored in a structured MySQL table before export.

    Description: This dataset contains detailed information for all Pokémon fetched from the PokeAPI, including:

    Basic attributes (ID, Name, Height, Weight)

    Combat stats (Attack, Defense, HP, Speed, etc.)

    Types (e.g. Grass, Poison, Fire)

    Abilities (e.g. Overgrow, Blaze)

    Top 5 Moves

    Data fetched programmatically using Python and stored in a MySQL database

    This dataset is ideal for:

    Data analysis

    Machine learning projects

    Pokémon classification models

    Power BI/Tableau visualizations

  12. Datasets for manuscript "Data engineering for tracking chemicals and...

    • catalog.data.gov
    • gimi9.com
    Updated Feb 10, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2021). Datasets for manuscript "Data engineering for tracking chemicals and releases at industrial end-of-life activities" [Dataset]. https://catalog.data.gov/dataset/datasets-for-manuscript-data-engineering-for-tracking-chemicals-and-releases-at-industrial
    Explore at:
    Dataset updated
    Feb 10, 2021
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    The GitHub repository contains a Python code (MC_Case_Study.py) to support and replicate the case study results shown in the manuscript entitled Data engineering for tracking chemicals and releases at industrial end-of-life activities. Also, it indicates the free-available Python libraries that are required for running the code "MC_Case_Study.py." The dataset "EoL_database_for_MC.csv" contains all data to execute the Python code and obtain "Figure 6: 6-level Sankey diagram for the case study", "Figure 7: Box plot for the case study", and "Figure 8: Histogram for the case study." A Table describing the data name entry and data type for the dataset "EoL_database_for_MC.csv" is shown. Also, this dataset information and Python code are provided in the manuscript Supporting Info file (see supporting documents). This dataset is associated with the following publication: Hernandez-Betancur, J.D., G.J. Ruiz-Mercado, J.P. Abraham, M. Martin, W.W. Ingwersen, and R.L. Smith. Data engineering for tracking chemicals and releases at industrial end-of-life activities. JOURNAL OF HAZARDOUS MATERIALS. Elsevier Science Ltd, New York, NY, USA, 405: 124270, (2021).

  13. Code4ML 2.0

    • zenodo.org
    csv, txt
    Updated May 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonimous authors; Anonimous authors (2025). Code4ML 2.0 [Dataset]. http://doi.org/10.5281/zenodo.15465737
    Explore at:
    csv, txtAvailable download formats
    Dataset updated
    May 19, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anonimous authors; Anonimous authors
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is an enriched version of the Code4ML dataset, a large-scale corpus of annotated Python code snippets, competition summaries, and data descriptions sourced from Kaggle. The initial release includes approximately 2.5 million snippets of machine learning code extracted from around 100,000 Jupyter notebooks. A portion of these snippets has been manually annotated by human assessors through a custom-built, user-friendly interface designed for this task.

    The original dataset is organized into multiple CSV files, each containing structured data on different entities:

    • code_blocks.csv: Contains raw code snippets extracted from Kaggle.
    • kernels_meta.csv: Metadata for the notebooks (kernels) from which the code snippets were derived.
    • competitions_meta.csv: Metadata describing Kaggle competitions, including information about tasks and data.
    • markup_data.csv: Annotated code blocks with semantic types, allowing deeper analysis of code structure.
    • vertices.csv: A mapping from numeric IDs to semantic types and subclasses, used to interpret annotated code blocks.

    Table 1. code_blocks.csv structure

    ColumnDescription
    code_blocks_indexGlobal index linking code blocks to markup_data.csv.
    kernel_idIdentifier for the Kaggle Jupyter notebook from which the code block was extracted.
    code_block_id

    Position of the code block within the notebook.

    code_block

    The actual machine learning code snippet.

    Table 2. kernels_meta.csv structure

    ColumnDescription
    kernel_idIdentifier for the Kaggle Jupyter notebook.
    kaggle_scorePerformance metric of the notebook.
    kaggle_commentsNumber of comments on the notebook.
    kaggle_upvotesNumber of upvotes the notebook received.
    kernel_linkURL to the notebook.
    comp_nameName of the associated Kaggle competition.

    Table 3. competitions_meta.csv structure

    ColumnDescription
    comp_nameName of the Kaggle competition.
    descriptionOverview of the competition task.
    data_typeType of data used in the competition.
    comp_typeClassification of the competition.
    subtitleShort description of the task.
    EvaluationAlgorithmAbbreviationMetric used for assessing competition submissions.
    data_sourcesLinks to datasets used.
    metric typeClass label for the assessment metric.

    Table 4. markup_data.csv structure

    ColumnDescription
    code_blockMachine learning code block.
    too_longFlag indicating whether the block spans multiple semantic types.
    marksConfidence level of the annotation.
    graph_vertex_idID of the semantic type.

    The dataset allows mapping between these tables. For example:

    • code_blocks.csv can be linked to kernels_meta.csv via the kernel_id column.
    • kernels_meta.csv is connected to competitions_meta.csv through comp_name. To maintain quality, kernels_meta.csv includes only notebooks with available Kaggle scores.

    In addition, data_with_preds.csv contains automatically classified code blocks, with a mapping back to code_blocks.csvvia the code_blocks_index column.

    Code4ML 2.0 Enhancements

    The updated Code4ML 2.0 corpus introduces kernels extracted from Meta Kaggle Code. These kernels correspond to the kaggle competitions launched since 2020. The natural descriptions of the competitions are retrieved with the aim of LLM.

    Notebooks in kernels_meta2.csv may not have a Kaggle score but include a leaderboard ranking (rank), providing additional context for evaluation.

    competitions_meta_2.csv is enriched with data_cards, decsribing the data used in the competitions.

    Applications

    The Code4ML 2.0 corpus is a versatile resource, enabling training and evaluation of models in areas such as:

    • Code generation
    • Code understanding
    • Natural language processing of code-related tasks
  14. h

    chessbenchmate_1.0

    • huggingface.co
    Updated Nov 18, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joshua Gao (2025). chessbenchmate_1.0 [Dataset]. https://huggingface.co/datasets/joshuakgao/chessbenchmate_1.0
    Explore at:
    Dataset updated
    Nov 18, 2025
    Authors
    Joshua Gao
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for Dataset Name

    This dataset is an extension of the Action-Value ChessBench dataset from Amortized Planning with Large-Scale Transformers: A Case Study on Chess. We add mating data to each datapoint where if a forced mate a possible, the number of moves until mate is added with Stockfish.

      Usage
    

    hf download joshuakgao/chessbenchmate --repo-type dataset --local-dir . --max-workers 1 python dataset.py

  15. Data and tools for studying isograms

    • figshare.com
    Updated Jul 31, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Florian Breit (2017). Data and tools for studying isograms [Dataset]. http://doi.org/10.6084/m9.figshare.5245810.v1
    Explore at:
    application/x-sqlite3Available download formats
    Dataset updated
    Jul 31, 2017
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Florian Breit
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A collection of datasets and python scripts for extraction and analysis of isograms (and some palindromes and tautonyms) from corpus-based word-lists, specifically Google Ngram and the British National Corpus (BNC).Below follows a brief description, first, of the included datasets and, second, of the included scripts.1. DatasetsThe data from English Google Ngrams and the BNC is available in two formats: as a plain text CSV file and as a SQLite3 database.1.1 CSV formatThe CSV files for each dataset actually come in two parts: one labelled ".csv" and one ".totals". The ".csv" contains the actual extracted data, and the ".totals" file contains some basic summary statistics about the ".csv" dataset with the same name.The CSV files contain one row per data point, with the colums separated by a single tab stop. There are no labels at the top of the files. Each line has the following columns, in this order (the labels below are what I use in the database, which has an identical structure, see section below):

    Label Data type Description

    isogramy int The order of isogramy, e.g. "2" is a second order isogram

    length int The length of the word in letters

    word text The actual word/isogram in ASCII

    source_pos text The Part of Speech tag from the original corpus

    count int Token count (total number of occurences)

    vol_count int Volume count (number of different sources which contain the word)

    count_per_million int Token count per million words

    vol_count_as_percent int Volume count as percentage of the total number of volumes

    is_palindrome bool Whether the word is a palindrome (1) or not (0)

    is_tautonym bool Whether the word is a tautonym (1) or not (0)

    The ".totals" files have a slightly different format, with one row per data point, where the first column is the label and the second column is the associated value. The ".totals" files contain the following data:

    Label

    Data type

    Description

    !total_1grams

    int

    The total number of words in the corpus

    !total_volumes

    int

    The total number of volumes (individual sources) in the corpus

    !total_isograms

    int

    The total number of isograms found in the corpus (before compacting)

    !total_palindromes

    int

    How many of the isograms found are palindromes

    !total_tautonyms

    int

    How many of the isograms found are tautonyms

    The CSV files are mainly useful for further automated data processing. For working with the data set directly (e.g. to do statistics or cross-check entries), I would recommend using the database format described below.1.2 SQLite database formatOn the other hand, the SQLite database combines the data from all four of the plain text files, and adds various useful combinations of the two datasets, namely:• Compacted versions of each dataset, where identical headwords are combined into a single entry.• A combined compacted dataset, combining and compacting the data from both Ngrams and the BNC.• An intersected dataset, which contains only those words which are found in both the Ngrams and the BNC dataset.The intersected dataset is by far the least noisy, but is missing some real isograms, too.The columns/layout of each of the tables in the database is identical to that described for the CSV/.totals files above.To get an idea of the various ways the database can be queried for various bits of data see the R script described below, which computes statistics based on the SQLite database.2. ScriptsThere are three scripts: one for tiding Ngram and BNC word lists and extracting isograms, one to create a neat SQLite database from the output, and one to compute some basic statistics from the data. The first script can be run using Python 3, the second script can be run using SQLite 3 from the command line, and the third script can be run in R/RStudio (R version 3).2.1 Source dataThe scripts were written to work with word lists from Google Ngram and the BNC, which can be obtained from http://storage.googleapis.com/books/ngrams/books/datasetsv2.html and [https://www.kilgarriff.co.uk/bnc-readme.html], (download all.al.gz).For Ngram the script expects the path to the directory containing the various files, for BNC the direct path to the *.gz file.2.2 Data preparationBefore processing proper, the word lists need to be tidied to exclude superfluous material and some of the most obvious noise. This will also bring them into a uniform format.Tidying and reformatting can be done by running one of the following commands:python isograms.py --ngrams --indir=INDIR --outfile=OUTFILEpython isograms.py --bnc --indir=INFILE --outfile=OUTFILEReplace INDIR/INFILE with the input directory or filename and OUTFILE with the filename for the tidied and reformatted output.2.3 Isogram ExtractionAfter preparing the data as above, isograms can be extracted from by running the following command on the reformatted and tidied files:python isograms.py --batch --infile=INFILE --outfile=OUTFILEHere INFILE should refer the the output from the previosu data cleaning process. Please note that the script will actually write two output files, one named OUTFILE with a word list of all the isograms and their associated frequency data, and one named "OUTFILE.totals" with very basic summary statistics.2.4 Creating a SQLite3 databaseThe output data from the above step can be easily collated into a SQLite3 database which allows for easy querying of the data directly for specific properties. The database can be created by following these steps:1. Make sure the files with the Ngrams and BNC data are named “ngrams-isograms.csv” and “bnc-isograms.csv” respectively. (The script assumes you have both of them, if you only want to load one, just create an empty file for the other one).2. Copy the “create-database.sql” script into the same directory as the two data files.3. On the command line, go to the directory where the files and the SQL script are. 4. Type: sqlite3 isograms.db 5. This will create a database called “isograms.db”.See the section 1 for a basic descript of the output data and how to work with the database.2.5 Statistical processingThe repository includes an R script (R version 3) named “statistics.r” that computes a number of statistics about the distribution of isograms by length, frequency, contextual diversity, etc. This can be used as a starting point for running your own stats. It uses RSQLite to access the SQLite database version of the data described above.

  16. H

    Python Codes for Data Analysis of The Impact of COVID-19 on Technical...

    • dataverse.harvard.edu
    • figshare.com
    Updated Mar 21, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Elizabeth Szkirpan (2022). Python Codes for Data Analysis of The Impact of COVID-19 on Technical Services Units Survey Results [Dataset]. http://doi.org/10.7910/DVN/SXMSDZ
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 21, 2022
    Dataset provided by
    Harvard Dataverse
    Authors
    Elizabeth Szkirpan
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Copies of Anaconda 3 Jupyter Notebooks and Python script for holistic and clustered analysis of "The Impact of COVID-19 on Technical Services Units" survey results. Data was analyzed holistically using cleaned and standardized survey results and by library type clusters. To streamline data analysis in certain locations, an off-shoot CSV file was created so data could be standardized without compromising the integrity of the parent clean file. Three Jupyter Notebooks/Python scripts are available in relation to this project: COVID_Impact_TechnicalServices_HolisticAnalysis (a holistic analysis of all survey data) and COVID_Impact_TechnicalServices_LibraryTypeAnalysis (a clustered analysis of impact by library type, clustered files available as part of the Dataverse for this project).

  17. H

    Observational Large Ensemble

    • dataverse.harvard.edu
    • search.dataone.org
    Updated Jul 19, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Karen McKinnon (2017). Observational Large Ensemble [Dataset]. http://doi.org/10.7910/DVN/7CPJPQ
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 19, 2017
    Dataset provided by
    Harvard Dataverse
    Authors
    Karen McKinnon
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    These python datasets contain the results presented in the above paper with regard to the variability in trends over North America during DJF due to sampling of internal variability. Two types of files are available. The netcdf file contains samples from the synthetic ensemble of DJF temperatures over North America from 1966-2015. The synthetic ensemble is centered on the observed trend. Recentering the ensemble on the ensemble mean trend from the NCAR CESM1 LENS will create the Observational Large Ensemble, in which each sample can be viewed as a temperature history that could have occurred given various samplings of internal variability. The synthetic ensemble can also be recentered on any other estimate of the forced response to climate change. While the dataset is both land and ocean, it has only been validated over land. The second type of file, presented as python datasets (.npz) contains the results presented in the McKinnon et al (2017) reference. In particular, it contains the 50-year trends for both the observations and the NCAR CESM1 Large Ensemble that actually occurred, and could have occurred given a different sampling of internal variability. The bootstrap results can be compared to the true spread across the NCAR CESM1 Large Ensemble for validation, as was done in the manuscript. Each of these files is named based on the observational dataset, variable, time span, and spatial domain. They contain: BETA: the empirical OLS trend BOOTSAMPLES: the OLS trends estimated after bootstrapping INTERANNUALVAR: the interannual variance in the data after modeling and removing the forced trend empiricalAR1: the empirical AR(1) coefficient estimated from the residuals around the forced trend The first dimension of all variables is 42, which is a stack of the ensemble mean behavior (index 0), the forty members of the NCAR Large Ensemble (indices 1:40), and the observations (last index, -1). The second dimension is spatial. See latlon.npz for the latitude and longitude vectors. The third dimension, when present, is the bootstrap samples. We have saved 1000 bootstrap samples.

  18. Z

    VSAT-3D Example Dataset

    • data-staging.niaid.nih.gov
    • data.niaid.nih.gov
    • +1more
    Updated Apr 9, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Méndez-Hernández, Hugo (2021). VSAT-3D Example Dataset [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_4671100
    Explore at:
    Dataset updated
    Apr 9, 2021
    Dataset provided by
    Instituto de Física y Astronomía - Universidad de Valparaíso
    Authors
    Méndez-Hernández, Hugo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset to run Example.py script of the Valparaíso Stacking Analysis Tool (VSAT-3D). The Valparaíso Stacking Analysis Tool (VSAT-3D) provides a series of tools for selecting, stacking, and analyzing 3D spectra. It is intended for stacking samples of datacubes extracted from interferometric datasets, belonging to large extragalactic catalogs by selecting subsamples of galaxies defined by their available properties (e.g. redshift, stellar mass, star formation rate) being possible to generate diverse (e.g. median, average, weighted average, histogram) composite spectra. However, it is possible to also use VSAT-3D on smaller datasets containing any type of astronomical object.

    VSAT-3D can be downloaded from the github repository link.

  19. D

    Dataset and Code for: Code problem similarity detection using code clones...

    • researchdata.ntu.edu.sg
    Updated May 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Geremie Yun Siang Yeo; Geremie Yun Siang Yeo (2023). Dataset and Code for: Code problem similarity detection using code clones and pretrained models [Dataset]. http://doi.org/10.21979/N9/VPCR7H
    Explore at:
    text/x-python(17209), text/x-python(1832), application/x-ipynb+json(6814), zip(53320679), zip(57359817), application/x-ipynb+json(9307), text/x-python(13174), application/x-ipynb+json(65497)Available download formats
    Dataset updated
    May 8, 2023
    Dataset provided by
    DR-NTU (Data)
    Authors
    Geremie Yun Siang Yeo; Geremie Yun Siang Yeo
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    This dataset complements the following study: Code problem similarity detection using code clones and pretrained models (SCSE22-0384). This study explores a new approach of detecting similar algorithmic-style code problems from websites such as LeetCode and Codeforces, by comparing the similarity of the solution source codes, an application of type IV code clone detection. It is based on 107,000 submissions in 3 different languages (Python, C++ and Java) from 3,000 problems on Codeforces between 2020 to 2023. Experiments were carried out using 3 different pre-trained models on this dataset (C4-CodeBERT, GraphCodeBERT, UniXcoder). UniXcoder performed the best with an F1 score of 0.905. As such, UniXcoder was used as the backbone of the code problem similarity checker (CPSC) which is used to identify the top similar problems (out of all the problems in the dataset) to an input source code. Based on the tests conducted in this project, his approach achieves state-of-the-art results when it comes to detecting similarity between various code problems. More research can be done, in domains where type IV code clone detection can be useful.

  20. Type Of Airplane Dataset

    • universe.roboflow.com
    zip
    Updated Sep 11, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    object detection yolov7 (2023). Type Of Airplane Dataset [Dataset]. https://universe.roboflow.com/object-detection-yolov7-0jmxy/type-of-airplane/model/1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 11, 2023
    Dataset provided by
    Object detection
    Authors
    object detection yolov7
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Airplane Bounding Boxes
    Description

    Type Of Airplane

    ## Overview
    
    Type Of Airplane is a dataset for object detection tasks - it contains Airplane annotations for 580 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Zenodo (2021). ManyTypes4Py: A benchmark Python Dataset for Machine Learning-Based Type Inference [Dataset]. https://data.europa.eu/88u/dataset/oai-zenodo-org-4571228
Organization logo

Data from: ManyTypes4Py: A benchmark Python Dataset for Machine Learning-Based Type Inference

Related Article
Explore at:
unknown(395470535)Available download formats
Dataset updated
Feb 28, 2021
Dataset authored and provided by
Zenodohttp://zenodo.org/
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The dataset is gathered on Sep. 17th 2020. It has more than 5.4K Python repositories that are hosted on GitHub. Check out the file ManyTypes4PyDataset.spec for repositories URL and their commit SHA. The dataset is also de-duplicated using the CD4Py tool. The list of duplicate files is provided in duplicate_files.txt file. All of its Python projects are processed in JSON-formatted files. They contain a seq2seq representation of each file, type-related hints, and information for machine learning models. The structure of JSON-formatted files is described in JSONOutput.md file. The dataset is split into train, validation and test sets by source code files. The list of files and their corresponding set is provided in dataset_split.csv file. Notable changes to each version of the dataset are documented in CHANGELOG.md.

Search
Clear search
Close search
Google apps
Main menu