84 datasets found
  1. Coding Questions Dataset

    • kaggle.com
    zip
    Updated Oct 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kartikeya Pandey (2025). Coding Questions Dataset [Dataset]. https://www.kaggle.com/datasets/guitaristboy/coding-questions-dataset
    Explore at:
    zip(135582 bytes)Available download formats
    Dataset updated
    Oct 24, 2025
    Authors
    Kartikeya Pandey
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This dataset contains a curated collection of programming questions, each paired with example inputs/outputs, constraints, and test cases.

    It is designed for use in machine learning research, code generation models, natural language processing (NLP) tasks, or simply as a question bank for learners and educators.

    Dataset Highlights:

    📘 616 questions with titles, descriptions, and difficulty levels (Easy, Medium, Hard)

    💡 Each question includes examples, constraints, and test cases stored as structured JSON

    🧠 Useful for LLM fine-tuning, question answering, and automated code evaluation tasks

    🧩 Ideal for creating or benchmarking AI coding assistants and educational apps

    Source: Collected from a structured internal question database built for educational and evaluation purposes.

    Format: CSV file with the following columns: id, title, description, difficulty_level, created_at, updated_at, examples, constraints, test_cases

  2. h

    medreport_text_1000

    • huggingface.co
    Updated Aug 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Young-Wouk Kim (2025). medreport_text_1000 [Dataset]. https://huggingface.co/datasets/wouk1805/medreport_text_1000
    Explore at:
    Dataset updated
    Aug 5, 2025
    Authors
    Young-Wouk Kim
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    MedReport - Reports Dataset

      Dataset Description
    

    This dataset contains medical audio transcriptions and the corresponding structured reports.

      Columns
    

    input: Audio transcription output: Structured medical report sample_id: Example identifier

      Statistics
    

    Total examples: 1000 License: Apache License 2.0 Created: 2025-08-05

      Usage
    
    
    
    
    
      Loading the dataset
    

    from datasets import load_dataset

    Load the dataset

    full_dataset =… See the full description on the dataset page: https://huggingface.co/datasets/wouk1805/medreport_text_1000.

  3. f

    Sample questions for the semi-structured interviews.

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Oct 29, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amati, Mirjam; Rubinelli, Sara; Zanini, Claudia; Grignoli, Nicola; Amann, Julia (2019). Sample questions for the semi-structured interviews. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000158998
    Explore at:
    Dataset updated
    Oct 29, 2019
    Authors
    Amati, Mirjam; Rubinelli, Sara; Zanini, Claudia; Grignoli, Nicola; Amann, Julia
    Description

    Sample questions for the semi-structured interviews.

  4. h

    sample-structure-dataset

    • huggingface.co
    Updated Apr 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    GenBio AI (2025). sample-structure-dataset [Dataset]. https://huggingface.co/datasets/genbio-ai/sample-structure-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 10, 2025
    Dataset authored and provided by
    GenBio AI
    Description

    Sample dataset for PETAL model

    This dataset is a sample dataset to test the functionalities of the PETAL model (encoder and decoder). It is based on CASP15 dataset, see

    https://predictioncenter.org/casp15/ https://github.com/Bhattacharya-Lab/CASP15

    The registries folder contains the registry of CASP15 dataset (a csv file with filename, pdb_id, etc.)

  5. Catastrophic Collapse Can Occur without Early Warning: Examples of Silent...

    • plos.figshare.com
    qt
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maarten C. Boerlijst; Thomas Oudman; André M. de Roos (2023). Catastrophic Collapse Can Occur without Early Warning: Examples of Silent Catastrophes in Structured Ecological Models [Dataset]. http://doi.org/10.1371/journal.pone.0062033
    Explore at:
    qtAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Maarten C. Boerlijst; Thomas Oudman; André M. de Roos
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Catastrophic and sudden collapses of ecosystems are sometimes preceded by early warning signals that potentially could be used to predict and prevent a forthcoming catastrophe. Universality of these early warning signals has been proposed, but no formal proof has been provided. Here, we show that in relatively simple ecological models the most commonly used early warning signals for a catastrophic collapse can be silent. We underpin the mathematical reason for this phenomenon, which involves the direction of the eigenvectors of the system. Our results demonstrate that claims on the universality of early warning signals are not correct, and that catastrophic collapses can occur without prior warning. In order to correctly predict a collapse and determine whether early warning signals precede the collapse, detailed knowledge of the mathematical structure of the approaching bifurcation is necessary. Unfortunately, such knowledge is often only obtained after the collapse has already occurred.

  6. f

    Sample semi-structured interview questions.

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated May 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ward, Brooklyn; Marshall, Carrie Anne; Allen, Jessica; Javadizadeh, Elham; Easton, Corinna; Perez, Shauna; Goldszmidt, Rebecca; Plett, Patti (2025). Sample semi-structured interview questions. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0002034325
    Explore at:
    Dataset updated
    May 22, 2025
    Authors
    Ward, Brooklyn; Marshall, Carrie Anne; Allen, Jessica; Javadizadeh, Elham; Easton, Corinna; Perez, Shauna; Goldszmidt, Rebecca; Plett, Patti
    Description

    Having access to good quality housing is a key determinant of well-being. Little is known about experiences of housing quality following homelessness from the perspectives of persons with lived experience. To build on existing literature, we conducted a secondary analysis of qualitative interviews with 19 individuals who had experiences of transitioning to housing following homelessness. Interview transcripts were drawn from a community-based participatory research study exploring the conditions needed for thriving following homelessness in Ontario, Canada. We analyzed these transcripts using reflexive thematic analysis. We coded transcripts abductively, informed by theories of social justice and health equity. Consistent with reflexive thematic analysis, we identified a central essence to elucidate experiences of housing quality following homelessness: “negotiating control within oppressive structural contexts.” This was expressed through four distinct themes: 1) being forced to live in undesirable living conditions; 2) stuck in an unsafe environment; 3) negotiating power dynamics to attain comfort and safety in one’s housing; and 4) having access to people and resources that create home. Overall, our findings indicate that attaining good quality housing following homelessness is elusive for many and influenced by a range of structural factors including ongoing poverty following homelessness, a lack of deeply affordable housing stock, and a lack of available social support networks. To prevent homelessness, it is essential to improve access to good quality housing that can support tenancy sustainment and well-being following homelessness. Policymakers need to review existing housing policies and reflect on how over-reliance on market housing has imposed negative impacts on the lives of persons who are leaving homelessness. Given the current economic context, it is imperative that policymakers devise policies that mitigate the financialization of housing, and result in the restoration of the social housing system in Canada and beyond.

  7. Patent PDF Samples with Extracted Structured Data

    • console.cloud.google.com
    Updated Jul 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    https://console.cloud.google.com/marketplace/browse?filter=partner:Subsets%20of%20Patent%20Data&hl=de (2023). Patent PDF Samples with Extracted Structured Data [Dataset]. https://console.cloud.google.com/marketplace/product/global-patents/labeled-patents?hl=de
    Explore at:
    Dataset updated
    Jul 20, 2023
    Dataset provided by
    Googlehttp://google.com/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset consists of PDFs in Google Cloud Storage from the first page of select US and EU patents, and BigQuery tables with extracted entities, labels, and other properties, including a link to each file in GCS. The structured data contains labels for eleven patent entities (patent inventor, publication date, classification number, patent title, etc.), global properties (US/EU issued, language, invention type), and the location of any figures or schematics on the patent's first page. The structured data is the result of a data entry operation collecting information from PDF documents, making the dataset a useful testing ground for benchmarking and developing AI/ML systems intended to perform broad document understanding tasks like extraction of structured data from unstructured documents. This dataset can be used to develop and benchmark natural language tasks such as named entity recognition and text classification, AI/ML vision tasks such as image classification and object detection, as well as more general AI/ML tasks such as automated data entry and document understanding. Google is sharing this dataset to support the AI/ML community because there is a shortage of document extraction/understanding datasets shared under an open license. This public dataset is hosted in Google Cloud Storage and Google BigQuery. It is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery or this this Cloud Storage quick start guide to begin.

  8. f

    Data structure information for each sample and outcome.

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Jul 9, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jones, Kelvyn; Prior, Lucy; Manley, David (2020). Data structure information for each sample and outcome. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000527554
    Explore at:
    Dataset updated
    Jul 9, 2020
    Authors
    Jones, Kelvyn; Prior, Lucy; Manley, David
    Description

    Data structure information for each sample and outcome.

  9. Spider Realistic Dataset In Structure-Grounded Pretraining for Text-to-SQL

    • zenodo.org
    bin, json, txt
    Updated Aug 16, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xiang Deng; Ahmed Hassan Awadallah; Christopher Meek; Oleksandr Polozov; Huan Sun; Matthew Richardson; Xiang Deng; Ahmed Hassan Awadallah; Christopher Meek; Oleksandr Polozov; Huan Sun; Matthew Richardson (2021). Spider Realistic Dataset In Structure-Grounded Pretraining for Text-to-SQL [Dataset]. http://doi.org/10.5281/zenodo.5205322
    Explore at:
    txt, json, binAvailable download formats
    Dataset updated
    Aug 16, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Xiang Deng; Ahmed Hassan Awadallah; Christopher Meek; Oleksandr Polozov; Huan Sun; Matthew Richardson; Xiang Deng; Ahmed Hassan Awadallah; Christopher Meek; Oleksandr Polozov; Huan Sun; Matthew Richardson
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This folder contains the Spider-Realistic dataset used for evaluation in the paper "Structure-Grounded Pretraining for Text-to-SQL". The dataset is created based on the dev split of the Spider dataset (2020-06-07 version from https://yale-lily.github.io/spider). We manually modified the original questions to remove the explicit mention of column names while keeping the SQL queries unchanged to better evaluate the model's capability in aligning the NL utterance and the DB schema. For more details, please check our paper at https://arxiv.org/abs/2010.12773.

    It contains the following files:

    - spider-realistic.json
    # The spider-realistic evaluation set
    # Examples: 508
    # Databases: 19
    - dev.json
    # The original dev split of Spider
    # Examples: 1034
    # Databases: 20
    - tables.json
    # The original DB schemas from Spider
    # Databases: 166
    - README.txt
    - license

    The Spider-Realistic dataset is created based on the dev split of the Spider dataset realsed by Yu, Tao, et al. "Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task." It is a subset of the original dataset with explicit mention of the column names removed. The sql queries and databases are kept unchanged.
    For the format of each json file, please refer to the github page of Spider https://github.com/taoyds/spider.
    For the database files please refer to the official Spider release https://yale-lily.github.io/spider.

    This dataset is distributed under the CC BY-SA 4.0 license.

    If you use the dataset, please cite the following papers including the original Spider datasets, Finegan-Dollak et al., 2018 and the original datasets for Restaurants, GeoQuery, Scholar, Academic, IMDB, and Yelp.

    @article{deng2020structure,
    title={Structure-Grounded Pretraining for Text-to-SQL},
    author={Deng, Xiang and Awadallah, Ahmed Hassan and Meek, Christopher and Polozov, Oleksandr and Sun, Huan and Richardson, Matthew},
    journal={arXiv preprint arXiv:2010.12773},
    year={2020}
    }

    @inproceedings{Yu&al.18c,
    year = 2018,
    title = {Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task},
    booktitle = {EMNLP},
    author = {Tao Yu and Rui Zhang and Kai Yang and Michihiro Yasunaga and Dongxu Wang and Zifan Li and James Ma and Irene Li and Qingning Yao and Shanelle Roman and Zilin Zhang and Dragomir Radev }
    }

    @InProceedings{P18-1033,
    author = "Finegan-Dollak, Catherine
    and Kummerfeld, Jonathan K.
    and Zhang, Li
    and Ramanathan, Karthik
    and Sadasivam, Sesh
    and Zhang, Rui
    and Radev, Dragomir",
    title = "Improving Text-to-SQL Evaluation Methodology",
    booktitle = "Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    year = "2018",
    publisher = "Association for Computational Linguistics",
    pages = "351--360",
    location = "Melbourne, Australia",
    url = "http://aclweb.org/anthology/P18-1033"
    }

    @InProceedings{data-sql-imdb-yelp,
    dataset = {IMDB and Yelp},
    author = {Navid Yaghmazadeh, Yuepeng Wang, Isil Dillig, and Thomas Dillig},
    title = {SQLizer: Query Synthesis from Natural Language},
    booktitle = {International Conference on Object-Oriented Programming, Systems, Languages, and Applications, ACM},
    month = {October},
    year = {2017},
    pages = {63:1--63:26},
    url = {http://doi.org/10.1145/3133887},
    }

    @article{data-academic,
    dataset = {Academic},
    author = {Fei Li and H. V. Jagadish},
    title = {Constructing an Interactive Natural Language Interface for Relational Databases},
    journal = {Proceedings of the VLDB Endowment},
    volume = {8},
    number = {1},
    month = {September},
    year = {2014},
    pages = {73--84},
    url = {http://dx.doi.org/10.14778/2735461.2735468},
    }

    @InProceedings{data-atis-geography-scholar,
    dataset = {Scholar, and Updated ATIS and Geography},
    author = {Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, Jayant Krishnamurthy, and Luke Zettlemoyer},
    title = {Learning a Neural Semantic Parser from User Feedback},
    booktitle = {Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
    year = {2017},
    pages = {963--973},
    location = {Vancouver, Canada},
    url = {http://www.aclweb.org/anthology/P17-1089},
    }

    @inproceedings{data-geography-original
    dataset = {Geography, original},
    author = {John M. Zelle and Raymond J. Mooney},
    title = {Learning to Parse Database Queries Using Inductive Logic Programming},
    booktitle = {Proceedings of the Thirteenth National Conference on Artificial Intelligence - Volume 2},
    year = {1996},
    pages = {1050--1055},
    location = {Portland, Oregon},
    url = {http://dl.acm.org/citation.cfm?id=1864519.1864543},
    }

    @inproceedings{data-restaurants-logic,
    author = {Lappoon R. Tang and Raymond J. Mooney},
    title = {Automated Construction of Database Interfaces: Intergrating Statistical and Relational Learning for Semantic Parsing},
    booktitle = {2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora},
    year = {2000},
    pages = {133--141},
    location = {Hong Kong, China},
    url = {http://www.aclweb.org/anthology/W00-1317},
    }

    @inproceedings{data-restaurants-original,
    author = {Ana-Maria Popescu, Oren Etzioni, and Henry Kautz},
    title = {Towards a Theory of Natural Language Interfaces to Databases},
    booktitle = {Proceedings of the 8th International Conference on Intelligent User Interfaces},
    year = {2003},
    location = {Miami, Florida, USA},
    pages = {149--157},
    url = {http://doi.acm.org/10.1145/604045.604070},
    }

    @inproceedings{data-restaurants,
    author = {Alessandra Giordani and Alessandro Moschitti},
    title = {Automatic Generation and Reranking of SQL-derived Answers to NL Questions},
    booktitle = {Proceedings of the Second International Conference on Trustworthy Eternal Systems via Evolving Software, Data and Knowledge},
    year = {2012},
    location = {Montpellier, France},
    pages = {59--76},
    url = {https://doi.org/10.1007/978-3-642-45260-4_5},
    }

  10. Wikipedia Structured Contents

    • kaggle.com
    zip
    Updated Apr 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wikimedia (2025). Wikipedia Structured Contents [Dataset]. https://www.kaggle.com/datasets/wikimedia-foundation/wikipedia-structured-contents
    Explore at:
    zip(25121685657 bytes)Available download formats
    Dataset updated
    Apr 11, 2025
    Dataset provided by
    Wikimedia Foundationhttp://www.wikimedia.org/
    Authors
    Wikimedia
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dataset Summary Early beta release of pre-parsed English and French Wikipedia articles including infoboxes. Inviting feedback.

    This dataset contains all articles of the English and French language editions of Wikipedia, pre-parsed and outputted as structured JSON files with a consistent schema. Each JSON line holds the content of one full Wikipedia article stripped of extra markdown and non-prose sections (references, etc.).

    Invitation for Feedback The dataset is built as part of the Structured Contents initiative and based on the Wikimedia Enterprise html snapshots. It is an early beta release to improve transparency in the development process and request feedback. This first version includes pre-parsed Wikipedia abstracts, short descriptions, main images links, infoboxes and article sections, excluding non-prose sections (e.g. references). More elements (such as lists and tables) may be added over time. For updates follow the project’s blog and our Mediawiki Quarterly software updates on MediaWiki. As this is an early beta release, we highly value your feedback to help us refine and improve this dataset. Please share your thoughts, suggestions, and any issues you encounter either on the discussion page of Wikimedia Enterprise’s homepage on Meta wiki, or on the discussion page for this dataset here on Kaggle.

    The contents of this dataset of Wikipedia articles is collectively written and curated by a global volunteer community. All original textual content is licensed under the GNU Free Documentation License (GFDL) and the Creative Commons Attribution-Share-Alike 4.0 License. Some text may be available only under the Creative Commons license; see the Wikimedia Terms of Use for details. Text written by some authors may be released under additional licenses or into the public domain.

    The dataset in its structured form is generally helpful for a wide variety of tasks, including all phases of model development, from pre-training to alignment, fine-tuning, updating/RAG as well as testing/benchmarking. We would love to hear more about your use cases.

    Data Fields The data fields are the same among all, noteworthy included fields: name - title of the article. identifier - ID of the article. url - URL of the article. version: metadata related to the latest specific revision of the article version.editor - editor-specific signals that can help contextualize the revision version.scores - returns assessments by ML models on the likelihood of a revision being reverted. main entity - Wikidata QID the article is related to. abstract - lead section, summarizing what the article is about. description - one-sentence description of the article for quick reference. image - main image representing the article's subject. infoboxes - parsed information from the side panel (infobox) on the Wikipedia article. sections - parsed sections of the article, including links. Note: excludes other media/images, lists, tables and references or similar non-prose sections. Full data dictionary is available here: https://enterprise.wikimedia.com/docs/data-dictionary/

    Curation Rationale This dataset has been created as part of the larger Structured Contents initiative at Wikimedia Enterprise with the aim of making Wikimedia data more machine readable. These efforts are both focused on pre-parsing Wikipedia snippets as well as connecting the different projects closer together. Even if Wikipedia is very structured to the human eye, it is a non-triv...

  11. Example DICOM RT Structure

    • zenodo.org
    bin
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Simon Biggs; Simon Biggs (2020). Example DICOM RT Structure [Dataset]. http://doi.org/10.5281/zenodo.3576026
    Explore at:
    binAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Simon Biggs; Simon Biggs
    License

    http://www.apache.org/licenses/LICENSE-2.0http://www.apache.org/licenses/LICENSE-2.0

    Description

    Example DICOM RT Structure

  12. Data from: Car Evaluation Data Set

    • hypi.ai
    • kaggle.com
    zip
    Updated Sep 1, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ahiale Darlington (2017). Car Evaluation Data Set [Dataset]. https://hypi.ai/wp/wp-content/uploads/2019/10/car-evaluation-data-set/
    Explore at:
    zip(4775 bytes)Available download formats
    Dataset updated
    Sep 1, 2017
    Authors
    Ahiale Darlington
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    from: https://archive.ics.uci.edu/ml/datasets/car+evaluation

    1. Title: Car Evaluation Database

    2. Sources: (a) Creator: Marko Bohanec (b) Donors: Marko Bohanec (marko.bohanec@ijs.si) Blaz Zupan (blaz.zupan@ijs.si) (c) Date: June, 1997

    3. Past Usage:

      The hierarchical decision model, from which this dataset is derived, was first presented in

      M. Bohanec and V. Rajkovic: Knowledge acquisition and explanation for multi-attribute decision making. In 8th Intl Workshop on Expert Systems and their Applications, Avignon, France. pages 59-78, 1988.

      Within machine-learning, this dataset was used for the evaluation of HINT (Hierarchy INduction Tool), which was proved to be able to completely reconstruct the original hierarchical model. This, together with a comparison with C4.5, is presented in

      B. Zupan, M. Bohanec, I. Bratko, J. Demsar: Machine learning by function decomposition. ICML-97, Nashville, TN. 1997 (to appear)

    4. Relevant Information Paragraph:

      Car Evaluation Database was derived from a simple hierarchical decision model originally developed for the demonstration of DEX (M. Bohanec, V. Rajkovic: Expert system for decision making. Sistemica 1(1), pp. 145-157, 1990.). The model evaluates cars according to the following concept structure:

      CAR car acceptability . PRICE overall price . . buying buying price . . maint price of the maintenance . TECH technical characteristics . . COMFORT comfort . . . doors number of doors . . . persons capacity in terms of persons to carry . . . lug_boot the size of luggage boot . . safety estimated safety of the car

      Input attributes are printed in lowercase. Besides the target concept (CAR), the model includes three intermediate concepts: PRICE, TECH, COMFORT. Every concept is in the original model related to its lower level descendants by a set of examples (for these examples sets see http://www-ai.ijs.si/BlazZupan/car.html).

      The Car Evaluation Database contains examples with the structural information removed, i.e., directly relates CAR to the six input attributes: buying, maint, doors, persons, lug_boot, safety.

      Because of known underlying concept structure, this database may be particularly useful for testing constructive induction and structure discovery methods.

    5. Number of Instances: 1728 (instances completely cover the attribute space)

    6. Number of Attributes: 6

    7. Attribute Values:

      buying v-high, high, med, low maint v-high, high, med, low doors 2, 3, 4, 5-more persons 2, 4, more lug_boot small, med, big safety low, med, high

    8. Missing Attribute Values: none

    9. Class Distribution (number of instances per class)

      class N N[%]

      unacc 1210 (70.023 %) acc 384 (22.222 %) good 69 ( 3.993 %) v-good 65 ( 3.762 %)

  13. m

    Coronavirus Panoply.io for Database Warehousing and Post Analysis using...

    • data.mendeley.com
    Updated Feb 4, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pranav Pandya (2020). Coronavirus Panoply.io for Database Warehousing and Post Analysis using Sequal Language (SQL) [Dataset]. http://doi.org/10.17632/4gphfg5tgs.2
    Explore at:
    Dataset updated
    Feb 4, 2020
    Authors
    Pranav Pandya
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    It has never been easier to solve any database related problem using any sequel language and the following gives an opportunity for you guys to understand how I was able to figure out some of the interline relationships between databases using Panoply.io tool.

    I was able to insert coronavirus dataset and create a submittable, reusable result. I hope it helps you work in Data Warehouse environment.

    The following is list of SQL commands performed on dataset attached below with the final output as stored in Exports Folder QUERY 1 SELECT "Province/State" As "Region", Deaths, Recovered, Confirmed FROM "public"."coronavirus_updated" WHERE Recovered>(Deaths/2) AND Deaths>0 Description: How will we estimate where Coronavirus has infiltrated, but there is effective recovery amongst patients? We can view those places by having Recovery twice more than the Death Toll.

    Query 2 SELECT country, sum(confirmed) as "Confirmed Count", sum(Recovered) as "Recovered Count", sum(Deaths) as "Death Toll" FROM "public"."coronavirus_updated" WHERE Recovered>(Deaths/2) AND Confirmed>0 GROUP BY country

    Description: Coronavirus Epidemic has infiltrated multiple countries, and the only way to be safe is by knowing the countries which have confirmed Coronavirus Cases. So here is a list of those countries

    Query 3 SELECT country as "Countries where Coronavirus has reached" FROM "public"."coronavirus_updated" WHERE confirmed>0 GROUP BY country Description: Coronavirus Epidemic has infiltrated multiple countries, and the only way to be safe is by knowing the countries which have confirmed Coronavirus Cases. So here is a list of those countries.

    Query 4 SELECT country, sum(suspected) as "Suspected Cases under potential CoronaVirus outbreak" FROM "public"."coronavirus_updated" WHERE suspected>0 AND deaths=0 AND confirmed=0 GROUP BY country ORDER BY sum(suspected) DESC

    Description: Coronavirus is spreading at alarming rate. In order to know which countries are newly getting the virus is important because in these countries if timely measures are taken, it could prevent any causalities. Here is a list of suspected cases with no virus resulted deaths.

    Query 5 SELECT country, sum(suspected) as "Coronavirus uncontrolled spread count and human life loss", 100*sum(suspected)/(SELECT sum((suspected)) FROM "public"."coronavirus_updated") as "Global suspected Exposure of Coronavirus in percentage" FROM "public"."coronavirus_updated" WHERE suspected>0 AND deaths=0 GROUP BY country ORDER BY sum(suspected) DESC Description: Coronavirus is getting stronger in particular countries, but how will we measure that? We can measure it by knowing the percentage of suspected patients amongst countries which still doesn’t have any Coronavirus related deaths. The following is a list.

    Data Provided by: SRK, Data Scientist at H2O.ai, Chennai, India

  14. Example structure of data sent from a collection management system to a...

    • zenodo.org
    • data.europa.eu
    bin
    Updated Jan 24, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gwenaël Le Bras; Gwenaël Le Bras (2020). Example structure of data sent from a collection management system to a citizen science platform, multi-imaged case [Dataset]. http://doi.org/10.5281/zenodo.2579738
    Explore at:
    binAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Gwenaël Le Bras; Gwenaël Le Bras
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Illustrative example of data format following Darwin Core to send from a collection management system to a citizen science platform. Multi-imaged vertebrate specimen case : http://coldb.mnhn.fr/catalognumber/mnhn/zo/2013-152

    Illustration of the milestone28 document, worpackage 5.2 of the ICEDIG project.

  15. Towards a Structured Evaluation Methodology for Artificial Intelligence...

    • catalog.data.gov
    • datasets.ai
    Updated May 9, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institute of Standards and Technology (2023). Towards a Structured Evaluation Methodology for Artificial Intelligence Technology (SEMAIT) MIg analyZeR (mizr) Package [Dataset]. https://catalog.data.gov/dataset/towards-a-structured-evaluation-methodology-for-artificial-intelligence-technology-semait-
    Explore at:
    Dataset updated
    May 9, 2023
    Dataset provided by
    National Institute of Standards and Technologyhttp://www.nist.gov/
    Description

    Our work towards a Structured Evaluation Methodology for Artificial Intelligence Technology (SEMAIT) aims to provide plots, tools, methods, and strategies to extract insights out of various machine learning (ML) and Artificial Intelligence (AI) data.Included in this software is the MIg analyZeR (mizr) R software package that produces various plots. It was initially developed within the Multimodal Information Group (MIG) at the National Institute of Standards and Technology (NIST).This software is documented, configured to be installed as an R package, and comes with an example SEMAIT script with an example (system, dataset, metrics, score) ML tuple set that we constructed ourselves.

  16. kali_linux_toolkit_dataset

    • kaggle.com
    zip
    Updated May 18, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SUNNY THAKUR (2025). kali_linux_toolkit_dataset [Dataset]. https://www.kaggle.com/datasets/cyberprince/kali-linux-toolkit-dataset
    Explore at:
    zip(27628 bytes)Available download formats
    Dataset updated
    May 18, 2025
    Authors
    SUNNY THAKUR
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Kali Linux Tools Dataset

    A comprehensive and structured dataset of common offensive security tools available in Kali Linux, including usage commands, flags, descriptions, categories, and official documentation links.

    This dataset is designed to support cybersecurity training, red team automation, LLM fine-tuning, and terminal assistants for penetration testers.

    📁 Dataset Format

    Each entry is a JSON object and stored in .jsonl (JSON Lines) format. This structure is ideal for machine learning pipelines and programmatic use.

    Fields:

    FieldDescription
    toolName of the Linux tool (e.g., nmap, sqlmap)
    commandA real-world example command
    descriptionHuman-readable explanation of what the command does
    categoryType of tool or use case (e.g., Networking, Exploitation, Web)
    use_caseSpecific purpose of the command (e.g., port scanning, password cracking)
    flagsImportant flags used in the command
    osOperating system (Linux)
    reference_linkURL to official documentation or man page

    🧪 Example Entry

    {
     "tool": "sqlmap",
     "command": "sqlmap -u http://example.com --dbs",
     "description": "Enumerate databases on a vulnerable web application.",
     "category": "Web Application",
     "use_case": "SQL injection testing",
     "flags": ["-u", "--dbs"],
     "os": "Linux",
     "reference_link": "http://sqlmap.org/"
    }
    ✅ Key Features
    
      ✅ Covers widely-used tools: nmap, hydra, sqlmap, burpsuite, aircrack-ng, wireshark, etc.
    
      ✅ Multiple real-world command examples per tool
    
      ✅ Cross-categorized where tools serve multiple purposes
    
      ✅ Ready for use in LLM training, cybersecurity education, and CLI helpers
    
    🔍 Use Cases
    
      Fine-tuning AI models (LLMs) for cybersecurity and terminal tools
    
      Building red team knowledge bases or documentation bots
    
      Creating terminal assistant tools and cheat sheets
    
      Teaching ethical hacking through command-line exercises
    
    📚 Categories Covered
    
      Networking
    
      Web Application Testing
    
      Exploitation
    
      Password Cracking
    
      Wireless Attacks
    
      System Forensics
    
      Sniffing & Spoofing
    
    ⚠️ Legal Notice
    
    This dataset is provided for educational, research, and ethical security testing purposes only. Use of these tools and commands in unauthorized environments may be illegal.
    📜 License
    
    This dataset is released under the MIT License.
    🙌 Contributions
    
    Contributions are welcome! Feel free to submit PRs to add tools, improve descriptions, or fix errors.
    📫 Maintainer
    
    Created by: sunnythakur
    GitHub: github.com/sunnythakur25
    Contact: sunny48445@gmail.com
    
  17. Resume_Dataset

    • kaggle.com
    zip
    Updated Jul 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    RayyanKauchali0 (2025). Resume_Dataset [Dataset]. https://www.kaggle.com/datasets/rayyankauchali0/resume-dataset
    Explore at:
    zip(3616108 bytes)Available download formats
    Dataset updated
    Jul 26, 2025
    Authors
    RayyanKauchali0
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Tech Resume Dataset (3,500+ Samples):

    This dataset is designed for cutting-edge NLP research in resume parsing, job classification, and ATS system development. Below are extensive details and several ready-made diagrams you can include in your Kaggle upload (just save and upload as “Additional Files” or use them in your dataset description).

    Dataset Composition and Sourcing

    • Total Resumes: 3,500+
    • Sources:
      • Real Data: 2,047 resumes (58.5%) from ResumeAtlas and reputable open repositories; all records strictly anonymized.
      • Template-Based Synthetic: 573 resumes featuring varied narratives and realistic achievements for classic, modern, and professional styles.
      • LLM-Generated Variations: 460 unique samples using structured prompts to diversify skills, summaries, and career tracks, focusing on AI, ML, and data.
      • Faker-Seeded Synthetic: 420 resumes, especially for junior/support/cloud/network tracks, populated with robust Faker-generated work and education fields.
    • Role Coverage:
      • 15 major technology clusters (Software Engineering, DevOps, Cloud, AI/ML, Security, Data Engineering, QA, UI/UX, and more)
      • At least 200 samples per primary role group for label balance
      • 60+ subcategories reflecting granular tech job roles

    Key Dataset Fields (JSONL Schema)

    FieldDescriptionExample/Data Type
    ResumeIDUnique, anonymized string"DIS4JE91Z..." (string)
    CategoryTech job category/label"DevOps Engineer"
    NameAnonymized (Faker-generated) name"Jordan Patel"
    EmailAnonymized email address"jpatel@example.com"
    PhoneAnonymized phone number"+1-555-343-2123"
    LocationCity, country or region (anonymized)"Austin, TX, USA"
    SummaryProfessional summary/introString (3-6 sentences)
    SkillsList or comma-separated tech/soft skills"Python, Kubernetes..."
    ExperienceWork chronology, organizations, bullet-point detailsString (multiline)
    EducationUniversities, degrees, certsString (multiline)
    Source"real", "template", "llm", "faker"String

    https://ppl-ai-code-interpreter-files.s3.amazonaws.com/web/direct-files/626086319755b5c5810ff838ca0c0c3b/a5b5a057-7265-4428-9827-0a4c92f88d19/0e26c38c.png" alt="Dataset Schema Overview with Field Descriptions and Data Types">

    Dataset Schema Overview with Field Descriptions and Data Types

    Technical Validation & Quality Assurance

    • Formatting:
      • Uniform schema, right-tab alignment for dates (MMM-YYYY)
      • Standard ATS/NLP-friendly section headers
    • De-duplication:
      • All records checked with BERT/MinHash for uniqueness (cosine similarity >0.9 removed)
    • PII Scrubbing:
      • Names, contacts, locations anonymized with Python Faker
    • Role/Skill Taxonomy:
      • Job titles & skills mapped to ESCO, O*NET, NIST NICE, CNCF lexicons for research alignment
    • Quality Checks:
      • Automatic and manual validation for section presence, data type conformity, and format alignment

    Role & Source Coverage Visualizations

    Composition by Data Source:

    https://ppl-ai-code-interpreter-files.s3.amazonaws.com/web/direct-files/626086319755b5c5810ff838ca0c0c3b/a5aafe90-c5b6-4d07-ad9c-cf5244266561/5723c094.png" alt="Composition of Tech Resume Dataset by Data Source">

    Composition of Tech Resume Dataset by Data Source

    Role Cluster Diversity:

    https://ppl-ai-code-interpreter-files.s3.amazonaws.com/web/direct-files/626086319755b5c5810ff838ca0c0c3b/8c6ba5d6-f676-4213-b4f7-16a133081e00/e9cc61b6.png" alt="Distribution of Major Tech Role Clusters in the 3,500 Resumes Dataset">

    Distribution of Major Tech Role Clusters in the 3,500 Resumes Dataset

    Alternative: Dataset by Source Type (Pie Chart):

    https://ppl-ai-code-interpreter-files.s3.amazonaws.com/web/direct-files/626086319755b5c5810ff838ca0c0c3b/2325f133-7fe5-4294-9a9d-4db19be3584f/b85a47bd.png" alt="Resume Dataset Composition by Source Type">

    Resume Dataset Composition by Source Type

    Typical Use Cases

    • Resume parsing & sectioning (training for models like BERT, RoBERTa, spaCy)
    • Fine-tuning for NER, job classification (60+ labels), skill extraction, and ATS research
    • Development or benchmarking of AI-powered job matching, candidate ranking, and automated tracking tools
    • ML/data science education and demo pipelines

    How to Use the JSONL File

    Each line in tech_resumes_dataset.jsonl is a single, fully structured resume object:

    import json
    
    with open('tech_resumes_dataset.jsonl', 'r', encoding='utf-8') as f:
      resumes = [json.loads(line) for line in f]
    # Each record is now a Python dictionary
    

    Citing and Sharing

    If you use this dataset, credit it as “[your Kaggle dataset URL]” and mention original sources (ResumeAtlas, Resume_Classification, Kaggle Resume Dataset, and synthetic methodology as described).

  18. Steam Reviews English - Dead by Daylight

    • kaggle.com
    zip
    Updated Nov 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nicola Mustone (2025). Steam Reviews English - Dead by Daylight [Dataset]. https://www.kaggle.com/datasets/nicolamustone/steam-reviews-english-dead-by-daylight
    Explore at:
    zip(22155467 bytes)Available download formats
    Dataset updated
    Nov 20, 2025
    Authors
    Nicola Mustone
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Steam Reviews — Dead by Daylight (App 381210)

    A dataset of 277,439 English-only Steam user reviews for Dead by Daylight from 2019 to November 2025, collected through the official Steam API. Each row represents a single review, including sentiment labels, playtime, and engagement metrics.
    This dataset is ideal for natural language processing, sentiment analysis, and behavioral data studies.

    A separate CSV with all the patches released for Dead by Daylight is included in the download for your convenience.

    Dataset Summary

    FieldDescription
    reviewFull review text
    sentiment1 = positive review, 0 = negative
    purchased1 if purchased on Steam
    received_for_free1 if the game was received for free
    votes_upNumber of helpful votes
    votes_funnyNumber of “funny” votes
    date_createdReview creation date (YYYY-MM-DD, UTC)
    date_updatedLast update date (YYYY-MM-DD, UTC)
    author_num_games_ownedTotal games owned by reviewer
    author_num_reviewsTotal reviews written by reviewer
    author_playtime_forever_minTotal playtime in minutes
    author_playtime_at_review_minPlaytime when the review was written (minutes)

    Example Use Cases

    • Sentiment Analysis: Train classifiers using user tone and voting patterns.
    • Text Embeddings: Extract embeddings for clustering or topic modeling.
    • Behavioral Correlation: Relate sentiment to playtime or review length.

    Data Source

    Reviews were collected using the SirDarcanos/Steam-Reviews-Scraper script.

    This dataset includes only publicly accessible user content and metadata.
    Each record is factual and unaltered beyond format normalization.

    Licensing

    • Dataset: MIT License
      Free for commercial and non-commercial use with attribution.
    • Collection Script: GPLv3 License
      Ensures derivative software remains open-source.

    Update Schedule

    Updates will be performed irregularly and only when new data is collected. Users are welcome to suggest improvements or request updates via the discussion section.

    Credits

    Created by Nicola Mustone.

    Disclaimer

    This dataset and its author are not affiliated with, endorsed by, or sponsored by Valve Corporation or Behaviour Interactive Inc.

    All product names, logos, brands, and trademarks are the property of their respective owners.

    The data included in this dataset was collected from publicly available user reviews through the official Steam Web API, and is provided solely for educational and research purposes.

  19. f

    Sample provider semi-structured interview guide.

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Jan 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    McBain, Ryan K.; Namisango, Eve; Green, Harold D.; Gwokyalya, Violet; Bouskill, Kathryn; Ober, Allison; Matovu, Joseph K. B.; Beyeza-Kashesya, Jolly; Nakami, Sylvia; Juncker, Margrethe; Luyirika, Emmanuel; Wagner, Glenn J.; Bogart, Laura M.; Wanyenze, Rhoda K. (2025). Sample provider semi-structured interview guide. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001388938
    Explore at:
    Dataset updated
    Jan 24, 2025
    Authors
    McBain, Ryan K.; Namisango, Eve; Green, Harold D.; Gwokyalya, Violet; Bouskill, Kathryn; Ober, Allison; Matovu, Joseph K. B.; Beyeza-Kashesya, Jolly; Nakami, Sylvia; Juncker, Margrethe; Luyirika, Emmanuel; Wagner, Glenn J.; Bogart, Laura M.; Wanyenze, Rhoda K.
    Description

    IntroductionCervical cancer (CC) is the leading cause of cancer-related deaths among Uganda women, yet rates of CC screening are very low. Training women who have recently screened to engage in advocacy for screening among women in their social network is a network-based strategy for promoting information dissemination and CC screening uptake.MethodsDrawing on the Exploration, Preparation, Implementation and Sustainment (EPIS) framework for implementation science, this hybrid type 1 randomized controlled trial (RCT) of a peer-led, group advocacy training intervention, Game Changers for Cervical Cancer Prevention (GC-CCP), will examine efficacy for increasing CC screening uptake as well as how it can be implemented and sustained in diverse clinic settings. In the Preparation phase we will prepare the four study clinics for implementation of GC-CCP and the expected increase in demand for CC screening, by using qualitative methods (stakeholder interviews and client focus groups) to identify and address structural barriers to easy access to CC screening. In the Implementation phase, GC-CCP will be implemented over 36 months at each clinic, with screened women (index participants) enrolled as research participants receiving the intervention in the first 6 months as part of a parallel group RCT overseen by the research study team to evaluate efficacy for CC screening uptake among their enrolled social network members. All research participants will be assessed at baseline and months 6 and 12. Intervention implementation and supervision will then be transitioned to clinic staff and offered as part of usual care in the subsequent 30 months as part of the Sustainability phase. Using the RE-AIM framework, we will evaluate engagement in GC-CCP and CC advocacy (reach), alter CC screening (effectiveness), adoption into clinic operations, implementation outcomes (acceptability, feasibility, fidelity, cost-effectiveness) and maintenance.DiscussionThis is one of the first studies to use a network-driven approach and empowerment of CC screened peers as change agents to increase CC screening. If shown to be an effective and sustainable implementation strategy for promoting CC screening, this peer advocacy model could be applied to other preventative health behaviors and disease contexts.Trial registrationNIH Clinical Trial Registry NCT06010160 (clinicaltrials.gov; date: 8/17/2023).

  20. Example Structure-From-Motion Data

    • figshare.com
    bin
    Updated Apr 1, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Atticus Stovall (2020). Example Structure-From-Motion Data [Dataset]. http://doi.org/10.6084/m9.figshare.12061197.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    Apr 1, 2020
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Atticus Stovall
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Example data for exploring structure-from-motion data from a 100 m x 100 m subset of temperate forest in central Virginia.Data collection and post-processing by:Atticus Stovall, Bailey Costello, and Xi YangDrone: DJI Mavic Pro with onboard RGB camera

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Kartikeya Pandey (2025). Coding Questions Dataset [Dataset]. https://www.kaggle.com/datasets/guitaristboy/coding-questions-dataset
Organization logo

Coding Questions Dataset

A structured collection of coding problems with examples and test cases

Explore at:
zip(135582 bytes)Available download formats
Dataset updated
Oct 24, 2025
Authors
Kartikeya Pandey
License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

This dataset contains a curated collection of programming questions, each paired with example inputs/outputs, constraints, and test cases.

It is designed for use in machine learning research, code generation models, natural language processing (NLP) tasks, or simply as a question bank for learners and educators.

Dataset Highlights:

📘 616 questions with titles, descriptions, and difficulty levels (Easy, Medium, Hard)

💡 Each question includes examples, constraints, and test cases stored as structured JSON

🧠 Useful for LLM fine-tuning, question answering, and automated code evaluation tasks

🧩 Ideal for creating or benchmarking AI coding assistants and educational apps

Source: Collected from a structured internal question database built for educational and evaluation purposes.

Format: CSV file with the following columns: id, title, description, difficulty_level, created_at, updated_at, examples, constraints, test_cases

Search
Clear search
Close search
Google apps
Main menu