84 datasets found

Coding Questions Dataset
kaggle.com
zip
Updated Oct 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kartikeya Pandey (2025). Coding Questions Dataset [Dataset]. https://www.kaggle.com/datasets/guitaristboy/coding-questions-dataset
Explore at:
zip(135582 bytes)Available download formats
Dataset updated
Oct 24, 2025
Authors
Kartikeya Pandey
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This dataset contains a curated collection of programming questions, each paired with example inputs/outputs, constraints, and test cases.

It is designed for use in machine learning research, code generation models, natural language processing (NLP) tasks, or simply as a question bank for learners and educators.

Dataset Highlights:

📘 616 questions with titles, descriptions, and difficulty levels (Easy, Medium, Hard)

💡 Each question includes examples, constraints, and test cases stored as structured JSON

🧠 Useful for LLM fine-tuning, question answering, and automated code evaluation tasks

🧩 Ideal for creating or benchmarking AI coding assistants and educational apps

Source: Collected from a structured internal question database built for educational and evaluation purposes.

Format: CSV file with the following columns: id, title, description, difficulty_level, created_at, updated_at, examples, constraints, test_cases
h
medreport_text_1000
huggingface.co
Updated Aug 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Young-Wouk Kim (2025). medreport_text_1000 [Dataset]. https://huggingface.co/datasets/wouk1805/medreport_text_1000
Explore at:
Dataset updated
Aug 5, 2025
Authors
Young-Wouk Kim
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
MedReport - Reports Dataset

Dataset Description

This dataset contains medical audio transcriptions and the corresponding structured reports.

Columns

input: Audio transcription output: Structured medical report sample_id: Example identifier

Statistics

Total examples: 1000 License: Apache License 2.0 Created: 2025-08-05

Usage Loading the dataset

from datasets import load_dataset

Load the dataset

full_dataset =… See the full description on the dataset page: https://huggingface.co/datasets/wouk1805/medreport_text_1000.
f
Sample questions for the semi-structured interviews.
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Oct 29, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amati, Mirjam; Rubinelli, Sara; Zanini, Claudia; Grignoli, Nicola; Amann, Julia (2019). Sample questions for the semi-structured interviews. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000158998
Explore at:
Dataset updated
Oct 29, 2019
Authors
Amati, Mirjam; Rubinelli, Sara; Zanini, Claudia; Grignoli, Nicola; Amann, Julia
Description
Sample questions for the semi-structured interviews.
h
sample-structure-dataset
huggingface.co
Updated Apr 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
GenBio AI (2025). sample-structure-dataset [Dataset]. https://huggingface.co/datasets/genbio-ai/sample-structure-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 10, 2025
Dataset authored and provided by
GenBio AI
Description
Sample dataset for PETAL model

This dataset is a sample dataset to test the functionalities of the PETAL model (encoder and decoder). It is based on CASP15 dataset, see

https://predictioncenter.org/casp15/ https://github.com/Bhattacharya-Lab/CASP15

The registries folder contains the registry of CASP15 dataset (a csv file with filename, pdb_id, etc.)
Catastrophic Collapse Can Occur without Early Warning: Examples of Silent...
plos.figshare.com
qt
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maarten C. Boerlijst; Thomas Oudman; André M. de Roos (2023). Catastrophic Collapse Can Occur without Early Warning: Examples of Silent Catastrophes in Structured Ecological Models [Dataset]. http://doi.org/10.1371/journal.pone.0062033
Explore at:
qtAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0062033
Dataset updated
Jun 1, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Maarten C. Boerlijst; Thomas Oudman; André M. de Roos
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Catastrophic and sudden collapses of ecosystems are sometimes preceded by early warning signals that potentially could be used to predict and prevent a forthcoming catastrophe. Universality of these early warning signals has been proposed, but no formal proof has been provided. Here, we show that in relatively simple ecological models the most commonly used early warning signals for a catastrophic collapse can be silent. We underpin the mathematical reason for this phenomenon, which involves the direction of the eigenvectors of the system. Our results demonstrate that claims on the universality of early warning signals are not correct, and that catastrophic collapses can occur without prior warning. In order to correctly predict a collapse and determine whether early warning signals precede the collapse, detailed knowledge of the mathematical structure of the approaching bifurcation is necessary. Unfortunately, such knowledge is often only obtained after the collapse has already occurred.
f
Sample semi-structured interview questions.
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated May 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ward, Brooklyn; Marshall, Carrie Anne; Allen, Jessica; Javadizadeh, Elham; Easton, Corinna; Perez, Shauna; Goldszmidt, Rebecca; Plett, Patti (2025). Sample semi-structured interview questions. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0002034325
Explore at:
Dataset updated
May 22, 2025
Authors
Ward, Brooklyn; Marshall, Carrie Anne; Allen, Jessica; Javadizadeh, Elham; Easton, Corinna; Perez, Shauna; Goldszmidt, Rebecca; Plett, Patti
Description
Having access to good quality housing is a key determinant of well-being. Little is known about experiences of housing quality following homelessness from the perspectives of persons with lived experience. To build on existing literature, we conducted a secondary analysis of qualitative interviews with 19 individuals who had experiences of transitioning to housing following homelessness. Interview transcripts were drawn from a community-based participatory research study exploring the conditions needed for thriving following homelessness in Ontario, Canada. We analyzed these transcripts using reflexive thematic analysis. We coded transcripts abductively, informed by theories of social justice and health equity. Consistent with reflexive thematic analysis, we identified a central essence to elucidate experiences of housing quality following homelessness: “negotiating control within oppressive structural contexts.” This was expressed through four distinct themes: 1) being forced to live in undesirable living conditions; 2) stuck in an unsafe environment; 3) negotiating power dynamics to attain comfort and safety in one’s housing; and 4) having access to people and resources that create home. Overall, our findings indicate that attaining good quality housing following homelessness is elusive for many and influenced by a range of structural factors including ongoing poverty following homelessness, a lack of deeply affordable housing stock, and a lack of available social support networks. To prevent homelessness, it is essential to improve access to good quality housing that can support tenancy sustainment and well-being following homelessness. Policymakers need to review existing housing policies and reflect on how over-reliance on market housing has imposed negative impacts on the lives of persons who are leaving homelessness. Given the current economic context, it is imperative that policymakers devise policies that mitigate the financialization of housing, and result in the restoration of the social housing system in Canada and beyond.
Patent PDF Samples with Extracted Structured Data
console.cloud.google.com
Updated Jul 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
https://console.cloud.google.com/marketplace/browse?filter=partner:Subsets%20of%20Patent%20Data&hl=de (2023). Patent PDF Samples with Extracted Structured Data [Dataset]. https://console.cloud.google.com/marketplace/product/global-patents/labeled-patents?hl=de
Explore at:
Dataset updated
Jul 20, 2023
Dataset provided by
Googlehttp://google.com/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset consists of PDFs in Google Cloud Storage from the first page of select US and EU patents, and BigQuery tables with extracted entities, labels, and other properties, including a link to each file in GCS. The structured data contains labels for eleven patent entities (patent inventor, publication date, classification number, patent title, etc.), global properties (US/EU issued, language, invention type), and the location of any figures or schematics on the patent's first page. The structured data is the result of a data entry operation collecting information from PDF documents, making the dataset a useful testing ground for benchmarking and developing AI/ML systems intended to perform broad document understanding tasks like extraction of structured data from unstructured documents. This dataset can be used to develop and benchmark natural language tasks such as named entity recognition and text classification, AI/ML vision tasks such as image classification and object detection, as well as more general AI/ML tasks such as automated data entry and document understanding. Google is sharing this dataset to support the AI/ML community because there is a shortage of document extraction/understanding datasets shared under an open license. This public dataset is hosted in Google Cloud Storage and Google BigQuery. It is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery or this this Cloud Storage quick start guide to begin.
f
Data structure information for each sample and outcome.
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Jul 9, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jones, Kelvyn; Prior, Lucy; Manley, David (2020). Data structure information for each sample and outcome. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000527554
Explore at:
Dataset updated
Jul 9, 2020
Authors
Jones, Kelvyn; Prior, Lucy; Manley, David
Description
Data structure information for each sample and outcome.
Spider Realistic Dataset In Structure-Grounded Pretraining for Text-to-SQL
zenodo.org
bin, json, txt
Updated Aug 16, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xiang Deng; Ahmed Hassan Awadallah; Christopher Meek; Oleksandr Polozov; Huan Sun; Matthew Richardson; Xiang Deng; Ahmed Hassan Awadallah; Christopher Meek; Oleksandr Polozov; Huan Sun; Matthew Richardson (2021). Spider Realistic Dataset In Structure-Grounded Pretraining for Text-to-SQL [Dataset]. http://doi.org/10.5281/zenodo.5205322
Explore at:
txt, json, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5205322
Dataset updated
Aug 16, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Xiang Deng; Ahmed Hassan Awadallah; Christopher Meek; Oleksandr Polozov; Huan Sun; Matthew Richardson; Xiang Deng; Ahmed Hassan Awadallah; Christopher Meek; Oleksandr Polozov; Huan Sun; Matthew Richardson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This folder contains the Spider-Realistic dataset used for evaluation in the paper "Structure-Grounded Pretraining for Text-to-SQL". The dataset is created based on the dev split of the Spider dataset (2020-06-07 version from https://yale-lily.github.io/spider). We manually modified the original questions to remove the explicit mention of column names while keeping the SQL queries unchanged to better evaluate the model's capability in aligning the NL utterance and the DB schema. For more details, please check our paper at https://arxiv.org/abs/2010.12773.

It contains the following files:

- spider-realistic.json
# The spider-realistic evaluation set
# Examples: 508
# Databases: 19
- dev.json
# The original dev split of Spider
# Examples: 1034
# Databases: 20
- tables.json
# The original DB schemas from Spider
# Databases: 166
- README.txt
- license

The Spider-Realistic dataset is created based on the dev split of the Spider dataset realsed by Yu, Tao, et al. "Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task." It is a subset of the original dataset with explicit mention of the column names removed. The sql queries and databases are kept unchanged.
For the format of each json file, please refer to the github page of Spider https://github.com/taoyds/spider.
For the database files please refer to the official Spider release https://yale-lily.github.io/spider.

This dataset is distributed under the CC BY-SA 4.0 license.

If you use the dataset, please cite the following papers including the original Spider datasets, Finegan-Dollak et al., 2018 and the original datasets for Restaurants, GeoQuery, Scholar, Academic, IMDB, and Yelp.

@article{deng2020structure,
title={Structure-Grounded Pretraining for Text-to-SQL},
author={Deng, Xiang and Awadallah, Ahmed Hassan and Meek, Christopher and Polozov, Oleksandr and Sun, Huan and Richardson, Matthew},
journal={arXiv preprint arXiv:2010.12773},
year={2020}
}

@inproceedings{Yu&al.18c,
year = 2018,
title = {Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task},
booktitle = {EMNLP},
author = {Tao Yu and Rui Zhang and Kai Yang and Michihiro Yasunaga and Dongxu Wang and Zifan Li and James Ma and Irene Li and Qingning Yao and Shanelle Roman and Zilin Zhang and Dragomir Radev }
}

@InProceedings{P18-1033,
author = "Finegan-Dollak, Catherine
and Kummerfeld, Jonathan K.
and Zhang, Li
and Ramanathan, Karthik
and Sadasivam, Sesh
and Zhang, Rui
and Radev, Dragomir",
title = "Improving Text-to-SQL Evaluation Methodology",
booktitle = "Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
year = "2018",
publisher = "Association for Computational Linguistics",
pages = "351--360",
location = "Melbourne, Australia",
url = "http://aclweb.org/anthology/P18-1033"
}

@InProceedings{data-sql-imdb-yelp,
dataset = {IMDB and Yelp},
author = {Navid Yaghmazadeh, Yuepeng Wang, Isil Dillig, and Thomas Dillig},
title = {SQLizer: Query Synthesis from Natural Language},
booktitle = {International Conference on Object-Oriented Programming, Systems, Languages, and Applications, ACM},
month = {October},
year = {2017},
pages = {63:1--63:26},
url = {http://doi.org/10.1145/3133887},
}

@article{data-academic,
dataset = {Academic},
author = {Fei Li and H. V. Jagadish},
title = {Constructing an Interactive Natural Language Interface for Relational Databases},
journal = {Proceedings of the VLDB Endowment},
volume = {8},
number = {1},
month = {September},
year = {2014},
pages = {73--84},
url = {http://dx.doi.org/10.14778/2735461.2735468},
}

@InProceedings{data-atis-geography-scholar,
dataset = {Scholar, and Updated ATIS and Geography},
author = {Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, Jayant Krishnamurthy, and Luke Zettlemoyer},
title = {Learning a Neural Semantic Parser from User Feedback},
booktitle = {Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
year = {2017},
pages = {963--973},
location = {Vancouver, Canada},
url = {http://www.aclweb.org/anthology/P17-1089},
}

@inproceedings{data-geography-original
dataset = {Geography, original},
author = {John M. Zelle and Raymond J. Mooney},
title = {Learning to Parse Database Queries Using Inductive Logic Programming},
booktitle = {Proceedings of the Thirteenth National Conference on Artificial Intelligence - Volume 2},
year = {1996},
pages = {1050--1055},
location = {Portland, Oregon},
url = {http://dl.acm.org/citation.cfm?id=1864519.1864543},
}

@inproceedings{data-restaurants-logic,
author = {Lappoon R. Tang and Raymond J. Mooney},
title = {Automated Construction of Database Interfaces: Intergrating Statistical and Relational Learning for Semantic Parsing},
booktitle = {2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora},
year = {2000},
pages = {133--141},
location = {Hong Kong, China},
url = {http://www.aclweb.org/anthology/W00-1317},
}

@inproceedings{data-restaurants-original,
author = {Ana-Maria Popescu, Oren Etzioni, and Henry Kautz},
title = {Towards a Theory of Natural Language Interfaces to Databases},
booktitle = {Proceedings of the 8th International Conference on Intelligent User Interfaces},
year = {2003},
location = {Miami, Florida, USA},
pages = {149--157},
url = {http://doi.acm.org/10.1145/604045.604070},
}

@inproceedings{data-restaurants,
author = {Alessandra Giordani and Alessandro Moschitti},
title = {Automatic Generation and Reranking of SQL-derived Answers to NL Questions},
booktitle = {Proceedings of the Second International Conference on Trustworthy Eternal Systems via Evolving Software, Data and Knowledge},
year = {2012},
location = {Montpellier, France},
pages = {59--76},
url = {https://doi.org/10.1007/978-3-642-45260-4_5},
}
Wikipedia Structured Contents
kaggle.com
zip
Updated Apr 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wikimedia (2025). Wikipedia Structured Contents [Dataset]. https://www.kaggle.com/datasets/wikimedia-foundation/wikipedia-structured-contents
Explore at:
zip(25121685657 bytes)Available download formats
Dataset updated
Apr 11, 2025
Dataset provided by
Wikimedia Foundationhttp://www.wikimedia.org/
Authors
Wikimedia
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Dataset Summary Early beta release of pre-parsed English and French Wikipedia articles including infoboxes. Inviting feedback.

This dataset contains all articles of the English and French language editions of Wikipedia, pre-parsed and outputted as structured JSON files with a consistent schema. Each JSON line holds the content of one full Wikipedia article stripped of extra markdown and non-prose sections (references, etc.).

Invitation for Feedback The dataset is built as part of the Structured Contents initiative and based on the Wikimedia Enterprise html snapshots. It is an early beta release to improve transparency in the development process and request feedback. This first version includes pre-parsed Wikipedia abstracts, short descriptions, main images links, infoboxes and article sections, excluding non-prose sections (e.g. references). More elements (such as lists and tables) may be added over time. For updates follow the project’s blog and our Mediawiki Quarterly software updates on MediaWiki. As this is an early beta release, we highly value your feedback to help us refine and improve this dataset. Please share your thoughts, suggestions, and any issues you encounter either on the discussion page of Wikimedia Enterprise’s homepage on Meta wiki, or on the discussion page for this dataset here on Kaggle.

The contents of this dataset of Wikipedia articles is collectively written and curated by a global volunteer community. All original textual content is licensed under the GNU Free Documentation License (GFDL) and the Creative Commons Attribution-Share-Alike 4.0 License. Some text may be available only under the Creative Commons license; see the Wikimedia Terms of Use for details. Text written by some authors may be released under additional licenses or into the public domain.

The dataset in its structured form is generally helpful for a wide variety of tasks, including all phases of model development, from pre-training to alignment, fine-tuning, updating/RAG as well as testing/benchmarking. We would love to hear more about your use cases.

Data Fields The data fields are the same among all, noteworthy included fields: name - title of the article. identifier - ID of the article. url - URL of the article. version: metadata related to the latest specific revision of the article version.editor - editor-specific signals that can help contextualize the revision version.scores - returns assessments by ML models on the likelihood of a revision being reverted. main entity - Wikidata QID the article is related to. abstract - lead section, summarizing what the article is about. description - one-sentence description of the article for quick reference. image - main image representing the article's subject. infoboxes - parsed information from the side panel (infobox) on the Wikipedia article. sections - parsed sections of the article, including links. Note: excludes other media/images, lists, tables and references or similar non-prose sections. Full data dictionary is available here: https://enterprise.wikimedia.com/docs/data-dictionary/

Curation Rationale This dataset has been created as part of the larger Structured Contents initiative at Wikimedia Enterprise with the aim of making Wikimedia data more machine readable. These efforts are both focused on pre-parsing Wikipedia snippets as well as connecting the different projects closer together. Even if Wikipedia is very structured to the human eye, it is a non-triv...
Example DICOM RT Structure
zenodo.org
bin
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Simon Biggs; Simon Biggs (2020). Example DICOM RT Structure [Dataset]. http://doi.org/10.5281/zenodo.3576026
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3576026
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Simon Biggs; Simon Biggs
License
http://www.apache.org/licenses/LICENSE-2.0http://www.apache.org/licenses/LICENSE-2.0
Description
Example DICOM RT Structure
Data from: Car Evaluation Data Set
hypi.ai
kaggle.com
zip
Updated Sep 1, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ahiale Darlington (2017). Car Evaluation Data Set [Dataset]. https://hypi.ai/wp/wp-content/uploads/2019/10/car-evaluation-data-set/
Explore at:
zip(4775 bytes)Available download formats
Dataset updated
Sep 1, 2017
Authors
Ahiale Darlington
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
from: https://archive.ics.uci.edu/ml/datasets/car+evaluation

Title: Car Evaluation Database

Sources: (a) Creator: Marko Bohanec (b) Donors: Marko Bohanec (marko.bohanec@ijs.si) Blaz Zupan (blaz.zupan@ijs.si) (c) Date: June, 1997

Past Usage:

The hierarchical decision model, from which this dataset is derived, was first presented in

M. Bohanec and V. Rajkovic: Knowledge acquisition and explanation for multi-attribute decision making. In 8th Intl Workshop on Expert Systems and their Applications, Avignon, France. pages 59-78, 1988.

Within machine-learning, this dataset was used for the evaluation of HINT (Hierarchy INduction Tool), which was proved to be able to completely reconstruct the original hierarchical model. This, together with a comparison with C4.5, is presented in

B. Zupan, M. Bohanec, I. Bratko, J. Demsar: Machine learning by function decomposition. ICML-97, Nashville, TN. 1997 (to appear)

Relevant Information Paragraph:

Car Evaluation Database was derived from a simple hierarchical decision model originally developed for the demonstration of DEX (M. Bohanec, V. Rajkovic: Expert system for decision making. Sistemica 1(1), pp. 145-157, 1990.). The model evaluates cars according to the following concept structure:

CAR car acceptability . PRICE overall price . . buying buying price . . maint price of the maintenance . TECH technical characteristics . . COMFORT comfort . . . doors number of doors . . . persons capacity in terms of persons to carry . . . lug_boot the size of luggage boot . . safety estimated safety of the car

Input attributes are printed in lowercase. Besides the target concept (CAR), the model includes three intermediate concepts: PRICE, TECH, COMFORT. Every concept is in the original model related to its lower level descendants by a set of examples (for these examples sets see http://www-ai.ijs.si/BlazZupan/car.html).

The Car Evaluation Database contains examples with the structural information removed, i.e., directly relates CAR to the six input attributes: buying, maint, doors, persons, lug_boot, safety.

Because of known underlying concept structure, this database may be particularly useful for testing constructive induction and structure discovery methods.

Number of Instances: 1728 (instances completely cover the attribute space)

Number of Attributes: 6

Attribute Values:

buying v-high, high, med, low maint v-high, high, med, low doors 2, 3, 4, 5-more persons 2, 4, more lug_boot small, med, big safety low, med, high

Missing Attribute Values: none

Class Distribution (number of instances per class)

class N N[%]

unacc 1210 (70.023 %) acc 384 (22.222 %) good 69 ( 3.993 %) v-good 65 ( 3.762 %)
m
Coronavirus Panoply.io for Database Warehousing and Post Analysis using...
data.mendeley.com
Updated Feb 4, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pranav Pandya (2020). Coronavirus Panoply.io for Database Warehousing and Post Analysis using Sequal Language (SQL) [Dataset]. http://doi.org/10.17632/4gphfg5tgs.2
Explore at:
Unique identifier
https://doi.org/10.17632/4gphfg5tgs.2
Dataset updated
Feb 4, 2020
Authors
Pranav Pandya
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
It has never been easier to solve any database related problem using any sequel language and the following gives an opportunity for you guys to understand how I was able to figure out some of the interline relationships between databases using Panoply.io tool.

I was able to insert coronavirus dataset and create a submittable, reusable result. I hope it helps you work in Data Warehouse environment.

The following is list of SQL commands performed on dataset attached below with the final output as stored in Exports Folder QUERY 1 SELECT "Province/State" As "Region", Deaths, Recovered, Confirmed FROM "public"."coronavirus_updated" WHERE Recovered>(Deaths/2) AND Deaths>0 Description: How will we estimate where Coronavirus has infiltrated, but there is effective recovery amongst patients? We can view those places by having Recovery twice more than the Death Toll.

Query 2 SELECT country, sum(confirmed) as "Confirmed Count", sum(Recovered) as "Recovered Count", sum(Deaths) as "Death Toll" FROM "public"."coronavirus_updated" WHERE Recovered>(Deaths/2) AND Confirmed>0 GROUP BY country

Description: Coronavirus Epidemic has infiltrated multiple countries, and the only way to be safe is by knowing the countries which have confirmed Coronavirus Cases. So here is a list of those countries

Query 3 SELECT country as "Countries where Coronavirus has reached" FROM "public"."coronavirus_updated" WHERE confirmed>0 GROUP BY country Description: Coronavirus Epidemic has infiltrated multiple countries, and the only way to be safe is by knowing the countries which have confirmed Coronavirus Cases. So here is a list of those countries.

Query 4 SELECT country, sum(suspected) as "Suspected Cases under potential CoronaVirus outbreak" FROM "public"."coronavirus_updated" WHERE suspected>0 AND deaths=0 AND confirmed=0 GROUP BY country ORDER BY sum(suspected) DESC

Description: Coronavirus is spreading at alarming rate. In order to know which countries are newly getting the virus is important because in these countries if timely measures are taken, it could prevent any causalities. Here is a list of suspected cases with no virus resulted deaths.

Query 5 SELECT country, sum(suspected) as "Coronavirus uncontrolled spread count and human life loss", 100*sum(suspected)/(SELECT sum((suspected)) FROM "public"."coronavirus_updated") as "Global suspected Exposure of Coronavirus in percentage" FROM "public"."coronavirus_updated" WHERE suspected>0 AND deaths=0 GROUP BY country ORDER BY sum(suspected) DESC Description: Coronavirus is getting stronger in particular countries, but how will we measure that? We can measure it by knowing the percentage of suspected patients amongst countries which still doesn’t have any Coronavirus related deaths. The following is a list.

Data Provided by: SRK, Data Scientist at H2O.ai, Chennai, India
Example structure of data sent from a collection management system to a...
zenodo.org
data.europa.eu
bin
Updated Jan 24, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gwenaël Le Bras; Gwenaël Le Bras (2020). Example structure of data sent from a collection management system to a citizen science platform, multi-imaged case [Dataset]. http://doi.org/10.5281/zenodo.2579738
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.2579738
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Gwenaël Le Bras; Gwenaël Le Bras
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Illustrative example of data format following Darwin Core to send from a collection management system to a citizen science platform. Multi-imaged vertebrate specimen case : http://coldb.mnhn.fr/catalognumber/mnhn/zo/2013-152

Illustration of the milestone28 document, worpackage 5.2 of the ICEDIG project.
Towards a Structured Evaluation Methodology for Artificial Intelligence...
catalog.data.gov
datasets.ai
Updated May 9, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Standards and Technology (2023). Towards a Structured Evaluation Methodology for Artificial Intelligence Technology (SEMAIT) MIg analyZeR (mizr) Package [Dataset]. https://catalog.data.gov/dataset/towards-a-structured-evaluation-methodology-for-artificial-intelligence-technology-semait-
Explore at:
Dataset updated
May 9, 2023
Dataset provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
Description
Our work towards a Structured Evaluation Methodology for Artificial Intelligence Technology (SEMAIT) aims to provide plots, tools, methods, and strategies to extract insights out of various machine learning (ML) and Artificial Intelligence (AI) data.Included in this software is the MIg analyZeR (mizr) R software package that produces various plots. It was initially developed within the Multimodal Information Group (MIG) at the National Institute of Standards and Technology (NIST).This software is documented, configured to be installed as an R package, and comes with an example SEMAIT script with an example (system, dataset, metrics, score) ML tuple set that we constructed ourselves.

kali_linux_toolkit_dataset

kaggle.com

zip

Updated May 18, 2025

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

SUNNY THAKUR (2025). kali_linux_toolkit_dataset [Dataset]. https://www.kaggle.com/datasets/cyberprince/kali-linux-toolkit-dataset

Explore at:

zip(27628 bytes)Available download formats

Dataset updated

May 18, 2025

Authors

SUNNY THAKUR

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Kali Linux Tools Dataset

A comprehensive and structured dataset of common offensive security tools available in Kali Linux, including usage commands, flags, descriptions, categories, and official documentation links.

This dataset is designed to support cybersecurity training, red team automation, LLM fine-tuning, and terminal assistants for penetration testers.

📁 Dataset Format

Each entry is a JSON object and stored in .jsonl (JSON Lines) format. This structure is ideal for machine learning pipelines and programmatic use.

Fields:

Field	Description
`tool`	Name of the Linux tool (e.g., `nmap`, `sqlmap`)
`command`	A real-world example command
`description`	Human-readable explanation of what the command does
`category`	Type of tool or use case (e.g., Networking, Exploitation, Web)
`use_case`	Specific purpose of the command (e.g., port scanning, password cracking)
`flags`	Important flags used in the command
`os`	Operating system (`Linux`)
`reference_link`	URL to official documentation or man page

🧪 Example Entry

{
 "tool": "sqlmap",
 "command": "sqlmap -u http://example.com --dbs",
 "description": "Enumerate databases on a vulnerable web application.",
 "category": "Web Application",
 "use_case": "SQL injection testing",
 "flags": ["-u", "--dbs"],
 "os": "Linux",
 "reference_link": "http://sqlmap.org/"
}
✅ Key Features

  ✅ Covers widely-used tools: nmap, hydra, sqlmap, burpsuite, aircrack-ng, wireshark, etc.

  ✅ Multiple real-world command examples per tool

  ✅ Cross-categorized where tools serve multiple purposes

  ✅ Ready for use in LLM training, cybersecurity education, and CLI helpers

🔍 Use Cases

  Fine-tuning AI models (LLMs) for cybersecurity and terminal tools

  Building red team knowledge bases or documentation bots

  Creating terminal assistant tools and cheat sheets

  Teaching ethical hacking through command-line exercises

📚 Categories Covered

  Networking

  Web Application Testing

  Exploitation

  Password Cracking

  Wireless Attacks

  System Forensics

  Sniffing & Spoofing

⚠️ Legal Notice

This dataset is provided for educational, research, and ethical security testing purposes only. Use of these tools and commands in unauthorized environments may be illegal.
📜 License

This dataset is released under the MIT License.
🙌 Contributions

Contributions are welcome! Feel free to submit PRs to add tools, improve descriptions, or fix errors.
📫 Maintainer

Created by: sunnythakur
GitHub: github.com/sunnythakur25
Contact: sunny48445@gmail.com

Resume_Dataset

kaggle.com

zip

Updated Jul 26, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

RayyanKauchali0 (2025). Resume_Dataset [Dataset]. https://www.kaggle.com/datasets/rayyankauchali0/resume-dataset

Explore at:

zip(3616108 bytes)Available download formats

Dataset updated

Jul 26, 2025

Authors

RayyanKauchali0

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Tech Resume Dataset (3,500+ Samples):

This dataset is designed for cutting-edge NLP research in resume parsing, job classification, and ATS system development. Below are extensive details and several ready-made diagrams you can include in your Kaggle upload (just save and upload as “Additional Files” or use them in your dataset description).

Dataset Composition and Sourcing

Total Resumes: 3,500+
Sources:
- Real Data: 2,047 resumes (58.5%) from ResumeAtlas and reputable open repositories; all records strictly anonymized.
- Template-Based Synthetic: 573 resumes featuring varied narratives and realistic achievements for classic, modern, and professional styles.
- LLM-Generated Variations: 460 unique samples using structured prompts to diversify skills, summaries, and career tracks, focusing on AI, ML, and data.
- Faker-Seeded Synthetic: 420 resumes, especially for junior/support/cloud/network tracks, populated with robust Faker-generated work and education fields.
Role Coverage:
- 15 major technology clusters (Software Engineering, DevOps, Cloud, AI/ML, Security, Data Engineering, QA, UI/UX, and more)
- At least 200 samples per primary role group for label balance
- 60+ subcategories reflecting granular tech job roles

Key Dataset Fields (JSONL Schema)

Field	Description	Example/Data Type
ResumeID	Unique, anonymized string	"DIS4JE91Z..." (string)
Category	Tech job category/label	"DevOps Engineer"
Name	Anonymized (Faker-generated) name	"Jordan Patel"
Email	Anonymized email address	"jpatel@example.com"
Phone	Anonymized phone number	"+1-555-343-2123"
Location	City, country or region (anonymized)	"Austin, TX, USA"
Summary	Professional summary/intro	String (3-6 sentences)
Skills	List or comma-separated tech/soft skills	"Python, Kubernetes..."
Experience	Work chronology, organizations, bullet-point details	String (multiline)
Education	Universities, degrees, certs	String (multiline)
Source	"real", "template", "llm", "faker"	String

https://ppl-ai-code-interpreter-files.s3.amazonaws.com/web/direct-files/626086319755b5c5810ff838ca0c0c3b/a5b5a057-7265-4428-9827-0a4c92f88d19/0e26c38c.png" alt="Dataset Schema Overview with Field Descriptions and Data Types">

Dataset Schema Overview with Field Descriptions and Data Types

Technical Validation & Quality Assurance

Formatting:
- Uniform schema, right-tab alignment for dates (MMM-YYYY)
- Standard ATS/NLP-friendly section headers
De-duplication:
- All records checked with BERT/MinHash for uniqueness (cosine similarity >0.9 removed)
PII Scrubbing:
- Names, contacts, locations anonymized with Python Faker
Role/Skill Taxonomy:
- Job titles & skills mapped to ESCO, O*NET, NIST NICE, CNCF lexicons for research alignment
Quality Checks:
- Automatic and manual validation for section presence, data type conformity, and format alignment

Role & Source Coverage Visualizations

Composition by Data Source:

https://ppl-ai-code-interpreter-files.s3.amazonaws.com/web/direct-files/626086319755b5c5810ff838ca0c0c3b/a5aafe90-c5b6-4d07-ad9c-cf5244266561/5723c094.png" alt="Composition of Tech Resume Dataset by Data Source">

Composition of Tech Resume Dataset by Data Source

Role Cluster Diversity:

https://ppl-ai-code-interpreter-files.s3.amazonaws.com/web/direct-files/626086319755b5c5810ff838ca0c0c3b/8c6ba5d6-f676-4213-b4f7-16a133081e00/e9cc61b6.png" alt="Distribution of Major Tech Role Clusters in the 3,500 Resumes Dataset">

Distribution of Major Tech Role Clusters in the 3,500 Resumes Dataset

Alternative: Dataset by Source Type (Pie Chart):

https://ppl-ai-code-interpreter-files.s3.amazonaws.com/web/direct-files/626086319755b5c5810ff838ca0c0c3b/2325f133-7fe5-4294-9a9d-4db19be3584f/b85a47bd.png" alt="Resume Dataset Composition by Source Type">

Resume Dataset Composition by Source Type

Typical Use Cases

Resume parsing & sectioning (training for models like BERT, RoBERTa, spaCy)
Fine-tuning for NER, job classification (60+ labels), skill extraction, and ATS research
Development or benchmarking of AI-powered job matching, candidate ranking, and automated tracking tools
ML/data science education and demo pipelines

How to Use the JSONL File

Each line in tech_resumes_dataset.jsonl is a single, fully structured resume object:

import json

with open('tech_resumes_dataset.jsonl', 'r', encoding='utf-8') as f:
  resumes = [json.loads(line) for line in f]
# Each record is now a Python dictionary

Citing and Sharing

If you use this dataset, credit it as “[your Kaggle dataset URL]” and mention original sources (ResumeAtlas, Resume_Classification, Kaggle Resume Dataset, and synthetic methodology as described).

⁂

Steam Reviews English - Dead by Daylight

kaggle.com

zip

Updated Nov 20, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Nicola Mustone (2025). Steam Reviews English - Dead by Daylight [Dataset]. https://www.kaggle.com/datasets/nicolamustone/steam-reviews-english-dead-by-daylight

Explore at:

zip(22155467 bytes)Available download formats

Dataset updated

Nov 20, 2025

Authors

Nicola Mustone

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Steam Reviews — Dead by Daylight (App 381210)

A dataset of 277,439 English-only Steam user reviews for Dead by Daylight from 2019 to November 2025, collected through the official Steam API. Each row represents a single review, including sentiment labels, playtime, and engagement metrics.
This dataset is ideal for natural language processing, sentiment analysis, and behavioral data studies.

A separate CSV with all the patches released for Dead by Daylight is included in the download for your convenience.

Dataset Summary

Field	Description
`review`	Full review text
`sentiment`	`1` = positive review, `0` = negative
`purchased`	`1` if purchased on Steam
`received_for_free`	`1` if the game was received for free
`votes_up`	Number of helpful votes
`votes_funny`	Number of “funny” votes
`date_created`	Review creation date (YYYY-MM-DD, UTC)
`date_updated`	Last update date (YYYY-MM-DD, UTC)
`author_num_games_owned`	Total games owned by reviewer
`author_num_reviews`	Total reviews written by reviewer
`author_playtime_forever_min`	Total playtime in minutes
`author_playtime_at_review_min`	Playtime when the review was written (minutes)

Example Use Cases

Sentiment Analysis: Train classifiers using user tone and voting patterns.
Text Embeddings: Extract embeddings for clustering or topic modeling.
Behavioral Correlation: Relate sentiment to playtime or review length.

Data Source

Reviews were collected using the SirDarcanos/Steam-Reviews-Scraper script.

This dataset includes only publicly accessible user content and metadata.
Each record is factual and unaltered beyond format normalization.

Licensing

Dataset: MIT License
Free for commercial and non-commercial use with attribution.
Collection Script: GPLv3 License
Ensures derivative software remains open-source.

Update Schedule

Updates will be performed irregularly and only when new data is collected. Users are welcome to suggest improvements or request updates via the discussion section.

Credits

Created by Nicola Mustone.

Disclaimer

This dataset and its author are not affiliated with, endorsed by, or sponsored by Valve Corporation or Behaviour Interactive Inc.

All product names, logos, brands, and trademarks are the property of their respective owners.

The data included in this dataset was collected from publicly available user reviews through the official Steam Web API, and is provided solely for educational and research purposes.

f
Sample provider semi-structured interview guide.
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Jan 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
McBain, Ryan K.; Namisango, Eve; Green, Harold D.; Gwokyalya, Violet; Bouskill, Kathryn; Ober, Allison; Matovu, Joseph K. B.; Beyeza-Kashesya, Jolly; Nakami, Sylvia; Juncker, Margrethe; Luyirika, Emmanuel; Wagner, Glenn J.; Bogart, Laura M.; Wanyenze, Rhoda K. (2025). Sample provider semi-structured interview guide. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001388938
Explore at:
Dataset updated
Jan 24, 2025
Authors
McBain, Ryan K.; Namisango, Eve; Green, Harold D.; Gwokyalya, Violet; Bouskill, Kathryn; Ober, Allison; Matovu, Joseph K. B.; Beyeza-Kashesya, Jolly; Nakami, Sylvia; Juncker, Margrethe; Luyirika, Emmanuel; Wagner, Glenn J.; Bogart, Laura M.; Wanyenze, Rhoda K.
Description
IntroductionCervical cancer (CC) is the leading cause of cancer-related deaths among Uganda women, yet rates of CC screening are very low. Training women who have recently screened to engage in advocacy for screening among women in their social network is a network-based strategy for promoting information dissemination and CC screening uptake.MethodsDrawing on the Exploration, Preparation, Implementation and Sustainment (EPIS) framework for implementation science, this hybrid type 1 randomized controlled trial (RCT) of a peer-led, group advocacy training intervention, Game Changers for Cervical Cancer Prevention (GC-CCP), will examine efficacy for increasing CC screening uptake as well as how it can be implemented and sustained in diverse clinic settings. In the Preparation phase we will prepare the four study clinics for implementation of GC-CCP and the expected increase in demand for CC screening, by using qualitative methods (stakeholder interviews and client focus groups) to identify and address structural barriers to easy access to CC screening. In the Implementation phase, GC-CCP will be implemented over 36 months at each clinic, with screened women (index participants) enrolled as research participants receiving the intervention in the first 6 months as part of a parallel group RCT overseen by the research study team to evaluate efficacy for CC screening uptake among their enrolled social network members. All research participants will be assessed at baseline and months 6 and 12. Intervention implementation and supervision will then be transitioned to clinic staff and offered as part of usual care in the subsequent 30 months as part of the Sustainability phase. Using the RE-AIM framework, we will evaluate engagement in GC-CCP and CC advocacy (reach), alter CC screening (effectiveness), adoption into clinic operations, implementation outcomes (acceptability, feasibility, fidelity, cost-effectiveness) and maintenance.DiscussionThis is one of the first studies to use a network-driven approach and empowerment of CC screened peers as change agents to increase CC screening. If shown to be an effective and sustainable implementation strategy for promoting CC screening, this peer advocacy model could be applied to other preventative health behaviors and disease contexts.Trial registrationNIH Clinical Trial Registry NCT06010160 (clinicaltrials.gov; date: 8/17/2023).
Example Structure-From-Motion Data
figshare.com
bin
Updated Apr 1, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Atticus Stovall (2020). Example Structure-From-Motion Data [Dataset]. http://doi.org/10.6084/m9.figshare.12061197.v1
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12061197.v1
Dataset updated
Apr 1, 2020
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Atticus Stovall
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Example data for exploring structure-from-motion data from a 100 m x 100 m subset of temperate forest in central Virginia.Data collection and post-processing by:Atticus Stovall, Bailey Costello, and Xi YangDrone: DJI Mavic Pro with onboard RGB camera

Facebook

Twitter

Click to copy link

Link copied

Cite

Kartikeya Pandey (2025). Coding Questions Dataset [Dataset]. https://www.kaggle.com/datasets/guitaristboy/coding-questions-dataset

Coding Questions Dataset

A structured collection of coding problems with examples and test cases

Explore at:

zip(135582 bytes)Available download formats

Dataset updated

Oct 24, 2025

Authors

Kartikeya Pandey

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

This dataset contains a curated collection of programming questions, each paired with example inputs/outputs, constraints, and test cases.

It is designed for use in machine learning research, code generation models, natural language processing (NLP) tasks, or simply as a question bank for learners and educators.

Dataset Highlights:

📘 616 questions with titles, descriptions, and difficulty levels (Easy, Medium, Hard)

💡 Each question includes examples, constraints, and test cases stored as structured JSON

🧠 Useful for LLM fine-tuning, question answering, and automated code evaluation tasks

🧩 Ideal for creating or benchmarking AI coding assistants and educational apps

Source: Collected from a structured internal question database built for educational and evaluation purposes.

Format: CSV file with the following columns: id, title, description, difficulty_level, created_at, updated_at, examples, constraints, test_cases

Clear search

Close search

Google apps

Main menu

Coding Questions Dataset

medreport_text_1000

Load the dataset

Sample questions for the semi-structured interviews.

sample-structure-dataset

Catastrophic Collapse Can Occur without Early Warning: Examples of Silent...

Sample semi-structured interview questions.

Patent PDF Samples with Extracted Structured Data

Data structure information for each sample and outcome.

Spider Realistic Dataset In Structure-Grounded Pretraining for Text-to-SQL

Wikipedia Structured Contents

Example DICOM RT Structure

Data from: Car Evaluation Data Set

class N N[%]

Coronavirus Panoply.io for Database Warehousing and Post Analysis using...

Example structure of data sent from a collection management system to a...

Towards a Structured Evaluation Methodology for Artificial Intelligence...

kali_linux_toolkit_dataset

Kali Linux Tools Dataset

📁 Dataset Format

Fields:

🧪 Example Entry

Resume_Dataset

Tech Resume Dataset (3,500+ Samples):

Dataset Composition and Sourcing

Key Dataset Fields (JSONL Schema)

Technical Validation & Quality Assurance

Role & Source Coverage Visualizations

Typical Use Cases

How to Use the JSONL File

Citing and Sharing

Steam Reviews English - Dead by Daylight

Steam Reviews — Dead by Daylight (App 381210)

Dataset Summary

Example Use Cases

Data Source

Licensing

Update Schedule

Credits

Disclaimer

Sample provider semi-structured interview guide.

Example Structure-From-Motion Data

Coding Questions Dataset

A structured collection of coding problems with examples and test cases