Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset contains a curated collection of programming questions, each paired with example inputs/outputs, constraints, and test cases.
It is designed for use in machine learning research, code generation models, natural language processing (NLP) tasks, or simply as a question bank for learners and educators.
Dataset Highlights:
📘 616 questions with titles, descriptions, and difficulty levels (Easy, Medium, Hard)
💡 Each question includes examples, constraints, and test cases stored as structured JSON
🧠 Useful for LLM fine-tuning, question answering, and automated code evaluation tasks
🧩 Ideal for creating or benchmarking AI coding assistants and educational apps
Source: Collected from a structured internal question database built for educational and evaluation purposes.
Format: CSV file with the following columns: id, title, description, difficulty_level, created_at, updated_at, examples, constraints, test_cases
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
MedReport - Reports Dataset
Dataset Description
This dataset contains medical audio transcriptions and the corresponding structured reports.
Columns
input: Audio transcription output: Structured medical report sample_id: Example identifier
Statistics
Total examples: 1000 License: Apache License 2.0 Created: 2025-08-05
Usage
Loading the dataset
from datasets import load_dataset
full_dataset =… See the full description on the dataset page: https://huggingface.co/datasets/wouk1805/medreport_text_1000.
Facebook
TwitterSample questions for the semi-structured interviews.
Facebook
TwitterSample dataset for PETAL model
This dataset is a sample dataset to test the functionalities of the PETAL model (encoder and decoder). It is based on CASP15 dataset, see
https://predictioncenter.org/casp15/ https://github.com/Bhattacharya-Lab/CASP15
The registries folder contains the registry of CASP15 dataset (a csv file with filename, pdb_id, etc.)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Catastrophic and sudden collapses of ecosystems are sometimes preceded by early warning signals that potentially could be used to predict and prevent a forthcoming catastrophe. Universality of these early warning signals has been proposed, but no formal proof has been provided. Here, we show that in relatively simple ecological models the most commonly used early warning signals for a catastrophic collapse can be silent. We underpin the mathematical reason for this phenomenon, which involves the direction of the eigenvectors of the system. Our results demonstrate that claims on the universality of early warning signals are not correct, and that catastrophic collapses can occur without prior warning. In order to correctly predict a collapse and determine whether early warning signals precede the collapse, detailed knowledge of the mathematical structure of the approaching bifurcation is necessary. Unfortunately, such knowledge is often only obtained after the collapse has already occurred.
Facebook
TwitterHaving access to good quality housing is a key determinant of well-being. Little is known about experiences of housing quality following homelessness from the perspectives of persons with lived experience. To build on existing literature, we conducted a secondary analysis of qualitative interviews with 19 individuals who had experiences of transitioning to housing following homelessness. Interview transcripts were drawn from a community-based participatory research study exploring the conditions needed for thriving following homelessness in Ontario, Canada. We analyzed these transcripts using reflexive thematic analysis. We coded transcripts abductively, informed by theories of social justice and health equity. Consistent with reflexive thematic analysis, we identified a central essence to elucidate experiences of housing quality following homelessness: “negotiating control within oppressive structural contexts.” This was expressed through four distinct themes: 1) being forced to live in undesirable living conditions; 2) stuck in an unsafe environment; 3) negotiating power dynamics to attain comfort and safety in one’s housing; and 4) having access to people and resources that create home. Overall, our findings indicate that attaining good quality housing following homelessness is elusive for many and influenced by a range of structural factors including ongoing poverty following homelessness, a lack of deeply affordable housing stock, and a lack of available social support networks. To prevent homelessness, it is essential to improve access to good quality housing that can support tenancy sustainment and well-being following homelessness. Policymakers need to review existing housing policies and reflect on how over-reliance on market housing has imposed negative impacts on the lives of persons who are leaving homelessness. Given the current economic context, it is imperative that policymakers devise policies that mitigate the financialization of housing, and result in the restoration of the social housing system in Canada and beyond.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset consists of PDFs in Google Cloud Storage from the first page of select US and EU patents, and BigQuery tables with extracted entities, labels, and other properties, including a link to each file in GCS. The structured data contains labels for eleven patent entities (patent inventor, publication date, classification number, patent title, etc.), global properties (US/EU issued, language, invention type), and the location of any figures or schematics on the patent's first page. The structured data is the result of a data entry operation collecting information from PDF documents, making the dataset a useful testing ground for benchmarking and developing AI/ML systems intended to perform broad document understanding tasks like extraction of structured data from unstructured documents. This dataset can be used to develop and benchmark natural language tasks such as named entity recognition and text classification, AI/ML vision tasks such as image classification and object detection, as well as more general AI/ML tasks such as automated data entry and document understanding. Google is sharing this dataset to support the AI/ML community because there is a shortage of document extraction/understanding datasets shared under an open license. This public dataset is hosted in Google Cloud Storage and Google BigQuery. It is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery or this this Cloud Storage quick start guide to begin.
Facebook
TwitterData structure information for each sample and outcome.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This folder contains the Spider-Realistic dataset used for evaluation in the paper "Structure-Grounded Pretraining for Text-to-SQL". The dataset is created based on the dev split of the Spider dataset (2020-06-07 version from https://yale-lily.github.io/spider). We manually modified the original questions to remove the explicit mention of column names while keeping the SQL queries unchanged to better evaluate the model's capability in aligning the NL utterance and the DB schema. For more details, please check our paper at https://arxiv.org/abs/2010.12773.
It contains the following files:
- spider-realistic.json
# The spider-realistic evaluation set
# Examples: 508
# Databases: 19
- dev.json
# The original dev split of Spider
# Examples: 1034
# Databases: 20
- tables.json
# The original DB schemas from Spider
# Databases: 166
- README.txt
- license
The Spider-Realistic dataset is created based on the dev split of the Spider dataset realsed by Yu, Tao, et al. "Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task." It is a subset of the original dataset with explicit mention of the column names removed. The sql queries and databases are kept unchanged.
For the format of each json file, please refer to the github page of Spider https://github.com/taoyds/spider.
For the database files please refer to the official Spider release https://yale-lily.github.io/spider.
This dataset is distributed under the CC BY-SA 4.0 license.
If you use the dataset, please cite the following papers including the original Spider datasets, Finegan-Dollak et al., 2018 and the original datasets for Restaurants, GeoQuery, Scholar, Academic, IMDB, and Yelp.
@article{deng2020structure,
title={Structure-Grounded Pretraining for Text-to-SQL},
author={Deng, Xiang and Awadallah, Ahmed Hassan and Meek, Christopher and Polozov, Oleksandr and Sun, Huan and Richardson, Matthew},
journal={arXiv preprint arXiv:2010.12773},
year={2020}
}
@inproceedings{Yu&al.18c,
year = 2018,
title = {Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task},
booktitle = {EMNLP},
author = {Tao Yu and Rui Zhang and Kai Yang and Michihiro Yasunaga and Dongxu Wang and Zifan Li and James Ma and Irene Li and Qingning Yao and Shanelle Roman and Zilin Zhang and Dragomir Radev }
}
@InProceedings{P18-1033,
author = "Finegan-Dollak, Catherine
and Kummerfeld, Jonathan K.
and Zhang, Li
and Ramanathan, Karthik
and Sadasivam, Sesh
and Zhang, Rui
and Radev, Dragomir",
title = "Improving Text-to-SQL Evaluation Methodology",
booktitle = "Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
year = "2018",
publisher = "Association for Computational Linguistics",
pages = "351--360",
location = "Melbourne, Australia",
url = "http://aclweb.org/anthology/P18-1033"
}
@InProceedings{data-sql-imdb-yelp,
dataset = {IMDB and Yelp},
author = {Navid Yaghmazadeh, Yuepeng Wang, Isil Dillig, and Thomas Dillig},
title = {SQLizer: Query Synthesis from Natural Language},
booktitle = {International Conference on Object-Oriented Programming, Systems, Languages, and Applications, ACM},
month = {October},
year = {2017},
pages = {63:1--63:26},
url = {http://doi.org/10.1145/3133887},
}
@article{data-academic,
dataset = {Academic},
author = {Fei Li and H. V. Jagadish},
title = {Constructing an Interactive Natural Language Interface for Relational Databases},
journal = {Proceedings of the VLDB Endowment},
volume = {8},
number = {1},
month = {September},
year = {2014},
pages = {73--84},
url = {http://dx.doi.org/10.14778/2735461.2735468},
}
@InProceedings{data-atis-geography-scholar,
dataset = {Scholar, and Updated ATIS and Geography},
author = {Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, Jayant Krishnamurthy, and Luke Zettlemoyer},
title = {Learning a Neural Semantic Parser from User Feedback},
booktitle = {Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
year = {2017},
pages = {963--973},
location = {Vancouver, Canada},
url = {http://www.aclweb.org/anthology/P17-1089},
}
@inproceedings{data-geography-original
dataset = {Geography, original},
author = {John M. Zelle and Raymond J. Mooney},
title = {Learning to Parse Database Queries Using Inductive Logic Programming},
booktitle = {Proceedings of the Thirteenth National Conference on Artificial Intelligence - Volume 2},
year = {1996},
pages = {1050--1055},
location = {Portland, Oregon},
url = {http://dl.acm.org/citation.cfm?id=1864519.1864543},
}
@inproceedings{data-restaurants-logic,
author = {Lappoon R. Tang and Raymond J. Mooney},
title = {Automated Construction of Database Interfaces: Intergrating Statistical and Relational Learning for Semantic Parsing},
booktitle = {2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora},
year = {2000},
pages = {133--141},
location = {Hong Kong, China},
url = {http://www.aclweb.org/anthology/W00-1317},
}
@inproceedings{data-restaurants-original,
author = {Ana-Maria Popescu, Oren Etzioni, and Henry Kautz},
title = {Towards a Theory of Natural Language Interfaces to Databases},
booktitle = {Proceedings of the 8th International Conference on Intelligent User Interfaces},
year = {2003},
location = {Miami, Florida, USA},
pages = {149--157},
url = {http://doi.acm.org/10.1145/604045.604070},
}
@inproceedings{data-restaurants,
author = {Alessandra Giordani and Alessandro Moschitti},
title = {Automatic Generation and Reranking of SQL-derived Answers to NL Questions},
booktitle = {Proceedings of the Second International Conference on Trustworthy Eternal Systems via Evolving Software, Data and Knowledge},
year = {2012},
location = {Montpellier, France},
pages = {59--76},
url = {https://doi.org/10.1007/978-3-642-45260-4_5},
}
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Dataset Summary Early beta release of pre-parsed English and French Wikipedia articles including infoboxes. Inviting feedback.
This dataset contains all articles of the English and French language editions of Wikipedia, pre-parsed and outputted as structured JSON files with a consistent schema. Each JSON line holds the content of one full Wikipedia article stripped of extra markdown and non-prose sections (references, etc.).
Invitation for Feedback The dataset is built as part of the Structured Contents initiative and based on the Wikimedia Enterprise html snapshots. It is an early beta release to improve transparency in the development process and request feedback. This first version includes pre-parsed Wikipedia abstracts, short descriptions, main images links, infoboxes and article sections, excluding non-prose sections (e.g. references). More elements (such as lists and tables) may be added over time. For updates follow the project’s blog and our Mediawiki Quarterly software updates on MediaWiki. As this is an early beta release, we highly value your feedback to help us refine and improve this dataset. Please share your thoughts, suggestions, and any issues you encounter either on the discussion page of Wikimedia Enterprise’s homepage on Meta wiki, or on the discussion page for this dataset here on Kaggle.
The contents of this dataset of Wikipedia articles is collectively written and curated by a global volunteer community. All original textual content is licensed under the GNU Free Documentation License (GFDL) and the Creative Commons Attribution-Share-Alike 4.0 License. Some text may be available only under the Creative Commons license; see the Wikimedia Terms of Use for details. Text written by some authors may be released under additional licenses or into the public domain.
The dataset in its structured form is generally helpful for a wide variety of tasks, including all phases of model development, from pre-training to alignment, fine-tuning, updating/RAG as well as testing/benchmarking. We would love to hear more about your use cases.
Data Fields The data fields are the same among all, noteworthy included fields: name - title of the article. identifier - ID of the article. url - URL of the article. version: metadata related to the latest specific revision of the article version.editor - editor-specific signals that can help contextualize the revision version.scores - returns assessments by ML models on the likelihood of a revision being reverted. main entity - Wikidata QID the article is related to. abstract - lead section, summarizing what the article is about. description - one-sentence description of the article for quick reference. image - main image representing the article's subject. infoboxes - parsed information from the side panel (infobox) on the Wikipedia article. sections - parsed sections of the article, including links. Note: excludes other media/images, lists, tables and references or similar non-prose sections. Full data dictionary is available here: https://enterprise.wikimedia.com/docs/data-dictionary/
Curation Rationale This dataset has been created as part of the larger Structured Contents initiative at Wikimedia Enterprise with the aim of making Wikimedia data more machine readable. These efforts are both focused on pre-parsing Wikipedia snippets as well as connecting the different projects closer together. Even if Wikipedia is very structured to the human eye, it is a non-triv...
Facebook
Twitterhttp://www.apache.org/licenses/LICENSE-2.0http://www.apache.org/licenses/LICENSE-2.0
Example DICOM RT Structure
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
from: https://archive.ics.uci.edu/ml/datasets/car+evaluation
Title: Car Evaluation Database
Sources: (a) Creator: Marko Bohanec (b) Donors: Marko Bohanec (marko.bohanec@ijs.si) Blaz Zupan (blaz.zupan@ijs.si) (c) Date: June, 1997
Past Usage:
The hierarchical decision model, from which this dataset is derived, was first presented in
M. Bohanec and V. Rajkovic: Knowledge acquisition and explanation for multi-attribute decision making. In 8th Intl Workshop on Expert Systems and their Applications, Avignon, France. pages 59-78, 1988.
Within machine-learning, this dataset was used for the evaluation of HINT (Hierarchy INduction Tool), which was proved to be able to completely reconstruct the original hierarchical model. This, together with a comparison with C4.5, is presented in
B. Zupan, M. Bohanec, I. Bratko, J. Demsar: Machine learning by function decomposition. ICML-97, Nashville, TN. 1997 (to appear)
Relevant Information Paragraph:
Car Evaluation Database was derived from a simple hierarchical decision model originally developed for the demonstration of DEX (M. Bohanec, V. Rajkovic: Expert system for decision making. Sistemica 1(1), pp. 145-157, 1990.). The model evaluates cars according to the following concept structure:
CAR car acceptability . PRICE overall price . . buying buying price . . maint price of the maintenance . TECH technical characteristics . . COMFORT comfort . . . doors number of doors . . . persons capacity in terms of persons to carry . . . lug_boot the size of luggage boot . . safety estimated safety of the car
Input attributes are printed in lowercase. Besides the target concept (CAR), the model includes three intermediate concepts: PRICE, TECH, COMFORT. Every concept is in the original model related to its lower level descendants by a set of examples (for these examples sets see http://www-ai.ijs.si/BlazZupan/car.html).
The Car Evaluation Database contains examples with the structural information removed, i.e., directly relates CAR to the six input attributes: buying, maint, doors, persons, lug_boot, safety.
Because of known underlying concept structure, this database may be particularly useful for testing constructive induction and structure discovery methods.
Number of Instances: 1728 (instances completely cover the attribute space)
Number of Attributes: 6
Attribute Values:
buying v-high, high, med, low maint v-high, high, med, low doors 2, 3, 4, 5-more persons 2, 4, more lug_boot small, med, big safety low, med, high
Missing Attribute Values: none
Class Distribution (number of instances per class)
unacc 1210 (70.023 %) acc 384 (22.222 %) good 69 ( 3.993 %) v-good 65 ( 3.762 %)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
It has never been easier to solve any database related problem using any sequel language and the following gives an opportunity for you guys to understand how I was able to figure out some of the interline relationships between databases using Panoply.io tool.
I was able to insert coronavirus dataset and create a submittable, reusable result. I hope it helps you work in Data Warehouse environment.
The following is list of SQL commands performed on dataset attached below with the final output as stored in Exports Folder QUERY 1 SELECT "Province/State" As "Region", Deaths, Recovered, Confirmed FROM "public"."coronavirus_updated" WHERE Recovered>(Deaths/2) AND Deaths>0 Description: How will we estimate where Coronavirus has infiltrated, but there is effective recovery amongst patients? We can view those places by having Recovery twice more than the Death Toll.
Query 2 SELECT country, sum(confirmed) as "Confirmed Count", sum(Recovered) as "Recovered Count", sum(Deaths) as "Death Toll" FROM "public"."coronavirus_updated" WHERE Recovered>(Deaths/2) AND Confirmed>0 GROUP BY country
Description: Coronavirus Epidemic has infiltrated multiple countries, and the only way to be safe is by knowing the countries which have confirmed Coronavirus Cases. So here is a list of those countries
Query 3 SELECT country as "Countries where Coronavirus has reached" FROM "public"."coronavirus_updated" WHERE confirmed>0 GROUP BY country Description: Coronavirus Epidemic has infiltrated multiple countries, and the only way to be safe is by knowing the countries which have confirmed Coronavirus Cases. So here is a list of those countries.
Query 4 SELECT country, sum(suspected) as "Suspected Cases under potential CoronaVirus outbreak" FROM "public"."coronavirus_updated" WHERE suspected>0 AND deaths=0 AND confirmed=0 GROUP BY country ORDER BY sum(suspected) DESC
Description: Coronavirus is spreading at alarming rate. In order to know which countries are newly getting the virus is important because in these countries if timely measures are taken, it could prevent any causalities. Here is a list of suspected cases with no virus resulted deaths.
Query 5 SELECT country, sum(suspected) as "Coronavirus uncontrolled spread count and human life loss", 100*sum(suspected)/(SELECT sum((suspected)) FROM "public"."coronavirus_updated") as "Global suspected Exposure of Coronavirus in percentage" FROM "public"."coronavirus_updated" WHERE suspected>0 AND deaths=0 GROUP BY country ORDER BY sum(suspected) DESC Description: Coronavirus is getting stronger in particular countries, but how will we measure that? We can measure it by knowing the percentage of suspected patients amongst countries which still doesn’t have any Coronavirus related deaths. The following is a list.
Data Provided by: SRK, Data Scientist at H2O.ai, Chennai, India
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Illustrative example of data format following Darwin Core to send from a collection management system to a citizen science platform. Multi-imaged vertebrate specimen case : http://coldb.mnhn.fr/catalognumber/mnhn/zo/2013-152
Illustration of the milestone28 document, worpackage 5.2 of the ICEDIG project.
Facebook
TwitterOur work towards a Structured Evaluation Methodology for Artificial Intelligence Technology (SEMAIT) aims to provide plots, tools, methods, and strategies to extract insights out of various machine learning (ML) and Artificial Intelligence (AI) data.Included in this software is the MIg analyZeR (mizr) R software package that produces various plots. It was initially developed within the Multimodal Information Group (MIG) at the National Institute of Standards and Technology (NIST).This software is documented, configured to be installed as an R package, and comes with an example SEMAIT script with an example (system, dataset, metrics, score) ML tuple set that we constructed ourselves.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
A comprehensive and structured dataset of common offensive security tools available in Kali Linux, including usage commands, flags, descriptions, categories, and official documentation links.
This dataset is designed to support cybersecurity training, red team automation, LLM fine-tuning, and terminal assistants for penetration testers.
Each entry is a JSON object and stored in .jsonl (JSON Lines) format. This structure is ideal for machine learning pipelines and programmatic use.
| Field | Description |
|---|---|
tool | Name of the Linux tool (e.g., nmap, sqlmap) |
command | A real-world example command |
description | Human-readable explanation of what the command does |
category | Type of tool or use case (e.g., Networking, Exploitation, Web) |
use_case | Specific purpose of the command (e.g., port scanning, password cracking) |
flags | Important flags used in the command |
os | Operating system (Linux) |
reference_link | URL to official documentation or man page |
{
"tool": "sqlmap",
"command": "sqlmap -u http://example.com --dbs",
"description": "Enumerate databases on a vulnerable web application.",
"category": "Web Application",
"use_case": "SQL injection testing",
"flags": ["-u", "--dbs"],
"os": "Linux",
"reference_link": "http://sqlmap.org/"
}
✅ Key Features
✅ Covers widely-used tools: nmap, hydra, sqlmap, burpsuite, aircrack-ng, wireshark, etc.
✅ Multiple real-world command examples per tool
✅ Cross-categorized where tools serve multiple purposes
✅ Ready for use in LLM training, cybersecurity education, and CLI helpers
🔍 Use Cases
Fine-tuning AI models (LLMs) for cybersecurity and terminal tools
Building red team knowledge bases or documentation bots
Creating terminal assistant tools and cheat sheets
Teaching ethical hacking through command-line exercises
📚 Categories Covered
Networking
Web Application Testing
Exploitation
Password Cracking
Wireless Attacks
System Forensics
Sniffing & Spoofing
⚠️ Legal Notice
This dataset is provided for educational, research, and ethical security testing purposes only. Use of these tools and commands in unauthorized environments may be illegal.
📜 License
This dataset is released under the MIT License.
🙌 Contributions
Contributions are welcome! Feel free to submit PRs to add tools, improve descriptions, or fix errors.
📫 Maintainer
Created by: sunnythakur
GitHub: github.com/sunnythakur25
Contact: sunny48445@gmail.com
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset is designed for cutting-edge NLP research in resume parsing, job classification, and ATS system development. Below are extensive details and several ready-made diagrams you can include in your Kaggle upload (just save and upload as “Additional Files” or use them in your dataset description).
| Field | Description | Example/Data Type |
|---|---|---|
| ResumeID | Unique, anonymized string | "DIS4JE91Z..." (string) |
| Category | Tech job category/label | "DevOps Engineer" |
| Name | Anonymized (Faker-generated) name | "Jordan Patel" |
| Anonymized email address | "jpatel@example.com" | |
| Phone | Anonymized phone number | "+1-555-343-2123" |
| Location | City, country or region (anonymized) | "Austin, TX, USA" |
| Summary | Professional summary/intro | String (3-6 sentences) |
| Skills | List or comma-separated tech/soft skills | "Python, Kubernetes..." |
| Experience | Work chronology, organizations, bullet-point details | String (multiline) |
| Education | Universities, degrees, certs | String (multiline) |
| Source | "real", "template", "llm", "faker" | String |
https://ppl-ai-code-interpreter-files.s3.amazonaws.com/web/direct-files/626086319755b5c5810ff838ca0c0c3b/a5b5a057-7265-4428-9827-0a4c92f88d19/0e26c38c.png" alt="Dataset Schema Overview with Field Descriptions and Data Types">
Dataset Schema Overview with Field Descriptions and Data Types
MMM-YYYY)Composition by Data Source:
https://ppl-ai-code-interpreter-files.s3.amazonaws.com/web/direct-files/626086319755b5c5810ff838ca0c0c3b/a5aafe90-c5b6-4d07-ad9c-cf5244266561/5723c094.png" alt="Composition of Tech Resume Dataset by Data Source">
Composition of Tech Resume Dataset by Data Source
Role Cluster Diversity:
https://ppl-ai-code-interpreter-files.s3.amazonaws.com/web/direct-files/626086319755b5c5810ff838ca0c0c3b/8c6ba5d6-f676-4213-b4f7-16a133081e00/e9cc61b6.png" alt="Distribution of Major Tech Role Clusters in the 3,500 Resumes Dataset">
Distribution of Major Tech Role Clusters in the 3,500 Resumes Dataset
Alternative: Dataset by Source Type (Pie Chart):
https://ppl-ai-code-interpreter-files.s3.amazonaws.com/web/direct-files/626086319755b5c5810ff838ca0c0c3b/2325f133-7fe5-4294-9a9d-4db19be3584f/b85a47bd.png" alt="Resume Dataset Composition by Source Type">
Resume Dataset Composition by Source Type
Each line in tech_resumes_dataset.jsonl is a single, fully structured resume object:
import json
with open('tech_resumes_dataset.jsonl', 'r', encoding='utf-8') as f:
resumes = [json.loads(line) for line in f]
# Each record is now a Python dictionary
If you use this dataset, credit it as “[your Kaggle dataset URL]” and mention original sources (ResumeAtlas, Resume_Classification, Kaggle Resume Dataset, and synthetic methodology as described).
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
A dataset of 277,439 English-only Steam user reviews for Dead by Daylight from 2019 to November 2025, collected through the official Steam API.
Each row represents a single review, including sentiment labels, playtime, and engagement metrics.
This dataset is ideal for natural language processing, sentiment analysis, and behavioral data studies.
A separate CSV with all the patches released for Dead by Daylight is included in the download for your convenience.
| Field | Description |
|---|---|
review | Full review text |
sentiment | 1 = positive review, 0 = negative |
purchased | 1 if purchased on Steam |
received_for_free | 1 if the game was received for free |
votes_up | Number of helpful votes |
votes_funny | Number of “funny” votes |
date_created | Review creation date (YYYY-MM-DD, UTC) |
date_updated | Last update date (YYYY-MM-DD, UTC) |
author_num_games_owned | Total games owned by reviewer |
author_num_reviews | Total reviews written by reviewer |
author_playtime_forever_min | Total playtime in minutes |
author_playtime_at_review_min | Playtime when the review was written (minutes) |
Reviews were collected using the SirDarcanos/Steam-Reviews-Scraper script.
This dataset includes only publicly accessible user content and metadata.
Each record is factual and unaltered beyond format normalization.
Updates will be performed irregularly and only when new data is collected. Users are welcome to suggest improvements or request updates via the discussion section.
Created by Nicola Mustone.
This dataset and its author are not affiliated with, endorsed by, or sponsored by Valve Corporation or Behaviour Interactive Inc.
All product names, logos, brands, and trademarks are the property of their respective owners.
The data included in this dataset was collected from publicly available user reviews through the official Steam Web API, and is provided solely for educational and research purposes.
Facebook
TwitterIntroductionCervical cancer (CC) is the leading cause of cancer-related deaths among Uganda women, yet rates of CC screening are very low. Training women who have recently screened to engage in advocacy for screening among women in their social network is a network-based strategy for promoting information dissemination and CC screening uptake.MethodsDrawing on the Exploration, Preparation, Implementation and Sustainment (EPIS) framework for implementation science, this hybrid type 1 randomized controlled trial (RCT) of a peer-led, group advocacy training intervention, Game Changers for Cervical Cancer Prevention (GC-CCP), will examine efficacy for increasing CC screening uptake as well as how it can be implemented and sustained in diverse clinic settings. In the Preparation phase we will prepare the four study clinics for implementation of GC-CCP and the expected increase in demand for CC screening, by using qualitative methods (stakeholder interviews and client focus groups) to identify and address structural barriers to easy access to CC screening. In the Implementation phase, GC-CCP will be implemented over 36 months at each clinic, with screened women (index participants) enrolled as research participants receiving the intervention in the first 6 months as part of a parallel group RCT overseen by the research study team to evaluate efficacy for CC screening uptake among their enrolled social network members. All research participants will be assessed at baseline and months 6 and 12. Intervention implementation and supervision will then be transitioned to clinic staff and offered as part of usual care in the subsequent 30 months as part of the Sustainability phase. Using the RE-AIM framework, we will evaluate engagement in GC-CCP and CC advocacy (reach), alter CC screening (effectiveness), adoption into clinic operations, implementation outcomes (acceptability, feasibility, fidelity, cost-effectiveness) and maintenance.DiscussionThis is one of the first studies to use a network-driven approach and empowerment of CC screened peers as change agents to increase CC screening. If shown to be an effective and sustainable implementation strategy for promoting CC screening, this peer advocacy model could be applied to other preventative health behaviors and disease contexts.Trial registrationNIH Clinical Trial Registry NCT06010160 (clinicaltrials.gov; date: 8/17/2023).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Example data for exploring structure-from-motion data from a 100 m x 100 m subset of temperate forest in central Virginia.Data collection and post-processing by:Atticus Stovall, Bailey Costello, and Xi YangDrone: DJI Mavic Pro with onboard RGB camera
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset contains a curated collection of programming questions, each paired with example inputs/outputs, constraints, and test cases.
It is designed for use in machine learning research, code generation models, natural language processing (NLP) tasks, or simply as a question bank for learners and educators.
Dataset Highlights:
📘 616 questions with titles, descriptions, and difficulty levels (Easy, Medium, Hard)
💡 Each question includes examples, constraints, and test cases stored as structured JSON
🧠 Useful for LLM fine-tuning, question answering, and automated code evaluation tasks
🧩 Ideal for creating or benchmarking AI coding assistants and educational apps
Source: Collected from a structured internal question database built for educational and evaluation purposes.
Format: CSV file with the following columns: id, title, description, difficulty_level, created_at, updated_at, examples, constraints, test_cases