70 datasets found

f
Instruction formatted data for LLaMA2-chat fine tuning
datasetcatalog.nlm.nih.gov
Updated Feb 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BoEC, ForexPredict (2024). Instruction formatted data for LLaMA2-chat fine tuning [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001428867
Explore at:
Dataset updated
Feb 10, 2024
Authors
BoEC, ForexPredict
Description
######################### Medical scope, Prompt #1 61K instruction PubMED-Unlabeled ###################################################################Instruction: You are a language model specialized in generating JSON or YAML output. Given an instruction, you should generate a valid JSON or YAML object according to the provided guidelines. Your output should conform to the following JSON format: { "key1": "value1", "key2": "value2", ... } Make sure to adhere to the specified key-value pairs and maintain the correct syntax throughout your response. Your goal is to generate accurate and properly structured JSON or YAML output in response to the given instructions withoutany extra comment. Input: As a doctor that only speak in JSON, provide a "answer", "context and "keywords" about following question: Is naturopathy as effective as conventional therapy for treatment of menopausal symptoms? Output:{"answer" :Naturopathy appears to be an effective alternative for relief of specific menopausal symptoms compared to conventional therapy.,"context" :Although the use of alternative medicine in the United States is increasing, no published studies have documented the effectiveness of naturopathy for treatment of menopausal symptoms compared to women receiving conventional therapy in the clinical setting."keywords" :['Anxiety', 'Cohort Studies', 'Confidence Intervals', 'Estrogen Replacement Therapy', 'Female', 'Health Status', 'Hot Flashes', 'Humans', 'Menopause', 'Middle Aged','Naturopathy', 'Odds Ratio', 'Patient Satisfaction', 'Retrospective Studies', 'Risk Factors', 'Severity of Illness Index', 'Sleep Initiation and Maintenance Disorders','Surveys and Questionnaires', 'Time Factors', 'Urinary Incontinence', 'Vaginal Diseases', 'Washington']} ############################### Medical scope, Prompt #2 symptom-disease dataset 1K instruction ######################################################Instruction: You are a language model specialized in generating JSON or YAML output. Given an instruction, you should generate a valid JSON or YAML object according to theprovided guidelines. Your output should conform to the following JSON format: { "key1": "value1", "key2": "value2", ... } Make sure to adhere to the specified key-value pairs and maintain the correct syntax throughout your response. Your goal is to generate accurate and properly structured JSON or YAML output in response to the given instructions without any extra comment. Input: You are a doctor that only speak in JSON. Do not output that is'nt properly formated JSON. Provide following information about Panic disorder disease. Fill information about "symptom", "reason", "TestsAndProcedures" and "commonMedications" about Panic disorder disease in JSON format. Output: {"symptom" :['Anxiety and nervousness', 'Depression', 'Shortness of breath', 'Depressive or psychotic symptoms', 'Sharp chest pain', 'Dizziness', 'Insomnia', 'Abnormal involuntary movements', 'Chest tightness', 'Palpitations', 'Irregular heartbeat', 'Breathing fast'],"reason" :Panic disorder is an anxiety disorder characterized by recurring severe panic attacks. It may also include significant behavioral changes lasting at least a month and of ongoing worry about the implications or concern about having other attacks. The latter are called anticipatory attacks (DSM-IVR). Panic disorder is not the same as agoraphobia (fear of public places), although many afflicted with panic disorder also suffer from agoraphobia. Panic attacks cannot be predicted, therefore an individual may become stressed, anxious or worried wondering when the next panic attack will occur. Panic disorder may be differentiated as a medical condition, or chemical imbalance. The DSM-IV-TR describes panic disorder and anxiety differently. Whereas anxiety is preceded by chronic stressors which build to reactions of moderate intensity that can last for days, weeks or months, panic attacks are acute events triggered by a sudden, out-of-the-blue cause: duration is short and symptoms are more intense. Panic attacks can occur in children, as well as adults. Panic in young people may be particularly distressing because children tend to have less insight about what is happening, and parents are also likely to experience distress when attacks occur. "TestsAndProcedures" : ['Psychotherapy', 'Mental health counseling', 'Electrocardiogram', 'Depression screen (Depression screening)', 'Toxicology screen', 'Psychological and psychiatric evaluation and therapy']"commonMedications" : ['Lorazepam', 'Alprazolam (Xanax)', 'Clonazepam', 'Paroxetine (Paxil)', 'Venlafaxine (Effexor)', 'Mirtazapine', 'Buspirone (Buspar)', 'Fluvoxamine (Luvox)', 'Imipramine', 'Desvenlafaxine (Pristiq)', 'Clomipramine', 'Acamprosate (Campral)']} ############################### Medical scope, Prompt #3 194K instruction MED-MCQA dataset ######################################################Instruction: You are a language model specialized in generating JSON or YAML output. Given an instruction, you should generate a valid JSON or YAML object according to the provided guidelines. Your output should conform to the following JSON format: { "key1": "value1", "key2": "value2", ... } Make sure to adhere to the specified key-value pairs and maintain the correct syntax throughout your response. Your goal is to generate accurate and properly structured JSON or YAML output in response to the given instructions without any extra comment. Input: You are a medical teacher that only speak in JSON. Do not output that is'nt properly formated JSON. Generate a multichoice question and answer inAnatomy field and Urinary tract topic with following format: { "question": "expresion" , "OPA": "expresion", "OPB" : "expresion", "OPC": "expresion","OPD": "expresion", "Answer": "3"}. Output: { "question": Chronic urethral obstruction due to benign prismatic hyperplasia can lead to the following change in kidney parenchyma, "opa": Hyperplasia, "opb": Hyperophy, "opc": Atrophy, "opd": Dyplasia, "Answer": 2. ############################### Finance scope, Prompt #4 38K instruction ######################################################Instruction: You are a language model specialized in generating JSON or YAML output. Given an instruction, you should generate a valid JSON or YAML object according to the provided guidelines. Your output should conform to the following JSON format: { "key1": "value1", "key2": "value2", ... } Make sure to adhere to the specified key-value pairs and maintain the correct syntax throughout your response. Your goal is to generate accurate and properly structured JSON or YAML output in response to the given instructions without any extra comment. Input: You are a financial news analyzer that only speaks in JSON. Do not output that isn't properly formatted JSON. Analyze this news title "Here Why Bitcoin is Likely to See Historic Volatility in the Next Few Days". Provide sentiment as a probability distribution also indicate target market and related asset to the news title in JSON format. Output:{"sentiment": {"positive": 0.4463904500007629, "negative": 0.06607405841350555,"neutral":0.4875355064868927}, "market": "cryptocurrency" , "relatedAsset": ['BTC/USDT']}.
G
JSON Editor Market Research Report 2033
growthmarketreports.com
csv, pdf, pptx
Updated Aug 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Growth Market Reports (2025). JSON Editor Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/json-editor-market
Explore at:
pptx, csv, pdfAvailable download formats
Dataset updated
Aug 23, 2025
Dataset authored and provided by
Growth Market Reports
Time period covered
2024 - 2032
Area covered
Global
Description
JSON Editor Market Outlook

According to our latest research, the JSON Editor market size reached USD 345 million in 2024, exhibiting robust momentum driven by the escalating need for efficient data management and seamless integration across digital platforms. The market is projected to expand at a CAGR of 13.2% from 2025 to 2033, resulting in a forecasted market size of USD 1,012 million by 2033. This growth is fueled by the proliferation of cloud-based applications, increasing adoption of API-driven architectures, and the rising complexity of data structures in modern enterprises. As per our latest research, the demand for advanced JSON editors is being propelled by the imperative for real-time data manipulation, enhanced collaboration, and automation across diverse industries.

One of the primary growth factors for the JSON Editor market is the exponential rise in the volume and complexity of data generated by organizations across various sectors. As businesses embrace digital transformation, the need to manage, process, and exchange structured and semi-structured data has become paramount. JSON (JavaScript Object Notation) has emerged as the preferred data-interchange format due to its lightweight, human-readable, and machine-friendly nature. Consequently, organizations are increasingly investing in sophisticated JSON editors to streamline data handling, ensure data integrity, and enable seamless integration with web and mobile applications. The surge in API-centric development further amplifies the demand for JSON editors, as APIs predominantly leverage JSON for data exchange, making robust editing and validation tools indispensable for developers and data engineers.

Another significant factor contributing to the market's expansion is the growing adoption of cloud computing and SaaS-based solutions. Cloud-based JSON editors are gaining traction due to their scalability, accessibility, and collaborative capabilities, allowing distributed teams to work on data structures in real-time. Enterprises are increasingly migrating their data management workflows to cloud platforms to reduce infrastructure costs, enhance operational efficiency, and facilitate remote work. This shift is driving vendors to innovate and offer cloud-native JSON editor solutions with advanced features such as version control, automated validation, and integration with popular DevOps tools. As organizations prioritize agility and flexibility, the demand for cloud-based JSON editors is expected to witness sustained growth throughout the forecast period.

The emergence of advanced technologies such as artificial intelligence (AI), machine learning (ML), and the Internet of Things (IoT) is also shaping the growth trajectory of the JSON Editor market. These technologies generate vast amounts of structured and semi-structured data, necessitating efficient tools for data parsing, validation, and transformation. JSON editors equipped with AI-driven functionalities, such as smart schema detection, auto-completion, and error correction, are becoming increasingly popular among developers and data scientists. Furthermore, the integration of JSON editors with CI/CD pipelines and automated testing frameworks enhances productivity and accelerates application development cycles. The continuous evolution of digital ecosystems and the need for interoperability across heterogeneous systems are expected to further bolster the adoption of advanced JSON editor solutions.

From a regional perspective, North America remains the dominant market for JSON editors, accounting for the largest revenue share in 2024. This leadership is attributed to the presence of major technology companies, a mature IT infrastructure, and a strong emphasis on innovation and digitalization. Europe follows closely, driven by stringent data compliance regulations and widespread adoption of cloud technologies. The Asia Pacific region is poised for the highest growth rate during the forecast period, fueled by rapid digital transformation, expanding IT investments, and the proliferation of startups and SMEs. Latin America and the Middle East & Africa are also witnessing steady adoption, supported by increasing awareness of data management best practices and the growing penetration of digital technologies.

"https://growthmarketreports.com/request-sample/126219">
<button class=
h
sharegpt-structured-output-json
huggingface.co
Updated Feb 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
v (2025). sharegpt-structured-output-json [Dataset]. https://huggingface.co/datasets/Arun63/sharegpt-structured-output-json
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 1, 2025
Authors
v
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
ShareGPT-Formatted Dataset for Structured JSON Output

Dataset Description

This dataset is formatted in the ShareGPT style and is designed for fine-tuning large language models (LLMs) to generate structured JSON outputs. It consists of multi-turn conversations where each response follows a predefined JSON schema, making it ideal for training models that need to produce structured data in natural language scenarios.

Usage

This dataset can be used to train LLMs… See the full description on the dataset page: https://huggingface.co/datasets/Arun63/sharegpt-structured-output-json.
f
Main Data and Code
figshare.com
zip
Updated Oct 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Momo (2025). Main Data and Code [Dataset]. http://doi.org/10.6084/m9.figshare.29929412.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.29929412.v1
Dataset updated
Oct 5, 2025
Dataset provided by
figshare
Authors
Momo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Important Notice: Ethical Use OnlyThis repository provides code and datasets for academic research on misinformation.Please note that the datasets include rumor-related texts. These materials are supplied solely for scholarly analysis and research aimed at understanding and combating misinformation.Prohibited UseDo not use this repository, including its code or data, to create or spread false information in any real-world context.Any misuse of these resources for malicious purposes is strictly forbidden.DisclaimerThe authors bear no responsibility for any unethical or unlawful use of the provided resources.By accessing or using this repository, you acknowledge and agree to comply with these ethical guidelines.Project StructureThe project is organized into three main directories, each corresponding to a major section of the paper's experiments:main_data_and_code/├── rumor_generation/├── rumor_detection/└── rumor_debunking/How to Get StartedPrerequisitesTo successfully run the code and reproduce the results, you will need to:Obtain and configure your own API key for the large language models (LLMs) used in the experiments. Please replace the placeholder API key in the code with your own.For the rumor detection experiments, download the public datasets (Twitter15, Twitter16, FakeNewsNet) from their respective sources. The pre-process scripts in the rumor detection folder must be run first to prepare the public datasets.Please note that many scripts are provided as examples using the Twitter15 dataset. To run experiments on other datasets like Twitter16 or FakeNewsNet, you will need to modify these scripts or create copies and update the corresponding file paths.Detailed Directory Breakdown1. rumor_generation/This directory contains all the code and data related to the rumor generation experiments.rumor_generation_zeroshot.py: Code for the zero-shot rumor generation experiment.rumor_generation_fewshot.py: Code for the few-shot rumor generation experiment.rumor_generation_cot.py: Code for the chain-of-thought (CoT) rumor generation experiment.token_distribution.py: Script to analyze token distribution in the generated text.label_rumors.py：Script to label LLM-generated texts based on whether they contain rumor-related content.extract_reasons.py: Script to extract reasons for rumor generation and rejection.visualization.py: Utility script for generating figures.LDA.py: Code for performing LDA topic modeling on the generated data.rumor_generation_responses.json: The complete output dataset from the rumor generation experiments.generation_reasons_extracted.json: The extracted reasons for generated rumors.rejection_reasons_extracted.json: The extracted reasons for rejected rumor generation requests.2. rumor_detection/This directory contains the code and data used for the rumor detection experiments.nonreasoning_zeroshot_twitter15.py: Code for the non-reasoning, zero-shot detection on the Twitter15 dataset. To run on Twitter16 or FakeNewsNet, update the file paths within the script. Similar experiment scripts below follow the same principle and are not described repeatedly.nonreasoning_fewshot_twitter15.py: Code for the non-reasoning, few-shot detection on the Twitter15 dataset.nonreasoning_cot_twitter15.py: Code for the non-reasoning, CoT detection on the Twitter15 dataset.reasoning_zeroshot_twitter15.py: Code for the Reasoning LLMs, zero-shot detection on the Twitter15 dataset.reasoning_fewshot_twitter15.py: Code for the Reasoning LLMs, few-shot detection on the Twitter15 dataset.reasoning_cot_twitter15.py: Code for the Reasoning LLMs, CoT detection on the Twitter15 dataset.traditional_model.py: Code for the traditional models used as baselines.preprocess_twitter15_and_twitter16.py: Script for preprocessing the Twitter15 and Twitter16 datasets.preprocess_fakenews.py: Script for preprocessing the FakeNewsNet dataset.generate_summary_table.py: Calculates all classification metrics and generates the final summary table for the rumor detection experiments.select_few_shot_example_15.py: Script to pre-select few-shot examples, using the Twitter15 dataset as an example. To generate examples for Twitter16 or FakeNewsNet, update the file paths within the script.twitter15_few_shot_examples.json: Pre-selected few-shot examples for the Twitter15 dataset.twitter16_few_shot_examples.json: Pre-selected few-shot examples for the Twitter16 dataset.fakenewsnet_few_shot_examples.json: Pre-selected few-shot examples for the FakeNewsNet dataset.twitter15_llm_results.json: LLM prediction results on the Twitter15 dataset.twitter16_llm_results.json: LLM prediction results on the Twitter16 dataset.fakenewsnet_llm_results.json: LLM prediction results on the FakeNewsNet dataset.visualization.py: Utility script for generating figures.3. rumor_debunking/This directory contains all the code and data for the rumor debunking experiments.analyze_sentiment.py: Script for analyzing the sentiment of the debunking texts.calculate_readability.py: Script for calculating the readability score of the debunking texts.plot_readability.py: Utility script for generating figures related to readability.fact_checking_with_nli.py: Code for the NLI-based fact-checking experiment.debunking_results.json: The dataset containing the debunking results for this experimental section.debunking_results_with_readability.json: The dataset containing the debunking results along with readability scores.sentiment_analysis/: This directory contains the result file from the sentiment analysis.debunking_results_with_sentiment.json: The dataset containing the debunking results along with sentiment analysis.Please contact the repository owner if you encounter any problems or have questions about the code or data.
TaRBench: A Comprehensive Benchmark for Automated Test Case Repair
figshare.com
zip
Updated Aug 3, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ahmadreza Saboor Yaraghi; Darren Holden; Nafiseh Kahani; Lionel Briand (2025). TaRBench: A Comprehensive Benchmark for Automated Test Case Repair [Dataset]. http://doi.org/10.6084/m9.figshare.25008893.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.25008893.v1
Dataset updated
Aug 3, 2025
Dataset provided by
Figsharehttp://figshare.com/
Authors
Ahmadreza Saboor Yaraghi; Darren Holden; Nafiseh Kahani; Lionel Briand
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
AbstractEnsuring the quality of software systems through testing is a critical aspect of software development. However, the maintenance of test cases presents significant challenges, both in terms of complexity and cost. The constant need for updates to align with evolving systems under test can result in broken test cases, leading to a deterioration in test suite quality and disruptions in the software development process. To address these challenges, we introduce TaRGet (Test Repair GEneraTor), an approach that leverages pre-trained code language models for automated test case repair. TaRGet treats test repair as a language translation task and employs a two-step process to fine-tune a language model using essential context data that characterizes test breakages.PublicationThis repository is a supplement to our paper (accepted at IEEE TSE) which can be found on TSE.2025.3541166. For detailed definitions, experiments, and results, please refer to the paper. If you find this repository useful, kindly cite our work:@ARTICLE{saboor2025target, author={Saboor Yaraghi, Ahmadreza and Holden, Darren and Kahani, Nafiseh and Briand, Lionel}, journal={IEEE Transactions on Software Engineering}, title={Automated Test Case Repair Using Language Models}, year={2025}, pages={1-31}, doi={10.1109/TSE.2025.3541166}}TaRBenchTaRBench is a comprehensive benchmark that we developed to evaluate the effectiveness of TaRGet in automated test case repair. The benchmark encompasses 45,373 broken test repairs across 59 open-source projects, providing a diverse and extensive dataset for assessing the capabilities of TaRGet. In the following sections, we present detailed information about TaRBench (available in the "TaRBench.zip" file).The DataEach project's data is stored in the "projects" directory, following the format "projects/GitHub_Username/GitHub_Repo_Name" (e.g., "projects/apache/druid"). TaRBench structures the data for each project into four JSON files:dataset.json: The main file containing all the test repair instances of the project.codeMining/call_graphs.json: Call graphs of the test cases found in dataset.json across different commits.codeMining/sut_method_changes.json: Code changes in methods of the System Under Test (SUT) commits.codeMining/sut_class_changes.json: Code changes in classes of the SUT commits.Furthermore, the splits.csv file specifies the data split (train, valid, or test) for each test repair sample, identified by its unique ID.In following, we provide details on the attributes within each file.dataset.jsonThe dataset.json file comprises an array of JSON objects, each representing a test repair instance. These instances have the following attributes:ID (String): A unique identifier for the repair instance.name (String): The qualified name of the test case method, including the package, class, and method names.bCommit, aCommit (String): The commit hash (version) of the project before (bCommit) and after (aCommit) the repair.aCommitTime (Integer): Timestamp of aCommit.bPath, aPath (String): The relative path of the test case source file in the project before (bPath) and after (aPath) the repair.bSource, aSource (Object): The source code of the test case method before (bSource) and after (aSource) the repair. Each includes the start line of the test method in its source file as well as the test method source code.hunk (Object): The Git hunk of the test repair, representing the changes made to the test case method for repair. It includes the lines changed before (sourceChanges) and after (targetChanges) the repair, along with the corresponding code elements.astActions (Array): The edit actions applied to the abstract syntax tree (AST) of the test code for repair.trivial (Array): Indicates whether the repair is trivial (class or method renaming) or not. A "null" value signifies a non-trivial repair, while an array indicates a trivial repair and includes the associated trivial repair types.call_graphs.jsonThe call_graphs.json file contains call graphs, generated through static code analysis, for test cases identified in the corresponding test repair commits. The file has a nested structure: at the first layer, each key represent a commit hash; at the second layer, each key is a qualified test case name; and at the third layer, it includes the call graph for the respective commit and the test case. Attributes of each call graph are as follows:root (Object): Represents the root node of the call graph, always the test case.nodes (Array): An array containing metadata for all nodes in the call graph.graph (Object): Represents edges of the graph, where each key corresponds to a node ID, and the associated value is an array of node IDs representing the nodes to which the key has outgoing edges.sut_method_changes.json and sut_class_changes.jsonBoth files share a common structure, detailing code changes in the SUT for test repair commits. The sut_method_changes.json file exclusively captures changes within methods, while sut_class_changes.json covers all changes within classes. Both files present an array of objects, each object denoting changes in a commit. These objects have the following attributes:bCommit, aCommit (String): Same as dataset.json.changes (Array): An array of objects, each representing a change in a method or a class. These objects have the following attributes:bPath, aPath (String): Same as dataset.json.name (String): The qualified name of the changed method or the class.hunks (Array): An array of all hunks, representing changes in the method or class. The hunk data structure here follows that of the hunk field in dataset.json.is_test_source (Boolean): Indicates whether the method or class is part of the project's test source code.TaRGet ResultsIn addition to TaRBench, we provide the results of our approach, TaRGet. Below are the details of the results and associated files:TaRGet_Results.zipThis ZIP archive contains two main folders:Best_on_TaRBenchThis folder includes the best results obtained using the best configuration of TaRGet on TaRBench during our experiments, i.e., IO2 with CodeT5+ model. Detailed information about the experimental setup and the optimal configuration can be found in our paper. The folder contains the following files:train.json, valid.json, test.json: These files represent the dataset splits for TaRBench. Each file includes the input and expected output of the model, formatted using the best configuration, along with the original TaRBench fields (described in detail above).test_predictions.json: This file contains the predictions generated by the optimal TaRGet configuration using beam search. Each entry includes:ID: The TaRBench ID.target: The expected output.preds: TaRGet's generations.test_verdicts.json: This file presents the results of executing the repairs generated by TaRGet (listed in test_predictions.json). Each entry includes:Note: A zero execution time indicates that the generated repair was identical to the correct repair and was skipped to save time.ID: The TaRBench ID.rank: The rank of the executed generation within the preds field of the predictions file.verdict: The execution result.success: A Boolean indicating whether the result was successful.exec_time: The execution time in seconds.VS_CEPROTThis folder contains the results of the comparative analysis between TaRGet and CEPROT, a baseline test repair method used in our study. For further details about the comparison, refer to our paper. The folder includes:test.json: The test dataset in the same format as the dataset splits mentioned above. This dataset was used as the evaluation benchmark in CEPROT's study. Note that the CID field in this file, as well as in the subsequent files, refers to CEPROT's ID. This ID consists of two parts: the focal_db ID and the test_db ID, separated by a dash. The ID field, on the other hand, corresponds to a TaRBench-like ID that was created during the collection of CEPROT's data.TaRGet_predictions.json: The repairs generated by TaRGet on the test dataset.CEPROT_predictions.json: The repairs generated by CEPROT on the same test dataset.TaRGet_Best_FineTuned_Model.zipThis ZIP archive contains the fine-tuned model of the optimal TaRGet configuration, which was used to generate the results in the "Best_on_TaRBench" folder. The archive includes:checkpoint-best folder: Contains the model files, including the fine-tuned model weights (pytorch_model.bin).tokenizer folder: Includes the tokenizer files used for the model, along with all additional special tokens.ConclusionTaRBench serves as a valuable resource for researchers, developers, and practitioners interested in automated test case repair. Through an evaluation on a diverse dataset, it provides insights into the capabilities and limitations of TaRGet, paving the way for advancements in the field of automated software testing and maintenance.
Data for creating Interactive Dictionary
kaggle.com
zip
Updated Nov 16, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dhrumil Patel (2018). Data for creating Interactive Dictionary [Dataset]. https://www.kaggle.com/borrkk/dictionary
Explore at:
zip(1458641 bytes)Available download formats
Dataset updated
Nov 16, 2018
Authors
Dhrumil Patel
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Dataset

This dataset was created by Dhrumil Patel

Released under CC0: Public Domain

Contents
English JSON Short Stories
kaggle.com
Updated May 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nezahat Korkmaz (2024). English JSON Short Stories [Dataset]. https://www.kaggle.com/datasets/nezahatkk/english-a1-short-stories
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 24, 2024
Dataset provided by
Kaggle
Authors
Nezahat Korkmaz
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
English Short Stories" is a collection of short stories designed for English language learners at the A1 level, indicating beginners. The stories are written in simple language with basic vocabulary and sentence structures to aid comprehension and language acquisition.

The "English Short Stories JSON file for mobile applications" refers to a JSON (JavaScript Object Notation) file containing the same short stories formatted in a way that is suitable for integration into mobile applications. JSON is a lightweight data interchange format commonly used for storing and transmitting data between a server and a web application, including mobile apps. In this context, the JSON file likely contains the text of the short stories along with any associated metadata or formatting information needed for display within a mobile app.
Z
ActiveHuman Part 2
data.niaid.nih.gov
data-staging.niaid.nih.gov
Updated Nov 14, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Charalampos Georgiadis (2023). ActiveHuman Part 2 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8361113
Explore at:
Dataset updated
Nov 14, 2023
Dataset provided by
Aristotle University of Thessaloniki (AUTh)
Authors
Charalampos Georgiadis
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is Part 2/2 of the ActiveHuman dataset! Part 1 can be found here. Dataset Description ActiveHuman was generated using Unity's Perception package. It consists of 175428 RGB images and their semantic segmentation counterparts taken at different environments, lighting conditions, camera distances and angles. In total, the dataset contains images for 8 environments, 33 humans, 4 lighting conditions, 7 camera distances (1m-4m) and 36 camera angles (0-360 at 10-degree intervals). The dataset does not include images at every single combination of available camera distances and angles, since for some values the camera would collide with another object or go outside the confines of an environment. As a result, some combinations of camera distances and angles do not exist in the dataset. Alongside each image, 2D Bounding Box, 3D Bounding Box and Keypoint ground truth annotations are also generated via the use of Labelers and are stored as a JSON-based dataset. These Labelers are scripts that are responsible for capturing ground truth annotations for each captured image or frame. Keypoint annotations follow the COCO format defined by the COCO keypoint annotation template offered in the perception package.

Folder configuration The dataset consists of 3 folders:

JSON Data: Contains all the generated JSON files. RGB Images: Contains the generated RGB images. Semantic Segmentation Images: Contains the generated semantic segmentation images.

Essential Terminology

Annotation: Recorded data describing a single capture. Capture: One completed rendering process of a Unity sensor which stored the rendered result to data files (e.g. PNG, JPG, etc.). Ego: Object or person on which a collection of sensors is attached to (e.g., if a drone has a camera attached to it, the drone would be the ego and the camera would be the sensor). Ego coordinate system: Coordinates with respect to the ego. Global coordinate system: Coordinates with respect to the global origin in Unity. Sensor: Device that captures the dataset (in this instance the sensor is a camera). Sensor coordinate system: Coordinates with respect to the sensor. Sequence: Time-ordered series of captures. This is very useful for video capture where the time-order relationship of two captures is vital. UIID: Universal Unique Identifier. It is a unique hexadecimal identifier that can represent an individual instance of a capture, ego, sensor, annotation, labeled object or keypoint, or keypoint template.

Dataset Data The dataset includes 4 types of JSON annotation files files:

annotation_definitions.json: Contains annotation definitions for all of the active Labelers of the simulation stored in an array. Each entry consists of a collection of key-value pairs which describe a particular type of annotation and contain information about that specific annotation describing how its data should be mapped back to labels or objects in the scene. Each entry contains the following key-value pairs:

id: Integer identifier of the annotation's definition. name: Annotation name (e.g., keypoints, bounding box, bounding box 3D, semantic segmentation). description: Description of the annotation's specifications. format: Format of the file containing the annotation specifications (e.g., json, PNG). spec: Format-specific specifications for the annotation values generated by each Labeler.

Most Labelers generate different annotation specifications in the spec key-value pair:

BoundingBox2DLabeler/BoundingBox3DLabeler:

label_id: Integer identifier of a label. label_name: String identifier of a label. KeypointLabeler:

template_id: Keypoint template UUID. template_name: Name of the keypoint template. key_points: Array containing all the joints defined by the keypoint template. This array includes the key-value pairs:

label: Joint label. index: Joint index. color: RGBA values of the keypoint. color_code: Hex color code of the keypoint skeleton: Array containing all the skeleton connections defined by the keypoint template. Each skeleton connection defines a connection between two different joints. This array includes the key-value pairs:

label1: Label of the first joint. label2: Label of the second joint. joint1: Index of the first joint. joint2: Index of the second joint. color: RGBA values of the connection. color_code: Hex color code of the connection. SemanticSegmentationLabeler:

label_name: String identifier of a label. pixel_value: RGBA values of the label. color_code: Hex color code of the label.

captures_xyz.json: Each of these files contains an array of ground truth annotations generated by each active Labeler for each capture separately, as well as extra metadata that describe the state of each active sensor that is present in the scene. Each array entry in the contains the following key-value pairs:

id: UUID of the capture. sequence_id: UUID of the sequence. step: Index of the capture within a sequence. timestamp: Timestamp (in ms) since the beginning of a sequence. sensor: Properties of the sensor. This entry contains a collection with the following key-value pairs:

sensor_id: Sensor UUID. ego_id: Ego UUID. modality: Modality of the sensor (e.g., camera, radar). translation: 3D vector that describes the sensor's position (in meters) with respect to the global coordinate system. rotation: Quaternion variable that describes the sensor's orientation with respect to the ego coordinate system. camera_intrinsic: matrix containing (if it exists) the camera's intrinsic calibration. projection: Projection type used by the camera (e.g., orthographic, perspective). ego: Attributes of the ego. This entry contains a collection with the following key-value pairs:

ego_id: Ego UUID. translation: 3D vector that describes the ego's position (in meters) with respect to the global coordinate system. rotation: Quaternion variable containing the ego's orientation. velocity: 3D vector containing the ego's velocity (in meters per second). acceleration: 3D vector containing the ego's acceleration (in ). format: Format of the file captured by the sensor (e.g., PNG, JPG). annotations: Key-value pair collections, one for each active Labeler. These key-value pairs are as follows:

id: Annotation UUID . annotation_definition: Integer identifier of the annotation's definition. filename: Name of the file generated by the Labeler. This entry is only present for Labelers that generate an image. values: List of key-value pairs containing annotation data for the current Labeler.

Each Labeler generates different annotation specifications in the values key-value pair:

BoundingBox2DLabeler:

label_id: Integer identifier of a label. label_name: String identifier of a label. instance_id: UUID of one instance of an object. Each object with the same label that is visible on the same capture has different instance_id values. x: Position of the 2D bounding box on the X axis. y: Position of the 2D bounding box position on the Y axis. width: Width of the 2D bounding box. height: Height of the 2D bounding box. BoundingBox3DLabeler:

label_id: Integer identifier of a label. label_name: String identifier of a label. instance_id: UUID of one instance of an object. Each object with the same label that is visible on the same capture has different instance_id values. translation: 3D vector containing the location of the center of the 3D bounding box with respect to the sensor coordinate system (in meters). size: 3D vector containing the size of the 3D bounding box (in meters) rotation: Quaternion variable containing the orientation of the 3D bounding box. velocity: 3D vector containing the velocity of the 3D bounding box (in meters per second). acceleration: 3D vector containing the acceleration of the 3D bounding box acceleration (in ). KeypointLabeler:

label_id: Integer identifier of a label. instance_id: UUID of one instance of a joint. Keypoints with the same joint label that are visible on the same capture have different instance_id values. template_id: UUID of the keypoint template. pose: Pose label for that particular capture. keypoints: Array containing the properties of each keypoint. Each keypoint that exists in the keypoint template file is one element of the array. Each entry's contents have as follows:

index: Index of the keypoint in the keypoint template file. x: Pixel coordinates of the keypoint on the X axis. y: Pixel coordinates of the keypoint on the Y axis. state: State of the keypoint.

The SemanticSegmentationLabeler does not contain a values list.

egos.json: Contains collections of key-value pairs for each ego. These include:

id: UUID of the ego. description: Description of the ego. sensors.json: Contains collections of key-value pairs for all sensors of the simulation. These include:

id: UUID of the sensor. ego_id: UUID of the ego on which the sensor is attached. modality: Modality of the sensor (e.g., camera, radar, sonar). description: Description of the sensor (e.g., camera, radar).

Image names The RGB and semantic segmentation images share the same image naming convention. However, the semantic segmentation images also contain the string Semantic_ at the beginning of their filenames. Each RGB image is named "e_h_l_d_r.jpg", where:

e denotes the id of the environment. h denotes the id of the person. l denotes the id of the lighting condition. d denotes the camera distance at which the image was captured. r denotes the camera angle at which the image was captured.
m
Sample JSON file
mygeodata.cloud
Updated Sep 11, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2018). Sample JSON file [Dataset]. https://mygeodata.cloud/converter/dwg-to-json
Explore at:
Dataset updated
Sep 11, 2018
Description
Sample data in GeoJSON format available for download for testing purposes.
Z
ActiveHuman Part 1
data.niaid.nih.gov
Updated Nov 14, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Charalampos Georgiadis (2023). ActiveHuman Part 1 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8359765
Explore at:
Dataset updated
Nov 14, 2023
Dataset provided by
Aristotle University of Thessaloniki (AUTh)
Authors
Charalampos Georgiadis
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is Part 1/2 of the ActiveHuman dataset! Part 2 can be found here. Dataset Description ActiveHuman was generated using Unity's Perception package. It consists of 175428 RGB images and their semantic segmentation counterparts taken at different environments, lighting conditions, camera distances and angles. In total, the dataset contains images for 8 environments, 33 humans, 4 lighting conditions, 7 camera distances (1m-4m) and 36 camera angles (0-360 at 10-degree intervals). The dataset does not include images at every single combination of available camera distances and angles, since for some values the camera would collide with another object or go outside the confines of an environment. As a result, some combinations of camera distances and angles do not exist in the dataset. Alongside each image, 2D Bounding Box, 3D Bounding Box and Keypoint ground truth annotations are also generated via the use of Labelers and are stored as a JSON-based dataset. These Labelers are scripts that are responsible for capturing ground truth annotations for each captured image or frame. Keypoint annotations follow the COCO format defined by the COCO keypoint annotation template offered in the perception package.

Folder configuration The dataset consists of 3 folders:

JSON Data: Contains all the generated JSON files. RGB Images: Contains the generated RGB images. Semantic Segmentation Images: Contains the generated semantic segmentation images.

Essential Terminology

Annotation: Recorded data describing a single capture. Capture: One completed rendering process of a Unity sensor which stored the rendered result to data files (e.g. PNG, JPG, etc.). Ego: Object or person on which a collection of sensors is attached to (e.g., if a drone has a camera attached to it, the drone would be the ego and the camera would be the sensor). Ego coordinate system: Coordinates with respect to the ego. Global coordinate system: Coordinates with respect to the global origin in Unity. Sensor: Device that captures the dataset (in this instance the sensor is a camera). Sensor coordinate system: Coordinates with respect to the sensor. Sequence: Time-ordered series of captures. This is very useful for video capture where the time-order relationship of two captures is vital. UIID: Universal Unique Identifier. It is a unique hexadecimal identifier that can represent an individual instance of a capture, ego, sensor, annotation, labeled object or keypoint, or keypoint template.

Dataset Data The dataset includes 4 types of JSON annotation files files:

annotation_definitions.json: Contains annotation definitions for all of the active Labelers of the simulation stored in an array. Each entry consists of a collection of key-value pairs which describe a particular type of annotation and contain information about that specific annotation describing how its data should be mapped back to labels or objects in the scene. Each entry contains the following key-value pairs:

id: Integer identifier of the annotation's definition. name: Annotation name (e.g., keypoints, bounding box, bounding box 3D, semantic segmentation). description: Description of the annotation's specifications. format: Format of the file containing the annotation specifications (e.g., json, PNG). spec: Format-specific specifications for the annotation values generated by each Labeler.

Most Labelers generate different annotation specifications in the spec key-value pair:

BoundingBox2DLabeler/BoundingBox3DLabeler:

label_id: Integer identifier of a label. label_name: String identifier of a label. KeypointLabeler:

template_id: Keypoint template UUID. template_name: Name of the keypoint template. key_points: Array containing all the joints defined by the keypoint template. This array includes the key-value pairs:

label: Joint label. index: Joint index. color: RGBA values of the keypoint. color_code: Hex color code of the keypoint skeleton: Array containing all the skeleton connections defined by the keypoint template. Each skeleton connection defines a connection between two different joints. This array includes the key-value pairs:

label1: Label of the first joint. label2: Label of the second joint. joint1: Index of the first joint. joint2: Index of the second joint. color: RGBA values of the connection. color_code: Hex color code of the connection. SemanticSegmentationLabeler:

label_name: String identifier of a label. pixel_value: RGBA values of the label. color_code: Hex color code of the label.

captures_xyz.json: Each of these files contains an array of ground truth annotations generated by each active Labeler for each capture separately, as well as extra metadata that describe the state of each active sensor that is present in the scene. Each array entry in the contains the following key-value pairs:

id: UUID of the capture. sequence_id: UUID of the sequence. step: Index of the capture within a sequence. timestamp: Timestamp (in ms) since the beginning of a sequence. sensor: Properties of the sensor. This entry contains a collection with the following key-value pairs:

sensor_id: Sensor UUID. ego_id: Ego UUID. modality: Modality of the sensor (e.g., camera, radar). translation: 3D vector that describes the sensor's position (in meters) with respect to the global coordinate system. rotation: Quaternion variable that describes the sensor's orientation with respect to the ego coordinate system. camera_intrinsic: matrix containing (if it exists) the camera's intrinsic calibration. projection: Projection type used by the camera (e.g., orthographic, perspective). ego: Attributes of the ego. This entry contains a collection with the following key-value pairs:

ego_id: Ego UUID. translation: 3D vector that describes the ego's position (in meters) with respect to the global coordinate system. rotation: Quaternion variable containing the ego's orientation. velocity: 3D vector containing the ego's velocity (in meters per second). acceleration: 3D vector containing the ego's acceleration (in ). format: Format of the file captured by the sensor (e.g., PNG, JPG). annotations: Key-value pair collections, one for each active Labeler. These key-value pairs are as follows:

id: Annotation UUID . annotation_definition: Integer identifier of the annotation's definition. filename: Name of the file generated by the Labeler. This entry is only present for Labelers that generate an image. values: List of key-value pairs containing annotation data for the current Labeler.

Each Labeler generates different annotation specifications in the values key-value pair:

BoundingBox2DLabeler:

label_id: Integer identifier of a label. label_name: String identifier of a label. instance_id: UUID of one instance of an object. Each object with the same label that is visible on the same capture has different instance_id values. x: Position of the 2D bounding box on the X axis. y: Position of the 2D bounding box position on the Y axis. width: Width of the 2D bounding box. height: Height of the 2D bounding box. BoundingBox3DLabeler:

label_id: Integer identifier of a label. label_name: String identifier of a label. instance_id: UUID of one instance of an object. Each object with the same label that is visible on the same capture has different instance_id values. translation: 3D vector containing the location of the center of the 3D bounding box with respect to the sensor coordinate system (in meters). size: 3D vector containing the size of the 3D bounding box (in meters) rotation: Quaternion variable containing the orientation of the 3D bounding box. velocity: 3D vector containing the velocity of the 3D bounding box (in meters per second). acceleration: 3D vector containing the acceleration of the 3D bounding box acceleration (in ). KeypointLabeler:

label_id: Integer identifier of a label. instance_id: UUID of one instance of a joint. Keypoints with the same joint label that are visible on the same capture have different instance_id values. template_id: UUID of the keypoint template. pose: Pose label for that particular capture. keypoints: Array containing the properties of each keypoint. Each keypoint that exists in the keypoint template file is one element of the array. Each entry's contents have as follows:

index: Index of the keypoint in the keypoint template file. x: Pixel coordinates of the keypoint on the X axis. y: Pixel coordinates of the keypoint on the Y axis. state: State of the keypoint.

The SemanticSegmentationLabeler does not contain a values list.

egos.json: Contains collections of key-value pairs for each ego. These include:

id: UUID of the ego. description: Description of the ego. sensors.json: Contains collections of key-value pairs for all sensors of the simulation. These include:

id: UUID of the sensor. ego_id: UUID of the ego on which the sensor is attached. modality: Modality of the sensor (e.g., camera, radar, sonar). description: Description of the sensor (e.g., camera, radar).

Image names The RGB and semantic segmentation images share the same image naming convention. However, the semantic segmentation images also contain the string Semantic_ at the beginning of their filenames. Each RGB image is named "e_h_l_d_r.jpg", where:

e denotes the id of the environment. h denotes the id of the person. l denotes the id of the lighting condition. d denotes the camera distance at which the image was captured. r denotes the camera angle at which the image was captured.
EdX, Coursera, and Udemy Course Data
kaggle.com
zip
Updated Apr 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Karar Haitham (2025). EdX, Coursera, and Udemy Course Data [Dataset]. https://www.kaggle.com/datasets/kararhaitham/courses
Explore at:
zip(49594908 bytes)Available download formats
Dataset updated
Apr 11, 2025
Authors
Karar Haitham
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This dataset contains course information and metadata scraped from two popular MOOC platforms: EdX and Coursera(Udemy will be added soon, although the script to scrape it available on my git https://github.com/karar-git/Quamus). It includes various course attributes such as descriptions, program details, and other relevant data, making it useful for building recommendation systems, educational tools, or performing analysis on online learning content.

The dataset includes the following files:

combine_preprocessing.py: A Python script that preprocesses and combines the raw data into a unified format. You must run this script to generate the processed dataset. combined_dataset.json: The final, preprocessed, and combined dataset of course metadata from both EdX and Coursera. edx_courses.json: Raw course data scraped from the EdX platform. edx_degree_programs.json: Data on degree programs available on EdX. edx_executive_education_paidstuff.json: Paid course and executive education data from EdX. edx_programs.json: Data on various programs available on EdX. processed_coursera_data.json: Processed course data scraped from Coursera.

How to Use:

To generate the final combined_dataset.json, you need to run the combine_preprocessing.py script. This script will process the raw data files, clean, normalize, and combine them into one unified dataset. Disclaimer:

Data Source: The data in this dataset was scraped from publicly available information on EdX and Coursera. The scraping was done solely for educational and research purposes. The scraping process adheres to the terms of use of the respective platforms. Usage: This dataset is intended for non-commercial use only. Please use responsibly and adhere to the terms and conditions of the platforms from which the data was collected. No Warranty: This data is provided "as-is" without any warranty. Users are responsible for ensuring that their use of the data complies with the relevant platform policies.

Models:

In addition to the data, there are machine learning models related to this dataset available on GitHub. These models can help with content-based course recommendations and are built using the data you will find here. Specifically, the models include:

A cosine similarity-based model for course recommendations. A two-tower model for personalized recommendations, trained using pseudo-labels. A transformer-based course predictor (work in progress) designed to suggest the next course based on a user's learning progression.

Note:

The dataset currently contains data only from EdX and Coursera. The script to scrape Udemy data can be found in the related GitHub repository.

You will need to access the GitHub repository to view and experiment with the models and the Udemy scraping script. The models require the data files from this dataset to work properly.

By uploading this dataset to Kaggle, you can explore these educational resources and leverage them for building custom educational tools or analyzing online course trends.
Z
GPT-2 generated form fields
data.niaid.nih.gov
Updated May 13, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Brian Davis (2022). GPT-2 generated form fields [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6544100
Explore at:
Dataset updated
May 13, 2022
Dataset provided by
Brigham Young University
Authors
Brian Davis
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is a single json containing label-value form fields generated using GPT-2. This data was used to train Dessurt (https://arxiv.org/abs/2203.16618). Details of the generation process can be found in Dessurt's Supplementary Materials and the script used to generate it is gpt_forms.py in https://github.com/herobd/dessurt

The data has groups of label-value pairs each with a "title" or topic (or null). Each label-value pair group was generated in a single GPT-2 generation and thus the pairs "belong to the same form." The json structure is a list of tuples, where each tuple has the title or null as the first element and the list of label-value pairs of the group as the second element. Each label-value pair is another tuple with the first element being the label and the second being the value or a list of values.

For example:

[ ["title",[ ["first label", "first value"], ["second label", ["a label", "another label"] ] ] ], [null, [ ["again label", "again value"] ] ] ]
D
Dataset for Design Ideation Study
dataverse.azure.uit.no
dataverse.no
application/x-h5, pdf +3
Updated Feb 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Filip Gornitzka Abelson; Filip Gornitzka Abelson; Henrikke Dybvik; Henrikke Dybvik; Martin Steinert; Martin Steinert (2024). Dataset for Design Ideation Study [Dataset]. http://doi.org/10.18710/PZQC4A
Explore at:
tsv(7501), txt(13093), application/x-h5(25860340), application/x-h5(286920385), zip(581532), tsv(295160), application/x-h5(540715825), tsv(767327), application/x-h5(49209334), application/x-h5(510702725), tsv(1336354), tsv(2010), tsv(1935109), pdf(33267), application/x-h5(272694817)Available download formats
Unique identifier
https://doi.org/10.18710/PZQC4A
Dataset updated
Feb 28, 2024
Dataset provided by
DataverseNO
Authors
Filip Gornitzka Abelson; Filip Gornitzka Abelson; Henrikke Dybvik; Henrikke Dybvik; Martin Steinert; Martin Steinert
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Study information Design ideation study (N = 24) using eye tracking technology. Participants solved a total of twelve design problems while receiving inspirational stimuli on a monitor. Their task was to generate as many solutions to each problem and explain their solution briefly by thinking aloud. The study allows for getting further insight into how inspirational stimuli improve idea fluency during design ideation. This dataset features processed data from the experiment. Eye tracking data includes gaze data, fixation data, blink data, and pupillometry data for all participants. The study is based on the following research paper and follows the same experimental setup: Goucher-Lambert, K., Moss, J., & Cagan, J. (2019). A neuroimaging investigation of design ideation with and without inspirational stimuli—understanding the meaning of near and far stimuli. Design Studies, 60, 1-38. DOI Dataset Most files in the dataset are saved as CSV files or other human readable file formats. Large files are saved in Hierarchical Data Format (HDF5/H5) to allow for smaller file sizes and higher compression. All data is described thoroughly in 00_ReadMe.txt. The following processed data is included in the dataset: Concatenated annotations file of experimental flow for all participants (CSV). All eye tracking raw data in concatenated files. Annotated with only participant ID. (CSV/HDF5) Annotated eye tracking data for ideation routines only. A subset of the files above. (CSV/HDF5) Audio transcriptions from Google Cloud Speech-to-Text API of each recording with annotations. (CSV) Raw API response for each transcription. These files include time offset for each word in a recording. (JSON) Data for questionnaire feedback and ideas generated during the experiment. (CSV) Data for the post-experiment survey, including demographic information (TSV). Python code used for the open-source experimental setup and dataset construction is hosted at GitHub. Repository also includes code of how the dataset has been further processed.
Exploring Large Language Models for Automated Non-Functional Requirements...
zenodo.org
Updated Oct 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jomar Thomas Almonte; Santhosh Anitha Boominathan; Nathalia Nascimento; Nathalia Nascimento; Jomar Thomas Almonte; Santhosh Anitha Boominathan (2025). Exploring Large Language Models for Automated Non-Functional Requirements Generation [Dataset]. http://doi.org/10.5281/zenodo.17144731
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.17144731
Dataset updated
Oct 10, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jomar Thomas Almonte; Santhosh Anitha Boominathan; Nathalia Nascimento; Nathalia Nascimento; Jomar Thomas Almonte; Santhosh Anitha Boominathan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Mar 10, 2025
Description
Exploring Large Language Models for Automated Non-Functional Requirements Generation: A Human Annotated Dataset for NFR Quality

This artifact provides a comprehensive dataset and analysis tools for evaluating the quality of Non-Functional Requirements (NFRs) generated by Large Language Models (LLMs) based solely on Functional Requirements (FRs). The dataset includes human evaluations of NFR quality according to ISO/IEC 25010:2023 standard quality attributes.

Description

This research artifact contains:

Human evaluation data for NFRs generated by 8 different LLMs across 34 functional requirements

Professional responses collected through Turso database service from software engineering professionals

Analysis scripts for data processing and statistical analysis

LLM outputs in structured JSON format for all tested models

Advanced prompting techniques incorporating ISO/IEC 25010:2023 standards

The study evaluates two key aspects:

NFR Validity (1-5 scale): Coherence and appropriateness of generated NFRs

Attribute Applicability (1-5 scale): Relevance of assigned ISO quality attributes

Requirements

Software Dependencies

Deno (JavaScript/TypeScript runtime) - Version 1.40+ recommended

SQLite3 (for database operations)

Standard text editor (for viewing TSV/JSON files)

Hardware Requirements

RAM: Minimum 4GB (recommended 8GB)

Storage: 500MB free space

OS: Cross-platform (Linux, macOS, Windows)

Installation

Install Deno:

# Linux/macOS curl -fsSL https://deno.land/install.sh | sh # Windows (PowerShell) irm https://deno.land/install.ps1 | iex

Verify Installation:

deno --version

Clone/Download Artifact: Extract downloaded archive

Step-by-Step Instructions to Reproduce Paper Results

Step 1: Examine Raw Data Sources

Input: Professional evaluation data collected via Turso database service

File: data/dump.sql

Description: SQL dump containing responses from software engineering professionals who evaluated LLM-generated NFRs

Content: Raw evaluation data including validity scores (1-5), applicability scores (1-5), and quality attribute assignments

Expected Output: Understanding of data collection methodology and raw response structure

Step 2: Generate Analysis Database

Purpose: Convert SQL dump to SQLite database for analysis

cd analysis deno run --allow-read --allow-write generateData.ts

Process:

The generateData.ts script reads data/dump.sql

Creates data/dump.db SQLite database

Structures data for statistical analysis

Expected Output: data/dump.db file created (approximately 2-5MB)

Step 3: Process and Merge Evaluation Data

Purpose: Combine human evaluations with LLM assignments and generate final dataset

The generateData.ts script performs:

Assignment Processing: Maps evaluators to specific FR-LLM combinations:

NFR Validity evaluations: 10 evaluators × 3 FRs each × 8 LLMs

Attribute Applicability evaluations: 10 evaluators × 3 FRs each × 8 LLMs

Data Merging: Combines database records with assignment metadata

CSV Generation: Outputs structured TSV file for analysis

Expected Output: analysis/Human_Evaluation_Data.tsv (final dataset used in paper)

Step 4: Analyze LLM Output Structure

Files: LLMOutputs/*.json (8 files, one per LLM)

claude-3-5-haiku.json

claude-3-7-sonnet.json

deepSeek-V3.json

gemini-1.5-pro.json

gpt-4o-mini.json

grok-2.json

lama-3.3-70B.json

Qwen2.5-72B.json

Expected Format for each FR:

{ "functionalRequirement": "System shall allow users to log in with username and password", "identifiedNFRs": [ { "attribute": "Security", "requirement": "The system must encrypt passwords using AES-256 encryption", "justification": "Login functionality requires secure credential handling" } ] }

Analysis: Each JSON contains 34 FR entries with generated NFRs following ISO/IEC 25010:2023 categories

Step 5: Examine Prompt Engineering Approach

File: data/AdvancedPrompt.txt Content: Complete prompt used for NFR generation including:

Role assignment (expert software quality engineer)

Knowledge grounding (ISO/IEC 25010:2023 standard)

Output structure constraints (JSON format)

Quality requirements (specific, actionable, testable NFRs)

File Descriptions

Core Dataset Files

analysis/Human_Evaluation_Data.tsv: Main evaluation dataset (2,240 evaluated NFRs)

Columns: FR ID, FR text, NFR ID, LLM model, ISO attribute, NFR text, justification, validity score, applicability score, human attribute assignment, evaluator assignment type, evaluator ID

data/FR_34.tsv: 34 functional requirements subset used for evaluation

data/dump.sql: Raw SQL dump from Turso database service containing professional evaluations

LLM Output Files

LLMOutputs/[model].json: Structured NFR generations for each of 8 LLMs

Each file contains 34 FR entries with associated NFRs in JSON format

Configuration Files

data/AdvancedPrompt.txt: Complete prompt template with ISO/IEC 25010:2023 integration

analysis/generateData.ts: Data processing script for database creation and CSV generation

Documentation

LICENSE.md: Distribution rights and usage terms

analysis/visualization.ipynb: Jupyter notebook for data visualization and statistical analysis

Mapping to Paper Claims

Key Paper Statistics (Section 6 - Results)

1,593 total NFRs generated across 8 LLMs and 34 FRs

174 NFRs evaluated for validity and applicability scoring

168 NFRs evaluated for attribute selection task

Mean validity score: 4.63 (median: 5.0) on 1-5 scale

Mean applicability score: 4.59 (median: 5.0) on 1-5 scale

80.4% attribute accuracy in expert vs. LLM attribute selection

Figure Reproduction Mapping

Figure 3: Reproduced from validity scores in Human_Evaluation_Data.tsv

Shows 90.8% of NFRs scored ≥4, with 76.4% scoring perfect 5

Figure 4: Generated from applicability scores in Human_Evaluation_Data.tsv

Demonstrates 90.2% highly applicable ratings (scores 4-5)

Figure 5: Computed from attribute selection task data

Visualizes 80.4% exact matches, 8.3% near misses, 11.3% complete mismatches

Figure 6: Generated from LLM vs. expert attribute assignments

Shows specific misclassification patterns (e.g., Functional Suitability vs. Reliability)

Table Reproduction Mapping

Table 4 (LLM Comparison): Directly derived from Human_Evaluation_Data.tsv grouped by LLM model

Validity ranges: 3.96 (claude-3-7-sonnet) to 4.94 (llama-3.3-70B)

Applicability ranges: 3.67 (claude-3-7-sonnet) to 4.97 (grok-2)

Attribute accuracy ranges: 71.4% (deepSeek-V3) to 90.9% (gemini-1.5-pro)

Research Questions Validation

RQ1 (LLM Effectiveness): Validated through high validity (90.8% ≥4) and applicability (90.2% ≥4) scores

RQ2 (Best Performing LLM): Answered via Table 4 comparison showing gemini-1.5-pro (highest attribute accuracy) and llama-3.3-70B (highest validity/applicability)

RQ3 (Prompting Technique Impact): Demonstrated through advanced vs. baseline prompting comparison

Methodology Reproduction (Section
Z
Data from: Maniple
data.niaid.nih.gov
Updated Aug 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anonymous, Anonymous (2024). Maniple [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10853003
Explore at:
Dataset updated
Aug 3, 2024
Authors
Anonymous, Anonymous
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Maniple This repository contains code, scripts and data necessary to reproduce the paper "The Fact Selection Problem in LLM-Based Program Repair".

Installation Before installing the project, ensure you have the following prerequisites installed on your system:

Python version 3.10 or higher.

Follow these steps to install and set up the project on your local machine:

cd maniple python3 -m pip install .

Structure of Directories The project is organized into several directories, each serving a specific purpose:

data/ # Training and testing datasets BGP32.zip/ # Sampled 32 bugs from the BugsInPy dataset black/ # The bug project folder 10/ # The bug ID folder 100000001/ # The bitvector used for prompting prompt.md # The prompt used for this bitvector response_1.md # The response from the model response_1.json # The response in JSON format response_1.patch # The response in patch format result_1.json # Testing result ... BGP32-without-cot.zip # GPT response for 32 bugs without CoT prompting BGP314.zip # 314 bugs from the BugsInPy dataset BGP157Ply1-llama3-70b.zip # experiment with llama3 model on BGP157Ply1 dataset BGP32-permutation.zip # permutation experiment on BGP32 dataset

maniple/ # Scripts for getting facts and generate prompts strata_based/ # Scripts for generating prompts utils/ # Utility functions metrics/ # Scripts for calculating metrics for dataset

patch_correctness_labelling.xlsx # The labelling of patch correctness experiment.ipynb # Jupyter notebook for training models

experiment-initialization-resources/ # Contains raw facts for each bug bug-data/ # row facts for each bug ansible/ # Bug project folder 5/ # Bug ID folder bug-info.json # Metadata for the bug facts_in_prompt.json # Facts used in the prompt processed_facts.json # Processed facts external_facts.json # GitHub issues for this bug static-dynamic-facts.json # Static and dynamic facts ... datasets-list/ # Subsets from BugsInPy dataset strata-bitvector/ # Debugging information for bitvectors

Steps to Reproduce the Experiments Please follow the steps below sequentially to reproduce the experiments on 314 bugs in BugsInPy with our bitvector based prompt

Prepare the Dataset The CLI scripts under the maniple directory provide useful commands to download and prepare environments for each bug.

To download and prepare environments for each bugs, you can use the prep command.

maniple prep --dataset 314-dataset

This script will automatically download all 314 bugs from GitHub, create a virtual environment for the bug and install the necessary dependencies.

Fact Extraction Then you can extract facts from the bug data using the extract command as follows:

maniple extract --dataset 314-dataset --output-dir data/BGP314

This script will extract facts from the bug data and save them in the specified output directory.

You can find all extracted facts under the experiment-initialization-resources/bug-data directory.

Generate Bitvector Specific Prompts and Responses First, you need to generate bitvector for the facts. The 128 bitvector for our paper can be generated by the following command.

python3 -m maniple.strata_based.fact_bitvector_generator

You can customize your bitvectors, they should be put under experiment-initialization-resources/strata-bitvectors directory. You can refer the example bitvector format used for our paper.

To reproduce our experiment prompt and response, please use the command below, and replace with your own key.

On Linux/macOS:

export OPENAI_API_KEY=

On windows:

setx OPENAI_API_KEY

python3 -m maniple.strata_based.prompt_generator --database BGP314 --partition 10 --start_index 1 --trial 15

Again, you can build your own customize prompt with customize bitvector using our extracted facts. Above is only for reproducing our prompt and response.

This script will generate prompts and responses for all 314 bugs in the dataset by enumerating all possible bitvectors according to current strata design specified in maniple/strata_based/fact_strata_table.json. By specifying --trial 15, the script will generate 15 responses for each prompt. And by specifying --partition 10 the script will start 10 threads to speed up the process.

Testing Generated Patches Please use following command:

maniple validate --output-dir data/BGP314

This script will validate the generated patches for the specified bug and save the results in the specified output directory. The test comes from the developer's fix commit.

Contributing Contributions to this project are welcome! Please submit a PR if you find any bugs or have any suggestions.

License This project is licensed under the MIT - see the LICENSE file for details.
E
Data from: Keyword extraction datasets for Croatian, Estonian, Latvian and...
live.european-language-grid.eu
binary format
Updated Jun 3, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2021). Keyword extraction datasets for Croatian, Estonian, Latvian and Russian 1.0 [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/8369
Explore at:
binary formatAvailable download formats
Dataset updated
Jun 3, 2021
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Area covered
Estonia
Description
EACL Hackashop Keyword Challenge Datasets

In this repository you can find ids of articles used for the keyword extraction challenge at
EACL Hackashop on News Media Content Analysis and Automated Report Generation (http://embeddia.eu/hackashop2021/). The article ids can be used to generate train-test split used in paper:

Koloski, B., Pollak, S., Škrlj, B., & Martinc, M. (2021). Extending Neural Keyword Extraction with TF-IDF tagset matching. In: Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation, Kiev, Ukraine, pages 22–29.

Train and test splits are provided for Latvian, Estonian, Russian and Croatian.

The articles with the corresponding ID-s can be extracted from the following datasets:
- Estonian and Russian (use the eearticles2015-2019 dataset): https://www.clarin.si/repository/xmlui/handle/11356/1408
- Latvian: https://www.clarin.si/repository/xmlui/handle/11356/1409
- Croatian: https://www.clarin.si/repository/xmlui/handle/11356/1410

dataset_ids folder is organized in the following way:

- latvian – containing latvian_train.json: a json file with ids from train articles to replicate the data used in Koloski et al. (2020), the latvian_test.json: a json file with ids from test articles to replicate the data

- estonian – containing estonian_train.json: a json file with ids from train articles to replicate the data used in Koloski et al. (2020), the estonian_test.json: a json file with ids from test articles to replicate the data

- russian – containing russian_train.json: a json file with ids from train articles to replicate the train data used in Koloski et al. (2020), the russian_test.json: a json file with ids from test articles to replicate the data

- croatian - containing croatian_id_train.tsv file with sites and ids (note that just ids are not unique across dataset, therefore site information also needs to be included to obtain a unique article identifier) of articles in the train set, and the croatian_id_test.tsv file with sites and ids of articles in the test set.

In addition, scripts are provided for extracting articles (see folder parse containing scripts parse.py and build_croatian_dataset.py, requirements for scripts are pandas and bs4 Python libraries):

parse.py is used for extraction of Estonian, Russian and Latvian train and test datasets:

Instructions:

ESTONIAN-RUSSIAN
1) Retrieve the data ee_articles_2015_2019.zip
2) Create a folder 'data' and subfolder 'ee'
3) Unzip them in the 'data/ee' folder

To extract train/test Estonian articles:
run function 'build_dataset(lang="ee", opt="nat")' in the parse.py script
To extract train/test Russian articles:
run function 'build_dataset(lang="ee", opt="rus")' in the parse.py script

LATVIAN:
1) Retrieve the latvian data
2) Unzip it in 'data/lv' folder
3) To extract train/test Latvian articles:
run function 'build_dataset(lang="lv", opt="nat")' in the parse.py script

build_croatian_dataset.py is used for extraction of Croatian train and test datasets:

Instructions:

CROATIAN:
1) Retrieve the Croatian data (file 'STY_24sata_articles_hr_PUB-01.csv')
2) put the script 'build_croatian_dataset.py' in the same folder as the extracted data and run it (e.g., python build_croatian_dataset.py).

For additional questions: {Boshko.Koloski,Matej.Martinc,Senja.Pollak}@ijs.si
Inventory data for Pharmacy Website in JSON format
kaggle.com
zip
Updated Oct 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Priti Poddar (2024). Inventory data for Pharmacy Website in JSON format [Dataset]. https://www.kaggle.com/datasets/pritipoddar/inventory-data-for-pharmacy-website-in-json-format
Explore at:
zip(14761 bytes)Available download formats
Dataset updated
Oct 22, 2024
Authors
Priti Poddar
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
This dataset contains inventory data for a pharmacy e-commerce website in JSON format, designed for easy integration into MongoDB databases, making it ideal for MERN stack projects. It includes 10 fields:

drugName: Name of the drug

manufacturer: Drug manufacturer

image: URL of the product image

description: Detailed description of the drug

expiryDate: Expiry date of the drug

price: Price of the drug

sideEffects: Potential side effects

disclaimer: Important legal and medical disclaimers

category: Drug classification (e.g., pain relief, antibiotics)

countInStock: Quantity of the product available in stock

This dataset is useful for developing pharmacy-related web applications, inventory management systems, or online medical stores using the MERN stack.

Do not use for production-level purposes; use for project development only. Feel free to contribute if you find any mistakes or have suggestions.
Simple chatbot dataset
kaggle.com
zip
Updated Jul 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
dame rajee (2023). Simple chatbot dataset [Dataset]. https://www.kaggle.com/datasets/damerajee/simple-chatbot-dataset
Explore at:
zip(3587 bytes)Available download formats
Dataset updated
Jul 31, 2023
Authors
dame rajee
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This JSON file contains a collection of conversational AI intents designed to motivate and interact with users. The intents cover various topics, including greetings, weather inquiries, hobbies, music, movies, farewells, informal and formal questions, math operations and formulas, prime numbers, geometry concepts, math puzzles, and even a Shakespearean poem.

The additional intents related to consolidating people and motivating them have been included to provide users with uplifting and encouraging responses. These intents aim to offer support during challenging times, foster teamwork, and provide words of motivation and inspiration to users seeking guidance and encouragement.

The JSON structure is organized into individual intent objects, each containing a tag to identify the intent, a set of patterns representing user inputs, and corresponding responses provided by the AI model. This dataset can be used to train a conversational AI system to engage in positive interactions with users and offer motivational messages.
MLB-unpacked-train-json
kaggle.com
zip
Updated Jul 8, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ultron (2021). MLB-unpacked-train-json [Dataset]. https://www.kaggle.com/mrutyunjaybiswal/mlbunpackedtrainjson
Explore at:
zip(819937893 bytes)Available download formats
Dataset updated
Jul 8, 2021
Authors
Ultron
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Source Notebook: https://www.kaggle.com/mrutyunjaybiswal/mlb-create-unpacked-dataset-from-train
Replication package for "Escaping the Time Pit: Pitfalls and Guidelines for...
zenodo.org
zip
Updated Mar 21, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Samuel W. Flint; Samuel W. Flint; Jigyasa Chauhan; Robert Dyer; Robert Dyer; Jigyasa Chauhan (2021). Replication package for "Escaping the Time Pit: Pitfalls and Guidelines for Using Time-Based Git Data" [Dataset]. http://doi.org/10.5281/zenodo.4625288
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4625288
Dataset updated
Mar 21, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Samuel W. Flint; Samuel W. Flint; Jigyasa Chauhan; Robert Dyer; Robert Dyer; Jigyasa Chauhan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset includes scripts, text files, and JSON files used to generate all analysis and results from the paper. A README.md file is included for details on using the scripts - though all of the data the scripts generate should already be cached as JSON or txt files and none of the scripts actually need run.

It also includes a spreadsheet containing the paper survey results and manual judgements.

The scripts are also on GitHub: https://github.com/psybers/msr21-timestudy

Facebook

Twitter

Click to copy link

Link copied

Cite

BoEC, ForexPredict (2024). Instruction formatted data for LLaMA2-chat fine tuning [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001428867

Instruction formatted data for LLaMA2-chat fine tuning

Explore at:

Dataset updated

Feb 10, 2024

Authors

BoEC, ForexPredict

Description

######################### Medical scope, Prompt #1 61K instruction PubMED-Unlabeled ###################################################################Instruction: You are a language model specialized in generating JSON or YAML output. Given an instruction, you should generate a valid JSON or YAML object according to the provided guidelines. Your output should conform to the following JSON format: { "key1": "value1", "key2": "value2", ... } Make sure to adhere to the specified key-value pairs and maintain the correct syntax throughout your response. Your goal is to generate accurate and properly structured JSON or YAML output in response to the given instructions withoutany extra comment. Input: As a doctor that only speak in JSON, provide a "answer", "context and "keywords" about following question: Is naturopathy as effective as conventional therapy for treatment of menopausal symptoms? Output:{"answer" :Naturopathy appears to be an effective alternative for relief of specific menopausal symptoms compared to conventional therapy.,"context" :Although the use of alternative medicine in the United States is increasing, no published studies have documented the effectiveness of naturopathy for treatment of menopausal symptoms compared to women receiving conventional therapy in the clinical setting."keywords" :['Anxiety', 'Cohort Studies', 'Confidence Intervals', 'Estrogen Replacement Therapy', 'Female', 'Health Status', 'Hot Flashes', 'Humans', 'Menopause', 'Middle Aged','Naturopathy', 'Odds Ratio', 'Patient Satisfaction', 'Retrospective Studies', 'Risk Factors', 'Severity of Illness Index', 'Sleep Initiation and Maintenance Disorders','Surveys and Questionnaires', 'Time Factors', 'Urinary Incontinence', 'Vaginal Diseases', 'Washington']} ############################### Medical scope, Prompt #2 symptom-disease dataset 1K instruction ######################################################Instruction: You are a language model specialized in generating JSON or YAML output. Given an instruction, you should generate a valid JSON or YAML object according to theprovided guidelines. Your output should conform to the following JSON format: { "key1": "value1", "key2": "value2", ... } Make sure to adhere to the specified key-value pairs and maintain the correct syntax throughout your response. Your goal is to generate accurate and properly structured JSON or YAML output in response to the given instructions without any extra comment. Input: You are a doctor that only speak in JSON. Do not output that is'nt properly formated JSON. Provide following information about Panic disorder disease. Fill information about "symptom", "reason", "TestsAndProcedures" and "commonMedications" about Panic disorder disease in JSON format. Output: {"symptom" :['Anxiety and nervousness', 'Depression', 'Shortness of breath', 'Depressive or psychotic symptoms', 'Sharp chest pain', 'Dizziness', 'Insomnia', 'Abnormal involuntary movements', 'Chest tightness', 'Palpitations', 'Irregular heartbeat', 'Breathing fast'],"reason" :Panic disorder is an anxiety disorder characterized by recurring severe panic attacks. It may also include significant behavioral changes lasting at least a month and of ongoing worry about the implications or concern about having other attacks. The latter are called anticipatory attacks (DSM-IVR). Panic disorder is not the same as agoraphobia (fear of public places), although many afflicted with panic disorder also suffer from agoraphobia. Panic attacks cannot be predicted, therefore an individual may become stressed, anxious or worried wondering when the next panic attack will occur. Panic disorder may be differentiated as a medical condition, or chemical imbalance. The DSM-IV-TR describes panic disorder and anxiety differently. Whereas anxiety is preceded by chronic stressors which build to reactions of moderate intensity that can last for days, weeks or months, panic attacks are acute events triggered by a sudden, out-of-the-blue cause: duration is short and symptoms are more intense. Panic attacks can occur in children, as well as adults. Panic in young people may be particularly distressing because children tend to have less insight about what is happening, and parents are also likely to experience distress when attacks occur. "TestsAndProcedures" : ['Psychotherapy', 'Mental health counseling', 'Electrocardiogram', 'Depression screen (Depression screening)', 'Toxicology screen', 'Psychological and psychiatric evaluation and therapy']"commonMedications" : ['Lorazepam', 'Alprazolam (Xanax)', 'Clonazepam', 'Paroxetine (Paxil)', 'Venlafaxine (Effexor)', 'Mirtazapine', 'Buspirone (Buspar)', 'Fluvoxamine (Luvox)', 'Imipramine', 'Desvenlafaxine (Pristiq)', 'Clomipramine', 'Acamprosate (Campral)']} ############################### Medical scope, Prompt #3 194K instruction MED-MCQA dataset ######################################################Instruction: You are a language model specialized in generating JSON or YAML output. Given an instruction, you should generate a valid JSON or YAML object according to the provided guidelines. Your output should conform to the following JSON format: { "key1": "value1", "key2": "value2", ... } Make sure to adhere to the specified key-value pairs and maintain the correct syntax throughout your response. Your goal is to generate accurate and properly structured JSON or YAML output in response to the given instructions without any extra comment. Input: You are a medical teacher that only speak in JSON. Do not output that is'nt properly formated JSON. Generate a multichoice question and answer inAnatomy field and Urinary tract topic with following format: { "question": "expresion" , "OPA": "expresion", "OPB" : "expresion", "OPC": "expresion","OPD": "expresion", "Answer": "3"}. Output: { "question": Chronic urethral obstruction due to benign prismatic hyperplasia can lead to the following change in kidney parenchyma, "opa": Hyperplasia, "opb": Hyperophy, "opc": Atrophy, "opd": Dyplasia, "Answer": 2. ############################### Finance scope, Prompt #4 38K instruction ######################################################Instruction: You are a language model specialized in generating JSON or YAML output. Given an instruction, you should generate a valid JSON or YAML object according to the provided guidelines. Your output should conform to the following JSON format: { "key1": "value1", "key2": "value2", ... } Make sure to adhere to the specified key-value pairs and maintain the correct syntax throughout your response. Your goal is to generate accurate and properly structured JSON or YAML output in response to the given instructions without any extra comment. Input: You are a financial news analyzer that only speaks in JSON. Do not output that isn't properly formatted JSON. Analyze this news title "Here Why Bitcoin is Likely to See Historic Volatility in the Next Few Days". Provide sentiment as a probability distribution also indicate target market and related asset to the news title in JSON format. Output:{"sentiment": {"positive": 0.4463904500007629, "negative": 0.06607405841350555,"neutral":0.4875355064868927}, "market": "cryptocurrency" , "relatedAsset": ['BTC/USDT']}.

Clear search

Close search

Google apps

Main menu

Instruction formatted data for LLaMA2-chat fine tuning

JSON Editor Market Research Report 2033

JSON Editor Market Outlook

sharegpt-structured-output-json

Main Data and Code

TaRBench: A Comprehensive Benchmark for Automated Test Case Repair

Data for creating Interactive Dictionary

Dataset

Contents

English JSON Short Stories

ActiveHuman Part 2

Sample JSON file

ActiveHuman Part 1

EdX, Coursera, and Udemy Course Data

GPT-2 generated form fields

Dataset for Design Ideation Study

Exploring Large Language Models for Automated Non-Functional Requirements...

Exploring Large Language Models for Automated Non-Functional Requirements Generation: A Human Annotated Dataset for NFR Quality

Description

Requirements

Software Dependencies

Hardware Requirements

Installation

Step-by-Step Instructions to Reproduce Paper Results

Step 1: Examine Raw Data Sources

Step 2: Generate Analysis Database

Step 3: Process and Merge Evaluation Data

Step 4: Analyze LLM Output Structure

Step 5: Examine Prompt Engineering Approach

File Descriptions

Core Dataset Files

LLM Output Files

Configuration Files

Documentation

Mapping to Paper Claims

Key Paper Statistics (Section 6 - Results)

Figure Reproduction Mapping

Table Reproduction Mapping

Research Questions Validation

Methodology Reproduction (Section

Data from: Maniple

On Linux/macOS:

On windows:

Data from: Keyword extraction datasets for Croatian, Estonian, Latvian and...

Inventory data for Pharmacy Website in JSON format

Simple chatbot dataset

MLB-unpacked-train-json

Source Notebook: https://www.kaggle.com/mrutyunjaybiswal/mlb-create-unpacked-dataset-from-train

Replication package for "Escaping the Time Pit: Pitfalls and Guidelines for...

Instruction formatted data for LLaMA2-chat fine tuning