Facebook
Twitter
Facebook
Twitter
According to our latest research, the JSON Editor market size reached USD 345 million in 2024, exhibiting robust momentum driven by the escalating need for efficient data management and seamless integration across digital platforms. The market is projected to expand at a CAGR of 13.2% from 2025 to 2033, resulting in a forecasted market size of USD 1,012 million by 2033. This growth is fueled by the proliferation of cloud-based applications, increasing adoption of API-driven architectures, and the rising complexity of data structures in modern enterprises. As per our latest research, the demand for advanced JSON editors is being propelled by the imperative for real-time data manipulation, enhanced collaboration, and automation across diverse industries.
One of the primary growth factors for the JSON Editor market is the exponential rise in the volume and complexity of data generated by organizations across various sectors. As businesses embrace digital transformation, the need to manage, process, and exchange structured and semi-structured data has become paramount. JSON (JavaScript Object Notation) has emerged as the preferred data-interchange format due to its lightweight, human-readable, and machine-friendly nature. Consequently, organizations are increasingly investing in sophisticated JSON editors to streamline data handling, ensure data integrity, and enable seamless integration with web and mobile applications. The surge in API-centric development further amplifies the demand for JSON editors, as APIs predominantly leverage JSON for data exchange, making robust editing and validation tools indispensable for developers and data engineers.
Another significant factor contributing to the market's expansion is the growing adoption of cloud computing and SaaS-based solutions. Cloud-based JSON editors are gaining traction due to their scalability, accessibility, and collaborative capabilities, allowing distributed teams to work on data structures in real-time. Enterprises are increasingly migrating their data management workflows to cloud platforms to reduce infrastructure costs, enhance operational efficiency, and facilitate remote work. This shift is driving vendors to innovate and offer cloud-native JSON editor solutions with advanced features such as version control, automated validation, and integration with popular DevOps tools. As organizations prioritize agility and flexibility, the demand for cloud-based JSON editors is expected to witness sustained growth throughout the forecast period.
The emergence of advanced technologies such as artificial intelligence (AI), machine learning (ML), and the Internet of Things (IoT) is also shaping the growth trajectory of the JSON Editor market. These technologies generate vast amounts of structured and semi-structured data, necessitating efficient tools for data parsing, validation, and transformation. JSON editors equipped with AI-driven functionalities, such as smart schema detection, auto-completion, and error correction, are becoming increasingly popular among developers and data scientists. Furthermore, the integration of JSON editors with CI/CD pipelines and automated testing frameworks enhances productivity and accelerates application development cycles. The continuous evolution of digital ecosystems and the need for interoperability across heterogeneous systems are expected to further bolster the adoption of advanced JSON editor solutions.
From a regional perspective, North America remains the dominant market for JSON editors, accounting for the largest revenue share in 2024. This leadership is attributed to the presence of major technology companies, a mature IT infrastructure, and a strong emphasis on innovation and digitalization. Europe follows closely, driven by stringent data compliance regulations and widespread adoption of cloud technologies. The Asia Pacific region is poised for the highest growth rate during the forecast period, fueled by rapid digital transformation, expanding IT investments, and the proliferation of startups and SMEs. Latin America and the Middle East & Africa are also witnessing steady adoption, supported by increasing awareness of data management best practices and the growing penetration of digital technologies.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
ShareGPT-Formatted Dataset for Structured JSON Output
Dataset Description
This dataset is formatted in the ShareGPT style and is designed for fine-tuning large language models (LLMs) to generate structured JSON outputs. It consists of multi-turn conversations where each response follows a predefined JSON schema, making it ideal for training models that need to produce structured data in natural language scenarios.
Usage
This dataset can be used to train LLMs… See the full description on the dataset page: https://huggingface.co/datasets/Arun63/sharegpt-structured-output-json.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Important Notice: Ethical Use OnlyThis repository provides code and datasets for academic research on misinformation.Please note that the datasets include rumor-related texts. These materials are supplied solely for scholarly analysis and research aimed at understanding and combating misinformation.Prohibited UseDo not use this repository, including its code or data, to create or spread false information in any real-world context.Any misuse of these resources for malicious purposes is strictly forbidden.DisclaimerThe authors bear no responsibility for any unethical or unlawful use of the provided resources.By accessing or using this repository, you acknowledge and agree to comply with these ethical guidelines.Project StructureThe project is organized into three main directories, each corresponding to a major section of the paper's experiments:main_data_and_code/├── rumor_generation/├── rumor_detection/└── rumor_debunking/How to Get StartedPrerequisitesTo successfully run the code and reproduce the results, you will need to:Obtain and configure your own API key for the large language models (LLMs) used in the experiments. Please replace the placeholder API key in the code with your own.For the rumor detection experiments, download the public datasets (Twitter15, Twitter16, FakeNewsNet) from their respective sources. The pre-process scripts in the rumor detection folder must be run first to prepare the public datasets.Please note that many scripts are provided as examples using the Twitter15 dataset. To run experiments on other datasets like Twitter16 or FakeNewsNet, you will need to modify these scripts or create copies and update the corresponding file paths.Detailed Directory Breakdown1. rumor_generation/This directory contains all the code and data related to the rumor generation experiments.rumor_generation_zeroshot.py: Code for the zero-shot rumor generation experiment.rumor_generation_fewshot.py: Code for the few-shot rumor generation experiment.rumor_generation_cot.py: Code for the chain-of-thought (CoT) rumor generation experiment.token_distribution.py: Script to analyze token distribution in the generated text.label_rumors.py:Script to label LLM-generated texts based on whether they contain rumor-related content.extract_reasons.py: Script to extract reasons for rumor generation and rejection.visualization.py: Utility script for generating figures.LDA.py: Code for performing LDA topic modeling on the generated data.rumor_generation_responses.json: The complete output dataset from the rumor generation experiments.generation_reasons_extracted.json: The extracted reasons for generated rumors.rejection_reasons_extracted.json: The extracted reasons for rejected rumor generation requests.2. rumor_detection/This directory contains the code and data used for the rumor detection experiments.nonreasoning_zeroshot_twitter15.py: Code for the non-reasoning, zero-shot detection on the Twitter15 dataset. To run on Twitter16 or FakeNewsNet, update the file paths within the script. Similar experiment scripts below follow the same principle and are not described repeatedly.nonreasoning_fewshot_twitter15.py: Code for the non-reasoning, few-shot detection on the Twitter15 dataset.nonreasoning_cot_twitter15.py: Code for the non-reasoning, CoT detection on the Twitter15 dataset.reasoning_zeroshot_twitter15.py: Code for the Reasoning LLMs, zero-shot detection on the Twitter15 dataset.reasoning_fewshot_twitter15.py: Code for the Reasoning LLMs, few-shot detection on the Twitter15 dataset.reasoning_cot_twitter15.py: Code for the Reasoning LLMs, CoT detection on the Twitter15 dataset.traditional_model.py: Code for the traditional models used as baselines.preprocess_twitter15_and_twitter16.py: Script for preprocessing the Twitter15 and Twitter16 datasets.preprocess_fakenews.py: Script for preprocessing the FakeNewsNet dataset.generate_summary_table.py: Calculates all classification metrics and generates the final summary table for the rumor detection experiments.select_few_shot_example_15.py: Script to pre-select few-shot examples, using the Twitter15 dataset as an example. To generate examples for Twitter16 or FakeNewsNet, update the file paths within the script.twitter15_few_shot_examples.json: Pre-selected few-shot examples for the Twitter15 dataset.twitter16_few_shot_examples.json: Pre-selected few-shot examples for the Twitter16 dataset.fakenewsnet_few_shot_examples.json: Pre-selected few-shot examples for the FakeNewsNet dataset.twitter15_llm_results.json: LLM prediction results on the Twitter15 dataset.twitter16_llm_results.json: LLM prediction results on the Twitter16 dataset.fakenewsnet_llm_results.json: LLM prediction results on the FakeNewsNet dataset.visualization.py: Utility script for generating figures.3. rumor_debunking/This directory contains all the code and data for the rumor debunking experiments.analyze_sentiment.py: Script for analyzing the sentiment of the debunking texts.calculate_readability.py: Script for calculating the readability score of the debunking texts.plot_readability.py: Utility script for generating figures related to readability.fact_checking_with_nli.py: Code for the NLI-based fact-checking experiment.debunking_results.json: The dataset containing the debunking results for this experimental section.debunking_results_with_readability.json: The dataset containing the debunking results along with readability scores.sentiment_analysis/: This directory contains the result file from the sentiment analysis.debunking_results_with_sentiment.json: The dataset containing the debunking results along with sentiment analysis.Please contact the repository owner if you encounter any problems or have questions about the code or data.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
AbstractEnsuring the quality of software systems through testing is a critical aspect of software development. However, the maintenance of test cases presents significant challenges, both in terms of complexity and cost. The constant need for updates to align with evolving systems under test can result in broken test cases, leading to a deterioration in test suite quality and disruptions in the software development process. To address these challenges, we introduce TaRGet (Test Repair GEneraTor), an approach that leverages pre-trained code language models for automated test case repair. TaRGet treats test repair as a language translation task and employs a two-step process to fine-tune a language model using essential context data that characterizes test breakages.PublicationThis repository is a supplement to our paper (accepted at IEEE TSE) which can be found on TSE.2025.3541166. For detailed definitions, experiments, and results, please refer to the paper. If you find this repository useful, kindly cite our work:@ARTICLE{saboor2025target, author={Saboor Yaraghi, Ahmadreza and Holden, Darren and Kahani, Nafiseh and Briand, Lionel}, journal={IEEE Transactions on Software Engineering}, title={Automated Test Case Repair Using Language Models}, year={2025}, pages={1-31}, doi={10.1109/TSE.2025.3541166}}TaRBenchTaRBench is a comprehensive benchmark that we developed to evaluate the effectiveness of TaRGet in automated test case repair. The benchmark encompasses 45,373 broken test repairs across 59 open-source projects, providing a diverse and extensive dataset for assessing the capabilities of TaRGet. In the following sections, we present detailed information about TaRBench (available in the "TaRBench.zip" file).The DataEach project's data is stored in the "projects" directory, following the format "projects/GitHub_Username/GitHub_Repo_Name" (e.g., "projects/apache/druid"). TaRBench structures the data for each project into four JSON files:dataset.json: The main file containing all the test repair instances of the project.codeMining/call_graphs.json: Call graphs of the test cases found in dataset.json across different commits.codeMining/sut_method_changes.json: Code changes in methods of the System Under Test (SUT) commits.codeMining/sut_class_changes.json: Code changes in classes of the SUT commits.Furthermore, the splits.csv file specifies the data split (train, valid, or test) for each test repair sample, identified by its unique ID.In following, we provide details on the attributes within each file.dataset.jsonThe dataset.json file comprises an array of JSON objects, each representing a test repair instance. These instances have the following attributes:ID (String): A unique identifier for the repair instance.name (String): The qualified name of the test case method, including the package, class, and method names.bCommit, aCommit (String): The commit hash (version) of the project before (bCommit) and after (aCommit) the repair.aCommitTime (Integer): Timestamp of aCommit.bPath, aPath (String): The relative path of the test case source file in the project before (bPath) and after (aPath) the repair.bSource, aSource (Object): The source code of the test case method before (bSource) and after (aSource) the repair. Each includes the start line of the test method in its source file as well as the test method source code.hunk (Object): The Git hunk of the test repair, representing the changes made to the test case method for repair. It includes the lines changed before (sourceChanges) and after (targetChanges) the repair, along with the corresponding code elements.astActions (Array): The edit actions applied to the abstract syntax tree (AST) of the test code for repair.trivial (Array): Indicates whether the repair is trivial (class or method renaming) or not. A "null" value signifies a non-trivial repair, while an array indicates a trivial repair and includes the associated trivial repair types.call_graphs.jsonThe call_graphs.json file contains call graphs, generated through static code analysis, for test cases identified in the corresponding test repair commits. The file has a nested structure: at the first layer, each key represent a commit hash; at the second layer, each key is a qualified test case name; and at the third layer, it includes the call graph for the respective commit and the test case. Attributes of each call graph are as follows:root (Object): Represents the root node of the call graph, always the test case.nodes (Array): An array containing metadata for all nodes in the call graph.graph (Object): Represents edges of the graph, where each key corresponds to a node ID, and the associated value is an array of node IDs representing the nodes to which the key has outgoing edges.sut_method_changes.json and sut_class_changes.jsonBoth files share a common structure, detailing code changes in the SUT for test repair commits. The sut_method_changes.json file exclusively captures changes within methods, while sut_class_changes.json covers all changes within classes. Both files present an array of objects, each object denoting changes in a commit. These objects have the following attributes:bCommit, aCommit (String): Same as dataset.json.changes (Array): An array of objects, each representing a change in a method or a class. These objects have the following attributes:bPath, aPath (String): Same as dataset.json.name (String): The qualified name of the changed method or the class.hunks (Array): An array of all hunks, representing changes in the method or class. The hunk data structure here follows that of the hunk field in dataset.json.is_test_source (Boolean): Indicates whether the method or class is part of the project's test source code.TaRGet ResultsIn addition to TaRBench, we provide the results of our approach, TaRGet. Below are the details of the results and associated files:TaRGet_Results.zipThis ZIP archive contains two main folders:Best_on_TaRBenchThis folder includes the best results obtained using the best configuration of TaRGet on TaRBench during our experiments, i.e., IO2 with CodeT5+ model. Detailed information about the experimental setup and the optimal configuration can be found in our paper. The folder contains the following files:train.json, valid.json, test.json: These files represent the dataset splits for TaRBench. Each file includes the input and expected output of the model, formatted using the best configuration, along with the original TaRBench fields (described in detail above).test_predictions.json: This file contains the predictions generated by the optimal TaRGet configuration using beam search. Each entry includes:ID: The TaRBench ID.target: The expected output.preds: TaRGet's generations.test_verdicts.json: This file presents the results of executing the repairs generated by TaRGet (listed in test_predictions.json). Each entry includes:Note: A zero execution time indicates that the generated repair was identical to the correct repair and was skipped to save time.ID: The TaRBench ID.rank: The rank of the executed generation within the preds field of the predictions file.verdict: The execution result.success: A Boolean indicating whether the result was successful.exec_time: The execution time in seconds.VS_CEPROTThis folder contains the results of the comparative analysis between TaRGet and CEPROT, a baseline test repair method used in our study. For further details about the comparison, refer to our paper. The folder includes:test.json: The test dataset in the same format as the dataset splits mentioned above. This dataset was used as the evaluation benchmark in CEPROT's study. Note that the CID field in this file, as well as in the subsequent files, refers to CEPROT's ID. This ID consists of two parts: the focal_db ID and the test_db ID, separated by a dash. The ID field, on the other hand, corresponds to a TaRBench-like ID that was created during the collection of CEPROT's data.TaRGet_predictions.json: The repairs generated by TaRGet on the test dataset.CEPROT_predictions.json: The repairs generated by CEPROT on the same test dataset.TaRGet_Best_FineTuned_Model.zipThis ZIP archive contains the fine-tuned model of the optimal TaRGet configuration, which was used to generate the results in the "Best_on_TaRBench" folder. The archive includes:checkpoint-best folder: Contains the model files, including the fine-tuned model weights (pytorch_model.bin).tokenizer folder: Includes the tokenizer files used for the model, along with all additional special tokens.ConclusionTaRBench serves as a valuable resource for researchers, developers, and practitioners interested in automated test case repair. Through an evaluation on a diverse dataset, it provides insights into the capabilities and limitations of TaRGet, paving the way for advancements in the field of automated software testing and maintenance.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was created by Dhrumil Patel
Released under CC0: Public Domain
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
English Short Stories" is a collection of short stories designed for English language learners at the A1 level, indicating beginners. The stories are written in simple language with basic vocabulary and sentence structures to aid comprehension and language acquisition.
The "English Short Stories JSON file for mobile applications" refers to a JSON (JavaScript Object Notation) file containing the same short stories formatted in a way that is suitable for integration into mobile applications. JSON is a lightweight data interchange format commonly used for storing and transmitting data between a server and a web application, including mobile apps. In this context, the JSON file likely contains the text of the short stories along with any associated metadata or formatting information needed for display within a mobile app.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is Part 2/2 of the ActiveHuman dataset! Part 1 can be found here. Dataset Description ActiveHuman was generated using Unity's Perception package. It consists of 175428 RGB images and their semantic segmentation counterparts taken at different environments, lighting conditions, camera distances and angles. In total, the dataset contains images for 8 environments, 33 humans, 4 lighting conditions, 7 camera distances (1m-4m) and 36 camera angles (0-360 at 10-degree intervals). The dataset does not include images at every single combination of available camera distances and angles, since for some values the camera would collide with another object or go outside the confines of an environment. As a result, some combinations of camera distances and angles do not exist in the dataset. Alongside each image, 2D Bounding Box, 3D Bounding Box and Keypoint ground truth annotations are also generated via the use of Labelers and are stored as a JSON-based dataset. These Labelers are scripts that are responsible for capturing ground truth annotations for each captured image or frame. Keypoint annotations follow the COCO format defined by the COCO keypoint annotation template offered in the perception package.
Folder configuration The dataset consists of 3 folders:
JSON Data: Contains all the generated JSON files. RGB Images: Contains the generated RGB images. Semantic Segmentation Images: Contains the generated semantic segmentation images.
Essential Terminology
Annotation: Recorded data describing a single capture. Capture: One completed rendering process of a Unity sensor which stored the rendered result to data files (e.g. PNG, JPG, etc.). Ego: Object or person on which a collection of sensors is attached to (e.g., if a drone has a camera attached to it, the drone would be the ego and the camera would be the sensor). Ego coordinate system: Coordinates with respect to the ego. Global coordinate system: Coordinates with respect to the global origin in Unity. Sensor: Device that captures the dataset (in this instance the sensor is a camera). Sensor coordinate system: Coordinates with respect to the sensor. Sequence: Time-ordered series of captures. This is very useful for video capture where the time-order relationship of two captures is vital. UIID: Universal Unique Identifier. It is a unique hexadecimal identifier that can represent an individual instance of a capture, ego, sensor, annotation, labeled object or keypoint, or keypoint template.
Dataset Data The dataset includes 4 types of JSON annotation files files:
annotation_definitions.json: Contains annotation definitions for all of the active Labelers of the simulation stored in an array. Each entry consists of a collection of key-value pairs which describe a particular type of annotation and contain information about that specific annotation describing how its data should be mapped back to labels or objects in the scene. Each entry contains the following key-value pairs:
id: Integer identifier of the annotation's definition. name: Annotation name (e.g., keypoints, bounding box, bounding box 3D, semantic segmentation). description: Description of the annotation's specifications. format: Format of the file containing the annotation specifications (e.g., json, PNG). spec: Format-specific specifications for the annotation values generated by each Labeler.
Most Labelers generate different annotation specifications in the spec key-value pair:
BoundingBox2DLabeler/BoundingBox3DLabeler:
label_id: Integer identifier of a label. label_name: String identifier of a label. KeypointLabeler:
template_id: Keypoint template UUID. template_name: Name of the keypoint template. key_points: Array containing all the joints defined by the keypoint template. This array includes the key-value pairs:
label: Joint label. index: Joint index. color: RGBA values of the keypoint. color_code: Hex color code of the keypoint skeleton: Array containing all the skeleton connections defined by the keypoint template. Each skeleton connection defines a connection between two different joints. This array includes the key-value pairs:
label1: Label of the first joint. label2: Label of the second joint. joint1: Index of the first joint. joint2: Index of the second joint. color: RGBA values of the connection. color_code: Hex color code of the connection. SemanticSegmentationLabeler:
label_name: String identifier of a label. pixel_value: RGBA values of the label. color_code: Hex color code of the label.
captures_xyz.json: Each of these files contains an array of ground truth annotations generated by each active Labeler for each capture separately, as well as extra metadata that describe the state of each active sensor that is present in the scene. Each array entry in the contains the following key-value pairs:
id: UUID of the capture. sequence_id: UUID of the sequence. step: Index of the capture within a sequence. timestamp: Timestamp (in ms) since the beginning of a sequence. sensor: Properties of the sensor. This entry contains a collection with the following key-value pairs:
sensor_id: Sensor UUID. ego_id: Ego UUID. modality: Modality of the sensor (e.g., camera, radar). translation: 3D vector that describes the sensor's position (in meters) with respect to the global coordinate system. rotation: Quaternion variable that describes the sensor's orientation with respect to the ego coordinate system. camera_intrinsic: matrix containing (if it exists) the camera's intrinsic calibration. projection: Projection type used by the camera (e.g., orthographic, perspective). ego: Attributes of the ego. This entry contains a collection with the following key-value pairs:
ego_id: Ego UUID. translation: 3D vector that describes the ego's position (in meters) with respect to the global coordinate system. rotation: Quaternion variable containing the ego's orientation. velocity: 3D vector containing the ego's velocity (in meters per second). acceleration: 3D vector containing the ego's acceleration (in ). format: Format of the file captured by the sensor (e.g., PNG, JPG). annotations: Key-value pair collections, one for each active Labeler. These key-value pairs are as follows:
id: Annotation UUID . annotation_definition: Integer identifier of the annotation's definition. filename: Name of the file generated by the Labeler. This entry is only present for Labelers that generate an image. values: List of key-value pairs containing annotation data for the current Labeler.
Each Labeler generates different annotation specifications in the values key-value pair:
BoundingBox2DLabeler:
label_id: Integer identifier of a label. label_name: String identifier of a label. instance_id: UUID of one instance of an object. Each object with the same label that is visible on the same capture has different instance_id values. x: Position of the 2D bounding box on the X axis. y: Position of the 2D bounding box position on the Y axis. width: Width of the 2D bounding box. height: Height of the 2D bounding box. BoundingBox3DLabeler:
label_id: Integer identifier of a label. label_name: String identifier of a label. instance_id: UUID of one instance of an object. Each object with the same label that is visible on the same capture has different instance_id values. translation: 3D vector containing the location of the center of the 3D bounding box with respect to the sensor coordinate system (in meters). size: 3D vector containing the size of the 3D bounding box (in meters) rotation: Quaternion variable containing the orientation of the 3D bounding box. velocity: 3D vector containing the velocity of the 3D bounding box (in meters per second). acceleration: 3D vector containing the acceleration of the 3D bounding box acceleration (in ). KeypointLabeler:
label_id: Integer identifier of a label. instance_id: UUID of one instance of a joint. Keypoints with the same joint label that are visible on the same capture have different instance_id values. template_id: UUID of the keypoint template. pose: Pose label for that particular capture. keypoints: Array containing the properties of each keypoint. Each keypoint that exists in the keypoint template file is one element of the array. Each entry's contents have as follows:
index: Index of the keypoint in the keypoint template file. x: Pixel coordinates of the keypoint on the X axis. y: Pixel coordinates of the keypoint on the Y axis. state: State of the keypoint.
The SemanticSegmentationLabeler does not contain a values list.
egos.json: Contains collections of key-value pairs for each ego. These include:
id: UUID of the ego. description: Description of the ego. sensors.json: Contains collections of key-value pairs for all sensors of the simulation. These include:
id: UUID of the sensor. ego_id: UUID of the ego on which the sensor is attached. modality: Modality of the sensor (e.g., camera, radar, sonar). description: Description of the sensor (e.g., camera, radar).
Image names The RGB and semantic segmentation images share the same image naming convention. However, the semantic segmentation images also contain the string Semantic_ at the beginning of their filenames. Each RGB image is named "e_h_l_d_r.jpg", where:
e denotes the id of the environment. h denotes the id of the person. l denotes the id of the lighting condition. d denotes the camera distance at which the image was captured. r denotes the camera angle at which the image was captured.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is Part 1/2 of the ActiveHuman dataset! Part 2 can be found here. Dataset Description ActiveHuman was generated using Unity's Perception package. It consists of 175428 RGB images and their semantic segmentation counterparts taken at different environments, lighting conditions, camera distances and angles. In total, the dataset contains images for 8 environments, 33 humans, 4 lighting conditions, 7 camera distances (1m-4m) and 36 camera angles (0-360 at 10-degree intervals). The dataset does not include images at every single combination of available camera distances and angles, since for some values the camera would collide with another object or go outside the confines of an environment. As a result, some combinations of camera distances and angles do not exist in the dataset. Alongside each image, 2D Bounding Box, 3D Bounding Box and Keypoint ground truth annotations are also generated via the use of Labelers and are stored as a JSON-based dataset. These Labelers are scripts that are responsible for capturing ground truth annotations for each captured image or frame. Keypoint annotations follow the COCO format defined by the COCO keypoint annotation template offered in the perception package.
Folder configuration The dataset consists of 3 folders:
JSON Data: Contains all the generated JSON files. RGB Images: Contains the generated RGB images. Semantic Segmentation Images: Contains the generated semantic segmentation images.
Essential Terminology
Annotation: Recorded data describing a single capture. Capture: One completed rendering process of a Unity sensor which stored the rendered result to data files (e.g. PNG, JPG, etc.). Ego: Object or person on which a collection of sensors is attached to (e.g., if a drone has a camera attached to it, the drone would be the ego and the camera would be the sensor). Ego coordinate system: Coordinates with respect to the ego. Global coordinate system: Coordinates with respect to the global origin in Unity. Sensor: Device that captures the dataset (in this instance the sensor is a camera). Sensor coordinate system: Coordinates with respect to the sensor. Sequence: Time-ordered series of captures. This is very useful for video capture where the time-order relationship of two captures is vital. UIID: Universal Unique Identifier. It is a unique hexadecimal identifier that can represent an individual instance of a capture, ego, sensor, annotation, labeled object or keypoint, or keypoint template.
Dataset Data The dataset includes 4 types of JSON annotation files files:
annotation_definitions.json: Contains annotation definitions for all of the active Labelers of the simulation stored in an array. Each entry consists of a collection of key-value pairs which describe a particular type of annotation and contain information about that specific annotation describing how its data should be mapped back to labels or objects in the scene. Each entry contains the following key-value pairs:
id: Integer identifier of the annotation's definition. name: Annotation name (e.g., keypoints, bounding box, bounding box 3D, semantic segmentation). description: Description of the annotation's specifications. format: Format of the file containing the annotation specifications (e.g., json, PNG). spec: Format-specific specifications for the annotation values generated by each Labeler.
Most Labelers generate different annotation specifications in the spec key-value pair:
BoundingBox2DLabeler/BoundingBox3DLabeler:
label_id: Integer identifier of a label. label_name: String identifier of a label. KeypointLabeler:
template_id: Keypoint template UUID. template_name: Name of the keypoint template. key_points: Array containing all the joints defined by the keypoint template. This array includes the key-value pairs:
label: Joint label. index: Joint index. color: RGBA values of the keypoint. color_code: Hex color code of the keypoint skeleton: Array containing all the skeleton connections defined by the keypoint template. Each skeleton connection defines a connection between two different joints. This array includes the key-value pairs:
label1: Label of the first joint. label2: Label of the second joint. joint1: Index of the first joint. joint2: Index of the second joint. color: RGBA values of the connection. color_code: Hex color code of the connection. SemanticSegmentationLabeler:
label_name: String identifier of a label. pixel_value: RGBA values of the label. color_code: Hex color code of the label.
captures_xyz.json: Each of these files contains an array of ground truth annotations generated by each active Labeler for each capture separately, as well as extra metadata that describe the state of each active sensor that is present in the scene. Each array entry in the contains the following key-value pairs:
id: UUID of the capture. sequence_id: UUID of the sequence. step: Index of the capture within a sequence. timestamp: Timestamp (in ms) since the beginning of a sequence. sensor: Properties of the sensor. This entry contains a collection with the following key-value pairs:
sensor_id: Sensor UUID. ego_id: Ego UUID. modality: Modality of the sensor (e.g., camera, radar). translation: 3D vector that describes the sensor's position (in meters) with respect to the global coordinate system. rotation: Quaternion variable that describes the sensor's orientation with respect to the ego coordinate system. camera_intrinsic: matrix containing (if it exists) the camera's intrinsic calibration. projection: Projection type used by the camera (e.g., orthographic, perspective). ego: Attributes of the ego. This entry contains a collection with the following key-value pairs:
ego_id: Ego UUID. translation: 3D vector that describes the ego's position (in meters) with respect to the global coordinate system. rotation: Quaternion variable containing the ego's orientation. velocity: 3D vector containing the ego's velocity (in meters per second). acceleration: 3D vector containing the ego's acceleration (in ). format: Format of the file captured by the sensor (e.g., PNG, JPG). annotations: Key-value pair collections, one for each active Labeler. These key-value pairs are as follows:
id: Annotation UUID . annotation_definition: Integer identifier of the annotation's definition. filename: Name of the file generated by the Labeler. This entry is only present for Labelers that generate an image. values: List of key-value pairs containing annotation data for the current Labeler.
Each Labeler generates different annotation specifications in the values key-value pair:
BoundingBox2DLabeler:
label_id: Integer identifier of a label. label_name: String identifier of a label. instance_id: UUID of one instance of an object. Each object with the same label that is visible on the same capture has different instance_id values. x: Position of the 2D bounding box on the X axis. y: Position of the 2D bounding box position on the Y axis. width: Width of the 2D bounding box. height: Height of the 2D bounding box. BoundingBox3DLabeler:
label_id: Integer identifier of a label. label_name: String identifier of a label. instance_id: UUID of one instance of an object. Each object with the same label that is visible on the same capture has different instance_id values. translation: 3D vector containing the location of the center of the 3D bounding box with respect to the sensor coordinate system (in meters). size: 3D vector containing the size of the 3D bounding box (in meters) rotation: Quaternion variable containing the orientation of the 3D bounding box. velocity: 3D vector containing the velocity of the 3D bounding box (in meters per second). acceleration: 3D vector containing the acceleration of the 3D bounding box acceleration (in ). KeypointLabeler:
label_id: Integer identifier of a label. instance_id: UUID of one instance of a joint. Keypoints with the same joint label that are visible on the same capture have different instance_id values. template_id: UUID of the keypoint template. pose: Pose label for that particular capture. keypoints: Array containing the properties of each keypoint. Each keypoint that exists in the keypoint template file is one element of the array. Each entry's contents have as follows:
index: Index of the keypoint in the keypoint template file. x: Pixel coordinates of the keypoint on the X axis. y: Pixel coordinates of the keypoint on the Y axis. state: State of the keypoint.
The SemanticSegmentationLabeler does not contain a values list.
egos.json: Contains collections of key-value pairs for each ego. These include:
id: UUID of the ego. description: Description of the ego. sensors.json: Contains collections of key-value pairs for all sensors of the simulation. These include:
id: UUID of the sensor. ego_id: UUID of the ego on which the sensor is attached. modality: Modality of the sensor (e.g., camera, radar, sonar). description: Description of the sensor (e.g., camera, radar).
Image names The RGB and semantic segmentation images share the same image naming convention. However, the semantic segmentation images also contain the string Semantic_ at the beginning of their filenames. Each RGB image is named "e_h_l_d_r.jpg", where:
e denotes the id of the environment. h denotes the id of the person. l denotes the id of the lighting condition. d denotes the camera distance at which the image was captured. r denotes the camera angle at which the image was captured.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset contains course information and metadata scraped from two popular MOOC platforms: EdX and Coursera(Udemy will be added soon, although the script to scrape it available on my git https://github.com/karar-git/Quamus). It includes various course attributes such as descriptions, program details, and other relevant data, making it useful for building recommendation systems, educational tools, or performing analysis on online learning content.
The dataset includes the following files:
combine_preprocessing.py: A Python script that preprocesses and combines the raw data into a unified format. You must run this script to generate the processed dataset.
combined_dataset.json: The final, preprocessed, and combined dataset of course metadata from both EdX and Coursera.
edx_courses.json: Raw course data scraped from the EdX platform.
edx_degree_programs.json: Data on degree programs available on EdX.
edx_executive_education_paidstuff.json: Paid course and executive education data from EdX.
edx_programs.json: Data on various programs available on EdX.
processed_coursera_data.json: Processed course data scraped from Coursera.
How to Use:
To generate the final combined_dataset.json, you need to run the combine_preprocessing.py script. This script will process the raw data files, clean, normalize, and combine them into one unified dataset. Disclaimer:
Data Source: The data in this dataset was scraped from publicly available information on EdX and Coursera. The scraping was done solely for educational and research purposes. The scraping process adheres to the terms of use of the respective platforms.
Usage: This dataset is intended for non-commercial use only. Please use responsibly and adhere to the terms and conditions of the platforms from which the data was collected.
No Warranty: This data is provided "as-is" without any warranty. Users are responsible for ensuring that their use of the data complies with the relevant platform policies.
Models:
In addition to the data, there are machine learning models related to this dataset available on GitHub. These models can help with content-based course recommendations and are built using the data you will find here. Specifically, the models include:
A cosine similarity-based model for course recommendations.
A two-tower model for personalized recommendations, trained using pseudo-labels.
A transformer-based course predictor (work in progress) designed to suggest the next course based on a user's learning progression.
Note:
The dataset currently contains data only from EdX and Coursera. The script to scrape Udemy data can be found in the related GitHub repository.
You will need to access the GitHub repository to view and experiment with the models and the Udemy scraping script. The models require the data files from this dataset to work properly.
By uploading this dataset to Kaggle, you can explore these educational resources and leverage them for building custom educational tools or analyzing online course trends.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is a single json containing label-value form fields generated using GPT-2. This data was used to train Dessurt (https://arxiv.org/abs/2203.16618). Details of the generation process can be found in Dessurt's Supplementary Materials and the script used to generate it is gpt_forms.py in https://github.com/herobd/dessurt
The data has groups of label-value pairs each with a "title" or topic (or null). Each label-value pair group was generated in a single GPT-2 generation and thus the pairs "belong to the same form." The json structure is a list of tuples, where each tuple has the title or null as the first element and the list of label-value pairs of the group as the second element. Each label-value pair is another tuple with the first element being the label and the second being the value or a list of values.
For example:
[ ["title",[ ["first label", "first value"], ["second label", ["a label", "another label"] ] ] ], [null, [ ["again label", "again value"] ] ] ]
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Study information Design ideation study (N = 24) using eye tracking technology. Participants solved a total of twelve design problems while receiving inspirational stimuli on a monitor. Their task was to generate as many solutions to each problem and explain their solution briefly by thinking aloud. The study allows for getting further insight into how inspirational stimuli improve idea fluency during design ideation. This dataset features processed data from the experiment. Eye tracking data includes gaze data, fixation data, blink data, and pupillometry data for all participants. The study is based on the following research paper and follows the same experimental setup: Goucher-Lambert, K., Moss, J., & Cagan, J. (2019). A neuroimaging investigation of design ideation with and without inspirational stimuli—understanding the meaning of near and far stimuli. Design Studies, 60, 1-38. DOI Dataset Most files in the dataset are saved as CSV files or other human readable file formats. Large files are saved in Hierarchical Data Format (HDF5/H5) to allow for smaller file sizes and higher compression. All data is described thoroughly in 00_ReadMe.txt. The following processed data is included in the dataset: Concatenated annotations file of experimental flow for all participants (CSV). All eye tracking raw data in concatenated files. Annotated with only participant ID. (CSV/HDF5) Annotated eye tracking data for ideation routines only. A subset of the files above. (CSV/HDF5) Audio transcriptions from Google Cloud Speech-to-Text API of each recording with annotations. (CSV) Raw API response for each transcription. These files include time offset for each word in a recording. (JSON) Data for questionnaire feedback and ideas generated during the experiment. (CSV) Data for the post-experiment survey, including demographic information (TSV). Python code used for the open-source experimental setup and dataset construction is hosted at GitHub. Repository also includes code of how the dataset has been further processed.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This artifact provides a comprehensive dataset and analysis tools for evaluating the quality of Non-Functional Requirements (NFRs) generated by Large Language Models (LLMs) based solely on Functional Requirements (FRs). The dataset includes human evaluations of NFR quality according to ISO/IEC 25010:2023 standard quality attributes.
This research artifact contains:
The study evaluates two key aspects:
# Linux/macOS
curl -fsSL https://deno.land/install.sh | sh
# Windows (PowerShell)
irm https://deno.land/install.ps1 | iex
deno --version
Input: Professional evaluation data collected via Turso database service
data/dump.sqlExpected Output: Understanding of data collection methodology and raw response structure
Purpose: Convert SQL dump to SQLite database for analysis
cd analysis
deno run --allow-read --allow-write generateData.ts
Process:
generateData.ts script reads data/dump.sqldata/dump.db SQLite databaseExpected Output: data/dump.db file created (approximately 2-5MB)
Purpose: Combine human evaluations with LLM assignments and generate final dataset
The generateData.ts script performs:
Expected Output: analysis/Human_Evaluation_Data.tsv (final dataset used in paper)
Files: LLMOutputs/*.json (8 files, one per LLM)
claude-3-5-haiku.jsonclaude-3-7-sonnet.jsondeepSeek-V3.jsongemini-1.5-pro.jsongpt-4o-mini.jsongrok-2.jsonlama-3.3-70B.jsonQwen2.5-72B.jsonExpected Format for each FR:
{
"functionalRequirement": "System shall allow users to log in with username and password",
"identifiedNFRs": [
{
"attribute": "Security",
"requirement": "The system must encrypt passwords using AES-256 encryption",
"justification": "Login functionality requires secure credential handling"
}
]
}
Analysis: Each JSON contains 34 FR entries with generated NFRs following ISO/IEC 25010:2023 categories
File: data/AdvancedPrompt.txt Content: Complete prompt used for NFR generation including:
analysis/Human_Evaluation_Data.tsv: Main evaluation dataset (2,240 evaluated NFRs)
data/FR_34.tsv: 34 functional requirements subset used for evaluationdata/dump.sql: Raw SQL dump from Turso database service containing professional evaluationsLLMOutputs/[model].json: Structured NFR generations for each of 8 LLMs
data/AdvancedPrompt.txt: Complete prompt template with ISO/IEC 25010:2023 integrationanalysis/generateData.ts: Data processing script for database creation and CSV generationLICENSE.md: Distribution rights and usage termsanalysis/visualization.ipynb: Jupyter notebook for data visualization and statistical analysisHuman_Evaluation_Data.tsv
Human_Evaluation_Data.tsv
Human_Evaluation_Data.tsv grouped by LLM model
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Maniple This repository contains code, scripts and data necessary to reproduce the paper "The Fact Selection Problem in LLM-Based Program Repair".
Installation Before installing the project, ensure you have the following prerequisites installed on your system:
Follow these steps to install and set up the project on your local machine:
cd maniple python3 -m pip install .
Structure of Directories The project is organized into several directories, each serving a specific purpose:
data/ # Training and testing datasets BGP32.zip/ # Sampled 32 bugs from the BugsInPy dataset black/ # The bug project folder 10/ # The bug ID folder 100000001/ # The bitvector used for prompting prompt.md # The prompt used for this bitvector response_1.md # The response from the model response_1.json # The response in JSON format response_1.patch # The response in patch format result_1.json # Testing result ... BGP32-without-cot.zip # GPT response for 32 bugs without CoT prompting BGP314.zip # 314 bugs from the BugsInPy dataset BGP157Ply1-llama3-70b.zip # experiment with llama3 model on BGP157Ply1 dataset BGP32-permutation.zip # permutation experiment on BGP32 dataset
maniple/ # Scripts for getting facts and generate prompts strata_based/ # Scripts for generating prompts utils/ # Utility functions metrics/ # Scripts for calculating metrics for dataset
patch_correctness_labelling.xlsx # The labelling of patch correctness experiment.ipynb # Jupyter notebook for training models
experiment-initialization-resources/ # Contains raw facts for each bug bug-data/ # row facts for each bug ansible/ # Bug project folder 5/ # Bug ID folder bug-info.json # Metadata for the bug facts_in_prompt.json # Facts used in the prompt processed_facts.json # Processed facts external_facts.json # GitHub issues for this bug static-dynamic-facts.json # Static and dynamic facts ... datasets-list/ # Subsets from BugsInPy dataset strata-bitvector/ # Debugging information for bitvectors
Steps to Reproduce the Experiments Please follow the steps below sequentially to reproduce the experiments on 314 bugs in BugsInPy with our bitvector based prompt
Prepare the Dataset
The CLI scripts under the maniple directory provide useful commands to download and prepare environments for each bug.
To download and prepare environments for each bugs, you can use the prep command.
maniple prep --dataset 314-dataset
This script will automatically download all 314 bugs from GitHub, create a virtual environment for the bug and install the necessary dependencies.
Fact Extraction
Then you can extract facts from the bug data using the extract command as follows:
maniple extract --dataset 314-dataset --output-dir data/BGP314
This script will extract facts from the bug data and save them in the specified output directory.
You can find all extracted facts under the experiment-initialization-resources/bug-data directory.
Generate Bitvector Specific Prompts and Responses First, you need to generate bitvector for the facts. The 128 bitvector for our paper can be generated by the following command.
python3 -m maniple.strata_based.fact_bitvector_generator
You can customize your bitvectors, they should be put under experiment-initialization-resources/strata-bitvectors directory. You can refer the example bitvector format used for our paper.
To reproduce our experiment prompt and response, please use the command below, and replace with your own key.
export OPENAI_API_KEY=
setx OPENAI_API_KEY
python3 -m maniple.strata_based.prompt_generator --database BGP314 --partition 10 --start_index 1 --trial 15
Again, you can build your own customize prompt with customize bitvector using our extracted facts. Above is only for reproducing our prompt and response.
This script will generate prompts and responses for all 314 bugs in the dataset by enumerating all possible bitvectors according to current strata design specified in maniple/strata_based/fact_strata_table.json. By specifying --trial 15, the script will generate 15 responses for each prompt. And by specifying --partition 10 the script will start 10 threads to speed up the process.
Testing Generated Patches Please use following command:
maniple validate --output-dir data/BGP314
This script will validate the generated patches for the specified bug and save the results in the specified output directory. The test comes from the developer's fix commit.
Contributing Contributions to this project are welcome! Please submit a PR if you find any bugs or have any suggestions.
License This project is licensed under the MIT - see the LICENSE file for details.
Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
EACL Hackashop Keyword Challenge Datasets
In this repository you can find ids of articles used for the keyword extraction challenge at
EACL Hackashop on News Media Content Analysis and Automated Report Generation (http://embeddia.eu/hackashop2021/). The article ids can be used to generate train-test split used in paper:
Koloski, B., Pollak, S., Škrlj, B., & Martinc, M. (2021). Extending Neural Keyword Extraction with TF-IDF tagset matching. In: Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation, Kiev, Ukraine, pages 22–29.
Train and test splits are provided for Latvian, Estonian, Russian and Croatian.
The articles with the corresponding ID-s can be extracted from the following datasets:
- Estonian and Russian (use the eearticles2015-2019 dataset): https://www.clarin.si/repository/xmlui/handle/11356/1408
- Latvian: https://www.clarin.si/repository/xmlui/handle/11356/1409
- Croatian: https://www.clarin.si/repository/xmlui/handle/11356/1410
dataset_ids folder is organized in the following way:
- latvian – containing latvian_train.json: a json file with ids from train articles to replicate the data used in Koloski et al. (2020), the latvian_test.json: a json file with ids from test articles to replicate the data
- estonian – containing estonian_train.json: a json file with ids from train articles to replicate the data used in Koloski et al. (2020), the estonian_test.json: a json file with ids from test articles to replicate the data
- russian – containing russian_train.json: a json file with ids from train articles to replicate the train data used in Koloski et al. (2020), the russian_test.json: a json file with ids from test articles to replicate the data
- croatian - containing croatian_id_train.tsv file with sites and ids (note that just ids are not unique across dataset, therefore site information also needs to be included to obtain a unique article identifier) of articles in the train set, and the croatian_id_test.tsv file with sites and ids of articles in the test set.
In addition, scripts are provided for extracting articles (see folder parse containing scripts parse.py and build_croatian_dataset.py, requirements for scripts are pandas and bs4 Python libraries):
parse.py is used for extraction of Estonian, Russian and Latvian train and test datasets:
Instructions:
ESTONIAN-RUSSIAN
1) Retrieve the data ee_articles_2015_2019.zip
2) Create a folder 'data' and subfolder 'ee'
3) Unzip them in the 'data/ee' folder
To extract train/test Estonian articles:
run function 'build_dataset(lang="ee", opt="nat")' in the parse.py script
To extract train/test Russian articles:
run function 'build_dataset(lang="ee", opt="rus")' in the parse.py script
LATVIAN:
1) Retrieve the latvian data
2) Unzip it in 'data/lv' folder
3) To extract train/test Latvian articles:
run function 'build_dataset(lang="lv", opt="nat")' in the parse.py script
build_croatian_dataset.py is used for extraction of Croatian train and test datasets:
Instructions:
CROATIAN:
1) Retrieve the Croatian data (file 'STY_24sata_articles_hr_PUB-01.csv')
2) put the script 'build_croatian_dataset.py' in the same folder as the extracted data and run it (e.g., python build_croatian_dataset.py).
For additional questions: {Boshko.Koloski,Matej.Martinc,Senja.Pollak}@ijs.si
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
This dataset contains inventory data for a pharmacy e-commerce website in JSON format, designed for easy integration into MongoDB databases, making it ideal for MERN stack projects. It includes 10 fields:
This dataset is useful for developing pharmacy-related web applications, inventory management systems, or online medical stores using the MERN stack.
Do not use for production-level purposes; use for project development only. Feel free to contribute if you find any mistakes or have suggestions.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This JSON file contains a collection of conversational AI intents designed to motivate and interact with users. The intents cover various topics, including greetings, weather inquiries, hobbies, music, movies, farewells, informal and formal questions, math operations and formulas, prime numbers, geometry concepts, math puzzles, and even a Shakespearean poem.
The additional intents related to consolidating people and motivating them have been included to provide users with uplifting and encouraging responses. These intents aim to offer support during challenging times, foster teamwork, and provide words of motivation and inspiration to users seeking guidance and encouragement.
The JSON structure is organized into individual intent objects, each containing a tag to identify the intent, a set of patterns representing user inputs, and corresponding responses provided by the AI model. This dataset can be used to train a conversational AI system to engage in positive interactions with users and offer motivational messages.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset includes scripts, text files, and JSON files used to generate all analysis and results from the paper. A README.md file is included for details on using the scripts - though all of the data the scripts generate should already be cached as JSON or txt files and none of the scripts actually need run.
It also includes a spreadsheet containing the paper survey results and manual judgements.
The scripts are also on GitHub: https://github.com/psybers/msr21-timestudy
Facebook
Twitter