100+ datasets found
  1. h

    example-space-to-dataset-json

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lucain Pouget, example-space-to-dataset-json [Dataset]. https://huggingface.co/datasets/Wauplin/example-space-to-dataset-json
    Explore at:
    Authors
    Lucain Pouget
    Description
  2. Store Sales json

    • kaggle.com
    zip
    Updated Jun 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Indi Ella (2024). Store Sales json [Dataset]. https://www.kaggle.com/datasets/indiella/store-sales-json
    Explore at:
    zip(5397153 bytes)Available download formats
    Dataset updated
    Jun 1, 2024
    Authors
    Indi Ella
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    Dataset contains more than 50000 records of Sales and order data related to an online store.

  3. Data from: Food Recipes dataset

    • kaggle.com
    zip
    Updated Aug 31, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    samsatp (2021). Food Recipes dataset [Dataset]. https://www.kaggle.com/datasets/sathianpong/foodrecipe
    Explore at:
    zip(181170342 bytes)Available download formats
    Dataset updated
    Aug 31, 2021
    Authors
    samsatp
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset

    This dataset was created by samsatp

    Released under CC0: Public Domain

    Contents

  4. Spider Realistic Dataset In Structure-Grounded Pretraining for Text-to-SQL

    • zenodo.org
    bin, json, txt
    Updated Aug 16, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xiang Deng; Ahmed Hassan Awadallah; Christopher Meek; Oleksandr Polozov; Huan Sun; Matthew Richardson; Xiang Deng; Ahmed Hassan Awadallah; Christopher Meek; Oleksandr Polozov; Huan Sun; Matthew Richardson (2021). Spider Realistic Dataset In Structure-Grounded Pretraining for Text-to-SQL [Dataset]. http://doi.org/10.5281/zenodo.5205322
    Explore at:
    txt, json, binAvailable download formats
    Dataset updated
    Aug 16, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Xiang Deng; Ahmed Hassan Awadallah; Christopher Meek; Oleksandr Polozov; Huan Sun; Matthew Richardson; Xiang Deng; Ahmed Hassan Awadallah; Christopher Meek; Oleksandr Polozov; Huan Sun; Matthew Richardson
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This folder contains the Spider-Realistic dataset used for evaluation in the paper "Structure-Grounded Pretraining for Text-to-SQL". The dataset is created based on the dev split of the Spider dataset (2020-06-07 version from https://yale-lily.github.io/spider). We manually modified the original questions to remove the explicit mention of column names while keeping the SQL queries unchanged to better evaluate the model's capability in aligning the NL utterance and the DB schema. For more details, please check our paper at https://arxiv.org/abs/2010.12773.

    It contains the following files:

    - spider-realistic.json
    # The spider-realistic evaluation set
    # Examples: 508
    # Databases: 19
    - dev.json
    # The original dev split of Spider
    # Examples: 1034
    # Databases: 20
    - tables.json
    # The original DB schemas from Spider
    # Databases: 166
    - README.txt
    - license

    The Spider-Realistic dataset is created based on the dev split of the Spider dataset realsed by Yu, Tao, et al. "Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task." It is a subset of the original dataset with explicit mention of the column names removed. The sql queries and databases are kept unchanged.
    For the format of each json file, please refer to the github page of Spider https://github.com/taoyds/spider.
    For the database files please refer to the official Spider release https://yale-lily.github.io/spider.

    This dataset is distributed under the CC BY-SA 4.0 license.

    If you use the dataset, please cite the following papers including the original Spider datasets, Finegan-Dollak et al., 2018 and the original datasets for Restaurants, GeoQuery, Scholar, Academic, IMDB, and Yelp.

    @article{deng2020structure,
    title={Structure-Grounded Pretraining for Text-to-SQL},
    author={Deng, Xiang and Awadallah, Ahmed Hassan and Meek, Christopher and Polozov, Oleksandr and Sun, Huan and Richardson, Matthew},
    journal={arXiv preprint arXiv:2010.12773},
    year={2020}
    }

    @inproceedings{Yu&al.18c,
    year = 2018,
    title = {Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task},
    booktitle = {EMNLP},
    author = {Tao Yu and Rui Zhang and Kai Yang and Michihiro Yasunaga and Dongxu Wang and Zifan Li and James Ma and Irene Li and Qingning Yao and Shanelle Roman and Zilin Zhang and Dragomir Radev }
    }

    @InProceedings{P18-1033,
    author = "Finegan-Dollak, Catherine
    and Kummerfeld, Jonathan K.
    and Zhang, Li
    and Ramanathan, Karthik
    and Sadasivam, Sesh
    and Zhang, Rui
    and Radev, Dragomir",
    title = "Improving Text-to-SQL Evaluation Methodology",
    booktitle = "Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    year = "2018",
    publisher = "Association for Computational Linguistics",
    pages = "351--360",
    location = "Melbourne, Australia",
    url = "http://aclweb.org/anthology/P18-1033"
    }

    @InProceedings{data-sql-imdb-yelp,
    dataset = {IMDB and Yelp},
    author = {Navid Yaghmazadeh, Yuepeng Wang, Isil Dillig, and Thomas Dillig},
    title = {SQLizer: Query Synthesis from Natural Language},
    booktitle = {International Conference on Object-Oriented Programming, Systems, Languages, and Applications, ACM},
    month = {October},
    year = {2017},
    pages = {63:1--63:26},
    url = {http://doi.org/10.1145/3133887},
    }

    @article{data-academic,
    dataset = {Academic},
    author = {Fei Li and H. V. Jagadish},
    title = {Constructing an Interactive Natural Language Interface for Relational Databases},
    journal = {Proceedings of the VLDB Endowment},
    volume = {8},
    number = {1},
    month = {September},
    year = {2014},
    pages = {73--84},
    url = {http://dx.doi.org/10.14778/2735461.2735468},
    }

    @InProceedings{data-atis-geography-scholar,
    dataset = {Scholar, and Updated ATIS and Geography},
    author = {Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, Jayant Krishnamurthy, and Luke Zettlemoyer},
    title = {Learning a Neural Semantic Parser from User Feedback},
    booktitle = {Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
    year = {2017},
    pages = {963--973},
    location = {Vancouver, Canada},
    url = {http://www.aclweb.org/anthology/P17-1089},
    }

    @inproceedings{data-geography-original
    dataset = {Geography, original},
    author = {John M. Zelle and Raymond J. Mooney},
    title = {Learning to Parse Database Queries Using Inductive Logic Programming},
    booktitle = {Proceedings of the Thirteenth National Conference on Artificial Intelligence - Volume 2},
    year = {1996},
    pages = {1050--1055},
    location = {Portland, Oregon},
    url = {http://dl.acm.org/citation.cfm?id=1864519.1864543},
    }

    @inproceedings{data-restaurants-logic,
    author = {Lappoon R. Tang and Raymond J. Mooney},
    title = {Automated Construction of Database Interfaces: Intergrating Statistical and Relational Learning for Semantic Parsing},
    booktitle = {2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora},
    year = {2000},
    pages = {133--141},
    location = {Hong Kong, China},
    url = {http://www.aclweb.org/anthology/W00-1317},
    }

    @inproceedings{data-restaurants-original,
    author = {Ana-Maria Popescu, Oren Etzioni, and Henry Kautz},
    title = {Towards a Theory of Natural Language Interfaces to Databases},
    booktitle = {Proceedings of the 8th International Conference on Intelligent User Interfaces},
    year = {2003},
    location = {Miami, Florida, USA},
    pages = {149--157},
    url = {http://doi.acm.org/10.1145/604045.604070},
    }

    @inproceedings{data-restaurants,
    author = {Alessandra Giordani and Alessandro Moschitti},
    title = {Automatic Generation and Reranking of SQL-derived Answers to NL Questions},
    booktitle = {Proceedings of the Second International Conference on Trustworthy Eternal Systems via Evolving Software, Data and Knowledge},
    year = {2012},
    location = {Montpellier, France},
    pages = {59--76},
    url = {https://doi.org/10.1007/978-3-642-45260-4_5},
    }

  5. h

    json-training

    • huggingface.co
    Updated Jul 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christian Zhou-Zheng (2024). json-training [Dataset]. https://huggingface.co/datasets/ChristianAzinn/json-training
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 26, 2024
    Authors
    Christian Zhou-Zheng
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    JSON Training Data

    The advent of tiny yet powerful models like Qwen2 0.5B and SmolLM 135M/360M that can feasibly be run on just about anything means there is a necessity for data to finetune these models on downstream tasks. In particular, these models fail spectacularly at structured data generation in JSON, and even frameworks that are meant to force JSON output get stuck repeating infinitely because the models just don't have a clue what they're being asked to do. I found there… See the full description on the dataset page: https://huggingface.co/datasets/ChristianAzinn/json-training.

  6. I

    TerriaJS Map Catalog in JSON Format

    • ihp-wins.unesco.org
    • data.dev-wins.com
    json
    Updated Dec 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pablo Rojas (2025). TerriaJS Map Catalog in JSON Format [Dataset]. https://ihp-wins.unesco.org/dataset/terriajs-map-catalog-in-json-format
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Dec 2, 2025
    Dataset provided by
    Pablo Rojas
    Description

    This dataset contains a collection of JSON files used to configure map catalogs in TerriaJS, an interactive geospatial data visualization platform. The files include detailed configurations for services such as WMS, WFS, and other geospatial resources, enabling the integration and visualization of diverse datasets in a user-friendly web interface. This resource is ideal for developers, researchers, and professionals who wish to customize or implement interactive map catalogs in their own applications using TerriaJS.

  7. Z

    Valencia Portcalls 07/2018 to 12/2018

    • data.niaid.nih.gov
    • data.europa.eu
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eneko Olivares Gorriti (2020). Valencia Portcalls 07/2018 to 12/2018 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3257156
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    UPV
    Authors
    Eneko Olivares Gorriti
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Json file with a list of portcalls from vessels arriving to Valencia ports. Data was used inside the INTER-IoT project as an example dataset that a legacy IoT platform was providing.

    *NOTE: Due to a bug in the system it is not possible to upload files with a .json extension. It is uploaded to ._json extension instead. Please rename it after download.

  8. Cricket_International_Matches_PlayerStat_2000-2024

    • kaggle.com
    zip
    Updated Oct 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sunny Yadav (2025). Cricket_International_Matches_PlayerStat_2000-2024 [Dataset]. https://www.kaggle.com/datasets/sunnyyadav754/cricket-international-matches-playerstat-2000-2025
    Explore at:
    zip(60577471 bytes)Available download formats
    Dataset updated
    Oct 29, 2025
    Authors
    Sunny Yadav
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    This dataset contains detailed cricket performance data structured in a nested JSON format. It includes information for multiple players categorized by gender, opponent teams, formats, and individual batter match-ups.

    Each player node (e.g., β€œJ Srinath”) provides detailed insight into: - vs_team: Player performance against each international team (e.g., England, Australia, Pakistan). - vs_batter: Individual head-to-head statistics against other players. - formats: Match format-specific data (e.g., ODI, T20I, Test).

    πŸ“˜ Dataset Structure Example: { "root": { "Gender": { "male": { "J Srinath": { "vs_team": { "England": { ...14 items... }, "Australia": { ...14 items... }, "Pakistan": { ...14 items... } }, "formats": { "ODI": { ... }, "Test": { ... } }, "vs_batter": { "Ricky Ponting": { ... }, ... "vs_bowler": {
    "L Malinga": { ... }, ... } } } } } }

    ✨ Key Features: - 🏏 Granular Player Data: Bowler vs Batter and Bowler vs Team breakdowns. - πŸ“Š Format-wise Stats: Data across multiple cricket formats (Test, ODI, T20). - 🌍 International Coverage: Includes performance against top teams like England, Australia, Pakistan, and Sri Lanka. - βš™οΈ Machine Learning Ready: Structured JSON ideal for training predictive models (e.g., run prediction, wicket prediction). - πŸ“ Customizable: Easily expandable to include more players, genders, and match details.

    πŸ’Ύ File Description: File name: Cricket_Stat_All_Player_International_Matches File type: .json File size: ~ 1.26 GB (depending on dataset size)

    Columns / Structure: | Field | Description | |--------|--------------| | root | Root key containing gender-wise data | | Gender | "male" or "female" category | | Player Name | Name of the player (e.g., β€œJ Srinath”) | | vs_team | Dictionary of performance stats vs teams | | formats | Data grouped by cricket formats | | vs_batter | Player’s record vs individual batters | | vs_bowler | Player’s record vs individual bowler |

    🏷️ Suggested Tags: cricket, sports-analytics, json-dataset, machine-learning, player-performance, india, odi, t20, bowler-vs-batter

    πŸ“œ License: CC BY 4.0 (Attribution License)

    πŸš€ Use Cases: - Player performance prediction using ML. - Team strategy analysis and matchup visualization. - Cricket data visualization dashboards. - Fantasy team data preparation.

  9. E

    Data from: Slovenian Dataset for Vision-Language Model Instruction-Tuning...

    • live.european-language-grid.eu
    binary format
    Updated Sep 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Slovenian Dataset for Vision-Language Model Instruction-Tuning SLO-VLM-IT-Dataset 1.0 [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/23887
    Explore at:
    binary formatAvailable download formats
    Dataset updated
    Sep 17, 2025
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    This entry contains the SLO-VLM-IT-Dataset, a comprehensive dataset designed for instruction-tuning vision-language models in the Slovenian language. It is composed of five main .json files, which together provide a rich and diverse set of examples for training and fine-tuning models to understand and process both visual and textual information in Slovenian.

    1. llava_v1_5_mix665k_translated_gemini_1_5_pro_all.json This file contains a machine-translated version of the popular Llava_v1_5_mix665k dataset. The translation from English to Slovenian was performed using the proprietary Gemini 1.5 Pro model.

    2. wiki_14_march_2024_latest.json This file consists of conversational examples generated from Slovenian Wikipedia articles. The proprietary Gemini 1.5 Pro model was utilized for the data curation process, transforming the articles into an instruction-tuning format.

    3. rtv.json This file consists of conversational examples generated on the basis of images from the news portal https://www.rtvslo.si. The proprietary Gemini 1.5 Pro model was utilized for the data generation.

    4. siol.json This file consists of conversational examples generated on the basis of images from the news portal https://siol.net. The proprietary Gemini 1.5 Pro model was utilized for the data generation.

    5. 24ur.json This file consists of conversational examples generated on the basis of images from the news portal https://www.24ur.com. The proprietary Gemini 1.5 Pro model was utilized for the data generation.

    The combined dataset includes a total of 1,128,228 examples, categorized as follows:

    21,838 textvqa examples: Instructions for vision question answering based on specific Optical Character Recognition (OCR) tokens.

    349,369 coco examples: A mix of instructions corresponding to 118,000 images from the COCO 2017 Object Detection Dataset. These include tasks such as generating long image descriptions, providing single-word answers, and answering multiple-choice questions.

    81,309 vg examples: Instructions to either provide bounding box coordinates for a specified region in an image or describe a region defined by given coordinates.

    66,227 gqa examples: Instructions requiring a one-word or one-phrase response to a question about the corresponding image.

    78,976 ocr_vqa examples: Instructions focused on performing OCR to extract text from an image.

    139,433 wiki examples: Instruction-tuning examples generated from Slovenian Wikipedia articles. The original Wikipedia articles were obtained from a Wikipedia database dump from March 14th 2025.

    100,000 rtv examples: Instruction-tuning examples generated on the basis of images from the news portal https://www.rtvslo.si. Image scraping was completed on February 7th 2025.

    100,000 siol examples: Instruction-tuning examples generated on the basis of images from the news portal https://siol.net. Image scraping was completed on March 22nd 2025.

    100,000 24ur examples: Instruction-tuning examples generated on the basis of images from the news portal https://www.24ur.com. Image scraping was completed on February 7th 2025.

    Accessing the Corresponding Images

    News portal Images The images corresponding to the 'rtv', 'siol' and '24ur' examples need to be downloaded from the appropriate news portal. Each example in the json file contains an 'image' key with a URL of the corresponding image.

    Wiki Images The images corresponding to the 'wiki' examples are available for download at the following link: https://kt-cloud.ijs.si/index.php/s/nbLmWkaJEXHMMwe

    Llava_v1_5_mix665k Images To facilitate the download of images for the translated Llava_v1_5_mix665k dataset, we provide the necessary Python script get_llava_images.py and its dependency overwatch.py.

  10. Z

    Dataset of IEEE 802.11 probe requests from an uncontrolled urban environment...

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    Updated Jan 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Miha Mohorčič; Aleő Simončič; Mihael Mohorčič; Andrej Hrovat (2023). Dataset of IEEE 802.11 probe requests from an uncontrolled urban environment [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7509279
    Explore at:
    Dataset updated
    Jan 6, 2023
    Authors
    Miha Mohorčič; Aleő Simončič; Mihael Mohorčič; Andrej Hrovat
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Introduction

    The 802.11 standard includes several management features and corresponding frame types. One of them are Probe Requests (PR), which are sent by mobile devices in an unassociated state to scan the nearby area for existing wireless networks. The frame part of PRs consists of variable-length fields, called Information Elements (IE), which represent the capabilities of a mobile device, such as supported data rates.

    This dataset contains PRs collected over a seven-day period by four gateway devices in an uncontrolled urban environment in the city of Catania.

    It can be used for various use cases, e.g., analyzing MAC randomization, determining the number of people in a given location at a given time or in different time periods, analyzing trends in population movement (streets, shopping malls, etc.) in different time periods, etc.

    Related dataset

    Same authors also produced the Labeled dataset of IEEE 802.11 probe requests with same data layout and recording equipment.

    Measurement setup

    The system for collecting PRs consists of a Raspberry Pi 4 (RPi) with an additional WiFi dongle to capture WiFi signal traffic in monitoring mode (gateway device). Passive PR monitoring is performed by listening to 802.11 traffic and filtering out PR packets on a single WiFi channel.

    The following information about each received PR is collected: - MAC address - Supported data rates - extended supported rates - HT capabilities - extended capabilities - data under extended tag and vendor specific tag - interworking - VHT capabilities - RSSI - SSID - timestamp when PR was received.

    The collected data was forwarded to a remote database via a secure VPN connection. A Python script was written using the Pyshark package to collect, preprocess, and transmit the data.

    Data preprocessing

    The gateway collects PRs for each successive predefined scan interval (10 seconds). During this interval, the data is preprocessed before being transmitted to the database. For each detected PR in the scan interval, the IEs fields are saved in the following JSON structure:

    PR_IE_data = { 'DATA_RTS': {'SUPP': DATA_supp , 'EXT': DATA_ext}, 'HT_CAP': DATA_htcap, 'EXT_CAP': {'length': DATA_len, 'data': DATA_extcap}, 'VHT_CAP': DATA_vhtcap, 'INTERWORKING': DATA_inter, 'EXT_TAG': {'ID_1': DATA_1_ext, 'ID_2': DATA_2_ext ...}, 'VENDOR_SPEC': {VENDOR_1:{ 'ID_1': DATA_1_vendor1, 'ID_2': DATA_2_vendor1 ...}, VENDOR_2:{ 'ID_1': DATA_1_vendor2, 'ID_2': DATA_2_vendor2 ...} ...} }

    Supported data rates and extended supported rates are represented as arrays of values that encode information about the rates supported by a mobile device. The rest of the IEs data is represented in hexadecimal format. Vendor Specific Tag is structured differently than the other IEs. This field can contain multiple vendor IDs with multiple data IDs with corresponding data. Similarly, the extended tag can contain multiple data IDs with corresponding data.
    Missing IE fields in the captured PR are not included in PR_IE_DATA.

    When a new MAC address is detected in the current scan time interval, the data from PR is stored in the following structure:

    {'MAC': MAC_address, 'SSIDs': [ SSID ], 'PROBE_REQs': [PR_data] },

    where PR_data is structured as follows:

    { 'TIME': [ DATA_time ], 'RSSI': [ DATA_rssi ], 'DATA': PR_IE_data }.

    This data structure allows to store only 'TOA' and 'RSSI' for all PRs originating from the same MAC address and containing the same 'PR_IE_data'. All SSIDs from the same MAC address are also stored. The data of the newly detected PR is compared with the already stored data of the same MAC in the current scan time interval. If identical PR's IE data from the same MAC address is already stored, only data for the keys 'TIME' and 'RSSI' are appended. If identical PR's IE data from the same MAC address has not yet been received, then the PR_data structure of the new PR for that MAC address is appended to the 'PROBE_REQs' key. The preprocessing procedure is shown in Figure ./Figures/Preprocessing_procedure.png

    At the end of each scan time interval, all processed data is sent to the database along with additional metadata about the collected data, such as the serial number of the wireless gateway and the timestamps for the start and end of the scan. For an example of a single PR capture, see the Single_PR_capture_example.json file.

    Folder structure

    For ease of processing of the data, the dataset is divided into 7 folders, each containing a 24-hour period. Each folder contains four files, each containing samples from that device.

    The folders are named after the start and end time (in UTC). For example, the folder 2022-09-22T22-00-00_2022-09-23T22-00-00 contains samples collected between 23th of September 2022 00:00 local time, until 24th of September 2022 00:00 local time.

    Files representing their location via mapping: - 1.json -> location 1 - 2.json -> location 2 - 3.json -> location 3 - 4.json -> location 4

    Environments description

    The measurements were carried out in the city of Catania, in Piazza UniversitΓ  and Piazza del Duomo The gateway devices (rPIs with WiFi dongle) were set up and gathering data before the start time of this dataset. As of September 23, 2022, the devices were placed in their final configuration and personally checked for correctness of installation and data status of the entire data collection system. Devices were connected either to a nearby Ethernet outlet or via WiFi to the access point provided.

    Four Raspbery Pi-s were used: - location 1 -> Piazza del Duomo - Chierici building (balcony near Fontana dell’Amenano) - location 2 -> southernmost window in the building of Via Etnea near Piazza del Duomo - location 3 -> nothernmost window in the building of Via Etnea near Piazza UniversitΓ  - location 4 -> first window top the right of the entrance of the University of Catania

    Locations were suggested by the authors and adjusted during deployment based on physical constraints (locations of electrical outlets or internet access) Under ideal circumstances, the locations of the devices and their coverage area would cover both squares and the part of Via Etna between them, with a partial overlap of signal detection. The locations of the gateways are shown in Figure ./Figures/catania.png.

    Known dataset shortcomings

    Due to technical and physical limitations, the dataset contains some identified deficiencies.

    PRs are collected and transmitted in 10-second chunks. Due to the limited capabilites of the recording devices, some time (in the range of seconds) may not be accounted for between chunks if the transmission of the previous packet took too long or an unexpected error occurred.

    Every 20 minutes the service is restarted on the recording device. This is a workaround for undefined behavior of the USB WiFi dongle, which can no longer respond. For this reason, up to 20 seconds of data will not be recorded in each 20-minute period.

    The devices had a scheduled reboot at 4:00 each day which is shown as missing data of up to a few minutes.

     Location 1 - Piazza del Duomo - Chierici
    

    The gateway device (rPi) is located on the second floor balcony and is hardwired to the Ethernet port. This device appears to function stably throughout the data collection period. Its location is constant and is not disturbed, dataset seems to have complete coverage.

     Location 2 - Via Etnea - Piazza del Duomo
    

    The device is located inside the building. During working hours (approximately 9:00-17:00), the device was placed on the windowsill. However, the movement of the device cannot be confirmed. As the device was moved back and forth, power outages and internet connection issues occurred. The last three days in the record contain no PRs from this location.

     Location 3 - Via Etnea - Piazza UniversitΓ 
    

    Similar to Location 2, the device is placed on the windowsill and moved around by people working in the building. Similar behavior is also observed, e.g., it is placed on the windowsill and moved inside a thick wall when no people are present. This device appears to have been collecting data throughout the whole dataset period.

     Location 4 - Piazza UniversitΓ 
    

    This location is wirelessly connected to the access point. The device was placed statically on a windowsill overlooking the square. Due to physical limitations, the device had lost power several times during the deployment. The internet connection was also interrupted sporadically.

    Recognitions

    The data was collected within the scope of Resiloc project with the help of City of Catania and project partners.

  11. Wikimedia Structured Dataset Navigator (JSONL)

    • kaggle.com
    zip
    Updated Apr 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mehranism (2025). Wikimedia Structured Dataset Navigator (JSONL) [Dataset]. https://www.kaggle.com/datasets/mehranism/wikimedia-structured-dataset-navigator-jsonl
    Explore at:
    zip(266196504 bytes)Available download formats
    Dataset updated
    Apr 23, 2025
    Authors
    Mehranism
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    πŸ“š Overview: This dataset provides a compact and efficient way to explore the massive "Wikipedia Structured Contents" dataset by Wikimedia Foundation, which consists of 38 large JSONL files (each ~2.5GB). Loading these directly in Kaggle or Colab is impractical due to resource constraints. This file index solves that problem.

    πŸ” What’s Inside: This dataset includes a single JSONL file named wiki_structured_dataset_navigator.jsonl that contains metadata for every file in the English portion of the Wikimedia dataset.

    Each line in the JSONL file is a JSON object with the following fields: - file_name: the actual filename in the source dataset (e.g., enwiki_namespace_0_0.jsonl) - file_index: the numeric row index of the file - name: the Wikipedia article title or identifier - url: a link to the full article on Wikipedia - description: a short description or abstract of the article (when available)

    πŸ›  Use Case: Use this dataset to search by keyword, article name, or description to find which specific files from the full Wikimedia dataset contain the topics you're interested in. You can then download only the relevant file(s) instead of the entire dataset.

    ⚑️ Benefits: - Lightweight (~MBs vs. GBs) - Easy to load and search - Great for indexing, previewing, and subsetting the Wikimedia dataset - Saves time, bandwidth, and compute resources

    πŸ“Ž Example Usage (Python): ```python import kagglehub import json import pandas as pd import numpy as np import os from tqdm import tqdm from datetime import datetime import re

    def read_jsonl(file_path, max_records=None): data = [] with open(file_path, 'r', encoding='utf-8') as f: for i, line in enumerate(tqdm(f)): if max_records and i >= max_records: break data.append(json.loads(line)) return data

    file_path = kagglehub.dataset_download("mehranism/wikimedia-structured-dataset-navigator-jsonl",path="wiki_structured_dataset_navigator.jsonl") data = read_jsonl(file_path) print(f"Successfully loaded {len(data)} records")

    df = pd.DataFrame(data) print(f"Dataset shape: {df.shape}") print(" Columns in the dataset:") for col in df.columns: print(f"- {col}")

    
    This dataset is perfect for developers working on:
    - Retrieval-Augmented Generation (RAG)
    - Large Language Model (LLM) fine-tuning
    - Search and filtering pipelines
    - Academic research on structured Wikipedia content
    
    πŸ’‘ Tip:
    Pair this index with the original [Wikipedia Structured Contents dataset](https://www.kaggle.com/datasets/wikimedia-foundation/wikipedia-structured-contents) for full article access.
    
    πŸ“ƒ Format:
    - File: `wiki_structured_dataset_navigator.jsonl`
    - Format: JSON Lines (1 object per line)
    - Encoding: UTF-8
    
    ---
    
    ### **Tags**
    

    wikipedia, wikimedia, jsonl, structured-data, search-index, metadata, file-catalog, dataset-index, large-language-models, machine-learning ```

    Licensing

    CC0: Public Domain Dedication
    

    (Recommended for open indexing tools with no sensitive data.)

  12. h

    LoGiPT-data

    • huggingface.co
    Updated Jul 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jamie Jiazhan Feng (2024). LoGiPT-data [Dataset]. https://huggingface.co/datasets/jzfeng/LoGiPT-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 2, 2024
    Authors
    Jamie Jiazhan Feng
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Details

    These are the training data for LoGiPT from NAACL'24 paper: "Language Models can be Deductive Solvers".

    LoGiPT-data-ProofWriter.json: Instruction-tuning data for LoGiPT constructed from ProofWriter. LoGiPT-data-PrOntoQA.json: Instruction-tuning data for LoGiPT constructed from PrOntoQA.

    All training examples are organised in Json-format and Vicuna-style.

      If you find this data helpful, please cite our NAACL'24 paper: (or Arxiv version:… See the full description on the dataset page: https://huggingface.co/datasets/jzfeng/LoGiPT-data.
    
  13. f

    Main Data and Code

    • figshare.com
    zip
    Updated Oct 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Momo (2025). Main Data and Code [Dataset]. http://doi.org/10.6084/m9.figshare.29929412.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 5, 2025
    Dataset provided by
    figshare
    Authors
    Momo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Important Notice: Ethical Use OnlyThis repository provides code and datasets for academic research on misinformation.Please note that the datasets include rumor-related texts. These materials are supplied solely for scholarly analysis and research aimed at understanding and combating misinformation.Prohibited UseDo not use this repository, including its code or data, to create or spread false information in any real-world context.Any misuse of these resources for malicious purposes is strictly forbidden.DisclaimerThe authors bear no responsibility for any unethical or unlawful use of the provided resources.By accessing or using this repository, you acknowledge and agree to comply with these ethical guidelines.Project StructureThe project is organized into three main directories, each corresponding to a major section of the paper's experiments:main_data_and_code/β”œβ”€β”€ rumor_generation/β”œβ”€β”€ rumor_detection/└── rumor_debunking/How to Get StartedPrerequisitesTo successfully run the code and reproduce the results, you will need to:Obtain and configure your own API key for the large language models (LLMs) used in the experiments. Please replace the placeholder API key in the code with your own.For the rumor detection experiments, download the public datasets (Twitter15, Twitter16, FakeNewsNet) from their respective sources. The pre-process scripts in the rumor detection folder must be run first to prepare the public datasets.Please note that many scripts are provided as examples using the Twitter15 dataset. To run experiments on other datasets like Twitter16 or FakeNewsNet, you will need to modify these scripts or create copies and update the corresponding file paths.Detailed Directory Breakdown1. rumor_generation/This directory contains all the code and data related to the rumor generation experiments.rumor_generation_zeroshot.py: Code for the zero-shot rumor generation experiment.rumor_generation_fewshot.py: Code for the few-shot rumor generation experiment.rumor_generation_cot.py: Code for the chain-of-thought (CoT) rumor generation experiment.token_distribution.py: Script to analyze token distribution in the generated text.label_rumors.py:Script to label LLM-generated texts based on whether they contain rumor-related content.extract_reasons.py: Script to extract reasons for rumor generation and rejection.visualization.py: Utility script for generating figures.LDA.py: Code for performing LDA topic modeling on the generated data.rumor_generation_responses.json: The complete output dataset from the rumor generation experiments.generation_reasons_extracted.json: The extracted reasons for generated rumors.rejection_reasons_extracted.json: The extracted reasons for rejected rumor generation requests.2. rumor_detection/This directory contains the code and data used for the rumor detection experiments.nonreasoning_zeroshot_twitter15.py: Code for the non-reasoning, zero-shot detection on the Twitter15 dataset. To run on Twitter16 or FakeNewsNet, update the file paths within the script. Similar experiment scripts below follow the same principle and are not described repeatedly.nonreasoning_fewshot_twitter15.py: Code for the non-reasoning, few-shot detection on the Twitter15 dataset.nonreasoning_cot_twitter15.py: Code for the non-reasoning, CoT detection on the Twitter15 dataset.reasoning_zeroshot_twitter15.py: Code for the Reasoning LLMs, zero-shot detection on the Twitter15 dataset.reasoning_fewshot_twitter15.py: Code for the Reasoning LLMs, few-shot detection on the Twitter15 dataset.reasoning_cot_twitter15.py: Code for the Reasoning LLMs, CoT detection on the Twitter15 dataset.traditional_model.py: Code for the traditional models used as baselines.preprocess_twitter15_and_twitter16.py: Script for preprocessing the Twitter15 and Twitter16 datasets.preprocess_fakenews.py: Script for preprocessing the FakeNewsNet dataset.generate_summary_table.py: Calculates all classification metrics and generates the final summary table for the rumor detection experiments.select_few_shot_example_15.py: Script to pre-select few-shot examples, using the Twitter15 dataset as an example. To generate examples for Twitter16 or FakeNewsNet, update the file paths within the script.twitter15_few_shot_examples.json: Pre-selected few-shot examples for the Twitter15 dataset.twitter16_few_shot_examples.json: Pre-selected few-shot examples for the Twitter16 dataset.fakenewsnet_few_shot_examples.json: Pre-selected few-shot examples for the FakeNewsNet dataset.twitter15_llm_results.json: LLM prediction results on the Twitter15 dataset.twitter16_llm_results.json: LLM prediction results on the Twitter16 dataset.fakenewsnet_llm_results.json: LLM prediction results on the FakeNewsNet dataset.visualization.py: Utility script for generating figures.3. rumor_debunking/This directory contains all the code and data for the rumor debunking experiments.analyze_sentiment.py: Script for analyzing the sentiment of the debunking texts.calculate_readability.py: Script for calculating the readability score of the debunking texts.plot_readability.py: Utility script for generating figures related to readability.fact_checking_with_nli.py: Code for the NLI-based fact-checking experiment.debunking_results.json: The dataset containing the debunking results for this experimental section.debunking_results_with_readability.json: The dataset containing the debunking results along with readability scores.sentiment_analysis/: This directory contains the result file from the sentiment analysis.debunking_results_with_sentiment.json: The dataset containing the debunking results along with sentiment analysis.Please contact the repository owner if you encounter any problems or have questions about the code or data.

  14. i

    Data from: Autorickshaw Detection Challenge Dataset

    • india-data.org
    Updated Jan 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    IIIT Hyderabad, IHUB (2025). Autorickshaw Detection Challenge Dataset [Dataset]. https://india-data.org/googleSEO-list-dataset-search
    Explore at:
    images, geolocation infoAvailable download formats
    Dataset updated
    Jan 2, 2025
    Dataset authored and provided by
    IIIT Hyderabad, IHUB
    License

    https://india-data.org/terms-conditionshttps://india-data.org/terms-conditions

    Area covered
    India
    Description

    This is a sample dataset for the Autorickshaw detection challenge. The full dataset will be released shortly (see the Timeline in the website: http://cvit.iiit.ac.in/autorickshaw_detection ).?The images folder contains 800 images of autorickshaws. Each images file has a number in its filename.?The bbs folder contains bbs.json file, which contain the ground truth bounding boxes. It is an array of lenth 800. The bounding box information corresponding to the image i.jpg can be found at the ith location in bbs array. The bounding box information is again an array. The length of the array the the number of autorickshaws in the that image. At each index the four vertices of the bounding box is provided. See the view.py script for an example.?The scripts folder contains a file view.py, which opens the image and overlays the bounding boxes (closing the window, will show the next image). It serves as an example on how to view as well load the data format in bbs.json.

  15. Data from: BreCaHAD: A Dataset for Breast Cancer Histopathological...

    • figshare.com
    png
    Updated Jan 28, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alper Aksac; Douglas J. Demetrick; Tansel Γ–zyer; Reda Alhajj (2019). BreCaHAD: A Dataset for Breast Cancer Histopathological Annotation and Diagnosis [Dataset]. http://doi.org/10.6084/m9.figshare.7379186.v3
    Explore at:
    pngAvailable download formats
    Dataset updated
    Jan 28, 2019
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Alper Aksac; Douglas J. Demetrick; Tansel Γ–zyer; Reda Alhajj
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset consists of 1 .xlsx file, 2 .png files, 1 .json file and 1 .zip file:annotation_details.xlsx: The distribution of annotations in the previously mentioned six classes (mitosis, apoptosis, tumor nuclei, non-tumor nuclei, tubule, and non-tubule) is presented in a Excel spreadsheet.original.png: The input image.annotated.png: An example from the dataset. In the annotated image, blue circles indicate the tumor nuclei, pink circles show non-tumor nuclei such as blood cells, stroma nuclei, and lymphocytes; orange and green circles are mitosis and apoptosis, respectively; light blue circles are true lumen for tubules, and yellow circles represent white regions (non-lumen) such as fat, blood vessel, and broken tissues.data.json: The annotations for the BreCaHAD dataset are provided in JSON (JavaScript Object Notation) format. In the given example, the JSON file (ground truth) contains two mitosis and only one tumor nuclei annotations. Here, x and y are the coordinates of the centroid of the annotated object, and the values are between 0, 1.BreCaHAD.zip: An archive file containing dataset. Three folders are included: images (original images), groundTruth (json files), and groundTruth_display (groundTruth applied on original images)

  16. Data from: 3DHD CityScenes: High-Definition Maps in High-Density Point...

    • data.europa.eu
    • data.niaid.nih.gov
    • +1more
    unknown
    Updated Sep 15, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2022). 3DHD CityScenes: High-Definition Maps in High-Density Point Clouds [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-7085090?locale=de
    Explore at:
    unknown(147082)Available download formats
    Dataset updated
    Sep 15, 2022
    Dataset authored and provided by
    Zenodohttp://zenodo.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Overview 3DHD CityScenes is the most comprehensive, large-scale high-definition (HD) map dataset to date, annotated in the three spatial dimensions of globally referenced, high-density LiDAR point clouds collected in urban domains. Our HD map covers 127 km of road sections of the inner city of Hamburg, Germany including 467 km of individual lanes. In total, our map comprises 266,762 individual items. Our corresponding paper (published at ITSC 2022) is available here. Further, we have applied 3DHD CityScenes to map deviation detection here. Moreover, we release code to facilitate the application of our dataset and the reproducibility of our research. Specifically, our 3DHD_DevKit comprises: Python tools to read, generate, and visualize the dataset, 3DHDNet deep learning pipeline (training, inference, evaluation) for map deviation detection and 3D object detection. The DevKit is available here: https://github.com/volkswagen/3DHD_devkit. The dataset and DevKit have been created by Christopher Plachetka as project lead during his PhD period at Volkswagen Group, Germany. When using our dataset, you are welcome to cite: @INPROCEEDINGS{9921866, author={Plachetka, Christopher and Sertolli, Benjamin and Fricke, Jenny and Klingner, Marvin and Fingscheidt, Tim}, booktitle={2022 IEEE 25th International Conference on Intelligent Transportation Systems (ITSC)}, title={3DHD CityScenes: High-Definition Maps in High-Density Point Clouds}, year={2022}, pages={627-634}} Acknowledgements We thank the following interns for their exceptional contributions to our work. Benjamin Sertolli: Major contributions to our DevKit during his master thesis Niels Maier: Measurement campaign for data collection and data preparation The European large-scale project Hi-Drive (www.Hi-Drive.eu) supports the publication of 3DHD CityScenes and encourages the general publication of information and databases facilitating the development of automated driving technologies. The Dataset After downloading, the 3DHD_CityScenes folder provides five subdirectories, which are explained briefly in the following. 1. Dataset This directory contains the training, validation, and test set definition (train.json, val.json, test.json) used in our publications. Respective files contain samples that define a geolocation and the orientation of the ego vehicle in global coordinates on the map. During dataset generation (done by our DevKit), samples are used to take crops from the larger point cloud. Also, map elements in reach of a sample are collected. Both modalities can then be used, e.g., as input to a neural network such as our 3DHDNet. To read any JSON-encoded data provided by 3DHD CityScenes in Python, you can use the following code snipped as an example. import json json_path = r"E:\3DHD_CityScenes\Dataset\train.json" with open(json_path) as jf: data = json.load(jf) print(data) 2. HD_Map Map items are stored as lists of items in JSON format. In particular, we provide: traffic signs, traffic lights, pole-like objects, construction site locations, construction site obstacles (point-like such as cones, and line-like such as fences), line-shaped markings (solid, dashed, etc.), polygon-shaped markings (arrows, stop lines, symbols, etc.), lanes (ordinary and temporary), relations between elements (only for construction sites, e.g., sign to lane association). 3. HD_Map_MetaData Our high-density point cloud used as basis for annotating the HD map is split in 648 tiles. This directory contains the geolocation for each tile as polygon on the map. You can view the respective tile definition using QGIS. Alternatively, we also provide respective polygons as lists of UTM coordinates in JSON. Files with the ending .dbf, .prj, .qpj, .shp, and .shx belong to the tile definition as β€œshape file” (commonly used in geodesy) that can be viewed using QGIS. The JSON file contains the same information provided in a different format used in our Python API. 4. HD_PointCloud_Tiles The high-density point cloud tiles are provided in global UTM32N coordinates and are encoded in a proprietary binary format. The first 4 bytes (integer) encode the number of points contained in that file. Subsequently, all point cloud values are provided as arrays. First all x-values, then all y-values, and so on. Specifically, the arrays are encoded as follows. x-coordinates: 4 byte integer y-coordinates: 4 byte integer z-coordinates: 4 byte integer intensity of reflected beams: 2 byte unsigned integer ground classification flag: 1 byte unsigned integer After reading, respective values have to be unnormalized. As an example, you can use the following code snipped to read the point cloud data. For visualization, you can use the pptk package, for instance. import numpy as np import pptk file_path = r"E:\3DHD_CityScenes\HD_PointCloud_Tiles\HH_001.bin" pc_dict = {} key_list = ['x', 'y', 'z', 'intensity', 'is_ground'] type_list = ['<i4', '<i4', '<i4', '<u2', 'u1'] with open(file_path, "r") as fid: num_points = np.f

  17. MSP430FR5969 Basic Block Worst Case Energy Consumption (WCEC) and Worst Case...

    • zenodo.org
    • data.niaid.nih.gov
    bin, csv, xz
    Updated Dec 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hugo Reymond; Hugo Reymond; Hector Chabot; Hector Chabot; Abderaouf Nassim Amalou; Abderaouf Nassim Amalou; Isabelle Puaut; Isabelle Puaut (2024). MSP430FR5969 Basic Block Worst Case Energy Consumption (WCEC) and Worst Case Execution Time (WCET) dataset [Dataset]. http://doi.org/10.5281/zenodo.11066623
    Explore at:
    csv, bin, xzAvailable download formats
    Dataset updated
    Dec 20, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Hugo Reymond; Hugo Reymond; Hector Chabot; Hector Chabot; Abderaouf Nassim Amalou; Abderaouf Nassim Amalou; Isabelle Puaut; Isabelle Puaut
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains around 30 000 basic blocks whose energy consumption and execution time have been measured in isolation on the MSP430FR5969 microcontroller, at 1MHz. Basic blocks were executed in a worst case scenario regarding the MSP430 FRAM cache and CPU pipeline. The dataset creation process is described thoroughly in [1].

    Folder structure

    This dataset is composed of the following files:

    • basic_blocks.tar.xz contains all basic blocks (BB) used in the dataset, in a custom JSON format,
    • data.csv/data.xlsx contains the measured energy consumption and execution time for each basic block

    We first details how the basic_blocks.tar.gz archive is organized, and then present the CSV/XSLX spreadsheet format.

    Basic Blocks

    We extracted the basic blocks from a subset of programs of the AnghaBench benchmark suite [2]. The basic_blocks.tar.gz archive consist of the extracted basic blocks organized as json files. Each json file correspond to a C source file from AnghaBench, and is given a unique identifier. An example json (137.json) is available here:

    {
      "extr_pfctl_altq.c_pfctl_altq_init": [
         # Basic block 1
        [
          # Instruction 1 of BB1
          [
            "MOV.W",
            "#queue_map",
            "R13"
          ],
          # Instruction 2 of BB1
          [
            "MOV.B",
            "#0",
            "R14"
          ],
          # Instruction 3 of BB1
          [
            "CALL",
            "#hcreate_r",
            null
          ]
        ],
        # Basic block 2
        [
          ....
        ]
      ]
    }

    The json contains a dict with only one key pointing to an array of basic blocks. This key is the name of the original C source file in AnghaBench from which the basic blocks were extracted (here extr_pfctl_altq.c_pfctl_altq_init.c). The array contains severals basic blocks, which are represented as an array of instructions, which are themselves represented as an array [OPCODE, OPERAND1, OPERAND2].

    Then, each basic block can be identified uniquely using two ids : its file id and its offset in the file (id=). In our example, the basic block 1 can be identified by the json file id (137) and its offset in the file (0). Its ID is 137_0. This ID is used to make the mapping between a basic block and its energy consumption/execution time, with the data.csv/data.xlsx spreadsheet.

    Energy Consumption and Execution Time

    Energy consumption and execution time data are stored in the data.csv file. Here is the extract of the csv file corresponding to the basic block 137_0. The spreadsheet format is described below.

    bb_id;nb_inst;max_energy;max_time;avg_time;avg_energy;energy_per_inst;nb_samples;unroll_factor
    137_0;3;8.77;7.08;7.04;8.21;2.92;40;50

    Spreadsheet format :

    • bb_id: the unique identifier of a basic block (cf. Basic Blocks)
    • nb_inst: the number of instructions in the basic block
    • max_energy: the maximum energy comsumption (in nJ) measured during the experiment
    • max_time: the maximum execution time (in us) measured during the experiment
    • avg_time: the average execution time (in us) measured during the experiment
    • avg_energy: the average energy comsumption (in nJ) measured during the experiment
    • energy_per_inst: the average energy consumption per instruction (correspond to avg_energy/nb_inst)
    • nb_samples: how much time the basic block energy consumption/execution time has been measured
    • unroll_factor: how much time the basic block was unrolled (cf Basic Block Unrolling)

    Basic Block Unrolling

    To measure the energy consumption and execution time of the msp430, we need to be able to handle the scale difference between the measurement tool and the basic block execution time. This is achieved by duplicating the basic block multiple times while making sure to keep the worst-case memory layout as explained in the paper. The number of time the basic block has been duplicated is called the unroll_factor.

    Values of energy and time are always given per basic block, so they have already been divided by the unroll factor.

    Dataset description

    Features

    The selected features after PCA analysis for both energy and time model are listed here: MOV.W_Rn_Rn, MOV.W_X(Rn)_X(Rn), CALL, MOV.B_#N_Rn, ADD.W_Rn_Rn, MOV.W_@Rn_Rn, MOV.W_X(Rn)_Rn, ADD.W_#N_Rn, PUSHM.W_#N_Rn, MOV.W_X(Rn)_ADDR, CMP.W_#N_Rn, MOV.W_&ADDR_X(Rn), MOV.W_Rn_X(Rn), BIS.W_Rn_Rn, RLAM.W_#N_Rn, SUB.W_#N_Rn, MOV.W_&ADDR_Rn, MOV.W_#N_X(Rn), CMP.W_Rn_Rn, BIT.W_ADDR_Rn, MOV.W_@Rn_X(Rn), ADD.W_#N_X(Rn), MOV.W_#N_Rn, AND.W_Rn_Rn, MOV.W_Rn_ADDR, SUB.W_Rn_Rn, MOV.W_ADDR_Rn, MOV.W_X(Rn)_&ADDR, MOV.W_ADDR_ADDR, JMP, ADD_#N_Rn, BIS.W_Rn_X(Rn), SUB_Rn_Rn, MOV.W_ADDR_X(Rn), ADDC_#N_X(Rn), MOV.B_Rn_Rn, CMP.W_X(Rn)_X(Rn), ADD_Rn_Rn, nb_inst, INV.W_Rn_, NOP_, ADD.W_X(Rn)_X(Rn), ADD.W_Rn_X(Rn), MOV.B_@Rn_Rn, BIS.W_X(Rn)_X(Rn), MOV.B_#N_X(Rn), MOV.W_#N_ADDR, AND.W_#N_ADDR, SUBC_X(Rn)_X(Rn), BIS.W_#N_X(Rn), SUB.W_X(Rn)_X(Rn), AND.B_#N_Rn, ADD_X(Rn)_X(Rn), MOV.W_@Rn_ADDR, MOV.W_&ADDR_ADDR, ADDC_Rn_Rn, AND.W_#N_X(Rn), SUB_#N_Rn, RRUM.W_#N_Rn, AND_ADDR_Rn, CMP.W_X(Rn)_ADDR, MOV.B_#N_ADDR, ADD.W_#N_ADDR, CMP.B_#N_Rn, SXT_Rn_, XOR.W_Rn_Rn, CMP.W_@Rn_Rn, ADD.W_@Rn_Rn, ADD.W_X(Rn)_Rn, AND.W_Rn_X(Rn), CMP.B_Rn_Rn, AND.W_X(Rn)_X(Rn), BIC.W_#N_Rn, BIS.W_#N_Rn, AND.B_#N_X(Rn), MOV.B_X(Rn)_X(Rn), AND.W_@Rn_Rn, MOV.W_#N_&ADDR, BIS.W_Rn_ADDR, SUB.W_X(Rn)_Rn, SUB.W_Rn_X(Rn), SUB_X(Rn)_X(Rn), MOV.B_@Rn_X(Rn), CMP.W_@Rn_X(Rn), ADD.W_X(Rn)_ADDR, CMP.W_Rn_X(Rn), BIS.W_@Rn_X(Rn), CMP.B_X(Rn)_X(Rn), RRC.W_Rn_, MOV.W_@Rn_&ADDR, CMP.W_#N_X(Rn), ADDC_X(Rn)_Rn, CMP.W_X(Rn)_Rn, BIS.W_X(Rn)_Rn, SUB_X(Rn)_Rn, MOV.B_X(Rn)_Rn, MOV.W_ADDR_&ADDR, AND.W_#N_Rn, RLA.W_Rn_, INV.W_X(Rn)_, XOR.W_#N_Rn, SUB.W_Rn_ADDR, BIC.W_#N_X(Rn), MOV.B_X(Rn)_ADDR, ADD_#N_X(Rn), SUB_Rn_X(Rn), MOV.B_&ADDR_Rn, MOV.W_Rn_&ADDR, ADD_X(Rn)_Rn, AND.W_X(Rn)_Rn, PUSHM.A_#N_Rn, RRAM.W_#N_Rn, AND.W_@Rn_X(Rn), BIS.B_Rn_X(Rn), SUB.W_@Rn_Rn, CLRC_, CMP.W_#N_ADDR, XOR.W_Rn_X(Rn), MOV.B_Rn_ADDR, CMP.B_X(Rn)_Rn, BIS.B_Rn_Rn, BIS.W_X(Rn)_ADDR, CMP.B_#N_X(Rn), CMP.W_Rn_ADDR, XOR.W_X(Rn)_Rn, MOV.B_Rn_X(Rn), ADD.B_#N_Rn

    Code

    The trained machine learning model, tests, and local explanation code can be generated and found here: WORTEX Machine learning code

    Acknowledgment

    This work has received a French government support granted to the Labex CominLabs excellence laboratory and managed by the National Research Agency in the β€œInvesting for the Future” program under reference ANR-10-LABX-07-01

    Licensing

    Copyright 2024 Hector Chabot Copyright 2024 Abderaouf Nassim Amalou Copyright 2024 Hugo Reymond Copyright 2024 Isabelle Puaut

    Licensed under the Creative Commons Attribution 4.0 International License

    References

    [1] Reymond, H., Amalou, A. N., Puaut, I. β€œWORTEX: Worst-Case Execution Time and Energy Estimation in Low-Power Microprocessors using Explainable ML” in 22nd International Workshop on Worst-Case Execution Time Analysis (WCET 2024)

    [2] Da Silva, Anderson Faustino, et al. β€œAnghabench: A suite with one million compilable C benchmarks for code-size reduction.” 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE, 2021.

  18. d

    Dataset metadata of known Dataverse installations

    • search.dataone.org
    • dataverse.harvard.edu
    • +1more
    Updated Nov 22, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gautier, Julian (2023). Dataset metadata of known Dataverse installations [Dataset]. http://doi.org/10.7910/DVN/DCDKZQ
    Explore at:
    Dataset updated
    Nov 22, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Gautier, Julian
    Description

    This dataset contains the metadata of the datasets published in 77 Dataverse installations, information about each installation's metadata blocks, and the list of standard licenses that dataset depositors can apply to the datasets they publish in the 36 installations running more recent versions of the Dataverse software. The data is useful for reporting on the quality of dataset and file-level metadata within and across Dataverse installations. Curators and other researchers can use this dataset to explore how well Dataverse software and the repositories using the software help depositors describe data. How the metadata was downloaded The dataset metadata and metadata block JSON files were downloaded from each installation on October 2 and October 3, 2022 using a Python script kept in a GitHub repo at https://github.com/jggautier/dataverse-scripts/blob/main/other_scripts/get_dataset_metadata_of_all_installations.py. In order to get the metadata from installations that require an installation account API token to use certain Dataverse software APIs, I created a CSV file with two columns: one column named "hostname" listing each installation URL in which I was able to create an account and another named "apikey" listing my accounts' API tokens. The Python script expects and uses the API tokens in this CSV file to get metadata and other information from installations that require API tokens. How the files are organized β”œβ”€β”€ csv_files_with_metadata_from_most_known_dataverse_installations β”‚ β”œβ”€β”€ author(citation).csv β”‚ β”œβ”€β”€ basic.csv β”‚ β”œβ”€β”€ contributor(citation).csv β”‚ β”œβ”€β”€ ... β”‚ └── topic_classification(citation).csv β”œβ”€β”€ dataverse_json_metadata_from_each_known_dataverse_installation β”‚ β”œβ”€β”€ Abacus_2022.10.02_17.11.19.zip β”‚ β”œβ”€β”€ dataset_pids_Abacus_2022.10.02_17.11.19.csv β”‚ β”œβ”€β”€ Dataverse_JSON_metadata_2022.10.02_17.11.19 β”‚ β”œβ”€β”€ hdl_11272.1_AB2_0AQZNT_v1.0.json β”‚ β”œβ”€β”€ ... β”‚ β”œβ”€β”€ metadatablocks_v5.6 β”‚ β”œβ”€β”€ astrophysics_v5.6.json β”‚ β”œβ”€β”€ biomedical_v5.6.json β”‚ β”œβ”€β”€ citation_v5.6.json β”‚ β”œβ”€β”€ ... β”‚ β”œβ”€β”€ socialscience_v5.6.json β”‚ β”œβ”€β”€ ACSS_Dataverse_2022.10.02_17.26.19.zip β”‚ β”œβ”€β”€ ADA_Dataverse_2022.10.02_17.26.57.zip β”‚ β”œβ”€β”€ Arca_Dados_2022.10.02_17.44.35.zip β”‚ β”œβ”€β”€ ... β”‚ └── World_Agroforestry_-_Research_Data_Repository_2022.10.02_22.59.36.zip └── dataset_pids_from_most_known_dataverse_installations.csv └── licenses_used_by_dataverse_installations.csv └── metadatablocks_from_most_known_dataverse_installations.csv This dataset contains two directories and three CSV files not in a directory. One directory, "csv_files_with_metadata_from_most_known_dataverse_installations", contains 18 CSV files that contain the values from common metadata fields of all 77 Dataverse installations. For example, author(citation)_2022.10.02-2022.10.03.csv contains the "Author" metadata for all published, non-deaccessioned, versions of all datasets in the 77 installations, where there's a row for each author name, affiliation, identifier type and identifier. The other directory, "dataverse_json_metadata_from_each_known_dataverse_installation", contains 77 zipped files, one for each of the 77 Dataverse installations whose dataset metadata I was able to download using Dataverse APIs. Each zip file contains a CSV file and two sub-directories: The CSV file contains the persistent IDs and URLs of each published dataset in the Dataverse installation as well as a column to indicate whether or not the Python script was able to download the Dataverse JSON metadata for each dataset. For Dataverse installations using Dataverse software versions whose Search APIs include each dataset's owning Dataverse collection name and alias, the CSV files also include which Dataverse collection (within the installation) that dataset was published in. One sub-directory contains a JSON file for each of the installation's published, non-deaccessioned dataset versions. The JSON files contain the metadata in the "Dataverse JSON" metadata schema. The other sub-directory contains information about the metadata models (the "metadata blocks" in JSON files) that the installation was using when the dataset metadata was downloaded. I saved them so that they can be used when extracting metadata from the Dataverse JSON files. The dataset_pids_from_most_known_dataverse_installations.csv file contains the dataset PIDs of all published datasets in the 77 Dataverse installations, with a column to indicate if the Python script was able to download the dataset's metadata. It's a union of all of the "dataset_pids_..." files in each of the 77 zip files. The licenses_used_by_dataverse_installations.csv file contains information about the licenses that a number of the installations let depositors choose when creating datasets. When I collected ... Visit https://dataone.org/datasets/sha256%3Ad27d528dae8cf01e3ea915f450426c38fd6320e8c11d3e901c43580f997a3146 for complete metadata about this dataset.

  19. Archi Dataset: a dataset of software engineering projects

    • zenodo.org
    zip
    Updated Nov 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Giovanni Nicola Della Pelle; Eugenio Facciolo; Francesco Leotta; Francesco Leotta; Massimo Mecella; Massimo Mecella; Flavia Monti; Flavia Monti; Giovanni Nicola Della Pelle; Eugenio Facciolo (2024). Archi Dataset: a dataset of software engineering projects [Dataset]. http://doi.org/10.5281/zenodo.14238664
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 28, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Giovanni Nicola Della Pelle; Eugenio Facciolo; Francesco Leotta; Francesco Leotta; Massimo Mecella; Massimo Mecella; Flavia Monti; Flavia Monti; Giovanni Nicola Della Pelle; Eugenio Facciolo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A dataset composed of 8 software engineering projects collected (and refined) from the Software Engineering – Laboratory of Advanced Programming” course at Sapienza University of Rome for master students in Engineering in Computer Science.

    The dataset comprises folders for each project. Each folder contains:

    • input.txt file with the description of the system and the list of user stories,
    • DataMetrics.json with the performance characteristics of the project. An example of DataMetrics format is provided below.
      [
       {
        "set_id": 1,
        "set_name": "auth client",
        "user_stories": [1, 2, 3, 4],
        "links": [2, 3, 5],
        "db": "true"
       },
       ...
      ]

      The json is represented by an array of dictionaries, each relative to a set, characterized by a set_id and a set_name, grouping user stories (identified by their numerical identifier in user_stories). Each dictionary also contains links and db keys to indicate other sets that have a related context and the need for a backend service to store or retrieve data, respectively. From an architectural point of view, user stories that belong to linked sets can be fulfilled by the same container and the sets of user stories that are required to store or retrieve data must be fulfilled by a container hosting a database microservice.

    • Student_Doc.md with the student (entire) documentation (solution) of the project,
    • source code of the developed project.

    The dataset is under continuous updating. Each academic year it will be enriched with new projects.

    If you want to contribute, please contact us.

  20. Geospatial Data Pack for Visualization

    • kaggle.com
    zip
    Updated Oct 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vega Datasets (2025). Geospatial Data Pack for Visualization [Dataset]. https://www.kaggle.com/datasets/vega-datasets/geospatial-data-pack
    Explore at:
    zip(1422109 bytes)Available download formats
    Dataset updated
    Oct 21, 2025
    Dataset authored and provided by
    Vega Datasets
    Description

    Geospatial Data Pack for Visualization πŸ—ΊοΈ

    Learn Geographic Mapping with Altair, Vega-Lite and Vega using Curated Datasets

    Complete geographic and geophysical data collection for mapping and visualization. This consolidation includes 18 complementary datasets used by 31+ Vega, Vega-Lite, and Altair examples πŸ“Š. Perfect for learning geographic visualization techniques including projections, choropleths, point maps, vector fields, and interactive displays.

    Source data lives on GitHub and can also be accessed via CDN. The vega-datasets project serves as a common repository for example datasets used across these visualization libraries and related projects.

    Why Use This Dataset? πŸ€”

    • Comprehensive Geospatial Types: Explore a variety of core geospatial data models:
      • Vector Data: Includes points (like airports.csv), lines (like londonTubeLines.json), and polygons (like us-10m.json).
      • Raster-like Data: Work with gridded datasets (like windvectors.csv, annual-precip.json).
    • Diverse Formats: Gain experience with standard and efficient geospatial formats like GeoJSON (see Table 1, 2, 4), compressed TopoJSON (see Table 1), and plain CSV/TSV (see Table 2, 3, 4) for point data and attribute tables ready for joining.
    • Multi-Scale Coverage: Practice visualization across different geographic scales, from global and national (Table 1, 4) down to the city level (Table 1).
    • Rich Thematic Mapping: Includes multiple datasets (Table 3) specifically designed for joining attributes to geographic boundaries (like states or counties from Table 1) to create insightful choropleth maps.
    • Ready-to-Use & Example-Driven: Cleaned datasets tightly integrated with 31+ official examples (see Appendix) from Altair, Vega-Lite, and Vega, allowing you to immediately practice techniques like projections, point maps, network maps, and interactive displays.
    • Python Friendly: Works seamlessly with essential Python libraries like Altair (which can directly read TopoJSON/GeoJSON), Pandas, and GeoPandas, fitting perfectly into the Kaggle notebook environment.

    Table of Contents

    Dataset Inventory πŸ—‚οΈ

    This pack includes 18 datasets covering base maps, reference points, statistical data for choropleths, and geophysical data.

    1. BASE MAP BOUNDARIES (Topological Data)

    DatasetFileSizeFormatLicenseDescriptionKey Fields / Join Info
    US Map (1:10m)us-10m.json627 KBTopoJSONCC-BY-4.0US state and county boundaries. Contains states and counties objects. Ideal for choropleths.id (FIPS code) property on geometries
    World Map (1:110m)world-110m.json117 KBTopoJSONCC-BY-4.0World country boundaries. Contains countries object. Suitable for world-scale viz.id property on geometries
    London BoroughslondonBoroughs.json14 KBTopoJSONCC-BY-4.0London borough boundaries.properties.BOROUGHN (name)
    London CentroidslondonCentroids.json2 KBGeoJSONCC-BY-4.0Center points for London boroughs.properties.id, properties.name
    London Tube LineslondonTubeLines.json78 KBGeoJSONCC-BY-4.0London Underground network lines.properties.name, properties.color

    2. GEOGRAPHIC REFERENCE POINTS (Point Data) πŸ“

    DatasetFileSizeFormatLicenseDescriptionKey Fields / Join Info
    US Airportsairports.csv205 KBCSVPublic DomainUS airports with codes and coordinates.iata, state, `l...
Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Lucain Pouget, example-space-to-dataset-json [Dataset]. https://huggingface.co/datasets/Wauplin/example-space-to-dataset-json

example-space-to-dataset-json

Wauplin/example-space-to-dataset-json

Explore at:
Authors
Lucain Pouget
Description
Search
Clear search
Close search
Google apps
Main menu