Facebook
TwitterDemo to save data from a Space to a Dataset. Goal is to provide reusable snippets of code.
Documentation: https://huggingface.co/docs/huggingface_hub/main/en/guides/upload#scheduled-uploads Space: https://huggingface.co/spaces/Wauplin/space_to_dataset_saver/ JSON dataset: https://huggingface.co/datasets/Wauplin/example-space-to-dataset-json Image dataset: https://huggingface.co/datasets/Wauplin/example-space-to-dataset-image Image (zipped) dataset:β¦ See the full description on the dataset page: https://huggingface.co/datasets/Wauplin/example-space-to-dataset-json.
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Dataset contains more than 50000 records of Sales and order data related to an online store.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was created by samsatp
Released under CC0: Public Domain
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This folder contains the Spider-Realistic dataset used for evaluation in the paper "Structure-Grounded Pretraining for Text-to-SQL". The dataset is created based on the dev split of the Spider dataset (2020-06-07 version from https://yale-lily.github.io/spider). We manually modified the original questions to remove the explicit mention of column names while keeping the SQL queries unchanged to better evaluate the model's capability in aligning the NL utterance and the DB schema. For more details, please check our paper at https://arxiv.org/abs/2010.12773.
It contains the following files:
- spider-realistic.json
# The spider-realistic evaluation set
# Examples: 508
# Databases: 19
- dev.json
# The original dev split of Spider
# Examples: 1034
# Databases: 20
- tables.json
# The original DB schemas from Spider
# Databases: 166
- README.txt
- license
The Spider-Realistic dataset is created based on the dev split of the Spider dataset realsed by Yu, Tao, et al. "Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task." It is a subset of the original dataset with explicit mention of the column names removed. The sql queries and databases are kept unchanged.
For the format of each json file, please refer to the github page of Spider https://github.com/taoyds/spider.
For the database files please refer to the official Spider release https://yale-lily.github.io/spider.
This dataset is distributed under the CC BY-SA 4.0 license.
If you use the dataset, please cite the following papers including the original Spider datasets, Finegan-Dollak et al., 2018 and the original datasets for Restaurants, GeoQuery, Scholar, Academic, IMDB, and Yelp.
@article{deng2020structure,
title={Structure-Grounded Pretraining for Text-to-SQL},
author={Deng, Xiang and Awadallah, Ahmed Hassan and Meek, Christopher and Polozov, Oleksandr and Sun, Huan and Richardson, Matthew},
journal={arXiv preprint arXiv:2010.12773},
year={2020}
}
@inproceedings{Yu&al.18c,
year = 2018,
title = {Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task},
booktitle = {EMNLP},
author = {Tao Yu and Rui Zhang and Kai Yang and Michihiro Yasunaga and Dongxu Wang and Zifan Li and James Ma and Irene Li and Qingning Yao and Shanelle Roman and Zilin Zhang and Dragomir Radev }
}
@InProceedings{P18-1033,
author = "Finegan-Dollak, Catherine
and Kummerfeld, Jonathan K.
and Zhang, Li
and Ramanathan, Karthik
and Sadasivam, Sesh
and Zhang, Rui
and Radev, Dragomir",
title = "Improving Text-to-SQL Evaluation Methodology",
booktitle = "Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
year = "2018",
publisher = "Association for Computational Linguistics",
pages = "351--360",
location = "Melbourne, Australia",
url = "http://aclweb.org/anthology/P18-1033"
}
@InProceedings{data-sql-imdb-yelp,
dataset = {IMDB and Yelp},
author = {Navid Yaghmazadeh, Yuepeng Wang, Isil Dillig, and Thomas Dillig},
title = {SQLizer: Query Synthesis from Natural Language},
booktitle = {International Conference on Object-Oriented Programming, Systems, Languages, and Applications, ACM},
month = {October},
year = {2017},
pages = {63:1--63:26},
url = {http://doi.org/10.1145/3133887},
}
@article{data-academic,
dataset = {Academic},
author = {Fei Li and H. V. Jagadish},
title = {Constructing an Interactive Natural Language Interface for Relational Databases},
journal = {Proceedings of the VLDB Endowment},
volume = {8},
number = {1},
month = {September},
year = {2014},
pages = {73--84},
url = {http://dx.doi.org/10.14778/2735461.2735468},
}
@InProceedings{data-atis-geography-scholar,
dataset = {Scholar, and Updated ATIS and Geography},
author = {Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, Jayant Krishnamurthy, and Luke Zettlemoyer},
title = {Learning a Neural Semantic Parser from User Feedback},
booktitle = {Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
year = {2017},
pages = {963--973},
location = {Vancouver, Canada},
url = {http://www.aclweb.org/anthology/P17-1089},
}
@inproceedings{data-geography-original
dataset = {Geography, original},
author = {John M. Zelle and Raymond J. Mooney},
title = {Learning to Parse Database Queries Using Inductive Logic Programming},
booktitle = {Proceedings of the Thirteenth National Conference on Artificial Intelligence - Volume 2},
year = {1996},
pages = {1050--1055},
location = {Portland, Oregon},
url = {http://dl.acm.org/citation.cfm?id=1864519.1864543},
}
@inproceedings{data-restaurants-logic,
author = {Lappoon R. Tang and Raymond J. Mooney},
title = {Automated Construction of Database Interfaces: Intergrating Statistical and Relational Learning for Semantic Parsing},
booktitle = {2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora},
year = {2000},
pages = {133--141},
location = {Hong Kong, China},
url = {http://www.aclweb.org/anthology/W00-1317},
}
@inproceedings{data-restaurants-original,
author = {Ana-Maria Popescu, Oren Etzioni, and Henry Kautz},
title = {Towards a Theory of Natural Language Interfaces to Databases},
booktitle = {Proceedings of the 8th International Conference on Intelligent User Interfaces},
year = {2003},
location = {Miami, Florida, USA},
pages = {149--157},
url = {http://doi.acm.org/10.1145/604045.604070},
}
@inproceedings{data-restaurants,
author = {Alessandra Giordani and Alessandro Moschitti},
title = {Automatic Generation and Reranking of SQL-derived Answers to NL Questions},
booktitle = {Proceedings of the Second International Conference on Trustworthy Eternal Systems via Evolving Software, Data and Knowledge},
year = {2012},
location = {Montpellier, France},
pages = {59--76},
url = {https://doi.org/10.1007/978-3-642-45260-4_5},
}
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
JSON Training Data
The advent of tiny yet powerful models like Qwen2 0.5B and SmolLM 135M/360M that can feasibly be run on just about anything means there is a necessity for data to finetune these models on downstream tasks. In particular, these models fail spectacularly at structured data generation in JSON, and even frameworks that are meant to force JSON output get stuck repeating infinitely because the models just don't have a clue what they're being asked to do. I found there⦠See the full description on the dataset page: https://huggingface.co/datasets/ChristianAzinn/json-training.
Facebook
TwitterThis dataset contains a collection of JSON files used to configure map catalogs in TerriaJS, an interactive geospatial data visualization platform. The files include detailed configurations for services such as WMS, WFS, and other geospatial resources, enabling the integration and visualization of diverse datasets in a user-friendly web interface. This resource is ideal for developers, researchers, and professionals who wish to customize or implement interactive map catalogs in their own applications using TerriaJS.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Json file with a list of portcalls from vessels arriving to Valencia ports. Data was used inside the INTER-IoT project as an example dataset that a legacy IoT platform was providing.
*NOTE: Due to a bug in the system it is not possible to upload files with a .json extension. It is uploaded to ._json extension instead. Please rename it after download.
Facebook
TwitterAttribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
This dataset contains detailed cricket performance data structured in a nested JSON format. It includes information for multiple players categorized by gender, opponent teams, formats, and individual batter match-ups.
Each player node (e.g., βJ Srinathβ) provides detailed insight into: - vs_team: Player performance against each international team (e.g., England, Australia, Pakistan). - vs_batter: Individual head-to-head statistics against other players. - formats: Match format-specific data (e.g., ODI, T20I, Test).
π Dataset Structure Example:
{
"root": {
"Gender": {
"male": {
"J Srinath": {
"vs_team": {
"England": { ...14 items... },
"Australia": { ...14 items... },
"Pakistan": { ...14 items... }
},
"formats": {
"ODI": { ... },
"Test": { ... }
},
"vs_batter": {
"Ricky Ponting": { ... },
...
"vs_bowler": {
"L Malinga": { ... },
...
}
}
}
}
}
}
β¨ Key Features: - π Granular Player Data: Bowler vs Batter and Bowler vs Team breakdowns. - π Format-wise Stats: Data across multiple cricket formats (Test, ODI, T20). - π International Coverage: Includes performance against top teams like England, Australia, Pakistan, and Sri Lanka. - βοΈ Machine Learning Ready: Structured JSON ideal for training predictive models (e.g., run prediction, wicket prediction). - π Customizable: Easily expandable to include more players, genders, and match details.
πΎ File Description: File name: Cricket_Stat_All_Player_International_Matches File type: .json File size: ~ 1.26 GB (depending on dataset size)
Columns / Structure: | Field | Description | |--------|--------------| | root | Root key containing gender-wise data | | Gender | "male" or "female" category | | Player Name | Name of the player (e.g., βJ Srinathβ) | | vs_team | Dictionary of performance stats vs teams | | formats | Data grouped by cricket formats | | vs_batter | Playerβs record vs individual batters | | vs_bowler | Playerβs record vs individual bowler |
π·οΈ Suggested Tags: cricket, sports-analytics, json-dataset, machine-learning, player-performance, india, odi, t20, bowler-vs-batter
π License: CC BY 4.0 (Attribution License)
π Use Cases: - Player performance prediction using ML. - Team strategy analysis and matchup visualization. - Cricket data visualization dashboards. - Fantasy team data preparation.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
This entry contains the SLO-VLM-IT-Dataset, a comprehensive dataset designed for instruction-tuning vision-language models in the Slovenian language. It is composed of five main .json files, which together provide a rich and diverse set of examples for training and fine-tuning models to understand and process both visual and textual information in Slovenian.
llava_v1_5_mix665k_translated_gemini_1_5_pro_all.json This file contains a machine-translated version of the popular Llava_v1_5_mix665k dataset. The translation from English to Slovenian was performed using the proprietary Gemini 1.5 Pro model.
wiki_14_march_2024_latest.json This file consists of conversational examples generated from Slovenian Wikipedia articles. The proprietary Gemini 1.5 Pro model was utilized for the data curation process, transforming the articles into an instruction-tuning format.
rtv.json This file consists of conversational examples generated on the basis of images from the news portal https://www.rtvslo.si. The proprietary Gemini 1.5 Pro model was utilized for the data generation.
siol.json This file consists of conversational examples generated on the basis of images from the news portal https://siol.net. The proprietary Gemini 1.5 Pro model was utilized for the data generation.
24ur.json This file consists of conversational examples generated on the basis of images from the news portal https://www.24ur.com. The proprietary Gemini 1.5 Pro model was utilized for the data generation.
The combined dataset includes a total of 1,128,228 examples, categorized as follows:
21,838 textvqa examples: Instructions for vision question answering based on specific Optical Character Recognition (OCR) tokens.
349,369 coco examples: A mix of instructions corresponding to 118,000 images from the COCO 2017 Object Detection Dataset. These include tasks such as generating long image descriptions, providing single-word answers, and answering multiple-choice questions.
81,309 vg examples: Instructions to either provide bounding box coordinates for a specified region in an image or describe a region defined by given coordinates.
66,227 gqa examples: Instructions requiring a one-word or one-phrase response to a question about the corresponding image.
78,976 ocr_vqa examples: Instructions focused on performing OCR to extract text from an image.
139,433 wiki examples: Instruction-tuning examples generated from Slovenian Wikipedia articles. The original Wikipedia articles were obtained from a Wikipedia database dump from March 14th 2025.
100,000 rtv examples: Instruction-tuning examples generated on the basis of images from the news portal https://www.rtvslo.si. Image scraping was completed on February 7th 2025.
100,000 siol examples: Instruction-tuning examples generated on the basis of images from the news portal https://siol.net. Image scraping was completed on March 22nd 2025.
100,000 24ur examples: Instruction-tuning examples generated on the basis of images from the news portal https://www.24ur.com. Image scraping was completed on February 7th 2025.
Accessing the Corresponding Images
News portal Images The images corresponding to the 'rtv', 'siol' and '24ur' examples need to be downloaded from the appropriate news portal. Each example in the json file contains an 'image' key with a URL of the corresponding image.
Wiki Images The images corresponding to the 'wiki' examples are available for download at the following link: https://kt-cloud.ijs.si/index.php/s/nbLmWkaJEXHMMwe
Llava_v1_5_mix665k Images To facilitate the download of images for the translated Llava_v1_5_mix665k dataset, we provide the necessary Python script get_llava_images.py and its dependency overwatch.py.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Introduction
The 802.11 standard includes several management features and corresponding frame types. One of them are Probe Requests (PR), which are sent by mobile devices in an unassociated state to scan the nearby area for existing wireless networks. The frame part of PRs consists of variable-length fields, called Information Elements (IE), which represent the capabilities of a mobile device, such as supported data rates.
This dataset contains PRs collected over a seven-day period by four gateway devices in an uncontrolled urban environment in the city of Catania.
It can be used for various use cases, e.g., analyzing MAC randomization, determining the number of people in a given location at a given time or in different time periods, analyzing trends in population movement (streets, shopping malls, etc.) in different time periods, etc.
Related dataset
Same authors also produced the Labeled dataset of IEEE 802.11 probe requests with same data layout and recording equipment.
Measurement setup
The system for collecting PRs consists of a Raspberry Pi 4 (RPi) with an additional WiFi dongle to capture WiFi signal traffic in monitoring mode (gateway device). Passive PR monitoring is performed by listening to 802.11 traffic and filtering out PR packets on a single WiFi channel.
The following information about each received PR is collected: - MAC address - Supported data rates - extended supported rates - HT capabilities - extended capabilities - data under extended tag and vendor specific tag - interworking - VHT capabilities - RSSI - SSID - timestamp when PR was received.
The collected data was forwarded to a remote database via a secure VPN connection. A Python script was written using the Pyshark package to collect, preprocess, and transmit the data.
Data preprocessing
The gateway collects PRs for each successive predefined scan interval (10 seconds). During this interval, the data is preprocessed before being transmitted to the database. For each detected PR in the scan interval, the IEs fields are saved in the following JSON structure:
PR_IE_data = { 'DATA_RTS': {'SUPP': DATA_supp , 'EXT': DATA_ext}, 'HT_CAP': DATA_htcap, 'EXT_CAP': {'length': DATA_len, 'data': DATA_extcap}, 'VHT_CAP': DATA_vhtcap, 'INTERWORKING': DATA_inter, 'EXT_TAG': {'ID_1': DATA_1_ext, 'ID_2': DATA_2_ext ...}, 'VENDOR_SPEC': {VENDOR_1:{ 'ID_1': DATA_1_vendor1, 'ID_2': DATA_2_vendor1 ...}, VENDOR_2:{ 'ID_1': DATA_1_vendor2, 'ID_2': DATA_2_vendor2 ...} ...} }
Supported data rates and extended supported rates are represented as arrays of values that encode information about the rates supported by a mobile device. The rest of the IEs data is represented in hexadecimal format. Vendor Specific Tag is structured differently than the other IEs. This field can contain multiple vendor IDs with multiple data IDs with corresponding data. Similarly, the extended tag can contain multiple data IDs with corresponding data.
Missing IE fields in the captured PR are not included in PR_IE_DATA.
When a new MAC address is detected in the current scan time interval, the data from PR is stored in the following structure:
{'MAC': MAC_address, 'SSIDs': [ SSID ], 'PROBE_REQs': [PR_data] },
where PR_data is structured as follows:
{ 'TIME': [ DATA_time ], 'RSSI': [ DATA_rssi ], 'DATA': PR_IE_data }.
This data structure allows to store only 'TOA' and 'RSSI' for all PRs originating from the same MAC address and containing the same 'PR_IE_data'. All SSIDs from the same MAC address are also stored. The data of the newly detected PR is compared with the already stored data of the same MAC in the current scan time interval. If identical PR's IE data from the same MAC address is already stored, only data for the keys 'TIME' and 'RSSI' are appended. If identical PR's IE data from the same MAC address has not yet been received, then the PR_data structure of the new PR for that MAC address is appended to the 'PROBE_REQs' key. The preprocessing procedure is shown in Figure ./Figures/Preprocessing_procedure.png
At the end of each scan time interval, all processed data is sent to the database along with additional metadata about the collected data, such as the serial number of the wireless gateway and the timestamps for the start and end of the scan. For an example of a single PR capture, see the Single_PR_capture_example.json file.
Folder structure
For ease of processing of the data, the dataset is divided into 7 folders, each containing a 24-hour period. Each folder contains four files, each containing samples from that device.
The folders are named after the start and end time (in UTC). For example, the folder 2022-09-22T22-00-00_2022-09-23T22-00-00 contains samples collected between 23th of September 2022 00:00 local time, until 24th of September 2022 00:00 local time.
Files representing their location via mapping: - 1.json -> location 1 - 2.json -> location 2 - 3.json -> location 3 - 4.json -> location 4
Environments description
The measurements were carried out in the city of Catania, in Piazza UniversitΓ and Piazza del Duomo The gateway devices (rPIs with WiFi dongle) were set up and gathering data before the start time of this dataset. As of September 23, 2022, the devices were placed in their final configuration and personally checked for correctness of installation and data status of the entire data collection system. Devices were connected either to a nearby Ethernet outlet or via WiFi to the access point provided.
Four Raspbery Pi-s were used: - location 1 -> Piazza del Duomo - Chierici building (balcony near Fontana dellβAmenano) - location 2 -> southernmost window in the building of Via Etnea near Piazza del Duomo - location 3 -> nothernmost window in the building of Via Etnea near Piazza UniversitΓ - location 4 -> first window top the right of the entrance of the University of Catania
Locations were suggested by the authors and adjusted during deployment based on physical constraints (locations of electrical outlets or internet access) Under ideal circumstances, the locations of the devices and their coverage area would cover both squares and the part of Via Etna between them, with a partial overlap of signal detection. The locations of the gateways are shown in Figure ./Figures/catania.png.
Known dataset shortcomings
Due to technical and physical limitations, the dataset contains some identified deficiencies.
PRs are collected and transmitted in 10-second chunks. Due to the limited capabilites of the recording devices, some time (in the range of seconds) may not be accounted for between chunks if the transmission of the previous packet took too long or an unexpected error occurred.
Every 20 minutes the service is restarted on the recording device. This is a workaround for undefined behavior of the USB WiFi dongle, which can no longer respond. For this reason, up to 20 seconds of data will not be recorded in each 20-minute period.
The devices had a scheduled reboot at 4:00 each day which is shown as missing data of up to a few minutes.
Location 1 - Piazza del Duomo - Chierici
The gateway device (rPi) is located on the second floor balcony and is hardwired to the Ethernet port. This device appears to function stably throughout the data collection period. Its location is constant and is not disturbed, dataset seems to have complete coverage.
Location 2 - Via Etnea - Piazza del Duomo
The device is located inside the building. During working hours (approximately 9:00-17:00), the device was placed on the windowsill. However, the movement of the device cannot be confirmed. As the device was moved back and forth, power outages and internet connection issues occurred. The last three days in the record contain no PRs from this location.
Location 3 - Via Etnea - Piazza UniversitΓ
Similar to Location 2, the device is placed on the windowsill and moved around by people working in the building. Similar behavior is also observed, e.g., it is placed on the windowsill and moved inside a thick wall when no people are present. This device appears to have been collecting data throughout the whole dataset period.
Location 4 - Piazza UniversitΓ
This location is wirelessly connected to the access point. The device was placed statically on a windowsill overlooking the square. Due to physical limitations, the device had lost power several times during the deployment. The internet connection was also interrupted sporadically.
Recognitions
The data was collected within the scope of Resiloc project with the help of City of Catania and project partners.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
π Overview: This dataset provides a compact and efficient way to explore the massive "Wikipedia Structured Contents" dataset by Wikimedia Foundation, which consists of 38 large JSONL files (each ~2.5GB). Loading these directly in Kaggle or Colab is impractical due to resource constraints. This file index solves that problem.
π Whatβs Inside:
This dataset includes a single JSONL file named wiki_structured_dataset_navigator.jsonl that contains metadata for every file in the English portion of the Wikimedia dataset.
Each line in the JSONL file is a JSON object with the following fields:
- file_name: the actual filename in the source dataset (e.g., enwiki_namespace_0_0.jsonl)
- file_index: the numeric row index of the file
- name: the Wikipedia article title or identifier
- url: a link to the full article on Wikipedia
- description: a short description or abstract of the article (when available)
π Use Case: Use this dataset to search by keyword, article name, or description to find which specific files from the full Wikimedia dataset contain the topics you're interested in. You can then download only the relevant file(s) instead of the entire dataset.
β‘οΈ Benefits: - Lightweight (~MBs vs. GBs) - Easy to load and search - Great for indexing, previewing, and subsetting the Wikimedia dataset - Saves time, bandwidth, and compute resources
π Example Usage (Python): ```python import kagglehub import json import pandas as pd import numpy as np import os from tqdm import tqdm from datetime import datetime import re
def read_jsonl(file_path, max_records=None): data = [] with open(file_path, 'r', encoding='utf-8') as f: for i, line in enumerate(tqdm(f)): if max_records and i >= max_records: break data.append(json.loads(line)) return data
file_path = kagglehub.dataset_download("mehranism/wikimedia-structured-dataset-navigator-jsonl",path="wiki_structured_dataset_navigator.jsonl") data = read_jsonl(file_path) print(f"Successfully loaded {len(data)} records")
df = pd.DataFrame(data) print(f"Dataset shape: {df.shape}") print(" Columns in the dataset:") for col in df.columns: print(f"- {col}")
This dataset is perfect for developers working on:
- Retrieval-Augmented Generation (RAG)
- Large Language Model (LLM) fine-tuning
- Search and filtering pipelines
- Academic research on structured Wikipedia content
π‘ Tip:
Pair this index with the original [Wikipedia Structured Contents dataset](https://www.kaggle.com/datasets/wikimedia-foundation/wikipedia-structured-contents) for full article access.
π Format:
- File: `wiki_structured_dataset_navigator.jsonl`
- Format: JSON Lines (1 object per line)
- Encoding: UTF-8
---
### **Tags**
wikipedia, wikimedia, jsonl, structured-data, search-index, metadata, file-catalog, dataset-index, large-language-models, machine-learning ```
CC0: Public Domain Dedication
(Recommended for open indexing tools with no sensitive data.)
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Details
These are the training data for LoGiPT from NAACL'24 paper: "Language Models can be Deductive Solvers".
LoGiPT-data-ProofWriter.json: Instruction-tuning data for LoGiPT constructed from ProofWriter. LoGiPT-data-PrOntoQA.json: Instruction-tuning data for LoGiPT constructed from PrOntoQA.
All training examples are organised in Json-format and Vicuna-style.
If you find this data helpful, please cite our NAACL'24 paper: (or Arxiv version:β¦ See the full description on the dataset page: https://huggingface.co/datasets/jzfeng/LoGiPT-data.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Important Notice: Ethical Use OnlyThis repository provides code and datasets for academic research on misinformation.Please note that the datasets include rumor-related texts. These materials are supplied solely for scholarly analysis and research aimed at understanding and combating misinformation.Prohibited UseDo not use this repository, including its code or data, to create or spread false information in any real-world context.Any misuse of these resources for malicious purposes is strictly forbidden.DisclaimerThe authors bear no responsibility for any unethical or unlawful use of the provided resources.By accessing or using this repository, you acknowledge and agree to comply with these ethical guidelines.Project StructureThe project is organized into three main directories, each corresponding to a major section of the paper's experiments:main_data_and_code/βββ rumor_generation/βββ rumor_detection/βββ rumor_debunking/How to Get StartedPrerequisitesTo successfully run the code and reproduce the results, you will need to:Obtain and configure your own API key for the large language models (LLMs) used in the experiments. Please replace the placeholder API key in the code with your own.For the rumor detection experiments, download the public datasets (Twitter15, Twitter16, FakeNewsNet) from their respective sources. The pre-process scripts in the rumor detection folder must be run first to prepare the public datasets.Please note that many scripts are provided as examples using the Twitter15 dataset. To run experiments on other datasets like Twitter16 or FakeNewsNet, you will need to modify these scripts or create copies and update the corresponding file paths.Detailed Directory Breakdown1. rumor_generation/This directory contains all the code and data related to the rumor generation experiments.rumor_generation_zeroshot.py: Code for the zero-shot rumor generation experiment.rumor_generation_fewshot.py: Code for the few-shot rumor generation experiment.rumor_generation_cot.py: Code for the chain-of-thought (CoT) rumor generation experiment.token_distribution.py: Script to analyze token distribution in the generated text.label_rumors.pyοΌScript to label LLM-generated texts based on whether they contain rumor-related content.extract_reasons.py: Script to extract reasons for rumor generation and rejection.visualization.py: Utility script for generating figures.LDA.py: Code for performing LDA topic modeling on the generated data.rumor_generation_responses.json: The complete output dataset from the rumor generation experiments.generation_reasons_extracted.json: The extracted reasons for generated rumors.rejection_reasons_extracted.json: The extracted reasons for rejected rumor generation requests.2. rumor_detection/This directory contains the code and data used for the rumor detection experiments.nonreasoning_zeroshot_twitter15.py: Code for the non-reasoning, zero-shot detection on the Twitter15 dataset. To run on Twitter16 or FakeNewsNet, update the file paths within the script. Similar experiment scripts below follow the same principle and are not described repeatedly.nonreasoning_fewshot_twitter15.py: Code for the non-reasoning, few-shot detection on the Twitter15 dataset.nonreasoning_cot_twitter15.py: Code for the non-reasoning, CoT detection on the Twitter15 dataset.reasoning_zeroshot_twitter15.py: Code for the Reasoning LLMs, zero-shot detection on the Twitter15 dataset.reasoning_fewshot_twitter15.py: Code for the Reasoning LLMs, few-shot detection on the Twitter15 dataset.reasoning_cot_twitter15.py: Code for the Reasoning LLMs, CoT detection on the Twitter15 dataset.traditional_model.py: Code for the traditional models used as baselines.preprocess_twitter15_and_twitter16.py: Script for preprocessing the Twitter15 and Twitter16 datasets.preprocess_fakenews.py: Script for preprocessing the FakeNewsNet dataset.generate_summary_table.py: Calculates all classification metrics and generates the final summary table for the rumor detection experiments.select_few_shot_example_15.py: Script to pre-select few-shot examples, using the Twitter15 dataset as an example. To generate examples for Twitter16 or FakeNewsNet, update the file paths within the script.twitter15_few_shot_examples.json: Pre-selected few-shot examples for the Twitter15 dataset.twitter16_few_shot_examples.json: Pre-selected few-shot examples for the Twitter16 dataset.fakenewsnet_few_shot_examples.json: Pre-selected few-shot examples for the FakeNewsNet dataset.twitter15_llm_results.json: LLM prediction results on the Twitter15 dataset.twitter16_llm_results.json: LLM prediction results on the Twitter16 dataset.fakenewsnet_llm_results.json: LLM prediction results on the FakeNewsNet dataset.visualization.py: Utility script for generating figures.3. rumor_debunking/This directory contains all the code and data for the rumor debunking experiments.analyze_sentiment.py: Script for analyzing the sentiment of the debunking texts.calculate_readability.py: Script for calculating the readability score of the debunking texts.plot_readability.py: Utility script for generating figures related to readability.fact_checking_with_nli.py: Code for the NLI-based fact-checking experiment.debunking_results.json: The dataset containing the debunking results for this experimental section.debunking_results_with_readability.json: The dataset containing the debunking results along with readability scores.sentiment_analysis/: This directory contains the result file from the sentiment analysis.debunking_results_with_sentiment.json: The dataset containing the debunking results along with sentiment analysis.Please contact the repository owner if you encounter any problems or have questions about the code or data.
Facebook
Twitterhttps://india-data.org/terms-conditionshttps://india-data.org/terms-conditions
This is a sample dataset for the Autorickshaw detection challenge. The full dataset will be released shortly (see the Timeline in the website: http://cvit.iiit.ac.in/autorickshaw_detection ).?The images folder contains 800 images of autorickshaws. Each images file has a number in its filename.?The bbs folder contains bbs.json file, which contain the ground truth bounding boxes. It is an array of lenth 800. The bounding box information corresponding to the image i.jpg can be found at the ith location in bbs array. The bounding box information is again an array. The length of the array the the number of autorickshaws in the that image. At each index the four vertices of the bounding box is provided. See the view.py script for an example.?The scripts folder contains a file view.py, which opens the image and overlays the bounding boxes (closing the window, will show the next image). It serves as an example on how to view as well load the data format in bbs.json.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset consists of 1 .xlsx file, 2 .png files, 1 .json file and 1 .zip file:annotation_details.xlsx: The distribution of annotations in the previously mentioned six classes (mitosis, apoptosis, tumor nuclei, non-tumor nuclei, tubule, and non-tubule) is presented in a Excel spreadsheet.original.png: The input image.annotated.png: An example from the dataset. In the annotated image, blue circles indicate the tumor nuclei, pink circles show non-tumor nuclei such as blood cells, stroma nuclei, and lymphocytes; orange and green circles are mitosis and apoptosis, respectively; light blue circles are true lumen for tubules, and yellow circles represent white regions (non-lumen) such as fat, blood vessel, and broken tissues.data.json: The annotations for the BreCaHAD dataset are provided in JSON (JavaScript Object Notation) format. In the given example, the JSON file (ground truth) contains two mitosis and only one tumor nuclei annotations. Here, x and y are the coordinates of the centroid of the annotated object, and the values are between 0, 1.BreCaHAD.zip: An archive file containing dataset. Three folders are included: images (original images), groundTruth (json files), and groundTruth_display (groundTruth applied on original images)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Overview 3DHD CityScenes is the most comprehensive, large-scale high-definition (HD) map dataset to date, annotated in the three spatial dimensions of globally referenced, high-density LiDAR point clouds collected in urban domains. Our HD map covers 127 km of road sections of the inner city of Hamburg, Germany including 467 km of individual lanes. In total, our map comprises 266,762 individual items. Our corresponding paper (published at ITSC 2022) is available here. Further, we have applied 3DHD CityScenes to map deviation detection here. Moreover, we release code to facilitate the application of our dataset and the reproducibility of our research. Specifically, our 3DHD_DevKit comprises: Python tools to read, generate, and visualize the dataset, 3DHDNet deep learning pipeline (training, inference, evaluation) for map deviation detection and 3D object detection. The DevKit is available here: https://github.com/volkswagen/3DHD_devkit. The dataset and DevKit have been created by Christopher Plachetka as project lead during his PhD period at Volkswagen Group, Germany. When using our dataset, you are welcome to cite: @INPROCEEDINGS{9921866, author={Plachetka, Christopher and Sertolli, Benjamin and Fricke, Jenny and Klingner, Marvin and Fingscheidt, Tim}, booktitle={2022 IEEE 25th International Conference on Intelligent Transportation Systems (ITSC)}, title={3DHD CityScenes: High-Definition Maps in High-Density Point Clouds}, year={2022}, pages={627-634}} Acknowledgements We thank the following interns for their exceptional contributions to our work. Benjamin Sertolli: Major contributions to our DevKit during his master thesis Niels Maier: Measurement campaign for data collection and data preparation The European large-scale project Hi-Drive (www.Hi-Drive.eu) supports the publication of 3DHD CityScenes and encourages the general publication of information and databases facilitating the development of automated driving technologies. The Dataset After downloading, the 3DHD_CityScenes folder provides five subdirectories, which are explained briefly in the following. 1. Dataset This directory contains the training, validation, and test set definition (train.json, val.json, test.json) used in our publications. Respective files contain samples that define a geolocation and the orientation of the ego vehicle in global coordinates on the map. During dataset generation (done by our DevKit), samples are used to take crops from the larger point cloud. Also, map elements in reach of a sample are collected. Both modalities can then be used, e.g., as input to a neural network such as our 3DHDNet. To read any JSON-encoded data provided by 3DHD CityScenes in Python, you can use the following code snipped as an example. import json json_path = r"E:\3DHD_CityScenes\Dataset\train.json" with open(json_path) as jf: data = json.load(jf) print(data) 2. HD_Map Map items are stored as lists of items in JSON format. In particular, we provide: traffic signs, traffic lights, pole-like objects, construction site locations, construction site obstacles (point-like such as cones, and line-like such as fences), line-shaped markings (solid, dashed, etc.), polygon-shaped markings (arrows, stop lines, symbols, etc.), lanes (ordinary and temporary), relations between elements (only for construction sites, e.g., sign to lane association). 3. HD_Map_MetaData Our high-density point cloud used as basis for annotating the HD map is split in 648 tiles. This directory contains the geolocation for each tile as polygon on the map. You can view the respective tile definition using QGIS. Alternatively, we also provide respective polygons as lists of UTM coordinates in JSON. Files with the ending .dbf, .prj, .qpj, .shp, and .shx belong to the tile definition as βshape fileβ (commonly used in geodesy) that can be viewed using QGIS. The JSON file contains the same information provided in a different format used in our Python API. 4. HD_PointCloud_Tiles The high-density point cloud tiles are provided in global UTM32N coordinates and are encoded in a proprietary binary format. The first 4 bytes (integer) encode the number of points contained in that file. Subsequently, all point cloud values are provided as arrays. First all x-values, then all y-values, and so on. Specifically, the arrays are encoded as follows. x-coordinates: 4 byte integer y-coordinates: 4 byte integer z-coordinates: 4 byte integer intensity of reflected beams: 2 byte unsigned integer ground classification flag: 1 byte unsigned integer After reading, respective values have to be unnormalized. As an example, you can use the following code snipped to read the point cloud data. For visualization, you can use the pptk package, for instance. import numpy as np import pptk file_path = r"E:\3DHD_CityScenes\HD_PointCloud_Tiles\HH_001.bin" pc_dict = {} key_list = ['x', 'y', 'z', 'intensity', 'is_ground'] type_list = ['<i4', '<i4', '<i4', '<u2', 'u1'] with open(file_path, "r") as fid: num_points = np.f
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains around 30 000 basic blocks whose energy consumption and execution time have been measured in isolation on the MSP430FR5969 microcontroller, at 1MHz. Basic blocks were executed in a worst case scenario regarding the MSP430 FRAM cache and CPU pipeline. The dataset creation process is described thoroughly in [1].
This dataset is composed of the following files:
basic_blocks.tar.xz contains all basic blocks (BB) used in the dataset, in a custom JSON format,data.csv/data.xlsx contains the measured energy consumption and execution time for each basic blockWe first details how the basic_blocks.tar.gz archive is organized, and then present the CSV/XSLX spreadsheet format.
We extracted the basic blocks from a subset of programs of the AnghaBench benchmark suite [2]. The basic_blocks.tar.gz archive consist of the extracted basic blocks organized as json files. Each json file correspond to a C source file from AnghaBench, and is given a unique identifier. An example json (137.json) is available here:
{
"extr_pfctl_altq.c_pfctl_altq_init": [
# Basic block 1
[
# Instruction 1 of BB1
[
"MOV.W",
"#queue_map",
"R13"
],
# Instruction 2 of BB1
[
"MOV.B",
"#0",
"R14"
],
# Instruction 3 of BB1
[
"CALL",
"#hcreate_r",
null
]
],
# Basic block 2
[
....
]
]
}
The json contains a dict with only one key pointing to an array of basic blocks. This key is the name of the original C source file in AnghaBench from which the basic blocks were extracted (here extr_pfctl_altq.c_pfctl_altq_init.c). The array contains severals basic blocks, which are represented as an array of instructions, which are themselves represented as an array [OPCODE, OPERAND1, OPERAND2].
Then, each basic block can be identified uniquely using two ids : its file id and its offset in the file (id=). In our example, the basic block 1 can be identified by the json file id (137) and its offset in the file (0). Its ID is 137_0. This ID is used to make the mapping between a basic block and its energy consumption/execution time, with the data.csv/data.xlsx spreadsheet.
Energy consumption and execution time data are stored in the data.csv file. Here is the extract of the csv file corresponding to the basic block 137_0. The spreadsheet format is described below.
bb_id;nb_inst;max_energy;max_time;avg_time;avg_energy;energy_per_inst;nb_samples;unroll_factor
137_0;3;8.77;7.08;7.04;8.21;2.92;40;50
Spreadsheet format :
bb_id: the unique identifier of a basic block (cf. Basic Blocks)nb_inst: the number of instructions in the basic blockmax_energy: the maximum energy comsumption (in nJ) measured during the experimentmax_time: the maximum execution time (in us) measured during the experimentavg_time: the average execution time (in us) measured during the experimentavg_energy: the average energy comsumption (in nJ) measured during the experimentenergy_per_inst: the average energy consumption per instruction (correspond to avg_energy/nb_inst)nb_samples: how much time the basic block energy consumption/execution time has been measuredunroll_factor: how much time the basic block was unrolled (cf Basic Block Unrolling)To measure the energy consumption and execution time of the msp430, we need to be able to handle the scale difference between the measurement tool and the basic block execution time. This is achieved by duplicating the basic block multiple times while making sure to keep the worst-case memory layout as explained in the paper. The number of time the basic block has been duplicated is called the unroll_factor.
Values of energy and time are always given per basic block, so they have already been divided by the unroll factor.
The selected features after PCA analysis for both energy and time model are listed here: MOV.W_Rn_Rn, MOV.W_X(Rn)_X(Rn), CALL, MOV.B_#N_Rn, ADD.W_Rn_Rn, MOV.W_@Rn_Rn, MOV.W_X(Rn)_Rn, ADD.W_#N_Rn, PUSHM.W_#N_Rn, MOV.W_X(Rn)_ADDR, CMP.W_#N_Rn, MOV.W_&ADDR_X(Rn), MOV.W_Rn_X(Rn), BIS.W_Rn_Rn, RLAM.W_#N_Rn, SUB.W_#N_Rn, MOV.W_&ADDR_Rn, MOV.W_#N_X(Rn), CMP.W_Rn_Rn, BIT.W_ADDR_Rn, MOV.W_@Rn_X(Rn), ADD.W_#N_X(Rn), MOV.W_#N_Rn, AND.W_Rn_Rn, MOV.W_Rn_ADDR, SUB.W_Rn_Rn, MOV.W_ADDR_Rn, MOV.W_X(Rn)_&ADDR, MOV.W_ADDR_ADDR, JMP, ADD_#N_Rn, BIS.W_Rn_X(Rn), SUB_Rn_Rn, MOV.W_ADDR_X(Rn), ADDC_#N_X(Rn), MOV.B_Rn_Rn, CMP.W_X(Rn)_X(Rn), ADD_Rn_Rn, nb_inst, INV.W_Rn_, NOP_, ADD.W_X(Rn)_X(Rn), ADD.W_Rn_X(Rn), MOV.B_@Rn_Rn, BIS.W_X(Rn)_X(Rn), MOV.B_#N_X(Rn), MOV.W_#N_ADDR, AND.W_#N_ADDR, SUBC_X(Rn)_X(Rn), BIS.W_#N_X(Rn), SUB.W_X(Rn)_X(Rn), AND.B_#N_Rn, ADD_X(Rn)_X(Rn), MOV.W_@Rn_ADDR, MOV.W_&ADDR_ADDR, ADDC_Rn_Rn, AND.W_#N_X(Rn), SUB_#N_Rn, RRUM.W_#N_Rn, AND_ADDR_Rn, CMP.W_X(Rn)_ADDR, MOV.B_#N_ADDR, ADD.W_#N_ADDR, CMP.B_#N_Rn, SXT_Rn_, XOR.W_Rn_Rn, CMP.W_@Rn_Rn, ADD.W_@Rn_Rn, ADD.W_X(Rn)_Rn, AND.W_Rn_X(Rn), CMP.B_Rn_Rn, AND.W_X(Rn)_X(Rn), BIC.W_#N_Rn, BIS.W_#N_Rn, AND.B_#N_X(Rn), MOV.B_X(Rn)_X(Rn), AND.W_@Rn_Rn, MOV.W_#N_&ADDR, BIS.W_Rn_ADDR, SUB.W_X(Rn)_Rn, SUB.W_Rn_X(Rn), SUB_X(Rn)_X(Rn), MOV.B_@Rn_X(Rn), CMP.W_@Rn_X(Rn), ADD.W_X(Rn)_ADDR, CMP.W_Rn_X(Rn), BIS.W_@Rn_X(Rn), CMP.B_X(Rn)_X(Rn), RRC.W_Rn_, MOV.W_@Rn_&ADDR, CMP.W_#N_X(Rn), ADDC_X(Rn)_Rn, CMP.W_X(Rn)_Rn, BIS.W_X(Rn)_Rn, SUB_X(Rn)_Rn, MOV.B_X(Rn)_Rn, MOV.W_ADDR_&ADDR, AND.W_#N_Rn, RLA.W_Rn_, INV.W_X(Rn)_, XOR.W_#N_Rn, SUB.W_Rn_ADDR, BIC.W_#N_X(Rn), MOV.B_X(Rn)_ADDR, ADD_#N_X(Rn), SUB_Rn_X(Rn), MOV.B_&ADDR_Rn, MOV.W_Rn_&ADDR, ADD_X(Rn)_Rn, AND.W_X(Rn)_Rn, PUSHM.A_#N_Rn, RRAM.W_#N_Rn, AND.W_@Rn_X(Rn), BIS.B_Rn_X(Rn), SUB.W_@Rn_Rn, CLRC_, CMP.W_#N_ADDR, XOR.W_Rn_X(Rn), MOV.B_Rn_ADDR, CMP.B_X(Rn)_Rn, BIS.B_Rn_Rn, BIS.W_X(Rn)_ADDR, CMP.B_#N_X(Rn), CMP.W_Rn_ADDR, XOR.W_X(Rn)_Rn, MOV.B_Rn_X(Rn), ADD.B_#N_Rn
The trained machine learning model, tests, and local explanation code can be generated and found here: WORTEX Machine learning code
This work has received a French government support granted to the Labex CominLabs excellence laboratory and managed by the National Research Agency in the βInvesting for the Futureβ program under reference ANR-10-LABX-07-01
Copyright 2024 Hector Chabot Copyright 2024 Abderaouf Nassim Amalou Copyright 2024 Hugo Reymond Copyright 2024 Isabelle Puaut
Licensed under the Creative Commons Attribution 4.0 International License
[1] Reymond, H., Amalou, A. N., Puaut, I. βWORTEX: Worst-Case Execution Time and Energy Estimation in Low-Power Microprocessors using Explainable MLβ in 22nd International Workshop on Worst-Case Execution Time Analysis (WCET 2024)
[2] Da Silva, Anderson Faustino, et al. βAnghabench: A suite with one million compilable C benchmarks for code-size reduction.β 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE, 2021.
Facebook
TwitterThis dataset contains the metadata of the datasets published in 77 Dataverse installations, information about each installation's metadata blocks, and the list of standard licenses that dataset depositors can apply to the datasets they publish in the 36 installations running more recent versions of the Dataverse software. The data is useful for reporting on the quality of dataset and file-level metadata within and across Dataverse installations. Curators and other researchers can use this dataset to explore how well Dataverse software and the repositories using the software help depositors describe data. How the metadata was downloaded The dataset metadata and metadata block JSON files were downloaded from each installation on October 2 and October 3, 2022 using a Python script kept in a GitHub repo at https://github.com/jggautier/dataverse-scripts/blob/main/other_scripts/get_dataset_metadata_of_all_installations.py. In order to get the metadata from installations that require an installation account API token to use certain Dataverse software APIs, I created a CSV file with two columns: one column named "hostname" listing each installation URL in which I was able to create an account and another named "apikey" listing my accounts' API tokens. The Python script expects and uses the API tokens in this CSV file to get metadata and other information from installations that require API tokens. How the files are organized βββ csv_files_with_metadata_from_most_known_dataverse_installations β βββ author(citation).csv β βββ basic.csv β βββ contributor(citation).csv β βββ ... β βββ topic_classification(citation).csv βββ dataverse_json_metadata_from_each_known_dataverse_installation β βββ Abacus_2022.10.02_17.11.19.zip β βββ dataset_pids_Abacus_2022.10.02_17.11.19.csv β βββ Dataverse_JSON_metadata_2022.10.02_17.11.19 β βββ hdl_11272.1_AB2_0AQZNT_v1.0.json β βββ ... β βββ metadatablocks_v5.6 β βββ astrophysics_v5.6.json β βββ biomedical_v5.6.json β βββ citation_v5.6.json β βββ ... β βββ socialscience_v5.6.json β βββ ACSS_Dataverse_2022.10.02_17.26.19.zip β βββ ADA_Dataverse_2022.10.02_17.26.57.zip β βββ Arca_Dados_2022.10.02_17.44.35.zip β βββ ... β βββ World_Agroforestry_-_Research_Data_Repository_2022.10.02_22.59.36.zip βββ dataset_pids_from_most_known_dataverse_installations.csv βββ licenses_used_by_dataverse_installations.csv βββ metadatablocks_from_most_known_dataverse_installations.csv This dataset contains two directories and three CSV files not in a directory. One directory, "csv_files_with_metadata_from_most_known_dataverse_installations", contains 18 CSV files that contain the values from common metadata fields of all 77 Dataverse installations. For example, author(citation)_2022.10.02-2022.10.03.csv contains the "Author" metadata for all published, non-deaccessioned, versions of all datasets in the 77 installations, where there's a row for each author name, affiliation, identifier type and identifier. The other directory, "dataverse_json_metadata_from_each_known_dataverse_installation", contains 77 zipped files, one for each of the 77 Dataverse installations whose dataset metadata I was able to download using Dataverse APIs. Each zip file contains a CSV file and two sub-directories: The CSV file contains the persistent IDs and URLs of each published dataset in the Dataverse installation as well as a column to indicate whether or not the Python script was able to download the Dataverse JSON metadata for each dataset. For Dataverse installations using Dataverse software versions whose Search APIs include each dataset's owning Dataverse collection name and alias, the CSV files also include which Dataverse collection (within the installation) that dataset was published in. One sub-directory contains a JSON file for each of the installation's published, non-deaccessioned dataset versions. The JSON files contain the metadata in the "Dataverse JSON" metadata schema. The other sub-directory contains information about the metadata models (the "metadata blocks" in JSON files) that the installation was using when the dataset metadata was downloaded. I saved them so that they can be used when extracting metadata from the Dataverse JSON files. The dataset_pids_from_most_known_dataverse_installations.csv file contains the dataset PIDs of all published datasets in the 77 Dataverse installations, with a column to indicate if the Python script was able to download the dataset's metadata. It's a union of all of the "dataset_pids_..." files in each of the 77 zip files. The licenses_used_by_dataverse_installations.csv file contains information about the licenses that a number of the installations let depositors choose when creating datasets. When I collected ... Visit https://dataone.org/datasets/sha256%3Ad27d528dae8cf01e3ea915f450426c38fd6320e8c11d3e901c43580f997a3146 for complete metadata about this dataset.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A dataset composed of 8 software engineering projects collected (and refined) from the Software Engineering β Laboratory of Advanced Programmingβ course at Sapienza University of Rome for master students in Engineering in Computer Science.
The dataset comprises folders for each project. Each folder contains:
[
{
"set_id": 1,
"set_name": "auth client",
"user_stories": [1, 2, 3, 4],
"links": [2, 3, 5],
"db": "true"
},
...
]
The json is represented by an array of dictionaries, each relative to a set, characterized by a set_id and a set_name, grouping user stories (identified by their numerical identifier in user_stories). Each dictionary also contains links and db keys to indicate other sets that have a related context and the need for a backend service to store or retrieve data, respectively. From an architectural point of view, user stories that belong to linked sets can be fulfilled by the same container and the sets of user stories that are required to store or retrieve data must be fulfilled by a container hosting a database microservice.
The dataset is under continuous updating. Each academic year it will be enriched with new projects.
If you want to contribute, please contact us.
Facebook
TwitterLearn Geographic Mapping with Altair, Vega-Lite and Vega using Curated Datasets
Complete geographic and geophysical data collection for mapping and visualization. This consolidation includes 18 complementary datasets used by 31+ Vega, Vega-Lite, and Altair examples π. Perfect for learning geographic visualization techniques including projections, choropleths, point maps, vector fields, and interactive displays.
Source data lives on GitHub and can also be accessed via CDN. The vega-datasets project serves as a common repository for example datasets used across these visualization libraries and related projects.
airports.csv), lines (like londonTubeLines.json), and polygons (like us-10m.json).windvectors.csv, annual-precip.json).This pack includes 18 datasets covering base maps, reference points, statistical data for choropleths, and geophysical data.
| Dataset | File | Size | Format | License | Description | Key Fields / Join Info |
|---|---|---|---|---|---|---|
| US Map (1:10m) | us-10m.json | 627 KB | TopoJSON | CC-BY-4.0 | US state and county boundaries. Contains states and counties objects. Ideal for choropleths. | id (FIPS code) property on geometries |
| World Map (1:110m) | world-110m.json | 117 KB | TopoJSON | CC-BY-4.0 | World country boundaries. Contains countries object. Suitable for world-scale viz. | id property on geometries |
| London Boroughs | londonBoroughs.json | 14 KB | TopoJSON | CC-BY-4.0 | London borough boundaries. | properties.BOROUGHN (name) |
| London Centroids | londonCentroids.json | 2 KB | GeoJSON | CC-BY-4.0 | Center points for London boroughs. | properties.id, properties.name |
| London Tube Lines | londonTubeLines.json | 78 KB | GeoJSON | CC-BY-4.0 | London Underground network lines. | properties.name, properties.color |
| Dataset | File | Size | Format | License | Description | Key Fields / Join Info |
|---|---|---|---|---|---|---|
| US Airports | airports.csv | 205 KB | CSV | Public Domain | US airports with codes and coordinates. | iata, state, `l... |
Facebook
TwitterDemo to save data from a Space to a Dataset. Goal is to provide reusable snippets of code.
Documentation: https://huggingface.co/docs/huggingface_hub/main/en/guides/upload#scheduled-uploads Space: https://huggingface.co/spaces/Wauplin/space_to_dataset_saver/ JSON dataset: https://huggingface.co/datasets/Wauplin/example-space-to-dataset-json Image dataset: https://huggingface.co/datasets/Wauplin/example-space-to-dataset-image Image (zipped) dataset:β¦ See the full description on the dataset page: https://huggingface.co/datasets/Wauplin/example-space-to-dataset-json.