MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Extraction Examples Dataset
This dataset contains 17 examples for testing extraction workflows.
Dataset Structure
Each example includes:
PDF file: Original document
map_info.json: Map extraction metadata
direction.json: Direction information
GeoJSON files: Polygon geometries
Area JSON files: Area definitions
File Organization
files/ ├── example1/ │ ├── document.pdf │ ├── map_info.json │ ├── direction.json │ ├── polygon1.geojson │ └── area1.json… See the full description on the dataset page: https://huggingface.co/datasets/alexdzm/extraction-examples.
ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
geodata data package providing geojson polygons for all the world's countries
World Country and State coordinate for plotting geospatial maps.
Files source:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Introduction
The 802.11 standard includes several management features and corresponding frame types. One of them are Probe Requests (PR), which are sent by mobile devices in an unassociated state to scan the nearby area for existing wireless networks. The frame part of PRs consists of variable-length fields, called Information Elements (IE), which represent the capabilities of a mobile device, such as supported data rates.
This dataset contains PRs collected over a seven-day period by four gateway devices in an uncontrolled urban environment in the city of Catania.
It can be used for various use cases, e.g., analyzing MAC randomization, determining the number of people in a given location at a given time or in different time periods, analyzing trends in population movement (streets, shopping malls, etc.) in different time periods, etc.
Related dataset
Same authors also produced the Labeled dataset of IEEE 802.11 probe requests with same data layout and recording equipment.
Measurement setup
The system for collecting PRs consists of a Raspberry Pi 4 (RPi) with an additional WiFi dongle to capture WiFi signal traffic in monitoring mode (gateway device). Passive PR monitoring is performed by listening to 802.11 traffic and filtering out PR packets on a single WiFi channel.
The following information about each received PR is collected: - MAC address - Supported data rates - extended supported rates - HT capabilities - extended capabilities - data under extended tag and vendor specific tag - interworking - VHT capabilities - RSSI - SSID - timestamp when PR was received.
The collected data was forwarded to a remote database via a secure VPN connection. A Python script was written using the Pyshark package to collect, preprocess, and transmit the data.
Data preprocessing
The gateway collects PRs for each successive predefined scan interval (10 seconds). During this interval, the data is preprocessed before being transmitted to the database. For each detected PR in the scan interval, the IEs fields are saved in the following JSON structure:
PR_IE_data = { 'DATA_RTS': {'SUPP': DATA_supp , 'EXT': DATA_ext}, 'HT_CAP': DATA_htcap, 'EXT_CAP': {'length': DATA_len, 'data': DATA_extcap}, 'VHT_CAP': DATA_vhtcap, 'INTERWORKING': DATA_inter, 'EXT_TAG': {'ID_1': DATA_1_ext, 'ID_2': DATA_2_ext ...}, 'VENDOR_SPEC': {VENDOR_1:{ 'ID_1': DATA_1_vendor1, 'ID_2': DATA_2_vendor1 ...}, VENDOR_2:{ 'ID_1': DATA_1_vendor2, 'ID_2': DATA_2_vendor2 ...} ...} }
Supported data rates and extended supported rates are represented as arrays of values that encode information about the rates supported by a mobile device. The rest of the IEs data is represented in hexadecimal format. Vendor Specific Tag is structured differently than the other IEs. This field can contain multiple vendor IDs with multiple data IDs with corresponding data. Similarly, the extended tag can contain multiple data IDs with corresponding data.
Missing IE fields in the captured PR are not included in PR_IE_DATA.
When a new MAC address is detected in the current scan time interval, the data from PR is stored in the following structure:
{'MAC': MAC_address, 'SSIDs': [ SSID ], 'PROBE_REQs': [PR_data] },
where PR_data is structured as follows:
{ 'TIME': [ DATA_time ], 'RSSI': [ DATA_rssi ], 'DATA': PR_IE_data }.
This data structure allows to store only 'TOA' and 'RSSI' for all PRs originating from the same MAC address and containing the same 'PR_IE_data'. All SSIDs from the same MAC address are also stored. The data of the newly detected PR is compared with the already stored data of the same MAC in the current scan time interval. If identical PR's IE data from the same MAC address is already stored, only data for the keys 'TIME' and 'RSSI' are appended. If identical PR's IE data from the same MAC address has not yet been received, then the PR_data structure of the new PR for that MAC address is appended to the 'PROBE_REQs' key. The preprocessing procedure is shown in Figure ./Figures/Preprocessing_procedure.png
At the end of each scan time interval, all processed data is sent to the database along with additional metadata about the collected data, such as the serial number of the wireless gateway and the timestamps for the start and end of the scan. For an example of a single PR capture, see the Single_PR_capture_example.json file.
Folder structure
For ease of processing of the data, the dataset is divided into 7 folders, each containing a 24-hour period. Each folder contains four files, each containing samples from that device.
The folders are named after the start and end time (in UTC). For example, the folder 2022-09-22T22-00-00_2022-09-23T22-00-00 contains samples collected between 23th of September 2022 00:00 local time, until 24th of September 2022 00:00 local time.
Files representing their location via mapping: - 1.json -> location 1 - 2.json -> location 2 - 3.json -> location 3 - 4.json -> location 4
Environments description
The measurements were carried out in the city of Catania, in Piazza Università and Piazza del Duomo The gateway devices (rPIs with WiFi dongle) were set up and gathering data before the start time of this dataset. As of September 23, 2022, the devices were placed in their final configuration and personally checked for correctness of installation and data status of the entire data collection system. Devices were connected either to a nearby Ethernet outlet or via WiFi to the access point provided.
Four Raspbery Pi-s were used: - location 1 -> Piazza del Duomo - Chierici building (balcony near Fontana dell’Amenano) - location 2 -> southernmost window in the building of Via Etnea near Piazza del Duomo - location 3 -> nothernmost window in the building of Via Etnea near Piazza Università - location 4 -> first window top the right of the entrance of the University of Catania
Locations were suggested by the authors and adjusted during deployment based on physical constraints (locations of electrical outlets or internet access) Under ideal circumstances, the locations of the devices and their coverage area would cover both squares and the part of Via Etna between them, with a partial overlap of signal detection. The locations of the gateways are shown in Figure ./Figures/catania.png.
Known dataset shortcomings
Due to technical and physical limitations, the dataset contains some identified deficiencies.
PRs are collected and transmitted in 10-second chunks. Due to the limited capabilites of the recording devices, some time (in the range of seconds) may not be accounted for between chunks if the transmission of the previous packet took too long or an unexpected error occurred.
Every 20 minutes the service is restarted on the recording device. This is a workaround for undefined behavior of the USB WiFi dongle, which can no longer respond. For this reason, up to 20 seconds of data will not be recorded in each 20-minute period.
The devices had a scheduled reboot at 4:00 each day which is shown as missing data of up to a few minutes.
Location 1 - Piazza del Duomo - Chierici
The gateway device (rPi) is located on the second floor balcony and is hardwired to the Ethernet port. This device appears to function stably throughout the data collection period. Its location is constant and is not disturbed, dataset seems to have complete coverage.
Location 2 - Via Etnea - Piazza del Duomo
The device is located inside the building. During working hours (approximately 9:00-17:00), the device was placed on the windowsill. However, the movement of the device cannot be confirmed. As the device was moved back and forth, power outages and internet connection issues occurred. The last three days in the record contain no PRs from this location.
Location 3 - Via Etnea - Piazza Università
Similar to Location 2, the device is placed on the windowsill and moved around by people working in the building. Similar behavior is also observed, e.g., it is placed on the windowsill and moved inside a thick wall when no people are present. This device appears to have been collecting data throughout the whole dataset period.
Location 4 - Piazza Università
This location is wirelessly connected to the access point. The device was placed statically on a windowsill overlooking the square. Due to physical limitations, the device had lost power several times during the deployment. The internet connection was also interrupted sporadically.
Recognitions
The data was collected within the scope of Resiloc project with the help of City of Catania and project partners.
MSVD-CTN Dataset This dataset contains CTN annotations for the MSVD-CTN benchmark dataset in JSON format. It has three files for the train, test, and validation splits. For project details, visit https://narrativebridge.github.io/.
Dataset Structure Each JSON file contains a dictionary where the keys are the video IDs and the values are the corresponding Causal-Temporal Narrative (CTN) captions. The CTN captions are represented as a dictionary with two keys: "Cause" and "Effect", containing the cause and effect statements, respectively.
Example:
json { "video_id_1": { "Cause": "a person performed an action", "Effect": "a specific outcome occurred" }, "video_id_2": { "Cause": "another cause statement", "Effect": "another effect statement" } }
Loading the Datasets To load the datasets, use a JSON parsing library in your preferred programming language. For example, in Python, you can use the json module:
import json
with open("msvd_CTN_train.json", "r") as f:
msvd_train_data = json.load(f)
Access the CTN captions
for video_id, ctn_caption in msvd_train_data.items():
cause = ctn_caption["Cause"]
effect = ctn_caption["Effect"]
# Process the cause and effect statements as needed
License The MSVD-CTN benchmark dataset is licensed under the Creative Commons Attribution Non Commercial No Derivatives 4.0 International (CC BY-NC-ND 4.0) license.
Automatically describing images using natural sentences is an essential task for visually impaired people's inclusion on the Internet. Although there are many datasets in the literature, most of them contain only English captions, whereas datasets with captions described in other languages are scarce.
PraCegoVer arose on the Internet, stimulating users from social media to publish images, tag #PraCegoVer, and add a short description of their content. Inspired by this movement, we have proposed the #PraCegoVer, a multi-modal dataset with Portuguese captions based on posts from Instagram. It is the first large dataset for image captioning in Portuguese with freely annotated images.
#PraCegoVer has 533,523 pairs with images and captions described in Portuguese collected from more than 14 thousand different profiles. Also, the average caption length in #PraCegoVer is 39.3 words and the standard deviation is 29.7.
New Release
We release pracegover_400k.json which contains 403,337 examples from the original dataset.json after preprocessing and duplication removal. It is split into train, validation, and test with 242036, 80628, and 80673 examples, respectively.
Dataset Structure
#PraCegoVer dataset comprehends a main file dataset.json and a collection of compressed files named images.tar.gz.partX
containing the images. The file dataset.json comprehends a list of JSON objects with the attributes:
Each instance in dataset.json is associated with exactly one image in the images directory whose filename is pointed by the attribute filename. Also, we provide a sample with five instances, so the users can download the sample to get an overview of the dataset before downloading it completely.
Download Instructions
If you just want to have an overview of the dataset structure, you can download sample.tar.gz. But, if you want to use the dataset, or any of its subsets (63k, 173k, and 400k), you must download all the files and run the following commands to uncompress and join the files:
cat images.tar.gz.part* > images.tar.gz
tar -xzvf images.tar.gz
Alternatively, you can download the entire dataset from the terminal using the python script download_dataset.py available in the PraCegoVer repository. In this case, first, you have to download the script and create an access token here. Then, you can run the following command to download and uncompress the image files:
python download_dataset.py --access_token=
This dataset contains resources transformed from other datasets on HDX. They exist here only in a format modified to support visualization on HDX and may not be as up to date as the source datasets from which they are derived.
Source datasets: https://data.hdx.rwlabs.org/dataset/idps-data-by-region-in-mali
This dataset contains the metadata of the datasets published in 77 Dataverse installations, information about each installation's metadata blocks, and the list of standard licenses that dataset depositors can apply to the datasets they publish in the 36 installations running more recent versions of the Dataverse software. The data is useful for reporting on the quality of dataset and file-level metadata within and across Dataverse installations. Curators and other researchers can use this dataset to explore how well Dataverse software and the repositories using the software help depositors describe data. How the metadata was downloaded The dataset metadata and metadata block JSON files were downloaded from each installation on October 2 and October 3, 2022 using a Python script kept in a GitHub repo at https://github.com/jggautier/dataverse-scripts/blob/main/other_scripts/get_dataset_metadata_of_all_installations.py. In order to get the metadata from installations that require an installation account API token to use certain Dataverse software APIs, I created a CSV file with two columns: one column named "hostname" listing each installation URL in which I was able to create an account and another named "apikey" listing my accounts' API tokens. The Python script expects and uses the API tokens in this CSV file to get metadata and other information from installations that require API tokens. How the files are organized ├── csv_files_with_metadata_from_most_known_dataverse_installations │ ├── author(citation).csv │ ├── basic.csv │ ├── contributor(citation).csv │ ├── ... │ └── topic_classification(citation).csv ├── dataverse_json_metadata_from_each_known_dataverse_installation │ ├── Abacus_2022.10.02_17.11.19.zip │ ├── dataset_pids_Abacus_2022.10.02_17.11.19.csv │ ├── Dataverse_JSON_metadata_2022.10.02_17.11.19 │ ├── hdl_11272.1_AB2_0AQZNT_v1.0.json │ ├── ... │ ├── metadatablocks_v5.6 │ ├── astrophysics_v5.6.json │ ├── biomedical_v5.6.json │ ├── citation_v5.6.json │ ├── ... │ ├── socialscience_v5.6.json │ ├── ACSS_Dataverse_2022.10.02_17.26.19.zip │ ├── ADA_Dataverse_2022.10.02_17.26.57.zip │ ├── Arca_Dados_2022.10.02_17.44.35.zip │ ├── ... │ └── World_Agroforestry_-_Research_Data_Repository_2022.10.02_22.59.36.zip └── dataset_pids_from_most_known_dataverse_installations.csv └── licenses_used_by_dataverse_installations.csv └── metadatablocks_from_most_known_dataverse_installations.csv This dataset contains two directories and three CSV files not in a directory. One directory, "csv_files_with_metadata_from_most_known_dataverse_installations", contains 18 CSV files that contain the values from common metadata fields of all 77 Dataverse installations. For example, author(citation)_2022.10.02-2022.10.03.csv contains the "Author" metadata for all published, non-deaccessioned, versions of all datasets in the 77 installations, where there's a row for each author name, affiliation, identifier type and identifier. The other directory, "dataverse_json_metadata_from_each_known_dataverse_installation", contains 77 zipped files, one for each of the 77 Dataverse installations whose dataset metadata I was able to download using Dataverse APIs. Each zip file contains a CSV file and two sub-directories: The CSV file contains the persistent IDs and URLs of each published dataset in the Dataverse installation as well as a column to indicate whether or not the Python script was able to download the Dataverse JSON metadata for each dataset. For Dataverse installations using Dataverse software versions whose Search APIs include each dataset's owning Dataverse collection name and alias, the CSV files also include which Dataverse collection (within the installation) that dataset was published in. One sub-directory contains a JSON file for each of the installation's published, non-deaccessioned dataset versions. The JSON files contain the metadata in the "Dataverse JSON" metadata schema. The other sub-directory contains information about the metadata models (the "metadata blocks" in JSON files) that the installation was using when the dataset metadata was downloaded. I saved them so that they can be used when extracting metadata from the Dataverse JSON files. The dataset_pids_from_most_known_dataverse_installations.csv file contains the dataset PIDs of all published datasets in the 77 Dataverse installations, with a column to indicate if the Python script was able to download the dataset's metadata. It's a union of all of the "dataset_pids_..." files in each of the 77 zip files. The licenses_used_by_dataverse_installations.csv file contains information about the licenses that a number of the installations let depositors choose when creating datasets. When I collected ... Visit https://dataone.org/datasets/sha256%3Ad27d528dae8cf01e3ea915f450426c38fd6320e8c11d3e901c43580f997a3146 for complete metadata about this dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is the fluentspeechcommands dataset, formatted in the WebDataset format. WebDataset files are essentially tar archives, where each example in the dataset is represented by a pair of files: a WAV audio file and a corresponding JSON metadata file. The JSON file contains the class label and other relevant information for that particular audio sample.
$ tar tvf fluentspeechcommands_train_0000000.tar |head
-r--r--r-- bigdata/bigdata 174 2025-01-17 07:20 48fac300-45c8-11e9-8ec0-7bf21d1cfe30.json
-r--r--r-- bigdata/bigdata 131116 2025-01-17 07:20 48fac300-45c8-11e9-8ec0-7bf21d1cfe30.wav
-r--r--r-- bigdata/bigdata 136 2025-01-17 07:20 3f770360-44e3-11e9-bb82-bdba769643e7.json
-r--r--r-- bigdata/bigdata 71376 2025-01-17 07:20 3f770360-44e3-11e9-bb82-bdba769643e7.wav
-r--r--r-- bigdata/bigdata 132 2025-01-17 07:20 3ea38ea0-4613-11e9-bc65-55b32b211b66.json
-r--r--r-- bigdata/bigdata 68310 2025-01-17 07:20 3ea38ea0-4613-11e9-bc65-55b32b211b66.wav
-r--r--r-- bigdata/bigdata 143 2025-01-17 07:20 61578420-45ea-11e9-b578-494a5b19ab8b.json
-r--r--r-- bigdata/bigdata 89208 2025-01-17 07:20 61578420-45ea-11e9-b578-494a5b19ab8b.wav
-r--r--r-- bigdata/bigdata 132 2025-01-17 07:20 c4595690-4520-11e9-a843-8db76f4b5e29.json
-r--r--r-- bigdata/bigdata 76502 2025-01-17 07:20 c4595690-4520-11e9-a843-8db76f4b5e29.wav
$ cat 48fac300-45c8-11e9-8ec0-7bf21d1cfe30.json
{"speakerId": "52XVOeXMXYuaElyw", "transcription": "I need to practice my English. Switch the language", "action": "change language", "object": "English", "location": "none"}
The DataCite Public Data File contains metadata records in JSON format for all DataCite DOIs in Findable state that were registered up to the end of 2023.
This dataset represents a processed version of the Public Data File, where the data have been extracted and loaded into a Redivis dataset.
The DataCite Public Data File contains metadata records in JSON format for all DataCite DOIs in Findable state that were registered up to the end of 2023.
Records have descriptive metadata for research outputs and resources structured according to the DataCite Metadata Schema and include links to other persistent identifiers (PIDs) for works (DOIs), people (ORCID iDs), and organizations (ROR IDs).
Use of the DataCite Public Data File is subject to the DataCite Data File Use Policy.
This datasets is a processed version of the DataCite public data file, where the original file (a 23GB .tar.gz) has been extracted into 55,239 JSONL files, that were then concatenated into a single JSONL file.
This JSONL file has been imported into a Redivis table to facilitate further exploration and analysis.
A sample project demonstrating how to query the DataCite data file can be found here: https://redivis.com/projects/hx1e-a6w8vmwsx
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data set containing features extracted from 211 DNS Tunneling packet captures. The packet capture samples are classified by the protocols tunneled within the DNS tunnel. The features are stored in json files for each packet capture. The features in each file include the IP Packet Length, the DNS Query Name Length and the DNS Query Name entropy. In this "slightly unclean" version of the feature set the DNS Query Name field values are also present, but are not actually necessary.
This feature set may be used to perform machine learning techniques on DNS Tunneling traffic to discover new insights without necessarily having to reconstruct and analyze the equivalent full packet captures.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This folder contains the Spider-Realistic dataset used for evaluation in the paper "Structure-Grounded Pretraining for Text-to-SQL". The dataset is created based on the dev split of the Spider dataset (2020-06-07 version from https://yale-lily.github.io/spider). We manually modified the original questions to remove the explicit mention of column names while keeping the SQL queries unchanged to better evaluate the model's capability in aligning the NL utterance and the DB schema. For more details, please check our paper at https://arxiv.org/abs/2010.12773.
It contains the following files:
- spider-realistic.json
# The spider-realistic evaluation set
# Examples: 508
# Databases: 19
- dev.json
# The original dev split of Spider
# Examples: 1034
# Databases: 20
- tables.json
# The original DB schemas from Spider
# Databases: 166
- README.txt
- license
The Spider-Realistic dataset is created based on the dev split of the Spider dataset realsed by Yu, Tao, et al. "Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task." It is a subset of the original dataset with explicit mention of the column names removed. The sql queries and databases are kept unchanged.
For the format of each json file, please refer to the github page of Spider https://github.com/taoyds/spider.
For the database files please refer to the official Spider release https://yale-lily.github.io/spider.
This dataset is distributed under the CC BY-SA 4.0 license.
If you use the dataset, please cite the following papers including the original Spider datasets, Finegan-Dollak et al., 2018 and the original datasets for Restaurants, GeoQuery, Scholar, Academic, IMDB, and Yelp.
@article{deng2020structure,
title={Structure-Grounded Pretraining for Text-to-SQL},
author={Deng, Xiang and Awadallah, Ahmed Hassan and Meek, Christopher and Polozov, Oleksandr and Sun, Huan and Richardson, Matthew},
journal={arXiv preprint arXiv:2010.12773},
year={2020}
}
@inproceedings{Yu&al.18c,
year = 2018,
title = {Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task},
booktitle = {EMNLP},
author = {Tao Yu and Rui Zhang and Kai Yang and Michihiro Yasunaga and Dongxu Wang and Zifan Li and James Ma and Irene Li and Qingning Yao and Shanelle Roman and Zilin Zhang and Dragomir Radev }
}
@InProceedings{P18-1033,
author = "Finegan-Dollak, Catherine
and Kummerfeld, Jonathan K.
and Zhang, Li
and Ramanathan, Karthik
and Sadasivam, Sesh
and Zhang, Rui
and Radev, Dragomir",
title = "Improving Text-to-SQL Evaluation Methodology",
booktitle = "Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
year = "2018",
publisher = "Association for Computational Linguistics",
pages = "351--360",
location = "Melbourne, Australia",
url = "http://aclweb.org/anthology/P18-1033"
}
@InProceedings{data-sql-imdb-yelp,
dataset = {IMDB and Yelp},
author = {Navid Yaghmazadeh, Yuepeng Wang, Isil Dillig, and Thomas Dillig},
title = {SQLizer: Query Synthesis from Natural Language},
booktitle = {International Conference on Object-Oriented Programming, Systems, Languages, and Applications, ACM},
month = {October},
year = {2017},
pages = {63:1--63:26},
url = {http://doi.org/10.1145/3133887},
}
@article{data-academic,
dataset = {Academic},
author = {Fei Li and H. V. Jagadish},
title = {Constructing an Interactive Natural Language Interface for Relational Databases},
journal = {Proceedings of the VLDB Endowment},
volume = {8},
number = {1},
month = {September},
year = {2014},
pages = {73--84},
url = {http://dx.doi.org/10.14778/2735461.2735468},
}
@InProceedings{data-atis-geography-scholar,
dataset = {Scholar, and Updated ATIS and Geography},
author = {Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, Jayant Krishnamurthy, and Luke Zettlemoyer},
title = {Learning a Neural Semantic Parser from User Feedback},
booktitle = {Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
year = {2017},
pages = {963--973},
location = {Vancouver, Canada},
url = {http://www.aclweb.org/anthology/P17-1089},
}
@inproceedings{data-geography-original
dataset = {Geography, original},
author = {John M. Zelle and Raymond J. Mooney},
title = {Learning to Parse Database Queries Using Inductive Logic Programming},
booktitle = {Proceedings of the Thirteenth National Conference on Artificial Intelligence - Volume 2},
year = {1996},
pages = {1050--1055},
location = {Portland, Oregon},
url = {http://dl.acm.org/citation.cfm?id=1864519.1864543},
}
@inproceedings{data-restaurants-logic,
author = {Lappoon R. Tang and Raymond J. Mooney},
title = {Automated Construction of Database Interfaces: Intergrating Statistical and Relational Learning for Semantic Parsing},
booktitle = {2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora},
year = {2000},
pages = {133--141},
location = {Hong Kong, China},
url = {http://www.aclweb.org/anthology/W00-1317},
}
@inproceedings{data-restaurants-original,
author = {Ana-Maria Popescu, Oren Etzioni, and Henry Kautz},
title = {Towards a Theory of Natural Language Interfaces to Databases},
booktitle = {Proceedings of the 8th International Conference on Intelligent User Interfaces},
year = {2003},
location = {Miami, Florida, USA},
pages = {149--157},
url = {http://doi.acm.org/10.1145/604045.604070},
}
@inproceedings{data-restaurants,
author = {Alessandra Giordani and Alessandro Moschitti},
title = {Automatic Generation and Reranking of SQL-derived Answers to NL Questions},
booktitle = {Proceedings of the Second International Conference on Trustworthy Eternal Systems via Evolving Software, Data and Knowledge},
year = {2012},
location = {Montpellier, France},
pages = {59--76},
url = {https://doi.org/10.1007/978-3-642-45260-4_5},
}
Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
ThermoML is an XML-based IUPAC standard for the storage and exchange of experimental thermophysical and thermochemical property data. The ThermoML archive is a subset of Thermodynamics Research Center (TRC) data holdings corresponding to cooperation between NIST TRC and five journals: Journal of Chemical Engineering and Data (ISSN: 1520-5134), The Journal of Chemical Thermodynamics (ISSN: 1096-3626), Fluid Phase Equilibria (ISSN: 0378-3812), Thermochimica Acta (ISSN: 0040-6031), and International Journal of Thermophysics (ISSN: 1572-9567). Data from initial cooperation (around 2003) through the 2019 calendar year are included. The original scope of the archive has been expanded to include JSON files. The JSON files are structured according to the ThermoML.xsd (available below) and rendered from the same experimental thermophysical and thermochemical property data reported in the corresponding articles as the ThermoML files. In fact, the ThermoML files are generated from the JSON files to keep the information in sync. The JSON files may contain additional information not supported by the ThermoML schema. For example, each JSON file contains the md5 checksum on the ThermoML file (THERMOML_MD5_CHECKSUM) that may be used to validate the ThermoML download. This data.nist.gov resource provides a .tgz file download containing the JSON and ThermoML files for each version of the archive. Data from initial cooperation (around 2003) through the 2019 calendar year are provided below (ThermoML.v2020-09.30.tgz). The date of the extraction from TRC databases, as specified in the dateCit field of the xml files, are 2020-09-29 and 2020-09-30. The .tgz file contains a directory tree that maps to the DOI prefix/suffix of the entries; e.g. unzipping the .tgz file creates a directory for each of the prefixes ( 10.1007, 10.1016, and 10.1021) that contains all the .json and .xml files. The data and other information throughout this digital resource (including the website, API, JSON, and ThermoML files) have been carefully extracted from the original articles by NIST/TRC personnel. Neither the Journal publisher, nor its editors, nor NIST/TRC warrant or represent, expressly or implied, the correctness or accuracy of the content of information contained throughout this digital resource, nor its fitness for any use or for any purpose, nor can they, or will they, accept any liability or responsibility whatever for the consequences of its use or misuse by anyone. In any individual case of application, the respective user must check the correctness by consulting other relevant sources of information.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Ransomware is considered as a significant threat for most enterprises since past few years. In scenarios wherein users can access all files on a shared server, one infected host is capable of locking the access to all shared files. In the article related to this repository, we detect ransomware infection based on file-sharing traffic analysis, even in the case of encrypted traffic. We compare three machine learning models and choose the best for validation. We train and test the detection model using more than 70 ransomware binaries from 26 different families and more than 2500 h of ‘not infected’ traffic from real users. The results reveal that the proposed tool can detect all ransomware binaries, including those not used in the training phase (zero-days). This paper provides a validation of the algorithm by studying the false positive rate and the amount of information from user files that the ransomware could encrypt before being detected.
This dataset directory contains the 'infected' and 'not infected' samples and the models used for each T configuration, each one in a separated folder.
The folders are named NxSy where x is the number of 1-second interval per sample and y the sliding step in seconds.
Each folder (for example N10S10/) contains: - tree.py -> Python script with the Tree model. - ensemble.json -> JSON file with the information about the Ensemble model. - NN_XhiddenLayer.json -> JSON file with the information about the NN model with X hidden layers (1, 2 or 3). - N10S10.csv -> All samples used for training each model in this folder. It is in csv format for using in bigML application. - zeroDays.csv -> All zero-day samples used for testing each model in this folder. It is in csv format for using in bigML application. - userSamples_test -> All samples used for validating each model in this folder. It is in csv format for using in bigML application. - userSamples_train -> User samples used for training the models. - ransomware_train -> Ransomware samples used for training the models - scaler.scaler -> Standard Scaler from python library used for scale the samples. - zeroDays_notFiltered -> Folder with the zeroDay samples.
In the case of N30S30 folder, there is an additional folder (SMBv2SMBv3NFS) with the samples extracted from the SMBv2, SMBv3 and NFS traffic traces. There are more binaries than the ones presented in the article, but it is because some of them are not "unseen" binaries (the families are present in the training set).
The files containing samples (NxSy.csv, zeroDays.csv and userSamples_test.csv) are structured as follows: - Each line is one sample. - Each sample has 3*T features and the label (1 if it is 'infected' sample and 0 if it is not). - The features are separated by ',' because it is a csv file. - The last column is the label of the sample.
Additionally we have placed two pcap files in root directory. There are the traces used for compare both versions of SMB.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is the ravdess dataset, formatted in the WebDataset format. WebDataset files are essentially tar archives, where each example in the dataset is represented by a pair of files: a WAV audio file and a corresponding JSON metadata file. The JSON file contains the class label and other relevant information for that particular audio sample.
$ tar tvf ravdess_fold_0_0000000.tar |head
-r--r--r-- bigdata/bigdata 24 2025-01-10 15:44 03-01-08-01-01-01-11.json
-r--r--r-- bigdata/bigdata 341912 2025-01-10 15:44 03-01-08-01-01-01-11.wav
-r--r--r-- bigdata/bigdata 22 2025-01-10 15:44 03-01-07-02-01-02-05.json
-r--r--r-- bigdata/bigdata 424184 2025-01-10 15:44 03-01-07-02-01-02-05.wav
-r--r--r-- bigdata/bigdata 22 2025-01-10 15:44 03-01-06-01-01-02-10.json
-r--r--r-- bigdata/bigdata 377100 2025-01-10 15:44 03-01-06-01-01-02-10.wav
-r--r--r-- bigdata/bigdata 24 2025-01-10 15:44 03-01-08-01-02-01-16.json
-r--r--r-- bigdata/bigdata 396324 2025-01-10 15:44 03-01-08-01-02-01-16.wav
-r--r--r-- bigdata/bigdata 24 2025-01-10 15:44 03-01-08-01-02-02-22.json
-r--r--r-- bigdata/bigdata 404388 2025-01-10 15:44 03-01-08-01-02-02-22.wav
$ cat 03-01-08-01-01-01-11.json
{"emotion": "surprised"}
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
To evaluate land use and land cover (LULC) maps an independent and representative test dataset is required. Here, a test dataset was generated via stratified random sampling approach across all areas in Fiji not used to generate training data (i.e. all Tikinas which did not contain a training data point were valid for sampling to generate the test dataset). Following equation 13 in Olofsson et al. (2014), the sample size of the test dataset was 834. This was based on a desired standard error of the overall accuracy score of 0.01 and a user's accuracy of 0.75 for all classes. The strata for sampling test samples were the eight LULC classes: water, mangrove, bare soil, urban, agriculture, grassland, shrubland, and trees.
There are different strategies for allocating samples to strata for evaluating LULC maps, as discussed by Olofsson et al. (2014). Equal allocation of samples to strata ensures coverage of rarely occurring classes and minimise the standard error of estimators of user's accuracy. However, equal allocation does not optimise the standard error of the estimator of overall accuracy. Proportional allocation of samples to strata, based on the proportion of the strata in the overall dataset, can result in rarely occurring classes being underrepresented in the test dataset. Optimal allocation of samples to strata is challenging to implement when there are multiple evaluation objectives. Olofsson et al. (2014) recommend a "simple" allocation procedure where 50 to 100 samples are allocated to rare classes and proportional allocation is used to allocate samples to the remaining majority classes. The number of samples to allocate to rare classes can be determined by iterating over different allocations and computing estimated standard errors for performance metrics. Here, the 2021 all-Fiji LULC map, minus the Tikinas used for generating training samples, was used to estimate the proportional areal coverage of each LULC class. The LULC map from 2021 was used to permit comparison with other LULC products with a 2021 layer, notably the ESA WorldCover 10m v200 2021 product.
The 2021 LULC map was dominated by the tree class (74\% of the area classified) and the remaining classes had less than 10\% coverage each. Therefore, a "simple" allocation of 100 samples to the seven minority classes and an allocation of 133 samples to the tree class was used. This ensured all the minority classes had sufficient coverage in the test set while balancing the requirement to minimise standard errors for the estimate of overall accuracy. The allocated number of test dataset points were randomly sampled within each strata and were manually labelled using 2021 annual median RGB composites from Sentinel-2 and Planet NICFI and high-resolution Google Satellite Basemaps.
The Fiji LULC test data is available in GeoJSON format in the file fiji-lulc-test-data.geojson
. Each point feature has two attributes: ref_class
(the LULC class manually labelled and quality checked) and strata
(the strata the sampled point belongs to derived from the 2021 all-Fiji LULC map). The following integers correspond to the ref_class
and strata
labels:
When evaluating LULC maps using test data derived from a stratified sample, the nature of the stratified sampling needs to be accounted for when estimating performance metrics such as overall accuracy, user's accuracy, and producer's accuracy. This is particulary so if the strata do not match the map classes (i.e. when comparing different LULC products). Stehman (2014) provide formulas for estimating performance metrics and their standard errors when using test data with a stratified sampling structure.
To support LULC accuracy assessment a Python package has been developed which provides implementations of Stehman's (2014) formulas. The package can be installed via:
pip install lulc-validation
with documentation and examples here.
In order to compute performance metrics accounting for the stratified nature of the sample the total number of points / pixels available to be sampled in each strata must be known. For this dataset that is:
This dataset was generated with support from a Climate Change AI Innovation Grant.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Description
This dataset is a large-scale set of measurements for RSS-based localization. The data consists of received signal strength (RSS) measurements taken using the POWDER Testbed at the University of Utah. Samples include either 0, 1, or 2 active transmitters.
The dataset consists of 5,214 unique samples, with transmitters in 5,514 unique locations. The majority of the samples contain only 1 transmitter, but there are small sets of samples with 0 or 2 active transmitters, as shown below. Each sample has RSS values from between 10 and 25 receivers. The majority of the receivers are stationary endpoints fixed on the side of buildings, on rooftop towers, or on free-standing poles. A small set of receivers are located on shuttles which travel specific routes throughout campus.
Dataset Description | Sample Count | Receiver Count |
---|---|---|
No-Tx Samples | 46 | 10 to 25 |
1-Tx Samples | 4822 | 10 to 25 |
2-Tx Samples | 346 | 11 to 12 |
The transmitters for this dataset are handheld walkie-talkies (Baofeng BF-F8HP) transmitting in the FRS/GMRS band at 462.7 MHz. These devices have a rated transmission power of 1 W. The raw IQ samples were processed through a 6 kHz bandpass filter to remove neighboring transmissions, and the RSS value was calculated as follows:
\(RSS = \frac{10}{N} \log_{10}\left(\sum_i^N x_i^2 \right) \)
Measurement Parameters | Description |
---|---|
Frequency | 462.7 MHz |
Radio Gain | 35 dB |
Receiver Sample Rate | 2 MHz |
Sample Length | N=10,000 |
Band-pass Filter | 6 kHz |
Transmitters | 0 to 2 |
Transmission Power | 1 W |
Receivers consist of Ettus USRP X310 and B210 radios, and a mix of wide- and narrow-band antennas, as shown in the table below Each receiver took measurements with a receiver gain of 35 dB. However, devices have different maxmimum gain settings, and no calibration data was available, so all RSS values in the dataset are uncalibrated, and are only relative to the device.
Usage Instructions
Data is provided in .json
format, both as one file and as split files.
import json
data_file = 'powder_462.7_rss_data.json'
with open(data_file) as f:
data = json.load(f)
The json
data is a dictionary with the sample timestamp as a key. Within each sample are the following keys:
rx_data
: A list of data from each receiver. Each entry contains RSS value, latitude, longitude, and device name.tx_coords
: A list of coordinates for each transmitter. Each entry contains latitude and longitude.metadata
: A list of dictionaries containing metadata for each transmitter, in the same order as the rows in tx_coords
File Separations and Train/Test Splits
In the separated_data.zip
folder there are several train/test separations of the data.
all_data
contains all the data in the main JSON file, separated by the number of transmitters.stationary
consists of 3 cases where a stationary receiver remained in one location for several minutes. This may be useful for evaluating localization using mobile shuttles, or measuring the variation in the channel characteristics for stationary receivers.train_test_splits
contains unique data splits used for training and evaluating ML models. These splits only used data from the single-tx case. In other words, the union of each splits, along with unused.json
, is equivalent to the file all_data/single_tx.json
.
random
split is a random 80/20 split of the data.special_test_cases
contains the stationary transmitter data, indoor transmitter data (with high noise in GPS location), and transmitters off campus.grid
split divides the campus region in to a 10 by 10 grid. Each grid square is assigned to the training or test set, with 80 squares in the training set and the remainder in the test set. If a square is assigned to the test set, none of its four neighbors are included in the test set. Transmitters occuring in each grid square are assigned to train or test. One such random assignment of grid squares makes up the grid
split.seasonal
split contains data separated by the month of collection, in April or July.transportation
split contains data separated by the method of movement for the transmitter: walking, cycling, or driving. The non-driving.json
file contains the union of the walking and cycling data.campus.json
contains the on-campus data, so is equivalent to the union of each split, not including unused.json
.Digital Surface Model
The dataset includes a digital surface model (DSM) from a State of Utah 2013-2014 LiDAR survey. This map includes the University of Utah campus and surrounding area. The DSM includes buildings and trees, unlike some digital elevation models.
To read the data in python:
import rasterio as rio
import numpy as np
import utm
dsm_object = rio.open('dsm.tif')
dsm_map = dsm_object.read(1) # a np.array containing elevation values
dsm_resolution = dsm_object.res # a tuple containing x,y resolution (0.5 meters)
dsm_transform = dsm_object.transform # an Affine transform for conversion to UTM-12 coordinates
utm_transform = np.array(dsm_transform).reshape((3,3))[:2]
utm_top_left = utm_transform @ np.array([0,0,1])
utm_bottom_right = utm_transform @ np.array([dsm_object.shape[0], dsm_object.shape[1], 1])
latlon_top_left = utm.to_latlon(utm_top_left[0], utm_top_left[1], 12, 'T')
latlon_bottom_right = utm.to_latlon(utm_bottom_right[0], utm_bottom_right[1], 12, 'T')
Dataset Acknowledgement: This DSM file is acquired by the State of Utah and its partners, and is in the public domain and can be freely distributed with proper credit to the State of Utah and its partners. The State of Utah and its partners makes no warranty, expressed or implied, regarding its suitability for a particular use and shall not be liable under any circumstances for any direct, indirect, special, incidental, or consequential damages with respect to users of this product.
DSM DOI: https://doi.org/10.5069/G9TH8JNQ
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains a sample of 10,000 (3.5%) out of a total of 285,846 text sequences extracted from the 1891–1896 Map of London by the Ordnance Survey (OS).
The methodology used for the automated recognition, linking, and sequencing of the text is detailed in the article Recognizing and Sequencing Multi-word Texts in Maps Using an Attentive Pointer by M. Zou et al., 2025.
The map is drawn at a scale of five-feet to the mile (c.a. 1:1,056). The text on the map is an invaluable source of information about the Greater London in the late Victorian period. It includes the names of streets, squares, parks, watercourses and even some estates ('Poplars', 'The Grange', 'Arbutus Lodge'). In addition, the map contains many details of the function of buildings and economic activity, such as factories ('Sweet Factory', 'Crown Linoleum Works', 'Imperial Flour Mills', 'Lion Brewery'), warehouses or commercial infrastructure ('Warehouse', 'Jamaica Wharf', 'Rag Store'), offices ('Offices'), etc. The map also mentions public buildings such as schools ('School Boys, Girls & Infants', 'Sunday School'), hospitals or clinics ('St. Saviour's Union Infirmary', 'Beulah Spa Hydropathic Establishment', 'South Western Fever Hospital'), railway stations ('Clapham Station'), post offices, banks, police stations, etc. Other social venues are also mentioned, such as public houses, i.e. pubs ('P.H.'), clubs, casinos, and recreational areas (e.g. 'Cricket Ground'). Special attention is given to churches, with a regular count of the number of seats (e.g. 'Baptist Chapel Seats for 600').
In addition, the map provides details that can be of great interest in the study of everyday life in London at the end of the 19th century. For example, there are numerous mentions of 'Stables', 'Drinking Fountain'[s] (or simply 'Fn.') or 'Urinal'[s]. Fire protection infrastructure is highlighted, e.g. fire plugs ('F.P.') and fire alarms ('F.A.'). The map also includes information on elevation (e.g. '11·6') and flood levels (e.g. 'High Water Mark of Ordinary Tides').
A list of abbreviations used in the Ordnance Survey maps, created by Richard Oliver [1], is made available by the National Library of Scotland (link).
The data in 10k_text_london_OS_1890s.geojson
is organized as a regular geojson file.
{
"type": "FeatureCollection",
"features": [
{
"type": "Feature",
"geometry": {
"type": "MultiPolygon",
"coordinates": [[[ [x1, y1], [x2, y2], ...]]]
},
"properties": {
"label": "Oxford Circus",
}
},
... # Further text sequences
]
}
The original map document consists of 729 separate sheets, digitized, georeferenced, and served as geographic tiles by the National Library of Scotland [2].
Total Number of text sequences: 285,846
Sample size: 10,000
Total Area covered: 450 square km
For any mention of this dataset, please cite :
@misc{text_london_OS_1890s,
author = {Zou, Mengjie and Petitpierre, R{\'{e}}mi and di Lenardo, Isabella},
title = {{London 1890s Ordnance Survey Text Layer}},
year = {2025},
publisher = {Zenodo},
url = {https://doi.org/10.5281/zenodo.14982946}}@article{recognizing_sequencing_2025,
author = {Zou, Mengjie and Dai, Tianhao and Petitpierre, R{\'{e}}mi and Vaienti, Beatrice and di Lenardo, Isabella},
title = {{Recognizing and Sequencing Multi-word Texts in Maps Using an Attentive Pointer}},
year = {2025}}
Rémi PETITPIERRE - remi.petitpierre@epfl.ch - ORCID - Github - Scholar - ResearchGate
This project is licensed under the CC BY 4.0 License.
We do not assume any liability for the use of this dataset.
This dataset contains ether as well as popular ERC20 token transfer transactions extracted from the Ethereum Mainnet blockchain.
Only send ether, contract function call, contract deployment transactions are present in the dataset. Miner reward transactions are not currently included in the dataset.
Details of the datasets are given below:
FILENAME FORMAT:
The filenames have the following format:
eth-tx-
where
For example file eth-tx-1000000-1099999.txt.bz2 contains transactions from
block 1000000 to block 1099999 inclusive.
The files are compressed with bzip2. They can be uncompressed using command bunzip2.
TRANSACTION FORMAT:
Each line in a file corresponds to a transaction. The transaction has the following format:
units. ERC20 tokens transfers (transfer and transferFrom function calls in ERC20
contract) are indicated by token symbol. For example GUSD is Gemini USD stable
coin. The JSON file erc20tokens.json given below contains the details of ERC20 tokens.
decoder-error.txt FILE:
This file contains transactions (block no, tx no, tx hash) on each line that produced
error while decoding calldata. These transactions are not present in the data files.
er20tokens.json FILE:
This file contains the list of popular ERC20 token contracts whose transfer/transferFrom
transactions appear in the data files.
-------------------------------------------------------------------------------------------
[
{
"address": "0xdac17f958d2ee523a2206206994597c13d831ec7",
"decdigits": 6,
"symbol": "USDT",
"name": "Tether-USD"
},
{
"address": "0xB8c77482e45F1F44dE1745F52C74426C631bDD52",
"decdigits": 18,
"symbol": "BNB",
"name": "Binance"
},
{
"address": "0x2af5d2ad76741191d15dfe7bf6ac92d4bd912ca3",
"decdigits": 18,
"symbol": "LEO",
"name": "Bitfinex-LEO"
},
{
"address": "0x514910771af9ca656af840dff83e8264ecf986ca",
"decdigits": 18,
"symbol": "LNK",
"name": "Chainlink"
},
{
"address": "0x6f259637dcd74c767781e37bc6133cd6a68aa161",
"decdigits": 18,
"symbol": "HT",
"name": "HuobiToken"
},
{
"address": "0xf1290473e210b2108a85237fbcd7b6eb42cc654f",
"decdigits": 18,
"symbol": "HEDG",
"name": "HedgeTrade"
},
{
"address": "0x9f8f72aa9304c8b593d555f12ef6589cc3a579a2",
"decdigits": 18,
"symbol": "MKR",
"name": "Maker"
},
{
"address": "0xa0b73e1ff0b80914ab6fe0444e65848c4c34450b",
"decdigits": 8,
"symbol": "CRO",
"name": "Crypto.com"
},
{
"address": "0xd850942ef8811f2a866692a623011bde52a462c1",
"decdigits": 18,
"symbol": "VEN",
"name": "VeChain"
},
{
"address": "0x0d8775f648430679a709e98d2b0cb6250d2887ef",
"decdigits": 18,
"symbol": "BAT",
"name": "Basic-Attention"
},
{
"address": "0xc9859fccc876e6b4b3c749c5d29ea04f48acb74f",
"decdigits": 0,
"symbol": "INO",
"name": "INO-Coin"
},
{
"address": "0x8e870d67f660d95d5be530380d0ec0bd388289e1",
"decdigits": 18,
"symbol": "PAX",
"name": "Paxos-Standard"
},
{
"address": "0x17aa18a4b64a55abed7fa543f2ba4e91f2dce482",
"decdigits": 18,
"symbol": "INB",
"name": "Insight-Chain"
},
{
"address": "0xc011a72400e58ecd99ee497cf89e3775d4bd732f",
"decdigits": 18,
"symbol": "SNX",
"name": "Synthetix-Network"
},
{
"address": "0x1985365e9f78359a9B6AD760e32412f4a445E862",
"decdigits": 18,
"symbol": "REP",
"name": "Reputation"
},
{
"address": "0x653430560be843c4a3d143d0110e896c2ab8ac0d",
"decdigits": 16,
"symbol": "MOF",
"name": "Molecular-Future"
},
{
"address": "0x0000000000085d4780B73119b644AE5ecd22b376",
"decdigits": 18,
"symbol": "TUSD",
"name": "True-USD"
},
{
"address": "0xe41d2489571d322189246dafa5ebde1f4699f498",
"decdigits": 18,
"symbol": "ZRX",
"name": "ZRX"
},
{
"address": "0x8ce9137d39326ad0cd6491fb5cc0cba0e089b6a9",
"decdigits": 18,
"symbol": "SXP",
"name": "Swipe"
},
{
"address": "0x75231f58b43240c9718dd58b4967c5114342a86c",
"decdigits": 18,
"symbol": "OKB",
"name": "Okex"
},
{
"address": "0xa974c709cfb4566686553a20790685a47aceaa33",
"decdigits": 18,
"symbol": "XIN",
"name": "Mixin"
},
{
"address": "0xd26114cd6EE289AccF82350c8d8487fedB8A0C07",
"decdigits": 18,
"symbol": "OMG",
"name": "OmiseGO"
},
{
"address": "0x89d24a6b4ccb1b6faa2625fe562bdd9a23260359",
"decdigits": 18,
"symbol": "SAI",
"name": "Sai Stablecoin v1.0"
},
{
"address": "0x6c6ee5e31d828de241282b9606c8e98ea48526e2",
"decdigits": 18,
"symbol": "HOT",
"name": "HoloToken"
},
{
"address": "0x6b175474e89094c44da98b954eedeac495271d0f",
"decdigits": 18,
"symbol": "DAI",
"name": "Dai Stablecoin"
},
{
"address": "0xdb25f211ab05b1c97d595516f45794528a807ad8",
"decdigits": 2,
"symbol": "EURS",
"name": "Statis-EURS"
},
{
"address": "0xa66daa57432024023db65477ba87d4e7f5f95213",
"decdigits": 18,
"symbol": "HPT",
"name": "HuobiPoolToken"
},
{
"address": "0x4fabb145d64652a948d72533023f6e7a623c7c53",
"decdigits": 18,
"symbol": "BUSD",
"name": "Binance-USD"
},
{
"address": "0x056fd409e1d7a124bd7017459dfea2f387b6d5cd",
"decdigits": 2,
"symbol": "GUSD",
"name": "Gemini-USD"
},
{
"address": "0x2c537e5624e4af88a7ae4060c022609376c8d0eb",
"decdigits": 6,
"symbol": "TRYB",
"name": "BiLira"
},
{
"address": "0x4922a015c4407f87432b179bb209e125432e4a2a",
"decdigits": 6,
"symbol": "XAUT",
"name": "Tether-Gold"
},
{
"address": "0xa0b86991c6218b36c1d19d4a2e9eb0ce3606eb48",
"decdigits": 6,
"symbol": "USDC",
"name": "USD-Coin"
},
{
"address": "0xa5b55e6448197db434b92a0595389562513336ff",
"decdigits": 16,
"symbol": "SUSD",
"name": "Santender"
},
{
"address": "0xffe8196bc259e8dedc544d935786aa4709ec3e64",
"decdigits": 18,
"symbol": "HDG",
"name": "HedgeTrade"
},
{
"address": "0x4a16baf414b8e637ed12019fad5dd705735db2e0",
"decdigits": 2,
"symbol": "QCAD",
"name": "QCAD"
}
]
-------------------------------------------------------------------------------------------
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Extraction Examples Dataset
This dataset contains 17 examples for testing extraction workflows.
Dataset Structure
Each example includes:
PDF file: Original document
map_info.json: Map extraction metadata
direction.json: Direction information
GeoJSON files: Polygon geometries
Area JSON files: Area definitions
File Organization
files/ ├── example1/ │ ├── document.pdf │ ├── map_info.json │ ├── direction.json │ ├── polygon1.geojson │ └── area1.json… See the full description on the dataset page: https://huggingface.co/datasets/alexdzm/extraction-examples.