100+ datasets found
  1. h

    extraction-examples

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alex, extraction-examples [Dataset]. https://huggingface.co/datasets/alexdzm/extraction-examples
    Explore at:
    Authors
    Alex
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Extraction Examples Dataset

    This dataset contains 17 examples for testing extraction workflows.

      Dataset Structure
    

    Each example includes:

    PDF file: Original document map_info.json: Map extraction metadata direction.json: Direction information
    GeoJSON files: Polygon geometries Area JSON files: Area definitions

      File Organization
    

    files/ ├── example1/ │ ├── document.pdf │ ├── map_info.json │ ├── direction.json │ ├── polygon1.geojson │ └── area1.json… See the full description on the dataset page: https://huggingface.co/datasets/alexdzm/extraction-examples.

  2. d

    Country Polygons as GeoJSON

    • datahub.io
    Updated Sep 1, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2017). Country Polygons as GeoJSON [Dataset]. https://datahub.io/core/geo-countries
    Explore at:
    Dataset updated
    Sep 1, 2017
    License

    ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
    License information was derived automatically

    Description

    geodata data package providing geojson polygons for all the world's countries

  3. Country State GeoJSON

    • kaggle.com
    zip
    Updated Apr 27, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mukesh Chapagain (2020). Country State GeoJSON [Dataset]. https://www.kaggle.com/chapagain/country-state-geo-location
    Explore at:
    zip(286136 bytes)Available download formats
    Dataset updated
    Apr 27, 2020
    Authors
    Mukesh Chapagain
    Description

    About

    World Country and State coordinate for plotting geospatial maps.

    Source

    Files source:

    1. Folium GitHub Repository:

  4. Z

    Dataset of IEEE 802.11 probe requests from an uncontrolled urban environment...

    • data.niaid.nih.gov
    Updated Jan 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrej Hrovat (2023). Dataset of IEEE 802.11 probe requests from an uncontrolled urban environment [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7509279
    Explore at:
    Dataset updated
    Jan 6, 2023
    Dataset provided by
    Miha Mohorčič
    Mihael Mohorčič
    Aleš Simončič
    Andrej Hrovat
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Introduction

    The 802.11 standard includes several management features and corresponding frame types. One of them are Probe Requests (PR), which are sent by mobile devices in an unassociated state to scan the nearby area for existing wireless networks. The frame part of PRs consists of variable-length fields, called Information Elements (IE), which represent the capabilities of a mobile device, such as supported data rates.

    This dataset contains PRs collected over a seven-day period by four gateway devices in an uncontrolled urban environment in the city of Catania.

    It can be used for various use cases, e.g., analyzing MAC randomization, determining the number of people in a given location at a given time or in different time periods, analyzing trends in population movement (streets, shopping malls, etc.) in different time periods, etc.

    Related dataset

    Same authors also produced the Labeled dataset of IEEE 802.11 probe requests with same data layout and recording equipment.

    Measurement setup

    The system for collecting PRs consists of a Raspberry Pi 4 (RPi) with an additional WiFi dongle to capture WiFi signal traffic in monitoring mode (gateway device). Passive PR monitoring is performed by listening to 802.11 traffic and filtering out PR packets on a single WiFi channel.

    The following information about each received PR is collected: - MAC address - Supported data rates - extended supported rates - HT capabilities - extended capabilities - data under extended tag and vendor specific tag - interworking - VHT capabilities - RSSI - SSID - timestamp when PR was received.

    The collected data was forwarded to a remote database via a secure VPN connection. A Python script was written using the Pyshark package to collect, preprocess, and transmit the data.

    Data preprocessing

    The gateway collects PRs for each successive predefined scan interval (10 seconds). During this interval, the data is preprocessed before being transmitted to the database. For each detected PR in the scan interval, the IEs fields are saved in the following JSON structure:

    PR_IE_data = { 'DATA_RTS': {'SUPP': DATA_supp , 'EXT': DATA_ext}, 'HT_CAP': DATA_htcap, 'EXT_CAP': {'length': DATA_len, 'data': DATA_extcap}, 'VHT_CAP': DATA_vhtcap, 'INTERWORKING': DATA_inter, 'EXT_TAG': {'ID_1': DATA_1_ext, 'ID_2': DATA_2_ext ...}, 'VENDOR_SPEC': {VENDOR_1:{ 'ID_1': DATA_1_vendor1, 'ID_2': DATA_2_vendor1 ...}, VENDOR_2:{ 'ID_1': DATA_1_vendor2, 'ID_2': DATA_2_vendor2 ...} ...} }

    Supported data rates and extended supported rates are represented as arrays of values that encode information about the rates supported by a mobile device. The rest of the IEs data is represented in hexadecimal format. Vendor Specific Tag is structured differently than the other IEs. This field can contain multiple vendor IDs with multiple data IDs with corresponding data. Similarly, the extended tag can contain multiple data IDs with corresponding data.
    Missing IE fields in the captured PR are not included in PR_IE_DATA.

    When a new MAC address is detected in the current scan time interval, the data from PR is stored in the following structure:

    {'MAC': MAC_address, 'SSIDs': [ SSID ], 'PROBE_REQs': [PR_data] },

    where PR_data is structured as follows:

    { 'TIME': [ DATA_time ], 'RSSI': [ DATA_rssi ], 'DATA': PR_IE_data }.

    This data structure allows to store only 'TOA' and 'RSSI' for all PRs originating from the same MAC address and containing the same 'PR_IE_data'. All SSIDs from the same MAC address are also stored. The data of the newly detected PR is compared with the already stored data of the same MAC in the current scan time interval. If identical PR's IE data from the same MAC address is already stored, only data for the keys 'TIME' and 'RSSI' are appended. If identical PR's IE data from the same MAC address has not yet been received, then the PR_data structure of the new PR for that MAC address is appended to the 'PROBE_REQs' key. The preprocessing procedure is shown in Figure ./Figures/Preprocessing_procedure.png

    At the end of each scan time interval, all processed data is sent to the database along with additional metadata about the collected data, such as the serial number of the wireless gateway and the timestamps for the start and end of the scan. For an example of a single PR capture, see the Single_PR_capture_example.json file.

    Folder structure

    For ease of processing of the data, the dataset is divided into 7 folders, each containing a 24-hour period. Each folder contains four files, each containing samples from that device.

    The folders are named after the start and end time (in UTC). For example, the folder 2022-09-22T22-00-00_2022-09-23T22-00-00 contains samples collected between 23th of September 2022 00:00 local time, until 24th of September 2022 00:00 local time.

    Files representing their location via mapping: - 1.json -> location 1 - 2.json -> location 2 - 3.json -> location 3 - 4.json -> location 4

    Environments description

    The measurements were carried out in the city of Catania, in Piazza Università and Piazza del Duomo The gateway devices (rPIs with WiFi dongle) were set up and gathering data before the start time of this dataset. As of September 23, 2022, the devices were placed in their final configuration and personally checked for correctness of installation and data status of the entire data collection system. Devices were connected either to a nearby Ethernet outlet or via WiFi to the access point provided.

    Four Raspbery Pi-s were used: - location 1 -> Piazza del Duomo - Chierici building (balcony near Fontana dell’Amenano) - location 2 -> southernmost window in the building of Via Etnea near Piazza del Duomo - location 3 -> nothernmost window in the building of Via Etnea near Piazza Università - location 4 -> first window top the right of the entrance of the University of Catania

    Locations were suggested by the authors and adjusted during deployment based on physical constraints (locations of electrical outlets or internet access) Under ideal circumstances, the locations of the devices and their coverage area would cover both squares and the part of Via Etna between them, with a partial overlap of signal detection. The locations of the gateways are shown in Figure ./Figures/catania.png.

    Known dataset shortcomings

    Due to technical and physical limitations, the dataset contains some identified deficiencies.

    PRs are collected and transmitted in 10-second chunks. Due to the limited capabilites of the recording devices, some time (in the range of seconds) may not be accounted for between chunks if the transmission of the previous packet took too long or an unexpected error occurred.

    Every 20 minutes the service is restarted on the recording device. This is a workaround for undefined behavior of the USB WiFi dongle, which can no longer respond. For this reason, up to 20 seconds of data will not be recorded in each 20-minute period.

    The devices had a scheduled reboot at 4:00 each day which is shown as missing data of up to a few minutes.

     Location 1 - Piazza del Duomo - Chierici
    

    The gateway device (rPi) is located on the second floor balcony and is hardwired to the Ethernet port. This device appears to function stably throughout the data collection period. Its location is constant and is not disturbed, dataset seems to have complete coverage.

     Location 2 - Via Etnea - Piazza del Duomo
    

    The device is located inside the building. During working hours (approximately 9:00-17:00), the device was placed on the windowsill. However, the movement of the device cannot be confirmed. As the device was moved back and forth, power outages and internet connection issues occurred. The last three days in the record contain no PRs from this location.

     Location 3 - Via Etnea - Piazza Università
    

    Similar to Location 2, the device is placed on the windowsill and moved around by people working in the building. Similar behavior is also observed, e.g., it is placed on the windowsill and moved inside a thick wall when no people are present. This device appears to have been collecting data throughout the whole dataset period.

     Location 4 - Piazza Università
    

    This location is wirelessly connected to the access point. The device was placed statically on a windowsill overlooking the square. Due to physical limitations, the device had lost power several times during the deployment. The internet connection was also interrupted sporadically.

    Recognitions

    The data was collected within the scope of Resiloc project with the help of City of Catania and project partners.

  5. P

    MSVD-CTN Dataset

    • paperswithcode.com
    • huggingface.co
    Updated Jun 9, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Asmar Nadeem; Faegheh Sardari; Robert Dawes; Syed Sameed Husain; Adrian Hilton; Armin Mustafa (2024). MSVD-CTN Dataset [Dataset]. https://paperswithcode.com/dataset/msvd-ctn
    Explore at:
    Dataset updated
    Jun 9, 2024
    Authors
    Asmar Nadeem; Faegheh Sardari; Robert Dawes; Syed Sameed Husain; Adrian Hilton; Armin Mustafa
    Description

    MSVD-CTN Dataset This dataset contains CTN annotations for the MSVD-CTN benchmark dataset in JSON format. It has three files for the train, test, and validation splits. For project details, visit https://narrativebridge.github.io/.

    Dataset Structure Each JSON file contains a dictionary where the keys are the video IDs and the values are the corresponding Causal-Temporal Narrative (CTN) captions. The CTN captions are represented as a dictionary with two keys: "Cause" and "Effect", containing the cause and effect statements, respectively.

    Example:

    json { "video_id_1": { "Cause": "a person performed an action", "Effect": "a specific outcome occurred" }, "video_id_2": { "Cause": "another cause statement", "Effect": "another effect statement" } }

    Loading the Datasets To load the datasets, use a JSON parsing library in your preferred programming language. For example, in Python, you can use the json module:

    import json
    
    with open("msvd_CTN_train.json", "r") as f:
      msvd_train_data = json.load(f)
    
    Access the CTN captions
    for video_id, ctn_caption in msvd_train_data.items():
      cause = ctn_caption["Cause"]
      effect = ctn_caption["Effect"]
      # Process the cause and effect statements as needed
    

    License The MSVD-CTN benchmark dataset is licensed under the Creative Commons Attribution Non Commercial No Derivatives 4.0 International (CC BY-NC-ND 4.0) license.

  6. Data from: #PraCegoVer dataset

    • zenodo.org
    Updated Jan 20, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gabriel Oliveira dos Santos; Gabriel Oliveira dos Santos; Esther Luna Colombini; Esther Luna Colombini; Sandra Avila; Sandra Avila (2023). #PraCegoVer dataset [Dataset]. http://doi.org/10.5281/zenodo.7548638
    Explore at:
    Dataset updated
    Jan 20, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Gabriel Oliveira dos Santos; Gabriel Oliveira dos Santos; Esther Luna Colombini; Esther Luna Colombini; Sandra Avila; Sandra Avila
    Description

    Automatically describing images using natural sentences is an essential task for visually impaired people's inclusion on the Internet. Although there are many datasets in the literature, most of them contain only English captions, whereas datasets with captions described in other languages are scarce.

    PraCegoVer arose on the Internet, stimulating users from social media to publish images, tag #PraCegoVer, and add a short description of their content. Inspired by this movement, we have proposed the #PraCegoVer, a multi-modal dataset with Portuguese captions based on posts from Instagram. It is the first large dataset for image captioning in Portuguese with freely annotated images.

    #PraCegoVer has 533,523 pairs with images and captions described in Portuguese collected from more than 14 thousand different profiles. Also, the average caption length in #PraCegoVer is 39.3 words and the standard deviation is 29.7.

    New Release

    We release pracegover_400k.json which contains 403,337 examples from the original dataset.json after preprocessing and duplication removal. It is split into train, validation, and test with 242036, 80628, and 80673 examples, respectively.

    Dataset Structure

    #PraCegoVer dataset comprehends a main file dataset.json and a collection of compressed files named images.tar.gz.partX
    containing the images. The file dataset.json comprehends a list of JSON objects with the attributes:

    • user: anonymized user that made the post;
    • filename: image file name;
    • raw_caption: raw caption;
    • caption: clean caption;
    • date: post date.

    Each instance in dataset.json is associated with exactly one image in the images directory whose filename is pointed by the attribute filename. Also, we provide a sample with five instances, so the users can download the sample to get an overview of the dataset before downloading it completely.

    Download Instructions

    If you just want to have an overview of the dataset structure, you can download sample.tar.gz. But, if you want to use the dataset, or any of its subsets (63k, 173k, and 400k), you must download all the files and run the following commands to uncompress and join the files:

    cat images.tar.gz.part* > images.tar.gz
    tar -xzvf images.tar.gz

    Alternatively, you can download the entire dataset from the terminal using the python script download_dataset.py available in the PraCegoVer repository. In this case, first, you have to download the script and create an access token here. Then, you can run the following command to download and uncompress the image files:

    python download_dataset.py --access_token=

  7. JSON Repository

    • data.amerigeoss.org
    • cloud.csiss.gmu.edu
    • +2more
    csv, geojson, json +1
    Updated Jun 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    UN Humanitarian Data Exchange (2025). JSON Repository [Dataset]. https://data.amerigeoss.org/dataset/json-repository
    Explore at:
    geojson(135805), geojson(886086), csv(457), csv(242), geojson(222216), csv(845984), json(3401512), csv(9901), json(2064743), geojson(709673), csv(779), geojson(9124), json(327649), json(640845), csv(462610), geojson(162605), csv(358964), csv(4907), csv(6789), geojson(219728), json(1975854), csv(177), json(632081), geojson(1324722), geojson(543777), csv(536), topojson(2728099), csv(177073), geojson(953043), json(3478518), json(3411081), json(876253), geojson(2396630), geojson(366788), geojson(545299), csv(669568), geojson(178718), json(461423), json(457832), geojson(54889), csv(85982), json(1132925), csv(9980), json(707249), geojson(74470), geojson(365288), json(520472), json(559095), geojson(164379)Available download formats
    Dataset updated
    Jun 4, 2025
    Dataset provided by
    United Nationshttp://un.org/
    Description

    This dataset contains resources transformed from other datasets on HDX. They exist here only in a format modified to support visualization on HDX and may not be as up to date as the source datasets from which they are derived.

    Source datasets: https://data.hdx.rwlabs.org/dataset/idps-data-by-region-in-mali

  8. d

    Dataset metadata of known Dataverse installations

    • search.dataone.org
    • dataverse.harvard.edu
    • +1more
    Updated Nov 22, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gautier, Julian (2023). Dataset metadata of known Dataverse installations [Dataset]. http://doi.org/10.7910/DVN/DCDKZQ
    Explore at:
    Dataset updated
    Nov 22, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Gautier, Julian
    Description

    This dataset contains the metadata of the datasets published in 77 Dataverse installations, information about each installation's metadata blocks, and the list of standard licenses that dataset depositors can apply to the datasets they publish in the 36 installations running more recent versions of the Dataverse software. The data is useful for reporting on the quality of dataset and file-level metadata within and across Dataverse installations. Curators and other researchers can use this dataset to explore how well Dataverse software and the repositories using the software help depositors describe data. How the metadata was downloaded The dataset metadata and metadata block JSON files were downloaded from each installation on October 2 and October 3, 2022 using a Python script kept in a GitHub repo at https://github.com/jggautier/dataverse-scripts/blob/main/other_scripts/get_dataset_metadata_of_all_installations.py. In order to get the metadata from installations that require an installation account API token to use certain Dataverse software APIs, I created a CSV file with two columns: one column named "hostname" listing each installation URL in which I was able to create an account and another named "apikey" listing my accounts' API tokens. The Python script expects and uses the API tokens in this CSV file to get metadata and other information from installations that require API tokens. How the files are organized ├── csv_files_with_metadata_from_most_known_dataverse_installations │ ├── author(citation).csv │ ├── basic.csv │ ├── contributor(citation).csv │ ├── ... │ └── topic_classification(citation).csv ├── dataverse_json_metadata_from_each_known_dataverse_installation │ ├── Abacus_2022.10.02_17.11.19.zip │ ├── dataset_pids_Abacus_2022.10.02_17.11.19.csv │ ├── Dataverse_JSON_metadata_2022.10.02_17.11.19 │ ├── hdl_11272.1_AB2_0AQZNT_v1.0.json │ ├── ... │ ├── metadatablocks_v5.6 │ ├── astrophysics_v5.6.json │ ├── biomedical_v5.6.json │ ├── citation_v5.6.json │ ├── ... │ ├── socialscience_v5.6.json │ ├── ACSS_Dataverse_2022.10.02_17.26.19.zip │ ├── ADA_Dataverse_2022.10.02_17.26.57.zip │ ├── Arca_Dados_2022.10.02_17.44.35.zip │ ├── ... │ └── World_Agroforestry_-_Research_Data_Repository_2022.10.02_22.59.36.zip └── dataset_pids_from_most_known_dataverse_installations.csv └── licenses_used_by_dataverse_installations.csv └── metadatablocks_from_most_known_dataverse_installations.csv This dataset contains two directories and three CSV files not in a directory. One directory, "csv_files_with_metadata_from_most_known_dataverse_installations", contains 18 CSV files that contain the values from common metadata fields of all 77 Dataverse installations. For example, author(citation)_2022.10.02-2022.10.03.csv contains the "Author" metadata for all published, non-deaccessioned, versions of all datasets in the 77 installations, where there's a row for each author name, affiliation, identifier type and identifier. The other directory, "dataverse_json_metadata_from_each_known_dataverse_installation", contains 77 zipped files, one for each of the 77 Dataverse installations whose dataset metadata I was able to download using Dataverse APIs. Each zip file contains a CSV file and two sub-directories: The CSV file contains the persistent IDs and URLs of each published dataset in the Dataverse installation as well as a column to indicate whether or not the Python script was able to download the Dataverse JSON metadata for each dataset. For Dataverse installations using Dataverse software versions whose Search APIs include each dataset's owning Dataverse collection name and alias, the CSV files also include which Dataverse collection (within the installation) that dataset was published in. One sub-directory contains a JSON file for each of the installation's published, non-deaccessioned dataset versions. The JSON files contain the metadata in the "Dataverse JSON" metadata schema. The other sub-directory contains information about the metadata models (the "metadata blocks" in JSON files) that the installation was using when the dataset metadata was downloaded. I saved them so that they can be used when extracting metadata from the Dataverse JSON files. The dataset_pids_from_most_known_dataverse_installations.csv file contains the dataset PIDs of all published datasets in the 77 Dataverse installations, with a column to indicate if the Python script was able to download the dataset's metadata. It's a union of all of the "dataset_pids_..." files in each of the 77 zip files. The licenses_used_by_dataverse_installations.csv file contains information about the licenses that a number of the installations let depositors choose when creating datasets. When I collected ... Visit https://dataone.org/datasets/sha256%3Ad27d528dae8cf01e3ea915f450426c38fd6320e8c11d3e901c43580f997a3146 for complete metadata about this dataset.

  9. z

    fluentspeechcommands in WebDataset Format

    • zenodo.org
    tar
    Updated Jan 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Niu Yadong; Niu Yadong (2025). fluentspeechcommands in WebDataset Format [Dataset]. http://doi.org/10.5281/zenodo.14722453
    Explore at:
    tarAvailable download formats
    Dataset updated
    Jan 23, 2025
    Dataset provided by
    xiaomi
    Authors
    Niu Yadong; Niu Yadong
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is the fluentspeechcommands dataset, formatted in the WebDataset format. WebDataset files are essentially tar archives, where each example in the dataset is represented by a pair of files: a WAV audio file and a corresponding JSON metadata file. The JSON file contains the class label and other relevant information for that particular audio sample.

    $ tar tvf fluentspeechcommands_train_0000000.tar |head
    -r--r--r-- bigdata/bigdata 174 2025-01-17 07:20 48fac300-45c8-11e9-8ec0-7bf21d1cfe30.json
    -r--r--r-- bigdata/bigdata 131116 2025-01-17 07:20 48fac300-45c8-11e9-8ec0-7bf21d1cfe30.wav
    -r--r--r-- bigdata/bigdata  136 2025-01-17 07:20 3f770360-44e3-11e9-bb82-bdba769643e7.json
    -r--r--r-- bigdata/bigdata 71376 2025-01-17 07:20 3f770360-44e3-11e9-bb82-bdba769643e7.wav
    -r--r--r-- bigdata/bigdata  132 2025-01-17 07:20 3ea38ea0-4613-11e9-bc65-55b32b211b66.json
    -r--r--r-- bigdata/bigdata 68310 2025-01-17 07:20 3ea38ea0-4613-11e9-bc65-55b32b211b66.wav
    -r--r--r-- bigdata/bigdata  143 2025-01-17 07:20 61578420-45ea-11e9-b578-494a5b19ab8b.json
    -r--r--r-- bigdata/bigdata 89208 2025-01-17 07:20 61578420-45ea-11e9-b578-494a5b19ab8b.wav
    -r--r--r-- bigdata/bigdata  132 2025-01-17 07:20 c4595690-4520-11e9-a843-8db76f4b5e29.json
    -r--r--r-- bigdata/bigdata 76502 2025-01-17 07:20 c4595690-4520-11e9-a843-8db76f4b5e29.wav

    $ cat 48fac300-45c8-11e9-8ec0-7bf21d1cfe30.json 
    {"speakerId": "52XVOeXMXYuaElyw", "transcription": "I need to practice my English. Switch the language", "action": "change language", "object": "English", "location": "none"}
  10. DataCite Public Data

    • redivis.com
    application/jsonl +7
    Updated Dec 12, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Redivis Demo Organization (2024). DataCite Public Data [Dataset]. https://redivis.com/datasets/7wec-6vgw8qaaq
    Explore at:
    application/jsonl, arrow, spss, csv, stata, sas, avro, parquetAvailable download formats
    Dataset updated
    Dec 12, 2024
    Dataset provided by
    Redivis Inc.
    Authors
    Redivis Demo Organization
    Description

    Abstract

    The DataCite Public Data File contains metadata records in JSON format for all DataCite DOIs in Findable state that were registered up to the end of 2023.

    This dataset represents a processed version of the Public Data File, where the data have been extracted and loaded into a Redivis dataset.

    Methodology

    The DataCite Public Data File contains metadata records in JSON format for all DataCite DOIs in Findable state that were registered up to the end of 2023.

    Records have descriptive metadata for research outputs and resources structured according to the DataCite Metadata Schema and include links to other persistent identifiers (PIDs) for works (DOIs), people (ORCID iDs), and organizations (ROR IDs).

    Use of the DataCite Public Data File is subject to the DataCite Data File Use Policy.

    Usage

    This datasets is a processed version of the DataCite public data file, where the original file (a 23GB .tar.gz) has been extracted into 55,239 JSONL files, that were then concatenated into a single JSONL file.

    This JSONL file has been imported into a Redivis table to facilitate further exploration and analysis.

    A sample project demonstrating how to query the DataCite data file can be found here: https://redivis.com/projects/hx1e-a6w8vmwsx

  11. r

    Dataset containing Features from DNS Tunneling Samples stored in JSON files

    • researchdata.se
    Updated May 10, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Irvin Homem; Panagiotis Papapetrou (2017). Dataset containing Features from DNS Tunneling Samples stored in JSON files [Dataset]. http://doi.org/10.17045/STHLMUNI.4229399
    Explore at:
    Dataset updated
    May 10, 2017
    Dataset provided by
    Stockholm University
    Authors
    Irvin Homem; Panagiotis Papapetrou
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data set containing features extracted from 211 DNS Tunneling packet captures. The packet capture samples are classified by the protocols tunneled within the DNS tunnel. The features are stored in json files for each packet capture. The features in each file include the IP Packet Length, the DNS Query Name Length and the DNS Query Name entropy. In this "slightly unclean" version of the feature set the DNS Query Name field values are also present, but are not actually necessary.

    This feature set may be used to perform machine learning techniques on DNS Tunneling traffic to discover new insights without necessarily having to reconstruct and analyze the equivalent full packet captures.

  12. Spider Realistic Dataset In Structure-Grounded Pretraining for Text-to-SQL

    • zenodo.org
    bin, json, txt
    Updated Aug 16, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xiang Deng; Ahmed Hassan Awadallah; Christopher Meek; Oleksandr Polozov; Huan Sun; Matthew Richardson; Xiang Deng; Ahmed Hassan Awadallah; Christopher Meek; Oleksandr Polozov; Huan Sun; Matthew Richardson (2021). Spider Realistic Dataset In Structure-Grounded Pretraining for Text-to-SQL [Dataset]. http://doi.org/10.5281/zenodo.5205322
    Explore at:
    txt, json, binAvailable download formats
    Dataset updated
    Aug 16, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Xiang Deng; Ahmed Hassan Awadallah; Christopher Meek; Oleksandr Polozov; Huan Sun; Matthew Richardson; Xiang Deng; Ahmed Hassan Awadallah; Christopher Meek; Oleksandr Polozov; Huan Sun; Matthew Richardson
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This folder contains the Spider-Realistic dataset used for evaluation in the paper "Structure-Grounded Pretraining for Text-to-SQL". The dataset is created based on the dev split of the Spider dataset (2020-06-07 version from https://yale-lily.github.io/spider). We manually modified the original questions to remove the explicit mention of column names while keeping the SQL queries unchanged to better evaluate the model's capability in aligning the NL utterance and the DB schema. For more details, please check our paper at https://arxiv.org/abs/2010.12773.

    It contains the following files:

    - spider-realistic.json
    # The spider-realistic evaluation set
    # Examples: 508
    # Databases: 19
    - dev.json
    # The original dev split of Spider
    # Examples: 1034
    # Databases: 20
    - tables.json
    # The original DB schemas from Spider
    # Databases: 166
    - README.txt
    - license

    The Spider-Realistic dataset is created based on the dev split of the Spider dataset realsed by Yu, Tao, et al. "Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task." It is a subset of the original dataset with explicit mention of the column names removed. The sql queries and databases are kept unchanged.
    For the format of each json file, please refer to the github page of Spider https://github.com/taoyds/spider.
    For the database files please refer to the official Spider release https://yale-lily.github.io/spider.

    This dataset is distributed under the CC BY-SA 4.0 license.

    If you use the dataset, please cite the following papers including the original Spider datasets, Finegan-Dollak et al., 2018 and the original datasets for Restaurants, GeoQuery, Scholar, Academic, IMDB, and Yelp.

    @article{deng2020structure,
    title={Structure-Grounded Pretraining for Text-to-SQL},
    author={Deng, Xiang and Awadallah, Ahmed Hassan and Meek, Christopher and Polozov, Oleksandr and Sun, Huan and Richardson, Matthew},
    journal={arXiv preprint arXiv:2010.12773},
    year={2020}
    }

    @inproceedings{Yu&al.18c,
    year = 2018,
    title = {Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task},
    booktitle = {EMNLP},
    author = {Tao Yu and Rui Zhang and Kai Yang and Michihiro Yasunaga and Dongxu Wang and Zifan Li and James Ma and Irene Li and Qingning Yao and Shanelle Roman and Zilin Zhang and Dragomir Radev }
    }

    @InProceedings{P18-1033,
    author = "Finegan-Dollak, Catherine
    and Kummerfeld, Jonathan K.
    and Zhang, Li
    and Ramanathan, Karthik
    and Sadasivam, Sesh
    and Zhang, Rui
    and Radev, Dragomir",
    title = "Improving Text-to-SQL Evaluation Methodology",
    booktitle = "Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    year = "2018",
    publisher = "Association for Computational Linguistics",
    pages = "351--360",
    location = "Melbourne, Australia",
    url = "http://aclweb.org/anthology/P18-1033"
    }

    @InProceedings{data-sql-imdb-yelp,
    dataset = {IMDB and Yelp},
    author = {Navid Yaghmazadeh, Yuepeng Wang, Isil Dillig, and Thomas Dillig},
    title = {SQLizer: Query Synthesis from Natural Language},
    booktitle = {International Conference on Object-Oriented Programming, Systems, Languages, and Applications, ACM},
    month = {October},
    year = {2017},
    pages = {63:1--63:26},
    url = {http://doi.org/10.1145/3133887},
    }

    @article{data-academic,
    dataset = {Academic},
    author = {Fei Li and H. V. Jagadish},
    title = {Constructing an Interactive Natural Language Interface for Relational Databases},
    journal = {Proceedings of the VLDB Endowment},
    volume = {8},
    number = {1},
    month = {September},
    year = {2014},
    pages = {73--84},
    url = {http://dx.doi.org/10.14778/2735461.2735468},
    }

    @InProceedings{data-atis-geography-scholar,
    dataset = {Scholar, and Updated ATIS and Geography},
    author = {Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, Jayant Krishnamurthy, and Luke Zettlemoyer},
    title = {Learning a Neural Semantic Parser from User Feedback},
    booktitle = {Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
    year = {2017},
    pages = {963--973},
    location = {Vancouver, Canada},
    url = {http://www.aclweb.org/anthology/P17-1089},
    }

    @inproceedings{data-geography-original
    dataset = {Geography, original},
    author = {John M. Zelle and Raymond J. Mooney},
    title = {Learning to Parse Database Queries Using Inductive Logic Programming},
    booktitle = {Proceedings of the Thirteenth National Conference on Artificial Intelligence - Volume 2},
    year = {1996},
    pages = {1050--1055},
    location = {Portland, Oregon},
    url = {http://dl.acm.org/citation.cfm?id=1864519.1864543},
    }

    @inproceedings{data-restaurants-logic,
    author = {Lappoon R. Tang and Raymond J. Mooney},
    title = {Automated Construction of Database Interfaces: Intergrating Statistical and Relational Learning for Semantic Parsing},
    booktitle = {2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora},
    year = {2000},
    pages = {133--141},
    location = {Hong Kong, China},
    url = {http://www.aclweb.org/anthology/W00-1317},
    }

    @inproceedings{data-restaurants-original,
    author = {Ana-Maria Popescu, Oren Etzioni, and Henry Kautz},
    title = {Towards a Theory of Natural Language Interfaces to Databases},
    booktitle = {Proceedings of the 8th International Conference on Intelligent User Interfaces},
    year = {2003},
    location = {Miami, Florida, USA},
    pages = {149--157},
    url = {http://doi.acm.org/10.1145/604045.604070},
    }

    @inproceedings{data-restaurants,
    author = {Alessandra Giordani and Alessandro Moschitti},
    title = {Automatic Generation and Reranking of SQL-derived Answers to NL Questions},
    booktitle = {Proceedings of the Second International Conference on Trustworthy Eternal Systems via Evolving Software, Data and Knowledge},
    year = {2012},
    location = {Montpellier, France},
    pages = {59--76},
    url = {https://doi.org/10.1007/978-3-642-45260-4_5},
    }

  13. G

    Hydroclimatic atlas 2022

    • open.canada.ca
    • catalogue.arctic-sdi.org
    • +1more
    csv, geojson, html +3
    Updated May 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Government and Municipalities of Québec (2025). Hydroclimatic atlas 2022 [Dataset]. https://open.canada.ca/data/dataset/8bc217ff-d25d-4f55-a9a7-ada3df4b29a7
    Explore at:
    csv, geojson, pdf, zip, html, shpAvailable download formats
    Dataset updated
    May 1, 2025
    Dataset provided by
    Government and Municipalities of Québec
    License

    Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
    License information was derived automatically

    Time period covered
    Jan 1, 1970 - Dec 31, 2100
    Description

    #Données of the 2022 Hydroclimatic Atlas ## #Description The Hydroclimatic Atlas describes the current and future water regime of southern Quebec in order to support the implementation of water management practices that are resilient to climate change. These data are from the most recent version of the Hydroclimatic Atlas. ## #Nouveautés * Improvement of the spatial resolution of the hydrographic network; * Greater spatial coverage; * Addition of the CliMEX and CORDEX-NA sets, in addition to the scenarios in the CMIP5 set; * Use of six hydrological platforms; * * Addition of indicators, especially annual ones. * Etc. ## #Liste data available * Link to the new Hydroclimatic Atlas website. * Map of the 24,604 river sections of the Hydroclimatic Atlas with their attributes, available in GeoJSON and shapefile format. To facilitate download and display, the map is divided into 11 GeoJSON files: ABIT (Abitibi and Lac Abitibi region), CND west (North Shore A and B regions), CND east (North Shore regions C, D and E), GASP (North Shore regions C, D and E), GASP (Gaspésie), MONT (Gaspesie), MONT (Montégérie), OUTM (Outaouais Upstream), OUTV (Outaouais Downstream), OUTV (Outaouais Downstream), SAGU (Saguenay), SLNO (St-Laurent Nord-Ouest), SLSO (St-Laurent Sud-Ouest), and VAUD (Vaudreuil). * The CSV tables (“Magnitude...”) for each of the 76 hydrological indicators describing the amplement, the direction and the dispersion for RCP 4.5 and RCP8.5, for the three future horizons (see the documentation for details). * The CSV tables (“Projected indicator...”) for each of the 76 hydrological indicators detailing the flow values with their uncertainty for the historical period and the three future horizons (RCP4.5 and 8.5). See the documentation for more details. * A PDF with the metadata and a more detailed description of the data. ## #Note The 2018 version data is archived on Data Quebec for reference, for example for old reports or analyses referring to this version of the data. Any new study or analysis should use the most recent data available below or on the Atlas website.**This third party metadata element was translated using an automated translation tool (Amazon Translate).**

  14. Data from: ThermoML/Data Archive

    • catalog.data.gov
    • data.nist.gov
    • +1more
    Updated Jul 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institute of Standards and Technology (2022). ThermoML/Data Archive [Dataset]. https://catalog.data.gov/dataset/thermoml-data-archive
    Explore at:
    Dataset updated
    Jul 29, 2022
    Dataset provided by
    National Institute of Standards and Technologyhttp://www.nist.gov/
    Description

    ThermoML is an XML-based IUPAC standard for the storage and exchange of experimental thermophysical and thermochemical property data. The ThermoML archive is a subset of Thermodynamics Research Center (TRC) data holdings corresponding to cooperation between NIST TRC and five journals: Journal of Chemical Engineering and Data (ISSN: 1520-5134), The Journal of Chemical Thermodynamics (ISSN: 1096-3626), Fluid Phase Equilibria (ISSN: 0378-3812), Thermochimica Acta (ISSN: 0040-6031), and International Journal of Thermophysics (ISSN: 1572-9567). Data from initial cooperation (around 2003) through the 2019 calendar year are included. The original scope of the archive has been expanded to include JSON files. The JSON files are structured according to the ThermoML.xsd (available below) and rendered from the same experimental thermophysical and thermochemical property data reported in the corresponding articles as the ThermoML files. In fact, the ThermoML files are generated from the JSON files to keep the information in sync. The JSON files may contain additional information not supported by the ThermoML schema. For example, each JSON file contains the md5 checksum on the ThermoML file (THERMOML_MD5_CHECKSUM) that may be used to validate the ThermoML download. This data.nist.gov resource provides a .tgz file download containing the JSON and ThermoML files for each version of the archive. Data from initial cooperation (around 2003) through the 2019 calendar year are provided below (ThermoML.v2020-09.30.tgz). The date of the extraction from TRC databases, as specified in the dateCit field of the xml files, are 2020-09-29 and 2020-09-30. The .tgz file contains a directory tree that maps to the DOI prefix/suffix of the entries; e.g. unzipping the .tgz file creates a directory for each of the prefixes ( 10.1007, 10.1016, and 10.1021) that contains all the .json and .xml files. The data and other information throughout this digital resource (including the website, API, JSON, and ThermoML files) have been carefully extracted from the original articles by NIST/TRC personnel. Neither the Journal publisher, nor its editors, nor NIST/TRC warrant or represent, expressly or implied, the correctness or accuracy of the content of information contained throughout this digital resource, nor its fitness for any use or for any purpose, nor can they, or will they, accept any liability or responsibility whatever for the consequences of its use or misuse by anyone. In any individual case of application, the respective user must check the correctness by consulting other relevant sources of information.

  15. m

    Ransomware and user samples for training and validating ML models

    • data.mendeley.com
    Updated Sep 17, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eduardo Berrueta (2021). Ransomware and user samples for training and validating ML models [Dataset]. http://doi.org/10.17632/yhg5wk39kf.2
    Explore at:
    Dataset updated
    Sep 17, 2021
    Authors
    Eduardo Berrueta
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Ransomware is considered as a significant threat for most enterprises since past few years. In scenarios wherein users can access all files on a shared server, one infected host is capable of locking the access to all shared files. In the article related to this repository, we detect ransomware infection based on file-sharing traffic analysis, even in the case of encrypted traffic. We compare three machine learning models and choose the best for validation. We train and test the detection model using more than 70 ransomware binaries from 26 different families and more than 2500 h of ‘not infected’ traffic from real users. The results reveal that the proposed tool can detect all ransomware binaries, including those not used in the training phase (zero-days). This paper provides a validation of the algorithm by studying the false positive rate and the amount of information from user files that the ransomware could encrypt before being detected.

    This dataset directory contains the 'infected' and 'not infected' samples and the models used for each T configuration, each one in a separated folder.

    The folders are named NxSy where x is the number of 1-second interval per sample and y the sliding step in seconds.

    Each folder (for example N10S10/) contains: - tree.py -> Python script with the Tree model. - ensemble.json -> JSON file with the information about the Ensemble model. - NN_XhiddenLayer.json -> JSON file with the information about the NN model with X hidden layers (1, 2 or 3). - N10S10.csv -> All samples used for training each model in this folder. It is in csv format for using in bigML application. - zeroDays.csv -> All zero-day samples used for testing each model in this folder. It is in csv format for using in bigML application. - userSamples_test -> All samples used for validating each model in this folder. It is in csv format for using in bigML application. - userSamples_train -> User samples used for training the models. - ransomware_train -> Ransomware samples used for training the models - scaler.scaler -> Standard Scaler from python library used for scale the samples. - zeroDays_notFiltered -> Folder with the zeroDay samples.

    In the case of N30S30 folder, there is an additional folder (SMBv2SMBv3NFS) with the samples extracted from the SMBv2, SMBv3 and NFS traffic traces. There are more binaries than the ones presented in the article, but it is because some of them are not "unseen" binaries (the families are present in the training set).

    The files containing samples (NxSy.csv, zeroDays.csv and userSamples_test.csv) are structured as follows: - Each line is one sample. - Each sample has 3*T features and the label (1 if it is 'infected' sample and 0 if it is not). - The features are separated by ',' because it is a csv file. - The last column is the label of the sample.

    Additionally we have placed two pcap files in root directory. There are the traces used for compare both versions of SMB.

  16. z

    ravdess in WebDataset Format

    • zenodo.org
    tar
    Updated Jan 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Niu Yadong; Niu Yadong (2025). ravdess in WebDataset Format [Dataset]. http://doi.org/10.5281/zenodo.14722524
    Explore at:
    tarAvailable download formats
    Dataset updated
    Jan 23, 2025
    Dataset provided by
    xiaomi
    Authors
    Niu Yadong; Niu Yadong
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is the ravdess dataset, formatted in the WebDataset format. WebDataset files are essentially tar archives, where each example in the dataset is represented by a pair of files: a WAV audio file and a corresponding JSON metadata file. The JSON file contains the class label and other relevant information for that particular audio sample.

    $ tar tvf ravdess_fold_0_0000000.tar |head
    -r--r--r-- bigdata/bigdata 24 2025-01-10 15:44 03-01-08-01-01-01-11.json
    -r--r--r-- bigdata/bigdata 341912 2025-01-10 15:44 03-01-08-01-01-01-11.wav
    -r--r--r-- bigdata/bigdata   22 2025-01-10 15:44 03-01-07-02-01-02-05.json
    -r--r--r-- bigdata/bigdata 424184 2025-01-10 15:44 03-01-07-02-01-02-05.wav
    -r--r--r-- bigdata/bigdata   22 2025-01-10 15:44 03-01-06-01-01-02-10.json
    -r--r--r-- bigdata/bigdata 377100 2025-01-10 15:44 03-01-06-01-01-02-10.wav
    -r--r--r-- bigdata/bigdata   24 2025-01-10 15:44 03-01-08-01-02-01-16.json
    -r--r--r-- bigdata/bigdata 396324 2025-01-10 15:44 03-01-08-01-02-01-16.wav
    -r--r--r-- bigdata/bigdata   24 2025-01-10 15:44 03-01-08-01-02-02-22.json
    -r--r--r-- bigdata/bigdata 404388 2025-01-10 15:44 03-01-08-01-02-02-22.wav

    $ cat 03-01-08-01-01-01-11.json {"emotion": "surprised"}
  17. P

    Fiji Land Use Land Cover Test Dataset

    • pacificdata.org
    • pacific-data.sprep.org
    geojson
    Updated Sep 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    John Duncan (2023). Fiji Land Use Land Cover Test Dataset [Dataset]. https://pacificdata.org/data/dataset/fiji-land-use-land-cover-test-dataset
    Explore at:
    geojson(136793)Available download formats
    Dataset updated
    Sep 15, 2023
    Dataset provided by
    John Duncan
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Time period covered
    Jan 1, 2021 - Dec 31, 2021
    Area covered
    Fiji
    Description

    To evaluate land use and land cover (LULC) maps an independent and representative test dataset is required. Here, a test dataset was generated via stratified random sampling approach across all areas in Fiji not used to generate training data (i.e. all Tikinas which did not contain a training data point were valid for sampling to generate the test dataset). Following equation 13 in Olofsson et al. (2014), the sample size of the test dataset was 834. This was based on a desired standard error of the overall accuracy score of 0.01 and a user's accuracy of 0.75 for all classes. The strata for sampling test samples were the eight LULC classes: water, mangrove, bare soil, urban, agriculture, grassland, shrubland, and trees.

    There are different strategies for allocating samples to strata for evaluating LULC maps, as discussed by Olofsson et al. (2014). Equal allocation of samples to strata ensures coverage of rarely occurring classes and minimise the standard error of estimators of user's accuracy. However, equal allocation does not optimise the standard error of the estimator of overall accuracy. Proportional allocation of samples to strata, based on the proportion of the strata in the overall dataset, can result in rarely occurring classes being underrepresented in the test dataset. Optimal allocation of samples to strata is challenging to implement when there are multiple evaluation objectives. Olofsson et al. (2014) recommend a "simple" allocation procedure where 50 to 100 samples are allocated to rare classes and proportional allocation is used to allocate samples to the remaining majority classes. The number of samples to allocate to rare classes can be determined by iterating over different allocations and computing estimated standard errors for performance metrics. Here, the 2021 all-Fiji LULC map, minus the Tikinas used for generating training samples, was used to estimate the proportional areal coverage of each LULC class. The LULC map from 2021 was used to permit comparison with other LULC products with a 2021 layer, notably the ESA WorldCover 10m v200 2021 product.

    The 2021 LULC map was dominated by the tree class (74\% of the area classified) and the remaining classes had less than 10\% coverage each. Therefore, a "simple" allocation of 100 samples to the seven minority classes and an allocation of 133 samples to the tree class was used. This ensured all the minority classes had sufficient coverage in the test set while balancing the requirement to minimise standard errors for the estimate of overall accuracy. The allocated number of test dataset points were randomly sampled within each strata and were manually labelled using 2021 annual median RGB composites from Sentinel-2 and Planet NICFI and high-resolution Google Satellite Basemaps.

    Data format

    The Fiji LULC test data is available in GeoJSON format in the file fiji-lulc-test-data.geojson. Each point feature has two attributes: ref_class (the LULC class manually labelled and quality checked) and strata (the strata the sampled point belongs to derived from the 2021 all-Fiji LULC map). The following integers correspond to the ref_class and strata labels:

    1. water
    2. mangrove
    3. bare earth / rock
    4. urban / impervious
    5. agriculture
    6. grassland
    7. shrubland
    8. tree

    Use

    When evaluating LULC maps using test data derived from a stratified sample, the nature of the stratified sampling needs to be accounted for when estimating performance metrics such as overall accuracy, user's accuracy, and producer's accuracy. This is particulary so if the strata do not match the map classes (i.e. when comparing different LULC products). Stehman (2014) provide formulas for estimating performance metrics and their standard errors when using test data with a stratified sampling structure.

    To support LULC accuracy assessment a Python package has been developed which provides implementations of Stehman's (2014) formulas. The package can be installed via:

    pip install lulc-validation
    

    with documentation and examples here.

    In order to compute performance metrics accounting for the stratified nature of the sample the total number of points / pixels available to be sampled in each strata must be known. For this dataset that is:

    1. 1779768,
    2. 3549325,
    3. 541204,
    4. 687659,
    5. 14279258,
    6. 15115599,
    7. 4972515,
    8. 116131948

    Acknowledgements

    This dataset was generated with support from a Climate Change AI Innovation Grant.

  18. A Dataset of Outdoor RSS Measurements for Localization

    • zenodo.org
    • data.niaid.nih.gov
    json, tiff, zip
    Updated Jul 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Frost Mitchell; Frost Mitchell; Aniqua Baset; Sneha Kumar Kasera; Aditya Bhaskara; Aniqua Baset; Sneha Kumar Kasera; Aditya Bhaskara (2024). A Dataset of Outdoor RSS Measurements for Localization [Dataset]. http://doi.org/10.5281/zenodo.7259895
    Explore at:
    tiff, json, zipAvailable download formats
    Dataset updated
    Jul 15, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Frost Mitchell; Frost Mitchell; Aniqua Baset; Sneha Kumar Kasera; Aditya Bhaskara; Aniqua Baset; Sneha Kumar Kasera; Aditya Bhaskara
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Description

    This dataset is a large-scale set of measurements for RSS-based localization. The data consists of received signal strength (RSS) measurements taken using the POWDER Testbed at the University of Utah. Samples include either 0, 1, or 2 active transmitters.

    The dataset consists of 5,214 unique samples, with transmitters in 5,514 unique locations. The majority of the samples contain only 1 transmitter, but there are small sets of samples with 0 or 2 active transmitters, as shown below. Each sample has RSS values from between 10 and 25 receivers. The majority of the receivers are stationary endpoints fixed on the side of buildings, on rooftop towers, or on free-standing poles. A small set of receivers are located on shuttles which travel specific routes throughout campus.

    Dataset DescriptionSample CountReceiver Count
    No-Tx Samples4610 to 25
    1-Tx Samples482210 to 25
    2-Tx Samples34611 to 12

    The transmitters for this dataset are handheld walkie-talkies (Baofeng BF-F8HP) transmitting in the FRS/GMRS band at 462.7 MHz. These devices have a rated transmission power of 1 W. The raw IQ samples were processed through a 6 kHz bandpass filter to remove neighboring transmissions, and the RSS value was calculated as follows:

    \(RSS = \frac{10}{N} \log_{10}\left(\sum_i^N x_i^2 \right) \)

    Measurement ParametersDescription
    Frequency462.7 MHz
    Radio Gain35 dB
    Receiver Sample Rate2 MHz
    Sample LengthN=10,000
    Band-pass Filter6 kHz
    Transmitters0 to 2
    Transmission Power1 W

    Receivers consist of Ettus USRP X310 and B210 radios, and a mix of wide- and narrow-band antennas, as shown in the table below Each receiver took measurements with a receiver gain of 35 dB. However, devices have different maxmimum gain settings, and no calibration data was available, so all RSS values in the dataset are uncalibrated, and are only relative to the device.

    Usage Instructions

    Data is provided in .json format, both as one file and as split files.

    import json
    data_file = 'powder_462.7_rss_data.json'
    with open(data_file) as f:
      data = json.load(f)
    

    The json data is a dictionary with the sample timestamp as a key. Within each sample are the following keys:

    • rx_data: A list of data from each receiver. Each entry contains RSS value, latitude, longitude, and device name.
    • tx_coords: A list of coordinates for each transmitter. Each entry contains latitude and longitude.
    • metadata: A list of dictionaries containing metadata for each transmitter, in the same order as the rows in tx_coords

    File Separations and Train/Test Splits

    In the separated_data.zip folder there are several train/test separations of the data.

    • all_data contains all the data in the main JSON file, separated by the number of transmitters.
    • stationary consists of 3 cases where a stationary receiver remained in one location for several minutes. This may be useful for evaluating localization using mobile shuttles, or measuring the variation in the channel characteristics for stationary receivers.
    • train_test_splits contains unique data splits used for training and evaluating ML models. These splits only used data from the single-tx case. In other words, the union of each splits, along with unused.json, is equivalent to the file all_data/single_tx.json.
      • The random split is a random 80/20 split of the data.
      • special_test_cases contains the stationary transmitter data, indoor transmitter data (with high noise in GPS location), and transmitters off campus.
      • The grid split divides the campus region in to a 10 by 10 grid. Each grid square is assigned to the training or test set, with 80 squares in the training set and the remainder in the test set. If a square is assigned to the test set, none of its four neighbors are included in the test set. Transmitters occuring in each grid square are assigned to train or test. One such random assignment of grid squares makes up the grid split.
      • The seasonal split contains data separated by the month of collection, in April or July.
      • The transportation split contains data separated by the method of movement for the transmitter: walking, cycling, or driving. The non-driving.json file contains the union of the walking and cycling data.
      • campus.json contains the on-campus data, so is equivalent to the union of each split, not including unused.json.

    Digital Surface Model

    The dataset includes a digital surface model (DSM) from a State of Utah 2013-2014 LiDAR survey. This map includes the University of Utah campus and surrounding area. The DSM includes buildings and trees, unlike some digital elevation models.

    To read the data in python:

    import rasterio as rio
    import numpy as np
    import utm
    
    dsm_object = rio.open('dsm.tif')
    dsm_map = dsm_object.read(1)   # a np.array containing elevation values
    dsm_resolution = dsm_object.res   # a tuple containing x,y resolution (0.5 meters) 
    dsm_transform = dsm_object.transform   # an Affine transform for conversion to UTM-12 coordinates
    utm_transform = np.array(dsm_transform).reshape((3,3))[:2]
    utm_top_left = utm_transform @ np.array([0,0,1])
    utm_bottom_right = utm_transform @ np.array([dsm_object.shape[0], dsm_object.shape[1], 1])
    latlon_top_left = utm.to_latlon(utm_top_left[0], utm_top_left[1], 12, 'T')
    latlon_bottom_right = utm.to_latlon(utm_bottom_right[0], utm_bottom_right[1], 12, 'T')
    

    Dataset Acknowledgement: This DSM file is acquired by the State of Utah and its partners, and is in the public domain and can be freely distributed with proper credit to the State of Utah and its partners. The State of Utah and its partners makes no warranty, expressed or implied, regarding its suitability for a particular use and shall not be liable under any circumstances for any direct, indirect, special, incidental, or consequential damages with respect to users of this product.

    DSM DOI: https://doi.org/10.5069/G9TH8JNQ

  19. London 1890s Ordnance Survey Text Layer

    • zenodo.org
    • data.niaid.nih.gov
    bin, png
    Updated Mar 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mengjie Zou; Mengjie Zou; Remi Petitpierre; Remi Petitpierre; Isabella di Lenardo; Isabella di Lenardo (2025). London 1890s Ordnance Survey Text Layer [Dataset]. http://doi.org/10.5281/zenodo.14982947
    Explore at:
    png, binAvailable download formats
    Dataset updated
    Mar 20, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Mengjie Zou; Mengjie Zou; Remi Petitpierre; Remi Petitpierre; Isabella di Lenardo; Isabella di Lenardo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    London
    Description

    This dataset contains a sample of 10,000 (3.5%) out of a total of 285,846 text sequences extracted from the 1891–1896 Map of London by the Ordnance Survey (OS).

    The methodology used for the automated recognition, linking, and sequencing of the text is detailed in the article Recognizing and Sequencing Multi-word Texts in Maps Using an Attentive Pointer by M. Zou et al., 2025.

    Description of the content

    The map is drawn at a scale of five-feet to the mile (c.a. 1:1,056). The text on the map is an invaluable source of information about the Greater London in the late Victorian period. It includes the names of streets, squares, parks, watercourses and even some estates ('Poplars', 'The Grange', 'Arbutus Lodge'). In addition, the map contains many details of the function of buildings and economic activity, such as factories ('Sweet Factory', 'Crown Linoleum Works', 'Imperial Flour Mills', 'Lion Brewery'), warehouses or commercial infrastructure ('Warehouse', 'Jamaica Wharf', 'Rag Store'), offices ('Offices'), etc. The map also mentions public buildings such as schools ('School Boys, Girls & Infants', 'Sunday School'), hospitals or clinics ('St. Saviour's Union Infirmary', 'Beulah Spa Hydropathic Establishment', 'South Western Fever Hospital'), railway stations ('Clapham Station'), post offices, banks, police stations, etc. Other social venues are also mentioned, such as public houses, i.e. pubs ('P.H.'), clubs, casinos, and recreational areas (e.g. 'Cricket Ground'). Special attention is given to churches, with a regular count of the number of seats (e.g. 'Baptist Chapel Seats for 600').

    In addition, the map provides details that can be of great interest in the study of everyday life in London at the end of the 19th century. For example, there are numerous mentions of 'Stables', 'Drinking Fountain'[s] (or simply 'Fn.') or 'Urinal'[s]. Fire protection infrastructure is highlighted, e.g. fire plugs ('F.P.') and fire alarms ('F.A.'). The map also includes information on elevation (e.g. '11·6') and flood levels (e.g. 'High Water Mark of Ordinary Tides').

    A list of abbreviations used in the Ordnance Survey maps, created by Richard Oliver [1], is made available by the National Library of Scotland (link).

    Organization of the data

    The data in 10k_text_london_OS_1890s.geojson is organized as a regular geojson file.

    Example structure

    {
    "type": "FeatureCollection",
    "features": [
    {
    "type": "Feature",
    "geometry": {
    "type": "MultiPolygon",
    "coordinates": [[[ [x1, y1], [x2, y2], ...]]]
    },
    "properties": {
    "label": "Oxford Circus",
    }
    },

    ... # Further text sequences

    ]
    }

    Image documents

    The original map document consists of 729 separate sheets, digitized, georeferenced, and served as geographic tiles by the National Library of Scotland [2].

    Descriptive statistics

    Total Number of text sequences: 285,846
    Sample size: 10,000
    Total Area covered: 450 square km

    Use and Citation

    For any mention of this dataset, please cite :

    @misc{text_london_OS_1890s,
    author = {Zou, Mengjie and Petitpierre, R{\'{e}}mi and di Lenardo, Isabella},
    title = {{London 1890s Ordnance Survey Text Layer}},
    year = {2025},
    publisher = {Zenodo},
    url = {https://doi.org/10.5281/zenodo.14982946}}


    @article{recognizing_sequencing_2025,
    author = {Zou, Mengjie and Dai, Tianhao and Petitpierre, R{\'{e}}mi and Vaienti, Beatrice and di Lenardo, Isabella},
    title = {{Recognizing and Sequencing Multi-word Texts in Maps Using an Attentive Pointer}},
    year = {2025}}

    Corresponding author

    Rémi PETITPIERRE - remi.petitpierre@epfl.ch - ORCID - Github - Scholar - ResearchGate

    License

    This project is licensed under the CC BY 4.0 License.

    Liability

    We do not assume any liability for the use of this dataset.

    References

    1. Oliver R. (2013). Ordnance Survey maps: A concise guide for historians. The Charles Close Society. London, UK. 3rd Ed. 320 pages
    2. Ordnance Survey, London, five feet to the mile, 1893-1896 (1896), https://maps.nls.uk/os/townplans-england/london-1056-1890s.html, digitized by the National Library of Scotland (NLS)
  20. Transaction Graph Dataset for the Ethereum Blockchain

    • zenodo.org
    Updated Dec 19, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Can Özturan; Can Özturan; Alper Şen; Alper Şen; Baran Kılıç; Baran Kılıç (2022). Transaction Graph Dataset for the Ethereum Blockchain [Dataset]. http://doi.org/10.5281/zenodo.3669937
    Explore at:
    Dataset updated
    Dec 19, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Can Özturan; Can Özturan; Alper Şen; Alper Şen; Baran Kılıç; Baran Kılıç
    Description

    This dataset contains ether as well as popular ERC20 token transfer transactions extracted from the Ethereum Mainnet blockchain.

    Only send ether, contract function call, contract deployment transactions are present in the dataset. Miner reward transactions are not currently included in the dataset.

    Details of the datasets are given below:

    FILENAME FORMAT:

    The filenames have the following format:

    eth-tx-

    where

    For example file eth-tx-1000000-1099999.txt.bz2 contains transactions from

    block 1000000 to block 1099999 inclusive.

    The files are compressed with bzip2. They can be uncompressed using command bunzip2.

    TRANSACTION FORMAT:

    Each line in a file corresponds to a transaction. The transaction has the following format:

    units. ERC20 tokens transfers (transfer and transferFrom function calls in ERC20

    contract) are indicated by token symbol. For example GUSD is Gemini USD stable

    coin. The JSON file erc20tokens.json given below contains the details of ERC20 tokens.

    decoder-error.txt FILE:

    This file contains transactions (block no, tx no, tx hash) on each line that produced

    error while decoding calldata. These transactions are not present in the data files.

    er20tokens.json FILE:

    This file contains the list of popular ERC20 token contracts whose transfer/transferFrom

    transactions appear in the data files.

    -------------------------------------------------------------------------------------------

    [

    {

    "address": "0xdac17f958d2ee523a2206206994597c13d831ec7",

    "decdigits": 6,

    "symbol": "USDT",

    "name": "Tether-USD"

    },

    {

    "address": "0xB8c77482e45F1F44dE1745F52C74426C631bDD52",

    "decdigits": 18,

    "symbol": "BNB",

    "name": "Binance"

    },

    {

    "address": "0x2af5d2ad76741191d15dfe7bf6ac92d4bd912ca3",

    "decdigits": 18,

    "symbol": "LEO",

    "name": "Bitfinex-LEO"

    },

    {

    "address": "0x514910771af9ca656af840dff83e8264ecf986ca",

    "decdigits": 18,

    "symbol": "LNK",

    "name": "Chainlink"

    },

    {

    "address": "0x6f259637dcd74c767781e37bc6133cd6a68aa161",

    "decdigits": 18,

    "symbol": "HT",

    "name": "HuobiToken"

    },

    {

    "address": "0xf1290473e210b2108a85237fbcd7b6eb42cc654f",

    "decdigits": 18,

    "symbol": "HEDG",

    "name": "HedgeTrade"

    },

    {

    "address": "0x9f8f72aa9304c8b593d555f12ef6589cc3a579a2",

    "decdigits": 18,

    "symbol": "MKR",

    "name": "Maker"

    },

    {

    "address": "0xa0b73e1ff0b80914ab6fe0444e65848c4c34450b",

    "decdigits": 8,

    "symbol": "CRO",

    "name": "Crypto.com"

    },

    {

    "address": "0xd850942ef8811f2a866692a623011bde52a462c1",

    "decdigits": 18,

    "symbol": "VEN",

    "name": "VeChain"

    },

    {

    "address": "0x0d8775f648430679a709e98d2b0cb6250d2887ef",

    "decdigits": 18,

    "symbol": "BAT",

    "name": "Basic-Attention"

    },

    {

    "address": "0xc9859fccc876e6b4b3c749c5d29ea04f48acb74f",

    "decdigits": 0,

    "symbol": "INO",

    "name": "INO-Coin"

    },

    {

    "address": "0x8e870d67f660d95d5be530380d0ec0bd388289e1",

    "decdigits": 18,

    "symbol": "PAX",

    "name": "Paxos-Standard"

    },

    {

    "address": "0x17aa18a4b64a55abed7fa543f2ba4e91f2dce482",

    "decdigits": 18,

    "symbol": "INB",

    "name": "Insight-Chain"

    },

    {

    "address": "0xc011a72400e58ecd99ee497cf89e3775d4bd732f",

    "decdigits": 18,

    "symbol": "SNX",

    "name": "Synthetix-Network"

    },

    {

    "address": "0x1985365e9f78359a9B6AD760e32412f4a445E862",

    "decdigits": 18,

    "symbol": "REP",

    "name": "Reputation"

    },

    {

    "address": "0x653430560be843c4a3d143d0110e896c2ab8ac0d",

    "decdigits": 16,

    "symbol": "MOF",

    "name": "Molecular-Future"

    },

    {

    "address": "0x0000000000085d4780B73119b644AE5ecd22b376",

    "decdigits": 18,

    "symbol": "TUSD",

    "name": "True-USD"

    },

    {

    "address": "0xe41d2489571d322189246dafa5ebde1f4699f498",

    "decdigits": 18,

    "symbol": "ZRX",

    "name": "ZRX"

    },

    {

    "address": "0x8ce9137d39326ad0cd6491fb5cc0cba0e089b6a9",

    "decdigits": 18,

    "symbol": "SXP",

    "name": "Swipe"

    },

    {

    "address": "0x75231f58b43240c9718dd58b4967c5114342a86c",

    "decdigits": 18,

    "symbol": "OKB",

    "name": "Okex"

    },

    {

    "address": "0xa974c709cfb4566686553a20790685a47aceaa33",

    "decdigits": 18,

    "symbol": "XIN",

    "name": "Mixin"

    },

    {

    "address": "0xd26114cd6EE289AccF82350c8d8487fedB8A0C07",

    "decdigits": 18,

    "symbol": "OMG",

    "name": "OmiseGO"

    },

    {

    "address": "0x89d24a6b4ccb1b6faa2625fe562bdd9a23260359",

    "decdigits": 18,

    "symbol": "SAI",

    "name": "Sai Stablecoin v1.0"

    },

    {

    "address": "0x6c6ee5e31d828de241282b9606c8e98ea48526e2",

    "decdigits": 18,

    "symbol": "HOT",

    "name": "HoloToken"

    },

    {

    "address": "0x6b175474e89094c44da98b954eedeac495271d0f",

    "decdigits": 18,

    "symbol": "DAI",

    "name": "Dai Stablecoin"

    },

    {

    "address": "0xdb25f211ab05b1c97d595516f45794528a807ad8",

    "decdigits": 2,

    "symbol": "EURS",

    "name": "Statis-EURS"

    },

    {

    "address": "0xa66daa57432024023db65477ba87d4e7f5f95213",

    "decdigits": 18,

    "symbol": "HPT",

    "name": "HuobiPoolToken"

    },

    {

    "address": "0x4fabb145d64652a948d72533023f6e7a623c7c53",

    "decdigits": 18,

    "symbol": "BUSD",

    "name": "Binance-USD"

    },

    {

    "address": "0x056fd409e1d7a124bd7017459dfea2f387b6d5cd",

    "decdigits": 2,

    "symbol": "GUSD",

    "name": "Gemini-USD"

    },

    {

    "address": "0x2c537e5624e4af88a7ae4060c022609376c8d0eb",

    "decdigits": 6,

    "symbol": "TRYB",

    "name": "BiLira"

    },

    {

    "address": "0x4922a015c4407f87432b179bb209e125432e4a2a",

    "decdigits": 6,

    "symbol": "XAUT",

    "name": "Tether-Gold"

    },

    {

    "address": "0xa0b86991c6218b36c1d19d4a2e9eb0ce3606eb48",

    "decdigits": 6,

    "symbol": "USDC",

    "name": "USD-Coin"

    },

    {

    "address": "0xa5b55e6448197db434b92a0595389562513336ff",

    "decdigits": 16,

    "symbol": "SUSD",

    "name": "Santender"

    },

    {

    "address": "0xffe8196bc259e8dedc544d935786aa4709ec3e64",

    "decdigits": 18,

    "symbol": "HDG",

    "name": "HedgeTrade"

    },

    {

    "address": "0x4a16baf414b8e637ed12019fad5dd705735db2e0",

    "decdigits": 2,

    "symbol": "QCAD",

    "name": "QCAD"

    }

    ]

    -------------------------------------------------------------------------------------------

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Alex, extraction-examples [Dataset]. https://huggingface.co/datasets/alexdzm/extraction-examples

extraction-examples

alexdzm/extraction-examples

Explore at:
Authors
Alex
License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Extraction Examples Dataset

This dataset contains 17 examples for testing extraction workflows.

  Dataset Structure

Each example includes:

PDF file: Original document map_info.json: Map extraction metadata direction.json: Direction information
GeoJSON files: Polygon geometries Area JSON files: Area definitions

  File Organization

files/ ├── example1/ │ ├── document.pdf │ ├── map_info.json │ ├── direction.json │ ├── polygon1.geojson │ └── area1.json… See the full description on the dataset page: https://huggingface.co/datasets/alexdzm/extraction-examples.

Search
Clear search
Close search
Google apps
Main menu