100+ datasets found
  1. h

    example-space-to-dataset-image-zip

    • huggingface.co
    Updated Jun 16, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lucain Pouget (2023). example-space-to-dataset-image-zip [Dataset]. https://huggingface.co/datasets/Wauplin/example-space-to-dataset-image-zip
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 16, 2023
    Authors
    Lucain Pouget
    Description
  2. h

    example-space-to-dataset-json

    • huggingface.co
    Updated May 26, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    m (2025). example-space-to-dataset-json [Dataset]. https://huggingface.co/datasets/mmwmm/example-space-to-dataset-json
    Explore at:
    Dataset updated
    May 26, 2025
    Authors
    m
    Description

    mmwmm/example-space-to-dataset-json dataset hosted on Hugging Face and contributed by the HF Datasets community

  3. d

    Country Polygons as GeoJSON

    • datahub.io
    Updated Sep 1, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2017). Country Polygons as GeoJSON [Dataset]. https://datahub.io/core/geo-countries
    Explore at:
    Dataset updated
    Sep 1, 2017
    License

    ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
    License information was derived automatically

    Description

    geodata data package providing geojson polygons for all the world's countries

  4. USA states GeoJson

    • kaggle.com
    Updated Aug 18, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kate Gallo (2020). USA states GeoJson [Dataset]. https://www.kaggle.com/pompelmo/usa-states-geojson/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 18, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Kate Gallo
    Area covered
    United States
    Description

    Context

    I created a dataset to help people create choropleth maps of United States states.

    Content

    One geojson to plot the countries borders, and one csv from the Census Bureau for the us population per state.

    Inspiration

    I think the best way to use this dataset is in joining it with other data. For example, I used this dataset to plot police killings using the data from https://www.kaggle.com/jpmiller/police-violence-in-the-us

  5. Z

    Data from: 3DHD CityScenes: High-Definition Maps in High-Density Point...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jul 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fricke, Jenny (2024). 3DHD CityScenes: High-Definition Maps in High-Density Point Clouds [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7085089
    Explore at:
    Dataset updated
    Jul 16, 2024
    Dataset provided by
    Sertolli, Benjamin
    Fricke, Jenny
    Klingner, Marvin
    Fingscheidt, Tim
    Plachetka, Christopher
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Overview

    3DHD CityScenes is the most comprehensive, large-scale high-definition (HD) map dataset to date, annotated in the three spatial dimensions of globally referenced, high-density LiDAR point clouds collected in urban domains. Our HD map covers 127 km of road sections of the inner city of Hamburg, Germany including 467 km of individual lanes. In total, our map comprises 266,762 individual items.

    Our corresponding paper (published at ITSC 2022) is available here. Further, we have applied 3DHD CityScenes to map deviation detection here.

    Moreover, we release code to facilitate the application of our dataset and the reproducibility of our research. Specifically, our 3DHD_DevKit comprises:

    Python tools to read, generate, and visualize the dataset,

    3DHDNet deep learning pipeline (training, inference, evaluation) for map deviation detection and 3D object detection.

    The DevKit is available here:

    https://github.com/volkswagen/3DHD_devkit.

    The dataset and DevKit have been created by Christopher Plachetka as project lead during his PhD period at Volkswagen Group, Germany.

    When using our dataset, you are welcome to cite:

    @INPROCEEDINGS{9921866, author={Plachetka, Christopher and Sertolli, Benjamin and Fricke, Jenny and Klingner, Marvin and Fingscheidt, Tim}, booktitle={2022 IEEE 25th International Conference on Intelligent Transportation Systems (ITSC)}, title={3DHD CityScenes: High-Definition Maps in High-Density Point Clouds}, year={2022}, pages={627-634}}

    Acknowledgements

    We thank the following interns for their exceptional contributions to our work.

    Benjamin Sertolli: Major contributions to our DevKit during his master thesis

    Niels Maier: Measurement campaign for data collection and data preparation

    The European large-scale project Hi-Drive (www.Hi-Drive.eu) supports the publication of 3DHD CityScenes and encourages the general publication of information and databases facilitating the development of automated driving technologies.

    The Dataset

    After downloading, the 3DHD_CityScenes folder provides five subdirectories, which are explained briefly in the following.

    1. Dataset

    This directory contains the training, validation, and test set definition (train.json, val.json, test.json) used in our publications. Respective files contain samples that define a geolocation and the orientation of the ego vehicle in global coordinates on the map.

    During dataset generation (done by our DevKit), samples are used to take crops from the larger point cloud. Also, map elements in reach of a sample are collected. Both modalities can then be used, e.g., as input to a neural network such as our 3DHDNet.

    To read any JSON-encoded data provided by 3DHD CityScenes in Python, you can use the following code snipped as an example.

    import json

    json_path = r"E:\3DHD_CityScenes\Dataset\train.json" with open(json_path) as jf: data = json.load(jf) print(data)

    1. HD_Map

    Map items are stored as lists of items in JSON format. In particular, we provide:

    traffic signs,

    traffic lights,

    pole-like objects,

    construction site locations,

    construction site obstacles (point-like such as cones, and line-like such as fences),

    line-shaped markings (solid, dashed, etc.),

    polygon-shaped markings (arrows, stop lines, symbols, etc.),

    lanes (ordinary and temporary),

    relations between elements (only for construction sites, e.g., sign to lane association).

    1. HD_Map_MetaData

    Our high-density point cloud used as basis for annotating the HD map is split in 648 tiles. This directory contains the geolocation for each tile as polygon on the map. You can view the respective tile definition using QGIS. Alternatively, we also provide respective polygons as lists of UTM coordinates in JSON.

    Files with the ending .dbf, .prj, .qpj, .shp, and .shx belong to the tile definition as “shape file” (commonly used in geodesy) that can be viewed using QGIS. The JSON file contains the same information provided in a different format used in our Python API.

    1. HD_PointCloud_Tiles

    The high-density point cloud tiles are provided in global UTM32N coordinates and are encoded in a proprietary binary format. The first 4 bytes (integer) encode the number of points contained in that file. Subsequently, all point cloud values are provided as arrays. First all x-values, then all y-values, and so on. Specifically, the arrays are encoded as follows.

    x-coordinates: 4 byte integer

    y-coordinates: 4 byte integer

    z-coordinates: 4 byte integer

    intensity of reflected beams: 2 byte unsigned integer

    ground classification flag: 1 byte unsigned integer

    After reading, respective values have to be unnormalized. As an example, you can use the following code snipped to read the point cloud data. For visualization, you can use the pptk package, for instance.

    import numpy as np import pptk

    file_path = r"E:\3DHD_CityScenes\HD_PointCloud_Tiles\HH_001.bin" pc_dict = {} key_list = ['x', 'y', 'z', 'intensity', 'is_ground'] type_list = ['

  6. Country State GeoJSON

    • kaggle.com
    zip
    Updated Apr 27, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mukesh Chapagain (2020). Country State GeoJSON [Dataset]. https://www.kaggle.com/chapagain/country-state-geo-location
    Explore at:
    zip(286136 bytes)Available download formats
    Dataset updated
    Apr 27, 2020
    Authors
    Mukesh Chapagain
    Description

    About

    World Country and State coordinate for plotting geospatial maps.

    Source

    Files source:

    1. Folium GitHub Repository:

  7. Wireless HotSpots (GEOJSON)

    • data.gov.sg
    Updated Jun 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Info-communications Media Development Authority (2024). Wireless HotSpots (GEOJSON) [Dataset]. https://data.gov.sg/datasets/d_d8644084f8b54f851a1acbb2f04d5089/view
    Explore at:
    Dataset updated
    Jun 6, 2024
    Dataset provided by
    Infocomm Media Development Authorityhttp://www.imda.gov.sg/
    Authors
    Info-communications Media Development Authority
    License

    https://data.gov.sg/open-data-licencehttps://data.gov.sg/open-data-licence

    Description

    Dataset from Info-communications Media Development Authority. For more information, visit https://data.gov.sg/datasets/d_d8644084f8b54f851a1acbb2f04d5089/view

  8. JSON Repository

    • data.amerigeoss.org
    csv, geojson, json +1
    Updated Jun 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    UN Humanitarian Data Exchange (2025). JSON Repository [Dataset]. https://data.amerigeoss.org/dataset/json-repository
    Explore at:
    csv(9901), csv(779), csv(462610), json(3411081), geojson(543777), geojson(545299), geojson(365288), json(1132925), geojson(366788), csv(177073), geojson(162605), json(2064743), json(520472), geojson(953043), geojson(886086), json(457832), geojson(222216), geojson(9124), csv(85982), geojson(164379), csv(457), csv(242), json(3401512), csv(669568), json(461423), json(876253), csv(6789), csv(536), json(640845), json(707249), csv(358964), geojson(135805), csv(4907), csv(177), json(327649), csv(9980), geojson(709673), geojson(54889), geojson(2396630), json(632081), topojson(2728099), csv(845984), geojson(178718), json(559095), json(1975854), geojson(74470), geojson(219728), geojson(1324722), json(3478518)Available download formats
    Dataset updated
    Jun 4, 2025
    Dataset provided by
    United Nationshttp://un.org/
    Description

    This dataset contains resources transformed from other datasets on HDX. They exist here only in a format modified to support visualization on HDX and may not be as up to date as the source datasets from which they are derived.

    Source datasets: https://data.hdx.rwlabs.org/dataset/idps-data-by-region-in-mali

  9. json_large_sample

    • kaggle.com
    Updated Dec 1, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Noura Aly (2023). json_large_sample [Dataset]. https://www.kaggle.com/datasets/nouraaly/json-large-sample
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 1, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Noura Aly
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset

    This dataset was created by Noura Aly

    Released under Apache 2.0

    Contents

  10. h

    example-space-to-dataset-json

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ahmad Sohrabi, example-space-to-dataset-json [Dataset]. https://huggingface.co/datasets/CognitiveScience/example-space-to-dataset-json
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Ahmad Sohrabi
    Description

    CognitiveScience/example-space-to-dataset-json dataset hosted on Hugging Face and contributed by the HF Datasets community

  11. Z

    Dataset of IEEE 802.11 probe requests from an uncontrolled urban environment...

    • data.niaid.nih.gov
    Updated Jan 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrej Hrovat (2023). Dataset of IEEE 802.11 probe requests from an uncontrolled urban environment [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7509279
    Explore at:
    Dataset updated
    Jan 6, 2023
    Dataset provided by
    Mihael Mohorčič
    Miha Mohorčič
    Andrej Hrovat
    Aleš Simončič
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Introduction

    The 802.11 standard includes several management features and corresponding frame types. One of them are Probe Requests (PR), which are sent by mobile devices in an unassociated state to scan the nearby area for existing wireless networks. The frame part of PRs consists of variable-length fields, called Information Elements (IE), which represent the capabilities of a mobile device, such as supported data rates.

    This dataset contains PRs collected over a seven-day period by four gateway devices in an uncontrolled urban environment in the city of Catania.

    It can be used for various use cases, e.g., analyzing MAC randomization, determining the number of people in a given location at a given time or in different time periods, analyzing trends in population movement (streets, shopping malls, etc.) in different time periods, etc.

    Related dataset

    Same authors also produced the Labeled dataset of IEEE 802.11 probe requests with same data layout and recording equipment.

    Measurement setup

    The system for collecting PRs consists of a Raspberry Pi 4 (RPi) with an additional WiFi dongle to capture WiFi signal traffic in monitoring mode (gateway device). Passive PR monitoring is performed by listening to 802.11 traffic and filtering out PR packets on a single WiFi channel.

    The following information about each received PR is collected: - MAC address - Supported data rates - extended supported rates - HT capabilities - extended capabilities - data under extended tag and vendor specific tag - interworking - VHT capabilities - RSSI - SSID - timestamp when PR was received.

    The collected data was forwarded to a remote database via a secure VPN connection. A Python script was written using the Pyshark package to collect, preprocess, and transmit the data.

    Data preprocessing

    The gateway collects PRs for each successive predefined scan interval (10 seconds). During this interval, the data is preprocessed before being transmitted to the database. For each detected PR in the scan interval, the IEs fields are saved in the following JSON structure:

    PR_IE_data = { 'DATA_RTS': {'SUPP': DATA_supp , 'EXT': DATA_ext}, 'HT_CAP': DATA_htcap, 'EXT_CAP': {'length': DATA_len, 'data': DATA_extcap}, 'VHT_CAP': DATA_vhtcap, 'INTERWORKING': DATA_inter, 'EXT_TAG': {'ID_1': DATA_1_ext, 'ID_2': DATA_2_ext ...}, 'VENDOR_SPEC': {VENDOR_1:{ 'ID_1': DATA_1_vendor1, 'ID_2': DATA_2_vendor1 ...}, VENDOR_2:{ 'ID_1': DATA_1_vendor2, 'ID_2': DATA_2_vendor2 ...} ...} }

    Supported data rates and extended supported rates are represented as arrays of values that encode information about the rates supported by a mobile device. The rest of the IEs data is represented in hexadecimal format. Vendor Specific Tag is structured differently than the other IEs. This field can contain multiple vendor IDs with multiple data IDs with corresponding data. Similarly, the extended tag can contain multiple data IDs with corresponding data.
    Missing IE fields in the captured PR are not included in PR_IE_DATA.

    When a new MAC address is detected in the current scan time interval, the data from PR is stored in the following structure:

    {'MAC': MAC_address, 'SSIDs': [ SSID ], 'PROBE_REQs': [PR_data] },

    where PR_data is structured as follows:

    { 'TIME': [ DATA_time ], 'RSSI': [ DATA_rssi ], 'DATA': PR_IE_data }.

    This data structure allows to store only 'TOA' and 'RSSI' for all PRs originating from the same MAC address and containing the same 'PR_IE_data'. All SSIDs from the same MAC address are also stored. The data of the newly detected PR is compared with the already stored data of the same MAC in the current scan time interval. If identical PR's IE data from the same MAC address is already stored, only data for the keys 'TIME' and 'RSSI' are appended. If identical PR's IE data from the same MAC address has not yet been received, then the PR_data structure of the new PR for that MAC address is appended to the 'PROBE_REQs' key. The preprocessing procedure is shown in Figure ./Figures/Preprocessing_procedure.png

    At the end of each scan time interval, all processed data is sent to the database along with additional metadata about the collected data, such as the serial number of the wireless gateway and the timestamps for the start and end of the scan. For an example of a single PR capture, see the Single_PR_capture_example.json file.

    Folder structure

    For ease of processing of the data, the dataset is divided into 7 folders, each containing a 24-hour period. Each folder contains four files, each containing samples from that device.

    The folders are named after the start and end time (in UTC). For example, the folder 2022-09-22T22-00-00_2022-09-23T22-00-00 contains samples collected between 23th of September 2022 00:00 local time, until 24th of September 2022 00:00 local time.

    Files representing their location via mapping: - 1.json -> location 1 - 2.json -> location 2 - 3.json -> location 3 - 4.json -> location 4

    Environments description

    The measurements were carried out in the city of Catania, in Piazza Università and Piazza del Duomo The gateway devices (rPIs with WiFi dongle) were set up and gathering data before the start time of this dataset. As of September 23, 2022, the devices were placed in their final configuration and personally checked for correctness of installation and data status of the entire data collection system. Devices were connected either to a nearby Ethernet outlet or via WiFi to the access point provided.

    Four Raspbery Pi-s were used: - location 1 -> Piazza del Duomo - Chierici building (balcony near Fontana dell’Amenano) - location 2 -> southernmost window in the building of Via Etnea near Piazza del Duomo - location 3 -> nothernmost window in the building of Via Etnea near Piazza Università - location 4 -> first window top the right of the entrance of the University of Catania

    Locations were suggested by the authors and adjusted during deployment based on physical constraints (locations of electrical outlets or internet access) Under ideal circumstances, the locations of the devices and their coverage area would cover both squares and the part of Via Etna between them, with a partial overlap of signal detection. The locations of the gateways are shown in Figure ./Figures/catania.png.

    Known dataset shortcomings

    Due to technical and physical limitations, the dataset contains some identified deficiencies.

    PRs are collected and transmitted in 10-second chunks. Due to the limited capabilites of the recording devices, some time (in the range of seconds) may not be accounted for between chunks if the transmission of the previous packet took too long or an unexpected error occurred.

    Every 20 minutes the service is restarted on the recording device. This is a workaround for undefined behavior of the USB WiFi dongle, which can no longer respond. For this reason, up to 20 seconds of data will not be recorded in each 20-minute period.

    The devices had a scheduled reboot at 4:00 each day which is shown as missing data of up to a few minutes.

     Location 1 - Piazza del Duomo - Chierici
    

    The gateway device (rPi) is located on the second floor balcony and is hardwired to the Ethernet port. This device appears to function stably throughout the data collection period. Its location is constant and is not disturbed, dataset seems to have complete coverage.

     Location 2 - Via Etnea - Piazza del Duomo
    

    The device is located inside the building. During working hours (approximately 9:00-17:00), the device was placed on the windowsill. However, the movement of the device cannot be confirmed. As the device was moved back and forth, power outages and internet connection issues occurred. The last three days in the record contain no PRs from this location.

     Location 3 - Via Etnea - Piazza Università
    

    Similar to Location 2, the device is placed on the windowsill and moved around by people working in the building. Similar behavior is also observed, e.g., it is placed on the windowsill and moved inside a thick wall when no people are present. This device appears to have been collecting data throughout the whole dataset period.

     Location 4 - Piazza Università
    

    This location is wirelessly connected to the access point. The device was placed statically on a windowsill overlooking the square. Due to physical limitations, the device had lost power several times during the deployment. The internet connection was also interrupted sporadically.

    Recognitions

    The data was collected within the scope of Resiloc project with the help of City of Catania and project partners.

  12. URA Parking Lot (GEOJSON)

    • data.gov.sg
    Updated Jun 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Urban Redevelopment Authority (2024). URA Parking Lot (GEOJSON) [Dataset]. https://data.gov.sg/datasets/d_d959102fa76d58f2de276bfbb7e8f68e/view
    Explore at:
    Dataset updated
    Jun 6, 2024
    Dataset authored and provided by
    Urban Redevelopment Authorityhttp://ura.gov.sg/
    License

    https://data.gov.sg/open-data-licencehttps://data.gov.sg/open-data-licence

    Description

    Dataset from Urban Redevelopment Authority. For more information, visit https://data.gov.sg/datasets/d_d959102fa76d58f2de276bfbb7e8f68e/view

  13. Spider Realistic Dataset In Structure-Grounded Pretraining for Text-to-SQL

    • zenodo.org
    bin, json, txt
    Updated Aug 16, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xiang Deng; Ahmed Hassan Awadallah; Christopher Meek; Oleksandr Polozov; Huan Sun; Matthew Richardson; Xiang Deng; Ahmed Hassan Awadallah; Christopher Meek; Oleksandr Polozov; Huan Sun; Matthew Richardson (2021). Spider Realistic Dataset In Structure-Grounded Pretraining for Text-to-SQL [Dataset]. http://doi.org/10.5281/zenodo.5205322
    Explore at:
    txt, json, binAvailable download formats
    Dataset updated
    Aug 16, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Xiang Deng; Ahmed Hassan Awadallah; Christopher Meek; Oleksandr Polozov; Huan Sun; Matthew Richardson; Xiang Deng; Ahmed Hassan Awadallah; Christopher Meek; Oleksandr Polozov; Huan Sun; Matthew Richardson
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This folder contains the Spider-Realistic dataset used for evaluation in the paper "Structure-Grounded Pretraining for Text-to-SQL". The dataset is created based on the dev split of the Spider dataset (2020-06-07 version from https://yale-lily.github.io/spider). We manually modified the original questions to remove the explicit mention of column names while keeping the SQL queries unchanged to better evaluate the model's capability in aligning the NL utterance and the DB schema. For more details, please check our paper at https://arxiv.org/abs/2010.12773.

    It contains the following files:

    - spider-realistic.json
    # The spider-realistic evaluation set
    # Examples: 508
    # Databases: 19
    - dev.json
    # The original dev split of Spider
    # Examples: 1034
    # Databases: 20
    - tables.json
    # The original DB schemas from Spider
    # Databases: 166
    - README.txt
    - license

    The Spider-Realistic dataset is created based on the dev split of the Spider dataset realsed by Yu, Tao, et al. "Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task." It is a subset of the original dataset with explicit mention of the column names removed. The sql queries and databases are kept unchanged.
    For the format of each json file, please refer to the github page of Spider https://github.com/taoyds/spider.
    For the database files please refer to the official Spider release https://yale-lily.github.io/spider.

    This dataset is distributed under the CC BY-SA 4.0 license.

    If you use the dataset, please cite the following papers including the original Spider datasets, Finegan-Dollak et al., 2018 and the original datasets for Restaurants, GeoQuery, Scholar, Academic, IMDB, and Yelp.

    @article{deng2020structure,
    title={Structure-Grounded Pretraining for Text-to-SQL},
    author={Deng, Xiang and Awadallah, Ahmed Hassan and Meek, Christopher and Polozov, Oleksandr and Sun, Huan and Richardson, Matthew},
    journal={arXiv preprint arXiv:2010.12773},
    year={2020}
    }

    @inproceedings{Yu&al.18c,
    year = 2018,
    title = {Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task},
    booktitle = {EMNLP},
    author = {Tao Yu and Rui Zhang and Kai Yang and Michihiro Yasunaga and Dongxu Wang and Zifan Li and James Ma and Irene Li and Qingning Yao and Shanelle Roman and Zilin Zhang and Dragomir Radev }
    }

    @InProceedings{P18-1033,
    author = "Finegan-Dollak, Catherine
    and Kummerfeld, Jonathan K.
    and Zhang, Li
    and Ramanathan, Karthik
    and Sadasivam, Sesh
    and Zhang, Rui
    and Radev, Dragomir",
    title = "Improving Text-to-SQL Evaluation Methodology",
    booktitle = "Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    year = "2018",
    publisher = "Association for Computational Linguistics",
    pages = "351--360",
    location = "Melbourne, Australia",
    url = "http://aclweb.org/anthology/P18-1033"
    }

    @InProceedings{data-sql-imdb-yelp,
    dataset = {IMDB and Yelp},
    author = {Navid Yaghmazadeh, Yuepeng Wang, Isil Dillig, and Thomas Dillig},
    title = {SQLizer: Query Synthesis from Natural Language},
    booktitle = {International Conference on Object-Oriented Programming, Systems, Languages, and Applications, ACM},
    month = {October},
    year = {2017},
    pages = {63:1--63:26},
    url = {http://doi.org/10.1145/3133887},
    }

    @article{data-academic,
    dataset = {Academic},
    author = {Fei Li and H. V. Jagadish},
    title = {Constructing an Interactive Natural Language Interface for Relational Databases},
    journal = {Proceedings of the VLDB Endowment},
    volume = {8},
    number = {1},
    month = {September},
    year = {2014},
    pages = {73--84},
    url = {http://dx.doi.org/10.14778/2735461.2735468},
    }

    @InProceedings{data-atis-geography-scholar,
    dataset = {Scholar, and Updated ATIS and Geography},
    author = {Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, Jayant Krishnamurthy, and Luke Zettlemoyer},
    title = {Learning a Neural Semantic Parser from User Feedback},
    booktitle = {Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
    year = {2017},
    pages = {963--973},
    location = {Vancouver, Canada},
    url = {http://www.aclweb.org/anthology/P17-1089},
    }

    @inproceedings{data-geography-original
    dataset = {Geography, original},
    author = {John M. Zelle and Raymond J. Mooney},
    title = {Learning to Parse Database Queries Using Inductive Logic Programming},
    booktitle = {Proceedings of the Thirteenth National Conference on Artificial Intelligence - Volume 2},
    year = {1996},
    pages = {1050--1055},
    location = {Portland, Oregon},
    url = {http://dl.acm.org/citation.cfm?id=1864519.1864543},
    }

    @inproceedings{data-restaurants-logic,
    author = {Lappoon R. Tang and Raymond J. Mooney},
    title = {Automated Construction of Database Interfaces: Intergrating Statistical and Relational Learning for Semantic Parsing},
    booktitle = {2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora},
    year = {2000},
    pages = {133--141},
    location = {Hong Kong, China},
    url = {http://www.aclweb.org/anthology/W00-1317},
    }

    @inproceedings{data-restaurants-original,
    author = {Ana-Maria Popescu, Oren Etzioni, and Henry Kautz},
    title = {Towards a Theory of Natural Language Interfaces to Databases},
    booktitle = {Proceedings of the 8th International Conference on Intelligent User Interfaces},
    year = {2003},
    location = {Miami, Florida, USA},
    pages = {149--157},
    url = {http://doi.acm.org/10.1145/604045.604070},
    }

    @inproceedings{data-restaurants,
    author = {Alessandra Giordani and Alessandro Moschitti},
    title = {Automatic Generation and Reranking of SQL-derived Answers to NL Questions},
    booktitle = {Proceedings of the Second International Conference on Trustworthy Eternal Systems via Evolving Software, Data and Knowledge},
    year = {2012},
    location = {Montpellier, France},
    pages = {59--76},
    url = {https://doi.org/10.1007/978-3-642-45260-4_5},
    }

  14. e

    User memories from Cultural Heritage Search

    • data.europa.eu
    unknown
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    User memories from Cultural Heritage Search [Dataset]. https://data.europa.eu/data/datasets/https-data-norge-no-node-2123?locale=en
    Explore at:
    unknownAvailable download formats
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Description

    User memories from Cultural Heritage Search is a data set (in the form of a file dump) that consists of the audience’s own registrations in the online solution Kulturminnesøk. Cultural Heritage Search is an intermediary service from the Directorate for Cultural Heritage, and is run by an editorial board. The information about most of the cultural monuments in Cultural Heritage Search comes from the cultural heritage base Askeladden, which is managed by the Directorate of Cultural Heritage, but you as a user can also contribute your user memories into the solution. The dataset follows the GeoJSON-LD standard. It can be read and used as regular GeoJSON, but also has a semantic component that allows it to be processed as JSON-LD. For more information, see http://geojson.org/geojson-ld/ . Each document can be linked with Cultural Heritage Search. For example, the document with ID http://kulturminnesok.no/fm/gilahytta-1 can be retrieved from Cultural Heritage Search as follows: https://kulturminnesok.no/minne/?queryString=http://kulturminnesok.no/fm/gilahytta-1

  15. a

    Concurrent LC MHM Polygons

    • globe-data-igestrategies.hub.arcgis.com
    • geospatial.strategies.org
    Updated Jan 7, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Institute for Global Environmental Strategies (2023). Concurrent LC MHM Polygons [Dataset]. https://globe-data-igestrategies.hub.arcgis.com/datasets/concurrent-lc-mhm-polygons
    Explore at:
    Dataset updated
    Jan 7, 2023
    Dataset authored and provided by
    Institute for Global Environmental Strategies
    Area covered
    Description

    This feature layer consists of paired GLOBE Observer Mosquito Habitat Mapper (MHM) and GLOBE Observer Land Cover (LC) observation data resulting from the following processing steps:MHM
    GEOJSON Data was pulled from this GLOBE API URL: https://api.globe.gov/search/v1/measurement/protocol/measureddate/?protocols=mosquito_habitat_mapper&startdate=2017-05-01&enddate=2022-12-31&geojson=TRUE&sample=FALSE Only device-reported measurements are kept- "DataSource" = "GLOBE Observer App"
    As we are only interested in device measurements, latitude and longitude are determined from "MeasurementLatitude" and "MeasurementLongitude". All instances of duplicate photos have been removed from the dataset.LC
    GEOJSON Data was pulled from this GLOBE API URL:https://api.globe.gov/search/v1/measurement/protocol/measureddate/?protocols=land_covers&startdate=2018-09-01&enddate=2022-12-31&geojson=TRUE&sample=FALSE Only device-reported measurements are kept- "DataSource" = "GLOBE Observer App"
    As we are only interested in device measurements, latitude and longitude are determined from "MeasurementLatitude" and "MeasurementLongitude".ConcurrenceThese two layers were then combined using a spatiotemporal join with the following conditions: Tool: Geoanalytics Desktop Tools -> Join Features Target Layer: LC Join Type: one to many Join Layer: MHM Coordinate fields used: MeasurementLatitude, MeasurementLongitude Time fields used: MeasuredAt (UTC time) Spatial Proximity: 100 meters (NEAR_GEODESIC) Temporal Proximity: 60 minutes (NEAR) Attribute match: UserIDThe result is a dataset consisting of all paired instances where the same observer (Userid) collected a Mosquito Habitat Mapper observation within 100 meters and 1 hour of collecting a Land Cover observation.Additional fields include:lc_mhm_obsID_pair': A string representing the two paired observations- {lc_LandCoverId}_{mhm_MosquitoHabitatMapperId}'lc_latlon': A string representing the coordinates of the LC observation - "({lc_MeasurementLatitude}, {lc_MeasurementLongitude})"'mhm_latlon': A string representing the coordinates of the MHM observation - "({mhm_MeasurementLatitude}, {mhm_MeasurementLongitude})"'spatialDistanceMeters': Numeric value representing the distance between the two paired observations in meters'temporalDistanceMinutes': Numeric value representing the time delta between the two paired observations in minutes'squareBuffer': A polygon string representing a 100m square centered on the LC observation coordinates. This may be used in conjunction with additional map layers to evaluate the land cover types near the observation coordinates. (n.b. This is not the buffer used in calculating spatiotemporal concurrence)For the purposes of this visualization, geometry is a 100m x 100m square centered on the Land Cover observation coordinates.

  16. d

    Polygon Data | Marinas in US and Canada | Map & Geospatial Insights

    • datarade.ai
    Updated Mar 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xtract (2023). Polygon Data | Marinas in US and Canada | Map & Geospatial Insights [Dataset]. https://datarade.ai/data-products/xtract-io-geometry-data-marinas-in-us-and-canada-xtract
    Explore at:
    .json, .csv, .xls, .txtAvailable download formats
    Dataset updated
    Mar 23, 2023
    Dataset authored and provided by
    Xtract
    Area covered
    United States, Canada
    Description

    This specialized location dataset delivers detailed information about marina establishments. Maritime industry professionals, coastal planners, and tourism researchers can leverage precise location insights to understand maritime infrastructure, analyze recreational boating landscapes, and develop targeted strategies.

    How Do We Create Polygons? -All our polygons are manually crafted using advanced GIS tools like QGIS, ArcGIS, and similar applications. This involves leveraging aerial imagery and street-level views to ensure precision. -Beyond visual data, our expert GIS data engineers integrate venue layout/elevation plans sourced from official company websites to construct detailed indoor polygons. This meticulous process ensures higher accuracy and consistency. -We verify our polygons through multiple quality checks, focusing on accuracy, relevance, and completeness.

    What's More? -Custom Polygon Creation: Our team can build polygons for any location or category based on your specific requirements. Whether it’s a new retail chain, transportation hub, or niche point of interest, we’ve got you covered. -Enhanced Customization: In addition to polygons, we capture critical details such as entry and exit points, parking areas, and adjacent pathways, adding greater context to your geospatial data. -Flexible Data Delivery Formats: We provide datasets in industry-standard formats like WKT, GeoJSON, Shapefile, and GDB, making them compatible with various systems and tools. -Regular Data Updates: Stay ahead with our customizable refresh schedules, ensuring your polygon data is always up-to-date for evolving business needs.

    Unlock the Power of POI and Geospatial Data With our robust polygon datasets and point-of-interest data, you can: -Perform detailed market analyses to identify growth opportunities. -Pinpoint the ideal location for your next store or business expansion. -Decode consumer behavior patterns using geospatial insights. -Execute targeted, location-driven marketing campaigns for better ROI. -Gain an edge over competitors by leveraging geofencing and spatial intelligence.

    Why Choose LocationsXYZ? LocationsXYZ is trusted by leading brands to unlock actionable business insights with our spatial data solutions. Join our growing network of successful clients who have scaled their operations with precise polygon and POI data. Request your free sample today and explore how we can help accelerate your business growth.

  17. Atlas of the Working Group I Contribution to the IPCC Sixth Assessment...

    • catalogue.ceda.ac.uk
    Updated Jun 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maialen Iturbide; José Manuel Gutiérrez; Joaquín Bedia; Ezequiel Cimadevilla; Javier Díez-Sierra; Rodrigo Manzanas; Ana Casanueva; Jorge Baño-Medina; Josipa Milovac; Sixto Milovac; Antonio S. Cofiño; Daniel San Martín; Markel García-Díez; Mathias Hauser; David Huard; Özge Yelekci; Jesús Fernández (2023). Atlas of the Working Group I Contribution to the IPCC Sixth Assessment Report - data for Figure Atlas.2 (v20221104) [Dataset]. https://catalogue.ceda.ac.uk/uuid/789ad030299342ea99534edfb62450d9
    Explore at:
    Dataset updated
    Jun 19, 2023
    Dataset provided by
    Centre for Environmental Data Analysishttp://www.ceda.ac.uk/
    Authors
    Maialen Iturbide; José Manuel Gutiérrez; Joaquín Bedia; Ezequiel Cimadevilla; Javier Díez-Sierra; Rodrigo Manzanas; Ana Casanueva; Jorge Baño-Medina; Josipa Milovac; Sixto Milovac; Antonio S. Cofiño; Daniel San Martín; Markel García-Díez; Mathias Hauser; David Huard; Özge Yelekci; Jesús Fernández
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jan 1, 1850 - Dec 31, 2099
    Area covered
    Earth
    Description

    Data for Figure Atlas.2 from Atlas of the Working Group I (WGI) Contribution to the Intergovernmental Panel on Climate Change (IPCC) Sixth Assessment Report (AR6).

    Figure Atlas.2 shows WGI reference regions used in the (a) AR5 and (b) AR6 reports.

    How to cite this dataset

    When citing this dataset, please include both the data citation below (under 'Citable as') and the following citations: For the report component from which the figure originates: Gutiérrez, J.M., R.G. Jones, G.T. Narisma, L.M. Alves, M. Amjad, I.V. Gorodetskaya, M. Grose, N.A.B. Klutse, S. Krakovska, J. Li, D. Martínez-Castro, L.O. Mearns, S.H. Mernild, T. Ngo-Duc, B. van den Hurk, and J.-H. Yoon, 2021: Atlas. In Climate Change 2021: The Physical Science Basis. Contribution of Working Group I to the Sixth Assessment Report of the Intergovernmental Panel on Climate Change [Masson-Delmotte, V., P. Zhai, A. Pirani, S.L. Connors, C. Péan, S. Berger, N. Caud, Y. Chen, L. Goldfarb, M.I. Gomis, M. Huang, K. Leitzell, E. Lonnoy, J.B.R. Matthews, T.K. Maycock, T. Waterfield, O. Yelekçi, R. Yu, and B. Zhou (eds.)]. Cambridge University Press, Cambridge, United Kingdom and New York, NY, USA, pp. 1927–2058, doi:10.1017/9781009157896.021

    Iturbide, M. et al., 2021: Repository supporting the implementation of FAIR principles in the IPCC-WG1 Interactive Atlas. Zenodo. Retrieved from: http://doi.org/10.5281/zenodo.5171760

    Figure subpanels

    The figure has two panels, with data provided for both panels in the master GitHub repository linked in the documentation.

    Data provided in relation to figure

    This dataset contains the corner coordinates defining each reference region for the second panel of the figure, which contain coordinate information at a 0.44º resolution. The repository directory 'reference-regions' contains data provided for the reference regions as polygons in different formats (CSV with coordinates, R data, shapefile and geojson) together with R and Python notebooks illustrating the use of these regions with worked examples.

    Data for reference regions for AR5 can be found here: https://catalogue.ceda.ac.uk/uuid/a3b6d7f93e5c4ea986f3622eeee2b96f

    CMIP5 is the fifth phase of the Coupled Model Intercomparison Project. CMIP6 is the sixth phase of the Coupled Model Intercomparison Project. CORDEX is The Coordinated Regional Downscaling Experiment from the WCRP. AR5 and AR6 refer to the 5th and 6th Annual Report of the IPCC. WGI stands for Working Group I

    Notes on reproducing the figure from the provided data

    Data and figures produced by the Jupyter Notebooks live inside the notebooks directory. The notebooks describe step by step the basic process followed to generate some key figures of the AR6 WGI Atlas and some products underpinning the Interactive Atlas, such as reference regions, global warming levels, aggregated datasets. They include comments and hints to extend the analysis, thus promoting reusability of the results. These notebooks are provided as guidance for practitioners, more user friendly than the code provided as scripts in the reproducibility folder.

    Some of the notebooks require access to large data volumes out of this repository. To speed up the execution of the notebook, in addition to the full code to access the data, we provide a data loading shortcut, by storing intermediate results in the auxiliary-material folder in this repository. To test other parameter settings, the full data access instructions should be followed, which can take long waiting times.

    Sources of additional information

    The following weblinks are provided in the Related Documents section of this catalogue record: - Link to the figure on the IPCC AR6 website - Link to the report component containing the figure (Atlas) - Link to the Supplementary Material for Atlas, which contains details on the input data used in Table Atlas.SM.15. - Link to the code for the figure, archived on Zenodo. - Link to the necessary notebooks for reproducing the figure from GitHub. - Link to IPCC AR5 reference regions dataset

  18. G

    Hydroclimatic atlas 2022

    • open.canada.ca
    • catalogue.arctic-sdi.org
    • +1more
    csv, geojson, html +3
    Updated May 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Government and Municipalities of Québec (2025). Hydroclimatic atlas 2022 [Dataset]. https://open.canada.ca/data/dataset/8bc217ff-d25d-4f55-a9a7-ada3df4b29a7
    Explore at:
    csv, geojson, pdf, zip, html, shpAvailable download formats
    Dataset updated
    May 1, 2025
    Dataset provided by
    Government and Municipalities of Québec
    License

    Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
    License information was derived automatically

    Time period covered
    Jan 1, 1970 - Dec 31, 2100
    Description

    #Données of the 2022 Hydroclimatic Atlas ## #Description The Hydroclimatic Atlas describes the current and future water regime of southern Quebec in order to support the implementation of water management practices that are resilient to climate change. These data are from the most recent version of the Hydroclimatic Atlas. ## #Nouveautés * Improvement of the spatial resolution of the hydrographic network; * Greater spatial coverage; * Addition of the CliMEX and CORDEX-NA sets, in addition to the scenarios in the CMIP5 set; * Use of six hydrological platforms; * * Addition of indicators, especially annual ones. * Etc. ## #Liste data available * Link to the new Hydroclimatic Atlas website. * Map of the 24,604 river sections of the Hydroclimatic Atlas with their attributes, available in GeoJSON and shapefile format. To facilitate download and display, the map is divided into 11 GeoJSON files: ABIT (Abitibi and Lac Abitibi region), CND west (North Shore A and B regions), CND east (North Shore regions C, D and E), GASP (North Shore regions C, D and E), GASP (Gaspésie), MONT (Gaspesie), MONT (Montégérie), OUTM (Outaouais Upstream), OUTV (Outaouais Downstream), OUTV (Outaouais Downstream), SAGU (Saguenay), SLNO (St-Laurent Nord-Ouest), SLSO (St-Laurent Sud-Ouest), and VAUD (Vaudreuil). * The CSV tables (“Magnitude...”) for each of the 76 hydrological indicators describing the amplement, the direction and the dispersion for RCP 4.5 and RCP8.5, for the three future horizons (see the documentation for details). * The CSV tables (“Projected indicator...”) for each of the 76 hydrological indicators detailing the flow values with their uncertainty for the historical period and the three future horizons (RCP4.5 and 8.5). See the documentation for more details. * A PDF with the metadata and a more detailed description of the data. ## #Note The 2018 version data is archived on Data Quebec for reference, for example for old reports or analyses referring to this version of the data. Any new study or analysis should use the most recent data available below or on the Atlas website.**This third party metadata element was translated using an automated translation tool (Amazon Translate).**

  19. DataCite Public Data

    • redivis.com
    application/jsonl +7
    Updated Dec 12, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Redivis Demo Organization (2024). DataCite Public Data [Dataset]. https://redivis.com/datasets/7wec-6vgw8qaaq
    Explore at:
    application/jsonl, arrow, spss, csv, stata, sas, avro, parquetAvailable download formats
    Dataset updated
    Dec 12, 2024
    Dataset provided by
    Redivis Inc.
    Authors
    Redivis Demo Organization
    Description

    Abstract

    The DataCite Public Data File contains metadata records in JSON format for all DataCite DOIs in Findable state that were registered up to the end of 2023.

    This dataset represents a processed version of the Public Data File, where the data have been extracted and loaded into a Redivis dataset.

    Methodology

    The DataCite Public Data File contains metadata records in JSON format for all DataCite DOIs in Findable state that were registered up to the end of 2023.

    Records have descriptive metadata for research outputs and resources structured according to the DataCite Metadata Schema and include links to other persistent identifiers (PIDs) for works (DOIs), people (ORCID iDs), and organizations (ROR IDs).

    Use of the DataCite Public Data File is subject to the DataCite Data File Use Policy.

    Usage

    This datasets is a processed version of the DataCite public data file, where the original file (a 23GB .tar.gz) has been extracted into 55,239 JSONL files, that were then concatenated into a single JSONL file.

    This JSONL file has been imported into a Redivis table to facilitate further exploration and analysis.

    A sample project demonstrating how to query the DataCite data file can be found here: https://redivis.com/projects/hx1e-a6w8vmwsx

  20. Data from: A Dataset of Bot and Human Activities in GitHub

    • zenodo.org
    json, txt
    Updated Jan 5, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Natarajan Chidambaram; Natarajan Chidambaram; Alexandre Decan; Alexandre Decan; Tom Mens; Tom Mens (2024). A Dataset of Bot and Human Activities in GitHub [Dataset]. http://doi.org/10.5281/zenodo.8219470
    Explore at:
    json, txtAvailable download formats
    Dataset updated
    Jan 5, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Natarajan Chidambaram; Natarajan Chidambaram; Alexandre Decan; Alexandre Decan; Tom Mens; Tom Mens
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A Dataset of Bot and Human Activities in GitHub

    This repository provides an updated version of a dataset of GitHub contributor activities that is accompanied by a paper published at MSR 2023 in the Data and Tool Showcase Track. The paper is entitled A Dataset of Bot and Human Activities in GitHub and is co-authored by Natarajan Chidambaram, Alexandre Decan and Tom Mens (Software Engineering Lab, University of Mons, Belgium). DOI: https://www.doi.org/10.1109/MSR59073.2023.00070. This work is done as a part of Natarajan Chdiambaram's PhD research in the context of DigitalWallonia4.AI research project ARIAC (grant number 2010235) and TRAIL.

    The dataset contains 1,015,422 high-level activities made by 350 bots and 620 human contributors on GitHub between 25 November 2022 and 15 April 2023. The activities were generated from 1,221,907 low-level events obtained from the GitHub's Event API and cover 24 distinct activity types. This dataset facilitates the characterisation of bot and human behaviour in GitHub repositories, by enabling the analysis of activity sequences and activity patterns of bot and human contributors. This dataset could lead to better bot identification tools and empirical studies on how bots play a role in collaborative software development.

    Files description

    The following files are provided as part of the archive:

    • bot_activities.json - A JSON file containing 754,165 activities made by 350 bot contributors;
    • human_activities.json - A JSON file containing 261,258 activities made by 620 human contributors (anonymized);
    • JsonSchema.json - A JSON schema that validates the above datasets;
    • bots.txt - A TEXT file containing login names of all the 350 bots

    Example

    Below is an example of a Closing pull request activity:

    {
     "date": "2022-11-25T18:49:09+00:00",
     "activity": "Closing pull request",
     "contributor": "typescript-bot",
     "repository": "DefinitelyTyped/DefinitelyTyped",
     "comment": {
       "length": 249,
       "GH_node": "IC_kwDOAFz6BM5PJG7l"
     },
     "pull_request": {
       "id": 62328,
       "title": "[qunit] Add `test.each()`",
       "created_at": "2022-09-19T17:34:28+00:00",
       "status": "closed",
       "closed_at": "2022-11-25T18:49:08+00:00",
       "merged": false,
       "GH_node": "PR_kwDOAFz6BM4_N5ib"
     },
     "conversation": {
       "comments": 19
     },
     "payload": {
       "pr_commits": 1,
       "pr_changed_files": 5
     }
    }

    List of activity types

    In total, we have identified 24 different high-level activity types from 15 different low-level event types. They are Creating repository, Creating branch, Creating tag, Deleting tag, Deleting repository, Publishing a release, Making repository public, Adding collaborator to repository, Forking repository, Starring repository, Editing wiki page, Opening issue, Closing issue, Reopening issue, Transferring issue, Commenting issue, Opening pull request, Closing pull request, Reopening pull request, Commenting pull request, Commenting pull request changes, Reviewing code, Commenting commits, Pushing commits.

    List of fields

    Not only does the dataset contain a list of activities made by bot and human contributors, but it also contains some details about these activities. For example, commenting issue activities provide details about the author of the comment, the repository and issue in which the comment was created, and so on.

    For all activity types, we provide the date of the activity, the contributor that made the activity, and the repository in which the activity took place. Depending on the activity type, additional fields are provided. In this section, we describe for each activity type the different fields that are provided in the JSON file. It is worth to mention that we also provide the corresponding JSON schema alongside with the datasets.

    Properties

    • date
      • Date on which the activity is performed
      • Type: string
      • e.g., "2022-11-25T09:55:19+00:00"
      • String format must be a "date-time"

    • activity
      • The activity performed by the contributor
      • Type: string
      • e.g., "Commenting pull request"
    • contributor
      • The login name of the contributor who performed this activity
      • Type: string
      • e.g., "analysis-bot", "anonymised" in the case of a human contributor
    • repository
      • The repository in which the activity is performed
      • Type: string
      • e.g., "apache/spark", "anonymised" in the case of a human contributor
    • issue
      • Issue information - provided for Opening issue, Closing issue, Reopening issue, Transferring issue and Commenting issue
      • Type: object
      • Properties
        • id
          • Issue number
          • Type: integer
          • e.g., 35471
        • title
          • Issue title
          • Type: string
          • e.g., "error building handtracking gpu example with bazel", "anonymised" in the case of a human contributor
        • created_at
          • The date on which this issue is created
          • Type: string
          • e.g., "2022-11-10T13:07:23+00:00"
          • String format must be a "date-time"
        • status
          • Current state of the issue
          • Type: string
          • "open" or "closed"
        • closed_at
          • The date on which this issue is closed. "null" will be provided if the issue is open
          • Types: string, null
          • e.g., "2022-11-25T10:42:39+00:00"
          • String format must be a "date-time"
        • resolved
          • The issue is resolved or not_planned/still open
          • Type: boolean
          • true or false
        • GH_node
          • The GitHub node of this issue
          • Type: string
          • e.g., "IC_kwDOC27xRM5PHTBU", "anonymised" in the case of a human contributor
    • pull_request
      • Pull request information - provided for Opening pull request, Closing pull request, Reopening pull request, Commenting pull request changes and Reviewing code
      • Type: object
      • Properties
        • id
          • Pull request number
          • Type: integer
          • e.g., 35471
        • title
          • Pull request title
          • Type: string
          • e.g., "error building handtracking gpu example with bazel", "anonymised" in the case of a human contributor
        • created_at
          • The date on which this pull request is created
          • Type: string
          • e.g., "2022-11-10T13:07:23+00:00"
          • String format must be a "date-time"
        • status
          • Current state of the pull request
          • Type: string
          • "open" or "closed"
        • closed_at
          • The date on which this pull request is closed. "null" will be provided if the pull request is open
          • Types: string, null
          • e.g., "2022-11-25T10:42:39+00:00"
          • String format must be a "date-time"
        • merged
          • The PR is merged or rejected/still open
          • Type: boolean
          • true or false
        • GH_node
          • The GitHub node of this pull request
          • Type: string
          • e.g., "PR_kwDOC7Q2kM5Dsu3-", "anonymised" in the case of a human contributor
    • review
      • Pull request review information - provided for Reviewing code
      • Type: object
      • Properties
        • status
          • Status of the review
          • Type: string
          • "changes_requested" or "approved" or "dismissed"
        • GH_node
          • The GitHub node of this review
          • Type: string
          • e.g., "PRR_kwDOEBHXU85HLfIn", "anonymised" in the case of a human contributor
    • conversation
      • Comments information in issue or pull request - Provided for Opening issue, Closing issue, Reopening issue, Transferring issue, Commenting issue, Opening pull request, Closing pull request, Reopening pull request and Commenting pull request
      • Type: object
      • Properties
        • comments
          • Number of comments present in the corresponding issue or pull request
          • Type: integer
          • e.g.,

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Lucain Pouget (2023). example-space-to-dataset-image-zip [Dataset]. https://huggingface.co/datasets/Wauplin/example-space-to-dataset-image-zip

example-space-to-dataset-image-zip

Wauplin/example-space-to-dataset-image-zip

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 16, 2023
Authors
Lucain Pouget
Description
Search
Clear search
Close search
Google apps
Main menu