Demo to save data from a Space to a Dataset. Goal is to provide reusable snippets of code.
Documentation: https://huggingface.co/docs/huggingface_hub/main/en/guides/upload#scheduled-uploads Space: https://huggingface.co/spaces/Wauplin/space_to_dataset_saver/ JSON dataset: https://huggingface.co/datasets/Wauplin/example-space-to-dataset-json Image dataset: https://huggingface.co/datasets/Wauplin/example-space-to-dataset-image Image (zipped) dataset:… See the full description on the dataset page: https://huggingface.co/datasets/Wauplin/example-space-to-dataset-json.
CSV output from https://github.com/marks/health-insurance-marketplace-analytics/blob/master/flattener/flatten_from_index.py
taichi256/example-space-to-dataset-json dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Aside from reporting quality results to the user via the console, an analysis run produces a quality report written to disk in both HTML and JSON formats. The create if either output format can be disabled. The JSON format files can be imported by MultiQC. This zip archive includes example results of both BAM and KMER modes, as well as the resulting MultiQC report. (ZIP)
CSV output from https://github.com/marks/health-insurance-marketplace-analytics/blob/master/flattener/flatten_from_index.py
This dataset contains resources transformed from other datasets on HDX. They exist here only in a format modified to support visualization on HDX and may not be as up to date as the source datasets from which they are derived.
Source datasets: https://data.hdx.rwlabs.org/dataset/idps-data-by-region-in-mali
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Diverse Restricted JSON Data Extraction
Curated by: The paraloq analytics team.
Uses
Benchmark restricted JSON data extraction (text + JSON schema -> JSON instance) Fine-Tune data extraction model (text + JSON schema -> JSON instance) Fine-Tune JSON schema Retrieval model (text -> retriever -> most adequate JSON schema)
Out-of-Scope Use
Intended for research purposes only.
Dataset Structure
The data comes with the following fields:
title: The… See the full description on the dataset page: https://huggingface.co/datasets/paraloq/json_data_extraction.
The DataCite Public Data File contains metadata records in JSON format for all DataCite DOIs in Findable state that were registered up to the end of 2023.
This dataset represents a processed version of the Public Data File, where the data have been extracted and loaded into a Redivis dataset.
The DataCite Public Data File contains metadata records in JSON format for all DataCite DOIs in Findable state that were registered up to the end of 2023.
Records have descriptive metadata for research outputs and resources structured according to the DataCite Metadata Schema and include links to other persistent identifiers (PIDs) for works (DOIs), people (ORCID iDs), and organizations (ROR IDs).
Use of the DataCite Public Data File is subject to the DataCite Data File Use Policy.
This datasets is a processed version of the DataCite public data file, where the original file (a 23GB .tar.gz) has been extracted into 55,239 JSONL files, that were then concatenated into a single JSONL file.
This JSONL file has been imported into a Redivis table to facilitate further exploration and analysis.
A sample project demonstrating how to query the DataCite data file can be found here: https://redivis.com/projects/hx1e-a6w8vmwsx
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Accessing data in structured formats such as XML, CSV and JSON in statically typed languages is difficult, because the languages do not understand the structure of the data. Dynamically typed languages make this syntactically easier, but lead to error-prone code. Despite numerous efforts, most of the data available on the web do not come with a schema. The only information available to developers is a set of examples, such as typical server responses. We describe an inference algorithm that infers a type of structured formats including CSV, XML and JSON. The algorithm is based on finding a common supertype of types representing individual samples (or values in collections). We use the algorithm as a basis for an F# type provider that integrates the inference into the F# type system. As a result, users can access CSV, XML and JSON data in a statically-typed fashion just by specifying a representative sample document.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Interoperability in systems-of-systems is a difficult problem due to the abundance of data standards and formats. Current approaches to interoperability rely on hand-made adapters or methods using ontological metadata. This dataset was created to facilitate research on data-driven interoperability solutions. The data comes from a simulation of a building heating system, and the messages sent within control systems-of-systems. For more information see attached data documentation. The data comes in two semicolon-separated (;) csv files, training.csv and test.csv. The train/test split is not random; training data comes from the first 80% of simulated timesteps, and the test data is the last 20%. There is no specific validation dataset, the validation data should instead be randomly selected from the training data. The simulation runs for as many time steps as there are outside temperature values available. The original SMHI data only samples once every hour, which we linearly interpolate to get one temperature sample every ten seconds. The data saved at each time step consists of 34 JSON messages (four per room and two temperature readings from the outside), 9 temperature values (one per room and outside), 8 setpoint values, and 8 actuator outputs. The data associated with each of those 34 JSON-messages is stored as a single row in the tables. This means that much data is duplicated, a choice made to make it easier to use the data. The simulation data is not meant to be opened and analyzed in spreadsheet software, it is meant for training machine learning models. It is recommended to open the data with the pandas library for Python, available at https://pypi.org/project/pandas/.
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
We have successfully extracted a comprehensive news dataset from CNBC, covering not only financial updates but also an extensive range of news categories relevant to diverse audiences in Europe, the US, and the UK. This dataset includes over 500,000 records, meticulously structured in JSON format for seamless integration and analysis.
This extensive extraction spans multiple segments, such as:
Each record in the dataset is enriched with metadata tags, enabling precise filtering by region, sector, topic, and publication date.
The comprehensive news dataset provides real-time insights into global developments, corporate strategies, leadership changes, and sector-specific trends. Designed for media analysts, research firms, and businesses, it empowers users to perform:
Additionally, the JSON format ensures easy integration with analytics platforms for advanced processing.
Looking for a rich repository of structured news data? Visit our news dataset collection to explore additional offerings tailored to your analysis needs.
To get a preview, check out the CSV sample of the CNBC economy articles dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Overview
3DHD CityScenes is the most comprehensive, large-scale high-definition (HD) map dataset to date, annotated in the three spatial dimensions of globally referenced, high-density LiDAR point clouds collected in urban domains. Our HD map covers 127 km of road sections of the inner city of Hamburg, Germany including 467 km of individual lanes. In total, our map comprises 266,762 individual items.
Our corresponding paper (published at ITSC 2022) is available here. Further, we have applied 3DHD CityScenes to map deviation detection here.
Moreover, we release code to facilitate the application of our dataset and the reproducibility of our research. Specifically, our 3DHD_DevKit comprises:
Python tools to read, generate, and visualize the dataset,
3DHDNet deep learning pipeline (training, inference, evaluation) for map deviation detection and 3D object detection.
The DevKit is available here:
https://github.com/volkswagen/3DHD_devkit.
The dataset and DevKit have been created by Christopher Plachetka as project lead during his PhD period at Volkswagen Group, Germany.
When using our dataset, you are welcome to cite:
@INPROCEEDINGS{9921866, author={Plachetka, Christopher and Sertolli, Benjamin and Fricke, Jenny and Klingner, Marvin and Fingscheidt, Tim}, booktitle={2022 IEEE 25th International Conference on Intelligent Transportation Systems (ITSC)}, title={3DHD CityScenes: High-Definition Maps in High-Density Point Clouds}, year={2022}, pages={627-634}}
Acknowledgements
We thank the following interns for their exceptional contributions to our work.
Benjamin Sertolli: Major contributions to our DevKit during his master thesis
Niels Maier: Measurement campaign for data collection and data preparation
The European large-scale project Hi-Drive (www.Hi-Drive.eu) supports the publication of 3DHD CityScenes and encourages the general publication of information and databases facilitating the development of automated driving technologies.
The Dataset
After downloading, the 3DHD_CityScenes folder provides five subdirectories, which are explained briefly in the following.
This directory contains the training, validation, and test set definition (train.json, val.json, test.json) used in our publications. Respective files contain samples that define a geolocation and the orientation of the ego vehicle in global coordinates on the map.
During dataset generation (done by our DevKit), samples are used to take crops from the larger point cloud. Also, map elements in reach of a sample are collected. Both modalities can then be used, e.g., as input to a neural network such as our 3DHDNet.
To read any JSON-encoded data provided by 3DHD CityScenes in Python, you can use the following code snipped as an example.
import json
json_path = r"E:\3DHD_CityScenes\Dataset\train.json" with open(json_path) as jf: data = json.load(jf) print(data)
Map items are stored as lists of items in JSON format. In particular, we provide:
traffic signs,
traffic lights,
pole-like objects,
construction site locations,
construction site obstacles (point-like such as cones, and line-like such as fences),
line-shaped markings (solid, dashed, etc.),
polygon-shaped markings (arrows, stop lines, symbols, etc.),
lanes (ordinary and temporary),
relations between elements (only for construction sites, e.g., sign to lane association).
Our high-density point cloud used as basis for annotating the HD map is split in 648 tiles. This directory contains the geolocation for each tile as polygon on the map. You can view the respective tile definition using QGIS. Alternatively, we also provide respective polygons as lists of UTM coordinates in JSON.
Files with the ending .dbf, .prj, .qpj, .shp, and .shx belong to the tile definition as “shape file” (commonly used in geodesy) that can be viewed using QGIS. The JSON file contains the same information provided in a different format used in our Python API.
The high-density point cloud tiles are provided in global UTM32N coordinates and are encoded in a proprietary binary format. The first 4 bytes (integer) encode the number of points contained in that file. Subsequently, all point cloud values are provided as arrays. First all x-values, then all y-values, and so on. Specifically, the arrays are encoded as follows.
x-coordinates: 4 byte integer
y-coordinates: 4 byte integer
z-coordinates: 4 byte integer
intensity of reflected beams: 2 byte unsigned integer
ground classification flag: 1 byte unsigned integer
After reading, respective values have to be unnormalized. As an example, you can use the following code snipped to read the point cloud data. For visualization, you can use the pptk package, for instance.
import numpy as np import pptk
file_path = r"E:\3DHD_CityScenes\HD_PointCloud_Tiles\HH_001.bin" pc_dict = {} key_list = ['x', 'y', 'z', 'intensity', 'is_ground'] type_list = ['
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
SAE sample data (JSON)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Example JSON research object output from www.clinicalcodes.org for clinical codes for a research article. see https://github.com/rOpenHealth/ClinicalCodes/tree/master/paper
CSV output from https://github.com/marks/health-insurance-marketplace-analytics/blob/master/flattener/flatten_from_index.py
https://www.etalab.gouv.fr/licence-ouverte-open-licencehttps://www.etalab.gouv.fr/licence-ouverte-open-licence
This dataset contains the articles published on the Covid-19 FAQ for companies published by the Directorate-General for Enterprises at https://info-entreprises-covid19.economie.fr
The data are presented in the JSON format as follows: JSON [ { “title”: “Example article for documentation”, “content”: [ this is the first page of the article. here the second, “‘div’these articles incorporate some HTML formatting‘/div’” ], “path”: [ “File to visit in the FAQ”, “to join the article”] }, ... ] “'” The update is done every day at 6:00 UTC. This data is extracted directly from the site, the source code of the script used to extract the data is available here: https://github.com/chrnin/docCovidDGE
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A JSON file used as an example to illustrate queries and to benchmark some tool.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was created by samsatp
Released under CC0: Public Domain
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is related to the manuscript "An empirical meta-analysis of the life sciences linked open data on the web" published at Nature Scientific Data. If you use the dataset, please cite the manuscript as follows:Kamdar, M.R., Musen, M.A. An empirical meta-analysis of the life sciences linked open data on the web. Sci Data 8, 24 (2021). https://doi.org/10.1038/s41597-021-00797-yWe have extracted schemas from more than 80 publicly available biomedical linked data graphs in the Life Sciences Linked Open Data (LSLOD) cloud into an LSLOD schema graph and conduct an empirical meta-analysis to evaluate the extent of semantic heterogeneity across the LSLOD cloud. The dataset published here contains the following files:- The set of Linked Data Graphs from the LSLOD cloud from which schemas are extracted.- Refined Sets of extracted classes, object properties, data properties, and datatypes, shared across the Linked Data Graphs on LSLOD cloud. Where the schema element is reused from a Linked Open Vocabulary or an ontology, it is explicitly indicated.- The LSLOD Schema Graph, which contains all the above extracted schema elements interlinked with each other based on the underlying content. Sample instances and sample assertions are also provided along with broad level characteristics of the modeled content. The LSLOD Schema Graph is saved as a JSON Pickle File. To read the JSON object in this Pickle file use the Python command as follows:with open('LSLOD-Schema-Graph.json.pickle' , 'rb') as infile: x = pickle.load(infile, encoding='iso-8859-1')Check the Referenced Link for more details on this research, raw data files, and code references.
Demo to save data from a Space to a Dataset. Goal is to provide reusable snippets of code.
Documentation: https://huggingface.co/docs/huggingface_hub/main/en/guides/upload#scheduled-uploads Space: https://huggingface.co/spaces/Wauplin/space_to_dataset_saver/ JSON dataset: https://huggingface.co/datasets/Wauplin/example-space-to-dataset-json Image dataset: https://huggingface.co/datasets/Wauplin/example-space-to-dataset-image Image (zipped) dataset:… See the full description on the dataset page: https://huggingface.co/datasets/Wauplin/example-space-to-dataset-json.