8 datasets found

Z
Sentence/Table Pair Data from Wikipedia for Pre-training with...
data.niaid.nih.gov
zenodo.org
Updated Oct 29, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Huan Sun (2021). Sentence/Table Pair Data from Wikipedia for Pre-training with Distant-Supervision [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5612315
Explore at:
Dataset updated
Oct 29, 2021
Dataset provided by
Huan Sun
Alyssa Lees
Xiang Deng
Cong Yu
You Wu
Yu Su
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the dataset used for pre-training in "ReasonBERT: Pre-trained to Reason with Distant Supervision", EMNLP'21.

There are two files:

sentence_pairs_for_pretrain_no_tokenization.tar.gz -> Contain only sentences as evidence, Text-only

table_pairs_for_pretrain_no_tokenization.tar.gz -> At least one piece of evidence is a table, Hybrid

The data is chunked into multiple tar files for easy loading. We use WebDataset, a PyTorch Dataset (IterableDataset) implementation providing efficient sequential/streaming data access.

For pre-training code, or if you have any questions, please check our GitHub repo https://github.com/sunlab-osu/ReasonBERT

Below is a sample code snippet to load the data

import webdataset as wds

path to the uncompressed files, should be a directory with a set of tar files

url = './sentence_multi_pairs_for_pretrain_no_tokenization/{000000...000763}.tar' dataset = ( wds.Dataset(url) .shuffle(1000) # cache 1000 samples and shuffle .decode() .to_tuple("json") .batched(20) # group every 20 examples into a batch )

Please see the documentation for WebDataset for more details about how to use it as dataloader for Pytorch

You can also iterate through all examples and dump them with your preferred data format

Below we show how the data is organized with two examples.

Text-only

{'s1_text': 'Sils is a municipality in the comarca of Selva, in Catalonia, Spain.', # query sentence 's1_all_links': { 'Sils,_Girona': [[0, 4]], 'municipality': [[10, 22]], 'Comarques_of_Catalonia': [[30, 37]], 'Selva': [[41, 46]], 'Catalonia': [[51, 60]] }, # list of entities and their mentions in the sentence (start, end location) 'pairs': [ # other sentences that share common entity pair with the query, group by shared entity pairs { 'pair': ['Comarques_of_Catalonia', 'Selva'], # the common entity pair 's1_pair_locs': [[[30, 37]], [[41, 46]]], # mention of the entity pair in the query 's2s': [ # list of other sentences that contain the common entity pair, or evidence { 'md5': '2777e32bddd6ec414f0bc7a0b7fea331', 'text': 'Selva is a coastal comarque (county) in Catalonia, Spain, located between the mountain range known as the Serralada Transversal or Puigsacalm and the Costa Brava (part of the Mediterranean coast). Unusually, it is divided between the provinces of Girona and Barcelona, with Fogars de la Selva being part of Barcelona province and all other municipalities falling inside Girona province. Also unusually, its capital, Santa Coloma de Farners, is no longer among its larger municipalities, with the coastal towns of Blanes and Lloret de Mar having far surpassed it in size.', 's_loc': [0, 27], # in addition to the sentence containing the common entity pair, we also keep its surrounding context. 's_loc' is the start/end location of the actual evidence sentence 'pair_locs': [ # mentions of the entity pair in the evidence [[19, 27]], # mentions of entity 1 [[0, 5], [288, 293]] # mentions of entity 2 ], 'all_links': { 'Selva': [[0, 5], [288, 293]], 'Comarques_of_Catalonia': [[19, 27]], 'Catalonia': [[40, 49]] } } ,...] # there are multiple evidence sentences }, ,...] # there are multiple entity pairs in the query }

Hybrid

{'s1_text': 'The 2006 Major League Baseball All-Star Game was the 77th playing of the midseason exhibition baseball game between the all-stars of the American League (AL) and National League (NL), the two leagues comprising Major League Baseball.', 's1_all_links': {...}, # same as text-only 'sentence_pairs': [{'pair': ..., 's1_pair_locs': ..., 's2s': [...]}], # same as text-only 'table_pairs': [ 'tid': 'Major_League_Baseball-1', 'text':[ ['World Series Records', 'World Series Records', ...], ['Team', 'Number of Series won', ...], ['St. Louis Cardinals (NL)', '11', ...], ...] # table content, list of rows 'index':[ [[0, 0], [0, 1], ...], [[1, 0], [1, 1], ...], ...] # index of each cell [row_id, col_id]. we keep only a table snippet, but the index here is from the original table. 'value_ranks':[ [0, 0, ...], [0, 0, ...], [0, 10, ...], ...] # if the cell contain numeric value/date, this is its rank ordered from small to large, follow TAPAS 'value_inv_ranks': [], # inverse rank 'all_links':{ 'St._Louis_Cardinals': { '2': [ [[2, 0], [0, 19]], # [[row_id, col_id], [start, end]] ] # list of mentions in the second row, the key is row_id }, 'CARDINAL:11': {'2': [[[2, 1], [0, 2]]], '8': [[[8, 3], [0, 2]]]}, } 'name': '', # table name, if exists 'pairs': { 'pair': ['American_League', 'National_League'], 's1_pair_locs': [[[137, 152]], [[162, 177]]], # mention in the query 'table_pair_locs': { '17': [ # mention of entity pair in row 17 [ [[17, 0], [3, 18]], [[17, 1], [3, 18]], [[17, 2], [3, 18]], [[17, 3], [3, 18]] ], # mention of the first entity [ [[17, 0], [21, 36]], [[17, 1], [21, 36]], ] # mention of the second entity ] } } ] }
Dataset for "SpecTf: Transformers Enable Data-Driven Imaging Spectroscopy...
zenodo.org
bin, csv, pdf
Updated Jan 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jake Lee; Jake Lee; Michael Kiper; Michael Kiper; David R. Thompson; David R. Thompson; Philip Brodrick; Philip Brodrick (2025). Dataset for "SpecTf: Transformers Enable Data-Driven Imaging Spectroscopy Cloud Detection" [Dataset]. http://doi.org/10.5281/zenodo.14614218
Explore at:
bin, pdf, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14614218
Dataset updated
Jan 10, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jake Lee; Jake Lee; Michael Kiper; Michael Kiper; David R. Thompson; David R. Thompson; Philip Brodrick; Philip Brodrick
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
SpecTf: Transformers Enable Data-Driven Imaging Spectroscopy Cloud Detection

Summary

Manuscript in review. Preprint: https://arxiv.org/abs/2501.04916

This repository contains the dataset used to train and evaluate the Spectroscopic Transformer model for EMIT cloud screening.

spectf_cloud_labelbox.hdf5

1,841,641 Labeled spectra from 221 EMIT Scenes.

spectf_cloud_mmgis.hdf5

1,733,801 Labeled spectra from 313 EMIT Scenes.

These scenes were speciffically labeled to correct false detections by an earlier version of the model.

train_fids.csv

465 EMIT scenes comprising the training set.

test_fids.csv

69 EMIT scenes comprising the held-out validation set.

v2 adds validation_scenes.pdf, a PDF displaying the 69 validation scenes in RGB and Falsecolor, their existing baseline cloud masks, as well as their cloud masks produced by the ANN and GBT reference models and the SpecTf model.

Data Description

221 EMIT Scenes were initially selected for labeling with diversity in mind. After sparse segmentation labeling of confident regions in Labelbox, up to 10,000 spectra were selected per-class per-scene to form the spectf_cloud_labelbox dataset. We deployed a preliminary model trained on these spectra on all EMIT scenes observed in March 2024, then labeled another 313 EMIT Scenes using MMGIS's polygonal labeling tool to correct false positive and false negative detections. After similarly sampling spectra from these scenes, A total of 3,575,442 spectra were labeled and sampled.

The train/test split was randomly determined by scene FID to prevent the same EMIT scene from contributing spectra to both the training and validation datasets.

Please refer to Section 4.2 in the paper for a complete description, and to our code repository for example usage and a Pytorch dataloader.

Each hdf5 file contains the following arrays:

'spectra'

Top-of-Atmosphere reflectance calculated from the EMIT L1B Radiance product

Float64 of shape (n, 268)

'fids'

The FID from which each spectrum was sampled

Binary string of shape (n,)

'indices'

The (col, row) index from which each spectrum was sampled

Int64 of shape (n, 2)

'labels'

Annotation label of each spectrum

0 - "Clear"

1 - "Cloud"

2 - "Cloud Shadow" (Only for the Labelbox dataset, and this class was combined with the clear class for this work. See paper for details.)

label[label==2] = 0

Int64 of shape (n,2)

Each hdf5 file contains the following attribute:

'bands'

The band center wavelengths (nm) of the spectrum

Float64 of shape (268,)

Acknowledgements

The EMIT online mapping tool was developed by the JPL MMGIS team. The High Performance Computing resources used in this investigation were provided by funding from the JPL Information and Technology Solutions Directorate.

This research was carried out at the Jet Propulsion Laboratory, California Institute of Technology, under a contract with the National Aeronautics and Space Administration (80NM0018D0004).

© 2024 California Institute of Technology. Government sponsorship acknowledged.
Z
Wallhack1.8k Dataset | Data Augmentation Techniques for Cross-Domain WiFi...
data.niaid.nih.gov
zenodo.org
Updated Apr 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Strohmayer, Julian (2025). Wallhack1.8k Dataset | Data Augmentation Techniques for Cross-Domain WiFi CSI-Based Human Activity Recognition [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8188998
Explore at:
Dataset updated
Apr 4, 2025
Dataset provided by
Strohmayer, Julian
Kampel, Martin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains the Wallhack1.8k dataset for WiFi-based long-range activity recognition in Line-of-Sight (LoS) and Non-Line-of-Sight (NLoS)/Through-Wall scenarios, as proposed in [1,2], as well as the CAD models (of 3D-printable parts) of the WiFi systems proposed in [2].

PyTroch Dataloader

A minimal PyTorch dataloader for the Wallhack1.8k dataset is provided at: https://github.com/StrohmayerJ/wallhack1.8k

Dataset Description

The Wallhack1.8k dataset comprises 1,806 CSI amplitude spectrograms (and raw WiFi packet time series) corresponding to three activity classes: "no presence," "walking," and "walking + arm-waving." WiFi packets were transmitted at a frequency of 100 Hz, and each spectrogram captures a temporal context of approximately 4 seconds (400 WiFi packets).

To assess cross-scenario and cross-system generalization, WiFi packet sequences were collected in LoS and through-wall (NLoS) scenarios, utilizing two different WiFi systems (BQ: biquad antenna and PIFA: printed inverted-F antenna). The dataset is structured accordingly:

LOS/BQ/ <- WiFi packets collected in the LoS scenario using the BQ system

LOS/PIFA/ <- WiFi packets collected in the LoS scenario using the PIFA system

NLOS/BQ/ <- WiFi packets collected in the NLoS scenario using the BQ system

NLOS/PIFA/ <- WiFi packets collected in the NLoS scenario using the PIFA system

These directories contain the raw WiFi packet time series (see Table 1). Each row represents a single WiFi packet with the complex CSI vector H being stored in the "data" field and the class label being stored in the "class" field. H is of the form [I, R, I, R, ..., I, R], where two consecutive entries represent imaginary and real parts of complex numbers (the Channel Frequency Responses of subcarriers). Taking the absolute value of H (e.g., via numpy.abs(H)) yields the subcarrier amplitudes A.

To extract the 52 L-LTF subcarriers used in [1], the following indices of A are to be selected:

52 L-LTF subcarriers

csi_valid_subcarrier_index = [] csi_valid_subcarrier_index += [i for i in range(6, 32)] csi_valid_subcarrier_index += [i for i in range(33, 59)]

Additional 56 HT-LTF subcarriers can be selected via:

56 HT-LTF subcarriers

csi_valid_subcarrier_index += [i for i in range(66, 94)]
csi_valid_subcarrier_index += [i for i in range(95, 123)]

For more details on subcarrier selection, see ESP-IDF (Section Wi-Fi Channel State Information) and esp-csi.

Extracted amplitude spectrograms with the corresponding label files of the train/validation/test split: "trainLabels.csv," "validationLabels.csv," and "testLabels.csv," can be found in the spectrograms/ directory.

The columns in the label files correspond to the following: [Spectrogram index, Class label, Room label]

Spectrogram index: [0, ..., n]

Class label: [0,1,2], where 0 = "no presence", 1 = "walking", and 2 = "walking + arm-waving."

Room label: [0,1,2,3,4,5], where labels 1-5 correspond to the room number in the NLoS scenario (see Fig. 3 in [1]). The label 0 corresponds to no room and is used for the "no presence" class.

Dataset Overview:

Table 1: Raw WiFi packet sequences.

Scenario System "no presence" / label 0 "walking" / label 1 "walking + arm-waving" / label 2 Total

LoS BQ b1.csv w1.csv, w2.csv, w3.csv, w4.csv and w5.csv ww1.csv, ww2.csv, ww3.csv, ww4.csv and ww5.csv

LoS PIFA b1.csv w1.csv, w2.csv, w3.csv, w4.csv and w5.csv ww1.csv, ww2.csv, ww3.csv, ww4.csv and ww5.csv

NLoS BQ b1.csv w1.csv, w2.csv, w3.csv, w4.csv and w5.csv ww1.csv, ww2.csv, ww3.csv, ww4.csv and ww5.csv

NLoS PIFA b1.csv w1.csv, w2.csv, w3.csv, w4.csv and w5.csv ww1.csv, ww2.csv, ww3.csv, ww4.csv and ww5.csv

4 20 20 44

Table 2: Sample/Spectrogram distribution across activity classes in Wallhack1.8k.

Scenario System

"no presence" / label 0

"walking" / label 1

"walking + arm-waving" / label 2 Total

LoS BQ 149 154 155

LoS PIFA 149 160 152

NLoS BQ 148 150 152

NLoS PIFA 143 147 147

589 611 606 1,806

Download and UseThis data may be used for non-commercial research purposes only. If you publish material based on this data, we request that you include a reference to one of our papers [1,2].

[1] Strohmayer, Julian, and Martin Kampel. (2024). “Data Augmentation Techniques for Cross-Domain WiFi CSI-Based Human Activity Recognition”, In IFIP International Conference on Artificial Intelligence Applications and Innovations (pp. 42-56). Cham: Springer Nature Switzerland, doi: https://doi.org/10.1007/978-3-031-63211-2_4.

[2] Strohmayer, Julian, and Martin Kampel., “Directional Antenna Systems for Long-Range Through-Wall Human Activity Recognition,” 2024 IEEE International Conference on Image Processing (ICIP), Abu Dhabi, United Arab Emirates, 2024, pp. 3594-3599, doi: https://doi.org/10.1109/ICIP51287.2024.10647666.

BibTeX citations:

@inproceedings{strohmayer2024data, title={Data Augmentation Techniques for Cross-Domain WiFi CSI-Based Human Activity Recognition}, author={Strohmayer, Julian and Kampel, Martin}, booktitle={IFIP International Conference on Artificial Intelligence Applications and Innovations}, pages={42--56}, year={2024}, organization={Springer}}@INPROCEEDINGS{10647666, author={Strohmayer, Julian and Kampel, Martin}, booktitle={2024 IEEE International Conference on Image Processing (ICIP)}, title={Directional Antenna Systems for Long-Range Through-Wall Human Activity Recognition}, year={2024}, volume={}, number={}, pages={3594-3599}, keywords={Visualization;Accuracy;System performance;Directional antennas;Directive antennas;Reflector antennas;Sensors;Human Activity Recognition;WiFi;Channel State Information;Through-Wall Sensing;ESP32}, doi={10.1109/ICIP51287.2024.10647666}}
Z
HALOC Dataset | WiFi CSI-based Long-Range Person Localization Using...
data.niaid.nih.gov
Updated Nov 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Strohmayer, Julian (2024). HALOC Dataset | WiFi CSI-based Long-Range Person Localization Using Directional Antennas [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10715594
Explore at:
Dataset updated
Nov 27, 2024
Dataset provided by
Strohmayer, Julian
Kampel, Martin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
WiFi CSI-based Long-Range Person Localization Using Directional Antennas

This repository contains the HAllway LOCalization (HALOC) dataset and WiFi system CAD files as proposed in [1].

PyTroch Dataloader

A minimal PyTorch dataloader for the HALOC dataset is provided at: https://github.com/StrohmayerJ/HALOC

Dataset Description

The HALOC dataset comprises six sequences (in .csv format) of synchronized WiFi Channel State Information (CSI) and 3D position labels. Each row in a given .csv file represents a single WiFi packet captured via ESP-IDF, with CSI and 3D coordinates stored in the "data" and ("x", "y", "z") fields, respectively.

The sequences are divided into training, validation, and test subsets as follows:

Subset Sequences

Training 0.csv, 1.csv, 2.csv and 3.csv

Validation 4.csv

Test 5.csv

WiFi System CAD files

We provide CAD files for the 3D printable parts of the proposed WiFi system consisting of the main housing (housing.stl), the lid (lid.stl), and the carrier board (carrier.stl) featuring mounting points for the Nvidia Jetson Orin Nano and the ESP32-S3-DevKitC-1 module.

Download and UseThis data may be used for non-commercial research purposes only. If you publish material based on this data, we request that you include a reference to our paper [1].

[1] Strohmayer, J., and Kampel, M. (2024). “WiFi CSI-based Long-Range Person Localization Using Directional Antennas”, The Second Tiny Papers Track at ICLR 2024, May 2024, Vienna, Austria. https://openreview.net/forum?id=AOJFcEh5Eb

BibTeX citation:

@inproceedings{strohmayer2024wifi,title={WiFi {CSI}-based Long-Range Person Localization Using Directional Antennas},author={Julian Strohmayer and Martin Kampel},booktitle={The Second Tiny Papers Track at ICLR 2024},year={2024},url={https://openreview.net/forum?id=AOJFcEh5Eb}}
Z
Immobilized fluorescently stained zebrafish through the eXtended Field of...
data.niaid.nih.gov
explore.openaire.eu
Updated Jul 11, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Symvoulidis, Panagiotis (2024). Immobilized fluorescently stained zebrafish through the eXtended Field of view Light Field Microscope 2D-3D dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8024695
Explore at:
Dataset updated
Jul 11, 2024
Dataset provided by
Favaro, Paolo
Page Vizcaíno, Josué
Symvoulidis, Panagiotis
Lasser, Tobias
Jelten, Jonas
Wang, Zeguan
Boyden, Edward S.
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Immobilized fluorescently stained zebrafish through the eXtended Field of view Light Field Microscope 2D-3D dataset

This dataset comprises three immobilized fluorescently stained zebrafish imaged through the eXtended Field of view Light Field Microscope (XLFM, also known as Fourier Light Field Microscope). The images were preprocessed with the SLNet, which extracts the sparse signals from the images (a.k.a. the neural activity).

If you intend to use this with Pytorch, you can find a data loader and working source code to load and train networks here.

This dataset is part of the publication: Fast light-field 3D microscopy with out-of-distribution detection and adaptation through Conditional Normalizing Flows.

The fish present are:

1x NLS GCaMP6s

1x Pan-neuronal nuclear localized GCaMP6s Tg(HuC:H2B:GCaMP6s)

1x Soma localized GCaMP7f Tg(HuC:somaGCaMP7f)

The dataset is structured as follows::

XLFM_dataset

Dataset/

GCaMP6s_NLS_1/

SLNet_preprocessed/

XLFM_image/

XLFM_image_stack.tif: tif stack of 600 preprocessed XLFM images.

XLFM_stack/

XLFM_stack_nnn.tif: 3D stack corresponding to frame nnn.

Neural_activity_coordinates.csv: 3D coordinates of neurons found through the suite2p framework.

Raw/

XLFM_image/

XLFM_image_stack.tif: tif stack of 600 raw XLFM images.

(other samples)

lenslet_centers_python.txt: 2D coordinates of the lenset in the XLFM images.

PSF_241depths_16bit.tif: 3D PSF of the microscope can be used for 3D deconvolution. Spanning 734 × 734 × 550𝜇𝑚3 used to deconvolve this volumes.

In this dataset, we provide a subset of the images and volumes.

Due to space constraints, we provide the 3D volumes only for:

SLNet_preprocessed/XLFM_stack/

10 interleaved frames between frames 0-499 (can be used for training a network).

20 consecutive frames, 500-520 (can be used for testing).

raw/

No volumes are provided for raw data, but they can be reconstructed through 3D deconvolution.

Enjoy, and feel free to contact us for any information request, like the full PSF, 3 more samples or longer image sequences.
Dataset for "Spectroscopic Transformer for Improved EMIT Cloud Masks"
zenodo.org
bin, csv
Updated Jan 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jake Lee; Jake Lee; Michael Kiper; Michael Kiper; David R. Thompson; David R. Thompson; Philip Brodrick; Philip Brodrick (2025). Dataset for "Spectroscopic Transformer for Improved EMIT Cloud Masks" [Dataset]. http://doi.org/10.5281/zenodo.14607938
Explore at:
csv, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14607938
Dataset updated
Jan 7, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jake Lee; Jake Lee; Michael Kiper; Michael Kiper; David R. Thompson; David R. Thompson; Philip Brodrick; Philip Brodrick
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Spectroscopic Transformer for Improved EMIT Cloud Masks

Summary

Manuscript in preparation/submitted.

This repository contains the dataset used to train and evaluate the Spectroscopic Transformer model for EMIT cloud screening.

spectf_cloud_labelbox.hdf5

1,841,641 Labeled spectra from 221 EMIT Scenes.

spectf_cloud_mmgis.hdf5

1,733,801 Labeled spectra from 313 EMIT Scenes.

These scenes were speciffically labeled to correct false detections by an earlier version of the model.

train_fids.csv

465 EMIT scenes comprising the training set.

test_fids.csv

69 EMIT scenes comprising the held-out validation set.

Data Description

221 EMIT Scenes were initially selected for labeling with diversity in mind. After sparse segmentation labeling of confident regions in Labelbox, up to 10,000 spectra were selected per-class per-scene to form the spectf_cloud_labelbox dataset. We deployed a preliminary model trained on these spectra on all EMIT scenes observed in March 2024, then labeled another 313 EMIT Scenes using MMGIS's polygonal labeling tool to correct false positive and false negative detections. After similarly sampling spectra from these scenes, A total of 3,575,442 spectra were labeled and sampled.

The train/test split was randomly determined by scene FID to prevent the same EMIT scene from contributing spectra to both the training and validation datasets.

Please refer to Section 4.2 in the paper for a complete description, and to our code repository for example usage and a Pytorch dataloader.

Each hdf5 file contains the following arrays:

'spectra'

Top-of-Atmosphere reflectance calculated from the EMIT L1B Radiance product

Float64 of shape (n, 268)

'fids'

The FID from which each spectrum was sampled

Binary string of shape (n,)

'indices'

The (col, row) index from which each spectrum was sampled

Int64 of shape (n, 2)

'labels'

Annotation label of each spectrum

0 - "Clear"

1 - "Cloud"

2 - "Cloud Shadow" (Only for the Labelbox dataset, and this class was combined with the clear class for this work. See paper for details.)

label[label==2] = 0

Int64 of shape (n,2)

Each hdf5 file contains the following attribute:

'bands'

The band center wavelengths (nm) of the spectrum

Float64 of shape (268,)

Acknowledgements

The EMIT online mapping tool was developed by the JPL MMGIS team. The High Performance Computing resources used in this investigation were provided by funding from the JPL Information and Technology Solutions Directorate.

This research was carried out at the Jet Propulsion Laboratory, California Institute of Technology, under a contract with the National Aeronautics and Space Administration (80NM0018D0004).

© 2024 California Institute of Technology. Government sponsorship acknowledged.
d
Data from: acusim: a synthetic dataset for cervicocranial acupuncture points...
search.dataone.org
datadryad.org
Updated Apr 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Qilei Sun; Jiatao Ma; Paul Craig; Linjun Dai; EngGee Lim (2025). acusim: a synthetic dataset for cervicocranial acupuncture points localisation [Dataset]. http://doi.org/10.5061/dryad.zs7h44jkz
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.zs7h44jkz
Dataset updated
Apr 2, 2025
Dataset provided by
Dryad Digital Repository
Authors
Qilei Sun; Jiatao Ma; Paul Craig; Linjun Dai; EngGee Lim
Description
The locations of acupuncture points (acupoints) differ among human individuals due to variations in factors such as height, weight, and fat proportions. However, acupoint annotation is expert-dependent, labour-intensive, and highly expensive, which limits the data size and detection accuracy. In this paper, we introduce the "AcuSim" dataset as a new synthetic dataset for the task of localising points on the human cervicocranial area from an input image using an automatic render and labelling pipeline during acupuncture treatment. It includes the creation of 63,936 RGB-D images and 504 synthetic anatomical models with 174 volumetric acupoints annotated, to capture the variability and diversity of human anatomies. The study validates a convolutional neural network (CNN) on the proposed dataset with an accuracy of 99.73% and shows that 92.86% of predictions in the validation set align within a 5mm threshold of margin error when compared to expert-annotated data. This dataset addresses the ..., , , # AcuSim: A Synthetic Dataset for Cervicocranial Acupuncture Points Localisation

Dryad DOI:Â https://doi.org/10.5061/dryad.zs7h44jkz

Dataset Overview

A multi-view acupuncture point dataset containing:

64x64, 128x128, 256x256, 512Ã—512 and 1024x1024resolution RGB images

Corresponding JSON annotations with:

2D/3D keypoint coordinates

Visibility weights (0.9-1.0 scale)

Meridian category indices

Visibility masks

174 standard acupuncture points (map.txt)

Occlusion handling implementation

Key Features

Multi-view Rendering: Generated using Blender 3.5 with realistic occlusion simulation

Structured Annotation:

Default initialization for occluded points ([0.0, 0.0, 0.5])

Meridian category preservation for occluded points

Weighted visibility scoring

ML-Ready Format: Preconfigured PyTorch DataLoader implementation

Dataset Structure

dataset_root/ â”œâ”€â”€ map.txt # Complete list of 174 acupuncture points â”œâ”€â”€ train/ ...,

3DO Dataset | On the Generalization of WiFi-based Person-centric Sensing in...

zenodo.org

zip

Updated Dec 5, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

Julian Strohmayer; Julian Strohmayer; Martin Kampel; Martin Kampel (2024). 3DO Dataset | On the Generalization of WiFi-based Person-centric Sensing in Through-Wall Scenarios [Dataset]. http://doi.org/10.5281/zenodo.10925351

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.10925351

Dataset updated

Dec 5, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Julian Strohmayer; Julian Strohmayer; Martin Kampel; Martin Kampel

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Time period covered

Nov 20, 2024

Description

On the Generalization of WiFi-based Person-centric Sensing in Through-Wall Scenarios

This repository contains the 3DO dataset proposed in [1].

PyTroch Dataloader

A minimal PyTorch dataloader for the 3DO dataset is provided at: https://github.com/StrohmayerJ/3DO

Dataset Description

The 3DO dataset comprises 42 five-minute recordings (~1.25M WiFi packets) of three human activities performed by a single person, captured in a WiFi through-wall sensing scenario over three consecutive days. Each WiFi packet is annotated with a 3D trajectory label and a class label for the activities: no person/background (0), walking (1), sitting (2), and lying (3). (Note: The labels returned in our dataloader example are walking (0), sitting (1), and lying (2), because background sequences are not used.)

The directories 3DO/d1/, 3DO/d2/, and 3DO/d3/ contain the sequences from days 1, 2, and 3, respectively. Furthermore, each sequence directory (e.g., 3DO/d1/w1/) contains a csiposreg.csv file storing the raw WiFi packet time series and a csiposreg_complex.npy cache file, which stores the complex Channel State Information (CSI) of the WiFi packet time series. (If missing, csiposreg_complex.npy is automatically generated by the provided dataloader.)

Dataset Structure:

/3DO

├── d1 <-- day 1 subdirectory

└── w1 <-- sequence subdirectory

└── csiposreg.csv <-- raw WiFi packet time series

└── csiposreg_complex.npy <-- CSI time series cache

├── d2 <-- day 2 subdirectory

├── d3 <-- day 3 subdirectory

In [1], we use the following training, validation, and test split:

Subset	Day	Sequences
Train	1	w1, w2, w3, s1, s2, s3, l1, l2, l3
Val	1	w4, s4, l4
Test	1	w5 , s5, l5
Test	2	w1, w2, w3, w4, w5, s1, s2, s3, s4, s5, l1, l2, l3, l4, l5
Test	3	w1, w2, w4, w5, s1, s2, s3, s4, s5, l1, l2, l4

w = walking, s = sitting and l= lying

Note: On each day, we additionally recorded three ten-minute background sequences (b1, b2, b3), which are provided as well.

Download and Use
This data may be used for non-commercial research purposes only. If you publish material based on this data, we request that you include a reference to our paper [1].

[1] Strohmayer, J., Kampel, M. (2025). On the Generalization of WiFi-Based Person-Centric Sensing in Through-Wall Scenarios. In: Pattern Recognition. ICPR 2024. Lecture Notes in Computer Science, vol 15315. Springer, Cham. https://doi.org/10.1007/978-3-031-78354-8_13

BibTeX citation:

@inproceedings{strohmayerOn2025,
  author="Strohmayer, Julian and Kampel, Martin",
  title="On the Generalization of WiFi-Based Person-Centric Sensing in Through-Wall Scenarios",
  booktitle="Pattern Recognition",
  year="2025",
  publisher="Springer Nature Switzerland",
  address="Cham",
  pages="194--211",
  isbn="978-3-031-78354-8"
}

Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Huan Sun (2021). Sentence/Table Pair Data from Wikipedia for Pre-training with Distant-Supervision [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5612315

Sentence/Table Pair Data from Wikipedia for Pre-training with Distant-Supervision

Explore at:

Dataset updated

Oct 29, 2021

Dataset provided by

Huan Sun
Alyssa Lees
Xiang Deng
Cong Yu
You Wu
Yu Su

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This is the dataset used for pre-training in "ReasonBERT: Pre-trained to Reason with Distant Supervision", EMNLP'21.

There are two files:

sentence_pairs_for_pretrain_no_tokenization.tar.gz -> Contain only sentences as evidence, Text-only

table_pairs_for_pretrain_no_tokenization.tar.gz -> At least one piece of evidence is a table, Hybrid

The data is chunked into multiple tar files for easy loading. We use WebDataset, a PyTorch Dataset (IterableDataset) implementation providing efficient sequential/streaming data access.

For pre-training code, or if you have any questions, please check our GitHub repo https://github.com/sunlab-osu/ReasonBERT

Below is a sample code snippet to load the data

import webdataset as wds

path to the uncompressed files, should be a directory with a set of tar files

url = './sentence_multi_pairs_for_pretrain_no_tokenization/{000000...000763}.tar' dataset = ( wds.Dataset(url) .shuffle(1000) # cache 1000 samples and shuffle .decode() .to_tuple("json") .batched(20) # group every 20 examples into a batch )

Please see the documentation for WebDataset for more details about how to use it as dataloader for Pytorch

You can also iterate through all examples and dump them with your preferred data format

Below we show how the data is organized with two examples.

Text-only

{'s1_text': 'Sils is a municipality in the comarca of Selva, in Catalonia, Spain.', # query sentence 's1_all_links': { 'Sils,_Girona': [[0, 4]], 'municipality': [[10, 22]], 'Comarques_of_Catalonia': [[30, 37]], 'Selva': [[41, 46]], 'Catalonia': [[51, 60]] }, # list of entities and their mentions in the sentence (start, end location) 'pairs': [ # other sentences that share common entity pair with the query, group by shared entity pairs { 'pair': ['Comarques_of_Catalonia', 'Selva'], # the common entity pair 's1_pair_locs': [[[30, 37]], [[41, 46]]], # mention of the entity pair in the query 's2s': [ # list of other sentences that contain the common entity pair, or evidence { 'md5': '2777e32bddd6ec414f0bc7a0b7fea331', 'text': 'Selva is a coastal comarque (county) in Catalonia, Spain, located between the mountain range known as the Serralada Transversal or Puigsacalm and the Costa Brava (part of the Mediterranean coast). Unusually, it is divided between the provinces of Girona and Barcelona, with Fogars de la Selva being part of Barcelona province and all other municipalities falling inside Girona province. Also unusually, its capital, Santa Coloma de Farners, is no longer among its larger municipalities, with the coastal towns of Blanes and Lloret de Mar having far surpassed it in size.', 's_loc': [0, 27], # in addition to the sentence containing the common entity pair, we also keep its surrounding context. 's_loc' is the start/end location of the actual evidence sentence 'pair_locs': [ # mentions of the entity pair in the evidence [[19, 27]], # mentions of entity 1 [[0, 5], [288, 293]] # mentions of entity 2 ], 'all_links': { 'Selva': [[0, 5], [288, 293]], 'Comarques_of_Catalonia': [[19, 27]], 'Catalonia': [[40, 49]] } } ,...] # there are multiple evidence sentences }, ,...] # there are multiple entity pairs in the query }

Hybrid

{'s1_text': 'The 2006 Major League Baseball All-Star Game was the 77th playing of the midseason exhibition baseball game between the all-stars of the American League (AL) and National League (NL), the two leagues comprising Major League Baseball.', 's1_all_links': {...}, # same as text-only 'sentence_pairs': [{'pair': ..., 's1_pair_locs': ..., 's2s': [...]}], # same as text-only 'table_pairs': [ 'tid': 'Major_League_Baseball-1', 'text':[ ['World Series Records', 'World Series Records', ...], ['Team', 'Number of Series won', ...], ['St. Louis Cardinals (NL)', '11', ...], ...] # table content, list of rows 'index':[ [[0, 0], [0, 1], ...], [[1, 0], [1, 1], ...], ...] # index of each cell [row_id, col_id]. we keep only a table snippet, but the index here is from the original table. 'value_ranks':[ [0, 0, ...], [0, 0, ...], [0, 10, ...], ...] # if the cell contain numeric value/date, this is its rank ordered from small to large, follow TAPAS 'value_inv_ranks': [], # inverse rank 'all_links':{ 'St._Louis_Cardinals': { '2': [ [[2, 0], [0, 19]], # [[row_id, col_id], [start, end]] ] # list of mentions in the second row, the key is row_id }, 'CARDINAL:11': {'2': [[[2, 1], [0, 2]]], '8': [[[8, 3], [0, 2]]]}, } 'name': '', # table name, if exists 'pairs': { 'pair': ['American_League', 'National_League'], 's1_pair_locs': [[[137, 152]], [[162, 177]]], # mention in the query 'table_pair_locs': { '17': [ # mention of entity pair in row 17 [ [[17, 0], [3, 18]], [[17, 1], [3, 18]], [[17, 2], [3, 18]], [[17, 3], [3, 18]] ], # mention of the first entity [ [[17, 0], [21, 36]], [[17, 1], [21, 36]], ] # mention of the second entity ] } } ] }

Clear search

Close search

Google apps

Main menu

Sentence/Table Pair Data from Wikipedia for Pre-training with...

path to the uncompressed files, should be a directory with a set of tar files

Please see the documentation for WebDataset for more details about how to use it as dataloader for Pytorch

You can also iterate through all examples and dump them with your preferred data format

Dataset for "SpecTf: Transformers Enable Data-Driven Imaging Spectroscopy...

SpecTf: Transformers Enable Data-Driven Imaging Spectroscopy Cloud Detection

Summary

Data Description

Acknowledgements

Wallhack1.8k Dataset | Data Augmentation Techniques for Cross-Domain WiFi...

52 L-LTF subcarriers

56 HT-LTF subcarriers

HALOC Dataset | WiFi CSI-based Long-Range Person Localization Using...

Immobilized fluorescently stained zebrafish through the eXtended Field of...

Dataset for "Spectroscopic Transformer for Improved EMIT Cloud Masks"

Spectroscopic Transformer for Improved EMIT Cloud Masks

Summary

Data Description

Acknowledgements

Data from: acusim: a synthetic dataset for cervicocranial acupuncture points...

Dataset Overview

Key Features

Dataset Structure

3DO Dataset | On the Generalization of WiFi-based Person-centric Sensing in...

Sentence/Table Pair Data from Wikipedia for Pre-training with Distant-Supervision

path to the uncompressed files, should be a directory with a set of tar files

Please see the documentation for WebDataset for more details about how to use it as dataloader for Pytorch

You can also iterate through all examples and dump them with your preferred data format