Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the dataset used for pre-training in "ReasonBERT: Pre-trained to Reason with Distant Supervision", EMNLP'21.
There are two files:
sentence_pairs_for_pretrain_no_tokenization.tar.gz -> Contain only sentences as evidence, Text-only
table_pairs_for_pretrain_no_tokenization.tar.gz -> At least one piece of evidence is a table, Hybrid
The data is chunked into multiple tar files for easy loading. We use WebDataset, a PyTorch Dataset (IterableDataset) implementation providing efficient sequential/streaming data access.
For pre-training code, or if you have any questions, please check our GitHub repo https://github.com/sunlab-osu/ReasonBERT
Below is a sample code snippet to load the data
import webdataset as wds
url = './sentence_multi_pairs_for_pretrain_no_tokenization/{000000...000763}.tar' dataset = ( wds.Dataset(url) .shuffle(1000) # cache 1000 samples and shuffle .decode() .to_tuple("json") .batched(20) # group every 20 examples into a batch )
Below we show how the data is organized with two examples.
Text-only
{'s1_text': 'Sils is a municipality in the comarca of Selva, in Catalonia, Spain.', # query sentence 's1_all_links': { 'Sils,_Girona': [[0, 4]], 'municipality': [[10, 22]], 'Comarques_of_Catalonia': [[30, 37]], 'Selva': [[41, 46]], 'Catalonia': [[51, 60]] }, # list of entities and their mentions in the sentence (start, end location) 'pairs': [ # other sentences that share common entity pair with the query, group by shared entity pairs { 'pair': ['Comarques_of_Catalonia', 'Selva'], # the common entity pair 's1_pair_locs': [[[30, 37]], [[41, 46]]], # mention of the entity pair in the query 's2s': [ # list of other sentences that contain the common entity pair, or evidence { 'md5': '2777e32bddd6ec414f0bc7a0b7fea331', 'text': 'Selva is a coastal comarque (county) in Catalonia, Spain, located between the mountain range known as the Serralada Transversal or Puigsacalm and the Costa Brava (part of the Mediterranean coast). Unusually, it is divided between the provinces of Girona and Barcelona, with Fogars de la Selva being part of Barcelona province and all other municipalities falling inside Girona province. Also unusually, its capital, Santa Coloma de Farners, is no longer among its larger municipalities, with the coastal towns of Blanes and Lloret de Mar having far surpassed it in size.', 's_loc': [0, 27], # in addition to the sentence containing the common entity pair, we also keep its surrounding context. 's_loc' is the start/end location of the actual evidence sentence 'pair_locs': [ # mentions of the entity pair in the evidence [[19, 27]], # mentions of entity 1 [[0, 5], [288, 293]] # mentions of entity 2 ], 'all_links': { 'Selva': [[0, 5], [288, 293]], 'Comarques_of_Catalonia': [[19, 27]], 'Catalonia': [[40, 49]] } } ,...] # there are multiple evidence sentences }, ,...] # there are multiple entity pairs in the query }
Hybrid
{'s1_text': 'The 2006 Major League Baseball All-Star Game was the 77th playing of the midseason exhibition baseball game between the all-stars of the American League (AL) and National League (NL), the two leagues comprising Major League Baseball.', 's1_all_links': {...}, # same as text-only 'sentence_pairs': [{'pair': ..., 's1_pair_locs': ..., 's2s': [...]}], # same as text-only 'table_pairs': [ 'tid': 'Major_League_Baseball-1', 'text':[ ['World Series Records', 'World Series Records', ...], ['Team', 'Number of Series won', ...], ['St. Louis Cardinals (NL)', '11', ...], ...] # table content, list of rows 'index':[ [[0, 0], [0, 1], ...], [[1, 0], [1, 1], ...], ...] # index of each cell [row_id, col_id]. we keep only a table snippet, but the index here is from the original table. 'value_ranks':[ [0, 0, ...], [0, 0, ...], [0, 10, ...], ...] # if the cell contain numeric value/date, this is its rank ordered from small to large, follow TAPAS 'value_inv_ranks': [], # inverse rank 'all_links':{ 'St._Louis_Cardinals': { '2': [ [[2, 0], [0, 19]], # [[row_id, col_id], [start, end]] ] # list of mentions in the second row, the key is row_id }, 'CARDINAL:11': {'2': [[[2, 1], [0, 2]]], '8': [[[8, 3], [0, 2]]]}, } 'name': '', # table name, if exists 'pairs': { 'pair': ['American_League', 'National_League'], 's1_pair_locs': [[[137, 152]], [[162, 177]]], # mention in the query 'table_pair_locs': { '17': [ # mention of entity pair in row 17 [ [[17, 0], [3, 18]], [[17, 1], [3, 18]], [[17, 2], [3, 18]], [[17, 3], [3, 18]] ], # mention of the first entity [ [[17, 0], [21, 36]], [[17, 1], [21, 36]], ] # mention of the second entity ] } } ] }
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Manuscript in review. Preprint: https://arxiv.org/abs/2501.04916
This repository contains the dataset used to train and evaluate the Spectroscopic Transformer model for EMIT cloud screening.
v2 adds validation_scenes.pdf, a PDF displaying the 69 validation scenes in RGB and Falsecolor, their existing baseline cloud masks, as well as their cloud masks produced by the ANN and GBT reference models and the SpecTf model.
221 EMIT Scenes were initially selected for labeling with diversity in mind. After sparse segmentation labeling of confident regions in Labelbox, up to 10,000 spectra were selected per-class per-scene to form the spectf_cloud_labelbox dataset. We deployed a preliminary model trained on these spectra on all EMIT scenes observed in March 2024, then labeled another 313 EMIT Scenes using MMGIS's polygonal labeling tool to correct false positive and false negative detections. After similarly sampling spectra from these scenes, A total of 3,575,442 spectra were labeled and sampled.
The train/test split was randomly determined by scene FID to prevent the same EMIT scene from contributing spectra to both the training and validation datasets.
Please refer to Section 4.2 in the paper for a complete description, and to our code repository for example usage and a Pytorch dataloader.
Each hdf5 file contains the following arrays:
Each hdf5 file contains the following attribute:
The EMIT online mapping tool was developed by the JPL MMGIS team. The High Performance Computing resources used in this investigation were provided by funding from the JPL Information and Technology Solutions Directorate.
This research was carried out at the Jet Propulsion Laboratory, California Institute of Technology, under a contract with the National Aeronautics and Space Administration (80NM0018D0004).
© 2024 California Institute of Technology. Government sponsorship acknowledged.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains the Wallhack1.8k dataset for WiFi-based long-range activity recognition in Line-of-Sight (LoS) and Non-Line-of-Sight (NLoS)/Through-Wall scenarios, as proposed in [1,2], as well as the CAD models (of 3D-printable parts) of the WiFi systems proposed in [2].
PyTroch Dataloader
A minimal PyTorch dataloader for the Wallhack1.8k dataset is provided at: https://github.com/StrohmayerJ/wallhack1.8k
Dataset Description
The Wallhack1.8k dataset comprises 1,806 CSI amplitude spectrograms (and raw WiFi packet time series) corresponding to three activity classes: "no presence," "walking," and "walking + arm-waving." WiFi packets were transmitted at a frequency of 100 Hz, and each spectrogram captures a temporal context of approximately 4 seconds (400 WiFi packets).
To assess cross-scenario and cross-system generalization, WiFi packet sequences were collected in LoS and through-wall (NLoS) scenarios, utilizing two different WiFi systems (BQ: biquad antenna and PIFA: printed inverted-F antenna). The dataset is structured accordingly:
LOS/BQ/ <- WiFi packets collected in the LoS scenario using the BQ system
LOS/PIFA/ <- WiFi packets collected in the LoS scenario using the PIFA system
NLOS/BQ/ <- WiFi packets collected in the NLoS scenario using the BQ system
NLOS/PIFA/ <- WiFi packets collected in the NLoS scenario using the PIFA system
These directories contain the raw WiFi packet time series (see Table 1). Each row represents a single WiFi packet with the complex CSI vector H being stored in the "data" field and the class label being stored in the "class" field. H is of the form [I, R, I, R, ..., I, R], where two consecutive entries represent imaginary and real parts of complex numbers (the Channel Frequency Responses of subcarriers). Taking the absolute value of H (e.g., via numpy.abs(H)) yields the subcarrier amplitudes A.
To extract the 52 L-LTF subcarriers used in [1], the following indices of A are to be selected:
csi_valid_subcarrier_index = [] csi_valid_subcarrier_index += [i for i in range(6, 32)] csi_valid_subcarrier_index += [i for i in range(33, 59)]
Additional 56 HT-LTF subcarriers can be selected via:
csi_valid_subcarrier_index += [i for i in range(66, 94)]
csi_valid_subcarrier_index += [i for i in range(95, 123)]
For more details on subcarrier selection, see ESP-IDF (Section Wi-Fi Channel State Information) and esp-csi.
Extracted amplitude spectrograms with the corresponding label files of the train/validation/test split: "trainLabels.csv," "validationLabels.csv," and "testLabels.csv," can be found in the spectrograms/ directory.
The columns in the label files correspond to the following: [Spectrogram index, Class label, Room label]
Spectrogram index: [0, ..., n]
Class label: [0,1,2], where 0 = "no presence", 1 = "walking", and 2 = "walking + arm-waving."
Room label: [0,1,2,3,4,5], where labels 1-5 correspond to the room number in the NLoS scenario (see Fig. 3 in [1]). The label 0 corresponds to no room and is used for the "no presence" class.
Dataset Overview:
Table 1: Raw WiFi packet sequences.
Scenario System "no presence" / label 0 "walking" / label 1 "walking + arm-waving" / label 2 Total
LoS BQ b1.csv w1.csv, w2.csv, w3.csv, w4.csv and w5.csv ww1.csv, ww2.csv, ww3.csv, ww4.csv and ww5.csv
LoS PIFA b1.csv w1.csv, w2.csv, w3.csv, w4.csv and w5.csv ww1.csv, ww2.csv, ww3.csv, ww4.csv and ww5.csv
NLoS BQ b1.csv w1.csv, w2.csv, w3.csv, w4.csv and w5.csv ww1.csv, ww2.csv, ww3.csv, ww4.csv and ww5.csv
NLoS PIFA b1.csv w1.csv, w2.csv, w3.csv, w4.csv and w5.csv ww1.csv, ww2.csv, ww3.csv, ww4.csv and ww5.csv
4 20 20 44
Table 2: Sample/Spectrogram distribution across activity classes in Wallhack1.8k.
Scenario System
"no presence" / label 0
"walking" / label 1
"walking + arm-waving" / label 2 Total
LoS BQ 149 154 155
LoS PIFA 149 160 152
NLoS BQ 148 150 152
NLoS PIFA 143 147 147
589 611 606 1,806
Download and UseThis data may be used for non-commercial research purposes only. If you publish material based on this data, we request that you include a reference to one of our papers [1,2].
[1] Strohmayer, Julian, and Martin Kampel. (2024). “Data Augmentation Techniques for Cross-Domain WiFi CSI-Based Human Activity Recognition”, In IFIP International Conference on Artificial Intelligence Applications and Innovations (pp. 42-56). Cham: Springer Nature Switzerland, doi: https://doi.org/10.1007/978-3-031-63211-2_4.
[2] Strohmayer, Julian, and Martin Kampel., “Directional Antenna Systems for Long-Range Through-Wall Human Activity Recognition,” 2024 IEEE International Conference on Image Processing (ICIP), Abu Dhabi, United Arab Emirates, 2024, pp. 3594-3599, doi: https://doi.org/10.1109/ICIP51287.2024.10647666.
BibTeX citations:
@inproceedings{strohmayer2024data, title={Data Augmentation Techniques for Cross-Domain WiFi CSI-Based Human Activity Recognition}, author={Strohmayer, Julian and Kampel, Martin}, booktitle={IFIP International Conference on Artificial Intelligence Applications and Innovations}, pages={42--56}, year={2024}, organization={Springer}}@INPROCEEDINGS{10647666, author={Strohmayer, Julian and Kampel, Martin}, booktitle={2024 IEEE International Conference on Image Processing (ICIP)}, title={Directional Antenna Systems for Long-Range Through-Wall Human Activity Recognition}, year={2024}, volume={}, number={}, pages={3594-3599}, keywords={Visualization;Accuracy;System performance;Directional antennas;Directive antennas;Reflector antennas;Sensors;Human Activity Recognition;WiFi;Channel State Information;Through-Wall Sensing;ESP32}, doi={10.1109/ICIP51287.2024.10647666}}
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Music Grounding by Short Video E-commerce (MGSV-EC) Dataset
📄 [Paper] 📦 Feature File 🔧 [PyTorch Dataloader] 🧬 [Model Code]
📝 Dataset Summary
MGSV-EC is a large-scale dataset for the new task of Music Grounding by Short Video (MGSV), which aims to localize a specific music segment that best serves as the background music (BGM) for a given query short video.Unlike traditional video-to-music retrieval (V2MR), MGSV requires both… See the full description on the dataset page: https://huggingface.co/datasets/xxayt/MGSV-EC.
The locations of acupuncture points (acupoints) differ among human individuals due to variations in factors such as height, weight, and fat proportions. However, acupoint annotation is expert-dependent, labour-intensive, and highly expensive, which limits the data size and detection accuracy. In this paper, we introduce the "AcuSim" dataset as a new synthetic dataset for the task of localising points on the human cervicocranial area from an input image using an automatic render and labelling pipeline during acupuncture treatment. It includes the creation of 63,936 RGB-D images and 504 synthetic anatomical models with 174 volumetric acupoints annotated, to capture the variability and diversity of human anatomies. The study validates a convolutional neural network (CNN) on the proposed dataset with an accuracy of 99.73% and shows that 92.86% of predictions in the validation set align within a 5mm threshold of margin error when compared to expert-annotated data. This dataset addresses the ..., , , # AcuSim: A Synthetic Dataset for Cervicocranial Acupuncture Points Localisation
Dryad DOI:Â https://doi.org/10.5061/dryad.zs7h44jkz
A multi-view acupuncture point dataset containing:
dataset_root/
├── map.txt # Complete list of 174 acupuncture points
├── train/
...,
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Manuscript in preparation/submitted.
This repository contains the dataset used to train and evaluate the Spectroscopic Transformer model for EMIT cloud screening.
221 EMIT Scenes were initially selected for labeling with diversity in mind. After sparse segmentation labeling of confident regions in Labelbox, up to 10,000 spectra were selected per-class per-scene to form the spectf_cloud_labelbox dataset. We deployed a preliminary model trained on these spectra on all EMIT scenes observed in March 2024, then labeled another 313 EMIT Scenes using MMGIS's polygonal labeling tool to correct false positive and false negative detections. After similarly sampling spectra from these scenes, A total of 3,575,442 spectra were labeled and sampled.
The train/test split was randomly determined by scene FID to prevent the same EMIT scene from contributing spectra to both the training and validation datasets.
Please refer to Section 4.2 in the paper for a complete description, and to our code repository for example usage and a Pytorch dataloader.
Each hdf5 file contains the following arrays:
Each hdf5 file contains the following attribute:
The EMIT online mapping tool was developed by the JPL MMGIS team. The High Performance Computing resources used in this investigation were provided by funding from the JPL Information and Technology Solutions Directorate.
This research was carried out at the Jet Propulsion Laboratory, California Institute of Technology, under a contract with the National Aeronautics and Space Administration (80NM0018D0004).
© 2024 California Institute of Technology. Government sponsorship acknowledged.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
On the Generalization of WiFi-based Person-centric Sensing in Through-Wall Scenarios
This repository contains the 3DO dataset proposed in [1].
PyTroch Dataloader
A minimal PyTorch dataloader for the 3DO dataset is provided at: https://github.com/StrohmayerJ/3DO
Dataset Description
The 3DO dataset comprises 42 five-minute recordings (~1.25M WiFi packets) of three human activities performed by a single person, captured in a WiFi through-wall sensing scenario over three consecutive days. Each WiFi packet is annotated with a 3D trajectory label and a class label for the activities: no person/background (0), walking (1), sitting (2), and lying (3). (Note: The labels returned in our dataloader example are walking (0), sitting (1), and lying (2), because background sequences are not used.)
The directories 3DO/d1/, 3DO/d2/, and 3DO/d3/ contain the sequences from days 1, 2, and 3, respectively. Furthermore, each sequence directory (e.g., 3DO/d1/w1/) contains a csiposreg.csv file storing the raw WiFi packet time series and a csiposreg_complex.npy cache file, which stores the complex Channel State Information (CSI) of the WiFi packet time series. (If missing, csiposreg_complex.npy is automatically generated by the provided dataloader.)
Dataset Structure:
/3DO
├── d1 <-- day 1 subdirectory
└── w1 <-- sequence subdirectory
└── csiposreg.csv <-- raw WiFi packet time series
└── csiposreg_complex.npy <-- CSI time series cache
├── d2 <-- day 2 subdirectory
├── d3 <-- day 3 subdirectory
In [1], we use the following training, validation, and test split:
Subset Day Sequences
Train 1 w1, w2, w3, s1, s2, s3, l1, l2, l3
Val 1 w4, s4, l4
Test 1 w5 , s5, l5
Test 2 w1, w2, w3, w4, w5, s1, s2, s3, s4, s5, l1, l2, l3, l4, l5
Test 3 w1, w2, w4, w5, s1, s2, s3, s4, s5, l1, l2, l4
w = walking, s = sitting and l= lying
Note: On each day, we additionally recorded three ten-minute background sequences (b1, b2, b3), which are provided as well.
Download and UseThis data may be used for non-commercial research purposes only. If you publish material based on this data, we request that you include a reference to our paper [1].
[1] Strohmayer, J., Kampel, M. (2025). On the Generalization of WiFi-Based Person-Centric Sensing in Through-Wall Scenarios. In: Pattern Recognition. ICPR 2024. Lecture Notes in Computer Science, vol 15315. Springer, Cham. https://doi.org/10.1007/978-3-031-78354-8_13
BibTeX citation:
@inproceedings{strohmayerOn2025, author="Strohmayer, Julian and Kampel, Martin", title="On the Generalization of WiFi-Based Person-Centric Sensing in Through-Wall Scenarios", booktitle="Pattern Recognition", year="2025", publisher="Springer Nature Switzerland", address="Cham", pages="194--211", isbn="978-3-031-78354-8" }
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Introduction As mobile service robots increasingly operate in human-centered environments, they must learn to use elevators without modifying elevator hardware. This task traditionally involves processing an image of an elevator control panel using instance segmentation of the buttons and labels, reading the text on the labels, and associating buttons with their corresponding labels. In addition to the standard approach, our project also implements an additional segmentation step where missing buttons and labels are recovered after the first feature detection pass. In a robust system, both the first segmentation pass and the recovery models’ training data requires pixel-level annotations of buttons and labels, while the label reading step needs annotations of the text on the labels. Current elevator panel feature datasets, however, either do not provide segmentation annotations, or do not draw distinctions between the buttons and labels. The “Living With Robots Elevator Button Dataset” was assembled for purposes of training segmentation and scene text recognition models on realistic scenarios involving varying conditions such as lighting, blur, and position of the camera relative to the elevator control panel. Buttons are labeled with the same action as their respective labels for purposes of training a button-label association model. A pipeline including all steps of the task mentioned was trained and evaluated, producing state-of-the-art accuracy and precision results using the high quality elevator button dataset. Dataset Contents 400 jpeg images of elevator panels. 292 taken of 25 different elevators across 24 buildings on the University of Texas at Austin campus. 108 sourced from the internet, with varying lighting, quality, and perspective conditions. JSON files containing border annotations, button and label distinctions, and text on labels for the Campus and Internet Sub-Datasets. PyTorch files containing state dictionaries with network weights for: The first-pass segmentation model, a transformer-based model trained to segment buttons and labels in a full-color image: “segmentation_vit_model.pth”. The feature-recovery segmentation model, a transformer-based model trained to segment masks of missed buttons and labels from the class map output of the first pass: “recovery_vit_model.pth”. The scene text recognition model, trained from PARSeq to read the special characters present on elevator panel labels: “parseq_str.ckpt”. Links to the data loader, training, and evaluation scripts for the segmentation models hosted in GitHub. The data subsets are all JPGs collected through 2 different means. The campus subset images were taken in buildings on and around the University of Texas at Austin campus. All pictures were taken facing the elevator panel’s wall roughly straight-on, while the camera itself was positioned in each of nine locations in a 3x3 grid layout relative to the panel: to the top left, top middle, top right, middle left, center, middle right, bottom left, bottom middle, and bottom right. A subset of these also includes versions of each image with the elevator door closed or open, varying the lighting and background conditions. All of these images are 3024 × 4032, and were taken with either an iPhone 12 or 12 Pro Max. The Internet subset deliberately features user-shared photos with irregular or uncommon panel characteristics. Images in this dataset vary widely in terms of resolution, clarity, button/label shape, and angle of the image to add variety to the dataset and robustness to any models trained with it. Data Segmentation The segmentation for this dataset served two training purposes. First, they were used to identify the pixels that comprise the elevator buttons and labels in the images. A segmentation model was then trained to accurately recognize buttons and labels in an image at the pixel-level. The second use, and the one that most distinguishes our approach, was training a separate model to recover missed button and label detections. The annotations were used to generate class maps of each, before being procedurally masked to provide a data ground-truth (the remaining masks) and a target (the hidden masks) for the recovery model. Data Annotation Method All annotations were done with the VGG Image Annotator published by the University of Oxford. All images were given their own set of annotations, identified in their file naming convention. Regarding the segmentation annotations, any button that was largely in-view of the image was segmented as one of several shapes that most closely fit the feature: rectangle, ellipse, or polygon. In the annotation JSONs, these appeared as either the coordinates of each point of a polygon or as the dimensions of an ellipse (center coordinates, radius dimensions, and angle of rotation). Additionally, each feature was designated as a “button” or “label”. For retraining the model that reads text on labels, each label and its...
Attribution 2.0 (CC BY 2.0)https://creativecommons.org/licenses/by/2.0/
License information was derived automatically
Volcanic Event Classification Dataset and Data Loader
This project contains a dataset for classifying volcanic events and a Jupyter Notebook (DataModuleTestThraws.ipynb) for loading and preprocessing this data using PyTorch and PyTorch Lightning.
Dataset Structure
The dataset is organized into two main directories:
TrainVal/: Contains data for training and validation. Test/: Contains data for testing the trained model.
Within both TrainVal/ and Test/ directories, there… See the full description on the dataset page: https://huggingface.co/datasets/sirbastiano94/END2END.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
WiFi CSI-based Long-Range Person Localization Using Directional Antennas
This repository contains the HAllway LOCalization (HALOC) dataset and WiFi system CAD files as proposed in [1].
PyTroch Dataloader
A minimal PyTorch dataloader for the HALOC dataset is provided at: https://github.com/StrohmayerJ/HALOC
Dataset Description
The HALOC dataset comprises six sequences (in .csv format) of synchronized WiFi Channel State Information (CSI) and 3D position labels. Each row in a given .csv file represents a single WiFi packet captured via ESP-IDF, with CSI and 3D coordinates stored in the "data" and ("x", "y", "z") fields, respectively.
The sequences are divided into training, validation, and test subsets as follows:
Subset Sequences
Training 0.csv, 1.csv, 2.csv and 3.csv
Validation 4.csv
Test 5.csv
WiFi System CAD files
We provide CAD files for the 3D printable parts of the proposed WiFi system consisting of the main housing (housing.stl), the lid (lid.stl), and the carrier board (carrier.stl) featuring mounting points for the Nvidia Jetson Orin Nano and the ESP32-S3-DevKitC-1 module.
Download and UseThis data may be used for non-commercial research purposes only. If you publish material based on this data, we request that you include a reference to our paper [1].
[1] Strohmayer, J., and Kampel, M. (2024). “WiFi CSI-based Long-Range Person Localization Using Directional Antennas”, The Second Tiny Papers Track at ICLR 2024, May 2024, Vienna, Austria. https://openreview.net/forum?id=AOJFcEh5Eb
BibTeX citation:
@inproceedings{strohmayer2024wifi,title={WiFi {CSI}-based Long-Range Person Localization Using Directional Antennas},author={Julian Strohmayer and Martin Kampel},booktitle={The Second Tiny Papers Track at ICLR 2024},year={2024},url={https://openreview.net/forum?id=AOJFcEh5Eb}}
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Immobilized fluorescently stained zebrafish through the eXtended Field of view Light Field Microscope 2D-3D dataset
This dataset comprises three immobilized fluorescently stained zebrafish imaged through the eXtended Field of view Light Field Microscope (XLFM, also known as Fourier Light Field Microscope). The images were preprocessed with the SLNet, which extracts the sparse signals from the images (a.k.a. the neural activity).
If you intend to use this with Pytorch, you can find a data loader and working source code to load and train networks here.
This dataset is part of the publication: Fast light-field 3D microscopy with out-of-distribution detection and adaptation through Conditional Normalizing Flows.
The fish present are:
The dataset is structured as follows::
XLFM_dataset
In this dataset, we provide a subset of the images and volumes.
Due to space constraints, we provide the 3D volumes only for:
Enjoy, and feel free to contact us for any information request, like the full PSF, 3 more samples or longer image sequences.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This directory contains the training data and code for training and testing a ResMLP with experience replay for creating a machine-learning physics parameterization for the Community Atmospheric Model.
The directory is structured as follows:
1. Download training and testing data: https://portal.nersc.gov/archive/home/z/zhangtao/www/hybird_GCM_ML
2. Unzip nncam_training.zip
nncam_training
- models
model definition of ResMLP and other models for comparison purposes
- dataloader
utility scripts to load data into pytorch dataset
- training_scripts
scripts to train ResMLP model with/without experience replay
- offline_test
scripts to perform offline test (Table 2, Figure 2)
3. Unzip nncam_coupling.zip
nncam_srcmods
- SourceMods
SourceMods to be used with CAM modules for coupling with neural network
- otherfiles
additional configuration files to setup and run SPCAM with neural network
- pythonfiles
python scripts to run neural network and couple with CAM
- ClimAnalysis
- paper_plots.ipynb
scripts to produce online evaluation figures (Figure 1, Figure 3-10)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Introduction The advent of neural networks capable of learning salient features from variance in the radar data has expanded the breadth of radar applications, often as an alternative sensor or a complementary modality to camera vision. Gesture recognition for command control is arguably the most commonly explored application. Nevertheless, more suitable benchmarking datasets than currently available are needed to assess and compare the merits of the different proposed solutions and explore a broader range of scenarios than simple hand-gesturing a few centimeters away from a radar transmitter/receiver. Most current publicly available radar datasets used in gesture recognition provide limited diversity, do not provide access to raw ADC data, and are not significantly challenging. To address these shortcomings, we created and make available a new dataset that combines FMCW radar and dynamic vision camera of 10 aircraft marshalling signals (whole body) at several distances and angles from the sensors, recorded from 13 people. The two modalities are hardware synchronized using the radar's PRI signal. Moreover, in the supporting publication we propose a sparse encoding of the time domain (ADC) signals that achieve a dramatic data rate reduction (>76%) while retaining the efficacy of the downstream FFT processing (<2% accuracy loss on recognition tasks), and can be used to create an sparse event-based representation of the radar data. In this way the dataset can be used as a two-modality neuromorphic dataset. Synchronization of the two modalities The PRI pulses from the radar have been hard-wired to the event stream of the DVS sensor, and timestamped using the DVS clock. Based on this signal the DVS event stream has been segmented such that groups of events (time-bins) of the DVS are mapped with individual radar pulses (chirps). Data storage DVS events (x,y coords and timestamps) are stored in structured arrays, and one such structured array object is associated with the data of a radar transmission (pulse/chirp). A radar transmission is a vector of 512 ADC levels that correspond to sampling points of chirping signal (FMCW radar) that lasts about ~1.3ms. Every 192 radar transmissions are stacked in a matrix called a radar frame (each transmission is a row in that matrix). A data capture (recording) consisting of some thousands of continuous radar transmissions is therefore segmented in a number of radar frames. Finally radar frames and the corresponding DVS structured arrays are stored in separate containers in a custom-made multi-container file format (extension .rad). We provide a (rad file) parser for extracting the data out of these files. There is one file per capture of continuous gesture recording of about 10s. Note the number of 192 transmissions per radar frame is an ad-hoc segmentation that suits the purpose of obtaining sufficient signal resolution in a 2D FFT typical in radar signal processing, for the range resolution of the specific radar. It also served the purpose of fast streaming storing of the data during capture. For extracting individual data points for the dataset however, one can pool together (concat) all the radar frames from a single capture file and re-segment them according to liking. The data loader that we provide offers this, with a default of re-segmenting every 769 transmissions (about 1s of gesturing). Data captures directory organization (radar8Ghz-DVS-marshaling_signals_20220901_publication_anonymized.7z) The dataset captures (recordings) are organized in a common directory structure which encompasses additional metadata information about the captures. dataset_dir///--/ofxRadar8Ghz_yyyy-mm-dd_HH-MM-SS.rad Identifiers
stage [train, test].
room: [conference_room, foyer, open_space].
subject: [0-9]. Note that 0 stands for no person, and 1 for an unlabeled, random person (only present in test).
gesture: ['none', 'emergency_stop', 'move_ahead', 'move_back_v1', 'move_back_v2', 'slow_down' 'start_engines', 'stop_engines', 'straight_ahead', 'turn_left', 'turn_right'].
distance: 'xxx', '100', '150', '200', '250', '300', '350', '400', '450'. Note that xxx is used for none gestures when there is no person present in front of the radar (i.e. background samples), or when a person is walking in front of the radar with varying distances but performing no gesture.
The test data captures contain both subjects that appear in the train data as well as previously unseen subjects. Similarly the test data contain captures from the spaces that train data were recorded at, as well as from a new unseen open space.
Files List
radar8Ghz-DVS-marshaling_signals_20220901_publication_anonymized.7z
This is the actual archive bundle with the data captures (recordings).
rad_file_parser_2.py
Parser for individual .rad files, which contain capture data.
loader.py
A convenience PyTorch Dataset loader (partly Tonic compatible). You practically only need this to quick-start if you don't want to delve too much into code reading. When you init a DvsRadarAircraftMarshallingSignals class object it automatically downloads the dataset archive and the .rad file parser, unpacks the archive, and imports the .rad parser to load the data. One can then request from it a training set, a validation set and a test set as torch.Datasets to work with.
aircraft_marshalling_signals_howto.ipynb
Jupyter notebook for exemplary basic use of loader.py
Contact
For further information or questions try contacting first M. Sifalakis or F. Corradi.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the dataset used for pre-training in "ReasonBERT: Pre-trained to Reason with Distant Supervision", EMNLP'21.
There are two files:
sentence_pairs_for_pretrain_no_tokenization.tar.gz -> Contain only sentences as evidence, Text-only
table_pairs_for_pretrain_no_tokenization.tar.gz -> At least one piece of evidence is a table, Hybrid
The data is chunked into multiple tar files for easy loading. We use WebDataset, a PyTorch Dataset (IterableDataset) implementation providing efficient sequential/streaming data access.
For pre-training code, or if you have any questions, please check our GitHub repo https://github.com/sunlab-osu/ReasonBERT
Below is a sample code snippet to load the data
import webdataset as wds
url = './sentence_multi_pairs_for_pretrain_no_tokenization/{000000...000763}.tar' dataset = ( wds.Dataset(url) .shuffle(1000) # cache 1000 samples and shuffle .decode() .to_tuple("json") .batched(20) # group every 20 examples into a batch )
Below we show how the data is organized with two examples.
Text-only
{'s1_text': 'Sils is a municipality in the comarca of Selva, in Catalonia, Spain.', # query sentence 's1_all_links': { 'Sils,_Girona': [[0, 4]], 'municipality': [[10, 22]], 'Comarques_of_Catalonia': [[30, 37]], 'Selva': [[41, 46]], 'Catalonia': [[51, 60]] }, # list of entities and their mentions in the sentence (start, end location) 'pairs': [ # other sentences that share common entity pair with the query, group by shared entity pairs { 'pair': ['Comarques_of_Catalonia', 'Selva'], # the common entity pair 's1_pair_locs': [[[30, 37]], [[41, 46]]], # mention of the entity pair in the query 's2s': [ # list of other sentences that contain the common entity pair, or evidence { 'md5': '2777e32bddd6ec414f0bc7a0b7fea331', 'text': 'Selva is a coastal comarque (county) in Catalonia, Spain, located between the mountain range known as the Serralada Transversal or Puigsacalm and the Costa Brava (part of the Mediterranean coast). Unusually, it is divided between the provinces of Girona and Barcelona, with Fogars de la Selva being part of Barcelona province and all other municipalities falling inside Girona province. Also unusually, its capital, Santa Coloma de Farners, is no longer among its larger municipalities, with the coastal towns of Blanes and Lloret de Mar having far surpassed it in size.', 's_loc': [0, 27], # in addition to the sentence containing the common entity pair, we also keep its surrounding context. 's_loc' is the start/end location of the actual evidence sentence 'pair_locs': [ # mentions of the entity pair in the evidence [[19, 27]], # mentions of entity 1 [[0, 5], [288, 293]] # mentions of entity 2 ], 'all_links': { 'Selva': [[0, 5], [288, 293]], 'Comarques_of_Catalonia': [[19, 27]], 'Catalonia': [[40, 49]] } } ,...] # there are multiple evidence sentences }, ,...] # there are multiple entity pairs in the query }
Hybrid
{'s1_text': 'The 2006 Major League Baseball All-Star Game was the 77th playing of the midseason exhibition baseball game between the all-stars of the American League (AL) and National League (NL), the two leagues comprising Major League Baseball.', 's1_all_links': {...}, # same as text-only 'sentence_pairs': [{'pair': ..., 's1_pair_locs': ..., 's2s': [...]}], # same as text-only 'table_pairs': [ 'tid': 'Major_League_Baseball-1', 'text':[ ['World Series Records', 'World Series Records', ...], ['Team', 'Number of Series won', ...], ['St. Louis Cardinals (NL)', '11', ...], ...] # table content, list of rows 'index':[ [[0, 0], [0, 1], ...], [[1, 0], [1, 1], ...], ...] # index of each cell [row_id, col_id]. we keep only a table snippet, but the index here is from the original table. 'value_ranks':[ [0, 0, ...], [0, 0, ...], [0, 10, ...], ...] # if the cell contain numeric value/date, this is its rank ordered from small to large, follow TAPAS 'value_inv_ranks': [], # inverse rank 'all_links':{ 'St._Louis_Cardinals': { '2': [ [[2, 0], [0, 19]], # [[row_id, col_id], [start, end]] ] # list of mentions in the second row, the key is row_id }, 'CARDINAL:11': {'2': [[[2, 1], [0, 2]]], '8': [[[8, 3], [0, 2]]]}, } 'name': '', # table name, if exists 'pairs': { 'pair': ['American_League', 'National_League'], 's1_pair_locs': [[[137, 152]], [[162, 177]]], # mention in the query 'table_pair_locs': { '17': [ # mention of entity pair in row 17 [ [[17, 0], [3, 18]], [[17, 1], [3, 18]], [[17, 2], [3, 18]], [[17, 3], [3, 18]] ], # mention of the first entity [ [[17, 0], [21, 36]], [[17, 1], [21, 36]], ] # mention of the second entity ] } } ] }