10 datasets found

Sentence/Table Pair Data from Wikipedia for Pre-training with...
data.niaid.nih.gov
zenodo.org
Updated Oct 29, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xiang Deng; Yu Su; Alyssa Lees; You Wu; Cong Yu; Huan Sun (2021). Sentence/Table Pair Data from Wikipedia for Pre-training with Distant-Supervision [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5612315
Explore at:
Dataset updated
Oct 29, 2021
Dataset provided by
Google Research
The Ohio State University
Authors
Xiang Deng; Yu Su; Alyssa Lees; You Wu; Cong Yu; Huan Sun
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the dataset used for pre-training in "ReasonBERT: Pre-trained to Reason with Distant Supervision", EMNLP'21.

There are two files:

sentence_pairs_for_pretrain_no_tokenization.tar.gz -> Contain only sentences as evidence, Text-only

table_pairs_for_pretrain_no_tokenization.tar.gz -> At least one piece of evidence is a table, Hybrid

The data is chunked into multiple tar files for easy loading. We use WebDataset, a PyTorch Dataset (IterableDataset) implementation providing efficient sequential/streaming data access.

For pre-training code, or if you have any questions, please check our GitHub repo https://github.com/sunlab-osu/ReasonBERT

Below is a sample code snippet to load the data

import webdataset as wds

path to the uncompressed files, should be a directory with a set of tar files

url = './sentence_multi_pairs_for_pretrain_no_tokenization/{000000...000763}.tar' dataset = ( wds.Dataset(url) .shuffle(1000) # cache 1000 samples and shuffle .decode() .to_tuple("json") .batched(20) # group every 20 examples into a batch )

Please see the documentation for WebDataset for more details about how to use it as dataloader for Pytorch

You can also iterate through all examples and dump them with your preferred data format

Below we show how the data is organized with two examples.

Text-only

{'s1_text': 'Sils is a municipality in the comarca of Selva, in Catalonia, Spain.', # query sentence 's1_all_links': { 'Sils,_Girona': [[0, 4]], 'municipality': [[10, 22]], 'Comarques_of_Catalonia': [[30, 37]], 'Selva': [[41, 46]], 'Catalonia': [[51, 60]] }, # list of entities and their mentions in the sentence (start, end location) 'pairs': [ # other sentences that share common entity pair with the query, group by shared entity pairs { 'pair': ['Comarques_of_Catalonia', 'Selva'], # the common entity pair 's1_pair_locs': [[[30, 37]], [[41, 46]]], # mention of the entity pair in the query 's2s': [ # list of other sentences that contain the common entity pair, or evidence { 'md5': '2777e32bddd6ec414f0bc7a0b7fea331', 'text': 'Selva is a coastal comarque (county) in Catalonia, Spain, located between the mountain range known as the Serralada Transversal or Puigsacalm and the Costa Brava (part of the Mediterranean coast). Unusually, it is divided between the provinces of Girona and Barcelona, with Fogars de la Selva being part of Barcelona province and all other municipalities falling inside Girona province. Also unusually, its capital, Santa Coloma de Farners, is no longer among its larger municipalities, with the coastal towns of Blanes and Lloret de Mar having far surpassed it in size.', 's_loc': [0, 27], # in addition to the sentence containing the common entity pair, we also keep its surrounding context. 's_loc' is the start/end location of the actual evidence sentence 'pair_locs': [ # mentions of the entity pair in the evidence [[19, 27]], # mentions of entity 1 [[0, 5], [288, 293]] # mentions of entity 2 ], 'all_links': { 'Selva': [[0, 5], [288, 293]], 'Comarques_of_Catalonia': [[19, 27]], 'Catalonia': [[40, 49]] } } ,...] # there are multiple evidence sentences }, ,...] # there are multiple entity pairs in the query }

Hybrid

{'s1_text': 'The 2006 Major League Baseball All-Star Game was the 77th playing of the midseason exhibition baseball game between the all-stars of the American League (AL) and National League (NL), the two leagues comprising Major League Baseball.', 's1_all_links': {...}, # same as text-only 'sentence_pairs': [{'pair': ..., 's1_pair_locs': ..., 's2s': [...]}], # same as text-only 'table_pairs': [ 'tid': 'Major_League_Baseball-1', 'text':[ ['World Series Records', 'World Series Records', ...], ['Team', 'Number of Series won', ...], ['St. Louis Cardinals (NL)', '11', ...], ...] # table content, list of rows 'index':[ [[0, 0], [0, 1], ...], [[1, 0], [1, 1], ...], ...] # index of each cell [row_id, col_id]. we keep only a table snippet, but the index here is from the original table. 'value_ranks':[ [0, 0, ...], [0, 0, ...], [0, 10, ...], ...] # if the cell contain numeric value/date, this is its rank ordered from small to large, follow TAPAS 'value_inv_ranks': [], # inverse rank 'all_links':{ 'St._Louis_Cardinals': { '2': [ [[2, 0], [0, 19]], # [[row_id, col_id], [start, end]] ] # list of mentions in the second row, the key is row_id }, 'CARDINAL:11': {'2': [[[2, 1], [0, 2]]], '8': [[[8, 3], [0, 2]]]}, } 'name': '', # table name, if exists 'pairs': { 'pair': ['American_League', 'National_League'], 's1_pair_locs': [[[137, 152]], [[162, 177]]], # mention in the query 'table_pair_locs': { '17': [ # mention of entity pair in row 17 [ [[17, 0], [3, 18]], [[17, 1], [3, 18]], [[17, 2], [3, 18]], [[17, 3], [3, 18]] ], # mention of the first entity [ [[17, 0], [21, 36]], [[17, 1], [21, 36]], ] # mention of the second entity ] } } ] }
Z
3DO Dataset | On the Generalization of WiFi-based Person-centric Sensing in...
nde-dev.biothings.io
data.niaid.nih.gov
+1more
Updated Dec 5, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Strohmayer, Julian (2024). 3DO Dataset | On the Generalization of WiFi-based Person-centric Sensing in Through-Wall Scenarios [Dataset]. https://nde-dev.biothings.io/resources?id=zenodo_10925350
Explore at:
Dataset updated
Dec 5, 2024
Dataset provided by
Strohmayer, Julian
Kampel, Martin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
On the Generalization of WiFi-based Person-centric Sensing in Through-Wall Scenarios

This repository contains the 3DO dataset proposed in [1].

PyTroch Dataloader

A minimal PyTorch dataloader for the 3DO dataset is provided at: https://github.com/StrohmayerJ/3DO

Dataset Description

The 3DO dataset comprises 42 five-minute recordings (~1.25M WiFi packets) of three human activities performed by a single person, captured in a WiFi through-wall sensing scenario over three consecutive days. Each WiFi packet is annotated with a 3D trajectory label and a class label for the activities: no person/background (0), walking (1), sitting (2), and lying (3). (Note: The labels returned in our dataloader example are walking (0), sitting (1), and lying (2), because background sequences are not used.)

The directories 3DO/d1/, 3DO/d2/, and 3DO/d3/ contain the sequences from days 1, 2, and 3, respectively. Furthermore, each sequence directory (e.g., 3DO/d1/w1/) contains a csiposreg.csv file storing the raw WiFi packet time series and a csiposreg_complex.npy cache file, which stores the complex Channel State Information (CSI) of the WiFi packet time series. (If missing, csiposreg_complex.npy is automatically generated by the provided dataloader.)

Dataset Structure:

/3DO

├── d1 <-- day 1 subdirectory

└── w1 <-- sequence subdirectory └── csiposreg.csv <-- raw WiFi packet time series └── csiposreg_complex.npy <-- CSI time series cache

├── d2 <-- day 2 subdirectory

├── d3 <-- day 3 subdirectory

In [1], we use the following training, validation, and test split:

Subset Day Sequences

Train 1 w1, w2, w3, s1, s2, s3, l1, l2, l3

Val 1 w4, s4, l4

Test 1 w5 , s5, l5

Test 2 w1, w2, w3, w4, w5, s1, s2, s3, s4, s5, l1, l2, l3, l4, l5

Test 3 w1, w2, w4, w5, s1, s2, s3, s4, s5, l1, l2, l4

w = walking, s = sitting and l= lying

Note: On each day, we additionally recorded three ten-minute background sequences (b1, b2, b3), which are provided as well.

Download and UseThis data may be used for non-commercial research purposes only. If you publish material based on this data, we request that you include a reference to our paper [1].

[1] Strohmayer, J., Kampel, M. (2025). On the Generalization of WiFi-Based Person-Centric Sensing in Through-Wall Scenarios. In: Pattern Recognition. ICPR 2024. Lecture Notes in Computer Science, vol 15315. Springer, Cham. https://doi.org/10.1007/978-3-031-78354-8_13

BibTeX citation:

@inproceedings{strohmayerOn2025, author="Strohmayer, Julian and Kampel, Martin", title="On the Generalization of WiFi-Based Person-Centric Sensing in Through-Wall Scenarios", booktitle="Pattern Recognition", year="2025", publisher="Springer Nature Switzerland", address="Cham", pages="194--211", isbn="978-3-031-78354-8" }
feral-cat-segmentation_dataset
kaggle.com
universe.roboflow.com
zip
Updated Mar 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
lu hou yang (2025). feral-cat-segmentation_dataset [Dataset]. https://www.kaggle.com/datasets/luhouyang/feral-cat-segmentation-dataset
Explore at:
zip(971125684 bytes)Available download formats
Dataset updated
Mar 18, 2025
Authors
lu hou yang
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Feral Cat Segmentation Dataset

Overview

This dataset provides image segmentation data for feral cats, designed for computer vision and machine learning tasks. It builds upon the original public domain dataset by Paul Cashman from Roboflow, with additional preprocessing and multiple data formats for easier consumption.

Dataset Source

Original Author: Paul Cashman

Original Source: Roboflow Universe

Extended by: Lu Hou Yang

GitHub: https://github.com/luhouyang/open_circles

License: Public Domain

Dataset Contents

The dataset is organized into three standard splits: - Train set - Validation set - Test set

Each split contains data in multiple formats: 1. Original JPG images 2. Segmentation mask JPG images 3. Parquet files containing flattened image and mask data 4. Pickle files containing serialized image and mask data

Data Formats

1. Image Files

Format: JPG

Resolution: 224×224 pixels

Directory Structure:

train/: Original training images

valid/: Original validation images

test/: Original test images

train_mask/: Corresponding segmentation masks for training

valid_mask/: Corresponding segmentation masks for validation

test_mask/: Corresponding segmentation masks for testing

2. Parquet Files

Files: train_dataset.parquet, valid_dataset.parquet, test_dataset.parquet

Content: Flattened image data and corresponding masks combined in a single table

Structure: Each row contains the flattened pixel values of an image followed by the flattened pixel values of its mask

Data Division: Image and mask data are split at index split_at = image_size[0] * image_size[1] * image_channels

Data before this index: image pixel values (reshaped to [-1, 224, 224, 3])

Data after this index: mask pixel values (reshaped to [-1, 224, 224, 1])

Benefits: Efficient storage and faster loading compared to individual image files

3. Pickle Files

Files: train_dataset.pkl, valid_dataset.pkl, test_dataset.pkl

Content: Serialized Python objects containing images and their corresponding masks

Structure: List of [image, mask] pairs, where each image and mask is serialized using Python's pickle

Data Access: Similar to parquet files, when loaded through the provided dataset class, data is split at the same index: split_at = image_size[0] * image_size[1] * image_channels

Benefits: Preserves original data structure and enables quick loading in Python

4. CSV Files

Files: train_dataset.csv, valid_dataset.csv, test_dataset.csv

Content: Same data as parquet files but in CSV format

Structure: No headers, raw flattened pixel values

Data Division: Same split point as parquet files

Image Preprocessing

All images were preprocessed with the following operations: - Resized to 224×224 pixels using bilinear interpolation - Segmentation masks were also resized to match the images using nearest neighbor interpolation - Original RLE (Run-Length Encoding) segmentation data converted to binary masks

Data Normalization

When used with the provided PyTorch dataset class, images are normalized with: - Mean: [0.48235, 0.45882, 0.40784] - Standard Deviation: [0.00392156862745098, 0.00392156862745098, 0.00392156862745098]

PyTorch Integration

A custom CatDataset class is included for easy integration with PyTorch:

from cat_dataset import CatDataset # Load from parquet format dataset = CatDataset( root="path/to/dataset", split="train", # Options: "train", "valid", "test" format="parquet", # Options: "parquet", "pkl" image_size=[224, 224], image_channels=3, mask_channels=1 ) # Use with PyTorch DataLoader from torch.utils.data import DataLoader dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

Performance Comparison

Loading time benchmarks from the original implementation: - Parquet format: ~1.29 seconds per iteration - Pickle format: ~0.71 seconds per iteration

The pickle format provides the fastest loading times and is recommended for most use cases.

Citation

If you use this dataset in your research or projects, please cite:

@misc{feral-cat-segmentation_dataset, title = {feral-cat-segmentation Dataset}, type = {Open Source Dataset}, author = {Paul Cashman}, howpublished = {\url{https://universe.roboflow.com/paul-cashman-mxgwb/feral-cat-segmentation}}, url = {https://universe.roboflow.com/paul-cashman-mxgwb/feral-cat-segmentation}, journal = {Roboflow Universe}, publisher = {Roboflow}, year = {2025}, month = {mar}, note = {visited on 2025-03-19}, }

Sample Usage Code

Basic Dataset Loading

from ca...
Data from: SynthSOD: Developing an Heterogeneous Dataset for Orchestra Music...
data.europa.eu
unknown
Updated Sep 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2024). SynthSOD: Developing an Heterogeneous Dataset for Orchestra Music Source Separation [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-13759492?locale=fr
Explore at:
unknown(2009641471)Available download formats
Dataset updated
Sep 12, 2024
Dataset authored and provided by
Zenodohttp://zenodo.org/
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Inside the SynthSOD-data folder, there is a folder for every one of the songs of the dataset and inside them, there is a folder called Tree with the signals synthesized for the Decca Tree (which provide a reasonable stereo mix with the original reverberation of the synthesizer) and a folder called Close Mic with the signals synthesized for the close mics of the instruments (which are the driest signals generated by the synthesizer and can be used as source signals if wanting to add custom reverberation). Inside these folders are the FLAC files of the instruments present in the mix, which should be at least two of the followings: Violin_1.flac, Violin_2.flac, Viola.flac, Cello.flac, Bass.flac, Flute.flac, Piccolo.flac, Clarinet.flac, Oboe.flac, coranglais.flac, Bassoon.flac, Horn.flac, Trumpet.flac, Trombone.flac, Tuba.flac, Harp.flac, Timpani.flac, and untunedpercussion.flac. The file SynthSOD_metadata_all.json contains information about the instruments present in the dataset and the activity time of every one of them and their combinations for the whole dataset and for every one of the songs as well as the ID of every song in the SOD. The files SynthSOD_metadata_train.json, SynthSOD_metadata_evaluation.json, and SynthSOD_metadata_test.json contain the same information but only for the songs in the official train, evaluation, and test partitions of the dataset. Note that the folder SynthSOD-data contains the songs for all the partitions without any splits, so the information about the partitions is only in the JSON files. You can find an example of a PyTorch dataloader for the dataset in the repository of the baseline model. The compressed file SynthSOD-sample.zip is just a subset of the full dataset with 10 pieces that can be downloaded to take a look/listen to the data before downloading the full dataset.
MELD Preprocessed
kaggle.com
zip
Updated Mar 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Argish Abhangi (2025). MELD Preprocessed [Dataset]. https://www.kaggle.com/datasets/argish/meld-preprocessed
Explore at:
zip(3527202381 bytes)Available download formats
Dataset updated
Mar 1, 2025
Authors
Argish Abhangi
Description
The MELD Preprocessed Dataset is a multi-modal dataset designed for research on emotion recognition from audio, video, and textual data. The dataset builds upon the original MELD dataset and applies extensive preprocessing steps to extract features from different modalities. Each sample is saved as a .pt file containing a dictionary of preprocessed features, making it easy for developers to load and integrate into PyTorch-based workflows.

Data Sources

Audio: Waveforms extracted from the original video files.

Video: Video files are processed to sample frames at a target frame rate (default: 2 fps) and to detect faces using a Haar Cascade classifier.

Text: Utterances from the dialogue, which are cleaned using custom encoding functions to fix potential byte encoding issues.

Emotion Labels: Each sample is associated with an emotion label.

Preprocessing Pipeline

The preprocessing script performs several key steps:

Text Cleaning:

fix_encoding_with_bytes(text): Decodes text from bytes using UTF-8, Latin-1, or cp1252, ensuring correct encoding.

replace_double_encoding(text): Fixes issues related to double-encoded characters (e.g., replacing "Â’" with the proper apostrophe).

Audio Processing:

Extracts raw audio waveform from each sample.

Computes a Mel-spectrogram using torchaudio.transforms.MelSpectrogram with 64 mel bins (VGGish format).

Converts the spectrogram to a logarithmic scale for numerical stability.

Video Processing:

Reads video frames at a specified target FPS (default: 2 fps) using OpenCV.

For each video, samples frames evenly based on the original video's FPS.

Applies Haar Cascade face detection on the frames to extract the first detected face.

Resizes the detected face to 224x224 and converts it to RGB. If no face is detected, a default black image (224x224x3) is returned.

Saving Processed Samples:

Each sample is saved as a .pt file in a directory structure split by data type (train, dev, and test).

The filename is derived from the original video filename (e.g., dia0_utt1.mp4 becomes dia0_utt1.pt).

Data Format

Each preprocessed sample is stored in a .pt file and contains a dictionary with the following keys:

utterance (str): The cleaned textual utterance.

emotion (str/int): The corresponding emotion label.

video_path (str): Original path to the video file from which the sample was extracted.

audio (Tensor): Raw audio waveform tensor of shape [channels, time].

audio_sample_rate (int): The sampling rate of the audio waveform.

audio_mel (Tensor): The computed log-scaled Mel-spectrogram with shape [channels, n_mels, time].

face (NumPy array): The extracted face image (RGB format) of shape (224, 224, 3). If no face was detected, a default black image is provided.

Directory Structure

The preprocessed files are organized into splits: preprocessed_data/ ├── train/ │ ├── dia0_utt0.pt │ ├── dia1_utt1.pt │ └── ... ├── dev/ │ ├── dia0_utt0.pt │ ├── dia1_utt1.pt │ └── ... └── test/ │ ├── dia0_utt0.pt │ ├── dia1_utt1.pt └── ...

Loading and Using the Dataset

A custom PyTorch dataset and DataLoader are provided to facilitate easy integration:

Dataset Class

from torch.utils.data import Dataset import os import torch class PreprocessedMELDDataset(Dataset): def _init_(self, data_dir): """ Args: data_dir (str): Directory where preprocessed .pt files are stored. """ self.data_dir = data_dir self.files = [os.path.join(data_dir, f) for f in os.listdir(data_dir) if f.endswith('.pt')] def _len_(self): return len(self.files) def _getitem_(self, idx): sample_path = self.files[idx] sample = torch.load(sample_path) return sample

Custom Collate Function

def preprocessed_collate_fn(batch): """ Collates a list of sample dictionaries into a single dictionary with keys mapping to lists. Modify this function to pad or stack tensor data if needed. """ collated = {} collated['utterance'] = [sample['utterance'] for sample in batch] collated['emotion'] = [sample['emotion'] for sample in batch] collated['video_path'] = [sample['video_path'] for sample in batch] collated['audio'] = [sample['audio'] for sample in batch] collated['audio_sample_rate'] = batch[0]['audio_sample_rate'] collated['audio_mel'] = [sample['audio_mel'] for sample in batch] collated['face'] = [sample['face'] for sample in batch] return collated

Creating DataLoaders

from torch.utils.data import DataLoader # Define paths for each split train_data_dir = "preprocessed_data/train" dev_data_dir = "preproces...
Z
Wallhack1.8k Dataset | Data Augmentation Techniques for Cross-Domain WiFi...
data-staging.niaid.nih.gov
nde-dev.biothings.io
+2more
Updated Apr 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Strohmayer, Julian; Kampel, Martin (2025). Wallhack1.8k Dataset | Data Augmentation Techniques for Cross-Domain WiFi CSI-Based Human Activity Recognition [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_8188998
Explore at:
Dataset updated
Apr 4, 2025
Dataset provided by
Computer Vision Lab, TU Wien
Authors
Strohmayer, Julian; Kampel, Martin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains the Wallhack1.8k dataset for WiFi-based long-range activity recognition in Line-of-Sight (LoS) and Non-Line-of-Sight (NLoS)/Through-Wall scenarios, as proposed in [1,2], as well as the CAD models (of 3D-printable parts) of the WiFi systems proposed in [2].

PyTroch Dataloader

A minimal PyTorch dataloader for the Wallhack1.8k dataset is provided at: https://github.com/StrohmayerJ/wallhack1.8k

Dataset Description

The Wallhack1.8k dataset comprises 1,806 CSI amplitude spectrograms (and raw WiFi packet time series) corresponding to three activity classes: "no presence," "walking," and "walking + arm-waving." WiFi packets were transmitted at a frequency of 100 Hz, and each spectrogram captures a temporal context of approximately 4 seconds (400 WiFi packets).

To assess cross-scenario and cross-system generalization, WiFi packet sequences were collected in LoS and through-wall (NLoS) scenarios, utilizing two different WiFi systems (BQ: biquad antenna and PIFA: printed inverted-F antenna). The dataset is structured accordingly:

LOS/BQ/ <- WiFi packets collected in the LoS scenario using the BQ system

LOS/PIFA/ <- WiFi packets collected in the LoS scenario using the PIFA system

NLOS/BQ/ <- WiFi packets collected in the NLoS scenario using the BQ system

NLOS/PIFA/ <- WiFi packets collected in the NLoS scenario using the PIFA system

These directories contain the raw WiFi packet time series (see Table 1). Each row represents a single WiFi packet with the complex CSI vector H being stored in the "data" field and the class label being stored in the "class" field. H is of the form [I, R, I, R, ..., I, R], where two consecutive entries represent imaginary and real parts of complex numbers (the Channel Frequency Responses of subcarriers). Taking the absolute value of H (e.g., via numpy.abs(H)) yields the subcarrier amplitudes A.

To extract the 52 L-LTF subcarriers used in [1], the following indices of A are to be selected:

52 L-LTF subcarriers

csi_valid_subcarrier_index = [] csi_valid_subcarrier_index += [i for i in range(6, 32)] csi_valid_subcarrier_index += [i for i in range(33, 59)]

Additional 56 HT-LTF subcarriers can be selected via:

56 HT-LTF subcarriers

csi_valid_subcarrier_index += [i for i in range(66, 94)]
csi_valid_subcarrier_index += [i for i in range(95, 123)]

For more details on subcarrier selection, see ESP-IDF (Section Wi-Fi Channel State Information) and esp-csi.

Extracted amplitude spectrograms with the corresponding label files of the train/validation/test split: "trainLabels.csv," "validationLabels.csv," and "testLabels.csv," can be found in the spectrograms/ directory.

The columns in the label files correspond to the following: [Spectrogram index, Class label, Room label]

Spectrogram index: [0, ..., n]

Class label: [0,1,2], where 0 = "no presence", 1 = "walking", and 2 = "walking + arm-waving."

Room label: [0,1,2,3,4,5], where labels 1-5 correspond to the room number in the NLoS scenario (see Fig. 3 in [1]). The label 0 corresponds to no room and is used for the "no presence" class.

Dataset Overview:

Table 1: Raw WiFi packet sequences.

Scenario System "no presence" / label 0 "walking" / label 1 "walking + arm-waving" / label 2 Total

LoS BQ b1.csv w1.csv, w2.csv, w3.csv, w4.csv and w5.csv ww1.csv, ww2.csv, ww3.csv, ww4.csv and ww5.csv

LoS PIFA b1.csv w1.csv, w2.csv, w3.csv, w4.csv and w5.csv ww1.csv, ww2.csv, ww3.csv, ww4.csv and ww5.csv

NLoS BQ b1.csv w1.csv, w2.csv, w3.csv, w4.csv and w5.csv ww1.csv, ww2.csv, ww3.csv, ww4.csv and ww5.csv

NLoS PIFA b1.csv w1.csv, w2.csv, w3.csv, w4.csv and w5.csv ww1.csv, ww2.csv, ww3.csv, ww4.csv and ww5.csv

4 20 20 44

Table 2: Sample/Spectrogram distribution across activity classes in Wallhack1.8k.

Scenario System

"no presence" / label 0

"walking" / label 1

"walking + arm-waving" / label 2 Total

LoS BQ 149 154 155

LoS PIFA 149 160 152

NLoS BQ 148 150 152

NLoS PIFA 143 147 147

589 611 606 1,806

Download and UseThis data may be used for non-commercial research purposes only. If you publish material based on this data, we request that you include a reference to one of our papers [1,2].

[1] Strohmayer, Julian, and Martin Kampel. (2024). “Data Augmentation Techniques for Cross-Domain WiFi CSI-Based Human Activity Recognition”, In IFIP International Conference on Artificial Intelligence Applications and Innovations (pp. 42-56). Cham: Springer Nature Switzerland, doi: https://doi.org/10.1007/978-3-031-63211-2_4.

[2] Strohmayer, Julian, and Martin Kampel., “Directional Antenna Systems for Long-Range Through-Wall Human Activity Recognition,” 2024 IEEE International Conference on Image Processing (ICIP), Abu Dhabi, United Arab Emirates, 2024, pp. 3594-3599, doi: https://doi.org/10.1109/ICIP51287.2024.10647666.

BibTeX citations:

@inproceedings{strohmayer2024data, title={Data Augmentation Techniques for Cross-Domain WiFi CSI-Based Human Activity Recognition}, author={Strohmayer, Julian and Kampel, Martin}, booktitle={IFIP International Conference on Artificial Intelligence Applications and Innovations}, pages={42--56}, year={2024}, organization={Springer}}@INPROCEEDINGS{10647666, author={Strohmayer, Julian and Kampel, Martin}, booktitle={2024 IEEE International Conference on Image Processing (ICIP)}, title={Directional Antenna Systems for Long-Range Through-Wall Human Activity Recognition}, year={2024}, volume={}, number={}, pages={3594-3599}, keywords={Visualization;Accuracy;System performance;Directional antennas;Directive antennas;Reflector antennas;Sensors;Human Activity Recognition;WiFi;Channel State Information;Through-Wall Sensing;ESP32}, doi={10.1109/ICIP51287.2024.10647666}}
Dataset for "SpecTf: Transformers Enable Data-Driven Imaging Spectroscopy...
zenodo.org
bin, csv, pdf
Updated Jan 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jake Lee; Jake Lee; Michael Kiper; Michael Kiper; David R. Thompson; David R. Thompson; Philip Brodrick; Philip Brodrick (2025). Dataset for "SpecTf: Transformers Enable Data-Driven Imaging Spectroscopy Cloud Detection" [Dataset]. http://doi.org/10.5281/zenodo.14614218
Explore at:
bin, pdf, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14614218
Dataset updated
Jan 10, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jake Lee; Jake Lee; Michael Kiper; Michael Kiper; David R. Thompson; David R. Thompson; Philip Brodrick; Philip Brodrick
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
SpecTf: Transformers Enable Data-Driven Imaging Spectroscopy Cloud Detection

Summary

Manuscript in review. Preprint: https://arxiv.org/abs/2501.04916

This repository contains the dataset used to train and evaluate the Spectroscopic Transformer model for EMIT cloud screening.

spectf_cloud_labelbox.hdf5

1,841,641 Labeled spectra from 221 EMIT Scenes.

spectf_cloud_mmgis.hdf5

1,733,801 Labeled spectra from 313 EMIT Scenes.

These scenes were speciffically labeled to correct false detections by an earlier version of the model.

train_fids.csv

465 EMIT scenes comprising the training set.

test_fids.csv

69 EMIT scenes comprising the held-out validation set.

v2 adds validation_scenes.pdf, a PDF displaying the 69 validation scenes in RGB and Falsecolor, their existing baseline cloud masks, as well as their cloud masks produced by the ANN and GBT reference models and the SpecTf model.

Data Description

221 EMIT Scenes were initially selected for labeling with diversity in mind. After sparse segmentation labeling of confident regions in Labelbox, up to 10,000 spectra were selected per-class per-scene to form the spectf_cloud_labelbox dataset. We deployed a preliminary model trained on these spectra on all EMIT scenes observed in March 2024, then labeled another 313 EMIT Scenes using MMGIS's polygonal labeling tool to correct false positive and false negative detections. After similarly sampling spectra from these scenes, A total of 3,575,442 spectra were labeled and sampled.

The train/test split was randomly determined by scene FID to prevent the same EMIT scene from contributing spectra to both the training and validation datasets.

Please refer to Section 4.2 in the paper for a complete description, and to our code repository for example usage and a Pytorch dataloader.

Each hdf5 file contains the following arrays:

'spectra'

Top-of-Atmosphere reflectance calculated from the EMIT L1B Radiance product

Float64 of shape (n, 268)

'fids'

The FID from which each spectrum was sampled

Binary string of shape (n,)

'indices'

The (col, row) index from which each spectrum was sampled

Int64 of shape (n, 2)

'labels'

Annotation label of each spectrum

0 - "Clear"

1 - "Cloud"

2 - "Cloud Shadow" (Only for the Labelbox dataset, and this class was combined with the clear class for this work. See paper for details.)

label[label==2] = 0

Int64 of shape (n,2)

Each hdf5 file contains the following attribute:

'bands'

The band center wavelengths (nm) of the spectrum

Float64 of shape (268,)

Acknowledgements

The EMIT online mapping tool was developed by the JPL MMGIS team. The High Performance Computing resources used in this investigation were provided by funding from the JPL Information and Technology Solutions Directorate.

This research was carried out at the Jet Propulsion Laboratory, California Institute of Technology, under a contract with the National Aeronautics and Space Administration (80NM0018D0004).

© 2024 California Institute of Technology. Government sponsorship acknowledged.
FinSen Financial Sentiment Dataset
kaggle.com
zip
Updated Oct 29, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eagle W H L (2024). FinSen Financial Sentiment Dataset [Dataset]. https://www.kaggle.com/datasets/eaglewhl/finsen-financial-sentiment-dataset/code
Explore at:
zip(6549212 bytes)Available download formats
Dataset updated
Oct 29, 2024
Authors
Eagle W H L
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Enhancing Financial Market Predictions: Causality-Driven Feature Selection

Note:[Please help give a Vote 👍 if you think this FinSen dataset is good for you, Thanks:)]

This paper introduces FinSen dataset that revolutionizes financial market analysis by integrating economic and financial news articles from 197 countries with stock market data. The dataset’s extensive coverage spans 15 years from 2007 to 2023 with temporal information, offering a rich, global perspective 160,000 records on financial market news. Our study leverages causally validated sentiment scores and LSTM models to enhance market forecast accuracy and reliability.

Technical Framework

https://github.com/user-attachments/assets/5df3c4a7-2403-460a-ac7f-2d69572fec2f" alt="image">

Our FinSen Dataset

This repository contains the dataset for "https://arxiv.org/abs/2408.01005">Enhancing Financial Market Predictions: Causality-Driven Feature Selection, which has been accepted in ADMA 2024.

If the dataset or the paper has been useful in your research, please add a citation to our work:

@article{liang2024enhancing, title={Enhancing Financial Market Predictions: Causality-Driven Feature Selection}, author={Liang, Wenhao and Li, Zhengyang and Chen, Weitong}, journal={arXiv e-prints}, pages={arXiv--2408}, year={2024} }

Datasets

[FinSen] can be downloaded manually from the repository as csv file. Sentiment and its score are generated by FinBert model from the Hugging Face Transformers library under the identifier "ProsusAI/finbert". (Araci, Dogu. "Finbert: Financial sentiment analysis with pre-trained language models." arXiv preprint arXiv:1908.10063 (2019).)

We only provide US for research purpose usage, please contact w.liang@adelaide.edu.au for other countries (total 197 included) if necessary.

https://github.com/user-attachments/assets/f28e670a-7329-409d-81cb-1fe47da22140" alt="image">

Finsen Data Sample:

https://github.com/user-attachments/assets/6ab08486-85b7-4cf6-b4fe-7d4294624f91">

We also provide other NLP datasets for text classification tasks here, please cite them correspondingly once you used them in your research if any.

20Newsgroups. Joachims, T., et al.: A probabilistic analysis of the rocchio algorithm with tfidf for text categorization. In: ICML. vol. 97, pp. 143–151. Citeseer (1997)

AG News. Zhang, X., Zhao, J., LeCun, Y.: Character-level convolutional networks for text classification. Advances in neural information processing systems 28 (2015)

Financial PhraseBank. Malo, P., Sinha, A., Korhonen, P., Wallenius, J., Takala, P.: Good debt or bad debt: Detecting semantic orientations in economic texts. Journal of the Association for Information Science and Technology 65(4), 782–796 (2014)

Dataloader for FinSen

We provide the preprocessing file finsen.py for our FinSen dataset under dataloaders directory for more convienient usage.

Models - Text Classification

DAN-3.

Gobal Pooling CNN.

Models - Regression Prediction

LSTM

Using Sentiment Score from FinSen Predict Result on S&P500

https://github.com/user-attachments/assets/2d9b4dd7-7f59-425c-b812-2cca57719243" alt="image">

:smiley: ☺ Happy Research !
Z
Data from: Aircraft Marshaling Signals Dataset of FMCW Radar and Event-Based...
data.niaid.nih.gov
Updated Dec 11, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Leon Müller; Manolis Sifalakis; Sherif Eissa; Amirreza Yousefzadeh; Sander Stuijk; Federico Corradi; Paul Detterer (2023). Aircraft Marshaling Signals Dataset of FMCW Radar and Event-Based Camera for Sensor Fusion [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7656910
Explore at:
Dataset updated
Dec 11, 2023
Dataset provided by
IMEC
Eindhoven University of Technology
Authors
Leon Müller; Manolis Sifalakis; Sherif Eissa; Amirreza Yousefzadeh; Sander Stuijk; Federico Corradi; Paul Detterer
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset Introduction The advent of neural networks capable of learning salient features from variance in the radar data has expanded the breadth of radar applications, often as an alternative sensor or a complementary modality to camera vision. Gesture recognition for command control is arguably the most commonly explored application. Nevertheless, more suitable benchmarking datasets than currently available are needed to assess and compare the merits of the different proposed solutions and explore a broader range of scenarios than simple hand-gesturing a few centimeters away from a radar transmitter/receiver. Most current publicly available radar datasets used in gesture recognition provide limited diversity, do not provide access to raw ADC data, and are not significantly challenging. To address these shortcomings, we created and make available a new dataset that combines FMCW radar and dynamic vision camera of 10 aircraft marshalling signals (whole body) at several distances and angles from the sensors, recorded from 13 people. The two modalities are hardware synchronized using the radar's PRI signal. Moreover, in the supporting publication we propose a sparse encoding of the time domain (ADC) signals that achieve a dramatic data rate reduction (>76%) while retaining the efficacy of the downstream FFT processing (<2% accuracy loss on recognition tasks), and can be used to create an sparse event-based representation of the radar data. In this way the dataset can be used as a two-modality neuromorphic dataset. Synchronization of the two modalities The PRI pulses from the radar have been hard-wired to the event stream of the DVS sensor, and timestamped using the DVS clock. Based on this signal the DVS event stream has been segmented such that groups of events (time-bins) of the DVS are mapped with individual radar pulses (chirps). Data storage DVS events (x,y coords and timestamps) are stored in structured arrays, and one such structured array object is associated with the data of a radar transmission (pulse/chirp). A radar transmission is a vector of 512 ADC levels that correspond to sampling points of chirping signal (FMCW radar) that lasts about ~1.3ms. Every 192 radar transmissions are stacked in a matrix called a radar frame (each transmission is a row in that matrix). A data capture (recording) consisting of some thousands of continuous radar transmissions is therefore segmented in a number of radar frames. Finally radar frames and the corresponding DVS structured arrays are stored in separate containers in a custom-made multi-container file format (extension .rad). We provide a (rad file) parser for extracting the data out of these files. There is one file per capture of continuous gesture recording of about 10s. Note the number of 192 transmissions per radar frame is an ad-hoc segmentation that suits the purpose of obtaining sufficient signal resolution in a 2D FFT typical in radar signal processing, for the range resolution of the specific radar. It also served the purpose of fast streaming storing of the data during capture. For extracting individual data points for the dataset however, one can pool together (concat) all the radar frames from a single capture file and re-segment them according to liking. The data loader that we provide offers this, with a default of re-segmenting every 769 transmissions (about 1s of gesturing). Data captures directory organization (radar8Ghz-DVS-marshaling_signals_20220901_publication_anonymized.7z) The dataset captures (recordings) are organized in a common directory structure which encompasses additional metadata information about the captures. dataset_dir///--/ofxRadar8Ghz_yyyy-mm-dd_HH-MM-SS.rad Identifiers

stage [train, test]. room: [conference_room, foyer, open_space]. subject: [0-9]. Note that 0 stands for no person, and 1 for an unlabeled, random person (only present in test). gesture: ['none', 'emergency_stop', 'move_ahead', 'move_back_v1', 'move_back_v2', 'slow_down' 'start_engines', 'stop_engines', 'straight_ahead', 'turn_left', 'turn_right']. distance: 'xxx', '100', '150', '200', '250', '300', '350', '400', '450'. Note that xxx is used for none gestures when there is no person present in front of the radar (i.e. background samples), or when a person is walking in front of the radar with varying distances but performing no gesture. The test data captures contain both subjects that appear in the train data as well as previously unseen subjects. Similarly the test data contain captures from the spaces that train data were recorded at, as well as from a new unseen open space. Files List radar8Ghz-DVS-marshaling_signals_20220901_publication_anonymized.7z This is the actual archive bundle with the data captures (recordings). rad_file_parser_2.py Parser for individual .rad files, which contain capture data. loader.py A convenience PyTorch Dataset loader (partly Tonic compatible). You practically only need this to quick-start if you don't want to delve too much into code reading. When you init a DvsRadarAircraftMarshallingSignals class object it automatically downloads the dataset archive and the .rad file parser, unpacks the archive, and imports the .rad parser to load the data. One can then request from it a training set, a validation set and a test set as torch.Datasets to work with.
aircraft_marshalling_signals_howto.ipynb Jupyter notebook for exemplary basic use of loader.py Contact For further information or questions try contacting first M. Sifalakis or F. Corradi.
Z
Immobilized fluorescently stained zebrafish through the eXtended Field of...
data.niaid.nih.gov
Updated Jul 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Page Vizcaíno, Josué; Symvoulidis, Panagiotis; Wang, Zeguan; Jelten, Jonas; Favaro, Paolo; Boyden, Edward S.; Lasser, Tobias (2024). Immobilized fluorescently stained zebrafish through the eXtended Field of view Light Field Microscope 2D-3D dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8024695
Explore at:
Dataset updated
Jul 11, 2024
Dataset provided by
Synthetic Neurobiology Group, Massachusetts Institute of Technology, USA
Computer Vision Group, University of Bern, Switzerland
Computational Imaging and Inverse Problems, Department of Informatics, School of Computation, Information and Technology, Technical University of Munich, Germany 2Munich Institute of Biomedical Engineering, Technical University of Munich, Germany
Authors
Page Vizcaíno, Josué; Symvoulidis, Panagiotis; Wang, Zeguan; Jelten, Jonas; Favaro, Paolo; Boyden, Edward S.; Lasser, Tobias
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Immobilized fluorescently stained zebrafish through the eXtended Field of view Light Field Microscope 2D-3D dataset

This dataset comprises three immobilized fluorescently stained zebrafish imaged through the eXtended Field of view Light Field Microscope (XLFM, also known as Fourier Light Field Microscope). The images were preprocessed with the SLNet, which extracts the sparse signals from the images (a.k.a. the neural activity).

If you intend to use this with Pytorch, you can find a data loader and working source code to load and train networks here.

This dataset is part of the publication: Fast light-field 3D microscopy with out-of-distribution detection and adaptation through Conditional Normalizing Flows.

The fish present are:

1x NLS GCaMP6s

1x Pan-neuronal nuclear localized GCaMP6s Tg(HuC:H2B:GCaMP6s)

1x Soma localized GCaMP7f Tg(HuC:somaGCaMP7f)

The dataset is structured as follows::

XLFM_dataset

Dataset/

GCaMP6s_NLS_1/

SLNet_preprocessed/

XLFM_image/

XLFM_image_stack.tif: tif stack of 600 preprocessed XLFM images.

XLFM_stack/

XLFM_stack_nnn.tif: 3D stack corresponding to frame nnn.

Neural_activity_coordinates.csv: 3D coordinates of neurons found through the suite2p framework.

Raw/

XLFM_image/

XLFM_image_stack.tif: tif stack of 600 raw XLFM images.

(other samples)

lenslet_centers_python.txt: 2D coordinates of the lenset in the XLFM images.

PSF_241depths_16bit.tif: 3D PSF of the microscope can be used for 3D deconvolution. Spanning 734 × 734 × 550𝜇𝑚3 used to deconvolve this volumes.

In this dataset, we provide a subset of the images and volumes.

Due to space constraints, we provide the 3D volumes only for:

SLNet_preprocessed/XLFM_stack/

10 interleaved frames between frames 0-499 (can be used for training a network).

20 consecutive frames, 500-520 (can be used for testing).

raw/

No volumes are provided for raw data, but they can be reconstructed through 3D deconvolution.

Enjoy, and feel free to contact us for any information request, like the full PSF, 3 more samples or longer image sequences.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Xiang Deng; Yu Su; Alyssa Lees; You Wu; Cong Yu; Huan Sun (2021). Sentence/Table Pair Data from Wikipedia for Pre-training with Distant-Supervision [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5612315

Sentence/Table Pair Data from Wikipedia for Pre-training with Distant-Supervision

Explore at:

Dataset updated

Oct 29, 2021

Dataset provided by

Google Research
The Ohio State University

Authors

Xiang Deng; Yu Su; Alyssa Lees; You Wu; Cong Yu; Huan Sun

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This is the dataset used for pre-training in "ReasonBERT: Pre-trained to Reason with Distant Supervision", EMNLP'21.

There are two files:

sentence_pairs_for_pretrain_no_tokenization.tar.gz -> Contain only sentences as evidence, Text-only

table_pairs_for_pretrain_no_tokenization.tar.gz -> At least one piece of evidence is a table, Hybrid

The data is chunked into multiple tar files for easy loading. We use WebDataset, a PyTorch Dataset (IterableDataset) implementation providing efficient sequential/streaming data access.

For pre-training code, or if you have any questions, please check our GitHub repo https://github.com/sunlab-osu/ReasonBERT

Below is a sample code snippet to load the data

import webdataset as wds

path to the uncompressed files, should be a directory with a set of tar files

url = './sentence_multi_pairs_for_pretrain_no_tokenization/{000000...000763}.tar' dataset = ( wds.Dataset(url) .shuffle(1000) # cache 1000 samples and shuffle .decode() .to_tuple("json") .batched(20) # group every 20 examples into a batch )

Please see the documentation for WebDataset for more details about how to use it as dataloader for Pytorch

You can also iterate through all examples and dump them with your preferred data format

Below we show how the data is organized with two examples.

Text-only

{'s1_text': 'Sils is a municipality in the comarca of Selva, in Catalonia, Spain.', # query sentence 's1_all_links': { 'Sils,_Girona': [[0, 4]], 'municipality': [[10, 22]], 'Comarques_of_Catalonia': [[30, 37]], 'Selva': [[41, 46]], 'Catalonia': [[51, 60]] }, # list of entities and their mentions in the sentence (start, end location) 'pairs': [ # other sentences that share common entity pair with the query, group by shared entity pairs { 'pair': ['Comarques_of_Catalonia', 'Selva'], # the common entity pair 's1_pair_locs': [[[30, 37]], [[41, 46]]], # mention of the entity pair in the query 's2s': [ # list of other sentences that contain the common entity pair, or evidence { 'md5': '2777e32bddd6ec414f0bc7a0b7fea331', 'text': 'Selva is a coastal comarque (county) in Catalonia, Spain, located between the mountain range known as the Serralada Transversal or Puigsacalm and the Costa Brava (part of the Mediterranean coast). Unusually, it is divided between the provinces of Girona and Barcelona, with Fogars de la Selva being part of Barcelona province and all other municipalities falling inside Girona province. Also unusually, its capital, Santa Coloma de Farners, is no longer among its larger municipalities, with the coastal towns of Blanes and Lloret de Mar having far surpassed it in size.', 's_loc': [0, 27], # in addition to the sentence containing the common entity pair, we also keep its surrounding context. 's_loc' is the start/end location of the actual evidence sentence 'pair_locs': [ # mentions of the entity pair in the evidence [[19, 27]], # mentions of entity 1 [[0, 5], [288, 293]] # mentions of entity 2 ], 'all_links': { 'Selva': [[0, 5], [288, 293]], 'Comarques_of_Catalonia': [[19, 27]], 'Catalonia': [[40, 49]] } } ,...] # there are multiple evidence sentences }, ,...] # there are multiple entity pairs in the query }

Hybrid

{'s1_text': 'The 2006 Major League Baseball All-Star Game was the 77th playing of the midseason exhibition baseball game between the all-stars of the American League (AL) and National League (NL), the two leagues comprising Major League Baseball.', 's1_all_links': {...}, # same as text-only 'sentence_pairs': [{'pair': ..., 's1_pair_locs': ..., 's2s': [...]}], # same as text-only 'table_pairs': [ 'tid': 'Major_League_Baseball-1', 'text':[ ['World Series Records', 'World Series Records', ...], ['Team', 'Number of Series won', ...], ['St. Louis Cardinals (NL)', '11', ...], ...] # table content, list of rows 'index':[ [[0, 0], [0, 1], ...], [[1, 0], [1, 1], ...], ...] # index of each cell [row_id, col_id]. we keep only a table snippet, but the index here is from the original table. 'value_ranks':[ [0, 0, ...], [0, 0, ...], [0, 10, ...], ...] # if the cell contain numeric value/date, this is its rank ordered from small to large, follow TAPAS 'value_inv_ranks': [], # inverse rank 'all_links':{ 'St._Louis_Cardinals': { '2': [ [[2, 0], [0, 19]], # [[row_id, col_id], [start, end]] ] # list of mentions in the second row, the key is row_id }, 'CARDINAL:11': {'2': [[[2, 1], [0, 2]]], '8': [[[8, 3], [0, 2]]]}, } 'name': '', # table name, if exists 'pairs': { 'pair': ['American_League', 'National_League'], 's1_pair_locs': [[[137, 152]], [[162, 177]]], # mention in the query 'table_pair_locs': { '17': [ # mention of entity pair in row 17 [ [[17, 0], [3, 18]], [[17, 1], [3, 18]], [[17, 2], [3, 18]], [[17, 3], [3, 18]] ], # mention of the first entity [ [[17, 0], [21, 36]], [[17, 1], [21, 36]], ] # mention of the second entity ] } } ] }

Clear search

Close search

Google apps

Main menu

Sentence/Table Pair Data from Wikipedia for Pre-training with...

path to the uncompressed files, should be a directory with a set of tar files

Please see the documentation for WebDataset for more details about how to use it as dataloader for Pytorch

You can also iterate through all examples and dump them with your preferred data format

3DO Dataset | On the Generalization of WiFi-based Person-centric Sensing in...

feral-cat-segmentation_dataset

Feral Cat Segmentation Dataset

Overview

Dataset Source

Dataset Contents

Data Formats

1. Image Files

2. Parquet Files

3. Pickle Files

4. CSV Files

Image Preprocessing

Data Normalization

PyTorch Integration

Performance Comparison

Citation

Sample Usage Code

Basic Dataset Loading

Data from: SynthSOD: Developing an Heterogeneous Dataset for Orchestra Music...

MELD Preprocessed

Data Sources

Preprocessing Pipeline

Data Format

Directory Structure

Loading and Using the Dataset

Dataset Class

Custom Collate Function

Creating DataLoaders

Wallhack1.8k Dataset | Data Augmentation Techniques for Cross-Domain WiFi...

52 L-LTF subcarriers

56 HT-LTF subcarriers

Dataset for "SpecTf: Transformers Enable Data-Driven Imaging Spectroscopy...

SpecTf: Transformers Enable Data-Driven Imaging Spectroscopy Cloud Detection

Summary

Data Description

Acknowledgements

FinSen Financial Sentiment Dataset

Enhancing Financial Market Predictions: Causality-Driven Feature Selection

Technical Framework

Our FinSen Dataset

Datasets

Dataloader for FinSen

Models - Text Classification

Models - Regression Prediction

Using Sentiment Score from FinSen Predict Result on S&P500

Data from: Aircraft Marshaling Signals Dataset of FMCW Radar and Event-Based...

Immobilized fluorescently stained zebrafish through the eXtended Field of...

Sentence/Table Pair Data from Wikipedia for Pre-training with Distant-Supervision

path to the uncompressed files, should be a directory with a set of tar files

Please see the documentation for WebDataset for more details about how to use it as dataloader for Pytorch

You can also iterate through all examples and dump them with your preferred data format