5 datasets found

Z
Sentence/Table Pair Data from Wikipedia for Pre-training with...
data.niaid.nih.gov
zenodo.org
Updated Oct 29, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cong Yu (2021). Sentence/Table Pair Data from Wikipedia for Pre-training with Distant-Supervision [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5612315
Explore at:
Dataset updated
Oct 29, 2021
Dataset provided by
Alyssa Lees
Yu Su
You Wu
Huan Sun
Xiang Deng
Cong Yu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the dataset used for pre-training in "ReasonBERT: Pre-trained to Reason with Distant Supervision", EMNLP'21.

There are two files:

sentence_pairs_for_pretrain_no_tokenization.tar.gz -> Contain only sentences as evidence, Text-only

table_pairs_for_pretrain_no_tokenization.tar.gz -> At least one piece of evidence is a table, Hybrid

The data is chunked into multiple tar files for easy loading. We use WebDataset, a PyTorch Dataset (IterableDataset) implementation providing efficient sequential/streaming data access.

For pre-training code, or if you have any questions, please check our GitHub repo https://github.com/sunlab-osu/ReasonBERT

Below is a sample code snippet to load the data

import webdataset as wds

path to the uncompressed files, should be a directory with a set of tar files

url = './sentence_multi_pairs_for_pretrain_no_tokenization/{000000...000763}.tar' dataset = ( wds.Dataset(url) .shuffle(1000) # cache 1000 samples and shuffle .decode() .to_tuple("json") .batched(20) # group every 20 examples into a batch )

Please see the documentation for WebDataset for more details about how to use it as dataloader for Pytorch

You can also iterate through all examples and dump them with your preferred data format

Below we show how the data is organized with two examples.

Text-only

{'s1_text': 'Sils is a municipality in the comarca of Selva, in Catalonia, Spain.', # query sentence 's1_all_links': { 'Sils,_Girona': [[0, 4]], 'municipality': [[10, 22]], 'Comarques_of_Catalonia': [[30, 37]], 'Selva': [[41, 46]], 'Catalonia': [[51, 60]] }, # list of entities and their mentions in the sentence (start, end location) 'pairs': [ # other sentences that share common entity pair with the query, group by shared entity pairs { 'pair': ['Comarques_of_Catalonia', 'Selva'], # the common entity pair 's1_pair_locs': [[[30, 37]], [[41, 46]]], # mention of the entity pair in the query 's2s': [ # list of other sentences that contain the common entity pair, or evidence { 'md5': '2777e32bddd6ec414f0bc7a0b7fea331', 'text': 'Selva is a coastal comarque (county) in Catalonia, Spain, located between the mountain range known as the Serralada Transversal or Puigsacalm and the Costa Brava (part of the Mediterranean coast). Unusually, it is divided between the provinces of Girona and Barcelona, with Fogars de la Selva being part of Barcelona province and all other municipalities falling inside Girona province. Also unusually, its capital, Santa Coloma de Farners, is no longer among its larger municipalities, with the coastal towns of Blanes and Lloret de Mar having far surpassed it in size.', 's_loc': [0, 27], # in addition to the sentence containing the common entity pair, we also keep its surrounding context. 's_loc' is the start/end location of the actual evidence sentence 'pair_locs': [ # mentions of the entity pair in the evidence [[19, 27]], # mentions of entity 1 [[0, 5], [288, 293]] # mentions of entity 2 ], 'all_links': { 'Selva': [[0, 5], [288, 293]], 'Comarques_of_Catalonia': [[19, 27]], 'Catalonia': [[40, 49]] } } ,...] # there are multiple evidence sentences }, ,...] # there are multiple entity pairs in the query }

Hybrid

{'s1_text': 'The 2006 Major League Baseball All-Star Game was the 77th playing of the midseason exhibition baseball game between the all-stars of the American League (AL) and National League (NL), the two leagues comprising Major League Baseball.', 's1_all_links': {...}, # same as text-only 'sentence_pairs': [{'pair': ..., 's1_pair_locs': ..., 's2s': [...]}], # same as text-only 'table_pairs': [ 'tid': 'Major_League_Baseball-1', 'text':[ ['World Series Records', 'World Series Records', ...], ['Team', 'Number of Series won', ...], ['St. Louis Cardinals (NL)', '11', ...], ...] # table content, list of rows 'index':[ [[0, 0], [0, 1], ...], [[1, 0], [1, 1], ...], ...] # index of each cell [row_id, col_id]. we keep only a table snippet, but the index here is from the original table. 'value_ranks':[ [0, 0, ...], [0, 0, ...], [0, 10, ...], ...] # if the cell contain numeric value/date, this is its rank ordered from small to large, follow TAPAS 'value_inv_ranks': [], # inverse rank 'all_links':{ 'St._Louis_Cardinals': { '2': [ [[2, 0], [0, 19]], # [[row_id, col_id], [start, end]] ] # list of mentions in the second row, the key is row_id }, 'CARDINAL:11': {'2': [[[2, 1], [0, 2]]], '8': [[[8, 3], [0, 2]]]}, } 'name': '', # table name, if exists 'pairs': { 'pair': ['American_League', 'National_League'], 's1_pair_locs': [[[137, 152]], [[162, 177]]], # mention in the query 'table_pair_locs': { '17': [ # mention of entity pair in row 17 [ [[17, 0], [3, 18]], [[17, 1], [3, 18]], [[17, 2], [3, 18]], [[17, 3], [3, 18]] ], # mention of the first entity [ [[17, 0], [21, 36]], [[17, 1], [21, 36]], ] # mention of the second entity ] } } ] }
h
MGSV-EC
huggingface.co
Updated May 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zijie Xin (2025). MGSV-EC [Dataset]. https://huggingface.co/datasets/xxayt/MGSV-EC
Explore at:
Dataset updated
May 22, 2025
Authors
Zijie Xin
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Music Grounding by Short Video E-commerce (MGSV-EC) Dataset

📄 [Paper] 📦 Feature File 🔧 [PyTorch Dataloader] 🧬 [Model Code]

📝 Dataset Summary

MGSV-EC is a large-scale dataset for the new task of Music Grounding by Short Video (MGSV), which aims to localize a specific music segment that best serves as the background music (BGM) for a given query short video.Unlike traditional video-to-music retrieval (V2MR), MGSV requires both… See the full description on the dataset page: https://huggingface.co/datasets/xxayt/MGSV-EC.
Z
HALOC Dataset | WiFi CSI-based Long-Range Person Localization Using...
data.niaid.nih.gov
Updated Nov 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kampel, Martin (2024). HALOC Dataset | WiFi CSI-based Long-Range Person Localization Using Directional Antennas [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10715594
Explore at:
Dataset updated
Nov 27, 2024
Dataset provided by
Strohmayer, Julian
Kampel, Martin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
WiFi CSI-based Long-Range Person Localization Using Directional Antennas

This repository contains the HAllway LOCalization (HALOC) dataset and WiFi system CAD files as proposed in [1].

PyTroch Dataloader

A minimal PyTorch dataloader for the HALOC dataset is provided at: https://github.com/StrohmayerJ/HALOC

Dataset Description

The HALOC dataset comprises six sequences (in .csv format) of synchronized WiFi Channel State Information (CSI) and 3D position labels. Each row in a given .csv file represents a single WiFi packet captured via ESP-IDF, with CSI and 3D coordinates stored in the "data" and ("x", "y", "z") fields, respectively.

The sequences are divided into training, validation, and test subsets as follows:

Subset Sequences

Training 0.csv, 1.csv, 2.csv and 3.csv

Validation 4.csv

Test 5.csv

WiFi System CAD files

We provide CAD files for the 3D printable parts of the proposed WiFi system consisting of the main housing (housing.stl), the lid (lid.stl), and the carrier board (carrier.stl) featuring mounting points for the Nvidia Jetson Orin Nano and the ESP32-S3-DevKitC-1 module.

Download and UseThis data may be used for non-commercial research purposes only. If you publish material based on this data, we request that you include a reference to our paper [1].

[1] Strohmayer, J., and Kampel, M. (2024). “WiFi CSI-based Long-Range Person Localization Using Directional Antennas”, The Second Tiny Papers Track at ICLR 2024, May 2024, Vienna, Austria. https://openreview.net/forum?id=AOJFcEh5Eb

BibTeX citation:

@inproceedings{strohmayer2024wifi,title={WiFi {CSI}-based Long-Range Person Localization Using Directional Antennas},author={Julian Strohmayer and Martin Kampel},booktitle={The Second Tiny Papers Track at ICLR 2024},year={2024},url={https://openreview.net/forum?id=AOJFcEh5Eb}}

3DO Dataset | On the Generalization of WiFi-based Person-centric Sensing in...

zenodo.org

zip

Updated Dec 5, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

Julian Strohmayer; Julian Strohmayer; Martin Kampel; Martin Kampel (2024). 3DO Dataset | On the Generalization of WiFi-based Person-centric Sensing in Through-Wall Scenarios [Dataset]. http://doi.org/10.5281/zenodo.10925351

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.10925351

Dataset updated

Dec 5, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Julian Strohmayer; Julian Strohmayer; Martin Kampel; Martin Kampel

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Time period covered

Nov 20, 2024

Description

On the Generalization of WiFi-based Person-centric Sensing in Through-Wall Scenarios

This repository contains the 3DO dataset proposed in [1].

PyTroch Dataloader

A minimal PyTorch dataloader for the 3DO dataset is provided at: https://github.com/StrohmayerJ/3DO

Dataset Description

The 3DO dataset comprises 42 five-minute recordings (~1.25M WiFi packets) of three human activities performed by a single person, captured in a WiFi through-wall sensing scenario over three consecutive days. Each WiFi packet is annotated with a 3D trajectory label and a class label for the activities: no person/background (0), walking (1), sitting (2), and lying (3). (Note: The labels returned in our dataloader example are walking (0), sitting (1), and lying (2), because background sequences are not used.)

The directories 3DO/d1/, 3DO/d2/, and 3DO/d3/ contain the sequences from days 1, 2, and 3, respectively. Furthermore, each sequence directory (e.g., 3DO/d1/w1/) contains a csiposreg.csv file storing the raw WiFi packet time series and a csiposreg_complex.npy cache file, which stores the complex Channel State Information (CSI) of the WiFi packet time series. (If missing, csiposreg_complex.npy is automatically generated by the provided dataloader.)

Dataset Structure:

/3DO

├── d1 <-- day 1 subdirectory

└── w1 <-- sequence subdirectory

└── csiposreg.csv <-- raw WiFi packet time series

└── csiposreg_complex.npy <-- CSI time series cache

├── d2 <-- day 2 subdirectory

├── d3 <-- day 3 subdirectory

In [1], we use the following training, validation, and test split:

Subset	Day	Sequences
Train	1	w1, w2, w3, s1, s2, s3, l1, l2, l3
Val	1	w4, s4, l4
Test	1	w5 , s5, l5
Test	2	w1, w2, w3, w4, w5, s1, s2, s3, s4, s5, l1, l2, l3, l4, l5
Test	3	w1, w2, w4, w5, s1, s2, s3, s4, s5, l1, l2, l4

w = walking, s = sitting and l= lying

Note: On each day, we additionally recorded three ten-minute background sequences (b1, b2, b3), which are provided as well.

Download and Use
This data may be used for non-commercial research purposes only. If you publish material based on this data, we request that you include a reference to our paper [1].

[1] Strohmayer, J., Kampel, M. (2025). On the Generalization of WiFi-Based Person-Centric Sensing in Through-Wall Scenarios. In: Pattern Recognition. ICPR 2024. Lecture Notes in Computer Science, vol 15315. Springer, Cham. https://doi.org/10.1007/978-3-031-78354-8_13

BibTeX citation:

@inproceedings{strohmayerOn2025,
  author="Strohmayer, Julian and Kampel, Martin",
  title="On the Generalization of WiFi-Based Person-Centric Sensing in Through-Wall Scenarios",
  booktitle="Pattern Recognition",
  year="2025",
  publisher="Springer Nature Switzerland",
  address="Cham",
  pages="194--211",
  isbn="978-3-031-78354-8"
}

P
InfantMarmosetsVox Dataset
paperswithcode.com
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
InfantMarmosetsVox Dataset [Dataset]. https://paperswithcode.com/dataset/infantmarmosetsvox
Explore at:
Description
InfantMarmosetsVox is a dataset for multi-class call-type and caller identification. It contains audio recordings of different individual marmosets and their call-types. The dataset contains a total of 350 files of precisely labelled 10-minute audio recordings across all caller classes. The audio was recorded from five pairs of infant marmoset twins, each recorded individually in two separate sound-proofed recording rooms at a sampling rate of 44.1 kHz. The start and end time, call-type, and marmoset identity of each vocalization are provided, labeled by an experienced researcher. A PyTorch Dataloader is included in this dataset.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Cong Yu (2021). Sentence/Table Pair Data from Wikipedia for Pre-training with Distant-Supervision [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5612315

Sentence/Table Pair Data from Wikipedia for Pre-training with Distant-Supervision

Explore at:

Dataset updated

Oct 29, 2021

Dataset provided by

Alyssa Lees
Yu Su
You Wu
Huan Sun
Xiang Deng
Cong Yu

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This is the dataset used for pre-training in "ReasonBERT: Pre-trained to Reason with Distant Supervision", EMNLP'21.

There are two files:

sentence_pairs_for_pretrain_no_tokenization.tar.gz -> Contain only sentences as evidence, Text-only

table_pairs_for_pretrain_no_tokenization.tar.gz -> At least one piece of evidence is a table, Hybrid

The data is chunked into multiple tar files for easy loading. We use WebDataset, a PyTorch Dataset (IterableDataset) implementation providing efficient sequential/streaming data access.

For pre-training code, or if you have any questions, please check our GitHub repo https://github.com/sunlab-osu/ReasonBERT

Below is a sample code snippet to load the data

import webdataset as wds

path to the uncompressed files, should be a directory with a set of tar files

url = './sentence_multi_pairs_for_pretrain_no_tokenization/{000000...000763}.tar' dataset = ( wds.Dataset(url) .shuffle(1000) # cache 1000 samples and shuffle .decode() .to_tuple("json") .batched(20) # group every 20 examples into a batch )

Please see the documentation for WebDataset for more details about how to use it as dataloader for Pytorch

You can also iterate through all examples and dump them with your preferred data format

Below we show how the data is organized with two examples.

Text-only

{'s1_text': 'Sils is a municipality in the comarca of Selva, in Catalonia, Spain.', # query sentence 's1_all_links': { 'Sils,_Girona': [[0, 4]], 'municipality': [[10, 22]], 'Comarques_of_Catalonia': [[30, 37]], 'Selva': [[41, 46]], 'Catalonia': [[51, 60]] }, # list of entities and their mentions in the sentence (start, end location) 'pairs': [ # other sentences that share common entity pair with the query, group by shared entity pairs { 'pair': ['Comarques_of_Catalonia', 'Selva'], # the common entity pair 's1_pair_locs': [[[30, 37]], [[41, 46]]], # mention of the entity pair in the query 's2s': [ # list of other sentences that contain the common entity pair, or evidence { 'md5': '2777e32bddd6ec414f0bc7a0b7fea331', 'text': 'Selva is a coastal comarque (county) in Catalonia, Spain, located between the mountain range known as the Serralada Transversal or Puigsacalm and the Costa Brava (part of the Mediterranean coast). Unusually, it is divided between the provinces of Girona and Barcelona, with Fogars de la Selva being part of Barcelona province and all other municipalities falling inside Girona province. Also unusually, its capital, Santa Coloma de Farners, is no longer among its larger municipalities, with the coastal towns of Blanes and Lloret de Mar having far surpassed it in size.', 's_loc': [0, 27], # in addition to the sentence containing the common entity pair, we also keep its surrounding context. 's_loc' is the start/end location of the actual evidence sentence 'pair_locs': [ # mentions of the entity pair in the evidence [[19, 27]], # mentions of entity 1 [[0, 5], [288, 293]] # mentions of entity 2 ], 'all_links': { 'Selva': [[0, 5], [288, 293]], 'Comarques_of_Catalonia': [[19, 27]], 'Catalonia': [[40, 49]] } } ,...] # there are multiple evidence sentences }, ,...] # there are multiple entity pairs in the query }

Hybrid

{'s1_text': 'The 2006 Major League Baseball All-Star Game was the 77th playing of the midseason exhibition baseball game between the all-stars of the American League (AL) and National League (NL), the two leagues comprising Major League Baseball.', 's1_all_links': {...}, # same as text-only 'sentence_pairs': [{'pair': ..., 's1_pair_locs': ..., 's2s': [...]}], # same as text-only 'table_pairs': [ 'tid': 'Major_League_Baseball-1', 'text':[ ['World Series Records', 'World Series Records', ...], ['Team', 'Number of Series won', ...], ['St. Louis Cardinals (NL)', '11', ...], ...] # table content, list of rows 'index':[ [[0, 0], [0, 1], ...], [[1, 0], [1, 1], ...], ...] # index of each cell [row_id, col_id]. we keep only a table snippet, but the index here is from the original table. 'value_ranks':[ [0, 0, ...], [0, 0, ...], [0, 10, ...], ...] # if the cell contain numeric value/date, this is its rank ordered from small to large, follow TAPAS 'value_inv_ranks': [], # inverse rank 'all_links':{ 'St._Louis_Cardinals': { '2': [ [[2, 0], [0, 19]], # [[row_id, col_id], [start, end]] ] # list of mentions in the second row, the key is row_id }, 'CARDINAL:11': {'2': [[[2, 1], [0, 2]]], '8': [[[8, 3], [0, 2]]]}, } 'name': '', # table name, if exists 'pairs': { 'pair': ['American_League', 'National_League'], 's1_pair_locs': [[[137, 152]], [[162, 177]]], # mention in the query 'table_pair_locs': { '17': [ # mention of entity pair in row 17 [ [[17, 0], [3, 18]], [[17, 1], [3, 18]], [[17, 2], [3, 18]], [[17, 3], [3, 18]] ], # mention of the first entity [ [[17, 0], [21, 36]], [[17, 1], [21, 36]], ] # mention of the second entity ] } } ] }

Clear search

Close search

Google apps

Main menu

Sentence/Table Pair Data from Wikipedia for Pre-training with...

path to the uncompressed files, should be a directory with a set of tar files

Please see the documentation for WebDataset for more details about how to use it as dataloader for Pytorch

You can also iterate through all examples and dump them with your preferred data format

MGSV-EC

HALOC Dataset | WiFi CSI-based Long-Range Person Localization Using...

3DO Dataset | On the Generalization of WiFi-based Person-centric Sensing in...

InfantMarmosetsVox Dataset

Sentence/Table Pair Data from Wikipedia for Pre-training with Distant-Supervision

path to the uncompressed files, should be a directory with a set of tar files

Please see the documentation for WebDataset for more details about how to use it as dataloader for Pytorch

You can also iterate through all examples and dump them with your preferred data format