12 datasets found
  1. Sentence/Table Pair Data from Wikipedia for Pre-training with...

    • zenodo.org
    • data.niaid.nih.gov
    application/gzip
    Updated Oct 29, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xiang Deng; Yu Su; Alyssa Lees; You Wu; Cong Yu; Huan Sun; Xiang Deng; Yu Su; Alyssa Lees; You Wu; Cong Yu; Huan Sun (2021). Sentence/Table Pair Data from Wikipedia for Pre-training with Distant-Supervision [Dataset]. http://doi.org/10.5281/zenodo.5612316
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Oct 29, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Xiang Deng; Yu Su; Alyssa Lees; You Wu; Cong Yu; Huan Sun; Xiang Deng; Yu Su; Alyssa Lees; You Wu; Cong Yu; Huan Sun
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the dataset used for pre-training in "ReasonBERT: Pre-trained to Reason with Distant Supervision", EMNLP'21.

    There are two files:

    sentence_pairs_for_pretrain_no_tokenization.tar.gz -> Contain only sentences as evidence, Text-only

    table_pairs_for_pretrain_no_tokenization.tar.gz -> At least one piece of evidence is a table, Hybrid

    The data is chunked into multiple tar files for easy loading. We use WebDataset, a PyTorch Dataset (IterableDataset) implementation providing efficient sequential/streaming data access.

    For pre-training code, or if you have any questions, please check our GitHub repo https://github.com/sunlab-osu/ReasonBERT

    Below is a sample code snippet to load the data

    import webdataset as wds
    
    # path to the uncompressed files, should be a directory with a set of tar files
    url = './sentence_multi_pairs_for_pretrain_no_tokenization/{000000...000763}.tar'
    dataset = (
      wds.Dataset(url)
      .shuffle(1000) # cache 1000 samples and shuffle
      .decode()
      .to_tuple("json")
      .batched(20) # group every 20 examples into a batch
    )
    
    # Please see the documentation for WebDataset for more details about how to use it as dataloader for Pytorch
    # You can also iterate through all examples and dump them with your preferred data format

    Below we show how the data is organized with two examples.

    Text-only

    {'s1_text': 'Sils is a municipality in the comarca of Selva, in Catalonia, Spain.', # query sentence
     's1_all_links': {
      'Sils,_Girona': [[0, 4]],
      'municipality': [[10, 22]],
      'Comarques_of_Catalonia': [[30, 37]],
      'Selva': [[41, 46]],
      'Catalonia': [[51, 60]]
     }, # list of entities and their mentions in the sentence (start, end location)
     'pairs': [ # other sentences that share common entity pair with the query, group by shared entity pairs
      {
        'pair': ['Comarques_of_Catalonia', 'Selva'], # the common entity pair
        's1_pair_locs': [[[30, 37]], [[41, 46]]], # mention of the entity pair in the query
        's2s': [ # list of other sentences that contain the common entity pair, or evidence
         {
           'md5': '2777e32bddd6ec414f0bc7a0b7fea331',
           'text': 'Selva is a coastal comarque (county) in Catalonia, Spain, located between the mountain range known as the Serralada Transversal or Puigsacalm and the Costa Brava (part of the Mediterranean coast). Unusually, it is divided between the provinces of Girona and Barcelona, with Fogars de la Selva being part of Barcelona province and all other municipalities falling inside Girona province. Also unusually, its capital, Santa Coloma de Farners, is no longer among its larger municipalities, with the coastal towns of Blanes and Lloret de Mar having far surpassed it in size.',
           's_loc': [0, 27], # in addition to the sentence containing the common entity pair, we also keep its surrounding context. 's_loc' is the start/end location of the actual evidence sentence
           'pair_locs': [ # mentions of the entity pair in the evidence
            [[19, 27]], # mentions of entity 1
            [[0, 5], [288, 293]] # mentions of entity 2
           ],
           'all_links': {
            'Selva': [[0, 5], [288, 293]],
            'Comarques_of_Catalonia': [[19, 27]],
            'Catalonia': [[40, 49]]
           }
          }
        ,...] # there are multiple evidence sentences
       },
     ,...] # there are multiple entity pairs in the query
    }

    Hybrid

    {'s1_text': 'The 2006 Major League Baseball All-Star Game was the 77th playing of the midseason exhibition baseball game between the all-stars of the American League (AL) and National League (NL), the two leagues comprising Major League Baseball.',
     's1_all_links': {...}, # same as text-only
     'sentence_pairs': [{'pair': ..., 's1_pair_locs': ..., 's2s': [...]}], # same as text-only
     'table_pairs': [
      'tid': 'Major_League_Baseball-1',
      'text':[
        ['World Series Records', 'World Series Records', ...],
        ['Team', 'Number of Series won', ...],
        ['St. Louis Cardinals (NL)', '11', ...],
      ...] # table content, list of rows
      'index':[
        [[0, 0], [0, 1], ...],
        [[1, 0], [1, 1], ...],
      ...] # index of each cell [row_id, col_id]. we keep only a table snippet, but the index here is from the original table.
      'value_ranks':[
        [0, 0, ...],
        [0, 0, ...],
        [0, 10, ...],
      ...] # if the cell contain numeric value/date, this is its rank ordered from small to large, follow TAPAS
      'value_inv_ranks': [], # inverse rank
      'all_links':{
        'St._Louis_Cardinals': {
         '2': [
          [[2, 0], [0, 19]], # [[row_id, col_id], [start, end]]
         ] # list of mentions in the second row, the key is row_id
        },
        'CARDINAL:11': {'2': [[[2, 1], [0, 2]]], '8': [[[8, 3], [0, 2]]]},
      }
      'name': '', # table name, if exists
      'pairs': {
        'pair': ['American_League', 'National_League'],
        's1_pair_locs': [[[137, 152]], [[162, 177]]], # mention in the query
        'table_pair_locs': {
         '17': [ # mention of entity pair in row 17
           [
            [[17, 0], [3, 18]],
            [[17, 1], [3, 18]],
            [[17, 2], [3, 18]],
            [[17, 3], [3, 18]]
           ], # mention of the first entity
           [
            [[17, 0], [21, 36]],
            [[17, 1], [21, 36]],
           ] # mention of the second entity
         ]
        }
       }
     ]
    }

  2. h

    MGSV-EC

    • huggingface.co
    Updated May 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zijie Xin (2025). MGSV-EC [Dataset]. https://huggingface.co/datasets/xxayt/MGSV-EC
    Explore at:
    Dataset updated
    May 22, 2025
    Authors
    Zijie Xin
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Music Grounding by Short Video E-commerce (MGSV-EC) Dataset

    📄 [Paper] 📦 Feature File 🔧 [PyTorch Dataloader] 🧬 [Model Code]

      📝 Dataset Summary
    

    MGSV-EC is a large-scale dataset for the new task of Music Grounding by Short Video (MGSV), which aims to localize a specific music segment that best serves as the background music (BGM) for a given query short video.Unlike traditional video-to-music retrieval (V2MR), MGSV requires both… See the full description on the dataset page: https://huggingface.co/datasets/xxayt/MGSV-EC.

  3. Dataset for "SpecTf: Transformers Enable Data-Driven Imaging Spectroscopy...

    • zenodo.org
    bin, csv, pdf
    Updated Jan 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jake Lee; Jake Lee; Michael Kiper; Michael Kiper; David R. Thompson; David R. Thompson; Philip Brodrick; Philip Brodrick (2025). Dataset for "SpecTf: Transformers Enable Data-Driven Imaging Spectroscopy Cloud Detection" [Dataset]. http://doi.org/10.5281/zenodo.14614218
    Explore at:
    bin, pdf, csvAvailable download formats
    Dataset updated
    Jan 10, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jake Lee; Jake Lee; Michael Kiper; Michael Kiper; David R. Thompson; David R. Thompson; Philip Brodrick; Philip Brodrick
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    SpecTf: Transformers Enable Data-Driven Imaging Spectroscopy Cloud Detection

    Summary

    Manuscript in review. Preprint: https://arxiv.org/abs/2501.04916

    This repository contains the dataset used to train and evaluate the Spectroscopic Transformer model for EMIT cloud screening.

    • spectf_cloud_labelbox.hdf5
      • 1,841,641 Labeled spectra from 221 EMIT Scenes.
    • spectf_cloud_mmgis.hdf5
      • 1,733,801 Labeled spectra from 313 EMIT Scenes.
      • These scenes were speciffically labeled to correct false detections by an earlier version of the model.
    • train_fids.csv
      • 465 EMIT scenes comprising the training set.
    • test_fids.csv
      • 69 EMIT scenes comprising the held-out validation set.

    v2 adds validation_scenes.pdf, a PDF displaying the 69 validation scenes in RGB and Falsecolor, their existing baseline cloud masks, as well as their cloud masks produced by the ANN and GBT reference models and the SpecTf model.

    Data Description

    221 EMIT Scenes were initially selected for labeling with diversity in mind. After sparse segmentation labeling of confident regions in Labelbox, up to 10,000 spectra were selected per-class per-scene to form the spectf_cloud_labelbox dataset. We deployed a preliminary model trained on these spectra on all EMIT scenes observed in March 2024, then labeled another 313 EMIT Scenes using MMGIS's polygonal labeling tool to correct false positive and false negative detections. After similarly sampling spectra from these scenes, A total of 3,575,442 spectra were labeled and sampled.

    The train/test split was randomly determined by scene FID to prevent the same EMIT scene from contributing spectra to both the training and validation datasets.

    Please refer to Section 4.2 in the paper for a complete description, and to our code repository for example usage and a Pytorch dataloader.

    Each hdf5 file contains the following arrays:

    • 'spectra'
    • 'fids'
      • The FID from which each spectrum was sampled
      • Binary string of shape (n,)
    • 'indices'
      • The (col, row) index from which each spectrum was sampled
      • Int64 of shape (n, 2)
    • 'labels'
      • Annotation label of each spectrum
        • 0 - "Clear"
        • 1 - "Cloud"
        • 2 - "Cloud Shadow" (Only for the Labelbox dataset, and this class was combined with the clear class for this work. See paper for details.)
          • label[label==2] = 0
      • Int64 of shape (n,2)

    Each hdf5 file contains the following attribute:

    • 'bands'
      • The band center wavelengths (nm) of the spectrum
      • Float64 of shape (268,)

    Acknowledgements

    The EMIT online mapping tool was developed by the JPL MMGIS team. The High Performance Computing resources used in this investigation were provided by funding from the JPL Information and Technology Solutions Directorate.

    This research was carried out at the Jet Propulsion Laboratory, California Institute of Technology, under a contract with the National Aeronautics and Space Administration (80NM0018D0004).

    © 2024 California Institute of Technology. Government sponsorship acknowledged.

  4. Z

    3DO Dataset | On the Generalization of WiFi-based Person-centric Sensing in...

    • data.niaid.nih.gov
    Updated Dec 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Strohmayer, Julian (2024). 3DO Dataset | On the Generalization of WiFi-based Person-centric Sensing in Through-Wall Scenarios [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10925350
    Explore at:
    Dataset updated
    Dec 5, 2024
    Dataset provided by
    Strohmayer, Julian
    Kampel, Martin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    On the Generalization of WiFi-based Person-centric Sensing in Through-Wall Scenarios

    This repository contains the 3DO dataset proposed in [1].

    PyTroch Dataloader

    A minimal PyTorch dataloader for the 3DO dataset is provided at: https://github.com/StrohmayerJ/3DO

    Dataset Description

    The 3DO dataset comprises 42 five-minute recordings (~1.25M WiFi packets) of three human activities performed by a single person, captured in a WiFi through-wall sensing scenario over three consecutive days. Each WiFi packet is annotated with a 3D trajectory label and a class label for the activities: no person/background (0), walking (1), sitting (2), and lying (3). (Note: The labels returned in our dataloader example are walking (0), sitting (1), and lying (2), because background sequences are not used.)

    The directories 3DO/d1/, 3DO/d2/, and 3DO/d3/ contain the sequences from days 1, 2, and 3, respectively. Furthermore, each sequence directory (e.g., 3DO/d1/w1/) contains a csiposreg.csv file storing the raw WiFi packet time series and a csiposreg_complex.npy cache file, which stores the complex Channel State Information (CSI) of the WiFi packet time series. (If missing, csiposreg_complex.npy is automatically generated by the provided dataloader.)

    Dataset Structure:

    /3DO

    ├── d1 <-- day 1 subdirectory

      └── w1 <-- sequence subdirectory
    
         └── csiposreg.csv <-- raw WiFi packet time series
    
         └── csiposreg_complex.npy <-- CSI time series cache
    

    ├── d2 <-- day 2 subdirectory

    ├── d3 <-- day 3 subdirectory

    In [1], we use the following training, validation, and test split:

    Subset Day Sequences

    Train 1 w1, w2, w3, s1, s2, s3, l1, l2, l3

    Val 1 w4, s4, l4

    Test 1 w5 , s5, l5

    Test 2 w1, w2, w3, w4, w5, s1, s2, s3, s4, s5, l1, l2, l3, l4, l5

    Test 3 w1, w2, w4, w5, s1, s2, s3, s4, s5, l1, l2, l4

    w = walking, s = sitting and l= lying

    Note: On each day, we additionally recorded three ten-minute background sequences (b1, b2, b3), which are provided as well.

    Download and UseThis data may be used for non-commercial research purposes only. If you publish material based on this data, we request that you include a reference to our paper [1].

    [1] Strohmayer, J., Kampel, M. (2025). On the Generalization of WiFi-Based Person-Centric Sensing in Through-Wall Scenarios. In: Pattern Recognition. ICPR 2024. Lecture Notes in Computer Science, vol 15315. Springer, Cham. https://doi.org/10.1007/978-3-031-78354-8_13

    BibTeX citation:

    @inproceedings{strohmayerOn2025, author="Strohmayer, Julian and Kampel, Martin", title="On the Generalization of WiFi-Based Person-Centric Sensing in Through-Wall Scenarios", booktitle="Pattern Recognition", year="2025", publisher="Springer Nature Switzerland", address="Cham", pages="194--211", isbn="978-3-031-78354-8" }

  5. Dataset for "Spectroscopic Transformer for Improved EMIT Cloud Masks"

    • zenodo.org
    bin, csv
    Updated Jan 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jake Lee; Jake Lee; Michael Kiper; Michael Kiper; David R. Thompson; David R. Thompson; Philip Brodrick; Philip Brodrick (2025). Dataset for "Spectroscopic Transformer for Improved EMIT Cloud Masks" [Dataset]. http://doi.org/10.5281/zenodo.14607938
    Explore at:
    csv, binAvailable download formats
    Dataset updated
    Jan 7, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jake Lee; Jake Lee; Michael Kiper; Michael Kiper; David R. Thompson; David R. Thompson; Philip Brodrick; Philip Brodrick
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Spectroscopic Transformer for Improved EMIT Cloud Masks

    Summary

    Manuscript in preparation/submitted.

    This repository contains the dataset used to train and evaluate the Spectroscopic Transformer model for EMIT cloud screening.

    • spectf_cloud_labelbox.hdf5
      • 1,841,641 Labeled spectra from 221 EMIT Scenes.
    • spectf_cloud_mmgis.hdf5
      • 1,733,801 Labeled spectra from 313 EMIT Scenes.
      • These scenes were speciffically labeled to correct false detections by an earlier version of the model.
    • train_fids.csv
      • 465 EMIT scenes comprising the training set.
    • test_fids.csv
      • 69 EMIT scenes comprising the held-out validation set.

    Data Description

    221 EMIT Scenes were initially selected for labeling with diversity in mind. After sparse segmentation labeling of confident regions in Labelbox, up to 10,000 spectra were selected per-class per-scene to form the spectf_cloud_labelbox dataset. We deployed a preliminary model trained on these spectra on all EMIT scenes observed in March 2024, then labeled another 313 EMIT Scenes using MMGIS's polygonal labeling tool to correct false positive and false negative detections. After similarly sampling spectra from these scenes, A total of 3,575,442 spectra were labeled and sampled.

    The train/test split was randomly determined by scene FID to prevent the same EMIT scene from contributing spectra to both the training and validation datasets.

    Please refer to Section 4.2 in the paper for a complete description, and to our code repository for example usage and a Pytorch dataloader.

    Each hdf5 file contains the following arrays:

    • 'spectra'
    • 'fids'
      • The FID from which each spectrum was sampled
      • Binary string of shape (n,)
    • 'indices'
      • The (col, row) index from which each spectrum was sampled
      • Int64 of shape (n, 2)
    • 'labels'
      • Annotation label of each spectrum
        • 0 - "Clear"
        • 1 - "Cloud"
        • 2 - "Cloud Shadow" (Only for the Labelbox dataset, and this class was combined with the clear class for this work. See paper for details.)
          • label[label==2] = 0
      • Int64 of shape (n,2)

    Each hdf5 file contains the following attribute:

    • 'bands'
      • The band center wavelengths (nm) of the spectrum
      • Float64 of shape (268,)

    Acknowledgements

    The EMIT online mapping tool was developed by the JPL MMGIS team. The High Performance Computing resources used in this investigation were provided by funding from the JPL Information and Technology Solutions Directorate.

    This research was carried out at the Jet Propulsion Laboratory, California Institute of Technology, under a contract with the National Aeronautics and Space Administration (80NM0018D0004).

    © 2024 California Institute of Technology. Government sponsorship acknowledged.

  6. Z

    Wallhack1.8k Dataset | Data Augmentation Techniques for Cross-Domain WiFi...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Apr 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kampel, Martin (2025). Wallhack1.8k Dataset | Data Augmentation Techniques for Cross-Domain WiFi CSI-Based Human Activity Recognition [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8188998
    Explore at:
    Dataset updated
    Apr 4, 2025
    Dataset provided by
    Strohmayer, Julian
    Kampel, Martin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository contains the Wallhack1.8k dataset for WiFi-based long-range activity recognition in Line-of-Sight (LoS) and Non-Line-of-Sight (NLoS)/Through-Wall scenarios, as proposed in [1,2], as well as the CAD models (of 3D-printable parts) of the WiFi systems proposed in [2].

    PyTroch Dataloader

    A minimal PyTorch dataloader for the Wallhack1.8k dataset is provided at: https://github.com/StrohmayerJ/wallhack1.8k

    Dataset Description

    The Wallhack1.8k dataset comprises 1,806 CSI amplitude spectrograms (and raw WiFi packet time series) corresponding to three activity classes: "no presence," "walking," and "walking + arm-waving." WiFi packets were transmitted at a frequency of 100 Hz, and each spectrogram captures a temporal context of approximately 4 seconds (400 WiFi packets).

    To assess cross-scenario and cross-system generalization, WiFi packet sequences were collected in LoS and through-wall (NLoS) scenarios, utilizing two different WiFi systems (BQ: biquad antenna and PIFA: printed inverted-F antenna). The dataset is structured accordingly:

    LOS/BQ/ <- WiFi packets collected in the LoS scenario using the BQ system

    LOS/PIFA/ <- WiFi packets collected in the LoS scenario using the PIFA system

    NLOS/BQ/ <- WiFi packets collected in the NLoS scenario using the BQ system

    NLOS/PIFA/ <- WiFi packets collected in the NLoS scenario using the PIFA system

    These directories contain the raw WiFi packet time series (see Table 1). Each row represents a single WiFi packet with the complex CSI vector H being stored in the "data" field and the class label being stored in the "class" field. H is of the form [I, R, I, R, ..., I, R], where two consecutive entries represent imaginary and real parts of complex numbers (the Channel Frequency Responses of subcarriers). Taking the absolute value of H (e.g., via numpy.abs(H)) yields the subcarrier amplitudes A.

    To extract the 52 L-LTF subcarriers used in [1], the following indices of A are to be selected:

    52 L-LTF subcarriers

    csi_valid_subcarrier_index = [] csi_valid_subcarrier_index += [i for i in range(6, 32)] csi_valid_subcarrier_index += [i for i in range(33, 59)]

    Additional 56 HT-LTF subcarriers can be selected via:

    56 HT-LTF subcarriers

    csi_valid_subcarrier_index += [i for i in range(66, 94)]
    csi_valid_subcarrier_index += [i for i in range(95, 123)]

    For more details on subcarrier selection, see ESP-IDF (Section Wi-Fi Channel State Information) and esp-csi.

    Extracted amplitude spectrograms with the corresponding label files of the train/validation/test split: "trainLabels.csv," "validationLabels.csv," and "testLabels.csv," can be found in the spectrograms/ directory.

    The columns in the label files correspond to the following: [Spectrogram index, Class label, Room label]

    Spectrogram index: [0, ..., n]

    Class label: [0,1,2], where 0 = "no presence", 1 = "walking", and 2 = "walking + arm-waving."

    Room label: [0,1,2,3,4,5], where labels 1-5 correspond to the room number in the NLoS scenario (see Fig. 3 in [1]). The label 0 corresponds to no room and is used for the "no presence" class.

    Dataset Overview:

    Table 1: Raw WiFi packet sequences.

    Scenario System "no presence" / label 0 "walking" / label 1 "walking + arm-waving" / label 2 Total

    LoS BQ b1.csv w1.csv, w2.csv, w3.csv, w4.csv and w5.csv ww1.csv, ww2.csv, ww3.csv, ww4.csv and ww5.csv

    LoS PIFA b1.csv w1.csv, w2.csv, w3.csv, w4.csv and w5.csv ww1.csv, ww2.csv, ww3.csv, ww4.csv and ww5.csv

    NLoS BQ b1.csv w1.csv, w2.csv, w3.csv, w4.csv and w5.csv ww1.csv, ww2.csv, ww3.csv, ww4.csv and ww5.csv

    NLoS PIFA b1.csv w1.csv, w2.csv, w3.csv, w4.csv and w5.csv ww1.csv, ww2.csv, ww3.csv, ww4.csv and ww5.csv

    4 20 20 44

    Table 2: Sample/Spectrogram distribution across activity classes in Wallhack1.8k.

    Scenario System

    "no presence" / label 0

    "walking" / label 1

    "walking + arm-waving" / label 2 Total

    LoS BQ 149 154 155

    LoS PIFA 149 160 152

    NLoS BQ 148 150 152

    NLoS PIFA 143 147 147

    589 611 606 1,806

    Download and UseThis data may be used for non-commercial research purposes only. If you publish material based on this data, we request that you include a reference to one of our papers [1,2].

    [1] Strohmayer, Julian, and Martin Kampel. (2024). “Data Augmentation Techniques for Cross-Domain WiFi CSI-Based Human Activity Recognition”, In IFIP International Conference on Artificial Intelligence Applications and Innovations (pp. 42-56). Cham: Springer Nature Switzerland, doi: https://doi.org/10.1007/978-3-031-63211-2_4.

    [2] Strohmayer, Julian, and Martin Kampel., “Directional Antenna Systems for Long-Range Through-Wall Human Activity Recognition,” 2024 IEEE International Conference on Image Processing (ICIP), Abu Dhabi, United Arab Emirates, 2024, pp. 3594-3599, doi: https://doi.org/10.1109/ICIP51287.2024.10647666.

    BibTeX citations:

    @inproceedings{strohmayer2024data, title={Data Augmentation Techniques for Cross-Domain WiFi CSI-Based Human Activity Recognition}, author={Strohmayer, Julian and Kampel, Martin}, booktitle={IFIP International Conference on Artificial Intelligence Applications and Innovations}, pages={42--56}, year={2024}, organization={Springer}}@INPROCEEDINGS{10647666, author={Strohmayer, Julian and Kampel, Martin}, booktitle={2024 IEEE International Conference on Image Processing (ICIP)}, title={Directional Antenna Systems for Long-Range Through-Wall Human Activity Recognition}, year={2024}, volume={}, number={}, pages={3594-3599}, keywords={Visualization;Accuracy;System performance;Directional antennas;Directive antennas;Reflector antennas;Sensors;Human Activity Recognition;WiFi;Channel State Information;Through-Wall Sensing;ESP32}, doi={10.1109/ICIP51287.2024.10647666}}

  7. d

    Data from: acusim: a synthetic dataset for cervicocranial acupuncture points...

    • search.dataone.org
    • datadryad.org
    Updated Apr 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Qilei Sun; Jiatao Ma; Paul Craig; Linjun Dai; EngGee Lim (2025). acusim: a synthetic dataset for cervicocranial acupuncture points localisation [Dataset]. http://doi.org/10.5061/dryad.zs7h44jkz
    Explore at:
    Dataset updated
    Apr 2, 2025
    Dataset provided by
    Dryad Digital Repository
    Authors
    Qilei Sun; Jiatao Ma; Paul Craig; Linjun Dai; EngGee Lim
    Description

    The locations of acupuncture points (acupoints) differ among human individuals due to variations in factors such as height, weight, and fat proportions. However, acupoint annotation is expert-dependent, labour-intensive, and highly expensive, which limits the data size and detection accuracy. In this paper, we introduce the "AcuSim" dataset as a new synthetic dataset for the task of localising points on the human cervicocranial area from an input image using an automatic render and labelling pipeline during acupuncture treatment. It includes the creation of 63,936 RGB-D images and 504 synthetic anatomical models with 174 volumetric acupoints annotated, to capture the variability and diversity of human anatomies. The study validates a convolutional neural network (CNN) on the proposed dataset with an accuracy of 99.73% and shows that 92.86% of predictions in the validation set align within a 5mm threshold of margin error when compared to expert-annotated data. This dataset addresses the ..., , , # AcuSim: A Synthetic Dataset for Cervicocranial Acupuncture Points Localisation

    Dryad DOI:Â https://doi.org/10.5061/dryad.zs7h44jkz

    Dataset Overview

    A multi-view acupuncture point dataset containing:

    • 64x64, 128x128, 256x256, 512×512 and 1024x1024resolution RGB images
    • Corresponding JSON annotations with:
      • 2D/3D keypoint coordinates
      • Visibility weights (0.9-1.0 scale)
      • Meridian category indices
      • Visibility masks
    • 174 standard acupuncture points (map.txt)
    • Occlusion handling implementation

    Key Features

    • Multi-view Rendering: Generated using Blender 3.5 with realistic occlusion simulation
    • Structured Annotation:
      • Default initialization for occluded points ([0.0, 0.0, 0.5])
      • Meridian category preservation for occluded points
      • Weighted visibility scoring
    • ML-Ready Format: Preconfigured PyTorch DataLoader implementation

    Dataset Structure

    dataset_root/
    ├── map.txt         # Complete list of 174 acupuncture points
    ├── train/
    ...,
    
  8. Z

    HALOC Dataset | WiFi CSI-based Long-Range Person Localization Using...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Nov 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Strohmayer, Julian (2024). HALOC Dataset | WiFi CSI-based Long-Range Person Localization Using Directional Antennas [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10715594
    Explore at:
    Dataset updated
    Nov 27, 2024
    Dataset provided by
    Strohmayer, Julian
    Kampel, Martin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    WiFi CSI-based Long-Range Person Localization Using Directional Antennas

    This repository contains the HAllway LOCalization (HALOC) dataset and WiFi system CAD files as proposed in [1].

    PyTroch Dataloader

    A minimal PyTorch dataloader for the HALOC dataset is provided at: https://github.com/StrohmayerJ/HALOC

    Dataset Description

    The HALOC dataset comprises six sequences (in .csv format) of synchronized WiFi Channel State Information (CSI) and 3D position labels. Each row in a given .csv file represents a single WiFi packet captured via ESP-IDF, with CSI and 3D coordinates stored in the "data" and ("x", "y", "z") fields, respectively.

    The sequences are divided into training, validation, and test subsets as follows:

    Subset Sequences

    Training 0.csv, 1.csv, 2.csv and 3.csv

    Validation 4.csv

    Test 5.csv

    WiFi System CAD files

    We provide CAD files for the 3D printable parts of the proposed WiFi system consisting of the main housing (housing.stl), the lid (lid.stl), and the carrier board (carrier.stl) featuring mounting points for the Nvidia Jetson Orin Nano and the ESP32-S3-DevKitC-1 module.

    Download and UseThis data may be used for non-commercial research purposes only. If you publish material based on this data, we request that you include a reference to our paper [1].

    [1] Strohmayer, J., and Kampel, M. (2024). “WiFi CSI-based Long-Range Person Localization Using Directional Antennas”, The Second Tiny Papers Track at ICLR 2024, May 2024, Vienna, Austria. https://openreview.net/forum?id=AOJFcEh5Eb

    BibTeX citation:

    @inproceedings{strohmayer2024wifi,title={WiFi {CSI}-based Long-Range Person Localization Using Directional Antennas},author={Julian Strohmayer and Martin Kampel},booktitle={The Second Tiny Papers Track at ICLR 2024},year={2024},url={https://openreview.net/forum?id=AOJFcEh5Eb}}

  9. Z

    InfantMarmosetsVox

    • data.niaid.nih.gov
    Updated Nov 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sarkar, Eklavya (2023). InfantMarmosetsVox [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10130104
    Explore at:
    Dataset updated
    Nov 15, 2023
    Dataset provided by
    Sarkar, Eklavya
    Magimai Doss, Mathew
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Description InfantMarmosetsVox is a dataset for multi-class call-type and caller identification. It contains audio recordings of different individual marmosets and their call-types. The dataset contains a total of 350 files of precisely labelled 10-minute audio recordings across all caller classes. The audio was recorded from five pairs of infant marmoset twins, each recorded individually in two separate sound-proofed recording rooms at a sampling rate of 44.1 kHz. The start and end time, call-type, and marmoset identity of each vocalization are provided, labeled by an experienced researcher.

    References This dataset was collected and partially used for the paper "Automatic detection and classification of marmoset vocalizations using deep and recurrent neural networks" by Zhang et al. It is also used for the experiments in the paper "Can Self-Supervised Neural Representations Pre-Trained on Human Speech distinguish Animal Callers?" by E. Sarkar and M. Magimai-Doss. The source code of a PyTorch DataLoader reading this data is available at https://github.com/idiap/ssl-caller-detection.

    Citation Any publication (eg. conference paper, journal article, technical report, book chapter, etc) resulting from the usage of InfantsMarmosetVox must cite the following publication: Sarkar, E., Magimai.-Doss, M. (2023) Can Self-Supervised Neural Representations Pre-Trained on Human Speech distinguish Animal Callers? Proc. INTERSPEECH 2023, 1189-1193, doi: 10.21437/Interspeech.2023-1968 Bibtex: @inproceedings{sarkar23_interspeech, author={Eklavya Sarkar and Mathew Magimai.-Doss}, title={{Can Self-Supervised Neural Representations Pre-Trained on Human Speech distinguish Animal Callers?}}, year=2023, booktitle={Proc. INTERSPEECH 2023}, pages={1189--1193}, doi={10.21437/Interspeech.2023-1968}}

  10. Food Recognition 2022

    • kaggle.com
    Updated Feb 12, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sai Nikhilesh Reddy (2022). Food Recognition 2022 [Dataset]. https://www.kaggle.com/datasets/sainikhileshreddy/food-recognition-2022
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 12, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Sai Nikhilesh Reddy
    Description

    Food Recognition Benchmark 2022 😋

    This dataset is Preprocessed⚙️, Compressed🗜️, and Streamable📶!

    Problem Statement

    The goal of this benchmark is to train models which can look at images of food items and detect the individual food items present in them. We use a novel dataset of food images collected through the MyFoodRepo app, where numerous volunteer Swiss users provide images of their daily food intake in the context of a digital cohort called Food & You. This growing data set has been annotated - or automatic annotations have been verified - with respect to segmentation, classification (mapping the individual food items onto an ontology of Swiss Food items), and weight/volume estimation.

    Datasets

    Finding annotated food images is difficult. There are some databases with some annotations, but they tend to be limited in important ways. To put it bluntly: most food images on the internet are a lie. Search for any dish, and you’ll find beautiful stock photography of that particular dish. Same on social media: we share photos of dishes with our friends when the image is exceptionally beautiful. But algorithms need to work on real-world images. In addition, annotations are generally missing - ideally, food images would be annotated with proper segmentation, classification, and volume/weight estimates. With this 2022 iteration of the Food Recognition Benchmark, AIcrowd released v2.0 of the MyFoodRepo dataset, containing a training set of 39,962 images food items, with 76,491 annotations.

    Zipped Datasets is in MS-COCO format:

    raw_data/public_training_set_release_2.0.tar.gz: Training Set -> 39,962 (as RGB images) food images -> 76491 annotations -> 498 food classes raw_data/public_validation_set_2.0.tar.gz: Validation Set -> 1000 (as RGB images) food images -> 1830 annotations -> 498 food classes raw_data/public_test_release_2.0.tar.gz: Public Test Set -> Food Recognition Benchmark 2022

    Check the usage at the notebook

    Kaggle Notebook - https://www.kaggle.com/sainikhileshreddy/how-to-use-the-dataset

    Usage of the processed kaggle dataset

    import hub
    ds = hub.dataset('/kaggle/input/food-recognition-2022/hub/train/')
    

    Usage of the dataset anywhere (through streaming)

    import hub
    ds = hub.dataset('hub://sainikhileshreddy/food-recognition-2022-train/')
    

    Usage of the hub dataset using popular deep learning frameworks

    1. Food Recognition 2020 with PyTorch in Python

    dataloader = ds.pytorch(num_workers = 2, shuffle = True, transform = transform, batch_size= batch_size)
    

    2. Food Recognition 2020 with TensorFlow in Python

    ds_tensorflow = ds.tensorflow()
    

    Evaluation

    The benchmark uses the official detection evaluation metrics used by COCO. The primary evaluation metric is AP @ IoU=0.50:0.05:0.95. The seconday evaluation metric is AR @ IoU=0.50:0.05:0.95. A further discussion about the evaluation metric can be found here.

    Dataset Original Source

    Dataset has been taken from the Food Recognition Benchmark 2022. You can find more details about the challenge on the below link https://www.aicrowd.com/challenges/food-recognition-benchmark-2022

    Resources

    1. Activeloop Hub: https://docs.activeloop.ai/
    2. Github: SaiNikhileshReddy | Food-Recognition-2022
    3. Kaggle Discussion - What is Activeloop Hub Format?
  11. Data and code for training and testing a ResMLP model with experience replay...

    • zenodo.org
    zip
    Updated Feb 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jianda Chen; Jianda Chen; Minghua Zhang; Wuyin Lin; Tao Zhang; Wei Xue; Minghua Zhang; Wuyin Lin; Tao Zhang; Wei Xue (2025). Data and code for training and testing a ResMLP model with experience replay for machine-learning physics parameterization [Dataset]. http://doi.org/10.5281/zenodo.13690812
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 20, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jianda Chen; Jianda Chen; Minghua Zhang; Wuyin Lin; Tao Zhang; Wei Xue; Minghua Zhang; Wuyin Lin; Tao Zhang; Wei Xue
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This directory contains the training data and code for training and testing a ResMLP with experience replay for creating a machine-learning physics parameterization for the Community Atmospheric Model.

    The directory is structured as follows:

    1. Download training and testing data: https://portal.nersc.gov/archive/home/z/zhangtao/www/hybird_GCM_ML

    2. Unzip nncam_training.zip

    nncam_training

    - models

    model definition of ResMLP and other models for comparison purposes

    - dataloader

    utility scripts to load data into pytorch dataset

    - training_scripts

    scripts to train ResMLP model with/without experience replay

    - offline_test

    scripts to perform offline test (Table 2, Figure 2)

    3. Unzip nncam_coupling.zip

    nncam_srcmods

    - SourceMods

    SourceMods to be used with CAM modules for coupling with neural network

    - otherfiles

    additional configuration files to setup and run SPCAM with neural network

    - pythonfiles

    python scripts to run neural network and couple with CAM

    - ClimAnalysis

    - paper_plots.ipynb

    scripts to produce online evaluation figures (Figure 1, Figure 3-10)

  12. FireSR: A Dataset for Super-Resolution and Segmentation of Burned Areas

    • zenodo.org
    application/gzip
    Updated Aug 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eric Brune; Eric Brune (2024). FireSR: A Dataset for Super-Resolution and Segmentation of Burned Areas [Dataset]. http://doi.org/10.5281/zenodo.13384289
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Aug 29, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Eric Brune; Eric Brune
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jun 5, 2024
    Description


    # FireSR Dataset

    ## Overview

    **FireSR** is a dataset designed for the super-resolution and segmentation of wildfire-burned areas. It includes data for all wildfire events in Canada from 2017 to 2023 that exceed 2000 hectares in size, as reported by the National Burned Area Composite (NBAC). The dataset aims to support high-resolution daily monitoring and improve wildfire management using machine learning techniques.

    ## Dataset Structure

    The dataset is organized into several directories, each containing data relevant to different aspects of wildfire monitoring:

    - **S2**: Contains Sentinel-2 images.
    - **pre**: Pre-fire Sentinel-2 images (high resolution).
    - **post**: Post-fire Sentinel-2 images (high resolution).

    - **mask**: Contains NBAC polygons, which serve as ground truth masks for the burned areas.
    - **pre**: Burned area labels from the year before the fire, using the same spatial bounds as the fire events of the current year.
    - **post**: Burned area labels corresponding to post-fire conditions.

    - **MODIS**: Contains post-fire MODIS images (lower resolution).

    - **LULC**: Contains land use/land cover data from ESRI Sentinel-2 10-Meter Land Use/Land Cover (2017-2023).

    - **Daymet**: Contains weather data from Daymet V4: Daily Surface Weather and Climatological Summaries.

    ### File Naming Convention

    Each GeoTIFF (.tif) file is named according to the format: `CA_

    ### Directory Structure

    The dataset is organized as follows:

    ```
    FireSR/

    ├── dataset/
    │ ├── S2/
    │ │ ├── post/
    │ │ │ ├── CA_2017_AB_204.tif
    │ │ │ ├── CA_2017_AB_2418.tif
    │ │ │ └── ...
    │ │ ├── pre/
    │ │ │ ├── CA_2017_AB_204.tif
    │ │ │ ├── CA_2017_AB_2418.tif
    │ │ │ └── ...
    │ ├── mask/
    │ │ ├── post/
    │ │ │ ├── CA_2017_AB_204.tif
    │ │ │ ├── CA_2017_AB_2418.tif
    │ │ │ └── ...
    │ │ ├── pre/
    │ │ │ ├── CA_2017_AB_204.tif
    │ │ │ ├── CA_2017_AB_2418.tif
    │ │ │ └── ...
    │ ├── MODIS/
    │ │ ├── CA_2017_AB_204.tif
    │ │ ├── CA_2017_AB_2418.tif
    │ │ └── ...
    │ ├── LULC/
    │ │ ├── CA_2017_AB_204.tif
    │ │ ├── CA_2017_AB_2418.tif
    │ │ └── ...
    │ ├── Daymet/
    │ │ ├── CA_2017_AB_204.tif
    │ │ ├── CA_2017_AB_2418.tif
    │ │ └── ...
    ```

    ### Spatial Resolution and Channels

    - **Sentinel-2 (S2) Images**: 20 meters (Bands: B12, B8, B4)
    - **MODIS Images**: 250 meters (Bands: B7, B2, B1)
    - **NBAC Burned Area Labels**: 20 meters (1 channel, binary classification: burned/unburned)
    - **Daymet Weather Data**: 1000 meters (7 channels: dayl, prcp, srad, swe, tmax, tmin, vp)
    - **ESRI Land Use/Land Cover Data**: 10 meters (1 channel with 9 classes: water, trees, flooded vegetation, crops, built area, bare ground, snow/ice, clouds, rangeland)

    **Daymet Weather Data**: The Daymet dataset includes seven channels that provide various weather-related parameters, which are crucial for understanding and modeling wildfire conditions:

    | Name | Units | Min | Max | Description |

    |------|-------|-----|-----|-------------|

    | dayl | seconds | 0 | 86400 | Duration of the daylight period, based on the period of the day during which the sun is above a hypothetical flat horizon. |

    | prcp | mm | 0 | 544 | Daily total precipitation, sum of all forms converted to water-equivalent. |

    | srad | W/m^2 | 0 | 1051 | Incident shortwave radiation flux density, averaged over the daylight period of the day. |

    | swe | kg/m^2 | 0 | 13931 | Snow water equivalent, representing the amount of water contained within the snowpack. |

    | tmax | °C | -60 | 60 | Daily maximum 2-meter air temperature. |

    | tmin | °C | -60 | 42 | Daily minimum 2-meter air temperature. |

    | vp | Pa | 0 | 8230 | Daily average partial pressure of water vapor. |

    **ESRI Land Use/Land Cover Data**: The ESRI 10m Annual Land Cover dataset provides a time series of global maps of land use and land cover (LULC) from 2017 to 2023 at a 10-meter resolution. These maps are derived from ESA Sentinel-2 imagery and are generated by Impact Observatory using a deep learning model trained on billions of human-labeled pixels. Each map is a composite of LULC predictions for 9 classes throughout the year, offering a representative snapshot of each year.

    | Class Value | Land Cover Class |

    |-------------|------------------|

    | 1 | Water |

    | 2 | Trees |

    | 4 | Flooded Vegetation |

    | 5 | Crops |

    | 7 | Built Area |

    | 8 | Bare Ground |

    | 9 | Snow/Ice |

    | 10 | Clouds |

    | 11 | Rangeland |


    ## Usage Tutorial

    To help users get started with FireSR, we provide a comprehensive tutorial with scripts for data extraction and processing. Below is an example workflow:

    ### Step 1: Extract FireSR.tar.gz

    ```bash
    tar -xvf FireSR.tar.gz
    ```

    ### Step 2: Tiling the GeoTIFF Files

    The dataset contains high-resolution GeoTIFF files. For machine learning models, it may be useful to tile these images into smaller patches. Here's a Python script to tile the images:

    ```python
    import rasterio
    from rasterio.windows import Window
    import os

    def tile_image(image_path, output_dir, tile_size=128):
    with rasterio.open(image_path) as src:
    for i in range(0, src.height, tile_size):
    for j in range(0, src.width, tile_size):
    window = Window(j, i, tile_size, tile_size)
    transform = src.window_transform(window)
    outpath = os.path.join(output_dir, f"{os.path.basename(image_path).split('.')[0]}_{i}_{j}.tif")
    with rasterio.open(outpath, 'w', driver='GTiff', height=tile_size, width=tile_size, count=src.count, dtype=src.dtypes[0], crs=src.crs, transform=transform) as dst:
    dst.write(src.read(window=window))

    # Example usage
    tile_image('FireSR/dataset/S2/post/CA_2017_AB_204.tif', 'tiled_images/')
    ```

    ### Step 3: Loading Data into a Machine Learning Model

    After tiling, the images can be loaded into a machine learning model using libraries like PyTorch or TensorFlow. Here's an example using PyTorch:

    ```python
    import torch
    from torch.utils.data import Dataset
    from torchvision import transforms
    import rasterio

    class FireSRDataset(Dataset):
    def _init_(self, image_dir, transform=None):
    self.image_dir = image_dir
    self.transform = transform
    self.image_paths = [os.path.join(image_dir, f) for f in os.listdir(image_dir) if f.endswith('.tif')]

    def _len_(self):
    return len(self.image_paths)

    def _getitem_(self, idx):
    image_path = self.image_paths[idx]
    with rasterio.open(image_path) as src:
    image = src.read()
    if self.transform:
    image = self.transform(image)
    return image

    # Example usage
    dataset = FireSRDataset('tiled_images/', transform=transforms.ToTensor())
    dataloader = torch.utils.data.DataLoader(dataset, batch_size=16, shuffle=True)
    ```

    ## License

    This dataset is licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0). You are free to share and adapt the material as long as appropriate credit is given.

    ## Contact

    For any questions or further information, please contact:
    - Name: Eric Brune
    - Email: ebrune@kth.se

  13. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Xiang Deng; Yu Su; Alyssa Lees; You Wu; Cong Yu; Huan Sun; Xiang Deng; Yu Su; Alyssa Lees; You Wu; Cong Yu; Huan Sun (2021). Sentence/Table Pair Data from Wikipedia for Pre-training with Distant-Supervision [Dataset]. http://doi.org/10.5281/zenodo.5612316
Organization logo

Sentence/Table Pair Data from Wikipedia for Pre-training with Distant-Supervision

Explore at:
application/gzipAvailable download formats
Dataset updated
Oct 29, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Xiang Deng; Yu Su; Alyssa Lees; You Wu; Cong Yu; Huan Sun; Xiang Deng; Yu Su; Alyssa Lees; You Wu; Cong Yu; Huan Sun
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This is the dataset used for pre-training in "ReasonBERT: Pre-trained to Reason with Distant Supervision", EMNLP'21.

There are two files:

sentence_pairs_for_pretrain_no_tokenization.tar.gz -> Contain only sentences as evidence, Text-only

table_pairs_for_pretrain_no_tokenization.tar.gz -> At least one piece of evidence is a table, Hybrid

The data is chunked into multiple tar files for easy loading. We use WebDataset, a PyTorch Dataset (IterableDataset) implementation providing efficient sequential/streaming data access.

For pre-training code, or if you have any questions, please check our GitHub repo https://github.com/sunlab-osu/ReasonBERT

Below is a sample code snippet to load the data

import webdataset as wds

# path to the uncompressed files, should be a directory with a set of tar files
url = './sentence_multi_pairs_for_pretrain_no_tokenization/{000000...000763}.tar'
dataset = (
  wds.Dataset(url)
  .shuffle(1000) # cache 1000 samples and shuffle
  .decode()
  .to_tuple("json")
  .batched(20) # group every 20 examples into a batch
)

# Please see the documentation for WebDataset for more details about how to use it as dataloader for Pytorch
# You can also iterate through all examples and dump them with your preferred data format

Below we show how the data is organized with two examples.

Text-only

{'s1_text': 'Sils is a municipality in the comarca of Selva, in Catalonia, Spain.', # query sentence
 's1_all_links': {
  'Sils,_Girona': [[0, 4]],
  'municipality': [[10, 22]],
  'Comarques_of_Catalonia': [[30, 37]],
  'Selva': [[41, 46]],
  'Catalonia': [[51, 60]]
 }, # list of entities and their mentions in the sentence (start, end location)
 'pairs': [ # other sentences that share common entity pair with the query, group by shared entity pairs
  {
    'pair': ['Comarques_of_Catalonia', 'Selva'], # the common entity pair
    's1_pair_locs': [[[30, 37]], [[41, 46]]], # mention of the entity pair in the query
    's2s': [ # list of other sentences that contain the common entity pair, or evidence
     {
       'md5': '2777e32bddd6ec414f0bc7a0b7fea331',
       'text': 'Selva is a coastal comarque (county) in Catalonia, Spain, located between the mountain range known as the Serralada Transversal or Puigsacalm and the Costa Brava (part of the Mediterranean coast). Unusually, it is divided between the provinces of Girona and Barcelona, with Fogars de la Selva being part of Barcelona province and all other municipalities falling inside Girona province. Also unusually, its capital, Santa Coloma de Farners, is no longer among its larger municipalities, with the coastal towns of Blanes and Lloret de Mar having far surpassed it in size.',
       's_loc': [0, 27], # in addition to the sentence containing the common entity pair, we also keep its surrounding context. 's_loc' is the start/end location of the actual evidence sentence
       'pair_locs': [ # mentions of the entity pair in the evidence
        [[19, 27]], # mentions of entity 1
        [[0, 5], [288, 293]] # mentions of entity 2
       ],
       'all_links': {
        'Selva': [[0, 5], [288, 293]],
        'Comarques_of_Catalonia': [[19, 27]],
        'Catalonia': [[40, 49]]
       }
      }
    ,...] # there are multiple evidence sentences
   },
 ,...] # there are multiple entity pairs in the query
}

Hybrid

{'s1_text': 'The 2006 Major League Baseball All-Star Game was the 77th playing of the midseason exhibition baseball game between the all-stars of the American League (AL) and National League (NL), the two leagues comprising Major League Baseball.',
 's1_all_links': {...}, # same as text-only
 'sentence_pairs': [{'pair': ..., 's1_pair_locs': ..., 's2s': [...]}], # same as text-only
 'table_pairs': [
  'tid': 'Major_League_Baseball-1',
  'text':[
    ['World Series Records', 'World Series Records', ...],
    ['Team', 'Number of Series won', ...],
    ['St. Louis Cardinals (NL)', '11', ...],
  ...] # table content, list of rows
  'index':[
    [[0, 0], [0, 1], ...],
    [[1, 0], [1, 1], ...],
  ...] # index of each cell [row_id, col_id]. we keep only a table snippet, but the index here is from the original table.
  'value_ranks':[
    [0, 0, ...],
    [0, 0, ...],
    [0, 10, ...],
  ...] # if the cell contain numeric value/date, this is its rank ordered from small to large, follow TAPAS
  'value_inv_ranks': [], # inverse rank
  'all_links':{
    'St._Louis_Cardinals': {
     '2': [
      [[2, 0], [0, 19]], # [[row_id, col_id], [start, end]]
     ] # list of mentions in the second row, the key is row_id
    },
    'CARDINAL:11': {'2': [[[2, 1], [0, 2]]], '8': [[[8, 3], [0, 2]]]},
  }
  'name': '', # table name, if exists
  'pairs': {
    'pair': ['American_League', 'National_League'],
    's1_pair_locs': [[[137, 152]], [[162, 177]]], # mention in the query
    'table_pair_locs': {
     '17': [ # mention of entity pair in row 17
       [
        [[17, 0], [3, 18]],
        [[17, 1], [3, 18]],
        [[17, 2], [3, 18]],
        [[17, 3], [3, 18]]
       ], # mention of the first entity
       [
        [[17, 0], [21, 36]],
        [[17, 1], [21, 36]],
       ] # mention of the second entity
     ]
    }
   }
 ]
}

Search
Clear search
Close search
Google apps
Main menu