81 datasets found
  1. h

    Webdataset

    • huggingface.co
    Updated Aug 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    wangmaohua (2023). Webdataset [Dataset]. https://huggingface.co/datasets/wangmaohua/Webdataset
    Explore at:
    Dataset updated
    Aug 30, 2023
    Authors
    wangmaohua
    Description

    wangmaohua/Webdataset dataset hosted on Hugging Face and contributed by the HF Datasets community

  2. h

    mnist-webdataset-png

    • huggingface.co
    Updated Mar 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hayden Donnelly (2024). mnist-webdataset-png [Dataset]. https://huggingface.co/datasets/hayden-donnelly/mnist-webdataset-png
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 14, 2024
    Authors
    Hayden Donnelly
    Description

    MNIST WebDataset PNG

    The MNIST dataset with samples stored as PNG images and compiled into the WebDataset format.

      DALI/JAX Example
    

    The following code shows how this dataset can be loaded into JAX arrays by DALI. from nvidia.dali import pipeline_def import nvidia.dali.fn as fn import nvidia.dali.types as types from nvidia.dali.plugin.jax import DALIGenericIterator from nvidia.dali.plugin.base_iterator import LastBatchPolicy

    def get_data_iterator(batch_size, dataset_path):โ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/hayden-donnelly/mnist-webdataset-png.

  3. z

    fluentspeechcommands in WebDataset Format

    • zenodo.org
    tar
    Updated Jan 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Niu Yadong; Niu Yadong (2025). fluentspeechcommands in WebDataset Format [Dataset]. http://doi.org/10.5281/zenodo.14722453
    Explore at:
    tarAvailable download formats
    Dataset updated
    Jan 23, 2025
    Dataset provided by
    xiaomi
    Authors
    Niu Yadong; Niu Yadong
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is the fluentspeechcommands dataset, formatted in the WebDataset format. WebDataset files are essentially tar archives, where each example in the dataset is represented by a pair of files: a WAV audio file and a corresponding JSON metadata file. The JSON file contains the class label and other relevant information for that particular audio sample.

    $ tar tvf fluentspeechcommands_train_0000000.tar |head
    -r--r--r-- bigdata/bigdata 174 2025-01-17 07:20 48fac300-45c8-11e9-8ec0-7bf21d1cfe30.json
    -r--r--r-- bigdata/bigdata 131116 2025-01-17 07:20 48fac300-45c8-11e9-8ec0-7bf21d1cfe30.wav
    -r--r--r-- bigdata/bigdata  136 2025-01-17 07:20 3f770360-44e3-11e9-bb82-bdba769643e7.json
    -r--r--r-- bigdata/bigdata 71376 2025-01-17 07:20 3f770360-44e3-11e9-bb82-bdba769643e7.wav
    -r--r--r-- bigdata/bigdata  132 2025-01-17 07:20 3ea38ea0-4613-11e9-bc65-55b32b211b66.json
    -r--r--r-- bigdata/bigdata 68310 2025-01-17 07:20 3ea38ea0-4613-11e9-bc65-55b32b211b66.wav
    -r--r--r-- bigdata/bigdata  143 2025-01-17 07:20 61578420-45ea-11e9-b578-494a5b19ab8b.json
    -r--r--r-- bigdata/bigdata 89208 2025-01-17 07:20 61578420-45ea-11e9-b578-494a5b19ab8b.wav
    -r--r--r-- bigdata/bigdata  132 2025-01-17 07:20 c4595690-4520-11e9-a843-8db76f4b5e29.json
    -r--r--r-- bigdata/bigdata 76502 2025-01-17 07:20 c4595690-4520-11e9-a843-8db76f4b5e29.wav

    $ cat 48fac300-45c8-11e9-8ec0-7bf21d1cfe30.json 
    {"speakerId": "52XVOeXMXYuaElyw", "transcription": "I need to practice my English. Switch the language", "action": "change language", "object": "English", "location": "none"}
  4. librilight-webdataset

    • huggingface.co
    Updated Sep 1, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Collabora (2024). librilight-webdataset [Dataset]. https://huggingface.co/datasets/collabora/librilight-webdataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 1, 2024
    Dataset authored and provided by
    Collaborahttp://collabora.com/
    License

    https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

    Description

    collabora/librilight-webdataset dataset hosted on Hugging Face and contributed by the HF Datasets community

  5. z

    speechocean762 in WebDataset Format

    • zenodo.org
    tar
    Updated Jan 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Niu Yadong; Niu Yadong (2025). speechocean762 in WebDataset Format [Dataset]. http://doi.org/10.5281/zenodo.14725291
    Explore at:
    tarAvailable download formats
    Dataset updated
    Jan 23, 2025
    Dataset provided by
    Xiaomi
    Authors
    Niu Yadong; Niu Yadong
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is the speechocean762 dataset, formatted in the WebDataset format. WebDataset files are essentially tar archives, where each example in the dataset is represented by a pair of files: a WAV audio file and a corresponding JSON metadata file. The JSON file contains the class label and other relevant information for that particular audio sample.

    $ tar tvf speechocean762_train.tar |head          
    -r--r--r-- bigdata/bigdata 607 2025-01-13 14:49 000010011.json
    -r--r--r-- bigdata/bigdata 82604 2025-01-13 14:49 000010011.wav
    -r--r--r-- bigdata/bigdata  646 2025-01-13 14:49 000010035.json
    -r--r--r-- bigdata/bigdata 109804 2025-01-13 14:49 000010035.wav
    -r--r--r-- bigdata/bigdata  630 2025-01-13 14:49 000010053.json
    -r--r--r-- bigdata/bigdata 107244 2025-01-13 14:49 000010053.wav
    -r--r--r-- bigdata/bigdata  561 2025-01-13 14:49 000010063.json
    -r--r--r-- bigdata/bigdata 106252 2025-01-13 14:49 000010063.wav
    -r--r--r-- bigdata/bigdata  671 2025-01-13 14:49 000010069.json
    -r--r--r-- bigdata/bigdata 96364 2025-01-13 14:49 000010069.wav
    $ cat 000010011.json
    {"id": 10011, "accuracy": 8, "completeness": 10.0, "fluency": 9, "prosodic": 9, "words": [{"accuracy": 10, "stress": 10, "phones": ["W", "IY0"], "total": 10, "text": "WE", "phones-accuracy": [2.0, 2.0]}, {"accuracy": 10, "stress": 10, "phones": ["K", "AO0", "L"], "total": 10, "text": "CALL", "phones-accuracy": [2.0, 1.8, 1.8]}, {"accuracy": 10, "stress": 10, "phones": ["IH0", "T"], "total": 10, "text": "IT", "phones-accuracy": [2.0, 2.0]}, {"accuracy": 6, "stress": 10, "phones": ["B", "EH0", "R"], "total": 6, "text": "BEAR", "phones-accuracy": [2.0, 1.0, 1.0]}], "total": 8, "text": "WE CALL IT BEAR"}
  6. z

    maestro in WebDataset Format Creators

    • zenodo.org
    tar
    Updated Feb 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Niu Yadong; Niu Yadong (2025). maestro in WebDataset Format Creators [Dataset]. http://doi.org/10.5281/zenodo.14858022
    Explore at:
    tarAvailable download formats
    Dataset updated
    Feb 12, 2025
    Dataset provided by
    Xiaomi
    Authors
    Niu Yadong; Niu Yadong
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is the maestro dataset, formatted in the WebDataset format. WebDataset files are essentially tar archives, where each example in the dataset is represented by a pair of files: a WAV audio file and a corresponding JSON metadata file. The JSON file contains the class label and other relevant information for that particular audio sample.

    $ tar tvf maestro_train_000000.tar|head       
    -r--r--r-- bigdata/bigdata 327458 2025-01-23 13:26 MIDI-Unprocessed_XP_15_R2_2004_01_ORIG_MID--AUDIO_15_R2_2004_02_Track02_wav.json
    -r--r--r-- bigdata/bigdata 120375940 2025-01-23 13:26 MIDI-Unprocessed_XP_15_R2_2004_01_ORIG_MID--AUDIO_15_R2_2004_02_Track02_wav.wav
    -r--r--r-- bigdata/bigdata  625054 2025-01-23 13:26 MIDI-Unprocessed_13_R1_2009_01-03_ORIG_MID--AUDIO_13_R1_2009_13_R1_2009_03_WAV.json
    -r--r--r-- bigdata/bigdata 137713368 2025-01-23 13:26 MIDI-Unprocessed_13_R1_2009_01-03_ORIG_MID--AUDIO_13_R1_2009_13_R1_2009_03_WAV.wav
    -r--r--r-- bigdata/bigdata  356393 2025-01-23 13:26 MIDI-Unprocessed_XP_17_R2_2004_01_ORIG_MID--AUDIO_17_R2_2004_01_Track01_wav.json
    -r--r--r-- bigdata/bigdata 132159804 2025-01-23 13:26 MIDI-Unprocessed_XP_17_R2_2004_01_ORIG_MID--AUDIO_17_R2_2004_01_Track01_wav.wav
    -r--r--r-- bigdata/bigdata  255210 2025-01-23 13:26 ORIG-MIDI_01_7_6_13_Group_MID--AUDIO_01_R1_2013_wav--2.json
    -r--r--r-- bigdata/bigdata 58523088 2025-01-23 13:26 ORIG-MIDI_01_7_6_13_Group_MID--AUDIO_01_R1_2013_wav--2.wav
    -r--r--r-- bigdata/bigdata  1190145 2025-01-23 13:26 MIDI-UNPROCESSED_04-07-08-10-12-15-17_R2_2014_MID--AUDIO_17_R2_2014_wav.json
    -r--r--r-- bigdata/bigdata 390151460 2025-01-23 13:26 MIDI-UNPROCESSED_04-07-08-10-12-15-17_R2_2014_MID--AUDIO_17_R2_2014_wav.wav
    

    $ cat ORIG-MIDI_01_7_6_13_Group_MID--AUDIO_01_R1_2013_wav--2.json
    [
      ...
      {"start": 323.546875, "end": 323.5859375, "note": 51}, 
      {"start": 323.703125, "end": 323.74869791666663, "note": 51}, 
      {"start": 323.8450520833333, "end": 323.8919270833333, "note": 51}, 
      {"start": 324.00390625, "end": 324.0442708333333, "note": 51},
      ...
    ]
  7. h

    webdataset

    • huggingface.co
    Updated Nov 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Suin Hong (2024). webdataset [Dataset]. https://huggingface.co/datasets/0208suin/webdataset
    Explore at:
    Dataset updated
    Nov 14, 2024
    Authors
    Suin Hong
    Description

    0208suin/webdataset dataset hosted on Hugging Face and contributed by the HF Datasets community

  8. BirdCLEF2024-webdataset

    • kaggle.com
    Updated Jun 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    T-Hasumi (2024). BirdCLEF2024-webdataset [Dataset]. https://www.kaggle.com/datasets/tkyhsm/birdclef2024-webdataset/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 9, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    T-Hasumi
    Description

    Dataset

    This dataset was created by T-Hasumi

    Contents

  9. Z

    Sentence/Table Pair Data from Wikipedia for Pre-training with...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Oct 29, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Huan Sun (2021). Sentence/Table Pair Data from Wikipedia for Pre-training with Distant-Supervision [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5612315
    Explore at:
    Dataset updated
    Oct 29, 2021
    Dataset provided by
    Xiang Deng
    Huan Sun
    Cong Yu
    You Wu
    Alyssa Lees
    Yu Su
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the dataset used for pre-training in "ReasonBERT: Pre-trained to Reason with Distant Supervision", EMNLP'21.

    There are two files:

    sentence_pairs_for_pretrain_no_tokenization.tar.gz -> Contain only sentences as evidence, Text-only

    table_pairs_for_pretrain_no_tokenization.tar.gz -> At least one piece of evidence is a table, Hybrid

    The data is chunked into multiple tar files for easy loading. We use WebDataset, a PyTorch Dataset (IterableDataset) implementation providing efficient sequential/streaming data access.

    For pre-training code, or if you have any questions, please check our GitHub repo https://github.com/sunlab-osu/ReasonBERT

    Below is a sample code snippet to load the data

    import webdataset as wds

    path to the uncompressed files, should be a directory with a set of tar files

    url = './sentence_multi_pairs_for_pretrain_no_tokenization/{000000...000763}.tar' dataset = ( wds.Dataset(url) .shuffle(1000) # cache 1000 samples and shuffle .decode() .to_tuple("json") .batched(20) # group every 20 examples into a batch )

    Please see the documentation for WebDataset for more details about how to use it as dataloader for Pytorch

    You can also iterate through all examples and dump them with your preferred data format

    Below we show how the data is organized with two examples.

    Text-only

    {'s1_text': 'Sils is a municipality in the comarca of Selva, in Catalonia, Spain.', # query sentence 's1_all_links': { 'Sils,_Girona': [[0, 4]], 'municipality': [[10, 22]], 'Comarques_of_Catalonia': [[30, 37]], 'Selva': [[41, 46]], 'Catalonia': [[51, 60]] }, # list of entities and their mentions in the sentence (start, end location) 'pairs': [ # other sentences that share common entity pair with the query, group by shared entity pairs { 'pair': ['Comarques_of_Catalonia', 'Selva'], # the common entity pair 's1_pair_locs': [[[30, 37]], [[41, 46]]], # mention of the entity pair in the query 's2s': [ # list of other sentences that contain the common entity pair, or evidence { 'md5': '2777e32bddd6ec414f0bc7a0b7fea331', 'text': 'Selva is a coastal comarque (county) in Catalonia, Spain, located between the mountain range known as the Serralada Transversal or Puigsacalm and the Costa Brava (part of the Mediterranean coast). Unusually, it is divided between the provinces of Girona and Barcelona, with Fogars de la Selva being part of Barcelona province and all other municipalities falling inside Girona province. Also unusually, its capital, Santa Coloma de Farners, is no longer among its larger municipalities, with the coastal towns of Blanes and Lloret de Mar having far surpassed it in size.', 's_loc': [0, 27], # in addition to the sentence containing the common entity pair, we also keep its surrounding context. 's_loc' is the start/end location of the actual evidence sentence 'pair_locs': [ # mentions of the entity pair in the evidence [[19, 27]], # mentions of entity 1 [[0, 5], [288, 293]] # mentions of entity 2 ], 'all_links': { 'Selva': [[0, 5], [288, 293]], 'Comarques_of_Catalonia': [[19, 27]], 'Catalonia': [[40, 49]] } } ,...] # there are multiple evidence sentences }, ,...] # there are multiple entity pairs in the query }

    Hybrid

    {'s1_text': 'The 2006 Major League Baseball All-Star Game was the 77th playing of the midseason exhibition baseball game between the all-stars of the American League (AL) and National League (NL), the two leagues comprising Major League Baseball.', 's1_all_links': {...}, # same as text-only 'sentence_pairs': [{'pair': ..., 's1_pair_locs': ..., 's2s': [...]}], # same as text-only 'table_pairs': [ 'tid': 'Major_League_Baseball-1', 'text':[ ['World Series Records', 'World Series Records', ...], ['Team', 'Number of Series won', ...], ['St. Louis Cardinals (NL)', '11', ...], ...] # table content, list of rows 'index':[ [[0, 0], [0, 1], ...], [[1, 0], [1, 1], ...], ...] # index of each cell [row_id, col_id]. we keep only a table snippet, but the index here is from the original table. 'value_ranks':[ [0, 0, ...], [0, 0, ...], [0, 10, ...], ...] # if the cell contain numeric value/date, this is its rank ordered from small to large, follow TAPAS 'value_inv_ranks': [], # inverse rank 'all_links':{ 'St._Louis_Cardinals': { '2': [ [[2, 0], [0, 19]], # [[row_id, col_id], [start, end]] ] # list of mentions in the second row, the key is row_id }, 'CARDINAL:11': {'2': [[[2, 1], [0, 2]]], '8': [[[8, 3], [0, 2]]]}, } 'name': '', # table name, if exists 'pairs': { 'pair': ['American_League', 'National_League'], 's1_pair_locs': [[[137, 152]], [[162, 177]]], # mention in the query 'table_pair_locs': { '17': [ # mention of entity pair in row 17 [ [[17, 0], [3, 18]], [[17, 1], [3, 18]], [[17, 2], [3, 18]], [[17, 3], [3, 18]] ], # mention of the first entity [ [[17, 0], [21, 36]], [[17, 1], [21, 36]], ] # mention of the second entity ] } } ] }

  10. z

    SpeechCommands in WebDataset Format

    • zenodo.org
    tar
    Updated Jan 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Niu Yadong; Niu Yadong (2025). SpeechCommands in WebDataset Format [Dataset]. http://doi.org/10.5281/zenodo.14722647
    Explore at:
    tarAvailable download formats
    Dataset updated
    Jan 23, 2025
    Dataset provided by
    Xiaomi
    Authors
    Niu Yadong; Niu Yadong
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is the speechcommands dataset, formatted in the WebDataset format. WebDataset files are essentially tar archives, where each example in the dataset is represented by a pair of files: a WAV audio file and a corresponding JSON metadata file. The JSON file contains the class label and other relevant information for that particular audio sample.

    $ tar tvf wds-audio-train_0000000.tar|head        
    -r--r--r-- bigdata/bigdata 19 2025-01-10 08:58 right_7e783e3f_nohash_1.json
    -r--r--r-- bigdata/bigdata 32044 2025-01-10 08:58 right_7e783e3f_nohash_1.wav
    -r--r--r-- bigdata/bigdata  16 2025-01-10 08:58 up_c79159aa_nohash_3.json
    -r--r--r-- bigdata/bigdata 32044 2025-01-10 08:58 up_c79159aa_nohash_3.wav
    -r--r--r-- bigdata/bigdata  18 2025-01-10 08:58 left_2b42e7a2_nohash_3.json
    -r--r--r-- bigdata/bigdata 32044 2025-01-10 08:58 left_2b42e7a2_nohash_3.wav
    -r--r--r-- bigdata/bigdata  18 2025-01-10 08:58 left_c79159aa_nohash_4.json
    -r--r--r-- bigdata/bigdata 32044 2025-01-10 08:58 left_c79159aa_nohash_4.wav
    -r--r--r-- bigdata/bigdata  18 2025-01-10 08:58 left_708b8d51_nohash_0.json
    -r--r--r-- bigdata/bigdata 32044 2025-01-10 08:58 left_708b8d51_nohash_0.wav
    $ cat right_7e783e3f_nohash_1.json 
    {"labels": "right"}
  11. hi-stt-preprocessed-webdataset

    • huggingface.co
    Updated Jan 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Collabora (2025). hi-stt-preprocessed-webdataset [Dataset]. https://huggingface.co/datasets/collabora/hi-stt-preprocessed-webdataset
    Explore at:
    Dataset updated
    Jan 17, 2025
    Dataset authored and provided by
    Collaborahttp://collabora.com/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    collabora/hi-stt-preprocessed-webdataset dataset hosted on Hugging Face and contributed by the HF Datasets community

  12. z

    ravdess in WebDataset Format

    • zenodo.org
    tar
    Updated Jan 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Niu Yadong; Niu Yadong (2025). ravdess in WebDataset Format [Dataset]. http://doi.org/10.5281/zenodo.14722524
    Explore at:
    tarAvailable download formats
    Dataset updated
    Jan 23, 2025
    Dataset provided by
    xiaomi
    Authors
    Niu Yadong; Niu Yadong
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is the ravdess dataset, formatted in the WebDataset format. WebDataset files are essentially tar archives, where each example in the dataset is represented by a pair of files: a WAV audio file and a corresponding JSON metadata file. The JSON file contains the class label and other relevant information for that particular audio sample.

    $ tar tvf ravdess_fold_0_0000000.tar |head
    -r--r--r-- bigdata/bigdata 24 2025-01-10 15:44 03-01-08-01-01-01-11.json
    -r--r--r-- bigdata/bigdata 341912 2025-01-10 15:44 03-01-08-01-01-01-11.wav
    -r--r--r-- bigdata/bigdata   22 2025-01-10 15:44 03-01-07-02-01-02-05.json
    -r--r--r-- bigdata/bigdata 424184 2025-01-10 15:44 03-01-07-02-01-02-05.wav
    -r--r--r-- bigdata/bigdata   22 2025-01-10 15:44 03-01-06-01-01-02-10.json
    -r--r--r-- bigdata/bigdata 377100 2025-01-10 15:44 03-01-06-01-01-02-10.wav
    -r--r--r-- bigdata/bigdata   24 2025-01-10 15:44 03-01-08-01-02-01-16.json
    -r--r--r-- bigdata/bigdata 396324 2025-01-10 15:44 03-01-08-01-02-01-16.wav
    -r--r--r-- bigdata/bigdata   24 2025-01-10 15:44 03-01-08-01-02-02-22.json
    -r--r--r-- bigdata/bigdata 404388 2025-01-10 15:44 03-01-08-01-02-02-22.wav

    $ cat 03-01-08-01-01-01-11.json {"emotion": "surprised"}
  13. h

    PhD-webdataset

    • huggingface.co
    Updated Jun 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AIMClab (2025). PhD-webdataset [Dataset]. https://huggingface.co/datasets/AIMClab-RUC/PhD-webdataset
    Explore at:
    Dataset updated
    Jun 2, 2025
    Dataset authored and provided by
    AIMClab
    Description

    PhD Webdataset

    This repository contains the packaged version of PhD. For a detailed introduction to PhD, please visit the official website.

      Overview
    

    The PhD Webdataset is designed to facilitate easy access and usage of the PhD dataset. It includes various fields in 'json' key. The data in this repo is totally the same as in PhD.

      Installation
    

    Ensure you have Hugging Face's datasets library installed. You can install it via pip: pip install datasetsโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/AIMClab-RUC/PhD-webdataset.

  14. z

    FSD50k in WebDataset Format

    • zenodo.org
    tar
    Updated Feb 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Niu Yadong; Niu Yadong (2025). FSD50k in WebDataset Format [Dataset]. http://doi.org/10.5281/zenodo.14868441
    Explore at:
    tarAvailable download formats
    Dataset updated
    Feb 14, 2025
    Dataset provided by
    Xiaomi
    Authors
    Niu Yadong; Niu Yadong
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is the FSD50K dataset, formatted in the WebDataset format. WebDataset files are essentially tar archives, where each example in the dataset is represented by a pair of files: a WAV audio file and a corresponding JSON metadata file. The JSON file contains the class label and other relevant information for that particular audio sample.

    $ tar tvf fsdk50_eval_0000000.tar |head
    -r--r--r-- bigdata/bigdata 40 2025-01-12 13:02 45604.json
    -r--r--r-- bigdata/bigdata 43066 2025-01-12 13:02 45604.wav
    -r--r--r-- bigdata/bigdata  46 2025-01-12 13:02 213293.json
    -r--r--r-- bigdata/bigdata 1372242 2025-01-12 13:02 213293.wav
    -r--r--r-- bigdata/bigdata   82 2025-01-12 13:02 348174.json
    -r--r--r-- bigdata/bigdata 804280 2025-01-12 13:02 348174.wav
    -r--r--r-- bigdata/bigdata   71 2025-01-12 13:02 417736.json
    -r--r--r-- bigdata/bigdata 2238542 2025-01-12 13:02 417736.wav
    -r--r--r-- bigdata/bigdata   43 2025-01-12 13:02 235555.json
    -r--r--r-- bigdata/bigdata 542508 2025-01-12 13:02 235555.wav
     $ tar -xOf fsdk50_eval_0000000.tar 45604.json
    {"soundevent": "Yell;Shout;Human_voice"}
    
    
  15. d

    CBP eRulings Web dataset

    • catalog.data.gov
    • data.amerigeoss.org
    Updated Oct 19, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BEMSD (2022). CBP eRulings Web dataset [Dataset]. https://catalog.data.gov/dataset/cbp-erulings-web-dataset
    Explore at:
    Dataset updated
    Oct 19, 2022
    Dataset provided by
    BEMSD
    Description

    Contains the CBP eRulings data that is available to the public via HTML Web Page

  16. i

    Netflix

    • ieee-dataport.org
    Updated Oct 1, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Danil Shamsimukhametov (2021). Netflix [Dataset]. https://ieee-dataport.org/documents/youtube-netflix-web-dataset-encrypted-traffic-classification
    Explore at:
    Dataset updated
    Oct 1, 2021
    Authors
    Danil Shamsimukhametov
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    YouTube
    Description

    YouTube flows

  17. P

    Noise of Web Dataset

    • paperswithcode.com
    Updated Aug 8, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Noise of Web Dataset [Dataset]. https://paperswithcode.com/dataset/noise-of-web-now
    Explore at:
    Dataset updated
    Aug 8, 2024
    Description

    Noise of Web (NoW) is a challenging noisy correspondence learning (NCL) benchmark for robust image-text matching/retrieval models. It contains 100K image-text pairs consisting of website pages and multilingual website meta-descriptions (98,000 pairs for training, 1,000 for validation, and 1,000 for testing). NoW has two main characteristics: without human annotations and the noisy pairs are naturally captured. The source image data of NoW is obtained by taking screenshots when accessing web pages on mobile user interface (MUI) with 720 $\times$ 1280 resolution, and we parse the meta-description field in the HTML source code as the captions. In NCR (predecessor of NCL), each image in all datasets were preprocessed using Faster-RCNN detector provided by Bottom-up Attention Model to generate 36 region proposals, and each proposal was encoded as a 2048-dimensional feature. Thus, following NCR, we release our the features instead of raw images for fair comparison. However, we can not just use detection methods like Faster-RCNN to extract image features since it is trained on real-world animals and objects on MS-COCO. To tackle this, we adapt APT as the detection model since it is trained on MUI data. Then, we capture the 768-dimensional features of top 36 objects for one image. Due to the automated and non-human curated data collection process, the noise in NoW is highly authentic and intrinsic. The estimated noise ratio of this dataset is nearly 70%.

  18. z

    voxceleb1 in WebDataset Format

    • zenodo.org
    tar
    Updated Jan 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Niu Yadong; Niu Yadong (2025). voxceleb1 in WebDataset Format [Dataset]. http://doi.org/10.5281/zenodo.14725363
    Explore at:
    tarAvailable download formats
    Dataset updated
    Jan 24, 2025
    Dataset provided by
    Xiaomi
    Authors
    Niu Yadong; Niu Yadong
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is the voxceleb1 dataset, formatted in the WebDataset format. WebDataset files are essentially tar archives, where each example in the dataset is represented by a pair of files: a WAV audio file and a corresponding JSON metadata file. The JSON file contains the class label and other relevant information for that particular audio sample.

    $ tar tvf voxceleb1_train_0000000.tar|head            
    -r--r--r-- bigdata/bigdata 24 2025-01-10 12:16 ig9vI1du458_00005.json
    -r--r--r-- bigdata/bigdata 143406 2025-01-10 12:16 ig9vI1du458_00005.wav
    -r--r--r-- bigdata/bigdata   24 2025-01-10 12:16 y0UFutwJ-ow_00034.json
    -r--r--r-- bigdata/bigdata 147246 2025-01-10 12:16 y0UFutwJ-ow_00034.wav
    -r--r--r-- bigdata/bigdata   24 2025-01-10 12:16 FdYLFLTY-5Q_00050.json
    -r--r--r-- bigdata/bigdata 138286 2025-01-10 12:16 FdYLFLTY-5Q_00050.wav
    -r--r--r-- bigdata/bigdata   24 2025-01-10 12:16 zELwAz2W6hM_00010.json
    -r--r--r-- bigdata/bigdata 165166 2025-01-10 12:16 zELwAz2W6hM_00010.wav
    -r--r--r-- bigdata/bigdata   24 2025-01-10 12:16 Qz-GecPwRmI_00008.json
    -r--r--r-- bigdata/bigdata 140846 2025-01-10 12:16 Qz-GecPwRmI_00008.wav
    $ cat ig9vI1du458_00005.json 
    {"speakerid": "id10047"}

  19. h

    openhermes-2.5-webdataset

    • huggingface.co
    Updated Feb 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marianna Nezhurina (2024). openhermes-2.5-webdataset [Dataset]. https://huggingface.co/datasets/marianna13/openhermes-2.5-webdataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 17, 2024
    Authors
    Marianna Nezhurina
    Description

    marianna13/openhermes-2.5-webdataset dataset hosted on Hugging Face and contributed by the HF Datasets community

  20. z

    gtzan_genre in WebDataset Format

    • zenodo.org
    tar
    Updated Jan 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Niu Yadong; Niu Yadong (2025). gtzan_genre in WebDataset Format [Dataset]. http://doi.org/10.5281/zenodo.14722472
    Explore at:
    tarAvailable download formats
    Dataset updated
    Jan 23, 2025
    Dataset provided by
    xiaomi
    Authors
    Niu Yadong; Niu Yadong
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is the gtzan_genre dataset, formatted in the WebDataset format. WebDataset files are essentially tar archives, where each example in the dataset is represented by a pair of files: a WAV audio file and a corresponding JSON metadata file. The JSON file contains the class label and other relevant information for that particular audio sample.

    $ tar tvf gtzan_fold_0_0000000.tar |head 
    -r--r--r-- bigdata/bigdata 17 2025-01-10 15:06 jazz_00007.json
    -r--r--r-- bigdata/bigdata 1323632 2025-01-10 15:06 jazz_00007.wav
    -r--r--r-- bigdata/bigdata   17 2025-01-10 15:06 jazz_00006.json
    -r--r--r-- bigdata/bigdata 1323632 2025-01-10 15:06 jazz_00006.wav
    -r--r--r-- bigdata/bigdata   18 2025-01-10 15:06 blues_00006.json
    -r--r--r-- bigdata/bigdata 1323632 2025-01-10 15:06 blues_00006.wav
    -r--r--r-- bigdata/bigdata   19 2025-01-10 15:06 reggae_00005.json
    -r--r--r-- bigdata/bigdata 1323632 2025-01-10 15:06 reggae_00005.wav
    -r--r--r-- bigdata/bigdata   19 2025-01-10 15:06 reggae_00002.json
    -r--r--r-- bigdata/bigdata 1323632 2025-01-10 15:06 reggae_00002.wav
    

    $ cat jazz_00007.json 
    {"genre": "jazz"}
Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
wangmaohua (2023). Webdataset [Dataset]. https://huggingface.co/datasets/wangmaohua/Webdataset

Webdataset

wangmaohua/Webdataset

Explore at:
109 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Aug 30, 2023
Authors
wangmaohua
Description

wangmaohua/Webdataset dataset hosted on Hugging Face and contributed by the HF Datasets community

Search
Clear search
Close search
Google apps
Main menu