15 datasets found
  1. z

    voxceleb1 in WebDataset Format

    • zenodo.org
    Updated Jan 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Niu Yadong; Niu Yadong (2025). voxceleb1 in WebDataset Format [Dataset]. http://doi.org/10.5281/zenodo.14722737
    Explore at:
    Dataset updated
    Jan 23, 2025
    Dataset provided by
    Xiaomi
    Authors
    Niu Yadong; Niu Yadong
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is the voxceleb1 dataset, formatted in the WebDataset format. WebDataset files are essentially tar archives, where each example in the dataset is represented by a pair of files: a WAV audio file and a corresponding JSON metadata file. The JSON file contains the class label and other relevant information for that particular audio sample.

    $ tar tvf voxceleb1_train_0000000.tar|head            
    -r--r--r-- bigdata/bigdata 24 2025-01-10 12:16 ig9vI1du458_00005.json
    -r--r--r-- bigdata/bigdata 143406 2025-01-10 12:16 ig9vI1du458_00005.wav
    -r--r--r-- bigdata/bigdata   24 2025-01-10 12:16 y0UFutwJ-ow_00034.json
    -r--r--r-- bigdata/bigdata 147246 2025-01-10 12:16 y0UFutwJ-ow_00034.wav
    -r--r--r-- bigdata/bigdata   24 2025-01-10 12:16 FdYLFLTY-5Q_00050.json
    -r--r--r-- bigdata/bigdata 138286 2025-01-10 12:16 FdYLFLTY-5Q_00050.wav
    -r--r--r-- bigdata/bigdata   24 2025-01-10 12:16 zELwAz2W6hM_00010.json
    -r--r--r-- bigdata/bigdata 165166 2025-01-10 12:16 zELwAz2W6hM_00010.wav
    -r--r--r-- bigdata/bigdata   24 2025-01-10 12:16 Qz-GecPwRmI_00008.json
    -r--r--r-- bigdata/bigdata 140846 2025-01-10 12:16 Qz-GecPwRmI_00008.wav
    $ cat ig9vI1du458_00005.json   ✔ 
    {"speakerid": "id10047"}
  2. z

    SpeechCommands in WebDataset Format

    • zenodo.org
    tar
    Updated Jan 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Niu Yadong; Niu Yadong (2025). SpeechCommands in WebDataset Format [Dataset]. http://doi.org/10.5281/zenodo.14722647
    Explore at:
    tarAvailable download formats
    Dataset updated
    Jan 23, 2025
    Dataset provided by
    Xiaomi
    Authors
    Niu Yadong; Niu Yadong
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is the speechcommands dataset, formatted in the WebDataset format. WebDataset files are essentially tar archives, where each example in the dataset is represented by a pair of files: a WAV audio file and a corresponding JSON metadata file. The JSON file contains the class label and other relevant information for that particular audio sample.

    $ tar tvf wds-audio-train_0000000.tar|head        
    -r--r--r-- bigdata/bigdata 19 2025-01-10 08:58 right_7e783e3f_nohash_1.json
    -r--r--r-- bigdata/bigdata 32044 2025-01-10 08:58 right_7e783e3f_nohash_1.wav
    -r--r--r-- bigdata/bigdata  16 2025-01-10 08:58 up_c79159aa_nohash_3.json
    -r--r--r-- bigdata/bigdata 32044 2025-01-10 08:58 up_c79159aa_nohash_3.wav
    -r--r--r-- bigdata/bigdata  18 2025-01-10 08:58 left_2b42e7a2_nohash_3.json
    -r--r--r-- bigdata/bigdata 32044 2025-01-10 08:58 left_2b42e7a2_nohash_3.wav
    -r--r--r-- bigdata/bigdata  18 2025-01-10 08:58 left_c79159aa_nohash_4.json
    -r--r--r-- bigdata/bigdata 32044 2025-01-10 08:58 left_c79159aa_nohash_4.wav
    -r--r--r-- bigdata/bigdata  18 2025-01-10 08:58 left_708b8d51_nohash_0.json
    -r--r--r-- bigdata/bigdata 32044 2025-01-10 08:58 left_708b8d51_nohash_0.wav
    $ cat right_7e783e3f_nohash_1.json 
    {"labels": "right"}
  3. Z

    Sentence/Table Pair Data from Wikipedia for Pre-training with...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Oct 29, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cong Yu (2021). Sentence/Table Pair Data from Wikipedia for Pre-training with Distant-Supervision [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5612315
    Explore at:
    Dataset updated
    Oct 29, 2021
    Dataset provided by
    Alyssa Lees
    Yu Su
    Xiang Deng
    Huan Sun
    You Wu
    Cong Yu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the dataset used for pre-training in "ReasonBERT: Pre-trained to Reason with Distant Supervision", EMNLP'21.

    There are two files:

    sentence_pairs_for_pretrain_no_tokenization.tar.gz -> Contain only sentences as evidence, Text-only

    table_pairs_for_pretrain_no_tokenization.tar.gz -> At least one piece of evidence is a table, Hybrid

    The data is chunked into multiple tar files for easy loading. We use WebDataset, a PyTorch Dataset (IterableDataset) implementation providing efficient sequential/streaming data access.

    For pre-training code, or if you have any questions, please check our GitHub repo https://github.com/sunlab-osu/ReasonBERT

    Below is a sample code snippet to load the data

    import webdataset as wds

    path to the uncompressed files, should be a directory with a set of tar files

    url = './sentence_multi_pairs_for_pretrain_no_tokenization/{000000...000763}.tar' dataset = ( wds.Dataset(url) .shuffle(1000) # cache 1000 samples and shuffle .decode() .to_tuple("json") .batched(20) # group every 20 examples into a batch )

    Please see the documentation for WebDataset for more details about how to use it as dataloader for Pytorch

    You can also iterate through all examples and dump them with your preferred data format

    Below we show how the data is organized with two examples.

    Text-only

    {'s1_text': 'Sils is a municipality in the comarca of Selva, in Catalonia, Spain.', # query sentence 's1_all_links': { 'Sils,_Girona': [[0, 4]], 'municipality': [[10, 22]], 'Comarques_of_Catalonia': [[30, 37]], 'Selva': [[41, 46]], 'Catalonia': [[51, 60]] }, # list of entities and their mentions in the sentence (start, end location) 'pairs': [ # other sentences that share common entity pair with the query, group by shared entity pairs { 'pair': ['Comarques_of_Catalonia', 'Selva'], # the common entity pair 's1_pair_locs': [[[30, 37]], [[41, 46]]], # mention of the entity pair in the query 's2s': [ # list of other sentences that contain the common entity pair, or evidence { 'md5': '2777e32bddd6ec414f0bc7a0b7fea331', 'text': 'Selva is a coastal comarque (county) in Catalonia, Spain, located between the mountain range known as the Serralada Transversal or Puigsacalm and the Costa Brava (part of the Mediterranean coast). Unusually, it is divided between the provinces of Girona and Barcelona, with Fogars de la Selva being part of Barcelona province and all other municipalities falling inside Girona province. Also unusually, its capital, Santa Coloma de Farners, is no longer among its larger municipalities, with the coastal towns of Blanes and Lloret de Mar having far surpassed it in size.', 's_loc': [0, 27], # in addition to the sentence containing the common entity pair, we also keep its surrounding context. 's_loc' is the start/end location of the actual evidence sentence 'pair_locs': [ # mentions of the entity pair in the evidence [[19, 27]], # mentions of entity 1 [[0, 5], [288, 293]] # mentions of entity 2 ], 'all_links': { 'Selva': [[0, 5], [288, 293]], 'Comarques_of_Catalonia': [[19, 27]], 'Catalonia': [[40, 49]] } } ,...] # there are multiple evidence sentences }, ,...] # there are multiple entity pairs in the query }

    Hybrid

    {'s1_text': 'The 2006 Major League Baseball All-Star Game was the 77th playing of the midseason exhibition baseball game between the all-stars of the American League (AL) and National League (NL), the two leagues comprising Major League Baseball.', 's1_all_links': {...}, # same as text-only 'sentence_pairs': [{'pair': ..., 's1_pair_locs': ..., 's2s': [...]}], # same as text-only 'table_pairs': [ 'tid': 'Major_League_Baseball-1', 'text':[ ['World Series Records', 'World Series Records', ...], ['Team', 'Number of Series won', ...], ['St. Louis Cardinals (NL)', '11', ...], ...] # table content, list of rows 'index':[ [[0, 0], [0, 1], ...], [[1, 0], [1, 1], ...], ...] # index of each cell [row_id, col_id]. we keep only a table snippet, but the index here is from the original table. 'value_ranks':[ [0, 0, ...], [0, 0, ...], [0, 10, ...], ...] # if the cell contain numeric value/date, this is its rank ordered from small to large, follow TAPAS 'value_inv_ranks': [], # inverse rank 'all_links':{ 'St._Louis_Cardinals': { '2': [ [[2, 0], [0, 19]], # [[row_id, col_id], [start, end]] ] # list of mentions in the second row, the key is row_id }, 'CARDINAL:11': {'2': [[[2, 1], [0, 2]]], '8': [[[8, 3], [0, 2]]]}, } 'name': '', # table name, if exists 'pairs': { 'pair': ['American_League', 'National_League'], 's1_pair_locs': [[[137, 152]], [[162, 177]]], # mention in the query 'table_pair_locs': { '17': [ # mention of entity pair in row 17 [ [[17, 0], [3, 18]], [[17, 1], [3, 18]], [[17, 2], [3, 18]], [[17, 3], [3, 18]] ], # mention of the first entity [ [[17, 0], [21, 36]], [[17, 1], [21, 36]], ] # mention of the second entity ] } } ] }

  4. z

    ravdess in WebDataset Format

    • zenodo.org
    tar
    Updated Jan 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Niu Yadong; Niu Yadong (2025). ravdess in WebDataset Format [Dataset]. http://doi.org/10.5281/zenodo.14722524
    Explore at:
    tarAvailable download formats
    Dataset updated
    Jan 23, 2025
    Dataset provided by
    xiaomi
    Authors
    Niu Yadong; Niu Yadong
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is the ravdess dataset, formatted in the WebDataset format. WebDataset files are essentially tar archives, where each example in the dataset is represented by a pair of files: a WAV audio file and a corresponding JSON metadata file. The JSON file contains the class label and other relevant information for that particular audio sample.

    $ tar tvf ravdess_fold_0_0000000.tar |head
    -r--r--r-- bigdata/bigdata 24 2025-01-10 15:44 03-01-08-01-01-01-11.json
    -r--r--r-- bigdata/bigdata 341912 2025-01-10 15:44 03-01-08-01-01-01-11.wav
    -r--r--r-- bigdata/bigdata   22 2025-01-10 15:44 03-01-07-02-01-02-05.json
    -r--r--r-- bigdata/bigdata 424184 2025-01-10 15:44 03-01-07-02-01-02-05.wav
    -r--r--r-- bigdata/bigdata   22 2025-01-10 15:44 03-01-06-01-01-02-10.json
    -r--r--r-- bigdata/bigdata 377100 2025-01-10 15:44 03-01-06-01-01-02-10.wav
    -r--r--r-- bigdata/bigdata   24 2025-01-10 15:44 03-01-08-01-02-01-16.json
    -r--r--r-- bigdata/bigdata 396324 2025-01-10 15:44 03-01-08-01-02-01-16.wav
    -r--r--r-- bigdata/bigdata   24 2025-01-10 15:44 03-01-08-01-02-02-22.json
    -r--r--r-- bigdata/bigdata 404388 2025-01-10 15:44 03-01-08-01-02-02-22.wav

    $ cat 03-01-08-01-01-01-11.json {"emotion": "surprised"}
  5. z

    maestro in WebDataset Format Creators

    • zenodo.org
    tar
    Updated Feb 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Niu Yadong; Niu Yadong (2025). maestro in WebDataset Format Creators [Dataset]. http://doi.org/10.5281/zenodo.14858022
    Explore at:
    tarAvailable download formats
    Dataset updated
    Feb 12, 2025
    Dataset provided by
    Xiaomi
    Authors
    Niu Yadong; Niu Yadong
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is the maestro dataset, formatted in the WebDataset format. WebDataset files are essentially tar archives, where each example in the dataset is represented by a pair of files: a WAV audio file and a corresponding JSON metadata file. The JSON file contains the class label and other relevant information for that particular audio sample.

    $ tar tvf maestro_train_000000.tar|head       
    -r--r--r-- bigdata/bigdata 327458 2025-01-23 13:26 MIDI-Unprocessed_XP_15_R2_2004_01_ORIG_MID--AUDIO_15_R2_2004_02_Track02_wav.json
    -r--r--r-- bigdata/bigdata 120375940 2025-01-23 13:26 MIDI-Unprocessed_XP_15_R2_2004_01_ORIG_MID--AUDIO_15_R2_2004_02_Track02_wav.wav
    -r--r--r-- bigdata/bigdata  625054 2025-01-23 13:26 MIDI-Unprocessed_13_R1_2009_01-03_ORIG_MID--AUDIO_13_R1_2009_13_R1_2009_03_WAV.json
    -r--r--r-- bigdata/bigdata 137713368 2025-01-23 13:26 MIDI-Unprocessed_13_R1_2009_01-03_ORIG_MID--AUDIO_13_R1_2009_13_R1_2009_03_WAV.wav
    -r--r--r-- bigdata/bigdata  356393 2025-01-23 13:26 MIDI-Unprocessed_XP_17_R2_2004_01_ORIG_MID--AUDIO_17_R2_2004_01_Track01_wav.json
    -r--r--r-- bigdata/bigdata 132159804 2025-01-23 13:26 MIDI-Unprocessed_XP_17_R2_2004_01_ORIG_MID--AUDIO_17_R2_2004_01_Track01_wav.wav
    -r--r--r-- bigdata/bigdata  255210 2025-01-23 13:26 ORIG-MIDI_01_7_6_13_Group_MID--AUDIO_01_R1_2013_wav--2.json
    -r--r--r-- bigdata/bigdata 58523088 2025-01-23 13:26 ORIG-MIDI_01_7_6_13_Group_MID--AUDIO_01_R1_2013_wav--2.wav
    -r--r--r-- bigdata/bigdata  1190145 2025-01-23 13:26 MIDI-UNPROCESSED_04-07-08-10-12-15-17_R2_2014_MID--AUDIO_17_R2_2014_wav.json
    -r--r--r-- bigdata/bigdata 390151460 2025-01-23 13:26 MIDI-UNPROCESSED_04-07-08-10-12-15-17_R2_2014_MID--AUDIO_17_R2_2014_wav.wav
    

    $ cat ORIG-MIDI_01_7_6_13_Group_MID--AUDIO_01_R1_2013_wav--2.json
    [
      ...
      {"start": 323.546875, "end": 323.5859375, "note": 51}, 
      {"start": 323.703125, "end": 323.74869791666663, "note": 51}, 
      {"start": 323.8450520833333, "end": 323.8919270833333, "note": 51}, 
      {"start": 324.00390625, "end": 324.0442708333333, "note": 51},
      ...
    ]
  6. z

    gtzan_genre in WebDataset Format

    • zenodo.org
    tar
    Updated Jan 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Niu Yadong; Niu Yadong (2025). gtzan_genre in WebDataset Format [Dataset]. http://doi.org/10.5281/zenodo.14722472
    Explore at:
    tarAvailable download formats
    Dataset updated
    Jan 23, 2025
    Dataset provided by
    xiaomi
    Authors
    Niu Yadong; Niu Yadong
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is the gtzan_genre dataset, formatted in the WebDataset format. WebDataset files are essentially tar archives, where each example in the dataset is represented by a pair of files: a WAV audio file and a corresponding JSON metadata file. The JSON file contains the class label and other relevant information for that particular audio sample.

    $ tar tvf gtzan_fold_0_0000000.tar |head 
    -r--r--r-- bigdata/bigdata 17 2025-01-10 15:06 jazz_00007.json
    -r--r--r-- bigdata/bigdata 1323632 2025-01-10 15:06 jazz_00007.wav
    -r--r--r-- bigdata/bigdata   17 2025-01-10 15:06 jazz_00006.json
    -r--r--r-- bigdata/bigdata 1323632 2025-01-10 15:06 jazz_00006.wav
    -r--r--r-- bigdata/bigdata   18 2025-01-10 15:06 blues_00006.json
    -r--r--r-- bigdata/bigdata 1323632 2025-01-10 15:06 blues_00006.wav
    -r--r--r-- bigdata/bigdata   19 2025-01-10 15:06 reggae_00005.json
    -r--r--r-- bigdata/bigdata 1323632 2025-01-10 15:06 reggae_00005.wav
    -r--r--r-- bigdata/bigdata   19 2025-01-10 15:06 reggae_00002.json
    -r--r--r-- bigdata/bigdata 1323632 2025-01-10 15:06 reggae_00002.wav
    

    $ cat jazz_00007.json 
    {"genre": "jazz"}
  7. h

    Open-Qwen2VL-Data

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Weizhi Wang, Open-Qwen2VL-Data [Dataset]. https://huggingface.co/datasets/weizhiwang/Open-Qwen2VL-Data
    Explore at:
    Authors
    Weizhi Wang
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Introduction

    This repository contains the data for Open-Qwen2VL: Compute-Efficient Pre-Training of Fully-Open Multimodal LLMs on Academic Resources. Project page: https://victorwz.github.io/Open-Qwen2VL Code: https://github.com/Victorwz/Open-Qwen2VL

      Dataset
    

    ccs_ebdataset: CC3M-CC12M-SBU filtered by CLIP, we directly download the webdataset based on the released of curated subset of BLIP-1 datacomp_medium_dfn_webdataset: DataComp-Medium-128M filtered by DFN, we just… See the full description on the dataset page: https://huggingface.co/datasets/weizhiwang/Open-Qwen2VL-Data.

  8. Z

    Covid-on-the-Web dataset

    • data.niaid.nih.gov
    Updated Feb 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wimmics Research Team (2022). Covid-on-the-Web dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3833752
    Explore at:
    Dataset updated
    Feb 28, 2022
    Dataset authored and provided by
    Wimmics Research Team
    Description

    This RDF dataset provides two main knowledge graphs produced by processing the scholarly articles of the COVID-19 Open Research Dataset (CORD-19), a resource of articles about COVID-19 and the coronavirus family of viruses.

    The CORD-19 Named Entities Knowledge Graph describes named entities identified and disambiguated by NCBO BioPortal annotator, Entity-fishing and DBpedia Spotlight. The CORD-19 Argumentative Knowledge Graph describes argumentative components and PICO elements extracted from the articles by the Argumentative Clinical Trial Analysis platform (ACTA).

    Homepage: https://github.com/Wimmics/CovidOnTheWeb

    License: see the LICENCE file in the archive.

  9. z

    speechocean762 in WebDataset Format

    • zenodo.org
    tar
    Updated Jan 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Niu Yadong; Niu Yadong (2025). speechocean762 in WebDataset Format [Dataset]. http://doi.org/10.5281/zenodo.14725291
    Explore at:
    tarAvailable download formats
    Dataset updated
    Jan 23, 2025
    Dataset provided by
    Xiaomi
    Authors
    Niu Yadong; Niu Yadong
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is the speechocean762 dataset, formatted in the WebDataset format. WebDataset files are essentially tar archives, where each example in the dataset is represented by a pair of files: a WAV audio file and a corresponding JSON metadata file. The JSON file contains the class label and other relevant information for that particular audio sample.

    $ tar tvf speechocean762_train.tar |head          
    -r--r--r-- bigdata/bigdata 607 2025-01-13 14:49 000010011.json
    -r--r--r-- bigdata/bigdata 82604 2025-01-13 14:49 000010011.wav
    -r--r--r-- bigdata/bigdata  646 2025-01-13 14:49 000010035.json
    -r--r--r-- bigdata/bigdata 109804 2025-01-13 14:49 000010035.wav
    -r--r--r-- bigdata/bigdata  630 2025-01-13 14:49 000010053.json
    -r--r--r-- bigdata/bigdata 107244 2025-01-13 14:49 000010053.wav
    -r--r--r-- bigdata/bigdata  561 2025-01-13 14:49 000010063.json
    -r--r--r-- bigdata/bigdata 106252 2025-01-13 14:49 000010063.wav
    -r--r--r-- bigdata/bigdata  671 2025-01-13 14:49 000010069.json
    -r--r--r-- bigdata/bigdata 96364 2025-01-13 14:49 000010069.wav
    $ cat 000010011.json
    {"id": 10011, "accuracy": 8, "completeness": 10.0, "fluency": 9, "prosodic": 9, "words": [{"accuracy": 10, "stress": 10, "phones": ["W", "IY0"], "total": 10, "text": "WE", "phones-accuracy": [2.0, 2.0]}, {"accuracy": 10, "stress": 10, "phones": ["K", "AO0", "L"], "total": 10, "text": "CALL", "phones-accuracy": [2.0, 1.8, 1.8]}, {"accuracy": 10, "stress": 10, "phones": ["IH0", "T"], "total": 10, "text": "IT", "phones-accuracy": [2.0, 2.0]}, {"accuracy": 6, "stress": 10, "phones": ["B", "EH0", "R"], "total": 6, "text": "BEAR", "phones-accuracy": [2.0, 1.0, 1.0]}], "total": 8, "text": "WE CALL IT BEAR"}
  10. h

    GenRef-wds

    • huggingface.co
    Updated Apr 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Diffusion CoT (2025). GenRef-wds [Dataset]. https://huggingface.co/datasets/diffusion-cot/GenRef-wds
    Explore at:
    Dataset updated
    Apr 23, 2025
    Dataset authored and provided by
    Diffusion CoT
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    GenRef-1M

    We provide 1M high-quality triplets of the form (flawed image, high-quality image, reflection) collected across multiple domains using our scalable pipeline from [1]. We used this dataset to train our reflection tuning model. To know the details of the dataset creation pipeline, please refer to Section 3.2 of [1]. Project Page: https://diffusion-cot.github.io/reflection2perfection

      Dataset loading
    

    We provide the dataset in the webdataset format for fast… See the full description on the dataset page: https://huggingface.co/datasets/diffusion-cot/GenRef-wds.

  11. h

    ffhq-1024-wds

    • huggingface.co
    Updated Jun 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thien Tran (2025). ffhq-1024-wds [Dataset]. https://huggingface.co/datasets/gaunernst/ffhq-1024-wds
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 23, 2025
    Authors
    Thien Tran
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Flickr-Faces-HQ Dataset (FFHQ) - 1024x1024

    This is a reupload of FFHQ-1024. Refer to the original dataset repo for more information https://github.com/NVlabs/ffhq-dataset Specifically, this is the images1024x1024 set - faces are aligned and cropped to 1024x1024. Original PNG files were transcoded to WEBP losslessly to save space and packed to WebDataset format for ease of streaming. The original filenames are kept (with different file extension) so that you can match against… See the full description on the dataset page: https://huggingface.co/datasets/gaunernst/ffhq-1024-wds.

  12. h

    MozzaVID_Small

    • huggingface.co
    Updated Apr 11, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Technical University of Denmark (2025). MozzaVID_Small [Dataset]. https://huggingface.co/datasets/dtudk/MozzaVID_Small
    Explore at:
    Dataset updated
    Apr 11, 2025
    Dataset authored and provided by
    Technical University of Denmark
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    MozzaVID dataset - Small split

    A dataset of synchrotron X-ray tomography scans of mozzarella microstructure, aimed for volumetric model benchmarking and food structure analysis.

      [Paper] [Project website]
    

    This version is prepared in the WebDataset format, optimized for streaming. Check our GitHub for details on how to use it. To download raw data instead, visit: [LINK].

      Dataset splits
    

    This is a Small split of the dataset containing 591 volumes. We also… See the full description on the dataset page: https://huggingface.co/datasets/dtudk/MozzaVID_Small.

  13. z

    The Klarna Product-Page Dataset

    • zenodo.org
    • researchdata.se
    • +2more
    application/gzip, xz
    Updated Nov 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alexandra Hotti; Riccardo Sven Risuleo; Stefan Magureanu; Aref Moradi; Jens Lagergren; Alexandra Hotti; Riccardo Sven Risuleo; Stefan Magureanu; Aref Moradi; Jens Lagergren (2024). The Klarna Product-Page Dataset [Dataset]. http://doi.org/10.5281/zenodo.12605480
    Explore at:
    xz, application/gzipAvailable download formats
    Dataset updated
    Nov 7, 2024
    Dataset provided by
    Royal Institute of Technology (KTH)
    Authors
    Alexandra Hotti; Riccardo Sven Risuleo; Stefan Magureanu; Aref Moradi; Jens Lagergren; Alexandra Hotti; Riccardo Sven Risuleo; Stefan Magureanu; Aref Moradi; Jens Lagergren
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Description

    The Klarna Product Page Dataset is a dataset of publicly available pages corresponding to products sold online on various e-commerce websites. The dataset contains offline snapshots of 51,701 product pages collected from 8,175 distinct merchants across 8 different markets (US, GB, SE, NL, FI, NO, DE, AT) between 2018 and 2019. On each page, analysts labelled 5 elements of interest: the price of the product, its image, its name and the add-to-cart and go-to-cart buttons (if found). These labels are present in the HTML code as an attribute called klarna-ai-label taking one of the values: Price, Name, Main picture, Add to cart and Cart.

    The snapshots are available in 3 formats: as MHTML files (~24GB), as WebTraversalLibrary (WTL) snapshots (~7.4GB), and as screeshots (~8.9GB). The MHTML format is less lossy, a browser can render these pages though any Javascript on the page is lost. The WTL snapshots are produced by loading the MHTML pages into a chromium-based browser. To keep the WTL dataset compact, the screenshots of the rendered MTHML are provided separately; here we provide the HTML of the rendered DOM tree and additional page and element metadata with rendering information (bounding boxes of elements, font sizes etc.). The folder structure of the screenshot dataset is identical to the one the WTL dataset and can be used to complete the WTL snapshots with image information. For convenience, the datasets are provided with a train/test split in which no merchants in the test set are present in the training set.

    Corresponding Publication

    For more information about the contents of the datasets (statistics etc.) please refer to the following TMLR paper.

    GitHub Repository

    The code needed to re-run the experiments in the publication accompanying the dataset can be accessed here.

    Citing

    If you found this dataset useful in your research, please cite the paper as follows:

    @article{hotti2024the,
    title={The Klarna Product Page Dataset: Web Element Nomination with Graph Neural Networks and Large Language Models},
    author={Alexandra Hotti and Riccardo Sven Risuleo and Stefan Magureanu and Aref Moradi and Jens Lagergren},
    journal={Transactions on Machine Learning Research},
    issn={2835-8856},
    year={2024},
    url={https://openreview.net/forum?id=zz6FesdDbB},
    note={}
    }
  14. h

    SCube-Data

    • huggingface.co
    Updated May 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xuanchi Ren (2025). SCube-Data [Dataset]. https://huggingface.co/datasets/xrenaa/SCube-Data
    Explore at:
    Dataset updated
    May 29, 2025
    Authors
    Xuanchi Ren
    Description

    SCube: Instant Large-Scale Scene Reconstruction using VoxSplats

    Xuanchi Ren, Yifan Lu, Hanxue Liang, Zhangjie Wu, Huan Ling, Mike Chen, Sanja Fidler, Francis Williams, Jiahui Huang [Project Page] We provide the webdataset-format files of ground-truth voxels here. Please refer to Github for the usage.

      Citation
    

    @inproceedings{ ren2024scube, title={SCube: Instant Large-Scale Scene Reconstruction using VoxSplats}, author={Ren, Xuanchi and Lu, Yifan and Liang… See the full description on the dataset page: https://huggingface.co/datasets/xrenaa/SCube-Data.

  15. h

    Mind2Web

    • huggingface.co
    Updated Jun 12, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OSU NLP Group (2023). Mind2Web [Dataset]. https://huggingface.co/datasets/osunlp/Mind2Web
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 12, 2023
    Dataset authored and provided by
    OSU NLP Group
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Card for Dataset Name

      Dataset Summary
    

    Mind2Web is a dataset for developing and evaluating generalist agents for the web that can follow language instructions to complete complex tasks on any website. Existing datasets for web agents either use simulated websites or only cover a limited set of websites and tasks, thus not suitable for generalist web agents. With over 2,000 open-ended tasks collected from 137 websites spanning 31 domains and crowdsourced action… See the full description on the dataset page: https://huggingface.co/datasets/osunlp/Mind2Web.

  16. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Niu Yadong; Niu Yadong (2025). voxceleb1 in WebDataset Format [Dataset]. http://doi.org/10.5281/zenodo.14722737

voxceleb1 in WebDataset Format

Explore at:
Dataset updated
Jan 23, 2025
Dataset provided by
Xiaomi
Authors
Niu Yadong; Niu Yadong
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This dataset is the voxceleb1 dataset, formatted in the WebDataset format. WebDataset files are essentially tar archives, where each example in the dataset is represented by a pair of files: a WAV audio file and a corresponding JSON metadata file. The JSON file contains the class label and other relevant information for that particular audio sample.

$ tar tvf voxceleb1_train_0000000.tar|head            
-r--r--r-- bigdata/bigdata 24 2025-01-10 12:16 ig9vI1du458_00005.json
-r--r--r-- bigdata/bigdata 143406 2025-01-10 12:16 ig9vI1du458_00005.wav
-r--r--r-- bigdata/bigdata   24 2025-01-10 12:16 y0UFutwJ-ow_00034.json
-r--r--r-- bigdata/bigdata 147246 2025-01-10 12:16 y0UFutwJ-ow_00034.wav
-r--r--r-- bigdata/bigdata   24 2025-01-10 12:16 FdYLFLTY-5Q_00050.json
-r--r--r-- bigdata/bigdata 138286 2025-01-10 12:16 FdYLFLTY-5Q_00050.wav
-r--r--r-- bigdata/bigdata   24 2025-01-10 12:16 zELwAz2W6hM_00010.json
-r--r--r-- bigdata/bigdata 165166 2025-01-10 12:16 zELwAz2W6hM_00010.wav
-r--r--r-- bigdata/bigdata   24 2025-01-10 12:16 Qz-GecPwRmI_00008.json
-r--r--r-- bigdata/bigdata 140846 2025-01-10 12:16 Qz-GecPwRmI_00008.wav
$ cat ig9vI1du458_00005.json   ✔ 
{"speakerid": "id10047"}
Search
Clear search
Close search
Google apps
Main menu