9 datasets found

Sentence/Table Pair Data from Wikipedia for Pre-training with...
data.niaid.nih.gov
zenodo.org
Updated Oct 29, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xiang Deng; Yu Su; Alyssa Lees; You Wu; Cong Yu; Huan Sun (2021). Sentence/Table Pair Data from Wikipedia for Pre-training with Distant-Supervision [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5612315
Explore at:
Dataset updated
Oct 29, 2021
Dataset provided by
Google Research
The Ohio State University
Authors
Xiang Deng; Yu Su; Alyssa Lees; You Wu; Cong Yu; Huan Sun
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the dataset used for pre-training in "ReasonBERT: Pre-trained to Reason with Distant Supervision", EMNLP'21.

There are two files:

sentence_pairs_for_pretrain_no_tokenization.tar.gz -> Contain only sentences as evidence, Text-only

table_pairs_for_pretrain_no_tokenization.tar.gz -> At least one piece of evidence is a table, Hybrid

The data is chunked into multiple tar files for easy loading. We use WebDataset, a PyTorch Dataset (IterableDataset) implementation providing efficient sequential/streaming data access.

For pre-training code, or if you have any questions, please check our GitHub repo https://github.com/sunlab-osu/ReasonBERT

Below is a sample code snippet to load the data

import webdataset as wds

path to the uncompressed files, should be a directory with a set of tar files

url = './sentence_multi_pairs_for_pretrain_no_tokenization/{000000...000763}.tar' dataset = ( wds.Dataset(url) .shuffle(1000) # cache 1000 samples and shuffle .decode() .to_tuple("json") .batched(20) # group every 20 examples into a batch )

Please see the documentation for WebDataset for more details about how to use it as dataloader for Pytorch

You can also iterate through all examples and dump them with your preferred data format

Below we show how the data is organized with two examples.

Text-only

{'s1_text': 'Sils is a municipality in the comarca of Selva, in Catalonia, Spain.', # query sentence 's1_all_links': { 'Sils,_Girona': [[0, 4]], 'municipality': [[10, 22]], 'Comarques_of_Catalonia': [[30, 37]], 'Selva': [[41, 46]], 'Catalonia': [[51, 60]] }, # list of entities and their mentions in the sentence (start, end location) 'pairs': [ # other sentences that share common entity pair with the query, group by shared entity pairs { 'pair': ['Comarques_of_Catalonia', 'Selva'], # the common entity pair 's1_pair_locs': [[[30, 37]], [[41, 46]]], # mention of the entity pair in the query 's2s': [ # list of other sentences that contain the common entity pair, or evidence { 'md5': '2777e32bddd6ec414f0bc7a0b7fea331', 'text': 'Selva is a coastal comarque (county) in Catalonia, Spain, located between the mountain range known as the Serralada Transversal or Puigsacalm and the Costa Brava (part of the Mediterranean coast). Unusually, it is divided between the provinces of Girona and Barcelona, with Fogars de la Selva being part of Barcelona province and all other municipalities falling inside Girona province. Also unusually, its capital, Santa Coloma de Farners, is no longer among its larger municipalities, with the coastal towns of Blanes and Lloret de Mar having far surpassed it in size.', 's_loc': [0, 27], # in addition to the sentence containing the common entity pair, we also keep its surrounding context. 's_loc' is the start/end location of the actual evidence sentence 'pair_locs': [ # mentions of the entity pair in the evidence [[19, 27]], # mentions of entity 1 [[0, 5], [288, 293]] # mentions of entity 2 ], 'all_links': { 'Selva': [[0, 5], [288, 293]], 'Comarques_of_Catalonia': [[19, 27]], 'Catalonia': [[40, 49]] } } ,...] # there are multiple evidence sentences }, ,...] # there are multiple entity pairs in the query }

Hybrid

{'s1_text': 'The 2006 Major League Baseball All-Star Game was the 77th playing of the midseason exhibition baseball game between the all-stars of the American League (AL) and National League (NL), the two leagues comprising Major League Baseball.', 's1_all_links': {...}, # same as text-only 'sentence_pairs': [{'pair': ..., 's1_pair_locs': ..., 's2s': [...]}], # same as text-only 'table_pairs': [ 'tid': 'Major_League_Baseball-1', 'text':[ ['World Series Records', 'World Series Records', ...], ['Team', 'Number of Series won', ...], ['St. Louis Cardinals (NL)', '11', ...], ...] # table content, list of rows 'index':[ [[0, 0], [0, 1], ...], [[1, 0], [1, 1], ...], ...] # index of each cell [row_id, col_id]. we keep only a table snippet, but the index here is from the original table. 'value_ranks':[ [0, 0, ...], [0, 0, ...], [0, 10, ...], ...] # if the cell contain numeric value/date, this is its rank ordered from small to large, follow TAPAS 'value_inv_ranks': [], # inverse rank 'all_links':{ 'St._Louis_Cardinals': { '2': [ [[2, 0], [0, 19]], # [[row_id, col_id], [start, end]] ] # list of mentions in the second row, the key is row_id }, 'CARDINAL:11': {'2': [[[2, 1], [0, 2]]], '8': [[[8, 3], [0, 2]]]}, } 'name': '', # table name, if exists 'pairs': { 'pair': ['American_League', 'National_League'], 's1_pair_locs': [[[137, 152]], [[162, 177]]], # mention in the query 'table_pair_locs': { '17': [ # mention of entity pair in row 17 [ [[17, 0], [3, 18]], [[17, 1], [3, 18]], [[17, 2], [3, 18]], [[17, 3], [3, 18]] ], # mention of the first entity [ [[17, 0], [21, 36]], [[17, 1], [21, 36]], ] # mention of the second entity ] } } ] }
h
turbulence_gravity_cooling
huggingface.co
Updated Dec 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Polymathic AI (2024). turbulence_gravity_cooling [Dataset]. https://huggingface.co/datasets/polymathic-ai/turbulence_gravity_cooling
Explore at:
Dataset updated
Dec 3, 2024
Dataset authored and provided by
Polymathic AI
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This Dataset is part of The Well Collection.

How To Load from HuggingFace Hub

Be sure to have the_well installed (pip install the_well) Use the WellDataModule to retrieve data as follows:

from the_well.data import WellDataModule

The following line may take a couple of minutes to instantiate the datamodule

datamodule = WellDataModule( "hf://datasets/polymathic-ai/", "turbulence_gravity_cooling", ) train_dataloader = datamodule.train_dataloader()

for batch in dataloader:… See the full description on the dataset page: https://huggingface.co/datasets/polymathic-ai/turbulence_gravity_cooling.
h
convective_envelope_rsg
huggingface.co
Updated Dec 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Polymathic AI (2024). convective_envelope_rsg [Dataset]. https://huggingface.co/datasets/polymathic-ai/convective_envelope_rsg
Explore at:
Dataset updated
Dec 3, 2024
Dataset authored and provided by
Polymathic AI
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This Dataset is part of The Well Collection.

How To Load from HuggingFace Hub

Be sure to have the_well installed (pip install the_well) Use the WellDataModule to retrieve data as follows:

from the_well.data import WellDataModule

The following line may take a couple of minutes to instantiate the datamodule

datamodule = WellDataModule( "hf://datasets/polymathic-ai/", "convective_envelope_rsg", ) train_dataloader = datamodule.train_dataloader()

for batch in dataloader:… See the full description on the dataset page: https://huggingface.co/datasets/polymathic-ai/convective_envelope_rsg.
h
planetswe
huggingface.co
Updated Dec 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Polymathic AI (2024). planetswe [Dataset]. https://huggingface.co/datasets/polymathic-ai/planetswe
Explore at:
Dataset updated
Dec 3, 2024
Dataset authored and provided by
Polymathic AI
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This Dataset is part of The Well Collection.

How To Load from HuggingFace Hub

Be sure to have the_well installed (pip install the_well) Use the WellDataModule to retrieve data as follows:

from the_well.data import WellDataModule

The following line may take a couple of minutes to instantiate the datamodule

datamodule = WellDataModule( "hf://datasets/polymathic-ai/", "planetswe", ) train_dataloader = datamodule.train_dataloader()

for batch in dataloader: # Process… See the full description on the dataset page: https://huggingface.co/datasets/polymathic-ai/planetswe.
MELD Preprocessed
kaggle.com
zip
Updated Mar 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Argish Abhangi (2025). MELD Preprocessed [Dataset]. https://www.kaggle.com/datasets/argish/meld-preprocessed
Explore at:
zip(3527202381 bytes)Available download formats
Dataset updated
Mar 1, 2025
Authors
Argish Abhangi
Description
The MELD Preprocessed Dataset is a multi-modal dataset designed for research on emotion recognition from audio, video, and textual data. The dataset builds upon the original MELD dataset and applies extensive preprocessing steps to extract features from different modalities. Each sample is saved as a .pt file containing a dictionary of preprocessed features, making it easy for developers to load and integrate into PyTorch-based workflows.

Data Sources

Audio: Waveforms extracted from the original video files.

Video: Video files are processed to sample frames at a target frame rate (default: 2 fps) and to detect faces using a Haar Cascade classifier.

Text: Utterances from the dialogue, which are cleaned using custom encoding functions to fix potential byte encoding issues.

Emotion Labels: Each sample is associated with an emotion label.

Preprocessing Pipeline

The preprocessing script performs several key steps:

Text Cleaning:

fix_encoding_with_bytes(text): Decodes text from bytes using UTF-8, Latin-1, or cp1252, ensuring correct encoding.

replace_double_encoding(text): Fixes issues related to double-encoded characters (e.g., replacing "Â’" with the proper apostrophe).

Audio Processing:

Extracts raw audio waveform from each sample.

Computes a Mel-spectrogram using torchaudio.transforms.MelSpectrogram with 64 mel bins (VGGish format).

Converts the spectrogram to a logarithmic scale for numerical stability.

Video Processing:

Reads video frames at a specified target FPS (default: 2 fps) using OpenCV.

For each video, samples frames evenly based on the original video's FPS.

Applies Haar Cascade face detection on the frames to extract the first detected face.

Resizes the detected face to 224x224 and converts it to RGB. If no face is detected, a default black image (224x224x3) is returned.

Saving Processed Samples:

Each sample is saved as a .pt file in a directory structure split by data type (train, dev, and test).

The filename is derived from the original video filename (e.g., dia0_utt1.mp4 becomes dia0_utt1.pt).

Data Format

Each preprocessed sample is stored in a .pt file and contains a dictionary with the following keys:

utterance (str): The cleaned textual utterance.

emotion (str/int): The corresponding emotion label.

video_path (str): Original path to the video file from which the sample was extracted.

audio (Tensor): Raw audio waveform tensor of shape [channels, time].

audio_sample_rate (int): The sampling rate of the audio waveform.

audio_mel (Tensor): The computed log-scaled Mel-spectrogram with shape [channels, n_mels, time].

face (NumPy array): The extracted face image (RGB format) of shape (224, 224, 3). If no face was detected, a default black image is provided.

Directory Structure

The preprocessed files are organized into splits: preprocessed_data/ ├── train/ │ ├── dia0_utt0.pt │ ├── dia1_utt1.pt │ └── ... ├── dev/ │ ├── dia0_utt0.pt │ ├── dia1_utt1.pt │ └── ... └── test/ │ ├── dia0_utt0.pt │ ├── dia1_utt1.pt └── ...

Loading and Using the Dataset

A custom PyTorch dataset and DataLoader are provided to facilitate easy integration:

Dataset Class

from torch.utils.data import Dataset import os import torch class PreprocessedMELDDataset(Dataset): def _init_(self, data_dir): """ Args: data_dir (str): Directory where preprocessed .pt files are stored. """ self.data_dir = data_dir self.files = [os.path.join(data_dir, f) for f in os.listdir(data_dir) if f.endswith('.pt')] def _len_(self): return len(self.files) def _getitem_(self, idx): sample_path = self.files[idx] sample = torch.load(sample_path) return sample

Custom Collate Function

def preprocessed_collate_fn(batch): """ Collates a list of sample dictionaries into a single dictionary with keys mapping to lists. Modify this function to pad or stack tensor data if needed. """ collated = {} collated['utterance'] = [sample['utterance'] for sample in batch] collated['emotion'] = [sample['emotion'] for sample in batch] collated['video_path'] = [sample['video_path'] for sample in batch] collated['audio'] = [sample['audio'] for sample in batch] collated['audio_sample_rate'] = batch[0]['audio_sample_rate'] collated['audio_mel'] = [sample['audio_mel'] for sample in batch] collated['face'] = [sample['face'] for sample in batch] return collated

Creating DataLoaders

from torch.utils.data import DataLoader # Define paths for each split train_data_dir = "preprocessed_data/train" dev_data_dir = "preproces...
h
shear_flow
huggingface.co
Updated Dec 3, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Polymathic AI (2024). shear_flow [Dataset]. https://huggingface.co/datasets/polymathic-ai/shear_flow
Explore at:
Dataset updated
Dec 3, 2024
Dataset authored and provided by
Polymathic AI
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
How To Load from HuggingFace Hub

Be sure to have the_well installed (pip install the_well) Use the WellDataModule to retrieve data as follows:

from the_well.data import WellDataModule

The following line may take a couple of minutes to instantiate the datamodule

datamodule = WellDataModule( "hf://datasets/polymathic-ai/", "shear_flow", ) train_dataloader = datamodule.train_dataloader()

for batch in dataloader: # Process training batch ...

Periodic… See the full description on the dataset page: https://huggingface.co/datasets/polymathic-ai/shear_flow.
h
MHD_64
huggingface.co
Updated Dec 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Polymathic AI (2024). MHD_64 [Dataset]. https://huggingface.co/datasets/polymathic-ai/MHD_64
Explore at:
Dataset updated
Dec 3, 2024
Dataset authored and provided by
Polymathic AI
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This Dataset is part of The Well Collection.

How To Load from HuggingFace Hub

Be sure to have the_well installed (pip install the_well) Use the WellDataModule to retrieve data as follows:

from the_well.benchmark.data import WellDataModule

The following line may take a couple of minutes to instantiate the datamodule

datamodule = WellDataModule( "hf://datasets/polymathic-ai/", "MHD_64", ) train_dataloader = datamodule.train_dataloader()

for batch in dataloader: #… See the full description on the dataset page: https://huggingface.co/datasets/polymathic-ai/MHD_64.
h
rayleigh_benard
huggingface.co
Updated Dec 3, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Polymathic AI (2024). rayleigh_benard [Dataset]. https://huggingface.co/datasets/polymathic-ai/rayleigh_benard
Explore at:
Dataset updated
Dec 3, 2024
Dataset authored and provided by
Polymathic AI
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This Dataset is part of The Well Collection.

How To Load from HuggingFace Hub

Be sure to have the_well installed (pip install the_well) Use the WellDataModule to retrieve data as follows:

from the_well.benchmark.data import WellDataModule

The following line may take a couple of minutes to instantiate the datamodule

datamodule = WellDataModule( "hf://datasets/polymathic-ai/", "rayleigh_benard", ) train_dataloader = datamodule.train_dataloader()

for batch in dataloader:… See the full description on the dataset page: https://huggingface.co/datasets/polymathic-ai/rayleigh_benard.
h
PDEBench_2D_DarcyFlow
huggingface.co
Updated Nov 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Brian Staber (2025). PDEBench_2D_DarcyFlow [Dataset]. https://huggingface.co/datasets/Nionio/PDEBench_2D_DarcyFlow
Explore at:
Dataset updated
Nov 12, 2025
Authors
Brian Staber
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Example of usage: import torch from plaid.bridges import huggingface_bridge as hfb from torch.utils.data import DataLoader

def reshape_all(batch: dict[str, torch.Tensor]) -> dict[str, torch.Tensor]: """Helper function that reshapes the flattened fields into images of sizes (128, 128).""" batch["diffusion_coefficient"] = batch["diffusion_coefficient"].reshape( -1, 128, 128 )

batch["flow"] = batch["flow"].reshape(-1, 128, 128) return batch

Load the dataset… See the full description on the dataset page: https://huggingface.co/datasets/Nionio/PDEBench_2D_DarcyFlow.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Xiang Deng; Yu Su; Alyssa Lees; You Wu; Cong Yu; Huan Sun (2021). Sentence/Table Pair Data from Wikipedia for Pre-training with Distant-Supervision [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5612315

Sentence/Table Pair Data from Wikipedia for Pre-training with Distant-Supervision

Explore at:

Dataset updated

Oct 29, 2021

Dataset provided by

Google Research
The Ohio State University

Authors

Xiang Deng; Yu Su; Alyssa Lees; You Wu; Cong Yu; Huan Sun

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This is the dataset used for pre-training in "ReasonBERT: Pre-trained to Reason with Distant Supervision", EMNLP'21.

There are two files:

sentence_pairs_for_pretrain_no_tokenization.tar.gz -> Contain only sentences as evidence, Text-only

table_pairs_for_pretrain_no_tokenization.tar.gz -> At least one piece of evidence is a table, Hybrid

The data is chunked into multiple tar files for easy loading. We use WebDataset, a PyTorch Dataset (IterableDataset) implementation providing efficient sequential/streaming data access.

For pre-training code, or if you have any questions, please check our GitHub repo https://github.com/sunlab-osu/ReasonBERT

Below is a sample code snippet to load the data

import webdataset as wds

path to the uncompressed files, should be a directory with a set of tar files

url = './sentence_multi_pairs_for_pretrain_no_tokenization/{000000...000763}.tar' dataset = ( wds.Dataset(url) .shuffle(1000) # cache 1000 samples and shuffle .decode() .to_tuple("json") .batched(20) # group every 20 examples into a batch )

Please see the documentation for WebDataset for more details about how to use it as dataloader for Pytorch

You can also iterate through all examples and dump them with your preferred data format

Below we show how the data is organized with two examples.

Text-only

{'s1_text': 'Sils is a municipality in the comarca of Selva, in Catalonia, Spain.', # query sentence 's1_all_links': { 'Sils,_Girona': [[0, 4]], 'municipality': [[10, 22]], 'Comarques_of_Catalonia': [[30, 37]], 'Selva': [[41, 46]], 'Catalonia': [[51, 60]] }, # list of entities and their mentions in the sentence (start, end location) 'pairs': [ # other sentences that share common entity pair with the query, group by shared entity pairs { 'pair': ['Comarques_of_Catalonia', 'Selva'], # the common entity pair 's1_pair_locs': [[[30, 37]], [[41, 46]]], # mention of the entity pair in the query 's2s': [ # list of other sentences that contain the common entity pair, or evidence { 'md5': '2777e32bddd6ec414f0bc7a0b7fea331', 'text': 'Selva is a coastal comarque (county) in Catalonia, Spain, located between the mountain range known as the Serralada Transversal or Puigsacalm and the Costa Brava (part of the Mediterranean coast). Unusually, it is divided between the provinces of Girona and Barcelona, with Fogars de la Selva being part of Barcelona province and all other municipalities falling inside Girona province. Also unusually, its capital, Santa Coloma de Farners, is no longer among its larger municipalities, with the coastal towns of Blanes and Lloret de Mar having far surpassed it in size.', 's_loc': [0, 27], # in addition to the sentence containing the common entity pair, we also keep its surrounding context. 's_loc' is the start/end location of the actual evidence sentence 'pair_locs': [ # mentions of the entity pair in the evidence [[19, 27]], # mentions of entity 1 [[0, 5], [288, 293]] # mentions of entity 2 ], 'all_links': { 'Selva': [[0, 5], [288, 293]], 'Comarques_of_Catalonia': [[19, 27]], 'Catalonia': [[40, 49]] } } ,...] # there are multiple evidence sentences }, ,...] # there are multiple entity pairs in the query }

Hybrid

{'s1_text': 'The 2006 Major League Baseball All-Star Game was the 77th playing of the midseason exhibition baseball game between the all-stars of the American League (AL) and National League (NL), the two leagues comprising Major League Baseball.', 's1_all_links': {...}, # same as text-only 'sentence_pairs': [{'pair': ..., 's1_pair_locs': ..., 's2s': [...]}], # same as text-only 'table_pairs': [ 'tid': 'Major_League_Baseball-1', 'text':[ ['World Series Records', 'World Series Records', ...], ['Team', 'Number of Series won', ...], ['St. Louis Cardinals (NL)', '11', ...], ...] # table content, list of rows 'index':[ [[0, 0], [0, 1], ...], [[1, 0], [1, 1], ...], ...] # index of each cell [row_id, col_id]. we keep only a table snippet, but the index here is from the original table. 'value_ranks':[ [0, 0, ...], [0, 0, ...], [0, 10, ...], ...] # if the cell contain numeric value/date, this is its rank ordered from small to large, follow TAPAS 'value_inv_ranks': [], # inverse rank 'all_links':{ 'St._Louis_Cardinals': { '2': [ [[2, 0], [0, 19]], # [[row_id, col_id], [start, end]] ] # list of mentions in the second row, the key is row_id }, 'CARDINAL:11': {'2': [[[2, 1], [0, 2]]], '8': [[[8, 3], [0, 2]]]}, } 'name': '', # table name, if exists 'pairs': { 'pair': ['American_League', 'National_League'], 's1_pair_locs': [[[137, 152]], [[162, 177]]], # mention in the query 'table_pair_locs': { '17': [ # mention of entity pair in row 17 [ [[17, 0], [3, 18]], [[17, 1], [3, 18]], [[17, 2], [3, 18]], [[17, 3], [3, 18]] ], # mention of the first entity [ [[17, 0], [21, 36]], [[17, 1], [21, 36]], ] # mention of the second entity ] } } ] }

Clear search

Close search

Google apps

Main menu

Sentence/Table Pair Data from Wikipedia for Pre-training with...

path to the uncompressed files, should be a directory with a set of tar files

Please see the documentation for WebDataset for more details about how to use it as dataloader for Pytorch

You can also iterate through all examples and dump them with your preferred data format

turbulence_gravity_cooling

The following line may take a couple of minutes to instantiate the datamodule

convective_envelope_rsg

The following line may take a couple of minutes to instantiate the datamodule

planetswe

The following line may take a couple of minutes to instantiate the datamodule

MELD Preprocessed

Data Sources

Preprocessing Pipeline

Data Format

Directory Structure

Loading and Using the Dataset

Dataset Class

Custom Collate Function

Creating DataLoaders

shear_flow

The following line may take a couple of minutes to instantiate the datamodule

MHD_64

The following line may take a couple of minutes to instantiate the datamodule

rayleigh_benard

The following line may take a couple of minutes to instantiate the datamodule

PDEBench_2D_DarcyFlow

Load the dataset… See the full description on the dataset page: https://huggingface.co/datasets/Nionio/PDEBench_2D_DarcyFlow.

Sentence/Table Pair Data from Wikipedia for Pre-training with Distant-Supervision

path to the uncompressed files, should be a directory with a set of tar files

Please see the documentation for WebDataset for more details about how to use it as dataloader for Pytorch

You can also iterate through all examples and dump them with your preferred data format