Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Wikipedia dataset containing cleaned articles of all languages. The datasets are built from the Wikipedia dump (https://dumps.wikimedia.org/) with one split per language. Each example contains the content of one full Wikipedia article with cleaning to strip markdown and unwanted sections (references, etc.).
not-lain/wikipedia dataset hosted on Hugging Face and contributed by the HF Datasets community
HuggingFaceFW/clean-wikipedia dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Dataset Card for "wikitext"
Dataset Summary
The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License. Compared to the preprocessed version of Penn Treebank (PTB), WikiText-2 is over 2 times larger and WikiText-103 is over 110 times larger. The WikiText dataset also features a far… See the full description on the dataset page: https://huggingface.co/datasets/mindchain/wikitext2.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Japanese Wikipedia Human Retrieval dataset
This is a Japanese question answereing dataset with retrieval on Wikipedia articles by trained human workers.
Contributors
Yusuke Oda defined the dataset specification, data structure, and the scheme of data collection. Baobab, Inc. operated data collection, data checking, and formatting.
About the dataset
Each entry represents a single QA session: given a question sentence, the responsible worker tried… See the full description on the dataset page: https://huggingface.co/datasets/baobab-trees/wikipedia-human-retrieval-ja.
mixedbread-ai/wikipedia-data-en-2023-11 dataset hosted on Hugging Face and contributed by the HF Datasets community
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Wikipedia (de) embedded with cohere.ai multilingual-22-12 encoder
We encoded Wikipedia (de) using the cohere.ai multilingual-22-12 embedding model. To get an overview how this dataset was created and pre-processed, have a look at Cohere/wikipedia-22-12.
Embeddings
We compute for title+" "+text the embeddings using our multilingual-22-12 embedding model, a state-of-the-art model that works for semantic search in 100 languages. If you want to learn more about this… See the full description on the dataset page: https://huggingface.co/datasets/Cohere/wikipedia-22-12-de-embeddings.
Dataset Card for Simple Wiki
This dataset is a collection of pairs of English Wikipedia entries and their simplified variants. See Simple-Wiki for additional information. This dataset can be used directly with Sentence Transformers to train embedding models.
Dataset Subsets
pair subset
Columns: "text", "simplified" Column types: str, str Examples:{ 'text': "Charles Michael `` Chuck '' Palahniuk ( ; born February 21 , 1962 ) is an American… See the full description on the dataset page: https://huggingface.co/datasets/sentence-transformers/simple-wiki.
This dataset is used to train the Cephalo models. Cephalo is a series of multimodal materials science focused vision large language models (V-LLMs) designed to integrate visual and linguistic data for advanced understanding and interaction in human-AI or multi-agent AI frameworks. A novel aspect of Cephalo's development is the innovative dataset generation method. The extraction process employs advanced algorithms to accurately detect and separate images and their corresponding textual… See the full description on the dataset page: https://huggingface.co/datasets/lamm-mit/Cephalo-Wikipedia-Materials.
charris/wikipedia-filtered dataset hosted on Hugging Face and contributed by the HF Datasets community
Tevatron/wikipedia-nq-corpus dataset hosted on Hugging Face and contributed by the HF Datasets community
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Jigsaw Toxic Comment Challenge dataset. This dataset was the basis of a Kaggle competition run by Jigsaw
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
ieeeeeH/svm-wikipedia-01 dataset hosted on Hugging Face and contributed by the HF Datasets community
Tevatron/wikipedia-curated dataset hosted on Hugging Face and contributed by the HF Datasets community
tcltcl/truncated-american-wikipedia dataset hosted on Hugging Face and contributed by the HF Datasets community
oriental-lab/wikipedia-english dataset hosted on Hugging Face and contributed by the HF Datasets community
shomez/blink-pretrain-wikipedia dataset hosted on Hugging Face and contributed by the HF Datasets community
Wikipedia-based Image Text (WIT) Dataset is a large multimodal multilingual dataset. WIT is composed of a curated set of 37.6 million entity rich image-text examples with 11.5 million unique images across 108 Wikipedia languages. Its size enables WIT to be used as a pretraining dataset for multimodal machine learning models.
Key Advantages
A few unique advantages of WIT:
The largest multimodal dataset (time of this writing) by the number of image-text examples. A massively multilingual (first of its kind) with coverage for over 100+ languages. A collection of diverse set of concepts and real world entities. Brings forth challenging real-world test sets.
burgerbee/wikipedia-en-20241020 dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Wikipedia dataset containing cleaned articles of all languages. The datasets are built from the Wikipedia dump (https://dumps.wikimedia.org/) with one split per language. Each example contains the content of one full Wikipedia article with cleaning to strip markdown and unwanted sections (references, etc.).