100+ datasets found

h
wikipedia
huggingface.co
tensorflow.org
Updated Feb 21, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Online Language Modelling (2023). wikipedia [Dataset]. https://huggingface.co/datasets/olm/wikipedia
Explore at:
Dataset updated
Feb 21, 2023
Dataset authored and provided by
Online Language Modelling
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
Wikipedia dataset containing cleaned articles of all languages. The datasets are built from the Wikipedia dump (https://dumps.wikimedia.org/) with one split per language. Each example contains the content of one full Wikipedia article with cleaning to strip markdown and unwanted sections (references, etc.).
h
wikipedia
huggingface.co
Updated Apr 17, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hafedh Hichri (2024). wikipedia [Dataset]. https://huggingface.co/datasets/not-lain/wikipedia
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 17, 2024
Authors
Hafedh Hichri
Description
not-lain/wikipedia dataset hosted on Hugging Face and contributed by the HF Datasets community
h
clean-wikipedia
huggingface.co
Updated Oct 31, 2008
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
clean-wikipedia [Dataset]. https://huggingface.co/datasets/HuggingFaceFW/clean-wikipedia
Explore at:
Dataset updated
Oct 31, 2008
Dataset authored and provided by
HuggingFaceFW
Description
HuggingFaceFW/clean-wikipedia dataset hosted on Hugging Face and contributed by the HF Datasets community
h
kilt_wikipedia
huggingface.co
opendatalab.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AI at Meta, kilt_wikipedia [Dataset]. https://huggingface.co/datasets/facebook/kilt_wikipedia
Explore at:
Dataset authored and provided by
AI at Meta
Description
KILT-Wikipedia: Wikipedia pre-processed for KILT.
h
wikitext2
huggingface.co
paperswithcode.com
+1more
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
wikitext2 [Dataset]. https://huggingface.co/datasets/mindchain/wikitext2
Explore at:
Authors
Jan Karsten Kuhnke
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
Dataset Card for "wikitext"

Dataset Summary

The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License. Compared to the preprocessed version of Penn Treebank (PTB), WikiText-2 is over 2 times larger and WikiText-103 is over 110 times larger. The WikiText dataset also features a far… See the full description on the dataset page: https://huggingface.co/datasets/mindchain/wikitext2.
h
wikipedia-human-retrieval-ja
huggingface.co
Updated Jan 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Baobab, Inc. (2024). wikipedia-human-retrieval-ja [Dataset]. https://huggingface.co/datasets/baobab-trees/wikipedia-human-retrieval-ja
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 15, 2024
Dataset authored and provided by
Baobab, Inc.
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Japanese Wikipedia Human Retrieval dataset

This is a Japanese question answereing dataset with retrieval on Wikipedia articles by trained human workers.

Contributors

Yusuke Oda defined the dataset specification, data structure, and the scheme of data collection. Baobab, Inc. operated data collection, data checking, and formatting.

About the dataset

Each entry represents a single QA session: given a question sentence, the responsible worker tried… See the full description on the dataset page: https://huggingface.co/datasets/baobab-trees/wikipedia-human-retrieval-ja.
h
wikipedia-data-en-2023-11
huggingface.co
Updated Dec 1, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mixedbread (2023). wikipedia-data-en-2023-11 [Dataset]. https://huggingface.co/datasets/mixedbread-ai/wikipedia-data-en-2023-11
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 1, 2023
Dataset authored and provided by
Mixedbread
Description
mixedbread-ai/wikipedia-data-en-2023-11 dataset hosted on Hugging Face and contributed by the HF Datasets community
wikipedia-22-12-de-embeddings
huggingface.co
Updated Apr 20, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cohere (2023). wikipedia-22-12-de-embeddings [Dataset]. https://huggingface.co/datasets/Cohere/wikipedia-22-12-de-embeddings
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 20, 2023
Dataset authored and provided by
Coherehttps://cohere.com/
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Wikipedia (de) embedded with cohere.ai multilingual-22-12 encoder

We encoded Wikipedia (de) using the cohere.ai multilingual-22-12 embedding model. To get an overview how this dataset was created and pre-processed, have a look at Cohere/wikipedia-22-12.

Embeddings

We compute for title+" "+text the embeddings using our multilingual-22-12 embedding model, a state-of-the-art model that works for semantic search in 100 languages. If you want to learn more about this… See the full description on the dataset page: https://huggingface.co/datasets/Cohere/wikipedia-22-12-de-embeddings.
h
simple-wiki
huggingface.co
Updated Jan 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sentence Transformers (2025). simple-wiki [Dataset]. https://huggingface.co/datasets/sentence-transformers/simple-wiki
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 17, 2025
Dataset authored and provided by
Sentence Transformers
Description
Dataset Card for Simple Wiki

This dataset is a collection of pairs of English Wikipedia entries and their simplified variants. See Simple-Wiki for additional information. This dataset can be used directly with Sentence Transformers to train embedding models.

Dataset Subsets pair subset

Columns: "text", "simplified" Column types: str, str Examples:{ 'text': "Charles Michael `` Chuck '' Palahniuk ( ; born February 21 , 1962 ) is an American… See the full description on the dataset page: https://huggingface.co/datasets/sentence-transformers/simple-wiki.
h
Cephalo-Wikipedia-Materials
huggingface.co
Updated Jun 2, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
LAMM: MIT Laboratory for Atomistic and Molecular Mechanics (2024). Cephalo-Wikipedia-Materials [Dataset]. https://huggingface.co/datasets/lamm-mit/Cephalo-Wikipedia-Materials
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 2, 2024
Dataset authored and provided by
LAMM: MIT Laboratory for Atomistic and Molecular Mechanics
Description
This dataset is used to train the Cephalo models. Cephalo is a series of multimodal materials science focused vision large language models (V-LLMs) designed to integrate visual and linguistic data for advanced understanding and interaction in human-AI or multi-agent AI frameworks. A novel aspect of Cephalo's development is the innovative dataset generation method. The extraction process employs advanced algorithms to accurately detect and separate images and their corresponding textual… See the full description on the dataset page: https://huggingface.co/datasets/lamm-mit/Cephalo-Wikipedia-Materials.
h
wikipedia-filtered
huggingface.co
Updated Dec 8, 2016
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Camille Harris (2016). wikipedia-filtered [Dataset]. https://huggingface.co/datasets/charris/wikipedia-filtered
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 8, 2016
Authors
Camille Harris
Description
charris/wikipedia-filtered dataset hosted on Hugging Face and contributed by the HF Datasets community
h
wikipedia-nq-corpus
huggingface.co
Updated Dec 23, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tevatron (2021). wikipedia-nq-corpus [Dataset]. https://huggingface.co/datasets/Tevatron/wikipedia-nq-corpus
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 23, 2021
Dataset authored and provided by
Tevatron
Description
Tevatron/wikipedia-nq-corpus dataset hosted on Hugging Face and contributed by the HF Datasets community
h
wiki_toxic
huggingface.co
opendatalab.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
wiki_toxic [Dataset]. https://huggingface.co/datasets/OxAISH-AL-LLM/wiki_toxic
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
OxAI Safety Hub Active Learning with Large Language Models Labs Team
License
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Description
Jigsaw Toxic Comment Challenge dataset. This dataset was the basis of a Kaggle competition run by Jigsaw
h
svm-wikipedia-01
huggingface.co
Updated Nov 10, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wangqian (2024). svm-wikipedia-01 [Dataset]. https://huggingface.co/datasets/ieeeeeH/svm-wikipedia-01
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 10, 2024
Authors
Wangqian
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
ieeeeeH/svm-wikipedia-01 dataset hosted on Hugging Face and contributed by the HF Datasets community
h
wikipedia-curated
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
wikipedia-curated [Dataset]. https://huggingface.co/datasets/Tevatron/wikipedia-curated
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
Tevatron
Description
Tevatron/wikipedia-curated dataset hosted on Hugging Face and contributed by the HF Datasets community
h
truncated-american-wikipedia
huggingface.co
Updated Feb 24, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gb (2025). truncated-american-wikipedia [Dataset]. https://huggingface.co/datasets/tcltcl/truncated-american-wikipedia
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 24, 2025
Authors
Gb
Area covered
United States
Description
tcltcl/truncated-american-wikipedia dataset hosted on Hugging Face and contributed by the HF Datasets community
h
wikipedia-english
huggingface.co
Updated Mar 18, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oriental Lab (2025). wikipedia-english [Dataset]. https://huggingface.co/datasets/oriental-lab/wikipedia-english
Explore at:
Dataset updated
Mar 18, 2025
Dataset authored and provided by
Oriental Lab
Description
oriental-lab/wikipedia-english dataset hosted on Hugging Face and contributed by the HF Datasets community
h
blink-pretrain-wikipedia
huggingface.co
Updated Aug 17, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shoumik Gandre (2024). blink-pretrain-wikipedia [Dataset]. https://huggingface.co/datasets/shomez/blink-pretrain-wikipedia
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 17, 2024
Authors
Shoumik Gandre
Description
shomez/blink-pretrain-wikipedia dataset hosted on Hugging Face and contributed by the HF Datasets community
P
WIT Dataset
paperswithcode.com
huggingface.co
Updated Jun 14, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Krishna Srinivasan; Karthik Raman; Jiecao Chen; Michael Bendersky; Marc Najork (2023). WIT Dataset [Dataset]. https://paperswithcode.com/dataset/wit
Explore at:
Dataset updated
Jun 14, 2023
Authors
Krishna Srinivasan; Karthik Raman; Jiecao Chen; Michael Bendersky; Marc Najork
Description
Wikipedia-based Image Text (WIT) Dataset is a large multimodal multilingual dataset. WIT is composed of a curated set of 37.6 million entity rich image-text examples with 11.5 million unique images across 108 Wikipedia languages. Its size enables WIT to be used as a pretraining dataset for multimodal machine learning models.

Key Advantages

A few unique advantages of WIT:

The largest multimodal dataset (time of this writing) by the number of image-text examples. A massively multilingual (first of its kind) with coverage for over 100+ languages. A collection of diverse set of concepts and real world entities. Brings forth challenging real-world test sets.
h
wikipedia-en-20241020
huggingface.co
Updated Oct 20, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bigge (2024). wikipedia-en-20241020 [Dataset]. https://huggingface.co/datasets/burgerbee/wikipedia-en-20241020
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 20, 2024
Authors
Bigge
Description
burgerbee/wikipedia-en-20241020 dataset hosted on Hugging Face and contributed by the HF Datasets community

Facebook

Twitter

Click to copy link

Link copied

Cite

Online Language Modelling (2023). wikipedia [Dataset]. https://huggingface.co/datasets/olm/wikipedia

wikipedia

Wikipedia

olm/wikipedia

Explore at:

Dataset updated

Feb 21, 2023

Dataset authored and provided by

Online Language Modelling

License

Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically

Description

Wikipedia dataset containing cleaned articles of all languages. The datasets are built from the Wikipedia dump (https://dumps.wikimedia.org/) with one split per language. Each example contains the content of one full Wikipedia article with cleaning to strip markdown and unwanted sections (references, etc.).

Clear search

Close search

Google apps

Main menu

wikipedia

wikipedia

clean-wikipedia

kilt_wikipedia

wikitext2

wikipedia-human-retrieval-ja

wikipedia-data-en-2023-11

wikipedia-22-12-de-embeddings

simple-wiki

Cephalo-Wikipedia-Materials

wikipedia-filtered

wikipedia-nq-corpus

wiki_toxic

svm-wikipedia-01

wikipedia-curated

truncated-american-wikipedia

wikipedia-english

blink-pretrain-wikipedia

WIT Dataset

wikipedia-en-20241020

wikipediaSee More Versions

Wikipedia

olm/wikipedia

wikipedia