Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Wikipedia dataset containing cleaned articles of all languages. The datasets are built from the Wikipedia dump (https://dumps.wikimedia.org/) with one split per language. Each example contains the content of one full Wikipedia article with cleaning to strip markdown and unwanted sections (references, etc.).
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
In this huggingface discussion you can share what you used the dataset for. Derives from https://www.kaggle.com/datasets/rtatman/questionanswer-dataset?resource=download we generated our own subset using generate.py.
Wikipedia
Source: https://huggingface.co/datasets/wikipedia Num examples: 1,281,412 Language: Vietnamese
from datasets import load_dataset
load_dataset("tdtunlp/wikipedia_vi")
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for "simple-wiki"
Dataset Summary
This dataset contains pairs of equivalent sentences obtained from Wikipedia.
Supported Tasks
Sentence Transformers training; useful for semantic search and sentence similarity.
Languages
English.
Dataset Structure
Each example in the dataset contains pairs of equivalent sentences and is formatted as a dictionary with the key "set" and a list with the sentences as "value". {"set":… See the full description on the dataset page: https://huggingface.co/datasets/embedding-data/simple-wiki.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Wikipedia (simple English) embedded with cohere.ai multilingual-22-12 encoder
We encoded Wikipedia (simple English) using the cohere.ai multilingual-22-12 embedding model. To get an overview how this dataset was created and pre-processed, have a look at Cohere/wikipedia-22-12.
Embeddings
We compute for title+" "+text the embeddings using our multilingual-22-12 embedding model, a state-of-the-art model that works for semantic search in 100 languages. If you want to… See the full description on the dataset page: https://huggingface.co/datasets/Cohere/wikipedia-22-12-simple-embeddings.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
wikipedia persons masked: A filtered version of the wikipedia dataset, with only pages of people
Dataset Summary
Contains ~70k pages from wikipedia, each describing a person. For each page, the person described in the text is masked with a
Supported Tasks and Leaderboards
The dataset supports the tasks of fill-mask, but can also be used for other tasks such as question answering, e.g. "Who is
Languages
english only
Dataset Structure… See the full description on the dataset page: https://huggingface.co/datasets/rcds/wikipedia-persons-masked.
https://choosealicense.com/licenses/gfdl/https://choosealicense.com/licenses/gfdl/
this is a subset of the wikimedia/wikipedia dataset code for creating this dataset : from datasets import load_dataset, Dataset from sentence_transformers import SentenceTransformer model = SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1")
dataset = load_dataset( "wikimedia/wikipedia", "20231101.en", split="train", streaming=True )
from tqdm importtqdm data = Dataset.from_dict({}) for i, entry in… See the full description on the dataset page: https://huggingface.co/datasets/not-lain/wikipedia-small-3000-embedded.
Dataset Summary
This is a dataset that can be used for research into machine learning and natural language processing. It contains all titles and summaries (or introductions) of English Wikipedia articles, extracted in September of 2017. The dataset is different from the regular Wikipedia dump and different from the datasets that can be created by gensim because ours contains the extracted summaries and not the entire unprocessed page body. This could be useful if one wants to use… See the full description on the dataset page: https://huggingface.co/datasets/jordiclive/wikipedia-summary-dataset.
Multilingual Embeddings for Wikipedia in 300+ Languages
This dataset contains the wikimedia/wikipedia dataset dump from 2023-11-01 from Wikipedia in all 300+ languages. The individual articles have been chunked and embedded with the state-of-the-art multilingual Cohere Embed V3 embedding model. This enables an easy way to semantically search across all of Wikipedia or to use it as a knowledge source for your RAG application. In total is it close to 250M paragraphs / embeddings. You… See the full description on the dataset page: https://huggingface.co/datasets/Cohere/wikipedia-2023-11-embed-multilingual-v3.
tcltcl/small-simple-wikipedia dataset hosted on Hugging Face and contributed by the HF Datasets community
HuggingFaceFW/clean-wikipedia dataset hosted on Hugging Face and contributed by the HF Datasets community
台灣正體中文維基百科 (zh-tw Wikipedia)
截至 2023 年 5 月,中文維基百科 2,533,212 篇條目的台灣正體文字內容。每篇條目為一列 (row),包含 HTML 以及 Markdown 兩種格式。 A nearly-complete collection of 2,533,212 Traditional Chinese (zh-tw) Wikipedia pages, gathered between May 1, 2023, and May 7, 2023. Includes both the original HTML format and an auto-converted Markdown version, which has been processed using vinta/pangu.py. 於 2023 年 5 月 1 日至 5 月 7 日間取自維基百科 action=query & prop=extracts API,內容皆與維基百科網站之台灣正體版本一致,沒有繁簡體混雜的問題。 For development… See the full description on the dataset page: https://huggingface.co/datasets/zetavg/zh-tw-wikipedia.
Dataset Card for Wikipedia Sentences (English)
This dataset contains 7.87 million English sentences and can be used in knowledge distillation of embedding models.
Dataset Details
Columns: "sentence" Column types: str Examples:{ 'sentence': "After the deal was approved and NONG's stock rose to $13, Farris purchased 10,000 shares at the $2.50 price, sold 2,500 shares at the new price to reimburse the company, and gave the remaining 7,500 shares to Landreville at no cost… See the full description on the dataset page: https://huggingface.co/datasets/sentence-transformers/wikipedia-en-sentences.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
upprize/wikipedia dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
ROOTS Subset: roots_indic-te_wikipedia
wikipedia
Dataset uid: wikipedia
Description
Homepage
Licensing
Speaker Locations
Sizes
3.2299 % of total 4.2071 % of en 5.6773 % of ar 3.3416 % of fr 5.2815 % of es 12.4852 % of ca 0.4288 % of zh 0.4286 % of zh 5.4743 % of indic-bn 8.9062 % of indic-ta 21.3313 % of indic-te 4.4845 % of pt 4.0493 % of indic-hi 11.3163 % of indic-ml 22.5300 % of indic-ur 4.4902 % of vi 16.9916 % of… See the full description on the dataset page: https://huggingface.co/datasets/bigscience-data/roots_indic-te_wikipedia.
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Dataset Card for llm-book/ner-wikipedia-dataset
書籍『大規模言語モデル入門』で使用する、ストックマーク株式会社により作成された「Wikipediaを用いた日本語の固有表現抽出データセット」(Version 2.0)です。 Githubリポジトリstockmarkteam/ner-wikipedia-datasetで公開されているデータセットを利用しています。
Citation
@inproceedings{omi-2021-wikipedia, title = "Wikipediaを用いた日本語の固有表現抽出のデータセットの構築", author = "近江 崇宏", booktitle = "言語処理学会第27回年次大会", year = "2021", url = "https://anlp.jp/proceedings/annual_meeting/2021/pdf_dir/P2-7.pdf", }
Licence… See the full description on the dataset page: https://huggingface.co/datasets/llm-book/ner-wikipedia-dataset.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
This is the wikipedia split used to evaluate the Dense Passage Retrieval (DPR) model. It contains 21M passages from wikipedia along with their DPR embeddings. The wikipedia articles were split into multiple, disjoint text blocks of 100 words as passages.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Wizard-of-Wikipedia data for the Findings of EMNLP 2020 paper "Difference-aware Knowledge Selection for Knowledge-grounded Conversation Generation" GitHub repo. Original paper. @inproceedings{zheng-etal-2020-diffks, title="{D}ifference-aware Knowledge Selection for Knowledge-grounded Conversation Generation", author="Zheng, Chujie and Cao, Yunbo and Jiang, Daxin and Huang, Minlie", booktitle="Findings of EMNLP", year="2020" }
The Wikipedia Corpus in MedRAG
This HF dataset contains the chunked snippets from the Wikipedia corpus used in MedRAG. It can be used for medical Retrieval-Augmented Generation (RAG).
News
(02/26/2024) The "id" column has been reformatted. A new "wiki_id" column is added.
Dataset Details
Dataset Descriptions
As a large-scale open-source encyclopedia, Wikipedia is frequently used as a corpus in information retrieval tasks. We select Wikipedia as one… See the full description on the dataset page: https://huggingface.co/datasets/MedRAG/wikipedia.
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Wikipedia dataset containing cleaned articles of all languages. The datasets are built from the Wikipedia dump (https://dumps.wikimedia.org/) with one split per language. Each example contains the content of one full Wikipedia article with cleaning to strip markdown and unwanted sections (references, etc.).