Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Wikipedia dataset containing cleaned articles of all languages. The datasets are built from the Wikipedia dump (https://dumps.wikimedia.org/) with one split per language. Each example contains the content of one full Wikipedia article with cleaning to strip markdown and unwanted sections (references, etc.).
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
In this huggingface discussion you can share what you used the dataset for. Derives from https://www.kaggle.com/datasets/rtatman/questionanswer-dataset?resource=download we generated our own subset using generate.py.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for "simple-wiki"
Dataset Summary
This dataset contains pairs of equivalent sentences obtained from Wikipedia.
Supported Tasks
Sentence Transformers training; useful for semantic search and sentence similarity.
Languages
English.
Dataset Structure
Each example in the dataset contains pairs of equivalent sentences and is formatted as a dictionary with the key "set" and a list with the sentences as "value". {"set":… See the full description on the dataset page: https://huggingface.co/datasets/embedding-data/simple-wiki.
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Dataset Card for "wikitext"
Dataset Summary
The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License. Compared to the preprocessed version of Penn Treebank (PTB), WikiText-2 is over 2 times larger and WikiText-103 is over 110 times larger. The WikiText dataset also features a far larger… See the full description on the dataset page: https://huggingface.co/datasets/Salesforce/wikitext.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Dataset Card for Wikimedia Structured Wikipedia
Dataset Description
Dataset Summary
Early beta release of pre-parsed English and French Wikipedia articles including infoboxes. Inviting feedback. This dataset contains all articles of the English and French language editions of Wikipedia, pre-parsed and outputted as structured JSON files with a consistent schema (JSONL compressed as zip). Each JSON line holds the content of one full Wikipedia article stripped of… See the full description on the dataset page: https://huggingface.co/datasets/wikimedia/structured-wikipedia.
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Dataset Card for "wiki_qa"
Dataset Summary
Wiki Question Answering corpus from Microsoft. The WikiQA corpus is a publicly available set of question and sentence pairs, collected and annotated for research on open-domain question answering.
Supported Tasks and Leaderboards
More Information Needed
Languages
More Information Needed
Dataset Structure
Data Instances
default
Size of downloaded dataset files: 7.10 MB Size… See the full description on the dataset page: https://huggingface.co/datasets/microsoft/wiki_qa.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Wikipedia (simple English) embedded with cohere.ai multilingual-22-12 encoder
We encoded Wikipedia (simple English) using the cohere.ai multilingual-22-12 embedding model. To get an overview how this dataset was created and pre-processed, have a look at Cohere/wikipedia-22-12.
Embeddings
We compute for title+" "+text the embeddings using our multilingual-22-12 embedding model, a state-of-the-art model that works for semantic search in 100 languages. If you want to… See the full description on the dataset page: https://huggingface.co/datasets/Cohere/wikipedia-22-12-simple-embeddings.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
wikipedia persons masked: A filtered version of the wikipedia dataset, with only pages of people
Dataset Summary
Contains ~70k pages from wikipedia, each describing a person. For each page, the person described in the text is masked with a
Supported Tasks and Leaderboards
The dataset supports the tasks of fill-mask, but can also be used for other tasks such as question answering, e.g. "Who is
Languages
english only
Dataset Structure… See the full description on the dataset page: https://huggingface.co/datasets/rcds/wikipedia-persons-masked.
https://choosealicense.com/licenses/gfdl/https://choosealicense.com/licenses/gfdl/
this is a subset of the wikimedia/wikipedia dataset code for creating this dataset : from datasets import load_dataset, Dataset from sentence_transformers import SentenceTransformer model = SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1")
dataset = load_dataset( "wikimedia/wikipedia", "20231101.en", split="train", streaming=True )
from tqdm importtqdm data = Dataset.from_dict({}) for i, entry in… See the full description on the dataset page: https://huggingface.co/datasets/not-lain/wikipedia-small-3000-embedded.
HuggingFaceFW/clean-wikipedia dataset hosted on Hugging Face and contributed by the HF Datasets community
simple-wikipedia
Processed, text-only dump of the Simple Wikipedia (English). Contains 23,886,673 words.
Dataset Summary
This is a dataset that can be used for research into machine learning and natural language processing. It contains all titles and summaries (or introductions) of English Wikipedia articles, extracted in September of 2017. The dataset is different from the regular Wikipedia dump and different from the datasets that can be created by gensim because ours contains the extracted summaries and not the entire unprocessed page body. This could be useful if one wants to use… See the full description on the dataset page: https://huggingface.co/datasets/jordiclive/wikipedia-summary-dataset.
uqa/Wiki-UQA dataset hosted on Hugging Face and contributed by the HF Datasets community
Dataset Card for Wikipedia Sentences (English)
This dataset contains 7.87 million English sentences and can be used in knowledge distillation of embedding models.
Dataset Details
Columns: "sentence" Column types: str Examples:{ 'sentence': "After the deal was approved and NONG's stock rose to $13, Farris purchased 10,000 shares at the $2.50 price, sold 2,500 shares at the new price to reimburse the company, and gave the remaining 7,500 shares to Landreville at no cost… See the full description on the dataset page: https://huggingface.co/datasets/sentence-transformers/wikipedia-en-sentences.
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Dataset Card for Wikimedia Wikisource
Dataset Summary
Wikisource dataset containing cleaned articles of all languages. The dataset is built from the Wikisource dumps (https://dumps.wikimedia.org/) with one subset per language, each containing a single train split. Each example contains the content of one full Wikisource text with cleaning to strip markdown and unwanted sections (references, etc.). All language subsets have already been processed for recent dump, and you… See the full description on the dataset page: https://huggingface.co/datasets/wikimedia/wikisource.
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
WikiAuto provides a set of aligned sentences from English Wikipedia and Simple English Wikipedia
as a resource to train sentence simplification systems. The authors first crowd-sourced a set of manual alignments
between sentences in a subset of the Simple English Wikipedia and their corresponding versions in English Wikipedia
(this corresponds to the manual
config), then trained a neural CRF system to predict these alignments.
The trained model was then applied to the other articles in Simple English Wikipedia with an English counterpart to
create a larger corpus of aligned sentences (corresponding to the auto
, auto_acl
, auto_full_no_split
, and auto_full_with_split
configs here).
Multilingual Embeddings for Wikipedia in 300+ Languages
This dataset contains the wikimedia/wikipedia dataset dump from 2023-11-01 from Wikipedia in all 300+ languages. The individual articles have been chunked and embedded with the state-of-the-art multilingual Cohere Embed V3 embedding model. This enables an easy way to semantically search across all of Wikipedia or to use it as a knowledge source for your RAG application. In total is it close to 250M paragraphs / embeddings. You… See the full description on the dataset page: https://huggingface.co/datasets/Cohere/wikipedia-2023-11-embed-multilingual-v3.
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
The WikiMovies dataset consists of roughly 100k (templated) questions over 75k entities based on questions with answers in the open movie database (OMDb).
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
This is the wikipedia split used to evaluate the Dense Passage Retrieval (DPR) model. It contains 21M passages from wikipedia along with their DPR embeddings. The wikipedia articles were split into multiple, disjoint text blocks of 100 words as passages.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
GoodWiki Dataset
GoodWiki is a 179 million token dataset of English Wikipedia articles collected on September 4, 2023, that have been marked as Good or Featured by Wikipedia editors. The dataset provides these articles in GitHub-flavored Markdown format, preserving layout features like lists, code blocks, math, and block quotes, unlike many other public Wikipedia datasets. Articles are accompanied by a short description of the page as well as any associated categories. Thanks to a… See the full description on the dataset page: https://huggingface.co/datasets/euirim/goodwiki.
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Wikipedia dataset containing cleaned articles of all languages. The datasets are built from the Wikipedia dump (https://dumps.wikimedia.org/) with one split per language. Each example contains the content of one full Wikipedia article with cleaning to strip markdown and unwanted sections (references, etc.).