Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Wikipedia dataset containing cleaned articles of all languages. The datasets are built from the Wikipedia dump (https://dumps.wikimedia.org/) with one split per language. Each example contains the content of one full Wikipedia article with cleaning to strip markdown and unwanted sections (references, etc.).
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Dataset Card for "wikitext"
Dataset Summary
The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License. Compared to the preprocessed version of Penn Treebank (PTB), WikiText-2 is over 2 times larger and WikiText-103 is over 110 times larger. The WikiText dataset also features a far… See the full description on the dataset page: https://huggingface.co/datasets/mindchain/wikitext2.
Dataset Card for Simple Wiki
This dataset is a collection of pairs of English Wikipedia entries and their simplified variants. See Simple-Wiki for additional information. This dataset can be used directly with Sentence Transformers to train embedding models.
Dataset Subsets
pair subset
Columns: "text", "simplified" Column types: str, str Examples:{ 'text': "Charles Michael `` Chuck '' Palahniuk ( ; born February 21 , 1962 ) is an American… See the full description on the dataset page: https://huggingface.co/datasets/sentence-transformers/simple-wiki.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
The VCR-Wiki Dataset for Visual Caption Restoration (VCR)
🏠 Paper | 👩🏻💻 GitHub | 🤗 Huggingface Datasets | 📏 Evaluation with lmms-eval This is the official Hugging Face dataset for VCR-Wiki, a dataset for the Visual Caption Restoration (VCR) task. VCR is designed to measure vision-language models' capability to accurately restore partially obscured texts using pixel-level hints within images. text-based processing becomes ineffective in VCR as accurate text restoration… See the full description on the dataset page: https://huggingface.co/datasets/vcr-org/VCR-wiki-en-hard-test.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
chentong00/factoid-wiki dataset hosted on Hugging Face and contributed by the HF Datasets community
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Wiki-Convert is a 900,000+ sentences dataset of precise number annotations from English Wikipedia. It relies on Wiki contributors' annotations in the form of a {{Convert}} template.
The WikiBio GPT-3 Hallucination Dataset is a benchmark dataset used for hallucination detection. It is based on Wikipedia biographies (WikiBio) and is specifically designed to evaluate the factuality of text generated by large language models like GPT-3¹². Here are some key details about this dataset:
Dataset Source: Wikipedia biographies (WikiBio) Task: Text classification Language: English Size Categories: Less than 1,000 samples License: Creative Commons Attribution-ShareAlike 3.0 (cc-by-sa-3.0)
(1) potsawee/wiki_bio_gpt3_hallucination · Datasets at Hugging Face. https://huggingface.co/datasets/potsawee/wiki_bio_gpt3_hallucination. (2) [2303.08896] SelfCheckGPT: Zero-Resource Black-Box Hallucination .... https://arxiv.org/abs/2303.08896. (3) AIトラストと、対話型生成AIにおける富士通のAIトラスト技術 : 富士通. https://www.fujitsu.com/jp/about/research/article/202312-ai-trust-technologies.html. (4) README.md · potsawee/wiki_bio_gpt3_hallucination at main - Hugging Face. https://huggingface.co/datasets/potsawee/wiki_bio_gpt3_hallucination/blob/main/README.md. (5) undefined. https://github.com/potsawee/selfcheckgpt.
Clean-up text for 40+ Wikipedia languages editions of pages correspond to entities. The datasets have train/dev/test splits per language. The dataset is cleaned up by page filtering to remove disambiguation pages, redirect pages, deleted pages, and non-entity pages. Each example contains the wikidata id of the entity, and the full Wikipedia article after page processing that removes non-content sections and structured objects.
Wikipedia-based Image Text (WIT) Dataset is a large multimodal multilingual dataset. WIT is composed of a curated set of 37.6 million entity rich image-text examples with 11.5 million unique images across 108 Wikipedia languages. Its size enables WIT to be used as a pretraining dataset for multimodal machine learning models.
Key Advantages
A few unique advantages of WIT:
The largest multimodal dataset (time of this writing) by the number of image-text examples. A massively multilingual (first of its kind) with coverage for over 100+ languages. A collection of diverse set of concepts and real world entities. Brings forth challenging real-world test sets.
Tevatron/wiki-ss-corpus dataset hosted on Hugging Face and contributed by the HF Datasets community
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Japanese Wikipedia Human Retrieval dataset
This is a Japanese question answereing dataset with retrieval on Wikipedia articles by trained human workers.
Contributors
Yusuke Oda defined the dataset specification, data structure, and the scheme of data collection. Baobab, Inc. operated data collection, data checking, and formatting.
About the dataset
Each entry represents a single QA session: given a question sentence, the responsible worker tried… See the full description on the dataset page: https://huggingface.co/datasets/baobab-trees/wikipedia-human-retrieval-ja.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the accompanying dataset that was generated by the GitHub project: https://github.com/tonyreina/tdc-tcr-epitope-antibody-binding. In that repository I show how to create a machine learning models for predicting if a T-cell receptor (TCR) and protein epitope will bind to each other.
A model that can predict how well a TCR bindings to an epitope can lead to more effective treatments that use immunotherapy. For example, in anti-cancer therapies it is important for the T-cell receptor to bind to the protein marker in the cancer cell so that the T-cell (actually the T-cell's friends in the immune system) can kill the cancer cell.
import pandas as pd
train_data = pd.read_pickle("train_data.pkl")
validation_data = pd.read_pickle("validation_data.pkl")
test_data = pd.read_pickle("test_data.pkl")
The epitope_aa and the tcr_full columns are the protein (peptide) sequences for the epitope and the T-cell receptor, respectively. The letters correspond to the standard amino acid codes.
The epitope_smi column is the SMILES notation for the chemical structure of the epitope. We won't use this information. Instead, the ESM-1b embedder should be sufficient for the input to our binary classification model.
The tcr column is the CDR3 hyperloop. It's the part of the TCR that actually binds (assuming it binds) to the epitope.
The label column is whether the two proteins bind. 0 = No. 1 = Yes.
The tcr_vector and epitope_vector columns are the bio-embeddings of the TCR and epitope sequences generated by the Facebook ESM-1b model. These two vectors can be used to create a machine learning model that predicts whether the combination will produce a successful protein binding.
From the TDC website:
T-cells are an integral part of the adaptive immune system, whose survival, proliferation, activation and function are all governed by the interaction of their T-cell receptor (TCR) with immunogenic peptides (epitopes). A large repertoire of T-cell receptors with different specificity is needed to provide protection against a wide range of pathogens. This new task aims to predict the binding affinity given a pair of TCR sequence and epitope sequence.
Weber et al.
Dataset Description: The dataset is from Weber et al. who assemble a large and diverse data from the VDJ database and ImmuneCODE project. It uses human TCR-beta chain sequences. Since this dataset is highly imbalanced, the authors exclude epitopes with less than 15 associated TCR sequences and downsample to a limit of 400 TCRs per epitope. The dataset contains amino acid sequences either for the entire TCR or only for the hypervariable CDR3 loop. Epitopes are available as amino acid sequences. Since Weber et al. proposed to represent the peptides as SMILES strings (which reformulates the problem to protein-ligand binding prediction) the SMILES strings of the epitopes are also included. 50% negative samples were generated by shuffling the pairs, i.e. associating TCR sequences with epitopes they have not been shown to bind.
Task Description: Binary classification. Given the epitope (a peptide, either represented as amino acid sequence or as SMILES) and a T-cell receptor (amino acid sequence, either of the full protein complex or only of the hypervariable CDR3 loop), predict whether the epitope binds to the TCR.
Dataset Statistics: 47,182 TCR-Epitope pairs between 192 epitopes and 23,139 TCRs.
References:
Dataset License: CC BY 4.0.
Contributed by: Anna Weber and Jannis Born.
Checkpoint name | Number of layers | Number of parameters |
esm2_t48_15B_UR50D | 48 | 15B |
esm2_t36_3B_UR50D | 36 | 3B |
esm2_t33_650M_UR50D | 33 | 650M |
esm2_t30_150M_UR50D | 30 | 150M |
esm2_t12_35M_UR50D | 12 | 35M |
esm2_t6_8M_UR50D | 6 | 8M |
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Jigsaw Toxic Comment Challenge dataset. This dataset was the basis of a Kaggle competition run by Jigsaw
flpelerin/mlk-wiki dataset hosted on Hugging Face and contributed by the HF Datasets community
FEVER is a publicly available dataset for fact extraction and verification against textual sources.
It consists of 185,445 claims manually verified against the introductory sections of Wikipedia pages and classified as SUPPORTED, REFUTED or NOTENOUGHINFO. For the first two classes, systems and annotators need to also return the combination of sentences forming the necessary evidence supporting or refuting the claim.
The claims were generated by human annotators extracting claims from Wikipedia and mutating them in a variety of ways, some of which were meaning-altering. The verification of each claim was conducted in a separate annotation process by annotators who were aware of the page but not the sentence from which original claim was extracted and thus in 31.75% of the claims more than one sentence was considered appropriate evidence. Claims require composition of evidence from multiple sentences in 16.82% of cases. Furthermore, in 12.15% of the claims, this evidence was taken from multiple pages.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Wikipedia (de) embedded with cohere.ai multilingual-22-12 encoder
We encoded Wikipedia (de) using the cohere.ai multilingual-22-12 embedding model. To get an overview how this dataset was created and pre-processed, have a look at Cohere/wikipedia-22-12.
Embeddings
We compute for title+" "+text the embeddings using our multilingual-22-12 embedding model, a state-of-the-art model that works for semantic search in 100 languages. If you want to learn more about this… See the full description on the dataset page: https://huggingface.co/datasets/Cohere/wikipedia-22-12-de-embeddings.
The Stanford Question Answering Dataset (SQuAD) is a collection of question-answer pairs derived from Wikipedia articles. In SQuAD, the correct answers of questions can be any sequence of tokens in the given text. Because the questions and answers are produced by humans through crowdsourcing, it is more diverse than some other question-answering datasets. SQuAD 1.1 contains 107,785 question-answer pairs on 536 articles. SQuAD2.0 (open-domain SQuAD, SQuAD-Open), the latest version, combines the 100,000 questions in SQuAD1.1 with over 50,000 un-answerable questions written adversarially by crowdworkers in forms that are similar to the answerable ones.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Plain text of French Wiktionary
Dataset Description Size Example use (python) Data fields Notes on data formatting
License Aknowledgements Citation
Dataset Description
This dataset is a plain text version of pages from wiktionary.org in French language. The text is without HTML tags nor wiki templates. It just includes markdown syntax for headers, lists and tables. See Notes on data formatting for more details. It was created by LINAGORA and OpenLLM France… See the full description on the dataset page: https://huggingface.co/datasets/OpenLLM-France/wiktionary.
asspunchman/wiki-kaz-v1 dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Wikipedia dataset containing cleaned articles of all languages. The datasets are built from the Wikipedia dump (https://dumps.wikimedia.org/) with one split per language. Each example contains the content of one full Wikipedia article with cleaning to strip markdown and unwanted sections (references, etc.).