100+ datasets found

h
speech-wikimedia
huggingface.co
Updated Aug 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MLCommons (2023). speech-wikimedia [Dataset]. https://huggingface.co/datasets/MLCommons/speech-wikimedia
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 19, 2023
Dataset authored and provided by
MLCommons
Description
Dataset Card for Speech Wikimedia

Dataset Summary

The Speech Wikimedia Dataset is a compilation of audiofiles with transcriptions extracted from wikimedia commons that is licensed for academic and commercial usage under CC and Public domain. It includes 2,000+ hours of transcribed speech in different languages with a diverse set of speakers. Each audiofile should have one or more transcriptions in different languages.

Transcription languages

English German… See the full description on the dataset page: https://huggingface.co/datasets/MLCommons/speech-wikimedia.
h
wikipedia-2023-11-embed-multilingual-v3
huggingface.co
Updated Nov 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cohere (2023). wikipedia-2023-11-embed-multilingual-v3 [Dataset]. https://huggingface.co/datasets/Cohere/wikipedia-2023-11-embed-multilingual-v3
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 1, 2023
Dataset authored and provided by
Cohere
Description
Multilingual Embeddings for Wikipedia in 300+ Languages

This dataset contains the wikimedia/wikipedia dataset dump from 2023-11-01 from Wikipedia in all 300+ languages. The individual articles have been chunked and embedded with the state-of-the-art multilingual Cohere Embed V3 embedding model. This enables an easy way to semantically search across all of Wikipedia or to use it as a knowledge source for your RAG application. In total is it close to 250M paragraphs / embeddings. You… See the full description on the dataset page: https://huggingface.co/datasets/Cohere/wikipedia-2023-11-embed-multilingual-v3.
Data from: English Wikipedia - Species Pages
gbif.org
Updated Aug 23, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Markus Döring; Markus Döring (2022). English Wikipedia - Species Pages [Dataset]. http://doi.org/10.15468/c3kkgh
Explore at:
Unique identifier
https://doi.org/10.15468/c3kkgh
Dataset updated
Aug 23, 2022
Dataset provided by
Wikimedia Foundationhttp://www.wikimedia.org/
Global Biodiversity Information Facilityhttps://www.gbif.org/
Authors
Markus Döring; Markus Döring
Description
Species pages extracted from the English Wikipedia article XML dump from 2022-08-02. Multimedia, vernacular names and textual descriptions are extracted, but only pages with a taxobox or speciesbox template are recognized.

See https://github.com/mdoering/wikipedia-dwca for details.
Wikipedia Talk Corpus
figshare.com
application/x-gzip
Updated Jan 23, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ellery Wulczyn; Nithum Thain; Lucas Dixon (2017). Wikipedia Talk Corpus [Dataset]. http://doi.org/10.6084/m9.figshare.4264973.v3
Explore at:
application/x-gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.4264973.v3
Dataset updated
Jan 23, 2017
Dataset provided by
Figsharehttp://figshare.com/
Authors
Ellery Wulczyn; Nithum Thain; Lucas Dixon
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
We provide a corpus of discussion comments from English Wikipedia talk pages. Comments are grouped into different files by year. Comments are generated by computing diffs over the full revision history and extracting the content added for each revision. See our wiki for documentation of the schema and our research paper for documentation on the data collection and processing methodology.
P
French Wikipedia Dataset
paperswithcode.com
opendatalab.com
+1more
Updated Feb 18, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Louis Martin; Benjamin Muller; Pedro Javier Ortiz Suárez; Yoann Dupont; Laurent Romary; Éric Villemonte de la Clergerie; Djamé Seddah; Benoît Sagot (2021). French Wikipedia Dataset [Dataset]. https://paperswithcode.com/dataset/french-wikipedia
Explore at:
Dataset updated
Feb 18, 2021
Authors
Louis Martin; Benjamin Muller; Pedro Javier Ortiz Suárez; Yoann Dupont; Laurent Romary; Éric Villemonte de la Clergerie; Djamé Seddah; Benoît Sagot
Area covered
French
Description
French Wikipedia is a dataset used for pretraining the CamemBERT French language model. It uses the official 2019 French Wikipedia dumps
h
wikipedia-small-3000-embedded
huggingface.co
Updated Apr 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hafedh Hichri (2024). wikipedia-small-3000-embedded [Dataset]. https://huggingface.co/datasets/not-lain/wikipedia-small-3000-embedded
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 6, 2024
Authors
Hafedh Hichri
License
https://choosealicense.com/licenses/gfdl/https://choosealicense.com/licenses/gfdl/
Description
this is a subset of the wikimedia/wikipedia dataset code for creating this dataset : from datasets import load_dataset, Dataset from sentence_transformers import SentenceTransformer model = SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1")

load dataset in streaming mode (no download and it's fast)

dataset = load_dataset( "wikimedia/wikipedia", "20231101.en", split="train", streaming=True )

select 3000 samples

from tqdm importtqdm data = Dataset.from_dict({}) for i, entry in… See the full description on the dataset page: https://huggingface.co/datasets/not-lain/wikipedia-small-3000-embedded.
f
Data from: Wiki-Reliability: A Large Scale Dataset for Content Reliability...
figshare.com
txt
Updated Mar 14, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
KayYen Wong; Diego Saez-Trumper; Miriam Redi (2021). Wiki-Reliability: A Large Scale Dataset for Content Reliability on Wikipedia [Dataset]. http://doi.org/10.6084/m9.figshare.14113799.v4
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.14113799.v4
Dataset updated
Mar 14, 2021
Dataset provided by
figshare
Authors
KayYen Wong; Diego Saez-Trumper; Miriam Redi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Wiki-Reliability: Machine Learning datasets for measuring content reliability on WikipediaConsists of metadata features and content text datasets, with the formats:- {template_name}_features.csv - {template_name}_difftxt.csv.gz - {template_name}_fulltxt.csv.gz For more details on the project, dataset schema, and links to data usage and benchmarking:https://meta.wikimedia.org/wiki/Research:Wiki-Reliability:_A_Large_Scale_Dataset_for_Content_Reliability_on_Wikipedia
A Wikipedia dataset of 5 categories
zenodo.org
data.niaid.nih.gov
zip
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Julien Maitre; Julien Maitre (2020). A Wikipedia dataset of 5 categories [Dataset]. http://doi.org/10.5281/zenodo.3260046
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3260046
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Julien Maitre; Julien Maitre
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A subset of articles extracted from the French Wikipedia XML dump. Data published here include 5 different categories : Economy (Economie), History (Histoire), Informatics (Informatique), Health (Medecine) and Law (Droit). The Wikipedia dump was downloaded on November 8, 2016 from https://dumps.wikimedia.org/. Each article is a xml file extracted from the dump and save as UTF8 plain text. The characteristics of dataset is :

Economy : 44'876 articles

History : 92'041 articles

Informatics : 25'408 articles

Health : 22'143 articles

Law : 9'964 articles
T
wikipedia_toxicity_subtypes
tensorflow.org
Updated Oct 4, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2021). wikipedia_toxicity_subtypes [Dataset]. https://www.tensorflow.org/datasets/catalog/wikipedia_toxicity_subtypes
Explore at:
Dataset updated
Oct 4, 2021
Description
The comments in this dataset come from an archive of Wikipedia talk page comments. These have been annotated by Jigsaw for toxicity, as well as (for the main config) a variety of toxicity subtypes, including severe toxicity, obscenity, threatening language, insulting language, and identity attacks. This dataset is a replica of the data released for the Jigsaw Toxic Comment Classification Challenge and Jigsaw Multilingual Toxic Comment Classification competition on Kaggle, with the test dataset merged with the test_labels released after the end of the competitions. Test data not used for scoring has been dropped. This dataset is released under CC0, as is the underlying comment text.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('wikipedia_toxicity_subtypes', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
Z
Long document similarity dataset, Wikipedia excerptions for movies...
data.niaid.nih.gov
zenodo.org
Updated Jan 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
anonymous (2023). Long document similarity dataset, Wikipedia excerptions for movies collections [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7019172
Explore at:
Dataset updated
Jan 20, 2023
Dataset authored and provided by
anonymous
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Movies-related articles extracted from Wikipedia.

For all articles, the figures and tables have been filtered out, as well as the categories and "see also" sections.

The article structure, and particularly the sub-titles and paragraphs are kept in these datasets

Movies

The Wikipedia Movies dataset consists of 100,371 articles describing various movies. Each article may consist of text passages describing the plot, cast, production, reception, soundtrack, and more.
Tamil (Tamizh) Wikipedia Text Dataset for NLP
kaggle.com
Updated Nov 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Younus_Mohamed (2024). Tamil (Tamizh) Wikipedia Text Dataset for NLP [Dataset]. http://doi.org/10.34740/kaggle/dsv/9884525
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/dsv/9884525
Dataset updated
Nov 12, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Younus_Mohamed
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
This dataset is part of a larger mission to transform Tamil into a high-resource language in the field of Natural Language Processing (NLP). As one of the oldest and most culturally rich languages, Tamil has a unique linguistic structure, yet it remains underrepresented in the NLP landscape. This dataset, extracted from Tamil Wikipedia, serves as a foundational resource to support Tamil language processing, text mining, and machine learning applications.

What’s Included

- Text Data: This dataset contains over 569,000 articles in raw text form, extracted from Tamil Wikipedia. The collection is ideal for language model training, word frequency analysis, and text mining.

- Scripts and Processing Tools: Code snippets are provided for processing .bz2 compressed files, generating word counts, and handling data for NLP applications.

Why This Dataset?

Despite having a documented lexicon of over 100,000 words, only a fraction of these are actively used in everyday communication. The largest available Tamil treebank currently holds only 10,000 words, limiting the scope for training accurate language models. This dataset aims to bridge that gap by providing a robust, open-source corpus for researchers, developers, and linguists who want to work on Tamil language technologies.

** How You Can Use This Dataset**

- Language Modeling: Train or fine-tune models like BERT, GPT, or LSTM-based language models for Tamil. - Linguistic Research: Analyze Tamil morphology, syntax, and vocabulary usage. - Data Augmentation: Use the raw text to generate augmented data for multilingual NLP applications. - Word Embeddings and Semantic Analysis: Create embeddings for Tamil words, useful in multilingual setups or standalone applications.

Let’s Collaborate!

I believe that advancing Tamil in NLP cannot be a solo effort. Contributions in the form of additional data, annotations, or even new tools for Tamil language processing are welcome! By working together, we can make Tamil a truly high-resource language in NLP.

License

This dataset is based on content from Tamil Wikipedia and is shared under the Creative Commons Attribution-ShareAlike 3.0 Unported License (CC BY-SA 3.0). Proper attribution to Wikipedia is required when using this data.
h
wikipedia-persons-masked
huggingface.co
Updated May 23, 2009
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Institute for Public Sector Transformation IPST - Digital Sustainability Lab DSL (2009). wikipedia-persons-masked [Dataset]. https://huggingface.co/datasets/rcds/wikipedia-persons-masked
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 23, 2009
Dataset authored and provided by
Institute for Public Sector Transformation IPST - Digital Sustainability Lab DSL
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
wikipedia persons masked: A filtered version of the wikipedia dataset, with only pages of people

Dataset Summary

Contains ~70k pages from wikipedia, each describing a person. For each page, the person described in the text is masked with a

Supported Tasks and Leaderboards

The dataset supports the tasks of fill-mask, but can also be used for other tasks such as question answering, e.g. "Who is

Languages

english only

Dataset Structure… See the full description on the dataset page: https://huggingface.co/datasets/rcds/wikipedia-persons-masked.
P
Wiki-CS Dataset
paperswithcode.com
opendatalab.com
Updated Jul 8, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Péter Mernyei; Cătălina Cangea (2020). Wiki-CS Dataset [Dataset]. https://paperswithcode.com/dataset/wiki-cs
Explore at:
Dataset updated
Jul 8, 2020
Authors
Péter Mernyei; Cătălina Cangea
Description
Wiki-CS is a Wikipedia-based dataset for benchmarking Graph Neural Networks. The dataset is constructed from Wikipedia categories, specifically 10 classes corresponding to branches of computer science, with very high connectivity. The node features are derived from the text of the corresponding articles. They were calculated as the average of pretrained GloVe word embeddings (Pennington et al., 2014), resulting in 300-dimensional node features.

The dataset has 11,701 nodes and 216,123 edges.
a
Wikidata PageRank
danker.s3.amazonaws.com
Updated Jun 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andreas Thalhammer (2025). Wikidata PageRank [Dataset]. https://danker.s3.amazonaws.com/index.html
Explore at:
tsv, application/n-triples, application/vnd.hdt, ttlAvailable download formats
Dataset updated
Jun 14, 2025
Authors
Andreas Thalhammer
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
Regularly published dataset of PageRank scores for Wikidata entities. The underlying link graph is formed by a union of all links accross all Wikipedia language editions. Computation is performed Andreas Thalhammer with 'danker' available at https://github.com/athalhammer/danker . If you find the downloads here useful please feel free to leave a GitHub ⭐ at the repository and buy me a ☕ https://www.buymeacoffee.com/thalhamm
Wiki-talk Datasets
zenodo.org
data.niaid.nih.gov
application/gzip
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jun Sun; Jérôme Kunegis; Jun Sun; Jérôme Kunegis (2020). Wiki-talk Datasets [Dataset]. http://doi.org/10.5281/zenodo.49561
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.49561
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jun Sun; Jérôme Kunegis; Jun Sun; Jérôme Kunegis
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
User interaction networks of Wikipedia of 28 different languages. Nodes (orininal wikipedia user IDs) represent users of the Wikipedia, and an edge from user A to user B denotes that user A wrote a message on the talk page of user B at a certain timestamp.

More info: http://yfiua.github.io/academic/2016/02/14/wiki-talk-datasets.html
E
Pairwise Multi-Class Document Classification for Semantic Relations between...
live.european-language-grid.eu
Updated Mar 22, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Pairwise Multi-Class Document Classification for Semantic Relations between Wikipedia Articles (Models & Code) [Dataset]. https://live.european-language-grid.eu/catalogue/ld/18333
Explore at:
Dataset updated
Mar 22, 2024
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Many digital libraries recommend literature to their users considering the similarity between a query document and their repository. However, they often fail to distinguish what is the relationship that makes two documents alike. In this paper, we model the problem of finding the relationship between two documents as a pairwise document classification task. To find the semantic relation between documents, we apply a series of techniques, such as GloVe, Paragraph-Vectors, BERT, and XLNet under different configurations (e.g., sequence length, vector concatenation scheme), including a Siamese architecture for the Transformer-based systems. We perform our experiments on a newly proposed dataset of 32,168 Wikipedia article pairs and Wikidata properties that define the semantic document relations. Our results show vanilla BERT as the best performing system with an F1-score of 0.93, which we manually examine to better understand its applicability to other domains. Our findings suggest that classifying semantic relations between documents is a solvable task and motivates the development of recommender systems based on the evaluated techniques. The discussions in this paper serve as first steps in the exploration of documents through SPARQL-like queries such that one could find documents that are similar in one aspect but dissimilar in another.

Additional information can be found on GitHub.

The following data is supplemental to the experiments described in our research paper. The data consists of:

Datasets (articles, class labels, cross-validation splits)

Pretrained models (Transformers, GloVe, Doc2vec) Model output (prediction) for the best performing models.

This package consists of the Models and Code part: Pretrained models

PyTorch: vanilla and Siamese BERT + XLNet

Pretrained model for each fold is available in the corresponding model archives:

# Vanilla

model_wiki.bert_base_joint_seq512.tar.gz

model_wiki.xlnet_base_joint_seq512.tar.gz

# Siamese

model_wiki.bert_base_siamese_seq512_4d.tar.gz

model_wiki.xlnet_base_siamese_seq512_4d.tar.gz
wikidata-20220103-all.json.gz
academictorrents.com
bittorrent
Updated Jan 24, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
wikidata.org (2022). wikidata-20220103-all.json.gz [Dataset]. https://academictorrents.com/details/229cfeb2331ad43d4706efd435f6d78f40a3c438
Explore at:
bittorrent(109042925619)Available download formats
Dataset updated
Jan 24, 2022
Dataset provided by
Wikidata//wikidata.org/
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
A BitTorrent file to download data with the title 'wikidata-20220103-all.json.gz'
Z
Long document similarity datasets, Wikipedia excerptions for movies, video...
data.niaid.nih.gov
live.european-language-grid.eu
+1more
Updated Jan 27, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
anonymous (2021). Long document similarity datasets, Wikipedia excerptions for movies, video games and wine collections [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4468782
Explore at:
Dataset updated
Jan 27, 2021
Dataset authored and provided by
anonymous
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Three corpora in different domains extracted from Wikipedia.

For all datasets, the figures and tables have been filtered out, as well as the categories and "see also" sections.

The article structure, and particularly the sub-titles and paragraphs are kept in these datasets

Wines

Wikipedia wines dataset consists of 1635 articles from the wine domain. The extracted dataset consists of a non-trivial mixture of articles, including different wine categories, brands, wineries, grape types, and more. The ground-truth recommendations were crafted by a human sommelier, which annotated 92 source articles with ~10 ground-truth recommendations for each sample. Examples for ground-truth expert-based recommendations are

Dom Pérignon - Moët & Chandon

Pinot Meunier - Chardonnay

Movies

The Wikipedia movies dataset consists of 100385 articles describing different movies. The movies' articles may consist of text passages describing the plot, cast, production, reception, soundtrack, and more. For this dataset, we have extracted a test set of ground truth annotations for 50 source articles using the "BestSimilar" database. Each source articles is associated with a list of ${\scriptsize \sim}12$ most similar movies. Examples for ground-truth expert-based recommendations are

Schindler's List - The Pianist

Lion King - The Jungle Book

Video games

The Wikipedia video games dataset consists of 21,935 articles reviewing video games from all genres and consoles. Each article may consist of a different combination of sections, including summary, gameplay, plot, production, etc. Examples for ground-truth expert-based recommendations are:

Grand Theft Auto - Mafia

Burnout Paradise - Forza Horizon 3
h
ner-wikipedia-dataset
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stockmark Inc., ner-wikipedia-dataset [Dataset]. https://huggingface.co/datasets/stockmark/ner-wikipedia-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
Stockmark Inc.
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
Wikipediaを用いた日本語の固有表現抽出データセット

GitHub: https://github.com/stockmarkteam/ner-wikipedia-dataset/ LICENSE: CC-BY-SA 3.0

Developed by Stockmark Inc.
t
French Wikipedia - Dataset - LDM
service.tib.eu
Updated Dec 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). French Wikipedia - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/french-wikipedia
Explore at:
Dataset updated
Dec 16, 2024
Area covered
French
Description
French Wikipedia corpus

Facebook

Twitter

Click to copy link

Link copied

Cite

MLCommons (2023). speech-wikimedia [Dataset]. https://huggingface.co/datasets/MLCommons/speech-wikimedia

speech-wikimedia

MLCommons/speech-wikimedia

Explore at:

8 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Aug 19, 2023

Dataset authored and provided by

MLCommons

Description

Dataset Card for Speech Wikimedia

  Dataset Summary

The Speech Wikimedia Dataset is a compilation of audiofiles with transcriptions extracted from wikimedia commons that is licensed for academic and commercial usage under CC and Public domain. It includes 2,000+ hours of transcribed speech in different languages with a diverse set of speakers. Each audiofile should have one or more transcriptions in different languages.

  Transcription languages

English German… See the full description on the dataset page: https://huggingface.co/datasets/MLCommons/speech-wikimedia.

Clear search

Close search

Google apps

Main menu

speech-wikimedia

wikipedia-2023-11-embed-multilingual-v3

Data from: English Wikipedia - Species Pages

Wikipedia Talk Corpus

French Wikipedia Dataset

wikipedia-small-3000-embedded

load dataset in streaming mode (no download and it's fast)

select 3000 samples

Data from: Wiki-Reliability: A Large Scale Dataset for Content Reliability...

A Wikipedia dataset of 5 categories

wikipedia_toxicity_subtypes

Long document similarity dataset, Wikipedia excerptions for movies...

Tamil (Tamizh) Wikipedia Text Dataset for NLP

What’s Included

Why This Dataset?

** How You Can Use This Dataset**

Let’s Collaborate!

License

wikipedia-persons-masked

Wiki-CS Dataset

Wikidata PageRank

Wiki-talk Datasets

Pairwise Multi-Class Document Classification for Semantic Relations between...

wikidata-20220103-all.json.gz

Long document similarity datasets, Wikipedia excerptions for movies, video...

ner-wikipedia-dataset

French Wikipedia - Dataset - LDM

speech-wikimedia

MLCommons/speech-wikimedia

How You Can Use This Dataset