100+ datasets found
  1. h

    speech-wikimedia

    • huggingface.co
    Updated Aug 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MLCommons (2023). speech-wikimedia [Dataset]. https://huggingface.co/datasets/MLCommons/speech-wikimedia
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 19, 2023
    Dataset authored and provided by
    MLCommons
    Description

    Dataset Card for Speech Wikimedia

      Dataset Summary
    

    The Speech Wikimedia Dataset is a compilation of audiofiles with transcriptions extracted from wikimedia commons that is licensed for academic and commercial usage under CC and Public domain. It includes 2,000+ hours of transcribed speech in different languages with a diverse set of speakers. Each audiofile should have one or more transcriptions in different languages.

      Transcription languages
    

    English German… See the full description on the dataset page: https://huggingface.co/datasets/MLCommons/speech-wikimedia.

  2. h

    wikipedia-2023-11-embed-multilingual-v3

    • huggingface.co
    Updated Nov 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cohere (2023). wikipedia-2023-11-embed-multilingual-v3 [Dataset]. https://huggingface.co/datasets/Cohere/wikipedia-2023-11-embed-multilingual-v3
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 1, 2023
    Dataset authored and provided by
    Cohere
    Description

    Multilingual Embeddings for Wikipedia in 300+ Languages

    This dataset contains the wikimedia/wikipedia dataset dump from 2023-11-01 from Wikipedia in all 300+ languages. The individual articles have been chunked and embedded with the state-of-the-art multilingual Cohere Embed V3 embedding model. This enables an easy way to semantically search across all of Wikipedia or to use it as a knowledge source for your RAG application. In total is it close to 250M paragraphs / embeddings. You… See the full description on the dataset page: https://huggingface.co/datasets/Cohere/wikipedia-2023-11-embed-multilingual-v3.

  3. Data from: English Wikipedia - Species Pages

    • gbif.org
    Updated Aug 23, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Markus Döring; Markus Döring (2022). English Wikipedia - Species Pages [Dataset]. http://doi.org/10.15468/c3kkgh
    Explore at:
    Dataset updated
    Aug 23, 2022
    Dataset provided by
    Wikimedia Foundationhttp://www.wikimedia.org/
    Global Biodiversity Information Facilityhttps://www.gbif.org/
    Authors
    Markus Döring; Markus Döring
    Description

    Species pages extracted from the English Wikipedia article XML dump from 2022-08-02. Multimedia, vernacular names and textual descriptions are extracted, but only pages with a taxobox or speciesbox template are recognized.

    See https://github.com/mdoering/wikipedia-dwca for details.

  4. Wikipedia Talk Corpus

    • figshare.com
    application/x-gzip
    Updated Jan 23, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ellery Wulczyn; Nithum Thain; Lucas Dixon (2017). Wikipedia Talk Corpus [Dataset]. http://doi.org/10.6084/m9.figshare.4264973.v3
    Explore at:
    application/x-gzipAvailable download formats
    Dataset updated
    Jan 23, 2017
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Ellery Wulczyn; Nithum Thain; Lucas Dixon
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    We provide a corpus of discussion comments from English Wikipedia talk pages. Comments are grouped into different files by year. Comments are generated by computing diffs over the full revision history and extracting the content added for each revision. See our wiki for documentation of the schema and our research paper for documentation on the data collection and processing methodology.

  5. P

    French Wikipedia Dataset

    • paperswithcode.com
    • opendatalab.com
    • +1more
    Updated Feb 18, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Louis Martin; Benjamin Muller; Pedro Javier Ortiz Suárez; Yoann Dupont; Laurent Romary; Éric Villemonte de la Clergerie; Djamé Seddah; Benoît Sagot (2021). French Wikipedia Dataset [Dataset]. https://paperswithcode.com/dataset/french-wikipedia
    Explore at:
    Dataset updated
    Feb 18, 2021
    Authors
    Louis Martin; Benjamin Muller; Pedro Javier Ortiz Suárez; Yoann Dupont; Laurent Romary; Éric Villemonte de la Clergerie; Djamé Seddah; Benoît Sagot
    Area covered
    French
    Description

    French Wikipedia is a dataset used for pretraining the CamemBERT French language model. It uses the official 2019 French Wikipedia dumps

  6. h

    wikipedia-small-3000-embedded

    • huggingface.co
    Updated Apr 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hafedh Hichri (2024). wikipedia-small-3000-embedded [Dataset]. https://huggingface.co/datasets/not-lain/wikipedia-small-3000-embedded
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 6, 2024
    Authors
    Hafedh Hichri
    License

    https://choosealicense.com/licenses/gfdl/https://choosealicense.com/licenses/gfdl/

    Description

    this is a subset of the wikimedia/wikipedia dataset code for creating this dataset : from datasets import load_dataset, Dataset from sentence_transformers import SentenceTransformer model = SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1")

    load dataset in streaming mode (no download and it's fast)

    dataset = load_dataset( "wikimedia/wikipedia", "20231101.en", split="train", streaming=True )

    select 3000 samples

    from tqdm importtqdm data = Dataset.from_dict({}) for i, entry in… See the full description on the dataset page: https://huggingface.co/datasets/not-lain/wikipedia-small-3000-embedded.

  7. f

    Data from: Wiki-Reliability: A Large Scale Dataset for Content Reliability...

    • figshare.com
    txt
    Updated Mar 14, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    KayYen Wong; Diego Saez-Trumper; Miriam Redi (2021). Wiki-Reliability: A Large Scale Dataset for Content Reliability on Wikipedia [Dataset]. http://doi.org/10.6084/m9.figshare.14113799.v4
    Explore at:
    txtAvailable download formats
    Dataset updated
    Mar 14, 2021
    Dataset provided by
    figshare
    Authors
    KayYen Wong; Diego Saez-Trumper; Miriam Redi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Wiki-Reliability: Machine Learning datasets for measuring content reliability on WikipediaConsists of metadata features and content text datasets, with the formats:- {template_name}_features.csv - {template_name}_difftxt.csv.gz - {template_name}_fulltxt.csv.gz For more details on the project, dataset schema, and links to data usage and benchmarking:https://meta.wikimedia.org/wiki/Research:Wiki-Reliability:_A_Large_Scale_Dataset_for_Content_Reliability_on_Wikipedia

  8. A Wikipedia dataset of 5 categories

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Julien Maitre; Julien Maitre (2020). A Wikipedia dataset of 5 categories [Dataset]. http://doi.org/10.5281/zenodo.3260046
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Julien Maitre; Julien Maitre
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A subset of articles extracted from the French Wikipedia XML dump. Data published here include 5 different categories : Economy (Economie), History (Histoire), Informatics (Informatique), Health (Medecine) and Law (Droit). The Wikipedia dump was downloaded on November 8, 2016 from https://dumps.wikimedia.org/. Each article is a xml file extracted from the dump and save as UTF8 plain text. The characteristics of dataset is :

    • Economy : 44'876 articles
    • History : 92'041 articles
    • Informatics : 25'408 articles
    • Health : 22'143 articles
    • Law : 9'964 articles
  9. T

    wikipedia_toxicity_subtypes

    • tensorflow.org
    Updated Oct 4, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). wikipedia_toxicity_subtypes [Dataset]. https://www.tensorflow.org/datasets/catalog/wikipedia_toxicity_subtypes
    Explore at:
    Dataset updated
    Oct 4, 2021
    Description

    The comments in this dataset come from an archive of Wikipedia talk page comments. These have been annotated by Jigsaw for toxicity, as well as (for the main config) a variety of toxicity subtypes, including severe toxicity, obscenity, threatening language, insulting language, and identity attacks. This dataset is a replica of the data released for the Jigsaw Toxic Comment Classification Challenge and Jigsaw Multilingual Toxic Comment Classification competition on Kaggle, with the test dataset merged with the test_labels released after the end of the competitions. Test data not used for scoring has been dropped. This dataset is released under CC0, as is the underlying comment text.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('wikipedia_toxicity_subtypes', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  10. Z

    Long document similarity dataset, Wikipedia excerptions for movies...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jan 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    anonymous (2023). Long document similarity dataset, Wikipedia excerptions for movies collections [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7019172
    Explore at:
    Dataset updated
    Jan 20, 2023
    Dataset authored and provided by
    anonymous
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Movies-related articles extracted from Wikipedia.

    For all articles, the figures and tables have been filtered out, as well as the categories and "see also" sections.

    The article structure, and particularly the sub-titles and paragraphs are kept in these datasets

    Movies

    The Wikipedia Movies dataset consists of 100,371 articles describing various movies. Each article may consist of text passages describing the plot, cast, production, reception, soundtrack, and more.

  11. Tamil (Tamizh) Wikipedia Text Dataset for NLP

    • kaggle.com
    Updated Nov 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Younus_Mohamed (2024). Tamil (Tamizh) Wikipedia Text Dataset for NLP [Dataset]. http://doi.org/10.34740/kaggle/dsv/9884525
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 12, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Younus_Mohamed
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    This dataset is part of a larger mission to transform Tamil into a high-resource language in the field of Natural Language Processing (NLP). As one of the oldest and most culturally rich languages, Tamil has a unique linguistic structure, yet it remains underrepresented in the NLP landscape. This dataset, extracted from Tamil Wikipedia, serves as a foundational resource to support Tamil language processing, text mining, and machine learning applications.

    What’s Included

    - Text Data: This dataset contains over 569,000 articles in raw text form, extracted from Tamil Wikipedia. The collection is ideal for language model training, word frequency analysis, and text mining.

    - Scripts and Processing Tools: Code snippets are provided for processing .bz2 compressed files, generating word counts, and handling data for NLP applications.

    Why This Dataset?

    Despite having a documented lexicon of over 100,000 words, only a fraction of these are actively used in everyday communication. The largest available Tamil treebank currently holds only 10,000 words, limiting the scope for training accurate language models. This dataset aims to bridge that gap by providing a robust, open-source corpus for researchers, developers, and linguists who want to work on Tamil language technologies.

    ** How You Can Use This Dataset**

    - Language Modeling: Train or fine-tune models like BERT, GPT, or LSTM-based language models for Tamil. - Linguistic Research: Analyze Tamil morphology, syntax, and vocabulary usage. - Data Augmentation: Use the raw text to generate augmented data for multilingual NLP applications. - Word Embeddings and Semantic Analysis: Create embeddings for Tamil words, useful in multilingual setups or standalone applications.

    Let’s Collaborate!

    I believe that advancing Tamil in NLP cannot be a solo effort. Contributions in the form of additional data, annotations, or even new tools for Tamil language processing are welcome! By working together, we can make Tamil a truly high-resource language in NLP.

    License

    This dataset is based on content from Tamil Wikipedia and is shared under the Creative Commons Attribution-ShareAlike 3.0 Unported License (CC BY-SA 3.0). Proper attribution to Wikipedia is required when using this data.

  12. h

    wikipedia-persons-masked

    • huggingface.co
    Updated May 23, 2009
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Institute for Public Sector Transformation IPST - Digital Sustainability Lab DSL (2009). wikipedia-persons-masked [Dataset]. https://huggingface.co/datasets/rcds/wikipedia-persons-masked
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 23, 2009
    Dataset authored and provided by
    Institute for Public Sector Transformation IPST - Digital Sustainability Lab DSL
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    wikipedia persons masked: A filtered version of the wikipedia dataset, with only pages of people

      Dataset Summary
    

    Contains ~70k pages from wikipedia, each describing a person. For each page, the person described in the text is masked with a

      Supported Tasks and Leaderboards
    

    The dataset supports the tasks of fill-mask, but can also be used for other tasks such as question answering, e.g. "Who is

      Languages
    

    english only

      Dataset Structure… See the full description on the dataset page: https://huggingface.co/datasets/rcds/wikipedia-persons-masked.
    
  13. P

    Wiki-CS Dataset

    • paperswithcode.com
    • opendatalab.com
    Updated Jul 8, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Péter Mernyei; Cătălina Cangea (2020). Wiki-CS Dataset [Dataset]. https://paperswithcode.com/dataset/wiki-cs
    Explore at:
    Dataset updated
    Jul 8, 2020
    Authors
    Péter Mernyei; Cătălina Cangea
    Description

    Wiki-CS is a Wikipedia-based dataset for benchmarking Graph Neural Networks. The dataset is constructed from Wikipedia categories, specifically 10 classes corresponding to branches of computer science, with very high connectivity. The node features are derived from the text of the corresponding articles. They were calculated as the average of pretrained GloVe word embeddings (Pennington et al., 2014), resulting in 300-dimensional node features.

    The dataset has 11,701 nodes and 216,123 edges.

  14. a

    Wikidata PageRank

    • danker.s3.amazonaws.com
    Updated Jun 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andreas Thalhammer (2025). Wikidata PageRank [Dataset]. https://danker.s3.amazonaws.com/index.html
    Explore at:
    tsv, application/n-triples, application/vnd.hdt, ttlAvailable download formats
    Dataset updated
    Jun 14, 2025
    Authors
    Andreas Thalhammer
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Regularly published dataset of PageRank scores for Wikidata entities. The underlying link graph is formed by a union of all links accross all Wikipedia language editions. Computation is performed Andreas Thalhammer with 'danker' available at https://github.com/athalhammer/danker . If you find the downloads here useful please feel free to leave a GitHub ⭐ at the repository and buy me a ☕ https://www.buymeacoffee.com/thalhamm

  15. Wiki-talk Datasets

    • zenodo.org
    • data.niaid.nih.gov
    application/gzip
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jun Sun; Jérôme Kunegis; Jun Sun; Jérôme Kunegis (2020). Wiki-talk Datasets [Dataset]. http://doi.org/10.5281/zenodo.49561
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jun Sun; Jérôme Kunegis; Jun Sun; Jérôme Kunegis
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    User interaction networks of Wikipedia of 28 different languages. Nodes (orininal wikipedia user IDs) represent users of the Wikipedia, and an edge from user A to user B denotes that user A wrote a message on the talk page of user B at a certain timestamp.

    More info: http://yfiua.github.io/academic/2016/02/14/wiki-talk-datasets.html

  16. E

    Pairwise Multi-Class Document Classification for Semantic Relations between...

    • live.european-language-grid.eu
    Updated Mar 22, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Pairwise Multi-Class Document Classification for Semantic Relations between Wikipedia Articles (Models & Code) [Dataset]. https://live.european-language-grid.eu/catalogue/ld/18333
    Explore at:
    Dataset updated
    Mar 22, 2024
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Many digital libraries recommend literature to their users considering the similarity between a query document and their repository. However, they often fail to distinguish what is the relationship that makes two documents alike. In this paper, we model the problem of finding the relationship between two documents as a pairwise document classification task. To find the semantic relation between documents, we apply a series of techniques, such as GloVe, Paragraph-Vectors, BERT, and XLNet under different configurations (e.g., sequence length, vector concatenation scheme), including a Siamese architecture for the Transformer-based systems. We perform our experiments on a newly proposed dataset of 32,168 Wikipedia article pairs and Wikidata properties that define the semantic document relations. Our results show vanilla BERT as the best performing system with an F1-score of 0.93, which we manually examine to better understand its applicability to other domains. Our findings suggest that classifying semantic relations between documents is a solvable task and motivates the development of recommender systems based on the evaluated techniques. The discussions in this paper serve as first steps in the exploration of documents through SPARQL-like queries such that one could find documents that are similar in one aspect but dissimilar in another.

    Additional information can be found on GitHub.

    The following data is supplemental to the experiments described in our research paper. The data consists of:

    Datasets (articles, class labels, cross-validation splits)

    Pretrained models (Transformers, GloVe, Doc2vec) Model output (prediction) for the best performing models.

    This package consists of the Models and Code part: Pretrained models

    PyTorch: vanilla and Siamese BERT + XLNet

    Pretrained model for each fold is available in the corresponding model archives:

    # Vanilla

    model_wiki.bert_base_joint_seq512.tar.gz

    model_wiki.xlnet_base_joint_seq512.tar.gz

    # Siamese

    model_wiki.bert_base_siamese_seq512_4d.tar.gz

    model_wiki.xlnet_base_siamese_seq512_4d.tar.gz

  17. wikidata-20220103-all.json.gz

    • academictorrents.com
    bittorrent
    Updated Jan 24, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    wikidata.org (2022). wikidata-20220103-all.json.gz [Dataset]. https://academictorrents.com/details/229cfeb2331ad43d4706efd435f6d78f40a3c438
    Explore at:
    bittorrent(109042925619)Available download formats
    Dataset updated
    Jan 24, 2022
    Dataset provided by
    Wikidata//wikidata.org/
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    A BitTorrent file to download data with the title 'wikidata-20220103-all.json.gz'

  18. Z

    Long document similarity datasets, Wikipedia excerptions for movies, video...

    • data.niaid.nih.gov
    • live.european-language-grid.eu
    • +1more
    Updated Jan 27, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    anonymous (2021). Long document similarity datasets, Wikipedia excerptions for movies, video games and wine collections [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4468782
    Explore at:
    Dataset updated
    Jan 27, 2021
    Dataset authored and provided by
    anonymous
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Three corpora in different domains extracted from Wikipedia.

    For all datasets, the figures and tables have been filtered out, as well as the categories and "see also" sections.

    The article structure, and particularly the sub-titles and paragraphs are kept in these datasets

    Wines

    Wikipedia wines dataset consists of 1635 articles from the wine domain. The extracted dataset consists of a non-trivial mixture of articles, including different wine categories, brands, wineries, grape types, and more. The ground-truth recommendations were crafted by a human sommelier, which annotated 92 source articles with ~10 ground-truth recommendations for each sample. Examples for ground-truth expert-based recommendations are

    Dom Pérignon - Moët & Chandon

    Pinot Meunier - Chardonnay

    Movies

    The Wikipedia movies dataset consists of 100385 articles describing different movies. The movies' articles may consist of text passages describing the plot, cast, production, reception, soundtrack, and more. For this dataset, we have extracted a test set of ground truth annotations for 50 source articles using the "BestSimilar" database. Each source articles is associated with a list of ${\scriptsize \sim}12$ most similar movies. Examples for ground-truth expert-based recommendations are

    Schindler's List - The Pianist

    Lion King - The Jungle Book

    Video games

    The Wikipedia video games dataset consists of 21,935 articles reviewing video games from all genres and consoles. Each article may consist of a different combination of sections, including summary, gameplay, plot, production, etc. Examples for ground-truth expert-based recommendations are:

    Grand Theft Auto - Mafia

    Burnout Paradise - Forza Horizon 3

  19. h

    ner-wikipedia-dataset

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stockmark Inc., ner-wikipedia-dataset [Dataset]. https://huggingface.co/datasets/stockmark/ner-wikipedia-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    Stockmark Inc.
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Wikipediaを用いた日本語の固有表現抽出データセット

    GitHub: https://github.com/stockmarkteam/ner-wikipedia-dataset/ LICENSE: CC-BY-SA 3.0

    Developed by Stockmark Inc.

  20. t

    French Wikipedia - Dataset - LDM

    • service.tib.eu
    Updated Dec 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). French Wikipedia - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/french-wikipedia
    Explore at:
    Dataset updated
    Dec 16, 2024
    Area covered
    French
    Description

    French Wikipedia corpus

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
MLCommons (2023). speech-wikimedia [Dataset]. https://huggingface.co/datasets/MLCommons/speech-wikimedia

speech-wikimedia

MLCommons/speech-wikimedia

Explore at:
8 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 19, 2023
Dataset authored and provided by
MLCommons
Description

Dataset Card for Speech Wikimedia

  Dataset Summary

The Speech Wikimedia Dataset is a compilation of audiofiles with transcriptions extracted from wikimedia commons that is licensed for academic and commercial usage under CC and Public domain. It includes 2,000+ hours of transcribed speech in different languages with a diverse set of speakers. Each audiofile should have one or more transcriptions in different languages.

  Transcription languages

English German… See the full description on the dataset page: https://huggingface.co/datasets/MLCommons/speech-wikimedia.

Search
Clear search
Close search
Google apps
Main menu