25 datasets found
  1. Wikipedia Corpus (2023-03-01)

    • kaggle.com
    zip
    Updated Jan 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marcell Emmer (2024). Wikipedia Corpus (2023-03-01) [Dataset]. https://www.kaggle.com/datasets/emmermarcell/wikipedia-corpus-2023-03-01
    Explore at:
    zip(7253680490 bytes)Available download formats
    Dataset updated
    Jan 24, 2024
    Authors
    Marcell Emmer
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    A corpus created from the Hugging Face Wikipedia dataset (https://huggingface.co/datasets/wikipedia). The preprocessing and the creation of this corpus are done using the text_to_sentences method of Blingfire. The details can be found in the following notebook:

    https://www.kaggle.com/code/emmermarcell/create-a-wikipedia-corpus

  2. T

    wikipedia

    • tensorflow.org
    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    wikipedia [Dataset]. https://www.tensorflow.org/datasets/catalog/wikipedia
    Explore at:
    Description

    Wikipedia dataset containing cleaned articles of all languages. The datasets are built from the Wikipedia dump (https://dumps.wikimedia.org/) with one split per language. Each example contains the content of one full Wikipedia article with cleaning to strip markdown and unwanted sections (references, etc.).

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('wikipedia', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  3. Plain text Wikipedia (SimpleEnglish)

    • kaggle.com
    zip
    Updated Apr 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ffatty (2024). Plain text Wikipedia (SimpleEnglish) [Dataset]. https://www.kaggle.com/datasets/ffatty/plain-text-wikipedia-simpleenglish
    Explore at:
    zip(133738695 bytes)Available download formats
    Dataset updated
    Apr 1, 2024
    Authors
    Ffatty
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Plain text Wikipedia (SimpleEnglish)

    Unsupervised text corpus of all 249,396 articles in the Simple English Wikipedia.

    • 31M tokens
    • 196,000 words
    • <400 MB uncompressed

    Extracted from wikipedia dumps using an open source tool.

    Data is remarkably clean and uniform.

    There is also a dataset available for the much more massive Full English Wikipedia, generated in the same manner as this: https://www.kaggle.com/datasets/ffatty/plaintext-wikipedia-full-english

    Format:

    • Each article's title appears before the content.
    • Articles are plain text; they are stripped of all Wiki formatting syntax, including font styles, citations, links, etc.
    • Articles are concatenated into txt files of ≤ 1MB each.
    • Sometimes, related articles are found next to each other (see excerpt below). This is probably because they were created or edited around the same time.

    Random example excerpt:

    (a portion of 1 file; 4 articles, concatenated in place)

    Nicotine
     
    Nicotine is a drug in tobacco cigarettes, cigars, pipe tobacco, chewing tobacco, vaping liquids and some e-cigarettes. Nicotine is an addictive stimulant that causes the heart to beat faster and makes blood pressure rise.
     
    Addiction
     
    Addiction is when the body or mind badly wants or needs something in order to work right. When you have an addiction to something it is called being "addicted" or being an "addict". People can be addicted to drugs, cigarettes, alcohol, caffeine, and many other things. 
    
    Bishop
    
    Bishop is a type of clergy in some Christian churches. The bishop is the leader of the Christians and the Christian priests in each diocese. The diocese which a bishop governs is called a bishopric. A bishop may be given the rank of archbishop in an archdiocese. 
    
    Christian priests in some denominations must be made priests by bishops. Some Christian movements have neither bishops nor priests: Quakers are one example.
    
    In the Catholic church, the Pope is chosen by all the cardinals.
    
    Tray
    
    A tray is a shallow container designed for carrying things.
    
    Trays are flat, but with raised edges to stop things from sliding off of them. 
    
    


  4. Wikipedia Plaintext (2023-07-01)

    • kaggle.com
    Updated Jul 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    JJ (2023). Wikipedia Plaintext (2023-07-01) [Dataset]. https://www.kaggle.com/datasets/jjinho/wikipedia-20230701
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 17, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    JJ
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    While other great datasets containing Wikipedia exist, the latest one dates from 2020, and so this is an updated version that contains 6,286,775 articles, titles, text, and categories from the July 1st, 2023 Wikipedia dump.

    Articles are sorted in alphanumeric order and separated into parquet files corresponding to the first character of the article title. The data is partitioned into parquet files named a-z, number (titles that began with numbers), and other (titles that began with symbols).

    The best place to see it in action is: https://www.kaggle.com/code/jjinho/open-book-llm-science-exam

    If you find this dataset helpful, please upvote!

  5. h

    simple-wikipedia

    • huggingface.co
    Updated Aug 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rahul Aralikatte (2023). simple-wikipedia [Dataset]. https://huggingface.co/datasets/rahular/simple-wikipedia
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 17, 2023
    Authors
    Rahul Aralikatte
    Description

    simple-wikipedia

    Processed, text-only dump of the Simple Wikipedia (English). Contains 23,886,673 words.

  6. Raw Wikipedia updates 2024

    • kaggle.com
    zip
    Updated May 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sheema Zain (2024). Raw Wikipedia updates 2024 [Dataset]. https://www.kaggle.com/datasets/sheemazain/raw-wikipedia-updates-2024
    Explore at:
    zip(8575597 bytes)Available download formats
    Dataset updated
    May 24, 2024
    Authors
    Sheema Zain
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description
    1. Visit the Wikimedia Dumps page:

      • Go to the Wikimedia Dumps page. Here you can find the latest dumps for various Wikimedia projects, including Wikipedia.
    2. Select the desired Wikipedia version:

      • Choose the language version of Wikipedia you are interested in. For example, for English Wikipedia, select the enwiki directory.
    3. Download the latest dump:

      • Inside the directory for your selected language, you will find several types of dumps. The most commonly used dumps for raw Wikipedia content are:
        • enwiki-latest-pages-articles.xml.bz2: Contains the current versions of article content.
        • enwiki-latest-pages-meta-current.xml.bz2: Contains current versions of article content, including page metadata.
      • Click on the file you want to download.
    4. Handling large files:

      • These files are typically very large (several gigabytes). Ensure you have sufficient storage and bandwidth to download them.
      • You may need tools like bzip2 to decompress .bz2 files.
    5. Parsing the dump:

      • Once you have the dump file, you will need to parse it. The dumps are in XML format, which can be processed using various programming languages and tools. Python, for instance, has libraries like xml.etree.ElementTree for XML parsing.
      • Alternatively, you can use specialized tools like WikiExtractor, which is a Python script designed to extract and clean text from Wikipedia XML dumps.

      Example of Download and Parsing

    Here's an example of how you might use Python to download and parse a Wikipedia dump:

    1. Download using Python:
    import requests
    
    url = 'https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2'
    response = requests.get(url, stream=True)
    
    with open('enwiki-latest-pages-articles.xml.bz2', 'wb') as file:
      for chunk in response.iter_content(chunk_size=1024):
        if chunk:
          file.write(chunk)
    
    
    2. Decompress and parse using WikiExtractor:
    
    ```bash
    # First, ensure you have WikiExtractor installed
    pip install wikiextractor
    
    # Run WikiExtractor to process the dump
    wikiextractor enwiki-latest-pages-articles.xml.bz2
    

    Important Notes - Ensure you have appropriate storage and processing power to handle large datasets. - Parsing and processing Wikipedia dumps can be resource-intensive, so plan accordingly. - Always check the licensing and usage terms for Wikipedia content to ensure compliance.

  7. E

    Plaintext Wikipedia dump 2018

    • live.european-language-grid.eu
    binary format
    Updated Feb 24, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2018). Plaintext Wikipedia dump 2018 [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/1242
    Explore at:
    binary formatAvailable download formats
    Dataset updated
    Feb 24, 2018
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Wikipedia plain text data obtained from Wikipedia dumps with WikiExtractor in February 2018.

    The data come from all Wikipedias for which dumps could be downloaded at [https://dumps.wikimedia.org/]. This amounts to 297 Wikipedias, usually corresponding to individual languages and identified by their ISO codes. Several special Wikipedias are included, most notably "simple" (Simple English Wikipedia) and "incubator" (tiny hatching Wikipedias in various languages).

    For a list of all the Wikipedias, see [https://meta.wikimedia.org/wiki/List_of_Wikipedias].

    The script which can be used to get new version of the data is included, but note that Wikipedia limits the download speed for downloading a lot of the dumps, so it takes a few days to download all of them (but one or a few can be downloaded fast).

    Also, the format of the dumps changes time to time, so the script will probably eventually stop working one day.

    The WikiExtractor tool [http://medialab.di.unipi.it/wiki/Wikipedia_Extractor] used to extract text from the Wikipedia dumps is not mine, I only modified it slightly to produce plaintext outputs [https://github.com/ptakopysk/wikiextractor].

  8. English Wikipedia People Dataset

    • kaggle.com
    zip
    Updated Jul 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wikimedia (2025). English Wikipedia People Dataset [Dataset]. https://www.kaggle.com/datasets/wikimedia-foundation/english-wikipedia-people-dataset
    Explore at:
    zip(4293465577 bytes)Available download formats
    Dataset updated
    Jul 31, 2025
    Dataset provided by
    Wikimedia Foundationhttp://www.wikimedia.org/
    Authors
    Wikimedia
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Summary

    This dataset contains biographical information derived from articles on English Wikipedia as it stood in early June 2024. It was created as part of the Structured Contents initiative at Wikimedia Enterprise and is intended for evaluation and research use.

    The beta sample dataset is a subset of the Structured Contents Snapshot focusing on people with infoboxes in EN wikipedia; outputted as json files (compressed in tar.gz).

    We warmly welcome any feedback you have. Please share your thoughts, suggestions, and any issues you encounter on the discussion page for this dataset here on Kaggle.

    Data Structure

    • File name: wme_people_infobox.tar.gz
    • Size of compressed file: 4.12 GB
    • Size of uncompressed file: 21.28 GB

    Noteworthy Included Fields: - name - title of the article. - identifier - ID of the article. - image - main image representing the article's subject. - description - one-sentence description of the article for quick reference. - abstract - lead section, summarizing what the article is about. - infoboxes - parsed information from the side panel (infobox) on the Wikipedia article. - sections - parsed sections of the article, including links. Note: excludes other media/images, lists, tables and references or similar non-prose sections.

    The Wikimedia Enterprise Data Dictionary explains all of the fields in this dataset.

    Stats

    Infoboxes - Compressed: 2GB - Uncompressed: 11GB

    Infoboxes + sections + short description - Size of compressed file: 4.12 GB - Size of uncompressed file: 21.28 GB

    Article analysis and filtering breakdown: - total # of articles analyzed: 6,940,949 - # people found with QID: 1,778,226 - # people found with Category: 158,996 - people found with Biography Project: 76,150 - Total # of people articles found: 2,013,372 - Total # people articles with infoboxes: 1,559,985 End stats - Total number of people articles in this dataset: 1,559,985 - that have a short description: 1,416,701 - that have an infobox: 1,559,985 - that have article sections: 1,559,921

    This dataset includes 235,146 people articles that exist on Wikipedia but aren't yet tagged on Wikidata as instance of:human.

    Maintenance and Support

    This dataset was originally extracted from the Wikimedia Enterprise APIs on June 5, 2024. The information in this dataset may therefore be out of date. This dataset isn't being actively updated or maintained, and has been shared for community use and feedback. If you'd like to retrieve up-to-date Wikipedia articles or data from other Wikiprojects, get started with Wikimedia Enterprise's APIs

    Initial Data Collection and Normalization

    The dataset is built from the Wikimedia Enterprise HTML “snapshots”: https://enterprise.wikimedia.com/docs/snapshot/ and focuses on the Wikipedia article namespace (namespace 0 (main)).

    Who are the source language producers?

    Wikipedia is a human generated corpus of free knowledge, written, edited, and curated by a global community of editors since 2001. It is the largest and most accessed educational resource in history, accessed over 20 billion times by half a billion people each month. Wikipedia represents almost 25 years of work by its community; the creation, curation, and maintenance of millions of articles on distinct topics. This dataset includes the biographical contents of English Wikipedia language editions: English https://en.wikipedia.org/, written by the community.

    Attribution

    Terms and conditions

    Wikimedia Enterprise provides this dataset under the assumption that downstream users will adhere to the relevant free culture licenses when the data is reused. In situations where attribution is required, reusers should identify the Wikimedia project from which the content was retrieved as the source of the content. Any attribution should adhere to Wikimedia’s trademark policy (available at https://foundation.wikimedia.org/wiki/Trademark_policy) and visual identity guidelines (ava...

  9. E

    External References of English Wikipedia (ref-wiki-en)

    • live.european-language-grid.eu
    • data.niaid.nih.gov
    txt
    Updated Mar 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). External References of English Wikipedia (ref-wiki-en) [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7625
    Explore at:
    txtAvailable download formats
    Dataset updated
    Mar 27, 2024
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    External References of English Wikipedia (ref-wiki-en) is a corpus of the plain-text content of 2,475,461 external webpages linked from the reference section of articles in English Wikipedia. Specifically:

    32,329,989 external reference URLs were extracted from a 2018 HTML dump of English Wikipedia. Removing repeated and ill-formed URLs yielded 23,036,318 unique URLs.These URLs were filtered to remove file extensions for unsupported formats (videos, audio, etc.), yielding 17,781,974 downloadable URLs. The URLs were loaded into Apache Nutch and continuously downloaded from August 2019 to December 2019, resulting in 2,475,461 successfully downloaded URLs. Not all URLs could be accessed. The order in which URLs were accessed was determined by Nutch, which partitions URLs by host and then randomly chooses amongst the URLs for each host.The content of these webpages were indexed in Apache Solr by Nutch. From Solr we extracted a JSON dump of the content.Many URLs offer a redirect; unfortunately Nutch does not index redirect information. This means that connecting the Wikipedia article (with the pre-direct link) to the downloaded webpage (at the post-redirect link) was complicated. However, by inspecting the order of download in the Nutch log files, we managed to recover links for 2,058,896 documents (83%) from their original Wikipedia article(s).We further managed to associate 3,899,953 unique Wikidata items with at least one external reference webpage in the corpus.

    The ref-en-wiki corpus is incomplete, i.e., we did not attempt to download all reference URLs for English Wikipedia. We thus also collect a smaller complete corpus for the external references of 5,000 Wikipedia articles (ref-wiki-en-5k). We sampled from 5 ranges of Wikidata items: Q1-10000, Q10001-100000, Q100001-1000000, Q1000001-10000000, and Q10000001-100000000. From each range we sampled 1000 items. We then scraped the external reference URLs for the Wikipedia article corresponding to these items and downloaded them. The resulting corpus contains 37,983 webpages.Each line of the corpus (ref-wiki-en, ref-wiki-en-5k) encodes the webpage of an external reference in JSON format. Specifically, we provide:

    tstamp: When the webpage was accessedhost: The domain (FQDN post-redirect) from which the webpage was retrieved.title: The title (meta) of the documenturl: The URL (post-redirect) of the webpageQ: The Q-code identifiers of the Wikidata items whose corresponding Wikipedia article is confirmed to link to this webpage.content: A plain-text encoding of the content of the webpage.

    Below we provide an abbreviated example of a line from the corpus:{""tstamp"":""2019-09-26T01:22:43.621Z"",""host"":""geology.isu.edu"",""title"":""Digital Geology of Idaho - Basin And Range"",""url"":""http://geology.isu.edu/Digital_Geology_Idaho/Module9/mod9.htm"",""Q"":[810178],""content"":""Digital Geology of Idaho - Basin And Range 1 - Idaho Basement Rock 2 - Belt Supergroup 3 - Rifting & Passive Margin 4 - Accreted Terranes 5 - Thrust Belt 6 - Idaho Batholith 7 - North Idaho & Mining 8 - Challis Volcanics 9 - Basin and Range 10 - Columbia River Basalts 11 - SRP & Yellowstone 12 - Pleistocene Glaciation 13 - Palouse & Lake Missoula 14 - Lake Bonneville Flood 15 - Snake River Plain Aquifer Basin and Range Province - Teritiary Extension General geology of the Basin and Range Province Mechanisms of Basin and Range faulting Idaho Basin and Range south of the Snake River Plain Idaho Basin and Range north of the Snake River Plain Local areas of active and recent Basin & Range faulting: Borah Peak PDF Slideshows: North of SRP , South of SRP , Borah Earthquake Flythroughs: Teton Valley , Henry's Fork , Big Lost River , Blackfoot , Portneuf , Raft River Valley , Bear River , Salmon Falls Creek , Snake River , Big Wood River Vocabulary Words thrust fault Basin and Range Snake River Plain half-graben transfer zone Fly-throughs General geology of the Basin and Range Province The Basin and Range Province generally includes most of eastern California, eastern Oregon, eastern Washington, Nevada, western Utah, southern and western Arizona, and southeastern Idaho. ...""},A summary of the files we make available:

    ref-wiki-en.json.gz: 2,475,461 external reference webpages (JSON format)ref-wiki-en_urls.txt.gz: 23,036,318 unique raw links to external references (plain-text format)ref-wiki-en-5k.json.gz: 37,983 external reference webpages (JSON format)ref-wiki-en-5k_urls.json.gz: 70,375 unique raw links to external references (plain-text format)ref-wiki-en-5k_Q.txt.gz: 5,000 Wikidata Q identifiers forming the 5k dataset (plain-text format)

    Further details can be found in the publication:

    Suggesting References for Wikidata Claims based on Wikipedia's External References. Paolo Curotto, Aidan Hogan. Wikidata Workshop @ISWC 2020.

    Further material relating to this publication (including code for a proof-of-concept interface) is also available.

  10. Kensho Derived Wikimedia Dataset

    • kaggle.com
    zip
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kensho R&D (2020). Kensho Derived Wikimedia Dataset [Dataset]. https://www.kaggle.com/kenshoresearch/kensho-derived-wikimedia-data
    Explore at:
    zip(8760044227 bytes)Available download formats
    Dataset updated
    Jan 24, 2020
    Authors
    Kensho R&D
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Kensho Derived Wikimedia Dataset

    Wikipedia, the free encyclopedia, and Wikidata, the free knowledge base, are crowd-sourced projects supported by the Wikimedia Foundation. Wikipedia is nearly 20 years old and recently added its six millionth article in English. Wikidata, its younger machine-readable sister project, was created in 2012 but has been growing rapidly and currently contains more than 75 million items.

    These projects contribute to the Wikimedia Foundation's mission of empowering people to develop and disseminate educational content under a free license. They are also heavily utilized by computer science research groups, especially those interested in natural language processing (NLP). The Wikimedia Foundation periodically releases snapshots of the raw data backing these projects, but these are in a variety of formats and were not designed for use in NLP research. In the Kensho R&D group, we spend a lot of time downloading, parsing, and experimenting with this raw data. The Kensho Derived Wikimedia Dataset (KDWD) is a condensed subset of the raw Wikimedia data in a form that we find helpful for NLP work. The KDWD has a CC BY-SA 3.0 license, so feel free to use it in your work too.

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F4301984%2F972e4157b97efe8c2c5ea17c983b1504%2Fkdwd_header_logos_2.jpg?generation=1580510520532141&alt=media" alt="">

    This particular release consists of two main components - a link annotated corpus of English Wikipedia pages and a compact sample of the Wikidata knowledge base. We version the KDWD using the raw Wikimedia snapshot dates. The version string for this dataset is kdwd_enwiki_20191201_wikidata_20191202 indicating that this KDWD was built from the English Wikipedia snapshot from 2019 December 1 and the Wikidata snapshot from 2019 December 2. Below we describe these components in more detail.

    Example Notebooks

    Dive right in by checking out some of our example notebooks:

    Updates / Changelog

    • initial release 2020-01-31

    File Summary

    • Wikipedia
      • page.csv (page metadata and Wikipedia-to-Wikidata mapping)
      • link_annotated_text.jsonl (plaintext of Wikipedia pages with link offsets)
    • Wikidata
      • item.csv (item labels and descriptions in English)
      • item_aliases.csv (item aliases in English)
      • property.csv (property labels and descriptions in English)
      • property_aliases.csv (property aliases in English)
      • statements.csv (truthy qpq statements)

    Three Layers of Data

    The KDWD is three connected layers of data. The base layer is a plain text English Wikipedia corpus, the middle layer annotates the corpus by indicating which text spans are links, and the top layer connects the link text spans to items in Wikidata. Below we'll describe these layers in more detail.

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F4301984%2F19663d43bade0e92f578255f6e0d9dcd%2Fkensho_wiki_triple_layer.svg?generation=1580347573004185&alt=media" alt="">

    Wikipedia Sample

    The first part of the KDWD is derived from Wikipedia. In order to create a corpus of mostly natural text, we restrict our English Wikipedia page sample to those that:

  11. 📖 Wikipedia Articles in PlainText

    • kaggle.com
    zip
    Updated Dec 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BwandoWando (2023). 📖 Wikipedia Articles in PlainText [Dataset]. https://www.kaggle.com/datasets/bwandowando/wikipedia-index-and-plaintext-20230801/data
    Explore at:
    zip(6778044866 bytes)Available download formats
    Dataset updated
    Dec 16, 2023
    Authors
    BwandoWando
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    [Update]

    [Context]

    I was inspired by Radek Osmulski's additional Kaggle LLM Science Exam datasets for the Kaggle - LLM Science Exam competition.

    I am trying to replicate his dataset creation workflow.

    His workflow consists of getting Science and Tech Wikipedia articles and submitting them to ChatGPT3.5 for the creation of additional training data, which is discussed in this Youtube Video.

    https://www.youtube.com/watch?v=w4Js5My2KXw" alt="">

    [Challenges]

    There are challenges that I (and he also mentioned them) encountered revolves around using the Wikipedia's get random article API.

    Some but not limited to

    1. You can't control what article you will get, and the competition is about Science and Technology questions
    2. When you get an article that has too little of text, ChatGPT will "Hallucinate" and will do its best to create a question about the little text data it has received

    These issues has been discussed here . As a very important first step, is to download the latest complete dump of Wikipedia via the Wikimedia website, which can be found here https://dumps.wikimedia.org/enwiki/yyyymm01/

    [Workflow]

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F1842206%2F932674abed030512a05d794ae00cf40e%2Fworkflow.png?generation=1692165011555570&alt=media" alt="">

    • I used WikiExtractor to extract the articles from the 20GB compressed dump as 512MB JSON files. Afterwards, I converted them to compressed csv format (zip).
    • The same library has a lot of issues running under the Window environment. Issues revolving around encoding and forking/ multiprocessing were encountered. Running under Ubuntu made me finish the whole task

    [Files]

    There are 28 compressed files, a-z, numbers (0-9), and others (those that start with symbols).

    [Cover]

    Generated using https://hotpot.ai/

  12. wikipedia-22-12-en-embeddings

    • huggingface.co
    Updated Oct 16, 2006
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cohere (2006). wikipedia-22-12-en-embeddings [Dataset]. https://huggingface.co/datasets/Cohere/wikipedia-22-12-en-embeddings
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 16, 2006
    Dataset authored and provided by
    Coherehttps://cohere.com/
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Wikipedia (en) embedded with cohere.ai multilingual-22-12 encoder

    We encoded Wikipedia (en) using the cohere.ai multilingual-22-12 embedding model. To get an overview how this dataset was created and pre-processed, have a look at Cohere/wikipedia-22-12.

      Embeddings
    

    We compute for title+" "+text the embeddings using our multilingual-22-12 embedding model, a state-of-the-art model that works for semantic search in 100 languages. If you want to learn more about this… See the full description on the dataset page: https://huggingface.co/datasets/Cohere/wikipedia-22-12-en-embeddings.

  13. h

    rag-mini-wikipedia

    • huggingface.co
    Updated May 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    RAG Datasets (2025). rag-mini-wikipedia [Dataset]. https://huggingface.co/datasets/rag-datasets/rag-mini-wikipedia
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 5, 2025
    Dataset authored and provided by
    RAG Datasets
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Description

    In this huggingface discussion you can share what you used the dataset for. Derives from https://www.kaggle.com/datasets/rtatman/questionanswer-dataset?resource=download we generated our own subset using generate.py.

  14. Microsoft Research WikiQA Corpus

    • kaggle.com
    zip
    Updated Jan 24, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Saurabh Shahane (2021). Microsoft Research WikiQA Corpus [Dataset]. https://www.kaggle.com/datasets/saurabhshahane/wikiqa-corpus
    Explore at:
    zip(7080215 bytes)Available download formats
    Dataset updated
    Jan 24, 2021
    Authors
    Saurabh Shahane
    Description

    Context

    The WikiQA corpus is a new publicly available set of question and sentence pairs, collected and annotated for research on open-domain question answering. In order to reflect the true information need of general users, dataset authors used Bing query logs as the question source. Each question is linked to a Wikipedia page that potentially has the answer. Because the summary section of a Wikipedia page provides the basic and usually most important information about the topic, authors used sentences in this section as the candidate answers. Source - https://msropendata.com/datasets/21032bb1-88bd-4656-9570-3172ae1757f0

    Content

    Dataset contains 3,047 questions and 29,258 sentences , where 1,473 sentences were labeled as answer sentences to their corresponding question.

    Acknowledgements

    Dataset Source - https://msropendata.com/datasets/21032bb1-88bd-4656-9570-3172ae1757f0

    InProceedings{YangYihMeek:EMNLP2015:WikiQA, author = {Yang, Yi and Yih, Wen-tau and Meek, Christopher}, title = {{WikiQA}: {A} Challenge Dataset for Open-Domain Question Answering}, booktitle = {Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP)}, month = {September}, year = {2015}, address = {Lisbon, Portugal}, publisher = {Association for Computational Linguistics} }

    License - Open Use of Data Agreement v1.0

  15. e

    PAISÀ Corpus of Italian Web Text

    • clarin.eurac.edu
    Updated Jun 5, 2013
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Verena Lyding; Egon Stemle; Claudia Borghetti; Marco Brunello; Sara Castagnoli; Felice Dell’Orletta; Henrik Dittmann; Alessandro Lenci; Vito Pirrelli (2013). PAISÀ Corpus of Italian Web Text [Dataset]. https://clarin.eurac.edu/repository/xmlui/handle/20.500.12124/3
    Explore at:
    Dataset updated
    Jun 5, 2013
    Authors
    Verena Lyding; Egon Stemle; Claudia Borghetti; Marco Brunello; Sara Castagnoli; Felice Dell’Orletta; Henrik Dittmann; Alessandro Lenci; Vito Pirrelli
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    The Paisà corpus is a large collection of Italian web texts, licensed under Creative Commons (Attribution-ShareAlike and Attribution-Noncommercial-ShareAlike). It has been created in the context of the project PAISÀ.

    All documents were selected in two different ways. A part of the corpus was constructed using a method inspired by the WaCky project. We created 50,000 word pairs by randomly combining terms from an Italian basic vocabulary list, and used the pairs as queries to the Yahoo! search engine in order to retrieve candidate pages. We limited hits to pages in Italian with a Creative Commons license of type: CC-Attribution, CC-Attribution-Sharealike, CC-Attribution-Sharealike-Non-commercial, and CC-Attribution-Non-commercial. Pages that were wrongly tagged as CC-licensed were eliminated using a black-list that was populated by manual inspection of earlier versions of the corpus. The retrieved pages were automatically cleaned using the KrdWrd system.

    The remaining pages in the PAISÀ corpus come from the Italian versions of various Wikimedia Foundation projects, namely: Wikipedia, Wikinews, Wikisource, Wikibooks, Wikiversity, Wikivoyage. The official Wikimedia Foundation dumps were used, extracting text with Wikipedia Extractor.

    Once all materials were downloaded, the collection was filtered discarding empty documents or documents containing less than 150 words.

    The corpus contains approximately 380,000 documents coming from about 1,000 different websites, for a total of about 250 million words. Approximately 260,000 documents are from Wikipedia, approx. 5,600 from other Wikimedia Foundation projects. About 9,300 documents come from Indymedia, and we estimate that about 65,000 documents come from blog services.

  16. wiki_qa

    • huggingface.co
    • opendatalab.com
    Updated Jun 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Microsoft (2024). wiki_qa [Dataset]. https://huggingface.co/datasets/microsoft/wiki_qa
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 3, 2024
    Dataset authored and provided by
    Microsofthttp://microsoft.com/
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Dataset Card for "wiki_qa"

      Dataset Summary
    

    Wiki Question Answering corpus from Microsoft. The WikiQA corpus is a publicly available set of question and sentence pairs, collected and annotated for research on open-domain question answering.

      Supported Tasks and Leaderboards
    

    More Information Needed

      Languages
    

    More Information Needed

      Dataset Structure
    
    
    
    
    
      Data Instances
    
    
    
    
    
      default
    

    Size of downloaded dataset files: 7.10 MB Size… See the full description on the dataset page: https://huggingface.co/datasets/microsoft/wiki_qa.

  17. Wikipedia Article Ratings

    • kaggle.com
    • huggingface.co
    zip
    Updated May 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christopher Akiki (2025). Wikipedia Article Ratings [Dataset]. https://www.kaggle.com/datasets/cakiki/wikipedia-article-ratings
    Explore at:
    zip(458435086 bytes)Available download formats
    Dataset updated
    May 23, 2025
    Authors
    Christopher Akiki
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    1-year dump of English Wikipedia article ratings. The dataset includes 47,207,448 records corresponding to 11,801,862 unique ratings posted between July 22, 2011 and July 22, 2012.

    The Wikimedia Foundation has been experimenting with a feature to capture reader quality assessments of articles since September 2010. Article Feedback v4 (AFTv4) is a tool allowing readers to rate the quality of an article along 4 different dimensions. The tool has been deployed on the entire English Wikipedia (except for a small number of articles) since July 22, 2011. A new version of the tool, focused on feedback instead of ratings (AFTv5), has been tested in 2012 and deployed to a 10% random sample of articles from the English Wikipedia in July 2012.

    Since launching the tool in September 2010, we've continually analyzed the results; see the Research reports, including specific analyses of the call to action and rater expertise.

    As of AFTv5, all research reports are hosted on Meta.

    This 1-year dump of anonymized rating data was originally made available for download from the DataHub. Real-time rating data can also be accessed via the toolserver.

  18. h

    large_spanish_corpus

    • huggingface.co
    Updated Apr 20, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    José Cañete (2019). large_spanish_corpus [Dataset]. https://huggingface.co/datasets/josecannete/large_spanish_corpus
    Explore at:
    Dataset updated
    Apr 20, 2019
    Authors
    José Cañete
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    The Large Spanish Corpus is a compilation of 15 unlabelled Spanish corpora spanning Wikipedia to European parliament notes. Each config contains the data corresponding to a different corpus. For example, "all_wiki" only includes examples from Spanish Wikipedia. By default, the config is set to "combined" which loads all the corpora; with this setting you can also specify the number of samples to return per corpus by configuring the "split" argument.

  19. h

    hind_encorp

    • huggingface.co
    • opendatalab.com
    Updated Mar 22, 2014
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pavel Rychlý (2014). hind_encorp [Dataset]. https://huggingface.co/datasets/pary/hind_encorp
    Explore at:
    Dataset updated
    Mar 22, 2014
    Authors
    Pavel Rychlý
    License

    Attribution-NonCommercial-ShareAlike 3.0 (CC BY-NC-SA 3.0)https://creativecommons.org/licenses/by-nc-sa/3.0/
    License information was derived automatically

    Description

    HindEnCorp parallel texts (sentence-aligned) come from the following sources: Tides, which contains 50K sentence pairs taken mainly from news articles. This dataset was originally col- lected for the DARPA-TIDES surprise-language con- test in 2002, later refined at IIIT Hyderabad and provided for the NLP Tools Contest at ICON 2008 (Venkatapathy, 2008).

    Commentaries by Daniel Pipes contain 322 articles in English written by a journalist Daniel Pipes and translated into Hindi.

    EMILLE. This corpus (Baker et al., 2002) consists of three components: monolingual, parallel and annotated corpora. There are fourteen monolingual sub- corpora, including both written and (for some lan- guages) spoken data for fourteen South Asian lan- guages. The EMILLE monolingual corpora contain in total 92,799,000 words (including 2,627,000 words of transcribed spoken data for Bengali, Gujarati, Hindi, Punjabi and Urdu). The parallel corpus consists of 200,000 words of text in English and its accompanying translations into Hindi and other languages.

    Smaller datasets as collected by Bojar et al. (2010) include the corpus used at ACL 2005 (a subcorpus of EMILLE), a corpus of named entities from Wikipedia (crawled in 2009), and Agriculture domain parallel corpus.  For the current release, we are extending the parallel corpus using these sources: Intercorp (Čermák and Rosen,2012) is a large multilingual parallel corpus of 32 languages including Hindi. The central language used for alignment is Czech. Intercorp’s core texts amount to 202 million words. These core texts are most suitable for us because their sentence alignment is manually checked and therefore very reliable. They cover predominately short sto- ries and novels. There are seven Hindi texts in Inter- corp. Unfortunately, only for three of them the English translation is available; the other four are aligned only with Czech texts. The Hindi subcorpus of Intercorp contains 118,000 words in Hindi.

    TED talks 3 held in various languages, primarily English, are equipped with transcripts and these are translated into 102 languages. There are 179 talks for which Hindi translation is available.

    The Indic multi-parallel corpus (Birch et al., 2011; Post et al., 2012) is a corpus of texts from Wikipedia translated from the respective Indian language into English by non-expert translators hired over Mechanical Turk. The quality is thus somewhat mixed in many respects starting from typesetting and punctuation over capi- talization, spelling, word choice to sentence structure. A little bit of control could be in principle obtained from the fact that every input sentence was translated 4 times. We used the 2012 release of the corpus.

    Launchpad.net is a software collaboration platform that hosts many open-source projects and facilitates also collaborative localization of the tools. We downloaded all revisions of all the hosted projects and extracted the localization (.po) files.

    Other smaller datasets. This time, we added Wikipedia entities as crawled in 2013 (including any morphological variants of the named entitity that appears on the Hindi variant of the Wikipedia page) and words, word examples and quotes from the Shabdkosh online dictionary.

  20. b

    SpeakGer: A meta-data enriched speech corpus of German state and federal...

    • berd-platform.de
    csv
    Updated Jul 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kai-Robin Lange; Kai-Robin Lange; Carsten Jentsch; Carsten Jentsch (2025). SpeakGer: A meta-data enriched speech corpus of German state and federal parliaments [Dataset]. http://doi.org/10.82939/g3225-rba63
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jul 25, 2025
    Dataset provided by
    BERD@NFDI
    Authors
    Kai-Robin Lange; Kai-Robin Lange; Carsten Jentsch; Carsten Jentsch
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Germany
    Description

    A dataset of German parliament debates covering 74 years of plenary protocols across all 16 state parliaments of Germany as well as the German Bundestag. The debates are separated into individual speeches which are enriched with meta data identifying the speaker as a member of the parliament (mp).

    When using this data set, please cite the original paper "Lange, K.-R., Jentsch, C. (2023). SpeakGer: A meta-data enriched speech corpus of German state and federal parliaments. Proceedings of the 3rd Workshop on Computational Linguistics for Political Text Analysis@KONVENS 2023.".

    The meta data is separated into two different types: time-specific meta-data that contains only information for a legislative period but can change over time (e.g. the party or constituency of an mp) and meta-data that is considered fixed, such as the birth date or the name of a speaker. The former information are stored aong with the speeches as it is considered temporal information of that point in time, but are additionally stored in the file all_mps_mapping.csv if there is the need to double-check something. The rest of the meta-data are stored in the file all_mps_meta.csv. The meta-data from this file can be matched with a speech by comparing the speaker ID-variable "MPID". The speeches of each parliament are saved in a csv format. Along with the speeches, they contain the following meta-data:

    • Period: int. The period in which the speech took place
    • Session: int. The session in which the speech took place
    • Chair: boolean. The information if the speaker was the chair of the plenary session
    • Interjection: boolean. The information if the speech is a comment or an interjection from the crowd
    • Party: list (e.g. ["cdu"] or ["cdu", "fdp"] when having more than one speaker during an interjection). List of the party of the speaker or the parties whom the comment/interjection references
    • Consituency: string. The consituency of the speaker in the current legislative period
    • MPID: int. The ID of the speaker, which can be used to get more meta-data from the file all_mps_meta.csv

    The file all_mps_meta.csv contains the following meta information:

    • MPID: int. The ID of the speaker, which can be used to match the mp with his/her speeches.
    • WikipediaLink: The Link to the mps Wikipedia page
    • WikiDataLink: The Link to the mps WikiData page
    • Name: string. The full name of the mp.
    • Last Name: string. The last name of the mp, found on WikiData. If no last name is given on WikiData, the full name was heuristically cut at the last space to get the information neccessary for splitting the speeches.
    • Born: string, format: YYYY-MM-DD. Birth date of the mp. If an exact birth date is found on WikiData, this exact date is used. Otherwise, a day in the year of birth given on Wikipedia is used.
    • SexOrGender: string. Information on the sex or gender of the mp. Disclaimer: This infomation was taken from WikiData, which does not seem to differentiate between sex or gender.
    • Occupation: list. Occupation(s) of the mp.
    • Religion: string. Religious believes of the mp.
    • AbgeordnetenwatchID: int. ID of the mp on the website Abgeordnetenwatch

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Marcell Emmer (2024). Wikipedia Corpus (2023-03-01) [Dataset]. https://www.kaggle.com/datasets/emmermarcell/wikipedia-corpus-2023-03-01
Organization logo

Wikipedia Corpus (2023-03-01)

A sentence-by-sentence breakdown of the Wikipedia dataset

Explore at:
zip(7253680490 bytes)Available download formats
Dataset updated
Jan 24, 2024
Authors
Marcell Emmer
License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

A corpus created from the Hugging Face Wikipedia dataset (https://huggingface.co/datasets/wikipedia). The preprocessing and the creation of this corpus are done using the text_to_sentences method of Blingfire. The details can be found in the following notebook:

https://www.kaggle.com/code/emmermarcell/create-a-wikipedia-corpus

Search
Clear search
Close search
Google apps
Main menu