52 datasets found
  1. e

    Plaintext Wikipedia dump 2018 - Dataset - B2FIND

    • b2find.eudat.eu
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Plaintext Wikipedia dump 2018 - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/3074cb26-6a0d-5803-8520-d0050a22c66e
    Explore at:
    Description

    Wikipedia plain text data obtained from Wikipedia dumps with WikiExtractor in February 2018. The data come from all Wikipedias for which dumps could be downloaded at [https://dumps.wikimedia.org/]. This amounts to 297 Wikipedias, usually corresponding to individual languages and identified by their ISO codes. Several special Wikipedias are included, most notably "simple" (Simple English Wikipedia) and "incubator" (tiny hatching Wikipedias in various languages). For a list of all the Wikipedias, see [https://meta.wikimedia.org/wiki/List_of_Wikipedias]. The script which can be used to get new version of the data is included, but note that Wikipedia limits the download speed for downloading a lot of the dumps, so it takes a few days to download all of them (but one or a few can be downloaded fast). Also, the format of the dumps changes time to time, so the script will probably eventually stop working one day. The WikiExtractor tool [http://medialab.di.unipi.it/wiki/Wikipedia_Extractor] used to extract text from the Wikipedia dumps is not mine, I only modified it slightly to produce plaintext outputs [https://github.com/ptakopysk/wikiextractor].

  2. E

    Plaintext Wikipedia dump 2018

    • live.european-language-grid.eu
    binary format
    Updated Feb 24, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2018). Plaintext Wikipedia dump 2018 [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/1242
    Explore at:
    binary formatAvailable download formats
    Dataset updated
    Feb 24, 2018
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Wikipedia plain text data obtained from Wikipedia dumps with WikiExtractor in February 2018.

    The data come from all Wikipedias for which dumps could be downloaded at [https://dumps.wikimedia.org/]. This amounts to 297 Wikipedias, usually corresponding to individual languages and identified by their ISO codes. Several special Wikipedias are included, most notably "simple" (Simple English Wikipedia) and "incubator" (tiny hatching Wikipedias in various languages).

    For a list of all the Wikipedias, see [https://meta.wikimedia.org/wiki/List_of_Wikipedias].

    The script which can be used to get new version of the data is included, but note that Wikipedia limits the download speed for downloading a lot of the dumps, so it takes a few days to download all of them (but one or a few can be downloaded fast).

    Also, the format of the dumps changes time to time, so the script will probably eventually stop working one day.

    The WikiExtractor tool [http://medialab.di.unipi.it/wiki/Wikipedia_Extractor] used to extract text from the Wikipedia dumps is not mine, I only modified it slightly to produce plaintext outputs [https://github.com/ptakopysk/wikiextractor].

  3. h

    simple-wikipedia

    • huggingface.co
    Updated Aug 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rahul Aralikatte (2023). simple-wikipedia [Dataset]. https://huggingface.co/datasets/rahular/simple-wikipedia
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 17, 2023
    Authors
    Rahul Aralikatte
    Description

    simple-wikipedia

    Processed, text-only dump of the Simple Wikipedia (English). Contains 23,886,673 words.

  4. Z

    A Wikipedia dataset of 5 categories

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maitre, Julien (2020). A Wikipedia dataset of 5 categories [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3260045
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset authored and provided by
    Maitre, Julien
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A subset of articles extracted from the French Wikipedia XML dump. Data published here include 5 different categories : Economy (Economie), History (Histoire), Informatics (Informatique), Health (Medecine) and Law (Droit). The Wikipedia dump was downloaded on November 8, 2016 from https://dumps.wikimedia.org/. Each article is a xml file extracted from the dump and save as UTF8 plain text. The characteristics of dataset is :

    Economy : 44'876 articles

    History : 92'041 articles

    Informatics : 25'408 articles

    Health : 22'143 articles

    Law : 9'964 articles

  5. wikipedia-22-12-simple-embeddings

    • huggingface.co
    • opendatalab.com
    Updated Mar 29, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cohere (2023). wikipedia-22-12-simple-embeddings [Dataset]. https://huggingface.co/datasets/Cohere/wikipedia-22-12-simple-embeddings
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 29, 2023
    Dataset authored and provided by
    Coherehttps://cohere.com/
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Wikipedia (simple English) embedded with cohere.ai multilingual-22-12 encoder

    We encoded Wikipedia (simple English) using the cohere.ai multilingual-22-12 embedding model. To get an overview how this dataset was created and pre-processed, have a look at Cohere/wikipedia-22-12.

      Embeddings
    

    We compute for title+" "+text the embeddings using our multilingual-22-12 embedding model, a state-of-the-art model that works for semantic search in 100 languages. If you want to… See the full description on the dataset page: https://huggingface.co/datasets/Cohere/wikipedia-22-12-simple-embeddings.

  6. h

    thai_wikipedia_clean_20230101

    • huggingface.co
    Updated Jan 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    PyThaiNLP (2023). thai_wikipedia_clean_20230101 [Dataset]. https://huggingface.co/datasets/pythainlp/thai_wikipedia_clean_20230101
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 1, 2023
    Dataset authored and provided by
    PyThaiNLP
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Dataset Card for "thai_wikipedia_clean_20230101"

    More Information needed Thai Wikipedia Database dumps to plain text for NLP work. This dataset was dump on 1 January 2023 from Thai wikipedia.

    GitHub: PyThaiNLP / ThaiWiki-clean Notebook for upload to HF: https://github.com/PyThaiNLP/ThaiWiki-clean/blob/main/thai_wikipedia_clean_20230101_hf.ipynb

  7. Wikipedia Plaintext (2023-07-01) cut

    • kaggle.com
    Updated Sep 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kɔuq Wang (2023). Wikipedia Plaintext (2023-07-01) cut [Dataset]. https://www.kaggle.com/datasets/gmhost/wikipedia-plaintext-2023-07-01-cut
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 3, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Kɔuq Wang
    Description

    Dataset

    This dataset was created by kwang

    Contents

  8. E

    External References of English Wikipedia (ref-wiki-en)

    • live.european-language-grid.eu
    • data.niaid.nih.gov
    txt
    Updated Mar 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). External References of English Wikipedia (ref-wiki-en) [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7625
    Explore at:
    txtAvailable download formats
    Dataset updated
    Mar 27, 2024
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    External References of English Wikipedia (ref-wiki-en) is a corpus of the plain-text content of 2,475,461 external webpages linked from the reference section of articles in English Wikipedia. Specifically:

    32,329,989 external reference URLs were extracted from a 2018 HTML dump of English Wikipedia. Removing repeated and ill-formed URLs yielded 23,036,318 unique URLs.These URLs were filtered to remove file extensions for unsupported formats (videos, audio, etc.), yielding 17,781,974 downloadable URLs. The URLs were loaded into Apache Nutch and continuously downloaded from August 2019 to December 2019, resulting in 2,475,461 successfully downloaded URLs. Not all URLs could be accessed. The order in which URLs were accessed was determined by Nutch, which partitions URLs by host and then randomly chooses amongst the URLs for each host.The content of these webpages were indexed in Apache Solr by Nutch. From Solr we extracted a JSON dump of the content.Many URLs offer a redirect; unfortunately Nutch does not index redirect information. This means that connecting the Wikipedia article (with the pre-direct link) to the downloaded webpage (at the post-redirect link) was complicated. However, by inspecting the order of download in the Nutch log files, we managed to recover links for 2,058,896 documents (83%) from their original Wikipedia article(s).We further managed to associate 3,899,953 unique Wikidata items with at least one external reference webpage in the corpus.

    The ref-en-wiki corpus is incomplete, i.e., we did not attempt to download all reference URLs for English Wikipedia. We thus also collect a smaller complete corpus for the external references of 5,000 Wikipedia articles (ref-wiki-en-5k). We sampled from 5 ranges of Wikidata items: Q1-10000, Q10001-100000, Q100001-1000000, Q1000001-10000000, and Q10000001-100000000. From each range we sampled 1000 items. We then scraped the external reference URLs for the Wikipedia article corresponding to these items and downloaded them. The resulting corpus contains 37,983 webpages.Each line of the corpus (ref-wiki-en, ref-wiki-en-5k) encodes the webpage of an external reference in JSON format. Specifically, we provide:

    tstamp: When the webpage was accessedhost: The domain (FQDN post-redirect) from which the webpage was retrieved.title: The title (meta) of the documenturl: The URL (post-redirect) of the webpageQ: The Q-code identifiers of the Wikidata items whose corresponding Wikipedia article is confirmed to link to this webpage.content: A plain-text encoding of the content of the webpage.

    Below we provide an abbreviated example of a line from the corpus:{""tstamp"":""2019-09-26T01:22:43.621Z"",""host"":""geology.isu.edu"",""title"":""Digital Geology of Idaho - Basin And Range"",""url"":""http://geology.isu.edu/Digital_Geology_Idaho/Module9/mod9.htm"",""Q"":[810178],""content"":""Digital Geology of Idaho - Basin And Range 1 - Idaho Basement Rock 2 - Belt Supergroup 3 - Rifting & Passive Margin 4 - Accreted Terranes 5 - Thrust Belt 6 - Idaho Batholith 7 - North Idaho & Mining 8 - Challis Volcanics 9 - Basin and Range 10 - Columbia River Basalts 11 - SRP & Yellowstone 12 - Pleistocene Glaciation 13 - Palouse & Lake Missoula 14 - Lake Bonneville Flood 15 - Snake River Plain Aquifer Basin and Range Province - Teritiary Extension General geology of the Basin and Range Province Mechanisms of Basin and Range faulting Idaho Basin and Range south of the Snake River Plain Idaho Basin and Range north of the Snake River Plain Local areas of active and recent Basin & Range faulting: Borah Peak PDF Slideshows: North of SRP , South of SRP , Borah Earthquake Flythroughs: Teton Valley , Henry's Fork , Big Lost River , Blackfoot , Portneuf , Raft River Valley , Bear River , Salmon Falls Creek , Snake River , Big Wood River Vocabulary Words thrust fault Basin and Range Snake River Plain half-graben transfer zone Fly-throughs General geology of the Basin and Range Province The Basin and Range Province generally includes most of eastern California, eastern Oregon, eastern Washington, Nevada, western Utah, southern and western Arizona, and southeastern Idaho. ...""},A summary of the files we make available:

    ref-wiki-en.json.gz: 2,475,461 external reference webpages (JSON format)ref-wiki-en_urls.txt.gz: 23,036,318 unique raw links to external references (plain-text format)ref-wiki-en-5k.json.gz: 37,983 external reference webpages (JSON format)ref-wiki-en-5k_urls.json.gz: 70,375 unique raw links to external references (plain-text format)ref-wiki-en-5k_Q.txt.gz: 5,000 Wikidata Q identifiers forming the 5k dataset (plain-text format)

    Further details can be found in the publication:

    Suggesting References for Wikidata Claims based on Wikipedia's External References. Paolo Curotto, Aidan Hogan. Wikidata Workshop @ISWC 2020.

    Further material relating to this publication (including code for a proof-of-concept interface) is also available.

  9. Extended Wikipedia Multimodal Dataset

    • kaggle.com
    Updated Apr 4, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oleh Onyshchak (2020). Extended Wikipedia Multimodal Dataset [Dataset]. http://doi.org/10.34740/kaggle/dsv/1058023
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 4, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Oleh Onyshchak
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Wikipedia Featured Articles multimodal dataset

    Overview

    • This is a multimodal dataset of featured articles containing 5,638 articles and 57,454 images.
    • Its superset of good articles is also hosted on Kaggle. It has six times more entries although with a little worse quality.

    It contains the text of an article and also all the images from that article along with metadata such as image titles and descriptions. From Wikipedia, we selected featured articles, which are just a small subset of all available ones, because they are manually reviewed and protected from edits. Thus it's the best theoretical quality human editors on Wikipedia can offer.

    You can find more details in "Image Recommendation for Wikipedia Articles" thesis.

    Dataset structure

    The high-level structure of the dataset is as follows:

    .
    +-- page1 
    |  +-- text.json 
    |  +-- img 
    |    +-- meta.json
    +-- page2 
    |  +-- text.json 
    |  +-- img 
    |    +-- meta.json
    : 
    +-- pageN 
    |  +-- text.json 
    |  +-- img 
    |    +-- meta.json
    
    labeldescription
    pageNis the title of N-th Wikipedia page and contains all information about the page
    text.jsontext of the page saved as JSON. Please refer to the details of JSON schema below.
    meta.jsona collection of all images of the page. Please refer to the details of JSON schema below.
    imageNis the N-th image of an article, saved in jpg format where the width of each image is set to 600px. Name of the image is md5 hashcode of original image title.

    text.JSON Schema

    Below you see an example of how data is stored:

    {
     "title": "Naval Battle of Guadalcanal",
     "id": 405411,
     "url": "https://en.wikipedia.org/wiki/Naval_Battle_of_Guadalcanal",
     "html": "... 
    

    ...", "wikitext": "... The '''Naval Battle of Guadalcanal''', sometimes referred to as ...", }

    keydescription
    titlepage title
    idunique page id
    urlurl of a page on Wikipedia
    htmlHTML content of the article
    wikitextwikitext content of the article

    Please note that @html and @wikitext properties represent the same information in different formats, so just choose the one which is easier to parse in your circumstances.

    meta.JSON Schema

    {
     "img_meta": [
      {
       "filename": "702105f83a2aa0d2a89447be6b61c624.jpg",
       "title": "IronbottomSound.jpg",
       "parsed_title": "ironbottom sound",
       "url": "https://en.wikipedia.org/wiki/File%3AIronbottomSound.jpg",
       "is_icon": False,
       "on_commons": True,
       "description": "A U.S. destroyer steams up what later became known as ...",
       "caption": "Ironbottom Sound. The majority of the warship surface ...",
       "headings": ['Naval Battle of Guadalcanal', 'First Naval Battle of Guadalcanal', ...],
       "features": ['4.8618264', '0.49436468', '7.0841103', '2.7377882', '2.1305492', ...],
       },
       ...
      ]
    }
    
    keydescription
    filenameunique image id, md5 hashcode of original image title
    titleimage title retrieved from Commons, if applicable
    parsed_titleimage title split into words, i.e. "helloWorld.jpg" -> "hello world"
    urlurl of an image on Wikipedia
    is_iconTrue if image is an icon, e.g. category icon. We assume that image is an icon if you cannot load a preview on Wikipedia after clicking on it
    on_commonsTrue if image is available from Wikimedia Commons dataset
    descriptiondescription of an image parsed from Wikimedia Commons page, if available
    captioncaption of an image parsed from Wikipedia article, if available
    headingslist of all nested headings of location where article is placed in Wikipedia article. The first element is top-most heading
    featuresoutput of 5-th convolutional layer of ResNet152 trained on ImageNet dataset. That output of shape (19, 24, 2048) is then max-pooled to a shape (2048,). Features taken from original images downloaded in jpeg format with fixed width of 600px. Practically, it is a list of floats with len = 2048

    Collection method

    Data was collected by fetching featured articles text&image content with pywikibot library and then parsing out a lot of additional metadata from HTML pages from Wikipedia and Commons.

  10. Z

    Sentence/Table Pair Data from Wikipedia for Pre-training with...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Oct 29, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yu Su (2021). Sentence/Table Pair Data from Wikipedia for Pre-training with Distant-Supervision [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5612315
    Explore at:
    Dataset updated
    Oct 29, 2021
    Dataset provided by
    You Wu
    Alyssa Lees
    Yu Su
    Cong Yu
    Huan Sun
    Xiang Deng
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the dataset used for pre-training in "ReasonBERT: Pre-trained to Reason with Distant Supervision", EMNLP'21.

    There are two files:

    sentence_pairs_for_pretrain_no_tokenization.tar.gz -> Contain only sentences as evidence, Text-only

    table_pairs_for_pretrain_no_tokenization.tar.gz -> At least one piece of evidence is a table, Hybrid

    The data is chunked into multiple tar files for easy loading. We use WebDataset, a PyTorch Dataset (IterableDataset) implementation providing efficient sequential/streaming data access.

    For pre-training code, or if you have any questions, please check our GitHub repo https://github.com/sunlab-osu/ReasonBERT

    Below is a sample code snippet to load the data

    import webdataset as wds

    path to the uncompressed files, should be a directory with a set of tar files

    url = './sentence_multi_pairs_for_pretrain_no_tokenization/{000000...000763}.tar' dataset = ( wds.Dataset(url) .shuffle(1000) # cache 1000 samples and shuffle .decode() .to_tuple("json") .batched(20) # group every 20 examples into a batch )

    Please see the documentation for WebDataset for more details about how to use it as dataloader for Pytorch

    You can also iterate through all examples and dump them with your preferred data format

    Below we show how the data is organized with two examples.

    Text-only

    {'s1_text': 'Sils is a municipality in the comarca of Selva, in Catalonia, Spain.', # query sentence 's1_all_links': { 'Sils,_Girona': [[0, 4]], 'municipality': [[10, 22]], 'Comarques_of_Catalonia': [[30, 37]], 'Selva': [[41, 46]], 'Catalonia': [[51, 60]] }, # list of entities and their mentions in the sentence (start, end location) 'pairs': [ # other sentences that share common entity pair with the query, group by shared entity pairs { 'pair': ['Comarques_of_Catalonia', 'Selva'], # the common entity pair 's1_pair_locs': [[[30, 37]], [[41, 46]]], # mention of the entity pair in the query 's2s': [ # list of other sentences that contain the common entity pair, or evidence { 'md5': '2777e32bddd6ec414f0bc7a0b7fea331', 'text': 'Selva is a coastal comarque (county) in Catalonia, Spain, located between the mountain range known as the Serralada Transversal or Puigsacalm and the Costa Brava (part of the Mediterranean coast). Unusually, it is divided between the provinces of Girona and Barcelona, with Fogars de la Selva being part of Barcelona province and all other municipalities falling inside Girona province. Also unusually, its capital, Santa Coloma de Farners, is no longer among its larger municipalities, with the coastal towns of Blanes and Lloret de Mar having far surpassed it in size.', 's_loc': [0, 27], # in addition to the sentence containing the common entity pair, we also keep its surrounding context. 's_loc' is the start/end location of the actual evidence sentence 'pair_locs': [ # mentions of the entity pair in the evidence [[19, 27]], # mentions of entity 1 [[0, 5], [288, 293]] # mentions of entity 2 ], 'all_links': { 'Selva': [[0, 5], [288, 293]], 'Comarques_of_Catalonia': [[19, 27]], 'Catalonia': [[40, 49]] } } ,...] # there are multiple evidence sentences }, ,...] # there are multiple entity pairs in the query }

    Hybrid

    {'s1_text': 'The 2006 Major League Baseball All-Star Game was the 77th playing of the midseason exhibition baseball game between the all-stars of the American League (AL) and National League (NL), the two leagues comprising Major League Baseball.', 's1_all_links': {...}, # same as text-only 'sentence_pairs': [{'pair': ..., 's1_pair_locs': ..., 's2s': [...]}], # same as text-only 'table_pairs': [ 'tid': 'Major_League_Baseball-1', 'text':[ ['World Series Records', 'World Series Records', ...], ['Team', 'Number of Series won', ...], ['St. Louis Cardinals (NL)', '11', ...], ...] # table content, list of rows 'index':[ [[0, 0], [0, 1], ...], [[1, 0], [1, 1], ...], ...] # index of each cell [row_id, col_id]. we keep only a table snippet, but the index here is from the original table. 'value_ranks':[ [0, 0, ...], [0, 0, ...], [0, 10, ...], ...] # if the cell contain numeric value/date, this is its rank ordered from small to large, follow TAPAS 'value_inv_ranks': [], # inverse rank 'all_links':{ 'St._Louis_Cardinals': { '2': [ [[2, 0], [0, 19]], # [[row_id, col_id], [start, end]] ] # list of mentions in the second row, the key is row_id }, 'CARDINAL:11': {'2': [[[2, 1], [0, 2]]], '8': [[[8, 3], [0, 2]]]}, } 'name': '', # table name, if exists 'pairs': { 'pair': ['American_League', 'National_League'], 's1_pair_locs': [[[137, 152]], [[162, 177]]], # mention in the query 'table_pair_locs': { '17': [ # mention of entity pair in row 17 [ [[17, 0], [3, 18]], [[17, 1], [3, 18]], [[17, 2], [3, 18]], [[17, 3], [3, 18]] ], # mention of the first entity [ [[17, 0], [21, 36]], [[17, 1], [21, 36]], ] # mention of the second entity ] } } ] }

  11. A word2vec model file built from the French Wikipedia XML Dump using gensim....

    • zenodo.org
    • data.niaid.nih.gov
    bin, text/x-python
    Updated Jan 24, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christof Schöch; Christof Schöch (2020). A word2vec model file built from the French Wikipedia XML Dump using gensim. [Dataset]. http://doi.org/10.5281/zenodo.162792
    Explore at:
    bin, text/x-pythonAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Christof Schöch; Christof Schöch
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    French
    Description

    A word2vec model file built from the French Wikipedia XML dump using gensim. The data published here includes three model files (you need all three of them in the same folder) as well as the Python script used to build the model (for documentation). The Wikipedia dump was downloaded on October 7, 2016 from https://dumps.wikimedia.org/. Before building the model, plain text was extracted from the dump. The size of that dataset is about 500 million words or 3.6 GB of plain text. The principal parameters for building the model were the following: no lemmatization was performed, tokenization was done using the "\W" regular expression (any non-word character splits tokens), and the model was built with 500 dimensions.

  12. E

    Pairwise Multi-Class Document Classification for Semantic Relations between...

    • live.european-language-grid.eu
    csv
    Updated Apr 15, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Pairwise Multi-Class Document Classification for Semantic Relations between Wikipedia Articles (Dataset) [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/18317
    Explore at:
    csvAvailable download formats
    Dataset updated
    Apr 15, 2024
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Many digital libraries recommend literature to their users considering the similarity between a query document and their repository. However, they often fail to distinguish what is the relationship that makes two documents alike. In this paper, we model the problem of finding the relationship between two documents as a pairwise document classification task. To find the semantic relation between documents, we apply a series of techniques, such as GloVe, Paragraph-Vectors, BERT, and XLNet under different configurations (e.g., sequence length, vector concatenation scheme), including a Siamese architecture for the Transformer-based systems. We perform our experiments on a newly proposed dataset of 32,168 Wikipedia article pairs and Wikidata properties that define the semantic document relations. Our results show vanilla BERT as the best performing system with an F1-score of 0.93,
    which we manually examine to better understand its applicability to other domains. Our findings suggest that classifying semantic relations between documents is a solvable task and motivates the development of recommender systems based on the evaluated techniques. The discussions in this paper serve as first steps in the exploration of documents through SPARQL-like queries such that one could find documents that are similar in one aspect but dissimilar in another.

    Additional information can be found on GitHub.

    The following data is supplemental to the experiments described in our research paper. The data consists of:

    • Datasets (articles, class labels, cross-validation splits)
    • Pretrained models (Transformers, GloVe, Doc2vec)
    • Model output (prediction) for the best performing models

    This package consists of the Dataset part.

    Dataset

    The Wikipedia article corpus is available in enwiki-20191101-pages-articles.weighted.10k.jsonl.bz2. The original data have been downloaded as XML dump, and the corresponding articles were extracted as plain-text with gensim.scripts.segment_wiki. The archive contains only articles that are available in training or test data.

    The actual dataset is provided as used in the stratified k-fold with k=4 in train_testdata_4folds.tar.gz.

    ├── 1
    │  ├── test.csv
    │  └── train.csv
    ├── 2
    │  ├── test.csv
    │  └── train.csv
    ├── 3
    │  ├── test.csv
    │  └── train.csv
    └── 4
     ├── test.csv
     └── train.csv

    4 directories, 8 files

  13. e

    Wikipedia Text Segmentation - Dataset - B2FIND

    • b2find.eudat.eu
    Updated Apr 4, 2012
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2012). Wikipedia Text Segmentation - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/3f590262-3ea6-5cc8-8e85-94e8a2c402f3
    Explore at:
    Dataset updated
    Apr 4, 2012
    Description

    For corpus generation, we extracted top-level sections of featured articles and concatenated their textual contents to a pure-text corpus file. The content of a section is constituted by the concatenation of the text of its paragraph elements and the content of contained sections. Particularly, other elements such as tables and image captions are ignored during generating the text for a section because text segmentation is meant to be applied to prose and not to pieces of information such as table fields. Furthermore, sections with one of the titles See also'',References'', and ``External links'' are skipped as they do not contain information where segmentation makes sense.

  14. WikiReaD (Wikipedia Readability Dataset)

    • zenodo.org
    bz2
    Updated May 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mykola Trokhymovych; Indira Sen; Martin Gerlach; Mykola Trokhymovych; Indira Sen; Martin Gerlach (2025). WikiReaD (Wikipedia Readability Dataset) [Dataset]. http://doi.org/10.5281/zenodo.11371932
    Explore at:
    bz2Available download formats
    Dataset updated
    May 22, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Mykola Trokhymovych; Indira Sen; Martin Gerlach; Mykola Trokhymovych; Indira Sen; Martin Gerlach
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dataset Description:

    The dataset contains pairs of encyclopedic articles in 14 languages. Each pair includes the same article in two levels of readability (easy/hard). The pairs are obtained by matching Wikipedia articles (hard) with the corresponding versions from different simplified or children's encyclopedias (easy).

    Dataset Details:

    • Number of Languages: 14
    • Number of files: 19
    • Use Case: Training and evaluating readability scoring models for articles within and outside Wikipedia.
    • Processing details: Text pairs are created by matching articles from Wikipedia with the corresponding article in the simplified/children encyclopedia either via the Wikidata item ID or their page titles. The text of each article is extracted directly from their parsed HTML version.
    • Files: The dataset consists of independent files for each type of children/simplified encyclopedia and each language (e.g., `

    Attribution:

    The dataset was compiled from the following sources. The text of the original articles comes from the corresponding language version of Wikipedia. The text of the simplified articles comes from one of the following encyclopedias: Simple English Wikipedia, Vikidia, Klexikon, Txikipedia, or Wikikids.

    Below we provide information about the license of the original content as well as the template to generate the link to the original source for a given page (

    Related paper citation:

    @inproceedings{trokhymovych-etal-2024-open,
      title = "An Open Multilingual System for Scoring Readability of {W}ikipedia",
      author = "Trokhymovych, Mykola and
       Sen, Indira and
       Gerlach, Martin",
      editor = "Ku, Lun-Wei and
       Martins, Andre and
       Srikumar, Vivek",
      booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
      month = aug,
      year = "2024",
      address = "Bangkok, Thailand",
      publisher = "Association for Computational Linguistics",
      url = "https://aclanthology.org/2024.acl-long.342/",
      doi = "10.18653/v1/2024.acl-long.342",
      pages = "6296--6311"
    }
  15. h

    barwiki-dumps

    • huggingface.co
    Updated Aug 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bavarian NLP (2025). barwiki-dumps [Dataset]. https://huggingface.co/datasets/bavarian-nlp/barwiki-dumps
    Explore at:
    Dataset updated
    Aug 9, 2025
    Dataset authored and provided by
    Bavarian NLP
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    🥨 Bavarian Wikipedia Dumps

    This repo hosts backups of the Bavarian Wikipedia Dumps. More precisely, various *-pages-articles.xml.bz2 dumps are hosted here, that include articles, templates, media/file descriptions, and primary meta-pages. These dumps can be used to e.g. construct a plaintext Wikipedia dump using wikiextractor. Recent dumps will be added to this repo on a regular basis.

  16. Citations with contexts in Wikipedia

    • figshare.com
    html
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aaron Halfaker; Meen Chul Kim; Andrea Forte; Dario Taraborelli (2023). Citations with contexts in Wikipedia [Dataset]. http://doi.org/10.6084/m9.figshare.5588842.v1
    Explore at:
    htmlAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Aaron Halfaker; Meen Chul Kim; Andrea Forte; Dario Taraborelli
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset represents structured metadata and contextual information about references added to Wikipedia articles in a JSON format. Each record represents an individual Wikipedia article revision with all the tags parsed, as stored in Wikipedia's XML dumps, including information about: 1) the context(s) in which the reference occurs within the article – such as the surrounding text, parent section title, and section level – 2) structured data and bibliographic metadata included within the reference itself (such as: any citation template used, external links, any known persistent identifiers) 3) additional data/metadata about the reference itself (the reference name, its raw content, and if applicable, revision ID associated with reference addition/deletion/change)The data is available as a set of compressed JSON files, extracted from the July 1, 2017 XML dump of English Wikipedia. Other languages may be added to this dataset in the future.The JSON schema and Python parsing libraries used to generate the data are in the references.

  17. E

    Data from: W2C – Web to Corpus – tool

    • live.european-language-grid.eu
    Updated Dec 19, 2011
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2011). W2C – Web to Corpus – tool [Dataset]. https://live.european-language-grid.eu/catalogue/tool-service/18153
    Explore at:
    Dataset updated
    Dec 19, 2011
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    A tool used to build multilingual corpora from wikipedia. Download the web pages, convert them to plain text, identify language, etc.

    A set of 120 corpora collected using this tool is available at https://ufal-point.mff.cuni.cz/xmlui/handle/11858/00-097C-0000-0022-6133-9

  18. z

    Wikipedia and Simple Wikipedia Lead Section Pairs for Nine Categories

    • zenodo.org
    • explore.openaire.eu
    bin, zip
    Updated Aug 9, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    José Frederico Rodrigues; Carla Teixeira Lopes; Carla Teixeira Lopes; Henrique Lopes Cardoso; José Frederico Rodrigues; Henrique Lopes Cardoso (2024). Wikipedia and Simple Wikipedia Lead Section Pairs for Nine Categories [Dataset]. http://doi.org/10.25747/4vc9-zs43
    Explore at:
    zip, binAvailable download formats
    Dataset updated
    Aug 9, 2024
    Dataset provided by
    INESC TEC
    Authors
    José Frederico Rodrigues; Carla Teixeira Lopes; Carla Teixeira Lopes; Henrique Lopes Cardoso; José Frederico Rodrigues; Henrique Lopes Cardoso
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset (categorized_dataset folder) contains 9 files in .csv format, each a collection of 10,000 lead section pairs sourced from Wikipedia (https://www.wikipedia.org/) and Simple Wikipedia (https://simple.wikipedia.org/) for a given category. Included categories are Culture, Education, Employment, Entertainment, Health, Leisure, Objects, Science and Time. This dataset was created to understand how effective an open-source large language model (Llama3) is in assessing the readability of texts and simplifying text across multiple domains. The dataset was collected using Wikipedia API.

  19. Wikipedia English Official Offline Edition 2014-07-07

    • academictorrents.com
    bittorrent
    Updated Aug 5, 2014
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wikipedia (2014). Wikipedia English Official Offline Edition 2014-07-07 [Dataset]. https://academictorrents.com/details/e18b8cce7d9cb2726f5f40dcb857111ec573cad4
    Explore at:
    bittorrent(11031162019)Available download formats
    Dataset updated
    Aug 5, 2014
    Dataset authored and provided by
    Wikipedia//www.wikipedia.org/
    License

    https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified

    Description

    Wikipedia offers free copies of all available content to interested users. These databases can be used for mirroring, personal use, informal backups, offline use or database queries (such as for Wikipedia:Maintenance). All text content is multi-licensed under the Creative Commons Attribution-ShareAlike 3.0 License (CC-BY-SA) and the GNU Free Documentation License (GFDL). Images and other files are available under different terms, as detailed on their description pages. For our advice about complying with these licenses, see Wikipedia:Copyrights.

  20. A Comprehensive Dataset of Classified Citations with Identifiers from...

    • zenodo.org
    zip
    Updated Jul 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Natallia Kokash; Natallia Kokash; Giovanni Colavizza; Giovanni Colavizza (2023). A Comprehensive Dataset of Classified Citations with Identifiers from English Wikipedia (2023) [Dataset]. http://doi.org/10.5281/zenodo.8107239
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 5, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Natallia Kokash; Natallia Kokash; Giovanni Colavizza; Giovanni Colavizza
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is a dataset of 40.664.485 citations extracted from English Wikipedia February 2023 dump (https://dumps.wikimedia.org/enwiki/20230220/).

    Version 1: en_citations.zip is a dataset of extracted citations

    Version 2: en_final.zip is the same dataset with classified citations augmented with identifiers

    The fields are as follows:

    • type_of_citation - Wikipedia template type used to define the citation, e.g., 'cite journal', 'cite news', etc.
    • page_title - title of the Wikipedia article from which the citation was extracted.
    • Title - source title, e.g., title of the book, newspaper article, etc.
    • URL - link to the source, e.g., webpage where news article was published, description of the book at the publisher's website, online library webpage, etc.
    • tld - top link domain extracted from the URL, e.g., 'bbc' for https://www.bbc.co.uk/...
    • Authors - list of article or book authors, if available.
    • ID_list - list of publication identifiers mentioned in the citation, e.g., DOI, ISBN, etc.
    • citations - citation text as used in Wikipedia code
    • actual_label - 'book', 'journal', 'news', or 'other' label assigned based on the analysis of citation identifiers or top link domain.
    • acquired_ID_list - identifiers located via Google Books and Crossref APIs for citations which are likely to refer to books or journals, i.e., defined using 'cite book', 'cite journal', 'cite encyclopedia', and 'cite proceedings' templates.
    1. The total number of news: 9.926.598
    2. The total number of books: 2.994.601
    3. The total number of journals: 2.052.172
    4. Augmented with IDs via lookup 929.601 (out of 2.445.913 book, journal, encyclopedia, and proceedings template citations not classified as books or journals via given identifiers).

    The source code to extract citations can be found here: https://github.com/albatros13/wikicite.

    The code is a fork of the earlier project on Wikipedia citation extraction: https://github.com/Harshdeep1996/cite-classifications-wiki.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Plaintext Wikipedia dump 2018 - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/3074cb26-6a0d-5803-8520-d0050a22c66e

Plaintext Wikipedia dump 2018 - Dataset - B2FIND

Explore at:
Description

Wikipedia plain text data obtained from Wikipedia dumps with WikiExtractor in February 2018. The data come from all Wikipedias for which dumps could be downloaded at [https://dumps.wikimedia.org/]. This amounts to 297 Wikipedias, usually corresponding to individual languages and identified by their ISO codes. Several special Wikipedias are included, most notably "simple" (Simple English Wikipedia) and "incubator" (tiny hatching Wikipedias in various languages). For a list of all the Wikipedias, see [https://meta.wikimedia.org/wiki/List_of_Wikipedias]. The script which can be used to get new version of the data is included, but note that Wikipedia limits the download speed for downloading a lot of the dumps, so it takes a few days to download all of them (but one or a few can be downloaded fast). Also, the format of the dumps changes time to time, so the script will probably eventually stop working one day. The WikiExtractor tool [http://medialab.di.unipi.it/wiki/Wikipedia_Extractor] used to extract text from the Wikipedia dumps is not mine, I only modified it slightly to produce plaintext outputs [https://github.com/ptakopysk/wikiextractor].

Search
Clear search
Close search
Google apps
Main menu