100+ datasets found
  1. h

    WikipediaUpdated

    • huggingface.co
    Updated May 4, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    jojo jenkins (2023). WikipediaUpdated [Dataset]. https://huggingface.co/datasets/luciferxf/WikipediaUpdated
    Explore at:
    Dataset updated
    May 4, 2023
    Authors
    jojo jenkins
    Description

    Wikipedia dataset containing cleaned articles of all languages. The datasets are built from the Wikipedia dump (https://dumps.wikimedia.org/) with one split per language. Each example contains the content of one full Wikipedia article with cleaning to strip markdown and unwanted sections (references, etc.).

  2. T

    wikipedia

    • tensorflow.org
    • huggingface.co
    Updated Aug 9, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2019). wikipedia [Dataset]. https://www.tensorflow.org/datasets/catalog/wikipedia
    Explore at:
    Dataset updated
    Aug 9, 2019
    Description

    Wikipedia dataset containing cleaned articles of all languages. The datasets are built from the Wikipedia dump (https://dumps.wikimedia.org/) with one split per language. Each example contains the content of one full Wikipedia article with cleaning to strip markdown and unwanted sections (references, etc.).

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('wikipedia', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  3. h

    wikipedia-summary-dataset

    • huggingface.co
    Updated Feb 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jordan Clive (2023). wikipedia-summary-dataset [Dataset]. https://huggingface.co/datasets/jordiclive/wikipedia-summary-dataset
    Explore at:
    Dataset updated
    Feb 15, 2023
    Authors
    Jordan Clive
    Description

    Dataset Summary

    This is a dataset that can be used for research into machine learning and natural language processing. It contains all titles and summaries (or introductions) of English Wikipedia articles, extracted in September of 2017. The dataset is different from the regular Wikipedia dump and different from the datasets that can be created by gensim because ours contains the extracted summaries and not the entire unprocessed page body. This could be useful if one wants to… See the full description on the dataset page: https://huggingface.co/datasets/jordiclive/wikipedia-summary-dataset.

  4. P

    French Wikipedia Dataset

    • paperswithcode.com
    • opendatalab.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Louis Martin; Benjamin Muller; Pedro Javier Ortiz Suárez; Yoann Dupont; Laurent Romary; Éric Villemonte de la Clergerie; Djamé Seddah; Benoît Sagot, French Wikipedia Dataset [Dataset]. https://paperswithcode.com/dataset/french-wikipedia
    Explore at:
    Authors
    Louis Martin; Benjamin Muller; Pedro Javier Ortiz Suárez; Yoann Dupont; Laurent Romary; Éric Villemonte de la Clergerie; Djamé Seddah; Benoît Sagot
    Area covered
    French
    Description

    French Wikipedia is a dataset used for pretraining the CamemBERT French language model. It uses the official 2019 French Wikipedia dumps

  5. Extended Wikipedia Multimodal Dataset

    • kaggle.com
    zip
    Updated Apr 4, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oleh Onyshchak (2020). Extended Wikipedia Multimodal Dataset [Dataset]. https://www.kaggle.com/datasets/jacksoncrow/extended-wikipedia-multimodal-dataset
    Explore at:
    zip(977856346 bytes)Available download formats
    Dataset updated
    Apr 4, 2020
    Authors
    Oleh Onyshchak
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Wikipedia Featured Articles multimodal dataset

    Overview

    • This is a multimodal dataset of featured articles containing 5,638 articles and 57,454 images.
    • Its superset of good articles is also hosted on Kaggle. It has six times more entries although with a little worse quality.

    It contains the text of an article and also all the images from that article along with metadata such as image titles and descriptions. From Wikipedia, we selected featured articles, which are just a small subset of all available ones, because they are manually reviewed and protected from edits. Thus it's the best theoretical quality human editors on Wikipedia can offer.

    You can find more details in "Image Recommendation for Wikipedia Articles" thesis.

    Dataset structure

    The high-level structure of the dataset is as follows:

    .
    +-- page1 
    |  +-- text.json 
    |  +-- img 
    |    +-- meta.json
    +-- page2 
    |  +-- text.json 
    |  +-- img 
    |    +-- meta.json
    : 
    +-- pageN 
    |  +-- text.json 
    |  +-- img 
    |    +-- meta.json
    
    labeldescription
    pageNis the title of N-th Wikipedia page and contains all information about the page
    text.jsontext of the page saved as JSON. Please refer to the details of JSON schema below.
    meta.jsona collection of all images of the page. Please refer to the details of JSON schema below.
    imageNis the N-th image of an article, saved in jpg format where the width of each image is set to 600px. Name of the image is md5 hashcode of original image title.

    text.JSON Schema

    Below you see an example of how data is stored:

    {
     "title": "Naval Battle of Guadalcanal",
     "id": 405411,
     "url": "https://en.wikipedia.org/wiki/Naval_Battle_of_Guadalcanal",
     "html": "... 
    

    ...", "wikitext": "... The '''Naval Battle of Guadalcanal''', sometimes referred to as ...", }

    keydescription
    titlepage title
    idunique page id
    urlurl of a page on Wikipedia
    htmlHTML content of the article
    wikitextwikitext content of the article

    Please note that @html and @wikitext properties represent the same information in different formats, so just choose the one which is easier to parse in your circumstances.

    meta.JSON Schema

    {
     "img_meta": [
      {
       "filename": "702105f83a2aa0d2a89447be6b61c624.jpg",
       "title": "IronbottomSound.jpg",
       "parsed_title": "ironbottom sound",
       "url": "https://en.wikipedia.org/wiki/File%3AIronbottomSound.jpg",
       "is_icon": False,
       "on_commons": True,
       "description": "A U.S. destroyer steams up what later became known as ...",
       "caption": "Ironbottom Sound. The majority of the warship surface ...",
       "headings": ['Naval Battle of Guadalcanal', 'First Naval Battle of Guadalcanal', ...],
       "features": ['4.8618264', '0.49436468', '7.0841103', '2.7377882', '2.1305492', ...],
       },
       ...
      ]
    }
    
    keydescription
    filenameunique image id, md5 hashcode of original image title
    titleimage title retrieved from Commons, if applicable
    parsed_titleimage title split into words, i.e. "helloWorld.jpg" -> "hello world"
    urlurl of an image on Wikipedia
    is_iconTrue if image is an icon, e.g. category icon. We assume that image is an icon if you cannot load a preview on Wikipedia after clicking on it
    on_commonsTrue if image is available from Wikimedia Commons dataset
    descriptiondescription of an image parsed from Wikimedia Commons page, if available
    captioncaption of an image parsed from Wikipedia article, if available
    headingslist of all nested headings of location where article is placed in Wikipedia article. The first element is top-most heading
    featuresoutput of 5-th convolutional layer of ResNet152 trained on ImageNet dataset. That output of shape (19, 24, 2048) is then max-pooled to a shape (2048,). Features taken from original images downloaded in jpeg format with fixed width of 600px. Practically, it is a list of floats with len = 2048

    Collection method

    Data was collected by fetching featured articles text&image content with pywikibot library and then parsing out a lot of additional metadata from HTML pages from Wikipedia and Commons.

  6. P

    Wiki-en Dataset

    • paperswithcode.com
    • opendatalab.com
    Updated Jul 25, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yumo Xu; Mirella Lapata (2019). Wiki-en Dataset [Dataset]. https://paperswithcode.com/dataset/wiki-en
    Explore at:
    Dataset updated
    Jul 25, 2019
    Authors
    Yumo Xu; Mirella Lapata
    Description

    Wiki-en is an annotated English dataset for domain detection extracted from Wikipedia. It includes texts from 7 different domains: “Business and Commerce” (BUS), “Government and Politics” (GOV), “Physical and Mental Health” (HEA), “Law and Order” (LAW), “Lifestyle” (LIF), “Military” (MIL), and “General Purpose” (GEN).

  7. Data from: Wikipedia Citations: A comprehensive dataset of citations with...

    • zenodo.org
    zip
    Updated Nov 12, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Harshdeep Singh; Robert West; Giovanni Colavizza; Harshdeep Singh; Robert West; Giovanni Colavizza (2020). Wikipedia Citations: A comprehensive dataset of citations with identifiers extracted from English Wikipedia [Dataset]. http://doi.org/10.5281/zenodo.3940692
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 12, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Harshdeep Singh; Robert West; Giovanni Colavizza; Harshdeep Singh; Robert West; Giovanni Colavizza
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset is composed of 3 parts:

    1. The dataset of 29.276 million citations from 35 different citation templates, out of which 3.92 million citations already contained identifiers, and approximately 260,752 citations were equipped with identifiers from Crossref. This is under the filename: citations_from_wikipedia.zip

    2. A minimal dataset containing a few of the columns from the citations from Wikipedia dataset. These columns are as follows: 'type_of_citation', 'page_title', 'Title', 'ID_list', metadata_file', 'updated_identifier'. This is under the filename: minimal_dataset.zip. The 'metadata_file' column can be used to refer to the metadata collected from CrossRef and page title, the title of the citation can be used to refer to the 'citations_from_wikipedia.zip' dataset and get more information for a particular citation (such as author, periodical, chapter).

    3. Citations classified as a journal and their corresponding metadata/identifier extracted from Crossref to make the dataset more complete. This is under the filename: lookup_data.zip. This zip file contains a CSV file: lookup_table.gzip (a parquet file containing all citations classified as a journal) and a folder: metadata_extracted (a folder containing the metadata from CrossRef for all the citations mentioned in the table)


    The data was parsed from the Wikipedia XML content dumps published in May 2020.

    The source code to extract and getting used to the pipeline can be found here: https://github.com/Harshdeep1996/cite-classifications-wiki

    The taxonomy of the dataset in (1) can be found here: https://github.com/Harshdeep1996/cite-classifications-wiki/wiki/Taxonomy-of-the-parent-dataset

  8. f

    Data from: Wiki-Reliability: A Large Scale Dataset for Content Reliability...

    • figshare.com
    txt
    Updated Mar 14, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    KayYen Wong; Diego Saez-Trumper; Miriam Redi (2021). Wiki-Reliability: A Large Scale Dataset for Content Reliability on Wikipedia [Dataset]. http://doi.org/10.6084/m9.figshare.14113799.v4
    Explore at:
    txtAvailable download formats
    Dataset updated
    Mar 14, 2021
    Dataset provided by
    figshare
    Authors
    KayYen Wong; Diego Saez-Trumper; Miriam Redi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Wiki-Reliability: Machine Learning datasets for measuring content reliability on WikipediaConsists of metadata features and content text datasets, with the formats:- {template_name}_features.csv - {template_name}_difftxt.csv.gz - {template_name}_fulltxt.csv.gz For more details on the project, dataset schema, and links to data usage and benchmarking:https://meta.wikimedia.org/wiki/Research:Wiki-Reliability:_A_Large_Scale_Dataset_for_Content_Reliability_on_Wikipedia

  9. P

    Wikipedia Person and Animal Dataset Dataset

    • paperswithcode.com
    Updated Nov 27, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Qingyun Wang; Xiaoman Pan; Lifu Huang; Boliang Zhang; Zhiying Jiang; Heng Ji; Kevin Knight (2021). Wikipedia Person and Animal Dataset Dataset [Dataset]. https://paperswithcode.com/dataset/wikipedia-person-and-animal-dataset
    Explore at:
    Dataset updated
    Nov 27, 2021
    Authors
    Qingyun Wang; Xiaoman Pan; Lifu Huang; Boliang Zhang; Zhiying Jiang; Heng Ji; Kevin Knight
    Description

    This dataset gathers 428,748 person and 12,236 animal infobox with descriptions based on Wikipedia dump (2018/04/01) and Wikidata (2018/04/12).

  10. c

    Plaintext Wikipedia dump 2018

    • lindat.mff.cuni.cz
    • live.european-language-grid.eu
    Updated Feb 25, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rudolf Rosa (2018). Plaintext Wikipedia dump 2018 [Dataset]. https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-2735
    Explore at:
    Dataset updated
    Feb 25, 2018
    Authors
    Rudolf Rosa
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Wikipedia plain text data obtained from Wikipedia dumps with WikiExtractor in February 2018.

    The data come from all Wikipedias for which dumps could be downloaded at [https://dumps.wikimedia.org/]. This amounts to 297 Wikipedias, usually corresponding to individual languages and identified by their ISO codes. Several special Wikipedias are included, most notably "simple" (Simple English Wikipedia) and "incubator" (tiny hatching Wikipedias in various languages). For a list of all the Wikipedias, see [https://meta.wikimedia.org/wiki/List_of_Wikipedias].

    The script which can be used to get new version of the data is included, but note that Wikipedia limits the download speed for downloading a lot of the dumps, so it takes a few days to download all of them (but one or a few can be downloaded fast). Also, the format of the dumps changes time to time, so the script will probably eventually stop working one day. The WikiExtractor tool [http://medialab.di.unipi.it/wiki/Wikipedia_Extractor] used to extract text from the Wikipedia dumps is not mine, I only modified it slightly to produce plaintext outputs [https://github.com/ptakopysk/wikiextractor].

  11. Wikipedia Article Networks

    • kaggle.com
    Updated Nov 12, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrea Garritano (2019). Wikipedia Article Networks [Dataset]. https://www.kaggle.com/datasets/andreagarritano/wikipedia-article-networks
    Explore at:
    Dataset updated
    Nov 12, 2019
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Andrea Garritano
    Description

    Wikipedia Article Networks

    Description

    The data was collected from the English Wikipedia (December 2018). These datasets represent page-page networks on specific topics (chameleons, crocodiles and squirrels). Nodes represent articles and edges are mutual links between them. The edges csv files contain the edges - nodes are indexed from 0. The features json files contain the features of articles - each key is a page id, and node features are given as lists. The presence of a feature in the feature list means that an informative noun appeared in the text of the Wikipedia article. The target csv contains the node identifiers and the average monthly traffic between October 2017 and November 2018 for each page. For each page-page network we listed the number of nodes an edges with some other descriptive statistics.

    Properties

    • Directed: No.
    • Node features: Yes.
    • Edge features: No.
    • Node labels: Yes. Continuous target.
    • Temporal: No.

    | Dataset | Chameleon | Crocodile | Squirrel |

    | Nodes |2,277 | 11,631 | 5,201 |

    | Edges | 31,421 |170,918 | 198,493 |

    | Density | 0.012 | 0.003 | 0.015 |

    | Transitvity | 0.314| 0.026 | 0.348 |

    Possible Tasks

    • Regression
    • Link prediction
    • Community detection
    • Network visualization

    Paper: Multi-scale Attributed Node Embedding. Benedek Rozemberczki, Carl Allen, and Rik Sarkar. arXiv, 2019. https://arxiv.org/abs/1909.13021

  12. T

    wiki40b

    • tensorflow.org
    • opendatalab.com
    • +1more
    Updated Aug 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). wiki40b [Dataset]. https://www.tensorflow.org/datasets/catalog/wiki40b
    Explore at:
    Dataset updated
    Aug 30, 2023
    Description

    Clean-up text for 40+ Wikipedia languages editions of pages correspond to entities. The datasets have train/dev/test splits per language. The dataset is cleaned up by page filtering to remove disambiguation pages, redirect pages, deleted pages, and non-entity pages. Each example contains the wikidata id of the entity, and the full Wikipedia article after page processing that removes non-content sections and structured objects. The language models trained on this corpus - including 41 monolingual models, and 2 multilingual models - can be found at https://tfhub.dev/google/collections/wiki40b-lm/1.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('wiki40b', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  13. Wikipedia Talk Labels: Personal Attacks

    • figshare.com
    txt
    Updated Feb 22, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ellery Wulczyn; Nithum Thain; Lucas Dixon (2017). Wikipedia Talk Labels: Personal Attacks [Dataset]. http://doi.org/10.6084/m9.figshare.4054689.v6
    Explore at:
    txtAvailable download formats
    Dataset updated
    Feb 22, 2017
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Ellery Wulczyn; Nithum Thain; Lucas Dixon
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This data set includes over 100k labeled discussion comments from English Wikipedia. Each comment was labeled by multiple annotators via Crowdflower on whether it contains a personal attack. We also include some demographic data for each crowd-worker. See our wiki for documentation of the schema of each file and our research paper for documentation on the data collection and modeling methodology. For a quick demo of how to use the data for model building and analysis, check out this ipython notebook.

  14. Kensho Derived Wikimedia Dataset

    • kaggle.com
    Updated Jan 31, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kensho R&D (2020). Kensho Derived Wikimedia Dataset [Dataset]. https://www.kaggle.com/kenshoresearch/kensho-derived-wikimedia-data/activity
    Explore at:
    Dataset updated
    Jan 31, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Kensho R&D
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Kensho Derived Wikimedia Dataset

    Wikipedia, the free encyclopedia, and Wikidata, the free knowledge base, are crowd-sourced projects supported by the Wikimedia Foundation. Wikipedia is nearly 20 years old and recently added its six millionth article in English. Wikidata, its younger machine-readable sister project, was created in 2012 but has been growing rapidly and currently contains more than 75 million items.

    These projects contribute to the Wikimedia Foundation's mission of empowering people to develop and disseminate educational content under a free license. They are also heavily utilized by computer science research groups, especially those interested in natural language processing (NLP). The Wikimedia Foundation periodically releases snapshots of the raw data backing these projects, but these are in a variety of formats and were not designed for use in NLP research. In the Kensho R&D group, we spend a lot of time downloading, parsing, and experimenting with this raw data. The Kensho Derived Wikimedia Dataset (KDWD) is a condensed subset of the raw Wikimedia data in a form that we find helpful for NLP work. The KDWD has a CC BY-SA 3.0 license, so feel free to use it in your work too.

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F4301984%2F972e4157b97efe8c2c5ea17c983b1504%2Fkdwd_header_logos_2.jpg?generation=1580510520532141&alt=media" alt="">

    This particular release consists of two main components - a link annotated corpus of English Wikipedia pages and a compact sample of the Wikidata knowledge base. We version the KDWD using the raw Wikimedia snapshot dates. The version string for this dataset is kdwd_enwiki_20191201_wikidata_20191202 indicating that this KDWD was built from the English Wikipedia snapshot from 2019 December 1 and the Wikidata snapshot from 2019 December 2. Below we describe these components in more detail.

    Example Notebooks

    Dive right in by checking out some of our example notebooks:

    Updates / Changelog

    • initial release 2020-01-31

    File Summary

    • Wikipedia
      • page.csv (page metadata and Wikipedia-to-Wikidata mapping)
      • link_annotated_text.jsonl (plaintext of Wikipedia pages with link offsets)
    • Wikidata
      • item.csv (item labels and descriptions in English)
      • item_aliases.csv (item aliases in English)
      • property.csv (property labels and descriptions in English)
      • property_aliases.csv (property aliases in English)
      • statements.csv (truthy qpq statements)

    Three Layers of Data

    The KDWD is three connected layers of data. The base layer is a plain text English Wikipedia corpus, the middle layer annotates the corpus by indicating which text spans are links, and the top layer connects the link text spans to items in Wikidata. Below we'll describe these layers in more detail.

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F4301984%2F19663d43bade0e92f578255f6e0d9dcd%2Fkensho_wiki_triple_layer.svg?generation=1580347573004185&alt=media" alt="">

    Wikipedia Sample

    The first part of the KDWD is derived from Wikipedia. In order to create a corpus of mostly natural text, we restrict our English Wikipedia page sample to those that:

    From these pages we construct a corpus of link annotated text. We store this data in a single JSON Lines file with one page per line. Each page object has the following format:

    page = {
      "page_id": 12,   # wikipedia page id of annotated page
      "sections": [...]  # list of section objects
    }
    
    section = {
      "name": "Introduction",             # section header
      "text": "Anarchism is an ...",          # plaintext of section
      "link_offsets": [16, 35, 49, ...],        # list of anchor text offsets
      "link_lengths": [18, 9, 17, ...],        # list of anchor text lengths
      "target_page_ids": [867979, 23040, 586276, ...] # list of link target page ids
    }
    

    The text attribute of each section object contains our parse of the section’s wikitext markup into plaintext. Text spans that represent links are identified via the attributes link_offsets, link_lengths, and target_page_ids.

    Wikidata Sample

    The second part of the KDWD is derived from Wikidata. Because more people are familiar with Wikipedia than Wikidata, we provide more background here than in the previous section. Wikidata provides centralized storage of structured data for all Wikimedia projects. The core Wikidata concepts are items, properties, and statements.

    In Wikidata, items are used to represent all the things in human knowledge, including topics, concepts, and objects. For example, the "1988 Summer Olympics", "love", "Elvis Presley", and "gorilla" are all items in Wikidata.

    -- https://www.wikidata.org/wiki/Help:Items

    A property describes the data value of a statement and can be thought of as a category of data, for example "color" for the data value "blue".

    -- https://www.wikidata.org/wiki/Help:Properties

    A statement is how the information we know about an item - the data we have about it - gets recorded in Wikidata. This happens by pairing a property with at least one data value

    -- https://www.wikidata.org/wiki/Help:Statements

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F4301984%2F1c39f09bcbce766b7cef415a102be567%2Fkdwd_wikidata_image_nologo.jpg?generation=1579814918394405&alt=media" alt="">

    The image above shows several statements from the Wikidata item for Grace Hopper. We can think about these statements as triples with the form (item, property, data value).

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F4301984%2F0a9c3aa58860f298db8b0e1f5952030e%2Fkdwd_statement_table.png?generation=1579214465126860&alt=media" alt="">

    In the first statement (Grace Hopper, date of birth, 9 December 1906) the data value represents a time. However, data values can have several different types (e.g., time, string, globecoordinate, item, …). If the data value in a statement triple is a Wikidata item, we call it a qpq-statement (note that each item has a unique i.d. beginning with Q and each property has a unique i.d. beginning with P). We can think of qpq-statements as triples of the form (source item, property, target_item). The qpq-statements in the image above are:

    In order to construct a compact Wikidata sample that is relevant to our Wikipedia sample, we start with all statements in Wikidata and filter down to those that:

    • have a data value that is a Wikidata item (i.e., qpq-statements)
    • have a source item associated with a Wikipedia page from our Wikipedia sample
    • are
  15. f

    Wikipedia Article Topics for All Languages (based on article outlinks)

    • figshare.com
    bz2
    Updated Jul 20, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Isaac Johnson (2021). Wikipedia Article Topics for All Languages (based on article outlinks) [Dataset]. http://doi.org/10.6084/m9.figshare.12619766.v3
    Explore at:
    bz2Available download formats
    Dataset updated
    Jul 20, 2021
    Dataset provided by
    figshare
    Authors
    Isaac Johnson
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset contains the predicted topic(s) for (almost) every Wikipedia article across languages. It is missing articles without any valid outlinks -- i.e. links to other Wikipedia articles. This current version is based on the December 2020 Wikipedia dumps (data as of 1 January 2021) but earlier/future versions may be for other snapshots as indicated by the filename.The data is bzip-compressed and each row is tab-delimited and contains the following metadata and then the predicted probability (rounded to three decimal places to reduce filesize) that each of these topics applies to the article: https://www.mediawiki.org/wiki/ORES/Articletopic#Taxonomy* wiki_db: which Wikipedia language edition the article belongs too -- e.g., enwiki == English Wikipedia* qid: if the article has a Wikidata item, what ID is it -- e.g., the article for Douglas Adams is Q42 (https://www.wikidata.org/wiki/Q42)* pid: the page ID of the article -- e.g., the article for Douglas Adams in English Wikipedia is 8091 (en.wikipedia.org/wiki/?curid=8091)* num_outlinks: the number of Wikipedia links in the article that were used by the model to make its prediction -- this is after removing links to non-article namespaces (e.g., categories, templates), articles without Wikidata IDs (very few), and interwiki links -- i.e. only retaining links to namespace 0 articles in the same wiki that have associated Wikidata IDs. This is mainly provided to give a sense of how much data the prediction is based upon.For more information, see this model description page on Meta: https://meta.wikimedia.org/wiki/Research:Language-Agnostic_Topic_Classification/Outlink_model_performanceAdditional, a 1% sample file is provided for easier exploration. The sampling was done by Wikidata ID so if e.g., Q800612 (Canfranc International railway station) was sampled in, then all 16 language versions of the article would be included. It includes 201,196 Wikidata IDs which led to 340,290 articles.

  16. Arabic Wiki data Dump 2018

    • kaggle.com
    zip
    Updated Feb 6, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abed Khooli (2018). Arabic Wiki data Dump 2018 [Dataset]. https://www.kaggle.com/datasets/abedkhooli/arabic-wiki-data-dump-2018
    Explore at:
    zip(694547311 bytes)Available download formats
    Dataset updated
    Feb 6, 2018
    Authors
    Abed Khooli
    Description

    Context

    Arabic is a rich and major world language. Recent advances in computational linguistics and AI can be applied to Arabic but not in the generic way most languages are treated. This dataset (Arabic articles from Wikipedia) will be used to train Word2Vec and compare performance with publicly available pre-trained model from FastText (Facebook) in a generic way. A related model is now available: https://www.kaggle.com/abedkhooli/arabic-ulmfit-model

    Content

    All Wikipedia Arabic articles from the January 20, 2018 data dump (compressed) in wikimedia format. Cntent is expected to be (mostly) in modern standard Arabic

    Acknowledgements

    Thanks to Wikipedia for making public data dumps available and for Facebook for releasing pre-trained models.

    Inspiration

    The challenges (opportunities) here are mostly in the pre-processing of the token and text normalization, plus hyper parameter tuning for different purposes. It is easy to isolate Arabic tokens (many articles have non-Arabic words), but tokenization is a challenge: how to treat accented (7arakaat or tashkeel) and non-accented word forms, same word form with different meanings, suffixes and prefixes (especially w).

  17. Bangla Wikipedia dataset

    • kaggle.com
    • data.mendeley.com
    • +1more
    zip
    Updated Jan 20, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SADMAN ARAF (2021). Bangla Wikipedia dataset [Dataset]. https://www.kaggle.com/sadmanaraf/bangla-wikipedia-dataset
    Explore at:
    zip(74663684 bytes)Available download formats
    Dataset updated
    Jan 20, 2021
    Authors
    SADMAN ARAF
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Context

    A subset of the Bangla version of the Wikipedia text. To create the Wikipedia dataset, we collected the Bangla wiki-dump of 10th June, 2019. The files are then merged and each article is selected as a sample text. All HTML tags were removed and the title of the page was stripped from the beginning of the text. This dataset contains 70377 samples with a total number of words being 18229481. The entire dataset has 1289249 unique words, which is 7% of the total vocabulary.

    Content

    Each text is represented by an id. The data is found in the wiki.csv

    wiki.csv contains - id - text - title - url

    Acknowledgement

    The original dataset is found here here. To acknowledge use of the dataset in publications, please cite the data: Khatun, Aisha; Rahman, Anisur; Islam, Md Saiful (2020), “Bangla Wikipedia dataset”, Mendeley Data, V4, doi: 10.17632/3ph3n78fp7.4

    http://dx.doi.org/10.17632/3ph3n78fp7.4

    Inspiration

    • Research on bangla natural language can be conducted
  18. l

    ner-wikipedia-dataset

    • hf-mirror.llyke.com
    • huggingface.co
    Updated Jul 25, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    大規模言語モデル入門 (2023). ner-wikipedia-dataset [Dataset]. https://hf-mirror.llyke.com/datasets/llm-book/ner-wikipedia-dataset
    Explore at:
    Dataset updated
    Jul 25, 2023
    Dataset authored and provided by
    大規模言語モデル入門
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Dataset Card for llm-book/ner-wikipedia-dataset

    書籍『大規模言語モデル入門』で使用する、ストックマーク株式会社により作成された「Wikipediaを用いた日本語の固有表現抽出データセット」(Version 2.0)です。 Githubリポジトリstockmarkteam/ner-wikipedia-datasetで公開されているデータセットを利用しています。

      Citation
    

    @inproceedings{omi-2021-wikipedia, title = "Wikipediaを用いた日本語の固有表現抽出のデータセットの構築", author = "近江 崇宏", booktitle = "言語処理学会第27回年次大会", year = "2021", url = "https://anlp.jp/proceedings/annual_meeting/2021/pdf_dir/P2-7.pdf", }… See the full description on the dataset page: https://hf-mirror.com/datasets/llm-book/ner-wikipedia-dataset.

  19. f

    Wikipedia Talk Labels: Aggression

    • figshare.com
    txt
    Updated Feb 22, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ellery Wulczyn; Nithum Thain; Lucas Dixon (2017). Wikipedia Talk Labels: Aggression [Dataset]. http://doi.org/10.6084/m9.figshare.4267550.v5
    Explore at:
    txtAvailable download formats
    Dataset updated
    Feb 22, 2017
    Dataset provided by
    figshare
    Authors
    Ellery Wulczyn; Nithum Thain; Lucas Dixon
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This data set includes over 100k labeled discussion comments from English Wikipedia. Each comment was labeled by multiple annotators via Crowdflower on whether it has aggressive tone. We also include some demographic data for each crowd-worker. See our wiki for documentation of the schema of each file and our research paper for documentation on the data collection and modeling methodology. For a quick demo of how to use the data for model building and analysis, check out this ipython notebook.

  20. T

    wikipedia_toxicity_subtypes

    • tensorflow.org
    Updated Dec 6, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). wikipedia_toxicity_subtypes [Dataset]. https://www.tensorflow.org/datasets/catalog/wikipedia_toxicity_subtypes
    Explore at:
    Dataset updated
    Dec 6, 2022
    Description

    The comments in this dataset come from an archive of Wikipedia talk page comments. These have been annotated by Jigsaw for toxicity, as well as (for the main config) a variety of toxicity subtypes, including severe toxicity, obscenity, threatening language, insulting language, and identity attacks. This dataset is a replica of the data released for the Jigsaw Toxic Comment Classification Challenge and Jigsaw Multilingual Toxic Comment Classification competition on Kaggle, with the test dataset merged with the test_labels released after the end of the competitions. Test data not used for scoring has been dropped. This dataset is released under CC0, as is the underlying comment text.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('wikipedia_toxicity_subtypes', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
jojo jenkins (2023). WikipediaUpdated [Dataset]. https://huggingface.co/datasets/luciferxf/WikipediaUpdated

WikipediaUpdated

luciferxf/WikipediaUpdated

Explore at:
Dataset updated
May 4, 2023
Authors
jojo jenkins
Description

Wikipedia dataset containing cleaned articles of all languages. The datasets are built from the Wikipedia dump (https://dumps.wikimedia.org/) with one split per language. Each example contains the content of one full Wikipedia article with cleaning to strip markdown and unwanted sections (references, etc.).

Search
Clear search
Close search
Google apps
Main menu