100+ datasets found
  1. h

    wikipedia-persons-masked

    • huggingface.co
    Updated May 23, 2009
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    wikipedia-persons-masked [Dataset]. https://huggingface.co/datasets/rcds/wikipedia-persons-masked
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 23, 2009
    Dataset authored and provided by
    Institute for Public Sector Transformation IPST - Digital Sustainability Lab DSL
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    wikipedia persons masked: A filtered version of the wikipedia dataset, with only pages of people

      Dataset Summary
    

    Contains ~70k pages from wikipedia, each describing a person. For each page, the person described in the text is masked with a

      Supported Tasks and Leaderboards
    

    The dataset supports the tasks of fill-mask, but can also be used for other tasks such as question answering, e.g. "Who is

      Languages
    

    english only… See the full description on the dataset page: https://huggingface.co/datasets/rcds/wikipedia-persons-masked.

  2. Data from: English Wikipedia - Species Pages

    • gbif.org
    Updated Aug 23, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Markus Döring; Markus Döring (2022). English Wikipedia - Species Pages [Dataset]. http://doi.org/10.15468/c3kkgh
    Explore at:
    Dataset updated
    Aug 23, 2022
    Dataset provided by
    Global Biodiversity Information Facilityhttps://www.gbif.org/
    Wikimedia Foundationhttp://www.wikimedia.org/
    Authors
    Markus Döring; Markus Döring
    Description

    Species pages extracted from the English Wikipedia article XML dump from 2022-08-02. Multimedia, vernacular names and textual descriptions are extracted, but only pages with a taxobox or speciesbox template are recognized.

    See https://github.com/mdoering/wikipedia-dwca for details.

  3. Data from: Wiki-Reliability: A Large Scale Dataset for Content Reliability...

    • figshare.com
    txt
    Updated Mar 14, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    KayYen Wong; Diego Saez-Trumper; Miriam Redi (2021). Wiki-Reliability: A Large Scale Dataset for Content Reliability on Wikipedia [Dataset]. http://doi.org/10.6084/m9.figshare.14113799.v4
    Explore at:
    txtAvailable download formats
    Dataset updated
    Mar 14, 2021
    Dataset provided by
    figshare
    Authors
    KayYen Wong; Diego Saez-Trumper; Miriam Redi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Wiki-Reliability: Machine Learning datasets for measuring content reliability on WikipediaConsists of metadata features and content text datasets, with the formats:- {template_name}_features.csv - {template_name}_difftxt.csv.gz - {template_name}_fulltxt.csv.gz For more details on the project, dataset schema, and links to data usage and benchmarking:https://meta.wikimedia.org/wiki/Research:Wiki-Reliability:_A_Large_Scale_Dataset_for_Content_Reliability_on_Wikipedia

  4. Wikipedia People Page Views Data

    • kaggle.com
    Updated Apr 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Netanel M (2024). Wikipedia People Page Views Data [Dataset]. https://www.kaggle.com/datasets/netanelmad/wikipedia-people-page-views-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 17, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Netanel M
    Description

    The data in this dataset is extracted from the BigQuery Wikipedia dataset. It includes:

    1. Monthly page views for all people on Wikipedia (P31 (instance of) = Q5 (human)) for the years 2015-2023.
    2. Wikidata Properties of these pages (NOTE: this information is found to be very messy, missing, and sometimes incorrect. If you wish to have a clean and validated dataset, I recommend checking out the verified people dataset by Laouenan et al. at this link).

    Date accessed: March 2024.

    Below are the queries used to get the dataset: ``sql -- 1. Get List of people on Wikipedia SELECT DISTINCT en_wiki -- page title name in english wikipedia FROMproject.wikipedia_pageviews.wikidata`, UNNEST(instance_of) AS instance_of_struct

    WHERE instance_of_struct.numeric_id = 5 -- instance_of = 5 => person

    -- 2. Get pageview data for those people SELECT title, DATETIME_TRUNC(datehour, MONTH) AS month, SUM(views) AS monthly_views

    FROM project.wikipedia_pageviews.pageviews_20xx a -- replace xx with desired year JOIN project.data_for_project.distinct_people b ON a.title = b.en_wiki

    WHERE datehour IS NOT NULL AND wiki = "en"

    GROUP BY title, DATETIME_TRUNC(datehour, MONTH)

    -- 3. Get wikidata for those people SELECT *
    FROM project.wikipedia_pageviews.wikidata, UNNEST(instance_of) AS instance_of_struct

    WHERE instance_of_struct.numeric_id = 5 ```

  5. h

    wizard_of_wikipedia

    • huggingface.co
    Updated Jul 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chujie Zheng (2023). wizard_of_wikipedia [Dataset]. https://huggingface.co/datasets/chujiezheng/wizard_of_wikipedia
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 16, 2023
    Authors
    Chujie Zheng
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Wizard-of-Wikipedia data for the Findings of EMNLP 2020 paper "Difference-aware Knowledge Selection for Knowledge-grounded Conversation Generation" GitHub repo. Original paper. @inproceedings{zheng-etal-2020-diffks, title="{D}ifference-aware Knowledge Selection for Knowledge-grounded Conversation Generation", author="Zheng, Chujie and Cao, Yunbo and Jiang, Daxin and Huang, Minlie", booktitle="Findings of EMNLP", year="2020" }

  6. Wiki-talk Datasets

    • zenodo.org
    • data.niaid.nih.gov
    application/gzip
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jun Sun; Jérôme Kunegis; Jun Sun; Jérôme Kunegis (2020). Wiki-talk Datasets [Dataset]. http://doi.org/10.5281/zenodo.49561
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jun Sun; Jérôme Kunegis; Jun Sun; Jérôme Kunegis
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    User interaction networks of Wikipedia of 28 different languages. Nodes (orininal wikipedia user IDs) represent users of the Wikipedia, and an edge from user A to user B denotes that user A wrote a message on the talk page of user B at a certain timestamp.

    More info: http://yfiua.github.io/academic/2016/02/14/wiki-talk-datasets.html

  7. P

    Wikipedia Citations Dataset

    • paperswithcode.com
    Updated Jul 16, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2020). Wikipedia Citations Dataset [Dataset]. https://paperswithcode.com/dataset/wikipedia-citations
    Explore at:
    Dataset updated
    Jul 16, 2020
    Description

    Wikipedia Citations is a comprehensive dataset of citations extracted from Wikipedia. A total of 29.3M citations were extracted from 6.1M English Wikipedia articles as of May 2020, and classified as being to books, journal articles or Web contents. We were thus able to extract 4.0M citations to scholarly publications with known identifiers -- including DOI, PMC, PMID, and ISBN -- and further equip an extra 261K citations with DOIs from Crossref. As a result, we find that 6.7% of Wikipedia articles cite at least one journal article with an associated DOI, and that Wikipedia cites just 2% of all articles with a DOI currently indexed in the Web of Science.

  8. t

    English Wikipedia - Dataset - LDM

    • service.tib.eu
    Updated Nov 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). English Wikipedia - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/english-wikipedia
    Explore at:
    Dataset updated
    Nov 25, 2024
    Description

    The English Wikipedia is widely used as a text corpus for NLP tasks.

  9. h

    Wikipedia-example-data

    • huggingface.co
    Updated Jul 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TopicNavi (2024). Wikipedia-example-data [Dataset]. https://huggingface.co/datasets/TopicNavi/Wikipedia-example-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 6, 2024
    Dataset authored and provided by
    TopicNavi
    Description

    TopicNavi/Wikipedia-example-data dataset hosted on Hugging Face and contributed by the HF Datasets community

  10. Wikipedia Data set New

    • kaggle.com
    zip
    Updated Dec 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Threatthriver (2023). Wikipedia Data set New [Dataset]. https://www.kaggle.com/datasets/kirauser/wikipedia-data-set-new/discussion
    Explore at:
    zip(347495280 bytes)Available download formats
    Dataset updated
    Dec 8, 2023
    Authors
    Threatthriver
    Description

    Dataset

    This dataset was created by Threatthriver

    Contents

  11. Wikipedia Change Metadata

    • redivis.com
    application/jsonl +7
    Updated Sep 22, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stanford Graduate School of Business Library (2021). Wikipedia Change Metadata [Dataset]. https://redivis.com/datasets/1ky2-8b1pvrv76
    Explore at:
    avro, parquet, application/jsonl, stata, csv, sas, spss, arrowAvailable download formats
    Dataset updated
    Sep 22, 2021
    Dataset provided by
    Redivis Inc.
    Authors
    Stanford Graduate School of Business Library
    Time period covered
    Jan 16, 2001 - Mar 1, 2019
    Description

    Abstract

    The Wikipedia Change Metadata is a curation of article changes, updates, and edits over time.

    Documentation

    **Source for details below: **https://zenodo.org/record/3605388#.YWitsdnML0o

    Dataset details

    Part 1: HTML revision history The data is split into 558 directories, named enwiki-20190301-pages-meta-history$1.xml-p$2p$3, where $1 ranges from 1 to 27, and *p$2p$3 *indicates that the directory contains revisions for pages with ids between $2 and $3. (This naming scheme directly mirrors that of the wikitext revision history from which WikiHist.html was derived.) Each directory contains a collection of gzip-compressed JSON files, each containing 1,000 HTML article revisions. Each row in the gzipped JSON files represents one article revision. Rows are sorted by page id, and revisions of the same page are sorted by revision id. We include all revision information from the original wikitext dump, the only difference being that we replace the revision’s wikitext content with its parsed HTML version (and that we store the data in JSON rather than XML):

    • id: id of this revision
    • parentid: id of revision modified by this revision
    • timestamp: time when revision was made
    • cont_username: username of contributor
    • cont_id: id of contributor
    • cont_ip: IP address of contributor
    • comment: comment made by contributor
    • model: content model (usually "wikitext")
    • format: content format (usually "text/x-wiki")
    • sha1: SHA-1 hash
    • title: page title
    • ns: namespace (always 0)
    • page_id: page id
    • redirect_title: if page is redirect, title of target page
    • html: revision content in HTML format

    %3C!-- --%3E

    Part 2: Page creation times (page_creation_times.json.gz)

    This JSON file specifies the creation time of each English Wikipedia page. It can, e.g., be used to determine if a wiki link was blue or red at a specific time in the past. Format:

    • page_id: page id
    • title: page title
    • ns: namespace (0 for articles)
    • timestamp: time when page was created

    %3C!-- --%3E

    Part 3: Redirect history (redirect_history.json.gz)

    This JSON file specifies all revisions corresponding to redirects, as well as the target page to which the respective page redirected at the time of the revision. This information is useful for reconstructing Wikipedia's link network at any time in the past. Format:

    • page_id: page id of redirect source
    • title: page title of redirect source
    • ns: namespace (0 for articles)
    • revision_id: revision id of redirect source
    • timestamp: time at which redirect became active
    • redirect: page title of redirect target (in 1st item of array; 2nd item can be ignored)

    %3C!-- --%3E

  12. Article-level image suggestions evaluation (ALISE) dataset

    • figshare.com
    application/x-gzip
    Updated Jun 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Article-level image suggestions evaluation (ALISE) dataset [Dataset]. https://figshare.com/articles/dataset/Article-level_image_suggestions_evaluation_strong_ALISE_strong_dataset/23301860
    Explore at:
    application/x-gzipAvailable download formats
    Dataset updated
    Jun 7, 2023
    Dataset provided by
    figshare
    Authors
    Cormac Parle; Marco Fossati
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Article-level image suggestions (ALIS, reads alice) is a distributed computing system that recommends images for Wikipedia articles that don't have one [1]. This publication contains roughly 3,800 human ratings made against ALIS output in multiple Wikipedia language editions. Evaluation task Data was collected through an evaluation tool [2], with code available at [3]. Given a language, the user is shown a random Wikipedia article and an image suggested by the system; they are then asked to rate the relevance of the image by clicking on either the Good, Okay, Bad, or Unsure button. The user is also brought to judge whether the image is not suitable for any reason via the It's ok, It's unsuitable, or Unsure button. Content The archive holds 2 tab-separated-values (TSV) text files:

    evaluation_dataset.tsv contains the evaluation data; unillustrated_articles.tsv keeps track of unillustrated Wikipedia articles.

    Evaluation dataset headers

    id (integer) - identifier used for internal storage; unillustratedArticleId (integer) - identifier of the unillustrated Wikipedia article; resultFilePage (string) - Wikimedia Commons image file name. Prepend https://commons.wikimedia.org/wiki/ to form a valid Commons URL; resultImageUrl (string) - Wikimedia Commons thumbnail URL; source (string) - suggestion source. ms = MediaSearch; ima = ALIS prototype algorithm. See [4] and [5] respectively for more details; confidence_class (string) - shallow degree of suggestion confidence. Either low, medium, or high; rating (integer) - human image relevance rating. 1 = good; 0 = okay; -1 = bad; sensitive (integer) - human image suitability rating. 0 = it's okay; 1 = it's unsuitable; -1 = unsure; viewCount (integer) - number of times the suggestion was seen by evaluators.

    Example 7357 1827 File:Cuphea_cyanea_strybing.jpg https://upload.wikimedia.org/wikipedia/commons/thumb/1/17/Cuphea_cyanea_strybing.jpg/800px-Cuphea_cyanea_strybing.jpg ima high 1 0 1 Unillustrated articles headers

    id (integer) - identifier used for internal storage. Maps to unillustratedArticleId in the evaluation data; langCode (string) - Wikipedia language code; pageTitle (string) - Wikipedia article title; unsuitableArticleType (integer) - whether the Wikipedia article is suitable for receiving image suggestions. 0 = suitable; 1 = not suitable;

    Example 1827 vi Cuphea_cyanea 0 References [1] https://www.mediawiki.org/wiki/Structured_Data_Across_Wikimedia/Image_Suggestions/Data_Pipeline [2] https://image-recommendation-test.toolforge.org/ [3] https://github.com/cormacparle/media-search-signal-test/tree/master/public_html [4] https://www.mediawiki.org/wiki/Help:MediaSearch [5] https://www.mediawiki.org/wiki/Structured_Data_Across_Wikimedia/Image_Suggestions/Data_Pipeline#How_it_works

  13. h

    wikipedia_vi

    • huggingface.co
    Updated Mar 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    VietGPT (2023). wikipedia_vi [Dataset]. https://huggingface.co/datasets/vietgpt/wikipedia_vi
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 31, 2023
    Dataset authored and provided by
    VietGPT
    Description

    Wikipedia

    Source: https://huggingface.co/datasets/wikipedia Num examples: 1,281,412 Language: Vietnamese

    from datasets import load_dataset

    load_dataset("tdtunlp/wikipedia_vi")

  14. Wikipedia data.tsv

    • figshare.com
    txt
    Updated Oct 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mengyi Wei (2023). Wikipedia data.tsv [Dataset]. http://doi.org/10.6084/m9.figshare.24278299.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Oct 10, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Mengyi Wei
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Using Wikipedia data to study AI ethics.

  15. Extended Wikipedia Multimodal Dataset

    • kaggle.com
    zip
    Updated Nov 28, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oleh Onyshchak (2019). Extended Wikipedia Multimodal Dataset [Dataset]. https://www.kaggle.com/datasets/jacksoncrow/extended-wikipedia-multimodal-dataset/versions/2
    Explore at:
    zip(625488367 bytes)Available download formats
    Dataset updated
    Nov 28, 2019
    Authors
    Oleh Onyshchak
    Description

    Dataset

    This dataset was created by Oleh Onyshchak

    Contents

  16. t

    French Wikipedia - Dataset - LDM

    • service.tib.eu
    Updated Dec 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). French Wikipedia - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/french-wikipedia
    Explore at:
    Dataset updated
    Dec 16, 2024
    Area covered
    French
    Description

    French Wikipedia corpus

  17. f

    Wikimedia editor activity (monthly)

    • figshare.com
    bz2
    Updated Dec 17, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aaron Halfaker (2019). Wikimedia editor activity (monthly) [Dataset]. http://doi.org/10.6084/m9.figshare.1553296.v1
    Explore at:
    bz2Available download formats
    Dataset updated
    Dec 17, 2019
    Dataset provided by
    figshare
    Authors
    Aaron Halfaker
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains a row for every (wiki, user, month) that contains a count of all 'revisions' saved and a count of those revisions that were 'archived' when the page was deleted. For more information, see https://meta.wikimedia.org/wiki/Research:Monthly_wikimedia_editor_activity_dataset Fields: · wiki -- The dbname of the wiki in question ("enwiki" == English Wikipedia, "commonswiki" == Commons) · month -- YYYYMM · user_id -- The user's identifier in the local wiki · user_name -- The user name in the local wiki (from the 'user' table) · user_registration -- The recorded registration date for the user in the 'user' table · archived -- The count of deleted revisions saved in this month by this user · revisions -- The count of all revisions saved in this month by this user (archived or not) · attached_method -- The method by which this user attached this account to their global account

  18. h

    Wikidata Companies Graph

    • data.hellenicdataservice.gr
    • explore.openaire.eu
    • +1more
    Updated Jun 20, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2019). Wikidata Companies Graph [Dataset]. https://data.hellenicdataservice.gr/dataset/f7341a62-a513-4931-99b8-0e302dc46d66
    Explore at:
    Dataset updated
    Jun 20, 2019
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains information about commercial organizations (companies) and their relations with other commercial organizations, persons, products, locations, groups and industries. The dataset has the form of a graph. It has been produced by the SmartDataLake project (https://smartdatalake.eu), using data collected from Wikidata (https://www.wikidata.org).

  19. Z

    Kaggle Wikipedia Web Traffic Daily Dataset (without Missing Values)

    • data.niaid.nih.gov
    • zenodo.org
    Updated Apr 1, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Webb, Geoff (2021). Kaggle Wikipedia Web Traffic Daily Dataset (without Missing Values) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3892918
    Explore at:
    Dataset updated
    Apr 1, 2021
    Dataset provided by
    Montero-Manso, Pablo
    Bergmeir, Christoph
    Godahewa, Rakshitha
    Hyndman, Rob
    Webb, Geoff
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset was used in the Kaggle Wikipedia Web Traffic forecasting competition. It contains 145063 daily time series representing the number of hits or web traffic for a set of Wikipedia pages from 2015-07-01 to 2017-09-10.

    The original dataset contains missing values. They have been simply replaced by zeros.

  20. Global weekly interest in Wiki" query on Google search 2023-2024

    • statista.com
    Updated Dec 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Global weekly interest in Wiki" query on Google search 2023-2024 [Dataset]. https://www.statista.com/statistics/1428123/wiki-google-search-weekly-worldwide/
    Explore at:
    Dataset updated
    Dec 4, 2024
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Dec 3, 2022 - Dec 1, 2024
    Area covered
    Worldwide
    Description

    As of December 2024, global Google searches using the query "Wiki" stood at 87 percentage points, so far the lowest interest rate despite a relatively stable margin throughout the analyzed period. Meanwhile, searches for the full Wikipedia query were slightly more popular during the analyzed period.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
wikipedia-persons-masked [Dataset]. https://huggingface.co/datasets/rcds/wikipedia-persons-masked

wikipedia-persons-masked

rcds/wikipedia-persons-masked

wikipedia persons masked: A filtered version of the wikipedia dataset, with only pages of people.

Explore at:
2 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 23, 2009
Dataset authored and provided by
Institute for Public Sector Transformation IPST - Digital Sustainability Lab DSL
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

wikipedia persons masked: A filtered version of the wikipedia dataset, with only pages of people

  Dataset Summary

Contains ~70k pages from wikipedia, each describing a person. For each page, the person described in the text is masked with a

  Supported Tasks and Leaderboards

The dataset supports the tasks of fill-mask, but can also be used for other tasks such as question answering, e.g. "Who is

  Languages

english only… See the full description on the dataset page: https://huggingface.co/datasets/rcds/wikipedia-persons-masked.

Search
Clear search
Close search
Google apps
Main menu