54 datasets found
  1. h

    wikitext2

    • huggingface.co
    • opendatalab.com
    Updated Oct 21, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jan Karsten Kuhnke (2023). wikitext2 [Dataset]. https://huggingface.co/datasets/mindchain/wikitext2
    Explore at:
    Dataset updated
    Oct 21, 2023
    Authors
    Jan Karsten Kuhnke
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Dataset Card for "wikitext"

      Dataset Summary
    

    The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License. Compared to the preprocessed version of Penn Treebank (PTB), WikiText-2 is over 2 times larger and WikiText-103 is over 110 times larger. The WikiText dataset also features a far larger… See the full description on the dataset page: https://huggingface.co/datasets/mindchain/wikitext2.

  2. a

    Wikitext-2

    • academictorrents.com
    bittorrent
    Updated Oct 16, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stephen Merity et al., 2016 (2018). Wikitext-2 [Dataset]. https://academictorrents.com/details/ac7ffa98b66427246a316a81b2ea31c9b58ea5b6
    Explore at:
    bittorrent(4070055)Available download formats
    Dataset updated
    Oct 16, 2018
    Dataset authored and provided by
    Stephen Merity et al., 2016
    License

    https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified

    Description

    A subset of Wikitext-103; useful for testing language model training on smaller datasets.

  3. h

    lilac-wikitext-2-raw-v1

    • huggingface.co
    Updated Sep 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lilac AI (2025). lilac-wikitext-2-raw-v1 [Dataset]. https://huggingface.co/datasets/lilacai/lilac-wikitext-2-raw-v1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 1, 2025
    Dataset authored and provided by
    Lilac AI
    Description

    This dataset is generated by Lilac for a HuggingFace Space: huggingface.co/spaces/lilacai/lilac. Original dataset: https://huggingface.co/datasets/wikitext Lilac dataset config: name: wikitext-2-raw-v1 source: dataset_name: wikitext config_name: wikitext-2-raw-v1 source_name: huggingface embeddings: - path: text embedding: gte-small signals: - path: text signal: signal_name: near_dup - path: text signal: signal_name: pii - path: text signal:… See the full description on the dataset page: https://huggingface.co/datasets/lilacai/lilac-wikitext-2-raw-v1.

  4. h

    wikitext-2-raw-v1-shuffled

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tongyao, wikitext-2-raw-v1-shuffled [Dataset]. https://huggingface.co/datasets/tyzhu/wikitext-2-raw-v1-shuffled
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Tongyao
    Description

    Dataset Card for "wikitext-2-raw-v1-shuffled"

    More Information needed

  5. h

    wikitext-2-raw-v1-forbidden-titles-train

    • huggingface.co
    Updated Jul 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ActiveRetrieval (2024). wikitext-2-raw-v1-forbidden-titles-train [Dataset]. https://huggingface.co/datasets/Self-GRIT/wikitext-2-raw-v1-forbidden-titles-train
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 25, 2024
    Dataset authored and provided by
    ActiveRetrieval
    Description

    Self-GRIT/wikitext-2-raw-v1-forbidden-titles-train dataset hosted on Hugging Face and contributed by the HF Datasets community

  6. P

    WikiText-103 Dataset

    • paperswithcode.com
    • opendatalab.com
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stephen Merity; Caiming Xiong; James Bradbury; Richard Socher, WikiText-103 Dataset [Dataset]. https://paperswithcode.com/dataset/wikitext-103
    Explore at:
    Authors
    Stephen Merity; Caiming Xiong; James Bradbury; Richard Socher
    Description

    The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License.

    Compared to the preprocessed version of Penn Treebank (PTB), WikiText-2 is over 2 times larger and WikiText-103 is over 110 times larger. The WikiText dataset also features a far larger vocabulary and retains the original case, punctuation and numbers - all of which are removed in PTB. As it is composed of full articles, the dataset is well suited for models that can take advantage of long term dependencies.

  7. h

    wikitext-2

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mika Senghaas, wikitext-2 [Dataset]. https://huggingface.co/datasets/mikasenghaas/wikitext-2
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Mika Senghaas
    Description

    mikasenghaas/wikitext-2 dataset hosted on Hugging Face and contributed by the HF Datasets community

  8. h

    wikitext-2-sample

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Clara Na, wikitext-2-sample [Dataset]. https://huggingface.co/datasets/claran/wikitext-2-sample
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Clara Na
    Description

    claran/wikitext-2-sample dataset hosted on Hugging Face and contributed by the HF Datasets community

  9. h

    wikitext-tiny

    • huggingface.co
    Updated Aug 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    yujiepan (2023). wikitext-tiny [Dataset]. https://huggingface.co/datasets/yujiepan/wikitext-tiny
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 31, 2023
    Authors
    yujiepan
    Description

    This dataset is sampled from wikitext/wikitext-2-v1/train. Codes to generate this dataset: import datasets dataset = datasets.load_dataset('wikitext', 'wikitext-2-v1')

    selected = [] i = -1 while len(selected) < 24: i += 1 text = dataset['train'][i]['text'] if 8 < len(text.split(' ')) <= 16 and '=' not in text: selected.append(i)

    tiny_dataset = dataset['train'].select(selected)

  10. E

    WikiText-103 & 2

    • live.european-language-grid.eu
    txt
    Updated Dec 30, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2016). WikiText-103 & 2 [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/5169
    Explore at:
    txtAvailable download formats
    Dataset updated
    Dec 30, 2016
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Dataset contains word and character level tokens extracted from Wikipedia

  11. h

    wikitext-2-second-half

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hao Li, wikitext-2-second-half [Dataset]. https://huggingface.co/datasets/tartspuppy/wikitext-2-second-half
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Hao Li
    Description

    tartspuppy/wikitext-2-second-half dataset hosted on Hugging Face and contributed by the HF Datasets community

  12. f

    Word-level valid and test perplexity on WikiText-2.

    • plos.figshare.com
    xls
    Updated Jun 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lu Yuwen; Shuyu Chen; Xiaohan Yuan (2023). Word-level valid and test perplexity on WikiText-2. [Dataset]. http://doi.org/10.1371/journal.pone.0249820.t010
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 10, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Lu Yuwen; Shuyu Chen; Xiaohan Yuan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Word-level valid and test perplexity on WikiText-2.

  13. h

    wikitext-2-nonulls-sample-v2

    • huggingface.co
    Updated Sep 15, 2008
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Clara Na (2008). wikitext-2-nonulls-sample-v2 [Dataset]. https://huggingface.co/datasets/claran/wikitext-2-nonulls-sample-v2
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 15, 2008
    Authors
    Clara Na
    Description

    claran/wikitext-2-nonulls-sample-v2 dataset hosted on Hugging Face and contributed by the HF Datasets community

  14. Data from: WikiHist.html: English Wikipedia's Full Revision History in HTML...

    • zenodo.org
    application/gzip, zip
    Updated Jun 8, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Blagoj Mitrevski; Tiziano Piccardi; Tiziano Piccardi; Robert West; Robert West; Blagoj Mitrevski (2020). WikiHist.html: English Wikipedia's Full Revision History in HTML Format [Dataset]. http://doi.org/10.5281/zenodo.3605388
    Explore at:
    application/gzip, zipAvailable download formats
    Dataset updated
    Jun 8, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Blagoj Mitrevski; Tiziano Piccardi; Tiziano Piccardi; Robert West; Robert West; Blagoj Mitrevski
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Description

    Introduction

    Wikipedia is written in the wikitext markup language. When serving content, the MediaWiki software that powers Wikipedia parses wikitext to HTML, thereby inserting additional content by expanding macros (templates and modules). Hence, researchers who intend to analyze Wikipedia as seen by its readers should work with HTML, rather than wikitext. Since Wikipedia’s revision history is made publicly available by the Wikimedia Foundation exclusively in wikitext format, researchers have had to produce HTML themselves, typically by using Wikipedia’s REST API for ad-hoc wikitext-to-HTML parsing. This approach, however, (1) does not scale to very large amounts of data and (2) does not correctly expand macros in historical article revisions.

    We have solved these problems by developing a parallelized architecture for parsing massive amounts of wikitext using local instances of MediaWiki, enhanced with the capacity of correct historical macro expansion. By deploying our system, we produce and hereby release WikiHist.html, English Wikipedia’s full revision history in HTML format. It comprises the HTML content of 580M revisions of 5.8M articles generated from the full English Wikipedia history spanning 18 years from 1 January 2001 to 1 March 2019. Boilerplate content such as page headers, footers, and navigation sidebars are not included in the HTML.

    For more details, please refer to the description below and to the dataset paper:
    Blagoj Mitrevski, Tiziano Piccardi, and Robert West: WikiHist.html: English Wikipedia’s Full Revision History in HTML Format. In Proceedings of the 14th International AAAI Conference on Web and Social Media, 2020.
    https://arxiv.org/abs/2001.10256

    When using the dataset, please cite the above paper.

    Dataset summary

    The dataset consists of three parts:

    1. English Wikipedia’s full revision history parsed to HTML,
    2. a table of the creation times of all Wikipedia pages (page_creation_times.json.gz),
    3. a table that allows for resolving redirects for any point in time (redirect_history.json.gz).

    Part 1 is our main contribution, while parts 2 and 3 contain complementary information that can aid researchers in their analyses.

    Getting the data

    Parts 2 and 3 are hosted in this Zenodo repository. Part 1 is 7TB large -- too large for Zenodo -- and is therefore hosted externally on the Internet Archive. For downloading part 1, you have multiple options:

    Dataset details

    Part 1: HTML revision history
    The data is split into 558 directories, named enwiki-20190301-pages-meta-history$1.xml-p$2p$3, where $1 ranges from 1 to 27, and p$2p$3 indicates that the directory contains revisions for pages with ids between $2 and $3. (This naming scheme directly mirrors that of the wikitext revision history from which WikiHist.html was derived.) Each directory contains a collection of gzip-compressed JSON files, each containing 1,000 HTML article revisions. Each row in the gzipped JSON files represents one article revision. Rows are sorted by page id, and revisions of the same page are sorted by revision id. We include all revision information from the original wikitext dump, the only difference being that we replace the revision’s wikitext content with its parsed HTML version (and that we store the data in JSON rather than XML):

    • id: id of this revision
    • parentid: id of revision modified by this revision
    • timestamp: time when revision was made
    • cont_username: username of contributor
    • cont_id: id of contributor
    • cont_ip: IP address of contributor
    • comment: comment made by contributor
    • model: content model (usually "wikitext")
    • format: content format (usually "text/x-wiki")
    • sha1: SHA-1 hash
    • title: page title
    • ns: namespace (always 0)
    • page_id: page id
    • redirect_title: if page is redirect, title of target page
    • html: revision content in HTML format

    Part 2: Page creation times (page_creation_times.json.gz)

    This JSON file specifies the creation time of each English Wikipedia page. It can, e.g., be used to determine if a wiki link was blue or red at a specific time in the past. Format:

    • page_id: page id
    • title: page title
    • ns: namespace (0 for articles)
    • timestamp: time when page was created

    Part 3: Redirect history (redirect_history.json.gz)

    This JSON file specifies all revisions corresponding to redirects, as well as the target page to which the respective page redirected at the time of the revision. This information is useful for reconstructing Wikipedia's link network at any time in the past. Format:

    • page_id: page id of redirect source
    • title: page title of redirect source
    • ns: namespace (0 for articles)
    • revision_id: revision id of redirect source
    • timestamp: time at which redirect became active
    • redirect: page title of redirect target (in 1st item of array; 2nd item can be ignored)

    The repository also contains two additional files, metadata.zip and mysql_database.zip. These two files are not part of WikiHist.html per se, and most users will not need to download them manually. The file metadata.zip is required by the download script (and will be fetched by the script automatically), and mysql_database.zip is required by the code used to produce WikiHist.html. The code that uses these files is hosted at GitHub, but the files are too big for GitHub and are therefore hosted here.

    WikiHist.html was produced by parsing the 1 March 2019 dump of https://dumps.wikimedia.org/enwiki/20190301 from wikitext to HTML. That old dump is not available anymore on Wikimedia's servers, so we make a copy available at https://archive.org/details/enwiki-20190301-original-full-history-dump_dlab .

  15. h

    wikitext-2-noheader-sample-v2

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Clara Na, wikitext-2-noheader-sample-v2 [Dataset]. https://huggingface.co/datasets/claran/wikitext-2-noheader-sample-v2
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Clara Na
    Description

    claran/wikitext-2-noheader-sample-v2 dataset hosted on Hugging Face and contributed by the HF Datasets community

  16. f

    The pseudocode of the learning rate back-tracking.

    • figshare.com
    xls
    Updated Apr 15, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lu Yuwen; Shuyu Chen; Xiaohan Yuan (2021). The pseudocode of the learning rate back-tracking. [Dataset]. http://doi.org/10.1371/journal.pone.0249820.t006
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Apr 15, 2021
    Dataset provided by
    PLOS ONE
    Authors
    Lu Yuwen; Shuyu Chen; Xiaohan Yuan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The pseudocode of the learning rate back-tracking.

  17. h

    wikitext-2-raw-v1-preprocessed-1k

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DevQuasar, wikitext-2-raw-v1-preprocessed-1k [Dataset]. https://huggingface.co/datasets/DevQuasar/wikitext-2-raw-v1-preprocessed-1k
    Explore at:
    Dataset authored and provided by
    DevQuasar
    Description

    DevQuasar/wikitext-2-raw-v1-preprocessed-1k dataset hosted on Hugging Face and contributed by the HF Datasets community

  18. f

    Perplexity of different initializations and improvement strategies.

    • plos.figshare.com
    xls
    Updated Jun 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lu Yuwen; Shuyu Chen; Xiaohan Yuan (2023). Perplexity of different initializations and improvement strategies. [Dataset]. http://doi.org/10.1371/journal.pone.0249820.t007
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 10, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Lu Yuwen; Shuyu Chen; Xiaohan Yuan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Perplexity of different initializations and improvement strategies.

  19. h

    wikitext-2-raw-v1-forbidden-titles-1k

    • huggingface.co
    Updated Aug 7, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ActiveRetrieval (2024). wikitext-2-raw-v1-forbidden-titles-1k [Dataset]. https://huggingface.co/datasets/Self-GRIT/wikitext-2-raw-v1-forbidden-titles-1k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 7, 2024
    Dataset authored and provided by
    ActiveRetrieval
    Description

    Self-GRIT/wikitext-2-raw-v1-forbidden-titles-1k dataset hosted on Hugging Face and contributed by the HF Datasets community

  20. Wikipedia Change Metadata

    • redivis.com
    application/jsonl +7
    Updated Sep 22, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stanford Graduate School of Business Library (2021). Wikipedia Change Metadata [Dataset]. https://redivis.com/datasets/1ky2-8b1pvrv76
    Explore at:
    application/jsonl, avro, parquet, arrow, spss, stata, csv, sasAvailable download formats
    Dataset updated
    Sep 22, 2021
    Dataset provided by
    Redivis Inc.
    Authors
    Stanford Graduate School of Business Library
    Time period covered
    Jan 16, 2001 - Mar 1, 2019
    Description

    Abstract

    The Wikipedia Change Metadata is a curation of article changes, updates, and edits over time.

    Methodology

    This dataset includes the history (2001 to 2019) of Wikipedia edits and collaboration elements (e.g. administrative decisions, elections, communication between

    Usage

    This dataset is freely available to any Stanford researcher and does not require a DUA

    Documentation

    **Source for details below: **https://zenodo.org/record/3605388#.YWitsdnML0o

    Dataset details

    Part 1: HTML revision history The data is split into 558 directories, named enwiki-20190301-pages-meta-history$1.xml-p$2p$3, where $1 ranges from 1 to 27, and *p$2p$3 *indicates that the directory contains revisions for pages with ids between $2 and $3. (This naming scheme directly mirrors that of the wikitext revision history from which WikiHist.html was derived.) Each directory contains a collection of gzip-compressed JSON files, each containing 1,000 HTML article revisions. Each row in the gzipped JSON files represents one article revision. Rows are sorted by page id, and revisions of the same page are sorted by revision id. We include all revision information from the original wikitext dump, the only difference being that we replace the revision’s wikitext content with its parsed HTML version (and that we store the data in JSON rather than XML):

    • id: id of this revision
    • parentid: id of revision modified by this revision
    • timestamp: time when revision was made
    • cont_username: username of contributor
    • cont_id: id of contributor
    • cont_ip: IP address of contributor
    • comment: comment made by contributor
    • model: content model (usually "wikitext")
    • format: content format (usually "text/x-wiki")
    • sha1: SHA-1 hash
    • title: page title
    • ns: namespace (always 0)
    • page_id: page id
    • redirect_title: if page is redirect, title of target page
    • html: revision content in HTML format

    %3C!-- --%3E

    Part 2: Page creation times (page_creation_times.json.gz)

    This JSON file specifies the creation time of each English Wikipedia page. It can, e.g., be used to determine if a wiki link was blue or red at a specific time in the past. Format:

    • page_id: page id
    • title: page title
    • ns: namespace (0 for articles)
    • timestamp: time when page was created

    %3C!-- --%3E

    Part 3: Redirect history (redirect_history.json.gz)

    This JSON file specifies all revisions corresponding to redirects, as well as the target page to which the respective page redirected at the time of the revision. This information is useful for reconstructing Wikipedia's link network at any time in the past. Format:

    • page_id: page id of redirect source
    • title: page title of redirect source
    • ns: namespace (0 for articles)
    • revision_id: revision id of redirect source
    • timestamp: time at which redirect became active
    • redirect: page title of redirect target (in 1st item of array; 2nd item can be ignored)

    %3C!-- --%3E

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Jan Karsten Kuhnke (2023). wikitext2 [Dataset]. https://huggingface.co/datasets/mindchain/wikitext2

wikitext2

WikiText

mindchain/wikitext2

Explore at:
Dataset updated
Oct 21, 2023
Authors
Jan Karsten Kuhnke
License

Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically

Description

Dataset Card for "wikitext"

  Dataset Summary

The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License. Compared to the preprocessed version of Penn Treebank (PTB), WikiText-2 is over 2 times larger and WikiText-103 is over 110 times larger. The WikiText dataset also features a far larger… See the full description on the dataset page: https://huggingface.co/datasets/mindchain/wikitext2.

Search
Clear search
Close search
Google apps
Main menu