36 datasets found
  1. P

    WikiText-2 Dataset

    • paperswithcode.com
    • opendatalab.com
    • +1more
    Updated Dec 13, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stephen Merity; Caiming Xiong; James Bradbury; Richard Socher (2023). WikiText-2 Dataset [Dataset]. https://paperswithcode.com/dataset/wikitext-2
    Explore at:
    Dataset updated
    Dec 13, 2023
    Authors
    Stephen Merity; Caiming Xiong; James Bradbury; Richard Socher
    Description

    The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License.

    Compared to the preprocessed version of Penn Treebank (PTB), WikiText-2 is over 2 times larger and WikiText-103 is over 110 times larger. The WikiText dataset also features a far larger vocabulary and retains the original case, punctuation and numbers - all of which are removed in PTB. As it is composed of full articles, the dataset is well suited for models that can take advantage of long term dependencies.

  2. wikitext

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Salesforce, wikitext [Dataset]. https://huggingface.co/datasets/Salesforce/wikitext
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset provided by
    Salesforce Inchttp://salesforce.com/
    Authors
    Salesforce
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Dataset Card for "wikitext"

      Dataset Summary
    

    The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License. Compared to the preprocessed version of Penn Treebank (PTB), WikiText-2 is over 2 times larger and WikiText-103 is over 110 times larger. The WikiText dataset also features a far largerโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/Salesforce/wikitext.

  3. a

    Wikitext-2

    • academictorrents.com
    bittorrent
    Updated Oct 16, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stephen Merity et al., 2016 (2018). Wikitext-2 [Dataset]. https://academictorrents.com/details/ac7ffa98b66427246a316a81b2ea31c9b58ea5b6
    Explore at:
    bittorrent(4070055)Available download formats
    Dataset updated
    Oct 16, 2018
    Dataset authored and provided by
    Stephen Merity et al., 2016
    License

    https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified

    Description

    A subset of Wikitext-103; useful for testing language model training on smaller datasets.

  4. h

    lilac-wikitext-2-raw-v1

    • huggingface.co
    Updated Aug 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lilac AI (2023). lilac-wikitext-2-raw-v1 [Dataset]. https://huggingface.co/datasets/lilacai/lilac-wikitext-2-raw-v1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 21, 2023
    Dataset authored and provided by
    Lilac AI
    Description

    This dataset is generated by Lilac for a HuggingFace Space: huggingface.co/spaces/lilacai/lilac. Original dataset: https://huggingface.co/datasets/wikitext Lilac dataset config: name: wikitext-2-raw-v1 source: dataset_name: wikitext config_name: wikitext-2-raw-v1 source_name: huggingface embeddings: - path: text embedding: gte-small signals: - path: text signal: signal_name: near_dup - path: text signal: signal_name: pii - path: text signal:โ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/lilacai/lilac-wikitext-2-raw-v1.

  5. h

    wikitext-2-raw-v1-shuffled

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tongyao, wikitext-2-raw-v1-shuffled [Dataset]. https://huggingface.co/datasets/tyzhu/wikitext-2-raw-v1-shuffled
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Tongyao
    Description

    Dataset Card for "wikitext-2-raw-v1-shuffled"

    More Information needed

  6. wikitext2

    • kaggle.com
    zip
    Updated Nov 8, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nazhura (2021). wikitext2 [Dataset]. https://www.kaggle.com/datasets/nazhura/wikitext2
    Explore at:
    zip(4542021 bytes)Available download formats
    Dataset updated
    Nov 8, 2021
    Authors
    Nazhura
    Description

    Dataset

    This dataset was created by Nazhura

    Contents

  7. h

    wikitext-2

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mika Senghaas, wikitext-2 [Dataset]. https://huggingface.co/datasets/mikasenghaas/wikitext-2
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Mika Senghaas
    Description

    mikasenghaas/wikitext-2 dataset hosted on Hugging Face and contributed by the HF Datasets community

  8. h

    wikitext-2-nonulls-sample-v2

    • huggingface.co
    Updated Sep 15, 2008
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Clara Na (2008). wikitext-2-nonulls-sample-v2 [Dataset]. https://huggingface.co/datasets/claran/wikitext-2-nonulls-sample-v2
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 15, 2008
    Authors
    Clara Na
    Description

    claran/wikitext-2-nonulls-sample-v2 dataset hosted on Hugging Face and contributed by the HF Datasets community

  9. h

    wikitext-2-raw-bytepair

    • huggingface.co
    Updated Dec 9, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alignment Lab AI (2024). wikitext-2-raw-bytepair [Dataset]. https://huggingface.co/datasets/Alignment-Lab-AI/wikitext-2-raw-bytepair
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 9, 2024
    Authors
    Alignment Lab AI
    Description

    Alignment-Lab-AI/wikitext-2-raw-bytepair dataset hosted on Hugging Face and contributed by the HF Datasets community

  10. E

    WikiText-103 & 2

    • live.european-language-grid.eu
    txt
    Updated Dec 30, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2016). WikiText-103 & 2 [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/5169
    Explore at:
    txtAvailable download formats
    Dataset updated
    Dec 30, 2016
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Dataset contains word and character level tokens extracted from Wikipedia

  11. h

    wikitext-2-raw-v1-forbidden-titles-1k

    • huggingface.co
    Updated Aug 7, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ActiveRetrieval (2024). wikitext-2-raw-v1-forbidden-titles-1k [Dataset]. https://huggingface.co/datasets/Self-GRIT/wikitext-2-raw-v1-forbidden-titles-1k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 7, 2024
    Dataset authored and provided by
    ActiveRetrieval
    Description

    Self-GRIT/wikitext-2-raw-v1-forbidden-titles-1k dataset hosted on Hugging Face and contributed by the HF Datasets community

  12. f

    Statistics information of the PTB and WikiText-2 datasets.

    • plos.figshare.com
    xls
    Updated Jun 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lu Yuwen; Shuyu Chen; Xiaohan Yuan (2023). Statistics information of the PTB and WikiText-2 datasets. [Dataset]. http://doi.org/10.1371/journal.pone.0249820.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 11, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Lu Yuwen; Shuyu Chen; Xiaohan Yuan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Statistics information of the PTB and WikiText-2 datasets.

  13. h

    wikitext-2-raw-v1-forbidden-titles-5k

    • huggingface.co
    Updated Jul 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ActiveRetrieval (2024). wikitext-2-raw-v1-forbidden-titles-5k [Dataset]. https://huggingface.co/datasets/Self-GRIT/wikitext-2-raw-v1-forbidden-titles-5k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 24, 2024
    Dataset authored and provided by
    ActiveRetrieval
    Description

    Self-GRIT/wikitext-2-raw-v1-forbidden-titles-5k dataset hosted on Hugging Face and contributed by the HF Datasets community

  14. h

    wikitext-2-raw-v1-preprocessed-200-PI_KFI_-wikipedia-dpr-k-1-OP-False-train-PI_KFI_IK-perplexity...

    • huggingface.co
    Updated Aug 3, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ActiveRetrieval (2024). wikitext-2-raw-v1-preprocessed-200-PI_KFI_-wikipedia-dpr-k-1-OP-False-train-PI_KFI_IK-perplexity [Dataset]. https://huggingface.co/datasets/Self-GRIT/wikitext-2-raw-v1-preprocessed-200-PI_KFI_-wikipedia-dpr-k-1-OP-False-train-PI_KFI_IK-perplexity
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 3, 2024
    Dataset authored and provided by
    ActiveRetrieval
    Description

    Self-GRIT/wikitext-2-raw-v1-preprocessed-200-PI_KFI_-wikipedia-dpr-k-1-OP-False-train-PI_KFI_IK-perplexity dataset hosted on Hugging Face and contributed by the HF Datasets community

  15. f

    Word-level valid and test perplexity on PTB.

    • plos.figshare.com
    xls
    Updated Jun 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lu Yuwen; Shuyu Chen; Xiaohan Yuan (2023). Word-level valid and test perplexity on PTB. [Dataset]. http://doi.org/10.1371/journal.pone.0249820.t009
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 11, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Lu Yuwen; Shuyu Chen; Xiaohan Yuan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Word-level valid and test perplexity on PTB.

  16. f

    The pseudocode of the learning rate back-tracking.

    • plos.figshare.com
    • figshare.com
    xls
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lu Yuwen; Shuyu Chen; Xiaohan Yuan (2023). The pseudocode of the learning rate back-tracking. [Dataset]. http://doi.org/10.1371/journal.pone.0249820.t006
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Lu Yuwen; Shuyu Chen; Xiaohan Yuan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The pseudocode of the learning rate back-tracking.

  17. f

    The pseudocode of the gradient batch training algorithm.

    • plos.figshare.com
    xls
    Updated Jun 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lu Yuwen; Shuyu Chen; Xiaohan Yuan (2023). The pseudocode of the gradient batch training algorithm. [Dataset]. http://doi.org/10.1371/journal.pone.0249820.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 11, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Lu Yuwen; Shuyu Chen; Xiaohan Yuan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The pseudocode of the gradient batch training algorithm.

  18. h

    wikitext-2-raw-v1-preprocessed-200-PP_RIP_PTP-train-perplexity

    • huggingface.co
    Updated Aug 4, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ActiveRetrieval (2024). wikitext-2-raw-v1-preprocessed-200-PP_RIP_PTP-train-perplexity [Dataset]. https://huggingface.co/datasets/Self-GRIT/wikitext-2-raw-v1-preprocessed-200-PP_RIP_PTP-train-perplexity
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 4, 2024
    Dataset authored and provided by
    ActiveRetrieval
    Description

    Self-GRIT/wikitext-2-raw-v1-preprocessed-200-PP_RIP_PTP-train-perplexity dataset hosted on Hugging Face and contributed by the HF Datasets community

  19. h

    wikitext-2-raw-v1-preprocessed-200-PP_RIP_PTP_claude-wikipedia-dpr-k-1-OP-True...

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ActiveRetrieval, wikitext-2-raw-v1-preprocessed-200-PP_RIP_PTP_claude-wikipedia-dpr-k-1-OP-True [Dataset]. https://huggingface.co/datasets/Self-GRIT/wikitext-2-raw-v1-preprocessed-200-PP_RIP_PTP_claude-wikipedia-dpr-k-1-OP-True
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    ActiveRetrieval
    Description

    Self-GRIT/wikitext-2-raw-v1-preprocessed-200-PP_RIP_PTP_claude-wikipedia-dpr-k-1-OP-True dataset hosted on Hugging Face and contributed by the HF Datasets community

  20. f

    Perplexity of different initializations and improvement strategies.

    • figshare.com
    xls
    Updated Jun 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lu Yuwen; Shuyu Chen; Xiaohan Yuan (2023). Perplexity of different initializations and improvement strategies. [Dataset]. http://doi.org/10.1371/journal.pone.0249820.t007
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 10, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Lu Yuwen; Shuyu Chen; Xiaohan Yuan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Perplexity of different initializations and improvement strategies.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Stephen Merity; Caiming Xiong; James Bradbury; Richard Socher (2023). WikiText-2 Dataset [Dataset]. https://paperswithcode.com/dataset/wikitext-2

WikiText-2 Dataset

Explore at:
Dataset updated
Dec 13, 2023
Authors
Stephen Merity; Caiming Xiong; James Bradbury; Richard Socher
Description

The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License.

Compared to the preprocessed version of Penn Treebank (PTB), WikiText-2 is over 2 times larger and WikiText-103 is over 110 times larger. The WikiText dataset also features a far larger vocabulary and retains the original case, punctuation and numbers - all of which are removed in PTB. As it is composed of full articles, the dataset is well suited for models that can take advantage of long term dependencies.

Search
Clear search
Close search
Google apps
Main menu