3 datasets found
  1. h

    MNBVC

    • huggingface.co
    Updated Jan 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Language Intelligence and Word Understanding Research Group (LIWU) (2023). MNBVC [Dataset]. https://huggingface.co/datasets/liwu/MNBVC
    Explore at:
    Dataset updated
    Jan 1, 2023
    Dataset authored and provided by
    Language Intelligence and Word Understanding Research Group (LIWU)
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    MNBVC: Massive Never-ending BT Vast Chinese corpus

  2. h

    liwu-MNBVC

    • huggingface.co
    Updated Jan 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ab10 (2023). liwu-MNBVC [Dataset]. https://huggingface.co/datasets/botp/liwu-MNBVC
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 1, 2023
    Dataset authored and provided by
    ab10
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for MNBVC

      数据集介绍
    

    中文互联网上最古老最神秘(没有之一)的里屋社区于2023.1.1庄重宣布: 在英明神武的里屋管子带领下,决心发挥社区所长(哪都长),帮助开源社区长期更新一份最大的中文互联网语料集。 Huggingface上的MNBVC数据集在逐渐更新中,请到https://github.com/esbatmop/MNBVC 获取未完成清洗的更多数据。 可以使用如下脚本加载: from datasets import load_dataset dataset = load_dataset("liwu/MNBVC", 'law_judgement', split='train', streaming=True)

    next(iter(dataset)) # get the first line

      数据子集
    

    MNBVC数据集包含数个子集:

    law_judgement: 来自法律文书的文本。 gov_xuexiqiangguo: 来自学习强国的文本。gov_report:… See the full description on the dataset page: https://huggingface.co/datasets/botp/liwu-MNBVC.

  3. h

    quora_qa_raw

    • huggingface.co
    Updated Jun 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Javi Lau (2024). quora_qa_raw [Dataset]. https://huggingface.co/datasets/LxYxvv/quora_qa_raw
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 18, 2024
    Authors
    Javi Lau
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    QUORA_ONE_MANY_QA

    This dataset is derived from quora.com questioning data. It is a question with multiple answers. The project provide gas for mnbvc.

      STATISTICS
    

    Raw data size

    100w 16G 200w 17G 300w 15G 400w 11G 500w 10G 600w 9G 700w 9G 800w 7.5G 900w 7G 1000w 6.5G Updating...

  4. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Language Intelligence and Word Understanding Research Group (LIWU) (2023). MNBVC [Dataset]. https://huggingface.co/datasets/liwu/MNBVC

MNBVC

MNBVC

liwu/MNBVC

Explore at:
Dataset updated
Jan 1, 2023
Dataset authored and provided by
Language Intelligence and Word Understanding Research Group (LIWU)
License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

MNBVC: Massive Never-ending BT Vast Chinese corpus

Search
Clear search
Close search
Google apps
Main menu