3 datasets found

h
MNBVC
huggingface.co
Updated Jan 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Language Intelligence and Word Understanding Research Group (LIWU) (2023). MNBVC [Dataset]. https://huggingface.co/datasets/liwu/MNBVC
Explore at:
Dataset updated
Jan 1, 2023
Dataset authored and provided by
Language Intelligence and Word Understanding Research Group (LIWU)
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
MNBVC: Massive Never-ending BT Vast Chinese corpus
h
liwu-MNBVC
huggingface.co
Updated Jan 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ab10 (2023). liwu-MNBVC [Dataset]. https://huggingface.co/datasets/botp/liwu-MNBVC
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 1, 2023
Dataset authored and provided by
ab10
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card for MNBVC

数据集介绍

中文互联网上最古老最神秘(没有之一)的里屋社区于2023.1.1庄重宣布: 在英明神武的里屋管子带领下，决心发挥社区所长(哪都长)，帮助开源社区长期更新一份最大的中文互联网语料集。 Huggingface上的MNBVC数据集在逐渐更新中，请到https://github.com/esbatmop/MNBVC 获取未完成清洗的更多数据。可以使用如下脚本加载： from datasets import load_dataset dataset = load_dataset("liwu/MNBVC", 'law_judgement', split='train', streaming=True)

next(iter(dataset)) # get the first line

数据子集

MNBVC数据集包含数个子集：

law_judgement: 来自法律文书的文本。 gov_xuexiqiangguo: 来自学习强国的文本。gov_report:… See the full description on the dataset page: https://huggingface.co/datasets/botp/liwu-MNBVC.
h
quora_qa_raw
huggingface.co
Updated Jun 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Javi Lau (2024). quora_qa_raw [Dataset]. https://huggingface.co/datasets/LxYxvv/quora_qa_raw
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 18, 2024
Authors
Javi Lau
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
QUORA_ONE_MANY_QA

This dataset is derived from quora.com questioning data. It is a question with multiple answers. The project provide gas for mnbvc.

STATISTICS

Raw data size

100w 16G 200w 17G 300w 15G 400w 11G 500w 10G 600w 9G 700w 9G 800w 7.5G 900w 7G 1000w 6.5G Updating...
Not seeing a result you expected?
Learn how you can add new datasets to our index.