Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for "face-celeb-vietnamese"
Dataset Summary
This dataset contains information on over 8,000 samples of well-known Vietnamese individuals, categorized into three professions: singers, actors, and beauty queens. The dataset includes data on more than 100 celebrities in each of the three job categories.
Languages
Vietnamese: The label is used to indicate the name of celebrities in Vietnamese.
Dataset Structure
The image and Vietnamese… See the full description on the dataset page: https://huggingface.co/datasets/fptudsc/face-celeb-vietnamese.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Gross Domestic Product (GDP) in Vietnam was worth 476.39 billion US dollars in 2024, according to official data from the World Bank. The GDP value of Vietnam represents 0.45 percent of the world economy. This dataset provides the latest reported value for - Vietnam GDP - plus previous releases, historical high and low, short-term forecast and long-term prediction, economic calendar, survey consensus and news.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Vietnam Exports to United States was US$97.07 Billion during 2023, according to the United Nations COMTRADE database on international trade. Vietnam Exports to United States - data, historical chart and statistics - was last updated on July of 2025.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Vietnam Imports from United States was US$13.83 Billion during 2023, according to the United Nations COMTRADE database on international trade. Vietnam Imports from United States - data, historical chart and statistics - was last updated on August of 2025.
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
VietVault
VietVault is a large-scale Vietnamese language corpus, carefully filtered and curated from Common Crawl dataset dumps prior to 2023. This dataset is designed to serve as a high-quality resource for Vietnamese language model pretraining and various natural language processing tasks.
Dataset Statistics
Size: 80GB of raw text Language: Vietnamese Source: Common Crawl dataset (all dumps in 2013-2023) Preprocessing: Cleaned, deduplicated, filtered for Vietnamese… See the full description on the dataset page: https://huggingface.co/datasets/nampdn-ai/vietvault.
Estimated number of Vietnamese individuals aged 50+ years who are eligible for treatment by the US National Osteoporosis Foundation Guidelines.
This dataset includes the primary language of newly Medi-Cal eligible individuals who identified their primary language as English, Spanish, Vietnamese, Mandarin, Cantonese, Arabic, Other Non-English, Armenian, Russian, Farsi, Korean, Tagalog, Other Chinese Languages, Hmong, Cambodian, Portuguese, Lao, French, Thai, Japanese, Samoan, Other Sign Language, American Sign Language (ASL), Turkish, Ilacano, Mien, Italian, Hebrew, and Polish, by reporting period. The primary language data is from the Medi-Cal Eligibility Data System (MEDS) and includes eligible individuals without prior Medi-Cal eligibility. This dataset is part of the public reporting requirements set forth in California Welfare and Institutions Code 14102.5.
Data platform
Performances
Citation
Please use the following citation if you intend to use our dataset for training or evaluation: @misc{VietnameseMedBench, title={VM14K: First Vietnamese Medical Benchmark}, author={Anonymus}, year={2025}, howpublished = {\url{https://huggingface.co/datasets/venera-ai/VietnameseMedBench}} }
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Vietnam recorded a trade surplus of 2.83 USD Billion in June of 2025. This dataset provides the latest reported value for - Vietnam Balance of Trade - plus previous releases, historical high and low, short-term forecast and long-term prediction, economic calendar, survey consensus and news.
5CD-AI/Vietnamese-Intel-orca_dpo_pairs-gg-translated dataset hosted on Hugging Face and contributed by the HF Datasets community
Wikipedia
Source: https://huggingface.co/datasets/wikipedia Num examples: 1,281,412 Language: Vietnamese
from datasets import load_dataset
load_dataset("tdtunlp/wikipedia_vi")
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Preprocessed Dataset from IWSLT'15 English-Vietnamese machine translation: English-Vietnamese.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for Vietnamese UltraChat 200k
Dataset Description
This is a heavily filtered version of the UltraChat dataset and was used to train Zephyr-7B-β, a state of the art 7b chat model. The original datasets consists of 1.4M dialogues generated by ChatGPT and spanning a wide range of topics. To create UltraChat 200k, we applied the following logic:
Selection of a subset of data for faster supervised fine tuning. Truecasing of the dataset, as we observed around 5%… See the full description on the dataset page: https://huggingface.co/datasets/nguyenphuthien/vietnamese_ultrachat_200k.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
MCP Cloudwords VIVOS Processed ASR Dataset
This dataset contains Vietnamese speech data processed and prepared by MCP Cloudwords for ASR tasks. It includes audio files and corresponding transcriptions, divided into train and test sets.
Replace YOUR_USERNAME/YOUR_DATASET_NAME with your actual Hugging Face username and dataset name
dataset = load_dataset("YOUR_USERNAME/YOUR_DATASET_NAME", trust_remote_code=True)
Display dataset information
print(dataset)… See the full description on the dataset page: https://huggingface.co/datasets/SMEW-TECH/asr-vi.
5CD-AI/Vietnamese-openbmb-RLAIF-V-Dataset-gg-translated dataset hosted on Hugging Face and contributed by the HF Datasets community
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
The dataset is from 5CD-AI/Vietnamese-c-s-ale-alpaca-gpt4-data-gg-translated, formatted as dialogues for speed and ease of use. Many thanks to 5CD-AI for releasing it. Importantly, this format is easy to use via the default chat template of transformers, meaning you can use huggingface/alignment-handbook immediately, unsloth.
Structure
View online through viewer.
Note
We advise you to reconsider before use, thank you. If you find it useful… See the full description on the dataset page: https://huggingface.co/datasets/lamhieu/alpaca_gpt4_dialogue_en.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
The dataset is from 5CD-AI/Vietnamese-mabryCodes-tiny-cot-alpaca-gg-translated, formatted as dialogues for speed and ease of use. Many thanks to author for releasing it. Importantly, this format is easy to use via the default chat template of transformers, meaning you can use huggingface/alignment-handbook immediately, unsloth.
Structure
View online through viewer.
Note
We advise you to reconsider before use, thank you. If you find it useful… See the full description on the dataset page: https://huggingface.co/datasets/lamhieu/mabrycodes_dialogue_en.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
FLEURS
Fleurs is the speech version of the FLoRes machine translation benchmark. We use 2009 n-way parallel sentences from the FLoRes dev and devtest publicly available sets, in 102 languages. Training sets have around 10 hours of supervision. Speakers of the train sets are different than speakers from the dev/test sets. Multilingual fine-tuning is used and ”unit error rate” (characters, signs) of all languages is averaged. Languages and results are also grouped into seven… See the full description on the dataset page: https://huggingface.co/datasets/google/fleurs.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Vietnamese Function Calling Benchmark
RAG applications for Vietnamese chatbot systems are becoming increasingly popular. Many LLM models already support FC for Vietnamese, but there is no common and comprehensive benchmark yet. Today, I am releasing a benchmark for the Vietnamese Function Calling task. I hope this will serve as a standard for product teams to choose models in a reasonable and appropriate way. Dataset Details:
Data size: 2899 single-turn funcation calling samples Domains:… See the full description on the dataset page: https://huggingface.co/datasets/phamhai/Vietnamese-Function-Calling-Test.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for "face-celeb-vietnamese"
Dataset Summary
This dataset contains information on over 8,000 samples of well-known Vietnamese individuals, categorized into three professions: singers, actors, and beauty queens. The dataset includes data on more than 100 celebrities in each of the three job categories.
Languages
Vietnamese: The label is used to indicate the name of celebrities in Vietnamese.
Dataset Structure
The image and Vietnamese… See the full description on the dataset page: https://huggingface.co/datasets/fptudsc/face-celeb-vietnamese.