4 datasets found
  1. h

    uit_viic

    • huggingface.co
    • paperswithcode.com
    Updated Jun 20, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SEACrowd (2024). uit_viic [Dataset]. https://huggingface.co/datasets/SEACrowd/uit_viic
    Explore at:
    Dataset updated
    Jun 20, 2024
    Dataset authored and provided by
    SEACrowd
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    UIT-ViIC contains manually written captions for images from Microsoft COCO dataset relating to sports played with ball. UIT-ViIC consists of 19,250 Vietnamese captions for 3,850 images. For each image, UIT-ViIC provides five Vietnamese captions annotated by five annotators.

  2. h

    OpenViVQA_Image-captioning

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yud G Nourt, OpenViVQA_Image-captioning [Dataset]. https://huggingface.co/datasets/YudGNourt/OpenViVQA_Image-captioning
    Explore at:
    Authors
    Yud G Nourt
    Description

    OpenViVQA Vietnamese Captions Dataset

      Introduction
    

    This dataset includes 11,199 images from the OpenViVQA collection, each paired with five Vietnamese captions automatically generated by the Qwen2.5-VL-32B model:

    layout_caption: Describes the overall layout of the image overview_caption: Provides a detailed overview (long caption) primary_object_caption: Describes the main object in detail (long caption) secondary_object_caption: Describes secondary objects… See the full description on the dataset page: https://huggingface.co/datasets/YudGNourt/OpenViVQA_Image-captioning.

  3. o

    Data from: CsEnVi Pairwise Parallel Corpora

    • explore.openaire.eu
    • lindat.cz
    • +1more
    Updated Nov 10, 2015
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Duc Tam Hoang; Ondřej Bojar (2015). CsEnVi Pairwise Parallel Corpora [Dataset]. https://explore.openaire.eu/search/dataset?pid=11234%2F1-1595
    Explore at:
    Dataset updated
    Nov 10, 2015
    Authors
    Duc Tam Hoang; Ondřej Bojar
    Description

    CsEnVi Pairwise Parallel Corpora consist of Vietnamese-Czech parallel corpus and Vietnamese-English parallel corpus. The corpora were assembled from the following sources: - OPUS, the open parallel corpus is a growing multilingual corpus of translated open source documents. The majority of Vi-En and Vi-Cs bitexts are subtitles from movies and television series. The nature of the bitexts are paraphrasing of each other's meaning, rather than translations. - TED talks, a collection of short talks on various topics, given primarily in English, transcribed and with transcripts translated to other languages. In our corpus, we use 1198 talks which had English and Vietnamese transcripts available and 784 talks which had Czech and Vietnamese transcripts available in January 2015. The size of the original corpora collected from OPUS and TED talks is as follows: CS/VI EN/VI Sentence 1337199/1337199 2035624/2035624 Word 9128897/12073975 16638364/17565580 Unique word 224416/68237 91905/78333 We improve the quality of the corpora in two steps: normalizing and filtering. In the normalizing step, the corpora are cleaned based on the general format of subtitles and transcripts. For instance, sequences of dots indicate explicit continuation of subtitles across multiple time frames. The sequences of dots are distributed differently in the source and the target side. Removing the sequence of dots, along with a number of other normalization rules, improves the quality of the alignment significantly. In the filtering step, we adapt the CzEng filtering tool [1] to filter out bad sentence pairs. The size of cleaned corpora as published is as follows: CS/VI EN/VI Sentence 1091058/1091058 1113177/1091058 Word 6718184/7646701 8518711/8140876 Unique word 195446/59737 69513/58286 The corpora are used as training data in [2]. References: [1] Ondřej Bojar, Zdeněk Žabokrtský, et al. 2012. The Joy of Parallelism with CzEng 1.0. Proceedings of LREC2012. ELRA. Istanbul, Turkey. [2] Duc Tam Hoang and Ondřej Bojar, The Prague Bulletin of Mathematical Linguistics. Volume 104, Issue 1, Pages 75–86, ISSN 1804-0462. 9/2015

  4. h

    Vista

    • huggingface.co
    Updated May 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vietnamese VLM (2024). Vista [Dataset]. https://huggingface.co/datasets/Vi-VLM/Vista
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 19, 2024
    Dataset authored and provided by
    Vietnamese VLM
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for "Vista"

    "700.000 Vietnamese vision-language samples open-source dataset"

      Dataset Overview
    

    This dataset contains over 700,000 Vietnamese vision-language samples, created by Gemini Pro. We employed several prompt engineering techniques: few-shot learning, caption-based prompting and image-based prompting.

    For the COCO dataset, we generated data using Llava-style prompts

    For the ShareGPT4V dataset, we used translation prompts.

    Caption-based prompting:… See the full description on the dataset page: https://huggingface.co/datasets/Vi-VLM/Vista.

  5. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
SEACrowd (2024). uit_viic [Dataset]. https://huggingface.co/datasets/SEACrowd/uit_viic

uit_viic

Uit Viic

SEACrowd/uit_viic

Explore at:
Dataset updated
Jun 20, 2024
Dataset authored and provided by
SEACrowd
License

https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

Description

UIT-ViIC contains manually written captions for images from Microsoft COCO dataset relating to sports played with ball. UIT-ViIC consists of 19,250 Vietnamese captions for 3,850 images. For each image, UIT-ViIC provides five Vietnamese captions annotated by five annotators.

Search
Clear search
Close search
Google apps
Main menu