8 datasets found
  1. h

    nvidia-faq-bert-fine-tuned-llm

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Heykal Sayid, nvidia-faq-bert-fine-tuned-llm [Dataset]. https://huggingface.co/datasets/paacamo/nvidia-faq-bert-fine-tuned-llm
    Explore at:
    Authors
    Heykal Sayid
    Description

    paacamo/nvidia-faq-bert-fine-tuned-llm dataset hosted on Hugging Face and contributed by the HF Datasets community

  2. h

    commonvoice_16_1_bert_vits2

    • huggingface.co
    Updated Jul 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    hon9kon9ize (2024). commonvoice_16_1_bert_vits2 [Dataset]. https://huggingface.co/datasets/hon9kon9ize/commonvoice_16_1_bert_vits2
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 10, 2024
    Dataset authored and provided by
    hon9kon9ize
    License

    https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

    Description

    Cantonese Common Voice 16.1 for Bert-VITS2 fine tuning format

    This dataset contains 14.5 hours of validated speech data in Cantonese (yue and zh-hk) from the Common Voice project, but with some cleansing and fixing of common Chinese characters, and used facebook/seamless-m4t-v2-large to cross check the data. The dataset is in the format required for fine-tuning the Bert-VITS2. For more detail of cleansing, fixing and filtering, please refer to the notebook.

      Data format… See the full description on the dataset page: https://huggingface.co/datasets/hon9kon9ize/commonvoice_16_1_bert_vits2.
    
  3. h

    DNA

    • huggingface.co
    Updated Jul 2, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Walled AI (2024). DNA [Dataset]. https://huggingface.co/datasets/walledai/DNA
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 2, 2024
    Dataset authored and provided by
    Walled AI
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs

      Overview
    

    Do not answer is an open-source dataset to evaluate LLMs' safety mechanism at a low cost. The dataset is curated and filtered to consist only of prompts to which responsible language models do not answer. Besides human annotations, Do not answer also implements model-based evaluation, where a 600M fine-tuned BERT-like evaluator achieves comparable results with human and GPT-4.

      Results
    

    For… See the full description on the dataset page: https://huggingface.co/datasets/walledai/DNA.

  4. h

    pii-masking-65k

    • huggingface.co
    Updated Apr 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ai4Privacy (2024). pii-masking-65k [Dataset]. http://doi.org/10.57967/hf/2012
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 5, 2024
    Dataset authored and provided by
    Ai4Privacy
    Description

    Purpose and Features

    The purpose of the model and dataset is to remove personally identifiable information (PII) from text, especially in the context of AI assistants and LLMs. The model is a fine-tuned version of "Distilled BERT", a smaller and faster version of BERT. It was adapted for the task of token classification based on the largest to our knowledge open-source PII masking dataset, which we are releasing simultaneously. The model size is 62 million parameters. The original… See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/pii-masking-65k.

  5. h

    do-not-answer

    • huggingface.co
    Updated Sep 13, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    LibrAI (2023). do-not-answer [Dataset]. https://huggingface.co/datasets/LibrAI/do-not-answer
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 13, 2023
    Dataset authored and provided by
    LibrAI
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs

      Overview
    

    Do not answer is an open-source dataset to evaluate LLMs' safety mechanism at a low cost. The dataset is curated and filtered to consist only of prompts to which responsible language models do not answer. Besides human annotations, Do not answer also implements model-based evaluation, where a 600M fine-tuned BERT-like evaluator achieves comparable results with human and GPT-4.… See the full description on the dataset page: https://huggingface.co/datasets/LibrAI/do-not-answer.

  6. h

    AmaSquad

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nebiyou Daniel Hailemariam, AmaSquad [Dataset]. https://huggingface.co/datasets/nebhailema/AmaSquad
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Nebiyou Daniel Hailemariam
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    AmaSQuAD - Amharic Question Answering Dataset

      Dataset Overview
    

    AmaSQuAD is a synthetic dataset created by translating the SQuAD 2.0 dataset into Amharic using a novel translation framework. The dataset addresses key challenges, including:

    Misalignment between translated questions and answers.
    Presence of multiple answers in the translated context.

    Techniques such as cosine similarity (using embeddings from a fine-tuned Amharic BERT model) and Longest Common… See the full description on the dataset page: https://huggingface.co/datasets/nebhailema/AmaSquad.

  7. h

    Vietnamese-Legal-Doc-Retrieval-Data

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tai Nguyen Phu, Vietnamese-Legal-Doc-Retrieval-Data [Dataset]. https://huggingface.co/datasets/YuITC/Vietnamese-Legal-Doc-Retrieval-Data
    Explore at:
    Authors
    Tai Nguyen Phu
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    bert-base-multilingual-cased-finetuned-VNLegalDocs

    Data used to deploy the Gradio application:

    corspus_data.parquet: Documents list legal_faiss.index: FAISS index

    Data used to fine-tune and evaluate the model YuITC/bert-base-multilingual-cased-finetuned-VNLegalDocs

    train_data.parquet: For fine-tuning the model test_data.parquet: For evaluation

  8. h

    twitter-sentiment-analysis

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Muhammad Rizal, twitter-sentiment-analysis [Dataset]. https://huggingface.co/datasets/KidzRizal/twitter-sentiment-analysis
    Explore at:
    Authors
    Muhammad Rizal
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Twitter Sentiment Analysis: Prabowo's First 100 Days

      Dataset Overview
    

    This dataset contains tweets related to President Prabowo Subianto's first 100 days in office in Indonesia (2024-2029). The tweets have been preprocessed and classified into three sentiment categories using a fine-tuned BERT model for Indonesian language (IndoBERT).

      Dataset Details
    

    Language: Indonesian Source: Twitter/X Time period: First 100 days of President Prabowo's administration Number… See the full description on the dataset page: https://huggingface.co/datasets/KidzRizal/twitter-sentiment-analysis.

  9. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Heykal Sayid, nvidia-faq-bert-fine-tuned-llm [Dataset]. https://huggingface.co/datasets/paacamo/nvidia-faq-bert-fine-tuned-llm

nvidia-faq-bert-fine-tuned-llm

paacamo/nvidia-faq-bert-fine-tuned-llm

Explore at:
Authors
Heykal Sayid
Description

paacamo/nvidia-faq-bert-fine-tuned-llm dataset hosted on Hugging Face and contributed by the HF Datasets community

Search
Clear search
Close search
Google apps
Main menu