8 datasets found

h
nvidia-faq-bert-fine-tuned-llm
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Heykal Sayid, nvidia-faq-bert-fine-tuned-llm [Dataset]. https://huggingface.co/datasets/paacamo/nvidia-faq-bert-fine-tuned-llm
Explore at:
Authors
Heykal Sayid
Description
paacamo/nvidia-faq-bert-fine-tuned-llm dataset hosted on Hugging Face and contributed by the HF Datasets community
h
commonvoice_16_1_bert_vits2
huggingface.co
Updated Jul 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
hon9kon9ize (2024). commonvoice_16_1_bert_vits2 [Dataset]. https://huggingface.co/datasets/hon9kon9ize/commonvoice_16_1_bert_vits2
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 10, 2024
Dataset authored and provided by
hon9kon9ize
License
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Description
Cantonese Common Voice 16.1 for Bert-VITS2 fine tuning format

This dataset contains 14.5 hours of validated speech data in Cantonese (yue and zh-hk) from the Common Voice project, but with some cleansing and fixing of common Chinese characters, and used facebook/seamless-m4t-v2-large to cross check the data. The dataset is in the format required for fine-tuning the Bert-VITS2. For more detail of cleansing, fixing and filtering, please refer to the notebook.

Data format… See the full description on the dataset page: https://huggingface.co/datasets/hon9kon9ize/commonvoice_16_1_bert_vits2.
h
DNA
huggingface.co
Updated Jul 2, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Walled AI (2024). DNA [Dataset]. https://huggingface.co/datasets/walledai/DNA
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 2, 2024
Dataset authored and provided by
Walled AI
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs

Overview

Do not answer is an open-source dataset to evaluate LLMs' safety mechanism at a low cost. The dataset is curated and filtered to consist only of prompts to which responsible language models do not answer. Besides human annotations, Do not answer also implements model-based evaluation, where a 600M fine-tuned BERT-like evaluator achieves comparable results with human and GPT-4.

Results

For… See the full description on the dataset page: https://huggingface.co/datasets/walledai/DNA.
h
pii-masking-65k
huggingface.co
Updated Apr 5, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ai4Privacy (2024). pii-masking-65k [Dataset]. http://doi.org/10.57967/hf/2012
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57967/hf/2012
Dataset updated
Apr 5, 2024
Dataset authored and provided by
Ai4Privacy
Description
Purpose and Features

The purpose of the model and dataset is to remove personally identifiable information (PII) from text, especially in the context of AI assistants and LLMs. The model is a fine-tuned version of "Distilled BERT", a smaller and faster version of BERT. It was adapted for the task of token classification based on the largest to our knowledge open-source PII masking dataset, which we are releasing simultaneously. The model size is 62 million parameters. The original… See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/pii-masking-65k.
h
do-not-answer
huggingface.co
Updated Sep 13, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
LibrAI (2023). do-not-answer [Dataset]. https://huggingface.co/datasets/LibrAI/do-not-answer
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 13, 2023
Dataset authored and provided by
LibrAI
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs

Overview

Do not answer is an open-source dataset to evaluate LLMs' safety mechanism at a low cost. The dataset is curated and filtered to consist only of prompts to which responsible language models do not answer. Besides human annotations, Do not answer also implements model-based evaluation, where a 600M fine-tuned BERT-like evaluator achieves comparable results with human and GPT-4.… See the full description on the dataset page: https://huggingface.co/datasets/LibrAI/do-not-answer.
h
AmaSquad
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nebiyou Daniel Hailemariam, AmaSquad [Dataset]. https://huggingface.co/datasets/nebhailema/AmaSquad
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Nebiyou Daniel Hailemariam
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
AmaSQuAD - Amharic Question Answering Dataset

Dataset Overview

AmaSQuAD is a synthetic dataset created by translating the SQuAD 2.0 dataset into Amharic using a novel translation framework. The dataset addresses key challenges, including:

Misalignment between translated questions and answers.
Presence of multiple answers in the translated context.

Techniques such as cosine similarity (using embeddings from a fine-tuned Amharic BERT model) and Longest Common… See the full description on the dataset page: https://huggingface.co/datasets/nebhailema/AmaSquad.
h
Vietnamese-Legal-Doc-Retrieval-Data
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tai Nguyen Phu, Vietnamese-Legal-Doc-Retrieval-Data [Dataset]. https://huggingface.co/datasets/YuITC/Vietnamese-Legal-Doc-Retrieval-Data
Explore at:
Authors
Tai Nguyen Phu
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
bert-base-multilingual-cased-finetuned-VNLegalDocs

Data used to deploy the Gradio application:

corspus_data.parquet: Documents list legal_faiss.index: FAISS index

Data used to fine-tune and evaluate the model YuITC/bert-base-multilingual-cased-finetuned-VNLegalDocs

train_data.parquet: For fine-tuning the model test_data.parquet: For evaluation
h
twitter-sentiment-analysis
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Muhammad Rizal, twitter-sentiment-analysis [Dataset]. https://huggingface.co/datasets/KidzRizal/twitter-sentiment-analysis
Explore at:
Authors
Muhammad Rizal
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Twitter Sentiment Analysis: Prabowo's First 100 Days

Dataset Overview

This dataset contains tweets related to President Prabowo Subianto's first 100 days in office in Indonesia (2024-2029). The tweets have been preprocessed and classified into three sentiment categories using a fine-tuned BERT model for Indonesian language (IndoBERT).

Dataset Details

Language: Indonesian Source: Twitter/X Time period: First 100 days of President Prabowo's administration Number… See the full description on the dataset page: https://huggingface.co/datasets/KidzRizal/twitter-sentiment-analysis.
Not seeing a result you expected?
Learn how you can add new datasets to our index.