paacamo/nvidia-faq-bert-fine-tuned-llm dataset hosted on Hugging Face and contributed by the HF Datasets community
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Cantonese Common Voice 16.1 for Bert-VITS2 fine tuning format
This dataset contains 14.5 hours of validated speech data in Cantonese (yue and zh-hk) from the Common Voice project, but with some cleansing and fixing of common Chinese characters, and used facebook/seamless-m4t-v2-large to cross check the data. The dataset is in the format required for fine-tuning the Bert-VITS2. For more detail of cleansing, fixing and filtering, please refer to the notebook.
Data format… See the full description on the dataset page: https://huggingface.co/datasets/hon9kon9ize/commonvoice_16_1_bert_vits2.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs
Overview
Do not answer is an open-source dataset to evaluate LLMs' safety mechanism at a low cost. The dataset is curated and filtered to consist only of prompts to which responsible language models do not answer. Besides human annotations, Do not answer also implements model-based evaluation, where a 600M fine-tuned BERT-like evaluator achieves comparable results with human and GPT-4.
Results
For… See the full description on the dataset page: https://huggingface.co/datasets/walledai/DNA.
Purpose and Features
The purpose of the model and dataset is to remove personally identifiable information (PII) from text, especially in the context of AI assistants and LLMs. The model is a fine-tuned version of "Distilled BERT", a smaller and faster version of BERT. It was adapted for the task of token classification based on the largest to our knowledge open-source PII masking dataset, which we are releasing simultaneously. The model size is 62 million parameters. The original… See the full description on the dataset page: https://huggingface.co/datasets/ai4privacy/pii-masking-65k.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs
Overview
Do not answer is an open-source dataset to evaluate LLMs' safety mechanism at a low cost. The dataset is curated and filtered to consist only of prompts to which responsible language models do not answer. Besides human annotations, Do not answer also implements model-based evaluation, where a 600M fine-tuned BERT-like evaluator achieves comparable results with human and GPT-4.… See the full description on the dataset page: https://huggingface.co/datasets/LibrAI/do-not-answer.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
AmaSQuAD - Amharic Question Answering Dataset
Dataset Overview
AmaSQuAD is a synthetic dataset created by translating the SQuAD 2.0 dataset into Amharic using a novel translation framework. The dataset addresses key challenges, including:
Misalignment between translated questions and answers.
Presence of multiple answers in the translated context.
Techniques such as cosine similarity (using embeddings from a fine-tuned Amharic BERT model) and Longest Common… See the full description on the dataset page: https://huggingface.co/datasets/nebhailema/AmaSquad.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
bert-base-multilingual-cased-finetuned-VNLegalDocs
Data used to deploy the Gradio application:
corspus_data.parquet: Documents list legal_faiss.index: FAISS index
Data used to fine-tune and evaluate the model YuITC/bert-base-multilingual-cased-finetuned-VNLegalDocs
train_data.parquet: For fine-tuning the model test_data.parquet: For evaluation
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Twitter Sentiment Analysis: Prabowo's First 100 Days
Dataset Overview
This dataset contains tweets related to President Prabowo Subianto's first 100 days in office in Indonesia (2024-2029). The tweets have been preprocessed and classified into three sentiment categories using a fine-tuned BERT model for Indonesian language (IndoBERT).
Dataset Details
Language: Indonesian Source: Twitter/X Time period: First 100 days of President Prabowo's administration Number… See the full description on the dataset page: https://huggingface.co/datasets/KidzRizal/twitter-sentiment-analysis.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
paacamo/nvidia-faq-bert-fine-tuned-llm dataset hosted on Hugging Face and contributed by the HF Datasets community