Facebook
TwitterDataset for fine-tuning an embedding model for AI job search. Data sourced from datastax/linkedin_job_listings. Data used to fine-tune shawhin/distilroberta-ai-job-embeddings for AI job search. Links
GitHub Repo Video link Blog link
Facebook
TwitterAI Wit Training Dataset
This dataset contains witty comeback and humor training data for fine-tuning language models.
Dataset Structure
Each sample contains:
messages: List of user/assistant conversation source: Data source (e.g., "reddit_jokes") style: Response style (e.g., "humorous", "witty")
Usage
This dataset is designed for fine-tuning conversational AI models to generate witty, humorous responses to offensive or provocative inputs.
Example
{⦠See the full description on the dataset page: https://huggingface.co/datasets/artificialreply/ai-wit-training-data.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The AstroChat dataset is a collection of 901 dialogues, synthetically generated, tailored to the specific domain of Astronautics / Space Mission Engineering. This dataset will be frequently updated following feedback from the community. If you would like to contribute, please reach out in the community discussion.
The dataset is intended to be used for supervised fine-tuning of chat LLMs (Large Language Models). Due to its currently limited size, you should use a pre-trained instruct model and ideally augment the AstroChat dataset with other datasets in the area of (Science Technology, Engineering and Math).
To be completed
python
from datasets import load_dataset
dataset = load_dataset("patrickfleith/AstroChat")901 generated conversations between a simulated user and AI-assistant (more on the generation method below). Each instance is made of the following field (column):
- id: a unique identifier to refer to this specific conversation. Useeful for traceability purposes, especially for further processing task or merge with other datasets.
- topic: a topic within the domain of Astronautics / Space Mission Engineering. This field is useful to filter the dataset by topic, or to create a topic-based split.
- subtopic: a subtopic of the topic. For instance in the topic of Propulsion, there are subtopics like Injector Design, Combustion Instability, Electric Propulsion, Chemical Propulsion, etc.
- persona: description of the persona used to simulate a user
- opening_question: the first question asked by the user to start a conversation with the AI-assistant
- messages: the whole conversation messages between the user and the AI assistant in already nicely formatted for rapid use with the transformers library. A list of messages where each message is a dictionary with the following fields:
- role: the role of the speaker, either user or assistant
- content: the message content. For the assistant, it is the answer to the user's question. For the user, it is the question asked to the assistant.
Important See the full list of topics and subtopics covered below.
Dataset is version controlled and commits history is available here: https://huggingface.co/datasets/patrickfleith/AstroChat/commits/main
We used a method inspired from Ultrachat dataset. Especially, we implemented our own version of Human-Model interaction from Sector I: Questions about the World of their paper:
Ding, N., Chen, Y., Xu, B., Qin, Y., Zheng, Z., Hu, S., ... & Zhou, B. (2023). Enhancing chat language models by scaling high-quality instructional conversations. arXiv preprint arXiv:2305.14233.
gpt-4-turbo model) to generate the answers to the opening questionsAll instances in the dataset are in english
901 synthetically-generated dialogue
AstroChat Ā© 2024 by Patrick Fleith is licensed under Creative Commons Attribution 4.0 International
No restriction. Please provide the correct attribution following the license terms.
Patrick Fleith. (2024). AstroChat - A Dataset of synthetically generated conversations for LLM supervised fine-tuning in the domain of Space Mission Engineering and Astronautics (Version 1) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.11531579
Will be updated based on feedbacks. I am also looking for contributors. Help me create more datasets for Space Engineering LLMs :)
Use the ...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This comprehensive mental health conversational dataset contains over 510,000+ professionally curated conversations, therapeutic dialogues, and support interactions designed for training empathetic AI systems. The dataset combines real-world counseling scenarios, community discussions, and synthetic conversations covering the full spectrum of mental health topics including anxiety, depression, crisis intervention, and wellness support. All content has been carefully anonymized, ethically reviewed, and formatted for immediate compatibility with popular machine learning frameworks including Hugging Face Transformers, OpenAI APIs, and custom language models. The dataset includes multiple file formats (CSV, JSON) and ready-to-use training splits optimized for fine-tuning conversational AI models, making it an invaluable resource for researchers, developers, and organizations building mental health support technologies while maintaining the highest standards of privacy, safety, and therapeutic appropriateness.
Facebook
TwitterWhisper Fine-Tuning Evaluation: Local vs Commercial ASR
A "back of the envelope" evaluation comparing fine-tuned Whisper models running locally against commercial ASR APIs via Eden AI.
The Question
Can fine-tuning Whisper achieve measurable WER reductions, even when comparing local inference against cloud-based commercial models?
TL;DR
Yes. Fine-tuned Whisper Large Turbo running locally achieved 5.84% WER, beating the best commercial API (Assembly at⦠See the full description on the dataset page: https://huggingface.co/datasets/danielrosehill/Whisper-Fine-Tune-One-Shot-Eval.
Facebook
TwitterThis Install Package for LLM RAG, fine tuning essential library such as ( HuggingFace hub , transformer, langchain , evalate, sentence-transformers and etc. ) , suitable for Kaggle competition (offline) requirement which download form kaggle development environment.
Support Package list as below:
transformer
datasets
accelerate
bitsandbytes
langchain
langchain-community
sentence-transformers
chromadb
faiss-cpu
huggingface_hub
langchain-text-splitters
peft
trl
umap-learn
evaluate
deepeval
weave
Suggestion install command in kaggle: !pip install transformers --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/tranformers !pip install -U datasets --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/datasets !pip install -U accelerate --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/accelerate !pip install build --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/build-1.2.1-py3-none-any.whl !pip install -U bitsandbytes --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/bitsandbytes-0.43.1-py3-none-manylinux_2_24_x86_64.whl !pip install langchain --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/langchain-0.2.5-py3-none-any.whl !pip install langchain-core --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/langchain_core-0.2.9-py3-none-any.whl !pip install langsmith --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/langsmith-0.1.81-py3-none-any.whl !pip install langchain-community --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/langchain_community-0.2.5-py3-none-any.whl !pip install sentence-transformers --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/sentence_transformers-3.0.1-py3-none-any.whl !pip install chromadb --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/chromadb-0.5.3-py3-none-any.whl !pip install faiss-cpu --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/faiss_cpu-1.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl !pip install -U huggingface_hub --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/huggingface_hub !pip install -qU langchain-text-splitters --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/langchain_text_splitters-0.2.1-py3-none-any.whl !pip install -U peft --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/peft-0.11.1-py3-none-any.whl !pip install -U trl --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/trl-0.9.4-py3-none-any.whl !pip install umap-learn --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/umap-learn !pip install evaluate --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/evaluate-0.4.2-py3-none-any.whl !pip install deepeval --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/deepeval-0.21.59-py3-none-any.whl !pip install weave --no-index --no-deps --find-links=file:///kaggle/input/ai-math-llm-package/download-package/weave-0.50.2-py3-none-any.whl
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
https://huggingface.co/microsoft/Phi-3-mini-4k-instruct
The Phi-3-Mini-4K-Instruct is a 3.8B parameters, lightweight, state-of-the-art open model trained with the Phi-3 datasets that includes both synthetic data and the filtered publicly available websites data with a focus on high-quality and reasoning dense properties. The model belongs to the Phi-3 family with the Mini version in two variants 4K and 128K which is the context length (in tokens) that it can support.
The model has underwent a post-training process that incorporates both supervised fine-tuning and direct preference optimization for the instruction following and safety measures. When assessed against benchmarks testing common sense, language understanding, math, code, long context and logical reasoning, Phi-3 Mini-4K-Instruct showcased a robust and state-of-the-art performance among models with less than 13 billion parameters.
Resources and Technical Documentation:
Primary use cases
The model is intended for commercial and research use in English. The model provides uses for applications which require:
1) Memory/compute constrained environments 2) Latency bound scenarios 3) Strong reasoning (especially code, math and logic)
Our model is designed to accelerate research on language and multimodal models, for use as a building block for generative AI powered features.
Use case considerations
Our models are not specifically designed or evaluated for all downstream purposes. Developers should consider common limitations of language models as they select use cases, and evaluate and mitigate for accuracy, safety, and fariness before using within a specific downstream use case, particularly for high risk scenarios. Developers should be aware of and adhere to applicable laws or regulations (including privacy, trade compliance laws, etc.) that are relevant to their use case.
Nothing contained in this Model Card should be interpreted as or deemed a restriction or modification to the license the model is released under.
Phi-3 Mini-4K-Instruct has been integrated in the development version (4.41.0.dev0) of transformers. Until the official version is released through pip, ensure that you are doing one of the following:
When loading the model, ensure that trust_remote_code=True is passed as an argument of the from_pretrained() function.
Update your local transformers to the development version: pip uninstall -y transformers && pip install git+https://github.com/huggingface/transformers. The previous command is an alternative to cloning and installing from the source.
The current transformers version can be verified with: pip list | grep transformers.
Phi-3 Mini-4K-Instruct is also available in HuggingChat.
Phi-3 Mini-4K-Instruct supports a vocabulary size of up to 32064 tokens. The tokenizer files already provide placeholder tokens that can be used for downstream fine-tuning, but they can also be extended up to the model's vocabulary size.
Given the nature of the training data, the Phi-3 Mini-4K-Instruct model is best suited for prompts using the chat format as follows.
You can provide the prompt as a question with a generic template as follow:
markdown
<|user|>
Question <|end|>
<|assistant|>
For example:
markdown
<|user|>
How to explain Internet for a medieval knight?<|end|>
<|assistant|>
where the model generates the text after <|assistant|> . In case of few-shots prompt, the prompt can be formatted as the following:
<|user|>
I am going to Paris, what should I see?<|end|>
<|assistant|>
Paris, the capital of France, is known for its stunning architecture, art museums, historical landmarks, and romantic atmosphere. Here are some of the top attractions to see in Paris:
1. The Eiffel Tower: The iconic Eiffel Tower is one of the most recognizable landmarks in the world and offers breathtaking views of the city.
2. The Louvre Museum: The Louvre is one of the world's largest and most famous museums, housing an impressive collection of art and artifacts, including the Mona Lisa.
3. Notre-Dame Cathedral: This beautiful cathedral is one of the most famous landmarks in Paris and is known for its Gothic...
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation. Trained on 680k hours of labelled data, Whisper models demonstrate a strong ability to generalise to many datasets and domains without the need for fine-tuning.
This is a fine-tuned small Whisper for Bangla. The fine-tuning was started from small model OpenAI/whisper-small, and it was realized on the Bengali.AI Speech Recognition dataset.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Supervised Fine-Tuning (SFT) is a foundational technique for adapting large language models (LLMs) like GPT, LLaMA, and Claude to perform specific tasks. In SFT, a model is trained on a dataset of instructionāinputāoutput triples, allowing it to learn how to generate helpful, relevant, and accurate responses based on human-designed prompts and inputs.
This technique is widely used for building task-specific AI agents, copilots, educational tools, and customer service bots.
This dataset contains 10,000 instructionāinputāoutput examples spanning 10 practical domains:
Each record is structured as:
| Column | Description |
|---|---|
id | Unique identifier |
domain | Domain/topic of the task |
instruction | A prompt asking the model to perform a task |
input | Context or information needed to complete the task |
output | Target response generated for the given instruction + input |
source | Whether the entry is synthetic or human-curated |
quality_score | A rating from 1ā5 reflecting the response's quality |
| Instruction | Input | Output |
|---|---|---|
| "Summarize the following article" | "Photosynthesis is the process by which plants..." | "Photosynthesis converts light into chemical energy." |
| "Fix the code below" | "def greet(name): print('Hello' name)" | "def greet(name): print('Hello', name)" |
| "Plan a 5-day trip" | "Destination: Japan. Interests: culture, tech." | "Day 1: Tokyo tour... Day 2: Kyoto temples..." |
instruction + input ā outputtransformers and PEFTquality_scoreReleased under the MIT License. You may use, modify, and share with attribution.
Created by Zeeshan-ul-hassan Usmani to support open learning, LLM research, and educational outreach. Inspired by initiatives like Self-Instruct, OpenAssistant, and Hugging Face open datasets.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
𩺠Recurv-Clinical-Dataset:
The Recurv Clinical Dataset is a comprehensive resource containing 12,631 high-quality question-answer pairs specifically designed for training and fine-tuning medical AI models. Curated from trusted medical sources, this dataset focuses on real-world scenarios, including patient history, diagnostics, and treatment recommendations. It sets a new benchmark for advancing conversational AI in the healthcare field.
š Dataset Statistics⦠See the full description on the dataset page: https://huggingface.co/datasets/RecurvAI/Recurv-Clinical-Dataset.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This synthetic dataset for LLM training captures realistic employeeāassistant interactions about HR and compliance policies.
Generated using Syncora.ai's synthetic data generation engine, it provides privacy-safe, high-quality conversations for training Large Language Models (LLMs) to handle HR-related queries.
Perfect for researchers, HR tech startups, and AI developers building chatbots, compliance assistants, or policy QA systems ā without exposing sensitive employee data.
HR departments handle countless queries on policies, compliance, and workplace practices.
This dataset simulates those Q&A flows, making it a powerful dataset for LLM training and research.
You can use it for:
| Column | Description |
|---|---|
role | Role of the message author (system, user, or assistant) |
content | Actual text of the message |
messages | Grouped sequence of roleācontent exchanges (conversation turns) |
Each entry represents a self-contained dialogue snippet designed to reflect natural HR conversations, ideal for synthetic data generation research.
Whether you're building an HR assistant, compliance bot, or experimenting with enterprise LLMs, Syncora.ai synthetic datasets give you trustworthy, free datasets to start with ā and scalable tools to grow further.
Got feedback, research use cases, or want to collaborate?
Open an issue or reach out ā weāre excited to work with AI researchers, HR tech builders, and compliance innovators.
This dataset is 100% synthetic and does not represent real employees or organizations.
It is intended solely for research, educational, and experimental use in HR analytics, compliance automation, and machine learning.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
This record contains the annotated datasets and models used and produced for the work reported in the Master Thesis "Where Did the News come from? Detection of News Agency Releases in Historical Newspapers " (link).
Please cite this report if you are using the models/datasets or find it relevant to your research:
@article{Marxen:305129,
title = {Where Did the News Come From? Detection of News Agency Releases in Historical Newspapers},
author = {Marxen, Lea},
pages = {114p},
year = {2023},
url = {http://infoscience.epfl.ch/record/305129},
}
1. DATA
The newsagency-dataset contains historical newspaper articles with annotations of news agency mentions. The articles are divided into French (fr) and German (de) subsets and a train, dev and test set respectively. The data is annotated at token-level in the CoNLL format with IOB tagging format.
The distribution of articles in the different sets is as follows:
| Lg. | Docs | Agency Mentions | |
|---|---|---|---|
| Train | de | 333 | 493 |
| fr | 903 | 1,122 | |
| Dev | de | 32 | 26 |
| fr | 110 | 114 | |
| Test | de | 32 | 58 |
| fr | 120 | 163 |
Due to an error, there are seven duplicated articles in the French test set (article IDs: courriergdl-1847-10-02-a-i0002, courriergdl-1852-02-14-a-i0002, courriergdl-1860-10-31-a-i0016, courriergdl-1864-12-15-a-i0005, lunion-1860-11-27-a-i0004, lunion-1865-02-05-a-i0012, lunion-1866-02-16-a-i0009).
2. MODELS
The two agency detection and classification models used for the inference on the impresso Corpus are released as well:
The models perform multitask classification with two prediction heads, one for token-level agency entity classification and one for sentence-level (has_agency: yes/no). They can be run with TorchServe, for details see the newsagency-classification repository.
Please refer to the report for further information or contact us.
3. CODE
https://github.com/impresso/newsagency-classification
4. CONTACT
Maud Ehrmann (EPFL-DHLAB)
Emanuela Boros (EPFL-DHLAB)
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset card for dataset-summaries-llama
This dataset contains AI-generated summaries of dataset cards from the Hugging Face Hub, generated using meta-llama/Llama-3.3-70B-Instruct. It is designed to be used in combination with a similar dataset of model card summaries for initial supervised fine-tuning (SFT) of language models specialized in generating tl;dr summaries of dataset and model cards from the Hugging Face Hub. This dataset was made with Curator.
Dataset⦠See the full description on the dataset page: https://huggingface.co/datasets/davanstrien/hub-tldr-dataset-summaries-llama.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The HVULao_NLP project is dedicated to sharing datasets and tools for Lao Natural Language Processing (NLP), developed and maintained by the research team at Hung Vuong University (HVU), Phu Tho, Vietnam. This project is supported by Hung Vuong University with the aim of advancing research and applications in low-resource language processing, particularly for the Lao language.
š Datasets
This release provides a semi-automatically constructed corpus consisting of Lao sentences that have been word-segmented and part-of-speech (POS) tagged. It is designed to support a wide range of NLP applications, including language modeling, sequence labeling, linguistic research, and the development of Lao language tools.
Datatest1k/ ā Test set (1,000 Lao sentences)
testorgin1000.txt: Original raw sentences (UTF-8, one sentence per line). testsegsent_1000.txt: Word-segmented version aligned 1-to-1 with the raw file (tokens separated by spaces). testtag1k.json: Word-segmented and POS-tagged sentences, generated using large language models (LLMs) and manually reviewed by native linguists. Datatrain10k/ ā Training set (10,000 Lao sentences)
10ktrainorin.txt: Original raw sentences (UTF-8, one sentence per line). 10ksegmented.txt: Word-segmented version aligned 1-to-1 with the raw file. 10ktraintag.json: Word-segmented and POS-tagged sentences, generated using the same method as the test set. lao_finetuned_10k/ ā A fine-tuned transformer-based model for Lao word segmentation, compatible with Hugging Faceās transformers library.
All data files are encoded in UTF-8 (NFC) and prepared for direct use in NLP pipelines.
š The Lao sentence segmentation tool
A command-line tool for Lao word segmentation built with a fine-tuned Hugging Face transformers model and PyTorch.
Features
- Accurate Lao word segmentation using a pre-trained model
- Simple command-line usage
- GPU support (if available)
Example usage
```bash
python3 segment_lao.py -i ./data/lao_raw.txt -o ./output/lao_segmented.txt
š The Lao sentence POS tagging tool
A POS tagging tool for segmented Lao text, implemented with Python and CRF++.
Example usage
python3 Pos_tagging.py ./Test/lao_sentences_segmented.txt Test1
š Usage
The HVULao_NLP dataset and tools are intended for:
- Training and evaluating sequence labeling models (e.g., CRF, BiLSTM, mBERT)
- Developing Lao NLP tools (e.g., POS taggers, tokenizers)
- Conducting linguistic and computational research on Lao
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Overview This dataset is designed to train and fine-tune chatbot models by mapping user queries (patterns) to predefined intents (tags) and generating contextually accurate responses. Each tag represents a unique conversational intent or topic (e.g., "climate_change," "crypto_regulation," "quantum_computing"), accompanied by multiple paraphrased user prompts (patterns) and a detailed, informative response. Ideal for building intent classification systems, dialogue management, or generative AI models.
{
"intents": [
{
"tag": "tag_name",
"patterns": ["user query 1", "user query 2", ...],
"responses": ["detailed answer"]
},
...
]
}
Possible Uses Intent Classification: Train models to categorize user inputs into predefined tags.
Response Generation: Fine-tune generative models (GPT, BERT) to produce context-aware answers.
Educational Chatbots: Power QA systems for topics like science, history, or technology.
Customer Support: Automate responses for FAQs or policy explanations.
Compatibility Frameworks: TensorFlow, PyTorch, spaCy, Rasa, Hugging Face Transformers.
Use Cases: Virtual assistants, customer service bots, trivia apps, educational tools.
Facebook
Twitterhttps://www.gnu.org/licenses/gpl-3.0.htmlhttps://www.gnu.org/licenses/gpl-3.0.html
Iām an AI enthusiast, working on machine learning projects and open-source contributions.
I enjoy exploring AI pipelines, natural language processing, and building tools that make development easier.
Hugging Face
Medium
LinkedIn
GitHub
This dataset is a preprocessed and balanced version of the MELD Dataset, designed for multimodal emotion recognition research.
It combines text, audio, and video modalities, each represented by a set of emotion probability distributions predicted by pretrained or custom-trained models.
| Feature | Description |
|---|---|
| Total Samples | 4,000 utterances |
| Modalities | Text, Audio, Video |
| Balanced Emotions | Each emotion class is approximately balanced |
| Cleaned Samples | Videos with unclear or no facial detection removed |
| Emotion Labels | ['angry', 'disgust', 'fear', 'happy', 'neutral', 'sad', 'surprise'] |
Each row in the dataset corresponds to a single utterance, along with emotion label, file name, and predicted emotion probabilities per modality.
| Utterance | Emotion | File_Name | MultiModel Predictions |
|---|---|---|---|
| You are going to a clinic! | disgust | dia127_utt3.mp4 | {"video": [0.7739, 0.0, 0.0, 0.0783, 0.1217, 0.0174, 0.0087], "audio": [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0], "text": [0.0005, 0.0, 0.0, 0.0007, 0.998, 0.0004, 0.0004]} |
Each modalityās emotion vector was generated independently using specialized models:
| Modality | Model / Method | Description |
|---|---|---|
| Video | python-fer | Facial expression recognition using CNN-based FER library. |
| Audio | Custom-trained CNN model | Trained on Mel spectrogram features for emotion classification. |
| Text | arpanghoshal/EmoRoBERTa | Transformer-based text emotion model fine-tuned on GoEmotions dataset. |
UtteranceEmotionFile_NameFinal_Emotion (JSON: { "video": [...], "audio": [...], "text": [...] })This dataset is ideal for: - Fusion model training - Fine-tuning multimodal emotion models - Benchmarking emotion fusion strategies - Ablation studies on modality importance
References for the original MELD Dataset - S. Poria, D. Hazarika, N. Majumder, G. Naik, R. Mihalcea, E. Cambria. MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversation (2018). - Chen, S.Y., Hsu, C.C., Kuo, C.C. and Ku, L.W. EmotionLines: An Emotion Corpus of Multi-Party Conversations. arXiv preprint arXiv:1802.08379 (2018).
This dataset is a derivative work of MELD, used here for research and educational purposes.
All credit for the original dataset goes to the MELD authors and contributors.
Facebook
TwitterTOOLVERIFIER: Generalization to New Tools via Self-Verification
This repository contains the ToolSelect dataset which was used to fine-tune Llama-2 70B for tool selection.
Data
ToolSelect data is synthetic training data generated for tool selection task using Llama-2 70B and Llama-2-Chat-70B. It consists of 555 samples corresponding to 173 tools. Each training sample is composed of a user instruction, a candidate set of tools that includes the ground truth tool, and a⦠See the full description on the dataset page: https://huggingface.co/datasets/facebook/toolverifier.
Facebook
TwitterThese are LoRA fineātuned adapter weights for Google Gemma 3n E2B IT, produced for the Google Gemma 3n Impact Challenge. Base model: https://huggingface.co/google/gemma-3n-e2b-it Usage subject to the Gemma Terms of Use: https://ai.google.dev/gemma/terms and the rules listed in the competition page: https://www.kaggle.com/competitions/google-gemma-3n-hackathon/rules
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Everyday Conversations Fine-Tuning Dataset (LLaMA 3.1 - 2K)
Overview
This repository hosts the Everyday Conversations - LLaMA 3.1 - 2K dataset, a carefully curated fine-tuning dataset designed for conversational AI models. The dataset was created using the Kokoro-82M model, featuring voice samples from the af Voicepack.
Dataset Link
Hugging Face Dataset - Everyday Conversations LLaMA 3.1 - 2K
Features
Voice Model: Kokoro-82M Voicepack: af⦠See the full description on the dataset page: https://huggingface.co/datasets/rokeya71/KokoroTTS-af-Synthetic-QA.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
GSCF Q&A Dataset for Fine-Tuning
Dataset Summary
This dataset contains a collection of question-and-answer pairs specifically designed for fine-tuning Large Language Models (LLMs) on the Global Supply Chain Forum (GSCF) framework, a leading process model developed at The Ohio State University. The data is structured to train a model to act as an expert supply chain consultant. The content covers two main types of interactions:
Definitional Knowledge: Questions that⦠See the full description on the dataset page: https://huggingface.co/datasets/Supply-Chain-AI-Research/GSCF_finetune.
Facebook
TwitterDataset for fine-tuning an embedding model for AI job search. Data sourced from datastax/linkedin_job_listings. Data used to fine-tune shawhin/distilroberta-ai-job-embeddings for AI job search. Links
GitHub Repo Video link Blog link