Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
IntroductionGenerating physician letters is a time-consuming task in daily clinical practice.MethodsThis study investigates local fine-tuning of large language models (LLMs), specifically LLaMA models, for physician letter generation in a privacy-preserving manner within the field of radiation oncology.ResultsOur findings demonstrate that base LLaMA models, without fine-tuning, are inadequate for effectively generating physician letters. The QLoRA algorithm provides an efficient method for local intra-institutional fine-tuning of LLMs with limited computational resources (i.e., a single 48 GB GPU workstation within the hospital). The fine-tuned LLM successfully learns radiation oncology-specific information and generates physician letters in an institution-specific style. ROUGE scores of the generated summary reports highlight the superiority of the 8B LLaMA-3 model over the 13B LLaMA-2 model. Further multidimensional physician evaluations of 10 cases reveal that, although the fine-tuned LLaMA-3 model has limited capacity to generate content beyond the provided input data, it successfully generates salutations, diagnoses and treatment histories, recommendations for further treatment, and planned schedules. Overall, clinical benefit was rated highly by the clinical experts (average score of 3.4 on a 4-point scale).DiscussionWith careful physician review and correction, automated LLM-based physician letter generation has significant practical value.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
IntroductionGenerating physician letters is a time-consuming task in daily clinical practice.MethodsThis study investigates local fine-tuning of large language models (LLMs), specifically LLaMA models, for physician letter generation in a privacy-preserving manner within the field of radiation oncology.ResultsOur findings demonstrate that base LLaMA models, without fine-tuning, are inadequate for effectively generating physician letters. The QLoRA algorithm provides an efficient method for local intra-institutional fine-tuning of LLMs with limited computational resources (i.e., a single 48 GB GPU workstation within the hospital). The fine-tuned LLM successfully learns radiation oncology-specific information and generates physician letters in an institution-specific style. ROUGE scores of the generated summary reports highlight the superiority of the 8B LLaMA-3 model over the 13B LLaMA-2 model. Further multidimensional physician evaluations of 10 cases reveal that, although the fine-tuned LLaMA-3 model has limited capacity to generate content beyond the provided input data, it successfully generates salutations, diagnoses and treatment histories, recommendations for further treatment, and planned schedules. Overall, clinical benefit was rated highly by the clinical experts (average score of 3.4 on a 4-point scale).DiscussionWith careful physician review and correction, automated LLM-based physician letter generation has significant practical value.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for radm/arenahard_gpt4vsllama3
The dataset was created for fine-tuning Llama-3-70B-Instruct as a judge on Arena Hard (https://github.com/lm-sys/arena-hard-auto)
Dataset Info
question_id: question id from Arena Hard instruction: original instruction from Arena Hard model: model whose responses are evaluated against the baseline model (gpt-4-0314) - gpt-4-turbo-2024-04-09 (score: 82.6) and Llama-2-70b-chat-hf (score: 11.6) input: responses of the evaluated… See the full description on the dataset page: https://huggingface.co/datasets/radm/arenahard_gpt4vsllama3.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Language models (LMs) are no longer restricted to the ML community, and instruction-following LMs have led to a rise in autonomous AI agents. As the accessibility of LMs grows, it is imperative that an understanding of their capabilities, intended usage, and development cycle also improves. Model cards are a widespread practice for documenting detailed information about an ML model. To automate model card generation, we introduce a dataset of 500 question-answer pairs for 25 LMs that cover crucial aspects of the model, such as its training configurations, datasets, biases, architecture details, and training resources. We employ annotators to extract the answers from the original paper. Further, we explore the capabilities of LMs in generating model cards by answering questions. We experiment with three configurations: zero-shot generation, retrieval-augmented generation, and fine-tuning on our dataset. The fine-tuned Llama 3 model shows an improvement of 7 points over the retrieval-augmented generation setup. This indicates that our dataset can be used to train models to automatically generate model cards from paper text and reduce the human effort in the model card curation process.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
ToolACE for LLaMA
Created by: Seungwoo Ryu
Introduction
This dataset is an adaptation of the ToolACE dataset, modified to be directly compatible with LLaMA models for tool-calling fine-tuning. The original dataset was not in a format that could be immediately used for tool-calling training, so I have transformed it accordingly. This makes it more accessible for training LLaMA-based models with function-calling capabilities. This dataset is applicable to all… See the full description on the dataset page: https://huggingface.co/datasets/tryumanshow/ToolACE-Llama-cleaned.
https://choosealicense.com/licenses/openrail++/https://choosealicense.com/licenses/openrail++/
Dataset Card for Dataset Name
Dataset Details
Using this data, we got the highlighted results using BART sequence-to-sequence model. The configs and code for fine-tuning can be found on github
Dataset Description
This is a PseudoParaDetox dataset with real source toxic data and generated neutral detoxification by a non-patched LLama 3 8B with 10-shot. This dataset is based on the ParaDetox dataset for English texts detoxification.
Curated by:… See the full description on the dataset page: https://huggingface.co/datasets/s-nlp/pseudoparadetox_llama3_8b_10shot_noabl.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Innovation in natural language processing (NLP) has led to the creation of models such as BERT, RoBERTa, GPT-4o, Llama 3 and Gemini. However, the adaptation of these models to specific dialects, especially in languages other than English, remains underexplored, especially with slang or informal language. In response to this need, our research evaluates Spanish monolingual models best suited to Peruvian colloquial expressions, the best alternative being RoBERTuito, a model pre-trained on a large corpus of Spanish tweets that highlights its effectiveness in text classification tasks. We refine and compare this model to reflect the characteristics of Peruvian Spanish. We implemented a Facebook data collection and preprocessing process, focusing on Peruvian Spanish comments. This specialised dataset with over 11,000 labelled comments was used to train monolingual models on the sentiment analysis task and obtain more accurate polarity detection in texts that include Peruvian slang. RoBERTuito achieved a balanced F1-score of 0.750, outperforming BETO (0.661), BERTuit (0.70) and RoBERTa-BNE (0.696). We also evaluated precision, recall and accuracy for a comprehensive evaluation. This study not only provides a solution for sentiment analysis in Peruvian Spanish, but also establishes a basis for adapting monolingual models to linguistic contexts.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Fine tuning progress validation - RedPajama 3B, StableLM Alpha 7B, Open-LLaMA
This repository contains the progress of fine-tuning models: RedPajama 3B, StableLM Alpha 7B, Open-LLaMA. These models have been fine-tuned on a specific text dataset and the results of the fine-tuning process are provided in the text file included in this repository.
Fine-Tuning Details
Model: RedPajama 3B, size: 3 billion parameters, method: adapter Model: StableLM Alpha 7B, size: 7 billion… See the full description on the dataset page: https://huggingface.co/datasets/kstevica/llm-comparison.
princeton-nlp/prolong-data-512K
[Paper] [HF Collection] [Code] ProLong (Princeton long-context language models) is a family of long-context models that are continued trained and supervised fine-tuned from Llama-3-8B, with a maximum context window of 512K tokens. Our main ProLong model is one of the best-performing long-context models at the 10B scale (evaluated by HELMET). To train this strong long-context model, we conduct thorough ablations on the long-context pre-training data… See the full description on the dataset page: https://huggingface.co/datasets/princeton-nlp/prolong-data-512K.
Intro
This dataset(1K) formats an existing podcast dataset (64bits/lex_fridman_podcast_for_llm_vicuna) for llama 3 chat model fine tuning. It represents a compilation of audio-to-text transcripts from the Lex Fridman Podcast. The Lex Fridman Podcast, hosted by AI researcher at MIT, Lex Fridman.
Problems
There might be some minor issues during the transcribe phase.
Next Step
Use whisper to directly load the podcast and transcribe it in this format.
Description
This dataset has been built from the MBPP dataset with LLM generated descriptions from a Llama-3-70B-awq model, for fine tuning dense retrieval models. The dataset was created by using the first 70% points from the MBPP dataset. We created triplets corresponding to all negatives for a positive pair. Hence there are n * (n - 1) triplets for n pairs(since we have n-1 negative examples for every anchor-positive pair). Using a random seed of 10, we split these triplets into… See the full description on the dataset page: https://huggingface.co/datasets/Nutanix/mbpp_processed_triplet_data.
philschmid/finanical-rag-embedding-dataset
philschmid/finanical-rag-embedding-dataset is a modified fork of virattt/llama-3-8b-financialQA for fine-tuning embedding models using positive text pairs (question, context). The dataset include 7,000 question, context pairs from NVIDIAs 2023 SEC Filling Report
Dataset Card for magpie-ultra-v1.0
This dataset has been created with distilabel.
Dataset Summary
magpie-ultra it's a synthetically generated dataset for supervised fine-tuning using the Llama 3.1 405B-Instruct model, together with other Llama models like Llama-Guard-3-8B and Llama-3.1-8B-Instruct. The dataset contains challenging instructions and responses for a wide variety of tasks, such as Coding & debugging, Math, Data analysis, Creative Writing… See the full description on the dataset page: https://huggingface.co/datasets/GenRM/magpie-ultra-v1.0-argilla.
Dataset Card for Dataset Name
This dataset card aims to be a base template for new datasets. It has been generated using this raw template
Dataset Details
Dataset Description
The PoemLib Dataset consists of humorous poems created in a madlib game style. It was generated using the Meta Llama 3 8b-instruct Model with the goal of fine-tuning a Large Language Model to generate madlib-like poems based on given prompts. The dataset creation process utilized a… See the full description on the dataset page: https://huggingface.co/datasets/eddyejembi/PoemLib.
Dataset Details
This dataset is created using meta-llama/Llama-3-8b-chat-hf and contains 894 pairs of rows. Dataset comprises of an instruction and a sarcastic response to the instruction. The script used for creating this dataset is here - LLM/Lifecycle/CustomDataForFineTuning.ipynb The inference script that uses this dataset for fine tuning an LLM is in progress and link to which will be added here soon. This dataset can be used in fine tuning an LLM. This will help an LLM adopt… See the full description on the dataset page: https://huggingface.co/datasets/Siddharthvij10/sarcastic-responses.
Dataset Description
Abstract:
This dataset contains processed document files from 3GPP standards (rel8 to rel19) and Q&A pairs generated using the LLaMA 3-8B-instruct model. Each Q&A pair consists of four parts: Instruction, Input, Output, and Metadata. The dataset is designed to support and promote research and applications in the field of Natural Language Processing (NLP), particularly for instruction tuning of large language models (LLMs) focused on telecom standards… See the full description on the dataset page: https://huggingface.co/datasets/jiangfb/3GPP-Finetuning.
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Arabic LLaMA Math Dataset
Example Entries
Dataset Overview
Dataset Name: Arabic_LLaMA_Math_Dataset.csv Number of Records: 12,496 Number of Columns: 3 File Format: CSV
Dataset Structure
Columns:
Instruction: The problem statement or question (text, in Arabic) Input: Additional input for model fine-tuning (empty in this dataset) Solution: The solution or answer to the problem (text, in Arabic)
Dataset Description
The Arabic… See the full description on the dataset page: https://huggingface.co/datasets/Jr23xd23/Arabic_LLaMA_Math_Dataset.
Trelis Function Calling Dataset - VERSION 3
Access this dataset by purchasing a license HERE.
Allows models to be fine-tuned for function-calling. The dataset is human generated and does not make use of Llama 2 or OpenAI! The dataset includes 66 training rows, 19 validation rows and 5 test rows (for manual evaluation). Based on eight functions: search_bing, search_arxiv, save_chat, read_json_file, list_files, get_current_weather, delete_file, clear_chat
Alternatively, you can find… See the full description on the dataset page: https://huggingface.co/datasets/Trelis/function_calling_v3.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Spatial-DPO-Dataset
Overview
This dataset was created to train language models for 3D-to-Speech conversion, specifically for the EchoLLaMA project. It contains 2,000 samples of prompts derived from 3D image analyses paired with two types of responses: high-quality responses from DeepSeek-V3-0324 (chosen) and baseline responses from LLaMA-3.2-1B-Instruct (rejected). This structure enables Direct Preference Optimization (DPO) for fine-tuning language models to generate… See the full description on the dataset page: https://huggingface.co/datasets/AquaLabs/Spatial-DPO-Dataset.
Dataset Name: Deutsche Bahn FAQ in Llama 3 Format Dataset Description: This dataset contains 1000 question-answer pairs extracted from the official Deutsche Bahn (German Railways) FAQ section. The data has been specifically formatted to be compatible with the Llama 3 instruct models for supervised fine-tuning (SFT). Dataset Purpose: The primary purpose of this dataset is to facilitate the fine-tuning of Llama 3 instruct models for tasks related to customer service and information retrieval in… See the full description on the dataset page: https://huggingface.co/datasets/islam-hajosman/deutsche_bahn_faq_128.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
IntroductionGenerating physician letters is a time-consuming task in daily clinical practice.MethodsThis study investigates local fine-tuning of large language models (LLMs), specifically LLaMA models, for physician letter generation in a privacy-preserving manner within the field of radiation oncology.ResultsOur findings demonstrate that base LLaMA models, without fine-tuning, are inadequate for effectively generating physician letters. The QLoRA algorithm provides an efficient method for local intra-institutional fine-tuning of LLMs with limited computational resources (i.e., a single 48 GB GPU workstation within the hospital). The fine-tuned LLM successfully learns radiation oncology-specific information and generates physician letters in an institution-specific style. ROUGE scores of the generated summary reports highlight the superiority of the 8B LLaMA-3 model over the 13B LLaMA-2 model. Further multidimensional physician evaluations of 10 cases reveal that, although the fine-tuned LLaMA-3 model has limited capacity to generate content beyond the provided input data, it successfully generates salutations, diagnoses and treatment histories, recommendations for further treatment, and planned schedules. Overall, clinical benefit was rated highly by the clinical experts (average score of 3.4 on a 4-point scale).DiscussionWith careful physician review and correction, automated LLM-based physician letter generation has significant practical value.