Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
IntroductionGenerating physician letters is a time-consuming task in daily clinical practice.MethodsThis study investigates local fine-tuning of large language models (LLMs), specifically LLaMA models, for physician letter generation in a privacy-preserving manner within the field of radiation oncology.ResultsOur findings demonstrate that base LLaMA models, without fine-tuning, are inadequate for effectively generating physician letters. The QLoRA algorithm provides an efficient method for local intra-institutional fine-tuning of LLMs with limited computational resources (i.e., a single 48 GB GPU workstation within the hospital). The fine-tuned LLM successfully learns radiation oncology-specific information and generates physician letters in an institution-specific style. ROUGE scores of the generated summary reports highlight the superiority of the 8B LLaMA-3 model over the 13B LLaMA-2 model. Further multidimensional physician evaluations of 10 cases reveal that, although the fine-tuned LLaMA-3 model has limited capacity to generate content beyond the provided input data, it successfully generates salutations, diagnoses and treatment histories, recommendations for further treatment, and planned schedules. Overall, clinical benefit was rated highly by the clinical experts (average score of 3.4 on a 4-point scale).DiscussionWith careful physician review and correction, automated LLM-based physician letter generation has significant practical value.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for radm/arenahard_gpt4vsllama3
The dataset was created for fine-tuning Llama-3-70B-Instruct as a judge on Arena Hard (https://github.com/lm-sys/arena-hard-auto)
Dataset Info
question_id: question id from Arena Hard instruction: original instruction from Arena Hard model: model whose responses are evaluated against the baseline model (gpt-4-0314) - gpt-4-turbo-2024-04-09 (score: 82.6) and Llama-2-70b-chat-hf (score: 11.6) input: responses of the evaluated… See the full description on the dataset page: https://huggingface.co/datasets/radm/arenahard_gpt4vsllama3.
Hmoumad/Moumad-Dataset-Fine-Tune-Llama-2 dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Language models (LMs) are no longer restricted to the ML community, and instruction-following LMs have led to a rise in autonomous AI agents. As the accessibility of LMs grows, it is imperative that an understanding of their capabilities, intended usage, and development cycle also improves. Model cards are a widespread practice for documenting detailed information about an ML model. To automate model card generation, we introduce a dataset of 500 question-answer pairs for 25 LMs that cover crucial aspects of the model, such as its training configurations, datasets, biases, architecture details, and training resources. We employ annotators to extract the answers from the original paper. Further, we explore the capabilities of LMs in generating model cards by answering questions. We experiment with three configurations: zero-shot generation, retrieval-augmented generation, and fine-tuning on our dataset. The fine-tuned Llama 3 model shows an improvement of 7 points over the retrieval-augmented generation setup. This indicates that our dataset can be used to train models to automatically generate model cards from paper text and reduce the human effort in the model card curation process.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
ToolACE for LLaMA
Created by: Seungwoo Ryu
Introduction
This dataset is an adaptation of the ToolACE dataset, modified to be directly compatible with LLaMA models for tool-calling fine-tuning. The original dataset was not in a format that could be immediately used for tool-calling training, so I have transformed it accordingly. This makes it more accessible for training LLaMA-based models with function-calling capabilities. This dataset is applicable to all… See the full description on the dataset page: https://huggingface.co/datasets/tryumanshow/ToolACE-Llama-cleaned.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Innovation in natural language processing (NLP) has led to the creation of models such as BERT, RoBERTa, GPT-4o, Llama 3 and Gemini. However, the adaptation of these models to specific dialects, especially in languages other than English, remains underexplored, especially with slang or informal language. In response to this need, our research evaluates Spanish monolingual models best suited to Peruvian colloquial expressions, the best alternative being RoBERTuito, a model pre-trained on a large corpus of Spanish tweets that highlights its effectiveness in text classification tasks. We refine and compare this model to reflect the characteristics of Peruvian Spanish. We implemented a Facebook data collection and preprocessing process, focusing on Peruvian Spanish comments. This specialised dataset with over 11,000 labelled comments was used to train monolingual models on the sentiment analysis task and obtain more accurate polarity detection in texts that include Peruvian slang. RoBERTuito achieved a balanced F1-score of 0.750, outperforming BETO (0.661), BERTuit (0.70) and RoBERTa-BNE (0.696). We also evaluated precision, recall and accuracy for a comprehensive evaluation. This study not only provides a solution for sentiment analysis in Peruvian Spanish, but also establishes a basis for adapting monolingual models to linguistic contexts.
https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts
Large-language models (LLMs) show promise for extracting information from clinical notes. Deploying these models at scale can be challenging due to high computational costs, regulatory constraints, and privacy concerns. To address these challenges, synthetic data distillation can be used to fine-tune smaller, open-source LLMs that achieve performance similar to the teacher model. These smaller models can be run on less expensive local hardware or at a vastly reduced cost in cloud deployments. In our recent study [1], we used Llama-3.1-70B-Instruct to generate synthetic training examples in the form of question-answer pairs along with supporting information. We manually reviewed 1000 of these examples and release them here. These examples can then be used to fine-tune smaller versions of Llama to improve their ability to extract clinical information from notes.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Low rank adaptation fine-tuned weights for Llama 2 experiments for simultaneously extracting named entities and their relationships in structured format, with results shown in the paper "*Structured information extraction from scientific text with large language models*" in Nature Communications by John Dagdelen*, Alexander Dunn*, Sanghoon Lee, Nicholas Walker, Andrew S. Rosen, Gerbrand Ceder, Kristin Persson, and Anubhav Jain.Includes 13 independent experiments for three different tasks:General materials information extraction (five cross validation folds, one fine-tune model for each)Metal organic frameworks information extraction (five cross validation folds, one fine-tune model for each)Inorganic impurity doping information extraction (one test set for each of three schemas - JSON, English, and ExtraEnglish, each corresponding to one model)
https://choosealicense.com/licenses/openrail++/https://choosealicense.com/licenses/openrail++/
Dataset Card for Dataset Name
Dataset Details
Using this data, we got the highlighted results using BART sequence-to-sequence model. The configs and code for fine-tuning can be found on github
Dataset Description
This is a PseudoParaDetox dataset with real source toxic data and generated neutral detoxification by a patched LLama 3 8B with 10-shot. This dataset is based on the ParaDetox dataset for English texts detoxification.
Curated by:… See the full description on the dataset page: https://huggingface.co/datasets/s-nlp/pseudoparadetox_llama3_8b_10shot_patched.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Fine tuning progress validation - RedPajama 3B, StableLM Alpha 7B, Open-LLaMA
This repository contains the progress of fine-tuning models: RedPajama 3B, StableLM Alpha 7B, Open-LLaMA. These models have been fine-tuned on a specific text dataset and the results of the fine-tuning process are provided in the text file included in this repository.
Fine-Tuning Details
Model: RedPajama 3B, size: 3 billion parameters, method: adapter Model: StableLM Alpha 7B, size: 7 billion… See the full description on the dataset page: https://huggingface.co/datasets/kstevica/llm-comparison.
https://choosealicense.com/licenses/llama2/https://choosealicense.com/licenses/llama2/
llama-instruct
This dataset was used to finetune Llama-2-7B-32K-Instruct. We follow the distillation paradigm that is used by Alpaca, Vicuna, WizardLM, Orca — producing instructions by querying a powerful LLM, which in our case, is the Llama-2-70B-Chat model released by Meta. To build Llama-2-7B-32K-Instruct, we collect instructions from 19K human inputs extracted from ShareGPT-90K (only using human inputs, not ChatGPT outputs). The actual script handles multi-turn conversations… See the full description on the dataset page: https://huggingface.co/datasets/togethercomputer/llama-instruct.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
CoinGecko Function Calling Dataset
This dataset is designed for function calling of the CoinGecko API in Alpaca format and has been used to fine-tune a Meta-LLama-3-8B-Instruct-Coingecko-Function-Calling.
CoinGecko API Functions
Function format: .json Example:Question: How much is one Ethereum in USD? Answer: [{ "name": "simple_price", "arguments": { "ids": "ethereum", "vs_currencies": "usd" } }]
Total Rows: 1558
Antiprompts:
Example:Question: What's the moving average for… See the full description on the dataset page: https://huggingface.co/datasets/SanctumAI/Coingecko-Function-Calling.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
ORPO-DPO-mix-40k v1.2
This dataset is designed for ORPO or DPO training. See Fine-tune Llama 3 with ORPO for more information about how to use it. It is a combination of the following high-quality DPO datasets:
argilla/Capybara-Preferences: highly scored chosen answers >=5 (7,424 samples)argilla/distilabel-intel-orca-dpo-pairs: highly scored chosen answers >=9, not in GSM8K (2,299 samples) argilla/ultrafeedback-binarized-preferences-cleaned: highly scored chosen answers >=5 (22… See the full description on the dataset page: https://huggingface.co/datasets/mlabonne/orpo-dpo-mix-40k.
Intro
This dataset(1K) formats an existing podcast dataset (64bits/lex_fridman_podcast_for_llm_vicuna) for llama 3 chat model fine tuning. It represents a compilation of audio-to-text transcripts from the Lex Fridman Podcast. The Lex Fridman Podcast, hosted by AI researcher at MIT, Lex Fridman.
Problems
There might be some minor issues during the transcribe phase.
Next Step
Use whisper to directly load the podcast and transcribe it in this format.
philschmid/finanical-rag-embedding-dataset
philschmid/finanical-rag-embedding-dataset is a modified fork of virattt/llama-3-8b-financialQA for fine-tuning embedding models using positive text pairs (question, context). The dataset include 7,000 question, context pairs from NVIDIAs 2023 SEC Filling Report
Description
This dataset has been built from the MBPP dataset with LLM generated descriptions from a Llama-3-70B-awq model, for fine tuning dense retrieval models. The dataset was created by using the first 70% points from the MBPP dataset. We created triplets corresponding to all negatives for a positive pair. Hence there are n * (n - 1) triplets for n pairs(since we have n-1 negative examples for every anchor-positive pair). Using a random seed of 10, we split these triplets into… See the full description on the dataset page: https://huggingface.co/datasets/Nutanix/mbpp_processed_triplet_data.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
5.2k rows of single and multi turn synthetic Forgotten Realms data! It was partially produced with Groq's Llama-3-70B and my WIP dataset creator fine-tune.
princeton-nlp/prolong-data-512K
[Paper] [HF Collection] [Code] ProLong (Princeton long-context language models) is a family of long-context models that are continued trained and supervised fine-tuned from Llama-3-8B, with a maximum context window of 512K tokens. Our main ProLong model is one of the best-performing long-context models at the 10B scale (evaluated by HELMET). To train this strong long-context model, we conduct thorough ablations on the long-context pre-training data… See the full description on the dataset page: https://huggingface.co/datasets/princeton-nlp/prolong-data-512K.
Dataset Card for Dataset Name
This dataset card aims to be a base template for new datasets. It has been generated using this raw template
Dataset Details
Dataset Description
The PoemLib Dataset consists of humorous poems created in a madlib game style. It was generated using the Meta Llama 3 8b-instruct Model with the goal of fine-tuning a Large Language Model to generate madlib-like poems based on given prompts. The dataset creation process utilized a… See the full description on the dataset page: https://huggingface.co/datasets/eddyejembi/PoemLib.
Dataset Card for magpie-ultra-v1.0
This dataset has been created with distilabel.
Dataset Summary
magpie-ultra it's a synthetically generated dataset for supervised fine-tuning using the Llama 3.1 405B-Instruct model, together with other Llama models like Llama-Guard-3-8B and Llama-3.1-8B-Instruct. The dataset contains challenging instructions and responses for a wide variety of tasks, such as Coding & debugging, Math, Data analysis, Creative Writing… See the full description on the dataset page: https://huggingface.co/datasets/GenRM/magpie-ultra-v1.0-argilla.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
IntroductionGenerating physician letters is a time-consuming task in daily clinical practice.MethodsThis study investigates local fine-tuning of large language models (LLMs), specifically LLaMA models, for physician letter generation in a privacy-preserving manner within the field of radiation oncology.ResultsOur findings demonstrate that base LLaMA models, without fine-tuning, are inadequate for effectively generating physician letters. The QLoRA algorithm provides an efficient method for local intra-institutional fine-tuning of LLMs with limited computational resources (i.e., a single 48 GB GPU workstation within the hospital). The fine-tuned LLM successfully learns radiation oncology-specific information and generates physician letters in an institution-specific style. ROUGE scores of the generated summary reports highlight the superiority of the 8B LLaMA-3 model over the 13B LLaMA-2 model. Further multidimensional physician evaluations of 10 cases reveal that, although the fine-tuned LLaMA-3 model has limited capacity to generate content beyond the provided input data, it successfully generates salutations, diagnoses and treatment histories, recommendations for further treatment, and planned schedules. Overall, clinical benefit was rated highly by the clinical experts (average score of 3.4 on a 4-point scale).DiscussionWith careful physician review and correction, automated LLM-based physician letter generation has significant practical value.