84 datasets found

LLM Fine Tuning Dataset of Indian Legal Texts
kaggle.com
Updated Jul 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Akshat Gupta (2024). LLM Fine Tuning Dataset of Indian Legal Texts [Dataset]. https://www.kaggle.com/datasets/akshatgupta7/llm-fine-tuning-dataset-of-indian-legal-texts/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 30, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Akshat Gupta
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Area covered
India
Description
This dataset comprises curated question-answer pairs derived from key legal texts pertinent to Indian law, specifically the Indian Penal Code (IPC), Criminal Procedure Code (CRPC), and the Indian Constitution. The goal of this dataset is to facilitate the development and fine-tuning of language models and AI applications that assist legal professionals in India.

Dataset Details:

Sources: The questions and answers in this dataset are extracted from the Indian Constitution, Indian Penal Code (IPC), and the Code of Criminal Procedure (CrPC), ensuring relevance and accuracy in legal contexts.

Content: Each entry in the dataset contains a clear and concise question alongside its corresponding answer. The questions are designed to cover fundamental concepts, key provisions, and significant terms found within these legal documents.

Use Cases:

Legal Research: A valuable tool for lawyers, legal researchers, and students seeking to understand legal terminology and principles as outlined in Indian law.

Natural Language Processing (NLP): This dataset is ideal for training AI models for question-answering systems that require a strong understanding of Indian legal texts.

Educational Resources: Useful for creating educational tools and materials for law students and legal practitioners.

Note on Use and Limitations:

Misuse of Dataset: This dataset is intended for educational, research, and development purposes only. Users should exercise caution to ensure that any AI applications developed using this dataset do not misrepresent or distort legal information. The dataset should not be used for legal advice or to influence legal decisions without proper context and verification.

Relevance and Context: While every effort has been made to ensure the accuracy and relevance of the question-answer pairs, some entries may be out of context or may not fully represent the legal concepts they aim to explain. Users are strongly encouraged to conduct thorough reviews of the entries, particularly when using them in formal applications or legal research.

Data Preprocessing Recommended: Due to the nature of natural language, the QA pairs may include variations in phrasing, potential redundancies, or entries that may not align perfectly with the intended legal context. Therefore, it is highly recommended that users perform data preprocessing to cleanse, normalize, or filter out any irrelevant or out-of-context pairs before integrating the dataset into machine learning models or systems.

Dynamic Nature of Law: The legal landscape is subject to change over time. As laws and interpretations evolve, some answers may become outdated or less applicable. Users should verify the current applicability of legal concepts and check sources for updates when necessary.

Credits and Citations: If you use this dataset in your research or projects, appropriate credits should be provided. Users are also encouraged to share any improvements, corrections, or updates they make to the dataset for the benefit of the community.
Alpaca Cleaned
kaggle.com
huggingface.co
Updated Nov 26, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). Alpaca Cleaned [Dataset]. https://www.kaggle.com/datasets/thedevastator/alpaca-language-instruction-training
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 26, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Alpaca Cleaned

Improving Pretrained Language Model Understanding

By Huggingface Hub [source]

About this dataset

Alpaca is the perfect dataset for fine-tuning your language models to better understand and follow instructions, capable of taking you beyond standard Natural Language Processing (NLP) abilities! This curated, cleaned dataset provides you with over 52,000 expertly crafted instructions and demonstrations generated by OpenAI's text-davinci-003 engine - all in English (BCP-47 en). Improve the quality of your language models with fields such as instruction, output, and input which have been designed to enhance every aspect of their comprehension. The data here has gone through rigorous cleaning to ensure there are no errors or biases present; allowing you to trust that this data will result in improved performance for any language model that uses it! Get ready to see what Alpaca can do for your NLP needs

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

This dataset provides a unique and valuable resource for anyone who wishes to create, develop and train language models. Alpaca provides users with 52,000 instruction-demonstration pairs generated by OpenAI's text-davinci-003 engine.

The data included in this dataset is formatted into 3 columns: “instruction”, “output” and “input.” All the data is written in English (BCP-47 en).

To make the most out of this dataset it is recommended to:

Familiarize yourself with the instructions in the instruction column as these provide guidance on how to use the other two columns; input and output.

Once comfortable with understanding the instructions columns move onto exploring what you are provided within each 14 sets of triplets – instruction, output and input – that are included in this clean version of Alpaca.

Read through many examples paying attention to any areas where you feel more clarification could be added or could be further improved upon for better understanding of language models however bear in mind that these examples have been cleaned from any errors or biases found from original dataset

Get inspired! As mentioned earlier there are more than 52k sets provided meaning having much flexibility for varying training strategies or unique approaches when creating your own language model!

Finally while not essential it may be helpful to have familiarity with OpenAI's text-davinci engine as well as enjoy playing around with different parameters/options depending on what type of outcomes you wish achieve

Research Ideas

Developing natural language processing (NLP) tasks that aim to better automate and interpret instructions given by humans.

Training machine learning models of robotic agents to be able to understand natural language commands, as well as understand the correct action that needs to be taken in response.

Creating a system that can generate personalized instructions and feedback in real time based on language models, catering specifically to each individual user's preferences or needs

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: train.csv | Column name | Description | |:----------------|:-------------------------------------------------------------------------| | instruction | This column contains the instructions for the language model. (Text) | | output | This column contains the expected output from the language model. (Text) | | input | This column contains the input given to the language model. (Text) |

Acknowledgements

If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.
h
longform_article_summarization
huggingface.co
Updated Nov 26, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vincent Goldberg (2023). longform_article_summarization [Dataset]. https://huggingface.co/datasets/vgoldberg/longform_article_summarization
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 26, 2023
Authors
Vincent Goldberg
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Name: Long-Form Article Summarization Dataset Description: The Long-Form Article Summarization Dataset is meticulously curated for the purpose of fine-tuning Natural Language Processing (NLP) models specifically tailored for summarization tasks. It is a rich collection of long-form articles that have been carefully condensed and summarized. The dataset provides a diverse range of topics and writing styles, making it an invaluable resource for researchers and practitioners working on… See the full description on the dataset page: https://huggingface.co/datasets/vgoldberg/longform_article_summarization.
t
Evaluating SQuAD-based Question Answering for the Open Research Knowledge...
service.tib.eu
Updated Aug 4, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). Evaluating SQuAD-based Question Answering for the Open Research Knowledge Graph Completion - Vdataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/luh-evaluating-squad-based-question-answering-for-the-open-research-knowledge-graph-completion
Explore at:
Dataset updated
Aug 4, 2023
License
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Description
This dataset is part of the bachelor thesis "Evaluating SQuAD-based Question Answering for the Open Research Knowledge Graph Completion". It was created for the finetuning of Bert Based models pre-trained on the SQUaD dataset. The Dataset was created using semi-automatic approach on the ORKG data. The dataset.csv file contains the entire data (all properties) in a tabular for and is unsplit. The json files contain only the necessary fields for training and evaluation, with additional fields (index of start and end of the answers in the abstracts). The data in the json files is split (training data) and evaluation data. We create 4 variants of the training and evaluation sets for each one of the question labels ("no label", "how", "what", "which") For detailed information on each of the fields in the dataset, refer to section 4.2 (Corpus) of the Thesis document that can be found in https://www.repo.uni-hannover.de/handle/123456789/12958. The script used to generate the dataset can be found in the public repository https://github.com/as18cia/thesis_work and https://gitlab.com/TIBHannover/orkg/nlp/experiments/orkg-fine-tuning-squad-based-models
h
wikipedia-paragraph-sft
huggingface.co
Updated Sep 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alan Tseng (2024). wikipedia-paragraph-sft [Dataset]. https://huggingface.co/datasets/agentlans/wikipedia-paragraph-sft
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 16, 2024
Authors
Alan Tseng
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
Wikipedia Paragraph Supervised Finetuning Dataset

Model Description

This dataset is designed for training language models to generate supervised finetuning data from raw text. It consists of text passages and corresponding question-answer pairs in JSONLines format.

Intended Use

The primary purpose of this dataset is to enable large language models (LLMs) to generate high-quality supervised finetuning data from raw text inputs, useful for creating custom… See the full description on the dataset page: https://huggingface.co/datasets/agentlans/wikipedia-paragraph-sft.
h
finewebedu-sft
huggingface.co
Updated Aug 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alan Tseng (2024). finewebedu-sft [Dataset]. https://huggingface.co/datasets/agentlans/finewebedu-sft
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 27, 2024
Authors
Alan Tseng
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
FineWeb-Edu Supervised Finetuning Dataset

Model Description

This dataset is designed for training language models to generate supervised finetuning data from raw text. It consists of text passages and corresponding question-answer pairs in JSONLines format.

Intended Use

The primary purpose of this dataset is to enable large language models (LLMs) to generate high-quality supervised finetuning data from raw text inputs, useful for creating custom datasets for… See the full description on the dataset page: https://huggingface.co/datasets/agentlans/finewebedu-sft.
IMDb Movies Metadata Dataset – 4.5M Records (Global Coverage)
crawlfeeds.com
csv, zip
Updated Jul 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Crawl Feeds (2025). IMDb Movies Metadata Dataset – 4.5M Records (Global Coverage) [Dataset]. https://crawlfeeds.com/datasets/imdb-movies-metadata-dataset-4-5m-records-global-coverage
Explore at:
csv, zipAvailable download formats
Dataset updated
Jul 5, 2025
Dataset authored and provided by
Crawl Feeds
License
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
Description
Unlock one of the most comprehensive movie datasets available—4.5 million structured IMDb movie records, extracted and enriched for data science, machine learning, and entertainment research.

This dataset includes a vast collection of global movie metadata, including details on title, release year, genre, country, language, runtime, cast, directors, IMDb ratings, reviews, and synopsis. Whether you're building a recommendation engine, benchmarking trends, or training AI models, this dataset is designed to give you deep and wide access to cinematic data across decades and continents.

Perfect for use in film analytics, OTT platforms, review sentiment analysis, knowledge graphs, and LLM fine-tuning, the dataset is cleaned, normalized, and exportable in multiple formats.

What’s Included:

Genres: Drama, Comedy, Horror, Action, Sci-Fi, Documentary, and more

Delivery: Direct download

Use Cases:

Train LLMs or chatbots on cinematic language and metadata

Build or enrich movie recommendation engines

Run cross-lingual or multi-region film analytics

Benchmark genre popularity across time periods

Power academic studies or entertainment dashboards

Feed into knowledge graphs, search engines, or NLP pipelines
Labelled data for fine tuning a geological Named Entity Recognition and...
metadata.bgs.ac.uk
hosted-metadata.bgs.ac.uk
+1more
html
Updated Feb 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
British Geological Survey (2024). Labelled data for fine tuning a geological Named Entity Recognition and Entity Relation Extraction model [Dataset]. https://metadata.bgs.ac.uk/geonetwork/srv/api/records/15ac4ca9-3be0-119e-e063-0937940a8990
Explore at:
htmlAvailable download formats
Dataset updated
Feb 15, 2024
Dataset authored and provided by
British Geological Surveyhttps://www.bgs.ac.uk/
License
http://inspire.ec.europa.eu/metadata-codelist/LimitationsOnPublicAccess/noLimitationshttp://inspire.ec.europa.eu/metadata-codelist/LimitationsOnPublicAccess/noLimitations
Time period covered
Nov 1, 2023 - Feb 15, 2024
Description
This dataset consists of sentences extracted from BGS memoirs, DECC/OGA onshore hydrocarbons well reports and Mineral Reconnaissance Programme (MRP) reports. The sentences have been annotated to enable the dataset to be used as labelled training data for a Named Entity Recognition model and Entity Relation Extraction model, both of which are Natural Language Processing (NLP) techniques that assist with extracting structured data from unstructured text. The entities of interest are rock formations, geological ages, rock types, physical properties and locations, with inter-relations such as overlies, observedIn. The entity labels for rock formations and geological ages in the BGS memoirs were an extract from earlier published work https://github.com/BritishGeologicalSurvey/geo-ner-model https://zenodo.org/records/4181488 . The data can be used to fine tune a pre-trained large language model using transfer learning, to create a model that can be used in inference mode to automatically create the labels, thereby creating structured data useful for geological modelling and subsurface characterisation. The data is provided in JSONL(Relation) format which is the export format from doccano open source text annotation software (https://doccano.github.io/doccano/) used to create the labels. The source documents are already publicly available, but the MRP and DECC reports are only published in pdf image form. These latter documents had to undergo OCR and resulted in lower quality text and a lower quality training data. The majority of the labelled data is from the higher quality BGS memoirs text. The dataset is a proof of concept. Minimal peer review of the labelling has been conducted so this should not be treated as a gold standard labelled dataset, and it is of insufficient volume to build a performant model. The development of this training data and the text processing scripts were supported by a grant from UK Government Office for Technology Transfer (GOTT) Knowledge Asset Grant Fund Project 10083604
ObjectNET [7 of 10]
kaggle.com
Updated Jul 15, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Darien Schettler (2022). ObjectNET [7 of 10] [Dataset]. https://www.kaggle.com/datasets/dschettler8845/objectnet-7-of-10/discussion?sort=undefined
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 15, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Darien Schettler
Description
NOTE: BY USING THIS DATASET YOU ACKNOWLEDGE THAT YOU HAVE READ THE LICENSE AND WILL ABIDE BY THE TERMS THEREWITHIN

THE LICENSE

ObjectNet is free to use for both research and commercial applications. The authors own the source images and allow their use under a license derived from Creative Commons Attribution 4.0 with two additional clauses: 1. ObjectNet may never be used to tune the parameters of any model. This includes, but is not limited to, computing statistics on ObjectNet and including those statistics into a model, fine-tuning on ObjectNet, performing gradient updates on any parameters based on these images. 2. Any individual images from ObjectNet may only be posted to the web including their 1 pixel red border. If you post this archive in a public location, please leave the password intact as "objectnetisatestset". [Other General License Information Conforms to Attribution 4.0 International]

⚠️🛑⚠️ ⚠️🛑⚠️⚠️🛑⚠️⚠️🛑⚠️⚠️🛑⚠️⚠️🛑⚠️⚠️🛑⚠️⚠️🛑⚠️⚠️🛑⚠️⚠️🛑⚠️⚠️🛑⚠️⚠️🛑⚠️⚠️🛑⚠️⚠️🛑⚠️⚠️🛑⚠️⚠️🛑⚠️

IMPORTANT NOTE ––– THIS DATASET IS ONLY FOR VALIDATION/TESTING * YOU CANNOT USE IT TO TRAIN MODELS IN ANY WAY * IF YOU TRAIN A MODEL WITH IT YOU ARE VIOLATING THE LICENSE AGREEMENT * IF YOU POST IMAGES FROM THIS DATASET ANYWHERE YOU MUST ADD A RED BORDER TO THE IMAGE * IF YOU POST IMAGES WITHOUT THE BORDER YOU ARE VIOLATING THE LICENSE AGREEMENT

⚠️🛑⚠️ ⚠️🛑⚠️⚠️🛑⚠️⚠️🛑⚠️⚠️🛑⚠️⚠️🛑⚠️⚠️🛑⚠️⚠️🛑⚠️⚠️🛑⚠️⚠️🛑⚠️⚠️🛑⚠️⚠️🛑⚠️⚠️🛑⚠️⚠️🛑⚠️⚠️🛑⚠️⚠️🛑⚠️

This is Part 7 of 10 * Original Paper Link * ObjectNet Website

The links to the various parts of the dataset are:

Part 1

Part 2

Part 3

Part 4

Part 5

Part 6

Part 7

Part 8

Part 9

Part 10

Description From ObjectNET Homepage

What is ObjectNet?

A new kind of vision dataset borrowing the idea of controls from other areas of science.

No training set, only a test set! Put your vision system through its paces.

Collected to intentionally show objects from new viewpoints on new backgrounds.

50,000 image test set, same as ImageNet, with controls for rotation, background, and viewpoint.

313 object classes with 113 overlapping ImageNet

Large performance drop, what you can expect from vision systems in the real world!

Robust to fine-tuning and a very difficult transfer learning problem

Controls For Biases Increase Variation

https://objectnet.dev/images/objectnet_controls_table.png">

Easy For Humans, Hard For Machines

Ready to help develop the next generation of object recognition algorithms that have robustness, bias, and safety in mind.

Controls can remove bias from other datasets machine learning, not just vision.

https://objectnet.dev/images/objectnet_results.png">

Full Description

ObjectNet is a large real-world test set for object recognition with control where object backgrounds, rotations, and imaging viewpoints are random.

Most scientific experiments have controls, confounds which are removed from the data, to ensure that subjects cannot perform a task by exploiting trivial correlations in the data. Historically, large machine learning and computer vision datasets have lacked such controls. This has resulted in models that must be fine-tuned for new datasets and perform better on datasets than in real-world applications. When tested on ObjectNet, object detectors show a 40-45% drop in performance, with respect to their performance on other benchmarks, due to the controls for biases. Controls make ObjectNet robust to fine-tuning showing only small performance increases.

We develop a highly automated platform that enables gathering datasets with controls by crowdsourcing image capturing and annotation. ObjectNet is the same size as the ImageNet test set (50,000 images), and by design does not come paired with a training set in order to encourage generaliz...
Z
Data from: HL Dataset: Visually-grounded Description of Scenes, Actions and...
data.niaid.nih.gov
Updated Feb 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kees van Deemter (2024). HL Dataset: Visually-grounded Description of Scenes, Actions and Rationales [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10723070
Explore at:
Dataset updated
Feb 28, 2024
Dataset provided by
Albert Gatt
Michele Cafagna
Kees van Deemter
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Current captioning datasets focus on object-centric captions, describing the visible objects in the image, often ending up stating the obvious (for humans), e.g. "people eating food in a park". Although these datasets are useful to evaluate the ability of Vision & Language models to recognize and describe visual content, they do not support controlled experiments involving model testing or fine-tuning, with more high-level captions, which humans find easy and natural to produce. For example, people often describe images based on the type of scene they depict ("people at a holiday resort") and the actions they perform ("people having a picnic"). Such concepts are based on personal experience and contribute to forming common sense assumptions. We present the High-Level Dataset, a dataset extending 14997 images from the COCO dataset, aligned with a new set of 134,973 human-annotated (high-level) captions collected along three axes: scenes, actions and rationales. We further extend this dataset with confidence scores collected from an independent set of readers, as well as a set of narrative captions generated synthetically, by combining each of the three axes. We describe this dataset and analyse it extensively. We also present baseline results for the High-Level Captioning task.
Apple Leaf Disease Detection Using Vision Transformer
zenodo.org
text/x-python
Updated Jun 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amreen Batool; Amreen Batool (2025). Apple Leaf Disease Detection Using Vision Transformer [Dataset]. http://doi.org/10.5281/zenodo.15702007
Explore at:
text/x-pythonAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15702007
Dataset updated
Jun 20, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Amreen Batool; Amreen Batool
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains a Python script for classifying apple leaf diseases using a Vision Transformer (ViT) model. The dataset used is the Plant Village dataset, which contains images of apple leaves with four classes: Healthy, Apple Scab, Black Rot, and Cedar Apple Rust. The script includes data preprocessing, model training, and evaluation steps.

Table of Contents

Introduction

Code Explanation

Steps for Implementation

Example Usage

Conclusion

Introduction

The goal of this project is to classify apple leaf diseases using a Vision Transformer (ViT) model. The dataset is divided into four classes: Healthy, Apple Scab, Black Rot, and Cedar Apple Rust. The script includes data preprocessing, model training, and evaluation steps.

Code Explanation

1. Importing Libraries

The script starts by importing necessary libraries such as matplotlib, seaborn, numpy, pandas, tensorflow, and sklearn. These libraries are used for data visualization, data manipulation, and building/training the deep learning model.

2. Visualizing the Dataset

The walk_through_dir function is used to explore the dataset directory structure and count the number of images in each class.

The dataset is divided into Train, Val, and Test directories, each containing subdirectories for the four classes.

3. Data Augmentation

The script uses ImageDataGenerator from Keras to apply data augmentation techniques such as rotation, horizontal flipping, and rescaling to the training data. This helps in improving the model's generalization ability.

Separate generators are created for training, validation, and test datasets.

4. Patch Visualization

The script defines a Patches layer that extracts patches from the images. This is a crucial step in Vision Transformers, where images are divided into smaller patches that are then processed by the transformer.

The script visualizes these patches for different patch sizes (32x32, 16x16, 8x8) to understand how the image is divided.

5. Model Training

The script defines a Vision Transformer (ViT) model using TensorFlow and Keras. The model is compiled with the Adam optimizer and categorical cross-entropy loss.

The model is trained for a specified number of epochs, and the training history is stored for later analysis.

6. Model Evaluation

After training, the model is evaluated on the test dataset. The script generates a confusion matrix and a classification report to assess the model's performance.

The confusion matrix is visualized using seaborn to provide a clear understanding of the model's predictions.

7. Visualizing Misclassified Images

The script includes functionality to visualize misclassified images, which helps in understanding where the model is making errors.

8. Fine-Tuning and Learning Rate Adjustment

The script demonstrates how to fine-tune the model by adjusting the learning rate and re-training the model.

Steps for Implementation

Dataset Preparation

Ensure that the dataset is organized into Train, Val, and Test directories, with each directory containing subdirectories for each class (Healthy, Apple Scab, Black Rot, Cedar Apple Rust).

Install Required Libraries

Install the necessary Python libraries using pip:

pip install tensorflow matplotlib seaborn numpy pandas scikit-learn

Run the Script

Execute the script in a Python environment. The script will automatically:

Load and preprocess the dataset.

Apply data augmentation.

Train the Vision Transformer model.

Evaluate the model and generate performance metrics.

Analyze Results

Review the confusion matrix and classification report to understand the model's performance.

Visualize misclassified images to identify potential areas for improvement.

Fine-Tuning

Experiment with different patch sizes, learning rates, and data augmentation techniques to improve the model's accuracy.
WONDERBREAD: A Benchmark + Dataset for Business Process Management (BPM)...
zenodo.org
csv, json, zip
Updated Oct 14, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michael Wornow; Michael Wornow (2024). WONDERBREAD: A Benchmark + Dataset for Business Process Management (BPM) Tasks [Dataset]. http://doi.org/10.5281/zenodo.12671568
Explore at:
csv, zip, jsonAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.12671568
Dataset updated
Oct 14, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Michael Wornow; Michael Wornow
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jul 6, 2024
Description
Paper: WONDERBREAD: A Benchmark for Evaluating Multimodal Foundation Models on Business Process Management Tasks

Background

The WONDERBREAD dataset contains 2,928 human demonstrations of 598 web navigation workflows across 6 types of BPM tasks. These tasks measure the ability of a model to generate accurate documentation, assist in knowledge transfer, and improve the effeciency of workflows.

Please see our website for more details: https://wonderbread.stanford.edu/

Quick Start

To start, download debug_demos.zip (~1 GB). It contains a subset of 24 demonstrations which can give you a sense of how the dataset is structured.

To reproduce the paper, download gold_demos.zip (~33 GB). It contains 724 demonstrations corresponding to the 162 "Gold" tasks which were used for all the evaluations in the original paper.

To obtain the full dataset, download demos.zip (~133 GB). This contains all 2,928 demonstrations and can be used for training, fine-tuning, and evaluating models.

Dataset Structure

The dataset contains several files, defined below.

Raw Data (useful for training/fine-tuning/evaluation)

debug_demos.zip -- a subset of only 24 demonstrations taken from the full dataset. Useful to get a sense of the dataset and for debugging.

gold_demos.zip -- a subset of only 724 demonstrations corresopnding to the 162 "Gold" tasks. This is the dataset that was used for all evaluations in the original WONDERBREAD paper.

demos.zip -- all 2,928 demonstrations across 598 tasks. Useful for training your own models.

Evaluation (useful for evaluation)

qa_dataset.csv -- contains all 120 questions and ground truth answers used in the "Knowlege Transfer" evaluation.

df_rankings.csv -- contains the rankings of all "Gold" tasks used in the "SOP Ranking" evaluation.

Metadata (can be safely ignored)

Process Mining Task Demonstrations.xlsx -- maps human annotators to specific demonstrations; also contains "Gold" task rankings used in the "SOP Ranking" evaluation.

metadata.json -- maps Google Drive URLs to Google Drive Folder IDs to demonstration names

df_valid.csv -- tracks assets associated with each demonstration
Z
Curlie Enhanced with LLM Annotations: Two Datasets for Advancing...
data.niaid.nih.gov
zenodo.org
Updated Dec 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cizinsky, Ludek (2023). Curlie Enhanced with LLM Annotations: Two Datasets for Advancing Homepage2Vec's Multilingual Website Classification [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10413067
Explore at:
Dataset updated
Dec 21, 2023
Dataset provided by
Senghaas, Mika
Nutter, Peter
Cizinsky, Ludek
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Advancing Homepage2Vec with LLM-Generated Datasets for Multilingual Website Classification

This dataset contains two subsets of labeled website data, specifically created to enhance the performance of Homepage2Vec, a multi-label model for website classification. The datasets were generated using Large Language Models (LLMs) to provide more accurate and diverse topic annotations for websites, addressing a limitation of existing Homepage2Vec training data.

Key Features:

LLM-generated annotations: Both datasets feature website topic labels generated using LLMs, a novel approach to creating high-quality training data for website classification models.

Improved multi-label classification: Fine-tuning Homepage2Vec with these datasets has been shown to improve its macro F1 score from 38% to 43% evaluated on a human-labeled dataset, demonstrating their effectiveness in capturing a broader range of website topics.

Multilingual applicability: The datasets facilitate classification of websites in multiple languages, reflecting the inherent multilingual nature of Homepage2Vec.

Dataset Composition:

curlie-gpt3.5-10k: 10,000 websites labeled using GPT-3.5, context 2 and 1-shot

curlie-gpt4-10k: 10,000 websites labeled using GPT-4, context 2 and zero-shot

Intended Use:

Fine-tuning and advancing Homepage2Vec or similar website classification models

Research on LLM-generated datasets for text classification tasks

Exploration of multilingual website classification

Additional Information:

Project and report repository: https://github.com/CS-433/ml-project-2-mlp

Acknowledgments:

This dataset was created as part of a project at EPFL's Data Science Lab (DLab) in collaboration with Prof. Robert West and Tiziano Piccardi.
h
cvedataset.jsonl
huggingface.co
Updated Mar 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thierno Diallo (2025). cvedataset.jsonl [Dataset]. https://huggingface.co/datasets/iamthierno/cvedataset.jsonl
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 2, 2025
Authors
Thierno Diallo
Description
CVE Dataset (1999-2024) for LLM Fine-Tuning

Overview

This dataset comprises Common Vulnerabilities and Exposures (CVE) records spanning from 1999 to 2024. Each entry provides essential information on software vulnerabilities, their descriptions, affected products and versions, CVSS scores, and relevant references. The data is formatted in a JSON Lines (.jsonl) structure, making it suitable for fine-tuning Large Language Models (LLMs) for tasks such as cybersecurity… See the full description on the dataset page: https://huggingface.co/datasets/iamthierno/cvedataset.jsonl.
R
Object Detection For Mstar Imagery Dataset
universe.roboflow.com
zip
Updated Nov 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Corn (2024). Object Detection For Mstar Imagery Dataset [Dataset]. https://universe.roboflow.com/corn-y933v/object-detection-for-mstar-imagery/model/3
Explore at:
zipAvailable download formats
Dataset updated
Nov 21, 2024
Dataset authored and provided by
Corn
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Armored Vehicles Bounding Boxes
Description
Exploring Object Detection Techniques for MSTAR IU Mixed Targets Dataset

Introduction: The rapid advancements in machine learning and computer vision have significantly improved object detection capabilities. In this project, we aim to explore and develop object detection techniques specifically tailored to the MSTAR IU Mixed Targets. This dataset, provided by the Sensor Data Management System, offers a valuable resource for training and evaluating object detection models for synthetic aperture radar (SAR) imagery.

Objective: Our primary objective is to develop an efficient and accurate object detection model that can identify and localize various targets within the MSTAR IU Mixed Targets dataset. By achieving this, we aim to enhance the understanding and applicability of SAR imagery in real-world scenarios, such as surveillance, reconnaissance, and military applications.

Ethics: As responsible researchers, we recognize the importance of ethics in conducting our project. We are committed to ensuring the ethical use of data and adhering to privacy guidelines. The MSTAR IU Mixed Targets dataset provided by the Sensor Data Management System will be used solely for academic and research purposes. Any personal information or sensitive data within the dataset will be handled with utmost care and confidentiality.

Data Attribution and Giving Credit: We deeply appreciate the Sensor Data Management System for providing the MSTAR IU Mixed Targets dataset. We understand the effort and resources invested in curating and maintaining this valuable dataset, which forms the foundation of our project. To acknowledge and give credit to the Sensor Data Management System, we will prominently mention their contribution in all project publications, reports, and presentations. We will provide appropriate citations and include a statement recognizing their dataset as the source of our training and evaluation data.

Methodology:

Data Preprocessing: We will preprocess the MSTAR IU Mixed Targets dataset to enhance its compatibility with YOLOv8 object detection algorithm. Involve resizing, normalizing, and augmenting the images.

Training and Evaluation: The selected model will be trained on the preprocessed dataset, utilizing appropriate loss functions and optimization techniques. We will extensively evaluate the model's performance using standard evaluation metrics such as precision, recall, and mean average precision (mAP).

Fine-tuning and Optimization: We will fine-tune the model on the MSTAR IU Mixed Targets dataset to enhance its accuracy and adaptability to SAR-specific features. Additionally, we will explore techniques such as transfer learning and data augmentation to further improve the model's performance.

Results and Analysis: The final model's performance will be analyzed in terms of detection accuracy, computational efficiency, and generalization capability. We will conduct comprehensive experiments and provide visualizations to showcase the model's object detection capabilities on the MSTAR IU Mixed Targets dataset.

Model Selection and Revaluation: We will evaluate and compare state-of-the-art object detection models to identify the most suitable architecture for SAR imagery. This will involve researching and implementing models such as Faster R-CNN, other YOLO versions or SSD, considering their performance, speed, and adaptability to the MSTAR dataset.

Conclusion: This project aims to contribute to the field of object detection in SAR imagery by leveraging the valuable MSTAR IU Mixed Targets dataset provided by the Sensor Data Management System. We will ensure ethical use of the data and give proper credit to the dataset's source. By developing an accurate and efficient object detection model, we hope to advance the understanding and application of SAR imagery in various domains.

Note: This project description serves as an overview and can be expanded upon in terms of specific methodologies, experiments, and evaluation techniques as the project progresses.
F
Danish Conversation Chat Dataset for Telecom Domain
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Danish Conversation Chat Dataset for Telecom Domain [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/danish-telecom-domain-conversation-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
The dataset comprises over 10,000 chat conversations, each focusing on specific Telecom related topics. Each conversation provides a detailed interaction between a call center agent and a customer, capturing real-life scenarios and language nuances.
•
Participants Details: 150+ native Danish participants from the FutureBeeAI community.

•
Word Count & Length: Chats are diverse, averaging 300 to 700 words and 50 to 150 turns across both speakers.

Topic Diversity
The chat dataset covers a wide range of conversations on Telecom topics, ensuring that the dataset is comprehensive and relevant for training and fine-tuning models for various Telecom use cases. It offers diversity in terms of conversation topics, chat types, and outcomes, including both inbound and outbound chats with positive, neutral, and negative outcomes.
•Inbound Chats:
•Phone Number Porting
•Network Connectivity Issues
•Billing and Payments
•Technical Support
•Service Activation
•International Roaming Enquiry
•Refunds and Billing Adjustments
•Emergency Service Access, and many more
•Outbound Chats:
•Welcome Calls / Onboarding Process
•Payment Reminders
•Customer Surveys
•Technical Updates
•Service Usage Reviews
•Network Complaint Update, and many more
Language Variety & Nuances
The conversations in this dataset capture the diverse language styles and expressions prevalent in Danish Telecom interactions. This diversity ensures the dataset accurately represents the language used by Danish speakers in Telecom contexts.
The dataset encompasses a wide array of language elements, including:
•
Naming Conventions: Chats include a variety of Danish personal and business names.

•
Localized Details: Real-world addresses, emails, phone numbers, and other contact information as according to different Danish-speaking regions.

•
Temporal and Numeric Expressions: Dates, times, currencies, and numbers in Danish forms, adhering to local conventions.

•
Idiomatic Expressions and Slang: It includes local slang, idioms, and informal phrase present in Danish Telecom conversations.

This linguistic authenticity ensures that the dataset equips researchers and developers with a comprehensive understanding of the intricate language patterns, cultural references, and communication styles inherent to Danish Telecom interactions.
Conversational Flow and Interaction Types
The dataset includes a broad range of conversations, from simple inquiries to detailed discussions, capturing the dynamic nature of Telecom customer-agent interactions.
•Simple Inquiries
•Detailed Discussions
•Transactional Interactions
•Problem-Solving Dialogues
•Advisory Sessions
•Routine Checks and Follow-Ups
Each of these conversations contains various aspects of conversation flow like:
•Greetings
•Authentication
•Information gathering
•Resolution identification
<span
O
Data from: BuildingsBench: A Large-Scale Dataset of 900K Buildings and...
data.openei.org
osti.gov
code, data, website
Updated Dec 31, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Patrick Emami; Peter Graf; Patrick Emami; Peter Graf (2018). BuildingsBench: A Large-Scale Dataset of 900K Buildings and Benchmark for Short-Term Load Forecasting [Dataset]. http://doi.org/10.25984/1986147
Explore at:
code, website, dataAvailable download formats
Unique identifier
https://doi.org/10.25984/1986147
Dataset updated
Dec 31, 2018
Dataset provided by
USDOE Office of Energy Efficiency and Renewable Energy (EERE), Multiple Programs (EE)
National Renewable Energy Laboratory
Open Energy Data Initiative (OEDI)
Authors
Patrick Emami; Peter Graf; Patrick Emami; Peter Graf
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The BuildingsBench datasets consist of:

Buildings-900K: A large-scale dataset of 900K buildings for pretraining models on the task of short-term load forecasting (STLF). Buildings-900K is statistically representative of the entire U.S. building stock.

7 real residential and commercial building datasets for benchmarking two downstream tasks evaluating generalization: zero-shot STLF and transfer learning for STLF.

Buildings-900K can be used for pretraining models on day-ahead STLF for residential and commercial buildings. The specific gap it fills is the lack of large-scale and diverse time series datasets of sufficient size for studying pretraining and finetuning with scalable machine learning models. Buildings-900K consists of synthetically generated energy consumption time series. It is derived from the NREL End-Use Load Profiles (EULP) dataset (see link to this database in the links further below). However, the EULP was not originally developed for the purpose of STLF. Rather, it was developed to "...help electric utilities, grid operators, manufacturers, government entities, and research organizations make critical decisions about prioritizing research and development, utility resource and distribution system planning, and state and local energy planning and regulation." Similar to the EULP, Buildings-900K is a collection of Parquet files and it follows nearly the same Parquet dataset organization as the EULP. As it only contains a single energy consumption time series per building, it is much smaller (~110 GB).

BuildingsBench also provides an evaluation benchmark that is a collection of various open source residential and commercial real building energy consumption datasets. The evaluation datasets, which are provided alongside Buildings-900K below, are collections of CSV files which contain annual energy consumption. The size of the evaluation datasets altogether is less than 1GB, and they are listed out below:

ElectricityLoadDiagrams20112014

Building Data Genome Project-2

Individual household electric power consumption (Sceaux)

Borealis

SMART

IDEAL

Low Carbon London

A README file providing details about how the data is stored and describing the organization of the datasets can be found within each data lake version under BuildingsBench.
F
Vietnamese Agent-Customer Chat Dataset for Healthcare Domain
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Vietnamese Agent-Customer Chat Dataset for Healthcare Domain [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/vietnamese-healthcare-domain-conversation-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
The Vietnamese Healthcare Chat Dataset is a rich collection of over 10,000 text-based conversations between customers and call center agents, focused on real-world healthcare interactions. Designed to reflect authentic language use and domain-specific dialogue patterns, this dataset supports the development of conversational AI, chatbots, and NLP models tailored for healthcare applications in Vietnamese-speaking regions.
Participant & Chat Overview
•
Participants: 150+ native Vietnamese speakers from the FutureBeeAI Crowd Community

•
Conversation Length: 300–700 words per chat

•
Turns per Chat: 50–150 dialogue turns across both participants

•
Chat Types: Inbound and outbound

•
Sentiment Coverage: Positive, neutral, and negative outcomes included

Topic Diversity
The dataset captures a wide spectrum of healthcare-related chat scenarios, ensuring comprehensive coverage for training robust AI systems:
•
Inbound Chats (Customer-Initiated): Appointment scheduling, new patient registration, surgery and treatment consultations, diet and lifestyle discussions, insurance claim inquiries, lab result follow-ups

•
Outbound Chats (Agent-Initiated): Appointment reminders and confirmations, health and wellness program offers, test result notifications, preventive care and vaccination reminders, subscription renewals, risk assessment and eligibility follow-ups

This variety helps simulate realistic healthcare support workflows and patient-agent dynamics.
Language Diversity & Realism
This dataset reflects the natural flow of Vietnamese healthcare communication and includes:
•
Authentic Naming Patterns: Vietnamese personal names, clinic names, and brands

•
Localized Contact Elements: Addresses, emails, phone numbers, and clinic locations in regional Vietnamese formats

•
Time & Currency References: Use of dates, times, numeric expressions, and currency units aligned with Vietnamese-speaking regions

•
Colloquial & Medical Expressions: Local slang, informal speech, and common healthcare-related terminology

These elements ensure the dataset is contextually relevant and linguistically rich for real-world use cases.
Conversational Flow & Structure
Conversations range from simple inquiries to complex advisory sessions, including:
•General inquiries
•Detailed problem-solving
•Routine status updates
•Treatment recommendations
•Support and feedback interactions
Each conversation typically includes these structural components:
•Greetings and verification
•Information gathering
•Problem definition
•Solution delivery
•Closing messages
•Follow-up and feedback (where applicable)
This structured flow mirrors actual healthcare support conversations and is ideal for training advanced dialogue systems.
Data Format & Structure
Available in JSON, CSV, and TXT formats, each conversation includes:
•Full message history with clear speaker labels
•Participant identifiers
•Metadata (e.g., topic tags, region, sentiment)
•Compatibility with common NLP and ML pipelines
<h3 style="font-weight:
f
Experiment environment.
plos.figshare.com
xls
Updated Jan 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Le Bu; Caiping Hu; Xiuliang Zhang (2024). Experiment environment. [Dataset]. http://doi.org/10.1371/journal.pone.0296789.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0296789.t002
Dataset updated
Jan 19, 2024
Dataset provided by
PLOS ONE
Authors
Le Bu; Caiping Hu; Xiuliang Zhang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The recognition of food images is of great significance for nutrition monitoring, food retrieval and food recommendation. However, the accuracy of recognition had not been high enough due to the complex background of food images and the characteristics of small inter-class differences and large intra-class differences. To solve these problems, this paper proposed a food image recognition method based on transfer learning and ensemble learning. Firstly, generic image features were extracted by using the convolutional neural network models (VGG19, ResNet50, MobileNet V2, AlexNet) pre-trained on the ImageNet dataset. Secondly, the 4 pre-trained models were transferred to the food image dataset for model fine-tuning. Finally, different basic learner combination strategies were adopted to establish the ensemble model and classify feature information. In this paper, several kinds of experiments were performed to compare the results of food image recognition between single models and ensemble models on food-11 dataset. The experimental results demonstrated that the accuracy of the ensemble model was the highest, reaching 96.88%, which was superior to any base learner. Therefore, the convolutional neural network model based on transfer learning and ensemble learning has strong learning ability and generalization ability, and it is feasible and practical to apply the method to food image recognition.
Z
Quilt-1M: One Million Image-Text Pairs for Histopathology
data.niaid.nih.gov
zenodo.org
Updated Aug 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Linda G. Shapiro (2023). Quilt-1M: One Million Image-Text Pairs for Histopathology [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8239941
Explore at:
Dataset updated
Aug 16, 2023
Dataset provided by
Linda G. Shapiro
Fatemeh Ghezloo
Ranjay Krishna
Mehmet S. Seyfioglu
Wisdom Oluchi Ikezogwo
Pavan K. Anand
Fatwir S. Mohammed
Dylan Geva
Description
Recent accelerations in multi-modal applications have been made possible with the plethora of image and text data available online. However, the scarcity of similar data in the medical field, specifically in histopathology, has slowed similar progress. To enable similar representation learning for histopathology, we turn to YouTube, an untapped resource of videos, offering 1,087 hours of valuable educational histopathology videos from expert clinicians. From YouTube, we curate Quilt: a large-scale vision-language dataset consisting of 802,148 image and text pairs. Quilt was automatically curated using a mixture of models, including large language models), handcrafted algorithms, human knowledge databases, and automatic speech recognition. In comparison, the most comprehensive datasets curated for histopathology amass only around 200K samples. We combine Quilt with datasets, from other sources, including Twitter, research papers, and the internet in general, to create an even larger dataset: Quilt-1M, with 1M paired image-text samples, marking it as the largest vision-language histopathology dataset to date. We demonstrate the value of Quilt-1M by fine-tuning a pre-trained CLIP model. Our model outperforms state-of-the-art models on both zero-shot and linear probing tasks for classifying new pathology images across 13 diverse patch-level datasets of 8 different sub-pathologies and cross-modal retrieval tasks.

Facebook

Twitter

Click to copy link

Link copied

Cite

Akshat Gupta (2024). LLM Fine Tuning Dataset of Indian Legal Texts [Dataset]. https://www.kaggle.com/datasets/akshatgupta7/llm-fine-tuning-dataset-of-indian-legal-texts/discussion

LLM Fine Tuning Dataset of Indian Legal Texts

QA Dataset for fine tuning LLMs on IPC, CRPC, and Indian Constitution

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Jul 30, 2024

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Akshat Gupta

License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Area covered

India

Description

This dataset comprises curated question-answer pairs derived from key legal texts pertinent to Indian law, specifically the Indian Penal Code (IPC), Criminal Procedure Code (CRPC), and the Indian Constitution. The goal of this dataset is to facilitate the development and fine-tuning of language models and AI applications that assist legal professionals in India.

Dataset Details:

Sources: The questions and answers in this dataset are extracted from the Indian Constitution, Indian Penal Code (IPC), and the Code of Criminal Procedure (CrPC), ensuring relevance and accuracy in legal contexts.
Content: Each entry in the dataset contains a clear and concise question alongside its corresponding answer. The questions are designed to cover fundamental concepts, key provisions, and significant terms found within these legal documents.

Use Cases:

Legal Research: A valuable tool for lawyers, legal researchers, and students seeking to understand legal terminology and principles as outlined in Indian law.
Natural Language Processing (NLP): This dataset is ideal for training AI models for question-answering systems that require a strong understanding of Indian legal texts.
Educational Resources: Useful for creating educational tools and materials for law students and legal practitioners.

Note on Use and Limitations:

Misuse of Dataset: This dataset is intended for educational, research, and development purposes only. Users should exercise caution to ensure that any AI applications developed using this dataset do not misrepresent or distort legal information. The dataset should not be used for legal advice or to influence legal decisions without proper context and verification.
Relevance and Context: While every effort has been made to ensure the accuracy and relevance of the question-answer pairs, some entries may be out of context or may not fully represent the legal concepts they aim to explain. Users are strongly encouraged to conduct thorough reviews of the entries, particularly when using them in formal applications or legal research.
Data Preprocessing Recommended: Due to the nature of natural language, the QA pairs may include variations in phrasing, potential redundancies, or entries that may not align perfectly with the intended legal context. Therefore, it is highly recommended that users perform data preprocessing to cleanse, normalize, or filter out any irrelevant or out-of-context pairs before integrating the dataset into machine learning models or systems.
Dynamic Nature of Law: The legal landscape is subject to change over time. As laws and interpretations evolve, some answers may become outdated or less applicable. Users should verify the current applicability of legal concepts and check sources for updates when necessary.
Credits and Citations: If you use this dataset in your research or projects, appropriate credits should be provided. Users are also encouraged to share any improvements, corrections, or updates they make to the dataset for the benefit of the community.

Clear search

Close search

Google apps

Main menu

LLM Fine Tuning Dataset of Indian Legal Texts

Dataset Details:

Use Cases:

Note on Use and Limitations:

Alpaca Cleaned

Alpaca Cleaned

Improving Pretrained Language Model Understanding

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Acknowledgements

longform_article_summarization

Evaluating SQuAD-based Question Answering for the Open Research Knowledge...

wikipedia-paragraph-sft

finewebedu-sft

IMDb Movies Metadata Dataset – 4.5M Records (Global Coverage)

What’s Included:

Use Cases:

Labelled data for fine tuning a geological Named Entity Recognition and...

ObjectNET [7 of 10]

NOTE: BY USING THIS DATASET YOU ACKNOWLEDGE THAT YOU HAVE READ THE LICENSE AND WILL ABIDE BY THE TERMS THEREWITHIN

THE LICENSE

Description From ObjectNET Homepage

What is ObjectNet?

Controls For Biases Increase Variation

Easy For Humans, Hard For Machines

Full Description

Data from: HL Dataset: Visually-grounded Description of Scenes, Actions and...

Apple Leaf Disease Detection Using Vision Transformer

Table of Contents

Introduction

Code Explanation

1. Importing Libraries

2. Visualizing the Dataset

3. Data Augmentation

4. Patch Visualization

5. Model Training

6. Model Evaluation

7. Visualizing Misclassified Images

8. Fine-Tuning and Learning Rate Adjustment

Steps for Implementation

WONDERBREAD: A Benchmark + Dataset for Business Process Management (BPM)...

Background

Quick Start

Dataset Structure

Curlie Enhanced with LLM Annotations: Two Datasets for Advancing...

cvedataset.jsonl

Object Detection For Mstar Imagery Dataset

Danish Conversation Chat Dataset for Telecom Domain

Introduction

Topic Diversity

Language Variety & Nuances

Conversational Flow and Interaction Types

Data from: BuildingsBench: A Large-Scale Dataset of 900K Buildings and...

Vietnamese Agent-Customer Chat Dataset for Healthcare Domain

Introduction

Participant & Chat Overview

Topic Diversity

Language Diversity & Realism

Conversational Flow & Structure

Data Format & Structure

Experiment environment.

Quilt-1M: One Million Image-Text Pairs for Histopathology

LLM Fine Tuning Dataset of Indian Legal Texts

QA Dataset for fine tuning LLMs on IPC, CRPC, and Indian Constitution

Dataset Details:

Use Cases:

Note on Use and Limitations: