84 datasets found
  1. LLM Fine Tuning Dataset of Indian Legal Texts

    • kaggle.com
    Updated Jul 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Akshat Gupta (2024). LLM Fine Tuning Dataset of Indian Legal Texts [Dataset]. https://www.kaggle.com/datasets/akshatgupta7/llm-fine-tuning-dataset-of-indian-legal-texts/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 30, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Akshat Gupta
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Area covered
    India
    Description

    This dataset comprises curated question-answer pairs derived from key legal texts pertinent to Indian law, specifically the Indian Penal Code (IPC), Criminal Procedure Code (CRPC), and the Indian Constitution. The goal of this dataset is to facilitate the development and fine-tuning of language models and AI applications that assist legal professionals in India.

    Dataset Details:

    • Sources: The questions and answers in this dataset are extracted from the Indian Constitution, Indian Penal Code (IPC), and the Code of Criminal Procedure (CrPC), ensuring relevance and accuracy in legal contexts.
    • Content: Each entry in the dataset contains a clear and concise question alongside its corresponding answer. The questions are designed to cover fundamental concepts, key provisions, and significant terms found within these legal documents.

    Use Cases:

    • Legal Research: A valuable tool for lawyers, legal researchers, and students seeking to understand legal terminology and principles as outlined in Indian law.
    • Natural Language Processing (NLP): This dataset is ideal for training AI models for question-answering systems that require a strong understanding of Indian legal texts.
    • Educational Resources: Useful for creating educational tools and materials for law students and legal practitioners.

    Note on Use and Limitations:

    • Misuse of Dataset: This dataset is intended for educational, research, and development purposes only. Users should exercise caution to ensure that any AI applications developed using this dataset do not misrepresent or distort legal information. The dataset should not be used for legal advice or to influence legal decisions without proper context and verification.

    • Relevance and Context: While every effort has been made to ensure the accuracy and relevance of the question-answer pairs, some entries may be out of context or may not fully represent the legal concepts they aim to explain. Users are strongly encouraged to conduct thorough reviews of the entries, particularly when using them in formal applications or legal research.

    • Data Preprocessing Recommended: Due to the nature of natural language, the QA pairs may include variations in phrasing, potential redundancies, or entries that may not align perfectly with the intended legal context. Therefore, it is highly recommended that users perform data preprocessing to cleanse, normalize, or filter out any irrelevant or out-of-context pairs before integrating the dataset into machine learning models or systems.

    • Dynamic Nature of Law: The legal landscape is subject to change over time. As laws and interpretations evolve, some answers may become outdated or less applicable. Users should verify the current applicability of legal concepts and check sources for updates when necessary.

    • Credits and Citations: If you use this dataset in your research or projects, appropriate credits should be provided. Users are also encouraged to share any improvements, corrections, or updates they make to the dataset for the benefit of the community.

  2. Alpaca Cleaned

    • kaggle.com
    • huggingface.co
    Updated Nov 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Alpaca Cleaned [Dataset]. https://www.kaggle.com/datasets/thedevastator/alpaca-language-instruction-training
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 26, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Alpaca Cleaned

    Improving Pretrained Language Model Understanding

    By Huggingface Hub [source]

    About this dataset

    Alpaca is the perfect dataset for fine-tuning your language models to better understand and follow instructions, capable of taking you beyond standard Natural Language Processing (NLP) abilities! This curated, cleaned dataset provides you with over 52,000 expertly crafted instructions and demonstrations generated by OpenAI's text-davinci-003 engine - all in English (BCP-47 en). Improve the quality of your language models with fields such as instruction, output, and input which have been designed to enhance every aspect of their comprehension. The data here has gone through rigorous cleaning to ensure there are no errors or biases present; allowing you to trust that this data will result in improved performance for any language model that uses it! Get ready to see what Alpaca can do for your NLP needs

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This dataset provides a unique and valuable resource for anyone who wishes to create, develop and train language models. Alpaca provides users with 52,000 instruction-demonstration pairs generated by OpenAI's text-davinci-003 engine.

    The data included in this dataset is formatted into 3 columns: “instruction”, “output” and “input.” All the data is written in English (BCP-47 en).

    To make the most out of this dataset it is recommended to:

    • Familiarize yourself with the instructions in the instruction column as these provide guidance on how to use the other two columns; input and output.

    • Once comfortable with understanding the instructions columns move onto exploring what you are provided within each 14 sets of triplets – instruction, output and input – that are included in this clean version of Alpaca.

    • Read through many examples paying attention to any areas where you feel more clarification could be added or could be further improved upon for better understanding of language models however bear in mind that these examples have been cleaned from any errors or biases found from original dataset

    • Get inspired! As mentioned earlier there are more than 52k sets provided meaning having much flexibility for varying training strategies or unique approaches when creating your own language model!

    • Finally while not essential it may be helpful to have familiarity with OpenAI's text-davinci engine as well as enjoy playing around with different parameters/options depending on what type of outcomes you wish achieve

    Research Ideas

    • Developing natural language processing (NLP) tasks that aim to better automate and interpret instructions given by humans.
    • Training machine learning models of robotic agents to be able to understand natural language commands, as well as understand the correct action that needs to be taken in response.
    • Creating a system that can generate personalized instructions and feedback in real time based on language models, catering specifically to each individual user's preferences or needs

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: train.csv | Column name | Description | |:----------------|:-------------------------------------------------------------------------| | instruction | This column contains the instructions for the language model. (Text) | | output | This column contains the expected output from the language model. (Text) | | input | This column contains the input given to the language model. (Text) |

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.

  3. h

    longform_article_summarization

    • huggingface.co
    Updated Nov 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vincent Goldberg (2023). longform_article_summarization [Dataset]. https://huggingface.co/datasets/vgoldberg/longform_article_summarization
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 26, 2023
    Authors
    Vincent Goldberg
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Name: Long-Form Article Summarization Dataset Description: The Long-Form Article Summarization Dataset is meticulously curated for the purpose of fine-tuning Natural Language Processing (NLP) models specifically tailored for summarization tasks. It is a rich collection of long-form articles that have been carefully condensed and summarized. The dataset provides a diverse range of topics and writing styles, making it an invaluable resource for researchers and practitioners working on… See the full description on the dataset page: https://huggingface.co/datasets/vgoldberg/longform_article_summarization.

  4. t

    Evaluating SQuAD-based Question Answering for the Open Research Knowledge...

    • service.tib.eu
    Updated Aug 4, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Evaluating SQuAD-based Question Answering for the Open Research Knowledge Graph Completion - Vdataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/luh-evaluating-squad-based-question-answering-for-the-open-research-knowledge-graph-completion
    Explore at:
    Dataset updated
    Aug 4, 2023
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Description

    This dataset is part of the bachelor thesis "Evaluating SQuAD-based Question Answering for the Open Research Knowledge Graph Completion". It was created for the finetuning of Bert Based models pre-trained on the SQUaD dataset. The Dataset was created using semi-automatic approach on the ORKG data. The dataset.csv file contains the entire data (all properties) in a tabular for and is unsplit. The json files contain only the necessary fields for training and evaluation, with additional fields (index of start and end of the answers in the abstracts). The data in the json files is split (training data) and evaluation data. We create 4 variants of the training and evaluation sets for each one of the question labels ("no label", "how", "what", "which") For detailed information on each of the fields in the dataset, refer to section 4.2 (Corpus) of the Thesis document that can be found in https://www.repo.uni-hannover.de/handle/123456789/12958. The script used to generate the dataset can be found in the public repository https://github.com/as18cia/thesis_work and https://gitlab.com/TIBHannover/orkg/nlp/experiments/orkg-fine-tuning-squad-based-models

  5. h

    wikipedia-paragraph-sft

    • huggingface.co
    Updated Sep 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alan Tseng (2024). wikipedia-paragraph-sft [Dataset]. https://huggingface.co/datasets/agentlans/wikipedia-paragraph-sft
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 16, 2024
    Authors
    Alan Tseng
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    Wikipedia Paragraph Supervised Finetuning Dataset

      Model Description
    

    This dataset is designed for training language models to generate supervised finetuning data from raw text. It consists of text passages and corresponding question-answer pairs in JSONLines format.

      Intended Use
    

    The primary purpose of this dataset is to enable large language models (LLMs) to generate high-quality supervised finetuning data from raw text inputs, useful for creating custom… See the full description on the dataset page: https://huggingface.co/datasets/agentlans/wikipedia-paragraph-sft.

  6. h

    finewebedu-sft

    • huggingface.co
    Updated Aug 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alan Tseng (2024). finewebedu-sft [Dataset]. https://huggingface.co/datasets/agentlans/finewebedu-sft
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 27, 2024
    Authors
    Alan Tseng
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    FineWeb-Edu Supervised Finetuning Dataset

      Model Description
    

    This dataset is designed for training language models to generate supervised finetuning data from raw text. It consists of text passages and corresponding question-answer pairs in JSONLines format.

      Intended Use
    

    The primary purpose of this dataset is to enable large language models (LLMs) to generate high-quality supervised finetuning data from raw text inputs, useful for creating custom datasets for… See the full description on the dataset page: https://huggingface.co/datasets/agentlans/finewebedu-sft.

  7. IMDb Movies Metadata Dataset – 4.5M Records (Global Coverage)

    • crawlfeeds.com
    csv, zip
    Updated Jul 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Crawl Feeds (2025). IMDb Movies Metadata Dataset – 4.5M Records (Global Coverage) [Dataset]. https://crawlfeeds.com/datasets/imdb-movies-metadata-dataset-4-5m-records-global-coverage
    Explore at:
    csv, zipAvailable download formats
    Dataset updated
    Jul 5, 2025
    Dataset authored and provided by
    Crawl Feeds
    License

    https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy

    Description

    Unlock one of the most comprehensive movie datasets available—4.5 million structured IMDb movie records, extracted and enriched for data science, machine learning, and entertainment research.

    This dataset includes a vast collection of global movie metadata, including details on title, release year, genre, country, language, runtime, cast, directors, IMDb ratings, reviews, and synopsis. Whether you're building a recommendation engine, benchmarking trends, or training AI models, this dataset is designed to give you deep and wide access to cinematic data across decades and continents.

    Perfect for use in film analytics, OTT platforms, review sentiment analysis, knowledge graphs, and LLM fine-tuning, the dataset is cleaned, normalized, and exportable in multiple formats.

    What’s Included:

    • Genres: Drama, Comedy, Horror, Action, Sci-Fi, Documentary, and more

    • Delivery: Direct download

    Use Cases:

    • Train LLMs or chatbots on cinematic language and metadata

    • Build or enrich movie recommendation engines

    • Run cross-lingual or multi-region film analytics

    • Benchmark genre popularity across time periods

    • Power academic studies or entertainment dashboards

    • Feed into knowledge graphs, search engines, or NLP pipelines

  8. Labelled data for fine tuning a geological Named Entity Recognition and...

    • metadata.bgs.ac.uk
    • hosted-metadata.bgs.ac.uk
    • +1more
    html
    Updated Feb 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    British Geological Survey (2024). Labelled data for fine tuning a geological Named Entity Recognition and Entity Relation Extraction model [Dataset]. https://metadata.bgs.ac.uk/geonetwork/srv/api/records/15ac4ca9-3be0-119e-e063-0937940a8990
    Explore at:
    htmlAvailable download formats
    Dataset updated
    Feb 15, 2024
    Dataset authored and provided by
    British Geological Surveyhttps://www.bgs.ac.uk/
    License

    http://inspire.ec.europa.eu/metadata-codelist/LimitationsOnPublicAccess/noLimitationshttp://inspire.ec.europa.eu/metadata-codelist/LimitationsOnPublicAccess/noLimitations

    Time period covered
    Nov 1, 2023 - Feb 15, 2024
    Description

    This dataset consists of sentences extracted from BGS memoirs, DECC/OGA onshore hydrocarbons well reports and Mineral Reconnaissance Programme (MRP) reports. The sentences have been annotated to enable the dataset to be used as labelled training data for a Named Entity Recognition model and Entity Relation Extraction model, both of which are Natural Language Processing (NLP) techniques that assist with extracting structured data from unstructured text. The entities of interest are rock formations, geological ages, rock types, physical properties and locations, with inter-relations such as overlies, observedIn. The entity labels for rock formations and geological ages in the BGS memoirs were an extract from earlier published work https://github.com/BritishGeologicalSurvey/geo-ner-model https://zenodo.org/records/4181488 . The data can be used to fine tune a pre-trained large language model using transfer learning, to create a model that can be used in inference mode to automatically create the labels, thereby creating structured data useful for geological modelling and subsurface characterisation. The data is provided in JSONL(Relation) format which is the export format from doccano open source text annotation software (https://doccano.github.io/doccano/) used to create the labels. The source documents are already publicly available, but the MRP and DECC reports are only published in pdf image form. These latter documents had to undergo OCR and resulted in lower quality text and a lower quality training data. The majority of the labelled data is from the higher quality BGS memoirs text. The dataset is a proof of concept. Minimal peer review of the labelling has been conducted so this should not be treated as a gold standard labelled dataset, and it is of insufficient volume to build a performant model. The development of this training data and the text processing scripts were supported by a grant from UK Government Office for Technology Transfer (GOTT) Knowledge Asset Grant Fund Project 10083604

  9. ObjectNET [7 of 10]

    • kaggle.com
    Updated Jul 15, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Darien Schettler (2022). ObjectNET [7 of 10] [Dataset]. https://www.kaggle.com/datasets/dschettler8845/objectnet-7-of-10/discussion?sort=undefined
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 15, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Darien Schettler
    Description

    NOTE: BY USING THIS DATASET YOU ACKNOWLEDGE THAT YOU HAVE READ THE LICENSE AND WILL ABIDE BY THE TERMS THEREWITHIN

    THE LICENSE

    ObjectNet is free to use for both research and commercial
    applications. The authors own the source images and allow their use
    under a license derived from Creative Commons Attribution 4.0 with
    two additional clauses:
    
    1. ObjectNet may never be used to tune the parameters of any
      model. This includes, but is not limited to, computing statistics
      on ObjectNet and including those statistics into a model,
      fine-tuning on ObjectNet, performing gradient updates on any
      parameters based on these images.
    
    2. Any individual images from ObjectNet may only be posted to the web
      including their 1 pixel red border.
    
    If you post this archive in a public location, please leave the password
    intact as "objectnetisatestset".
    
    [Other General License Information Conforms to Attribution 4.0 International]
    


    ⚠️🛑⚠️ ⚠️🛑⚠️⚠️🛑⚠️⚠️🛑⚠️⚠️🛑⚠️⚠️🛑⚠️⚠️🛑⚠️⚠️🛑⚠️⚠️🛑⚠️⚠️🛑⚠️⚠️🛑⚠️⚠️🛑⚠️⚠️🛑⚠️⚠️🛑⚠️⚠️🛑⚠️⚠️🛑⚠️

    IMPORTANT NOTE ––– THIS DATASET IS ONLY FOR VALIDATION/TESTING * YOU CANNOT USE IT TO TRAIN MODELS IN ANY WAY * IF YOU TRAIN A MODEL WITH IT YOU ARE VIOLATING THE LICENSE AGREEMENT * IF YOU POST IMAGES FROM THIS DATASET ANYWHERE YOU MUST ADD A RED BORDER TO THE IMAGE * IF YOU POST IMAGES WITHOUT THE BORDER YOU ARE VIOLATING THE LICENSE AGREEMENT

    ⚠️🛑⚠️ ⚠️🛑⚠️⚠️🛑⚠️⚠️🛑⚠️⚠️🛑⚠️⚠️🛑⚠️⚠️🛑⚠️⚠️🛑⚠️⚠️🛑⚠️⚠️🛑⚠️⚠️🛑⚠️⚠️🛑⚠️⚠️🛑⚠️⚠️🛑⚠️⚠️🛑⚠️⚠️🛑⚠️



    This is Part 7 of 10 * Original Paper Link * ObjectNet Website


    The links to the various parts of the dataset are:



    Description From ObjectNET Homepage



    What is ObjectNet?

    • A new kind of vision dataset borrowing the idea of controls from other areas of science.
    • No training set, only a test set! Put your vision system through its paces.
    • Collected to intentionally show objects from new viewpoints on new backgrounds.
    • 50,000 image test set, same as ImageNet, with controls for rotation, background, and viewpoint.
    • 313 object classes with 113 overlapping ImageNet
    • Large performance drop, what you can expect from vision systems in the real world!
    • Robust to fine-tuning and a very difficult transfer learning problem


    Controls For Biases Increase Variation


    https://objectnet.dev/images/objectnet_controls_table.png">



    Easy For Humans, Hard For Machines

    • Ready to help develop the next generation of object recognition algorithms that have robustness, bias, and safety in mind.
    • Controls can remove bias from other datasets machine learning, not just vision.


    https://objectnet.dev/images/objectnet_results.png">



    Full Description

    ObjectNet is a large real-world test set for object recognition with control where object backgrounds, rotations, and imaging viewpoints are random.

    Most scientific experiments have controls, confounds which are removed from the data, to ensure that subjects cannot perform a task by exploiting trivial correlations in the data. Historically, large machine learning and computer vision datasets have lacked such controls. This has resulted in models that must be fine-tuned for new datasets and perform better on datasets than in real-world applications. When tested on ObjectNet, object detectors show a 40-45% drop in performance, with respect to their performance on other benchmarks, due to the controls for biases. Controls make ObjectNet robust to fine-tuning showing only small performance increases.

    We develop a highly automated platform that enables gathering datasets with controls by crowdsourcing image capturing and annotation. ObjectNet is the same size as the ImageNet test set (50,000 images), and by design does not come paired with a training set in order to encourage generaliz...

  10. Z

    Data from: HL Dataset: Visually-grounded Description of Scenes, Actions and...

    • data.niaid.nih.gov
    Updated Feb 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kees van Deemter (2024). HL Dataset: Visually-grounded Description of Scenes, Actions and Rationales [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10723070
    Explore at:
    Dataset updated
    Feb 28, 2024
    Dataset provided by
    Albert Gatt
    Michele Cafagna
    Kees van Deemter
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Current captioning datasets focus on object-centric captions, describing the visible objects in the image, often ending up stating the obvious (for humans), e.g. "people eating food in a park". Although these datasets are useful to evaluate the ability of Vision & Language models to recognize and describe visual content, they do not support controlled experiments involving model testing or fine-tuning, with more high-level captions, which humans find easy and natural to produce. For example, people often describe images based on the type of scene they depict ("people at a holiday resort") and the actions they perform ("people having a picnic"). Such concepts are based on personal experience and contribute to forming common sense assumptions. We present the High-Level Dataset, a dataset extending 14997 images from the COCO dataset, aligned with a new set of 134,973 human-annotated (high-level) captions collected along three axes: scenes, actions and rationales. We further extend this dataset with confidence scores collected from an independent set of readers, as well as a set of narrative captions generated synthetically, by combining each of the three axes. We describe this dataset and analyse it extensively. We also present baseline results for the High-Level Captioning task.

  11. Apple Leaf Disease Detection Using Vision Transformer

    • zenodo.org
    text/x-python
    Updated Jun 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amreen Batool; Amreen Batool (2025). Apple Leaf Disease Detection Using Vision Transformer [Dataset]. http://doi.org/10.5281/zenodo.15702007
    Explore at:
    text/x-pythonAvailable download formats
    Dataset updated
    Jun 20, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Amreen Batool; Amreen Batool
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository contains a Python script for classifying apple leaf diseases using a Vision Transformer (ViT) model. The dataset used is the Plant Village dataset, which contains images of apple leaves with four classes: Healthy, Apple Scab, Black Rot, and Cedar Apple Rust. The script includes data preprocessing, model training, and evaluation steps.

    Table of Contents

    Introduction

    The goal of this project is to classify apple leaf diseases using a Vision Transformer (ViT) model. The dataset is divided into four classes: Healthy, Apple Scab, Black Rot, and Cedar Apple Rust. The script includes data preprocessing, model training, and evaluation steps.

    Code Explanation

    1. Importing Libraries

    • The script starts by importing necessary libraries such as matplotlib, seaborn, numpy, pandas, tensorflow, and sklearn. These libraries are used for data visualization, data manipulation, and building/training the deep learning model.

    2. Visualizing the Dataset

    • The walk_through_dir function is used to explore the dataset directory structure and count the number of images in each class.
    • The dataset is divided into Train, Val, and Test directories, each containing subdirectories for the four classes.

    3. Data Augmentation

    • The script uses ImageDataGenerator from Keras to apply data augmentation techniques such as rotation, horizontal flipping, and rescaling to the training data. This helps in improving the model's generalization ability.
    • Separate generators are created for training, validation, and test datasets.

    4. Patch Visualization

    • The script defines a Patches layer that extracts patches from the images. This is a crucial step in Vision Transformers, where images are divided into smaller patches that are then processed by the transformer.
    • The script visualizes these patches for different patch sizes (32x32, 16x16, 8x8) to understand how the image is divided.

    5. Model Training

    • The script defines a Vision Transformer (ViT) model using TensorFlow and Keras. The model is compiled with the Adam optimizer and categorical cross-entropy loss.
    • The model is trained for a specified number of epochs, and the training history is stored for later analysis.

    6. Model Evaluation

    • After training, the model is evaluated on the test dataset. The script generates a confusion matrix and a classification report to assess the model's performance.
    • The confusion matrix is visualized using seaborn to provide a clear understanding of the model's predictions.

    7. Visualizing Misclassified Images

    • The script includes functionality to visualize misclassified images, which helps in understanding where the model is making errors.

    8. Fine-Tuning and Learning Rate Adjustment

    • The script demonstrates how to fine-tune the model by adjusting the learning rate and re-training the model.

    Steps for Implementation

    1. Dataset Preparation

      • Ensure that the dataset is organized into Train, Val, and Test directories, with each directory containing subdirectories for each class (Healthy, Apple Scab, Black Rot, Cedar Apple Rust).
    2. Install Required Libraries

      • Install the necessary Python libraries using pip:
        pip install tensorflow matplotlib seaborn numpy pandas scikit-learn
    3. Run the Script

      • Execute the script in a Python environment. The script will automatically:
        • Load and preprocess the dataset.
        • Apply data augmentation.
        • Train the Vision Transformer model.
        • Evaluate the model and generate performance metrics.
    4. Analyze Results

      • Review the confusion matrix and classification report to understand the model's performance.
      • Visualize misclassified images to identify potential areas for improvement.
    5. Fine-Tuning

      • Experiment with different patch sizes, learning rates, and data augmentation techniques to improve the model's accuracy.
  12. WONDERBREAD: A Benchmark + Dataset for Business Process Management (BPM)...

    • zenodo.org
    csv, json, zip
    Updated Oct 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michael Wornow; Michael Wornow (2024). WONDERBREAD: A Benchmark + Dataset for Business Process Management (BPM) Tasks [Dataset]. http://doi.org/10.5281/zenodo.12671568
    Explore at:
    csv, zip, jsonAvailable download formats
    Dataset updated
    Oct 14, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Michael Wornow; Michael Wornow
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jul 6, 2024
    Description

    Paper: WONDERBREAD: A Benchmark for Evaluating Multimodal Foundation Models on Business Process Management Tasks

    Background

    The WONDERBREAD dataset contains 2,928 human demonstrations of 598 web navigation workflows across 6 types of BPM tasks. These tasks measure the ability of a model to generate accurate documentation, assist in knowledge transfer, and improve the effeciency of workflows.

    Please see our website for more details: https://wonderbread.stanford.edu/

    Quick Start

    To start, download debug_demos.zip (~1 GB). It contains a subset of 24 demonstrations which can give you a sense of how the dataset is structured.

    To reproduce the paper, download gold_demos.zip (~33 GB). It contains 724 demonstrations corresponding to the 162 "Gold" tasks which were used for all the evaluations in the original paper.

    To obtain the full dataset, download demos.zip (~133 GB). This contains all 2,928 demonstrations and can be used for training, fine-tuning, and evaluating models.

    Dataset Structure

    The dataset contains several files, defined below.

    1. Raw Data (useful for training/fine-tuning/evaluation)
      1. debug_demos.zip -- a subset of only 24 demonstrations taken from the full dataset. Useful to get a sense of the dataset and for debugging.
      2. gold_demos.zip -- a subset of only 724 demonstrations corresopnding to the 162 "Gold" tasks. This is the dataset that was used for all evaluations in the original WONDERBREAD paper.
      3. demos.zip -- all 2,928 demonstrations across 598 tasks. Useful for training your own models.
    2. Evaluation (useful for evaluation)
      1. qa_dataset.csv -- contains all 120 questions and ground truth answers used in the "Knowlege Transfer" evaluation.
      2. df_rankings.csv -- contains the rankings of all "Gold" tasks used in the "SOP Ranking" evaluation.
    3. Metadata (can be safely ignored)
      1. Process Mining Task Demonstrations.xlsx -- maps human annotators to specific demonstrations; also contains "Gold" task rankings used in the "SOP Ranking" evaluation.
      2. metadata.json -- maps Google Drive URLs to Google Drive Folder IDs to demonstration names
      3. df_valid.csv -- tracks assets associated with each demonstration
  13. Z

    Curlie Enhanced with LLM Annotations: Two Datasets for Advancing...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Dec 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cizinsky, Ludek (2023). Curlie Enhanced with LLM Annotations: Two Datasets for Advancing Homepage2Vec's Multilingual Website Classification [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10413067
    Explore at:
    Dataset updated
    Dec 21, 2023
    Dataset provided by
    Senghaas, Mika
    Nutter, Peter
    Cizinsky, Ludek
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Advancing Homepage2Vec with LLM-Generated Datasets for Multilingual Website Classification

    This dataset contains two subsets of labeled website data, specifically created to enhance the performance of Homepage2Vec, a multi-label model for website classification. The datasets were generated using Large Language Models (LLMs) to provide more accurate and diverse topic annotations for websites, addressing a limitation of existing Homepage2Vec training data.

    Key Features:

    LLM-generated annotations: Both datasets feature website topic labels generated using LLMs, a novel approach to creating high-quality training data for website classification models.

    Improved multi-label classification: Fine-tuning Homepage2Vec with these datasets has been shown to improve its macro F1 score from 38% to 43% evaluated on a human-labeled dataset, demonstrating their effectiveness in capturing a broader range of website topics.

    Multilingual applicability: The datasets facilitate classification of websites in multiple languages, reflecting the inherent multilingual nature of Homepage2Vec.

    Dataset Composition:

    curlie-gpt3.5-10k: 10,000 websites labeled using GPT-3.5, context 2 and 1-shot

    curlie-gpt4-10k: 10,000 websites labeled using GPT-4, context 2 and zero-shot

    Intended Use:

    Fine-tuning and advancing Homepage2Vec or similar website classification models

    Research on LLM-generated datasets for text classification tasks

    Exploration of multilingual website classification

    Additional Information:

    Project and report repository: https://github.com/CS-433/ml-project-2-mlp

    Acknowledgments:

    This dataset was created as part of a project at EPFL's Data Science Lab (DLab) in collaboration with Prof. Robert West and Tiziano Piccardi.

  14. h

    cvedataset.jsonl

    • huggingface.co
    Updated Mar 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thierno Diallo (2025). cvedataset.jsonl [Dataset]. https://huggingface.co/datasets/iamthierno/cvedataset.jsonl
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 2, 2025
    Authors
    Thierno Diallo
    Description

    CVE Dataset (1999-2024) for LLM Fine-Tuning

      Overview
    

    This dataset comprises Common Vulnerabilities and Exposures (CVE) records spanning from 1999 to 2024. Each entry provides essential information on software vulnerabilities, their descriptions, affected products and versions, CVSS scores, and relevant references. The data is formatted in a JSON Lines (.jsonl) structure, making it suitable for fine-tuning Large Language Models (LLMs) for tasks such as cybersecurity… See the full description on the dataset page: https://huggingface.co/datasets/iamthierno/cvedataset.jsonl.

  15. R

    Object Detection For Mstar Imagery Dataset

    • universe.roboflow.com
    zip
    Updated Nov 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Corn (2024). Object Detection For Mstar Imagery Dataset [Dataset]. https://universe.roboflow.com/corn-y933v/object-detection-for-mstar-imagery/model/3
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 21, 2024
    Dataset authored and provided by
    Corn
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Armored Vehicles Bounding Boxes
    Description

    Exploring Object Detection Techniques for MSTAR IU Mixed Targets Dataset

    Introduction: The rapid advancements in machine learning and computer vision have significantly improved object detection capabilities. In this project, we aim to explore and develop object detection techniques specifically tailored to the MSTAR IU Mixed Targets. This dataset, provided by the Sensor Data Management System, offers a valuable resource for training and evaluating object detection models for synthetic aperture radar (SAR) imagery.

    Objective: Our primary objective is to develop an efficient and accurate object detection model that can identify and localize various targets within the MSTAR IU Mixed Targets dataset. By achieving this, we aim to enhance the understanding and applicability of SAR imagery in real-world scenarios, such as surveillance, reconnaissance, and military applications.

    Ethics: As responsible researchers, we recognize the importance of ethics in conducting our project. We are committed to ensuring the ethical use of data and adhering to privacy guidelines. The MSTAR IU Mixed Targets dataset provided by the Sensor Data Management System will be used solely for academic and research purposes. Any personal information or sensitive data within the dataset will be handled with utmost care and confidentiality.

    Data Attribution and Giving Credit: We deeply appreciate the Sensor Data Management System for providing the MSTAR IU Mixed Targets dataset. We understand the effort and resources invested in curating and maintaining this valuable dataset, which forms the foundation of our project. To acknowledge and give credit to the Sensor Data Management System, we will prominently mention their contribution in all project publications, reports, and presentations. We will provide appropriate citations and include a statement recognizing their dataset as the source of our training and evaluation data.

    Methodology:

    1. Data Preprocessing: We will preprocess the MSTAR IU Mixed Targets dataset to enhance its compatibility with YOLOv8 object detection algorithm. Involve resizing, normalizing, and augmenting the images.

    2. Training and Evaluation: The selected model will be trained on the preprocessed dataset, utilizing appropriate loss functions and optimization techniques. We will extensively evaluate the model's performance using standard evaluation metrics such as precision, recall, and mean average precision (mAP).

    3. Fine-tuning and Optimization: We will fine-tune the model on the MSTAR IU Mixed Targets dataset to enhance its accuracy and adaptability to SAR-specific features. Additionally, we will explore techniques such as transfer learning and data augmentation to further improve the model's performance.

    4. Results and Analysis: The final model's performance will be analyzed in terms of detection accuracy, computational efficiency, and generalization capability. We will conduct comprehensive experiments and provide visualizations to showcase the model's object detection capabilities on the MSTAR IU Mixed Targets dataset.

    5. Model Selection and Revaluation: We will evaluate and compare state-of-the-art object detection models to identify the most suitable architecture for SAR imagery. This will involve researching and implementing models such as Faster R-CNN, other YOLO versions or SSD, considering their performance, speed, and adaptability to the MSTAR dataset.

    Conclusion: This project aims to contribute to the field of object detection in SAR imagery by leveraging the valuable MSTAR IU Mixed Targets dataset provided by the Sensor Data Management System. We will ensure ethical use of the data and give proper credit to the dataset's source. By developing an accurate and efficient object detection model, we hope to advance the understanding and application of SAR imagery in various domains.

    Note: This project description serves as an overview and can be expanded upon in terms of specific methodologies, experiments, and evaluation techniques as the project progresses.

  16. F

    Danish Conversation Chat Dataset for Telecom Domain

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Danish Conversation Chat Dataset for Telecom Domain [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/danish-telecom-domain-conversation-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    The dataset comprises over 10,000 chat conversations, each focusing on specific Telecom related topics. Each conversation provides a detailed interaction between a call center agent and a customer, capturing real-life scenarios and language nuances.

    Participants Details: 150+ native Danish participants from the FutureBeeAI community.
    Word Count & Length: Chats are diverse, averaging 300 to 700 words and 50 to 150 turns across both speakers.

    Topic Diversity

    The chat dataset covers a wide range of conversations on Telecom topics, ensuring that the dataset is comprehensive and relevant for training and fine-tuning models for various Telecom use cases. It offers diversity in terms of conversation topics, chat types, and outcomes, including both inbound and outbound chats with positive, neutral, and negative outcomes.

    Inbound Chats:
    Phone Number Porting
    Network Connectivity Issues
    Billing and Payments
    Technical Support
    Service Activation
    International Roaming Enquiry
    Refunds and Billing Adjustments
    Emergency Service Access, and many more
    Outbound Chats:
    Welcome Calls / Onboarding Process
    Payment Reminders
    Customer Surveys
    Technical Updates
    Service Usage Reviews
    Network Complaint Update, and many more

    Language Variety & Nuances

    The conversations in this dataset capture the diverse language styles and expressions prevalent in Danish Telecom interactions. This diversity ensures the dataset accurately represents the language used by Danish speakers in Telecom contexts.

    The dataset encompasses a wide array of language elements, including:

    Naming Conventions: Chats include a variety of Danish personal and business names.
    Localized Details: Real-world addresses, emails, phone numbers, and other contact information as according to different Danish-speaking regions.
    Temporal and Numeric Expressions: Dates, times, currencies, and numbers in Danish forms, adhering to local conventions.
    Idiomatic Expressions and Slang: It includes local slang, idioms, and informal phrase present in Danish Telecom conversations.

    This linguistic authenticity ensures that the dataset equips researchers and developers with a comprehensive understanding of the intricate language patterns, cultural references, and communication styles inherent to Danish Telecom interactions.

    Conversational Flow and Interaction Types

    The dataset includes a broad range of conversations, from simple inquiries to detailed discussions, capturing the dynamic nature of Telecom customer-agent interactions.

    Simple Inquiries
    Detailed Discussions
    Transactional Interactions
    Problem-Solving Dialogues
    Advisory Sessions
    Routine Checks and Follow-Ups

    Each of these conversations contains various aspects of conversation flow like:

    Greetings
    Authentication
    Information gathering
    Resolution identification
    <span

  17. O

    Data from: BuildingsBench: A Large-Scale Dataset of 900K Buildings and...

    • data.openei.org
    • osti.gov
    code, data, website
    Updated Dec 31, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Patrick Emami; Peter Graf; Patrick Emami; Peter Graf (2018). BuildingsBench: A Large-Scale Dataset of 900K Buildings and Benchmark for Short-Term Load Forecasting [Dataset]. http://doi.org/10.25984/1986147
    Explore at:
    code, website, dataAvailable download formats
    Dataset updated
    Dec 31, 2018
    Dataset provided by
    USDOE Office of Energy Efficiency and Renewable Energy (EERE), Multiple Programs (EE)
    National Renewable Energy Laboratory
    Open Energy Data Initiative (OEDI)
    Authors
    Patrick Emami; Peter Graf; Patrick Emami; Peter Graf
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The BuildingsBench datasets consist of:

    • Buildings-900K: A large-scale dataset of 900K buildings for pretraining models on the task of short-term load forecasting (STLF). Buildings-900K is statistically representative of the entire U.S. building stock.
    • 7 real residential and commercial building datasets for benchmarking two downstream tasks evaluating generalization: zero-shot STLF and transfer learning for STLF.

    Buildings-900K can be used for pretraining models on day-ahead STLF for residential and commercial buildings. The specific gap it fills is the lack of large-scale and diverse time series datasets of sufficient size for studying pretraining and finetuning with scalable machine learning models. Buildings-900K consists of synthetically generated energy consumption time series. It is derived from the NREL End-Use Load Profiles (EULP) dataset (see link to this database in the links further below). However, the EULP was not originally developed for the purpose of STLF. Rather, it was developed to "...help electric utilities, grid operators, manufacturers, government entities, and research organizations make critical decisions about prioritizing research and development, utility resource and distribution system planning, and state and local energy planning and regulation." Similar to the EULP, Buildings-900K is a collection of Parquet files and it follows nearly the same Parquet dataset organization as the EULP. As it only contains a single energy consumption time series per building, it is much smaller (~110 GB).

    BuildingsBench also provides an evaluation benchmark that is a collection of various open source residential and commercial real building energy consumption datasets. The evaluation datasets, which are provided alongside Buildings-900K below, are collections of CSV files which contain annual energy consumption. The size of the evaluation datasets altogether is less than 1GB, and they are listed out below:

    1. ElectricityLoadDiagrams20112014
    2. Building Data Genome Project-2
    3. Individual household electric power consumption (Sceaux)
    4. Borealis
    5. SMART
    6. IDEAL
    7. Low Carbon London

    A README file providing details about how the data is stored and describing the organization of the datasets can be found within each data lake version under BuildingsBench.

  18. F

    Vietnamese Agent-Customer Chat Dataset for Healthcare Domain

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Vietnamese Agent-Customer Chat Dataset for Healthcare Domain [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/vietnamese-healthcare-domain-conversation-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    The Vietnamese Healthcare Chat Dataset is a rich collection of over 10,000 text-based conversations between customers and call center agents, focused on real-world healthcare interactions. Designed to reflect authentic language use and domain-specific dialogue patterns, this dataset supports the development of conversational AI, chatbots, and NLP models tailored for healthcare applications in Vietnamese-speaking regions.

    Participant & Chat Overview

    Participants: 150+ native Vietnamese speakers from the FutureBeeAI Crowd Community
    Conversation Length: 300–700 words per chat
    Turns per Chat: 50–150 dialogue turns across both participants
    Chat Types: Inbound and outbound
    Sentiment Coverage: Positive, neutral, and negative outcomes included

    Topic Diversity

    The dataset captures a wide spectrum of healthcare-related chat scenarios, ensuring comprehensive coverage for training robust AI systems:

    Inbound Chats (Customer-Initiated): Appointment scheduling, new patient registration, surgery and treatment consultations, diet and lifestyle discussions, insurance claim inquiries, lab result follow-ups
    Outbound Chats (Agent-Initiated): Appointment reminders and confirmations, health and wellness program offers, test result notifications, preventive care and vaccination reminders, subscription renewals, risk assessment and eligibility follow-ups

    This variety helps simulate realistic healthcare support workflows and patient-agent dynamics.

    Language Diversity & Realism

    This dataset reflects the natural flow of Vietnamese healthcare communication and includes:

    Authentic Naming Patterns: Vietnamese personal names, clinic names, and brands
    Localized Contact Elements: Addresses, emails, phone numbers, and clinic locations in regional Vietnamese formats
    Time & Currency References: Use of dates, times, numeric expressions, and currency units aligned with Vietnamese-speaking regions
    Colloquial & Medical Expressions: Local slang, informal speech, and common healthcare-related terminology

    These elements ensure the dataset is contextually relevant and linguistically rich for real-world use cases.

    Conversational Flow & Structure

    Conversations range from simple inquiries to complex advisory sessions, including:

    General inquiries
    Detailed problem-solving
    Routine status updates
    Treatment recommendations
    Support and feedback interactions

    Each conversation typically includes these structural components:

    Greetings and verification
    Information gathering
    Problem definition
    Solution delivery
    Closing messages
    Follow-up and feedback (where applicable)

    This structured flow mirrors actual healthcare support conversations and is ideal for training advanced dialogue systems.

    Data Format & Structure

    Available in JSON, CSV, and TXT formats, each conversation includes:

    Full message history with clear speaker labels
    Participant identifiers
    Metadata (e.g., topic tags, region, sentiment)
    Compatibility with common NLP and ML pipelines
    <h3 style="font-weight:

  19. f

    Experiment environment.

    • plos.figshare.com
    xls
    Updated Jan 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Le Bu; Caiping Hu; Xiuliang Zhang (2024). Experiment environment. [Dataset]. http://doi.org/10.1371/journal.pone.0296789.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jan 19, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Le Bu; Caiping Hu; Xiuliang Zhang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The recognition of food images is of great significance for nutrition monitoring, food retrieval and food recommendation. However, the accuracy of recognition had not been high enough due to the complex background of food images and the characteristics of small inter-class differences and large intra-class differences. To solve these problems, this paper proposed a food image recognition method based on transfer learning and ensemble learning. Firstly, generic image features were extracted by using the convolutional neural network models (VGG19, ResNet50, MobileNet V2, AlexNet) pre-trained on the ImageNet dataset. Secondly, the 4 pre-trained models were transferred to the food image dataset for model fine-tuning. Finally, different basic learner combination strategies were adopted to establish the ensemble model and classify feature information. In this paper, several kinds of experiments were performed to compare the results of food image recognition between single models and ensemble models on food-11 dataset. The experimental results demonstrated that the accuracy of the ensemble model was the highest, reaching 96.88%, which was superior to any base learner. Therefore, the convolutional neural network model based on transfer learning and ensemble learning has strong learning ability and generalization ability, and it is feasible and practical to apply the method to food image recognition.

  20. Z

    Quilt-1M: One Million Image-Text Pairs for Histopathology

    • data.niaid.nih.gov
    • zenodo.org
    Updated Aug 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Linda G. Shapiro (2023). Quilt-1M: One Million Image-Text Pairs for Histopathology [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8239941
    Explore at:
    Dataset updated
    Aug 16, 2023
    Dataset provided by
    Linda G. Shapiro
    Fatemeh Ghezloo
    Ranjay Krishna
    Mehmet S. Seyfioglu
    Wisdom Oluchi Ikezogwo
    Pavan K. Anand
    Fatwir S. Mohammed
    Dylan Geva
    Description

    Recent accelerations in multi-modal applications have been made possible with the plethora of image and text data available online. However, the scarcity of similar data in the medical field, specifically in histopathology, has slowed similar progress. To enable similar representation learning for histopathology, we turn to YouTube, an untapped resource of videos, offering 1,087 hours of valuable educational histopathology videos from expert clinicians. From YouTube, we curate Quilt: a large-scale vision-language dataset consisting of 802,148 image and text pairs. Quilt was automatically curated using a mixture of models, including large language models), handcrafted algorithms, human knowledge databases, and automatic speech recognition. In comparison, the most comprehensive datasets curated for histopathology amass only around 200K samples. We combine Quilt with datasets, from other sources, including Twitter, research papers, and the internet in general, to create an even larger dataset: Quilt-1M, with 1M paired image-text samples, marking it as the largest vision-language histopathology dataset to date. We demonstrate the value of Quilt-1M by fine-tuning a pre-trained CLIP model. Our model outperforms state-of-the-art models on both zero-shot and linear probing tasks for classifying new pathology images across 13 diverse patch-level datasets of 8 different sub-pathologies and cross-modal retrieval tasks.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Akshat Gupta (2024). LLM Fine Tuning Dataset of Indian Legal Texts [Dataset]. https://www.kaggle.com/datasets/akshatgupta7/llm-fine-tuning-dataset-of-indian-legal-texts/discussion
Organization logo

LLM Fine Tuning Dataset of Indian Legal Texts

QA Dataset for fine tuning LLMs on IPC, CRPC, and Indian Constitution

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 30, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Akshat Gupta
License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Area covered
India
Description

This dataset comprises curated question-answer pairs derived from key legal texts pertinent to Indian law, specifically the Indian Penal Code (IPC), Criminal Procedure Code (CRPC), and the Indian Constitution. The goal of this dataset is to facilitate the development and fine-tuning of language models and AI applications that assist legal professionals in India.

Dataset Details:

  • Sources: The questions and answers in this dataset are extracted from the Indian Constitution, Indian Penal Code (IPC), and the Code of Criminal Procedure (CrPC), ensuring relevance and accuracy in legal contexts.
  • Content: Each entry in the dataset contains a clear and concise question alongside its corresponding answer. The questions are designed to cover fundamental concepts, key provisions, and significant terms found within these legal documents.

Use Cases:

  • Legal Research: A valuable tool for lawyers, legal researchers, and students seeking to understand legal terminology and principles as outlined in Indian law.
  • Natural Language Processing (NLP): This dataset is ideal for training AI models for question-answering systems that require a strong understanding of Indian legal texts.
  • Educational Resources: Useful for creating educational tools and materials for law students and legal practitioners.

Note on Use and Limitations:

  • Misuse of Dataset: This dataset is intended for educational, research, and development purposes only. Users should exercise caution to ensure that any AI applications developed using this dataset do not misrepresent or distort legal information. The dataset should not be used for legal advice or to influence legal decisions without proper context and verification.

  • Relevance and Context: While every effort has been made to ensure the accuracy and relevance of the question-answer pairs, some entries may be out of context or may not fully represent the legal concepts they aim to explain. Users are strongly encouraged to conduct thorough reviews of the entries, particularly when using them in formal applications or legal research.

  • Data Preprocessing Recommended: Due to the nature of natural language, the QA pairs may include variations in phrasing, potential redundancies, or entries that may not align perfectly with the intended legal context. Therefore, it is highly recommended that users perform data preprocessing to cleanse, normalize, or filter out any irrelevant or out-of-context pairs before integrating the dataset into machine learning models or systems.

  • Dynamic Nature of Law: The legal landscape is subject to change over time. As laws and interpretations evolve, some answers may become outdated or less applicable. Users should verify the current applicability of legal concepts and check sources for updates when necessary.

  • Credits and Citations: If you use this dataset in your research or projects, appropriate credits should be provided. Users are also encouraged to share any improvements, corrections, or updates they make to the dataset for the benefit of the community.

Search
Clear search
Close search
Google apps
Main menu