Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset comprises curated question-answer pairs derived from key legal texts pertinent to Indian law, specifically the Indian Penal Code (IPC), Criminal Procedure Code (CRPC), and the Indian Constitution. The goal of this dataset is to facilitate the development and fine-tuning of language models and AI applications that assist legal professionals in India.
Misuse of Dataset: This dataset is intended for educational, research, and development purposes only. Users should exercise caution to ensure that any AI applications developed using this dataset do not misrepresent or distort legal information. The dataset should not be used for legal advice or to influence legal decisions without proper context and verification.
Relevance and Context: While every effort has been made to ensure the accuracy and relevance of the question-answer pairs, some entries may be out of context or may not fully represent the legal concepts they aim to explain. Users are strongly encouraged to conduct thorough reviews of the entries, particularly when using them in formal applications or legal research.
Data Preprocessing Recommended: Due to the nature of natural language, the QA pairs may include variations in phrasing, potential redundancies, or entries that may not align perfectly with the intended legal context. Therefore, it is highly recommended that users perform data preprocessing to cleanse, normalize, or filter out any irrelevant or out-of-context pairs before integrating the dataset into machine learning models or systems.
Dynamic Nature of Law: The legal landscape is subject to change over time. As laws and interpretations evolve, some answers may become outdated or less applicable. Users should verify the current applicability of legal concepts and check sources for updates when necessary.
Credits and Citations: If you use this dataset in your research or projects, appropriate credits should be provided. Users are also encouraged to share any improvements, corrections, or updates they make to the dataset for the benefit of the community.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
Alpaca is the perfect dataset for fine-tuning your language models to better understand and follow instructions, capable of taking you beyond standard Natural Language Processing (NLP) abilities! This curated, cleaned dataset provides you with over 52,000 expertly crafted instructions and demonstrations generated by OpenAI's text-davinci-003 engine - all in English (BCP-47 en). Improve the quality of your language models with fields such as instruction, output, and input which have been designed to enhance every aspect of their comprehension. The data here has gone through rigorous cleaning to ensure there are no errors or biases present; allowing you to trust that this data will result in improved performance for any language model that uses it! Get ready to see what Alpaca can do for your NLP needs
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset provides a unique and valuable resource for anyone who wishes to create, develop and train language models. Alpaca provides users with 52,000 instruction-demonstration pairs generated by OpenAI's text-davinci-003 engine.
The data included in this dataset is formatted into 3 columns: “instruction”, “output” and “input.” All the data is written in English (BCP-47 en).
To make the most out of this dataset it is recommended to:
Familiarize yourself with the instructions in the instruction column as these provide guidance on how to use the other two columns; input and output.
Once comfortable with understanding the instructions columns move onto exploring what you are provided within each 14 sets of triplets – instruction, output and input – that are included in this clean version of Alpaca.
Read through many examples paying attention to any areas where you feel more clarification could be added or could be further improved upon for better understanding of language models however bear in mind that these examples have been cleaned from any errors or biases found from original dataset
Get inspired! As mentioned earlier there are more than 52k sets provided meaning having much flexibility for varying training strategies or unique approaches when creating your own language model!
Finally while not essential it may be helpful to have familiarity with OpenAI's text-davinci engine as well as enjoy playing around with different parameters/options depending on what type of outcomes you wish achieve
- Developing natural language processing (NLP) tasks that aim to better automate and interpret instructions given by humans.
- Training machine learning models of robotic agents to be able to understand natural language commands, as well as understand the correct action that needs to be taken in response.
- Creating a system that can generate personalized instructions and feedback in real time based on language models, catering specifically to each individual user's preferences or needs
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: train.csv | Column name | Description | |:----------------|:-------------------------------------------------------------------------| | instruction | This column contains the instructions for the language model. (Text) | | output | This column contains the expected output from the language model. (Text) | | input | This column contains the input given to the language model. (Text) |
If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Name: Long-Form Article Summarization Dataset Description: The Long-Form Article Summarization Dataset is meticulously curated for the purpose of fine-tuning Natural Language Processing (NLP) models specifically tailored for summarization tasks. It is a rich collection of long-form articles that have been carefully condensed and summarized. The dataset provides a diverse range of topics and writing styles, making it an invaluable resource for researchers and practitioners working on… See the full description on the dataset page: https://huggingface.co/datasets/vgoldberg/longform_article_summarization.
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
This dataset is part of the bachelor thesis "Evaluating SQuAD-based Question Answering for the Open Research Knowledge Graph Completion". It was created for the finetuning of Bert Based models pre-trained on the SQUaD dataset. The Dataset was created using semi-automatic approach on the ORKG data. The dataset.csv file contains the entire data (all properties) in a tabular for and is unsplit. The json files contain only the necessary fields for training and evaluation, with additional fields (index of start and end of the answers in the abstracts). The data in the json files is split (training data) and evaluation data. We create 4 variants of the training and evaluation sets for each one of the question labels ("no label", "how", "what", "which") For detailed information on each of the fields in the dataset, refer to section 4.2 (Corpus) of the Thesis document that can be found in https://www.repo.uni-hannover.de/handle/123456789/12958. The script used to generate the dataset can be found in the public repository https://github.com/as18cia/thesis_work and https://gitlab.com/TIBHannover/orkg/nlp/experiments/orkg-fine-tuning-squad-based-models
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Wikipedia Paragraph Supervised Finetuning Dataset
Model Description
This dataset is designed for training language models to generate supervised finetuning data from raw text. It consists of text passages and corresponding question-answer pairs in JSONLines format.
Intended Use
The primary purpose of this dataset is to enable large language models (LLMs) to generate high-quality supervised finetuning data from raw text inputs, useful for creating custom… See the full description on the dataset page: https://huggingface.co/datasets/agentlans/wikipedia-paragraph-sft.
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
FineWeb-Edu Supervised Finetuning Dataset
Model Description
This dataset is designed for training language models to generate supervised finetuning data from raw text. It consists of text passages and corresponding question-answer pairs in JSONLines format.
Intended Use
The primary purpose of this dataset is to enable large language models (LLMs) to generate high-quality supervised finetuning data from raw text inputs, useful for creating custom datasets for… See the full description on the dataset page: https://huggingface.co/datasets/agentlans/finewebedu-sft.
https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy
Unlock one of the most comprehensive movie datasets available—4.5 million structured IMDb movie records, extracted and enriched for data science, machine learning, and entertainment research.
This dataset includes a vast collection of global movie metadata, including details on title, release year, genre, country, language, runtime, cast, directors, IMDb ratings, reviews, and synopsis. Whether you're building a recommendation engine, benchmarking trends, or training AI models, this dataset is designed to give you deep and wide access to cinematic data across decades and continents.
Perfect for use in film analytics, OTT platforms, review sentiment analysis, knowledge graphs, and LLM fine-tuning, the dataset is cleaned, normalized, and exportable in multiple formats.
Genres: Drama, Comedy, Horror, Action, Sci-Fi, Documentary, and more
Train LLMs or chatbots on cinematic language and metadata
Build or enrich movie recommendation engines
Run cross-lingual or multi-region film analytics
Benchmark genre popularity across time periods
Power academic studies or entertainment dashboards
Feed into knowledge graphs, search engines, or NLP pipelines
http://inspire.ec.europa.eu/metadata-codelist/LimitationsOnPublicAccess/noLimitationshttp://inspire.ec.europa.eu/metadata-codelist/LimitationsOnPublicAccess/noLimitations
This dataset consists of sentences extracted from BGS memoirs, DECC/OGA onshore hydrocarbons well reports and Mineral Reconnaissance Programme (MRP) reports. The sentences have been annotated to enable the dataset to be used as labelled training data for a Named Entity Recognition model and Entity Relation Extraction model, both of which are Natural Language Processing (NLP) techniques that assist with extracting structured data from unstructured text. The entities of interest are rock formations, geological ages, rock types, physical properties and locations, with inter-relations such as overlies, observedIn. The entity labels for rock formations and geological ages in the BGS memoirs were an extract from earlier published work https://github.com/BritishGeologicalSurvey/geo-ner-model https://zenodo.org/records/4181488 . The data can be used to fine tune a pre-trained large language model using transfer learning, to create a model that can be used in inference mode to automatically create the labels, thereby creating structured data useful for geological modelling and subsurface characterisation. The data is provided in JSONL(Relation) format which is the export format from doccano open source text annotation software (https://doccano.github.io/doccano/) used to create the labels. The source documents are already publicly available, but the MRP and DECC reports are only published in pdf image form. These latter documents had to undergo OCR and resulted in lower quality text and a lower quality training data. The majority of the labelled data is from the higher quality BGS memoirs text. The dataset is a proof of concept. Minimal peer review of the labelling has been conducted so this should not be treated as a gold standard labelled dataset, and it is of insufficient volume to build a performant model. The development of this training data and the text processing scripts were supported by a grant from UK Government Office for Technology Transfer (GOTT) Knowledge Asset Grant Fund Project 10083604
ObjectNet is free to use for both research and commercial
applications. The authors own the source images and allow their use
under a license derived from Creative Commons Attribution 4.0 with
two additional clauses:
1. ObjectNet may never be used to tune the parameters of any
model. This includes, but is not limited to, computing statistics
on ObjectNet and including those statistics into a model,
fine-tuning on ObjectNet, performing gradient updates on any
parameters based on these images.
2. Any individual images from ObjectNet may only be posted to the web
including their 1 pixel red border.
If you post this archive in a public location, please leave the password
intact as "objectnetisatestset".
[Other General License Information Conforms to Attribution 4.0 International]
⚠️🛑⚠️ ⚠️🛑⚠️⚠️🛑⚠️⚠️🛑⚠️⚠️🛑⚠️⚠️🛑⚠️⚠️🛑⚠️⚠️🛑⚠️⚠️🛑⚠️⚠️🛑⚠️⚠️🛑⚠️⚠️🛑⚠️⚠️🛑⚠️⚠️🛑⚠️⚠️🛑⚠️⚠️🛑⚠️
IMPORTANT NOTE ––– THIS DATASET IS ONLY FOR VALIDATION/TESTING * YOU CANNOT USE IT TO TRAIN MODELS IN ANY WAY * IF YOU TRAIN A MODEL WITH IT YOU ARE VIOLATING THE LICENSE AGREEMENT * IF YOU POST IMAGES FROM THIS DATASET ANYWHERE YOU MUST ADD A RED BORDER TO THE IMAGE * IF YOU POST IMAGES WITHOUT THE BORDER YOU ARE VIOLATING THE LICENSE AGREEMENT
⚠️🛑⚠️ ⚠️🛑⚠️⚠️🛑⚠️⚠️🛑⚠️⚠️🛑⚠️⚠️🛑⚠️⚠️🛑⚠️⚠️🛑⚠️⚠️🛑⚠️⚠️🛑⚠️⚠️🛑⚠️⚠️🛑⚠️⚠️🛑⚠️⚠️🛑⚠️⚠️🛑⚠️⚠️🛑⚠️
This is Part 7 of 10 * Original Paper Link * ObjectNet Website
The links to the various parts of the dataset are:
https://objectnet.dev/images/objectnet_controls_table.png">
https://objectnet.dev/images/objectnet_results.png">
ObjectNet is a large real-world test set for object recognition with control where object backgrounds, rotations, and imaging viewpoints are random.
Most scientific experiments have controls, confounds which are removed from the data, to ensure that subjects cannot perform a task by exploiting trivial correlations in the data. Historically, large machine learning and computer vision datasets have lacked such controls. This has resulted in models that must be fine-tuned for new datasets and perform better on datasets than in real-world applications. When tested on ObjectNet, object detectors show a 40-45% drop in performance, with respect to their performance on other benchmarks, due to the controls for biases. Controls make ObjectNet robust to fine-tuning showing only small performance increases.
We develop a highly automated platform that enables gathering datasets with controls by crowdsourcing image capturing and annotation. ObjectNet is the same size as the ImageNet test set (50,000 images), and by design does not come paired with a training set in order to encourage generaliz...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Current captioning datasets focus on object-centric captions, describing the visible objects in the image, often ending up stating the obvious (for humans), e.g. "people eating food in a park". Although these datasets are useful to evaluate the ability of Vision & Language models to recognize and describe visual content, they do not support controlled experiments involving model testing or fine-tuning, with more high-level captions, which humans find easy and natural to produce. For example, people often describe images based on the type of scene they depict ("people at a holiday resort") and the actions they perform ("people having a picnic"). Such concepts are based on personal experience and contribute to forming common sense assumptions. We present the High-Level Dataset, a dataset extending 14997 images from the COCO dataset, aligned with a new set of 134,973 human-annotated (high-level) captions collected along three axes: scenes, actions and rationales. We further extend this dataset with confidence scores collected from an independent set of readers, as well as a set of narrative captions generated synthetically, by combining each of the three axes. We describe this dataset and analyse it extensively. We also present baseline results for the High-Level Captioning task.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains a Python script for classifying apple leaf diseases using a Vision Transformer (ViT) model. The dataset used is the Plant Village dataset, which contains images of apple leaves with four classes: Healthy, Apple Scab, Black Rot, and Cedar Apple Rust. The script includes data preprocessing, model training, and evaluation steps.
The goal of this project is to classify apple leaf diseases using a Vision Transformer (ViT) model. The dataset is divided into four classes: Healthy, Apple Scab, Black Rot, and Cedar Apple Rust. The script includes data preprocessing, model training, and evaluation steps.
matplotlib
, seaborn
, numpy
, pandas
, tensorflow
, and sklearn
. These libraries are used for data visualization, data manipulation, and building/training the deep learning model.walk_through_dir
function is used to explore the dataset directory structure and count the number of images in each class.Train
, Val
, and Test
directories, each containing subdirectories for the four classes.ImageDataGenerator
from Keras to apply data augmentation techniques such as rotation, horizontal flipping, and rescaling to the training data. This helps in improving the model's generalization ability.Patches
layer that extracts patches from the images. This is a crucial step in Vision Transformers, where images are divided into smaller patches that are then processed by the transformer.seaborn
to provide a clear understanding of the model's predictions.Dataset Preparation
Train
, Val
, and Test
directories, with each directory containing subdirectories for each class (Healthy, Apple Scab, Black Rot, Cedar Apple Rust).Install Required Libraries
pip install tensorflow matplotlib seaborn numpy pandas scikit-learn
Run the Script
Analyze Results
Fine-Tuning
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The WONDERBREAD dataset contains 2,928 human demonstrations of 598 web navigation workflows across 6 types of BPM tasks. These tasks measure the ability of a model to generate accurate documentation, assist in knowledge transfer, and improve the effeciency of workflows.
Please see our website for more details: https://wonderbread.stanford.edu/
To start, download debug_demos.zip (~1 GB). It contains a subset of 24 demonstrations which can give you a sense of how the dataset is structured.
To reproduce the paper, download gold_demos.zip (~33 GB). It contains 724 demonstrations corresponding to the 162 "Gold" tasks which were used for all the evaluations in the original paper.
To obtain the full dataset, download demos.zip (~133 GB). This contains all 2,928 demonstrations and can be used for training, fine-tuning, and evaluating models.
The dataset contains several files, defined below.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Advancing Homepage2Vec with LLM-Generated Datasets for Multilingual Website Classification
This dataset contains two subsets of labeled website data, specifically created to enhance the performance of Homepage2Vec, a multi-label model for website classification. The datasets were generated using Large Language Models (LLMs) to provide more accurate and diverse topic annotations for websites, addressing a limitation of existing Homepage2Vec training data.
Key Features:
LLM-generated annotations: Both datasets feature website topic labels generated using LLMs, a novel approach to creating high-quality training data for website classification models.
Improved multi-label classification: Fine-tuning Homepage2Vec with these datasets has been shown to improve its macro F1 score from 38% to 43% evaluated on a human-labeled dataset, demonstrating their effectiveness in capturing a broader range of website topics.
Multilingual applicability: The datasets facilitate classification of websites in multiple languages, reflecting the inherent multilingual nature of Homepage2Vec.
Dataset Composition:
curlie-gpt3.5-10k: 10,000 websites labeled using GPT-3.5, context 2 and 1-shot
curlie-gpt4-10k: 10,000 websites labeled using GPT-4, context 2 and zero-shot
Intended Use:
Fine-tuning and advancing Homepage2Vec or similar website classification models
Research on LLM-generated datasets for text classification tasks
Exploration of multilingual website classification
Additional Information:
Project and report repository: https://github.com/CS-433/ml-project-2-mlp
Acknowledgments:
This dataset was created as part of a project at EPFL's Data Science Lab (DLab) in collaboration with Prof. Robert West and Tiziano Piccardi.
CVE Dataset (1999-2024) for LLM Fine-Tuning
Overview
This dataset comprises Common Vulnerabilities and Exposures (CVE) records spanning from 1999 to 2024. Each entry provides essential information on software vulnerabilities, their descriptions, affected products and versions, CVSS scores, and relevant references. The data is formatted in a JSON Lines (.jsonl) structure, making it suitable for fine-tuning Large Language Models (LLMs) for tasks such as cybersecurity… See the full description on the dataset page: https://huggingface.co/datasets/iamthierno/cvedataset.jsonl.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Exploring Object Detection Techniques for MSTAR IU Mixed Targets Dataset
Introduction: The rapid advancements in machine learning and computer vision have significantly improved object detection capabilities. In this project, we aim to explore and develop object detection techniques specifically tailored to the MSTAR IU Mixed Targets. This dataset, provided by the Sensor Data Management System, offers a valuable resource for training and evaluating object detection models for synthetic aperture radar (SAR) imagery.
Objective: Our primary objective is to develop an efficient and accurate object detection model that can identify and localize various targets within the MSTAR IU Mixed Targets dataset. By achieving this, we aim to enhance the understanding and applicability of SAR imagery in real-world scenarios, such as surveillance, reconnaissance, and military applications.
Ethics: As responsible researchers, we recognize the importance of ethics in conducting our project. We are committed to ensuring the ethical use of data and adhering to privacy guidelines. The MSTAR IU Mixed Targets dataset provided by the Sensor Data Management System will be used solely for academic and research purposes. Any personal information or sensitive data within the dataset will be handled with utmost care and confidentiality.
Data Attribution and Giving Credit: We deeply appreciate the Sensor Data Management System for providing the MSTAR IU Mixed Targets dataset. We understand the effort and resources invested in curating and maintaining this valuable dataset, which forms the foundation of our project. To acknowledge and give credit to the Sensor Data Management System, we will prominently mention their contribution in all project publications, reports, and presentations. We will provide appropriate citations and include a statement recognizing their dataset as the source of our training and evaluation data.
Methodology:
Data Preprocessing: We will preprocess the MSTAR IU Mixed Targets dataset to enhance its compatibility with YOLOv8 object detection algorithm. Involve resizing, normalizing, and augmenting the images.
Training and Evaluation: The selected model will be trained on the preprocessed dataset, utilizing appropriate loss functions and optimization techniques. We will extensively evaluate the model's performance using standard evaluation metrics such as precision, recall, and mean average precision (mAP).
Fine-tuning and Optimization: We will fine-tune the model on the MSTAR IU Mixed Targets dataset to enhance its accuracy and adaptability to SAR-specific features. Additionally, we will explore techniques such as transfer learning and data augmentation to further improve the model's performance.
Results and Analysis: The final model's performance will be analyzed in terms of detection accuracy, computational efficiency, and generalization capability. We will conduct comprehensive experiments and provide visualizations to showcase the model's object detection capabilities on the MSTAR IU Mixed Targets dataset.
Model Selection and Revaluation: We will evaluate and compare state-of-the-art object detection models to identify the most suitable architecture for SAR imagery. This will involve researching and implementing models such as Faster R-CNN, other YOLO versions or SSD, considering their performance, speed, and adaptability to the MSTAR dataset.
Conclusion: This project aims to contribute to the field of object detection in SAR imagery by leveraging the valuable MSTAR IU Mixed Targets dataset provided by the Sensor Data Management System. We will ensure ethical use of the data and give proper credit to the dataset's source. By developing an accurate and efficient object detection model, we hope to advance the understanding and application of SAR imagery in various domains.
Note: This project description serves as an overview and can be expanded upon in terms of specific methodologies, experiments, and evaluation techniques as the project progresses.
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
The dataset comprises over 10,000 chat conversations, each focusing on specific Telecom related topics. Each conversation provides a detailed interaction between a call center agent and a customer, capturing real-life scenarios and language nuances.
The chat dataset covers a wide range of conversations on Telecom topics, ensuring that the dataset is comprehensive and relevant for training and fine-tuning models for various Telecom use cases. It offers diversity in terms of conversation topics, chat types, and outcomes, including both inbound and outbound chats with positive, neutral, and negative outcomes.
The conversations in this dataset capture the diverse language styles and expressions prevalent in Danish Telecom interactions. This diversity ensures the dataset accurately represents the language used by Danish speakers in Telecom contexts.
The dataset encompasses a wide array of language elements, including:
This linguistic authenticity ensures that the dataset equips researchers and developers with a comprehensive understanding of the intricate language patterns, cultural references, and communication styles inherent to Danish Telecom interactions.
The dataset includes a broad range of conversations, from simple inquiries to detailed discussions, capturing the dynamic nature of Telecom customer-agent interactions.
Each of these conversations contains various aspects of conversation flow like:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The BuildingsBench datasets consist of:
Buildings-900K can be used for pretraining models on day-ahead STLF for residential and commercial buildings. The specific gap it fills is the lack of large-scale and diverse time series datasets of sufficient size for studying pretraining and finetuning with scalable machine learning models. Buildings-900K consists of synthetically generated energy consumption time series. It is derived from the NREL End-Use Load Profiles (EULP) dataset (see link to this database in the links further below). However, the EULP was not originally developed for the purpose of STLF. Rather, it was developed to "...help electric utilities, grid operators, manufacturers, government entities, and research organizations make critical decisions about prioritizing research and development, utility resource and distribution system planning, and state and local energy planning and regulation." Similar to the EULP, Buildings-900K is a collection of Parquet files and it follows nearly the same Parquet dataset organization as the EULP. As it only contains a single energy consumption time series per building, it is much smaller (~110 GB).
BuildingsBench also provides an evaluation benchmark that is a collection of various open source residential and commercial real building energy consumption datasets. The evaluation datasets, which are provided alongside Buildings-900K below, are collections of CSV files which contain annual energy consumption. The size of the evaluation datasets altogether is less than 1GB, and they are listed out below:
A README file providing details about how the data is stored and describing the organization of the datasets can be found within each data lake version under BuildingsBench.
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
The Vietnamese Healthcare Chat Dataset is a rich collection of over 10,000 text-based conversations between customers and call center agents, focused on real-world healthcare interactions. Designed to reflect authentic language use and domain-specific dialogue patterns, this dataset supports the development of conversational AI, chatbots, and NLP models tailored for healthcare applications in Vietnamese-speaking regions.
The dataset captures a wide spectrum of healthcare-related chat scenarios, ensuring comprehensive coverage for training robust AI systems:
This variety helps simulate realistic healthcare support workflows and patient-agent dynamics.
This dataset reflects the natural flow of Vietnamese healthcare communication and includes:
These elements ensure the dataset is contextually relevant and linguistically rich for real-world use cases.
Conversations range from simple inquiries to complex advisory sessions, including:
Each conversation typically includes these structural components:
This structured flow mirrors actual healthcare support conversations and is ideal for training advanced dialogue systems.
Available in JSON, CSV, and TXT formats, each conversation includes:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The recognition of food images is of great significance for nutrition monitoring, food retrieval and food recommendation. However, the accuracy of recognition had not been high enough due to the complex background of food images and the characteristics of small inter-class differences and large intra-class differences. To solve these problems, this paper proposed a food image recognition method based on transfer learning and ensemble learning. Firstly, generic image features were extracted by using the convolutional neural network models (VGG19, ResNet50, MobileNet V2, AlexNet) pre-trained on the ImageNet dataset. Secondly, the 4 pre-trained models were transferred to the food image dataset for model fine-tuning. Finally, different basic learner combination strategies were adopted to establish the ensemble model and classify feature information. In this paper, several kinds of experiments were performed to compare the results of food image recognition between single models and ensemble models on food-11 dataset. The experimental results demonstrated that the accuracy of the ensemble model was the highest, reaching 96.88%, which was superior to any base learner. Therefore, the convolutional neural network model based on transfer learning and ensemble learning has strong learning ability and generalization ability, and it is feasible and practical to apply the method to food image recognition.
Recent accelerations in multi-modal applications have been made possible with the plethora of image and text data available online. However, the scarcity of similar data in the medical field, specifically in histopathology, has slowed similar progress. To enable similar representation learning for histopathology, we turn to YouTube, an untapped resource of videos, offering 1,087 hours of valuable educational histopathology videos from expert clinicians. From YouTube, we curate Quilt: a large-scale vision-language dataset consisting of 802,148 image and text pairs. Quilt was automatically curated using a mixture of models, including large language models), handcrafted algorithms, human knowledge databases, and automatic speech recognition. In comparison, the most comprehensive datasets curated for histopathology amass only around 200K samples. We combine Quilt with datasets, from other sources, including Twitter, research papers, and the internet in general, to create an even larger dataset: Quilt-1M, with 1M paired image-text samples, marking it as the largest vision-language histopathology dataset to date. We demonstrate the value of Quilt-1M by fine-tuning a pre-trained CLIP model. Our model outperforms state-of-the-art models on both zero-shot and linear probing tasks for classifying new pathology images across 13 diverse patch-level datasets of 8 different sub-pathologies and cross-modal retrieval tasks.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset comprises curated question-answer pairs derived from key legal texts pertinent to Indian law, specifically the Indian Penal Code (IPC), Criminal Procedure Code (CRPC), and the Indian Constitution. The goal of this dataset is to facilitate the development and fine-tuning of language models and AI applications that assist legal professionals in India.
Misuse of Dataset: This dataset is intended for educational, research, and development purposes only. Users should exercise caution to ensure that any AI applications developed using this dataset do not misrepresent or distort legal information. The dataset should not be used for legal advice or to influence legal decisions without proper context and verification.
Relevance and Context: While every effort has been made to ensure the accuracy and relevance of the question-answer pairs, some entries may be out of context or may not fully represent the legal concepts they aim to explain. Users are strongly encouraged to conduct thorough reviews of the entries, particularly when using them in formal applications or legal research.
Data Preprocessing Recommended: Due to the nature of natural language, the QA pairs may include variations in phrasing, potential redundancies, or entries that may not align perfectly with the intended legal context. Therefore, it is highly recommended that users perform data preprocessing to cleanse, normalize, or filter out any irrelevant or out-of-context pairs before integrating the dataset into machine learning models or systems.
Dynamic Nature of Law: The legal landscape is subject to change over time. As laws and interpretations evolve, some answers may become outdated or less applicable. Users should verify the current applicability of legal concepts and check sources for updates when necessary.
Credits and Citations: If you use this dataset in your research or projects, appropriate credits should be provided. Users are also encouraged to share any improvements, corrections, or updates they make to the dataset for the benefit of the community.