Dataset Card for The Cauldron
Dataset description
The Cauldron is part of the Idefics2 release. It is a massive collection of 50 vision-language datasets (training sets only) that were used for the fine-tuning of the vision-language model Idefics2.
Load the dataset
To load the dataset, install the library datasets with pip install datasets. Then, from datasets import load_dataset ds = load_dataset("HuggingFaceM4/the_cauldron", "ai2d")
to download and load the… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceM4/the_cauldron.
https://choosealicense.com/licenses/wtfpl/https://choosealicense.com/licenses/wtfpl/
My personal dataset collection: https://huggingface.co/datasets/FireIceDancer2/mouthmask/tree/main This is the (un)official dataset collection of the AI Waifu DID discord server. We are a group of enthusiasts sharing the same love for generative AI stuff, specifically AI-generated images and text. Despite the name, our interests are not limited to damsel-in-distress (DID) stuff, but also encompass many different things, such as anime and the like. This repo was created as an effort to create a… See the full description on the dataset page: https://huggingface.co/datasets/FireIceDancer2/AI-Waifu-DIDcord-Datasets-Collection.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This archive contains the replication package for the DRAGON multi-label classification models, which leverage BERT-based architectures. The package includes scripts for repository mining, dataset creation, data processing, model training, and evaluation. The two main models used are DRAGON and LEGION.
Before running any commands, ensure you have the necessary dependencies installed. It is recommended to use a virtual environment:
python3 -m venv venv
source venv/bin/activate # On Windows use `venv\Scripts\activate`
pip install -r requirements.txt
repository_mining/
: Contains scripts for mining the initial set of repositories.
repository_mining/doc/
: Includes documentation with the necessary information for repository mining.dataset_creation/
: Contains all the notebooks to be run sequentially to prepare the dataset.multilabel_class/
: Contains scripts for classification, threshold tuning, and evaluation.
multilabel_class/model_output/
: trained model organized by: first dataset, then model variantion.data/
: Contains the hugginface datasets ( our dataset and LEGION dataset) ready for the training/eval.To mine the initial set of repositories from Software Heritage, use the scripts available in the repository_mining/
folder. Detailed information and steps for repository mining can be found in:
repository_mining/doc/
After mining the repositories, prepare the dataset by running the Jupyter notebooks inside the dataset_creation/
folder in sequence. These notebooks handle data cleaning, transformation, and formatting necessary for model training. All the documentation needed is inside each notebook explaining every step.
Once the dataset is prepared, convert it into a Hugging Face dataset using:
python3 multilabel_class/create_dataset.py --file_path data/02_processed_datasets/2024-05-22/origin-metadata-readme_names-900000dataset_forks-cleaned.csv
After processing the dataset, train the DRAGON model with the following command:
python3 multilabel_class/tune_thresholds.py --model_type bert --model_variant focal --dataset_path data/03_huggingaceV_datasets/2024-05-22/origin-metadata-readme_names-900000dataset_forks-cleaned/dataset
Modify the configuration file multilabel_class/utils/config.py
to set the following parameter to True
:
DEFAULT_PREPROCESSING_PARAMS = {
'use_sentence_pairs': True # If True, process as (text1, text2); if False, concatenate texts
}
To train DRAGON without using sentence pairs, use the same command but set use_sentence_pairs
to False
in the config file:
DEFAULT_PREPROCESSING_PARAMS = {
'use_sentence_pairs': False
}
To train DRAGON on a benchmark dataset, use:
python3 multilabel_class/tune_thresholds.py --model_type bert --model_variant focal --dataset_path data/03_huggingaceV_datasets/LEGION/dataset
Ensure the use_sentence_pairs
parameter is set to True
in config.py
.
To train LEGION on the DRAGON dataset, use:
python3 multilabel_class/tune_thresholds.py --model_type bert --model_variant db --dataset_path data/03_huggingaceV_datasets/2024-05-22/origin-metadata-readme_names-900000dataset_forks-cleaned/dataset
Ensure the use_sentence_pairs
parameter is set to False
in config.py
:
DEFAULT_PREPROCESSING_PARAMS = {
'use_sentence_pairs': False
}
To train LEGION on a baseline dataset, run:
python3 multilabel_class/tune_thresholds.py --model_type bert --model_variant db --dataset_path data/03_huggingaceV_datasets/LEGION/dataset
Once thresholds are tuned, you can evaluate the model using:
python3 multilabel_class/evaluation.py --model_type bert --model_variant focal --dataset_path data/03_huggingaceV_datasets/2024-05-22/origin-metadata-readme_names-900000dataset_forks-cleaned/dataset
This evaluation script computes standard multi-label classification metrics including:
Ensure that the model variant and dataset path correspond to the previously trained model.
We suggest an interactive and visual analysis of model performance, you can also use the provided Jupyter notebooks located in:
DRAGON_replication/multilabel_class/notebooks/
These notebooks reproduce the complete evaluation pipeline and generate additional visualizations and metrics discussed in the associated paper.
Both command-line and notebook-based evaluations ensure reproducibility and offer complementary insights into model behavior.
Several folders in this replication package have been compressed into .zip
files to reduce package size. Before running any code, you must unzip all the provided .zip
files in-place—that is, extract each archive into the same directory as the .zip
file, using the same name as the zip file (without the .zip
extension).
For example:
DRAGON_replication\data\02_processed_dataset\2024-05-22.zip
should be extracted to:
DRAGON_replication\data\02_processed_dataset\2024-05-22\
.zip
files to extractDRAGON_replication\data\02_processed_dataset\2024-05-22.zip
DRAGON_replication\data\03_huggingaceV_datasets\2024-05-22.zip
DRAGON_replication\data\03_huggingaceV_datasets\LEGION.zip
DRAGON_replication\dataset_creation\data.zip
DRAGON_replication\multilabel_class\model_output\2024-05-22.zip
DRAGON_replication\multilabel_class\model_output\LEGION.zip
Make sure that after extraction, each corresponding folder exists and contains the expected files. Do not change the folder names or directory structure after unzipping.
This README provides an overview of the essential steps for repository mining, dataset preparation, processing, model training, and evaluation. For further customization, refer to the configuration files and experiment with different preprocessing settings.
Merged-LID-20
This dataset provides a curated collection of language-specific datasets from Hugging Face, optimized for building and training language identification models. Each dataset includes text samples in a single language, making this an ideal resource for projects involving multilingual natural language processing tasks such as language identification.
Overview
The dataset collection includes 20 languages, covering a range of language families, scripts, and… See the full description on the dataset page: https://huggingface.co/datasets/Michielo/Merged-LID-20.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
This dataset contains a collection of questions and answers that have been contextualized to reveal subtle implications and insights. It is focused on helping researchers gain a deeper understanding of how semantics, context, and other factors affect how people interpret and respond to various conversations about different topics. By exploring this dataset, researchers will be able to uncover the underlying principles governing conversation styles, which can then be applied to better understand attitudes among different groups. With its comprehensive coverage of questions from a variety of sources around the web, this dataset offers an invaluable resource for those looking to sleep analyze discourse in terms of sentiment analysis or opinion mining
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
How to Use This Dataset
This dataset contains a collection of contextualized questions and answers extracted from various sources around the web, which can be useful for exploring implications and insights. To get started with the dataset:
- Read through the headings on each column in order to understand the data that has been collected - this will help you identify which pieces of information are relevant for your research project.
- Explore each column and view what types of responses have been given in response to particular questions or topics - this will give you an idea as to how people interpret specific topics differently when presented with different contexts or circumstances.
- Next, analyze the responses looking for any patterns or correlations between responses on different topics or contexts - this can help reveal implications and insights previously unknown to you about a particular subject matter. You can also use any data visualization tools such as Tableau or PowerBI to gain deeper understanding into the results and trends within your data set!
- Finally, use these findings to better inform your project by tailoring future questions around any patterns discovered within your analysis!
- To understand the nature of public debates and how people express their opinions in different contexts.
- To better comprehend the implicit attitudes and assumptions inherent in language use, providing insight into discourse norms on a range of issues.
- To gain insight into the use of rhetorical devices, such as exaggeration and deceptive tactics, used to influence public opinion on important topics
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: train.csv | Column name | Description | |:--------------|:-----------------------------------------------------------------------------| | context | The context in which the question was asked and the answer was given. (Text) |
If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
SEA-VL: A Multicultural Vision-Language Dataset for Southeast Asia
Paper: Crowdsource, Crawl, or Generate? Creating SEA-VL, A Multicultural Vision-Language Dataset for Southeast Asia Dataset: SEA-VL Collection on HuggingFace Code: SEA-VL Experiment | SEA-VL Image Collection
What is SEA-VL?
Following the success of our SEACrowd project, we’re excited to announce SEA-VL, a new open-source initiative to create high-quality vision-language datasets specifically for… See the full description on the dataset page: https://huggingface.co/datasets/SEACrowd/sea-vl_crowdsourcing.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
RealHarm
RealHarm is a collection of harmful real-world interactions with AI agents.
Dataset Details
Dataset Description
RealHarm contains harmful samples, categorized among 10 harm categories. A complete taxonomy has been proposed along with the dataset and is described in the RealHarm paper. Each sample has an associated safe version, for which we rewrote the agent answer to make it harmless. This dataset provides researchers and developers with authentic… See the full description on the dataset page: https://huggingface.co/datasets/giskardai/realharm.
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
This dataset contains 14,934 instructions, contexts and responses, in several natural language categories such as classification, closed QA, generation, etc. The English original dataset was created by @databricks, who crowd-sourced the data creation via its employees. The current dataset is a translation of that dataset through ChatGPT (gpt-3.5-turbo
).
Data Instances
{
"id": 14963,
"instruction": "Wat zijn de duurste steden ter wereld?",
"context": "",
"response": "Dit is een uitgebreide lijst van de duurste steden: Singapore, Tel Aviv, New York, Hong Kong, Los Angeles, Zurich, Genève, San Francisco, Parijs en Sydney.",
"category": "brainstorming"
}
Data Fields
[1502, 1812, 1868, 4179, 4541, 6347, 8851, 9321, 10588, 10835, 11257, 12082, 12319, 12471, 12701, 12988, 13066, 13074, 13076, 13181, 13253, 13279, 13313, 13346, 13369, 13446, 13475, 13528, 13546, 13548, 13549, 13558, 13566, 13600, 13603, 13657, 13668, 13733, 13765, 13775, 13801, 13831, 13906, 13922, 13923, 13957, 13967, 13976, 14028, 14031, 14045, 14050, 14082, 14083, 14089, 14110, 14155, 14162, 14181, 14187, 14200, 14221, 14222, 14281, 14473, 14475, 14476, 14587, 14590, 14667, 14685, 14764, 14780, 14808, 14836, 14891, 1 4966]
Dataset Creation
Both the translations and the topics were translated with OpenAI's API for gpt-3.5-turbo
. max_tokens=1024, temperature=0
as parameters.
The prompt template to translate the input is (where src_lang
was English and tgt_lang
Dutch):
CONVERSATION_TRANSLATION_PROMPT = """You are asked to translate a task's instruction, optional context to the task, and the response to the task, from {src_lang} to {tgt_lang}.
Here are the requirements that you should adhere to:
1. maintain the format: the task consists of a task instruction (marked `instruction: `), optional context to the task (marked `context: `) and response for the task marked with `response: `;
2. do not translate the identifiers `instruction: `, `context: `, and `response: ` but instead copy them to your output;
3. make sure that text is fluent to read and does not contain grammatical errors. Use standard {tgt_lang} without regional bias;
4. translate the instruction and context text using informal, but standard, language;
5. make sure to avoid biases (such as gender bias, grammatical bias, social bias);
6. if the instruction is to correct grammar mistakes or spelling mistakes then you have to generate a similar mistake in the context in {tgt_lang}, and then also generate a corrected output version in the output in {tgt_lang};
7. if the instruction is to translate text from one language to another, then you do not translate the text that needs to be translated in the instruction or the context, nor the translation in the response (just copy them as-is);
8. do not translate code fragments but copy them to your output. If there are English examples, variable names or definitions in code fragments, keep them in English.
Now translate the following task with the requirements set out above. Do not provide an explanation and do not add anything else.
"""
The system message was:
You are a helpful assistant that translates English to Dutch according to the requirements that are given to you.
Note that 77 items (0.5%) were not successfully translated. This can either mean that the prompt was too long for the given limit (max_tokens=1024
) or that the generated translation could not be parsed into instruction
, context
and response
fields. The missing IDs are [1502, 1812, 1868, 4179, 4541, 6347, 8851, 9321, 10588, 10835, 11257, 12082, 12319, 12471, 12701, 12988, 13066, 13074, 13076, 13181, 13253, 13279, 13313, 13346, 13369, 13446, 13475, 13528, 13546, 13548, 13549, 13558, 13566, 13600, 13603, 13657, 13668, 13733, 13765, 13775, 13801, 13831, 13906, 13922, 13923, 13957, 13967, 13976, 14028, 14031, 14045, 14050, 14082, 14083, 14089, 14110, 14155, 14162, 14181, 14187, 14200, 14221, 14222, 14281, 14473, 14475, 14476, 14587, 14590, 14667, 14685, 14764, 14780, 14808, 14836, 14891, 1 4966]
.
Initial Data Collection and Normalization
Initial data collection by databricks. See their repository for more information about this dataset.
Considerations for Using the Data
Note that the translations in this new dataset have not been verified by humans! Use at your own risk, both in terms of quality and biases.
Discussion of Biases
As with any machine-generated texts, users should be aware of potential biases that are included in this dataset. Although the prompt specifically includes make sure to avoid biases (such as gender bias, grammatical bias, social bias)
, of course the impact of such command is not known. It is likely that biases remain in the dataset so use with caution.
Other Known Limitations
The translation quality has not been verified. Use at your own risk!
Licensing Information
This repository follows the original databricks license, which is CC BY-SA 3.0 but see below for a specific restriction.
This text was generated (either in part or in full) with GPT-3 (gpt-3.5-turbo
), OpenAI’s large-scale language-generation model. Upon generating draft language, the author reviewed, edited, and revised the language to their own liking and takes ultimate responsibility for the content of this publication.
If you use this dataset, you must also follow the Sharing and Usage policies.
As clearly stated in their Terms of Use, specifically 2c.iii, "[you may not] use output from the Services to develop models that compete with OpenAI". That means that you cannot use this dataset to build models that are intended to commercially compete with OpenAI. As far as I am aware, that is a specific restriction that should serve as an addendum to the current license.
This dataset is also available on the Hugging Face hub, its canonical repository.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
This dataset contains the text of a remarkable collection of short stories known as the TinyStories Corpus. With over 2,000 annotated stories, it is populated with an array of diverse styles and genres from multiple sources. This corpus is enriched by intricate annotations across each narrative content, making it a valuable resource for narrative text classification. The text field in each row includes the entirety of each story that can be used to identify plots, characters and other features associated with story-telling techniques. Through this collection of stories, users will gain an extensive insight into a wide range of narratives which could be used to produce powerful machine learning models for Narrative Text Classification
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
In this dataset, each row contains a short story along with its associated labels for narrative text classification tasks. The data consists of the following columns: - text: The story text itself (string) - validation.csv: Contains a set of short stories for validation (dataframe) - train.csv: Contains the text of short stories used for narrative text classification (dataframe)
The data contained in both files can be used for various types of machine learning tasks related to narrative text classification. These include but are not limited to experiments such as determining story genres, predicting user reactions, sentiment analysis etc.
To get started with using this dataset, begin by downloading both validation and train csv files from Kaggle datasets page and saving them on your computer or local environment. Once downloaded, you may need to preprocess both datasets by cleaning up any unnecessary/wrongly formatted values or duplicate entries if any exists within it before proceeding further on to your research work or machine learning task experimentations as these have great impacts on your research results accuracy rate which you do not want compromised!
Next step is simply loading up these two datasets into Python pandas dataframes so that they can easily be manipulated and analyzed using common tools associated with Natural Language Processing(NLP). This would require you writing few simple lines using pandas API functions like read_csv(), .append(), .concat()etc depending upon what kind of analysis/experiment you intend conducting afterwards utilizing this dataset in Python Jupyter Notebook framework as well as other machine learning frameworks popular among data scientists like scikit-learn if it will be something more complex than simple NLP task operations!
By now if done everything mentioned correctly here then we are ready now to finally get into actually working out our desired applications from exploring potential connections between different narratives or character traits via supervised Machine Learning models such as Naive Bayes Classifier among many others that could ultimately provide us useful insights revealing patterns existing underneath all those texts! With all necessary datas loaded up in supporting python platforms correctly so feel free to make interesting discoveries/predictions from extensive analyses provided by this richly annotated TinyStories Narrative Dataset!
- Creating a text classification algorithm to automatically categorize short stories by genre.
- Developing an AI-based summarization tool to quickly summarize the main points in a story.
- Developing an AI-based story generator that can generate new stories based on existing ones in the dataset
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: validation.csv | Column name | Description | |:--------------|:--------------------------------| | text | The text of the story. (String) |
File: train.csv | Column name | Description | |:--------------|:----------------------------...
SQL Text Collection
This is a collection of publicly available text-to-SQL datasets.
Dataset Structure
Each row contains the columns:
context: The schema for the database (e.g., CREATE TABLE statements). query: A natural language query or action to perform, expressed in English. source: The original dataset from which the row was sourced. dialect: One or more SQL dialects identified based on dialect-specific keywords found in the context and query. If there are multiple… See the full description on the dataset page: https://huggingface.co/datasets/agentlans/sql-text-collection.
https://choosealicense.com/licenses/gpl-3.0/https://choosealicense.com/licenses/gpl-3.0/
brics-edtech-patent-analysis
paper's repository Dataset: Data Collection, Processing, and Annotation In this section, we describe the methodology used to create the research dataset, including data sources, processing steps, and annotation by a large language model.
2.1.1. Source and Data Collection
The primary data source for this study was the patents.google.com database. This platform was chosen for its extensive collection of full-text national and international… See the full description on the dataset page: https://huggingface.co/datasets/brics-edtech/brics-edtech-data-collection.
Combined Louvre and Art Institute of Chicago (AIC) Collection Dataset
Dataset Summary
This dataset merges artwork information from two prominent museum collections: the Musée du Louvre and The Art Institute of Chicago (AIC). It combines data from the Louvre Paper and Canvas Collection and the AIC Dataset 0.2 datasets. Due to differences in the original datasets' schemas, a decision was made to focus on common fields and create a non-atomic full_info field containing… See the full description on the dataset page: https://huggingface.co/datasets/anna-bozhenko/artworks.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
SPML Chatbot Prompt Injection Dataset
Arxiv Paper Introducing the SPML Chatbot Prompt Injection Dataset: a robust collection of system prompts designed to create realistic chatbot interactions, coupled with a diverse array of annotated user prompts that attempt to carry out prompt injection attacks. While other datasets in this domain have centered on less practical chatbot scenarios or have limited themselves to "jailbreaking" – just one aspect of prompt injection – our dataset… See the full description on the dataset page: https://huggingface.co/datasets/reshabhs/SPML_Chatbot_Prompt_Injection.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
📺 YouTube-Commons 📺
YouTube-Commons is a collection of audio transcripts of 2,063,066 videos shared on YouTube under a CC-By license.
Content
The collection comprises 22,709,724 original and automatically translated transcripts from 3,156,703 videos (721,136 individual channels). In total, this represents nearly 45 billion words (44,811,518,375). All the videos where shared on YouTube with a CC-BY license: the dataset provide all the necessary provenance information… See the full description on the dataset page: https://huggingface.co/datasets/PleIAs/YouTube-Commons.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for UltraChat 200k
Dataset Description
This is a heavily filtered version of the UltraChat dataset and was used to train Zephyr-7B-β, a state of the art 7b chat model. The original datasets consists of 1.4M dialogues generated by ChatGPT and spanning a wide range of topics. To create UltraChat 200k, we applied the following logic:
Selection of a subset of data for faster supervised fine tuning. Truecasing of the dataset, as we observed around 5% of the data… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Children Stories Collection A great synthetic datasets consists of around 0.9 million stories especially meant for Young Children. You can directly use these datasets for training large models. Total 10 datasets are available for download. You can use any one or all the json files for training purpose. These datasets are in "prompt" and "text" format. Total token length is also available. Thank you for your love & support.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
🤖 We curated this dataset for NeurIPS Large Language Model Efficiency Challenge: 1 LLM + 1GPU + 1Day. 🚀 Our Birbal-7B-V1 fine-tuned on this dataset achieved 🏆 first rank 🏆 in the competition.
Here is high-level diagram of our data preparation strategy:
Natural Instructions Dataset Preparation
Natural Instructionsdataset is a community effort to create a large collection of tasks and their natural language definitions/instructions. As show in above diagram, we sample from… See the full description on the dataset page: https://huggingface.co/datasets/upaya07/NeurIPS-LLM-data.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Mario Maker 2 users
Part of the Mario Maker 2 Dataset Collection
Dataset Description
The Mario Maker 2 users dataset consists of 6 million users from Nintendo's online service totaling around 1.2GB of data. The dataset was created using the self-hosted Mario Maker 2 api over the course of 1 month in February 2022.
How to use it
The Mario Maker 2 users dataset is a very large dataset so for most use cases it is recommended to make use of the streaming API of… See the full description on the dataset page: https://huggingface.co/datasets/TheGreatRambler/mm2_user.
Dataset Card for AGNews
This dataset is a collection of title-description pairs collected from AGNews. See the AG News corpus for additional information. This dataset can be used directly with Sentence Transformers to train embedding models.
Dataset Subsets
pair subset
Columns: "title", "description" Column types: str, str Examples:{ 'title': 'Helicopter Crashes in Colombian Drug War, Kills 20', 'description': 'BOGOTA, Colombia - A U.S.-made helicopter on… See the full description on the dataset page: https://huggingface.co/datasets/sentence-transformers/agnews.
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Dataset Card for OPUS Books
Dataset Summary
This is a collection of copyright free books aligned by Andras Farkas, which are available from http://www.farkastranslations.com/bilingual_books.php Note that the texts are rather dated due to copyright issues and that some of them are manually reviewed (check the meta-data at the top of the corpus files in XML). The source is multilingually aligned, which is available from http://www.farkastranslations.com/bilingual_books.php.… See the full description on the dataset page: https://huggingface.co/datasets/Helsinki-NLP/opus_books.
Dataset Card for The Cauldron
Dataset description
The Cauldron is part of the Idefics2 release. It is a massive collection of 50 vision-language datasets (training sets only) that were used for the fine-tuning of the vision-language model Idefics2.
Load the dataset
To load the dataset, install the library datasets with pip install datasets. Then, from datasets import load_dataset ds = load_dataset("HuggingFaceM4/the_cauldron", "ai2d")
to download and load the… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceM4/the_cauldron.