Facebook
TwitterU.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
This resource also includes a Dictionary from the ELMN that has a set of terms translated from English to all the EU languages. The list of languages that is indicated with this resource tell what languages that the rest of the lists and dictionaries cover together.
Facebook
Twitterouro_dataset README
Overview
The ouro_dataset is a JSON file containing a list of dictionaries, where each dictionary represents a data entry. Each entry corresponds to a question-answer pair associated with an image. This dataset is intended for use in tasks such as Optical Character Recognition (OCR) and Visual Question Answering (VQA). Each dictionary contains an image path, a question, and its corresponding answer.
Dataset Structure
The dataset is stored in… See the full description on the dataset page: https://huggingface.co/datasets/tinnel123/OURO_dataset.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Dataset Card for GRPO Oumi ANLI Subset
Dataset
This dataset is a reformatted version of the oumi-ai/oumi-c2d-d2c-subset dataset, specifically structured for use with the GRPO trainer. You can find more detailed information about the original dataset at the provided link. Link: https://huggingface.co/datasets/oumi-ai/oumi-c2d-d2c-subset
Dataset Structure
The dataset consists of a list of dictionaries, where each dictionary represents a single data instance with… See the full description on the dataset page: https://huggingface.co/datasets/TEEN-D/grpo-oumi-c2d-d2c-subset.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Dataset Card for GRPO Oumi ANLI Subset
Dataset
This dataset is a reformatted version of the TEEN-D/grpo-oumi-anli-subset dataset, specifically structured for use with the GRPO trainer. You can find more detailed information about the original dataset at the provided link. Link: https://huggingface.co/datasets/oumi-ai/oumi-synthetic-claims
Dataset Structure
The dataset consists of a list of dictionaries, where each dictionary represents a single data instance… See the full description on the dataset page: https://huggingface.co/datasets/TEEN-D/grpo-oumi-synthetic-claims.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Galician version of alpaca_data.json
This is a Galician-translated with Python package googletranslatepy version of the Stanford alpaca_data.json dataset. Our working notes are available here.
Dataset Structure
The dataset contains 52K instruction-following elements in a JSON file with a list of dictionaries. Each dictionary contains the following fields:
instruction: str, describes the task the model should perform. Each of the 52K instructions is unique. input: str… See the full description on the dataset page: https://huggingface.co/datasets/irlab-udc/alpaca_data_galician.
Facebook
TwitterThis is the hh-rlhf dataset, with only the helpful split merged. And the format is parsed so that chosen and rejected are not strings but lists of dictionaries, where each dictionary is the conversation (following the more standard format).
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Filename of each json are represented like this :
Each json file contains a list of dictionaries, each dictionary representing a conversation turn with the following keys:
caller: The speaker of the turn (e.g., "Speaker 1", "Speaker 2"). next_caller: The next speaker in the conversation (e.g., "Speaker 2", "Speaker 1"). act_tad: The DAMSL act tag for the turn (e.g., "Statement-opinion", "Question-yesno"). text: The text of the turn. context: A… See the full description on the dataset page: https://huggingface.co/datasets/OpenLiliO/mili-o.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This is official synthetic dataset used to train GLiNER multi-task model. The dataset is a list of dictionaries consisting a tokenized text with named entity recognition (NER) information. Each item represents of two main components:
'tokenized_text': A list of individual words and punctuation marks from the original text, split into tokens.
'ner': A list of lists containing named entity recognition information. Each inner list has three elements:
Start index of the named entity in the… See the full description on the dataset page: https://huggingface.co/datasets/knowledgator/GLINER-multi-task-synthetic-data.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset
Materials project (2019 dump) This dataset contains 133420 materials with formation energy per atom. Processed from mp.2019.04.01.json
Download
Download link: materials-project.tar.gz MD5 checksum c132f3781f32cd17f3a92aa6501b9531
Content
Bundled in materials-project.tar.gz.
Index (index.json)
list of dict:
index (int) => index of the structure in data file. id (str) => id of Materials Project. formula (str) => formula. natoms (int) => number… See the full description on the dataset page: https://huggingface.co/datasets/materials-toolkits/materials-project.
Facebook
Twitterhttps://choosealicense.com/licenses/gpl-2.0/https://choosealicense.com/licenses/gpl-2.0/
Dataset Card for ca-text-corpus
Dataset Summary
Catalan word lists with part of speech labeling curated by humans. Contains 1 180 773 forms including verbs, nouns, adjectives, names or toponyms. These word lists are used to build applications like Catalan spellcheckers or verb querying applications.
Supported Tasks and Leaderboards
[More Information Needed]
Languages
Catalan (ca).
Dataset Structure
The dataset contains 3 columns:
Form… See the full description on the dataset page: https://huggingface.co/datasets/softcatala/catalan-dictionary.
Facebook
TwitterAfrican Cultural Reasoning Dataset with SmolAgents
import os from typing import Dict, List, Any import json from datetime import datetime import asyncio from smolagents import CodeAgent, DuckDuckGoSearchTool, LiteLLMModel
class AfricanCultureDataGenerator: def init(self, api_key: str): # Initialize with explicit API key os.environ["OPENAI_API_KEY"] = api_key
self.model = LiteLLMModel(
model_id="gpt-4o-mini",
)… See the full description on the dataset page: https://huggingface.co/datasets/Svngoku/african_cultural_reasoning.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
from datasets import load_dataset, Dataset
def formatting_prompts_func(examples): """ This function takes a dataset of examples and formats them into a list of text prompts.
Args:
examples (pandas.DataFrame): A DataFrame containing the dataset of examples.
Returns:
dict: A dictionary with a 'text' key containing the list of formatted text prompts.
"""
texts = [
# Create a formatted text prompt for… See the full description on the dataset page: https://huggingface.co/datasets/ebowwa/human-biases.
Facebook
Twitterhttps://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
List of all finalist songs at Sanremo Music Festival from the first edition in 1951 to 2025. The file is structured as a single dictionary where each key is a string representing the year (e.g., "1951"). The corresponding value is a list of dictionaries, with each dictionary detailing a single finalist song for that year. Each song object contains the following fields:
'titolo': The title of the song. 'autori': The song's authors and composers. 'interpreti': The artist(s) who performed the… See the full description on the dataset page: https://huggingface.co/datasets/raicrits/Sanremo_finalist_songs.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Structure
This dataset contains conversational data in the HuggingFace format with a messages field. This data is a transformation of HuggingFaceH4/Multilingual-Thinking.
Data Fields
messages: A list of message dictionaries, each containing: role: The role of the message sender (system, user, or assistant) content: The message content thinking: (optional) Extended thinking content for assistant messages
Usage
from datasets import load_dataset… See the full description on the dataset page: https://huggingface.co/datasets/karashiiro/multilingual-uwu.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Segmentation Dataset for Judgments of the Supreme Court of Justice of Portugal
The goal of this dataset is to train a segmentation model that, given a judgment from the Supreme Court of Justice of Portugal (STJ), can divide its paragraphs into sections of the judgment itself.
Dataset Contents
JSON Files:
Judgment Text: Contains the judgment text divided into paragraphs, with each paragraph associated with a unique ID.
Denotations: A list of dictionaries where each… See the full description on the dataset page: https://huggingface.co/datasets/MartimZanatti/Segmentation_judgments_STJ.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
LLMSQL Benchmark (Finetune-Ready)
This benchmark is designed to evaluate text-to-SQL models. For usage of this benchmark see https://github.com/LLMSQL/llmsql-benchmark.
This repository contains a finetune-ready version of the LLMSQL benchmark: LLMSQL on Hugging Face.
The dataset is structured in a messages format suitable for instruction-tuned models, where each example has a messages field. This field is a list of dictionaries with:
"role": "user" — the input question or prompt… See the full description on the dataset page: https://huggingface.co/datasets/llmsql-bench/llmsql-benchmark-finetune-ready.
Facebook
TwitterDataset: bouygues-deu-train
Dataset Info
This dataset contains conversational data structured into a single column, "messages", which represents interactions between a user and an assistant.
Features
messages: A list of dictionaries representing each conversational message content: (string) The text of the message. role: (string) The role of the sender, either "user" or "assistant".
Splits
The dataset is divided into training and testing splits.… See the full description on the dataset page: https://huggingface.co/datasets/JonOlds64/accor-deu-train.
Facebook
TwitterIn the filename, 'all' means MMLU-Pro-CoT-Eval, 'math500' means Math500 Conf/PRM/Aggregated Scores are Min/Max Normalized. Each file is a dict with key: "0.0~0.1", ... "0.9~1.0", Denotes the Delta=abs(PRM-Conf) Dataset["0.0~0.1"] is a list of dict, each dict contains:"problem": question sample,"score": aggregated score,"prm": prm score,"conf": confidence score,"delta_prm_conf": abs(prm-conf),"step_completion": list of str, reasoning path by steps,"correctness": correctness of the final answer).
Facebook
Twitterhttps://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
RetroInstruct Mix v0.2
This is the first release of the RetroInstruct synthetic instruction dataset. It is a mixture of 7 synthetic subsets:
RetroInstruct Weave Evaluator Questions: JDP - Answer questions about synthetic short form writing in the style of John David Pressman.
RetroInstruct Analogical Translations - Infer the generative process of bad faith reasoning by executing a bad faith process to generate arguments and reversing it.
RetroInstruct Part Lists For Dictionary… See the full description on the dataset page: https://huggingface.co/datasets/jdpressman/retroinstruct-mix-v0.2.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
""" _HOMEPAGE = "" _LICENSE = "Creative Commons Attribution-NonCommercial 4.0 International Public License"
_split_generators method)_URLs = {"default": "https://www.dropbox.com/s/041prrjylv0tf0h/ethics.zip?dl=1"}
class Imppres(datasets.GeneratorBasedBuilder):
VERSION = datasets.Version("1.1.0")
def _info(self):
features = datasets.Features(
{
"better_choice": datasets.Value("string"),
"worst_choice": datasets.Value("string"),
"comparison": datasets.Value("string"),
"label": datasets.Value("int32"),
})
return datasets.DatasetInfo(
# This is the description that will appear on the datasets page.
description=_DESCRIPTION,
# This defines the different columns of the dataset and their types
features=features, # Here we define them above because they are different between the two configurations
# If there's a common (input, target) tuple from the features,
# specify them here. They'll be used if as_supervised=True in
# builder.as_dataset.
supervised_keys=None,
# Homepage of the dataset for documentation
homepage=_HOMEPAGE,
# License for the dataset if available
license=_LICENSE,
# Citation for the dataset
citation=_CITATION,
)
def _split_generators(self, dl_manager):
Facebook
TwitterU.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
This resource also includes a Dictionary from the ELMN that has a set of terms translated from English to all the EU languages. The list of languages that is indicated with this resource tell what languages that the rest of the lists and dictionaries cover together.