47 datasets found

E
Term lists and Dictionaries from Swedish Authorities
live.european-language-grid.eu
huggingface.co
+1more
pdf
Updated Aug 30, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). Term lists and Dictionaries from Swedish Authorities [Dataset]. https://live.european-language-grid.eu/catalogue/lcr/18919
Explore at:
pdfAvailable download formats
Dataset updated
Aug 30, 2022
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Description
This resource also includes a Dictionary from the ELMN that has a set of terms translated from English to all the EU languages. The list of languages that is indicated with this resource tell what languages that the rest of the lists and dictionaries cover together.
h
OURO_dataset
huggingface.co
Updated Sep 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xu (2025). OURO_dataset [Dataset]. https://huggingface.co/datasets/tinnel123/OURO_dataset
Explore at:
Dataset updated
Sep 20, 2025
Authors
Xu
Description
ouro_dataset README

Overview

The ouro_dataset is a JSON file containing a list of dictionaries, where each dictionary represents a data entry. Each entry corresponds to a question-answer pair associated with an image. This dataset is intended for use in tasks such as Optical Character Recognition (OCR) and Visual Question Answering (VQA). Each dictionary contains an image path, a question, and its corresponding answer.

Dataset Structure

The dataset is stored in… See the full description on the dataset page: https://huggingface.co/datasets/tinnel123/OURO_dataset.
h
grpo-oumi-c2d-d2c-subset
huggingface.co
Updated Apr 24, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Teen Different (2025). grpo-oumi-c2d-d2c-subset [Dataset]. https://huggingface.co/datasets/TEEN-D/grpo-oumi-c2d-d2c-subset
Explore at:
Dataset updated
Apr 24, 2025
Dataset authored and provided by
Teen Different
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Dataset Card for GRPO Oumi ANLI Subset

Dataset

This dataset is a reformatted version of the oumi-ai/oumi-c2d-d2c-subset dataset, specifically structured for use with the GRPO trainer. You can find more detailed information about the original dataset at the provided link. Link: https://huggingface.co/datasets/oumi-ai/oumi-c2d-d2c-subset

Dataset Structure

The dataset consists of a list of dictionaries, where each dictionary represents a single data instance with… See the full description on the dataset page: https://huggingface.co/datasets/TEEN-D/grpo-oumi-c2d-d2c-subset.
h
grpo-oumi-synthetic-claims
huggingface.co
Updated Apr 24, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Teen Different (2025). grpo-oumi-synthetic-claims [Dataset]. https://huggingface.co/datasets/TEEN-D/grpo-oumi-synthetic-claims
Explore at:
Dataset updated
Apr 24, 2025
Dataset authored and provided by
Teen Different
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Dataset Card for GRPO Oumi ANLI Subset

Dataset

This dataset is a reformatted version of the TEEN-D/grpo-oumi-anli-subset dataset, specifically structured for use with the GRPO trainer. You can find more detailed information about the original dataset at the provided link. Link: https://huggingface.co/datasets/oumi-ai/oumi-synthetic-claims

Dataset Structure

The dataset consists of a list of dictionaries, where each dictionary represents a single data instance… See the full description on the dataset page: https://huggingface.co/datasets/TEEN-D/grpo-oumi-synthetic-claims.
alpaca_data_galician
huggingface.co
Updated Nov 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Information Retrieval Lab @ University of A Coruña (2023). alpaca_data_galician [Dataset]. https://huggingface.co/datasets/irlab-udc/alpaca_data_galician
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 8, 2023
Dataset provided by
University of A Coruñahttp://udc.es/
Authors
Information Retrieval Lab @ University of A Coruña
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Galician version of alpaca_data.json

This is a Galician-translated with Python package googletranslatepy version of the Stanford alpaca_data.json dataset. Our working notes are available here.

Dataset Structure

The dataset contains 52K instruction-following elements in a JSON file with a list of dictionaries. Each dictionary contains the following fields:

instruction: str, describes the task the model should perform. Each of the 52K instructions is unique. input: str… See the full description on the dataset page: https://huggingface.co/datasets/irlab-udc/alpaca_data_galician.
h
HH_full_parsed
huggingface.co
Updated Feb 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Taywon Min (2025). HH_full_parsed [Dataset]. https://huggingface.co/datasets/Taywon/HH_full_parsed
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 9, 2025
Authors
Taywon Min
Description
This is the hh-rlhf dataset, with only the helpful split merged. And the format is parsed so that chosen and rejected are not strings but lists of dictionaries, where each dictionary is the conversation (following the more standard format).
h
mili-o
huggingface.co
Updated Aug 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OpenLiliO (2025). mili-o [Dataset]. https://huggingface.co/datasets/OpenLiliO/mili-o
Explore at:
Dataset updated
Aug 8, 2025
Dataset authored and provided by
OpenLiliO
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Filename of each json are represented like this :

Each json file contains a list of dictionaries, each dictionary representing a conversation turn with the following keys:

caller: The speaker of the turn (e.g., "Speaker 1", "Speaker 2"). next_caller: The next speaker in the conversation (e.g., "Speaker 2", "Speaker 1"). act_tad: The DAMSL act tag for the turn (e.g., "Statement-opinion", "Question-yesno"). text: The text of the turn. context: A… See the full description on the dataset page: https://huggingface.co/datasets/OpenLiliO/mili-o.
GLINER-multi-task-synthetic-data
huggingface.co
Updated Jul 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Knowledgator Engineering (2024). GLINER-multi-task-synthetic-data [Dataset]. https://huggingface.co/datasets/knowledgator/GLINER-multi-task-synthetic-data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 15, 2024
Dataset authored and provided by
Knowledgator Engineering
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This is official synthetic dataset used to train GLiNER multi-task model. The dataset is a list of dictionaries consisting a tokenized text with named entity recognition (NER) information. Each item represents of two main components:

'tokenized_text': A list of individual words and punctuation marks from the original text, split into tokens.

'ner': A list of lists containing named entity recognition information. Each inner list has three elements:

Start index of the named entity in the… See the full description on the dataset page: https://huggingface.co/datasets/knowledgator/GLINER-multi-task-synthetic-data.
h
materials-project
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Materials toolkits, materials-project [Dataset]. https://huggingface.co/datasets/materials-toolkits/materials-project
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
Materials toolkits
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset

Materials project (2019 dump) This dataset contains 133420 materials with formation energy per atom. Processed from mp.2019.04.01.json

Download

Download link: materials-project.tar.gz MD5 checksum c132f3781f32cd17f3a92aa6501b9531

Content

Bundled in materials-project.tar.gz.

Index (index.json)

list of dict:

index (int) => index of the structure in data file. id (str) => id of Materials Project. formula (str) => formula. natoms (int) => number… See the full description on the dataset page: https://huggingface.co/datasets/materials-toolkits/materials-project.
catalan-dictionary
huggingface.co
Updated Jun 30, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Softcatalà (2022). catalan-dictionary [Dataset]. https://huggingface.co/datasets/softcatala/catalan-dictionary
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 30, 2022
Dataset authored and provided by
Softcatalà
License
https://choosealicense.com/licenses/gpl-2.0/https://choosealicense.com/licenses/gpl-2.0/
Description
Dataset Card for ca-text-corpus

Dataset Summary

Catalan word lists with part of speech labeling curated by humans. Contains 1 180 773 forms including verbs, nouns, adjectives, names or toponyms. These word lists are used to build applications like Catalan spellcheckers or verb querying applications.

Supported Tasks and Leaderboards

[More Information Needed]

Languages

Catalan (ca).

Dataset Structure

The dataset contains 3 columns:

Form… See the full description on the dataset page: https://huggingface.co/datasets/softcatala/catalan-dictionary.
h
african_cultural_reasoning
huggingface.co
Updated Mar 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NIONGOLO Chrys Fé-Marty (2025). african_cultural_reasoning [Dataset]. https://huggingface.co/datasets/Svngoku/african_cultural_reasoning
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 11, 2025
Authors
NIONGOLO Chrys Fé-Marty
Area covered
Africa
Description
African Cultural Reasoning Dataset with SmolAgents

import os from typing import Dict, List, Any import json from datetime import datetime import asyncio from smolagents import CodeAgent, DuckDuckGoSearchTool, LiteLLMModel

class AfricanCultureDataGenerator: def init(self, api_key: str): # Initialize with explicit API key os.environ["OPENAI_API_KEY"] = api_key

self.model = LiteLLMModel( model_id="gpt-4o-mini", )… See the full description on the dataset page: https://huggingface.co/datasets/Svngoku/african_cultural_reasoning.
h
human-biases
huggingface.co
Updated Jun 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ebowwa (2024). human-biases [Dataset]. https://huggingface.co/datasets/ebowwa/human-biases
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 22, 2024
Authors
Ebowwa
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
from datasets import load_dataset, Dataset

Load and preprocess the dataset

def formatting_prompts_func(examples): """ This function takes a dataset of examples and formats them into a list of text prompts.

Args: examples (pandas.DataFrame): A DataFrame containing the dataset of examples. Returns: dict: A dictionary with a 'text' key containing the list of formatted text prompts. """ texts = [ # Create a formatted text prompt for… See the full description on the dataset page: https://huggingface.co/datasets/ebowwa/human-biases.
h
Sanremo_finalist_songs
huggingface.co
Updated Sep 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
RAI - Centre for Research, Technological Innovation and Experimentation (2025). Sanremo_finalist_songs [Dataset]. https://huggingface.co/datasets/raicrits/Sanremo_finalist_songs
Explore at:
Dataset updated
Sep 6, 2025
Dataset authored and provided by
RAI - Centre for Research, Technological Innovation and Experimentation
License
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Description
List of all finalist songs at Sanremo Music Festival from the first edition in 1951 to 2025. The file is structured as a single dictionary where each key is a string representing the year (e.g., "1951"). The corresponding value is a list of dictionaries, with each dictionary detailing a single finalist song for that year. Each song object contains the following fields:

'titolo': The title of the song. 'autori': The song's authors and composers. 'interpreti': The artist(s) who performed the… See the full description on the dataset page: https://huggingface.co/datasets/raicrits/Sanremo_finalist_songs.
h
multilingual-uwu
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kara Aki, multilingual-uwu [Dataset]. https://huggingface.co/datasets/karashiiro/multilingual-uwu
Explore at:
Authors
Kara Aki
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Structure

This dataset contains conversational data in the HuggingFace format with a messages field. This data is a transformation of HuggingFaceH4/Multilingual-Thinking.

Data Fields

messages: A list of message dictionaries, each containing: role: The role of the message sender (system, user, or assistant) content: The message content thinking: (optional) Extended thinking content for assistant messages

Usage

from datasets import load_dataset… See the full description on the dataset page: https://huggingface.co/datasets/karashiiro/multilingual-uwu.
h
Segmentation_judgments_STJ
huggingface.co
Updated Jul 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Martim Zanatti dos Santos Gomes da Silva (2024). Segmentation_judgments_STJ [Dataset]. https://huggingface.co/datasets/MartimZanatti/Segmentation_judgments_STJ
Explore at:
Dataset updated
Jul 13, 2024
Authors
Martim Zanatti dos Santos Gomes da Silva
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Segmentation Dataset for Judgments of the Supreme Court of Justice of Portugal

The goal of this dataset is to train a segmentation model that, given a judgment from the Supreme Court of Justice of Portugal (STJ), can divide its paragraphs into sections of the judgment itself.

Dataset Contents

JSON Files:

Judgment Text: Contains the judgment text divided into paragraphs, with each paragraph associated with a unique ID.

Denotations: A list of dictionaries where each… See the full description on the dataset page: https://huggingface.co/datasets/MartimZanatti/Segmentation_judgments_STJ.
h
llmsql-benchmark-finetune-ready
huggingface.co
Updated Oct 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
LLMSQL (2025). llmsql-benchmark-finetune-ready [Dataset]. https://huggingface.co/datasets/llmsql-bench/llmsql-benchmark-finetune-ready
Explore at:
Dataset updated
Oct 22, 2025
Dataset authored and provided by
LLMSQL
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
LLMSQL Benchmark (Finetune-Ready)

This benchmark is designed to evaluate text-to-SQL models. For usage of this benchmark see https://github.com/LLMSQL/llmsql-benchmark. This repository contains a finetune-ready version of the LLMSQL benchmark: LLMSQL on Hugging Face.
The dataset is structured in a messages format suitable for instruction-tuned models, where each example has a messages field. This field is a list of dictionaries with:

"role": "user" — the input question or prompt… See the full description on the dataset page: https://huggingface.co/datasets/llmsql-bench/llmsql-benchmark-finetune-ready.
h
accor-deu-train
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jon Olds, accor-deu-train [Dataset]. https://huggingface.co/datasets/JonOlds64/accor-deu-train
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Jon Olds
Description
Dataset: bouygues-deu-train

Dataset Info

This dataset contains conversational data structured into a single column, "messages", which represents interactions between a user and an assistant.

Features

messages: A list of dictionaries representing each conversational message content: (string) The text of the message. role: (string) The role of the sender, either "user" or "assistant".

Splits

The dataset is divided into training and testing splits.… See the full description on the dataset page: https://huggingface.co/datasets/JonOlds64/accor-deu-train.
h
delta_data
huggingface.co
Updated Oct 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jiongdao Jin (2025). delta_data [Dataset]. https://huggingface.co/datasets/jiongdao/delta_data
Explore at:
Dataset updated
Oct 21, 2025
Authors
Jiongdao Jin
Description
In the filename, 'all' means MMLU-Pro-CoT-Eval, 'math500' means Math500 Conf/PRM/Aggregated Scores are Min/Max Normalized. Each file is a dict with key: "0.0~0.1", ... "0.9~1.0", Denotes the Delta=abs(PRM-Conf) Dataset["0.0~0.1"] is a list of dict, each dict contains:"problem": question sample,"score": aggregated score,"prm": prm score,"conf": confidence score,"delta_prm_conf": abs(prm-conf),"step_completion": list of str, reasoning path by steps,"correctness": correctness of the final answer).
h
retroinstruct-mix-v0.2
huggingface.co
Updated Jul 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
John David Pressman (2024). retroinstruct-mix-v0.2 [Dataset]. https://huggingface.co/datasets/jdpressman/retroinstruct-mix-v0.2
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 22, 2024
Authors
John David Pressman
License
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Description
RetroInstruct Mix v0.2

This is the first release of the RetroInstruct synthetic instruction dataset. It is a mixture of 7 synthetic subsets:

RetroInstruct Weave Evaluator Questions: JDP - Answer questions about synthetic short form writing in the style of John David Pressman.

RetroInstruct Analogical Translations - Infer the generative process of bad faith reasoning by executing a bad faith process to generate arguments and reversing it.

RetroInstruct Part Lists For Dictionary… See the full description on the dataset page: https://huggingface.co/datasets/jdpressman/retroinstruct-mix-v0.2.

utilitarianism

huggingface.co

Updated Apr 14, 2023

Facebook

Twitter

Click to copy link

Link copied

Cite

metaeval (2023). utilitarianism [Dataset]. https://huggingface.co/datasets/metaeval/utilitarianism

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Apr 14, 2023

Dataset authored and provided by

metaeval

License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

""" _HOMEPAGE = "" _LICENSE = "Creative Commons Attribution-NonCommercial 4.0 International Public License"

The HuggingFace dataset library don't host the datasets but only point to the original files

This can be an arbitrary nested dict/list of URLs (see below in `_split_generators` method)

_URLs = {"default": "https://www.dropbox.com/s/041prrjylv0tf0h/ethics.zip?dl=1"}

class Imppres(datasets.GeneratorBasedBuilder):

VERSION = datasets.Version("1.1.0")

def _info(self):
  features = datasets.Features(
    {
      "better_choice": datasets.Value("string"),
      "worst_choice": datasets.Value("string"),
      "comparison": datasets.Value("string"),
      "label": datasets.Value("int32"),
    })
  return datasets.DatasetInfo(
    # This is the description that will appear on the datasets page.
    description=_DESCRIPTION,
    # This defines the different columns of the dataset and their types
    features=features, # Here we define them above because they are different between the two configurations
    # If there's a common (input, target) tuple from the features,
    # specify them here. They'll be used if as_supervised=True in
    # builder.as_dataset.
    supervised_keys=None,
    # Homepage of the dataset for documentation
    homepage=_HOMEPAGE,
    # License for the dataset if available
    license=_LICENSE,
    # Citation for the dataset
    citation=_CITATION,
  )

def _split_generators(self, dl_manager):

Facebook

Twitter

Click to copy link

Link copied

Cite

(2022). Term lists and Dictionaries from Swedish Authorities [Dataset]. https://live.european-language-grid.eu/catalogue/lcr/18919

Term lists and Dictionaries from Swedish Authorities

Explore at:

pdfAvailable download formats

Dataset updated

Aug 30, 2022

License

U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically

Description

This resource also includes a Dictionary from the ELMN that has a set of terms translated from English to all the EU languages. The list of languages that is indicated with this resource tell what languages that the rest of the lists and dictionaries cover together.

Clear search

Close search

Google apps

Main menu

Term lists and Dictionaries from Swedish Authorities

OURO_dataset

grpo-oumi-c2d-d2c-subset

grpo-oumi-synthetic-claims

alpaca_data_galician

HH_full_parsed

mili-o

GLINER-multi-task-synthetic-data

materials-project

catalan-dictionary

african_cultural_reasoning

human-biases

Load and preprocess the dataset

Sanremo_finalist_songs

multilingual-uwu

Segmentation_judgments_STJ

llmsql-benchmark-finetune-ready

accor-deu-train

delta_data

retroinstruct-mix-v0.2

utilitarianism

The HuggingFace dataset library don't host the datasets but only point to the original files

This can be an arbitrary nested dict/list of URLs (see below in _split_generators method)

Term lists and Dictionaries from Swedish AuthoritiesSee More Versions

This can be an arbitrary nested dict/list of URLs (see below in `_split_generators` method)

Term lists and Dictionaries from Swedish Authorities