47 datasets found
  1. E

    Term lists and Dictionaries from Swedish Authorities

    • live.european-language-grid.eu
    • huggingface.co
    • +1more
    pdf
    Updated Aug 30, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). Term lists and Dictionaries from Swedish Authorities [Dataset]. https://live.european-language-grid.eu/catalogue/lcr/18919
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Aug 30, 2022
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Description

    This resource also includes a Dictionary from the ELMN that has a set of terms translated from English to all the EU languages. The list of languages that is indicated with this resource tell what languages that the rest of the lists and dictionaries cover together.

  2. h

    OURO_dataset

    • huggingface.co
    Updated Sep 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xu (2025). OURO_dataset [Dataset]. https://huggingface.co/datasets/tinnel123/OURO_dataset
    Explore at:
    Dataset updated
    Sep 20, 2025
    Authors
    Xu
    Description

    ouro_dataset README

      Overview
    

    The ouro_dataset is a JSON file containing a list of dictionaries, where each dictionary represents a data entry. Each entry corresponds to a question-answer pair associated with an image. This dataset is intended for use in tasks such as Optical Character Recognition (OCR) and Visual Question Answering (VQA). Each dictionary contains an image path, a question, and its corresponding answer.

      Dataset Structure
    

    The dataset is stored in… See the full description on the dataset page: https://huggingface.co/datasets/tinnel123/OURO_dataset.

  3. h

    grpo-oumi-c2d-d2c-subset

    • huggingface.co
    Updated Apr 24, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Teen Different (2025). grpo-oumi-c2d-d2c-subset [Dataset]. https://huggingface.co/datasets/TEEN-D/grpo-oumi-c2d-d2c-subset
    Explore at:
    Dataset updated
    Apr 24, 2025
    Dataset authored and provided by
    Teen Different
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Dataset Card for GRPO Oumi ANLI Subset

      Dataset
    

    This dataset is a reformatted version of the oumi-ai/oumi-c2d-d2c-subset dataset, specifically structured for use with the GRPO trainer. You can find more detailed information about the original dataset at the provided link. Link: https://huggingface.co/datasets/oumi-ai/oumi-c2d-d2c-subset

      Dataset Structure
    

    The dataset consists of a list of dictionaries, where each dictionary represents a single data instance with… See the full description on the dataset page: https://huggingface.co/datasets/TEEN-D/grpo-oumi-c2d-d2c-subset.

  4. h

    grpo-oumi-synthetic-claims

    • huggingface.co
    Updated Apr 24, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Teen Different (2025). grpo-oumi-synthetic-claims [Dataset]. https://huggingface.co/datasets/TEEN-D/grpo-oumi-synthetic-claims
    Explore at:
    Dataset updated
    Apr 24, 2025
    Dataset authored and provided by
    Teen Different
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Dataset Card for GRPO Oumi ANLI Subset

      Dataset
    

    This dataset is a reformatted version of the TEEN-D/grpo-oumi-anli-subset dataset, specifically structured for use with the GRPO trainer. You can find more detailed information about the original dataset at the provided link. Link: https://huggingface.co/datasets/oumi-ai/oumi-synthetic-claims

      Dataset Structure
    

    The dataset consists of a list of dictionaries, where each dictionary represents a single data instance… See the full description on the dataset page: https://huggingface.co/datasets/TEEN-D/grpo-oumi-synthetic-claims.

  5. alpaca_data_galician

    • huggingface.co
    Updated Nov 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Information Retrieval Lab @ University of A Coruña (2023). alpaca_data_galician [Dataset]. https://huggingface.co/datasets/irlab-udc/alpaca_data_galician
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 8, 2023
    Dataset provided by
    University of A Coruñahttp://udc.es/
    Authors
    Information Retrieval Lab @ University of A Coruña
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Galician version of alpaca_data.json

    This is a Galician-translated with Python package googletranslatepy version of the Stanford alpaca_data.json dataset. Our working notes are available here.

      Dataset Structure
    

    The dataset contains 52K instruction-following elements in a JSON file with a list of dictionaries. Each dictionary contains the following fields:

    instruction: str, describes the task the model should perform. Each of the 52K instructions is unique. input: str… See the full description on the dataset page: https://huggingface.co/datasets/irlab-udc/alpaca_data_galician.

  6. h

    HH_full_parsed

    • huggingface.co
    Updated Feb 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Taywon Min (2025). HH_full_parsed [Dataset]. https://huggingface.co/datasets/Taywon/HH_full_parsed
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 9, 2025
    Authors
    Taywon Min
    Description

    This is the hh-rlhf dataset, with only the helpful split merged. And the format is parsed so that chosen and rejected are not strings but lists of dictionaries, where each dictionary is the conversation (following the more standard format).

  7. h

    mili-o

    • huggingface.co
    Updated Aug 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenLiliO (2025). mili-o [Dataset]. https://huggingface.co/datasets/OpenLiliO/mili-o
    Explore at:
    Dataset updated
    Aug 8, 2025
    Dataset authored and provided by
    OpenLiliO
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Filename of each json are represented like this :

    Each json file contains a list of dictionaries, each dictionary representing a conversation turn with the following keys:

    caller: The speaker of the turn (e.g., "Speaker 1", "Speaker 2"). next_caller: The next speaker in the conversation (e.g., "Speaker 2", "Speaker 1"). act_tad: The DAMSL act tag for the turn (e.g., "Statement-opinion", "Question-yesno"). text: The text of the turn. context: A… See the full description on the dataset page: https://huggingface.co/datasets/OpenLiliO/mili-o.

  8. GLINER-multi-task-synthetic-data

    • huggingface.co
    Updated Jul 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Knowledgator Engineering (2024). GLINER-multi-task-synthetic-data [Dataset]. https://huggingface.co/datasets/knowledgator/GLINER-multi-task-synthetic-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 15, 2024
    Dataset authored and provided by
    Knowledgator Engineering
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This is official synthetic dataset used to train GLiNER multi-task model. The dataset is a list of dictionaries consisting a tokenized text with named entity recognition (NER) information. Each item represents of two main components:

    'tokenized_text': A list of individual words and punctuation marks from the original text, split into tokens.

    'ner': A list of lists containing named entity recognition information. Each inner list has three elements:

    Start index of the named entity in the… See the full description on the dataset page: https://huggingface.co/datasets/knowledgator/GLINER-multi-task-synthetic-data.

  9. h

    materials-project

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Materials toolkits, materials-project [Dataset]. https://huggingface.co/datasets/materials-toolkits/materials-project
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    Materials toolkits
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset

    Materials project (2019 dump) This dataset contains 133420 materials with formation energy per atom. Processed from mp.2019.04.01.json

      Download
    

    Download link: materials-project.tar.gz MD5 checksum c132f3781f32cd17f3a92aa6501b9531

      Content
    

    Bundled in materials-project.tar.gz.

      Index (index.json)
    

    list of dict:

    index (int) => index of the structure in data file. id (str) => id of Materials Project. formula (str) => formula. natoms (int) => number… See the full description on the dataset page: https://huggingface.co/datasets/materials-toolkits/materials-project.

  10. catalan-dictionary

    • huggingface.co
    Updated Jun 30, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Softcatalà (2022). catalan-dictionary [Dataset]. https://huggingface.co/datasets/softcatala/catalan-dictionary
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 30, 2022
    Dataset authored and provided by
    Softcatalà
    License

    https://choosealicense.com/licenses/gpl-2.0/https://choosealicense.com/licenses/gpl-2.0/

    Description

    Dataset Card for ca-text-corpus

      Dataset Summary
    

    Catalan word lists with part of speech labeling curated by humans. Contains 1 180 773 forms including verbs, nouns, adjectives, names or toponyms. These word lists are used to build applications like Catalan spellcheckers or verb querying applications.

      Supported Tasks and Leaderboards
    

    [More Information Needed]

      Languages
    

    Catalan (ca).

      Dataset Structure
    

    The dataset contains 3 columns:

    Form… See the full description on the dataset page: https://huggingface.co/datasets/softcatala/catalan-dictionary.

  11. h

    african_cultural_reasoning

    • huggingface.co
    Updated Mar 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NIONGOLO Chrys Fé-Marty (2025). african_cultural_reasoning [Dataset]. https://huggingface.co/datasets/Svngoku/african_cultural_reasoning
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 11, 2025
    Authors
    NIONGOLO Chrys Fé-Marty
    Area covered
    Africa
    Description

    African Cultural Reasoning Dataset with SmolAgents

    import os from typing import Dict, List, Any import json from datetime import datetime import asyncio from smolagents import CodeAgent, DuckDuckGoSearchTool, LiteLLMModel

    class AfricanCultureDataGenerator: def init(self, api_key: str): # Initialize with explicit API key os.environ["OPENAI_API_KEY"] = api_key

      self.model = LiteLLMModel(
        model_id="gpt-4o-mini",
      )… See the full description on the dataset page: https://huggingface.co/datasets/Svngoku/african_cultural_reasoning.
    
  12. h

    human-biases

    • huggingface.co
    Updated Jun 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ebowwa (2024). human-biases [Dataset]. https://huggingface.co/datasets/ebowwa/human-biases
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 22, 2024
    Authors
    Ebowwa
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    from datasets import load_dataset, Dataset

    Load and preprocess the dataset

    def formatting_prompts_func(examples): """ This function takes a dataset of examples and formats them into a list of text prompts.

    Args:
      examples (pandas.DataFrame): A DataFrame containing the dataset of examples.
    
    Returns:
      dict: A dictionary with a 'text' key containing the list of formatted text prompts.
    """
    texts = [
      # Create a formatted text prompt for… See the full description on the dataset page: https://huggingface.co/datasets/ebowwa/human-biases.
    
  13. h

    Sanremo_finalist_songs

    • huggingface.co
    Updated Sep 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    RAI - Centre for Research, Technological Innovation and Experimentation (2025). Sanremo_finalist_songs [Dataset]. https://huggingface.co/datasets/raicrits/Sanremo_finalist_songs
    Explore at:
    Dataset updated
    Sep 6, 2025
    Dataset authored and provided by
    RAI - Centre for Research, Technological Innovation and Experimentation
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    List of all finalist songs at Sanremo Music Festival from the first edition in 1951 to 2025. The file is structured as a single dictionary where each key is a string representing the year (e.g., "1951"). The corresponding value is a list of dictionaries, with each dictionary detailing a single finalist song for that year. Each song object contains the following fields:

    'titolo': The title of the song. 'autori': The song's authors and composers. 'interpreti': The artist(s) who performed the… See the full description on the dataset page: https://huggingface.co/datasets/raicrits/Sanremo_finalist_songs.

  14. h

    multilingual-uwu

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kara Aki, multilingual-uwu [Dataset]. https://huggingface.co/datasets/karashiiro/multilingual-uwu
    Explore at:
    Authors
    Kara Aki
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Structure

    This dataset contains conversational data in the HuggingFace format with a messages field. This data is a transformation of HuggingFaceH4/Multilingual-Thinking.

      Data Fields
    

    messages: A list of message dictionaries, each containing: role: The role of the message sender (system, user, or assistant) content: The message content thinking: (optional) Extended thinking content for assistant messages

      Usage
    

    from datasets import load_dataset… See the full description on the dataset page: https://huggingface.co/datasets/karashiiro/multilingual-uwu.

  15. h

    Segmentation_judgments_STJ

    • huggingface.co
    Updated Jul 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Martim Zanatti dos Santos Gomes da Silva (2024). Segmentation_judgments_STJ [Dataset]. https://huggingface.co/datasets/MartimZanatti/Segmentation_judgments_STJ
    Explore at:
    Dataset updated
    Jul 13, 2024
    Authors
    Martim Zanatti dos Santos Gomes da Silva
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Segmentation Dataset for Judgments of the Supreme Court of Justice of Portugal

    The goal of this dataset is to train a segmentation model that, given a judgment from the Supreme Court of Justice of Portugal (STJ), can divide its paragraphs into sections of the judgment itself.

      Dataset Contents
    

    JSON Files:

    Judgment Text: Contains the judgment text divided into paragraphs, with each paragraph associated with a unique ID.

    Denotations: A list of dictionaries where each… See the full description on the dataset page: https://huggingface.co/datasets/MartimZanatti/Segmentation_judgments_STJ.

  16. h

    llmsql-benchmark-finetune-ready

    • huggingface.co
    Updated Oct 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    LLMSQL (2025). llmsql-benchmark-finetune-ready [Dataset]. https://huggingface.co/datasets/llmsql-bench/llmsql-benchmark-finetune-ready
    Explore at:
    Dataset updated
    Oct 22, 2025
    Dataset authored and provided by
    LLMSQL
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    LLMSQL Benchmark (Finetune-Ready)

    This benchmark is designed to evaluate text-to-SQL models. For usage of this benchmark see https://github.com/LLMSQL/llmsql-benchmark. This repository contains a finetune-ready version of the LLMSQL benchmark: LLMSQL on Hugging Face.
    The dataset is structured in a messages format suitable for instruction-tuned models, where each example has a messages field. This field is a list of dictionaries with:

    "role": "user" — the input question or prompt… See the full description on the dataset page: https://huggingface.co/datasets/llmsql-bench/llmsql-benchmark-finetune-ready.

  17. h

    accor-deu-train

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jon Olds, accor-deu-train [Dataset]. https://huggingface.co/datasets/JonOlds64/accor-deu-train
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Jon Olds
    Description

    Dataset: bouygues-deu-train

      Dataset Info
    

    This dataset contains conversational data structured into a single column, "messages", which represents interactions between a user and an assistant.

      Features
    

    messages: A list of dictionaries representing each conversational message content: (string) The text of the message. role: (string) The role of the sender, either "user" or "assistant".

      Splits
    

    The dataset is divided into training and testing splits.… See the full description on the dataset page: https://huggingface.co/datasets/JonOlds64/accor-deu-train.

  18. h

    delta_data

    • huggingface.co
    Updated Oct 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jiongdao Jin (2025). delta_data [Dataset]. https://huggingface.co/datasets/jiongdao/delta_data
    Explore at:
    Dataset updated
    Oct 21, 2025
    Authors
    Jiongdao Jin
    Description

    In the filename, 'all' means MMLU-Pro-CoT-Eval, 'math500' means Math500 Conf/PRM/Aggregated Scores are Min/Max Normalized. Each file is a dict with key: "0.0~0.1", ... "0.9~1.0", Denotes the Delta=abs(PRM-Conf) Dataset["0.0~0.1"] is a list of dict, each dict contains:"problem": question sample,"score": aggregated score,"prm": prm score,"conf": confidence score,"delta_prm_conf": abs(prm-conf),"step_completion": list of str, reasoning path by steps,"correctness": correctness of the final answer).

  19. h

    retroinstruct-mix-v0.2

    • huggingface.co
    Updated Jul 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    John David Pressman (2024). retroinstruct-mix-v0.2 [Dataset]. https://huggingface.co/datasets/jdpressman/retroinstruct-mix-v0.2
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 22, 2024
    Authors
    John David Pressman
    License

    https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

    Description

    RetroInstruct Mix v0.2

    This is the first release of the RetroInstruct synthetic instruction dataset. It is a mixture of 7 synthetic subsets:

    RetroInstruct Weave Evaluator Questions: JDP - Answer questions about synthetic short form writing in the style of John David Pressman.

    RetroInstruct Analogical Translations - Infer the generative process of bad faith reasoning by executing a bad faith process to generate arguments and reversing it.

    RetroInstruct Part Lists For Dictionary… See the full description on the dataset page: https://huggingface.co/datasets/jdpressman/retroinstruct-mix-v0.2.

  20. h

    utilitarianism

    • huggingface.co
    Updated Apr 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    metaeval (2023). utilitarianism [Dataset]. https://huggingface.co/datasets/metaeval/utilitarianism
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 14, 2023
    Dataset authored and provided by
    metaeval
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    """ _HOMEPAGE = "" _LICENSE = "Creative Commons Attribution-NonCommercial 4.0 International Public License"

    The HuggingFace dataset library don't host the datasets but only point to the original files

    This can be an arbitrary nested dict/list of URLs (see below in _split_generators method)

    _URLs = {"default": "https://www.dropbox.com/s/041prrjylv0tf0h/ethics.zip?dl=1"}

    class Imppres(datasets.GeneratorBasedBuilder):

    VERSION = datasets.Version("1.1.0")
    
    def _info(self):
      features = datasets.Features(
        {
          "better_choice": datasets.Value("string"),
          "worst_choice": datasets.Value("string"),
          "comparison": datasets.Value("string"),
          "label": datasets.Value("int32"),
        })
      return datasets.DatasetInfo(
        # This is the description that will appear on the datasets page.
        description=_DESCRIPTION,
        # This defines the different columns of the dataset and their types
        features=features, # Here we define them above because they are different between the two configurations
        # If there's a common (input, target) tuple from the features,
        # specify them here. They'll be used if as_supervised=True in
        # builder.as_dataset.
        supervised_keys=None,
        # Homepage of the dataset for documentation
        homepage=_HOMEPAGE,
        # License for the dataset if available
        license=_LICENSE,
        # Citation for the dataset
        citation=_CITATION,
      )
    
    def _split_generators(self, dl_manager):
    
Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
(2022). Term lists and Dictionaries from Swedish Authorities [Dataset]. https://live.european-language-grid.eu/catalogue/lcr/18919

Term lists and Dictionaries from Swedish Authorities

Explore at:
pdfAvailable download formats
Dataset updated
Aug 30, 2022
License

U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically

Description

This resource also includes a Dictionary from the ELMN that has a set of terms translated from English to all the EU languages. The list of languages that is indicated with this resource tell what languages that the rest of the lists and dictionaries cover together.

Search
Clear search
Close search
Google apps
Main menu