100+ datasets found
  1. h

    json_data_extraction

    • huggingface.co
    Updated Feb 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    paraloq analytics (2024). json_data_extraction [Dataset]. https://huggingface.co/datasets/paraloq/json_data_extraction
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 1, 2024
    Dataset authored and provided by
    paraloq analytics
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Diverse Restricted JSON Data Extraction

    Curated by: The paraloq analytics team.

      Uses
    

    Benchmark restricted JSON data extraction (text + JSON schema -> JSON instance) Fine-Tune data extraction model (text + JSON schema -> JSON instance) Fine-Tune JSON schema Retrieval model (text -> retriever -> most adequate JSON schema)

      Out-of-Scope Use
    

    Intended for research purposes only.

      Dataset Structure
    

    The data comes with the following fields:

    title: The… See the full description on the dataset page: https://huggingface.co/datasets/paraloq/json_data_extraction.

  2. h

    extract-0

    • huggingface.co
    Updated Sep 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Henrique Godoy (2025). extract-0 [Dataset]. https://huggingface.co/datasets/HenriqueGodoy/extract-0
    Explore at:
    Dataset updated
    Sep 8, 2025
    Authors
    Henrique Godoy
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Extract-0 Document Information Extraction Dataset

    This dataset contains 280,128 synthetic training examples for document information extraction, used to train Extract-0, a specialized 7B parameter language model that outperforms GPT-4 and other larger models on extraction tasks.

      Dataset Description
    

    The Extract-0 dataset represents a comprehensive collection of document extraction examples generated from diverse sources including arXiv papers, PubMed Central articles… See the full description on the dataset page: https://huggingface.co/datasets/HenriqueGodoy/extract-0.

  3. h

    parsed-model-cards

    • huggingface.co
    Updated Feb 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel van Strien (2025). parsed-model-cards [Dataset]. https://huggingface.co/datasets/davanstrien/parsed-model-cards
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 28, 2025
    Authors
    Daniel van Strien
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for parsed-model-cards

    This dataset contains structured information extracted from model cards on the Hugging Face Hub. It was created using Curator, the Qwen reasoning model QwQ-32B, and vLLM.

      Dataset Overview
    

    The dataset consists of model card texts paired with structured information extracted from those texts using a reasoning-based approach. Each entry contains:

    The original model card content Structured JSON with standardized fields extracted from… See the full description on the dataset page: https://huggingface.co/datasets/davanstrien/parsed-model-cards.

  4. Z

    CoAID dataset with multiple extracted features (both sparse and dense)

    • data.niaid.nih.gov
    • zenodo.org
    • +1more
    Updated Jun 10, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Guillaume Bernard (2022). CoAID dataset with multiple extracted features (both sparse and dense) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6630404
    Explore at:
    Dataset updated
    Jun 10, 2022
    Dataset provided by
    Laboratoire L3i, Université de La Rochelle
    Authors
    Guillaume Bernard
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is a publication of the CoAID dataset originaly dedicated to fake news detection. We changed here the purpose of this dataset in order to use it in the context of event tracking in press documents.

    Cui, Limeng, et Dongwon Lee. 2020. « CoAID: COVID-19 Healthcare Misinformation Dataset ». ArXiv:2006.00885 [Cs], novembre. http://arxiv.org/abs/2006.00885.

    In this dataset, we provide multiple features extracted from the text itself. Please note the text is missing from the dataset published in the CSV format for copyright reasons. You can download the original datasets and manually add the missing texts from the original publications.

    Features are extracted using:

    • A corpus of reference articles in multiple languages languages for TF-IDF weighting. (features_news) [1]

    • A corpus of tweets reporting news for TF-IDF weighting. (features_tweets) [1]

    • A S-BERT model [2] that uses distiluse-base-multilingual-cased-v1 (called features_use) 3

    • A S-BERT model [2] that uses paraphrase-multilingual-mpnet-base-v2 (called features_mpnet) 4

    References:

    [1]: Guillaume Bernard. (2022). Resources to compute TF-IDF weightings on press articles and tweets (1.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.6610406

    [2]: Reimers, Nils, et Iryna Gurevych. 2019. « Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks ». In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 3982‑92. Hong Kong, China: Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1410.

  5. Replication Package for 'How do Machine Learning Models Change?'

    • zenodo.org
    Updated Nov 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joel Castaño Fernández; Joel Castaño Fernández; Rafael Cabañas; Rafael Cabañas; Antonio Salmerón; Antonio Salmerón; Lo David; Lo David; Silverio Martínez-Fernández; Silverio Martínez-Fernández (2024). Replication Package for 'How do Machine Learning Models Change?' [Dataset]. http://doi.org/10.5281/zenodo.14128997
    Explore at:
    Dataset updated
    Nov 13, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Joel Castaño Fernández; Joel Castaño Fernández; Rafael Cabañas; Rafael Cabañas; Antonio Salmerón; Antonio Salmerón; Lo David; Lo David; Silverio Martínez-Fernández; Silverio Martínez-Fernández
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Overview

    This replication package accompanies the paper "How Do Machine Learning Models Change?" In this study, we conducted a comprehensive analysis of over 200,000 commits and 1,200 releases across more than 50,000 models on the Hugging Face (HF) platform. Our goal was to understand how machine learning (ML) models evolve over time by classifying commit types based on an extended ML change taxonomy and analyzing patterns in commit and release activities using Bayesian networks.

    Our research addresses three main aspects:

    1. Categorization of Commit Changes: We classified over 200,000 commits on HF using an extended ML change taxonomy, providing a detailed breakdown of change types and their distribution across models.
    2. Analysis of Commit Sequences: We examined the sequence and dependencies of commit types using Bayesian networks to identify temporal patterns and common progression paths in model changes.
    3. Release Analysis: We investigated the distribution and evolution of release types, analyzing how model attributes and metadata change across successive releases.

    This replication package contains all the necessary code, datasets, and documentation to reproduce the results presented in the paper.

    Data Collection and Preprocessing

    Data Collection

    We collected data from the Hugging Face platform using the Hugging Face Hub API and the `HfApi` class. The data extraction was performed on November 6th, 2023. The collected data includes:

    • Model Information: Details of over 380,000 models, including dataset sizes, training hardware, evaluation metrics, model file sizes, number of downloads and likes, tags, and the raw text of model cards.
    • Commit Histories: Comprehensive commit details, including commit messages, dates, authors, and the list of files edited in each commit.
    • Release Information: Information on model releases marked by tags in their repositories.

    To enrich the commit data with detailed file change information, we integrated the PyDriller framework within the HFCommunity dataset.

    Data Preprocessing

    Commit Diffs

    We computed the differences between commits for key files, specifically JSON configuration files (e.g., `config.json`). For each commit that modifies these files, we compared the changes with the previous commit affecting the same file to identify added, deleted, and updated keys.

    Commit Classification

    We classified each commit according to Bhatia et al.'s ML change taxonomy using the Gemini 1.5 Flash Large Language Model (LLM). This classification, using LLMs to apply Bhatia et al.'s taxonomy on a large-scale ML repository, is one of the main contributions of our paper. We ensured the correctness of the classification by achieving a Cohen's kappa coefficient ≥ 0.9 through iterative validation. In addition, we performed classification based on Swanson's categories using a simpler neural network approach, following methods from prior work. This classification has less impact compared to the detailed classification using Bhatia et al.'s taxonomy.

    Model Metadata

    We extracted detailed metadata from the model files of selected releases, focusing on attributes such as the number of parameters, tensor shapes, etc. We also calculated the differences between the metadata of successive releases.

    Folder Structure

    The replication package is organized as follows:

    - code/: Contains the Jupyter notebooks with the data extraction, preprocessing, analysis, and model training scripts.

    • Collection/: Contains two Jupyter notebooks for data collection:
      • HFTotalExtraction.ipynb: Script for collecting data on the entire Hugging Face platform.
      • HFReleasesExtraction.ipynb: Script for collecting data on models that contain releases.
    • Preprocessing/: Contains preprocessing scripts:
      • HFTotalPreprocessing.ipynb: Preprocesses the dataset obtained from `HFTotalExtraction.ipynb`.
      • HFCommitsPreprocessing.ipynb: Processes commit data, including:
        • Retrieval of diff information between commits.
        • Classification of commits following Bhatia et al.'s taxonomy using LLMs.
        • Extension and adaptation of the final commits dataset, including additional variables for Bayesian network analysis.
      • HFReleasesPreprocessing.ipynb: Processes release data, including classification and preparation for analysis.
    • Analysis/: Contains three Jupyter notebooks with the analysis for each research question:
      • RQ1_Analysis.ipynb: Analysis for RQ1.
      • RQ2_Analysis.ipynb: Analysis for RQ2.
      • RQ3_Analysis.ipynb: Analysis for RQ3.

    - datasets/: Contains the raw, processed, and manually curated datasets used for the analysis.

    • Main Datasets:
      • HFCommits_50K_RANDOM.csv: Contains the commits of 50,000 randomly sampled models from HF with the classification based on Bhatia et al.'s taxonomy.
      • HFCommits_MultipleCommits.csv: Contains the commits of 10,000 models with at least 10 commits, used for analyzing commit sequences.
      • HFReleases.csv: Contains over 1,200 releases from 127 models, classified using Bhatia et al.'s taxonomy.
      • model_metadata_with_diff.csv: Contains the metadata of releases from 27 models, including differences between successive releases.
      • These datasets correspond to the following dataset splits:
        • +200,000 commits from 50,000 models: Used for RQ1. Provides a broad overview of commit types and patterns across diverse models.
        • +200,000 commits from 10,000 models: Used for RQ2. Focuses on models with at least 10 commits for detailed evolutionary study.
        • +1,200 releases from 127 models: Used for RQ3.1, RQ3.2, and RQ3.3. Facilitates the investigation of release patterns and their evolution.
        • Metadata of 173 releases from 27 models: Used for RQ3.4. Analyzes the evolution of model parameters and configurations.
    • Additional Datasets:
      • HF_Total_Raw.csv: Contains a snapshot of the entire Hugging Face platform with over 380,000 models, as obtained from HFTotalExtraction.ipynb.
      • HF_Total_Preprocessed.csv: Contains the preprocessed version of the entire HF dataset, as obtained from HFTotalPreprocessing.ipynb. This dataset is needed for the commits preprocessing.
      • Auxiliary datasets generated during processing are also included to facilitate reproduction of specific parts of the code without time-consuming steps.

    - metadata/: Contains the tags_metadata.yaml file used during preprocessing.

    - models/: Contains the model trained to classify commit messages into corrective, perfective, and adaptive types based on Swanson's traditional software maintenance categories.

    - requirements.txt: Lists the required Python packages to set up the environment and run the code.

    Setup and Execution

    Prerequisites

    • Python 3.10.11 or later.
    • Jupyter Notebook or JupyterLab.

    Installation

    1. Download and Extract the Replication Package
    2. Create a Virtual Environment (Recommended):bash
      python -m venv venv
      source venv/bin/activate # On Windows, use venv\Scripts\activate
    3. Install Required Packages:bash
      pip install -r requirements.txt

    Notes

    - LLM Usage: The classification of commits using the Gemini 1.5 Flash LLM requires access to the model. Ensure you have the necessary permissions and API keys to use the model.

    - Computational Resources: Processing large datasets and running Bayesian network analyses may require significant computational resources. It is recommended to use a machine with ample memory and processing power.

    - Reproducing Results: The auxiliary datasets included can be used to reproduce specific parts of the code without re-running the entire data collection and preprocessing pipeline.

    Additional Information

    Contact: If you have any questions or encounter issues, please contact the authors at joel.castano@upc.edu.

    This README provides detailed instructions and information to reproduce and understand the analyses performed in the paper. If you find this package useful, please cite our work.

  6. h

    drug-combo-extraction

    • huggingface.co
    Updated Feb 3, 2015
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ai2 (2015). drug-combo-extraction [Dataset]. https://huggingface.co/datasets/allenai/drug-combo-extraction
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 3, 2015
    Dataset authored and provided by
    Ai2
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    allenai/drug-combo-extraction dataset hosted on Hugging Face and contributed by the HF Datasets community

  7. h

    lm-extraction-benchmark

    • huggingface.co
    Updated Mar 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Guy Dar (2024). lm-extraction-benchmark [Dataset]. https://huggingface.co/datasets/dar-tau/lm-extraction-benchmark
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 14, 2024
    Authors
    Guy Dar
    Description

    dar-tau/lm-extraction-benchmark dataset hosted on Hugging Face and contributed by the HF Datasets community

  8. h

    document-extraction-dataset-with-ocr

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Patel, document-extraction-dataset-with-ocr [Dataset]. https://huggingface.co/datasets/Jwalit/document-extraction-dataset-with-ocr
    Explore at:
    Authors
    Patel
    Description

    Jwalit/document-extraction-dataset-with-ocr dataset hosted on Hugging Face and contributed by the HF Datasets community

  9. ingredient-detection

    • huggingface.co
    Updated Sep 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Open Food Facts (2024). ingredient-detection [Dataset]. https://huggingface.co/datasets/openfoodfacts/ingredient-detection
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 19, 2024
    Dataset authored and provided by
    Open Food Factshttps://openfoodfacts.org/
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    This dataset is used to train a multilingual ingredient list detection model. The goal is to automate the extraction of ingredient lists from food packaging images. See this issue for a broader context about ingredient list extraction.

      Dataset generation
    

    Raw unannotated texts are OCR results obtained with Google Cloud Vision. It only contains images marked as ingredient image on Open Food Facts. The dataset was generated using ChatGPT-3.5: we asked ChatGPT to extract ingredient… See the full description on the dataset page: https://huggingface.co/datasets/openfoodfacts/ingredient-detection.

  10. h

    Invoice-Data-Extraction

    • huggingface.co
    Updated Aug 11, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel Lopez (2025). Invoice-Data-Extraction [Dataset]. https://huggingface.co/datasets/nassimN/Invoice-Data-Extraction
    Explore at:
    Dataset updated
    Aug 11, 2025
    Authors
    Daniel Lopez
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    nassimN/Invoice-Data-Extraction dataset hosted on Hugging Face and contributed by the HF Datasets community

  11. h

    contracts-extraction-instruction-llm-experiments

    • huggingface.co
    Updated Feb 3, 2015
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yogendra Sisodia (2015). contracts-extraction-instruction-llm-experiments [Dataset]. https://huggingface.co/datasets/scholarly360/contracts-extraction-instruction-llm-experiments
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 3, 2015
    Authors
    Yogendra Sisodia
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Card for "contracts-extraction-instruction-llm-experiments"

    More Information needed

  12. h

    clinical-ie

    • huggingface.co
    Updated Dec 7, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MIT Clinical Machine Learning Group (2022). clinical-ie [Dataset]. https://huggingface.co/datasets/mitclinicalml/clinical-ie
    Explore at:
    Dataset updated
    Dec 7, 2022
    Dataset authored and provided by
    MIT Clinical Machine Learning Group
    Description

    Below, we provide access to the datasets used in and created for the EMNLP 2022 paper Large Language Models are Few-Shot Clinical Information Extractors.

      Task #1: Clinical Sense Disambiguation
    

    For Task #1, we use the original annotations from the Clinical Acronym Sense Inventory (CASI) dataset, described in their paper. As is common, due to noisiness in the label set, we do not evaluate on the entire dataset, but only on a cleaner subset. For consistency, we use the subset defined… See the full description on the dataset page: https://huggingface.co/datasets/mitclinicalml/clinical-ie.

  13. h

    jaredjoss_pythia-410m-roberta-lr_8e7-kl_01-steps_12000-rlhf-model

    • huggingface.co
    Updated Jun 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Math extraction comparisson (2024). jaredjoss_pythia-410m-roberta-lr_8e7-kl_01-steps_12000-rlhf-model [Dataset]. https://huggingface.co/datasets/math-extraction-comp/jaredjoss_pythia-410m-roberta-lr_8e7-kl_01-steps_12000-rlhf-model
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 17, 2024
    Dataset authored and provided by
    Math extraction comparisson
    Description

    math-extraction-comp/jaredjoss_pythia-410m-roberta-lr_8e7-kl_01-steps_12000-rlhf-model dataset hosted on Hugging Face and contributed by the HF Datasets community

  14. h

    Table-Extraction

    • huggingface.co
    Updated Apr 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Group (2024). Table-Extraction [Dataset]. https://huggingface.co/datasets/Effyis/Table-Extraction
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 8, 2024
    Dataset authored and provided by
    Group
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Table Extract Dataset

    This dataset is designed to evaluate the ability of large language models (LLMs) to extract tables from text. It provides a collection of text snippets containing tables and their corresponding structured representations in JSON format.

      Source
    

    The dataset is based on the Table Fact Dataset, also known as TabFact, which contains 16,573 tables extracted from Wikipedia.

      Schema:
    

    Each data point in the dataset consists of two elements:… See the full description on the dataset page: https://huggingface.co/datasets/Effyis/Table-Extraction.

  15. h

    structured-generation-information-extraction-vlms-openbmb-RLAIF-V-Dataset

    • huggingface.co
    Updated Jan 29, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Berenstein (2025). structured-generation-information-extraction-vlms-openbmb-RLAIF-V-Dataset [Dataset]. https://huggingface.co/datasets/davidberenstein1957/structured-generation-information-extraction-vlms-openbmb-RLAIF-V-Dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 29, 2025
    Authors
    David Berenstein
    Description

    davidberenstein1957/structured-generation-information-extraction-vlms-openbmb-RLAIF-V-Dataset dataset hosted on Hugging Face and contributed by the HF Datasets community

  16. h

    furniture-model-extraction

    • huggingface.co
    Updated Sep 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    wynn watson (2025). furniture-model-extraction [Dataset]. https://huggingface.co/datasets/wynnwatson/furniture-model-extraction
    Explore at:
    Dataset updated
    Sep 24, 2025
    Authors
    wynn watson
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Furniture Model Number Extraction Dataset

    This dataset contains furniture inventory images with corresponding model numbers for training vision-language models to extract product model numbers from furniture store photos.

      Dataset Description
    

    Created: 2025-09-24T12:26:54.209816 Task: Vision-Language Model Training for Model Number Extraction Base Model: IBM Granite Vision 3.2 2B Domain: Furniture Inventory Management

      Dataset Statistics
    

    Training Samples: 219… See the full description on the dataset page: https://huggingface.co/datasets/wynnwatson/furniture-model-extraction.

  17. h

    Data from: arabic-books

    • huggingface.co
    Updated Nov 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohamed Rashad (2024). arabic-books [Dataset]. https://huggingface.co/datasets/MohamedRashad/arabic-books
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 28, 2024
    Authors
    Mohamed Rashad
    License

    https://choosealicense.com/licenses/gpl-3.0/https://choosealicense.com/licenses/gpl-3.0/

    Description

    Arabic Books

      Dataset Summary
    

    The arabic-books dataset contains 8,500 rows of text, each representing the full text of a single Arabic book. These texts were extracted using the arabic-large-nougat model, showcasing the model’s capabilities in Arabic OCR and text extraction. The dataset spans a total of 1.1 billion tokens, calculated using the GPT-4 tokenizer. This dataset is a testimony to the quality of the Arabic Nougat models and their effectiveness in extracting… See the full description on the dataset page: https://huggingface.co/datasets/MohamedRashad/arabic-books.

  18. h

    Financial-NER-NLP

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joseph G Flowers, Financial-NER-NLP [Dataset]. https://huggingface.co/datasets/Josephgflowers/Financial-NER-NLP
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Joseph G Flowers
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Card for Financial-NER-NLP Dataset Summary The Financial-NER-NLP Dataset is a derivative of the FiNER-139 dataset, which consists of 1.1 million sentences annotated with 139 XBRL tags. This new dataset transforms the original structured data into natural language prompts suitable for training language models. The dataset is designed to enhance models’ abilities in tasks such as named entity recognition (NER), summarization, and information extraction in the financial domain. The… See the full description on the dataset page: https://huggingface.co/datasets/Josephgflowers/Financial-NER-NLP.

  19. tuple_ie

    • huggingface.co
    • opendatalab.com
    Updated Jun 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ai2 (2024). tuple_ie [Dataset]. https://huggingface.co/datasets/allenai/tuple_ie
    Explore at:
    Dataset updated
    Jun 8, 2024
    Dataset provided by
    Allen Institute for AIhttp://allenai.org/
    Authors
    Ai2
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    The TupleInf Open IE dataset contains Open IE tuples extracted from 263K sentences that were used by the solver in “Answering Complex Questions Using Open Information Extraction” (referred as Tuple KB, T). These sentences were collected from a large Web corpus using training questions from 4th and 8th grade as queries. This dataset contains 156K sentences collected for 4th grade questions and 107K sentences for 8th grade questions. Each sentence is followed by the Open IE v4 tuples using their simple format.

  20. h

    PII-NER

    • huggingface.co
    Updated Jul 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joseph G Flowers (2024). PII-NER [Dataset]. https://huggingface.co/datasets/Josephgflowers/PII-NER
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 20, 2024
    Authors
    Joseph G Flowers
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for NER PII Extraction Dataset Dataset Summary This dataset is designed for training and evaluating Named Entity Recognition (NER) models focused on extracting Personally Identifiable Information (PII) from text. It includes a variety of entities such as names, addresses, phone numbers, email addresses, and identification numbers. The dataset is suitable for tasks that involve PII detection, compliance checks, and data anonymization. Supported Tasks and Leaderboards Named Entity… See the full description on the dataset page: https://huggingface.co/datasets/Josephgflowers/PII-NER.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
paraloq analytics (2024). json_data_extraction [Dataset]. https://huggingface.co/datasets/paraloq/json_data_extraction

json_data_extraction

paraloq/json_data_extraction

Diverse Restricted JSON Data Extraction

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 1, 2024
Dataset authored and provided by
paraloq analytics
License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

Diverse Restricted JSON Data Extraction

Curated by: The paraloq analytics team.

  Uses

Benchmark restricted JSON data extraction (text + JSON schema -> JSON instance) Fine-Tune data extraction model (text + JSON schema -> JSON instance) Fine-Tune JSON schema Retrieval model (text -> retriever -> most adequate JSON schema)

  Out-of-Scope Use

Intended for research purposes only.

  Dataset Structure

The data comes with the following fields:

title: The… See the full description on the dataset page: https://huggingface.co/datasets/paraloq/json_data_extraction.

Search
Clear search
Close search
Google apps
Main menu