100+ datasets found

h
json_data_extraction
huggingface.co
Updated Feb 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
paraloq analytics (2024). json_data_extraction [Dataset]. https://huggingface.co/datasets/paraloq/json_data_extraction
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 1, 2024
Dataset authored and provided by
paraloq analytics
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Diverse Restricted JSON Data Extraction

Curated by: The paraloq analytics team.

Uses

Benchmark restricted JSON data extraction (text + JSON schema -> JSON instance) Fine-Tune data extraction model (text + JSON schema -> JSON instance) Fine-Tune JSON schema Retrieval model (text -> retriever -> most adequate JSON schema)

Out-of-Scope Use

Intended for research purposes only.

Dataset Structure

The data comes with the following fields:

title: The… See the full description on the dataset page: https://huggingface.co/datasets/paraloq/json_data_extraction.
h
extract-0
huggingface.co
Updated Sep 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Henrique Godoy (2025). extract-0 [Dataset]. https://huggingface.co/datasets/HenriqueGodoy/extract-0
Explore at:
Dataset updated
Sep 8, 2025
Authors
Henrique Godoy
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Extract-0 Document Information Extraction Dataset

This dataset contains 280,128 synthetic training examples for document information extraction, used to train Extract-0, a specialized 7B parameter language model that outperforms GPT-4 and other larger models on extraction tasks.

Dataset Description

The Extract-0 dataset represents a comprehensive collection of document extraction examples generated from diverse sources including arXiv papers, PubMed Central articles… See the full description on the dataset page: https://huggingface.co/datasets/HenriqueGodoy/extract-0.
h
parsed-model-cards
huggingface.co
Updated Feb 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel van Strien (2025). parsed-model-cards [Dataset]. https://huggingface.co/datasets/davanstrien/parsed-model-cards
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 28, 2025
Authors
Daniel van Strien
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card for parsed-model-cards

This dataset contains structured information extracted from model cards on the Hugging Face Hub. It was created using Curator, the Qwen reasoning model QwQ-32B, and vLLM.

Dataset Overview

The dataset consists of model card texts paired with structured information extracted from those texts using a reasoning-based approach. Each entry contains:

The original model card content Structured JSON with standardized fields extracted from… See the full description on the dataset page: https://huggingface.co/datasets/davanstrien/parsed-model-cards.
Z
CoAID dataset with multiple extracted features (both sparse and dense)
data.niaid.nih.gov
zenodo.org
+1more
Updated Jun 10, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Guillaume Bernard (2022). CoAID dataset with multiple extracted features (both sparse and dense) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6630404
Explore at:
Dataset updated
Jun 10, 2022
Dataset provided by
Laboratoire L3i, Université de La Rochelle
Authors
Guillaume Bernard
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is a publication of the CoAID dataset originaly dedicated to fake news detection. We changed here the purpose of this dataset in order to use it in the context of event tracking in press documents.

Cui, Limeng, et Dongwon Lee. 2020. « CoAID: COVID-19 Healthcare Misinformation Dataset ». ArXiv:2006.00885 [Cs], novembre. http://arxiv.org/abs/2006.00885.

In this dataset, we provide multiple features extracted from the text itself. Please note the text is missing from the dataset published in the CSV format for copyright reasons. You can download the original datasets and manually add the missing texts from the original publications.

Features are extracted using:

A corpus of reference articles in multiple languages languages for TF-IDF weighting. (features_news) [1]

A corpus of tweets reporting news for TF-IDF weighting. (features_tweets) [1]

A S-BERT model [2] that uses distiluse-base-multilingual-cased-v1 (called features_use) 3

A S-BERT model [2] that uses paraphrase-multilingual-mpnet-base-v2 (called features_mpnet) 4

References:

[1]: Guillaume Bernard. (2022). Resources to compute TF-IDF weightings on press articles and tweets (1.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.6610406

[2]: Reimers, Nils, et Iryna Gurevych. 2019. « Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks ». In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 3982‑92. Hong Kong, China: Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1410.
Replication Package for 'How do Machine Learning Models Change?'
zenodo.org
Updated Nov 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joel Castaño Fernández; Joel Castaño Fernández; Rafael Cabañas; Rafael Cabañas; Antonio Salmerón; Antonio Salmerón; Lo David; Lo David; Silverio Martínez-Fernández; Silverio Martínez-Fernández (2024). Replication Package for 'How do Machine Learning Models Change?' [Dataset]. http://doi.org/10.5281/zenodo.14128997
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.14128997
Dataset updated
Nov 13, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Joel Castaño Fernández; Joel Castaño Fernández; Rafael Cabañas; Rafael Cabañas; Antonio Salmerón; Antonio Salmerón; Lo David; Lo David; Silverio Martínez-Fernández; Silverio Martínez-Fernández
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Overview

This replication package accompanies the paper "How Do Machine Learning Models Change?" In this study, we conducted a comprehensive analysis of over 200,000 commits and 1,200 releases across more than 50,000 models on the Hugging Face (HF) platform. Our goal was to understand how machine learning (ML) models evolve over time by classifying commit types based on an extended ML change taxonomy and analyzing patterns in commit and release activities using Bayesian networks.

Our research addresses three main aspects:

Categorization of Commit Changes: We classified over 200,000 commits on HF using an extended ML change taxonomy, providing a detailed breakdown of change types and their distribution across models.

Analysis of Commit Sequences: We examined the sequence and dependencies of commit types using Bayesian networks to identify temporal patterns and common progression paths in model changes.

Release Analysis: We investigated the distribution and evolution of release types, analyzing how model attributes and metadata change across successive releases.

This replication package contains all the necessary code, datasets, and documentation to reproduce the results presented in the paper.

Data Collection and Preprocessing

Data Collection

We collected data from the Hugging Face platform using the Hugging Face Hub API and the `HfApi` class. The data extraction was performed on November 6th, 2023. The collected data includes:

Model Information: Details of over 380,000 models, including dataset sizes, training hardware, evaluation metrics, model file sizes, number of downloads and likes, tags, and the raw text of model cards.

Commit Histories: Comprehensive commit details, including commit messages, dates, authors, and the list of files edited in each commit.

Release Information: Information on model releases marked by tags in their repositories.

To enrich the commit data with detailed file change information, we integrated the PyDriller framework within the HFCommunity dataset.

Data Preprocessing

Commit Diffs

We computed the differences between commits for key files, specifically JSON configuration files (e.g., `config.json`). For each commit that modifies these files, we compared the changes with the previous commit affecting the same file to identify added, deleted, and updated keys.

Commit Classification

We classified each commit according to Bhatia et al.'s ML change taxonomy using the Gemini 1.5 Flash Large Language Model (LLM). This classification, using LLMs to apply Bhatia et al.'s taxonomy on a large-scale ML repository, is one of the main contributions of our paper. We ensured the correctness of the classification by achieving a Cohen's kappa coefficient ≥ 0.9 through iterative validation. In addition, we performed classification based on Swanson's categories using a simpler neural network approach, following methods from prior work. This classification has less impact compared to the detailed classification using Bhatia et al.'s taxonomy.

Model Metadata

We extracted detailed metadata from the model files of selected releases, focusing on attributes such as the number of parameters, tensor shapes, etc. We also calculated the differences between the metadata of successive releases.

Folder Structure

The replication package is organized as follows:

- code/: Contains the Jupyter notebooks with the data extraction, preprocessing, analysis, and model training scripts.

Collection/: Contains two Jupyter notebooks for data collection:

HFTotalExtraction.ipynb: Script for collecting data on the entire Hugging Face platform.

HFReleasesExtraction.ipynb: Script for collecting data on models that contain releases.

Preprocessing/: Contains preprocessing scripts:

HFTotalPreprocessing.ipynb: Preprocesses the dataset obtained from `HFTotalExtraction.ipynb`.

HFCommitsPreprocessing.ipynb: Processes commit data, including:

Retrieval of diff information between commits.

Classification of commits following Bhatia et al.'s taxonomy using LLMs.

Extension and adaptation of the final commits dataset, including additional variables for Bayesian network analysis.

HFReleasesPreprocessing.ipynb: Processes release data, including classification and preparation for analysis.

Analysis/: Contains three Jupyter notebooks with the analysis for each research question:

RQ1_Analysis.ipynb: Analysis for RQ1.

RQ2_Analysis.ipynb: Analysis for RQ2.

RQ3_Analysis.ipynb: Analysis for RQ3.

- datasets/: Contains the raw, processed, and manually curated datasets used for the analysis.

Main Datasets:

HFCommits_50K_RANDOM.csv: Contains the commits of 50,000 randomly sampled models from HF with the classification based on Bhatia et al.'s taxonomy.

HFCommits_MultipleCommits.csv: Contains the commits of 10,000 models with at least 10 commits, used for analyzing commit sequences.

HFReleases.csv: Contains over 1,200 releases from 127 models, classified using Bhatia et al.'s taxonomy.

model_metadata_with_diff.csv: Contains the metadata of releases from 27 models, including differences between successive releases.

These datasets correspond to the following dataset splits:

+200,000 commits from 50,000 models: Used for RQ1. Provides a broad overview of commit types and patterns across diverse models.

+200,000 commits from 10,000 models: Used for RQ2. Focuses on models with at least 10 commits for detailed evolutionary study.

+1,200 releases from 127 models: Used for RQ3.1, RQ3.2, and RQ3.3. Facilitates the investigation of release patterns and their evolution.

Metadata of 173 releases from 27 models: Used for RQ3.4. Analyzes the evolution of model parameters and configurations.

Additional Datasets:

HF_Total_Raw.csv: Contains a snapshot of the entire Hugging Face platform with over 380,000 models, as obtained from HFTotalExtraction.ipynb.

HF_Total_Preprocessed.csv: Contains the preprocessed version of the entire HF dataset, as obtained from HFTotalPreprocessing.ipynb. This dataset is needed for the commits preprocessing.

Auxiliary datasets generated during processing are also included to facilitate reproduction of specific parts of the code without time-consuming steps.

- metadata/: Contains the tags_metadata.yaml file used during preprocessing.

- models/: Contains the model trained to classify commit messages into corrective, perfective, and adaptive types based on Swanson's traditional software maintenance categories.

- requirements.txt: Lists the required Python packages to set up the environment and run the code.

Setup and Execution

Prerequisites

Python 3.10.11 or later.

Jupyter Notebook or JupyterLab.

Installation

Download and Extract the Replication Package

Create a Virtual Environment (Recommended):bash
python -m venv venv
source venv/bin/activate # On Windows, use venv\Scripts\activate

Install Required Packages:bash
pip install -r requirements.txt

Notes

- LLM Usage: The classification of commits using the Gemini 1.5 Flash LLM requires access to the model. Ensure you have the necessary permissions and API keys to use the model.

- Computational Resources: Processing large datasets and running Bayesian network analyses may require significant computational resources. It is recommended to use a machine with ample memory and processing power.

- Reproducing Results: The auxiliary datasets included can be used to reproduce specific parts of the code without re-running the entire data collection and preprocessing pipeline.

Additional Information

Contact: If you have any questions or encounter issues, please contact the authors at joel.castano@upc.edu.

This README provides detailed instructions and information to reproduce and understand the analyses performed in the paper. If you find this package useful, please cite our work.
h
drug-combo-extraction
huggingface.co
Updated Feb 3, 2015
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ai2 (2015). drug-combo-extraction [Dataset]. https://huggingface.co/datasets/allenai/drug-combo-extraction
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 3, 2015
Dataset authored and provided by
Ai2
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
allenai/drug-combo-extraction dataset hosted on Hugging Face and contributed by the HF Datasets community
h
lm-extraction-benchmark
huggingface.co
Updated Mar 14, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Guy Dar (2024). lm-extraction-benchmark [Dataset]. https://huggingface.co/datasets/dar-tau/lm-extraction-benchmark
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 14, 2024
Authors
Guy Dar
Description
dar-tau/lm-extraction-benchmark dataset hosted on Hugging Face and contributed by the HF Datasets community
h
document-extraction-dataset-with-ocr
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Patel, document-extraction-dataset-with-ocr [Dataset]. https://huggingface.co/datasets/Jwalit/document-extraction-dataset-with-ocr
Explore at:
Authors
Patel
Description
Jwalit/document-extraction-dataset-with-ocr dataset hosted on Hugging Face and contributed by the HF Datasets community
ingredient-detection
huggingface.co
Updated Sep 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Open Food Facts (2024). ingredient-detection [Dataset]. https://huggingface.co/datasets/openfoodfacts/ingredient-detection
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 19, 2024
Dataset authored and provided by
Open Food Factshttps://openfoodfacts.org/
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
This dataset is used to train a multilingual ingredient list detection model. The goal is to automate the extraction of ingredient lists from food packaging images. See this issue for a broader context about ingredient list extraction.

Dataset generation

Raw unannotated texts are OCR results obtained with Google Cloud Vision. It only contains images marked as ingredient image on Open Food Facts. The dataset was generated using ChatGPT-3.5: we asked ChatGPT to extract ingredient… See the full description on the dataset page: https://huggingface.co/datasets/openfoodfacts/ingredient-detection.
h
Invoice-Data-Extraction
huggingface.co
Updated Aug 11, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel Lopez (2025). Invoice-Data-Extraction [Dataset]. https://huggingface.co/datasets/nassimN/Invoice-Data-Extraction
Explore at:
Dataset updated
Aug 11, 2025
Authors
Daniel Lopez
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
nassimN/Invoice-Data-Extraction dataset hosted on Hugging Face and contributed by the HF Datasets community
h
contracts-extraction-instruction-llm-experiments
huggingface.co
Updated Feb 3, 2015
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yogendra Sisodia (2015). contracts-extraction-instruction-llm-experiments [Dataset]. https://huggingface.co/datasets/scholarly360/contracts-extraction-instruction-llm-experiments
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 3, 2015
Authors
Yogendra Sisodia
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Card for "contracts-extraction-instruction-llm-experiments"

More Information needed
h
clinical-ie
huggingface.co
Updated Dec 7, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MIT Clinical Machine Learning Group (2022). clinical-ie [Dataset]. https://huggingface.co/datasets/mitclinicalml/clinical-ie
Explore at:
Dataset updated
Dec 7, 2022
Dataset authored and provided by
MIT Clinical Machine Learning Group
Description
Below, we provide access to the datasets used in and created for the EMNLP 2022 paper Large Language Models are Few-Shot Clinical Information Extractors.

Task #1: Clinical Sense Disambiguation

For Task #1, we use the original annotations from the Clinical Acronym Sense Inventory (CASI) dataset, described in their paper. As is common, due to noisiness in the label set, we do not evaluate on the entire dataset, but only on a cleaner subset. For consistency, we use the subset defined… See the full description on the dataset page: https://huggingface.co/datasets/mitclinicalml/clinical-ie.
h
jaredjoss_pythia-410m-roberta-lr_8e7-kl_01-steps_12000-rlhf-model
huggingface.co
Updated Jun 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Math extraction comparisson (2024). jaredjoss_pythia-410m-roberta-lr_8e7-kl_01-steps_12000-rlhf-model [Dataset]. https://huggingface.co/datasets/math-extraction-comp/jaredjoss_pythia-410m-roberta-lr_8e7-kl_01-steps_12000-rlhf-model
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 17, 2024
Dataset authored and provided by
Math extraction comparisson
Description
math-extraction-comp/jaredjoss_pythia-410m-roberta-lr_8e7-kl_01-steps_12000-rlhf-model dataset hosted on Hugging Face and contributed by the HF Datasets community
h
Table-Extraction
huggingface.co
Updated Apr 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Group (2024). Table-Extraction [Dataset]. https://huggingface.co/datasets/Effyis/Table-Extraction
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 8, 2024
Dataset authored and provided by
Group
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Table Extract Dataset

This dataset is designed to evaluate the ability of large language models (LLMs) to extract tables from text. It provides a collection of text snippets containing tables and their corresponding structured representations in JSON format.

Source

The dataset is based on the Table Fact Dataset, also known as TabFact, which contains 16,573 tables extracted from Wikipedia.

Schema:

Each data point in the dataset consists of two elements:… See the full description on the dataset page: https://huggingface.co/datasets/Effyis/Table-Extraction.
h
structured-generation-information-extraction-vlms-openbmb-RLAIF-V-Dataset
huggingface.co
Updated Jan 29, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Berenstein (2025). structured-generation-information-extraction-vlms-openbmb-RLAIF-V-Dataset [Dataset]. https://huggingface.co/datasets/davidberenstein1957/structured-generation-information-extraction-vlms-openbmb-RLAIF-V-Dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 29, 2025
Authors
David Berenstein
Description
davidberenstein1957/structured-generation-information-extraction-vlms-openbmb-RLAIF-V-Dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
h
furniture-model-extraction
huggingface.co
Updated Sep 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
wynn watson (2025). furniture-model-extraction [Dataset]. https://huggingface.co/datasets/wynnwatson/furniture-model-extraction
Explore at:
Dataset updated
Sep 24, 2025
Authors
wynn watson
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Furniture Model Number Extraction Dataset

This dataset contains furniture inventory images with corresponding model numbers for training vision-language models to extract product model numbers from furniture store photos.

Dataset Description

Created: 2025-09-24T12:26:54.209816 Task: Vision-Language Model Training for Model Number Extraction Base Model: IBM Granite Vision 3.2 2B Domain: Furniture Inventory Management

Dataset Statistics

Training Samples: 219… See the full description on the dataset page: https://huggingface.co/datasets/wynnwatson/furniture-model-extraction.
h
Data from: arabic-books
huggingface.co
Updated Nov 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohamed Rashad (2024). arabic-books [Dataset]. https://huggingface.co/datasets/MohamedRashad/arabic-books
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 28, 2024
Authors
Mohamed Rashad
License
https://choosealicense.com/licenses/gpl-3.0/https://choosealicense.com/licenses/gpl-3.0/
Description
Arabic Books

Dataset Summary

The arabic-books dataset contains 8,500 rows of text, each representing the full text of a single Arabic book. These texts were extracted using the arabic-large-nougat model, showcasing the model’s capabilities in Arabic OCR and text extraction. The dataset spans a total of 1.1 billion tokens, calculated using the GPT-4 tokenizer. This dataset is a testimony to the quality of the Arabic Nougat models and their effectiveness in extracting… See the full description on the dataset page: https://huggingface.co/datasets/MohamedRashad/arabic-books.
h
Financial-NER-NLP
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joseph G Flowers, Financial-NER-NLP [Dataset]. https://huggingface.co/datasets/Josephgflowers/Financial-NER-NLP
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Joseph G Flowers
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Card for Financial-NER-NLP Dataset Summary The Financial-NER-NLP Dataset is a derivative of the FiNER-139 dataset, which consists of 1.1 million sentences annotated with 139 XBRL tags. This new dataset transforms the original structured data into natural language prompts suitable for training language models. The dataset is designed to enhance models’ abilities in tasks such as named entity recognition (NER), summarization, and information extraction in the financial domain. The… See the full description on the dataset page: https://huggingface.co/datasets/Josephgflowers/Financial-NER-NLP.
tuple_ie
huggingface.co
opendatalab.com
Updated Jun 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ai2 (2024). tuple_ie [Dataset]. https://huggingface.co/datasets/allenai/tuple_ie
Explore at:
Dataset updated
Jun 8, 2024
Dataset provided by
Allen Institute for AIhttp://allenai.org/
Authors
Ai2
License
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Description
The TupleInf Open IE dataset contains Open IE tuples extracted from 263K sentences that were used by the solver in “Answering Complex Questions Using Open Information Extraction” (referred as Tuple KB, T). These sentences were collected from a large Web corpus using training questions from 4th and 8th grade as queries. This dataset contains 156K sentences collected for 4th grade questions and 107K sentences for 8th grade questions. Each sentence is followed by the Open IE v4 tuples using their simple format.
h
PII-NER
huggingface.co
Updated Jul 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joseph G Flowers (2024). PII-NER [Dataset]. https://huggingface.co/datasets/Josephgflowers/PII-NER
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 20, 2024
Authors
Joseph G Flowers
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card for NER PII Extraction Dataset Dataset Summary This dataset is designed for training and evaluating Named Entity Recognition (NER) models focused on extracting Personally Identifiable Information (PII) from text. It includes a variety of entities such as names, addresses, phone numbers, email addresses, and identification numbers. The dataset is suitable for tasks that involve PII detection, compliance checks, and data anonymization. Supported Tasks and Leaderboards Named Entity… See the full description on the dataset page: https://huggingface.co/datasets/Josephgflowers/PII-NER.

Facebook

Twitter

Click to copy link

Link copied

Cite

paraloq analytics (2024). json_data_extraction [Dataset]. https://huggingface.co/datasets/paraloq/json_data_extraction

json_data_extraction

paraloq/json_data_extraction

Diverse Restricted JSON Data Extraction

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Feb 1, 2024

Dataset authored and provided by

paraloq analytics

License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

Diverse Restricted JSON Data Extraction

Curated by: The paraloq analytics team.

  Uses

Benchmark restricted JSON data extraction (text + JSON schema -> JSON instance) Fine-Tune data extraction model (text + JSON schema -> JSON instance) Fine-Tune JSON schema Retrieval model (text -> retriever -> most adequate JSON schema)

  Out-of-Scope Use

Intended for research purposes only.

  Dataset Structure

The data comes with the following fields:

title: The… See the full description on the dataset page: https://huggingface.co/datasets/paraloq/json_data_extraction.

Clear search

Close search

Google apps

Main menu

json_data_extraction

extract-0

parsed-model-cards

CoAID dataset with multiple extracted features (both sparse and dense)

Replication Package for 'How do Machine Learning Models Change?'

Overview

Data Collection and Preprocessing

Data Collection

Data Preprocessing

Folder Structure

Setup and Execution

Prerequisites

Installation

Notes

Additional Information

drug-combo-extraction

lm-extraction-benchmark

document-extraction-dataset-with-ocr

ingredient-detection

Invoice-Data-Extraction

contracts-extraction-instruction-llm-experiments

clinical-ie

jaredjoss_pythia-410m-roberta-lr_8e7-kl_01-steps_12000-rlhf-model

Table-Extraction

structured-generation-information-extraction-vlms-openbmb-RLAIF-V-Dataset

furniture-model-extraction

Data from: arabic-books

Financial-NER-NLP

tuple_ie

PII-NER

json_data_extraction

paraloq/json_data_extraction

Diverse Restricted JSON Data Extraction