Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Diverse Restricted JSON Data Extraction
Curated by: The paraloq analytics team.
Uses
Benchmark restricted JSON data extraction (text + JSON schema -> JSON instance) Fine-Tune data extraction model (text + JSON schema -> JSON instance) Fine-Tune JSON schema Retrieval model (text -> retriever -> most adequate JSON schema)
Out-of-Scope Use
Intended for research purposes only.
Dataset Structure
The data comes with the following fields:
title: The… See the full description on the dataset page: https://huggingface.co/datasets/paraloq/json_data_extraction.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Extract-0 Document Information Extraction Dataset
This dataset contains 280,128 synthetic training examples for document information extraction, used to train Extract-0, a specialized 7B parameter language model that outperforms GPT-4 and other larger models on extraction tasks.
Dataset Description
The Extract-0 dataset represents a comprehensive collection of document extraction examples generated from diverse sources including arXiv papers, PubMed Central articles… See the full description on the dataset page: https://huggingface.co/datasets/HenriqueGodoy/extract-0.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for parsed-model-cards
This dataset contains structured information extracted from model cards on the Hugging Face Hub. It was created using Curator, the Qwen reasoning model QwQ-32B, and vLLM.
Dataset Overview
The dataset consists of model card texts paired with structured information extracted from those texts using a reasoning-based approach. Each entry contains:
The original model card content Structured JSON with standardized fields extracted from… See the full description on the dataset page: https://huggingface.co/datasets/davanstrien/parsed-model-cards.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a publication of the CoAID dataset originaly dedicated to fake news detection. We changed here the purpose of this dataset in order to use it in the context of event tracking in press documents.
Cui, Limeng, et Dongwon Lee. 2020. « CoAID: COVID-19 Healthcare Misinformation Dataset ». ArXiv:2006.00885 [Cs], novembre. http://arxiv.org/abs/2006.00885.
In this dataset, we provide multiple features extracted from the text itself. Please note the text is missing from the dataset published in the CSV format for copyright reasons. You can download the original datasets and manually add the missing texts from the original publications.
Features are extracted using:
A corpus of reference articles in multiple languages languages for TF-IDF weighting. (features_news) [1]
A corpus of tweets reporting news for TF-IDF weighting. (features_tweets) [1]
A S-BERT model [2] that uses distiluse-base-multilingual-cased-v1 (called features_use) 3
A S-BERT model [2] that uses paraphrase-multilingual-mpnet-base-v2 (called features_mpnet) 4
References:
[1]: Guillaume Bernard. (2022). Resources to compute TF-IDF weightings on press articles and tweets (1.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.6610406
[2]: Reimers, Nils, et Iryna Gurevych. 2019. « Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks ». In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 3982‑92. Hong Kong, China: Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1410.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This replication package accompanies the paper "How Do Machine Learning Models Change?" In this study, we conducted a comprehensive analysis of over 200,000 commits and 1,200 releases across more than 50,000 models on the Hugging Face (HF) platform. Our goal was to understand how machine learning (ML) models evolve over time by classifying commit types based on an extended ML change taxonomy and analyzing patterns in commit and release activities using Bayesian networks.
Our research addresses three main aspects:
This replication package contains all the necessary code, datasets, and documentation to reproduce the results presented in the paper.
We collected data from the Hugging Face platform using the Hugging Face Hub API and the `HfApi` class. The data extraction was performed on November 6th, 2023. The collected data includes:
To enrich the commit data with detailed file change information, we integrated the PyDriller framework within the HFCommunity dataset.
Commit Diffs
We computed the differences between commits for key files, specifically JSON configuration files (e.g., `config.json`). For each commit that modifies these files, we compared the changes with the previous commit affecting the same file to identify added, deleted, and updated keys.
Commit Classification
We classified each commit according to Bhatia et al.'s ML change taxonomy using the Gemini 1.5 Flash Large Language Model (LLM). This classification, using LLMs to apply Bhatia et al.'s taxonomy on a large-scale ML repository, is one of the main contributions of our paper. We ensured the correctness of the classification by achieving a Cohen's kappa coefficient ≥ 0.9 through iterative validation. In addition, we performed classification based on Swanson's categories using a simpler neural network approach, following methods from prior work. This classification has less impact compared to the detailed classification using Bhatia et al.'s taxonomy.
Model Metadata
We extracted detailed metadata from the model files of selected releases, focusing on attributes such as the number of parameters, tensor shapes, etc. We also calculated the differences between the metadata of successive releases.
The replication package is organized as follows:
- code/: Contains the Jupyter notebooks with the data extraction, preprocessing, analysis, and model training scripts.
HFTotalExtraction.ipynb: Script for collecting data on the entire Hugging Face platform.HFReleasesExtraction.ipynb: Script for collecting data on models that contain releases.HFTotalPreprocessing.ipynb: Preprocesses the dataset obtained from `HFTotalExtraction.ipynb`.HFCommitsPreprocessing.ipynb: Processes commit data, including:
HFReleasesPreprocessing.ipynb: Processes release data, including classification and preparation for analysis.RQ1_Analysis.ipynb: Analysis for RQ1.RQ2_Analysis.ipynb: Analysis for RQ2.RQ3_Analysis.ipynb: Analysis for RQ3.- datasets/: Contains the raw, processed, and manually curated datasets used for the analysis.
HFCommits_50K_RANDOM.csv: Contains the commits of 50,000 randomly sampled models from HF with the classification based on Bhatia et al.'s taxonomy.HFCommits_MultipleCommits.csv: Contains the commits of 10,000 models with at least 10 commits, used for analyzing commit sequences.HFReleases.csv: Contains over 1,200 releases from 127 models, classified using Bhatia et al.'s taxonomy.model_metadata_with_diff.csv: Contains the metadata of releases from 27 models, including differences between successive releases.HF_Total_Raw.csv: Contains a snapshot of the entire Hugging Face platform with over 380,000 models, as obtained from HFTotalExtraction.ipynb.HF_Total_Preprocessed.csv: Contains the preprocessed version of the entire HF dataset, as obtained from HFTotalPreprocessing.ipynb. This dataset is needed for the commits preprocessing.- metadata/: Contains the tags_metadata.yaml file used during preprocessing.
- models/: Contains the model trained to classify commit messages into corrective, perfective, and adaptive types based on Swanson's traditional software maintenance categories.
- requirements.txt: Lists the required Python packages to set up the environment and run the code.
bashpython -m venv venvsource venv/bin/activate # On Windows, use venv\Scripts\activatebashpip install -r requirements.txt- LLM Usage: The classification of commits using the Gemini 1.5 Flash LLM requires access to the model. Ensure you have the necessary permissions and API keys to use the model.
- Computational Resources: Processing large datasets and running Bayesian network analyses may require significant computational resources. It is recommended to use a machine with ample memory and processing power.
- Reproducing Results: The auxiliary datasets included can be used to reproduce specific parts of the code without re-running the entire data collection and preprocessing pipeline.
Contact: If you have any questions or encounter issues, please contact the authors at joel.castano@upc.edu.
This README provides detailed instructions and information to reproduce and understand the analyses performed in the paper. If you find this package useful, please cite our work.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
allenai/drug-combo-extraction dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterdar-tau/lm-extraction-benchmark dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterJwalit/document-extraction-dataset-with-ocr dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset is used to train a multilingual ingredient list detection model. The goal is to automate the extraction of ingredient lists from food packaging images. See this issue for a broader context about ingredient list extraction.
Dataset generation
Raw unannotated texts are OCR results obtained with Google Cloud Vision. It only contains images marked as ingredient image on Open Food Facts. The dataset was generated using ChatGPT-3.5: we asked ChatGPT to extract ingredient… See the full description on the dataset page: https://huggingface.co/datasets/openfoodfacts/ingredient-detection.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
nassimN/Invoice-Data-Extraction dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for "contracts-extraction-instruction-llm-experiments"
More Information needed
Facebook
TwitterBelow, we provide access to the datasets used in and created for the EMNLP 2022 paper Large Language Models are Few-Shot Clinical Information Extractors.
Task #1: Clinical Sense Disambiguation
For Task #1, we use the original annotations from the Clinical Acronym Sense Inventory (CASI) dataset, described in their paper. As is common, due to noisiness in the label set, we do not evaluate on the entire dataset, but only on a cleaner subset. For consistency, we use the subset defined… See the full description on the dataset page: https://huggingface.co/datasets/mitclinicalml/clinical-ie.
Facebook
Twittermath-extraction-comp/jaredjoss_pythia-410m-roberta-lr_8e7-kl_01-steps_12000-rlhf-model dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Table Extract Dataset
This dataset is designed to evaluate the ability of large language models (LLMs) to extract tables from text. It provides a collection of text snippets containing tables and their corresponding structured representations in JSON format.
Source
The dataset is based on the Table Fact Dataset, also known as TabFact, which contains 16,573 tables extracted from Wikipedia.
Schema:
Each data point in the dataset consists of two elements:… See the full description on the dataset page: https://huggingface.co/datasets/Effyis/Table-Extraction.
Facebook
Twitterdavidberenstein1957/structured-generation-information-extraction-vlms-openbmb-RLAIF-V-Dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Furniture Model Number Extraction Dataset
This dataset contains furniture inventory images with corresponding model numbers for training vision-language models to extract product model numbers from furniture store photos.
Dataset Description
Created: 2025-09-24T12:26:54.209816 Task: Vision-Language Model Training for Model Number Extraction Base Model: IBM Granite Vision 3.2 2B Domain: Furniture Inventory Management
Dataset Statistics
Training Samples: 219… See the full description on the dataset page: https://huggingface.co/datasets/wynnwatson/furniture-model-extraction.
Facebook
Twitterhttps://choosealicense.com/licenses/gpl-3.0/https://choosealicense.com/licenses/gpl-3.0/
Arabic Books
Dataset Summary
The arabic-books dataset contains 8,500 rows of text, each representing the full text of a single Arabic book. These texts were extracted using the arabic-large-nougat model, showcasing the model’s capabilities in Arabic OCR and text extraction. The dataset spans a total of 1.1 billion tokens, calculated using the GPT-4 tokenizer. This dataset is a testimony to the quality of the Arabic Nougat models and their effectiveness in extracting… See the full description on the dataset page: https://huggingface.co/datasets/MohamedRashad/arabic-books.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for Financial-NER-NLP Dataset Summary The Financial-NER-NLP Dataset is a derivative of the FiNER-139 dataset, which consists of 1.1 million sentences annotated with 139 XBRL tags. This new dataset transforms the original structured data into natural language prompts suitable for training language models. The dataset is designed to enhance models’ abilities in tasks such as named entity recognition (NER), summarization, and information extraction in the financial domain. The… See the full description on the dataset page: https://huggingface.co/datasets/Josephgflowers/Financial-NER-NLP.
Facebook
Twitterhttps://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
The TupleInf Open IE dataset contains Open IE tuples extracted from 263K sentences that were used by the solver in “Answering Complex Questions Using Open Information Extraction” (referred as Tuple KB, T). These sentences were collected from a large Web corpus using training questions from 4th and 8th grade as queries. This dataset contains 156K sentences collected for 4th grade questions and 107K sentences for 8th grade questions. Each sentence is followed by the Open IE v4 tuples using their simple format.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for NER PII Extraction Dataset Dataset Summary This dataset is designed for training and evaluating Named Entity Recognition (NER) models focused on extracting Personally Identifiable Information (PII) from text. It includes a variety of entities such as names, addresses, phone numbers, email addresses, and identification numbers. The dataset is suitable for tasks that involve PII detection, compliance checks, and data anonymization. Supported Tasks and Leaderboards Named Entity… See the full description on the dataset page: https://huggingface.co/datasets/Josephgflowers/PII-NER.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Diverse Restricted JSON Data Extraction
Curated by: The paraloq analytics team.
Uses
Benchmark restricted JSON data extraction (text + JSON schema -> JSON instance) Fine-Tune data extraction model (text + JSON schema -> JSON instance) Fine-Tune JSON schema Retrieval model (text -> retriever -> most adequate JSON schema)
Out-of-Scope Use
Intended for research purposes only.
Dataset Structure
The data comes with the following fields:
title: The… See the full description on the dataset page: https://huggingface.co/datasets/paraloq/json_data_extraction.