100+ datasets found

Z
Data from: SIMARA: a database for key-value information extraction from...
data.niaid.nih.gov
Updated Apr 27, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Solène Tarride (2023). SIMARA: a database for key-value information extraction from full-page handwritten documents [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7866826
Explore at:
Dataset updated
Apr 27, 2023
Dataset provided by
Mélodie Boillet
Jean-François Moufflet
Solène Tarride
Christopher Kermorvant
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We propose a new database for information extraction from historical handwritten documents. The corpus includes 5,393 finding aids from six different series, dating from the 18th-20th centuries. Finding aids are handwritten documents that contain metadata describing older archives. They are stored in the National Archives of France and are used by archivists to identify and find archival documents.

Each document is annotated at page-level, and contains seven fields to retrieve. The localization of each field is not available in such a way that this dataset encourages research on segmentation-free systems for information extraction.
O
Kleister NDA
opendatalab.com
paperswithcode.com
zip
Updated Mar 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Adam Mickiewicz University (2023). Kleister NDA [Dataset]. https://opendatalab.com/OpenDataLab/Kleister_NDA
Explore at:
zipAvailable download formats
Dataset updated
Mar 24, 2023
Dataset provided by
Warsaw University of Technology
Adam Mickiewicz University
Applica.ai
Samsung R&D Institute Poland
Description
Kleister NDA is a dataset for Key Information Extraction (KIE). The dataset contains a mix of scanned and born-digital long formal English-language documents. For this datasets, an NLP system is expected to find or infer various types of entities by employing both textual and structural layout features. The Kleister NDA dataset has 540 Non-disclosure Agreements, with 3,229 unique pages and 2,160 entities to extract.
n
Data from: Improving Scientific Information Extraction with Text Generation
curate.nd.edu
pdf
Updated Apr 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Qingkai Zeng (2025). Improving Scientific Information Extraction with Text Generation [Dataset]. http://doi.org/10.7274/28571045.v1
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.7274/28571045.v1
Dataset updated
Apr 9, 2025
Dataset provided by
University of Notre Dame
Authors
Qingkai Zeng
License
Public Domain Mark 1.0https://creativecommons.org/publicdomain/mark/1.0/
License information was derived automatically
Description
As research communities expand, the number of scientific articles continues to grow rapidly, with no signs of slowing. This information overload drives the need for automated tools to identify relevant materials and extract key ideas. Information extraction (IE) focuses on converting unstructured scientific text into structured knowledge (e.g., ontologies, taxonomies, and knowledge graphs), enabling intelligent systems to excel in tasks like document organization, scientific literature retrieval and recommendation, claim verification even novel idea or hypothesis generation. To pinpoint the scope of this thesis, I focus on the taxonomic structure in this thesis to represent the knowledge in the scientific domain.

To construct a taxonomy from scientific corpora, traditional methods often rely on pipeline frameworks. These frameworks typically follow a sequence: first, extracting scientific concepts or entities from the corpus; second, identifying hierarchical relationships between the concepts; and finally, organizing these relationships into a cohesive taxonomy. However, such methods encounter several challenges: (1) the quality of the corpus or annotation data, (2) error propagation within the pipeline framework, and (3) limited generalization and transferability to other specific domains. The development of large language models (LLMs) offers promising advancements, as these models have demonstrated remarkable abilities to internalize knowledge and respond effectively to a wide range of inquiries. Unlike traditional pipeline-based approaches, generative methods harness LLMs to achieve (1) better utilization of their internalized knowledge, (2) direct text-to-knowledge conversion, and (3) flexible, schema-free adaptability.

This thesis explores innovative methods for integrating text generation technologies to improve IE in the scientific domain, with a focus on taxonomy construction. The approach begins with generating entity names and evolves to create or enrich taxonomies directly via text generation. I will explore combining neighborhood structural context, descriptive textual information, and LLMs' internal knowledge to improve output quality. Finally, this thesis will outline future research directions.
P
SROIE Dataset
paperswithcode.com
opendatalab.com
Updated Mar 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zheng Huang; Kai Chen; Jianhua He; Xiang Bai; Dimosthenis Karatzas; Shjian Lu; C. V. Jawahar (2024). SROIE Dataset [Dataset]. https://paperswithcode.com/dataset/sroie
Explore at:
Dataset updated
Mar 27, 2024
Authors
Zheng Huang; Kai Chen; Jianhua He; Xiang Bai; Dimosthenis Karatzas; Shjian Lu; C. V. Jawahar
Description
Consists of a dataset with 1000 whole scanned receipt images and annotations for the competition on scanned receipts OCR and key information extraction (SROIE).
Global Information Extraction IE Technology Market Size By Technology Type,...
verifiedmarketresearch.com
Updated Jan 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
VERIFIED MARKET RESEARCH (2024). Global Information Extraction IE Technology Market Size By Technology Type, By Deployment Model, By Application, By Geographic Scope And Forecast [Dataset]. https://www.verifiedmarketresearch.com/product/information-extraction-ie-technology-market/
Explore at:
Dataset updated
Jan 8, 2024
Dataset provided by
Verified Market Researchhttps://www.verifiedmarketresearch.com/
Authors
VERIFIED MARKET RESEARCH
License
https://www.verifiedmarketresearch.com/privacy-policy/https://www.verifiedmarketresearch.com/privacy-policy/
Time period covered
2024 - 2030
Area covered
Global
Description
Information Extraction IE Technology Market size was valued at USD 8.3 Billion in 2023 and is projected to reach USD 23.4 Billion by 2030, growing at a CAGR of 11.1% during the forecast period 2024-2030.

Global Information Extraction IE Technology Market Drivers

The market drivers for the Information Extraction IE Technology Market can be influenced by various factors. These may include:

Growing Interest in Data Insights: Businesses in a variety of sectors were beginning to understand how important it was to glean insights from vast amounts of unstructured data. In order to transform unstructured data into structured data that can be examined for insightful analysis, information extraction technology is essential. Growing Requirement for Automation: The necessity for automation in information processing and analysis grew along with the volume of data. Businesses can save time and costs by using information extraction technology to automate the extraction of pertinent data from a variety of sources. Natural language processing (NLP) advances: Information extraction systems' capabilities were being improved by the ongoing developments in Natural Language Processing technology. These advancements made it possible to extract information from textual material that was more precise and contextually aware. Growing Use of Machine Learning and AI: The creation and application of complex information extraction solutions were being fueled by the widespread acceptance of artificial intelligence (AI) and machine learning (ML) technology across a range of industries. These technological advancements lead to systems that are more precise and flexible. Regulation and Compliance Needs: Accurate and effective information extraction was required to meet compliance and regulatory standards in sectors like banking, healthcare, and law. Automated methods can help make sure privacy and data protection laws are followed. Growing Amount of Unstructured Information: The exponential increase in unstructured data—text, photos, and multimedia—required the use of cutting-edge technology in order to extract meaningful information. This demand was being met by information extraction technologies, which converted unstructured data into formats that were structured. Improved Client Experience: Information extraction technologies have been used in retail and e-commerce to enhance the customer experience. This entails gathering pertinent product details, client testimonials, and sentiment analysis from multiple sources. Fraud detection and risk management: In industries such as finance and insurance, information extraction was essential for fraud detection and risk management. Automated systems might collect pertinent data and spot irregularities immediately, reducing risks. Multilingual Extraction and Globalization: The requirement for information extraction systems that could process and comprehend text in different languages grew as firms expanded internationally. The development of more adaptable and language-neutral extraction techniques was fueled by this trend of globalization. Combining Other Technologies: Technologies for information extraction were frequently included into larger business intelligence and data analytics packages, producing a synergistic effect that improved overall data-driven decision-making procedures.
k
Key Information Document (Template)
koncile.ai
Updated Mar 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Koncile (2024). Key Information Document (Template) [Dataset]. https://www.koncile.ai/en/extraction-ocr/key-information-document
Explore at:
Dataset updated
Mar 28, 2024
Dataset authored and provided by
Koncile
License
https://www.koncile.ai/en/termsandconditionshttps://www.koncile.ai/en/termsandconditions
Variables measured
Main risks, Risk scale, Entry costs, Manufacturer, Product name, Product type, Recurring costs, Date of publication, Regulatory authority, Recommended investment period
Description
Automatically extract critical data from Key Information Documents (DIC) with Koncile's intelligent OCR. Fast structuring, usable formats (Excel, JSON).

SIMARA Dataset

paperswithcode.com

Updated Apr 25, 2023

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Solène Tarride; Mélodie Boillet; Jean-François Moufflet; Christopher Kermorvant (2023). SIMARA Dataset [Dataset]. https://paperswithcode.com/dataset/simara

Explore at:

Dataset updated

Apr 25, 2023

Authors

Solène Tarride; Mélodie Boillet; Jean-François Moufflet; Christopher Kermorvant

Description

Description We propose a new database for information extraction from historical handwritten documents. The corpus includes 5,393 finding aids from six different series, dating from the 18th-20th centuries. Finding aids are handwritten documents that contain metadata describing older archives. They are stored in the National Archives of France and are used by archivists to identify and find archival documents.

Each document is annotated at page-level, and contains seven fields to retrieve. The localization of each field is not available in such a way that this dataset encourages research on segmentation-free systems for information extraction.

The dataset is available at https://zenodo.org/record/7868059

Details for each series and entity type | Series | Train | Validation | Test | Total (%) | | ------------------- | ----- | ---------- | ---- | --------: | | E series | 322 | 64 | 79 | 8.6 | | L series | 38 | 8 | 4 | 0.9 | | M series | 128 | 21 | 27 | 3.3 | | X1a series | 2209 | 491 | 469 | 58.8 | | Y series | 940 | 205 | 196 | 24.9 | | Douët s'Arcq series | 141 | 22 | 29 | 3.5 | | Total | 3778 | 811 | 804 | 100 |

Entities	Train	Validation	Test	Total (%)
date	8406	1814	1799	10.4
title	35531	7495	8173	44.5
serie	3168	664	676	3.9
analysis	25988	5130	5602	31.9
volume_number	3913	808	813	4.8
article_number	3181	665	678	3.9
arrangement	644	122	153	0.8
Total	80831	16698	17894	100

Data encoding Transcriptions with entities are encoded in the labels.json JSON file. Special tokens are used to represent named entities. Please not that there are only opening NER tokens: each entity spans all words until the next entity starts.

Entities	Special token	Symbol unicode
date	ⓓ	\u24d3
title	ⓘ	\u24d8
serie	ⓢ	\u24e2
analysis	ⓒ	\u24d2
volume_number	ⓟ	\u24df
article_number	ⓐ	\u24d0
arrangement	ⓥ	\u24e5

Cite us! The dataset is presented in details in the following article:

bib @article{simara2023, author = {Solène Tarride and Mélodie Boillet and Jean-François Moufflet and Christopher Kermorvant}, title = {SIMARA: a database for key-value information extraction from full-page handwritten documents}, year = {2023}, journal={Proceedings of the 17th International Conference on Document Analysis and Recognition}, }

Key_Information-Extraction_FUNSD
kaggle.com
Updated May 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
DucNguyen168 (2024). Key_Information-Extraction_FUNSD [Dataset]. https://www.kaggle.com/datasets/ducnguyen168/key-information-extraction-funsd
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 27, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
DucNguyen168
Description
Dataset

This dataset was created by DucNguyen168

Contents
v
Natural Language Processing Market By Type (Statistical NLP, Rule Based NLP,...
verifiedmarketresearch.com
Updated Mar 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
VERIFIED MARKET RESEARCH (2024). Natural Language Processing Market By Type (Statistical NLP, Rule Based NLP, Hybrid NLP), By Deployment Mode (Private Cloud, Public Cloud, And Hybrid Cloud), By Application (Information Extraction, Machine Translation, Language Translation, Question Answering, Speech Recognition, Text Summarization, Report Generation), By End Users (Healthcare, Banking, Financial Services, And Insurance (BFSI), Media And Entertainment, E-commerce), And Region For 2024-2031 [Dataset]. https://www.verifiedmarketresearch.com/product/natural-language-processing-market/
Explore at:
Dataset updated
Mar 15, 2024
Dataset authored and provided by
VERIFIED MARKET RESEARCH
License
https://www.verifiedmarketresearch.com/privacy-policy/https://www.verifiedmarketresearch.com/privacy-policy/
Time period covered
2024 - 2031
Area covered
Global
Description
The Natural Language Processing Market size was valued at USD 31.76 Billion in 2023 and is projected to reach USD 92.99 Billion by 2031, with a growth rate (CAGR) of 23.97 % from 2024 to 2031. The growing use of NLP technology in a variety of industries, including healthcare, banking, retail, and customer service, has contributed significantly to market growth. NLP's ability to evaluate and extract insights from massive volumes of unstructured data has become critical for businesses looking to improve decision-making processes and gain a competitive advantage. The rise of voice-activated virtual assistants and chatbots has increased demand for NLP applications in the consumer market, accelerating growth.

Natural Language Processing Market: Definition/Overview

Natural language processing (NLP) is a computer application that uses artificial intelligence to interpret human language. This computerized technique allows the computer to examine and interpret human communication using a collection of technologies and theories. The purpose of natural language processing is to reduce the time required to grasp computer languages like Ruby, C, C++, and Java. NLP is used in big data analysis because huge amounts of data are generated in today's business scenarios from sources such as audio, emails, web blogs, documents, social networking sites, and forums.

Optical character recognition (OCR), auto coding, text analytics, interactive voice response (IVR), pattern and image recognition, classification and categorization, and speech analytics are all examples of natural language processing technology. Natural language processing (NLP) can be cloud-based or on-premise, and it is used for applications such as information extraction, question answering, machine translation, and report generation in a variety of industries, including automotive, retail, and consumer goods, high-tech and electronics, government, banking, financial services, and insurance (BFSI), health care and life sciences, research and education, and media and entertainment.
m
Global Information Extraction IE Technology Market Share, Size & Industry...
marketresearchintellect.com
Updated Jul 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Market Research Intellect (2025). Global Information Extraction IE Technology Market Share, Size & Industry Analysis 2033 [Dataset]. https://www.marketresearchintellect.com/product/information-extraction-ie-technology-market/
Explore at:
Dataset updated
Jul 6, 2025
Dataset authored and provided by
Market Research Intellect
License
https://www.marketresearchintellect.com/privacy-policyhttps://www.marketresearchintellect.com/privacy-policy
Area covered
Global
Description
Get key insights from Market Research Intellect's Information Extraction IE Technology Market Report, valued at USD 5.2 billion in 2024, and forecast to grow to USD 12.1 billion by 2033, with a CAGR of 10.3% (2026-2033).
P
DocILE Dataset
paperswithcode.com
Updated Feb 14, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Štěpán Šimsa; Milan Šulc; Michal Uřičář; Yash Patel; Ahmed Hamdi; Matěj Kocián; Matyáš Skalický; Jiří Matas; Antoine Doucet; Mickaël Coustaty; Dimosthenis Karatzas (2023). DocILE Dataset [Dataset]. https://paperswithcode.com/dataset/docile
Explore at:
Dataset updated
Feb 14, 2023
Authors
Štěpán Šimsa; Milan Šulc; Michal Uřičář; Yash Patel; Ahmed Hamdi; Matěj Kocián; Matyáš Skalický; Jiří Matas; Antoine Doucet; Mickaël Coustaty; Dimosthenis Karatzas
Description
DocILE is a large dataset of business documents for the tasks of Key Information Localization and Extraction and Line Item Recognition. It contains 6.7k annotated business documents, 100k synthetically generated documents, and nearly 1M unlabeled documents for unsupervised pre-training. The dataset has been built with knowledge of domain- and task-specific aspects, resulting in the following key features:

i) annotations in 55 classes, which surpasses the granularity of previously published key information extraction datasets by a large margin

ii) Line Item Recognition represents a highly practical information extraction task, where key information has to be assigned to items in a table

iii) documents come from numerous layouts and the test set includes zero- and few-shot cases as well as layouts commonly seen in the training set
I
Information Extraction (IE) Technology Report
datainsightsmarket.com
doc, pdf, ppt
Updated May 17, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Insights Market (2025). Information Extraction (IE) Technology Report [Dataset]. https://www.datainsightsmarket.com/reports/information-extraction-ie-technology-1974185
Explore at:
ppt, doc, pdfAvailable download formats
Dataset updated
May 17, 2025
Dataset authored and provided by
Data Insights Market
License
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The Information Extraction (IE) technology market is experiencing robust growth, driven by the increasing need for automated data processing and analysis across diverse sectors. The market's expansion is fueled by the exponential growth of unstructured data, demanding efficient tools for converting this raw information into structured, actionable insights. Government and military applications are key drivers, utilizing IE for intelligence gathering, threat analysis, and resource optimization. The internet service provider (ISP) segment leverages IE for network monitoring, customer behavior analysis, and fraud detection. The education sector benefits from IE for automating administrative tasks, personalized learning, and research analysis. Standalone systems currently dominate the market, offering ease of integration and tailored solutions. However, integrated systems are gaining traction, driven by the demand for comprehensive data management and interoperability across various platforms. While the initial investment in IE systems can be significant, the long-term return on investment (ROI) through improved efficiency and decision-making justifies the cost. Furthermore, the ongoing advancements in natural language processing (NLP) and machine learning (ML) are continuously enhancing the accuracy and capabilities of IE technologies, further stimulating market growth. Despite the promising growth trajectory, the market faces certain challenges. The complexity of implementing and managing IE systems, along with the need for skilled professionals, pose significant hurdles. Data security and privacy concerns, particularly crucial in sectors like government and finance, also present a restraint. However, the increasing adoption of cloud-based solutions is mitigating some of these challenges by offering scalability, cost-effectiveness, and enhanced security features. Looking ahead, the convergence of IE with other technologies like big data analytics and business intelligence will further broaden its applications and accelerate market growth. We project a continued strong CAGR for the foreseeable future, with substantial growth across all segments and regions. The market is poised for significant expansion, driven by the unrelenting need to harness the value hidden within vast repositories of unstructured data.
f
Statistics of WebNLG, NYT and NYT11-HRL.
plos.figshare.com
xls
Updated Feb 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hongmei Tang; Dixiongxiao Zhu; Wenzhong Tang; Shuai Wang; Yanyang Wang; Lihong Wang (2024). Statistics of WebNLG, NYT and NYT11-HRL. [Dataset]. http://doi.org/10.1371/journal.pone.0298974.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0298974.t002
Dataset updated
Feb 23, 2024
Dataset provided by
PLOS ONE
Authors
Hongmei Tang; Dixiongxiao Zhu; Wenzhong Tang; Shuai Wang; Yanyang Wang; Lihong Wang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Relationship Extraction (RE) is a central task in information extraction. The use of entity mapping to address complex scenarios with overlapping triples, such as CasRel, is gaining traction, yet faces challenges such as inadequate consideration of sentence continuity, sample imbalance and data noise. This research introduces an entity mapping-based method CasRelBLCF building on CasRel. The main contributions include: A joint decoder for the head entity, utilizing Bi-LSTM and CRF, integration of the Focal Loss function to tackle sample imbalance and a reinforcement learning-based noise reduction method for handling dataset noise. Experiments on relation extraction datasets indicate the superiority of the CasRelBLCF model and the enhancement on model’s performance of the noise reduction method.
MEDED: Magi Entity Description Extraction Dataset
zenodo.org
data.niaid.nih.gov
Updated Mar 5, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yichao Ji; Yichao Ji; Xinyang Liu; Xinyang Liu; Kui Ma; Kui Ma; Xuezhi Zhao; Xuezhi Zhao; Qiao Sun; Qiao Sun (2020). MEDED: Magi Entity Description Extraction Dataset [Dataset]. http://doi.org/10.5281/zenodo.3242514
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.3242514
Dataset updated
Mar 5, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Yichao Ji; Yichao Ji; Xinyang Liu; Xinyang Liu; Kui Ma; Kui Ma; Xuezhi Zhao; Xuezhi Zhao; Qiao Sun; Qiao Sun
Description
Magi Entity Description Extraction Dataset (MEDED) contains 500,000 (entity, description, source URL) tuples extracted by the Magi system from Chinese webpages on the Internet that are accessible from Mainland China in May 2019.

These data are learned automatically and should not be considered to contain any opinion of any human individual including the authors.

Main contents of the source URLs can be found in the Magi Practical Web Article Corpus.
Information Extraction IE Technology Market Report | Global Forecast From...
dataintelo.com
csv, pdf, pptx
Updated Dec 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dataintelo (2024). Information Extraction IE Technology Market Report | Global Forecast From 2025 To 2033 [Dataset]. https://dataintelo.com/report/global-information-extraction-ie-technology-market
Explore at:
csv, pptx, pdfAvailable download formats
Dataset updated
Dec 4, 2024
Dataset authored and provided by
Dataintelo
License
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
Time period covered
2024 - 2032
Area covered
Global
Description
Information Extraction (IE) Technology Market Outlook

The Information Extraction (IE) Technology market size is anticipated to witness significant expansion, with an estimated valuation reaching USD 15.2 billion by 2032, growing from USD 5.3 billion in 2023. This growth is driven by a robust compound annual growth rate (CAGR) of 12.5% during the forecast period. Key factors propelling this market forward include the increasing need for automated data processing solutions, advancements in artificial intelligence (AI) and machine learning (ML), and the growing volume of unstructured data across enterprises globally. The ability of IE technology to transform raw data into actionable insights is paramount, making it indispensable for industries seeking competitive advantages.

One of the primary growth drivers of the IE Technology market is the exponential increase in data generation and the subsequent demand for efficient data processing tools. Organizations across various sectors are inundated with vast amounts of unstructured data, including text, videos, and images, which hold valuable insights critical for strategic decision-making. Traditional data analysis methods are often insufficient to handle this deluge of information. As a solution, IE technology automates the extraction of relevant data points, simplifying the analysis process and facilitating enhanced decision-making capabilities. This demand is particularly acute in industries such as finance and healthcare, where timely and accurate information extraction can yield substantial benefits.

Moreover, the advancement of AI and ML has significantly fortified the capabilities of IE technologies, enhancing their accuracy and efficiency. Machine learning algorithms, when integrated with IE systems, allow for more sophisticated data analysis, including sentiment analysis and natural language processing (NLP). These advancements enable the technology to understand context, tone, and sentiment, providing deeper insights that were previously unattainable. This evolution of capabilities has expanded the application scope of IE technologies beyond traditional text analysis to more dynamic areas such as social media monitoring and online reputation management, where understanding public sentiment and behavior patterns is crucial.

The growing reliance on digital and online platforms has also fueled the demand for IE technology, especially for applications like web mining and social media monitoring. In an era where consumer interactions are predominantly digital, businesses are keen to harness insights from online activities to tailor their strategies effectively. IE technology plays a pivotal role in deciphering these digital footprints, allowing businesses to craft personalized marketing campaigns, improve customer service, and enhance product development. This trend is particularly evident in the retail and media industries, where consumer engagement and satisfaction are paramount.

Regionally, the Information Extraction Technology market exhibits varied dynamics, with North America leading the charge in terms of adoption and innovation. The region's technological infrastructure, coupled with a high concentration of leading tech firms, has created a fertile ground for the development and deployment of advanced IE solutions. In contrast, the Asia Pacific region is expected to witness the fastest growth, driven by rapid digitalization and an increasing number of small and medium enterprises (SMEs) adopting digital tools. Europe, with its stringent data privacy regulations, presents unique challenges but also opportunities for IE technology tailored to comply with such frameworks.

Component Analysis

The component segmentation of the Information Extraction Technology market encompasses software and services, each playing a pivotal role in the ecosystem. The software segment, being the backbone of IE technology, is projected to dominate the market. This segment includes various tools and platforms specifically designed for data extraction, text analysis, sentiment analysis, and more. The continuous evolution of these software solutions, incorporating advanced AI and ML algorithms, is enhancing their accuracy, efficiency, and functionality, thereby driving their adoption across industries. Furthermore, the flexibility and scalability offered by modern software solutions make them indispensable for enterprises ranging from SMEs to large corporations, seeking to streamline their data processing capabilities.

In addition to proprietary software solutions, open-source IE software is gaining traction among enterprises look
h
Data from: DWIE
huggingface.co
Updated May 18, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Speech and Language Technology, DFKI (2024). DWIE [Dataset]. https://huggingface.co/datasets/DFKI-SLT/DWIE
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 18, 2024
Dataset authored and provided by
Speech and Language Technology, DFKI
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Dataset Card for DWIE

Dataset Summary

DWIE (Deutsche Welle corpus for Information Extraction) is a new dataset for document-level multi-task Information Extraction (IE). It combines four main IE sub-tasks: 1.Named Entity Recognition: 23,130 entities classified in 311 multi-label entity types (tags). 2.Coreference Resolution: 43,373 entity mentions clustered in 23,130 entities. 3.Relation Extraction: 21,749 annotated relations between entities classified in 65… See the full description on the dataset page: https://huggingface.co/datasets/DFKI-SLT/DWIE.
f
Statistics of DuIE2.0 dataset.
plos.figshare.com
xls
Updated Feb 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hongmei Tang; Dixiongxiao Zhu; Wenzhong Tang; Shuai Wang; Yanyang Wang; Lihong Wang (2024). Statistics of DuIE2.0 dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0298974.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0298974.t003
Dataset updated
Feb 23, 2024
Dataset provided by
PLOS ONE
Authors
Hongmei Tang; Dixiongxiao Zhu; Wenzhong Tang; Shuai Wang; Yanyang Wang; Lihong Wang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Relationship Extraction (RE) is a central task in information extraction. The use of entity mapping to address complex scenarios with overlapping triples, such as CasRel, is gaining traction, yet faces challenges such as inadequate consideration of sentence continuity, sample imbalance and data noise. This research introduces an entity mapping-based method CasRelBLCF building on CasRel. The main contributions include: A joint decoder for the head entity, utilizing Bi-LSTM and CRF, integration of the Focal Loss function to tackle sample imbalance and a reinforcement learning-based noise reduction method for handling dataset noise. Experiments on relation extraction datasets indicate the superiority of the CasRelBLCF model and the enhancement on model’s performance of the noise reduction method.
Event Detection Dataset
search.datacite.org
datosdeinvestigacion.conicet.gov.ar
+2more
Updated Jul 11, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mariano Maisonnave (2020). Event Detection Dataset [Dataset]. http://doi.org/10.17632/7d54rvzxkr
Explore at:
Unique identifier
https://doi.org/10.17632/7d54rvzxkr
Dataset updated
Jul 11, 2020
Dataset provided by
DataCitehttps://www.datacite.org/
Mendeley
Authors
Mariano Maisonnave
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The present is a manually labeled data set for the task of Event Detection (ED). The task of ED consists of identifying event triggers, the word that most clearly indicates the occurrence of an event. The present data set consists of 2,200 news extracts from The New York Times (NYT) Annotated Corpus, separated into training (2,000) and testing (200) sets. Each news extract contains the plain text with the labels (event mentions), along with two metadata (publication date and an identifier). Labels description: We consider as event any ongoing real-world event or situation reported in the news articles. It is important to distinguish those events and situations that are in progress (or are reported as fresh events) at the moment the news is delivered from past events that are simply brought back, future events, hypothetical events, or events that will not take place. In our data set we only labeled as event the first type of event. Based on this criterion, some words that are typically considered as events are labeled as non-event triggers if they do not refer to ongoing events at the time the analyzed news is released. Take for instance the following news extract: "devaluation is not a realistic option to the current account deficit since it would only contribute to weakening the credibility of economic policies as it did during the last crisis." The only word that is labeled as event trigger in this example is "deficit" because it is the only ongoing event refereed in the news. Note that the words "devaluation", "weakening" and "crisis" could be labeled as event triggers in other news extracts, where the context of use of these words is different, but not in the given example. Further information: For a more detailed description of the data set and the data collection process please visit: https://cs.uns.edu.ar/~mmaisonnave/resources/ED_data. Data format: The dataset is split in two folders: training and testing. The first folder contains 2,000 XML files. The second folder contains 200 XML files. Each XML file has the following format. YYYYMMDDTHHMMSS ... ... ... The first three tags (pubdate, file-id and sent-idx) contain metadata information. The first one is the publication date of the news article that contained that text extract. The next two tags represent a unique identifier for the text extract. The file-id uniquely identifies a news article, that can hold several text extracts. The second one is the index that identifies that text extract inside the full article. The last tag (sentence) defines the beginning and end of the text extract. Inside that text are the tags. Each of these tags surrounds one word that was manually labeled as an event trigger.
Data from: Text2KGBench: A Benchmark for Ontology-Driven Knowledge Graph...
zenodo.org
data.niaid.nih.gov
zip
Updated May 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nandana Mihindukulasooriya; Nandana Mihindukulasooriya; Sanju Tiwari; Sanju Tiwari; Carlos F. Enguix; Carlos F. Enguix; Kusum Lata; Kusum Lata (2023). Text2KGBench: A Benchmark for Ontology-Driven Knowledge Graph Generation from Text [Dataset]. http://doi.org/10.5281/zenodo.7916716
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7916716
Dataset updated
May 23, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Nandana Mihindukulasooriya; Nandana Mihindukulasooriya; Sanju Tiwari; Sanju Tiwari; Carlos F. Enguix; Carlos F. Enguix; Kusum Lata; Kusum Lata
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the repository for ISWC 2023 Resource Track submission for Text2KGBench: Benchmark for Ontology-Driven Knowledge Graph Generation from Text. Text2KGBench is a benchmark to evaluate the capabilities of language models to generate KGs from natural language text guided by an ontology. Given an input ontology and a set of sentences, the task is to extract facts from the text while complying with the given ontology (concepts, relations, domain/range constraints) and being faithful to the input sentences.

It contains two datasets (i) Wikidata-TekGen with 10 ontologies and 13,474 sentences and (ii) DBpedia-WebNLG with 19 ontologies and 4,860 sentences.

An example

An example test sentence:

Test Sentence: {"id": "ont_music_test_n", "sent": "\"The Loco-Motion\" is a 1962 pop song written by American songwriters Gerry Goffin and Carole King."}

An example of ontology:

Ontology: Music Ontology

Expected Output:

{ "id": "ont_k_music_test_n", "sent": "\"The Loco-Motion\" is a 1962 pop song written by American songwriters Gerry Goffin and Carole King.", "triples": [ { "sub": "The Loco-Motion", "rel": "publication date", "obj": "01 January 1962" },{ "sub": "The Loco-Motion", "rel": "lyrics by", "obj": "Gerry Goffin" },{ "sub": "The Loco-Motion", "rel": "lyrics by", "obj": "Carole King" },] }

The data is released under a Creative Commons Attribution-ShareAlike 4.0 International (CC BY 4.0) License.

The structure of the repo is as the following.

Text2KGBench

src: the source code used for generation and evaluation, and baseline

benchmark the code used to generate the benchmark

evaluation evaluation scripts for calculating the results

baseline code for generating the baselines including prompts, sentence similarities, and LLM client.

data: the benchmark datasets and baseline data. There are two datasets: wikidata_tekgen and dbpedia_webnlg.

wikidata_tekgen Wikidata-TekGen Dataset

ontologies 10 ontologies used by this dataset

train training data

test test data

manually_verified_sentences ids of a subset of test cases manually validated

unseen_sentences new sentences that are added by the authors which are not part of Wikipedia

test unseen test unseen test sentences

ground_truth ground truth for unseen test sentences.

ground_truth ground truth for the test data

baselines data related to running the baselines.

test_train_sent_similarity for each test case, 5 most similar train sentences generated using SBERT T5-XXL model.

prompts prompts corresponding to each test file

unseen prompts unseen prompts for the unseen test cases

Alpaca-LoRA-13B data related to the Alpaca-LoRA model

llm_responses raw LLM responses and extracted triples

eval_metrics ontology-level and aggregated evaluation results

unseen results results for the unseen test cases

llm_responses raw LLM responses and extracted triples

eval_metrics ontology-level and aggregated evaluation results

Vicuna-13B data related to the Vicuna-13B model

llm_responses raw LLM responses and extracted triples

eval_metrics ontology-level and aggregated evaluation results

dbpedia_webnlg DBpedia Dataset

ontologies 19 ontologies used by this dataset

train training data

test test data

ground_truth ground truth for the test data

baselines data related to running the baselines.

test_train_sent_similarity for each test case, 5 most similar train sentences generated using SBERT T5-XXL model.

prompts prompts corresponding to each test file

Alpaca-LoRA-13B data related to the Alpaca-LoRA model

llm_responses raw LLM responses and extracted triples

eval_metrics ontology-level and aggregated evaluation results

Vicuna-13B data related to the Vicuna-13B model

llm_responses raw LLM responses and extracted triples

eval_metrics ontology-level and aggregated evaluation results

This benchmark contains data derived from the TekGen corpus (part of the KELM corpus) [1] released under CC BY-SA 2.0 license and WebNLG 3.0 corpus [2] released under CC BY-NC-SA 4.0 license.

[1] Oshin Agarwal, Heming Ge, Siamak Shakeri, and Rami Al-Rfou. 2021. Knowledge Graph Based Synthetic Corpus Generation for Knowledge-Enhanced Language Model Pre-training. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3554–3565, Online. Association for Computational Linguistics.

[2] Claire Gardent, Anastasia Shimorina, Shashi Narayan, and Laura Perez-Beltrachini. 2017. Creating Training Corpora for NLG Micro-Planners. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages
Scanned Czech Receipts Dataset
kaggle.com
Updated Jul 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Jansa (2025). Scanned Czech Receipts Dataset [Dataset]. https://www.kaggle.com/datasets/davidjansa/scanned-czech-receipts-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 2, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
David Jansa
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
This dataset contains 175 flatbed-scanned Czech receipts, each labeled from 001 to 175. The dataset includes real-world variability, such as faded or dark receipts (marked with a "b" in the filename, e.g. 014b.jpg).

File Descriptions

The dataset is organized into three directories:

scans/ Contains JPEG images of scanned receipts. Some images are dark or have lower contrast, simulating real-world scanning scenarios.

ocr_target/ Contains .txt files with a line-by-line literal transcription of each receipt, suitable for OCR model evaluation.

segment_target/ Contains .json files with structured information extracted from each receipt. Each JSON file captures key details, such as store name, purchase date, currency, and itemized product data (including discounts). Product data DO NOT include duplicates. (Maybe I will update the segment_target dataset in the future to include duplicated product names as well...)

Each .json file in segment_target/ follows this schema: { "company": "tesco", "date": "26.07.2024", "currency": "czk", "products": { "madeta cottage 150 g": 29.9, "raj.cel.lou400g/240g": 39.9, "cc raj.cel.lou400g/2": -20, "cc madeta cottage 15": -40 } } company: Name of the store or seller (e.g., "tesco") in lowercase.

date: Date of purchase in DD.MM.YYYY format.

currency: Transaction currency (e.g., "czk") in lowercase.

products: Key-value pairs of product names (lowercase) and their prices. Discounts are represented as negative values.

Warning: Some fields may contain null if the data could not be extracted reliably.

Facebook

Twitter

Click to copy link

Link copied

Cite

Solène Tarride (2023). SIMARA: a database for key-value information extraction from full-page handwritten documents [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7866826

Data from: SIMARA: a database for key-value information extraction from full-page handwritten documents

Explore at:

Dataset updated

Apr 27, 2023

Dataset provided by

Mélodie Boillet
Jean-François Moufflet
Solène Tarride
Christopher Kermorvant

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

We propose a new database for information extraction from historical handwritten documents. The corpus includes 5,393 finding aids from six different series, dating from the 18th-20th centuries. Finding aids are handwritten documents that contain metadata describing older archives. They are stored in the National Archives of France and are used by archivists to identify and find archival documents.

Clear search

Close search

Google apps

Main menu

Data from: SIMARA: a database for key-value information extraction from...

Kleister NDA

Data from: Improving Scientific Information Extraction with Text Generation

SROIE Dataset

Global Information Extraction IE Technology Market Size By Technology Type,...

Key Information Document (Template)

SIMARA Dataset

Key_Information-Extraction_FUNSD

Dataset

Contents

Natural Language Processing Market By Type (Statistical NLP, Rule Based NLP,...

Global Information Extraction IE Technology Market Share, Size & Industry...

DocILE Dataset

Information Extraction (IE) Technology Report

Statistics of WebNLG, NYT and NYT11-HRL.

MEDED: Magi Entity Description Extraction Dataset

Information Extraction IE Technology Market Report | Global Forecast From...

Information Extraction (IE) Technology Market Outlook

Component Analysis

Data from: DWIE

Statistics of DuIE2.0 dataset.

Event Detection Dataset

Data from: Text2KGBench: A Benchmark for Ontology-Driven Knowledge Graph...

Scanned Czech Receipts Dataset

File Descriptions

Data from: SIMARA: a database for key-value information extraction from full-page handwritten documents