Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We propose a new database for information extraction from historical handwritten documents. The corpus includes 5,393 finding aids from six different series, dating from the 18th-20th centuries. Finding aids are handwritten documents that contain metadata describing older archives. They are stored in the National Archives of France and are used by archivists to identify and find archival documents.
Each document is annotated at page-level, and contains seven fields to retrieve. The localization of each field is not available in such a way that this dataset encourages research on segmentation-free systems for information extraction.
Kleister NDA is a dataset for Key Information Extraction (KIE). The dataset contains a mix of scanned and born-digital long formal English-language documents. For this datasets, an NLP system is expected to find or infer various types of entities by employing both textual and structural layout features. The Kleister NDA dataset has 540 Non-disclosure Agreements, with 3,229 unique pages and 2,160 entities to extract.
Public Domain Mark 1.0https://creativecommons.org/publicdomain/mark/1.0/
License information was derived automatically
As research communities expand, the number of scientific articles continues to grow rapidly, with no signs of slowing. This information overload drives the need for automated tools to identify relevant materials and extract key ideas. Information extraction (IE) focuses on converting unstructured scientific text into structured knowledge (e.g., ontologies, taxonomies, and knowledge graphs), enabling intelligent systems to excel in tasks like document organization, scientific literature retrieval and recommendation, claim verification even novel idea or hypothesis generation. To pinpoint the scope of this thesis, I focus on the taxonomic structure in this thesis to represent the knowledge in the scientific domain.
To construct a taxonomy from scientific corpora, traditional methods often rely on pipeline frameworks. These frameworks typically follow a sequence: first, extracting scientific concepts or entities from the corpus; second, identifying hierarchical relationships between the concepts; and finally, organizing these relationships into a cohesive taxonomy. However, such methods encounter several challenges: (1) the quality of the corpus or annotation data, (2) error propagation within the pipeline framework, and (3) limited generalization and transferability to other specific domains. The development of large language models (LLMs) offers promising advancements, as these models have demonstrated remarkable abilities to internalize knowledge and respond effectively to a wide range of inquiries. Unlike traditional pipeline-based approaches, generative methods harness LLMs to achieve (1) better utilization of their internalized knowledge, (2) direct text-to-knowledge conversion, and (3) flexible, schema-free adaptability.
This thesis explores innovative methods for integrating text generation technologies to improve IE in the scientific domain, with a focus on taxonomy construction. The approach begins with generating entity names and evolves to create or enrich taxonomies directly via text generation. I will explore combining neighborhood structural context, descriptive textual information, and LLMs' internal knowledge to improve output quality. Finally, this thesis will outline future research directions.
Consists of a dataset with 1000 whole scanned receipt images and annotations for the competition on scanned receipts OCR and key information extraction (SROIE).
https://www.verifiedmarketresearch.com/privacy-policy/https://www.verifiedmarketresearch.com/privacy-policy/
Information Extraction IE Technology Market size was valued at USD 8.3 Billion in 2023 and is projected to reach USD 23.4 Billion by 2030, growing at a CAGR of 11.1% during the forecast period 2024-2030.
Global Information Extraction IE Technology Market Drivers
The market drivers for the Information Extraction IE Technology Market can be influenced by various factors. These may include:
Growing Interest in Data Insights: Businesses in a variety of sectors were beginning to understand how important it was to glean insights from vast amounts of unstructured data. In order to transform unstructured data into structured data that can be examined for insightful analysis, information extraction technology is essential. Growing Requirement for Automation: The necessity for automation in information processing and analysis grew along with the volume of data. Businesses can save time and costs by using information extraction technology to automate the extraction of pertinent data from a variety of sources. Natural language processing (NLP) advances: Information extraction systems' capabilities were being improved by the ongoing developments in Natural Language Processing technology. These advancements made it possible to extract information from textual material that was more precise and contextually aware. Growing Use of Machine Learning and AI: The creation and application of complex information extraction solutions were being fueled by the widespread acceptance of artificial intelligence (AI) and machine learning (ML) technology across a range of industries. These technological advancements lead to systems that are more precise and flexible. Regulation and Compliance Needs: Accurate and effective information extraction was required to meet compliance and regulatory standards in sectors like banking, healthcare, and law. Automated methods can help make sure privacy and data protection laws are followed. Growing Amount of Unstructured Information: The exponential increase in unstructured data—text, photos, and multimedia—required the use of cutting-edge technology in order to extract meaningful information. This demand was being met by information extraction technologies, which converted unstructured data into formats that were structured. Improved Client Experience: Information extraction technologies have been used in retail and e-commerce to enhance the customer experience. This entails gathering pertinent product details, client testimonials, and sentiment analysis from multiple sources. Fraud detection and risk management: In industries such as finance and insurance, information extraction was essential for fraud detection and risk management. Automated systems might collect pertinent data and spot irregularities immediately, reducing risks. Multilingual Extraction and Globalization: The requirement for information extraction systems that could process and comprehend text in different languages grew as firms expanded internationally. The development of more adaptable and language-neutral extraction techniques was fueled by this trend of globalization. Combining Other Technologies: Technologies for information extraction were frequently included into larger business intelligence and data analytics packages, producing a synergistic effect that improved overall data-driven decision-making procedures.
https://www.koncile.ai/en/termsandconditionshttps://www.koncile.ai/en/termsandconditions
Automatically extract critical data from Key Information Documents (DIC) with Koncile's intelligent OCR. Fast structuring, usable formats (Excel, JSON).
Description We propose a new database for information extraction from historical handwritten documents. The corpus includes 5,393 finding aids from six different series, dating from the 18th-20th centuries. Finding aids are handwritten documents that contain metadata describing older archives. They are stored in the National Archives of France and are used by archivists to identify and find archival documents.
Each document is annotated at page-level, and contains seven fields to retrieve. The localization of each field is not available in such a way that this dataset encourages research on segmentation-free systems for information extraction.
The dataset is available at https://zenodo.org/record/7868059
Details for each series and entity type | Series | Train | Validation | Test | Total (%) | | ------------------- | ----- | ---------- | ---- | --------: | | E series | 322 | 64 | 79 | 8.6 | | L series | 38 | 8 | 4 | 0.9 | | M series | 128 | 21 | 27 | 3.3 | | X1a series | 2209 | 491 | 469 | 58.8 | | Y series | 940 | 205 | 196 | 24.9 | | Douët s'Arcq series | 141 | 22 | 29 | 3.5 | | Total | 3778 | 811 | 804 | 100 |
Entities | Train | Validation | Test | Total (%) |
---|---|---|---|---|
date | 8406 | 1814 | 1799 | 10.4 |
title | 35531 | 7495 | 8173 | 44.5 |
serie | 3168 | 664 | 676 | 3.9 |
analysis | 25988 | 5130 | 5602 | 31.9 |
volume_number | 3913 | 808 | 813 | 4.8 |
article_number | 3181 | 665 | 678 | 3.9 |
arrangement | 644 | 122 | 153 | 0.8 |
Total | 80831 | 16698 | 17894 | 100 |
Data encoding Transcriptions with entities are encoded in the labels.json JSON file. Special tokens are used to represent named entities. Please not that there are only opening NER tokens: each entity spans all words until the next entity starts.
Entities | Special token | Symbol unicode |
---|---|---|
date | ⓓ | \u24d3 |
title | ⓘ | \u24d8 |
serie | ⓢ | \u24e2 |
analysis | ⓒ | \u24d2 |
volume_number | ⓟ | \u24df |
article_number | ⓐ | \u24d0 |
arrangement | ⓥ | \u24e5 |
Cite us! The dataset is presented in details in the following article:
bib @article{simara2023, author = {Solène Tarride and Mélodie Boillet and Jean-François Moufflet and Christopher Kermorvant}, title = {SIMARA: a database for key-value information extraction from full-page handwritten documents}, year = {2023}, journal={Proceedings of the 17th International Conference on Document Analysis and Recognition}, }
This dataset was created by DucNguyen168
https://www.verifiedmarketresearch.com/privacy-policy/https://www.verifiedmarketresearch.com/privacy-policy/
The Natural Language Processing Market size was valued at USD 31.76 Billion in 2023 and is projected to reach USD 92.99 Billion by 2031, with a growth rate (CAGR) of 23.97 % from 2024 to 2031. The growing use of NLP technology in a variety of industries, including healthcare, banking, retail, and customer service, has contributed significantly to market growth. NLP's ability to evaluate and extract insights from massive volumes of unstructured data has become critical for businesses looking to improve decision-making processes and gain a competitive advantage. The rise of voice-activated virtual assistants and chatbots has increased demand for NLP applications in the consumer market, accelerating growth.
Natural Language Processing Market: Definition/Overview
Natural language processing (NLP) is a computer application that uses artificial intelligence to interpret human language. This computerized technique allows the computer to examine and interpret human communication using a collection of technologies and theories. The purpose of natural language processing is to reduce the time required to grasp computer languages like Ruby, C, C++, and Java. NLP is used in big data analysis because huge amounts of data are generated in today's business scenarios from sources such as audio, emails, web blogs, documents, social networking sites, and forums.
Optical character recognition (OCR), auto coding, text analytics, interactive voice response (IVR), pattern and image recognition, classification and categorization, and speech analytics are all examples of natural language processing technology. Natural language processing (NLP) can be cloud-based or on-premise, and it is used for applications such as information extraction, question answering, machine translation, and report generation in a variety of industries, including automotive, retail, and consumer goods, high-tech and electronics, government, banking, financial services, and insurance (BFSI), health care and life sciences, research and education, and media and entertainment.
https://www.marketresearchintellect.com/privacy-policyhttps://www.marketresearchintellect.com/privacy-policy
Get key insights from Market Research Intellect's Information Extraction IE Technology Market Report, valued at USD 5.2 billion in 2024, and forecast to grow to USD 12.1 billion by 2033, with a CAGR of 10.3% (2026-2033).
DocILE is a large dataset of business documents for the tasks of Key Information Localization and Extraction and Line Item Recognition. It contains 6.7k annotated business documents, 100k synthetically generated documents, and nearly 1M unlabeled documents for unsupervised pre-training. The dataset has been built with knowledge of domain- and task-specific aspects, resulting in the following key features:
i) annotations in 55 classes, which surpasses the granularity of previously published key information extraction datasets by a large margin
ii) Line Item Recognition represents a highly practical information extraction task, where key information has to be assigned to items in a table
iii) documents come from numerous layouts and the test set includes zero- and few-shot cases as well as layouts commonly seen in the training set
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
The Information Extraction (IE) technology market is experiencing robust growth, driven by the increasing need for automated data processing and analysis across diverse sectors. The market's expansion is fueled by the exponential growth of unstructured data, demanding efficient tools for converting this raw information into structured, actionable insights. Government and military applications are key drivers, utilizing IE for intelligence gathering, threat analysis, and resource optimization. The internet service provider (ISP) segment leverages IE for network monitoring, customer behavior analysis, and fraud detection. The education sector benefits from IE for automating administrative tasks, personalized learning, and research analysis. Standalone systems currently dominate the market, offering ease of integration and tailored solutions. However, integrated systems are gaining traction, driven by the demand for comprehensive data management and interoperability across various platforms. While the initial investment in IE systems can be significant, the long-term return on investment (ROI) through improved efficiency and decision-making justifies the cost. Furthermore, the ongoing advancements in natural language processing (NLP) and machine learning (ML) are continuously enhancing the accuracy and capabilities of IE technologies, further stimulating market growth. Despite the promising growth trajectory, the market faces certain challenges. The complexity of implementing and managing IE systems, along with the need for skilled professionals, pose significant hurdles. Data security and privacy concerns, particularly crucial in sectors like government and finance, also present a restraint. However, the increasing adoption of cloud-based solutions is mitigating some of these challenges by offering scalability, cost-effectiveness, and enhanced security features. Looking ahead, the convergence of IE with other technologies like big data analytics and business intelligence will further broaden its applications and accelerate market growth. We project a continued strong CAGR for the foreseeable future, with substantial growth across all segments and regions. The market is poised for significant expansion, driven by the unrelenting need to harness the value hidden within vast repositories of unstructured data.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Relationship Extraction (RE) is a central task in information extraction. The use of entity mapping to address complex scenarios with overlapping triples, such as CasRel, is gaining traction, yet faces challenges such as inadequate consideration of sentence continuity, sample imbalance and data noise. This research introduces an entity mapping-based method CasRelBLCF building on CasRel. The main contributions include: A joint decoder for the head entity, utilizing Bi-LSTM and CRF, integration of the Focal Loss function to tackle sample imbalance and a reinforcement learning-based noise reduction method for handling dataset noise. Experiments on relation extraction datasets indicate the superiority of the CasRelBLCF model and the enhancement on model’s performance of the noise reduction method.
Magi Entity Description Extraction Dataset (MEDED) contains 500,000 (entity, description, source URL) tuples extracted by the Magi system from Chinese webpages on the Internet that are accessible from Mainland China in May 2019.
These data are learned automatically and should not be considered to contain any opinion of any human individual including the authors.
Main contents of the source URLs can be found in the Magi Practical Web Article Corpus.
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
The Information Extraction (IE) Technology market size is anticipated to witness significant expansion, with an estimated valuation reaching USD 15.2 billion by 2032, growing from USD 5.3 billion in 2023. This growth is driven by a robust compound annual growth rate (CAGR) of 12.5% during the forecast period. Key factors propelling this market forward include the increasing need for automated data processing solutions, advancements in artificial intelligence (AI) and machine learning (ML), and the growing volume of unstructured data across enterprises globally. The ability of IE technology to transform raw data into actionable insights is paramount, making it indispensable for industries seeking competitive advantages.
One of the primary growth drivers of the IE Technology market is the exponential increase in data generation and the subsequent demand for efficient data processing tools. Organizations across various sectors are inundated with vast amounts of unstructured data, including text, videos, and images, which hold valuable insights critical for strategic decision-making. Traditional data analysis methods are often insufficient to handle this deluge of information. As a solution, IE technology automates the extraction of relevant data points, simplifying the analysis process and facilitating enhanced decision-making capabilities. This demand is particularly acute in industries such as finance and healthcare, where timely and accurate information extraction can yield substantial benefits.
Moreover, the advancement of AI and ML has significantly fortified the capabilities of IE technologies, enhancing their accuracy and efficiency. Machine learning algorithms, when integrated with IE systems, allow for more sophisticated data analysis, including sentiment analysis and natural language processing (NLP). These advancements enable the technology to understand context, tone, and sentiment, providing deeper insights that were previously unattainable. This evolution of capabilities has expanded the application scope of IE technologies beyond traditional text analysis to more dynamic areas such as social media monitoring and online reputation management, where understanding public sentiment and behavior patterns is crucial.
The growing reliance on digital and online platforms has also fueled the demand for IE technology, especially for applications like web mining and social media monitoring. In an era where consumer interactions are predominantly digital, businesses are keen to harness insights from online activities to tailor their strategies effectively. IE technology plays a pivotal role in deciphering these digital footprints, allowing businesses to craft personalized marketing campaigns, improve customer service, and enhance product development. This trend is particularly evident in the retail and media industries, where consumer engagement and satisfaction are paramount.
Regionally, the Information Extraction Technology market exhibits varied dynamics, with North America leading the charge in terms of adoption and innovation. The region's technological infrastructure, coupled with a high concentration of leading tech firms, has created a fertile ground for the development and deployment of advanced IE solutions. In contrast, the Asia Pacific region is expected to witness the fastest growth, driven by rapid digitalization and an increasing number of small and medium enterprises (SMEs) adopting digital tools. Europe, with its stringent data privacy regulations, presents unique challenges but also opportunities for IE technology tailored to comply with such frameworks.
The component segmentation of the Information Extraction Technology market encompasses software and services, each playing a pivotal role in the ecosystem. The software segment, being the backbone of IE technology, is projected to dominate the market. This segment includes various tools and platforms specifically designed for data extraction, text analysis, sentiment analysis, and more. The continuous evolution of these software solutions, incorporating advanced AI and ML algorithms, is enhancing their accuracy, efficiency, and functionality, thereby driving their adoption across industries. Furthermore, the flexibility and scalability offered by modern software solutions make them indispensable for enterprises ranging from SMEs to large corporations, seeking to streamline their data processing capabilities.
In addition to proprietary software solutions, open-source IE software is gaining traction among enterprises look
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Dataset Card for DWIE
Dataset Summary
DWIE (Deutsche Welle corpus for Information Extraction) is a new dataset for document-level multi-task Information Extraction (IE). It combines four main IE sub-tasks: 1.Named Entity Recognition: 23,130 entities classified in 311 multi-label entity types (tags). 2.Coreference Resolution: 43,373 entity mentions clustered in 23,130 entities. 3.Relation Extraction: 21,749 annotated relations between entities classified in 65… See the full description on the dataset page: https://huggingface.co/datasets/DFKI-SLT/DWIE.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Relationship Extraction (RE) is a central task in information extraction. The use of entity mapping to address complex scenarios with overlapping triples, such as CasRel, is gaining traction, yet faces challenges such as inadequate consideration of sentence continuity, sample imbalance and data noise. This research introduces an entity mapping-based method CasRelBLCF building on CasRel. The main contributions include: A joint decoder for the head entity, utilizing Bi-LSTM and CRF, integration of the Focal Loss function to tackle sample imbalance and a reinforcement learning-based noise reduction method for handling dataset noise. Experiments on relation extraction datasets indicate the superiority of the CasRelBLCF model and the enhancement on model’s performance of the noise reduction method.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The present is a manually labeled data set for the task of Event Detection (ED). The task of ED consists of identifying event triggers, the word that most clearly indicates the occurrence of an event. The present data set consists of 2,200 news extracts from The New York Times (NYT) Annotated Corpus, separated into training (2,000) and testing (200) sets. Each news extract contains the plain text with the labels (event mentions), along with two metadata (publication date and an identifier). Labels description: We consider as event any ongoing real-world event or situation reported in the news articles. It is important to distinguish those events and situations that are in progress (or are reported as fresh events) at the moment the news is delivered from past events that are simply brought back, future events, hypothetical events, or events that will not take place. In our data set we only labeled as event the first type of event. Based on this criterion, some words that are typically considered as events are labeled as non-event triggers if they do not refer to ongoing events at the time the analyzed news is released. Take for instance the following news extract: "devaluation is not a realistic option to the current account deficit since it would only contribute to weakening the credibility of economic policies as it did during the last crisis." The only word that is labeled as event trigger in this example is "deficit" because it is the only ongoing event refereed in the news. Note that the words "devaluation", "weakening" and "crisis" could be labeled as event triggers in other news extracts, where the context of use of these words is different, but not in the given example. Further information: For a more detailed description of the data set and the data collection process please visit: https://cs.uns.edu.ar/~mmaisonnave/resources/ED_data. Data format: The dataset is split in two folders: training and testing. The first folder contains 2,000 XML files. The second folder contains 200 XML files. Each XML file has the following format. YYYYMMDDTHHMMSS ... ... ... The first three tags (pubdate, file-id and sent-idx) contain metadata information. The first one is the publication date of the news article that contained that text extract. The next two tags represent a unique identifier for the text extract. The file-id uniquely identifies a news article, that can hold several text extracts. The second one is the index that identifies that text extract inside the full article. The last tag (sentence) defines the beginning and end of the text extract. Inside that text are the tags. Each of these tags surrounds one word that was manually labeled as an event trigger.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the repository for ISWC 2023 Resource Track submission for Text2KGBench: Benchmark for Ontology-Driven Knowledge Graph Generation from Text
. Text2KGBench is a benchmark to evaluate the capabilities of language models to generate KGs from natural language text guided by an ontology. Given an input ontology and a set of sentences, the task is to extract facts from the text while complying with the given ontology (concepts, relations, domain/range constraints) and being faithful to the input sentences.
It contains two datasets (i) Wikidata-TekGen with 10 ontologies and 13,474 sentences and (ii) DBpedia-WebNLG with 19 ontologies and 4,860 sentences.
An example
An example test sentence:
Test Sentence:
{"id": "ont_music_test_n", "sent": "\"The Loco-Motion\" is a 1962 pop song written by
American songwriters Gerry Goffin and Carole King."}
An example of ontology:
Ontology: Music Ontology
Expected Output:
{
"id": "ont_k_music_test_n",
"sent": "\"The Loco-Motion\" is a 1962 pop song written by American songwriters Gerry Goffin and Carole King.",
"triples": [
{
"sub": "The Loco-Motion",
"rel": "publication date",
"obj": "01 January 1962"
},{
"sub": "The Loco-Motion",
"rel": "lyrics by",
"obj": "Gerry Goffin"
},{
"sub": "The Loco-Motion",
"rel": "lyrics by",
"obj": "Carole King"
},]
}
The data is released under a Creative Commons Attribution-ShareAlike 4.0 International (CC BY 4.0) License.
The structure of the repo is as the following.
benchmark
the code used to generate the benchmarkevaluation
evaluation scripts for calculating the resultsThis benchmark contains data derived from the TekGen corpus (part of the KELM corpus) [1] released under CC BY-SA 2.0 license and WebNLG 3.0 corpus [2] released under CC BY-NC-SA 4.0 license.
[1] Oshin Agarwal, Heming Ge, Siamak Shakeri, and Rami Al-Rfou. 2021. Knowledge Graph Based Synthetic Corpus Generation for Knowledge-Enhanced Language Model Pre-training. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3554–3565, Online. Association for Computational Linguistics.
[2] Claire Gardent, Anastasia Shimorina, Shashi Narayan, and Laura Perez-Beltrachini. 2017. Creating Training Corpora for NLG Micro-Planners. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
This dataset contains 175 flatbed-scanned Czech receipts, each labeled from 001 to 175. The dataset includes real-world variability, such as faded or dark receipts (marked with a "b" in the filename, e.g. 014b.jpg).
The dataset is organized into three directories:
scans/
Contains JPEG images of scanned receipts. Some images are dark or have lower contrast, simulating real-world scanning scenarios.
ocr_target/
Contains .txt files with a line-by-line literal transcription of each receipt, suitable for OCR model evaluation.
segment_target/
Contains .json files with structured information extracted from each receipt. Each JSON file captures key details, such as store name, purchase date, currency, and itemized product data (including discounts). Product data DO NOT include duplicates. (Maybe I will update the segment_target dataset in the future to include duplicated product names as well...)
Each .json file in segment_target/ follows this schema:
{
"company": "tesco",
"date": "26.07.2024",
"currency": "czk",
"products": {
"madeta cottage 150 g": 29.9,
"raj.cel.lou400g/240g": 39.9,
"cc raj.cel.lou400g/2": -20,
"cc madeta cottage 15": -40
}
}
company
: Name of the store or seller (e.g., "tesco") in lowercase.
date
: Date of purchase in DD.MM.YYYY format.
currency
: Transaction currency (e.g., "czk") in lowercase.
products
: Key-value pairs of product names (lowercase) and their prices. Discounts are represented as negative values.
Warning: Some fields may contain null if the data could not be extracted reliably.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We propose a new database for information extraction from historical handwritten documents. The corpus includes 5,393 finding aids from six different series, dating from the 18th-20th centuries. Finding aids are handwritten documents that contain metadata describing older archives. They are stored in the National Archives of France and are used by archivists to identify and find archival documents.
Each document is annotated at page-level, and contains seven fields to retrieve. The localization of each field is not available in such a way that this dataset encourages research on segmentation-free systems for information extraction.