100+ datasets found
  1. Z

    Data from: SIMARA: a database for key-value information extraction from...

    • data.niaid.nih.gov
    Updated Apr 27, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Solène Tarride (2023). SIMARA: a database for key-value information extraction from full-page handwritten documents [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7866826
    Explore at:
    Dataset updated
    Apr 27, 2023
    Dataset provided by
    Mélodie Boillet
    Jean-François Moufflet
    Solène Tarride
    Christopher Kermorvant
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We propose a new database for information extraction from historical handwritten documents. The corpus includes 5,393 finding aids from six different series, dating from the 18th-20th centuries. Finding aids are handwritten documents that contain metadata describing older archives. They are stored in the National Archives of France and are used by archivists to identify and find archival documents.

    Each document is annotated at page-level, and contains seven fields to retrieve. The localization of each field is not available in such a way that this dataset encourages research on segmentation-free systems for information extraction.

  2. O

    Kleister NDA

    • opendatalab.com
    • paperswithcode.com
    zip
    Updated Mar 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Adam Mickiewicz University (2023). Kleister NDA [Dataset]. https://opendatalab.com/OpenDataLab/Kleister_NDA
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 24, 2023
    Dataset provided by
    Warsaw University of Technology
    Adam Mickiewicz University
    Applica.ai
    Samsung R&D Institute Poland
    Description

    Kleister NDA is a dataset for Key Information Extraction (KIE). The dataset contains a mix of scanned and born-digital long formal English-language documents. For this datasets, an NLP system is expected to find or infer various types of entities by employing both textual and structural layout features. The Kleister NDA dataset has 540 Non-disclosure Agreements, with 3,229 unique pages and 2,160 entities to extract.

  3. n

    Data from: Improving Scientific Information Extraction with Text Generation

    • curate.nd.edu
    pdf
    Updated Apr 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Qingkai Zeng (2025). Improving Scientific Information Extraction with Text Generation [Dataset]. http://doi.org/10.7274/28571045.v1
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Apr 9, 2025
    Dataset provided by
    University of Notre Dame
    Authors
    Qingkai Zeng
    License

    Public Domain Mark 1.0https://creativecommons.org/publicdomain/mark/1.0/
    License information was derived automatically

    Description

    As research communities expand, the number of scientific articles continues to grow rapidly, with no signs of slowing. This information overload drives the need for automated tools to identify relevant materials and extract key ideas. Information extraction (IE) focuses on converting unstructured scientific text into structured knowledge (e.g., ontologies, taxonomies, and knowledge graphs), enabling intelligent systems to excel in tasks like document organization, scientific literature retrieval and recommendation, claim verification even novel idea or hypothesis generation. To pinpoint the scope of this thesis, I focus on the taxonomic structure in this thesis to represent the knowledge in the scientific domain.

    To construct a taxonomy from scientific corpora, traditional methods often rely on pipeline frameworks. These frameworks typically follow a sequence: first, extracting scientific concepts or entities from the corpus; second, identifying hierarchical relationships between the concepts; and finally, organizing these relationships into a cohesive taxonomy. However, such methods encounter several challenges: (1) the quality of the corpus or annotation data, (2) error propagation within the pipeline framework, and (3) limited generalization and transferability to other specific domains. The development of large language models (LLMs) offers promising advancements, as these models have demonstrated remarkable abilities to internalize knowledge and respond effectively to a wide range of inquiries. Unlike traditional pipeline-based approaches, generative methods harness LLMs to achieve (1) better utilization of their internalized knowledge, (2) direct text-to-knowledge conversion, and (3) flexible, schema-free adaptability.

    This thesis explores innovative methods for integrating text generation technologies to improve IE in the scientific domain, with a focus on taxonomy construction. The approach begins with generating entity names and evolves to create or enrich taxonomies directly via text generation. I will explore combining neighborhood structural context, descriptive textual information, and LLMs' internal knowledge to improve output quality. Finally, this thesis will outline future research directions.

  4. P

    SROIE Dataset

    • paperswithcode.com
    • opendatalab.com
    Updated Mar 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zheng Huang; Kai Chen; Jianhua He; Xiang Bai; Dimosthenis Karatzas; Shjian Lu; C. V. Jawahar (2024). SROIE Dataset [Dataset]. https://paperswithcode.com/dataset/sroie
    Explore at:
    Dataset updated
    Mar 27, 2024
    Authors
    Zheng Huang; Kai Chen; Jianhua He; Xiang Bai; Dimosthenis Karatzas; Shjian Lu; C. V. Jawahar
    Description

    Consists of a dataset with 1000 whole scanned receipt images and annotations for the competition on scanned receipts OCR and key information extraction (SROIE).

  5. Global Information Extraction IE Technology Market Size By Technology Type,...

    • verifiedmarketresearch.com
    Updated Jan 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    VERIFIED MARKET RESEARCH (2024). Global Information Extraction IE Technology Market Size By Technology Type, By Deployment Model, By Application, By Geographic Scope And Forecast [Dataset]. https://www.verifiedmarketresearch.com/product/information-extraction-ie-technology-market/
    Explore at:
    Dataset updated
    Jan 8, 2024
    Dataset provided by
    Verified Market Researchhttps://www.verifiedmarketresearch.com/
    Authors
    VERIFIED MARKET RESEARCH
    License

    https://www.verifiedmarketresearch.com/privacy-policy/https://www.verifiedmarketresearch.com/privacy-policy/

    Time period covered
    2024 - 2030
    Area covered
    Global
    Description

    Information Extraction IE Technology Market size was valued at USD 8.3 Billion in 2023 and is projected to reach USD 23.4 Billion by 2030, growing at a CAGR of 11.1% during the forecast period 2024-2030.

    Global Information Extraction IE Technology Market Drivers

    The market drivers for the Information Extraction IE Technology Market can be influenced by various factors. These may include:

    Growing Interest in Data Insights: Businesses in a variety of sectors were beginning to understand how important it was to glean insights from vast amounts of unstructured data. In order to transform unstructured data into structured data that can be examined for insightful analysis, information extraction technology is essential. Growing Requirement for Automation: The necessity for automation in information processing and analysis grew along with the volume of data. Businesses can save time and costs by using information extraction technology to automate the extraction of pertinent data from a variety of sources. Natural language processing (NLP) advances: Information extraction systems' capabilities were being improved by the ongoing developments in Natural Language Processing technology. These advancements made it possible to extract information from textual material that was more precise and contextually aware. Growing Use of Machine Learning and AI: The creation and application of complex information extraction solutions were being fueled by the widespread acceptance of artificial intelligence (AI) and machine learning (ML) technology across a range of industries. These technological advancements lead to systems that are more precise and flexible. Regulation and Compliance Needs: Accurate and effective information extraction was required to meet compliance and regulatory standards in sectors like banking, healthcare, and law. Automated methods can help make sure privacy and data protection laws are followed. Growing Amount of Unstructured Information: The exponential increase in unstructured data—text, photos, and multimedia—required the use of cutting-edge technology in order to extract meaningful information. This demand was being met by information extraction technologies, which converted unstructured data into formats that were structured. Improved Client Experience: Information extraction technologies have been used in retail and e-commerce to enhance the customer experience. This entails gathering pertinent product details, client testimonials, and sentiment analysis from multiple sources. Fraud detection and risk management: In industries such as finance and insurance, information extraction was essential for fraud detection and risk management. Automated systems might collect pertinent data and spot irregularities immediately, reducing risks. Multilingual Extraction and Globalization: The requirement for information extraction systems that could process and comprehend text in different languages grew as firms expanded internationally. The development of more adaptable and language-neutral extraction techniques was fueled by this trend of globalization. Combining Other Technologies: Technologies for information extraction were frequently included into larger business intelligence and data analytics packages, producing a synergistic effect that improved overall data-driven decision-making procedures.

  6. k

    Key Information Document (Template)

    • koncile.ai
    Updated Mar 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Koncile (2024). Key Information Document (Template) [Dataset]. https://www.koncile.ai/en/extraction-ocr/key-information-document
    Explore at:
    Dataset updated
    Mar 28, 2024
    Dataset authored and provided by
    Koncile
    License

    https://www.koncile.ai/en/termsandconditionshttps://www.koncile.ai/en/termsandconditions

    Variables measured
    Main risks, Risk scale, Entry costs, Manufacturer, Product name, Product type, Recurring costs, Date of publication, Regulatory authority, Recommended investment period
    Description

    Automatically extract critical data from Key Information Documents (DIC) with Koncile's intelligent OCR. Fast structuring, usable formats (Excel, JSON).

  7. P

    SIMARA Dataset

    • paperswithcode.com
    Updated Apr 25, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Solène Tarride; Mélodie Boillet; Jean-François Moufflet; Christopher Kermorvant (2023). SIMARA Dataset [Dataset]. https://paperswithcode.com/dataset/simara
    Explore at:
    Dataset updated
    Apr 25, 2023
    Authors
    Solène Tarride; Mélodie Boillet; Jean-François Moufflet; Christopher Kermorvant
    Description

    Description We propose a new database for information extraction from historical handwritten documents. The corpus includes 5,393 finding aids from six different series, dating from the 18th-20th centuries. Finding aids are handwritten documents that contain metadata describing older archives. They are stored in the National Archives of France and are used by archivists to identify and find archival documents.

    Each document is annotated at page-level, and contains seven fields to retrieve. The localization of each field is not available in such a way that this dataset encourages research on segmentation-free systems for information extraction.

    The dataset is available at https://zenodo.org/record/7868059

    Details for each series and entity type | Series | Train | Validation | Test | Total (%) | | ------------------- | ----- | ---------- | ---- | --------: | | E series | 322 | 64 | 79 | 8.6 | | L series | 38 | 8 | 4 | 0.9 | | M series | 128 | 21 | 27 | 3.3 | | X1a series | 2209 | 491 | 469 | 58.8 | | Y series | 940 | 205 | 196 | 24.9 | | Douët s'Arcq series | 141 | 22 | 29 | 3.5 | | Total | 3778 | 811 | 804 | 100 |

    EntitiesTrainValidationTestTotal (%)
    date84061814179910.4
    title355317495817344.5
    serie31686646763.9
    analysis259885130560231.9
    volume_number39138088134.8
    article_number31816656783.9
    arrangement6441221530.8
    Total808311669817894100

    Data encoding Transcriptions with entities are encoded in the labels.json JSON file. Special tokens are used to represent named entities. Please not that there are only opening NER tokens: each entity spans all words until the next entity starts.

    EntitiesSpecial tokenSymbol unicode
    date\u24d3
    title\u24d8
    serie\u24e2
    analysis\u24d2
    volume_number\u24df
    article_number\u24d0
    arrangement\u24e5

    Cite us! The dataset is presented in details in the following article:

    bib @article{simara2023, author = {Solène Tarride and Mélodie Boillet and Jean-François Moufflet and Christopher Kermorvant}, title = {SIMARA: a database for key-value information extraction from full-page handwritten documents}, year = {2023}, journal={Proceedings of the 17th International Conference on Document Analysis and Recognition}, }

  8. Key_Information-Extraction_FUNSD

    • kaggle.com
    Updated May 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DucNguyen168 (2024). Key_Information-Extraction_FUNSD [Dataset]. https://www.kaggle.com/datasets/ducnguyen168/key-information-extraction-funsd
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 27, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    DucNguyen168
    Description

    Dataset

    This dataset was created by DucNguyen168

    Contents

  9. v

    Natural Language Processing Market By Type (Statistical NLP, Rule Based NLP,...

    • verifiedmarketresearch.com
    Updated Mar 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    VERIFIED MARKET RESEARCH (2024). Natural Language Processing Market By Type (Statistical NLP, Rule Based NLP, Hybrid NLP), By Deployment Mode (Private Cloud, Public Cloud, And Hybrid Cloud), By Application (Information Extraction, Machine Translation, Language Translation, Question Answering, Speech Recognition, Text Summarization, Report Generation), By End Users (Healthcare, Banking, Financial Services, And Insurance (BFSI), Media And Entertainment, E-commerce), And Region For 2024-2031 [Dataset]. https://www.verifiedmarketresearch.com/product/natural-language-processing-market/
    Explore at:
    Dataset updated
    Mar 15, 2024
    Dataset authored and provided by
    VERIFIED MARKET RESEARCH
    License

    https://www.verifiedmarketresearch.com/privacy-policy/https://www.verifiedmarketresearch.com/privacy-policy/

    Time period covered
    2024 - 2031
    Area covered
    Global
    Description

    The Natural Language Processing Market size was valued at USD 31.76 Billion in 2023 and is projected to reach USD 92.99 Billion by 2031, with a growth rate (CAGR) of 23.97 % from 2024 to 2031. The growing use of NLP technology in a variety of industries, including healthcare, banking, retail, and customer service, has contributed significantly to market growth. NLP's ability to evaluate and extract insights from massive volumes of unstructured data has become critical for businesses looking to improve decision-making processes and gain a competitive advantage. The rise of voice-activated virtual assistants and chatbots has increased demand for NLP applications in the consumer market, accelerating growth.

    Natural Language Processing Market: Definition/Overview

    Natural language processing (NLP) is a computer application that uses artificial intelligence to interpret human language. This computerized technique allows the computer to examine and interpret human communication using a collection of technologies and theories. The purpose of natural language processing is to reduce the time required to grasp computer languages like Ruby, C, C++, and Java. NLP is used in big data analysis because huge amounts of data are generated in today's business scenarios from sources such as audio, emails, web blogs, documents, social networking sites, and forums.

    Optical character recognition (OCR), auto coding, text analytics, interactive voice response (IVR), pattern and image recognition, classification and categorization, and speech analytics are all examples of natural language processing technology. Natural language processing (NLP) can be cloud-based or on-premise, and it is used for applications such as information extraction, question answering, machine translation, and report generation in a variety of industries, including automotive, retail, and consumer goods, high-tech and electronics, government, banking, financial services, and insurance (BFSI), health care and life sciences, research and education, and media and entertainment.

  10. m

    Global Information Extraction IE Technology Market Share, Size & Industry...

    • marketresearchintellect.com
    Updated Jul 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Market Research Intellect (2025). Global Information Extraction IE Technology Market Share, Size & Industry Analysis 2033 [Dataset]. https://www.marketresearchintellect.com/product/information-extraction-ie-technology-market/
    Explore at:
    Dataset updated
    Jul 6, 2025
    Dataset authored and provided by
    Market Research Intellect
    License

    https://www.marketresearchintellect.com/privacy-policyhttps://www.marketresearchintellect.com/privacy-policy

    Area covered
    Global
    Description

    Get key insights from Market Research Intellect's Information Extraction IE Technology Market Report, valued at USD 5.2 billion in 2024, and forecast to grow to USD 12.1 billion by 2033, with a CAGR of 10.3% (2026-2033).

  11. P

    DocILE Dataset

    • paperswithcode.com
    Updated Feb 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Štěpán Šimsa; Milan Šulc; Michal Uřičář; Yash Patel; Ahmed Hamdi; Matěj Kocián; Matyáš Skalický; Jiří Matas; Antoine Doucet; Mickaël Coustaty; Dimosthenis Karatzas (2023). DocILE Dataset [Dataset]. https://paperswithcode.com/dataset/docile
    Explore at:
    Dataset updated
    Feb 14, 2023
    Authors
    Štěpán Šimsa; Milan Šulc; Michal Uřičář; Yash Patel; Ahmed Hamdi; Matěj Kocián; Matyáš Skalický; Jiří Matas; Antoine Doucet; Mickaël Coustaty; Dimosthenis Karatzas
    Description

    DocILE is a large dataset of business documents for the tasks of Key Information Localization and Extraction and Line Item Recognition. It contains 6.7k annotated business documents, 100k synthetically generated documents, and nearly 1M unlabeled documents for unsupervised pre-training. The dataset has been built with knowledge of domain- and task-specific aspects, resulting in the following key features:

    i) annotations in 55 classes, which surpasses the granularity of previously published key information extraction datasets by a large margin

    ii) Line Item Recognition represents a highly practical information extraction task, where key information has to be assigned to items in a table

    iii) documents come from numerous layouts and the test set includes zero- and few-shot cases as well as layouts commonly seen in the training set

  12. I

    Information Extraction (IE) Technology Report

    • datainsightsmarket.com
    doc, pdf, ppt
    Updated May 17, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Insights Market (2025). Information Extraction (IE) Technology Report [Dataset]. https://www.datainsightsmarket.com/reports/information-extraction-ie-technology-1974185
    Explore at:
    ppt, doc, pdfAvailable download formats
    Dataset updated
    May 17, 2025
    Dataset authored and provided by
    Data Insights Market
    License

    https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The Information Extraction (IE) technology market is experiencing robust growth, driven by the increasing need for automated data processing and analysis across diverse sectors. The market's expansion is fueled by the exponential growth of unstructured data, demanding efficient tools for converting this raw information into structured, actionable insights. Government and military applications are key drivers, utilizing IE for intelligence gathering, threat analysis, and resource optimization. The internet service provider (ISP) segment leverages IE for network monitoring, customer behavior analysis, and fraud detection. The education sector benefits from IE for automating administrative tasks, personalized learning, and research analysis. Standalone systems currently dominate the market, offering ease of integration and tailored solutions. However, integrated systems are gaining traction, driven by the demand for comprehensive data management and interoperability across various platforms. While the initial investment in IE systems can be significant, the long-term return on investment (ROI) through improved efficiency and decision-making justifies the cost. Furthermore, the ongoing advancements in natural language processing (NLP) and machine learning (ML) are continuously enhancing the accuracy and capabilities of IE technologies, further stimulating market growth. Despite the promising growth trajectory, the market faces certain challenges. The complexity of implementing and managing IE systems, along with the need for skilled professionals, pose significant hurdles. Data security and privacy concerns, particularly crucial in sectors like government and finance, also present a restraint. However, the increasing adoption of cloud-based solutions is mitigating some of these challenges by offering scalability, cost-effectiveness, and enhanced security features. Looking ahead, the convergence of IE with other technologies like big data analytics and business intelligence will further broaden its applications and accelerate market growth. We project a continued strong CAGR for the foreseeable future, with substantial growth across all segments and regions. The market is poised for significant expansion, driven by the unrelenting need to harness the value hidden within vast repositories of unstructured data.

  13. f

    Statistics of WebNLG, NYT and NYT11-HRL.

    • plos.figshare.com
    xls
    Updated Feb 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hongmei Tang; Dixiongxiao Zhu; Wenzhong Tang; Shuai Wang; Yanyang Wang; Lihong Wang (2024). Statistics of WebNLG, NYT and NYT11-HRL. [Dataset]. http://doi.org/10.1371/journal.pone.0298974.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Feb 23, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Hongmei Tang; Dixiongxiao Zhu; Wenzhong Tang; Shuai Wang; Yanyang Wang; Lihong Wang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Relationship Extraction (RE) is a central task in information extraction. The use of entity mapping to address complex scenarios with overlapping triples, such as CasRel, is gaining traction, yet faces challenges such as inadequate consideration of sentence continuity, sample imbalance and data noise. This research introduces an entity mapping-based method CasRelBLCF building on CasRel. The main contributions include: A joint decoder for the head entity, utilizing Bi-LSTM and CRF, integration of the Focal Loss function to tackle sample imbalance and a reinforcement learning-based noise reduction method for handling dataset noise. Experiments on relation extraction datasets indicate the superiority of the CasRelBLCF model and the enhancement on model’s performance of the noise reduction method.

  14. MEDED: Magi Entity Description Extraction Dataset

    • zenodo.org
    • data.niaid.nih.gov
    Updated Mar 5, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yichao Ji; Yichao Ji; Xinyang Liu; Xinyang Liu; Kui Ma; Kui Ma; Xuezhi Zhao; Xuezhi Zhao; Qiao Sun; Qiao Sun (2020). MEDED: Magi Entity Description Extraction Dataset [Dataset]. http://doi.org/10.5281/zenodo.3242514
    Explore at:
    Dataset updated
    Mar 5, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Yichao Ji; Yichao Ji; Xinyang Liu; Xinyang Liu; Kui Ma; Kui Ma; Xuezhi Zhao; Xuezhi Zhao; Qiao Sun; Qiao Sun
    Description

    Magi Entity Description Extraction Dataset (MEDED) contains 500,000 (entity, description, source URL) tuples extracted by the Magi system from Chinese webpages on the Internet that are accessible from Mainland China in May 2019.

    These data are learned automatically and should not be considered to contain any opinion of any human individual including the authors.

    Main contents of the source URLs can be found in the Magi Practical Web Article Corpus.

  15. Information Extraction IE Technology Market Report | Global Forecast From...

    • dataintelo.com
    csv, pdf, pptx
    Updated Dec 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataintelo (2024). Information Extraction IE Technology Market Report | Global Forecast From 2025 To 2033 [Dataset]. https://dataintelo.com/report/global-information-extraction-ie-technology-market
    Explore at:
    csv, pptx, pdfAvailable download formats
    Dataset updated
    Dec 4, 2024
    Dataset authored and provided by
    Dataintelo
    License

    https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy

    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Information Extraction (IE) Technology Market Outlook



    The Information Extraction (IE) Technology market size is anticipated to witness significant expansion, with an estimated valuation reaching USD 15.2 billion by 2032, growing from USD 5.3 billion in 2023. This growth is driven by a robust compound annual growth rate (CAGR) of 12.5% during the forecast period. Key factors propelling this market forward include the increasing need for automated data processing solutions, advancements in artificial intelligence (AI) and machine learning (ML), and the growing volume of unstructured data across enterprises globally. The ability of IE technology to transform raw data into actionable insights is paramount, making it indispensable for industries seeking competitive advantages.



    One of the primary growth drivers of the IE Technology market is the exponential increase in data generation and the subsequent demand for efficient data processing tools. Organizations across various sectors are inundated with vast amounts of unstructured data, including text, videos, and images, which hold valuable insights critical for strategic decision-making. Traditional data analysis methods are often insufficient to handle this deluge of information. As a solution, IE technology automates the extraction of relevant data points, simplifying the analysis process and facilitating enhanced decision-making capabilities. This demand is particularly acute in industries such as finance and healthcare, where timely and accurate information extraction can yield substantial benefits.



    Moreover, the advancement of AI and ML has significantly fortified the capabilities of IE technologies, enhancing their accuracy and efficiency. Machine learning algorithms, when integrated with IE systems, allow for more sophisticated data analysis, including sentiment analysis and natural language processing (NLP). These advancements enable the technology to understand context, tone, and sentiment, providing deeper insights that were previously unattainable. This evolution of capabilities has expanded the application scope of IE technologies beyond traditional text analysis to more dynamic areas such as social media monitoring and online reputation management, where understanding public sentiment and behavior patterns is crucial.



    The growing reliance on digital and online platforms has also fueled the demand for IE technology, especially for applications like web mining and social media monitoring. In an era where consumer interactions are predominantly digital, businesses are keen to harness insights from online activities to tailor their strategies effectively. IE technology plays a pivotal role in deciphering these digital footprints, allowing businesses to craft personalized marketing campaigns, improve customer service, and enhance product development. This trend is particularly evident in the retail and media industries, where consumer engagement and satisfaction are paramount.



    Regionally, the Information Extraction Technology market exhibits varied dynamics, with North America leading the charge in terms of adoption and innovation. The region's technological infrastructure, coupled with a high concentration of leading tech firms, has created a fertile ground for the development and deployment of advanced IE solutions. In contrast, the Asia Pacific region is expected to witness the fastest growth, driven by rapid digitalization and an increasing number of small and medium enterprises (SMEs) adopting digital tools. Europe, with its stringent data privacy regulations, presents unique challenges but also opportunities for IE technology tailored to comply with such frameworks.



    Component Analysis



    The component segmentation of the Information Extraction Technology market encompasses software and services, each playing a pivotal role in the ecosystem. The software segment, being the backbone of IE technology, is projected to dominate the market. This segment includes various tools and platforms specifically designed for data extraction, text analysis, sentiment analysis, and more. The continuous evolution of these software solutions, incorporating advanced AI and ML algorithms, is enhancing their accuracy, efficiency, and functionality, thereby driving their adoption across industries. Furthermore, the flexibility and scalability offered by modern software solutions make them indispensable for enterprises ranging from SMEs to large corporations, seeking to streamline their data processing capabilities.



    In addition to proprietary software solutions, open-source IE software is gaining traction among enterprises look

  16. h

    Data from: DWIE

    • huggingface.co
    Updated May 18, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Speech and Language Technology, DFKI (2024). DWIE [Dataset]. https://huggingface.co/datasets/DFKI-SLT/DWIE
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 18, 2024
    Dataset authored and provided by
    Speech and Language Technology, DFKI
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Dataset Card for DWIE

      Dataset Summary
    

    DWIE (Deutsche Welle corpus for Information Extraction) is a new dataset for document-level multi-task Information Extraction (IE). It combines four main IE sub-tasks: 1.Named Entity Recognition: 23,130 entities classified in 311 multi-label entity types (tags). 2.Coreference Resolution: 43,373 entity mentions clustered in 23,130 entities. 3.Relation Extraction: 21,749 annotated relations between entities classified in 65… See the full description on the dataset page: https://huggingface.co/datasets/DFKI-SLT/DWIE.

  17. f

    Statistics of DuIE2.0 dataset.

    • plos.figshare.com
    xls
    Updated Feb 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hongmei Tang; Dixiongxiao Zhu; Wenzhong Tang; Shuai Wang; Yanyang Wang; Lihong Wang (2024). Statistics of DuIE2.0 dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0298974.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Feb 23, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Hongmei Tang; Dixiongxiao Zhu; Wenzhong Tang; Shuai Wang; Yanyang Wang; Lihong Wang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Relationship Extraction (RE) is a central task in information extraction. The use of entity mapping to address complex scenarios with overlapping triples, such as CasRel, is gaining traction, yet faces challenges such as inadequate consideration of sentence continuity, sample imbalance and data noise. This research introduces an entity mapping-based method CasRelBLCF building on CasRel. The main contributions include: A joint decoder for the head entity, utilizing Bi-LSTM and CRF, integration of the Focal Loss function to tackle sample imbalance and a reinforcement learning-based noise reduction method for handling dataset noise. Experiments on relation extraction datasets indicate the superiority of the CasRelBLCF model and the enhancement on model’s performance of the noise reduction method.

  18. Event Detection Dataset

    • search.datacite.org
    • datosdeinvestigacion.conicet.gov.ar
    • +2more
    Updated Jul 11, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mariano Maisonnave (2020). Event Detection Dataset [Dataset]. http://doi.org/10.17632/7d54rvzxkr
    Explore at:
    Dataset updated
    Jul 11, 2020
    Dataset provided by
    DataCitehttps://www.datacite.org/
    Mendeley
    Authors
    Mariano Maisonnave
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The present is a manually labeled data set for the task of Event Detection (ED). The task of ED consists of identifying event triggers, the word that most clearly indicates the occurrence of an event. The present data set consists of 2,200 news extracts from The New York Times (NYT) Annotated Corpus, separated into training (2,000) and testing (200) sets. Each news extract contains the plain text with the labels (event mentions), along with two metadata (publication date and an identifier). Labels description: We consider as event any ongoing real-world event or situation reported in the news articles. It is important to distinguish those events and situations that are in progress (or are reported as fresh events) at the moment the news is delivered from past events that are simply brought back, future events, hypothetical events, or events that will not take place. In our data set we only labeled as event the first type of event. Based on this criterion, some words that are typically considered as events are labeled as non-event triggers if they do not refer to ongoing events at the time the analyzed news is released. Take for instance the following news extract: "devaluation is not a realistic option to the current account deficit since it would only contribute to weakening the credibility of economic policies as it did during the last crisis." The only word that is labeled as event trigger in this example is "deficit" because it is the only ongoing event refereed in the news. Note that the words "devaluation", "weakening" and "crisis" could be labeled as event triggers in other news extracts, where the context of use of these words is different, but not in the given example. Further information: For a more detailed description of the data set and the data collection process please visit: https://cs.uns.edu.ar/~mmaisonnave/resources/ED_data. Data format: The dataset is split in two folders: training and testing. The first folder contains 2,000 XML files. The second folder contains 200 XML files. Each XML file has the following format. YYYYMMDDTHHMMSS ... ... ... The first three tags (pubdate, file-id and sent-idx) contain metadata information. The first one is the publication date of the news article that contained that text extract. The next two tags represent a unique identifier for the text extract. The file-id uniquely identifies a news article, that can hold several text extracts. The second one is the index that identifies that text extract inside the full article. The last tag (sentence) defines the beginning and end of the text extract. Inside that text are the tags. Each of these tags surrounds one word that was manually labeled as an event trigger.

  19. Data from: Text2KGBench: A Benchmark for Ontology-Driven Knowledge Graph...

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated May 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nandana Mihindukulasooriya; Nandana Mihindukulasooriya; Sanju Tiwari; Sanju Tiwari; Carlos F. Enguix; Carlos F. Enguix; Kusum Lata; Kusum Lata (2023). Text2KGBench: A Benchmark for Ontology-Driven Knowledge Graph Generation from Text [Dataset]. http://doi.org/10.5281/zenodo.7916716
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 23, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Nandana Mihindukulasooriya; Nandana Mihindukulasooriya; Sanju Tiwari; Sanju Tiwari; Carlos F. Enguix; Carlos F. Enguix; Kusum Lata; Kusum Lata
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the repository for ISWC 2023 Resource Track submission for Text2KGBench: Benchmark for Ontology-Driven Knowledge Graph Generation from Text. Text2KGBench is a benchmark to evaluate the capabilities of language models to generate KGs from natural language text guided by an ontology. Given an input ontology and a set of sentences, the task is to extract facts from the text while complying with the given ontology (concepts, relations, domain/range constraints) and being faithful to the input sentences.

    It contains two datasets (i) Wikidata-TekGen with 10 ontologies and 13,474 sentences and (ii) DBpedia-WebNLG with 19 ontologies and 4,860 sentences.

    An example

    An example test sentence:

    Test Sentence:
    {"id": "ont_music_test_n", "sent": "\"The Loco-Motion\" is a 1962 pop song written by 
    American songwriters Gerry Goffin and Carole King."}
    

    An example of ontology:

    Ontology: Music Ontology

    Expected Output:

    {
     "id": "ont_k_music_test_n", 
     "sent": "\"The Loco-Motion\" is a 1962 pop song written by American songwriters Gerry Goffin and Carole King.", 
     "triples": [
     {
      "sub": "The Loco-Motion", 
      "rel": "publication date",
      "obj": "01 January 1962"
     },{
      "sub": "The Loco-Motion",
      "rel": "lyrics by",
      "obj": "Gerry Goffin"
     },{
      "sub": "The Loco-Motion", 
      "rel": "lyrics by", 
      "obj": "Carole King"
     },]
    }
    

    The data is released under a Creative Commons Attribution-ShareAlike 4.0 International (CC BY 4.0) License.

    The structure of the repo is as the following.

    This benchmark contains data derived from the TekGen corpus (part of the KELM corpus) [1] released under CC BY-SA 2.0 license and WebNLG 3.0 corpus [2] released under CC BY-NC-SA 4.0 license.

    [1] Oshin Agarwal, Heming Ge, Siamak Shakeri, and Rami Al-Rfou. 2021. Knowledge Graph Based Synthetic Corpus Generation for Knowledge-Enhanced Language Model Pre-training. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3554–3565, Online. Association for Computational Linguistics.

    [2] Claire Gardent, Anastasia Shimorina, Shashi Narayan, and Laura Perez-Beltrachini. 2017. Creating Training Corpora for NLG Micro-Planners. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages

  20. Scanned Czech Receipts Dataset

    • kaggle.com
    Updated Jul 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Jansa (2025). Scanned Czech Receipts Dataset [Dataset]. https://www.kaggle.com/datasets/davidjansa/scanned-czech-receipts-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 2, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    David Jansa
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    This dataset contains 175 flatbed-scanned Czech receipts, each labeled from 001 to 175. The dataset includes real-world variability, such as faded or dark receipts (marked with a "b" in the filename, e.g. 014b.jpg).

    File Descriptions

    The dataset is organized into three directories:

    scans/ Contains JPEG images of scanned receipts. Some images are dark or have lower contrast, simulating real-world scanning scenarios.

    ocr_target/ Contains .txt files with a line-by-line literal transcription of each receipt, suitable for OCR model evaluation.

    segment_target/ Contains .json files with structured information extracted from each receipt. Each JSON file captures key details, such as store name, purchase date, currency, and itemized product data (including discounts). Product data DO NOT include duplicates. (Maybe I will update the segment_target dataset in the future to include duplicated product names as well...)

    Each .json file in segment_target/ follows this schema: { "company": "tesco", "date": "26.07.2024", "currency": "czk", "products": { "madeta cottage 150 g": 29.9, "raj.cel.lou400g/240g": 39.9, "cc raj.cel.lou400g/2": -20, "cc madeta cottage 15": -40 } } company: Name of the store or seller (e.g., "tesco") in lowercase.

    date: Date of purchase in DD.MM.YYYY format.

    currency: Transaction currency (e.g., "czk") in lowercase.

    products: Key-value pairs of product names (lowercase) and their prices. Discounts are represented as negative values.

    Warning: Some fields may contain null if the data could not be extracted reliably.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Solène Tarride (2023). SIMARA: a database for key-value information extraction from full-page handwritten documents [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7866826

Data from: SIMARA: a database for key-value information extraction from full-page handwritten documents

Related Article
Explore at:
Dataset updated
Apr 27, 2023
Dataset provided by
Mélodie Boillet
Jean-François Moufflet
Solène Tarride
Christopher Kermorvant
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

We propose a new database for information extraction from historical handwritten documents. The corpus includes 5,393 finding aids from six different series, dating from the 18th-20th centuries. Finding aids are handwritten documents that contain metadata describing older archives. They are stored in the National Archives of France and are used by archivists to identify and find archival documents.

Each document is annotated at page-level, and contains seven fields to retrieve. The localization of each field is not available in such a way that this dataset encourages research on segmentation-free systems for information extraction.

Search
Clear search
Close search
Google apps
Main menu