https://www.nist.gov/open/licensehttps://www.nist.gov/open/license
This dataset is called the "IIT CDIP collection". "CDIP" stands for "Complex Document Information Processing" and "IIT" stands for "Illinois Institute of Technology" who originally built the dataset. The dataset consists of documents from the states' lawsuit against the tobacco industry in the 1990s. As a result of the settlement of that lawsuit (the "Master Settlement Agreement"), the companies had to make all the documents public in an archive, which currently resides at UCSF, the University of California, San Francisco.IIT used this data to build a dataset of "messy" documents that were challenging for existing systems to process. There is handwriting on the documents, stains, etc. TREC used an automatic text conversion of this dataset in the TREC Legal Track, and we also have the original TIFF scans of the documents. The dataset consists of around 7 million documents, preprocessed with 90s-era OCR, and also the original page scans in TIFF format. See contact information in this record for access to this dataset.
The ILDC dataset (Indian Legal Documents Corpus) is a large corpus of 35k Indian Supreme Court cases annotated with original court decisions. A portion of the corpus (a separate test set) is annotated with gold standard explanations by legal experts. The dataset is used for Court Judgment Prediction and Explanation (CJPE). The task requires an automated system to predict an explainable outcome of a case.
An Electronic Repository created to streamline the storing/recording of various Security Requests, including SSA-120s/1121s, ATSAFE-613, E-mails, etc
To use this dataset and respect for copyright, please cite the following paper: https://ieeexplore.ieee.org/abstract/document/9116896/ We present a new dataset that covers almost all the scenarios that may exist on document images that were taken by a smartphone. The collection includes 1111 images. We tested two state-of-the-art algorithms for finding the corners of the document in our dataset and the results also provided. The results indicate that there are still situations that these algorithms fail and it needs more research.
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
The global off-site document storage market size is projected to grow from USD 7.5 billion in 2023 to USD 12.3 billion by 2032, reflecting a robust CAGR of 5.7% during the forecast period. This growth is driven by increasing regulatory compliance requirements, data security concerns, and the expanding scope of digitization across various industries.
One of the key growth factors for the off-site document storage market is the escalating need for secure and reliable document storage solutions. Organizations, irrespective of their size, generate a multitude of documents daily. The necessity to preserve these documents for legal, regulatory, and operational reasons has led to a surge in demand for off-site document storage services. This trend is particularly pronounced in sectors such as BFSI, healthcare, and legal, where the integrity and confidentiality of records are paramount.
Moreover, the growing emphasis on disaster recovery planning has further accentuated the need for off-site document storage solutions. Companies are increasingly aware of the potential risks associated with storing critical documents on-site, such as natural disasters, theft, or technical failures. Off-site storage facilities offer a secure alternative, ensuring that important records are protected from unforeseen events and thus contributing to business continuity and resilience.
The advancements in information technology and the increasing adoption of digital transformation initiatives are also significant growth drivers. Many organizations are transitioning from traditional paper-based systems to digital records, necessitating advanced document storage solutions that can handle both physical and electronic documents. This shift not only enhances operational efficiency but also ensures compliance with stringent regulatory frameworks governing data management and privacy.
From a regional perspective, North America currently dominates the off-site document storage market, largely due to stringent regulatory requirements and the presence of numerous large enterprises. However, emerging markets in the Asia Pacific and Latin America are expected to witness substantial growth, driven by rapid industrialization, urbanization, and increasing awareness about the benefits of off-site storage solutions. Europe and the Middle East & Africa also present promising opportunities, albeit at a slightly moderate growth rate.
As organizations increasingly rely on off-site document storage solutions, the role of Physical Document Destruction Service Provider Services becomes crucial. These providers offer specialized services to ensure that sensitive documents are not only stored securely but also destroyed when no longer needed. This is particularly important in industries such as healthcare and finance, where data protection is paramount. By partnering with a reliable service provider, companies can ensure compliance with data protection regulations while mitigating the risks associated with unauthorized access to confidential information. The integration of these services into the document management lifecycle enhances overall security and operational efficiency.
The off-site document storage market can be segmented by service type into document storage, document shredding, document scanning, and others. Document storage is the most prevalent service offered, catering to organizations' need to store physical documents in a secure environment. This service provides businesses with temperature-controlled, well-secured, and monitored facilities, ensuring that sensitive and critical documents are preserved and accessible when needed. The demand for document storage services is particularly high in sectors like BFSI and healthcare, where large volumes of documents must be retained for extended periods.
Document shredding services are increasingly in demand due to stringent data protection laws and the rising emphasis on confidential information destruction. Organizations are becoming more aware of the risks associated with improper disposal of sensitive documents, leading to the adoption of professional shredding services. This segment is expected to grow steadily as more companies prioritize data security and compliance with regulations such as GDPR and HIPAA, which mandate the secure disposal of sensitive information.
<br /&Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data Coordinator Step-by-Step Guide includes:Step 1. Data spreadsheet Step 2. Complete a Dataset Inventory for each new dataset Step 3. Evaluate and Prioritize data for publication Step 4. Review security and privacy criteria Step 5. Prepare Metadata Step 6. Prepare Data Dictionary Step 7. Data Upload Step 8. Service Ticket update
The Flood Insurance Rate Map (FIRM) depicts flood risk information and supporting data used to develop the risk data. The primary risk classifications used are the 1-percent-annual-chance flood event (A or AE) and the 0.2-percent-annual- chance flood event (X). The FIRM data can be derived from Flood Insurance Studies (FISs) and previously published Flood Insurance Rate Maps (FIRMs). The FISs and FIRMs are published by the Federal Emergency Management Agency (FEMA). This database has been created by digitizing data from georefrenced paper FIRM maps and adding information from FIS where available. All FIRMs were georeferenced at a 1:4000 scale or finer. This data should be used as a reference layer, not as an authoritative source.
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
This bundle contains documentation about data products that are collected using radio science and supporting equipment. With one exception, each member collection contains one or more versions of a single Software Interface Specification (SIS) or an equivalent document. A SIS describes the format and content of a data file at a granularity suffient for use -- typically byte-level, but sometimes bit-level. Examples of products and descriptions of their use may also be included in a collection, as appropriate. The exception is the DOCUMENT collection, which contains supporting material -- usually journal publications, technical reports, or other documents that describe investigations, analysis methods, and/or data but not at the level of a SIS. Members of the DOCUMENT collection were usually released once, whereas a SIS often evolves over many years.
This dataset consists of points that represent recorded documents in the Delaware County Recorder's Plat Books, Cabinet/Slides and Instruments Records which are not represented by subdivision plats that are active. They are documents such as; vacations, subdivisions, centerline surveys, surveys, annexations, and miscellaneous documents within Delaware County, Ohio.
According to our latest research, the global Document AI Platform market size in 2024 is valued at USD 3.9 billion, reflecting the rapid adoption of artificial intelligence for document processing across industries. The market is experiencing robust expansion, boasting a CAGR of 29.7% from 2025 to 2033. By the end of 2033, the market is forecasted to reach an impressive USD 34.6 billion. This growth is driven by the increasing demand for automation in document-intensive workflows, the proliferation of digital transformation initiatives, and the necessity for enhanced compliance and data accuracy in regulated industries. As organizations worldwide seek to streamline operations and leverage unstructured data, Document AI platforms are becoming indispensable tools for modern enterprises.
One of the primary growth factors propelling the Document AI Platform market is the accelerating pace of digital transformation across sectors such as BFSI, healthcare, and retail. Organizations are increasingly burdened by vast volumes of unstructured and semi-structured documents, from contracts and invoices to patient records and regulatory filings. Document AI platforms, leveraging advanced machine learning, natural language processing, and optical character recognition, enable automated extraction, classification, and validation of data, significantly reducing manual labor and operational costs. This automation not only enhances productivity but also minimizes human error, ensuring data integrity and compliance with stringent regulatory requirements. As a result, enterprises are prioritizing investments in Document AI solutions to gain a competitive edge and drive business agility.
Another significant growth driver is the rising need for enhanced compliance management and fraud detection capabilities. With the ever-evolving regulatory landscape, especially in sectors such as finance and healthcare, organizations must ensure that their document processing aligns with legal and industry standards. Document AI platforms provide robust compliance tools, enabling real-time monitoring, audit trails, and automated flagging of anomalies or potential fraudulent activities. The ability to rapidly detect inconsistencies and ensure adherence to regulations not only mitigates risks but also fosters trust among stakeholders. Moreover, the integration of AI-driven analytics empowers organizations to derive actionable insights from their document repositories, facilitating informed decision-making and strategic planning.
The proliferation of cloud-based solutions and the increasing accessibility of AI technologies are further catalyzing market growth. Cloud deployment models offer scalability, flexibility, and cost-efficiency, making advanced Document AI capabilities accessible to organizations of all sizes, including small and medium enterprises. The shift to remote and hybrid work models post-pandemic has accelerated the adoption of cloud-based Document AI platforms, enabling seamless collaboration, secure access, and real-time processing of documents from any location. Additionally, advancements in AI algorithms and interoperability with existing enterprise systems have reduced the barriers to adoption, encouraging a broader spectrum of industries to embrace Document AI for both core and ancillary business functions.
Regionally, North America continues to dominate the Document AI Platform market, driven by the presence of major technology providers, a mature digital infrastructure, and early adoption of AI-powered solutions. The United States, in particular, leads in terms of market share and innovation, with significant investments from both public and private sectors in AI research and development. Europe follows closely, supported by stringent data privacy regulations and a growing emphasis on digital sovereignty. Asia Pacific is emerging as a high-growth region, fueled by rapid digitization, expanding enterprise IT budgets, and government initiatives promoting AI adoption. Meanwhile, Latin America and the Middle East & Africa are witnessing steady growth, albeit from a smaller base, as organizations in these regions begin to recognize the transformative potential of Document AI platforms.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Data Documentation and Metadata session from the 2015 Virginia Data Management Bootcamp. Introduces non-structural (data dictionaries, read me files, code books) and structured ways (XML schemas) to document research data.
Documents issued by the Protected Documents Office to civil status and passport offices, including passports, cards, certificates, and family books.
https://www.statsndata.org/how-to-orderhttps://www.statsndata.org/how-to-order
The Outsource Legal Document Review Service market has emerged as a critical component in the legal sector, catering to the burgeoning need for efficient and accurate document examination. This service, which involves hiring third-party professionals to review legal documents for compliance, relevance, and accuracy,
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
The global Document Tracking System (DTS) Software market size is projected to grow from USD 5.2 billion in 2023 to USD 13.8 billion by 2032, registering a compound annual growth rate (CAGR) of 11.2% during the forecast period. This substantial expansion is driven by the increasing necessity for efficient document management and the rise of remote work, which demands secure and accessible document tracking solutions.
One of the primary growth factors for the DTS software market is the growing emphasis on data security and compliance. As businesses increasingly operate in a digital environment, the need to track and secure sensitive documents has become paramount. Regulatory requirements such as GDPR and HIPAA are pushing organizations to adopt advanced DTS solutions to ensure compliance and avoid hefty penalties. Furthermore, the rise in cyber threats and data breaches has heightened the focus on secure document management, further fueling market growth.
Another significant driver of the DTS software market is the increased adoption of cloud-based solutions. The flexibility, scalability, and cost-effectiveness of cloud-based document tracking systems make them an attractive option for organizations of all sizes. Additionally, the ease of integration with existing enterprise systems and the ability to access documents from any location contribute to the growing popularity of cloud-based DTS solutions. This trend is particularly noticeable in small and medium enterprises (SMEs) that seek cost-efficient and flexible document management solutions without the need for substantial upfront investments.
The market is also benefitting from advancements in artificial intelligence (AI) and machine learning (ML) technologies. AI-powered DTS software can automate various document management tasks, such as categorizing documents, detecting anomalies, and predicting compliance issues. These capabilities not only enhance operational efficiency but also reduce the risk of human error, making them highly valuable for organizations. The ability to leverage AI and ML for intelligent document tracking and management is expected to drive significant growth in the market over the forecast period.
As the demand for efficient document management solutions grows, Document Databases Software is becoming increasingly vital. This type of software allows organizations to store and manage documents in a structured format, enabling easy retrieval and manipulation of data. Document Databases Software supports the scalability and flexibility needed in today's fast-paced business environment, particularly for companies handling large volumes of data. By offering robust search capabilities and seamless integration with other enterprise systems, it enhances productivity and ensures data consistency. The adoption of Document Databases Software is expected to rise as businesses seek to streamline their document management processes and improve data accessibility.
Regionally, North America is expected to hold the largest market share in the DTS software market, driven by the presence of key market players, technological advancements, and high adoption rates of digital solutions. The Asia Pacific region is anticipated to witness the highest growth rate, attributed to the rapid digital transformation in emerging economies such as China and India, increasing investments in IT infrastructure, and the growing awareness of document management solutions. Europe is also expected to contribute significantly to the market growth, driven by stringent regulatory requirements and the increasing adoption of cloud-based DTS solutions across various industries.
The DTS software market is segmented by component into software and services. The software segment is expected to dominate the market, driven by the increasing demand for advanced document tracking solutions that offer features such as real-time tracking, automated workflows, and enhanced security. These software solutions enable organizations to efficiently manage and monitor document movement, ensuring compliance with regulatory standards and minimizing the risk of data breaches. Additionally, the integration of AI and ML technologies in DTS software is further enhancing its capabilities, making it an essential tool for modern enterprises.
The services segment, although smaller compared to the software segment
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
City of Tempe Security and Privacy Worksheet includes:Section 1: DATASET NAME Section 2. PERSONALLY IDENTIFIABLE INFORMATION QUESTIONS Section 3. SECURITY: PROTECTED DATA Section 4. SECURITY: SENSITIVE DATA
The OpenScience Slovenia metadata dataset contains metadata entries for Slovenian public domain academic documents which include undergraduate and postgraduate theses, research and professional articles, along with other academic document types. The data within the dataset was collected as a part of the establishment of the Slovenian Open-Access Infrastructure which defined a unified document collection process and cataloguing for universities in Slovenia within the infrastructure repositories. The data was collected from several already established but separate library systems in Slovenia and merged into a single metadata scheme using metadata deduplication and merging techniques. It consists of text and numerical fields, representing attributes that describe documents. These attributes include document titles, keywords, abstracts, typologies, authors, issue years and other identifiers such as URL and UDC. The potential of this dataset lies especially in text mining and text classification tasks and can also be used in development or benchmarking of content-based recommender systems on real-world data.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
US Fish and Wildlife Service (FWS) Servcat Documents: Topic: Notes
This deposit contains an archive of documents from the US Fish and Wildlife Service (FWS) Servcat system. The documents were obtained by scraping the FWS Servcat system, which is a database of documents related to the management of fish and wildlife resources in the United States. The documents include reports, memos, and other materials related to the management of fish and wildlife resources.
The documents are organized here by general topic, and are contained in a zip file. If the original general topic contained more than 50 Gb of data, the documents are split into multiple zip files. The zip files are named according to the original general topic, and are numbered sequentially when more than one zip file is created. For example, if the original general topic was Geospatial_Dataset, and there were three zip files created, the zip files would be named Geospatial_Dataset_part1.zip, Geospatial_Dataset_part2.zip, and Geospatial_Dataset_part3.zip. If only one zip file is created, it will be named by that general topic, e.g. Geospatial_Dataset.zip.
Document Outsourcing Market Size 2025-2029
The document outsourcing market size is forecast to increase by USD 19.5 billion at a CAGR of 5.7% between 2024 and 2029.
The market is experiencing significant growth due to the increasing need for cost reduction and enhanced efficiency in business operations. Companies are turning to document outsourcing services to streamline their processes and focus on core competencies. Additionally, regulatory compliance requirements are driving the adoption of document outsourcing solutions to ensure data security and adherence to industry standards. However, the market faces challenges, primarily in the areas of data security and regulatory compliance. With the shift towards cloud sourcing, ensuring data security becomes paramount. Companies must implement robust security measures to protect sensitive information from cyber threats. Regulatory hurdles also impact adoption, as organizations grapple with complex compliance requirements across various industries and jurisdictions. Supply chain inconsistencies can temper growth potential, as businesses seek reliable and consistent service delivery from their outsourcing partners. To capitalize on market opportunities and navigate challenges effectively, companies must prioritize data security, regulatory compliance, and supply chain management in their outsourcing strategies.
What will be the Size of the Document Outsourcing Market during the forecast period?
Request Free SampleThe market is experiencing significant transformation as businesses increasingly leverage technology to streamline operations and enhance productivity. Big data is playing a pivotal role in this evolution, enabling organizations to derive valuable insights from their unstructured data through intelligent document processing and data analytics. Service level agreements (SLAs) are a critical aspect of document outsourcing, ensuring quality and performance in supply chain management. Key performance indicators (KPIs) are used to measure success, with return on investment (ROI) being a key metric. Edge computing and hybrid cloud solutions are gaining traction, allowing for real-time data processing and analysis, while paperless offices and digital transformation initiatives continue to drive the demand for document outsourcing services. Process mining and business intelligence are essential tools for optimizing operations and improving business continuity. Compliance management and risk management are also top priorities, with predictive analytics and robotic process automation helping to mitigate risks and ensure regulatory compliance. Data governance and quality assurance are crucial components of document outsourcing, with data visualization and performance metrics used to monitor and improve processes. Customer relationship management and knowledge discovery are also important areas of focus, as organizations seek to gain a competitive edge through data-driven insights. Cloud migration and business intelligence are key trends, with organizations looking to leverage the power of the cloud to improve their document outsourcing capabilities and enhance their overall digital strategy. Overall, the market is dynamic and evolving, with a focus on innovation, efficiency, and data-driven insights.
How is this Document Outsourcing Industry segmented?
The document outsourcing industry research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD billion' for the period 2025-2029, as well as historical data from 2019-2023 for the following segments. ServiceOnsite contractedStatement printingDPOEnd-userLarge companiesSmall and medium companiesApplicationHealthcareITRetailMediaOthersGeographyNorth AmericaUSCanadaEuropeFranceGermanyItalyThe NetherlandsUKAPACChinaIndiaJapanRest of World (ROW)
By Service Insights
The onsite contracted segment is estimated to witness significant growth during the forecast period.In the market, onsite contracted services have emerged as a popular solution for businesses seeking advanced document management systems. Service providers offer onsite technology implementation and services for document conversion, assessment, and consulting, tailored to meet specific client requirements. The evaluation of a company's IT architecture leads to the implementation of document management solutions suitable for their industry vertical, business size, and competitive landscape. To cater to the growing demand for business process automation and data-driven decision-making, document outsourcing providers expand their service offerings. These on-site document management systems enable companies to efficiently process financial documents, extract data for sales and marketing purposes, and ensure data security through compliance regulations. Additionally, these solutions offer mobility, enabling remote work, and facilitate
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This online appendix contains the coding guide and the data used in the paper Information Correspondence between Types of Documentation for APIs accepted for publication in the Empirical Software Engineering (EMSE) journal. The tutorial data was retrieved in October 2018.
It contains the following files:
CodingGuide.pdf: the coding guide to classify a sentence as API Information or Supporting Text.
annotated_sampled_sentences.csv: the set of 332 sampled sentences and two columns of corresponding annotations – one by the first author of this work and the second by an external annotator. This data was used to calculate the agreement score reported in the paper.
-.csv: the data set of annotated sentences in the tutorial on in . For example Python-REGEX.csv is the file containing sentences from the Python tutorial on regular expressions. This file contains the preprocessed sentences from the tutorial, their source files, and their annotation of sentence correspondence with reference documentation.
For licensing reasons, we are unable to upload the original API reference documentation and tutorials, however these are available on request.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Multiple entities in a document generally exhibit complex inter-sentence relations, and cannot be well handled by existing relation extraction (RE) methods that typically focus on extracting intra-sentence relations for single entity pairs. In order to accelerate the research on document-level RE, we introduce DocRED, a new dataset constructed from Wikipedia and Wikidata with three features: - DocRED annotates both named entities and relations, and is the largest human-annotated dataset for document-level RE from plain text. - DocRED requires reading multiple sentences in a document to extract entities and infer their relations by synthesizing all information of the document. - Along with the human-annotated data, we also offer large-scale distantly supervised data, which enables DocRED to be adopted for both supervised and weakly supervised scenarios.
https://www.nist.gov/open/licensehttps://www.nist.gov/open/license
This dataset is called the "IIT CDIP collection". "CDIP" stands for "Complex Document Information Processing" and "IIT" stands for "Illinois Institute of Technology" who originally built the dataset. The dataset consists of documents from the states' lawsuit against the tobacco industry in the 1990s. As a result of the settlement of that lawsuit (the "Master Settlement Agreement"), the companies had to make all the documents public in an archive, which currently resides at UCSF, the University of California, San Francisco.IIT used this data to build a dataset of "messy" documents that were challenging for existing systems to process. There is handwriting on the documents, stains, etc. TREC used an automatic text conversion of this dataset in the TREC Legal Track, and we also have the original TIFF scans of the documents. The dataset consists of around 7 million documents, preprocessed with 90s-era OCR, and also the original page scans in TIFF format. See contact information in this record for access to this dataset.