100+ datasets found

h
my-pdf-dataset
huggingface.co
Updated Mar 11, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Benito Silva (2025). my-pdf-dataset [Dataset]. https://huggingface.co/datasets/benitoals/my-pdf-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 11, 2025
Authors
Benito Silva
Description
benitoals/my-pdf-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
h
my-pdf-data
huggingface.co
Updated Sep 1, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
saivivek Adigarla (2025). my-pdf-data [Dataset]. https://huggingface.co/datasets/Saivivek25/my-pdf-data
Explore at:
Dataset updated
Sep 1, 2025
Authors
saivivek Adigarla
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Saivivek25/my-pdf-data dataset hosted on Hugging Face and contributed by the HF Datasets community
IUST-PDFCorpus
zenodo.org
live.european-language-grid.eu
zip
Updated Apr 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Morteza Zakeri-Nasrabadi; Morteza Zakeri-Nasrabadi (2025). IUST-PDFCorpus [Dataset]. http://doi.org/10.5281/zenodo.3484013
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3484013
Dataset updated
Apr 24, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Morteza Zakeri-Nasrabadi; Morteza Zakeri-Nasrabadi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
About

IUST-PDFCorpus is a large set of various PDF files, aimed at building and manipulating new PDF files, to test, debug, and improve the qualification of real-world PDF readers such as Adobe Acrobat Reader, Foxit Reader, Nitro Reader, MuPDF. IUST-PDFCorpus contains 6,141 PDF complete files in various sizes and contents. The corpus includes 507,299 PDF data objects and 151,132 PDF streams extracted from the set of complete files. Data objects are in the textual format while streams have a binary format and together they make PDF files. In addition, we attached the code coverage of each PDF file when it used as test data in testing MuPDF. The coverage info is available in both binary and XML formats. PDF data objects are organized into three categories. The first category contains all objects in the corpus. Each file in this category holds all PDF objects extracted from one PDF file without any preprocessing. The second category is a dataset made by merging all files in the first category with some preprocessing. The dataset is spilled into train, test and validation set which is useful for using in the machine learning tasks. The third category is the same as the second category but in a smaller size for using in the developing stage of different algorithms. IUST-PDFCorpus is collected from various sources including the Mozilla PDF.js open test corpus, some PDFs which are used in AFL as initial seed, and PDFs gathered from existing e-books, software documents, and public web in different languages. We first introduced IUST-PDFCorpus in our paper “Format-aware learn&fuzz: deep test data generation for efficient fuzzing” where we used it to build an intelligent file format fuzzer, called IUST-DeepFuzz. For the time being, we are gathering other file formats to automate testing of related applications.

Citing IUST-PDFCorpus

If IUST-PDFCorpus is used in your work in any form please cite the relevant paper: https://arxiv.org/abs/1812.09961v2
R
Pdf Figure Detection Dataset
universe.roboflow.com
zip
Updated Oct 31, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
APS360 Project (2021). Pdf Figure Detection Dataset [Dataset]. https://universe.roboflow.com/aps360-project/pdf-figure-detection
Explore at:
zipAvailable download formats
Dataset updated
Oct 31, 2021
Dataset authored and provided by
APS360 Project
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Figures Bounding Boxes
Description
PDF Figure Detection

## Overview PDF Figure Detection is a dataset for object detection tasks - it contains Figures annotations for 264 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
h
atlas-pdf-img-cluster
huggingface.co
Updated Feb 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Atlas Unified (2024). atlas-pdf-img-cluster [Dataset]. https://huggingface.co/datasets/AtlasUnified/atlas-pdf-img-cluster
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 8, 2024
Authors
Atlas Unified
License
https://choosealicense.com/licenses/osl-3.0/https://choosealicense.com/licenses/osl-3.0/
Description
Atlas PDF Image Cluster Dataset

Derives from the following Python Pipeline code: https://github.com/atlasunified/PDF-to-Image-Cluster

Dataset Description

This dataset is a collection of text extracted from PDF files, originating from various online resources. The dataset was generated using a series of Python scripts forming a robust pipeline that automated the tasks of downloading, converting, and managing the data.

Dataset Summary

Sample JPG

Corresponding… See the full description on the dataset page: https://huggingface.co/datasets/AtlasUnified/atlas-pdf-img-cluster.
d
PDF format log books of data collection in Lake Mead in 2000
catalog.data.gov
search.dataone.org
Updated Sep 16, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). PDF format log books of data collection in Lake Mead in 2000 [Dataset]. https://catalog.data.gov/dataset/pdf-format-log-books-of-data-collection-in-lake-mead-in-2000-f01e7
Explore at:
Dataset updated
Sep 16, 2025
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
Lake Mead
Description
Lake Mead is a large interstate reservoir located in the Mojave Desert of southeastern Nevada and northwestern Arizona. It was impounded in 1935 by the construction of Hoover Dam and is one of a series of multi-purpose reservoirs on the Colorado River. The lake extends 183 km from the mouth of the Grand Canyon to Black Canyon, the site of Hoover Dam, and provides water for residential, commercial, industrial, recreational, and other non-agricultural users in communities across the southwestern United States. Extensive research has been conducted on Lake Mead, but a majority of the studies have involved determining levels of anthropogenic contaminants such as synthetic organic compounds, heavy metals and dissolved ions, furans/dioxins, and nutrient loading in lake water, sediment, and biota (Preissler, et al., 1998; Bevans et al, 1996; Bevans et al., 1998; Covay and Leiker, 1998; LaBounty and Horn, 1997; Paulson, 1981). By contrast, little work has focused on the sediments in the lake and the processes of deposition (Gould, 1951). To address these questions, sidescan-sonar imagery and high-resolution seismic-reflection profiles were collected throughout Lake Mead by the USGS in cooperation with researchers from University of Nevada Las Vegas (UNLV). These data allow a detailed mapping of the surficial geology and the distribution and thickness of sediment that has accumulated in the lake since the completion of Hoover Dam. Results indicate that the accumulation of post-impoundment sediment is primarily restricted to former river and stream beds that are now submerged below the lake while the margins of the lake appear to be devoid of post-impoundment sediment. The sediment cover along the original Colorado River bed is continuous and is typically greater than 10 m thick through much of its length. Sediment thickness in some areas exceeds 35 m while the smaller tributary valleys typically are filled with less than 4 m of sediment. Away from the river beds that are now covered with post-impoundment sediment, pre-impoundment alluvial deposits and rock outcrops are still exposed on the lake floor.
R
Pdf Batch_0 Dataset
universe.roboflow.com
zip
Updated Feb 7, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
chunghoon (2023). Pdf Batch_0 Dataset [Dataset]. https://universe.roboflow.com/chunghoon/pdf-batch_0
Explore at:
zipAvailable download formats
Dataset updated
Feb 7, 2023
Dataset authored and provided by
chunghoon
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Exercise Problems Bounding Boxes
Description
Pdf Batch_0

## Overview Pdf Batch_0 is a dataset for object detection tasks - it contains Exercise Problems annotations for 1,171 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
h
pdf-dataset
huggingface.co
Updated Jun 26, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrea Soria (2023). pdf-dataset [Dataset]. https://huggingface.co/datasets/asoria/pdf-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 26, 2023
Authors
Andrea Soria
Description
asoria/pdf-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
T
PDF Solutions | PDFS - EPS Earnings Per Share
tradingeconomics.com
csv, excel, json, xml
Updated Jun 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TRADING ECONOMICS (2025). PDF Solutions | PDFS - EPS Earnings Per Share [Dataset]. https://tradingeconomics.com/pdfs:us:eps
Explore at:
csv, json, xml, excelAvailable download formats
Dataset updated
Jun 15, 2025
Dataset authored and provided by
TRADING ECONOMICS
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jan 1, 2000 - Oct 12, 2025
Area covered
United States
Description
PDF Solutions reported $0.19 in EPS Earnings Per Share for its fiscal quarter ending in June of 2025. Data for PDF Solutions | PDFS - EPS Earnings Per Share including historical, tables and charts were last updated by Trading Economics this last October in 2025.
Human-readable PDF metadata from 69 PDFs
figshare.com
datasetcatalog.nlm.nih.gov
txt
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ross Mounce (2023). Human-readable PDF metadata from 69 PDFs [Dataset]. http://doi.org/10.6084/m9.figshare.106195.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.106195.v1
Dataset updated
Jun 1, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Ross Mounce
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
See the corresponding blogpost

identifying bibliographic data and links to source PDFs here: http://dx.doi.org/10.6084/m9.figshare.105633
T
PDF Solutions | PDFS - Current Assets
tradingeconomics.com
csv, excel, json, xml
Updated Jun 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TRADING ECONOMICS (2025). PDF Solutions | PDFS - Current Assets [Dataset]. https://tradingeconomics.com/pdfs:us:current-assets
Explore at:
csv, json, xml, excelAvailable download formats
Dataset updated
Jun 15, 2025
Dataset authored and provided by
TRADING ECONOMICS
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jan 1, 2000 - Oct 14, 2025
Area covered
United States
Description
PDF Solutions reported $134.79M in Current Assets for its fiscal quarter ending in June of 2025. Data for PDF Solutions | PDFS - Current Assets including historical, tables and charts were last updated by Trading Economics this last October in 2025.
c
Global PDF reader software market size is USD 1958.2 million in 2024.
cognitivemarketresearch.com
pdf,excel,csv,ppt
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cognitive Market Research, Global PDF reader software market size is USD 1958.2 million in 2024. [Dataset]. https://www.cognitivemarketresearch.com/pdf-reader-software-market-report
Explore at:
pdf,excel,csv,pptAvailable download formats
Dataset authored and provided by
Cognitive Market Research
License
https://www.cognitivemarketresearch.com/privacy-policyhttps://www.cognitivemarketresearch.com/privacy-policy
Time period covered
2021 - 2033
Area covered
Global
Description
According to Cognitive Market Research, the global PDF reader software market size is USD 1958.2 million in 2024. It will expand at a compound annual growth rate (CAGR) of 13.30% from 2024 to 2031.

North America held the major market share for more than 40% of the global revenue with a market size of USD 783.28 million in 2024 and will grow at a compound annual growth rate (CAGR) of 11.5% from 2024 to 2031. Europe accounted for a market share of over 30% of the global revenue with a market size of USD 587.46 million. Asia Pacific held a market share of around 23% of the global revenue with a market size of USD 450.39 million in 2024 and will grow at a compound annual growth rate (CAGR) of 15.3% from 2024 to 2031. Latin America had a market share for more than 5% of the global revenue with a market size of USD 97.91 million in 2024 and will grow at a compound annual growth rate (CAGR) of 12.7% from 2024 to 2031. Middle East and Africa had a market share of around 2% of the global revenue and was estimated at a market size of USD 39.16 million in 2024 and will grow at a compound annual growth rate (CAGR) of 13.0% from 2024 to 2031. The without editional function held the highest PDF reader software market revenue share in 2024.

Market Dynamics of PDF reader software Market

Key Drivers for PDF reader software Market

Growing adoption of digital documents to increase the demand globally

The growing adoption of digital documents is significantly increasing demand globally for PDF reader software. As businesses and individuals transition towards digital workflows, the need for efficient document management tools becomes paramount. Digital documents offer advantages such as easier storage, faster retrieval, and reduced environmental impact compared to traditional paper-based systems. This shift is particularly evident in sectors like finance, healthcare, education, and legal services, where paper-intensive processes are being replaced by digital solutions. Furthermore, the rise in remote work and virtual collaboration due to global events has accelerated this trend, driving up the demand for versatile PDF readers capable of supporting seamless document sharing, annotation, and editing across different devices and platforms. As a result, PDF reader software providers are poised to capitalize on these trends by continually innovating and enhancing their offerings to meet the evolving needs of digital document users worldwide.

Rising mobile device usage to propel market growth

The increasing prevalence of mobile devices is a significant catalyst for market growth in PDF reader software. With more people relying on smartphones and tablets as primary computing devices, the demand for mobile-friendly PDF readers is on the rise. Mobile devices enable users to access and interact with documents on the go, enhancing productivity and convenience. This trend is particularly pronounced in sectors such as sales, field service, and education, where mobile devices facilitate real-time access to critical documents and information. PDF reader software that optimizes for mobile platforms by offering intuitive interfaces, responsive design, and features like annotation and cloud integration stands to capitalize on this trend. As mobile device usage continues to grow globally, PDF reader providers have a strategic opportunity to innovate and expand their market presence by catering to the evolving needs of mobile-centric users.

Restraint Factor for the PDF reader software Market

Competition from free alternatives to Limit the Sales

Competition from free alternatives poses a significant challenge to the sales potential of PDF reader software. Many users opt for freely available PDF readers like Adobe Acrobat Reader DC, Foxit Reader, or built-in PDF viewers in operating systems, which offer basic functionalities without requiring payment. These free alternatives often satisfy the needs of casual users who only require simple document viewing and basic interaction features. To counter this competition, paid PDF reader software must differentiate themselves by offering compelling value propositions such as advanced editing capabilities, enhanced security features, seamless integration with other software ecosystems, and superior customer support. Furthermore, emphasizing additional benefits such as improved user experience, regular updates, and specialized featur...
h
pdf-ocr-dataset
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
broadfield, pdf-ocr-dataset [Dataset]. https://huggingface.co/datasets/broadfield-dev/pdf-ocr-dataset
Explore at:
Authors
broadfield
Description
broadfield-dev/pdf-ocr-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
D
PDF Editor Software Market Report | Global Forecast From 2025 To 2033
dataintelo.com
csv, pdf, pptx
Updated Sep 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dataintelo (2024). PDF Editor Software Market Report | Global Forecast From 2025 To 2033 [Dataset]. https://dataintelo.com/report/global-pdf-editor-software-market
Explore at:
pdf, pptx, csvAvailable download formats
Dataset updated
Sep 22, 2024
Dataset authored and provided by
Dataintelo
License
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
Time period covered
2024 - 2032
Area covered
Global
Description
PDF Editor Software Market Outlook

The PDF Editor Software market size is poised to witness significant growth from 2024 to 2032, with a projected CAGR of 11.5% during this period. In 2023, the global market size was valued at approximately USD 1.5 billion and is expected to reach USD 4.1 billion by 2032. This rapid expansion is driven by increasing digitalization, the rising need for efficient document management, and the growing adoption of electronic signatures in various sectors.

One of the primary growth factors contributing to this market surge is the ubiquitous adoption of digital documentation across industries. The shift from paper-based processes to digital solutions has been accelerated by the global move towards sustainability and efficiency. Enterprises and government bodies are increasingly deploying PDF editor software to streamline their document management processes, which significantly reduces operational costs and enhances productivity. Moreover, the integration of advanced features such as Optical Character Recognition (OCR) and AI-based editing tools in PDF editors has further fueled their adoption.

Another critical driver for the PDF Editor Software market is the rise in remote working and the demand for collaborative tools. The COVID-19 pandemic has prompted a permanent shift towards remote and hybrid work environments, necessitating efficient digital tools to manage and edit documents. PDF editor software has become indispensable for professionals working remotely, enabling seamless collaboration, editing, and sharing of documents in real-time. This trend is expected to continue, further propelling the demand for PDF editor software in the coming years.

The increasing demand for enhanced security features in document management systems is also a significant growth factor. With the rise in cyber threats and data breaches, organizations are prioritizing the security of their digital documents. PDF editor software that offers robust security features such as encryption, password protection, and secure sharing capabilities is witnessing higher adoption rates. This focus on security is particularly pronounced in sectors such as finance, healthcare, and government, where the confidentiality of documents is paramount.

Regionally, North America currently holds the largest market share and is expected to maintain its dominance throughout the forecast period. The region's advanced IT infrastructure, coupled with the high adoption rate of digital technologies among enterprises, drives this dominance. Furthermore, the presence of major PDF editor software providers in the region contributes to the sustained market growth. However, the Asia Pacific region is anticipated to register the highest CAGR due to the rapid digital transformation in emerging economies and increasing investments in IT infrastructure.

Component Analysis

The PDF Editor Software market is segmented by components into software and services. The software segment dominates the market and is expected to maintain its lead throughout the forecast period. This segment includes standalone PDF editor applications as well as integrated solutions within larger document management systems. The continuous advancements in software features, such as enhanced user interfaces, cloud integration, and AI capabilities, are driving the adoption of PDF editor software. Additionally, the increasing availability of subscription-based pricing models has made these software solutions more accessible to a broader range of users.

On the other hand, the services segment, though smaller, plays a crucial role in the overall market. This includes various support services, such as implementation, training, and maintenance, which are essential for the effective utilization of PDF editor software. Managed services are also gaining traction, offering enterprises the convenience of outsourcing their document management needs. The rising complexity of digital document workflows and the need for customized solutions are further fueling the demand for professional services in this segment.

The integration of cloud services with PDF editor software is another noteworthy trend within the component segment. Cloud-based PDF editors offer several advantages, including easier accessibility, real-time collaboration, and automatic updates. These benefits are particularly appealing to small and medium enterprises (SMEs) that may lack the resources to maintain extensive IT infrastructure. As a result, the services segment is witnessing a growing demand for cloud management and support
e
Uslu.pdf - Dataset - B2FIND
b2find.eudat.eu
Updated Dec 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Uslu.pdf - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/81ef2e6c-8c17-5b7c-ac32-6d7f3ebac6b8
Explore at:
Dataset updated
Dec 28, 2024
Description
This study was conducted with a study group consisting of parents to define the relationship between obesity and the family environment related to nutrition and physical activity in school-aged children aged 5–14 years and to determine the relationship with the variables of school level, gender, and parental education level affecting this environment. The study was conducted online with 531 parents—289 male (father) and 242 female (mother)—who have children in preschool, primary, and secondary school during the fall semester of 2024. Data were collected with questions designed to determine sociodemographic characteristics, and the Family Nutrition and Physical Activity Screening Scale (FNPA-TR) was adapted into Turkish. The relationships between the scores obtained from the FNPA scale and children's body mass index (BMI), as well as some socio-demographic variables, were examined using the appropriate variance model and correlation analysis according to the structure and distribution of the data. When examining the results of this study, it was revealed that the higher education level of parents contributes to children having lower BMI values. In addition, it was observed that family and child activities play an important role in children's BMI, and children with lower BMI were more active. A healthy environment and family sleep patterns were also found to positively affect BMI. The gender of the children did not make a significant difference in BMI. It is clear that family dietary habits and physical activity levels are important factors influencing childhood obesity risk, but family eating patterns and dietary habits do not directly influence BMI in interaction with environmental factors.
n
Keyphrase Metrics for Pdf
newsletterscan.com
Updated Mar 31, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Keyphrase Metrics for Pdf [Dataset]. http://newsletterscan.com/topic/pdf
Explore at:
Dataset updated
Mar 31, 2025
Variables measured
Mentions, Growth Rate, Growth Category
Description
A dataset of mentions, growth rate, and total volume of the keyphrase 'Pdf' over time.
Form PDF Generator
odgavaprod.ogopendata.com
s.cnmilf.com
+1more
Updated Jul 31, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S Department of Transportation (2025). Form PDF Generator [Dataset]. https://odgavaprod.ogopendata.com/dataset/form-pdf-generator
Explore at:
Dataset updated
Jul 31, 2025
Dataset provided by
Federal Railroad Administrationhttp://www.fra.dot.gov/
Authors
U.S Department of Transportation
Description
This is the landing page for generating PDF reports for Form 6180.54 Rail Equipment Accident/Incident, Form 6180.55a Injury/Illness [Casualty], Form 6180.57 Highway-Rail Grade Crossing Accident/Incident and Form 6180.71 Crossing Inventory.
PDF Example
hub.arcgis.com
Updated Jul 9, 2014
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Esri Canada - Atlantic Region (2014). PDF Example [Dataset]. https://hub.arcgis.com/documents/esrica-atlantic::pdf-example/about?path=
Explore at:
Dataset updated
Jul 9, 2014
Dataset provided by
Esrihttp://esri.com/
Authors
Esri Canada - Atlantic Region
Area covered

Description
Demo of a PDF document
JuVer Contract Corpus
zenodo.org
data.niaid.nih.gov
application/gzip
Updated Dec 12, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Frieda Josi; Frieda Josi; Jean Charbonnier; Jean Charbonnier; Christian Wartena; Christian Wartena (2022). JuVer Contract Corpus [Dataset]. http://doi.org/10.5281/zenodo.7425489
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7425489
Dataset updated
Dec 12, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Frieda Josi; Frieda Josi; Jean Charbonnier; Jean Charbonnier; Christian Wartena; Christian Wartena
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This corpus consists of 2110 PDF Files and 2110 XML files with the text extracted from the PDF files. All PDF files are contracts in German publically available on the internet. Most of these contracts are from the city governments of Hamburg and Bremen and were collected from the websites http://suche.transparenz.hamburg.de/dataset?q=vertrag&esq_title=&check_all_ and https://www.transparenz.bremen.de.

In the XML files the texts are segmented into sentences. Each sentence also has some additional information on the freuency of use in the corpus.

The root of each XML file is the element document, that has a referece to the original PDF in an attribute. A document is divided into pages. Pages then consists of the elements heading and sentence. Each sentence has two identifiers, sid for the sentece and cid for the cluster it belongs to. Sentenecs with the same sentence identifier are identical. Sentences with the same cluster identifier are very similar but not necessarily identical. Sentences were clustered with single link clustering based on trigram (character) overlap.

The corpus consists of 106,539 (non-unique) sentences and 3,635,371 tokens, including interpunction.
P
PDF Merge Software Report
archivemarketresearch.com
doc, pdf, ppt
Updated Feb 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Archive Market Research (2025). PDF Merge Software Report [Dataset]. https://www.archivemarketresearch.com/reports/pdf-merge-software-25717
Explore at:
doc, pdf, pptAvailable download formats
Dataset updated
Feb 14, 2025
Dataset authored and provided by
Archive Market Research
License
https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
Market Overview The global PDF Merge Software market is projected to reach USD XXX million by 2033, growing at a CAGR of XX% during the forecast period (2025-2033). The rising demand for efficient document management solutions, coupled with the increasing adoption of digital workflows, is driving the market growth. The ability of PDF Merge Software to combine multiple PDF files into a single cohesive document, simplifying editing, sharing, and storage, has made it indispensable for individuals and businesses alike. Key Drivers and Trends Key drivers propelling the market include the increasing popularity of cloud-based PDF Merge Software, offering greater accessibility and collaboration options. The adoption of mobile devices and the proliferation of remote work models further fuel demand for solutions that enable seamless document merging on any platform. Additionally, the growing awareness of data security and compliance regulations is driving the adoption of secure and compliant PDF Merge Software solutions. Trends shaping the market include the integration of artificial intelligence (AI) and machine learning (ML) technologies to automate document merging tasks, enhancing accuracy and efficiency. The emergence of advanced features, such as drag-and-drop functionality and real-time collaboration tools, is also contributing to the market's growth prospects.