asoria/pdf-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
PDF is a dataset for object detection tasks - it contains Text annotations for 1,009 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset was created by Sagarjha0316
Released under MIT
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
🍃 MINT-1T:Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens
🍃 MINT-1T is an open-source Multimodal INTerleaved dataset with 1 trillion text tokens and 3.4 billion images, a 10x scale-up from existing open-source datasets. Additionally, we include previously untapped sources such as PDFs and ArXiv papers. 🍃 MINT-1T is designed to facilitate research in multimodal pretraining. 🍃 MINT-1T is created by a team from the University of Washington in… See the full description on the dataset page: https://huggingface.co/datasets/mlfoundations/MINT-1T-PDF-CC-2024-18.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
PDF Figure Detection is a dataset for object detection tasks - it contains Figures annotations for 264 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
About
IUST-PDFCorpus is a large set of various PDF files, aimed at building and manipulating new PDF files, to test, debug, and improve the qualification of real-world PDF readers such as Adobe Acrobat Reader, Foxit Reader, Nitro Reader, MuPDF. IUST-PDFCorpus contains 6,141 PDF complete files in various sizes and contents. The corpus includes 507,299 PDF data objects and 151,132 PDF streams extracted from the set of complete files. Data objects are in the textual format while streams have a binary format and together they make PDF files. In addition, we attached the code coverage of each PDF file when it used as test data in testing MuPDF. The coverage info is available in both binary and XML formats. PDF data objects are organized into three categories. The first category contains all objects in the corpus. Each file in this category holds all PDF objects extracted from one PDF file without any preprocessing. The second category is a dataset made by merging all files in the first category with some preprocessing. The dataset is spilled into train, test and validation set which is useful for using in the machine learning tasks. The third category is the same as the second category but in a smaller size for using in the developing stage of different algorithms. IUST-PDFCorpus is collected from various sources including the Mozilla PDF.js open test corpus, some PDFs which are used in AFL as initial seed, and PDFs gathered from existing e-books, software documents, and public web in different languages. We first introduced IUST-PDFCorpus in our paper “Format-aware learn&fuzz: deep test data generation for efficient fuzzing” where we used it to build an intelligent file format fuzzer, called IUST-DeepFuzz. For the time being, we are gathering other file formats to automate testing of related applications.
Citing IUST-PDFCorpus
If IUST-PDFCorpus is used in your work in any form please cite the relevant paper: https://arxiv.org/abs/1812.09961v2
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
DocuMente is an innovative capstone project designed to harness the power of Google’s LLM for intelligent document comprehension and interactive Q&A. The system processes large volumes of documents, extracts contextual insights, and enables users to query a vast repository of content in real time.¶
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
PDF Solutions reported 539 in Employees for its fiscal year ending in December of 2024. Data for PDF Solutions | PDFS - Employees Total Number including historical, tables and charts were last updated by Trading Economics this last June in 2025.
https://www.etalab.gouv.fr/licence-ouverte-open-licencehttps://www.etalab.gouv.fr/licence-ouverte-open-licence
Texte extrait des pdfs trouvés sur data.gouv.fr Description Ce dataset contient le texte extrait de 6602 fichiers qui ont l'extension pdf dans le catalogue de ressources de data.gouv.fr. Le dataset contient que les pdfs de 20 Mb ou moins et qui sont toujours disponibles sur l'adresse URL indiquée. L'extraction a été réalisée avec PDFBox via son wrapper Python python-pdfbox. Les PDFs qui sont des images (scans, cartes, etc) sont détectés avec une heuristique simple : si après la conversion au format texte avec pdfbox, la taille du fichier produit est inférieure à 20 bytes on considère qu'il s'agit d'une image. Dans ce cas, on procède à la OCRisation. Celle-ci est réalisé avec Tesseract via son wrapper Python pyocr. Le résultat sont des fichiers txt provenant des pdfs triés par organisation (l'organisation qui a publiée la ressource). Il y a 175 organisations dans ce dataset, donc 175 dossiers. Le nom de chaque fichier correspond au string {id-du-dataset}--{id-de-la-ressource}.txt. Input Catalogue de ressources data.gouv.fr. Output Fichiers texte de chaque ressource type pdf trouvée dans le catalogue qui a été converti avec succès et qui a satisfait les contraintes ci-dessus. L'arborescence est la suivante : . ├── ACTION_Nogent-sur-Marne │ ├── 53ba55c4a3a729219b7beae2--0cf9f9cd-e398-4512-80de-5fd0e2d1cb0a.txt │ ├── 53ba55c4a3a729219b7beae2--1ffcb2cb-2355-4426-b74a-946dadeba7f1.txt │ ├── 53ba55c4a3a729219b7beae2--297a0466-daaa-47f4-972a-0d5bea2ab180.txt │ ├── 53ba55c4a3a729219b7beae2--3ac0a881-181f-499e-8b3f-c2b0ddd528f7.txt │ ├── 53ba55c4a3a729219b7beae2--3ca6bd8f-05a6-469a-a36b-afda5a7444a4.txt |── ... ├── Aeroport_La_Rochelle-Ile_de_Re ├── Agence_de_services_et_de_paiement_ASP ├── Agence_du_Numerique ├── ... Distribution des textes [au 20 mai 2020] Le top 10 d'organisations avec le nombre le plus grand des documents est: [('Les_Lilas', 1294), ('Ville_de_Pirae', 1099), ('Region_Hauts-de-France', 592), ('Ressourcerie_datalocale', 297), ('NA', 268), ('CORBION', 244), ('Education_Nationale', 189), ('Incubateur_de_Services_Numeriques', 157), ('Ministere_des_Solidarites_et_de_la_Sante', 148), ('Communaute_dAgglomeration_Plaine_Vallee', 142)] Et leur aperçu en 2D est (HashFeatures+TruncatedSVD+t-SNE) : Code Les scripts Python utilisés pour faire cette extraction sont ici. Remarques Dû à la qualité des pdfs d'origine (scans de basse résolution, pdfs non alignés, ...) et à la performance des méthodes de transformation pdf-->txt, les résultats peuvent être très bruités.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was created by Peter Staar
Released under CC0: Public Domain
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
The PDF Editor Software market size is poised to witness significant growth from 2024 to 2032, with a projected CAGR of 11.5% during this period. In 2023, the global market size was valued at approximately USD 1.5 billion and is expected to reach USD 4.1 billion by 2032. This rapid expansion is driven by increasing digitalization, the rising need for efficient document management, and the growing adoption of electronic signatures in various sectors.
One of the primary growth factors contributing to this market surge is the ubiquitous adoption of digital documentation across industries. The shift from paper-based processes to digital solutions has been accelerated by the global move towards sustainability and efficiency. Enterprises and government bodies are increasingly deploying PDF editor software to streamline their document management processes, which significantly reduces operational costs and enhances productivity. Moreover, the integration of advanced features such as Optical Character Recognition (OCR) and AI-based editing tools in PDF editors has further fueled their adoption.
Another critical driver for the PDF Editor Software market is the rise in remote working and the demand for collaborative tools. The COVID-19 pandemic has prompted a permanent shift towards remote and hybrid work environments, necessitating efficient digital tools to manage and edit documents. PDF editor software has become indispensable for professionals working remotely, enabling seamless collaboration, editing, and sharing of documents in real-time. This trend is expected to continue, further propelling the demand for PDF editor software in the coming years.
The increasing demand for enhanced security features in document management systems is also a significant growth factor. With the rise in cyber threats and data breaches, organizations are prioritizing the security of their digital documents. PDF editor software that offers robust security features such as encryption, password protection, and secure sharing capabilities is witnessing higher adoption rates. This focus on security is particularly pronounced in sectors such as finance, healthcare, and government, where the confidentiality of documents is paramount.
Regionally, North America currently holds the largest market share and is expected to maintain its dominance throughout the forecast period. The region's advanced IT infrastructure, coupled with the high adoption rate of digital technologies among enterprises, drives this dominance. Furthermore, the presence of major PDF editor software providers in the region contributes to the sustained market growth. However, the Asia Pacific region is anticipated to register the highest CAGR due to the rapid digital transformation in emerging economies and increasing investments in IT infrastructure.
The PDF Editor Software market is segmented by components into software and services. The software segment dominates the market and is expected to maintain its lead throughout the forecast period. This segment includes standalone PDF editor applications as well as integrated solutions within larger document management systems. The continuous advancements in software features, such as enhanced user interfaces, cloud integration, and AI capabilities, are driving the adoption of PDF editor software. Additionally, the increasing availability of subscription-based pricing models has made these software solutions more accessible to a broader range of users.
On the other hand, the services segment, though smaller, plays a crucial role in the overall market. This includes various support services, such as implementation, training, and maintenance, which are essential for the effective utilization of PDF editor software. Managed services are also gaining traction, offering enterprises the convenience of outsourcing their document management needs. The rising complexity of digital document workflows and the need for customized solutions are further fueling the demand for professional services in this segment.
The integration of cloud services with PDF editor software is another noteworthy trend within the component segment. Cloud-based PDF editors offer several advantages, including easier accessibility, real-time collaboration, and automatic updates. These benefits are particularly appealing to small and medium enterprises (SMEs) that may lack the resources to maintain extensive IT infrastructure. As a result, the services segment is witnessing a growing demand for cloud management and support
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
PDF Solutions reported $0.21 in EPS Earnings Per Share for its fiscal quarter ending in March of 2025. Data for PDF Solutions | PDFS - EPS Earnings Per Share including historical, tables and charts were last updated by Trading Economics this last June in 2025.
https://www.marketresearchintellect.com/privacy-policyhttps://www.marketresearchintellect.com/privacy-policy
The size and share of this market is categorized based on Desktop PDF Readers (Free PDF Readers, Paid PDF Readers, Open Source PDF Readers, Enterprise PDF Readers, Portable PDF Readers) and Mobile PDF Readers (iOS PDF Readers, Android PDF Readers, Windows Mobile PDF Readers, Cross-Platform PDF Readers, Feature-Rich Mobile PDF Readers) and Web-Based PDF Readers (Browser Extensions, Cloud-Based PDF Readers, Collaborative PDF Readers, PDF Editors with Viewing Capabilities, API-Integrated PDF Readers) and geographical regions (North America, Europe, Asia-Pacific, South America, Middle-East and Africa).
cfahlgren1/test-pdf dataset hosted on Hugging Face and contributed by the HF Datasets community
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was created by PrabhatAgnihotri
Released under CC0: Public Domain
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by Nguyễn Gia Bảo
Released under Apache 2.0
https://www.marketresearchforecast.com/privacy-policyhttps://www.marketresearchforecast.com/privacy-policy
The global Enterprise PDF Document Solutions market is projected to reach USD 14.24 billion by 2030, exhibiting a CAGR of 8.5% during the forecast period. The expanding need for efficient document management solutions and the increasing adoption of cloud-based enterprise services are driving the market growth. PDF document solutions offer enhanced security features, improved collaboration capabilities, and real-time editing functionalities, making them a preferred choice for organizations seeking to streamline their document workflow. Key market trends include the rising adoption of mobile PDF solutions for remote work and the growing demand for automated PDF processing. Additionally, the integration of artificial intelligence (AI) and machine learning (ML) capabilities within PDF document solutions is expected to further enhance their efficiency and functionality. The market is characterized by the presence of established players such as Adobe and Nitro, as well as emerging vendors offering innovative solutions. The competitive landscape is expected to remain dynamic as companies invest in research and development to stay ahead of the curve. Regional growth drivers, key market segments, and company profiles are provided in the market research report.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
PDF Solutions 매상 - 현재 값, 이력 데이터, 예측, 통계, 차트 및 경제 달력 - May 2025.Data for PDF Solutions | 매상 including historical, tables and charts were last updated by Trading Economics this last May in 2025.
https://www.cognitivemarketresearch.com/privacy-policyhttps://www.cognitivemarketresearch.com/privacy-policy
Asia Pacific PDF reader software market USD 450.39 million in 2024 and will grow at a compound annual growth rate (CAGR) of 15.3% from 2024 to 2031. Increased remote work and virtual collaboration is expected to aid the sales to USD 1201.5 million by 2031.
The following articles dealing with GOP data have been written and published so far. Please note: These articles have been published with free access to public, while the full copyright policies of each journal apply.
Papers with limited access will be listed in the experiment 'gop_papers_lim_access'.
asoria/pdf-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community