100+ datasets found

d
Facilities Listing and Related Cost Documentation Example Template
catalog.data.gov
data.virginia.gov
Updated Sep 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Administration for Children and Families (2025). Facilities Listing and Related Cost Documentation Example Template [Dataset]. https://catalog.data.gov/dataset/facilities-listing-and-related-cost-documentation-example-template
Explore at:
Dataset updated
Sep 7, 2025
Dataset provided by
Administration for Children and Families
Description
ACF Agency Wide resource Metadata-only record linking to the original dataset. Open original dataset below.
Company Documents Dataset
kaggle.com
zip
Updated May 23, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ayoub Cherguelaine (2024). Company Documents Dataset [Dataset]. https://www.kaggle.com/datasets/ayoubcherguelaine/company-documents-dataset
Explore at:
zip(9789538 bytes)Available download formats
Dataset updated
May 23, 2024
Authors
Ayoub Cherguelaine
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Overview

This dataset contains a collection of over 2,000 company documents, categorized into four main types: invoices, inventory reports, purchase orders, and shipping orders. Each document is provided in PDF format, accompanied by a CSV file that includes the text extracted from these documents, their respective labels, and the word count of each document. This dataset is ideal for various natural language processing (NLP) tasks, including text classification, information extraction, and document clustering.

Dataset Content

PDF Documents: The dataset includes 2,677 PDF files, each representing a unique company document. These documents are derived from the Northwind dataset, which is commonly used for demonstrating database functionalities.

The document types are:

Invoices: Detailed records of transactions between a buyer and a seller.

Inventory Reports: Records of inventory levels, including items in stock and units sold.

Purchase Orders: Requests made by a buyer to a seller to purchase products or services.

Shipping Orders: Instructions for the delivery of goods to specified recipients.

Example Entries

Here are a few example entries from the CSV file:

Shipping Order:

Order ID: 10718

Shipping Details: "Ship Name: Königlich Essen, Ship Address: Maubelstr. 90, Ship City: ..."

Word Count: 120

Invoice:

Order ID: 10707

Customer Details: "Customer ID: Arout, Order Date: 2017-10-16, Contact Name: Th..."

Word Count: 66

Purchase Order:

Order ID: 10892

Order Details: "Order Date: 2018-02-17, Customer Name: Catherine Dewey, Products: Product ..."

Word Count: 26

Applications

This dataset can be used for:

Text Classification: Train models to classify documents into their respective categories.

Information Extraction: Extract specific fields and details from the documents.

Document Clustering: Group similar documents together based on their content.

OCR and Text Mining: Improve OCR (Optical Character Recognition) models and text mining techniques using real-world data.
Meta data and supporting documentation
catalog.data.gov
s.cnmilf.com
Updated Nov 12, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2020). Meta data and supporting documentation [Dataset]. https://catalog.data.gov/dataset/meta-data-and-supporting-documentation
Explore at:
Dataset updated
Nov 12, 2020
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
We include a description of the data sets in the meta-data as well as sample code and results from a simulated data set. This dataset is not publicly accessible because: EPA cannot release personally identifiable information regarding living individuals, according to the Privacy Act and the Freedom of Information Act (FOIA). This dataset contains information about human research subjects. Because there is potential to identify individual participants and disclose personal information, either alone or in combination with other datasets, individual level data are not appropriate to post for public access. Restricted access may be granted to authorized persons by contacting the party listed. It can be accessed through the following means: The R code is available on line here: https://github.com/warrenjl/SpGPCW. Format: Abstract The data used in the application section of the manuscript consist of geocoded birth records from the North Carolina State Center for Health Statistics, 2005-2008. In the simulation study section of the manuscript, we simulate synthetic data that closely match some of the key features of the birth certificate data while maintaining confidentiality of any actual pregnant women. Availability Due to the highly sensitive and identifying information contained in the birth certificate data (including latitude/longitude and address of residence at delivery), we are unable to make the data from the application section publicly available. However, we will make one of the simulated datasets available for any reader interested in applying the method to realistic simulated birth records data. This will also allow the user to become familiar with the required inputs of the model, how the data should be structured, and what type of output is obtained. While we cannot provide the application data here, access to the North Carolina birth records can be requested through the North Carolina State Center for Health Statistics and requires an appropriate data use agreement. Description Permissions: These are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. File format: R workspace file. Metadata (including data dictionary) • y: Vector of binary responses (1: preterm birth, 0: control) • x: Matrix of covariates; one row for each simulated individual • z: Matrix of standardized pollution exposures • n: Number of simulated individuals • m: Number of exposure time periods (e.g., weeks of pregnancy) • p: Number of columns in the covariate design matrix • alpha_true: Vector of “true” critical window locations/magnitudes (i.e., the ground truth that we want to estimate). This dataset is associated with the following publication: Warren, J., W. Kong, T. Luben, and H. Chang. Critical Window Variable Selection: Estimating the Impact of Air Pollution on Very Preterm Birth. Biostatistics. Oxford University Press, OXFORD, UK, 1-30, (2019).
Radio Science Documentation Bundle - Dataset - NASA Open Data Portal
data.nasa.gov
Updated Mar 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
nasa.gov (2025). Radio Science Documentation Bundle - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/radio-science-documentation-bundle
Explore at:
Dataset updated
Mar 31, 2025
Dataset provided by
NASAhttp://nasa.gov/
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Description
This bundle contains documentation about data products that are collected using radio science and supporting equipment. With one exception, each member collection contains one or more versions of a single Software Interface Specification (SIS) or an equivalent document. A SIS describes the format and content of a data file at a granularity suffient for use -- typically byte-level, but sometimes bit-level. Examples of products and descriptions of their use may also be included in a collection, as appropriate. The exception is the DOCUMENT collection, which contains supporting material -- usually journal publications, technical reports, or other documents that describe investigations, analysis methods, and/or data but not at the level of a SIS. Members of the DOCUMENT collection were usually released once, whereas a SIS often evolves over many years.
V
Real Property Listing and Related Cost Documentation Example
data.virginia.gov
catalog.data.gov
html
Updated Sep 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Administration for Children and Families (2025). Real Property Listing and Related Cost Documentation Example [Dataset]. https://data.virginia.gov/dataset/real-property-listing-and-related-cost-documentation-example
Explore at:
htmlAvailable download formats
Dataset updated
Sep 5, 2025
Dataset provided by
Administration for Children and Families
Description
ACF Agency Wide resource

Metadata-only record linking to the original dataset. Open original dataset below.
Z
The TDI data and PSD/sensitivity-related files for PyCBC LISA documentation...
data-staging.niaid.nih.gov
data.niaid.nih.gov
+1more
Updated Jan 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shichao Wu; Connor Weaving (2023). The TDI data and PSD/sensitivity-related files for PyCBC LISA documentation example [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_7078807
Explore at:
Dataset updated
Jan 2, 2023
Dataset provided by
University of Portsmouth
AEI Hannover
Authors
Shichao Wu; Connor Weaving
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The TDI data and PSD/sensitivity-related files for PyCBC LISA documentation example, most of them are generated from LDC-Sangria dataset.
Data Policy Templates
fsm-data.sprep.org
pacific-data.sprep.org
+13more
docx
Updated Feb 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Secretariat of the Pacific Regional Environment Programme (2025). Data Policy Templates [Dataset]. https://fsm-data.sprep.org/dataset/data-policy-templates
Explore at:
docx(39231), docx(68313), docx(28279)Available download formats
Dataset updated
Feb 20, 2025
Dataset provided by
Pacific Regional Environment Programmehttps://www.sprep.org/
License
Public Domain Mark 1.0https://creativecommons.org/publicdomain/mark/1.0/
License information was derived automatically
Area covered
Pacific Region
Description
This dataset contains templates of policies and MoU's on data sharing. You can download the Word-templates and adapt the documents to your national context.
Data from: Radio Science Documentation Bundle
s.cnmilf.com
catalog.data.gov
Updated Aug 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Aeronautics and Space Administration (2025). Radio Science Documentation Bundle [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/radio-science-documentation-bundle-dec9d
Explore at:
Dataset updated
Aug 22, 2025
Dataset provided by
NASAhttp://nasa.gov/
Description
This bundle contains documentation about data products that are collected using radio science and supporting equipment. With one exception, each member collection contains one or more versions of a single Software Interface Specification (SIS) or an equivalent document. A SIS describes the format and content of a data file at a granularity suffient for use -- typically byte-level, but sometimes bit-level. Examples of products and descriptions of their use may also be included in a collection, as appropriate. The exception is the DOCUMENT collection, which contains supporting material -- usually journal publications, technical reports, or other documents that describe investigations, analysis methods, and/or data but not at the level of a SIS. Members of the DOCUMENT collection were usually released once, whereas a SIS often evolves over many years.
OCR image data for Thai documents
kaggle.com
zip
Updated Jun 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Appen Limited (2025). OCR image data for Thai documents [Dataset]. https://www.kaggle.com/datasets/appenlimited/ocr-image-data-for-thai-documents
Explore at:
zip(26285828 bytes)Available download formats
Dataset updated
Jun 25, 2025
Authors
Appen Limited
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
如需完整数据集或了解更多，请发邮件至commercialproduct@appen.com For the complete dataset or more, please email commercialproduct@appen.com

The dataset product can be used in many AI pilot projects and supplement production models with other data. It can improve the model performance and be cost-effectiveness. Dataset is an excellent solution when time and budget is limited. Appen database team can provide a large number of database products, such as ASR, TTS, video, text, image. At the same time, we are also constantly building new datasets to expand resources. Database team always strive to deliver as soon as possible to meet the needs of the global customers. This OCR database consists of image data in Korean, Vietnamese, Spanish, French, Thai, Japanese, Indonesian, Tamil, and Burmese, as well as handwritten images in both Chinese and English (including annotations). On average, each image contains 30 to 40 frames, including texts in various languages, special characters, and numbers. The accuracy rate requirement is over 99% (both position and content are correct). The images include the following categories: - RECEIPT - IDCARD - TRADE - TABLE - WHITEBOARD - NEWSPAPER - THESIS - CARD - NOTE - CONTRACT - BOOKCONTENT - HANDWRITING

Data Specification Usage Cases Image label recognition training Collecting device Mobile phone / Camera Collecting environment Multiple lights environments

Database Name Category Quantity

Korean Document OCR Images

RECEIPT 1500 IDCARD 500 TRADE 1012 TABLE 512 WHITEBOARD 500 NEWSPAPER 500 THESIS 500 CARD 500 NOTE 499 CONTRACT 501 BOOKCONTENT 500 TOTAL 7,024

Vietnamese Document OCR Images

RECEIPT 337 IDCARD 100 TRADE 227 TABLE 100 WHITEBOARD 111 NEWSPAPER 100 THESIS 100 CARD 100 NOTE 100 CONTRACT 105 BOOKCONTENT 700 TOTAL 2,080

Spanish Document OCR Images

RECEIPT 1500 IDCARD 500 TRADE 1000 TABLE 500 WHITEBOARD 500 NEWSPAPER 500 THESIS 500 CARD 500 NOTE 500 CONTRACT 500 BOOKCONTENT 500 TOTAL 7000

French Document OCR Images

RECEIPT 300 IDCARD 100 TRADE 200 TABLE 100 WHITEBOARD 100 NEWSPAPER 100 THESIS 103 CARD 100 NOTE 100 CONTRACT 100 BOOKCONTENT 700 TOTAL 2003

Thai Document OCR Images

RECEIPT 1500 IDCARD 500 TRADE 1000 TABLE 537 WHITEBOARD 500 NEWSPAPER 500 THESIS 500 CARD 500 NOTE 500 CONTRACT 500 BOOKCONTENT 500 TOTAL 7037

Japanese Document OCR Images

RECEIPT 1586 IDCARD 500 TRADE 1000 TABLE 552 WHITEBOARD 500 NEWSPAPER 500 THESIS 509 CARD 500 NOTE 500 CONTRACT 500 BOOKCONTENT 500 TOTAL 7147

Indonesian Document OCR Images

RECEIPT 1500 IDCARD 500 TRADE 1003 TABLE 500 WHITEBOARD 501 NEWSPAPER 502 THESIS 500 CARD 500 NOTE 500 CONTRACT 500 BOOKCONTENT 500 TOTAL 7006

Tamil Document OCR Images

RECEIPT 356 IDCARD 98 TRADE 475 TABLE 532 WHITEBOARD 501 NEWSPAPER 500 THESIS 500 CARD 500 NOTE 501 CONTRACT 500 BOOKCONTENT 500 TOTAL 4963

Burmese Document OCR Images

RECEIPT 300 IDCARD 100 TRADE 200 TABLE 117 WHITEBOARD 110 NEWSPAPER 108 THESIS 102 CARD 100 NOTE 120 CONTRACT 100 BOOKCONTENT 761 TOTAL 2118

English Handwritten Datasets HANDWRITING 2278 Chinese Handwritten Datasets HANDWRITING 11118

Information provided by database

Data Format：. JPG
t
Data from: Data Dictionary Template
data.tempe.gov
data-academy.tempe.gov
+8more
Updated Jun 5, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
City of Tempe (2020). Data Dictionary Template [Dataset]. https://data.tempe.gov/documents/f97e93ac8d324c71a35caf5a295c4c1e
Explore at:
Dataset updated
Jun 5, 2020
Dataset authored and provided by
City of Tempe
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data Dictionary template for Tempe Open Data.
OCR image data of French document type
kaggle.com
zip
Updated Jun 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Appen Limited (2025). OCR image data of French document type [Dataset]. https://www.kaggle.com/datasets/appenlimited/ocr-image-data-of-french-document-type
Explore at:
zip(22416674 bytes)Available download formats
Dataset updated
Jun 25, 2025
Authors
Appen Limited
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Area covered
French
Description
如需完整数据集或了解更多，请发邮件至commercialproduct@appen.com For the complete dataset or more, please email commercialproduct@appen.com

The dataset product can be used in many AI pilot projects and supplement production models with other data. It can improve the model performance and be cost-effectiveness. Dataset is an excellent solution when time and budget is limited. Appen database team can provide a large number of database products, such as ASR, TTS, video, text, image. At the same time, we are also constantly building new datasets to expand resources. Database team always strive to deliver as soon as possible to meet the needs of the global customers. This OCR database consists of image data in Korean, Vietnamese, Spanish, French, Thai, Japanese, Indonesian, Tamil, and Burmese, as well as handwritten images in both Chinese and English (including annotations). On average, each image contains 30 to 40 frames, including texts in various languages, special characters, and numbers. The accuracy rate requirement is over 99% (both position and content are correct). The images include the following categories: - RECEIPT - IDCARD - TRADE - TABLE - WHITEBOARD - NEWSPAPER - THESIS - CARD - NOTE - CONTRACT - BOOKCONTENT - HANDWRITING

Data Specification Usage Cases Image label recognition training Collecting device Mobile phone / Camera Collecting environment Multiple lights environments

Database Name Category Quantity

Korean Document OCR Images

RECEIPT 1500 IDCARD 500 TRADE 1012 TABLE 512 WHITEBOARD 500 NEWSPAPER 500 THESIS 500 CARD 500 NOTE 499 CONTRACT 501 BOOKCONTENT 500 TOTAL 7,024

Vietnamese Document OCR Images

RECEIPT 337 IDCARD 100 TRADE 227 TABLE 100 WHITEBOARD 111 NEWSPAPER 100 THESIS 100 CARD 100 NOTE 100 CONTRACT 105 BOOKCONTENT 700 TOTAL 2,080

Spanish Document OCR Images

RECEIPT 1500 IDCARD 500 TRADE 1000 TABLE 500 WHITEBOARD 500 NEWSPAPER 500 THESIS 500 CARD 500 NOTE 500 CONTRACT 500 BOOKCONTENT 500 TOTAL 7000

French Document OCR Images

RECEIPT 300 IDCARD 100 TRADE 200 TABLE 100 WHITEBOARD 100 NEWSPAPER 100 THESIS 103 CARD 100 NOTE 100 CONTRACT 100 BOOKCONTENT 700 TOTAL 2003

Thai Document OCR Images

RECEIPT 1500 IDCARD 500 TRADE 1000 TABLE 537 WHITEBOARD 500 NEWSPAPER 500 THESIS 500 CARD 500 NOTE 500 CONTRACT 500 BOOKCONTENT 500 TOTAL 7037

Japanese Document OCR Images

RECEIPT 1586 IDCARD 500 TRADE 1000 TABLE 552 WHITEBOARD 500 NEWSPAPER 500 THESIS 509 CARD 500 NOTE 500 CONTRACT 500 BOOKCONTENT 500 TOTAL 7147

Indonesian Document OCR Images

RECEIPT 1500 IDCARD 500 TRADE 1003 TABLE 500 WHITEBOARD 501 NEWSPAPER 502 THESIS 500 CARD 500 NOTE 500 CONTRACT 500 BOOKCONTENT 500 TOTAL 7006

Tamil Document OCR Images

RECEIPT 356 IDCARD 98 TRADE 475 TABLE 532 WHITEBOARD 501 NEWSPAPER 500 THESIS 500 CARD 500 NOTE 501 CONTRACT 500 BOOKCONTENT 500 TOTAL 4963

Burmese Document OCR Images

RECEIPT 300 IDCARD 100 TRADE 200 TABLE 117 WHITEBOARD 110 NEWSPAPER 108 THESIS 102 CARD 100 NOTE 120 CONTRACT 100 BOOKCONTENT 761 TOTAL 2118

English Handwritten Datasets HANDWRITING 2278 Chinese Handwritten Datasets HANDWRITING 11118

Information provided by database

Data Format：. JPG
Temporary Assistance for Needy Families (TANF):Data and Documentation:Sample...
healthdata.gov
csv, xlsx, xml
Updated Jul 25, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). Temporary Assistance for Needy Families (TANF):Data and Documentation:Sample Data Available to the Public - efue-2hf6 - Archive Repository [Dataset]. https://healthdata.gov/dataset/Temporary-Assistance-for-Needy-Families-TANF-Data-/k3se-sbh9
Explore at:
xml, xlsx, csvAvailable download formats
Dataset updated
Jul 25, 2023
Description
This dataset tracks the updates made on the dataset "Temporary Assistance for Needy Families (TANF):Data and Documentation:Sample Data Available to the Public" as a repository for previous versions of the data and metadata.
g
Real Property Listing and Related Cost Documentation Example | gimi9.com
gimi9.com
Updated Sep 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Real Property Listing and Related Cost Documentation Example | gimi9.com [Dataset]. https://gimi9.com/dataset/data-gov_real-property-listing-and-related-cost-documentation-example/
Explore at:
Dataset updated
Sep 10, 2025
Description
🇺🇸 미국
t
Metadata Form Template
data-academy.tempe.gov
data.tempe.gov
+8more
Updated Jun 5, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
City of Tempe (2020). Metadata Form Template [Dataset]. https://data-academy.tempe.gov/documents/c450d13c28ed4b1888ed6ab9d0363473
Explore at:
Dataset updated
Jun 5, 2020
Dataset authored and provided by
City of Tempe
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Metadata form template for Tempe Open Data.
The TDI data and corresponding PSD files for PyCBC LISA documentation...
zenodo.org
bin, txt
Updated Jan 1, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shichao Wu; Shichao Wu; Connor Weaving; Connor Weaving (2023). The TDI data and corresponding PSD files for PyCBC LISA documentation example [Dataset]. http://doi.org/10.5281/zenodo.7433487
Explore at:
txt, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7433487
Dataset updated
Jan 1, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Shichao Wu; Shichao Wu; Connor Weaving; Connor Weaving
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The TDI data and corresponding PSD files for PyCBC LISA documentation example, generated from LDC-Sangria dataset.
R
Document Element Detection Dataset
universe.roboflow.com
zip
Updated Nov 17, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cheng Fei (2021). Document Element Detection Dataset [Dataset]. https://universe.roboflow.com/cheng-fei/document-element-detection
Explore at:
zipAvailable download formats
Dataset updated
Nov 17, 2021
Dataset authored and provided by
Cheng Fei
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Document Elements Bounding Boxes
Description
Here are a few use cases for this project:

Automated Document Classification: The 'document element detection' model can be used by businesses in automating their document management systems. By identifying various elements, the model could classify documents into categories (e.g., invoices, reports, forms) for easier retrieval and storage.

Accessibility Technology: This model could be incorporated into software that aids visually impaired or dyslexic individuals. By identifying and classifying different elements in a document, the software could use text-to-speech functionality for reading documents aloud.

Data Extraction and Analysis: Organizations often need to extract specific data elements from documents, such as tables or graph information, for analysis. The model could be trained to isolate these areas for easier extraction and analysis, thus improving data-driven decision-making.

Quality Assurance: For publishers or printers, the model can be used to identify unwanted elements or inconsistencies (like misplaced graphs, irregular tables) in a document before it goes to print, helping in maintaining the quality of publication.

Content Creation Software: In applications like automated resume or report building, the 'document element detection' model can be employed to identify where certain elements (image, table, text) are commonly placed, which can then be used to create professional, standardized templates.

Note: The given example of a 'man in a suit and tie' appears to be unrelated to the use of a document element detection model, as it seems more applicable to a model designed to identify or classify elements within portrait photographs or fashion-related applications.
o
Templates for developing and versioning data standards and reporting formats...
osti.gov
search.dataone.org
+1more
Updated Dec 31, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Environmental System Science Data Infrastructure for a Virtual Ecosystem (2020). Templates for developing and versioning data standards and reporting formats using GitHub [Dataset]. http://doi.org/10.15485/1780564
Explore at:
Unique identifier
https://doi.org/10.15485/1780564
Dataset updated
Dec 31, 2020
Dataset provided by
Environmental Systems Science Data Infrastructure for a Virtual Ecosystem
Environmental System Science Data Infrastructure for a Virtual Ecosystem
U.S. DOE > Office of Science > Biological and Environmental Research (BER)
Description
This data package contains three templates that can be used for creating README files and Issue Templates, written in the markdown language, that support community-led data reporting formats. We created these templates based on the results of a systematic review (see related references) that explored how groups developing data standard documentation use the Version Control platform GitHub, to collaborate on supporting documents. Based on our review of 32 GitHub repositories, we make recommendations for the content of README Files (e.g., provide a user license, indicate how users can contribute) and so 'README_template.md' includes headings for each section. The two issue templates we include ('issue_template_for_all_other_changes.md' and 'issue_template_for_documentation_change.md') can be used in a GitHub repository to help structure user-submitted issues, or can be modified to suit the needs of data standard developers. We used these templates when establishing ESS-DIVE's community space on GitHub (https://github.com/ess-dive-community) that includes documentation for community-led data reporting formats. We also include file-level metadata 'flmd.csv' that describes the contents of each file within this data package. Lastly, the temporal range that we indicate in our metadata is the time range during which we searched for data standards documented on GitHub.
OCR Document Text Recognition Dataset
kaggle.com
zip
Updated Sep 7, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Unique Data (2023). OCR Document Text Recognition Dataset [Dataset]. https://www.kaggle.com/datasets/trainingdatapro/text-detection-in-the-documents/discussion
Explore at:
zip(32330434 bytes)Available download formats
Dataset updated
Sep 7, 2023
Authors
Unique Data
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
OCR Text Detection in the Documents Object Detection dataset

The dataset is a collection of images that have been annotated with the location of text in the document. The dataset is specifically curated for text detection and recognition tasks in documents such as scanned papers, forms, invoices, and handwritten notes.

The dataset contains a variety of document types, including different layouts, font sizes, and styles. The images come from diverse sources, ensuring a representative collection of document styles and quality. Each image in the dataset is accompanied by bounding box annotations that outline the exact location of the text within the document.

💴 For Commercial Usage: To discuss your requirements, learn about the price and buy the dataset, leave a request on TrainingData to buy the dataset

The Text Detection in the Documents dataset provides an invaluable resource for developing and testing algorithms for text extraction, recognition, and analysis. It enables researchers to explore and innovate in various applications, including optical character recognition (OCR), information extraction, and document understanding.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F6986071a88d8a9829fee98d5b49d9ff8%2FMacBook%20Air%20-%201%20(1).png?generation=1691059158337136&alt=media" alt="">

Dataset structure

images - contains of original images of documents

boxes - includes bounding box labeling for the original images

annotations.xml - contains coordinates of the bounding boxes and labels, created for the original photo

Data Format

Each image from images folder is accompanied by an XML-annotation in the annotations.xml file indicating the coordinates of the bounding boxes and labels for text detection. For each point, the x and y coordinates are provided.

Labels for the text:

"Text Title" - corresponds to titles, the box is red

"Text Paragraph" - corresponds to paragraphs of text, the box is blue

"Table" - corresponds to the table, the box is green

"Handwritten" - corresponds to handwritten text, the box is purple

Example of XML file structure

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F38e02db515561a30e29faca9f5b176b0%2Fcarbon.png?generation=1691058761924879&alt=media" alt="">

Text Detection in the Documents might be made in accordance with your requirements.

💴 Buy the Dataset: This is just an example of the data. Leave a request on https://trainingdata.pro/datasets to discuss your requirements, learn about the price and buy the dataset

TrainingData provides high-quality data annotation tailored to your needs

keywords: text detection, text recognition, optical character recognition, document text recognition, document text detection, detecting text-lines, object detection, scanned documents, deep-text-recognition, text area detection, text extraction, images dataset, image-to-text
FDA Drug Label Data
kaggle.com
zip
Updated Jun 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jeff Lin (2025). FDA Drug Label Data [Dataset]. https://www.kaggle.com/datasets/jefflin97/fda-guidelines-data
Explore at:
zip(239522541 bytes)Available download formats
Dataset updated
Jun 17, 2025
Authors
Jeff Lin
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
FDA Monoclonal Antibody Regulatory Dataset

About the Dataset

This dataset aggregates comprehensive regulatory documentation and resources from the U.S. Food and Drug Administration (FDA), specifically related to monoclonal antibodies (mAbs). It provides structured access to critical FDA filings, clinical trial documentation, and drug labels, serving as an essential resource for regulatory analysis, clinical research, and AI-driven applications.

Contents

The dataset comprises:

FDA Documentation

New Drug Applications (NDA) submissions and approval summaries.

Investigational New Drug (IND) filings, including clinical and preclinical data.

International Council for Harmonisation (ICH) guidance documents relevant to monoclonal antibody regulation.

Clinical Trial Documentation

Protocols, study designs, and outcome reports from clinical trials.

Regulatory correspondence and approval notices.

Drug Labels

Structured drug labeling information for 180 approved monoclonal antibodies, detailing indications, dosages, adverse reactions, warnings, and clinical pharmacology.

Potential Use Cases

This dataset supports various research and analytical tasks, including:

Regulatory compliance analysis: Identify key elements and benchmarks for successful FDA approvals.

Clinical trial design optimization: Inform trial protocols using historical approval data.

Natural Language Processing (NLP) applications: Enable text classification, information extraction, summarization, and entity recognition tasks.

Safety and efficacy research: Facilitate comparative analysis of drug labels and clinical outcomes.

Intended Audience

Regulatory professionals and pharmaceutical industry researchers.

Biomedical data scientists and informaticians.

NLP and machine learning practitioners focused on biomedical applications.

Data Format

All documents and labels are provided in machine-readable PDF format that can be parsed using PyPDF, but some drug labels may be a faxed document in a PDF, which may require OCR to parse via Tesseract.

Acknowledgments

This dataset utilizes publicly available information provided by the FDA and other regulatory bodies.

Citation

If you use this dataset in your research or applications, please provide an appropriate citation referencing this dataset.
Text Document Classification Dataset
kaggle.com
zip
Updated Dec 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
sunil thite (2023). Text Document Classification Dataset [Dataset]. https://www.kaggle.com/datasets/sunilthite/text-document-classification-dataset
Explore at:
zip(1941393 bytes)Available download formats
Dataset updated
Dec 4, 2023
Authors
sunil thite
Description
This is text document classification dataset which contains 2225 text data and five categories of documents. Five categories are politics, sport, tech, entertainment and business. We can use this dataset for documents classification and document clustering.

About Dataset - Dataset contains two features text and label. - No. of Rows : 2225 - No. of Columns : 2

Text: It contains different categories of text data Label: It contains labels for five different categories : 0,1,2,3,4

Politics = 0

Sport = 1

Technology = 2

Entertainment =3

Business = 4

Facebook

Twitter

Click to copy link

Link copied

Cite

Administration for Children and Families (2025). Facilities Listing and Related Cost Documentation Example Template [Dataset]. https://catalog.data.gov/dataset/facilities-listing-and-related-cost-documentation-example-template

Facilities Listing and Related Cost Documentation Example Template

Explore at:

Dataset updated

Sep 7, 2025

Dataset provided by

Administration for Children and Families

Description

ACF Agency Wide resource Metadata-only record linking to the original dataset. Open original dataset below.

Clear search

Close search

Google apps

Main menu

Facilities Listing and Related Cost Documentation Example Template

Company Documents Dataset

Overview

Dataset Content

Example Entries

Shipping Order:

Invoice:

Purchase Order:

Applications

Meta data and supporting documentation

Radio Science Documentation Bundle - Dataset - NASA Open Data Portal

Real Property Listing and Related Cost Documentation Example

The TDI data and PSD/sensitivity-related files for PyCBC LISA documentation...

Data Policy Templates

Data from: Radio Science Documentation Bundle

OCR image data for Thai documents

Korean Document OCR Images

Vietnamese Document OCR Images

Spanish Document OCR Images

French Document OCR Images

Thai Document OCR Images

Japanese Document OCR Images

Indonesian Document OCR Images

Tamil Document OCR Images

Burmese Document OCR Images

Data from: Data Dictionary Template

OCR image data of French document type

Korean Document OCR Images

Vietnamese Document OCR Images

Spanish Document OCR Images

French Document OCR Images

Thai Document OCR Images

Japanese Document OCR Images

Indonesian Document OCR Images

Tamil Document OCR Images

Burmese Document OCR Images

Temporary Assistance for Needy Families (TANF):Data and Documentation:Sample...

Real Property Listing and Related Cost Documentation Example | gimi9.com

Metadata Form Template

The TDI data and corresponding PSD files for PyCBC LISA documentation...

Document Element Detection Dataset

Templates for developing and versioning data standards and reporting formats...

OCR Document Text Recognition Dataset

OCR Text Detection in the Documents Object Detection dataset

💴 For Commercial Usage: To discuss your requirements, learn about the price and buy the dataset, leave a request on TrainingData to buy the dataset

Dataset structure

Data Format

Labels for the text:

Example of XML file structure

Text Detection in the Documents might be made in accordance with your requirements.

💴 Buy the Dataset: This is just an example of the data. Leave a request on https://trainingdata.pro/datasets to discuss your requirements, learn about the price and buy the dataset

TrainingData provides high-quality data annotation tailored to your needs

FDA Drug Label Data

FDA Monoclonal Antibody Regulatory Dataset

About the Dataset

Contents

Potential Use Cases

Intended Audience

Data Format

Acknowledgments

Citation

Text Document Classification Dataset

Facilities Listing and Related Cost Documentation Example Template