100+ datasets found
  1. d

    Facilities Listing and Related Cost Documentation Example Template

    • catalog.data.gov
    • data.virginia.gov
    Updated Sep 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Administration for Children and Families (2025). Facilities Listing and Related Cost Documentation Example Template [Dataset]. https://catalog.data.gov/dataset/facilities-listing-and-related-cost-documentation-example-template
    Explore at:
    Dataset updated
    Sep 7, 2025
    Dataset provided by
    Administration for Children and Families
    Description

    ACF Agency Wide resource Metadata-only record linking to the original dataset. Open original dataset below.

  2. Company Documents Dataset

    • kaggle.com
    zip
    Updated May 23, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ayoub Cherguelaine (2024). Company Documents Dataset [Dataset]. https://www.kaggle.com/datasets/ayoubcherguelaine/company-documents-dataset
    Explore at:
    zip(9789538 bytes)Available download formats
    Dataset updated
    May 23, 2024
    Authors
    Ayoub Cherguelaine
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Overview

    This dataset contains a collection of over 2,000 company documents, categorized into four main types: invoices, inventory reports, purchase orders, and shipping orders. Each document is provided in PDF format, accompanied by a CSV file that includes the text extracted from these documents, their respective labels, and the word count of each document. This dataset is ideal for various natural language processing (NLP) tasks, including text classification, information extraction, and document clustering.

    Dataset Content

    PDF Documents: The dataset includes 2,677 PDF files, each representing a unique company document. These documents are derived from the Northwind dataset, which is commonly used for demonstrating database functionalities.

    The document types are:

    • Invoices: Detailed records of transactions between a buyer and a seller.
    • Inventory Reports: Records of inventory levels, including items in stock and units sold.
    • Purchase Orders: Requests made by a buyer to a seller to purchase products or services.
    • Shipping Orders: Instructions for the delivery of goods to specified recipients.

    Example Entries

    Here are a few example entries from the CSV file:

    Shipping Order:

    • Order ID: 10718
    • Shipping Details: "Ship Name: Königlich Essen, Ship Address: Maubelstr. 90, Ship City: ..."
    • Word Count: 120

    Invoice:

    • Order ID: 10707
    • Customer Details: "Customer ID: Arout, Order Date: 2017-10-16, Contact Name: Th..."
    • Word Count: 66

    Purchase Order:

    • Order ID: 10892
    • Order Details: "Order Date: 2018-02-17, Customer Name: Catherine Dewey, Products: Product ..."
    • Word Count: 26

    Applications

    This dataset can be used for:

    • Text Classification: Train models to classify documents into their respective categories.
    • Information Extraction: Extract specific fields and details from the documents.
    • Document Clustering: Group similar documents together based on their content.
    • OCR and Text Mining: Improve OCR (Optical Character Recognition) models and text mining techniques using real-world data.
  3. Meta data and supporting documentation

    • catalog.data.gov
    • s.cnmilf.com
    Updated Nov 12, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2020). Meta data and supporting documentation [Dataset]. https://catalog.data.gov/dataset/meta-data-and-supporting-documentation
    Explore at:
    Dataset updated
    Nov 12, 2020
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    We include a description of the data sets in the meta-data as well as sample code and results from a simulated data set. This dataset is not publicly accessible because: EPA cannot release personally identifiable information regarding living individuals, according to the Privacy Act and the Freedom of Information Act (FOIA). This dataset contains information about human research subjects. Because there is potential to identify individual participants and disclose personal information, either alone or in combination with other datasets, individual level data are not appropriate to post for public access. Restricted access may be granted to authorized persons by contacting the party listed. It can be accessed through the following means: The R code is available on line here: https://github.com/warrenjl/SpGPCW. Format: Abstract The data used in the application section of the manuscript consist of geocoded birth records from the North Carolina State Center for Health Statistics, 2005-2008. In the simulation study section of the manuscript, we simulate synthetic data that closely match some of the key features of the birth certificate data while maintaining confidentiality of any actual pregnant women. Availability Due to the highly sensitive and identifying information contained in the birth certificate data (including latitude/longitude and address of residence at delivery), we are unable to make the data from the application section publicly available. However, we will make one of the simulated datasets available for any reader interested in applying the method to realistic simulated birth records data. This will also allow the user to become familiar with the required inputs of the model, how the data should be structured, and what type of output is obtained. While we cannot provide the application data here, access to the North Carolina birth records can be requested through the North Carolina State Center for Health Statistics and requires an appropriate data use agreement. Description Permissions: These are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. File format: R workspace file. Metadata (including data dictionary) • y: Vector of binary responses (1: preterm birth, 0: control) • x: Matrix of covariates; one row for each simulated individual • z: Matrix of standardized pollution exposures • n: Number of simulated individuals • m: Number of exposure time periods (e.g., weeks of pregnancy) • p: Number of columns in the covariate design matrix • alpha_true: Vector of “true” critical window locations/magnitudes (i.e., the ground truth that we want to estimate). This dataset is associated with the following publication: Warren, J., W. Kong, T. Luben, and H. Chang. Critical Window Variable Selection: Estimating the Impact of Air Pollution on Very Preterm Birth. Biostatistics. Oxford University Press, OXFORD, UK, 1-30, (2019).

  4. Radio Science Documentation Bundle - Dataset - NASA Open Data Portal

    • data.nasa.gov
    Updated Mar 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nasa.gov (2025). Radio Science Documentation Bundle - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/radio-science-documentation-bundle
    Explore at:
    Dataset updated
    Mar 31, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Description

    This bundle contains documentation about data products that are collected using radio science and supporting equipment. With one exception, each member collection contains one or more versions of a single Software Interface Specification (SIS) or an equivalent document. A SIS describes the format and content of a data file at a granularity suffient for use -- typically byte-level, but sometimes bit-level. Examples of products and descriptions of their use may also be included in a collection, as appropriate. The exception is the DOCUMENT collection, which contains supporting material -- usually journal publications, technical reports, or other documents that describe investigations, analysis methods, and/or data but not at the level of a SIS. Members of the DOCUMENT collection were usually released once, whereas a SIS often evolves over many years.

  5. V

    Real Property Listing and Related Cost Documentation Example

    • data.virginia.gov
    • catalog.data.gov
    html
    Updated Sep 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Administration for Children and Families (2025). Real Property Listing and Related Cost Documentation Example [Dataset]. https://data.virginia.gov/dataset/real-property-listing-and-related-cost-documentation-example
    Explore at:
    htmlAvailable download formats
    Dataset updated
    Sep 5, 2025
    Dataset provided by
    Administration for Children and Families
    Description

    ACF Agency Wide resource

    Metadata-only record linking to the original dataset. Open original dataset below.

  6. Z

    The TDI data and PSD/sensitivity-related files for PyCBC LISA documentation...

    • data-staging.niaid.nih.gov
    • data.niaid.nih.gov
    • +1more
    Updated Jan 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shichao Wu; Connor Weaving (2023). The TDI data and PSD/sensitivity-related files for PyCBC LISA documentation example [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_7078807
    Explore at:
    Dataset updated
    Jan 2, 2023
    Dataset provided by
    University of Portsmouth
    AEI Hannover
    Authors
    Shichao Wu; Connor Weaving
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The TDI data and PSD/sensitivity-related files for PyCBC LISA documentation example, most of them are generated from LDC-Sangria dataset.

  7. Data Policy Templates

    • fsm-data.sprep.org
    • pacific-data.sprep.org
    • +13more
    docx
    Updated Feb 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Secretariat of the Pacific Regional Environment Programme (2025). Data Policy Templates [Dataset]. https://fsm-data.sprep.org/dataset/data-policy-templates
    Explore at:
    docx(39231), docx(68313), docx(28279)Available download formats
    Dataset updated
    Feb 20, 2025
    Dataset provided by
    Pacific Regional Environment Programmehttps://www.sprep.org/
    License

    Public Domain Mark 1.0https://creativecommons.org/publicdomain/mark/1.0/
    License information was derived automatically

    Area covered
    Pacific Region
    Description

    This dataset contains templates of policies and MoU's on data sharing. You can download the Word-templates and adapt the documents to your national context.

  8. Data from: Radio Science Documentation Bundle

    • s.cnmilf.com
    • catalog.data.gov
    Updated Aug 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Aeronautics and Space Administration (2025). Radio Science Documentation Bundle [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/radio-science-documentation-bundle-dec9d
    Explore at:
    Dataset updated
    Aug 22, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    This bundle contains documentation about data products that are collected using radio science and supporting equipment. With one exception, each member collection contains one or more versions of a single Software Interface Specification (SIS) or an equivalent document. A SIS describes the format and content of a data file at a granularity suffient for use -- typically byte-level, but sometimes bit-level. Examples of products and descriptions of their use may also be included in a collection, as appropriate. The exception is the DOCUMENT collection, which contains supporting material -- usually journal publications, technical reports, or other documents that describe investigations, analysis methods, and/or data but not at the level of a SIS. Members of the DOCUMENT collection were usually released once, whereas a SIS often evolves over many years.

  9. OCR image data for Thai documents

    • kaggle.com
    zip
    Updated Jun 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Appen Limited (2025). OCR image data for Thai documents [Dataset]. https://www.kaggle.com/datasets/appenlimited/ocr-image-data-for-thai-documents
    Explore at:
    zip(26285828 bytes)Available download formats
    Dataset updated
    Jun 25, 2025
    Authors
    Appen Limited
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    如需完整数据集或了解更多,请发邮件至commercialproduct@appen.com For the complete dataset or more, please email commercialproduct@appen.com

    The dataset product can be used in many AI pilot projects and supplement production models with other data. It can improve the model performance and be cost-effectiveness. Dataset is an excellent solution when time and budget is limited. Appen database team can provide a large number of database products, such as ASR, TTS, video, text, image. At the same time, we are also constantly building new datasets to expand resources. Database team always strive to deliver as soon as possible to meet the needs of the global customers. This OCR database consists of image data in Korean, Vietnamese, Spanish, French, Thai, Japanese, Indonesian, Tamil, and Burmese, as well as handwritten images in both Chinese and English (including annotations). On average, each image contains 30 to 40 frames, including texts in various languages, special characters, and numbers. The accuracy rate requirement is over 99% (both position and content are correct). The images include the following categories: - RECEIPT - IDCARD - TRADE - TABLE - WHITEBOARD - NEWSPAPER - THESIS - CARD - NOTE - CONTRACT - BOOKCONTENT - HANDWRITING

    1. Data Specification Usage Cases Image label recognition training Collecting device Mobile phone / Camera Collecting environment Multiple lights environments

    Database Name Category Quantity

    Korean Document OCR Images

    RECEIPT 1500 IDCARD 500 TRADE 1012 TABLE 512 WHITEBOARD 500 NEWSPAPER 500 THESIS 500 CARD 500 NOTE 499 CONTRACT 501 BOOKCONTENT 500 TOTAL 7,024

    Vietnamese Document OCR Images

    RECEIPT 337 IDCARD 100 TRADE 227 TABLE 100 WHITEBOARD 111 NEWSPAPER 100 THESIS 100 CARD 100 NOTE 100 CONTRACT 105 BOOKCONTENT 700 TOTAL 2,080

    Spanish Document OCR Images

    RECEIPT 1500 IDCARD 500 TRADE 1000 TABLE 500 WHITEBOARD 500 NEWSPAPER 500 THESIS 500 CARD 500 NOTE 500 CONTRACT 500 BOOKCONTENT 500 TOTAL 7000

    French Document OCR Images

    RECEIPT 300 IDCARD 100 TRADE 200 TABLE 100 WHITEBOARD 100 NEWSPAPER 100 THESIS 103 CARD 100 NOTE 100 CONTRACT 100 BOOKCONTENT 700 TOTAL 2003

    Thai Document OCR Images

    RECEIPT 1500 IDCARD 500 TRADE 1000 TABLE 537 WHITEBOARD 500 NEWSPAPER 500 THESIS 500 CARD 500 NOTE 500 CONTRACT 500 BOOKCONTENT 500 TOTAL 7037

    Japanese Document OCR Images

    RECEIPT 1586 IDCARD 500 TRADE 1000 TABLE 552 WHITEBOARD 500 NEWSPAPER 500 THESIS 509 CARD 500 NOTE 500 CONTRACT 500 BOOKCONTENT 500 TOTAL 7147

    Indonesian Document OCR Images

    RECEIPT 1500 IDCARD 500 TRADE 1003 TABLE 500 WHITEBOARD 501 NEWSPAPER 502 THESIS 500 CARD 500 NOTE 500 CONTRACT 500 BOOKCONTENT 500 TOTAL 7006

    Tamil Document OCR Images

    RECEIPT 356 IDCARD 98 TRADE 475 TABLE 532 WHITEBOARD 501 NEWSPAPER 500 THESIS 500 CARD 500 NOTE 501 CONTRACT 500 BOOKCONTENT 500 TOTAL 4963

    Burmese Document OCR Images

    RECEIPT 300 IDCARD 100 TRADE 200 TABLE 117 WHITEBOARD 110 NEWSPAPER 108 THESIS 102 CARD 100 NOTE 120 CONTRACT 100 BOOKCONTENT 761 TOTAL 2118

    English Handwritten Datasets HANDWRITING 2278 Chinese Handwritten Datasets HANDWRITING 11118

    1. Information provided by database
    2. Data Format:. JPG
  10. t

    Data from: Data Dictionary Template

    • data.tempe.gov
    • data-academy.tempe.gov
    • +8more
    Updated Jun 5, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    City of Tempe (2020). Data Dictionary Template [Dataset]. https://data.tempe.gov/documents/f97e93ac8d324c71a35caf5a295c4c1e
    Explore at:
    Dataset updated
    Jun 5, 2020
    Dataset authored and provided by
    City of Tempe
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data Dictionary template for Tempe Open Data.

  11. OCR image data of French document type

    • kaggle.com
    zip
    Updated Jun 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Appen Limited (2025). OCR image data of French document type [Dataset]. https://www.kaggle.com/datasets/appenlimited/ocr-image-data-of-french-document-type
    Explore at:
    zip(22416674 bytes)Available download formats
    Dataset updated
    Jun 25, 2025
    Authors
    Appen Limited
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Area covered
    French
    Description

    如需完整数据集或了解更多,请发邮件至commercialproduct@appen.com For the complete dataset or more, please email commercialproduct@appen.com

    The dataset product can be used in many AI pilot projects and supplement production models with other data. It can improve the model performance and be cost-effectiveness. Dataset is an excellent solution when time and budget is limited. Appen database team can provide a large number of database products, such as ASR, TTS, video, text, image. At the same time, we are also constantly building new datasets to expand resources. Database team always strive to deliver as soon as possible to meet the needs of the global customers. This OCR database consists of image data in Korean, Vietnamese, Spanish, French, Thai, Japanese, Indonesian, Tamil, and Burmese, as well as handwritten images in both Chinese and English (including annotations). On average, each image contains 30 to 40 frames, including texts in various languages, special characters, and numbers. The accuracy rate requirement is over 99% (both position and content are correct). The images include the following categories: - RECEIPT - IDCARD - TRADE - TABLE - WHITEBOARD - NEWSPAPER - THESIS - CARD - NOTE - CONTRACT - BOOKCONTENT - HANDWRITING

    1. Data Specification Usage Cases Image label recognition training Collecting device Mobile phone / Camera Collecting environment Multiple lights environments

    Database Name Category Quantity

    Korean Document OCR Images

    RECEIPT 1500 IDCARD 500 TRADE 1012 TABLE 512 WHITEBOARD 500 NEWSPAPER 500 THESIS 500 CARD 500 NOTE 499 CONTRACT 501 BOOKCONTENT 500 TOTAL 7,024

    Vietnamese Document OCR Images

    RECEIPT 337 IDCARD 100 TRADE 227 TABLE 100 WHITEBOARD 111 NEWSPAPER 100 THESIS 100 CARD 100 NOTE 100 CONTRACT 105 BOOKCONTENT 700 TOTAL 2,080

    Spanish Document OCR Images

    RECEIPT 1500 IDCARD 500 TRADE 1000 TABLE 500 WHITEBOARD 500 NEWSPAPER 500 THESIS 500 CARD 500 NOTE 500 CONTRACT 500 BOOKCONTENT 500 TOTAL 7000

    French Document OCR Images

    RECEIPT 300 IDCARD 100 TRADE 200 TABLE 100 WHITEBOARD 100 NEWSPAPER 100 THESIS 103 CARD 100 NOTE 100 CONTRACT 100 BOOKCONTENT 700 TOTAL 2003

    Thai Document OCR Images

    RECEIPT 1500 IDCARD 500 TRADE 1000 TABLE 537 WHITEBOARD 500 NEWSPAPER 500 THESIS 500 CARD 500 NOTE 500 CONTRACT 500 BOOKCONTENT 500 TOTAL 7037

    Japanese Document OCR Images

    RECEIPT 1586 IDCARD 500 TRADE 1000 TABLE 552 WHITEBOARD 500 NEWSPAPER 500 THESIS 509 CARD 500 NOTE 500 CONTRACT 500 BOOKCONTENT 500 TOTAL 7147

    Indonesian Document OCR Images

    RECEIPT 1500 IDCARD 500 TRADE 1003 TABLE 500 WHITEBOARD 501 NEWSPAPER 502 THESIS 500 CARD 500 NOTE 500 CONTRACT 500 BOOKCONTENT 500 TOTAL 7006

    Tamil Document OCR Images

    RECEIPT 356 IDCARD 98 TRADE 475 TABLE 532 WHITEBOARD 501 NEWSPAPER 500 THESIS 500 CARD 500 NOTE 501 CONTRACT 500 BOOKCONTENT 500 TOTAL 4963

    Burmese Document OCR Images

    RECEIPT 300 IDCARD 100 TRADE 200 TABLE 117 WHITEBOARD 110 NEWSPAPER 108 THESIS 102 CARD 100 NOTE 120 CONTRACT 100 BOOKCONTENT 761 TOTAL 2118

    English Handwritten Datasets HANDWRITING 2278 Chinese Handwritten Datasets HANDWRITING 11118

    1. Information provided by database
    2. Data Format:. JPG
  12. Temporary Assistance for Needy Families (TANF):Data and Documentation:Sample...

    • healthdata.gov
    csv, xlsx, xml
    Updated Jul 25, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Temporary Assistance for Needy Families (TANF):Data and Documentation:Sample Data Available to the Public - efue-2hf6 - Archive Repository [Dataset]. https://healthdata.gov/dataset/Temporary-Assistance-for-Needy-Families-TANF-Data-/k3se-sbh9
    Explore at:
    xml, xlsx, csvAvailable download formats
    Dataset updated
    Jul 25, 2023
    Description

    This dataset tracks the updates made on the dataset "Temporary Assistance for Needy Families (TANF):Data and Documentation:Sample Data Available to the Public" as a repository for previous versions of the data and metadata.

  13. g

    Real Property Listing and Related Cost Documentation Example | gimi9.com

    • gimi9.com
    Updated Sep 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Real Property Listing and Related Cost Documentation Example | gimi9.com [Dataset]. https://gimi9.com/dataset/data-gov_real-property-listing-and-related-cost-documentation-example/
    Explore at:
    Dataset updated
    Sep 10, 2025
    Description

    🇺🇸 미국

  14. t

    Metadata Form Template

    • data-academy.tempe.gov
    • data.tempe.gov
    • +8more
    Updated Jun 5, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    City of Tempe (2020). Metadata Form Template [Dataset]. https://data-academy.tempe.gov/documents/c450d13c28ed4b1888ed6ab9d0363473
    Explore at:
    Dataset updated
    Jun 5, 2020
    Dataset authored and provided by
    City of Tempe
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Metadata form template for Tempe Open Data.

  15. The TDI data and corresponding PSD files for PyCBC LISA documentation...

    • zenodo.org
    bin, txt
    Updated Jan 1, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shichao Wu; Shichao Wu; Connor Weaving; Connor Weaving (2023). The TDI data and corresponding PSD files for PyCBC LISA documentation example [Dataset]. http://doi.org/10.5281/zenodo.7433487
    Explore at:
    txt, binAvailable download formats
    Dataset updated
    Jan 1, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Shichao Wu; Shichao Wu; Connor Weaving; Connor Weaving
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The TDI data and corresponding PSD files for PyCBC LISA documentation example, generated from LDC-Sangria dataset.

  16. R

    Document Element Detection Dataset

    • universe.roboflow.com
    zip
    Updated Nov 17, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cheng Fei (2021). Document Element Detection Dataset [Dataset]. https://universe.roboflow.com/cheng-fei/document-element-detection
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 17, 2021
    Dataset authored and provided by
    Cheng Fei
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Document Elements Bounding Boxes
    Description

    Here are a few use cases for this project:

    1. Automated Document Classification: The 'document element detection' model can be used by businesses in automating their document management systems. By identifying various elements, the model could classify documents into categories (e.g., invoices, reports, forms) for easier retrieval and storage.

    2. Accessibility Technology: This model could be incorporated into software that aids visually impaired or dyslexic individuals. By identifying and classifying different elements in a document, the software could use text-to-speech functionality for reading documents aloud.

    3. Data Extraction and Analysis: Organizations often need to extract specific data elements from documents, such as tables or graph information, for analysis. The model could be trained to isolate these areas for easier extraction and analysis, thus improving data-driven decision-making.

    4. Quality Assurance: For publishers or printers, the model can be used to identify unwanted elements or inconsistencies (like misplaced graphs, irregular tables) in a document before it goes to print, helping in maintaining the quality of publication.

    5. Content Creation Software: In applications like automated resume or report building, the 'document element detection' model can be employed to identify where certain elements (image, table, text) are commonly placed, which can then be used to create professional, standardized templates.

    Note: The given example of a 'man in a suit and tie' appears to be unrelated to the use of a document element detection model, as it seems more applicable to a model designed to identify or classify elements within portrait photographs or fashion-related applications.

  17. o

    Templates for developing and versioning data standards and reporting formats...

    • osti.gov
    • search.dataone.org
    • +1more
    Updated Dec 31, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Environmental System Science Data Infrastructure for a Virtual Ecosystem (2020). Templates for developing and versioning data standards and reporting formats using GitHub [Dataset]. http://doi.org/10.15485/1780564
    Explore at:
    Dataset updated
    Dec 31, 2020
    Dataset provided by
    Environmental Systems Science Data Infrastructure for a Virtual Ecosystem
    Environmental System Science Data Infrastructure for a Virtual Ecosystem
    U.S. DOE > Office of Science > Biological and Environmental Research (BER)
    Description

    This data package contains three templates that can be used for creating README files and Issue Templates, written in the markdown language, that support community-led data reporting formats. We created these templates based on the results of a systematic review (see related references) that explored how groups developing data standard documentation use the Version Control platform GitHub, to collaborate on supporting documents. Based on our review of 32 GitHub repositories, we make recommendations for the content of README Files (e.g., provide a user license, indicate how users can contribute) and so 'README_template.md' includes headings for each section. The two issue templates we include ('issue_template_for_all_other_changes.md' and 'issue_template_for_documentation_change.md') can be used in a GitHub repository to help structure user-submitted issues, or can be modified to suit the needs of data standard developers. We used these templates when establishing ESS-DIVE's community space on GitHub (https://github.com/ess-dive-community) that includes documentation for community-led data reporting formats. We also include file-level metadata 'flmd.csv' that describes the contents of each file within this data package. Lastly, the temporal range that we indicate in our metadata is the time range during which we searched for data standards documented on GitHub.

  18. OCR Document Text Recognition Dataset

    • kaggle.com
    zip
    Updated Sep 7, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Unique Data (2023). OCR Document Text Recognition Dataset [Dataset]. https://www.kaggle.com/datasets/trainingdatapro/text-detection-in-the-documents/discussion
    Explore at:
    zip(32330434 bytes)Available download formats
    Dataset updated
    Sep 7, 2023
    Authors
    Unique Data
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    OCR Text Detection in the Documents Object Detection dataset

    The dataset is a collection of images that have been annotated with the location of text in the document. The dataset is specifically curated for text detection and recognition tasks in documents such as scanned papers, forms, invoices, and handwritten notes.

    The dataset contains a variety of document types, including different layouts, font sizes, and styles. The images come from diverse sources, ensuring a representative collection of document styles and quality. Each image in the dataset is accompanied by bounding box annotations that outline the exact location of the text within the document.

    💴 For Commercial Usage: To discuss your requirements, learn about the price and buy the dataset, leave a request on TrainingData to buy the dataset

    The Text Detection in the Documents dataset provides an invaluable resource for developing and testing algorithms for text extraction, recognition, and analysis. It enables researchers to explore and innovate in various applications, including optical character recognition (OCR), information extraction, and document understanding.

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F6986071a88d8a9829fee98d5b49d9ff8%2FMacBook%20Air%20-%201%20(1).png?generation=1691059158337136&alt=media" alt="">

    Dataset structure

    • images - contains of original images of documents
    • boxes - includes bounding box labeling for the original images
    • annotations.xml - contains coordinates of the bounding boxes and labels, created for the original photo

    Data Format

    Each image from images folder is accompanied by an XML-annotation in the annotations.xml file indicating the coordinates of the bounding boxes and labels for text detection. For each point, the x and y coordinates are provided.

    Labels for the text:

    • "Text Title" - corresponds to titles, the box is red
    • "Text Paragraph" - corresponds to paragraphs of text, the box is blue
    • "Table" - corresponds to the table, the box is green
    • "Handwritten" - corresponds to handwritten text, the box is purple

    Example of XML file structure

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F38e02db515561a30e29faca9f5b176b0%2Fcarbon.png?generation=1691058761924879&alt=media" alt="">

    Text Detection in the Documents might be made in accordance with your requirements.

    💴 Buy the Dataset: This is just an example of the data. Leave a request on https://trainingdata.pro/datasets to discuss your requirements, learn about the price and buy the dataset

    TrainingData provides high-quality data annotation tailored to your needs

    keywords: text detection, text recognition, optical character recognition, document text recognition, document text detection, detecting text-lines, object detection, scanned documents, deep-text-recognition, text area detection, text extraction, images dataset, image-to-text

  19. FDA Drug Label Data

    • kaggle.com
    zip
    Updated Jun 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jeff Lin (2025). FDA Drug Label Data [Dataset]. https://www.kaggle.com/datasets/jefflin97/fda-guidelines-data
    Explore at:
    zip(239522541 bytes)Available download formats
    Dataset updated
    Jun 17, 2025
    Authors
    Jeff Lin
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    FDA Monoclonal Antibody Regulatory Dataset

    About the Dataset

    This dataset aggregates comprehensive regulatory documentation and resources from the U.S. Food and Drug Administration (FDA), specifically related to monoclonal antibodies (mAbs). It provides structured access to critical FDA filings, clinical trial documentation, and drug labels, serving as an essential resource for regulatory analysis, clinical research, and AI-driven applications.

    Contents

    The dataset comprises:

    • FDA Documentation

      • New Drug Applications (NDA) submissions and approval summaries.
      • Investigational New Drug (IND) filings, including clinical and preclinical data.
      • International Council for Harmonisation (ICH) guidance documents relevant to monoclonal antibody regulation.
    • Clinical Trial Documentation

      • Protocols, study designs, and outcome reports from clinical trials.
      • Regulatory correspondence and approval notices.
    • Drug Labels

      • Structured drug labeling information for 180 approved monoclonal antibodies, detailing indications, dosages, adverse reactions, warnings, and clinical pharmacology.

    Potential Use Cases

    This dataset supports various research and analytical tasks, including:

    • Regulatory compliance analysis: Identify key elements and benchmarks for successful FDA approvals.
    • Clinical trial design optimization: Inform trial protocols using historical approval data.
    • Natural Language Processing (NLP) applications: Enable text classification, information extraction, summarization, and entity recognition tasks.
    • Safety and efficacy research: Facilitate comparative analysis of drug labels and clinical outcomes.

    Intended Audience

    • Regulatory professionals and pharmaceutical industry researchers.
    • Biomedical data scientists and informaticians.
    • NLP and machine learning practitioners focused on biomedical applications.

    Data Format

    • All documents and labels are provided in machine-readable PDF format that can be parsed using PyPDF, but some drug labels may be a faxed document in a PDF, which may require OCR to parse via Tesseract.

    Acknowledgments

    This dataset utilizes publicly available information provided by the FDA and other regulatory bodies.

    Citation

    If you use this dataset in your research or applications, please provide an appropriate citation referencing this dataset.

  20. Text Document Classification Dataset

    • kaggle.com
    zip
    Updated Dec 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    sunil thite (2023). Text Document Classification Dataset [Dataset]. https://www.kaggle.com/datasets/sunilthite/text-document-classification-dataset
    Explore at:
    zip(1941393 bytes)Available download formats
    Dataset updated
    Dec 4, 2023
    Authors
    sunil thite
    Description

    This is text document classification dataset which contains 2225 text data and five categories of documents. Five categories are politics, sport, tech, entertainment and business. We can use this dataset for documents classification and document clustering.

    About Dataset - Dataset contains two features text and label. - No. of Rows : 2225 - No. of Columns : 2

    Text: It contains different categories of text data Label: It contains labels for five different categories : 0,1,2,3,4

    1. Politics = 0
    2. Sport = 1
    3. Technology = 2
    4. Entertainment =3
    5. Business = 4
Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Administration for Children and Families (2025). Facilities Listing and Related Cost Documentation Example Template [Dataset]. https://catalog.data.gov/dataset/facilities-listing-and-related-cost-documentation-example-template

Facilities Listing and Related Cost Documentation Example Template

Explore at:
Dataset updated
Sep 7, 2025
Dataset provided by
Administration for Children and Families
Description

ACF Agency Wide resource Metadata-only record linking to the original dataset. Open original dataset below.

Search
Clear search
Close search
Google apps
Main menu