100+ datasets found
  1. d

    Data from: Data Documentation Initiative (DDI) Workshop

    • search.dataone.org
    Updated Dec 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carol Perry; Data Liberation Initiative (DLI) (2023). Data Documentation Initiative (DDI) Workshop [Dataset]. http://doi.org/10.5683/SP3/1AURMB
    Explore at:
    Dataset updated
    Dec 28, 2023
    Dataset provided by
    Borealis
    Authors
    Carol Perry; Data Liberation Initiative (DLI)
    Description

    This workshop is a continuation of the DDI power point presentation given at the previous year's DLI Training in Kingston. It is intended as a primer for those interested in understanding the basic concepts of the Data Documentation Initiative (DDI) and the Data Type Definition (DTD) statements. This time participants will have the opportunity to take a closer look, examine the tags, determine criteria for selection and create an XML template.

  2. Company Documents Dataset

    • kaggle.com
    zip
    Updated May 23, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ayoub Cherguelaine (2024). Company Documents Dataset [Dataset]. https://www.kaggle.com/datasets/ayoubcherguelaine/company-documents-dataset
    Explore at:
    zip(9789538 bytes)Available download formats
    Dataset updated
    May 23, 2024
    Authors
    Ayoub Cherguelaine
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Overview

    This dataset contains a collection of over 2,000 company documents, categorized into four main types: invoices, inventory reports, purchase orders, and shipping orders. Each document is provided in PDF format, accompanied by a CSV file that includes the text extracted from these documents, their respective labels, and the word count of each document. This dataset is ideal for various natural language processing (NLP) tasks, including text classification, information extraction, and document clustering.

    Dataset Content

    PDF Documents: The dataset includes 2,677 PDF files, each representing a unique company document. These documents are derived from the Northwind dataset, which is commonly used for demonstrating database functionalities.

    The document types are:

    • Invoices: Detailed records of transactions between a buyer and a seller.
    • Inventory Reports: Records of inventory levels, including items in stock and units sold.
    • Purchase Orders: Requests made by a buyer to a seller to purchase products or services.
    • Shipping Orders: Instructions for the delivery of goods to specified recipients.

    Example Entries

    Here are a few example entries from the CSV file:

    Shipping Order:

    • Order ID: 10718
    • Shipping Details: "Ship Name: Königlich Essen, Ship Address: Maubelstr. 90, Ship City: ..."
    • Word Count: 120

    Invoice:

    • Order ID: 10707
    • Customer Details: "Customer ID: Arout, Order Date: 2017-10-16, Contact Name: Th..."
    • Word Count: 66

    Purchase Order:

    • Order ID: 10892
    • Order Details: "Order Date: 2018-02-17, Customer Name: Catherine Dewey, Products: Product ..."
    • Word Count: 26

    Applications

    This dataset can be used for:

    • Text Classification: Train models to classify documents into their respective categories.
    • Information Extraction: Extract specific fields and details from the documents.
    • Document Clustering: Group similar documents together based on their content.
    • OCR and Text Mining: Improve OCR (Optical Character Recognition) models and text mining techniques using real-world data.
  3. Invasive Plant Inventory at San Diego National Wildlife Refuge- Data...

    • catalog.data.gov
    • datasets.ai
    Updated Nov 25, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Fish and Wildlife Service (2025). Invasive Plant Inventory at San Diego National Wildlife Refuge- Data Documentation [Dataset]. https://catalog.data.gov/dataset/invasive-plant-inventory-at-san-diego-national-wildlife-refuge-data-documentation
    Explore at:
    Dataset updated
    Nov 25, 2025
    Dataset provided by
    U.S. Fish and Wildlife Servicehttp://www.fws.gov/
    Description

    In 2012, an invasive plant inventory of priority invasive plant species in priority areas was conducted at San Diego National Wildlife Refuge. Results from this effort will inform the development of invasive plant management objectives, strategies, and serves as a baseline for assessing change in the status of invasive plant distribution or abundance over time.

  4. OCR image data of French document type

    • kaggle.com
    zip
    Updated Jun 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Appen Limited (2025). OCR image data of French document type [Dataset]. https://www.kaggle.com/datasets/appenlimited/ocr-image-data-of-french-document-type
    Explore at:
    zip(22416674 bytes)Available download formats
    Dataset updated
    Jun 25, 2025
    Authors
    Appen Limited
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Area covered
    French
    Description

    如需完整数据集或了解更多,请发邮件至commercialproduct@appen.com For the complete dataset or more, please email commercialproduct@appen.com

    The dataset product can be used in many AI pilot projects and supplement production models with other data. It can improve the model performance and be cost-effectiveness. Dataset is an excellent solution when time and budget is limited. Appen database team can provide a large number of database products, such as ASR, TTS, video, text, image. At the same time, we are also constantly building new datasets to expand resources. Database team always strive to deliver as soon as possible to meet the needs of the global customers. This OCR database consists of image data in Korean, Vietnamese, Spanish, French, Thai, Japanese, Indonesian, Tamil, and Burmese, as well as handwritten images in both Chinese and English (including annotations). On average, each image contains 30 to 40 frames, including texts in various languages, special characters, and numbers. The accuracy rate requirement is over 99% (both position and content are correct). The images include the following categories: - RECEIPT - IDCARD - TRADE - TABLE - WHITEBOARD - NEWSPAPER - THESIS - CARD - NOTE - CONTRACT - BOOKCONTENT - HANDWRITING

    1. Data Specification Usage Cases Image label recognition training Collecting device Mobile phone / Camera Collecting environment Multiple lights environments

    Database Name Category Quantity

    Korean Document OCR Images

    RECEIPT 1500 IDCARD 500 TRADE 1012 TABLE 512 WHITEBOARD 500 NEWSPAPER 500 THESIS 500 CARD 500 NOTE 499 CONTRACT 501 BOOKCONTENT 500 TOTAL 7,024

    Vietnamese Document OCR Images

    RECEIPT 337 IDCARD 100 TRADE 227 TABLE 100 WHITEBOARD 111 NEWSPAPER 100 THESIS 100 CARD 100 NOTE 100 CONTRACT 105 BOOKCONTENT 700 TOTAL 2,080

    Spanish Document OCR Images

    RECEIPT 1500 IDCARD 500 TRADE 1000 TABLE 500 WHITEBOARD 500 NEWSPAPER 500 THESIS 500 CARD 500 NOTE 500 CONTRACT 500 BOOKCONTENT 500 TOTAL 7000

    French Document OCR Images

    RECEIPT 300 IDCARD 100 TRADE 200 TABLE 100 WHITEBOARD 100 NEWSPAPER 100 THESIS 103 CARD 100 NOTE 100 CONTRACT 100 BOOKCONTENT 700 TOTAL 2003

    Thai Document OCR Images

    RECEIPT 1500 IDCARD 500 TRADE 1000 TABLE 537 WHITEBOARD 500 NEWSPAPER 500 THESIS 500 CARD 500 NOTE 500 CONTRACT 500 BOOKCONTENT 500 TOTAL 7037

    Japanese Document OCR Images

    RECEIPT 1586 IDCARD 500 TRADE 1000 TABLE 552 WHITEBOARD 500 NEWSPAPER 500 THESIS 509 CARD 500 NOTE 500 CONTRACT 500 BOOKCONTENT 500 TOTAL 7147

    Indonesian Document OCR Images

    RECEIPT 1500 IDCARD 500 TRADE 1003 TABLE 500 WHITEBOARD 501 NEWSPAPER 502 THESIS 500 CARD 500 NOTE 500 CONTRACT 500 BOOKCONTENT 500 TOTAL 7006

    Tamil Document OCR Images

    RECEIPT 356 IDCARD 98 TRADE 475 TABLE 532 WHITEBOARD 501 NEWSPAPER 500 THESIS 500 CARD 500 NOTE 501 CONTRACT 500 BOOKCONTENT 500 TOTAL 4963

    Burmese Document OCR Images

    RECEIPT 300 IDCARD 100 TRADE 200 TABLE 117 WHITEBOARD 110 NEWSPAPER 108 THESIS 102 CARD 100 NOTE 120 CONTRACT 100 BOOKCONTENT 761 TOTAL 2118

    English Handwritten Datasets HANDWRITING 2278 Chinese Handwritten Datasets HANDWRITING 11118

    1. Information provided by database
    2. Data Format:. JPG
  5. d

    Statistical compilation of various types of official documents and general...

    • data.gov.tw
    csv
    Updated Jul 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ministry of Digital Affairs (2024). Statistical compilation of various types of official documents and general document drafting and signing statistics in the Digital Development Department. [Dataset]. https://data.gov.tw/en/datasets/162455
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jul 1, 2024
    Dataset authored and provided by
    Ministry of Digital Affairs
    License

    https://data.gov.tw/licensehttps://data.gov.tw/license

    Description

    This data set is updated annually by the Digital Development Department to fulfill the purpose of open government public information. It includes the number of documents received by agencies each month, including general documents, legislator inquiries, people's requests, appeal cases, people's petitions, special control cases, and supervisory cases. It is hoped that the data can be used by data users for analysis and use in the handling of official documents in government agencies.

  6. Data from: Classifying document types to enhance search and recommendations...

    • figshare.com
    txt
    Updated May 25, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aristotelis Charalampous; Petr Knoth (2017). Classifying document types to enhance search and recommendations in digital libraries - Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.4834229.v2
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 25, 2017
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Aristotelis Charalampous; Petr Knoth
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Taken from "Classifying document types to enhance search and recommendations in digital libraries"https://www.overleaf.com/read/zzzrvmzmwdckAbstract: In this paper, we address the problem of classifying documents available from the global network of (open access) repositories according to their type. We show that the metadata provided by repositories enabling us to distinguish research papers, thesis and slides are missing in over 60% of cases. While these metadata describing document types are useful in a variety of scenarios ranging from research analytics to improving search and recommender (SR) systems, this problem has not yet been sufficiently addressed in the context of the repositories infrastructure. We have developed a new approach for classifying document types using supervised machine learning based exclusively on text specific features. We achieve 0.96 F1-score using the random forest and Adaboost classifiers, which are the best performing models on our data. By analysing the SR system logs of the CORE digital library aggregator, we show that users are an order of magnitude more likely to click on research papers and thesis than on slides. This suggests that using document types as a feature for ranking/filtering SR results in digital libraries has the potential to improve user experience.The descriptors, as featured in the study, are encoded in the dataset as follows:authors_len: Number of authors associated with the document entry.num_of_pages: Number of pages the document has in total.avg_word_per_page: Average words per page in the document.total_words: Total words in the document.source: The online service from which the document originated (can be either "CORE" or "SlideShare").id: Identifier with which the source's API can be queried to retrieve the corresponding document.label: The document's type, from "research", "thesis" or "slides".

  7. The NIST Extensible Resource Data Model (NERDm): JSON schemas for rich...

    • nist.gov
    • gimi9.com
    • +2more
    Updated Sep 2, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institute of Standards and Technology (2017). The NIST Extensible Resource Data Model (NERDm): JSON schemas for rich description of data resources [Dataset]. http://doi.org/10.18434/mds2-1870
    Explore at:
    Dataset updated
    Sep 2, 2017
    Dataset provided by
    National Institute of Standards and Technologyhttp://www.nist.gov/
    License

    https://www.nist.gov/open/licensehttps://www.nist.gov/open/license

    Description

    The NIST Extensible Resource Data Model (NERDm) is a set of schemas for encoding in JSON format metadata that describe digital resources. The variety of digital resources it can describe includes not only digital data sets and collections, but also software, digital services, web sites and portals, and digital twins. It was created to serve as the internal metadata format used by the NIST Public Data Repository and Science Portal to drive rich presentations on the web and to enable discovery; however, it was also designed to enable programmatic access to resources and their metadata by external users. Interoperability was also a key design aim: the schemas are defined using the JSON Schema standard, metadata are encoded as JSON-LD, and their semantics are tied to community ontologies, with an emphasis on DCAT and the US federal Project Open Data (POD) models. Finally, extensibility is also central to its design: the schemas are composed of a central core schema and various extension schemas. New extensions to support richer metadata concepts can be added over time without breaking existing applications. Validation is central to NERDm's extensibility model. Consuming applications should be able to choose which metadata extensions they care to support and ignore terms and extensions they don't support. Furthermore, they should not fail when a NERDm document leverages extensions they don't recognize, even when on-the-fly validation is required. To support this flexibility, the NERDm framework allows documents to declare what extensions are being used and where. We have developed an optional extension to the standard JSON Schema validation (see ejsonschema below) to support flexible validation: while a standard JSON Schema validater can validate a NERDm document against the NERDm core schema, our extension will validate a NERDm document against any recognized extensions and ignore those that are not recognized. The NERDm data model is based around the concept of resource, semantically equivalent to a schema.org Resource, and as in schema.org, there can be different types of resources, such as data sets and software. A NERDm document indicates what types the resource qualifies as via the JSON-LD "@type" property. All NERDm Resources are described by metadata terms from the core NERDm schema; however, different resource types can be described by additional metadata properties (often drawing on particular NERDm extension schemas). A Resource contains Components of various types (including DCAT-defined Distributions) that are considered part of the Resource; specifically, these can include downloadable data files, hierachical data collecitons, links to web sites (like software repositories), software tools, or other NERDm Resources. Through the NERDm extension system, domain-specific metadata can be included at either the resource or component level. The direct semantic and syntactic connections to the DCAT, POD, and schema.org schemas is intended to ensure unambiguous conversion of NERDm documents into those schemas. As of this writing, the Core NERDm schema and its framework stands at version 0.7 and is compatible with the "draft-04" version of JSON Schema. Version 1.0 is projected to be released in 2025. In that release, the NERDm schemas will be updated to the "draft2020" version of JSON Schema. Other improvements will include stronger support for RDF and the Linked Data Platform through its support of JSON-LD.

  8. n

    OpenScience Slovenia document metadata dataset

    • narcis.nl
    • data.mendeley.com
    Updated Mar 9, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Borovič, M (via Mendeley Data) (2021). OpenScience Slovenia document metadata dataset [Dataset]. http://doi.org/10.17632/7wh9xvvmgk.3
    Explore at:
    Dataset updated
    Mar 9, 2021
    Dataset provided by
    Data Archiving and Networked Services (DANS)
    Authors
    Borovič, M (via Mendeley Data)
    Area covered
    Slovenia
    Description

    The OpenScience Slovenia metadata dataset contains metadata entries for Slovenian public domain academic documents which include undergraduate and postgraduate theses, research and professional articles, along with other academic document types. The data within the dataset was collected as a part of the establishment of the Slovenian Open-Access Infrastructure which defined a unified document collection process and cataloguing for universities in Slovenia within the infrastructure repositories. The data was collected from several already established but separate library systems in Slovenia and merged into a single metadata scheme using metadata deduplication and merging techniques. It consists of text and numerical fields, representing attributes that describe documents. These attributes include document titles, keywords, abstracts, typologies, authors, issue years and other identifiers such as URL and UDC. The potential of this dataset lies especially in text mining and text classification tasks and can also be used in development or benchmarking of content-based recommender systems on real-world data.

  9. Text Document Classification Dataset

    • kaggle.com
    zip
    Updated Dec 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    sunil thite (2023). Text Document Classification Dataset [Dataset]. https://www.kaggle.com/datasets/sunilthite/text-document-classification-dataset
    Explore at:
    zip(1941393 bytes)Available download formats
    Dataset updated
    Dec 4, 2023
    Authors
    sunil thite
    Description

    This is text document classification dataset which contains 2225 text data and five categories of documents. Five categories are politics, sport, tech, entertainment and business. We can use this dataset for documents classification and document clustering.

    About Dataset - Dataset contains two features text and label. - No. of Rows : 2225 - No. of Columns : 2

    Text: It contains different categories of text data Label: It contains labels for five different categories : 0,1,2,3,4

    1. Politics = 0
    2. Sport = 1
    3. Technology = 2
    4. Entertainment =3
    5. Business = 4
  10. Grant Related Forms and Documents

    • data.virginia.gov
    • catalog.data.gov
    html
    Updated Sep 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Administration for Children and Families (2025). Grant Related Forms and Documents [Dataset]. https://data.virginia.gov/dataset/grant-related-forms-and-documents
    Explore at:
    htmlAvailable download formats
    Dataset updated
    Sep 5, 2025
    Dataset provided by
    Administration for Children and Families
    Description

    The Grant Related Forms and Documents includes forms, certifications, and assurances that are commonly used in applying for Administration for Children and Families grants and reporting on the status of grant projects.

    Metadata-only record linking to the original dataset. Open original dataset below.

  11. Data articles in journals

    • zenodo.org
    bin, csv, txt
    Updated Sep 21, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carlota Balsa-Sanchez; Carlota Balsa-Sanchez; Vanesa Loureiro; Vanesa Loureiro (2023). Data articles in journals [Dataset]. http://doi.org/10.5281/zenodo.7419132
    Explore at:
    csv, txt, binAvailable download formats
    Dataset updated
    Sep 21, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Carlota Balsa-Sanchez; Carlota Balsa-Sanchez; Vanesa Loureiro; Vanesa Loureiro
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Last Version: 3

    Authors: Carlota Balsa-Sánchez, Vanesa Loureiro

    Date of data collection: 2022/10/28

    General description: The publication of datasets according to the FAIR principles, could be reached publishing a data paper (or software paper) in data journals or in academic standard journals. The excel and CSV file contains a list of academic journals that publish data papers and software papers.
    File list:

    - data_articles_journal_list_v3.xlsx: full list of 124 academic journals in which data papers or/and software papers could be published
    - data_articles_journal_list_3.csv: full list of 124 academic journals in which data papers or/and software papers could be published

    Relationship between files: both files have the same information. Two different formats are offered to improve reuse

    Type of version of the dataset: final processed version

    Versions of the files: 3rd version
    - Information updated: number of journals, URL, document types associated to a specific journal, publishers normalization and simplification of document types
    - Information added : listed in the Directory of Open Access Journals (DOAJ), indexed in Web of Science (WOS) and quartile in Journal Citation Reports (JCR) and/or Scimago Journal and Country Rank (SJR).

    Erratum - Data articles in journals Version 3:

    Botanical Studies -- ISSN 1999-3110 -- JCR (JIF) Q2
    Data -- ISSN 2306-5729 -- JCR (JIF) n/a
    Data in Brief -- ISSN 2352-3409 -- JCR (JIF) n/a

    Version: 2

    Author: Francisco Rubio, Universitat Politècnia de València.

    Date of data collection: 2020/06/23

    General description: The publication of datasets according to the FAIR principles, could be reached publishing a data paper (or software paper) in data journals or in academic standard journals. The excel and CSV file contains a list of academic journals that publish data papers and software papers.
    File list:

    - data_articles_journal_list_v2.xlsx: full list of 56 academic journals in which data papers or/and software papers could be published
    - data_articles_journal_list_v2.csv: full list of 56 academic journals in which data papers or/and software papers could be published

    Relationship between files: both files have the same information. Two different formats are offered to improve reuse

    Type of version of the dataset: final processed version

    Versions of the files: 2nd version
    - Information updated: number of journals, URL, document types associated to a specific journal, publishers normalization and simplification of document types
    - Information added : listed in the Directory of Open Access Journals (DOAJ), indexed in Web of Science (WOS) and quartile in Scimago Journal and Country Rank (SJR)

    Total size: 32 KB

    Version 1: Description

    This dataset contains a list of journals that publish data articles, code, software articles and database articles.

    The search strategy in DOAJ and Ulrichsweb was the search for the word data in the title of the journals.
    Acknowledgements:
    Xaquín Lores Torres for his invaluable help in preparing this dataset.

  12. C

    Replication data for: SIDTD. Synthetic dataset of ID and Travel Document

    • dataverse.csuc.cat
    txt, zip
    Updated Dec 18, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carlos Boned Riera; Carlos Boned Riera; Maxime Talarmain; Maxime Talarmain; Oriol Ramos Terrades; Oriol Ramos Terrades (2024). Replication data for: SIDTD. Synthetic dataset of ID and Travel Document [Dataset]. http://doi.org/10.34810/data1815
    Explore at:
    zip(54511584181), zip(460633), zip(23564788403), zip(307269), zip(723095), zip(534849), zip(748455), zip(21236), zip(10171), txt(0), zip(7432), zip(1273966468)Available download formats
    Dataset updated
    Dec 18, 2024
    Dataset provided by
    CORA.Repositori de Dades de Recerca
    Authors
    Carlos Boned Riera; Carlos Boned Riera; Maxime Talarmain; Maxime Talarmain; Oriol Ramos Terrades; Oriol Ramos Terrades
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    The SIDTD dataset is an extension of the MIDV2020 dataset. Initially, the MIDV2020 dataset is composed of forged ID documents, as all documents are generated by means of AI techniques. These generated documents are considered in the SIDTD dataset as representative of bona fide. On the other hand, the documents generated are considered as being forged versions of them. The corpus of the dataset is composed by ten European nationalities that are equally represented: Albanian, Azerbaijani, Estonian, Finnish, Greek, Lithuanian, Russian, Serbian, Slovakian, and Spanish. We employ two techniques for generating composite PAIs: Crop & Replace and inpainting. Datase contains videos, and clips, of captured ID Documents with different backgrounds, we add the same type of data for the forged ID Document images generated using the techniques described. The protocol employed to generate the dataset is as follows: We printed 191 counterfeit ID documents on paper using an HP Color LaserJet E65050 printer. Then, the documents were laminated with 100-micron-thick laminating pouches to enhance realism and manually cropped. CVC’s employees were requested to use their smartphones to record videos of forged ID documents from SIDTD. This approach aimed to capture a diverse range of video qualities, backgrounds, durations, and light intensities

  13. d

    Documentation forms for current velocity and direction data collection...

    • catalog.data.gov
    Updated Nov 1, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (Point of Contact) (2025). Documentation forms for current velocity and direction data collection surveys by buoys and NOAA Ship Ferrel in the North Atlantic from 1971-05-10 to 1973-06-07 (NCEI Accession 7301004) [Dataset]. https://catalog.data.gov/dataset/documentation-forms-for-current-velocity-and-direction-data-collection-surveys-by-buoys-and-noa1
    Explore at:
    Dataset updated
    Nov 1, 2025
    Dataset provided by
    (Point of Contact)
    Description

    Current and velocity data were collected in Boston Harbor and the North Atlantic in support of the Boston Harbor Current Survey, OPR-501-FE-71 and the South Coastal Plains Expedition, OPR-500-FE-73. Data were collected from NOAA Ship Ferrel and survey buoys from 1971-05-10 to 1973-07-07. This archival package contains only documentation forms for these data, not the data files themselves.

  14. f

    ID's photo Dataset | 67 countries | 11 types of documents | Document...

    • data.filemarket.ai
    Updated Jul 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FileMarket (2025). ID's photo Dataset | 67 countries | 11 types of documents | Document Recognition | OCR Training | Computer Vision [Dataset]. https://data.filemarket.ai/products/id-s-photo-dataset-67-countries-11-types-of-documents-d-filemarket
    Explore at:
    Dataset updated
    Jul 26, 2025
    Dataset authored and provided by
    FileMarket
    Area covered
    United States, Brazil, France
    Description

    Dataset of 3623 images from 1661 users (~2.18/user), mainly front/back ID documents, ideal for OCR training, document recognition, and automated identity verification tasks.

  15. g

    State Strategy Documents Data

    • gimi9.com
    Updated May 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). State Strategy Documents Data [Dataset]. https://gimi9.com/dataset/eu_https-data-gov-lt-datasets-2862-
    Explore at:
    Dataset updated
    May 6, 2025
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The data collection of the Monitoring Information System (SIS) of the Ministry of Finance of the Republic of Lithuania has compiled data on documents, hierarchy of elements, financial and indicators since 2011. State Strategy Documents Data The data table for State Strategy and Application Documents consists of: document code, title, document type, last version number, process stage, period, institution and its code. Document Item Hierarchy The data table of the hierarchy of state strategies and programme elements consists of: document ID, which referrals to “State Strategy Document Data”, item code, name, class, type, description (for indicators), beginning and end, periodicity, element units of measure, as well as the extent to which the element (country, organization or document owner) is implemented. Data on State Strategy Indicators The data table for indicators of state strategies and programmes consists of: document and object IDs (referenced to “Document item hierarchy” and “State strategy documents data”), indicator period and target and actual values (quantitative and qualitative). Financial data of state strategies The financial data table of state strategies and programmes consists of: document and object IDs (referenced to “Document item hierarchy” and “State strategy document data”), item start and end dates, value in euro, cost type, source of funding, type of funds, state function, as well as indication of whether this is a plan, revised plan, requirement or fact. Data provider means the Ministry of Finance of the Republic of Lithuania. Contact the atverimas@stat.gov.lt for technical questions or possible errors.

  16. w

    Global Document Processing Service Market Research Report: By Service Type...

    • wiseguyreports.com
    Updated Sep 15, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Global Document Processing Service Market Research Report: By Service Type (Data Extraction, Document Classification, Data Entry, Data Capture, Invoice Processing), By Deployment Model (Cloud-Based, On-Premises, Hybrid), By Industry Vertical (Banking and Financial Services, Healthcare, Retail, Telecommunications, Government), By Organization Size (Small Enterprises, Medium Enterprises, Large Enterprises) and By Regional (North America, Europe, South America, Asia Pacific, Middle East and Africa) - Forecast to 2035 [Dataset]. https://www.wiseguyreports.com/reports/document-processing-service-market
    Explore at:
    Dataset updated
    Sep 15, 2025
    License

    https://www.wiseguyreports.com/pages/privacy-policyhttps://www.wiseguyreports.com/pages/privacy-policy

    Time period covered
    Sep 25, 2025
    Area covered
    North America, Global
    Description
    BASE YEAR2024
    HISTORICAL DATA2019 - 2023
    REGIONS COVEREDNorth America, Europe, APAC, South America, MEA
    REPORT COVERAGERevenue Forecast, Competitive Landscape, Growth Factors, and Trends
    MARKET SIZE 20247.87(USD Billion)
    MARKET SIZE 20258.37(USD Billion)
    MARKET SIZE 203515.4(USD Billion)
    SEGMENTS COVEREDService Type, Deployment Model, Industry Vertical, Organization Size, Regional
    COUNTRIES COVEREDUS, Canada, Germany, UK, France, Russia, Italy, Spain, Rest of Europe, China, India, Japan, South Korea, Malaysia, Thailand, Indonesia, Rest of APAC, Brazil, Mexico, Argentina, Rest of South America, GCC, South Africa, Rest of MEA
    KEY MARKET DYNAMICStechnological advancements, increasing automation, rising data volume, regulatory compliance demands, cost efficiency pressures
    MARKET FORECAST UNITSUSD Billion
    KEY COMPANIES PROFILEDHyland Software, IBM, Nuance Communications, CaptureFast, DocuWare, Xerox, ABBYY, Micro Focus, SAP, Laserfiche, FileBound, MFiles, Adobe, OpenText, Kofax
    MARKET FORECAST PERIOD2025 - 2035
    KEY MARKET OPPORTUNITIESAI-driven automation solutions, Cloud-based document management, Enhanced data security measures, Integration with emerging technologies, Market expansion in developing regions
    COMPOUND ANNUAL GROWTH RATE (CAGR) 6.3% (2025 - 2035)
  17. d

    Digital Development Department, Digital Industry Agency, various types of...

    • data.gov.tw
    csv
    Updated May 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Digital Development Department, Digital Industry Agency, various types of official document receiving and general official document issuing and signing statistics. [Dataset]. https://data.gov.tw/en/datasets/173524
    Explore at:
    csvAvailable download formats
    Dataset updated
    May 19, 2025
    License

    https://data.gov.tw/licensehttps://data.gov.tw/license

    Description

    This dataset is compiled by the Secretariat for statistical purposes, recording the statistical statistics of various types of official documents received by the Digital Industry Department of the Digital Development Bureau and the statistics of general official documents issued for signature and drafting, provided for reference and use by data users.

  18. N

    ACRIS - Document Control Codes

    • data.cityofnewyork.us
    • nycopendata.socrata.com
    • +3more
    csv, xlsx, xml
    Updated Nov 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department of Finance (DOF) (2025). ACRIS - Document Control Codes [Dataset]. https://data.cityofnewyork.us/City-Government/ACRIS-Document-Control-Codes/7isb-wh4c
    Explore at:
    csv, xml, xlsxAvailable download formats
    Dataset updated
    Nov 17, 2025
    Dataset authored and provided by
    Department of Finance (DOF)
    Description

    ACRIS Document Type and Class Code mappings for Codes in the ACRIS Real and Personal Property Master Datasets

  19. w

    Global Off Site Document Storage Market Research Report: By Document Type...

    • wiseguyreports.com
    Updated Oct 14, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Global Off Site Document Storage Market Research Report: By Document Type (Paper Documents, Digital Documents, Microfilm, Blueprints, Photographs), By Storage Type (Physical Storage, Cloud Storage, Hybrid Storage), By End Use Industry (Healthcare, Legal, Financial Services, Education, Government), By Service Type (Document Scanning, Document Shredding, Record Management, Data Backup) and By Regional (North America, Europe, South America, Asia Pacific, Middle East and Africa) - Forecast to 2035 [Dataset]. https://www.wiseguyreports.com/reports/off-site-document-storage-market
    Explore at:
    Dataset updated
    Oct 14, 2025
    License

    https://www.wiseguyreports.com/pages/privacy-policyhttps://www.wiseguyreports.com/pages/privacy-policy

    Time period covered
    Oct 25, 2025
    Area covered
    Global
    Description
    BASE YEAR2024
    HISTORICAL DATA2019 - 2023
    REGIONS COVEREDNorth America, Europe, APAC, South America, MEA
    REPORT COVERAGERevenue Forecast, Competitive Landscape, Growth Factors, and Trends
    MARKET SIZE 20245.46(USD Billion)
    MARKET SIZE 20255.86(USD Billion)
    MARKET SIZE 203512.0(USD Billion)
    SEGMENTS COVEREDDocument Type, Storage Type, End Use Industry, Service Type, Regional
    COUNTRIES COVEREDUS, Canada, Germany, UK, France, Russia, Italy, Spain, Rest of Europe, China, India, Japan, South Korea, Malaysia, Thailand, Indonesia, Rest of APAC, Brazil, Mexico, Argentina, Rest of South America, GCC, South Africa, Rest of MEA
    KEY MARKET DYNAMICSgrowing digitalization, increasing data security concerns, rising regulatory compliance, cost-effective storage solutions, enhanced access and collaboration
    MARKET FORECAST UNITSUSD Billion
    KEY COMPANIES PROFILEDKofile Technologies, Access Information Management, Space Saver, DocuVault, Iron Mountain, Xerox, DataSafe, Record Nations, Outsourcing Data Solutions, Citysweeper, Cintas, Metrofile, ECS Ltd, DataBank, Shredit, Recall
    MARKET FORECAST PERIOD2025 - 2035
    KEY MARKET OPPORTUNITIESIncreased data privacy regulations, Rising demand for cost-effective storage, Growth in remote work culture, Advancements in cloud storage technologies, Increase in digital transformation initiatives
    COMPOUND ANNUAL GROWTH RATE (CAGR) 7.5% (2025 - 2035)
  20. Global Soil Types, 0.5-Degree Grid (Modified Zobler) - Dataset - NASA Open...

    • data.nasa.gov
    Updated Apr 1, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nasa.gov (2025). Global Soil Types, 0.5-Degree Grid (Modified Zobler) - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/global-soil-types-0-5-degree-grid-modified-zobler-f09ea
    Explore at:
    Dataset updated
    Apr 1, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    A global data set of soil types is available at 0.5-degree latitude by 0.5-degree longitude resolution. There are 106 soil units, based on Zobler?s (1986) assessment of the FAO/UNESCO Soil Map of the World. This data set is a conversion of the Zobler 1-degree resolution version to a 0.5-degree resolution. The resolution of the data set was not actually increased. Rather, the 1-degree squares were divided into four 0.5-degree squares with the necessary adjustment of continental boundaries and islands. The computer code used to convert the original 1-degree data to 0.5-degree is provided as a companion file. A JPG image of the data is provided in this document. The Zobler data (1-degree resolution) as distributed by Webb et al. (1993) [http://www.ngdc.noaa.gov/seg/eco/cdroms/gedii_a/datasets/a12/wr.htm#top] contains two columns, one column for continent and one column for soil type. The Soil Map of the World consists of 9 maps that represent parts of the world. The texture data that Webb et al.(1993) provided allowed for the fact that a soil type in one part of the world may have different properties than the same soil in a different part of the world. This continent-specific information is retained in this 0.5-degree resolution data set, as well as the soil type information which is the second column. A code was written (one2half.c) to take the file CONTIZOB.LER distributed by Webb et al. (1993) [http://www.ngdc.noaa.gov/seg/eco/cdroms/gedii_a/datasets/a12/wr.htm#top] and simply divide the 1-degree cells into quarters. This code also reads in a land/water file (land.wave) that specifies the cells that are land at 0.5 degrees. The code checks for consistency between the newly quartered map and the land/water map to which the quartered map is to be registered. If there is a discrepancy between the two, an attempt was made to make the two consistent using the following logic. If the cell is supposed to be water, it is forced to be water. If it is supposed to be land but was resolved to water at 1 degree, the code looks at the surrounding 8 cells and picks the most frequent soil type and assigns it to the cell. If there are no surrounding land cells then it is kept as water in the hopes that on the next pass one or more of the surrounding cells might be converted from water to a soil type. The whole map is iterated 5 times. The remaining cells that should be land but couldn't be determined from surrounding cells (mostly islands that are resolved at 0.5 degree but not at 1 degree) are printed out with coordinate information. A temporary map is output with -9 indicating where data is required. This is repeated for the continent code in CONTIZOB.LER as well. A separate map of the temporary continent codes is produced with -9 indicating required data. A nearly identical code (one2half.c) does the same for the continent codes. The printout allows one to consult the printed versions of the soil map and look up the soil type with the largest coverage in the 0.5-degree cell. The program manfix.c then will go through the temporary map and prompt for input to correct both the soil codes and the continent codes for the map. This can be done manually or by preparing a file of changes (new_fix.dat) and redirecting stdin. A new complete version of the map is outputted. This is in the form of the original CONTIZOB.LER file (contizob.half) but four times larger. Original documentation and computer codes prepared by Post et al. (1996) are provided as companion files with this data set. Image of 106 global soil types available at 0.5-degree by 0.5-degree resolution. Additional documentation from Zobler?s assessment of FAO soil units is available from the NASA Center for Scientific Information.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Carol Perry; Data Liberation Initiative (DLI) (2023). Data Documentation Initiative (DDI) Workshop [Dataset]. http://doi.org/10.5683/SP3/1AURMB

Data from: Data Documentation Initiative (DDI) Workshop

Related Article
Explore at:
Dataset updated
Dec 28, 2023
Dataset provided by
Borealis
Authors
Carol Perry; Data Liberation Initiative (DLI)
Description

This workshop is a continuation of the DDI power point presentation given at the previous year's DLI Training in Kingston. It is intended as a primer for those interested in understanding the basic concepts of the Data Documentation Initiative (DDI) and the Data Type Definition (DTD) statements. This time participants will have the opportunity to take a closer look, examine the tags, determine criteria for selection and create an XML template.

Search
Clear search
Close search
Google apps
Main menu