Facebook
TwitterThis workshop is a continuation of the DDI power point presentation given at the previous year's DLI Training in Kingston. It is intended as a primer for those interested in understanding the basic concepts of the Data Documentation Initiative (DDI) and the Data Type Definition (DTD) statements. This time participants will have the opportunity to take a closer look, examine the tags, determine criteria for selection and create an XML template.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset contains a collection of over 2,000 company documents, categorized into four main types: invoices, inventory reports, purchase orders, and shipping orders. Each document is provided in PDF format, accompanied by a CSV file that includes the text extracted from these documents, their respective labels, and the word count of each document. This dataset is ideal for various natural language processing (NLP) tasks, including text classification, information extraction, and document clustering.
PDF Documents: The dataset includes 2,677 PDF files, each representing a unique company document. These documents are derived from the Northwind dataset, which is commonly used for demonstrating database functionalities.
The document types are:
Here are a few example entries from the CSV file:
This dataset can be used for:
Facebook
TwitterIn 2012, an invasive plant inventory of priority invasive plant species in priority areas was conducted at San Diego National Wildlife Refuge. Results from this effort will inform the development of invasive plant management objectives, strategies, and serves as a baseline for assessing change in the status of invasive plant distribution or abundance over time.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
如需完整数据集或了解更多,请发邮件至commercialproduct@appen.com For the complete dataset or more, please email commercialproduct@appen.com
The dataset product can be used in many AI pilot projects and supplement production models with other data. It can improve the model performance and be cost-effectiveness. Dataset is an excellent solution when time and budget is limited. Appen database team can provide a large number of database products, such as ASR, TTS, video, text, image. At the same time, we are also constantly building new datasets to expand resources. Database team always strive to deliver as soon as possible to meet the needs of the global customers. This OCR database consists of image data in Korean, Vietnamese, Spanish, French, Thai, Japanese, Indonesian, Tamil, and Burmese, as well as handwritten images in both Chinese and English (including annotations). On average, each image contains 30 to 40 frames, including texts in various languages, special characters, and numbers. The accuracy rate requirement is over 99% (both position and content are correct). The images include the following categories: - RECEIPT - IDCARD - TRADE - TABLE - WHITEBOARD - NEWSPAPER - THESIS - CARD - NOTE - CONTRACT - BOOKCONTENT - HANDWRITING
Database Name Category Quantity
RECEIPT 1500 IDCARD 500 TRADE 1012 TABLE 512 WHITEBOARD 500 NEWSPAPER 500 THESIS 500 CARD 500 NOTE 499 CONTRACT 501 BOOKCONTENT 500 TOTAL 7,024
RECEIPT 337 IDCARD 100 TRADE 227 TABLE 100 WHITEBOARD 111 NEWSPAPER 100 THESIS 100 CARD 100 NOTE 100 CONTRACT 105 BOOKCONTENT 700 TOTAL 2,080
RECEIPT 1500 IDCARD 500 TRADE 1000 TABLE 500 WHITEBOARD 500 NEWSPAPER 500 THESIS 500 CARD 500 NOTE 500 CONTRACT 500 BOOKCONTENT 500 TOTAL 7000
RECEIPT 300 IDCARD 100 TRADE 200 TABLE 100 WHITEBOARD 100 NEWSPAPER 100 THESIS 103 CARD 100 NOTE 100 CONTRACT 100 BOOKCONTENT 700 TOTAL 2003
RECEIPT 1500 IDCARD 500 TRADE 1000 TABLE 537 WHITEBOARD 500 NEWSPAPER 500 THESIS 500 CARD 500 NOTE 500 CONTRACT 500 BOOKCONTENT 500 TOTAL 7037
RECEIPT 1586 IDCARD 500 TRADE 1000 TABLE 552 WHITEBOARD 500 NEWSPAPER 500 THESIS 509 CARD 500 NOTE 500 CONTRACT 500 BOOKCONTENT 500 TOTAL 7147
RECEIPT 1500 IDCARD 500 TRADE 1003 TABLE 500 WHITEBOARD 501 NEWSPAPER 502 THESIS 500 CARD 500 NOTE 500 CONTRACT 500 BOOKCONTENT 500 TOTAL 7006
RECEIPT 356 IDCARD 98 TRADE 475 TABLE 532 WHITEBOARD 501 NEWSPAPER 500 THESIS 500 CARD 500 NOTE 501 CONTRACT 500 BOOKCONTENT 500 TOTAL 4963
RECEIPT 300 IDCARD 100 TRADE 200 TABLE 117 WHITEBOARD 110 NEWSPAPER 108 THESIS 102 CARD 100 NOTE 120 CONTRACT 100 BOOKCONTENT 761 TOTAL 2118
English Handwritten Datasets HANDWRITING 2278 Chinese Handwritten Datasets HANDWRITING 11118
Facebook
Twitterhttps://data.gov.tw/licensehttps://data.gov.tw/license
This data set is updated annually by the Digital Development Department to fulfill the purpose of open government public information. It includes the number of documents received by agencies each month, including general documents, legislator inquiries, people's requests, appeal cases, people's petitions, special control cases, and supervisory cases. It is hoped that the data can be used by data users for analysis and use in the handling of official documents in government agencies.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Taken from "Classifying document types to enhance search and recommendations in digital libraries"https://www.overleaf.com/read/zzzrvmzmwdckAbstract: In this paper, we address the problem of classifying documents available from the global network of (open access) repositories according to their type. We show that the metadata provided by repositories enabling us to distinguish research papers, thesis and slides are missing in over 60% of cases. While these metadata describing document types are useful in a variety of scenarios ranging from research analytics to improving search and recommender (SR) systems, this problem has not yet been sufficiently addressed in the context of the repositories infrastructure. We have developed a new approach for classifying document types using supervised machine learning based exclusively on text specific features. We achieve 0.96 F1-score using the random forest and Adaboost classifiers, which are the best performing models on our data. By analysing the SR system logs of the CORE digital library aggregator, we show that users are an order of magnitude more likely to click on research papers and thesis than on slides. This suggests that using document types as a feature for ranking/filtering SR results in digital libraries has the potential to improve user experience.The descriptors, as featured in the study, are encoded in the dataset as follows:authors_len: Number of authors associated with the document entry.num_of_pages: Number of pages the document has in total.avg_word_per_page: Average words per page in the document.total_words: Total words in the document.source: The online service from which the document originated (can be either "CORE" or "SlideShare").id: Identifier with which the source's API can be queried to retrieve the corresponding document.label: The document's type, from "research", "thesis" or "slides".
Facebook
Twitterhttps://www.nist.gov/open/licensehttps://www.nist.gov/open/license
The NIST Extensible Resource Data Model (NERDm) is a set of schemas for encoding in JSON format metadata that describe digital resources. The variety of digital resources it can describe includes not only digital data sets and collections, but also software, digital services, web sites and portals, and digital twins. It was created to serve as the internal metadata format used by the NIST Public Data Repository and Science Portal to drive rich presentations on the web and to enable discovery; however, it was also designed to enable programmatic access to resources and their metadata by external users. Interoperability was also a key design aim: the schemas are defined using the JSON Schema standard, metadata are encoded as JSON-LD, and their semantics are tied to community ontologies, with an emphasis on DCAT and the US federal Project Open Data (POD) models. Finally, extensibility is also central to its design: the schemas are composed of a central core schema and various extension schemas. New extensions to support richer metadata concepts can be added over time without breaking existing applications. Validation is central to NERDm's extensibility model. Consuming applications should be able to choose which metadata extensions they care to support and ignore terms and extensions they don't support. Furthermore, they should not fail when a NERDm document leverages extensions they don't recognize, even when on-the-fly validation is required. To support this flexibility, the NERDm framework allows documents to declare what extensions are being used and where. We have developed an optional extension to the standard JSON Schema validation (see ejsonschema below) to support flexible validation: while a standard JSON Schema validater can validate a NERDm document against the NERDm core schema, our extension will validate a NERDm document against any recognized extensions and ignore those that are not recognized. The NERDm data model is based around the concept of resource, semantically equivalent to a schema.org Resource, and as in schema.org, there can be different types of resources, such as data sets and software. A NERDm document indicates what types the resource qualifies as via the JSON-LD "@type" property. All NERDm Resources are described by metadata terms from the core NERDm schema; however, different resource types can be described by additional metadata properties (often drawing on particular NERDm extension schemas). A Resource contains Components of various types (including DCAT-defined Distributions) that are considered part of the Resource; specifically, these can include downloadable data files, hierachical data collecitons, links to web sites (like software repositories), software tools, or other NERDm Resources. Through the NERDm extension system, domain-specific metadata can be included at either the resource or component level. The direct semantic and syntactic connections to the DCAT, POD, and schema.org schemas is intended to ensure unambiguous conversion of NERDm documents into those schemas. As of this writing, the Core NERDm schema and its framework stands at version 0.7 and is compatible with the "draft-04" version of JSON Schema. Version 1.0 is projected to be released in 2025. In that release, the NERDm schemas will be updated to the "draft2020" version of JSON Schema. Other improvements will include stronger support for RDF and the Linked Data Platform through its support of JSON-LD.
Facebook
TwitterThe OpenScience Slovenia metadata dataset contains metadata entries for Slovenian public domain academic documents which include undergraduate and postgraduate theses, research and professional articles, along with other academic document types. The data within the dataset was collected as a part of the establishment of the Slovenian Open-Access Infrastructure which defined a unified document collection process and cataloguing for universities in Slovenia within the infrastructure repositories. The data was collected from several already established but separate library systems in Slovenia and merged into a single metadata scheme using metadata deduplication and merging techniques. It consists of text and numerical fields, representing attributes that describe documents. These attributes include document titles, keywords, abstracts, typologies, authors, issue years and other identifiers such as URL and UDC. The potential of this dataset lies especially in text mining and text classification tasks and can also be used in development or benchmarking of content-based recommender systems on real-world data.
Facebook
TwitterThis is text document classification dataset which contains 2225 text data and five categories of documents. Five categories are politics, sport, tech, entertainment and business. We can use this dataset for documents classification and document clustering.
About Dataset - Dataset contains two features text and label. - No. of Rows : 2225 - No. of Columns : 2
Text: It contains different categories of text data Label: It contains labels for five different categories : 0,1,2,3,4
Facebook
TwitterThe Grant Related Forms and Documents includes forms, certifications, and assurances that are commonly used in applying for Administration for Children and Families grants and reporting on the status of grant projects.
Metadata-only record linking to the original dataset. Open original dataset below.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Last Version: 3
Authors: Carlota Balsa-Sánchez, Vanesa Loureiro
Date of data collection: 2022/10/28
General description: The publication of datasets according to the FAIR principles, could be reached publishing a data paper (or software paper) in data journals or in academic standard journals. The excel and CSV file contains a list of academic journals that publish data papers and software papers.
File list:
- data_articles_journal_list_v3.xlsx: full list of 124 academic journals in which data papers or/and software papers could be published
- data_articles_journal_list_3.csv: full list of 124 academic journals in which data papers or/and software papers could be published
Relationship between files: both files have the same information. Two different formats are offered to improve reuse
Type of version of the dataset: final processed version
Versions of the files: 3rd version
- Information updated: number of journals, URL, document types associated to a specific journal, publishers normalization and simplification of document types
- Information added : listed in the Directory of Open Access Journals (DOAJ), indexed in Web of Science (WOS) and quartile in Journal Citation Reports (JCR) and/or Scimago Journal and Country Rank (SJR).
Erratum - Data articles in journals Version 3:
Botanical Studies -- ISSN 1999-3110 -- JCR (JIF) Q2
Data -- ISSN 2306-5729 -- JCR (JIF) n/a
Data in Brief -- ISSN 2352-3409 -- JCR (JIF) n/a
Version: 2
Author: Francisco Rubio, Universitat Politècnia de València.
Date of data collection: 2020/06/23
General description: The publication of datasets according to the FAIR principles, could be reached publishing a data paper (or software paper) in data journals or in academic standard journals. The excel and CSV file contains a list of academic journals that publish data papers and software papers.
File list:
- data_articles_journal_list_v2.xlsx: full list of 56 academic journals in which data papers or/and software papers could be published
- data_articles_journal_list_v2.csv: full list of 56 academic journals in which data papers or/and software papers could be published
Relationship between files: both files have the same information. Two different formats are offered to improve reuse
Type of version of the dataset: final processed version
Versions of the files: 2nd version
- Information updated: number of journals, URL, document types associated to a specific journal, publishers normalization and simplification of document types
- Information added : listed in the Directory of Open Access Journals (DOAJ), indexed in Web of Science (WOS) and quartile in Scimago Journal and Country Rank (SJR)
Total size: 32 KB
Version 1: Description
This dataset contains a list of journals that publish data articles, code, software articles and database articles.
The search strategy in DOAJ and Ulrichsweb was the search for the word data in the title of the journals.
Acknowledgements:
Xaquín Lores Torres for his invaluable help in preparing this dataset.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
The SIDTD dataset is an extension of the MIDV2020 dataset. Initially, the MIDV2020 dataset is composed of forged ID documents, as all documents are generated by means of AI techniques. These generated documents are considered in the SIDTD dataset as representative of bona fide. On the other hand, the documents generated are considered as being forged versions of them. The corpus of the dataset is composed by ten European nationalities that are equally represented: Albanian, Azerbaijani, Estonian, Finnish, Greek, Lithuanian, Russian, Serbian, Slovakian, and Spanish. We employ two techniques for generating composite PAIs: Crop & Replace and inpainting. Datase contains videos, and clips, of captured ID Documents with different backgrounds, we add the same type of data for the forged ID Document images generated using the techniques described. The protocol employed to generate the dataset is as follows: We printed 191 counterfeit ID documents on paper using an HP Color LaserJet E65050 printer. Then, the documents were laminated with 100-micron-thick laminating pouches to enhance realism and manually cropped. CVC’s employees were requested to use their smartphones to record videos of forged ID documents from SIDTD. This approach aimed to capture a diverse range of video qualities, backgrounds, durations, and light intensities
Facebook
TwitterCurrent and velocity data were collected in Boston Harbor and the North Atlantic in support of the Boston Harbor Current Survey, OPR-501-FE-71 and the South Coastal Plains Expedition, OPR-500-FE-73. Data were collected from NOAA Ship Ferrel and survey buoys from 1971-05-10 to 1973-07-07. This archival package contains only documentation forms for these data, not the data files themselves.
Facebook
TwitterDataset of 3623 images from 1661 users (~2.18/user), mainly front/back ID documents, ideal for OCR training, document recognition, and automated identity verification tasks.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The data collection of the Monitoring Information System (SIS) of the Ministry of Finance of the Republic of Lithuania has compiled data on documents, hierarchy of elements, financial and indicators since 2011. State Strategy Documents Data The data table for State Strategy and Application Documents consists of: document code, title, document type, last version number, process stage, period, institution and its code. Document Item Hierarchy The data table of the hierarchy of state strategies and programme elements consists of: document ID, which referrals to “State Strategy Document Data”, item code, name, class, type, description (for indicators), beginning and end, periodicity, element units of measure, as well as the extent to which the element (country, organization or document owner) is implemented. Data on State Strategy Indicators The data table for indicators of state strategies and programmes consists of: document and object IDs (referenced to “Document item hierarchy” and “State strategy documents data”), indicator period and target and actual values (quantitative and qualitative). Financial data of state strategies The financial data table of state strategies and programmes consists of: document and object IDs (referenced to “Document item hierarchy” and “State strategy document data”), item start and end dates, value in euro, cost type, source of funding, type of funds, state function, as well as indication of whether this is a plan, revised plan, requirement or fact. Data provider means the Ministry of Finance of the Republic of Lithuania. Contact the atverimas@stat.gov.lt for technical questions or possible errors.
Facebook
Twitterhttps://www.wiseguyreports.com/pages/privacy-policyhttps://www.wiseguyreports.com/pages/privacy-policy
| BASE YEAR | 2024 |
| HISTORICAL DATA | 2019 - 2023 |
| REGIONS COVERED | North America, Europe, APAC, South America, MEA |
| REPORT COVERAGE | Revenue Forecast, Competitive Landscape, Growth Factors, and Trends |
| MARKET SIZE 2024 | 7.87(USD Billion) |
| MARKET SIZE 2025 | 8.37(USD Billion) |
| MARKET SIZE 2035 | 15.4(USD Billion) |
| SEGMENTS COVERED | Service Type, Deployment Model, Industry Vertical, Organization Size, Regional |
| COUNTRIES COVERED | US, Canada, Germany, UK, France, Russia, Italy, Spain, Rest of Europe, China, India, Japan, South Korea, Malaysia, Thailand, Indonesia, Rest of APAC, Brazil, Mexico, Argentina, Rest of South America, GCC, South Africa, Rest of MEA |
| KEY MARKET DYNAMICS | technological advancements, increasing automation, rising data volume, regulatory compliance demands, cost efficiency pressures |
| MARKET FORECAST UNITS | USD Billion |
| KEY COMPANIES PROFILED | Hyland Software, IBM, Nuance Communications, CaptureFast, DocuWare, Xerox, ABBYY, Micro Focus, SAP, Laserfiche, FileBound, MFiles, Adobe, OpenText, Kofax |
| MARKET FORECAST PERIOD | 2025 - 2035 |
| KEY MARKET OPPORTUNITIES | AI-driven automation solutions, Cloud-based document management, Enhanced data security measures, Integration with emerging technologies, Market expansion in developing regions |
| COMPOUND ANNUAL GROWTH RATE (CAGR) | 6.3% (2025 - 2035) |
Facebook
Twitterhttps://data.gov.tw/licensehttps://data.gov.tw/license
This dataset is compiled by the Secretariat for statistical purposes, recording the statistical statistics of various types of official documents received by the Digital Industry Department of the Digital Development Bureau and the statistics of general official documents issued for signature and drafting, provided for reference and use by data users.
Facebook
TwitterACRIS Document Type and Class Code mappings for Codes in the ACRIS Real and Personal Property Master Datasets
Facebook
Twitterhttps://www.wiseguyreports.com/pages/privacy-policyhttps://www.wiseguyreports.com/pages/privacy-policy
| BASE YEAR | 2024 |
| HISTORICAL DATA | 2019 - 2023 |
| REGIONS COVERED | North America, Europe, APAC, South America, MEA |
| REPORT COVERAGE | Revenue Forecast, Competitive Landscape, Growth Factors, and Trends |
| MARKET SIZE 2024 | 5.46(USD Billion) |
| MARKET SIZE 2025 | 5.86(USD Billion) |
| MARKET SIZE 2035 | 12.0(USD Billion) |
| SEGMENTS COVERED | Document Type, Storage Type, End Use Industry, Service Type, Regional |
| COUNTRIES COVERED | US, Canada, Germany, UK, France, Russia, Italy, Spain, Rest of Europe, China, India, Japan, South Korea, Malaysia, Thailand, Indonesia, Rest of APAC, Brazil, Mexico, Argentina, Rest of South America, GCC, South Africa, Rest of MEA |
| KEY MARKET DYNAMICS | growing digitalization, increasing data security concerns, rising regulatory compliance, cost-effective storage solutions, enhanced access and collaboration |
| MARKET FORECAST UNITS | USD Billion |
| KEY COMPANIES PROFILED | Kofile Technologies, Access Information Management, Space Saver, DocuVault, Iron Mountain, Xerox, DataSafe, Record Nations, Outsourcing Data Solutions, Citysweeper, Cintas, Metrofile, ECS Ltd, DataBank, Shredit, Recall |
| MARKET FORECAST PERIOD | 2025 - 2035 |
| KEY MARKET OPPORTUNITIES | Increased data privacy regulations, Rising demand for cost-effective storage, Growth in remote work culture, Advancements in cloud storage technologies, Increase in digital transformation initiatives |
| COMPOUND ANNUAL GROWTH RATE (CAGR) | 7.5% (2025 - 2035) |
Facebook
TwitterA global data set of soil types is available at 0.5-degree latitude by 0.5-degree longitude resolution. There are 106 soil units, based on Zobler?s (1986) assessment of the FAO/UNESCO Soil Map of the World. This data set is a conversion of the Zobler 1-degree resolution version to a 0.5-degree resolution. The resolution of the data set was not actually increased. Rather, the 1-degree squares were divided into four 0.5-degree squares with the necessary adjustment of continental boundaries and islands. The computer code used to convert the original 1-degree data to 0.5-degree is provided as a companion file. A JPG image of the data is provided in this document. The Zobler data (1-degree resolution) as distributed by Webb et al. (1993) [http://www.ngdc.noaa.gov/seg/eco/cdroms/gedii_a/datasets/a12/wr.htm#top] contains two columns, one column for continent and one column for soil type. The Soil Map of the World consists of 9 maps that represent parts of the world. The texture data that Webb et al.(1993) provided allowed for the fact that a soil type in one part of the world may have different properties than the same soil in a different part of the world. This continent-specific information is retained in this 0.5-degree resolution data set, as well as the soil type information which is the second column. A code was written (one2half.c) to take the file CONTIZOB.LER distributed by Webb et al. (1993) [http://www.ngdc.noaa.gov/seg/eco/cdroms/gedii_a/datasets/a12/wr.htm#top] and simply divide the 1-degree cells into quarters. This code also reads in a land/water file (land.wave) that specifies the cells that are land at 0.5 degrees. The code checks for consistency between the newly quartered map and the land/water map to which the quartered map is to be registered. If there is a discrepancy between the two, an attempt was made to make the two consistent using the following logic. If the cell is supposed to be water, it is forced to be water. If it is supposed to be land but was resolved to water at 1 degree, the code looks at the surrounding 8 cells and picks the most frequent soil type and assigns it to the cell. If there are no surrounding land cells then it is kept as water in the hopes that on the next pass one or more of the surrounding cells might be converted from water to a soil type. The whole map is iterated 5 times. The remaining cells that should be land but couldn't be determined from surrounding cells (mostly islands that are resolved at 0.5 degree but not at 1 degree) are printed out with coordinate information. A temporary map is output with -9 indicating where data is required. This is repeated for the continent code in CONTIZOB.LER as well. A separate map of the temporary continent codes is produced with -9 indicating required data. A nearly identical code (one2half.c) does the same for the continent codes. The printout allows one to consult the printed versions of the soil map and look up the soil type with the largest coverage in the 0.5-degree cell. The program manfix.c then will go through the temporary map and prompt for input to correct both the soil codes and the continent codes for the map. This can be done manually or by preparing a file of changes (new_fix.dat) and redirecting stdin. A new complete version of the map is outputted. This is in the form of the original CONTIZOB.LER file (contizob.half) but four times larger. Original documentation and computer codes prepared by Post et al. (1996) are provided as companion files with this data set. Image of 106 global soil types available at 0.5-degree by 0.5-degree resolution. Additional documentation from Zobler?s assessment of FAO soil units is available from the NASA Center for Scientific Information.
Facebook
TwitterThis workshop is a continuation of the DDI power point presentation given at the previous year's DLI Training in Kingston. It is intended as a primer for those interested in understanding the basic concepts of the Data Documentation Initiative (DDI) and the Data Type Definition (DTD) statements. This time participants will have the opportunity to take a closer look, examine the tags, determine criteria for selection and create an XML template.