Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Sensitive Document Classification
Preventing data violation becomes increasingly crucial. Several data breaches have been reported during the last years. To prevent data violation, we need to determine the sensitivity level of documents. Deep learning techniques perform well in document classification but require large amount of data. However, a lack of public dataset in this context, due to the sensitive nature of documents, prevent reseacher to to design powerful models. We… See the full description on the dataset page: https://huggingface.co/datasets/mouhamet/sensitive_document_classification.
Facebook
TwitterU.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
This bundle contains documentation about data products that are collected using radio science and supporting equipment. With one exception, each member collection contains one or more versions of a single Software Interface Specification (SIS) or an equivalent document. A SIS describes the format and content of a data file at a granularity suffient for use -- typically byte-level, but sometimes bit-level. Examples of products and descriptions of their use may also be included in a collection, as appropriate. The exception is the DOCUMENT collection, which contains supporting material -- usually journal publications, technical reports, or other documents that describe investigations, analysis methods, and/or data but not at the level of a SIS. Members of the DOCUMENT collection were usually released once, whereas a SIS often evolves over many years.
Facebook
TwitterSurvey data presented and discussed in the paper 'How to Document Ontology Design Patterns' presented at the Workshop on Ontology and Semantic Web Patterns in conjunction with the International Semantic Web Conference 2016.
The dataset contains two CSV files, each corresponding to one of the two surveys discussed in Section 3 of the paper in question. Both files include the questions (row 1), answer options (row 2), and provided answers (row 3 and onward). OEMS-Data.csv contains the data discussed in Section 3.1 (Table 2/3) and ODPT-Data.csv contains the data discussed in Section 3.2 (Table 4).
The dataset was originally published in DiVA and moved to SND in 2024.
Facebook
TwitterData represents feedback on learning environment from families. Aids in facilitating the understanding of families perceptions of students, teachers, environment of their school. The survey is aligned to the DOE's framework for great schools. It is designed to collect important information about each schools ability to support success.
Facebook
TwitterThis is an auto-generated index table corresponding to a folder of files in this dataset with the same name. This table can be used to extract a subset of files based on their metadata, which can then be used for further analysis. You can view the contents of specific files by navigating to the "cells" tab and clicking on an individual file_kd.
Facebook
Twitterhttps://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
According to our latest research, the global Document Fraud Detection market size reached USD 8.2 billion in 2024, reflecting robust demand across various sectors. The market is expected to expand at a compound annual growth rate (CAGR) of 13.7% from 2025 to 2033, projecting a value of USD 25.1 billion by 2033. This impressive growth is driven primarily by the increasing sophistication of fraud attempts, rapid digital transformation, and heightened regulatory requirements for identity verification and data security in both public and private sectors worldwide.
A major growth factor for the Document Fraud Detection market is the escalating complexity and frequency of fraudulent activities targeting sensitive documentation. As businesses and governments digitize more of their operations, the risk of document forgery, identity theft, and data manipulation has surged. Organizations are increasingly investing in advanced fraud detection solutions to safeguard their assets, maintain customer trust, and comply with evolving regulations. The proliferation of remote onboarding and digital transactions, especially post-pandemic, has further amplified the need for robust document authentication and identity verification processes. Consequently, innovative technologies such as artificial intelligence, machine learning, and biometrics are being integrated into document fraud detection systems to enhance accuracy, speed, and scalability.
Another significant driver for the market is the tightening of regulatory frameworks across the globe. Governments and regulatory bodies are mandating stringent Know Your Customer (KYC), Anti-Money Laundering (AML), and data privacy standards, especially in sectors like banking, financial services, healthcare, and government services. Failure to comply with these regulations can result in hefty fines and reputational damage. As a result, organizations are prioritizing investment in comprehensive document fraud detection solutions that not only ensure compliance but also provide audit trails and real-time alerts. The need for continuous monitoring and proactive risk management is pushing companies to adopt both on-premises and cloud-based solutions, depending on their operational requirements and data sensitivity.
Furthermore, the rising adoption of digital identity verification in emerging markets is propelling the growth of the Document Fraud Detection market. Countries in Asia Pacific, Latin America, and Africa are experiencing rapid digitalization, with increased access to smartphones and the internet. This has led to a surge in online banking, e-commerce, e-governance, and digital healthcare services, all of which require secure and reliable document verification processes. Local and international vendors are tapping into these opportunities by offering scalable, cloud-native solutions tailored to regional needs and regulatory environments. The growing awareness about the risks of document fraud, coupled with the need for seamless user experiences, is expected to further accelerate market expansion in these regions.
From a regional perspective, North America currently dominates the Document Fraud Detection market due to its advanced technological infrastructure, high incidence of digital fraud, and strict regulatory landscape. However, Asia Pacific is expected to witness the fastest growth over the forecast period, driven by rapid digital adoption, increasing investments in cybersecurity, and supportive government initiatives for digital identity management. Europe remains a significant market, supported by GDPR and other data protection regulations, while the Middle East & Africa and Latin America are emerging as lucrative markets, fueled by ongoing digital transformation projects and rising awareness about document security.
The Component segment of the Document Fraud Detection market is categorized into software, hardware, and services, each playing a crucial role in the overall ecosystem. Software solutions form the backbone of fraud detection initiatives, providing the necessary algorithms, analytics, and user interfaces for identifying and preventing fraudulent activities. These solutions are continuously evolving, incorporating advanced technologies such as artificial intelligence, machine learning, and natural language processing to enhance detection accuracy and reduce false positives. Software vendors are also focusing on developing modular, scalable platfor
Facebook
TwitterRecords associated with claims for compensation
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This is part 6 of the IDNet dataset of our research paper "IDNet: A Novel Identity Document Dataset via Few-Shot and Quality-Driven Synthetic Data Generation. Here's a link to the paper: https://ieeexplore.ieee.org/document/10825017
Citation:
@inproceedings{xie2024idnet,
title={IDNet: A Novel Identity Document Dataset via Few-Shot and Quality-Driven Synthetic Data Generation},
author={Xie, Lulu and Wang, Yancheng and Guan, Hong and Nag, Soham and Goel, Rajeev and Swamy, Niranjan and Yang, Yingzhen and Xiao, Chaowei and Prisby, Jonathan and Maciejewski, Ross and others},
booktitle={2024 IEEE International Conference on Big Data (BigData)},
pages={2244--2253},
year={2024},
organization={IEEE}
}
@article{guan2024idnet,
title={IDNet: A Novel Dataset for Identity Document Analysis and Fraud Detection},
author={Guan, Hong and Wang, Yancheng and Xie, Lulu and Nag, Soham and Goel, Rajeev and Swamy, Niranjan Erappa Narayana and Yang, Yingzhen and Xiao, Chaowei and Prisby, Jonathan and Maciejewski, Ross and Zou, Jia},
journal={arXiv preprint arXiv:2408.01690},
year={2024}
}
Facebook
Twitterhttps://www.arcgis.com/sharing/rest/content/items/89679671cfa64832ac2399a0ef52e414/datahttps://www.arcgis.com/sharing/rest/content/items/89679671cfa64832ac2399a0ef52e414/data
Use the Records Search to do the following: Search for records, such as agreements with other government agencies, maps and other documents from the Public Works, Transportation, and Planning, Building and Development departments.
Facebook
TwitterTo use this dataset and respect for copyright, please cite the following paper: https://ieeexplore.ieee.org/abstract/document/9116896/ We present a new dataset that covers almost all the scenarios that may exist on document images that were taken by a smartphone. The collection includes 1111 images. We tested two state-of-the-art algorithms for finding the corners of the document in our dataset and the results also provided. The results indicate that there are still situations that these algorithms fail and it needs more research.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data Dictionary template for Tempe Open Data.
Facebook
TwitterThe London Borough of Barnet has entered into a contract for Electronic scanning and document storage with Stor-A-File. The contract is for the provision of electronic scanning and document storage and commenced on 1st August 2023 and will run until 31st July 2024. Further details on the Contract Award can be found in the link below. Personal data relating to junior officer names and commercial interests has been redacted from the contract attachment.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
datafile.csv
datafile.json
datafile.ods
datafile.xls
The data contains following features:-
'Year (Col.1)' 'Geographical Area (Col.2)' 'Reporting area for Land utilisation statistics (Col.3 = Col.4+Col.7+ Col.11+Col.14+Col.15)' 'Forests (Col.4)' 'Not available for cultivation - Area under non-agricultural uses (Col.5)' 'Not available for cultivation - Barren and unculturable Land (Col.6)' 'Not available for cultivation - Total (Col.7 = Col.5+Col.6)' 'Other uncultivated Land excluding Fallow Land - Permanent pastures & other Grazing Lands (Col.8)' 'Other uncultivated Land excluding Fallow Land - Land under Misc. tree crops & groves (not incl. in net area sown) (Col.9)' 'Other uncultivated Land excluding Fallow Land - Culturable waste Land (Col.10)' 'Other uncultivated Land excluding Fallow Land - Total (Col.11 = Col.8 to Col.10)' 'Fallow Lands - Fallow Lands other than current fallows (Col.12)' 'Fallow Lands - Current fallows (Col.13)' 'Fallow Lands - Total Col.14 = (Col.12+Col.13)' 'Net area Sown (Col.15)' 'Total cropped area (Col.16)' 'Area sown more than once (Col.17 = Col.16-Col.15)' 'Agricultural Land/Cultivable Land/Culturable Land/Arable Land (Col.18 = Col.9+Col.10+Col.14+Col.15)' 'Cultivated Land (Col.19 = Col.13+Col.15)' 'Cropping Intensity (Col.20 = % of Col.16 over Col.15)'
I am really thankful to Indian government for storing these valuable data. Source:- https://data.gov.in/
I am inspired by everyone here on Kaggle for the level of their dedication and hard work.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Data Documentation and Metadata session from the 2015 Virginia Data Management Bootcamp. Introduces non-structural (data dictionaries, read me files, code books) and structured ways (XML schemas) to document research data.
Facebook
TwitterThis dataset consists of points that represent recorded documents in the Delaware County Recorder's Plat Books, Cabinet/Slides and Instruments Records which are not represented by subdivision plats that are active. They are documents such as; vacations, subdivisions, centerline surveys, surveys, annexations, and miscellaneous documents within Delaware County, Ohio.
Facebook
TwitterThis document dataset was the output of a project aimed to create a 'gold standard' dataset that could be used to train and validate machine learning approaches to natural language processing (NLP). The project was carried out by Aleph Insights and Committed Software on behalf of the Defence Science and Technology Laboratory (Dstl). The data set specifically focusing on entity and relationship extraction relevant to somebody operating in the role of a defence and security intelligence analyst. The dataset was therefore constructed using documents and structured schemas that were relevant to the defence and security analysis domain. A number of data subsets were produced (this is the BBC Online data subset). Further information about this data subset (BBC Online) and the others produced (together with licence conditions, attribution and schemas) many be found at the main project GitHub repository webpage (https://github.com/dstl/re3d). Note that the 'documents.json' file is to be used together with the 'entities.json' and 'relations.json' files (also found on this data.gov.uk webpage and their structures and relationship described on the given GitHub webpage.
Facebook
TwitterDocuments issued by the Protected Documents Office to civil status and passport offices, including passports, cards, certificates, and family books.
Facebook
TwitterIn 2012, an invasive plant inventory of priority invasive plant species in priority areas was conducted at San Diego National Wildlife Refuge. Results from this effort will inform the development of invasive plant management objectives, strategies, and serves as a baseline for assessing change in the status of invasive plant distribution or abundance over time.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This submission includes publicly available data extracted in its original form. Please reference the Related Publication listed here for source and citation information: TRI basic plus data files guides. (2024, September 18). US EPA. https://www.epa.gov/toxics-release-inventory-tri-program/tri-basic-plus-data-files-guides If you have questions about the underlying data stored here, please contact tri.help@epa.gov. If you have questions or recommendations related to this metadata entry and extracted data, please contact the CAFE Data Management team at: climatecafe@bu.edu. "EPA has been collecting Toxics Release Inventory (TRI) data since 1987. The "Basic Plus" data files include ten file types that collectively contain all of the data fields from the TRI Reporting Form R and Form A. The files themselves are in tab-delimited .txt format and then compressed into a .zip file. 1a: Facility, chemical, releases and other waste management summary information 1b: Chemical activities and uses 2a: On- and off-site disposal, treatment, energy recovery, and recycling information; non-production-related waste managed quantities; production/activity ratio information; and source reduction activities 2b: Detailed on-site waste treatment methods and efficiency 3a: Transfers off site for disposal and further waste management 3b: Transfers to Publicly Owned Treatment Works (POTWs) (RY1987 - RY2010) 3c: Transfers to Publicly Owned Treatment Works (POTWs) (RY2011 - Present) 4: Facility information 5: Optional information on source reduction, recycling and pollution control (RY2005 - Present) 6: Additional miscellaneous and optional information (RY2010 - Present) Quantities of dioxin and dioxin-like compounds are reported in grams, while all other chemicals are reported in pounds. This webpage contains the most recent versions of all TRI data files; facilities may revise previous years' TRI submissions if necessary, and any such changes will be reflected in these files. For this reason, data contained in these files may differ from data used to construct the TRI National Analysis." [Quote from https://www.epa.gov/toxics-release-inventory-tri-program/tri-basic-plus-data-files-calendar-years-1987-present]
Facebook
TwitterAn Electronic Repository created to streamline the storing/recording of various Security Requests, including SSA-120s/1121s, ATSAFE-613, E-mails, etc
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Sensitive Document Classification
Preventing data violation becomes increasingly crucial. Several data breaches have been reported during the last years. To prevent data violation, we need to determine the sensitivity level of documents. Deep learning techniques perform well in document classification but require large amount of data. However, a lack of public dataset in this context, due to the sensitive nature of documents, prevent reseacher to to design powerful models. We… See the full description on the dataset page: https://huggingface.co/datasets/mouhamet/sensitive_document_classification.