100+ datasets found

OCR Document Text Recognition Dataset
kaggle.com
Updated Sep 7, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Training Data (2023). OCR Document Text Recognition Dataset [Dataset]. https://www.kaggle.com/datasets/trainingdatapro/text-detection-in-the-documents/versions/2
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 7, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Training Data
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
OCR Text Detection in the Documents Object Detection dataset

The dataset is a collection of images that have been annotated with the location of text in the document. The dataset is specifically curated for text detection and recognition tasks in documents such as scanned papers, forms, invoices, and handwritten notes.

The dataset contains a variety of document types, including different layouts, font sizes, and styles. The images come from diverse sources, ensuring a representative collection of document styles and quality. Each image in the dataset is accompanied by bounding box annotations that outline the exact location of the text within the document.

💴 For Commercial Usage: To discuss your requirements, learn about the price and buy the dataset, leave a request on TrainingData to buy the dataset

The Text Detection in the Documents dataset provides an invaluable resource for developing and testing algorithms for text extraction, recognition, and analysis. It enables researchers to explore and innovate in various applications, including optical character recognition (OCR), information extraction, and document understanding.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F6986071a88d8a9829fee98d5b49d9ff8%2FMacBook%20Air%20-%201%20(1).png?generation=1691059158337136&alt=media" alt="">

Dataset structure

images - contains of original images of documents

boxes - includes bounding box labeling for the original images

annotations.xml - contains coordinates of the bounding boxes and labels, created for the original photo

Data Format

Each image from images folder is accompanied by an XML-annotation in the annotations.xml file indicating the coordinates of the bounding boxes and labels for text detection. For each point, the x and y coordinates are provided.

Labels for the text:

"Text Title" - corresponds to titles, the box is red

"Text Paragraph" - corresponds to paragraphs of text, the box is blue

"Table" - corresponds to the table, the box is green

"Handwritten" - corresponds to handwritten text, the box is purple

Example of XML file structure

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F38e02db515561a30e29faca9f5b176b0%2Fcarbon.png?generation=1691058761924879&alt=media" alt="">

Text Detection in the Documents might be made in accordance with your requirements.

💴 Buy the Dataset: This is just an example of the data. Leave a request on https://trainingdata.pro/datasets to discuss your requirements, learn about the price and buy the dataset

TrainingData provides high-quality data annotation tailored to your needs

keywords: text detection, text recognition, optical character recognition, document text recognition, document text detection, detecting text-lines, object detection, scanned documents, deep-text-recognition, text area detection, text extraction, images dataset, image-to-text
d
SIAM 2007 Text Mining Competition dataset
catalog.data.gov
data.nasa.gov
+2more
Updated Apr 11, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2025). SIAM 2007 Text Mining Competition dataset [Dataset]. https://catalog.data.gov/dataset/siam-2007-text-mining-competition-dataset
Explore at:
Dataset updated
Apr 11, 2025
Dataset provided by
Dashlink
Description
Subject Area: Text Mining Description: This is the dataset used for the SIAM 2007 Text Mining competition. This competition focused on developing text mining algorithms for document classification. The documents in question were aviation safety reports that documented one or more problems that occurred during certain flights. The goal was to label the documents with respect to the types of problems that were described. This is a subset of the Aviation Safety Reporting System (ASRS) dataset, which is publicly available. How Data Was Acquired: The data for this competition came from human generated reports on incidents that occurred during a flight. Sample Rates, Parameter Description, and Format: There is one document per incident. The datasets are in raw text format. All documents for each set will be contained in a single file. Each row in this file corresponds to a single document. The first characters on each line of the file are the document number and a tilde separats the document number from the text itself. Anomalies/Faults: This is a document category classification problem.
m
Annotated Terms of Service of 100 Online Platforms
data.mendeley.com
Updated Dec 12, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Przemyslaw Palka (2023). Annotated Terms of Service of 100 Online Platforms [Dataset]. http://doi.org/10.17632/dtbj87j937.3
Explore at:
Unique identifier
https://doi.org/10.17632/dtbj87j937.3
Dataset updated
Dec 12, 2023
Authors
Przemyslaw Palka
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset contains information about the contents of 100 Terms of Service (ToS) of online platforms. The documents were analyzed and evaluated from the point of view of the European Union consumer law. The main results have been presented in the table titled "Terms of Service Analysis and Evaluation_RESULTS." This table is accompanied by the instruction followed by the annotators, titled "Variables Definitions," allowing for the interpretation of the assigned values. In addition, we provide the raw data (analyzed ToS, in the folder "Clear ToS") and the annotated documents (in the folder "Annotated ToS," further subdivided).

SAMPLE: The sample contains 100 contracts of digital platforms operating in sixteen market sectors: Cloud storage, Communication, Dating, Finance, Food, Gaming, Health, Music, Shopping, Social, Sports, Transportation, Travel, Video, Work, and Various. The selected companies' main headquarters span four legal surroundings: the US, the EU, Poland specifically, and Other jurisdictions. The chosen platforms are both privately held and publicly listed and offer both fee-based and free services. Although the sample cannot be treated as representative of all online platforms, it nevertheless accounts for the most popular consumer services in the analyzed sectors and contains a diverse and heterogeneous set.

CONTENT: Each ToS has been assigned the following information: 1. Metadata: 1.1. the name of the service; 1.2. the URL; 1.3. the effective date; 1.4. the language of ToS; 1.5. the sector; 1.6. the number of words in ToS; 1.7–1.8. the jurisdiction of the main headquarters; 1.9. if the company is public or private; 1.10. if the service is paid or free. 2. Evaluative Variables: remedy clauses (2.1– 2.5); dispute resolution clauses (2.6–2.10); unilateral alteration clauses (2.11–2.15); rights to police the behavior of users (2.16–2.17); regulatory requirements (2.18–2.20); and various (2.21–2.25). 3. Count Variables: the number of clauses seen as unclear (3.1) and the number of other documents referred to by the ToS (3.2). 4. Pull-out Text Variables: rights and obligations of the parties (4.1) and descriptions of the service (4.2)

ACKNOWLEDGEMENT: The research leading to these results has received funding from the Norwegian Financial Mechanism 2014-2021, project no. 2020/37/K/HS5/02769, titled “Private Law of Data: Concepts, Practices, Principles & Politics.”
FATURA Dataset
zenodo.org
zip
Updated Dec 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mahmoud Limam; Marwa Dhiaf; Yousri Kessentini; Mahmoud Limam; Marwa Dhiaf; Yousri Kessentini (2023). FATURA Dataset [Dataset]. http://doi.org/10.5281/zenodo.10371464
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10371464
Dataset updated
Dec 13, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Mahmoud Limam; Marwa Dhiaf; Yousri Kessentini; Mahmoud Limam; Marwa Dhiaf; Yousri Kessentini
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset consists of 10000 jpg images with white backgrounds, 10000 jpg images with colored backgrounds (the same colors used in the paper) as well as 3x10000 json annotation files. The images are generated from 50 different templates. For each template, 200 images were generated. We provide annotations in three formats: our own original format, the COCO format and a format compatible with HuggingFace Transformers. Background color varies across templates but not across instances from the same template.
In terms of objects, the dataset contains 24 different classes. The classes vary considerably in their numbers of occurrences and thus, the dataset is somewhat imbalanced.
The annotations contain bounding box coordinates, bounding box text and object classes.
We propose two methods for training and evaluating models. The models were trained until convergence ie until the model reaches optimal performance on the validation split and started overfitting. The model version used for evaluation is the one with the best validation performance.
First Evaluation strategy:
For each template, the generated images are randomly split into 3 subsets: training, validation and testing.
In this scenario, the model trains on all templates and is thus tested on new images rather than new layouts.
Second Evaluation strategy:
The real templates are randomly split into a training set, and a common set of templates for validation and testing. All the variants created from the training templates are used as training dataset. The same is done to form the validation and testing datasets. The validation and testing sets are made up of the same templates but of different images.
This approach tests the models' performance on different unseen templates/layouts, rather than the same templates with different content.
We provide the data splits we used for every evaluation scenario. We also provide the background colors we used as augmentation for each template.
a
Metadata Form Template
quality-of-life-tempegov.hub.arcgis.com
data.tempe.gov
+10more
Updated Jun 5, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
City of Tempe (2020). Metadata Form Template [Dataset]. https://quality-of-life-tempegov.hub.arcgis.com/documents/c450d13c28ed4b1888ed6ab9d0363473
Explore at:
Dataset updated
Jun 5, 2020
Dataset authored and provided by
City of Tempe
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Metadata form template for Tempe Open Data.
Requirements data sets (user stories)
zenodo.org
data.mendeley.com
txt
Updated Jan 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fabiano Dalpiaz; Fabiano Dalpiaz (2025). Requirements data sets (user stories) [Dataset]. http://doi.org/10.17632/7zbk8zsd8y.1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.17632/7zbk8zsd8y.1
Dataset updated
Jan 13, 2025
Dataset provided by
Mendeley Ltd.
Authors
Fabiano Dalpiaz; Fabiano Dalpiaz
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A collection of 22 data set of 50+ requirements each, expressed as user stories.

The dataset has been created by gathering data from web sources and we are not aware of license agreements or intellectual property rights on the requirements / user stories. The curator took utmost diligence in minimizing the risks of copyright infringement by using non-recent data that is less likely to be critical, by sampling a subset of the original requirements collection, and by qualitatively analyzing the requirements. In case of copyright infringement, please contact the dataset curator (Fabiano Dalpiaz, f.dalpiaz@uu.nl) to discuss the possibility of removal of that dataset [see Zenodo's policies]

The data sets have been originally used to conduct experiments about ambiguity detection with the REVV-Light tool: https://github.com/RELabUU/revv-light

This collection has been originally published in Mendeley data: https://data.mendeley.com/datasets/7zbk8zsd8y/1

Overview of the datasets [data and links added in December 2024]

The following text provides a description of the datasets, including links to the systems and websites, when available. The datasets are organized by macro-category and then by identifier.

Public administration and transparency

g02-federalspending.txt (2018) originates from early data in the Federal Spending Transparency project, which pertain to the website that is used to share publicly the spending data for the U.S. government. The website was created because of the Digital Accountability and Transparency Act of 2014 (DATA Act). The specific dataset pertains a system called DAIMS or Data Broker, which stands for DATA Act Information Model Schema. The sample that was gathered refers to a sub-project related to allowing the government to act as a data broker, thereby providing data to third parties. The data for the Data Broker project is currently not available online, although the backend seems to be hosted in GitHub under a CC0 1.0 Universal license. Current and recent snapshots of federal spending related websites, including many more projects than the one described in the shared collection, can be found here.

g03-loudoun.txt (2018) is a set of extracted requirements from a document, by the Loudoun County Virginia, that describes the to-be user stories and use cases about a system for land management readiness assessment called Loudoun County LandMARC. The source document can be found here and it is part of the Electronic Land Management System and EPlan Review Project - RFP RFQ issued in March 2018. More information about the overall LandMARC system and services can be found here.

g04-recycling.txt(2017) concerns a web application where recycling and waste disposal facilities can be searched and located. The application operates through the visualization of a map that the user can interact with. The dataset has obtained from a GitHub website and it is at the basis of a students' project on web site design; the code is available (no license).

g05-openspending.txt (2018) is about the OpenSpending project (www), a project of the Open Knowledge foundation which aims at transparency about how local governments spend money. At the time of the collection, the data was retrieved from a Trello board that is currently unavailable. The sample focuses on publishing, importing and editing datasets, and how the data should be presented. Currently, OpenSpending is managed via a GitHub repository which contains multiple sub-projects with unknown license.

g11-nsf.txt (2018) refers to a collection of user stories referring to the NSF Site Redesign & Content Discovery project, which originates from a publicly accessible GitHub repository (GPL 2.0 license). In particular, the user stories refer to an early version of the NSF's website. The user stories can be found as closed Issues.

(Research) data and meta-data management

g08-frictionless.txt (2016) regards the Frictionless Data project, which offers an open source dataset for building data infrastructures, to be used by researchers, data scientists, and data engineers. Links to the many projects within the Frictionless Data project are on GitHub (with a mix of Unlicense and MIT license) and web. The specific set of user stories has been collected in 2016 by GitHub user @danfowler and are stored in a Trello board.

g14-datahub.txt (2013) concerns the open source project DataHub, which is currently developed via a GitHub repository (the code has Apache License 2.0). DataHub is a data discovery platform which has been developed over multiple years. The specific data set is an initial set of user stories, which we can date back to 2013 thanks to a comment therein.

g16-mis.txt (2015) is a collection of user stories that pertains a repository for researchers and archivists. The source of the dataset is a public Trello repository. Although the user stories do not have explicit links to projects, it can be inferred that the stories originate from some project related to the library of Duke University.

g17-cask.txt (2016) refers to the Cask Data Application Platform (CDAP). CDAP is an open source application platform (GitHub, under Apache License 2.0) that can be used to develop applications within the Apache Hadoop ecosystem, an open-source framework which can be used for distributed processing of large datasets. The user stories are extracted from a document that includes requirements regarding dataset management for Cask 4.0, which includes the scenarios, user stories and a design for the implementation of these user stories. The raw data is available in the following environment.

g18-neurohub.txt (2012) is concerned with the NeuroHub platform, a neuroscience data management, analysis and collaboration platform for researchers in neuroscience to collect, store, and share data with colleagues or with the research community. The user stories were collected at a time NeuroHub was still a research project sponsored by the UK Joint Information Systems Committee (JISC). For information about the research project from which the requirements were collected, see the following record.

g22-rdadmp.txt (2018) is a collection of user stories from the Research Data Alliance's working group on DMP Common Standards. Their GitHub repository contains a collection of user stories that were created by asking the community to suggest functionality that should part of a website that manages data management plans. Each user story is stored as an issue on the GitHub's page.

g23-archivesspace.txt (2012-2013) refers to ArchivesSpace: an open source, web application for managing archives information. The application is designed to support core functions in archives administration such as accessioning; description and arrangement of processed materials including analog, hybrid, and
born digital content; management of authorities and rights; and reference service. The application supports collection management through collection management records, tracking of events, and a growing number of administrative reports. ArchivesSpace is open source and its
Legal Text Classification Dataset
kaggle.com
Updated Oct 17, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
A.Mohan kumar (2023). Legal Text Classification Dataset [Dataset]. https://www.kaggle.com/datasets/amohankumar/legal-text-classification-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 17, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
A.Mohan kumar
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
The dataset contains a total of 25000 legal cases in the form of text documents. Each document has been annotated with catchphrases, citations sentences, citation catchphrases, and citation classes. Citation classes indicate the type of treatment given to the cases cited by the present case.
i
Example Dataset of Exercise Analysis and Forecasting
ieee-dataport.org
Updated Jun 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chengcheng Guo (2025). Example Dataset of Exercise Analysis and Forecasting [Dataset]. https://ieee-dataport.org/documents/example-dataset-exercise-analysis-and-forecasting
Explore at:
Dataset updated
Jun 17, 2025
Authors
Chengcheng Guo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This data set is an example data set for the data set used in the experiment of the paper "A Multilevel Analysis and Hybrid Forecasting Algorithm for Long Short-term Step Data". It contains two parts of hourly step data and daily step data
d
Data from: Data Dictionary Template
catalog.data.gov
data-academy.tempe.gov
+9more
Updated Mar 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
City of Tempe (2023). Data Dictionary Template [Dataset]. https://catalog.data.gov/dataset/data-dictionary-template-2e170
Explore at:
Dataset updated
Mar 18, 2023
Dataset provided by
City of Tempe
Description
Data Dictionary template for Tempe Open Data.
Data Policy Templates
fsm-data.sprep.org
pacific-data.sprep.org
+13more
docx
Updated Feb 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Secretariat of the Pacific Regional Environment Programme (2025). Data Policy Templates [Dataset]. https://fsm-data.sprep.org/dataset/data-policy-templates
Explore at:
docx(39231), docx(68313), docx(28279)Available download formats
Dataset updated
Feb 20, 2025
Dataset provided by
Pacific Regional Environment Programmehttps://www.sprep.org/
License
Public Domain Mark 1.0https://creativecommons.org/publicdomain/mark/1.0/
License information was derived automatically
Area covered
Pacific Region
Description
This dataset contains templates of policies and MoU's on data sharing. You can download the Word-templates and adapt the documents to your national context.
i
Sample Dataset for Testing
ieee-dataport.org
Updated Apr 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alex Outman (2025). Sample Dataset for Testing [Dataset]. https://ieee-dataport.org/documents/sample-dataset-testing
Explore at:
Dataset updated
Apr 28, 2025
Authors
Alex Outman
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
10
Selfies & ID Images Dataset, 95,000 files
kaggle.com
Updated Aug 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
KUCEV ROMAN (2023). Selfies & ID Images Dataset, 95,000 files [Dataset]. https://www.kaggle.com/datasets/tapakah68/selfies-id-images-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 1, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
KUCEV ROMAN
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
Selfies, ID Images Face Dataset

5 591 sets, which includes 2 photos of a person from his documents and 13 selfies. 571 sets of Hispanics and 3512 sets of Caucasians.

Photo documents contains only a photo of a person. All personal information from the document is hidden

💴 For Commercial Usage: Full version of the dataset includes 95 000+ photos of people, leave a request on TrainingData to buy the dataset

Metadata for the full dataset:

assignment_id - unique identifier of the media file

worker_id - unique identifier of the person

age - age of the person

true_gender - gender of the person

country - country of the person

ethnicity - ethnicity of the person

photo_1_extension, photo_2_extension, …, photo_15_extension - photo extensions in the dataset

photo_1_resolution, photo_2_resolution, …, photo_15_resolution - photo resolution in the dataset

Content

The dataset includes 2 folders: - 18_sets_Caucasians - images of Caucasian people - 11_sets_Hispanics - images Hispanic people

In each folder there are folders for every person in dataset. Files are named "ID_1", "ID_2" for ID images and "Selfie_1",..."Selfie_13" for selfies.

https://sun9-53.userapi.com/impg/dOFVs6YsLexi-rM0LBud5rc6bVsCQPq5bIvrnA/S-3MRJPo-IE.jpg?size=2560x1054&quality=95&sign=16fc124e8f61d43a371cf4f0712f6a14&type=album" alt="">

💴 Buy the Dataset: This is just an example of the data. Leave a request on https://trainingdata.pro/datasets to learn about the price and buy the dataset

TrainingData provides high-quality data annotation tailored to your needs.

keywords: biometric system, biometric dataset, face recognition database, face recognition dataset, face detection dataset, facial analysis, object detection dataset, deep learning datasets, computer vision datset, human images dataset, human faces dataset, machine learning, image-to-image, re-identification, id photos, selfies and paired id, photos, id verification models, passport, id card image, digital photo-identification
d
Data Management Plan Examples Database
search.dataone.org
borealisdata.ca
Updated Sep 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Evering, Danica; Acharya, Shrey; Pratt, Isaac; Behal, Sarthak (2024). Data Management Plan Examples Database [Dataset]. http://doi.org/10.5683/SP3/SDITUG
Explore at:
Unique identifier
https://doi.org/10.5683/SP3/SDITUG
Dataset updated
Sep 4, 2024
Dataset provided by
Borealis
Authors
Evering, Danica; Acharya, Shrey; Pratt, Isaac; Behal, Sarthak
Time period covered
Jan 1, 2011 - Jan 1, 2023
Description
This dataset is comprised of a collection of example DMPs from a wide array of fields; obtained from a number of different sources outlined below. Data included/extracted from the examples include the discipline and field of study, author, institutional affiliation and funding information, location, date created, title, research and data-type, description of project, link to the DMP, and where possible external links to related publications or grant pages. This CSV document serves as the content for a McMaster Data Management Plan (DMP) Database as part of the Research Data Management (RDM) Services website, located at https://u.mcmaster.ca/dmps. Other universities and organizations are encouraged to link to the DMP Database or use this dataset as the content for their own DMP Database. This dataset will be updated regularly to include new additions and will be versioned as such. We are gathering submissions at https://u.mcmaster.ca/submit-a-dmp to continue to expand the collection.
b
Data from: Coarse datasets for the 2002-2010 Tsimane' Amazonian Panel...
scholarworks.brandeis.edu
docx, pdf, xls
Updated Mar 15, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ricardo Godoy; William R. Leonard; Victoria Reyes-Garcia; Tomas Huanca (2022). Coarse datasets for the 2002-2010 Tsimane' Amazonian Panel Study(TAPS) - Introduction and authorization [Dataset]. https://scholarworks.brandeis.edu/esploro/outputs/dataset/Coarse-datasets-for-the-2002-2010-Tsimane/9924097301801921
Explore at:
xls(1472000 bytes), pdf(140365 bytes), docx(32618 bytes)Available download formats
Dataset updated
Mar 15, 2022
Authors
Ricardo Godoy; William R. Leonard; Victoria Reyes-Garcia; Tomas Huanca
Time period covered
Mar 2022
Measurement technique
<p>See Chapter 4 of "Too little, too late" for general methods, and different chapters for methods on different topics</p>
Description
Introduction. This document provides an overview of an archive composed of four sections.
[1] An introduction (this document) which describes the scope of the project
[2] Yearly folder, from 2002 until 2010, of the coarse Microsoft Access datasets + the surveys used to collect information for each year. The word coarse does not mean the information in the Microsoft Access dataset was not corrected for mistakes; it was, but some mistakes and inconsistencies remain, such as with data on age or education. Furthermore, the coarse dataset provides disaggregated information for selected topics, which appear in summary statistics in the clean dataset. For example, in the coarse dataset one can find the different illnesses afflicting a person during the past 14 days whereas in the clean dataset only the total number of illnesses appears.
[3] A letter from the Gran Consejo Tsimane’ authorizing the public use of de-identified data collected in our studies among Tsimane’.
[4] A Microsoft Excel document with the unique identification number for each person in the panel study.

Background. During 2002-2010, a team of international researchers, surveyors, and translators gathered longitudinal (panel) data on the demography, economy, social relations, health, nutritional status, local ecological knowledge, and emotions of about 1400 native Amazonians known as Tsimane’ who lived in thirteen villages near and far from towns in the department of Beni in the Bolivian Amazon. A report titled “Too little, too late” summarizes selected findings from the study and is available to the public at the electronic library of Brandeis University:
https://scholarworks.brandeis.edu/permalink/01BRAND_INST/1bo2f6t/alma9923926194001921

A copy of the clean, merged, and appended Stata (V17) dataset is available to the public at the following two web addresses:
[a] Brandeis University:
https://scholarworks.brandeis.edu/permalink/01BRAND_INST/1bo2f6t/alma9923926193901921
[b] Inter-university Consortium for Political and Social Research (ICPSR), University of Michigan (only available to users affiliated with institutions belonging to ICPSR)
http://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/37671/utilization

Chapter 4 of the report “Too little, too late” mentioned above describes the motivation and history of the study, the difference between the coarse and clean datasets, and topics which can be examined only with coarse data.

Aims. The aims of this archive are to:
· Make available in Microsoft Access the coarse de-identified dataset [1] for each of the seven yearly surveys (2004-2010) and [2] one Access data based on quarterly surveys done during 2002 and 2003. Together, these two datasets form one longitudinal dataset of individuals, households, and villages.
· Provide guidance on how to link files within and across years, and
· Make available a Microsoft Excel file with a unique identification number to link individuals across years
The datasets in the archive.
· Eight Microsoft Access datasets with data on a wide range of variables. Except for the Access file for 2002-2003, all the other information in each of the other Access files refers to one year. Within any Access dataset, users will find two types of files:
o Thematic files. The name of a thematic file contains the prefix tbl (e.g., 29_tbl_Demography or tbl_29_Demography). The file name (sometimes in Spanish, sometimes in English) indicates the content of the file. For example, in the Access dataset for one year, the micro file tbl_30_Ventas has all the information on sales for that year. Within each micro file, columns contain information on a variable and the name of the column indicates the content of the variable. For instance, the column heading item in the Sales file would indicate the type of good sold. The exac…
d
ID's photo Dataset | 67 countries | 11 types of documents | Document...
datarade.ai
.jpg, .jpeg, .png
Updated Jul 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FileMarket (2025). ID's photo Dataset | 67 countries | 11 types of documents | Document Recognition | OCR Training | Computer Vision [Dataset]. https://datarade.ai/data-products/id-s-photo-dataset-67-countries-11-types-of-documents-d-filemarket
Explore at:
.jpg, .jpeg, .pngAvailable download formats
Dataset updated
Jul 25, 2025
Dataset authored and provided by
FileMarket
Area covered
Bulgaria, Egypt, France, Indonesia, Mexico, Sri Lanka, Cuba, Peru, Venezuela (Bolivarian Republic of), Benin
Description
Total individuals: 1661 Total images: 3623 Images per users: 2.18

Top Countries: - Nigeria 44,6% - United States of America 7,2% - Bangladesh 7,1% - Ethiopia 6,7% - Indonesia 4,8% - India 4,8% - Kenya 2,4% - Iran 2,3% - Nepal 1,7% - Pakistan 1,4% (Total 67 countries)

Type of documents: - Identification Card (ID Card) 63,2% - Driver's License 6,4% - Student ID 4,9% - International passport 2,8% - Domestic passport 0,8% - Residence Permit 0,7% - Military ID 0,4% - Health Insurance Card 0,2%

Data is organized in per‑user folders and includes rich metadata.

Within a folder you may find: (a) multiple document categories for the same person, and/or (b) repeated captures of the same document against different backgrounds or lighting setups. The maximum volume per individual is 28 images.

Metadata includes country of document, type of document, created date, last name, first name, day of birth, month of birth and year of birth.

Every image was provided with explicit user consent. This ensures downstream use cases—such as training and evaluating document detection, classification, text extraction, and identity authentication models—are supported by legally sourced data.
Z
Sentence/Table Pair Data from Wikipedia for Pre-training with...
data.niaid.nih.gov
zenodo.org
Updated Oct 29, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yu Su (2021). Sentence/Table Pair Data from Wikipedia for Pre-training with Distant-Supervision [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5612315
Explore at:
Dataset updated
Oct 29, 2021
Dataset provided by
Cong Yu
Yu Su
Huan Sun
Alyssa Lees
Xiang Deng
You Wu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the dataset used for pre-training in "ReasonBERT: Pre-trained to Reason with Distant Supervision", EMNLP'21.

There are two files:

sentence_pairs_for_pretrain_no_tokenization.tar.gz -> Contain only sentences as evidence, Text-only

table_pairs_for_pretrain_no_tokenization.tar.gz -> At least one piece of evidence is a table, Hybrid

The data is chunked into multiple tar files for easy loading. We use WebDataset, a PyTorch Dataset (IterableDataset) implementation providing efficient sequential/streaming data access.

For pre-training code, or if you have any questions, please check our GitHub repo https://github.com/sunlab-osu/ReasonBERT

Below is a sample code snippet to load the data

import webdataset as wds

path to the uncompressed files, should be a directory with a set of tar files

url = './sentence_multi_pairs_for_pretrain_no_tokenization/{000000...000763}.tar' dataset = ( wds.Dataset(url) .shuffle(1000) # cache 1000 samples and shuffle .decode() .to_tuple("json") .batched(20) # group every 20 examples into a batch )

Please see the documentation for WebDataset for more details about how to use it as dataloader for Pytorch

You can also iterate through all examples and dump them with your preferred data format

Below we show how the data is organized with two examples.

Text-only

{'s1_text': 'Sils is a municipality in the comarca of Selva, in Catalonia, Spain.', # query sentence 's1_all_links': { 'Sils,_Girona': [[0, 4]], 'municipality': [[10, 22]], 'Comarques_of_Catalonia': [[30, 37]], 'Selva': [[41, 46]], 'Catalonia': [[51, 60]] }, # list of entities and their mentions in the sentence (start, end location) 'pairs': [ # other sentences that share common entity pair with the query, group by shared entity pairs { 'pair': ['Comarques_of_Catalonia', 'Selva'], # the common entity pair 's1_pair_locs': [[[30, 37]], [[41, 46]]], # mention of the entity pair in the query 's2s': [ # list of other sentences that contain the common entity pair, or evidence { 'md5': '2777e32bddd6ec414f0bc7a0b7fea331', 'text': 'Selva is a coastal comarque (county) in Catalonia, Spain, located between the mountain range known as the Serralada Transversal or Puigsacalm and the Costa Brava (part of the Mediterranean coast). Unusually, it is divided between the provinces of Girona and Barcelona, with Fogars de la Selva being part of Barcelona province and all other municipalities falling inside Girona province. Also unusually, its capital, Santa Coloma de Farners, is no longer among its larger municipalities, with the coastal towns of Blanes and Lloret de Mar having far surpassed it in size.', 's_loc': [0, 27], # in addition to the sentence containing the common entity pair, we also keep its surrounding context. 's_loc' is the start/end location of the actual evidence sentence 'pair_locs': [ # mentions of the entity pair in the evidence [[19, 27]], # mentions of entity 1 [[0, 5], [288, 293]] # mentions of entity 2 ], 'all_links': { 'Selva': [[0, 5], [288, 293]], 'Comarques_of_Catalonia': [[19, 27]], 'Catalonia': [[40, 49]] } } ,...] # there are multiple evidence sentences }, ,...] # there are multiple entity pairs in the query }

Hybrid

{'s1_text': 'The 2006 Major League Baseball All-Star Game was the 77th playing of the midseason exhibition baseball game between the all-stars of the American League (AL) and National League (NL), the two leagues comprising Major League Baseball.', 's1_all_links': {...}, # same as text-only 'sentence_pairs': [{'pair': ..., 's1_pair_locs': ..., 's2s': [...]}], # same as text-only 'table_pairs': [ 'tid': 'Major_League_Baseball-1', 'text':[ ['World Series Records', 'World Series Records', ...], ['Team', 'Number of Series won', ...], ['St. Louis Cardinals (NL)', '11', ...], ...] # table content, list of rows 'index':[ [[0, 0], [0, 1], ...], [[1, 0], [1, 1], ...], ...] # index of each cell [row_id, col_id]. we keep only a table snippet, but the index here is from the original table. 'value_ranks':[ [0, 0, ...], [0, 0, ...], [0, 10, ...], ...] # if the cell contain numeric value/date, this is its rank ordered from small to large, follow TAPAS 'value_inv_ranks': [], # inverse rank 'all_links':{ 'St._Louis_Cardinals': { '2': [ [[2, 0], [0, 19]], # [[row_id, col_id], [start, end]] ] # list of mentions in the second row, the key is row_id }, 'CARDINAL:11': {'2': [[[2, 1], [0, 2]]], '8': [[[8, 3], [0, 2]]]}, } 'name': '', # table name, if exists 'pairs': { 'pair': ['American_League', 'National_League'], 's1_pair_locs': [[[137, 152]], [[162, 177]]], # mention in the query 'table_pair_locs': { '17': [ # mention of entity pair in row 17 [ [[17, 0], [3, 18]], [[17, 1], [3, 18]], [[17, 2], [3, 18]], [[17, 3], [3, 18]] ], # mention of the first entity [ [[17, 0], [21, 36]], [[17, 1], [21, 36]], ] # mention of the second entity ] } } ] }
Meta data and supporting documentation
catalog.data.gov
Updated Nov 12, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2020). Meta data and supporting documentation [Dataset]. https://catalog.data.gov/dataset/meta-data-and-supporting-documentation
Explore at:
Dataset updated
Nov 12, 2020
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
We include a description of the data sets in the meta-data as well as sample code and results from a simulated data set. This dataset is not publicly accessible because: EPA cannot release personally identifiable information regarding living individuals, according to the Privacy Act and the Freedom of Information Act (FOIA). This dataset contains information about human research subjects. Because there is potential to identify individual participants and disclose personal information, either alone or in combination with other datasets, individual level data are not appropriate to post for public access. Restricted access may be granted to authorized persons by contacting the party listed. It can be accessed through the following means: The R code is available on line here: https://github.com/warrenjl/SpGPCW. Format: Abstract The data used in the application section of the manuscript consist of geocoded birth records from the North Carolina State Center for Health Statistics, 2005-2008. In the simulation study section of the manuscript, we simulate synthetic data that closely match some of the key features of the birth certificate data while maintaining confidentiality of any actual pregnant women. Availability Due to the highly sensitive and identifying information contained in the birth certificate data (including latitude/longitude and address of residence at delivery), we are unable to make the data from the application section publicly available. However, we will make one of the simulated datasets available for any reader interested in applying the method to realistic simulated birth records data. This will also allow the user to become familiar with the required inputs of the model, how the data should be structured, and what type of output is obtained. While we cannot provide the application data here, access to the North Carolina birth records can be requested through the North Carolina State Center for Health Statistics and requires an appropriate data use agreement. Description Permissions: These are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. File format: R workspace file. Metadata (including data dictionary) • y: Vector of binary responses (1: preterm birth, 0: control) • x: Matrix of covariates; one row for each simulated individual • z: Matrix of standardized pollution exposures • n: Number of simulated individuals • m: Number of exposure time periods (e.g., weeks of pregnancy) • p: Number of columns in the covariate design matrix • alpha_true: Vector of “true” critical window locations/magnitudes (i.e., the ground truth that we want to estimate). This dataset is associated with the following publication: Warren, J., W. Kong, T. Luben, and H. Chang. Critical Window Variable Selection: Estimating the Impact of Air Pollution on Very Preterm Birth. Biostatistics. Oxford University Press, OXFORD, UK, 1-30, (2019).
d
Mapping Example
catalog.data.gov
opendata.maryland.gov
+1more
Updated Jun 30, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
opendata.maryland.gov (2022). Mapping Example [Dataset]. https://catalog.data.gov/dataset/mapping-example
Explore at:
Dataset updated
Jun 30, 2022
Dataset provided by
opendata.maryland.gov
Description
Tutorial document describing how to go about mapping crash data in Tyler Data & Insights. This was done using the Crash Data dataset.
Example Dataset - Subside_Dataset - DSO Data Discovery
ckan.tacc.utexas.edu
Updated Jul 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ckan.tacc.utexas.edu (2025). Example Dataset - Subside_Dataset - DSO Data Discovery [Dataset]. https://ckan.tacc.utexas.edu/dataset/test-dataset
Explore at:
Dataset updated
Jul 14, 2025
Dataset provided by
CKANhttps://ckan.org/
Description
Dataset of example files for use in testing CKAN configurations and settings. Datasets to add / test: tabular data [csv, xlsx] * documents [pdf, docx, json, md] web [html] image [jpg, png,tiff] other [tiff] * geospatial [?]
ToS;DR policies dataset (raw) - 21/07/2023
zenodo.org
csv
Updated May 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2025). ToS;DR policies dataset (raw) - 21/07/2023 [Dataset]. http://doi.org/10.5281/zenodo.15012282
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15012282
Dataset updated
May 5, 2025
Dataset provided by
Zenodohttp://zenodo.org/
License
https://www.gnu.org/licenses/gpl-3.0-standalone.htmlhttps://www.gnu.org/licenses/gpl-3.0-standalone.html
Description
This dataset has been collected and annotated by Terms of Service; Didn't Read (ToS;DR), an independent project aimed at analyzing and summarizing the terms of service and privacy policies of various online services. ToS;DR helps users understand the legal agreements they accept when using online platforms by categorizing and evaluating specific cases related to these policies.

The dataset includes structured information on individual cases, broader topics, specific services, detailed documents, and key points extracted from legal texts.

Cases refer to individual legal cases or specific issues related to the terms of service or privacy policies of a particular online service. Each case typically focuses on a specific aspect of a service's terms, such as data collection, user rights, content ownership, or security practices.

id, a unique id for each case (incremental).

classification, one of those values (good, bad, neutral, blocker).

score, values range between 0 to 100.

title.

description.

topic_id, connecting the case with it's topic.

created_at.

updated_at.

privacy_related, a flag indicate if it's related to privacy or not.

docbot_regex, the regex expression used to check for specific words in the quoted text.

Topics are general categories or themes that encompass various cases. They help organize and group similar cases together based on the type of issues they address. For example, "Data Collection" could be a topic that includes cases related to how a service collects and uses user data.

id, a unique id for each topic (incremental).

title.

subtitle, small description.

description.

created_at.

updated_at.

Services represent specific online platforms, websites, or applications that have their own terms of service and privacy policies.

id, a unique id for each service (incremental).

name.

url.

created_at.

updated_at.

wikipedia, wikipedia url of the service.

keywords.

related, connecting the service with one of known similar services in the same field.

slug. extracted from the name, small letters, no spaces and so on.

is_comprehensively_reviewed, a flag indicate if it's comprehensively_reviewed or not.

rating, overall rating for the service based on the all cases.

status, indicate if the service is deleted or not (deleted, NaN).

Points are individual statements or aspects within a case that highlight important information about a service's terms of service or privacy policy. These points can be positive (e.g., strong privacy protections) or negative (e.g., data sharing with third parties).

id, a unique id for each point (incremental).

rank, all values are zero.

title, mostly it's similar to case title.

source, url of the source.

status, one of those values (approved, declined, pending, changes-requested, disputed, draft).

analysis.

created_at.

updated_at.

service_id, connecting the point with it's service.

quote_text, quotted text from the source which contain information for this point.

case_id, connecting the point with the related case.

old_id, used for data migration.

quote_start, index of first letter of the quotted text in the document.

quote_end, index of last letter of the quotted text in the document.

service_needs_rating_update, all values are False.

document_id, connecting the point with the related document.

annotation_ref.

Documents refer to the original terms of service and privacy policies of the services that are being analyzed on TOSDR. These documents are the source of information for the cases, points, and ratings provided on the platform. TOSDR links to the actual documents, so users can review the full details if they choose to.

id, a unique id for each document (incremental).

name, name of document like privacy policy or cookies policy, etc.

url, url of the document.

xpath.

text, the actual document.

created_at.

updated_at.

service_id, connecting the document with it's service.

reviewed, a flag indicate if the document has been reviewed or not.

status, indicate if the service is deleted or not (deleted, NaN).

crawler_server, the server used to crawl the document

Facebook

Twitter

Click to copy link

Link copied

Cite

Training Data (2023). OCR Document Text Recognition Dataset [Dataset]. https://www.kaggle.com/datasets/trainingdatapro/text-detection-in-the-documents/versions/2

OCR Document Text Recognition Dataset

Photos of the documents and text - OCR dataset

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Sep 7, 2023

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Training Data

License

Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically

Description

OCR Text Detection in the Documents Object Detection dataset

The dataset is a collection of images that have been annotated with the location of text in the document. The dataset is specifically curated for text detection and recognition tasks in documents such as scanned papers, forms, invoices, and handwritten notes.

The dataset contains a variety of document types, including different layouts, font sizes, and styles. The images come from diverse sources, ensuring a representative collection of document styles and quality. Each image in the dataset is accompanied by bounding box annotations that outline the exact location of the text within the document.

💴 For Commercial Usage: To discuss your requirements, learn about the price and buy the dataset, leave a request on TrainingData to buy the dataset

The Text Detection in the Documents dataset provides an invaluable resource for developing and testing algorithms for text extraction, recognition, and analysis. It enables researchers to explore and innovate in various applications, including optical character recognition (OCR), information extraction, and document understanding.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F6986071a88d8a9829fee98d5b49d9ff8%2FMacBook%20Air%20-%201%20(1).png?generation=1691059158337136&alt=media" alt="">

Dataset structure

images - contains of original images of documents
boxes - includes bounding box labeling for the original images
annotations.xml - contains coordinates of the bounding boxes and labels, created for the original photo

Data Format

Each image from images folder is accompanied by an XML-annotation in the annotations.xml file indicating the coordinates of the bounding boxes and labels for text detection. For each point, the x and y coordinates are provided.

Labels for the text:

"Text Title" - corresponds to titles, the box is red
"Text Paragraph" - corresponds to paragraphs of text, the box is blue
"Table" - corresponds to the table, the box is green
"Handwritten" - corresponds to handwritten text, the box is purple

Example of XML file structure

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F38e02db515561a30e29faca9f5b176b0%2Fcarbon.png?generation=1691058761924879&alt=media" alt="">

Text Detection in the Documents might be made in accordance with your requirements.

💴 Buy the Dataset: This is just an example of the data. Leave a request on https://trainingdata.pro/datasets to discuss your requirements, learn about the price and buy the dataset

TrainingData provides high-quality data annotation tailored to your needs

keywords: text detection, text recognition, optical character recognition, document text recognition, document text detection, detecting text-lines, object detection, scanned documents, deep-text-recognition, text area detection, text extraction, images dataset, image-to-text

Clear search

Close search

Google apps

Main menu

OCR Document Text Recognition Dataset

OCR Text Detection in the Documents Object Detection dataset

💴 For Commercial Usage: To discuss your requirements, learn about the price and buy the dataset, leave a request on TrainingData to buy the dataset

Dataset structure

Data Format

Labels for the text:

Example of XML file structure

Text Detection in the Documents might be made in accordance with your requirements.

💴 Buy the Dataset: This is just an example of the data. Leave a request on https://trainingdata.pro/datasets to discuss your requirements, learn about the price and buy the dataset

TrainingData provides high-quality data annotation tailored to your needs

SIAM 2007 Text Mining Competition dataset

Annotated Terms of Service of 100 Online Platforms

FATURA Dataset

Metadata Form Template

Requirements data sets (user stories)

Overview of the datasets [data and links added in December 2024]

Public administration and transparency

(Research) data and meta-data management

Legal Text Classification Dataset

Example Dataset of Exercise Analysis and Forecasting

Data from: Data Dictionary Template

Data Policy Templates

Sample Dataset for Testing

Selfies & ID Images Dataset, 95,000 files

Selfies, ID Images Face Dataset

Photo documents contains only a photo of a person. All personal information from the document is hidden

💴 For Commercial Usage: Full version of the dataset includes 95 000+ photos of people, leave a request on TrainingData to buy the dataset

Metadata for the full dataset:

Content

💴 Buy the Dataset: This is just an example of the data. Leave a request on https://trainingdata.pro/datasets to learn about the price and buy the dataset

TrainingData provides high-quality data annotation tailored to your needs.

Data Management Plan Examples Database

Data from: Coarse datasets for the 2002-2010 Tsimane' Amazonian Panel...

ID's photo Dataset | 67 countries | 11 types of documents | Document...

Sentence/Table Pair Data from Wikipedia for Pre-training with...

path to the uncompressed files, should be a directory with a set of tar files

Please see the documentation for WebDataset for more details about how to use it as dataloader for Pytorch

You can also iterate through all examples and dump them with your preferred data format

Meta data and supporting documentation

Mapping Example

Example Dataset - Subside_Dataset - DSO Data Discovery

ToS;DR policies dataset (raw) - 21/07/2023

OCR Document Text Recognition DatasetSee More Versions

Photos of the documents and text - OCR dataset

OCR Text Detection in the Documents Object Detection dataset

💴 For Commercial Usage: To discuss your requirements, learn about the price and buy the dataset, leave a request on TrainingData to buy the dataset

Dataset structure

Data Format

Labels for the text:

Example of XML file structure

Text Detection in the Documents might be made in accordance with your requirements.

💴 Buy the Dataset: This is just an example of the data. Leave a request on https://trainingdata.pro/datasets to discuss your requirements, learn about the price and buy the dataset

TrainingData provides high-quality data annotation tailored to your needs

OCR Document Text Recognition Dataset