Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset contains the images of handwritten text in Urdu language, one of the most widely spoken languages in South-East Asian regions. The native-speaking authors from different social domains were invited to write a pre-written text in their handwritings. The pre-written text is carefully written in a way that it includes almost all the characters, ligatures, diacritics, and dots used in writing the text Urdu script. The disabled persons are also involved to write the text to make the data collection more comprehensive. The demographic data of the authors is also recorded for supporting the research activities like author identification, text-matching etc.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
digits
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
UHaT Dataset
UHaT: Urdu Handwritten Text Dataset
This dataset contains handwritten characters and digits of Urdu language. The samples are written by 900+ individuals.
Description and organization
Size of images: All the images are stored in 28 by 28 resolution.
How many images: The training set per each character contains of 700 images on average. For example, there are 811 train set images for AYN and 697 train set images for ALIF. Similarly, the train set per each contains 700 images on average. For example, there are 678 train set images for digits one. The test set per each character contains 140 images on average. For example, there are 145 test set images for character ALIF. The test set per each digit contains 140 images on average. For example, there are 147 test set images for digit nine.
The dataset is organized into four sub-directories. Characters Training set, Characters Test set, Digits training set and digits test set. Each sub-director contains one sub-folder per one character. For example, all the train images for character ALIF are placed in sub-folder Alif.
The folder hierarchy is given as:
*Data > characters train set > alif
Data > characters train set > ayn*
And so on….
How to load directly?
You can also load it directly from the uhat_dataset.npz file. See the kernel load_dataset
Acknowledgements
Thanks to all volunteers who contributed by providing handwriting samples.
Inspiration
This is an MNIST style dataset. The machine learning community in general will find it useful for experimentation, demonstration purposes of machine learning models.
The dataset will also provide an opportunity to researchers to work on Urdu text recognition.
Description:
This dataset consists of high-quality images of handwritten text in the Urdu language, one of the most commonly spoken languages in South Asia, especially in Pakistan, India, and surrounding regions. The dataset has been created by inviting native Urdu speakers from diverse social, educational, and cultural backgrounds to write a predefined text in their natural handwriting style. This predefined text was carefully curated to cover the full range of Urdu characters, ligatures, diacritics, dots, and special symbols used in everyday writing.
Dataset Features
Diverse Handwriting Styles: The dataset includes contributions from native speakers across different demographics, ensuring a rich variety of handwriting styles.
Comprehensive Character Set: The predefined text covers all characters, ligatures, diacritics, and dots commonly used in Urdu script.
Inclusivity: Contributions from people with disabilities add unique variations to the dataset, making it more diverse and comprehensive.
Download Dataset
Demographic Information
The demographic details of contributors, including age, gender, and educational background, are recorded. This information is particularly valuable for research related to author identification, handwriting analysis, and text-matching algorithms.
Potential Applications
This dataset has numerous applications, including:
OCR Development: Enhancing Optical Character Recognition systems for Urdu text.
Handwriting Authentication: Improving security through handwriting-based user verification.
Linguistic Studies: Supporting research in Urdu language processing, script digitalization, and handwriting analysis.
Forensic Handwriting Analysis: Assisting in forensic research for identifying individual handwriting patterns.
Multilingual Handwriting Recognition: Building robust AI models that can recognize handwriting across different languages and scripts.
Quality Control
The dataset has undergone a rigorous quality check to ensure consistency, accuracy, and usability across various academic and commercial research projects, particularly those that involve natural language processing and computer vision technologies.
This dataset is sourced from Kaggle.
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Introducing the Urdu Shopping List Image Dataset - a diverse and comprehensive collection of handwritten text images carefully curated to propel the advancement of text recognition and optical character recognition (OCR) models designed specifically for the Urdu language.
Dataset Contain & Diversity:Containing more than 2000 images, this Urdu OCR dataset offers a wide distribution of different types of shopping list images. Within this dataset, you'll discover a variety of handwritten text, including sentences, and individual item name words, quantity, comments, etc on shopping lists. The images in this dataset showcase distinct handwriting styles, fonts, font sizes, and writing variations.
To ensure diversity and robustness in training your OCR model, we allow limited (less than three) unique images in a single handwriting. This ensures we have diverse types of handwriting to train your OCR model on. Stringent measures have been taken to exclude any personally identifiable information (PII) and to ensure that in each image a minimum of 80% of space contains visible Urdu text.
The images have been captured under varying lighting conditions, including day and night, as well as different capture angles and backgrounds. This diversity helps build a balanced OCR dataset, featuring images in both portrait and landscape modes.
All these shopping lists were written and images were captured by native Urdu people to ensure text quality, prevent toxic content, and exclude PII text. We utilized the latest iOS and Android mobile devices with cameras above 5MP to maintain image quality. Images in this training dataset are available in both JPEG and HEIC formats.
Metadata:In addition to the image data, you will receive structured metadata in CSV format. For each image, this metadata includes information on image orientation, country, language, and device details. Each image is correctly named to correspond with the metadata.
This metadata serves as a valuable resource for understanding and characterizing the data, aiding informed decision-making in the development of Urdu text recognition models.
Update & Custom Collection:We are committed to continually expanding this dataset by adding more images with the help of our native Urdu crowd community.
If you require a customized OCR dataset containing shopping list images tailored to your specific guidelines or device distribution, please don't hesitate to contact us. We have the capability to curate specialized data to meet your unique requirements.
Additionally, we can annotate or label the images with bounding boxes or transcribe the text in the images to align with your project's specific needs using our crowd community.
License:This image dataset, created by FutureBeeAI, is now available for commercial use.
Conclusion:Leverage this shopping list image OCR dataset to enhance the training and performance of text recognition, text detection, and optical character recognition models for the Urdu language. Your journey to improved language understanding and processing begins here.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ink color
Urdu Handwriting Dataset for Demographic Traits Classification was developed in Bahria University, Islamabad, Pakistan as a part of the bachelor's degree final year thesis/project. This is a unique dataset which is the first of its kind. The dataset is composed of 1000 unique handwriting images each taken from unique individuals. It can be seen in the title, the handwriting samples are specifically in Urdu Language. Urdu Handwriting Dataset is made for the Classification of Demographic Traits problem due to which it consists of the demographic information of each individual. Following are the demographic traits that are covered in the dataset:
Gender (Male, Female)
Handedness (Left, Right)
Age-Group (15-20,21-30,31-40,41-50,51-up)
Province (Balochistan, Sindh, Punjab, kpk, gilgit-baltistan, none)
Occupation (Student, Employee, Both, None)
Education (Primary(Below Matriculation), Matriculation, Intermediate, Bachelors, Masters, PHD, None)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Details of hyper-parameters for proposed ET-Network.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The comparison of CER on valid and test splits with proposed models.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The comparison of the proposed model with state-of-the-art methods.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Advantages and disadvantages of state-of-the-art methods.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Table evaluate the proposed model with different layers of attention.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset contains the images of handwritten text in Urdu language, one of the most widely spoken languages in South-East Asian regions. The native-speaking authors from different social domains were invited to write a pre-written text in their handwritings. The pre-written text is carefully written in a way that it includes almost all the characters, ligatures, diacritics, and dots used in writing the text Urdu script. The disabled persons are also involved to write the text to make the data collection more comprehensive. The demographic data of the authors is also recorded for supporting the research activities like author identification, text-matching etc.