12 datasets found

m
Urdu Handwritten Text Dataset
data.mendeley.com
Updated Aug 9, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mujtaba Husnain (2021). Urdu Handwritten Text Dataset [Dataset]. http://doi.org/10.17632/bg2sctsysf.1
Explore at:
Unique identifier
https://doi.org/10.17632/bg2sctsysf.1
Dataset updated
Aug 9, 2021
Authors
Mujtaba Husnain
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset contains the images of handwritten text in Urdu language, one of the most widely spoken languages in South-East Asian regions. The native-speaking authors from different social domains were invited to write a pre-written text in their handwritings. The pre-written text is carefully written in a way that it includes almost all the characters, ligatures, diacritics, and dots used in writing the text Urdu script. The disabled persons are also involved to write the text to make the data collection more comprehensive. The demographic data of the authors is also recorded for supporting the research activities like author identification, text-matching etc.
i
MANUU: Handwritten Urdu OCR Dataset
ieee-dataport.org
Updated Dec 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shaik Ahmed (2024). MANUU: Handwritten Urdu OCR Dataset [Dataset]. https://ieee-dataport.org/documents/manuu-handwritten-urdu-ocr-dataset
Explore at:
Dataset updated
Dec 15, 2024
Authors
Shaik Ahmed
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
digits
UHaT Dataset: Urdu Handwritten Text Dataset
zenodo.org
explore.openaire.eu
zip
Updated Feb 18, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hazrat Ali; Hazrat Ali (2020). UHaT Dataset: Urdu Handwritten Text Dataset [Dataset]. http://doi.org/10.5281/zenodo.3670611
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3670611
Dataset updated
Feb 18, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Hazrat Ali; Hazrat Ali
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
UHaT Dataset

UHaT: Urdu Handwritten Text Dataset

This dataset contains handwritten characters and digits of Urdu language. The samples are written by 900+ individuals.

Description and organization

Size of images: All the images are stored in 28 by 28 resolution.

How many images: The training set per each character contains of 700 images on average. For example, there are 811 train set images for AYN and 697 train set images for ALIF. Similarly, the train set per each contains 700 images on average. For example, there are 678 train set images for digits one. The test set per each character contains 140 images on average. For example, there are 145 test set images for character ALIF. The test set per each digit contains 140 images on average. For example, there are 147 test set images for digit nine.

The dataset is organized into four sub-directories. Characters Training set, Characters Test set, Digits training set and digits test set. Each sub-director contains one sub-folder per one character. For example, all the train images for character ALIF are placed in sub-folder Alif.

The folder hierarchy is given as:

*Data > characters train set > alif

Data > characters train set > ayn*

And so on….

How to load directly?

You can also load it directly from the uhat_dataset.npz file. See the kernel load_dataset

Acknowledgements

Thanks to all volunteers who contributed by providing handwriting samples.

Inspiration

This is an MNIST style dataset. The machine learning community in general will find it useful for experimentation, demonstration purposes of machine learning models.
The dataset will also provide an opportunity to researchers to work on Urdu text recognition.
P
Urdu Handwritten Text Dataset Dataset
paperswithcode.com
Updated Mar 18, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Urdu Handwritten Text Dataset Dataset [Dataset]. https://paperswithcode.com/dataset/urdu-handwritten-text-dataset
Explore at:
Dataset updated
Mar 18, 2025
Description
Description:

👉 Download the dataset here

This dataset consists of high-quality images of handwritten text in the Urdu language, one of the most commonly spoken languages in South Asia, especially in Pakistan, India, and surrounding regions. The dataset has been created by inviting native Urdu speakers from diverse social, educational, and cultural backgrounds to write a predefined text in their natural handwriting style. This predefined text was carefully curated to cover the full range of Urdu characters, ligatures, diacritics, dots, and special symbols used in everyday writing.

Dataset Features

Diverse Handwriting Styles: The dataset includes contributions from native speakers across different demographics, ensuring a rich variety of handwriting styles.

Comprehensive Character Set: The predefined text covers all characters, ligatures, diacritics, and dots commonly used in Urdu script.

Inclusivity: Contributions from people with disabilities add unique variations to the dataset, making it more diverse and comprehensive.

Download Dataset

Demographic Information

The demographic details of contributors, including age, gender, and educational background, are recorded. This information is particularly valuable for research related to author identification, handwriting analysis, and text-matching algorithms.

Potential Applications

This dataset has numerous applications, including:

OCR Development: Enhancing Optical Character Recognition systems for Urdu text.

Handwriting Authentication: Improving security through handwriting-based user verification.

Linguistic Studies: Supporting research in Urdu language processing, script digitalization, and handwriting analysis.

Forensic Handwriting Analysis: Assisting in forensic research for identifying individual handwriting patterns.

Multilingual Handwriting Recognition: Building robust AI models that can recognize handwriting across different languages and scripts.

Quality Control

The dataset has undergone a rigorous quality check to ensure consistency, accuracy, and usability across various academic and commercial research projects, particularly those that involve natural language processing and computer vision technologies.

This dataset is sourced from Kaggle.
F
Urdu Shopping List OCR Image Dataset
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Urdu Shopping List OCR Image Dataset [Dataset]. https://www.futurebeeai.com/dataset/ocr-dataset/urdu-shopping-list-ocr-image-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
What’s Included
Introducing the Urdu Shopping List Image Dataset - a diverse and comprehensive collection of handwritten text images carefully curated to propel the advancement of text recognition and optical character recognition (OCR) models designed specifically for the Urdu language.
Dataset Contain & Diversity:
Containing more than 2000 images, this Urdu OCR dataset offers a wide distribution of different types of shopping list images. Within this dataset, you'll discover a variety of handwritten text, including sentences, and individual item name words, quantity, comments, etc on shopping lists. The images in this dataset showcase distinct handwriting styles, fonts, font sizes, and writing variations.
To ensure diversity and robustness in training your OCR model, we allow limited (less than three) unique images in a single handwriting. This ensures we have diverse types of handwriting to train your OCR model on. Stringent measures have been taken to exclude any personally identifiable information (PII) and to ensure that in each image a minimum of 80% of space contains visible Urdu text.
The images have been captured under varying lighting conditions, including day and night, as well as different capture angles and backgrounds. This diversity helps build a balanced OCR dataset, featuring images in both portrait and landscape modes.
All these shopping lists were written and images were captured by native Urdu people to ensure text quality, prevent toxic content, and exclude PII text. We utilized the latest iOS and Android mobile devices with cameras above 5MP to maintain image quality. Images in this training dataset are available in both JPEG and HEIC formats.
Metadata:
In addition to the image data, you will receive structured metadata in CSV format. For each image, this metadata includes information on image orientation, country, language, and device details. Each image is correctly named to correspond with the metadata.
This metadata serves as a valuable resource for understanding and characterizing the data, aiding informed decision-making in the development of Urdu text recognition models.
Update & Custom Collection:
We are committed to continually expanding this dataset by adding more images with the help of our native Urdu crowd community.
If you require a customized OCR dataset containing shopping list images tailored to your specific guidelines or device distribution, please don't hesitate to contact us. We have the capability to curate specialized data to meet your unique requirements.
Additionally, we can annotate or label the images with bounding boxes or transcribe the text in the images to align with your project's specific needs using our crowd community.
License:
This image dataset, created by FutureBeeAI, is now available for commercial use.
Conclusion:
Leverage this shopping list image OCR dataset to enhance the training and performance of text recognition, text detection, and optical character recognition models for the Urdu language. Your journey to improved language understanding and processing begins here.
i
Urdu Handwritten Ligature Dataset
ieee-dataport.org
Updated Nov 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AEJAZ GANAI (2024). Urdu Handwritten Ligature Dataset [Dataset]. https://ieee-dataport.org/documents/urdu-handwritten-ligature-dataset
Explore at:
Dataset updated
Nov 4, 2024
Authors
AEJAZ GANAI
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ink color
Z
Urdu Handwriting Dataset for Demographic Traits Classification
data.niaid.nih.gov
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rehman, Huma (2020). Urdu Handwriting Dataset for Demographic Traits Classification [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_2573098
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
Mirza, Ghulam Ali
Rehman, Huma
Mustufa, Syed Ghulam
Description
Urdu Handwriting Dataset for Demographic Traits Classification was developed in Bahria University, Islamabad, Pakistan as a part of the bachelor's degree final year thesis/project. This is a unique dataset which is the first of its kind. The dataset is composed of 1000 unique handwriting images each taken from unique individuals. It can be seen in the title, the handwriting samples are specifically in Urdu Language. Urdu Handwriting Dataset is made for the Classification of Demographic Traits problem due to which it consists of the demographic information of each individual. Following are the demographic traits that are covered in the dataset:

Gender (Male, Female)

Handedness (Left, Right)

Age-Group (15-20,21-30,31-40,41-50,51-up)

Province (Balochistan, Sindh, Punjab, kpk, gilgit-baltistan, none)

Occupation (Student, Employee, Both, None)

Education (Primary(Below Matriculation), Matriculation, Intermediate, Bachelors, Masters, PHD, None)
f
Details of hyper-parameters for proposed ET-Network.
plos.figshare.com
xls
Updated May 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ameer Hamza; Shengbing Ren; Usman Saeed (2024). Details of hyper-parameters for proposed ET-Network. [Dataset]. http://doi.org/10.1371/journal.pone.0302590.t007
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0302590.t007
Dataset updated
May 17, 2024
Dataset provided by
PLOS ONE
Authors
Ameer Hamza; Shengbing Ren; Usman Saeed
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Details of hyper-parameters for proposed ET-Network.
f
The comparison of CER on valid and test splits with proposed models.
plos.figshare.com
xls
Updated May 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ameer Hamza; Shengbing Ren; Usman Saeed (2024). The comparison of CER on valid and test splits with proposed models. [Dataset]. http://doi.org/10.1371/journal.pone.0302590.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0302590.t004
Dataset updated
May 17, 2024
Dataset provided by
PLOS ONE
Authors
Ameer Hamza; Shengbing Ren; Usman Saeed
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The comparison of CER on valid and test splits with proposed models.
f
The comparison of the proposed model with state-of-the-art methods.
plos.figshare.com
xls
Updated May 17, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ameer Hamza; Shengbing Ren; Usman Saeed (2024). The comparison of the proposed model with state-of-the-art methods. [Dataset]. http://doi.org/10.1371/journal.pone.0302590.t005
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0302590.t005
Dataset updated
May 17, 2024
Dataset provided by
PLOS ONE
Authors
Ameer Hamza; Shengbing Ren; Usman Saeed
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The comparison of the proposed model with state-of-the-art methods.
f
Advantages and disadvantages of state-of-the-art methods.
plos.figshare.com
xls
Updated May 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ameer Hamza; Shengbing Ren; Usman Saeed (2024). Advantages and disadvantages of state-of-the-art methods. [Dataset]. http://doi.org/10.1371/journal.pone.0302590.t006
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0302590.t006
Dataset updated
May 17, 2024
Dataset provided by
PLOS ONE
Authors
Ameer Hamza; Shengbing Ren; Usman Saeed
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Advantages and disadvantages of state-of-the-art methods.
f
Ablation study of self-attention layer (SAL).
plos.figshare.com
xls
Updated May 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ameer Hamza; Shengbing Ren; Usman Saeed (2024). Ablation study of self-attention layer (SAL). [Dataset]. http://doi.org/10.1371/journal.pone.0302590.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0302590.t003
Dataset updated
May 17, 2024
Dataset provided by
PLOS ONE
Authors
Ameer Hamza; Shengbing Ren; Usman Saeed
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Table evaluate the proposed model with different layers of attention.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Mujtaba Husnain (2021). Urdu Handwritten Text Dataset [Dataset]. http://doi.org/10.17632/bg2sctsysf.1

Urdu Handwritten Text Dataset

Explore at:

92 scholarly articles cite this dataset (View in Google Scholar)

Unique identifier

https://doi.org/10.17632/bg2sctsysf.1

Dataset updated

Aug 9, 2021

Authors

Mujtaba Husnain

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The dataset contains the images of handwritten text in Urdu language, one of the most widely spoken languages in South-East Asian regions. The native-speaking authors from different social domains were invited to write a pre-written text in their handwritings. The pre-written text is carefully written in a way that it includes almost all the characters, ligatures, diacritics, and dots used in writing the text Urdu script. The disabled persons are also involved to write the text to make the data collection more comprehensive. The demographic data of the authors is also recorded for supporting the research activities like author identification, text-matching etc.

Clear search

Close search

Google apps

Main menu

Urdu Handwritten Text Dataset

MANUU: Handwritten Urdu OCR Dataset

UHaT Dataset: Urdu Handwritten Text Dataset

Urdu Handwritten Text Dataset Dataset

Urdu Shopping List OCR Image Dataset

What’s Included

Urdu Handwritten Ligature Dataset

Urdu Handwriting Dataset for Demographic Traits Classification

Details of hyper-parameters for proposed ET-Network.

The comparison of CER on valid and test splits with proposed models.

The comparison of the proposed model with state-of-the-art methods.

Advantages and disadvantages of state-of-the-art methods.

Ablation study of self-attention layer (SAL).

Urdu Handwritten Text DatasetSee More Versions

Urdu Handwritten Text Dataset