Dataset Card for Dataset Name
This dataset card aims to be a base template for new datasets. It has been generated using this raw template.
Dataset Details
Dataset Description
Content A version of the Tashkeela Arabic diacritized text dataset cleaned from the non-Arabic content and the undiacritized text, then divided into training, development, and testing sets. The cleaning process includes removing the XML tags and strange symbols, as well as fixing… See the full description on the dataset page: https://huggingface.co/datasets/EmanKhater/Tashkeela.
📚 Sadeed Tashkeela Arabic Diacritization Dataset
The Sadeed dataset is a large, high-quality Arabic diacritized corpus optimized for training and evaluating Arabic diacritization models.It is built exclusively from the Tashkeela corpus for the training set and a refined version of the Fadel Tashkeela test set for the test set.
Dataset Overview
Training Data:
Source: Cleaned version of the Tashkeela corpus (original data is ~75 million words, mostly Classical… See the full description on the dataset page: https://huggingface.co/datasets/Misraj/Sadeed_Tashkeela.
https://choosealicense.com/licenses/gpl-2.0/https://choosealicense.com/licenses/gpl-2.0/
ROOTS Subset: roots_ar_tashkeela
Tashkeela
Dataset uid: tashkeela
Description
The dataset collected from 97 books in both modern and classic arabic. The dataset contains Arabic diacritics. The dataset is
Homepage
https://sourceforge.net/projects/tashkeela/
Licensing
gpl-2.0: GNU General Public License v2.0 only
Speaker Locations
Sizes
0.2533 % of total 2.3340 % of ar
BigScience processing steps
Filters… See the full description on the dataset page: https://huggingface.co/datasets/bigscience-data/roots_ar_tashkeela.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Arabic Tashkeel Dataset
This is a fairly large dataset gathered from five main sources:
tashkeela (1.79GB - 45.05%): The entire Tashkeela dataset, repurposed in sentences. Some rows were omitted as they contain low diacritic (tashkeel characters) rate. shamela (1.67GB - 42.10%): Random pages from over 2,000 books on the Shamela Library. Pages were selected using the below function (high diacritics rate) wikipedia (269.94MB - 6.64%): A collection of Wikipedia articles. Diacritics… See the full description on the dataset page: https://huggingface.co/datasets/Abdou/arabic-tashkeel-dataset.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Dataset Card for Dataset Name
This dataset card aims to be a base template for new datasets. It has been generated using this raw template.
Dataset Details
Dataset Description
Content A version of the Tashkeela Arabic diacritized text dataset cleaned from the non-Arabic content and the undiacritized text, then divided into training, development, and testing sets. The cleaning process includes removing the XML tags and strange symbols, as well as fixing… See the full description on the dataset page: https://huggingface.co/datasets/EmanKhater/Tashkeela.