4 datasets found

h
Tashkeela
huggingface.co
Updated Aug 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eman Khater (2024). Tashkeela [Dataset]. https://huggingface.co/datasets/EmanKhater/Tashkeela
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 4, 2024
Authors
Eman Khater
Description
Dataset Card for Dataset Name

This dataset card aims to be a base template for new datasets. It has been generated using this raw template.

Dataset Details Dataset Description

Content A version of the Tashkeela Arabic diacritized text dataset cleaned from the non-Arabic content and the undiacritized text, then divided into training, development, and testing sets. The cleaning process includes removing the XML tags and strange symbols, as well as fixing… See the full description on the dataset page: https://huggingface.co/datasets/EmanKhater/Tashkeela.
h
Sadeed_Tashkeela
huggingface.co
Updated May 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Misraj Ai (2025). Sadeed_Tashkeela [Dataset]. https://huggingface.co/datasets/Misraj/Sadeed_Tashkeela
Explore at:
Dataset updated
May 1, 2025
Dataset authored and provided by
Misraj Ai
Description
📚 Sadeed Tashkeela Arabic Diacritization Dataset

The Sadeed dataset is a large, high-quality Arabic diacritized corpus optimized for training and evaluating Arabic diacritization models.It is built exclusively from the Tashkeela corpus for the training set and a refined version of the Fadel Tashkeela test set for the test set.

Dataset Overview

Training Data:

Source: Cleaned version of the Tashkeela corpus (original data is ~75 million words, mostly Classical… See the full description on the dataset page: https://huggingface.co/datasets/Misraj/Sadeed_Tashkeela.
h
roots_ar_tashkeela
huggingface.co
Updated Apr 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BigScience Data (2023). roots_ar_tashkeela [Dataset]. https://huggingface.co/datasets/bigscience-data/roots_ar_tashkeela
Explore at:
Dataset updated
Apr 28, 2023
Dataset authored and provided by
BigScience Data
License
https://choosealicense.com/licenses/gpl-2.0/https://choosealicense.com/licenses/gpl-2.0/
Description
ROOTS Subset: roots_ar_tashkeela

Tashkeela

Dataset uid: tashkeela

Description

The dataset collected from 97 books in both modern and classic arabic. The dataset contains Arabic diacritics. The dataset is

Homepage

https://sourceforge.net/projects/tashkeela/

Licensing

gpl-2.0: GNU General Public License v2.0 only

Speaker Locations Sizes

0.2533 % of total 2.3340 % of ar

BigScience processing steps Filters… See the full description on the dataset page: https://huggingface.co/datasets/bigscience-data/roots_ar_tashkeela.
h
arabic-tashkeel-dataset
huggingface.co
Updated Oct 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rockikz (2024). arabic-tashkeel-dataset [Dataset]. https://huggingface.co/datasets/Abdou/arabic-tashkeel-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 22, 2024
Authors
Rockikz
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Arabic Tashkeel Dataset

This is a fairly large dataset gathered from five main sources:

tashkeela (1.79GB - 45.05%): The entire Tashkeela dataset, repurposed in sentences. Some rows were omitted as they contain low diacritic (tashkeel characters) rate. shamela (1.67GB - 42.10%): Random pages from over 2,000 books on the Shamela Library. Pages were selected using the below function (high diacritics rate) wikipedia (269.94MB - 6.64%): A collection of Wikipedia articles. Diacritics… See the full description on the dataset page: https://huggingface.co/datasets/Abdou/arabic-tashkeel-dataset.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Eman Khater (2024). Tashkeela [Dataset]. https://huggingface.co/datasets/EmanKhater/Tashkeela

Tashkeela

EmanKhater/Tashkeela

Explore at:

162 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Aug 4, 2024

Authors

Eman Khater

Description

Dataset Card for Dataset Name

This dataset card aims to be a base template for new datasets. It has been generated using this raw template.

  Dataset Details





  Dataset Description

Content A version of the Tashkeela Arabic diacritized text dataset cleaned from the non-Arabic content and the undiacritized text, then divided into training, development, and testing sets. The cleaning process includes removing the XML tags and strange symbols, as well as fixing… See the full description on the dataset page: https://huggingface.co/datasets/EmanKhater/Tashkeela.

Clear search

Close search

Google apps

Main menu

Tashkeela

Sadeed_Tashkeela

roots_ar_tashkeela

arabic-tashkeel-dataset

Tashkeela

EmanKhater/Tashkeela