64 datasets found

N
Arab, AL Annual Population and Growth Analysis Dataset: A Comprehensive...
neilsberg.com
csv, json
Updated Jul 30, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neilsberg Research (2024). Arab, AL Annual Population and Growth Analysis Dataset: A Comprehensive Overview of Population Changes and Yearly Growth Rates in Arab from 2000 to 2023 // 2024 Edition [Dataset]. https://www.neilsberg.com/insights/arab-al-population-by-year/
Explore at:
csv, jsonAvailable download formats
Dataset updated
Jul 30, 2024
Dataset authored and provided by
Neilsberg Research
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Arab, Alabama
Variables measured
Annual Population Growth Rate, Population Between 2000 and 2023, Annual Population Growth Rate Percent
Measurement technique
The data presented in this dataset is derived from the 20 years data of U.S. Census Bureau Population Estimates Program (PEP) 2000 - 2023. To measure the variables, namely (a) population and (b) population change in ( absolute and as a percentage ), we initially analyzed and tabulated the data for each of the years between 2000 and 2023. For further information regarding these estimates, please feel free to reach out to us via email at research@neilsberg.com.
Dataset funded by
Neilsberg Research
Description
About this dataset

Context

The dataset tabulates the Arab population over the last 20 plus years. It lists the population for each year, along with the year on year change in population, as well as the change in percentage terms for each year. The dataset can be utilized to understand the population change of Arab across the last two decades. For example, using this dataset, we can identify if the population is declining or increasing. If there is a change, when the population peaked, or if it is still growing and has not reached its peak. We can also compare the trend with the overall trend of United States population over the same period of time.

Key observations

In 2023, the population of Arab was 8,830, a 2.36% increase year-by-year from 2022. Previously, in 2022, Arab population was 8,626, an increase of 1.48% compared to a population of 8,500 in 2021. Over the last 20 plus years, between 2000 and 2023, population of Arab increased by 1,401. In this period, the peak population was 8,830 in the year 2023. The numbers suggest that the population has not reached its peak yet and is showing a trend of further growth. Source: U.S. Census Bureau Population Estimates Program (PEP).

Content

When available, the data consists of estimates from the U.S. Census Bureau Population Estimates Program (PEP).

Data Coverage:

From 2000 to 2023

Variables / Data Columns

Year: This column displays the data year (Measured annually and for years 2000 to 2023)

Population: The population for the specific year for the Arab is shown in this column.

Year on Year Change: This column displays the change in Arab population for each year compared to the previous year.

Change in Percent: This column displays the year on year change as a percentage. Please note that the sum of all percentages may not equal one due to rounding of values.

Good to know

Margin of Error

Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

Custom data

If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

Inspiration

Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

Recommended for further research

This dataset is a part of the main dataset for Arab Population by Year. You can refer the same here
Egyptian Arabic Customer Speech Dataset
kaggle.com
Updated Sep 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Macgence (2025). Egyptian Arabic Customer Speech Dataset [Dataset]. https://www.kaggle.com/datasets/macgence/egyptian-arabic-customer-speech-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 17, 2025
Dataset provided by
Kaggle
Authors
Macgence
Description
With an extensive 250-hour collection of high-quality General Conversation audio recordings, this dataset empowers researchers and developers to enhance natural language processing, conversational AI, and generative voice AI algorithms across multiple sectors. Whether it's finance, healthcare, retail, or any other industry, this Egyptian Arabic Customer Speech Dataset provides a rich resource for training and evaluation purposes.

Metadata Availability: Insights into Participant Details:

Each participant is accompanied by comprehensive metadata, which includes detailed information about their age, gender, location, and dialect. Furthermore, this metadata encompasses details such as domain, topic, call type, and outcome, providing valuable insights for both model development and evaluation purposes.

Audio Recording Specifications:

Audio Duration: 250 hours Formats Utilized: WAV and MP3, providing flexibility and compatibility Customizable Sample Rate: Variable to meet project specifics, offering flexibility Recording Equipment Standard: Standard call center devices are utilized for meticulous capture of authentic interactions between Egyptian Arabic speakers and customers Environment: Recorded within diverse real-world conditions, providing a comprehensive representation of call center interactions These technical specifications ensure compatibility and optimal performance for a wide range of AI development applications within the general sector.

Speech Data:

Our dataset comprises 250 hours of authentic conversational audio recordings spanning diverse sectors. From unscripted interactions to real-world conversations, each audio file (averaging 5 to 15 minutes) provides valuable insights into customer inquiries, issue resolutions, transactions, and more. The data is available in both MP3 and WAV formats, ensuring compatibility and flexibility for various applications.

Transcription of Datasets:

Manual verbatim transcriptions in JSON format are provided for each call center audio file. These transcriptions, complete with speaker-wise dialogue and time-coded segmentation, facilitate the development of Egyptian Arabic call center conversational AI and ASR models.

License:

Exclusively created by Macgence, this dataset is available for commercial use, empowering AI developers in the general sector.

Updates and Customization:

Regular updates enrich the dataset with new audio data from diverse sectors, ensuring its relevance and diversity. Customization options are available to meet specific project requirements, including tailored transcriptions and linguistic variations.

*****Looking for high-quality datasets to train your AI model? Contact us today to get the dataset you need—fast, reliable, and ready for deployment!*****
h
Data from: arabic-books
huggingface.co
Updated Nov 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohamed Rashad (2024). arabic-books [Dataset]. https://huggingface.co/datasets/MohamedRashad/arabic-books
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 28, 2024
Authors
Mohamed Rashad
License
https://choosealicense.com/licenses/gpl-3.0/https://choosealicense.com/licenses/gpl-3.0/
Description
Arabic Books

Dataset Summary

The arabic-books dataset contains 8,500 rows of text, each representing the full text of a single Arabic book. These texts were extracted using the arabic-large-nougat model, showcasing the model’s capabilities in Arabic OCR and text extraction. The dataset spans a total of 1.1 billion tokens, calculated using the GPT-4 tokenizer. This dataset is a testimony to the quality of the Arabic Nougat models and their effectiveness in extracting… See the full description on the dataset page: https://huggingface.co/datasets/MohamedRashad/arabic-books.
E
ArabLEX: Database of Foreign Names in Arabic (DAF)
catalog.elra.info
Updated Oct 7, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2019). ArabLEX: Database of Foreign Names in Arabic (DAF) [Dataset]. https://catalog.elra.info/en-us/repository/browse/ELRA-M0106/
Explore at:
Dataset updated
Oct 7, 2019
Dataset provided by
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
ELRA (European Language Resources Association)
License
https://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
https://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalog.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
Description
This database is part of the ArabLEX set of data which consists of the Database of Arabic General Vocabulary (DAG), Database of Arabic Place Names (DAP), Database of Foreign Names in Arabic (DAF) and Database of Arab Names (DAN) available from ELRA under references, respectively, ELRA-L0131, ELRA-M0105, ELRA-M0106 and ELRA-M0107.With over 226 million forms based on 223,000 lemmas, this full-form database covers non-Arab personal names in both Arabic and English, some Arabic script variants, vocalized or unvocalized formats, as well as inflected and cliticized forms. The precise phonemic transcriptions and full vowel diacritics are designed to enhance Arabic speech technology. Orthographic variants are also extensively covered.This database is provided with three options: 1) proclitics, 2) phonetic information (CARS) and 3) orthographic variants. Subsets excluding some of the three proposed options may be provided upon demand. CARS is an accurate phonemic transcription. Optionally, phonetic transcriptions, IPA and/or SAMPA, can be provided, fine tuned to a customer's specifications.Quantity and size: 226,784,907 lines / 32,181 MB (31.4 GB)File format: flat TSV text filesSamples and a specifications document available upon request.
Dataset for Arabic Classification
kaggle.com
Updated Feb 17, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Saurabh Shahane (2021). Dataset for Arabic Classification [Dataset]. https://www.kaggle.com/datasets/saurabhshahane/arabic-classification/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 17, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Saurabh Shahane
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Context

The dataset is a collection of Arabic texts, which covers modern Arabic language used in newspapers articles. The text contains alphabetic, numeric and symbolic words. The existence of numeric and symbolic words in this dataset could tell the efficiency and robustness of many Arabic text classification and indexing documents.

Content

The dataset consists of 111,728 documents (cf. Table 1) and 319,254,124 words (cf. Table 2) structured in text files, and collected from 3 Arabic online newspapers: Assabah [9], Hespress [10] and Akhbarona [11] using semi-automatic web crawling process. The documents in the dataset are categorized into 5 classes: sport, politic, culture, economy and diverse. The number of documents and words for each class varies from one class to another (cf. Tables 1-2).

Acknowledgements

BINIZ, mohamed (2018), “DataSet for Arabic Classification”, Mendeley Data, V2, doi: 10.17632/v524p5dhpj.2
World's Muslims Data Set, 2012
thearda.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
James Bell, World's Muslims Data Set, 2012 [Dataset]. http://doi.org/10.17605/OSF.IO/C2VE5
Explore at:
Unique identifier
https://doi.org/10.17605/OSF.IO/C2VE5
Dataset provided by
Association of Religion Data Archives
Authors
James Bell
Dataset funded by
The Pew Charitable Trusts
The John Templeton Foundation
Description
"Between October 2011 and November 2012, Pew Research Center, with generous funding from The Pew Charitable Trusts and the John Templeton Foundation, conducted a public opinion survey involving more than 30,000 face-to-face interviews in 26 countries in Africa, Asia, the Middle East and Europe. The survey asked people to describe their religious beliefs and practices, and sought to gauge respondents; knowledge of and attitudes toward other faiths. It aimed to assess levels of political and economic satisfaction, concerns about crime, corruption and extremism, positions on issues such as abortion and polygamy, and views of democracy, religious law and the place of women in society.

"Although the surveys were nationally representative in most countries, the primary goal of the survey was to gauge and compare beliefs and attitudes of Muslims. The findings for Muslim respondents are summarized in the Religion & Public Life Project's reports The World's Muslims: Unity and Diversity and The World's Muslims: Religion, Politics and Society, which are available at www.pewresearch.org. [...] This dataset only contains data for Muslim respondents in the countries surveyed. Please note that this codebook is meant as a guide to the dataset, and is not the survey questionnaire." (2012 Pew Religion Worlds Muslims Codebook)
F
Arabic Closed Ended Question Answer Text Dataset
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Arabic Closed Ended Question Answer Text Dataset [Dataset]. https://www.futurebeeai.com/dataset/prompt-response-dataset/arabic-closed-ended-question-answer-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
The Arabic Closed-Ended Question Answering Dataset is a meticulously curated collection of 5000 comprehensive Question-Answer pairs. It serves as a valuable resource for training Large Language Models (LLMs) and question-answering models in the Arabic language, advancing the field of artificial intelligence.
Dataset Content
This closed-ended QA dataset comprises a diverse set of context paragraphs and questions paired with corresponding answers in Arabic. There is a context paragraph given for each question to get the answer from. The questions cover a broad range of topics, including science, history, technology, geography, literature, current affairs, and more.
Each question is accompanied by an answer, providing valuable information and insights to enhance the language model training process. Both the questions and answers were manually curated by native Arabic people, and references were taken from diverse sources like books, news articles, websites, web forums, and other reliable references.
This question-answer prompt completion dataset contains different types of prompts, including instruction type, continuation type, and in-context learning (zero-shot, few-shot) type. The dataset also contains questions and answers with different types of rich text, including tables, code, JSON, etc., with proper markdown.
Question Diversity
To ensure diversity, this Q&A dataset includes questions with varying complexity levels, ranging from easy to medium and hard. Different types of questions, such as multiple-choice, direct, and true/false, are included. The QA dataset also contains questions with constraints, which makes it even more useful for LLM training.
Answer Formats
To accommodate varied learning experiences, the dataset incorporates different types of answer formats. These formats include single-word, short phrases, single sentences, and paragraphs types of answers. The answers contain text strings, numerical values, date and time formats as well. Such diversity strengthens the language model's ability to generate coherent and contextually appropriate answers.
Data Format and Annotation Details
This fully labeled Arabic Closed-Ended Question Answer Dataset is available in JSON and CSV formats. It includes annotation details such as a unique id, context paragraph, context reference link, question, question type, question complexity, question category, domain, prompt type, answer, answer type, and rich text presence.
Quality and Accuracy
The dataset upholds the highest standards of quality and accuracy. Each question undergoes careful validation, and the corresponding answers are thoroughly verified. To prioritize inclusivity, the dataset incorporates questions and answers representing diverse perspectives and writing styles, ensuring it remains unbiased and avoids perpetuating discrimination.
The Arabic versions is grammatically accurate without any spelling or grammatical errors. No toxic or harmful content is used while building this dataset.
Continuous Updates and Customization
The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Continuous efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to collect custom question-answer data tailored to specific needs, providing flexibility and customization options.
License:
The dataset, created by FutureBeeAI, is now ready for commercial use. Researchers, data scientists, and developers can utilize this fully labeled and ready-to-deploy Arabic Closed-Ended Question Answer Dataset to enhance the language understanding capabilities of their generative AI models, improve response generation, and explore new approaches to NLP question-answering tasks.
N
Arab, AL Population Breakdown By Race (Excluding Ethnicity) Dataset:...
neilsberg.com
csv, json
Updated Feb 21, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neilsberg Research (2025). Arab, AL Population Breakdown By Race (Excluding Ethnicity) Dataset: Population Counts and Percentages for 7 Racial Categories as Identified by the US Census Bureau // 2025 Edition [Dataset]. https://www.neilsberg.com/research/datasets/755bb051-ef82-11ef-9e71-3860777c1fe6/
Explore at:
csv, jsonAvailable download formats
Dataset updated
Feb 21, 2025
Dataset authored and provided by
Neilsberg Research
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Arab
Variables measured
Asian Population, Black Population, White Population, Some other race Population, Two or more races Population, American Indian and Alaska Native Population, Asian Population as Percent of Total Population, Black Population as Percent of Total Population, White Population as Percent of Total Population, Native Hawaiian and Other Pacific Islander Population, and 4 more
Measurement technique
The data presented in this dataset is derived from the latest U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates. To measure the two variables, namely (a) population and (b) population as a percentage of the total population, we initially analyzed and categorized the data for each of the racial categories idetified by the US Census Bureau. It is ensured that the population estimates used in this dataset pertain exclusively to the identified racial categories, and do not rely on any ethnicity classification. For further information regarding these estimates, please feel free to reach out to us via email at research@neilsberg.com.
Dataset funded by
Neilsberg Research
Description
About this dataset

Context

The dataset tabulates the population of Arab by race. It includes the population of Arab across racial categories (excluding ethnicity) as identified by the Census Bureau. The dataset can be utilized to understand the population distribution of Arab across relevant racial categories.

Key observations

The percent distribution of Arab population by race (across all racial categories recognized by the U.S. Census Bureau): 90.79% are white, 1.19% are Black or African American, 0.21% are American Indian and Alaska Native, 1.73% are Asian, 0.02% are Native Hawaiian and other Pacific Islander, 1.76% are some other race and 4.30% are multiracial.

Content

When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates.

Racial categories include:

White

Black or African American

American Indian and Alaska Native

Asian

Native Hawaiian and Other Pacific Islander

Some other race

Two or more races (multiracial)

Variables / Data Columns

Race: This column displays the racial categories (excluding ethnicity) for the Arab

Population: The population of the racial category (excluding ethnicity) in the Arab is shown in this column.

% of Total Population: This column displays the percentage distribution of each race as a proportion of Arab total population. Please note that the sum of all percentages may not equal one due to rounding of values.

Good to know

Margin of Error

Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

Custom data

If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

Inspiration

Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

Recommended for further research

This dataset is a part of the main dataset for Arab Population by Race & Ethnicity. You can refer the same here
F
Arabic Shopping List OCR Image Dataset
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Arabic Shopping List OCR Image Dataset [Dataset]. https://www.futurebeeai.com/dataset/ocr-dataset/arabic-shopping-list-ocr-image-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
What’s Included
Introducing the Arabic Shopping List Image Dataset - a diverse and comprehensive collection of handwritten text images carefully curated to propel the advancement of text recognition and optical character recognition (OCR) models designed specifically for the Arabic language.
Dataset Contain & Diversity:
Containing more than 2000 images, this Arabic OCR dataset offers a wide distribution of different types of shopping list images. Within this dataset, you'll discover a variety of handwritten text, including sentences, and individual item name words, quantity, comments, etc on shopping lists. The images in this dataset showcase distinct handwriting styles, fonts, font sizes, and writing variations.
To ensure diversity and robustness in training your OCR model, we allow limited (less than three) unique images in a single handwriting. This ensures we have diverse types of handwriting to train your OCR model on. Stringent measures have been taken to exclude any personally identifiable information (PII) and to ensure that in each image a minimum of 80% of space contains visible Arabic text.
The images have been captured under varying lighting conditions, including day and night, as well as different capture angles and backgrounds. This diversity helps build a balanced OCR dataset, featuring images in both portrait and landscape modes.
All these shopping lists were written and images were captured by native Arabic people to ensure text quality, prevent toxic content, and exclude PII text. We utilized the latest iOS and Android mobile devices with cameras above 5MP to maintain image quality. Images in this training dataset are available in both JPEG and HEIC formats.
Metadata:
In addition to the image data, you will receive structured metadata in CSV format. For each image, this metadata includes information on image orientation, country, language, and device details. Each image is correctly named to correspond with the metadata.
This metadata serves as a valuable resource for understanding and characterizing the data, aiding informed decision-making in the development of Arabic text recognition models.
Update & Custom Collection:
We are committed to continually expanding this dataset by adding more images with the help of our native Arabic crowd community.
If you require a customized OCR dataset containing shopping list images tailored to your specific guidelines or device distribution, please don't hesitate to contact us. We have the capability to curate specialized data to meet your unique requirements.
Additionally, we can annotate or label the images with bounding boxes or transcribe the text in the images to align with your project's specific needs using our crowd community.
License:
This image dataset, created by FutureBeeAI, is now available for commercial use.
Conclusion:
Leverage this shopping list image OCR dataset to enhance the training and performance of text recognition, text detection, and optical character recognition models for the Arabic language. Your journey to improved language understanding and processing begins here.
F
Saudi Arabian Arabic General Conversation Speech Dataset for ASR
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Saudi Arabian Arabic General Conversation Speech Dataset for ASR [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-arabic-saudiarabia
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Area covered
Saudi Arabia
Dataset funded by
FutureBeeAI
Description
Introduction
Welcome to the Saudi Arabian Arabic General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of Arabic speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Saudi Arabian Arabic communication.
Curated by FutureBeeAI, this 40 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade Arabic speech models that understand and respond to authentic Saudi accents and dialects.
Speech Data
The dataset comprises 40 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Saudi Arabian Arabic. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.
•Participant Diversity:
•
Speakers: 80 verified native Saudi Arabian Arabic speakers from FutureBeeAI’s contributor community.

•
Regions: Representing various provinces of Saudi Arabia to ensure dialectal diversity and demographic balance.

•
Demographics: A balanced gender ratio (60% male, 40% female) with participant ages ranging from 18 to 70 years.

•Recording Details:
•
Conversation Style: Unscripted, spontaneous peer-to-peer dialogues.

•
Duration: Each conversation ranges from 15 to 60 minutes.

•
Audio Format: Stereo WAV files, 16-bit depth, recorded at 16kHz sample rate.

•
Environment: Quiet, echo-free settings with no background noise.

Topic Diversity
The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.
•Sample Topics Include:
•Family & Relationships
•Food & Recipes
•Education & Career
•Healthcare Discussions
•Social Issues
•Technology & Gadgets
•Travel & Local Culture
•Shopping & Marketplace Experiences, and many more.
Transcription
Each audio file is paired with a human-verified, verbatim transcription available in JSON format.
•Transcription Highlights:
•Speaker-segmented dialogues
•Time-coded utterances
•Non-speech elements (pauses, laughter, etc.)
•High transcription accuracy, achieved through double QA pass, average WER < 5%
These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.
Metadata
The dataset comes with granular metadata for both speakers and recordings:
•
Speaker Metadata: Age, gender, accent, dialect, state/province, and participant ID.

•
Recording Metadata: Topic, duration, audio format, device type, and sample rate.

Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.
Usage and Applications
This dataset is a versatile resource for multiple Arabic speech and language AI applications:
•
ASR Development: Train accurate speech-to-text systems for Saudi Arabian Arabic.

•
Voice Assistants: Build smart assistants capable of understanding natural Saudi conversations.

<div style="margin-top:10px; margin-bottom: 10px; padding-left: 30px; display:
ASL 20-Words Dataset v1
kaggle.com
Updated Nov 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hossam Magdy Balaha (2024). ASL 20-Words Dataset v1 [Dataset]. http://doi.org/10.34740/kaggle/dsv/9797396
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/dsv/9797396
Dataset updated
Nov 3, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Hossam Magdy Balaha
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
The Arabic Sign Language (ASL) 20-Words Dataset v1 was carefully designed to reflect natural conditions, aiming to capture realistic signing environments and circumstances. Recognizing that nearly everyone has access to a smartphone with a camera as of 2020, the dataset was specifically recorded using mobile phones, aligning with how people commonly record videos in daily life. This approach ensures the dataset is grounded in real-world conditions, enhancing its applicability for practical use cases.

Each video in this dataset was recorded directly on the authors' smartphones, without any form of stabilization—neither hardware nor software. As a result, the videos vary in resolution and were captured across diverse locations, places, and backgrounds. This variability introduces natural noise and conditions, supporting the development of robust deep learning models capable of generalizing across environments.

In total, the dataset comprises 8,467 videos of 20 sign language words, contributed by 72 volunteers aged between 20 and 24. Each volunteer performed each sign a minimum of five times, resulting in approximately 100 videos per participant. This repetition standardizes the data and ensures each sign is adequately represented across different performers. The dataset’s mean video count per sign is 423.35, with a standard deviation of 18.58, highlighting the balance and consistency achieved across the signs.

For reference, Table 2 (in the research article) provides the count of videos for each sign, while Figure 2 (in the research article) offers a visual summary of the statistics for each word in the dataset. Additionally, sample frames from each word are displayed in Figure 3 (in the research article), giving a glimpse of the visual content captured.

For in-depth insights into the methodology and the dataset's creation, see the research paper: Balaha, M.M., El-Kady, S., Balaha, H.M., et al. (2023). "A vision-based deep learning approach for independent-users Arabic sign language interpretation". Multimedia Tools and Applications, 82, 6807–6826. https://doi.org/10.1007/s11042-022-13423-9

Please consider citing the following if you use this dataset:

@misc{balaha_asl_2024_db, title={ASL 20-Words Dataset v1}, url={https://www.kaggle.com/dsv/9783691}, DOI={10.34740/KAGGLE/DSV/9783691}, publisher={Kaggle}, author={Mostafa Magdy Balaha and Sara El-Kady and Hossam Magdy Balaha and Mohamed Salama and Eslam Emad and Muhammed Hassan and Mahmoud M. Saafan}, year={2024} }

@article{balaha2023vision, title={A vision-based deep learning approach for independent-users Arabic sign language interpretation}, author={Balaha, Mostafa Magdy and El-Kady, Sara and Balaha, Hossam Magdy and Salama, Mohamed and Emad, Eslam and Hassan, Muhammed and Saafan, Mahmoud M}, journal={Multimedia Tools and Applications}, volume={82}, number={5}, pages={6807--6826}, year={2023}, publisher={Springer} }

This dataset is available under the CC BY-NC-SA 4.0 license, which allows for sharing and adaptation under conditions of non-commercial use, proper attribution, and distribution under the same license.

For further inquiries or information: https://hossambalaha.github.io/.
h
Falcon-Arabic-7B-Base-details
huggingface.co
Updated Jul 23, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Technology Innovation Institute (2025). Falcon-Arabic-7B-Base-details [Dataset]. https://huggingface.co/datasets/tiiuae/Falcon-Arabic-7B-Base-details
Explore at:
Dataset updated
Jul 23, 2025
Dataset authored and provided by
Technology Innovation Institute
Description
tiiuae/Falcon-Arabic-7B-Base-details dataset hosted on Hugging Face and contributed by the HF Datasets community
h
SARD
huggingface.co
Updated May 19, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Robotics and Interne-of-Things (2025). SARD [Dataset]. https://huggingface.co/datasets/riotu-lab/SARD
Explore at:
Dataset updated
May 19, 2025
Authors
Robotics and Interne-of-Things
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
SARD: Synthetic Arabic Recognition Dataset

Overview

SARD (Synthetic Arabic Recognition Dataset) is a large-scale, synthetically generated dataset designed for training and evaluating Optical Character Recognition (OCR) models for Arabic text. This dataset addresses the critical need for comprehensive Arabic text recognition resources by providing controlled, diverse, and scalable training data that simulates real-world book layouts.

Key Features

Massive… See the full description on the dataset page: https://huggingface.co/datasets/riotu-lab/SARD.
m
DataSet for Arabic Classification
data.mendeley.com
Updated Mar 15, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BINIZ mohamed (2018). DataSet for Arabic Classification [Dataset]. http://doi.org/10.17632/v524p5dhpj.1
Explore at:
Unique identifier
https://doi.org/10.17632/v524p5dhpj.1
Dataset updated
Mar 15, 2018
Authors
BINIZ mohamed
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset is a collection of Arabic texts, which covers modern Arabic language used in newspapers articles. The text contains alphabetic, numeric and symbolic words. The existence of numeric and symbolic words in this dataset could tell the efficiency and robustness of many Arabic text classification and indexing documents. The dataset consists of 111,728 documents (cf. Table 1) and 319,254,124 words (cf. Table 2) structured in text files, and collected from 3 Arabic online newspapers: Assabah [9], Hespress [10] and Akhbarona [11] using semi-automatic web crawling process. The documents in the dataset are categorized into 5 classes: sport, politic, culture, economy and diverse. The number of documents and words for each class varies from one class to another (cf. Tables 1-2).
e
Arab West Report Interview Documentation Project: Islam in Egypt - Dataset -...
b2find.eudat.eu
Updated Nov 15, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Arab West Report Interview Documentation Project: Islam in Egypt - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/2e268fc7-76d1-50b9-baee-28e3c550ddb6
Explore at:
Dataset updated
Nov 15, 2024
Area covered
Egypt
Description
The following dataset contains 7 audio recordings (5 items) on the subject of Islam in Egypt. All summaries are rendered in English. Interviews were conducted in English and Arabic. This Thematic Collection contains links to the datasets of the Stichting Arab-West Foundation (AWF), in The Netherlands in close cooperation with the Center for Intercultural Dialogue and Translation (CIDT). These datasets cover the period 1994-2016. The data consists of the reporting of Dutch sociologist Cornelis Hulsman, reporting supervised by him, full-transcript interviews, audio recordings and summaries of these audio recordings.The Arab-West Foundation was established in 2005 to support the work of Cornelis Hulsman and his wife Eng. Sawsan Gabra Ayoub Hulsman-Khalil in Egypt. Cornelis Hulsman left The Netherlands for Egypt in October 1994. Sawsan Hulsman followed suit in 1995. They focused primarily on the study of Muslim-Christian relations and the role of religion in society in Egypt and neighboring countries, while obtaining their income from journalism.The purpose of this work was to foster greater understanding between Muslims and Christians in Egypt and to show non-Egyptians that relations between the two faiths in Egypt cannot be described in reductive black and white terms, rather they are diverse and complicated. Working towards mutual understanding of different cultures and beliefs helps to reduce tensions and conflicts. Too often, parties present themselves as the victim of the other which results in biased reporting. Sometimes this is done deliberately to gain support. What is lacking in cases like this, is an in-depth understanding of the wider context in which narratives of victimization occur. Hulsman found several patterns that are key to understanding Muslim-Christian relations in Egypt such as- the impact of a culture of honor and shame and- aversion in traditional areas for visible changes in public (which includes church buildings and making one’s conversion to another religion public).The datasets also include material on the place of Islamists in society, as well as wider information about Egyptian society since this is the context in which religious numerical minorities in Egypt live (the term minority is widely rejected in Egypt since all Egyptians, regardless of religion, are one. But in terms of numbers Christians are a minority).It was Hulsman’s ambition to obtain a PhD but the challenges of making a living in Egypt prevented him from accomplishing this goal. Up until the year 2001, Cornelis only had an income from traditional media reporting. After 2004 he became largely dependent upon working with Kerk in Actie (Netherlands), Missio and Misereor (Germany).Hulsman was dedicated towards non-partisan Muslim-Christian understanding. This began starting with a large number of recorded interviews, followed by research into why so many Christian girls convert to Islam (1995-1996). This work in turn led to the creation of an electronic newsletter called Religious News Service from the Arab World (RNSAW) and a growing number of investigative reports. In 2003 the RNSAW was renamed Arab-West Report. In 2004 they attempted to establish an Egyptian NGO but since no answer was obtained from authorities, the procedure was taken to the Council of State who ruled in 2006 that the request for NGO status was valid. This in turn resulted in a formal registration of the NGO with the Ministry of Social Solidarity in 2007. Because the outcome of this process was insecure in 2005 the Hulsmans established the Center for Intercultural Dialogue and Translation (CIDT) . CIDT was established as a tawsiya basita (sole proprietorship) on the name of Sawsan Gabra Ayoub Khalil since it was extremely complicated to do this on the name of a non-Egyptian. In the same year friends of the Hulsman family established the Arab-West Foundation (AWF). CIDT tawsiya basita was closed in 2012. A new company was established under the same name but now as limited liability company and again it was not possible for Cornelis Hulsman to become a partner.As a consequence the Hulsmans have been working since 2005 with an Egyptian company and a Dutch support NGO. Since 2007 they have also been working with an Egyptian NGO. This was important, since Egyptian law prohibits companies from receiving donations and carrying out not-for-profit work. NGOs, on the other hand, need to request permissions from the Ministry of Social Solidarity for each donation they receive. Such permissions are hard to obtain.CIDT functions as a thinktank with funding from Kerk in Actie (Netherlands), Missio and Misereor (Germany) and at times projects with other organizations. CIDT produces the electronic newsletter Arab-West Report and has built the Arab West Report Database based on these data. Publication of this data is accomplished through the Arab-West Foundation since it turned out to be extremely hard to register Arab-West Report in Egypt. CAWU became the prime organization hosting student interns from Egypt and countries all over the world, which was possible since CAWU does not charge student interns for its services and neither pays them for any work carried out. Student interns have been contributing on a volunteer basis to the database of Arab-West Report, writing articles and papers and being engaged in social media under the supervision of Cornelis Hulsman. Other student interns contributed to summary translations of Arabic media, always supervised by a professional translator of CIDT.CAWU has been promoting intercultural dialogue through a variety of programs including meetings and forums with community members, religious leaders and politicians from Egypt and the West. CAWU's aim is to bridge the gap of misunderstanding between Arab and Western communities by exposing biased media reporting and informing the public and important persons on complicated issues.- Availability -AWF's datasets are available to researchers upon request. Please go to the dataset you wish to download and request permission via the button 'Request Permission' on the tab 'Datafiles'. AWF will respond to your request.
F
Egyptian Arabic General Conversation Speech Dataset for ASR
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Egyptian Arabic General Conversation Speech Dataset for ASR [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-arabic-egypt
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
Welcome to the Egyptian Arabic General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of Arabic speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Egyptian Arabic communication.
Curated by FutureBeeAI, this 40 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade Arabic speech models that understand and respond to authentic Egyptian accents and dialects.
Speech Data
The dataset comprises 40 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Egyptian Arabic. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.
•Participant Diversity:
•
Speakers: 80 verified native Egyptian Arabic speakers from FutureBeeAI’s contributor community.

•
Regions: Representing various provinces of Egypt to ensure dialectal diversity and demographic balance.

•
Demographics: A balanced gender ratio (60% male, 40% female) with participant ages ranging from 18 to 70 years.

•Recording Details:
•
Conversation Style: Unscripted, spontaneous peer-to-peer dialogues.

•
Duration: Each conversation ranges from 15 to 60 minutes.

•
Audio Format: Stereo WAV files, 16-bit depth, recorded at 16kHz sample rate.

•
Environment: Quiet, echo-free settings with no background noise.

Topic Diversity
The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.
•Sample Topics Include:
•Family & Relationships
•Food & Recipes
•Education & Career
•Healthcare Discussions
•Social Issues
•Technology & Gadgets
•Travel & Local Culture
•Shopping & Marketplace Experiences, and many more.
Transcription
Each audio file is paired with a human-verified, verbatim transcription available in JSON format.
•Transcription Highlights:
•Speaker-segmented dialogues
•Time-coded utterances
•Non-speech elements (pauses, laughter, etc.)
•High transcription accuracy, achieved through double QA pass, average WER < 5%
These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.
Metadata
The dataset comes with granular metadata for both speakers and recordings:
•
Speaker Metadata: Age, gender, accent, dialect, state/province, and participant ID.

•
Recording Metadata: Topic, duration, audio format, device type, and sample rate.

Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.
Usage and Applications
This dataset is a versatile resource for multiple Arabic speech and language AI applications:
•
ASR Development: Train accurate speech-to-text systems for Egyptian Arabic.

•
Voice Assistants: Build smart assistants capable of understanding natural Egyptian conversations.

<div style="margin-top:10px; margin-bottom: 10px; padding-left: 30px; display: flex; gap: 16px;
f
Arabic Handwritten Characters Dataset
figshare.com
kaggle.com
txt
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohamed Loey (2023). Arabic Handwritten Characters Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.12236960.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12236960.v1
Dataset updated
May 31, 2023
Dataset provided by
figshare
Authors
Mohamed Loey
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Arabic Handwritten Characters DatasetAstractHandwritten Arabic character recognition systems face several challenges, including the unlimited variation in human handwriting and large public databases. In this work, we model a deep learning architecture that can be effectively apply to recognizing Arabic handwritten characters. A Convolutional Neural Network (CNN) is a special type of feed-forward multilayer trained in supervised mode. The CNN trained and tested our database that contain 16800 of handwritten Arabic characters. In this paper, the optimization methods implemented to increase the performance of CNN. Common machine learning methods usually apply a combination of feature extractor and trainable classifier. The use of CNN leads to significant improvements across different machine-learning classification algorithms. Our proposed CNN is giving an average 5.1% misclassification error on testing data.ContextThe motivation of this study is to use cross knowledge learned from multiple works to enhancement the performance of Arabic handwritten character recognition. In recent years, Arabic handwritten characters recognition with different handwriting styles as well, making it important to find and work on a new and advanced solution for handwriting recognition. A deep learning systems needs a huge number of data (images) to be able to make a good decisions.ContentThe data-set is composed of 16,800 characters written by 60 participants, the age range is between 19 to 40 years, and 90% of participants are right-hand. Each participant wrote each character (from ’alef’ to ’yeh’) ten times on two forms as shown in Fig. 7(a) & 7(b). The forms were scanned at the resolution of 300 dpi. Each block is segmented automatically using Matlab 2016a to determining the coordinates for each block. The database is partitioned into two sets: a training set (13,440 characters to 480 images per class) and a test set (3,360 characters to 120 images per class). Writers of training set and test set are exclusive. Ordering of including writers to test set are randomized to make sure that writers of test set are not from a single institution (to ensure variability of the test set).In an experimental section we showed that the results were promising with 94.9% classification accuracy rate on testing images. In future work, we plan to work on improving the performance of handwritten Arabic character recognition.AcknowledgementsAhmed El-Sawy, Mohamed Loey, Hazem EL-Bakry, Arabic Handwritten Characters Recognition using Convolutional Neural Network, WSEAS, 2017Our proposed CNN is giving an average 5.1% misclassification error on testing data.InspirationCreating the proposed database presents more challenges because it deals with many issues such as style of writing, thickness, dots number and position. Some characters have different shapes while written in the same position. For example the teh character has different shapes in isolated position.Benha Universityhttp://bu.edu.eg/staff/mloeyhttps://mloey.github.io/
F
Arabic Open Ended Question Answer Text Dataset
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Arabic Open Ended Question Answer Text Dataset [Dataset]. https://www.futurebeeai.com/dataset/prompt-response-dataset/arabic-open-ended-question-answer-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
The Arabic Open-Ended Question Answering Dataset is a meticulously curated collection of comprehensive Question-Answer pairs. It serves as a valuable resource for training Large Language Models (LLMs) and Question-answering models in the Arabic language, advancing the field of artificial intelligence.
Dataset Content:
This QA dataset comprises a diverse set of open-ended questions paired with corresponding answers in Arabic. There is no context paragraph given to choose an answer from, and each question is answered without any predefined context content. The questions cover a broad range of topics, including science, history, technology, geography, literature, current affairs, and more.
Each question is accompanied by an answer, providing valuable information and insights to enhance the language model training process. Both the questions and answers were manually curated by native Arabic people, and references were taken from diverse sources like books, news articles, websites, and other reliable references.
This question-answer prompt completion dataset contains different types of prompts, including instruction type, continuation type, and in-context learning (zero-shot, few-shot) type. The dataset also contains questions and answers with different types of rich text, including tables, code, JSON, etc., with proper markdown.
Question Diversity:
To ensure diversity, this Q&A dataset includes questions with varying complexity levels, ranging from easy to medium and hard. Different types of questions, such as multiple-choice, direct, and true/false, are included. Additionally, questions are further classified into fact-based and opinion-based categories, creating a comprehensive variety. The QA dataset also contains the question with constraints and persona restrictions, which makes it even more useful for LLM training.
Answer Formats:
To accommodate varied learning experiences, the dataset incorporates different types of answer formats. These formats include single-word, short phrases, single sentences, and paragraph types of answers. The answer contains text strings, numerical values, date and time formats as well. Such diversity strengthens the Language model's ability to generate coherent and contextually appropriate answers.
Data Format and Annotation Details:
This fully labeled Arabic Open Ended Question Answer Dataset is available in JSON and CSV formats. It includes annotation details such as id, language, domain, question_length, prompt_type, question_category, question_type, complexity, answer_type, rich_text.
Quality and Accuracy:
The dataset upholds the highest standards of quality and accuracy. Each question undergoes careful validation, and the corresponding answers are thoroughly verified. To prioritize inclusivity, the dataset incorporates questions and answers representing diverse perspectives and writing styles, ensuring it remains unbiased and avoids perpetuating discrimination.
Both the question and answers in Arabic are grammatically accurate without any word or grammatical errors. No copyrighted, toxic, or harmful content is used while building this dataset.
Continuous Updates and Customization:
The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Continuous efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to collect custom question-answer data tailored to specific needs, providing flexibility and customization options.
License:
The dataset, created by FutureBeeAI, is now ready for commercial use. Researchers, data scientists, and developers can utilize this fully labeled and ready-to-deploy Arabic Open Ended Question Answer Dataset to enhance the language understanding capabilities of their generative ai models, improve response generation, and explore new approaches to NLP question-answering tasks.
F
Arabic Extraction Prompt & Response Dataset
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Arabic Extraction Prompt & Response Dataset [Dataset]. https://www.futurebeeai.com/dataset/prompt-response-dataset/arabic-extraction-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Welcome to the Arabic Extraction Type Prompt-Response Dataset, a meticulously curated collection of 1500 prompt and response pairs. This dataset is a valuable resource for enhancing the data extraction abilities of Language Models (LMs), a critical aspect in advancing generative AI.
Dataset Content
This extraction dataset comprises a diverse set of prompts and responses where the prompt contains input text, extraction instruction, constraints, and restrictions while completion contains the most accurate extraction data for the given prompt. Both these prompts and completions are available in Arabic language.
These prompt and completion pairs cover a broad range of topics, including science, history, technology, geography, literature, current affairs, and more. Each prompt is accompanied by a response, providing valuable information and insights to enhance the language model training process. Both the prompt and response were manually curated by native Arabic people, and references were taken from diverse sources like books, news articles, websites, and other reliable references.
This dataset encompasses various prompt types, including instruction type, continuation type, and in-context learning (zero-shot, few-shot) type. Additionally, you'll find prompts and responses containing rich text elements, such as tables, code, JSON, etc., all in proper markdown format.
Prompt Diversity
To ensure diversity, this extraction dataset includes prompts with varying complexity levels, ranging from easy to medium and hard. Additionally, prompts are diverse in terms of length from short to medium and long, creating a comprehensive variety. The extraction dataset also contains prompts with constraints and persona restrictions, which makes it even more useful for LLM training.
Response Formats
To accommodate diverse learning experiences, our dataset incorporates different types of responses depending on the prompt. These formats include single-word, short phrase, single sentence, and paragraph type of response. These responses encompass text strings, numerical values, and date and time, enhancing the language model's ability to generate reliable, coherent, and contextually appropriate answers.
Data Format and Annotation Details
This fully labeled Arabic Extraction Prompt Completion Dataset is available in JSON and CSV formats. It includes annotation details such as a unique ID, prompt, prompt type, prompt length, prompt complexity, domain, response, response type, and rich text presence.
Quality and Accuracy
Our dataset upholds the highest standards of quality and accuracy. Each prompt undergoes meticulous validation, and the corresponding responses are thoroughly verified. We prioritize inclusivity, ensuring that the dataset incorporates prompts and completions representing diverse perspectives and writing styles, maintaining an unbiased and discrimination-free stance.
The Arabic version is grammatically accurate without any spelling or grammatical errors. No copyrighted, toxic, or harmful content is used during the construction of this dataset.
Continuous Updates and Customization
The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Ongoing efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to gather custom extraction prompt and completion data tailored to specific needs, providing flexibility and customization options.
License
The dataset, created by FutureBeeAI, is now available for commercial use. Researchers, data scientists, and developers can leverage this fully labeled and ready-to-deploy Arabic Extraction Prompt-Completion Dataset to enhance the data extraction abilities and accurate response generation capabilities of their generative AI models and explore new approaches to NLP tasks.
F
Arabic Open Ended Classification Prompt & Response Dataset
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Arabic Open Ended Classification Prompt & Response Dataset [Dataset]. https://www.futurebeeai.com/dataset/prompt-response-dataset/arabic-open-ended-classification-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Welcome to the Arabic Open Ended Classification Prompt-Response Dataset, an extensive collection of 3000 meticulously curated prompt and response pairs. This dataset is a valuable resource for training Language Models (LMs) to classify input text accurately, a crucial aspect in advancing generative AI.
Dataset Content
This open-ended classification dataset comprises a diverse set of prompts and responses where the prompt contains input text to be classified and may also contain task instruction, context, constraints, and restrictions while completion contains the best classification category as response. Both these prompts and completions are available in Arabic language. As this is an open-ended dataset, there will be no options given to choose the right classification category as a part of the prompt.
These prompt and completion pairs cover a broad range of topics, including science, history, technology, geography, literature, current affairs, and more. Each prompt is accompanied by a response, providing valuable information and insights to enhance the language model training process. Both the prompt and response were manually curated by native Arabic people, and references were taken from diverse sources like books, news articles, websites, and other reliable references.
This open-ended classification prompt and completion dataset contains different types of prompts, including instruction type, continuation type, and in-context learning (zero-shot, few-shot) type. The dataset also contains prompts and responses with different types of rich text, including tables, code, JSON, etc., with proper markdown.
Prompt Diversity
To ensure diversity, this open-ended classification dataset includes prompts with varying complexity levels, ranging from easy to medium and hard. Different types of prompts, such as multiple-choice, direct, and true/false, are included. Additionally, prompts are diverse in terms of length from short to medium and long, creating a comprehensive variety. The classification dataset also contains prompts with constraints and persona restrictions, which makes it even more useful for LLM training.
Response Formats
To accommodate diverse learning experiences, our dataset incorporates different types of responses depending on the prompt. These formats include single-word, short phrase, and single sentence type of response. These responses encompass text strings, numerical values, and date and time formats, enhancing the language model's ability to generate reliable, coherent, and contextually appropriate answers.
Data Format and Annotation Details
This fully labeled Arabic Open Ended Classification Prompt Completion Dataset is available in JSON and CSV formats. It includes annotation details such as a unique ID, prompt, prompt type, prompt length, prompt complexity, domain, response, response type, and rich text presence.
Quality and Accuracy
Our dataset upholds the highest standards of quality and accuracy. Each prompt undergoes meticulous validation, and the corresponding responses are thoroughly verified. We prioritize inclusivity, ensuring that the dataset incorporates prompts and completions representing diverse perspectives and writing styles, maintaining an unbiased and discrimination-free stance.
The Arabic version is grammatically accurate without any spelling or grammatical errors. No copyrighted, toxic, or harmful content is used during the construction of this dataset.
Continuous Updates and Customization
The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Ongoing efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to gather custom open-ended classification prompt and completion data tailored to specific needs, providing flexibility and customization options.
License
The dataset, created by FutureBeeAI, is now available for commercial use. Researchers, data scientists, and developers can leverage this fully labeled and ready-to-deploy Arabic Open Ended Classification Prompt-Completion Dataset to enhance the classification abilities and accurate response generation capabilities of their generative AI models and explore new approaches to NLP tasks.

Facebook

Twitter

Click to copy link

Link copied

Cite

Neilsberg Research (2024). Arab, AL Annual Population and Growth Analysis Dataset: A Comprehensive Overview of Population Changes and Yearly Growth Rates in Arab from 2000 to 2023 // 2024 Edition [Dataset]. https://www.neilsberg.com/insights/arab-al-population-by-year/

Arab, AL Annual Population and Growth Analysis Dataset: A Comprehensive Overview of Population Changes and Yearly Growth Rates in Arab from 2000 to 2023 // 2024 Edition

Explore at:

csv, jsonAvailable download formats

Dataset updated

Jul 30, 2024

Dataset authored and provided by

Neilsberg Research

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Area covered

Arab, Alabama

Variables measured

Annual Population Growth Rate, Population Between 2000 and 2023, Annual Population Growth Rate Percent

Measurement technique

The data presented in this dataset is derived from the 20 years data of U.S. Census Bureau Population Estimates Program (PEP) 2000 - 2023. To measure the variables, namely (a) population and (b) population change in ( absolute and as a percentage ), we initially analyzed and tabulated the data for each of the years between 2000 and 2023. For further information regarding these estimates, please feel free to reach out to us via email at research@neilsberg.com.

Dataset funded by

Neilsberg Research

Description

About this dataset

Context

The dataset tabulates the Arab population over the last 20 plus years. It lists the population for each year, along with the year on year change in population, as well as the change in percentage terms for each year. The dataset can be utilized to understand the population change of Arab across the last two decades. For example, using this dataset, we can identify if the population is declining or increasing. If there is a change, when the population peaked, or if it is still growing and has not reached its peak. We can also compare the trend with the overall trend of United States population over the same period of time.

Key observations

In 2023, the population of Arab was 8,830, a 2.36% increase year-by-year from 2022. Previously, in 2022, Arab population was 8,626, an increase of 1.48% compared to a population of 8,500 in 2021. Over the last 20 plus years, between 2000 and 2023, population of Arab increased by 1,401. In this period, the peak population was 8,830 in the year 2023. The numbers suggest that the population has not reached its peak yet and is showing a trend of further growth. Source: U.S. Census Bureau Population Estimates Program (PEP).

Content

When available, the data consists of estimates from the U.S. Census Bureau Population Estimates Program (PEP).

Data Coverage:

From 2000 to 2023

Variables / Data Columns

Year: This column displays the data year (Measured annually and for years 2000 to 2023)
Population: The population for the specific year for the Arab is shown in this column.
Year on Year Change: This column displays the change in Arab population for each year compared to the previous year.
Change in Percent: This column displays the year on year change as a percentage. Please note that the sum of all percentages may not equal one due to rounding of values.

Good to know

Margin of Error

Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

Custom data

If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

Inspiration

Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

Recommended for further research

This dataset is a part of the main dataset for Arab Population by Year. You can refer the same here

Clear search

Close search

Google apps

Main menu

Arab, AL Annual Population and Growth Analysis Dataset: A Comprehensive...

About this dataset

Content

Inspiration

Recommended for further research

Egyptian Arabic Customer Speech Dataset

Metadata Availability: Insights into Participant Details:

Audio Recording Specifications:

Speech Data:

Transcription of Datasets:

License:

Updates and Customization:

Data from: arabic-books

ArabLEX: Database of Foreign Names in Arabic (DAF)

Dataset for Arabic Classification

Context

Content

Acknowledgements

World's Muslims Data Set, 2012

Arabic Closed Ended Question Answer Text Dataset

Dataset Content

Question Diversity

Answer Formats

Data Format and Annotation Details

Quality and Accuracy

Continuous Updates and Customization

License:

Arab, AL Population Breakdown By Race (Excluding Ethnicity) Dataset:...

About this dataset

Content

Inspiration

Recommended for further research

Arabic Shopping List OCR Image Dataset

What’s Included

Saudi Arabian Arabic General Conversation Speech Dataset for ASR

Introduction

Speech Data

Topic Diversity

Transcription

Metadata

Usage and Applications

ASL 20-Words Dataset v1

Falcon-Arabic-7B-Base-details

SARD

DataSet for Arabic Classification

Arab West Report Interview Documentation Project: Islam in Egypt - Dataset -...

Egyptian Arabic General Conversation Speech Dataset for ASR

Introduction

Speech Data

Topic Diversity

Transcription

Metadata

Usage and Applications

Arabic Handwritten Characters Dataset

Arabic Open Ended Question Answer Text Dataset

Dataset Content:

Question Diversity:

Answer Formats:

Data Format and Annotation Details:

Quality and Accuracy:

Continuous Updates and Customization:

License:

Arabic Extraction Prompt & Response Dataset

Dataset Content

Prompt Diversity

Response Formats

Data Format and Annotation Details

Quality and Accuracy

Continuous Updates and Customization

License

Arabic Open Ended Classification Prompt & Response Dataset

Dataset Content

Prompt Diversity

Response Formats

Data Format and Annotation Details

Quality and Accuracy

Continuous Updates and Customization

License

Arab, AL Annual Population and Growth Analysis Dataset: A Comprehensive Overview of Population Changes and Yearly Growth Rates in Arab from 2000 to 2023 // 2024 EditionSee More Versions

About this dataset

Arab, AL Annual Population and Growth Analysis Dataset: A Comprehensive Overview of Population Changes and Yearly Growth Rates in Arab from 2000 to 2023 // 2024 Edition