5 datasets found

Z
IndQNER: Indonesian Benchmark Dataset from the Indonesian Translation of the...
data.niaid.nih.gov
Updated Jan 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gusmita, Ria Hari (2024). IndQNER: Indonesian Benchmark Dataset from the Indonesian Translation of the Quran [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7454891
Explore at:
Dataset updated
Jan 27, 2024
Dataset provided by
Gusmita, Ria Hari
Firmansyah, Asep Fajar
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
IndQNER

IndQNER is a Named Entity Recognition (NER) benchmark dataset that was created by manually annotating 8 chapters in the Indonesian translation of the Quran. The annotation was performed using a web-based text annotation tool, Tagtog, and the BIO (Beginning-Inside-Outside) tagging format. The dataset contains:

3117 sentences

62027 tokens

2475 named entities

18 named entity categories

Named Entity Classes

The named entity classes were initially defined by analyzing the existing Quran concepts ontology. The initial classes were updated based on the information acquired during the annotation process. Finally, there are 20 classes, as follows:

Allah

Allah's Throne

Artifact

Astronomical body

Event

False deity

Holy book

Language

Angel

Person

Messenger

Prophet

Sentient

Afterlife location

Geographical location

Color

Religion

Food

Fruit

The book of Allah

Annotation Stage

There were eight annotators who contributed to the annotation process. They were informatics engineering students at the State Islamic University Syarif Hidayatullah Jakarta.

Anggita Maharani Gumay Putri

Muhammad Destamal Junas

Naufaldi Hafidhigbal

Nur Kholis Azzam Ubaidillah

Puspitasari

Septiany Nur Anggita

Wilda Nurjannah

William Santoso

Verification Stage

We found many named entity and class candidates during the annotation stage. To verify the candidates, we consulted Quran and Tafseer (content) experts who are lecturers at Quran and Tafseer Department at the State Islamic University Syarif Hidayatullah Jakarta.

Dr. Eva Nugraha, M.Ag.

Dr. Jauhar Azizy, MA

Dr. Lilik Ummi Kultsum, MA

Evaluation

We evaluated the annotation quality of IndQNER by performing experiments in two settings: supervised learning (BiLSTM+CRF) and transfer learning (IndoBERT fine-tuning).

Supervised Learning Setting

The implementation of BiLSTM and CRF utilized IndoBERT to provide word embeddings. All experiments used a batch size of 16. These are the results:

Maximum sequence length Number of e-poch Precision Recall F1 score

256 10 0.94 0.92 0.93

256 20 0.99 0.97 0.98

256 40 0.96 0.96 0.96

256 100 0.97 0.96 0.96

512 10 0.92 0.92 0.92

512 20 0.96 0.95 0.96

512 40 0.97 0.95 0.96

512 100 0.97 0.95 0.96

Transfer Learning Setting

We performed several experiments with different parameters in IndoBERT fine-tuning. All experiments used a learning rate of 2e-5 and a batch size of 16. These are the results:

Maximum sequence length Number of e-poch Precision Recall F1 score

256 10 0.67 0.65 0.65

256 20 0.60 0.59 0.59

256 40 0.75 0.72 0.71

256 100 0.73 0.68 0.68

512 10 0.72 0.62 0.64

512 20 0.62 0.57 0.58

512 40 0.72 0.66 0.67

512 100 0.68 0.68 0.67

This dataset is also part of the NusaCrowd project which aims to collect Natural Language Processing (NLP) datasets for Indonesian and its local languages.

How to Cite

@InProceedings{10.1007/978-3-031-35320-8_12,author="Gusmita, Ria Hariand Firmansyah, Asep Fajarand Moussallem, Diegoand Ngonga Ngomo, Axel-Cyrille",editor="M{\'e}tais, Elisabethand Meziane, Faridand Sugumaran, Vijayanand Manning, Warrenand Reiff-Marganiec, Stephan",title="IndQNER: Named Entity Recognition Benchmark Dataset from the Indonesian Translation of the Quran",booktitle="Natural Language Processing and Information Systems",year="2023",publisher="Springer Nature Switzerland",address="Cham",pages="170--185",abstract="Indonesian is classified as underrepresented in the Natural Language Processing (NLP) field, despite being the tenth most spoken language in the world with 198 million speakers. The paucity of datasets is recognized as the main reason for the slow advancements in NLP research for underrepresented languages. Significant attempts were made in 2020 to address this drawback for Indonesian. The Indonesian Natural Language Understanding (IndoNLU) benchmark was introduced alongside IndoBERT pre-trained language model. The second benchmark, Indonesian Language Evaluation Montage (IndoLEM), was presented in the same year. These benchmarks support several tasks, including Named Entity Recognition (NER). However, all NER datasets are in the public domain and do not contain domain-specific datasets. To alleviate this drawback, we introduce IndQNER, a manually annotated NER benchmark dataset in the religious domain that adheres to a meticulously designed annotation guideline. Since Indonesia has the world's largest Muslim population, we build the dataset from the Indonesian translation of the Quran. The dataset includes 2475 named entities representing 18 different classes. To assess the annotation quality of IndQNER, we perform experiments with BiLSTM and CRF-based NER, as well as IndoBERT fine-tuning. The results reveal that the first model outperforms the second model achieving 0.98 F1 points. This outcome indicates that IndQNER may be an acceptable evaluation metric for Indonesian NER tasks in the aforementioned domain, widening the research's domain range.",isbn="978-3-031-35320-8"}

Contact

If you have any questions or feedback, feel free to contact us at ria.hari.gusmita@uni-paderborn.de or ria.gusmita@uinjkt.ac.id
f
Data from: A modest proposal for conducting future research on media...
figshare.com
data.niaid.nih.gov
+1more
pdf
Updated Dec 28, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Harits Masduqi (2021). A modest proposal for conducting future research on media portrayals of Islam and Muslims in Indonesia [Dataset]. http://doi.org/10.6084/m9.figshare.16681825.v1
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.16681825.v1
Dataset updated
Dec 28, 2021
Dataset provided by
figshare
Authors
Harits Masduqi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Indonesia
Description
Recent issues on politics have been dominant in Indonesia that people are divided and become more intolerant of each other. Indonesia has the biggest Muslim population in the world and the role of Islam in Indonesian politics is significant. The current Indonesian government claim that moderate Muslims are loyal to the present political system while the opposing rivals who are often labelled’intolerant and radical Muslims’ by Indonesian mass media often disagree with the central interpretation of democracy in Indonesia. Studies on contributing factors and discourse strategies used in news and articles in secular and Islamic mass media which play a vital role in the construction of Muslim and Islamic identities in Indonesia are, therefore, recommended.
m
Hajj Dataset 2021-2024: Ministry of Religious Affairs Malang City
data.mendeley.com
Updated Feb 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mutiara Dzakiroh (2025). Hajj Dataset 2021-2024: Ministry of Religious Affairs Malang City [Dataset]. http://doi.org/10.17632/cdyygzjcky.1
Explore at:
Unique identifier
https://doi.org/10.17632/cdyygzjcky.1
Dataset updated
Feb 18, 2025
Authors
Mutiara Dzakiroh
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Malang
Description
The Hajj Dataset 2021-2024: Ministry of Religious Affairs Malang City contains comprehensive data on the Hajj pilgrimage process for 2021 through 2024, gathered explicitly from the Malang City branch of Indonesia's Ministry of Religious Affairs (Kemenag). This dataset captures key information about the Hajj pilgrimage, including payment records, associated costs, and demographic details of the pilgrims, providing valuable insights into the financial aspects and trends over the four years. Key Data Features: Yearly Hajj Costs: Information on the financial breakdown of Hajj costs for each year, covering all components, including transportation, accommodation, and other mandatory fees. Pilgrim Demographics: Data on the number and characteristics of pilgrims from Malang City, including age, gender, and other socioeconomic indicators. Payment Status and History: Records of payments made by the pilgrims detailing the timing, amount, and any outstanding balances. Regulatory Changes: Information on changes in the regulations and policies of the Ministry of Religious Affairs (Kemenag) that may have impacted the cost structure or payment schedule during this period. Inflation and Currency Impact: Data reflecting the impact of national inflation rates or currency fluctuations, particularly the value of the Indonesian Rupiah (IDR) relative to the Saudi Riyal (SAR), on the overall pilgrimage cost. Hajj Quota and Registrations: The number of Hajj applicants from Malang City and the annual quota allocated to the region, including details on the selection process and waiting periods. Potential Use Cases: Cost Prediction: Analyze cost trends and predict future financial needs for the Hajj pilgrimage. Policy Analysis: Assess the impact of government policies on the affordability and accessibility of Hajj for pilgrims. Economic Analysis: Understand how national economic factors (inflation and and exchange rates) affect pilgrimage costs. Social Research: Study demographic patterns and regional participation in Hajj from Malang City. This dataset provides an essential resource for anyone interested in the economic, social, and policy dimensions of the Hajj pilgrimage in Indonesia, particularly in the context of Malang City's unique data.
Party Variation in Religiosity and Womens Leadership, Non-Arab Muslim...
thearda.com
Updated Oct 29, 2012
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fatima Sbaity Kassem (2012). Party Variation in Religiosity and Womens Leadership, Non-Arab Muslim Majority Countries Dataset [Dataset]. http://doi.org/10.17605/OSF.IO/K5MDA
Explore at:
Unique identifier
https://doi.org/10.17605/OSF.IO/K5MDA
Dataset updated
Oct 29, 2012
Dataset provided by
Association of Religion Data Archives
Authors
Fatima Sbaity Kassem
Dataset funded by
Fatima Sbaity Kassem
Description
These data were collected for a study of how the characteristics of political parties influence women's chances in assuming leadership positions within the parties' inner structures. Data were compiled by Fatima Sbaity Kassem for a case-study of Lebanon and by national and local researchers for 25 other countries in Asia, Africa and Europe. The researchers collected raw data on women in politics from party administrators and government officials. Researchers gathered information about parties' year of origin, number of seats in parliament, political platform, and all gender-disaggregated party data (in percentages) on overall party membership, shares in executive and decision-making bodies, and nominations on electoral lists. A key variable measures party religiosity, which refers to the religious components on their political platforms or the extent to which religion penetrates their political agendas.

Only parties that have at least one seat in any of the last three parliaments were included. These are referred to as 'relevant' parties. The four data sets combined cover 330 political parties in Lebanon plus 12 other Arab countries (Algeria, Bahrain, Comoros, Djibouti, Egypt, Jordan, Kuwait, Mauritania, Morocco, Palestine, Tunisia, and Yemen), seven non-Arab Muslim-majority countries (Albania, Afghanistan, Bangladesh, Bosnia-Herzegovina, Indonesia, Senegal, and Turkey), five European countries with dominant Christian democratic parties (Austria, Belgium, Italy, Germany, and the Netherlands), and Israel.
f
Data from: The future of Arabic language learning fornon-Muslims as an...
figshare.com
xlsx
Updated Aug 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mahyudin Ritonga (2023). The future of Arabic language learning fornon-Muslims as an actualization ofWasathiyah Islam in Indonesia [Dataset]. http://doi.org/10.6084/m9.figshare.24066195.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.24066195.v1
Dataset updated
Aug 31, 2023
Dataset provided by
figshare
Authors
Mahyudin Ritonga
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Indonesia
Description
The research was carried out with two approaches, namely quantitative first and followed by qualitative. Research data is analyzed critically and comparatively, through this analysis technique research findings will become logical for the readers. Looking at the results of the study it can be concluded that; First, learning Arabic for non-Muslims in Indonesia has a great opportunity, this finding is based on the basic function of language as a communication tool. This includes the meaning that learning any language is not limited by religion, ethnicity, or race. Second, learning Arabic for non-Muslims in Indonesia will be part of the basis for the actualization of Wasathiyah Islam in Indonesia.This research uses a mix method or mixed methods. The design of this research was used sequential explanatory design, which combines quantitative and qualitative approaches sequentially.[26]. The quantitative approach used is in the form of a survey where the researcher conducted a survey to the respondents who are the research sample. Meanwhile, the qualitative approach used focused interviews which are described descriptively. The kind of this research is conducted to obtain more comprehensive data on “The Future of Arabic Language Learning for Non-Muslims as the Actualization of Wasathiyah Islam in Indonesia” because it integrates the benefits of the two methods. The samples used in this study were as many as 64 respondents. The sampling technique of this research used was a cluster random sampling technique combined with convenience sampling, meaning that the sample is taken at random and also selected based on the availability of respondents and the ease of obtaining data.Quantitative survey data were analyzed by doing a percentage of the data obtained from the questionnaire, then the data were analyzed using quantitative descriptive. Meanwhile, qualitative data were analyzed using the Miles and Huberman model. First, after the data was collected, the researcher classified the data based on the specified research problem. Second, the researcher presented the data according to the specified problem. Third, the researcher concluded the findings from the research problem. Based on the research findings, this analysis focused on the three problems that have been formulated and the data critically examined by following these three stages.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Gusmita, Ria Hari (2024). IndQNER: Indonesian Benchmark Dataset from the Indonesian Translation of the Quran [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7454891

IndQNER: Indonesian Benchmark Dataset from the Indonesian Translation of the Quran

Explore at:

Dataset updated

Jan 27, 2024

Dataset provided by

Gusmita, Ria Hari
Firmansyah, Asep Fajar

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

IndQNER

IndQNER is a Named Entity Recognition (NER) benchmark dataset that was created by manually annotating 8 chapters in the Indonesian translation of the Quran. The annotation was performed using a web-based text annotation tool, Tagtog, and the BIO (Beginning-Inside-Outside) tagging format. The dataset contains:

3117 sentences

62027 tokens

2475 named entities

18 named entity categories

Named Entity Classes

The named entity classes were initially defined by analyzing the existing Quran concepts ontology. The initial classes were updated based on the information acquired during the annotation process. Finally, there are 20 classes, as follows:

Allah

Allah's Throne

Artifact

Astronomical body

Event

False deity

Holy book

Language

Angel

Person

Messenger

Prophet

Sentient

Afterlife location

Geographical location

Color

Religion

Food

Fruit

The book of Allah

Annotation Stage

There were eight annotators who contributed to the annotation process. They were informatics engineering students at the State Islamic University Syarif Hidayatullah Jakarta.

Anggita Maharani Gumay Putri

Muhammad Destamal Junas

Naufaldi Hafidhigbal

Nur Kholis Azzam Ubaidillah

Puspitasari

Septiany Nur Anggita

Wilda Nurjannah

William Santoso

Verification Stage

We found many named entity and class candidates during the annotation stage. To verify the candidates, we consulted Quran and Tafseer (content) experts who are lecturers at Quran and Tafseer Department at the State Islamic University Syarif Hidayatullah Jakarta.

Dr. Eva Nugraha, M.Ag.

Dr. Jauhar Azizy, MA

Dr. Lilik Ummi Kultsum, MA

Evaluation

We evaluated the annotation quality of IndQNER by performing experiments in two settings: supervised learning (BiLSTM+CRF) and transfer learning (IndoBERT fine-tuning).

Supervised Learning Setting

The implementation of BiLSTM and CRF utilized IndoBERT to provide word embeddings. All experiments used a batch size of 16. These are the results:

Maximum sequence length Number of e-poch Precision Recall F1 score

256 10 0.94 0.92 0.93

256 20 0.99 0.97 0.98

256 40 0.96 0.96 0.96

256 100 0.97 0.96 0.96

512 10 0.92 0.92 0.92

512 20 0.96 0.95 0.96

512 40 0.97 0.95 0.96

512 100 0.97 0.95 0.96

Transfer Learning Setting

We performed several experiments with different parameters in IndoBERT fine-tuning. All experiments used a learning rate of 2e-5 and a batch size of 16. These are the results:

Maximum sequence length Number of e-poch Precision Recall F1 score

256 10 0.67 0.65 0.65

256 20 0.60 0.59 0.59

256 40 0.75 0.72 0.71

256 100 0.73 0.68 0.68

512 10 0.72 0.62 0.64

512 20 0.62 0.57 0.58

512 40 0.72 0.66 0.67

512 100 0.68 0.68 0.67

This dataset is also part of the NusaCrowd project which aims to collect Natural Language Processing (NLP) datasets for Indonesian and its local languages.

How to Cite

@InProceedings{10.1007/978-3-031-35320-8_12,author="Gusmita, Ria Hariand Firmansyah, Asep Fajarand Moussallem, Diegoand Ngonga Ngomo, Axel-Cyrille",editor="M{\'e}tais, Elisabethand Meziane, Faridand Sugumaran, Vijayanand Manning, Warrenand Reiff-Marganiec, Stephan",title="IndQNER: Named Entity Recognition Benchmark Dataset from the Indonesian Translation of the Quran",booktitle="Natural Language Processing and Information Systems",year="2023",publisher="Springer Nature Switzerland",address="Cham",pages="170--185",abstract="Indonesian is classified as underrepresented in the Natural Language Processing (NLP) field, despite being the tenth most spoken language in the world with 198 million speakers. The paucity of datasets is recognized as the main reason for the slow advancements in NLP research for underrepresented languages. Significant attempts were made in 2020 to address this drawback for Indonesian. The Indonesian Natural Language Understanding (IndoNLU) benchmark was introduced alongside IndoBERT pre-trained language model. The second benchmark, Indonesian Language Evaluation Montage (IndoLEM), was presented in the same year. These benchmarks support several tasks, including Named Entity Recognition (NER). However, all NER datasets are in the public domain and do not contain domain-specific datasets. To alleviate this drawback, we introduce IndQNER, a manually annotated NER benchmark dataset in the religious domain that adheres to a meticulously designed annotation guideline. Since Indonesia has the world's largest Muslim population, we build the dataset from the Indonesian translation of the Quran. The dataset includes 2475 named entities representing 18 different classes. To assess the annotation quality of IndQNER, we perform experiments with BiLSTM and CRF-based NER, as well as IndoBERT fine-tuning. The results reveal that the first model outperforms the second model achieving 0.98 F1 points. This outcome indicates that IndQNER may be an acceptable evaluation metric for Indonesian NER tasks in the aforementioned domain, widening the research's domain range.",isbn="978-3-031-35320-8"}

Contact

If you have any questions or feedback, feel free to contact us at ria.hari.gusmita@uni-paderborn.de or ria.gusmita@uinjkt.ac.id

Clear search

Close search

Google apps

Main menu

IndQNER: Indonesian Benchmark Dataset from the Indonesian Translation of the...

Data from: A modest proposal for conducting future research on media...

Hajj Dataset 2021-2024: Ministry of Religious Affairs Malang City

Party Variation in Religiosity and Womens Leadership, Non-Arab Muslim...

Data from: The future of Arabic language learning fornon-Muslims as an...

IndQNER: Indonesian Benchmark Dataset from the Indonesian Translation of the Quran