Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
IndQNER
IndQNER is a Named Entity Recognition (NER) benchmark dataset that was created by manually annotating 8 chapters in the Indonesian translation of the Quran. The annotation was performed using a web-based text annotation tool, Tagtog, and the BIO (Beginning-Inside-Outside) tagging format. The dataset contains:
3117 sentences
62027 tokens
2475 named entities
18 named entity categories
Named Entity Classes
The named entity classes were initially defined by analyzing the existing Quran concepts ontology. The initial classes were updated based on the information acquired during the annotation process. Finally, there are 20 classes, as follows:
Allah
Allah's Throne
Artifact
Astronomical body
Event
False deity
Holy book
Language
Angel
Person
Messenger
Prophet
Sentient
Afterlife location
Geographical location
Color
Religion
Food
Fruit
The book of Allah
Annotation Stage
There were eight annotators who contributed to the annotation process. They were informatics engineering students at the State Islamic University Syarif Hidayatullah Jakarta.
Anggita Maharani Gumay Putri
Muhammad Destamal Junas
Naufaldi Hafidhigbal
Nur Kholis Azzam Ubaidillah
Puspitasari
Septiany Nur Anggita
Wilda Nurjannah
William Santoso
Verification Stage
We found many named entity and class candidates during the annotation stage. To verify the candidates, we consulted Quran and Tafseer (content) experts who are lecturers at Quran and Tafseer Department at the State Islamic University Syarif Hidayatullah Jakarta.
Dr. Eva Nugraha, M.Ag.
Dr. Jauhar Azizy, MA
Dr. Lilik Ummi Kultsum, MA
Evaluation
We evaluated the annotation quality of IndQNER by performing experiments in two settings: supervised learning (BiLSTM+CRF) and transfer learning (IndoBERT fine-tuning).
Supervised Learning Setting
The implementation of BiLSTM and CRF utilized IndoBERT to provide word embeddings. All experiments used a batch size of 16. These are the results:
Maximum sequence length Number of e-poch Precision Recall F1 score
256 10 0.94 0.92 0.93
256 20 0.99 0.97 0.98
256 40 0.96 0.96 0.96
256 100 0.97 0.96 0.96
512 10 0.92 0.92 0.92
512 20 0.96 0.95 0.96
512 40 0.97 0.95 0.96
512 100 0.97 0.95 0.96
Transfer Learning Setting
We performed several experiments with different parameters in IndoBERT fine-tuning. All experiments used a learning rate of 2e-5 and a batch size of 16. These are the results:
Maximum sequence length Number of e-poch Precision Recall F1 score
256 10 0.67 0.65 0.65
256 20 0.60 0.59 0.59
256 40 0.75 0.72 0.71
256 100 0.73 0.68 0.68
512 10 0.72 0.62 0.64
512 20 0.62 0.57 0.58
512 40 0.72 0.66 0.67
512 100 0.68 0.68 0.67
This dataset is also part of the NusaCrowd project which aims to collect Natural Language Processing (NLP) datasets for Indonesian and its local languages.
How to Cite
@InProceedings{10.1007/978-3-031-35320-8_12,author="Gusmita, Ria Hariand Firmansyah, Asep Fajarand Moussallem, Diegoand Ngonga Ngomo, Axel-Cyrille",editor="M{\'e}tais, Elisabethand Meziane, Faridand Sugumaran, Vijayanand Manning, Warrenand Reiff-Marganiec, Stephan",title="IndQNER: Named Entity Recognition Benchmark Dataset from the Indonesian Translation of the Quran",booktitle="Natural Language Processing and Information Systems",year="2023",publisher="Springer Nature Switzerland",address="Cham",pages="170--185",abstract="Indonesian is classified as underrepresented in the Natural Language Processing (NLP) field, despite being the tenth most spoken language in the world with 198 million speakers. The paucity of datasets is recognized as the main reason for the slow advancements in NLP research for underrepresented languages. Significant attempts were made in 2020 to address this drawback for Indonesian. The Indonesian Natural Language Understanding (IndoNLU) benchmark was introduced alongside IndoBERT pre-trained language model. The second benchmark, Indonesian Language Evaluation Montage (IndoLEM), was presented in the same year. These benchmarks support several tasks, including Named Entity Recognition (NER). However, all NER datasets are in the public domain and do not contain domain-specific datasets. To alleviate this drawback, we introduce IndQNER, a manually annotated NER benchmark dataset in the religious domain that adheres to a meticulously designed annotation guideline. Since Indonesia has the world's largest Muslim population, we build the dataset from the Indonesian translation of the Quran. The dataset includes 2475 named entities representing 18 different classes. To assess the annotation quality of IndQNER, we perform experiments with BiLSTM and CRF-based NER, as well as IndoBERT fine-tuning. The results reveal that the first model outperforms the second model achieving 0.98 F1 points. This outcome indicates that IndQNER may be an acceptable evaluation metric for Indonesian NER tasks in the aforementioned domain, widening the research's domain range.",isbn="978-3-031-35320-8"}
Contact
If you have any questions or feedback, feel free to contact us at ria.hari.gusmita@uni-paderborn.de or ria.gusmita@uinjkt.ac.id
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Recent issues on politics have been dominant in Indonesia that people are divided and become more intolerant of each other. Indonesia has the biggest Muslim population in the world and the role of Islam in Indonesian politics is significant. The current Indonesian government claim that moderate Muslims are loyal to the present political system while the opposing rivals who are often labelled’intolerant and radical Muslims’ by Indonesian mass media often disagree with the central interpretation of democracy in Indonesia. Studies on contributing factors and discourse strategies used in news and articles in secular and Islamic mass media which play a vital role in the construction of Muslim and Islamic identities in Indonesia are, therefore, recommended.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
IndQNER
IndQNER is a Named Entity Recognition (NER) benchmark dataset that was created by manually annotating 8 chapters in the Indonesian translation of the Quran. The annotation was performed using a web-based text annotation tool, Tagtog, and the BIO (Beginning-Inside-Outside) tagging format. The dataset contains:
3117 sentences
62027 tokens
2475 named entities
18 named entity categories
Named Entity Classes
The named entity classes were initially defined by analyzing the existing Quran concepts ontology. The initial classes were updated based on the information acquired during the annotation process. Finally, there are 20 classes, as follows:
Allah
Allah's Throne
Artifact
Astronomical body
Event
False deity
Holy book
Language
Angel
Person
Messenger
Prophet
Sentient
Afterlife location
Geographical location
Color
Religion
Food
Fruit
The book of Allah
Annotation Stage
There were eight annotators who contributed to the annotation process. They were informatics engineering students at the State Islamic University Syarif Hidayatullah Jakarta.
Anggita Maharani Gumay Putri
Muhammad Destamal Junas
Naufaldi Hafidhigbal
Nur Kholis Azzam Ubaidillah
Puspitasari
Septiany Nur Anggita
Wilda Nurjannah
William Santoso
Verification Stage
We found many named entity and class candidates during the annotation stage. To verify the candidates, we consulted Quran and Tafseer (content) experts who are lecturers at Quran and Tafseer Department at the State Islamic University Syarif Hidayatullah Jakarta.
Dr. Eva Nugraha, M.Ag.
Dr. Jauhar Azizy, MA
Dr. Lilik Ummi Kultsum, MA
Evaluation
We evaluated the annotation quality of IndQNER by performing experiments in two settings: supervised learning (BiLSTM+CRF) and transfer learning (IndoBERT fine-tuning).
Supervised Learning Setting
The implementation of BiLSTM and CRF utilized IndoBERT to provide word embeddings. All experiments used a batch size of 16. These are the results:
Maximum sequence length Number of e-poch Precision Recall F1 score
256 10 0.94 0.92 0.93
256 20 0.99 0.97 0.98
256 40 0.96 0.96 0.96
256 100 0.97 0.96 0.96
512 10 0.92 0.92 0.92
512 20 0.96 0.95 0.96
512 40 0.97 0.95 0.96
512 100 0.97 0.95 0.96
Transfer Learning Setting
We performed several experiments with different parameters in IndoBERT fine-tuning. All experiments used a learning rate of 2e-5 and a batch size of 16. These are the results:
Maximum sequence length Number of e-poch Precision Recall F1 score
256 10 0.67 0.65 0.65
256 20 0.60 0.59 0.59
256 40 0.75 0.72 0.71
256 100 0.73 0.68 0.68
512 10 0.72 0.62 0.64
512 20 0.62 0.57 0.58
512 40 0.72 0.66 0.67
512 100 0.68 0.68 0.67
This dataset is also part of the NusaCrowd project which aims to collect Natural Language Processing (NLP) datasets for Indonesian and its local languages.
How to Cite
@InProceedings{10.1007/978-3-031-35320-8_12,author="Gusmita, Ria Hariand Firmansyah, Asep Fajarand Moussallem, Diegoand Ngonga Ngomo, Axel-Cyrille",editor="M{\'e}tais, Elisabethand Meziane, Faridand Sugumaran, Vijayanand Manning, Warrenand Reiff-Marganiec, Stephan",title="IndQNER: Named Entity Recognition Benchmark Dataset from the Indonesian Translation of the Quran",booktitle="Natural Language Processing and Information Systems",year="2023",publisher="Springer Nature Switzerland",address="Cham",pages="170--185",abstract="Indonesian is classified as underrepresented in the Natural Language Processing (NLP) field, despite being the tenth most spoken language in the world with 198 million speakers. The paucity of datasets is recognized as the main reason for the slow advancements in NLP research for underrepresented languages. Significant attempts were made in 2020 to address this drawback for Indonesian. The Indonesian Natural Language Understanding (IndoNLU) benchmark was introduced alongside IndoBERT pre-trained language model. The second benchmark, Indonesian Language Evaluation Montage (IndoLEM), was presented in the same year. These benchmarks support several tasks, including Named Entity Recognition (NER). However, all NER datasets are in the public domain and do not contain domain-specific datasets. To alleviate this drawback, we introduce IndQNER, a manually annotated NER benchmark dataset in the religious domain that adheres to a meticulously designed annotation guideline. Since Indonesia has the world's largest Muslim population, we build the dataset from the Indonesian translation of the Quran. The dataset includes 2475 named entities representing 18 different classes. To assess the annotation quality of IndQNER, we perform experiments with BiLSTM and CRF-based NER, as well as IndoBERT fine-tuning. The results reveal that the first model outperforms the second model achieving 0.98 F1 points. This outcome indicates that IndQNER may be an acceptable evaluation metric for Indonesian NER tasks in the aforementioned domain, widening the research's domain range.",isbn="978-3-031-35320-8"}
Contact
If you have any questions or feedback, feel free to contact us at ria.hari.gusmita@uni-paderborn.de or ria.gusmita@uinjkt.ac.id
Facebook
TwitterThese data were collected for a study of how the characteristics of political parties influence women's chances in assuming leadership positions within the parties' inner structures. Data were compiled by Fatima Sbaity Kassem for a case-study of Lebanon and by national and local researchers for 25 other countries in Asia, Africa and Europe. The researchers collected raw data on women in politics from party administrators and government officials. Researchers gathered information about parties' year of origin, number of seats in parliament, political platform, and all gender-disaggregated party data (in percentages) on overall party membership, shares in executive and decision-making bodies, and nominations on electoral lists. A key variable measures party religiosity, which refers to the religious components on their political platforms or the extent to which religion penetrates their political agendas.
Only parties that have at least one seat in any of the last three parliaments were included. These are referred to as 'relevant' parties. The four data sets combined cover 330 political parties in Lebanon plus 12 other Arab countries (Algeria, Bahrain, Comoros, Djibouti, Egypt, Jordan, Kuwait, Mauritania, Morocco, Palestine, Tunisia, and Yemen), seven non-Arab Muslim-majority countries (Albania, Afghanistan, Bangladesh, Bosnia-Herzegovina, Indonesia, Senegal, and Turkey), five European countries with dominant Christian democratic parties (Austria, Belgium, Italy, Germany, and the Netherlands), and Israel.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
IndQNER
IndQNER is a Named Entity Recognition (NER) benchmark dataset that was created by manually annotating 8 chapters in the Indonesian translation of the Quran. The annotation was performed using a web-based text annotation tool, Tagtog, and the BIO (Beginning-Inside-Outside) tagging format. The dataset contains:
3117 sentences
62027 tokens
2475 named entities
18 named entity categories
Named Entity Classes
The named entity classes were initially defined by analyzing the existing Quran concepts ontology. The initial classes were updated based on the information acquired during the annotation process. Finally, there are 20 classes, as follows:
Allah
Allah's Throne
Artifact
Astronomical body
Event
False deity
Holy book
Language
Angel
Person
Messenger
Prophet
Sentient
Afterlife location
Geographical location
Color
Religion
Food
Fruit
The book of Allah
Annotation Stage
There were eight annotators who contributed to the annotation process. They were informatics engineering students at the State Islamic University Syarif Hidayatullah Jakarta.
Anggita Maharani Gumay Putri
Muhammad Destamal Junas
Naufaldi Hafidhigbal
Nur Kholis Azzam Ubaidillah
Puspitasari
Septiany Nur Anggita
Wilda Nurjannah
William Santoso
Verification Stage
We found many named entity and class candidates during the annotation stage. To verify the candidates, we consulted Quran and Tafseer (content) experts who are lecturers at Quran and Tafseer Department at the State Islamic University Syarif Hidayatullah Jakarta.
Dr. Eva Nugraha, M.Ag.
Dr. Jauhar Azizy, MA
Dr. Lilik Ummi Kultsum, MA
Evaluation
We evaluated the annotation quality of IndQNER by performing experiments in two settings: supervised learning (BiLSTM+CRF) and transfer learning (IndoBERT fine-tuning).
Supervised Learning Setting
The implementation of BiLSTM and CRF utilized IndoBERT to provide word embeddings. All experiments used a batch size of 16. These are the results:
Maximum sequence length Number of e-poch Precision Recall F1 score
256 10 0.94 0.92 0.93
256 20 0.99 0.97 0.98
256 40 0.96 0.96 0.96
256 100 0.97 0.96 0.96
512 10 0.92 0.92 0.92
512 20 0.96 0.95 0.96
512 40 0.97 0.95 0.96
512 100 0.97 0.95 0.96
Transfer Learning Setting
We performed several experiments with different parameters in IndoBERT fine-tuning. All experiments used a learning rate of 2e-5 and a batch size of 16. These are the results:
Maximum sequence length Number of e-poch Precision Recall F1 score
256 10 0.67 0.65 0.65
256 20 0.60 0.59 0.59
256 40 0.75 0.72 0.71
256 100 0.73 0.68 0.68
512 10 0.72 0.62 0.64
512 20 0.62 0.57 0.58
512 40 0.72 0.66 0.67
512 100 0.68 0.68 0.67
This dataset is also part of the NusaCrowd project which aims to collect Natural Language Processing (NLP) datasets for Indonesian and its local languages.
How to Cite
@InProceedings{10.1007/978-3-031-35320-8_12,author="Gusmita, Ria Hariand Firmansyah, Asep Fajarand Moussallem, Diegoand Ngonga Ngomo, Axel-Cyrille",editor="M{\'e}tais, Elisabethand Meziane, Faridand Sugumaran, Vijayanand Manning, Warrenand Reiff-Marganiec, Stephan",title="IndQNER: Named Entity Recognition Benchmark Dataset from the Indonesian Translation of the Quran",booktitle="Natural Language Processing and Information Systems",year="2023",publisher="Springer Nature Switzerland",address="Cham",pages="170--185",abstract="Indonesian is classified as underrepresented in the Natural Language Processing (NLP) field, despite being the tenth most spoken language in the world with 198 million speakers. The paucity of datasets is recognized as the main reason for the slow advancements in NLP research for underrepresented languages. Significant attempts were made in 2020 to address this drawback for Indonesian. The Indonesian Natural Language Understanding (IndoNLU) benchmark was introduced alongside IndoBERT pre-trained language model. The second benchmark, Indonesian Language Evaluation Montage (IndoLEM), was presented in the same year. These benchmarks support several tasks, including Named Entity Recognition (NER). However, all NER datasets are in the public domain and do not contain domain-specific datasets. To alleviate this drawback, we introduce IndQNER, a manually annotated NER benchmark dataset in the religious domain that adheres to a meticulously designed annotation guideline. Since Indonesia has the world's largest Muslim population, we build the dataset from the Indonesian translation of the Quran. The dataset includes 2475 named entities representing 18 different classes. To assess the annotation quality of IndQNER, we perform experiments with BiLSTM and CRF-based NER, as well as IndoBERT fine-tuning. The results reveal that the first model outperforms the second model achieving 0.98 F1 points. This outcome indicates that IndQNER may be an acceptable evaluation metric for Indonesian NER tasks in the aforementioned domain, widening the research's domain range.",isbn="978-3-031-35320-8"}
Contact
If you have any questions or feedback, feel free to contact us at ria.hari.gusmita@uni-paderborn.de or ria.gusmita@uinjkt.ac.id