4 datasets found

Z
IndQNER: Indonesian Benchmark Dataset from the Indonesian Translation of the...
data.niaid.nih.gov
Updated Jan 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gusmita, Ria Hari; Firmansyah, Asep Fajar (2024). IndQNER: Indonesian Benchmark Dataset from the Indonesian Translation of the Quran [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7454891
Explore at:
Dataset updated
Jan 27, 2024
Dataset provided by
Islamic State University Syarif Hidayatullah Jakarta, Paderborn University
Authors
Gusmita, Ria Hari; Firmansyah, Asep Fajar
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
IndQNER

IndQNER is a Named Entity Recognition (NER) benchmark dataset that was created by manually annotating 8 chapters in the Indonesian translation of the Quran. The annotation was performed using a web-based text annotation tool, Tagtog, and the BIO (Beginning-Inside-Outside) tagging format. The dataset contains:

3117 sentences

62027 tokens

2475 named entities

18 named entity categories

Named Entity Classes

The named entity classes were initially defined by analyzing the existing Quran concepts ontology. The initial classes were updated based on the information acquired during the annotation process. Finally, there are 20 classes, as follows:

Allah

Allah's Throne

Artifact

Astronomical body

Event

False deity

Holy book

Language

Angel

Person

Messenger

Prophet

Sentient

Afterlife location

Geographical location

Color

Religion

Food

Fruit

The book of Allah

Annotation Stage

There were eight annotators who contributed to the annotation process. They were informatics engineering students at the State Islamic University Syarif Hidayatullah Jakarta.

Anggita Maharani Gumay Putri

Muhammad Destamal Junas

Naufaldi Hafidhigbal

Nur Kholis Azzam Ubaidillah

Puspitasari

Septiany Nur Anggita

Wilda Nurjannah

William Santoso

Verification Stage

We found many named entity and class candidates during the annotation stage. To verify the candidates, we consulted Quran and Tafseer (content) experts who are lecturers at Quran and Tafseer Department at the State Islamic University Syarif Hidayatullah Jakarta.

Dr. Eva Nugraha, M.Ag.

Dr. Jauhar Azizy, MA

Dr. Lilik Ummi Kultsum, MA

Evaluation

We evaluated the annotation quality of IndQNER by performing experiments in two settings: supervised learning (BiLSTM+CRF) and transfer learning (IndoBERT fine-tuning).

Supervised Learning Setting

The implementation of BiLSTM and CRF utilized IndoBERT to provide word embeddings. All experiments used a batch size of 16. These are the results:

Maximum sequence length Number of e-poch Precision Recall F1 score

256 10 0.94 0.92 0.93

256 20 0.99 0.97 0.98

256 40 0.96 0.96 0.96

256 100 0.97 0.96 0.96

512 10 0.92 0.92 0.92

512 20 0.96 0.95 0.96

512 40 0.97 0.95 0.96

512 100 0.97 0.95 0.96

Transfer Learning Setting

We performed several experiments with different parameters in IndoBERT fine-tuning. All experiments used a learning rate of 2e-5 and a batch size of 16. These are the results:

Maximum sequence length Number of e-poch Precision Recall F1 score

256 10 0.67 0.65 0.65

256 20 0.60 0.59 0.59

256 40 0.75 0.72 0.71

256 100 0.73 0.68 0.68

512 10 0.72 0.62 0.64

512 20 0.62 0.57 0.58

512 40 0.72 0.66 0.67

512 100 0.68 0.68 0.67

This dataset is also part of the NusaCrowd project which aims to collect Natural Language Processing (NLP) datasets for Indonesian and its local languages.

How to Cite

@InProceedings{10.1007/978-3-031-35320-8_12,author="Gusmita, Ria Hariand Firmansyah, Asep Fajarand Moussallem, Diegoand Ngonga Ngomo, Axel-Cyrille",editor="M{\'e}tais, Elisabethand Meziane, Faridand Sugumaran, Vijayanand Manning, Warrenand Reiff-Marganiec, Stephan",title="IndQNER: Named Entity Recognition Benchmark Dataset from the Indonesian Translation of the Quran",booktitle="Natural Language Processing and Information Systems",year="2023",publisher="Springer Nature Switzerland",address="Cham",pages="170--185",abstract="Indonesian is classified as underrepresented in the Natural Language Processing (NLP) field, despite being the tenth most spoken language in the world with 198 million speakers. The paucity of datasets is recognized as the main reason for the slow advancements in NLP research for underrepresented languages. Significant attempts were made in 2020 to address this drawback for Indonesian. The Indonesian Natural Language Understanding (IndoNLU) benchmark was introduced alongside IndoBERT pre-trained language model. The second benchmark, Indonesian Language Evaluation Montage (IndoLEM), was presented in the same year. These benchmarks support several tasks, including Named Entity Recognition (NER). However, all NER datasets are in the public domain and do not contain domain-specific datasets. To alleviate this drawback, we introduce IndQNER, a manually annotated NER benchmark dataset in the religious domain that adheres to a meticulously designed annotation guideline. Since Indonesia has the world's largest Muslim population, we build the dataset from the Indonesian translation of the Quran. The dataset includes 2475 named entities representing 18 different classes. To assess the annotation quality of IndQNER, we perform experiments with BiLSTM and CRF-based NER, as well as IndoBERT fine-tuning. The results reveal that the first model outperforms the second model achieving 0.98 F1 points. This outcome indicates that IndQNER may be an acceptable evaluation metric for Indonesian NER tasks in the aforementioned domain, widening the research's domain range.",isbn="978-3-031-35320-8"}

Contact

If you have any questions or feedback, feel free to contact us at ria.hari.gusmita@uni-paderborn.de or ria.gusmita@uinjkt.ac.id
f
Data from: A modest proposal for conducting future research on media...
figshare.com
data.niaid.nih.gov
+2more
pdf
Updated Dec 28, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Harits Masduqi (2021). A modest proposal for conducting future research on media portrayals of Islam and Muslims in Indonesia [Dataset]. http://doi.org/10.6084/m9.figshare.16681825.v1
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.16681825.v1
Dataset updated
Dec 28, 2021
Dataset provided by
figshare
Authors
Harits Masduqi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Indonesia
Description
Recent issues on politics have been dominant in Indonesia that people are divided and become more intolerant of each other. Indonesia has the biggest Muslim population in the world and the role of Islam in Indonesian politics is significant. The current Indonesian government claim that moderate Muslims are loyal to the present political system while the opposing rivals who are often labelled’intolerant and radical Muslims’ by Indonesian mass media often disagree with the central interpretation of democracy in Indonesia. Studies on contributing factors and discourse strategies used in news and articles in secular and Islamic mass media which play a vital role in the construction of Muslim and Islamic identities in Indonesia are, therefore, recommended.
Z
IndQNER: Indonesian Benchmark Dataset from the Indonesian Translation of the...
nde-dev.biothings.io
Updated Jan 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gusmita, Ria Hari (2024). IndQNER: Indonesian Benchmark Dataset from the Indonesian Translation of the Quran [Dataset]. https://nde-dev.biothings.io/resources?id=zenodo_7454891
Explore at:
Dataset updated
Jan 27, 2024
Dataset provided by
Firmansyah, Asep Fajar
Gusmita, Ria Hari
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
IndQNER

IndQNER is a Named Entity Recognition (NER) benchmark dataset that was created by manually annotating 8 chapters in the Indonesian translation of the Quran. The annotation was performed using a web-based text annotation tool, Tagtog, and the BIO (Beginning-Inside-Outside) tagging format. The dataset contains:

3117 sentences

62027 tokens

2475 named entities

18 named entity categories

Named Entity Classes

The named entity classes were initially defined by analyzing the existing Quran concepts ontology. The initial classes were updated based on the information acquired during the annotation process. Finally, there are 20 classes, as follows:

Allah

Allah's Throne

Artifact

Astronomical body

Event

False deity

Holy book

Language

Angel

Person

Messenger

Prophet

Sentient

Afterlife location

Geographical location

Color

Religion

Food

Fruit

The book of Allah

Annotation Stage

There were eight annotators who contributed to the annotation process. They were informatics engineering students at the State Islamic University Syarif Hidayatullah Jakarta.

Anggita Maharani Gumay Putri

Muhammad Destamal Junas

Naufaldi Hafidhigbal

Nur Kholis Azzam Ubaidillah

Puspitasari

Septiany Nur Anggita

Wilda Nurjannah

William Santoso

Verification Stage

We found many named entity and class candidates during the annotation stage. To verify the candidates, we consulted Quran and Tafseer (content) experts who are lecturers at Quran and Tafseer Department at the State Islamic University Syarif Hidayatullah Jakarta.

Dr. Eva Nugraha, M.Ag.

Dr. Jauhar Azizy, MA

Dr. Lilik Ummi Kultsum, MA

Evaluation

We evaluated the annotation quality of IndQNER by performing experiments in two settings: supervised learning (BiLSTM+CRF) and transfer learning (IndoBERT fine-tuning).

Supervised Learning Setting

The implementation of BiLSTM and CRF utilized IndoBERT to provide word embeddings. All experiments used a batch size of 16. These are the results:

Maximum sequence length Number of e-poch Precision Recall F1 score

256 10 0.94 0.92 0.93

256 20 0.99 0.97 0.98

256 40 0.96 0.96 0.96

256 100 0.97 0.96 0.96

512 10 0.92 0.92 0.92

512 20 0.96 0.95 0.96

512 40 0.97 0.95 0.96

512 100 0.97 0.95 0.96

Transfer Learning Setting

We performed several experiments with different parameters in IndoBERT fine-tuning. All experiments used a learning rate of 2e-5 and a batch size of 16. These are the results:

Maximum sequence length Number of e-poch Precision Recall F1 score

256 10 0.67 0.65 0.65

256 20 0.60 0.59 0.59

256 40 0.75 0.72 0.71

256 100 0.73 0.68 0.68

512 10 0.72 0.62 0.64

512 20 0.62 0.57 0.58

512 40 0.72 0.66 0.67

512 100 0.68 0.68 0.67

This dataset is also part of the NusaCrowd project which aims to collect Natural Language Processing (NLP) datasets for Indonesian and its local languages.

How to Cite

@InProceedings{10.1007/978-3-031-35320-8_12,author="Gusmita, Ria Hariand Firmansyah, Asep Fajarand Moussallem, Diegoand Ngonga Ngomo, Axel-Cyrille",editor="M{\'e}tais, Elisabethand Meziane, Faridand Sugumaran, Vijayanand Manning, Warrenand Reiff-Marganiec, Stephan",title="IndQNER: Named Entity Recognition Benchmark Dataset from the Indonesian Translation of the Quran",booktitle="Natural Language Processing and Information Systems",year="2023",publisher="Springer Nature Switzerland",address="Cham",pages="170--185",abstract="Indonesian is classified as underrepresented in the Natural Language Processing (NLP) field, despite being the tenth most spoken language in the world with 198 million speakers. The paucity of datasets is recognized as the main reason for the slow advancements in NLP research for underrepresented languages. Significant attempts were made in 2020 to address this drawback for Indonesian. The Indonesian Natural Language Understanding (IndoNLU) benchmark was introduced alongside IndoBERT pre-trained language model. The second benchmark, Indonesian Language Evaluation Montage (IndoLEM), was presented in the same year. These benchmarks support several tasks, including Named Entity Recognition (NER). However, all NER datasets are in the public domain and do not contain domain-specific datasets. To alleviate this drawback, we introduce IndQNER, a manually annotated NER benchmark dataset in the religious domain that adheres to a meticulously designed annotation guideline. Since Indonesia has the world's largest Muslim population, we build the dataset from the Indonesian translation of the Quran. The dataset includes 2475 named entities representing 18 different classes. To assess the annotation quality of IndQNER, we perform experiments with BiLSTM and CRF-based NER, as well as IndoBERT fine-tuning. The results reveal that the first model outperforms the second model achieving 0.98 F1 points. This outcome indicates that IndQNER may be an acceptable evaluation metric for Indonesian NER tasks in the aforementioned domain, widening the research's domain range.",isbn="978-3-031-35320-8"}

Contact

If you have any questions or feedback, feel free to contact us at ria.hari.gusmita@uni-paderborn.de or ria.gusmita@uinjkt.ac.id
Party Variation in Religiosity and Womens Leadership, Non-Arab Muslim...
thearda.com
Updated Oct 29, 2012
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fatima Sbaity Kassem (2012). Party Variation in Religiosity and Womens Leadership, Non-Arab Muslim Majority Countries Dataset [Dataset]. http://doi.org/10.17605/OSF.IO/K5MDA
Explore at:
Unique identifier
https://doi.org/10.17605/OSF.IO/K5MDA
Dataset updated
Oct 29, 2012
Dataset provided by
Association of Religion Data Archives
Authors
Fatima Sbaity Kassem
Dataset funded by
Fatima Sbaity Kassem
Description
These data were collected for a study of how the characteristics of political parties influence women's chances in assuming leadership positions within the parties' inner structures. Data were compiled by Fatima Sbaity Kassem for a case-study of Lebanon and by national and local researchers for 25 other countries in Asia, Africa and Europe. The researchers collected raw data on women in politics from party administrators and government officials. Researchers gathered information about parties' year of origin, number of seats in parliament, political platform, and all gender-disaggregated party data (in percentages) on overall party membership, shares in executive and decision-making bodies, and nominations on electoral lists. A key variable measures party religiosity, which refers to the religious components on their political platforms or the extent to which religion penetrates their political agendas.

Only parties that have at least one seat in any of the last three parliaments were included. These are referred to as 'relevant' parties. The four data sets combined cover 330 political parties in Lebanon plus 12 other Arab countries (Algeria, Bahrain, Comoros, Djibouti, Egypt, Jordan, Kuwait, Mauritania, Morocco, Palestine, Tunisia, and Yemen), seven non-Arab Muslim-majority countries (Albania, Afghanistan, Bangladesh, Bosnia-Herzegovina, Indonesia, Senegal, and Turkey), five European countries with dominant Christian democratic parties (Austria, Belgium, Italy, Germany, and the Netherlands), and Israel.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Gusmita, Ria Hari; Firmansyah, Asep Fajar (2024). IndQNER: Indonesian Benchmark Dataset from the Indonesian Translation of the Quran [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7454891

IndQNER: Indonesian Benchmark Dataset from the Indonesian Translation of the Quran

Explore at:

Dataset updated

Jan 27, 2024

Dataset provided by

Islamic State University Syarif Hidayatullah Jakarta, Paderborn University

Authors

Gusmita, Ria Hari; Firmansyah, Asep Fajar

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

IndQNER

IndQNER is a Named Entity Recognition (NER) benchmark dataset that was created by manually annotating 8 chapters in the Indonesian translation of the Quran. The annotation was performed using a web-based text annotation tool, Tagtog, and the BIO (Beginning-Inside-Outside) tagging format. The dataset contains:

3117 sentences

62027 tokens

2475 named entities

18 named entity categories

Named Entity Classes

The named entity classes were initially defined by analyzing the existing Quran concepts ontology. The initial classes were updated based on the information acquired during the annotation process. Finally, there are 20 classes, as follows:

Allah

Allah's Throne

Artifact

Astronomical body

Event

False deity

Holy book

Language

Angel

Person

Messenger

Prophet

Sentient

Afterlife location

Geographical location

Color

Religion

Food

Fruit

The book of Allah

Annotation Stage

There were eight annotators who contributed to the annotation process. They were informatics engineering students at the State Islamic University Syarif Hidayatullah Jakarta.

Anggita Maharani Gumay Putri

Muhammad Destamal Junas

Naufaldi Hafidhigbal

Nur Kholis Azzam Ubaidillah

Puspitasari

Septiany Nur Anggita

Wilda Nurjannah

William Santoso

Verification Stage

We found many named entity and class candidates during the annotation stage. To verify the candidates, we consulted Quran and Tafseer (content) experts who are lecturers at Quran and Tafseer Department at the State Islamic University Syarif Hidayatullah Jakarta.

Dr. Eva Nugraha, M.Ag.

Dr. Jauhar Azizy, MA

Dr. Lilik Ummi Kultsum, MA

Evaluation

We evaluated the annotation quality of IndQNER by performing experiments in two settings: supervised learning (BiLSTM+CRF) and transfer learning (IndoBERT fine-tuning).

Supervised Learning Setting

The implementation of BiLSTM and CRF utilized IndoBERT to provide word embeddings. All experiments used a batch size of 16. These are the results:

Maximum sequence length Number of e-poch Precision Recall F1 score

256 10 0.94 0.92 0.93

256 20 0.99 0.97 0.98

256 40 0.96 0.96 0.96

256 100 0.97 0.96 0.96

512 10 0.92 0.92 0.92

512 20 0.96 0.95 0.96

512 40 0.97 0.95 0.96

512 100 0.97 0.95 0.96

Transfer Learning Setting

We performed several experiments with different parameters in IndoBERT fine-tuning. All experiments used a learning rate of 2e-5 and a batch size of 16. These are the results:

Maximum sequence length Number of e-poch Precision Recall F1 score

256 10 0.67 0.65 0.65

256 20 0.60 0.59 0.59

256 40 0.75 0.72 0.71

256 100 0.73 0.68 0.68

512 10 0.72 0.62 0.64

512 20 0.62 0.57 0.58

512 40 0.72 0.66 0.67

512 100 0.68 0.68 0.67

This dataset is also part of the NusaCrowd project which aims to collect Natural Language Processing (NLP) datasets for Indonesian and its local languages.

How to Cite

@InProceedings{10.1007/978-3-031-35320-8_12,author="Gusmita, Ria Hariand Firmansyah, Asep Fajarand Moussallem, Diegoand Ngonga Ngomo, Axel-Cyrille",editor="M{\'e}tais, Elisabethand Meziane, Faridand Sugumaran, Vijayanand Manning, Warrenand Reiff-Marganiec, Stephan",title="IndQNER: Named Entity Recognition Benchmark Dataset from the Indonesian Translation of the Quran",booktitle="Natural Language Processing and Information Systems",year="2023",publisher="Springer Nature Switzerland",address="Cham",pages="170--185",abstract="Indonesian is classified as underrepresented in the Natural Language Processing (NLP) field, despite being the tenth most spoken language in the world with 198 million speakers. The paucity of datasets is recognized as the main reason for the slow advancements in NLP research for underrepresented languages. Significant attempts were made in 2020 to address this drawback for Indonesian. The Indonesian Natural Language Understanding (IndoNLU) benchmark was introduced alongside IndoBERT pre-trained language model. The second benchmark, Indonesian Language Evaluation Montage (IndoLEM), was presented in the same year. These benchmarks support several tasks, including Named Entity Recognition (NER). However, all NER datasets are in the public domain and do not contain domain-specific datasets. To alleviate this drawback, we introduce IndQNER, a manually annotated NER benchmark dataset in the religious domain that adheres to a meticulously designed annotation guideline. Since Indonesia has the world's largest Muslim population, we build the dataset from the Indonesian translation of the Quran. The dataset includes 2475 named entities representing 18 different classes. To assess the annotation quality of IndQNER, we perform experiments with BiLSTM and CRF-based NER, as well as IndoBERT fine-tuning. The results reveal that the first model outperforms the second model achieving 0.98 F1 points. This outcome indicates that IndQNER may be an acceptable evaluation metric for Indonesian NER tasks in the aforementioned domain, widening the research's domain range.",isbn="978-3-031-35320-8"}

Contact

If you have any questions or feedback, feel free to contact us at ria.hari.gusmita@uni-paderborn.de or ria.gusmita@uinjkt.ac.id

Clear search

Close search

Google apps

Main menu

IndQNER: Indonesian Benchmark Dataset from the Indonesian Translation of the...

Data from: A modest proposal for conducting future research on media...

IndQNER: Indonesian Benchmark Dataset from the Indonesian Translation of the...

Party Variation in Religiosity and Womens Leadership, Non-Arab Muslim...

IndQNER: Indonesian Benchmark Dataset from the Indonesian Translation of the Quran