4 datasets found
  1. Z

    IndQNER: Indonesian Benchmark Dataset from the Indonesian Translation of the...

    • data.niaid.nih.gov
    Updated Jan 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gusmita, Ria Hari; Firmansyah, Asep Fajar (2024). IndQNER: Indonesian Benchmark Dataset from the Indonesian Translation of the Quran [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7454891
    Explore at:
    Dataset updated
    Jan 27, 2024
    Dataset provided by
    Islamic State University Syarif Hidayatullah Jakarta, Paderborn University
    Authors
    Gusmita, Ria Hari; Firmansyah, Asep Fajar
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    IndQNER

    IndQNER is a Named Entity Recognition (NER) benchmark dataset that was created by manually annotating 8 chapters in the Indonesian translation of the Quran. The annotation was performed using a web-based text annotation tool, Tagtog, and the BIO (Beginning-Inside-Outside) tagging format. The dataset contains:

    3117 sentences

    62027 tokens

    2475 named entities

    18 named entity categories

    Named Entity Classes

    The named entity classes were initially defined by analyzing the existing Quran concepts ontology. The initial classes were updated based on the information acquired during the annotation process. Finally, there are 20 classes, as follows:

    Allah

    Allah's Throne

    Artifact

    Astronomical body

    Event

    False deity

    Holy book

    Language

    Angel

    Person

    Messenger

    Prophet

    Sentient

    Afterlife location

    Geographical location

    Color

    Religion

    Food

    Fruit

    The book of Allah

    Annotation Stage

    There were eight annotators who contributed to the annotation process. They were informatics engineering students at the State Islamic University Syarif Hidayatullah Jakarta.

    Anggita Maharani Gumay Putri

    Muhammad Destamal Junas

    Naufaldi Hafidhigbal

    Nur Kholis Azzam Ubaidillah

    Puspitasari

    Septiany Nur Anggita

    Wilda Nurjannah

    William Santoso

    Verification Stage

    We found many named entity and class candidates during the annotation stage. To verify the candidates, we consulted Quran and Tafseer (content) experts who are lecturers at Quran and Tafseer Department at the State Islamic University Syarif Hidayatullah Jakarta.

    Dr. Eva Nugraha, M.Ag.

    Dr. Jauhar Azizy, MA

    Dr. Lilik Ummi Kultsum, MA

    Evaluation

    We evaluated the annotation quality of IndQNER by performing experiments in two settings: supervised learning (BiLSTM+CRF) and transfer learning (IndoBERT fine-tuning).

    Supervised Learning Setting

    The implementation of BiLSTM and CRF utilized IndoBERT to provide word embeddings. All experiments used a batch size of 16. These are the results:

    Maximum sequence length Number of e-poch Precision Recall F1 score

    256 10 0.94 0.92 0.93

    256 20 0.99 0.97 0.98

    256 40 0.96 0.96 0.96

    256 100 0.97 0.96 0.96

    512 10 0.92 0.92 0.92

    512 20 0.96 0.95 0.96

    512 40 0.97 0.95 0.96

    512 100 0.97 0.95 0.96

    Transfer Learning Setting

    We performed several experiments with different parameters in IndoBERT fine-tuning. All experiments used a learning rate of 2e-5 and a batch size of 16. These are the results:

    Maximum sequence length Number of e-poch Precision Recall F1 score

    256 10 0.67 0.65 0.65

    256 20 0.60 0.59 0.59

    256 40 0.75 0.72 0.71

    256 100 0.73 0.68 0.68

    512 10 0.72 0.62 0.64

    512 20 0.62 0.57 0.58

    512 40 0.72 0.66 0.67

    512 100 0.68 0.68 0.67

    This dataset is also part of the NusaCrowd project which aims to collect Natural Language Processing (NLP) datasets for Indonesian and its local languages.

    How to Cite

    @InProceedings{10.1007/978-3-031-35320-8_12,author="Gusmita, Ria Hariand Firmansyah, Asep Fajarand Moussallem, Diegoand Ngonga Ngomo, Axel-Cyrille",editor="M{\'e}tais, Elisabethand Meziane, Faridand Sugumaran, Vijayanand Manning, Warrenand Reiff-Marganiec, Stephan",title="IndQNER: Named Entity Recognition Benchmark Dataset from the Indonesian Translation of the Quran",booktitle="Natural Language Processing and Information Systems",year="2023",publisher="Springer Nature Switzerland",address="Cham",pages="170--185",abstract="Indonesian is classified as underrepresented in the Natural Language Processing (NLP) field, despite being the tenth most spoken language in the world with 198 million speakers. The paucity of datasets is recognized as the main reason for the slow advancements in NLP research for underrepresented languages. Significant attempts were made in 2020 to address this drawback for Indonesian. The Indonesian Natural Language Understanding (IndoNLU) benchmark was introduced alongside IndoBERT pre-trained language model. The second benchmark, Indonesian Language Evaluation Montage (IndoLEM), was presented in the same year. These benchmarks support several tasks, including Named Entity Recognition (NER). However, all NER datasets are in the public domain and do not contain domain-specific datasets. To alleviate this drawback, we introduce IndQNER, a manually annotated NER benchmark dataset in the religious domain that adheres to a meticulously designed annotation guideline. Since Indonesia has the world's largest Muslim population, we build the dataset from the Indonesian translation of the Quran. The dataset includes 2475 named entities representing 18 different classes. To assess the annotation quality of IndQNER, we perform experiments with BiLSTM and CRF-based NER, as well as IndoBERT fine-tuning. The results reveal that the first model outperforms the second model achieving 0.98 F1 points. This outcome indicates that IndQNER may be an acceptable evaluation metric for Indonesian NER tasks in the aforementioned domain, widening the research's domain range.",isbn="978-3-031-35320-8"}

    Contact

    If you have any questions or feedback, feel free to contact us at ria.hari.gusmita@uni-paderborn.de or ria.gusmita@uinjkt.ac.id

  2. f

    Data from: A modest proposal for conducting future research on media...

    • figshare.com
    • data.niaid.nih.gov
    • +2more
    pdf
    Updated Dec 28, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Harits Masduqi (2021). A modest proposal for conducting future research on media portrayals of Islam and Muslims in Indonesia [Dataset]. http://doi.org/10.6084/m9.figshare.16681825.v1
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Dec 28, 2021
    Dataset provided by
    figshare
    Authors
    Harits Masduqi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Indonesia
    Description

    Recent issues on politics have been dominant in Indonesia that people are divided and become more intolerant of each other. Indonesia has the biggest Muslim population in the world and the role of Islam in Indonesian politics is significant. The current Indonesian government claim that moderate Muslims are loyal to the present political system while the opposing rivals who are often labelled’intolerant and radical Muslims’ by Indonesian mass media often disagree with the central interpretation of democracy in Indonesia. Studies on contributing factors and discourse strategies used in news and articles in secular and Islamic mass media which play a vital role in the construction of Muslim and Islamic identities in Indonesia are, therefore, recommended.

  3. Z

    IndQNER: Indonesian Benchmark Dataset from the Indonesian Translation of the...

    • nde-dev.biothings.io
    Updated Jan 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gusmita, Ria Hari (2024). IndQNER: Indonesian Benchmark Dataset from the Indonesian Translation of the Quran [Dataset]. https://nde-dev.biothings.io/resources?id=zenodo_7454891
    Explore at:
    Dataset updated
    Jan 27, 2024
    Dataset provided by
    Firmansyah, Asep Fajar
    Gusmita, Ria Hari
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    IndQNER

    IndQNER is a Named Entity Recognition (NER) benchmark dataset that was created by manually annotating 8 chapters in the Indonesian translation of the Quran. The annotation was performed using a web-based text annotation tool, Tagtog, and the BIO (Beginning-Inside-Outside) tagging format. The dataset contains:

    3117 sentences

    62027 tokens

    2475 named entities

    18 named entity categories

    Named Entity Classes

    The named entity classes were initially defined by analyzing the existing Quran concepts ontology. The initial classes were updated based on the information acquired during the annotation process. Finally, there are 20 classes, as follows:

    Allah

    Allah's Throne

    Artifact

    Astronomical body

    Event

    False deity

    Holy book

    Language

    Angel

    Person

    Messenger

    Prophet

    Sentient

    Afterlife location

    Geographical location

    Color

    Religion

    Food

    Fruit

    The book of Allah

    Annotation Stage

    There were eight annotators who contributed to the annotation process. They were informatics engineering students at the State Islamic University Syarif Hidayatullah Jakarta.

    Anggita Maharani Gumay Putri

    Muhammad Destamal Junas

    Naufaldi Hafidhigbal

    Nur Kholis Azzam Ubaidillah

    Puspitasari

    Septiany Nur Anggita

    Wilda Nurjannah

    William Santoso

    Verification Stage

    We found many named entity and class candidates during the annotation stage. To verify the candidates, we consulted Quran and Tafseer (content) experts who are lecturers at Quran and Tafseer Department at the State Islamic University Syarif Hidayatullah Jakarta.

    Dr. Eva Nugraha, M.Ag.

    Dr. Jauhar Azizy, MA

    Dr. Lilik Ummi Kultsum, MA

    Evaluation

    We evaluated the annotation quality of IndQNER by performing experiments in two settings: supervised learning (BiLSTM+CRF) and transfer learning (IndoBERT fine-tuning).

    Supervised Learning Setting

    The implementation of BiLSTM and CRF utilized IndoBERT to provide word embeddings. All experiments used a batch size of 16. These are the results:

    Maximum sequence length Number of e-poch Precision Recall F1 score

    256 10 0.94 0.92 0.93

    256 20 0.99 0.97 0.98

    256 40 0.96 0.96 0.96

    256 100 0.97 0.96 0.96

    512 10 0.92 0.92 0.92

    512 20 0.96 0.95 0.96

    512 40 0.97 0.95 0.96

    512 100 0.97 0.95 0.96

    Transfer Learning Setting

    We performed several experiments with different parameters in IndoBERT fine-tuning. All experiments used a learning rate of 2e-5 and a batch size of 16. These are the results:

    Maximum sequence length Number of e-poch Precision Recall F1 score

    256 10 0.67 0.65 0.65

    256 20 0.60 0.59 0.59

    256 40 0.75 0.72 0.71

    256 100 0.73 0.68 0.68

    512 10 0.72 0.62 0.64

    512 20 0.62 0.57 0.58

    512 40 0.72 0.66 0.67

    512 100 0.68 0.68 0.67

    This dataset is also part of the NusaCrowd project which aims to collect Natural Language Processing (NLP) datasets for Indonesian and its local languages.

    How to Cite

    @InProceedings{10.1007/978-3-031-35320-8_12,author="Gusmita, Ria Hariand Firmansyah, Asep Fajarand Moussallem, Diegoand Ngonga Ngomo, Axel-Cyrille",editor="M{\'e}tais, Elisabethand Meziane, Faridand Sugumaran, Vijayanand Manning, Warrenand Reiff-Marganiec, Stephan",title="IndQNER: Named Entity Recognition Benchmark Dataset from the Indonesian Translation of the Quran",booktitle="Natural Language Processing and Information Systems",year="2023",publisher="Springer Nature Switzerland",address="Cham",pages="170--185",abstract="Indonesian is classified as underrepresented in the Natural Language Processing (NLP) field, despite being the tenth most spoken language in the world with 198 million speakers. The paucity of datasets is recognized as the main reason for the slow advancements in NLP research for underrepresented languages. Significant attempts were made in 2020 to address this drawback for Indonesian. The Indonesian Natural Language Understanding (IndoNLU) benchmark was introduced alongside IndoBERT pre-trained language model. The second benchmark, Indonesian Language Evaluation Montage (IndoLEM), was presented in the same year. These benchmarks support several tasks, including Named Entity Recognition (NER). However, all NER datasets are in the public domain and do not contain domain-specific datasets. To alleviate this drawback, we introduce IndQNER, a manually annotated NER benchmark dataset in the religious domain that adheres to a meticulously designed annotation guideline. Since Indonesia has the world's largest Muslim population, we build the dataset from the Indonesian translation of the Quran. The dataset includes 2475 named entities representing 18 different classes. To assess the annotation quality of IndQNER, we perform experiments with BiLSTM and CRF-based NER, as well as IndoBERT fine-tuning. The results reveal that the first model outperforms the second model achieving 0.98 F1 points. This outcome indicates that IndQNER may be an acceptable evaluation metric for Indonesian NER tasks in the aforementioned domain, widening the research's domain range.",isbn="978-3-031-35320-8"}

    Contact

    If you have any questions or feedback, feel free to contact us at ria.hari.gusmita@uni-paderborn.de or ria.gusmita@uinjkt.ac.id

  4. Party Variation in Religiosity and Womens Leadership, Non-Arab Muslim...

    • thearda.com
    Updated Oct 29, 2012
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fatima Sbaity Kassem (2012). Party Variation in Religiosity and Womens Leadership, Non-Arab Muslim Majority Countries Dataset [Dataset]. http://doi.org/10.17605/OSF.IO/K5MDA
    Explore at:
    Dataset updated
    Oct 29, 2012
    Dataset provided by
    Association of Religion Data Archives
    Authors
    Fatima Sbaity Kassem
    Dataset funded by
    Fatima Sbaity Kassem
    Description

    These data were collected for a study of how the characteristics of political parties influence women's chances in assuming leadership positions within the parties' inner structures. Data were compiled by Fatima Sbaity Kassem for a case-study of Lebanon and by national and local researchers for 25 other countries in Asia, Africa and Europe. The researchers collected raw data on women in politics from party administrators and government officials. Researchers gathered information about parties' year of origin, number of seats in parliament, political platform, and all gender-disaggregated party data (in percentages) on overall party membership, shares in executive and decision-making bodies, and nominations on electoral lists. A key variable measures party religiosity, which refers to the religious components on their political platforms or the extent to which religion penetrates their political agendas.

    Only parties that have at least one seat in any of the last three parliaments were included. These are referred to as 'relevant' parties. The four data sets combined cover 330 political parties in Lebanon plus 12 other Arab countries (Algeria, Bahrain, Comoros, Djibouti, Egypt, Jordan, Kuwait, Mauritania, Morocco, Palestine, Tunisia, and Yemen), seven non-Arab Muslim-majority countries (Albania, Afghanistan, Bangladesh, Bosnia-Herzegovina, Indonesia, Senegal, and Turkey), five European countries with dominant Christian democratic parties (Austria, Belgium, Italy, Germany, and the Netherlands), and Israel.

  5. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Gusmita, Ria Hari; Firmansyah, Asep Fajar (2024). IndQNER: Indonesian Benchmark Dataset from the Indonesian Translation of the Quran [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7454891

IndQNER: Indonesian Benchmark Dataset from the Indonesian Translation of the Quran

Explore at:
Dataset updated
Jan 27, 2024
Dataset provided by
Islamic State University Syarif Hidayatullah Jakarta, Paderborn University
Authors
Gusmita, Ria Hari; Firmansyah, Asep Fajar
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

IndQNER

IndQNER is a Named Entity Recognition (NER) benchmark dataset that was created by manually annotating 8 chapters in the Indonesian translation of the Quran. The annotation was performed using a web-based text annotation tool, Tagtog, and the BIO (Beginning-Inside-Outside) tagging format. The dataset contains:

3117 sentences

62027 tokens

2475 named entities

18 named entity categories

Named Entity Classes

The named entity classes were initially defined by analyzing the existing Quran concepts ontology. The initial classes were updated based on the information acquired during the annotation process. Finally, there are 20 classes, as follows:

Allah

Allah's Throne

Artifact

Astronomical body

Event

False deity

Holy book

Language

Angel

Person

Messenger

Prophet

Sentient

Afterlife location

Geographical location

Color

Religion

Food

Fruit

The book of Allah

Annotation Stage

There were eight annotators who contributed to the annotation process. They were informatics engineering students at the State Islamic University Syarif Hidayatullah Jakarta.

Anggita Maharani Gumay Putri

Muhammad Destamal Junas

Naufaldi Hafidhigbal

Nur Kholis Azzam Ubaidillah

Puspitasari

Septiany Nur Anggita

Wilda Nurjannah

William Santoso

Verification Stage

We found many named entity and class candidates during the annotation stage. To verify the candidates, we consulted Quran and Tafseer (content) experts who are lecturers at Quran and Tafseer Department at the State Islamic University Syarif Hidayatullah Jakarta.

Dr. Eva Nugraha, M.Ag.

Dr. Jauhar Azizy, MA

Dr. Lilik Ummi Kultsum, MA

Evaluation

We evaluated the annotation quality of IndQNER by performing experiments in two settings: supervised learning (BiLSTM+CRF) and transfer learning (IndoBERT fine-tuning).

Supervised Learning Setting

The implementation of BiLSTM and CRF utilized IndoBERT to provide word embeddings. All experiments used a batch size of 16. These are the results:

Maximum sequence length Number of e-poch Precision Recall F1 score

256 10 0.94 0.92 0.93

256 20 0.99 0.97 0.98

256 40 0.96 0.96 0.96

256 100 0.97 0.96 0.96

512 10 0.92 0.92 0.92

512 20 0.96 0.95 0.96

512 40 0.97 0.95 0.96

512 100 0.97 0.95 0.96

Transfer Learning Setting

We performed several experiments with different parameters in IndoBERT fine-tuning. All experiments used a learning rate of 2e-5 and a batch size of 16. These are the results:

Maximum sequence length Number of e-poch Precision Recall F1 score

256 10 0.67 0.65 0.65

256 20 0.60 0.59 0.59

256 40 0.75 0.72 0.71

256 100 0.73 0.68 0.68

512 10 0.72 0.62 0.64

512 20 0.62 0.57 0.58

512 40 0.72 0.66 0.67

512 100 0.68 0.68 0.67

This dataset is also part of the NusaCrowd project which aims to collect Natural Language Processing (NLP) datasets for Indonesian and its local languages.

How to Cite

@InProceedings{10.1007/978-3-031-35320-8_12,author="Gusmita, Ria Hariand Firmansyah, Asep Fajarand Moussallem, Diegoand Ngonga Ngomo, Axel-Cyrille",editor="M{\'e}tais, Elisabethand Meziane, Faridand Sugumaran, Vijayanand Manning, Warrenand Reiff-Marganiec, Stephan",title="IndQNER: Named Entity Recognition Benchmark Dataset from the Indonesian Translation of the Quran",booktitle="Natural Language Processing and Information Systems",year="2023",publisher="Springer Nature Switzerland",address="Cham",pages="170--185",abstract="Indonesian is classified as underrepresented in the Natural Language Processing (NLP) field, despite being the tenth most spoken language in the world with 198 million speakers. The paucity of datasets is recognized as the main reason for the slow advancements in NLP research for underrepresented languages. Significant attempts were made in 2020 to address this drawback for Indonesian. The Indonesian Natural Language Understanding (IndoNLU) benchmark was introduced alongside IndoBERT pre-trained language model. The second benchmark, Indonesian Language Evaluation Montage (IndoLEM), was presented in the same year. These benchmarks support several tasks, including Named Entity Recognition (NER). However, all NER datasets are in the public domain and do not contain domain-specific datasets. To alleviate this drawback, we introduce IndQNER, a manually annotated NER benchmark dataset in the religious domain that adheres to a meticulously designed annotation guideline. Since Indonesia has the world's largest Muslim population, we build the dataset from the Indonesian translation of the Quran. The dataset includes 2475 named entities representing 18 different classes. To assess the annotation quality of IndQNER, we perform experiments with BiLSTM and CRF-based NER, as well as IndoBERT fine-tuning. The results reveal that the first model outperforms the second model achieving 0.98 F1 points. This outcome indicates that IndQNER may be an acceptable evaluation metric for Indonesian NER tasks in the aforementioned domain, widening the research's domain range.",isbn="978-3-031-35320-8"}

Contact

If you have any questions or feedback, feel free to contact us at ria.hari.gusmita@uni-paderborn.de or ria.gusmita@uinjkt.ac.id

Search
Clear search
Close search
Google apps
Main menu