4 datasets found
  1. Z

    IndQNER: Indonesian Benchmark Dataset from the Indonesian Translation of the...

    • data.niaid.nih.gov
    Updated Jan 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Firmansyah, Asep Fajar (2024). IndQNER: Indonesian Benchmark Dataset from the Indonesian Translation of the Quran [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7454891
    Explore at:
    Dataset updated
    Jan 27, 2024
    Dataset provided by
    Gusmita, Ria Hari
    Firmansyah, Asep Fajar
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    IndQNER

    IndQNER is a Named Entity Recognition (NER) benchmark dataset that was created by manually annotating 8 chapters in the Indonesian translation of the Quran. The annotation was performed using a web-based text annotation tool, Tagtog, and the BIO (Beginning-Inside-Outside) tagging format. The dataset contains:

    3117 sentences

    62027 tokens

    2475 named entities

    18 named entity categories

    Named Entity Classes

    The named entity classes were initially defined by analyzing the existing Quran concepts ontology. The initial classes were updated based on the information acquired during the annotation process. Finally, there are 20 classes, as follows:

    Allah

    Allah's Throne

    Artifact

    Astronomical body

    Event

    False deity

    Holy book

    Language

    Angel

    Person

    Messenger

    Prophet

    Sentient

    Afterlife location

    Geographical location

    Color

    Religion

    Food

    Fruit

    The book of Allah

    Annotation Stage

    There were eight annotators who contributed to the annotation process. They were informatics engineering students at the State Islamic University Syarif Hidayatullah Jakarta.

    Anggita Maharani Gumay Putri

    Muhammad Destamal Junas

    Naufaldi Hafidhigbal

    Nur Kholis Azzam Ubaidillah

    Puspitasari

    Septiany Nur Anggita

    Wilda Nurjannah

    William Santoso

    Verification Stage

    We found many named entity and class candidates during the annotation stage. To verify the candidates, we consulted Quran and Tafseer (content) experts who are lecturers at Quran and Tafseer Department at the State Islamic University Syarif Hidayatullah Jakarta.

    Dr. Eva Nugraha, M.Ag.

    Dr. Jauhar Azizy, MA

    Dr. Lilik Ummi Kultsum, MA

    Evaluation

    We evaluated the annotation quality of IndQNER by performing experiments in two settings: supervised learning (BiLSTM+CRF) and transfer learning (IndoBERT fine-tuning).

    Supervised Learning Setting

    The implementation of BiLSTM and CRF utilized IndoBERT to provide word embeddings. All experiments used a batch size of 16. These are the results:

    Maximum sequence length Number of e-poch Precision Recall F1 score

    256 10 0.94 0.92 0.93

    256 20 0.99 0.97 0.98

    256 40 0.96 0.96 0.96

    256 100 0.97 0.96 0.96

    512 10 0.92 0.92 0.92

    512 20 0.96 0.95 0.96

    512 40 0.97 0.95 0.96

    512 100 0.97 0.95 0.96

    Transfer Learning Setting

    We performed several experiments with different parameters in IndoBERT fine-tuning. All experiments used a learning rate of 2e-5 and a batch size of 16. These are the results:

    Maximum sequence length Number of e-poch Precision Recall F1 score

    256 10 0.67 0.65 0.65

    256 20 0.60 0.59 0.59

    256 40 0.75 0.72 0.71

    256 100 0.73 0.68 0.68

    512 10 0.72 0.62 0.64

    512 20 0.62 0.57 0.58

    512 40 0.72 0.66 0.67

    512 100 0.68 0.68 0.67

    This dataset is also part of the NusaCrowd project which aims to collect Natural Language Processing (NLP) datasets for Indonesian and its local languages.

    How to Cite

    @InProceedings{10.1007/978-3-031-35320-8_12,author="Gusmita, Ria Hariand Firmansyah, Asep Fajarand Moussallem, Diegoand Ngonga Ngomo, Axel-Cyrille",editor="M{\'e}tais, Elisabethand Meziane, Faridand Sugumaran, Vijayanand Manning, Warrenand Reiff-Marganiec, Stephan",title="IndQNER: Named Entity Recognition Benchmark Dataset from the Indonesian Translation of the Quran",booktitle="Natural Language Processing and Information Systems",year="2023",publisher="Springer Nature Switzerland",address="Cham",pages="170--185",abstract="Indonesian is classified as underrepresented in the Natural Language Processing (NLP) field, despite being the tenth most spoken language in the world with 198 million speakers. The paucity of datasets is recognized as the main reason for the slow advancements in NLP research for underrepresented languages. Significant attempts were made in 2020 to address this drawback for Indonesian. The Indonesian Natural Language Understanding (IndoNLU) benchmark was introduced alongside IndoBERT pre-trained language model. The second benchmark, Indonesian Language Evaluation Montage (IndoLEM), was presented in the same year. These benchmarks support several tasks, including Named Entity Recognition (NER). However, all NER datasets are in the public domain and do not contain domain-specific datasets. To alleviate this drawback, we introduce IndQNER, a manually annotated NER benchmark dataset in the religious domain that adheres to a meticulously designed annotation guideline. Since Indonesia has the world's largest Muslim population, we build the dataset from the Indonesian translation of the Quran. The dataset includes 2475 named entities representing 18 different classes. To assess the annotation quality of IndQNER, we perform experiments with BiLSTM and CRF-based NER, as well as IndoBERT fine-tuning. The results reveal that the first model outperforms the second model achieving 0.98 F1 points. This outcome indicates that IndQNER may be an acceptable evaluation metric for Indonesian NER tasks in the aforementioned domain, widening the research's domain range.",isbn="978-3-031-35320-8"}

    Contact

    If you have any questions or feedback, feel free to contact us at ria.hari.gusmita@uni-paderborn.de or ria.gusmita@uinjkt.ac.id

  2. m

    Hajj Dataset 2021-2024: Ministry of Religious Affairs Malang City

    • data.mendeley.com
    Updated Feb 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mutiara Dzakiroh (2025). Hajj Dataset 2021-2024: Ministry of Religious Affairs Malang City [Dataset]. http://doi.org/10.17632/cdyygzjcky.1
    Explore at:
    Dataset updated
    Feb 18, 2025
    Authors
    Mutiara Dzakiroh
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Malang
    Description

    The Hajj Dataset 2021-2024: Ministry of Religious Affairs Malang City contains comprehensive data on the Hajj pilgrimage process for 2021 through 2024, gathered explicitly from the Malang City branch of Indonesia's Ministry of Religious Affairs (Kemenag). This dataset captures key information about the Hajj pilgrimage, including payment records, associated costs, and demographic details of the pilgrims, providing valuable insights into the financial aspects and trends over the four years. Key Data Features: Yearly Hajj Costs: Information on the financial breakdown of Hajj costs for each year, covering all components, including transportation, accommodation, and other mandatory fees. Pilgrim Demographics: Data on the number and characteristics of pilgrims from Malang City, including age, gender, and other socioeconomic indicators. Payment Status and History: Records of payments made by the pilgrims detailing the timing, amount, and any outstanding balances. Regulatory Changes: Information on changes in the regulations and policies of the Ministry of Religious Affairs (Kemenag) that may have impacted the cost structure or payment schedule during this period. Inflation and Currency Impact: Data reflecting the impact of national inflation rates or currency fluctuations, particularly the value of the Indonesian Rupiah (IDR) relative to the Saudi Riyal (SAR), on the overall pilgrimage cost. Hajj Quota and Registrations: The number of Hajj applicants from Malang City and the annual quota allocated to the region, including details on the selection process and waiting periods. Potential Use Cases: Cost Prediction: Analyze cost trends and predict future financial needs for the Hajj pilgrimage. Policy Analysis: Assess the impact of government policies on the affordability and accessibility of Hajj for pilgrims. Economic Analysis: Understand how national economic factors (inflation and and exchange rates) affect pilgrimage costs. Social Research: Study demographic patterns and regional participation in Hajj from Malang City. This dataset provides an essential resource for anyone interested in the economic, social, and policy dimensions of the Hajj pilgrimage in Indonesia, particularly in the context of Malang City's unique data.

  3. Party Variation in Religiosity and Womens Leadership, Non-Arab Muslim...

    • thearda.com
    Updated Oct 29, 2012
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fatima Sbaity Kassem (2012). Party Variation in Religiosity and Womens Leadership, Non-Arab Muslim Majority Countries Dataset [Dataset]. http://doi.org/10.17605/OSF.IO/K5MDA
    Explore at:
    Dataset updated
    Oct 29, 2012
    Dataset provided by
    Association of Religion Data Archives
    Authors
    Fatima Sbaity Kassem
    Dataset funded by
    Fatima Sbaity Kassem
    Description

    These data were collected for a study of how the characteristics of political parties influence women's chances in assuming leadership positions within the parties' inner structures. Data were compiled by Fatima Sbaity Kassem for a case-study of Lebanon and by national and local researchers for 25 other countries in Asia, Africa and Europe. The researchers collected raw data on women in politics from party administrators and government officials. Researchers gathered information about parties' year of origin, number of seats in parliament, political platform, and all gender-disaggregated party data (in percentages) on overall party membership, shares in executive and decision-making bodies, and nominations on electoral lists. A key variable measures party religiosity, which refers to the religious components on their political platforms or the extent to which religion penetrates their political agendas.

    Only parties that have at least one seat in any of the last three parliaments were included. These are referred to as 'relevant' parties. The four data sets combined cover 330 political parties in Lebanon plus 12 other Arab countries (Algeria, Bahrain, Comoros, Djibouti, Egypt, Jordan, Kuwait, Mauritania, Morocco, Palestine, Tunisia, and Yemen), seven non-Arab Muslim-majority countries (Albania, Afghanistan, Bangladesh, Bosnia-Herzegovina, Indonesia, Senegal, and Turkey), five European countries with dominant Christian democratic parties (Austria, Belgium, Italy, Germany, and the Netherlands), and Israel.

  4. Data of Main Survey - Service quality in the Indonesian Islamic Bank

    • figshare.com
    Updated Jul 2, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kifayah Amar; Khairun Nadiyah; Syamsul Bahri (2022). Data of Main Survey - Service quality in the Indonesian Islamic Bank [Dataset]. http://doi.org/10.6084/m9.figshare.20217602.v2
    Explore at:
    Dataset updated
    Jul 2, 2022
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Kifayah Amar; Khairun Nadiyah; Syamsul Bahri
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This data set contains perceptions and expectations of respondents who has customer of Islamic bank in Indonesia. Demography data is also provided.

  5. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Firmansyah, Asep Fajar (2024). IndQNER: Indonesian Benchmark Dataset from the Indonesian Translation of the Quran [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7454891

IndQNER: Indonesian Benchmark Dataset from the Indonesian Translation of the Quran

Explore at:
Dataset updated
Jan 27, 2024
Dataset provided by
Gusmita, Ria Hari
Firmansyah, Asep Fajar
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

IndQNER

IndQNER is a Named Entity Recognition (NER) benchmark dataset that was created by manually annotating 8 chapters in the Indonesian translation of the Quran. The annotation was performed using a web-based text annotation tool, Tagtog, and the BIO (Beginning-Inside-Outside) tagging format. The dataset contains:

3117 sentences

62027 tokens

2475 named entities

18 named entity categories

Named Entity Classes

The named entity classes were initially defined by analyzing the existing Quran concepts ontology. The initial classes were updated based on the information acquired during the annotation process. Finally, there are 20 classes, as follows:

Allah

Allah's Throne

Artifact

Astronomical body

Event

False deity

Holy book

Language

Angel

Person

Messenger

Prophet

Sentient

Afterlife location

Geographical location

Color

Religion

Food

Fruit

The book of Allah

Annotation Stage

There were eight annotators who contributed to the annotation process. They were informatics engineering students at the State Islamic University Syarif Hidayatullah Jakarta.

Anggita Maharani Gumay Putri

Muhammad Destamal Junas

Naufaldi Hafidhigbal

Nur Kholis Azzam Ubaidillah

Puspitasari

Septiany Nur Anggita

Wilda Nurjannah

William Santoso

Verification Stage

We found many named entity and class candidates during the annotation stage. To verify the candidates, we consulted Quran and Tafseer (content) experts who are lecturers at Quran and Tafseer Department at the State Islamic University Syarif Hidayatullah Jakarta.

Dr. Eva Nugraha, M.Ag.

Dr. Jauhar Azizy, MA

Dr. Lilik Ummi Kultsum, MA

Evaluation

We evaluated the annotation quality of IndQNER by performing experiments in two settings: supervised learning (BiLSTM+CRF) and transfer learning (IndoBERT fine-tuning).

Supervised Learning Setting

The implementation of BiLSTM and CRF utilized IndoBERT to provide word embeddings. All experiments used a batch size of 16. These are the results:

Maximum sequence length Number of e-poch Precision Recall F1 score

256 10 0.94 0.92 0.93

256 20 0.99 0.97 0.98

256 40 0.96 0.96 0.96

256 100 0.97 0.96 0.96

512 10 0.92 0.92 0.92

512 20 0.96 0.95 0.96

512 40 0.97 0.95 0.96

512 100 0.97 0.95 0.96

Transfer Learning Setting

We performed several experiments with different parameters in IndoBERT fine-tuning. All experiments used a learning rate of 2e-5 and a batch size of 16. These are the results:

Maximum sequence length Number of e-poch Precision Recall F1 score

256 10 0.67 0.65 0.65

256 20 0.60 0.59 0.59

256 40 0.75 0.72 0.71

256 100 0.73 0.68 0.68

512 10 0.72 0.62 0.64

512 20 0.62 0.57 0.58

512 40 0.72 0.66 0.67

512 100 0.68 0.68 0.67

This dataset is also part of the NusaCrowd project which aims to collect Natural Language Processing (NLP) datasets for Indonesian and its local languages.

How to Cite

@InProceedings{10.1007/978-3-031-35320-8_12,author="Gusmita, Ria Hariand Firmansyah, Asep Fajarand Moussallem, Diegoand Ngonga Ngomo, Axel-Cyrille",editor="M{\'e}tais, Elisabethand Meziane, Faridand Sugumaran, Vijayanand Manning, Warrenand Reiff-Marganiec, Stephan",title="IndQNER: Named Entity Recognition Benchmark Dataset from the Indonesian Translation of the Quran",booktitle="Natural Language Processing and Information Systems",year="2023",publisher="Springer Nature Switzerland",address="Cham",pages="170--185",abstract="Indonesian is classified as underrepresented in the Natural Language Processing (NLP) field, despite being the tenth most spoken language in the world with 198 million speakers. The paucity of datasets is recognized as the main reason for the slow advancements in NLP research for underrepresented languages. Significant attempts were made in 2020 to address this drawback for Indonesian. The Indonesian Natural Language Understanding (IndoNLU) benchmark was introduced alongside IndoBERT pre-trained language model. The second benchmark, Indonesian Language Evaluation Montage (IndoLEM), was presented in the same year. These benchmarks support several tasks, including Named Entity Recognition (NER). However, all NER datasets are in the public domain and do not contain domain-specific datasets. To alleviate this drawback, we introduce IndQNER, a manually annotated NER benchmark dataset in the religious domain that adheres to a meticulously designed annotation guideline. Since Indonesia has the world's largest Muslim population, we build the dataset from the Indonesian translation of the Quran. The dataset includes 2475 named entities representing 18 different classes. To assess the annotation quality of IndQNER, we perform experiments with BiLSTM and CRF-based NER, as well as IndoBERT fine-tuning. The results reveal that the first model outperforms the second model achieving 0.98 F1 points. This outcome indicates that IndQNER may be an acceptable evaluation metric for Indonesian NER tasks in the aforementioned domain, widening the research's domain range.",isbn="978-3-031-35320-8"}

Contact

If you have any questions or feedback, feel free to contact us at ria.hari.gusmita@uni-paderborn.de or ria.gusmita@uinjkt.ac.id

Search
Clear search
Close search
Google apps
Main menu