Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Implementations:
Description:
We present the Nepali Handwriting Dataset (NHD), which is a collection of camera-captured images of Nepali handwritten text from various regions in Nepal. The dataset aims to provide a benchmark for researchers to explore new techniques in handwriting detection and recognition. We also present benchmark results for text localization and recognition using well-established deep-learning frameworks. The dataset and benchmark results are available here.
Key Features:
The role of data collection and preprocessing in the research on handwritten text detection cannot be overstated. It is a crucial aspect that plays a significant role in obtaining a comprehensive and diverse dataset. To this end, the researchers personally collected 1,000 mobile phone-captured data samples from various sources, including schools, government offices, universities, and student councils.
The dataset was carefully curated to encompass three distinct categories based on age groups, namely kids, youth, and adults, with 599, 152, and 249 samples, respectively. Each of the 1,000 pages was meticulously annotated by the researchers to ensure accurate labeling and create a reliable dataset. The data collection process focused on capturing a wide range of handwriting styles and variations prevalent among different age groups and settings.
The collected dataset served as a valuable resource for training and evaluating the handwritten text detection models in the research. It provided a rich and diverse set of data that enabled the researchers to develop robust models capable of accurately detecting handwritten text across different age groups and settings.
Use Cases:
Results:
You can find its implementation here: https://github.com/R4j4n/Nepali-Text-Detection-DBnet
Recall: 0.9069154470416869
Precision: 0.9178659178659179
HMean: 0.9123578206927347
Test Image:
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F4786384%2Ff8d9aa282a42848b359aeeb021b97937%2Foutput.png?generation=1695433752833462&alt=media" alt="">
If you find this dataset useful, your support through an upvote would be greatly appreciated ❤️🙂
Thank you
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
🧠 NaBI: Nepali Bias & Information Dataset
The NaBI (Nepali Bias & Information) Dataset is a curated and annotated dataset developed for the classification of Nepali language content into four critical categories related to information integrity and harmful content. It is designed to aid in the development of NLP models for moderation, content filtering, and sociopolitical analysis.
💡 Dataset Description
The dataset consists of Nepali text samples sourced from public… See the full description on the dataset page: https://huggingface.co/datasets/Utkarsha666/NaBI.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Techsalerator's News Events Data for Nepal: A Comprehensive Overview
Techsalerator's News Events Data for Nepal offers a valuable resource for businesses, researchers, and media organizations. This dataset aggregates information on key news events throughout Nepal, sourcing data from various media outlets, including news channels, online publications, and social media platforms. It provides essential insights for those interested in tracking trends, analyzing public sentiment, or observing industry-specific developments.
Key Data Fields - Event Date: Captures the exact date of the news event, crucial for analysts monitoring trends over time or businesses responding to market changes. - Event Title: A brief headline describing the event, allowing users to quickly categorize and assess news content based on their interests. - Source: Identifies the news outlet or platform where the event was reported, helping users track credible sources and evaluate the reach and influence of the event. - Location: Provides geographic information on where the event occurred within Nepal, valuable for regional analysis or localized marketing efforts. - Event Description: A detailed summary of the event, outlining key developments, participants, and potential impact, aiding researchers and businesses in understanding the context and implications.
Top 5 News Categories in Nepal - Politics: Major news on government decisions, political movements, elections, and policy changes affecting the national landscape. - Economy: Covers Nepal’s economic indicators, inflation rates, international trade, and corporate activities influencing business and finance sectors. - Social Issues: News on protests, public health, education, and other societal concerns driving public discourse. - Sports: Highlights events in popular sports such as football and cricket, drawing widespread attention and engagement. - Technology and Innovation: Reports on tech developments, startups, and innovations within Nepal’s growing tech ecosystem, featuring emerging companies and advancements.
Top 5 News Sources in Nepal - The Kathmandu Post: A leading news outlet providing extensive coverage of national politics, economy, and social issues. - Republica: A major newspaper known for its timely updates on breaking news, politics, and current affairs. - Nagarik News: A widely-read source offering insights into local politics, economic developments, and societal trends. - My Republica: Covers a broad spectrum of topics, including politics, economy, and social issues. - Khabarhub: The national news agency delivering updates on significant events, public health, and sports across Nepal.
Accessing Techsalerator’s News Events Data for Nepal To access Techsalerator’s News Events Data for Nepal, please contact info@techsalerator.com with your specific needs. We will provide a customized quote based on the data fields and records you require, with delivery available within 24 hours. Ongoing access options can also be discussed.
Included Data Fields - Event Date - Event Title - Source - Location - Event Description - Event Category (Politics, Economy, Sports, etc.) - Participants (if applicable) - Event Impact (Social, Economic, etc.)
Techsalerator’s dataset is an essential tool for tracking significant events in Nepal, supporting informed decisions whether for business strategy, market analysis, or academic research, and offering a clear view of the country’s news landscape.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset comprises compiled data utilized for the integrated seismic risk assessment presented in the following study:
Bhochhibhoya, S., & Maharjan, R. (2022). Integrated seismic risk assessment in Nepal. Natural Hazards and Earth System Sciences, 22(10), 3211-3230. https://doi.org/10.5194/nhess-22-3211-2022
Dataset Contents:
data_used_paper.csv: Municipality-level data used directly in the paper. Note that some entries are available only at the district level; please refer to the study for specific details.
opendata.xlsx: A comprehensive Excel file compiling relevant district-level data obtained from the OpenData website.
additional_survey_district.csv: Census data at the district level that was not included in the analysis.
Data Sources:The data were compiled from publicly available sources and were not originally collected by the authors. Key sources include:
CBS – Central Bureau of Statistic: National Population and Housing Census 2011 (National Report),https://unstats.un.org/unsd/demographic-social/census/documents/Nepal/Nepal-Census-2011-Vol1.pdf (last access:20 November 2021), 2012.
CBS – Central Bureau of Statistic: Population Monograph of Nepal,Vol. I (Population Dynamics), https://nepal.unfpa.org/sites/default/files/pub-pdf/PopulationMonograph2014Volume1.pdf(last access: 20 November 2021), 2014a.
CBS – Central Bureau of Statistic: Population Monograph of Nepal,Vol. III (Economical Demography), https://nepal.unfpa.org/sites/default/files/pub-pdf/PopulationMonographV02.pdf (last access:20 November 2021), 2014b.
Sharma, P., Guha-Khasnobis, B., and Khanal, D. R.: Nepal human development report 2014, https://www.npc.gov.np/images/category/NHDR_Report_2014.pdf (last access: 20 Novem-ber 2021), 2014
Department of Health Services (2013).
Budget report for year 2070–2071 BS (Bikram Sambat,based on Nepali calendar) (2013–2014 CE).
Department of Education (2013–2014).
Opendata Website.
If the dataset is used, please cite both the dataset and the paper (below).
Bhochhibhoya, S., & Maharjan, R. (2022). Integrated seismic risk assessment in Nepal. Natural Hazards and Earth System Sciences, 22(10), 3211-3230. https://doi.org/10.5194/nhess-22-3211-2022
Roisha, M. & Bhochhibhoya, S. (2024). Population and Economic Data of Nepal 2011 - Municipal-Level Data from different sources, including the National Census (Version v1) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.14010807
If files are not working, or any other queries, contact sonicewrites@gmail.com.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by Saleep Shrestha
Released under Apache 2.0
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
Do you want to uncover the power of language through analysis? The Lince Dataset is the answer! An expansive collection of language technologies and data, this dataset can be utilized for a multitude of purposes. With six different languages to explore - Spanish, Hindi, Nepali, Spanish-English, Hindi-English as well as Spanish Multi-Source-English (MSAEA) - you are granted access to an enormous selection of language identification (LID), part-of-speech (POS) tagging, Named-Entity Recognition (NER), sentiment analysis (SA) and much more. Train your models efficiently with the help of ML in order to automatically detect and classify tasks such as POS or NER from each variation. Or even build cross linguistic models between multiple languages if preferred! Push the boundaries with Lince Dataset's unparalleled diversity. Dive into exploratory research within this feast for NLP connoisseurs and unlock hidden opportunities today!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
Are you looking to unlock the potential of multilingual natural language processing (NLP) with the Lince Dataset? If so, you’re in the right place! With six languages and training data for language identification (LID), part-of-speech (POS) tagging, Named-Entity Recognition (NER), sentiment analysis (SA) and more, this is one of the most comprehensive datasets for NLP today.
Understand what is included in this dataset This dataset includes language technology data from six different languages. These include Spanish, Hindi, Nepali, Spanish-English, Hindi-English and Spanish Multi**Source**English (MSAEA). Each file is labelled according to its content - e.g. lid_msaea_test.csv which contains test data for language identificaiton (LID) with 5 columns containing words, part of speech tags as well as sentiment analysis labels. A brief summary of each file's contents can be found when you pull this dataset up on Kaggle or when running a script such as “head()” or “describe()” depending on your software preferences
Decide What Kind Of Analysis You Want To Do Once you are familiar with what type of data is provided it will be necessary to decide which kind of model or analysis you want to do before diving into coding any algorithms relevant for that task . For example if one wants to build a cross lingual model for POS tagging then it would be ideal to have training and validation sets from 3 different languages so that one can take advantage multi domain knowledge interchange between them during training phase hence selecting files such as pos_spaeng _train , pos_hineng _validation will come into play . While designing your model architecture make sure that task specific hyper parameters should complement each other while taking decisions , also choosing an appropriate feature vector representation strategy helps in improved performance
Run Appropriate Algorithms On The Data Provided In The Dataset Now upon understanding all elements presented in front we can start running appropriate algorithms irespective respectively of tools used while tuning our models using metrics like accuracy , f1 score etc . Once tuned ensure that our system works reliably by testing on unseen test set and ensuring desired results . During optimization various hyper parameter tuning has makes significant role depending upon algorithm chosen irespective respective ly
- Developing a multilingual sentiment analysis system that can analyze sentiment in any of the six languages.
- Training a model to identify and classify named entities across multiple languages, such as identifying certain words for proper nouns or locations regardless of language or coding scheme.
- Developing an AI-powered cross-lingual translator that is able to effectively translate text from one language to another with minimal errors and maximum accuracy
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: lid_msaea_test.csv...
Facebook
Twitterhttps://heidata.uni-heidelberg.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.11588/DATA/A9DCZAhttps://heidata.uni-heidelberg.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.11588/DATA/A9DCZA
DANAM is the Digital Archive for Nepalese Art and Monuments and the heart of the Nepal Heritage Documentation Project (NHDP), located at the Heidelberg Centre of Transcultural Studies (HCTS) and the Academy of Sciences (AdW) and operated in cooperation with Saraf Foundation and the Department of Archaeology, Nepal. The database offers visual and textual documentation of heritage monuments, which are threatened by urbanisation and natural disasters. Data sets contain structured information on the monuments, i.e. details of their location, history, architectural structure, and religious and social activities. It presents photographs, maps, plans and drawings, transcriptions of inscriptions, and historical and anthropological reports. Descriptions are available in both English and Nepali. References to scholarly documentation and resources enable further research interests to be explored. DANAM is based on Arches (v.4), an open-source, geospatially-enabled software platform for cultural heritage inventory and management, developed jointly by the Getty Conservation Institute and World Monuments Fund. Funded by the British Arcadia Foundation, the entire content of DANAM is is openly accessible to the general public. A large part of the data is also stored in heidICON and heiDATA.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was created by Suprabal Pandey
Released under CC0: Public Domain
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This is labeled handwritten Nepali dataset for text recognition. The dataset was created for Optical Character Recognition tool. Due to its small size, using large classification models will over-fit the training results. Hence, in order to gain leverage over this constrain, HuggingFace TransformerOCR was implemented. The model achieved good results, however was slow during inference. The motivation to make this dataset public is to develop a optimal model that can recognize the texts.
Results
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F4786384%2F46e658c372c33874e22a5c8cda2713aa%2FScreenshot%20from%202024-05-21%2019-49-38.png?generation=1716301271869838&alt=media" alt="Results">
Facebook
TwitterAttribution-NonCommercial-ShareAlike 3.0 (CC BY-NC-SA 3.0)https://creativecommons.org/licenses/by-nc-sa/3.0/
License information was derived automatically
This dataset contains images of Air Pollution for different cities in India and Nepal. The dataset is divided into two folders: Combined_Dataset and Country_wise_Dataset.
Total number of image dataset: 12,240 Image size: 224*224
Air Quality Index (AQI) Class and its defination used in the dataset.
There are a total of six classes of Air Pollution, which we represent in our dataset as follows:
1. Good (0-50): Air quality is considered satisfactory and air pollution poses little or no risk.
2. Moderate (51-100): Air quality is acceptable; however, for some pollutants, there may be a moderate health concern for a very small number of people who are unusually sensitive to air pollution.
3. Unhealthy for Sensitive Groups (101-150): Members of sensitive groups may experience health effects, but the general public is unlikely to be affected.
4. Unhealthy (151-200): Some members of the general public may experience health effects; members of sensitive groups may experience more serious health effects.
5. Very Unhealthy (201-300): Health alert: The risk of health effects is increased for everyone.
6. Hazardous/Severe (301-500): Health warning of emergency conditions: Everyone is more likely to be affected.
Reference:
https://airtw.epa.gov.tw/ENG/Information/Standard/AirQualityIndicator.aspx
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F11024368%2F3865850ad0720dc148c71b79946f4196%2FAQI%20reference.JPG?generation=1681898437013999&alt=media" alt="">
Cities of India
1. ITO, Delhi 2. Dimapur, Nagaland 3. Spice Garden, Bengaluru 4. Knowledge Park III, Greater Noida 5. New Ind Town, Faridabad 6. Borivali East, Mumbai 7. Oragadam, Tamil Nadu
City of Nepal 1. Biratnagar
Combined dataset:
The combined dataset folder contains two subfolders. 1. All_img: This subfolder contains all the collected images from all AQI classes. 2. IND_and_NEP: This subfolder contains six different subfolders representing six different classes of AQI.
The csv file in this folder contains all the data and its parameters. It is labeled as
Location, Filename, Year, Month, Day, Hour, AQI, PM2.5, PM10, O3, CO, SO2, NO2, and AQI_Class
Country_wise_Dataset:
This folder contains two subfolders representing the countries from which the dataset was collected.
**1. India: ** This subfolder contains the subfolder representing the names of all cities from where data were collected. Each subfolder of cities contains folders representing the data collected for each respective AQI class, as well as a csv file. which contains the details of each image, like we mentioned above. Such as,
Location, Filename, Year, Month, Day, Hour, AQI, PM2.5, PM10, O3, CO, SO2, NO2, and AQI_Class
**2. Nepal: ** We managed to collect the image dataset from Nepal. This subfolder contains the subfolder representing the name of the city from where data were collected. This subfolder of the city contains folders representing the data collected for each AQI class and also a csv file. which contains the details of each image, like we mentioned above. Such as,
Location, Filename, Year, Month, Day, Hour, AQI, PM2.5, PM10, O3, CO, SO2, NO2, and AQI_Class
////////////////////////////////////////////////////////////////////////////////
Dataset Collection Process:
1. Visit the site: The first step in collecting the air pollution data was to personally visit the site. This involved physically going to the location and capturing images and videos of the area.
2. Note current parameters: While visiting the site, various parameters related to air pollution were noted. These included measurements of PM2.5, PM10, NO2, SO2, CO, etc. These parameters were noted by referring to publicly available data sources such as the Central Pollution Control Board (CPCB) website. For India we used https://app.cpcbccr.com/AQI_India/ and for Nepal we used: https://www.tomorrow.io/weather/NP/4/Biratnagar/079711/hourly/
3. Preprocess images: Once the images and videos were captured, they were preprocessed to remove any images that were blurry, overexposed, or had other quality issues. Only the images that met the desired quality criteria were selected for further analysis.
4. Extract frames from videos: In addition to the images, videos were also capture...
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was created by Dimanjan Dahal
Released under CC0: Public Domain
Note: GBIME and JBNL merged in 2019.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was created by Shivahari Subedi
Released under CC0: Public Domain
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was created by Allah Hitler
Released under CC0: Public Domain
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was created by Sundeep Dawadi
Released under CC0: Public Domain
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Implementations:
Description:
We present the Nepali Handwriting Dataset (NHD), which is a collection of camera-captured images of Nepali handwritten text from various regions in Nepal. The dataset aims to provide a benchmark for researchers to explore new techniques in handwriting detection and recognition. We also present benchmark results for text localization and recognition using well-established deep-learning frameworks. The dataset and benchmark results are available here.
Key Features:
The role of data collection and preprocessing in the research on handwritten text detection cannot be overstated. It is a crucial aspect that plays a significant role in obtaining a comprehensive and diverse dataset. To this end, the researchers personally collected 1,000 mobile phone-captured data samples from various sources, including schools, government offices, universities, and student councils.
The dataset was carefully curated to encompass three distinct categories based on age groups, namely kids, youth, and adults, with 599, 152, and 249 samples, respectively. Each of the 1,000 pages was meticulously annotated by the researchers to ensure accurate labeling and create a reliable dataset. The data collection process focused on capturing a wide range of handwriting styles and variations prevalent among different age groups and settings.
The collected dataset served as a valuable resource for training and evaluating the handwritten text detection models in the research. It provided a rich and diverse set of data that enabled the researchers to develop robust models capable of accurately detecting handwritten text across different age groups and settings.
Use Cases:
Results:
You can find its implementation here: https://github.com/R4j4n/Nepali-Text-Detection-DBnet
Recall: 0.9069154470416869
Precision: 0.9178659178659179
HMean: 0.9123578206927347
Test Image:
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F4786384%2Ff8d9aa282a42848b359aeeb021b97937%2Foutput.png?generation=1695433752833462&alt=media" alt="">
If you find this dataset useful, your support through an upvote would be greatly appreciated ❤️🙂
Thank you