In March 2024, Meta-powered apps Facebook and Instagram were the most downloaded mobile apps worldwide, with 59 million and 58 million downloads, respectively. Social video app TikTok followed with 46 million downloads. Meta-owned microblogging platform Threads generated 24 million downloads during the last month of the year.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset comprises user feedback data collected from 15 globally acclaimed mobile applications, spanning diverse categories. The included applications are among the most downloaded worldwide, providing a rich and varied source for analysis. The dataset is particularly suitable for Natural Language Processing (NLP) applications, such as text classification and topic modeling. List of Included Applications:
TikTok Instagram Facebook WhatsApp Telegram Zoom Snapchat Facebook Messenger Capcut Spotify YouTube HBO Max Cash App Subway Surfers Roblox Data Columns and Descriptions: Data Columns and Descriptions:
review_id: Unique identifiers for each user feedback/application review. content: User-generated feedback/review in text format. score: Rating or star given by the user. TU_count: Number of likes/thumbs up (TU) received for the review. app_id: Unique identifier for each application. app_name: Name of the application. RC_ver: Version of the app when the review was created (RC). Terms of Use: This dataset is open access for scientific research and non-commercial purposes. Users are required to acknowledge the authors' work and, in the case of scientific publication, cite the most appropriate reference: M. H. Asnawi, A. A. Pravitasari, T. Herawan, and T. Hendrawati, "The Combination of Contextualized Topic Model and MPNet for User Feedback Topic Modeling," in IEEE Access, vol. 11, pp. 130272-130286, 2023, doi: 10.1109/ACCESS.2023.3332644.
Researchers and analysts are encouraged to explore this dataset for insights into user sentiments, preferences, and trends across these top mobile applications. If you have any questions or need further information, feel free to contact the dataset authors.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
We surveyed 10,208 people from more than 15 countries on their mobile app usage behavior. The countries include USA, China, Japan, Germany, France, Brazil, UK, Italy, Russia, India, Canada, Spain, Australia, Mexico, and South Korea. We asked respondents about: (1) their mobile app user behavior in terms of mobile app usage, including the app stores they use, what triggers them to look for apps, why they download apps, why they abandon apps, and the types of apps they download. (2) their demographics including gender, age, marital status, nationality, country of residence, first language, ethnicity, education level, occupation, and household income (3) their personality using the Big-Five personality traits This dataset contains the results of the survey.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
A dataset consisting of 751,500 English app reviews of 12 online shopping apps. The dataset was scraped from the internet using a python script. This ShoppingAppReviews dataset contains app reviews of the 12 most popular online shopping android apps: Alibaba, Aliexpress, Amazon, Daraz, eBay, Flipcart, Lazada, Meesho, Myntra, Shein, Snapdeal and Walmart. Each review entry contains many metadata like review score, thumbsupcount, review posting time, reply content etc. The dataset is organized in a zip file, under which there are 12 json files for 12 online shopping apps. This dataset can be used to obtain valuable information about customers' feedback regarding their user experience of these financially important apps.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Automated Insights Dataset (AID) brings metadata from the 200 most downloaded free apps from each of the 32 categories on the Google Play Store, totaling 6400 apps, with information that goes beyond that presented by app stores, also bringing metadata from AppBrain. The User Interface Depth Dataset (UID) brings a high-quality sampling of the AID, and delves into the identification of 7540 components of 50 component types and the capture of 1948 screenshots of the interface of 400 apps. The component set was based on components of Google Material Design and Android Studio.
In September 2024, Ludo King was the most-downloaded gaming app in the Google Play Store worldwide. The board game generated more than 15.47 million downloads from Android users. My Supermarket Simulator 3D was the second-most popular gaming app title with approximately 14.19 million downloads from global users.
As of June 2024, the most popular database management system (DBMS) worldwide was Oracle, with a ranking score of 1244.08; MySQL and Microsoft SQL server rounded out the top three. Although the database management industry contains some of the largest companies in the tech industry, such as Microsoft, Oracle and IBM, a number of free and open-source DBMSs such as PostgreSQL and MariaDB remain competitive. Database Management Systems As the name implies, DBMSs provide a platform through which developers can organize, update, and control large databases. Given the business world’s growing focus on big data and data analytics, knowledge of SQL programming languages has become an important asset for software developers around the world, and database management skills are seen as highly desirable. In addition to providing developers with the tools needed to operate databases, DBMS are also integral to the way that consumers access information through applications, which further illustrates the importance of the software.
As COVID-19 continues to spread across the world, a growing number of malicious campaigns are exploiting the pandemic. It is reported that COVID-19 is being used in a variety of online malicious activities, including Email scam, ransomware and malicious domains. As the number of the afflicted cases continue to surge, malicious campaigns that use coronavirus as a lure are increasing. Malicious developers take advantage of this opportunity to lure mobile users to download and install malicious apps.
However, besides a few media reports, the coronavirus-themed mobile malware has not been well studied. Our community lacks of the comprehensive understanding of the landscape of the coronavirus-themed mobile malware, and no accessible dataset could be used by our researchers to boost COVID-19 related cybersecurity studies.
We make efforts to create a daily growing COVID-19 related mobile app dataset. By the time of mid-November, we have curated a dataset of 4,322 COVID-19 themed apps, and 611 of them are considered to be malicious. The number is growing daily and our dataset will update weekly. For more details, please visit https://covid19apps.github.io
This dataset includes the following files:
(1) covid19apps.xlsx
In this file, we list all the COVID-19 themed apps information, including apk file hashes, released date, package name, AV-Rank, etc.
(2)covid19apps.zip
We put the COVID-19 themed apps Apk samples in zip files . In order to reduce the size of a single file, we divide the sample into multiple zip files for storage. And the APK file name after the file SHA256.
If your papers or articles use our dataset, please use the following bibtex reference to cite our paper: https://arxiv.org/abs/2005.14619
(Accepted to Empirical Software Engineering)
@misc{wang2021virus,
title={Beyond the Virus: A First Look at Coronavirus-themed Mobile Malware},
author={Liu Wang and Ren He and Haoyu Wang and Pengcheng Xia and Yuanchun Li and Lei Wu and Yajin Zhou and Xiapu Luo and Yulei Sui and Yao Guo and Guoai Xu},
year={2021},
eprint={2005.14619},
archivePrefix={arXiv},
primaryClass={cs.CR}
}
This dataset provides information on the 20 most popular digital health certificate apps in the world. It shows how many times each app has been downloaded, describes their privacy policies, and highlights any potentially invasive permissions.
This dataset describes the 10 most popular contact tracing apps. It provides information on where they are used, how many downloads each app has accumulated, and shows whether or not each has an adequate privacy policy.
WorldPop produces different types of gridded population count datasets, depending on the methods used and end application.
Please make sure you have read our Mapping Populations overview page before choosing and downloading a dataset.
Bespoke methods used to produce datasets for specific individual countries are available through the WorldPop Open Population Repository (WOPR) link below.
These are 100m resolution gridded population estimates using customized methods ("bottom-up" and/or "top-down") developed for the latest data available from each country.
They can also be visualised and explored through the woprVision App.
The remaining datasets in the links below are produced using the "top-down" method,
with either the unconstrained or constrained top-down disaggregation method used.
Please make sure you read the Top-down estimation modelling overview page to decide on which datasets best meet your needs.
Datasets are available to download in Geotiff and ASCII XYZ format at a resolution of 3 and 30 arc-seconds (approximately 100m and 1km at the equator, respectively):
- Unconstrained individual countries 2000-2020 ( 1km resolution ): Consistent 1km resolution population count datasets created using
unconstrained top-down methods for all countries of the World for each year 2000-2020.
- Unconstrained individual countries 2000-2020 ( 100m resolution ): Consistent 100m resolution population count datasets created using
unconstrained top-down methods for all countries of the World for each year 2000-2020.
- Unconstrained individual countries 2000-2020 UN adjusted ( 100m resolution ): Consistent 100m resolution population count datasets created using
unconstrained top-down methods for all countries of the World for each year 2000-2020 and adjusted to match United Nations national population estimates (UN 2019)
-Unconstrained individual countries 2000-2020 UN adjusted ( 1km resolution ): Consistent 1km resolution population count datasets created using
unconstrained top-down methods for all countries of the World for each year 2000-2020 and adjusted to match United Nations national population estimates (UN 2019).
-Unconstrained global mosaics 2000-2020 ( 1km resolution ): Mosaiced 1km resolution versions of the "Unconstrained individual countries 2000-2020" datasets.
-Constrained individual countries 2020 ( 100m resolution ): Consistent 100m resolution population count datasets created using
constrained top-down methods for all countries of the World for 2020.
-Constrained individual countries 2020 UN adjusted ( 100m resolution ): Consistent 100m resolution population count datasets created using
constrained top-down methods for all countries of the World for 2020 and adjusted to match United Nations national
population estimates (UN 2019).
Older datasets produced for specific individual countries and continents, using a set of tailored geospatial inputs and differing "top-down" methods and time periods are still available for download here: Individual countries and Whole Continent.
Data for earlier dates is available directly from WorldPop.
WorldPop (www.worldpop.org - School of Geography and Environmental Science, University of Southampton; Department of Geography and Geosciences, University of Louisville; Departement de Geographie, Universite de Namur) and Center for International Earth Science Information Network (CIESIN), Columbia University (2018). Global High Resolution Population Denominators Project - Funded by The Bill and Melinda Gates Foundation (OPP1134076). https://dx.doi.org/10.5258/SOTON/WP00645
https://object-store.os-api.cci2.ecmwf.int:443/cci2-prod-catalogue/licences/licence-to-use-copernicus-products/licence-to-use-copernicus-products_b4b9451f54cffa16ecef5c912c9cebd6979925a956e3fa677976e0cf198c2c18.pdfhttps://object-store.os-api.cci2.ecmwf.int:443/cci2-prod-catalogue/licences/licence-to-use-copernicus-products/licence-to-use-copernicus-products_b4b9451f54cffa16ecef5c912c9cebd6979925a956e3fa677976e0cf198c2c18.pdf
ERA5 is the fifth generation ECMWF reanalysis for the global climate and weather for the past 8 decades. Data is available from 1940 onwards. ERA5 replaces the ERA-Interim reanalysis. Reanalysis combines model data with observations from across the world into a globally complete and consistent dataset using the laws of physics. This principle, called data assimilation, is based on the method used by numerical weather prediction centres, where every so many hours (12 hours at ECMWF) a previous forecast is combined with newly available observations in an optimal way to produce a new best estimate of the state of the atmosphere, called analysis, from which an updated, improved forecast is issued. Reanalysis works in the same way, but at reduced resolution to allow for the provision of a dataset spanning back several decades. Reanalysis does not have the constraint of issuing timely forecasts, so there is more time to collect observations, and when going further back in time, to allow for the ingestion of improved versions of the original observations, which all benefit the quality of the reanalysis product. ERA5 provides hourly estimates for a large number of atmospheric, ocean-wave and land-surface quantities. An uncertainty estimate is sampled by an underlying 10-member ensemble at three-hourly intervals. Ensemble mean and spread have been pre-computed for convenience. Such uncertainty estimates are closely related to the information content of the available observing system which has evolved considerably over time. They also indicate flow-dependent sensitive areas. To facilitate many climate applications, monthly-mean averages have been pre-calculated too, though monthly means are not available for the ensemble mean and spread. ERA5 is updated daily with a latency of about 5 days. In case that serious flaws are detected in this early release (called ERA5T), this data could be different from the final release 2 to 3 months later. In case that this occurs users are notified. The data set presented here is a regridded subset of the full ERA5 data set on native resolution. It is online on spinning disk, which should ensure fast and easy access. It should satisfy the requirements for most common applications. An overview of all ERA5 datasets can be found in this article. Information on access to ERA5 data on native resolution is provided in these guidelines. Data has been regridded to a regular lat-lon grid of 0.25 degrees for the reanalysis and 0.5 degrees for the uncertainty estimate (0.5 and 1 degree respectively for ocean waves). There are four main sub sets: hourly and monthly products, both on pressure levels (upper air fields) and single levels (atmospheric, ocean-wave and land surface quantities). The present entry is "ERA5 hourly data on single levels from 1940 to present".
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Protein networks have become a popular tool for analyzing and visualizing the often long lists of proteins or genes obtained from proteomics and other high-throughput technologies. One of the most popular sources of such networks is the STRING database, which provides protein networks for more than 2000 organisms, including both physical interactions from experimental data and functional associations from curated pathways, automatic text mining, and prediction methods. However, its web interface is mainly intended for inspection of small networks and their underlying evidence. The Cytoscape software, on the other hand, is much better suited for working with large networks and offers greater flexibility in terms of network analysis, import, and visualization of additional data. To include both resources in the same workflow, we created stringApp, a Cytoscape app that makes it easy to import STRING networks into Cytoscape, retains the appearance and many of the features of STRING, and integrates data from associated databases. Here, we introduce many of the stringApp features and show how they can be used to carry out complex network analysis and visualization tasks on a typical proteomics data set, all through the Cytoscape user interface. stringApp is freely available from the Cytoscape app store: http://apps.cytoscape.org/apps/stringapp.
WorldPop produces different types of gridded population count datasets, depending on the methods used and end application.
Please make sure you have read our Mapping Populations overview page before choosing and downloading a dataset.
Bespoke methods used to produce datasets for specific individual countries are available through the WorldPop Open Population Repository (WOPR) link below.
These are 100m resolution gridded population estimates using customized methods ("bottom-up" and/or "top-down") developed for the latest data available from each country.
They can also be visualised and explored through the woprVision App.
The remaining datasets in the links below are produced using the "top-down" method,
with either the unconstrained or constrained top-down disaggregation method used.
Please make sure you read the Top-down estimation modelling overview page to decide on which datasets best meet your needs.
Datasets are available to download in Geotiff and ASCII XYZ format at a resolution of 3 and 30 arc-seconds (approximately 100m and 1km at the equator, respectively):
- Unconstrained individual countries 2000-2020 ( 1km resolution ): Consistent 1km resolution population count datasets created using
unconstrained top-down methods for all countries of the World for each year 2000-2020.
- Unconstrained individual countries 2000-2020 ( 100m resolution ): Consistent 100m resolution population count datasets created using
unconstrained top-down methods for all countries of the World for each year 2000-2020.
- Unconstrained individual countries 2000-2020 UN adjusted ( 100m resolution ): Consistent 100m resolution population count datasets created using
unconstrained top-down methods for all countries of the World for each year 2000-2020 and adjusted to match United Nations national population estimates (UN 2019)
-Unconstrained individual countries 2000-2020 UN adjusted ( 1km resolution ): Consistent 1km resolution population count datasets created using
unconstrained top-down methods for all countries of the World for each year 2000-2020 and adjusted to match United Nations national population estimates (UN 2019).
-Unconstrained global mosaics 2000-2020 ( 1km resolution ): Mosaiced 1km resolution versions of the "Unconstrained individual countries 2000-2020" datasets.
-Constrained individual countries 2020 ( 100m resolution ): Consistent 100m resolution population count datasets created using
constrained top-down methods for all countries of the World for 2020.
-Constrained individual countries 2020 UN adjusted ( 100m resolution ): Consistent 100m resolution population count datasets created using
constrained top-down methods for all countries of the World for 2020 and adjusted to match United Nations national
population estimates (UN 2019).
Older datasets produced for specific individual countries and continents, using a set of tailored geospatial inputs and differing "top-down" methods and time periods are still available for download here: Individual countries and Whole Continent.
Data for earlier dates is available directly from WorldPop.
WorldPop (www.worldpop.org - School of Geography and Environmental Science, University of Southampton; Department of Geography and Geosciences, University of Louisville; Departement de Geographie, Universite de Namur) and Center for International Earth Science Information Network (CIESIN), Columbia University (2018). Global High Resolution Population Denominators Project - Funded by The Bill and Melinda Gates Foundation (OPP1134076). https://dx.doi.org/10.5258/SOTON/WP00645
WorldPop produces different types of gridded population count datasets, depending on the methods used and end application.
Please make sure you have read our Mapping Populations overview page before choosing and downloading a dataset.
Bespoke methods used to produce datasets for specific individual countries are available through the WorldPop Open Population Repository (WOPR) link below.
These are 100m resolution gridded population estimates using customized methods ("bottom-up" and/or "top-down") developed for the latest data available from each country.
They can also be visualised and explored through the woprVision App.
The remaining datasets in the links below are produced using the "top-down" method,
with either the unconstrained or constrained top-down disaggregation method used.
Please make sure you read the Top-down estimation modelling overview page to decide on which datasets best meet your needs.
Datasets are available to download in Geotiff and ASCII XYZ format at a resolution of 3 and 30 arc-seconds (approximately 100m and 1km at the equator, respectively):
- Unconstrained individual countries 2000-2020 ( 1km resolution ): Consistent 1km resolution population count datasets created using
unconstrained top-down methods for all countries of the World for each year 2000-2020.
- Unconstrained individual countries 2000-2020 ( 100m resolution ): Consistent 100m resolution population count datasets created using
unconstrained top-down methods for all countries of the World for each year 2000-2020.
- Unconstrained individual countries 2000-2020 UN adjusted ( 100m resolution ): Consistent 100m resolution population count datasets created using
unconstrained top-down methods for all countries of the World for each year 2000-2020 and adjusted to match United Nations national population estimates (UN 2019)
-Unconstrained individual countries 2000-2020 UN adjusted ( 1km resolution ): Consistent 1km resolution population count datasets created using
unconstrained top-down methods for all countries of the World for each year 2000-2020 and adjusted to match United Nations national population estimates (UN 2019).
-Unconstrained global mosaics 2000-2020 ( 1km resolution ): Mosaiced 1km resolution versions of the "Unconstrained individual countries 2000-2020" datasets.
-Constrained individual countries 2020 ( 100m resolution ): Consistent 100m resolution population count datasets created using
constrained top-down methods for all countries of the World for 2020.
-Constrained individual countries 2020 UN adjusted ( 100m resolution ): Consistent 100m resolution population count datasets created using
constrained top-down methods for all countries of the World for 2020 and adjusted to match United Nations national
population estimates (UN 2019).
Older datasets produced for specific individual countries and continents, using a set of tailored geospatial inputs and differing "top-down" methods and time periods are still available for download here: Individual countries and Whole Continent.
Data for earlier dates is available directly from WorldPop.
WorldPop (www.worldpop.org - School of Geography and Environmental Science, University of Southampton; Department of Geography and Geosciences, University of Louisville; Departement de Geographie, Universite de Namur) and Center for International Earth Science Information Network (CIESIN), Columbia University (2018). Global High Resolution Population Denominators Project - Funded by The Bill and Melinda Gates Foundation (OPP1134076). https://dx.doi.org/10.5258/SOTON/WP00645
https://brightdata.com/licensehttps://brightdata.com/license
Our travel datasets provide extensive, structured data covering various aspects of the global travel and hospitality industry. These datasets are ideal for businesses, analysts, and developers looking to gain insights into hotel pricing, short-term rentals, restaurant listings, and travel trends. Whether you're optimizing pricing strategies, analyzing market trends, or enhancing travel-related applications, our datasets offer the depth and accuracy you need.
Key Travel Datasets Available:
Hotel & Rental Listings: Access detailed data on hotel properties, short-term rentals, and vacation stays from platforms like
Airbnb, Booking.com, and other OTAs. This includes property details, pricing, availability, guest reviews, and amenities.
Real-Time & Historical Pricing Data: Track hotel room pricing, rental occupancy rates, and pricing trends
to optimize revenue management and competitive analysis.
Restaurant Listings & Reviews: Explore restaurant data from Tripadvisor, OpenTable, Zomato, Deliveroo, and Talabat,
including restaurant details, customer ratings, menus, and delivery availability.
Market & Trend Analysis: Use structured datasets to analyze travel demand, seasonal trends, and consumer preferences
across different regions.
Geo-Targeted Data: Get location-specific insights with city, state, and country-level segmentation,
allowing for precise market research and localized business strategies.
Use Cases for Travel Datasets:
Dynamic Pricing & Revenue Optimization: Adjust pricing strategies based on real-time market trends and competitor analysis.
Market Research & Competitive Intelligence: Identify emerging travel trends, monitor competitor performance, and assess market demand.
Travel & Hospitality App Development: Enhance travel platforms with accurate, up-to-date data on hotels, restaurants, and rental properties.
Investment & Financial Analysis: Evaluate travel industry performance for investment decisions and economic forecasting.
Our travel datasets are available in multiple formats (JSON, CSV, Excel) and can be delivered via
API, cloud storage (AWS, Google Cloud, Azure), or direct download.
Stay ahead in the travel industry with high-quality, structured data that powers smarter decisions.
The Atmosphere Protection Plan (APP) sets out the objectives to reduce concentrations of pollutants in the atmosphere below limit values within agglomerations of more than 250,000 inhabitants or areas where limit values are exceeded or are likely to be exceeded. The structure of plans for the protection of the atmosphere is governed by the Environmental Code (Articles R222-13 to R222-36). The plans for the protection of the atmosphere shall gather the information necessary for the inventory and assessment of the air quality of the area concerned.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Summary of prior works studying mobile app reviews.
WorldPop produces different types of gridded population count datasets, depending on the methods used and end application.
Please make sure you have read our Mapping Populations overview page before choosing and downloading a dataset.
Bespoke methods used to produce datasets for specific individual countries are available through the WorldPop Open Population Repository (WOPR) link below.
These are 100m resolution gridded population estimates using customized methods ("bottom-up" and/or "top-down") developed for the latest data available from each country.
They can also be visualised and explored through the woprVision App.
The remaining datasets in the links below are produced using the "top-down" method,
with either the unconstrained or constrained top-down disaggregation method used.
Please make sure you read the Top-down estimation modelling overview page to decide on which datasets best meet your needs.
Datasets are available to download in Geotiff and ASCII XYZ format at a resolution of 3 and 30 arc-seconds (approximately 100m and 1km at the equator, respectively):
- Unconstrained individual countries 2000-2020 ( 1km resolution ): Consistent 1km resolution population count datasets created using
unconstrained top-down methods for all countries of the World for each year 2000-2020.
- Unconstrained individual countries 2000-2020 ( 100m resolution ): Consistent 100m resolution population count datasets created using
unconstrained top-down methods for all countries of the World for each year 2000-2020.
- Unconstrained individual countries 2000-2020 UN adjusted ( 100m resolution ): Consistent 100m resolution population count datasets created using
unconstrained top-down methods for all countries of the World for each year 2000-2020 and adjusted to match United Nations national population estimates (UN 2019)
-Unconstrained individual countries 2000-2020 UN adjusted ( 1km resolution ): Consistent 1km resolution population count datasets created using
unconstrained top-down methods for all countries of the World for each year 2000-2020 and adjusted to match United Nations national population estimates (UN 2019).
-Unconstrained global mosaics 2000-2020 ( 1km resolution ): Mosaiced 1km resolution versions of the "Unconstrained individual countries 2000-2020" datasets.
-Constrained individual countries 2020 ( 100m resolution ): Consistent 100m resolution population count datasets created using
constrained top-down methods for all countries of the World for 2020.
-Constrained individual countries 2020 UN adjusted ( 100m resolution ): Consistent 100m resolution population count datasets created using
constrained top-down methods for all countries of the World for 2020 and adjusted to match United Nations national
population estimates (UN 2019).
Older datasets produced for specific individual countries and continents, using a set of tailored geospatial inputs and differing "top-down" methods and time periods are still available for download here: Individual countries and Whole Continent.
Data for earlier dates is available directly from WorldPop.
WorldPop (www.worldpop.org - School of Geography and Environmental Science, University of Southampton; Department of Geography and Geosciences, University of Louisville; Departement de Geographie, Universite de Namur) and Center for International Earth Science Information Network (CIESIN), Columbia University (2018). Global High Resolution Population Denominators Project - Funded by The Bill and Melinda Gates Foundation (OPP1134076). https://dx.doi.org/10.5258/SOTON/WP00645
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Microbiology 1K QA pairs in Burmese Language
Before this Burmese Clinical Microbiology 1K dataset, the open-source resources to train the Burmese Large Language Model in Medical fields were rare. Thus, the high-quality dataset needs to be curated to cover medical knowledge for the development of LLM in the Burmese language
I found an old notebook in my box. The book was from 2019. It contained written notes on microbiology when I was a third-year medical student. Because of the need for Burmese language resources in medical fields, I added more facts, and more notes and curated a dataset on microbiology in the Burmese language.
The dataset for microbiology in the Burmese language contains 1262 rows of instruction and output pairs in CSV format. The dataset mainly focuses on clinical microbiology foundational knowledge, abstracting basic facts on culture medium, microbes - bacteria, viruses, fungi, parasites, and diseases caused by these microbes.
ငှက်ဖျားရောဂါဆိုတာ ဘာလဲ?,ငှက်ဖျားရောဂါသည် Plasmodium ကပ်ပါးကောင်ကြောင့် ဖြစ်ပွားသော အသက်အန္တရာယ်ရှိနိုင်သည့် သွေးရောဂါတစ်မျိုးဖြစ်သည်။ ၎င်းသည် ငှက်ဖျားခြင်ကိုက်ခြင်းမှတဆင့် ကူးစက်ပျံ့နှံ့သည်။
Influenza virus အကြောင်း အကျဉ်းချုပ် ဖော်ပြပါ။,Influenza virus သည် တုပ်ကွေးရောဂါ ဖြစ်စေသော RNA ဗိုင်းရပ်စ် ဖြစ်သည်။ Orthomyxoviridae မိသားစုဝင် ဖြစ်ပြီး type A၊ B၊ C နှင့် D ဟူ၍ အမျိုးအစား လေးမျိုး ရှိသည်။
Clostridium tetani ဆိုတာ ဘာလဲ,Clostridium tetani သည် မေးခိုင်ရောဂါ ဖြစ်စေသော gram-positive၊ anaerobic bacteria တစ်မျိုး ဖြစ်သည်။ မြေဆီလွှာတွင် တွေ့ရလေ့ရှိသည်။
Onychomycosis ဆိုတာ ဘာလဲ?,Onychomycosis သည် လက်သည်း သို့မဟုတ် ခြေသည်းများတွင် ဖြစ်ပွားသော မှိုကူးစက်မှုဖြစ်သည်။ ၎င်းသည် လက်သည်း သို့မဟုတ် ခြေသည်းများကို ထူထဲစေပြီး အရောင်ပြောင်းလဲစေသည်။
Github - https://github.com/MinSiThu/Burmese-Microbiology-1K/blob/main/data/Microbiology.csv
Zenodo - https://zenodo.org/records/12803638
Hugginface - https://huggingface.co/datasets/jojo-ai-mst/Burmese-Microbiology-1K
Kaggle - https://www.kaggle.com/datasets/minsithu/burmese-microbiology-1k
Burmese Microbiology 1K Dataset can be used in building various medical-related NLP applications.
Special thanks to magickospace.org for supporting the curation process of Burmese Microbiology 1K Dataset.
https://openstax.org/details/books/microbiology - For medical facts
https://moh.nugmyanmar.org/my/ - For burmese words for disease names
https://myordbok.com/dictionary/english - English-Myanmar Translation Dictionary
Si Thu, M. (2024). Burmese MicroBiology 1K Dataset (1.1) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.12803638
Si Thu, Min, Burmese-Microbiology-1K (July 24, 2024). Available at SSRN: https://ssrn.com/abstract=4904320
In March 2024, Meta-powered apps Facebook and Instagram were the most downloaded mobile apps worldwide, with 59 million and 58 million downloads, respectively. Social video app TikTok followed with 46 million downloads. Meta-owned microblogging platform Threads generated 24 million downloads during the last month of the year.