https://cdla.io/permissive-1-0/https://cdla.io/permissive-1-0/
Google Play stores top 500 app data based on their rankings on January 2022 for all the available categories. Link to scraping code: https://github.com/Shakthi-Dhar/AppPin Link to backup datafiles: github data files
The dataset contains the top 500 android apps available on the google play store for the following categories: All Categories, Art & Design, Auto & Vehicles, Beauty, Books & Reference, Business, Comics, Communication, Education, Entertainment, Events, Finance, Food & Drink, Health & Fitness, House & Home, Libraries & Demo, Lifestyle, Maps & Navigation, Medical, Music & Audio, News & Magazines, Parenting, Personalization, Photography, Productivity, Shopping, Social, Sports, Tools, Travel & Local, and Video Players & Editors.
The app rankings are based on google play store app rankings for January 2022.
In Review and Downloads, the alphabet T, L, Cr represents Thousands, Lakhs, Crores as per the google play store naming convention. They are similar to M, B which represent millions, billions. 1L (1 Lakh) = 100T (100 Thousand) 10L (10 Lakhs) = 1M (1 Million) 1Cr( 1 Crore) = 10M (10 Million)
This data is not provided directly by Google, so I used Appium an automation tool with python to scrape the data from the google play store app.
Inspired by Fortune500. Fortune500 provides data on top companies in the world, so why not have a data source for top apps in the world.
In 2024, the United States was the leading app market, with the Apple App Store and the Google App Store generating approximately 31 billion U.S. dollars of in-app revenues. China was the second-largest app market, as in-app revenues in the region generated approximately 17.34 billion U.S. dollars. Japan ranked third, as the region generated around 11.25 billion U.S. dollars in app revenues for the examined period.
To date (April 2020), Android is still the most popular mobile operating system in the world. Taking into account billion of Android users worldwide, mining this data has the potential to reveal user behaviors and trends in the whole global scope.
There are 2 CSV files: - app.csv with 53,732 rows and 18 columns. - comment.csv with 1,468,173 rows and 4 columns.
The scraping was done in April 2020.
This dataset is obtained from scraping Google Play Store. Without Google and Android, this dataset wouldn’t have existed.
The dataset is first published in this blog.
Business trends on mobile can be explored by examining this dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is a preview of a bigger dataset. My Telegram bot will answer your queries for more data and also allow you to contact me.
When Dating apps like Tinder were becoming viral, people wanted to have the best profile in order to get more matches and more potential encounters. Unlike other previous dating platforms, those new ones emphasized on the mutuality of attraction before allowing any two people to get in touch and chat. This made it all the more important to create the best profile in order to get the best first impression.
Parallel to that, we Humans have always been in awe before charismatic and inspiring people. The more charismatic people tend to be followed and listened to by more people. Through their metrics such as the number of friends/followers, social networks give some ways of "measuring" the potential charisma of some people.
In regard to all that, one can then think: - what makes a great user profile ? - how to make the best first impression in order to get more matches (and ultimately find love, or new friendships) ? - what makes a person charismatic ? - how do charismatic people present themselves ?
In order to try and understand those different social questions, I decided to create a dataset of user profile informations using the social network Lovoo when it came out. By using different methodologies, I was able to gather user profile data, as well as some usually unavailable metrics (such as the number of profile visits).
The dataset contains user profile infos of users of the website Lovoo.
The dataset was gathered during spring 2015 (april, may). At that time, Lovoo was expanding in european countries (among others), while Tinder was trending both in America and in Europe. At that time the iOS version of the Lovoo app was in version 3.
The dataset references pictures (field pictureId
) of user profiles. These pictures are also available for a fraction of users but have not been uploaded and should be asked separately.
The idea when gathering the profile pictures was to determine whether some correlations could be identified between a profile picture and the reputation or success of a given profile. Since first impression matters, a sound hypothesis to make is that the profile picture might have a great influence on the number of profile visits, matches and so on. Do not forget that only a fraction of a user's profile is seen when browsing through a list of users.
https://s1.dmcdn.net/v/BnWkG1M7WuJDq2PKP/x480" alt="App preview of browsing profiles">
In order to gather the data, I developed a set of tools that would save the data while browsing through profiles and doing searches. Because of this approach (and the constraints that forced me to develop this approach) I could only gather user profiles that were recommended by Lovoo's algorithm for 2 profiles I created for this purpose occasion (male, open to friends & chats & dates). That is why there are only female users in the dataset. Another work could be done to fetch similar data for both genders or other age ranges.
Regarding the number of user profiles It turned out that the recommendation algorithm always seemed to output the same set of user profiles. This meant Lovoo's algorithm was probably heavily relying on settings like location (to recommend more people nearby than people in different places or countries) and maybe cookies. This diminished the number of different user profiles that would be presented and included in the dataset.
As mentioned in the introduction, there are a lot of questions we can answer using a dataset such as this one. Some questions are related to - popularity, charisma - census and demographic studies. - Statistics about the interest of people joining dating apps (making friends, finding someone to date, finding true love, ...). - Detecting influencers / potential influencers and studying them
Previously mentioned: - what makes a great user profile ? - how to make the best first impression in order to get more matches (and ultimately find love, or new friendships) ? - what makes a person charismatic ? - how do charismatic people present themselves ?
Other works: - A starter analysis is available on my data.world account, made using a SQL query. Another file has been created through that mean on the dataset page. - The kaggle version of the dataset might contain a starter kernel.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
As children increasingly engage with the digital world, mobile apps have become central to their learning, entertainment, and social lives. However, this growing accessibility raises concerns, especially around the collection and handling of children’s personal data. In response, regulations like the Children’s Online Privacy Protection Act (COPPA) in the U.S.—along with similar laws worldwide—seek to safeguard children’s online privacy by requiring developers to secure parental consent before gathering data from users under 13.
This data can be used to build a machine learning model capable of predicting whether a mobile app might pose a risk of violating COPPA. By detecting apps that may be non-compliant, your model can support app stores, developers, and parents in fostering a safer digital space for children. Your model will evaluate various app features—such as genre, inferred audience size, privacy policy elements, and developer details—to estimate the likelihood of COPPA violations.
The dataset contains information scraped from a major app marketplace, along with derived features related to privacy and compliance. It includes a mix of categorical, numerical, and boolean features. Missing values are represented by empty strings.
As of May 2024, 44 percent of the total revenues generated by the global app market came from subscriptions. Other monetization methods such as paid downloads and in-app purchases represented the most popular types of revenue streams for global app publishers. Overall, 56 percent of total app revenues came from other monetization methods.
During the first quarter of 2024, YouTube shorts recorded the highest engagement rate across all short video platforms and in-app features analyzed. Content hosted on YouTube in form of shorts had an engagement rate of 5.91 percent, while TikTok reported an engagement rate of approximately 5.75 percent. Facebook Reels had an engagement rate of around two percent, making the platform rank last for short-format user engagement.
WorldPop produces different types of gridded population count datasets, depending on the methods used and end application.
Please make sure you have read our Mapping Populations overview page before choosing and downloading a dataset.
Bespoke methods used to produce datasets for specific individual countries are available through the WorldPop Open Population Repository (WOPR) link below.
These are 100m resolution gridded population estimates using customized methods ("bottom-up" and/or "top-down") developed for the latest data available from each country.
They can also be visualised and explored through the woprVision App.
The remaining datasets in the links below are produced using the "top-down" method,
with either the unconstrained or constrained top-down disaggregation method used.
Please make sure you read the Top-down estimation modelling overview page to decide on which datasets best meet your needs.
Datasets are available to download in Geotiff and ASCII XYZ format at a resolution of 3 and 30 arc-seconds (approximately 100m and 1km at the equator, respectively):
- Unconstrained individual countries 2000-2020 ( 1km resolution ): Consistent 1km resolution population count datasets created using
unconstrained top-down methods for all countries of the World for each year 2000-2020.
- Unconstrained individual countries 2000-2020 ( 100m resolution ): Consistent 100m resolution population count datasets created using
unconstrained top-down methods for all countries of the World for each year 2000-2020.
- Unconstrained individual countries 2000-2020 UN adjusted ( 100m resolution ): Consistent 100m resolution population count datasets created using
unconstrained top-down methods for all countries of the World for each year 2000-2020 and adjusted to match United Nations national population estimates (UN 2019)
-Unconstrained individual countries 2000-2020 UN adjusted ( 1km resolution ): Consistent 1km resolution population count datasets created using
unconstrained top-down methods for all countries of the World for each year 2000-2020 and adjusted to match United Nations national population estimates (UN 2019).
-Unconstrained global mosaics 2000-2020 ( 1km resolution ): Mosaiced 1km resolution versions of the "Unconstrained individual countries 2000-2020" datasets.
-Constrained individual countries 2020 ( 100m resolution ): Consistent 100m resolution population count datasets created using
constrained top-down methods for all countries of the World for 2020.
-Constrained individual countries 2020 UN adjusted ( 100m resolution ): Consistent 100m resolution population count datasets created using
constrained top-down methods for all countries of the World for 2020 and adjusted to match United Nations national
population estimates (UN 2019).
Older datasets produced for specific individual countries and continents, using a set of tailored geospatial inputs and differing "top-down" methods and time periods are still available for download here: Individual countries and Whole Continent.
Data for earlier dates is available directly from WorldPop.
WorldPop (www.worldpop.org - School of Geography and Environmental Science, University of Southampton; Department of Geography and Geosciences, University of Louisville; Departement de Geographie, Universite de Namur) and Center for International Earth Science Information Network (CIESIN), Columbia University (2018). Global High Resolution Population Denominators Project - Funded by The Bill and Melinda Gates Foundation (OPP1134076). https://dx.doi.org/10.5258/SOTON/WP00645
Researchers in India have developed a global flood mapper tool which runs on Google. The tool allows to explore the extent of historical floods from 2014 onwards.
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
We are publishing a walking activity dataset including inertial and positioning information from 19 volunteers, including reference distance measured using a trundle wheel. The dataset includes a total of 96.7 Km walked by the volunteers, split into 203 separate tracks. The trundle wheel is of two types: it is either an analogue trundle wheel, which provides the total amount of meters walked in a single track, or it is a sensorized trundle wheel, which measures every revolution of the wheel, therefore recording a continuous incremental distance.
Each track has data from the accelerometer and gyroscope embedded in the phones, location information from the Global Navigation Satellite System (GNSS), and the step count obtained by the device. The dataset can be used to implement walking distance estimation algorithms and to explore data quality in the context of walking activity and physical capacity tests, fitness, and pedestrian navigation.
Methods
The proposed dataset is a collection of walks where participants used their own smartphones to capture inertial and positioning information. The participants involved in the data collection come from two sites. The first site is the Oxford University Hospitals NHS Foundation Trust, United Kingdom, where 10 participants (7 affected by cardiovascular diseases and 3 healthy individuals) performed unsupervised 6MWTs in an outdoor environment of their choice (ethical approval obtained by the UK National Health Service Health Research Authority protocol reference numbers: 17/WM/0355). All participants involved provided informed consent. The second site is at Malm ̈o University, in Sweden, where a group of 9 healthy researchers collected data. This dataset can be used by researchers to develop distance estimation algorithms and how data quality impacts the estimation.
All walks were performed by holding a smartphone in one hand, with an app collecting inertial data, the GNSS signal, and the step counting. On the other free hand, participants held a trundle wheel to obtain the ground truth distance. Two different trundle wheels were used: an analogue trundle wheel that allowed the registration of a total single value of walked distance, and a sensorized trundle wheel which collected timestamps and distance at every 1-meter revolution, resulting in continuous incremental distance information. The latter configuration is innovative and allows the use of temporal windows of the IMU data as input to machine learning algorithms to estimate walked distance. In the case of data collected by researchers, if the walks were done simultaneously and at a close distance from each other, only one person used the trundle wheel, and the reference distance was associated with all walks that were collected at the same time.The walked paths are of variable length, duration, and shape. Participants were instructed to walk paths of increasing curvature, from straight to rounded. Irregular paths are particularly useful in determining limitations in the accuracy of walked distance algorithms. Two smartphone applications were developed for collecting the information of interest from the participants' devices, both available for Android and iOS operating systems. The first is a web-application that retrieves inertial data (acceleration, rotation rate, orientation) while connecting to the sensorized trundle wheel to record incremental reference distance [1]. The second app is the Timed Walk app [2], which guides the user in performing a walking test by signalling when to start and when to stop the walk while collecting both inertial and positioning data. All participants in the UK used the Timed Walk app.
The data collected during the walk is from the Inertial Measurement Unit (IMU) of the phone and, when available, the Global Navigation Satellite System (GNSS). In addition, the step count information is retrieved by the sensors embedded in each participant’s smartphone. With the dataset, we provide a descriptive table with the characteristics of each recording, including brand and model of the smartphone, duration, reference total distance, types of signals included and additionally scoring some relevant parameters related to the quality of the various signals. The path curvature is one of the most relevant parameters. Previous literature from our team, in fact, confirmed the negative impact of curved-shaped paths with the use of multiple distance estimation algorithms [3]. We visually inspected the walked paths and clustered them in three groups, a) straight path, i.e. no turns wider than 90 degrees, b) gently curved path, i.e. between one and five turns wider than 90 degrees, and c) curved path, i.e. more than five turns wider than 90 degrees. Other features relevant to the quality of collected signals are the total amount of time above a threshold (0.05s and 6s) where, respectively, inertial and GNSS data were missing due to technical issues or due to the app going in the background thus losing access to the sensors, sampling frequency of different data streams, average walking speed and the smartphone position. The start of each walk is set as 0 ms, thus not reporting time-related information. Walks locations collected in the UK are anonymized using the following approach: the first position is fixed to a central location of the city of Oxford (latitude: 51.7520, longitude: -1.2577) and all other positions are reassigned by applying a translation along the longitudinal and latitudinal axes which maintains the original distance and angle between samples. This way, the exact geographical location is lost, but the path shape and distances between samples are maintained. The difference between consecutive points “as the crow flies” and path curvature was numerically and visually inspected to obtain the same results as the original walks. Computations were made possible by using the Haversine Python library.
Multiple datasets are available regarding walking activity recognition among other daily living tasks. However, few studies are published with datasets that focus on the distance for both indoor and outdoor environments and that provide relevant ground truth information for it. Yan et al. [4] introduced an inertial walking dataset within indoor scenarios using a smartphone placed in 4 positions (on the leg, in a bag, in the hand, and on the body) by six healthy participants. The reference measurement used in this study is a Visual Odometry System embedded in a smartphone that has to be worn at the chest level, using a strap to hold it. While interesting and detailed, this dataset lacks GNSS data, which is likely to be used in outdoor scenarios, and the reference used for localization also suffers from accuracy issues, especially outdoors. Vezovcnik et al. [5] analysed estimation models for step length and provided an open-source dataset for a total of 22 km of only inertial walking data from 15 healthy adults. While relevant, their dataset focuses on steps rather than total distance and was acquired on a treadmill, which limits the validity in real-world scenarios. Kang et al. [6] proposed a way to estimate travelled distance by using an Android app that uses outdoor walking patterns to match them in indoor contexts for each participant. They collect data outdoors by including both inertial and positioning information and they use average values of speed obtained by the GPS data as reference labels. Afterwards, they use deep learning models to estimate walked distance obtaining high performances. Their results share that 3% to 11% of the data for each participant was discarded due to low quality. Unfortunately, the name of the used app is not reported and the paper does not mention if the dataset can be made available.
This dataset is heterogeneous under multiple aspects. It includes a majority of healthy participants, therefore, it is not possible to generalize the outcomes from this dataset to all walking styles or physical conditions. The dataset is heterogeneous also from a technical perspective, given the difference in devices, acquired data, and used smartphone apps (i.e. some tests lack IMU or GNSS, sampling frequency in iPhone was particularly low). We suggest selecting the appropriate track based on desired characteristics to obtain reliable and consistent outcomes.
This dataset allows researchers to develop algorithms to compute walked distance and to explore data quality and reliability in the context of the walking activity. This dataset was initiated to investigate the digitalization of the 6MWT, however, the collected information can also be useful for other physical capacity tests that involve walking (distance- or duration-based), or for other purposes such as fitness, and pedestrian navigation.
The article related to this dataset will be published in the proceedings of the IEEE MetroXRAINE 2024 conference, held in St. Albans, UK, 21-23 October.
This research is partially funded by the Swedish Knowledge Foundation and the Internet of Things and People research center through the Synergy project Intelligent and Trustworthy IoT Systems.
This dataset provides information on the 20 most popular digital health certificate apps in the world. It shows how many times each app has been downloaded, describes their privacy policies, and highlights any potentially invasive permissions.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
FitLife360 is a synthetic dataset that simulates real-world health and fitness tracking data from 3,000 participants over a one-year period. The dataset captures daily activities, vital health metrics, and lifestyle factors, making it valuable for health analytics and predictive modeling.
participant_id: Unique identifier for each participant age: Age of participant (18-65 years) gender: Gender (M/F/Other) height_cm: Height in centimeters weight_kg: Weight in kilograms bmi: Body Mass Index calculated from height and weight
activity_type: Type of exercise (Running, Swimming, Cycling, etc.) duration_minutes: Length of activity session intensity: Exercise intensity (Low/Medium/High) calories_burned: Estimated calories burned during activity daily_steps: Daily step count
avg_heart_rate: Average heart rate during activity resting_heart_rate: Resting heart rate blood_pressure_systolic: Systolic blood pressure blood_pressure_diastolic: Diastolic blood pressure health_condition: Presence of health conditions smoking_status: Smoking history (Never/Former/Current)
hours_sleep: Hours of sleep per night stress_level: Daily stress level (1-10) hydration_level: Daily water intake in liters fitness_level: Calculated fitness score based on cumulative activity
Predict risk of health conditions based on activity patterns Forecast potential life expectancy based on health metrics Identify early warning signs of health issues
Develop personalized weight loss prediction models Analyze effectiveness of different activities for weight loss Study the relationship between sleep, stress, and weight management
Track fitness level progression over time Analyze the impact of consistent exercise on health metrics Study recovery patterns and optimal training frequencies
Analyze the relationship between lifestyle choices and health outcomes Study the impact of smoking on fitness performance Investigate correlations between sleep patterns and health metrics
Develop personalized exercise recommendations Optimize workout intensity based on individual characteristics Create targeted fitness programs based on health conditions
Study seasonal patterns in exercise behavior Analyze the relationship between stress and physical activity Research the impact of hydration on exercise performance
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Nowadays, mobile applications (a.k.a., apps) are used by over two billion users for every type of need, including social and emergency connectivity. Their pervasiveness in today world has inspired the software testing research community in devising approaches to allow developers to better test their apps and improve the quality of the tests being developed. In spite of this research effort, we still notice a lack of empirical analyses aiming at assessing the actual quality of test cases manually developed by mobile developers: this perspective could provide evidence-based findings on the future research directions in the field as well as on the current status of testing in the wild. As such, we performed a large-scale empirical study targeting 1,780 open-source Android apps and aiming at assessing (1) the extent to which these apps are actually tested, (2) how well-designed are the available tests, and (3) what is their effectiveness. The key results of our study show that mobile developers still tend not to properly test their apps, possibly because of time to market requirements. Furthermore, we discovered that the test cases of the considered apps have a low (i) design quality, both in terms of test code metrics and test smells, and (ii) effectiveness when considering code coverage as well as assertion density.
In August 2024, over half a million unique devices used the Chinese AI tool Aishenqi. Artificial intelligence tools include a broad range of artificial intelligence services. China's leading AI tools include code writing support, as well as a digital language study companion.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Global Lagrangian dataset of Marine litter
This dataset regroups 12 yearly files (global-marine-litter-[2010–2021].nc) combining monthly releases of 32,300 particles initially distributed across the globe following global Mismanaged Plastic Waste (MPW) inputs. The particles are advected with OceanParcels (Delandmeter, P and E van Sebille, 2019) using ocean surface velocity, a wind drag coefficient of 1%, and a small random walk component with a uniform horizontal turbulent diffusion coefficient of Kh = 1m2s-1 representing unresolved turbulent motions in the ocean (see Chassignet et al. 2021 for more details).
Global oceanic current and atmospheric wind
Ocean surface velocities are obtained from GOFS3.1, a global ocean reanalysis based on the HYbrid Coordinate Ocean Model (HYCOM) and the Navy Coupled Ocean Data Assimilation (NCODA; Chassignet et al., 2009; Metzger et al., 2014). NCODA uses a three-dimensional (3D) variational scheme and assimilates satellite and altimeter observations as well as in-situ temperature and salinity measurements from moored buoys, Expendable Bathythermographs (XBTs), Argo floats (Cummings and Smedstad, 2013). Surface information is projected downward into the water column using Improved Synthetic Ocean Profiles (Helber et al., 2013). The horizontal resolution and the temporal frequency for the GOF3.1 outputs are 1/12° (8 km at the equator, 6 km at mid-latitudes) and 3-hourly, respectively. Details on the validation of the ocean circulation model are available in Metzger et al. (2017).
Wind velocities are obtained from JRA55, the Japanese 55-year atmospheric reanalysis. The JRA55, which spans from 1958 to the present, is the longest third-generation reanalysis that uses the full observing system and a 4D advanced data assimilation variational scheme. The horizontal resolution of JRA55 is about 55 km and the temporal frequency is 3-hourly (see Tsujino et al. (2018) for more details).
Marine Litter Sources
The marine litter sources are obtained by combining MPW direct inputs from coastal regions, which are defined as areas within 50 km of the coastline (Lebreton and Andrady 2019), and indirect inputs from inland regions via rivers (Lebreton et al. 2017).
File Format
The locations (lon, lat), the corresponding weight (tons), and the source (1: land, 0: river) associated with the 32,300 particles are described in the file initial-location-global.csv. The particle trajectories are regrouped into yearly files (marine-litter-[2010–2021].nc) which contain 12 monthly releases, resulting in a total of 387,600 trajectories per file. More precisely, in each of the yearly files, the first 32,300 lines contain the trajectories of particles released on January 1st, then lines 32,301–64,600 contain the trajectories of particles released on February 1st, and so on. The trajectories are recorded daily and are advected from their release until 2021-12-31, resulting in longer time series for earlier years of the dataset.
References
Chassignet, E. P., Hurlburt, H. E., Metzger, E. J., Smedstad, O. M., Cummings, J., Halliwell, G. R., et al. (2009). U.S. GODAE: global ocean prediction with the hybrid coordinate ocean model (HYCOM). Oceanography 22, 64–75. doi: 10.5670/oceanog.2009.39
Chassignet, E. P., Xu, X., and Zavala-Romero, O. (2021). Tracking Marine Litter With a Global Ocean Model: Where Does It Go? Where Does It Come From?. Frontiers in Marine Science, 8, 414, doi: 10.3389/fmars.2021.667591
Cummings, J. A., and Smedstad, O. M. (2013). “Chapter 13: variational data assimilation for the global ocean”, in Data Assimilation for Atmospheric, Oceanic and Hydrologic Applications, Vol. II, eds S. Park and L. Xu (Berlin: Springer), 303–343. doi: 10.1007/978-3-642-35088-7_13
Delandmeter, P., and van Sebille, E. (2019). The Parcels v2.0 Lagrangian framework: new field interpolation schemes. Geosci. Model Dev. 12, 3571–3584. doi: 10.5194/gmd-12-3571-2019
Helber, R. W., Townsend, T. L., Barron, C. N., Dastugue, J. M., and Carnes, M. R. (2013). Validation Test Report for the Improved Synthetic Ocean Profile (ISOP) System, Part I: Synthetic Profile Methods and Algorithm. NRL Memo. Report, NRL/MR/7320—13-9364 Hancock, MS: Stennis Space Center.
Metzger, E. J., Smedstad, O. M., Thoppil, P. G., Hurlburt, H. E., Cummings, J. A., Wallcraft, A. J., et al. (2014). US Navy operational global ocean and Arctic ice prediction systems. Oceanography 27, 32–43, doi: 10.5670/oceanog.2014.66.
Metzger, E., Helber, R. W., Hogan, P. J., Posey, P. G., Thoppil, P. G., Townsend, T. L., et al. (2017). Global Ocean Forecast System 3.1 validation test. Technical Report. NRL/MR/7320–17-9722. Hancock, MS: Stennis Space Center, 61.
Lebreton, L., and Andrady, A. (2019). Future scenarios of global plastic waste generation and disposal. Palgrave Commun. 5:6, doi: 10.1057/s41599-018-0212-7.
Lebreton, L., van der Zwet, J., Damsteeg, J. W., Slat, B., Andrady, A., and Reisser, J. (2017). River plastic emissions to the world’s oceans. Nat. Commun. 8:15611, doi: 10.1038/ncomms15611.
Tsujino H., S. Urakawa, H. Nakano, R.J. Small, W.M. Kim, S.G. Yeager, G. Danabasoglu, T. Suzuki, J.L. Bamber, M. Bentsen, C. Böning, A. Bozec, E.P. Chassignet, E. Curchitser, F. Boeira Dias, P.J. Durack, S.M. Griffies, Y. Harada, M. Ilicak, S.A. Josey, C. Kobayashi, S. Kobayashi, Y. Komuro, W.G. Large, J. Le Sommer, S.J. Marsland, S. Masina, M. Scheinert, H. Tomita, M. Valdivieso, and D. Yamazaki, 2018. JRA-55 based surface dataset for driving ocean-sea-ice models (JRA55-do). Ocean Modelling, 130, 79-139, doi: 10.1016/j.ocemod.2018.07.002.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘U.S. News and World Report’s College Data’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/flyingwombat/us-news-and-world-reports-college-data on 30 September 2021.
--- Dataset description provided by original source is as follows ---
Statistics for a large number of US Colleges from the 1995 issue of US News and World Report.
A data frame with 777 observations on the following 18 variables.
Private A factor with levels No and Yes indicating private or public university
Apps Number of applications received
Accept Number of applications accepted
Enroll Number of new students enrolled
Top10perc Pct. new students from top 10% of H.S. class
Top25perc Pct. new students from top 25% of H.S. class
F.Undergrad Number of fulltime undergraduates
P.Undergrad Number of parttime undergraduates
Outstate Out-of-state tuition
Room.Board Room and board costs
Books Estimated book costs
Personal Estimated personal spending
PhD Pct. of faculty with Ph.D.’s
Terminal Pct. of faculty with terminal degree
S.F.Ratio Student/faculty ratio
perc.alumni Pct. alumni who donate
Expend Instructional expenditure per student
Grad.Rate Graduation rate
This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.
The dataset was used in the ASA Statistical Graphics Section’s 1995 Data Analysis Exposition.
--- Original source retains full ownership of the source dataset ---
Live Face Anti-Spoof Dataset
A live face dataset is crucial for advancing computer vision tasks such as face detection, anti-spoofing detection, and face recognition. The Live Face Anti-Spoof Dataset offered by Ainnotate is specifically designed to train algorithms for anti-spoofing purposes, ensuring that AI systems can accurately differentiate between real and fake faces in various scenarios.
Key Features:
Comprehensive Video Collection: The dataset features thousands of videos showcasing a diverse range of individuals, including males and females, with and without glasses. It also includes men with beards, mustaches, and clean-shaven faces. Lighting Conditions: Videos are captured in both indoor and outdoor environments, ensuring that the data covers a wide range of lighting conditions, making it highly applicable for real-world use. Data Collection Method: Our datasets are gathered through a community-driven approach, leveraging our extensive network of over 700k users across various Telegram apps. This method ensures that the data is not only diverse but also ethically sourced with full consent from participants, providing reliable and real-world applicable data for training AI models. Versatility: This dataset is ideal for training models in face detection, anti-spoofing, and face recognition tasks, offering robust support for these essential computer vision applications. In addition to the Live Face Anti-Spoof Dataset, FileMarket provides specialized datasets across various categories to support a wide range of AI and machine learning projects:
Object Detection Data: Perfect for training AI in image and video analysis. Machine Learning (ML) Data: Offers a broad spectrum of applications, from predictive analytics to natural language processing (NLP). Large Language Model (LLM) Data: Designed to support text generation, chatbots, and machine translation models. Deep Learning (DL) Data: Essential for developing complex neural networks and deep learning models. Biometric Data: Includes diverse datasets for facial recognition, fingerprint analysis, and other biometric applications. This live face dataset, alongside our other specialized data categories, empowers your AI projects by providing high-quality, diverse, and comprehensive datasets. Whether your focus is on anti-spoofing detection, face recognition, or other biometric and machine learning tasks, our data offerings are tailored to meet your specific needs.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We collected all available global soil carbon (C) and heterotrophic respiration (RH) maps derived from data-driven estimates, sourcing them from public repositories and supplementary materials of previous studies (Table 1). All spatial datasets were converted to NetCDF format for consistency and ease of use.
Because the maps had varying spatial resolutions (ranging from 0.0083° to 0.5°), we harmonized all datasets to a common resolution of 0.5° (approximately 50 km at the equator). We then merged the processed maps by computing the mean, maximum, and minimum values at each grid cell, resulting in harmonized global maps of soil C (for the top 0–30 cm and 0–100 cm depths) and RH at 0.5° resolution.
Grid cells with fewer than three soil C estimates or fewer than four RH estimates were assigned NA values. Land and water grid cells were automatically distinguished by combining multiple datasets containing soil C and RH information over land.
Soil carbon turnover time (years), denoted as τ, was calculated under the assumption of a quasi-equilibrium state using the formula:
τ = CS / RH
where CS is soil carbon stock and RH is the heterotrophic respiration rate. The uncertainty range of τ was estimated for each grid cell using:
τmax = CS+ / RH− τmin = CS− / RH+
where CS+ and CS− are the maximum and minimum soil C values, and RH+ and RH− are the maximum and minimum RH values, respectively.
To calculate the temperature sensitivity of decomposition (Q10)—the factor by which decomposition rates increase with a 10 °C rise in temperature—we followed the method described in Koven et al. (2017). The uncertainty of Q10 (maximum and minimum values) was derived using τmax and τmin, respectively.
More details are provided in:
Shoji Hashimoto, Akihiko Ito, Kazuya Nishina (submitted)
Reference
Koven, C. D., Hugelius, G., Lawrence, D. M. & Wieder, W. R. Higher climatological temperature sensitivity of soil carbon in cold than warm climates. Nat. Clim. Change 7, 817–822 (2017).
Table1 : List of soil carbon and heterotrophic respiration datasets used in this study.
Dataset |
Repository/References (Dataset name) |
Depth |
ID in NetCDF file*** |
Global soil C |
Global soil data task 2000 (IGBP-DIS)1 |
0–100 |
3,- |
|
Shangguan et al. 2014 (GSDE)2,3 |
0–100, 0–30* |
1,1 |
|
Batjes 2016 (WISE30sec)4,5 |
0–100, 0–30 |
6,7 |
|
Sanderman et al. 2017 (Soil-Carbon-Debt) 6,7 |
0–100, 0–30 |
5,5 |
|
Soilgrids team and Hengl et al. 2017 (SoilGrids)8,9 |
0–30** |
-,6 |
|
Hengl and Wheeler 2018 (LandGIS)10 |
0–100, 0–30 |
4,4 |
|
FAO 2022 (GSOC)11 |
0–30 |
-,2 |
|
FAO 2023 (HWSD2)12 |
0–100, 0–30 |
2,3 |
Circumpolar soil C |
Hugelius et al. 2013 (NCSCD)13–15 |
0–100, 0–30 |
7,8 |
Global RH |
Hashimoto et al. 201516,17 |
- |
1 |
|
Warner et al. 2019 (Bond-Lamberty equation based)18,19 |
- |
2 |
|
Warner et al. 2019 (Subke equation based)18,19 |
- |
3 |
|
Tang et al. 202020,21 |
- |
4 |
|
Lu et al. 202122,23 |
- |
5 |
|
Stell et al. 202124,25 |
- |
6 |
|
Yao et al. 202126,27 |
- |
7 |
|
He et al. 202228,29 |
- |
8 |
*The vertical depth intervals did not exactly match 100 cm and 30 cm. Therefore, weighted means were calculated for the 0–100 cm and 0–30 cm depths. **Only the soil C stock data for the 0–30 cm depth is officially provided in the repository. ***IDs for 0–100cm/0–30cm
References
1. Global soil data task. Global Gridded Surfaces of Selected Soil Characteristics (IGBP-DIS). Preprint at https://doi.org/10.3334/ORNLDAAC/569 (2000).
2. Shangguan, W., Dai, Y., Duan, Q., Liu, B. & Yuan, H. A global soil data set for earth system modeling. J. Adv. Model. Earth Syst. 6, 249–263 (2014).
3. Land-atmosphere interaction research group at Sun Yat-sen University. The global soil dataset for Earth system modeling. http://globalchange.bnu.edu.cn/research/soilw (2014).
4. Batjes, N. H. Harmonized soil property values for broad-scale modelling (WISE30sec) with estimates of global soil carbon stocks. Geoderma 269, 61–68 (2016).
5. ISRIC World Soil Information. WISE derived soil properties on a 30 by 30 arc-seconds global grid. https://data.isric.org/geonetwork/srv/eng/catalog.search#/metadata/dc7b283a-8f19-45e1-aaed-e9bd515119bc (2016).
6. Sanderman, J., Hengl, T. & Fiske, G. J. Soil carbon debt of 12,000 years of human land use. Proc. Natl. Acad. Sci. 114, 9575–9580 (2017).
7. Sanderman, J. Soil-Carbon-Debt. https://github.com/whrc/Soil-Carbon-Debt (2017).
8. SoilGrids team. SoilGrids-global gridded soil information. https://files.isric.org/soilgrids/latest/data_aggregated/ (2020).
9. Hengl, T. et al. SoilGrids250m: Global gridded soil information based on machine learning. PLOS ONE 12, e0169748 (2017).
10. Hengl, T. & Wheeler, I. Soil organic carbon stock in kg/m2 for 5 standard depth intervals (0–10, 10–30, 30–60, 60–100 and 100–200 cm) at 250 m resolution. Zenodo https://doi.org/10.5281/ZENODO.2536040 (2018).
11. FAO. Global soil organic carbon map. https://data.apps.fao.org/catalog/dataset/global-soil-organic-carbon-map (2022).
12. FAO. Harmonized world soil database v2.0. https://www.fao.org/soils-portal/data-hub/soil-maps-and-databases/harmonized-world-soil-database-v20/en/ (2023).
13. Hugelius, G. et al. A new data set for estimating organic carbon storage to 3 m depth in soils of the northern circumpolar permafrost region. Earth Syst. Sci. Data 5, 393–402 (2013).
14.
WorldPop produces different types of gridded population count datasets, depending on the methods used and end application.
Please make sure you have read our Mapping Populations overview page before choosing and downloading a dataset.
Bespoke methods used to produce datasets for specific individual countries are available through the WorldPop Open Population Repository (WOPR) link below.
These are 100m resolution gridded population estimates using customized methods ("bottom-up" and/or "top-down") developed for the latest data available from each country.
They can also be visualised and explored through the woprVision App.
The remaining datasets in the links below are produced using the "top-down" method,
with either the unconstrained or constrained top-down disaggregation method used.
Please make sure you read the Top-down estimation modelling overview page to decide on which datasets best meet your needs.
Datasets are available to download in Geotiff and ASCII XYZ format at a resolution of 3 and 30 arc-seconds (approximately 100m and 1km at the equator, respectively):
- Unconstrained individual countries 2000-2020 ( 1km resolution ): Consistent 1km resolution population count datasets created using
unconstrained top-down methods for all countries of the World for each year 2000-2020.
- Unconstrained individual countries 2000-2020 ( 100m resolution ): Consistent 100m resolution population count datasets created using
unconstrained top-down methods for all countries of the World for each year 2000-2020.
- Unconstrained individual countries 2000-2020 UN adjusted ( 100m resolution ): Consistent 100m resolution population count datasets created using
unconstrained top-down methods for all countries of the World for each year 2000-2020 and adjusted to match United Nations national population estimates (UN 2019)
-Unconstrained individual countries 2000-2020 UN adjusted ( 1km resolution ): Consistent 1km resolution population count datasets created using
unconstrained top-down methods for all countries of the World for each year 2000-2020 and adjusted to match United Nations national population estimates (UN 2019).
-Unconstrained global mosaics 2000-2020 ( 1km resolution ): Mosaiced 1km resolution versions of the "Unconstrained individual countries 2000-2020" datasets.
-Constrained individual countries 2020 ( 100m resolution ): Consistent 100m resolution population count datasets created using
constrained top-down methods for all countries of the World for 2020.
-Constrained individual countries 2020 UN adjusted ( 100m resolution ): Consistent 100m resolution population count datasets created using
constrained top-down methods for all countries of the World for 2020 and adjusted to match United Nations national
population estimates (UN 2019).
Older datasets produced for specific individual countries and continents, using a set of tailored geospatial inputs and differing "top-down" methods and time periods are still available for download here: Individual countries and Whole Continent.
Data for earlier dates is available directly from WorldPop.
WorldPop (www.worldpop.org - School of Geography and Environmental Science, University of Southampton; Department of Geography and Geosciences, University of Louisville; Departement de Geographie, Universite de Namur) and Center for International Earth Science Information Network (CIESIN), Columbia University (2018). Global High Resolution Population Denominators Project - Funded by The Bill and Melinda Gates Foundation (OPP1134076). https://dx.doi.org/10.5258/SOTON/WP00645
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Bangladeshi Currency (Coins & Notes) Recognition Dataset is a comprehensive collection of high-quality images of Bangladeshi coins and banknotes. It is designed to facilitate machine learning and computer vision applications for currency recognition, classification, and detection.
This dataset is organized into various denominations of coins and notes, with each folder representing a specific currency denomination. Each folder contains 10,000 images, providing a total of 100,000 images in the dataset.
The images have been resized to a uniform dimension of 256x256 pixels, ensuring consistency and enabling easy integration into machine learning workflows. The images are saved in JPEG format to optimize storage and speed for large-scale training tasks.
Currency Denominations Included: 10 Poisha (Small denomination coin) 1 Poisha 1 Taka 25 Poisha 2 Taka 50 Poisha 5 Poisha 5 Taka Commemorative Coins Demonetized Notes Features: Image Size: All images have been resized to 256x256 pixels (Width x Height). Image Format: JPEG. Total Images: 100,000 (10,000 images per folder, one per denomination). Categories: Each folder corresponds to a unique denomination of currency. The folder names are aligned with the specific denominations such as 10_Poisha, 1_Taka, 5_Taka, etc. Objective: This dataset is ideal for training and evaluating models for the following tasks:
Currency Classification: Identifying the denomination of a given image of a coin or banknote.
Currency Recognition: Detecting and recognizing specific Bangladeshi coins and notes from real-world images.
Coin and Note Detection: Identifying and classifying multiple coins and notes in a single image.
Possible Use Cases:
Currency detection systems: Automated systems in ATMs, vending machines, or cash counting machines that recognize Bangladeshi coins and banknotes.
Banknote and Coin Classification: Machine learning models that classify various denominations of coins and notes for digital payment applications.
Real-world Applications: Currency recognition for mobile apps, kiosks, or any system that needs to automatically recognize Bangladeshi currency.
Research in Currency Image Recognition: Researchers working on currency recognition problems using computer vision techniques.
Collected (https://www.bb.org.bd/currency) + own
Note for Researchers Using the dataset
This dataset was created by Shuvo Kumar Basak. If you use this dataset for your research or academic purposes, please ensure to cite this dataset appropriately. If you have published your research using this dataset, please share a link to your paper. Good Luck.
https://cdla.io/permissive-1-0/https://cdla.io/permissive-1-0/
Google Play stores top 500 app data based on their rankings on January 2022 for all the available categories. Link to scraping code: https://github.com/Shakthi-Dhar/AppPin Link to backup datafiles: github data files
The dataset contains the top 500 android apps available on the google play store for the following categories: All Categories, Art & Design, Auto & Vehicles, Beauty, Books & Reference, Business, Comics, Communication, Education, Entertainment, Events, Finance, Food & Drink, Health & Fitness, House & Home, Libraries & Demo, Lifestyle, Maps & Navigation, Medical, Music & Audio, News & Magazines, Parenting, Personalization, Photography, Productivity, Shopping, Social, Sports, Tools, Travel & Local, and Video Players & Editors.
The app rankings are based on google play store app rankings for January 2022.
In Review and Downloads, the alphabet T, L, Cr represents Thousands, Lakhs, Crores as per the google play store naming convention. They are similar to M, B which represent millions, billions. 1L (1 Lakh) = 100T (100 Thousand) 10L (10 Lakhs) = 1M (1 Million) 1Cr( 1 Crore) = 10M (10 Million)
This data is not provided directly by Google, so I used Appium an automation tool with python to scrape the data from the google play store app.
Inspired by Fortune500. Fortune500 provides data on top companies in the world, so why not have a data source for top apps in the world.