Which county has the most Facebook users?
There are more than 378 million Facebook users in India alone, making it the leading country in terms of Facebook audience size. To put this into context, if India’s Facebook audience were a country then it would be ranked third in terms of largest population worldwide. Apart from India, there are several other markets with more than 100 million Facebook users each: The United States, Indonesia, and Brazil with 193.8 million, 119.05 million, and 112.55 million Facebook users respectively.
Facebook – the most used social media
Meta, the company that was previously called Facebook, owns four of the most popular social media platforms worldwide, WhatsApp, Facebook Messenger, Facebook, and Instagram. As of the third quarter of 2021, there were around 3,5 billion cumulative monthly users of the company’s products worldwide. With around 2.9 billion monthly active users, Facebook is the most popular social media worldwide. With an audience of this scale, it is no surprise that the vast majority of Facebook’s revenue is generated through advertising.
Facebook usage by device
As of July 2021, it was found that 98.5 percent of active users accessed their Facebook account from mobile devices. In fact, almost 81.8 percent of Facebook audiences worldwide access the platform only via mobile phone. Facebook is not only available through mobile browser as the company has published several mobile apps for users to access their products and services. As of the third quarter 2021, the four core Meta products were leading the ranking of most downloaded mobile apps worldwide, with WhatsApp amassing approximately six billion downloads.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dominican Republic DO: Population in Urban Agglomerations of More Than 1 Million: as % of Total Population data was reported at 28.759 % in 2017. This records an increase from the previous number of 28.360 % for 2016. Dominican Republic DO: Population in Urban Agglomerations of More Than 1 Million: as % of Total Population data is updated yearly, averaging 21.404 % from Dec 1960 (Median) to 2017, with 58 observations. The data reached an all-time high of 28.759 % in 2017 and a record low of 11.151 % in 1960. Dominican Republic DO: Population in Urban Agglomerations of More Than 1 Million: as % of Total Population data remains active status in CEIC and is reported by World Bank. The data is categorized under Global Database’s Dominican Republic – Table DO.World Bank: Population and Urbanization Statistics. Population in urban agglomerations of more than one million is the percentage of a country's population living in metropolitan areas that in 2000 had a population of more than one million people.; ; United Nations, World Urbanization Prospects.; Weighted Average;
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dominican Republic DO: Population in Urban Agglomerations of More Than 1 Million data was reported at 3,094,465.000 Person in 2017. This records an increase from the previous number of 3,018,681.000 Person for 2016. Dominican Republic DO: Population in Urban Agglomerations of More Than 1 Million data is updated yearly, averaging 1,483,522.000 Person from Dec 1960 (Median) to 2017, with 58 observations. The data reached an all-time high of 3,094,465.000 Person in 2017 and a record low of 367,328.000 Person in 1960. Dominican Republic DO: Population in Urban Agglomerations of More Than 1 Million data remains active status in CEIC and is reported by World Bank. The data is categorized under Global Database’s Dominican Republic – Table DO.World Bank.WDI: Population and Urbanization Statistics. Population in urban agglomerations of more than one million is the country's population living in metropolitan areas that in 2018 had a population of more than one million people.; ; United Nations, World Urbanization Prospects.; ;
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Earlier this year, Dr. Hoffman and Dr. Fafard published a book chapter on the efficacy and legality of border closures enacted by governments in response to changing COVID-19 conditions. The authors concluded border closures are at best, regarded as powerful symbolic acts taken by governments to show they are acting forcefully, even if the actions lack an epidemiological impact and breach international law. This COVID-19 travel restriction project was developed out of a necessity and desire to further examine the empirical implications of border closures. The current dataset contains bilateral travel restriction information on the status of 179 countries between 1 January 2020 and 8 June 2020. The data was extracted from the ‘international controls’ column from the Oxford COVID-19 Government Response Tracker (OxCGRT). The data in the ‘international controls’ column outlined a country’s change in border control status, as a response to COVID-19 conditions. Accompanying source links were further verified through random selection and comparison with external news sources. Greater weight is given to official national government sources, then to provincial and municipal news-affiliated agencies. The database is presented in matrix form for each country-pair and date. Subsequently, each cell is represented by datum Xdmn and indicates the border closure status on date d by country m on country n. The coding is as follows: no border closure (code = 0), targeted border closure (= 1), and a total border closure (= 99). The dataset provides further details in the ‘notes’ column if the type of closure is a modified form of a targeted closure, either as a land or port closure, flight or visa suspension, or a re-opening of borders to select countries. Visa suspensions and closure of land borders were coded separately as de facto border closures and analyzed as targeted border closures in quantitative analyses. The file titled ‘BTR Supplementary Information’ covers a multitude of supplemental details to the database. The various tabs cover the following: 1) Codebook: variable name, format, source links, and description; 2) Sources, Access dates: dates of access for the individual source links with additional notes; 3) Country groups: breakdown of EEA, EU, SADC, Schengen groups with source links; 4) Newly added sources: for missing countries with a population greater than 1 million (meeting the inclusion criteria), relevant news sources were added for analysis; 5) Corrections: external news sources correcting for errors in the coding of international controls retrieved from the OxCGRT dataset. At the time of our study inception, there was no existing dataset which recorded the bilateral decisions of travel restrictions between countries. We hope this dataset will be useful in the study of the impact of border closures in the COVID-19 pandemic and widen the capabilities of studying border closures on a global scale, due to its interconnected nature and impact, rather than being limited in analysis to a single country or region only. Statement of contributions: Data entry and verification was performed mainly by GL, with assistance from MJP and RN. MP and IW provided further data verification on the nine countries purposively selected for the exploratory analysis of political decision-making.
If you know any further standard populations worth integrating in this dataset, please let me know in the discussion part. I would be happy to integrate further data to make this dataset more useful for everybody.
"Standard populations are "artificial populations" with fictitious age structures, that are used in age standardization as uniform basis for the calculation of comparable measures for the respective reference population(s).
Use: Age standardizations based on a standard population are often used at cancer registries to compare morbidity or mortality rates. If there are different age structures in populations of different regions or in a population in one region over time, the comparability of their mortality or morbidity rates is only limited. For interregional or inter-temporal comparisons, therefore, an age standardization is necessary. For this purpose the age structure of a reference population, the so-called standard population, is assumed for the study population. The age specific mortality or morbidity rates of the study population are weighted according to the age structure of the standard population. Selection of a standard population:
Which standard population is used for comparison basically, does not matter. It is important, however, that
The aim of this dataset is to provide a variety of the most commonly used 'standard populations'.
Currently, two files with 22 standard populations are provided: - standard_populations_20_age_groups.csv - 20 age groups: '0', '01-04', '05-09', '10-14', '15-19', '20-24', '25-29', '30-34', '35-39', '40-44', '45-49', '50-54', '55-59', '60-64', '65-69', '70-74', '75-79', '80-84', '85-89', '90+' - 7 standard populations: 'Standard population Germany 2011', 'Standard population Germany 1987', 'Standard population of Europe 2013', 'Standard population Old Laender 1987', 'Standard population New Laender 1987', 'New standard population of Europe', 'World standard population' - source: German Federal Health Monitoring System
No restrictions are known to the author. Standard populations are published by different organisations for public usage.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Context
The dataset tabulates the United States population distribution across 18 age groups. It lists the population in each age group along with the percentage population relative of the total population for United States. The dataset can be utilized to understand the population distribution of United States by age. For example, using this dataset, we can identify the largest age group in United States.
Key observations
The largest age group in United States was for the group of age 25-29 years with a population of 22,854,328 (6.93%), according to the 2021 American Community Survey. At the same time, the smallest age group in United States was the 80-84 years with a population of 5,932,196 (1.80%). Source: U.S. Census Bureau American Community Survey (ACS) 2017-2021 5-Year Estimates.
When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2017-2021 5-Year Estimates.
Age groups:
Variables / Data Columns
Good to know
Margin of Error
Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.
Custom data
If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.
Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.
This dataset is a part of the main dataset for United States Population by Age. You can refer the same here
Public Domain Mark 1.0https://creativecommons.org/publicdomain/mark/1.0/
License information was derived automatically
The Gridded Population of the World, Version 4 (GPWv4): Population Density, Revision 11 consists of estimates of human population density (number of persons per square kilometer) based on counts consistent with national censuses and population registers, for the years 2000, 2005, 2010, 2015, and 2020. A proportional allocation gridding algorithm, utilizing approximately 13.5 million national and sub-national administrative units, was used to assign population counts to 30 arc-second grid cells. The population density rasters were created by dividing the population count raster for a given target year by the land area raster. The data files were produced as global rasters at 30 arc-second (~1 km at the equator) resolution.
Purpose: To provide estimates of population density for the years 2000, 2005, 2010, 2015, and 2020, based on counts consistent with national censuses and population registers, as raster data to facilitate data integration.
Recommended Citation(s)*: Center for International Earth Science Information Network - CIESIN - Columbia University. 2018. Gridded Population of the World, Version 4 (GPWv4): Population Density, Revision 11. Palisades, NY: NASA Socioeconomic Data and Applications Center (SEDAC). https://doi.org/10.7927/H49C6VHW. Accessed DAY MONTH YEAR.
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
TruMedicines has trained a deep convolutional neural network to autoencode and retrieve a saved image, from a large image dataset based on the random pattern of dots on the surface of the pharmaceutical tablet (pill). Using a mobile phone app a user can query the image datebase and verify the query pill is not counterfeit and is authentic, additional meta data can be displayed to the user: manf date, manf location, drug expiration date, drug strength, adverse reactions etc.
TruMedicines Pharmaceutical images of 252 speckled pill images. We have convoluted the images to create 20,000 training database by: rotations, grey scale, black and white, added noise, non-pill images, images are 292px x 292px in jpeg format
In this playground competition, Kagglers are challenged to develop deep Convolutional Neural Network and hash codes to accurately identify images of pills and quickly retrieved from our database. Jpeg images of pills can be autoencoded using a CNN and retrieved using a CNN hashing code index. Our Android app takes a phone of a pill and sends a query to the image database for a match, then returns meta data abut the pill: manf date, expiration date, ingredients, adverse reactions etc. Techniques from computer vision alongside other current technologies can make recognition of non-counterfeit, medications cheaper, faster, and more reliable.
Special Thanks to Microsoft Paul Debaun and Steve Borg and NWCadence, Bellevue WA for their assistance
TruMedicines is using machine learning on a mobile app to stop the spread of counterfeit medicines around the world. Every year the World Health Organization WHO estimates 1 million people die or become disabled due to counterfeit medicine.
Spotify Million Playlist Dataset Challenge
Summary
The Spotify Million Playlist Dataset Challenge consists of a dataset and evaluation to enable research in music recommendations. It is a continuation of the RecSys Challenge 2018, which ran from January to July 2018. The dataset contains 1,000,000 playlists, including playlist titles and track titles, created by users on the Spotify platform between January 2010 and October 2017. The evaluation task is automatic playlist continuation: given a seed playlist title and/or initial set of tracks in a playlist, to predict the subsequent tracks in that playlist. This is an open-ended challenge intended to encourage research in music recommendations, and no prizes will be awarded (other than bragging rights).
Background
Playlists like Today’s Top Hits and RapCaviar have millions of loyal followers, while Discover Weekly and Daily Mix are just a couple of our personalized playlists made especially to match your unique musical tastes.
Our users love playlists too. In fact, the Digital Music Alliance, in their 2018 Annual Music Report, state that 54% of consumers say that playlists are replacing albums in their listening habits.
But our users don’t love just listening to playlists, they also love creating them. To date, over 4 billion playlists have been created and shared by Spotify users. People create playlists for all sorts of reasons: some playlists group together music categorically (e.g., by genre, artist, year, or city), by mood, theme, or occasion (e.g., romantic, sad, holiday), or for a particular purpose (e.g., focus, workout). Some playlists are even made to land a dream job, or to send a message to someone special.
The other thing we love here at Spotify is playlist research. By learning from the playlists that people create, we can learn all sorts of things about the deep relationship between people and music. Why do certain songs go together? What is the difference between “Beach Vibes” and “Forest Vibes”? And what words do people use to describe which playlists?
By learning more about nature of playlists, we may also be able to suggest other tracks that a listener would enjoy in the context of a given playlist. This can make playlist creation easier, and ultimately help people find more of the music they love.
Dataset
To enable this type of research at scale, in 2018 we sponsored the RecSys Challenge 2018, which introduced the Million Playlist Dataset (MPD) to the research community. Sampled from the over 4 billion public playlists on Spotify, this dataset of 1 million playlists consist of over 2 million unique tracks by nearly 300,000 artists, and represents the largest public dataset of music playlists in the world. The dataset includes public playlists created by US Spotify users between January 2010 and November 2017. The challenge ran from January to July 2018, and received 1,467 submissions from 410 teams. A summary of the challenge and the top scoring submissions was published in the ACM Transactions on Intelligent Systems and Technology.
In September 2020, we re-released the dataset as an open-ended challenge on AIcrowd.com. The dataset can now be downloaded by registered participants from the Resources page.
Each playlist in the MPD contains a playlist title, the track list (including track IDs and metadata), and other metadata fields (last edit time, number of playlist edits, and more). All data is anonymized to protect user privacy. Playlists are sampled with some randomization, are manually filtered for playlist quality and to remove offensive content, and have some dithering and fictitious tracks added to them. As such, the dataset is not representative of the true distribution of playlists on the Spotify platform, and must not be interpreted as such in any research or analysis performed on the dataset.
Dataset Contains
1000 examples of each scenario:
Title only (no tracks) Title and first track Title and first 5 tracks First 5 tracks only Title and first 10 tracks First 10 tracks only Title and first 25 tracks Title and 25 random tracks Title and first 100 tracks Title and 100 random tracks
Download Link
Full Details: https://www.aicrowd.com/challenges/spotify-million-playlist-dataset-challenge
Download Link: https://www.aicrowd.com/challenges/spotify-million-playlist-dataset-challenge/dataset_files
GitHub is how people build software and is home to the largest community of open source developers in the world, with over 12 million people contributing to 31 million projects on GitHub since 2008.
This 3TB+ dataset comprises the largest released source of GitHub activity to date. It contains a full snapshot of the content of more than 2.8 million open source GitHub repositories including more than 145 million unique commits, over 2 billion different file paths, and the contents of the latest revision for 163 million files, all of which are searchable with regular expressions.
You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.github_repos.[TABLENAME]
. Fork this kernel to get started to learn how to safely manage analyzing large BigQuery datasets.
This dataset was made available per GitHub's terms of service. This dataset is available via Google Cloud Platform's Marketplace, GitHub Activity Data, as part of GCP Public Datasets.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
All cities with a population > 1000 or seats of adm div (ca 80.000)Sources and ContributionsSources : GeoNames is aggregating over hundred different data sources. Ambassadors : GeoNames Ambassadors help in many countries. Wiki : A wiki allows to view the data and quickly fix error and add missing places. Donations and Sponsoring : Costs for running GeoNames are covered by donations and sponsoring.Enrichment:add country name
The Global Consumption Database (GCD) contains information on consumption patterns at the national level, by urban/rural area, and by income level (4 categories: lowest, low, middle, higher with thresholds based on a global income distribution), for 92 low and middle-income countries, as of 2010. The data were extracted from national household surveys. The consumption is presented by category of products and services of the International Comparison Program (ICP) 2005, which mostly corresponds to COICOP. For three countries, sub-national data are also available (Brazil, India, and South Africa). Data on population estimates are also included.
The data file can be used for the production of the following tables (by urban/rural and income class/consumption segment):
- Sample Size by Country, Area and Consumption Segment (Number of Households)
- Population 2010 by Country, Area and Consumption Segment
- Population 2010 by Country, Area and Consumption Segment, as a Percentage of the National Population
- Population 2010 by Country, Area and Consumption Segment, as a Percentage of the Area Population
- Population 2010 by Country, Age Group, Sex and Consumption Segment
- Household Consumption 2010 by Country, Sector, Area and Consumption Segment in Local Currency (Million)
- Household Consumption 2010 by Country, Sector, Area and Consumption Segment in $PPP (Million)
- Household Consumption 2010 by Country, Sector, Area and Consumption Segment in US$ (Million)
- Household Consumption 2010 by Country, Category of Product/Service, Area and Consumption Segment in Local Currency (Million)
- Household Consumption 2010 by Country, Category of Product/Service, Area and Consumption Segment in $PPP (Million)
- Household Consumption 2010 by Country, Category of Product/Service, Area and Consumption Segment in US$ (Million)
- Household Consumption 2010 by Country, Product/Service, Area and Consumption Segment in Local Currency (Million)
- Household Consumption 2010 by Country, Product/Service, Area and Consumption Segment in $PPP (Million)
- Household Consumption 2010 by Country, Product/Service, Area and Consumption Segment in US$ (Million)
- Per Capita Consumption 2010 by Country, Sector, Area and Consumption Segment in Local Currency
- Per Capita Consumption 2010 by Country, Sector, Area and Consumption Segment in US$
- Per Capita Consumption 2010 by Country, Sector, Area and Consumption Segment in $PPP
- Per Capita Consumption 2010 by Country, Category of Product/Service, Area and Consumption Segment in Local Currency
- Per Capita Consumption 2010 by Country, Category of Product/Service, Area and Consumption Segment in US$
- Per Capita Consumption 2010 by Country, Category of Product/Service, Area and Consumption Segment in $PPP
- Per Capita Consumption 2010 by Country, Product or Service, Area and Consumption Segment in Local Currency
- Per Capita Consumption 2010 by Country, Product or Service, Area and Consumption Segment in US$
- Per Capita Consumption 2010 by Country, Product or Service, Area and Consumption Segment in $PPP
- Consumption Shares 2010 by Country, Sector, Area and Consumption Segment (Percent)
- Consumption Shares 2010 by Country, Category of Products/Services, Area and Consumption Segment (Percent)
- Consumption Shares 2010 by Country, Product/Service, Area and Consumption Segment (Percent)
- Percentage of Households who Reported Having Consumed the Product or Service by Country, Consumption Segment and Area (as of Survey Year)
For all countries, estimates are provided at the national level and at the urban/rural levels. For Brazil, India, and South Africa, data are also provided at the sub-national level (admin 1): - Brazil: ACR, Alagoas, Amapa, Amazonas, Bahia, Ceara, Distrito Federal, Espirito Santo, Goias, Maranhao, Mato Grosso, Mato Grosso do Sul, Minas Gerais, Para, Paraiba, Parana, Pernambuco, Piaji, Rio de Janeiro, Rio Grande do Norte, Rio Grande do Sul, Rondonia, Roraima, Santa Catarina, Sao Paolo, Sergipe, Tocatins - India: Andaman and Nicobar Islands, Andhra Pradesh, Arinachal Pradesh, Assam, Bihar, Chandigarh, Chattisgarh, Dadra and Nagar Haveli, Daman and Diu, Delhi, Goa, Gujarat, Haryana, Himachal Pradesh, Jammu and Kashmir, Jharkhand, Karnataka, Kerala, Lakshadweep, Madya Pradesh, Maharastra, Manipur, Meghalaya, Mizoram, Nagaland, Orissa, Pondicherry, Punjab, Rajasthan, Sikkim, Tamil Nadu, Tripura, Uttar Pradesh, Uttaranchal, West Bengal - South Africa: Eastern Cape, Free State, Gauteng, Kwazulu Natal, Limpopo, Mpulamanga, Northern Cape, North West, Western Cape
Data derived from survey microdata
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Historical chart and dataset showing World population growth rate by year from 1961 to 2023.
The dataset is a relational dataset of 8,000 households households, representing a sample of the population of an imaginary middle-income country. The dataset contains two data files: one with variables at the household level, the other one with variables at the individual level. It includes variables that are typically collected in population censuses (demography, education, occupation, dwelling characteristics, fertility, mortality, and migration) and in household surveys (household expenditure, anthropometric data for children, assets ownership). The data only includes ordinary households (no community households). The dataset was created using REaLTabFormer, a model that leverages deep learning methods. The dataset was created for the purpose of training and simulation and is not intended to be representative of any specific country.
The full-population dataset (with about 10 million individuals) is also distributed as open data.
The dataset is a synthetic dataset for an imaginary country. It was created to represent the population of this country by province (equivalent to admin1) and by urban/rural areas of residence.
Household, Individual
The dataset is a fully-synthetic dataset representative of the resident population of ordinary households for an imaginary middle-income country.
ssd
The sample size was set to 8,000 households. The fixed number of households to be selected from each enumeration area was set to 25. In a first stage, the number of enumeration areas to be selected in each stratum was calculated, proportional to the size of each stratum (stratification by geo_1 and urban/rural). Then 25 households were randomly selected within each enumeration area. The R script used to draw the sample is provided as an external resource.
other
The dataset is a synthetic dataset. Although the variables it contains are variables typically collected from sample surveys or population censuses, no questionnaire is available for this dataset. A "fake" questionnaire was however created for the sample dataset extracted from this dataset, to be used as training material.
The synthetic data generation process included a set of "validators" (consistency checks, based on which synthetic observation were assessed and rejected/replaced when needed). Also, some post-processing was applied to the data to result in the distributed data files.
This is a synthetic dataset; the "response rate" is 100%.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
There is a lack of public available datasets on financial services and specially in the emerging mobile money transactions domain. Financial datasets are important to many researchers and in particular to us performing research in the domain of fraud detection. Part of the problem is the intrinsically private nature of financial transactions, that leads to no publicly available datasets.
We present a synthetic dataset generated using the simulator called PaySim as an approach to such a problem. PaySim uses aggregated data from the private dataset to generate a synthetic dataset that resembles the normal operation of transactions and injects malicious behaviour to later evaluate the performance of fraud detection methods.
PaySim simulates mobile money transactions based on a sample of real transactions extracted from one month of financial logs from a mobile money service implemented in an African country. The original logs were provided by a multinational company, who is the provider of the mobile financial service which is currently running in more than 14 countries all around the world.
This synthetic dataset is scaled down 1/4 of the original dataset and it is created just for Kaggle.
This is a sample of 1 row with headers explanation:
1,PAYMENT,1060.31,C429214117,1089.0,28.69,M1591654462,0.0,0.0,0,0
step - maps a unit of time in the real world. In this case 1 step is 1 hour of time. Total steps 744 (30 days simulation).
type - CASH-IN, CASH-OUT, DEBIT, PAYMENT and TRANSFER.
amount - amount of the transaction in local currency.
nameOrig - customer who started the transaction
oldbalanceOrg - initial balance before the transaction
newbalanceOrig - new balance after the transaction
nameDest - customer who is the recipient of the transaction
oldbalanceDest - initial balance recipient before the transaction. Note that there is not information for customers that start with M (Merchants).
newbalanceDest - new balance recipient after the transaction. Note that there is not information for customers that start with M (Merchants).
isFraud - This is the transactions made by the fraudulent agents inside the simulation. In this specific dataset the fraudulent behavior of the agents aims to profit by taking control or customers accounts and try to empty the funds by transferring to another account and then cashing out of the system.
isFlaggedFraud - The business model aims to control massive transfers from one account to another and flags illegal attempts. An illegal attempt in this dataset is an attempt to transfer more than 200.000 in a single transaction.
There are 5 similar files that contain the run of 5 different scenarios. These files are better explained at my PhD thesis chapter 7 (PhD Thesis Available here http://urn.kb.se/resolve?urn=urn:nbn:se:bth-12932).
We ran PaySim several times using random seeds for 744 steps, representing each hour of one month of real time, which matches the original logs. Each run took around 45 minutes on an i7 intel processor with 16GB of RAM. The final result of a run contains approximately 24 million of financial records divided into the 5 types of categories: CASH-IN, CASH-OUT, DEBIT, PAYMENT and TRANSFER.
This work is part of the research project ”Scalable resource-efficient systems for big data analytics” funded by the Knowledge Foundation (grant: 20140032) in Sweden.
Please refer to this dataset using the following citations:
PaySim first paper of the simulator:
E. A. Lopez-Rojas , A. Elmir, and S. Axelsson. "PaySim: A financial mobile money simulator for fraud detection". In: The 28th European Modeling and Simulation Symposium-EMSS, Larnaca, Cyprus. 2016
Cristiano Ronaldo has one of the most popular Instagram accounts as of April 2024.
The Portuguese footballer is the most-followed person on the photo sharing app platform with 628 million followers. Instagram's own account was ranked first with roughly 672 million followers.
How popular is Instagram?
Instagram is a photo-sharing social networking service that enables users to take pictures and edit them with filters. The platform allows users to post and share their images online and directly with their friends and followers on the social network. The cross-platform app reached one billion monthly active users in mid-2018. In 2020, there were over 114 million Instagram users in the United States and experts project this figure to surpass 127 million users in 2023.
Who uses Instagram?
Instagram audiences are predominantly young – recent data states that almost 60 percent of U.S. Instagram users are aged 34 years or younger. Fall 2020 data reveals that Instagram is also one of the most popular social media for teens and one of the social networks with the biggest reach among teens in the United States.
Celebrity influencers on Instagram
Many celebrities and athletes are brand spokespeople and generate additional income with social media advertising and sponsored content. Unsurprisingly, Ronaldo ranked first again, as the average media value of one of his Instagram posts was 985,441 U.S. dollars.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Finding a good data source is the first step toward creating a database. Cardiovascular illnesses (CVDs) are the major cause of death worldwide. CVDs include coronary heart disease, cerebrovascular disease, rheumatic heart disease, and other heart and blood vessel problems. According to the World Health Organization, 17.9 million people die each year. Heart attacks and strokes account for more than four out of every five CVD deaths, with one-third of these deaths occurring before the age of 70 A comprehensive database for factors that contribute to a heart attack has been constructed , The main purpose here is to collect characteristics of Heart Attack or factors that contribute to it. As a result, a form is created to accomplish this. Microsoft Excel was used to create this form. Figure 1 depicts the form which It has nine fields, where eight fields for input fields and one field for output field. Age, gender, heart rate, systolic BP, diastolic BP, blood sugar, CK-MB, and Test-Troponin are representing the input fields, while the output field pertains to the presence of heart attack, which is divided into two categories (negative and positive).negative refers to the absence of a heart attack, while positive refers to the presence of a heart attack.Table 1 show the detailed information and max and min of values attributes for 1319 cases in the whole database.To confirm the validity of this data, we looked at the patient files in the hospital archive and compared them with the data stored in the laboratories system. On the other hand, we interviewed the patients and specialized doctors. Table 2 is a sample for 1320 cases, which shows 44 cases and the factors that lead to a heart attack in the whole database,After collecting this data, we checked the data if it has null values (invalid values) or if there was an error during data collection. The value is null if it is unknown. Null values necessitate special treatment. This value is used to indicate that the target isn’t a valid data element. When trying to retrieve data that isn't present, you can come across the keyword null in Processing. If you try to do arithmetic operations on a numeric column with one or more null values, the outcome will be null. An example of a null values processing is shown in Figure 2.The data used in this investigation were scaled between 0 and 1 to guarantee that all inputs and outputs received equal attention and to eliminate their dimensionality. Prior to the use of AI models, data normalization has two major advantages. The first is to avoid overshadowing qualities in smaller numeric ranges by employing attributes in larger numeric ranges. The second goal is to avoid any numerical problems throughout the process.After completion of the normalization process, we split the data set into two parts - training and test sets. In the test, we have utilized1060 for train 259 for testing Using the input and output variables, modeling was implemented.
Data set is for private consumption for the competition.
According to IBEF “Domestic automobiles production increased at 2.36% CAGR between FY16-20 with 26.36 million vehicles being manufactured in the country in FY20.Overall, domestic automobiles sales increased at 1.29% CAGR between FY16-FY20 with 21.55 million vehicles being sold in FY20”.The rise in vehicles on the road will also lead to multiple challenges and the road will be more vulnerable to accidents.Increased accident rates also leads to more insurance claims and payouts rise for insurance companies.
In order to pre-emptively plan for the losses, the insurance firms leverage accident data to understand the risk across the geographical units e.g. Postal code/district etc.
In this challenge, we are providing you the dataset to predict the “Accident_Risk_Index” against the postcodes.Accident_Risk_Index (mean casualties at a postcode) = sum(Number_of_casualities)/count(Accident_ID)
Working example:
Train Data (given)
Accident_ID Postcode Number_of_casualities
1 AL1 1JJ 2
2 AL1 1JP 3
3 AL1 3PS 2
4 AL1 3PS 1
5 AL1 3PS 1
Modelling Train Data (Rolled up at Postcode level)
Postcode Derived_feature1 Derived_feature2 Accident_risk_Index
AL1 1JJ _ _ 2
AL1 1JP _ _ 3
AL1 3PS _ _ 1.33
The participants are required to predict the 'Accident_risk_index' for the test.csv and against the postcode on the test data.
Then submit your 'my_submission_file.csv' on the submission tab of the hackathon page.
Pro-tip: The participants are required to perform feature engineering to first roll-up the train data at postcode level and create a column as “accident_risk_index” and optimize the model against postcode level.
Few Hypothesis to help you think: "More accidents happen in the later part of the day as those are office hours causing congestion"
"Postal codes with more single carriage roads have more accidents"
(***In the above hypothesis features such as office_hours_flag and #single _carriage roads can be formed)
Additionally, we are providing you with road network data (contains info on the nearest road to a postcode and it's characteristics) and population data (contains info about population at area level). This info are for augmentation of features, but not mandatory to use.
The provided dataset contains the following files:
train.csv & test.csv:
'Accident_ID', 'Police_Force', 'Number_of_Vehicles', 'Number_of_Casualties', 'Date', 'Day_of_Week', 'Time', ‘Local_Authority_(District)', 'Local_Authority_(Highway)', '1st_Road_Class', '1st_Road_Number', 'Road_Type', 'Speed_limit', '2nd_Road_Class', '2nd_Road_Number', 'Pedestrian_Crossing-Human_Control', 'Pedestrian_Crossing-Physical_Facilities', 'Light_Conditions', ‘'Weather_Conditions', 'Road_Surface_Conditions', 'Special_Conditions_at_Site', 'Carriageway_Hazards', 'Urban_or_Rural_Area', 'Did_Police_Officer_Attend_Scene_of_Accident', 'state', 'postcode', 'country'
population.csv:
'postcode', 'Rural Urban', 'Variable: All usual residents; measures: Value', 'Variable: Males; measures: Value', 'Variable: Females; measures: Value', ‘Variable: Lives in a household; measures: Value', ‘Variable: Lives in a communal establishment; measures: Value', 'Variable: Schoolchild or full-time student aged 4 and over at their non term-time address; measures: Value', 'Variable: Area (Hectares); measures: Value', 'Variable: Density (number of persons per hectare); measures: Value'
roads_network.csv:
'WKT', 'roadClassi', ‘roadFuncti', 'formOfWay', 'length', 'primaryRou', 'distance to the nearest point on rd', 'postcode’
Overview Swiss Re is one of the largest reinsurers in the world headquartered in Zurich with offices in over 25 countries. Swiss Re’s core expertise is in underwriting in life, health, as well as the property and casualty insurance space whereas its tech strategy focuses on developing smarter and innovative solutions for clients’ value chains by leveraging data and technology.
The company’s vision is to make the world more resilient. Swiss Re believes in applying fresh perspectives, knowledge and capital to anticipate and manage risk to create smarter solutions and help the world rebuild, renew and move forward.About 1300 professionals that work in the Swiss Re Global Business Solutions Center (BSC), Bangalore combine experience, expertise and out-of-the-box thinking to bring Swiss Re's core business to life by creating new business opportunities.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This map is part of SDGs Today. Please see sdgstoday.org. Ookla believes that good connectivity should not be a scarce resource. Everything we do is focused on providing objective, accurate performance data and insights to improve connectivity for all. Hundreds of millions of people worldwide use Speedtest® to measure their internet connection. With over 11 million consumer-initiated tests taken daily and billions of data points gathered, Ookla® data paints a clear picture of the performance, quality, and availability of networks around the world.Through our Ookla for GoodTM program, Ookla’s open datasets are available on a complimentary basis to help like-minded people make informed decisions around internet connectivity, policy, development, education, disaster response, public health, and economic growth.This dataset contains global results from Ookla Speedtest Intelligence® data. These results are then aggregated to web mercator tiles at the zoom-level of z=16 (which equates to roughly 610.8 meters by 610.8 meters at the equator). Download speed, upload speed, and latency are collected via the Speedtest by Ookla applications for Android and iOS and averaged for each tile. Measurements are filtered to results containing GPS-quality location accuracy.The tiles are aggregated at a quarterly level, beginning in Q1 2019 up until the most recently completed quarter. This map shows tiles aggregated at the administrative 0 and 1 levels.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This layer is part of SDGs Today. Please see sdgstoday.org. Ookla believes that good connectivity should not be a scarce resource. Everything we do is focused on providing objective, accurate performance data and insights to improve connectivity for all. Hundreds of millions of people worldwide use Speedtest® to measure their internet connection. With over 11 million consumer-initiated tests taken daily and billions of data points gathered, Ookla® data paints a clear picture of the performance, quality, and availability of networks around the world.Through our Ookla for GoodTM program, Ookla’s open datasets are available on a complimentary basis to help like-minded people make informed decisions around internet connectivity, policy, development, education, disaster response, public health, and economic growth.This dataset contains global results from Ookla Speedtest Intelligence® data. These results are then aggregated to web mercator tiles at the zoom-level of z=16 (which equates to roughly 610.8 meters by 610.8 meters at the equator). Download speed, upload speed, and latency are collected via the Speedtest by Ookla applications for Android and iOS and averaged for each tile. Measurements are filtered to results containing GPS-quality location accuracy.The tiles are aggregated at a quarterly level, beginning in Q1 2019 up until the most recently completed quarter. This dashboard shows tiles aggregated at the administrative 1 level.
Which county has the most Facebook users?
There are more than 378 million Facebook users in India alone, making it the leading country in terms of Facebook audience size. To put this into context, if India’s Facebook audience were a country then it would be ranked third in terms of largest population worldwide. Apart from India, there are several other markets with more than 100 million Facebook users each: The United States, Indonesia, and Brazil with 193.8 million, 119.05 million, and 112.55 million Facebook users respectively.
Facebook – the most used social media
Meta, the company that was previously called Facebook, owns four of the most popular social media platforms worldwide, WhatsApp, Facebook Messenger, Facebook, and Instagram. As of the third quarter of 2021, there were around 3,5 billion cumulative monthly users of the company’s products worldwide. With around 2.9 billion monthly active users, Facebook is the most popular social media worldwide. With an audience of this scale, it is no surprise that the vast majority of Facebook’s revenue is generated through advertising.
Facebook usage by device
As of July 2021, it was found that 98.5 percent of active users accessed their Facebook account from mobile devices. In fact, almost 81.8 percent of Facebook audiences worldwide access the platform only via mobile phone. Facebook is not only available through mobile browser as the company has published several mobile apps for users to access their products and services. As of the third quarter 2021, the four core Meta products were leading the ranking of most downloaded mobile apps worldwide, with WhatsApp amassing approximately six billion downloads.