http://rightsstatements.org/vocab/InC/1.0/http://rightsstatements.org/vocab/InC/1.0/
This competition involves advertisement data provided by BuzzCity Pte. Ltd. BuzzCity is a global mobile advertising network that has millions of consumers around the world on mobile phones and devices. In Q1 2012, over 45 billion ad banners were delivered across the BuzzCity network consisting of more than 10,000 publisher sites which reach an average of over 300 million unique users per month. The number of smartphones active on the network has also grown significantly. Smartphones now account for more than 32% phones that are served advertisements across the BuzzCity network. The "raw" data used in this competition has two types: publisher database and click database, both provided in CSV format. The publisher database records the publisher's (aka partner's) profile and comprises several fields:
publisherid - Unique identifier of a publisher. Bankaccount - Bank account associated with a publisher (may be empty) address - Mailing address of a publisher (obfuscated; may be empty) status - Label of a publisher, which can be the following: "OK" - Publishers whom BuzzCity deems as having healthy traffic (or those who slipped their detection mechanisms) "Observation" - Publishers who may have just started their traffic or their traffic statistics deviates from system wide average. BuzzCity does not have any conclusive stand with these publishers yet "Fraud" - Publishers who are deemed as fraudulent with clear proof. Buzzcity suspends their accounts and their earnings will not be paid
On the other hand, the click database records the click traffics and has several fields:
id - Unique identifier of a particular click numericip - Public IP address of a clicker/visitor deviceua - Phone model used by a clicker/visitor publisherid - Unique identifier of a publisher adscampaignid - Unique identifier of a given advertisement campaign usercountry - Country from which the surfer is clicktime - Timestamp of a given click (in YYYY-MM-DD format) publisherchannel - Publisher's channel type, which can be the following: ad - Adult sites co - Community es - Entertainment and lifestyle gd - Glamour and dating in - Information mc - Mobile content pp - Premium portal se - Search, portal, services referredurl - URL where the ad banners were clicked (obfuscated; may be empty). More details about the HTTP Referer protocol can be found in this article. Related Publication: R. J. Oentaryo, E.-P. Lim, M. Finegold, D. Lo, F.-D. Zhu, C. Phua, E.-Y. Cheu, G.-E. Yap, K. Sim, M. N. Nguyen, K. Perera, B. Neupane, M. Faisal, Z.-Y. Aung, W. L. Woon, W. Chen, D. Patel, and D. Berrar. (2014). Detecting click fraud in online advertising: A data mining approach, Journal of Machine Learning Research, 15, 99-140.
The dataset contains information, divided by month, on the accesses made to the online services offered by the citizen's file and provided by the municipality of Milan. The pageviews column represents the total number of web pages, which have been displayed within the time frame used. The visitors column represents the total number of unique visitors who have accessed the web pages. By unique visitor, we mean a visitor counted only once within the time frame used.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset contains information, divided by month, on the accesses made to the online services offered by the institutional portal and provided by the municipality of Milan. The pageviews column represents the total number of web pages, which have been displayed within the time frame used. The visits column represents the total number of visits made, within the time frame used. The visitors column represents the total number of unique visitors who have accessed the web pages. By unique visitor, we mean a visitor counted only once within the time frame used.
The dataset contains information, divided by month, on the accesses made to the online services offered by the institutional portal and provided by the municipality of Milan. The pageviews column represents the total number of web pages, which have been displayed within the time frame used. The visits column represents the total number of visits made, within the time frame used. The visitors column represents the total number of unique visitors who have accessed the web pages. By unique visitor, we mean a visitor counted only once within the time frame used.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset contains 10,000 rows of marketing interaction data, designed to support multi-touch attribution (MTA) analysis. It records user interactions with various marketing channels and campaigns over a two-day period (February 10-11, 2025), along with conversion outcomes. The dataset is structured to help marketers analyze how different touchpoints contribute to customer conversions.
Purpose The dataset is useful for: • Multi-touch attribution modelling – Understanding the impact of each touchpoint in the customer journey. • Marketing performance analysis – Evaluating the effectiveness of different marketing channels and campaigns. • Machine learning applications – Training models to predict user conversion likelihood based on interaction patterns.
Data Structure The dataset consists of five columns, described below: • User ID: A unique identifier for each customer. • Timestamp: The exact date and time of the interaction. • Channel: The marketing channel where the interaction occurred. • Campaign: The specific marketing campaign associated with the interaction. ‘-’ indicates no campaign. • Conversion: Indicates whether the user converted (Yes) or not (No).
Key Insights: • Unique Users: 2,847 • Most Frequent Channel: Direct Traffic (~17.2%) • Campaign Involvement: 31.3% of interactions had no campaign assigned. • Conversion Rate: 49.44% of interactions resulted in a conversion.
Potential Use Cases: • Identifying the most influential marketing channels in driving conversions. • Using machine learning algorithms to predict user conversion probability. • Comparing rule-based attribution models (e.g., linear, time decay) with data-driven approaches (e.g., Markov Chains, Shapley Value).
This dataset is well-suited for marketing analytics, machine learning experiments, and data-driven decision-making.
Social media companies are starting to offer users the option to subscribe to their platforms in exchange for monthly fees. Until recently, social media has been predominantly free to use, with tech companies relying on advertising as their main revenue generator. However, advertising revenues have been dropping following the COVID-induced boom. As of July 2023, Meta Verified is the most costly of the subscription services, setting users back almost 15 U.S. dollars per month on iOS or Android. Twitter Blue costs between eight and 11 U.S. dollars per month and ensures users will receive the blue check mark, and have the ability to edit tweets and have NFT profile pictures. Snapchat+, drawing in four million users as of the second quarter of 2023, boasts a Story re-watch function, custom app icons, and a Snapchat+ badge.
Individual visits to El Pueblo museums, per month. *The Museum of Social Justice is an independently operated museum, and reopened to the public May 2021. All El Pueblo-operated museums partially reopened June 10, 2021.
Author: Víctor Yeste. Universitat Politècnica de Valencia.The object of this study is the design of a cybermetric methodology whose objectives are to measure the success of the content published in online media and the possible prediction of the selected success variables.In this case, due to the need to integrate data from two separate areas, such as web publishing and the analysis of their shares and related topics on Twitter, has opted for programming as you access both the Google Analytics v4 reporting API and Twitter Standard API, always respecting the limits of these.The website analyzed is hellofriki.com. It is an online media whose primary intention is to solve the need for information on some topics that provide daily a vast number of news in the form of news, as well as the possibility of analysis, reports, interviews, and many other information formats. All these contents are under the scope of the sections of cinema, series, video games, literature, and comics.This dataset has contributed to the elaboration of the PhD Thesis:Yeste Moreno, VM. (2021). Diseño de una metodología cibermétrica de cálculo del éxito para la optimización de contenidos web [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/176009Data have been obtained from each last-minute news article published online according to the indicators described in the doctoral thesis. All related data are stored in a database, divided into the following tables:tesis_followers: User ID list of media account followers.tesis_hometimeline: data from tweets posted by the media account sharing breaking news from the web.status_id: Tweet IDcreated_at: date of publicationtext: content of the tweetpath: URL extracted after processing the shortened URL in textpost_shared: Article ID in WordPress that is being sharedretweet_count: number of retweetsfavorite_count: number of favoritestesis_hometimeline_other: data from tweets posted by the media account that do not share breaking news from the web. Other typologies, automatic Facebook shares, custom tweets without link to an article, etc. With the same fields as tesis_hometimeline.tesis_posts: data of articles published by the web and processed for some analysis.stats_id: Analysis IDpost_id: Article ID in WordPresspost_date: article publication date in WordPresspost_title: title of the articlepath: URL of the article in the middle webtags: Tags ID or WordPress tags related to the articleuniquepageviews: unique page viewsentrancerate: input ratioavgtimeonpage: average visit timeexitrate: output ratiopageviewspersession: page views per sessionadsense_adunitsviewed: number of ads viewed by usersadsense_viewableimpressionpercent: ad display ratioadsense_ctr: ad click ratioadsense_ecpm: estimated ad revenue per 1000 page viewstesis_stats: data from a particular analysis, performed at each published breaking news item. Fields with statistical values can be computed from the data in the other tables, but total and average calculations are saved for faster and easier further processing.id: ID of the analysisphase: phase of the thesis in which analysis has been carried out (right now all are 1)time: "0" if at the time of publication, "1" if 14 days laterstart_date: date and time of measurement on the day of publicationend_date: date and time when the measurement is made 14 days latermain_post_id: ID of the published article to be analysedmain_post_theme: Main section of the published article to analyzesuperheroes_theme: "1" if about superheroes, "0" if nottrailer_theme: "1" if trailer, "0" if notname: empty field, possibility to add a custom name manuallynotes: empty field, possibility to add personalized notes manually, as if some tag has been removed manually for being considered too generic, despite the fact that the editor put itnum_articles: number of articles analysednum_articles_with_traffic: number of articles analysed with traffic (which will be taken into account for traffic analysis)num_articles_with_tw_data: number of articles with data from when they were shared on the media’s Twitter accountnum_terms: number of terms analyzeduniquepageviews_total: total page viewsuniquepageviews_mean: average page viewsentrancerate_mean: average input ratioavgtimeonpage_mean: average duration of visitsexitrate_mean: average output ratiopageviewspersession_mean: average page views per sessiontotal: total of ads viewedadsense_adunitsviewed_mean: average of ads viewedadsense_viewableimpressionpercent_mean: average ad display ratioadsense_ctr_mean: average ad click ratioadsense_ecpm_mean: estimated ad revenue per 1000 page viewsTotal: total incomeretweet_count_mean: average incomefavorite_count_total: total of favoritesfavorite_count_mean: average of favoritesterms_ini_num_tweets: total tweets on the terms on the day of publicationterms_ini_retweet_count_total: total retweets on the terms on the day of publicationterms_ini_retweet_count_mean: average retweets on the terms on the day of publicationterms_ini_favorite_count_total: total of favorites on the terms on the day of publicationterms_ini_favorite_count_mean: average of favorites on the terms on the day of publicationterms_ini_followers_talking_rate: ratio of followers of the media Twitter account who have recently published a tweet talking about the terms on the day of publicationterms_ini_user_num_followers_mean: average followers of users who have spoken of the terms on the day of publicationterms_ini_user_num_tweets_mean: average number of tweets published by users who spoke about the terms on the day of publicationterms_ini_user_age_mean: average age in days of users who have spoken of the terms on the day of publicationterms_ini_ur_inclusion_rate: URL inclusion ratio of tweets talking about terms on the day of publicationterms_end_num_tweets: total tweets on terms 14 days after publicationterms_ini_retweet_count_total: total retweets on terms 14 days after publicationterms_ini_retweet_count_mean: average retweets on terms 14 days after publicationterms_ini_favorite_count_total: total bookmarks on terms 14 days after publicationterms_ini_favorite_count_mean: average of favorites on terms 14 days after publicationterms_ini_followers_talking_rate: ratio of media Twitter account followers who have recently posted a tweet talking about the terms 14 days after publicationterms_ini_user_num_followers_mean: average followers of users who have spoken of the terms 14 days after publicationterms_ini_user_num_tweets_mean: average number of tweets published by users who have spoken about the terms 14 days after publicationterms_ini_user_age_mean: the average age in days of users who have spoken of the terms 14 days after publicationterms_ini_ur_inclusion_rate: URL inclusion ratio of tweets talking about terms 14 days after publication.tesis_terms: data of the terms (tags) related to the processed articles.stats_id: Analysis IDtime: "0" if at the time of publication, "1" if 14 days laterterm_id: Term ID (tag) in WordPressname: Name of the termslug: URL of the termnum_tweets: number of tweetsretweet_count_total: total retweetsretweet_count_mean: average retweetsfavorite_count_total: total of favoritesfavorite_count_mean: average of favoritesfollowers_talking_rate: ratio of followers of the media Twitter account who have recently published a tweet talking about the termuser_num_followers_mean: average followers of users who were talking about the termuser_num_tweets_mean: average number of tweets published by users who were talking about the termuser_age_mean: average age in days of users who were talking about the termurl_inclusion_rate: URL inclusion ratio
This Location Data & Foot traffic dataset available for all countries include enriched raw mobility data and visitation at POIs to answer questions such as:
-How often do people visit a location? (daily, monthly, absolute, and averages).
-What type of places do they visit ? (parks, schools, hospitals, etc)
-Which social characteristics do people have in a certain POI? - Breakdown by type: residents, workers, visitors.
-What's their mobility like enduring night hours & day hours?
-What's the frequency of the visits partition by day of the week and hour of the day?
Extra insights -Visitors´ relative income Level. -Visitors´ preferences as derived by their visits to shopping, parks, sports facilities, churches, among others.
Overview & Key Concepts Each record corresponds to a ping from a mobile device, at a particular moment in time and at a particular latitude and longitude. We procure this data from reliable technology partners, which obtain it through partnerships with location-aware apps. All the process is compliant with applicable privacy laws.
We clean and process these massive datasets with a number of complex, computer-intensive calculations to make them easier to use in different data science and machine learning applications, especially those related to understanding customer behavior.
Featured attributes of the data Device speed: based on the distance between each observation and the previous one, we estimate the speed at which the device is moving. This is particularly useful to differentiate between vehicles, pedestrians, and stationery observations.
Night base of the device: we calculate the approximated location of where the device spends the night, which is usually their home neighborhood.
Day base of the device: we calculate the most common daylight location during weekdays, which is usually their work location.
Income level: we use the night neighborhood of the device, and intersect it with available socioeconomic data, to infer the device’s income level. Depending on the country, and the availability of good census data, this figure ranges from a relative wealth index to a currency-calculated income.
POI visited: we intersect each observation with a number of POI databases, to estimate check-ins to different locations. POI databases can vary significantly, in scope and depth, between countries.
Category of visited POI: for each observation that can be attributable to a POI, we also include a standardized location category (park, hospital, among others). Coverage: Worldwide.
Delivery schemas We can deliver the data in three different formats:
Full dataset: one record per mobile ping. These datasets are very large, and should only be consumed by experienced teams with large computing budgets.
Visitation stream: one record per attributable visit. This dataset is considerably smaller than the full one but retains most of the more valuable elements in the dataset. This helps understand who visited a specific POI, characterize and understand the consumer's behavior.
Audience profiles: one record per mobile device in a given period of time (usually monthly). All the visitation stream is aggregated by category. This is the most condensed version of the dataset and is very useful to quickly understand the types of consumers in a particular area and to create cohorts of users.
The All CMS Data Feeds dataset is an expansive resource offering access to 119 unique report feeds, providing in-depth insights into various aspects of the U.S. healthcare system including nursing facility owners and accountable care organization participants contact data. With over 25.8 billion rows of data meticulously collected since 2007, this dataset is invaluable for healthcare professionals, analysts, researchers, and businesses seeking to understand and analyze healthcare trends, performance metrics, and demographic shifts over time. The dataset is updated monthly, ensuring that users always have access to the most current and relevant data available.
Dataset Overview:
118 Report Feeds: - The dataset includes a wide array of report feeds, each providing unique insights into different dimensions of healthcare. These topics range from Medicare and Medicaid service metrics, patient demographics, provider information, financial data, and much more. The breadth of information ensures that users can find relevant data for nearly any healthcare-related analysis. - As CMS releases new report feeds, they are automatically added to this dataset, keeping it current and expanding its utility for users.
25.8 Billion Rows of Data:
Historical Data Since 2007: - The dataset spans from 2007 to the present, offering a rich historical perspective that is essential for tracking long-term trends and changes in healthcare delivery, policy impacts, and patient outcomes. This historical data is particularly valuable for conducting longitudinal studies and evaluating the effects of various healthcare interventions over time.
Monthly Updates:
Data Sourced from CMS:
Use Cases:
Market Analysis:
Healthcare Research:
Performance Tracking:
Compliance and Regulatory Reporting:
Data Quality and Reliability:
The All CMS Data Feeds dataset is designed with a strong emphasis on data quality and reliability. Each row of data is meticulously cleaned and aligned, ensuring that it is both accurate and consistent. This attention to detail makes the dataset a trusted resource for high-stakes applications, where data quality is critical.
Integration and Usability:
Ease of Integration:
Our Activity dataset reveals real-world behavior through detailed foot traffic metrics around POIs in the US, Canada, and Mexico — all GDPR-compliant and non-PII.
By capturing total visits, unique visitors, and frequency of visits, our dataset enables a precise view of how consumers move, behave, and engage with locations over time — helping brands uncover demand, evaluate performance, and outpace competitors.
Key data points include: - Total visits, unique visitors, and visit frequency - Daily, weekly, monthly, and quarterly aggregation - Movement patterns around and within trade areas - Cleaned, normalized, and updated daily - Non-PII, GDPR-compliant location intelligence
Ideal for demand sensing, competitive benchmarking, and performance analysis, this dataset helps retail, real estate, and investment teams unlock powerful insights across North America.
The global number of Facebook users was forecast to continuously increase between 2023 and 2027 by in total 391 million users (+14.36 percent). After the fourth consecutive increasing year, the Facebook user base is estimated to reach 3.1 billion users and therefore a new peak in 2027. Notably, the number of Facebook users was continuously increasing over the past years. User figures, shown here regarding the platform Facebook, have been estimated by taking into account company filings or press material, secondary research, app downloads and traffic data. They refer to the average monthly active users over the period and count multiple accounts by persons only once.The shown data are an excerpt of Statista's Key Market Indicators (KMI). The KMI are a collection of primary and secondary indicators on the macro-economic, demographic and technological environment in up to 150 countries and regions worldwide. All indicators are sourced from international and national statistical offices, trade associations and the trade press and they are processed to generate comparable data sets (see supplementary notes under details for more information).
ODS, 237 KB
This file is in an OpenDocument format
ODT, 7.7 KB
This file is in an OpenDocument format
On 14 August 2025, corrections were made to the museums and galleries monthly visits data tables. Corrections were made to the DCMS total and the museum group totals for September 2020, November 2022 and January 2025 and to the Museum of the Home data for July, September, October and November 2024. These errors were identified on the day of publication and revised tables were uploaded the same day.
14 August 2025
England
Quarterly
Between April to June 2025, there were approximately 11.1 million visits to DCMS sponsored museums and galleries. Overall visits were similar (0.3% higher) to the equivalent period last year (when comparing museums open in both time periods). Overall visits were 14% lower than the equivalent period pre-pandemic in 2019 (wh
How many people use social media?
Social media usage is one of the most popular online activities. In 2024, over five billion people were using social media worldwide, a number projected to increase to over six billion in 2028.
Who uses social media?
Social networking is one of the most popular digital activities worldwide and it is no surprise that social networking penetration across all regions is constantly increasing. As of January 2023, the global social media usage rate stood at 59 percent. This figure is anticipated to grow as lesser developed digital markets catch up with other regions
when it comes to infrastructure development and the availability of cheap mobile devices. In fact, most of social media’s global growth is driven by the increasing usage of mobile devices. Mobile-first market Eastern Asia topped the global ranking of mobile social networking penetration, followed by established digital powerhouses such as the Americas and Northern Europe.
How much time do people spend on social media?
Social media is an integral part of daily internet usage. On average, internet users spend 151 minutes per day on social media and messaging apps, an increase of 40 minutes since 2015. On average, internet users in Latin America had the highest average time spent per day on social media.
What are the most popular social media platforms?
Market leader Facebook was the first social network to surpass one billion registered accounts and currently boasts approximately 2.9 billion monthly active users, making it the most popular social network worldwide. In June 2023, the top social media apps in the Apple App Store included mobile messaging apps WhatsApp and Telegram Messenger, as well as the ever-popular app version of Facebook.
The data on the use of the data sets on the OGD portal BL (data.bl.ch) are collected and published by the specialist and coordination office OGD BL. Contains the day the usage was measured.dataset_title: The title of the dataset_id record: The technical ID of the dataset.visitors: Specifies the number of daily visitors to the record. Visitors are recorded by counting the unique IP addresses that recorded access on the day of the survey. The IP address represents the network address of the device from which the portal was accessed.interactions: Includes all interactions with any record on data.bl.ch. A visitor can trigger multiple interactions. Interactions include clicks on the website (searching datasets, filters, etc.) as well as API calls (downloading a dataset as a JSON file, etc.).RemarksOnly calls to publicly available datasets are shown.IP addresses and interactions of users with a login of the Canton of Basel-Landschaft - in particular of employees of the specialist and coordination office OGD - are removed from the dataset before publication and therefore not shown.Calls from actors that are clearly identifiable as bots by the user agent header are also not shown.Combinations of dataset and date for which no use occurred (Visitors == 0 & Interactions == 0) are not shown.Due to synchronization problems, data may be missing by the day.
Intending to cover the existing gap regarding behavioral datasets modelling interactions of users with individual a multiple devices in Smart Office to later authenticate them continuously, we publish the following collection of datasets, which has been generated after having five users interacting for 60 days with their personal computer and mobile devices. Below you can find a brief description of each dataset.Dataset 1 (2.3 GB). This dataset contains 92975 vectors of features (8096 per vector) that model the interactions of the five users with their personal computers. Each vector contains aggregated data about keyboard and mouse activity, as well as application usage statistics. More info about features meaning can be found in the readme file. Originally, the number of features of this dataset was 24 065 but after filtering the constant features, this number was reduced to 8096. There was a high number of constant features to 0 since each possible digraph (two keys combination) was considered when collecting the data. However, there are many unusual digraphs that the users never introduced in their computers, so these features were deleted in the uploaded dataset.Dataset 2 (8.9 MB). This dataset contains 61918 vectors of features (15 per vector)that model the interactions of the five users with their mobile devices. Each vector contains aggregated data about application usage statistics. More info about features meaning can be found in the readme file.Dataset 3 (28.9 MB). This dataset contains 133590vectors of features (42 per vector)that model the interactions of the five users with their mobile devices. Each vector contains aggregated data about the gyroscope and Accelerometer sensors.More info about features meaning can be found in the readme file.Dataset 4 (162.4 MB). This dataset contains 145465vectors of features (241 per vector)that model the interactions of the five users with both personal computers and mobile devices. Each vector contains the aggregation of the most relevant features of both devices. More info about features meaning can be found in the readme file.Dataset 5 (878.7 KB). This dataset is composed of 7 datasets. Each one of them contains an aggregation of feature vectors generated from the active/inactive intervals of personal computers and mobile devices by considering different time windows ranging from 1h to 24h.1h: 4074 vectors2h: 2149 vectors3h: 1470 vectors4h: 1133 vectors6h: 770 vectors12h: 440 vectors24h: 229 vectors
Motivation
This dataset is derived and cleaned from the full PULSE project dataset to share with others data gathered about the users during the project.
Disclaimer
Any third party need to respect ethics rules and GDPR and must mention “PULSE DATA H2020 - 727816” in any dissemination activities related to data being exploited. Also, you should provide a link to the project associated website: http://www.project-pulse.eu/
The data provided in the files is provided as is. Despite our best efforts at filtering out potential issues, some information could be erroneous.
Description of the dataset
The only difference with the original dataset comes from anonymised user information.
The dataset content is described in a dedicated JSON file:
{
"citizen_id": "pseudonymized unique key of each citizen user in the PULSE system",
"city_code": {
"description": "3-letter city codes taken by convention from IATA codebook of airports and metropolitan areas, as the codebook of global cities in most common and widespread use and therefore adopted as standard in PULSE (since there is currently - in the year 2020 - still no relevant ISO or other standardized codebook of cities uniformly globally adopted and used). Exception is Pavia which does not have its own airport,and nearby Milan/Bergamo airports are not applicable, so the 'PAI' internal code (not existing in original IATA codes) has been devised in PULSE. For cities with multiple airports, IATA metropolitan area codes are used (New York, Paris).",
"BCN": "Barcelona",
"BHX": "Birmingham",
"NYC": "New York",
"PAI": "Pavia",
"PAR": "Paris",
"SIN": "Singapore",
"TPE": "Keelung(Taipei)"
},
"zip_code": "Zip or postal code (area) within a city, basic default granular territorial/administrative subdivision unit for localization of citizen users by place of residence (in all PULSE cities)",
"models": {
"asthma_risk_score": "PULSE asthma risk consensus model score, decimal value ranging from 0 to 1",
"asthma_risk_score_category": {
"description": "Categorized value of the PULSE asthma risk consensus model score, with the following possible category options:",
"low": "low asthma risk, score value below 0,05",
"medium-low": "medium-low asthma risk, score value from 0,05 and below 0,1",
"medium": "medium asthma risk, score value from 0,1 and below 0,15",
"medium-high": "medium-high asthma risk, score value from 0,15 and below 0,2",
"high": "high asthma risk, score value from 0,2 and higher"
},
"T2D_risk_score": "PULSE diabetes type 2 (T2D) risk consensus model score, decimal value ranging from 0 to 1",
"T2D_risk_score_category": {
"description": "Categorized value of the PULSE diabetes type 2 risk consensus model score, with the following possible category options:",
"low": "low T2D risk, score value below 0,05",
"medium-low": "medium-low T2D risk, score value from 0,05 and below 0,1",
"medium": "medium T2D risk, score value from 0,1 and below 0,15",
"medium-high": "medium-high T2D risk, score value from 0,15 and below 0,2",
"high": "high T2D risk, score value from 0,2 and below 0,25",
"very_high": "very high T2D risk, score value from 0,25 and higher"
},
"well-being_score": "PULSE well-being model score, decimal value ranging from -5 to 5",
"well-being_score_category": {
"description": "Categorized value of the PULSE well-being model score, with the following possible category options:",
"low": "low well-being, score value below -0,37",
"medium-low": "medium-low well-being, score value from -0,37 and below 0,04",
"medium-high": "medium-high well-being, score value from 0,04 and below 0,36",
"high": "high well-being, score value from 0,36 and higher"
},
"computed_time": "Timestamp (UTC) when each relevant model score value/result had been computed or derived"
}
}
Our Mobility dataset reveals real-world movement patterns by linking visits and visitors to over 82M+ POIs, helping businesses decode foot traffic, brand engagement, and cross-visitation trends.
Built for actionable insights, this GDPR-compliant dataset enables companies to analyze how people interact with places, from customer loyalty to dwell time and visit frequency, with monthly or quarterly updates to ensure reliability.
Key data points include: - Visit counts and unique visitors - Dwell time and visit frequency - Cross-visitation patterns - Foot traffic trends over time - GDPR-compliant, Non-PII data
Covering millions of commercial locations globally, this dataset powers market research, retail site analysis, customer journey modeling, and investment decisions.
https://object-store.os-api.cci2.ecmwf.int:443/cci2-prod-catalogue/licences/cc-by/cc-by_f24dc630aa52ab8c52a0ac85c03bc35e0abc850b4d7453bdc083535b41d5a5c3.pdfhttps://object-store.os-api.cci2.ecmwf.int:443/cci2-prod-catalogue/licences/cc-by/cc-by_f24dc630aa52ab8c52a0ac85c03bc35e0abc850b4d7453bdc083535b41d5a5c3.pdf
ERA5 is the fifth generation ECMWF reanalysis for the global climate and weather for the past 8 decades. Data is available from 1940 onwards. ERA5 replaces the ERA-Interim reanalysis. Reanalysis combines model data with observations from across the world into a globally complete and consistent dataset using the laws of physics. This principle, called data assimilation, is based on the method used by numerical weather prediction centres, where every so many hours (12 hours at ECMWF) a previous forecast is combined with newly available observations in an optimal way to produce a new best estimate of the state of the atmosphere, called analysis, from which an updated, improved forecast is issued. Reanalysis works in the same way, but at reduced resolution to allow for the provision of a dataset spanning back several decades. Reanalysis does not have the constraint of issuing timely forecasts, so there is more time to collect observations, and when going further back in time, to allow for the ingestion of improved versions of the original observations, which all benefit the quality of the reanalysis product. ERA5 provides hourly estimates for a large number of atmospheric, ocean-wave and land-surface quantities. An uncertainty estimate is sampled by an underlying 10-member ensemble at three-hourly intervals. Ensemble mean and spread have been pre-computed for convenience. Such uncertainty estimates are closely related to the information content of the available observing system which has evolved considerably over time. They also indicate flow-dependent sensitive areas. To facilitate many climate applications, monthly-mean averages have been pre-calculated too, though monthly means are not available for the ensemble mean and spread. ERA5 is updated daily with a latency of about 5 days (monthly means are available around the 6th of each month). In case that serious flaws are detected in this early release (called ERA5T), this data could be different from the final release 2 to 3 months later. So far this has only been the case for the month September 2021, while it will also be the case for October, November and December 2021. For months prior to September 2021 the final release has always been equal to ERA5T, and the goal is to align the two again after December 2021. ERA5 is updated daily with a latency of about 5 days (monthly means are available around the 6th of each month). In case that serious flaws are detected in this early release (called ERA5T), this data could be different from the final release 2 to 3 months later. In case that this occurs users are notified. The data set presented here is a regridded subset of the full ERA5 data set on native resolution. It is online on spinning disk, which should ensure fast and easy access. It should satisfy the requirements for most common applications. An overview of all ERA5 datasets can be found in this article. Information on access to ERA5 data on native resolution is provided in these guidelines. Data has been regridded to a regular lat-lon grid of 0.25 degrees for the reanalysis and 0.5 degrees for the uncertainty estimate (0.5 and 1 degree respectively for ocean waves). There are four main sub sets: hourly and monthly products, both on pressure levels (upper air fields) and single levels (atmospheric, ocean-wave and land surface quantities). The present entry is "ERA5 monthly mean data on pressure levels from 1940 to present".
This Mobility & Foot traffic dataset includes enriched mobility data and visitation at POIs to answer questions such as:
-How often do people visit a location? (daily, monthly, absolute, and averages).
-What type of places do they visit? (parks, schools, hospitals, etc).
-Which social characteristics do people have in a certain POI? - Breakdown by type: residents, workers, visitors.
-What's their mobility like during night hours & day hours?
-What's the frequency of the visits by day of the week and hour of the day?
Extra insights
-Visitors´ relative Income Level.
- Footfall measurement in all types of establishments (shopping malls, stand-alone stores, etc).
-Visitors´ preferences as derived from their visits to shopping, parks, sports facilities, and churches, among others.
- Origin/Destiny matrix.
- Vehicular traffic, measurement of speed, types of vehicles, among other insights.
Overview & Key Concepts
Each record corresponds to a ping from a mobile device, at a particular moment in time, and at a particular lat and long. We procure this data from reliable technology partners, which obtain it through partnerships with location-aware apps. All the process is compliant with GDPR and all applicable privacy laws.
We clean, process, and enrich these massive datasets with a number of complex, computer-intensive calculations to make them easier to use in different tailor-made solutions for companies and also data science and machine learning applications, especially those related to understanding customer behavior.
Featured attributes of the data
Device speed: based on the distance between each observation and the previous one, we estimate the speed at which the device is moving. This is particularly useful to differentiate between vehicles, pedestrians, and stationery observations.
Night base of the device: we calculate the approximate location of where the device spends the night, which is usually its home neighborhood.
Day base of the device: we calculate the most common daylight location during weekdays, which is usually their work location.
Income level: we use the night neighborhood of the device, and intersect it with available socioeconomic data, to infer the device’s income level. Depending on the country, and the availability of good census data, this figure ranges from a relative wealth index to a currency-calculated income.
POI visited: we intersect each observation with a number of POI databases, to estimate check-ins to different locations. POI databases can vary significantly, in scope and depth, between countries.
Category of visited POI: for each observation that can be attributable to a POI, we also include a standardized location category (park, hospital, among others).
Delivery schemas
We can deliver the data in three different formats:
Full dataset: one record per mobile ping. These datasets are very large, and should only be consumed by experienced teams with large computing budgets.
Visitation stream: one record per attributable visit. This dataset is considerably smaller than the full one but retains most of the more valuable elements in the dataset. This helps understand who visited a specific POI, and characterize and understand the consumer's behavior.
Audience profiles: one record per mobile device in a given period of time (usually monthly). All the visitation stream is aggregated by category. This is the most condensed version of the dataset and is very useful to quickly understand the types of consumers in a particular area and to create cohorts of users.
http://rightsstatements.org/vocab/InC/1.0/http://rightsstatements.org/vocab/InC/1.0/
This competition involves advertisement data provided by BuzzCity Pte. Ltd. BuzzCity is a global mobile advertising network that has millions of consumers around the world on mobile phones and devices. In Q1 2012, over 45 billion ad banners were delivered across the BuzzCity network consisting of more than 10,000 publisher sites which reach an average of over 300 million unique users per month. The number of smartphones active on the network has also grown significantly. Smartphones now account for more than 32% phones that are served advertisements across the BuzzCity network. The "raw" data used in this competition has two types: publisher database and click database, both provided in CSV format. The publisher database records the publisher's (aka partner's) profile and comprises several fields:
publisherid - Unique identifier of a publisher. Bankaccount - Bank account associated with a publisher (may be empty) address - Mailing address of a publisher (obfuscated; may be empty) status - Label of a publisher, which can be the following: "OK" - Publishers whom BuzzCity deems as having healthy traffic (or those who slipped their detection mechanisms) "Observation" - Publishers who may have just started their traffic or their traffic statistics deviates from system wide average. BuzzCity does not have any conclusive stand with these publishers yet "Fraud" - Publishers who are deemed as fraudulent with clear proof. Buzzcity suspends their accounts and their earnings will not be paid
On the other hand, the click database records the click traffics and has several fields:
id - Unique identifier of a particular click numericip - Public IP address of a clicker/visitor deviceua - Phone model used by a clicker/visitor publisherid - Unique identifier of a publisher adscampaignid - Unique identifier of a given advertisement campaign usercountry - Country from which the surfer is clicktime - Timestamp of a given click (in YYYY-MM-DD format) publisherchannel - Publisher's channel type, which can be the following: ad - Adult sites co - Community es - Entertainment and lifestyle gd - Glamour and dating in - Information mc - Mobile content pp - Premium portal se - Search, portal, services referredurl - URL where the ad banners were clicked (obfuscated; may be empty). More details about the HTTP Referer protocol can be found in this article. Related Publication: R. J. Oentaryo, E.-P. Lim, M. Finegold, D. Lo, F.-D. Zhu, C. Phua, E.-Y. Cheu, G.-E. Yap, K. Sim, M. N. Nguyen, K. Perera, B. Neupane, M. Faisal, Z.-Y. Aung, W. L. Woon, W. Chen, D. Patel, and D. Berrar. (2014). Detecting click fraud in online advertising: A data mining approach, Journal of Machine Learning Research, 15, 99-140.