The average time spent daily on a phone, not counting talking on the phone, has increased in recent years, reaching a total of 4 hours and 30 minutes as of April 2022. This figure is expected to reach around 4 hours and 39 minutes by 2024.
How much time do people spend on social media? As of 2024, the average daily social media usage of internet users worldwide amounted to 143 minutes per day, down from 151 minutes in the previous year. Currently, the country with the most time spent on social media per day is Brazil, with online users spending an average of three hours and 49 minutes on social media each day. In comparison, the daily time spent with social media in the U.S. was just two hours and 16 minutes. Global social media usageCurrently, the global social network penetration rate is 62.3 percent. Northern Europe had an 81.7 percent social media penetration rate, topping the ranking of global social media usage by region. Eastern and Middle Africa closed the ranking with 10.1 and 9.6 percent usage reach, respectively. People access social media for a variety of reasons. Users like to find funny or entertaining content and enjoy sharing photos and videos with friends, but mainly use social media to stay in touch with current events friends. Global impact of social mediaSocial media has a wide-reaching and significant impact on not only online activities but also offline behavior and life in general. During a global online user survey in February 2019, a significant share of respondents stated that social media had increased their access to information, ease of communication, and freedom of expression. On the flip side, respondents also felt that social media had worsened their personal privacy, increased a polarization in politics and heightened everyday distractions.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This Dataset is about average time spent with others is measured in minutes per day, and shown by the age of the respondent. This is based on averages from surveys between 2009 and 2019 by U.S. Bureau of Labor Statistics American Time Use Survey, accessed on Our World in Data.
Source: U.S. Bureau of Labor Statistics American Time Use Survey, accessed on our Our world in Data
Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
This table contains 2376 series, with data for years 2015 - 2015 (not all combinations necessarily have data for all years). This table contains data described by the following dimensions (Not all combinations are available): Geography (11 items: Canada; Newfoundland and Labrador; Prince Edward Island; Nova Scotia; ...); Age group (3 items: Total, 6 to 17 years; 6 to 11 years; 12 to 17 years); Sex (3 items: Both sexes; Males; Females); Children's screen time (3 items: Total population for the variable children's screen time; 2 hours or less of screen time per day; More than 2 hours of screen time per day); Characteristics (8 items: Number of persons; Low 95% confidence interval, number of persons; High 95% confidence interval, number of persons; Coefficient of variation for number of persons; ...).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Related article: Bergroth, C., Järv, O., Tenkanen, H., Manninen, M., Toivonen, T., 2022. A 24-hour population distribution dataset based on mobile phone data from Helsinki Metropolitan Area, Finland. Scientific Data 9, 39.
In this dataset:
We present temporally dynamic population distribution data from the Helsinki Metropolitan Area, Finland, at the level of 250 m by 250 m statistical grid cells. Three hourly population distribution datasets are provided for regular workdays (Mon – Thu), Saturdays and Sundays. The data are based on aggregated mobile phone data collected by the biggest mobile network operator in Finland. Mobile phone data are assigned to statistical grid cells using an advanced dasymetric interpolation method based on ancillary data about land cover, buildings and a time use survey. The data were validated by comparing population register data from Statistics Finland for night-time hours and a daytime workplace registry. The resulting 24-hour population data can be used to reveal the temporal dynamics of the city and examine population variations relevant to for instance spatial accessibility analyses, crisis management and planning.
Please cite this dataset as:
Bergroth, C., Järv, O., Tenkanen, H., Manninen, M., Toivonen, T., 2022. A 24-hour population distribution dataset based on mobile phone data from Helsinki Metropolitan Area, Finland. Scientific Data 9, 39. https://doi.org/10.1038/s41597-021-01113-4
Organization of data
The dataset is packaged into a single Zipfile Helsinki_dynpop_matrix.zip which contains following files:
HMA_Dynamic_population_24H_workdays.csv represents the dynamic population for average workday in the study area.
HMA_Dynamic_population_24H_sat.csv represents the dynamic population for average saturday in the study area.
HMA_Dynamic_population_24H_sun.csv represents the dynamic population for average sunday in the study area.
target_zones_grid250m_EPSG3067.geojson represents the statistical grid in ETRS89/ETRS-TM35FIN projection that can be used to visualize the data on a map using e.g. QGIS.
Column names
YKR_ID : a unique identifier for each statistical grid cell (n=13,231). The identifier is compatible with the statistical YKR grid cell data by Statistics Finland and Finnish Environment Institute.
H0, H1 ... H23 : Each field represents the proportional distribution of the total population in the study area between grid cells during a one-hour period. In total, 24 fields are formatted as “Hx”, where x stands for the hour of the day (values ranging from 0-23). For example, H0 stands for the first hour of the day: 00:00 - 00:59. The sum of all cell values for each field equals to 100 (i.e. 100% of total population for each one-hour period)
In order to visualize the data on a map, the result tables can be joined with the target_zones_grid250m_EPSG3067.geojson data. The data can be joined by using the field YKR_ID as a common key between the datasets.
License Creative Commons Attribution 4.0 International.
Related datasets
Järv, Olle; Tenkanen, Henrikki & Toivonen, Tuuli. (2017). Multi-temporal function-based dasymetric interpolation tool for mobile phone data. Zenodo. https://doi.org/10.5281/zenodo.252612
Tenkanen, Henrikki, & Toivonen, Tuuli. (2019). Helsinki Region Travel Time Matrix [Data set]. Zenodo. http://doi.org/10.5281/zenodo.3247564
Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
Daily average time in hours and proportion of day spent on various activities by age group and sex, 15 years and over, Canada and provinces.
DATASET DESCRIPTION: This Dataset includes the average response time by Call Priority across days of the week and hours of the day. Response Times reflect the same information contained in the APD 911 Calls for Service 2019-2024 dataset. AUSTIN POLICE DEPARTMENT DATA DISCLAIMER 1. The data provided is for informational use only and may differ from official Austin Police Department data. The Austin Police Department’s databases are continuously updated, and changes can be made due to a variety of investigative factors including but not limited to offense reclassification and dates. Reports run at different times may produce different results. Care should be taken when comparing against other reports as different data collection methods and different systems of record may have been used. 4.The Austin Police Department does not assume any liability for any decision made or action taken or not taken by the recipient in reliance upon any information or data provided. City of Austin Open Data Terms of Use -https://data.austintexas.gov/stories/s/ranj-cccq
The S3 dataset contains the behavior (sensors, statistics of applications, and voice) of 21 volunteers interacting with their smartphones for more than 60 days. The type of users is diverse, males and females in the age range from 18 until 70 have been considered in the dataset generation. The wide range of age is a key aspect, due to the impact of age in terms of smartphone usage. To generate the dataset the volunteers installed a prototype of the smartphone application in on their Android mobile phones.
All attributes of the different kinds of data are writed in a vector. The dataset contains the fellow vectors:
Sensors:
This type of vector contains data belonging to smartphone sensors (accelerometer and gyroscope) that has been acquired in a given windows of time. Each vector is obtained every 20 seconds, and the monitored features are:- Average of accelerometer and gyroscope values.- Maximum and minimum of accelerometer and gyroscope values.- Variance of accelerometer and gyroscope values.- Peak-to-peak (max-min) of X, Y, Z coordinates.- Magnitude for gyroscope and accelerometer.
Statistics:
These vectors contain data about the different applications used by the user recently. Each vector of statistics is calculated every 60 seconds and contains : - Foreground application counters (number of different and total apps) for the last minute and the last day.- Most common app ID and the number of usages in the last minute and the last day. - ID of the currently active app. - ID of the last active app prior to the current one.- ID of the application most frequently utilized prior to the current application. - Bytes transmitted and received through the network interfaces.
Voice:
This kind of vector is generated when the microphone is active in a call o voice note. The speaker vector is an embedding, extracted from the audio, and it contains information about the user's identity. This vector, is usually named "x-vector" in the Speaker Recognition field, and it is calculated following the steps detailed in "egs/sitw/v2" for the Kaldi library, with the models available for the extraction of the embedding.
A summary of the details of the collected database.
- Users: 21 - Sensors vectors: 417.128 - Statistics app's usage vectors: 151.034 - Speaker vectors: 2.720 - Call recordings: 629 - Voice messages: 2.091
Statewide Intake serves as the “front door to the front line” for all DFPS programs. As the central point of contact for reports of abuse, neglect and exploitation of vulnerable Texans, SWI staff are available 24 hours a day, 7 days per week, 365 days per year. SWI is the Centralized point of intake for child abuse and neglect, abuse, neglect or exploitation of people age 65 or older or adults with disabilities, clients served by DSHS or DADS employees in State Hospitals or State Supported Living Centers, and children in licensed child-care facilities or treatment centers for the entire State of Texas. SWI provides daily reports on call volume per application; hold times per application, etc. and integrates hardware and software upgrades to phone and computer systems to reduce hold times and improve efficiency. NOTE: Past Printed Data Books also included EBC, Re-Entry and Support Staff in all queues total. An abandoned call is a call that disconnects after completing navigation of the recorded message, but prior to being answered by an intake specialist. Legislative Budget Board (LBB) Performance Measure Targets are set every two years during Legislative Sessions. LBB Average Hold Time Targets for English Queue: 2010 11.4 minutes 2011 11.4 minutes 2012 8.7 minutes 2013 8.7 minutes 2014 8.7 minutes 2015 8.7 minutes 2016 7.2 minutes 2017 10.5 minutes 2018 12.0 minutes 2019 9.8 minutes Visit dfps.state.tx.us for information on all DFPS programs
Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
Average daily time spent by adults on activities including paid work, unpaid household work, unpaid care, travel and entertainment. These are official statistics in development.
Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
Daily average time and proportion of day spent on various activities, by age group and gender, 15 years and over, Canada, Geographical region of Canada, province or territory, 2022.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset presents detailed energy consumption records from various households over the month. With 90,000 rows and multiple features such as temperature, household size, air conditioning usage, and peak hour consumption, this dataset is perfect for performing time-series analysis, machine learning, and sustainability research.
Column Name | Data Type Category | Description |
---|---|---|
Household_ID | Categorical (Nominal) | Unique identifier for each household |
Date | Datetime | The date of the energy usage record |
Energy_Consumption_kWh | Numerical (Continuous) | Total energy consumed by the household in kWh |
Household_Size | Numerical (Discrete) | Number of individuals living in the household |
Avg_Temperature_C | Numerical (Continuous) | Average daily temperature in degrees Celsius |
Has_AC | Categorical (Binary) | Indicates if the household has air conditioning (Yes/No) |
Peak_Hours_Usage_kWh | Numerical (Continuous) | Energy consumed during peak hours in kWh |
Library | Purpose |
---|---|
pandas | Reading, cleaning, and transforming tabular data |
numpy | Numerical operations, working with arrays |
Library | Purpose |
---|---|
matplotlib | Creating static plots (line, bar, histograms, etc.) |
seaborn | Statistical visualizations, heatmaps, boxplots, etc. |
plotly | Interactive charts (time series, pie, bar, scatter, etc.) |
Library | Purpose |
---|---|
scikit-learn | Preprocessing, regression, classification, clustering |
xgboost / lightgbm | Gradient boosting models for better accuracy |
Library | Purpose |
---|---|
sklearn.preprocessing | Encoding categorical features, scaling, normalization |
datetime / pandas | Date-time conversion and manipulation |
Library | Purpose |
---|---|
sklearn.metrics | Accuracy, MAE, RMSE, R² score, confusion matrix, etc. |
✅ These libraries provide a complete toolkit for performing data analysis, modeling, and visualization tasks efficiently.
This dataset is ideal for a wide variety of analytics and machine learning projects:
This dataset contains the World Average Degree Days Database for the period 1964-2013. Follow datasource.kapsarc.org for timely data to advance energy economics research.*
Summary_64-13_freq=1D Average Degree Days of various indices for respective countries for the period 1964-2013, converted to a 1 day frequency
Summary_64-13_freq=6hrs Average Degree Days of various indices for respective countries for the period 1964-2013, calculated at 6 hrs frequency
T2m.hdd.18C Calculation of Heating Degree Days using plain temperature at 2 m elevation at Tref=18°C and frequency of 6 hrs
T2m.cdd.18C Calculation of Cooling Degree Days using plain temperature at 2 m elevation at Tref=18°C and frequency of 6 hrs
t2m.hdd.15.6C Calculation of Heating Degree Days using plain temperature at 2 m elevation at Tref=15.6°C and frequency of 6 hrs
t2m.hdd.18.3C Calculation of Heating Degree Days using plain temperature at 2 m elevation at Tref=18.3°C and frequency of 6 hrs
t2m.hdd.21.1C Calculation of Heating Degree Days using plain temperature at 2 m elevation at Tref=21.1°C and frequency of 6 hrs
t2m.cdd.15.6C Calculation of Cooling Degree Days using plain temperature at 2 m elevation at Tref=15.6°C and frequency of 6 hrs
t2m.cdd.18.3C Calculation of Cooling Degree Days using plain temperature at 2 m elevation at Tref=18.3°C and frequency of 6 hrs
t2m.cdd.21.1C Calculation of Cooling Degree Days using plain temperature at 2 m elevation at Tref=21.1°C and frequency of 6 hrs
t2m.hdd.60F Calculation of Heating Degree Days using plain temperature at 2 m elevation at Tref=60°F and frequency of 6 hrs
t2m.hdd.65F Calculation of Heating Degree Days using plain temperature at 2 m elevation at Tref=65°F and frequency of 6 hrs
t2m.hdd.70F Calculation of Heating Degree Days using plain temperature at 2 m elevation at Tref=70°F and frequency of 6 hrs
t2m.cdd.60F Calculation of Cooling Degree Days using plain temperature at 2 m elevation at Tref=60°F and frequency of 6 hrs
t2m.cdd.65F Calculation of Cooling Degree Days using plain temperature at 2 m elevation at Tref=65°F and frequency of 6 hrs
t2m.cdd.70F Calculation of Cooling Degree Days using plain temperature at 2 m elevation at Tref=70°F and frequency of 6 hrs
HI.hdd.57.56F Calculation of Heating Degree Days using the Heat Index at Tref=57.56°F and frequency of 6 hrs
HI.hdd.63.08F Calculation of Heating Degree Days using the Heat Index at Tref=63.08°F and frequency of 6 hrs
HI.hdd.68.58F Calculation of Heating Degree Days using the Heat Index at Tref=68.58°F and frequency of 6 hrs
HI.cdd.57.56F Calculation of Cooling Degree Days using the Heat Index at Tref=57.56°F and frequency of 6 hrs
HI.cdd.63.08F Calculation of Cooling Degree Days using the Heat Index at Tref=63.08°F and frequency of 6 hrs
HI.cdd.68.58F Calculation of Cooling Degree Days using the Heat Index at Tref=68.58°F and frequency of 6 hrs
HUM.hdd.13.98C Calculation of Heating Degree Days using the Humidex at Tref=13.98°C and frequency of 6 hrs
HUM.hdd.17.4C Calculation of Heating Degree Days using the Humidex at Tref=17.40°C and frequency of 6 hrs
HUM.hdd.21.09C Calculation of Heating Degree Days using the Humidex at Tref=21.09°C and frequency of 6 hrs
HUM.cdd.13.98C Calculation of Cooling Degree Days using the Humidex at Tref=13.98°C and frequency of 6 hrs
HUM.cdd.17.4C Calculation of Cooling Degree Days using the Humidex at Tref=17.40°C and frequency of 6 hrs
HUM.cdd.21.09C Calculation of Cooling Degree Days using the Humidex at Tref=21.09°C and frequency of 6 hrs
ESI.hdd.12.6C Calculation of Heating Degree Days using the Environmental Stress Index at Tref=12.6°C and frequency of 6 hrs
ESI.hdd.14.9C Calculation of Heating Degree Days using the Environmental Stress Index at Tref=14.9°C and frequency of 6 hrs
ESI.hdd.17.2C Calculation of Heating Degree Days using the Environmental Stress Index at Tref=17.2°C and frequency of 6 hrs
ESI.cdd.12.6C Calculation of Cooling Degree Days using the Environmental Stress Index at Tref=12.6°C and frequency of 6 hrs
ESI.cdd.14.9C Calculation of Cooling Degree Days using the Environmental Stress Index at Tref=14.9°C and frequency of 6 hrs
ESI.cdd.17.2C Calculation of Cooling Degree Days using the Environmental Stress Index at Tref=17.2°C and frequency of 6 hrs
Note:
Divide Degree Days by 4 to convert from 6 hrs to daily frequency
Global Surface Summary of the Day is derived from The Integrated Surface Hourly (ISH) dataset. The ISH dataset includes global data obtained from the USAF Climatology Center, located in the Federal Climate Complex with NCDC. The latest daily summary data are normally available 1-2 days after the date-time of the observations used in the daily summaries. The online data files begin with 1929 and are at the time of this writing at the Version 8 software level. Over 9000 stations' data are typically available. The daily elements included in the dataset (as available from each station) are: Mean temperature (.1 Fahrenheit) Mean dew point (.1 Fahrenheit) Mean sea level pressure (.1 mb) Mean station pressure (.1 mb) Mean visibility (.1 miles) Mean wind speed (.1 knots) Maximum sustained wind speed (.1 knots) Maximum wind gust (.1 knots) Maximum temperature (.1 Fahrenheit) Minimum temperature (.1 Fahrenheit) Precipitation amount (.01 inches) Snow depth (.1 inches) Indicator for occurrence of: Fog, Rain or Drizzle, Snow or Ice Pellets, Hail, Thunder, Tornado/Funnel Cloud Global summary of day data for 18 surface meteorological elements are derived from the synoptic/hourly observations contained in USAF DATSAV3 Surface data and Federal Climate Complex Integrated Surface Hourly (ISH). Historical data are generally available for 1929 to the present, with data from 1973 to the present being the most complete. For some periods, one or more countries' data may not be available due to data restrictions or communications problems. In deriving the summary of day data, a minimum of 4 observations for the day must be present (allows for stations which report 4 synoptic observations/day). Since the data are converted to constant units (e.g, knots), slight rounding error from the originally reported values may occur (e.g, 9.9 instead of 10.0). The mean daily values described below are based on the hours of operation for the station. For some stations/countries, the visibility will sometimes 'cluster' around a value (such as 10 miles) due to the practice of not reporting visibilities greater than certain distances. The daily extremes and totals--maximum wind gust, precipitation amount, and snow depth--will only appear if the station reports the data sufficiently to provide a valid value. Therefore, these three elements will appear less frequently than other values. Also, these elements are derived from the stations' reports during the day, and may comprise a 24-hour period which includes a portion of the previous day. The data are reported and summarized based on Greenwich Mean Time (GMT, 0000Z - 2359Z) since the original synoptic/hourly data are reported and based on GMT.
Abstract: Building health management is an important part in running an efficient and cost-effective building. Many problems in a building’s system can go undetected for long periods of time, leading to expensive repairs or wasted resources. This project aims to help detect and diagnose the building‘s health with data driven methods throughout the day. Orca and IMS are two state of the art algorithms that observe an array of building health sensors and provide feedback on the overall system’s health as well as localize the problem to one, or possibly two, components. With this level of feedback the hope is to quickly identify problems and provide appropriate maintenance while reducing the number of complaints and service calls. Introduction: To prepare these technologies for the new installation, the proposed methods are being tested on a current system that behaves similarly to the future green building. Building 241 was determined to best resemble the proposed building 232 and therefore was chosen for this study. Building 241 is currently outfitted with 34 sensors that monitor the heating & cooling temperatures for the air and water systems as well as other various subsystem states. The daily sensor recordings were logged and sent to the IDU group for analysis. The period of analysis was focused from July 1st through August 10th 2009. Methodology: The two algorithms used for analysis were Orca and IMS. Both methods look for anomalies using a distanced based scoring approach. Orca has the ability to use a single data set and find outliers within that data set. This tactic was applied to each day. After scoring each time sample throughout a given day the Orca score profiles were compared by computing the correlation against all other days. Days with high overall correlations were considered normal however days with lower overall correlations were more anomalous. IMS, on the other hand, needs a normal set of data to build a model, which can be applied to a set of test data to asses how anomaly the particular data set is. The typical days identified by Orca were used as the reference/training set for IMS, while all the other days were passed through IMS resulting in an anomaly score profile for each day. The mean of the IMS score profile was then calculated for each day to produce a summary IMS score. These summary scores were ranked and the top outliers were identified (see Figure 1). Once the anomalies were identified the contributing parameters were then ranked by the algorithm. Analysis: The contributing parameters identified by IMS were localized to the return air temperature duct system. -7/03/09 (Figure 2 & 3) AHU-1 Return Air Temperature (RAT) Calculated Average Return Air Temperature -7/19/09 (Figure 3 & 4) AHU-2 Return Air Temperature (RAT) Calculated Average Return Air Temperature IMS identified significantly higher temperatures compared to other days during the month of July and August. Conclusion: The proposed algorithms Orca and IMS have shown that they were able to pick up significant anomalies in the building system as well as diagnose the anomaly by identifying the sensor values that were anomalous. In the future these methods can be used on live streaming data and produce a real time anomaly score to help building maintenance with detection and diagnosis of problems.
Annual dataset covering the conterminous U.S., from 1981 to now. Contains spatially gridded annual average daily mean temperature at 4km grid cell resolution. Distribution of the point measurements to the spatial grid was accomplished using the PRISM model, developed and applied by Dr. Christopher Daly of the PRISM Climate Group at Oregon State University.
Abstract: Building health management is an important part in running an efficient and cost-effective building. Many problems in a building’s system can go undetected for long periods of time, leading to expensive repairs or wasted resources. This project aims to help detect and diagnose the building‘s health with data driven methods throughout the day. Orca and IMS are two state of the art algorithms that observe an array of building health sensors and provide feedback on the overall system’s health as well as localize the problem to one, or possibly two, components. With this level of feedback the hope is to quickly identify problems and provide appropriate maintenance while reducing the number of complaints and service calls. Introduction: To prepare these technologies for the new installation, the proposed methods are being tested on a current system that behaves similarly to the future green building. Building 241 was determined to best resemble the proposed building 232 and therefore was chosen for this study. Building 241 is currently outfitted with 34 sensors that monitor the heating & cooling temperatures for the air and water systems as well as other various subsystem states. The daily sensor recordings were logged and sent to the IDU group for analysis. The period of analysis was focused from July 1st through August 10th 2009. Methodology: The two algorithms used for analysis were Orca and IMS. Both methods look for anomalies using a distanced based scoring approach. Orca has the ability to use a single data set and find outliers within that data set. This tactic was applied to each day. After scoring each time sample throughout a given day the Orca score profiles were compared by computing the correlation against all other days. Days with high overall correlations were considered normal however days with lower overall correlations were more anomalous. IMS, on the other hand, needs a normal set of data to build a model, which can be applied to a set of test data to asses how anomaly the particular data set is. The typical days identified by Orca were used as the reference/training set for IMS, while all the other days were passed through IMS resulting in an anomaly score profile for each day. The mean of the IMS score profile was then calculated for each day to produce a summary IMS score. These summary scores were ranked and the top outliers were identified (see Figure 1). Once the anomalies were identified the contributing parameters were then ranked by the algorithm. Analysis: The contributing parameters identified by IMS were localized to the return air temperature duct system. -7/03/09 (Figure 2 & 3) AHU-1 Return Air Temperature (RAT) Calculated Average Return Air Temperature -7/19/09 (Figure 3 & 4) AHU-2 Return Air Temperature (RAT) Calculated Average Return Air Temperature IMS identified significantly higher temperatures compared to other days during the month of July and August. Conclusion: The proposed algorithms Orca and IMS have shown that they were able to pick up significant anomalies in the building system as well as diagnose the anomaly by identifying the sensor values that were anomalous. In the future these methods can be used on live streaming data and produce a real time anomaly score to help building maintenance with detection and diagnosis of problems.
The Heart Attack Risk Prediction Dataset serves as a valuable resource for delving into the intricate dynamics of heart health and its predictors. Heart attacks, or myocardial infarctions, continue to be a significant global health issue, necessitating a deeper comprehension of their precursors and potential mitigating factors. This dataset encapsulates a diverse range of attributes including age, cholesterol levels, blood pressure, smoking habits, exercise patterns, dietary preferences, and more, aiming to elucidate the complex interplay of these variables in determining the likelihood of a heart attack. By employing predictive analytics and machine learning on this dataset, researchers and healthcare professionals can work towards proactive strategies for heart disease prevention and management. The dataset stands as a testament to collective efforts to enhance our understanding of cardiovascular health and pave the way for a healthier future.
This synthetic dataset provides a comprehensive array of features relevant to heart health and lifestyle choices, encompassing patient-specific details such as age, gender, cholesterol levels, blood pressure, heart rate, and indicators like diabetes, family history, smoking habits, obesity, and alcohol consumption. Additionally, lifestyle factors like exercise hours, dietary habits, stress levels, and sedentary hours are included. Medical aspects comprising previous heart problems, medication usage, and triglyceride levels are considered. Socioeconomic aspects such as income and geographical attributes like country, continent, and hemisphere are incorporated. The dataset, consisting of 8763 records from patients around the globe, culminates in a crucial binary classification feature denoting the presence or absence of a heart attack risk, providing a comprehensive resource for predictive analysis and research in cardiovascular health.
https://i.imgur.com/5cTusqA.png" alt="">
This dataset is a synthetic creation generated using ChatGPT to simulate a realistic experience. Its purpose is to provide a platform for beginners and data enthusiasts, allowing them to create, enjoy, practice, and learn from a dataset that mirrors real-world scenarios. The aim is to foster learning and experimentation in a simulated environment, encouraging a deeper understanding of data analysis and interpretation.
Cover Photo by: brgfx on Freepik
Thumbnail by: vectorjuice on Freepik
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Instructions:
Dataset Name: Podcast Listening Time Prediction
Dataset Description: The dataset contains information about various podcast episodes and their attributes. The goal is to analyze and predict the average listening duration of podcast episodes based on various features.
Columns in the Dataset:
Podcast_Name (Type: string) Description: Names of popular podcasts. Example Values: "Tech Talk", "Health Hour", "Comedy Central"
Episode_Title (Type: string) Description: Titles of the podcast episodes. Example Values: "The Future of AI", "Meditation Tips", "Stand-Up Special"
Episode_Length (Type: float, minutes) Description: Length of the episode in minutes. Example Values: 5.0, 10.0, 30.0, 45.0, 60.0, 90.0
Genre (Type: string) Description: Genre of the podcast episode. Possible Values: "Technology", "Education", "Comedy", "Health", "True Crime", "Business", "Sports", "Lifestyle", "News", "Music"
Host_Popularity (Type: float, scale 0-100) Description: A score indicating the popularity of the host. Example Values: 50.0, 75.0, 90.0
Publication_Day (Type: string) Description: Day of the week the episode was published. Possible Values: "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"
Publication_Time (Type: string) Description: Time of the day the episode was published. Possible Values: "Morning", "Afternoon", "Evening", "Night"
Guest_Popularity (Type: float, scale 0-100) Description: A score indicating the popularity of the guest (if any). Example Values: 20.0, 50.0, 85.0
Number_of_Ads (Type: int) Description: Number of advertisements within the episode. Example Values: 0, 1, 2, 3
Episode_Sentiment (Type: string) Description: Sentiment of the episode's content. Possible Values: "Positive", "Neutral", "Negative"
Listening_Time (Type: float, minutes) Description: The actual average listening duration (target variable). Example Values: 4.5, 8.0, 30.0, 60.0
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
📌 Context of the Dataset
The Healthcare Ransomware Dataset was created to simulate real-world cyberattacks in the healthcare industry. Hospitals, clinics, and research labs have become prime targets for ransomware due to their reliance on real-time patient data and legacy IT infrastructure. This dataset provides insight into attack patterns, recovery times, and cybersecurity practices across different healthcare organizations.
Why is this important?
Ransomware attacks on healthcare organizations can shut down entire hospitals, delay treatments, and put lives at risk. Understanding how different healthcare organizations respond to attacks can help develop better security strategies. The dataset allows cybersecurity analysts, data scientists, and researchers to study patterns in ransomware incidents and explore predictive modeling for risk mitigation.
📌 Sources and Research Inspiration This simulated dataset was inspired by real-world cybersecurity reports and built using insights from official sources, including:
1️⃣ IBM Cost of a Data Breach Report (2024)
The healthcare sector had the highest average cost of data breaches ($10.93 million per incident). On average, organizations recovered only 64.8% of their data after paying ransom. Healthcare breaches took 277 days on average to detect and contain.
2️⃣ Sophos State of Ransomware in Healthcare (2024)
67% of healthcare organizations were hit by ransomware in 2024, an increase from 60% in 2023. 66% of backup compromise attempts succeeded, making data recovery significantly more difficult. The most common attack vectors included exploited vulnerabilities (34%) and compromised credentials (34%).
3️⃣ Health & Human Services (HHS) Cybersecurity Reports
Ransomware incidents in healthcare have doubled since 2016. Organizations that fail to monitor threats frequently experience higher infection rates.
4️⃣ Cybersecurity & Infrastructure Security Agency (CISA) Alerts
Identified phishing, unpatched software, and exposed RDP ports as top ransomware entry points. Only 13% of healthcare organizations monitor cyber threats more than once per day, increasing the risk of undetected attacks.
5️⃣ Emsisoft 2020 Report on Ransomware in Healthcare
The number of ransomware attacks in healthcare increased by 278% between 2018 and 2023. 560 healthcare facilities were affected in a single year, disrupting patient care and emergency services.
📌 Why is This a Simulated Dataset?
This dataset does not contain real patient data or actual ransomware cases. Instead, it was built using probabilistic modeling and structured randomness based on industry benchmarks and cybersecurity reports.
How It Was Created:
1️⃣ Defining the Dataset Structure
The dataset was designed to simulate realistic attack patterns in healthcare, using actual ransomware case studies as inspiration.
Columns were selected based on what real-world cybersecurity teams track, such as: Attack methods (phishing, RDP exploits, credential theft). Infection rates, recovery time, and backup compromise rates. Organization type (hospitals, clinics, research labs) and monitoring frequency.
2️⃣ Generating Realistic Data Using ChatGPT & Python
ChatGPT assisted in defining relationships between attack factors, ensuring that key cybersecurity concepts were accurately reflected. Python’s NumPy and Pandas libraries were used to introduce randomized attack simulations based on real-world statistics. Data was validated against industry research to ensure it aligns with actual ransomware attack trends.
3️⃣ Ensuring Logical Relationships Between Data Points
Hospitals take longer to recover due to larger infrastructure and compliance requirements. Organizations that track more cyber threats recover faster because they detect attacks earlier. Backup security significantly impacts recovery time, reflecting the real-world risk of backup encryption attacks.
The average time spent daily on a phone, not counting talking on the phone, has increased in recent years, reaching a total of 4 hours and 30 minutes as of April 2022. This figure is expected to reach around 4 hours and 39 minutes by 2024.