Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
LifeSnaps Dataset Documentation
Ubiquitous self-tracking technologies have penetrated various aspects of our lives, from physical and mental health monitoring to fitness and entertainment. Yet, limited data exist on the association between in the wild large-scale physical activity patterns, sleep, stress, and overall health, and behavioral patterns and psychological measurements due to challenges in collecting and releasing such datasets, such as waning user engagement, privacy considerations, and diversity in data modalities. In this paper, we present the LifeSnaps dataset, a multi-modal, longitudinal, and geographically-distributed dataset, containing a plethora of anthropological data, collected unobtrusively for the total course of more than 4 months by n=71 participants, under the European H2020 RAIS project. LifeSnaps contains more than 35 different data types from second to daily granularity, totaling more than 71M rows of data. The participants contributed their data through numerous validated surveys, real-time ecological momentary assessments, and a Fitbit Sense smartwatch, and consented to make these data available openly to empower future research. We envision that releasing this large-scale dataset of multi-modal real-world data, will open novel research opportunities and potential applications in the fields of medical digital innovations, data privacy and valorization, mental and physical well-being, psychology and behavioral sciences, machine learning, and human-computer interaction.
The following instructions will get you started with the LifeSnaps dataset and are complementary to the original publication.
Data Import: Reading CSV
For ease of use, we provide CSV files containing Fitbit, SEMA, and survey data at daily and/or hourly granularity. You can read the files via any programming language. For example, in Python, you can read the files into a Pandas DataFrame with the pandas.read_csv() command.
Data Import: Setting up a MongoDB (Recommended)
To take full advantage of the LifeSnaps dataset, we recommend that you use the raw, complete data via importing the LifeSnaps MongoDB database.
To do so, open the terminal/command prompt and run the following command for each collection in the DB. Ensure you have MongoDB Database Tools installed from here.
For the Fitbit data, run the following:
mongorestore --host localhost:27017 -d rais_anonymized -c fitbit
For the SEMA data, run the following:
mongorestore --host localhost:27017 -d rais_anonymized -c sema
For surveys data, run the following:
mongorestore --host localhost:27017 -d rais_anonymized -c surveys
If you have access control enabled, then you will need to add the --username and --password parameters to the above commands.
Data Availability
The MongoDB database contains three collections, fitbit, sema, and surveys, containing the Fitbit, SEMA3, and survey data, respectively. Similarly, the CSV files contain related information to these collections. Each document in any collection follows the format shown below:
{
_id:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Taken verbatim from the source: Ubiquitous self-tracking technologies have penetrated various aspects of our lives, from physical and mental health monitoring to fitness and entertainment. Yet, limited data exist on the association between in the wild large-scale physical activity patterns, sleep, stress, and overall health, and behavioral patterns and psychological measurements due to challenges in collecting and releasing such datasets, such as waning user engagement, privacy considerations, and diversity in data modalities. In this paper, we present the LifeSnaps dataset, a multi-modal, longitudinal, and geographically-distributed dataset, containing a plethora of anthropological data, collected unobtrusively for the total course of more than 4 months by n=71 participants, under the European H2020 RAIS project. LifeSnaps contains more than 35 different data types from second to daily granularity, totaling more than 71M rows of data. The participants contributed their data through numerous validated surveys, real-time ecological momentary assessments, and a Fitbit Sense smartwatch, and consented to make these data available openly to empower future research. We envision that releasing this large-scale dataset of multi-modal real-world data, will open novel research opportunities and potential applications in the fields of medical digital innovations, data privacy and valorization, mental and physical well-being, psychology and behavioral sciences, machine learning, and human-computer interaction.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Hotel customer dataset with 31 variables describing a total of 83,590 instances (customers). It comprehends three full years of customer behavioral data. In addition to personal and behavioral information, the dataset also contains demographic and geographical information. This dataset contributes to reducing the lack of real-world business data that can be used for educational and research purposes. The dataset can be used in data mining, machine learning, and other analytical field problems in the scope of data science. Due to its unit of analysis, it is a dataset especially suitable for building customer segmentation models, including clustering and RFM (Recency, Frequency, and Monetary value) models, but also be used in classification and regression problems.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was created to simulate a market basket dataset, providing insights into customer purchasing behavior and store operations. The dataset facilitates market basket analysis, customer segmentation, and other retail analytics tasks. Here's more information about the context and inspiration behind this dataset:
Context:
Retail businesses, from supermarkets to convenience stores, are constantly seeking ways to better understand their customers and improve their operations. Market basket analysis, a technique used in retail analytics, explores customer purchase patterns to uncover associations between products, identify trends, and optimize pricing and promotions. Customer segmentation allows businesses to tailor their offerings to specific groups, enhancing the customer experience.
Inspiration:
The inspiration for this dataset comes from the need for accessible and customizable market basket datasets. While real-world retail data is sensitive and often restricted, synthetic datasets offer a safe and versatile alternative. Researchers, data scientists, and analysts can use this dataset to develop and test algorithms, models, and analytical tools.
Dataset Information:
The columns provide information about the transactions, customers, products, and purchasing behavior, making the dataset suitable for various analyses, including market basket analysis and customer segmentation. Here's a brief explanation of each column in the Dataset:
Use Cases:
Note: This dataset is entirely synthetic and was generated using the Python Faker library, which means it doesn't contain real customer data. It's designed for educational and research purposes.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Field data collection in veterinary and animal behaviour science often faces practical limitations, including time constraints, restricted resources, and difficulties integrating high-quality data capture into real-world clinical workflows. This paper highlights the need for flexible, efficient, and standardised digital solutions that facilitate the collection of multimodal behavioural data in real-world settings. We present a case example using PetsDataLab, a novel cloud-based, “no code” platform designed to enable researchers to create customized apps for efficient and standardised data collection tailored to the behavioural domain, facilitating capture of diverse data types, including video, images, and contextual metadata. We used the platform to develop an app supporting the creation of the Dog Pain Database, a novel comprehensive resource aimed at advancing research on behaviour-based pain indicators in dogs. Using the app, we created a large-scale, structured dataset of dogs with clinically diagnosed conditions expected to be associated with pain and discomfort, including demographic, medical, and pain-related information, alongside high-quality video recordings for future behavioural analyses. To evaluate the app’s usability and its potential for future broader deployment, 14 veterinary professionals tested the app and provided structured feedback via a questionnaire. Results indicated strong usability and clarity, although agreement with using the app in daily clinic life was lower among external testers, pointing to possible barriers to routine integration. This proof-of-concept case study demonstrates the potential of cloud-based platforms like PetsDataLab to bridge research and practice by enabling scalable, standardised, and clinically compatible behavioural data collection. While developed for veterinary pain research, the approach is broadly applicable across behavioural science and supports open science principles through structured, reusable, and interoperable data collection.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This dataset explores how daily digital habits — including social media usage, screen time, and notification exposure — relate to individual productivity, stress, and well-being.
The dataset contains 30,000 real-world-style records simulating behavioral patterns of people with various jobs, social habits, and lifestyle choices. The goal is to understand how different digital behaviors correlate with perceived and actual productivity.
✅ Designed for real-world ML workflows
Includes missing values, noise, and outliers — ideal for practicing data cleaning and preprocessing.
🔗 High correlation between target features
The perceived_productivity_score
and actual_productivity_score
are strongly correlated, making this dataset suitable for experiments in feature selection and multicollinearity.
🛠️ Feature Engineering playground
Use this dataset to practice feature scaling, encoding, binning, interaction terms, and more.
🧪 Perfect for EDA, regression & classification
You can model productivity, stress, or satisfaction based on behavior patterns and digital exposure.
Column Name | Description |
---|---|
age | Age of the individual (18–65 years) |
gender | Gender identity: Male, Female, or Other |
job_type | Employment sector or status (IT, Education, Student, etc.) |
daily_social_media_time | Average daily time spent on social media (hours) |
social_platform_preference | Most-used social platform (Instagram, TikTok, Telegram, etc.) |
number_of_notifications | Number of mobile/social notifications per day |
work_hours_per_day | Average hours worked each day |
perceived_productivity_score | Self-rated productivity score (scale: 0–10) |
actual_productivity_score | Simulated ground-truth productivity score (scale: 0–10) |
stress_level | Current stress level (scale: 1–10) |
sleep_hours | Average hours of sleep per night |
screen_time_before_sleep | Time spent on screens before sleeping (hours) |
breaks_during_work | Number of breaks taken during work hours |
uses_focus_apps | Whether the user uses digital focus apps (True/False) |
has_digital_wellbeing_enabled | Whether Digital Wellbeing is activated (True/False) |
coffee_consumption_per_day | Number of coffee cups consumed per day |
days_feeling_burnout_per_month | Number of burnout days reported per month |
weekly_offline_hours | Total hours spent offline each week (excluding sleep) |
job_satisfaction_score | Satisfaction with job/life responsibilities (scale: 0–10) |
👉 Sample notebook coming soon with data cleaning, visualization, and productivity prediction!
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The Synthetic Student Performance Dataset is designed to support research, analytics, and educational projects focused on academic performance, family background, and behavioral factors affecting students. It mirrors real-world educational data and offers diverse features to explore student success patterns.
https://storage.googleapis.com/opendatabay_public/41933042-6ec7-49c4-b151-508fc8f5592b/7537d999da0b_student_performance_visuals.png" alt="Synthetic student performance data visuals and distribution.png">
This dataset is ideal for:
Captures a comprehensive view of student life, including family background, academic history, health, and lifestyle. The dataset supports multi-disciplinary research across education, sociology, and data science.
CC0 (Public Domain)
https://dataverse.unimi.it/api/datasets/:persistentId/versions/2.1/customlicense?persistentId=doi:10.13130/RD_UNIMI/LJ6Z8Vhttps://dataverse.unimi.it/api/datasets/:persistentId/versions/2.1/customlicense?persistentId=doi:10.13130/RD_UNIMI/LJ6Z8V
Dataset containing real-world and synthetic samples on legit and malware samples in the form of time series. The samples consider machine-level performance metrics: CPU usage, RAM usage, number of bytes read and written from and to disk and network. Synthetic samples are generated using a GAN.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This document provides a clear and practical guide to understanding missing data mechanisms, including Missing Completely At Random (MCAR), Missing At Random (MAR), and Missing Not At Random (MNAR). Through real-world scenarios and examples, it explains how different types of missingness impact data analysis and decision-making. It also outlines common strategies for handling missing data, including deletion techniques and imputation methods such as mean imputation, regression, and stochastic modeling.Designed for researchers, analysts, and students working with real-world datasets, this guide helps ensure statistical validity, reduce bias, and improve the overall quality of analysis in fields like public health, behavioral science, social research, and machine learning.
Interpreting time series models is uniquely challenging because it requires identifying both the location of time series signals that drive model predictions and their matching to an interpretable temporal pattern. While explainers from other modalities can be applied to time series, their inductive biases do not transfer well to the inherently uninterpretable nature of time series. We present TIMEX, a time series consistency model for training explainers. TIMEX trains an interpretable surrogate to mimic the behavior of a pretrained time series model. It addresses the issue of model faithfulness by introducing model behavior consistency, a novel formulation that preserves relations in the latent space induced by the pretrained model with relations in the latent space induced by TIMEX. TIMEX provides discrete attribution maps and, unlike existing interpretability methods, it learns a latent space of explanations that can be used in various ways, such as to provide landmarks to visually aggregate similar explanations and easily recognize temporal patterns. We evaluate TIMEX on 8 synthetic and real-world datasets and compare its performance against state-of-the-art interpretability methods. We also conduct case studies using physiological time series. Quantitative evaluations demonstrate that TIMEX achieves the highest or second-highest performance in every metric compared to baselines across all datasets. Through case studies, we show that the novel components of TIMEX show potential for training faithful, interpretable models that capture the behavior of pretrained time series models.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
This dataset contains a compilation of carefully-crafted Q&A pairs which are designed to provide AI-based tailored support for mental health. These carefully chosen questions and answers offer an avenue for those looking for help to gain the assistance they need. With these pre-processed conversations, Artificial Intelligence (AI) solutions can be developed and deployed to better understand and respond appropriately to individual needs based on their input. This comprehensive dataset is crafted by experts in the mental health field, providing insightful content that will further research in this growing area. These data points will be invaluable for developing the next generation of personalized AI-based mental health chatbots capable of truly understanding what people need
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset contains pre-processed Q&A pairs for AI-based tailored support for mental health. As such, it represents an excellent starting point in building a conversational model which can handle conversations about mental health issues. Here are some tips on how to use this dataset to its fullest potential:
Understand your data: Spend time getting to know the text of the conversation between the user and the chatbot and familiarize yourself with what type of questions and answers are included in this specific dataset. This will help you better formulate queries for your own conversational model or develop new ones you can add yourself.
Refine your language processing models: By studying the patterns in syntax, grammar, tone, voice, etc., within this conversational data set you can hone your natural language processing capabilities - such as keyword extractions or entity extraction – prior to implementing them into a larger bot system .
Test assumptions: Have an idea of what you think may work best with a particular audience or context? See if these assumptions pan out by applying different variations of text to this dataset to see if it works before rolling out changes across other channels or programs that utilize AI/chatbot services
Research & Analyze Results : After testing out different scenarios on real-world users by using various forms of q&a within this chatbot pair data set , analyze & record any relevant results pertaining towards understanding user behavior better through further analysis after being exposed to tailored texted conversations about Mental Health topics both passively & actively . The more information you collect here , leads us closer towards creating effective AI powered conversations that bring our desired outcomes from our customer base .
- Developing a chatbot for personalized mental health advice and guidance tailored to individuals' unique needs, experiences, and struggles.
- Creating an AI-driven diagnostic system that can interpret mental health conversations and provide targeted recommendations for interventions or treatments based on clinical expertise.
- Designing an AI-powered recommendation engine to suggest relevant content such as articles, videos, or podcasts based on users’ questions or topics of discussion during their conversation with the chatbot
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: train.csv | Column name | Description | |:--------------|:------------------------------------------------------------------------| | text | The text of the conversation between the user and the chatbot. (String) |
If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The Dry Eye Disease Patient Records (Synthetic) is designed for educational and research purposes to analyze patterns in sleep behavior, stress levels, lifestyle factors, and their potential links to dry eye disease. It provides anonymized, synthetic data on various health conditions and behavioral habits.
https://storage.googleapis.com/opendatabay_public/f4e9ad52-5d13-4d2e-ac19-207a5b71522e/2e2e949519d7_eye.png" alt="Dry Eye Disease Patient Records Synthetic Data">
This dataset can be used for the following applications:
This synthetic dataset is fully anonymized and complies with data privacy standards. It includes a variety of demographic and lifestyle factors to support a broad range of research and analysis.
CC0 (Public Domain)
This real-world customer dataset with 31 variables describes 83,590 instances (customers) from a hotel in Lisbon, Portugal.
The data comprehends three full years of customer personal, behavioral, demographic, and geographical information.
Additional information on this dataset can be found in the article A Hotel's customers personal, behavioral, demographic, and geographic dataset from Lisbon, Portugal (2015-2018), written by Nuno Antonio, Ana de Almeida, and Luis Nunes for Data in Brief (online November 2020).
This dataset can be used in data mining, machine learning, and other analytical field problems in the scope of data science. Due to its unit of analysis, it is a dataset especially suitable for building customer segmentation models, including clustering and RFM (Recency, Frequency, and Monetary value) models, but also be used in classification and regression problems.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset belongs to a leading online E-commerce company. The company wants to identify customers who are likely to churn, so they can proactively approach these customers with promotional offers.
The dataset contains various features related to customer behavior and characteristics, which can be used to predict customer churn.
The main task is to predict customer churn based on the given features. This is a binary classification problem where the target variable is 'Churn'.
This dataset is provided for educational purposes. While it represents a real-world scenario, the data itself may be simulated or anonymized.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
LifeSnaps Dataset Documentation
Ubiquitous self-tracking technologies have penetrated various aspects of our lives, from physical and mental health monitoring to fitness and entertainment. Yet, limited data exist on the association between in the wild large-scale physical activity patterns, sleep, stress, and overall health, and behavioral patterns and psychological measurements due to challenges in collecting and releasing such datasets, such as waning user engagement, privacy considerations, and diversity in data modalities. In this paper, we present the LifeSnaps dataset, a multi-modal, longitudinal, and geographically-distributed dataset, containing a plethora of anthropological data, collected unobtrusively for the total course of more than 4 months by n=71 participants, under the European H2020 RAIS project. LifeSnaps contains more than 35 different data types from second to daily granularity, totaling more than 71M rows of data. The participants contributed their data through numerous validated surveys, real-time ecological momentary assessments, and a Fitbit Sense smartwatch, and consented to make these data available openly to empower future research. We envision that releasing this large-scale dataset of multi-modal real-world data, will open novel research opportunities and potential applications in the fields of medical digital innovations, data privacy and valorization, mental and physical well-being, psychology and behavioral sciences, machine learning, and human-computer interaction.
The following instructions will get you started with the LifeSnaps dataset and are complementary to the original publication.
Data Import: Reading CSV
For ease of use, we provide CSV files containing Fitbit, SEMA, and survey data at daily and/or hourly granularity. You can read the files via any programming language. For example, in Python, you can read the files into a Pandas DataFrame with the pandas.read_csv() command.
Data Import: Setting up a MongoDB (Recommended)
To take full advantage of the LifeSnaps dataset, we recommend that you use the raw, complete data via importing the LifeSnaps MongoDB database.
To do so, open the terminal/command prompt and run the following command for each collection in the DB. Ensure you have MongoDB Database Tools installed from here.
For the Fitbit data, run the following:
mongorestore --host localhost:27017 -d rais_anonymized -c fitbit
For the SEMA data, run the following:
mongorestore --host localhost:27017 -d rais_anonymized -c sema
For surveys data, run the following:
mongorestore --host localhost:27017 -d rais_anonymized -c surveys
If you have access control enabled, then you will need to add the --username and --password parameters to the above commands.
Data Availability
The MongoDB database contains three collections, fitbit, sema, and surveys, containing the Fitbit, SEMA3, and survey data, respectively. Similarly, the CSV files contain related information to these collections. Each document in any collection follows the format shown below:
{
_id: