14 datasets found

Data from: LifeSnaps: a 4-month multi-modal dataset capturing unobtrusive...
zenodo.org
explore.openaire.eu
zip
Updated Oct 20, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sofia Yfantidou; Sofia Yfantidou; Christina Karagianni; Stefanos Efstathiou; Stefanos Efstathiou; Athena Vakali; Athena Vakali; Joao Palotti; Joao Palotti; Dimitrios Panteleimon Giakatos; Dimitrios Panteleimon Giakatos; Thomas Marchioro; Thomas Marchioro; Andrei Kazlouski; Elena Ferrari; Šarūnas Girdzijauskas; Šarūnas Girdzijauskas; Christina Karagianni; Andrei Kazlouski; Elena Ferrari (2022). LifeSnaps: a 4-month multi-modal dataset capturing unobtrusive snapshots of our lives in the wild [Dataset]. http://doi.org/10.5281/zenodo.6832242
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6832242
Dataset updated
Oct 20, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Sofia Yfantidou; Sofia Yfantidou; Christina Karagianni; Stefanos Efstathiou; Stefanos Efstathiou; Athena Vakali; Athena Vakali; Joao Palotti; Joao Palotti; Dimitrios Panteleimon Giakatos; Dimitrios Panteleimon Giakatos; Thomas Marchioro; Thomas Marchioro; Andrei Kazlouski; Elena Ferrari; Šarūnas Girdzijauskas; Šarūnas Girdzijauskas; Christina Karagianni; Andrei Kazlouski; Elena Ferrari
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
LifeSnaps Dataset Documentation

Ubiquitous self-tracking technologies have penetrated various aspects of our lives, from physical and mental health monitoring to fitness and entertainment. Yet, limited data exist on the association between in the wild large-scale physical activity patterns, sleep, stress, and overall health, and behavioral patterns and psychological measurements due to challenges in collecting and releasing such datasets, such as waning user engagement, privacy considerations, and diversity in data modalities. In this paper, we present the LifeSnaps dataset, a multi-modal, longitudinal, and geographically-distributed dataset, containing a plethora of anthropological data, collected unobtrusively for the total course of more than 4 months by n=71 participants, under the European H2020 RAIS project. LifeSnaps contains more than 35 different data types from second to daily granularity, totaling more than 71M rows of data. The participants contributed their data through numerous validated surveys, real-time ecological momentary assessments, and a Fitbit Sense smartwatch, and consented to make these data available openly to empower future research. We envision that releasing this large-scale dataset of multi-modal real-world data, will open novel research opportunities and potential applications in the fields of medical digital innovations, data privacy and valorization, mental and physical well-being, psychology and behavioral sciences, machine learning, and human-computer interaction.

The following instructions will get you started with the LifeSnaps dataset and are complementary to the original publication.

Data Import: Reading CSV

For ease of use, we provide CSV files containing Fitbit, SEMA, and survey data at daily and/or hourly granularity. You can read the files via any programming language. For example, in Python, you can read the files into a Pandas DataFrame with the pandas.read_csv() command.

Data Import: Setting up a MongoDB (Recommended)

To take full advantage of the LifeSnaps dataset, we recommend that you use the raw, complete data via importing the LifeSnaps MongoDB database.

To do so, open the terminal/command prompt and run the following command for each collection in the DB. Ensure you have MongoDB Database Tools installed from here.

For the Fitbit data, run the following:

mongorestore --host localhost:27017 -d rais_anonymized -c fitbit

For the SEMA data, run the following:

mongorestore --host localhost:27017 -d rais_anonymized -c sema

For surveys data, run the following:

mongorestore --host localhost:27017 -d rais_anonymized -c surveys

If you have access control enabled, then you will need to add the --username and --password parameters to the above commands.

Data Availability

The MongoDB database contains three collections, fitbit, sema, and surveys, containing the Fitbit, SEMA3, and survey data, respectively. Similarly, the CSV files contain related information to these collections. Each document in any collection follows the format shown below:

{ _id:
Lifesnaps Fitbit dataset
kaggle.com
Updated Feb 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Skylar Carroll (2023). Lifesnaps Fitbit dataset [Dataset]. https://www.kaggle.com/datasets/skywescar/lifesnaps-fitbit-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 3, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Skylar Carroll
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Taken verbatim from the source: Ubiquitous self-tracking technologies have penetrated various aspects of our lives, from physical and mental health monitoring to fitness and entertainment. Yet, limited data exist on the association between in the wild large-scale physical activity patterns, sleep, stress, and overall health, and behavioral patterns and psychological measurements due to challenges in collecting and releasing such datasets, such as waning user engagement, privacy considerations, and diversity in data modalities. In this paper, we present the LifeSnaps dataset, a multi-modal, longitudinal, and geographically-distributed dataset, containing a plethora of anthropological data, collected unobtrusively for the total course of more than 4 months by n=71 participants, under the European H2020 RAIS project. LifeSnaps contains more than 35 different data types from second to daily granularity, totaling more than 71M rows of data. The participants contributed their data through numerous validated surveys, real-time ecological momentary assessments, and a Fitbit Sense smartwatch, and consented to make these data available openly to empower future research. We envision that releasing this large-scale dataset of multi-modal real-world data, will open novel research opportunities and potential applications in the fields of medical digital innovations, data privacy and valorization, mental and physical well-being, psychology and behavioral sciences, machine learning, and human-computer interaction.
m
Lisbon, Portugal, hotel’s customer dataset with three years of personal,...
data.mendeley.com
Updated Nov 18, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nuno Antonio (2020). Lisbon, Portugal, hotel’s customer dataset with three years of personal, behavioral, demographic, and geographic information [Dataset]. http://doi.org/10.17632/j83f5fsh6c.1
Explore at:
Unique identifier
https://doi.org/10.17632/j83f5fsh6c.1
Dataset updated
Nov 18, 2020
Authors
Nuno Antonio
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Portugal, Lisbon
Description
Hotel customer dataset with 31 variables describing a total of 83,590 instances (customers). It comprehends three full years of customer behavioral data. In addition to personal and behavioral information, the dataset also contains demographic and geographical information. This dataset contributes to reducing the lack of real-world business data that can be used for educational and research purposes. The dataset can be used in data mining, machine learning, and other analytical field problems in the scope of data science. Due to its unit of analysis, it is a dataset especially suitable for building customer segmentation models, including clustering and RFM (Recency, Frequency, and Monetary value) models, but also be used in classification and regression problems.
Retail Transactions Dataset
kaggle.com
Updated May 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Prasad Patil (2024). Retail Transactions Dataset [Dataset]. https://www.kaggle.com/datasets/prasad22/retail-transactions-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 18, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Prasad Patil
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset was created to simulate a market basket dataset, providing insights into customer purchasing behavior and store operations. The dataset facilitates market basket analysis, customer segmentation, and other retail analytics tasks. Here's more information about the context and inspiration behind this dataset:

Context:

Retail businesses, from supermarkets to convenience stores, are constantly seeking ways to better understand their customers and improve their operations. Market basket analysis, a technique used in retail analytics, explores customer purchase patterns to uncover associations between products, identify trends, and optimize pricing and promotions. Customer segmentation allows businesses to tailor their offerings to specific groups, enhancing the customer experience.

Inspiration:

The inspiration for this dataset comes from the need for accessible and customizable market basket datasets. While real-world retail data is sensitive and often restricted, synthetic datasets offer a safe and versatile alternative. Researchers, data scientists, and analysts can use this dataset to develop and test algorithms, models, and analytical tools.

Dataset Information:

The columns provide information about the transactions, customers, products, and purchasing behavior, making the dataset suitable for various analyses, including market basket analysis and customer segmentation. Here's a brief explanation of each column in the Dataset:

Transaction_ID: A unique identifier for each transaction, represented as a 10-digit number. This column is used to uniquely identify each purchase.

Date: The date and time when the transaction occurred. It records the timestamp of each purchase.

Customer_Name: The name of the customer who made the purchase. It provides information about the customer's identity.

Product: A list of products purchased in the transaction. It includes the names of the products bought.

Total_Items: The total number of items purchased in the transaction. It represents the quantity of products bought.

Total_Cost: The total cost of the purchase, in currency. It represents the financial value of the transaction.

Payment_Method: The method used for payment in the transaction, such as credit card, debit card, cash, or mobile payment.

City: The city where the purchase took place. It indicates the location of the transaction.

Store_Type: The type of store where the purchase was made, such as a supermarket, convenience store, department store, etc.

Discount_Applied: A binary indicator (True/False) representing whether a discount was applied to the transaction.

Customer_Category: A category representing the customer's background or age group.

Season: The season in which the purchase occurred, such as spring, summer, fall, or winter.

Promotion: The type of promotion applied to the transaction, such as "None," "BOGO (Buy One Get One)," or "Discount on Selected Items."

Use Cases:

Market Basket Analysis: Discover associations between products and uncover buying patterns.

Customer Segmentation: Group customers based on purchasing behavior.

Pricing Optimization: Optimize pricing strategies and identify opportunities for discounts and promotions.

Retail Analytics: Analyze store performance and customer trends.

Note: This dataset is entirely synthetic and was generated using the Python Faker library, which means it doesn't contain real customer data. It's designed for educational and research purposes.
f
Table 2_Digitalising behavioural data collection through cloud-based...
figshare.com
frontiersin.figshare.com
xlsx
Updated Jun 19, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michelle Braghetti; Liat Vichman; Nareed Farhat; Daniel Simon Mills; Claudia Spadavecchia; Anna Zamansky; Annika Bremhorst (2025). Table 2_Digitalising behavioural data collection through cloud-based technology in veterinary science and beyond.xlsx [Dataset]. http://doi.org/10.3389/fvets.2025.1600619.s001
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.3389/fvets.2025.1600619.s001
Dataset updated
Jun 19, 2025
Dataset provided by
Frontiers
Authors
Michelle Braghetti; Liat Vichman; Nareed Farhat; Daniel Simon Mills; Claudia Spadavecchia; Anna Zamansky; Annika Bremhorst
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Field data collection in veterinary and animal behaviour science often faces practical limitations, including time constraints, restricted resources, and difficulties integrating high-quality data capture into real-world clinical workflows. This paper highlights the need for flexible, efficient, and standardised digital solutions that facilitate the collection of multimodal behavioural data in real-world settings. We present a case example using PetsDataLab, a novel cloud-based, “no code” platform designed to enable researchers to create customized apps for efficient and standardised data collection tailored to the behavioural domain, facilitating capture of diverse data types, including video, images, and contextual metadata. We used the platform to develop an app supporting the creation of the Dog Pain Database, a novel comprehensive resource aimed at advancing research on behaviour-based pain indicators in dogs. Using the app, we created a large-scale, structured dataset of dogs with clinically diagnosed conditions expected to be associated with pain and discomfort, including demographic, medical, and pain-related information, alongside high-quality video recordings for future behavioural analyses. To evaluate the app’s usability and its potential for future broader deployment, 14 veterinary professionals tested the app and provided structured feedback via a questionnaire. Results indicated strong usability and clarity, although agreement with using the app in daily clinic life was lower among external testers, pointing to possible barriers to routine integration. This proof-of-concept case study demonstrates the potential of cloud-based platforms like PetsDataLab to bridge research and practice by enabling scalable, standardised, and clinically compatible behavioural data collection. While developed for veterinary pain research, the approach is broadly applicable across behavioural science and supports open science principles through structured, reusable, and interoperable data collection.

Social Media vs Productivity

kaggle.com

Updated May 15, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Mahdi Mashayekhi (2025). Social Media vs Productivity [Dataset]. https://www.kaggle.com/datasets/mahdimashayekhi/social-media-vs-productivity/code

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

May 15, 2025

Dataset provided by

Kaggle

Authors

Mahdi Mashayekhi

License

Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically

Description

📊 Social Media vs Productivity — Realistic Behavioral Dataset (30,000 Users)

This dataset explores how daily digital habits — including social media usage, screen time, and notification exposure — relate to individual productivity, stress, and well-being.

🔍 What’s Inside?

The dataset contains 30,000 real-world-style records simulating behavioral patterns of people with various jobs, social habits, and lifestyle choices. The goal is to understand how different digital behaviors correlate with perceived and actual productivity.

🧠 Why This Dataset is Valuable

✅ Designed for real-world ML workflows
Includes missing values, noise, and outliers — ideal for practicing data cleaning and preprocessing.
🔗 High correlation between target features
The perceived_productivity_score and actual_productivity_score are strongly correlated, making this dataset suitable for experiments in feature selection and multicollinearity.
🛠️ Feature Engineering playground
Use this dataset to practice feature scaling, encoding, binning, interaction terms, and more.
🧪 Perfect for EDA, regression & classification
You can model productivity, stress, or satisfaction based on behavior patterns and digital exposure.

🧾 Columns & Feature Info

Column Name	Description
`age`	Age of the individual (18–65 years)
`gender`	Gender identity: Male, Female, or Other
`job_type`	Employment sector or status (IT, Education, Student, etc.)
`daily_social_media_time`	Average daily time spent on social media (hours)
`social_platform_preference`	Most-used social platform (Instagram, TikTok, Telegram, etc.)
`number_of_notifications`	Number of mobile/social notifications per day
`work_hours_per_day`	Average hours worked each day
`perceived_productivity_score`	Self-rated productivity score (scale: 0–10)
`actual_productivity_score`	Simulated ground-truth productivity score (scale: 0–10)
`stress_level`	Current stress level (scale: 1–10)
`sleep_hours`	Average hours of sleep per night
`screen_time_before_sleep`	Time spent on screens before sleeping (hours)
`breaks_during_work`	Number of breaks taken during work hours
`uses_focus_apps`	Whether the user uses digital focus apps (True/False)
`has_digital_wellbeing_enabled`	Whether Digital Wellbeing is activated (True/False)
`coffee_consumption_per_day`	Number of coffee cups consumed per day
`days_feeling_burnout_per_month`	Number of burnout days reported per month
`weekly_offline_hours`	Total hours spent offline each week (excluding sleep)
`job_satisfaction_score`	Satisfaction with job/life responsibilities (scale: 0–10)

📌 Notes

Contains NaN values in critical columns (productivity, sleep, stress) for data imputation tasks
Includes outliers in media usage, coffee intake, and notification count
Target columns are strongly correlated for multicollinearity testing
Multi-purpose: regression, classification, clustering, visualization

💡 Use Cases

Exploratory Data Analysis (EDA)
Feature engineering pipelines
Machine learning model benchmarking
Statistical hypothesis testing
Burnout and mental health prediction projects

📥 Bonus

👉 Sample notebook coming soon with data cleaning, visualization, and productivity prediction!

Synthetic Student Profiles with Academic Outcomes Dataset
opendatabay.com
.undefined
Updated Jun 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Opendatabay Labs (2025). Synthetic Student Profiles with Academic Outcomes Dataset [Dataset]. https://www.opendatabay.com/data/synthetic/41933042-6ec7-49c4-b151-508fc8f5592b
Explore at:
.undefinedAvailable download formats
Dataset updated
Jun 17, 2025
Dataset provided by
Buy & Sell Data | Opendatabay - AI & Synthetic Data Marketplace
Authors
Opendatabay Labs
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Education & Learning Analytics
Description
The Synthetic Student Performance Dataset is designed to support research, analytics, and educational projects focused on academic performance, family background, and behavioral factors affecting students. It mirrors real-world educational data and offers diverse features to explore student success patterns.

Dataset Features

student_id: Unique identifier for each student.

school: Attended school (e.g., GP or MS).

sex: Gender of the student (F/M).

age: Student's age in years.

address_type: Urban or Rural home location.

family_size: Family size (Less than or equal to 3 / Greater than 3).

parent_status: Parental cohabitation status (Living together / Apart).

mother_education / father_education: Highest education level completed (e.g., Primary, Secondary, Higher).

mother_job / father_job: Occupation of the student's parents.

school_choice_reason: Reason for choosing the school (e.g., Reputation, Proximity).

guardian: Primary caregiver (e.g., Mother, Father, Other).

travel_time: Daily travel time to school.

study_time: Weekly study time outside school.

class_failures: Number of past class failures.

school_support / family_support: Extra academic support received at school and from family (Yes/No).

extra_paid_classes: Attending paid private tutoring (Yes/No).

activities: Participation in extracurricular activities (Yes/No).

nursery_school: Attended preschool (Yes/No).

higher_ed: Desire to pursue higher education (Yes/No).

internet_access: Access to the internet at home (Yes/No).

romantic_relationship: Currently in a romantic relationship (Yes/No).

family_relationship: Quality of family relationships (numeric scale).

free_time: Amount of free time after school (numeric scale).

social: Frequency of social activities with peers (numeric scale).

weekday_alcohol / weekend_alcohol: Alcohol consumption levels on weekdays and weekends.

health: Current health status (1–5 scale).

absences: Number of school absences.

grade_1 / grade_2 / final_grade: First and second period grades and final academic performance.

Distribution

https://storage.googleapis.com/opendatabay_public/41933042-6ec7-49c4-b151-508fc8f5592b/7537d999da0b_student_performance_visuals.png" alt="Synthetic student performance data visuals and distribution.png">

Usage

This dataset is ideal for:

Academic Performance Prediction: Predict final grades based on behavioral and background features.

Feature Importance Analysis: Identify key influences on student success.

Sociological Insights: Understand the impact of family, relationship, and lifestyle factors on education.

Model Training: Suitable for classification, regression, and clustering tasks in educational data mining.

Coverage

Captures a comprehensive view of student life, including family background, academic history, health, and lifestyle. The dataset supports multi-disciplinary research across education, sociology, and data science.

License

CC0 (Public Domain)

Who Can Use It

Educational Researchers: For testing interventions and identifying risk factors.

Data Scientists and ML Practitioners: For building predictive models in education.

Instructors and Students: For coursework in data analysis, machine learning, and statistics.
U
Replication data for "Lightweight Behavior-Based Malware Detection"
dataverse.unimi.it
Updated Nov 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nicola Bena; Nicola Bena; Marco Anisetti; Marco Anisetti; Claudio A. Ardagna; Claudio A. Ardagna; Gabriele Gianini; Gabriele Gianini; Vincenzo Giandomenico; Vincenzo Giandomenico (2024). Replication data for "Lightweight Behavior-Based Malware Detection" [Dataset]. http://doi.org/10.13130/RD_UNIMI/LJ6Z8V
Explore at:
bin(4245), txt(240000), txt(2018), bin(36712), bin(156998), text/x-python(1339), text/markdown(13542), bin(55), text/x-python(1436), txt(1694), tsv(119217), text/x-python(1147), tsv(119113), zip(4469422), text/x-python(1126), zip(52781371), application/x-ipynb+json(8672), zip(3251335), bin(27523), tsv(112040), tsv(10111), application/x-ipynb+json(137862), tsv(119218), tsv(10289), tsv(112144), application/x-ipynb+json(1736968), application/x-ipynb+json(11533), tsv(118946), application/x-ipynb+json(121867), bin(1228541)Available download formats
Unique identifier
https://doi.org/10.13130/RD_UNIMI/LJ6Z8V
Dataset updated
Nov 3, 2024
Dataset provided by
UNIMI Dataverse
Authors
Nicola Bena; Nicola Bena; Marco Anisetti; Marco Anisetti; Claudio A. Ardagna; Claudio A. Ardagna; Gabriele Gianini; Gabriele Gianini; Vincenzo Giandomenico; Vincenzo Giandomenico
License
https://dataverse.unimi.it/api/datasets/:persistentId/versions/2.1/customlicense?persistentId=doi:10.13130/RD_UNIMI/LJ6Z8Vhttps://dataverse.unimi.it/api/datasets/:persistentId/versions/2.1/customlicense?persistentId=doi:10.13130/RD_UNIMI/LJ6Z8V
Description
Dataset containing real-world and synthetic samples on legit and malware samples in the form of time series. The samples consider machine-level performance metrics: CPU usage, RAM usage, number of bytes read and written from and to disk and network. Synthetic samples are generated using a GAN.
f
Understanding and Managing Missing Data.pdf
figshare.com
pdf
Updated Jun 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ibrahim Denis Fofanah (2025). Understanding and Managing Missing Data.pdf [Dataset]. http://doi.org/10.6084/m9.figshare.29265155.v1
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.29265155.v1
Dataset updated
Jun 9, 2025
Dataset provided by
figshare
Authors
Ibrahim Denis Fofanah
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This document provides a clear and practical guide to understanding missing data mechanisms, including Missing Completely At Random (MCAR), Missing At Random (MAR), and Missing Not At Random (MNAR). Through real-world scenarios and examples, it explains how different types of missingness impact data analysis and decision-making. It also outlines common strategies for handling missing data, including deletion techniques and imputation methods such as mean imputation, regression, and stochastic modeling.Designed for researchers, analysts, and students working with real-world datasets, this guide helps ensure statistical validity, reduce bias, and improve the overall quality of analysis in fields like public health, behavioral science, social research, and machine learning.
d
Replication Data for: TimeX
search.dataone.org
Updated Dec 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Queen, Owen (2023). Replication Data for: TimeX [Dataset]. http://doi.org/10.7910/DVN/B0DEQJ
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/B0DEQJ
Dataset updated
Dec 16, 2023
Dataset provided by
Harvard Dataverse
Authors
Queen, Owen
Description
Interpreting time series models is uniquely challenging because it requires identifying both the location of time series signals that drive model predictions and their matching to an interpretable temporal pattern. While explainers from other modalities can be applied to time series, their inductive biases do not transfer well to the inherently uninterpretable nature of time series. We present TIMEX, a time series consistency model for training explainers. TIMEX trains an interpretable surrogate to mimic the behavior of a pretrained time series model. It addresses the issue of model faithfulness by introducing model behavior consistency, a novel formulation that preserves relations in the latent space induced by the pretrained model with relations in the latent space induced by TIMEX. TIMEX provides discrete attribution maps and, unlike existing interpretability methods, it learns a latent space of explanations that can be used in various ways, such as to provide landmarks to visually aggregate similar explanations and easily recognize temporal patterns. We evaluate TIMEX on 8 synthetic and real-world datasets and compare its performance against state-of-the-art interpretability methods. We also conduct case studies using physiological time series. Quantitative evaluations demonstrate that TIMEX achieves the highest or second-highest performance in every metric compared to baselines across all datasets. Through case studies, we show that the novel components of TIMEX show potential for training faithful, interpretable models that capture the behavior of pretrained time series models.
Mental Health Chatbot Pairs
kaggle.com
Updated Nov 27, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). Mental Health Chatbot Pairs [Dataset]. https://www.kaggle.com/datasets/thedevastator/mental-health-chatbot-pairs
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 27, 2023
Dataset provided by
Kaggle
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Mental Health Chatbot Pairs

AI-based Tailored Support for Mental Health Conversation

By Huggingface Hub [source]

About this dataset

This dataset contains a compilation of carefully-crafted Q&A pairs which are designed to provide AI-based tailored support for mental health. These carefully chosen questions and answers offer an avenue for those looking for help to gain the assistance they need. With these pre-processed conversations, Artificial Intelligence (AI) solutions can be developed and deployed to better understand and respond appropriately to individual needs based on their input. This comprehensive dataset is crafted by experts in the mental health field, providing insightful content that will further research in this growing area. These data points will be invaluable for developing the next generation of personalized AI-based mental health chatbots capable of truly understanding what people need

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

This dataset contains pre-processed Q&A pairs for AI-based tailored support for mental health. As such, it represents an excellent starting point in building a conversational model which can handle conversations about mental health issues. Here are some tips on how to use this dataset to its fullest potential:

Understand your data: Spend time getting to know the text of the conversation between the user and the chatbot and familiarize yourself with what type of questions and answers are included in this specific dataset. This will help you better formulate queries for your own conversational model or develop new ones you can add yourself.

Refine your language processing models: By studying the patterns in syntax, grammar, tone, voice, etc., within this conversational data set you can hone your natural language processing capabilities - such as keyword extractions or entity extraction – prior to implementing them into a larger bot system .

Test assumptions: Have an idea of what you think may work best with a particular audience or context? See if these assumptions pan out by applying different variations of text to this dataset to see if it works before rolling out changes across other channels or programs that utilize AI/chatbot services

Research & Analyze Results : After testing out different scenarios on real-world users by using various forms of q&a within this chatbot pair data set , analyze & record any relevant results pertaining towards understanding user behavior better through further analysis after being exposed to tailored texted conversations about Mental Health topics both passively & actively . The more information you collect here , leads us closer towards creating effective AI powered conversations that bring our desired outcomes from our customer base .

Research Ideas

Developing a chatbot for personalized mental health advice and guidance tailored to individuals' unique needs, experiences, and struggles.

Creating an AI-driven diagnostic system that can interpret mental health conversations and provide targeted recommendations for interventions or treatments based on clinical expertise.

Designing an AI-powered recommendation engine to suggest relevant content such as articles, videos, or podcasts based on users’ questions or topics of discussion during their conversation with the chatbot

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: train.csv | Column name | Description | |:--------------|:------------------------------------------------------------------------| | text | The text of the conversation between the user and the chatbot. (String) |

Acknowledgements

If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.
Synthetic Dry Eye Disease Patient Records
opendatabay.com
.undefined
Updated Jun 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Opendatabay Labs (2025). Synthetic Dry Eye Disease Patient Records [Dataset]. https://www.opendatabay.com/data/synthetic/f4e9ad52-5d13-4d2e-ac19-207a5b71522e
Explore at:
.undefinedAvailable download formats
Dataset updated
Jun 24, 2025
Dataset provided by
Buy & Sell Data | Opendatabay - AI & Synthetic Data Marketplace
Authors
Opendatabay Labs
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Patient Health Records & Digital Health
Description
The Dry Eye Disease Patient Records (Synthetic) is designed for educational and research purposes to analyze patterns in sleep behavior, stress levels, lifestyle factors, and their potential links to dry eye disease. It provides anonymized, synthetic data on various health conditions and behavioral habits.

Dataset Features

Gender: Gender of the individual (Male/Female).

Age: Age of the individual.

Sleep Duration: Average sleep duration in hours.

Sleep Quality: Subjective assessment of sleep quality (scale-based).

Stress Level: Measured stress level (scale-based).

Heart Rate: Resting heart rate (bpm).

Daily Steps: Number of steps taken per day.

Physical Activity: Minutes of physical activity per day.

Height & Weight: Individual’s height (cm) and weight (kg).

Sleep Disorder: Presence of a diagnosed sleep disorder (Yes/No).

Wake Up During Night: Frequency of waking up during the night (Yes/No).

Feel Sleepy During Day: Self-reported daytime sleepiness (Yes/No).

Caffeine Consumption: Frequency of caffeine intake (Yes/No).

Alcohol Consumption: Frequency of alcohol intake (Yes/No).

Smoking: Smoking habits (Yes/No).

Medical Issue: Presence of any medical conditions (Yes/No).

Ongoing Medication: Use of any ongoing medication (Yes/No).

Smart Device Before Bed: Usage of smart devices before sleeping (Yes/No).

Average Screen Time: Daily screen time in hours.

Blue-Light Filter: Use of blue-light filters on devices (Yes/No).

Eye Discomfort & Strain: Presence of discomfort and eye strain (Yes/No).

Redness in Eye: Occurrence of eye redness (Yes/No).

Itchiness/Irritation in Eye: Symptoms of eye itchiness or irritation (Yes/No).

Dry Eye Disease: Diagnosis of Dry Eye Disease (Yes/No).

Distribution

https://storage.googleapis.com/opendatabay_public/f4e9ad52-5d13-4d2e-ac19-207a5b71522e/2e2e949519d7_eye.png" alt="Dry Eye Disease Patient Records Synthetic Data">

Usage

This dataset can be used for the following applications:

Healthcare Analytics: Identify patterns between lifestyle factors and dry eye disease.

Predictive Modeling: Develop machine learning models to predict eye health risks.

Clinical Research: Investigate associations between screen time, sleep habits, and eye conditions.

Educational Purposes: Provide a dataset for students in medical, data science, and public health fields to analyze real-world health trends.

Coverage

This synthetic dataset is fully anonymized and complies with data privacy standards. It includes a variety of demographic and lifestyle factors to support a broad range of research and analysis.

License

CC0 (Public Domain)

Who Can Use It

Healthcare Researchers: To explore correlations between lifestyle habits and dry eye disease.

Clinicians and Medical Practitioners: To analyze factors contributing to eye health issues.

Data Scientists and Machine Learning Practitioners: To develop predictive models for eye-related conditions.

Educators and Students: As a resource for studying health analytics and medical research.
A hotel's customers dataset
kaggle.com
Updated Nov 27, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nuno Antonio (2020). A hotel's customers dataset [Dataset]. https://www.kaggle.com/nantonio/a-hotels-customers-dataset/metadata
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 27, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Nuno Antonio
Description
Context

This real-world customer dataset with 31 variables describes 83,590 instances (customers) from a hotel in Lisbon, Portugal.

Content

The data comprehends three full years of customer personal, behavioral, demographic, and geographical information.

Acknowledgements

Additional information on this dataset can be found in the article A Hotel's customers personal, behavioral, demographic, and geographic dataset from Lisbon, Portugal (2015-2018), written by Nuno Antonio, Ana de Almeida, and Luis Nunes for Data in Brief (online November 2020).

Inspiration

This dataset can be used in data mining, machine learning, and other analytical field problems in the scope of data science. Due to its unit of analysis, it is a dataset especially suitable for building customer segmentation models, including clustering and RFM (Recency, Frequency, and Monetary value) models, but also be used in classification and regression problems.
E-commerce Customer Churn
kaggle.com
Updated Aug 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Samuel Semaya (2024). E-commerce Customer Churn [Dataset]. https://www.kaggle.com/datasets/samuelsemaya/e-commerce-customer-churn
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 6, 2024
Dataset provided by
Kaggle
Authors
Samuel Semaya
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
E-commerce Customer Churn Dataset

Context

This dataset belongs to a leading online E-commerce company. The company wants to identify customers who are likely to churn, so they can proactively approach these customers with promotional offers.

Content

The dataset contains various features related to customer behavior and characteristics, which can be used to predict customer churn.

Features

Tenure: Tenure of a customer in the company (numeric)

WarehouseToHome: Distance between the warehouse to the customer's home (numeric)

NumberOfDeviceRegistered: Total number of devices registered to a particular customer (numeric)

PreferedOrderCat: Preferred order category of a customer in the last month (categorical)

SatisfactionScore: Satisfactory score of a customer on service (numeric)

MaritalStatus: Marital status of a customer (categorical)

NumberOfAddress: Total number of addresses added for a particular customer (numeric)

Complaint: Whether any complaint has been raised in the last month (binary)

DaySinceLastOrder: Days since last order by customer (numeric)

CashbackAmount: Average cashback in last month (numeric)

Churn: Churn flag (target variable, binary)

Task

The main task is to predict customer churn based on the given features. This is a binary classification problem where the target variable is 'Churn'.

Potential Applications

Customer Retention: Identify at-risk customers and take proactive measures to retain them.

Targeted Marketing: Design specific marketing campaigns for customers likely to churn.

Service Improvement: Analyze features contributing to churn and improve those aspects of the service.

Acknowledgements

This dataset is provided for educational purposes. While it represents a real-world scenario, the data itself may be simulated or anonymized.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Sofia Yfantidou; Sofia Yfantidou; Christina Karagianni; Stefanos Efstathiou; Stefanos Efstathiou; Athena Vakali; Athena Vakali; Joao Palotti; Joao Palotti; Dimitrios Panteleimon Giakatos; Dimitrios Panteleimon Giakatos; Thomas Marchioro; Thomas Marchioro; Andrei Kazlouski; Elena Ferrari; Šarūnas Girdzijauskas; Šarūnas Girdzijauskas; Christina Karagianni; Andrei Kazlouski; Elena Ferrari (2022). LifeSnaps: a 4-month multi-modal dataset capturing unobtrusive snapshots of our lives in the wild [Dataset]. http://doi.org/10.5281/zenodo.6832242

Data from: LifeSnaps: a 4-month multi-modal dataset capturing unobtrusive snapshots of our lives in the wild

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.6832242

Dataset updated

Oct 20, 2022

Dataset provided by

Zenodohttp://zenodo.org/

Authors

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

LifeSnaps Dataset Documentation

Ubiquitous self-tracking technologies have penetrated various aspects of our lives, from physical and mental health monitoring to fitness and entertainment. Yet, limited data exist on the association between in the wild large-scale physical activity patterns, sleep, stress, and overall health, and behavioral patterns and psychological measurements due to challenges in collecting and releasing such datasets, such as waning user engagement, privacy considerations, and diversity in data modalities. In this paper, we present the LifeSnaps dataset, a multi-modal, longitudinal, and geographically-distributed dataset, containing a plethora of anthropological data, collected unobtrusively for the total course of more than 4 months by n=71 participants, under the European H2020 RAIS project. LifeSnaps contains more than 35 different data types from second to daily granularity, totaling more than 71M rows of data. The participants contributed their data through numerous validated surveys, real-time ecological momentary assessments, and a Fitbit Sense smartwatch, and consented to make these data available openly to empower future research. We envision that releasing this large-scale dataset of multi-modal real-world data, will open novel research opportunities and potential applications in the fields of medical digital innovations, data privacy and valorization, mental and physical well-being, psychology and behavioral sciences, machine learning, and human-computer interaction.

The following instructions will get you started with the LifeSnaps dataset and are complementary to the original publication.

Data Import: Reading CSV

For ease of use, we provide CSV files containing Fitbit, SEMA, and survey data at daily and/or hourly granularity. You can read the files via any programming language. For example, in Python, you can read the files into a Pandas DataFrame with the pandas.read_csv() command.

Data Import: Setting up a MongoDB (Recommended)

To take full advantage of the LifeSnaps dataset, we recommend that you use the raw, complete data via importing the LifeSnaps MongoDB database.

To do so, open the terminal/command prompt and run the following command for each collection in the DB. Ensure you have MongoDB Database Tools installed from here.

For the Fitbit data, run the following:

mongorestore --host localhost:27017 -d rais_anonymized -c fitbit

For the SEMA data, run the following:

mongorestore --host localhost:27017 -d rais_anonymized -c sema

For surveys data, run the following:

mongorestore --host localhost:27017 -d rais_anonymized -c surveys

If you have access control enabled, then you will need to add the --username and --password parameters to the above commands.

Data Availability

The MongoDB database contains three collections, fitbit, sema, and surveys, containing the Fitbit, SEMA3, and survey data, respectively. Similarly, the CSV files contain related information to these collections. Each document in any collection follows the format shown below:

{
  _id:

Clear search

Close search

Google apps

Main menu

Data from: LifeSnaps: a 4-month multi-modal dataset capturing unobtrusive...

Lifesnaps Fitbit dataset

Lisbon, Portugal, hotel’s customer dataset with three years of personal,...

Retail Transactions Dataset

Context:

Inspiration:

Dataset Information:

Use Cases:

Note: This dataset is entirely synthetic and was generated using the Python Faker library, which means it doesn't contain real customer data. It's designed for educational and research purposes.

Table 2_Digitalising behavioural data collection through cloud-based...

Social Media vs Productivity

📊 Social Media vs Productivity — Realistic Behavioral Dataset (30,000 Users)

🔍 What’s Inside?

🧠 Why This Dataset is Valuable

🧾 Columns & Feature Info

📌 Notes

💡 Use Cases

📥 Bonus

Synthetic Student Profiles with Academic Outcomes Dataset

Dataset Features

Distribution

Usage

Coverage

License

Who Can Use It

Replication data for "Lightweight Behavior-Based Malware Detection"

Understanding and Managing Missing Data.pdf

Replication Data for: TimeX

Mental Health Chatbot Pairs

Mental Health Chatbot Pairs

AI-based Tailored Support for Mental Health Conversation

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Acknowledgements

Synthetic Dry Eye Disease Patient Records

Dataset Features

Distribution

Usage

Coverage

License

Who Can Use It

A hotel's customers dataset

Context

Content

Acknowledgements

Inspiration

E-commerce Customer Churn

E-commerce Customer Churn Dataset

Context

Content

Features

Task

Potential Applications

Acknowledgements

Data from: LifeSnaps: a 4-month multi-modal dataset capturing unobtrusive snapshots of our lives in the wildSee More Versions

`Context:`

`Inspiration:`

`Dataset Information:`

`Use Cases:`

Data from: LifeSnaps: a 4-month multi-modal dataset capturing unobtrusive snapshots of our lives in the wild