14 datasets found
  1. Data from: LifeSnaps: a 4-month multi-modal dataset capturing unobtrusive...

    • zenodo.org
    • explore.openaire.eu
    zip
    Updated Oct 20, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sofia Yfantidou; Sofia Yfantidou; Christina Karagianni; Stefanos Efstathiou; Stefanos Efstathiou; Athena Vakali; Athena Vakali; Joao Palotti; Joao Palotti; Dimitrios Panteleimon Giakatos; Dimitrios Panteleimon Giakatos; Thomas Marchioro; Thomas Marchioro; Andrei Kazlouski; Elena Ferrari; Šarūnas Girdzijauskas; Šarūnas Girdzijauskas; Christina Karagianni; Andrei Kazlouski; Elena Ferrari (2022). LifeSnaps: a 4-month multi-modal dataset capturing unobtrusive snapshots of our lives in the wild [Dataset]. http://doi.org/10.5281/zenodo.6832242
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 20, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Sofia Yfantidou; Sofia Yfantidou; Christina Karagianni; Stefanos Efstathiou; Stefanos Efstathiou; Athena Vakali; Athena Vakali; Joao Palotti; Joao Palotti; Dimitrios Panteleimon Giakatos; Dimitrios Panteleimon Giakatos; Thomas Marchioro; Thomas Marchioro; Andrei Kazlouski; Elena Ferrari; Šarūnas Girdzijauskas; Šarūnas Girdzijauskas; Christina Karagianni; Andrei Kazlouski; Elena Ferrari
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    LifeSnaps Dataset Documentation

    Ubiquitous self-tracking technologies have penetrated various aspects of our lives, from physical and mental health monitoring to fitness and entertainment. Yet, limited data exist on the association between in the wild large-scale physical activity patterns, sleep, stress, and overall health, and behavioral patterns and psychological measurements due to challenges in collecting and releasing such datasets, such as waning user engagement, privacy considerations, and diversity in data modalities. In this paper, we present the LifeSnaps dataset, a multi-modal, longitudinal, and geographically-distributed dataset, containing a plethora of anthropological data, collected unobtrusively for the total course of more than 4 months by n=71 participants, under the European H2020 RAIS project. LifeSnaps contains more than 35 different data types from second to daily granularity, totaling more than 71M rows of data. The participants contributed their data through numerous validated surveys, real-time ecological momentary assessments, and a Fitbit Sense smartwatch, and consented to make these data available openly to empower future research. We envision that releasing this large-scale dataset of multi-modal real-world data, will open novel research opportunities and potential applications in the fields of medical digital innovations, data privacy and valorization, mental and physical well-being, psychology and behavioral sciences, machine learning, and human-computer interaction.

    The following instructions will get you started with the LifeSnaps dataset and are complementary to the original publication.

    Data Import: Reading CSV

    For ease of use, we provide CSV files containing Fitbit, SEMA, and survey data at daily and/or hourly granularity. You can read the files via any programming language. For example, in Python, you can read the files into a Pandas DataFrame with the pandas.read_csv() command.

    Data Import: Setting up a MongoDB (Recommended)

    To take full advantage of the LifeSnaps dataset, we recommend that you use the raw, complete data via importing the LifeSnaps MongoDB database.

    To do so, open the terminal/command prompt and run the following command for each collection in the DB. Ensure you have MongoDB Database Tools installed from here.

    For the Fitbit data, run the following:

    mongorestore --host localhost:27017 -d rais_anonymized -c fitbit 

    For the SEMA data, run the following:

    mongorestore --host localhost:27017 -d rais_anonymized -c sema 

    For surveys data, run the following:

    mongorestore --host localhost:27017 -d rais_anonymized -c surveys 

    If you have access control enabled, then you will need to add the --username and --password parameters to the above commands.

    Data Availability

    The MongoDB database contains three collections, fitbit, sema, and surveys, containing the Fitbit, SEMA3, and survey data, respectively. Similarly, the CSV files contain related information to these collections. Each document in any collection follows the format shown below:

    {
      _id: 
  2. Lifesnaps Fitbit dataset

    • kaggle.com
    Updated Feb 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Skylar Carroll (2023). Lifesnaps Fitbit dataset [Dataset]. https://www.kaggle.com/datasets/skywescar/lifesnaps-fitbit-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 3, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Skylar Carroll
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Taken verbatim from the source: Ubiquitous self-tracking technologies have penetrated various aspects of our lives, from physical and mental health monitoring to fitness and entertainment. Yet, limited data exist on the association between in the wild large-scale physical activity patterns, sleep, stress, and overall health, and behavioral patterns and psychological measurements due to challenges in collecting and releasing such datasets, such as waning user engagement, privacy considerations, and diversity in data modalities. In this paper, we present the LifeSnaps dataset, a multi-modal, longitudinal, and geographically-distributed dataset, containing a plethora of anthropological data, collected unobtrusively for the total course of more than 4 months by n=71 participants, under the European H2020 RAIS project. LifeSnaps contains more than 35 different data types from second to daily granularity, totaling more than 71M rows of data. The participants contributed their data through numerous validated surveys, real-time ecological momentary assessments, and a Fitbit Sense smartwatch, and consented to make these data available openly to empower future research. We envision that releasing this large-scale dataset of multi-modal real-world data, will open novel research opportunities and potential applications in the fields of medical digital innovations, data privacy and valorization, mental and physical well-being, psychology and behavioral sciences, machine learning, and human-computer interaction.

  3. m

    Lisbon, Portugal, hotel’s customer dataset with three years of personal,...

    • data.mendeley.com
    Updated Nov 18, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nuno Antonio (2020). Lisbon, Portugal, hotel’s customer dataset with three years of personal, behavioral, demographic, and geographic information [Dataset]. http://doi.org/10.17632/j83f5fsh6c.1
    Explore at:
    Dataset updated
    Nov 18, 2020
    Authors
    Nuno Antonio
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Portugal, Lisbon
    Description

    Hotel customer dataset with 31 variables describing a total of 83,590 instances (customers). It comprehends three full years of customer behavioral data. In addition to personal and behavioral information, the dataset also contains demographic and geographical information. This dataset contributes to reducing the lack of real-world business data that can be used for educational and research purposes. The dataset can be used in data mining, machine learning, and other analytical field problems in the scope of data science. Due to its unit of analysis, it is a dataset especially suitable for building customer segmentation models, including clustering and RFM (Recency, Frequency, and Monetary value) models, but also be used in classification and regression problems.

  4. Retail Transactions Dataset

    • kaggle.com
    Updated May 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Prasad Patil (2024). Retail Transactions Dataset [Dataset]. https://www.kaggle.com/datasets/prasad22/retail-transactions-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 18, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Prasad Patil
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset was created to simulate a market basket dataset, providing insights into customer purchasing behavior and store operations. The dataset facilitates market basket analysis, customer segmentation, and other retail analytics tasks. Here's more information about the context and inspiration behind this dataset:

    Context:

    Retail businesses, from supermarkets to convenience stores, are constantly seeking ways to better understand their customers and improve their operations. Market basket analysis, a technique used in retail analytics, explores customer purchase patterns to uncover associations between products, identify trends, and optimize pricing and promotions. Customer segmentation allows businesses to tailor their offerings to specific groups, enhancing the customer experience.

    Inspiration:

    The inspiration for this dataset comes from the need for accessible and customizable market basket datasets. While real-world retail data is sensitive and often restricted, synthetic datasets offer a safe and versatile alternative. Researchers, data scientists, and analysts can use this dataset to develop and test algorithms, models, and analytical tools.

    Dataset Information:

    The columns provide information about the transactions, customers, products, and purchasing behavior, making the dataset suitable for various analyses, including market basket analysis and customer segmentation. Here's a brief explanation of each column in the Dataset:

    • Transaction_ID: A unique identifier for each transaction, represented as a 10-digit number. This column is used to uniquely identify each purchase.
    • Date: The date and time when the transaction occurred. It records the timestamp of each purchase.
    • Customer_Name: The name of the customer who made the purchase. It provides information about the customer's identity.
    • Product: A list of products purchased in the transaction. It includes the names of the products bought.
    • Total_Items: The total number of items purchased in the transaction. It represents the quantity of products bought.
    • Total_Cost: The total cost of the purchase, in currency. It represents the financial value of the transaction.
    • Payment_Method: The method used for payment in the transaction, such as credit card, debit card, cash, or mobile payment.
    • City: The city where the purchase took place. It indicates the location of the transaction.
    • Store_Type: The type of store where the purchase was made, such as a supermarket, convenience store, department store, etc.
    • Discount_Applied: A binary indicator (True/False) representing whether a discount was applied to the transaction.
    • Customer_Category: A category representing the customer's background or age group.
    • Season: The season in which the purchase occurred, such as spring, summer, fall, or winter.
    • Promotion: The type of promotion applied to the transaction, such as "None," "BOGO (Buy One Get One)," or "Discount on Selected Items."

    Use Cases:

    • Market Basket Analysis: Discover associations between products and uncover buying patterns.
    • Customer Segmentation: Group customers based on purchasing behavior.
    • Pricing Optimization: Optimize pricing strategies and identify opportunities for discounts and promotions.
    • Retail Analytics: Analyze store performance and customer trends.

    Note: This dataset is entirely synthetic and was generated using the Python Faker library, which means it doesn't contain real customer data. It's designed for educational and research purposes.

  5. f

    Table 2_Digitalising behavioural data collection through cloud-based...

    • figshare.com
    • frontiersin.figshare.com
    xlsx
    Updated Jun 19, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michelle Braghetti; Liat Vichman; Nareed Farhat; Daniel Simon Mills; Claudia Spadavecchia; Anna Zamansky; Annika Bremhorst (2025). Table 2_Digitalising behavioural data collection through cloud-based technology in veterinary science and beyond.xlsx [Dataset]. http://doi.org/10.3389/fvets.2025.1600619.s001
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 19, 2025
    Dataset provided by
    Frontiers
    Authors
    Michelle Braghetti; Liat Vichman; Nareed Farhat; Daniel Simon Mills; Claudia Spadavecchia; Anna Zamansky; Annika Bremhorst
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Field data collection in veterinary and animal behaviour science often faces practical limitations, including time constraints, restricted resources, and difficulties integrating high-quality data capture into real-world clinical workflows. This paper highlights the need for flexible, efficient, and standardised digital solutions that facilitate the collection of multimodal behavioural data in real-world settings. We present a case example using PetsDataLab, a novel cloud-based, “no code” platform designed to enable researchers to create customized apps for efficient and standardised data collection tailored to the behavioural domain, facilitating capture of diverse data types, including video, images, and contextual metadata. We used the platform to develop an app supporting the creation of the Dog Pain Database, a novel comprehensive resource aimed at advancing research on behaviour-based pain indicators in dogs. Using the app, we created a large-scale, structured dataset of dogs with clinically diagnosed conditions expected to be associated with pain and discomfort, including demographic, medical, and pain-related information, alongside high-quality video recordings for future behavioural analyses. To evaluate the app’s usability and its potential for future broader deployment, 14 veterinary professionals tested the app and provided structured feedback via a questionnaire. Results indicated strong usability and clarity, although agreement with using the app in daily clinic life was lower among external testers, pointing to possible barriers to routine integration. This proof-of-concept case study demonstrates the potential of cloud-based platforms like PetsDataLab to bridge research and practice by enabling scalable, standardised, and clinically compatible behavioural data collection. While developed for veterinary pain research, the approach is broadly applicable across behavioural science and supports open science principles through structured, reusable, and interoperable data collection.

  6. Social Media vs Productivity

    • kaggle.com
    Updated May 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mahdi Mashayekhi (2025). Social Media vs Productivity [Dataset]. https://www.kaggle.com/datasets/mahdimashayekhi/social-media-vs-productivity/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 15, 2025
    Dataset provided by
    Kaggle
    Authors
    Mahdi Mashayekhi
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    📊 Social Media vs Productivity — Realistic Behavioral Dataset (30,000 Users)

    This dataset explores how daily digital habits — including social media usage, screen time, and notification exposure — relate to individual productivity, stress, and well-being.

    🔍 What’s Inside?

    The dataset contains 30,000 real-world-style records simulating behavioral patterns of people with various jobs, social habits, and lifestyle choices. The goal is to understand how different digital behaviors correlate with perceived and actual productivity.

    🧠 Why This Dataset is Valuable

    • Designed for real-world ML workflows
      Includes missing values, noise, and outliers — ideal for practicing data cleaning and preprocessing.

    • 🔗 High correlation between target features
      The perceived_productivity_score and actual_productivity_score are strongly correlated, making this dataset suitable for experiments in feature selection and multicollinearity.

    • 🛠️ Feature Engineering playground
      Use this dataset to practice feature scaling, encoding, binning, interaction terms, and more.

    • 🧪 Perfect for EDA, regression & classification
      You can model productivity, stress, or satisfaction based on behavior patterns and digital exposure.

    🧾 Columns & Feature Info

    Column NameDescription
    ageAge of the individual (18–65 years)
    genderGender identity: Male, Female, or Other
    job_typeEmployment sector or status (IT, Education, Student, etc.)
    daily_social_media_timeAverage daily time spent on social media (hours)
    social_platform_preferenceMost-used social platform (Instagram, TikTok, Telegram, etc.)
    number_of_notificationsNumber of mobile/social notifications per day
    work_hours_per_dayAverage hours worked each day
    perceived_productivity_scoreSelf-rated productivity score (scale: 0–10)
    actual_productivity_scoreSimulated ground-truth productivity score (scale: 0–10)
    stress_levelCurrent stress level (scale: 1–10)
    sleep_hoursAverage hours of sleep per night
    screen_time_before_sleepTime spent on screens before sleeping (hours)
    breaks_during_workNumber of breaks taken during work hours
    uses_focus_appsWhether the user uses digital focus apps (True/False)
    has_digital_wellbeing_enabledWhether Digital Wellbeing is activated (True/False)
    coffee_consumption_per_dayNumber of coffee cups consumed per day
    days_feeling_burnout_per_monthNumber of burnout days reported per month
    weekly_offline_hoursTotal hours spent offline each week (excluding sleep)
    job_satisfaction_scoreSatisfaction with job/life responsibilities (scale: 0–10)

    📌 Notes

    • Contains NaN values in critical columns (productivity, sleep, stress) for data imputation tasks
    • Includes outliers in media usage, coffee intake, and notification count
    • Target columns are strongly correlated for multicollinearity testing
    • Multi-purpose: regression, classification, clustering, visualization

    💡 Use Cases

    • Exploratory Data Analysis (EDA)
    • Feature engineering pipelines
    • Machine learning model benchmarking
    • Statistical hypothesis testing
    • Burnout and mental health prediction projects

    📥 Bonus

    👉 Sample notebook coming soon with data cleaning, visualization, and productivity prediction!

  7. Synthetic Student Profiles with Academic Outcomes Dataset

    • opendatabay.com
    .undefined
    Updated Jun 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Opendatabay Labs (2025). Synthetic Student Profiles with Academic Outcomes Dataset [Dataset]. https://www.opendatabay.com/data/synthetic/41933042-6ec7-49c4-b151-508fc8f5592b
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jun 17, 2025
    Dataset provided by
    Buy & Sell Data | Opendatabay - AI & Synthetic Data Marketplace
    Authors
    Opendatabay Labs
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Education & Learning Analytics
    Description

    The Synthetic Student Performance Dataset is designed to support research, analytics, and educational projects focused on academic performance, family background, and behavioral factors affecting students. It mirrors real-world educational data and offers diverse features to explore student success patterns.

    Dataset Features

    • student_id: Unique identifier for each student.
    • school: Attended school (e.g., GP or MS).
    • sex: Gender of the student (F/M).
    • age: Student's age in years.
    • address_type: Urban or Rural home location.
    • family_size: Family size (Less than or equal to 3 / Greater than 3).
    • parent_status: Parental cohabitation status (Living together / Apart).
    • mother_education / father_education: Highest education level completed (e.g., Primary, Secondary, Higher).
    • mother_job / father_job: Occupation of the student's parents.
    • school_choice_reason: Reason for choosing the school (e.g., Reputation, Proximity).
    • guardian: Primary caregiver (e.g., Mother, Father, Other).
    • travel_time: Daily travel time to school.
    • study_time: Weekly study time outside school.
    • class_failures: Number of past class failures.
    • school_support / family_support: Extra academic support received at school and from family (Yes/No).
    • extra_paid_classes: Attending paid private tutoring (Yes/No).
    • activities: Participation in extracurricular activities (Yes/No).
    • nursery_school: Attended preschool (Yes/No).
    • higher_ed: Desire to pursue higher education (Yes/No).
    • internet_access: Access to the internet at home (Yes/No).
    • romantic_relationship: Currently in a romantic relationship (Yes/No).
    • family_relationship: Quality of family relationships (numeric scale).
    • free_time: Amount of free time after school (numeric scale).
    • social: Frequency of social activities with peers (numeric scale).
    • weekday_alcohol / weekend_alcohol: Alcohol consumption levels on weekdays and weekends.
    • health: Current health status (1–5 scale).
    • absences: Number of school absences.
    • grade_1 / grade_2 / final_grade: First and second period grades and final academic performance.

    Distribution

    https://storage.googleapis.com/opendatabay_public/41933042-6ec7-49c4-b151-508fc8f5592b/7537d999da0b_student_performance_visuals.png" alt="Synthetic student performance data visuals and distribution.png">

    Usage

    This dataset is ideal for:

    • Academic Performance Prediction: Predict final grades based on behavioral and background features.
    • Feature Importance Analysis: Identify key influences on student success.
    • Sociological Insights: Understand the impact of family, relationship, and lifestyle factors on education.
    • Model Training: Suitable for classification, regression, and clustering tasks in educational data mining.

    Coverage

    Captures a comprehensive view of student life, including family background, academic history, health, and lifestyle. The dataset supports multi-disciplinary research across education, sociology, and data science.

    License

    CC0 (Public Domain)

    Who Can Use It

    • Educational Researchers: For testing interventions and identifying risk factors.
    • Data Scientists and ML Practitioners: For building predictive models in education.
    • Instructors and Students: For coursework in data analysis, machine learning, and statistics.
  8. U

    Replication data for "Lightweight Behavior-Based Malware Detection"

    • dataverse.unimi.it
    Updated Nov 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nicola Bena; Nicola Bena; Marco Anisetti; Marco Anisetti; Claudio A. Ardagna; Claudio A. Ardagna; Gabriele Gianini; Gabriele Gianini; Vincenzo Giandomenico; Vincenzo Giandomenico (2024). Replication data for "Lightweight Behavior-Based Malware Detection" [Dataset]. http://doi.org/10.13130/RD_UNIMI/LJ6Z8V
    Explore at:
    bin(4245), txt(240000), txt(2018), bin(36712), bin(156998), text/x-python(1339), text/markdown(13542), bin(55), text/x-python(1436), txt(1694), tsv(119217), text/x-python(1147), tsv(119113), zip(4469422), text/x-python(1126), zip(52781371), application/x-ipynb+json(8672), zip(3251335), bin(27523), tsv(112040), tsv(10111), application/x-ipynb+json(137862), tsv(119218), tsv(10289), tsv(112144), application/x-ipynb+json(1736968), application/x-ipynb+json(11533), tsv(118946), application/x-ipynb+json(121867), bin(1228541)Available download formats
    Dataset updated
    Nov 3, 2024
    Dataset provided by
    UNIMI Dataverse
    Authors
    Nicola Bena; Nicola Bena; Marco Anisetti; Marco Anisetti; Claudio A. Ardagna; Claudio A. Ardagna; Gabriele Gianini; Gabriele Gianini; Vincenzo Giandomenico; Vincenzo Giandomenico
    License

    https://dataverse.unimi.it/api/datasets/:persistentId/versions/2.1/customlicense?persistentId=doi:10.13130/RD_UNIMI/LJ6Z8Vhttps://dataverse.unimi.it/api/datasets/:persistentId/versions/2.1/customlicense?persistentId=doi:10.13130/RD_UNIMI/LJ6Z8V

    Description

    Dataset containing real-world and synthetic samples on legit and malware samples in the form of time series. The samples consider machine-level performance metrics: CPU usage, RAM usage, number of bytes read and written from and to disk and network. Synthetic samples are generated using a GAN.

  9. f

    Understanding and Managing Missing Data.pdf

    • figshare.com
    pdf
    Updated Jun 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ibrahim Denis Fofanah (2025). Understanding and Managing Missing Data.pdf [Dataset]. http://doi.org/10.6084/m9.figshare.29265155.v1
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 9, 2025
    Dataset provided by
    figshare
    Authors
    Ibrahim Denis Fofanah
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This document provides a clear and practical guide to understanding missing data mechanisms, including Missing Completely At Random (MCAR), Missing At Random (MAR), and Missing Not At Random (MNAR). Through real-world scenarios and examples, it explains how different types of missingness impact data analysis and decision-making. It also outlines common strategies for handling missing data, including deletion techniques and imputation methods such as mean imputation, regression, and stochastic modeling.Designed for researchers, analysts, and students working with real-world datasets, this guide helps ensure statistical validity, reduce bias, and improve the overall quality of analysis in fields like public health, behavioral science, social research, and machine learning.

  10. d

    Replication Data for: TimeX

    • search.dataone.org
    Updated Dec 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Queen, Owen (2023). Replication Data for: TimeX [Dataset]. http://doi.org/10.7910/DVN/B0DEQJ
    Explore at:
    Dataset updated
    Dec 16, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Queen, Owen
    Description

    Interpreting time series models is uniquely challenging because it requires identifying both the location of time series signals that drive model predictions and their matching to an interpretable temporal pattern. While explainers from other modalities can be applied to time series, their inductive biases do not transfer well to the inherently uninterpretable nature of time series. We present TIMEX, a time series consistency model for training explainers. TIMEX trains an interpretable surrogate to mimic the behavior of a pretrained time series model. It addresses the issue of model faithfulness by introducing model behavior consistency, a novel formulation that preserves relations in the latent space induced by the pretrained model with relations in the latent space induced by TIMEX. TIMEX provides discrete attribution maps and, unlike existing interpretability methods, it learns a latent space of explanations that can be used in various ways, such as to provide landmarks to visually aggregate similar explanations and easily recognize temporal patterns. We evaluate TIMEX on 8 synthetic and real-world datasets and compare its performance against state-of-the-art interpretability methods. We also conduct case studies using physiological time series. Quantitative evaluations demonstrate that TIMEX achieves the highest or second-highest performance in every metric compared to baselines across all datasets. Through case studies, we show that the novel components of TIMEX show potential for training faithful, interpretable models that capture the behavior of pretrained time series models.

  11. Mental Health Chatbot Pairs

    • kaggle.com
    Updated Nov 27, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Mental Health Chatbot Pairs [Dataset]. https://www.kaggle.com/datasets/thedevastator/mental-health-chatbot-pairs
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 27, 2023
    Dataset provided by
    Kaggle
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Mental Health Chatbot Pairs

    AI-based Tailored Support for Mental Health Conversation

    By Huggingface Hub [source]

    About this dataset

    This dataset contains a compilation of carefully-crafted Q&A pairs which are designed to provide AI-based tailored support for mental health. These carefully chosen questions and answers offer an avenue for those looking for help to gain the assistance they need. With these pre-processed conversations, Artificial Intelligence (AI) solutions can be developed and deployed to better understand and respond appropriately to individual needs based on their input. This comprehensive dataset is crafted by experts in the mental health field, providing insightful content that will further research in this growing area. These data points will be invaluable for developing the next generation of personalized AI-based mental health chatbots capable of truly understanding what people need

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This dataset contains pre-processed Q&A pairs for AI-based tailored support for mental health. As such, it represents an excellent starting point in building a conversational model which can handle conversations about mental health issues. Here are some tips on how to use this dataset to its fullest potential:

    • Understand your data: Spend time getting to know the text of the conversation between the user and the chatbot and familiarize yourself with what type of questions and answers are included in this specific dataset. This will help you better formulate queries for your own conversational model or develop new ones you can add yourself.

    • Refine your language processing models: By studying the patterns in syntax, grammar, tone, voice, etc., within this conversational data set you can hone your natural language processing capabilities - such as keyword extractions or entity extraction – prior to implementing them into a larger bot system .

    • Test assumptions: Have an idea of what you think may work best with a particular audience or context? See if these assumptions pan out by applying different variations of text to this dataset to see if it works before rolling out changes across other channels or programs that utilize AI/chatbot services

    • Research & Analyze Results : After testing out different scenarios on real-world users by using various forms of q&a within this chatbot pair data set , analyze & record any relevant results pertaining towards understanding user behavior better through further analysis after being exposed to tailored texted conversations about Mental Health topics both passively & actively . The more information you collect here , leads us closer towards creating effective AI powered conversations that bring our desired outcomes from our customer base .

    Research Ideas

    • Developing a chatbot for personalized mental health advice and guidance tailored to individuals' unique needs, experiences, and struggles.
    • Creating an AI-driven diagnostic system that can interpret mental health conversations and provide targeted recommendations for interventions or treatments based on clinical expertise.
    • Designing an AI-powered recommendation engine to suggest relevant content such as articles, videos, or podcasts based on users’ questions or topics of discussion during their conversation with the chatbot

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: train.csv | Column name | Description | |:--------------|:------------------------------------------------------------------------| | text | The text of the conversation between the user and the chatbot. (String) |

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.

  12. Synthetic Dry Eye Disease Patient Records

    • opendatabay.com
    .undefined
    Updated Jun 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Opendatabay Labs (2025). Synthetic Dry Eye Disease Patient Records [Dataset]. https://www.opendatabay.com/data/synthetic/f4e9ad52-5d13-4d2e-ac19-207a5b71522e
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jun 24, 2025
    Dataset provided by
    Buy & Sell Data | Opendatabay - AI & Synthetic Data Marketplace
    Authors
    Opendatabay Labs
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Patient Health Records & Digital Health
    Description

    The Dry Eye Disease Patient Records (Synthetic) is designed for educational and research purposes to analyze patterns in sleep behavior, stress levels, lifestyle factors, and their potential links to dry eye disease. It provides anonymized, synthetic data on various health conditions and behavioral habits.

    Dataset Features

    • Gender: Gender of the individual (Male/Female).
    • Age: Age of the individual.
    • Sleep Duration: Average sleep duration in hours.
    • Sleep Quality: Subjective assessment of sleep quality (scale-based).
    • Stress Level: Measured stress level (scale-based).
    • Heart Rate: Resting heart rate (bpm).
    • Daily Steps: Number of steps taken per day.
    • Physical Activity: Minutes of physical activity per day.
    • Height & Weight: Individual’s height (cm) and weight (kg).
    • Sleep Disorder: Presence of a diagnosed sleep disorder (Yes/No).
    • Wake Up During Night: Frequency of waking up during the night (Yes/No).
    • Feel Sleepy During Day: Self-reported daytime sleepiness (Yes/No).
    • Caffeine Consumption: Frequency of caffeine intake (Yes/No).
    • Alcohol Consumption: Frequency of alcohol intake (Yes/No).
    • Smoking: Smoking habits (Yes/No).
    • Medical Issue: Presence of any medical conditions (Yes/No).
    • Ongoing Medication: Use of any ongoing medication (Yes/No).
    • Smart Device Before Bed: Usage of smart devices before sleeping (Yes/No).
    • Average Screen Time: Daily screen time in hours.
    • Blue-Light Filter: Use of blue-light filters on devices (Yes/No).
    • Eye Discomfort & Strain: Presence of discomfort and eye strain (Yes/No).
    • Redness in Eye: Occurrence of eye redness (Yes/No).
    • Itchiness/Irritation in Eye: Symptoms of eye itchiness or irritation (Yes/No).
    • Dry Eye Disease: Diagnosis of Dry Eye Disease (Yes/No).

    Distribution

    https://storage.googleapis.com/opendatabay_public/f4e9ad52-5d13-4d2e-ac19-207a5b71522e/2e2e949519d7_eye.png" alt="Dry Eye Disease Patient Records Synthetic Data">

    Usage

    This dataset can be used for the following applications:

    • Healthcare Analytics: Identify patterns between lifestyle factors and dry eye disease.
    • Predictive Modeling: Develop machine learning models to predict eye health risks.
    • Clinical Research: Investigate associations between screen time, sleep habits, and eye conditions.
    • Educational Purposes: Provide a dataset for students in medical, data science, and public health fields to analyze real-world health trends.

    Coverage

    This synthetic dataset is fully anonymized and complies with data privacy standards. It includes a variety of demographic and lifestyle factors to support a broad range of research and analysis.

    License

    CC0 (Public Domain)

    Who Can Use It

    • Healthcare Researchers: To explore correlations between lifestyle habits and dry eye disease.
    • Clinicians and Medical Practitioners: To analyze factors contributing to eye health issues.
    • Data Scientists and Machine Learning Practitioners: To develop predictive models for eye-related conditions.
    • Educators and Students: As a resource for studying health analytics and medical research.
  13. A hotel's customers dataset

    • kaggle.com
    Updated Nov 27, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nuno Antonio (2020). A hotel's customers dataset [Dataset]. https://www.kaggle.com/nantonio/a-hotels-customers-dataset/metadata
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 27, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Nuno Antonio
    Description

    Context

    This real-world customer dataset with 31 variables describes 83,590 instances (customers) from a hotel in Lisbon, Portugal.

    Content

    The data comprehends three full years of customer personal, behavioral, demographic, and geographical information.

    Acknowledgements

    Additional information on this dataset can be found in the article A Hotel's customers personal, behavioral, demographic, and geographic dataset from Lisbon, Portugal (2015-2018), written by Nuno Antonio, Ana de Almeida, and Luis Nunes for Data in Brief (online November 2020).

    Inspiration

    This dataset can be used in data mining, machine learning, and other analytical field problems in the scope of data science. Due to its unit of analysis, it is a dataset especially suitable for building customer segmentation models, including clustering and RFM (Recency, Frequency, and Monetary value) models, but also be used in classification and regression problems.

  14. E-commerce Customer Churn

    • kaggle.com
    Updated Aug 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Samuel Semaya (2024). E-commerce Customer Churn [Dataset]. https://www.kaggle.com/datasets/samuelsemaya/e-commerce-customer-churn
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 6, 2024
    Dataset provided by
    Kaggle
    Authors
    Samuel Semaya
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    E-commerce Customer Churn Dataset

    Context

    This dataset belongs to a leading online E-commerce company. The company wants to identify customers who are likely to churn, so they can proactively approach these customers with promotional offers.

    Content

    The dataset contains various features related to customer behavior and characteristics, which can be used to predict customer churn.

    Features

    1. Tenure: Tenure of a customer in the company (numeric)
    2. WarehouseToHome: Distance between the warehouse to the customer's home (numeric)
    3. NumberOfDeviceRegistered: Total number of devices registered to a particular customer (numeric)
    4. PreferedOrderCat: Preferred order category of a customer in the last month (categorical)
    5. SatisfactionScore: Satisfactory score of a customer on service (numeric)
    6. MaritalStatus: Marital status of a customer (categorical)
    7. NumberOfAddress: Total number of addresses added for a particular customer (numeric)
    8. Complaint: Whether any complaint has been raised in the last month (binary)
    9. DaySinceLastOrder: Days since last order by customer (numeric)
    10. CashbackAmount: Average cashback in last month (numeric)
    11. Churn: Churn flag (target variable, binary)

    Task

    The main task is to predict customer churn based on the given features. This is a binary classification problem where the target variable is 'Churn'.

    Potential Applications

    1. Customer Retention: Identify at-risk customers and take proactive measures to retain them.
    2. Targeted Marketing: Design specific marketing campaigns for customers likely to churn.
    3. Service Improvement: Analyze features contributing to churn and improve those aspects of the service.

    Acknowledgements

    This dataset is provided for educational purposes. While it represents a real-world scenario, the data itself may be simulated or anonymized.

  15. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Sofia Yfantidou; Sofia Yfantidou; Christina Karagianni; Stefanos Efstathiou; Stefanos Efstathiou; Athena Vakali; Athena Vakali; Joao Palotti; Joao Palotti; Dimitrios Panteleimon Giakatos; Dimitrios Panteleimon Giakatos; Thomas Marchioro; Thomas Marchioro; Andrei Kazlouski; Elena Ferrari; Šarūnas Girdzijauskas; Šarūnas Girdzijauskas; Christina Karagianni; Andrei Kazlouski; Elena Ferrari (2022). LifeSnaps: a 4-month multi-modal dataset capturing unobtrusive snapshots of our lives in the wild [Dataset]. http://doi.org/10.5281/zenodo.6832242
Organization logo

Data from: LifeSnaps: a 4-month multi-modal dataset capturing unobtrusive snapshots of our lives in the wild

Related Article
Explore at:
zipAvailable download formats
Dataset updated
Oct 20, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Sofia Yfantidou; Sofia Yfantidou; Christina Karagianni; Stefanos Efstathiou; Stefanos Efstathiou; Athena Vakali; Athena Vakali; Joao Palotti; Joao Palotti; Dimitrios Panteleimon Giakatos; Dimitrios Panteleimon Giakatos; Thomas Marchioro; Thomas Marchioro; Andrei Kazlouski; Elena Ferrari; Šarūnas Girdzijauskas; Šarūnas Girdzijauskas; Christina Karagianni; Andrei Kazlouski; Elena Ferrari
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

LifeSnaps Dataset Documentation

Ubiquitous self-tracking technologies have penetrated various aspects of our lives, from physical and mental health monitoring to fitness and entertainment. Yet, limited data exist on the association between in the wild large-scale physical activity patterns, sleep, stress, and overall health, and behavioral patterns and psychological measurements due to challenges in collecting and releasing such datasets, such as waning user engagement, privacy considerations, and diversity in data modalities. In this paper, we present the LifeSnaps dataset, a multi-modal, longitudinal, and geographically-distributed dataset, containing a plethora of anthropological data, collected unobtrusively for the total course of more than 4 months by n=71 participants, under the European H2020 RAIS project. LifeSnaps contains more than 35 different data types from second to daily granularity, totaling more than 71M rows of data. The participants contributed their data through numerous validated surveys, real-time ecological momentary assessments, and a Fitbit Sense smartwatch, and consented to make these data available openly to empower future research. We envision that releasing this large-scale dataset of multi-modal real-world data, will open novel research opportunities and potential applications in the fields of medical digital innovations, data privacy and valorization, mental and physical well-being, psychology and behavioral sciences, machine learning, and human-computer interaction.

The following instructions will get you started with the LifeSnaps dataset and are complementary to the original publication.

Data Import: Reading CSV

For ease of use, we provide CSV files containing Fitbit, SEMA, and survey data at daily and/or hourly granularity. You can read the files via any programming language. For example, in Python, you can read the files into a Pandas DataFrame with the pandas.read_csv() command.

Data Import: Setting up a MongoDB (Recommended)

To take full advantage of the LifeSnaps dataset, we recommend that you use the raw, complete data via importing the LifeSnaps MongoDB database.

To do so, open the terminal/command prompt and run the following command for each collection in the DB. Ensure you have MongoDB Database Tools installed from here.

For the Fitbit data, run the following:

mongorestore --host localhost:27017 -d rais_anonymized -c fitbit 

For the SEMA data, run the following:

mongorestore --host localhost:27017 -d rais_anonymized -c sema 

For surveys data, run the following:

mongorestore --host localhost:27017 -d rais_anonymized -c surveys 

If you have access control enabled, then you will need to add the --username and --password parameters to the above commands.

Data Availability

The MongoDB database contains three collections, fitbit, sema, and surveys, containing the Fitbit, SEMA3, and survey data, respectively. Similarly, the CSV files contain related information to these collections. Each document in any collection follows the format shown below:

{
  _id: 
Search
Clear search
Close search
Google apps
Main menu