3 datasets found

t
FAIR Dataset for Disease Prediction in Healthcare Applications
test.researchdata.tuwien.ac.at
bin, csv, json, png
Updated Apr 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf (2025). FAIR Dataset for Disease Prediction in Healthcare Applications [Dataset]. http://doi.org/10.70124/5n77a-dnf02
Explore at:
csv, json, bin, pngAvailable download formats
Unique identifier
https://doi.org/10.70124/5n77a-dnf02
Dataset updated
Apr 14, 2025
Dataset provided by
TU Wien
Authors
Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset Description

Context and Methodology

Research Domain/Project:
This dataset was created for a machine learning experiment aimed at developing a classification model to predict outcomes based on a set of features. The primary research domain is disease prediction in patients. The dataset was used in the context of training, validating, and testing.

Purpose of the Dataset:
The purpose of this dataset is to provide training, validation, and testing data for the development of machine learning models. It includes labeled examples that help train classifiers to recognize patterns in the data and make predictions.

Dataset Creation:
Data preprocessing steps involved cleaning, normalization, and splitting the data into training, validation, and test sets. The data was carefully curated to ensure its quality and relevance to the problem at hand. For any missing values or outliers, appropriate handling techniques were applied (e.g., imputation, removal, etc.).

Technical Details

Structure of the Dataset:
The dataset consists of several files organized into folders by data type:

Training Data: Contains the training dataset used to train the machine learning model.

Validation Data: Used for hyperparameter tuning and model selection.

Test Data: Reserved for final model evaluation.

Each folder contains files with consistent naming conventions for easy navigation, such as train_data.csv, validation_data.csv, and test_data.csv. Each file follows a tabular format with columns representing features and rows representing individual data points.

Software Requirements:
To open and work with this dataset, you need VS Code or Jupyter, which could include tools like:

Python (with libraries such as pandas, numpy, scikit-learn, matplotlib, etc.)

Further Details

Reusability:
Users of this dataset should be aware that it is designed for machine learning experiments involving classification tasks. The dataset is already split into training, validation, and test subsets. Any model trained with this dataset should be evaluated using the test set to ensure proper validation.

Limitations:
The dataset may not cover all edge cases, and it might have biases depending on the selection of data sources. It's important to consider these limitations when generalizing model results to real-world applications.
t
Tour Recommendation Model
test.researchdata.tuwien.at
test.researchdata.tuwien.ac.at
bin, png +1
Updated May 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Muhammad Mobeel Akbar; Muhammad Mobeel Akbar; Muhammad Mobeel Akbar; Muhammad Mobeel Akbar (2025). Tour Recommendation Model [Dataset]. http://doi.org/10.70124/akpf6-8p175
Explore at:
text/markdown, png, binAvailable download formats
Unique identifier
https://doi.org/10.70124/akpf6-8p175
Dataset updated
May 14, 2025
Dataset provided by
TU Wien
Authors
Muhammad Mobeel Akbar; Muhammad Mobeel Akbar; Muhammad Mobeel Akbar; Muhammad Mobeel Akbar
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Apr 28, 2025
Description
Dataset Description for Tour Recommendation Model

Context and Methodology:

Research Domain/Project:
This dataset is part of the Tour Recommendation System project, which focuses on predicting user preferences and ratings for various tourist places and events. It belongs to the field of Machine Learning, specifically applied to Recommender Systems and Predictive Analytics.

Purpose:
The dataset serves as the training and evaluation data for a Decision Tree Regressor model, which predicts ratings (from 1-5) for different tourist destinations based on user preferences. The model can be used to recommend places or events to users based on their predicted ratings.

Creation Methodology:
The dataset was originally collected from a tourism platform where users rated various tourist places and events. The data was preprocessed to remove missing or invalid entries (such as #NAME? in rating columns). It was then split into subsets for training, validation, and testing the model.

Technical Details:

Structure of the Dataset:
The dataset is stored as a CSV file (user_ratings_dataset.csv) and contains the following columns:

place_or_event_id: Unique identifier for each tourist place or event.

rating: Rating given by the user, ranging from 1 to 5.

The data is split into three subsets:

Training Set: 80% of the dataset used to train the model.

Validation Set: A small portion used for hyperparameter tuning.

Test Set: 20% used to evaluate model performance.

Folder and File Naming Conventions:
The dataset files are stored in the following structure:

user_ratings_dataset.csv: The original dataset file containing user ratings.

tour_recommendation_model.pkl: The saved model after training.

actual_vs_predicted_chart.png: A chart comparing actual and predicted ratings.

Software Requirements:
To open and work with this dataset, the following software and libraries are required:

Python 3.x

Pandas for data manipulation

Scikit-learn for training and evaluating machine learning models

Matplotlib for chart generation

Joblib for saving and loading the trained model

The dataset can be opened and processed using any Python environment that supports these libraries.

Additional Resources:

The model training code, README file, and performance chart are available in the project repository.

For detailed explanation and code, please refer to the GitHub repository (or any other relevant link for the code).

Further Details:

Dataset Reusability:
The dataset is structured for easy use in training machine learning models for recommendation systems. Researchers and practitioners can utilize it to:

Train other types of models (e.g., regression, classification).

Experiment with different features or add more metadata to enrich the dataset.

Data Integrity:
The dataset has been cleaned and preprocessed to remove invalid values (such as #NAME? or missing ratings). However, users should ensure they understand the structure and the preprocessing steps taken before reusing it.

Licensing:
The dataset is provided under the CC BY 4.0 license, which allows free usage, distribution, and modification, provided that proper attribution is given.
t
Privacy-Sensitive Conversations between Care Workers and Care Home Residents...
researchdata.tuwien.ac.at
test.researchdata.tuwien.ac.at
+2more
bin, text/markdown
Updated Feb 25, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Reinhard Grabler; Reinhard Grabler; Michael Starzinger; Michael Starzinger; Matthias Hirschmanner; Matthias Hirschmanner; Helena Anna Frijns; Helena Anna Frijns (2025). Privacy-Sensitive Conversations between Care Workers and Care Home Residents in a Residential Care Home [Dataset]. http://doi.org/10.48436/q1kt0-edc53
Explore at:
bin, text/markdownAvailable download formats
Unique identifier
https://doi.org/10.48436/q1kt0-edc53
Dataset updated
Feb 25, 2025
Dataset provided by
TU Wien
Authors
Reinhard Grabler; Reinhard Grabler; Michael Starzinger; Michael Starzinger; Matthias Hirschmanner; Matthias Hirschmanner; Helena Anna Frijns; Helena Anna Frijns
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Apr 2024 - Aug 2024
Description
Dataset Card for "privacy-care-interactions"

Table of Contents

Dataset Description

Purpose and Features

Dataset Overview

Language Distribution

Locale Distribution

Key Facts

Dataset Structure

Data Instances

Data Fields

Data Splits

Dataset Creation

Curation Rationale

Source Data

Annotations

Personal and Sensitive Information

Considerations for Using the Data

Social Impact of Dataset

Discussion of Biases

Other Known Limitations

Additional Information

Dataset Curators

Licensing Information

Citation Information

Contributions

Dataset Description

Purpose and Features

🔒 Collection of Privacy-Sensitive Conversations between Care Workers and Care Home Residents in an Residential Care Home 🔒

The dataset is useful to train and evaluate models to identify and classify privacy-sensitive parts of conversations from text, especially in the context of AI assistants and LLMs.

Dataset Overview

Total entries: 95

Number of distinct taxonomy categories in the public dataset: 4

Number of distinct conversational categories in public dataset: 7

Papers:

Continues the work of: Privacy Agents: Utilizing Large Language Models to Safeguard Contextual Integrity in Elderly Care

Continues the work of: Prototype of a care documentation support system using audio recordings of care actions and large language models

Language Distribution 🌍

English (en): 95

Locale Distribution 🌎

United States (US) 🇺🇸: 95

Key Facts 🔑

This is synthetic data! Generated using proprietary algorithms - no privacy violations!

Conversations are classified following the taxonomy for privacy-sensitive robotics by Rueben et al. (2017).

The data was manually labeled by an expert.

Dataset Structure

Data Instances

The provided data format is .jsonl, the JSON Lines text format, also called newline-delimited JSON. An example entry looks as follows.

{ "text": "CW: Have you ever been to Italy? CR: Oh, yes... many years ago.", "taxonomy": 0, "category": 0, "affected_speaker": 1, "language": "en", "locale": "US", "data_type": 1, "uid": 16, "split": "train" }

Data Fields

The data fields are:

text: a string feature. The abbreviaton of the speakers refer to the care worker (CW) and the care recipient (CR).

taxonomy: a classification label, with possible values including informational (0), invasion (1), collection (2), processing (3), dissemination (4), physical (5), personal-space (6), territoriality (7), intrusion (8), obtrusion (9), contamination (10), modesty (11), psychological (12), interrogation (13), psychological-distance (14), social (15), association (16), crowding-isolation (17), public-gaze (18), solitude (19), intimacy (20), anonymity (21), reserve (22). The taxonomy is derived from Rueben et al. (2017). The classifications were manually labeled by an expert.

category: a classification label, with possible values including personal-information (0), family (1), health (2), thoughts (3), values (4), acquaintance (5), appointment (6). The privacy category affected in the conversation. The classifications were manually labeled by an expert.

affected_speaker: a classification label, with possible values including care-worker (0), care-recipient (1), other (2), both (3). The speaker whose privacy is impacted during the conversation. The classifications were manually labeled by an expert.

language: a string feature. Language code as defined by ISO 639.

locale: a string feature. Regional code as defined by ISO 3166-1 alpha-2.

data_type: a string a classification label, with possible values including real (0), synthetic (1).

uid: a int64 feature. A unique identifier within the dataset.

split: a string feature. Either train, validation or test.

Dataset Splits

The dataset has 2 subsets:

split: with a total of 95 examples split into train, validation and test (70%-15%-15%)

unsplit: with a total of 95 examples in a single train split

name train validation test
split 66 14 15
unsplit 95 n/a n/a

The files follow the naming convention subset-split-language.jsonl. The following files are contained in the dataset:

split-train-en.jsonl

split-validation-en.jsonl

split-test-en.jsonl

unsplit-train-en.jsonl

Dataset Creation

Curation Rationale

Recording audio of care workers and residents during care interactions, which includes partial and full body washing, giving of medication, as well as wound care, is a highly privacy-sensitive use case. Therefore, a dataset is created, which includes privacy-sensitive parts of conversations, synthesized from real-world data. This dataset serves as a basis for fine-tuning a local LLM to highlight and classify privacy-sensitive sections of transcripts created in care interactions, to further mask them to protect privacy.

Source Data

Initial Data Collection

The intial data was collected in the project Caring Robots of TU Wien in cooperation with Caritas Wien. One project track aims to facilitate Large Languge Models (LLM) to support documentation of care workers, with LLM-generated summaries of audio recordings of interactions between care workers and care home residents. The initial data are the transcriptions of those care interactions.

Data Processing

The transcriptions were thoroughly reviewed, and sections containing privacy-sensitive information were identified and marked using qualitative data analysis software by two experts. Subsequently, the sections were translated from German to U.S. English using the locally executed LLM icky/translate. In the next step, another llama3.1:70b was used locally to synthesize the conversation segments. This process involved generating similar, yet distinct and new, conversations that are not linked to the original data. The dataset was split using the train_test_split function from the <a href="https://scikit-learn.org/1.5/modules/generated/sklearn.model_selection.train_test_split.html" target="_blank" rel="noopener
Not seeing a result you expected?
Learn how you can add new datasets to our index.

name	train	validation	test
split	66	14	15
unsplit	95	n/a	n/a

Facebook

Twitter

Click to copy link

Link copied

Cite

Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf (2025). FAIR Dataset for Disease Prediction in Healthcare Applications [Dataset]. http://doi.org/10.70124/5n77a-dnf02

FAIR Dataset for Disease Prediction in Healthcare Applications

Explore at:

csv, json, bin, pngAvailable download formats

Unique identifier

https://doi.org/10.70124/5n77a-dnf02

Dataset updated

Apr 14, 2025

Dataset provided by

TU Wien

Authors

Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Dataset Description

Context and Methodology

Research Domain/Project:
This dataset was created for a machine learning experiment aimed at developing a classification model to predict outcomes based on a set of features. The primary research domain is disease prediction in patients. The dataset was used in the context of training, validating, and testing.
Purpose of the Dataset:
The purpose of this dataset is to provide training, validation, and testing data for the development of machine learning models. It includes labeled examples that help train classifiers to recognize patterns in the data and make predictions.
Dataset Creation:
Data preprocessing steps involved cleaning, normalization, and splitting the data into training, validation, and test sets. The data was carefully curated to ensure its quality and relevance to the problem at hand. For any missing values or outliers, appropriate handling techniques were applied (e.g., imputation, removal, etc.).

Technical Details

Structure of the Dataset:
The dataset consists of several files organized into folders by data type:
- Training Data: Contains the training dataset used to train the machine learning model.
- Validation Data: Used for hyperparameter tuning and model selection.
- Test Data: Reserved for final model evaluation.
Each folder contains files with consistent naming conventions for easy navigation, such as train_data.csv, validation_data.csv, and test_data.csv. Each file follows a tabular format with columns representing features and rows representing individual data points.
Software Requirements:
To open and work with this dataset, you need VS Code or Jupyter, which could include tools like:
- Python (with libraries such as pandas, numpy, scikit-learn, matplotlib, etc.)

Further Details

Reusability:
Users of this dataset should be aware that it is designed for machine learning experiments involving classification tasks. The dataset is already split into training, validation, and test subsets. Any model trained with this dataset should be evaluated using the test set to ensure proper validation.
Limitations:
The dataset may not cover all edge cases, and it might have biases depending on the selection of data sources. It's important to consider these limitations when generalizing model results to real-world applications.

Clear search

Close search

Google apps

Main menu

FAIR Dataset for Disease Prediction in Healthcare Applications

Dataset Description

Context and Methodology

Technical Details

Further Details

Tour Recommendation Model

Dataset Description for Tour Recommendation Model

Context and Methodology:

Technical Details:

Further Details:

Privacy-Sensitive Conversations between Care Workers and Care Home Residents...

Dataset Card for "privacy-care-interactions"

Table of Contents

Dataset Description

Purpose and Features

Dataset Overview

Language Distribution 🌍

Locale Distribution 🌎

Key Facts 🔑

Dataset Structure

Data Instances

Data Fields

Dataset Splits

Dataset Creation

Curation Rationale

Source Data

Initial Data Collection

Data Processing

FAIR Dataset for Disease Prediction in Healthcare Applications

Dataset Description

Context and Methodology

Technical Details

Further Details