EUCA dataset description Associated Paper: EUCA: the End-User-Centered Explainable AI Framework
Authors: Weina Jin, Jianyu Fan, Diane Gromala, Philippe Pasquier, Ghassan Hamarneh
Introduction: EUCA dataset is for modelling personalized or interactive explainable AI. It contains 309 data points of 32 end-users' preferences on 12 forms of explanation (including feature-, example-, and rule-based explanations). The data were collected from a user study on 32 layperson participants in the Greater Vancouver city area in 2019-2020. In the user study, the participants (P01-P32) were presented with AI-assisted critical tasks on house price prediction, health status prediction, purchasing a self-driving car, and studying for a biological exam [1]. Within each task and for its given explanation goal [2], the participants selected and rank the explanatory forms [3] that they saw the most suitable.
1 EUCA_EndUserXAI_ExplanatoryFormRanking.csv
Column description:
Index - Participants' number Case - task-explanation goal combination accept to use AI? trust it? - Participants response to whether they will use AI given the task and explanation goal require explanation? - Participants response to the question whether they request an explanation for the AI 1st, 2nd, 3rd, ... - Explanatory form card selection and ranking cards fulfill requirement? - After the card selection, participants were asked whether the selected card combination fulfill their explainability requirement.
2 EUCA_EndUserXAI_demography.csv
It contains the participants demographics, including their age, gender, educational background, and their knowledge and attitudes toward AI.
EUCA dataset zip file for download
More Context for EUCA Dataset [1] Critical tasks There are four tasks. Task label and their corresponding task titles are: house - Selling your house car - Buying an autonomous driving vehicle health - Personal health decision bird - Learning bird species
Please refer to EUCA quantatative data analysis report for the storyboard of the tasks and explanation goals presented in the user study.
[2] Explanation goal End-users may have different goals/purposes to check an explanation from AI. The EUCA dataset includes the following 11 explanation goals, with its [label] in the dataset, full name and description
[trust] Calibrate trust: trust is a key to establish human-AI decision-making partnership. Since users can easily distrust or overtrust AI, it is important to calibrate the trust to reflect the capabilities of AI systems.
[safe] Ensure safety: users need to ensure safety of the decision consequences.
[bias] - Detect bias: users need to ensure the decision is impartial and unbiased.
[unexpect] Resolve disagreement with AI: the AI prediction is unexpected and there are disagreements between users and AI.
[expected] - Expected: the AI's prediction is expected and aligns with users' expectations.
[differentiate] Differentiate similar instances: due to the consequences of wrong decisions, users sometimes need to discern similar instances or outcomes. For example, a doctor differentiates whether the diagnosis is a benign or malignant tumor.
[learning] Learn: users need to gain knowledge, improve their problem-solving skills, and discover new knowledge
[control] Improve: users seek causal factors to control and improve the predicted outcome.
[communicate] Communicate with stakeholders: many critical decision-making processes involve multiple stakeholders, and users need to discuss the decision with them.
[report] Generate reports: users need to utilize the explanations to perform particular tasks such as report production. For example, a radiologist generates a medical report on a patient's X-ray image.
[multi] Trade-off multiple objectives: AI may be optimized on an incomplete objective while the users seek to fulfill multiple objectives in real-world applications. For example, a doctor needs to ensure a treatment plan is effective as well as has acceptable patient adherence. Ethical and legal requirements may also be included as objectives.
[3] Explanatory form The following 12 explanatory forms are end-user-friendly, i.e.: no technical knowledge is required for the end-user to interpret the explanation.
Feature-Based Explanation
Feature Attribution - fa
Note: for tasks that has image as input data, the feature attribution is denoted by the following two cards:
ir: important regions (a.k.a. heat map or saliency map)
irc: important regions with their feature contribution percentage
Feature Shape - fs
Feature Interaction - fi
Example-Based Explanation
Similar Example - se Typical Example - te
Counterfactual Example - ce
Note: for contractual example, there were two visual variations used in the user study: cet: counterfactual example with transition from one example to the counterfactual one ceh: counterfactual example with the contrastive feature highlighted
Rule-Based Explanation
Rule - rt Decision Tree - dt
Decision Flow - df
Supplementary Information
Input Output Performance Dataset - prior (output prediction with prior distribution of each class in the training set)
Note: occasionally there is a wild card, which means the participant draw the card by themselves. It is indicated as 'wc'.
For visual examples of each explanatory form card, please refer to the Explanatory_form_labels.pdf document.
Link to the details on users' requirements on different explanatory forms
Code and report for EUCA data quantatitve analysis
EUCA data analysis code EUCA quantatative data analysis report
EUCA data citation @article{jin2021euca, title={EUCA: the End-User-Centered Explainable AI Framework}, author={Weina Jin and Jianyu Fan and Diane Gromala and Philippe Pasquier and Ghassan Hamarneh}, year={2021}, eprint={2102.02437}, archivePrefix={arXiv}, primaryClass={cs.HC} }
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The datasets presented in this repository are obtained by applying the decision tree (DecisionTree) algorithm with the following specific hyperparameter setting and the different inputs that have been described in the journal paper: "Data mining techniques for endometriosis detection in a data-scarce medical dataset".Hyperparametersmax_features: sqrtmax_iter: 1000000Filesresult_eb.csv: Results for EB sample type.result_ef.csv: Results for EF sample type.result_vagina.csv: Results for vagina sample type.result_oral.csv: Results for oral sample type.result_feces.csv: Results for feces sample type.result_frt.csv: Results for FRT (EB + EF + vagina) sample type.result_frt2.csv: Results for FRT2 (EB + vagina) sample type.properties_eb.log: Arguments and result information for EB sample type (n_split, max_features, max_iter, random_state, len_scores_before_filtering, len_scores_after_filtering, len_f1).properties_ef.log: Arguments and result information for EF sample type.properties_vagina.log: Arguments and result information for vagina sample type.properties_oral.log: Arguments and result information for oral sample type.properties_feces.log: Arguments and result information for feces sample type.properties_frt.log: Arguments and result information for FRT sample type.properties_frt2.log: Arguments and result information for FRT2 sample type.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Research Domain/Project:
This dataset is part of the Tour Recommendation System project, which focuses on predicting user preferences and ratings for various tourist places and events. It belongs to the field of Machine Learning, specifically applied to Recommender Systems and Predictive Analytics.
Purpose:
The dataset serves as the training and evaluation data for a Decision Tree Regressor model, which predicts ratings (from 1-5) for different tourist destinations based on user preferences. The model can be used to recommend places or events to users based on their predicted ratings.
Creation Methodology:
The dataset was originally collected from a tourism platform where users rated various tourist places and events. The data was preprocessed to remove missing or invalid entries (such as #NAME?
in rating columns). It was then split into subsets for training, validation, and testing the model.
Structure of the Dataset:
The dataset is stored as a CSV file (user_ratings_dataset.csv
) and contains the following columns:
place_or_event_id: Unique identifier for each tourist place or event.
rating: Rating given by the user, ranging from 1 to 5.
The data is split into three subsets:
Training Set: 80% of the dataset used to train the model.
Validation Set: A small portion used for hyperparameter tuning.
Test Set: 20% used to evaluate model performance.
Folder and File Naming Conventions:
The dataset files are stored in the following structure:
user_ratings_dataset.csv
: The original dataset file containing user ratings.
tour_recommendation_model.pkl
: The saved model after training.
actual_vs_predicted_chart.png
: A chart comparing actual and predicted ratings.
Software Requirements:
To open and work with this dataset, the following software and libraries are required:
Python 3.x
Pandas for data manipulation
Scikit-learn for training and evaluating machine learning models
Matplotlib for chart generation
Joblib for saving and loading the trained model
The dataset can be opened and processed using any Python environment that supports these libraries.
Additional Resources:
The model training code, README file, and performance chart are available in the project repository.
For detailed explanation and code, please refer to the GitHub repository (or any other relevant link for the code).
Dataset Reusability:
The dataset is structured for easy use in training machine learning models for recommendation systems. Researchers and practitioners can utilize it to:
Train other types of models (e.g., regression, classification).
Experiment with different features or add more metadata to enrich the dataset.
Data Integrity:
The dataset has been cleaned and preprocessed to remove invalid values (such as #NAME?
or missing ratings). However, users should ensure they understand the structure and the preprocessing steps taken before reusing it.
Licensing:
The dataset is provided under the CC BY 4.0 license, which allows free usage, distribution, and modification, provided that proper attribution is given.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset used in this study is the Wisconsin Diagnostic Breast Cancer (WDBC) dataset, originally provided by the University of Wisconsin and obtained via Kaggle. It consists of 569 observations, each corresponding to a digitized image of a fine needle aspirate (FNA) of a breast mass. The dataset contains 32 attributes: one identifier column (discarded during preprocessing), one diagnosis label (malignant or benign), and 30 continuous real-valued features that describe the morphology of cell nuclei. These features are grouped into three statistical descriptors—mean, standard error (SE), and worst (mean of the three largest values)—for ten morphological properties including radius, perimeter, area, concavity, and fractal dimension. All feature values were normalized using z-score standardization to ensure uniform scale across models sensitive to input ranges. No missing values were present in the original dataset. Label encoding was applied to the diagnosis column, assigning 1 to malignant and 0 to benign cases. The dataset was split into training (80%) and testing (20%) sets while preserving class balance via stratified sampling. The accompanying Python source code (breast_cancer_classification_models.py) performs data loading, preprocessing, model training, evaluation, and result visualization. Four lightweight classifiers—Decision Tree, Naïve Bayes, Perceptron, and K-Nearest Neighbors (KNN)—were implemented using the scikit-learn library (version 1.2 or later). Performance metrics including Accuracy, Precision, Recall, F1-score, and ROC-AUC were calculated for each model. Confusion matrices and ROC curves were generated and saved as PNG files for interpretability. All results are saved in a structured CSV file (classification_results.csv) that contains the performance metrics for each model. Supplementary visualizations include all_feature_histograms.png (distribution plots for all standardized features), model_comparison.png (metric-wise bar plot), and feature_correlation_heatmap.png (Pearson correlation matrix of all 30 features). The data files are in standard CSV and PNG formats and can be opened using any spreadsheet or image viewer, respectively. No rare file types are used, and all scripts are compatible with any Python 3.x environment. This data package enables reproducibility and offers a transparent overview of how baseline machine learning models perform in the domain of breast cancer diagnosis using a clinically-relevant dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset corresponds to the problems analyzed in the work "Assessing Reproducibility in Screenshot-Based Task Mining: A Decision Discovery Perspective," published in the Information Systems Journal.
The artifacts provided correspond to multiple instances of the execution of a particular process model, covering all its variants in the form of UI Logs. These UI Logs are divided into two groups:
Additionally, UI Logs are synthetically generated from an original UI Log in both cases.
For generating the UI Logs, a real-world process based on handling unsubscription requests from users of a telephone company has been selected. This case was selected based on the following criteria (1) the process is replicated from a real company, (2) decision discovery relies on visual elements present in the screenshots, specifically an email attachment and a checkbox in a web form. Thus, the selected process consists of 10 activities, a single decision point, and 4 process variants.
The dataset includes:
ProblemType_LogSize_Balanced
, where LogSize
is one of {75, 100, 300, 500} and Balanced
is either Balanced
or Imbalanced
. Therefore, each problem subfolder contains the corresponding UI Log and associated screenshots. organized into subfolders based on different problem characteristics. Each subfolder includes:
log.csv
: A CSV file containing the UI log data.1_img.png
: A sample screenshot image.1_img.png.json
: JSON file containing metadata for the corresponding screenshot.flattened_dataset.csv
: A flattened version of the dataset used for decision tree analysis.preprocessed_df.csv
: Preprocessed data frame used for analysis.decision_tree.log
: Log file documenting the decision tree process.CHAID-tree-feature-importance.csv
: CSV file detailing feature importance from the CHAID decision tree.bpmn.bpmn
: BPMN file representing the process model.bpmn.dot
: DOT file representing the BPMN process model.pn.dot
: DOT file representing the Petri net process model.traceability.json
: JSON file mapping decision point branches to rules from decision model.collect_results.py
: Script to collect experiment results.db_populate.json
: Configuration file for populating the database.hierarchy_constructor.py
: Script to construct the hierarchy of UI elements.models_populate.json
: Configuration file for populating models.process_logs.py
: Script to process UI logs.process_reproducibility_data.py
: Script to process reproducibility data.process_uielements.py
: Script to process UI elements.run_experiments.py
: Script to run experiments.run_experiments.sh
: Shell script to execute the experiments.To create the evaluation objects, we generated event logs of different sizes (|L|) by deriving events from the sample event log. We consider log sizes of {75, 100, 300, 500} events. Each log contains complete process instances, ensuring that if an additional instance exceeds |L|, it is removed.
To average results across different problem instances, we trained decision trees 30 times on synthetic variations of the dataset, obtaining the mean of the metrics as experiment metadata.
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
This data release documents the steps performed to create classified cropland fallow maps for the Northern Great Plains region of the United States from the years 2010 to 2019. The data release consists of the following: (i) an XML metadata file, (ii) a table of reference data (iii) two decision tree models, and (iv) 10 single band GeoTIFFs. The XML file named ‘Metadata.xml’ describes steps that were used to create this dataset. The table of reference data named ‘RefrenceSamples.csv’ list the training and validation point values used to train the decision tree classifiers and preform accuracy assessments. The two decision trees named ‘decisionTree_0.005.txt’ and ‘decisionTree_0.007.txt’ were used to classify remote sensing imagery to produce the classified cropland fallow maps. Ten GeoTIFFs named ‘nPlains_ACFAv1_2010.tif’ to ‘nPlains_ACFAv1_2019.tif’ contains the classified maps for each year with the first image for year 2010 and the last image for 2019. The classification for ea ...
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Raw data for the article: Gradient boosted decision trees reveal nuances of auditory discrimination behaviour (PLOS Computational Biology).This data repository contains the csv files after extraction of the raw MATLAB metadata files into pandas (Python) dataframes (helper function author: Jules Lebert). The csv files can easily be loaded back into dataframe objects using pandas before the subsampling steps (as documented in the paper, we used subsampling to ensure the number of F0-roved and control F0 trials were relatively equal) are completed.Link to GitHub repository to run the models on this data: https://github.com/carlacodes/boostmodelsA full description of each of the variables within the dataframe can be found in the data_description_instructions_for_datasets_plos_bio.pdf.Abstract: Animal psychophysics can generate rich behavioral datasets, often comprised of many 1000s of trials for an individual subject. Gradient-boosted models are a promising machine learning approach for analyzing such data, partly due to the tools that allow users to gain insight into how the model makes predictions. We trained ferrets to report a target word’s presence, timing, and lateralization within a stream of consecutively presented non-target words. To assess the animals’ ability to generalize across pitch, we manipulated the fundamental frequency (F0) of the speech stimuli across trials, and to assess the contribution of pitch to streaming, we roved the F0 from word token-to-token. We then implemented gradient-boosted regression and decision trees on the trial outcome and reaction time data to understand the behavioral factors behind the ferrets’ decision-making. We visualized model contributions by implementing SHAPs feature importance and partial dependency plots. While ferrets could accurately perform the task across all pitch-shifted conditions, our models reveal subtle effects of shifting F0 on performance, with within-trial pitch shifting elevating false alarms and extending reaction times. Our models identified a subset of non-target words that animals commonly false alarmed to. Follow-up analysis demonstrated that the spectrotemporal similarity of target and non-target words rather than similarity in duration or amplitude waveform was the strongest predictor of the likelihood of false alarming. Finally, we compared the results with those obtained with traditional mixed effects models, revealing equivalent or better performance for the gradient-boosted models over these approaches.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Disease Symptom Prediction’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/itachi9604/disease-symptom-description-dataset on 28 January 2022.
--- Dataset description provided by original source is as follows ---
A dataset to provide the students a source to create a healthcare related system. A project on the same using double Decision Tree Classifiication is available at : https://github.com/itachi9604/healthcare-chatbot
Get_dummies processed file will be available at https://www.kaggle.com/rabisingh/symptom-checker?select=Training.csv
There are columns containing diseases, their symptoms , precautions to be taken, and their weights. This dataset can be easily cleaned by using file handling in any language. The user only needs to understand how rows and coloumns are arranged.
I have created this dataset with help of a friend Pratik Rathod. As there was an existing dataset like this which was difficult to clean.
uchihaitachi9604@gmail.com
--- Original source retains full ownership of the source dataset ---
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The data presented here were used to produce the following paper:
Archibald, Twine, Mthabini, Stevens (2021) Browsing is a strong filter for savanna tree seedlings in their first growing season. J. Ecology.
The project under which these data were collected is: Mechanisms Controlling Species Limits in a Changing World. NRF/SASSCAL Grant number 118588
For information on the data or analysis please contact Sally Archibald: sally.archibald@wits.ac.za
Description of file(s):
File 1: cleanedData_forAnalysis.csv (required to run the R code: "finalAnalysis_PostClipResponses_Feb2021_requires_cleanData_forAnalysis_.R"
The data represent monthly survival and growth data for ~740 seedlings from 10 species under various levels of clipping.
The data consist of one .csv file with the following column names:
treatment Clipping treatment (1 - 5 months clip plus control unclipped) plot_rep One of three randomised plots per treatment matrix_no Where in the plot the individual was placed species_code First three letters of the genus name, and first three letters of the species name uniquely identifies the species species Full species name sample_period Classification of sampling period into time since clip. status Alive or Dead standing.height Vertical height above ground (in mm) height.mm Length of the longest branch (in mm) total.branch.length Total length of all the branches (in mm) stemdiam.mm Basal stem diameter (in mm) maxSpineLength.mm Length of the longest spine postclipStemNo Number of resprouting stems (only recorded AFTER clipping) date.clipped date.clipped date.measured date.measured date.germinated date.germinated Age.of.plant Date measured - Date germinated newtreat Treatment as a numeric variable, with 8 being the control plot (for plotting purposes)
File 2: Herbivory_SurvivalEndofSeason_march2017.csv (required to run the R code: "FinalAnalysisResultsSurvival_requires_Herbivory_SurvivalEndofSeason_march2017.R"
The data consist of one .csv file with the following column names:
treatment Clipping treatment (1 - 5 months clip plus control unclipped) plot_rep One of three randomised plots per treatment matrix_no Where in the plot the individual was placed species_code First three letters of the genus name, and first three letters of the species name uniquely identifies the species species Full species name sample_period Classification of sampling period into time since clip. status Alive or Dead standing.height Vertical height above ground (in mm) height.mm Length of the longest branch (in mm) total.branch.length Total length of all the branches (in mm) stemdiam.mm Basal stem diameter (in mm) maxSpineLength.mm Length of the longest spine postclipStemNo Number of resprouting stems (only recorded AFTER clipping) date.clipped date.clipped date.measured date.measured date.germinated date.germinated Age.of.plant Date measured - Date germinated newtreat Treatment as a numeric variable, with 8 being the control plot (for plotting purposes) genus Genus MAR Mean Annual Rainfall for that Species distribution (mm) rainclass High/medium/low
File 3: allModelParameters_byAge.csv (required to run the R code: "FinalModelSeedlingSurvival_June2021_.R"
Consists of a .csv file with the following column headings
Age.of.plant Age in days species_code Species pred_SD_mm Predicted stem diameter in mm pred_SD_up top 75th quantile of stem diameter in mm pred_SD_low bottom 25th quantile of stem diameter in mm treatdate date when clipped pred_surv Predicted survival probability pred_surv_low Predicted 25th quantile survival probability pred_surv_high Predicted 75th quantile survival probability species_code species code Bite.probability Daily probability of being eaten max_bite_diam_duiker_mm Maximum bite diameter of a duiker for this species duiker_sd standard deviation of bite diameter for a duiker for this species max_bite_diameter_kudu_mm Maximum bite diameer of a kudu for this species kudu_sd standard deviation of bite diameter for a kudu for this species mean_bite_diam_duiker_mm mean etc duiker_mean_sd standard devaition etc mean_bite_diameter_kudu_mm mean etc kudu_mean_sd standard deviation etc genus genus rainclass low/med/high
File 4: EatProbParameters_June2020.csv (required to run the R code: "FinalModelSeedlingSurvival_June2021_.R"
Consists of a .csv file with the following column headings
shtspec species name
species_code species code
genus genus
rainclass low/medium/high
seed mass mass of seed (g per 1000seeds)
Surv_intercept coefficient of the model predicting survival from age of clip for this species
Surv_slope coefficient of the model predicting survival from age of clip for this species
GR_intercept coefficient of the model predicting stem diameter from seedling age for this species
GR_slope coefficient of the model predicting stem diameter from seedling age for this species
species_code species code
max_bite_diam_duiker_mm Maximum bite diameter of a duiker for this species
duiker_sd standard deviation of bite diameter for a duiker for this species
max_bite_diameter_kudu_mm Maximum bite diameer of a kudu for this species
kudu_sd standard deviation of bite diameter for a kudu for this species
mean_bite_diam_duiker_mm mean etc
duiker_mean_sd standard devaition etc
mean_bite_diameter_kudu_mm mean etc
kudu_mean_sd standard deviation etc
AgeAtEscape_duiker[t] age of plant when its stem diameter is larger than a mean duiker bite
AgeAtEscape_duiker_min[t] age of plant when its stem diameter is larger than a min duiker bite
AgeAtEscape_duiker_max[t] age of plant when its stem diameter is larger than a max duiker bite
AgeAtEscape_kudu[t] age of plant when its stem diameter is larger than a mean kudu bite
AgeAtEscape_kudu_min[t] age of plant when its stem diameter is larger than a min kudu bite
AgeAtEscape_kudu_max[t] age of plant when its stem diameter is larger than a max kudu bite
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset of the paper titled "Context-Aware Code Change Embedding for Better Patch Correctness Assessment".
This is the online repository of the paper "Context-Aware Code Change Embedding for Better Patch Correctness Assessment" under review by ASE2021. We release the source code of Cache, the patches used in our evaluation, as well as the experiment results.
Patches: Two patch benchmarks included in our study.
Results
RQ1: The detailed result files in RQ1, which are named by the format of [model]_[classifier].csv
. For example, the file named BERT_DT.csv
in the folder Small
means that this file is the result of patches from Small dataset embedded by BERT and classified by Decision Tree.
Cross: The detailed result files of representation learning techniques when training on Large dataset and testing on Small dataset.
RQ2: The detailed result files in RQ2.
ODS_Cache.csv: The datailed result of Cache on the dataset from Xiong's ICSE18 paper. We directly compare against the results reported by the authors of ODS on 139 patches from Xiong's paper since the data and source code of ODS is unavailable.
Table_5_Effectiveness_APCA.xlsx: The detailed version of Table 5 in the paper.
Table_6_Effectiveness_ODS.xlsx: The detailed version of Table 6 in the paper.
source/Readme.md
. We will build a homepage for Cache on GitHub upon acceptance.Datasets Description:
The datasets under discussion pertain to the red and white variants of Portuguese "Vinho Verde" wine. Detailed information is available in the reference by Cortez et al. (2009). These datasets encompass physicochemical variables as inputs and sensory variables as outputs. Notably, specifics regarding grape types, wine brand, and selling prices are absent due to privacy and logistical concerns.
Classification and Regression Tasks: One can interpret these datasets as being suitable for both classification and regression analyses. The classes are ordered, albeit imbalanced. For instance, the dataset contains a more significant number of normal wines compared to excellent or poor ones.
Dataset Contents: For a comprehensive understanding, readers are encouraged to review the work by Cortez et al. (2009). The input variables, derived from physicochemical tests, include: 1. Fixed acidity 2. Volatile acidity 3. Citric acid 4. Residual sugar 5. Chlorides 6. Free sulfur dioxide 7. Total sulfur dioxide 8. Density 9. pH 10. Sulphates 11. Alcohol
The output variable, based on sensory data, is denoted by: 12. Quality (score ranging from 0 to 10)
Usage Tips: A practical suggestion involves setting a threshold for the dependent variable, defining wines with a quality score of 7 or higher as 'good/1' and the rest as 'not good/0.' This facilitates meaningful experimentation with hyperparameter tuning using decision tree algorithms and analyzing ROC curves and AUC values.
Operational Workflow: To efficiently utilize the dataset, the following steps are recommended: 1. Utilize a File Reader (for csv) to a linear correlation node and an interactive histogram for basic Exploratory Data Analysis (EDA). 2. Employ a File Reader to a Rule Engine Node for transforming the 10-point scale to a dichotomous variable indicating 'good wine' and 'rest.' 3. Implement a Rule Engine Node output to an input of Column Filter node to filter out the original 10-point feature, thus preventing data leakage. 4. Apply a Column Filter Node output to the input of Partitioning Node to execute a standard train/test split (e.g., 75%/25%, choosing 'random' or 'stratified'). 5. Feed the Partitioning Node train data split output into the input of Decision Tree Learner node. 6. Connect the Partitioning Node test data split output to the input of Decision Tree predictor Node. 7. Link the Decision Tree Learner Node output to the input of Decision Tree Node. 8. Finally, connect the Decision Tree output to the input of ROC Node for model evaluation based on the AUC value.
Tools and Acknowledgments: For an efficient analysis, consider using KNIME, a valuable graphical user interface (GUI) tool. Additionally, the dataset is available on the UCI machine learning repository, and proper acknowledgment and citation of the dataset source by Cortez et al. (2009) are essential for use.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Preventive Maintenance for Marine Engines: Data-Driven Insights
Introduction:
Marine engine failures can lead to costly downtime, safety risks and operational inefficiencies. This project leverages machine learning to predict maintenance needs, helping ship operators prevent unexpected breakdowns. Using a simulated dataset, we analyze key engine parameters and develop predictive models to classify maintenance status into three categories: Normal, Requires Maintenance, and Critical.
Overview This project explores preventive maintenance strategies for marine engines by analyzing operational data and applying machine learning techniques.
Key steps include: 1. Data Simulation: Creating a realistic dataset with engine performance metrics. 2. Exploratory Data Analysis (EDA): Understanding trends and patterns in engine behavior. 3. Model Training & Evaluation: Comparing machine learning models (Decision Tree, Random Forest, XGBoost) to predict maintenance needs. 4. Hyperparameter Tuning: Using GridSearchCV to optimize model performance.
Tools Used 1. Python: Data processing, analysis and modeling 2. Pandas & NumPy: Data manipulation 3. Scikit-Learn & XGBoost: Machine learning model training 4. Matplotlib & Seaborn: Data visualization
Skills Demonstrated ✔ Data Simulation & Preprocessing ✔ Exploratory Data Analysis (EDA) ✔ Feature Engineering & Encoding ✔ Supervised Machine Learning (Classification) ✔ Model Evaluation & Hyperparameter Tuning
Key Insights & Findings 📌 Engine Temperature & Vibration Level: Strong indicators of potential failures. 📌 Random Forest vs. XGBoost: After hyperparameter tuning, both models achieved comparable performance, with Random Forest performing slightly better. 📌 Maintenance Status Distribution: Balanced dataset ensures unbiased model training. 📌 Failure Modes: The most common issues were Mechanical Wear & Oil Leakage, aligning with real-world engine failure trends.
Challenges Faced 🚧 Simulating Realistic Data: Ensuring the dataset reflects real-world marine engine behavior was a key challenge. 🚧 Model Performance: The accuracy was limited (~35%) due to the complexity of failure prediction. 🚧 Feature Selection: Identifying the most impactful features required extensive analysis.
Call to Action 🔍 Explore the Dataset & Notebook: Try running different models and tweaking hyperparameters. 📊 Extend the Analysis: Incorporate additional sensor data or alternative machine learning techniques. 🚀 Real-World Application: This approach can be adapted for industrial machinery, aircraft engines, and power plants.
https://www.gnu.org/licenses/gpl-3.0.htmlhttps://www.gnu.org/licenses/gpl-3.0.html
Data & File OverviewDirectory of Files:All_Patches_resized.zipSize: (9.44GB)Description:This is a compressed folder that contains preprocessed patches. Each patch is an image of one of the following classes:full capsidpartially full capsidempty capsidaggregationicebroken capsidbackgroundnames_labels_df.csv.zipSize: (293KB)Description:This is a compressed .csv file that contains 94,830 patch instances each with a patch name and the associated class label.Additional Notes:This data was used in conjunction with the code at https://github.com/lcgutier/capsid-eyes to create the Capsidize app at https://github.com/lcgutier/Capsidize.File Naming Convention:Annotated and Preprocessed Images:The classes are defined as follows: 1=Full capsid, 2=Partially full capsid, 3=empty capsid, 4=aggregation, 5=ice, 6=broken capsid, 7=background Training and testing sets were randomly sampled from the resulting patches and each class was balanced.Images are named according to the dataset they belong to, followed by the original image ID number and lastly with the annotation ID number.Data Descriptionnames_labels_df.csv.zipColumns: patch_name, labelRows: 94,830 patch instancesClass TypesFull Capsids, label=1, Description: A capsid with a dark ring and even dark infill density.Partial Capsids, label=2, Description: A capsid with a dark ring and a mid-range infill density.Empty Capsids, label=3, Description: A capsid with a dark ring and a light infill density.Aggregation, label=4, Description: A cluster of capsids.Ice, label=5, Description: Dark globular crystals that range in size.Broken Capsids, label=6, Description: Full, partial, or empty capsids that have broken. These are often only fragments.Background, label=7, Description: This class contains all background masks including those as a result of background differences seen at the carbon grid hole/grid material boundary.Methodological InformationSoftware-specific information:Name: CapsidizeVersion: 1.0.0System Requirements: Macos or Python 3.9Open Source: YesDeveloper: Lilianna GutierrezProduct RepositorySource RepositoryAdditional Notes: The source code held in the capsid-eyes github repository can be run without macos, but the app was compiled on darwin-arm64.Equipment-specific information:Sample PreparationManufacturer: (Thermo Fisher Scientific [TFS], Waltham, MA, USA)Model: Vitrobot Mk 4Use: cooled in a bath of liquid nitrogen to freeze samples for imagingImagingManufacturer: Thermo Fisher ScientificModel: Krios 3Gi microscopeUse: To collect TEM images. Operating at 300 kVolts and equipped with a Selectris energy filter and Falcon 4i direct electron detecting camera operating in electron counting mode.
Life-habit/functional-trait codings for the Kope and Waynesville Formation species poolKWTraits.csv is a comma-separated value (.csv) format file listing the aggregate species pool for the Kope and Waynesville Formation used in empirical analyses. (The file is also included as a data file within the 'ecospace' R package.) The first three columns list taxonomic information. The remaining columns list ecospace character states (functional traits). See supplementary appendix A and Novack-Gottshall (2007) for information on characters and states. See text for explanation of how multistate characters were rescaled.K&WTraits.csvTwo-model model-selection support data files for Kope and Waynesville Formation samples, stratigraphic section, member, and formation aggregatesFile is in comma-separated value (.csv) format. The first five columns describe the Paleobiology Database collection identification number, scale (hand sample, stratigraphic section, etc.) of the sample, and stratigraphic/section names. Columns 6–14 list sample size (S, species richness) and values for eight disparity statistics (with NA designating when a statistic could not be calculated, because there were fewer than four unique life habits in the sample); see text for descriptions and abbreviations of statistics. The last column identifies which model has the best support among those candidates considered. The remaining columns list the classification-tree support each sample has for each candidate model considered. emp2-modelfits.csv lists model support using the classification tree trained on the 50% and 100%-strength training data sets. emp3-modelfits.csv lists model support for the tree trained on 50%, 90%, and 100% training data.emp2-modelfits.csvThree-model model-selection support data files for Kope and Waynesville Formation samples, stratigraphic section, member, and formation aggregatesFile is in comma-separated value (.csv) format. The first five columns describe the Paleobiology Database collection identification number, scale (hand sample, stratigraphic section, etc.) of the sample, and stratigraphic/section names. Columns 6–14 list sample size (S, species richness) and values for eight disparity statistics (with NA designating when a statistic could not be calculated, because there were fewer than four unique life habits in the sample); see text for descriptions and abbreviations of statistics. The last column identifies which model has the best support among those candidates considered. The remaining columns list the classification-tree support each sample has for each candidate model considered. emp3-modelfits.csv lists model support for the tree trained on 50%, 90%, and 100% training data.emp3-modelfits.csvFive-model model-selection support data files for Kope and Waynesville Formation samples, stratigraphic section, member, and formation aggregatesFile is in comma-separated value (.csv) format. The first five columns describe the Paleobiology Database collection identification number, scale (hand sample, stratigraphic section, etc.) of the sample, and stratigraphic/section names. Columns 6–14 list sample size (S, species richness) and values for eight disparity statistics (with NA designating when a statistic could not be calculated, because there were fewer than four unique life habits in the sample); see text for descriptions and abbreviations of statistics. The last column identifies which model has the best support among those candidates considered. The remaining columns list the classification-tree support each sample has for each candidate model considered. emp5-modelfits.csv lists model support for the tree trained on 50%, 75%, 90%, 95%, and 100% training data.emp5-modelfits.csvSupplementary Appendices 1-4 for manuscriptAppendix 1 gives an example of how life-habit character states were inferred and coded. Appendix 2 describes technical details on classification tree methods and confusion matrices. Appendix 3-4 give further details for the other Supplementary data files on Data Dryad.EcomodelsII_Appendices.docxSupplementary Figure 6Comparing statistical dynamics for different ecospace framework structures: varying number of characters, (A) 5 characters, (B) 15 characters, and (C) 25 characters. Each framework had mixed character types, in identical proportions (40% binary, 20% three-state factor, 20% five-state factor, and, 20% five-state ordered numeric character types). 5 "seed" species were chosen at random to begin each simulation. Other simulation details and graphical interpretation are the same as is Figure 2. Trends in total variance were excluded because the inclusion of factors prevented their calculation. The dynamics are generally similar, although larger frameworks allow modestly more powerful model selection using classification-tree methods (83%, 85%, and 86% of training models, respectively, classified correctly using classificat...
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Maps of the number, size, and species of trees in forests across the western United States are desirable for many applications such as estimating terrestrial carbon resources, predicting tree mortality following wildfires, and for forest inventory. However, detailed mapping of trees for large areas is not feasible with current technologies, but statistical methods for matching the forest plot data with biophysical characteristics of the landscape offer a practical means to populate landscapes with a limited set of forest plot inventory data. We used a modified random forests approach with Landscape Fire and Resource Management Planning Tools (LANDFIRE) vegetation and biophysical predictors as the target data, to which we imputed plot data collected by the USDA Forest Service’s Forest Inventory Analysis (FIA) to the landscape at 30-meter (m) grid resolution (Riley et al. 2016). This method imputes the plot with the best statistical match, according to a “forest” of decision trees, to each pixel of gridded landscape data. In this work, we used the LANDFIRE data set as the gridded target data because it is publicly available, offers seamless coverage of variables needed for fire models, and is consistent with other data sets, including burn probabilities and flame length probabilities generated for the continental United States. The main output of this project (the GeoTIFF available in this data publication) is a map of imputed plot identifiers at 30×30 m spatial resolution for the western United States for landscape conditions circa 2009. The map of plot identifiers can be linked to the FIA databases available through the FIA DataMart or to the ACCDB/CSV files included in this data publication to produce tree-level maps or to map other plot attributes. These ACCDB/CSV files also contain attributes regarding the FIA PLOT CN (a unique identifier for each time a plot is measured), the inventory year, the state code and abbreviation, the unit code, the county code, the plot number, the subplot number, the tree record number, and for each tree: the status (live or dead), species, diameter, height, actual height (where broken), crown ratio, number of trees per acre, and a unique identifier for each tree and tree visit. Application of the dataset to research questions other than those related to aboveground biomass and carbon should be investigated by the researcher before proceeding. The dataset may be suitable for other applications and for use across various scales (stand, landscape, and region), however, the researcher should test the dataset's applicability to a particular research question before proceeding.Geospatial data describing tree species or forest structure are required for many analyses and models of forest landscape dynamics. Forest data must have resolution and continuity sufficient to reflect site gradients in mountainous terrain and stand boundaries imposed by historical events, such as wildland fire and timber harvest. Such detailed forest structure data are not available for large areas of public and private lands in the United States, which rely on forest inventory at fixed plot locations at sparse densities. While direct sampling technologies such as light detection and ranging (LiDAR) may eventually make broad coverage of detailed forest inventory feasible, no such data sets at the scale of the western United States are currently available.When linking the tree list raster (“CN_text” field) to the FIA data via the plot CN field (“CN” in the “PLOT” table and “PLT_CN” in other tables), note that this field is unique to a single visit to a plot. The raster contains a “Value” field, which also appears in the ACCDB/CSV files in the “tl_id” field in order to facilitate this linkage. All plot CNs utilized in this analysis were single condition, 100% forested, physically located in the Rocky Mountain Research Station (RMRS) and Pacific Northwest Research Station (PNW) obtained from FIA in December of 2012.
Original metadata date was 01/03/2018. Minor metadata updates made on 04/30/2019.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset comprises curated subsets of the VeReMi (Vehicular Misbehavior Detection) dataset, specifically extracted and preprocessed to facilitate the detection of False Position Attacks (FPAs) in Vehicular Ad-hoc Networks (VANETs). The data preparation was conducted as part of the research presented in the manuscript “Detection of False Position Attacks in VANETs through Bagging Ensemble Learning”, submitted to PLOS ONE (Manuscript ID: PONE-D-25-17735).The original VeReMi dataset provides synthetic vehicular communication traces that simulate both benign and malicious behaviors in VANET scenarios. From this, we selected five specific attack types and reformatted the data into a machine learning–friendly structure to support the supervised classification of FPAs.Original Source:The original VeReMi dataset is publicly available at: https://arxiv.org/abs/1804.06701(Cite as: Van der Heijden, R. W., et al., "VeReMi: A Dataset for Comparable Evaluation of Misbehavior Detection in VANETs", arXiv preprint arXiv:1804.06701, 2018.)Included Files:The dataset includes the following five CSV files, each corresponding to a different simulated false position attack scenario:at1.csv — Attack Type 1at2.csv — Attack Type 2at4.csv — Attack Type 4at8.csv — Attack Type 8at16.csv — Attack Type 16Feature Description:Each row represents a single instance of vehicular behavior, including the following features:pos-x1, pos-y1: Position coordinates of vehicle 1spd-x1, spd-y1: Velocity components of vehicle 1pos-x2, pos-y2: Position coordinates of vehicle 2spd-x2, spd-y2: Velocity components of vehicle 2sendtime_1, sendtime_2: Message send timestamps, used to derive the feature time_intervalAttackerType: Class label indicating the type of attacker (used as the target variable for classification)Use Case and Analysis:The dataset was used to evaluate the effectiveness of multiple machine learning models in detecting FPAs. Specifically, the following classifiers were tested:Decision Tree (CART)Random Forest (RF)K-Nearest Neighbors (KNN)Multilayer Perceptron (MLP)Each model was assessed both with and without bagging to examine the impact of ensemble learning on classification performance. Results from our study demonstrate that KNN enhanced with bagging consistently outperforms other configurations across all attack types.Reuse Potential:This dataset is intended for researchers and practitioners in the fields of vehicular network security, misbehavior detection, and machine learning, particularly those exploring ensemble methods for anomaly detection. It enables reproducibility of our results and provides a foundation for benchmarking alternative detection techniques.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This Zenodo entry contains the result of the feature selection algorithm implemented through a backward variable elimination strategy in chopin2 (powered by hdlib) applied on MetaPhlAn3 microbial profiles of a public dataset of metagenomic stool samples collected from patients affected by the colorectal cancer (CRC) as well as from healthy individuals.
Microbial profiles have been extracted through the curatedMetagenomicData package for R under the IDs ThomasAM_2018a, ThomasAM_2018b, and ThomasAM_2019_a.
The feature selection algorithm is implemented as a backward variable elimination method, and it makes use of the vector-symbolic architecture described in Cumbo F 2020.
Deposited data is described below:
datasets.tar.gz: it contains the datasets used as input of chopin2 as the result of merging the three datasets with relative abundances mentioned above, also stratified by age and sex (with prefix RA). The same datasets have been also binarized (with prefix BIN);
hd-models.tar.gz: it contains the output of the feature selection performed with chopin2 (powered by hdlib) on the datasets with both relative abundance and binary profiles (RA and BIN);
ml-models.tar.gz: it contains the result of the feature selection produced with classical wrapper-based techniques (i.e., Random Forest, Decision Tree, Support Vector Machine, Logistic Regression, and Extreme Gradient Boosting) in addition to a Python 3.8 script to reproduce the results.
Please note that the datasets RA_ThomasAM_species.csv and BIN_ThomasAM_species.csv are also included into the datasets.tar.gz archive.
https://github.com/gruizmer/COW2NUTRIENT/tree/master/ToolPaper_DataFiles * These folders supply supporting datasets for the manuscript "COW2NUTRIENT: An environmental GIS-based decision support tool for the assessment of nutrient recovery systems in livestock facilities." * The datasets are recorder as comma-separated values (.csv) and Microsoft Excel® (.xlsx) files. Column data entries have names and units. Some data are about animal facility population and location, amount of nutrient-rich waste generated (kg/yr), amount of nutrient recovered (kg P/yr), installing, capital, and maintenance costs (USD), technologies and their ranking and frequency of being selected for each combination of normalization-aggregation methods, average chlorophyll-a concentration in water in the watershed (ug/L), and average phosphorus concentration in water in the watershed (ug/L). * The folder “Manuscript” has subfolders with datasets for creating manuscript Figures 4, 8, 9, and 10 as well as datasets for Tables 9 and 10. * The folder “Supplementary Material” holds subfolders with datasets for creating Supplementary Material Figures 1-5, 8, 9, 11, and 12. This dataset is associated with the following publication: Martin-Hernandez, E., M. Martin, and G.J. Ruiz-Mercado. A geospatial environmental and techno-economic framework for sustainable phosphorus management at livestock facilities. Resources, Conservation and Recycling. Elsevier Science BV, Amsterdam, NETHERLANDS, 175: 105843, (2021).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains 1,888 records merged from five publicly available heart disease datasets. It includes 14 features that are crucial for predicting heart attack and stroke risks, covering both medical and demographic factors. Below is a detailed description of each feature.
This dataset is a combination of five publicly available heart disease datasets, with a total of 1,888 records. Merging these datasets provides a more robust foundation for training machine learning models aimed at predicting heart attack risk.
Heart Attack Analysis & Prediction Dataset
Number of Records: 304
Reference: Rahman, 2021
Heart Disease Dataset
Number of Records: 1,026
Reference: Lapp, 2019
Heart Attack Prediction (Dataset 3)
Number of Records: 295
Reference: Damarla, 2020
Heart Attack Prediction (Dataset 4)
Number of Records: 271
Reference: Anand, 2018
Heart CSV Dataset
Number of Records: 290
Reference: Nandal, 2022
This dataset includes 14 features known to contribute to heart attack risk. It is ideal for training machine learning models aimed at early detection and prevention of heart disease. The records have been cleaned by removing missing data to ensure data integrity. This dataset can be applied to various machine learning algorithms, including classification models such as Decision Trees, Neural Networks, and others.
The minute weather dataset comes from the same source as the daily weather dataset that we used in the decision tree based classifier notebook. The main difference between these two datasets is that the minute weather dataset contains raw sensor measurements captured at one-minute intervals. Daily weather dataset instead contained processed and well curated data. The data is in the file minute_weather.csv, which is a comma-separated file. As with the daily weather data, this data comes from a weather station located in San Diego, California. The weather station is equipped with sensors that capture weather-related measurements such as air temperature, air pressure, and relative humidity. Data was collected for a period of three years, from September 2011 to September 2014, to ensure that sufficient data for different seasons and weather conditions is captured.
Each row in minute_weather.csv contains weather data captured for a one-minute interval. Each row, or sample, consists of the following variables:
rowID: unique number for each row (Unit: NA) hpwren_timestamp: timestamp of measure (Unit: year-month-day hour:minute:second) air_pressure: air pressure measured at the timestamp (Unit: hectopascals) air_temp: air temperature measure at the timestamp (Unit: degrees Fahrenheit) avg_wind_direction: wind direction averaged over the minute before the timestamp (Unit: degrees, with 0 means coming from the North, and increasing clockwise) avg_wind_speed: wind speed averaged over the minute before the timestamp (Unit: meters per second) max_wind_direction: highest wind direction in the minute before the timestamp (Unit: degrees, with 0 being North and increasing clockwise) max_wind_speed: highest wind speed in the minute before the timestamp (Unit: meters per second) min_wind_direction: smallest wind direction in the minute before the timestamp (Unit: degrees, with 0 being North and inceasing clockwise) min_wind_speed: smallest wind speed in the minute before the timestamp (Unit: meters per second) rain_accumulation: amount of accumulated rain measured at the timestamp (Unit: millimeters) rain_duration: length of time rain has fallen as measured at the timestamp (Unit: seconds) relative_humidity: relative humidity measured at the timestamp (Unit: percent)
EUCA dataset description Associated Paper: EUCA: the End-User-Centered Explainable AI Framework
Authors: Weina Jin, Jianyu Fan, Diane Gromala, Philippe Pasquier, Ghassan Hamarneh
Introduction: EUCA dataset is for modelling personalized or interactive explainable AI. It contains 309 data points of 32 end-users' preferences on 12 forms of explanation (including feature-, example-, and rule-based explanations). The data were collected from a user study on 32 layperson participants in the Greater Vancouver city area in 2019-2020. In the user study, the participants (P01-P32) were presented with AI-assisted critical tasks on house price prediction, health status prediction, purchasing a self-driving car, and studying for a biological exam [1]. Within each task and for its given explanation goal [2], the participants selected and rank the explanatory forms [3] that they saw the most suitable.
1 EUCA_EndUserXAI_ExplanatoryFormRanking.csv
Column description:
Index - Participants' number Case - task-explanation goal combination accept to use AI? trust it? - Participants response to whether they will use AI given the task and explanation goal require explanation? - Participants response to the question whether they request an explanation for the AI 1st, 2nd, 3rd, ... - Explanatory form card selection and ranking cards fulfill requirement? - After the card selection, participants were asked whether the selected card combination fulfill their explainability requirement.
2 EUCA_EndUserXAI_demography.csv
It contains the participants demographics, including their age, gender, educational background, and their knowledge and attitudes toward AI.
EUCA dataset zip file for download
More Context for EUCA Dataset [1] Critical tasks There are four tasks. Task label and their corresponding task titles are: house - Selling your house car - Buying an autonomous driving vehicle health - Personal health decision bird - Learning bird species
Please refer to EUCA quantatative data analysis report for the storyboard of the tasks and explanation goals presented in the user study.
[2] Explanation goal End-users may have different goals/purposes to check an explanation from AI. The EUCA dataset includes the following 11 explanation goals, with its [label] in the dataset, full name and description
[trust] Calibrate trust: trust is a key to establish human-AI decision-making partnership. Since users can easily distrust or overtrust AI, it is important to calibrate the trust to reflect the capabilities of AI systems.
[safe] Ensure safety: users need to ensure safety of the decision consequences.
[bias] - Detect bias: users need to ensure the decision is impartial and unbiased.
[unexpect] Resolve disagreement with AI: the AI prediction is unexpected and there are disagreements between users and AI.
[expected] - Expected: the AI's prediction is expected and aligns with users' expectations.
[differentiate] Differentiate similar instances: due to the consequences of wrong decisions, users sometimes need to discern similar instances or outcomes. For example, a doctor differentiates whether the diagnosis is a benign or malignant tumor.
[learning] Learn: users need to gain knowledge, improve their problem-solving skills, and discover new knowledge
[control] Improve: users seek causal factors to control and improve the predicted outcome.
[communicate] Communicate with stakeholders: many critical decision-making processes involve multiple stakeholders, and users need to discuss the decision with them.
[report] Generate reports: users need to utilize the explanations to perform particular tasks such as report production. For example, a radiologist generates a medical report on a patient's X-ray image.
[multi] Trade-off multiple objectives: AI may be optimized on an incomplete objective while the users seek to fulfill multiple objectives in real-world applications. For example, a doctor needs to ensure a treatment plan is effective as well as has acceptable patient adherence. Ethical and legal requirements may also be included as objectives.
[3] Explanatory form The following 12 explanatory forms are end-user-friendly, i.e.: no technical knowledge is required for the end-user to interpret the explanation.
Feature-Based Explanation
Feature Attribution - fa
Note: for tasks that has image as input data, the feature attribution is denoted by the following two cards:
ir: important regions (a.k.a. heat map or saliency map)
irc: important regions with their feature contribution percentage
Feature Shape - fs
Feature Interaction - fi
Example-Based Explanation
Similar Example - se Typical Example - te
Counterfactual Example - ce
Note: for contractual example, there were two visual variations used in the user study: cet: counterfactual example with transition from one example to the counterfactual one ceh: counterfactual example with the contrastive feature highlighted
Rule-Based Explanation
Rule - rt Decision Tree - dt
Decision Flow - df
Supplementary Information
Input Output Performance Dataset - prior (output prediction with prior distribution of each class in the training set)
Note: occasionally there is a wild card, which means the participant draw the card by themselves. It is indicated as 'wc'.
For visual examples of each explanatory form card, please refer to the Explanatory_form_labels.pdf document.
Link to the details on users' requirements on different explanatory forms
Code and report for EUCA data quantatitve analysis
EUCA data analysis code EUCA quantatative data analysis report
EUCA data citation @article{jin2021euca, title={EUCA: the End-User-Centered Explainable AI Framework}, author={Weina Jin and Jianyu Fan and Diane Gromala and Philippe Pasquier and Ghassan Hamarneh}, year={2021}, eprint={2102.02437}, archivePrefix={arXiv}, primaryClass={cs.HC} }