Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Source/Credit: Michael Grogan https://github.com/MGCodesandStats https://github.com/MGCodesandStats/datasets/blob/master/cars.csv
Sample dataset for regression analysis. Given 5 attributes (age, gender, miles driven per day, debt, and income) predict how much someone will spend on purchasing a car. All 5 of the input attributes have been scaled to be in 0 to 1 range. Training set has 723 training examples. Test set has 242 test examples.
This dataset will be used in an upcoming Galaxy Training Network tutorial (https://training.galaxyproject.org/training-material/topics/statistics/) on use of feedforward neural networks for regression analysis.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Gender microaggressions, especially its subtler forms microinsults and microinvalidations are by definition hard to discern. We aim to construct and validate a scale reflecting two facets of the microaggression taxonomy: microinsults and microinvalidations toward women in the workplace, the MIMI-16. Two studies were conducted (N1 = 500, N2 = 612). Using a genetic algorithm, a 16-item scale was developed and consequently validated via confirmatory factor analyses (CFA) in three separate validation samples. Correlational analyses with organizational outcome measures were performed. The MIMI-16 exhibits good model fit in all validation samples (CFI = 0.936–0.960, TLI = 0.926–0.954, RMSEA = 0.046–0.062, SRMR = 0.042–0.049). Multigroup-CFA suggested strict measurement invariance between all validation samples. Correlations were as expected and indicate internal and external validity. Scholars on gender microaggressions have mostly used qualitative research. With the newly developed MIMI-16 we provide a reliable and valid quantitative instrument to measure gender microaggressions in the workplace.
There are two CSV datasets in this publication used initially in the master thesis in sociology of Xenia Leontyeva at HSE University Saint Petersburg, titled "Popularity Factors of Domestic Films: Gender Characteristics and State Support Measures" (2022), and lately for the article by Leontyeva, Xenia, Olessia Koltsova, and Deb Verhoeven, titled "Gender (Im)Balance in Russian Cinema: On the Screen and behind the Camera" (Accepted in January 2024 in The Journal of Cultural Analytics). The first dataset (N=1285) includes all Russian films produced between 2008 and 2019 and theatrically released between December 1, 2008, and December 31, 2019. Distribution statistics cover the territory of the CIS, of which the Russian Federation is the biggest market. Budget information is available for 644 films. The second dataset contains the Bechdel-Wallace test modified by Leontyeva markup for 243 films, 193 of which have budget information. There is also a supplement with a detailed description of all variables and R-code producing tables, plots, and models for the article. The database was collected by Xenia Leontyeva while working at Nevafilm Research (until 2018) and later. In terms of distribution data, it is based on sources such as the open base Russian Cinema Fund Analytics – RCFA (since 2015), the closed base comScore/Rentrak ("International Box Office Essential") serving major Hollywood studios (data from it has been used since 2008 to fill gaps in open databases), Bookers' Bulletin (since 2011), and Russian Film Business Today magazines (since 2004), as well as self-collected by Nevafilm Research employees from film distributors and producers; the rights to use and continue this dataset have been received from Nevafilm company. In terms of production data, the information was taken from the State register of film distribution certificates, Kinopoisk.ru, and from the films' credits.
EUCA dataset description Associated Paper: EUCA: the End-User-Centered Explainable AI Framework
Authors: Weina Jin, Jianyu Fan, Diane Gromala, Philippe Pasquier, Ghassan Hamarneh
Introduction: EUCA dataset is for modelling personalized or interactive explainable AI. It contains 309 data points of 32 end-users' preferences on 12 forms of explanation (including feature-, example-, and rule-based explanations). The data were collected from a user study on 32 layperson participants in the Greater Vancouver city area in 2019-2020. In the user study, the participants (P01-P32) were presented with AI-assisted critical tasks on house price prediction, health status prediction, purchasing a self-driving car, and studying for a biological exam 1. Within each task and for its given explanation goal 2, the participants selected and rank the explanatory forms 3 that they saw the most suitable.
1 EUCA_EndUserXAI_ExplanatoryFormRanking.csv
Column description:
Index - Participants' number Case - task-explanation goal combination accept to use AI? trust it? - Participants response to whether they will use AI given the task and explanation goal require explanation? - Participants response to the question whether they request an explanation for the AI 1st, 2nd, 3rd, ... - Explanatory form card selection and ranking cards fulfill requirement? - After the card selection, participants were asked whether the selected card combination fulfill their explainability requirement.
2 EUCA_EndUserXAI_demography.csv
It contains the participants demographics, including their age, gender, educational background, and their knowledge and attitudes toward AI.
EUCA dataset zip file for download
More Context for EUCA Dataset 1 Critical tasks There are four tasks. Task label and their corresponding task titles are: house - Selling your house car - Buying an autonomous driving vehicle health - Personal health decision bird - Learning bird species
Please refer to EUCA quantatative data analysis report for the storyboard of the tasks and explanation goals presented in the user study.
2 Explanation goal End-users may have different goals/purposes to check an explanation from AI. The EUCA dataset includes the following 11 explanation goals, with its [label] in the dataset, full name and description
[trust] Calibrate trust: trust is a key to establish human-AI decision-making partnership. Since users can easily distrust or overtrust AI, it is important to calibrate the trust to reflect the capabilities of AI systems.
[safe] Ensure safety: users need to ensure safety of the decision consequences.
[bias] - Detect bias: users need to ensure the decision is impartial and unbiased.
[unexpect] Resolve disagreement with AI: the AI prediction is unexpected and there are disagreements between users and AI.
[expected] - Expected: the AI's prediction is expected and aligns with users' expectations.
[differentiate] Differentiate similar instances: due to the consequences of wrong decisions, users sometimes need to discern similar instances or outcomes. For example, a doctor differentiates whether the diagnosis is a benign or malignant tumor.
[learning] Learn: users need to gain knowledge, improve their problem-solving skills, and discover new knowledge
[control] Improve: users seek causal factors to control and improve the predicted outcome.
[communicate] Communicate with stakeholders: many critical decision-making processes involve multiple stakeholders, and users need to discuss the decision with them.
[report] Generate reports: users need to utilize the explanations to perform particular tasks such as report production. For example, a radiologist generates a medical report on a patient's X-ray image.
[multi] Trade-off multiple objectives: AI may be optimized on an incomplete objective while the users seek to fulfill multiple objectives in real-world applications. For example, a doctor needs to ensure a treatment plan is effective as well as has acceptable patient adherence. Ethical and legal requirements may also be included as objectives.
3 Explanatory form The following 12 explanatory forms are end-user-friendly, i.e.: no technical knowledge is required for the end-user to interpret the explanation.
Feature-Based Explanation
Feature Attribution - fa
Note: for tasks that has image as input data, the feature attribution is denoted by the following two cards:
ir: important regions (a.k.a. heat map or saliency map)
irc: important regions with their feature contribution percentage
Feature Shape - fs
Feature Interaction - fi
Example-Based Explanation
Similar Example - se Typical Example - te
Counterfactual Example - ce
Note: for contractual example, there were two visual variations used in the user study: cet: counterfactual example with transition from one example to the counterfactual one ceh: counterfactual example with the contrastive feature highlighted
Rule-Based Explanation
Rule - rt Decision Tree - dt
Decision Flow - df
Supplementary Information
Input Output Performance Dataset - prior (output prediction with prior distribution of each class in the training set)
Note: occasionally there is a wild card, which means the participant draw the card by themselves. It is indicated as 'wc'.
For visual examples of each explanatory form card, please refer to the Explanatory_form_labels.pdf document.
Link to the details on users' requirements on different explanatory forms
Code and report for EUCA data quantatitve analysis
EUCA data analysis code EUCA quantatative data analysis report
EUCA data citation @article{jin2021euca, title={EUCA: the End-User-Centered Explainable AI Framework}, author={Weina Jin and Jianyu Fan and Diane Gromala and Philippe Pasquier and Ghassan Hamarneh}, year={2021}, eprint={2102.02437}, archivePrefix={arXiv}, primaryClass={cs.HC} }
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Customer log dataset is a 12.5 GB JSON file and it contains 18 columns and 26,259,199 records. There are 12 string columns and 6 numeric columns, which may also contain null or NaN values. The columns include userId, artist, auth, firstName, gender, itemInSession, lastName, length, level, location, method, page, registration, sessionId, song,status, ts and userAgent. As evident from the column names, the dataset contains various user-related information, such as user identifiers, demographic details (firstName, lastName, gender), interaction details (artist, song, length, itemInSession, sessionId, registration, lastinteraction) and technical details (userAgent, method, page, location, status, level, auth).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ObjectiveTo construct and validate a predictive model for risk factors in children with severe adenoviral pneumonia based on chest low-dose CT imaging and clinical features.MethodsA total of 177 patients with adenoviral pneumonia who underwent low-dose CT examination were collected between January 2019 and August 2019. The assessment criteria for severe pneumonia were divided into mild group (N = 125) and severe group (N = 52). All cases divided into training cohort (N = 125) and validation cohort (N = 52). We constructed a prediction model by drawing a nomogram and verified the predictive efficacy of the model through the ROC curve, calibration curve and decision curve analysis.ResultsThe difference was statistically significant (P < 0.05) between the mild adenovirus pneumonia group and the severe adenovirus pneumonia group in gender, age, weight, body temperature, L/N ratio, LDH, ALT, AST, CK-MB, ADV DNA, bronchial inflation sign, emphysema, ground glass sign, bronchial wall thickening, bronchiectasis, pleural effusion, consolidation score, and lobular inflammation score. Multivariate logistic regression analysis showed that gender, LDH value, emphysema, consolidation score, and lobular inflammation score were severe independent risk factors for adenovirus pneumonia in children. Logistic regression was employed to construct clinical model, imaging semantic feature model, and combined model. The AUC values of the training sets of the three models were 0.85 (0.77–0.94), 0.83 (0.75–0.91), and 0.91 (0.85–0.97). The AUC of the validation set was 0.77 (0.64–0.91), 0.83 (0.71–0.94), and 0.85 (0.73–0.96), respectively. The calibration curve fit good of the three models. The clinical decision curve analysis demonstrates the clinical application value of the nomogram prediction model.ConclusionThe prediction model based on chest low-dose CT image characteristics and clinical characteristics has relatively clear predictive value in distinguishing mild adenovirus pneumonia from severe adenovirus pneumonia in children and might provide a new method for early clinical prediction of the outcome of adenovirus pneumonia in children.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Source/Credit: Michael Grogan https://github.com/MGCodesandStats https://github.com/MGCodesandStats/datasets/blob/master/cars.csv
Sample dataset for regression analysis. Given 5 attributes (age, gender, miles driven per day, debt, and income) predict how much someone will spend on purchasing a car. All 5 of the input attributes have been scaled to be in 0 to 1 range. Training set has 723 training examples. Test set has 242 test examples.
This dataset will be used in an upcoming Galaxy Training Network tutorial (https://training.galaxyproject.org/training-material/topics/statistics/) on use of feedforward neural networks for regression analysis.