Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The complete dataset used in the analysis comprises 36 samples, each described by 11 numeric features and 1 target. The attributes considered were caspase 3/7 activity, Mitotracker red CMXRos area and intensity (3 h and 24 h incubations with both compounds), Mitosox oxidation (3 h incubation with the referred compounds) and oxidation rate, DCFDA fluorescence (3 h and 24 h incubations with either compound) and oxidation rate, and DQ BSA hydrolysis. The target of each instance corresponds to one of the 9 possible classes (4 samples per class): Control, 6.25, 12.5, 25 and 50 µM for 6-OHDA and 0.03, 0.06, 0.125 and 0.25 µM for rotenone. The dataset is balanced, it does not contain any missing values and data was standardized across features. The small number of samples prevented a full and strong statistical analysis of the results. Nevertheless, it allowed the identification of relevant hidden patterns and trends.
Exploratory data analysis, information gain, hierarchical clustering, and supervised predictive modeling were performed using Orange Data Mining version 3.25.1 [41]. Hierarchical clustering was performed using the Euclidean distance metric and weighted linkage. Cluster maps were plotted to relate the features with higher mutual information (in rows) with instances (in columns), with the color of each cell representing the normalized level of a particular feature in a specific instance. The information is grouped both in rows and in columns by a two-way hierarchical clustering method using the Euclidean distances and average linkage. Stratified cross-validation was used to train the supervised decision tree. A set of preliminary empirical experiments were performed to choose the best parameters for each algorithm, and we verified that, within moderate variations, there were no significant changes in the outcome. The following settings were adopted for the decision tree algorithm: minimum number of samples in leaves: 2; minimum number of samples required to split an internal node: 5; stop splitting when majority reaches: 95%; criterion: gain ratio. The performance of the supervised model was assessed using accuracy, precision, recall, F-measure and area under the ROC curve (AUC) metrics.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This file is the data set form the famous publication Francis J. Anscombe "*Graphs in Statistical Analysis*", The American Statistician 27 pp. 17-21 (1973) (doi: 10.1080/00031305.1973.10478966). It consists of four data sets of 11 points each. Note the peculiarity that the same 'x' values are used for the first three data sets, and I have followed this exactly as in the original publication (originally done to save space), i.e. the first column (x123) serves as the 'x' for the next three 'y' columns; y1, y2 and y3.
In the dataset Anscombe_quintet_data.csv
there is a new column (y5
) as an example of Simpson's paradox (C. McBride Ellis "*Anscombe dataset No. 5: Simpson's paradox*", Zenodo doi: 10.5281/zenodo.15209087 (2025)
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains YouTube trending video statistics for various Mediterranean countries. Its primary purpose is to provide insights into popular video content, channels, and viewer engagement across the region over specific periods. It is valuable for analysing content trends, understanding regional audience preferences, and assessing video performance metrics on the YouTube platform.
The dataset is structured in a tabular format, typically provided as a CSV file. It consists of 15 distinct columns detailing various aspects of YouTube trending videos. While the exact total number of rows or records is not specified, the data includes trending video counts for several date ranges in 2022: * 06/04/2022 - 06/08/2022: 31 records * 06/08/2022 - 06/11/2022: 56 records * 06/11/2022 - 06/15/2022: 57 records * 06/15/2022 - 06/19/2022: 111 records * 06/19/2022 - 06/22/2022: 130 records * 06/22/2022 - 06/26/2022: 207 records * 06/26/2022 - 06/29/2022: 321 records * 06/29/2022 - 07/03/2022: 523 records * 07/03/2022 - 07/07/2022: 924 records * 07/07/2022 - 07/10/2022: 861 records The dataset features 19 unique countries and 1347 unique video IDs. View counts for videos in the dataset range from approximately 20.9 thousand to 123 million.
This dataset is well-suited for a variety of analytical applications and use cases: * Exploratory Data Analysis (EDA): Discovering patterns, anomalies, and relationships within YouTube trending content. * Data Manipulation and Querying: Practising data handling using libraries such as Pandas or Numpy in Python, or executing queries with SQL. * Natural Language Processing (NLP): Analysing video titles, tags, and descriptions to extract key themes, sentiment, and trending topics. * Trend Prediction: Developing models to forecast future trending videos or content categories. * Cross-Country Comparison: Examining how trending content varies across different Mediterranean nations.
CC0
Original Data Source: YouTube Trending Videos of the Day
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Coronavirus disease 2019 (COVID-19) time series listing confirmed cases, reported deaths and reported recoveries. Data is disaggregated by country (and sometimes subregion). Coronavirus disease (COVID-19) is caused by the Severe acute respiratory syndrome Coronavirus 2 (SARS-CoV-2) and has had a worldwide effect. On March 11 2020, the World Health Organization (WHO) declared it a pandemic, pointing to the over 118,000 cases of the Coronavirus illness in over 110 countries and territories around the world at the time.
This dataset includes time series data tracking the number of people affected by COVID-19 worldwide, including:
confirmed tested cases of Coronavirus infection the number of people who have reportedly died while sick with Coronavirus the number of people who have reportedly recovered from it
Data is in CSV format and updated daily. It is sourced from this upstream repository maintained by the amazing team at Johns Hopkins University Center for Systems Science and Engineering (CSSE) who have been doing a great public service from an early point by collating data from around the world.
We have cleaned and normalized that data, for example tidying dates and consolidating several files into normalized time series. We have also added some metadata such as column descriptions and data packaged it.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Electronic health records (EHRs) have been widely adopted in recent years, but often include a high proportion of missing data, which can create difficulties in implementing machine learning and other tools of personalized medicine. Completed datasets are preferred for a number of analysis methods, and successful imputation of missing EHR data can improve interpretation and increase our power to predict health outcomes. However, use of the most popular imputation methods mainly require scripting skills, and are implemented using various packages and syntax. Thus, the implementation of a full suite of methods is generally out of reach to all except experienced data scientists. Moreover, imputation is often considered as a separate exercise from exploratory data analysis, but should be considered as art of the data exploration process. We have created a new graphical tool, ImputEHR, that is based on a Python base and allows implementation of a range of simple and sophisticated (e.g., gradient-boosted tree-based and neural network) data imputation approaches. In addition to imputation, the tool enables data exploration for informed decision-making, as well as implementing machine learning prediction tools for response data selected by the user. Although the approach works for any missing data problem, the tool is primarily motivated by problems encountered for EHR and other biomedical data. We illustrate the tool using multiple real datasets, providing performance measures of imputation and downstream predictive analysis.
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Unsupervised exploratory data analysis (EDA) is often the first step in understanding complex data sets. While summary statistics are among the most efficient and convenient tools for exploring and describing sets of data, they are often overlooked in EDA. In this paper, we show multiple case studies that compare the performance, including clustering, of a series of summary statistics in EDA. The summary statistics considered here are pattern recognition entropy (PRE), the mean, standard deviation (STD), 1-norm, range, sum of squares (SSQ), and X4, which are compared with principal component analysis (PCA), multivariate curve resolution (MCR), and/or cluster analysis. PRE and the other summary statistics are direct methods for analyzing datathey are not factor-based approaches. To quantify the performance of summary statistics, we use the concept of the “critical pair,” which is employed in chromatography. The data analyzed here come from different analytical methods. Hyperspectral images, including one of a biological material, are also analyzed. In general, PRE outperforms the other summary statistics, especially in image analysis, although a suite of summary statistics is useful in exploring complex data sets. While PRE results were generally comparable to those from PCA and MCR, PRE is easier to apply. For example, there is no need to determine the number of factors that describe a data set. Finally, we introduce the concept of divided spectrum-PRE (DS-PRE) as a new EDA method. DS-PRE increases the discrimination power of PRE. We also show that DS-PRE can be used to provide the inputs for the k-nearest neighbor (kNN) algorithm. We recommend PRE and DS-PRE as rapid new tools for unsupervised EDA.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Identifying and dealing with outliers is an important part of data analysis. A new visualization, the O3 plot, is introduced to aid in the display and understanding of patterns of multivariate outliers. It uses the results of identifying outliers for every possible combination of dataset variables to provide insight into why particular cases are outliers. The O3 plot can be used to compare the results from up to six different outlier identification methods. There is anRpackage OutliersO3 implementing the plot. The article is illustrated with outlier analyses of German demographic and economic data. Supplementary materials for this article are available online.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The zipped file contains the following: - data (as csv, in the 'data' folder), - R scripts (as Rmd, in the rro folder), - figures (as pdf, in the 'figs' folder), and - presentation (as html, in the root folder).
The Iris Dataset. ¶. This data sets consists of 3 different types of irises’ (Setosa, Versicolour, and Virginica) petal and sepal length, stored in a 150x4 numpy.ndarray. The rows being the samples and the columns being: Sepal Length, Sepal Width, Petal Length and Petal Width.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Samples relating to 12 analyses of lay-theories of resilience among participants from USA, New Zealand, India, Iran, Russia (Moscow; Kazan). Central variables relate to participant endorsements of resilience descriptors. Demographic data includes (though not for all samples), Sex/Gender, Age, Ethnicity, Work, and Educational Status. Analysis 1. USA Exploratory Factor Analysis dataAnalysis 2. New Zealand Exploratory Factor Analysis dataAnalysis 3. India Exploratory Factor Analysis dataAnalysis 4. Iran Exploratory Factor Analysis dataAnalysis 5. Russian (Moscow) Exploratory Factor Analysis dataAnalysis 6. Russian (Kazan) Exploratory Factor Analysis dataAnalysis 7. USA Confirmatory Factor Analysis dataAnalysis 8. New Zealand Confirmatory Factor Analysis dataAnalysis 9. India Confirmatory Factor Analysis dataAnalysis 10. Iran Confirmatory Factor Analysis dataAnalysis 11. Russian (Moscow) Confirmatory Factor Analysis dataAnalysis 12. Russian (Kazan) Confirmatory Factor Analysis data
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains customer reviews for British Airways, a prominent airline operating in the United Kingdom [1, 2]. It offers a wide range of experiences and opinions shared by travellers [2]. The dataset is valuable for analysing customer sentiment, identifying recurring issues, tracking feedback trends over time, and segmenting reviews for more targeted insights [2].
The dataset is typically provided in CSV format [3]. While specific row counts are not explicitly stated, the data includes counts indicating review distribution, such as 293 entries for various label ranges, suggesting approximately 2930 records in total [1, 4]. It also notes that 2923 unique values are present for "British Airways customer review" and 1615 unique values for "never fly British Airways again" [4].
This dataset is ideal for various analytical applications, including: * Sentiment analysis: To gauge overall customer sentiment concerning British Airways [2]. * Theme identification: Pinpointing common themes or issues frequently mentioned by reviewers [2]. * Trend tracking: Monitoring changes in customer feedback and satisfaction over time [2]. * Targeted analysis: Segmenting reviews based on specific customer attributes for more focused insights [2].
The geographic scope of the reviews primarily includes the United Kingdom (62%) and the United States (11%), with other locations making up 26% of the data [4]. The dataset contains a 'date' column for time-based analysis, but a specific time range for the reviews is not specified in the provided information [1, 2]. Demographic details about the reviewers are not included.
CC0
This dataset is suitable for: * Data analysts and scientists: For building sentiment models or conducting exploratory data analysis. * Market research professionals: To understand customer perceptions and identify areas for service improvement. * Airline industry stakeholders: To monitor brand reputation and competitive landscape. * Students and researchers: For academic projects related to natural language processing (NLP), text mining, or customer experience studies.
Original Data Source: British Airways Customer Reviews
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The high-resolution and mass accuracy of Fourier transform mass spectrometry (FT-MS) has made it an increasingly popular technique for discerning the composition of soil, plant and aquatic samples containing complex mixtures of proteins, carbohydrates, lipids, lignins, hydrocarbons, phytochemicals and other compounds. Thus, there is a growing demand for informatics tools to analyze FT-MS data that will aid investigators seeking to understand the availability of carbon compounds to biotic and abiotic oxidation and to compare fundamental chemical properties of complex samples across groups. We present ftmsRanalysis, an R package which provides an extensive collection of data formatting and processing, filtering, visualization, and sample and group comparison functionalities. The package provides a suite of plotting methods and enables expedient, flexible and interactive visualization of complex datasets through functions which link to a powerful and interactive visualization user interface, Trelliscope. Example analysis using FT-MS data from a soil microbiology study demonstrates the core functionality of the package and highlights the capabilities for producing interactive visualizations.
This component contains the data and syntax code used to conduct the Exploratory Factor Analysis and compute Velicer’s minimum average partial test in sample 1
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
The Goodreads Book Reviews dataset encapsulates a wealth of reviews and various attributes concerning the books listed on the Goodreads platform. A distinguishing feature of this dataset is its capture of multiple tiers of user interaction, ranging from adding a book to a "shelf", to rating and reading it. This dataset is a treasure trove for those interested in understanding user behavior, book recommendations, sentiment analysis, and the interplay between various attributes of books and user interactions.
Basic Statistics: - Items: 1,561,465 - Users: 808,749 - Interactions: 225,394,930
Metadata: - Reviews: The text of the reviews provided by users. - Add-to-shelf, Read, Review Actions: Various interactions users have with the books. - Book Attributes: Attributes describing the books including title, and ISBN. - Graph of Similar Books: A graph depicting similarity relations between books.
Example (interaction data):
json
{
"user_id": "8842281e1d1347389f2ab93d60773d4d",
"book_id": "130580",
"review_id": "330f9c153c8d3347eb914c06b89c94da",
"isRead": true,
"rating": 4,
"date_added": "Mon Aug 01 13:41:57 -0700 2011",
"date_updated": "Mon Aug 01 13:42:41 -0700 2011",
"read_at": "Fri Jan 01 00:00:00 -0800 1988",
"started_at": ""
}
Use Cases: - Book Recommendations: Creating personalized book recommendations based on user interactions and preferences. - Sentiment Analysis: Analyzing sentiment in reviews and understanding how different book attributes influence sentiment. - User Behavior Analysis: Understanding user interaction patterns with books and deriving insights to enhance user engagement. - Natural Language Processing: Training models to process and analyze user-generated text in reviews. - Similarity Analysis: Analyzing the graph of similar books to understand book similarities and clustering.
Citation:
Please cite the following if you use the data:
Item recommendation on monotonic behavior chains
Mengting Wan, Julian McAuley
RecSys, 2018
[PDF](https://cseweb.ucsd.edu/~jmcauley/pdfs/recsys18e.pdf)
Code Samples: A curated set of code samples is provided in the dataset's Github repository, aiding in seamless interaction with the datasets. These include: - Downloading datasets without GUI: Facilitating dataset download in a non-GUI environment. - Displaying Sample Records: Showcasing sample records to get a glimpse of the dataset structure. - Calculating Basic Statistics: Computing basic statistics to understand the dataset's distribution and characteristics. - Exploring the Interaction Data: Delving into interaction data to grasp user-book interaction patterns. - Exploring the Review Data: Analyzing review data to extract valuable insights from user reviews.
Additional Dataset: - Complete book reviews (~15m multilingual reviews about ~2m books and 465k users): This dataset comprises a comprehensive collection of reviews, showcasing a multilingual facet with reviews about around 2 million books from 465,000 users.
Datasets:
Description: The COVID-19 dataset used for this EDA project encompasses comprehensive data on COVID-19 cases, deaths, and recoveries worldwide. It includes information gathered from authoritative sources such as the World Health Organization (WHO), the Centers for Disease Control and Prevention (CDC), and national health agencies. The dataset covers global, regional, and national levels, providing a holistic view of the pandemic's impact.
Purpose: This dataset is instrumental in understanding the multifaceted impact of the COVID-19 pandemic through data exploration. It aligns perfectly with the objectives of the EDA project, aiming to unveil insights, patterns, and trends related to COVID-19. Here are the key objectives: 1. Data Collection and Cleaning: • Gather reliable COVID-19 datasets from authoritative sources (such as WHO, CDC, or national health agencies). • Clean and preprocess the data to ensure accuracy and consistency. 2. Descriptive Statistics: • Summarize key statistics: total cases, recoveries, deaths, and testing rates. • Visualize temporal trends using line charts, bar plots, and heat maps. 3. Geospatial Analysis: • Map COVID-19 cases across countries, regions, or cities. • Identify hotspots and variations in infection rates. 4. Demographic Insights: • Explore how age, gender, and pre-existing conditions impact vulnerability. • Investigate disparities in infection rates among different populations. 5. Healthcare System Impact: • Analyze hospitalization rates, ICU occupancy, and healthcare resource allocation. • Assess the strain on medical facilities. 6. Economic and Social Effects: • Investigate the relationship between lockdown measures, economic indicators, and infection rates. • Explore behavioral changes (e.g., mobility patterns, remote work) during the pandemic. 7. Predictive Modeling (Optional): • If data permits, build simple predictive models (e.g., time series forecasting) to estimate future cases.
Data Sources: The primary sources of the COVID-19 dataset include the Johns Hopkins CSSE COVID-19 Data Repository, Google Health’s COVID-19 Open Data, and the U.S. Economic Development Administration (EDA). These sources provide reliable and up-to-date information on COVID-19 cases, deaths, testing rates, and other relevant variables. Additionally, GitHub repositories and platforms like Medium host supplementary datasets and analyses, enriching the available data resources.
Data Format: The dataset is available in various formats, such as CSV and JSON, facilitating easy access and analysis. Before conducting the EDA, the data underwent preprocessing steps to ensure accuracy and consistency. Data cleaning procedures were performed to address missing values, inconsistencies, and outliers, enhancing the quality and reliability of the dataset.
License: The COVID-19 dataset may be subject to specific usage licenses or restrictions imposed by the original data sources. Proper attribution is essential to acknowledge the contributions of the WHO, CDC, national health agencies, and other entities providing the data. Users should adhere to any licensing terms and usage guidelines associated with the dataset.
Attribution: We acknowledge the invaluable contributions of the World Health Organization (WHO), the Centers for Disease Control and Prevention (CDC), national health agencies, and other authoritative sources in compiling and disseminating the COVID-19 data used for this EDA project. Their efforts in collecting, curating, and sharing data have been instrumental in advancing our understanding of the pandemic and guiding public health responses globally.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This folder contains the files used in the ASL analyses of my study: All of the data and calculations for my primary analysis, my exploratory analyses (except the one using a video from The Daily Moth, which can be found in a separate folder), and the ASL portions of my secondary analysis. As described in my dissertation, I am not sharing the original video files in order to protect the privacy of those who participated in my study.Each file is shared in one or more of the formats listed below, as appropriate:PDF.csv files (one file for each sheet)Link to my Google Sheets file
The average American’s diet does not align with the Dietary Guidelines for Americans (DGA) provided by the U.S. Department of Agriculture and the U.S. Department of Health and Human Services (2020). The present study aimed to compare fruit and vegetable consumption among those who had and had not heard of the DGA, identify characteristics of DGA users, and identify barriers to DGA use. A nationwide survey of 943 Americans revealed that those who had heard of the DGA ate more fruits and vegetables than those who had not. Men, African Americans, and those who have more education had greater odds of using the DGA as a guide when preparing meals relative to their respective counterparts. Disinterest, effort, and time were among the most cited reasons for not using the DGA. Future research should examine how to increase DGA adherence among those unaware of or who do not use the DGA. Comparative analyses of fruit and vegetable consumption among those who were aware/unaware and use/do not use the DGA were completed using independent samples t tests. Fruit and vegetable consumption variables were log-transformed for analysis. Binary logistic regression was used to examine whether demographic features (race, gender, and age) predict DGA awareness and usage. Data were analyzed using SPSS version 28.1 and SAS/STAT® version 9.4 TS1M7 (2023 SAS Institute Inc).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains material related to the analysis performed in the article "Best Practices for Your Exploratory Factor Analysis: Factor Tutorial". The material includes the data used in the analyses in .dat format, the labels (.txt) of the variables used in the Factor software, the outputs (.txt) evaluated in the article, and videos (.mp4 with English subtitles) recorded for the purpose of explaining the article. The videos can also be accessed in the following playlist: https://youtube.com/playlist?list=PLDfyRtHbxiZ3R-T3H1cY8dusz273aUFVe. Below is a summary of the article:
"Exploratory Factor Analysis (EFA) is one of the statistical methods most widely used in Administration, however, its current practice coexists with rules of thumb and heuristics given half a century ago. The purpose of this article is to present the best practices and recent recommendations for a typical EFA in Administration through a practical solution accessible to researchers. In this sense, in addition to discussing current practices versus recommended practices, a tutorial with real data on Factor is illustrated, a software that is still little known in the Administration area, but freeware, easy to use (point and click) and powerful. The step-by-step illustrated in the article, in addition to the discussions raised and an additional example, is also available in the format of tutorial videos. Through the proposed didactic methodology (article-tutorial + video-tutorial), we encourage researchers/methodologists who have mastered a particular technique to do the same. Specifically, about EFA, we hope that the presentation of the Factor software, as a first solution, can transcend the current outdated rules of thumb and heuristics, by making best practices accessible to Administration researchers"
This dataset provides a collection of public tweets related to traffic in Metro Manila, Philippines. It was compiled by scraping Twitter using the keywords "traffic manila", offering insights into the daily experiences and sentiments of people enduring the urban traffic situation. The dataset serves as a valuable resource for understanding public discourse and patterns concerning traffic congestion in the region.
The dataset typically comes as a data file, often in CSV format. It covers tweets posted from January 1, 2022, to December 23, 2022. The data includes various counts for likes and tweet timestamps across this period, reflecting a significant volume of public social media activity related to Manila traffic.
This dataset is ideal for a range of applications, including: * Exploratory data analysis of social media content. * Natural Language Processing (NLP) tasks, such as sentiment analysis or topic modelling on tweets about traffic. * Text mining for patterns and trends in urban discussions. * Research on urban areas and cities, particularly focusing on traffic dynamics and public opinion. * Analysing social networks and public interactions regarding transport issues.
The dataset's geographic scope is focused on Metro Manila, covering public tweets using the keywords "traffic manila". It spans a time range from January 1, 2022, to December 23, 2022. The data reflects the general public's social media activity during this period without specific demographic breakdowns beyond the public nature of the tweets.
CC-BY
Original Data Source: Traffic Tweets in Manila - Jan 2022 - Dec 2022
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The complete dataset used in the analysis comprises 36 samples, each described by 11 numeric features and 1 target. The attributes considered were caspase 3/7 activity, Mitotracker red CMXRos area and intensity (3 h and 24 h incubations with both compounds), Mitosox oxidation (3 h incubation with the referred compounds) and oxidation rate, DCFDA fluorescence (3 h and 24 h incubations with either compound) and oxidation rate, and DQ BSA hydrolysis. The target of each instance corresponds to one of the 9 possible classes (4 samples per class): Control, 6.25, 12.5, 25 and 50 µM for 6-OHDA and 0.03, 0.06, 0.125 and 0.25 µM for rotenone. The dataset is balanced, it does not contain any missing values and data was standardized across features. The small number of samples prevented a full and strong statistical analysis of the results. Nevertheless, it allowed the identification of relevant hidden patterns and trends.
Exploratory data analysis, information gain, hierarchical clustering, and supervised predictive modeling were performed using Orange Data Mining version 3.25.1 [41]. Hierarchical clustering was performed using the Euclidean distance metric and weighted linkage. Cluster maps were plotted to relate the features with higher mutual information (in rows) with instances (in columns), with the color of each cell representing the normalized level of a particular feature in a specific instance. The information is grouped both in rows and in columns by a two-way hierarchical clustering method using the Euclidean distances and average linkage. Stratified cross-validation was used to train the supervised decision tree. A set of preliminary empirical experiments were performed to choose the best parameters for each algorithm, and we verified that, within moderate variations, there were no significant changes in the outcome. The following settings were adopted for the decision tree algorithm: minimum number of samples in leaves: 2; minimum number of samples required to split an internal node: 5; stop splitting when majority reaches: 95%; criterion: gain ratio. The performance of the supervised model was assessed using accuracy, precision, recall, F-measure and area under the ROC curve (AUC) metrics.