100+ datasets found

f
Data_Sheet_1_ImputEHR: A Visualization Tool of Imputation for the Prediction...
frontiersin.figshare.com
pdf
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yi-Hui Zhou; Ehsan Saghapour (2023). Data_Sheet_1_ImputEHR: A Visualization Tool of Imputation for the Prediction of Biomedical Data.PDF [Dataset]. http://doi.org/10.3389/fgene.2021.691274.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fgene.2021.691274.s001
Dataset updated
Jun 1, 2023
Dataset provided by
Frontiers
Authors
Yi-Hui Zhou; Ehsan Saghapour
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Electronic health records (EHRs) have been widely adopted in recent years, but often include a high proportion of missing data, which can create difficulties in implementing machine learning and other tools of personalized medicine. Completed datasets are preferred for a number of analysis methods, and successful imputation of missing EHR data can improve interpretation and increase our power to predict health outcomes. However, use of the most popular imputation methods mainly require scripting skills, and are implemented using various packages and syntax. Thus, the implementation of a full suite of methods is generally out of reach to all except experienced data scientists. Moreover, imputation is often considered as a separate exercise from exploratory data analysis, but should be considered as art of the data exploration process. We have created a new graphical tool, ImputEHR, that is based on a Python base and allows implementation of a range of simple and sophisticated (e.g., gradient-boosted tree-based and neural network) data imputation approaches. In addition to imputation, the tool enables data exploration for informed decision-making, as well as implementing machine learning prediction tools for response data selected by the user. Although the approach works for any missing data problem, the tool is primarily motivated by problems encountered for EHR and other biomedical data. We illustrate the tool using multiple real datasets, providing performance measures of imputation and downstream predictive analysis.
A
‘US Health Insurance Dataset’ analyzed by Analyst-2
analyst-2.ai
Updated Feb 29, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2020). ‘US Health Insurance Dataset’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-us-health-insurance-dataset-920a/latest
Explore at:
Dataset updated
Feb 29, 2020
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘US Health Insurance Dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/teertha/ushealthinsurancedataset on 28 January 2022.

--- Dataset description provided by original source is as follows ---

Context

The venerable insurance industry is no stranger to data driven decision making. Yet in today's rapidly transforming digital landscape, Insurance is struggling to adapt and benefit from new technologies compared to other industries, even within the BFSI sphere (compared to the Banking sector for example.) Extremely complex underwriting rule-sets that are radically different in different product lines, many non-KYC environments with a lack of centralized customer information base, complex relationship with consumers in traditional risk underwriting where sometimes customer centricity runs reverse to business profit, inertia of regulatory compliance - are some of the unique challenges faced by Insurance Business.

Despite this, emergent technologies like AI and Block Chain have brought a radical change in Insurance, and Data Analytics sits at the core of this transformation. We can identify 4 key factors behind the emergence of Analytics as a crucial part of InsurTech:

Big Data: The explosion of unstructured data in the form of images, videos, text, emails, social media

AI: The recent advances in Machine Learning and Deep Learning that can enable businesses to gain insight, do predictive analytics and build cost and time - efficient innovative solutions

Real time Processing: Ability of real time information processing through various data feeds (for ex. social media, news)

Increased Computing Power: a complex ecosystem of new analytics vendors and solutions that enable carriers to combine data sources, external insights, and advanced modeling techniques in order to glean insights that were not possible before.

This dataset can be helpful in a simple yet illuminating study in understanding the risk underwriting in Health Insurance, the interplay of various attributes of the insured and see how they affect the insurance premium.

Content

This dataset contains 1338 rows of insured data, where the Insurance charges are given against the following attributes of the insured: Age, Sex, BMI, Number of Children, Smoker and Region. There are no missing or undefined values in the dataset.

Inspiration

This relatively simple dataset should be an excellent starting point for EDA, Statistical Analysis and Hypothesis testing and training Linear Regression models for predicting Insurance Premium Charges.

Proposed Tasks: - Exploratory Data Analytics - Statistical hypothesis testing - Statistical Modeling - Linear Regression

--- Original source retains full ownership of the source dataset ---
G
Data for "Nitrate-containing groundwater in Denmark: Exploratory data...
dataverse.geus.dk
search.dataone.org
txt, xlsx
Updated Apr 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Denitza Voutchkova; Denitza Voutchkova (2025). Data for "Nitrate-containing groundwater in Denmark: Exploratory data analysis at the national scale" [Dataset]. http://doi.org/10.22008/FK2/SDJPUD
Explore at:
xlsx(334785), xlsx(4882151), xlsx(380164), txt(9269)Available download formats
Unique identifier
https://doi.org/10.22008/FK2/SDJPUD
Dataset updated
Apr 8, 2025
Dataset provided by
GEUS Dataverse
Authors
Denitza Voutchkova; Denitza Voutchkova
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Denmark
Description
The three data-products included in this repository were created in relation to the project Nret24 (“Forbedret kvælstof-retentionskortlægning til ny reguleringsmodel af landbruget”). They are reported in detail in the GEUS rapport 20025/8 "Nitrate-containing groundwater in Denmark: Exploratory data analysis at the national scale" (https://doi.org/10.22008/gpub/34765). Content description of the three data-products is available in Read_me.txt. Details on data-sources and data-processing are provided in the report.
f
SEM regression for H1-5.
plos.figshare.com
xls
Updated Nov 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daan Kolkman; Gwendolyn K. Lee; Arjen van Witteloostuijn (2024). SEM regression for H1-5. [Dataset]. http://doi.org/10.1371/journal.pone.0309318.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0309318.t004
Dataset updated
Nov 4, 2024
Dataset provided by
PLOS ONE
Authors
Daan Kolkman; Gwendolyn K. Lee; Arjen van Witteloostuijn
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Recent calls to take up data science either revolve around the superior predictive performance associated with machine learning or the potential of data science techniques for exploratory data analysis. Many believe that these strengths come at the cost of explanatory insights, which form the basis for theorization. In this paper, we show that this trade-off is false. When used as a part of a full research process, including inductive, deductive and abductive steps, machine learning can offer explanatory insights and provide a solid basis for theorization. We present a systematic five-step theory-building and theory-testing cycle that consists of: 1. Element identification (reduction); 2. Exploratory analysis (induction); 3. Hypothesis development (retroduction); 4. Hypothesis testing (deduction); and 5. Theorization (abduction). We demonstrate the usefulness of this approach, which we refer to as co-duction, in a vignette where we study firm growth with real-world observational data.
o
Vietnamese Online News .csv dataset
opendatabay.com
.csv
Updated Jun 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). Vietnamese Online News .csv dataset [Dataset]. https://www.opendatabay.com/data/dataset/bfe7c501-da11-4802-8bce-b044bcce3e8c
Explore at:
.csvAvailable download formats
Dataset updated
Jun 14, 2025
Dataset authored and provided by
Datasimple
License
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Area covered
Social Media and Networking
Description
Initially, the format of this dataset was .json, so I converted it to .csv for ease of data processing.

"Online articles from the 25 most popular news sites in Vietnam in July 2022, suitable for practicing Natural Language Processing in Vietnamese.

Online news outlets are an unavoidable part of our society today due to their easy access, mostly free. Their effects on the way communities think and act is becoming a concern for a multitude of groups of people, including legislators, content creators, and marketers, just to name a few. Aside from the effects, what is being written on the news should be a good reflection of people’s will, attention, and even cultural standard.

In Vietnam, even though journalists have received much criticism, especially in recent years, news outlets still receive a lot of traffic (27%) compared to other methods to receive information."

Original Data Source: Vietnamese Online News .csv dataset
The five-step co-duction cycle.
plos.figshare.com
xls
Updated Nov 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daan Kolkman; Gwendolyn K. Lee; Arjen van Witteloostuijn (2024). The five-step co-duction cycle. [Dataset]. http://doi.org/10.1371/journal.pone.0309318.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0309318.t001
Dataset updated
Nov 4, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Daan Kolkman; Gwendolyn K. Lee; Arjen van Witteloostuijn
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Recent calls to take up data science either revolve around the superior predictive performance associated with machine learning or the potential of data science techniques for exploratory data analysis. Many believe that these strengths come at the cost of explanatory insights, which form the basis for theorization. In this paper, we show that this trade-off is false. When used as a part of a full research process, including inductive, deductive and abductive steps, machine learning can offer explanatory insights and provide a solid basis for theorization. We present a systematic five-step theory-building and theory-testing cycle that consists of: 1. Element identification (reduction); 2. Exploratory analysis (induction); 3. Hypothesis development (retroduction); 4. Hypothesis testing (deduction); and 5. Theorization (abduction). We demonstrate the usefulness of this approach, which we refer to as co-duction, in a vignette where we study firm growth with real-world observational data.
Youtube cookery channels viewers comments in Hinglish
zenodo.org
csv
Updated Jan 24, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abhishek Kaushik; Abhishek Kaushik; Gagandeep Kaur; Gagandeep Kaur (2020). Youtube cookery channels viewers comments in Hinglish [Dataset]. http://doi.org/10.5281/zenodo.2841848
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.2841848
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Abhishek Kaushik; Abhishek Kaushik; Gagandeep Kaur; Gagandeep Kaur
License
Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
Area covered
YouTube
Description
The data was collected from the famous cookery Youtube channels in India. The major focus was to collect the viewers' comments in Hinglish languages. The datasets are taken from top 2 Indian cooking channel named Nisha Madhulika channel and Kabita’s Kitchen channel.

Both the datasets comments are divided into seven categories:-

Label 1- Gratitude

Label 2- About the recipe

Label 3- About the video

Label 4- Praising

Label 5- Hybrid

Label 6- Undefined

Label 7- Suggestions and queries

All the labelling has been done manually.

Nisha Madhulika dataset:

Dataset characteristics: Multivariate

Number of instances: 4900

Area: Cooking

Attribute characteristics: Real

Number of attributes: 3

Date donated: March, 2019

Associate tasks: Classification

Missing values: Null

Kabita Kitchen dataset:

Dataset characteristics: Multivariate

Number of instances: 4900

Area: Cooking

Attribute characteristics: Real

Number of attributes: 3

Date donated: March, 2019

Associate tasks: Classification

Missing values: Null

There are two separate datasets file of each channel named as preprocessing and main file .

The files with preprocessing names are generated after doing the preprocessing and exploratory data analysis on both the datasets. This file includes:

Id

Comment text

Labels

Count of stop-words

Uppercase words

Hashtags

Word count

Char count

Average words

Numeric

The main file includes:

Id

comment text

Labels

Please cite the paper

https://www.mdpi.com/2504-2289/3/3/37

MDPI and ACS Style

Kaur, G.; Kaushik, A.; Sharma, S. Cooking Is Creating Emotion: A Study on Hinglish Sentiments of Youtube Cookery Channels Using Semi-Supervised Approach. Big Data Cogn. Comput. 2019, 3, 37.
o
Depression Dataset
opendatabay.com
.undefined
Updated Jun 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). Depression Dataset [Dataset]. https://www.opendatabay.com/data/healthcare/5b3fe7f8-08d0-499e-bffc-3aea3fc7816c
Explore at:
.undefinedAvailable download formats
Dataset updated
Jun 9, 2025
Dataset authored and provided by
Datasimple
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Mental Health & Wellness
Description
Dataset Overview This dataset contains user reports about depression collected from various Reddit forums focused on depression-related topics. The dataset has been anonymized to protect user privacy by removing the user ID and publication dates. It consists of three main columns: title, content, and score.

Dataset Columns The dataset contains the following columns:

title: This column represents the title of the user report. It provides a concise summary or description of the report's content.

content: The content column contains the detailed report provided by the user. It may include personal experiences, thoughts, feelings, or any relevant information related to depression.

score: The score column represents the score or rating assigned to the publication by other users. The score could indicate the level of engagement, agreement, or relevance as determined by the Reddit community.

Data Usage The dataset can be used for various purposes, including but not limited to:

Text analysis and natural language processing tasks Sentiment analysis and emotion detection Topic modeling and clustering Depression research and analysis Machine learning model training and evaluation

Original Data Source: Depression Dataset
f
EDA augmentation parameters.
plos.figshare.com
xls
Updated Sep 26, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rodrigo Gutiérrez Benítez; Alejandra Segura Navarrete; Christian Vidal-Castro; Claudia Martínez-Araneda (2024). EDA augmentation parameters. [Dataset]. http://doi.org/10.1371/journal.pone.0310707.t009
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0310707.t009
Dataset updated
Sep 26, 2024
Dataset provided by
PLOS ONE
Authors
Rodrigo Gutiérrez Benítez; Alejandra Segura Navarrete; Christian Vidal-Castro; Claudia Martínez-Araneda
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Over the last ten years, social media has become a crucial data source for businesses and researchers, providing a space where people can express their opinions and emotions. To analyze this data and classify emotions and their polarity in texts, natural language processing (NLP) techniques such as emotion analysis (EA) and sentiment analysis (SA) are employed. However, the effectiveness of these tasks using machine learning (ML) and deep learning (DL) methods depends on large labeled datasets, which are scarce in languages like Spanish. To address this challenge, researchers use data augmentation (DA) techniques to artificially expand small datasets. This study aims to investigate whether DA techniques can improve classification results using ML and DL algorithms for sentiment and emotion analysis of Spanish texts. Various text manipulation techniques were applied, including transformations, paraphrasing (back-translation), and text generation using generative adversarial networks, to small datasets such as song lyrics, social media comments, headlines from national newspapers in Chile, and survey responses from higher education students. The findings show that the Convolutional Neural Network (CNN) classifier achieved the most significant improvement, with an 18% increase using the Generative Adversarial Networks for Sentiment Text (SentiGan) on the Aggressiveness (Seriousness) dataset. Additionally, the same classifier model showed an 11% improvement using the Easy Data Augmentation (EDA) on the Gender-Based Violence dataset. The performance of the Bidirectional Encoder Representations from Transformers (BETO) also improved by 10% on the back-translation augmented version of the October 18 dataset, and by 4% on the EDA augmented version of the Teaching survey dataset. These results suggest that data augmentation techniques enhance performance by transforming text and adapting it to the specific characteristics of the dataset. Through experimentation with various augmentation techniques, this research provides valuable insights into the analysis of subjectivity in Spanish texts and offers guidance for selecting algorithms and techniques based on dataset features.
m
Reddit r/AskScience Flair Dataset
data.mendeley.com
Updated May 23, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sumit Mishra (2022). Reddit r/AskScience Flair Dataset [Dataset]. http://doi.org/10.17632/k9r2d9z999.3
Explore at:
Unique identifier
https://doi.org/10.17632/k9r2d9z999.3
Dataset updated
May 23, 2022
Authors
Sumit Mishra
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Reddit is a social news, content rating and discussion website. It's one of the most popular sites on the internet. Reddit has 52 million daily active users and approximately 430 million users who use it once a month. Reddit has different subreddits and here We'll use the r/AskScience Subreddit.

The dataset is extracted from the subreddit /r/AskScience from Reddit. The data was collected between 01-01-2016 and 20-05-2022. It contains 612,668 Datapoints and 25 Columns. The database contains a number of information about the questions asked on the subreddit, the description of the submission, the flair of the question, NSFW or SFW status, the year of the submission, and more. The data is extracted using python and Pushshift's API. A little bit of cleaning is done using NumPy and pandas as well. (see the descriptions of individual columns below).

The dataset contains the following columns and descriptions: author - Redditor Name author_fullname - Redditor Full name contest_mode - Contest mode [implement obscured scores and randomized sorting]. created_utc - Time the submission was created, represented in Unix Time. domain - Domain of submission. edited - If the post is edited or not. full_link - Link of the post on the subreddit. id - ID of the submission. is_self - Whether or not the submission is a self post (text-only). link_flair_css_class - CSS Class used to identify the flair. link_flair_text - Flair on the post or The link flair’s text content. locked - Whether or not the submission has been locked. num_comments - The number of comments on the submission. over_18 - Whether or not the submission has been marked as NSFW. permalink - A permalink for the submission. retrieved_on - time ingested. score - The number of upvotes for the submission. description - Description of the Submission. spoiler - Whether or not the submission has been marked as a spoiler. stickied - Whether or not the submission is stickied. thumbnail - Thumbnail of Submission. question - Question Asked in the Submission. url - The URL the submission links to, or the permalink if a self post. year - Year of the Submission. banned - Banned by the moderator or not.

This dataset can be used for Flair Prediction, NSFW Classification, and different Text Mining/NLP tasks. Exploratory Data Analysis can also be done to get the insights and see the trend and patterns over the years.
Data from: Supplementary Material for "Sonification for Exploratory Data...
search.datacite.org
Updated Feb 5, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thomas Hermann (2019). Supplementary Material for "Sonification for Exploratory Data Analysis" [Dataset]. http://doi.org/10.4119/unibi/2920448
Explore at:
Unique identifier
https://doi.org/10.4119/unibi/2920448
Dataset updated
Feb 5, 2019
Dataset provided by
DataCitehttps://www.datacite.org/
Bielefeld University
Authors
Thomas Hermann
License
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Description
Sonification for Exploratory Data Analysis #### Chapter 8: Sonification Models In Chapter 8 of the thesis, 6 sonification models are presented to give some examples for the framework of Model-Based Sonification, developed in Chapter 7. Sonification models determine the rendering of the sonification and possible interactions. The "model in mind" helps the user to interprete the sound with respect to the data. ##### 8.1 Data Sonograms Data Sonograms use spherical expanding shock waves to excite linear oscillators which are represented by point masses in model space. * Table 8.2, page 87: Sound examples for Data Sonograms File: Iris dataset: started in plot (a) at S0 (b) at S1 (c) at S2
10d noisy circle dataset: started in plot (c) at S0 (mean) (d) at S1 (edge)
10d Gaussian: plot (d) started at S0
3 clusters: Example 1
3 clusters: invisible columns used as output variables: Example 2 Description: Data Sonogram Sound examples for synthetic datasets and the Iris dataset Duration: about 5 s ##### 8.2 Particle Trajectory Sonification Model This sonification model explores features of a data distribution by computing the trajectories of test particles which are injected into model space and move according to Newton's laws of motion in a potential given by the dataset. * Sound example: page 93, PTSM-Ex-1 Audification of 1 particle in the potential of phi(x). * Sound example: page 93, PTSM-Ex-2 Audification of a sequence of 15 particles in the potential of a dataset with 2 clusters. * Sound example: page 94, PTSM-Ex-3 Audification of 25 particles simultaneous in a potential of a dataset with 2 clusters. * Sound example: page 94, PTSM-Ex-4 Audification of 25 particles simultaneous in a potential of a dataset with 1 cluster. * Sound example: page 95, PTSM-Ex-5 sigma-step sequence for a mixture of three Gaussian clusters * Sound example: page 95, PTSM-Ex-6 sigma-step sequence for a Gaussian cluster * Sound example: page 96, PTSM-Iris-1 Sonification for the Iris Dataset with 20 particles per step. * Sound example: page 96, PTSM-Iris-2 Sonification for the Iris Dataset with 3 particles per step. * Sound example: page 96, PTSM-Tetra-1 Sonification for a 4d tetrahedron clusters dataset. ##### 8.3 Markov chain Monte Carlo Sonification The McMC Sonification Model defines a exploratory process in the domain of a given density p such that the acoustic representation summarizes features of p, particularly concerning the modes of p by sound. * Sound Example: page 105, MCMC-Ex-1 McMC Sonification, stabilization of amplitudes. * Sound Example: page 106, MCMC-Ex-2 Trajectory Audification for 100 McMC steps in 3 cluster dataset * McMC Sonification for Cluster Analysis, dataset with three clusters, page 107 * Stream 1 MCMC-Ex-3.1 * Stream 2 MCMC-Ex-3.2 * Stream 3 MCMC-Ex-3.3 * Mix MCMC-Ex-3.4 * McMC Sonification for Cluster Analysis, dataset with three clusters, T =0.002s, page 107 * Stream 1 MCMC-Ex-4.1 (stream 1) * Stream 2 MCMC-Ex-4.2 (stream 2) * Stream 3 MCMC-Ex-4.3 (stream 3) * Mix MCMC-Ex-4.4 * McMC Sonification for Cluster Analysis, density with 6 modes, T=0.008s, page 107 * Stream 1 MCMC-Ex-5.1 (stream 1) * Stream 2 MCMC-Ex-5.2 (stream 2) * Stream 3 MCMC-Ex-5.3 (stream 3) * Mix MCMC-Ex-5.4 * McMC Sonification for the Iris dataset, page 108 * MCMC-Ex-6.1 * MCMC-Ex-6.2 * MCMC-Ex-6.3 * MCMC-Ex-6.4 * MCMC-Ex-6.5 * MCMC-Ex-6.6 * MCMC-Ex-6.7 * MCMC-Ex-6.8 ##### 8.4 Principal Curve Sonification Principal Curve Sonification represents data by synthesizing the soundscape while a virtual listener moves along the principal curve of the dataset through the model space. * Noisy Spiral dataset, PCS-Ex-1.1 , page 113 * Noisy Spiral dataset with variance modulation PCS-Ex-1.2 , page 114 * 9d tetrahedron cluster dataset (10 clusters) PCS-Ex-2 , page 114 * Iris dataset, class label used as pitch of auditory grains PCS-Ex-3 , page 114 ##### 8.5 Data Crystallization Sonification Model * Table 8.6, page 122: Sound examples for Crystallization Sonification for 5d Gaussian distribution File: DCS started at center, in tail, from far outside Description: DCS for dataset sampled from N{0, I_5} excited at different locations Duration: 1.4 s * Mixture of 2 Gaussians, page 122 * DCS started at point A DCS-Ex1A * DCS started at point B DCS-Ex1B * Table 8.7, page 124: Sound examples for DCS on variation of the harmonics factor File: h_omega = 1, 2, 3, 4, 5, 6 Description: DCS for a mixture of two Gaussians with varying harmonics factor Duration: 1.4 s * Table 8.8, page 124: Sound examples for DCS on variation of the energy decay time File: tau_(1/2) = 0.001, 0.005, 0.01, 0.05, 0.1, 0.2 Description: DCS for a mixture of two Gaussians varying the energy decay time tau_(1/2) Duration: 1.4 s * Table 8.9, page 125: Sound examples for DCS on variation of the sonification time File: T = 0.2, 0.5, 1, 2, 4, 8 Description: DCS for a mixture of two Gaussians on varying the duration T Duration: 0.2s -- 8s * Table 8.10, page 125: Sound examples for DCS on variation of model space dimension File: selected columns of the dataset: (x0) (x0,x1) (x0,...,x2) (x0,...,x3) (x0,...,x4) (x0,...,x5) Description: DCS for a mixture of two Gaussians varying the dimension Duration: 1.4 s * Table 8.11, page 126: Sound examples for DCS for different excitation locations File: starting point: C0, C1, C2 Description: DCS for a mixture of three Gaussians in 10d space with different rank(S) = {2,4,8} Duration: 1.9 s * Table 8.12, page 126: Sound examples for DCS for the mixture of a 2d distribution and a 5d cluster File: condensation nucleus in (x0,x1)-plane at: (-6,0)=C1, (-3,0)=C2, ( 0,0)=C0 Description: DCS for a mixture of a uniform 2d and a 5d Gaussian Duration: 2.16 s * Table 8.13, page 127: Sound examples for DCS for the cancer dataset File: condensation nucleus in (x0,x1)-plane at: benign 1, benign 2
malignant 1, malignant 2 Description: DCS for a mixture of a uniform 2d and a 5d Gaussian Duration: 2.16 s ##### 8.6 Growing Neural Gas Sonification * Table 8.14, page 133: Sound examples for GNGS Probing File: Cluster C0 (2d): a, b, c
Cluster C1 (4d): a, b, c
Cluster C2 (8d): a, b, c Description: GNGS for a mixture of 3 Gaussians in 10d space Duration: 1 s * Table 8.15, page 134: Sound examples for GNGS for the noisy spiral dataset File: (a) GNG with 3 neurons 1, 2
(b) GNG with 20 neurons end, middle, inner end
(c) GNG with 45 neurons outer end, middle, close to inner end, at inner end
(d) GNG with 150 neurons outer end, in the middle, inner end
(e) GNG with 20 neurons outer end, in the middle, inner end
(f) GNG with 45 neurons outer end, in the middle, inner end Description: GNG probing sonification for 2d noisy spiral dataset Duration: 1 s * Table 8.16, page 136: Sound examples for GNG Process Monitoring Sonification for different data distributions File: Noisy spiral with 1 rotation: sound
Noisy spiral with 2 rotations: sound
Gaussian in 5d: sound
Mixture of 5d and 2d distributions: sound Description: GNG process sonification examples Duration: 5 s #### Chapter 9: Extensions #### In this chapter, two extensions for Parameter Mapping
f
Detailed characterization of the dataset.
figshare.com
xls
Updated Sep 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rodrigo Gutiérrez Benítez; Alejandra Segura Navarrete; Christian Vidal-Castro; Claudia Martínez-Araneda (2024). Detailed characterization of the dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0310707.t006
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0310707.t006
Dataset updated
Sep 26, 2024
Dataset provided by
PLOS ONE
Authors
Rodrigo Gutiérrez Benítez; Alejandra Segura Navarrete; Christian Vidal-Castro; Claudia Martínez-Araneda
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Over the last ten years, social media has become a crucial data source for businesses and researchers, providing a space where people can express their opinions and emotions. To analyze this data and classify emotions and their polarity in texts, natural language processing (NLP) techniques such as emotion analysis (EA) and sentiment analysis (SA) are employed. However, the effectiveness of these tasks using machine learning (ML) and deep learning (DL) methods depends on large labeled datasets, which are scarce in languages like Spanish. To address this challenge, researchers use data augmentation (DA) techniques to artificially expand small datasets. This study aims to investigate whether DA techniques can improve classification results using ML and DL algorithms for sentiment and emotion analysis of Spanish texts. Various text manipulation techniques were applied, including transformations, paraphrasing (back-translation), and text generation using generative adversarial networks, to small datasets such as song lyrics, social media comments, headlines from national newspapers in Chile, and survey responses from higher education students. The findings show that the Convolutional Neural Network (CNN) classifier achieved the most significant improvement, with an 18% increase using the Generative Adversarial Networks for Sentiment Text (SentiGan) on the Aggressiveness (Seriousness) dataset. Additionally, the same classifier model showed an 11% improvement using the Easy Data Augmentation (EDA) on the Gender-Based Violence dataset. The performance of the Bidirectional Encoder Representations from Transformers (BETO) also improved by 10% on the back-translation augmented version of the October 18 dataset, and by 4% on the EDA augmented version of the Teaching survey dataset. These results suggest that data augmentation techniques enhance performance by transforming text and adapting it to the specific characteristics of the dataset. Through experimentation with various augmentation techniques, this research provides valuable insights into the analysis of subjectivity in Spanish texts and offers guidance for selecting algorithms and techniques based on dataset features.
Comprehensive Supply Chain Analysis
kaggle.com
Updated Sep 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dorothy Joel (2023). Comprehensive Supply Chain Analysis [Dataset]. https://www.kaggle.com/datasets/dorothyjoel/us-regional-sales
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 15, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Dorothy Joel
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
This supply chain analysis provides a comprehensive view of the company's order and distribution processes, allowing for in-depth analysis and optimization of various aspects of the supply chain, from procurement and inventory management to sales and customer satisfaction. It empowers the company to make data-driven decisions to improve efficiency, reduce costs, and enhance customer experiences. The provided supply chain analysis dataset contains various columns that capture important information related to the company's order and distribution processes:

• OrderNumber • Sales Channel • WarehouseCode • ProcuredDate • CurrencyCode • OrderDate • ShipDate • DeliveryDate • SalesTeamID • CustomerID • StoreID • ProductID • Order Quantity • Discount Applied • Unit Cost • Unit Price
Replication Package for 'Data-Driven Analysis and Optimization of Machine...
zenodo.org
zip
Updated Jun 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joel Castaño; Joel Castaño (2025). Replication Package for 'Data-Driven Analysis and Optimization of Machine Learning Systems Using MLPerf Benchmark Data' [Dataset]. http://doi.org/10.5281/zenodo.15643706
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15643706
Dataset updated
Jun 11, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Joel Castaño; Joel Castaño
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data-Driven Analysis and Optimization of Machine Learning Systems Using MLPerf Benchmark Data

This repository contains the full replication package for the Master's thesis 'Data-Driven Analysis and Optimization of Machine Learning Systems Using MLPerf Benchmark Data'. The project focuses on leveraging public MLPerf benchmark data to analyze ML system performance and develop a multi-objective optimization framework for recommending optimal hardware configurations.

The framework considers the trade-offs between three key objectives:

1. Performance (maximizing throughput)

2. Energy Efficiency (minimizing estimated energy per unit)

3. Cost (minimizing estimated hardware cost)

Repository Structure

This repository is organized as follows:

Data_Analysis.ipynb: A Jupyter Notebook containing the code for the Exploratory Data Analysis (EDA) presented in the thesis. Running this notebook reproduces the plots in the eda_plots/ directory.

Dataset_Extension.ipynb : A Jupyter Notebook used for the data enrichment process. It takes the raw `Inference_data.csv` and produces the Inference_data_Extended.csv by adding detailed hardware specifications, cost estimates, and derived energy metrics.

Optimization_Model.ipynb: The main Jupyter Notebook for the core contribution of this thesis. It contains the code to perform the 5-fold cross-validation, train the final predictive models, generate the Pareto-optimal recommendations, and create the final result figures.

Inference_data.csv: The raw, unprocessed data collected from the official MLPerf Inference v4.0 results.

Inference_data_Extended.csv: The final, enriched dataset used for all analysis and modeling. This is the output of the Dataset_Extension.ipynb notebook.

eda_log.txt: A text log file containing summary statistics generated during the exploratory data analysis.

requirements.txt: A list of all necessary Python libraries and their versions required to run the code in this repository.

eda_plots/: A directory containing all plots (correlation matrices, scatter plots, box plots) generated by the EDA notebook.

optimization_models_final/: A directory where the trained and saved final model files (.joblib) are stored after running the optimization notebook.

pareto_validation_plot_fold_0.png: The validation plot comparing the true vs. predicted Pareto fronts, as presented in the thesis.

shap_waterfall_final_model.png: The SHAP plot used for the model interpretability analysis, as presented in the thesis.

Requirements and Installation

To reproduce the results, it is recommended to use a Python virtual environment to avoid conflicts with other projects.

1. Clone the repository:

bash

git clone

cd

2. **Create and activate a virtual environment (optional but recommended):

bash

python -m venv venv

source venv/bin/activate # On Windows, use `venv\Scripts\activate`

3. Install the required packages:

All dependencies are listed in the `requirements.txt` file. Install them using pip:

bash

pip install -r requirements.txt

Step-by-Step Reproduction Workflow

The notebooks are designed to be run in a logical sequence.

Step 1: Data Enrichment (Optional)

The final enriched dataset (`Inference_data_Extended.csv`) is already provided. However, if you wish to reproduce the enrichment process from scratch, you can run the **`Dataset_Extension.ipynb`** notebook. It will take `Inference_data.csv` as input and generate the extended version.

Step 2: Exploratory Data Analysis (Optional)

All plots from the EDA are pre-generated and available in the `eda_plots/` directory. To regenerate them, run the **`Data_Analysis.ipynb`** notebook. This will overwrite the existing plots and the `eda_log.txt` file.

Step 3: Main Model Training, Validation, and Recommendation

This is the core of the thesis. Running the Optimization_Model.ipynb notebook will execute the entire pipeline described in the paper:

It will perform the 5-fold group-aware cross-validation to validate the performance of the predictive models.

It will train the final production models on the entire dataset and save them to the optimization_models_final/ directory.

It will generate the final Pareto front recommendations and single-best recommendations for the Computer Vision task.

It will generate the final figures used in the results section, including pareto_validation_plot_fold_0.png and shap_waterfall_final_model.png.
E
EDA for Automotive Report
datainsightsmarket.com
doc, pdf, ppt
Updated Feb 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Insights Market (2025). EDA for Automotive Report [Dataset]. https://www.datainsightsmarket.com/reports/eda-for-automotive-538211
Explore at:
pdf, ppt, docAvailable download formats
Dataset updated
Feb 15, 2025
Dataset authored and provided by
Data Insights Market
License
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The global market for Engineering Design Automation (EDA) in the automotive industry is projected to reach a value of USD 3.7 billion by 2033, expanding at a CAGR of 5.8% during the forecast period (2025-2033). The growth of the market is primarily driven by the increasing adoption of advanced driver-assistance systems (ADAS) and autonomous vehicles, which require sophisticated software and electronic components. Moreover, the growing demand for lightweight and fuel-efficient vehicles is also contributing to the adoption of EDA tools, as they enable engineers to optimize vehicle designs for better performance and efficiency. The key trends shaping the automotive EDA market include the increasing adoption of cloud-based EDA solutions, the growing popularity of Model-Based Design (MBD) methodologies, and the integration of EDA tools with other software applications. The adoption of cloud-based EDA solutions is gaining traction as it offers several advantages, such as improved accessibility, scalability, and cost-effectiveness. MBD methodologies are also becoming increasingly popular as they enable engineers to create virtual prototypes of vehicles, which can be used to evaluate design performance and identify potential issues early in the design process. The integration of EDA tools with other software applications, such as computer-aided design (CAD) and product lifecycle management (PLM) systems, is also enhancing the overall efficiency of the vehicle design process.
I
Industrial Production Statistical Analysis Software Report
datainsightsmarket.com
doc, pdf, ppt
Updated May 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Insights Market (2025). Industrial Production Statistical Analysis Software Report [Dataset]. https://www.datainsightsmarket.com/reports/industrial-production-statistical-analysis-software-504068
Explore at:
doc, pdf, pptAvailable download formats
Dataset updated
May 7, 2025
Dataset authored and provided by
Data Insights Market
License
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The global market for Industrial Production Statistical Analysis Software is experiencing robust growth, projected at a Compound Annual Growth Rate (CAGR) of 5.2% from 2025 to 2033. In 2025, the market size reached $3748 million. This expansion is fueled by several key factors. Firstly, the increasing adoption of Industry 4.0 and digital transformation initiatives across manufacturing sectors is driving demand for sophisticated data analytics solutions. Businesses are increasingly reliant on data-driven decision-making to optimize production processes, improve efficiency, and enhance product quality. Secondly, the growing complexity of industrial processes necessitates advanced software capable of handling large datasets and providing actionable insights. This includes real-time monitoring, predictive maintenance, and quality control applications. The software’s ability to identify patterns and anomalies crucial to preventing production bottlenecks and maximizing output contributes significantly to its appeal. Finally, stringent regulatory compliance requirements and a growing focus on sustainability are further pushing adoption. Companies need robust data analysis tools to comply with environmental standards and track their carbon footprint. Segmentation reveals a diverse market landscape. The application segment is dominated by architecture, mechanical engineering, and the automotive industry, each leveraging the software for unique purposes such as design optimization, simulation, and performance analysis. Within types, 3D modeling and analysis software are gaining traction due to their ability to represent complex geometries and improve design accuracy. The geographical distribution shows a strong presence in North America and Europe, driven by technological advancements and robust manufacturing industries in these regions. However, the Asia-Pacific region is expected to witness significant growth in the coming years, fuelled by rapid industrialization and rising technological adoption in countries like China and India. Leading players such as Autodesk, Siemens EDA, and Dassault Systèmes are actively shaping the market through technological innovation and strategic partnerships. The forecast period, 2025-2033, promises continued market growth driven by these factors and the wider adoption of advanced data analytics in industrial production.
O
Analytic_Provenance
opendatalab.com
paperswithcode.com
zip
Updated Jan 17, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Texas A&M University (2018). Analytic_Provenance [Dataset]. https://opendatalab.com/OpenDataLab/Analytic_Provenance
Explore at:
zip(321803532 bytes)Available download formats
Dataset updated
Jan 17, 2018
Dataset provided by
Texas A&M University
Description
Analytic provenance is a data repository that can be used to study human analysis activity, thought processes, and software interaction with visual analysis tools during exploratory data analysis. It was collected during a series of user studies involving exploratory data analysis scenario with textual and cyber security data. Interactions logs, think-alouds, videos and all coded data in this study are available online for research purposes. Analysis sessions are segmented in multiple sub-task steps based on user think-alouds, video and audios captured during the studies. These analytic provenance datasets can be used for research involving tools and techniques for analyzing interaction logs and analysis history.
Electronic Design Automation Software Developers in the US - Market Research...
ibisworld.com
Updated Apr 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
IBISWorld (2025). Electronic Design Automation Software Developers in the US - Market Research Report (2015-2030) [Dataset]. https://www.ibisworld.com/united-states/market-research-reports/electronic-design-automation-software-developers-industry/
Explore at:
Dataset updated
Apr 15, 2025
Dataset authored and provided by
IBISWorld
License
https://www.ibisworld.com/about/termsofuse/https://www.ibisworld.com/about/termsofuse/
Time period covered
2015 - 2030
Area covered
United States
Description
The electronic design automation (EDA) software industry is experiencing a wave of transformation driven by technological innovations. Artificial Intelligence (AI) integration is becoming a cornerstone for the industry as it enhances EDA software capabilities by automating complex design processes, optimizing workflows and enabling predictive analytics. AI techniques now assist in crucial tasks such as logic synthesis, layout planning, timing analysis and design rule checking, dramatically reducing the likelihood of human error and shortening design cycles. Cloud computing technology has revolutionized the industry, offering scalable, flexible and collaborative platforms for chip design, reducing costs, speeding up design cycles and fostering collaboration Overall the EDA Software industry has expanded, climbing at a CAGR of 8.4% to $16.5 billion through the end of 2025, including a 5.9% climb in 2025 alone. Rising complexity and miniaturization in electronic systems are driving the evolution of advanced functions in EDA tools. This evolution has been spurred by growing demand in automotive, aerospace, telecommunications and healthcare industries, which require sophisticated system-on-chip (SoC) designs. The surging integration of AI, IoT, and 5G technologies in these industries brings significant growth opportunities for EDA software vendors, demanding specialized and high-performance chips that can handle massive data, real-time processing and low power consumption. Industry profit has faced pressure from rising R&D costs, geopolitical risks and competitive investments in innovation. The EDA software industry will endure significant changes, predominantly driven by artificial intelligence, generative technologies and digital twin technology. AI and generative technologies will foster product innovation, automating and optimizing chip design, verification and simulation processes. Companies that successfully integrate AI into their tools will enjoy higher demand, particularly from industries such as data centers, automotive and robotics. In the future, digital twin technology will become an essential tool in electronics design, enabling EDA software developers to simulate, test and optimize designs before physical prototypes are brought to life. Amid this tech-driven transformation, future growth won't come without challenges. EDA software vendors must invest in R&D, AI and cloud infrastructure while addressing associated data security and latency issues, or consider alliance or acquisition strategies to remain competitive in a rapidly consolidating industry landscape. Through the end of 2030 revenue will climb at a CAGR of 6.1% to reach $22.1 billion in 2030.
f
Sample village coordinates.
plos.figshare.com
zip
Updated Dec 16, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Siwei Yu; Ding Fan; Ma Ge; Zihang Chen (2024). Sample village coordinates. [Dataset]. http://doi.org/10.1371/journal.pone.0314242.s001
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0314242.s001
Dataset updated
Dec 16, 2024
Dataset provided by
PLOS ONE
Authors
Siwei Yu; Ding Fan; Ma Ge; Zihang Chen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The article examines the spatial distribution characteristics and influencing factors of traditional Tibetan “Bengke” residential architecture in Luhuo County, Ganzi Tibetan Autonomous Prefecture, Sichuan Province. The study utilizes spatial statistical methods, including Average Nearest Neighbor Analysis, Getis-Ord Gi*, and Kernel Density Estimation, to identify significant clustering patterns of Bengke architecture. Spatial autocorrelation was tested using Moran’s Index, with results indicating no significant spatial autocorrelation, suggesting that the distribution mechanisms are complex and influenced by multiple factors. Additionally, exploratory data analysis (EDA), the Analytic Hierarchy Process (AHP), and regression methods such as Lasso and Elastic Net were used to identify and validate key factors influencing the distribution of these buildings. The analysis reveals that road density, population density, economic development quality, and industrial structure are the most significant factors. The study also highlights that these factors vary in impact between high-density and low-density areas, depending on the regional environment. These findings offer a comprehensive understanding of the spatial patterns of Bengke architecture and provide valuable insights for the preservation and sustainable development of this cultural heritage.
I
Industrial Analysis Software Report
archivemarketresearch.com
doc, pdf, ppt
Updated Jun 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Archive Market Research (2025). Industrial Analysis Software Report [Dataset]. https://www.archivemarketresearch.com/reports/industrial-analysis-software-562258
Explore at:
doc, ppt, pdfAvailable download formats
Dataset updated
Jun 27, 2025
Dataset authored and provided by
Archive Market Research
License
https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The Industrial Analysis Software market is experiencing robust growth, driven by increasing automation in manufacturing, the expanding adoption of Industry 4.0 technologies, and a rising demand for improved operational efficiency and predictive maintenance. The market size in 2025 is estimated at $15 billion, projected to grow at a Compound Annual Growth Rate (CAGR) of 12% from 2025 to 2033. This significant expansion is fueled by several key factors. Firstly, the convergence of data analytics, cloud computing, and advanced simulation technologies is creating sophisticated software solutions capable of handling massive datasets and providing actionable insights. Secondly, the increasing complexity of modern industrial processes necessitates advanced analytical tools to optimize performance, reduce downtime, and improve product quality. Finally, stringent regulatory requirements and environmental concerns are driving the adoption of industrial analysis software to enhance sustainability and reduce environmental impact. Major players like Siemens EDA, Autodesk, and Dassault Systèmes are leading the innovation in this space, constantly improving their offerings and expanding their market reach through strategic partnerships and acquisitions. The market segmentation reveals a diverse landscape with various specialized software solutions catering to specific industries and needs. While the current data doesn't specify exact segment sizes, it's expected that process manufacturing, discrete manufacturing, and energy & utilities sectors will comprise a significant portion of the market share. The geographical distribution is anticipated to reflect strong growth in North America and Asia-Pacific regions, driven by high industrial output and technology adoption rates. However, Europe and other regions will also contribute to the overall market growth due to the increasing focus on digitalization and industrial automation across various sectors. The competitive landscape is intense, with numerous established players and emerging startups vying for market share. Future growth will likely depend on the ability of companies to offer innovative solutions, strong customer support, and seamless integration with existing industrial systems.

Facebook

Twitter

Click to copy link

Link copied

Cite

Yi-Hui Zhou; Ehsan Saghapour (2023). Data_Sheet_1_ImputEHR: A Visualization Tool of Imputation for the Prediction of Biomedical Data.PDF [Dataset]. http://doi.org/10.3389/fgene.2021.691274.s001

Data_Sheet_1_ImputEHR: A Visualization Tool of Imputation for the Prediction of Biomedical Data.PDF

Explore at:

pdfAvailable download formats

Unique identifier

https://doi.org/10.3389/fgene.2021.691274.s001

Dataset updated

Jun 1, 2023

Dataset provided by

Frontiers

Authors

Yi-Hui Zhou; Ehsan Saghapour

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Electronic health records (EHRs) have been widely adopted in recent years, but often include a high proportion of missing data, which can create difficulties in implementing machine learning and other tools of personalized medicine. Completed datasets are preferred for a number of analysis methods, and successful imputation of missing EHR data can improve interpretation and increase our power to predict health outcomes. However, use of the most popular imputation methods mainly require scripting skills, and are implemented using various packages and syntax. Thus, the implementation of a full suite of methods is generally out of reach to all except experienced data scientists. Moreover, imputation is often considered as a separate exercise from exploratory data analysis, but should be considered as art of the data exploration process. We have created a new graphical tool, ImputEHR, that is based on a Python base and allows implementation of a range of simple and sophisticated (e.g., gradient-boosted tree-based and neural network) data imputation approaches. In addition to imputation, the tool enables data exploration for informed decision-making, as well as implementing machine learning prediction tools for response data selected by the user. Although the approach works for any missing data problem, the tool is primarily motivated by problems encountered for EHR and other biomedical data. We illustrate the tool using multiple real datasets, providing performance measures of imputation and downstream predictive analysis.

Clear search

Close search

Google apps

Main menu

Data_Sheet_1_ImputEHR: A Visualization Tool of Imputation for the Prediction...

‘US Health Insurance Dataset’ analyzed by Analyst-2

Context

Content

Inspiration

Data for "Nitrate-containing groundwater in Denmark: Exploratory data...

SEM regression for H1-5.

Vietnamese Online News .csv dataset

The five-step co-duction cycle.

Youtube cookery channels viewers comments in Hinglish

Depression Dataset

EDA augmentation parameters.

Reddit r/AskScience Flair Dataset

Data from: Supplementary Material for "Sonification for Exploratory Data...

Detailed characterization of the dataset.

Comprehensive Supply Chain Analysis

Replication Package for 'Data-Driven Analysis and Optimization of Machine...

Data-Driven Analysis and Optimization of Machine Learning Systems Using MLPerf Benchmark Data

Repository Structure

Requirements and Installation

Step-by-Step Reproduction Workflow

Step 1: Data Enrichment (Optional)

Step 2: Exploratory Data Analysis (Optional)

Step 3: Main Model Training, Validation, and Recommendation

EDA for Automotive Report

Industrial Production Statistical Analysis Software Report

Analytic_Provenance

Electronic Design Automation Software Developers in the US - Market Research...

Sample village coordinates.

Industrial Analysis Software Report

Data_Sheet_1_ImputEHR: A Visualization Tool of Imputation for the Prediction of Biomedical Data.PDF