Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Electronic health records (EHRs) have been widely adopted in recent years, but often include a high proportion of missing data, which can create difficulties in implementing machine learning and other tools of personalized medicine. Completed datasets are preferred for a number of analysis methods, and successful imputation of missing EHR data can improve interpretation and increase our power to predict health outcomes. However, use of the most popular imputation methods mainly require scripting skills, and are implemented using various packages and syntax. Thus, the implementation of a full suite of methods is generally out of reach to all except experienced data scientists. Moreover, imputation is often considered as a separate exercise from exploratory data analysis, but should be considered as art of the data exploration process. We have created a new graphical tool, ImputEHR, that is based on a Python base and allows implementation of a range of simple and sophisticated (e.g., gradient-boosted tree-based and neural network) data imputation approaches. In addition to imputation, the tool enables data exploration for informed decision-making, as well as implementing machine learning prediction tools for response data selected by the user. Although the approach works for any missing data problem, the tool is primarily motivated by problems encountered for EHR and other biomedical data. We illustrate the tool using multiple real datasets, providing performance measures of imputation and downstream predictive analysis.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘US Health Insurance Dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/teertha/ushealthinsurancedataset on 28 January 2022.
--- Dataset description provided by original source is as follows ---
The venerable insurance industry is no stranger to data driven decision making. Yet in today's rapidly transforming digital landscape, Insurance is struggling to adapt and benefit from new technologies compared to other industries, even within the BFSI sphere (compared to the Banking sector for example.) Extremely complex underwriting rule-sets that are radically different in different product lines, many non-KYC environments with a lack of centralized customer information base, complex relationship with consumers in traditional risk underwriting where sometimes customer centricity runs reverse to business profit, inertia of regulatory compliance - are some of the unique challenges faced by Insurance Business.
Despite this, emergent technologies like AI and Block Chain have brought a radical change in Insurance, and Data Analytics sits at the core of this transformation. We can identify 4 key factors behind the emergence of Analytics as a crucial part of InsurTech:
This dataset can be helpful in a simple yet illuminating study in understanding the risk underwriting in Health Insurance, the interplay of various attributes of the insured and see how they affect the insurance premium.
This dataset contains 1338 rows of insured data, where the Insurance charges are given against the following attributes of the insured: Age, Sex, BMI, Number of Children, Smoker and Region. There are no missing or undefined values in the dataset.
This relatively simple dataset should be an excellent starting point for EDA, Statistical Analysis and Hypothesis testing and training Linear Regression models for predicting Insurance Premium Charges.
Proposed Tasks: - Exploratory Data Analytics - Statistical hypothesis testing - Statistical Modeling - Linear Regression
--- Original source retains full ownership of the source dataset ---
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The three data-products included in this repository were created in relation to the project Nret24 (“Forbedret kvælstof-retentionskortlægning til ny reguleringsmodel af landbruget”). They are reported in detail in the GEUS rapport 20025/8 "Nitrate-containing groundwater in Denmark: Exploratory data analysis at the national scale" (https://doi.org/10.22008/gpub/34765). Content description of the three data-products is available in Read_me.txt. Details on data-sources and data-processing are provided in the report.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Recent calls to take up data science either revolve around the superior predictive performance associated with machine learning or the potential of data science techniques for exploratory data analysis. Many believe that these strengths come at the cost of explanatory insights, which form the basis for theorization. In this paper, we show that this trade-off is false. When used as a part of a full research process, including inductive, deductive and abductive steps, machine learning can offer explanatory insights and provide a solid basis for theorization. We present a systematic five-step theory-building and theory-testing cycle that consists of: 1. Element identification (reduction); 2. Exploratory analysis (induction); 3. Hypothesis development (retroduction); 4. Hypothesis testing (deduction); and 5. Theorization (abduction). We demonstrate the usefulness of this approach, which we refer to as co-duction, in a vignette where we study firm growth with real-world observational data.
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Initially, the format of this dataset was .json, so I converted it to .csv for ease of data processing.
"Online articles from the 25 most popular news sites in Vietnam in July 2022, suitable for practicing Natural Language Processing in Vietnamese.
Online news outlets are an unavoidable part of our society today due to their easy access, mostly free. Their effects on the way communities think and act is becoming a concern for a multitude of groups of people, including legislators, content creators, and marketers, just to name a few. Aside from the effects, what is being written on the news should be a good reflection of people’s will, attention, and even cultural standard.
In Vietnam, even though journalists have received much criticism, especially in recent years, news outlets still receive a lot of traffic (27%) compared to other methods to receive information."
Original Data Source: Vietnamese Online News .csv dataset
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Recent calls to take up data science either revolve around the superior predictive performance associated with machine learning or the potential of data science techniques for exploratory data analysis. Many believe that these strengths come at the cost of explanatory insights, which form the basis for theorization. In this paper, we show that this trade-off is false. When used as a part of a full research process, including inductive, deductive and abductive steps, machine learning can offer explanatory insights and provide a solid basis for theorization. We present a systematic five-step theory-building and theory-testing cycle that consists of: 1. Element identification (reduction); 2. Exploratory analysis (induction); 3. Hypothesis development (retroduction); 4. Hypothesis testing (deduction); and 5. Theorization (abduction). We demonstrate the usefulness of this approach, which we refer to as co-duction, in a vignette where we study firm growth with real-world observational data.
Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
The data was collected from the famous cookery Youtube channels in India. The major focus was to collect the viewers' comments in Hinglish languages. The datasets are taken from top 2 Indian cooking channel named Nisha Madhulika channel and Kabita’s Kitchen channel.
Both the datasets comments are divided into seven categories:-
Label 1- Gratitude
Label 2- About the recipe
Label 3- About the video
Label 4- Praising
Label 5- Hybrid
Label 6- Undefined
Label 7- Suggestions and queries
All the labelling has been done manually.
Nisha Madhulika dataset:
Dataset characteristics: Multivariate
Number of instances: 4900
Area: Cooking
Attribute characteristics: Real
Number of attributes: 3
Date donated: March, 2019
Associate tasks: Classification
Missing values: Null
Kabita Kitchen dataset:
Dataset characteristics: Multivariate
Number of instances: 4900
Area: Cooking
Attribute characteristics: Real
Number of attributes: 3
Date donated: March, 2019
Associate tasks: Classification
Missing values: Null
There are two separate datasets file of each channel named as preprocessing and main file .
The files with preprocessing names are generated after doing the preprocessing and exploratory data analysis on both the datasets. This file includes:
The main file includes:
Please cite the paper
https://www.mdpi.com/2504-2289/3/3/37
MDPI and ACS Style
Kaur, G.; Kaushik, A.; Sharma, S. Cooking Is Creating Emotion: A Study on Hinglish Sentiments of Youtube Cookery Channels Using Semi-Supervised Approach. Big Data Cogn. Comput. 2019, 3, 37.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Dataset Overview This dataset contains user reports about depression collected from various Reddit forums focused on depression-related topics. The dataset has been anonymized to protect user privacy by removing the user ID and publication dates. It consists of three main columns: title, content, and score.
Dataset Columns The dataset contains the following columns:
title: This column represents the title of the user report. It provides a concise summary or description of the report's content.
content: The content column contains the detailed report provided by the user. It may include personal experiences, thoughts, feelings, or any relevant information related to depression.
score: The score column represents the score or rating assigned to the publication by other users. The score could indicate the level of engagement, agreement, or relevance as determined by the Reddit community.
Data Usage The dataset can be used for various purposes, including but not limited to:
Text analysis and natural language processing tasks Sentiment analysis and emotion detection Topic modeling and clustering Depression research and analysis Machine learning model training and evaluation
Original Data Source: Depression Dataset
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Over the last ten years, social media has become a crucial data source for businesses and researchers, providing a space where people can express their opinions and emotions. To analyze this data and classify emotions and their polarity in texts, natural language processing (NLP) techniques such as emotion analysis (EA) and sentiment analysis (SA) are employed. However, the effectiveness of these tasks using machine learning (ML) and deep learning (DL) methods depends on large labeled datasets, which are scarce in languages like Spanish. To address this challenge, researchers use data augmentation (DA) techniques to artificially expand small datasets. This study aims to investigate whether DA techniques can improve classification results using ML and DL algorithms for sentiment and emotion analysis of Spanish texts. Various text manipulation techniques were applied, including transformations, paraphrasing (back-translation), and text generation using generative adversarial networks, to small datasets such as song lyrics, social media comments, headlines from national newspapers in Chile, and survey responses from higher education students. The findings show that the Convolutional Neural Network (CNN) classifier achieved the most significant improvement, with an 18% increase using the Generative Adversarial Networks for Sentiment Text (SentiGan) on the Aggressiveness (Seriousness) dataset. Additionally, the same classifier model showed an 11% improvement using the Easy Data Augmentation (EDA) on the Gender-Based Violence dataset. The performance of the Bidirectional Encoder Representations from Transformers (BETO) also improved by 10% on the back-translation augmented version of the October 18 dataset, and by 4% on the EDA augmented version of the Teaching survey dataset. These results suggest that data augmentation techniques enhance performance by transforming text and adapting it to the specific characteristics of the dataset. Through experimentation with various augmentation techniques, this research provides valuable insights into the analysis of subjectivity in Spanish texts and offers guidance for selecting algorithms and techniques based on dataset features.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Reddit is a social news, content rating and discussion website. It's one of the most popular sites on the internet. Reddit has 52 million daily active users and approximately 430 million users who use it once a month. Reddit has different subreddits and here We'll use the r/AskScience Subreddit.
The dataset is extracted from the subreddit /r/AskScience from Reddit. The data was collected between 01-01-2016 and 20-05-2022. It contains 612,668 Datapoints and 25 Columns. The database contains a number of information about the questions asked on the subreddit, the description of the submission, the flair of the question, NSFW or SFW status, the year of the submission, and more. The data is extracted using python and Pushshift's API. A little bit of cleaning is done using NumPy and pandas as well. (see the descriptions of individual columns below).
The dataset contains the following columns and descriptions: author - Redditor Name author_fullname - Redditor Full name contest_mode - Contest mode [implement obscured scores and randomized sorting]. created_utc - Time the submission was created, represented in Unix Time. domain - Domain of submission. edited - If the post is edited or not. full_link - Link of the post on the subreddit. id - ID of the submission. is_self - Whether or not the submission is a self post (text-only). link_flair_css_class - CSS Class used to identify the flair. link_flair_text - Flair on the post or The link flair’s text content. locked - Whether or not the submission has been locked. num_comments - The number of comments on the submission. over_18 - Whether or not the submission has been marked as NSFW. permalink - A permalink for the submission. retrieved_on - time ingested. score - The number of upvotes for the submission. description - Description of the Submission. spoiler - Whether or not the submission has been marked as a spoiler. stickied - Whether or not the submission is stickied. thumbnail - Thumbnail of Submission. question - Question Asked in the Submission. url - The URL the submission links to, or the permalink if a self post. year - Year of the Submission. banned - Banned by the moderator or not.
This dataset can be used for Flair Prediction, NSFW Classification, and different Text Mining/NLP tasks. Exploratory Data Analysis can also be done to get the insights and see the trend and patterns over the years.
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Over the last ten years, social media has become a crucial data source for businesses and researchers, providing a space where people can express their opinions and emotions. To analyze this data and classify emotions and their polarity in texts, natural language processing (NLP) techniques such as emotion analysis (EA) and sentiment analysis (SA) are employed. However, the effectiveness of these tasks using machine learning (ML) and deep learning (DL) methods depends on large labeled datasets, which are scarce in languages like Spanish. To address this challenge, researchers use data augmentation (DA) techniques to artificially expand small datasets. This study aims to investigate whether DA techniques can improve classification results using ML and DL algorithms for sentiment and emotion analysis of Spanish texts. Various text manipulation techniques were applied, including transformations, paraphrasing (back-translation), and text generation using generative adversarial networks, to small datasets such as song lyrics, social media comments, headlines from national newspapers in Chile, and survey responses from higher education students. The findings show that the Convolutional Neural Network (CNN) classifier achieved the most significant improvement, with an 18% increase using the Generative Adversarial Networks for Sentiment Text (SentiGan) on the Aggressiveness (Seriousness) dataset. Additionally, the same classifier model showed an 11% improvement using the Easy Data Augmentation (EDA) on the Gender-Based Violence dataset. The performance of the Bidirectional Encoder Representations from Transformers (BETO) also improved by 10% on the back-translation augmented version of the October 18 dataset, and by 4% on the EDA augmented version of the Teaching survey dataset. These results suggest that data augmentation techniques enhance performance by transforming text and adapting it to the specific characteristics of the dataset. Through experimentation with various augmentation techniques, this research provides valuable insights into the analysis of subjectivity in Spanish texts and offers guidance for selecting algorithms and techniques based on dataset features.
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
This supply chain analysis provides a comprehensive view of the company's order and distribution processes, allowing for in-depth analysis and optimization of various aspects of the supply chain, from procurement and inventory management to sales and customer satisfaction. It empowers the company to make data-driven decisions to improve efficiency, reduce costs, and enhance customer experiences. The provided supply chain analysis dataset contains various columns that capture important information related to the company's order and distribution processes:
• OrderNumber • Sales Channel • WarehouseCode • ProcuredDate • CurrencyCode • OrderDate • ShipDate • DeliveryDate • SalesTeamID • CustomerID • StoreID • ProductID • Order Quantity • Discount Applied • Unit Cost • Unit Price
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data_Analysis.ipynb
: A Jupyter Notebook containing the code for the Exploratory Data Analysis (EDA) presented in the thesis. Running this notebook reproduces the plots in the eda_plots/
directory.Dataset_Extension.ipynb
: A Jupyter Notebook used for the data enrichment process. It takes the raw `Inference_data.csv
` and produces the Inference_data_Extended.csv
by adding detailed hardware specifications, cost estimates, and derived energy metrics.Optimization_Model.ipynb
: The main Jupyter Notebook for the core contribution of this thesis. It contains the code to perform the 5-fold cross-validation, train the final predictive models, generate the Pareto-optimal recommendations, and create the final result figures.Inference_data.csv
: The raw, unprocessed data collected from the official MLPerf Inference v4.0 results.Inference_data_Extended.csv
: The final, enriched dataset used for all analysis and modeling. This is the output of the Dataset_Extension.ipynb
notebook.eda_log.txt
: A text log file containing summary statistics generated during the exploratory data analysis.requirements.txt
: A list of all necessary Python libraries and their versions required to run the code in this repository.eda_plots/
: A directory containing all plots (correlation matrices, scatter plots, box plots) generated by the EDA notebook.optimization_models_final/
: A directory where the trained and saved final model files (.joblib
) are stored after running the optimization notebook.pareto_validation_plot_fold_0.png
: The validation plot comparing the true vs. predicted Pareto fronts, as presented in the thesis.shap_waterfall_final_model.png
: The SHAP plot used for the model interpretability analysis, as presented in the thesis.
bash
git clone
cd
bash
python -m venv venv
source venv/bin/activate # On Windows, use `venv\Scripts\activate`
bash
pip install -r requirements.txt
Inference_data_Extended.csv
`) is already provided. However, if you wish to reproduce the enrichment process from scratch, you can run the **`Dataset_Extension.ipynb
`** notebook. It will take `Inference_data.csv` as input and generate the extended version.eda_plots/
` directory. To regenerate them, run the **`Data_Analysis.ipynb
`** notebook. This will overwrite the existing plots and the `eda_log.txt` file.Optimization_Model.ipynb
notebook will execute the entire pipeline described in the paper:optimization_models_final/
directory.pareto_validation_plot_fold_0.png
and shap_waterfall_final_model.png
.https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
The global market for Engineering Design Automation (EDA) in the automotive industry is projected to reach a value of USD 3.7 billion by 2033, expanding at a CAGR of 5.8% during the forecast period (2025-2033). The growth of the market is primarily driven by the increasing adoption of advanced driver-assistance systems (ADAS) and autonomous vehicles, which require sophisticated software and electronic components. Moreover, the growing demand for lightweight and fuel-efficient vehicles is also contributing to the adoption of EDA tools, as they enable engineers to optimize vehicle designs for better performance and efficiency. The key trends shaping the automotive EDA market include the increasing adoption of cloud-based EDA solutions, the growing popularity of Model-Based Design (MBD) methodologies, and the integration of EDA tools with other software applications. The adoption of cloud-based EDA solutions is gaining traction as it offers several advantages, such as improved accessibility, scalability, and cost-effectiveness. MBD methodologies are also becoming increasingly popular as they enable engineers to create virtual prototypes of vehicles, which can be used to evaluate design performance and identify potential issues early in the design process. The integration of EDA tools with other software applications, such as computer-aided design (CAD) and product lifecycle management (PLM) systems, is also enhancing the overall efficiency of the vehicle design process.
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
The global market for Industrial Production Statistical Analysis Software is experiencing robust growth, projected at a Compound Annual Growth Rate (CAGR) of 5.2% from 2025 to 2033. In 2025, the market size reached $3748 million. This expansion is fueled by several key factors. Firstly, the increasing adoption of Industry 4.0 and digital transformation initiatives across manufacturing sectors is driving demand for sophisticated data analytics solutions. Businesses are increasingly reliant on data-driven decision-making to optimize production processes, improve efficiency, and enhance product quality. Secondly, the growing complexity of industrial processes necessitates advanced software capable of handling large datasets and providing actionable insights. This includes real-time monitoring, predictive maintenance, and quality control applications. The software’s ability to identify patterns and anomalies crucial to preventing production bottlenecks and maximizing output contributes significantly to its appeal. Finally, stringent regulatory compliance requirements and a growing focus on sustainability are further pushing adoption. Companies need robust data analysis tools to comply with environmental standards and track their carbon footprint. Segmentation reveals a diverse market landscape. The application segment is dominated by architecture, mechanical engineering, and the automotive industry, each leveraging the software for unique purposes such as design optimization, simulation, and performance analysis. Within types, 3D modeling and analysis software are gaining traction due to their ability to represent complex geometries and improve design accuracy. The geographical distribution shows a strong presence in North America and Europe, driven by technological advancements and robust manufacturing industries in these regions. However, the Asia-Pacific region is expected to witness significant growth in the coming years, fuelled by rapid industrialization and rising technological adoption in countries like China and India. Leading players such as Autodesk, Siemens EDA, and Dassault Systèmes are actively shaping the market through technological innovation and strategic partnerships. The forecast period, 2025-2033, promises continued market growth driven by these factors and the wider adoption of advanced data analytics in industrial production.
Analytic provenance is a data repository that can be used to study human analysis activity, thought processes, and software interaction with visual analysis tools during exploratory data analysis. It was collected during a series of user studies involving exploratory data analysis scenario with textual and cyber security data. Interactions logs, think-alouds, videos and all coded data in this study are available online for research purposes. Analysis sessions are segmented in multiple sub-task steps based on user think-alouds, video and audios captured during the studies. These analytic provenance datasets can be used for research involving tools and techniques for analyzing interaction logs and analysis history.
https://www.ibisworld.com/about/termsofuse/https://www.ibisworld.com/about/termsofuse/
The electronic design automation (EDA) software industry is experiencing a wave of transformation driven by technological innovations. Artificial Intelligence (AI) integration is becoming a cornerstone for the industry as it enhances EDA software capabilities by automating complex design processes, optimizing workflows and enabling predictive analytics. AI techniques now assist in crucial tasks such as logic synthesis, layout planning, timing analysis and design rule checking, dramatically reducing the likelihood of human error and shortening design cycles. Cloud computing technology has revolutionized the industry, offering scalable, flexible and collaborative platforms for chip design, reducing costs, speeding up design cycles and fostering collaboration Overall the EDA Software industry has expanded, climbing at a CAGR of 8.4% to $16.5 billion through the end of 2025, including a 5.9% climb in 2025 alone. Rising complexity and miniaturization in electronic systems are driving the evolution of advanced functions in EDA tools. This evolution has been spurred by growing demand in automotive, aerospace, telecommunications and healthcare industries, which require sophisticated system-on-chip (SoC) designs. The surging integration of AI, IoT, and 5G technologies in these industries brings significant growth opportunities for EDA software vendors, demanding specialized and high-performance chips that can handle massive data, real-time processing and low power consumption. Industry profit has faced pressure from rising R&D costs, geopolitical risks and competitive investments in innovation. The EDA software industry will endure significant changes, predominantly driven by artificial intelligence, generative technologies and digital twin technology. AI and generative technologies will foster product innovation, automating and optimizing chip design, verification and simulation processes. Companies that successfully integrate AI into their tools will enjoy higher demand, particularly from industries such as data centers, automotive and robotics. In the future, digital twin technology will become an essential tool in electronics design, enabling EDA software developers to simulate, test and optimize designs before physical prototypes are brought to life. Amid this tech-driven transformation, future growth won't come without challenges. EDA software vendors must invest in R&D, AI and cloud infrastructure while addressing associated data security and latency issues, or consider alliance or acquisition strategies to remain competitive in a rapidly consolidating industry landscape. Through the end of 2030 revenue will climb at a CAGR of 6.1% to reach $22.1 billion in 2030.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The article examines the spatial distribution characteristics and influencing factors of traditional Tibetan “Bengke” residential architecture in Luhuo County, Ganzi Tibetan Autonomous Prefecture, Sichuan Province. The study utilizes spatial statistical methods, including Average Nearest Neighbor Analysis, Getis-Ord Gi*, and Kernel Density Estimation, to identify significant clustering patterns of Bengke architecture. Spatial autocorrelation was tested using Moran’s Index, with results indicating no significant spatial autocorrelation, suggesting that the distribution mechanisms are complex and influenced by multiple factors. Additionally, exploratory data analysis (EDA), the Analytic Hierarchy Process (AHP), and regression methods such as Lasso and Elastic Net were used to identify and validate key factors influencing the distribution of these buildings. The analysis reveals that road density, population density, economic development quality, and industrial structure are the most significant factors. The study also highlights that these factors vary in impact between high-density and low-density areas, depending on the regional environment. These findings offer a comprehensive understanding of the spatial patterns of Bengke architecture and provide valuable insights for the preservation and sustainable development of this cultural heritage.
https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
The Industrial Analysis Software market is experiencing robust growth, driven by increasing automation in manufacturing, the expanding adoption of Industry 4.0 technologies, and a rising demand for improved operational efficiency and predictive maintenance. The market size in 2025 is estimated at $15 billion, projected to grow at a Compound Annual Growth Rate (CAGR) of 12% from 2025 to 2033. This significant expansion is fueled by several key factors. Firstly, the convergence of data analytics, cloud computing, and advanced simulation technologies is creating sophisticated software solutions capable of handling massive datasets and providing actionable insights. Secondly, the increasing complexity of modern industrial processes necessitates advanced analytical tools to optimize performance, reduce downtime, and improve product quality. Finally, stringent regulatory requirements and environmental concerns are driving the adoption of industrial analysis software to enhance sustainability and reduce environmental impact. Major players like Siemens EDA, Autodesk, and Dassault Systèmes are leading the innovation in this space, constantly improving their offerings and expanding their market reach through strategic partnerships and acquisitions. The market segmentation reveals a diverse landscape with various specialized software solutions catering to specific industries and needs. While the current data doesn't specify exact segment sizes, it's expected that process manufacturing, discrete manufacturing, and energy & utilities sectors will comprise a significant portion of the market share. The geographical distribution is anticipated to reflect strong growth in North America and Asia-Pacific regions, driven by high industrial output and technology adoption rates. However, Europe and other regions will also contribute to the overall market growth due to the increasing focus on digitalization and industrial automation across various sectors. The competitive landscape is intense, with numerous established players and emerging startups vying for market share. Future growth will likely depend on the ability of companies to offer innovative solutions, strong customer support, and seamless integration with existing industrial systems.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Electronic health records (EHRs) have been widely adopted in recent years, but often include a high proportion of missing data, which can create difficulties in implementing machine learning and other tools of personalized medicine. Completed datasets are preferred for a number of analysis methods, and successful imputation of missing EHR data can improve interpretation and increase our power to predict health outcomes. However, use of the most popular imputation methods mainly require scripting skills, and are implemented using various packages and syntax. Thus, the implementation of a full suite of methods is generally out of reach to all except experienced data scientists. Moreover, imputation is often considered as a separate exercise from exploratory data analysis, but should be considered as art of the data exploration process. We have created a new graphical tool, ImputEHR, that is based on a Python base and allows implementation of a range of simple and sophisticated (e.g., gradient-boosted tree-based and neural network) data imputation approaches. In addition to imputation, the tool enables data exploration for informed decision-making, as well as implementing machine learning prediction tools for response data selected by the user. Although the approach works for any missing data problem, the tool is primarily motivated by problems encountered for EHR and other biomedical data. We illustrate the tool using multiple real datasets, providing performance measures of imputation and downstream predictive analysis.