100+ datasets found
  1. f

    DataSheet1_Exploratory data analysis (EDA) machine learning approaches for...

    • frontiersin.figshare.com
    docx
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Victoria Da Poian; Bethany Theiling; Lily Clough; Brett McKinney; Jonathan Major; Jingyi Chen; Sarah Hƶrst (2023). DataSheet1_Exploratory data analysis (EDA) machine learning approaches for ocean world analog mass spectrometry.docx [Dataset]. http://doi.org/10.3389/fspas.2023.1134141.s001
    Explore at:
    docxAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Frontiers
    Authors
    Victoria Da Poian; Bethany Theiling; Lily Clough; Brett McKinney; Jonathan Major; Jingyi Chen; Sarah Hƶrst
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    World
    Description

    Many upcoming and proposed missions to ocean worlds such as Europa, Enceladus, and Titan aim to evaluate their habitability and the existence of potential life on these moons. These missions will suffer from communication challenges and technology limitations. We review and investigate the applicability of data science and unsupervised machine learning (ML) techniques on isotope ratio mass spectrometry data (IRMS) from volatile laboratory analogs of Europa and Enceladus seawaters as a case study for development of new strategies for icy ocean world missions. Our driving science goal is to determine whether the mass spectra of volatile gases could contain information about the composition of the seawater and potential biosignatures. We implement data science and ML techniques to investigate what inherent information the spectra contain and determine whether a data science pipeline could be designed to quickly analyze data from future ocean worlds missions. In this study, we focus on the exploratory data analysis (EDA) step in the analytics pipeline. This is a crucial unsupervised learning step that allows us to understand the data in depth before subsequent steps such as predictive/supervised learning. EDA identifies and characterizes recurring patterns, significant correlation structure, and helps determine which variables are redundant and which contribute to significant variation in the lower dimensional space. In addition, EDA helps to identify irregularities such as outliers that might be due to poor data quality. We compared dimensionality reduction methods Uniform Manifold Approximation and Projection (UMAP) and Principal Component Analysis (PCA) for transforming our data from a high-dimensional space to a lower dimension, and we compared clustering algorithms for identifying data-driven groups (ā€œclustersā€) in the ocean worlds analog IRMS data and mapping these clusters to experimental conditions such as seawater composition and CO2 concentration. Such data analysis and characterization efforts are the first steps toward the longer-term science autonomy goal where similar automated ML tools could be used onboard a spacecraft to prioritize data transmissions for bandwidth-limited outer Solar System missions.

  2. E

    Exploratory Data Analysis (EDA) Tools Report

    • marketreportanalytics.com
    doc, pdf, ppt
    Updated Apr 2, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Market Report Analytics (2025). Exploratory Data Analysis (EDA) Tools Report [Dataset]. https://www.marketreportanalytics.com/reports/exploratory-data-analysis-eda-tools-54164
    Explore at:
    ppt, pdf, docAvailable download formats
    Dataset updated
    Apr 2, 2025
    Dataset authored and provided by
    Market Report Analytics
    License

    https://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The Exploratory Data Analysis (EDA) tools market is experiencing robust growth, driven by the increasing volume and complexity of data across various industries. The market, estimated at $1.5 billion in 2025, is projected to exhibit a Compound Annual Growth Rate (CAGR) of 15% from 2025 to 2033, reaching approximately $5 billion by 2033. This expansion is fueled by several key factors. Firstly, the rising adoption of big data analytics and business intelligence initiatives across large enterprises and SMEs is creating a significant demand for efficient EDA tools. Secondly, the growing need for faster, more insightful data analysis to support better decision-making is driving the preference for user-friendly graphical EDA tools over traditional non-graphical methods. Furthermore, advancements in artificial intelligence and machine learning are seamlessly integrating into EDA tools, enhancing their capabilities and broadening their appeal. The market segmentation reveals a significant portion held by large enterprises, reflecting their greater resources and data handling needs. However, the SME segment is rapidly gaining traction, driven by the increasing affordability and accessibility of cloud-based EDA solutions. Geographically, North America currently dominates the market, but regions like Asia-Pacific are exhibiting high growth potential due to increasing digitalization and technological advancements. Despite this positive outlook, certain restraints remain. The high initial investment cost associated with implementing advanced EDA solutions can be a barrier for some SMEs. Additionally, the need for skilled professionals to effectively utilize these tools can create a challenge for organizations. However, the ongoing development of user-friendly interfaces and the availability of training resources are actively mitigating these limitations. The competitive landscape is characterized by a mix of established players like IBM and emerging innovative companies offering specialized solutions. Continuous innovation in areas like automated data preparation and advanced visualization techniques will further shape the future of the EDA tools market, ensuring its sustained growth trajectory.

  3. w

    Data Use in Academia Dataset

    • datacatalog.worldbank.org
    csv, utf-8
    Updated Nov 27, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Semantic Scholar Open Research Corpus (S2ORC) (2023). Data Use in Academia Dataset [Dataset]. https://datacatalog.worldbank.org/search/dataset/0065200/data_use_in_academia_dataset
    Explore at:
    utf-8, csvAvailable download formats
    Dataset updated
    Nov 27, 2023
    Dataset provided by
    Brian William Stacy
    Semantic Scholar Open Research Corpus (S2ORC)
    License

    https://datacatalog.worldbank.org/public-licenses?fragment=cchttps://datacatalog.worldbank.org/public-licenses?fragment=cc

    Description

    This dataset contains metadata (title, abstract, date of publication, field, etc) for around 1 million academic articles. Each record contains additional information on the country of study and whether the article makes use of data. Machine learning tools were used to classify the country of study and data use.


    Our data source of academic articles is the Semantic Scholar Open Research Corpus (S2ORC) (Lo et al. 2020). The corpus contains more than 130 million English language academic papers across multiple disciplines. The papers included in the Semantic Scholar corpus are gathered directly from publishers, from open archives such as arXiv or PubMed, and crawled from the internet.


    We placed some restrictions on the articles to make them usable and relevant for our purposes. First, only articles with an abstract and parsed PDF or latex file are included in the analysis. The full text of the abstract is necessary to classify the country of study and whether the article uses data. The parsed PDF and latex file are important for extracting important information like the date of publication and field of study. This restriction eliminated a large number of articles in the original corpus. Around 30 million articles remain after keeping only articles with a parsable (i.e., suitable for digital processing) PDF, and around 26% of those 30 million are eliminated when removing articles without an abstract. Second, only articles from the year 2000 to 2020 were considered. This restriction eliminated an additional 9% of the remaining articles. Finally, articles from the following fields of study were excluded, as we aim to focus on fields that are likely to use data produced by countries’ national statistical system: Biology, Chemistry, Engineering, Physics, Materials Science, Environmental Science, Geology, History, Philosophy, Math, Computer Science, and Art. Fields that are included are: Economics, Political Science, Business, Sociology, Medicine, and Psychology. This third restriction eliminated around 34% of the remaining articles. From an initial corpus of 136 million articles, this resulted in a final corpus of around 10 million articles.


    Due to the intensive computer resources required, a set of 1,037,748 articles were randomly selected from the 10 million articles in our restricted corpus as a convenience sample.


    The empirical approach employed in this project utilizes text mining with Natural Language Processing (NLP). The goal of NLP is to extract structured information from raw, unstructured text. In this project, NLP is used to extract the country of study and whether the paper makes use of data. We will discuss each of these in turn.


    To determine the country or countries of study in each academic article, two approaches are employed based on information found in the title, abstract, or topic fields. The first approach uses regular expression searches based on the presence of ISO3166 country names. A defined set of country names is compiled, and the presence of these names is checked in the relevant fields. This approach is transparent, widely used in social science research, and easily extended to other languages. However, there is a potential for exclusion errors if a country’s name is spelled non-standardly.


    The second approach is based on Named Entity Recognition (NER), which uses machine learning to identify objects from text, utilizing the spaCy Python library. The Named Entity Recognition algorithm splits text into named entities, and NER is used in this project to identify countries of study in the academic articles. SpaCy supports multiple languages and has been trained on multiple spellings of countries, overcoming some of the limitations of the regular expression approach. If a country is identified by either the regular expression search or NER, it is linked to the article. Note that one article can be linked to more than one country.


    The second task is to classify whether the paper uses data. A supervised machine learning approach is employed, where 3500 publications were first randomly selected and manually labeled by human raters using the Mechanical Turk service (Paszke et al. 2019).[1] To make sure the human raters had a similar and appropriate definition of data in mind, they were given the following instructions before seeing their first paper:


    Each of these documents is an academic article. The goal of this study is to measure whether a specific academic article is using data and from which country the data came.

    There are two classification tasks in this exercise:

    1. identifying whether an academic article is using data from any country

    2. Identifying from which country that data came.

    For task 1, we are looking specifically at the use of data. Data is any information that has been collected, observed, generated or created to produce research findings. As an example, a study that reports findings or analysis using a survey data, uses data. Some clues to indicate that a study does use data includes whether a survey or census is described, a statistical model estimated, or a table or means or summary statistics is reported.

    After an article is classified as using data, please note the type of data used. The options are population or business census, survey data, administrative data, geospatial data, private sector data, and other data. If no data is used, then mark "Not applicable". In cases where multiple data types are used, please click multiple options.[2]

    For task 2, we are looking at the country or countries that are studied in the article. In some cases, no country may be applicable. For instance, if the research is theoretical and has no specific country application. In some cases, the research article may involve multiple countries. In these cases, select all countries that are discussed in the paper.

    We expect between 10 and 35 percent of all articles to use data.


    The median amount of time that a worker spent on an article, measured as the time between when the article was accepted to be classified by the worker and when the classification was submitted was 25.4 minutes. If human raters were exclusively used rather than machine learning tools, then the corpus of 1,037,748 articles examined in this study would take around 50 years of human work time to review at a cost of $3,113,244, which assumes a cost of $3 per article as was paid to MTurk workers.


    A model is next trained on the 3,500 labelled articles. We use a distilled version of the BERT (bidirectional Encoder Representations for transformers) model to encode raw text into a numeric format suitable for predictions (Devlin et al. (2018)). BERT is pre-trained on a large corpus comprising the Toronto Book Corpus and Wikipedia. The distilled version (DistilBERT) is a compressed model that is 60% the size of BERT and retains 97% of the language understanding capabilities and is 60% faster (Sanh, Debut, Chaumond, Wolf 2019). We use PyTorch to produce a model to classify articles based on the labeled data. Of the 3,500 articles that were hand coded by the MTurk workers, 900 are fed to the machine learning model. 900 articles were selected because of computational limitations in training the NLP model. A classification of ā€œuses dataā€ was assigned if the model predicted an article used data with at least 90% confidence.


    The performance of the models classifying articles to countries and as using data or not can be compared to the classification by the human raters. We consider the human raters as giving us the ground truth. This may underestimate the model performance if the workers at times got the allocation wrong in a way that would not apply to the model. For instance, a human rater could mistake the Republic of Korea for the Democratic People’s Republic of Korea. If both humans and the model perform the same kind of errors, then the performance reported here will be overestimated.


    The model was able to predict whether an article made use of data with 87% accuracy evaluated on the set of articles held out of the model training. The correlation between the number of articles written about each country using data estimated under the two approaches is given in the figure below. The number of articles represents an aggregate total of

  4. f

    Descriptive statistics.

    • plos.figshare.com
    xls
    Updated Oct 31, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mrinal Saha; Aparna Deb; Imtiaz Sultan; Sujat Paul; Jishan Ahmed; Goutam Saha (2023). Descriptive statistics. [Dataset]. http://doi.org/10.1371/journal.pgph.0002475.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Oct 31, 2023
    Dataset provided by
    PLOS Global Public Health
    Authors
    Mrinal Saha; Aparna Deb; Imtiaz Sultan; Sujat Paul; Jishan Ahmed; Goutam Saha
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Vitamin D insufficiency appears to be prevalent in SLE patients. Multiple factors potentially contribute to lower vitamin D levels, including limited sun exposure, the use of sunscreen, darker skin complexion, aging, obesity, specific medical conditions, and certain medications. The study aims to assess the risk factors associated with low vitamin D levels in SLE patients in the southern part of Bangladesh, a region noted for a high prevalence of SLE. The research additionally investigates the possible correlation between vitamin D and the SLEDAI score, seeking to understand the potential benefits of vitamin D in enhancing disease outcomes for SLE patients. The study incorporates a dataset consisting of 50 patients from the southern part of Bangladesh and evaluates their clinical and demographic data. An initial exploratory data analysis is conducted to gain insights into the data, which includes calculating means and standard deviations, performing correlation analysis, and generating heat maps. Relevant inferential statistical tests, such as the Student’s t-test, are also employed. In the machine learning part of the analysis, this study utilizes supervised learning algorithms, specifically Linear Regression (LR) and Random Forest (RF). To optimize the hyperparameters of the RF model and mitigate the risk of overfitting given the small dataset, a 3-Fold cross-validation strategy is implemented. The study also calculates bootstrapped confidence intervals to provide robust uncertainty estimates and further validate the approach. A comprehensive feature importance analysis is carried out using RF feature importance, permutation-based feature importance, and SHAP values. The LR model yields an RMSE of 4.83 (CI: 2.70, 6.76) and MAE of 3.86 (CI: 2.06, 5.86), whereas the RF model achieves better results, with an RMSE of 2.98 (CI: 2.16, 3.76) and MAE of 2.68 (CI: 1.83,3.52). Both models identify Hb, CRP, ESR, and age as significant contributors to vitamin D level predictions. Despite the lack of a significant association between SLEDAI and vitamin D in the statistical analysis, the machine learning models suggest a potential nonlinear dependency of vitamin D on SLEDAI. These findings highlight the importance of these factors in managing vitamin D levels in SLE patients. The study concludes that there is a high prevalence of vitamin D insufficiency in SLE patients. Although a direct linear correlation between the SLEDAI score and vitamin D levels is not observed, machine learning models suggest the possibility of a nonlinear relationship. Furthermore, factors such as Hb, CRP, ESR, and age are identified as more significant in predicting vitamin D levels. Thus, the study suggests that monitoring these factors may be advantageous in managing vitamin D levels in SLE patients. Given the immunological nature of SLE, the potential role of vitamin D in SLE disease activity could be substantial. Therefore, it underscores the need for further large-scale studies to corroborate this hypothesis.

  5. d

    Data from: Subsurface Characterization and Machine Learning Predictions at...

    • catalog.data.gov
    • data.openei.org
    • +2more
    Updated Jan 20, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Renewable Energy Laboratory (2025). Subsurface Characterization and Machine Learning Predictions at Brady Hot Springs [Dataset]. https://catalog.data.gov/dataset/subsurface-characterization-and-machine-learning-predictions-at-brady-hot-springs-0f304
    Explore at:
    Dataset updated
    Jan 20, 2025
    Dataset provided by
    National Renewable Energy Laboratory
    Description

    Subsurface data analysis, reservoir modeling, and machine learning (ML) techniques have been applied to the Brady Hot Springs (BHS) geothermal field in Nevada, USA to further characterize the subsurface and assist with optimizing reservoir management. Hundreds of reservoir simulations have been conducted in TETRAD-G and CMG STARS to explore different injection and production fluid flow rates and allocations and to develop a training data set for ML. This process included simulating the historical injection and production since 1979 and prediction of future performance through 2040. ML networks were created and trained using TensorFlow based on multilayer perceptron, long short-term memory, and convolutional neural network architectures. These networks took as input selected flow rates, injection temperatures, and historical field operation data and produced estimates of future production temperatures. This approach was first successfully tested on a simplified single-fracture doublet system, followed by the application to the BHS reservoir. Using an initial BHS data set with 37 simulated scenarios, the trained and validated network predicted the production temperature for six production wells with the mean absolute percentage error of less than 8%. In a complementary analysis effort, the principal component analysis applied to 13 BHS geological parameters revealed that vertical fracture permeability shows the strongest correlation with fault density and fault intersection density. A new BHS reservoir model was developed considering the fault intersection density as proxy for permeability. This new reservoir model helps to explore underexploited zones in the reservoir. A data gathering plan to obtain additional subsurface data was developed; it includes temperature surveying for three idle injection wells at which the reservoir simulations indicate high bottom-hole temperatures. The collected data assist with calibrating the reservoir model. Data gathering activities are planned for the first quarter of 2021. This GDR submission includes a preprint of the paper titled "Subsurface Characterization and Machine Learning Predictions at Brady Hot Springs" presented at the 46th Stanford Geothermal Workshop (SGW) on Geothermal Reservoir Engineering from February 16-18, 2021.

  6. f

    Statistical analysis (Execution time).

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Jun 30, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alazmi, Meshari; Ayub, Nasir (2025). Statistical analysis (Execution time). [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0002092332
    Explore at:
    Dataset updated
    Jun 30, 2025
    Authors
    Alazmi, Meshari; Ayub, Nasir
    Description

    Predicting student performance is crucial for providing personalized support and enhancing academic performance. Advanced machine-learning approaches are being used to understand student performance variables as educational data grows. A big dataset from several Chinese institutions and high schools is used to develop a credible student performance prediction technique. Moreover, the dataset includes 80 features and 200,000 records, and consequently, it represents one of the most extensive data collections available for educational research. Initially, data is passed through preprocessing to address outliers and missing values. In addition, we developed a novel hybrid feature selection model that combined correlation filtering with mutual information, Cross-Validation (CV) along with Recursive Feature Eliminatio (RFE) (R, and stability selection to identify the most impactful features. Moreover, This study develops the proposed EffiXNet, a more refined version of EfficientNet augmented with self-attention mechanisms, dynamic convolutions, improved normalization methods, and Sparrow Search Optimization Algorithm for hyperparameter optimization. The developed model was tested using an 80/20 train-test split, where 160,000 records were used for training and 40,000 for testing. The results reported, including accuracy, precision, recall, and F1-score, are based on the full test dataset. However, for better visualization, the confusion matrices display only a representative subset of test results. Furthermore, the EffiXNet value of AUC amounting to 0.99, a 25% reduction of logarithmic loss relative to the baseline models, precision of 97.8%, F1-score of 98.1%, and reliable optimization of memory usage. Significantly, the developed model showed a consistently high-performance level demonstrated by various metrics, which indicates that it is proficient in capturing intricate data patterns. The key insights the current research provides are the necessity of early intervention and directed training support in the educational domain. The EffiXNet framework offers a robust, scalable, and efficient solution for predicting student performance, with potential applications in academic institutions worldwide.

  7. Student Mental Health Survey - Cleaned / Scaled

    • kaggle.com
    zip
    Updated Sep 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Avinash Bunga (2024). Student Mental Health Survey - Cleaned / Scaled [Dataset]. https://www.kaggle.com/datasets/avinashbunga/student-mental-health-survey-cleaned-scaled
    Explore at:
    zip(5773 bytes)Available download formats
    Dataset updated
    Sep 8, 2024
    Authors
    Avinash Bunga
    Description

    **Student Mental Health Survey: Scaled Data on IT Students' Academic and Emotional Well-being ** **Overview **This dataset contains survey responses from IT students, focusing on academic stress, mental health, and lifestyle factors. It includes two files that capture different stages of data preparation to suit various analytical needs.

    Files Included MentalHealthSurvey.csv:

    Description: Contains the original survey data with raw categorical and numerical variables. Usefulness: Ideal for initial data exploration and understanding the unprocessed patterns before any data transformation. MentalHealthSurvey_Cleaned.csv:

    Description: This file contains cleaned and preprocessed data with scaled numerical variables. The data was scaled using standard scaling techniques, which adjust the values so that each variable has a mean of 0 and a standard deviation of 1. Why Scaling is Useful: Scaling ensures that all numerical variables contribute equally to statistical models, particularly in factor analysis, where varying scales can skew the results. Scaled data improves model performance, stability, and interpretability, making it especially valuable for advanced analyses like predictive modeling and machine learning. Applications Initial Data Exploration: Use the raw data to explore variable distributions, correlations, and identify potential data quality issues. Advanced Analysis: The cleaned and scaled data is optimal for statistical analysis, helping to uncover meaningful patterns and insights into the factors affecting students' mental health and academic performance. Both files offer a complete view of the dataset, from raw data exploration to scaled data ready for rigorous analysis.

  8. m

    Find Ideal Location for Business in Bangladesh

    • data.mendeley.com
    Updated Sep 22, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Faisal Bin Ashraf (2021). Find Ideal Location for Business in Bangladesh [Dataset]. http://doi.org/10.17632/v2k2jvjwrh.1
    Explore at:
    Dataset updated
    Sep 22, 2021
    Authors
    Faisal Bin Ashraf
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Bangladesh
    Description

    The dataset has 21 columns that carry the features (questions) of 988 respondents. The efficiency of any machine learning model is heavily dependent on its raw initial dataset. For this, we had to be extra careful in gathering our information. We figured out that for our particular problem, we had to go forward with data that was not only authentic but also versatile enough to get the proper information from relevant sources. Hence we opted to build our dataset by dispatching a survey questionnaire among targeted audiences. Firstly, we built the questionnaire with inquiries that were made after keen observation. Studying the behavior from our intended audience, we came up with factual and informative queries that generated appropriate data. Our prime audience were those who were highly into buying fashion accessories and hence we had created a set of questionnaires that emphasized on questions related to that field. We had a total of twenty one well revised questions that gave us an overview of all answers that were going to be needed within the proximity of our system. As such, we had the opportunity to gather over half a thousand authentic leads and concluded upon our initial raw dataset accordingly.

  9. Comprehensive Synthetic Skin Disease Data

    • kaggle.com
    zip
    Updated Mar 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arif Miah (2025). Comprehensive Synthetic Skin Disease Data [Dataset]. https://www.kaggle.com/datasets/miadul/comprehensive-synthetic-skin-disease-data/data
    Explore at:
    zip(319234 bytes)Available download formats
    Dataset updated
    Mar 14, 2025
    Authors
    Arif Miah
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    *

    šŸ“‚ Dataset Description:

    The Askin Disease Dataset is a synthetic dataset generated to support machine learning and data analysis tasks related to dermatological conditions. It contains 34,000 rows and 10 columns, covering various aspects of skin diseases, patient demographics, treatment history, and disease severity.

    🌟 Why This Dataset?

    Skin diseases are a prevalent health issue affecting millions of people globally. Accurate diagnosis and effective treatment planning are crucial for improving patient outcomes. This dataset provides a comprehensive representation of various skin disease conditions, making it ideal for:
    - Classification tasks: Predicting disease type or severity.
    - Predictive modeling: Estimating treatment effectiveness.
    - Data visualization: Analyzing demographic patterns.
    - Exploratory Data Analysis (EDA): Understanding distribution and correlations.
    - Healthcare analytics: Gaining insights into treatment efficacy and disease prevalence.

    šŸ—ƒļø Dataset Content:

    The dataset contains the following 10 columns:

    1. Patient_ID: Unique identifier for each patient (e.g., P00001).
    2. Age: Age of the patient (range: 18 to 90).
    3. Gender: Gender of the patient (Male/Female).
    4. Skin_Color: The skin tone of the patient (Fair/Medium/Dark).
    5. Disease_Type: The diagnosed skin disease (Eczema, Psoriasis, Acne, Rosacea, Vitiligo, Melanoma).
    6. Severity: The severity level of the disease (Mild, Moderate, Severe).
    7. Duration: Duration of the disease in months (range: 1 to 120).
    8. Affected_Area: The body part affected by the disease (Face, Arms, Legs, Back, Chest, Scalp).
    9. Previous_Treatment: Indicates whether the patient has received prior treatment (Yes/No).
    10. Treatment_Effectiveness: The effectiveness of previous treatments (High, Moderate, Low).

    šŸ”„ Key Features:

    • Balanced Distribution: The dataset is synthetically generated to ensure a balanced distribution of disease types and severity levels.
    • Comprehensive Coverage: Multiple features capture patient demographics, disease characteristics, and treatment outcomes.
    • Versatile Applications: Suitable for classification, prediction, clustering, and data visualization tasks.
    • Data Integrity: Synthetic data eliminates privacy concerns while retaining the structure and characteristics of real-world data.

    šŸš€ Potential Use Cases:

    • Disease Classification: Using machine learning to classify skin disease types.
    • Severity Prediction: Predicting the severity level based on demographic and disease characteristics.
    • Treatment Effectiveness Analysis: Analyzing how previous treatments correlate with disease severity and affected areas.
    • Health Insights: Gaining insights into how skin color and demographics impact disease prevalence and severity.

    šŸ› ļø Recommended Techniques:

    • Exploratory Data Analysis (EDA) for initial data inspection and visualization.
    • Machine Learning Algorithms such as Decision Trees, Random Forest, SVM, and Neural Networks for classification tasks.
    • Data Preprocessing Techniques like handling missing values, encoding categorical data, and scaling numerical values.
    • Model Evaluation Metrics including accuracy, precision, recall, F1-score, and ROC-AUC.

    šŸ“ˆ License:

    This dataset is licensed under the CC BY 4.0 License. You are free to use, share, and modify the dataset with proper attribution.

    šŸ’¬ Inspiration:

    • Can machine learning accurately classify skin disease types based on demographic and clinical features?
    • How effective are various treatments for different skin conditions?
    • Can we predict the severity of skin diseases using patient attributes?

    šŸ“¬ Acknowledgments:

    This dataset is synthetically generated and does not represent real patient data. It is designed purely for educational and research purposes in machine learning and data analysis.

  10. Table_1_Predicting Success of a Digital Self-Help Intervention for Alcohol...

    • frontiersin.figshare.com
    • figshare.com
    docx
    Updated Jun 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lucas A. Ramos; Matthijs Blankers; Guido van Wingen; Tamara de Bruijn; Steffen C. Pauws; Anneke E. Goudriaan (2023). Table_1_Predicting Success of a Digital Self-Help Intervention for Alcohol and Substance Use With Machine Learning.DOCX [Dataset]. http://doi.org/10.3389/fpsyg.2021.734633.s001
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jun 8, 2023
    Dataset provided by
    Frontiers Mediahttp://www.frontiersin.org/
    Authors
    Lucas A. Ramos; Matthijs Blankers; Guido van Wingen; Tamara de Bruijn; Steffen C. Pauws; Anneke E. Goudriaan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    BackgroundDigital self-help interventions for reducing the use of alcohol tobacco and other drugs (ATOD) have generally shown positive but small effects in controlling substance use and improving the quality of life of participants. Nonetheless, low adherence rates remain a major drawback of these digital interventions, with mixed results in (prolonged) participation and outcome. To prevent non-adherence, we developed models to predict success in the early stages of an ATOD digital self-help intervention and explore the predictors associated with participant’s goal achievement.MethodsWe included previous and current participants from a widely used, evidence-based ATOD intervention from the Netherlands (Jellinek Digital Self-help). Participants were considered successful if they completed all intervention modules and reached their substance use goals (i.e., stop/reduce). Early dropout was defined as finishing only the first module. During model development, participants were split per substance (alcohol, tobacco, cannabis) and features were computed based on the log data of the first 3 days of intervention participation. Machine learning models were trained, validated and tested using a nested k-fold cross-validation strategy.ResultsFrom the 32,398 participants enrolled in the study, 80% of participants did not complete the first module of the intervention and were excluded from further analysis. From the remaining participants, the percentage of success for each substance was 30% for alcohol, 22% for cannabis and 24% for tobacco. The area under the Receiver Operating Characteristic curve was the highest for the Random Forest model trained on data from the alcohol and tobacco programs (0.71 95%CI 0.69–0.73) and (0.71 95%CI 0.67–0.76), respectively, followed by cannabis (0.67 95%CI 0.59–0.75). Quitting substance use instead of moderation as an intervention goal, initial daily consumption, no substance use on the weekends as a target goal and intervention engagement were strong predictors of success.DiscussionUsing log data from the first 3 days of intervention use, machine learning models showed positive results in identifying successful participants. Our results suggest the models were especially able to identify participants at risk of early dropout. Multiple variables were found to have high predictive value, which can be used to further improve the intervention.

  11. B

    Big Data Intelligence Engine Report

    • datainsightsmarket.com
    doc, pdf, ppt
    Updated May 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Insights Market (2025). Big Data Intelligence Engine Report [Dataset]. https://www.datainsightsmarket.com/reports/big-data-intelligence-engine-1991939
    Explore at:
    ppt, doc, pdfAvailable download formats
    Dataset updated
    May 21, 2025
    Dataset authored and provided by
    Data Insights Market
    License

    https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The Big Data Intelligence Engine market is experiencing robust growth, driven by the increasing need for advanced analytics across diverse sectors. The market's expansion is fueled by several key factors: the exponential growth of data volume from various sources (IoT devices, social media, etc.), the rising adoption of cloud computing for data storage and processing, and the increasing demand for real-time insights to support faster and more informed decision-making. Applications spanning data mining, machine learning, and artificial intelligence are significantly contributing to this market expansion. Furthermore, the rising adoption of programming languages like Java, Python, and Scala, which are well-suited for big data processing, is further fueling market growth. Technological advancements, such as the development of more efficient and scalable algorithms and the emergence of specialized hardware like GPUs, are also playing a crucial role. While data security and privacy concerns, along with the high initial investment costs associated with implementing Big Data Intelligence Engine solutions, pose some restraints, the overall market outlook remains extremely positive. The competitive landscape is dominated by a mix of established technology giants like IBM, Microsoft, Google, and Amazon, and emerging players such as Alibaba Cloud, Tencent Cloud, and Baidu Cloud. These companies are aggressively investing in research and development to enhance their offerings and expand their market share. The market is geographically diverse, with North America and Europe currently holding significant market shares. However, the Asia-Pacific region, particularly China and India, is expected to witness the fastest growth in the coming years due to increasing digitalization and government initiatives promoting technological advancements. This growth is further segmented by application (Data Mining, Machine Learning, AI) and programming languages (Java, Python, Scala), offering opportunities for specialized solutions and services. The forecast period of 2025-2033 promises substantial growth, driven by continued innovation and widespread adoption across industries.

  12. UCI and OpenML Data Sets for Ordinal Quantification

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    zip
    Updated Jul 25, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mirko Bunse; Mirko Bunse; Alejandro Moreo; Alejandro Moreo; Fabrizio Sebastiani; Fabrizio Sebastiani; Martin Senz; Martin Senz (2023). UCI and OpenML Data Sets for Ordinal Quantification [Dataset]. http://doi.org/10.5281/zenodo.8177302
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 25, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Mirko Bunse; Mirko Bunse; Alejandro Moreo; Alejandro Moreo; Fabrizio Sebastiani; Fabrizio Sebastiani; Martin Senz; Martin Senz
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    These four labeled data sets are targeted at ordinal quantification. The goal of quantification is not to predict the label of each individual instance, but the distribution of labels in unlabeled sets of data.

    With the scripts provided, you can extract CSV files from the UCI machine learning repository and from OpenML. The ordinal class labels stem from a binning of a continuous regression label.

    We complement this data set with the indices of data items that appear in each sample of our evaluation. Hence, you can precisely replicate our samples by drawing the specified data items. The indices stem from two evaluation protocols that are well suited for ordinal quantification. To this end, each row in the files app_val_indices.csv, app_tst_indices.csv, app-oq_val_indices.csv, and app-oq_tst_indices.csv represents one sample.

    Our first protocol is the artificial prevalence protocol (APP), where all possible distributions of labels are drawn with an equal probability. The second protocol, APP-OQ, is a variant thereof, where only the smoothest 20% of all APP samples are considered. This variant is targeted at ordinal quantification tasks, where classes are ordered and a similarity of neighboring classes can be assumed.

    Usage

    You can extract four CSV files through the provided script extract-oq.jl, which is conveniently wrapped in a Makefile. The Project.toml and Manifest.toml specify the Julia package dependencies, similar to a requirements file in Python.

    Preliminaries: You have to have a working Julia installation. We have used Julia v1.6.5 in our experiments.

    Data Extraction: In your terminal, you can call either

    make

    (recommended), or

    julia --project="." --eval "using Pkg; Pkg.instantiate()"
    julia --project="." extract-oq.jl

    Outcome: The first row in each CSV file is the header. The first column, named "class_label", is the ordinal class.

    Further Reading

    Implementation of our experiments: https://github.com/mirkobunse/regularized-oq

  13. Machine Learning market size was USD 24,345.76 million in 2021!

    • cognitivemarketresearch.com
    pdf,excel,csv,ppt
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cognitive Market Research, Machine Learning market size was USD 24,345.76 million in 2021! [Dataset]. https://www.cognitivemarketresearch.com/machine-learning-market-report
    Explore at:
    pdf,excel,csv,pptAvailable download formats
    Dataset authored and provided by
    Cognitive Market Research
    License

    https://www.cognitivemarketresearch.com/privacy-policyhttps://www.cognitivemarketresearch.com/privacy-policy

    Time period covered
    2021 - 2033
    Area covered
    Global
    Description

    As per Cognitive Market Research's latest published report, the Global Machine Learning market size was USD 24,345.76 million in 2021 and it is forecasted to reach USD 206,235.41 million by 2028. Machine Learning Industry's Compound Annual Growth Rate will be 42.64% from 2023 to 2030. Market Dynamics of Machine Learning Market

    Key Drivers for Machine Learning Market

    Explosion of Big Data Across Industries: The substantial increase in both structured and unstructured data generated by sensors, social media, transactions, and IoT devices is driving the demand for machine learning-based data analysis.

    Widespread Adoption of AI in Business Processes: Machine learning is facilitating automation, predictive analytics, and optimization in various sectors such as healthcare, finance, manufacturing, and retail, thereby enhancing efficiency and outcomes.

    Increased Availability of Open-Source Frameworks and Cloud Platforms: Resources like TensorFlow, PyTorch, and scalable cloud infrastructure are simplifying the process for developers and enterprises to create and implement machine learning models.

    Growing Investments in AI-Driven Innovation: Governments, venture capitalists, and major technology companies are making substantial investments in machine learning research and startups, which is accelerating progress and market entry.

    Key Restraints for Machine Learning Market

    Shortage of Skilled Talent in ML and AI: The need for data scientists, machine learning engineers, and domain specialists significantly surpasses the available supply, hindering scalability and implementation in numerous organizations.

    High Computational and Operational Costs: The training of intricate machine learning models necessitates considerable computing power, energy, and infrastructure, resulting in high costs for startups and smaller enterprises.

    Data Privacy and Regulatory Compliance Challenges: Issues related to user privacy, data breaches, and adherence to regulations such as GDPR and HIPAA present obstacles in the collection and utilization of data for machine learning.

    Lack of Model Transparency and Explainability: The opaque nature of certain machine learning models undermines trust, particularly in sensitive areas like finance and healthcare, where the need for explainable AI is paramount.

    Key Trends for Machine Learning Market

    Growth of AutoML and No-Code ML Platforms: Automated machine learning tools are making AI development more accessible, enabling individuals without extensive coding or mathematical expertise to construct models.

    Integration of ML with Edge Computing: Executing machine learning models locally on edge devices (such as cameras and smartphones) is enhancing real-time performance and minimizing latency in applications.

    Ethical AI and Responsible Machine Learning Practices: Increasing emphasis on fairness, bias reduction, and accountability is shaping ethical frameworks and governance in ML adoption.

    Industry-Specific ML Applications on the Rise: Custom ML solutions are rapidly emerging in sectors like agriculture (crop prediction), logistics (route optimization), and education (personalized learning).

    COVID-19 Impact:

    Similar to other industries, the covid-19 situation has affected the machine learning industry. Despite the dire conditions and uncertain collapse, some industries have continued to grow during the pandemic. During covid 19, the machine learning market remains stable with positive growth and opportunities. The global machine learning market faces minimal impact compared to some other industries.The growth of the global machine learning market has stagnated owing to automation developments and technological advancements. Pre-owned machines and smartphones widely used for remote work are leading to positive growth of the market. Several industries have transplanted the market progress using new technologies of machine learning systems. June 2020, DeCaprio et al. Published COVID-19 pandemic risk research is still in its early stages. In the report, DeCaprio et al. mentions that it has used machine learning to build an initial vulnerability index for the coronavirus. The lab further noted that as more data and results from ongoing research become available, it will be able to see more practical applications of machine learning in predicting infection risk. What is&nbs...

  14. Housing_HandsonML_Chapter2

    • kaggle.com
    zip
    Updated Sep 7, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Simran Jain (2020). Housing_HandsonML_Chapter2 [Dataset]. https://www.kaggle.com/datasets/simranjain17/housing-handsonml-chapter2/code
    Explore at:
    zip(409382 bytes)Available download formats
    Dataset updated
    Sep 7, 2020
    Authors
    Simran Jain
    Description

    This is a version of the Housing Dataset used in Hands-on ML by Scikit learn, Keras, and TensorFlow book by Oreilly. It helps in understanding EDA and initial modeling to beginners. Before exploring the complete dataset on Kaggle, this is a good start.

  15. D

    Data Intelligence Solution Report

    • archivemarketresearch.com
    doc, pdf, ppt
    Updated Jul 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Archive Market Research (2025). Data Intelligence Solution Report [Dataset]. https://www.archivemarketresearch.com/reports/data-intelligence-solution-559484
    Explore at:
    ppt, doc, pdfAvailable download formats
    Dataset updated
    Jul 6, 2025
    Dataset authored and provided by
    Archive Market Research
    License

    https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The Data Intelligence Solutions market is experiencing robust growth, driven by the increasing volume and complexity of data generated by businesses across all sectors. The market, estimated at $15 billion in 2025, is projected to exhibit a Compound Annual Growth Rate (CAGR) of 18% from 2025 to 2033. This significant expansion is fueled by several key factors, including the rising adoption of cloud-based solutions, the growing need for advanced analytics and data visualization tools, and the imperative to improve decision-making through data-driven insights. Key trends shaping the market include the integration of AI and machine learning capabilities into data intelligence platforms, the rise of real-time data processing, and the increasing demand for data governance and security solutions to comply with evolving regulatory frameworks. The competitive landscape is dynamic, with established players like SAP, Microsoft, and Qlik alongside innovative startups vying for market share. The market segmentation is evolving, with solutions tailored to specific industries and data types gaining traction. The increasing focus on data democratization, allowing broader access to data insights within organizations, further fuels market growth. The robust growth is anticipated to continue throughout the forecast period (2025-2033), driven by the expanding digital economy and the ever-increasing importance of data-driven decision-making. However, challenges remain. The high initial investment costs associated with implementing data intelligence solutions can be a barrier for smaller enterprises. Furthermore, the need for skilled data scientists and analysts to effectively utilize these solutions poses a significant hurdle. Despite these constraints, the long-term prospects for the Data Intelligence Solutions market are exceptionally positive, fueled by the undeniable value that data intelligence delivers to businesses of all sizes across diverse industries. This includes streamlining operations, improving customer experience, and uncovering new revenue streams.

  16. m

    Dataset of Understanding Guest Review From Google Play Using NaĆÆve...

    • data.mendeley.com
    Updated Jul 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Misyle Ariel Juarsa (2025). Dataset of Understanding Guest Review From Google Play Using NaĆÆve Bayes-Based Data Analysis: A Study on Nanovest [Dataset]. http://doi.org/10.17632/8frwrry7w6.1
    Explore at:
    Dataset updated
    Jul 28, 2025
    Authors
    Misyle Ariel Juarsa
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains user reviews of Nanovest, an investment application for AS stocks, gold, and cryptocurrency. The data was collected from user reviews of the Nanovest app on Google Play. The reviews, written in Indonesian, reflect users' experiences and opinions regarding the app’s features, security, and functionality. By analyzing this review data, this study aims to determine the proportion of positive and negative reviews and identify the key aspects frequently mentioned by users. The findings of this study can provide recommendations for improving service quality and application performance for cryptocurrency investment platforms in Indonesia.

    This dataset was collected through web scraping using Python. A total of 2,000 reviews were gathered. After removing duplicate and irrelevant reviews through data cleaning, the final dataset consisted of 1,921 reviews. This study will classify the data into positive and negative sentiments using machine learning.

    In the initial observation, the reviews showed a mix of feedback. Many users found the app beginner-friendly, while others raised concerns about its stability and compatibility.

    This dataset can be useful for researchers conducting sentiment analysis in the cryptocurrency investment industry in Indonesia, helping them understand user experiences and identify areas for improvement. Additionally, it can serve as training data for machine learning models in sentiment classification. By analyzing user feedback, this dataset can serve as a foundation for investment apps to enhance application performance and preserve essential features while refining areas that need improvement in the development of cryptocurrency investment applications in Indonesia.

  17. UCI Automobile Dataset

    • kaggle.com
    Updated Feb 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Otrivedi (2023). UCI Automobile Dataset [Dataset]. https://www.kaggle.com/datasets/otrivedi/automobile-data/suggestions
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 12, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Otrivedi
    Description

    In this project, I have done exploratory data analysis on the UCI Automobile dataset available at https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data

    This dataset consists of data From the 1985 Ward's Automotive Yearbook. Here are the sources

    1) 1985 Model Import Car and Truck Specifications, 1985 Ward's Automotive Yearbook. 2) Personal Auto Manuals, Insurance Services Office, 160 Water Street, New York, NY 10038 3) Insurance Collision Report, Insurance Institute for Highway Safety, Watergate 600, Washington, DC 20037

    Number of Instances: 398 Number of Attributes: 9 including the class attribute

    Attribute Information:

    mpg: continuous cylinders: multi-valued discrete displacement: continuous horsepower: continuous weight: continuous acceleration: continuous model year: multi-valued discrete origin: multi-valued discrete car name: string (unique for each instance)

    This data set consists of three types of entities:

    I - The specification of an auto in terms of various characteristics

    II - Tts assigned an insurance risk rating. This corresponds to the degree to which the auto is riskier than its price indicates. Cars are initially assigned a risk factor symbol associated with its price. Then, if it is riskier (or less), this symbol is adjusted by moving it up (or down) the scale. Actuaries call this process "symboling".

    III - Its normalized losses in use as compared to other cars. This is the relative average loss payment per insured vehicle year. This value is normalized for all autos within a particular size classification (two-door small, station wagons, sports/specialty, etc...), and represents the average loss per car per year.

    The analysis is divided into two parts:

    Data Wrangling

    1. Pre-processing data in python
    2. Dealing with missing values
    3. Data formatting
    4. Data normalization
    5. Binning
    6. Exploratory Data Analysis

    7. Descriptive statistics

    8. Groupby

    9. Analysis of variance

    10. Correlation

    11. Correlation stats

    Acknowledgment Dataset: UCI Machine Learning Repository Data link: https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data

  18. Adversarial Machine Learning Dataset

    • kaggle.com
    zip
    Updated Jul 8, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Network security group CNR-IEIIT (2022). Adversarial Machine Learning Dataset [Dataset]. https://www.kaggle.com/datasets/cnrieiit/adversarial-machine-learning-dataset/discussion
    Explore at:
    zip(4786927 bytes)Available download formats
    Dataset updated
    Jul 8, 2022
    Authors
    Network security group CNR-IEIIT
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Adversarial Machine Learning Dataset

    This repository contains the datasets adopted in the following paper, published in IEEE Access. If you use this repository in your research work, please consider citing our paper.

    I. Vaccari, A. Carlevaro, S. Narteni, E. Cambiaso and M. Mongelli, "eXplainable and Reliable Against Adversarial Machine Learning in Data Analytics," in IEEE Access, vol. 10, pp. 83949-83970, 2022, DOI: 10.1109/ACCESS.2022.3197299. URL: https://ieeexplore.ieee.org/document/9852204

    @ARTICLE{9852204,
     author={Vaccari, Ivan and Carlevaro, Alberto and Narteni, Sara and Cambiaso, Enrico and Mongelli, Maurizio},
     journal={IEEE Access}, 
     title={eXplainable and Reliable Against Adversarial Machine Learning in Data Analytics}, 
     year={2022},
     volume={10},
     number={},
     pages={83949-83970},
     doi={10.1109/ACCESS.2022.3197299}}
    

    Description

    We consider three different applications: * DNS tunneling, referring to network data captured during a DNS tunneling attack * Platooning, referring to simulated data from a vehicle platooning scenario * Remaining useful life (RUL), related to predictive maintenance of aircraft engines

    Each application scenario has been targeted through different Adversarial Machine Learning methods. Particularly, the following methods are considered: * Carlini-Wagner (CW): Carlini, N., & Wagner, D. (2017). Towards evaluating the robustness of neural networks. In 2017 ieee symposium on security and privacy (sp) (pp. 39-57). IEEE. * Fast Gradient Sign Method (FGSM): Goodfellow, I. J., Shlens, J., & Szegedy, C. (2014). Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572. * Jacobian Saliency Map (JSMA): Papernot, N., McDaniel, P., Jha, S., Fredrikson, M., Celik, Z. B., & Swami, A. (2016, March). The limitations of deep learning in adversarial settings. In 2016 IEEE European symposium on security and privacy (EuroS&P) (pp. 372-387). IEEE.

    Data structure

    We have a folder for each application scenario (DNS tunneling, platooning, RUL). For each application scenario, two folders are found: * legitimate, including original data * malicious, including data attacked by Adversarial Machine Learning methods; in this case, both training and test data are reported, together with a combined data

    The target variable for the adversarial attacks detection is called attack in all cases, whereas the target variables in the original problems (legitimate data) are the following: * g for the DNS tunneling application * collision for the platooning application * RUL_binary for the RUL application

    DNS tunneling data structure

    Concerning the DNS tunneling application, the following features are considered: * mDt: mean inter-packet time interval * mA: mean answer packet size * mQ: mean query packet size * vDt: variance of inter-packet time interval * vA: variance of answer packet size * vQ: variance of query packet size * sDt: skewness of inter-packet time interval * sA: skewness of answer packet size * sQ: skewness of query packet size * kDt: kurtosis of inter-packet time interval * kA: kurtosis of answer packet size * kQ: kurtosis of query packet size

    A portion of the legitimate data is reported in the following.

    mDt | mA | mQ | vDt | vA | vQ | sDt | sA | sQ | kDt | kA | kQ | g --- | -- | -- | --- | -- | -- | --- | -- | -- | --- | -- | -- | - 0.46915 | 239.8778 | 86.6326 | 81.8955 | 26109.835267 | 71.490417 | 63.185077 | 2.380883 | 0.552061 | 4284.358131 | 7.187715 | 5623.187219 | 0 0.584831 | 254.0284 | 87.3976 | 11.010015 | 29161.123993 | 922.955914 | 6.777654 | 1.980254 | 42.142166 | 46.147515 | 4.539171 | -2.993006 | 0 0.633453 | 269.3278 | 88.255 | 11.800109 | 38263.725547 | 62.161175 | 6.453627 | 2.051226 | 0.469637 | 41.80766 | 4.600109 | -1.385322 | 0 2.649329 | 258.529 | 88.3704 | 25991.830887 | 35950.372759 | 1297.225604 | 70.664632 | 2.064281 | 37.215584 | 4992.657556 | 4.664802 | 2005555.720075 | 0 ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ...

    Platooning data structure

    Concerning the platooning application, the following features are considered: * N: number of vehicles in the platoon * F0: braking force applied by the leader * PER: packet error rate (probability of packet loss) * d0: initial distance between vehicles * v0: initial speed between vehicles

    A portion of the legitimate data is reported in the following.

    N | F0 | PER | d0 | v0 | collision --- | --...

  19. H

    High Performance Data Analytics Industry Report

    • datainsightsmarket.com
    doc, pdf, ppt
    Updated Mar 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Insights Market (2025). High Performance Data Analytics Industry Report [Dataset]. https://www.datainsightsmarket.com/reports/high-performance-data-analytics-industry-14076
    Explore at:
    pdf, ppt, docAvailable download formats
    Dataset updated
    Mar 3, 2025
    Dataset authored and provided by
    Data Insights Market
    License

    https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The High-Performance Data Analytics (HPDA) market is experiencing robust growth, projected to reach $97.19 million in 2025 and exhibiting a Compound Annual Growth Rate (CAGR) of 23.63% from 2025 to 2033. This expansion is fueled by several key drivers. The increasing volume and velocity of data generated across various industries necessitate advanced analytical capabilities to extract actionable insights. Furthermore, the rise of cloud computing and the adoption of on-demand services are making HPDA solutions more accessible and cost-effective for businesses of all sizes. The BFSI (Banking, Financial Services, and Insurance), Government & Defense, and Energy & Utilities sectors are leading adopters, leveraging HPDA to enhance operational efficiency, improve risk management, and gain a competitive edge. Technological advancements in areas like artificial intelligence (AI), machine learning (ML), and big data processing further contribute to market expansion. While the initial investment in HPDA infrastructure can be a restraint for some smaller enterprises, the long-term benefits in terms of improved decision-making and cost savings are proving compelling. The market is segmented by component (hardware, software, services), deployment (on-premise, on-demand), organization size (SMEs, large enterprises), and end-user industry. Competition is intense, with major players like SAS Institute, Amazon Web Services, Juniper Networks, and others vying for market share through innovation and strategic partnerships. The North American market currently holds a significant share due to high technological adoption rates and the presence of major technology companies. However, Asia-Pacific regions, particularly China and India, are witnessing rapid growth, presenting lucrative opportunities for HPDA vendors in the coming years. The projected market trajectory indicates substantial growth opportunities for HPDA solutions providers. The continued expansion of data-intensive applications across diverse sectors will remain a primary driver, further intensified by advancements in data analytics techniques and the ongoing digital transformation across industries. The shift towards cloud-based HPDA deployments is expected to accelerate, offering scalability and cost optimization benefits for organizations. Geographic expansion, particularly in developing economies, will unlock significant untapped potential. While competitive pressures remain, companies successfully differentiating their offerings through superior performance, robust security features, and tailored solutions for specific industry needs will be well-positioned to capitalize on the ongoing market expansion. Furthermore, strategic partnerships and mergers & acquisitions are anticipated to shape the competitive landscape in the coming years. Recent developments include: May 2023: NeuroBlade announced its partnership with Dell Technologies to accelerate data analytics. This solution will offer customers security and reliability, coupled with the industry's first processor architecture proven to accelerate high throughput data analytics workloads. Through the partnership, NeuroBlade strengthen its market strategy and reinforces demand for advanced solutions., January 2023: Atos declared that it was selected by Austrian AVL List GmbH to deliver a new high-performance computing cluster based on BullSequanaXH2000 servers along with a five-year maintenance service. As the world's significant mobility technology provider for development, simulation, and testing in the automotive industry, the company would rely on Atos'supercomputer to drive more complex and powerful simulations while optimizing its energy consumption.. Key drivers for this market are: Growing Number of IT & Database Industry Across the Globe, Growing Data Volumes; Advancements in High-Performance Computing Activities. Potential restraints include: High Investment Cost, Stringent Government Regulations. Notable trends are: On-Demand to Witness the Growth.

  20. f

    Table_1_DeepBehavior: A Deep Learning Toolbox for Automated Analysis of...

    • datasetcatalog.nlm.nih.gov
    • frontiersin.figshare.com
    Updated May 7, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Golshani, Peyman; Carmichael, S. Thomas; Dobkin, Bruce H.; Arac, Ahmet; Zhao, Pingping (2019). Table_1_DeepBehavior: A Deep Learning Toolbox for Automated Analysis of Animal and Human Behavior Imaging Data.docx [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000176965
    Explore at:
    Dataset updated
    May 7, 2019
    Authors
    Golshani, Peyman; Carmichael, S. Thomas; Dobkin, Bruce H.; Arac, Ahmet; Zhao, Pingping
    Description

    Detailed behavioral analysis is key to understanding the brain-behavior relationship. Here, we present deep learning-based methods for analysis of behavior imaging data in mice and humans. Specifically, we use three different convolutional neural network architectures and five different behavior tasks in mice and humans and provide detailed instructions for rapid implementation of these methods for the neuroscience community. We provide examples of three dimensional (3D) kinematic analysis in the food pellet reaching task in mice, three-chamber test in mice, social interaction test in freely moving mice with simultaneous miniscope calcium imaging, and 3D kinematic analysis of two upper extremity movements in humans (reaching and alternating pronation/supination). We demonstrate that the transfer learning approach accelerates the training of the network when using images from these types of behavior video recordings. We also provide code for post-processing of the data after initial analysis with deep learning. Our methods expand the repertoire of available tools using deep learning for behavior analysis by providing detailed instructions on implementation, applications in several behavior tests, and post-processing methods and annotated code for detailed behavior analysis. Moreover, our methods in human motor behavior can be used in the clinic to assess motor function during recovery after an injury such as stroke.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Victoria Da Poian; Bethany Theiling; Lily Clough; Brett McKinney; Jonathan Major; Jingyi Chen; Sarah Hƶrst (2023). DataSheet1_Exploratory data analysis (EDA) machine learning approaches for ocean world analog mass spectrometry.docx [Dataset]. http://doi.org/10.3389/fspas.2023.1134141.s001

DataSheet1_Exploratory data analysis (EDA) machine learning approaches for ocean world analog mass spectrometry.docx

Related Article
Explore at:
docxAvailable download formats
Dataset updated
May 31, 2023
Dataset provided by
Frontiers
Authors
Victoria Da Poian; Bethany Theiling; Lily Clough; Brett McKinney; Jonathan Major; Jingyi Chen; Sarah Hƶrst
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Area covered
World
Description

Many upcoming and proposed missions to ocean worlds such as Europa, Enceladus, and Titan aim to evaluate their habitability and the existence of potential life on these moons. These missions will suffer from communication challenges and technology limitations. We review and investigate the applicability of data science and unsupervised machine learning (ML) techniques on isotope ratio mass spectrometry data (IRMS) from volatile laboratory analogs of Europa and Enceladus seawaters as a case study for development of new strategies for icy ocean world missions. Our driving science goal is to determine whether the mass spectra of volatile gases could contain information about the composition of the seawater and potential biosignatures. We implement data science and ML techniques to investigate what inherent information the spectra contain and determine whether a data science pipeline could be designed to quickly analyze data from future ocean worlds missions. In this study, we focus on the exploratory data analysis (EDA) step in the analytics pipeline. This is a crucial unsupervised learning step that allows us to understand the data in depth before subsequent steps such as predictive/supervised learning. EDA identifies and characterizes recurring patterns, significant correlation structure, and helps determine which variables are redundant and which contribute to significant variation in the lower dimensional space. In addition, EDA helps to identify irregularities such as outliers that might be due to poor data quality. We compared dimensionality reduction methods Uniform Manifold Approximation and Projection (UMAP) and Principal Component Analysis (PCA) for transforming our data from a high-dimensional space to a lower dimension, and we compared clustering algorithms for identifying data-driven groups (ā€œclustersā€) in the ocean worlds analog IRMS data and mapping these clusters to experimental conditions such as seawater composition and CO2 concentration. Such data analysis and characterization efforts are the first steps toward the longer-term science autonomy goal where similar automated ML tools could be used onboard a spacecraft to prioritize data transmissions for bandwidth-limited outer Solar System missions.

Search
Clear search
Close search
Google apps
Main menu