100+ datasets found
  1. d

    Data from: Discovering System Health Anomalies using Data Mining Techniques

    • catalog.data.gov
    • data.nasa.gov
    • +2more
    Updated Apr 10, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dashlink (2025). Discovering System Health Anomalies using Data Mining Techniques [Dataset]. https://catalog.data.gov/dataset/discovering-system-health-anomalies-using-data-mining-techniques
    Explore at:
    Dataset updated
    Apr 10, 2025
    Dataset provided by
    Dashlink
    Description

    We discuss a statistical framework that underlies envelope detection schemes as well as dynamical models based on Hidden Markov Models (HMM) that can encompass both discrete and continuous sensor measurements for use in Integrated System Health Management (ISHM) applications. The HMM allows for the rapid assimilation, analysis, and discovery of system anomalies. We motivate our work with a discussion of an aviation problem where the identification of anomalous sequences is essential for safety reasons. The data in this application are discrete and continuous sensor measurements and can be dealt with seamlessly using the methods described here to discover anomalous flights. We specifically treat the problem of discovering anomalous features in the time series that may be hidden from the sensor suite and compare those methods to standard envelope detection methods on test data designed to accentuate the differences between the two methods. Identification of these hidden anomalies is crucial to building stable, reusable, and cost-efficient systems. We also discuss a data mining framework for the analysis and discovery of anomalies in high-dimensional time series of sensor measurements that would be found in an ISHM system. We conclude with recommendations that describe the tradeoffs in building an integrated scalable platform for robust anomaly detection in ISHM applications.

  2. Data Science Jobs Analysis

    • kaggle.com
    Updated Feb 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Niyal Thakkar (2023). Data Science Jobs Analysis [Dataset]. https://www.kaggle.com/datasets/niyalthakkar/data-science-jobs-analysis
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 8, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Niyal Thakkar
    Description

    Data science is the domain of study that deals with vast volumes of data using modern tools and techniques to find unseen patterns, derive meaningful information, and make business decisions. Data science uses complex machine learning algorithms to build predictive models.

    The data used for analysis can come from many different sources and be presented in various formats. Data science is an essential part of many industries today, given the massive amounts of data that are produced, and is one of the most debated topics in IT circles.

  3. f

    Data_Sheet_1_The Optimal Machine Learning-Based Missing Data Imputation for...

    • frontiersin.figshare.com
    docx
    Updated Jun 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chao-Yu Guo; Ying-Chen Yang; Yi-Hau Chen (2023). Data_Sheet_1_The Optimal Machine Learning-Based Missing Data Imputation for the Cox Proportional Hazard Model.docx [Dataset]. http://doi.org/10.3389/fpubh.2021.680054.s001
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    Frontiers
    Authors
    Chao-Yu Guo; Ying-Chen Yang; Yi-Hau Chen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    An adequate imputation of missing data would significantly preserve the statistical power and avoid erroneous conclusions. In the era of big data, machine learning is a great tool to infer the missing values. The root means square error (RMSE) and the proportion of falsely classified entries (PFC) are two standard statistics to evaluate imputation accuracy. However, the Cox proportional hazards model using various types requires deliberate study, and the validity under different missing mechanisms is unknown. In this research, we propose supervised and unsupervised imputations and examine four machine learning-based imputation strategies. We conducted a simulation study under various scenarios with several parameters, such as sample size, missing rate, and different missing mechanisms. The results revealed the type-I errors according to different imputation techniques in the survival data. The simulation results show that the non-parametric “missForest” based on the unsupervised imputation is the only robust method without inflated type-I errors under all missing mechanisms. In contrast, other methods are not valid to test when the missing pattern is informative. Statistical analysis, which is improperly conducted, with missing data may lead to erroneous conclusions. This research provides a clear guideline for a valid survival analysis using the Cox proportional hazard model with machine learning-based imputations.

  4. Comparative Analysis of Data-Driven Anomaly Detection Methods

    • data.nasa.gov
    • s.cnmilf.com
    • +2more
    Updated Mar 31, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nasa.gov (2025). Comparative Analysis of Data-Driven Anomaly Detection Methods [Dataset]. https://data.nasa.gov/dataset/comparative-analysis-of-data-driven-anomaly-detection-methods
    Explore at:
    Dataset updated
    Mar 31, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    This paper provides a review of three different advanced machine learning algorithms for anomaly detection in continuous data streams from a ground-test firing of a subscale Solid Rocket Motor (SRM). This study compares Orca, one-class support vector machines, and the Inductive Monitoring System (IMS) for anomaly detection on the data streams. We measure the performance of the algorithm with respect to the detection horizon for situations where fault information is available. These algorithms have been also studied by the present authors (and other co-authors) as applied to liquid propulsion systems. The trade space will be explored between these algorithms for both types of propulsion systems.

  5. Data from: FEP Augmentation as a Means to Solve Data Paucity Problems for...

    • figshare.com
    zip
    Updated Apr 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pieter B. Burger; Xiaohu Hu; Ilya Balabin; Morné Muller; Megan Stanley; Fourie Joubert; Thomas M. Kaiser (2024). FEP Augmentation as a Means to Solve Data Paucity Problems for Machine Learning in Chemical Biology [Dataset]. http://doi.org/10.1021/acs.jcim.4c00071.s005
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 23, 2024
    Dataset provided by
    ACS Publications
    Authors
    Pieter B. Burger; Xiaohu Hu; Ilya Balabin; Morné Muller; Megan Stanley; Fourie Joubert; Thomas M. Kaiser
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    In the realm of medicinal chemistry, the primary objective is to swiftly optimize a multitude of chemical properties of a set of compounds to yield a clinical candidate poised for clinical trials. In recent years, two computational techniques, machine learning (ML) and physics-based methods, have evolved substantially and are now frequently incorporated into the medicinal chemist’s toolbox to enhance the efficiency of both hit optimization and candidate design. Both computational methods come with their own set of limitations, and they are often used independently of each other. ML’s capability to screen extensive compound libraries expediently is tempered by its reliance on quality data, which can be scarce especially during early-stage optimization. Contrarily, physics-based approaches like free energy perturbation (FEP) are frequently constrained by low throughput and high cost by comparison; however, physics-based methods are capable of making highly accurate binding affinity predictions. In this study, we harnessed the strength of FEP to overcome data paucity in ML by generating virtual activity data sets which then inform the training of algorithms. Here, we show that ML algorithms trained with an FEP-augmented data set could achieve comparable predictive accuracy to data sets trained on experimental data from biological assays. Throughout the paper, we emphasize key mechanistic considerations that must be taken into account when aiming to augment data sets and lay the groundwork for successful implementation. Ultimately, the study advocates for the synergy of physics-based methods and ML to expedite the lead optimization process. We believe that the physics-based augmentation of ML will significantly benefit drug discovery, as these techniques continue to evolve.

  6. e

    Techniques of Data Collection

    • paper.erudition.co.in
    html
    Updated Apr 6, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Einetic (2021). Techniques of Data Collection [Dataset]. https://paper.erudition.co.in/makaut/bachelor-of-business-administration/5/research-methodology
    Explore at:
    htmlAvailable download formats
    Dataset updated
    Apr 6, 2021
    Dataset authored and provided by
    Einetic
    License

    https://paper.erudition.co.in/termshttps://paper.erudition.co.in/terms

    Description

    Question Paper Solutions of chapter Techniques of Data Collection of Research Methodology, 5th Semester , Bachelor of Business Administration

  7. Technologies used in big data analysis 2015

    • statista.com
    Updated Jul 29, 2015
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2015). Technologies used in big data analysis 2015 [Dataset]. https://www.statista.com/statistics/491267/big-data-technologies-used/
    Explore at:
    Dataset updated
    Jul 29, 2015
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Dec 2014 - Feb 2015
    Area covered
    North America, Worldwide, Europe
    Description

    This graph presents the results of a survey, conducted by BARC in 2014/15, into the current and planned use of technology for the analysis of big data. At the beginning of 2015, 13 percent of respondents indicated that their company was already using a big data analytical appliance for big data.

  8. d

    Replication Data for: The MIDAS Touch: Accurate and Scalable Missing-Data...

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lall, Ranjit; Robinson, Thomas (2023). Replication Data for: The MIDAS Touch: Accurate and Scalable Missing-Data Imputation with Deep Learning [Dataset]. http://doi.org/10.7910/DVN/UPL4TT
    Explore at:
    Dataset updated
    Nov 23, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Lall, Ranjit; Robinson, Thomas
    Description

    Replication and simulation reproduction materials for the article "The MIDAS Touch: Accurate and Scalable Missing-Data Imputation with Deep Learning." Please see the README file for a summary of the contents and the Replication Guide for a more detailed description. Article abstract: Principled methods for analyzing missing values, based chiefly on multiple imputation, have become increasingly popular yet can struggle to handle the kinds of large and complex data that are also becoming common. We propose an accurate, fast, and scalable approach to multiple imputation, which we call MIDAS (Multiple Imputation with Denoising Autoencoders). MIDAS employs a class of unsupervised neural networks known as denoising autoencoders, which are designed to reduce dimensionality by corrupting and attempting to reconstruct a subset of data. We repurpose denoising autoencoders for multiple imputation by treating missing values as an additional portion of corrupted data and drawing imputations from a model trained to minimize the reconstruction error on the originally observed portion. Systematic tests on simulated as well as real social science data, together with an applied example involving a large-scale electoral survey, illustrate MIDAS's accuracy and efficiency across a range of settings. We provide open-source software for implementing MIDAS.

  9. U

    Statistical Methods in Water Resources - Supporting Materials

    • data.usgs.gov
    • catalog.data.gov
    Updated Apr 7, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Robert Hirsch; Karen Ryberg; Stacey Archfield; Edward Gilroy; Dennis Helsel (2020). Statistical Methods in Water Resources - Supporting Materials [Dataset]. http://doi.org/10.5066/P9JWL6XR
    Explore at:
    Dataset updated
    Apr 7, 2020
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Authors
    Robert Hirsch; Karen Ryberg; Stacey Archfield; Edward Gilroy; Dennis Helsel
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Description

    This dataset contains all of the supporting materials to accompany Helsel, D.R., Hirsch, R.M., Ryberg, K.R., Archfield, S.A., and Gilroy, E.J., 2020, Statistical methods in water resources: U.S. Geological Survey Techniques and Methods, book 4, chapter A3, 454 p., https://doi.org/10.3133/tm4a3. [Supersedes USGS Techniques of Water-Resources Investigations, book 4, chapter A3, version 1.1.]. Supplemental material (SM) for each chapter are available to re-create all examples and figures, and to solve the exercises at the end of each chapter, with relevant datasets provided in an electronic format readable by R. The SM provide (1) datasets as .Rdata files for immediate input into R, (2) datasets as .csv files for input into R or for use with other software programs, (3) R functions that are used in the textbook but not part of a published R package, (4) R scripts to produce virtually all of the figures in the book, and (5) solutions to the exercises as .html and .Rmd files. The suff ...

  10. j

    Data from: Data on the Construction Processes of Regression Models

    • jstagedata.jst.go.jp
    jpeg
    Updated Jul 27, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Taichi Kimura; Riko Iwamoto; Mikio Yoshida; Tatsuya Takahashi; Shuji Sasabe; Yoshiyuki Shirakawa (2023). Data on the Construction Processes of Regression Models [Dataset]. http://doi.org/10.50931/data.kona.22180318.v2
    Explore at:
    jpegAvailable download formats
    Dataset updated
    Jul 27, 2023
    Dataset provided by
    Hosokawa Powder Technology Foundation
    Authors
    Taichi Kimura; Riko Iwamoto; Mikio Yoshida; Tatsuya Takahashi; Shuji Sasabe; Yoshiyuki Shirakawa
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This CSV dataset (numbered 1–8) demonstrates the construction processes of the regression models using machine learning methods, which are used to plot Fig. 2–7. The CSV file of 1.LSM_R^2 (plotting Fig. 2) shows the data of the relationship between estimated values and actual values when the least-squares method was used for a model construction. In the CSV file 2.PCR_R^2 (plotting Fig. 3), the number of the principal components was varied from 1 to 5 during the construction of a model using the principal component regression. The data in the CSV file 3.SVR_R^2 (plotting Fig. 4) is the result of the construction using the support vector regression. The hyperparameters were decided by the comprehensive combination from the listed candidates by exploring hyperparameters with maximum R2 values. When a deep neural network was applied to the construction of a regression model, NNeur., NH.L. and NL.T. were varied. The CSV file 4.DNN_HL (plotting Fig. 5a)) shows the changes in the relationship between estimated values and actual values at each NH.L.. Similarly, changes in the relationships between estimated values and actual values in the case NNeur. or NL.T. were varied in the CSV files 5.DNN_ Neur (plotting Fig. 5b)) and 6.DNN_LT (plotting Fig. 5c)). The data in the CSV file 7.DNN_R^2 (plotting Fig. 6) is the result using optimal NNeur., NH.L. and NL.T.. In the CSV file 8.R^2 (plotting Fig. 7), the validity of each machine learning method was compared by showing the optimal results for each method. Experimental conditions Supply volume of the raw material: 25–125 mL Addition rate of TiO2: 5.0–15.0 wt% Operation time: 1–15 min Rotation speed: 2,200–5,700 min-1 Temperature: 295–319 K Nomenclature NNeur.: the number of neurons NH.L.: the number of hidden layers NL.T.: the number of learning times

  11. o

    Indigenous data analysis methods for research

    • osf.io
    url
    Updated Jun 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nina Sivertsen; Tahlia Johnson; Annette Briley; Shanamae Davies; Tara Struck; Larissa Taylor; Susan Smith; Megan Cooper; Jaclyn Davey (2024). Indigenous data analysis methods for research [Dataset]. http://doi.org/10.17605/OSF.IO/VNZD9
    Explore at:
    urlAvailable download formats
    Dataset updated
    Jun 12, 2024
    Dataset provided by
    Center For Open Science
    Authors
    Nina Sivertsen; Tahlia Johnson; Annette Briley; Shanamae Davies; Tara Struck; Larissa Taylor; Susan Smith; Megan Cooper; Jaclyn Davey
    Description

    Objective: The objective of this review is to identify what is known about Indigenous data analysis methods for research. Introduction: Understanding Indigenous data analyses methods for research is crucial in health research with Indigenous participants, to support culturally appropriate interpretation of research data, and culturally inclusive analyses in cross-cultural research teams. Inclusion Criteria: This review will consider primary research studies that report on Indigenous data analysis methods for research. Method: Medline (via Ovid SP), PsycINFO (via Ovid SP), Web of Science (Clarivate Analytics), Scopus (Elsevier), Cumulated Index to Nursing and Allied Health Literature CINAHL (EBSCOhost), ProQuest Central, ProQuest Social Sciences Premium (Clarivate) will be searched. ProQuest (Theses and Dissertations) will be searched for unpublished material. Studies published from inception onwards and written in English will be assessed for inclusion. Studies meeting the inclusion criteria will be assessed for methodological quality and data will be extracted.

  12. Data Science Tweets

    • figshare.com
    zip
    Updated May 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jesus Rogel-Salazar (2024). Data Science Tweets [Dataset]. http://doi.org/10.6084/m9.figshare.2062551.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 14, 2024
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Jesus Rogel-Salazar
    License

    https://www.gnu.org/licenses/gpl-3.0.htmlhttps://www.gnu.org/licenses/gpl-3.0.html

    Description

    Quantum Tunnel TweetsThe data set contains tweets sourced from @quantum_tunnel and @dt_science as a demo for classifying text using Naive Bayes. The demo is detailed in the book Data Science and Analytics with Python by Dr J Rogel-Salazar.Data contents:Train_QuantumTunnel_Tweets.csv: Labelled tweets for text related to "Data Science" with three features:DataScience: [0/1] indicating whether the text is about "Data Science" or not.Date: Date when the tweet was publishedTweet: Text of the tweetTest_QuantumTunnel_Tweets.csv: Testing data with twitter utterances withouth labels:id: A unique identifier for tweetsDate: Date when the tweet was publishedTweet: Text for the tweetFor further information, please get in touch with Dr J Rogel-Salazar.

  13. Intelligent Monitor

    • kaggle.com
    Updated Apr 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ptdevsecops (2024). Intelligent Monitor [Dataset]. http://doi.org/10.34740/kaggle/ds/4383210
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 12, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    ptdevsecops
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    IntelligentMonitor: Empowering DevOps Environments With Advanced Monitoring and Observability aims to improve monitoring and observability in complex, distributed DevOps environments by leveraging machine learning and data analytics. This repository contains a sample implementation of the IntelligentMonitor system proposed in the research paper, presented and published as part of the 11th International Conference on Information Technology (ICIT 2023).

    If you use this dataset and code or any herein modified part of it in any publication, please cite these papers:

    P. Thantharate, "IntelligentMonitor: Empowering DevOps Environments with Advanced Monitoring and Observability," 2023 International Conference on Information Technology (ICIT), Amman, Jordan, 2023, pp. 800-805, doi: 10.1109/ICIT58056.2023.10226123.

    For any questions and research queries - please reach out via Email.

    Abstract - In the dynamic field of software development, DevOps has become a critical tool for enhancing collaboration, streamlining processes, and accelerating delivery. However, monitoring and observability within DevOps environments pose significant challenges, often leading to delayed issue detection, inefficient troubleshooting, and compromised service quality. These issues stem from DevOps environments' complex and ever-changing nature, where traditional monitoring tools often fall short, creating blind spots that can conceal performance issues or system failures. This research addresses these challenges by proposing an innovative approach to improve monitoring and observability in DevOps environments. Our solution, Intelligent-Monitor, leverages realtime data collection, intelligent analytics, and automated anomaly detection powered by advanced technologies such as machine learning and artificial intelligence. The experimental results demonstrate that IntelligentMonitor effectively manages data overload, reduces alert fatigue, and improves system visibility, thereby enhancing performance and reliability. For instance, the average CPU usage across all components showed a decrease of 9.10%, indicating improved CPU efficiency. Similarly, memory utilization and network traffic showed an average increase of 7.33% and 0.49%, respectively, suggesting more efficient use of resources. By providing deep insights into system performance and facilitating rapid issue resolution, this research contributes to the DevOps community by offering a comprehensive solution to one of its most pressing challenges. This fosters more efficient, reliable, and resilient software development and delivery processes.

    Components The key components that would need to be implemented are:

    • Data Collection - Collect performance metrics and log data from the distributed system components. Could use technology like Kafka or telemetry libraries.
    • Data Processing - Preprocess and aggregate the collected data into an analyzable format. Could use Spark for distributed data processing.
    • Anomaly Detection - Apply machine learning algorithms to detect anomalies in the performance metrics. Could use isolation forest or LSTM models.
    • Alerting - Generate alerts when anomalies are detected. It could integrate with tools like PagerDuty.
    • Visualization - Create dashboards to visualize system health and key metrics. Could use Grafana or Kibana.
    • Data Storage - Store the collected metrics and log data. Could use Elasticsearch or InfluxDB.

    Implementation Details The core of the implementation would involve the following: - Setting up the data collection pipelines. - Building and training anomaly detection ML models on historical data. - Developing a real-time data processing pipeline. - Creating an alerting framework that ties into the ML models. - Building visualizations and dashboards.

    The code would need to handle scaled-out, distributed execution for production environments.

    Proper code documentation, logging, and testing would be added throughout the implementation.

    Usage Examples Usage examples could include:

    • Running the data collection agents on each system component.
    • Visualizing system metrics through Grafana dashboards.
    • Investigating anomalies detected by the ML models.
    • Tuning the alerting rules to minimize false positives.
    • Correlating metrics with log data to troubleshoot issues.

    References The implementation would follow the details provided in the original research paper: P. Thantharate, "IntelligentMonitor: Empowering DevOps Environments with Advanced Monitoring and Observability," 2023 International Conference on Information Technology (ICIT), Amman, Jordan, 2023, pp. 800-805, doi: 10.1109/ICIT58056.2023.10226123.

    Any additional external libraries or sources used would be properly cited.

    Tags - DevOps, Software Development, Collaboration, Streamlini...

  14. Yield Curve Models and Data - TIPS Yield Curve and Inflation Compensation

    • catalog.data.gov
    Updated Dec 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Board of Governors of the Federal Reserve System (2024). Yield Curve Models and Data - TIPS Yield Curve and Inflation Compensation [Dataset]. https://catalog.data.gov/dataset/yield-curve-models-and-data-tips-yield-curve-and-inflation-compensation
    Explore at:
    Dataset updated
    Dec 18, 2024
    Dataset provided by
    Federal Reserve Systemhttp://www.federalreserve.gov/
    Description

    The yield curve, also called the term structure of interest rates, refers to the relationship between the remaining time-to-maturity of debt securities and the yield on those securities. Yield curves have many practical uses, including pricing of various fixed-income securities, and are closely watched by market participants and policymakers alike for potential clues about the markets perception of the path of the policy rate and the macroeconomic outlook. This page provides daily estimated real yield curve parameters, smoothed yields on hypothetical TIPS, and implied inflation compensation, from 1999 to the present. Because this is a staff research product and not an official statistical release, it is subject to delay, revision, or methodological changes without advance notice.

  15. Z

    Data from: Investigating Online Art Search through Quantitative Behavioral...

    • data.niaid.nih.gov
    Updated Mar 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kouretsis, Alexandros (2023). Investigating Online Art Search through Quantitative Behavioral Data and Machine Learning Techniques - Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7741134
    Explore at:
    Dataset updated
    Mar 16, 2023
    Dataset provided by
    Pergantis, Minas
    Giannakoulopoulos, Andreas
    Kouretsis, Alexandros
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset includes the detailed values and scripts used to study behavioral aspects of users searching online for Art and Culture by analyzing quantitative data collected by the Art Boulevard search engine using machine learning techniques. This dataset is part of the core methodology, results and discussion sections of the research paper entitled "Investigating Online Art Search through Quantitative Behavioral Data and Machine Learning Techniques"

  16. Preventive Maintenance for Marine Engines

    • kaggle.com
    Updated Feb 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fijabi J. Adekunle (2025). Preventive Maintenance for Marine Engines [Dataset]. https://www.kaggle.com/datasets/jeleeladekunlefijabi/preventive-maintenance-for-marine-engines
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 13, 2025
    Dataset provided by
    Kaggle
    Authors
    Fijabi J. Adekunle
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Preventive Maintenance for Marine Engines: Data-Driven Insights

    Introduction:

    Marine engine failures can lead to costly downtime, safety risks and operational inefficiencies. This project leverages machine learning to predict maintenance needs, helping ship operators prevent unexpected breakdowns. Using a simulated dataset, we analyze key engine parameters and develop predictive models to classify maintenance status into three categories: Normal, Requires Maintenance, and Critical.

    Overview This project explores preventive maintenance strategies for marine engines by analyzing operational data and applying machine learning techniques.

    Key steps include: 1. Data Simulation: Creating a realistic dataset with engine performance metrics. 2. Exploratory Data Analysis (EDA): Understanding trends and patterns in engine behavior. 3. Model Training & Evaluation: Comparing machine learning models (Decision Tree, Random Forest, XGBoost) to predict maintenance needs. 4. Hyperparameter Tuning: Using GridSearchCV to optimize model performance.

    Tools Used 1. Python: Data processing, analysis and modeling 2. Pandas & NumPy: Data manipulation 3. Scikit-Learn & XGBoost: Machine learning model training 4. Matplotlib & Seaborn: Data visualization

    Skills Demonstrated ✔ Data Simulation & Preprocessing ✔ Exploratory Data Analysis (EDA) ✔ Feature Engineering & Encoding ✔ Supervised Machine Learning (Classification) ✔ Model Evaluation & Hyperparameter Tuning

    Key Insights & Findings 📌 Engine Temperature & Vibration Level: Strong indicators of potential failures. 📌 Random Forest vs. XGBoost: After hyperparameter tuning, both models achieved comparable performance, with Random Forest performing slightly better. 📌 Maintenance Status Distribution: Balanced dataset ensures unbiased model training. 📌 Failure Modes: The most common issues were Mechanical Wear & Oil Leakage, aligning with real-world engine failure trends.

    Challenges Faced 🚧 Simulating Realistic Data: Ensuring the dataset reflects real-world marine engine behavior was a key challenge. 🚧 Model Performance: The accuracy was limited (~35%) due to the complexity of failure prediction. 🚧 Feature Selection: Identifying the most impactful features required extensive analysis.

    Call to Action 🔍 Explore the Dataset & Notebook: Try running different models and tweaking hyperparameters. 📊 Extend the Analysis: Incorporate additional sensor data or alternative machine learning techniques. 🚀 Real-World Application: This approach can be adapted for industrial machinery, aircraft engines, and power plants.

  17. d

    Data to Support the Development of Rapid GC-MS Methods for Seized Drug...

    • catalog.data.gov
    • datasets.ai
    • +1more
    Updated Feb 23, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institute of Standards and Technology (2023). Data to Support the Development of Rapid GC-MS Methods for Seized Drug Analysis [Dataset]. https://catalog.data.gov/dataset/data-to-support-the-development-of-rapid-gc-ms-methods-for-seized-drug-analysis
    Explore at:
    Dataset updated
    Feb 23, 2023
    Dataset provided by
    National Institute of Standards and Technology
    Description

    This dataset contains raw datafiles that support the development of rapid gas chromatography mass spectrometry (GC-MS) methods for seized drug analysis. Files are provided in the native ".D" format collected from an Agilent GC-MS system. Files can be opened using Agilent proprietary software or freely available software such as AMDIS (which can be downloaded at chemdata.nist.gov). Included here is data of seized drug mixtures and adjudicated case samples that were analyzed as part of the method development process for rapid GC-MS. Information about the naming of datafiles and the contents of each mixture and case sample can be found in the associated Excel sheet ("File Names and Comments.xlsx").

  18. D

    Data Labeling Market Report

    • datainsightsmarket.com
    doc, pdf, ppt
    Updated Mar 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Insights Market (2025). Data Labeling Market Report [Dataset]. https://www.datainsightsmarket.com/reports/data-labeling-market-20383
    Explore at:
    doc, ppt, pdfAvailable download formats
    Dataset updated
    Mar 8, 2025
    Dataset authored and provided by
    Data Insights Market
    License

    https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The data labeling market is experiencing robust growth, projected to reach $3.84 billion in 2025 and maintain a Compound Annual Growth Rate (CAGR) of 28.13% from 2025 to 2033. This expansion is fueled by the increasing demand for high-quality training data across various sectors, including healthcare, automotive, and finance, which heavily rely on machine learning and artificial intelligence (AI). The surge in AI adoption, particularly in areas like autonomous vehicles, medical image analysis, and fraud detection, necessitates vast quantities of accurately labeled data. The market is segmented by sourcing type (in-house vs. outsourced), data type (text, image, audio), labeling method (manual, automatic, semi-supervised), and end-user industry. Outsourcing is expected to dominate the sourcing segment due to cost-effectiveness and access to specialized expertise. Similarly, image data labeling is likely to hold a significant share, given the visual nature of many AI applications. The shift towards automation and semi-supervised techniques aims to improve efficiency and reduce labeling costs, though manual labeling will remain crucial for tasks requiring high accuracy and nuanced understanding. Geographical distribution shows strong potential across North America and Europe, with Asia-Pacific emerging as a key growth region driven by increasing technological advancements and digital transformation. Competition in the data labeling market is intense, with a mix of established players like Amazon Mechanical Turk and Appen, alongside emerging specialized companies. The market's future trajectory will likely be shaped by advancements in automation technologies, the development of more efficient labeling techniques, and the increasing need for specialized data labeling services catering to niche applications. Companies are focusing on improving the accuracy and speed of data labeling through innovations in AI-powered tools and techniques. Furthermore, the rise of synthetic data generation offers a promising avenue for supplementing real-world data, potentially addressing data scarcity challenges and reducing labeling costs in certain applications. This will, however, require careful attention to ensure that the synthetic data generated is representative of real-world data to maintain model accuracy. This comprehensive report provides an in-depth analysis of the global data labeling market, offering invaluable insights for businesses, investors, and researchers. The study period covers 2019-2033, with 2025 as the base and estimated year, and a forecast period of 2025-2033. We delve into market size, segmentation, growth drivers, challenges, and emerging trends, examining the impact of technological advancements and regulatory changes on this rapidly evolving sector. The market is projected to reach multi-billion dollar valuations by 2033, fueled by the increasing demand for high-quality data to train sophisticated machine learning models. Recent developments include: September 2024: The National Geospatial-Intelligence Agency (NGA) is poised to invest heavily in artificial intelligence, earmarking up to USD 700 million for data labeling services over the next five years. This initiative aims to enhance NGA's machine-learning capabilities, particularly in analyzing satellite imagery and other geospatial data. The agency has opted for a multi-vendor indefinite-delivery/indefinite-quantity (IDIQ) contract, emphasizing the importance of annotating raw data be it images or videos—to render it understandable for machine learning models. For instance, when dealing with satellite imagery, the focus could be on labeling distinct entities such as buildings, roads, or patches of vegetation.October 2023: Refuel.ai unveiled a new platform, Refuel Cloud, and a specialized large language model (LLM) for data labeling. Refuel Cloud harnesses advanced LLMs, including its proprietary model, to automate data cleaning, labeling, and enrichment at scale, catering to diverse industry use cases. Recognizing that clean data underpins modern AI and data-centric software, Refuel Cloud addresses the historical challenge of human labor bottlenecks in data production. With Refuel Cloud, enterprises can swiftly generate the expansive, precise datasets they require in mere minutes, a task that traditionally spanned weeks.. Key drivers for this market are: Rising Penetration of Connected Cars and Advances in Autonomous Driving Technology, Advances in Big Data Analytics based on AI and ML. Potential restraints include: Rising Penetration of Connected Cars and Advances in Autonomous Driving Technology, Advances in Big Data Analytics based on AI and ML. Notable trends are: Healthcare is Expected to Witness Remarkable Growth.

  19. 30 Short Tips for Your Data Scientist Interview

    • kaggle.com
    Updated Oct 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Skillslash17 (2023). 30 Short Tips for Your Data Scientist Interview [Dataset]. https://www.kaggle.com/datasets/skillslash17/30-short-tips-for-your-data-scientist-interview
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 12, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Skillslash17
    Description

    If you’re a data scientist looking to get ahead in the ever-changing world of data science, you know that job interviews are a crucial part of your career. But getting a job as a data scientist is not just about being tech-savvy, it’s also about having the right skillset, being able to solve problems, and having good communication skills. With competition heating up, it’s important to stand out and make a good impression on potential employers.

    Data Science has become an essential part of the contemporary business environment, enabling decision-making in a variety of industries. Consequently, organizations are increasingly looking for individuals who can utilize the power of data to generate new ideas and expand their operations. However these roles come with a high level of expectation, requiring applicants to possess a comprehensive knowledge of data analytics and machine learning, as well as the capacity to turn their discoveries into practical solutions.

    With so many job seekers out there, it’s super important to be prepared and confident for your interview as a data scientist.

    Here are 30 tips to help you get the most out of your interview and land the job you want. No matter if you’re just starting out or have been in the field for a while, these tips will help you make the most of your interview and set you up for success.

    Technical Preparation

    Qualifying for a job as a data scientist needs a comprehensive level of technical preparation. Job seekers are often required to demonstrate their technical skills in order to show their ability to effectively fulfill the duties of the role. Here are a selection of key tips for technical proficiency:

    1 Master the Basics

    Make sure you have a good understanding of statistics, math, and programming languages such as Python and R.

    2 Understand Machine Learning

    Gain an in-depth understanding of commonly used machine learning techniques, including linear regression and decision trees, as well as neural networks.

    3 Data Manipulation

    Make sure you're good with data tools like Pandas and Matplotlib, as well as data visualization tools like Seaborn.

    4 SQL Skills

    Gain proficiency in the use of SQL language to extract and process data from databases.

    5 Feature Engineering

    Understand and know the importance of feature engineering and how to create meaningful features from raw data.

    6 Model Evaluation

    Learn to assess and compare machine learning models using metrics like accuracy, precision, recall, and F1-score.

    7 Big Data Technologies

    If the job requires it, become familiar with big data technologies like Hadoop and Spark.

    8 Coding Challenges

    Practice coding challenges related to data manipulation and machine learning on platforms like LeetCode and Kaggle.

    Portfolio and Projects

    9 Build a Portfolio

    Develop a portfolio of your data science projects that outlines your methodology, the resources you have employed, and the results achieved.

    10 Kaggle Competitions

    Participate in Kaggle competitions to gain real-world experience and showcase your problem-solving skills.

    11 Open Source Contributions

    Contribute to open-source data science projects to demonstrate your collaboration and coding abilities.

    12 GitHub Profile

    Maintain a well-organized GitHub profile with clean code and clear project documentation.

    Domain Knowledge

    13 Understand the Industry

    Research the industry you’re applying to and understand its specific data challenges and opportunities.

    14 Company Research

    Study the company you’re interviewing with to tailor your responses and show your genuine interest.

    Soft Skills

    15 Communication

    Practice explaining complex concepts in simple terms. Data Scientists often need to communicate findings to non-technical stakeholders.

    16 Problem-Solving

    Focus on your problem-solving abilities and how you approach complex challenges.

    17 Adaptability

    Highlight your ability to adapt to new technologies and techniques as the field of data science evolves.

    Interview Etiquette

    18 Professional Appearance

    Dress and present yourself in a professional manner, whether the interview is in person or remote.

    19 Punctuality

    Be on time for the interview, whether it’s virtual or in person.

    20 Body Language

    Maintain good posture and eye contact during the interview. Smile and exhibit confidence.

    21 Active Listening

    Pay close attention to the interviewer's questions and answer them directly.

    Behavioral Questions

    22 STAR Method

    Use the STAR (Situation, Task, Action, Result) method to structure your responses to behavioral questions.

    23 Conflict Resolution

    Be prepared to discuss how you have handled conflicts or challenging situations in previous roles.

    24 Teamwork

    Highlight instances where you’ve worked effectively in cross-functional teams...

  20. Leading data collection methods among UK consumers 2023

    • statista.com
    Updated Jun 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Leading data collection methods among UK consumers 2023 [Dataset]. https://www.statista.com/statistics/1453941/data-collection-method-consumers-uk/
    Explore at:
    Dataset updated
    Jun 26, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Nov 2023 - Dec 2023
    Area covered
    United Kingdom
    Description

    During a late 2023 survey among working-age consumers in the United Kingdom, **** percent of respondents stated that they preferred for their data to be collected via interactive surveys. Meanwhile, **** percent of respondents mentioned loyalty cards/programs as their favored data collection method.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Dashlink (2025). Discovering System Health Anomalies using Data Mining Techniques [Dataset]. https://catalog.data.gov/dataset/discovering-system-health-anomalies-using-data-mining-techniques

Data from: Discovering System Health Anomalies using Data Mining Techniques

Related Article
Explore at:
Dataset updated
Apr 10, 2025
Dataset provided by
Dashlink
Description

We discuss a statistical framework that underlies envelope detection schemes as well as dynamical models based on Hidden Markov Models (HMM) that can encompass both discrete and continuous sensor measurements for use in Integrated System Health Management (ISHM) applications. The HMM allows for the rapid assimilation, analysis, and discovery of system anomalies. We motivate our work with a discussion of an aviation problem where the identification of anomalous sequences is essential for safety reasons. The data in this application are discrete and continuous sensor measurements and can be dealt with seamlessly using the methods described here to discover anomalous flights. We specifically treat the problem of discovering anomalous features in the time series that may be hidden from the sensor suite and compare those methods to standard envelope detection methods on test data designed to accentuate the differences between the two methods. Identification of these hidden anomalies is crucial to building stable, reusable, and cost-efficient systems. We also discuss a data mining framework for the analysis and discovery of anomalies in high-dimensional time series of sensor measurements that would be found in an ISHM system. We conclude with recommendations that describe the tradeoffs in building an integrated scalable platform for robust anomaly detection in ISHM applications.

Search
Clear search
Close search
Google apps
Main menu