100+ datasets found

d
Data from: Discovering System Health Anomalies using Data Mining Techniques
catalog.data.gov
data.nasa.gov
+2more
Updated Apr 10, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2025). Discovering System Health Anomalies using Data Mining Techniques [Dataset]. https://catalog.data.gov/dataset/discovering-system-health-anomalies-using-data-mining-techniques
Explore at:
Dataset updated
Apr 10, 2025
Dataset provided by
Dashlink
Description
We discuss a statistical framework that underlies envelope detection schemes as well as dynamical models based on Hidden Markov Models (HMM) that can encompass both discrete and continuous sensor measurements for use in Integrated System Health Management (ISHM) applications. The HMM allows for the rapid assimilation, analysis, and discovery of system anomalies. We motivate our work with a discussion of an aviation problem where the identification of anomalous sequences is essential for safety reasons. The data in this application are discrete and continuous sensor measurements and can be dealt with seamlessly using the methods described here to discover anomalous flights. We specifically treat the problem of discovering anomalous features in the time series that may be hidden from the sensor suite and compare those methods to standard envelope detection methods on test data designed to accentuate the differences between the two methods. Identification of these hidden anomalies is crucial to building stable, reusable, and cost-efficient systems. We also discuss a data mining framework for the analysis and discovery of anomalies in high-dimensional time series of sensor measurements that would be found in an ISHM system. We conclude with recommendations that describe the tradeoffs in building an integrated scalable platform for robust anomaly detection in ISHM applications.
Data Science Jobs Analysis
kaggle.com
Updated Feb 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Niyal Thakkar (2023). Data Science Jobs Analysis [Dataset]. https://www.kaggle.com/datasets/niyalthakkar/data-science-jobs-analysis
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 8, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Niyal Thakkar
Description
Data science is the domain of study that deals with vast volumes of data using modern tools and techniques to find unseen patterns, derive meaningful information, and make business decisions. Data science uses complex machine learning algorithms to build predictive models.

The data used for analysis can come from many different sources and be presented in various formats. Data science is an essential part of many industries today, given the massive amounts of data that are produced, and is one of the most debated topics in IT circles.
f
Data_Sheet_1_The Optimal Machine Learning-Based Missing Data Imputation for...
frontiersin.figshare.com
docx
Updated Jun 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chao-Yu Guo; Ying-Chen Yang; Yi-Hau Chen (2023). Data_Sheet_1_The Optimal Machine Learning-Based Missing Data Imputation for the Cox Proportional Hazard Model.docx [Dataset]. http://doi.org/10.3389/fpubh.2021.680054.s001
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fpubh.2021.680054.s001
Dataset updated
Jun 3, 2023
Dataset provided by
Frontiers
Authors
Chao-Yu Guo; Ying-Chen Yang; Yi-Hau Chen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
An adequate imputation of missing data would significantly preserve the statistical power and avoid erroneous conclusions. In the era of big data, machine learning is a great tool to infer the missing values. The root means square error (RMSE) and the proportion of falsely classified entries (PFC) are two standard statistics to evaluate imputation accuracy. However, the Cox proportional hazards model using various types requires deliberate study, and the validity under different missing mechanisms is unknown. In this research, we propose supervised and unsupervised imputations and examine four machine learning-based imputation strategies. We conducted a simulation study under various scenarios with several parameters, such as sample size, missing rate, and different missing mechanisms. The results revealed the type-I errors according to different imputation techniques in the survival data. The simulation results show that the non-parametric “missForest” based on the unsupervised imputation is the only robust method without inflated type-I errors under all missing mechanisms. In contrast, other methods are not valid to test when the missing pattern is informative. Statistical analysis, which is improperly conducted, with missing data may lead to erroneous conclusions. This research provides a clear guideline for a valid survival analysis using the Cox proportional hazard model with machine learning-based imputations.
Comparative Analysis of Data-Driven Anomaly Detection Methods
data.nasa.gov
s.cnmilf.com
+2more
Updated Mar 31, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
nasa.gov (2025). Comparative Analysis of Data-Driven Anomaly Detection Methods [Dataset]. https://data.nasa.gov/dataset/comparative-analysis-of-data-driven-anomaly-detection-methods
Explore at:
Dataset updated
Mar 31, 2025
Dataset provided by
NASAhttp://nasa.gov/
Description
This paper provides a review of three different advanced machine learning algorithms for anomaly detection in continuous data streams from a ground-test firing of a subscale Solid Rocket Motor (SRM). This study compares Orca, one-class support vector machines, and the Inductive Monitoring System (IMS) for anomaly detection on the data streams. We measure the performance of the algorithm with respect to the detection horizon for situations where fault information is available. These algorithms have been also studied by the present authors (and other co-authors) as applied to liquid propulsion systems. The trade space will be explored between these algorithms for both types of propulsion systems.
Data from: FEP Augmentation as a Means to Solve Data Paucity Problems for...
figshare.com
zip
Updated Apr 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pieter B. Burger; Xiaohu Hu; Ilya Balabin; Morné Muller; Megan Stanley; Fourie Joubert; Thomas M. Kaiser (2024). FEP Augmentation as a Means to Solve Data Paucity Problems for Machine Learning in Chemical Biology [Dataset]. http://doi.org/10.1021/acs.jcim.4c00071.s005
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jcim.4c00071.s005
Dataset updated
Apr 23, 2024
Dataset provided by
ACS Publications
Authors
Pieter B. Burger; Xiaohu Hu; Ilya Balabin; Morné Muller; Megan Stanley; Fourie Joubert; Thomas M. Kaiser
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
In the realm of medicinal chemistry, the primary objective is to swiftly optimize a multitude of chemical properties of a set of compounds to yield a clinical candidate poised for clinical trials. In recent years, two computational techniques, machine learning (ML) and physics-based methods, have evolved substantially and are now frequently incorporated into the medicinal chemist’s toolbox to enhance the efficiency of both hit optimization and candidate design. Both computational methods come with their own set of limitations, and they are often used independently of each other. ML’s capability to screen extensive compound libraries expediently is tempered by its reliance on quality data, which can be scarce especially during early-stage optimization. Contrarily, physics-based approaches like free energy perturbation (FEP) are frequently constrained by low throughput and high cost by comparison; however, physics-based methods are capable of making highly accurate binding affinity predictions. In this study, we harnessed the strength of FEP to overcome data paucity in ML by generating virtual activity data sets which then inform the training of algorithms. Here, we show that ML algorithms trained with an FEP-augmented data set could achieve comparable predictive accuracy to data sets trained on experimental data from biological assays. Throughout the paper, we emphasize key mechanistic considerations that must be taken into account when aiming to augment data sets and lay the groundwork for successful implementation. Ultimately, the study advocates for the synergy of physics-based methods and ML to expedite the lead optimization process. We believe that the physics-based augmentation of ML will significantly benefit drug discovery, as these techniques continue to evolve.
e
Techniques of Data Collection
paper.erudition.co.in
html
Updated Apr 6, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Einetic (2021). Techniques of Data Collection [Dataset]. https://paper.erudition.co.in/makaut/bachelor-of-business-administration/5/research-methodology
Explore at:
htmlAvailable download formats
Dataset updated
Apr 6, 2021
Dataset authored and provided by
Einetic
License
https://paper.erudition.co.in/termshttps://paper.erudition.co.in/terms
Description
Question Paper Solutions of chapter Techniques of Data Collection of Research Methodology, 5th Semester , Bachelor of Business Administration
Technologies used in big data analysis 2015
statista.com
Updated Jul 29, 2015
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2015). Technologies used in big data analysis 2015 [Dataset]. https://www.statista.com/statistics/491267/big-data-technologies-used/
Explore at:
Dataset updated
Jul 29, 2015
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Dec 2014 - Feb 2015
Area covered
North America, Worldwide, Europe
Description
This graph presents the results of a survey, conducted by BARC in 2014/15, into the current and planned use of technology for the analysis of big data. At the beginning of 2015, 13 percent of respondents indicated that their company was already using a big data analytical appliance for big data.
d
Replication Data for: The MIDAS Touch: Accurate and Scalable Missing-Data...
search.dataone.org
dataverse.harvard.edu
Updated Nov 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lall, Ranjit; Robinson, Thomas (2023). Replication Data for: The MIDAS Touch: Accurate and Scalable Missing-Data Imputation with Deep Learning [Dataset]. http://doi.org/10.7910/DVN/UPL4TT
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/UPL4TT
Dataset updated
Nov 23, 2023
Dataset provided by
Harvard Dataverse
Authors
Lall, Ranjit; Robinson, Thomas
Description
Replication and simulation reproduction materials for the article "The MIDAS Touch: Accurate and Scalable Missing-Data Imputation with Deep Learning." Please see the README file for a summary of the contents and the Replication Guide for a more detailed description. Article abstract: Principled methods for analyzing missing values, based chiefly on multiple imputation, have become increasingly popular yet can struggle to handle the kinds of large and complex data that are also becoming common. We propose an accurate, fast, and scalable approach to multiple imputation, which we call MIDAS (Multiple Imputation with Denoising Autoencoders). MIDAS employs a class of unsupervised neural networks known as denoising autoencoders, which are designed to reduce dimensionality by corrupting and attempting to reconstruct a subset of data. We repurpose denoising autoencoders for multiple imputation by treating missing values as an additional portion of corrupted data and drawing imputations from a model trained to minimize the reconstruction error on the originally observed portion. Systematic tests on simulated as well as real social science data, together with an applied example involving a large-scale electoral survey, illustrate MIDAS's accuracy and efficiency across a range of settings. We provide open-source software for implementing MIDAS.
U
Statistical Methods in Water Resources - Supporting Materials
data.usgs.gov
catalog.data.gov
Updated Apr 7, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Robert Hirsch; Karen Ryberg; Stacey Archfield; Edward Gilroy; Dennis Helsel (2020). Statistical Methods in Water Resources - Supporting Materials [Dataset]. http://doi.org/10.5066/P9JWL6XR
Explore at:
Unique identifier
https://doi.org/10.5066/P9JWL6XR
Dataset updated
Apr 7, 2020
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Authors
Robert Hirsch; Karen Ryberg; Stacey Archfield; Edward Gilroy; Dennis Helsel
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Description
This dataset contains all of the supporting materials to accompany Helsel, D.R., Hirsch, R.M., Ryberg, K.R., Archfield, S.A., and Gilroy, E.J., 2020, Statistical methods in water resources: U.S. Geological Survey Techniques and Methods, book 4, chapter A3, 454 p., https://doi.org/10.3133/tm4a3. [Supersedes USGS Techniques of Water-Resources Investigations, book 4, chapter A3, version 1.1.]. Supplemental material (SM) for each chapter are available to re-create all examples and figures, and to solve the exercises at the end of each chapter, with relevant datasets provided in an electronic format readable by R. The SM provide (1) datasets as .Rdata files for immediate input into R, (2) datasets as .csv files for input into R or for use with other software programs, (3) R functions that are used in the textbook but not part of a published R package, (4) R scripts to produce virtually all of the figures in the book, and (5) solutions to the exercises as .html and .Rmd files. The suff ...
j
Data from: Data on the Construction Processes of Regression Models
jstagedata.jst.go.jp
jpeg
Updated Jul 27, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Taichi Kimura; Riko Iwamoto; Mikio Yoshida; Tatsuya Takahashi; Shuji Sasabe; Yoshiyuki Shirakawa (2023). Data on the Construction Processes of Regression Models [Dataset]. http://doi.org/10.50931/data.kona.22180318.v2
Explore at:
jpegAvailable download formats
Unique identifier
https://doi.org/10.50931/data.kona.22180318.v2
Dataset updated
Jul 27, 2023
Dataset provided by
Hosokawa Powder Technology Foundation
Authors
Taichi Kimura; Riko Iwamoto; Mikio Yoshida; Tatsuya Takahashi; Shuji Sasabe; Yoshiyuki Shirakawa
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This CSV dataset (numbered 1–8) demonstrates the construction processes of the regression models using machine learning methods, which are used to plot Fig. 2–7. The CSV file of 1.LSM_R^2 (plotting Fig. 2) shows the data of the relationship between estimated values and actual values when the least-squares method was used for a model construction. In the CSV file 2.PCR_R^2 (plotting Fig. 3), the number of the principal components was varied from 1 to 5 during the construction of a model using the principal component regression. The data in the CSV file 3.SVR_R^2 (plotting Fig. 4) is the result of the construction using the support vector regression. The hyperparameters were decided by the comprehensive combination from the listed candidates by exploring hyperparameters with maximum R2 values. When a deep neural network was applied to the construction of a regression model, NNeur., NH.L. and NL.T. were varied. The CSV file 4.DNN_HL (plotting Fig. 5a)) shows the changes in the relationship between estimated values and actual values at each NH.L.. Similarly, changes in the relationships between estimated values and actual values in the case NNeur. or NL.T. were varied in the CSV files 5.DNN_ Neur (plotting Fig. 5b)) and 6.DNN_LT (plotting Fig. 5c)). The data in the CSV file 7.DNN_R^2 (plotting Fig. 6) is the result using optimal NNeur., NH.L. and NL.T.. In the CSV file 8.R^2 (plotting Fig. 7), the validity of each machine learning method was compared by showing the optimal results for each method. Experimental conditions Supply volume of the raw material: 25–125 mL Addition rate of TiO2: 5.0–15.0 wt% Operation time: 1–15 min Rotation speed: 2,200–5,700 min-1 Temperature: 295–319 K Nomenclature NNeur.: the number of neurons NH.L.: the number of hidden layers NL.T.: the number of learning times
o
Indigenous data analysis methods for research
osf.io
url
Updated Jun 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nina Sivertsen; Tahlia Johnson; Annette Briley; Shanamae Davies; Tara Struck; Larissa Taylor; Susan Smith; Megan Cooper; Jaclyn Davey (2024). Indigenous data analysis methods for research [Dataset]. http://doi.org/10.17605/OSF.IO/VNZD9
Explore at:
urlAvailable download formats
Unique identifier
https://doi.org/10.17605/OSF.IO/VNZD9
Dataset updated
Jun 12, 2024
Dataset provided by
Center For Open Science
Authors
Nina Sivertsen; Tahlia Johnson; Annette Briley; Shanamae Davies; Tara Struck; Larissa Taylor; Susan Smith; Megan Cooper; Jaclyn Davey
Description
Objective: The objective of this review is to identify what is known about Indigenous data analysis methods for research. Introduction: Understanding Indigenous data analyses methods for research is crucial in health research with Indigenous participants, to support culturally appropriate interpretation of research data, and culturally inclusive analyses in cross-cultural research teams. Inclusion Criteria: This review will consider primary research studies that report on Indigenous data analysis methods for research. Method: Medline (via Ovid SP), PsycINFO (via Ovid SP), Web of Science (Clarivate Analytics), Scopus (Elsevier), Cumulated Index to Nursing and Allied Health Literature CINAHL (EBSCOhost), ProQuest Central, ProQuest Social Sciences Premium (Clarivate) will be searched. ProQuest (Theses and Dissertations) will be searched for unpublished material. Studies published from inception onwards and written in English will be assessed for inclusion. Studies meeting the inclusion criteria will be assessed for methodological quality and data will be extracted.
Data Science Tweets
figshare.com
zip
Updated May 14, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jesus Rogel-Salazar (2024). Data Science Tweets [Dataset]. http://doi.org/10.6084/m9.figshare.2062551.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.2062551.v1
Dataset updated
May 14, 2024
Dataset provided by
Figsharehttp://figshare.com/
Authors
Jesus Rogel-Salazar
License
https://www.gnu.org/licenses/gpl-3.0.htmlhttps://www.gnu.org/licenses/gpl-3.0.html
Description
Quantum Tunnel TweetsThe data set contains tweets sourced from @quantum_tunnel and @dt_science as a demo for classifying text using Naive Bayes. The demo is detailed in the book Data Science and Analytics with Python by Dr J Rogel-Salazar.Data contents:Train_QuantumTunnel_Tweets.csv: Labelled tweets for text related to "Data Science" with three features:DataScience: [0/1] indicating whether the text is about "Data Science" or not.Date: Date when the tweet was publishedTweet: Text of the tweetTest_QuantumTunnel_Tweets.csv: Testing data with twitter utterances withouth labels:id: A unique identifier for tweetsDate: Date when the tweet was publishedTweet: Text for the tweetFor further information, please get in touch with Dr J Rogel-Salazar.
Intelligent Monitor
kaggle.com
Updated Apr 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ptdevsecops (2024). Intelligent Monitor [Dataset]. http://doi.org/10.34740/kaggle/ds/4383210
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/ds/4383210
Dataset updated
Apr 12, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
ptdevsecops
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
IntelligentMonitor: Empowering DevOps Environments With Advanced Monitoring and Observability aims to improve monitoring and observability in complex, distributed DevOps environments by leveraging machine learning and data analytics. This repository contains a sample implementation of the IntelligentMonitor system proposed in the research paper, presented and published as part of the 11th International Conference on Information Technology (ICIT 2023).

If you use this dataset and code or any herein modified part of it in any publication, please cite these papers:

P. Thantharate, "IntelligentMonitor: Empowering DevOps Environments with Advanced Monitoring and Observability," 2023 International Conference on Information Technology (ICIT), Amman, Jordan, 2023, pp. 800-805, doi: 10.1109/ICIT58056.2023.10226123.

For any questions and research queries - please reach out via Email.

Abstract - In the dynamic field of software development, DevOps has become a critical tool for enhancing collaboration, streamlining processes, and accelerating delivery. However, monitoring and observability within DevOps environments pose significant challenges, often leading to delayed issue detection, inefficient troubleshooting, and compromised service quality. These issues stem from DevOps environments' complex and ever-changing nature, where traditional monitoring tools often fall short, creating blind spots that can conceal performance issues or system failures. This research addresses these challenges by proposing an innovative approach to improve monitoring and observability in DevOps environments. Our solution, Intelligent-Monitor, leverages realtime data collection, intelligent analytics, and automated anomaly detection powered by advanced technologies such as machine learning and artificial intelligence. The experimental results demonstrate that IntelligentMonitor effectively manages data overload, reduces alert fatigue, and improves system visibility, thereby enhancing performance and reliability. For instance, the average CPU usage across all components showed a decrease of 9.10%, indicating improved CPU efficiency. Similarly, memory utilization and network traffic showed an average increase of 7.33% and 0.49%, respectively, suggesting more efficient use of resources. By providing deep insights into system performance and facilitating rapid issue resolution, this research contributes to the DevOps community by offering a comprehensive solution to one of its most pressing challenges. This fosters more efficient, reliable, and resilient software development and delivery processes.

Components The key components that would need to be implemented are:

Data Collection - Collect performance metrics and log data from the distributed system components. Could use technology like Kafka or telemetry libraries.

Data Processing - Preprocess and aggregate the collected data into an analyzable format. Could use Spark for distributed data processing.

Anomaly Detection - Apply machine learning algorithms to detect anomalies in the performance metrics. Could use isolation forest or LSTM models.

Alerting - Generate alerts when anomalies are detected. It could integrate with tools like PagerDuty.

Visualization - Create dashboards to visualize system health and key metrics. Could use Grafana or Kibana.

Data Storage - Store the collected metrics and log data. Could use Elasticsearch or InfluxDB.

Implementation Details The core of the implementation would involve the following: - Setting up the data collection pipelines. - Building and training anomaly detection ML models on historical data. - Developing a real-time data processing pipeline. - Creating an alerting framework that ties into the ML models. - Building visualizations and dashboards.

The code would need to handle scaled-out, distributed execution for production environments.

Proper code documentation, logging, and testing would be added throughout the implementation.

Usage Examples Usage examples could include:

Running the data collection agents on each system component.

Visualizing system metrics through Grafana dashboards.

Investigating anomalies detected by the ML models.

Tuning the alerting rules to minimize false positives.

Correlating metrics with log data to troubleshoot issues.

References The implementation would follow the details provided in the original research paper: P. Thantharate, "IntelligentMonitor: Empowering DevOps Environments with Advanced Monitoring and Observability," 2023 International Conference on Information Technology (ICIT), Amman, Jordan, 2023, pp. 800-805, doi: 10.1109/ICIT58056.2023.10226123.

Any additional external libraries or sources used would be properly cited.

Tags - DevOps, Software Development, Collaboration, Streamlini...
Yield Curve Models and Data - TIPS Yield Curve and Inflation Compensation
catalog.data.gov
Updated Dec 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Board of Governors of the Federal Reserve System (2024). Yield Curve Models and Data - TIPS Yield Curve and Inflation Compensation [Dataset]. https://catalog.data.gov/dataset/yield-curve-models-and-data-tips-yield-curve-and-inflation-compensation
Explore at:
Dataset updated
Dec 18, 2024
Dataset provided by
Federal Reserve Systemhttp://www.federalreserve.gov/
Description
The yield curve, also called the term structure of interest rates, refers to the relationship between the remaining time-to-maturity of debt securities and the yield on those securities. Yield curves have many practical uses, including pricing of various fixed-income securities, and are closely watched by market participants and policymakers alike for potential clues about the markets perception of the path of the policy rate and the macroeconomic outlook. This page provides daily estimated real yield curve parameters, smoothed yields on hypothetical TIPS, and implied inflation compensation, from 1999 to the present. Because this is a staff research product and not an official statistical release, it is subject to delay, revision, or methodological changes without advance notice.
Z
Data from: Investigating Online Art Search through Quantitative Behavioral...
data.niaid.nih.gov
Updated Mar 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kouretsis, Alexandros (2023). Investigating Online Art Search through Quantitative Behavioral Data and Machine Learning Techniques - Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7741134
Explore at:
Dataset updated
Mar 16, 2023
Dataset provided by
Pergantis, Minas
Giannakoulopoulos, Andreas
Kouretsis, Alexandros
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset includes the detailed values and scripts used to study behavioral aspects of users searching online for Art and Culture by analyzing quantitative data collected by the Art Boulevard search engine using machine learning techniques. This dataset is part of the core methodology, results and discussion sections of the research paper entitled "Investigating Online Art Search through Quantitative Behavioral Data and Machine Learning Techniques"
Preventive Maintenance for Marine Engines
kaggle.com
Updated Feb 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fijabi J. Adekunle (2025). Preventive Maintenance for Marine Engines [Dataset]. https://www.kaggle.com/datasets/jeleeladekunlefijabi/preventive-maintenance-for-marine-engines
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 13, 2025
Dataset provided by
Kaggle
Authors
Fijabi J. Adekunle
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Preventive Maintenance for Marine Engines: Data-Driven Insights

Introduction:

Marine engine failures can lead to costly downtime, safety risks and operational inefficiencies. This project leverages machine learning to predict maintenance needs, helping ship operators prevent unexpected breakdowns. Using a simulated dataset, we analyze key engine parameters and develop predictive models to classify maintenance status into three categories: Normal, Requires Maintenance, and Critical.

Overview This project explores preventive maintenance strategies for marine engines by analyzing operational data and applying machine learning techniques.

Key steps include: 1. Data Simulation: Creating a realistic dataset with engine performance metrics. 2. Exploratory Data Analysis (EDA): Understanding trends and patterns in engine behavior. 3. Model Training & Evaluation: Comparing machine learning models (Decision Tree, Random Forest, XGBoost) to predict maintenance needs. 4. Hyperparameter Tuning: Using GridSearchCV to optimize model performance.

Tools Used 1. Python: Data processing, analysis and modeling 2. Pandas & NumPy: Data manipulation 3. Scikit-Learn & XGBoost: Machine learning model training 4. Matplotlib & Seaborn: Data visualization

Skills Demonstrated ✔ Data Simulation & Preprocessing ✔ Exploratory Data Analysis (EDA) ✔ Feature Engineering & Encoding ✔ Supervised Machine Learning (Classification) ✔ Model Evaluation & Hyperparameter Tuning

Key Insights & Findings 📌 Engine Temperature & Vibration Level: Strong indicators of potential failures. 📌 Random Forest vs. XGBoost: After hyperparameter tuning, both models achieved comparable performance, with Random Forest performing slightly better. 📌 Maintenance Status Distribution: Balanced dataset ensures unbiased model training. 📌 Failure Modes: The most common issues were Mechanical Wear & Oil Leakage, aligning with real-world engine failure trends.

Challenges Faced 🚧 Simulating Realistic Data: Ensuring the dataset reflects real-world marine engine behavior was a key challenge. 🚧 Model Performance: The accuracy was limited (~35%) due to the complexity of failure prediction. 🚧 Feature Selection: Identifying the most impactful features required extensive analysis.

Call to Action 🔍 Explore the Dataset & Notebook: Try running different models and tweaking hyperparameters. 📊 Extend the Analysis: Incorporate additional sensor data or alternative machine learning techniques. 🚀 Real-World Application: This approach can be adapted for industrial machinery, aircraft engines, and power plants.
d
Data to Support the Development of Rapid GC-MS Methods for Seized Drug...
catalog.data.gov
datasets.ai
+1more
Updated Feb 23, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Standards and Technology (2023). Data to Support the Development of Rapid GC-MS Methods for Seized Drug Analysis [Dataset]. https://catalog.data.gov/dataset/data-to-support-the-development-of-rapid-gc-ms-methods-for-seized-drug-analysis
Explore at:
Dataset updated
Feb 23, 2023
Dataset provided by
National Institute of Standards and Technology
Description
This dataset contains raw datafiles that support the development of rapid gas chromatography mass spectrometry (GC-MS) methods for seized drug analysis. Files are provided in the native ".D" format collected from an Agilent GC-MS system. Files can be opened using Agilent proprietary software or freely available software such as AMDIS (which can be downloaded at chemdata.nist.gov). Included here is data of seized drug mixtures and adjudicated case samples that were analyzed as part of the method development process for rapid GC-MS. Information about the naming of datafiles and the contents of each mixture and case sample can be found in the associated Excel sheet ("File Names and Comments.xlsx").
D
Data Labeling Market Report
datainsightsmarket.com
doc, pdf, ppt
Updated Mar 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Insights Market (2025). Data Labeling Market Report [Dataset]. https://www.datainsightsmarket.com/reports/data-labeling-market-20383
Explore at:
doc, ppt, pdfAvailable download formats
Dataset updated
Mar 8, 2025
Dataset authored and provided by
Data Insights Market
License
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The data labeling market is experiencing robust growth, projected to reach $3.84 billion in 2025 and maintain a Compound Annual Growth Rate (CAGR) of 28.13% from 2025 to 2033. This expansion is fueled by the increasing demand for high-quality training data across various sectors, including healthcare, automotive, and finance, which heavily rely on machine learning and artificial intelligence (AI). The surge in AI adoption, particularly in areas like autonomous vehicles, medical image analysis, and fraud detection, necessitates vast quantities of accurately labeled data. The market is segmented by sourcing type (in-house vs. outsourced), data type (text, image, audio), labeling method (manual, automatic, semi-supervised), and end-user industry. Outsourcing is expected to dominate the sourcing segment due to cost-effectiveness and access to specialized expertise. Similarly, image data labeling is likely to hold a significant share, given the visual nature of many AI applications. The shift towards automation and semi-supervised techniques aims to improve efficiency and reduce labeling costs, though manual labeling will remain crucial for tasks requiring high accuracy and nuanced understanding. Geographical distribution shows strong potential across North America and Europe, with Asia-Pacific emerging as a key growth region driven by increasing technological advancements and digital transformation. Competition in the data labeling market is intense, with a mix of established players like Amazon Mechanical Turk and Appen, alongside emerging specialized companies. The market's future trajectory will likely be shaped by advancements in automation technologies, the development of more efficient labeling techniques, and the increasing need for specialized data labeling services catering to niche applications. Companies are focusing on improving the accuracy and speed of data labeling through innovations in AI-powered tools and techniques. Furthermore, the rise of synthetic data generation offers a promising avenue for supplementing real-world data, potentially addressing data scarcity challenges and reducing labeling costs in certain applications. This will, however, require careful attention to ensure that the synthetic data generated is representative of real-world data to maintain model accuracy. This comprehensive report provides an in-depth analysis of the global data labeling market, offering invaluable insights for businesses, investors, and researchers. The study period covers 2019-2033, with 2025 as the base and estimated year, and a forecast period of 2025-2033. We delve into market size, segmentation, growth drivers, challenges, and emerging trends, examining the impact of technological advancements and regulatory changes on this rapidly evolving sector. The market is projected to reach multi-billion dollar valuations by 2033, fueled by the increasing demand for high-quality data to train sophisticated machine learning models. Recent developments include: September 2024: The National Geospatial-Intelligence Agency (NGA) is poised to invest heavily in artificial intelligence, earmarking up to USD 700 million for data labeling services over the next five years. This initiative aims to enhance NGA's machine-learning capabilities, particularly in analyzing satellite imagery and other geospatial data. The agency has opted for a multi-vendor indefinite-delivery/indefinite-quantity (IDIQ) contract, emphasizing the importance of annotating raw data be it images or videos—to render it understandable for machine learning models. For instance, when dealing with satellite imagery, the focus could be on labeling distinct entities such as buildings, roads, or patches of vegetation.October 2023: Refuel.ai unveiled a new platform, Refuel Cloud, and a specialized large language model (LLM) for data labeling. Refuel Cloud harnesses advanced LLMs, including its proprietary model, to automate data cleaning, labeling, and enrichment at scale, catering to diverse industry use cases. Recognizing that clean data underpins modern AI and data-centric software, Refuel Cloud addresses the historical challenge of human labor bottlenecks in data production. With Refuel Cloud, enterprises can swiftly generate the expansive, precise datasets they require in mere minutes, a task that traditionally spanned weeks.. Key drivers for this market are: Rising Penetration of Connected Cars and Advances in Autonomous Driving Technology, Advances in Big Data Analytics based on AI and ML. Potential restraints include: Rising Penetration of Connected Cars and Advances in Autonomous Driving Technology, Advances in Big Data Analytics based on AI and ML. Notable trends are: Healthcare is Expected to Witness Remarkable Growth.
30 Short Tips for Your Data Scientist Interview
kaggle.com
Updated Oct 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Skillslash17 (2023). 30 Short Tips for Your Data Scientist Interview [Dataset]. https://www.kaggle.com/datasets/skillslash17/30-short-tips-for-your-data-scientist-interview
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 12, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Skillslash17
Description
If you’re a data scientist looking to get ahead in the ever-changing world of data science, you know that job interviews are a crucial part of your career. But getting a job as a data scientist is not just about being tech-savvy, it’s also about having the right skillset, being able to solve problems, and having good communication skills. With competition heating up, it’s important to stand out and make a good impression on potential employers.

Data Science has become an essential part of the contemporary business environment, enabling decision-making in a variety of industries. Consequently, organizations are increasingly looking for individuals who can utilize the power of data to generate new ideas and expand their operations. However these roles come with a high level of expectation, requiring applicants to possess a comprehensive knowledge of data analytics and machine learning, as well as the capacity to turn their discoveries into practical solutions.

With so many job seekers out there, it’s super important to be prepared and confident for your interview as a data scientist.

Here are 30 tips to help you get the most out of your interview and land the job you want. No matter if you’re just starting out or have been in the field for a while, these tips will help you make the most of your interview and set you up for success.

Technical Preparation

Qualifying for a job as a data scientist needs a comprehensive level of technical preparation. Job seekers are often required to demonstrate their technical skills in order to show their ability to effectively fulfill the duties of the role. Here are a selection of key tips for technical proficiency:

1 Master the Basics

Make sure you have a good understanding of statistics, math, and programming languages such as Python and R.

2 Understand Machine Learning

Gain an in-depth understanding of commonly used machine learning techniques, including linear regression and decision trees, as well as neural networks.

3 Data Manipulation

Make sure you're good with data tools like Pandas and Matplotlib, as well as data visualization tools like Seaborn.

4 SQL Skills

Gain proficiency in the use of SQL language to extract and process data from databases.

5 Feature Engineering

Understand and know the importance of feature engineering and how to create meaningful features from raw data.

6 Model Evaluation

Learn to assess and compare machine learning models using metrics like accuracy, precision, recall, and F1-score.

7 Big Data Technologies

If the job requires it, become familiar with big data technologies like Hadoop and Spark.

8 Coding Challenges

Practice coding challenges related to data manipulation and machine learning on platforms like LeetCode and Kaggle.

Portfolio and Projects

9 Build a Portfolio

Develop a portfolio of your data science projects that outlines your methodology, the resources you have employed, and the results achieved.

10 Kaggle Competitions

Participate in Kaggle competitions to gain real-world experience and showcase your problem-solving skills.

11 Open Source Contributions

Contribute to open-source data science projects to demonstrate your collaboration and coding abilities.

12 GitHub Profile

Maintain a well-organized GitHub profile with clean code and clear project documentation.

Domain Knowledge

13 Understand the Industry

Research the industry you’re applying to and understand its specific data challenges and opportunities.

14 Company Research

Study the company you’re interviewing with to tailor your responses and show your genuine interest.

Soft Skills

15 Communication

Practice explaining complex concepts in simple terms. Data Scientists often need to communicate findings to non-technical stakeholders.

16 Problem-Solving

Focus on your problem-solving abilities and how you approach complex challenges.

17 Adaptability

Highlight your ability to adapt to new technologies and techniques as the field of data science evolves.

Interview Etiquette

18 Professional Appearance

Dress and present yourself in a professional manner, whether the interview is in person or remote.

19 Punctuality

Be on time for the interview, whether it’s virtual or in person.

20 Body Language

Maintain good posture and eye contact during the interview. Smile and exhibit confidence.

21 Active Listening

Pay close attention to the interviewer's questions and answer them directly.

Behavioral Questions

22 STAR Method

Use the STAR (Situation, Task, Action, Result) method to structure your responses to behavioral questions.

23 Conflict Resolution

Be prepared to discuss how you have handled conflicts or challenging situations in previous roles.

24 Teamwork

Highlight instances where you’ve worked effectively in cross-functional teams...
Leading data collection methods among UK consumers 2023
statista.com
Updated Jun 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Leading data collection methods among UK consumers 2023 [Dataset]. https://www.statista.com/statistics/1453941/data-collection-method-consumers-uk/
Explore at:
Dataset updated
Jun 26, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Nov 2023 - Dec 2023
Area covered
United Kingdom
Description
During a late 2023 survey among working-age consumers in the United Kingdom, **** percent of respondents stated that they preferred for their data to be collected via interactive surveys. Meanwhile, **** percent of respondents mentioned loyalty cards/programs as their favored data collection method.

Facebook

Twitter

Click to copy link

Link copied

Cite

Dashlink (2025). Discovering System Health Anomalies using Data Mining Techniques [Dataset]. https://catalog.data.gov/dataset/discovering-system-health-anomalies-using-data-mining-techniques

Data from: Discovering System Health Anomalies using Data Mining Techniques

Explore at:

Dataset updated

Apr 10, 2025

Dataset provided by

Dashlink

Description

We discuss a statistical framework that underlies envelope detection schemes as well as dynamical models based on Hidden Markov Models (HMM) that can encompass both discrete and continuous sensor measurements for use in Integrated System Health Management (ISHM) applications. The HMM allows for the rapid assimilation, analysis, and discovery of system anomalies. We motivate our work with a discussion of an aviation problem where the identification of anomalous sequences is essential for safety reasons. The data in this application are discrete and continuous sensor measurements and can be dealt with seamlessly using the methods described here to discover anomalous flights. We specifically treat the problem of discovering anomalous features in the time series that may be hidden from the sensor suite and compare those methods to standard envelope detection methods on test data designed to accentuate the differences between the two methods. Identification of these hidden anomalies is crucial to building stable, reusable, and cost-efficient systems. We also discuss a data mining framework for the analysis and discovery of anomalies in high-dimensional time series of sensor measurements that would be found in an ISHM system. We conclude with recommendations that describe the tradeoffs in building an integrated scalable platform for robust anomaly detection in ISHM applications.

Clear search

Close search

Google apps

Main menu

Data from: Discovering System Health Anomalies using Data Mining Techniques

Data Science Jobs Analysis

Data_Sheet_1_The Optimal Machine Learning-Based Missing Data Imputation for...

Comparative Analysis of Data-Driven Anomaly Detection Methods

Data from: FEP Augmentation as a Means to Solve Data Paucity Problems for...

Techniques of Data Collection

Technologies used in big data analysis 2015

Replication Data for: The MIDAS Touch: Accurate and Scalable Missing-Data...

Statistical Methods in Water Resources - Supporting Materials

Data from: Data on the Construction Processes of Regression Models

Indigenous data analysis methods for research

Data Science Tweets

Intelligent Monitor

Yield Curve Models and Data - TIPS Yield Curve and Inflation Compensation

Data from: Investigating Online Art Search through Quantitative Behavioral...

Preventive Maintenance for Marine Engines

Data to Support the Development of Rapid GC-MS Methods for Seized Drug...

Data Labeling Market Report

30 Short Tips for Your Data Scientist Interview

1 Master the Basics

2 Understand Machine Learning

3 Data Manipulation

4 SQL Skills

5 Feature Engineering

6 Model Evaluation

7 Big Data Technologies

8 Coding Challenges

9 Build a Portfolio

10 Kaggle Competitions

11 Open Source Contributions

12 GitHub Profile

13 Understand the Industry

14 Company Research

15 Communication

16 Problem-Solving

17 Adaptability

18 Professional Appearance

19 Punctuality

20 Body Language

21 Active Listening

22 STAR Method

23 Conflict Resolution

24 Teamwork

Leading data collection methods among UK consumers 2023

Data from: Discovering System Health Anomalies using Data Mining TechniquesSee More Versions

Data from: Discovering System Health Anomalies using Data Mining Techniques