Understanding the statistics of fluctuation driven flows in the boundary layer of magnetically confined plasmas is desired to accurately model the lifetime of the vacuum vessel components. Mirror Langmuir probes (MLPs) are a novel diagnostic that uniquely allow us to sample the plasma parameters on a time scale shorter than the characteristic time scale of their fluctuations. Sudden large-amplitude fluctuations in the plasma degrade the precision and accuracy of the plasma parameters reported by MLPs for cases in which the probe bias range is of insufficient amplitude. While some data samples can readily be classified as valid and invalid, we find that such a classification may be ambiguous for up to 40% of data sampled for the plasma parameters and bias voltages considered in this study. In this contribution, we employ an autoencoder (AE) to learn a low-dimensional representation of valid data samples. By definition, the coordinates in this space are the features that mostly characterize valid data. Ambiguous data samples are classified in this space using standard classifiers for vectorial data. In this way, we avoid defining complicated threshold rules to identify outliers, which require strong assumptions and introduce biases in the analysis. By removing the outliers that aremore » identified in the latent low-dimensional space of the AE, we find that the average conductive and convective radial heat fluxes are between approximately 5% and 15% lower as when removing outliers identified by threshold values. For contributions to the radial heat flux due to triple correlations, the difference is up to 40%.« less
National, regional
Households
Sample survey data [ssd]
The 2020 Vietnam COVID-19 High Frequency Phone Survey of Households (VHFPS) uses a nationally representative household survey from 2018 as the sampling frame. The 2018 baseline survey includes 46,980 households from 3132 communes (about 25% of total communes in Vietnam). In each commune, one EA is randomly selected and then 15 households are randomly selected in each EA for interview. We use the large module of to select the households for official interview of the VHFPS survey and the small module households as reserve for replacement. After data processing, the final sample size for Round 2 is 3,935 households.
Computer Assisted Telephone Interview [cati]
The questionnaire for Round 2 consisted of the following sections
Section 2. Behavior Section 3. Health Section 5. Employment (main respondent) Section 6. Coping Section 7. Safety Nets Section 8. FIES
Data cleaning began during the data collection process. Inputs for the cleaning process include available interviewers’ note following each question item, interviewers’ note at the end of the tablet form as well as supervisors’ note during monitoring. The data cleaning process was conducted in following steps:
• Append households interviewed in ethnic minority languages with the main dataset interviewed in Vietnamese.
• Remove unnecessary variables which were automatically calculated by SurveyCTO
• Remove household duplicates in the dataset where the same form is submitted more than once.
• Remove observations of households which were not supposed to be interviewed following the identified replacement procedure.
• Format variables as their object type (string, integer, decimal, etc.)
• Read through interviewers’ note and make adjustment accordingly. During interviews, whenever interviewers find it difficult to choose a correct code, they are recommended to choose the most appropriate one and write down respondents’ answer in detail so that the survey management team will justify and make a decision which code is best suitable for such answer.
• Correct data based on supervisors’ note where enumerators entered wrong code.
• Recode answer option “Other, please specify”. This option is usually followed by a blank line allowing enumerators to type or write texts to specify the answer. The data cleaning team checked thoroughly this type of answers to decide whether each answer needed recoding into one of the available categories or just keep the answer originally recorded. In some cases, that answer could be assigned a completely new code if it appeared many times in the survey dataset.
• Examine data accuracy of outlier values, defined as values that lie outside both 5th and 95th percentiles, by listening to interview recordings.
• Final check on matching main dataset with different sections, where information is asked on individual level, are kept in separate data files and in long form.
• Label variables using the full question text.
• Label variable values where necessary.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
This work reports pure component parameters for the PCP-SAFT equation of state for 1842 substances using a total of approximately 551 172 experimental data points for vapor pressure and liquid density. We utilize data from commercial and public databases in combination with an automated workflow to assign chemical identifiers to all substances, remove duplicate data sets, and filter unsuited data. The use of raw experimental data, as opposed to pseudoexperimental data from empirical correlations, requires means to identify and remove outliers, especially for vapor pressure data. We apply robust regression using a Huber loss function. For identifying and removing outliers, the empirical Wagner equation for vapor pressure is adjusted to experimental data, because the Wagner equation is mathematically rather flexible and is thus not subject to a systematic model bias. For adjusting model parameters of the PCP-SAFT model, nonpolar, dipolar and associating substances are distinguished. The resulting substance-specific parameters of the PCP-SAFT equation of state yield in a mean absolute relative deviation of the of 2.73% for vapor pressure and 0.52% for liquid densities (2.56% and 0.47% for nonpolar substances, 2.67% and 0.61% for dipolar substances, and 3.24% and 0.54% for associating substances) when evaluated against outlier-removed data. All parameters are provided as JSON and CSV files.
By UCI [source]
This dataset explores the phenomenon of credit card application acceptance or rejection. It includes a range of both continuous and categorical attributes, such as the applicant's gender, credit score, and income; as well as details about recent credit card activity including balance transfers and delinquency. This data presents a unique opportunity to investigate how these different attributes interact in determining application status. With careful analysis of this dataset, we can gain valuable insights into understanding what factors help ensure a successful application outcome. This could lead us to developing more effective strategies for predicting and improving financial credit access for everyone
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset is an excellent resource for researching the credit approval process, as it provides a variety of attributes from both continuous and categorical sources. The aim of this guide is to provide tips and advice on how to make the most out of this dataset. - Understand the data: Before attempting to work with this dataset, it's important to understand what kind of information it contains. Since there is a mix of continuous and categorical attributes in this data set, make sure you familiarise yourself with all the different columns before proceeding further. - Exploratory Analysis: It is recommended that you conduct some exploratory analysis on your data in order to gain an overall understanding of its characteristics and distributions. By investigating things like missing values and correlations between different independent variables (IVs) or dependent variables (DVs), you can better prepare yourself for making meaningful analyses or predictions in further steps. - Data Cleaning: Once you have familiarised yourself with your data, begin cleaning up any potential discrepancies such as missing values or outliers by replacing them appropriately or removing them from your dataset if necessary - Feature Selection/Engineering: After cleansing your data set, feature selection/engineering may be necessary if certain columns are redundant or not proving useful for constructing meaningful models/analyses over your data set (usually observed after exploratory analysis). You should be very mindful when deciding which features should be removed so that no information about potentially important relationships are lost!
- Model Building/Analysis: Now that our data has been pre-processed appropriately we can move forward with developing our desired models / analyses over our newly transformed datasets!
- Developing predictive models to identify customers who are likely to default on their credit card payments.
- Creating a risk analysis system that can identify customers who pose a higher risk for fraud or misuse of their credit cards.
- Developing an automated lending decision system that can use the data points provided in the dataset (i.e., gender, average monthly balance, etc.) to decide whether or not to approve applications for new credit lines and loans
If you use this dataset in your research, please credit the original authors. Data Source
See the dataset description for more information.
File: crx.data.csv | Column name | Description | |:--------------|:--------------------------------------------------------------------------------------------------------------------------------------------------------------| | b | Gender (Categorical) | | 30.83 | Average Monthly Balance (Continuous) | | 0 | Number of Months Since Applicant's Last Delinquency (Continuous) | | w | Number of Months Since Applicant's Last Credit Card Approval (Continuous) | | 1.25 | Number Of Months since The applicant's last balance increase (Continuous) ...
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Recent studies have demonstrated that conflict is common among gene trees in phylogenomic studies, and that less than one percent of genes may ultimately drive species tree inference in supermatrix analyses. Here, we examined two datasets where supermatrix and coalescent-based species trees conflict. We identified two highly influential “outlier” genes in each dataset. When removed from each dataset, the inferred supermatrix trees matched the topologies obtained from coalescent analyses. We also demonstrate that, while the outlier genes in the vertebrate dataset have been shown in a previous study to be the result of errors in orthology detection, the outlier genes from a plant dataset did not exhibit any obvious systematic error and therefore may be the result of some biological process yet to be determined. While topological comparisons among a small set of alternate topologies can be helpful in discovering outlier genes, they can be limited in several ways, such as assuming all genes share the same topology. Coalescent species tree methods relax this assumption but do not explicitly facilitate the examination of specific edges. Coalescent methods often also assume that conflict is the result of incomplete lineage sorting (ILS). Here we explored a framework that allows for quickly examining alternative edges and support for large phylogenomic datasets that does not assume a single topology for all genes. For both datasets, these analyses provided detailed results confirming the support for coalescent-based topologies. This framework suggests that we can improve our understanding of the underlying signal in phylogenomic datasets by asking more targeted edge-based questions.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Controlled Anomalies Time Series (CATS) Dataset consists of commands, external stimuli, and telemetry readings of a simulated complex dynamical system with 200 injected anomalies.
The CATS Dataset exhibits a set of desirable properties that make it very suitable for benchmarking Anomaly Detection Algorithms in Multivariate Time Series [1]:
[1] Example Benchmark of Anomaly Detection in Time Series: “Sebastian Schmidl, Phillip Wenig, and Thorsten Papenbrock. Anomaly Detection in Time Series: A Comprehensive Evaluation. PVLDB, 15(9): 1779 - 1797, 2022. doi:10.14778/3538598.3538602”
About Solenix
Solenix is an international company providing software engineering, consulting services and software products for the space market. Solenix is a dynamic company that brings innovative technologies and concepts to the aerospace market, keeping up to date with technical advancements and actively promoting spin-in and spin-out technology activities. We combine modern solutions which complement conventional practices. We aspire to achieve maximum customer satisfaction by fostering collaboration, constructivism, and flexibility.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The income or expenditure-related data sets are often nonlinear, heteroscedastic, skewed even after the transformation, and contain numerous outliers. We propose a class of robust nonlinear models that treat outlying observations effectively without removing them. For this purpose, case-specific parameters and a related penalty are employed to detect and modify the outliers systematically. We show how the existing nonlinear models such as smoothing splines and generalized additive models can be robustified by the case-specific parameters. Next, we extend the proposed methods to the heterogeneous models by incorporating unequal weights. The details of estimating the weights are provided. Two real data sets and simulated data sets show the potential of the proposed methods when the nature of the data is nonlinear with outlying observations.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Project Documentation: Cucumber Disease Detection
Introduction: A machine learning model for the automatic detection of diseases in cucumber plants is to be developed as part of the "Cucumber Disease Detection" project. This research is crucial because it tackles the issue of early disease identification in agriculture, which can increase crop yield and cut down on financial losses. To train and test the model, we use a dataset of pictures of cucumber plants.
Importance: Early disease diagnosis helps minimize crop losses, stop the spread of diseases, and better allocate resources in farming. Agriculture is a real-world application of this concept.
Goals and Objectives: Develop a machine learning model to classify cucumber plant images into healthy and diseased categories. Achieve a high level of accuracy in disease detection. Provide a tool for farmers to detect diseases early and take appropriate action.
Data Collection: Using cameras and smartphones, images from agricultural areas were gathered.
Data Preprocessing: Data cleaning to remove irrelevant or corrupted images. Handling missing values, if any, in the dataset. Removing outliers that may negatively impact model training. Data augmentation techniques applied to increase dataset diversity.
Exploratory Data Analysis (EDA) The dataset was examined using visuals like scatter plots and histograms. The data was examined for patterns, trends, and correlations. Understanding the distribution of photos of healthy and ill plants was made easier by EDA.
Methodology Machine Learning Algorithms:
Convolutional Neural Networks (CNNs) were chosen for image classification due to their effectiveness in handling image data. Transfer learning using pre-trained models such as ResNet or MobileNet may be considered. Train-Test Split:
The dataset was split into training and testing sets with a suitable ratio. Cross-validation may be used to assess model performance robustly.
Model Development The CNN model's architecture consists of layers, units, and activation operations. On the basis of experimentation, hyperparameters including learning rate, batch size, and optimizer were chosen. To avoid overfitting, regularization methods like dropout and L2 regularization were used.
Model Training During training, the model was fed the prepared dataset across a number of epochs. The loss function was minimized using an optimization method. To ensure convergence, early halting and model checkpoints were used.
Model Evaluation Evaluation Metrics:
Accuracy, precision, recall, F1-score, and confusion matrix were used to assess model performance. Results were computed for both training and test datasets. Performance Discussion:
The model's performance was analyzed in the context of disease detection in cucumber plants. Strengths and weaknesses of the model were identified.
Results and Discussion Key project findings include model performance and disease detection precision. a comparison of the many models employed, showing the benefits and drawbacks of each. challenges that were faced throughout the project and the methods used to solve them.
Conclusion recap of the project's key learnings. the project's importance to early disease detection in agriculture should be highlighted. Future enhancements and potential research directions are suggested.
References Library: Pillow,Roboflow,YELO,Sklearn,matplotlib Datasets:https://data.mendeley.com/datasets/y6d3z6f8z9/1
Code Repository https://universe.roboflow.com/hakuna-matata/cdd-g8a6g
Rafiur Rahman Rafit EWU 2018-3-60-111
In the present era of large-scale surveys, big data present new challenges to the discovery process for anomalous data. Such data can be indicative of systematic errors, extreme (or rare) forms of known phenomena, or most interestingly, truly novel phenomena that exhibit as-of-yet unobserved behaviours. In this work, we present an outlier scoring methodology to identify and characterize the most promising unusual sources to facilitate discoveries of such anomalous data. We have developed a data mining method based on k-nearest neighbour distance in feature space to efficiently identify the most anomalous light curves. We test variations of this method including using principal components of the feature space, removing select features, the effect of the choice of k, and scoring to subset samples. We evaluate the performance of our scoring on known object classes and find that our scoring consistently scores rare (<1000) object classes higher than common classes. We have applied scoring to all long cadence light curves of Quarters 1-17 of Kepler's prime mission and present outlier scores for all 2.8 million light curves for the roughly 200k objects.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
nuts-STeauRY dataset: hydrochemical and catchment characteristics dataset for large sample studies of Carbon, Nitrogen, Phosphorus and Silicon in French watercourses
Antoine Casquin, Marie Silvestre, Vincent Thieu
10.5281/zenodo.10830852
v0.1, 18th March 2024
Brief overview of data:
· Carbon and nutrients data for 5470 continental French catchments
· Modelled discharge for 5128 of catchments out of 5470
· Geopackages with catchment delineations and outlets
· DEM conditioned to delimit additional catchments
· Land-use and climatic data for 5470 continental French catchments
Citation of this work
A data paper is currently being submitted with details of methods and results. Once published, it will be the preferential source to cite. The data paper will be link to the new version of the dataset that will be updated on doi.org/10.5281/zenodo.10830852. If you use this dataset in your research or report, you must cite it.
Motivations
Data was collected and curated for the nuts-STeauRY project (http://nuts-steaury.cnrs.fr), which deployed a national generic land to sea modelling chain.
Data was primarily used (see related works):
To calibrate concentrations of dissolved organic carbon and dissolve silica in headwaters
To validate spatially and temporally the modelling chain (DOC, NO3-, NH4+, TP, SRP, DSi)
Hydrochemical large sample datasets have numerous other uses: trends computations elucidate transfer mechanisms, machine learning, retrospective studies etc.
The objective here is to provide a large sample curated dataset of carbon and nutrients concentrations along with modelled discharges, catchment characteristics and delimitations for the continental France. Such large sample dataset aims at easing the large sample studies over France and/or Europe. Although part of the data gathered here is obtainable via public sources, the catchments delineations, their characteristics and modelled hydrology were note not publicly available yet. Moreover, a unification of units and detection and removal of outliers was performed on carbon and nutrients data.
Data sources & processing
Sampling points where snapped on the CCM database v2.1 (http://data.europa.eu/89h/fe1878e8-7541-4c66-8453-afdae7469221)(Vogt et al., 2007) and catchments were delineated using a 100m resolution Digital Elevation Model (DEM) conditioned by the hydrographic network and elementary catchments’ delineations of the CCM data v2.1. More than 6000 catchments were delineated and screened manually to check consistency: 5470 were retained.
Nutrient data was collected mainly through the Naiades portal (https://naiades.eaufrance.fr/), a database collecting water quality data produced by different water related actors across France. Nutrient data was also collected directly with regional water agencies (https://www.eau-seine-normandie.fr/, https://eau-grandsudouest.fr/, https://www.eaurmc.fr/, https://www.eau-artois-picardie.fr/, https://www.eau-rhin-meuse.fr/ and https://agence.eau-loire-bretagne.fr/home.html), and pre-processed using a database management system relying on PostgreSQL with PostGIS extension (Thieu & Silvestre, 2015). A three-pass strategy was used to curate raw carbon and nutrients data: 1. Removal of “obvious outliers”, 2. Detection of baseline change and correction if possible (or removal of data) 3. Removal of outliers using a quantile based approach by element and temporal series.
Hydrological time series are interpolation trough hydrograph transfer (de Lavenne et al., 2023) of 1664 time series of discharge completed with GR4J model (Pelletier & Andréassian, 2020; Pelletier 2021).
Land cover data was extracted from Corine Land Cover dataset for years 2000, 2006, 2012, and 2018 (EEA, 2020). Raw CLC typology contains 44 classes. Results of percent cover per year per class were computed for each catchment. An aggregated typology of 8 classes is also proposed.
Climatological data was extracted from daily reconstruction at 5 arcmin for temperatures and 1 arcmin for precipitation over Europe (Thiemig et al., 2022). Mean by catchment for min&max daily temperature and precipitation were computed for each catchment for the 1990-2019 period.
Nuts-STeauRY dataset
Carbon and nutrients time series
Time series of carbon and nutrients within the 1962-2019 period on 5470 stations: Dissolved Organic Carbon (DOC), Total Organic Carbon (TOC) Nitrates (NO3-), Nitrites (NO2-), Ammonia (NH4+), Soluble Reactive Phosphorus (SRP), Total Phosphorus (TP) and Dissolved Silica (DSi).
|var | n_unique_station| n_total_meas| mean_duration_y| mean_frequency_y|
|:---|----------------:|------------:|---------------:|----------------:|
|DOC | 4 992| 658 147| 14.3| 9.0|
|DSi | 3 299| 333 866| 12.9| 8.3|
|NH4 | 5 318| 907 343| 19.3| 8.7|
|NO2 | 5 264| 891 886| 19.2| 8.6|
|NO3 | 5 465| 939 279| 19.0| 9.0|
|SRP | 5 361| 910 107| 19.1| 8.7|
|TOC | 935| 111 993| 13.6| 9.6|
|TP | 5 199| 802 841| 17.1| 8.8|
Note that some SRP and DSi measurements were declared as realized on raw water. A thorough analysis of time series show no evidence of difference on baselines. For more accuracy, it is advised to filter out those analyses using the “fraction” attribute of each measurement.
Discharge modelled daily time series
Modelled naturalized discharge through hydrograph transfer and interpolated measured discharges when available for the 1980-2019 period.
A daily discharge was computed for 5128 catchments. For small catchments (< 1000 km2, n = 4530), hydrograph transfer was used, while for big catchments, a direct interpolation of measured/completed discharges was performed. The direct interpolation was only possible for 598 catchments > 1000 km2. The criteria retained for a direct interpolation is 0.8*area_discharge_station < area_quality < 1.2*area_discharge_station when discharge and quality stations were nested.
Hydrological time series uncertainties varies a lot depending on: quality of data source, distance from pseudo-gauged outlets, land cover of the catchments, natural spatial and temporal variability of discharge, size of the catchment (de Lavenne et al., 2016). We advise a cautious use of those modelled discharges as uncertainties could not be computed.
Catchments, outlets and conditioned DEM
5470 catchments and outlets are delivered as geopackages (EPSG: 3035).
The DEM, conditioned by CCM 2.1 is also delivered as a GeoTIFF (EPSG: 3035) as way to delimit new catchment for the area that are consistent with the dataset.
Catchments characteristics and climate
Refer to Data sources & processing and File descriptions.
File and attributes descriptions:
The key “sta_code” is present across all files. For time varying records, “date” can be a secondary key.
Description of CNPSi.csv data attributes
Each line is a couple measurement/parameter/station
· sta_code: Code of the station in the Sandre referentiel (public french "dataverse" for water data)
· sta_name: Name of the station in the Sandre referentiel (public french "dataverse" for water data)
· var: Abbreviation of parameter name
· fraction: "water_filtrated" or "water_raw"
· date: date of sampling
· hour: hour of sampling
· value: analytical result (concentration)
· provider: provider of the data
· producer: producer of the data
· from_db: "Naiades2022" (https://naiades.eaufrance.fr/france-entiere#/ dump from 2022) or "DoNuts" (Thieu, V., Silvestre, M., 2015. DoNuts: un système d’information sur les observations environnementales. Présentation Séminaire UMR Métis)
· n_meas: number of observations for a given parameter / station
· unit: unit of concentration
· element: "C" "N" "P" or "Si"
· year: year of observation
· month: month of observation
· day: day of observation
· julian_day: julian day observation (1-366)
· decade: decade of observation (one of "1961-1970", "1971-1980", "1981-1990", "1991-2000", "2001-2010", "2011-2020")
Description of CNPSi_stats.csv data attributes
Each line is a couple parameter / station
· sta_code: Code of the station in the Sandre referentiel (public french "dataverse" for water data)
· sta_name: Name of the station in the Sandre referentiel (public french "dataverse" for water data)
· var: Abbreviation of parameter name
· n_meas: number of observations for a given parameter / station
· start_year: year of first observation for a given parameter / station
· end_year: year of last observation for a given parameter / station
· duration_y_tot: total duration of observation in years for a given parameter / station
· duration_y_tot: duration of observation in years for a given parameter / station for years with at least 1 meas
· mean_nmeas_per_y_tot: mean number of observations per year considering total duration
· mean_nmeas_per_y_meas: mean number of observations per year considering years with measurements
· is_fully_continuous: TRUE if at least one measurement per year for a given parameter / station
· start_cont_seq: year in which starts the longest continuous sequence for a given parameter / station
· end_cont_seq: year in which ends the longest continuous sequence for a given parameter / station
· duration_y_cont_seq: duration in years for the longest continuous sequence for a given parameter / station
· nmeas_cont_seq:
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Malaria is the leading cause of death in the African region. Data mining can help extract valuable knowledge from available data in the healthcare sector. This makes it possible to train models to predict patient health faster than in clinical trials. Implementations of various machine learning algorithms such as K-Nearest Neighbors, Bayes Theorem, Logistic Regression, Support Vector Machines, and Multinomial Naïve Bayes (MNB), etc., has been applied to malaria datasets in public hospitals, but there are still limitations in modeling using the Naive Bayes multinomial algorithm. This study applies the MNB model to explore the relationship between 15 relevant attributes of public hospitals data. The goal is to examine how the dependency between attributes affects the performance of the classifier. MNB creates transparent and reliable graphical representation between attributes with the ability to predict new situations. The model (MNB) has 97% accuracy. It is concluded that this model outperforms the GNB classifier which has 100% accuracy and the RF which also has 100% accuracy.
Methods
Prior to collection of data, the researcher was be guided by all ethical training certification on data collection, right to confidentiality and privacy reserved called Institutional Review Board (IRB). Data was be collected from the manual archive of the Hospitals purposively selected using stratified sampling technique, transform the data to electronic form and store in MYSQL database called malaria. Each patient file was extracted and review for signs and symptoms of malaria then check for laboratory confirmation result from diagnosis. The data was be divided into two tables: the first table was called data1 which contain data for use in phase 1 of the classification, while the second table data2 which contains data for use in phase 2 of the classification.
Data Source Collection
Malaria incidence data set is obtained from Public hospitals from 2017 to 2021. These are the data used for modeling and analysis. Also, putting in mind the geographical location and socio-economic factors inclusive which are available for patients inhabiting those areas. Naive Bayes (Multinomial) is the model used to analyze the collected data for malaria disease prediction and grading accordingly.
Data Preprocessing:
Data preprocessing shall be done to remove noise and outlier.
Transformation:
The data shall be transformed from analog to electronic record.
Data Partitioning
The data which shall be collected will be divided into two portions; one portion of the data shall be extracted as a training set, while the other portion will be used for testing. The training portion shall be taken from a table stored in a database and will be called data which is training set1, while the training portion taking from another table store in a database is shall be called data which is training set2.
The dataset was split into two parts: a sample containing 70% of the training data and 30% for the purpose of this research. Then, using MNB classification algorithms implemented in Python, the models were trained on the training sample. On the 30% remaining data, the resulting models were tested, and the results were compared with the other Machine Learning models using the standard metrics.
Classification and prediction:
Base on the nature of variable in the dataset, this study will use Naïve Bayes (Multinomial) classification techniques; Classification phase 1 and Classification phase 2. The operation of the framework is illustrated as follows:
i. Data collection and preprocessing shall be done.
ii. Preprocess data shall be stored in a training set 1 and training set 2. These datasets shall be used during classification.
iii. Test data set is shall be stored in database test data set.
iv. Part of the test data set must be compared for classification using classifier 1 and the remaining part must be classified with classifier 2 as follows:
Classifier phase 1: It classify into positive or negative classes. If the patient is having malaria, then the patient is classified as positive (P), while a patient is classified as negative (N) if the patient does not have malaria.
Classifier phase 2: It classify only data set that has been classified as positive by classifier 1, and then further classify them into complicated and uncomplicated class label. The classifier will also capture data on environmental factors, genetics, gender and age, cultural and socio-economic variables. The system will be designed such that the core parameters as a determining factor should supply their value.
Analyzing sales data is essential for any business looking to make informed decisions and optimize its operations. In this project, we will utilize Microsoft Excel and Power Query to conduct a comprehensive analysis of Superstore sales data. Our primary objectives will be to establish meaningful connections between various data sheets, ensure data quality, and calculate critical metrics such as the Cost of Goods Sold (COGS) and discount values. Below are the key steps and elements of this analysis:
1- Data Import and Transformation:
2- Data Quality Assessment:
3- Calculating COGS:
4- Discount Analysis:
5- Sales Metrics:
6- Visualization:
7- Report Generation:
Throughout this analysis, the goal is to provide a clear and comprehensive understanding of the Superstore's sales performance. By using Excel and Power Query, we can efficiently manage and analyze the data, ensuring that the insights gained contribute to the store's growth and success.
In this work, the authors present the most comprehensive INTEGRAL active galactic nucleus (AGN) sample. It lists 272 AGN for which they have secure optical identifications, precise optical spectroscopy and measured redshift values plus X-ray spectral information, i.e. 2-10 and 20-100 keV fluxes plus column densities. In their paper, the authors mainly use this sample to study the absorption properties of active galaxies, to probe new AGN classes and to test the AGN unification scheme. The authors find that half (48%) of the sample is absorbed, while the fraction of Compton-thick AGN is small (~7%). In line with their previous analysis, they have however shown that when the bias towards heavily absorbed objects which are lost if weak and at large distance is removed, as is possible in the local Universe, the above fractions increase to become 80% and 17%, respectively. The authors also find that absorption is a function of source luminosity, which implies some evolution in the obscuration properties of AGN. A few peculiar classes, so far poorly studied in the hard X-ray band, have been detected and studied for the first time such as 5 X-ray bright optically normal galaxies (XBONGs), 5 type 2 QSOs and 11 low-ionization nuclear emission regions. In terms of optical classification, this sample contains 57% type 1 and 43% type 2 AGN; this subdivision is similar to that found in X-rays if unabsorbed versus absorbed objects are considered, suggesting that the match between optical and X-ray classifications is on the whole good. Only a small percentage of sources (12%) does not fulfill the expectation of the unified theory as the authors find 22 type 1 AGN which are absorbed and 10 type 2 AGN which are unabsorbed. Studying in depth these outliers they found that most of the absorbed type 1 AGN have X-ray spectra characterized by either complex or warm/ionized absorption more likely due to ionized gas located in an accretion disc wind or in the bi-conical structure associated with the central nucleus, therefore unrelated to the toroidal structure. Among the 10 type 2 AGN which are unabsorbed, at most 3-4% are still eligible to be classified as 'true' type 2 AGN. In the fourth INTEGRAL/IBIS survey (Bird et al. 2010, ApJS, 186, 1, available in the HEASARC database as the IBISCAT4 table), there are 234 objects which have been identified with AGN. To this set of sources, the present authors then added 38 galaxies listed in the INTEGRAL all-sky survey by Krivonos et al. (2007, A&A, 475, 775, available in the HEASARC database as the INTIBISASS table) updated on the website (http://hea.iki.rssi.ru/integral/survey/catalog.php) but not included in the Bird et al. catalog due to the different sky coverage (these latter sources are indicated with hard_flag = 'h' values in this HEASARC table). The final data set presented and discussed in the reference paper and constituting this table therefore comprises 272 AGN and was last updated in March 2011 March. It represents the most complete view of the INTEGRAL extragalactic sky as of the date of publication in 2012. This table was created by the HEASARC in October 2014 based on CDS Catalog J/MNRAS/426/1750 files tablea1.dat and refs.dat. This is a service provided by NASA HEASARC .
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Exploring the California Housing Dataset: A Modern Take on a Classic Dataset
The California Housing dataset is a well-known and widely-used dataset in the data science and machine learning communities. Originating from the 1990 U.S. Census, it contains housing data for California districts, and serves as a perfect playground for learners and practitioners to explore real-world regression problems.
This dataset was originally published by Pace and Barry in 1997, and later adapted by Scikit-learn as a toy dataset for machine learning. While it's based on 1990 data, it remains relevant and invaluable due to its simplicity, interpretability, and richness in geographic and socioeconomic features.
The dataset includes 20,000+ observations, each representing a block group in California. Key features include: - Median income of households - Median house value - Housing median age - Total rooms and bedrooms - Population and household counts - Latitude and longitude coordinates
Its great for Beginners as it has many perks in analysis -
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F21414749%2F78e6885a905025b3478c7c87bc61194f%2FCA_few.png?generation=1747798227111665&alt=media" alt=""Few rows of CA Housing"">
and Visualization -
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F21414749%2F562002482656c84f6a4a7b923f11512f%2FCA%20dist.png?generation=1747798372627633&alt=media" alt=""Distribution of numerical features"">
with Geospatial Visuals -
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F21414749%2F1da1d024bee19cf7579573835d089905%2FCA%20Population%20as%20Bubble.png?generation=1747798539969864&alt=media" alt="">
I achieved a approvable variance score by doing following sequential steps: -
Outlier Removal- Carefully identified and removed outliers based on statistical thresholds in key features like Income vs House Value, Room vs House Value, and Population vs House Value. This step drastically improved the signal-to-noise ratio in the data, making downstream modeling more robust and interpretable.
Geo-Spatial Clustering- Using the latitude and longitude, I performed geo-based clustering to group similar regions. This not only uncovered meaningful spatial patterns in housing prices but also allowed for richer feature augmentation. KMean was used to super-localize the clusters and the results were validated with geographic visualizations and cluster analysis.
Variance Explained: A Huge Win!- I was able to explain up to 75% of the variance in the dataset with just linear models (OLS) — a strong indicator of how well the features (post-cleaning and clustering) capture the underlying data structure.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F21414749%2Fba06ee6605b57aef11ed6427210daffe%2FModel%20Comparision%20based%20on%20raw%20and%20processed%20datasets.png?generation=1747799721015242&alt=media" alt="">
This lays a great foundation for predictive modeling and insightful analysis.
Males of many species adjust their reproductive investment to the number of rivals present simultaneously. However, few studies have investigated whether males sum previous encounters with rivals, and the total level of competition has never been explicitly separated from social familiarity. Social familiarity can be an important component of kin recognition and has been suggested as a cue that males use to avoid harming females when competing with relatives. Previous work has succeeded in independently manipulating social familiarity and relatedness among rivals, but experimental manipulations of familiarity are confounded with manipulations of the total number of rivals that males encounter. Using the seed beetle Callosobruchus maculatus we manipulated three factors: familiarity among rival males, the number of rivals encountered simultaneously, and the total number of rivals encountered over a 48-hour period. Males produced smaller ejaculates when exposed to more rivals in total, regardless of the maximum number of rivals they encountered simultaneously. Males did not respond to familiarity. Our results demonstrate that males of this species can sum the number of rivals encountered over separate days, and therefore the confounding of familiarity with the total level of competition in previous studies should not be ignored.,Lymbery et al 2018 Full datasetContains all the data used in the statistical analyses for the associated manuscript. The file contains two spreadsheets: one containing the data and one containing a legend relating to column titles.Lymbery et al Full Dataset.xlsxLymbery et al 2018 Reduced dataset 1Contains data used in the attached manuscript following the removal of three outliers for the purposes of data distribution, as described in the associated R code. The file contains two spreadsheets: one containing the data and one containing a legend relating to column titles.Lymbery et al Reduced Dataset After 1st Round of Outlier Removal.xlsxLymbery et al 2018 Reduced dataset 2Contains the data used in the statistical analyses for the associated manuscript, after the removal of all outliers stated in the manuscript and associated R code. The file contains two spreadsheets: one containing the data and one containing a legend relating to column titles.Lymbery et al Reduced Dataset After Final Outlier Removal.xlsxLymbery et al 2018 R ScriptContains all the R code used for statistical analysis in this manuscript, with annotations to aid interpretation.,
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
We have produced a global dataset of ~4000 GPS vertical velocities that can be used as observational estimates of glacial isostatic adjustment (GIA) uplift rates. GIA is the response of the solid Earth to past ice loading, primarily, since the Last Glacial Maximum, about 20 K yrs BP. Modelling GIA is challenging because of large uncertainties in ice loading history and also the viscosity of the upper and lower mantle. GPS data contain the signature of GIA in their uplift rates but these also contain other sources of vertical land motion (VLM) such as tectonics, human and natural influences on water storage that can mask the underlying GIA signal. A novel fully-automatic strategy was developed to post-process the GPS time series and to correct for non-GIA artefacts. Before estimating vertical velocities and uncertainties, we detected outliers and jumps and corrected for atmospheric mass loading displacements. We corrected the resulting velocities for the elastic response of the solid Earth to global changes in ice sheets, glaciers, and ocean loading, as well as for changes in the Earth's rotational pole relative to the 20th century average. We then applied a spatial median filter to remove sites where local effects were dominant to leave approximately 4000 GPS sites. The resulting novel global GPS dataset shows a clean GIA signal at all post-processed stations and is suitable to investigate the behaviour of global GIA forward models. The results are transformed from a frame with its origin in the centre of mass of the total Earth's system (CM) into a frame with its origin in the centre of mass of the solid Earth (CE) before comparison with 13 global GIA forward model solutions, with best fits with Pur-6-VM5 and ICE-6G predictions. The largest discrepancies for all models were identified for Antarctica and Greenland, which may be due to either uncertain mantle rheology, ice loading history/magnitude and/or GPS errors.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
With the rapid increase of large-scale datasets, biomedical data visualization is facing challenges. The data may be large, have different orders of magnitude, contain extreme values, and the data distribution is not clear. Here we present an R package ggbreak that allows users to create broken axes using ggplot2 syntax. It can effectively use the plotting area to deal with large datasets (especially for long sequential data), data with different magnitudes, and contain outliers. The ggbreak package increases the available visual space for a better presentation of the data and detailed annotation, thus improves our ability to interpret the data. The ggbreak package is fully compatible with ggplot2 and it is easy to superpose additional layers and applies scale and theme to adjust the plot using the ggplot2 syntax. The ggbreak package is open-source software released under the Artistic-2.0 license, and it is freely available on CRAN (https://CRAN.R-project.org/package=ggbreak) and Github (https://github.com/YuLab-SMU/ggbreak).
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
🚗 Cleaned Drivers License Dataset
"This dataset is designed for those who believe in building smart, ethical AI models that can assist in real-world decision-making like licensing, risk assessment, and more."
📂 Dataset Overview
This dataset is a cleaned and preprocessed version of a drivers license dataset containing details like:
Age Group
Gender
Reaction Time
Driving Skill Level
Training Received
License Qualification Status
And many more total 20 columns you can train a best ML model with this dataset
The raw dataset contained missing values, categorical anomalies, and inconsistencies which have been fixed to make this dataset ML-ready.
💡 Key Highlights
✅ Cleaned missing values with intelligent imputation
🧠 Encoded categorical columns with appropriate techniques (OneHot, Ordinal)
🔍 Removed outliers using statistical logic
🔧 Feature engineered to preserve semantic meaning
💾 Ready-to-use for classification tasks (e.g., Predicting who qualifies for a license)
📊 Columns Description
Column Name Description
Gender Gender of the individual (Male, Female) Age Group Age segment (Teen, Adult, Senior) Race Ethnicity of the driver Reactions Reaction speed categorized (Fast, Average, Slow) Training Training received (Basic, Advanced) Driving Skill Skill level (Expert, Beginner, etc.) Qualified Whether the person qualified for a license (Yes, No)
🤖 Perfect For
📚 Machine Learning (Classification)
📊 Exploratory Data Analysis (EDA)
📉 Feature Engineering Practice
🧪 Model Evaluation & Experimentation
🚥 AI for Transport & Safety Projects
🏷️ Tags
📌 Author Notes
This dataset is part of a data cleaning and feature engineering project. One column is intentionally left unprocessed for developers to test their own pipeline or transformation strategies 😎.
🔗 Citation
If you use this dataset in your projects, notebooks, or blog posts — feel free to tag me or credit the original dataset and this cleaned version.
If you use this dataset in your work, please cite it as:
Divyanshu_Coder, 2025. Cleaned Driver License Dataset. Kaggle.
Phylogenetic trees include errors for a variety of reasons. We argue that one way to detect errors is to build a phylogeny with all the data and then detect taxa that artificially inflate the tree diameter. We formulate an optimization problem that seeks to find k leaves that can be removed to reduce the tree diameter maximally. We present a polynomial time solution to this “k-shrink†problem. Given this solution, we then use non-parametric statistics to find an outlier set of taxa that have an unexpectedly high impact on the tree diameter. We test our method, TreeShrink, on five biological datasets, and show that it is more conservative than rogue taxon removal using RogueNaRok. When the amount of filtering is controlled, TreeShrink outperforms RogueNaRok in three out of the five datasets, and they tie in another dataset., All the raw data are obtained from other publications as shown below. We further analyzed the data and provide the results of the analyses here. The methods used to analyze the data are described in the paper.
Dataset
Species
Genes
Download
PlantsÂ
104
852
DOIÂ 10.1186/2047-217X-3-17
Mammals
37
424
DOIÂ 10.13012/C5BG2KWG
Insects
144
1478
http://esayyari.github.io/InsectsData
Cannon
78
213
DOIÂ 10.5061/dryad.493b7
RouseÂ
26
393
DOIÂ 10.5061/dryad.79dq1
Frogs
164
95
DOIÂ 10.5061/dryad.12546.2
 ,
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is updated more frequently and can be visualized on NCWQR's data portal.
If you have any questions, please contact Dr. Laura Johnson or Dr. Nathan Manning.
The National Center for Water Quality Research (NCWQR) is a research laboratory at Heidelberg University in Tiffin, Ohio, USA. Our primary research program is the Heidelberg Tributary Loading Program (HTLP), where we currently monitor water quality at 22 river locations throughout Ohio and Michigan, effectively covering ~half of the land area of Ohio. The goal of the program is to accurately measure the total amounts (loads) of pollutants exported from watersheds by rivers and streams. Thus these data are used to assess different sources (nonpoint vs point), forms, and timing of pollutant export from watersheds. The HTLP officially began with high-frequency monitoring for sediment and nutrients from the Sandusky and Maumee rivers in 1974, and has continually expanded since then.
Each station where samples are collected for water quality is paired with a US Geological Survey gage for quantifying discharge (http://waterdata.usgs.gov/usa/nwis/rt). Our stations cover a wide range of watershed areas upstream of the sampling point from 11.0 km2 for the unnamed tributary to Lost Creek to 19,215 km2 for the Muskingum River. These rivers also drain a variety of land uses, though a majority of the stations drain over 50% row-crop agriculture.
At most sampling stations, submersible pumps located on the stream bottom continuously pump water into sampling wells inside heated buildings where automatic samplers collect discrete samples (4 unrefrigerated samples/d at 6-h intervals, 1974–1987; 3 refrigerated samples/d at 8-h intervals, 1988-current). At weekly intervals the samples are returned to the NCWQR laboratories for analysis. When samples either have high turbidity from suspended solids or are collected during high flow conditions, all samples for each day are analyzed. As stream flows and/or turbidity decreases, analysis frequency shifts to one sample per day. At the River Raisin and Muskingum River, a cooperator collects a grab sample from a bridge at or near the USGS station approximately daily and all samples are analyzed. Each sample bottle contains sufficient volume to support analyses of total phosphorus (TP), dissolved reactive phosphorus (DRP), suspended solids (SS), total Kjeldahl nitrogen (TKN), ammonium-N (NH4), nitrate-N and nitrite-N (NO2+3), chloride, fluoride, and sulfate. Nitrate and nitrite are commonly added together when presented; henceforth we refer to the sum as nitrate.
Upon return to the laboratory, all water samples are analyzed within 72h for the nutrients listed below using standard EPA methods. For dissolved nutrients, samples are filtered through a 0.45 um membrane filter prior to analysis. We currently use a Seal AutoAnalyzer 3 for DRP, silica, NH4, TP, and TKN colorimetry, and a DIONEX Ion Chromatograph with AG18 and AS18 columns for anions. Prior to 2014, we used a Seal TRAACs for all colorimetry.
2017 Ohio EPA Project Study Plan and Quality Assurance Plan
Data quality control and data screening
The data provided in the River Data files have all been screened by NCWQR staff. The purpose of the screening is to remove outliers that staff deem likely to reflect sampling or analytical errors rather than outliers that reflect the real variability in stream chemistry. Often, in the screening process, the causes of the outlier values can be determined and appropriate corrective actions taken. These may involve correction of sample concentrations or deletion of those data points.
This micro-site contains data for approximately 126,000 water samples collected beginning in 1974. We cannot guarantee that each data point is free from sampling bias/error, analytical errors, or transcription errors. However, since its beginnings, the NCWQR has operated a substantial internal quality control program and has participated in numerous external quality control reviews and sample exchange programs. These programs have consistently demonstrated that data produced by the NCWQR is of high quality.
A note on detection limits and zero and negative concentrations
It is routine practice in analytical chemistry to determine method detection limits and/or limits of quantitation, below which analytical results are considered less reliable or unreliable. This is something that we also do as part of our standard procedures. Many laboratories, especially those associated with agencies such as the U.S. EPA, do not report individual values that are less than the detection limit, even if the analytical equipment returns such values. This is in part because as individual measurements they may not be considered valid under litigation.
The measured concentration consists of the true but unknown concentration plus random instrument error, which is usually small compared to the range of expected environmental values. In a sample for which the true concentration is very small, perhaps even essentially zero, it is possible to obtain an analytical result of 0 or even a small negative concentration. Results of this sort are often “censored” and replaced with the statement “
Censoring these low values creates a number of problems for data analysis. How do you take an average? If you leave out these numbers, you get a biased result because you did not toss out any other (higher) values. Even if you replace negative concentrations with 0, a bias ensues, because you’ve chopped off some portion of the lower end of the distribution of random instrument error.
For these reasons, we do not censor our data. Values of -9 and -1 are used as missing value codes, but all other negative and zero concentrations are actual, valid results. Negative concentrations make no physical sense, but they make analytical and statistical sense. Users should be aware of this, and if necessary make their own decisions about how to use these values. Particularly if log transformations are to be used, some decision on the part of the user will be required.
Analyte Detection Limits
https://ncwqr.files.wordpress.com/2021/12/mdl-june-2019-epa-methods.jpg?w=1024
For more information, please visit https://ncwqr.org/
Understanding the statistics of fluctuation driven flows in the boundary layer of magnetically confined plasmas is desired to accurately model the lifetime of the vacuum vessel components. Mirror Langmuir probes (MLPs) are a novel diagnostic that uniquely allow us to sample the plasma parameters on a time scale shorter than the characteristic time scale of their fluctuations. Sudden large-amplitude fluctuations in the plasma degrade the precision and accuracy of the plasma parameters reported by MLPs for cases in which the probe bias range is of insufficient amplitude. While some data samples can readily be classified as valid and invalid, we find that such a classification may be ambiguous for up to 40% of data sampled for the plasma parameters and bias voltages considered in this study. In this contribution, we employ an autoencoder (AE) to learn a low-dimensional representation of valid data samples. By definition, the coordinates in this space are the features that mostly characterize valid data. Ambiguous data samples are classified in this space using standard classifiers for vectorial data. In this way, we avoid defining complicated threshold rules to identify outliers, which require strong assumptions and introduce biases in the analysis. By removing the outliers that aremore » identified in the latent low-dimensional space of the AE, we find that the average conductive and convective radial heat fluxes are between approximately 5% and 15% lower as when removing outliers identified by threshold values. For contributions to the radial heat flux due to triple correlations, the difference is up to 40%.« less