36 datasets found
  1. Data from: Outlier classification using autoencoders: application for...

    • osti.gov
    Updated Jun 2, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bianchi, F. M.; Brunner, D.; Kube, R.; LaBombard, B. (2021). Outlier classification using autoencoders: application for fluctuation driven flows in fusion plasmas [Dataset]. https://www.osti.gov/dataexplorer/biblio/dataset/1882649-outlier-classification-using-autoencoders-application-fluctuation-driven-flows-fusion-plasmas
    Explore at:
    Dataset updated
    Jun 2, 2021
    Dataset provided by
    United States Department of Energyhttp://energy.gov/
    Office of Sciencehttp://www.er.doe.gov/
    Massachusetts Inst. of Technology (MIT), Cambridge, MA (United States). Plasma Science and Fusion Center
    Authors
    Bianchi, F. M.; Brunner, D.; Kube, R.; LaBombard, B.
    Description

    Understanding the statistics of fluctuation driven flows in the boundary layer of magnetically confined plasmas is desired to accurately model the lifetime of the vacuum vessel components. Mirror Langmuir probes (MLPs) are a novel diagnostic that uniquely allow us to sample the plasma parameters on a time scale shorter than the characteristic time scale of their fluctuations. Sudden large-amplitude fluctuations in the plasma degrade the precision and accuracy of the plasma parameters reported by MLPs for cases in which the probe bias range is of insufficient amplitude. While some data samples can readily be classified as valid and invalid, we find that such a classification may be ambiguous for up to 40% of data sampled for the plasma parameters and bias voltages considered in this study. In this contribution, we employ an autoencoder (AE) to learn a low-dimensional representation of valid data samples. By definition, the coordinates in this space are the features that mostly characterize valid data. Ambiguous data samples are classified in this space using standard classifiers for vectorial data. In this way, we avoid defining complicated threshold rules to identify outliers, which require strong assumptions and introduce biases in the analysis. By removing the outliers that aremore » identified in the latent low-dimensional space of the AE, we find that the average conductive and convective radial heat fluxes are between approximately 5% and 15% lower as when removing outliers identified by threshold values. For contributions to the radial heat flux due to triple correlations, the difference is up to 40%.« less

  2. COVID-19 High Frequency Phone Survey of Households 2020, Round 2 - Viet Nam

    • microdata.worldbank.org
    • catalog.ihsn.org
    Updated Oct 26, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    World Bank (2023). COVID-19 High Frequency Phone Survey of Households 2020, Round 2 - Viet Nam [Dataset]. https://microdata.worldbank.org/index.php/catalog/4061
    Explore at:
    Dataset updated
    Oct 26, 2023
    Dataset provided by
    World Bank Grouphttp://www.worldbank.org/
    Authors
    World Bank
    Time period covered
    2020
    Area covered
    Vietnam
    Description

    Geographic coverage

    National, regional

    Analysis unit

    Households

    Kind of data

    Sample survey data [ssd]

    Sampling procedure

    The 2020 Vietnam COVID-19 High Frequency Phone Survey of Households (VHFPS) uses a nationally representative household survey from 2018 as the sampling frame. The 2018 baseline survey includes 46,980 households from 3132 communes (about 25% of total communes in Vietnam). In each commune, one EA is randomly selected and then 15 households are randomly selected in each EA for interview. We use the large module of to select the households for official interview of the VHFPS survey and the small module households as reserve for replacement. After data processing, the final sample size for Round 2 is 3,935 households.

    Mode of data collection

    Computer Assisted Telephone Interview [cati]

    Research instrument

    The questionnaire for Round 2 consisted of the following sections

    Section 2. Behavior Section 3. Health Section 5. Employment (main respondent) Section 6. Coping Section 7. Safety Nets Section 8. FIES

    Cleaning operations

    Data cleaning began during the data collection process. Inputs for the cleaning process include available interviewers’ note following each question item, interviewers’ note at the end of the tablet form as well as supervisors’ note during monitoring. The data cleaning process was conducted in following steps: • Append households interviewed in ethnic minority languages with the main dataset interviewed in Vietnamese. • Remove unnecessary variables which were automatically calculated by SurveyCTO • Remove household duplicates in the dataset where the same form is submitted more than once. • Remove observations of households which were not supposed to be interviewed following the identified replacement procedure. • Format variables as their object type (string, integer, decimal, etc.) • Read through interviewers’ note and make adjustment accordingly. During interviews, whenever interviewers find it difficult to choose a correct code, they are recommended to choose the most appropriate one and write down respondents’ answer in detail so that the survey management team will justify and make a decision which code is best suitable for such answer. • Correct data based on supervisors’ note where enumerators entered wrong code. • Recode answer option “Other, please specify”. This option is usually followed by a blank line allowing enumerators to type or write texts to specify the answer. The data cleaning team checked thoroughly this type of answers to decide whether each answer needed recoding into one of the available categories or just keep the answer originally recorded. In some cases, that answer could be assigned a completely new code if it appeared many times in the survey dataset.
    • Examine data accuracy of outlier values, defined as values that lie outside both 5th and 95th percentiles, by listening to interview recordings. • Final check on matching main dataset with different sections, where information is asked on individual level, are kept in separate data files and in long form. • Label variables using the full question text. • Label variable values where necessary.

  3. f

    Data from: PCP-SAFT Parameters of Pure Substances Using Large Experimental...

    • acs.figshare.com
    zip
    Updated Sep 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Timm Esper; Gernot Bauer; Philipp Rehner; Joachim Gross (2023). PCP-SAFT Parameters of Pure Substances Using Large Experimental Databases [Dataset]. http://doi.org/10.1021/acs.iecr.3c02255.s001
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 6, 2023
    Dataset provided by
    ACS Publications
    Authors
    Timm Esper; Gernot Bauer; Philipp Rehner; Joachim Gross
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    This work reports pure component parameters for the PCP-SAFT equation of state for 1842 substances using a total of approximately 551 172 experimental data points for vapor pressure and liquid density. We utilize data from commercial and public databases in combination with an automated workflow to assign chemical identifiers to all substances, remove duplicate data sets, and filter unsuited data. The use of raw experimental data, as opposed to pseudoexperimental data from empirical correlations, requires means to identify and remove outliers, especially for vapor pressure data. We apply robust regression using a Huber loss function. For identifying and removing outliers, the empirical Wagner equation for vapor pressure is adjusted to experimental data, because the Wagner equation is mathematically rather flexible and is thus not subject to a systematic model bias. For adjusting model parameters of the PCP-SAFT model, nonpolar, dipolar and associating substances are distinguished. The resulting substance-specific parameters of the PCP-SAFT equation of state yield in a mean absolute relative deviation of the of 2.73% for vapor pressure and 0.52% for liquid densities (2.56% and 0.47% for nonpolar substances, 2.67% and 0.61% for dipolar substances, and 3.24% and 0.54% for associating substances) when evaluated against outlier-removed data. All parameters are provided as JSON and CSV files.

  4. Credit Approval (Mixed Attributes)

    • kaggle.com
    Updated Dec 14, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2022). Credit Approval (Mixed Attributes) [Dataset]. https://www.kaggle.com/datasets/thedevastator/improving-credit-approval-with-mixed-attributes/data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 14, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    The Devastator
    Description

    Credit Approval (Mixed Attributes)

    Continuous and Categorical Features

    By UCI [source]

    About this dataset

    This dataset explores the phenomenon of credit card application acceptance or rejection. It includes a range of both continuous and categorical attributes, such as the applicant's gender, credit score, and income; as well as details about recent credit card activity including balance transfers and delinquency. This data presents a unique opportunity to investigate how these different attributes interact in determining application status. With careful analysis of this dataset, we can gain valuable insights into understanding what factors help ensure a successful application outcome. This could lead us to developing more effective strategies for predicting and improving financial credit access for everyone

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This dataset is an excellent resource for researching the credit approval process, as it provides a variety of attributes from both continuous and categorical sources. The aim of this guide is to provide tips and advice on how to make the most out of this dataset. - Understand the data: Before attempting to work with this dataset, it's important to understand what kind of information it contains. Since there is a mix of continuous and categorical attributes in this data set, make sure you familiarise yourself with all the different columns before proceeding further. - Exploratory Analysis: It is recommended that you conduct some exploratory analysis on your data in order to gain an overall understanding of its characteristics and distributions. By investigating things like missing values and correlations between different independent variables (IVs) or dependent variables (DVs), you can better prepare yourself for making meaningful analyses or predictions in further steps. - Data Cleaning: Once you have familiarised yourself with your data, begin cleaning up any potential discrepancies such as missing values or outliers by replacing them appropriately or removing them from your dataset if necessary - Feature Selection/Engineering: After cleansing your data set, feature selection/engineering may be necessary if certain columns are redundant or not proving useful for constructing meaningful models/analyses over your data set (usually observed after exploratory analysis). You should be very mindful when deciding which features should be removed so that no information about potentially important relationships are lost!
    - Model Building/Analysis: Now that our data has been pre-processed appropriately we can move forward with developing our desired models / analyses over our newly transformed datasets!

    Research Ideas

    • Developing predictive models to identify customers who are likely to default on their credit card payments.
    • Creating a risk analysis system that can identify customers who pose a higher risk for fraud or misuse of their credit cards.
    • Developing an automated lending decision system that can use the data points provided in the dataset (i.e., gender, average monthly balance, etc.) to decide whether or not to approve applications for new credit lines and loans

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    See the dataset description for more information.

    Columns

    File: crx.data.csv | Column name | Description | |:--------------|:--------------------------------------------------------------------------------------------------------------------------------------------------------------| | b | Gender (Categorical) | | 30.83 | Average Monthly Balance (Continuous) | | 0 | Number of Months Since Applicant's Last Delinquency (Continuous) | | w | Number of Months Since Applicant's Last Credit Card Approval (Continuous) | | 1.25 | Number Of Months since The applicant's last balance increase (Continuous) ...

  5. n

    Data from: Analyzing contentious relationships and outlier genes in...

    • data.niaid.nih.gov
    • search.dataone.org
    • +1more
    zip
    Updated Jun 5, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joseph F. Walker; Joseph W. Brown; Stephen A. Smith (2018). Analyzing contentious relationships and outlier genes in phylogenomics [Dataset]. http://doi.org/10.5061/dryad.br381mg
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 5, 2018
    Dataset provided by
    University of Sheffield
    University of Michigan
    Authors
    Joseph F. Walker; Joseph W. Brown; Stephen A. Smith
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Recent studies have demonstrated that conflict is common among gene trees in phylogenomic studies, and that less than one percent of genes may ultimately drive species tree inference in supermatrix analyses. Here, we examined two datasets where supermatrix and coalescent-based species trees conflict. We identified two highly influential “outlier” genes in each dataset. When removed from each dataset, the inferred supermatrix trees matched the topologies obtained from coalescent analyses. We also demonstrate that, while the outlier genes in the vertebrate dataset have been shown in a previous study to be the result of errors in orthology detection, the outlier genes from a plant dataset did not exhibit any obvious systematic error and therefore may be the result of some biological process yet to be determined. While topological comparisons among a small set of alternate topologies can be helpful in discovering outlier genes, they can be limited in several ways, such as assuming all genes share the same topology. Coalescent species tree methods relax this assumption but do not explicitly facilitate the examination of specific edges. Coalescent methods often also assume that conflict is the result of incomplete lineage sorting (ILS). Here we explored a framework that allows for quickly examining alternative edges and support for large phylogenomic datasets that does not assume a single topology for all genes. For both datasets, these analyses provided detailed results confirming the support for coalescent-based topologies. This framework suggests that we can improve our understanding of the underlying signal in phylogenomic datasets by asking more targeted edge-based questions.

  6. z

    Controlled Anomalies Time Series (CATS) Dataset

    • zenodo.org
    bin
    Updated Jul 12, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Patrick Fleith; Patrick Fleith (2024). Controlled Anomalies Time Series (CATS) Dataset [Dataset]. http://doi.org/10.5281/zenodo.7646897
    Explore at:
    binAvailable download formats
    Dataset updated
    Jul 12, 2024
    Dataset provided by
    Solenix Engineering GmbH
    Authors
    Patrick Fleith; Patrick Fleith
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Controlled Anomalies Time Series (CATS) Dataset consists of commands, external stimuli, and telemetry readings of a simulated complex dynamical system with 200 injected anomalies.

    The CATS Dataset exhibits a set of desirable properties that make it very suitable for benchmarking Anomaly Detection Algorithms in Multivariate Time Series [1]:

    • Multivariate (17 variables) including sensors reading and control signals. It simulates the operational behaviour of an arbitrary complex system including:
      • 4 Deliberate Actuations / Control Commands sent by a simulated operator / controller, for instance, commands of an operator to turn ON/OFF some equipment.
      • 3 Environmental Stimuli / External Forces acting on the system and affecting its behaviour, for instance, the wind affecting the orientation of a large ground antenna.
      • 10 Telemetry Readings representing the observable states of the complex system by means of sensors, for instance, a position, a temperature, a pressure, a voltage, current, humidity, velocity, acceleration, etc.
    • 5 million timestamps. Sensors readings are at 1Hz sampling frequency.
      • 1 million nominal observations (the first 1 million datapoints). This is suitable to start learning the "normal" behaviour.
      • 4 million observations that include both nominal and anomalous segments. This is suitable to evaluate both semi-supervised approaches (novelty detection) as well as unsupervised approaches (outlier detection).
    • 200 anomalous segments. One anomalous segment may contain several successive anomalous observations / timestamps. Only the last 4 million observations contain anomalous segments.
    • Different types of anomalies to understand what anomaly types can be detected by different approaches.
    • Fine control over ground truth. As this is a simulated system with deliberate anomaly injection, the start and end time of the anomalous behaviour is known very precisely. In contrast to real world datasets, there is no risk that the ground truth contains mislabelled segments which is often the case for real data.
    • Obvious anomalies. The simulated anomalies have been designed to be "easy" to be detected for human eyes (i.e., there are very large spikes or oscillations), hence also detectable for most algorithms. It makes this synthetic dataset useful for screening tasks (i.e., to eliminate algorithms that are not capable to detect those obvious anomalies). However, during our initial experiments, the dataset turned out to be challenging enough even for state-of-the-art anomaly detection approaches, making it suitable also for regular benchmark studies.
    • Context provided. Some variables can only be considered anomalous in relation to other behaviours. A typical example consists of a light and switch pair. The light being either on or off is nominal, the same goes for the switch, but having the switch on and the light off shall be considered anomalous. In the CATS dataset, users can choose (or not) to use the available context, and external stimuli, to test the usefulness of the context for detecting anomalies in this simulation.
    • Pure signal ideal for robustness-to-noise analysis. The simulated signals are provided without noise: while this may seem unrealistic at first, it is an advantage since users of the dataset can decide to add on top of the provided series any type of noise and choose an amplitude. This makes it well suited to test how sensitive and robust detection algorithms are against various levels of noise.
    • No missing data. You can drop whatever data you want to assess the impact of missing values on your detector with respect to a clean baseline.

    [1] Example Benchmark of Anomaly Detection in Time Series: “Sebastian Schmidl, Phillip Wenig, and Thorsten Papenbrock. Anomaly Detection in Time Series: A Comprehensive Evaluation. PVLDB, 15(9): 1779 - 1797, 2022. doi:10.14778/3538598.3538602”

    About Solenix

    Solenix is an international company providing software engineering, consulting services and software products for the space market. Solenix is a dynamic company that brings innovative technologies and concepts to the aerospace market, keeping up to date with technical advancements and actively promoting spin-in and spin-out technology activities. We combine modern solutions which complement conventional practices. We aspire to achieve maximum customer satisfaction by fostering collaboration, constructivism, and flexibility.

  7. f

    Data from: Nonlinear regression models for heterogeneous data with massive...

    • tandf.figshare.com
    txt
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yoonsuh Jung (2023). Nonlinear regression models for heterogeneous data with massive outliers [Dataset]. http://doi.org/10.6084/m9.figshare.7398524.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Taylor & Francis
    Authors
    Yoonsuh Jung
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The income or expenditure-related data sets are often nonlinear, heteroscedastic, skewed even after the transformation, and contain numerous outliers. We propose a class of robust nonlinear models that treat outlying observations effectively without removing them. For this purpose, case-specific parameters and a related penalty are employed to detect and modify the outliers systematically. We show how the existing nonlinear models such as smoothing splines and generalized additive models can be robustified by the case-specific parameters. Next, we extend the proposed methods to the heterogeneous models by incorporating unequal weights. The details of estimating the weights are provided. Two real data sets and simulated data sets show the potential of the proposed methods when the nature of the data is nonlinear with outlying observations.

  8. R

    Cdd Dataset

    • universe.roboflow.com
    zip
    Updated Sep 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    hakuna matata (2023). Cdd Dataset [Dataset]. https://universe.roboflow.com/hakuna-matata/cdd-g8a6g/3
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 5, 2023
    Dataset authored and provided by
    hakuna matata
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Cumcumber Diease Detection Bounding Boxes
    Description

    Project Documentation: Cucumber Disease Detection

    1. Title and Introduction Title: Cucumber Disease Detection

    Introduction: A machine learning model for the automatic detection of diseases in cucumber plants is to be developed as part of the "Cucumber Disease Detection" project. This research is crucial because it tackles the issue of early disease identification in agriculture, which can increase crop yield and cut down on financial losses. To train and test the model, we use a dataset of pictures of cucumber plants.

    1. Problem Statement Problem Definition: The research uses image analysis methods to address the issue of automating the identification of diseases, including Downy Mildew, in cucumber plants. Effective disease management in agriculture depends on early illness identification.

    Importance: Early disease diagnosis helps minimize crop losses, stop the spread of diseases, and better allocate resources in farming. Agriculture is a real-world application of this concept.

    Goals and Objectives: Develop a machine learning model to classify cucumber plant images into healthy and diseased categories. Achieve a high level of accuracy in disease detection. Provide a tool for farmers to detect diseases early and take appropriate action.

    1. Data Collection and Preprocessing Data Sources: The dataset comprises of pictures of cucumber plants from various sources, including both healthy and damaged specimens.

    Data Collection: Using cameras and smartphones, images from agricultural areas were gathered.

    Data Preprocessing: Data cleaning to remove irrelevant or corrupted images. Handling missing values, if any, in the dataset. Removing outliers that may negatively impact model training. Data augmentation techniques applied to increase dataset diversity.

    1. Exploratory Data Analysis (EDA) The dataset was examined using visuals like scatter plots and histograms. The data was examined for patterns, trends, and correlations. Understanding the distribution of photos of healthy and ill plants was made easier by EDA.

    2. Methodology Machine Learning Algorithms:

    Convolutional Neural Networks (CNNs) were chosen for image classification due to their effectiveness in handling image data. Transfer learning using pre-trained models such as ResNet or MobileNet may be considered. Train-Test Split:

    The dataset was split into training and testing sets with a suitable ratio. Cross-validation may be used to assess model performance robustly.

    1. Model Development The CNN model's architecture consists of layers, units, and activation operations. On the basis of experimentation, hyperparameters including learning rate, batch size, and optimizer were chosen. To avoid overfitting, regularization methods like dropout and L2 regularization were used.

    2. Model Training During training, the model was fed the prepared dataset across a number of epochs. The loss function was minimized using an optimization method. To ensure convergence, early halting and model checkpoints were used.

    3. Model Evaluation Evaluation Metrics:

    Accuracy, precision, recall, F1-score, and confusion matrix were used to assess model performance. Results were computed for both training and test datasets. Performance Discussion:

    The model's performance was analyzed in the context of disease detection in cucumber plants. Strengths and weaknesses of the model were identified.

    1. Results and Discussion Key project findings include model performance and disease detection precision. a comparison of the many models employed, showing the benefits and drawbacks of each. challenges that were faced throughout the project and the methods used to solve them.

    2. Conclusion recap of the project's key learnings. the project's importance to early disease detection in agriculture should be highlighted. Future enhancements and potential research directions are suggested.

    3. References Library: Pillow,Roboflow,YELO,Sklearn,matplotlib Datasets:https://data.mendeley.com/datasets/y6d3z6f8z9/1

    4. Code Repository https://universe.roboflow.com/hakuna-matata/cdd-g8a6g

    Rafiur Rahman Rafit EWU 2018-3-60-111

  9. e

    Density-based outlier scoring on Kepler data - Dataset - B2FIND

    • b2find.eudat.eu
    Updated Apr 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Density-based outlier scoring on Kepler data - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/049456b7-7080-5ff0-a5ff-bbb6180c4120
    Explore at:
    Dataset updated
    Apr 23, 2024
    Description

    In the present era of large-scale surveys, big data present new challenges to the discovery process for anomalous data. Such data can be indicative of systematic errors, extreme (or rare) forms of known phenomena, or most interestingly, truly novel phenomena that exhibit as-of-yet unobserved behaviours. In this work, we present an outlier scoring methodology to identify and characterize the most promising unusual sources to facilitate discoveries of such anomalous data. We have developed a data mining method based on k-nearest neighbour distance in feature space to efficiently identify the most anomalous light curves. We test variations of this method including using principal components of the feature space, removing select features, the effect of the choice of k, and scoring to subset samples. We evaluate the performance of our scoring on known object classes and find that our scoring consistently scores rare (<1000) object classes higher than common classes. We have applied scoring to all long cadence light curves of Quarters 1-17 of Kepler's prime mission and present outlier scores for all 2.8 million light curves for the roughly 200k objects.

  10. Z

    nuts-STeauRY dataset: hydrochemical and catchment characteristics dataset...

    • data.niaid.nih.gov
    Updated Jul 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Casquin, Antoine (2024). nuts-STeauRY dataset: hydrochemical and catchment characteristics dataset for large sample studies of Carbon, Nitrogen, Phosphorus and Silicon in french watercourses [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10830851
    Explore at:
    Dataset updated
    Jul 6, 2024
    Dataset provided by
    Casquin, Antoine
    Thieu, Vincent
    Silvestre, Marie
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    French
    Description

    nuts-STeauRY dataset: hydrochemical and catchment characteristics dataset for large sample studies of Carbon, Nitrogen, Phosphorus and Silicon in French watercourses

    Antoine Casquin, Marie Silvestre, Vincent Thieu

    10.5281/zenodo.10830852

    v0.1, 18th March 2024

    Brief overview of data:

    · Carbon and nutrients data for 5470 continental French catchments

    · Modelled discharge for 5128 of catchments out of 5470

    · Geopackages with catchment delineations and outlets

    · DEM conditioned to delimit additional catchments

    · Land-use and climatic data for 5470 continental French catchments

    Citation of this work

    A data paper is currently being submitted with details of methods and results. Once published, it will be the preferential source to cite. The data paper will be link to the new version of the dataset that will be updated on doi.org/10.5281/zenodo.10830852. If you use this dataset in your research or report, you must cite it.

    Motivations

    Data was collected and curated for the nuts-STeauRY project (http://nuts-steaury.cnrs.fr), which deployed a national generic land to sea modelling chain.

    Data was primarily used (see related works):

    To calibrate concentrations of dissolved organic carbon and dissolve silica in headwaters

    To validate spatially and temporally the modelling chain (DOC, NO3-, NH4+, TP, SRP, DSi)

    Hydrochemical large sample datasets have numerous other uses: trends computations elucidate transfer mechanisms, machine learning, retrospective studies etc.

    The objective here is to provide a large sample curated dataset of carbon and nutrients concentrations along with modelled discharges, catchment characteristics and delimitations for the continental France. Such large sample dataset aims at easing the large sample studies over France and/or Europe. Although part of the data gathered here is obtainable via public sources, the catchments delineations, their characteristics and modelled hydrology were note not publicly available yet. Moreover, a unification of units and detection and removal of outliers was performed on carbon and nutrients data.

    Data sources & processing

    Sampling points where snapped on the CCM database v2.1 (http://data.europa.eu/89h/fe1878e8-7541-4c66-8453-afdae7469221)(Vogt et al., 2007) and catchments were delineated using a 100m resolution Digital Elevation Model (DEM) conditioned by the hydrographic network and elementary catchments’ delineations of the CCM data v2.1. More than 6000 catchments were delineated and screened manually to check consistency: 5470 were retained.

    Nutrient data was collected mainly through the Naiades portal (https://naiades.eaufrance.fr/), a database collecting water quality data produced by different water related actors across France. Nutrient data was also collected directly with regional water agencies (https://www.eau-seine-normandie.fr/, https://eau-grandsudouest.fr/, https://www.eaurmc.fr/, https://www.eau-artois-picardie.fr/, https://www.eau-rhin-meuse.fr/ and https://agence.eau-loire-bretagne.fr/home.html), and pre-processed using a database management system relying on PostgreSQL with PostGIS extension (Thieu & Silvestre, 2015). A three-pass strategy was used to curate raw carbon and nutrients data: 1. Removal of “obvious outliers”, 2. Detection of baseline change and correction if possible (or removal of data) 3. Removal of outliers using a quantile based approach by element and temporal series.

    Hydrological time series are interpolation trough hydrograph transfer (de Lavenne et al., 2023) of 1664 time series of discharge completed with GR4J model (Pelletier & Andréassian, 2020; Pelletier 2021).

    Land cover data was extracted from Corine Land Cover dataset for years 2000, 2006, 2012, and 2018 (EEA, 2020). Raw CLC typology contains 44 classes. Results of percent cover per year per class were computed for each catchment. An aggregated typology of 8 classes is also proposed.

    Climatological data was extracted from daily reconstruction at 5 arcmin for temperatures and 1 arcmin for precipitation over Europe (Thiemig et al., 2022). Mean by catchment for min&max daily temperature and precipitation were computed for each catchment for the 1990-2019 period.

    Nuts-STeauRY dataset

    Carbon and nutrients time series

    Time series of carbon and nutrients within the 1962-2019 period on 5470 stations: Dissolved Organic Carbon (DOC), Total Organic Carbon (TOC) Nitrates (NO3-), Nitrites (NO2-), Ammonia (NH4+), Soluble Reactive Phosphorus (SRP), Total Phosphorus (TP) and Dissolved Silica (DSi).

    |var | n_unique_station| n_total_meas| mean_duration_y| mean_frequency_y|

    |:---|----------------:|------------:|---------------:|----------------:|

    |DOC | 4 992| 658 147| 14.3| 9.0|

    |DSi | 3 299| 333 866| 12.9| 8.3|

    |NH4 | 5 318| 907 343| 19.3| 8.7|

    |NO2 | 5 264| 891 886| 19.2| 8.6|

    |NO3 | 5 465| 939 279| 19.0| 9.0|

    |SRP | 5 361| 910 107| 19.1| 8.7|

    |TOC | 935| 111 993| 13.6| 9.6|

    |TP | 5 199| 802 841| 17.1| 8.8|

    Note that some SRP and DSi measurements were declared as realized on raw water. A thorough analysis of time series show no evidence of difference on baselines. For more accuracy, it is advised to filter out those analyses using the “fraction” attribute of each measurement.

    Discharge modelled daily time series

    Modelled naturalized discharge through hydrograph transfer and interpolated measured discharges when available for the 1980-2019 period.

    A daily discharge was computed for 5128 catchments. For small catchments (< 1000 km2, n = 4530), hydrograph transfer was used, while for big catchments, a direct interpolation of measured/completed discharges was performed. The direct interpolation was only possible for 598 catchments > 1000 km2. The criteria retained for a direct interpolation is 0.8*area_discharge_station < area_quality < 1.2*area_discharge_station when discharge and quality stations were nested.

    Hydrological time series uncertainties varies a lot depending on: quality of data source, distance from pseudo-gauged outlets, land cover of the catchments, natural spatial and temporal variability of discharge, size of the catchment (de Lavenne et al., 2016). We advise a cautious use of those modelled discharges as uncertainties could not be computed.

    Catchments, outlets and conditioned DEM

    5470 catchments and outlets are delivered as geopackages (EPSG: 3035).

    The DEM, conditioned by CCM 2.1 is also delivered as a GeoTIFF (EPSG: 3035) as way to delimit new catchment for the area that are consistent with the dataset.

    Catchments characteristics and climate

    Refer to Data sources & processing and File descriptions.

    File and attributes descriptions:

    The key “sta_code” is present across all files. For time varying records, “date” can be a secondary key.

    Description of CNPSi.csv data attributes

    Each line is a couple measurement/parameter/station

    · sta_code: Code of the station in the Sandre referentiel (public french "dataverse" for water data)

    · sta_name: Name of the station in the Sandre referentiel (public french "dataverse" for water data)

    · var: Abbreviation of parameter name

    · fraction: "water_filtrated" or "water_raw"

    · date: date of sampling

    · hour: hour of sampling

    · value: analytical result (concentration)

    · provider: provider of the data

    · producer: producer of the data

    · from_db: "Naiades2022" (https://naiades.eaufrance.fr/france-entiere#/ dump from 2022) or "DoNuts" (Thieu, V., Silvestre, M., 2015. DoNuts: un système d’information sur les observations environnementales. Présentation Séminaire UMR Métis)

    · n_meas: number of observations for a given parameter / station

    · unit: unit of concentration

    · element: "C" "N" "P" or "Si"

    · year: year of observation

    · month: month of observation

    · day: day of observation

    · julian_day: julian day observation (1-366)

    · decade: decade of observation (one of "1961-1970", "1971-1980", "1981-1990", "1991-2000", "2001-2010", "2011-2020")

    Description of CNPSi_stats.csv data attributes

    Each line is a couple parameter / station

    · sta_code: Code of the station in the Sandre referentiel (public french "dataverse" for water data)

    · sta_name: Name of the station in the Sandre referentiel (public french "dataverse" for water data)

    · var: Abbreviation of parameter name

    · n_meas: number of observations for a given parameter / station

    · start_year: year of first observation for a given parameter / station

    · end_year: year of last observation for a given parameter / station

    · duration_y_tot: total duration of observation in years for a given parameter / station

    · duration_y_tot: duration of observation in years for a given parameter / station for years with at least 1 meas

    · mean_nmeas_per_y_tot: mean number of observations per year considering total duration

    · mean_nmeas_per_y_meas: mean number of observations per year considering years with measurements

    · is_fully_continuous: TRUE if at least one measurement per year for a given parameter / station

    · start_cont_seq: year in which starts the longest continuous sequence for a given parameter / station

    · end_cont_seq: year in which ends the longest continuous sequence for a given parameter / station

    · duration_y_cont_seq: duration in years for the longest continuous sequence for a given parameter / station

    · nmeas_cont_seq:

  11. n

    Malaria disease and grading system dataset from public hospitals reflecting...

    • data.niaid.nih.gov
    • datadryad.org
    zip
    Updated Nov 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Temitope Olufunmi Atoyebi; Rashidah Funke Olanrewaju; N. V. Blamah; Emmanuel Chinanu Uwazie (2023). Malaria disease and grading system dataset from public hospitals reflecting complicated and uncomplicated conditions [Dataset]. http://doi.org/10.5061/dryad.4xgxd25gn
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 10, 2023
    Dataset provided by
    Nasarawa State University
    Authors
    Temitope Olufunmi Atoyebi; Rashidah Funke Olanrewaju; N. V. Blamah; Emmanuel Chinanu Uwazie
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Malaria is the leading cause of death in the African region. Data mining can help extract valuable knowledge from available data in the healthcare sector. This makes it possible to train models to predict patient health faster than in clinical trials. Implementations of various machine learning algorithms such as K-Nearest Neighbors, Bayes Theorem, Logistic Regression, Support Vector Machines, and Multinomial Naïve Bayes (MNB), etc., has been applied to malaria datasets in public hospitals, but there are still limitations in modeling using the Naive Bayes multinomial algorithm. This study applies the MNB model to explore the relationship between 15 relevant attributes of public hospitals data. The goal is to examine how the dependency between attributes affects the performance of the classifier. MNB creates transparent and reliable graphical representation between attributes with the ability to predict new situations. The model (MNB) has 97% accuracy. It is concluded that this model outperforms the GNB classifier which has 100% accuracy and the RF which also has 100% accuracy. Methods Prior to collection of data, the researcher was be guided by all ethical training certification on data collection, right to confidentiality and privacy reserved called Institutional Review Board (IRB). Data was be collected from the manual archive of the Hospitals purposively selected using stratified sampling technique, transform the data to electronic form and store in MYSQL database called malaria. Each patient file was extracted and review for signs and symptoms of malaria then check for laboratory confirmation result from diagnosis. The data was be divided into two tables: the first table was called data1 which contain data for use in phase 1 of the classification, while the second table data2 which contains data for use in phase 2 of the classification. Data Source Collection Malaria incidence data set is obtained from Public hospitals from 2017 to 2021. These are the data used for modeling and analysis. Also, putting in mind the geographical location and socio-economic factors inclusive which are available for patients inhabiting those areas. Naive Bayes (Multinomial) is the model used to analyze the collected data for malaria disease prediction and grading accordingly. Data Preprocessing: Data preprocessing shall be done to remove noise and outlier. Transformation: The data shall be transformed from analog to electronic record. Data Partitioning The data which shall be collected will be divided into two portions; one portion of the data shall be extracted as a training set, while the other portion will be used for testing. The training portion shall be taken from a table stored in a database and will be called data which is training set1, while the training portion taking from another table store in a database is shall be called data which is training set2. The dataset was split into two parts: a sample containing 70% of the training data and 30% for the purpose of this research. Then, using MNB classification algorithms implemented in Python, the models were trained on the training sample. On the 30% remaining data, the resulting models were tested, and the results were compared with the other Machine Learning models using the standard metrics. Classification and prediction: Base on the nature of variable in the dataset, this study will use Naïve Bayes (Multinomial) classification techniques; Classification phase 1 and Classification phase 2. The operation of the framework is illustrated as follows: i. Data collection and preprocessing shall be done. ii. Preprocess data shall be stored in a training set 1 and training set 2. These datasets shall be used during classification. iii. Test data set is shall be stored in database test data set. iv. Part of the test data set must be compared for classification using classifier 1 and the remaining part must be classified with classifier 2 as follows: Classifier phase 1: It classify into positive or negative classes. If the patient is having malaria, then the patient is classified as positive (P), while a patient is classified as negative (N) if the patient does not have malaria.
    Classifier phase 2: It classify only data set that has been classified as positive by classifier 1, and then further classify them into complicated and uncomplicated class label. The classifier will also capture data on environmental factors, genetics, gender and age, cultural and socio-economic variables. The system will be designed such that the core parameters as a determining factor should supply their value.

  12. Superstore Sales Analysis

    • kaggle.com
    Updated Oct 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ali Reda Elblgihy (2023). Superstore Sales Analysis [Dataset]. https://www.kaggle.com/datasets/aliredaelblgihy/superstore-sales-analysis
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 21, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ali Reda Elblgihy
    Description

    Analyzing sales data is essential for any business looking to make informed decisions and optimize its operations. In this project, we will utilize Microsoft Excel and Power Query to conduct a comprehensive analysis of Superstore sales data. Our primary objectives will be to establish meaningful connections between various data sheets, ensure data quality, and calculate critical metrics such as the Cost of Goods Sold (COGS) and discount values. Below are the key steps and elements of this analysis:

    1- Data Import and Transformation:

    • Gather and import relevant sales data from various sources into Excel.
    • Utilize Power Query to clean, transform, and structure the data for analysis.
    • Merge and link different data sheets to create a cohesive dataset, ensuring that all data fields are connected logically.

    2- Data Quality Assessment:

    • Perform data quality checks to identify and address issues like missing values, duplicates, outliers, and data inconsistencies.
    • Standardize data formats and ensure that all data is in a consistent, usable state.

    3- Calculating COGS:

    • Determine the Cost of Goods Sold (COGS) for each product sold by considering factors like purchase price, shipping costs, and any additional expenses.
    • Apply appropriate formulas and calculations to determine COGS accurately.

    4- Discount Analysis:

    • Analyze the discount values offered on products to understand their impact on sales and profitability.
    • Calculate the average discount percentage, identify trends, and visualize the data using charts or graphs.

    5- Sales Metrics:

    • Calculate and analyze various sales metrics, such as total revenue, profit margins, and sales growth.
    • Utilize Excel functions to compute these metrics and create visuals for better insights.

    6- Visualization:

    • Create visualizations, such as charts, graphs, and pivot tables, to present the data in an understandable and actionable format.
    • Visual representations can help identify trends, outliers, and patterns in the data.

    7- Report Generation:

    • Compile the findings and insights into a well-structured report or dashboard, making it easy for stakeholders to understand and make informed decisions.

    Throughout this analysis, the goal is to provide a clear and comprehensive understanding of the Superstore's sales performance. By using Excel and Power Query, we can efficiently manage and analyze the data, ensuring that the insights gained contribute to the store's growth and success.

  13. INTEGRAL IBIS AGN Catalog - Dataset - NASA Open Data Portal

    • data.nasa.gov
    Updated Apr 1, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nasa.gov (2025). INTEGRAL IBIS AGN Catalog - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/integral-ibis-agn-catalog
    Explore at:
    Dataset updated
    Apr 1, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    In this work, the authors present the most comprehensive INTEGRAL active galactic nucleus (AGN) sample. It lists 272 AGN for which they have secure optical identifications, precise optical spectroscopy and measured redshift values plus X-ray spectral information, i.e. 2-10 and 20-100 keV fluxes plus column densities. In their paper, the authors mainly use this sample to study the absorption properties of active galaxies, to probe new AGN classes and to test the AGN unification scheme. The authors find that half (48%) of the sample is absorbed, while the fraction of Compton-thick AGN is small (~7%). In line with their previous analysis, they have however shown that when the bias towards heavily absorbed objects which are lost if weak and at large distance is removed, as is possible in the local Universe, the above fractions increase to become 80% and 17%, respectively. The authors also find that absorption is a function of source luminosity, which implies some evolution in the obscuration properties of AGN. A few peculiar classes, so far poorly studied in the hard X-ray band, have been detected and studied for the first time such as 5 X-ray bright optically normal galaxies (XBONGs), 5 type 2 QSOs and 11 low-ionization nuclear emission regions. In terms of optical classification, this sample contains 57% type 1 and 43% type 2 AGN; this subdivision is similar to that found in X-rays if unabsorbed versus absorbed objects are considered, suggesting that the match between optical and X-ray classifications is on the whole good. Only a small percentage of sources (12%) does not fulfill the expectation of the unified theory as the authors find 22 type 1 AGN which are absorbed and 10 type 2 AGN which are unabsorbed. Studying in depth these outliers they found that most of the absorbed type 1 AGN have X-ray spectra characterized by either complex or warm/ionized absorption more likely due to ionized gas located in an accretion disc wind or in the bi-conical structure associated with the central nucleus, therefore unrelated to the toroidal structure. Among the 10 type 2 AGN which are unabsorbed, at most 3-4% are still eligible to be classified as 'true' type 2 AGN. In the fourth INTEGRAL/IBIS survey (Bird et al. 2010, ApJS, 186, 1, available in the HEASARC database as the IBISCAT4 table), there are 234 objects which have been identified with AGN. To this set of sources, the present authors then added 38 galaxies listed in the INTEGRAL all-sky survey by Krivonos et al. (2007, A&A, 475, 775, available in the HEASARC database as the INTIBISASS table) updated on the website (http://hea.iki.rssi.ru/integral/survey/catalog.php) but not included in the Bird et al. catalog due to the different sky coverage (these latter sources are indicated with hard_flag = 'h' values in this HEASARC table). The final data set presented and discussed in the reference paper and constituting this table therefore comprises 272 AGN and was last updated in March 2011 March. It represents the most complete view of the INTEGRAL extragalactic sky as of the date of publication in 2012. This table was created by the HEASARC in October 2014 based on CDS Catalog J/MNRAS/426/1750 files tablea1.dat and refs.dat. This is a service provided by NASA HEASARC .

  14. CA Housing | Processed data for Featured Models

    • kaggle.com
    Updated May 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Punyak (2025). CA Housing | Processed data for Featured Models [Dataset]. https://www.kaggle.com/datasets/punyakdei/ca-housing-processed-data-for-linear-models
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 26, 2025
    Dataset provided by
    Kaggle
    Authors
    Punyak
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Area covered
    California
    Description

    Exploring the California Housing Dataset: A Modern Take on a Classic Dataset

    The California Housing dataset is a well-known and widely-used dataset in the data science and machine learning communities. Originating from the 1990 U.S. Census, it contains housing data for California districts, and serves as a perfect playground for learners and practitioners to explore real-world regression problems.

    This dataset was originally published by Pace and Barry in 1997, and later adapted by Scikit-learn as a toy dataset for machine learning. While it's based on 1990 data, it remains relevant and invaluable due to its simplicity, interpretability, and richness in geographic and socioeconomic features.

    The dataset includes 20,000+ observations, each representing a block group in California. Key features include: - Median income of households - Median house value - Housing median age - Total rooms and bedrooms - Population and household counts - Latitude and longitude coordinates

    Its great for Beginners as it has many perks in analysis -

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F21414749%2F78e6885a905025b3478c7c87bc61194f%2FCA_few.png?generation=1747798227111665&alt=media" alt=""Few rows of CA Housing"">

    and Visualization -

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F21414749%2F562002482656c84f6a4a7b923f11512f%2FCA%20dist.png?generation=1747798372627633&alt=media" alt=""Distribution of numerical features"">

    with Geospatial Visuals -

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F21414749%2F1da1d024bee19cf7579573835d089905%2FCA%20Population%20as%20Bubble.png?generation=1747798539969864&alt=media" alt="">

    I achieved a approvable variance score by doing following sequential steps: -

    1. Outlier Removal- Carefully identified and removed outliers based on statistical thresholds in key features like Income vs House Value, Room vs House Value, and Population vs House Value. This step drastically improved the signal-to-noise ratio in the data, making downstream modeling more robust and interpretable.

    2. Geo-Spatial Clustering- Using the latitude and longitude, I performed geo-based clustering to group similar regions. This not only uncovered meaningful spatial patterns in housing prices but also allowed for richer feature augmentation. KMean was used to super-localize the clusters and the results were validated with geographic visualizations and cluster analysis.

    3. Variance Explained: A Huge Win!- I was able to explain up to 75% of the variance in the dataset with just linear models (OLS) — a strong indicator of how well the features (post-cleaning and clustering) capture the underlying data structure.

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F21414749%2Fba06ee6605b57aef11ed6427210daffe%2FModel%20Comparision%20based%20on%20raw%20and%20processed%20datasets.png?generation=1747799721015242&alt=media" alt="">

    This lays a great foundation for predictive modeling and insightful analysis.

  15. r

    Data from: Male responses to sperm competition risk when rivals vary in...

    • researchdata.edu.au
    • search.dataone.org
    • +1more
    Updated 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Leigh W. Simmons; Joseph L. Tomkins; Samuel J. Lymbery; School of Biological Sciences (2019). Data from: Male responses to sperm competition risk when rivals vary in their number and familiarity [Dataset]. http://doi.org/10.5061/DRYAD.M097580
    Explore at:
    Dataset updated
    2019
    Dataset provided by
    DRYAD
    The University of Western Australia
    Authors
    Leigh W. Simmons; Joseph L. Tomkins; Samuel J. Lymbery; School of Biological Sciences
    Description

    Males of many species adjust their reproductive investment to the number of rivals present simultaneously. However, few studies have investigated whether males sum previous encounters with rivals, and the total level of competition has never been explicitly separated from social familiarity. Social familiarity can be an important component of kin recognition and has been suggested as a cue that males use to avoid harming females when competing with relatives. Previous work has succeeded in independently manipulating social familiarity and relatedness among rivals, but experimental manipulations of familiarity are confounded with manipulations of the total number of rivals that males encounter. Using the seed beetle Callosobruchus maculatus we manipulated three factors: familiarity among rival males, the number of rivals encountered simultaneously, and the total number of rivals encountered over a 48-hour period. Males produced smaller ejaculates when exposed to more rivals in total, regardless of the maximum number of rivals they encountered simultaneously. Males did not respond to familiarity. Our results demonstrate that males of this species can sum the number of rivals encountered over separate days, and therefore the confounding of familiarity with the total level of competition in previous studies should not be ignored.,Lymbery et al 2018 Full datasetContains all the data used in the statistical analyses for the associated manuscript. The file contains two spreadsheets: one containing the data and one containing a legend relating to column titles.Lymbery et al Full Dataset.xlsxLymbery et al 2018 Reduced dataset 1Contains data used in the attached manuscript following the removal of three outliers for the purposes of data distribution, as described in the associated R code. The file contains two spreadsheets: one containing the data and one containing a legend relating to column titles.Lymbery et al Reduced Dataset After 1st Round of Outlier Removal.xlsxLymbery et al 2018 Reduced dataset 2Contains the data used in the statistical analyses for the associated manuscript, after the removal of all outliers stated in the manuscript and associated R code. The file contains two spreadsheets: one containing the data and one containing a legend relating to column titles.Lymbery et al Reduced Dataset After Final Outlier Removal.xlsxLymbery et al 2018 R ScriptContains all the R code used for statistical analysis in this manuscript, with annotations to aid interpretation.,

  16. t

    A new global GPS dataset for testing and improving modelled GIA uplift rates...

    • service.tib.eu
    Updated Nov 29, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). A new global GPS dataset for testing and improving modelled GIA uplift rates - Vdataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/png-doi-10-1594-pangaea-889923
    Explore at:
    Dataset updated
    Nov 29, 2024
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Description

    We have produced a global dataset of ~4000 GPS vertical velocities that can be used as observational estimates of glacial isostatic adjustment (GIA) uplift rates. GIA is the response of the solid Earth to past ice loading, primarily, since the Last Glacial Maximum, about 20 K yrs BP. Modelling GIA is challenging because of large uncertainties in ice loading history and also the viscosity of the upper and lower mantle. GPS data contain the signature of GIA in their uplift rates but these also contain other sources of vertical land motion (VLM) such as tectonics, human and natural influences on water storage that can mask the underlying GIA signal. A novel fully-automatic strategy was developed to post-process the GPS time series and to correct for non-GIA artefacts. Before estimating vertical velocities and uncertainties, we detected outliers and jumps and corrected for atmospheric mass loading displacements. We corrected the resulting velocities for the elastic response of the solid Earth to global changes in ice sheets, glaciers, and ocean loading, as well as for changes in the Earth's rotational pole relative to the 20th century average. We then applied a spatial median filter to remove sites where local effects were dominant to leave approximately 4000 GPS sites. The resulting novel global GPS dataset shows a clean GIA signal at all post-processed stations and is suitable to investigate the behaviour of global GIA forward models. The results are transformed from a frame with its origin in the centre of mass of the total Earth's system (CM) into a frame with its origin in the centre of mass of the solid Earth (CE) before comparison with 13 global GIA forward model solutions, with best fits with Pur-6-VM5 and ICE-6G predictions. The largest discrepancies for all models were identified for Antarctica and Greenland, which may be due to either uncertain mantle rheology, ice loading history/magnitude and/or GPS errors.

  17. f

    DataSheet1_Use ggbreak to Effectively Utilize Plotting Space to Deal With...

    • frontiersin.figshare.com
    pdf
    Updated Jun 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shuangbin Xu; Meijun Chen; Tingze Feng; Li Zhan; Lang Zhou; Guangchuang Yu (2023). DataSheet1_Use ggbreak to Effectively Utilize Plotting Space to Deal With Large Datasets and Outliers.PDF [Dataset]. http://doi.org/10.3389/fgene.2021.774846.s001
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 6, 2023
    Dataset provided by
    Frontiers
    Authors
    Shuangbin Xu; Meijun Chen; Tingze Feng; Li Zhan; Lang Zhou; Guangchuang Yu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    With the rapid increase of large-scale datasets, biomedical data visualization is facing challenges. The data may be large, have different orders of magnitude, contain extreme values, and the data distribution is not clear. Here we present an R package ggbreak that allows users to create broken axes using ggplot2 syntax. It can effectively use the plotting area to deal with large datasets (especially for long sequential data), data with different magnitudes, and contain outliers. The ggbreak package increases the available visual space for a better presentation of the data and detailed annotation, thus improves our ability to interpret the data. The ggbreak package is fully compatible with ggplot2 and it is easy to superpose additional layers and applies scale and theme to adjust the plot using the ggplot2 syntax. The ggbreak package is open-source software released under the Artistic-2.0 license, and it is freely available on CRAN (https://CRAN.R-project.org/package=ggbreak) and Github (https://github.com/YuLab-SMU/ggbreak).

  18. Cleaned Driver License Dataset

    • kaggle.com
    Updated Jul 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Divyanshu_CODER (2025). Cleaned Driver License Dataset [Dataset]. http://doi.org/10.34740/kaggle/dsv/12605959
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 29, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Divyanshu_CODER
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    🚗 Cleaned Drivers License Dataset

    "This dataset is designed for those who believe in building smart, ethical AI models that can assist in real-world decision-making like licensing, risk assessment, and more."

    📂 Dataset Overview

    This dataset is a cleaned and preprocessed version of a drivers license dataset containing details like:

    Age Group

    Gender

    Reaction Time

    Driving Skill Level

    Training Received

    License Qualification Status

    And many more total 20 columns you can train a best ML model with this dataset

    The raw dataset contained missing values, categorical anomalies, and inconsistencies which have been fixed to make this dataset ML-ready.

    💡 Key Highlights

    ✅ Cleaned missing values with intelligent imputation

    🧠 Encoded categorical columns with appropriate techniques (OneHot, Ordinal)

    🔍 Removed outliers using statistical logic

    🔧 Feature engineered to preserve semantic meaning

    💾 Ready-to-use for classification tasks (e.g., Predicting who qualifies for a license)

    📊 Columns Description

    Column Name Description

    Gender Gender of the individual (Male, Female) Age Group Age segment (Teen, Adult, Senior) Race Ethnicity of the driver Reactions Reaction speed categorized (Fast, Average, Slow) Training Training received (Basic, Advanced) Driving Skill Skill level (Expert, Beginner, etc.) Qualified Whether the person qualified for a license (Yes, No)

    🤖 Perfect For

    📚 Machine Learning (Classification)

    📊 Exploratory Data Analysis (EDA)

    📉 Feature Engineering Practice

    🧪 Model Evaluation & Experimentation

    🚥 AI for Transport & Safety Projects

    🏷️ Tags

    MLReady #LicensePrediction #ClassificationDataset #CleanedData #FeatureEngineering #DriversLicense #Transportation #AIProjects #Imputation #OrdinalEncoding

    📌 Author Notes

    This dataset is part of a data cleaning and feature engineering project. One column is intentionally left unprocessed for developers to test their own pipeline or transformation strategies 😎.

    🔗 Citation

    If you use this dataset in your projects, notebooks, or blog posts — feel free to tag me or credit the original dataset and this cleaned version.

    📚 Citation

    If you use this dataset in your work, please cite it as:

    Divyanshu_Coder, 2025. Cleaned Driver License Dataset. Kaggle.

  19. d

    TreeShrink: fast and accurate detection of outlier long branches in...

    • search.dataone.org
    • data.niaid.nih.gov
    • +2more
    Updated Jul 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Siavash Mirarab; Uyen Mai (2025). TreeShrink: fast and accurate detection of outlier long branches in collections of phylogenetic trees [Dataset]. http://doi.org/10.6076/D1HC71
    Explore at:
    Dataset updated
    Jul 22, 2025
    Dataset provided by
    Dryad Digital Repository
    Authors
    Siavash Mirarab; Uyen Mai
    Time period covered
    Jan 1, 2023
    Description

    Phylogenetic trees include errors for a variety of reasons. We argue that one way to detect errors is to build a phylogeny with all the data and then detect taxa that artificially inflate the tree diameter. We formulate an optimization problem that seeks to find k leaves that can be removed to reduce the tree diameter maximally. We present a polynomial time solution to this “k-shrink†problem. Given this solution, we then use non-parametric statistics to find an outlier set of taxa that have an unexpectedly high impact on the tree diameter. We test our method, TreeShrink, on five biological datasets, and show that it is more conservative than rogue taxon removal using RogueNaRok. When the amount of filtering is controlled, TreeShrink outperforms RogueNaRok in three out of the five datasets, and they tie in another dataset., All the raw data are obtained from other publications as shown below. We further analyzed the data and provide the results of the analyses here. The methods used to analyze the data are described in the paper.

    Dataset

    Species

    Genes

    Download

    PlantsÂ

    104

    852

    DOIÂ 10.1186/2047-217X-3-17

    Mammals

    37

    424

    DOIÂ 10.13012/C5BG2KWG

    Insects

    144

    1478

    http://esayyari.github.io/InsectsData

    Cannon

    78

    213

    DOIÂ 10.5061/dryad.493b7

    RouseÂ

    26

    393

    DOIÂ 10.5061/dryad.79dq1

    Frogs

    164

    95

    DOIÂ 10.5061/dryad.12546.2

    Â ,

  20. Heidelberg Tributary Loading Program (HTLP) Dataset

    • zenodo.org
    bin, png
    Updated Jul 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NCWQR; NCWQR (2024). Heidelberg Tributary Loading Program (HTLP) Dataset [Dataset]. http://doi.org/10.5281/zenodo.6606950
    Explore at:
    bin, pngAvailable download formats
    Dataset updated
    Jul 16, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    NCWQR; NCWQR
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is updated more frequently and can be visualized on NCWQR's data portal.

    If you have any questions, please contact Dr. Laura Johnson or Dr. Nathan Manning.

    The National Center for Water Quality Research (NCWQR) is a research laboratory at Heidelberg University in Tiffin, Ohio, USA. Our primary research program is the Heidelberg Tributary Loading Program (HTLP), where we currently monitor water quality at 22 river locations throughout Ohio and Michigan, effectively covering ~half of the land area of Ohio. The goal of the program is to accurately measure the total amounts (loads) of pollutants exported from watersheds by rivers and streams. Thus these data are used to assess different sources (nonpoint vs point), forms, and timing of pollutant export from watersheds. The HTLP officially began with high-frequency monitoring for sediment and nutrients from the Sandusky and Maumee rivers in 1974, and has continually expanded since then.

    Each station where samples are collected for water quality is paired with a US Geological Survey gage for quantifying discharge (http://waterdata.usgs.gov/usa/nwis/rt). Our stations cover a wide range of watershed areas upstream of the sampling point from 11.0 km2 for the unnamed tributary to Lost Creek to 19,215 km2 for the Muskingum River. These rivers also drain a variety of land uses, though a majority of the stations drain over 50% row-crop agriculture.

    At most sampling stations, submersible pumps located on the stream bottom continuously pump water into sampling wells inside heated buildings where automatic samplers collect discrete samples (4 unrefrigerated samples/d at 6-h intervals, 1974–1987; 3 refrigerated samples/d at 8-h intervals, 1988-current). At weekly intervals the samples are returned to the NCWQR laboratories for analysis. When samples either have high turbidity from suspended solids or are collected during high flow conditions, all samples for each day are analyzed. As stream flows and/or turbidity decreases, analysis frequency shifts to one sample per day. At the River Raisin and Muskingum River, a cooperator collects a grab sample from a bridge at or near the USGS station approximately daily and all samples are analyzed. Each sample bottle contains sufficient volume to support analyses of total phosphorus (TP), dissolved reactive phosphorus (DRP), suspended solids (SS), total Kjeldahl nitrogen (TKN), ammonium-N (NH4), nitrate-N and nitrite-N (NO2+3), chloride, fluoride, and sulfate. Nitrate and nitrite are commonly added together when presented; henceforth we refer to the sum as nitrate.

    Upon return to the laboratory, all water samples are analyzed within 72h for the nutrients listed below using standard EPA methods. For dissolved nutrients, samples are filtered through a 0.45 um membrane filter prior to analysis. We currently use a Seal AutoAnalyzer 3 for DRP, silica, NH4, TP, and TKN colorimetry, and a DIONEX Ion Chromatograph with AG18 and AS18 columns for anions. Prior to 2014, we used a Seal TRAACs for all colorimetry.

    2017 Ohio EPA Project Study Plan and Quality Assurance Plan

    Project Study Plan

    Quality Assurance Plan

    Data quality control and data screening

    The data provided in the River Data files have all been screened by NCWQR staff. The purpose of the screening is to remove outliers that staff deem likely to reflect sampling or analytical errors rather than outliers that reflect the real variability in stream chemistry. Often, in the screening process, the causes of the outlier values can be determined and appropriate corrective actions taken. These may involve correction of sample concentrations or deletion of those data points.

    This micro-site contains data for approximately 126,000 water samples collected beginning in 1974. We cannot guarantee that each data point is free from sampling bias/error, analytical errors, or transcription errors. However, since its beginnings, the NCWQR has operated a substantial internal quality control program and has participated in numerous external quality control reviews and sample exchange programs. These programs have consistently demonstrated that data produced by the NCWQR is of high quality.

    A note on detection limits and zero and negative concentrations

    It is routine practice in analytical chemistry to determine method detection limits and/or limits of quantitation, below which analytical results are considered less reliable or unreliable. This is something that we also do as part of our standard procedures. Many laboratories, especially those associated with agencies such as the U.S. EPA, do not report individual values that are less than the detection limit, even if the analytical equipment returns such values. This is in part because as individual measurements they may not be considered valid under litigation.

    The measured concentration consists of the true but unknown concentration plus random instrument error, which is usually small compared to the range of expected environmental values. In a sample for which the true concentration is very small, perhaps even essentially zero, it is possible to obtain an analytical result of 0 or even a small negative concentration. Results of this sort are often “censored” and replaced with the statement “

    Censoring these low values creates a number of problems for data analysis. How do you take an average? If you leave out these numbers, you get a biased result because you did not toss out any other (higher) values. Even if you replace negative concentrations with 0, a bias ensues, because you’ve chopped off some portion of the lower end of the distribution of random instrument error.

    For these reasons, we do not censor our data. Values of -9 and -1 are used as missing value codes, but all other negative and zero concentrations are actual, valid results. Negative concentrations make no physical sense, but they make analytical and statistical sense. Users should be aware of this, and if necessary make their own decisions about how to use these values. Particularly if log transformations are to be used, some decision on the part of the user will be required.

    Analyte Detection Limits

    https://ncwqr.files.wordpress.com/2021/12/mdl-june-2019-epa-methods.jpg?w=1024

    For more information, please visit https://ncwqr.org/

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Bianchi, F. M.; Brunner, D.; Kube, R.; LaBombard, B. (2021). Outlier classification using autoencoders: application for fluctuation driven flows in fusion plasmas [Dataset]. https://www.osti.gov/dataexplorer/biblio/dataset/1882649-outlier-classification-using-autoencoders-application-fluctuation-driven-flows-fusion-plasmas
Organization logoOrganization logo

Data from: Outlier classification using autoencoders: application for fluctuation driven flows in fusion plasmas

Related Article
Explore at:
Dataset updated
Jun 2, 2021
Dataset provided by
United States Department of Energyhttp://energy.gov/
Office of Sciencehttp://www.er.doe.gov/
Massachusetts Inst. of Technology (MIT), Cambridge, MA (United States). Plasma Science and Fusion Center
Authors
Bianchi, F. M.; Brunner, D.; Kube, R.; LaBombard, B.
Description

Understanding the statistics of fluctuation driven flows in the boundary layer of magnetically confined plasmas is desired to accurately model the lifetime of the vacuum vessel components. Mirror Langmuir probes (MLPs) are a novel diagnostic that uniquely allow us to sample the plasma parameters on a time scale shorter than the characteristic time scale of their fluctuations. Sudden large-amplitude fluctuations in the plasma degrade the precision and accuracy of the plasma parameters reported by MLPs for cases in which the probe bias range is of insufficient amplitude. While some data samples can readily be classified as valid and invalid, we find that such a classification may be ambiguous for up to 40% of data sampled for the plasma parameters and bias voltages considered in this study. In this contribution, we employ an autoencoder (AE) to learn a low-dimensional representation of valid data samples. By definition, the coordinates in this space are the features that mostly characterize valid data. Ambiguous data samples are classified in this space using standard classifiers for vectorial data. In this way, we avoid defining complicated threshold rules to identify outliers, which require strong assumptions and introduce biases in the analysis. By removing the outliers that aremore » identified in the latent low-dimensional space of the AE, we find that the average conductive and convective radial heat fluxes are between approximately 5% and 15% lower as when removing outliers identified by threshold values. For contributions to the radial heat flux due to triple correlations, the difference is up to 40%.« less

Search
Clear search
Close search
Google apps
Main menu