14 datasets found
  1. d

    Supporting data for \"A Standard Operating Procedure for Outlier Removal in...

    • search.dataone.org
    • dataverse.no
    • +1more
    Updated Jul 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Holsbø, Einar (2024). Supporting data for \"A Standard Operating Procedure for Outlier Removal in Large-Sample Epidemiological Transcriptomics Datasets\" [Dataset]. http://doi.org/10.18710/FGVLKS
    Explore at:
    Dataset updated
    Jul 29, 2024
    Dataset provided by
    DataverseNO
    Authors
    Holsbø, Einar
    Description

    This dataset is example data from the Norwegian Women and Cancer study. It is supporting information to our article "A Standard Operating Procedure for Outlier Removal in Large-Sample Epidemiological Transcriptomics Datasets." (In submission) The bulk of the data comes from measuring gene expression in blood samples from the Norwegian Women and Cancer study (NOWAC) on Illumina Whole-Genome Gene Expression Bead Chips, HumanHT-12 v4. Please see README.txt for details

  2. Z

    Dataset on the Human Body as a Signal Propagation Medium

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jul 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataset on the Human Body as a Signal Propagation Medium [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8214496
    Explore at:
    Dataset updated
    Jul 11, 2024
    Dataset provided by
    V. Abolins
    A. Elsts
    A. Sevcenko
    V. Aristovs
    V. Medvedevs
    J. Ormanis
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Overview: This is a large-scale dataset with impedance and signal loss data recorded on volunteer test subjects using low-voltage alternate current sine-shaped signals. The signal frequencies are from 50 kHz to 20 MHz.

    Applications: The intention of this dataset is to allow to investigate the human body as a signal propagation medium, and capture information related to how the properties of the human body (age, sex, composition etc.), the measurement locations, and the signal frequencies impact the signal loss over the human body.

    Overview statistics:

    Number of subjects: 30

    Number of transmitter locations: 6

    Number of receiver locations: 6

    Number of measurement frequencies: 19

    Input voltage: 1 V

    Load resistance: 50 ohm and 1 megaohm

    Measurement group statistics:

    Height: 174.10 (7.15)

    Weight: 72.85 (16.26)

    BMI: 23.94 (4.70)

    Body fat %: 21.53 (7.55)

    Age group: 29.00 (11.25)

    Male/female ratio: 50%

    Included files:

    experiment_protocol_description.docx - protocol used in the experiments

    electrode_placement_schematic.png - schematic of placement locations

    electrode_placement_photo.jpg - visualization on the experiment, on a volunteer subject

    RawData - the full measurement results and experiment info sheets

    all_measurements.csv - the most important results extracted to .csv

    all_measurements_filtered.csv - same, but after z-score filtering

    all_measurements_by_freq.csv - the most important results extracted to .csv, single frequency per row

    all_measurements_by_freq_filtered.csv - same, but after z-score filtering

    summary_of_subjects.csv - key statistics on the subjects from the experiment info sheets

    process_json_files.py - script that creates .csv from the raw data

    filter_results.py - outlier removal based on z-score

    plot_sample_curves.py - visualization of a randomly selected measurement result subset

    plot_measurement_group.py - visualization of the measurement group

    CSV file columns:

    subject_id - participant's random unique ID

    experiment_id - measurement session's number for the participant

    height - participant's height, cm

    weight - participant's weight, kg

    BMI - body mass index, computed from the valued above

    body_fat_% - body fat composition, as measured by bioimpedance scales

    age_group - age rounded to 10 years, e.g. 20, 30, 40 etc.

    male - 1 if male, 0 if female

    tx_point - transmitter point number

    rx_point - receiver point number

    distance - distance, in relative units, between the tx and rx points. Not scaled in terms of participant's height and limb lengths!

    tx_point_fat_level - transmitter point location's average fat content metric. Not scaled for each participant individually.

    rx_point_fat_level - receiver point location's average fat content metric. Not scaled for each participant individually.

    total_fat_level - sum of rx and tx fat levels

    bias - constant term to simplify data analytics, always equal to 1.0

    CSV file columns, frequency-specific:

    tx_abs_Z_... - transmitter-side impedance, as computed by the process_json_files.py script from the voltage drop

    rx_gain_50_f_... - experimentally measured gain on the receiver, in dB, using 50 ohm load impedance

    rx_gain_1M_f_... - experimentally measured gain on the receiver, in dB, using 1 megaohm load impedance

    Acknowledgments: The dataset collection was funded by the Latvian Council of Science, project “Body-Coupled Communication for Body Area Networks”, project No. lzp-2020/1-0358.

    References: For a more detailed information, see this article: J. Ormanis, V. Medvedevs, A. Sevcenko, V. Aristovs, V. Abolins, and A. Elsts. Dataset on the Human Body as a Signal Propagation Medium for Body Coupled Communication. Submitted to Elsevier Data in Brief, 2023.

    Contact information: info@edi.lv

  3. U

    Stream water-quality summary statistics and outliers, streamwater load...

    • data.usgs.gov
    • search.dataone.org
    • +1more
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Brent Aulenbach; John Joiner, Stream water-quality summary statistics and outliers, streamwater load models and yield estimates, and peak flow modeling parameters for 13 watersheds in Gwinnett County, Georgia [Dataset]. http://doi.org/10.5066/F7639MXG
    Explore at:
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Authors
    Brent Aulenbach; John Joiner
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Time period covered
    Mar 12, 2001 - Sep 30, 2015
    Area covered
    Gwinnett County, Georgia
    Description

    Data release includes the following five data tables: (1) water-quality constituent outliers that were removed from the calibration of regression models used to estimate streamwater solute loads, (2) parameters used to model peak streamflow recurrence intervals, (3) models used to estimate streamwater constituent loads, (4) statistical summaries of water-quality observations, and (5) estimated annual streamwater constituent yields. An associated metadata file is included for each of the five data tables.

  4. Data from: Outlier classification using autoencoders: application for...

    • osti.gov
    Updated Jun 2, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Massachusetts Inst. of Technology (MIT), Cambridge, MA (United States). Plasma Science and Fusion Center (2021). Outlier classification using autoencoders: application for fluctuation driven flows in fusion plasmas [Dataset]. http://doi.org/10.7910/DVN/SKEHRJ
    Explore at:
    Dataset updated
    Jun 2, 2021
    Dataset provided by
    Office of Sciencehttp://www.er.doe.gov/
    Massachusetts Inst. of Technology (MIT), Cambridge, MA (United States). Plasma Science and Fusion Center
    Description

    Understanding the statistics of fluctuation driven flows in the boundary layer of magnetically confined plasmas is desired to accurately model the lifetime of the vacuum vessel components. Mirror Langmuir probes (MLPs) are a novel diagnostic that uniquely allow us to sample the plasma parameters on a time scale shorter than the characteristic time scale of their fluctuations. Sudden large-amplitude fluctuations in the plasma degrade the precision and accuracy of the plasma parameters reported by MLPs for cases in which the probe bias range is of insufficient amplitude. While some data samples can readily be classified as valid and invalid, we find that such a classification may be ambiguous for up to 40% of data sampled for the plasma parameters and bias voltages considered in this study. In this contribution, we employ an autoencoder (AE) to learn a low-dimensional representation of valid data samples. By definition, the coordinates in this space are the features that mostly characterize valid data. Ambiguous data samples are classified in this space using standard classifiers for vectorial data. In this way, we avoid defining complicated threshold rules to identify outliers, which require strong assumptions and introduce biases in the analysis. By removing the outliers that are identified in the latent low-dimensional space of the AE, we find that the average conductive and convective radial heat fluxes are between approximately 5% and 15% lower as when removing outliers identified by threshold values. For contributions to the radial heat flux due to triple correlations, the difference is up to 40%.

  5. Z

    Identification of Performance Changes at Code Level (Measurement...

    • data.niaid.nih.gov
    • explore.openaire.eu
    • +1more
    Updated Aug 8, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonymous for Reviewing (2022). Identification of Performance Changes at Code Level (Measurement Configuration Dataset) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6300863
    Explore at:
    Dataset updated
    Aug 8, 2022
    Dataset authored and provided by
    Anonymous for Reviewing
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Measurement Configuration Dataset

    This is the anonymous reviewing version; the source code repository will be added after the review.

    This dataset provides reproduction data for performance measurement configuration at source code level in Java. The measurement data can be obtained using the precision-experiments repository https://anonymous.4open.science/r/precision-experiments-C613/ (Examining Different Repetition Counts) yourself. These data conatained here are the data we obtained from execution on i7-4770 CPU @ 3.40GHz.

    The analysis was tested on Ubuntu 20.04 and gnuplot 5.2.8. It will not work with older gnuplot versions.

    To execute the analysis, extract the data by

    tar -xvf basic-parameter-comparison.tar tar -xvf parallel-sequential-comparison.tar

    and afterwards build the precision-experiments repo and execute the analysis by

    cd precision-experiments/precision-analysis/ ../gradlew fatJar cd scripts/configuration-analysis/ ./executeCompleteAnalysis.sh ../../../../basic-parameter-comparison ../../../../parallel-sequential-comparison

    Afterwards, the following files will be present:

    precision-experiments/precision-analysis/scripts/configuration-analysis/repetitionHeatmaps/heatmap_all_en.pdf (Heatmaps for different repetition counts)

    precision-experiments/precision-analysis/scripts/configuration-analysis/repetitionHeatmaps/heatmap_outlierRemoval_en.pdf (Heatmap with and without outlier removal for 1000 repetitions)

    precision-experiments/precision-analysis/scripts/configuration-analysis/histogram_outliers_en.pdf (Histogram of the outliers)

    precision-experiments/precision-analysis/scripts/configuration-analysis/heatmap_parallel_en.pdf (Heatmap with sequential and parallel execution)

  6. U

    11: Streamwater sample constituent concentration outliers from 15 watersheds...

    • data.usgs.gov
    • catalog.data.gov
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Brent Aulenbach; Joshua Henley; Kristina Hopkins, 11: Streamwater sample constituent concentration outliers from 15 watersheds in Gwinnett County, Georgia for water years 2003-2020 [Dataset]. http://doi.org/10.5066/P9G8HZTY
    Explore at:
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Authors
    Brent Aulenbach; Joshua Henley; Kristina Hopkins
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Time period covered
    Oct 10, 2002 - Sep 29, 2020
    Area covered
    Gwinnett County, Georgia
    Description

    This dataset contains a list of outlier sample concentrations identified for 17 water quality constituents from streamwater sample collected at 15 study watersheds in Gwinnett County, Georgia for water years 2003 to 2020. The 17 water quality constituents are: biochemical oxygen demand (BOD), chemical oxygen demand (COD), total suspended solids (TSS), suspended sediment concentration (SSC), total nitrogen (TN), total nitrate plus nitrite (NO3NO2), total ammonia plus organic nitrogen (TKN), dissolved ammonia (NH3), total phosphorus (TP), dissolved phosphorus (DP), total organic carbon (TOC), total calcium (Ca), total magnesium (Mg), total copper (TCu), total lead (TPb), total zinc (TZn), and total dissolved solids (TDS). 885 outlier concentrations were identified. Outliers were excluded from model calibration datasets used to estimate streamwater constituent loads for 12 of these constituents. Outlier concentrations were removed because they had a high influence on the model fits o ...

  7. Data from: Fast robust SUR with economical and actuarial applications

    • search.datacite.org
    • wiley.figshare.com
    Updated Jul 14, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mia Hubert; Tim Verdonck (2016). Data from: Fast robust SUR with economical and actuarial applications [Dataset]. http://doi.org/10.6084/m9.figshare.3408073
    Explore at:
    Dataset updated
    Jul 14, 2016
    Dataset provided by
    DataCitehttps://www.datacite.org/
    Wiley
    Authors
    Mia Hubert; Tim Verdonck
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    The seemingly unrelated regression (SUR) model is a generalization of a linear regression model consisting of more than one equation, where the error terms of these equations are contemporaneously correlated. The standard Feasible Generalized Linear Squares (FGLS) estimator is efficient as it takes into account the covariance structure of the errors, but it is also very sensitive to outliers. The robust SUR estimator of Bilodeau and Duchesne (Canadian Journal of Statistics, 28:277-288, 2000) can accommodate outliers, but it is hard to compute. First we propose a fast algorithm, FastSUR, for its computation and show its good performance in a simulation study. We then provide diagnostics for outlier detection and illustrate them on a real data set from economics. Next we apply our FastSUR algorithm in the framework of stochastic loss reserving for general insurance. We focus on the General Multivariate Chain Ladder (GMCL) model that employs SUR to estimate its parameters. Consequently, this multivariate stochastic reserving method takes into account the contemporaneous correlations among run-off triangles and allows structural connections between these triangles. We plug in our FastSUR algorithm into the GMCL model to obtain a robust version.

  8. U

    Input data for chloride-specific conductance regression models

    • data.usgs.gov
    • catalog.data.gov
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rosemary Fanelli; Andrew Sekellick; Joel Moore, Input data for chloride-specific conductance regression models [Dataset]. http://doi.org/10.5066/P9YN2QST
    Explore at:
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Authors
    Rosemary Fanelli; Andrew Sekellick; Joel Moore
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Time period covered
    Sep 17, 1953 - Sep 28, 2018
    Description

    This data set includes input data for the development of regression models to predict chloride from specific conductance (SC) data at 56 U. S. Geological Survey water quality monitoring stations in the eastern United States. Each site has 20 or more simultaneous observations of SC and chloride. Data were downloaded from the National Water Information System (NWIS) using the R package dataRetrieval. Datasets for each site were evaluated and outliers were removed prior to the development of the regression model. This file contains only the final input dataset for the regression models. Please refer to Moore and others (in review) for more details. Moore, J., R. Fanelli, and A. Sekellick. In review. High-frequency data reveal deicing salts drive elevated conductivity and chloride along with pervasive and frequent exceedances of the EPA aquatic life criteria for chloride in urban streams. Submitted to Environmental Science and Technology.

  9. I

    CBP Water Quality Monitoring Subset (1984-2018), CB8 1E

    • data.ioos.us
    • erddap.maracoos.org
    • +1more
    erddap +2
    Updated May 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MARACOOS (2025). CBP Water Quality Monitoring Subset (1984-2018), CB8 1E [Dataset]. https://data.ioos.us/dataset/cbp-water-quality-monitoring-subset-1984-2018-cb8-1e
    Explore at:
    erddap, erddap-tabledap, opendapAvailable download formats
    Dataset updated
    May 19, 2025
    Dataset authored and provided by
    MARACOOS
    Description

    This product was developed as part of the project supported by the grant from and the National Oceanic and Atmospheric Administration’s Ocean Acidification Program under award NA18OAR0170430 to the Virginia Institute of Marine Science. The data product consists of water quality data for tidal 98 stations for 1984–2018. The source data used to generate this product were downloaded from the Chesapeake Bay Program’s (CBP) data hub. Out of the total of 255 monitoring stations in the Tidal Monitoring Program, we selected 98 with the long monitoring record (30 years or longer). The following variables were downloaded from the data hub at the native temporal and vertical resolution (between one and four cruises per month and approximately 10 depth levels sampled between 0 and 37 m) for 1984–2018: water temperature (T), salinity (S), pH, total alkalinity (TA), dissolved oxygen (DO) , and chlorophyll (Chl). All pH data prior to 1998 were removed because of the data quality concerns (Herrmann et al., 2020). Briefly, we found a dramatic difference in long-term trends between stations measured by institutions in the state of Virginia and stations measured by the state of Maryland, particularly from late spring to early fall. The boundary between the station groups runs east–west within the mesohaline portion of the bay, where the Potomac River estuary intersects the mainstem bay. The boundary separates strong negative linear trends to the south (Virginia stations) from neutral and weakly positive linear trends to the north (Maryland stations). For all variables, data entries marked with CBP’s “Problem” and “Qualifier” flags were removed. Additionally, all variables were scanned for extreme outliers: for each variable, data from all stations, depths, and times were combined into a single composite sample for which the 75th and 25th percentiles (i.e., the upper and lower quantiles) and the interquartile range (the difference between the upper and lower quantiles) were calculated. Extreme outliers were defined as the values falling outside of a certain number (censoring criterion) of interquartile ranges from the upper and lower quantiles.

  10. n

    A-TWAIN Physical Oceanography Mooring Data 2021-2022

    • data.npolar.no
    bin, nc, pdf
    Updated Oct 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Renner, Angelika H. H. (angelika.renner@hi.no); Sundfjord, Arild (arild.sundfjord@npolar.no); Foss, Øyvind (oyvind.foss@npolar.no); Renner, Angelika H. H. (angelika.renner@hi.no); Sundfjord, Arild (arild.sundfjord@npolar.no); Foss, Øyvind (oyvind.foss@npolar.no) (2024). A-TWAIN Physical Oceanography Mooring Data 2021-2022 [Dataset]. http://doi.org/10.21334/npolar.2024.86ec6869
    Explore at:
    bin, nc, pdfAvailable download formats
    Dataset updated
    Oct 29, 2024
    Dataset provided by
    Norwegian Polar Data Centre
    Authors
    Renner, Angelika H. H. (angelika.renner@hi.no); Sundfjord, Arild (arild.sundfjord@npolar.no); Foss, Øyvind (oyvind.foss@npolar.no); Renner, Angelika H. H. (angelika.renner@hi.no); Sundfjord, Arild (arild.sundfjord@npolar.no); Foss, Øyvind (oyvind.foss@npolar.no)
    License

    http://spdx.org/licenses/CC0-1.0http://spdx.org/licenses/CC0-1.0

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Time period covered
    Nov 9, 2021 - Oct 6, 2022
    Area covered
    Description

    A-TWAIN Physical Oceanography mooring data 2021-2022

    A-TWAIN (Long-term variability and trends in the Atlantic Water inflow region) was established to gain understanding on how the inflowing current system is distributed at different depths along the continental slope, how it responds to local, short-lived atmospheric changes, and how it varies on seasonal and interannual timescales.

    Overview

    As part of A-TWAIN, three moorings were redeployed near the continental slope of the Nansen Basin in the Arctic Ocean, near 31°E north of the Barents Sea. The moorings were operational between November 2021 and October 2022. All moorings have previously been deployed in the same respective locations; these data constitute the 2021-2022 continuation of the A-TWAIN mooring time series.

    AT800-7* and AT200-6 moorings were instrumented mooring lines extending from the bottom anchor to a sub-surface buoy, while AT500-2 was a bottom lander. CTD and ADCP data from the moorings will be made available here; other datasets from these moorings will be published elsewhere. Processed data will be added here as they become available.

    * "AT800-7" denotes the 7th deployment of the AT800 mooring.

    Table: Details of the mooring deployents


    MooringTypeBottom depthLatitudeLongitudeDeployment dateRecovery dateData status
    AT200-6Instrumented line205 m81.410531.243309.11.2106.10.22CTD data published
    AT500-2Bottom lander488 m81.457731.075309.11.2104.10.22
    AT800-7Instrumented line889 m81.550130.877709.11.2104.10.22


    "https://gitlab.com/NPIOcean/npi-figure-store/-/raw/main/figures/atwain_moorings_21_22/at200/atwain_map.png"> https://gitlab.com/NPIOcean/npi-figure-store/-/raw/main/figures/atwain_moorings_21_22/at200/atwain_map.png" width="900" alt="ATWAIN map">

    A-TWAIN mooring locations showing IBCAO v4 bathymetry.

    All three moorings were deployed in November 2021 during the joint Nansen Legacy and A-TWAIN/SIOS-InfraNor mooring service cruise (KPH20217123), and recovered in October 2022 during the Nansen Legacy Mooring Service Cruise (KPH2022712).

    Data details

    Processed data are made available as one NetCDF file per instrument. Raw instrument data are also available. Details of the processing of the respective datasets are shown below (click to access the dropdown content).

    AT200-6 CTD data


    InstrumentMedian depthSerial numberSampling frequencyFile name
    SBE16plus46 m5024145 minAT200_2021_2022_SBE16plus_50241_pres_temp_sal_46m.nc
    SBE37SMP59 m2077315 minAT200_2021_2022_SBE37SMP_20773_pres_temp_sal_59m.nc
    SBE37SM113 m1525215 minAT200_2021_2022_SBE37SM_15252_pres_temp_sal_113m.nc
    SBE37SM191 m929315 minAT200_2021_2022_SBE37SM_9293_pres_temp_sal_191m.nc

    AT200-6 data processing


    Data processing

    AT200 CTD data were processed to .cnv using SBEDataProcessing software. Additional processing was done in Python using the kval library (v0.0.2-beta, this commit).

    Processing steps as well as a python script for reproducing the post-processing from .cnv can be found in the PROCESSING variable of each file.

    All records were chopped to the time range 2021-11-09 20:00 - 2022-10-06 07:30 in order to remove data from recovery/deployment and deck times.

    Salinity outlier editing

    After visual inspection, no editing was applied to temperature and pressure.

    Salinity has been lightly edited in order to remove noise and outliers (see PSAL variable attributes for details). The identification of outliers is complicated by the large hydrographic variability in this location, reflecting sharp lateral gradients near the continental slope in combination with an energetic background environment and relatively strong tides. The processing has therefore been done using a relatively light approach, described below. This editing may or may not be appropriate or sufficient for specific research purposes. Users who want to apply their own editing are encouraged to work with unedited salinity, which can easily be obtained by reprocessing salinity from TEMP and CNDC (both of which have been left unedited).

    For SBE37 instruments:

    • PSAL was recomputed from modified conductivity CNDC_mod and temperature TEMP_mod in order to reduce (presumably artificial) high-frequency noise:
      • CNDC_mod was despiked using a 31-pt rolling window (rejecting outliers >3 SD from the median).
      • A rolling 3-point median was applied to CNDC_mod and TEMP_mod.
      • PSAL was recomputed from temperature, conductivity and pressure using the GSW-Python library.
        • (No filtering or editing have been applied to the fields TEMP and CNDC stored in the netCDF files.)
      • PSAL was despiked using a 15-pt rolling window (rejecting outliers >3 SD from the median).
      • Finally, a rolling 3-point median was applied to CNDC_mod and TEMP_mod.

    For the SBE16plus instrument:

    • Major outliers in PSAL were removed using a threshold value of 25.
    • Additional outliers were removed using a 31-pt rolling window (rejecting outliers >3 SD from the median).

    Validation against shipboard CTDs

    Measured variables were found to agree well with post-deployment CTD profiles (from a SBE911+ on the R/V Kronprins Haakon) from the start of the record. A the end of the record, all sensors were found to agree reasonably well with a pre-recovery shipboard CTD profile with the exception of the upper instrument (SBE16plus S/N 50241). We attribute this to the profile being complex around 50 m depth at this time (region of an ~1C cold intrusion and salinity inversion on the background of a strong halocline). The temperature-salinity distribution is broadly consistent with the measurement being physically sensible, as is the salinity increase from the sensor near 46 m to the one near 59 m. Users should be aware that the SBE16plus salinity data could not be validated against other measurements.

    "https://gitlab.com/NPIOcean/npi-figure-store/-/raw/main/figures/atwain_moorings_21_22/at200/comparison_at200_pre_deployment_ctd_timeseries_profile.png"> https://gitlab.com/NPIOcean/npi-figure-store/-/raw/main/figures/atwain_moorings_21_22/at200/comparison_at200_pre_deployment_ctd_timeseries_profile.png" width="400" alt="Temperature and Salinity profile comparison"> "https://gitlab.com/NPIOcean/npi-figure-store/-/raw/main/figures/atwain_moorings_21_22/at200/comparison_at200_pre_deployment_ctd_timeseries_T_S.png"> https://gitlab.com/NPIOcean/npi-figure-store/-/raw/main/figures/atwain_moorings_21_22/at200/comparison_at200_pre_deployment_ctd_timeseries_T_S.png" width="300" alt="T-S comparison">

    Comparison of temperature (left) and practical salinity (middle) profiles and temperature-salinity distributions (right) between moored CTDs (colors) and shipboard CTD SBE911+ profile (black) on Oct 5 2022, the day before mooring recovery. Coloured dots indicate the moored CTD value closest to the profile timestamp, and coloured lines show values collected within ±1h of the profile timestamp.

    "https://gitlab.com/NPIOcean/npi-figure-store/-/raw/main/figures/atwain_moorings_21_22/at200/comparison_at200_pre_deployment_ctd_timeseries.png"> https://gitlab.com/NPIOcean/npi-figure-store/-/raw/main/figures/atwain_moorings_21_22/at200/comparison_at200_pre_deployment_ctd_timeseries.png" width="400" alt="Image caption">

    Comparison between moored CTDs (black/blue) and shipboard CTD (red) on Oct 5 2022, the day before mooring recovery. Blue lines highlight the moored CTD values within ±one hour of the ship CTD profile.

    A post-deployment calibration CTD cast was performed after recovery of the moorings, on 09.10.22. Here, two of the SBE37 instruments (#15252 and #20733) were attached to the ship CTD rosette and submerged with resting stops at 75 m, 30 m, and 20 m. Comparing the values between the two microcats and against ship CTD suggests that these two instruments were internally consistent within approximately 0.005 psu and consistent with the ship CTD within ±0.02 psu.

    "https://gitlab.com/NPIOcean/npi-figure-store/-/raw/main/figures/atwain_moorings_21_22/at200/TEMP_CTD_comparison_microcats_on_rosette.png"> https://gitlab.com/NPIOcean/npi-figure-store/-/raw/main/figures/atwain_moorings_21_22/at200/TEMP_CTD_comparison_microcats_on_rosette.png" width="600" alt="Temperature Comparison">
    "https://gitlab.com/NPIOcean/npi-figure-store/-/raw/main/figures/atwain_moorings_21_22/at200/PSAL_CTD_comparison_microcats_on_rosette.png"> https://gitlab.com/NPIOcean/npi-figure-store/-/raw/main/figures/atwain_moorings_21_22/at200/PSAL_CTD_comparison_microcats_on_rosette.png" width="600" alt="Salinity Comparison">

    Comparison of temperature (upper) and practical salinity (lower) values from the "calibration CTD cast" on 09.10.22 where two of the AT200-6 SBE37 instruments were mounted on the rosette and resting stops were made near 75, 30, and 20 m. Black: Shipboard CTD, Red: SBE37 #15252, Blue: SBE37 #20773. Small dots show all data points from the depth stops, triangles and

  11. Data from: Outlier classification using autoencoders: application for...

    • osti.gov
    Updated Jun 2, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bianchi, F. M.; Brunner, D.; Kube, R.; LaBombard, B. (2021). Outlier classification using autoencoders: application for fluctuation driven flows in fusion plasmas [Dataset]. https://www.osti.gov/dataexplorer/biblio/dataset/1882649-outlier-classification-using-autoencoders-application-fluctuation-driven-flows-fusion-plasmas
    Explore at:
    Dataset updated
    Jun 2, 2021
    Dataset provided by
    United States Department of Energyhttp://energy.gov/
    Office of Sciencehttp://www.er.doe.gov/
    Massachusetts Inst. of Technology (MIT), Cambridge, MA (United States). Plasma Science and Fusion Center
    Authors
    Bianchi, F. M.; Brunner, D.; Kube, R.; LaBombard, B.
    Description

    Understanding the statistics of fluctuation driven flows in the boundary layer of magnetically confined plasmas is desired to accurately model the lifetime of the vacuum vessel components. Mirror Langmuir probes (MLPs) are a novel diagnostic that uniquely allow us to sample the plasma parameters on a time scale shorter than the characteristic time scale of their fluctuations. Sudden large-amplitude fluctuations in the plasma degrade the precision and accuracy of the plasma parameters reported by MLPs for cases in which the probe bias range is of insufficient amplitude. While some data samples can readily be classified as valid and invalid, we find that such a classification may be ambiguous for up to 40% of data sampled for the plasma parameters and bias voltages considered in this study. In this contribution, we employ an autoencoder (AE) to learn a low-dimensional representation of valid data samples. By definition, the coordinates in this space are the features that mostly characterize valid data. Ambiguous data samples are classified in this space using standard classifiers for vectorial data. In this way, we avoid defining complicated threshold rules to identify outliers, which require strong assumptions and introduce biases in the analysis. By removing the outliers that aremore » identified in the latent low-dimensional space of the AE, we find that the average conductive and convective radial heat fluxes are between approximately 5% and 15% lower as when removing outliers identified by threshold values. For contributions to the radial heat flux due to triple correlations, the difference is up to 40%.« less

  12. Environmental data used for PCA

    • figshare.com
    txt
    Updated Jun 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    James Whitehead (2023). Environmental data used for PCA [Dataset]. http://doi.org/10.6084/m9.figshare.20088632.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 17, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    James Whitehead
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset has had outliers removed and has been formatted in order to be appropriate for use in Principal Component Analysis. The raw data was provided by Moritz von der Lippe and Anne Hiller from TU Berlin, and field measurements were carried out by Lena Fiechter.

  13. m

    Data from: MQTTEEB-D: A Real-World IoT Cybersecurity Dataset for AI-Powered...

    • data.mendeley.com
    Updated Mar 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ABDERRAHMANE AQACHTOUL (2025). MQTTEEB-D: A Real-World IoT Cybersecurity Dataset for AI-Powered Threat Detection in MQTT Networks [Dataset]. http://doi.org/10.17632/jfttfjn6tr.1
    Explore at:
    Dataset updated
    Mar 20, 2025
    Authors
    ABDERRAHMANE AQACHTOUL
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset accompanies the research article on MQTTEEB-D and is intended for public use in cybersecurity research. The MQTTEEB-D dataset is a practical real-world data set for intrusion detection improvement in Message Queuing Telemetry Transport (MQTT)-based Internet of Things (IoT) networks. In contrast to already existing datasets that are constructed on simulated network traffic, MQTTEEB-D is obtained from a real-time IoT deployment at the International University of Rabat (UIR), Morocco. Using MySignals IoT health sensors, Raspberry Pi 4, and an MQTT broker server, this dataset represents the actual complexity of the active IoT communication process, which synthetic data fails to offer. To narrow the gap between simulated and real-world attack scenarios, various cyberattacks including Denial of Service (DoS), Slow DoS against Internet of Things Environments (SlowITe), Malformed Data Injection, Brute Force, and MQTT publish flooding were carried out in real-time, permitting close monitoring of network traffic anomalies. The data was captured using Python wrapper for tshark (PyShark) and organized into multiple Comma-Separated Values (CSV) files. To ensure high data quality, we performed pre-processing steps, such as outlier removal, normalization, standardization, and class balance. Several processed forms (raw, cleaned, normalized, standardized, Synthetic Minority Over-sampling Technique (SMOTE)) applied for this dataset are provided, along with detailed metadata to facilitate ease of use in cybersecurity research. This dataset provides an opportunity for researchers to develop and validate intrusion detection models in a real-world MQTT environment - a critical ingredient in Artificial Intelligence (AI)-driven cybersecurity solutions for IoT networks. The dataset will support future research IoT security and anomaly detection domains.

  14. n

    Global Ocean Data Analysis Project version 2.2019 (GLODAPv2.2019) (NCEI...

    • cmr.earthdata.nasa.gov
    • catalog.data.gov
    not provided
    Updated Sep 26, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2019). Global Ocean Data Analysis Project version 2.2019 (GLODAPv2.2019) (NCEI Accession 0186803) [Dataset]. http://doi.org/10.25921/xnme-wr20
    Explore at:
    not provided(1425.488 KB)Available download formats
    Dataset updated
    Sep 26, 2019
    Time period covered
    Jan 1, 1972 - Mar 5, 2017
    Area covered
    Earth
    Description

    This NCEI Accession consists of GLODAPv2.2019 data product composed of data from 840 scientific cruises covering the global ocean between 1972 and 2017. It includes full depth discrete bottle measurements of salinity, oxygen, nitrate, silicate, phosphate, dissolved inorganic carbon (TCO2), total alkalinity (TAlk), pH, chlorofluorocarbons (CFC-11, CFC-12, CFC-113, and CCl4), various isotopes and organic compounds. It was created by appending data from 116 cruises to GLODAPv2 (Olsen et al., 2016, NCEI Accession 0162565). The data for salinity, oxygen, nitrate, silicate, phosphate, TCO2, TAlk, pH, CFC-11, CFC-12, CFC-113, and CCl4 were subjected to primary and secondary quality control. Severe biases in these data have been corrected for, and outliers removed. However, differences in data related to any known or likely time trends or variations have not been corrected for. These data are believed to be accurate to 0.005 in salinity, 1% in oxygen, 2% in nitrate, 2% in silicate, 2% in phosphate, 4 µmol kg-1 in TCO2, 4 µmol kg-1 in TAlk, and for the halogenated transient tracers: 5%.

  15. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Holsbø, Einar (2024). Supporting data for \"A Standard Operating Procedure for Outlier Removal in Large-Sample Epidemiological Transcriptomics Datasets\" [Dataset]. http://doi.org/10.18710/FGVLKS

Supporting data for \"A Standard Operating Procedure for Outlier Removal in Large-Sample Epidemiological Transcriptomics Datasets\"

Explore at:
2 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Jul 29, 2024
Dataset provided by
DataverseNO
Authors
Holsbø, Einar
Description

This dataset is example data from the Norwegian Women and Cancer study. It is supporting information to our article "A Standard Operating Procedure for Outlier Removal in Large-Sample Epidemiological Transcriptomics Datasets." (In submission) The bulk of the data comes from measuring gene expression in blood samples from the Norwegian Women and Cancer study (NOWAC) on Illumina Whole-Genome Gene Expression Bead Chips, HumanHT-12 v4. Please see README.txt for details

Search
Clear search
Close search
Google apps
Main menu