This dataset is example data from the Norwegian Women and Cancer study. It is supporting information to our article "A Standard Operating Procedure for Outlier Removal in Large-Sample Epidemiological Transcriptomics Datasets." (In submission) The bulk of the data comes from measuring gene expression in blood samples from the Norwegian Women and Cancer study (NOWAC) on Illumina Whole-Genome Gene Expression Bead Chips, HumanHT-12 v4. Please see README.txt for details
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Overview: This is a large-scale dataset with impedance and signal loss data recorded on volunteer test subjects using low-voltage alternate current sine-shaped signals. The signal frequencies are from 50 kHz to 20 MHz.
Applications: The intention of this dataset is to allow to investigate the human body as a signal propagation medium, and capture information related to how the properties of the human body (age, sex, composition etc.), the measurement locations, and the signal frequencies impact the signal loss over the human body.
Overview statistics:
Number of subjects: 30
Number of transmitter locations: 6
Number of receiver locations: 6
Number of measurement frequencies: 19
Input voltage: 1 V
Load resistance: 50 ohm and 1 megaohm
Measurement group statistics:
Height: 174.10 (7.15)
Weight: 72.85 (16.26)
BMI: 23.94 (4.70)
Body fat %: 21.53 (7.55)
Age group: 29.00 (11.25)
Male/female ratio: 50%
Included files:
experiment_protocol_description.docx - protocol used in the experiments
electrode_placement_schematic.png - schematic of placement locations
electrode_placement_photo.jpg - visualization on the experiment, on a volunteer subject
RawData - the full measurement results and experiment info sheets
all_measurements.csv - the most important results extracted to .csv
all_measurements_filtered.csv - same, but after z-score filtering
all_measurements_by_freq.csv - the most important results extracted to .csv, single frequency per row
all_measurements_by_freq_filtered.csv - same, but after z-score filtering
summary_of_subjects.csv - key statistics on the subjects from the experiment info sheets
process_json_files.py - script that creates .csv from the raw data
filter_results.py - outlier removal based on z-score
plot_sample_curves.py - visualization of a randomly selected measurement result subset
plot_measurement_group.py - visualization of the measurement group
CSV file columns:
subject_id - participant's random unique ID
experiment_id - measurement session's number for the participant
height - participant's height, cm
weight - participant's weight, kg
BMI - body mass index, computed from the valued above
body_fat_% - body fat composition, as measured by bioimpedance scales
age_group - age rounded to 10 years, e.g. 20, 30, 40 etc.
male - 1 if male, 0 if female
tx_point - transmitter point number
rx_point - receiver point number
distance - distance, in relative units, between the tx and rx points. Not scaled in terms of participant's height and limb lengths!
tx_point_fat_level - transmitter point location's average fat content metric. Not scaled for each participant individually.
rx_point_fat_level - receiver point location's average fat content metric. Not scaled for each participant individually.
total_fat_level - sum of rx and tx fat levels
bias - constant term to simplify data analytics, always equal to 1.0
CSV file columns, frequency-specific:
tx_abs_Z_... - transmitter-side impedance, as computed by the process_json_files.py
script from the voltage drop
rx_gain_50_f_... - experimentally measured gain on the receiver, in dB, using 50 ohm load impedance
rx_gain_1M_f_... - experimentally measured gain on the receiver, in dB, using 1 megaohm load impedance
Acknowledgments: The dataset collection was funded by the Latvian Council of Science, project “Body-Coupled Communication for Body Area Networks”, project No. lzp-2020/1-0358.
References: For a more detailed information, see this article: J. Ormanis, V. Medvedevs, A. Sevcenko, V. Aristovs, V. Abolins, and A. Elsts. Dataset on the Human Body as a Signal Propagation Medium for Body Coupled Communication. Submitted to Elsevier Data in Brief, 2023.
Contact information: info@edi.lv
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Data release includes the following five data tables: (1) water-quality constituent outliers that were removed from the calibration of regression models used to estimate streamwater solute loads, (2) parameters used to model peak streamflow recurrence intervals, (3) models used to estimate streamwater constituent loads, (4) statistical summaries of water-quality observations, and (5) estimated annual streamwater constituent yields. An associated metadata file is included for each of the five data tables.
Understanding the statistics of fluctuation driven flows in the boundary layer of magnetically confined plasmas is desired to accurately model the lifetime of the vacuum vessel components. Mirror Langmuir probes (MLPs) are a novel diagnostic that uniquely allow us to sample the plasma parameters on a time scale shorter than the characteristic time scale of their fluctuations. Sudden large-amplitude fluctuations in the plasma degrade the precision and accuracy of the plasma parameters reported by MLPs for cases in which the probe bias range is of insufficient amplitude. While some data samples can readily be classified as valid and invalid, we find that such a classification may be ambiguous for up to 40% of data sampled for the plasma parameters and bias voltages considered in this study. In this contribution, we employ an autoencoder (AE) to learn a low-dimensional representation of valid data samples. By definition, the coordinates in this space are the features that mostly characterize valid data. Ambiguous data samples are classified in this space using standard classifiers for vectorial data. In this way, we avoid defining complicated threshold rules to identify outliers, which require strong assumptions and introduce biases in the analysis. By removing the outliers that are identified in the latent low-dimensional space of the AE, we find that the average conductive and convective radial heat fluxes are between approximately 5% and 15% lower as when removing outliers identified by threshold values. For contributions to the radial heat flux due to triple correlations, the difference is up to 40%.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Measurement Configuration Dataset
This is the anonymous reviewing version; the source code repository will be added after the review.
This dataset provides reproduction data for performance measurement configuration at source code level in Java. The measurement data can be obtained using the precision-experiments repository https://anonymous.4open.science/r/precision-experiments-C613/ (Examining Different Repetition Counts) yourself. These data conatained here are the data we obtained from execution on i7-4770 CPU @ 3.40GHz.
The analysis was tested on Ubuntu 20.04 and gnuplot 5.2.8. It will not work with older gnuplot versions.
To execute the analysis, extract the data by
tar -xvf basic-parameter-comparison.tar tar -xvf parallel-sequential-comparison.tar
and afterwards build the precision-experiments repo and execute the analysis by
cd precision-experiments/precision-analysis/ ../gradlew fatJar cd scripts/configuration-analysis/ ./executeCompleteAnalysis.sh ../../../../basic-parameter-comparison ../../../../parallel-sequential-comparison
Afterwards, the following files will be present:
precision-experiments/precision-analysis/scripts/configuration-analysis/repetitionHeatmaps/heatmap_all_en.pdf (Heatmaps for different repetition counts)
precision-experiments/precision-analysis/scripts/configuration-analysis/repetitionHeatmaps/heatmap_outlierRemoval_en.pdf (Heatmap with and without outlier removal for 1000 repetitions)
precision-experiments/precision-analysis/scripts/configuration-analysis/histogram_outliers_en.pdf (Histogram of the outliers)
precision-experiments/precision-analysis/scripts/configuration-analysis/heatmap_parallel_en.pdf (Heatmap with sequential and parallel execution)
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
This dataset contains a list of outlier sample concentrations identified for 17 water quality constituents from streamwater sample collected at 15 study watersheds in Gwinnett County, Georgia for water years 2003 to 2020. The 17 water quality constituents are: biochemical oxygen demand (BOD), chemical oxygen demand (COD), total suspended solids (TSS), suspended sediment concentration (SSC), total nitrogen (TN), total nitrate plus nitrite (NO3NO2), total ammonia plus organic nitrogen (TKN), dissolved ammonia (NH3), total phosphorus (TP), dissolved phosphorus (DP), total organic carbon (TOC), total calcium (Ca), total magnesium (Mg), total copper (TCu), total lead (TPb), total zinc (TZn), and total dissolved solids (TDS). 885 outlier concentrations were identified. Outliers were excluded from model calibration datasets used to estimate streamwater constituent loads for 12 of these constituents. Outlier concentrations were removed because they had a high influence on the model fits o ...
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The seemingly unrelated regression (SUR) model is a generalization of a linear regression model consisting of more than one equation, where the error terms of these equations are contemporaneously correlated. The standard Feasible Generalized Linear Squares (FGLS) estimator is efficient as it takes into account the covariance structure of the errors, but it is also very sensitive to outliers. The robust SUR estimator of Bilodeau and Duchesne (Canadian Journal of Statistics, 28:277-288, 2000) can accommodate outliers, but it is hard to compute. First we propose a fast algorithm, FastSUR, for its computation and show its good performance in a simulation study. We then provide diagnostics for outlier detection and illustrate them on a real data set from economics. Next we apply our FastSUR algorithm in the framework of stochastic loss reserving for general insurance. We focus on the General Multivariate Chain Ladder (GMCL) model that employs SUR to estimate its parameters. Consequently, this multivariate stochastic reserving method takes into account the contemporaneous correlations among run-off triangles and allows structural connections between these triangles. We plug in our FastSUR algorithm into the GMCL model to obtain a robust version.
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
This data set includes input data for the development of regression models to predict chloride from specific conductance (SC) data at 56 U. S. Geological Survey water quality monitoring stations in the eastern United States. Each site has 20 or more simultaneous observations of SC and chloride. Data were downloaded from the National Water Information System (NWIS) using the R package dataRetrieval. Datasets for each site were evaluated and outliers were removed prior to the development of the regression model. This file contains only the final input dataset for the regression models. Please refer to Moore and others (in review) for more details. Moore, J., R. Fanelli, and A. Sekellick. In review. High-frequency data reveal deicing salts drive elevated conductivity and chloride along with pervasive and frequent exceedances of the EPA aquatic life criteria for chloride in urban streams. Submitted to Environmental Science and Technology.
This product was developed as part of the project supported by the grant from and the National Oceanic and Atmospheric Administration’s Ocean Acidification Program under award NA18OAR0170430 to the Virginia Institute of Marine Science. The data product consists of water quality data for tidal 98 stations for 1984–2018. The source data used to generate this product were downloaded from the Chesapeake Bay Program’s (CBP) data hub. Out of the total of 255 monitoring stations in the Tidal Monitoring Program, we selected 98 with the long monitoring record (30 years or longer). The following variables were downloaded from the data hub at the native temporal and vertical resolution (between one and four cruises per month and approximately 10 depth levels sampled between 0 and 37 m) for 1984–2018: water temperature (T), salinity (S), pH, total alkalinity (TA), dissolved oxygen (DO) , and chlorophyll (Chl). All pH data prior to 1998 were removed because of the data quality concerns (Herrmann et al., 2020). Briefly, we found a dramatic difference in long-term trends between stations measured by institutions in the state of Virginia and stations measured by the state of Maryland, particularly from late spring to early fall. The boundary between the station groups runs east–west within the mesohaline portion of the bay, where the Potomac River estuary intersects the mainstem bay. The boundary separates strong negative linear trends to the south (Virginia stations) from neutral and weakly positive linear trends to the north (Maryland stations). For all variables, data entries marked with CBP’s “Problem” and “Qualifier” flags were removed. Additionally, all variables were scanned for extreme outliers: for each variable, data from all stations, depths, and times were combined into a single composite sample for which the 75th and 25th percentiles (i.e., the upper and lower quantiles) and the interquartile range (the difference between the upper and lower quantiles) were calculated. Extreme outliers were defined as the values falling outside of a certain number (censoring criterion) of interquartile ranges from the upper and lower quantiles.
http://spdx.org/licenses/CC0-1.0http://spdx.org/licenses/CC0-1.0
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
A-TWAIN (Long-term variability and trends in the Atlantic Water inflow region) was established to gain understanding on how the inflowing current system is distributed at different depths along the continental slope, how it responds to local, short-lived atmospheric changes, and how it varies on seasonal and interannual timescales.
As part of A-TWAIN, three moorings were redeployed near the continental slope of the Nansen Basin in the Arctic Ocean, near 31°E north of the Barents Sea. The moorings were operational between November 2021 and October 2022. All moorings have previously been deployed in the same respective locations; these data constitute the 2021-2022 continuation of the A-TWAIN mooring time series.
AT800-7* and AT200-6 moorings were instrumented mooring lines extending from the bottom anchor to a sub-surface buoy, while AT500-2 was a bottom lander. CTD and ADCP data from the moorings will be made available here; other datasets from these moorings will be published elsewhere. Processed data will be added here as they become available.
* "AT800-7" denotes the 7th deployment of the AT800 mooring.
Table: Details of the mooring deployentsMooring | Type | Bottom depth | Latitude | Longitude | Deployment date | Recovery date | Data status |
---|---|---|---|---|---|---|---|
AT200-6 | Instrumented line | 205 m | 81.4105 | 31.2433 | 09.11.21 | 06.10.22 | CTD data published |
AT500-2 | Bottom lander | 488 m | 81.4577 | 31.0753 | 09.11.21 | 04.10.22 | |
AT800-7 | Instrumented line | 889 m | 81.5501 | 30.8777 | 09.11.21 | 04.10.22 |
"https://gitlab.com/NPIOcean/npi-figure-store/-/raw/main/figures/atwain_moorings_21_22/at200/atwain_map.png">
https://gitlab.com/NPIOcean/npi-figure-store/-/raw/main/figures/atwain_moorings_21_22/at200/atwain_map.png" width="900" alt="ATWAIN map">
A-TWAIN mooring locations showing IBCAO v4 bathymetry.
All three moorings were deployed in November 2021 during the joint Nansen Legacy and A-TWAIN/SIOS-InfraNor mooring service cruise (KPH20217123), and recovered in October 2022 during the Nansen Legacy Mooring Service Cruise (KPH2022712).
Processed data are made available as one NetCDF file per instrument. Raw instrument data are also available. Details of the processing of the respective datasets are shown below (click to access the dropdown content).
AT200-6 CTD dataInstrument | Median depth | Serial number | Sampling frequency | File name |
---|---|---|---|---|
SBE16plus | 46 m | 50241 | 45 min | AT200_2021_2022_SBE16plus_50241_pres_temp_sal_46m.nc |
SBE37SMP | 59 m | 20773 | 15 min | AT200_2021_2022_SBE37SMP_20773_pres_temp_sal_59m.nc |
SBE37SM | 113 m | 15252 | 15 min | AT200_2021_2022_SBE37SM_15252_pres_temp_sal_113m.nc |
SBE37SM | 191 m | 9293 | 15 min | AT200_2021_2022_SBE37SM_9293_pres_temp_sal_191m.nc |
AT200 CTD data were processed to .cnv
using SBEDataProcessing software. Additional processing was done in Python using the kval
library (v0.0.2-beta, this commit).
Processing steps as well as a python script for reproducing the post-processing from .cnv
can be found in the PROCESSING
variable of each file.
All records were chopped to the time range 2021-11-09 20:00 - 2022-10-06 07:30 in order to remove data from recovery/deployment and deck times.
After visual inspection, no editing was applied to temperature and pressure.
Salinity has been lightly edited in order to remove noise and outliers (see PSAL
variable attributes for details).
The identification of outliers is complicated by the large hydrographic variability in this location, reflecting sharp lateral gradients near the continental slope in combination with an energetic background environment and relatively strong tides. The processing has therefore been
done using a relatively light approach, described below. This editing may or may not be appropriate or sufficient for specific research purposes. Users who want to apply their own editing are encouraged to work with unedited salinity, which can easily be obtained
by reprocessing salinity from TEMP
and CNDC
(both of which have been left unedited).
For SBE37 instruments:
PSAL
was recomputed from modified conductivity CNDC_mod
and temperature TEMP_mod
in order to reduce (presumably artificial) high-frequency noise:
CNDC_mod
was despiked using a 31-pt rolling window (rejecting outliers >3 SD from the median).CNDC_mod
and TEMP_mod
.PSAL
was recomputed from temperature, conductivity and pressure using the GSW-Python library.
TEMP
and CNDC
stored in the netCDF files.)PSAL
was despiked using a 15-pt rolling window (rejecting outliers >3 SD from the median).CNDC_mod
and TEMP_mod
.For the SBE16plus instrument:
PSAL
were removed using a threshold value of 25.Measured variables were found to agree well with post-deployment CTD profiles (from a SBE911+ on the R/V Kronprins Haakon) from the start of the record. A the end of the record, all sensors were found to agree reasonably well with a pre-recovery shipboard CTD profile with the exception of the upper instrument (SBE16plus S/N 50241). We attribute this to the profile being complex around 50 m depth at this time (region of an ~1C cold intrusion and salinity inversion on the background of a strong halocline). The temperature-salinity distribution is broadly consistent with the measurement being physically sensible, as is the salinity increase from the sensor near 46 m to the one near 59 m. Users should be aware that the SBE16plus salinity data could not be validated against other measurements.
"https://gitlab.com/NPIOcean/npi-figure-store/-/raw/main/figures/atwain_moorings_21_22/at200/comparison_at200_pre_deployment_ctd_timeseries_profile.png">
|
"https://gitlab.com/NPIOcean/npi-figure-store/-/raw/main/figures/atwain_moorings_21_22/at200/comparison_at200_pre_deployment_ctd_timeseries_T_S.png">
|
Comparison of temperature (left) and practical salinity (middle) profiles and temperature-salinity distributions (right) between moored CTDs (colors) and shipboard CTD SBE911+ profile (black) on Oct 5 2022, the day before mooring recovery. Coloured dots indicate the moored CTD value closest to the profile timestamp, and coloured lines show values collected within ±1h of the profile timestamp.
"https://gitlab.com/NPIOcean/npi-figure-store/-/raw/main/figures/atwain_moorings_21_22/at200/comparison_at200_pre_deployment_ctd_timeseries.png">
https://gitlab.com/NPIOcean/npi-figure-store/-/raw/main/figures/atwain_moorings_21_22/at200/comparison_at200_pre_deployment_ctd_timeseries.png" width="400" alt="Image caption">
Comparison between moored CTDs (black/blue) and shipboard CTD (red) on Oct 5 2022, the day before mooring recovery. Blue lines highlight the moored CTD values within ±one hour of the ship CTD profile.
A post-deployment calibration CTD cast was performed after recovery of the moorings, on 09.10.22. Here, two of the SBE37 instruments (#15252 and #20733) were attached to the ship CTD rosette and submerged with resting stops at 75 m, 30 m, and 20 m. Comparing the values between the two microcats and against ship CTD suggests that these two instruments were internally consistent within approximately 0.005 psu and consistent with the ship CTD within ±0.02 psu.
Comparison of temperature (upper) and practical salinity (lower) values from the "calibration CTD cast" on 09.10.22 where two of the AT200-6 SBE37 instruments were mounted on the rosette and resting stops were made near 75, 30, and 20 m. Black: Shipboard CTD, Red: SBE37 #15252, Blue: SBE37 #20773. Small dots show all data points from the depth stops, triangles and
Understanding the statistics of fluctuation driven flows in the boundary layer of magnetically confined plasmas is desired to accurately model the lifetime of the vacuum vessel components. Mirror Langmuir probes (MLPs) are a novel diagnostic that uniquely allow us to sample the plasma parameters on a time scale shorter than the characteristic time scale of their fluctuations. Sudden large-amplitude fluctuations in the plasma degrade the precision and accuracy of the plasma parameters reported by MLPs for cases in which the probe bias range is of insufficient amplitude. While some data samples can readily be classified as valid and invalid, we find that such a classification may be ambiguous for up to 40% of data sampled for the plasma parameters and bias voltages considered in this study. In this contribution, we employ an autoencoder (AE) to learn a low-dimensional representation of valid data samples. By definition, the coordinates in this space are the features that mostly characterize valid data. Ambiguous data samples are classified in this space using standard classifiers for vectorial data. In this way, we avoid defining complicated threshold rules to identify outliers, which require strong assumptions and introduce biases in the analysis. By removing the outliers that aremore » identified in the latent low-dimensional space of the AE, we find that the average conductive and convective radial heat fluxes are between approximately 5% and 15% lower as when removing outliers identified by threshold values. For contributions to the radial heat flux due to triple correlations, the difference is up to 40%.« less
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset has had outliers removed and has been formatted in order to be appropriate for use in Principal Component Analysis. The raw data was provided by Moritz von der Lippe and Anne Hiller from TU Berlin, and field measurements were carried out by Lena Fiechter.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset accompanies the research article on MQTTEEB-D and is intended for public use in cybersecurity research. The MQTTEEB-D dataset is a practical real-world data set for intrusion detection improvement in Message Queuing Telemetry Transport (MQTT)-based Internet of Things (IoT) networks. In contrast to already existing datasets that are constructed on simulated network traffic, MQTTEEB-D is obtained from a real-time IoT deployment at the International University of Rabat (UIR), Morocco. Using MySignals IoT health sensors, Raspberry Pi 4, and an MQTT broker server, this dataset represents the actual complexity of the active IoT communication process, which synthetic data fails to offer. To narrow the gap between simulated and real-world attack scenarios, various cyberattacks including Denial of Service (DoS), Slow DoS against Internet of Things Environments (SlowITe), Malformed Data Injection, Brute Force, and MQTT publish flooding were carried out in real-time, permitting close monitoring of network traffic anomalies. The data was captured using Python wrapper for tshark (PyShark) and organized into multiple Comma-Separated Values (CSV) files. To ensure high data quality, we performed pre-processing steps, such as outlier removal, normalization, standardization, and class balance. Several processed forms (raw, cleaned, normalized, standardized, Synthetic Minority Over-sampling Technique (SMOTE)) applied for this dataset are provided, along with detailed metadata to facilitate ease of use in cybersecurity research. This dataset provides an opportunity for researchers to develop and validate intrusion detection models in a real-world MQTT environment - a critical ingredient in Artificial Intelligence (AI)-driven cybersecurity solutions for IoT networks. The dataset will support future research IoT security and anomaly detection domains.
This NCEI Accession consists of GLODAPv2.2019 data product composed of data from 840 scientific cruises covering the global ocean between 1972 and 2017. It includes full depth discrete bottle measurements of salinity, oxygen, nitrate, silicate, phosphate, dissolved inorganic carbon (TCO2), total alkalinity (TAlk), pH, chlorofluorocarbons (CFC-11, CFC-12, CFC-113, and CCl4), various isotopes and organic compounds. It was created by appending data from 116 cruises to GLODAPv2 (Olsen et al., 2016, NCEI Accession 0162565). The data for salinity, oxygen, nitrate, silicate, phosphate, TCO2, TAlk, pH, CFC-11, CFC-12, CFC-113, and CCl4 were subjected to primary and secondary quality control. Severe biases in these data have been corrected for, and outliers removed. However, differences in data related to any known or likely time trends or variations have not been corrected for. These data are believed to be accurate to 0.005 in salinity, 1% in oxygen, 2% in nitrate, 2% in silicate, 2% in phosphate, 4 µmol kg-1 in TCO2, 4 µmol kg-1 in TAlk, and for the halogenated transient tracers: 5%.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
This dataset is example data from the Norwegian Women and Cancer study. It is supporting information to our article "A Standard Operating Procedure for Outlier Removal in Large-Sample Epidemiological Transcriptomics Datasets." (In submission) The bulk of the data comes from measuring gene expression in blood samples from the Norwegian Women and Cancer study (NOWAC) on Illumina Whole-Genome Gene Expression Bead Chips, HumanHT-12 v4. Please see README.txt for details