Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The objects are numbered. The Y-variable are boiling points. Other features are structural features of molecules. In the outlier column the outliers are assigned with a value of 1.
The data is derived from a published chemical dataset on boiling point measurements [1] and from public data [2]. Features were generated by means of the RDKit Python library [3]. The dataset was infused with known outliers (~5%) based on significant structural differences, i.e. polar and non-polar molecules.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The study of reaction times and their underlying cognitive processes is an important field in Psychology. Reaction times are often modeled through the ex-Gaussian distribution, because it provides a good fit to multiple empirical data. The complexity of this distribution makes the use of computational tools an essential element. Therefore, there is a strong need for efficient and versatile computational tools for the research in this area. In this manuscript we discuss some mathematical details of the ex-Gaussian distribution and apply the ExGUtils package, a set of functions and numerical tools, programmed for python, developed for numerical analysis of data involving the ex-Gaussian probability density. In order to validate the package, we present an extensive analysis of fits obtained with it, discuss advantages and differences between the least squares and maximum likelihood methods and quantitatively evaluate the goodness of the obtained fits (which is usually an overlooked point in most literature in the area). The analysis done allows one to identify outliers in the empirical datasets and criteriously determine if there is a need for data trimming and at which points it should be done.
These data are single-beam bathymetry points compiled in comma separated values (CSV) file format, generated from a hydrographic survey of the northern portion of Lake Calumet in Cook County, Illinois. Hydrographic data were collected July 18-19, 2023, using a single-beam echosounder (SBES) integrated with a Global Navigation Satellite System (GNSS) mounted on a marine survey vessel. Surface water elevation data were collected July 18 utilizing a single-base real-time kinematic (RTK)/GNSS unit. Bathymetric data points were collected as the vessel traversed the northern portions of the lake along overlapping survey lines. The SBES internally collected and stored the depth data from the echosounder and the horizontal and vertical position data of the vessel from the GNSS in real time. Data processing required specialized computer software to export bathymetry data from the raw data files. A Python script was written to calculate the lakebed elevations and identify outliers in the dataset. These data are provided in comma separated values (CSV) format as LakeCalumet_SBES_20230718.csv. Data points are stored as a series of x (longitude), y (latitude), and z (elevation or depth) points along with variable length records specific to the data transects.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
For the metabolome data, all calculations and statistical analyses were performed using Python. The Shapiro-Wilk test was performed to identify the metabolites whose concentrations in the blood showed a normal distribution, and Student’s t-test was used to compare their concentrations in blood samples for the IUGR and NORM groups. Metabolites whose concentrations did not show a normal distribution were compared between the two groups using the non-parametric Mann–Whitney test. The Benjamini–Hochberg correction was applied in both cases to account for the risk I inflation associated with multiple comparisons. Before being subjected to unsupervised and supervised algorithms, the concentration of each metabolite was normalised and centred. Principal component analysis (PCA) and orthogonal projection to latent structures-discriminant analysis (OPLS-DA) were employed as unsupervised and supervised methods in the multivariate analysis, respectively. PCA was used for the identification of outliers (Mahalanobis distance metric) as well as the spontaneous clustering of similar samples in the scatter plot of the two principal components. In the OPLS-DA analysis, the X matrix consisted of metabolite concentrations, while the Y vector contained information regarding the group (IUGR or NORM). The goodness of fit of the OPLS-DA model (R2Y) was reported, and predictive performance was assessed through cross-validation. Metrics such as the predictive ability of the model (Q2Y) and the predictive ability of permuted models (Q2Y-perm) were calculated for evaluation. OPLS-DA loading plots were used to illustrate the metabolites that contributed the most to the separation between the IUGR and NORM groups. The identification of metabolites of interest was made through the combination of the variable importance in the projection (VIP) and the loading between the metabolite in the X matrix and the predictive latent variable (pLV) of the model. Metabolites with VIP >1.0 and absolute high loading values were considered important in the metabolomics signature (De la Barca et al., 2022).References:Chao de la Barca JM, Chabrun F, Lefebvre T, Roche O, Huetz N, Blanchet O, Legendre G, Simard G, Reynier P, Gascoin G: A Metabolomic Profiling of Intra-Uterine Growth Restriction in Placenta and Cord Blood Points to an Impairment of Lipid and Energetic Metabolism. Biomedicines 2022, 10:1411.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Identification of features with high levels of confidence in liquid chromatography-mass spectrometry (LC MS) lipidomics research is an essential part of biomarker discovery, but existing software platforms can give inconsistent results, even from identical spectral data. This poses a clear challenge for reproducibility in bioinformatics work, and highlights the importance of data-driven outlier detection in assessing spectral outputs – here demonstrated using a machine learning approach based on support vector machine regression combined with leave-one-out cross validation – as well as manual curation, in order to identify software-driven errors driven by closely related lipids and by co-elution issues.
The lipidomics case study dataset used in this work analysed a lipid extraction of a human pancreatic adenocarcinoma cell line (PANC-1, Merck, UK, cat no. 87092802) analysed using an Acquity M-Class UPLC system (Waters, UK) coupled to a ZenoToF 7600 mass spectrometer (Sciex, UK). Raw output files are included alongside processed data using MS DIAL (v4.9.221218) and Lipostar (v2.1.4) and a Jupyter notebook with Python code to analyse the outputs for outlier detection.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset offers a detailed inventory of road intersections and their corresponding suburbs within Cape Town, meticulously curated to highlight instances of high pedestrian crash counts resulting in serious injuries observed in "high-high" cluster and "high-low" outlier fishnet grid cells across the years 2017, 2018 and 2019. To enhance its utility, the dataset meticulously colour-codes each month associated with elevated crash occurrences, providing a nuanced perspective. Furthermore, the dataset categorises road intersections based on their placement within "high-high" clusters (marked with pink tabs) or "high-low" outlier cells (indicated by red tabs). For ease of navigation, the intersections are further organised alphabetically by suburb name, ensuring accessibility and clarity.Data SpecificsData Type: Geospatial-temporal categorical data with numeric attributesFile Format: Word document (.docx)Size: 231 KBNumber of Files: The dataset contains a total of 245 road intersection records (7 "high-high" clusters and 238 "high-low" outliers)Date Created: 21st May 2024MethodologyData Collection Method: The descriptive road traffic crash data per crash victim involved in the crashes was obtained from the City of Cape Town Network InformationSoftware: ArcGIS Pro, Open Refine, Python, SQLProcessing Steps: The raw road traffic crash data underwent a comprehensive refining process using Python software to ensure its accuracy and consistency. Following this, duplicates were eliminated to retain only one entry per crash incident. Subsequently, the data underwent further refinement with Open Refine software, focusing specifically on isolating unique crash descriptions for subsequent geocoding in ArcGIS Pro. Notably, during this process, only the road intersection crashes were retained, as they were the only incidents with spatial definitions.Once geocoded, road intersection crashes that involved a pedestrian with a severe or fatal injury type were extracted so that subsequent spatio-temporal analyses would focus on these crashes only. The spatio-temporal analysis methods by which these pedestrian crashes were analysed included spatial autocorrelation, hotspot analysis, and cluster and outlier analysis. Leveraging these methods, road intersections with pedestrian crashes that resulted in a severe injury identified as either "high-high" clusters or "high-low" outliers were extracted for inclusion in the dataset.Geospatial InformationSpatial Coverage:West Bounding Coordinate: 18°20'EEast Bounding Coordinate: 19°05'ENorth Bounding Coordinate: 33°25'SSouth Bounding Coordinate: 34°25'SCoordinate System: South African Reference System (Lo19) using the Universal Transverse Mercator projectionTemporal InformationTemporal Coverage:Start Date: 01/01/2017End Date: 31/12/2019
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset offers a detailed inventory of road intersections and their corresponding suburbs within Cape Town, meticulously curated to highlight instances of high motorcycle (Motorcycle: Above 125cc, Motorcycle: 125cc and under, Quadru-cycle, Motor Tricycle) crash counts that resulted in injuries (slight, serious, fatalities) observed in "high-high" cluster and "high-low" outlier fishnet grid cells across the years 2017, 2018 and 2019. To enhance its utility, the dataset meticulously colour-codes each month associated with elevated crash occurrences, providing a nuanced perspective. Furthermore, the dataset categorises road intersections based on their placement within "high-high" clusters (marked with pink tabs) or "high-low" outlier cells (indicated by red tabs). For ease of navigation, the intersections are further organised alphabetically by suburb name, ensuring accessibility and clarity.Data SpecificsData Type: Geospatial-temporal categorical data with numeric attributesFile Format: Word document (.docx)Size: 157 KBNumber of Files: The dataset contains a total of 158 road intersection records (11 "high-high" clusters and 147 "high-low" outliers)Date Created: 22nd May 2024MethodologyData Collection Method: The descriptive road traffic crash data per crash victim involved in the crashes was obtained from the City of Cape Town Network InformationSoftware: ArcGIS Pro, Open Refine, Python, SQLProcessing Steps: The raw road traffic crash data underwent a comprehensive refining process using Python software to ensure its accuracy and consistency. Following this, duplicates were eliminated to retain only one entry per crash incident. Subsequently, the data underwent further refinement with Open Refine software, focusing specifically on isolating unique crash descriptions for subsequent geocoding in ArcGIS Pro. Notably, during this process, only the road intersection crashes were retained, as they were the only incidents with spatial definitions.Once geocoded, road intersection crashes that involved either a motor tricycle, motorcycle above 125cc, motorcycle below 125cc and quadru-cycles and that were additionally associated with a slight, severe or fatal injury type were extracted so that subsequent spatio-temporal analyses would focus on these crashes only. The spatio-temporal analysis methods by which these motorcycle crashes were analysed included spatial autocorrelation, hotspot analysis, and cluster and outlier analysis. Leveraging these methods, road intersections with motorcycle crashes identified as either "high-high" clusters or "high-low" outliers were extracted for inclusion in the dataset.Geospatial InformationSpatial Coverage:West Bounding Coordinate: 18°20'EEast Bounding Coordinate: 19°05'ENorth Bounding Coordinate: 33°25'SSouth Bounding Coordinate: 34°25'SCoordinate System: South African Reference System (Lo19) using the Universal Transverse Mercator projectionTemporal InformationTemporal Coverage:Start Date: 01/01/2017End Date: 31/12/2019
http://spdx.org/licenses/CC0-1.0http://spdx.org/licenses/CC0-1.0
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains hydrography (CTD and T instruments) and current (ADCP instruments) data from the mooring M1 in the northwestern Barents Sea from 2021 onward.
Details of the mooring deployments and the data processing can be found below.
Data will be added to this dataset as they become available and processed. A summary of the dataset history is found at the bottom of this page.
"https://gitlab.com/NPIOcean/npi-figure-store/-/raw/main/figures/m1_4/m1_map.png">
https://gitlab.com/NPIOcean/npi-figure-store/-/raw/main/figures/m1_4/m1_map.png" width="900" alt="M1 map">
M1 mooring location showing IBCAO v4 bathymetry.
The M1 mooring in the northwestern Barents Sea was first deployed in late 2018 as part of the Nansen Legacy project. The mooring was located on the topographic slope at a site meant to capture inflows from the north.
The base instrumentation on the mooring consists of hydrographic instruments (CTD/T) and current profilers (ADCP). Other sensors have also been deployed on the mooring on various deployments -- these data will be published elsewhere.
Data from the three first deployments of M1 (M1-1, M1-2, and M1-3) have been published at are available at the following DOI: https://doi.org/10.21334/npolar.2022.1a68b156. While the Nansen Legacy project has ended, the M1 mooring is currently maintained by the Norwegian Polar Institute.
| Mooring | Deployed | Recovered | Bottom depth | Latitude (N) | Longitude (E)| Status | |--|--|--|--|--|--| -- | | M1-4 | 10.11.21 | 07.10.22 | 263 m | 79.5829 | 28.0717 | Hydrography data published | | M1-5 | 07.10.22 | - | 268 m | 79.5819 | 28.0866 | Mooring lost |
Click dropdown menus to display content:
Details, M1-4 mooring (2021-2022) hydrography dataThe salinity (PSAL) observed at the two upper CTD sensors (#60600 and #204991) at M1-4 compared somewhat poorly with salinity measured by the ship CTD at at the mooring site before recovery. Ad-hoc corrections, detailed below, have been applied to these two salinity records, but in-situ calibration is complicated by the hydrographic variability in the area.
- Users should be aware that salinity values from these two sensors are uncertain.
- Original, unedited salinity can be recomputed from CNDC (which has not been adjusted) along with PRES and TEMP.
M1-4 CTD instruments: | Instrument | S/N | Depth (m) | Sampling rate | File| |------------|--------|-------------------|-------------------------|---| | RBR Concerto
x
| 60600 |22 | 1 min |M1_2021_2022_RBR_CONC_60600_pres_temp_sal_22m_v1.nc
| | RBR Concertox
*
| 204991 |26 | 10 min |M1_2021_2022_RBR_CONC_204991_pres_temp_sal_26m_v1.nc
| | RBR Concerto | 201405 | 59 |1 min |M1_2021_2022_RBR_CONC_201405_pres_temp_sal_59m_v1.nc
| | RBR Concerto | 60591 | 92 | 1 min |M1_2021_2022_RBR_CONC_60591_pres_temp_sal_92m_v1.nc
| | RBR Solo | 102486 | 154 | 5 sec |M1_2021_2022_RBR_SOLO_102486_temp_154m_v1.nc
| | RBR Concerto | 60592 |174 | 1 min |M1_2021_2022_RBR_CONC_60592_pres_temp_sal_174m_v1.nc
| | RBR Solo | 102477 | 216 | 5 sec |M1_2021_2022_RBR_SOLO_102477_temp_216m_v1.nc
| | Seabird SBE37SMP | 23180 | 250 | 1 hr |M1_2021_2022_SBE37_23180_pres_temp_sal_250m_v1.nc
|
x
Drift corrections have been applied toPSAL
(details below).CNDC
has not been edited.
*
With CHLA and PAR (not included here)
The pressure records were complete and without obvious spikes. No major drift was noticeable. After recovery, all sensors were found to be within ±10.5 dbar. No corrections were made to pressure in post-processing.
Assigning pressure to RBR Solos (temperature-only instruments)
RBR Solo instrument pressure was estimated by interpolating between by interpolating between the pressure records of CTD instruments located above and below the instrument based on nominal/target depths.
Comparison between moored and shipboard CTDs
Temperature and salinity from the moored instruments were compared with shipboard CTD profiles collected at the mooring site shortly after deployment and shortly before recovery. The figures below show temperature and salinity profiles collected at the mooring sites and T-S plots from the R/V Kronprins Haakon's SBE911+ CTD (thick black lines). Other CTD profiles collected within <5 km of the mooring sites are included as thin gray lines to illustrate the background variability. Colored circles show the moored CTD values the time closest to the profile time stamp, with smaller colored dots showing values within ±1 hour of the profile time stamp.
The water properties at the mooring sites are highly variable on small spatial and temporal scales, with rapid property changes and a large degree of interleaving. This makes in-situ calibration challenging. We have only made corrections where differences were rather stark. We found that the temperature and salinity values generally compared well with shipboard observations given the limitations above.
"https://gitlab.com/NPIOcean/npi-figure-store/-/raw/main/figures/m1_4/startctd_profile_comparison.png">
https://gitlab.com/NPIOcean/npi-figure-store/-/raw/main/figures/m1_4/startctd_profile_comparison.png" width="900" alt="M1 map">
Comparison between shipboard and moored CTDs near the mooring site after recovery of M1-4 (2021-11-10).
"https://gitlab.com/NPIOcean/npi-figure-store/-/raw/main/figures/m1_4/end_ctd_profile_comparison.png">
https://gitlab.com/NPIOcean/npi-figure-store/-/raw/main/figures/m1_4/end_ctd_profile_comparison.png" width="900" alt="M1 map">
Comparison between shipboard and moored CTDs (before adjustment of the top sensors) near the mooring site before recovery of M1-4 (2022-10-07).
Salinity drift corrections
Salinity values at the two upper sensors agreed well with ship CTDs on deployment but clearly deviated from shipboard profile values at the end of the record. Post-deployment laboratory calibration of the instrument sensors agreed well with pre-deployment ones, indicating that the observed discrepancy was likely due to biofouling of the sensors.
As a corrective measure, we applied a corrective drift factor latex f
to the salinity record, where latex f
increased linearly from 1 at the deployment start and equal to the ratio between shipboard- amd mooring-observed salinity, latex f_{END} = PSAL_{SHIP} / PSAL_{MOOR}
where the pressure and time align closest during the pre-recovery cast.
The adjusted salinty latex PSAL_{ADJ}
was computed from initial salinity latex PSAL_{INI}
as
PSAL_{ADJ}` = f\cdot PSAL_{INI}
| Instrument serial number | Final correction factor latex f_{END}
|
|--|--|
| 60600 | 1.0037 |
| 204991 | 1.0062 |
"https://gitlab.com/NPIOcean/npi-figure-store/-/raw/main/figures/m1_4/end_ctd_profile_comparison_after_adjustment.png">
https://gitlab.com/NPIOcean/npi-figure-store/-/raw/main/figures/m1_4/end_ctd_profile_comparison_after_adjustment.png" width="900" alt="M1 map">
Comparison between shipboard and moored CTDs (after adjustment of the top sensors) near the mooring site before recovery of M1-4 (2022-10-07).
Temperature records were found to be well-behaved and no obvious sensor drift was observed.
Data outside the range 2021-11-10 21:00 - 2022-10-07 06:30 were removed from the dataset in order not to include samples from deck time or during recovery/deployment.
The following processing steps were applied in order to gently denoise the Concerto salinity records and remove salinity outliers. This processing may or may be sufficient for specific scientific uses. Interested users can recalculate unaltered salinity from conductivity, tempererature and pressure.
*
running mean (denoted _roll
) was applied toCNDC
and TEMP
.
CNDC_roll
, TEMP_roll
, and pressure using the gsw-Python package.Salinity outliers were identified and rejected by:
(RBR Concertos): Comparing conductivity and temperature after normalizing both to the same scale agains a difference criterion latex \alpha
which was given a value between 0.15 and 0.3. Instances of large conductivity spikes with no corresponding spike in temperature were interpreted as erroneous due to e.g. the passage of biological matter through the conductivity cell. Salinity from samples where the criterion was met were rejected. This removed less than 0.1% of the samples.
\bigg| \frac{CNDC\_roll - \text{mean}(CNDC\_roll)}{\text{sd}(CNDC\_roll)} - \frac{TEMP\_roll - \text{mean}(TEMP\_roll)}{\text{sd}(TEMP\_roll)}\bigg|>\alpha
Rejecting outliers identified as PSAL
values deviating from the 7-day running median by more than 5 rolling standard deviations.
Rejecting any additional obvious salinity outliers based on visual inspection of density.
Details of the processing steps taken, as well as a Python script to reproduce the processing based on source data, can be found in the PROCESSING
variable of each file.
*
*A 15-point (15-minute) running mean for all Concertos except #204991 where a 3-point
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset offers a detailed inventory of road intersections and their corresponding suburbs within Cape Town, meticulously curated to highlight instances of high crash counts observed in "high-high" cluster and "high-low" outlier fishnet grid cells across the years 2017, 2018, 2019, and 2021. To enhance its utility, the dataset meticulously colour-codes each month associated with elevated crash occurrences, providing a nuanced perspective. Furthermore, the dataset categorises road intersections based on their placement within "high-high" clusters (marked with pink tabs) or "high-low" outlier cells (indicated by red tabs). For ease of navigation, the intersections are further organised alphabetically by suburb name, ensuring accessibility and clarity.Data SpecificsData Type: Geospatial-temporal categorical data with numeric attributesFile Format: Word document (.docx)Size: 602 KBNumber of Files: The dataset contains a total of 625 road intersection records (606 "high-high" cluster and 19 "high-low" outliers)Date Created: 21st May 2024MethodologyData Collection Method: The descriptive road traffic crash data per crash victim involved in the crashes was obtained from the City of Cape Town Network InformationSoftware: ArcGIS Pro, Open Refine, Python, SQLProcessing Steps: The raw road traffic crash data underwent a comprehensive refining process using Python software. Following this, duplicate crash records were eliminated to retain only one entry per crash. Subsequently, the data underwent further refinement with Open Refine software, focusing specifically on isolating unique crash descriptions for subsequent geocoding in ArcGIS Pro. Notably, during this process, only the road intersection crashes were retained, as they were the only crashes that were able to be spatially defined.Once geocoded, the road traffic crash data underwent rigorous spatio-temporal analyses, encompassing spatial autocorrelation, hotspot analysis, and cluster and outlier analysis. Leveraging these methods, road intersections identified as either "high-high" clusters or "high-low" outliers were extracted for inclusion in the dataset.Geospatial InformationSpatial Coverage:West Bounding Coordinate: 18°20'EEast Bounding Coordinate: 19°05'ENorth Bounding Coordinate: 33°25'SSouth Bounding Coordinate: 34°25'SCoordinate System: South African Reference System (Lo19) using the Universal Transverse Mercator projectionTemporal InformationTemporal Coverage:Start Date: 01/01/2017End Date: 31/12/2021 (2020 data omitted)
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The objects are numbered. The Y-variable are boiling points. Other features are structural features of molecules. In the outlier column the outliers are assigned with a value of 1.
The data is derived from a published chemical dataset on boiling point measurements [1] and from public data [2]. Features were generated by means of the RDKit Python library [3]. The dataset was infused with known outliers (~5%) based on significant structural differences, i.e. polar and non-polar molecules.