52 datasets found
  1. f

    Comparison experiments by using IF.

    • figshare.com
    xls
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gen Li; Jason J. Jung (2023). Comparison experiments by using IF. [Dataset]. http://doi.org/10.1371/journal.pone.0247119.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Gen Li; Jason J. Jung
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Comparison experiments by using IF.

  2. f

    Data from: Multivariate Functional Data Visualization and Outlier Detection

    • datasetcatalog.nlm.nih.gov
    Updated May 22, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Genton, Marc G.; Dai, Wenlin (2018). Multivariate Functional Data Visualization and Outlier Detection [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000679969
    Explore at:
    Dataset updated
    May 22, 2018
    Authors
    Genton, Marc G.; Dai, Wenlin
    Description

    This article proposes a new graphical tool, the magnitude-shape (MS) plot, for visualizing both the magnitude and shape outlyingness of multivariate functional data. The proposed tool builds on the recent notion of functional directional outlyingness, which measures the centrality of functional data by simultaneously considering the level and the direction of their deviation from the central region. The MS-plot intuitively presents not only levels but also directions of magnitude outlyingness on the horizontal axis or plane, and demonstrates shape outlyingness on the vertical axis. A dividing curve or surface is provided to separate nonoutlying data from the outliers. Both the simulated data and the practical examples confirm that the MS-plot is superior to existing tools for visualizing centrality and detecting outliers for functional data. Supplementary material for this article is available online.

  3. f

    Performance of DynGPE.

    • plos.figshare.com
    xls
    Updated Jun 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gen Li; Jason J. Jung (2023). Performance of DynGPE. [Dataset]. http://doi.org/10.1371/journal.pone.0247119.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 11, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Gen Li; Jason J. Jung
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Performance of DynGPE.

  4. API security: Access behavior anomaly dataset

    • kaggle.com
    Updated Nov 22, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ravi Guntur (2021). API security: Access behavior anomaly dataset [Dataset]. https://www.kaggle.com/datasets/tangodelta/api-access-behaviour-anomaly-dataset/data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 22, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ravi Guntur
    License

    http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html

    Description

    Context

    Distributed micro-services based applications are typically accessed via APIs. These APIs are used either by apps or they can be accessed directly by programmatic means. Many a time API access is abused by attackers trying to exploit the business logic exposed by these APIs. The way normal users access these APIs is different from how the attackers access these APIs. Many applications have 100s of APIs that are called in specific order and depending on various factors such as browser refreshes, session refreshes, network errors, or programmatic access these behaviors are not static and can vary for the same user. API calls in long running sessions form access graphs that need to be analysed in order to discover attack patterns and anomalies. Graphs dont lend themselves to numerical computation. We address this issue and provide a dataset where user access behavior is qualified as numerical features. In addition we provide a dataset where raw API call graphs are provided. Supporting the use of these datasets two notebooks on classification, node embeddings and clustering are also provided.

    About the dataset

    There are 4 files provided. Two files are in CSV format and two files are in JSON format. The files in CSV format are user behavior graphs represented as behavior metrics. The JSON files are the actual API call graphs. The two datasets can be joined on a key so that those who want to combine graphs with metrics could do so in novel ways.

    What is new in this dataset

    This data set captures API access patterns in terms of behavior metrics. Behaviors are captured by tracking users' API call graphs which are then summarized in terms of metrics. In some sense a categorical sequence of entities has been reduced to numerical metrics.

    CSV dataset

    There are two files provided. One called supervised_dataset.csv has behaviors labeled as normal or outlier. The second file called remaining_behavior_ext.csv has a larger number of samples that are not labeled but has additional insights as well as a classification created by another algorithm.

    What is each row

    Each row is one instance of an observed behavior that has been manually classified as normal or outlier

    JSON dataset

    There are two files provided to correspond to the two CSV files

    What is each item

    Each item has an _id field that can be used to join against the CSV data sets. Then we have the API behavior graph represented as a list of edges.

    Inspiration

    1. To model the classification label with a skewed distribution of normal and abnormal cases and with very few labeled samples available. Use supervised_dataset.csv
    2. To verify where the predicted class differs from the class determined by a second algorithm. Use remaining_behavior_ext.csv
  5. f

    Petre_Slide_CategoricalScatterplotFigShare.pptx

    • figshare.com
    pptx
    Updated Sep 19, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Benj Petre; Aurore Coince; Sophien Kamoun (2016). Petre_Slide_CategoricalScatterplotFigShare.pptx [Dataset]. http://doi.org/10.6084/m9.figshare.3840102.v1
    Explore at:
    pptxAvailable download formats
    Dataset updated
    Sep 19, 2016
    Dataset provided by
    figshare
    Authors
    Benj Petre; Aurore Coince; Sophien Kamoun
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Categorical scatterplots with R for biologists: a step-by-step guide

    Benjamin Petre1, Aurore Coince2, Sophien Kamoun1

    1 The Sainsbury Laboratory, Norwich, UK; 2 Earlham Institute, Norwich, UK

    Weissgerber and colleagues (2015) recently stated that ‘as scientists, we urgently need to change our practices for presenting continuous data in small sample size studies’. They called for more scatterplot and boxplot representations in scientific papers, which ‘allow readers to critically evaluate continuous data’ (Weissgerber et al., 2015). In the Kamoun Lab at The Sainsbury Laboratory, we recently implemented a protocol to generate categorical scatterplots (Petre et al., 2016; Dagdas et al., 2016). Here we describe the three steps of this protocol: 1) formatting of the data set in a .csv file, 2) execution of the R script to generate the graph, and 3) export of the graph as a .pdf file.

    Protocol

    • Step 1: format the data set as a .csv file. Store the data in a three-column excel file as shown in Powerpoint slide. The first column ‘Replicate’ indicates the biological replicates. In the example, the month and year during which the replicate was performed is indicated. The second column ‘Condition’ indicates the conditions of the experiment (in the example, a wild type and two mutants called A and B). The third column ‘Value’ contains continuous values. Save the Excel file as a .csv file (File -> Save as -> in ‘File Format’, select .csv). This .csv file is the input file to import in R.

    • Step 2: execute the R script (see Notes 1 and 2). Copy the script shown in Powerpoint slide and paste it in the R console. Execute the script. In the dialog box, select the input .csv file from step 1. The categorical scatterplot will appear in a separate window. Dots represent the values for each sample; colors indicate replicates. Boxplots are superimposed; black dots indicate outliers.

    • Step 3: save the graph as a .pdf file. Shape the window at your convenience and save the graph as a .pdf file (File -> Save as). See Powerpoint slide for an example.

    Notes

    • Note 1: install the ggplot2 package. The R script requires the package ‘ggplot2’ to be installed. To install it, Packages & Data -> Package Installer -> enter ‘ggplot2’ in the Package Search space and click on ‘Get List’. Select ‘ggplot2’ in the Package column and click on ‘Install Selected’. Install all dependencies as well.

    • Note 2: use a log scale for the y-axis. To use a log scale for the y-axis of the graph, use the command line below in place of command line #7 in the script.

    7 Display the graph in a separate window. Dot colors indicate

    replicates

    graph + geom_boxplot(outlier.colour='black', colour='black') + geom_jitter(aes(col=Replicate)) + scale_y_log10() + theme_bw()

    References

    Dagdas YF, Belhaj K, Maqbool A, Chaparro-Garcia A, Pandey P, Petre B, et al. (2016) An effector of the Irish potato famine pathogen antagonizes a host autophagy cargo receptor. eLife 5:e10856.

    Petre B, Saunders DGO, Sklenar J, Lorrain C, Krasileva KV, Win J, et al. (2016) Heterologous Expression Screens in Nicotiana benthamiana Identify a Candidate Effector of the Wheat Yellow Rust Pathogen that Associates with Processing Bodies. PLoS ONE 11(2):e0149035

    Weissgerber TL, Milic NM, Winham SJ, Garovic VD (2015) Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm. PLoS Biol 13(4):e1002128

    https://cran.r-project.org/

    http://ggplot2.org/

  6. f

    Data from: Multivariate Outliers and the O3 Plot

    • tandf.figshare.com
    txt
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Antony Unwin (2023). Multivariate Outliers and the O3 Plot [Dataset]. http://doi.org/10.6084/m9.figshare.7792115.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Taylor & Francis
    Authors
    Antony Unwin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Identifying and dealing with outliers is an important part of data analysis. A new visualization, the O3 plot, is introduced to aid in the display and understanding of patterns of multivariate outliers. It uses the results of identifying outliers for every possible combination of dataset variables to provide insight into why particular cases are outliers. The O3 plot can be used to compare the results from up to six different outlier identification methods. There is anRpackage OutliersO3 implementing the plot. The article is illustrated with outlier analyses of German demographic and economic data. Supplementary materials for this article are available online.

  7. Additional file 2 of Outlier identification and monitoring of institutional...

    • springernature.figshare.com
    txt
    Updated Jun 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Menelaos Pavlou; Gareth Ambler; Rumana Z. Omar; Andrew T. Goodwin; Uday Trivedi; Peter Ludman; Mark de Belder (2023). Additional file 2 of Outlier identification and monitoring of institutional or clinician performance: an overview of statistical methods and application to national audit data [Dataset]. http://doi.org/10.6084/m9.figshare.22612465.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 21, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Menelaos Pavlou; Gareth Ambler; Rumana Z. Omar; Andrew T. Goodwin; Uday Trivedi; Peter Ludman; Mark de Belder
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Additional file 2.

  8. Y

    Citation Network Graph

    • shibatadb.com
    Updated Sep 23, 2011
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yubetsu (2011). Citation Network Graph [Dataset]. https://www.shibatadb.com/article/SKbAgbBG
    Explore at:
    Dataset updated
    Sep 23, 2011
    Dataset authored and provided by
    Yubetsu
    License

    https://www.shibatadb.com/license/data/proprietary/v1.0/license.txthttps://www.shibatadb.com/license/data/proprietary/v1.0/license.txt

    Description

    Network of 42 papers and 71 citation links related to "Sparse online low-rank projection and outlier rejection (SOLO) for 3-D rigid-body motion registration".

  9. u

    Topology Bench: Systematic Graph Based Benchmarking for Optical Networks

    • rdr.ucl.ac.uk
    • zenodo.org
    bin
    Updated Oct 14, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Robin Matzner; Ahuja, Akanksha; Rasoul Sadeghi Yamchi; Michael Doherty; Alejandra Beghelli Zapata; Seb J. Savory; Polina Bayvel (2024). Topology Bench: Systematic Graph Based Benchmarking for Optical Networks [Dataset]. http://doi.org/10.5522/04/27212457.v2
    Explore at:
    binAvailable download formats
    Dataset updated
    Oct 14, 2024
    Dataset provided by
    University College London
    Authors
    Robin Matzner; Ahuja, Akanksha; Rasoul Sadeghi Yamchi; Michael Doherty; Alejandra Beghelli Zapata; Seb J. Savory; Polina Bayvel
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    TopologyBench is a systematic graph theoretical approach to benchmarking optical network topologies. Network datasets are combined with their corresponding graph theoretical analysis to provide a systematic methodology for selecting diverse sets of optical networks for benchmarking. This topology benchmark is comprised of a network dataset and a systematic graph theoretic analysis. The dataset provides (a) 105 real optical networks and (b) synthetic topologies, generated by the SNR-BA model, divided into (i) Syn-small of 900 synthetic networks and (ii) Syn-large of 270,000 synthetic networks. The systematic graph theoretical analysis identifies and analyses structural, spatial and spectral properties of both the real world and synthetic networks. The graph theoretical correlation analysis reveal network design strategies leading to sparse yet efficient networks. An outlier analysis identifies networks that deviate from standard network designs. The analysis also identifies the limitations of real data in terms of network diversity and provides a justification for using synthetic data to complement the real dataset. We conclude the paper by providing a systematic methodology to cluster networks based on unsupervised machine learning and to select a diverse set of topologies for benchmarking. TopologyBench is a novel, high-quality and unified benchmark designed to facilitate research collaborations in long-haul fibre infrastructure by providing a systematic graph theoretical approach to benchmarking optical networks.

  10. f

    DataSheet1_AEROS: AdaptivE RObust Least-Squares for Graph-Based SLAM.pdf

    • frontiersin.figshare.com
    pdf
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Milad Ramezani; Matias Mattamala; Maurice Fallon (2023). DataSheet1_AEROS: AdaptivE RObust Least-Squares for Graph-Based SLAM.pdf [Dataset]. http://doi.org/10.3389/frobt.2022.789444.s001
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    Frontiers
    Authors
    Milad Ramezani; Matias Mattamala; Maurice Fallon
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In robot localisation and mapping, outliers are unavoidable when loop-closure measurements are taken into account. A single false-positive loop-closure can have a very negative impact on SLAM problems causing an inferior trajectory to be produced or even for the optimisation to fail entirely. To address this issue, popular existing approaches define a hard switch for each loop-closure constraint. This paper presents AEROS, a novel approach to adaptively solve a robust least-squares minimisation problem by adding just a single extra latent parameter. It can be used in the back-end component of the SLAM system to enable generalised robust cost minimisation by simultaneously estimating the continuous latent parameter along with the set of sensor poses in a single joint optimisation. This leads to a very closely curve fitting on the distribution of the residuals, thereby reducing the effect of outliers. Additionally, we formulate the robust optimisation problem using standard Gaussian factors so that it can be solved by direct application of popular incremental estimation approaches such as iSAM. Experimental results on publicly available synthetic datasets and real LiDAR-SLAM datasets collected from the 2D and 3D LiDAR systems show the competitiveness of our approach with the state-of-the-art techniques and its superiority on real world scenarios.

  11. f

    DataSheet1_Use ggbreak to Effectively Utilize Plotting Space to Deal With...

    • frontiersin.figshare.com
    pdf
    Updated Jun 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shuangbin Xu; Meijun Chen; Tingze Feng; Li Zhan; Lang Zhou; Guangchuang Yu (2023). DataSheet1_Use ggbreak to Effectively Utilize Plotting Space to Deal With Large Datasets and Outliers.PDF [Dataset]. http://doi.org/10.3389/fgene.2021.774846.s001
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 6, 2023
    Dataset provided by
    Frontiers
    Authors
    Shuangbin Xu; Meijun Chen; Tingze Feng; Li Zhan; Lang Zhou; Guangchuang Yu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    With the rapid increase of large-scale datasets, biomedical data visualization is facing challenges. The data may be large, have different orders of magnitude, contain extreme values, and the data distribution is not clear. Here we present an R package ggbreak that allows users to create broken axes using ggplot2 syntax. It can effectively use the plotting area to deal with large datasets (especially for long sequential data), data with different magnitudes, and contain outliers. The ggbreak package increases the available visual space for a better presentation of the data and detailed annotation, thus improves our ability to interpret the data. The ggbreak package is fully compatible with ggplot2 and it is easy to superpose additional layers and applies scale and theme to adjust the plot using the ggplot2 syntax. The ggbreak package is open-source software released under the Artistic-2.0 license, and it is freely available on CRAN (https://CRAN.R-project.org/package=ggbreak) and Github (https://github.com/YuLab-SMU/ggbreak).

  12. GOPI Resource - Stacked Column Chart - Change in Jobs in Maryland by Month...

    • data.wu.ac.at
    csv, json, xml
    Updated Apr 27, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Bureau of Labor Statistics (2017). GOPI Resource - Stacked Column Chart - Change in Jobs in Maryland by Month (with Feb and March 2010 outliers filtered out) [Dataset]. https://data.wu.ac.at/schema/data_maryland_gov/NWk4aS1ieDU2
    Explore at:
    xml, csv, jsonAvailable download formats
    Dataset updated
    Apr 27, 2017
    Dataset provided by
    Bureau of Labor Statisticshttp://www.bls.gov/
    Area covered
    Maryland
    Description

    This dataset represents the CHANGE in the number of jobs per industry category and sub-category from the previous month, not the raw counts of actual jobs. The data behind these monthly change values is from the Bureau of Labor Statistics (BLS) Current Employment Statistics (CES) program. CES data represents businesses and government agencies, providing detailed industry data on employment on nonfarm payrolls.

  13. Building and updating software datasets: an empirical assessment

    • zenodo.org
    zip
    Updated Aug 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Juan Andrés Carruthers; Juan Andrés Carruthers (2024). Building and updating software datasets: an empirical assessment [Dataset]. http://doi.org/10.5281/zenodo.11395573
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 19, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Juan Andrés Carruthers; Juan Andrés Carruthers
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This is the repository for the scripts and data of the study "Building and updating software datasets: an empirical assessment".

    Data collected

    The data generated for the study it can be downloaded as a zip file. Each folder inside the file corresponds to one of the datasets of projects employed in the study (qualitas, currentSample and qualitasUpdated). Every dataset comprised three files "class.csv", "method.csv" and "sample.csv", with class metrics, method metrics and repository metadata of the projects respectively. Here is a description of the datasets:

    • qualitas: includes code metrics and repository metrics from the projects in the release 20130901r of the Qualitas Corpus.
    • currentSample: includes code metrics and repository metrics from a recent sample collected with our sampling procedure.
    • qualitasUpdated: includes code metrics and repository metrics from an updated version of the Qualitas Corpus applying our maintenance procedure.

    Plot graphics

    To plot the results and graphics in the article there is a Jupyter Notebook "Experiment.ipynb". It is initially configured to use the data in "datasets" folder.

    Replication Kit

    For replication purposes, the datasets containing recent projects from Github can be re-generated. To do so, the virtual environment must have installed the dependencies in "requirements.txt" file, add Github's tokens in "./token" file, re-define or leave as is the paths declared in the constants (variables written in caps) in the main method, and finally run "main.py" script. The source code scanner Sourcemeter for Windows is already installed in the project. If a new release becomes available or if the tool needs to be run on a different OS, it can be replaced in "./Sourcemeter/tool" directory.

    The script comprise 5 steps:

    1. Project retrieval from Github: at first the sampling frame with projects complying with a specific quality criteria are retrieved from Github's API.
    2. Create samples: with the sampling frame retrieved, the current samples are selected (currentSample and qualitasUpdated). In the case of qualitasUpdated, it is important to have first the "sample.csv" file inside the qualitas folder of the dataset originally created for the study. This file contains the metadata of the projects in Qualitas Corpus.
    3. Project download and analysis: when all the samples are selected from the sampling frame (currentSample and qualitasUpdated), the repositories are downloaded and scanned with SourceMeter. In the cases in which the analysis is not possible, the projects are replaced with another one with similar size.
    4. Outlier detection: once the datasets are collected, it is necessary to manually look for possible outliers in the code metrics under study. In the notebook "Experiment.ipynb" there are specific sections dedicated for it ("Outlier detection (Section 4.2.2)").
    5. Outlier replacement: when the outliers are detected, in the same notebook there is also a section for outlier replacement ("Replace Outliers") where the outliers' url have to be listed to find the appropriate replacement.
    • If it is required, the metrics from the Qualitas Corpus can also be re-generated. First, it is necessary to download the release 20130901r from its official webpage. Second, decompress the .tar files downloaded. Third, make sure that the compressed files with source code from the projects (.java files) are placed in the "compressed" folder, in some cases it is necessary to read the "QC_README" file in the project's folder. Finally, run the original main script "Generate metrics for the Qualitas Corpus (QC) dataset" part of the code.
  14. a

    NDWR Water Levels Dashboard

    • data-ndwr.hub.arcgis.com
    Updated Nov 19, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nevada Division of Water Resources (2020). NDWR Water Levels Dashboard [Dataset]. https://data-ndwr.hub.arcgis.com/datasets/ndwr-water-levels-dashboard
    Explore at:
    Dataset updated
    Nov 19, 2020
    Dataset authored and provided by
    Nevada Division of Water Resources
    Description

    This dashboard was created for exploring water level data in Nevada using a Nevada Division of Water Resources dataset known as the Well Net Database. The main map displays selectable active and inactive water level measurement sites in each of Nevada's hydrographic basins. Water level data for selected sites display in the graph. Negative water level values represent depth below the surface in feet. The dashboard's list feature displays up to 100 of the wells that are shown in the map extent. One or more sites in the list can be selected to show on the graph.Data entry errors may exist in the Well Net data. Use caution when interpreting any outlier data points.

  15. f

    Goodness-of-fit filtering in classical metric multidimensional scaling with...

    • tandf.figshare.com
    pdf
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jan Graffelman (2023). Goodness-of-fit filtering in classical metric multidimensional scaling with large datasets [Dataset]. http://doi.org/10.6084/m9.figshare.11389830.v1
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Taylor & Francis
    Authors
    Jan Graffelman
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Metric multidimensional scaling (MDS) is a widely used multivariate method with applications in almost all scientific disciplines. Eigenvalues obtained in the analysis are usually reported in order to calculate the overall goodness-of-fit of the distance matrix. In this paper, we refine MDS goodness-of-fit calculations, proposing additional point and pairwise goodness-of-fit statistics that can be used to filter poorly represented observations in MDS maps. The proposed statistics are especially relevant for large data sets that contain outliers, with typically many poorly fitted observations, and are helpful for improving MDS output and emphasizing the most important features of the dataset. Several goodness-of-fit statistics are considered, and both Euclidean and non-Euclidean distance matrices are considered. Some examples with data from demographic, genetic and geographic studies are shown.

  16. Mood swings of pro-Russian posts on Twitter in Poland 2022-2023

    • statista.com
    Updated Jul 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Mood swings of pro-Russian posts on Twitter in Poland 2022-2023 [Dataset]. https://www.statista.com/statistics/1365159/mood-swings-of-pro-russian-twitter-posts-poland/
    Explore at:
    Dataset updated
    Jul 10, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Jan 2022 - Jan 2023
    Area covered
    Poland
    Description

    The changes in posts by pro-Russian disinformation profiles on Twitter in Poland were analyzed in comparison with the entire period from ************ to ************. In general, the posts were negative, but the graph represents the extent to which there were positive and negative outliers and polarization. In addition, the negative intensity increased after the war began in *************. What can be observed is as soon as there were increased positive outliers in a given month, there were simultaneously increased negative outliers. This was particularly noticeable in January and *********.

  17. f

    Two Variable Artificial Dataset.

    • plos.figshare.com
    xls
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hong Choon Ong; Ekele Alih (2023). Two Variable Artificial Dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0125835.t007
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Hong Choon Ong; Ekele Alih
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Two Variable Artificial Dataset.

  18. f

    Data Sheet 1_Outliers and anomalies in training and testing datasets for...

    • figshare.com
    pdf
    Updated Jul 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yuriy Vasilev; Anastasia Pamova; Tatiana Bobrovskaya; Anton Vladzimirskyy; Olga Omelyanskaya; Elena Astapenko; Artem Kruchinkin; Novik Vladimir; Kirill Arzamasov (2025). Data Sheet 1_Outliers and anomalies in training and testing datasets for AI-powered morphometry—evidence from CT scans of the spleen.pdf [Dataset]. http://doi.org/10.3389/frai.2025.1607348.s001
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jul 15, 2025
    Dataset provided by
    Frontiers
    Authors
    Yuriy Vasilev; Anastasia Pamova; Tatiana Bobrovskaya; Anton Vladzimirskyy; Olga Omelyanskaya; Elena Astapenko; Artem Kruchinkin; Novik Vladimir; Kirill Arzamasov
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    IntroductionCreating training and testing datasets for machine learning algorithms to measure linear dimensions of organs is a tedious task. There are no universally accepted methods for evaluating outliers or anomalies in such datasets. This can cause errors in machine learning and compromise the quality of end products. The goal of this study is to identify optimal methods for detecting organ anomalies and outliers in medical datasets designed to train and test neural networks in morphometrics.MethodsA dataset was created containing linear measurements of the spleen obtained from CT scans. Labelling was performed by three radiologists. The total number of studies included in the sample was N = 197 patients. Using visual methods (1.5 interquartile range; heat map; boxplot; histogram; scatter plot), machine learning algorithms (Isolation forest; Density-Based Spatial Clustering of Applications with Noise; K-nearest neighbors algorithm; Local outlier factor; One-class support vector machines; EllipticEnvelope; Autoencoders), and mathematical statistics (z-score, Grubb’s test; Rosner’s test).ResultsWe identified measurement errors, input errors, abnormal size values and non-standard shapes of the organ (sickle-shaped, round, triangular, additional lobules). The most effective methods included visual techniques (including boxplots and histograms) and machine learning algorithms such is OSVM, KNN and autoencoders. A total of 32 outlier anomalies were found.DiscussionCuration of complex morphometric datasets must involve thorough mathematical and clinical analyses. Relying solely on mathematical statistics or machine learning methods appears inadequate.

  19. f

    The Pulp-fibre Dataset.

    • plos.figshare.com
    xls
    Updated Jun 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hong Choon Ong; Ekele Alih (2023). The Pulp-fibre Dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0125835.t012
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Hong Choon Ong; Ekele Alih
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Pulp-fibre Dataset.

  20. f

    The crcc T2 Revised statistics.

    • plos.figshare.com
    xls
    Updated May 31, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hong Choon Ong; Ekele Alih (2023). The crcc T2 Revised statistics. [Dataset]. http://doi.org/10.1371/journal.pone.0125835.t015
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Hong Choon Ong; Ekele Alih
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The crcc T2 Revised statistics.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Gen Li; Jason J. Jung (2023). Comparison experiments by using IF. [Dataset]. http://doi.org/10.1371/journal.pone.0247119.t001

Comparison experiments by using IF.

Related Article
Explore at:
xlsAvailable download formats
Dataset updated
Jun 2, 2023
Dataset provided by
PLOS ONE
Authors
Gen Li; Jason J. Jung
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Comparison experiments by using IF.

Search
Clear search
Close search
Google apps
Main menu