100+ datasets found

m
Data from: A fast algorithm for computing a matrix transform used to detect...
data.mendeley.com
narcis.nl
Updated Jun 9, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dan Kestner (2020). A fast algorithm for computing a matrix transform used to detect trends in noisy data [Dataset]. http://doi.org/10.17632/mkcxrky9jc.1
Explore at:
Unique identifier
https://doi.org/10.17632/mkcxrky9jc.1
Dataset updated
Jun 9, 2020
Authors
Dan Kestner
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
A recently discovered universal rank-based matrix method to extract trends from noisy time series is described in Ierley and Kostinski (2019) but the formula for the output matrix elements, implemented there as an open-access supplement MATLAB computer code, is O(N^4), with N the matrix dimension. This can become prohibitively large for time series with hundreds of sample points or more. Based on recurrence relations, here we derive a much faster O(N^2) algorithm and provide code implementations in MATLAB and in open-source JULIA. In some cases one has the output matrix and needs to solve an inverse problem to obtain the input matrix. A fast algorithm and code for this companion problem, also based on the recurrence relations, are given. Finally, in the narrower, but common, domains of (i) trend detection and (ii) parameter estimation of a linear trend, users require, not the individual matrix elements, but simply their accumulated mean value. For this latter case we provide a yet faster O(N) heuristic approximation that relies on a series of rank one matrices. These algorithms are illustrated on a time series of high energy cosmic rays with N > 4 x 10^4 .
Multidimensional Scaling With Very Large Datasets
tandf.figshare.com
txt
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Emmanuel Paradis (2023). Multidimensional Scaling With Very Large Datasets [Dataset]. http://doi.org/10.6084/m9.figshare.6238991.v2
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.6238991.v2
Dataset updated
Jun 2, 2023
Dataset provided by
Taylor & Francishttps://taylorandfrancis.com/
Authors
Emmanuel Paradis
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Multidimensional scaling has a wide range of applications when observations are not continuous but it is possible to define a distance (or dissimilarity) among them. However, standard implementations are limited when analyzing very large datasets because they rely on eigendecomposition of the full distance matrix and require very long computing times and large quantities of memory. Here, a new approach is developed based on projection of the observations in a space defined by a subset of the full dataset. The method is easily implemented. A simulation study showed that its performance are satisfactory in different situations and can be run in a short time when the standard method takes a very long time or cannot be run because of memory requirements.
n
Data from: Inferring complex phylogenies using parsimony: an empirical...
data.niaid.nih.gov
search.dataone.org
+1more
zip
Updated Feb 22, 2008
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Douglas E. Soltis; Pamela S. Soltis; Mark E. Mort; Mark W. Chase; Vincent Savolainen; Sara B. Hoot; Cynthia M. Morton (2008). Inferring complex phylogenies using parsimony: an empirical approach using three large DNA data sets for angiosperms [Dataset]. http://doi.org/10.5061/dryad.64
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.64
Dataset updated
Feb 22, 2008
Authors
Douglas E. Soltis; Pamela S. Soltis; Mark E. Mort; Mark W. Chase; Vincent Savolainen; Sara B. Hoot; Cynthia M. Morton
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
To explore the feasibility of parsimony analysis for large data sets, we conducted heuristic parsimony searches and bootstrap analyses on separate and combined DNA data sets for 190 angiosperms and three outgroups. Separate data sets of 18S rDNA (1,855 bp), rbc L (1,428 bp), and atp B (1,450 bp) sequences were combined into a single matrix 4,733 bp in length. Analyses of the combined data set show great improvements in computer run times compared to those of the separate data sets and of the data sets combined in pairs. Six searches of the 18S rDNA rbc L atp B data set were conducted; in all cases TBR branch swapping was completed, generally within a few days. In contrast, TBR branch swapping was not completed for any of the three separate data sets, or for the pairwise combined data sets. These results illustrate that it is possible to conduct a thorough search of tree space with large data sets, given sufficient signal. In this case, and probably most others, sufficient signal for a large number of taxa can only be obtained by combining data sets. The combined data sets also have higher internal support for clades than the separate data sets, and more clades receive bootstrap support of 50% in the combined analysis than in analyses of the separate data sets. These data suggest that one solution to the computational and analytical dilemmas posed by large data sets is the addition of nucleotides, as well as taxa.
f
Data from: Two Sample Test for Covariance Matrices in Ultra-High Dimension
tandf.figshare.com
txt
Updated Aug 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xiucai Ding; Yichen Hu; Zhenggang Wang (2025). Two Sample Test for Covariance Matrices in Ultra-High Dimension [Dataset]. http://doi.org/10.6084/m9.figshare.27609570.v2
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.27609570.v2
Dataset updated
Aug 11, 2025
Dataset provided by
Taylor & Francis
Authors
Xiucai Ding; Yichen Hu; Zhenggang Wang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In this article, we propose a new test for testing the equality of two population covariance matrices in the ultra-high dimensional setting that the dimension is much larger than the sizes of both of the two samples. Our proposed methodology relies on a data splitting procedure and a comparison of a set of well selected eigenvalues of the sample covariance matrices on the split datasets. Compared to the existing methods, our methodology is adaptive in the sense that (i). it does not require specific assumption (e.g., comparable or balancing, etc.) on the sizes of two samples; (ii). it does not need quantitative or structural assumptions of the population covariance matrices; (iii). it does not need the parametric distributions or the detailed knowledge of the moments of the two populations. Theoretically, we establish the asymptotic distributions of the statistics used in our method and conduct the power analysis. We justify that our method is powerful under weak alternatives. We conduct extensive numerical simulations and show that our method significantly outperforms the existing ones both in terms of size and power. Analysis of two real datasets is also carried out to demonstrate the usefulness and superior performance of our proposed methodology. An R package UHDtst is developed for easy implementation of our proposed methodology. Supplementary materials for this article are available online, including a standardized description of the materials available for reproducing the work.
A Data Driven Network Approach to Rank Countries Production Diversity and...
plos.figshare.com
tiff
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chengyi Tu; Joel Carr; Samir Suweis (2023). A Data Driven Network Approach to Rank Countries Production Diversity and Food Specialization [Dataset]. http://doi.org/10.1371/journal.pone.0165941
Explore at:
tiffAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0165941
Dataset updated
Jun 2, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Chengyi Tu; Joel Carr; Samir Suweis
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The easy access to large data sets has allowed for leveraging methodology in network physics and complexity science to disentangle patterns and processes directly from the data, leading to key insights in the behavior of systems. Here we use country specific food production data to study binary and weighted topological properties of the bipartite country-food production matrix. This country-food production matrix can be: 1) transformed into overlap matrices which embed information regarding shared production of products among countries, and or shared countries for individual products, 2) identify subsets of countries which produce similar commodities or subsets of commodities shared by a given country allowing for visualization of correlations in large networks, and 3) used to rank country fitness (the ability to produce a diverse array of products weighted on the type of food commodities) and food specialization (quantified on the number of countries producing a specific food product weighted on their fitness). Our results show that, on average, countries with high fitness produce both low and high specializion food commodities, whereas nations with low fitness tend to produce a small basket of diverse food products, typically comprised of low specializion food commodities.
d
Data from: Reference transcriptomics of porcine peripheral immune cells...
catalog.data.gov
agdatacommons.nal.usda.gov
+2more
Updated Jun 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Agricultural Research Service (2025). Data from: Reference transcriptomics of porcine peripheral immune cells created through bulk and single-cell RNA sequencing [Dataset]. https://catalog.data.gov/dataset/data-from-reference-transcriptomics-of-porcine-peripheral-immune-cells-created-through-bul-e667c
Explore at:
Dataset updated
Jun 5, 2025
Dataset provided by
Agricultural Research Service
Description
This dataset contains files reconstructing single-cell data presented in 'Reference transcriptomics of porcine peripheral immune cells created through bulk and single-cell RNA sequencing' by Herrera-Uribe & Wiarda et al. 2021. Samples of peripheral blood mononuclear cells (PBMCs) were collected from seven pigs and processed for single-cell RNA sequencing (scRNA-seq) in order to provide a reference annotation of porcine immune cell transcriptomics at enhanced, single-cell resolution. Analysis of single-cell data allowed identification of 36 cell clusters that were further classified into 13 cell types, including monocytes, dendritic cells, B cells, antibody-secreting cells, numerous populations of T cells, NK cells, and erythrocytes. Files may be used to reconstruct the data as presented in the manuscript, allowing for individual query by other users. Scripts for original data analysis are available at https://github.com/USDA-FSEPRU/PorcinePBMCs_bulkRNAseq_scRNAseq. Raw data are available at https://www.ebi.ac.uk/ena/browser/view/PRJEB43826. Funding for this dataset was also provided by NRSP8: National Animal Genome Research Program (https://www.nimss.org/projects/view/mrp/outline/18464). Resources in this dataset:Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells 10X Format. File Name: PBMC7_AllCells.zipResource Description: Zipped folder containing PBMC counts matrix, gene names, and cell IDs. Files are as follows: matrix of gene counts* (matrix.mtx.gx) gene names (features.tsv.gz) cell IDs (barcodes.tsv.gz) *The ‘raw’ count matrix is actually gene counts obtained following ambient RNA removal. During ambient RNA removal, we specified to calculate non-integer count estimations, so most gene counts are actually non-integer values in this matrix but should still be treated as raw/unnormalized data that requires further normalization/transformation. Data can be read into R using the function Read10X().Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells Metadata. File Name: PBMC7_AllCells_meta.csvResource Description: .csv file containing metadata for cells included in the final dataset. Metadata columns include: nCount_RNA = the number of transcripts detected in a cell nFeature_RNA = the number of genes detected in a cell Loupe = cell barcodes; correspond to the cell IDs found in the .h5Seurat and 10X formatted objects for all cells prcntMito = percent mitochondrial reads in a cell Scrublet = doublet probability score assigned to a cell seurat_clusters = cluster ID assigned to a cell PaperIDs = sample ID for a cell celltypes = cell type ID assigned to a cellResource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells PCA Coordinates. File Name: PBMC7_AllCells_PCAcoord.csvResource Description: .csv file containing first 100 PCA coordinates for cells. Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells t-SNE Coordinates. File Name: PBMC7_AllCells_tSNEcoord.csvResource Description: .csv file containing t-SNE coordinates for all cells.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells UMAP Coordinates. File Name: PBMC7_AllCells_UMAPcoord.csvResource Description: .csv file containing UMAP coordinates for all cells.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - CD4 T Cells t-SNE Coordinates. File Name: PBMC7_CD4only_tSNEcoord.csvResource Description: .csv file containing t-SNE coordinates for only CD4 T cells (clusters 0, 3, 4, 28). A dataset of only CD4 T cells can be re-created from the PBMC7_AllCells.h5Seurat, and t-SNE coordinates used in publication can be re-assigned using this .csv file.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - CD4 T Cells UMAP Coordinates. File Name: PBMC7_CD4only_UMAPcoord.csvResource Description: .csv file containing UMAP coordinates for only CD4 T cells (clusters 0, 3, 4, 28). A dataset of only CD4 T cells can be re-created from the PBMC7_AllCells.h5Seurat, and UMAP coordinates used in publication can be re-assigned using this .csv file.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - Gamma Delta T Cells UMAP Coordinates. File Name: PBMC7_GDonly_UMAPcoord.csvResource Description: .csv file containing UMAP coordinates for only gamma delta T cells (clusters 6, 21, 24, 31). A dataset of only gamma delta T cells can be re-created from the PBMC7_AllCells.h5Seurat, and UMAP coordinates used in publication can be re-assigned using this .csv file.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - Gamma Delta T Cells t-SNE Coordinates. File Name: PBMC7_GDonly_tSNEcoord.csvResource Description: .csv file containing t-SNE coordinates for only gamma delta T cells (clusters 6, 21, 24, 31). A dataset of only gamma delta T cells can be re-created from the PBMC7_AllCells.h5Seurat, and t-SNE coordinates used in publication can be re-assigned using this .csv file.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - Gene Annotation Information. File Name: UnfilteredGeneInfo.txtResource Description: .txt file containing gene nomenclature information used to assign gene names in the dataset. 'Name' column corresponds to the name assigned to a feature in the dataset.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells H5Seurat. File Name: PBMC7.tarResource Description: .h5Seurat object of all cells in PBMC dataset. File needs to be untarred, then read into R using function LoadH5Seurat().
d
Data from: Two dimensional fast Fourier transform for large data matrices
elsevier.digitalcommonsdata.com
search.datacite.org
Updated Jan 1, 1989
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mulugeta H. Serzu (1989). Two dimensional fast Fourier transform for large data matrices [Dataset]. http://doi.org/10.17632/wvcvxk7ykn.1
Explore at:
Unique identifier
https://doi.org/10.17632/wvcvxk7ykn.1
Dataset updated
Jan 1, 1989
Authors
Mulugeta H. Serzu
License
https://www.elsevier.com/about/policies/open-access-licenses/elsevier-user-license/cpc-license/https://www.elsevier.com/about/policies/open-access-licenses/elsevier-user-license/cpc-license/
Description
Abstract Most of the commonly used fast Fourier transform subroutines can not handle large data matrices because of the restriction imposed by the system's core memory. In this paper we present a two dimensional FFT program (SW2DFFT) and its long write-up. SW2DFFT is a Fortran program capable of handling large data matrices both square and rectangular. The data matrix is stored externally in a direct access mass storage. The program uses a stepwise approach in computing the large matrices based on the...

Title of program: SW2DFFT Catalogue Id: ABFB_v1_0

Nature of problem Any problem that requires Fourier Transformation of a large 2-D data matrix.

Versions of this program held in the CPC repository in Mendeley Data ABFB_v1_0; SW2DFFT; 10.1016/0010-4655(89)90108-2

This program has been imported from the CPC Program Library held at Queen's University Belfast (1969-2019)
Block-GP: Scalable Gaussian Process Regression for Multimodal Data - Dataset...
data.nasa.gov
Updated Mar 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
nasa.gov (2025). Block-GP: Scalable Gaussian Process Regression for Multimodal Data - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/block-gp-scalable-gaussian-process-regression-for-multimodal-data
Explore at:
Dataset updated
Mar 31, 2025
Dataset provided by
NASAhttp://nasa.gov/
Description
Regression problems on massive data sets are ubiquitous in many application domains including the Internet, earth and space sciences, and finances. In many cases, regression algorithms such as linear regression or neural networks attempt to fit the target variable as a function of the input variables without regard to the underlying joint distribution of the variables. As a result, these global models are not sensitive to variations in the local structure of the input space. Several algorithms, including the mixture of experts model, classification and regression trees (CART), and others have been developed, motivated by the fact that a variability in the local distribution of inputs may be reflective of a significant change in the target variable. While these methods can handle the non-stationarity in the relationships to varying degrees, they are often not scalable and, therefore, not used in large scale data mining applications. In this paper we develop Block-GP, a Gaussian Process regression framework for multimodal data, that can be an order of magnitude more scalable than existing state-of-the-art nonlinear regression algorithms. The framework builds local Gaussian Processes on semantically meaningful partitions of the data and provides higher prediction accuracy than a single global model with very high confidence. The method relies on approximating the covariance matrix of the entire input space by smaller covariance matrices that can be modeled independently, and can therefore be parallelized for faster execution. Theoretical analysis and empirical studies on various synthetic and real data sets show high accuracy and scalability of Block-GP compared to existing nonlinear regression techniques.
Brain Connectivity Matrix Dataset
kaggle.com
zip
Updated Jul 24, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Behzad Aslani Avilaq (2022). Brain Connectivity Matrix Dataset [Dataset]. https://www.kaggle.com/datasets/avilaqba/brain-connectivity-matrix-dataset/discussion
Explore at:
zip(73032239 bytes)Available download formats
Dataset updated
Jul 24, 2022
Authors
Behzad Aslani Avilaq
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
In this dataset, as an input, you are given low-resolution (LR) encodings of brain connectivity in a symmetric connectivity matrix 𝐗_𝑳𝑹 ∈ ℝ160 ×160, where element 𝐗_𝑳𝑹(i, j) denotes the strength of the connectivity between two brain regions i and j. The goal in this dataset is to train a machine learning model that predicts the high-resolution (HR) connectivity matrix 𝐗_𝑯𝑹 ∈ ℝ268 ×268, given the LR connectivity matrix 𝐗_𝑳𝑹 of the same brain, which is called brain graph super resolution. By vectorizing the off-diagonal upper triangular part of 𝐗_𝑳𝑹 and 𝐗_𝑯𝑹, we generate feature vectors x_𝑳𝑹 ∈ ℝ1 ×12720 and x_𝑯𝑹 ∈ ℝ1 ×35778 representing LR and HR connectivity features of a single sample. By stacking the samples vectors vertically across N=189 subjects, we construct the LR data matrix 𝐃𝑳𝑹 ∈ ℝN ×12720 and HR data matrix 𝐃𝑯𝑹 ∈ ℝN ×35778.
Sparse Inverse Gaussian Process Regression with Application to Climate...
data.nasa.gov
Updated Mar 31, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
nasa.gov (2025). Sparse Inverse Gaussian Process Regression with Application to Climate Network Discovery - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/sparse-inverse-gaussian-process-regression-with-application-to-climate-network-discovery
Explore at:
Dataset updated
Mar 31, 2025
Dataset provided by
NASAhttp://nasa.gov/
Description
Regression problems on massive data sets are ubiquitous in many application domains including the Internet, earth and space sciences, and finances. Gaussian Process regression is a popular technique for modeling the input-output relations of a set of variables under the assumption that the weight vector has a Gaussian prior. However, it is challenging to apply Gaussian Process regression to large data sets since prediction based on the learned model requires inversion of an order n kernel matrix. Approximate solutions for sparse Gaussian Processes have been proposed for sparse problems. However, in almost all cases, these solution techniques are agnostic to the input domain and do not preserve the similarity structure in the data. As a result, although these solutions sometimes provide excellent accuracy, the models do not have interpretability. Such interpretable sparsity patterns are very important for many applications. We propose a new technique for sparse Gaussian Process regression that allows us to compute a parsimonious model while preserving the interpretability of the sparsity structure in the data. We discuss how the inverse kernel matrix used in Gaussian Process prediction gives valuable domain information and then adapt the inverse covariance estimation from Gaussian graphical models to estimate the Gaussian kernel. We solve the optimization problem using the alternating direction method of multipliers that is amenable to parallel computation. We demonstrate the performance of our method in terms of accuracy, scalability and interpretability on a climate data set.
Orkut Social Network and Communities (SNAP)
kaggle.com
zip
Updated Dec 16, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Subhajit Sahu (2021). Orkut Social Network and Communities (SNAP) [Dataset]. https://www.kaggle.com/datasets/wolfram77/graphs-snap-com-orkut/discussion
Explore at:
zip(925908495 bytes)Available download formats
Dataset updated
Dec 16, 2021
Authors
Subhajit Sahu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Orkut social network and ground-truth communities

https://snap.stanford.edu/data/com-Orkut.html

Dataset information

Orkut (http://www.orkut.com/) is a free on-line social network where users form friendship each other. Orkut also allows users form a group which
other members can then join. We consider such user-defined groups as
ground-truth communities. We provide the Orkut friendship social network
and ground-truth communities. This data is provided by Alan Mislove et al. (http://socialnetworks.mpi-sws.org/data-imc2007.html)

We regard each connected component in a group as a separate ground-truth
community. We remove the ground-truth communities which have less than 3
nodes. We also provide the top 5,000 communities with highest quality
which are described in our paper (http://arxiv.org/abs/1205.6233). As for
the network, we provide the largest connected component.

Dataset statistics
Nodes 3,072,441
Edges 117,185,083
Nodes in largest WCC 3072441 (1.000)
Edges in largest WCC 117185083 (1.000)
Nodes in largest SCC 3072441 (1.000)
Edges in largest SCC 117185083 (1.000)
Average clustering coefficient 0.1666
Number of triangles 627584181
Fraction of closed triangles 0.01414
Diameter (longest shortest path) 9
90-percentile effective diameter 4.8

Source (citation)
J. Yang and J. Leskovec. Defining and Evaluating Network Communities based on Ground-truth. ICDM, 2012. http://arxiv.org/abs/1205.6233

Files
File Description
com-orkut.ungraph.txt.gz Undirected Orkut network
com-orkut.all.cmty.txt.gz Orkut communities
com-orkut.top5000.cmty.txt.gz Orkut communities (Top 5,000)

Notes on inclusion into the SuiteSparse Matrix Collection, July 2018:

The graph in the SNAP data set is 1-based, with nodes numbered 1 to
3,072,626.

In the SuiteSparse Matrix Collection, Problem.A is the undirected
Orkut network, a matrix of size n-by-n with n=3,072,441, which is
the number of unique user id's appearing in any edge.

Problem.aux.nodeid is a list of the node id's that appear in the SNAP data set. A(i,j)=1 if person nodeid(i) is friends with person nodeid(j). The
node id's are the same as the SNAP data set (1-based).

C = Problem.aux.Communities_all is a sparse matrix of size n by 15,301,901 which represents the same number communities in the com-orkut.all.cmty.txt file. The kth line in that file defines the kth community, and is the
column C(:,k), where where C(i,k)=1 if person nodeid(i) is in the kth
community. Row C(i,:) and row/column i of the A matrix thus refer to the
same person, nodeid(i).

Ctop = Problem.aux.Communities_to...
d
Data from: Block-GP: Scalable Gaussian Process Regression for Multimodal...
catalog.data.gov
Updated Apr 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2025). Block-GP: Scalable Gaussian Process Regression for Multimodal Data [Dataset]. https://catalog.data.gov/dataset/block-gp-scalable-gaussian-process-regression-for-multimodal-data
Explore at:
Dataset updated
Apr 11, 2025
Dataset provided by
Dashlink
Description
Regression problems on massive data sets are ubiquitous in many application domains including the Internet, earth and space sciences, and finances. In many cases, regression algorithms such as linear regression or neural networks attempt to fit the target variable as a function of the input variables without regard to the underlying joint distribution of the variables. As a result, these global models are not sensitive to variations in the local structure of the input space. Several algorithms, including the mixture of experts model, classification and regression trees (CART), and others have been developed, motivated by the fact that a variability in the local distribution of inputs may be reflective of a significant change in the target variable. While these methods can handle the non-stationarity in the relationships to varying degrees, they are often not scalable and, therefore, not used in large scale data mining applications. In this paper we develop Block-GP, a Gaussian Process regression framework for multimodal data, that can be an order of magnitude more scalable than existing state-of-the-art nonlinear regression algorithms. The framework builds local Gaussian Processes on semantically meaningful partitions of the data and provides higher prediction accuracy than a single global model with very high confidence. The method relies on approximating the covariance matrix of the entire input space by smaller covariance matrices that can be modeled independently, and can therefore be parallelized for faster execution. Theoretical analysis and empirical studies on various synthetic and real data sets show high accuracy and scalability of Block-GP compared to existing nonlinear regression techniques.
Tensor decomposition-based unsupervised feature extraction applied to matrix...
plos.figshare.com
pdf
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Y-h. Taguchi (2023). Tensor decomposition-based unsupervised feature extraction applied to matrix products for multi-view data processing [Dataset]. http://doi.org/10.1371/journal.pone.0183933
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0183933
Dataset updated
Jun 2, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Y-h. Taguchi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In the current era of big data, the amount of data available is continuously increasing. Both the number and types of samples, or features, are on the rise. The mixing of distinct features often makes interpretation more difficult. However, separate analysis of individual types requires subsequent integration. A tensor is a useful framework to deal with distinct types of features in an integrated manner without mixing them. On the other hand, tensor data is not easy to obtain since it requires the measurements of huge numbers of combinations of distinct features; if there are m kinds of features, each of which has N dimensions, the number of measurements needed are as many as Nm, which is often too large to measure. In this paper, I propose a new method where a tensor is generated from individual features without combinatorial measurements, and the generated tensor was decomposed back to matrices, by which unsupervised feature extraction was performed. In order to demonstrate the usefulness of the proposed strategy, it was applied to synthetic data, as well as three omics datasets. It outperformed other matrix-based methodologies.
n
Data from: Why do phylogenomic data sets yield conflicting trees? Data type...
data-staging.niaid.nih.gov
dataone.org
+2more
zip
Updated Mar 23, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sushma Reddy; Rebecca T. Kimball; Akanksha Pandey; Peter A. Hosner; Michael J. Braun; Shannon J. Hackett; Kin-Lan Han; John Harshman; Christopher J. Huddleston; Sarah Kingston; Ben D. Marks; Kathleen J. Miglia; William S. Moore; Frederick H. Sheldon; Christopher C. Witt; Tamaki Yuri; Edward L. Braun (2017). Why do phylogenomic data sets yield conflicting trees? Data type influences the avian tree of life more than taxon sampling [Dataset]. http://doi.org/10.5061/dryad.6536v
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.6536v
Dataset updated
Mar 23, 2017
Authors
Sushma Reddy; Rebecca T. Kimball; Akanksha Pandey; Peter A. Hosner; Michael J. Braun; Shannon J. Hackett; Kin-Lan Han; John Harshman; Christopher J. Huddleston; Sarah Kingston; Ben D. Marks; Kathleen J. Miglia; William S. Moore; Frederick H. Sheldon; Christopher C. Witt; Tamaki Yuri; Edward L. Braun
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Area covered
Global, Worldwide
Description
Phylogenomics, the use of large-scale data matrices in phylogenetic analyses, has been viewed as the ultimate solution to the problem of resolving difficult nodes in the tree of life. However, it has become clear that analyses of these large genomic data sets can also result in conflicting estimates of phylogeny. Here, we use the early divergences in Neoaves, the largest clade of extant birds, as a “model system” to understand the basis for incongruence among phylogenomic trees. We were motivated by the observation that trees from two recent avian phylogenomic studies exhibit conflicts. Those studies used different strategies: 1) collecting many characters [42 mega base pairs (Mbp) of sequence data] from 48 birds, sometimes including only one taxon for each major clade; and 2) collecting fewer characters (0.4 Mbp) from 198 birds, selected to subdivide long branches. However, the studies also used different data types: the taxon-poor data matrix comprised 68% non-coding sequences whereas coding exons dominated the taxon-rich data matrix. This difference raises the question of whether the primary reason for incongruence is the number of sites, the number of taxa, or the data type. To test among these alternative hypotheses we assembled a novel, large-scale data matrix comprising 90% non-coding sequences from 235 bird species. Although increased taxon sampling appeared to have a positive impact on phylogenetic analyses the most important variable was data type. Indeed, by analyzing different subsets of the taxa in our data matrix we found that increased taxon sampling actually resulted in increased congruence with the tree from the previous taxon-poor study (which had a majority of non-coding data) instead of the taxon-rich study (which largely used coding data). We suggest that the observed differences in the estimates of topology for these studies reflect data-type effects due to violations of the models used in phylogenetic analyses, some of which may be difficult to detect. If incongruence among trees estimated using phylogenomic methods largely reflects problems with model fit developing more “biologically-realistic” models is likely to be critical for efforts to reconstruct the tree of life.
G
Sparse-Matrix Compression Engine Market Research Report 2033
growthmarketreports.com
csv, pdf, pptx
Updated Aug 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Growth Market Reports (2025). Sparse-Matrix Compression Engine Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/sparse-matrix-compression-engine-market
Explore at:
pdf, pptx, csvAvailable download formats
Dataset updated
Aug 29, 2025
Dataset authored and provided by
Growth Market Reports
Time period covered
2024 - 2032
Area covered
Global
Description
Sparse-Matrix Compression Engine Market Outlook

According to our latest research, the global Sparse-Matrix Compression Engine market size reached USD 1.42 billion in 2024, reflecting robust adoption across high-performance computing and advanced analytics sectors. The market is poised for substantial expansion, with a projected CAGR of 15.8% during the forecast period. By 2033, the market is forecasted to achieve a value of USD 5.18 billion, driven by escalating data complexity, the proliferation of machine learning applications, and the imperative for efficient storage and computational solutions. The surge in demand for real-time analytics and the growing penetration of artificial intelligence across industries are primary factors fueling this remarkable growth trajectory.

One of the key growth drivers for the Sparse-Matrix Compression Engine market is the exponential increase in data generation and the corresponding need for efficient data processing and storage. As organizations in sectors such as scientific computing, finance, and healthcare grapple with large-scale, high-dimensional datasets, the requirement for optimized storage solutions becomes paramount. Sparse-matrix compression engines enable significant reduction in data redundancy, leading to lower storage costs and faster data retrieval. This efficiency is particularly crucial in high-performance computing environments where memory bandwidth and storage limitations can hinder computational throughput. The adoption of these engines is further propelled by advancements in hardware accelerators and software algorithms that enhance compression ratios without compromising data integrity.

Another significant factor contributing to market growth is the rising adoption of machine learning and artificial intelligence across diverse industry verticals. Modern AI and ML algorithms often operate on sparse datasets, especially in areas such as natural language processing, recommendation systems, and scientific simulations. Sparse-matrix compression engines play a pivotal role in minimizing memory footprint and optimizing computational resources, thereby accelerating model training and inference. The integration of these engines into cloud-based and on-premises solutions allows enterprises to scale their AI workloads efficiently, driving widespread deployment in both research and commercial applications. Additionally, the ongoing evolution of lossless and lossy compression techniques is expanding the applicability of these engines to new and emerging use cases.

The market is also benefiting from the increasing emphasis on cost optimization and energy efficiency in data centers and enterprise IT infrastructure. As organizations strive to reduce operational expenses and carbon footprints, the adoption of compression technologies that minimize data movement and storage requirements becomes a strategic imperative. Sparse-matrix compression engines facilitate this by enabling higher data throughput and lower energy consumption, making them attractive for deployment in large-scale analytics, telecommunications, and industrial automation. Furthermore, the growing ecosystem of service providers and solution integrators is making these technologies more accessible to small and medium enterprises, contributing to broader market penetration.

The development of High-Speed Hardware Compression Chip technology is revolutionizing the Sparse-Matrix Compression Engine market. These chips are designed to accelerate data compression processes, significantly enhancing the performance of high-performance computing systems. By integrating these chips, organizations can achieve faster data processing speeds, which is crucial for handling large-scale datasets in real-time analytics and AI applications. The chips offer a unique advantage by reducing latency and improving throughput, making them an essential component in modern data centers. As the demand for efficient data management solutions grows, the adoption of high-speed hardware compression chips is expected to rise, driving further innovation and competitiveness in the market.

From a regional perspective, North America continues to dominate the Sparse-Matrix Compression Engine market, accounting for the largest revenue share in 2024 owing to the presence of leading technology companies, advanced research institutions, and
Data from: Adaptive genomic signatures of globally invasive populations of...
zenodo.org
zip
Updated Apr 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alejandro Nabor Lozada-Chávez; Alejandro Nabor Lozada-Chávez; Irma Lozada-Chávez; Irma Lozada-Chávez; Niccolò Alfano; Umberto Palatini; Davide Sogliani; Samia Elfekih; Teshome Degefa; Maria V. Sharakhova; Athanase Badolo; Sriwichai Patchara; Mauricio Casas-Martinez; Bianca C. Carlos; Rebeca Carballar-Lejarazú; Louis Lambrechts; Jayme A. Souza-Neto; Mariangela Bonizzoni; Niccolò Alfano; Umberto Palatini; Davide Sogliani; Samia Elfekih; Teshome Degefa; Maria V. Sharakhova; Athanase Badolo; Sriwichai Patchara; Mauricio Casas-Martinez; Bianca C. Carlos; Rebeca Carballar-Lejarazú; Louis Lambrechts; Jayme A. Souza-Neto; Mariangela Bonizzoni (2025). Adaptive genomic signatures of globally invasive populations of the yellow fever mosquito Aedes aegypti [Dataset]. http://doi.org/10.5281/zenodo.14948092
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14948092
Dataset updated
Apr 5, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Alejandro Nabor Lozada-Chávez; Alejandro Nabor Lozada-Chávez; Irma Lozada-Chávez; Irma Lozada-Chávez; Niccolò Alfano; Umberto Palatini; Davide Sogliani; Samia Elfekih; Teshome Degefa; Maria V. Sharakhova; Athanase Badolo; Sriwichai Patchara; Mauricio Casas-Martinez; Bianca C. Carlos; Rebeca Carballar-Lejarazú; Louis Lambrechts; Jayme A. Souza-Neto; Mariangela Bonizzoni; Niccolò Alfano; Umberto Palatini; Davide Sogliani; Samia Elfekih; Teshome Degefa; Maria V. Sharakhova; Athanase Badolo; Sriwichai Patchara; Mauricio Casas-Martinez; Bianca C. Carlos; Rebeca Carballar-Lejarazú; Louis Lambrechts; Jayme A. Souza-Neto; Mariangela Bonizzoni
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Nov 29, 2024
Description
* These authors contributed equally: Alejandro N. Lozada-Chávez, Irma Lozada-Chávez.

Supplementary Dataset

This repository contains the Supplementary Data (from 1 to 12) cited in our paper "Adaptive genomic signatures of globally invasive populations of the yellow fever mosquito Aedes aegypti" in Nature Ecology and Evolution: https://doi.org/10.1038/s41559-025-02643-5">https://doi.org/10.1038/s41559-025-02643-5

These datasets are available in the section "Supplementary Information" of our paper, but with the absence of the SD-9 due its large big size (~3Gb after decompressed). Here you can find the complete set of datasets in a single ZIP file:

41559_2025_2643_MOESM5_ESM_Supplementary_Data.zip

LIST OF DATASETS:

1) Supplementary Data 1. SNP statistics for populations through genomic regions (TXT).
2) Supplementary Data 2. Sequences of new detected nrEVEs (FASTA).
3) Supplementary Data 3. Phylogenetic trees for populations and individuals (NEWICK).
4) Supplementary Data 4. Information for 8,120 hard selective sweeps detected with RAiSD in out-of-Africa populations (TXT).
5) Supplementary Data 5. Information for 1,030 SNP outliers detected with PCAdapt within 2,266 genes (VCF format).
6) Supplementary Data 6. Matrix with DoS scores for 11,651 orthologous protein-coding genes in AaegL5 and each Ae. aegypti population (TXT).
7) Supplementary Data 7. Matrix with MKT scores for 11,651 orthologous protein-coding genes in AaegL5 and each Ae. aegypti population (TXT).
8) Supplementary Data 8. Matrix with DoS scores used to estimate relaxed selection (TXT).
9) Supplementary Data 9. Matrix with SNPs and genomic coordinates within adaptive protein-coding genes and ncRNAs that are shared or private for out-of-Africa populations against African populations (TXT).
10) Supplementary Data 10. Matrix with 483 nonsynonymous SNPs and their allele frequencies for our 40 populations Florida and Colombia (TXT).
11) Supplementary Data 11. Genomic coordinates of SNPs in AaegL5 obtained from the literature and VectorBase (TXT).
12) Supplementary Data 12. Source data of metrics used to plot Figure 4b (TXT).

UPDATES NOTE:

Repository version 3. Final version of datasets for the accepted manuscript.

Repository version 2. Incomplete datasets: Files as prelimary versions and their content may vary. The SD-10 is not present (matrix with 483 SNPs) was added. The SD-7 is a broken file (cannot be opened).

Repository version 1. Incomplete datasets: Files as prelimary versions and their content may vary. Two final SD files are not present.

CITATION OF THIS REPOSITORY:

Lozada-Chávez, A. N., Lozada-Chávez, I., Alfano, N., Palatini, U., Sogliani, D., Elfekih, S., Degefa, T., Sharakhova, M. V., Badolo, A., Patchara, S., Casas-Martinez, M., Carlos, B. C., Carballar-Lejarazú, R., Lambrechts, L., Souza-Neto, J. A., & Bonizzoni, M. (2024). Adaptive genomic signatures of globally invasive populations of the yellow fever mosquito Aedes aegypti [Data set]. Zenodo. https://doi.org/10.5281/zenodo.14948092
122 CAIDA Autonomous systems Graphs (SNAP)
kaggle.com
zip
Updated Dec 16, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Subhajit Sahu (2021). 122 CAIDA Autonomous systems Graphs (SNAP) [Dataset]. https://www.kaggle.com/datasets/wolfram77/graphs-snap-as-caida
Explore at:
zip(40197223 bytes)Available download formats
Dataset updated
Dec 16, 2021
Authors
Subhajit Sahu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset contains 122 CAIDA AS graphs, from January 2004 to November 2007 - http://www.caida.org/data/active/as-relationships/ . Each file contains a full AS graph derived from a set of RouteViews BGP table snapshots.

Dataset statistics are calculated for the graph with the highest number of
nodes - dataset from November 5 2007. Dataset statistics for graph with
highest number of nodes - 11 5 2007

Nodes 26475
Edges 106762
Nodes in largest WCC 26475 (1.000)
Edges in largest WCC 106762 (1.000)
Nodes in largest SCC 26475 (1.000)
Edges in largest SCC 106762 (1.000)
Average clustering coefficient 0.2082
Number of triangles 36365
Fraction of closed triangles 0.007319
Diameter (longest shortest path) 17
90-percentile effective diameter 4.6

Source (citation)

J. Leskovec, J. Kleinberg and C. Faloutsos. Graphs over Time: Densification
Laws, Shrinking Diameters and Possible Explanations. ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining (KDD), 2005.

Files
File Description
as-caida20071105.txt.gz CAIDA AS graph from November 5 2007
as-caida.tar.gz 122 CAIDA AS graphs from January 2004 to November 2007

NOTE for UF Sparse Matrix Collection: these graphs are weighted. In the
original SNAP data set, the edge weights are in the set {-1, 0, 1, 2}. Note
that "0" is an edge weight. This can be handled in the UF collection for the
primary sparse matrix in a Problem, but not when the matrices are in a sequence in the Problem.aux MATLAB struct. The entries with zero edge weight would
become lost. To correct for this, the weights are modified by adding 2 to each weight. This preserves the structure of the original graphs, so that edges
with weight zero are not lost. (A non-edge is not the same as an edge with
weight zero in this problem).

old new weights: -1 1 0 2 1 3 2 4

So to obtain the original weights, subtract 2 from each entry.

The primary sparse matrix for this problem is the as-caida20071105 matrix, or
Problem.aux.G{121}, the second-to-the-last graph in the sequence.

The nodes are uniform across all graphs in the sequence in the UF collection.
That is, nodes do not come and go. A node that is "gone" simply has no edges. This is to allow comparisons across each node in the graphs.
Problem.aux.nodenames gives the node numbers of the original problem. So
row/column i in the matrix is always node number Problem.aux.nodenames(i) in
all the graphs.

Problem.aux.G{k} is the kth graph in the sequence.
Problem.aux.Gname(k,:) is the name of the kth graph.

Some Machine Learning Matrices

kaggle.com

zip

Updated Apr 21, 2022

Facebook

Twitter

Click to copy link

Link copied

Cite

Giovanni Manzini (2022). Some Machine Learning Matrices [Dataset]. https://www.kaggle.com/datasets/giovannimanzini/some-machine-learning-matrices

Explore at:

zip(5483965630 bytes)Available download formats

Dataset updated

Apr 21, 2022

Authors

Giovanni Manzini

Description

Context

To test linear algebra algorithms for Machine Learning problems one needs ready-to-use large matrices coming from real world problems. Unfortunately, it is surprisingly difficult to find such matrices and they often need some cleaning and some preprocessing to have all matrices in the same format.

To avoid duplicate efforts, this repository collects matrices which are already available on the net and provide them in a single place and in a single format (csv).

Banner image by Mick Haupt on Unsplash

Content

All matrices are in textual csv format, one line per row. No comments, headers, or other info. Here are some statistics:

Name	rows	columns	nonzero	distinct nonzero	`gzip`	`xz`
Susy	5,000,000	18	98.82%	20,352,142	53.27%	43.94%
Higgs	11,000,000	28	92.11%	8,083,943	48.38%	31.47%
Airline78	14,462,943	29	72.66%	7,794	13.27%	7.01%
Covtype	581,012	54	22.00%	6,682	6.25%	3.34%
Census	2,458,285	68	43.03%	45	5.54%	2.79%
Optical	325,834	174	97.50%	897,176	53.54%	27.13%
Mnist2m	2,000,000	784	25.25%	255	6.46%	4.25%
ImageNet	1,262,102	900	30.99%	824	5.52%	3.63%

gzip and xz columns report the compressed size relative to the dense representation taking 8 x rows x columns bytes.

Acknowledgements

Many thanks to the maintainers of the UCI machine learning repository from which we got the data for Susy, Higgs, Covtype, Census, Optical and Mnist2m, and to the American Statistical Association that provided the data on Airline on Time Performance from which we derived Airline78. The ImageNet matrix was provided by the authors of the paper Approximate kernel k-means: solution to large scale kernel clustering.

StAnD (Large problems)
kaggle.com
zip
Updated Nov 8, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zuru Tech (2025). StAnD (Large problems) [Dataset]. https://www.kaggle.com/zurutech/stand-large-problems
Explore at:
zip(65760152712 bytes)Available download formats
Dataset updated
Nov 8, 2025
Authors
Zuru Tech
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Static Analysis Dataset (StAnD) is a large collection of solved linear static analysis problems on frame structures. This dataset contains the subset of StAnD composed of large problems.

[Abstract] [Paper] [Code]

Context

Many algorithms for solving sparse linear systems are published at a great pace, but currently there is no standard dataset to compare their running time on real problems. To best of our knowledge there is no existing large dataset of sparse linear systems. A few sparse matrices (often less than 10) are published for many engineering problems in the Matrix Market or in the SuiteSparse matrix collection, but their limited number is not sufficient to measure the running time in the average case or to measure reliably how the resolution algorithm scales with the size and the number of non-zeros in the sparse matrix. Additionally, no constant term is provided in conjunction with the matrices. The constant term, is maybe not fundamental when direct methods are used, but it becomes important if we want to measure the effectiveness of iterative methods, whose behavior depends on the relationship between the initialization of the solution and the real solution.

Content

StAnD is composed of 303.000 static analysis problems of frame structures divided into 6 parts: 100.000 small training problems, 100.000 medium training problems, 100.000 large training problems, 1.000 small test problems, 1.000 medium test problems and 1.000 large training problems. The size of a problem is determined by the number of degrees of freedom (DOFs) of the structure model or equivalently by the number of rows or columns of the stiffness matrix associated to the structure. Small problems have 2115 DOFs on average (and at most 5166 DOFs), medium problems have about 7000 DOFs (and at most 14718 DOFs), while large problems can have up to 31770 DOFs with about 15500 DOFs on average.

The dataset is programmatically created using OpenSeesPy, the Python interface to the OpenSees finite element solver. For each structural model with its loading configuration, a static analysis is performed in order to compute nodal displacements. Therefore, every problem in the dataset is a tuple (K, f, u), where K is the sparse (symmetric positive-definite) stiffness matrix associated to the structure, f is a load vector and u is the ground-truth displacement vector such that K · u = f. In the training set, for the same structure (i.e. the same stiffness matrix K), we apply different load configurations. In the test set, we use a single load configuration for every structure to maximize the variability.

Every solved system is saved in a separate .npz file. The name of the file follows the scheme {seed}_{K hash}_{u f hash}.npz. Problems in files with the same seed and the same K hash come from the same structure. The sparse stiffness matrix is saved in COO format.

Let us call n is the number of rows (and columns) of K and nnz is the number of non-null elements of K, then the keys of the .npz file are: - A_indices: (2, nnz)-shaped array of row and column coordinates of non-null elements of K; - A_values: nnz-shaped array of coefficients of non-null elements of K; - x: n-shaped array containing to the linear system solution u; - b: n-shaped array containing to the linear system constant term f.

Inspiration

We introduce StAnD to ease the comparison and the evaluation of new sparse linear system solvers and to spur research in resolution methods specifically tailored to structural engineering and static analysis problems.
YouTube Social Network with Communities (SNAP)
kaggle.com
zip
Updated Dec 16, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Subhajit Sahu (2021). YouTube Social Network with Communities (SNAP) [Dataset]. https://www.kaggle.com/datasets/wolfram77/graphs-snap-com-youtube
Explore at:
zip(13777811 bytes)Available download formats
Dataset updated
Dec 16, 2021
Authors
Subhajit Sahu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
YouTube
Description
Youtube social network and ground-truth communities

https://snap.stanford.edu/data/com-Youtube.html

Dataset information

Youtube (http://www.youtube.com/) is a video-sharing web site that includes a social network. In the Youtube social network, users form friendship each other and users can create groups which other users can join. We consider
such user-defined groups as ground-truth communities. This data is provided by Alan Mislove et al.
(http://socialnetworks.mpi-sws.org/data-imc2007.html)

We regard each connected component in a group as a separate ground-truth
community. We remove the ground-truth communities which have less than 3
nodes. We also provide the top 5,000 communities with highest quality
which are described in our paper (http://arxiv.org/abs/1205.6233). As for
the network, we provide the largest connected component.

Network statistics
Nodes 1,134,890
Edges 2,987,624
Nodes in largest WCC 1134890 (1.000)
Edges in largest WCC 2987624 (1.000)
Nodes in largest SCC 1134890 (1.000)
Edges in largest SCC 2987624 (1.000)
Average clustering coefficient 0.0808
Number of triangles 3056386
Fraction of closed triangles 0.002081
Diameter (longest shortest path) 20
90-percentile effective diameter 6.5
Community statistics
Number of communities 8,385
Average community size 13.50
Average membership size 0.10

Source (citation)
J. Yang and J. Leskovec. Defining and Evaluating Network Communities based on Ground-truth. ICDM, 2012. http://arxiv.org/abs/1205.6233

Files
File Description
com-youtube.ungraph.txt.gz Undirected Youtube network
com-youtube.all.cmty.txt.gz Youtube communities
com-youtube.top5000.cmty.txt.gz Youtube communities (Top 5,000)

Notes on inclusion into the SuiteSparse Matrix Collection, July 2018:

The graph in the SNAP data set is 1-based, with nodes numbered 1 to
1,157,827.

In the SuiteSparse Matrix Collection, Problem.A is the undirected Youtube
network, a matrix of size n-by-n with n=1,134,890, which is the number of
unique user id's appearing in any edge.

Problem.aux.nodeid is a list of the node id's that appear in the SNAP data set. A(i,j)=1 if person nodeid(i) is friends with person nodeid(j). The
node id's are the same as the SNAP data set (1-based).

C = Problem.aux.Communities_all is a sparse matrix of size n by 16,386
which represents the communities in the com-youtube.all.cmty.txt file.
The kth line in that file defines the kth community, and is the column
C(:,k), where C(i,k)=1 if person ...

Facebook

Twitter

Click to copy link

Link copied

Cite

Dan Kestner (2020). A fast algorithm for computing a matrix transform used to detect trends in noisy data [Dataset]. http://doi.org/10.17632/mkcxrky9jc.1

Data from: A fast algorithm for computing a matrix transform used to detect trends in noisy data

Explore at:

Unique identifier

https://doi.org/10.17632/mkcxrky9jc.1

Dataset updated

Jun 9, 2020

Authors

Dan Kestner

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

A recently discovered universal rank-based matrix method to extract trends from noisy time series is described in Ierley and Kostinski (2019) but the formula for the output matrix elements, implemented there as an open-access supplement MATLAB computer code, is O(N^4), with N the matrix dimension. This can become prohibitively large for time series with hundreds of sample points or more. Based on recurrence relations, here we derive a much faster O(N^2) algorithm and provide code implementations in MATLAB and in open-source JULIA. In some cases one has the output matrix and needs to solve an inverse problem to obtain the input matrix. A fast algorithm and code for this companion problem, also based on the recurrence relations, are given. Finally, in the narrower, but common, domains of (i) trend detection and (ii) parameter estimation of a linear trend, users require, not the individual matrix elements, but simply their accumulated mean value. For this latter case we provide a yet faster O(N) heuristic approximation that relies on a series of rank one matrices. These algorithms are illustrated on a time series of high energy cosmic rays with N > 4 x 10^4 .

Clear search

Close search

Google apps

Main menu

Data from: A fast algorithm for computing a matrix transform used to detect...

Multidimensional Scaling With Very Large Datasets

Data from: Inferring complex phylogenies using parsimony: an empirical...

Data from: Two Sample Test for Covariance Matrices in Ultra-High Dimension

A Data Driven Network Approach to Rank Countries Production Diversity and...

Data from: Reference transcriptomics of porcine peripheral immune cells...

Data from: Two dimensional fast Fourier transform for large data matrices

Block-GP: Scalable Gaussian Process Regression for Multimodal Data - Dataset...

Brain Connectivity Matrix Dataset

Sparse Inverse Gaussian Process Regression with Application to Climate...

Orkut Social Network and Communities (SNAP)

Orkut social network and ground-truth communities

Notes on inclusion into the SuiteSparse Matrix Collection, July 2018:

Data from: Block-GP: Scalable Gaussian Process Regression for Multimodal...

Tensor decomposition-based unsupervised feature extraction applied to matrix...

Data from: Why do phylogenomic data sets yield conflicting trees? Data type...

Sparse-Matrix Compression Engine Market Research Report 2033

Sparse-Matrix Compression Engine Market Outlook

Data from: Adaptive genomic signatures of globally invasive populations of...

Supplementary Dataset

122 CAIDA Autonomous systems Graphs (SNAP)

Some Machine Learning Matrices

Context

Content

Acknowledgements

StAnD (Large problems)

Context

Content

Inspiration

YouTube Social Network with Communities (SNAP)

Youtube social network and ground-truth communities

Notes on inclusion into the SuiteSparse Matrix Collection, July 2018:

Data from: A fast algorithm for computing a matrix transform used to detect trends in noisy data