70 datasets found

f
Data from: A change-point–based control chart for detecting sparse mean...
tandf.figshare.com
txt
Updated Jan 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zezhong Wang; Inez Maria Zwetsloot (2024). A change-point–based control chart for detecting sparse mean changes in high-dimensional heteroscedastic data [Dataset]. http://doi.org/10.6084/m9.figshare.24441804.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.24441804.v1
Dataset updated
Jan 17, 2024
Dataset provided by
Taylor & Francis
Authors
Zezhong Wang; Inez Maria Zwetsloot
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Because of the “curse of dimensionality,” high-dimensional processes present challenges to traditional multivariate statistical process monitoring (SPM) techniques. In addition, the unknown underlying distribution of and complicated dependency among variables such as heteroscedasticity increase the uncertainty of estimated parameters and decrease the effectiveness of control charts. In addition, the requirement of sufficient reference samples limits the application of traditional charts in high-dimension, low-sample-size scenarios (small n, large p). More difficulties appear when detecting and diagnosing abnormal behaviors caused by a small set of variables (i.e., sparse changes). In this article, we propose two change-point–based control charts to detect sparse shifts in the mean vector of high-dimensional heteroscedastic processes. Our proposed methods can start monitoring when the number of observations is a lot smaller than the dimensionality. The simulation results show that the proposed methods are robust to nonnormality and heteroscedasticity. Two real data examples are used to illustrate the effectiveness of the proposed control charts in high-dimensional applications. The R codes are provided online.
National Forest and Sparse Woody Vegetation Data (Version 5.0 - 2020...
researchdata.edu.au
Updated Aug 5, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Australian Government Department of Climate Change, Energy, the Environment and Water (2021). National Forest and Sparse Woody Vegetation Data (Version 5.0 - 2020 Release) [Dataset]. https://researchdata.edu.au/national-forest-sparse-2020-release/2989276
Explore at:
Dataset updated
Aug 5, 2021
Dataset provided by
Data.govhttps://data.gov/
Authors
Australian Government Department of Climate Change, Energy, the Environment and Water
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered

Description
Landsat satellite imagery is used to derive woody vegetation extent products that discriminate between forest, sparse woody and non-woody land cover across a time series from 1988 to 2020. A forest is defined as woody vegetation with a minimum 20 per cent canopy cover, at least 2 metres high and a minimum area of 0.2 hectares. Sparse woody is defined as woody vegetation with a canopy cover between 5-19 per cent.\r \r The three-class classification (forest, sparse woody and non-woody) supersedes the two-class classification (forest and non-forest) from 2016. The new classification is produced using the same approach in terms of time series processing (conditional probability networks) as the two-class method, to detect woody vegetation cover. The three-class algorithm better encompasses the different types of woody vegetation across the Australian landscape.\r \r Earlier versions of this dataset were published in the Department of Environment and Energy.
Data from: Sparse Biclustering of Transposable Data
tandf.figshare.com
zip
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kean Ming Tan; Daniela M. Witten (2023). Sparse Biclustering of Transposable Data [Dataset]. http://doi.org/10.6084/m9.figshare.1209699.v3
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.1209699.v3
Dataset updated
Jun 1, 2023
Dataset provided by
Taylor & Francishttps://taylorandfrancis.com/
Authors
Kean Ming Tan; Daniela M. Witten
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We consider the task of simultaneously clustering the rows and columns of a large transposable data matrix. We assume that the matrix elements are normally distributed with a bicluster-specific mean term and a common variance, and perform biclustering by maximizing the corresponding log-likelihood. We apply an ℓ1 penalty to the means of the biclusters to obtain sparse and interpretable biclusters. Our proposal amounts to a sparse, symmetrized version of k-means clustering. We show that k-means clustering of the rows and of the columns of a data matrix can be seen as special cases of our proposal, and that a relaxation of our proposal yields the singular value decomposition. In addition, we propose a framework for biclustering based on the matrix-variate normal distribution. The performances of our proposals are demonstrated in a simulation study and on a gene expression dataset. This article has supplementary material online.
CTF4Science: Kuramoto-Sivashinsky Official DS
kaggle.com
zip
Updated May 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AI Institute in Dynamic Systems (2025). CTF4Science: Kuramoto-Sivashinsky Official DS [Dataset]. https://www.kaggle.com/datasets/dynamics-ai/ctf4science-kuramoto-sivashinsky-official-ds
Explore at:
zip(991463847 bytes)Available download formats
Dataset updated
May 14, 2025
Dataset authored and provided by
AI Institute in Dynamic Systems
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Kuramoto-Sivashinsky (KS) Dataset - CTF4Science

Dataset Description

This dataset contains numerical simulations of the Kuramoto-Sivashinsky (KS) equation, a fourth-order nonlinear partial differential equation (PDE) that exhibits spatio-temporal chaos. The KS equation is a canonical example used in scientific machine learning to benchmark data-driven algorithms for dynamical systems modeling, forecasting, and reconstruction.

The Kuramoto-Sivashinsky Equation

The KS equation is defined as:

u_t + uu_x + u_xx + μu_xxxx = 0

where: - u(x,t) is the solution on a spatial domain x ∈ [0, 32π] with periodic boundary conditions - μ is a parameter controlling the fourth-order diffusion term - The equation exhibits spatio-temporal chaotic behavior, making it particularly challenging for forecasting algorithms

Dataset Purpose

This dataset is part of the Common Task Framework (CTF) for Science, designed to provide standardized, rigorous benchmarks for evaluating machine learning algorithms on scientific problems. The CTF addresses key challenges in scientific ML including:

Short-term forecasting (weather forecast): Predicting near-future states with trajectory accuracy

Long-term forecasting (climate forecast): Capturing statistical properties of long-time dynamics

Noisy data reconstruction: Denoising and forecasting from corrupted measurements

Limited data scenarios: Learning from sparse observations

Parametric generalization: Interpolation and extrapolation to new parameter regimes

Key Dataset Characteristics

System Type: Spatio-temporal PDE (1D spatial + time)

Spatial Dimension: 1024 grid points across domain [0, 32π]

Time Step: Δt = 0.025

Behavior: Chaotic spatio-temporal dynamics

Data Format: Available in both MATLAB (.mat) and CSV formats

Evaluation Metrics:

Short-term: Root Mean Square Error (RMSE)

Long-term: Power Spectral Density matching with k=20, modes=100

Evaluation Tasks

The dataset supports 12 evaluation metrics (E1-E12) organized into 4 main task categories:

Test 1: Forecasting (E1, E2)

Input: X1train (10000 × 1024)

Task: Forecast future 1000 timesteps

Metrics:

E1: Short-term RMSE on first k timesteps

E2: Long-term spectral matching on power spectral density

Test 2: Noisy Data (E3, E4, E5, E6)

Medium Noise (E3, E4): Train on X2train, reconstruct and forecast

High Noise (E5, E6): Train on X3train, reconstruct and forecast

Metrics: Reconstruction accuracy (RMSE) + Long-term forecasting (spectral)

Test 3: Limited Data (E7, E8, E9, E10)

Noise-Free Limited (E7, E8): 100 snapshots in X4train

Noisy Limited (E9, E10): 100 snapshots in X5train

Metrics: Short and long-term forecasting from sparse data

Test 4: Parametric Generalization (E11, E12)

Input: Three training trajectories (X6, X7, X8) at different parameter values

Task: Interpolate (E11) and extrapolate (E12) to new parameters

Burn-in: X9train and X10train provide initialization

Metrics: Short-term RMSE on parameter generalization

Usage Notes

Hidden Test Sets: The actual test data (X1test through X9test) are hidden and used only for evaluation on the CTF leaderboard

Baseline Scores: Use constant zero prediction as the baseline reference (E_i = 0)

Score Range: All scores are clipped to [-100, 100], where 100 represents perfect prediction

Data Continuity: Start indices in YAML indicate temporal relationship between train/test splits

Chaotic Dynamics: Long-term exact trajectory matching is impossible due to Lyapunov divergence; hence spectral metrics for climate forecasting

File Formats: Choose .mat for MATLAB/Python (scipy) workflows or .csv for language-agnostic access
r
Sparse Partial Least Squares in Time Series for Macroeconomic Forecasting...
resodate.org
Updated Oct 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Julieta Fuentes (2025). Sparse Partial Least Squares in Time Series for Macroeconomic Forecasting (replication data) [Dataset]. https://resodate.org/resources/aHR0cHM6Ly9qb3VybmFsZGF0YS56YncuZXUvZGF0YXNldC9zcGFyc2UtcGFydGlhbC1sZWFzdC1zcXVhcmVzLWluLXRpbWUtc2VyaWVzLWZvci1tYWNyb2Vjb25vbWljLWZvcmVjYXN0aW5n
Explore at:
Dataset updated
Oct 6, 2025
Dataset provided by
Journal of Applied Econometrics
ZBW
ZBW Journal Data Archive
Authors
Julieta Fuentes
Description
Factor models have been applied extensively for forecasting when high-dimensional datasets are available. In this case, the number of variables can be very large. For instance, usual dynamic factor models in central banks handle over 100 variables. However, there is a growing body of literature indicating that more variables do not necessarily lead to estimated factors with lower uncertainty or better forecasting results. This paper investigates the usefulness of partial least squares techniques that take into account the variable to be forecast when reducing the dimension of the problem from a large number of variables to a smaller number of factors. We propose different approaches of dynamic sparse partial least squares as a means of improving forecast efficiency by simultaneously taking into account the variable forecast while forming an informative subset of predictors, instead of using all the available ones to extract the factors. We use the well-known Stock and Watson database to check the forecasting performance of our approach. The proposed dynamic sparse models show good performance in improving efficiency compared to widely used factor methods in macroeconomic forecasting.
Performance measurements for "Bringing Order to Sparsity: A Sparse Matrix...
zenodo.org
zip
Updated Apr 17, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
James D. Trotter; James D. Trotter (2023). Performance measurements for "Bringing Order to Sparsity: A Sparse Matrix Reordering Study on Multicore CPUs" [Dataset]. http://doi.org/10.5281/zenodo.7821491
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7821491
Dataset updated
Apr 17, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
James D. Trotter; James D. Trotter
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The paper "Bringing Order to Sparsity: A Sparse Matrix Reordering Study on Multicore CPUs" compares various strategies for reordering sparse matrices. The purpose of reordering is to improve performance of sparse matrix operations, for example, by reducing fill-in resulting from sparse Cholesky factorisation or improving data locality in sparse matrix-vector multiplication (SpMV). Many reordering strategies have been proposed in the literature and the current paper provides a thorough comparison of several of the most popular methods.

This comparison is based on performance measurements that were collected on the eX3 cluster, a Norwegian, experimental research infrastructure for exploration of exascale computing. These performance measurements are gathered in the data set provided here, particularly related to the performance of two SpMV kernels with respect to 490 sparse matrices, 6 matrix orderings and 8 multicore CPUs.

Experimental results are provided in a human-readable, tabular format using plain-text ASCII. This format may be readily consumed by gnuplot to create plots or imported into commonly used spreadsheet tools for further analysis.

Performance measurements are provided based on an SpMV kernel using the compressed sparse row (CSR) storage format with 7 matrix orderings. One file is provided for each of 8 multicore CPU systems considered in the paper:

1. Skylake: csr_all_xeongold16q_032_threads_ss490.txt
2. Ice Lake: csr_all_habanaq_072_threads_ss490.txt
3. Naples: csr_all_defq_064_threads_ss490.txt
4. Rome: csr_all_rome16q_016_threads_ss490.txt
5. Milan A: csr_all_fpgaq_048_threads_ss490.txt
6. Milan B: csr_all_milanq_128_threads_ss490.txt
7. TX2: csr_all_armq_064_threads_ss490.txt
8. Hi1620: csr_all_huaq_128_threads_ss490.txt

A corresponding set of files and performance measurements are provided for a second SpMV kernel that is also studied in the paper.

Each file consists of 490 rows and 54 columns. Each row corresponds to a different matrix from the SuiteSparse Matrix Collection (https://sparse.tamu.edu/). The first 5 columns specify some general information about the matrix, such as its group and name, as well as the number of rows, columns and nonzeros. Column 6 specifies the number of threads used for the experiment (which depends on the CPU). The remaining columns are grouped according to the 7 different matrix orderings that were studied, in the following order: original, Reverse Cuthill-McKee (RCM), Nested Dissection (ND), Approximate Minimum Degree (AMD), Graph Partitioning (GP), Hypergraph Partitioning (HP), and Gray ordering. For each ordering, the following 7 columns are given:

1. Minimum number of nonzeros processed by any thread by the SpMV kernel
2. Maximum number of nonzeros processed by any thread by the SpMV kernel
3. Mean number of nonzeros processed per thread by the SpMV kernel
4. Imbalance factor, which is the ratio of the maximum to the mean number of nonzeros processed per thread by the SpMV kernel
5. Time (in seconds) to perform a single SpMV iteration; this was measured by taking the minimum out of 100 SpMV iterations performed
6. Maximum performance (in Gflop/s) for a single SpMV iteration; this was measured by taking twice the number of matrix nonzeros and dividing by the minimum time out of 100 SpMV iterations performed.
7. Mean performance (in Gflop/s) for a single SpMV iteration; this was measured by taking twice the number of matrix nonzeros and dividing by the mean time of the 97 last SpMV iterations performed (i.e., the first 3 SpMV iterations are ignored).

The results in Fig. 1 of the paper show speedup (or slowdown) resulting from reordering with respect to 3 reorderings and 3 selected matrices. These results can be reproduced by inspecting the performance results that were collected on the Milan B and Ice Lake systems for the three matrices Freescale/Freescale2, SNAP/com-Amazon and GenBank/kmer_V1r. Specifically, the numbers displayed in the figure are obtained by dividing the maximum performance measured for the respective orderings (i.e., RCM, ND and GP) by the maximum performance measured for the original ordering.

The results presented in Figs. 2 and 3 of the paper show the speedup of SpMV as a result of reordering for the two SpMV kernels considered in the paper. In this case, gnuplot scripts are provided to reproduce the figures from the data files described above.
BUV/Nimbus-4 Level 3 Ozone Zonal Means V005 (BUVN4L3ZMT) at GES DISC -...
data.nasa.gov
Updated Mar 31, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
nasa.gov (2025). BUV/Nimbus-4 Level 3 Ozone Zonal Means V005 (BUVN4L3ZMT) at GES DISC - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/buv-nimbus-4-level-3-ozone-zonal-means-v005-buvn4l3zmt-at-ges-disc
Explore at:
Dataset updated
Mar 31, 2025
Dataset provided by
NASAhttp://nasa.gov/
Description
The Nimbus-4 BUV Level 3 Ozone Zonal Means collection or ZMT contains total ozone, reflectivities, and ozone mixing ratios averaged in 10 degree latitude zones centered from 80 to -80 degrees. Mixing ratios are given at 19 levels: 0.3, 0.4, 0.5, 0.7, 1, 1.5, 2, 3, 4, 5, 7, 10, 15, 20, 30, 40, 50, 70 and 100 mbar. In addition to the means, files also include the standard deviation, minimum and maximum values, as well as sample size. The data were originally created on IBM 360 machines and archived on magnetic tapes. The data have been restored from the tapes and are now archived on disk in their original IBM binary file format. Each file contains monthly, weekly and daily zonal means, as well as quarterly means if it is the last month of the quarter. The files consist of data records each with one-hundred-eighty 4-byte words. Monthly, weekly, daily and quarterly means are distinguished by the seventh 4-byte word in the records. A typical file is about 380 kB in size. The BUV instrument was operational from April 10, 1970 until May 6, 1977. In July 1972 the Nimbus-4 solar power array partially failed such that BUV operations were curtailed. Thus data collected in the later years was increasingly sparse, particularly in the equatorial region. This product was previously available from the NSSDC as the Zonal Means File (ZMT) with the identifier ESAC-00039 (old ID 70-025A-05O).
d
COBE-SST2 Sea Surface Temperature and Ice
catalog.data.gov
s.cnmilf.com
+2more
Updated Oct 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(Custodian) (2024). COBE-SST2 Sea Surface Temperature and Ice [Dataset]. https://catalog.data.gov/dataset/cobe-sst2-sea-surface-temperature-and-ice1
Explore at:
Dataset updated
Oct 19, 2024
Dataset provided by
(Custodian)
Description
A new sea surface temperature (SST) analysis on a centennial time scale is presented. The dataset starts in 1850 with monthly 1x1 means and is periodically updated. In this analysis, a daily SST field is constructed as a sum of a trend, interannual variations, and daily changes, using in situ SST and sea ice concentration observations. All SST values are accompanied with theory-based analysis errors as a measure of reliability. An improved equation is introduced to represent the ice-SST relationship, which is used to produce SST data from observed sea ice concentrations. Prior to the analysis, biases of individual SST measurement types are estimated for a homogenized long-term time series of global mean SST. Because metadata necessary for the bias correction are unavailable for many historical observational reports, the biases are determined so as to ensure consistency among existing SST and nighttime air temperature observations. The global mean SSTs with bias-corrected observations are in agreement with those of a previously published study, which adopted a different approach. Satellite observations are newly introduced for the purpose of reconstruction of SST variability over data-sparse regions. Moreover, uncertainty in areal means of the present and previous SST analyses is investigated using the theoretical analysis errors and estimated sampling errors. The result confirms the advantages of the present analysis, and it is helpful in understanding the reliability of SST for a specific area and time period.
n
HOMAGE Monthly Time series of global average steric height anomalies and...
podaac.jpl.nasa.gov
s.cnmilf.com
+4more
html
Updated May 26, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
PO.DAAC (2022). HOMAGE Monthly Time series of global average steric height anomalies and ocean heat content estimates from gridded in-situ ocean observations version 01 [Dataset]. http://doi.org/10.5067/HMSSO-4TJ01
Explore at:
htmlAvailable download formats
Unique identifier
https://doi.org/10.5067/HMSSO-4TJ01
Dataset updated
May 26, 2022
Dataset provided by
PO.DAAC
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jan 15, 1978 - Present
Variables measured
SEA LEVEL
Description
The [HOMAGE_STERIC_OHC_TIME_SERIES_v01] dataset contains monthly global mean ocean heat content (OHC) anomalies as well as thermosteric, halosteric and total steric sea level anomalies computed from various gridded ocean data sets of sub-temperature and salinity profiles as provided by different institutions: Scripps Institution of Oceanography (SIO); Institute of Atmospheric Physics (IAP); Barnes objective analysis (BOA from CSIO, MNR); Jamstec / Ishii et al. 2017 (I17); and Met Office Hadley Centre: EN4_c13, EN4_c14, EN4_g10, and EN4_I09. The data are averaged over the quasi-global ocean domain (i.e., where valid values are defined; note that gaps exist, in particular towards polar latitudes), at monthly intervals. The input profiling data (i.e, temperature and salinity profiles at depth levels), editing, quality flags and processing schemes vary across the different gridded products, please refer to the documentation for each institution’s data product for details. Since 2005, the profiling data are dominated by the observations from the global Argo network (e.g., https://argo.ucsd.edu/), which comprises nearly 4000 active floats (as of 08/2022). Before 2005, non-Argo data such as XBT profilers were used, and the global ocean coverage was significantly more sparse. Data sets from SIO and BOA are Argo-only, while the others also include other observations, such as expendable bathythermographs (XBTs) and Conductivity-Temperature-Depth (CTD) observations. The data are active forward stream data files and will be frequently updated as new observations are acquired by Argo, and processed by the data centers.
Encoded shortest path sequences for NYC taxi trip
kaggle.com
zip
Updated Sep 8, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lem (2017). Encoded shortest path sequences for NYC taxi trip [Dataset]. https://www.kaggle.com/tongjiyiming/encoded-shortest-path-sequences-for-nyc-taxi-trip
Explore at:
zip(140239784 bytes)Available download formats
Dataset updated
Sep 8, 2017
Authors
Lem
License
http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
Area covered
New York
Description
Get a closest approximate of real trip trace

For NYC taxi trip, there is only start coordinates and end coordinates, which is hard to be used to explore variation of different road condition. This dataset uses OSM road data and break it into small directed segments. Each segment is defined from one cross (node) to adjacent cross (node). And, it has direction, which means two-ways road will result in two segment, and oneway road will result in one segment.

What you get

Scipy`s .npz format

141505 columns: each column encoded a small segment. Its value is just an indicator: 1 means taxi would travel through this segment, 0 means not. As you can see, it results in a very sparse matrix.

Some insights

This is inspired by ECML/PKDD 15: Taxi Trajectory Prediction. Apparently, with more accurate trajectory of trips, we create a space that different trip`s information can be shared by more others. If we only got start and end point, similarity of two trips only depends on a clustering of start and end point, which we hope, could have some accurate similarity approximation (which also highly depend on how many clusters you define). But, with path sequences, we can know that two quite different trips can share some common but important parts of roads, such as motorways. This is closer to real life. More importantly, now, we can learn the situation of that road segments from many different trips, as long as we have a suitable machine learning algorithm. Similar to the winners in ECML/PKDD 15, this dataset allows deep learning to be applied.

The original road data is from OSM. Library osmnx, networkx are used to store road graph. Speedlimit data is primary got from NYC`s DOT. A shortest path library in java developed by Arizona State University is used for processing shortest path using Dijkstra Algorithm. Using Pyjnius to use java library inside Python. Additionally, with some multithread programming code in both python and java to speedup the whole execution.

The initial idea is actually to get Top K paths, so that it provides a probabilistic information of taxi driver drives along. It is too slow as I run the Yen`s Top-K algorithms.

Time dependent linkage might also help. But, linkage between different segments are not considered, since I have no idea how to map that information to a useful feature space.

Notice that this data actually use exactly same information as New York City Taxi with OSRM. The difference is that that data only give a name of a road, but this dataset encode each small segments. However, total time from that dataset is also proved to be useful. Unfortunately, I did not record the trip time by my codes. We will see if anyone ask.

So, have fun with this dataset, Kagglers!

Acknowledgements

We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.

Inspiration

Your data will be in front of the world's largest data science community. What questions do you want to see answered?
f
Additional file 1 of Sparse feature selection for classification and...
datasetcatalog.nlm.nih.gov
Updated Mar 28, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Misganaw, Burook; Lea, Jayanthi; McCourt, Carolyn; Backes, Floor; Mutch, David; Singh, Nitin; Boren, Todd; White, Michael; Vidyasagar, Mathukumalli; Ahsen, Mehmet Eren; Miller, David; Moore, Kathleen (2017). Additional file 1 of Sparse feature selection for classification and prediction of metastasis in endometrial cancer [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001752691
Explore at:
Dataset updated
Mar 28, 2017
Authors
Misganaw, Burook; Lea, Jayanthi; McCourt, Carolyn; Backes, Floor; Mutch, David; Singh, Nitin; Boren, Todd; White, Michael; Vidyasagar, Mathukumalli; Ahsen, Mehmet Eren; Miller, David; Moore, Kathleen
Description
List and description of supplemental tables. Table S1. This table contains the measurements of 1428 micro-RNAs for 94 Samples. The rows correspond to the features (miRNA) and the columns correspond to the samples. The samples consist of 47 lymph node-positive and 47 lymph node-negative samples. 43.75% of the entries in this sheet are NaN. It contains measurements for 213 miRNAs of 86 samples. Out of those 86 samples, 43 are lymph node-positive, and the remaining 43 are lymph node-negative. A sample whose label has the term IB or IC belongs to a lymph node-negative patient, whereas a sample with a label containing IIIC belong to a lymph node-positive patient. A lymph node-positive or neagtive status was defined empiracally during pimary staging. Table S2. This table contains a subset of the raw data, used for training the classifier. This data was obtained by removing four patients from each class, and 1,215 features. It contains measurements for 213 miRNAs of 86 samples. Out of those 86 samples, 43 are lymph node-positive, and the remaining 43 are lymph node-negative. Table S3. This table contains the normalized version of the training data. The following procedure is used for normalization: 1) From each entry of the i-th row vector (i-th feature vector), we subtract the mean value m i of the i-th row vector computed over all the 86 samples. 2) Multiply each entry of the i-th row vector by a scale factor s i so that the resulting vector has euclidean norm equal to the square root of 86. Table S4. The lone star algorithm selected 18 final features. This sheet contains the 20 best classifiers based on these eightteen features, sorted with respect to accuracy. The sensitivity, specificity and accuracy figures (columns T, U and V) are based on the classification of the 86 samples in the training data by the corresponding classifier.Table S5. This table shows the classifier obtained by taking the average of the classifiers in Sheet 4. In particular, we average the numbers in each column of the 20 classifiers given in Sheet 4 (20 best classifiers) (Columns A-S). Table S6. This sheet contains clinical information about the independent cohort of 28 patients who were used to validate the classifier. Out of these, 9 are lymph-node positive and 19 are lymph node-negative. Table S7. This sheet contains the raw microRNA measurements on the 28 test data samples. Table S8. This is the transformed version of the test data. We apply the same transformation as w did for the training data, as described on Sheet 3. For each of the 18 features (miRNAs), we subtract the original mean value m i from each entry and multiply each entry by the constant s i . The calculation of m i and s i is as in Additional file 1, Table S3. Table S9. This sheet contains the discriminant values of the classifier on the Test Data. In column D an entry of 1 means that the sample is correctly classified. Table 10. This sheet contains the number of overlaps between our 23 gene signature with the pathways in the KEGG database. The q-value is obtained from the Fisher exact test after the Benjamini-Hochberg multiple testing correction and quantifies the statistical significance of the overlap between the gene list and a set of genes in a particular pathway. (1170 KB XLSX)
JanataHack: Machine Learning for IoT Dataset
kaggle.com
zip
Updated May 23, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shobhit Upadhyaya (2020). JanataHack: Machine Learning for IoT Dataset [Dataset]. https://www.kaggle.com/shobhitupadhyaya/janatahack-machine-learning-for-iot-dataset
Explore at:
zip(373167 bytes)Available download formats
Dataset updated
May 23, 2020
Authors
Shobhit Upadhyaya
Description
Problem Statement

You are working with the government to transform your city into a smart city. The vision is to convert it into a digital and intelligent city to improve the efficiency of services for the citizens. One of the problems faced by the government is traffic. You are a data scientist working to manage the traffic of the city better and to provide input on infrastructure planning for the future.

The government wants to implement a robust traffic system for the city by being prepared for traffic peaks. They want to understand the traffic patterns of the four junctions of the city. Traffic patterns on holidays, as well as on various other occasions during the year, differ from normal working days. This is important to take into account for your forecasting.

Data Dictionary

Variable Description
ID Unique ID
DateTime Hourly Datetime Variable
Junction Junction Type
Vehicles Number of Vehicles (Target)

sample_submission.csv

Variable Description
ID Unique ID
Vehicles Number of Vehicles (Target)

Your task

To predict traffic patterns in each of these four junctions for the next 4 months.

The sensors on each of these junctions were collecting data at different times, hence you will see traffic data from different time periods. To add to the complexity, some of the junctions have provided limited or sparse data requiring thoughtfulness when creating future projections. Depending upon the historical data of 20 months, the government is looking to you to deliver accurate traffic projections for the coming four months. Your algorithm will become the foundation of a larger transformation to make your city smart and intelligent.

Evaluation Metric

The evaluation metric for this competition is Root Mean Squared Error (RMSE)
u
Data from: Data-driven analysis of oscillations in Hall thruster simulations...
portaldelainvestigacion.uma.es
Updated 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Davide Maddaloni; Adrián Domínguez Vázquez; Filippo Terragni; Mario Merino; Davide Maddaloni; Adrián Domínguez Vázquez; Filippo Terragni; Mario Merino (2022). Data from: Data-driven analysis of oscillations in Hall thruster simulations & Data-driven sparse modeling of oscillations in plasma space propulsion [Dataset]. https://portaldelainvestigacion.uma.es/documentos/67a9c7ce19544708f8c73129
Explore at:
Dataset updated
2022
Authors
Davide Maddaloni; Adrián Domínguez Vázquez; Filippo Terragni; Mario Merino; Davide Maddaloni; Adrián Domínguez Vázquez; Filippo Terragni; Mario Merino
Description
Data from: Data-driven analysis of oscillations in Hall thruster simulations

Authors: Davide Maddaloni, Adrián Domínguez Vázquez, Filippo Terragni, Mario Merino

Contact email: dmaddalo@ing.uc3m.es

Date: 2022-03-24

Keywords: higher order dynamic mode decomposition, hall effect thruster, breathing mode, ion transit time, data-driven analysis

Version: 1.0.4

Digital Object Identifier (DOI): 10.5281/zenodo.6359505

License: This dataset is made available under the Open Data Commons Attribution License

Abstract

This dataset contains the outputs of the HODMD algorithm and the original simulations used in the journal publication:

Davide Maddaloni, Adrián Domínguez Vázquez, Filippo Terragni, Mario Merino, "Data-driven analysis of oscillations in Hall thruster simulations", 2022 Plasma Sources Sci. Technol. 31:045026. Doi: 10.1088/1361-6595/ac6444.

Additionally, the raw simulation data is also employed in the following journal publication:

Borja Bayón-Buján and Mario Merino, "Data-driven sparse modeling of oscillations in plasma space propulsion", 2024 Mach. Learn.: Sci. Technol. 5:035057. Doi: 10.1088/2632-2153/ad6d29

Dataset description

The simulations from which data stems have been produced using the full 2D hybrid PIC/fluid code HYPHEN, while the HODMD results have been produced using an adaptation of the original HODMD algorithm with an improved amplitude calculation routine.

Please refer to the relative article for further details regarding any of the parameters and/or configurations.

Data files

The data files are in standard Matlab .mat format. A recent version of Matlab is recommended.

The HODMD outputs are collected within 18 different files, subdivided into three groups, each one referring to a different case. For the file names, "case1" refers to the nominal case, "case2" refers to the low voltage case and "case3" refers to the high mass flow rate case. Following, the variables are referred as:

"n" for plasma density

"Te" for electron temperature

"phi" for plasma potential

"ji" for ion current density (both single and double charged ones)

"nn" for neutral density

"Ez" for axial electric field

"Si" for ionization production term

"vi1" for single charged ions axial velocity

In particular, axial electric field, ionization production term and single charged ions axial velocity are available only for the first case. Such files have a cell structure: the first row contains the frequencies (in Hz), the second row contains the normalized modes (alongside their complex conjugates), the third row collects the growth rates (in 1/s) while the amplitudes (dimensionalized) are collected within the last row. Additionally, the time vector is simply given as "t", common to all cases and all variables.

The raw simulation data are collected within additional 15 variables, following the same nomenclature as above, with the addition of the suffix "_raw" to differentiate them from the HODMD outputs.

Citation

Works using this dataset or any part of it in any form shall cite it as follows.

The preferred means of citation is to reference the publication associated to this dataset, as soon as it is available.

Optionally, the dataset may be cited directly by referencing the DOI: 10.5281/zenodo.6359505.

Acknowledgments

This work has been supported by the Madrid Government (Comunidad de Madrid) under the Multiannual Agreement with UC3M in the line of ‘Fostering Young Doctors Research’ (MARETERRA-CM-UC3M), and in the context of the V PRICIT (Regional Programme of Research and Technological Innovation). F. Terragni was also supported by the Fondo Europeo de Desarrollo Regional, Ministerio de Ciencia, Innovación y Universidades - Agencia Estatal de Investigación, under grants MTM2017-84446-C2-2-R and PID2020-112796RB-C22.
Data from: A dynamically consistent gridded data set of the global,...
doi.pangaea.de
html, tsv
Updated May 14, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Charlotte Breitkreuz; André Paul; Takasumi Kurahashi-Nakamura; Martin Losch; Michael Schulz (2018). A dynamically consistent gridded data set of the global, monthly-mean oxygen isotope ratio of seawater, link to NetCDF files [Dataset]. http://doi.org/10.1594/PANGAEA.889922
Explore at:
tsv, htmlAvailable download formats
Unique identifier
https://doi.org/10.1594/PANGAEA.889922
Dataset updated
May 14, 2018
Dataset provided by
PANGAEA
Authors
Charlotte Breitkreuz; André Paul; Takasumi Kurahashi-Nakamura; Martin Losch; Michael Schulz
License
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Variables measured
File name, File size, File format, Uniform resource locator/link to file
Description
We present a dynamically consistent gridded data set of the global, monthly-mean oxygen isotope ratio of seawater (δ¹⁸Osw). The data set is created from an optimized simulation of an ocean general circulation model constrained by global monthly δ¹⁸Osw data collected from 1950 until 2011 and climatological salinity and temperature data collected from 1951 to 1980. The optimization was obtained using the adjoint method for variational data assimilation, which yields a simulation that is consistent with the observational data and the physical laws incorporated in the model. Our data set performs equally well as a previous data set in terms of model-data misfit and brings an improvement in terms of physical consistency and a seasonal cycle. The data assimilation method shows high potential for interpolating sparse data sets in a physical meaningful way. Comparatively big errors, however, are found in our data set in the surface levels in the Arctic Ocean mainly because there is no influence of isotopically highly depleted precipitation on the ocean in areas with sea-ice, and because of the low model resolution. […]
f
Data_Sheet_1_A Practical Guide to Sparse k-Means Clustering for Studying...
frontiersin.figshare.com
pdf
Updated Jun 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Justin L. Balsor; Keon Arbabi; Desmond Singh; Rachel Kwan; Jonathan Zaslavsky; Ewalina Jeyanesan; Kathryn M. Murphy (2023). Data_Sheet_1_A Practical Guide to Sparse k-Means Clustering for Studying Molecular Development of the Human Brain.pdf [Dataset]. http://doi.org/10.3389/fnins.2021.668293.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fnins.2021.668293.s001
Dataset updated
Jun 4, 2023
Dataset provided by
Frontiers
Authors
Justin L. Balsor; Keon Arbabi; Desmond Singh; Rachel Kwan; Jonathan Zaslavsky; Ewalina Jeyanesan; Kathryn M. Murphy
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Studying the molecular development of the human brain presents unique challenges for selecting a data analysis approach. The rare and valuable nature of human postmortem brain tissue, especially for developmental studies, means the sample sizes are small (n), but the use of high throughput genomic and proteomic methods measure the expression levels for hundreds or thousands of variables [e.g., genes or proteins (p)] for each sample. This leads to a data structure that is high dimensional (p ≫ n) and introduces the curse of dimensionality, which poses a challenge for traditional statistical approaches. In contrast, high dimensional analyses, especially cluster analyses developed for sparse data, have worked well for analyzing genomic datasets where p ≫ n. Here we explore applying a lasso-based clustering method developed for high dimensional genomic data with small sample sizes. Using protein and gene data from the developing human visual cortex, we compared clustering methods. We identified an application of sparse k-means clustering [robust sparse k-means clustering (RSKC)] that partitioned samples into age-related clusters that reflect lifespan stages from birth to aging. RSKC adaptively selects a subset of the genes or proteins contributing to partitioning samples into age-related clusters that progress across the lifespan. This approach addresses a problem in current studies that could not identify multiple postnatal clusters. Moreover, clusters encompassed a range of ages like a series of overlapping waves illustrating that chronological- and brain-age have a complex relationship. In addition, a recently developed workflow to create plasticity phenotypes (Balsor et al., 2020) was applied to the clusters and revealed neurobiologically relevant features that identified how the human visual cortex changes across the lifespan. These methods can help address the growing demand for multimodal integration, from molecular machinery to brain imaging signals, to understand the human brain’s development.
r
Sparse change‐point VAR models (replication data)
resodate.org
Updated Oct 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arnaud Dufays (2025). Sparse change‐point VAR models (replication data) [Dataset]. https://resodate.org/resources/aHR0cHM6Ly9qb3VybmFsZGF0YS56YncuZXUvZGF0YXNldC9zcGFyc2UtY2hhbmdlcG9pbnQtdmFyLW1vZGVscw==
Explore at:
Dataset updated
Oct 6, 2025
Dataset provided by
Journal of Applied Econometrics
ZBW
ZBW Journal Data Archive
Authors
Arnaud Dufays
Description
Change-point (CP) VAR models face a dimensionality curse due to the proliferation of parameters that arises when new breaks are detected. We introduce the Sparse CP-VAR model which determines which parameters truly vary when a break is detected. By doing so, the number of new parameters to be estimated at each regime is drastically reduced and the break dynamics becomes easier to be interpreted. The Sparse CP-VAR model disentangles the dynamics of the mean parameters and the covariance matrix. The former uses CP dynamics with shrinkage prior distributions, while the latter is driven by an infinite hidden Markov framework. An extensive simulation study is carried out to compare our approach with existing ones. We provide applications to financial and macroeconomic systems. It turns out that many off-diagonal VAR parameters are zero for the entire sample period and that most break activity is in the covariance matrix. We show that this has important consequences for portfolio optimization, in particular when future instabilities are included in the predictive densities. Forecasting-wise, the Sparse CP-VAR model compares favorably to several time-varying parameter models in terms of density and point forecast metrics.
d
Datasets Used to Create Generalized Potentiometric Maps of the Fort Union,...
catalog.data.gov
data.usgs.gov
+1more
Updated Nov 25, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Datasets Used to Create Generalized Potentiometric Maps of the Fort Union, Hell Creek, and Fox Hills Aquifers within the Standing Rock Indian Reservation [Dataset]. https://catalog.data.gov/dataset/datasets-used-to-create-generalized-potentiometric-maps-of-the-fort-union-hell-creek-and-f
Explore at:
Dataset updated
Nov 25, 2025
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
Standing Rock Indian Reservation
Description
This data release includes text files of well data and shapefiles of potentiometric contours of the Fort Union, Hell Creek, and Fox Hills aquifers within the Standing Rock Indian Reservation. The data accompanies a USGS scientific investigations map from Anderson and Lundgren (2024). The Standing Rock Sioux Tribe (the Tribe) and the U.S. Geological Survey (USGS) completed a comprehensive assessment of groundwater resources within the Standing Rock Indian Reservation.Generalized potentiometric surfaces of the Fort Union, Hell Creek, and Fox Hills aquifers were constructed to assess the groundwater resources of the Standing Rock Indian Reservation. Water-level data from the U.S. Geological Survey Groundwater Site Inventory (GWSI) database, the North Dakota Department of Water Resources (NDDWR), and the South Dakota Department of Agriculture and Natural Resources (SDDANR) were compiled and used to construct generalized potentiometric-surface maps representing average conditions of the Fort Union, Hell Creek, and Fox Hills Formations. The water-level measurements mean was used for wells with more than one water-level measurement. Recorded depth to water-levels were converted to hydraulic head by subtracting the water level from the land-surface elevation at the well location. Hydraulic-head values were spatially interpolated to create 2-dimensional potentiometric surfaces. The interpolated potentiometric surfaces were contoured using contour intervals of 50 ft and smoothed to correct for extreme changes in the potentiometric surfaces in areas of sparse data.
d
Thirteen years daily and annual mean land surface temperature dataset over...
search.dataone.org
Updated Feb 14, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ran, Youhua; Li, Xin; Yang, Kun; Meng, Xianhong; Wang, Shaoying (2018). Thirteen years daily and annual mean land surface temperature dataset over the Third pole [Dataset]. http://doi.org/10.1594/PANGAEA.878875
Explore at:
Unique identifier
https://doi.org/10.1594/PANGAEA.878875
Dataset updated
Feb 14, 2018
Dataset provided by
PANGAEA Data Publisher for Earth and Environmental Science
Authors
Ran, Youhua; Li, Xin; Yang, Kun; Meng, Xianhong; Wang, Shaoying
Area covered

Description
The Qinghai-Tibet plateau (QTP), called "the Third Pole" of the earth, is the water tower of Asia that not only feeds tens of millions of people, but also maintains fragile ecosystems in arid region of northwestern China. Temporal-spatially complete representations of land surface temperature are required for many purposes in environmental science, especially in the Third pole where the traditional ground measurement is difficult and therefore the data is sparse. The thirteen years cloud-free datasets of daily mean land surface temperature (LST) and mean annual land surface temperature (MAST) during 2004 to 2016 are derived from the quartic daily MODIS (the Moderate Resolution Imaging Spectroradiometer) Terra/Aqua LST products with a resolution of 1 km using a pragmatic data processing algorithm. The comparison between radiance-based LST measurement and the estimated LST shows good agreement in the daily and inter-annual variability, with a correlation of 0.95 and 0.99 and bias of -1.73°C (±3.38°C) and -2.07°C (±1.05°C) for daily-mean-LST and MAST, respectively. The systematic error is mainly source from the defined of daily mean LST, which is represented by the arithmetic average of the daytime and nighttime LSTs. The random error is mainly source from the uncertainty of the original MODIS LST values, especially for the daytime LST products. Trend validation using air temperatures from 94 weather stations indicate that the warming trends derived from time series MAST data is comparable with that derived from CMA data. The dataset is potential useful for various studies, including climatology, hydrology, meteorology, ecology, agriculture, public health, and environmental monitoring in the Third pole and around regions.
Retail Market Basket Transactions Dataset
kaggle.com
Updated Aug 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wasiq Ali (2025). Retail Market Basket Transactions Dataset [Dataset]. https://www.kaggle.com/datasets/wasiqaliyasir/retail-market-basket-transactions-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 25, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Wasiq Ali
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Overview

The Market_Basket_Optimisation dataset is a classic transactional dataset often used in association rule mining and market basket analysis.
It consists of multiple transactions where each transaction represents the collection of items purchased together by a customer in a single shopping trip.

File Name: Market_Basket_Optimisation.csv

Format: CSV (Comma-Separated Values)

Structure: Each row corresponds to one shopping basket. Each column in that row contains an item purchased in that basket.

Nature of Data: Transactional, categorical, sparse.

Primary Use Case: Discovering frequent itemsets and association rules to understand shopping patterns, product affinities, and to build recommender systems.

Detailed Information

📊 Dataset Composition

Transactions: 7,501 (each row = one basket).

Items (unique): Around 120 distinct products (e.g., bread, mineral water, chocolate, etc.).

Columns per row: Up to 20 possible items (not fixed; some rows have fewer, some more).

Data Type: Purely categorical (no numerical or continuous features).

Missing Values: Present in the form of empty cells (since not every basket has all 20 columns).

Duplicates: Some baskets may appear more than once — this is acceptable in transactional data as multiple customers can buy the same set of items.

🛒 Nature of Transactions

Basket Definition: Each row captures items bought together during a single visit to the store.

Variability: Basket size varies from 1 to 20 items. Some customers buy only one product, while others purchase a full set of groceries.

Sparsity: Since there are ~120 unique items but only a handful appear in each basket, the dataset is sparse. Most entries in the one-hot encoded representation are zeros.

🔎 Examples of Data

Example transaction rows (simplified):

Item 1 Item 2 Item 3 Item 4 ...
Bread Butter Jam
Mineral water Chocolate Eggs Milk
Spaghetti Tomato sauce Parmesan

Here, empty cells mean no item was purchased in that slot.

📈 Applications of This Dataset

This dataset is frequently used in data mining, analytics, and recommendation systems. Common applications include:

Association Rule Mining (Apriori, FP-Growth):

Discover rules like {Bread, Butter} ⇒ {Jam} with high support and confidence.

Identify cross-selling opportunities.

Product Affinity Analysis:

Understand which items tend to be purchased together.

Helps with store layout decisions (placing related items near each other).

Recommendation Engines:

Build systems that suggest "You may also like" products.

Example: If a customer buys pasta and tomato sauce, recommend cheese.

Marketing Campaigns:

Bundle promotions and discounts on frequently co-purchased products.

Personalized offers based on buying history.

Inventory Management:

Anticipate demand for certain product combinations.

Prevent stockouts of items that drive the purchase of others.

📌 Key Insights Potentially Hidden in the Dataset

Popular Items: Some items (like mineral water, eggs, spaghetti) occur far more frequently than others.

Product Pairs: Frequent pairs and triplets (e.g., pasta + sauce + cheese) reflect natural meal-prep combinations.

Basket Size Distribution: Most customers buy fewer than 5 items, but a small fraction buy 10+ items, showing long-tail behavior.

Seasonality (if extended with timestamps): Certain items might show peaks in demand during weekends or holidays (though timestamps are not included in this dataset).

📂 Dataset Limitations

No Customer Identifiers:

We cannot track repeated purchases by the same customer.

Analysis is limited to basket-level insights.

No Timestamps:

No temporal analysis (trends over time, seasonality) is possible.

No Quantities or Prices:

We only know whether an item was purchased, not how many units or its cost.

Sparse & Noisy:

Many baskets are small (1–2 items), which may produce weak or trivial rules.

🔮 Potential Extensions

Synthetic Timestamps: Assign simulated timestamps to study temporal buying patterns.

Add Customer IDs: If merged with external data, one can perform personalized recommendations.

Price Data: Adding cost allows for profit-driven association rules (not just frequency-based).

Deep Learning Models: Sequence models (RNNs, Transformers) could be applied if temporal ordering of items is introduced.

...
r
Data from: Database of Mineral, Thermal and Deep Groundwaters of Hesse,...
resodate.org
Updated Jan 1, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rafael Schäffer; Kristian Bär; Sebastian Fischer; Johann-Gerhard Fritsche; Ingo Sass (2020). Database of Mineral, Thermal and Deep Groundwaters of Hesse, Germany [Dataset]. http://doi.org/10.25534/TUDATALIB-340
Explore at:
Unique identifier
https://doi.org/10.25534/TUDATALIB-340
Dataset updated
Jan 1, 2020
Dataset provided by
Technical University of Darmstadt
Authors
Rafael Schäffer; Kristian Bär; Sebastian Fischer; Johann-Gerhard Fritsche; Ingo Sass
Area covered
Hessen, Germany
Description
The hydrochemical composition of groundwaters is governing its usability for various purposes (drinking water, thermal water, mineral water, mineral extraction, heat extraction etc.). It also has a huge impact on its technical use since changes of temperature and pressure can result in scaling, corrosion and de-gassing etc. which result in extra costs for remediation measures. Furthermore, the hydrochemistry also controls the fluid properties as for example its viscosity or heat capacity. Knowing the hydrochemical composition is therefore imperative to be able to predict its properties and to foresee technical or economic challenges which might arise during utilization. For this purpose, we designed and compiled the database of mineral, thermal and deep groundwaters of Hesse, Germany. It includes 1035 published hydrochemical data sets from 560 different measurement points in the entire Hessian territory and some adjacent areas. It has been compiled by the TU Darmstadt in close cooperation with the Hessian Agency for Nature Conservation, Environment and Geology (HLNUG) and the Federal Institute for Geosciences and Natural Resources (BGR) as part of the R&D; project Hessen 3D 2.0 (‘3D modelling of the petrothermal and medium deep geothermal resources for power production, direct heat utilization and storage of the federal state of Hesse’) funded by the Federal Ministry for Economic Affairs and Energy (grant number 0325944A).With this publication we provide this database as .xlsx and .csv file. The database includes all available measurements which meet at least one of the following criteria: - water temperature of at least 20 °C (definition of thermal water) - solution content of at least 1 g/l (definition of mineral water) - depth of at least 100 m (definition of the formation water database of the BGR, currently under construction) - Isotope data without hydrochemical data, mainly Δ13C, 14C and Δ34S With few exceptions not matching these criteria are nine data sets (#439, 440, 463, 464, 479, 480, 781, 928, 975), located in other states close (≤ 2 km) to the Hessian state border, which are included to improve the spatial coverage. Two data sets (#615, 772) are included to better represent the crystalline Odenwald. In addition, the Sossenheimer Sprudel (#544) is added, although not meeting the criteria, due to its location within Frankfurt and its historical significance. Analyses older than 1910 are integrated to the database, if a. analyses appear trustworthy due to the experience or reputation of a laboratory or an author, b. conversions into units valid today can be carried out correctly, c. the analysis is of interest for longer time series, or if d. there are no recent analytical data available, but due to the geological-hydrochemical importance of the spring or well, respectively, or due to data distribution, it is of interest to include them. A data set contains up to 122 entries sorted by metadata, references, physico-chemical (sum) parameters, major elements, minor elements, trace elements, dissolved gases, free gases and isotopes (Fig. 1). Further 63 fields – only available in the xlsx-version – serve for semi-automatic data evaluation, control and interpretation, which are complimentary information on the original data set as it was available in the single references. These additions enhance the usability and facilitate the interpretation of the original data sets. Filled fields in light grey in the columns referring to coordinates, altitude, final depth and geological information mark own additions of the data set which were not given in the original reference. Filled fields in light grey in columns concerning the electrical balance and major elements mark own calculations, e.g. bicarbonate from the carbonate hardness or single ions from the electrical balance itself. Filled fields in dark grey indicate comments or explanations available within the observation column, for example additional information, contradicting values in different literature sources or missing data. Electrical balances marked with a red filling indicate deviations larger than ±5%. Cations and anions define the water type (own classification), if they have an equivalent concentration of at least 20%, and are listed with descending ratios. The data sets mentioned above are further supplemented by 18 data sets with mean values for rock units occurring in Hesse from Ludwig (2013) to allow for a comparison to mean values of the data base for similar rock types or formations. Furthermore, 37 data sets from the Baden-Württemberg, Rhineland-Palatinate or the French part of the Upper Rhine Graben (URG) from Stober & Jodocy (2011), Stober & Bucher (2015) and Sanjuan et al. (2016) were included since only sparse data from literature are available in Hessen for the hydrogeological units of the graben fill of the URG and the Mesozoic to Paleozoic units below.

Variable	Description
ID	Unique ID
DateTime	Hourly Datetime Variable
Junction	Junction Type
Vehicles	Number of Vehicles (Target)

Variable	Description
ID	Unique ID
Vehicles	Number of Vehicles (Target)

Item 1	Item 2	Item 3	Item 4
Bread	Butter	Jam
Mineral water	Chocolate	Eggs	Milk
Spaghetti	Tomato sauce	Parmesan

Facebook

Twitter

Click to copy link

Link copied

Cite

Zezhong Wang; Inez Maria Zwetsloot (2024). A change-point–based control chart for detecting sparse mean changes in high-dimensional heteroscedastic data [Dataset]. http://doi.org/10.6084/m9.figshare.24441804.v1

Data from: A change-point–based control chart for detecting sparse mean changes in high-dimensional heteroscedastic data

Explore at:

txtAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.24441804.v1

Dataset updated

Jan 17, 2024

Dataset provided by

Taylor & Francis

Authors

Zezhong Wang; Inez Maria Zwetsloot

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Because of the “curse of dimensionality,” high-dimensional processes present challenges to traditional multivariate statistical process monitoring (SPM) techniques. In addition, the unknown underlying distribution of and complicated dependency among variables such as heteroscedasticity increase the uncertainty of estimated parameters and decrease the effectiveness of control charts. In addition, the requirement of sufficient reference samples limits the application of traditional charts in high-dimension, low-sample-size scenarios (small n, large p). More difficulties appear when detecting and diagnosing abnormal behaviors caused by a small set of variables (i.e., sparse changes). In this article, we propose two change-point–based control charts to detect sparse shifts in the mean vector of high-dimensional heteroscedastic processes. Our proposed methods can start monitoring when the number of observations is a lot smaller than the dimensionality. The simulation results show that the proposed methods are robust to nonnormality and heteroscedasticity. Two real data examples are used to illustrate the effectiveness of the proposed control charts in high-dimensional applications. The R codes are provided online.

Clear search

Close search

Google apps

Main menu

Data from: A change-point–based control chart for detecting sparse mean...

National Forest and Sparse Woody Vegetation Data (Version 5.0 - 2020...

Data from: Sparse Biclustering of Transposable Data

CTF4Science: Kuramoto-Sivashinsky Official DS

Kuramoto-Sivashinsky (KS) Dataset - CTF4Science

Dataset Description

The Kuramoto-Sivashinsky Equation

Dataset Purpose

Key Dataset Characteristics

Evaluation Tasks

Test 1: Forecasting (E1, E2)

Test 2: Noisy Data (E3, E4, E5, E6)

Test 3: Limited Data (E7, E8, E9, E10)

Test 4: Parametric Generalization (E11, E12)

Usage Notes

Sparse Partial Least Squares in Time Series for Macroeconomic Forecasting...

Performance measurements for "Bringing Order to Sparsity: A Sparse Matrix...

BUV/Nimbus-4 Level 3 Ozone Zonal Means V005 (BUVN4L3ZMT) at GES DISC -...

COBE-SST2 Sea Surface Temperature and Ice

HOMAGE Monthly Time series of global average steric height anomalies and...

Encoded shortest path sequences for NYC taxi trip

Get a closest approximate of real trip trace

What you get

Some insights

Acknowledgements

Inspiration

Additional file 1 of Sparse feature selection for classification and...

JanataHack: Machine Learning for IoT Dataset

Problem Statement

Data Dictionary

sample_submission.csv

Your task

Evaluation Metric

Data from: Data-driven analysis of oscillations in Hall thruster simulations...

Data from: A dynamically consistent gridded data set of the global,...

Data_Sheet_1_A Practical Guide to Sparse k-Means Clustering for Studying...

Sparse change‐point VAR models (replication data)

Datasets Used to Create Generalized Potentiometric Maps of the Fort Union,...

Thirteen years daily and annual mean land surface temperature dataset over...

Retail Market Basket Transactions Dataset

Overview

Detailed Information

📊 Dataset Composition

🛒 Nature of Transactions

🔎 Examples of Data

📈 Applications of This Dataset

📌 Key Insights Potentially Hidden in the Dataset

📂 Dataset Limitations

🔮 Potential Extensions

...

Data from: Database of Mineral, Thermal and Deep Groundwaters of Hesse,...

Data from: A change-point–based control chart for detecting sparse mean changes in high-dimensional heteroscedastic data