CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
See the article "Targeting model based on principal component analysis and extreme learning machine" for the meaning of the data.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Previously, we have established two distinct progressive multiple sclerosis (MS) models by induction of experimental autoimmune encephalomyelitis (EAE) with myelin oligodendrocyte glycoprotein (MOG) in two mouse strains. A.SW mice develop ataxia with antibody deposition, but no T cell infiltration, in the central nervous system (CNS), while SJL/J mice develop paralysis with CNS T cell infiltration. In this study, we determined biomarkers contributing to the homogeneity and heterogeneity of two models. Using the CNS and spleen microarray transcriptome and cytokine data, we conducted computational analyses. We identified up-regulation of immune-related genes, including immunoglobulins, in the CNS of both models. Pro-inflammatory cytokines, interferon (IFN)-γ and interleukin (IL)-17, were associated with the disease progression in SJL/J mice, while the expression of both cytokines was detected only at the EAE onset in A.SW mice. Principal component analysis (PCA) of CNS transcriptome data demonstrated that down-regulation of prolactin may reflect disease progression. Pattern matching analysis of spleen transcriptome with CNS PCA identified 333 splenic surrogate markers, including Stfa2l1, which reflected the changes in the CNS. Among them, we found that two genes (PER1/MIR6883 and FKBP5) and one gene (SLC16A1/MCT1) were also significantly up-regulated and down-regulated, respectively, in human MS peripheral blood, using data mining.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract Civil structures are usually prone to damage during their service life and it leads them to loss their serviceability and safety. Thus, damage assessment can guarantee the integrity of structures. As a result, a structural damage detection approach including two main components, a set of accelerometers to record the response data and a data mining (DM) procedure, is widely used to extract the information on the structural health condition. In the last decades, DM has provided numerous solutions to structural health monitoring (SHM) problems as an all-inclusive technique due to its powerful computational ability. This paper presents the first attempt to illustrate the data mining techniques (DMTs) applications in SHM through an intensive review of those articles dealing with the use of DMTs aimed for classification-, prediction- and optimization-based data mining methods. According to this categorization, applications of DMTs with respect to SHM research area are classified and it is concluded that, applications of DMTs in the SHM domain have increasingly been implemented, in the last decade and the most popular techniques in the area were artificial neural network (ANN), principal component analysis (PCA) and genetic algorithm (GA), respectively.
The standardized database registered 1322 cropping events
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Clustering texts together is an essential task in data mining and information retrieval, whose aim is to group unlabeled texts into meaningful clusters that facilitate extracting and understanding useful information from large volumes of textual data. However, clustering short texts (STC) is complex because they typically contain sparse, ambiguous, noisy, and lacking information. One of the challenges for STC is finding a proper representation for short text documents to generate cohesive clusters. However, typically, STC considers only a single-view representation to do clustering. The single-view representation is inefficient for representing text due to its inability to represent different aspects of the target text. In this paper, we propose the most suitable multi-view representation (MVR) (by finding the best combination of different single-view representations) to enhance STC. Our work will explore different types of MVR based on different sets of single-view representation combinations. The combination of the single-view representations is done by a fixed length concatenation via Principal Component analysis (PCA) technique. Three standard datasets (Twitter, Google News, and StackOverflow) are used to evaluate the performances of various sets of MVRs on STC. Based on experimental results, the best combination of single-view representation as an effective for STC was the 5-views MVR (a combination of BERT, GPT, TF-IDF, FastText, and GloVe). Based on that, we can conclude that MVR improves the performance of STC; however, the design for MVR requires selective single-view representations.
Contents of the database: this database contains the results of the micro-Particle Induced X-ray Emission (u-PIXE) analysis of calcite crystals in lapis lazuli rocks. The samples are reference rocks with known origin and coming from different lapis lazuli mining areas (Afghanistan, Tajikistan, Siberia, Myanmar), procured by on-site geological expeditions, museum collections or international markets. The source of each rock is indicated in the data sheet. The data were collected during a long-term study started in 2007 by the Solid State Physics Group at the Department of Physics of the University of Turin (Italy) with the the aim of creating a non-invasive analytical protocol to investigate the provenance of lapis lazuli material used in artworks throughout history. The analysis was carried out over several years and in different facilities using accelerated proton microbeams; this information and some basic details about the measurament conditions (e.g. beam energy) are reported in the data sheet. The software used for data processing is GUPIXWIN or derivatives. The full description of the analysis methodology and the discussion of the results in terms of the development of a lapis lazuli provenance protocol can be found in: (Magalini et al., 2024) Magalini, M., Guidorzi, L., Re, A., Marabotto, M., Borghi, A., Gallo, P., Visale, M., La Torre, L., Campostrini, M., Lemasson, Q., Pichon, L., Moignard, B., Pacheco, C., Couture, P., Palitsin, V., Lo Giudice, A. (Unpublished) Study of compositional and luminescence properties of calcite in lapis lazuli for provenance investigations of archaeological findings. Paper submitted to Eur. Phys. J. Plus And in previous publications related to the analysis of diopside and pyrite: (Lo Giudice et al., 2017) Lo Giudice, A., Angelici, D., Re, A., Gariani, G., Borghi, A., Calusi, S., Giuntini, L., Massi, M., Castelli, L., Taccetti, F., Calligaro, T., Pacheco, C., Lemasson, Q., Pichon, L., Moignard, B., Pratesi, G., Guidotti, M.C., 2017. Protocol for lapis lazuli provenance determination: evidence for an Afghan origin of the stones used for ancient carved artefacts kept at the Egyptian Museum of Florence (Italy). Archaeol. Anthropol. Sci. 9, 637–651. https://doi.org/10.1007/s12520-016-0430-0 (Guidorzi et al., 2023) Guidorzi, L., Re, A., Magalini, M., Angelici, D., Borghi, A., Vaggelli, G., Fantino, F., Rigato, V., La Torre, L., Lemasson, Q., Pacheco, C., Pichon, L., Moignard, B., Lo Giudice, A., 2023. Micro-PIXE and micro-IBIL characterisation of lapis lazuli samples from Myanmar mines and implications for provenance study. The European Physical Journal Plus, 138(2), 175. https://doi.org/10.1140/epjp/s13360-023-03768-x The characterisation of reference rocks is an ongoing activity of the group and further versions of this database will be updated as new data are collected. Contents of the PCA dataset: the PCA dataset reports the part of the database data that can be used to perform Principal Component Analysis to differentiate among four lapis lazuli provenances: Afghanistan (AFG), Tajikistan (TAJ), Myanmar (MYA) and Siberia (SIB). It contains only the elemental concentration values of those elements in calcite that are relevant for provenance discrimination (Mg, Mn, Sr, Y). If the element was determined to be below the limit of detection, the LOD/2 value is reported in the dataset. The details of the application of PCA to lapis lazuli provenance investigation are reported in the following publication, where the approach was presented for the first time: (Guidorzi et al., 2023b) Guidorzi, L., Re, A., Magalini, M., & Lo Giudice, A., 2023. Application of principal component analysis to µ-PIXE data in lapis lazuli provenance studies. Nuclear Instruments and Methods in Physics Research Section B: Beam Interactions with Materials and Atom 540, 45-50. https://doi.org/10.1016/j.nimb.2023.04.007
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Clustering texts together is an essential task in data mining and information retrieval, whose aim is to group unlabeled texts into meaningful clusters that facilitate extracting and understanding useful information from large volumes of textual data. However, clustering short texts (STC) is complex because they typically contain sparse, ambiguous, noisy, and lacking information. One of the challenges for STC is finding a proper representation for short text documents to generate cohesive clusters. However, typically, STC considers only a single-view representation to do clustering. The single-view representation is inefficient for representing text due to its inability to represent different aspects of the target text. In this paper, we propose the most suitable multi-view representation (MVR) (by finding the best combination of different single-view representations) to enhance STC. Our work will explore different types of MVR based on different sets of single-view representation combinations. The combination of the single-view representations is done by a fixed length concatenation via Principal Component analysis (PCA) technique. Three standard datasets (Twitter, Google News, and StackOverflow) are used to evaluate the performances of various sets of MVRs on STC. Based on experimental results, the best combination of single-view representation as an effective for STC was the 5-views MVR (a combination of BERT, GPT, TF-IDF, FastText, and GloVe). Based on that, we can conclude that MVR improves the performance of STC; however, the design for MVR requires selective single-view representations.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The datasets contains transactions made by credit cards in September 2013 by european cardholders. This dataset present transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.
It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. Features V1, V2, ... V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-senstive learning. Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.
The dataset has been collected and analysed during a research collaboration of Worldline and the Machine Learning Group (mlg.ulb.ac.be) of ULB (Université Libre de Bruxelles) on big data mining and fraud detection. More details on current and past projects on related topics are available on http://mlg.ulb.ac.be/BruFence and http://mlg.ulb.ac.be/ARTML.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Empirical Advances with, Annex 4, The 1015 residents’ health table, content: the 1015 de-identified residents’ medical histories and pathologies on September 30th 2013, as well as their falling history and 10-anonymized NHs (XLS 256 kb)
NASA is emphasizing the use of larger, more integrated models in conjunction with systems engineering tools and decision support systems. These tools place a corresponding stress on legacy engineering visualization systems which now are required to handle larger data sets, provide more intuition to the user, integrate well with many other tools, and help the user with his/her ultimate goal: improving the design of complex systems.
Phoenix Integration proposes to complete the prototype visualization environment created during Phase I to the point where it is a commercially viable product. New features, refinements, and integration with other tools will be accomplished in Phase II. In particular, the work will involve major improvements to whitespace exploration algorithms, techniques that enable users to unconstrain or modify the underlying engineering model in an effort to obtain results in previously unattainable areas. Work will also include more data mining algorithms (e.g. Principal Component Analysis), new graph types (e.g. spider plots), export formats to 3-D tools (e.g. Tecplot), integration with MBSE/SysML tools, integration with web-based decision support environments, and incorporation of probabilistic analysis. A rich integration with ModelCenter, the company's engineering integration and trade study environment, is planned, although a standalone capability will also be offered. The visualizer's architecture will be based on OpenGL and will use the GPU to parallelize rendering computations. Design will focus on usability and responsiveness, with the goal of providing quick insight into complex data. The tool will be user-tested through early adopters to ensure relevance and to guide development.
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
This dataset is designed to support research and model development in the area of fraud detection. It consists of real-world credit card transactions made by European cardholders over a two-day period in September 2013. Out of 284,807 transactions, 492 are labeled as fraudulent (positive class), making this a highly imbalanced classification problem.
Due to the extreme class imbalance, standard accuracy metrics are not informative. We recommend using the Area Under the Precision-Recall Curve (AUPRC) or F1-score for model evaluation.
The dataset is provided under the CC0 (Public Domain) license, allowing users to freely use, modify, and distribute the data without any restrictions.
The dataset has been collected and analysed during a research collaboration of Worldline and the Machine Learning Group (http://mlg.ulb.ac.be) of ULB (Université Libre de Bruxelles) on big data mining and fraud detection. More details on current and past projects on related topics are available on https://www.researchgate.net/project/Fraud-detection-5 and the page of the DefeatFraud project
Please cite the following works:
Andrea Dal Pozzolo, Olivier Caelen, Reid A. Johnson and Gianluca Bontempi. Calibrating Probability with Undersampling for Unbalanced Classification. In Symposium on Computational Intelligence and Data Mining (CIDM), IEEE, 2015
Dal Pozzolo, Andrea; Caelen, Olivier; Le Borgne, Yann-Ael; Waterschoot, Serge; Bontempi, Gianluca. Learned lessons in credit card fraud detection from a practitioner perspective, Expert systems with applications,41,10,4915-4928,2014, Pergamon
Dal Pozzolo, Andrea; Boracchi, Giacomo; Caelen, Olivier; Alippi, Cesare; Bontempi, Gianluca. Credit card fraud detection: a realistic modeling and a novel learning strategy, IEEE transactions on neural networks and learning systems,29,8,3784-3797,2018,IEEE
Dal Pozzolo, Andrea Adaptive Machine learning for credit card fraud detection ULB MLG PhD thesis (supervised by G. Bontempi)
Carcillo, Fabrizio; Dal Pozzolo, Andrea; Le Borgne, Yann-Aël; Caelen, Olivier; Mazzer, Yannis; Bontempi, Gianluca. Scarff: a scalable framework for streaming credit card fraud detection with Spark, Information fusion,41, 182-194,2018,Elsevier
Carcillo, Fabrizio; Le Borgne, Yann-Aël; Caelen, Olivier; Bontempi, Gianluca. Streaming active learning strategies for real-life credit card fraud detection: assessment and visualization, International Journal of Data Science and Analytics, 5,4,285-300,2018,Springer International Publishing
Bertrand Lebichot, Yann-Aël Le Borgne, Liyun He, Frederic Oblé, Gianluca Bontempi Deep-Learning Domain Adaptation Techniques for Credit Cards Fraud Detection, INNSBDDL 2019: Recent Advances in Big Data and Deep Learning, pp 78-88, 2019
Fabrizio Carcillo, Yann-Aël Le Borgne, Olivier Caelen, Frederic Oblé, Gianluca Bontempi Combining Unsupervised and Supervised Learning in Credit Card Fraud Detection Information Sciences, 2019
Yann-Aël Le Borgne, Gianluca Bontempi Reproducible machine Learning for Credit Card Fraud Detection - Practical Handbook
Bertrand Lebichot, Gianmarco Paldino, Wissam Siblini, Liyun He, Frederic Oblé, Gianluca Bontempi Incremental learning strategies for credit cards fraud detection, IInternational Journal of Data Science and Analytics
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
It is important that credit card companies are able to recognize fraudulent credit card transactions so that customers are not charged for items that they did not purchase.
The dataset contains transactions made by credit cards in September 2013 by European cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.
It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. Features V1, V2, ... V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-sensitive learning. Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.
Given the class imbalance ratio, we recommend measuring the accuracy using the Area Under the Precision-Recall Curve (AUPRC). Confusion matrix accuracy is not meaningful for unbalanced classification.
A simulator for transaction data has been released as part of the practical handbook on Machine Learning for Credit Card Fraud Detection - https://fraud-detection-handbook.github.io/fraud-detection-handbook/Chapter_3_GettingStarted/SimulatedDataset.html. We invite all practitioners interested in fraud detection datasets to also check out this data simulator, and the methodologies for credit card fraud detection presented in the book.
The dataset has been collected and analysed during a research collaboration of Worldline and the Machine Learning Group (http://mlg.ulb.ac.be) of ULB (Université Libre de Bruxelles) on big data mining and fraud detection. More details on current and past projects on related topics are available on https://www.researchgate.net/project/Fraud-detection-5 and the page of the DefeatFraud project
Please cite the following works:
Andrea Dal Pozzolo, Olivier Caelen, Reid A. Johnson and Gianluca Bontempi. Calibrating Probability with Undersampling for Unbalanced Classification. In Symposium on Computational Intelligence and Data Mining (CIDM), IEEE, 2015
Dal Pozzolo, Andrea; Caelen, Olivier; Le Borgne, Yann-Ael; Waterschoot, Serge; Bontempi, Gianluca. Learned lessons in credit card fraud detection from a practitioner perspective, Expert systems with applications,41,10,4915-4928,2014, Pergamon
Dal Pozzolo, Andrea; Boracchi, Giacomo; Caelen, Olivier; Alippi, Cesare; Bontempi, Gianluca. Credit card fraud detection: a realistic modeling and a novel learning strategy, IEEE transactions on neural networks and learning systems,29,8,3784-3797,2018,IEEE
Dal Pozzolo, Andrea Adaptive Machine learning for credit card fraud detection ULB MLG PhD thesis (supervised by G. Bontempi)
Carcillo, Fabrizio; Dal Pozzolo, Andrea; Le Borgne, Yann-Aël; Caelen, Olivier; Mazzer, Yannis; Bontempi, Gianluca. Scarff: a scalable framework for streaming credit card fraud detection with Spark, Information fusion,41, 182-194,2018,Elsevier
Carcillo, Fabrizio; Le Borgne, Yann-Aël; Caelen, Olivier; Bontempi, Gianluca. Streaming active learning strategies for real-life credit card fraud detection: assessment and visualization, International Journal of Data Science and Analytics, 5,4,285-300,2018,Springer International Publishing
Bertrand Lebichot, Yann-Aël Le Borgne, Liyun He, Frederic Oblé, Gianluca Bontempi Deep-Learning Domain Adaptation Techniques for Credit Cards Fraud Detection, INNSBDDL 2019: Recent Advances in Big Data and Deep Learning, pp 78-88, 2019
Fabrizio Carcillo, Yann-Aël Le Borgne, Olivier Caelen, Frederic Oblé, Gianluca Bontempi Combining Unsupervised and Supervised Learning in Credit Card Fraud Detection Information Sciences, 2019
Yann-Aël Le Borgne, Gianluca Bontempi Reproducible machine Learning for Credit Card Fraud Detection - Practical Handbook
Bertrand Lebichot, Gianmarco Paldino, Wissam Siblini, Liyun He, Frederic Oblé, Gianluca Bontempi Incremental learning strategies for credit cards fraud detection, IInternational Journal of Data Science and Analytics
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
As a data contributor, I'm sharing this crucial dataset focused on the detection of fraudulent credit card transactions. Recognizing these illicit activities is paramount for protecting customers and the integrity of financial systems.
About the Dataset:
This dataset encompasses credit card transactions made by European cardholders during a two-day period in September 2013. It presents a real-world scenario with a significant class imbalance, where fraudulent transactions are considerably less frequent than legitimate ones. Out of a total of 284,807 transactions, only 492 are instances of fraud, representing a mere 0.172% of the entire dataset.
Content of the Data:
Due to confidentiality concerns, the majority of the input features in this dataset have undergone a Principal Component Analysis (PCA) transformation. This means the original meaning and context of features V1, V2, ..., V28 are not directly provided. However, these principal components capture the variance in the underlying transaction data.
The only features that have not been transformed by PCA are:
The target variable for this classification task is:
Important Note on Evaluation:
Given the substantial class imbalance (far more legitimate transactions than fraudulent ones), traditional accuracy metrics based on the confusion matrix can be misleading. It is strongly recommended to evaluate models using the Area Under the Precision-Recall Curve (AUPRC), as this metric is more sensitive to the performance on the minority class (fraudulent transactions).
How to Use This Dataset:
Acknowledgements and Citation:
This dataset has been collected and analyzed through a research collaboration between Worldline and the Machine Learning Group (MLG) of ULB (Université Libre de Bruxelles).
When using this dataset in your research or projects, please cite the following works as appropriate:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Performance of STC with the best result for single-view and each type of MVRs on the Twitter dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Performance of STC with the best result for single-view and each type of MVRs on the StackOverflow dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Traumatic brain injury is highly prevalent in the United States. However, despite its frequency and significance, there is little understanding of how the brain responds during injurious loading. A confounding problem is that because testing conditions vary between assessment methods, brain biomechanics cannot be fully understood. Data mining techniques, which are commonly used to determine patterns in large datasets, were applied to discover how changes in testing conditions affect the mechanical response of the brain. Data at various strain rates were collected from published literature and sorted into datasets based on strain rate and tension vs. compression. Self-organizing maps were used to conduct a sensitivity analysis to rank the testing condition parameters by importance. Fuzzy C-means clustering was applied to determine if there were any patterns in the data. The parameter rankings and clustering for each dataset varied, indicating that the strain rate and type of deformation influence the role of these parameters in the datasets.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description of PCA outputs.
This is a metadata only record. Datasets described below were provided by the Macau Institute for Tourism Studies by Dr Fanny Vong. Direct enquiries for access to: fanny@iftm.edu.mo.
The four datasets (ELogbookReport 2011 – 2014 ) were extracted from the electronic internship logbooks of interns held by the Macau Institute for Tourism Studies . The data sets contain two main types of data: textual data and numerical data about the internship experiences of the interns and interns’ performance as assessed by their internship supervisors. The relevant textual data were content analysed both manually and automatically using the text-mining tool of Leximancer. The numerical data were analysed using appropriate statistical tools including Principal Component Analysis (PCA), ANOVA, t-tests; Manny-Whitney, and Kruscal Wallis.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Additional file 12. The active molecules from AID 2559 and 2561 were considered as the test set. These were high throughput screened confirmatory bioassay dataset. AID 2559 was consisting of 58 active and 67 inactive molecules whereas, AID 2561 was having 37 actives and 148 inactive molecules. The actives from both were combined to get the test set as ARFF file.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Semi-supervised Multi-View Learning for Gene Network Reconstruction
SynTReN Data: E.coli and Yeast sub-networks, generated expression data and gold standards (Input_Datasets.zip) Interactions predicted by base methods (Base_Method_Predictions.zip) Interactions predicted by our approach - Clustering performed with PCA (Predictions.zip) Interactions predicted by our approach - Clustering performed with K-means (PredictionsK.zip)
Dream5 Data: Expression data and gold standards provided by Marbach et al. 2012 1 Interactions predicted by the considered DREAM5 base methods provided by Marbach et al. 2012 1 Interactions predicted by our approach - Clustering performed with PCA (Predictions_D5.zip) Interactions predicted by our approach - Clustering performed with K-means (PredictionsK_D5.zip)
[1] Marbach, D., Costello, J. C., Kuffner, R., Vega, N. M., Prill, R. J., Camacho, D. M., Allison, K. R., Kellis, M., Collins, J. J., and Stolovitzky, G., Wisdom of crowds for robust gene network inference, Nature Methods, 9, 796-804, 2012.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
See the article "Targeting model based on principal component analysis and extreme learning machine" for the meaning of the data.