35 datasets found

f
Raw data of the ships of priority II in 2017
figshare.com
xlsx
Updated Jul 28, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jiaqi Mu (2020). Raw data of the ships of priority II in 2017 [Dataset]. http://doi.org/10.4121/uuid:92a6ce98-5001-4768-bec1-49881a367d36
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.4121/uuid:92a6ce98-5001-4768-bec1-49881a367d36
Dataset updated
Jul 28, 2020
Dataset provided by
4TU.ResearchData
Authors
Jiaqi Mu
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
See the article "Targeting model based on principal component analysis and extreme learning machine" for the meaning of the data.
f
Data_Sheet_1_Bioinformatics Analyses Determined the Distinct CNS and...
frontiersin.figshare.com
docx
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Seiichi Omura; Fumitaka Sato; Nicholas E. Martinez; Ah-Mee Park; Mitsugu Fujita; Nikki J. Kennett; Urška Cvek; Alireza Minagar; J. Steven Alexander; Ikuo Tsunoda (2023). Data_Sheet_1_Bioinformatics Analyses Determined the Distinct CNS and Peripheral Surrogate Biomarker Candidates Between Two Mouse Models for Progressive Multiple Sclerosis.docx [Dataset]. http://doi.org/10.3389/fimmu.2019.00516.s001
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fimmu.2019.00516.s001
Dataset updated
May 31, 2023
Dataset provided by
Frontiers
Authors
Seiichi Omura; Fumitaka Sato; Nicholas E. Martinez; Ah-Mee Park; Mitsugu Fujita; Nikki J. Kennett; Urška Cvek; Alireza Minagar; J. Steven Alexander; Ikuo Tsunoda
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Previously, we have established two distinct progressive multiple sclerosis (MS) models by induction of experimental autoimmune encephalomyelitis (EAE) with myelin oligodendrocyte glycoprotein (MOG) in two mouse strains. A.SW mice develop ataxia with antibody deposition, but no T cell infiltration, in the central nervous system (CNS), while SJL/J mice develop paralysis with CNS T cell infiltration. In this study, we determined biomarkers contributing to the homogeneity and heterogeneity of two models. Using the CNS and spleen microarray transcriptome and cytokine data, we conducted computational analyses. We identified up-regulation of immune-related genes, including immunoglobulins, in the CNS of both models. Pro-inflammatory cytokines, interferon (IFN)-γ and interleukin (IL)-17, were associated with the disease progression in SJL/J mice, while the expression of both cytokines was detected only at the EAE onset in A.SW mice. Principal component analysis (PCA) of CNS transcriptome data demonstrated that down-regulation of prolactin may reflect disease progression. Pattern matching analysis of spleen transcriptome with CNS PCA identified 333 splenic surrogate markers, including Stfa2l1, which reflected the changes in the CNS. Among them, we found that two genes (PER1/MIR6883 and FKBP5) and one gene (SLC16A1/MCT1) were also significantly up-regulated and down-regulated, respectively, in human MS peripheral blood, using data mining.
f
Data from: Recent Developments in Damage Identification of Structures Using...
scielo.figshare.com
jpeg
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Meisam Gordan; Hashim Abdul Razak; Zubaidah Ismail; Khaled Ghaedi (2023). Recent Developments in Damage Identification of Structures Using Data Mining [Dataset]. http://doi.org/10.6084/m9.figshare.5931043.v1
Explore at:
jpegAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.5931043.v1
Dataset updated
May 31, 2023
Dataset provided by
SciELO journals
Authors
Meisam Gordan; Hashim Abdul Razak; Zubaidah Ismail; Khaled Ghaedi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Abstract Civil structures are usually prone to damage during their service life and it leads them to loss their serviceability and safety. Thus, damage assessment can guarantee the integrity of structures. As a result, a structural damage detection approach including two main components, a set of accelerometers to record the response data and a data mining (DM) procedure, is widely used to extract the information on the structural health condition. In the last decades, DM has provided numerous solutions to structural health monitoring (SHM) problems as an all-inclusive technique due to its powerful computational ability. This paper presents the first attempt to illustrate the data mining techniques (DMTs) applications in SHM through an intensive review of those articles dealing with the use of DMTs aimed for classification-, prediction- and optimization-based data mining methods. According to this categorization, applications of DMTs with respect to SHM research area are classified and it is concluded that, applications of DMTs in the SHM domain have increasingly been implemented, in the last decade and the most popular techniques in the area were artificial neural network (ANN), principal component analysis (PCA) and genetic algorithm (GA), respectively.
d
Plantain harvest events for Colombia
search.dataone.org
dataverse.harvard.edu
Updated Nov 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fondo Nacional de Fomento Hortifruticola de Colombia; International Center for Tropical Agriculture (2023). Plantain harvest events for Colombia [Dataset]. http://doi.org/10.7910/DVN/GCMHMZ
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/GCMHMZ
Dataset updated
Nov 21, 2023
Dataset provided by
Harvard Dataverse
Authors
Fondo Nacional de Fomento Hortifruticola de Colombia; International Center for Tropical Agriculture
Time period covered
Jan 1, 2011 - Dec 1, 2012
Area covered
Colombia
Description
The standardized database registered 1322 cropping events
f
Statistic information of the STC datasets.
plos.figshare.com
xls
Updated Aug 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Majid Hameed Ahmed; Sabrina Tiun; Nazlia Omar; Nor Samsiah Sani (2024). Statistic information of the STC datasets. [Dataset]. http://doi.org/10.1371/journal.pone.0309206.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0309206.t001
Dataset updated
Aug 23, 2024
Dataset provided by
PLOS ONE
Authors
Majid Hameed Ahmed; Sabrina Tiun; Nazlia Omar; Nor Samsiah Sani
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Clustering texts together is an essential task in data mining and information retrieval, whose aim is to group unlabeled texts into meaningful clusters that facilitate extracting and understanding useful information from large volumes of textual data. However, clustering short texts (STC) is complex because they typically contain sparse, ambiguous, noisy, and lacking information. One of the challenges for STC is finding a proper representation for short text documents to generate cohesive clusters. However, typically, STC considers only a single-view representation to do clustering. The single-view representation is inefficient for representing text due to its inability to represent different aspects of the target text. In this paper, we propose the most suitable multi-view representation (MVR) (by finding the best combination of different single-view representations) to enhance STC. Our work will explore different types of MVR based on different sets of single-view representation combinations. The combination of the single-view representations is done by a fixed length concatenation via Principal Component analysis (PCA) technique. Three standard datasets (Twitter, Google News, and StackOverflow) are used to evaluate the performances of various sets of MVRs on STC. Based on experimental results, the best combination of single-view representation as an effective for STC was the 5-views MVR (a combination of BERT, GPT, TF-IDF, FastText, and GloVe). Based on that, we can conclude that MVR improves the performance of STC; however, the design for MVR requires selective single-view representations.
o
Database of elemental concentration of pyrite in lapis lazuli rocks of...
explore.openaire.eu
zenodo.org
Updated Oct 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alessandro Lo Giudice; Alessandro Re; Laura Guidorzi; Marta Magalini (2024). Database of elemental concentration of pyrite in lapis lazuli rocks of different provenances obtained by micro-PIXE [Dataset]. http://doi.org/10.5281/zenodo.13902915
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.13902915
Dataset updated
Oct 8, 2024
Authors
Alessandro Lo Giudice; Alessandro Re; Laura Guidorzi; Marta Magalini
Description
Contents of the database: this database contains the results of the micro-Particle Induced X-ray Emission (u-PIXE) analysis of calcite crystals in lapis lazuli rocks. The samples are reference rocks with known origin and coming from different lapis lazuli mining areas (Afghanistan, Tajikistan, Siberia, Myanmar), procured by on-site geological expeditions, museum collections or international markets. The source of each rock is indicated in the data sheet. The data were collected during a long-term study started in 2007 by the Solid State Physics Group at the Department of Physics of the University of Turin (Italy) with the the aim of creating a non-invasive analytical protocol to investigate the provenance of lapis lazuli material used in artworks throughout history. The analysis was carried out over several years and in different facilities using accelerated proton microbeams; this information and some basic details about the measurament conditions (e.g. beam energy) are reported in the data sheet. The software used for data processing is GUPIXWIN or derivatives. The full description of the analysis methodology and the discussion of the results in terms of the development of a lapis lazuli provenance protocol can be found in: (Magalini et al., 2024) Magalini, M., Guidorzi, L., Re, A., Marabotto, M., Borghi, A., Gallo, P., Visale, M., La Torre, L., Campostrini, M., Lemasson, Q., Pichon, L., Moignard, B., Pacheco, C., Couture, P., Palitsin, V., Lo Giudice, A. (Unpublished) Study of compositional and luminescence properties of calcite in lapis lazuli for provenance investigations of archaeological findings. Paper submitted to Eur. Phys. J. Plus And in previous publications related to the analysis of diopside and pyrite: (Lo Giudice et al., 2017) Lo Giudice, A., Angelici, D., Re, A., Gariani, G., Borghi, A., Calusi, S., Giuntini, L., Massi, M., Castelli, L., Taccetti, F., Calligaro, T., Pacheco, C., Lemasson, Q., Pichon, L., Moignard, B., Pratesi, G., Guidotti, M.C., 2017. Protocol for lapis lazuli provenance determination: evidence for an Afghan origin of the stones used for ancient carved artefacts kept at the Egyptian Museum of Florence (Italy). Archaeol. Anthropol. Sci. 9, 637–651. https://doi.org/10.1007/s12520-016-0430-0 (Guidorzi et al., 2023) Guidorzi, L., Re, A., Magalini, M., Angelici, D., Borghi, A., Vaggelli, G., Fantino, F., Rigato, V., La Torre, L., Lemasson, Q., Pacheco, C., Pichon, L., Moignard, B., Lo Giudice, A., 2023. Micro-PIXE and micro-IBIL characterisation of lapis lazuli samples from Myanmar mines and implications for provenance study. The European Physical Journal Plus, 138(2), 175. https://doi.org/10.1140/epjp/s13360-023-03768-x The characterisation of reference rocks is an ongoing activity of the group and further versions of this database will be updated as new data are collected. Contents of the PCA dataset: the PCA dataset reports the part of the database data that can be used to perform Principal Component Analysis to differentiate among four lapis lazuli provenances: Afghanistan (AFG), Tajikistan (TAJ), Myanmar (MYA) and Siberia (SIB). It contains only the elemental concentration values of those elements in calcite that are relevant for provenance discrimination (Mg, Mn, Sr, Y). If the element was determined to be below the limit of detection, the LOD/2 value is reported in the dataset. The details of the application of PCA to lapis lazuli provenance investigation are reported in the following publication, where the approach was presented for the first time: (Guidorzi et al., 2023b) Guidorzi, L., Re, A., Magalini, M., & Lo Giudice, A., 2023. Application of principal component analysis to µ-PIXE data in lapis lazuli provenance studies. Nuclear Instruments and Methods in Physics Research Section B: Beam Interactions with Materials and Atom 540, 45-50. https://doi.org/10.1016/j.nimb.2023.04.007
f
MVR performance t-test result.
figshare.com
xls
Updated Aug 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Majid Hameed Ahmed; Sabrina Tiun; Nazlia Omar; Nor Samsiah Sani (2024). MVR performance t-test result. [Dataset]. http://doi.org/10.1371/journal.pone.0309206.t005
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0309206.t005
Dataset updated
Aug 23, 2024
Dataset provided by
PLOS ONE
Authors
Majid Hameed Ahmed; Sabrina Tiun; Nazlia Omar; Nor Samsiah Sani
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Clustering texts together is an essential task in data mining and information retrieval, whose aim is to group unlabeled texts into meaningful clusters that facilitate extracting and understanding useful information from large volumes of textual data. However, clustering short texts (STC) is complex because they typically contain sparse, ambiguous, noisy, and lacking information. One of the challenges for STC is finding a proper representation for short text documents to generate cohesive clusters. However, typically, STC considers only a single-view representation to do clustering. The single-view representation is inefficient for representing text due to its inability to represent different aspects of the target text. In this paper, we propose the most suitable multi-view representation (MVR) (by finding the best combination of different single-view representations) to enhance STC. Our work will explore different types of MVR based on different sets of single-view representation combinations. The combination of the single-view representations is done by a fixed length concatenation via Principal Component analysis (PCA) technique. Three standard datasets (Twitter, Google News, and StackOverflow) are used to evaluate the performances of various sets of MVRs on STC. Based on experimental results, the best combination of single-view representation as an effective for STC was the 5-views MVR (a combination of BERT, GPT, TF-IDF, FastText, and GloVe). Based on that, we can conclude that MVR improves the performance of STC; however, the design for MVR requires selective single-view representations.
Credit card fraud detection Date 25th of June 2015
kaggle.com
Updated Oct 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zohair ahmed (2023). Credit card fraud detection Date 25th of June 2015 [Dataset]. https://www.kaggle.com/datasets/qnqfbqfqo/credit-card-fraud-detection-date-25th-of-june-2015
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 29, 2023
Dataset provided by
Kaggle
Authors
Zohair ahmed
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
The datasets contains transactions made by credit cards in September 2013 by european cardholders. This dataset present transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. Features V1, V2, ... V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-senstive learning. Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.

The dataset has been collected and analysed during a research collaboration of Worldline and the Machine Learning Group (mlg.ulb.ac.be) of ULB (Université Libre de Bruxelles) on big data mining and fraud detection. More details on current and past projects on related topics are available on http://mlg.ulb.ac.be/BruFence and http://mlg.ulb.ac.be/ARTML.
f
Additional file 4b: of Empirical advances with text mining of electronic...
figshare.com
xls
Updated Jun 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
T. Delespierre; P. Denormandie; A. Bar-Hen; L. Josseran (2023). Additional file 4b: of Empirical advances with text mining of electronic health records [Dataset]. http://doi.org/10.6084/m9.figshare.c.3860563_D4.v1
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.c.3860563_D4.v1
Dataset updated
Jun 4, 2023
Dataset provided by
figshare
Authors
T. Delespierre; P. Denormandie; A. Bar-Hen; L. Josseran
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Empirical Advances with, Annex 4, The 1015 residentsâ€™ health table, content: the 1015 de-identified residentsâ€™ medical histories and pathologies on September 30th 2013, as well as their falling history and 10-anonymized NHs (XLS 256 kb)
W
Integrated Visualization Environment for Science Mission Modeling, Phase II
cloud.csiss.gmu.edu
html
Updated Jan 29, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
United States (2020). Integrated Visualization Environment for Science Mission Modeling, Phase II [Dataset]. https://cloud.csiss.gmu.edu/uddi/dataset/integrated-visualization-environment-for-science-mission-modeling-phase-ii
Explore at:
htmlAvailable download formats
Dataset updated
Jan 29, 2020
Dataset provided by
United States
Description
NASA is emphasizing the use of larger, more integrated models in conjunction with systems engineering tools and decision support systems. These tools place a corresponding stress on legacy engineering visualization systems which now are required to handle larger data sets, provide more intuition to the user, integrate well with many other tools, and help the user with his/her ultimate goal: improving the design of complex systems.
Phoenix Integration proposes to complete the prototype visualization environment created during Phase I to the point where it is a commercially viable product. New features, refinements, and integration with other tools will be accomplished in Phase II. In particular, the work will involve major improvements to whitespace exploration algorithms, techniques that enable users to unconstrain or modify the underlying engineering model in an effort to obtain results in previously unattainable areas. Work will also include more data mining algorithms (e.g. Principal Component Analysis), new graph types (e.g. spider plots), export formats to 3-D tools (e.g. Tecplot), integration with MBSE/SysML tools, integration with web-based decision support environments, and incorporation of probabilistic analysis. A rich integration with ModelCenter, the company's engineering integration and trade study environment, is planned, although a standalone capability will also be offered. The visualizer's architecture will be based on OpenGL and will use the GPU to parallelize rendering computations. Design will focus on usability and responsiveness, with the goal of providing quick insight into complex data. The tool will be user-tested through early adopters to ensure relevance and to guide development.
o
Data from: Financial Fraud Detection Dataset
opendatabay.com
.undefined
Updated Jun 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Review Nexus (2025). Financial Fraud Detection Dataset [Dataset]. https://www.opendatabay.com/data/financial/d226c56e-5929-4059-a30d-13632e07b344
Explore at:
.undefinedAvailable download formats
Dataset updated
Jun 25, 2025
Dataset authored and provided by
Review Nexus
License
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Area covered
Fraud Detection & Risk Management
Description
This dataset is designed to support research and model development in the area of fraud detection. It consists of real-world credit card transactions made by European cardholders over a two-day period in September 2013. Out of 284,807 transactions, 492 are labeled as fraudulent (positive class), making this a highly imbalanced classification problem.

Performance Note:

Due to the extreme class imbalance, standard accuracy metrics are not informative. We recommend using the Area Under the Precision-Recall Curve (AUPRC) or F1-score for model evaluation.

Features:

Time Series Data: Each row represents a transaction, with the Time feature indicating the number of seconds elapsed since the first transaction.

Dimensionality Reduction Applied: Features V1 through V28 are anonymized principal components derived from a PCA transformation due to confidentiality constraints.

Raw Transaction Amount: The Amount field reflects the transaction value, useful for cost-sensitive modeling.

Binary Classification Target: The Class label is 1 for fraud and 0 for legitimate transactions.

Usage:

Machine learning model training for fraud detection.

Evaluation of anomaly detection and imbalanced classification methods.

Development of cost-sensitive learning approaches using the Amount variable.

Data Summary:

Total Records: 284,807

Fraud Cases: 492

Imbalance Ratio: Fraudulent transactions account for just 0.172% of the dataset.

Columns: 31 total (28 PCA features, plus Time, Amount, and Class)

License:

The dataset is provided under the CC0 (Public Domain) license, allowing users to freely use, modify, and distribute the data without any restrictions.

Acknowledgements

The dataset has been collected and analysed during a research collaboration of Worldline and the Machine Learning Group (http://mlg.ulb.ac.be) of ULB (Université Libre de Bruxelles) on big data mining and fraud detection. More details on current and past projects on related topics are available on https://www.researchgate.net/project/Fraud-detection-5 and the page of the DefeatFraud project

Please cite the following works:

Andrea Dal Pozzolo, Olivier Caelen, Reid A. Johnson and Gianluca Bontempi. Calibrating Probability with Undersampling for Unbalanced Classification. In Symposium on Computational Intelligence and Data Mining (CIDM), IEEE, 2015

Dal Pozzolo, Andrea; Caelen, Olivier; Le Borgne, Yann-Ael; Waterschoot, Serge; Bontempi, Gianluca. Learned lessons in credit card fraud detection from a practitioner perspective, Expert systems with applications,41,10,4915-4928,2014, Pergamon

Dal Pozzolo, Andrea; Boracchi, Giacomo; Caelen, Olivier; Alippi, Cesare; Bontempi, Gianluca. Credit card fraud detection: a realistic modeling and a novel learning strategy, IEEE transactions on neural networks and learning systems,29,8,3784-3797,2018,IEEE

Dal Pozzolo, Andrea Adaptive Machine learning for credit card fraud detection ULB MLG PhD thesis (supervised by G. Bontempi)

Carcillo, Fabrizio; Dal Pozzolo, Andrea; Le Borgne, Yann-Aël; Caelen, Olivier; Mazzer, Yannis; Bontempi, Gianluca. Scarff: a scalable framework for streaming credit card fraud detection with Spark, Information fusion,41, 182-194,2018,Elsevier

Carcillo, Fabrizio; Le Borgne, Yann-Aël; Caelen, Olivier; Bontempi, Gianluca. Streaming active learning strategies for real-life credit card fraud detection: assessment and visualization, International Journal of Data Science and Analytics, 5,4,285-300,2018,Springer International Publishing

Bertrand Lebichot, Yann-Aël Le Borgne, Liyun He, Frederic Oblé, Gianluca Bontempi Deep-Learning Domain Adaptation Techniques for Credit Cards Fraud Detection, INNSBDDL 2019: Recent Advances in Big Data and Deep Learning, pp 78-88, 2019

Fabrizio Carcillo, Yann-Aël Le Borgne, Olivier Caelen, Frederic Oblé, Gianluca Bontempi Combining Unsupervised and Supervised Learning in Credit Card Fraud Detection Information Sciences, 2019

Yann-Aël Le Borgne, Gianluca Bontempi Reproducible machine Learning for Credit Card Fraud Detection - Practical Handbook

Bertrand Lebichot, Gianmarco Paldino, Wissam Siblini, Liyun He, Frederic Oblé, Gianluca Bontempi Incremental learning strategies for credit cards fraud detection, IInternational Journal of Data Science and Analytics
Credit Card Fraud Detection
kaggle.com
test.researchdata.tuwien.ac.at
+1more
zip
Updated Mar 23, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Machine Learning Group - ULB (2018). Credit Card Fraud Detection [Dataset]. https://www.kaggle.com/mlg-ulb/creditcardfraud
Explore at:
zip(69155672 bytes)Available download formats
Dataset updated
Mar 23, 2018
Dataset authored and provided by
Machine Learning Group - ULB
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
Context

It is important that credit card companies are able to recognize fraudulent credit card transactions so that customers are not charged for items that they did not purchase.

Content

The dataset contains transactions made by credit cards in September 2013 by European cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. Features V1, V2, ... V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-sensitive learning. Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.

Given the class imbalance ratio, we recommend measuring the accuracy using the Area Under the Precision-Recall Curve (AUPRC). Confusion matrix accuracy is not meaningful for unbalanced classification.

Update (03/05/2021)

A simulator for transaction data has been released as part of the practical handbook on Machine Learning for Credit Card Fraud Detection - https://fraud-detection-handbook.github.io/fraud-detection-handbook/Chapter_3_GettingStarted/SimulatedDataset.html. We invite all practitioners interested in fraud detection datasets to also check out this data simulator, and the methodologies for credit card fraud detection presented in the book.

Acknowledgements

The dataset has been collected and analysed during a research collaboration of Worldline and the Machine Learning Group (http://mlg.ulb.ac.be) of ULB (Université Libre de Bruxelles) on big data mining and fraud detection. More details on current and past projects on related topics are available on https://www.researchgate.net/project/Fraud-detection-5 and the page of the DefeatFraud project

Please cite the following works:

Andrea Dal Pozzolo, Olivier Caelen, Reid A. Johnson and Gianluca Bontempi. Calibrating Probability with Undersampling for Unbalanced Classification. In Symposium on Computational Intelligence and Data Mining (CIDM), IEEE, 2015

Dal Pozzolo, Andrea; Caelen, Olivier; Le Borgne, Yann-Ael; Waterschoot, Serge; Bontempi, Gianluca. Learned lessons in credit card fraud detection from a practitioner perspective, Expert systems with applications,41,10,4915-4928,2014, Pergamon

Dal Pozzolo, Andrea; Boracchi, Giacomo; Caelen, Olivier; Alippi, Cesare; Bontempi, Gianluca. Credit card fraud detection: a realistic modeling and a novel learning strategy, IEEE transactions on neural networks and learning systems,29,8,3784-3797,2018,IEEE

Dal Pozzolo, Andrea Adaptive Machine learning for credit card fraud detection ULB MLG PhD thesis (supervised by G. Bontempi)

Carcillo, Fabrizio; Dal Pozzolo, Andrea; Le Borgne, Yann-Aël; Caelen, Olivier; Mazzer, Yannis; Bontempi, Gianluca. Scarff: a scalable framework for streaming credit card fraud detection with Spark, Information fusion,41, 182-194,2018,Elsevier

Carcillo, Fabrizio; Le Borgne, Yann-Aël; Caelen, Olivier; Bontempi, Gianluca. Streaming active learning strategies for real-life credit card fraud detection: assessment and visualization, International Journal of Data Science and Analytics, 5,4,285-300,2018,Springer International Publishing

Bertrand Lebichot, Yann-Aël Le Borgne, Liyun He, Frederic Oblé, Gianluca Bontempi Deep-Learning Domain Adaptation Techniques for Credit Cards Fraud Detection, INNSBDDL 2019: Recent Advances in Big Data and Deep Learning, pp 78-88, 2019

Fabrizio Carcillo, Yann-Aël Le Borgne, Olivier Caelen, Frederic Oblé, Gianluca Bontempi Combining Unsupervised and Supervised Learning in Credit Card Fraud Detection Information Sciences, 2019

Yann-Aël Le Borgne, Gianluca Bontempi Reproducible machine Learning for Credit Card Fraud Detection - Practical Handbook

Bertrand Lebichot, Gianmarco Paldino, Wissam Siblini, Liyun He, Frederic Oblé, Gianluca Bontempi Incremental learning strategies for credit cards fraud detection, IInternational Journal of Data Science and Analytics
Credit Card Fraud Detection Dataset
kaggle.com
Updated May 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ghanshyam Saini (2025). Credit Card Fraud Detection Dataset [Dataset]. https://www.kaggle.com/datasets/ghnshymsaini/credit-card-fraud-detection-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 15, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ghanshyam Saini
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Credit Card Fraud Detection Dataset (European Cardholders, September 2013)

As a data contributor, I'm sharing this crucial dataset focused on the detection of fraudulent credit card transactions. Recognizing these illicit activities is paramount for protecting customers and the integrity of financial systems.

About the Dataset:

This dataset encompasses credit card transactions made by European cardholders during a two-day period in September 2013. It presents a real-world scenario with a significant class imbalance, where fraudulent transactions are considerably less frequent than legitimate ones. Out of a total of 284,807 transactions, only 492 are instances of fraud, representing a mere 0.172% of the entire dataset.

Content of the Data:

Due to confidentiality concerns, the majority of the input features in this dataset have undergone a Principal Component Analysis (PCA) transformation. This means the original meaning and context of features V1, V2, ..., V28 are not directly provided. However, these principal components capture the variance in the underlying transaction data.

The only features that have not been transformed by PCA are:

Time: Numerical. Represents the number of seconds elapsed between each transaction and the very first transaction recorded in the dataset.

Amount: Numerical. The transaction amount in Euros (€). This feature could be valuable for cost-sensitive learning approaches.

The target variable for this classification task is:

Class: Integer. Takes the value 1 in the case of a fraudulent transaction and 0 otherwise.

Important Note on Evaluation:

Given the substantial class imbalance (far more legitimate transactions than fraudulent ones), traditional accuracy metrics based on the confusion matrix can be misleading. It is strongly recommended to evaluate models using the Area Under the Precision-Recall Curve (AUPRC), as this metric is more sensitive to the performance on the minority class (fraudulent transactions).

How to Use This Dataset:

Download the dataset file (likely in CSV format).

Load the data using libraries like Pandas.

Understand the class imbalance: Be aware that fraudulent transactions are rare.

Explore the features: Analyze the distributions of 'Time', 'Amount', and the PCA-transformed features (V1-V28).

Address the class imbalance: Consider using techniques like oversampling the minority class, undersampling the majority class, or using specialized algorithms designed for imbalanced datasets.

Build and train binary classification models to predict the 'Class' variable.

Evaluate your models using AUPRC to get a meaningful assessment of performance in detecting fraud.

Acknowledgements and Citation:

This dataset has been collected and analyzed through a research collaboration between Worldline and the Machine Learning Group (MLG) of ULB (Université Libre de Bruxelles).

When using this dataset in your research or projects, please cite the following works as appropriate:

Andrea Dal Pozzolo, Olivier Caelen, Reid A. Johnson and Gianluca Bontempi. Calibrating Probability with Undersampling for Unbalanced Classification. In Symposium on Computational Intelligence and Data Mining (CIDM), IEEE, 2015.

Dal Pozzolo, Andrea; Caelen, Olivier; Le Borgne, Yann-Ael; Waterschoot, Serge; Bontempi, Gianluca. Learned lessons in credit card fraud detection from a practitioner perspective, Expert systems with applications,41,10,4915-4928,2014, Pergamon.

Dal Pozzolo, Andrea; Boracchi, Giacomo; Caelen, Olivier; Alippi, Cesare; Bontempi, Gianluca. Credit card fraud detection: a realistic modeling and a novel learning strategy, IEEE transactions on neural networks and learning systems,29,8,3784-3797,2018,IEEE.

Andrea Dal Pozzolo. Adaptive Machine learning for credit card fraud detection ULB MLG PhD thesis (supervised by G. Bontempi).

Fabrizio Carcillo, Andrea Dal Pozzolo, Yann-Aël Le Borgne, Olivier Caelen, Yannis Mazzer, Gianluca Bontempi. Scarff: a scalable framework for streaming credit card fraud detection with Spark, Information fusion,41, 182-194,2018,Elsevier.

Fabrizio Carcillo, Yann-Aël Le Borgne, Olivier Caelen, Gianluca Bontempi. Streaming active learning strategies for real-life credit card fraud detection: assessment and visualization, International Journal of Data Science and Analytics, 5,4,285-300,2018,Springer International Publishing.

Bertrand Lebichot, Yann-Aël Le Borgne, Liyun He, Frederic Oblé, Gianluca Bontempi Deep-Learning Domain Adaptation Techniques for Credit Cards Fraud Detection, INNSBDDL 2019: Recent Advances in Big Data and Deep Learning, pp 78-88, 2019.

Fabrizio Carcillo, Yann-Aël Le Borgne, Olivier Caelen, Frederic Oblé, Gianluca Bontempi *Combining Unsupervised and Supervised...
f
Performance of STC with the best result for single-view and each type of...
plos.figshare.com
xls
Updated Aug 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Majid Hameed Ahmed; Sabrina Tiun; Nazlia Omar; Nor Samsiah Sani (2024). Performance of STC with the best result for single-view and each type of MVRs on the Twitter dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0309206.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0309206.t002
Dataset updated
Aug 23, 2024
Dataset provided by
PLOS ONE
Authors
Majid Hameed Ahmed; Sabrina Tiun; Nazlia Omar; Nor Samsiah Sani
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Performance of STC with the best result for single-view and each type of MVRs on the Twitter dataset.
f
Performance of STC with the best result for single-view and each type of...
plos.figshare.com
xls
Updated Aug 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Majid Hameed Ahmed; Sabrina Tiun; Nazlia Omar; Nor Samsiah Sani (2024). Performance of STC with the best result for single-view and each type of MVRs on the StackOverflow dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0309206.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0309206.t004
Dataset updated
Aug 23, 2024
Dataset provided by
PLOS ONE
Authors
Majid Hameed Ahmed; Sabrina Tiun; Nazlia Omar; Nor Samsiah Sani
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Performance of STC with the best result for single-view and each type of MVRs on the StackOverflow dataset.
f
Data from: Data mining the effects of testing conditions and specimen...
tandf.figshare.com
omicsdi.org
docx
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Folly Patterson; Osama AbuOmar; Mike Jones; Keith Tansey; R.K. Prabhu (2023). Data mining the effects of testing conditions and specimen properties on brain biomechanics [Dataset]. http://doi.org/10.6084/m9.figshare.8221103.v1
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.8221103.v1
Dataset updated
May 31, 2023
Dataset provided by
Taylor & Francis
Authors
Folly Patterson; Osama AbuOmar; Mike Jones; Keith Tansey; R.K. Prabhu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Traumatic brain injury is highly prevalent in the United States. However, despite its frequency and significance, there is little understanding of how the brain responds during injurious loading. A confounding problem is that because testing conditions vary between assessment methods, brain biomechanics cannot be fully understood. Data mining techniques, which are commonly used to determine patterns in large datasets, were applied to discover how changes in testing conditions affect the mechanical response of the brain. Data at various strain rates were collected from published literature and sorted into datasets based on strain rate and tension vs. compression. Self-organizing maps were used to conduct a sensitivity analysis to rank the testing condition parameters by importance. Fuzzy C-means clustering was applied to determine if there were any patterns in the data. The parameter rankings and clustering for each dataset varied, indicating that the strain rate and type of deformation influence the role of these parameters in the datasets.
f
Description of PCA outputs.
figshare.com
xls
Updated Jun 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ching-Hsue Cheng; Hsien-Hsiu Chen (2023). Description of PCA outputs. [Dataset]. http://doi.org/10.1371/journal.pone.0217591.t005
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0217591.t005
Dataset updated
Jun 21, 2023
Dataset provided by
PLOS ONE
Authors
Ching-Hsue Cheng; Hsien-Hsiu Chen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Description of PCA outputs.
r
The Internship Program of Instituto de Formação Turística de Macau (IFTM):...
researchdata.edu.au
Updated Jan 1, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Haynes John; Whannell Robert; Cheng Wan; Wan Lok Alan Cheng; Wan Lok Alan Cheng; Robert Whannell; John Haynes; Institute for Tourism Studies Macau (2021). The Internship Program of Instituto de Formação Turística de Macau (IFTM): An Evaluation Study [Dataset]. http://doi.org/10.25952/XD5K-EK97
Explore at:
Unique identifier
https://doi.org/10.25952/XD5K-EK97
Dataset updated
Jan 1, 2021
Dataset provided by
University of New England, Australia
Macau Institute for Tourism Studies
Authors
Haynes John; Whannell Robert; Cheng Wan; Wan Lok Alan Cheng; Wan Lok Alan Cheng; Robert Whannell; John Haynes; Institute for Tourism Studies Macau
Area covered
Macao
Description
This is a metadata only record. Datasets described below were provided by the Macau Institute for Tourism Studies by Dr Fanny Vong. Direct enquiries for access to: fanny@iftm.edu.mo.
The four datasets (ELogbookReport 2011 – 2014 ) were extracted from the electronic internship logbooks of interns held by the Macau Institute for Tourism Studies . The data sets contain two main types of data: textual data and numerical data about the internship experiences of the interns and interns’ performance as assessed by their internship supervisors. The relevant textual data were content analysed both manually and automatically using the text-mining tool of Leximancer. The numerical data were analysed using appropriate statistical tools including Principal Component Analysis (PCA), ANOVA, t-tests; Manny-Whitney, and Kruscal Wallis.
MOESM12 of Feature optimization in high dimensional chemical space:...
springernature.figshare.com
figshare.com
application/x-rar
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jinuraj K. R.; Rakhila M.; Dhanalakshmi M.; Sajeev R.; Akshata Gad; Jayan K.; Muhammed Iqbal P.; Andrew Manuel; Abdul Jaleel U. C. (2023). MOESM12 of Feature optimization in high dimensional chemical space: statistical and data mining solutions [Dataset]. http://doi.org/10.6084/m9.figshare.6813848.v1
Explore at:
application/x-rarAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.6813848.v1
Dataset updated
May 30, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Jinuraj K. R.; Rakhila M.; Dhanalakshmi M.; Sajeev R.; Akshata Gad; Jayan K.; Muhammed Iqbal P.; Andrew Manuel; Abdul Jaleel U. C.
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Additional file 12. The active molecules from AID 2559 and 2561 were considered as the test set. These were high throughput screened confirmatory bioassay dataset. AID 2559 was consisting of 58 active and 67 inactive molecules whereas, AID 2561 was having 37 actives and 148 inactive molecules. The actives from both were combined to get the test set as ARFF file.
Data from: Semi-supervised Multi-View Learning for Gene Network...
figshare.com
zip
Updated Jan 20, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gianvito Pio (2016). Semi-supervised Multi-View Learning for Gene Network Reconstruction [Dataset]. http://doi.org/10.6084/m9.figshare.1604827.v8
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.1604827.v8
Dataset updated
Jan 20, 2016
Dataset provided by
figshare
Authors
Gianvito Pio
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Semi-supervised Multi-View Learning for Gene Network Reconstruction

SynTReN Data: E.coli and Yeast sub-networks, generated expression data and gold standards (Input_Datasets.zip) Interactions predicted by base methods (Base_Method_Predictions.zip) Interactions predicted by our approach - Clustering performed with PCA (Predictions.zip) Interactions predicted by our approach - Clustering performed with K-means (PredictionsK.zip)

Dream5 Data: Expression data and gold standards provided by Marbach et al. 2012 1 Interactions predicted by the considered DREAM5 base methods provided by Marbach et al. 2012 1 Interactions predicted by our approach - Clustering performed with PCA (Predictions_D5.zip) Interactions predicted by our approach - Clustering performed with K-means (PredictionsK_D5.zip)

[1] Marbach, D., Costello, J. C., Kuffner, R., Vega, N. M., Prill, R. J., Camacho, D. M., Allison, K. R., Kellis, M., Collins, J. J., and Stolovitzky, G., Wisdom of crowds for robust gene network inference, Nature Methods, 9, 796-804, 2012.

Facebook

Twitter

Click to copy link

Link copied

Cite

Jiaqi Mu (2020). Raw data of the ships of priority II in 2017 [Dataset]. http://doi.org/10.4121/uuid:92a6ce98-5001-4768-bec1-49881a367d36

Raw data of the ships of priority II in 2017

Explore at:

xlsxAvailable download formats

Unique identifier

https://doi.org/10.4121/uuid:92a6ce98-5001-4768-bec1-49881a367d36

Dataset updated

Jul 28, 2020

Dataset provided by

4TU.ResearchData

Authors

Jiaqi Mu

License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

See the article "Targeting model based on principal component analysis and extreme learning machine" for the meaning of the data.

Clear search

Close search

Google apps

Main menu

Raw data of the ships of priority II in 2017

Data_Sheet_1_Bioinformatics Analyses Determined the Distinct CNS and...

Data from: Recent Developments in Damage Identification of Structures Using...

Plantain harvest events for Colombia

Statistic information of the STC datasets.

Database of elemental concentration of pyrite in lapis lazuli rocks of...

MVR performance t-test result.

Credit card fraud detection Date 25th of June 2015

Additional file 4b: of Empirical advances with text mining of electronic...

Integrated Visualization Environment for Science Mission Modeling, Phase II

Data from: Financial Fraud Detection Dataset

Performance Note:

Features:

Usage:

Data Summary:

License:

Acknowledgements

Credit Card Fraud Detection

Context

Content

Update (03/05/2021)

Acknowledgements

Credit Card Fraud Detection Dataset

Credit Card Fraud Detection Dataset (European Cardholders, September 2013)

Performance of STC with the best result for single-view and each type of...

Performance of STC with the best result for single-view and each type of...

Data from: Data mining the effects of testing conditions and specimen...

Description of PCA outputs.

The Internship Program of Instituto de Formação Turística de Macau (IFTM):...

MOESM12 of Feature optimization in high dimensional chemical space:...

Data from: Semi-supervised Multi-View Learning for Gene Network...

Raw data of the ships of priority II in 2017