35 datasets found
  1. f

    Raw data of the ships of priority II in 2017

    • figshare.com
    xlsx
    Updated Jul 28, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jiaqi Mu (2020). Raw data of the ships of priority II in 2017 [Dataset]. http://doi.org/10.4121/uuid:92a6ce98-5001-4768-bec1-49881a367d36
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jul 28, 2020
    Dataset provided by
    4TU.ResearchData
    Authors
    Jiaqi Mu
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    See the article "Targeting model based on principal component analysis and extreme learning machine" for the meaning of the data.

  2. f

    Data_Sheet_1_Bioinformatics Analyses Determined the Distinct CNS and...

    • frontiersin.figshare.com
    docx
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Seiichi Omura; Fumitaka Sato; Nicholas E. Martinez; Ah-Mee Park; Mitsugu Fujita; Nikki J. Kennett; Urška Cvek; Alireza Minagar; J. Steven Alexander; Ikuo Tsunoda (2023). Data_Sheet_1_Bioinformatics Analyses Determined the Distinct CNS and Peripheral Surrogate Biomarker Candidates Between Two Mouse Models for Progressive Multiple Sclerosis.docx [Dataset]. http://doi.org/10.3389/fimmu.2019.00516.s001
    Explore at:
    docxAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Frontiers
    Authors
    Seiichi Omura; Fumitaka Sato; Nicholas E. Martinez; Ah-Mee Park; Mitsugu Fujita; Nikki J. Kennett; Urška Cvek; Alireza Minagar; J. Steven Alexander; Ikuo Tsunoda
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Previously, we have established two distinct progressive multiple sclerosis (MS) models by induction of experimental autoimmune encephalomyelitis (EAE) with myelin oligodendrocyte glycoprotein (MOG) in two mouse strains. A.SW mice develop ataxia with antibody deposition, but no T cell infiltration, in the central nervous system (CNS), while SJL/J mice develop paralysis with CNS T cell infiltration. In this study, we determined biomarkers contributing to the homogeneity and heterogeneity of two models. Using the CNS and spleen microarray transcriptome and cytokine data, we conducted computational analyses. We identified up-regulation of immune-related genes, including immunoglobulins, in the CNS of both models. Pro-inflammatory cytokines, interferon (IFN)-γ and interleukin (IL)-17, were associated with the disease progression in SJL/J mice, while the expression of both cytokines was detected only at the EAE onset in A.SW mice. Principal component analysis (PCA) of CNS transcriptome data demonstrated that down-regulation of prolactin may reflect disease progression. Pattern matching analysis of spleen transcriptome with CNS PCA identified 333 splenic surrogate markers, including Stfa2l1, which reflected the changes in the CNS. Among them, we found that two genes (PER1/MIR6883 and FKBP5) and one gene (SLC16A1/MCT1) were also significantly up-regulated and down-regulated, respectively, in human MS peripheral blood, using data mining.

  3. f

    Data from: Recent Developments in Damage Identification of Structures Using...

    • scielo.figshare.com
    jpeg
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Meisam Gordan; Hashim Abdul Razak; Zubaidah Ismail; Khaled Ghaedi (2023). Recent Developments in Damage Identification of Structures Using Data Mining [Dataset]. http://doi.org/10.6084/m9.figshare.5931043.v1
    Explore at:
    jpegAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    SciELO journals
    Authors
    Meisam Gordan; Hashim Abdul Razak; Zubaidah Ismail; Khaled Ghaedi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Abstract Civil structures are usually prone to damage during their service life and it leads them to loss their serviceability and safety. Thus, damage assessment can guarantee the integrity of structures. As a result, a structural damage detection approach including two main components, a set of accelerometers to record the response data and a data mining (DM) procedure, is widely used to extract the information on the structural health condition. In the last decades, DM has provided numerous solutions to structural health monitoring (SHM) problems as an all-inclusive technique due to its powerful computational ability. This paper presents the first attempt to illustrate the data mining techniques (DMTs) applications in SHM through an intensive review of those articles dealing with the use of DMTs aimed for classification-, prediction- and optimization-based data mining methods. According to this categorization, applications of DMTs with respect to SHM research area are classified and it is concluded that, applications of DMTs in the SHM domain have increasingly been implemented, in the last decade and the most popular techniques in the area were artificial neural network (ANN), principal component analysis (PCA) and genetic algorithm (GA), respectively.

  4. d

    Plantain harvest events for Colombia

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fondo Nacional de Fomento Hortifruticola de Colombia; International Center for Tropical Agriculture (2023). Plantain harvest events for Colombia [Dataset]. http://doi.org/10.7910/DVN/GCMHMZ
    Explore at:
    Dataset updated
    Nov 21, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Fondo Nacional de Fomento Hortifruticola de Colombia; International Center for Tropical Agriculture
    Time period covered
    Jan 1, 2011 - Dec 1, 2012
    Area covered
    Colombia
    Description

    The standardized database registered 1322 cropping events

  5. f

    Statistic information of the STC datasets.

    • plos.figshare.com
    xls
    Updated Aug 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Majid Hameed Ahmed; Sabrina Tiun; Nazlia Omar; Nor Samsiah Sani (2024). Statistic information of the STC datasets. [Dataset]. http://doi.org/10.1371/journal.pone.0309206.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Aug 23, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Majid Hameed Ahmed; Sabrina Tiun; Nazlia Omar; Nor Samsiah Sani
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Clustering texts together is an essential task in data mining and information retrieval, whose aim is to group unlabeled texts into meaningful clusters that facilitate extracting and understanding useful information from large volumes of textual data. However, clustering short texts (STC) is complex because they typically contain sparse, ambiguous, noisy, and lacking information. One of the challenges for STC is finding a proper representation for short text documents to generate cohesive clusters. However, typically, STC considers only a single-view representation to do clustering. The single-view representation is inefficient for representing text due to its inability to represent different aspects of the target text. In this paper, we propose the most suitable multi-view representation (MVR) (by finding the best combination of different single-view representations) to enhance STC. Our work will explore different types of MVR based on different sets of single-view representation combinations. The combination of the single-view representations is done by a fixed length concatenation via Principal Component analysis (PCA) technique. Three standard datasets (Twitter, Google News, and StackOverflow) are used to evaluate the performances of various sets of MVRs on STC. Based on experimental results, the best combination of single-view representation as an effective for STC was the 5-views MVR (a combination of BERT, GPT, TF-IDF, FastText, and GloVe). Based on that, we can conclude that MVR improves the performance of STC; however, the design for MVR requires selective single-view representations.

  6. o

    Database of elemental concentration of pyrite in lapis lazuli rocks of...

    • explore.openaire.eu
    • zenodo.org
    Updated Oct 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alessandro Lo Giudice; Alessandro Re; Laura Guidorzi; Marta Magalini (2024). Database of elemental concentration of pyrite in lapis lazuli rocks of different provenances obtained by micro-PIXE [Dataset]. http://doi.org/10.5281/zenodo.13902915
    Explore at:
    Dataset updated
    Oct 8, 2024
    Authors
    Alessandro Lo Giudice; Alessandro Re; Laura Guidorzi; Marta Magalini
    Description

    Contents of the database: this database contains the results of the micro-Particle Induced X-ray Emission (u-PIXE) analysis of calcite crystals in lapis lazuli rocks. The samples are reference rocks with known origin and coming from different lapis lazuli mining areas (Afghanistan, Tajikistan, Siberia, Myanmar), procured by on-site geological expeditions, museum collections or international markets. The source of each rock is indicated in the data sheet. The data were collected during a long-term study started in 2007 by the Solid State Physics Group at the Department of Physics of the University of Turin (Italy) with the the aim of creating a non-invasive analytical protocol to investigate the provenance of lapis lazuli material used in artworks throughout history. The analysis was carried out over several years and in different facilities using accelerated proton microbeams; this information and some basic details about the measurament conditions (e.g. beam energy) are reported in the data sheet. The software used for data processing is GUPIXWIN or derivatives. The full description of the analysis methodology and the discussion of the results in terms of the development of a lapis lazuli provenance protocol can be found in: (Magalini et al., 2024) Magalini, M., Guidorzi, L., Re, A., Marabotto, M., Borghi, A., Gallo, P., Visale, M., La Torre, L., Campostrini, M., Lemasson, Q., Pichon, L., Moignard, B., Pacheco, C., Couture, P., Palitsin, V., Lo Giudice, A. (Unpublished) Study of compositional and luminescence properties of calcite in lapis lazuli for provenance investigations of archaeological findings. Paper submitted to Eur. Phys. J. Plus And in previous publications related to the analysis of diopside and pyrite: (Lo Giudice et al., 2017) Lo Giudice, A., Angelici, D., Re, A., Gariani, G., Borghi, A., Calusi, S., Giuntini, L., Massi, M., Castelli, L., Taccetti, F., Calligaro, T., Pacheco, C., Lemasson, Q., Pichon, L., Moignard, B., Pratesi, G., Guidotti, M.C., 2017. Protocol for lapis lazuli provenance determination: evidence for an Afghan origin of the stones used for ancient carved artefacts kept at the Egyptian Museum of Florence (Italy). Archaeol. Anthropol. Sci. 9, 637–651. https://doi.org/10.1007/s12520-016-0430-0 (Guidorzi et al., 2023) Guidorzi, L., Re, A., Magalini, M., Angelici, D., Borghi, A., Vaggelli, G., Fantino, F., Rigato, V., La Torre, L., Lemasson, Q., Pacheco, C., Pichon, L., Moignard, B., Lo Giudice, A., 2023. Micro-PIXE and micro-IBIL characterisation of lapis lazuli samples from Myanmar mines and implications for provenance study. The European Physical Journal Plus, 138(2), 175. https://doi.org/10.1140/epjp/s13360-023-03768-x The characterisation of reference rocks is an ongoing activity of the group and further versions of this database will be updated as new data are collected. Contents of the PCA dataset: the PCA dataset reports the part of the database data that can be used to perform Principal Component Analysis to differentiate among four lapis lazuli provenances: Afghanistan (AFG), Tajikistan (TAJ), Myanmar (MYA) and Siberia (SIB). It contains only the elemental concentration values of those elements in calcite that are relevant for provenance discrimination (Mg, Mn, Sr, Y). If the element was determined to be below the limit of detection, the LOD/2 value is reported in the dataset. The details of the application of PCA to lapis lazuli provenance investigation are reported in the following publication, where the approach was presented for the first time: (Guidorzi et al., 2023b) Guidorzi, L., Re, A., Magalini, M., & Lo Giudice, A., 2023. Application of principal component analysis to µ-PIXE data in lapis lazuli provenance studies. Nuclear Instruments and Methods in Physics Research Section B: Beam Interactions with Materials and Atom 540, 45-50. https://doi.org/10.1016/j.nimb.2023.04.007

  7. f

    MVR performance t-test result.

    • figshare.com
    xls
    Updated Aug 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Majid Hameed Ahmed; Sabrina Tiun; Nazlia Omar; Nor Samsiah Sani (2024). MVR performance t-test result. [Dataset]. http://doi.org/10.1371/journal.pone.0309206.t005
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Aug 23, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Majid Hameed Ahmed; Sabrina Tiun; Nazlia Omar; Nor Samsiah Sani
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Clustering texts together is an essential task in data mining and information retrieval, whose aim is to group unlabeled texts into meaningful clusters that facilitate extracting and understanding useful information from large volumes of textual data. However, clustering short texts (STC) is complex because they typically contain sparse, ambiguous, noisy, and lacking information. One of the challenges for STC is finding a proper representation for short text documents to generate cohesive clusters. However, typically, STC considers only a single-view representation to do clustering. The single-view representation is inefficient for representing text due to its inability to represent different aspects of the target text. In this paper, we propose the most suitable multi-view representation (MVR) (by finding the best combination of different single-view representations) to enhance STC. Our work will explore different types of MVR based on different sets of single-view representation combinations. The combination of the single-view representations is done by a fixed length concatenation via Principal Component analysis (PCA) technique. Three standard datasets (Twitter, Google News, and StackOverflow) are used to evaluate the performances of various sets of MVRs on STC. Based on experimental results, the best combination of single-view representation as an effective for STC was the 5-views MVR (a combination of BERT, GPT, TF-IDF, FastText, and GloVe). Based on that, we can conclude that MVR improves the performance of STC; however, the design for MVR requires selective single-view representations.

  8. Credit card fraud detection Date 25th of June 2015

    • kaggle.com
    Updated Oct 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zohair ahmed (2023). Credit card fraud detection Date 25th of June 2015 [Dataset]. https://www.kaggle.com/datasets/qnqfbqfqo/credit-card-fraud-detection-date-25th-of-june-2015
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 29, 2023
    Dataset provided by
    Kaggle
    Authors
    Zohair ahmed
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    The datasets contains transactions made by credit cards in September 2013 by european cardholders. This dataset present transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

    It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. Features V1, V2, ... V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-senstive learning. Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.

    The dataset has been collected and analysed during a research collaboration of Worldline and the Machine Learning Group (mlg.ulb.ac.be) of ULB (Université Libre de Bruxelles) on big data mining and fraud detection. More details on current and past projects on related topics are available on http://mlg.ulb.ac.be/BruFence and http://mlg.ulb.ac.be/ARTML.

  9. f

    Additional file 4b: of Empirical advances with text mining of electronic...

    • figshare.com
    xls
    Updated Jun 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    T. Delespierre; P. Denormandie; A. Bar-Hen; L. Josseran (2023). Additional file 4b: of Empirical advances with text mining of electronic health records [Dataset]. http://doi.org/10.6084/m9.figshare.c.3860563_D4.v1
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 4, 2023
    Dataset provided by
    figshare
    Authors
    T. Delespierre; P. Denormandie; A. Bar-Hen; L. Josseran
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Empirical Advances with, Annex 4, The 1015 residents’ health table, content: the 1015 de-identified residents’ medical histories and pathologies on September 30th 2013, as well as their falling history and 10-anonymized NHs (XLS 256 kb)

  10. W

    Integrated Visualization Environment for Science Mission Modeling, Phase II

    • cloud.csiss.gmu.edu
    html
    Updated Jan 29, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    United States (2020). Integrated Visualization Environment for Science Mission Modeling, Phase II [Dataset]. https://cloud.csiss.gmu.edu/uddi/dataset/integrated-visualization-environment-for-science-mission-modeling-phase-ii
    Explore at:
    htmlAvailable download formats
    Dataset updated
    Jan 29, 2020
    Dataset provided by
    United States
    Description

    NASA is emphasizing the use of larger, more integrated models in conjunction with systems engineering tools and decision support systems. These tools place a corresponding stress on legacy engineering visualization systems which now are required to handle larger data sets, provide more intuition to the user, integrate well with many other tools, and help the user with his/her ultimate goal: improving the design of complex systems.
    Phoenix Integration proposes to complete the prototype visualization environment created during Phase I to the point where it is a commercially viable product. New features, refinements, and integration with other tools will be accomplished in Phase II. In particular, the work will involve major improvements to whitespace exploration algorithms, techniques that enable users to unconstrain or modify the underlying engineering model in an effort to obtain results in previously unattainable areas. Work will also include more data mining algorithms (e.g. Principal Component Analysis), new graph types (e.g. spider plots), export formats to 3-D tools (e.g. Tecplot), integration with MBSE/SysML tools, integration with web-based decision support environments, and incorporation of probabilistic analysis. A rich integration with ModelCenter, the company's engineering integration and trade study environment, is planned, although a standalone capability will also be offered. The visualizer's architecture will be based on OpenGL and will use the GPU to parallelize rendering computations. Design will focus on usability and responsiveness, with the goal of providing quick insight into complex data. The tool will be user-tested through early adopters to ensure relevance and to guide development.

  11. o

    Data from: Financial Fraud Detection Dataset

    • opendatabay.com
    .undefined
    Updated Jun 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Review Nexus (2025). Financial Fraud Detection Dataset [Dataset]. https://www.opendatabay.com/data/financial/d226c56e-5929-4059-a30d-13632e07b344
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jun 25, 2025
    Dataset authored and provided by
    Review Nexus
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Area covered
    Fraud Detection & Risk Management
    Description

    This dataset is designed to support research and model development in the area of fraud detection. It consists of real-world credit card transactions made by European cardholders over a two-day period in September 2013. Out of 284,807 transactions, 492 are labeled as fraudulent (positive class), making this a highly imbalanced classification problem.

    Performance Note:

    Due to the extreme class imbalance, standard accuracy metrics are not informative. We recommend using the Area Under the Precision-Recall Curve (AUPRC) or F1-score for model evaluation.

    Features:

    • Time Series Data: Each row represents a transaction, with the Time feature indicating the number of seconds elapsed since the first transaction.
    • Dimensionality Reduction Applied: Features V1 through V28 are anonymized principal components derived from a PCA transformation due to confidentiality constraints.
    • Raw Transaction Amount: The Amount field reflects the transaction value, useful for cost-sensitive modeling.
    • Binary Classification Target: The Class label is 1 for fraud and 0 for legitimate transactions.

    Usage:

    • Machine learning model training for fraud detection.
    • Evaluation of anomaly detection and imbalanced classification methods.
    • Development of cost-sensitive learning approaches using the Amount variable.

    Data Summary:

    • Total Records: 284,807
    • Fraud Cases: 492
    • Imbalance Ratio: Fraudulent transactions account for just 0.172% of the dataset.
    • Columns: 31 total (28 PCA features, plus Time, Amount, and Class)

    License:

    The dataset is provided under the CC0 (Public Domain) license, allowing users to freely use, modify, and distribute the data without any restrictions.

    Acknowledgements

    The dataset has been collected and analysed during a research collaboration of Worldline and the Machine Learning Group (http://mlg.ulb.ac.be) of ULB (Université Libre de Bruxelles) on big data mining and fraud detection. More details on current and past projects on related topics are available on https://www.researchgate.net/project/Fraud-detection-5 and the page of the DefeatFraud project

    Please cite the following works:

    Andrea Dal Pozzolo, Olivier Caelen, Reid A. Johnson and Gianluca Bontempi. Calibrating Probability with Undersampling for Unbalanced Classification. In Symposium on Computational Intelligence and Data Mining (CIDM), IEEE, 2015

    Dal Pozzolo, Andrea; Caelen, Olivier; Le Borgne, Yann-Ael; Waterschoot, Serge; Bontempi, Gianluca. Learned lessons in credit card fraud detection from a practitioner perspective, Expert systems with applications,41,10,4915-4928,2014, Pergamon

    Dal Pozzolo, Andrea; Boracchi, Giacomo; Caelen, Olivier; Alippi, Cesare; Bontempi, Gianluca. Credit card fraud detection: a realistic modeling and a novel learning strategy, IEEE transactions on neural networks and learning systems,29,8,3784-3797,2018,IEEE

    Dal Pozzolo, Andrea Adaptive Machine learning for credit card fraud detection ULB MLG PhD thesis (supervised by G. Bontempi)

    Carcillo, Fabrizio; Dal Pozzolo, Andrea; Le Borgne, Yann-Aël; Caelen, Olivier; Mazzer, Yannis; Bontempi, Gianluca. Scarff: a scalable framework for streaming credit card fraud detection with Spark, Information fusion,41, 182-194,2018,Elsevier

    Carcillo, Fabrizio; Le Borgne, Yann-Aël; Caelen, Olivier; Bontempi, Gianluca. Streaming active learning strategies for real-life credit card fraud detection: assessment and visualization, International Journal of Data Science and Analytics, 5,4,285-300,2018,Springer International Publishing

    Bertrand Lebichot, Yann-Aël Le Borgne, Liyun He, Frederic Oblé, Gianluca Bontempi Deep-Learning Domain Adaptation Techniques for Credit Cards Fraud Detection, INNSBDDL 2019: Recent Advances in Big Data and Deep Learning, pp 78-88, 2019

    Fabrizio Carcillo, Yann-Aël Le Borgne, Olivier Caelen, Frederic Oblé, Gianluca Bontempi Combining Unsupervised and Supervised Learning in Credit Card Fraud Detection Information Sciences, 2019

    Yann-Aël Le Borgne, Gianluca Bontempi Reproducible machine Learning for Credit Card Fraud Detection - Practical Handbook

    Bertrand Lebichot, Gianmarco Paldino, Wissam Siblini, Liyun He, Frederic Oblé, Gianluca Bontempi Incremental learning strategies for credit cards fraud detection, IInternational Journal of Data Science and Analytics

  12. Credit Card Fraud Detection

    • kaggle.com
    • test.researchdata.tuwien.ac.at
    • +1more
    zip
    Updated Mar 23, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Machine Learning Group - ULB (2018). Credit Card Fraud Detection [Dataset]. https://www.kaggle.com/mlg-ulb/creditcardfraud
    Explore at:
    zip(69155672 bytes)Available download formats
    Dataset updated
    Mar 23, 2018
    Dataset authored and provided by
    Machine Learning Group - ULB
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    Context

    It is important that credit card companies are able to recognize fraudulent credit card transactions so that customers are not charged for items that they did not purchase.

    Content

    The dataset contains transactions made by credit cards in September 2013 by European cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

    It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. Features V1, V2, ... V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-sensitive learning. Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.

    Given the class imbalance ratio, we recommend measuring the accuracy using the Area Under the Precision-Recall Curve (AUPRC). Confusion matrix accuracy is not meaningful for unbalanced classification.

    Update (03/05/2021)

    A simulator for transaction data has been released as part of the practical handbook on Machine Learning for Credit Card Fraud Detection - https://fraud-detection-handbook.github.io/fraud-detection-handbook/Chapter_3_GettingStarted/SimulatedDataset.html. We invite all practitioners interested in fraud detection datasets to also check out this data simulator, and the methodologies for credit card fraud detection presented in the book.

    Acknowledgements

    The dataset has been collected and analysed during a research collaboration of Worldline and the Machine Learning Group (http://mlg.ulb.ac.be) of ULB (Université Libre de Bruxelles) on big data mining and fraud detection. More details on current and past projects on related topics are available on https://www.researchgate.net/project/Fraud-detection-5 and the page of the DefeatFraud project

    Please cite the following works:

    Andrea Dal Pozzolo, Olivier Caelen, Reid A. Johnson and Gianluca Bontempi. Calibrating Probability with Undersampling for Unbalanced Classification. In Symposium on Computational Intelligence and Data Mining (CIDM), IEEE, 2015

    Dal Pozzolo, Andrea; Caelen, Olivier; Le Borgne, Yann-Ael; Waterschoot, Serge; Bontempi, Gianluca. Learned lessons in credit card fraud detection from a practitioner perspective, Expert systems with applications,41,10,4915-4928,2014, Pergamon

    Dal Pozzolo, Andrea; Boracchi, Giacomo; Caelen, Olivier; Alippi, Cesare; Bontempi, Gianluca. Credit card fraud detection: a realistic modeling and a novel learning strategy, IEEE transactions on neural networks and learning systems,29,8,3784-3797,2018,IEEE

    Dal Pozzolo, Andrea Adaptive Machine learning for credit card fraud detection ULB MLG PhD thesis (supervised by G. Bontempi)

    Carcillo, Fabrizio; Dal Pozzolo, Andrea; Le Borgne, Yann-Aël; Caelen, Olivier; Mazzer, Yannis; Bontempi, Gianluca. Scarff: a scalable framework for streaming credit card fraud detection with Spark, Information fusion,41, 182-194,2018,Elsevier

    Carcillo, Fabrizio; Le Borgne, Yann-Aël; Caelen, Olivier; Bontempi, Gianluca. Streaming active learning strategies for real-life credit card fraud detection: assessment and visualization, International Journal of Data Science and Analytics, 5,4,285-300,2018,Springer International Publishing

    Bertrand Lebichot, Yann-Aël Le Borgne, Liyun He, Frederic Oblé, Gianluca Bontempi Deep-Learning Domain Adaptation Techniques for Credit Cards Fraud Detection, INNSBDDL 2019: Recent Advances in Big Data and Deep Learning, pp 78-88, 2019

    Fabrizio Carcillo, Yann-Aël Le Borgne, Olivier Caelen, Frederic Oblé, Gianluca Bontempi Combining Unsupervised and Supervised Learning in Credit Card Fraud Detection Information Sciences, 2019

    Yann-Aël Le Borgne, Gianluca Bontempi Reproducible machine Learning for Credit Card Fraud Detection - Practical Handbook

    Bertrand Lebichot, Gianmarco Paldino, Wissam Siblini, Liyun He, Frederic Oblé, Gianluca Bontempi Incremental learning strategies for credit cards fraud detection, IInternational Journal of Data Science and Analytics

  13. Credit Card Fraud Detection Dataset

    • kaggle.com
    Updated May 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ghanshyam Saini (2025). Credit Card Fraud Detection Dataset [Dataset]. https://www.kaggle.com/datasets/ghnshymsaini/credit-card-fraud-detection-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 15, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ghanshyam Saini
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Credit Card Fraud Detection Dataset (European Cardholders, September 2013)

    As a data contributor, I'm sharing this crucial dataset focused on the detection of fraudulent credit card transactions. Recognizing these illicit activities is paramount for protecting customers and the integrity of financial systems.

    About the Dataset:

    This dataset encompasses credit card transactions made by European cardholders during a two-day period in September 2013. It presents a real-world scenario with a significant class imbalance, where fraudulent transactions are considerably less frequent than legitimate ones. Out of a total of 284,807 transactions, only 492 are instances of fraud, representing a mere 0.172% of the entire dataset.

    Content of the Data:

    Due to confidentiality concerns, the majority of the input features in this dataset have undergone a Principal Component Analysis (PCA) transformation. This means the original meaning and context of features V1, V2, ..., V28 are not directly provided. However, these principal components capture the variance in the underlying transaction data.

    The only features that have not been transformed by PCA are:

    • Time: Numerical. Represents the number of seconds elapsed between each transaction and the very first transaction recorded in the dataset.
    • Amount: Numerical. The transaction amount in Euros (€). This feature could be valuable for cost-sensitive learning approaches.

    The target variable for this classification task is:

    • Class: Integer. Takes the value 1 in the case of a fraudulent transaction and 0 otherwise.

    Important Note on Evaluation:

    Given the substantial class imbalance (far more legitimate transactions than fraudulent ones), traditional accuracy metrics based on the confusion matrix can be misleading. It is strongly recommended to evaluate models using the Area Under the Precision-Recall Curve (AUPRC), as this metric is more sensitive to the performance on the minority class (fraudulent transactions).

    How to Use This Dataset:

    1. Download the dataset file (likely in CSV format).
    2. Load the data using libraries like Pandas.
    3. Understand the class imbalance: Be aware that fraudulent transactions are rare.
    4. Explore the features: Analyze the distributions of 'Time', 'Amount', and the PCA-transformed features (V1-V28).
    5. Address the class imbalance: Consider using techniques like oversampling the minority class, undersampling the majority class, or using specialized algorithms designed for imbalanced datasets.
    6. Build and train binary classification models to predict the 'Class' variable.
    7. Evaluate your models using AUPRC to get a meaningful assessment of performance in detecting fraud.

    Acknowledgements and Citation:

    This dataset has been collected and analyzed through a research collaboration between Worldline and the Machine Learning Group (MLG) of ULB (Université Libre de Bruxelles).

    When using this dataset in your research or projects, please cite the following works as appropriate:

    • Andrea Dal Pozzolo, Olivier Caelen, Reid A. Johnson and Gianluca Bontempi. Calibrating Probability with Undersampling for Unbalanced Classification. In Symposium on Computational Intelligence and Data Mining (CIDM), IEEE, 2015.
    • Dal Pozzolo, Andrea; Caelen, Olivier; Le Borgne, Yann-Ael; Waterschoot, Serge; Bontempi, Gianluca. Learned lessons in credit card fraud detection from a practitioner perspective, Expert systems with applications,41,10,4915-4928,2014, Pergamon.
    • Dal Pozzolo, Andrea; Boracchi, Giacomo; Caelen, Olivier; Alippi, Cesare; Bontempi, Gianluca. Credit card fraud detection: a realistic modeling and a novel learning strategy, IEEE transactions on neural networks and learning systems,29,8,3784-3797,2018,IEEE.
    • Andrea Dal Pozzolo. Adaptive Machine learning for credit card fraud detection ULB MLG PhD thesis (supervised by G. Bontempi).
    • Fabrizio Carcillo, Andrea Dal Pozzolo, Yann-Aël Le Borgne, Olivier Caelen, Yannis Mazzer, Gianluca Bontempi. Scarff: a scalable framework for streaming credit card fraud detection with Spark, Information fusion,41, 182-194,2018,Elsevier.
    • Fabrizio Carcillo, Yann-Aël Le Borgne, Olivier Caelen, Gianluca Bontempi. Streaming active learning strategies for real-life credit card fraud detection: assessment and visualization, International Journal of Data Science and Analytics, 5,4,285-300,2018,Springer International Publishing.
    • Bertrand Lebichot, Yann-Aël Le Borgne, Liyun He, Frederic Oblé, Gianluca Bontempi Deep-Learning Domain Adaptation Techniques for Credit Cards Fraud Detection, INNSBDDL 2019: Recent Advances in Big Data and Deep Learning, pp 78-88, 2019.
    • Fabrizio Carcillo, Yann-Aël Le Borgne, Olivier Caelen, Frederic Oblé, Gianluca Bontempi *Combining Unsupervised and Supervised...
  14. f

    Performance of STC with the best result for single-view and each type of...

    • plos.figshare.com
    xls
    Updated Aug 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Majid Hameed Ahmed; Sabrina Tiun; Nazlia Omar; Nor Samsiah Sani (2024). Performance of STC with the best result for single-view and each type of MVRs on the Twitter dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0309206.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Aug 23, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Majid Hameed Ahmed; Sabrina Tiun; Nazlia Omar; Nor Samsiah Sani
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Performance of STC with the best result for single-view and each type of MVRs on the Twitter dataset.

  15. f

    Performance of STC with the best result for single-view and each type of...

    • plos.figshare.com
    xls
    Updated Aug 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Majid Hameed Ahmed; Sabrina Tiun; Nazlia Omar; Nor Samsiah Sani (2024). Performance of STC with the best result for single-view and each type of MVRs on the StackOverflow dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0309206.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Aug 23, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Majid Hameed Ahmed; Sabrina Tiun; Nazlia Omar; Nor Samsiah Sani
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Performance of STC with the best result for single-view and each type of MVRs on the StackOverflow dataset.

  16. f

    Data from: Data mining the effects of testing conditions and specimen...

    • tandf.figshare.com
    • omicsdi.org
    docx
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Folly Patterson; Osama AbuOmar; Mike Jones; Keith Tansey; R.K. Prabhu (2023). Data mining the effects of testing conditions and specimen properties on brain biomechanics [Dataset]. http://doi.org/10.6084/m9.figshare.8221103.v1
    Explore at:
    docxAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Taylor & Francis
    Authors
    Folly Patterson; Osama AbuOmar; Mike Jones; Keith Tansey; R.K. Prabhu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Traumatic brain injury is highly prevalent in the United States. However, despite its frequency and significance, there is little understanding of how the brain responds during injurious loading. A confounding problem is that because testing conditions vary between assessment methods, brain biomechanics cannot be fully understood. Data mining techniques, which are commonly used to determine patterns in large datasets, were applied to discover how changes in testing conditions affect the mechanical response of the brain. Data at various strain rates were collected from published literature and sorted into datasets based on strain rate and tension vs. compression. Self-organizing maps were used to conduct a sensitivity analysis to rank the testing condition parameters by importance. Fuzzy C-means clustering was applied to determine if there were any patterns in the data. The parameter rankings and clustering for each dataset varied, indicating that the strain rate and type of deformation influence the role of these parameters in the datasets.

  17. f

    Description of PCA outputs.

    • figshare.com
    xls
    Updated Jun 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ching-Hsue Cheng; Hsien-Hsiu Chen (2023). Description of PCA outputs. [Dataset]. http://doi.org/10.1371/journal.pone.0217591.t005
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 21, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Ching-Hsue Cheng; Hsien-Hsiu Chen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Description of PCA outputs.

  18. r

    The Internship Program of Instituto de Formação Turística de Macau (IFTM):...

    • researchdata.edu.au
    Updated Jan 1, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Haynes John; Whannell Robert; Cheng Wan; Wan Lok Alan Cheng; Wan Lok Alan Cheng; Robert Whannell; John Haynes; Institute for Tourism Studies Macau (2021). The Internship Program of Instituto de Formação Turística de Macau (IFTM): An Evaluation Study [Dataset]. http://doi.org/10.25952/XD5K-EK97
    Explore at:
    Dataset updated
    Jan 1, 2021
    Dataset provided by
    University of New England, Australia
    Macau Institute for Tourism Studies
    Authors
    Haynes John; Whannell Robert; Cheng Wan; Wan Lok Alan Cheng; Wan Lok Alan Cheng; Robert Whannell; John Haynes; Institute for Tourism Studies Macau
    Area covered
    Macao
    Description

    This is a metadata only record. Datasets described below were provided by the Macau Institute for Tourism Studies by Dr Fanny Vong. Direct enquiries for access to: fanny@iftm.edu.mo.
    The four datasets (ELogbookReport 2011 – 2014 ) were extracted from the electronic internship logbooks of interns held by the Macau Institute for Tourism Studies . The data sets contain two main types of data: textual data and numerical data about the internship experiences of the interns and interns’ performance as assessed by their internship supervisors. The relevant textual data were content analysed both manually and automatically using the text-mining tool of Leximancer. The numerical data were analysed using appropriate statistical tools including Principal Component Analysis (PCA), ANOVA, t-tests; Manny-Whitney, and Kruscal Wallis.

  19. MOESM12 of Feature optimization in high dimensional chemical space:...

    • springernature.figshare.com
    • figshare.com
    application/x-rar
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jinuraj K. R.; Rakhila M.; Dhanalakshmi M.; Sajeev R.; Akshata Gad; Jayan K.; Muhammed Iqbal P.; Andrew Manuel; Abdul Jaleel U. C. (2023). MOESM12 of Feature optimization in high dimensional chemical space: statistical and data mining solutions [Dataset]. http://doi.org/10.6084/m9.figshare.6813848.v1
    Explore at:
    application/x-rarAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Jinuraj K. R.; Rakhila M.; Dhanalakshmi M.; Sajeev R.; Akshata Gad; Jayan K.; Muhammed Iqbal P.; Andrew Manuel; Abdul Jaleel U. C.
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Additional file 12. The active molecules from AID 2559 and 2561 were considered as the test set. These were high throughput screened confirmatory bioassay dataset. AID 2559 was consisting of 58 active and 67 inactive molecules whereas, AID 2561 was having 37 actives and 148 inactive molecules. The actives from both were combined to get the test set as ARFF file.

  20. Data from: Semi-supervised Multi-View Learning for Gene Network...

    • figshare.com
    zip
    Updated Jan 20, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gianvito Pio (2016). Semi-supervised Multi-View Learning for Gene Network Reconstruction [Dataset]. http://doi.org/10.6084/m9.figshare.1604827.v8
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 20, 2016
    Dataset provided by
    figshare
    Authors
    Gianvito Pio
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Semi-supervised Multi-View Learning for Gene Network Reconstruction

    SynTReN Data: E.coli and Yeast sub-networks, generated expression data and gold standards (Input_Datasets.zip) Interactions predicted by base methods (Base_Method_Predictions.zip) Interactions predicted by our approach - Clustering performed with PCA (Predictions.zip) Interactions predicted by our approach - Clustering performed with K-means (PredictionsK.zip)

    Dream5 Data: Expression data and gold standards provided by Marbach et al. 2012 1 Interactions predicted by the considered DREAM5 base methods provided by Marbach et al. 2012 1 Interactions predicted by our approach - Clustering performed with PCA (Predictions_D5.zip) Interactions predicted by our approach - Clustering performed with K-means (PredictionsK_D5.zip)

    [1] Marbach, D., Costello, J. C., Kuffner, R., Vega, N. M., Prill, R. J., Camacho, D. M., Allison, K. R., Kellis, M., Collins, J. J., and Stolovitzky, G., Wisdom of crowds for robust gene network inference, Nature Methods, 9, 796-804, 2012.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Jiaqi Mu (2020). Raw data of the ships of priority II in 2017 [Dataset]. http://doi.org/10.4121/uuid:92a6ce98-5001-4768-bec1-49881a367d36

Raw data of the ships of priority II in 2017

Explore at:
xlsxAvailable download formats
Dataset updated
Jul 28, 2020
Dataset provided by
4TU.ResearchData
Authors
Jiaqi Mu
License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

See the article "Targeting model based on principal component analysis and extreme learning machine" for the meaning of the data.

Search
Clear search
Close search
Google apps
Main menu