100+ datasets found
  1. Market Basket Analysis

    • kaggle.com
    zip
    Updated Dec 9, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aslan Ahmedov (2021). Market Basket Analysis [Dataset]. https://www.kaggle.com/datasets/aslanahmedov/market-basket-analysis
    Explore at:
    zip(23875170 bytes)Available download formats
    Dataset updated
    Dec 9, 2021
    Authors
    Aslan Ahmedov
    Description

    Market Basket Analysis

    Market basket analysis with Apriori algorithm

    The retailer wants to target customers with suggestions on itemset that a customer is most likely to purchase .I was given dataset contains data of a retailer; the transaction data provides data around all the transactions that have happened over a period of time. Retailer will use result to grove in his industry and provide for customer suggestions on itemset, we be able increase customer engagement and improve customer experience and identify customer behavior. I will solve this problem with use Association Rules type of unsupervised learning technique that checks for the dependency of one data item on another data item.

    Introduction

    Association Rule is most used when you are planning to build association in different objects in a set. It works when you are planning to find frequent patterns in a transaction database. It can tell you what items do customers frequently buy together and it allows retailer to identify relationships between the items.

    An Example of Association Rules

    Assume there are 100 customers, 10 of them bought Computer Mouth, 9 bought Mat for Mouse and 8 bought both of them. - bought Computer Mouth => bought Mat for Mouse - support = P(Mouth & Mat) = 8/100 = 0.08 - confidence = support/P(Mat for Mouse) = 0.08/0.09 = 0.89 - lift = confidence/P(Computer Mouth) = 0.89/0.10 = 8.9 This just simple example. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.

    Strategy

    • Data Import
    • Data Understanding and Exploration
    • Transformation of the data – so that is ready to be consumed by the association rules algorithm
    • Running association rules
    • Exploring the rules generated
    • Filtering the generated rules
    • Visualization of Rule

    Dataset Description

    • File name: Assignment-1_Data
    • List name: retaildata
    • File format: . xlsx
    • Number of Row: 522065
    • Number of Attributes: 7

      • BillNo: 6-digit number assigned to each transaction. Nominal.
      • Itemname: Product name. Nominal.
      • Quantity: The quantities of each product per transaction. Numeric.
      • Date: The day and time when each transaction was generated. Numeric.
      • Price: Product price. Numeric.
      • CustomerID: 5-digit number assigned to each customer. Nominal.
      • Country: Name of the country where each customer resides. Nominal.

    imagehttps://user-images.githubusercontent.com/91852182/145270162-fc53e5a3-4ad1-4d06-b0e0-228aabcf6b70.png">

    Libraries in R

    First, we need to load required libraries. Shortly I describe all libraries.

    • arules - Provides the infrastructure for representing, manipulating and analyzing transaction data and patterns (frequent itemsets and association rules).
    • arulesViz - Extends package 'arules' with various visualization. techniques for association rules and item-sets. The package also includes several interactive visualizations for rule exploration.
    • tidyverse - The tidyverse is an opinionated collection of R packages designed for data science.
    • readxl - Read Excel Files in R.
    • plyr - Tools for Splitting, Applying and Combining Data.
    • ggplot2 - A system for 'declaratively' creating graphics, based on "The Grammar of Graphics". You provide the data, tell 'ggplot2' how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.
    • knitr - Dynamic Report generation in R.
    • magrittr- Provides a mechanism for chaining commands with a new forward-pipe operator, %>%. This operator will forward a value, or the result of an expression, into the next function call/expression. There is flexible support for the type of right-hand side expressions.
    • dplyr - A fast, consistent tool for working with data frame like objects, both in memory and out of memory.
    • tidyverse - This package is designed to make it easy to install and load multiple 'tidyverse' packages in a single step.

    imagehttps://user-images.githubusercontent.com/91852182/145270210-49c8e1aa-9753-431b-a8d5-99601bc76cb5.png">

    Data Pre-processing

    Next, we need to upload Assignment-1_Data. xlsx to R to read the dataset.Now we can see our data in R.

    imagehttps://user-images.githubusercontent.com/91852182/145270229-514f0983-3bbb-4cd3-be64-980e92656a02.png"> imagehttps://user-images.githubusercontent.com/91852182/145270251-6f6f6472-8817-435c-a995-9bc4bfef10d1.png">

    After we will clear our data frame, will remove missing values.

    imagehttps://user-images.githubusercontent.com/91852182/145270286-05854e1a-2b6c-490e-ab30-9e99e731eacb.png">

    To apply Association Rule mining, we need to convert dataframe into transaction data to make all items that are bought together in one invoice will be in ...

  2. d

    Data Mining in Systems Health Management

    • catalog.data.gov
    • s.cnmilf.com
    • +1more
    Updated Apr 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dashlink (2025). Data Mining in Systems Health Management [Dataset]. https://catalog.data.gov/dataset/data-mining-in-systems-health-management
    Explore at:
    Dataset updated
    Apr 10, 2025
    Dataset provided by
    Dashlink
    Description

    This chapter presents theoretical and practical aspects associated to the implementation of a combined model-based/data-driven approach for failure prognostics based on particle filtering algorithms, in which the current esti- mate of the state PDF is used to determine the operating condition of the system and predict the progression of a fault indicator, given a dynamic state model and a set of process measurements. In this approach, the task of es- timating the current value of the fault indicator, as well as other important changing parameters in the environment, involves two basic steps: the predic- tion step, based on the process model, and an update step, which incorporates the new measurement into the a priori state estimate. This framework allows to estimate of the probability of failure at future time instants (RUL PDF) in real-time, providing information about time-to- failure (TTF) expectations, statistical confidence intervals, long-term predic- tions; using for this purpose empirical knowledge about critical conditions for the system (also referred to as the hazard zones). This information is of paramount significance for the improvement of the system reliability and cost-effective operation of critical assets, as it has been shown in a case study where feedback correction strategies (based on uncertainty measures) have been implemented to lengthen the RUL of a rotorcraft transmission system with propagating fatigue cracks on a critical component. Although the feed- back loop is implemented using simple linear relationships, it is helpful to provide a quick insight into the manner that the system reacts to changes on its input signals, in terms of its predicted RUL. The method is able to manage non-Gaussian pdf’s since it includes concepts such as nonlinear state estimation and confidence intervals in its formulation. Real data from a fault seeded test showed that the proposed framework was able to anticipate modifications on the system input to lengthen its RUL. Results of this test indicate that the method was able to successfully suggest the correction that the system required. In this sense, future work will be focused on the development and testing of similar strategies using different input-output uncertainty metrics.

  3. Datas of Disease Patterns

    • figshare.com
    zip
    Updated Jun 2, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jichang Zhao (2017). Datas of Disease Patterns [Dataset]. http://doi.org/10.6084/m9.figshare.5035775.v3
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 2, 2017
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Jichang Zhao
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    1.the "dingxiang_datas.xls"contains all the original data which is crawled from DingXiang forum, and also the word segmentation result for each medical record is given.2.the "pmi_new_words.txt" is the result of new medical words found by calculating mutual information.3.the "association_rules" folder contains the association rules mined from the dataset where h-confidence threshold is set 0.3 and support threshold is set 0.0001.4.the "network_communities.csv" describes the complication communities.p.s. if you encounter a "d", it means the word is a disease description vocabulary, and "z" or "s" represents a symptom description vocabulary.

  4. Data Mining in Systems Health Management - Dataset - NASA Open Data Portal

    • data.nasa.gov
    Updated Mar 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nasa.gov (2025). Data Mining in Systems Health Management - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/data-mining-in-systems-health-management
    Explore at:
    Dataset updated
    Mar 31, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    This chapter presents theoretical and practical aspects associated to the implementation of a combined model-based/data-driven approach for failure prognostics based on particle filtering algorithms, in which the current esti- mate of the state PDF is used to determine the operating condition of the system and predict the progression of a fault indicator, given a dynamic state model and a set of process measurements. In this approach, the task of es- timating the current value of the fault indicator, as well as other important changing parameters in the environment, involves two basic steps: the predic- tion step, based on the process model, and an update step, which incorporates the new measurement into the a priori state estimate. This framework allows to estimate of the probability of failure at future time instants (RUL PDF) in real-time, providing information about time-to- failure (TTF) expectations, statistical confidence intervals, long-term predic- tions; using for this purpose empirical knowledge about critical conditions for the system (also referred to as the hazard zones). This information is of paramount significance for the improvement of the system reliability and cost-effective operation of critical assets, as it has been shown in a case study where feedback correction strategies (based on uncertainty measures) have been implemented to lengthen the RUL of a rotorcraft transmission system with propagating fatigue cracks on a critical component. Although the feed- back loop is implemented using simple linear relationships, it is helpful to provide a quick insight into the manner that the system reacts to changes on its input signals, in terms of its predicted RUL. The method is able to manage non-Gaussian pdf’s since it includes concepts such as nonlinear state estimation and confidence intervals in its formulation. Real data from a fault seeded test showed that the proposed framework was able to anticipate modifications on the system input to lengthen its RUL. Results of this test indicate that the method was able to successfully suggest the correction that the system required. In this sense, future work will be focused on the development and testing of similar strategies using different input-output uncertainty metrics.

  5. The SAR difference of different confidence degree thresholds in D = 3.

    • plos.figshare.com
    xls
    Updated Jun 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xin Liu; Xuefeng Sang; Jiaxuan Chang; Yang Zheng; Yuping Han (2023). The SAR difference of different confidence degree thresholds in D = 3. [Dataset]. http://doi.org/10.1371/journal.pone.0255684.t009
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 9, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Xin Liu; Xuefeng Sang; Jiaxuan Chang; Yang Zheng; Yuping Han
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The SAR difference of different confidence degree thresholds in D = 3.

  6. f

    Data_Sheet_1_Machine Learning-Based Data Mining Method for Sentiment...

    • frontiersin.figshare.com
    zip
    Updated Jun 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Min-Joon Lee; Tae-Ro Lee; Seo-Joon Lee; Jin-Soo Jang; Eung Ju Kim (2023). Data_Sheet_1_Machine Learning-Based Data Mining Method for Sentiment Analysis of the Sewol Ferry Disaster's Effect on Social Stress.ZIP [Dataset]. http://doi.org/10.3389/fpsyt.2020.505673.s001
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 5, 2023
    Dataset provided by
    Frontiers
    Authors
    Min-Joon Lee; Tae-Ro Lee; Seo-Joon Lee; Jin-Soo Jang; Eung Ju Kim
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Sewol Ferry Disaster which took place in 16th of April, 2014, was a national level disaster in South Korea that caused severe social distress nation-wide. No research at the domestic level thus far has examined the influence of the disaster on social stress through a sentiment analysis of social media data. Data extracted from YouTube, Twitter, and Facebook were used in this study. The population was users who were randomly selected from the aforementioned social media platforms who had posted texts related to the disaster from April 2014 to March 2015. ANOVA was used for statistical comparison between negative, neutral, and positive sentiments under a 95% confidence level. For NLP-based data mining results, bar graph and word cloud analysis as well as analyses of phrases, entities, and queries were implemented. Research results showed a significantly negative sentiment on all social media platforms. This was mainly related to fundamental agents such as ex-president Park and her related political parties and politicians. YouTube, Twitter, and Facebook results showed negative sentiment in phrases (63.5, 69.4, and 58.9%, respectively), entity (81.1, 69.9, and 76.0%, respectively), and query topic (75.0, 85.4, and 75.0%, respectively). All results were statistically significant (p < 0.001). This research provides scientific evidence of the negative psychological impact of the disaster on the Korean population. This study is significant because it is the first research to conduct sentiment analysis of data extracted from the three largest existing social media platforms regarding the issue of the disaster.

  7. Z

    Data Analysis for the Systematic Literature Review of DL4SE

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    Updated Jul 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cody Watson; Nathan Cooper; David Nader; Kevin Moran; Denys Poshyvanyk (2024). Data Analysis for the Systematic Literature Review of DL4SE [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4768586
    Explore at:
    Dataset updated
    Jul 19, 2024
    Dataset provided by
    College of William and Mary
    Washington and Lee University
    Authors
    Cody Watson; Nathan Cooper; David Nader; Kevin Moran; Denys Poshyvanyk
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data Analysis is the process that supports decision-making and informs arguments in empirical studies. Descriptive statistics, Exploratory Data Analysis (EDA), and Confirmatory Data Analysis (CDA) are the approaches that compose Data Analysis (Xia & Gong; 2014). An Exploratory Data Analysis (EDA) comprises a set of statistical and data mining procedures to describe data. We ran EDA to provide statistical facts and inform conclusions. The mined facts allow attaining arguments that would influence the Systematic Literature Review of DL4SE.

    The Systematic Literature Review of DL4SE requires formal statistical modeling to refine the answers for the proposed research questions and formulate new hypotheses to be addressed in the future. Hence, we introduce DL4SE-DA, a set of statistical processes and data mining pipelines that uncover hidden relationships among Deep Learning reported literature in Software Engineering. Such hidden relationships are collected and analyzed to illustrate the state-of-the-art of DL techniques employed in the software engineering context.

    Our DL4SE-DA is a simplified version of the classical Knowledge Discovery in Databases, or KDD (Fayyad, et al; 1996). The KDD process extracts knowledge from a DL4SE structured database. This structured database was the product of multiple iterations of data gathering and collection from the inspected literature. The KDD involves five stages:

    Selection. This stage was led by the taxonomy process explained in section xx of the paper. After collecting all the papers and creating the taxonomies, we organize the data into 35 features or attributes that you find in the repository. In fact, we manually engineered features from the DL4SE papers. Some of the features are venue, year published, type of paper, metrics, data-scale, type of tuning, learning algorithm, SE data, and so on.

    Preprocessing. The preprocessing applied was transforming the features into the correct type (nominal), removing outliers (papers that do not belong to the DL4SE), and re-inspecting the papers to extract missing information produced by the normalization process. For instance, we normalize the feature “metrics” into “MRR”, “ROC or AUC”, “BLEU Score”, “Accuracy”, “Precision”, “Recall”, “F1 Measure”, and “Other Metrics”. “Other Metrics” refers to unconventional metrics found during the extraction. Similarly, the same normalization was applied to other features like “SE Data” and “Reproducibility Types”. This separation into more detailed classes contributes to a better understanding and classification of the paper by the data mining tasks or methods.

    Transformation. In this stage, we omitted to use any data transformation method except for the clustering analysis. We performed a Principal Component Analysis to reduce 35 features into 2 components for visualization purposes. Furthermore, PCA also allowed us to identify the number of clusters that exhibit the maximum reduction in variance. In other words, it helped us to identify the number of clusters to be used when tuning the explainable models.

    Data Mining. In this stage, we used three distinct data mining tasks: Correlation Analysis, Association Rule Learning, and Clustering. We decided that the goal of the KDD process should be oriented to uncover hidden relationships on the extracted features (Correlations and Association Rules) and to categorize the DL4SE papers for a better segmentation of the state-of-the-art (Clustering). A clear explanation is provided in the subsection “Data Mining Tasks for the SLR od DL4SE”. 5.Interpretation/Evaluation. We used the Knowledge Discover to automatically find patterns in our papers that resemble “actionable knowledge”. This actionable knowledge was generated by conducting a reasoning process on the data mining outcomes. This reasoning process produces an argument support analysis (see this link).

    We used RapidMiner as our software tool to conduct the data analysis. The procedures and pipelines were published in our repository.

    Overview of the most meaningful Association Rules. Rectangles are both Premises and Conclusions. An arrow connecting a Premise with a Conclusion implies that given some premise, the conclusion is associated. E.g., Given that an author used Supervised Learning, we can conclude that their approach is irreproducible with a certain Support and Confidence.

    Support = Number of occurrences this statement is true divided by the amount of statements Confidence = The support of the statement divided by the number of occurrences of the premise

  8. f

    Data Sheet 1_The effect of acupuncture-related therapies in animal model of...

    • figshare.com
    docx
    Updated Oct 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Guangbin Yu; YingYing Gao; Hongyuan Song; Guizhen Chen; Yunxiang Xu (2025). Data Sheet 1_The effect of acupuncture-related therapies in animal model of postmenopausal osteoporosis: a meta-analysis and data mining approach.docx [Dataset]. http://doi.org/10.3389/fendo.2025.1617154.s001
    Explore at:
    docxAvailable download formats
    Dataset updated
    Oct 15, 2025
    Dataset provided by
    Frontiers
    Authors
    Guangbin Yu; YingYing Gao; Hongyuan Song; Guizhen Chen; Yunxiang Xu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    PurposeThis meta-analysis and data mining aimed to investigate the effectiveness of Acupuncture-Related Therapies for animal models with Postmenopausal Osteoporosis (POMP) and to summarize the acupoints involved.MethodsThis systematic review was conducted a comprehensive search of animal experiments using acupuncture-related therapies for the treatment of PMOP up to April 1, 2025. The primary outcome was bone mineral density (BMD). The secondary outcome indicators were estradiol(E2), blood calcium, osteocalcin(OC) and alkaline phosphatas(ALP).Meta-analysis was used to evaluate its efficacy, and data mining was used to explore the protocol for acupoint selection.Results27 Animal Experiments encompassing 548 animals with PMOP were analyzed. Meta-analysis displayed that compared with the conventional drug group, the acupuncture-related therapy group significantly increased the Estradiol (mean difference [MD] 2.37,95% confidence interval [95%CI] 1.15 to 3.58), Femoral BMD (mean difference [MD] 1.25,95% confidence interval [95%CI] 0.65 to 1.87), lumbar BMD (mean difference [MD] 1.88,95% confidence interval [95%CI] 1.27 to 2.49), Tibia BMD (mean difference [MD] 1.63,95% confidence interval [95%CI] 0.56 to 2.69). Data mining revealed that Zusanli (ST36), Shenshu (BL23), Guanyuan (CV4), and Sanyinjiao (SP6) were the core acupoints for PMOP treated by Acupuncture-Related Therapies.ConclusionAcupuncture improved BMD and estrogen levels in animal models of PMOP. ST36, BL23, CV4, and SP6 are the core acupoints for acupuncture in PMOP, and this program is expected to become a supplementary treatment for PMOP.

  9. R

    Russia Entrepreneur Confidence Indicator: OKVED2: Mining & Quarrying: Metal...

    • ceicdata.com
    Updated Feb 3, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CEICdata.com (2018). Russia Entrepreneur Confidence Indicator: OKVED2: Mining & Quarrying: Metal Ores [Dataset]. https://www.ceicdata.com/en/russia/entrepreneur-confidence-indicator/entrepreneur-confidence-indicator-okved2-mining--quarrying-metal-ores
    Explore at:
    Dataset updated
    Feb 3, 2018
    Dataset provided by
    CEICdata.com
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Feb 1, 2018 - Jan 1, 2019
    Area covered
    Russia
    Variables measured
    Business Confidence Survey
    Description

    Russia Entrepreneur Confidence Indicator: OKVED2: Mining & Quarrying: Metal Ores data was reported at 6.000 % Point in Jan 2019. This stayed constant from the previous number of 6.000 % Point for Dec 2018. Russia Entrepreneur Confidence Indicator: OKVED2: Mining & Quarrying: Metal Ores data is updated monthly, averaging 10.000 % Point from Jan 2017 (Median) to Jan 2019, with 25 observations. The data reached an all-time high of 20.000 % Point in Aug 2017 and a record low of 3.000 % Point in Jan 2017. Russia Entrepreneur Confidence Indicator: OKVED2: Mining & Quarrying: Metal Ores data remains active status in CEIC and is reported by Federal State Statistics Service. The data is categorized under Global Database’s Russian Federation – Table RU.SA001: Entrepreneur Confidence Indicator.

  10. m

    Transactional BMS-WebView-2

    • data.mendeley.com
    Updated Feb 9, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Uday kiran RAGE (2023). Transactional BMS-WebView-2 [Dataset]. http://doi.org/10.17632/yrjwczzdg7.1
    Explore at:
    Dataset updated
    Feb 9, 2023
    Authors
    Uday kiran RAGE
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The statistical details of this dataset were provided below:

    Database size (total no of transactions) : 77411 Number of items : 3340 Minimum Transaction Size : 1 Average Transaction Size : 4.623451447468706 Maximum Transaction Size : 161 Standard Deviation Transaction Size : 6.073661955533351 Variance in Transaction Sizes : 36.88984609536578 Sparsity : 0.9986157330995603

    Please cite the following reference:

    Rage Uday Kiran, Masaru Kitsuregawa: Efficient discovery of correlated patterns using multiple minimum all-confidence thresholds. J. Intell. Inf. Syst. 45(3): 357-377 (2015)

    URL: https://link.springer.com/article/10.1007/s10844-014-0314-7

    Disclaimer: This dataset was downloaded from fimi repository - http://fimi.uantwerpen.be/data/

  11. s

    Data from: Comprehensive Evaluation of Association Measures for Fault...

    • researchdata.smu.edu.sg
    rar
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    LUCIA Lucia; David LO; Lingxiao JIANG; Aditya Budi (2023). Data from: Comprehensive Evaluation of Association Measures for Fault Localization [Dataset]. http://doi.org/10.25440/smu.12062796.v1
    Explore at:
    rarAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    SMU Research Data Repository (RDR)
    Authors
    LUCIA Lucia; David LO; Lingxiao JIANG; Aditya Budi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This record contains the underlying research data for the publication "Comprehensive Evaluation of Association Measures for Fault Localization" and the full-text is available from: https://ink.library.smu.edu.sg/sis_research/1330In statistics and data mining communities, there have been many measures proposed to gauge the strength of association between two variables of interest, such as odds ratio, confidence, Yule-Y, Yule-Q, Kappa, and gini index. These association measures have been used in various domains, for example, to evaluate whether a particular medical practice is associated positively to a cure of a disease or whether a particular marketing strategy is associated positively to an increase in revenue, etc. This paper models the problem of locating faults as association between the execution or non-execution of particular program elements with failures. There have been special measures, termed as suspiciousness measures, proposed for the task. Two state-of-the-art measures are Tarantula and Ochiai, which are different from many other statistical measures. To the best of our knowledge, there is no study that comprehensively investigates the effectiveness of various association measures in localizing faults. This paper fills in the gap by evaluating 20 wellknown association measures and compares their effectiveness in fault localization tasks with Tarantula and Ochiai. Evaluation on the Siemens programs show that a number of association measures perform statistically comparable as Tarantula and Ochiai.

  12. R

    Russia Entrepreneur Confidence Indicator: OKVED2: Mining & Quarrying

    • ceicdata.com
    Updated Mar 15, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CEICdata.com (2025). Russia Entrepreneur Confidence Indicator: OKVED2: Mining & Quarrying [Dataset]. https://www.ceicdata.com/en/russia/entrepreneur-confidence-indicator/entrepreneur-confidence-indicator-okved2-mining--quarrying
    Explore at:
    Dataset updated
    Mar 15, 2025
    Dataset provided by
    CEICdata.com
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Mar 1, 2024 - Feb 1, 2025
    Area covered
    Russia
    Variables measured
    Business Confidence Survey
    Description

    Russia Entrepreneur Confidence Indicator: OKVED2: Mining & Quarrying data was reported at 1.000 % Point in Mar 2025. This stayed constant from the previous number of 1.000 % Point for Feb 2025. Russia Entrepreneur Confidence Indicator: OKVED2: Mining & Quarrying data is updated monthly, averaging 0.000 % Point from Jan 2017 (Median) to Mar 2025, with 99 observations. The data reached an all-time high of 3.000 % Point in Jul 2024 and a record low of -6.000 % Point in May 2020. Russia Entrepreneur Confidence Indicator: OKVED2: Mining & Quarrying data remains active status in CEIC and is reported by Federal State Statistics Service. The data is categorized under Global Database’s Russian Federation – Table RU.SC001: Entrepreneur Confidence Indicator. [COVID-19-IMPACT]

  13. 2018 Database of Effect Sizes (ES)

    • figshare.com
    txt
    Updated Sep 15, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Paul Monsarrat; Jean-Noel Vergnes (2018). 2018 Database of Effect Sizes (ES) [Dataset]. http://doi.org/10.6084/m9.figshare.7066397.v3
    Explore at:
    txtAvailable download formats
    Dataset updated
    Sep 15, 2018
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Paul Monsarrat; Jean-Noel Vergnes
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The database contains 18 fields:- xmlfile: identification of the xmlfile from which the ES was extracted.- id: a unique identifier of extracted ES within a given xmlfile.- pmid: the PMID identifier of the abstract from which ES were extracted. - year: year of publication of the abstract.- month: month of publication of the abstract.- or: value of the ES (T#0).- lci: value of the lower part of the confidence interval.- hci: value of the higher part of the confidence interval.- orhrrr: classification of the type of ES (OR, HR, RR and PR). PR was considered as OR at the step of the statistical script.- ci: type of confidence interval (90%, 95% or 99%). If no CI was provided, 95% was assumed at the step of the statistical script.- counteies: continents extracted from author’s affiliations, using MapAfill and text mining (pipe ‘|’ delimited)- adjusted: 1 if an adjusted ES was detected (e.g. adjusted OR, aOR), 0 otherwise.- multivariate: 1 if a multivariate analysis was found within abstract, 0 otherwise.- sr: 1 if the abstract was a “Review” (review, systematic review or meta-analysis), 0 otherwise.- ccj: 1 if the journal was found within the Core Clinical Journals list, 0 otherwise.- doaj: 1 if the journal was found within the Directory of Open Access Journals, 0 otherwise. - pmc: PubMed Central identifier, if exists.- nlm: unique identifier of the journal abstract.

  14. f

    Table1_A real-world disproportionality analysis of Everolimus: data mining...

    • datasetcatalog.nlm.nih.gov
    • frontiersin.figshare.com
    Updated Mar 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Luo, Lan; Chen, Xiangning; Fu, Yumei; Zhao, Bin; Liu, Shu; Cui, Shichao (2024). Table1_A real-world disproportionality analysis of Everolimus: data mining of the public version of FDA adverse event reporting system.DOCX [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001334331
    Explore at:
    Dataset updated
    Mar 12, 2024
    Authors
    Luo, Lan; Chen, Xiangning; Fu, Yumei; Zhao, Bin; Liu, Shu; Cui, Shichao
    Description

    Background: Everolimus is an inhibitor of the mammalian target of rapamycin and is used to treat various tumors. The presented study aimed to evaluate the Everolimus-associated adverse events (AEs) through data mining of the US Food and Drug Administration Adverse Event Reporting System (FAERS).Methods: The AE records were selected by searching the FDA Adverse Event Reporting System database from the first quarter of 2009 to the first quarter of 2022. Potential adverse event signals were mined using the disproportionality analysis, including reporting odds ratio the proportional reporting ratio the Bayesian confidence propagation neural network and the empirical Bayes geometric mean and MedDRA was used to systematically classify the results.Results: A total of 24,575 AE reports of Everolimus were obtained using data from the FAERS database, and Everolimus-induced AEs occurrence targeted 24 system organ classes after conforming to the four algorithms simultaneously. The common significant SOCs were identified, included benign, malignant and unspecified neoplasms, reproductive system and breast disorders, etc. The significant AEs were then mapped to preferred terms such as stomatitis, pneumonitis and impaired insulin secretion, which have emerged in the study usually reported in patients with Everolimus. Of note, unexpected significant AEs, including biliary ischaemia, angiofibroma, and tuberous sclerosis complex were uncovered in the label.Conclusion: This study provided novel insights into the monitoring, surveillance, and management of adverse drug reaction associated with Everolimus. The outcome of serious adverse events and the corresponding detection signals, as well as the unexpected significant adverse events signals are worthy of attention in order to improving clinical medication safety during treatment of Everolimus.

  15. f

    Tourism research from its inception to present day: Subject area, geography,...

    • plos.figshare.com
    • datasetcatalog.nlm.nih.gov
    xlsx
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrei P. Kirilenko; Svetlana Stepchenkova (2023). Tourism research from its inception to present day: Subject area, geography, and gender distributions [Dataset]. http://doi.org/10.1371/journal.pone.0206820
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Andrei P. Kirilenko; Svetlana Stepchenkova
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This paper uses text data mining to identify long-term developments in tourism academic research from the perspectives of thematic focus, geography, and gender of tourism authorship. Abstracts of papers published in the period of 1970–2017 in high-ranking tourist journals were extracted from the Scopus database and served as data source for the analysis. Fourteen subject areas were identified using the Latent Dirichlet Allocation (LDA) text mining approach. LDA integrated with GIS information allowed to obtain geography distribution and trends of scholarly output, while probabilistic methods of gender identification based on social network data mining were used to track gender dynamics with sufficient confidence. The findings indicate that, while all 14 topics have been prominent from the inception of tourism studies to the present day, the geography of scholarship has notably expanded and the share of female authorship has increased through time and currently almost equals that of male authorship.

  16. f

    Table1_A real-world disproportionality analysis of Tivozanib data mining of...

    • datasetcatalog.nlm.nih.gov
    • figshare.com
    Updated Jun 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wang, Mengmeng; Wang, Kaixuan; Wang, Xiaohui; Li, Wensheng (2024). Table1_A real-world disproportionality analysis of Tivozanib data mining of the public version of FDA adverse event reporting system.xlsx [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001408505
    Explore at:
    Dataset updated
    Jun 13, 2024
    Authors
    Wang, Mengmeng; Wang, Kaixuan; Wang, Xiaohui; Li, Wensheng
    Description

    BackgroundTivozanib, a vascular endothelial growth factor tyrosine kinase inhibitor, has demonstrated efficacy in a phase III clinical trials for the treatment of renal cell carcinoma. However, comprehensive evaluation of its long-term safety profile in a large sample population remains elusive. The current study assessed Tivozanib-related adverse events of real-world through data mining of the US Food and Drug Administration Adverse Event Reporting System FDA Adverse Event Reporting System.MethodsDisproportionality analyses, utilizing reporting odds ratio proportional reporting ratio Bayesian confidence propagation neural network and multi-item gamma Poisson shrinker (MGPS) algorithms, were conducted to quantify signals of Tivozanib-related AEs. Weibull distribution was used to predict the varying risk incidence of AEs over time.ResultsOut of 5,361,420 reports collected from the FAERS database, 1,366 reports of Tivozanib-associated AEs were identified. A total of 94 significant disproportionality preferred terms (PTs) conforming to the four algorithms simultaneously were retained. The most common AEs included fatigue, diarrhea, nausea, blood pressure increased, decreased appetite, and dysphonia, consistent with prior specifications and clinical trials. Unexpected significant AEs such as dyspnea, constipation, pain in extremity, stomatitis, and palmar-plantar erythrodysaesthesia syndrome was observed. The median onset time of Tivozanib-related AEs was 37 days (interquartile range [IQR] 11.75–91 days), with a majority (n = 127, 46.35%) occurring within the initial month following Tivozanib initiation.ConclusionOur observations align with clinical assertions regarding Tivozanib’s safety profile. Additionally, we unveil potential novel and unexpected AE signatures associated with Tivozanib administration, highlighting the imperative for prospective clinical studies to validate these findings and elucidate their causal relationships. These results furnish valuable evidence to steer future clinical inquiries aimed at elucidating the safety profile of Tivozanib.

  17. C

    Chile IMCE: Mining

    • ceicdata.com
    Updated Jan 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CEICdata.com (2025). Chile IMCE: Mining [Dataset]. https://www.ceicdata.com/en/chile/business-confidence-index/imce-mining
    Explore at:
    Dataset updated
    Jan 15, 2025
    Dataset provided by
    CEICdata.com
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Feb 1, 2018 - Jan 1, 2019
    Area covered
    Chile
    Description

    Chile IMCE: Mining data was reported at 50.333 Index in Jan 2019. This records a decrease from the previous number of 53.758 Index for Dec 2018. Chile IMCE: Mining data is updated monthly, averaging 63.366 Index from Nov 2003 (Median) to Jan 2019, with 183 observations. The data reached an all-time high of 82.879 Index in Jul 2005 and a record low of 35.335 Index in Feb 2015. Chile IMCE: Mining data remains active status in CEIC and is reported by Central Bank of Chile. The data is categorized under Global Database’s Chile – Table CL.S001: Business Confidence Index.

  18. C

    Chile IMCE: excl Mining

    • ceicdata.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CEICdata.com, Chile IMCE: excl Mining [Dataset]. https://www.ceicdata.com/en/chile/business-confidence-index/imce-excl-mining
    Explore at:
    Dataset provided by
    CEICdata.com
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Feb 1, 2018 - Jan 1, 2019
    Area covered
    Chile
    Description

    Chile IMCE: excl Mining data was reported at 50.754 Index in Jan 2019. This records an increase from the previous number of 47.663 Index for Dec 2018. Chile IMCE: excl Mining data is updated monthly, averaging 53.290 Index from Nov 2003 (Median) to Jan 2019, with 183 observations. The data reached an all-time high of 62.878 Index in Feb 2011 and a record low of 31.533 Index in Feb 2009. Chile IMCE: excl Mining data remains active status in CEIC and is reported by Central Bank of Chile. The data is categorized under Global Database’s Chile – Table CL.S001: Business Confidence Index.

  19. f

    Data from: Improving Sensitivity in Shotgun Proteomics Using a...

    • figshare.com
    xls
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chia-Yu Yen; Steve Russell; Alex M. Mendoza; Karen Meyer-Arendt; Shaojun Sun; Krzysztof J. Cios; Natalie G. Ahn; Katheryn A. Resing (2023). Improving Sensitivity in Shotgun Proteomics Using a Peptide-Centric Database with Reduced Complexity:  Protease Cleavage and SCX Elution Rules from Data Mining of MS/MS Spectra [Dataset]. http://doi.org/10.1021/ac051127f.s003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    ACS Publications
    Authors
    Chia-Yu Yen; Steve Russell; Alex M. Mendoza; Karen Meyer-Arendt; Shaojun Sun; Krzysztof J. Cios; Natalie G. Ahn; Katheryn A. Resing
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Correct identification of a peptide sequence from MS/MS data is still a challenging research problem, particularly in proteomic analyses of higher eukaryotes where protein databases are large. The scoring methods of search programs often generate cases where incorrect peptide sequences score higher than correct peptide sequences (referred to as distraction). Because smaller databases yield less distraction and better discrimination between correct and incorrect assignments, we developed a method for editing a peptide-centric database (PC-DB) to remove unlikely sequences and strategies for enabling search programs to utilize this peptide database. Rules for unlikely missed cleavage and nontryptic proteolysis products were identified by data mining 11 849 high-confidence peptide assignments. We also evaluated ion exchange chromatographic behavior as an editing criterion to generate subset databases. When used to search a well-annotated test data set of MS/MS spectra, we found no loss of critical information using PC-DBs, validating the methods for generating and searching against the databases. On the other hand, improved confidence in peptide assignments was achieved for tryptic peptides, measured by changes in ΔCN and RSP. Decreased distraction was also achieved, consistent with the 3−9-fold decrease in database size. Data mining identified a major class of common nonspecific proteolytic products corresponding to leucine aminopeptidase (LAP) cleavages. Large improvements in identifying LAP products were achieved using the PC-DB approach when compared with conventional searches against protein databases. These results demonstrate that peptide properties can be used to reduce database size, yielding improved accuracy and information capture due to reduced distraction, but with little loss of information compared to conventional protein database searches.

  20. Quran Text and Data Mining

    • kaggle.com
    zip
    Updated Sep 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sultan Almujaiwel (2024). Quran Text and Data Mining [Dataset]. https://www.kaggle.com/datasets/sultanalmujaiwel/quran-mining/code
    Explore at:
    zip(603024 bytes)Available download formats
    Dataset updated
    Sep 14, 2024
    Authors
    Sultan Almujaiwel
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    بيانات الباب الثاني من كتاب: الإحصاء اللغوي وتنقيب النصوص والبيانات العربية

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Aslan Ahmedov (2021). Market Basket Analysis [Dataset]. https://www.kaggle.com/datasets/aslanahmedov/market-basket-analysis
Organization logo

Market Basket Analysis

Analyzing Consumer Behaviour Using MBA Association Rule Mining

Explore at:
2 scholarly articles cite this dataset (View in Google Scholar)
zip(23875170 bytes)Available download formats
Dataset updated
Dec 9, 2021
Authors
Aslan Ahmedov
Description

Market Basket Analysis

Market basket analysis with Apriori algorithm

The retailer wants to target customers with suggestions on itemset that a customer is most likely to purchase .I was given dataset contains data of a retailer; the transaction data provides data around all the transactions that have happened over a period of time. Retailer will use result to grove in his industry and provide for customer suggestions on itemset, we be able increase customer engagement and improve customer experience and identify customer behavior. I will solve this problem with use Association Rules type of unsupervised learning technique that checks for the dependency of one data item on another data item.

Introduction

Association Rule is most used when you are planning to build association in different objects in a set. It works when you are planning to find frequent patterns in a transaction database. It can tell you what items do customers frequently buy together and it allows retailer to identify relationships between the items.

An Example of Association Rules

Assume there are 100 customers, 10 of them bought Computer Mouth, 9 bought Mat for Mouse and 8 bought both of them. - bought Computer Mouth => bought Mat for Mouse - support = P(Mouth & Mat) = 8/100 = 0.08 - confidence = support/P(Mat for Mouse) = 0.08/0.09 = 0.89 - lift = confidence/P(Computer Mouth) = 0.89/0.10 = 8.9 This just simple example. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.

Strategy

  • Data Import
  • Data Understanding and Exploration
  • Transformation of the data – so that is ready to be consumed by the association rules algorithm
  • Running association rules
  • Exploring the rules generated
  • Filtering the generated rules
  • Visualization of Rule

Dataset Description

  • File name: Assignment-1_Data
  • List name: retaildata
  • File format: . xlsx
  • Number of Row: 522065
  • Number of Attributes: 7

    • BillNo: 6-digit number assigned to each transaction. Nominal.
    • Itemname: Product name. Nominal.
    • Quantity: The quantities of each product per transaction. Numeric.
    • Date: The day and time when each transaction was generated. Numeric.
    • Price: Product price. Numeric.
    • CustomerID: 5-digit number assigned to each customer. Nominal.
    • Country: Name of the country where each customer resides. Nominal.

imagehttps://user-images.githubusercontent.com/91852182/145270162-fc53e5a3-4ad1-4d06-b0e0-228aabcf6b70.png">

Libraries in R

First, we need to load required libraries. Shortly I describe all libraries.

  • arules - Provides the infrastructure for representing, manipulating and analyzing transaction data and patterns (frequent itemsets and association rules).
  • arulesViz - Extends package 'arules' with various visualization. techniques for association rules and item-sets. The package also includes several interactive visualizations for rule exploration.
  • tidyverse - The tidyverse is an opinionated collection of R packages designed for data science.
  • readxl - Read Excel Files in R.
  • plyr - Tools for Splitting, Applying and Combining Data.
  • ggplot2 - A system for 'declaratively' creating graphics, based on "The Grammar of Graphics". You provide the data, tell 'ggplot2' how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.
  • knitr - Dynamic Report generation in R.
  • magrittr- Provides a mechanism for chaining commands with a new forward-pipe operator, %>%. This operator will forward a value, or the result of an expression, into the next function call/expression. There is flexible support for the type of right-hand side expressions.
  • dplyr - A fast, consistent tool for working with data frame like objects, both in memory and out of memory.
  • tidyverse - This package is designed to make it easy to install and load multiple 'tidyverse' packages in a single step.

imagehttps://user-images.githubusercontent.com/91852182/145270210-49c8e1aa-9753-431b-a8d5-99601bc76cb5.png">

Data Pre-processing

Next, we need to upload Assignment-1_Data. xlsx to R to read the dataset.Now we can see our data in R.

imagehttps://user-images.githubusercontent.com/91852182/145270229-514f0983-3bbb-4cd3-be64-980e92656a02.png"> imagehttps://user-images.githubusercontent.com/91852182/145270251-6f6f6472-8817-435c-a995-9bc4bfef10d1.png">

After we will clear our data frame, will remove missing values.

imagehttps://user-images.githubusercontent.com/91852182/145270286-05854e1a-2b6c-490e-ab30-9e99e731eacb.png">

To apply Association Rule mining, we need to convert dataframe into transaction data to make all items that are bought together in one invoice will be in ...

Search
Clear search
Close search
Google apps
Main menu