8 datasets found

f
Data from: Mining significant crisp-fuzzy spatial association rules
tandf.figshare.com
pdf
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wenzhong Shi; Anshu Zhang; Geoffrey I. Webb (2023). Mining significant crisp-fuzzy spatial association rules [Dataset]. http://doi.org/10.6084/m9.figshare.5873139.v1
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.5873139.v1
Dataset updated
May 30, 2023
Dataset provided by
Taylor & Francis
Authors
Wenzhong Shi; Anshu Zhang; Geoffrey I. Webb
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Spatial association rule mining (SARM) is an important data mining task for understanding implicit and sophisticated interactions in spatial data. The usefulness of SARM results, represented as sets of rules, depends on their reliability: the abundance of rules, control over the risk of spurious rules, and accuracy of rule interestingness measure (RIM) values. This study presents crisp-fuzzy SARM, a novel SARM method that can enhance the reliability of resultant rules. The method firstly prunes dubious rules using statistically sound tests and crisp supports for the patterns involved, and then evaluates RIMs of accepted rules using fuzzy supports. For the RIM evaluation stage, the study also proposes a Gaussian-curve-based fuzzy data discretization model for SARM with improved design for spatial semantics. The proposed techniques were evaluated by both synthetic and real-world data. The synthetic data was generated with predesigned rules and RIM values, thus the reliability of SARM results could be confidently and quantitatively evaluated. The proposed techniques showed high efficacy in enhancing the reliability of SARM results in all three aspects. The abundance of resultant rules was improved by 50% or more compared with using conventional fuzzy SARM. Minimal risk of spurious rules was guaranteed by statistically sound tests. The probability that the entire result contained any spurious rules was below 1%. The RIM values also avoided large positive errors committed by crisp SARM, which typically exceeded 50% for representative RIMs. The real-world case study on New York City points of interest reconfirms the improved reliability of crisp-fuzzy SARM results, and demonstrates that such improvement is critical for practical spatial data analytics and decision support.
Data set for the paper "Predicting Relevance of Change Recommendations"
zenodo.org
zip
Updated Aug 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thomas Rolfsnes; Leon Moonen; Leon Moonen; David Binkley; David Binkley; Thomas Rolfsnes (2024). Data set for the paper "Predicting Relevance of Change Recommendations" [Dataset]. http://doi.org/10.5281/zenodo.1040118
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.1040118
Dataset updated
Aug 2, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Thomas Rolfsnes; Leon Moonen; Leon Moonen; David Binkley; David Binkley; Thomas Rolfsnes
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data set for the paper Predicting Relevance of Change Recommendations by Thomas Rolfsnes, Leon Moonen, and David Binkley, In International Conference on Automated Software Engineering (ASE), pp. 694–705. 2017, IEEE.

Please cite this work by referring to the corresponding conference publication (a preprint is included in this package).

Abstract: Software change recommendation seeks to suggest artifacts (e.g., files or methods) that are related to changes made by a developer, and thus identifies possible omissions or next steps. While one obvious challenge for recommender systems is to produce accurate recommendations, a complimentary challenge is to rank recommendations based on their relevance. In this paper, we address this challenge for recommendation systems that are based on evolutionary coupling. Such systems use targeted association-rule mining to identify relevant patterns in a software system's change history. Traditionally, this process involves ranking artifacts using interestingness measures such as confidence and support. However, these measures often fall short when used to assess recommendation relevance. We propose the use of random forest classification models to assess recommendation relevance. This approach improves on past use of various interestingness measures by learning from previous change recommendations. We empirically evaluate our approach on fourteen open source systems and two systems from our industry partners. Furthermore, we consider complimenting two mining algorithms: CO-CHANGE and TARMAQ. The results find that random forest classification significantly outperforms previous approaches, receives lower Brier scores, and has superior trade-off between precision and recall. The results are consistent across software system and mining algorithm.
d
Reference list of 265 sources used for the discovery of relationships...
search.dataone.org
doi.pangaea.de
Updated Feb 28, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bernard, Jürgen; Ruppert, Tobias; Scherer, Maximilian; Schreck, Tobias; Kohlhammer, Jörn (2018). Reference list of 265 sources used for the discovery of relationships between data clusters and metadata properties [Dataset]. http://doi.org/10.1594/PANGAEA.785666
Explore at:
Unique identifier
https://doi.org/10.1594/PANGAEA.785666
Dataset updated
Feb 28, 2018
Dataset provided by
PANGAEA Data Publisher for Earth and Environmental Science
Authors
Bernard, Jürgen; Ruppert, Tobias; Scherer, Maximilian; Schreck, Tobias; Kohlhammer, Jörn
Time period covered
Jan 1, 1992 - Jun 30, 2016
Area covered
Description
Visual cluster analysis provides valuable tools that help analysts to understand large data sets in terms of representative clusters and relationships thereof. Often, the found clusters are to be understood in context of belonging categorical, numerical or textual metadata which are given for the data elements. While often not part of the clustering process, such metadata play an important role and need to be considered during the interactive cluster exploration process. Traditionally, linked-views allow to relate (or loosely speaking: correlate) clusters with metadata or other properties of the underlying cluster data. Manually inspecting the distribution of metadata for each cluster in a linked-view approach is tedious, specially for large data sets, where a large search problem arises. Fully interactive search for potentially useful or interesting cluster to metadata relationships may constitute a cumbersome and long process. To remedy this problem, we propose a novel approach for guiding users in discovering interesting relationships between clusters and associated metadata. Its goal is to guide the analyst through the potentially huge search space. We focus in our work on metadata of categorical type, which can be summarized for a cluster in form of a histogram. We start from a given visual cluster representation, and compute certain measures of interestingness defined on the distribution of metadata categories for the clusters. These measures are used to automatically score and rank the clusters for potential interestingness regarding the distribution of categorical metadata. Identified interesting relationships are highlighted in the visual cluster representation for easy inspection by the user. We present a system implementing an encompassing, yet extensible, set of interestingness scores for categorical metadata, which can also be extended to numerical metadata. Appropriate visual representations are provided for showing the visual correlations, as well as the calculated ranking scores. Focusing on clusters of time series data, we test our approach on a large real-world data set of time-oriented scientific research data, demonstrating how specific interesting views are automatically identified, supporting the analyst discovering interesting and visually understandable relationships.
A
‘Groceries dataset ’ analyzed by Analyst-2
analyst-2.ai
Updated Aug 15, 2015
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2015). ‘Groceries dataset ’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-groceries-dataset-b6be/136ba9af/?iid=001-023&v=presentation
Explore at:
Dataset updated
Aug 15, 2015
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Groceries dataset ’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/heeraldedhia/groceries-dataset on 28 January 2022.

--- Dataset description provided by original source is as follows ---

Association Rule Mining

Market Basket Analysis is one of the key techniques used by large retailers to uncover associations between items. It works by looking for combinations of items that occur together frequently in transactions. To put it another way, it allows retailers to identify relationships between the items that people buy.

Association Rules are widely used to analyze retail basket or transaction data and are intended to identify strong rules discovered in transaction data using measures of interestingness, based on the concept of strong rules.

Details of the dataset

The dataset has 38765 rows of the purchase orders of people from the grocery stores. These orders can be analysed and association rules can be generated using Market Basket Analysis by algorithms like Apriori Algorithm.

Apriori Algorithm

Apriori is an algorithm for frequent itemset mining and association rule learning over relational databases. It proceeds by identifying the frequent individual items in the database and extending them to larger and larger item sets as long as those item sets appear sufficiently often in the database. The frequent itemsets determined by Apriori can be used to determine association rules which highlight general trends in the database: this has applications in domains such as market basket analysis.

An example of Association Rules

Assume there are 100 customers 10 of them bought milk, 8 bought butter and 6 bought both of them. bought milk => bought butter support = P(Milk & Butter) = 6/100 = 0.06 confidence = support/P(Butter) = 0.06/0.08 = 0.75 lift = confidence/P(Milk) = 0.75/0.10 = 7.5

Note: this example is extremely small. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.

Some important terms:

Support: This says how popular an itemset is, as measured by the proportion of transactions in which an itemset appears.

Confidence: This says how likely item Y is purchased when item X is purchased, expressed as {X -> Y}. This is measured by the proportion of transactions with item X, in which item Y also appears.

Lift: This says how likely item Y is purchased when item X is purchased while controlling for how popular item Y is.

--- Original source retains full ownership of the source dataset ---
d
Hotspots within a hotspot: Evolutionary measures unveil interesting...
datadryad.org
search.dataone.org
zip
Updated Jun 9, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rosa Scherson; Daniela Mardones (2022). Hotspots within a hotspot: Evolutionary measures unveil interesting biogeogeographic patterns for the conservation of the coastal forest in Chile [Dataset]. http://doi.org/10.5061/dryad.h44j0zpnw
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.h44j0zpnw
Dataset updated
Jun 9, 2022
Dataset provided by
Dryad
Authors
Rosa Scherson; Daniela Mardones
Time period covered
2022
Area covered
Chile
Description
This is a concatenated data matrix obtained from GenBank and laboratory work, that was used to perform phylogenetic analyses.
f
Equivalences between measures of information and inequality.
plos.figshare.com
xls
Updated Nov 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Equivalences between measures of information and inequality. [Dataset]. https://plos.figshare.com/articles/dataset/Equivalences_between_measures_of_information_and_inequality_/27869980
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0313281.t004
Dataset updated
Nov 20, 2024
Dataset provided by
PLOS ONE
Authors
Tobias Mages; Christian Rohner
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Equivalences between measures of information and inequality.
f
Descriptive statistics.
plos.figshare.com
xls
Updated Sep 3, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Deng Lujie; Chunhua Lin; Qiong Liao; Shuicai Qiu (2024). Descriptive statistics. [Dataset]. http://doi.org/10.1371/journal.pone.0305290.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0305290.t004
Dataset updated
Sep 3, 2024
Dataset provided by
PLOS ONE
Authors
Deng Lujie; Chunhua Lin; Qiong Liao; Shuicai Qiu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The objective of this study is to evaluate users’ perceptions and preferences on the design features of the COVID-19 prevention promotion icon from the perspective of users’ aesthetic and perceptual needs. In this study, 120 officially published icons from 24 countries and regions were collected from online platforms for ranking tests, and then the top-ranked icons were subjectively rated by the semantic differential method. By evaluating the quality of users’ perceptions of multiple semantic dimensions of icons, we extracted the perceptual semantic words that users valued as the main icon design features. Spearmen correlations were applied to derive possible correlations between user rankings and semantic scales, and a Friedman test was also conducted to determine the true differences in user perceptions and preferences for different styles of icons. Factor analysis was conducted to extract six perceptual words that influence the design features of the COVID-19 prevention promotion icon. The methodology adopted in this study facilitated the screening of design features related to icon effectiveness, and the findings show that “Interesting,” “Simple,” “Familiar, “Recognizable,” “Concrete,” and “Close(semantic distance)” are the key features that influence users’ perception and preference of COVID-19 icon design. The results of this study can be used as the basis for designing and improving publicity icons for preventive measures in COVID-19, and the methods adopted in this study can be applied to evaluate other types of icon design.
f
Rules sorted by top indicator (lift, support, and confidence).
plos.figshare.com
xls
Updated Jun 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chenwei Gu; Jinliang Xu; Chao Gao; Minghao Mu; Guangxun E; Yongji Ma (2023). Rules sorted by top indicator (lift, support, and confidence). [Dataset]. http://doi.org/10.1371/journal.pone.0276817.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0276817.t003
Dataset updated
Jun 3, 2023
Dataset provided by
PLOS ONE
Authors
Chenwei Gu; Jinliang Xu; Chao Gao; Minghao Mu; Guangxun E; Yongji Ma
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Rules sorted by top indicator (lift, support, and confidence).
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Wenzhong Shi; Anshu Zhang; Geoffrey I. Webb (2023). Mining significant crisp-fuzzy spatial association rules [Dataset]. http://doi.org/10.6084/m9.figshare.5873139.v1

Data from: Mining significant crisp-fuzzy spatial association rules

Explore at:

pdfAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.5873139.v1

Dataset updated

May 30, 2023

Dataset provided by

Taylor & Francis

Authors

Wenzhong Shi; Anshu Zhang; Geoffrey I. Webb

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Spatial association rule mining (SARM) is an important data mining task for understanding implicit and sophisticated interactions in spatial data. The usefulness of SARM results, represented as sets of rules, depends on their reliability: the abundance of rules, control over the risk of spurious rules, and accuracy of rule interestingness measure (RIM) values. This study presents crisp-fuzzy SARM, a novel SARM method that can enhance the reliability of resultant rules. The method firstly prunes dubious rules using statistically sound tests and crisp supports for the patterns involved, and then evaluates RIMs of accepted rules using fuzzy supports. For the RIM evaluation stage, the study also proposes a Gaussian-curve-based fuzzy data discretization model for SARM with improved design for spatial semantics. The proposed techniques were evaluated by both synthetic and real-world data. The synthetic data was generated with predesigned rules and RIM values, thus the reliability of SARM results could be confidently and quantitatively evaluated. The proposed techniques showed high efficacy in enhancing the reliability of SARM results in all three aspects. The abundance of resultant rules was improved by 50% or more compared with using conventional fuzzy SARM. Minimal risk of spurious rules was guaranteed by statistically sound tests. The probability that the entire result contained any spurious rules was below 1%. The RIM values also avoided large positive errors committed by crisp SARM, which typically exceeded 50% for representative RIMs. The real-world case study on New York City points of interest reconfirms the improved reliability of crisp-fuzzy SARM results, and demonstrates that such improvement is critical for practical spatial data analytics and decision support.

Clear search

Close search

Google apps

Main menu

Data from: Mining significant crisp-fuzzy spatial association rules

Data set for the paper "Predicting Relevance of Change Recommendations"

Reference list of 265 sources used for the discovery of relationships...

‘Groceries dataset ’ analyzed by Analyst-2

Association Rule Mining

Details of the dataset

Apriori Algorithm

An example of Association Rules

Some important terms:

Hotspots within a hotspot: Evolutionary measures unveil interesting...

Equivalences between measures of information and inequality.

Descriptive statistics.

Rules sorted by top indicator (lift, support, and confidence).

Data from: Mining significant crisp-fuzzy spatial association rules