Facebook
TwitterWithin the confines of this document, we embark on a comprehensive journey delving into the intricacies of a dataset meticulously curated for the purpose of association rules mining. This sophisticated data mining technique is a linchpin in the realms of market basket analysis. The dataset in question boasts an array of items commonly found in retail transactions, each meticulously encoded as a binary variable, with "1" denoting presence and "0" indicating absence in individual transactions.
Our dataset unfolds as an opulent tapestry of distinct columns, each dedicated to the representation of a specific item:
The raison d'être of this dataset is to serve as a catalyst for the discovery of intricate associations and patterns concealed within the labyrinthine network of customer transactions. Each row in this dataset mirrors a solitary transaction, while the values within each column serve as sentinels, indicating whether a particular item was welcomed into a transaction's embrace or relegated to the periphery.
The data within this repository is rendered in a binary symphony, where the enigmatic "1" enunciates the acquisition of an item, and the stoic "0" signifies its conspicuous absence. This binary manifestation serves to distill the essence of the dataset, centering the focus on item presence, rather than the quantum thereof.
This dataset unfurls its wings to encompass an assortment of prospective applications, including but not limited to:
The treasure trove of this dataset beckons the deployment of quintessential techniques, among them the venerable Apriori and FP-Growth algorithms. These stalwart algorithms are proficient at ferreting out the elusive frequent itemsets and invaluable association rules, shedding light on the arcane symphony of customer behavior and item co-occurrence patterns.
In closing, the association rules dataset unfurled before you offers an alluring odyssey, replete with the promise of discovering priceless patterns and affiliations concealed within the tapestry of transactional data. Through the artistry of data mining algorithms, businesses and analysts stand poised to unearth hitherto latent insights capable of steering the helm of strategic decisions, elevating the pantheon of customer experiences, and orchestrating the symphony of operational optimization.
Facebook
TwitterThese are artificially made beginner data mining datasets for learning purposes.
Case study:
The aim of FeelsLikeHome_Campaign dataset is to create project is in which you build a predictive model (using a sample of 2500 clients’ data) forecasting the highest profit from the next marketing campaign, which will indicate the customers who will be the most likely to accept the offer.
The aim of FeelsLikeHome_Cluster dataset is to create project in which you split company’s customer base on homogenous clusters (using 5000 clients’ data) and propose draft marketing strategies for these groups based on customer behavior and information about their profile.
FeelsLikeHome_Score dataset can be used to calculate total profit from marketing campaign and for producing a list of sorted customers by the probability of the dependent variable in predictive model problem.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
IntroductionHospitals have seen a rise in Medical Emergency Team (MET) reviews. We hypothesised that the commonest MET calls result in similar treatments. Our aim was to design a pre-emptive management algorithm that allowed direct institution of treatment to patients without having to wait for attendance of the MET team and to model its potential impact on MET call incidence and patient outcomes.MethodsData was extracted for all MET calls from the hospital database. Association rule data mining techniques were used to identify the most common combinations of MET call causes, outcomes and therapies.ResultsThere were 13,656 MET calls during the 34-month study period in 7936 patients. The most common MET call was for hypotension [31%, (2459/7936)]. These MET calls were strongly associated with the immediate administration of intra-venous fluid (70% [1714/2459] v 13% [739/5477] p
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The constantly increasing volume and complexity of available biological data requires new methods for their management and analysis. An important challenge is the integration of information from different sources in order to discover possible hidden relations between already known data. In this paper we introduce a data mining approach which relates biological ontologies by mining cross and intra-ontology pairwise generalized association rules. Its advantage is sensitivity to rare associations, for these are important for biologists. We propose a new class of interestingness measures designed for hierarchically organized rules. These measures allow one to select the most important rules and to take into account rare cases. They favor rules with an actual interestingness value that exceeds the expected value. The latter is calculated taking into account the parent rule. We demonstrate this approach by applying it to the analysis of data from Gene Ontology and GPCR databases. Our objective is to discover interesting relations between two different ontologies or parts of a single ontology. The association rules that are thus discovered can provide the user with new knowledge about underlying biological processes or help improve annotation consistency. The obtained results show that produced rules represent meaningful and quite reliable associations.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data Analysis is the process that supports decision-making and informs arguments in empirical studies. Descriptive statistics, Exploratory Data Analysis (EDA), and Confirmatory Data Analysis (CDA) are the approaches that compose Data Analysis (Xia & Gong; 2014). An Exploratory Data Analysis (EDA) comprises a set of statistical and data mining procedures to describe data. We ran EDA to provide statistical facts and inform conclusions. The mined facts allow attaining arguments that would influence the Systematic Literature Review of DL4SE.
The Systematic Literature Review of DL4SE requires formal statistical modeling to refine the answers for the proposed research questions and formulate new hypotheses to be addressed in the future. Hence, we introduce DL4SE-DA, a set of statistical processes and data mining pipelines that uncover hidden relationships among Deep Learning reported literature in Software Engineering. Such hidden relationships are collected and analyzed to illustrate the state-of-the-art of DL techniques employed in the software engineering context.
Our DL4SE-DA is a simplified version of the classical Knowledge Discovery in Databases, or KDD (Fayyad, et al; 1996). The KDD process extracts knowledge from a DL4SE structured database. This structured database was the product of multiple iterations of data gathering and collection from the inspected literature. The KDD involves five stages:
Selection. This stage was led by the taxonomy process explained in section xx of the paper. After collecting all the papers and creating the taxonomies, we organize the data into 35 features or attributes that you find in the repository. In fact, we manually engineered features from the DL4SE papers. Some of the features are venue, year published, type of paper, metrics, data-scale, type of tuning, learning algorithm, SE data, and so on.
Preprocessing. The preprocessing applied was transforming the features into the correct type (nominal), removing outliers (papers that do not belong to the DL4SE), and re-inspecting the papers to extract missing information produced by the normalization process. For instance, we normalize the feature “metrics” into “MRR”, “ROC or AUC”, “BLEU Score”, “Accuracy”, “Precision”, “Recall”, “F1 Measure”, and “Other Metrics”. “Other Metrics” refers to unconventional metrics found during the extraction. Similarly, the same normalization was applied to other features like “SE Data” and “Reproducibility Types”. This separation into more detailed classes contributes to a better understanding and classification of the paper by the data mining tasks or methods.
Transformation. In this stage, we omitted to use any data transformation method except for the clustering analysis. We performed a Principal Component Analysis to reduce 35 features into 2 components for visualization purposes. Furthermore, PCA also allowed us to identify the number of clusters that exhibit the maximum reduction in variance. In other words, it helped us to identify the number of clusters to be used when tuning the explainable models.
Data Mining. In this stage, we used three distinct data mining tasks: Correlation Analysis, Association Rule Learning, and Clustering. We decided that the goal of the KDD process should be oriented to uncover hidden relationships on the extracted features (Correlations and Association Rules) and to categorize the DL4SE papers for a better segmentation of the state-of-the-art (Clustering). A clear explanation is provided in the subsection “Data Mining Tasks for the SLR od DL4SE”. 5.Interpretation/Evaluation. We used the Knowledge Discover to automatically find patterns in our papers that resemble “actionable knowledge”. This actionable knowledge was generated by conducting a reasoning process on the data mining outcomes. This reasoning process produces an argument support analysis (see this link).
We used RapidMiner as our software tool to conduct the data analysis. The procedures and pipelines were published in our repository.
Overview of the most meaningful Association Rules. Rectangles are both Premises and Conclusions. An arrow connecting a Premise with a Conclusion implies that given some premise, the conclusion is associated. E.g., Given that an author used Supervised Learning, we can conclude that their approach is irreproducible with a certain Support and Confidence.
Support = Number of occurrences this statement is true divided by the amount of statements Confidence = The support of the statement divided by the number of occurrences of the premise
Facebook
TwitterTitle: Identifying Factors that Affect Entrepreneurs’ Use of Data Mining for Analytics Authors: Edward Matthew Dominica, Feylin Wijaya, Andrew Giovanni Winoto, Christian Conference: The 4th International Conference on Electrical, Computer, Communications, and Mechatronics Engineering https://www.iceccme.com/home
This dataset was created to support research focused on understanding the factors influencing entrepreneurs’ adoption of data mining techniques for business analytics. The dataset contains carefully curated data points that reflect entrepreneurial behaviors, decision-making criteria, and the role of data mining in enhancing business insights.
Researchers and practitioners can leverage this dataset to explore patterns, conduct statistical analyses, and build predictive models to gain a deeper understanding of entrepreneurial adoption of data mining.
Intended Use: This dataset is designed for research and academic purposes, especially in the fields of business analytics, entrepreneurship, and data mining. It is suitable for conducting exploratory data analysis, hypothesis testing, and model development.
Citation: If you use this dataset in your research or publication, please cite the paper presented at the ICECCME 2024 conference using the following format: Edward Matthew Dominica, Feylin Wijaya, Andrew Giovanni Winoto, Christian. Identifying Factors that Affect Entrepreneurs’ Use of Data Mining for Analytics. The 4th International Conference on Electrical, Computer, Communications, and Mechatronics Engineering (2024).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The LSC (Leicester Scientific Corpus)August 2019 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk) Supervised by Prof Alexander Gorban and Dr Evgeny MirkesThe data is extracted from the Web of Science® [1] You may not copy or distribute this data in whole or in part without the written consent of Clarivate Analytics.Getting StartedThis text provides background information on the LSC (Leicester Scientific Corpus) and pre-processing steps on abstracts, and describes the structure of files to organise the corpus. This corpus is created to be used in future work on the quantification of the sense of research texts. One of the goal of publishing the data is to make it available for further analysis and use in Natural Language Processing projects.LSC is a collection of abstracts of articles and proceeding papers published in 2014, and indexed by the Web of Science (WoS) database [1]. Each document contains title, list of authors, list of categories, list of research areas, and times cited. The corpus contains only documents in English.The corpus was collected in July 2018 online and contains the number of citations from publication date to July 2018.Each document in the corpus contains the following parts:1. Authors: The list of authors of the paper2. Title: The title of the paper3. Abstract: The abstract of the paper4. Categories: One or more category from the list of categories [2]. Full list of categories is presented in file ‘List_of _Categories.txt’.5. Research Areas: One or more research area from the list of research areas [3]. Full list of research areas is presented in file ‘List_of_Research_Areas.txt’.6. Total Times cited: The number of times the paper was cited by other items from all databases within Web of Science platform [4]7. Times cited in Core Collection: The total number of times the paper was cited by other papers within the WoS Core Collection [4]We describe a document as the collection of information (about a paper) listed above. The total number of documents in LSC is 1,673,824.All documents in LSC have nonempty abstract, title, categories, research areas and times cited in WoS databases. There are 119 documents with empty authors list, we did not exclude these documents.Data ProcessingThis section describes all steps in order for the LSC to be collected, clean and available to researchers. Processing the data consists of six main steps:Step 1: Downloading of the Data OnlineThis is the step of collecting the dataset online. This is done manually by exporting documents as Tab-delimitated files. All downloaded documents are available online.Step 2: Importing the Dataset to RThis is the process of converting the collection to RData format for processing the data. The LSC was collected as TXT files. All documents are extracted to R.Step 3: Cleaning the Data from Documents with Empty Abstract or without CategoryNot all papers have abstract and categories in the collection. As our research is based on the analysis of abstracts and categories, preliminary detecting and removing inaccurate documents were performed. All documents with empty abstracts and documents without categories are removed.Step 4: Identification and Correction of Concatenate Words in AbstractsTraditionally, abstracts are written in a format of executive summary with one paragraph of continuous writing, which is known as ‘unstructured abstract’. However, especially medicine-related publications use ‘structured abstracts’. Such type of abstracts are divided into sections with distinct headings such as introduction, aim, objective, method, result, conclusion etc.Used tool for extracting abstracts leads concatenate words of section headings with the first word of the section. As a result, some of structured abstracts in the LSC require additional process of correction to split such concatenate words. For instance, we observe words such as ConclusionHigher and ConclusionsRT etc. in the corpus. The detection and identification of concatenate words cannot be totally automated. Human intervention is needed in the identification of possible headings of sections. We note that we only consider concatenate words in headings of sections as it is not possible to detect all concatenate words without deep knowledge of research areas. Identification of such words is done by sampling of medicine-related publications. The section headings in such abstracts are listed in the List 1.List 1 Headings of sections identified in structured abstractsBackground Method(s) DesignTheoretical Measurement(s) LocationAim(s) Methodology ProcessAbstract Population ApproachObjective(s) Purpose(s) Subject(s)Introduction Implication(s) Patient(s)Procedure(s) Hypothesis Measure(s)Setting(s) Limitation(s) DiscussionConclusion(s) Result(s) Finding(s)Material (s) Rationale(s)Implications for health and nursing policyAll words including headings in the List 1 are detected in entire corpus, and then words are split into two words. For instance, the word ‘ConclusionHigher’ is split into ‘Conclusion’ and ‘Higher’.Step 5: Extracting (Sub-setting) the Data Based on Lengths of AbstractsAfter correction of concatenate words is completed, the lengths of abstracts are calculated. ‘Length’ indicates the totalnumber of words in the text, calculated by the same rule as for Microsoft Word ‘word count’ [5].According to APA style manual [6], an abstract should contain between 150 to 250 words. However, word limits vary from journal to journal. For instance, Journal of Vascular Surgery recommends that ‘Clinical and basic research studies must include a structured abstract of 400 words or less’[7].In LSC, the length of abstracts varies from 1 to 3805. We decided to limit length of abstracts from 30 to 500 words in order to study documents with abstracts of typical length ranges and to avoid the effect of the length to the analysis. Documents containing less than 30 and more than 500 words in abstracts are removed.Step 6: Saving the Dataset into CSV FormatCorrected and extracted documents are saved into 36 CSV files. The structure of files are described in the following section.The Structure of Fields in CSV FilesIn CSV files, the information is organised with one record on each line and parts of abstract, title, list of authors, list of categories, list of research areas, and times cited is recorded in separated fields.To access the LSC for research purposes, please email to ns433@le.ac.uk.References[1]Web of Science. (15 July). Available: https://apps.webofknowledge.com/[2]WoS Subject Categories. Available: https://images.webofknowledge.com/WOKRS56B5/help/WOS/hp_subject_category_terms_tasca.html[3]Research Areas in WoS. Available: https://images.webofknowledge.com/images/help/WOS/hp_research_areas_easca.html[4]Times Cited in WoS Core Collection. (15 July). Available: https://support.clarivate.com/ScientificandAcademicResearch/s/article/Web-of-Science-Times-Cited-accessibility-and-variation?language=en_US[5]Word Count. Available: https://support.office.com/en-us/article/show-word-count-3c9e6a11-a04d-43b4-977c-563a0e0d5da3[6]A. P. Association, Publication manual. American Psychological Association Washington, DC, 1983.[7]P. Gloviczki and P. F. Lawrence, "Information for authors," Journal of Vascular Surgery, vol. 65, no. 1, pp. A16-A22, 2017.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ObjectiveGlucolipotoxicity is a major pathophysiological mechanism in the development of insulin resistance and type 2 diabetes mellitus (T2D). We aimed to detect subtle changes in the circulating lipid profile by shotgun lipidomics analyses and to associate them with four different insulin sensitivity indices.MethodsThe cross-sectional study comprised 90 men with a broad range of insulin sensitivity including normal glucose tolerance (NGT, n = 33), impaired glucose tolerance (IGT, n = 32) and newly detected T2D (n = 25). Prior to oral glucose challenge plasma was obtained and quantitatively analyzed for 198 lipid molecular species from 13 different lipid classes including triacylglycerls (TAGs), phosphatidylcholine plasmalogen/ether (PC O-s), sphingomyelins (SMs), and lysophosphatidylcholines (LPCs). To identify a lipidomic signature of individual insulin sensitivity we applied three data mining approaches, namely least absolute shrinkage and selection operator (LASSO), Support Vector Regression (SVR) and Random Forests (RF) for the following insulin sensitivity indices: homeostasis model of insulin resistance (HOMA-IR), glucose insulin sensitivity index (GSI), insulin sensitivity index (ISI), and disposition index (DI). The LASSO procedure offers a high prediction accuracy and and an easier interpretability than SVR and RF.ResultsAfter LASSO selection, the plasma lipidome explained 3% (DI) to maximal 53% (HOMA-IR) variability of the sensitivity indexes. Among the lipid species with the highest positive LASSO regression coefficient were TAG 54:2 (HOMA-IR), PC O- 32:0 (GSI), and SM 40:3:1 (ISI). The highest negative regression coefficient was obtained for LPC 22:5 (HOMA-IR), TAG 51:1 (GSI), and TAG 58:6 (ISI).ConclusionAlthough a substantial part of lipid molecular species showed a significant correlation with insulin sensitivity indices we were able to identify a limited number of lipid metabolites of particular importance based on the LASSO approach. These few selected lipids with the closest connection to sensitivity indices may help to further improve disease risk prediction and disease and therapy monitoring.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Sales forecasting uses historical sales figures, in association with products characteristics and peculiarities, to predict short-term or long-term future performance in a business, and it can be used to derive sound financial and business plans. By using publicly available data, we build an accurate regression model for online sales forecasting obtained via a novel feature selection methodology composed by the application of the multi-objective evolutionary algorithm ENORA (Evolutionary NOn-dominated Radial slots based Algorithm) as search strategy in a wrapper method driven by the well-known regression model learner Random Forest. Our proposal integrates feature selection for regression, model evaluation, and decision making, in order to choose the most satisfactory model according to an a posteriori process in a multi-objective context. We test and compare the performances of ENORA as multi-objective evolutionary search strategy against a standard multi-objective evolutionary search strategy such as NSGA-II (Non-dominated Sorted Genetic Algorithm), against a classical backward search strategy such as RFE (Recursive Feature Elimination), and against the original data set.
Facebook
TwitterMarket basket analysis with Apriori algorithm
The retailer wants to target customers with suggestions on itemset that a customer is most likely to purchase .I was given dataset contains data of a retailer; the transaction data provides data around all the transactions that have happened over a period of time. Retailer will use result to grove in his industry and provide for customer suggestions on itemset, we be able increase customer engagement and improve customer experience and identify customer behavior. I will solve this problem with use Association Rules type of unsupervised learning technique that checks for the dependency of one data item on another data item.
Association Rule is most used when you are planning to build association in different objects in a set. It works when you are planning to find frequent patterns in a transaction database. It can tell you what items do customers frequently buy together and it allows retailer to identify relationships between the items.
Assume there are 100 customers, 10 of them bought Computer Mouth, 9 bought Mat for Mouse and 8 bought both of them. - bought Computer Mouth => bought Mat for Mouse - support = P(Mouth & Mat) = 8/100 = 0.08 - confidence = support/P(Mat for Mouse) = 0.08/0.09 = 0.89 - lift = confidence/P(Computer Mouth) = 0.89/0.10 = 8.9 This just simple example. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.
Number of Attributes: 7
https://user-images.githubusercontent.com/91852182/145270162-fc53e5a3-4ad1-4d06-b0e0-228aabcf6b70.png">
First, we need to load required libraries. Shortly I describe all libraries.
https://user-images.githubusercontent.com/91852182/145270210-49c8e1aa-9753-431b-a8d5-99601bc76cb5.png">
Next, we need to upload Assignment-1_Data. xlsx to R to read the dataset.Now we can see our data in R.
https://user-images.githubusercontent.com/91852182/145270229-514f0983-3bbb-4cd3-be64-980e92656a02.png">
https://user-images.githubusercontent.com/91852182/145270251-6f6f6472-8817-435c-a995-9bc4bfef10d1.png">
After we will clear our data frame, will remove missing values.
https://user-images.githubusercontent.com/91852182/145270286-05854e1a-2b6c-490e-ab30-9e99e731eacb.png">
To apply Association Rule mining, we need to convert dataframe into transaction data to make all items that are bought together in one invoice will be in ...
Facebook
TwitterContext: Predicting which acromegaly patients could benefit from somatostatin receptor ligand (SRL) is crucial to avoid months of ineffective treatment for non-responding cases. Although many biomarkers linked to SRL response have been identified, there is no consensus criterion on how to assign pharmacologic treatment according to biomarker levels. Objective: Our aim is to provide better predictive tools for a more accurate acromegaly patient stratification regarding the ability to respond to SRL. Design and patients: Retrospective multicenter study of 71 acromegaly patients. Methods: We used advanced mathematical modelling and artificial intelligence to predict SRL response combining molecular and clinical information. Results: Different models of patient stratification were obtained regarding SRL response, with a much higher accuracy when the studied cohort is fragmented according to relevant clinical characteristics. Considering all the models, a patient stratification based on the extrasellar growth of the tumor, sex, age and the expression of E-cadherin, GHRL, IN1-GHRL, DRD2, SSTR5 and PEBP1 is proposed, with accuracies that stand between 71 to 95%. Furthermore, we show an association between extrasellar growth and high BMI for SRL non-responding patients. Conclusion. The use of data mining is necessary for implementation of personalized medicine in acromegaly and requires an interdisciplinary effort between computer science, mathematics, biology and medicine. This new methodology opens a door to more precise personalized medicine for acromegaly patients.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract Background Regarding to oral health, little has been advanced on how to improve quality within dental care. Objective The aim of this study was to identify the demographic factors affecting the satisfaction of users of the dental public service having the value of a strategic and high consistency methodology. Method The Data Mining was used to a secondary database, contemplating 91 features, segmental in 9 demographic factors, 17 facets, and 5 dominions. Descriptive statistics were extracted to a demographic data and the satisfaction of the users by facets and dominions, being discovered as from Decision Trees and Association Rules. Results the analysis of the results showed the relation between the demographic factor 'professional occupation' and satisfaction, in all of the dominions. The occupations of general assistant and home assistant with daily wage stood out in Association Rules to represent the lower level of satisfaction compared to the facets that were worse evaluated. Also, the factor ‘health unit's name’ showed relation with most of the investigated dominions. The difference between health units was even more evident through the Association Rule. Conclusion The Data Mining allowed to identify complementary relations to the user's perception about the public oral health services quality, constituting a safe tool to support the management of Brazilian public health and the basis of future plans.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset contains transactional data collected for market basket analysis. Each row represents a single transaction with items purchased together. It is ideal for implementing association rule mining techniques such as Apriori, FP-Growth, and other machine learning algorithms.
Facebook
TwitterObjective: Using data mining, the present study aimed to discover the most effective acupoints and combinations in the acupuncture treatment of asthma. Methods: The main acupoints prescribed in these clinical trials was collected and quantified. A network analysis was performed to uncover the interconnections. Additionally, hierarchical clustering analysis and association rule mining were conducted to discover the potential acupoint combinations. Results: Feishu (BL13), Dingchuan (EX-B1), Dazhui (GV14), Shengshu (BL23), Pishu (BL20), and Fengmen (BL12) appeared to be the most frequently used acupoints for asthma. While the Bladder Meridian of Foot Taiyang, the Governor Vessel, and the Conception Vessel, compared to other meridians, were found to be the more commonly selected meridians. In the acupoint interconnection network, Feishu (BL13), Fengmen (BL12), Dingchuan (EX-B1), and Dazhui (GV14) were defined as key node acupoints. Association rule mining analysis demonstrated that the combination of Pishu, Shenshu, Feishu, and Dingchuan, as well as that of Feishu, Dazhui, and Fengmen were potential acupoint combinations that should be selected with priority in asthma treatment. Conclusion: This study provides valuable information regarding the selection of the most effective acupoints and combinations for clinical acupuncture practice and experimental study aimed at the prevention and treatment of asthma.
Facebook
TwitterAir pollution presents a significant environmental risk, impacting human health, accelerating climate change, and disrupting ecosystems. The main aim of air pollution research is to pinpoint the most harmful pollutants identified in previous studies and to map regions exposed to high pollution levels. This study introduces a large-scale, high-quality dataset to advance the analysis of PM2.5 pollution and reveal hidden patterns through pattern mining techniques. The dataset covers five years of hourly PM2.5 measurements collected from approximately 1,900 sensors across Japan, sourced from the Ministry of the Environment's Soramame platform. This platform offers hourly pollutant records, downloadable as monthly raw data files. The unorganised raw data files are systematically organised and stored in database tables using an Entity-Relationship (ER) schema. The primary objective of this dataset is to aid in developing and validating pattern mining models, enabling the accurate detection of..., The air pollution data was collected from Japan’s Soramame platform, which provides hourly updates on pollutant levels nationwide. The data files were collected from January 1, 2018, 01:00:00, to April 25, 2023, 22:00:00, covering records from approximately 1,900 sensors stationed in various locations across Japan. These files are initially unorganised in CSV format and require systematic organisation by year, month, time, sensor, and pollutant type. To maintain data integrity, we structured the dataset using an Entity-Relationship (ER) schema within a PostgreSQL database, comprising two main tables: the Sensor table (storing sensor name, ID, address, and location) and the Observations table (recording pollutant types and their values). A detailed step-by-step process is provided in the README, and this organization created a consolidated CSV file containing PM2.5 levels, timestamps, and sensor details., , # AEROS PM2.5 Dataset
The AEROS PM2.5 Dataset provides a comprehensive collection of hourly PM2.5 measurements recorded over a period of five years from sensors located across Japan. This dataset is a valuable resource for studying air quality trends, pollution patterns, and environmental health impacts.
FINAL_DATASET.csvThe dataset includes the following columns:
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset was derived by the Bioregional Assessment Programme from multiple source datasets. The source datasets are identified in the Lineage field in this metadata statement. The processes undertaken to produce this derived dataset are described in the History field in this metadata statement.
This data set provides mine footprint areas for seven proposed coal mines in the Galilee subregion that were included in surface water modelling.
This data set was used to identify surface water node catchments that were partly or fully occupied by a proposed coal mine developments in the Galilee subregion. This information is essential to estimate the changes in surface water flow due to mining operation.
Footprint maps of seven proposed coal mines in the Galilee subregion were obtained from the mine EIS and SEIS reports (see lineage) to produce general outlines for the surface water model.
Bioregional Assessment Programme (2015) Galilee mines footprints. Bioregional Assessment Derived Dataset. Viewed 12 December 2018, http://data.bioregionalassessments.gov.au/dataset/de808d14-b62b-47dd-b12c-4370b6c23b8e.
Derived From Onsite and offsite mine infrastructure for the Carmichael Coal Mine and Rail Project, Adani Mining Pty Ltd 2012
Derived From China Stone Coal Project initial advice statement
Derived From Alpha Coal Project Environmental Impact Statement
Derived From China First Galilee Coal Project Environmental Impact Assessment
Derived From Kevin's Corner Project Environmental Impact Statement
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This paper explores several data mining and time series analysis methods for predicting the magnitude of the largest seismic event in the next year based on the previously recorded seismic events in the same region. The methods are evaluated on a catalog of 9,042 earthquake events, which took place between 01/01/1983 and 31/12/2010 in the area of Israel and its neighboring countries. The data was obtained from the Geophysical Institute of Israel. Each earthquake record in the catalog is associated with one of 33 seismic regions. The data was cleaned by removing foreshocks and aftershocks. In our study, we have focused on ten most active regions, which account for more than 80% of the total number of earthquakes in the area. The goal is to predict whether the maximum earthquake magnitude in the following year will exceed the median of maximum yearly magnitudes in the same region. Since the analyzed catalog includes only 28 years of complete data, the last five annual records of each region (referring to the years 2006–2010) are kept for testing while using the previous annual records for training. The predictive features are based on the Gutenberg-Richter Ratio as well as on some new seismic indicators based on the moving averages of the number of earthquakes in each area. The new predictive features prove to be much more useful than the indicators traditionally used in the earthquake prediction literature. The most accurate result (AUC = 0.698) is reached by the Multi-Objective Info-Fuzzy Network (M-IFN) algorithm, which takes into account the association between two target variables: the number of earthquakes and the maximum earthquake magnitude during the same year.
Facebook
TwitterDespite the structure and objectivity provided by the Gene Ontology (GO), the annotation of proteins is a complex task that is subject to errors and inconsistencies. Electronically inferred annotations in particular are widely considered unreliable. However, given that manual curation of all GO annotations is unfeasible, it is imperative to improve the quality of electronically inferred annotations. In this work, we analyze the full GO molecular function annotation of UniProtKB proteins, and discuss some of the issues that affect their quality, focusing particularly on the lack of annotation consistency. Based on our analysis, we estimate that 64% of the UniProtKB proteins are incompletely annotated, and that inconsistent annotations affect 83% of the protein functions and at least 23% of the proteins. Additionally, we present and evaluate a data mining algorithm, based on the association rule learning methodology, for identifying implicit relationships between molecular function terms. The goal of this algorithm is to assist GO curators in updating GO and correcting and preventing inconsistent annotations. Our algorithm predicted 501 relationships with an estimated precision of 94%, whereas the basic association rule learning methodology predicted 12,352 relationships with a precision below 9%.
Facebook
TwitterCoal seam gas (CSG) is an unconventional natural gas (UNG) that is extracted from wells via coal seams, and reserves are found in Australia, the USA and the UK. Other UNG include shale and tight gas, which are sourced from different geological formations and utilise similar processes to CSG mining, and are extracted in Canada, Europe, Asia, the Middle East and Australia. In recent decades, UNG extraction has grown exponentially, with hydraulic fracturing or ‘fracking’ occurring across regional and rural landscapes and in close proximity to communities. Whilst major development projects can facilitate employment and other opportunities in surrounding communities through population growth and increased demand for services, there is evidence that negative impacts on health and wellbeing can outweigh any benefits.
Commonly referred to as the ‘resource curse’, when the costs of extraction and exporting natural resources outweigh the economic benefits, the expansion of CSG activity was often met with trepidation from local communities and the broader public. There was uncertainty around the impacts and consequences of rapid development, particularly in the USA and Australia, stemming from a lack of prior experience, mixed messages in the media, perceived lack of governmental support, and little empirical evidence. Presented with the opportunity to address the gap in the literature, this research explores the broader implications of mining activity on surrounding communities, with a focus on CSG and the social determinants of health and wellbeing.
The level of community interaction throughout a project lifecycle is greater in CSG mine settings compared to traditional mining methods (like coal, for example) because of their proximity to communities, and so there is a greater expectation of the mining company to monitor and mitigate impacts on the communities in which they operate. There is emerging evidence that the extractives industry may play a more diverse role in regional communities than previously expected, but the pathways in which they do this in the health sector are not clear. Integral to the provision of health services in regional areas is the integration of services and partnerships – it is common for stakeholders external to the health sector, like transport, police or environmental departments to be involved in the planning and availability of health services. There is a dearth of scientific evidence of the ways in which the extractives industry interacts with the health system in the communities in which they operate; what the costs and benefits of this interaction might be and how the relationship might be optimized to enable long-lasting health improvements.
This is particularly important in mining communities, where health outcomes could fluctuate with the various stages of mining activity, and more so in communities where mining activity is soon to cease, leading to uncertainty and economic downturn.
Objectives
This research was conducted in order to inform the regional and rural health sector, extractives industry, and communities who are undergoing a period of uncertainty with little peer reviewed evidence to provide objective direction. The research aims to: respond to the demand in understanding broader public health and wellbeing outcomes of mining beyond direct, physical and biological outcomes; contribute to the growing evidence base around CSG development and potential community-level impacts; and to comment on the interaction between stakeholders in the health system and the extractives industry at a local level.
Methods
This thesis has been organised in to three parts to meet the stated objectives:
1. Two systematic reviews to synthesise the evidence for broader, indirect health and wellbeing implications at community level associated with mining activity in low, middle and high income countries in order to provide a comprehensive account of how communities may be affected by mining;
2. Synthesis of qualitative data collected via a Health Needs Assessment (HNA) in Queensland, Australia to explore the determinants of health and wellbeing in communities living in proximity to CSG developments in order to strengthen understanding of how community and health services can prepare for fluctuations that might come with a mining boom or bust; and
3. Critically review regional health systems and the interaction between the extractives industry and key stakeholders at a local level in order to compile a set of recommendations that optimise health outcomes for local communities.
Results
Sixteen publications were included in the systematic review of high-income countries, and included studies that took place in the USA, Australia and Canada. Products mined included coal and mountain-top mining. There was evidence that mining activity can affect the social, physical and economic environment in which communities live, and these factors can in turn have adverse effects on health and wellbeing if not adequately measured and mitigated. Specific examples of self-reported health implications included increased risk of chronic disease and poor overall health, relationship breakdown, lack of social connectedness, and decreased access to health services.
Twelve publications were included in the systematic review of low and middle-income countries, and included studies that took place in Ghana, Namibia, South Africa Tanzania, India, Brazil, Guatemala and French Guiana. Products mined included gold and silver, iron ore and platinum. Mining was perceived to influence health behaviours, employment conditions, livelihoods and socio-political factors, which were linked to poorer health outcomes. Family relationships, mental health and community cohesion were negatively associated with mining activity. High-risk health behaviours, population growth and changes in vector ecology from environmental modification was associated with increased infectious disease prevalence.
The HNA was implemented in four towns in regional Queensland situated in proximity to CSG development. Eleven focus group discussions, nine in-depth interviews, and forty-five key informant interviews (KIIs) with health and community service providers and community members were conducted. Framework analysis was conducted following a recurrent theme that emerged from the qualitative data around health and wellbeing implications of the CSG industry. CSG mining was deemed a rapid development in the otherwise predominantly agricultural, rural communities. With this rapid development came fluctuations in the local economy, population, social structure and environmental conditions. There were perceived direct and indirect effects of CSG activity at an individual and community level, including impacts on alcohol and drug use; family relationships; social capital and mental health; and social connectedness, civic engagement and trust.
Before examining the interaction between the health system and mining sector, it was important to describe the rural health system and its complementary parts. Systems theory underpinned analysis of qualitative data from KIIs to assist in describing the characteristics of the health system and unique influences on its functionality. Results showed that communities are closely interconnected with the health system, and that the rural health systems in the case study were defined by geography, climate and economic fluctuations. Understanding unique system pressures is important for recognising the impact that policy decisions may have on rural health. Decentralisation of decision making, greater flexibility and predictability of programs will assist in health system strengthening in rural areas.
Another key theme emerged from the HNA: the mining sector played a diverse role in health and community service planning and delivery. Key informant transcripts were analysed again using phenomenology theory. Of these, 23 mentioned the presence of CSG or mining activity at least once during the interview without any specific reference to the extractives industry. Mining activity was perceived to influence the ability of service providers to meet demand, recruit and retain staff, and effectively plan and maintain programs. The level of interaction between mining companies with service providers and regulatory bodies varied and was commented on extensively. Several key informants identified pathways for the mining sector to engage with services more effectively, which included strengthening multi-sectoral engagement and enabling transparent, public consultation and evidence-based funding initiatives.
Conclusion
Unconventional natural gas extraction and the implications of mining activity on nearby communities is a subject of major concern internationally. Through the application of core public health theories and methodologies, including the Social Determinants of Health model, complex adaptive systems theory and health needs assessments; this thesis has significantly contributed to the discourse and demonstrated a significant association between mining activity and health.
This thesis sought to strengthen the evidence base of the association between the extractives industry and the social determinants of health of surrounding communities, with a focus on the potential impacts of CSG developments. The hypothesis that there may be broader, direct and indirect impacts on health and wellbeing at an individual or community-level was tested and proven. The secondary aim was to examine the relationship of stakeholders in the local health system with the mining sector, with the intention to develop recommendations that improve measurement, monitoring and response to potential impacts of mining in surrounding communities.
This research established that there are both common and unique health and wellbeing issues experienced by communities living in proximity to mining internationally. Our understanding
Facebook
TwitterThis data set used in the CoIL 2000 Challenge contains information on customers of an insurance company. The data consists of 86 variables and includes product usage data and socio-demographic data
DETAILED DATA DESCRIPTION
THE INSURANCE COMPANY (TIC) 2000
(c) Sentient Machine Research 2000
DISCLAIMER
This dataset is owned and supplied by the Dutch data mining company Sentient Machine Research, and is based on real-world business data. You are allowed to use this dataset and accompanying information for non-commercial research and education purposes only. It is explicitly not allowed to use this dataset for commercial education or demonstration purposes. For any other use, please contact Peter van der Putten, info@smr.nl.
This dataset has been used in the CoIL Challenge 2000 data mining competition. For papers describing results on this dataset, see the TIC 2000 homepage: http://www.wi.leidenuniv.nl/~putten/library/cc2000/
REFERENCE P. van der Putten and M. van Someren (eds). CoIL Challenge 2000: The Insurance Company Case. Published by Sentient Machine Research, Amsterdam. Also a Leiden Institute of Advanced Computer Science Technical Report 2000-09. June 22, 2000. See http://www.liacs.nl/~putten/library/cc2000/
RELEVANT FILES
tic_2000_train_data.csv: Dataset to train and validate prediction models and build a description (5822 customer records). Each record consists of 86 attributes, containing sociodemographic data (attribute 1-43) and product ownership (attributes 44-86). The sociodemographic data is derived from zip codes. All customers living in areas with the same zip code have the same sociodemographic attributes. Attribute 86, "CARAVAN: Number of mobile home policies", is the target variable.
tic_2000_eval_data.csv: Dataset for predictions (4000 customer records). It has the same format as TICDATA2000.txt, only the target is missing. Participants are supposed to return the list of predicted targets only. All datasets are in CSV format. The meaning of the attributes and attribute values is given dictionary.csv
tic_2000_target_data.csv Targets for the evaluation set.
dictionary.txt: Data description with numerical labeled categories descriptions. It has columnar description data and the labels of the dummy/Labeled encoding.
Original Task description Link: http://liacs.leidenuniv.nl/~puttenpwhvander/library/cc2000/problem.html UCI Machine Learning Repository: http://archive.ics.uci.edu/ml/datasets/Insurance+Company+Benchmark+%28COIL+2000%29
Facebook
TwitterWithin the confines of this document, we embark on a comprehensive journey delving into the intricacies of a dataset meticulously curated for the purpose of association rules mining. This sophisticated data mining technique is a linchpin in the realms of market basket analysis. The dataset in question boasts an array of items commonly found in retail transactions, each meticulously encoded as a binary variable, with "1" denoting presence and "0" indicating absence in individual transactions.
Our dataset unfolds as an opulent tapestry of distinct columns, each dedicated to the representation of a specific item:
The raison d'être of this dataset is to serve as a catalyst for the discovery of intricate associations and patterns concealed within the labyrinthine network of customer transactions. Each row in this dataset mirrors a solitary transaction, while the values within each column serve as sentinels, indicating whether a particular item was welcomed into a transaction's embrace or relegated to the periphery.
The data within this repository is rendered in a binary symphony, where the enigmatic "1" enunciates the acquisition of an item, and the stoic "0" signifies its conspicuous absence. This binary manifestation serves to distill the essence of the dataset, centering the focus on item presence, rather than the quantum thereof.
This dataset unfurls its wings to encompass an assortment of prospective applications, including but not limited to:
The treasure trove of this dataset beckons the deployment of quintessential techniques, among them the venerable Apriori and FP-Growth algorithms. These stalwart algorithms are proficient at ferreting out the elusive frequent itemsets and invaluable association rules, shedding light on the arcane symphony of customer behavior and item co-occurrence patterns.
In closing, the association rules dataset unfurled before you offers an alluring odyssey, replete with the promise of discovering priceless patterns and affiliations concealed within the tapestry of transactional data. Through the artistry of data mining algorithms, businesses and analysts stand poised to unearth hitherto latent insights capable of steering the helm of strategic decisions, elevating the pantheon of customer experiences, and orchestrating the symphony of operational optimization.