100+ datasets found

d
Data Mining in Systems Health Management
catalog.data.gov
s.cnmilf.com
+1more
Updated Apr 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2025). Data Mining in Systems Health Management [Dataset]. https://catalog.data.gov/dataset/data-mining-in-systems-health-management
Explore at:
Dataset updated
Apr 10, 2025
Dataset provided by
Dashlink
Description
This chapter presents theoretical and practical aspects associated to the implementation of a combined model-based/data-driven approach for failure prognostics based on particle filtering algorithms, in which the current esti- mate of the state PDF is used to determine the operating condition of the system and predict the progression of a fault indicator, given a dynamic state model and a set of process measurements. In this approach, the task of es- timating the current value of the fault indicator, as well as other important changing parameters in the environment, involves two basic steps: the predic- tion step, based on the process model, and an update step, which incorporates the new measurement into the a priori state estimate. This framework allows to estimate of the probability of failure at future time instants (RUL PDF) in real-time, providing information about time-to- failure (TTF) expectations, statistical confidence intervals, long-term predic- tions; using for this purpose empirical knowledge about critical conditions for the system (also referred to as the hazard zones). This information is of paramount significance for the improvement of the system reliability and cost-effective operation of critical assets, as it has been shown in a case study where feedback correction strategies (based on uncertainty measures) have been implemented to lengthen the RUL of a rotorcraft transmission system with propagating fatigue cracks on a critical component. Although the feed- back loop is implemented using simple linear relationships, it is helpful to provide a quick insight into the manner that the system reacts to changes on its input signals, in terms of its predicted RUL. The method is able to manage non-Gaussian pdf’s since it includes concepts such as nonlinear state estimation and confidence intervals in its formulation. Real data from a fault seeded test showed that the proposed framework was able to anticipate modifications on the system input to lengthen its RUL. Results of this test indicate that the method was able to successfully suggest the correction that the system required. In this sense, future work will be focused on the development and testing of similar strategies using different input-output uncertainty metrics.
Ensemble Data Mining Methods - Dataset - NASA Open Data Portal
data.nasa.gov
Updated Mar 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
nasa.gov (2025). Ensemble Data Mining Methods - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/ensemble-data-mining-methods
Explore at:
Dataset updated
Mar 31, 2025
Dataset provided by
NASAhttp://nasa.gov/
Description
Ensemble Data Mining Methods, also known as Committee Methods or Model Combiners, are machine learning methods that leverage the power of multiple models to achieve better prediction accuracy than any of the individual models could on their own. The basic goal when designing an ensemble is the same as when establishing a committee of people: each member of the committee should be as competent as possible, but the members should be complementary to one another. If the members are not complementary, i.e., if they always agree, then the committee is unnecessary---any one member is sufficient. If the members are complementary, then when one or a few members make an error, the probability is high that the remaining members can correct this error. Research in ensemble methods has largely revolved around designing ensembles consisting of competent yet complementary models.
d
Ensemble Data Mining Methods
catalog.data.gov
s.cnmilf.com
+1more
Updated Apr 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2025). Ensemble Data Mining Methods [Dataset]. https://catalog.data.gov/dataset/ensemble-data-mining-methods
Explore at:
Dataset updated
Apr 11, 2025
Dataset provided by
Dashlink
Description
Ensemble Data Mining Methods, also known as Committee Methods or Model Combiners, are machine learning methods that leverage the power of multiple models to achieve better prediction accuracy than any of the individual models could on their own. The basic goal when designing an ensemble is the same as when establishing a committee of people: each member of the committee should be as competent as possible, but the members should be complementary to one another. If the members are not complementary, i.e., if they always agree, then the committee is unnecessary---any one member is sufficient. If the members are complementary, then when one or a few members make an error, the probability is high that the remaining members can correct this error. Research in ensemble methods has largely revolved around designing ensembles consisting of competent yet complementary models.
Data Mining in Systems Health Management - Dataset - NASA Open Data Portal
data.nasa.gov
Updated Mar 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
nasa.gov (2025). Data Mining in Systems Health Management - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/data-mining-in-systems-health-management
Explore at:
Dataset updated
Mar 31, 2025
Dataset provided by
NASAhttp://nasa.gov/
Description
This chapter presents theoretical and practical aspects associated to the implementation of a combined model-based/data-driven approach for failure prognostics based on particle filtering algorithms, in which the current esti- mate of the state PDF is used to determine the operating condition of the system and predict the progression of a fault indicator, given a dynamic state model and a set of process measurements. In this approach, the task of es- timating the current value of the fault indicator, as well as other important changing parameters in the environment, involves two basic steps: the predic- tion step, based on the process model, and an update step, which incorporates the new measurement into the a priori state estimate. This framework allows to estimate of the probability of failure at future time instants (RUL PDF) in real-time, providing information about time-to- failure (TTF) expectations, statistical confidence intervals, long-term predic- tions; using for this purpose empirical knowledge about critical conditions for the system (also referred to as the hazard zones). This information is of paramount significance for the improvement of the system reliability and cost-effective operation of critical assets, as it has been shown in a case study where feedback correction strategies (based on uncertainty measures) have been implemented to lengthen the RUL of a rotorcraft transmission system with propagating fatigue cracks on a critical component. Although the feed- back loop is implemented using simple linear relationships, it is helpful to provide a quick insight into the manner that the system reacts to changes on its input signals, in terms of its predicted RUL. The method is able to manage non-Gaussian pdf’s since it includes concepts such as nonlinear state estimation and confidence intervals in its formulation. Real data from a fault seeded test showed that the proposed framework was able to anticipate modifications on the system input to lengthen its RUL. Results of this test indicate that the method was able to successfully suggest the correction that the system required. In this sense, future work will be focused on the development and testing of similar strategies using different input-output uncertainty metrics.
l
LSC (Leicester Scientific Corpus)
figshare.le.ac.uk
Updated Apr 15, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neslihan Suzen (2020). LSC (Leicester Scientific Corpus) [Dataset]. http://doi.org/10.25392/leicester.data.9449639.v1
Explore at:
Unique identifier
https://doi.org/10.25392/leicester.data.9449639.v1
Dataset updated
Apr 15, 2020
Dataset provided by
University of Leicester
Authors
Neslihan Suzen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Leicester
Description
The LSC (Leicester Scientific Corpus)August 2019 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk) Supervised by Prof Alexander Gorban and Dr Evgeny MirkesThe data is extracted from the Web of Science® [1] You may not copy or distribute this data in whole or in part without the written consent of Clarivate Analytics.Getting StartedThis text provides background information on the LSC (Leicester Scientific Corpus) and pre-processing steps on abstracts, and describes the structure of files to organise the corpus. This corpus is created to be used in future work on the quantification of the sense of research texts. One of the goal of publishing the data is to make it available for further analysis and use in Natural Language Processing projects.LSC is a collection of abstracts of articles and proceeding papers published in 2014, and indexed by the Web of Science (WoS) database [1]. Each document contains title, list of authors, list of categories, list of research areas, and times cited. The corpus contains only documents in English.The corpus was collected in July 2018 online and contains the number of citations from publication date to July 2018.Each document in the corpus contains the following parts:1. Authors: The list of authors of the paper2. Title: The title of the paper3. Abstract: The abstract of the paper4. Categories: One or more category from the list of categories [2]. Full list of categories is presented in file ‘List_of _Categories.txt’.5. Research Areas: One or more research area from the list of research areas [3]. Full list of research areas is presented in file ‘List_of_Research_Areas.txt’.6. Total Times cited: The number of times the paper was cited by other items from all databases within Web of Science platform [4]7. Times cited in Core Collection: The total number of times the paper was cited by other papers within the WoS Core Collection [4]We describe a document as the collection of information (about a paper) listed above. The total number of documents in LSC is 1,673,824.All documents in LSC have nonempty abstract, title, categories, research areas and times cited in WoS databases. There are 119 documents with empty authors list, we did not exclude these documents.Data ProcessingThis section describes all steps in order for the LSC to be collected, clean and available to researchers. Processing the data consists of six main steps:Step 1: Downloading of the Data OnlineThis is the step of collecting the dataset online. This is done manually by exporting documents as Tab-delimitated files. All downloaded documents are available online.Step 2: Importing the Dataset to RThis is the process of converting the collection to RData format for processing the data. The LSC was collected as TXT files. All documents are extracted to R.Step 3: Cleaning the Data from Documents with Empty Abstract or without CategoryNot all papers have abstract and categories in the collection. As our research is based on the analysis of abstracts and categories, preliminary detecting and removing inaccurate documents were performed. All documents with empty abstracts and documents without categories are removed.Step 4: Identification and Correction of Concatenate Words in AbstractsTraditionally, abstracts are written in a format of executive summary with one paragraph of continuous writing, which is known as ‘unstructured abstract’. However, especially medicine-related publications use ‘structured abstracts’. Such type of abstracts are divided into sections with distinct headings such as introduction, aim, objective, method, result, conclusion etc.Used tool for extracting abstracts leads concatenate words of section headings with the first word of the section. As a result, some of structured abstracts in the LSC require additional process of correction to split such concatenate words. For instance, we observe words such as ConclusionHigher and ConclusionsRT etc. in the corpus. The detection and identification of concatenate words cannot be totally automated. Human intervention is needed in the identification of possible headings of sections. We note that we only consider concatenate words in headings of sections as it is not possible to detect all concatenate words without deep knowledge of research areas. Identification of such words is done by sampling of medicine-related publications. The section headings in such abstracts are listed in the List 1.List 1 Headings of sections identified in structured abstractsBackground Method(s) DesignTheoretical Measurement(s) LocationAim(s) Methodology ProcessAbstract Population ApproachObjective(s) Purpose(s) Subject(s)Introduction Implication(s) Patient(s)Procedure(s) Hypothesis Measure(s)Setting(s) Limitation(s) DiscussionConclusion(s) Result(s) Finding(s)Material (s) Rationale(s)Implications for health and nursing policyAll words including headings in the List 1 are detected in entire corpus, and then words are split into two words. For instance, the word ‘ConclusionHigher’ is split into ‘Conclusion’ and ‘Higher’.Step 5: Extracting (Sub-setting) the Data Based on Lengths of AbstractsAfter correction of concatenate words is completed, the lengths of abstracts are calculated. ‘Length’ indicates the totalnumber of words in the text, calculated by the same rule as for Microsoft Word ‘word count’ [5].According to APA style manual [6], an abstract should contain between 150 to 250 words. However, word limits vary from journal to journal. For instance, Journal of Vascular Surgery recommends that ‘Clinical and basic research studies must include a structured abstract of 400 words or less’[7].In LSC, the length of abstracts varies from 1 to 3805. We decided to limit length of abstracts from 30 to 500 words in order to study documents with abstracts of typical length ranges and to avoid the effect of the length to the analysis. Documents containing less than 30 and more than 500 words in abstracts are removed.Step 6: Saving the Dataset into CSV FormatCorrected and extracted documents are saved into 36 CSV files. The structure of files are described in the following section.The Structure of Fields in CSV FilesIn CSV files, the information is organised with one record on each line and parts of abstract, title, list of authors, list of categories, list of research areas, and times cited is recorded in separated fields.To access the LSC for research purposes, please email to ns433@le.ac.uk.References[1]Web of Science. (15 July). Available: https://apps.webofknowledge.com/[2]WoS Subject Categories. Available: https://images.webofknowledge.com/WOKRS56B5/help/WOS/hp_subject_category_terms_tasca.html[3]Research Areas in WoS. Available: https://images.webofknowledge.com/images/help/WOS/hp_research_areas_easca.html[4]Times Cited in WoS Core Collection. (15 July). Available: https://support.clarivate.com/ScientificandAcademicResearch/s/article/Web-of-Science-Times-Cited-accessibility-and-variation?language=en_US[5]Word Count. Available: https://support.office.com/en-us/article/show-word-count-3c9e6a11-a04d-43b4-977c-563a0e0d5da3[6]A. P. Association, Publication manual. American Psychological Association Washington, DC, 1983.[7]P. Gloviczki and P. F. Lawrence, "Information for authors," Journal of Vascular Surgery, vol. 65, no. 1, pp. A16-A22, 2017.
l
LScDC (Leicester Scientific Dictionary-Core)
figshare.le.ac.uk
docx
Updated Apr 15, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neslihan Suzen (2020). LScDC (Leicester Scientific Dictionary-Core) [Dataset]. http://doi.org/10.25392/leicester.data.9896579.v3
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.25392/leicester.data.9896579.v3
Dataset updated
Apr 15, 2020
Dataset provided by
University of Leicester
Authors
Neslihan Suzen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Leicester
Description
The LScDC (Leicester Scientific Dictionary-Core Dictionary)April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk/suzenneslihan@hotmail.com)Supervised by Prof Alexander Gorban and Dr Evgeny Mirkes[Version 3] The third version of LScDC (Leicester Scientific Dictionary-Core) is formed using the updated LScD (Leicester Scientific Dictionary) - Version 3*. All steps applied to build the new version of core dictionary are the same as in Version 2** and can be found in description of Version 2 below. We did not repeat the explanation. The files provided with this description are also same as described as for LScDC Version 2. The numbers of words in the 3rd versions of LScD and LScDC are summarized below. # of wordsLScD (v3) 972,060LScDC (v3) 103,998 * Suzen, Neslihan (2019): LScD (Leicester Scientific Dictionary). figshare. Dataset. https://doi.org/10.25392/leicester.data.9746900.v3 ** Suzen, Neslihan (2019): LScDC (Leicester Scientific Dictionary-Core). figshare. Dataset. https://doi.org/10.25392/leicester.data.9896579.v2[Version 2] Getting StartedThis file describes a sorted and cleaned list of words from LScD (Leicester Scientific Dictionary), explains steps for sub-setting the LScD and basic statistics of words in the LSC (Leicester Scientific Corpus), to be found in [1, 2]. The LScDC (Leicester Scientific Dictionary-Core) is a list of words ordered by the number of documents containing the words, and is available in the CSV file published. There are 104,223 unique words (lemmas) in the LScDC. This dictionary is created to be used in future work on the quantification of the sense of research texts. The objective of sub-setting the LScD is to discard words which appear too rarely in the corpus. In text mining algorithms, usage of enormous number of text data brings the challenge to the performance and the accuracy of data mining applications. The performance and the accuracy of models are heavily depend on the type of words (such as stop words and content words) and the number of words in the corpus. Rare occurrence of words in a collection is not useful in discriminating texts in large corpora as rare words are likely to be non-informative signals (or noise) and redundant in the collection of texts. The selection of relevant words also holds out the possibility of more effective and faster operation of text mining algorithms.To build the LScDC, we decided the following process on LScD: removing words that appear in no more than 10 documents (
S
Sweden Index: SSE: Basic Materials: Industrial Metals and Mining
ceicdata.com
Updated Jan 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CEICdata.com (2025). Sweden Index: SSE: Basic Materials: Industrial Metals and Mining [Dataset]. https://www.ceicdata.com/en/sweden/omx-stockholm-stock-exchange-index/index-sse-basic-materials-industrial-metals-and-mining
Explore at:
Dataset updated
Jan 15, 2025
Dataset provided by
CEICdata.com
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jul 1, 2017 - Jun 1, 2018
Area covered
Sweden
Variables measured
Securities Exchange Index
Description
Sweden Index: SSE: Basic Materials: Industrial Metals and Mining data was reported at 650.370 30Jun2011=1000 in Nov 2018. This records a decrease from the previous number of 723.550 30Jun2011=1000 for Oct 2018. Sweden Index: SSE: Basic Materials: Industrial Metals and Mining data is updated monthly, averaging 689.180 30Jun2011=1000 from Jan 2000 (Median) to Nov 2018, with 227 observations. The data reached an all-time high of 2,256.910 30Jun2011=1000 in Jun 2007 and a record low of 318.020 30Jun2011=1000 in Jan 2016. Sweden Index: SSE: Basic Materials: Industrial Metals and Mining data remains active status in CEIC and is reported by Stockholm Stock Exchange. The data is categorized under Global Database’s Sweden – Table SE.Z001: OMX Stockholm Stock Exchange: Index.
H
Python and R Basics for Environmental Data Sciences
hydroshare.org
zip
Updated Nov 1, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tao Wen (2020). Python and R Basics for Environmental Data Sciences [Dataset]. https://www.hydroshare.org/resource/114e5092ab684bd9beb9fc845a25a087
Explore at:
zip(282.7 MB)Available download formats
Dataset updated
Nov 1, 2020
Dataset provided by
HydroShare
Authors
Tao Wen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered

Description
This resource collects teaching materials that are originally created for the in-person course 'GEOSC/GEOG 497 – Data Mining in Environmental Sciences' at Penn State University (co-taught by Tao Wen, Susan Brantley, and Alan Taylor) and then refined/revised by Tao Wen to be used in the online teaching module 'Data Science in Earth and Environmental Sciences' hosted on the NSF-sponsored HydroLearn platform.

This resource includes both R Notebooks and Python Jupyter Notebooks to teach the basics of R and Python coding, data analysis and data visualization, as well as building machine learning models in both programming languages by using authentic research data and questions. All of these R/Python scripts can be executed either on the CUAHSI JupyterHub or on your local machine.

This resource is shared under the CC-BY license. Please contact the creator Tao Wen at Syracuse University (twen08@syr.edu) for any questions you have about this resource. If you identify any errors in the files, please contact the creator.
f
Data from: An Open Source Protein Gel Documentation System for Proteome...
acs.figshare.com
application/gzip
Updated Jun 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel Faller; Thomas Reinheckel; Daniel Wenzler; Sascha Hagemann; Ke Xiao; Josef Honerkamp; Christoph Peters; Thomas Dandekar; Jens Timmer (2023). An Open Source Protein Gel Documentation System for Proteome Analyses [Dataset]. http://doi.org/10.1021/ci034174m.s001
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.1021/ci034174m.s001
Dataset updated
Jun 7, 2023
Dataset provided by
ACS Publications
Authors
Daniel Faller; Thomas Reinheckel; Daniel Wenzler; Sascha Hagemann; Ke Xiao; Josef Honerkamp; Christoph Peters; Thomas Dandekar; Jens Timmer
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Data organization and data mining represents one of the main challenges for modern high throughput technologies in pharmaceutical chemistry and medical chemistry. The presented open source documentation and analysis system provides an integrated solution (tutorial, setup protocol, sources, executables) aimed at substituting the traditionally used lab-book. The data management solution provided incorporates detailed information about the processing of the gels and the experimental conditions used and includes basic data analysis facilities which can be easily extended. The sample database and User-Interface are available free of charge under the GNU license from http://webber.physik.uni-freiburg.de/∼fallerd/tutorial.htm.
f
South Africa Education Data and Visualisations
figshare.com
ufs.figshare.com
png
Updated Aug 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Herkulaas Combrink; Elizabeth Carr; Katinka de wet; Vukosi Marivate; Benjamin Rosman (2023). South Africa Education Data and Visualisations [Dataset]. http://doi.org/10.38140/ufs.22081058.v4
Explore at:
pngAvailable download formats
Unique identifier
https://doi.org/10.38140/ufs.22081058.v4
Dataset updated
Aug 15, 2023
Dataset provided by
University of the Free State
Authors
Herkulaas Combrink; Elizabeth Carr; Katinka de wet; Vukosi Marivate; Benjamin Rosman
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
South Africa
Description
The tabular and visual dataset focuses on South African basic education and provides insights into the distribution of schools and basic population statistics across the country. This tabular and visual data are stratified across different quintiles for each provincial and district boundary. The quintile system is used by the South African government to classify schools based on their level of socio-economic disadvantage, with quintile 1 being the most disadvantaged and quintile 5 being the least disadvantaged. The data was joined by extracting information from the debarment of basic education with StatsSA population census data. Thereafter, all tabular data and geo located data were transformed to maps using GIS software and the Python integrated development environment. The dataset includes information on the number of schools and students in each quintile, as well as the population density in each area. The data is displayed through a combination of charts, maps and tables, allowing for easy analysis and interpretation of the information.
s
Citation Trends for "The association of serum vitamin K2 levels with...
shibatadb.com
Updated Aug 29, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yubetsu (2020). Citation Trends for "The association of serum vitamin K2 levels with Parkinson's disease: from basic case-control study to big data mining analysis" [Dataset]. https://www.shibatadb.com/article/fuUJKbY6
Explore at:
Dataset updated
Aug 29, 2020
Dataset authored and provided by
Yubetsu
License
https://www.shibatadb.com/license/data/proprietary/v1.0/license.txthttps://www.shibatadb.com/license/data/proprietary/v1.0/license.txt
Time period covered
2021 - 2025
Variables measured
New Citations per Year
Description
Yearly citation counts for the publication titled "The association of serum vitamin K2 levels with Parkinson's disease: from basic case-control study to big data mining analysis".
r
Data from: A Comprehensive Dataset for Australian Mine Production, 1799 to...
researchdata.edu.au
Updated Nov 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gavin Mudd (2023). A Comprehensive Dataset for Australian Mine Production, 1799 to 2021 [Dataset]. http://doi.org/10.25439/RMT.22724081.V2
Explore at:
Unique identifier
https://doi.org/10.25439/RMT.22724081.V2
Dataset updated
Nov 20, 2023
Dataset provided by
RMIT University, Australia
Authors
Gavin Mudd
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Australia
Description
Given that metals, minerals and energy resources extracted through mining are fundamental to human society, it follows that accurate data describing mine production are equally important. Although there are often national statistical sources, this typically includes data for metals (e.g., gold), minerals (e.g., iron ore) or energy resources (e.g., coal). No such study has ever compiled a national mine production data set which includes basic mining data such as ore processed, grades, extracted products (e.g., metals, concentrates, saleable ore) and waste rock. These data are crucial for geological assessments of mineable resources, environmental impacts, material flows (including losses during mining, smelting-refining, use and disposal or recycling) as well as facilitating more quantitative assessments of critical mineral potential (including possible extraction from tailings and/or waste rock left by mining). This data set achieves these needs for Australia, providing a world-first and comprehensive review of a national mining industry and an exemplar of what can be achieved for other countries with mining industry sectors.
Market Basket Analysis
kaggle.com
zip
Updated Dec 9, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aslan Ahmedov (2021). Market Basket Analysis [Dataset]. https://www.kaggle.com/datasets/aslanahmedov/market-basket-analysis
Explore at:
zip(23875170 bytes)Available download formats
Dataset updated
Dec 9, 2021
Authors
Aslan Ahmedov
Description
Market Basket Analysis

Market basket analysis with Apriori algorithm

The retailer wants to target customers with suggestions on itemset that a customer is most likely to purchase .I was given dataset contains data of a retailer; the transaction data provides data around all the transactions that have happened over a period of time. Retailer will use result to grove in his industry and provide for customer suggestions on itemset, we be able increase customer engagement and improve customer experience and identify customer behavior. I will solve this problem with use Association Rules type of unsupervised learning technique that checks for the dependency of one data item on another data item.

Introduction

Association Rule is most used when you are planning to build association in different objects in a set. It works when you are planning to find frequent patterns in a transaction database. It can tell you what items do customers frequently buy together and it allows retailer to identify relationships between the items.

An Example of Association Rules

Assume there are 100 customers, 10 of them bought Computer Mouth, 9 bought Mat for Mouse and 8 bought both of them. - bought Computer Mouth => bought Mat for Mouse - support = P(Mouth & Mat) = 8/100 = 0.08 - confidence = support/P(Mat for Mouse) = 0.08/0.09 = 0.89 - lift = confidence/P(Computer Mouth) = 0.89/0.10 = 8.9 This just simple example. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.

Strategy

Data Import

Data Understanding and Exploration

Transformation of the data – so that is ready to be consumed by the association rules algorithm

Running association rules

Exploring the rules generated

Filtering the generated rules

Visualization of Rule

Dataset Description

File name: Assignment-1_Data

List name: retaildata

File format: . xlsx

Number of Row: 522065

Number of Attributes: 7

BillNo: 6-digit number assigned to each transaction. Nominal.

Itemname: Product name. Nominal.

Quantity: The quantities of each product per transaction. Numeric.

Date: The day and time when each transaction was generated. Numeric.

Price: Product price. Numeric.

CustomerID: 5-digit number assigned to each customer. Nominal.

Country: Name of the country where each customer resides. Nominal.

https://user-images.githubusercontent.com/91852182/145270162-fc53e5a3-4ad1-4d06-b0e0-228aabcf6b70.png">

Libraries in R

First, we need to load required libraries. Shortly I describe all libraries.

arules - Provides the infrastructure for representing, manipulating and analyzing transaction data and patterns (frequent itemsets and association rules).

arulesViz - Extends package 'arules' with various visualization. techniques for association rules and item-sets. The package also includes several interactive visualizations for rule exploration.

tidyverse - The tidyverse is an opinionated collection of R packages designed for data science.

readxl - Read Excel Files in R.

plyr - Tools for Splitting, Applying and Combining Data.

ggplot2 - A system for 'declaratively' creating graphics, based on "The Grammar of Graphics". You provide the data, tell 'ggplot2' how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.

knitr - Dynamic Report generation in R.

magrittr- Provides a mechanism for chaining commands with a new forward-pipe operator, %>%. This operator will forward a value, or the result of an expression, into the next function call/expression. There is flexible support for the type of right-hand side expressions.

dplyr - A fast, consistent tool for working with data frame like objects, both in memory and out of memory.

tidyverse - This package is designed to make it easy to install and load multiple 'tidyverse' packages in a single step.

https://user-images.githubusercontent.com/91852182/145270210-49c8e1aa-9753-431b-a8d5-99601bc76cb5.png">

Data Pre-processing

Next, we need to upload Assignment-1_Data. xlsx to R to read the dataset.Now we can see our data in R.

https://user-images.githubusercontent.com/91852182/145270229-514f0983-3bbb-4cd3-be64-980e92656a02.png"> https://user-images.githubusercontent.com/91852182/145270251-6f6f6472-8817-435c-a995-9bc4bfef10d1.png">

After we will clear our data frame, will remove missing values.

https://user-images.githubusercontent.com/91852182/145270286-05854e1a-2b6c-490e-ab30-9e99e731eacb.png">

To apply Association Rule mining, we need to convert dataframe into transaction data to make all items that are bought together in one invoice will be in ...
FakeNewsNet
kaggle.com
dataverse.harvard.edu
zip
Updated Nov 2, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Deepak Mahudeswaran (2018). FakeNewsNet [Dataset]. https://www.kaggle.com/mdepak/fakenewsnet
Explore at:
zip(17409594 bytes)Available download formats
Dataset updated
Nov 2, 2018
Authors
Deepak Mahudeswaran
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
FakeNewsNet

This is a repository for an ongoing data collection project for fake news research at ASU. We describe and compare FakeNewsNet with other existing datasets in Fake News Detection on Social Media: A Data Mining Perspective. We also perform a detail analysis of FakeNewsNet dataset, and build a fake news detection model on this dataset in Exploiting Tri-Relationship for Fake News Detection

JSON version of this dataset is available in github here. The new version of this dataset described in FakeNewNet will be published soon or you can email authors for more info.

News Content

It includes all the fake news articles, with the news content attributes as follows:

source: It indicates the author or publisher of the news article

headline: It refers to the short text that aims to catch the attention of readers and relates well to the major of the news topic.

_body_text_: It elaborates the details of news story. Usually there is a major claim which shaped the angle of the publisher and is specifically highlighted and elaborated upon.

_image_video_: It is an important part of body content of news article, which provides visual cues to frame the story.

Social Context

It includes the social engagements of fake news articles from Twitter. We extract profiles, posts and social network information for all relevant users.

_user_profile_: It includes a set of profile fields that describe the users' basic information

_user_content_: It collects the users' recent posts on Twitter

_user_followers_: It includes the follower list of the relevant users

_user_followees_: It includes list of users that are followed by relevant users

References

If you use this dataset, please cite the following papers:

@article{shu2017fake, title={Fake News Detection on Social Media: A Data Mining Perspective}, author={Shu, Kai and Sliva, Amy and Wang, Suhang and Tang, Jiliang and Liu, Huan}, journal={ACM SIGKDD Explorations Newsletter}, volume={19}, number={1}, pages={22--36}, year={2017}, publisher={ACM} }

@article{shu2017exploiting, title={Exploiting Tri-Relationship for Fake News Detection}, author={Shu, Kai and Wang, Suhang and Liu, Huan}, journal={arXiv preprint arXiv:1712.07709}, year={2017} }

@article{shu2018fakenewsnet, title={FakeNewsNet: A Data Repository with News Content, Social Context and Dynamic Information for Studying Fake News on Social Media}, author={Shu, Kai and Mahudeswaran, Deepak and Wang, Suhang and Lee, Dongwon and Liu, Huan}, journal={arXiv preprint arXiv:1809.01286}, year={2018} }
I
Iran GDP: Basic Prices: Non Oil: Industries & Mining: Mining
ceicdata.com
Updated Jun 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CEICdata.com (2025). Iran GDP: Basic Prices: Non Oil: Industries & Mining: Mining [Dataset]. https://www.ceicdata.com/en/iran/gdp-basic-price-by-industry-current-price-annual/gdp-basic-prices-non-oil-industries--mining-mining
Explore at:
Dataset updated
Jun 15, 2025
Dataset provided by
CEICdata.com
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Mar 1, 2006 - Mar 1, 2017
Area covered
Iran
Variables measured
Gross Domestic Product
Description
Iran GDP: Basic Prices: Non Oil: Industries & Mining: Mining data was reported at 103,776.200 IRR bn in 2018. This records an increase from the previous number of 86,721.000 IRR bn for 2017. Iran GDP: Basic Prices: Non Oil: Industries & Mining: Mining data is updated yearly, averaging 76.902 IRR bn from Mar 1960 (Median) to 2018, with 59 observations. The data reached an all-time high of 103,776.200 IRR bn in 2018 and a record low of 0.837 IRR bn in 1962. Iran GDP: Basic Prices: Non Oil: Industries & Mining: Mining data remains active status in CEIC and is reported by Central Bank of the Islamic Republic of Iran. The data is categorized under Global Database’s Iran – Table IR.A012: GDP: Basic Price: by Industry: Current Price: Annual.
Summary of basic properties of empirical distributions that are interesting...
plos.figshare.com
xls
Updated Jun 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michael C. Thrun; Tino Gehlert; Alfred Ultsch (2023). Summary of basic properties of empirical distributions that are interesting for data mining. [Dataset]. http://doi.org/10.1371/journal.pone.0238835.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0238835.t001
Dataset updated
Jun 4, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Michael C. Thrun; Tino Gehlert; Alfred Ultsch
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Summary of basic properties of empirical distributions that are interesting for data mining.
I
Iran GDP: Basic Prices: Non Oil: Industries & Mining: Construction
ceicdata.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CEICdata.com, Iran GDP: Basic Prices: Non Oil: Industries & Mining: Construction [Dataset]. https://www.ceicdata.com/en/iran/gdp-basic-price-by-industry-current-price-annual/gdp-basic-prices-non-oil-industries--mining-construction
Explore at:
Dataset provided by
CEICdata.com
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Mar 1, 2006 - Mar 1, 2017
Area covered
Iran
Variables measured
Gross Domestic Product
Description
Iran GDP: Basic Prices: Non Oil: Industries & Mining: Construction data was reported at 760,509.700 IRR bn in 2018. This records an increase from the previous number of 661,502.000 IRR bn for 2017. Iran GDP: Basic Prices: Non Oil: Industries & Mining: Construction data is updated yearly, averaging 1,753.102 IRR bn from Mar 1960 (Median) to 2018, with 59 observations. The data reached an all-time high of 850,897.696 IRR bn in 2015 and a record low of 10.582 IRR bn in 1960. Iran GDP: Basic Prices: Non Oil: Industries & Mining: Construction data remains active status in CEIC and is reported by Central Bank of the Islamic Republic of Iran. The data is categorized under Global Database’s Iran – Table IR.A012: GDP: Basic Price: by Industry: Current Price: Annual.
I
Iran GDP: Basic Prices: Non Oil: Industries & Mining: Manufacturing
ceicdata.com
Updated Jun 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CEICdata.com (2025). Iran GDP: Basic Prices: Non Oil: Industries & Mining: Manufacturing [Dataset]. https://www.ceicdata.com/en/iran/gdp-basic-price-by-industry-current-price-annual/gdp-basic-prices-non-oil-industries--mining-manufacturing
Explore at:
Dataset updated
Jun 15, 2025
Dataset provided by
CEICdata.com
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Mar 1, 2006 - Mar 1, 2017
Area covered
Iran
Variables measured
Gross Domestic Product
Description
Iran GDP: Basic Prices: Non Oil: Industries & Mining: Manufacturing data was reported at 1,837,174.000 IRR bn in 2018. This records an increase from the previous number of 1,564,416.000 IRR bn for 2017. Iran GDP: Basic Prices: Non Oil: Industries & Mining: Manufacturing data is updated yearly, averaging 2,477.015 IRR bn from Mar 1960 (Median) to 2018, with 59 observations. The data reached an all-time high of 1,837,174.000 IRR bn in 2018 and a record low of 23.275 IRR bn in 1960. Iran GDP: Basic Prices: Non Oil: Industries & Mining: Manufacturing data remains active status in CEIC and is reported by Central Bank of the Islamic Republic of Iran. The data is categorized under Global Database’s Iran – Table IR.A012: GDP: Basic Price: by Industry: Current Price: Annual.
O
Oman GDP: at Basic Prices: NP: Industry: Mining and Quarrying: Others
ceicdata.com
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CEICdata.com, Oman GDP: at Basic Prices: NP: Industry: Mining and Quarrying: Others [Dataset]. https://www.ceicdata.com/en/oman/gdp-by-industry-current-price-annual/gdp-at-basic-prices-np-industry-mining-and-quarrying-others
Explore at:
Dataset provided by
CEICdata.com
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Dec 1, 2004 - Dec 1, 2015
Area covered
Oman
Variables measured
Gross Domestic Product
Description
Oman GDP: at Basic Prices: NP: Industry: Mining and Quarrying: Others data was reported at 87.700 OMR mn in 2015. This records an increase from the previous number of 79.500 OMR mn for 2014. Oman GDP: at Basic Prices: NP: Industry: Mining and Quarrying: Others data is updated yearly, averaging 20.800 OMR mn from Dec 1990 (Median) to 2015, with 26 observations. The data reached an all-time high of 87.700 OMR mn in 2015 and a record low of 4.000 OMR mn in 1990. Oman GDP: at Basic Prices: NP: Industry: Mining and Quarrying: Others data remains active status in CEIC and is reported by National Center for Statistics and Information. The data is categorized under Global Database’s Oman – Table OM.A006: GDP: by Industry: Current Price: Annual.
SyROCCo dataset
zenodo.org
data.niaid.nih.gov
csv
Updated Jun 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zheng Fang; Miguel Arana-Catania; Miguel Arana-Catania; Felix-Anselm van Lier; Juliana Outes Velarde; Harry Bregazzi; Harry Bregazzi; Mara Airoldi; Mara Airoldi; Eleanor Carter; Eleanor Carter; Rob Procter; Rob Procter; Zheng Fang; Felix-Anselm van Lier; Juliana Outes Velarde (2024). SyROCCo dataset [Dataset]. http://doi.org/10.5281/zenodo.12204304
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.12204304
Dataset updated
Jun 25, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Zheng Fang; Miguel Arana-Catania; Miguel Arana-Catania; Felix-Anselm van Lier; Juliana Outes Velarde; Harry Bregazzi; Harry Bregazzi; Mara Airoldi; Mara Airoldi; Eleanor Carter; Eleanor Carter; Rob Procter; Rob Procter; Zheng Fang; Felix-Anselm van Lier; Juliana Outes Velarde
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The peer-reviewed publication for this dataset has been published in Data & Policy, and can be accessed here: https://arxiv.org/abs/2406.16527 Please cite this when using the dataset.

This dataset has been produced as a result of the “Systematic Review of Outcomes Contracts using Machine Learning” (SyROCCo) project. The goal of the project was to apply machine learning techniques to a systematic review process of outcomes-based contracting (OBC). The purpose of the systematic review was to gather and curate, for the first time, all of the existing evidence on OBC. We aimed to map the current state of the evidence, synthesise key findings from across the published studies, and provide accessible insights to our policymaker and practitioner audiences.

OBC is a model for the provision of public services wherein a service provider receives payment, in-part or in-full, only upon the achievement of pre-agreed outcomes.

The data used to conduct the review consists of 1,952 individual studies of OBC. They include peer reviewed journal articles, book chapters, doctoral dissertations, and assorted ‘grey literature’ - that is, reports and evaluations produced outside of traditional academic publications. Those studies were manually filtered by experts on the topic from an initial search of over 11,000 results.

The full text of the articles was obtained from their PDF versions and preprocessed. This involved text format normalisation, removing acknowledgements and bibliographic references.

The corpus was then connected to the INDIGO Impact Bond Dataset. Projects and organisations mentioned in this latter dataset were searched for in the article’s corpus to relate both datasets.

Other types of information that were identified in the texts were 1) financial mechanisms (type of outcomes-based instrument); using a list of terms related to those financial mechanisms based on prior discussions with a policy advisory group (Picker et al., 2021); 2) references to the 17 Sustainable Development Goals (SDGs) defined by the United Nations General Assembly in the 2030 Agenda; 3) country names mentioned in each article and income levels related to the countries; according to the World Classification of Income Levels 2022 by the World Bank.

Three machine learning techniques were applied to the corpus:

Policy areas identification. A query-driven topic model (QDTM) (Fang et al., 2021) was used to determine the probability of an article belonging to different policy areas (health, education, homelessness, criminal justice, employment and training, child and family welfare, and agriculture and environment), using all text of the article as input. The QDTM is a semi-supervised machine learning algorithm that allows users to specify their prior knowledge in the form of simple queries in words or phrases and return query-related topics.

Named Entity Recognition. Three named entity recognition models were applied: “en_core_web_lg” and “en_core_web_trf” models from the python package ‘spaCy’ and the “ner-ontonotes-large” English model from ‘Flair’. “en_core_web_trf” is based on the RoBERTa-base transformer model. ‘Flair’ is a bi-LSTM character-based model. All models were trained on the “OntoNotes 5” data source (Marcus et al., 2011) and are able to identify geographical locations, organisation names, and laws and regulations. An ensemble method was adopted, considering the entities that appear simultaneously in the results of any two models as the correct entities.

Semantic text similarity. We calculated the similarity score between articles. The 10,000 most frequently mentioned words were first extracted from all the articles’ titles and abstracts and the text vectorization technique TF*IDF was applied to convert each article’s abstract into an importance score vector based on these words. Using these numerical vectors, the cosine similarity between different articles was calculated.

The SyROCCo Dataset includes references to the 1952 studies of OBCs mentioned above and the results of the previous processing steps and techniques. Each entry of the dataset contains the following information.

The basic information of each document is its title, abstract, authors, published years, DOI and Article ID:

Title: Title of the document.

Abstract: Text of the abstract.

Authors: Authors of a study.

Published Years: Published Years of a study.

DOI: DOI link of a study.

Article ID: ID of the document selected during the screening process.

The probability of a study belonging to each policy area:

policy_sector_health: The probability of a study belongs to the policy sector “health”.

policy_sector_education: The probability of a study belongs to the policy sector “education”.

policy_sector_homelessness: The probability of a study belongs to the policy sector “homelessness”.

policy_sector_criminal: The probability of a study belongs to the policy sector “criminal”

policy_sector_employment: The probability of a study belongs to the policy sector “employment”

policy_sector_child: The probability of a study belongs to the policy sector “child”.

policy_sector_environment: The probability of a study belongs to the policy sector “environment”.

Other types of information such as financial mechanisms, Sustainable Development Goals, and different types of named entities:

financial_mechanisms: Financial mechanisms mentioned in a study.

top_financial_mechanisms: The financial mechanisms mentioned in a study are listed in descending order according to the number of times they are mentioned, and include the corresponding context of the mentions.

top_sgds: Sustainable Development Goals mentioned in a study are listed in descending order according to the number of times they are mentioned, and include the corresponding context of the mentions.

top_countries: Country names mentioned in a study are listed in descending order according to the number of times they are mentioned, and include the corresponding context of the mentions. This entry is also used to determine the income level of the mentioned counties.

top_Project: Indigo projects mentioned in a study are listed in descending order according to the number of times they are mentioned, and include the corresponding context of the mentions.

top_GPE: Geographical locations mentioned in a study are listed in descending order according to the number of times they are mentioned, and include the corresponding context of the mentions.

top_LAW: Relevant laws and regulations mentioned in a study are listed in descending order according to the number of times they are mentioned, and include the corresponding context of the mentions.

top_ORG: Organisations mentioned in a study are listed in descending order according to the number of times they are mentioned, and include the corresponding context of the mentions.

Facebook

Twitter

Click to copy link

Link copied

Cite

Dashlink (2025). Data Mining in Systems Health Management [Dataset]. https://catalog.data.gov/dataset/data-mining-in-systems-health-management

Data Mining in Systems Health Management

Explore at:

12 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

Apr 10, 2025

Dataset provided by

Dashlink

Description

This chapter presents theoretical and practical aspects associated to the implementation of a combined model-based/data-driven approach for failure prognostics based on particle filtering algorithms, in which the current esti- mate of the state PDF is used to determine the operating condition of the system and predict the progression of a fault indicator, given a dynamic state model and a set of process measurements. In this approach, the task of es- timating the current value of the fault indicator, as well as other important changing parameters in the environment, involves two basic steps: the predic- tion step, based on the process model, and an update step, which incorporates the new measurement into the a priori state estimate. This framework allows to estimate of the probability of failure at future time instants (RUL PDF) in real-time, providing information about time-to- failure (TTF) expectations, statistical confidence intervals, long-term predic- tions; using for this purpose empirical knowledge about critical conditions for the system (also referred to as the hazard zones). This information is of paramount significance for the improvement of the system reliability and cost-effective operation of critical assets, as it has been shown in a case study where feedback correction strategies (based on uncertainty measures) have been implemented to lengthen the RUL of a rotorcraft transmission system with propagating fatigue cracks on a critical component. Although the feed- back loop is implemented using simple linear relationships, it is helpful to provide a quick insight into the manner that the system reacts to changes on its input signals, in terms of its predicted RUL. The method is able to manage non-Gaussian pdf’s since it includes concepts such as nonlinear state estimation and confidence intervals in its formulation. Real data from a fault seeded test showed that the proposed framework was able to anticipate modifications on the system input to lengthen its RUL. Results of this test indicate that the method was able to successfully suggest the correction that the system required. In this sense, future work will be focused on the development and testing of similar strategies using different input-output uncertainty metrics.

Clear search

Close search

Google apps

Main menu

Data Mining in Systems Health Management

Ensemble Data Mining Methods - Dataset - NASA Open Data Portal

Ensemble Data Mining Methods

Data Mining in Systems Health Management - Dataset - NASA Open Data Portal

LSC (Leicester Scientific Corpus)

LScDC (Leicester Scientific Dictionary-Core)

Sweden Index: SSE: Basic Materials: Industrial Metals and Mining

Python and R Basics for Environmental Data Sciences

Data from: An Open Source Protein Gel Documentation System for Proteome...

South Africa Education Data and Visualisations

Citation Trends for "The association of serum vitamin K2 levels with...

Data from: A Comprehensive Dataset for Australian Mine Production, 1799 to...

Market Basket Analysis

Market Basket Analysis

Introduction

An Example of Association Rules

Strategy

Dataset Description

Libraries in R

Data Pre-processing

FakeNewsNet

FakeNewsNet

News Content

Social Context

References

Iran GDP: Basic Prices: Non Oil: Industries & Mining: Mining

Summary of basic properties of empirical distributions that are interesting...

Iran GDP: Basic Prices: Non Oil: Industries & Mining: Construction

Iran GDP: Basic Prices: Non Oil: Industries & Mining: Manufacturing

Oman GDP: at Basic Prices: NP: Industry: Mining and Quarrying: Others

SyROCCo dataset

Data Mining in Systems Health Management