38 datasets found

Data from: Enriching time series datasets using Nonparametric kernel...
figshare.com
pdf
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohamad Ivan Fanany (2023). Enriching time series datasets using Nonparametric kernel regression to improve forecasting accuracy [Dataset]. http://doi.org/10.6084/m9.figshare.1609661.v1
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.1609661.v1
Dataset updated
May 31, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Mohamad Ivan Fanany
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Improving the accuracy of prediction on future values based on the past and current observations has been pursued by enhancing the prediction's methods, combining those methods or performing data pre-processing. In this paper, another approach is taken, namely by increasing the number of input in the dataset. This approach would be useful especially for a shorter time series data. By filling the in-between values in the time series, the number of training set can be increased, thus increasing the generalization capability of the predictor. The algorithm used to make prediction is Neural Network as it is widely used in literature for time series tasks. For comparison, Support Vector Regression is also employed. The dataset used in the experiment is the frequency of USPTO's patents and PubMed's scientific publications on the field of health, namely on Apnea, Arrhythmia, and Sleep Stages. Another time series data designated for NN3 Competition in the field of transportation is also used for benchmarking. The experimental result shows that the prediction performance can be significantly increased by filling in-between data in the time series. Furthermore, the use of detrend and deseasonalization which separates the data into trend, seasonal and stationary time series also improve the prediction performance both on original and filled dataset. The optimal number of increase on the dataset in this experiment is about five times of the length of original dataset.
Z
Data Analysis for the Systematic Literature Review of DL4SE
data.niaid.nih.gov
data-staging.niaid.nih.gov
Updated Jul 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cody Watson; Nathan Cooper; David Nader; Kevin Moran; Denys Poshyvanyk (2024). Data Analysis for the Systematic Literature Review of DL4SE [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4768586
Explore at:
Dataset updated
Jul 19, 2024
Dataset provided by
College of William and Mary
Washington and Lee University
Authors
Cody Watson; Nathan Cooper; David Nader; Kevin Moran; Denys Poshyvanyk
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data Analysis is the process that supports decision-making and informs arguments in empirical studies. Descriptive statistics, Exploratory Data Analysis (EDA), and Confirmatory Data Analysis (CDA) are the approaches that compose Data Analysis (Xia & Gong; 2014). An Exploratory Data Analysis (EDA) comprises a set of statistical and data mining procedures to describe data. We ran EDA to provide statistical facts and inform conclusions. The mined facts allow attaining arguments that would influence the Systematic Literature Review of DL4SE.

The Systematic Literature Review of DL4SE requires formal statistical modeling to refine the answers for the proposed research questions and formulate new hypotheses to be addressed in the future. Hence, we introduce DL4SE-DA, a set of statistical processes and data mining pipelines that uncover hidden relationships among Deep Learning reported literature in Software Engineering. Such hidden relationships are collected and analyzed to illustrate the state-of-the-art of DL techniques employed in the software engineering context.

Our DL4SE-DA is a simplified version of the classical Knowledge Discovery in Databases, or KDD (Fayyad, et al; 1996). The KDD process extracts knowledge from a DL4SE structured database. This structured database was the product of multiple iterations of data gathering and collection from the inspected literature. The KDD involves five stages:

Selection. This stage was led by the taxonomy process explained in section xx of the paper. After collecting all the papers and creating the taxonomies, we organize the data into 35 features or attributes that you find in the repository. In fact, we manually engineered features from the DL4SE papers. Some of the features are venue, year published, type of paper, metrics, data-scale, type of tuning, learning algorithm, SE data, and so on.

Preprocessing. The preprocessing applied was transforming the features into the correct type (nominal), removing outliers (papers that do not belong to the DL4SE), and re-inspecting the papers to extract missing information produced by the normalization process. For instance, we normalize the feature “metrics” into “MRR”, “ROC or AUC”, “BLEU Score”, “Accuracy”, “Precision”, “Recall”, “F1 Measure”, and “Other Metrics”. “Other Metrics” refers to unconventional metrics found during the extraction. Similarly, the same normalization was applied to other features like “SE Data” and “Reproducibility Types”. This separation into more detailed classes contributes to a better understanding and classification of the paper by the data mining tasks or methods.

Transformation. In this stage, we omitted to use any data transformation method except for the clustering analysis. We performed a Principal Component Analysis to reduce 35 features into 2 components for visualization purposes. Furthermore, PCA also allowed us to identify the number of clusters that exhibit the maximum reduction in variance. In other words, it helped us to identify the number of clusters to be used when tuning the explainable models.

Data Mining. In this stage, we used three distinct data mining tasks: Correlation Analysis, Association Rule Learning, and Clustering. We decided that the goal of the KDD process should be oriented to uncover hidden relationships on the extracted features (Correlations and Association Rules) and to categorize the DL4SE papers for a better segmentation of the state-of-the-art (Clustering). A clear explanation is provided in the subsection “Data Mining Tasks for the SLR od DL4SE”. 5.Interpretation/Evaluation. We used the Knowledge Discover to automatically find patterns in our papers that resemble “actionable knowledge”. This actionable knowledge was generated by conducting a reasoning process on the data mining outcomes. This reasoning process produces an argument support analysis (see this link).

We used RapidMiner as our software tool to conduct the data analysis. The procedures and pipelines were published in our repository.

Overview of the most meaningful Association Rules. Rectangles are both Premises and Conclusions. An arrow connecting a Premise with a Conclusion implies that given some premise, the conclusion is associated. E.g., Given that an author used Supervised Learning, we can conclude that their approach is irreproducible with a certain Support and Confidence.

Support = Number of occurrences this statement is true divided by the amount of statements Confidence = The support of the statement divided by the number of occurrences of the premise
c
Global Data Mining Software Market Report 2025 Edition, Market Size, Share,...
cognitivemarketresearch.com
pdf,excel,csv,ppt
Updated Jun 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cognitive Market Research (2025). Global Data Mining Software Market Report 2025 Edition, Market Size, Share, CAGR, Forecast, Revenue [Dataset]. https://www.cognitivemarketresearch.com/data-mining-software-market-report
Explore at:
pdf,excel,csv,pptAvailable download formats
Dataset updated
Jun 2, 2025
Dataset authored and provided by
Cognitive Market Research
License
https://www.cognitivemarketresearch.com/privacy-policyhttps://www.cognitivemarketresearch.com/privacy-policy
Time period covered
2021 - 2033
Area covered
Global
Description
According to Cognitive Market Research, the global Data Mining Software market size will be USD XX million in 2025. It will expand at a compound annual growth rate (CAGR) of XX% from 2025 to 2031.

North America held the major market share for more than XX% of the global revenue with a market size of USD XX million in 2025 and will grow at a CAGR of XX% from 2025 to 2031. Europe accounted for a market share of over XX% of the global revenue with a market size of USD XX million in 2025 and will grow at a CAGR of XX% from 2025 to 2031. Asia Pacific held a market share of around XX% of the global revenue with a market size of USD XX million in 2025 and will grow at a CAGR of XX% from 2025 to 2031. Latin America had a market share of more than XX% of the global revenue with a market size of USD XX million in 2025 and will grow at a CAGR of XX% from 2025 to 2031. Middle East and Africa had a market share of around XX% of the global revenue and was estimated at a market size of USD XX million in 2025 and will grow at a CAGR of XX% from 2025 to 2031. KEY DRIVERS

Increasing Focus on Customer Satisfaction to Drive Data Mining Software Market Growth

In today’s hyper-competitive and digitally connected marketplace, customer satisfaction has emerged as a critical factor for business sustainability and growth. The growing focus on enhancing customer satisfaction is proving to be a significant driver in the expansion of the data mining software market. Organizations are increasingly leveraging data mining tools to sift through vast volumes of customer data—ranging from transactional records and website activity to social media engagement and call center logs—to uncover insights that directly influence customer experience strategies. Data mining software empowers companies to analyze customer behavior patterns, identify dissatisfaction triggers, and predict future preferences. Through techniques such as classification, clustering, and association rule mining, businesses can break down large datasets to understand what customers want, what they are likely to purchase next, and how they feel about the brand. These insights not only help in refining customer service but also in shaping product development, pricing strategies, and promotional campaigns. For instance, Netflix uses data mining to recommend personalized content by analyzing a user's viewing history, ratings, and preferences. This has led to increased user engagement and retention, highlighting how a deep understanding of customer preferences—made possible through data mining—can translate into competitive advantage. Moreover, companies are increasingly using these tools to create highly targeted and customer-specific marketing campaigns. By mining data from e-commerce transactions, browsing behavior, and demographic profiles, brands can tailor their offerings and communications to suit individual customer segments. For Instance Amazon continuously mines customer purchasing and browsing data to deliver personalized product recommendations, tailored promotions, and timely follow-ups. This not only enhances customer satisfaction but also significantly boosts conversion rates and average order value. According to a report by McKinsey, personalization can deliver five to eight times the ROI on marketing spend and lift sales by 10% or more—a powerful incentive for companies to adopt data mining software as part of their customer experience toolkit. (Source: https://www.mckinsey.com/capabilities/growth-marketing-and-sales/our-insights/personalizing-at-scale#/) The utility of data mining tools extends beyond e-commerce and streaming platforms. In the banking and financial services industry, for example, institutions use data mining to analyze customer feedback, call center transcripts, and usage data to detect pain points and improve service delivery. Bank of America, for instance, utilizes data mining and predictive analytics to monitor customer interactions and provide proactive service suggestions or fraud alerts, significantly improving user satisfaction and trust. (Source: https://futuredigitalfinance.wbresearch.com/blog/bank-of-americas-erica-client-interactions-future-ai-in-banking) Similarly, telecom companies like Vodafone use data mining to understand customer churn behavior and implement retention strategies based on insights drawn from service usage patterns and complaint histories. In addition to p...
Comparison of 14 classifiers
figshare.com
application/gzip
Updated Jun 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jacques Wainer (2023). Comparison of 14 classifiers [Dataset]. http://doi.org/10.6084/m9.figshare.3407932.v2
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.3407932.v2
Dataset updated
Jun 11, 2023
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Jacques Wainer
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data, programs, results, and analysis software for the paper "Comparison of 14 different families of classification algorithms on 115 binary data sets" https://arxiv.org/abs/1606.00930
Data from: PREDICTION OF RANKING OF LOTS OF CORN SEEDS BY ARTIFICIAL...
scielo.figshare.com
tiff
Updated Jun 14, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gizele I. Gadotti; Nicacia A. B. Moraes; Joseano G. da Silva; Romário de M. Pinheiro; Rita de C. M. Monteiro (2023). PREDICTION OF RANKING OF LOTS OF CORN SEEDS BY ARTIFICIAL INTELLIGENCE [Dataset]. http://doi.org/10.6084/m9.figshare.20551630.v1
Explore at:
tiffAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.20551630.v1
Dataset updated
Jun 14, 2023
Dataset provided by
SciELOhttp://www.scielo.org/
Authors
Gizele I. Gadotti; Nicacia A. B. Moraes; Joseano G. da Silva; Romário de M. Pinheiro; Rita de C. M. Monteiro
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ABSTRACT The seed sector faces several challenges when it comes to ensuring a quick and accurate decision making when working with large amounts of data on physiological quality of seed lots, which makes the process time-consuming and inefficient. Thus, artificial intelligence (AI) emerges as a new technological option in the seed sector to solve database problems in the post-harvest stages. This study aims to use machine learning to classify maize seed lots. Data were obtained from eight maize seed crops from a private company. These data were mined using the following classifiers: J48 (DecisionTree), RandomForest, CVR (ClassificationViaRegression ) , lBk (lazy.IBK), MLP (MultiLayerPercepton), and NäiveBayes. Cross-validation was used for data measurement, with the data set, including training and testing data, being divided into 10 subsets. The described steps were performed using the Weka software. It is concluded that results obtained allow the classification of maize seed lots with high accuracy and precision, and these algorithms can better classify the maize seed lot through vigor attributes, thus enabling more accurate decision making based on vigor tests on a reduced evaluation time.
Z
Data from: QuerTCI: A Tool Integrating GitHub Issue Querying with Comment...
data.niaid.nih.gov
Updated Feb 21, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ye Paing; Tatiana Castro Vélez; Raffi Khatchadourian (2022). QuerTCI: A Tool Integrating GitHub Issue Querying with Comment Classification [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6115403
Explore at:
Dataset updated
Feb 21, 2022
Dataset provided by
City University of New York (CUNY) Hunter College
City University of New York (CUNY) Graduate Center
Authors
Ye Paing; Tatiana Castro Vélez; Raffi Khatchadourian
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Issue tracking systems enable users and developers to comment on problems plaguing a software system. Empirical Software Engineering (ESE) researchers study (open-source) project issues and the comments and threads within to discover---among others---challenges developers face when, e.g., incorporating new technologies, platforms, and programming language constructs. However, issue discussion threads accumulate over time and thus can become unwieldy, hindering any insight that researchers may gain. While existing approaches alleviate this burden by classifying issue thread comments, there is a gap between searching popular open-source software repositories (e.g., those on GitHub) for issues containing particular keywords and feeding the results into a classification model. In this paper, we demonstrate a research infrastructure tool called QuerTCI that bridges this gap by integrating the GitHub issue comment search API with the classification models found in existing approaches. Using queries, ESE researchers can retrieve GitHub issues containing particular keywords, e.g., those related to a certain programming language construct, and subsequently classify the kinds of discussions occurring in those issues. Using our tool, our hope is that ESE researchers can uncover challenges related to particular technologies using certain keywords through popular open-source repositories more seamlessly than previously possible. A tool demonstration video may be found at: https://youtu.be/fADKSxn0QUk.

Global Text Mining Software Market Research Report: By Application...

wiseguyreports.com

Updated Sep 15, 2025

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

(2025). Global Text Mining Software Market Research Report: By Application (Sentiment Analysis, Content Classification, Information Extraction, Text Categorization, Topic Modeling), By Deployment Type (On-premise, Cloud-based, Hybrid), By End User (Healthcare, Retail, Education, Finance, Government), By Organization Size (Small Enterprises, Medium Enterprises, Large Enterprises), By Output Format (Structured Data, Unstructured Data, Visualization Reports) and By Regional (North America, Europe, South America, Asia Pacific, Middle East and Africa) - Forecast to 2035 [Dataset]. https://www.wiseguyreports.com/reports/text-mining-software-market

Explore at:

Dataset updated

Sep 15, 2025

License

https://www.wiseguyreports.com/pages/privacy-policyhttps://www.wiseguyreports.com/pages/privacy-policy

Time period covered

Sep 25, 2025

Area covered

Global

Description

BASE YEAR	2024
HISTORICAL DATA	2019 - 2023
REGIONS COVERED	North America, Europe, APAC, South America, MEA
REPORT COVERAGE	Revenue Forecast, Competitive Landscape, Growth Factors, and Trends
MARKET SIZE 2024	2.93(USD Billion)
MARKET SIZE 2025	3.22(USD Billion)
MARKET SIZE 2035	8.5(USD Billion)
SEGMENTS COVERED	Application, Deployment Type, End User, Organization Size, Output Format, Regional
COUNTRIES COVERED	US, Canada, Germany, UK, France, Russia, Italy, Spain, Rest of Europe, China, India, Japan, South Korea, Malaysia, Thailand, Indonesia, Rest of APAC, Brazil, Mexico, Argentina, Rest of South America, GCC, South Africa, Rest of MEA
KEY MARKET DYNAMICS	growing data volume, rising demand for insights, advancements in natural language processing, increasing adoption of AI technologies, need for competitive intelligence
MARKET FORECAST UNITS	USD Billion
KEY COMPANIES PROFILED	RapidMiner, IBM, Clarabridge, Lexalytics, Oracle, Tableau, Dell Technologies, Information Builders, SAP, MonkeyLearn, Microsoft, Talend, TIBCO Software, SAS Institute, Alteryx, Qlik
MARKET FORECAST PERIOD	2025 - 2035
KEY MARKET OPPORTUNITIES	Increased demand for data analytics, Integration with artificial intelligence, Growth in social media monitoring, Expansion in healthcare applications, Rising need for consumer sentiment analysis
COMPOUND ANNUAL GROWTH RATE (CAGR)	10.2% (2025 - 2035)

Comparative Analysis of Artificial Hydrocarbon Networks an Data-Driven...
figshare.com
html
Updated Dec 31, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
LUIS MIRALLES (2016). Comparative Analysis of Artificial Hydrocarbon Networks an Data-Driven Approaches for Human Activity Recognition [Dataset]. http://doi.org/10.6084/m9.figshare.4508744.v1
Explore at:
htmlAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.4508744.v1
Dataset updated
Dec 31, 2016
Dataset provided by
Figsharehttp://figshare.com/
Authors
LUIS MIRALLES
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In recent years computing and sensing technologies advances contribute to develop effective human activity recognition systems. In context-aware and ambient assistive living applications, classification of body postures and movements, aids in the development of health systems that improve the quality of life of the disabled and the elderly. In this paper we describe a comparative analysis of data-driven activity recognition techniques against a novel supervised learning technique called artificial hydrocarbon networks (AHN). We prove that artificial hydrocarbon networks are suitable for efficient body postures and movements classification, providing a comparison between its performance and other well-known supervised learning methods.
r
Data from: Classifying microarray cancer datasets using nearest subspace...
researchdata.edu.au
bridges.monash.edu
Updated May 5, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michael C. Cohen; Kuldip K. Paliwal (2022). Classifying microarray cancer datasets using nearest subspace classification [Dataset]. http://doi.org/10.4225/03/5a13727393276
Explore at:
Unique identifier
https://doi.org/10.4225/03/5a13727393276
Dataset updated
May 5, 2022
Dataset provided by
Monash University
Authors
Michael C. Cohen; Kuldip K. Paliwal
Description
In this paper we implement and test the recently described nearest subspace classifier on a range of microarray cancer datasets. Its classification accuracy is tested against nearest neighbor and nearest centroid algorithms, and is shown to give a significant improvement. This classification system uses class-dependent PCA to construct a subspace for each class. Test vectors are assigned the class label of the nearest subspace, which is defined as the minimum reconstruction error across all subspaces. Furthermore, we demonstrate this distance measure is equivalent to the null-space component of the vector being analyzed. PRIB 2008 proceedings found at: http://dx.doi.org/10.1007/978-3-540-88436-1

Contributors: Monash University. Faculty of Information Technology. Gippsland School of Information Technology ; Chetty, Madhu ; Ahmad, Shandar ; Ngom, Alioune ; Teng, Shyh Wei ; Third IAPR International Conference on Pattern Recognition in Bioinformatics (PRIB) (3rd : 2008 : Melbourne, Australia) ; Coverage: Rights: Copyright by Third IAPR International Conference on Pattern Recognition in Bioinformatics. All rights reserved.
m
Data from: Prediction of paediatric asthma hospitalisation using data mining...
bridges.monash.edu
pdf
Updated Nov 21, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Schmidt, Sam; Gang Li; Yi-Ping Phoebe Chen (2017). Prediction of paediatric asthma hospitalisation using data mining techniques [Dataset]. http://doi.org/10.4225/03/5a1372a1685b1
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.4225/03/5a1372a1685b1
Dataset updated
Nov 21, 2017
Dataset provided by
Monash University
Authors
Schmidt, Sam; Gang Li; Yi-Ping Phoebe Chen
License
http://rightsstatements.org/vocab/InC/1.0/http://rightsstatements.org/vocab/InC/1.0/
Description
Research into the prevalence of hospitalisation among childhood asthma cases is undertaken, using a data set local to the Barwon region of Victoria. Participants were the parents/guardians on behalf of children aged between 5-11 years. Various data mining techniques are used, including segmentation, association and classification to assist in predicting and exploring the instances of childhood hospitalisation due to asthma. Results from this study indicate that children in inner city and metropolitan areas may overutilise emergency department services. In addition, this study found that the prediction of hospitalisaion for asthma in children was greater for those with a written asthma management plan. PRIB 2008 proceedings found at: http://dx.doi.org/10.1007/978-3-540-88436-1

Contributors: Monash University. Faculty of Information Technology. Gippsland School of Information Technology ; Chetty, Madhu ; Ahmad, Shandar ; Ngom, Alioune ; Teng, Shyh Wei ; Third IAPR International Conference on Pattern Recognition in Bioinformatics (PRIB) (3rd : 2008 : Melbourne, Australia) ; Coverage: Rights: Copyright by Third IAPR International Conference on Pattern Recognition in Bioinformatics. All rights reserved.
m
Data from: The strong convergence of visual classification method and its...
bridges.monash.edu
pdf
Updated Nov 21, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Meng, Deyu; Xu, Zongben; Leung, Yee; Fung, Tung (2017). The strong convergence of visual classification method and its applications in disease diagnosis [Dataset]. http://doi.org/10.4225/03/5a1371f709257
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.4225/03/5a1371f709257
Dataset updated
Nov 21, 2017
Dataset provided by
Monash University
Authors
Meng, Deyu; Xu, Zongben; Leung, Yee; Fung, Tung
License
http://rightsstatements.org/vocab/InC/1.0/http://rightsstatements.org/vocab/InC/1.0/
Description
Visual classification method is introduced as a learning strategy for pattern classification problem in bioinformatics. In this paper, we show the strong convergence property of the proposed method. In particular, the method is shown to converge to the Bayes estimator, i.e., the learning error of the method tends to achieve the posterior expected minimal value. The method is successfully applied to some practical disease diagnosis problems. The experimental results all verify the validity and effectiveness of the theoretical conclusions. PRIB 2008 proceedings found at: http://dx.doi.org/10.1007/978-3-540-88436-1

Contributors: Monash University. Faculty of Information Technology. Gippsland School of Information Technology ; Chetty, Madhu ; Ahmad, Shandar ; Ngom, Alioune ; Teng, Shyh Wei ; Third IAPR International Conference on Pattern Recognition in Bioinformatics (PRIB) (3rd : 2008 : Melbourne, Australia) ; Coverage: Rights: Copyright by Third IAPR International Conference on Pattern Recognition in Bioinformatics. All rights reserved.
r
Novel classification scheme for temporal genomic and proteomic problems
researchdata.edu.au
bridges.monash.edu
Updated May 5, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Simon Kocbek; Gregor Stiglic; Mateja Verlic; Peter Kokol (2022). Novel classification scheme for temporal genomic and proteomic problems [Dataset]. http://doi.org/10.4225/03/5a1373171c74f
Explore at:
Unique identifier
https://doi.org/10.4225/03/5a1373171c74f
Dataset updated
May 5, 2022
Dataset provided by
Monash University
Authors
Simon Kocbek; Gregor Stiglic; Mateja Verlic; Peter Kokol
Description
For over a decade genomic and proteomic datasets present a challenge for various statistical and machine learning methods. Most of microarray or mass spectrometry based datasets consist of a small number of samples with a large number of gene or protein expression measurements, but in the past few years new types of datasets with an additional time component are becoming available. This type of datasets offer new opportunities for development of new classification and gene selection techniques where one of the problems is the reduction of high-dimensionality. This paper presents a novel classification technique which combines feature extraction and feature selection to obtain the optimal set of genes available to a classifier. PRIB 2008 proceedings found at: http://dx.doi.org/10.1007/978-3-540-88436-1

Contributors: Monash University. Faculty of Information Technology. Gippsland School of Information Technology ; Chetty, Madhu ; Ahmad, Shandar ; Ngom, Alioune ; Teng, Shyh Wei ; Third IAPR International Conference on Pattern Recognition in Bioinformatics (PRIB) (3rd : 2008 : Melbourne, Australia) ; Coverage: Rights: Copyright by Third IAPR International Conference on Pattern Recognition in Bioinformatics. All rights reserved.
r
Data from: Gene expression analysis for tumor classification using vector...
researchdata.edu.au
bridges.monash.edu
Updated May 5, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Edna Márquez; Ana María Espinosa; Jaime Berumen; Christian Lemaitre (2022). Gene expression analysis for tumor classification using vector quantization [Dataset]. http://doi.org/10.4225/03/5a137205bd04a
Explore at:
Unique identifier
https://doi.org/10.4225/03/5a137205bd04a
Dataset updated
May 5, 2022
Dataset provided by
Monash University
Authors
Edna Márquez; Ana María Espinosa; Jaime Berumen; Christian Lemaitre
Description
Gene expression analysis is one of the most important tasks for genomic medicine, using these it is possible to classify tumors, which are directly related with the development of cancer. This paper presents a clustering method for tumor classification, vector quantization, using gene expression profiles from microarrays of mRNA with samples of cervical cancer and normal cervix. Vector quantization is used to divide the space into regions, and the centroids of the regions represent patients with tumors or healthy ones. Also the regions found by the vector quantizer are used as the base for classifying other tumors, that could help in the prognostics of the illness or for finding new groups of tumors. PRIB 2008 proceedings found at: http://dx.doi.org/10.1007/978-3-540-88436-1

Contributors: Monash University. Faculty of Information Technology. Gippsland School of Information Technology ; Chetty, Madhu ; Ahmad, Shandar ; Ngom, Alioune ; Teng, Shyh Wei ; Third IAPR International Conference on Pattern Recognition in Bioinformatics (PRIB) (3rd : 2008 : Melbourne, Australia) ; Coverage: Rights: Copyright by Third IAPR International Conference on Pattern Recognition in Bioinformatics. All rights reserved.
d
LANDFIRE.HI_120ESP
catalog.data.gov
Updated Nov 11, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2021). LANDFIRE.HI_120ESP [Dataset]. https://catalog.data.gov/dataset/landfire-hi-120esp
Explore at:
Dataset updated
Nov 11, 2021
Dataset provided by
U.S. Geological Survey
Description
The LANDFIRE vegetation layers describe the following elements of existing and potential vegetation for each LANDFIRE mapping zone: environmental site potentials, biophysical settings, existing vegetation types, canopy cover, and vegetation height. Vegetation is mapped using predictive landscape models based on extensive field reference data, satellite imagery, biophysical gradient layers, and classification and regression trees. DATA SUMMARY: The environmental site potential (ESP) data layer represents the vegetation that could be supported at a given site based on the biophysical environment. Map units are named according to NatureServe's Ecological Systems classification, which is a nationally consistent set of mid-scale ecological units (Comer and others 2003). Usage of these classification units to describe environmental site potential, however, differs from the original intent of Ecological Systems as units of existing vegetation. As used in LANDFIRE, map unit names represent the natural plant communities that would become established at late or climax stages of successional development in the absence of disturbance. They reflect the current climate and physical environment, as well as the competitive potential of native plant species. The ESP layer is similar in concept to other approaches to classifying potential vegetation in the western United States, including habitat types (for example, Daubenmire 1968 and Pfister and others 1977) and plant associations (for example, Henderson and others 1989). It is important to note that ESP is an abstract concept and represents neither current nor historical vegetation. To create the ESP data layer, we first assign field plots to one of the ESP map unit classes. Go to http://www.landfire.gov/participate_acknowledgements.php for more information regarding contributors of field plot data. Assignments are based on presence and abundance of indicator plant species recorded on the plots and on the ecological amplitude and competitive potential of these species. We then intersect plot locations with a series of 30-meter spatially explicit gradient layers. Most of the gradient layers used in the predictive modeling of ESP are derived using the WX-BGC simulation model (Keane and Holsinger, in preparation; Keane and others 2002). WX-BGC simulations are based largely on spatially extrapolated weather data from DAYMET (Thornton and others 1997; Thornton and Running 1999; http://www.daymet.org/ ) and on soils data in STATSGO (NRCS 1994). Additional indirect gradient layers, such as elevation, slope, and indices of topographic position, are also used. We use data from plot locations to develop predictive classification tree models, using See5 data mining software (Quinlan 1993; Rulequest Research 1997), for each LANDFIRE map zone. These decision trees are applied spatially to predict the ESP for every pixel across the landscape. Finally, ESP pixel values are, in some cases, modified based on a comparison with the LANDFIRE existing vegetation type (EVT) layer created with the use of 30-meter Landsat ETM satellite imagery. We make such modifications only in non-vegetated areas (such as water, rock, snow, or ice) and where information in the EVT layer clearly enables a better depiction of the environmental site potential concept. Although the ESP data layer is intended to represent current site potential, the actual time period for this data set is variable. The weather data used in DAYMET were compiled from 1980 to 1997. Refer to spatial metadata for date ranges of field plot data and satellite imagery for each LANDFIRE map zone. A number of changes were implemented for the LF2010 ESP product that worked with this original data. LF2010 updates to mapping EVT map units for Barren, Snow-Ice, and Water were translated to the LF2010 ESP product so those map units will coincide with the EVT. Subsequent to that, each ESP map unit was stratified spatially two different ways. First, each ESP map unit was stratified by LANDFIRE map zone. Second, each ESP map unit was stratified by an ESP life form classification layer that incorporated NLCD 2001 data, LF2001 EVC data, a Vegetation Change Tracker (VCT) dataset (Huang, 2010), and the National Wetlands Inventory (NWI) data. Each layer was leveraged against each other to determine areas of stable Sparse, Upland Herb, Upland Shrub, Upland Woodland, Upland Forest, Wetland Shrub-herb, Wetland Forest, Wetland Shrub, and Wetland Herb. Areas mapped as agriculture, urban, barren, snow-ice, and water were described as Undetermined.
r
Data from: A novel protein motif finding algorithm for classification of the...
researchdata.edu.au
bridges.monash.edu
Updated May 5, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Deng-Kuan Sun; Tong-Liang Zhang; Yong-Sheng Ding (2022). A novel protein motif finding algorithm for classification of the ligase subfamilies [Dataset]. http://doi.org/10.4225/03/5a1371c69c0e3
Explore at:
Unique identifier
https://doi.org/10.4225/03/5a1371c69c0e3
Dataset updated
May 5, 2022
Dataset provided by
Monash University
Authors
Deng-Kuan Sun; Tong-Liang Zhang; Yong-Sheng Ding
Description
The algorithm of extracting motifs from a family or subfamily is still a hot spot in bioinformatics. It not only contributes to understand functions of proteins and predicts the classification which a unknown protein sequence belongs to, but also helps to study the protein-protein interaction. In this paper, we present a novel algorithm to extract motifs of a subfamily, which is based on feature selection and position connection. Position connection is applied to generate motifs, which is the hybrid method with mechanism of vote decision-making to construct the classifier of the ligase subfamilies. Through testing in the database, more than 95.87% predictive accuracy is achieved. The result demonstrates that this novel method is practical. In addition, the method illuminates that motifs play an important role to classify proteins and research the characteristics of the subfamilies or families of protein database. PRIB 2008 proceedings found at: http://dx.doi.org/10.1007/978-3-540-88436-1

Contributors: Monash University. Faculty of Information Technology. Gippsland School of Information Technology ; Chetty, Madhu ; Ahmad, Shandar ; Ngom, Alioune ; Teng, Shyh Wei ; Third IAPR International Conference on Pattern Recognition in Bioinformatics (PRIB) (3rd : 2008 : Melbourne, Australia) ; Coverage: Rights: Copyright by Third IAPR International Conference on Pattern Recognition in Bioinformatics. All rights reserved.
f
Classification for unweighted and weighted data.
figshare.com
plos.figshare.com
xls
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jens Keilwagen; Ivo Grosse; Jan Grau (2023). Classification for unweighted and weighted data. [Dataset]. http://doi.org/10.1371/journal.pone.0092209.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0092209.t002
Dataset updated
Jun 1, 2023
Dataset provided by
PLOS ONE
Authors
Jens Keilwagen; Ivo Grosse; Jan Grau
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The entries of a confusion matrix have been calculated for a classification threshold of 1.5. In case of unweighted data, the class label is if and otherwise .
m
Microarray time-series data classification via multiple alignment of gene...
bridges.monash.edu
researchdata.edu.au
pdf
Updated Nov 21, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bari, Ataul; Rueda, Luis; Ngom, Alioune (2017). Microarray time-series data classification via multiple alignment of gene expression profiles [Dataset]. http://doi.org/10.4225/03/5a1371a04a06e
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.4225/03/5a1371a04a06e
Dataset updated
Nov 21, 2017
Dataset provided by
Monash University
Authors
Bari, Ataul; Rueda, Luis; Ngom, Alioune
License
http://rightsstatements.org/vocab/InC/1.0/http://rightsstatements.org/vocab/InC/1.0/
Description
Pairwise alignment approaches for time-varying gene expression profiles have been recently developed for the detection of co-expressions in time-series microarray data sets. In this paper, we analyze multiple expression profile alignment (MEPA) methods for classifying microarray time-course data. We apply a nearest centroid classification technique, in which the centroid of each class is computed by means of a MEPA algorithm. MEPA aligns the expression profiles in such a way to minimize the total area between all aligned profiles. We propose four MEPA approaches whose effectiveness are demonstrated on the well-known budding yeast, S. cerevisiae, data set. PRIB 2008 proceedings found at: http://dx.doi.org/10.1007/978-3-540-88436-1

Contributors: Monash University. Faculty of Information Technology. Gippsland School of Information Technology ; Chetty, Madhu ; Ahmad, Shandar ; Ngom, Alioune ; Teng, Shyh Wei ; Third IAPR International Conference on Pattern Recognition in Bioinformatics (PRIB) (3rd : 2008 : Melbourne, Australia) ; Coverage: Rights: Copyright by Third IAPR International Conference on Pattern Recognition in Bioinformatics. All rights reserved.
f
Data from: QSAR-Co: An Open Source Software for Developing Robust...
datasetcatalog.nlm.nih.gov
acs.figshare.com
Updated Nov 25, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cordeiro, M. Natália D. S.; Ambure, Pravin; Halder, Amit Kumar; Díaz, Humbert González (2020). QSAR-Co: An Open Source Software for Developing Robust Multitasking or Multitarget Classification-Based QSAR Models [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000528917
Explore at:
Dataset updated
Nov 25, 2020
Authors
Cordeiro, M. Natália D. S.; Ambure, Pravin; Halder, Amit Kumar; Díaz, Humbert González
Description
Quantitative structure–activity relationships (QSAR) modeling is a well-known computational technique with wide applications in fields such as drug design, toxicity predictions, nanomaterials, etc. However, QSAR researchers still face certain problems to develop robust classification-based QSAR models, especially while handling response data pertaining to diverse experimental and/or theoretical conditions. In the present work, we have developed an open source standalone software “QSAR-Co” (available to download at https://sites.google.com/view/qsar-co) to setup classification-based QSAR models that allow mining the response data coming from multiple conditions. The software comprises two modules: (1) the Model development module and (2) the Screen/Predict module. This user-friendly software provides several functionalities required for developing a robust multitasking or multitarget classification-based QSAR model using linear discriminant analysis or random forest techniques, with appropriate validation, following the principles set by the Organisation for Economic Co-operation and Development (OECD) for applying QSAR models in regulatory assessments.
D
Continuous Road Edge Case Mining Market Research Report 2033
dataintelo.com
csv, pdf, pptx
Updated Oct 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dataintelo (2025). Continuous Road Edge Case Mining Market Research Report 2033 [Dataset]. https://dataintelo.com/report/continuous-road-edge-case-mining-market
Explore at:
csv, pptx, pdfAvailable download formats
Dataset updated
Oct 1, 2025
Dataset authored and provided by
Dataintelo
License
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
Time period covered
2024 - 2032
Area covered
Global
Description
Continuous Road Edge Case Mining Market Outlook

According to our latest research, the global Continuous Road Edge Case Mining market size reached USD 1.16 billion in 2024, driven by the accelerating adoption of advanced analytics and artificial intelligence in automotive and transportation sectors. The market is expected to grow at a robust CAGR of 17.8% during the forecast period, reaching an estimated USD 5.18 billion by 2033. This significant growth is underpinned by the rising demand for enhanced road safety, the proliferation of autonomous vehicles, and the increasing integration of real-time data analytics in traffic management systems.

One of the primary growth factors for the Continuous Road Edge Case Mining market is the rapid advancement in autonomous vehicle technologies. As automotive OEMs and technology companies race to develop fully autonomous vehicles, the need for comprehensive edge case mining solutions becomes paramount. Edge cases—rare or unusual scenarios encountered on the road—pose significant challenges for the safe deployment of autonomous vehicles. Continuous road edge case mining leverages machine learning and big data analytics to identify, catalog, and address these scenarios, ensuring that vehicles can safely navigate even the most unpredictable conditions. This not only enhances the safety and reliability of autonomous vehicles but also accelerates their path to commercial deployment.

Another critical driver is the increasing emphasis on road safety and regulatory compliance. Governments and transportation agencies worldwide are mandating stricter safety standards for both autonomous and human-driven vehicles. Continuous road edge case mining enables organizations to proactively detect potential hazards and anomalies in real-world driving environments, facilitating timely interventions and policy adjustments. By systematically analyzing vast amounts of driving data, these solutions help stakeholders reduce accident rates, improve traffic flow, and ensure compliance with evolving safety regulations. The growing collaboration between public agencies and private sector innovators is further fueling the adoption of these technologies.

The proliferation of connected infrastructure and the rise of smart cities are also propelling the growth of the Continuous Road Edge Case Mining market. With the deployment of IoT sensors, high-definition cameras, and connected traffic management systems, unprecedented volumes of real-time data are being generated. Continuous edge case mining systems can harness this data to provide actionable insights for urban planners, traffic authorities, and automotive manufacturers. The integration of these solutions into smart city initiatives is enabling more efficient traffic management, reducing congestion, and enhancing overall urban mobility. This trend is particularly pronounced in regions with significant investments in digital infrastructure, such as North America, Europe, and Asia Pacific.

From a regional perspective, North America currently leads the global market, accounting for the largest share in 2024, followed closely by Europe and Asia Pacific. The region’s dominance is attributed to the early adoption of autonomous vehicle technologies, a robust ecosystem of technology providers, and supportive regulatory frameworks. Meanwhile, Asia Pacific is emerging as the fastest-growing market, driven by rapid urbanization, increasing investments in smart transportation, and the presence of leading automotive manufacturers. Europe continues to make significant strides, propelled by stringent safety regulations and a strong focus on innovation in mobility solutions.

Component Analysis

The Component segment of the Continuous Road Edge Case Mining market is broadly categorized into Software, Hardware, and Services. Each component plays a vital role in the overall ecosystem, contributing to the efficiency and effectiveness of edge case mining solutions. Software solutions form the backbone of the market, encompassing advanced analytics platforms, machine learning algorithms, and data visualization tools. These software solutions enable the automated identification and classification of edge cases from vast datasets, facilitating continuous improvement in vehicle safety and performance. The demand for customizable and scalable software platforms is on the rise, as organizations seek to tailor solutions to their specific operational needs.

Hardwar
Data from: Unobtrusive Mattress-based Identification of Hypertension by...
commons.datacite.org
figshare.com
Updated Jan 16, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fan Liu Fan Liu (2019). Unobtrusive Mattress-based Identification of Hypertension by Integrating Classification and Association Rule Mining [Dataset]. http://doi.org/10.6084/m9.figshare.7594433.v1
Explore at:
Unique identifier
https://doi.org/10.6084/m9.figshare.7594433.v1
Dataset updated
Jan 16, 2019
Dataset provided by
Figsharehttp://figshare.com/
DataCitehttps://www.datacite.org/
Authors
Fan Liu Fan Liu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
It contans 128 BCG recordings (61 hypertensive and 67 normotensive), and the software code of association classifier.

Facebook

Twitter

Click to copy link

Link copied

Cite

Mohamad Ivan Fanany (2023). Enriching time series datasets using Nonparametric kernel regression to improve forecasting accuracy [Dataset]. http://doi.org/10.6084/m9.figshare.1609661.v1

Data from: Enriching time series datasets using Nonparametric kernel regression to improve forecasting accuracy

Explore at:

pdfAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.1609661.v1

Dataset updated

May 31, 2023

Dataset provided by

Figsharehttp://figshare.com/

Authors

Mohamad Ivan Fanany

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Improving the accuracy of prediction on future values based on the past and current observations has been pursued by enhancing the prediction's methods, combining those methods or performing data pre-processing. In this paper, another approach is taken, namely by increasing the number of input in the dataset. This approach would be useful especially for a shorter time series data. By filling the in-between values in the time series, the number of training set can be increased, thus increasing the generalization capability of the predictor. The algorithm used to make prediction is Neural Network as it is widely used in literature for time series tasks. For comparison, Support Vector Regression is also employed. The dataset used in the experiment is the frequency of USPTO's patents and PubMed's scientific publications on the field of health, namely on Apnea, Arrhythmia, and Sleep Stages. Another time series data designated for NN3 Competition in the field of transportation is also used for benchmarking. The experimental result shows that the prediction performance can be significantly increased by filling in-between data in the time series. Furthermore, the use of detrend and deseasonalization which separates the data into trend, seasonal and stationary time series also improve the prediction performance both on original and filled dataset. The optimal number of increase on the dataset in this experiment is about five times of the length of original dataset.

Clear search

Close search

Google apps

Main menu

Data from: Enriching time series datasets using Nonparametric kernel...

Data Analysis for the Systematic Literature Review of DL4SE

Global Data Mining Software Market Report 2025 Edition, Market Size, Share,...

Comparison of 14 classifiers

Data from: PREDICTION OF RANKING OF LOTS OF CORN SEEDS BY ARTIFICIAL...

Data from: QuerTCI: A Tool Integrating GitHub Issue Querying with Comment...

Global Text Mining Software Market Research Report: By Application...

Comparative Analysis of Artificial Hydrocarbon Networks an Data-Driven...

Data from: Classifying microarray cancer datasets using nearest subspace...

Data from: Prediction of paediatric asthma hospitalisation using data mining...

Data from: The strong convergence of visual classification method and its...

Novel classification scheme for temporal genomic and proteomic problems

Data from: Gene expression analysis for tumor classification using vector...

LANDFIRE.HI_120ESP

Data from: A novel protein motif finding algorithm for classification of the...

Classification for unweighted and weighted data.

Microarray time-series data classification via multiple alignment of gene...

Data from: QSAR-Co: An Open Source Software for Developing Robust...

Continuous Road Edge Case Mining Market Research Report 2033

Continuous Road Edge Case Mining Market Outlook

Component Analysis

Data from: Unobtrusive Mattress-based Identification of Hypertension by...

Data from: Enriching time series datasets using Nonparametric kernel regression to improve forecasting accuracy