Facebook
TwitterDespite the structure and objectivity provided by the Gene Ontology (GO), the annotation of proteins is a complex task that is subject to errors and inconsistencies. Electronically inferred annotations in particular are widely considered unreliable. However, given that manual curation of all GO annotations is unfeasible, it is imperative to improve the quality of electronically inferred annotations. In this work, we analyze the full GO molecular function annotation of UniProtKB proteins, and discuss some of the issues that affect their quality, focusing particularly on the lack of annotation consistency. Based on our analysis, we estimate that 64% of the UniProtKB proteins are incompletely annotated, and that inconsistent annotations affect 83% of the protein functions and at least 23% of the proteins. Additionally, we present and evaluate a data mining algorithm, based on the association rule learning methodology, for identifying implicit relationships between molecular function terms. The goal of this algorithm is to assist GO curators in updating GO and correcting and preventing inconsistent annotations. Our algorithm predicted 501 relationships with an estimated precision of 94%, whereas the basic association rule learning methodology predicted 12,352 relationships with a precision below 9%.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset simulates real-world streaming data with controlled drift patterns across multiple features. It's specifically designed for developing and testing data mining techniques related to detecting and managing model drift in machine learning systems. The dataset contains intentional data quality issues that allow practitioners to experiment with preprocessing techniques and drift detection algorithms.
The dataset contains 100,000 records with timestamps and multiple features exhibiting different drift patterns across four distinct phases:
timestamp: Time-series index with some gaps and duplicates feature1:Numeric feature following normal distribution with shifting parameters feature2: Numeric feature following exponential distribution with inconsistent formatting feature3: Categorical feature with inconsistent casing and typos log_message: Text field containing embedded information about system status target: Binary classification target with concept drift across phases date_str: Date representation with irregular formats phase: Marker indicating which distribution regime the sample belongs to irrelevant1, irrelevant2: Noise features with no predictive value feature1_noisy: Correlated version of feature1 with added noise
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset originates from a Portuguese higher education institution and was developed as part of a national project aiming to combat student dropout and academic failure in universities. It brings together rich information from 4,424 undergraduate students across 8 degree programs, such as Agronomy, Design, Education, Nursing, Journalism, Management, Social Service, and Technologies.
The core objective is to support early intervention by using machine learning models to predict a student’s academic outcome—whether they will: - Drop out - Remain Enrolled - Successfully Graduate
This is framed as a three-class classification problem with a known class imbalance, offering real-world challenges for predictive modeling and education analytics.
Dropout, Enrolled, GraduateDemographics & Socioeconomic:
Academic History:
External Factors:
The original researchers performed extensive data cleaning, handling:
-Outliers
-Inconsistent entries
-Anomalies
-Missing values
Final dataset contains no missing values.
This dataset was created under the SATDAP - Capacitação da Administração Pública project funded by POCI-05-5762-FSE-000191 (Portugal) and is available through the UCI Machine Learning Repository.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
High-throughput sequencing technology, also called next-generation sequencing (NGS), has the potential to revolutionize the whole process of genome sequencing, transcriptomics, and epigenetics. Sequencing data is captured in a public primary data archive, the Sequence Read Archive (SRA). As of January 2013, data from more than 14,000 projects have been submitted to SRA, which is double that of the previous year. Researchers can download raw sequence data from SRA website to perform further analyses and to compare with their own data. However, it is extremely difficult to search entries and download raw sequences of interests with SRA because the data structure is complicated, and experimental conditions along with raw sequences are partly described in natural language. Additionally, some sequences are of inconsistent quality because anyone can submit sequencing data to SRA with no quality check. Therefore, as a criterion of data quality, we focused on SRA entries that were cited in journal articles. We extracted SRA IDs and PubMed IDs (PMIDs) from SRA and full-text versions of journal articles and retrieved 2748 SRA ID-PMID pairs. We constructed a publication list referring to SRA entries. Since, one of the main themes of -omics analyses is clarification of disease mechanisms, we also characterized SRA entries by disease keywords, according to the Medical Subject Headings (MeSH) extracted from articles assigned to each SRA entry. We obtained 989 SRA ID-MeSH disease term pairs, and constructed a disease list referring to SRA data. We previously developed feature profiles of diseases in a system called “Gendoo”. We generated hyperlinks between diseases extracted from SRA and the feature profiles of it. The developed project, publication and disease lists resulting from this study are available at our web service, called “DBCLS SRA” (http://sra.dbcls.jp/). This service will improve accessibility to high-quality data from SRA.
Facebook
Twitterhttps://www.technavio.com/content/privacy-noticehttps://www.technavio.com/content/privacy-notice
The ai for process optimization market size is forecast to increase by USD 17.2 billion, at a CAGR of 36.3% between 2024 and 2029.
The global AI for process optimization market is defined by the corporate need for greater operational efficiency and cost reduction. Enterprises are adopting intelligent process automation to automate labor-intensive tasks and minimize human error. These AI-powered systems enable a shift from static workflows to dynamic, intelligent operations that adapt in real-time. This includes generative ai in manufacturing, where AI algorithms optimize routing and inventory levels. The integration of generative AI capabilities is expanding automation into cognitive and creative domains, validating AI adoption as a means to gain a competitive advantage through enhanced operational performance. This is particularly relevant for the artificial intelligence (AI) market in manufacturing industry, where efficiency gains directly impact the bottom line.A transformative trend is the rapid integration of generative AI, which is accelerating the shift toward hyperautomation. This moves beyond traditional robotic process automation (RPA) by introducing systems capable of understanding natural language and making context-aware decisions. This trend is pushing organizations toward a disciplined approach to identify and automate as many business and IT processes as possible, using a combination of technologies such as AI and machine learning. However, the market faces a fundamental challenge with enterprise data infrastructure. AI models depend on high-quality data, but many organizations struggle with siloed, inconsistent, and incomplete datasets, which presents a significant barrier to effective AI implementation and model accuracy.
What will be the Size of the AI For Process Optimization Market during the forecast period?
Explore in-depth regional segment analysis with market size data - historical 2019 - 2023 and forecasts 2025-2029 - in the full report.
Request Free SampleThe market is characterized by the continuous integration of intelligent workflow orchestration and decision management systems to enhance operational agility. Enterprises are deploying these technologies to move beyond static procedures, enabling dynamic adjustments in response to real-time data. This involves the application of process discovery algorithms and task mining solutions to identify bottlenecks and optimization opportunities. The focus is on creating a self-optimizing ecosystem where prescriptive analytics engines guide automated actions, ensuring persistent efficiency gains in areas like generative ai in manufacturing and intelligent process automation.Ongoing developments are centered on improving predictive process monitoring and the deployment of machine learning models for more accurate forecasting. This includes leveraging ai-powered root cause analysis to understand deviations from standard operating procedures and implementing exception handling automation to manage anomalies without human intervention. The use of digital twin of an organization provides a simulated environment to test and refine these models. This analytical depth is crucial for sectors like generative ai in fulfillment and logistics, where operational precision directly impacts profitability and customer satisfaction.
How is this AI For Process Optimization Industry segmented?
The ai for process optimization industry research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in "USD million" for the period 2025-2029, as well as historical data from 2019 - 2023 for the following segments. DeploymentCloud-basedOn-premisesSectorLarge enterprisesSMEEnd-userBFSIIT and telecomRetailManufacturingHealthcareSolutionAutomationProcess modelingMonitoring and optimizationContent and document managementIntegrationGeographyNorth AmericaUSCanadaAPACChinaJapanIndiaSouth KoreaEuropeGermanyUKFranceMiddle East and AfricaUAESouth AfricaSouth AmericaBrazilArgentinaRest of World (ROW)
By Deployment Insights
The cloud-based segment is estimated to witness significant growth during the forecast period.The cloud-based deployment model is the dominant and most rapidly expanding segment. Its ascendancy is fueled by advantages such as scalability, allowing organizations to adjust computational resources for training complex machine learning models. This elasticity eliminates the need for substantial upfront capital expenditure on physical hardware, shifting costs to a more manageable operational model. This model is particularly prevalent in the Middle East and Africa, which accounts for 4.65% of the market's growth potential, where cloud adoption supports rapid digitalization efforts.Furthermore, cloud platforms facilitate seamless integration and interoperability, enabling
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
There is an urgent need for better biomarkers for the detection of early-stage breast cancer. Utilizing untargeted metabolomics and lipidomics in conjunction with advanced data mining approaches for metabolism-centric biomarker discovery and validation may enhance the identification and validation of novel biomarkers for breast cancer screening. In this study, we employed a multimodal omics approach to identify and validate potential biomarkers capable of differentiating between patients with breast cancer and those with benign tumors. Our findings indicated that ether-linked phosphatidylcholine exhibited a significant difference between invasive ductal carcinoma and benign tumors, including cases with inconsistent mammography results. We observed alterations in numerous lipid species, including sphingomyelin, triacylglycerol, and free fatty acids, in the breast cancer group. Furthermore, we identified several dysregulated hydrophilic metabolites in breast cancer, such as glutamate, glycochenodeoxycholate, and dimethyluric acid. Through robust multivariate receiver operating characteristic analysis utilizing machine learning models, either linear support vector machines or random forest models, we successfully distinguished between cancerous and benign cases with promising outcomes. These results emphasize the potential of metabolic biomarkers to complement other criteria in breast cancer screening. Future studies are essential to further validate the metabolic biomarkers identified in our study and to develop assays for clinical applications.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The data provided in this database were used to establish the temporality, composition and petrogenetic aspects of the so-called Cretaceous-Eocene Mexican Magmatic Arc (CEMMA). This magmatic arc is continuous throughout western Mexico and its understanding is critical to elucidating the tectonics of southwestern North America. The data were carefully compiled from the literature, including published articles, thesis, internal reports from mining companies, and to a large extent come from unpublished information generated by the working group (CONACYT Research Grant 49528-F). Unreliable data in relation to the methodology used and those where the information was inconsistent were not included. The Cretaceous-Eocene Mexican Magmatic Arc: Conceptual framework from geochemical and geochronological data of plutonic rocks
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The study on Early Childhood Development (ECD) practices in T/A Zilakoma, Nkhata Bay South Constituency, Malawi, employs the Ecological Systems Theory to explore recruitment, role definitions, and support systems. This theoretical construct enables an intricate examination of interactions within various environmental systems, emphasizing micro, meso, exo, macro, and chrono systems. Specifically, it illuminates the dynamics within immediate settings, interconnections among diverse systems, broader indirect influences, cultural ideologies, societal values, and temporal dimensions, offering a comprehensive lens for understanding educational contexts. The descriptive qualitative research was conducted within a case study framework to explore the practical experiences of stakeholders within ECD using semi-structured interview guides. Ethical standards were upheld, ensuring voluntary participation and confidentiality. Purposive sampling was used to collect data from diverse and knowledgeable participants involved in ECD domains, providing comprehensive insights aligned with the study’s objectives. Thematic analysis and sentiment mining were performed using Atlas 23 software. The results revealed themes such as recruitment practices relying on community-driven approaches, role ambiguity due to undefined responsibilities, informal evaluation processes, inconsistent training opportunities, and a dependency on community and volunteerism. These themes highlight the absence of formal structures and standardized processes in various aspects of ECD programs. Additionally, sentiment analysis illustrated diverse perspectives among stakeholders, reflecting their distinct experiences and challenges within the ECD landscape. The study concludes with policy recommendations aimed at addressing these systemic challenges.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterDespite the structure and objectivity provided by the Gene Ontology (GO), the annotation of proteins is a complex task that is subject to errors and inconsistencies. Electronically inferred annotations in particular are widely considered unreliable. However, given that manual curation of all GO annotations is unfeasible, it is imperative to improve the quality of electronically inferred annotations. In this work, we analyze the full GO molecular function annotation of UniProtKB proteins, and discuss some of the issues that affect their quality, focusing particularly on the lack of annotation consistency. Based on our analysis, we estimate that 64% of the UniProtKB proteins are incompletely annotated, and that inconsistent annotations affect 83% of the protein functions and at least 23% of the proteins. Additionally, we present and evaluate a data mining algorithm, based on the association rule learning methodology, for identifying implicit relationships between molecular function terms. The goal of this algorithm is to assist GO curators in updating GO and correcting and preventing inconsistent annotations. Our algorithm predicted 501 relationships with an estimated precision of 94%, whereas the basic association rule learning methodology predicted 12,352 relationships with a precision below 9%.