8 datasets found

f
Data from: Mining GO Annotations for Improving Annotation Consistency
datasetcatalog.nlm.nih.gov
figshare.com
Updated Jul 25, 2012
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ferreira, António E. N.; Faria, Daniel; Pesquita, Catia; Falcão, André O.; Albrecht, Mario; Schlicker, Andreas; Bastos, Hugo (2012). Mining GO Annotations for Improving Annotation Consistency [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001153517
Explore at:
Dataset updated
Jul 25, 2012
Authors
Ferreira, António E. N.; Faria, Daniel; Pesquita, Catia; Falcão, André O.; Albrecht, Mario; Schlicker, Andreas; Bastos, Hugo
Description
Despite the structure and objectivity provided by the Gene Ontology (GO), the annotation of proteins is a complex task that is subject to errors and inconsistencies. Electronically inferred annotations in particular are widely considered unreliable. However, given that manual curation of all GO annotations is unfeasible, it is imperative to improve the quality of electronically inferred annotations. In this work, we analyze the full GO molecular function annotation of UniProtKB proteins, and discuss some of the issues that affect their quality, focusing particularly on the lack of annotation consistency. Based on our analysis, we estimate that 64% of the UniProtKB proteins are incompletely annotated, and that inconsistent annotations affect 83% of the protein functions and at least 23% of the proteins. Additionally, we present and evaluate a data mining algorithm, based on the association rule learning methodology, for identifying implicit relationships between molecular function terms. The goal of this algorithm is to assist GO curators in updating GO and correcting and preventing inconsistent annotations. Our algorithm predicted 501 relationships with an estimated precision of 94%, whereas the basic association rule learning methodology predicted 12,352 relationships with a precision below 9%.
Raw drift dataset
kaggle.com
Updated Mar 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
andrew espira (2025). Raw drift dataset [Dataset]. https://www.kaggle.com/datasets/espirado/raw-drift-dataset/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 25, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
andrew espira
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This dataset simulates real-world streaming data with controlled drift patterns across multiple features. It's specifically designed for developing and testing data mining techniques related to detecting and managing model drift in machine learning systems. The dataset contains intentional data quality issues that allow practitioners to experiment with preprocessing techniques and drift detection algorithms.

The dataset contains 100,000 records with timestamps and multiple features exhibiting different drift patterns across four distinct phases:

timestamp: Time-series index with some gaps and duplicates feature1:Numeric feature following normal distribution with shifting parameters feature2: Numeric feature following exponential distribution with inconsistent formatting feature3: Categorical feature with inconsistent casing and typos log_message: Text field containing embedded information about system status target: Binary classification target with concept drift across phases date_str: Date representation with irregular formats phase: Marker indicating which distribution regime the sample belongs to irrelevant1, irrelevant2: Noise features with no predictive value feature1_noisy: Correlated version of feature1 with added noise
Student Dropout & Success Prediction Dataset
kaggle.com
zip
Updated Apr 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Adil Shamim (2025). Student Dropout & Success Prediction Dataset [Dataset]. https://www.kaggle.com/adilshamim8/predict-students-dropout-and-academic-success
Explore at:
zip(106181 bytes)Available download formats
Dataset updated
Apr 23, 2025
Authors
Adil Shamim
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset originates from a Portuguese higher education institution and was developed as part of a national project aiming to combat student dropout and academic failure in universities. It brings together rich information from 4,424 undergraduate students across 8 degree programs, such as Agronomy, Design, Education, Nursing, Journalism, Management, Social Service, and Technologies.

Objective

The core objective is to support early intervention by using machine learning models to predict a student’s academic outcome—whether they will: - Drop out - Remain Enrolled - Successfully Graduate

This is framed as a three-class classification problem with a known class imbalance, offering real-world challenges for predictive modeling and education analytics.

Dataset Highlights

Instances (Rows): 4,424 students

Features (Columns): 36 total

Types: Integer, Categorical, and Real-valued

Includes both demographic and academic information

Target Variable: 'Target' (Categorical)

Classes: Dropout, Enrolled, Graduate

Feature Categories

Demographics & Socioeconomic:

Gender, Age, Marital Status

Nationality

Parental Education and Occupation

Scholarship, Tuition Fees, Application Mode

Academic History:

Degree Program, Curricular Units Enrolled & Approved

Grades from 1st and 2nd semesters

Admission Grade, Previous Qualification

External Factors:

GDP, Inflation Rate at Enrollment Time

Data Preprocessing

The original researchers performed extensive data cleaning, handling:

-Outliers

-Inconsistent entries

-Anomalies

-Missing values

Final dataset contains no missing values.

Suggested Use Cases

Educational Data Mining

Early Warning Systems for Student Dropout

Classification Benchmarking

Feature Importance & Interpretability Studies

Policy-making simulations for academic retention

Recommended Setup

Task Type: Multiclass Classification

Evaluation Metrics: Accuracy, F1 Score (macro), Confusion Matrix

Suggested Split: 80% Training / 20% Testing

Citation & Source

This dataset was created under the SATDAP - Capacitação da Administração Pública project funded by POCI-05-5762-FSE-000191 (Portugal) and is available through the UCI Machine Learning Repository.
Experimental Design-Based Functional Mining and Characterization of...
plos.figshare.com
ai
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Takeru Nakazato; Tazro Ohta; Hidemasa Bono (2023). Experimental Design-Based Functional Mining and Characterization of High-Throughput Sequencing Data in the Sequence Read Archive [Dataset]. http://doi.org/10.1371/journal.pone.0077910
Explore at:
aiAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0077910
Dataset updated
May 31, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Takeru Nakazato; Tazro Ohta; Hidemasa Bono
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
High-throughput sequencing technology, also called next-generation sequencing (NGS), has the potential to revolutionize the whole process of genome sequencing, transcriptomics, and epigenetics. Sequencing data is captured in a public primary data archive, the Sequence Read Archive (SRA). As of January 2013, data from more than 14,000 projects have been submitted to SRA, which is double that of the previous year. Researchers can download raw sequence data from SRA website to perform further analyses and to compare with their own data. However, it is extremely difficult to search entries and download raw sequences of interests with SRA because the data structure is complicated, and experimental conditions along with raw sequences are partly described in natural language. Additionally, some sequences are of inconsistent quality because anyone can submit sequencing data to SRA with no quality check. Therefore, as a criterion of data quality, we focused on SRA entries that were cited in journal articles. We extracted SRA IDs and PubMed IDs (PMIDs) from SRA and full-text versions of journal articles and retrieved 2748 SRA ID-PMID pairs. We constructed a publication list referring to SRA entries. Since, one of the main themes of -omics analyses is clarification of disease mechanisms, we also characterized SRA entries by disease keywords, according to the Medical Subject Headings (MeSH) extracted from articles assigned to each SRA entry. We obtained 989 SRA ID-MeSH disease term pairs, and constructed a disease list referring to SRA data. We previously developed feature profiles of diseases in a system called “Gendoo”. We generated hyperlinks between diseases extracted from SRA and the feature profiles of it. The developed project, publication and disease lists resulting from this study are available at our web service, called “DBCLS SRA” (http://sra.dbcls.jp/). This service will improve accessibility to high-quality data from SRA.
AI For Process Optimization Market Analysis, Size, and Forecast 2025-2029 :...
technavio.com
pdf
Updated Oct 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Technavio (2025). AI For Process Optimization Market Analysis, Size, and Forecast 2025-2029 : North America (US and Canada), APAC (China, Japan, India, and South Korea), Europe (Germany, UK, and France), Middle East and Africa (UAE and South Africa), South America (Brazil and Argentina), and Rest of World (ROW) [Dataset]. https://www.technavio.com/report/ai-for-process-optimization-market-industry-analysis
Explore at:
pdfAvailable download formats
Dataset updated
Oct 9, 2025
Dataset provided by
TechNavio
Authors
Technavio
License
https://www.technavio.com/content/privacy-noticehttps://www.technavio.com/content/privacy-notice
Time period covered
2025 - 2029
Area covered
United Kingdom, Canada, United States
Description
Snapshot img { margin: 10px !important; } AI For Process Optimization Market Size 2025-2029

The ai for process optimization market size is forecast to increase by USD 17.2 billion, at a CAGR of 36.3% between 2024 and 2029.

The global AI for process optimization market is defined by the corporate need for greater operational efficiency and cost reduction. Enterprises are adopting intelligent process automation to automate labor-intensive tasks and minimize human error. These AI-powered systems enable a shift from static workflows to dynamic, intelligent operations that adapt in real-time. This includes generative ai in manufacturing, where AI algorithms optimize routing and inventory levels. The integration of generative AI capabilities is expanding automation into cognitive and creative domains, validating AI adoption as a means to gain a competitive advantage through enhanced operational performance. This is particularly relevant for the artificial intelligence (AI) market in manufacturing industry, where efficiency gains directly impact the bottom line.A transformative trend is the rapid integration of generative AI, which is accelerating the shift toward hyperautomation. This moves beyond traditional robotic process automation (RPA) by introducing systems capable of understanding natural language and making context-aware decisions. This trend is pushing organizations toward a disciplined approach to identify and automate as many business and IT processes as possible, using a combination of technologies such as AI and machine learning. However, the market faces a fundamental challenge with enterprise data infrastructure. AI models depend on high-quality data, but many organizations struggle with siloed, inconsistent, and incomplete datasets, which presents a significant barrier to effective AI implementation and model accuracy.

What will be the Size of the AI For Process Optimization Market during the forecast period?

Explore in-depth regional segment analysis with market size data - historical 2019 - 2023 and forecasts 2025-2029 - in the full report.
Request Free SampleThe market is characterized by the continuous integration of intelligent workflow orchestration and decision management systems to enhance operational agility. Enterprises are deploying these technologies to move beyond static procedures, enabling dynamic adjustments in response to real-time data. This involves the application of process discovery algorithms and task mining solutions to identify bottlenecks and optimization opportunities. The focus is on creating a self-optimizing ecosystem where prescriptive analytics engines guide automated actions, ensuring persistent efficiency gains in areas like generative ai in manufacturing and intelligent process automation.Ongoing developments are centered on improving predictive process monitoring and the deployment of machine learning models for more accurate forecasting. This includes leveraging ai-powered root cause analysis to understand deviations from standard operating procedures and implementing exception handling automation to manage anomalies without human intervention. The use of digital twin of an organization provides a simulated environment to test and refine these models. This analytical depth is crucial for sectors like generative ai in fulfillment and logistics, where operational precision directly impacts profitability and customer satisfaction.

How is this AI For Process Optimization Industry segmented?

The ai for process optimization industry research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in "USD million" for the period 2025-2029, as well as historical data from 2019 - 2023 for the following segments. DeploymentCloud-basedOn-premisesSectorLarge enterprisesSMEEnd-userBFSIIT and telecomRetailManufacturingHealthcareSolutionAutomationProcess modelingMonitoring and optimizationContent and document managementIntegrationGeographyNorth AmericaUSCanadaAPACChinaJapanIndiaSouth KoreaEuropeGermanyUKFranceMiddle East and AfricaUAESouth AfricaSouth AmericaBrazilArgentinaRest of World (ROW)

By Deployment Insights

The cloud-based segment is estimated to witness significant growth during the forecast period.The cloud-based deployment model is the dominant and most rapidly expanding segment. Its ascendancy is fueled by advantages such as scalability, allowing organizations to adjust computational resources for training complex machine learning models. This elasticity eliminates the need for substantial upfront capital expenditure on physical hardware, shifting costs to a more manageable operational model. This model is particularly prevalent in the Middle East and Africa, which accounts for 4.65% of the market's growth potential, where cloud adoption supports rapid digitalization efforts.Furthermore, cloud platforms facilitate seamless integration and interoperability, enabling
f
S1 Data -
figshare.com
xlsx
Updated Oct 21, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nguyen Ky Anh; Anbok Lee; Nguyen Ky Phat; Nguyen Thi Hai Yen; Nguyen Quang Thu; Nguyen Tran Nam Tien; Ho-Sook Kim; Tae Hyun Kim; Dong Hyun Kim; Hee-Yeon Kim; Nguyen Phuoc Long (2024). S1 Data - [Dataset]. http://doi.org/10.1371/journal.pone.0311810.s004
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0311810.s004
Dataset updated
Oct 21, 2024
Dataset provided by
PLOS ONE
Authors
Nguyen Ky Anh; Anbok Lee; Nguyen Ky Phat; Nguyen Thi Hai Yen; Nguyen Quang Thu; Nguyen Tran Nam Tien; Ho-Sook Kim; Tae Hyun Kim; Dong Hyun Kim; Hee-Yeon Kim; Nguyen Phuoc Long
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
There is an urgent need for better biomarkers for the detection of early-stage breast cancer. Utilizing untargeted metabolomics and lipidomics in conjunction with advanced data mining approaches for metabolism-centric biomarker discovery and validation may enhance the identification and validation of novel biomarkers for breast cancer screening. In this study, we employed a multimodal omics approach to identify and validate potential biomarkers capable of differentiating between patients with breast cancer and those with benign tumors. Our findings indicated that ether-linked phosphatidylcholine exhibited a significant difference between invasive ductal carcinoma and benign tumors, including cases with inconsistent mammography results. We observed alterations in numerous lipid species, including sphingomyelin, triacylglycerol, and free fatty acids, in the breast cancer group. Furthermore, we identified several dysregulated hydrophilic metabolites in breast cancer, such as glutamate, glycochenodeoxycholate, and dimethyluric acid. Through robust multivariate receiver operating characteristic analysis utilizing machine learning models, either linear support vector machines or random forest models, we successfully distinguished between cancerous and benign cases with promising outcomes. These results emphasize the potential of metabolic biomarkers to complement other criteria in breast cancer screening. Future studies are essential to further validate the metabolic biomarkers identified in our study and to develop assays for clinical applications.
m
Geochronologic and geochemical data of plutonic rocks from the...
data.mendeley.com
Updated Jun 25, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Teresa Orozco-Esquivel (2021). Geochronologic and geochemical data of plutonic rocks from the Cretaceous-Eocene Mexican Magmatic Arc (CEMMA) [Dataset]. http://doi.org/10.17632/6jm7z683tn.1
Explore at:
Unique identifier
https://doi.org/10.17632/6jm7z683tn.1
Dataset updated
Jun 25, 2021
Authors
Teresa Orozco-Esquivel
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The data provided in this database were used to establish the temporality, composition and petrogenetic aspects of the so-called Cretaceous-Eocene Mexican Magmatic Arc (CEMMA). This magmatic arc is continuous throughout western Mexico and its understanding is critical to elucidating the tectonics of southwestern North America. The data were carefully compiled from the literature, including published articles, thesis, internal reports from mining companies, and to a large extent come from unpublished information generated by the working group (CONACYT Research Grant 49528-F). Unreliable data in relation to the methodology used and those where the information was inconsistent were not included. The Cretaceous-Eocene Mexican Magmatic Arc: Conceptual framework from geochemical and geochronological data of plutonic rocks
S1 Data -
plos.figshare.com
zip
Updated Feb 21, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lazarus Obed Livingstone Banda; Chigonjetso Victoria Banda; Jane Thokozani Banda (2025). S1 Data - [Dataset]. http://doi.org/10.1371/journal.pone.0314530.s001
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0314530.s001
Dataset updated
Feb 21, 2025
Dataset provided by
PLOShttp://plos.org/
Authors
Lazarus Obed Livingstone Banda; Chigonjetso Victoria Banda; Jane Thokozani Banda
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The study on Early Childhood Development (ECD) practices in T/A Zilakoma, Nkhata Bay South Constituency, Malawi, employs the Ecological Systems Theory to explore recruitment, role definitions, and support systems. This theoretical construct enables an intricate examination of interactions within various environmental systems, emphasizing micro, meso, exo, macro, and chrono systems. Specifically, it illuminates the dynamics within immediate settings, interconnections among diverse systems, broader indirect influences, cultural ideologies, societal values, and temporal dimensions, offering a comprehensive lens for understanding educational contexts. The descriptive qualitative research was conducted within a case study framework to explore the practical experiences of stakeholders within ECD using semi-structured interview guides. Ethical standards were upheld, ensuring voluntary participation and confidentiality. Purposive sampling was used to collect data from diverse and knowledgeable participants involved in ECD domains, providing comprehensive insights aligned with the study’s objectives. Thematic analysis and sentiment mining were performed using Atlas 23 software. The results revealed themes such as recruitment practices relying on community-driven approaches, role ambiguity due to undefined responsibilities, informal evaluation processes, inconsistent training opportunities, and a dependency on community and volunteerism. These themes highlight the absence of formal structures and standardized processes in various aspects of ECD programs. Additionally, sentiment analysis illustrated diverse perspectives among stakeholders, reflecting their distinct experiences and challenges within the ECD landscape. The study concludes with policy recommendations aimed at addressing these systemic challenges.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Ferreira, António E. N.; Faria, Daniel; Pesquita, Catia; Falcão, André O.; Albrecht, Mario; Schlicker, Andreas; Bastos, Hugo (2012). Mining GO Annotations for Improving Annotation Consistency [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001153517

Data from: Mining GO Annotations for Improving Annotation Consistency

Explore at:

Dataset updated

Jul 25, 2012

Authors

Ferreira, António E. N.; Faria, Daniel; Pesquita, Catia; Falcão, André O.; Albrecht, Mario; Schlicker, Andreas; Bastos, Hugo

Description

Despite the structure and objectivity provided by the Gene Ontology (GO), the annotation of proteins is a complex task that is subject to errors and inconsistencies. Electronically inferred annotations in particular are widely considered unreliable. However, given that manual curation of all GO annotations is unfeasible, it is imperative to improve the quality of electronically inferred annotations. In this work, we analyze the full GO molecular function annotation of UniProtKB proteins, and discuss some of the issues that affect their quality, focusing particularly on the lack of annotation consistency. Based on our analysis, we estimate that 64% of the UniProtKB proteins are incompletely annotated, and that inconsistent annotations affect 83% of the protein functions and at least 23% of the proteins. Additionally, we present and evaluate a data mining algorithm, based on the association rule learning methodology, for identifying implicit relationships between molecular function terms. The goal of this algorithm is to assist GO curators in updating GO and correcting and preventing inconsistent annotations. Our algorithm predicted 501 relationships with an estimated precision of 94%, whereas the basic association rule learning methodology predicted 12,352 relationships with a precision below 9%.

Clear search

Close search

Google apps

Main menu

Data from: Mining GO Annotations for Improving Annotation Consistency

Raw drift dataset

Student Dropout & Success Prediction Dataset

Objective

Dataset Highlights

Feature Categories

Data Preprocessing

Suggested Use Cases

Recommended Setup

Citation & Source

Experimental Design-Based Functional Mining and Characterization of...

AI For Process Optimization Market Analysis, Size, and Forecast 2025-2029 :...

Snapshot img { margin: 10px !important; } AI For Process Optimization Market Size 2025-2029

S1 Data -

Geochronologic and geochemical data of plutonic rocks from the...

S1 Data -

Data from: Mining GO Annotations for Improving Annotation Consistency