8 datasets found
  1. f

    Data from: Mining GO Annotations for Improving Annotation Consistency

    • datasetcatalog.nlm.nih.gov
    • figshare.com
    Updated Jul 25, 2012
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ferreira, António E. N.; Faria, Daniel; Pesquita, Catia; Falcão, André O.; Albrecht, Mario; Schlicker, Andreas; Bastos, Hugo (2012). Mining GO Annotations for Improving Annotation Consistency [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001153517
    Explore at:
    Dataset updated
    Jul 25, 2012
    Authors
    Ferreira, António E. N.; Faria, Daniel; Pesquita, Catia; Falcão, André O.; Albrecht, Mario; Schlicker, Andreas; Bastos, Hugo
    Description

    Despite the structure and objectivity provided by the Gene Ontology (GO), the annotation of proteins is a complex task that is subject to errors and inconsistencies. Electronically inferred annotations in particular are widely considered unreliable. However, given that manual curation of all GO annotations is unfeasible, it is imperative to improve the quality of electronically inferred annotations. In this work, we analyze the full GO molecular function annotation of UniProtKB proteins, and discuss some of the issues that affect their quality, focusing particularly on the lack of annotation consistency. Based on our analysis, we estimate that 64% of the UniProtKB proteins are incompletely annotated, and that inconsistent annotations affect 83% of the protein functions and at least 23% of the proteins. Additionally, we present and evaluate a data mining algorithm, based on the association rule learning methodology, for identifying implicit relationships between molecular function terms. The goal of this algorithm is to assist GO curators in updating GO and correcting and preventing inconsistent annotations. Our algorithm predicted 501 relationships with an estimated precision of 94%, whereas the basic association rule learning methodology predicted 12,352 relationships with a precision below 9%.

  2. Raw drift dataset

    • kaggle.com
    Updated Mar 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    andrew espira (2025). Raw drift dataset [Dataset]. https://www.kaggle.com/datasets/espirado/raw-drift-dataset/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 25, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    andrew espira
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This dataset simulates real-world streaming data with controlled drift patterns across multiple features. It's specifically designed for developing and testing data mining techniques related to detecting and managing model drift in machine learning systems. The dataset contains intentional data quality issues that allow practitioners to experiment with preprocessing techniques and drift detection algorithms.

    The dataset contains 100,000 records with timestamps and multiple features exhibiting different drift patterns across four distinct phases:

    timestamp: Time-series index with some gaps and duplicates feature1:Numeric feature following normal distribution with shifting parameters feature2: Numeric feature following exponential distribution with inconsistent formatting feature3: Categorical feature with inconsistent casing and typos log_message: Text field containing embedded information about system status target: Binary classification target with concept drift across phases date_str: Date representation with irregular formats phase: Marker indicating which distribution regime the sample belongs to irrelevant1, irrelevant2: Noise features with no predictive value feature1_noisy: Correlated version of feature1 with added noise

  3. Student Dropout & Success Prediction Dataset

    • kaggle.com
    zip
    Updated Apr 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Adil Shamim (2025). Student Dropout & Success Prediction Dataset [Dataset]. https://www.kaggle.com/adilshamim8/predict-students-dropout-and-academic-success
    Explore at:
    zip(106181 bytes)Available download formats
    Dataset updated
    Apr 23, 2025
    Authors
    Adil Shamim
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset originates from a Portuguese higher education institution and was developed as part of a national project aiming to combat student dropout and academic failure in universities. It brings together rich information from 4,424 undergraduate students across 8 degree programs, such as Agronomy, Design, Education, Nursing, Journalism, Management, Social Service, and Technologies.

    Objective

    The core objective is to support early intervention by using machine learning models to predict a student’s academic outcome—whether they will: - Drop out - Remain Enrolled - Successfully Graduate

    This is framed as a three-class classification problem with a known class imbalance, offering real-world challenges for predictive modeling and education analytics.

    Dataset Highlights

    • Instances (Rows): 4,424 students
    • Features (Columns): 36 total
      • Types: Integer, Categorical, and Real-valued
      • Includes both demographic and academic information
    • Target Variable: 'Target' (Categorical)
      • Classes: Dropout, Enrolled, Graduate

    Feature Categories

    1. Demographics & Socioeconomic:

      • Gender, Age, Marital Status
      • Nationality
      • Parental Education and Occupation
      • Scholarship, Tuition Fees, Application Mode
    2. Academic History:

      • Degree Program, Curricular Units Enrolled & Approved
      • Grades from 1st and 2nd semesters
      • Admission Grade, Previous Qualification
    3. External Factors:

      • GDP, Inflation Rate at Enrollment Time

    Data Preprocessing

    The original researchers performed extensive data cleaning, handling:

    -Outliers

    -Inconsistent entries

    -Anomalies

    -Missing values

    Final dataset contains no missing values.

    Suggested Use Cases

    • Educational Data Mining
    • Early Warning Systems for Student Dropout
    • Classification Benchmarking
    • Feature Importance & Interpretability Studies
    • Policy-making simulations for academic retention

    Recommended Setup

    • Task Type: Multiclass Classification
    • Evaluation Metrics: Accuracy, F1 Score (macro), Confusion Matrix
    • Suggested Split: 80% Training / 20% Testing

    Citation & Source

    This dataset was created under the SATDAP - Capacitação da Administração Pública project funded by POCI-05-5762-FSE-000191 (Portugal) and is available through the UCI Machine Learning Repository.

  4. Experimental Design-Based Functional Mining and Characterization of...

    • plos.figshare.com
    ai
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Takeru Nakazato; Tazro Ohta; Hidemasa Bono (2023). Experimental Design-Based Functional Mining and Characterization of High-Throughput Sequencing Data in the Sequence Read Archive [Dataset]. http://doi.org/10.1371/journal.pone.0077910
    Explore at:
    aiAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Takeru Nakazato; Tazro Ohta; Hidemasa Bono
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    High-throughput sequencing technology, also called next-generation sequencing (NGS), has the potential to revolutionize the whole process of genome sequencing, transcriptomics, and epigenetics. Sequencing data is captured in a public primary data archive, the Sequence Read Archive (SRA). As of January 2013, data from more than 14,000 projects have been submitted to SRA, which is double that of the previous year. Researchers can download raw sequence data from SRA website to perform further analyses and to compare with their own data. However, it is extremely difficult to search entries and download raw sequences of interests with SRA because the data structure is complicated, and experimental conditions along with raw sequences are partly described in natural language. Additionally, some sequences are of inconsistent quality because anyone can submit sequencing data to SRA with no quality check. Therefore, as a criterion of data quality, we focused on SRA entries that were cited in journal articles. We extracted SRA IDs and PubMed IDs (PMIDs) from SRA and full-text versions of journal articles and retrieved 2748 SRA ID-PMID pairs. We constructed a publication list referring to SRA entries. Since, one of the main themes of -omics analyses is clarification of disease mechanisms, we also characterized SRA entries by disease keywords, according to the Medical Subject Headings (MeSH) extracted from articles assigned to each SRA entry. We obtained 989 SRA ID-MeSH disease term pairs, and constructed a disease list referring to SRA data. We previously developed feature profiles of diseases in a system called “Gendoo”. We generated hyperlinks between diseases extracted from SRA and the feature profiles of it. The developed project, publication and disease lists resulting from this study are available at our web service, called “DBCLS SRA” (http://sra.dbcls.jp/). This service will improve accessibility to high-quality data from SRA.

  5. AI For Process Optimization Market Analysis, Size, and Forecast 2025-2029 :...

    • technavio.com
    pdf
    Updated Oct 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Technavio (2025). AI For Process Optimization Market Analysis, Size, and Forecast 2025-2029 : North America (US and Canada), APAC (China, Japan, India, and South Korea), Europe (Germany, UK, and France), Middle East and Africa (UAE and South Africa), South America (Brazil and Argentina), and Rest of World (ROW) [Dataset]. https://www.technavio.com/report/ai-for-process-optimization-market-industry-analysis
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Oct 9, 2025
    Dataset provided by
    TechNavio
    Authors
    Technavio
    License

    https://www.technavio.com/content/privacy-noticehttps://www.technavio.com/content/privacy-notice

    Time period covered
    2025 - 2029
    Area covered
    United Kingdom, Canada, United States
    Description

    Snapshot img { margin: 10px !important; } AI For Process Optimization Market Size 2025-2029

    The ai for process optimization market size is forecast to increase by USD 17.2 billion, at a CAGR of 36.3% between 2024 and 2029.

    The global AI for process optimization market is defined by the corporate need for greater operational efficiency and cost reduction. Enterprises are adopting intelligent process automation to automate labor-intensive tasks and minimize human error. These AI-powered systems enable a shift from static workflows to dynamic, intelligent operations that adapt in real-time. This includes generative ai in manufacturing, where AI algorithms optimize routing and inventory levels. The integration of generative AI capabilities is expanding automation into cognitive and creative domains, validating AI adoption as a means to gain a competitive advantage through enhanced operational performance. This is particularly relevant for the artificial intelligence (AI) market in manufacturing industry, where efficiency gains directly impact the bottom line.A transformative trend is the rapid integration of generative AI, which is accelerating the shift toward hyperautomation. This moves beyond traditional robotic process automation (RPA) by introducing systems capable of understanding natural language and making context-aware decisions. This trend is pushing organizations toward a disciplined approach to identify and automate as many business and IT processes as possible, using a combination of technologies such as AI and machine learning. However, the market faces a fundamental challenge with enterprise data infrastructure. AI models depend on high-quality data, but many organizations struggle with siloed, inconsistent, and incomplete datasets, which presents a significant barrier to effective AI implementation and model accuracy.

    What will be the Size of the AI For Process Optimization Market during the forecast period?

    Explore in-depth regional segment analysis with market size data - historical 2019 - 2023 and forecasts 2025-2029 - in the full report.
    Request Free SampleThe market is characterized by the continuous integration of intelligent workflow orchestration and decision management systems to enhance operational agility. Enterprises are deploying these technologies to move beyond static procedures, enabling dynamic adjustments in response to real-time data. This involves the application of process discovery algorithms and task mining solutions to identify bottlenecks and optimization opportunities. The focus is on creating a self-optimizing ecosystem where prescriptive analytics engines guide automated actions, ensuring persistent efficiency gains in areas like generative ai in manufacturing and intelligent process automation.Ongoing developments are centered on improving predictive process monitoring and the deployment of machine learning models for more accurate forecasting. This includes leveraging ai-powered root cause analysis to understand deviations from standard operating procedures and implementing exception handling automation to manage anomalies without human intervention. The use of digital twin of an organization provides a simulated environment to test and refine these models. This analytical depth is crucial for sectors like generative ai in fulfillment and logistics, where operational precision directly impacts profitability and customer satisfaction.

    How is this AI For Process Optimization Industry segmented?

    The ai for process optimization industry research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in "USD million" for the period 2025-2029, as well as historical data from 2019 - 2023 for the following segments. DeploymentCloud-basedOn-premisesSectorLarge enterprisesSMEEnd-userBFSIIT and telecomRetailManufacturingHealthcareSolutionAutomationProcess modelingMonitoring and optimizationContent and document managementIntegrationGeographyNorth AmericaUSCanadaAPACChinaJapanIndiaSouth KoreaEuropeGermanyUKFranceMiddle East and AfricaUAESouth AfricaSouth AmericaBrazilArgentinaRest of World (ROW)

    By Deployment Insights

    The cloud-based segment is estimated to witness significant growth during the forecast period.The cloud-based deployment model is the dominant and most rapidly expanding segment. Its ascendancy is fueled by advantages such as scalability, allowing organizations to adjust computational resources for training complex machine learning models. This elasticity eliminates the need for substantial upfront capital expenditure on physical hardware, shifting costs to a more manageable operational model. This model is particularly prevalent in the Middle East and Africa, which accounts for 4.65% of the market's growth potential, where cloud adoption supports rapid digitalization efforts.Furthermore, cloud platforms facilitate seamless integration and interoperability, enabling

  6. f

    S1 Data -

    • figshare.com
    xlsx
    Updated Oct 21, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nguyen Ky Anh; Anbok Lee; Nguyen Ky Phat; Nguyen Thi Hai Yen; Nguyen Quang Thu; Nguyen Tran Nam Tien; Ho-Sook Kim; Tae Hyun Kim; Dong Hyun Kim; Hee-Yeon Kim; Nguyen Phuoc Long (2024). S1 Data - [Dataset]. http://doi.org/10.1371/journal.pone.0311810.s004
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Oct 21, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Nguyen Ky Anh; Anbok Lee; Nguyen Ky Phat; Nguyen Thi Hai Yen; Nguyen Quang Thu; Nguyen Tran Nam Tien; Ho-Sook Kim; Tae Hyun Kim; Dong Hyun Kim; Hee-Yeon Kim; Nguyen Phuoc Long
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    There is an urgent need for better biomarkers for the detection of early-stage breast cancer. Utilizing untargeted metabolomics and lipidomics in conjunction with advanced data mining approaches for metabolism-centric biomarker discovery and validation may enhance the identification and validation of novel biomarkers for breast cancer screening. In this study, we employed a multimodal omics approach to identify and validate potential biomarkers capable of differentiating between patients with breast cancer and those with benign tumors. Our findings indicated that ether-linked phosphatidylcholine exhibited a significant difference between invasive ductal carcinoma and benign tumors, including cases with inconsistent mammography results. We observed alterations in numerous lipid species, including sphingomyelin, triacylglycerol, and free fatty acids, in the breast cancer group. Furthermore, we identified several dysregulated hydrophilic metabolites in breast cancer, such as glutamate, glycochenodeoxycholate, and dimethyluric acid. Through robust multivariate receiver operating characteristic analysis utilizing machine learning models, either linear support vector machines or random forest models, we successfully distinguished between cancerous and benign cases with promising outcomes. These results emphasize the potential of metabolic biomarkers to complement other criteria in breast cancer screening. Future studies are essential to further validate the metabolic biomarkers identified in our study and to develop assays for clinical applications.

  7. m

    Geochronologic and geochemical data of plutonic rocks from the...

    • data.mendeley.com
    Updated Jun 25, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Teresa Orozco-Esquivel (2021). Geochronologic and geochemical data of plutonic rocks from the Cretaceous-Eocene Mexican Magmatic Arc (CEMMA) [Dataset]. http://doi.org/10.17632/6jm7z683tn.1
    Explore at:
    Dataset updated
    Jun 25, 2021
    Authors
    Teresa Orozco-Esquivel
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The data provided in this database were used to establish the temporality, composition and petrogenetic aspects of the so-called Cretaceous-Eocene Mexican Magmatic Arc (CEMMA). This magmatic arc is continuous throughout western Mexico and its understanding is critical to elucidating the tectonics of southwestern North America. The data were carefully compiled from the literature, including published articles, thesis, internal reports from mining companies, and to a large extent come from unpublished information generated by the working group (CONACYT Research Grant 49528-F). Unreliable data in relation to the methodology used and those where the information was inconsistent were not included. The Cretaceous-Eocene Mexican Magmatic Arc: Conceptual framework from geochemical and geochronological data of plutonic rocks

  8. S1 Data -

    • plos.figshare.com
    zip
    Updated Feb 21, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lazarus Obed Livingstone Banda; Chigonjetso Victoria Banda; Jane Thokozani Banda (2025). S1 Data - [Dataset]. http://doi.org/10.1371/journal.pone.0314530.s001
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 21, 2025
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Lazarus Obed Livingstone Banda; Chigonjetso Victoria Banda; Jane Thokozani Banda
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The study on Early Childhood Development (ECD) practices in T/A Zilakoma, Nkhata Bay South Constituency, Malawi, employs the Ecological Systems Theory to explore recruitment, role definitions, and support systems. This theoretical construct enables an intricate examination of interactions within various environmental systems, emphasizing micro, meso, exo, macro, and chrono systems. Specifically, it illuminates the dynamics within immediate settings, interconnections among diverse systems, broader indirect influences, cultural ideologies, societal values, and temporal dimensions, offering a comprehensive lens for understanding educational contexts. The descriptive qualitative research was conducted within a case study framework to explore the practical experiences of stakeholders within ECD using semi-structured interview guides. Ethical standards were upheld, ensuring voluntary participation and confidentiality. Purposive sampling was used to collect data from diverse and knowledgeable participants involved in ECD domains, providing comprehensive insights aligned with the study’s objectives. Thematic analysis and sentiment mining were performed using Atlas 23 software. The results revealed themes such as recruitment practices relying on community-driven approaches, role ambiguity due to undefined responsibilities, informal evaluation processes, inconsistent training opportunities, and a dependency on community and volunteerism. These themes highlight the absence of formal structures and standardized processes in various aspects of ECD programs. Additionally, sentiment analysis illustrated diverse perspectives among stakeholders, reflecting their distinct experiences and challenges within the ECD landscape. The study concludes with policy recommendations aimed at addressing these systemic challenges.

  9. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Ferreira, António E. N.; Faria, Daniel; Pesquita, Catia; Falcão, André O.; Albrecht, Mario; Schlicker, Andreas; Bastos, Hugo (2012). Mining GO Annotations for Improving Annotation Consistency [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001153517

Data from: Mining GO Annotations for Improving Annotation Consistency

Related Article
Explore at:
Dataset updated
Jul 25, 2012
Authors
Ferreira, António E. N.; Faria, Daniel; Pesquita, Catia; Falcão, André O.; Albrecht, Mario; Schlicker, Andreas; Bastos, Hugo
Description

Despite the structure and objectivity provided by the Gene Ontology (GO), the annotation of proteins is a complex task that is subject to errors and inconsistencies. Electronically inferred annotations in particular are widely considered unreliable. However, given that manual curation of all GO annotations is unfeasible, it is imperative to improve the quality of electronically inferred annotations. In this work, we analyze the full GO molecular function annotation of UniProtKB proteins, and discuss some of the issues that affect their quality, focusing particularly on the lack of annotation consistency. Based on our analysis, we estimate that 64% of the UniProtKB proteins are incompletely annotated, and that inconsistent annotations affect 83% of the protein functions and at least 23% of the proteins. Additionally, we present and evaluate a data mining algorithm, based on the association rule learning methodology, for identifying implicit relationships between molecular function terms. The goal of this algorithm is to assist GO curators in updating GO and correcting and preventing inconsistent annotations. Our algorithm predicted 501 relationships with an estimated precision of 94%, whereas the basic association rule learning methodology predicted 12,352 relationships with a precision below 9%.

Search
Clear search
Close search
Google apps
Main menu