Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Framing the investigation of diverse cancers as a machine learning problem has recently shown significant potential in multi-omics analysis and cancer research. Empowering these successful machine learning models are the high-quality training datasets with sufficient data volume and adequate preprocessing. However, while there exist several public data portals including The Cancer Genome Atlas (TCGA) multi-omics initiative or open-bases such as the LinkedOmics, these databases are not off-the-shelf for existing machine learning models. we propose MLOmics, an open cancer multi-omics database aiming at serving better the development and evaluation of bioinformatics and machine learning models. MLOmics contains 8,314 patient samples covering all 32 cancer types with four omics types, stratified features, and extensive baselines. Complementary support for downstream analysis and bio-knowledge linking are also included to support interdisciplinary analysis.
Facebook
Twitterhttps://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
According to our latest research, the global In-Database Machine Learning market size reached USD 2.77 billion in 2024. The market is exhibiting robust momentum, with a compound annual growth rate (CAGR) of 28.4% projected over the forecast period. By 2033, the In-Database Machine Learning market is expected to escalate to USD 21.13 billion globally, driven by increasing enterprise adoption of advanced analytics and artificial intelligence embedded directly within databases. This exponential growth is fueled by the surging demand for real-time data processing, operational efficiency, and the seamless integration of machine learning (ML) models within business-critical applications.
A significant growth factor in the In-Database Machine Learning market is the rising need for organizations to derive actionable insights from massive volumes of data in real time. Traditional machine learning workflows often require extracting data from databases, leading to latency, security risks, and operational bottlenecks. In-database machine learning addresses these challenges by enabling ML algorithms to operate directly where the data resides, eliminating the need for data movement. This approach not only accelerates the analytics lifecycle but also enhances data security and compliance, which is particularly crucial in regulated industries such as banking, healthcare, and finance. Organizations are increasingly recognizing the strategic value of embedding ML capabilities within their database environments to unlock deeper insights, automate decision-making, and drive competitive advantage.
Another pivotal driver is the evolution of database technologies and the proliferation of cloud-based database platforms. Modern relational and NoSQL databases are now equipped with native machine learning functionalities, making it easier for enterprises to deploy, train, and operationalize ML models at scale. The shift towards cloud-based and hybrid database infrastructures further amplifies the adoption of in-database ML, as organizations seek scalable and flexible solutions that can handle diverse data types and workloads. Vendors are responding by offering integrated ML toolkits and APIs, lowering the entry barrier for data scientists and business analysts. Furthermore, the convergence of big data, artificial intelligence, and advanced analytics is fostering innovation, enabling organizations to tackle complex use cases such as fraud detection, predictive maintenance, and personalized customer experiences.
The increasing emphasis on digital transformation across industries is also propelling the growth of the In-Database Machine Learning market. Enterprises are under pressure to modernize their data architectures and leverage AI-driven insights to optimize operations, reduce costs, and enhance customer engagement. In-database ML empowers organizations to streamline their analytics workflows, achieve real-time intelligence, and respond swiftly to market changes. The technology’s ability to scale across large datasets and integrate seamlessly with existing business processes makes it an attractive proposition for both large enterprises and small and medium-sized enterprises (SMEs). As a result, investments in in-database ML solutions are expected to surge, with vendors continuously innovating to deliver enhanced performance, automation, and explainability.
From a regional perspective, North America currently leads the global In-Database Machine Learning market, accounting for the largest revenue share in 2024. This dominance is attributed to the region’s advanced IT infrastructure, high adoption of cloud technologies, and the strong presence of leading technology vendors. Europe follows closely, driven by stringent data privacy regulations and growing investments in AI-driven analytics across sectors such as BFSI, healthcare, and manufacturing. The Asia Pacific region is emerging as a high-growth market, propelled by rapid digitalization, expanding enterprise data volumes, and government initiatives to foster AI innovation. Latin America and the Middle East & Africa are also witnessing increased adoption, albeit at a slower pace, as organizations in these regions gradually embrace data-driven decision-making and cloud-based analytics platforms.
The In-Database Machine Learning market is segmented by component into Software and S
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This database was firstly created for the scientific article entitled: "Reviewing Machine Learning of corrosion prediction: a data-oriented perspective"
L.B. Coelho 1 , D. Zhang 2 , Y.V. Ingelgem 1 , D. Steckelmacher 3 , A. Nowé 3 , H.A. Terryn 1
1 Department of Materials and Chemistry, Research Group Electrochemical and Surface Engineering, Vrije Universiteit Brussel, Brussels, Belgium 2 A Beijing Advanced Innovation Center for Materials Genome Engineering, National Materials Corrosion and Protection Data Center, Institute for Advanced Materials and Technology, University of Science and Technology Beijing, Beijing, China 3 VUB Artificial Intelligence Lab, Vrije Universiteit Brussel, Brussels, Belgium
Different metrics are possible to evaluate the prediction accuracy of regression models. However, only papers providing relative metrics (MAPE, R²) were included in this database. We tried as much as possible to include descriptors of all major ML procedure steps, including data collection (“Data acquisition”), data cleaning feature engineering (“Feature reduction”), model validation (“Train-Test split”*), etc.
*the total dataset is typically split into training sets and testing (unknown data) sets for performance evaluation of the model. Nonetheless, sometimes only the training or the testing performances were reported (“?” marks were added in the respective evaluation metric field(s)). The “Average R²” was sometimes considered for studies employing “CV” (cross-validation) on the dataset. For a detailed description of the ML basic procedures, the reader could refer to the References topic in the Review article.
Facebook
Twitter
According to our latest research, the global in-database machine learning market size in 2024 stands at USD 2.74 billion, reflecting the sector’s rapid adoption across diverse industries. The market is expected to grow at a robust CAGR of 28.6% from 2025 to 2033, reaching a projected value of USD 24.19 billion by the end of the forecast period. This exceptional growth is primarily driven by the increasing demand for advanced analytics, real-time data processing, and the seamless integration of machine learning capabilities directly within database environments, which are essential for accelerating business insights and operational efficiency.
The primary growth factor propelling the in-database machine learning market is the exponential surge in data volumes generated by enterprises worldwide. As organizations transition to digital-first operations, the need to analyze vast datasets in real time has become paramount. Traditional machine learning workflows, which require data extraction and movement to external environments, are increasingly seen as inefficient and prone to latency and security issues. In-database machine learning eliminates these bottlenecks by enabling algorithms to run directly within the database, thus reducing data movement, minimizing latency, and ensuring higher data security. This approach not only streamlines the analytics pipeline but also empowers businesses to derive actionable insights faster, supporting critical functions such as fraud detection, predictive maintenance, and customer personalization.
Another significant factor fueling market expansion is the growing adoption of cloud-based data platforms and the proliferation of hybrid IT infrastructures. Enterprises are leveraging cloud-native databases and data warehouses to centralize and scale their analytics capabilities. In-database machine learning solutions are designed to seamlessly integrate with these modern architectures, allowing organizations to harness the power of machine learning without the need for extensive data migration or IT overhead. This integration facilitates agile development, lowers total cost of ownership, and enables organizations to respond swiftly to market changes. Furthermore, the rise of open-source machine learning frameworks and APIs has democratized access to advanced analytics, making it easier for businesses of all sizes to implement and benefit from in-database ML capabilities.
A third pivotal growth driver is the increasing emphasis on regulatory compliance, data privacy, and security in highly regulated industries such as BFSI and healthcare. In-database machine learning offers a compelling solution by keeping sensitive data within secure database environments, thereby reducing the risk of data breaches and ensuring compliance with stringent data protection regulations such as GDPR and HIPAA. This capability is particularly valuable for organizations operating in regions with complex regulatory landscapes, where data residency and sovereignty are critical concerns. As a result, the adoption of in-database ML is accelerating among enterprises that prioritize security, governance, and auditability in their analytics workflows.
From a regional perspective, North America continues to dominate the in-database machine learning market, accounting for the largest revenue share in 2024, followed closely by Europe and Asia Pacific. The presence of leading technology vendors, early adoption of advanced analytics, and a mature digital infrastructure contribute to North America’s leadership. However, rapid economic development, digitization initiatives, and expanding IT ecosystems in Asia Pacific are positioning the region as a significant growth engine for the forecast period. Meanwhile, Europe’s focus on data privacy and innovation is driving substantial investments in secure and compliant in-database ML solutions, further fueling market growth across the continent.
The in-database machine learning mark
Facebook
Twitterhttps://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
The global Vector Database Software market is poised for substantial growth, projected to reach an estimated $XXX million in 2025, with an impressive Compound Annual Growth Rate (CAGR) of XX% during the forecast period of 2025-2033. This rapid expansion is fueled by the increasing adoption of AI and machine learning across industries, necessitating efficient storage and retrieval of unstructured data like images, audio, and text. The burgeoning demand for enhanced search capabilities, personalized recommendations, and advanced anomaly detection is driving the market forward. Key market drivers include the widespread implementation of large language models (LLMs), the growing need for semantic search functionalities, and the continuous innovation in AI-powered applications. The market is segmenting into applications catering to both Small and Medium-sized Enterprises (SMEs) and Large Enterprises, with a clear shift towards Cloud-based solutions owing to their scalability, cost-effectiveness, and ease of deployment. The vector database landscape is characterized by dynamic innovation and fierce competition, with prominent players like Pinecone, Weaviate, Supabase, and Zilliz Cloud leading the charge. Emerging trends such as the development of hybrid search capabilities, integration with existing data infrastructure, and enhanced security features are shaping the market's trajectory. While the market shows immense promise, certain restraints, including the complexity of data integration and the need for specialized technical expertise, may pose challenges. Geographically, North America is expected to dominate the market share due to its early adoption of AI technologies and robust R&D investments, followed closely by Asia Pacific, which is witnessing rapid digital transformation and a surge in AI startups. Europe and other emerging regions are also anticipated to contribute significantly to market growth as AI adoption becomes more widespread. This report delves into the rapidly evolving Vector Database Software Market, providing a detailed analysis of its landscape from 2019 to 2033. With a Base Year of 2025, the report offers crucial insights for the Estimated Year of 2025 and projects market dynamics through the Forecast Period of 2025-2033, building upon the Historical Period of 2019-2024. The global vector database software market is poised for significant expansion, with an estimated market size projected to reach hundreds of millions of dollars by 2025, and anticipated to grow exponentially in the coming years. This growth is fueled by the increasing adoption of AI and machine learning across various industries, necessitating efficient storage and retrieval of high-dimensional vector data.
Facebook
TwitterCollection of databases, domain theories, and data generators that are used by machine learning community for empirical analysis of machine learning algorithms. Datasets approved to be in the repository will be assigned Digital Object Identifier (DOI) if they do not already possess one. Datasets will be licensed under a Creative Commons Attribution 4.0 International license (CC BY 4.0) which allows for the sharing and adaptation of the datasets for any purpose, provided that the appropriate credit is given
Facebook
TwitterBats play crucial ecological roles and provide valuable ecosystem services, yet many populations face serious threats from various ecological disturbances. The North American Bat Monitoring Program (NABat) aims to assess status and trends of bat populations while developing innovative and community-driven conservation solutions using its unique data and technology infrastructure. To support scalability and transparency in the NABat acoustic data pipeline, we developed a fully-automated machine-learning algorithm. This dataset includes audio files of bat echolocation calls that were considered to develop V1.0 of the NABat machine-learning algorithm, however the test set (i.e., holdout dataset) has been excluded from this release. These recordings were collected by various bat monitoring partners across North America using ultrasonic acoustic recorders for stationary acoustic and mobile acoustic surveys. For more information on how these surveys may be conducted, see Chapters 4 and 5 of “A Plan for the North American Bat Monitoring Program” (https://doi.org/10.2737/SRS-GTR-208). These data were then post-processed by bat monitoring partners to remove noise files (or those that do not contain recognizable bat calls) and apply a species label to each file. There is undoubtedly variation in the steps that monitoring partners take to apply a species label, but the steps documented in “A Guide to Processing Bat Acoustic Data for the North American Bat Monitoring Program” (https://doi.org/10.3133/ofr20181068) include first processing with an automated classifier and then manually reviewing to confirm or downgrade the suggested species label. Once a manual ID label was applied, audio files of bat acoustic recordings were submitted to the NABat database in Waveform Audio File format. From these available files in the NABat database, we considered files from 35 classes (34 species and a noise class). Files for 4 species were excluded due to low sample size (Corynorhinus rafinesquii, N=3; Eumops floridanus, N =3; Lasiurus xanthinus, N = 4; Nyctinomops femorosaccus, N =11). From this pool, files were randomly selected until files for each species/grid cell combination were exhausted or the number of recordings reach 1250. The dataset was then randomly split into training, validation, and test sets (i.e., holdout dataset). This data release includes all files considered for training and validation, including files that had been excluded from model development and testing due to low sample size for a given species or because the threshold for species/grid cell combinations had been met. The test set (i.e., holdout dataset) is not included. Audio files are grouped by species, as indicated by the four-letter species code in the name of each folder. Definitions for each four-letter code, including Family, Genus, Species, and Common name, are also included as a dataset in this release.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The utilization of advanced structural materials, such as preplaced aggregate concrete (PAC), fiber-reinforced concrete (FRC), and FRC beams has revolutionized the field of civil engineering. Therefore, the current research titled "RAGN-R: A multi-subject ensemble machine-learning method for estimating mechanical properties of advanced structural materials" in Computers and Structures, introduces a novel RAGN-R approach for proposing a comprehensive predictive model. The dataset used for this research is published to be used by researchers, for more, please check the paper.
Facebook
TwitterThe original contributions presented in the study are included in the article and online through the TAME Toolkit, available at: https://uncsrp.github.io/Data-Analysis-Training-Modules/, with underlying code and datasets available in the parent UNC-SRP GitHub website (https://github.com/UNCSRP). This dataset is associated with the following publication: Roell, K., L. Koval, R. Boyles, G. Patlewicz, C. Ring, C. Rider, C. Ward-Caviness, D. Reif, I. Jaspers, R. Fry, and J. Rager. Development of the InTelligence And Machine LEarning (TAME) Toolkit for Introductory Data Science, Chemical-Biological Analyses, Predictive Modeling, and Database Mining for Environmental Health Research. Frontiers in Toxicology. Frontiers, Lausanne, SWITZERLAND, 4: 893924, (2022).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is designed to test Machine-Learning techniques on Computational Fluid Dynamics (CFD) data.
It contains two-dimensional RANS simulations of the turbulent flow around NACA 4-digits airfoils, at fixed angle of attack (10 degrees) and at a fixed Reynolds number (3x10^6). The whole NACA family is spawned.
The present dataset contains 425 geometries, 2600 further geometries are published in accompanying repository (10.5281/zenodo.4106752).
For further information refer to: Schillaci, A., Quadrio, M., Pipolo, C., Restelli, M., Boracchi, G. "Inferring Functional Properties from Fluid Dynamics Features" 2020 25th International Conference on Pattern Recognition (ICPR) Milan, Italy, Jan 10-15, 2021
Facebook
Twitterhttps://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
The global database market size was valued at approximately USD 67 billion in 2023 and is projected to reach USD 138 billion by 2032, growing at a compound annual growth rate (CAGR) of 8.3%. The market is poised for significant growth due to the increasing demand for data storage solutions and the rapid digital transformation across various industries. As businesses continue to generate massive volumes of data, the need for efficient and scalable database solutions is becoming more critical than ever. This growth is further propelled by advancements in cloud computing and the increasing adoption of artificial intelligence and machine learning technologies, which require robust database management systems to handle complex data sets.
One of the primary growth factors for the database market is the exponential increase in data generation from various sources, including social media, IoT devices, and enterprise applications. As organizations strive to leverage data for competitive advantage, the demand for sophisticated database technologies that can manage, process, and analyze large volumes of data is on the rise. These technologies enable businesses to gain actionable insights, improve decision-making, and enhance customer experiences. Additionally, the proliferation of connected devices and the Internet of Things (IoT) are contributing to the surge in data volume, necessitating the deployment of advanced database systems to handle the influx of information efficiently.
The cloud computing revolution is another significant growth driver for the database market. With the increasing adoption of cloud-based services, organizations are shifting from traditional on-premises database solutions to cloud-based database management systems. This transition is driven by the need for scalability, flexibility, and cost-effectiveness, as cloud solutions offer the ability to scale resources up or down based on demand. Cloud databases also provide enhanced data security, disaster recovery, and backup solutions, making them an attractive option for businesses of all sizes. Moreover, cloud service providers continuously innovate by offering managed database services, reducing the burden on IT departments and allowing organizations to focus on core business activities.
The rise of artificial intelligence (AI) and machine learning (ML) technologies is also playing a crucial role in shaping the future of the database market. These technologies require robust and dynamic database systems capable of handling complex algorithms and large data sets. Databases optimized for AI and ML applications enable organizations to harness the power of predictive analytics, automation, and data-driven decision-making. The integration of AI and ML with database systems enhances the ability to identify patterns, detect anomalies, and predict future trends, further driving the demand for advanced database solutions.
From a regional perspective, North America is expected to dominate the database market, owing to the presence of established technology companies and the rapid adoption of advanced technologies. The region's mature IT infrastructure and the increasing need for data-driven insights in various industries contribute to the market's growth. Asia Pacific is anticipated to witness the highest growth rate during the forecast period, driven by the increasing digitization efforts, rising internet penetration, and the growing popularity of cloud-based solutions. Europe is also expected to experience significant growth due to the expanding IT sector and the increasing adoption of data analytics solutions across industries.
The database market can be segmented by type into relational, non-relational, cloud, and others. Relational databases are among the oldest and most established types of database systems, widely used across industries due to their ability to handle structured data efficiently. These databases rely on structured query language (SQL) for managing and manipulating data, making them suitable for applications that require complex querying and transaction processing. Despite their maturity, relational databases continue to evolve, with advancements such as NewSQL and distributed SQL databases enhancing their scalability and performance for modern applications.
Non-relational databases, also known as NoSQL databases, have gained popularity in recent years due to their flexibility and ability to handle unstructured data. These databases are designed to accommodate a diverse range of data types, making them ideal for applications involving large v
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The data was collected by a Kinect V2 as a set of X, Y, Z coordinates at 60 fps during 6 different yoga inspired back stretches. There are 541 files in the dataset, each containing position, velocity for 25 body joints. These joints include: Head, Neck, SpineShoulder, SpineMid, SpineBase, ShoulderRight, ShoulderLeft, HipRight, HipLeft, ElbowRight, WristRight, HandRight, HandTipRight, ThumbRight, ElbowLeft, WristLeft, HandLeft, HandTipLeft, ThumbLeft, KneeRight, AnkleRight, FootRight, KneeLeft, AnkleLeft, FootLeft. The program used to record this data was adapted from Thomas Sanchez Langeling’s skeleton recording code. The file was set to record data for each body part as a separate file, repeated for each exercise. Each bodypart for a specific exercise is stored in a distinct folder. These folders are named with the following convention: subjNumber_stretchName_trialNumber The subjNumber ranged from 0 – 8. The stretchName was one of the following: Mermaid, Seated, Sumo, Towel, Wall, Y. The trialNumber ranged from 0 – 9 and represented the repetition number. These coordinates were chosen to have an origin centered at the subject’s upper chest. The data was standardized to the following conditions: 1) Kinect placed at the height of 2 ft and 3 in 2) Subject consistently positioned 6.5 ft away from the camera with their chests facing the camera 3) Each participant completed 10 repetitions of each stretch before continuing on Data was collected from the following population: * Adults ages 18-21 * Females: 4 * Males: 5 The following types of pre-processing occurred at the time of data collection. Velocity Data: Calculated using a discrete derivative equation with a spacing of 5 frames chosen to reduce sensitivity of the velocity function v[n]=(x[n]-x[n-5])/5 Occurs for all body parts and all axes individually Related manuscript: Capella, B., Subrmanian, D., Klatzky, R., & Siewiorek, D. Action Pose Recognition from 3D Camera Data Using Inter-frame and Inter-joint Dependencies. Preprint at link in references.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ECGs from the MIT-NSR database with some modifications to make them more suitable as playground data set for machine learning.
Original data set description:
MIT-BIH Normal Sinus Rhythm Database
George Moody, Published: Aug. 3, 1999. Version: 1.0.0
This database includes 18 long-term ECG recordings of subjects referred to the Arrhythmia Laboratory at Boston's Beth Israel Hospital (now the Beth Israel Deaconess Medical Center). Subjects included in this database were found to have had no significant arrhythmias; they include 5 men, aged 26 to 45, and 13 women, aged 20 to 50.
DOI: https://doi.org/10.13026/C2NK5R
Link: https://www.physionet.org/content/nsrdb/1.0.0/
Ref: Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Our most comprehensive database of AI models, containing over 800 models that are state of the art, highly cited, or otherwise historically notable. It tracks key factors driving machine learning progress and includes over 300 training compute estimates.
Facebook
Twitterhttps://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
The global database market is experiencing robust growth, driven by the increasing adoption of cloud computing, big data analytics, and the expanding digital transformation initiatives across various industries. The market, estimated at $150 billion in 2025, is projected to exhibit a Compound Annual Growth Rate (CAGR) of 15% from 2025 to 2033, reaching approximately $450 billion by 2033. This expansion is fueled by several key factors. The shift towards cloud-based database solutions offers scalability, cost-efficiency, and enhanced accessibility, attracting both small and large enterprises. Furthermore, the burgeoning need for real-time data processing and advanced analytics is driving demand for high-performance databases capable of handling massive datasets. The rise of artificial intelligence (AI) and machine learning (ML) applications, which rely heavily on efficient data management, further accelerates market growth. However, the market also faces certain restraints. Data security and privacy concerns remain paramount, requiring robust security measures and compliance with evolving regulations like GDPR. The complexity of integrating new database solutions into existing IT infrastructures can pose a challenge for some organizations. Furthermore, the high cost of implementing and maintaining advanced database systems can be a barrier to entry for smaller companies. Despite these challenges, the long-term outlook for the database market remains positive, with significant growth opportunities in emerging technologies like edge computing and the Internet of Things (IoT), which generate vast amounts of data requiring efficient storage and processing. Segmentation analysis reveals a strong demand across all enterprise sizes, with large enterprises leading the adoption of cloud-based solutions, while smaller enterprises show increasing preference for cost-effective cloud-based options and SaaS offerings.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset of venomous Bangladeshi snakes, comprising images of Russell's Viper, King Cobra, and Common Krait, offers a valuable resource for developing an image processing-based deep learning algorithm. This algorithm can aid in the rapid identification of these snakes, enhancing both ecological preservation and public safety. The comprehensive methods and protocols described ensure that the data collection and processing are rigorous, enabling others to reproduce and build upon this research effectively.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ACI FRP databases - a database to accompany, "Machine learning assessment of frp-strengthened and reinforced concrete members"
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Objective: In 2016, the International Agency for Research on Cancer, part of the World Health Organization, released the Exposome-Explorer, the first database dedicated to biomarkers of exposure for environmental risk factors for diseases. The database contents resulted from a manual literature search that yielded over 8,500 citations, but only a small fraction of these publications were used in the final database. Manually curating a database is time-consuming and requires domain expertise to gather relevant data scattered throughout millions of articles. This work proposes a supervised machine learning pipeline to assist the manual literature retrieval process.Methods: The manually retrieved corpus of scientific publications used in the Exposome-Explorer was used as training and testing sets for the machine learning models (classifiers). Several parameters and algorithms were evaluated to predict an article’s relevance based on different datasets made of titles, abstracts and metadata.Results: The top performance classifier was built with the Logistic Regression algorithm using the title and abstract set, achieving an F2-score of 70.1%. Furthermore, we extracted 1,143 entities from these articles with a classifier trained for biomarker entity recognition. Of these, we manually validated 45 new candidate entries to the database.Conclusion: Our methodology reduced the number of articles to be manually screened by the database curators by nearly 90%, while only misclassifying 22.1% of the relevant articles. We expect that this methodology can also be applied to similar biomarkers datasets or be adapted to assist the manual curation process of similar chemical or disease databases.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Final database with the cleaned articles.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A dataset of 35,608 materials with their topological properties is constructed by combining the density functional theory (DFT) results of Materiae and the Topological Materials Database. Thanks to this, machine-learning approaches are developed to categorize materials into five distinct topological types, with the XGBoost model achieving an impressive 85.2% classification accuracy. By conducting generalization tests on different sub-datasets, differences are identified between the original datasets in terms of topological types, chemical elements, unknown magnetic compounds, and feature space coverage. Their impact on model performance is analyzed. Turning to the simpler binary classification between trivial insulators and nontrivial topological materials, three different approaches are also tested. Key characteristics influencing material topology are identified, with the maximum packing efficiency and the fraction of p valence electrons being highlighted as critical features.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Framing the investigation of diverse cancers as a machine learning problem has recently shown significant potential in multi-omics analysis and cancer research. Empowering these successful machine learning models are the high-quality training datasets with sufficient data volume and adequate preprocessing. However, while there exist several public data portals including The Cancer Genome Atlas (TCGA) multi-omics initiative or open-bases such as the LinkedOmics, these databases are not off-the-shelf for existing machine learning models. we propose MLOmics, an open cancer multi-omics database aiming at serving better the development and evaluation of bioinformatics and machine learning models. MLOmics contains 8,314 patient samples covering all 32 cancer types with four omics types, stratified features, and extensive baselines. Complementary support for downstream analysis and bio-knowledge linking are also included to support interdisciplinary analysis.