Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
They are available at https://github.com/nerdyqx/ML. (ZIP)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Our most comprehensive database of AI models, containing over 800 models that are state of the art, highly cited, or otherwise historically notable. It tracks key factors driving machine learning progress and includes over 300 training compute estimates.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Framing the investigation of diverse cancers as a machine learning problem has recently shown significant potential in multi-omics analysis and cancer research. Empowering these successful machine learning models are the high-quality training datasets with sufficient data volume and adequate preprocessing. However, while there exist several public data portals including The Cancer Genome Atlas (TCGA) multi-omics initiative or open-bases such as the LinkedOmics, these databases are not off-the-shelf for existing machine learning models. we propose MLOmics, an open cancer multi-omics database aiming at serving better the development and evaluation of bioinformatics and machine learning models. MLOmics contains 8,314 patient samples covering all 32 cancer types with four omics types, stratified features, and extensive baselines. Complementary support for downstream analysis and bio-knowledge linking are also included to support interdisciplinary analysis.
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
According to our latest research, the global In-Database Machine Learning market size reached USD 2.77 billion in 2024. The market is exhibiting robust momentum, with a compound annual growth rate (CAGR) of 28.4% projected over the forecast period. By 2033, the In-Database Machine Learning market is expected to escalate to USD 21.13 billion globally, driven by increasing enterprise adoption of advanced analytics and artificial intelligence embedded directly within databases. This exponential growth is fueled by the surging demand for real-time data processing, operational efficiency, and the seamless integration of machine learning (ML) models within business-critical applications.
A significant growth factor in the In-Database Machine Learning market is the rising need for organizations to derive actionable insights from massive volumes of data in real time. Traditional machine learning workflows often require extracting data from databases, leading to latency, security risks, and operational bottlenecks. In-database machine learning addresses these challenges by enabling ML algorithms to operate directly where the data resides, eliminating the need for data movement. This approach not only accelerates the analytics lifecycle but also enhances data security and compliance, which is particularly crucial in regulated industries such as banking, healthcare, and finance. Organizations are increasingly recognizing the strategic value of embedding ML capabilities within their database environments to unlock deeper insights, automate decision-making, and drive competitive advantage.
Another pivotal driver is the evolution of database technologies and the proliferation of cloud-based database platforms. Modern relational and NoSQL databases are now equipped with native machine learning functionalities, making it easier for enterprises to deploy, train, and operationalize ML models at scale. The shift towards cloud-based and hybrid database infrastructures further amplifies the adoption of in-database ML, as organizations seek scalable and flexible solutions that can handle diverse data types and workloads. Vendors are responding by offering integrated ML toolkits and APIs, lowering the entry barrier for data scientists and business analysts. Furthermore, the convergence of big data, artificial intelligence, and advanced analytics is fostering innovation, enabling organizations to tackle complex use cases such as fraud detection, predictive maintenance, and personalized customer experiences.
The increasing emphasis on digital transformation across industries is also propelling the growth of the In-Database Machine Learning market. Enterprises are under pressure to modernize their data architectures and leverage AI-driven insights to optimize operations, reduce costs, and enhance customer engagement. In-database ML empowers organizations to streamline their analytics workflows, achieve real-time intelligence, and respond swiftly to market changes. The technology’s ability to scale across large datasets and integrate seamlessly with existing business processes makes it an attractive proposition for both large enterprises and small and medium-sized enterprises (SMEs). As a result, investments in in-database ML solutions are expected to surge, with vendors continuously innovating to deliver enhanced performance, automation, and explainability.
From a regional perspective, North America currently leads the global In-Database Machine Learning market, accounting for the largest revenue share in 2024. This dominance is attributed to the region’s advanced IT infrastructure, high adoption of cloud technologies, and the strong presence of leading technology vendors. Europe follows closely, driven by stringent data privacy regulations and growing investments in AI-driven analytics across sectors such as BFSI, healthcare, and manufacturing. The Asia Pacific region is emerging as a high-growth market, propelled by rapid digitalization, expanding enterprise data volumes, and government initiatives to foster AI innovation. Latin America and the Middle East & Africa are also witnessing increased adoption, albeit at a slower pace, as organizations in these regions gradually embrace data-driven decision-making and cloud-based analytics platforms.
The In-Database Machine Learning market is segmented by component into Software and S
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Large-Scale AI Models database documents over 200 models trained with more than 10²³ floating point operations, at the leading edge of scale and capabilities.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Heart failure (HF) is the final stage of the various heart diseases developing. The mortality rates of prognosis HF patients are highly variable, ranging from 5% to 75%. Evaluating the all-cause mortality of HF patients is an important means to avoid death and positively affect the health of patients. But in fact, machine learning models are difficult to gain good results on missing values, high dimensions, and imbalances HF data. Therefore, a deep learning system is proposed. In this system, we propose an indicator vector to indicate whether the value is true or be padded, which fast solves the missing values and helps expand data dimensions. Then, we use a convolutional neural network with different kernel sizes to obtain the features information. And a multi-head self-attention mechanism is applied to gain whole channel information, which is essential for the system to improve performance. Besides, the focal loss function is introduced to deal with the imbalanced problem better. The experimental data of the system are from the public database MIMIC-III, containing valid data for 10311 patients. The proposed system effectively and fast predicts four death types: death within 30 days, death within 180 days, death within 365 days and death after 365 days. Our study uses Deep SHAP to interpret the deep learning model and obtains the top 15 characteristics. These characteristics further confirm the effectiveness and rationality of the system and help provide a better medical service.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This database was firstly created for the scientific article entitled: "Reviewing Machine Learning of corrosion prediction: a data-oriented perspective"
L.B. Coelho 1 , D. Zhang 2 , Y.V. Ingelgem 1 , D. Steckelmacher 3 , A. Nowé 3 , H.A. Terryn 1
1 Department of Materials and Chemistry, Research Group Electrochemical and Surface Engineering, Vrije Universiteit Brussel, Brussels, Belgium 2 A Beijing Advanced Innovation Center for Materials Genome Engineering, National Materials Corrosion and Protection Data Center, Institute for Advanced Materials and Technology, University of Science and Technology Beijing, Beijing, China 3 VUB Artificial Intelligence Lab, Vrije Universiteit Brussel, Brussels, Belgium
Different metrics are possible to evaluate the prediction accuracy of regression models. However, only papers providing relative metrics (MAPE, R²) were included in this database. We tried as much as possible to include descriptors of all major ML procedure steps, including data collection (“Data acquisition”), data cleaning feature engineering (“Feature reduction”), model validation (“Train-Test split”*), etc.
*the total dataset is typically split into training sets and testing (unknown data) sets for performance evaluation of the model. Nonetheless, sometimes only the training or the testing performances were reported (“?” marks were added in the respective evaluation metric field(s)). The “Average R²” was sometimes considered for studies employing “CV” (cross-validation) on the dataset. For a detailed description of the ML basic procedures, the reader could refer to the References topic in the Review article.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Heart failure (HF) is the final stage of the various heart diseases developing. The mortality rates of prognosis HF patients are highly variable, ranging from 5% to 75%. Evaluating the all-cause mortality of HF patients is an important means to avoid death and positively affect the health of patients. But in fact, machine learning models are difficult to gain good results on missing values, high dimensions, and imbalances HF data. Therefore, a deep learning system is proposed. In this system, we propose an indicator vector to indicate whether the value is true or be padded, which fast solves the missing values and helps expand data dimensions. Then, we use a convolutional neural network with different kernel sizes to obtain the features information. And a multi-head self-attention mechanism is applied to gain whole channel information, which is essential for the system to improve performance. Besides, the focal loss function is introduced to deal with the imbalanced problem better. The experimental data of the system are from the public database MIMIC-III, containing valid data for 10311 patients. The proposed system effectively and fast predicts four death types: death within 30 days, death within 180 days, death within 365 days and death after 365 days. Our study uses Deep SHAP to interpret the deep learning model and obtains the top 15 characteristics. These characteristics further confirm the effectiveness and rationality of the system and help provide a better medical service.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundPeople with traumatic brain injury (TBI) are at high risk for infection and sepsis. The aim of the study was to develop and validate an explainable machine learning(ML) model based on clinical features for early prediction of the risk of sepsis in TBI patients.MethodsWe enrolled all patients with TBI in the Medical Information Mart for Intensive Care IV database from 2008 to 2019. All patients were randomly divided into a training set (70%) and a test set (30%). The univariate and multivariate regression analyses were used for feature selection. Six ML methods were applied to develop the model. The predictive performance of different models were determined based on the area under the curve (AUC) and calibration curves in the test cohort. In addition, we selected the eICU Collaborative Research Database version 1.2 as the external validation dataset. Finally, we used the Shapley additive interpretation to account for the effects of features attributed to the model.ResultsOf the 1555 patients enrolled in the final cohort, 834 (53.6%) patients developed sepsis after TBI. Six variables were associated with concomitant sepsis and were used to develop ML models. Of the 6 models constructed, the Extreme Gradient Boosting (XGB) model achieved the best performance with an AUC of 0.807 and an accuracy of 74.5% in the internal validation cohort, and an AUC of 0.762 for the external validation. Feature importance analysis revealed that use mechanical ventilation, SAPSII score, use intravenous pressors, blood transfusion on admission, history of diabetes, and presence of post-stroke sequelae were the top six most influential features of the XGB model.ConclusionAs shown in the study, the ML model could be used to predict the occurrence of sepsis in patients with TBI in the intensive care unit.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We present the full database of the article "Explainable Supervised Machine Learning Model to Predict Solvation Free Energy".
This is the database used for a ML model, containing a variety of solvent-solute pairs with known experimental solvation free energy ΔGsolv values. Data entries were collected from two separate databases. The FreeSolv library, with 642 experimental aqueous ΔGsolv determinations and the Solv@TUM database with 5597 entries for non-aqueous solvents. Both databases were selected given their wide-scale of solute/solvents pairs, amassing 6239 experimental values across light and heavy-atom solutes with a diverse solvent structure and with small value uncertainties.
Experimental ΔGsolv values range from -14 to 4 kcal mol-1 and each solute/solvent pair is represented by their chemical family, SMILES string and InChlKey. We generated 213 chemical descriptors for every solvent and solute in each entry using RDKit software, version 2022.09.4, running on top of Python 3.9. Descriptors were calculated from the “MolFromSmiles” function in “RDKIT.Chem” as descriptors with non-numerical values were removed. The descriptors encode significant chemical information and are used to present physicochemical characteristics of compounds, building a relationship between structure and ΔGsolv.
Through Machine Learning regression algorithms, our models were able to make ΔGsolv predictions with high accuracy, based on the information encoded in each chemical feature.
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
The global Vector Database Software market is poised for substantial growth, projected to reach an estimated $XXX million in 2025, with an impressive Compound Annual Growth Rate (CAGR) of XX% during the forecast period of 2025-2033. This rapid expansion is fueled by the increasing adoption of AI and machine learning across industries, necessitating efficient storage and retrieval of unstructured data like images, audio, and text. The burgeoning demand for enhanced search capabilities, personalized recommendations, and advanced anomaly detection is driving the market forward. Key market drivers include the widespread implementation of large language models (LLMs), the growing need for semantic search functionalities, and the continuous innovation in AI-powered applications. The market is segmenting into applications catering to both Small and Medium-sized Enterprises (SMEs) and Large Enterprises, with a clear shift towards Cloud-based solutions owing to their scalability, cost-effectiveness, and ease of deployment. The vector database landscape is characterized by dynamic innovation and fierce competition, with prominent players like Pinecone, Weaviate, Supabase, and Zilliz Cloud leading the charge. Emerging trends such as the development of hybrid search capabilities, integration with existing data infrastructure, and enhanced security features are shaping the market's trajectory. While the market shows immense promise, certain restraints, including the complexity of data integration and the need for specialized technical expertise, may pose challenges. Geographically, North America is expected to dominate the market share due to its early adoption of AI technologies and robust R&D investments, followed closely by Asia Pacific, which is witnessing rapid digital transformation and a surge in AI startups. Europe and other emerging regions are also anticipated to contribute significantly to market growth as AI adoption becomes more widespread. This report delves into the rapidly evolving Vector Database Software Market, providing a detailed analysis of its landscape from 2019 to 2033. With a Base Year of 2025, the report offers crucial insights for the Estimated Year of 2025 and projects market dynamics through the Forecast Period of 2025-2033, building upon the Historical Period of 2019-2024. The global vector database software market is poised for significant expansion, with an estimated market size projected to reach hundreds of millions of dollars by 2025, and anticipated to grow exponentially in the coming years. This growth is fueled by the increasing adoption of AI and machine learning across various industries, necessitating efficient storage and retrieval of high-dimensional vector data.
Here you can find all data and all information regarding each generated dataset.
For each dataset there are 4 files:
json_info : This file contains, number of features with their names and number of subjects that are available for the same dataset
data_testing: data frame with data used to test trained model
data_training: data frame with data used to train models
results: direct unfiltered data from database
Files are written in feather format.
Here is an example of data structure for each file in repository
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This database studies the performance inconsistency on the biomass HHV ultimate analysis. The research null hypothesis is the consistency in the rank of a biomass HHV model. Fifteen biomass models are trained and tested in four datasets. In each dataset, the rank invariability of these 15 models indicates the performance consistency.
The database includes the datasets and source codes to analyze the performance consistency of the biomass HHV. These datasets are stored in tabular on an excel workbook. The source codes are the biomass HHV machine learning model through the MATLAB Objected Orient Program (OOP). These machine learning models consist of eight regressions, four supervised learnings, and three neural networks.
An excel workbook, "BiomassDataSetUltimate.xlsx," collects the research datasets in six worksheets. The first worksheet, "Ultimate," contains 908 HHV data from 20 pieces of literature. The names of the worksheet column indicate the elements of the ultimate analysis on a % dry basis. The HHV column refers to the higher heating value in MJ/kg. The following worksheet, "Full Residuals," backups the model testing's residuals based on the 20-fold cross-validations. The article (Kijkarncharoensin & Innet, 2021) verifies the performance consistency through these residuals. The other worksheets present the literature datasets implemented to train and test the model performance in many pieces of literature.
A file named "SourceCodeUltimate.rar" collects the MATLAB machine learning models implemented in the article. The list of the folders in this file is the class structure of the machine learning models. These classes extend the features of the original MATLAB's Statistics and Machine Learning Toolbox to support, e.g., the k-fold cross-validation. The MATLAB script, name "runStudyUltimate.m," is the article's main program to analyze the performance consistency of the biomass HHV model through the ultimate analysis. The script instantly loads the datasets from the excel workbook and automatically fits the biomass model through the OOP classes.
The first section of the MATLAB script generates the most accurate model by optimizing the model's higher parameters. It takes a few hours for the first run to train the machine learning model via the trial and error process. The trained models can be saved in MATLAB .mat file and loaded back to the MATLAB workspace. The remaining script, separated by the script section break, performs the residual analysis to inspect the performance consistency. Furthermore, the figure of the biomass data in the 3D scatter plot, and the box plots of the prediction residuals are exhibited. Finally, the interpretations of these results are examined in the author's article.
Reference : Kijkarncharoensin, A., & Innet, S. (2022). Performance inconsistency of the Biomass Higher Heating Value (HHV) Models derived from Ultimate Analysis [Manuscript in preparation]. University of the Thai Chamber of Commerce.
https://www.marketresearchforecast.com/privacy-policyhttps://www.marketresearchforecast.com/privacy-policy
The data modeling tool market is experiencing robust growth, driven by the increasing demand for efficient data management and the rise of big data analytics. The market, estimated at $5 billion in 2025, is projected to achieve a Compound Annual Growth Rate (CAGR) of 15% from 2025 to 2033, reaching approximately $15 billion by 2033. This expansion is fueled by several key factors, including the growing adoption of cloud-based data modeling solutions, the increasing need for data governance and compliance, and the expanding use of data visualization and business intelligence tools that rely on well-structured data models. The market is segmented by tool type (e.g., ER diagramming tools, UML modeling tools), deployment mode (cloud, on-premise), and industry vertical (e.g., BFSI, healthcare, retail). Competition is intense, with established players like IBM, Oracle, and SAP vying for market share alongside numerous specialized vendors offering niche solutions. The market's growth is being further accelerated by the adoption of agile methodologies and DevOps practices that necessitate faster and more iterative data modeling processes. The major restraints impacting market growth include the high cost of advanced data modeling software, the complexity associated with implementing and maintaining these solutions, and the lack of skilled professionals adept at data modeling techniques. The increasing availability of open-source tools, coupled with the growth of professional training programs focused on data modeling, are gradually alleviating this constraint. Future growth will likely be shaped by innovations in artificial intelligence (AI) and machine learning (ML) that are being integrated into data modeling tools to automate aspects of model creation and validation. The trend towards data mesh architecture and the growing importance of data literacy are also driving demand for user-friendly and accessible data modeling tools. Furthermore, the development of integrated platforms that combine data modeling with other data management functions is a key market trend that is likely to significantly impact future growth.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Heart failure (HF) is the final stage of the various heart diseases developing. The mortality rates of prognosis HF patients are highly variable, ranging from 5% to 75%. Evaluating the all-cause mortality of HF patients is an important means to avoid death and positively affect the health of patients. But in fact, machine learning models are difficult to gain good results on missing values, high dimensions, and imbalances HF data. Therefore, a deep learning system is proposed. In this system, we propose an indicator vector to indicate whether the value is true or be padded, which fast solves the missing values and helps expand data dimensions. Then, we use a convolutional neural network with different kernel sizes to obtain the features information. And a multi-head self-attention mechanism is applied to gain whole channel information, which is essential for the system to improve performance. Besides, the focal loss function is introduced to deal with the imbalanced problem better. The experimental data of the system are from the public database MIMIC-III, containing valid data for 10311 patients. The proposed system effectively and fast predicts four death types: death within 30 days, death within 180 days, death within 365 days and death after 365 days. Our study uses Deep SHAP to interpret the deep learning model and obtains the top 15 characteristics. These characteristics further confirm the effectiveness and rationality of the system and help provide a better medical service.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset of MOFs constructed from building blocks of stable MOFs.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset necessary for DocTOR utility.
DocTOR (Direct fOreCast Target On Reaction), is a utility written in python3.9 (using the conda workframe) that allows the user to upload a list of Uniprot IDs and Adverse reactions (from the available models) in order to study the relationship between the two.
On output the program will assign a positive or negative class to the protein, assessing its possible involvement in the selected ADRs onset.
DocTOR exploits the data coming from T-ARDIS [https://doi.org/10.1093/database/baab068] to train different Machine Learning approaches (SVM, RF, NN) using network topological measurements as features.
The prediction coming from the single trained models are combined in a meta-predictor exploiting three different voting systems.
The results of the meta-predictor together with the ones from the single ML method will be available in the output log file (named "predictions_community" or "predictions_curated" based on the database type).
The DocTOR utility is avaiable at https://github.com/cristian931/DocTOR
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Static pbe calculations for 1D, 2D, 3D compounds can be found in 1D_pbe.tar.gz, 2D_pbe.tar.gz, 3D_pbe.tar.gz in batches of 100k materials. The latter also contains a separate convex hull pickle with all compounds on the pbe convex hull (convex_hull_pbe_2023.12.29.json.bz2) and a list of prototypes in the database (prototypes.json.bz2). The systematic 3D calculations performed for the article Improving machine-learning models in materials science through large datasets (in the paper referred to as round 2 and 3) can be found by the location keyword in the data dictionary of each ComputedStructureEntry containing "cgat_comp/quaternaries" (round 2) and "cgat_comp2/" (round 3). Round 1 (10.1002/adma.202210788) can be found under "cgat_comp/ternaries", ""cgat_comp/binaries".
Static pbesol calculations for 3D compounds can be found in 3D_ps.tar (still zip compressed) in batches of 100k materials. The folder also contains a separate convex hull pickle with all compounds on the pbesol convex hull (convex_hull_ps_2023.12.29.json.bz2).
Static scan calculations for 3D compounds can be found in 3D_scan.tar (still zip compressed) in batches of 100k materials. The folder also contains a separate convex hull pickle with all compounds on the scan convex hull (convex_hull_scan_2023.12.29.json.bz2).
Geometry relaxation curves for 1D and 2D and 3D compounds calculated with PBE can be found in geo_opt_1D.tar.gz, geo_opt_2D.tar.gz. and geo_opt_3D.tar. Each file in each folder contains a batch of up to 10k relaxation trajectories.
PBESOL relaxation trajectories for 3D compounds can be found in geo_opt_ps.tar
Can be used with the code at https://github.com/hyllios/CGAT/tree/main/CGAT.Note will predict the distance to the convex hull not normalized per atom when using the code on the github.
Alignn models as well as m3gnet and mace models corresponding to the publication can be found in alexandria_v2.tar.gz
scripts.tar.gz Some scripts used for generating CGAT input data/ performing parallel predictions and for relaxations with m3gnet/mace force fields
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
This dataset was created by dillsunnyb11
Released under Database: Open Database, Contents: Database Contents
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset of MOFs constructed from building blocks of stable MOFs.
Note: the columns labeled "rho" in features_and_properties are actually cell volume and not density.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
They are available at https://github.com/nerdyqx/ML. (ZIP)