Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
With recent success in supervised learning, artificial intelligence (AI) and machine learning (ML) can play a vital role in precision medicine. Deep learning neural networks have been used in drug discovery when larger data is available. However, applications of machine learning in clinical trials with small sample size (around a few hundreds) are limited. We propose a Similarity-Principle-Based Machine Learning (SBML) method, which is applicable for small and large sample size problems. In SBML, the attribute-scaling factors are introduced to objectively determine the relative importance of each attribute (predictor). The gradient method is used in learning (training), that is, updating the attribute-scaling factors. We evaluate SBML when the sample size is small and investigate the effects of tuning parameters. Simulations show that SBML achieves better predictions in terms of mean squared errors for various complicated nonlinear situations than full linear models, optimal and ridge regressions, mixed effect models, support vector machine and decision tree methods.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Over the last ten years, social media has become a crucial data source for businesses and researchers, providing a space where people can express their opinions and emotions. To analyze this data and classify emotions and their polarity in texts, natural language processing (NLP) techniques such as emotion analysis (EA) and sentiment analysis (SA) are employed. However, the effectiveness of these tasks using machine learning (ML) and deep learning (DL) methods depends on large labeled datasets, which are scarce in languages like Spanish. To address this challenge, researchers use data augmentation (DA) techniques to artificially expand small datasets. This study aims to investigate whether DA techniques can improve classification results using ML and DL algorithms for sentiment and emotion analysis of Spanish texts. Various text manipulation techniques were applied, including transformations, paraphrasing (back-translation), and text generation using generative adversarial networks, to small datasets such as song lyrics, social media comments, headlines from national newspapers in Chile, and survey responses from higher education students. The findings show that the Convolutional Neural Network (CNN) classifier achieved the most significant improvement, with an 18% increase using the Generative Adversarial Networks for Sentiment Text (SentiGan) on the Aggressiveness (Seriousness) dataset. Additionally, the same classifier model showed an 11% improvement using the Easy Data Augmentation (EDA) on the Gender-Based Violence dataset. The performance of the Bidirectional Encoder Representations from Transformers (BETO) also improved by 10% on the back-translation augmented version of the October 18 dataset, and by 4% on the EDA augmented version of the Teaching survey dataset. These results suggest that data augmentation techniques enhance performance by transforming text and adapting it to the specific characteristics of the dataset. Through experimentation with various augmentation techniques, this research provides valuable insights into the analysis of subjectivity in Spanish texts and offers guidance for selecting algorithms and techniques based on dataset features.
The dataset consists of 5 varieties of soil images in 5 directories or folders. The dataset is a very small dataset meant for beginners to build ML models using this dataset currently. A small dataset helps learn without wasting hefty time A small dataset helps in learning without wasting hefty time in training the model for half an hour or more and expensive computation. The results won't be really great as the dataset is really low and thus overfitting and other issues. A small dataset helps learn without wasting hefty time. A small dataset helps in learning without wasting hefty time in training the model for half an hour or more and expensive computation. It will be probably updated soon to a larger dataset with better images and collection.
The dataset consists of 5 varieties of soil images in 5 directories or folders. The dataset was made because I was unable to find any reliable image dataset for soil varieties. The dataset will be soon updated to better richness. For which I will be soon crowdsourcing.
The dataset is without annotation, for same another concept of real-time augmentation can be applied. One can go through this notebook for learning how... https://www.kaggle.com/prasanshasatpathy/soil-type-image-classification
I have made the initial set of this small dataset. Soon I expect collaborations to increase the size and types of the dataset. The method for contributing would be released soon.
Better dataset with less complexity and really necessary.
This dataset can be used to replicate the findings in "A Pragmatic Machine Learning Approach to Quantify Tumor Infiltrating Lymphocytes in Whole Slide Images". The motivation for this paper is that increased levels of tumor infiltrating lymphocytes (TILs) indicate favorable outcomes in many types of cancer. Our aim is to leverage computational pathology to automatically quantify TILs in standard diagnostic whole-tissue hematoxylin and eosin stained section slides (H&E slides). Our approach is to transfer an open source machine learning method for segmentation and classification of nuclei in H&E slides trained on public data to TIL quantification without manual labeling of our data. Our results show that improved data augmentation improves immune cell detection in H&E WSIs. Moreover, the resulting TIL quantification correlates to patient prognosis and compares favorably to the current state-of-the-art method for immune cell detection in non-small lung cancer (current standard CD8 cells in DAB stained TMAs HR 0.34 95% CI 0.17-0.68 vs TILs in HE WSIs: HoVer-Net PanNuke Model HR 0.30 95% CI 0.15-0.60). Moreover, we implemented a cloud based system to train, deploy, and visually inspect machine learning based annotation for H&E slides. Our pragmatic approach bridges the gap between machine learning research, translational clinical research and clinical implementation. However, validation in prospective studies is needed to assert that the method works in a clinical setting. The dataset is comprised of three parts: 1) Twenty image patches with and without overlays used by pathologists to manually evaluate the output of the deep learning models, 2) The models trained and subsequently used for inference in the paper, 3) the patient dataset with corresponding image patches used to clinically validate the output of the deep learning models. The tissue samples were collected from patients diagnosed between 1993 and 2003. Supplementing information was collected retrospectively in the time period 2006-2017. The images were produced in 2017.
Abstract: Description: This repository presents data collected to investigate the role of embodiment and supervision in learning. This is done inside a simulated 3D maze world with a navigation task using mainly visual input in the form of RGB images. The main contribution of this data repository is to provide a network model trained in this environment with weak supervision and a closed loop between action and perception. Additionally, control networks are provided which were trained with varying degrees of supervision and embodiment. In the corresponding paper [1] the representations of these networks are compared based on sparsity measures and well as content of the encodings and the possibility to extract semantic labels. For the training of the control conditions several new data sets were created which are also included here. They contain a collection of images from the simulated world with corresponding semantic labels. Overall, they provide a good basis for further analysis and a more in-depth investigation of representation learning and the effect of embodiment and supervision on representations. Abstract: Steps to reproduce: Data was generated through a 3D simulation of a maze environment called Obstacle Tower. The data of interest are the trained neural network weights and the networks activations corresponding with different input frames. Three main networks were trained. A reinforcement learning agent which trained through interaction with the simulated environment, an autoencoder trained to reconstruct images collected by the agent and a classifier, trained to classify objects in the images. Exact training and testing conditions, hyperparameter and network structure are provided in the corresponding paper. For the training of the reinforcement learning agent the Unity ml-agents toolkit PPO implementation is used with small modifications for extra data collection and control experiments. The code we used can be found here: https://github.com/vkakerbeck/ml-agents-dev . Model checkpoint files are saved for different points in training but mostly the final version of the network is analysed in the corresponding paper [1] . The autoencoder and classifier are trained using Python with TensorFlow and Keras. The corresponding code can be found here: https://github.com/vkakerbeck/Learning-World-Representations/tree/master/DataAnalysis . The data also contains activations in the hidden layer of the network corresponding to 4000 test images for all three networks. Code for this can be found in the same GitHub repository. The datasets used for training the autoencoder and classifier were created by collecting observations in the Obstacle Tower environment using the trained agent. These observations were then labelled automatically, and the labels were cross checked by hand. A Description of the individual files is included in the data folder (Description.txt). Due to storage constraints no all model checkpoint files used to create figure 6 of the paper could be uploaded. However, feel free to contact me (vkakerbeck[at]uos.de) if you are intrested in these detailed checkpoint files of the controll runs and I will make them available to you.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data
Funding
This research is based upon work supported by the U.S. Department of Energy, Office of Science, Office Basic Energy Sciences, under Award Number FWP PS-030. This research also used theory and computational resources of the Center for Functional Nanomaterials, which is a U.S. Department of Energy Office of Science User Facility, and the Scientific Data and Computing Center, a component of the Computational Science Initiative, at Brookhaven National Laboratory under Contract No. DE-SC0012704.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In order to ensure transparency and reproducibility, we have made everything available publicly here, including the Code, Models, Datasets and more. All the files and their functionality used in this paper are explained clearly in the README.md file.
Background: Technical Debt (TD) needs to be controlled and tracked during software development. Current support, such as static analysis tools and even ML-based automatic tagging, is still ineffective, especially for context-dependent TD.
Aim: We study the usage of a large TD dataset in combination with cutting-edge Natural Language Processing (NLP) approaches to classify TD automatically in issue trackers, allowing the identification and tracking of informal TD conversations.
Method: We mine and analyse more than 160GB of textual data from GitHub projects, collecting over 55,600 TD issues and consolidating them into a large dataset (GTD-dataset). We then use our dataset to train state-of-the-art Transformer ML models, before performing a quantitative case study on three projects and evaluating the performance metrics during inference. Additionally, we study the adaptation of our model to classify context-dependent TD in an unseen project, by retraining the model including different percentages of the TD issues in the target project.
Results: (i) We provide GTD- dataset, the most comprehensive datasets of TD issues to date, including issues from 6,401 unique public repositories with various contexts;
(ii) By training state-of-the-art Transformers using the GTD-dataset, we achieve performance metrics that outperform previous approaches;
(iii) We show that our model can provide a relatively reliable tool to classify automatically TD in issue trackers, especially when adapted to unseen projects where the training includes a small portion of TD issues in the new project.
Conclusion: Our results indicate that we have taken significant steps towards closing the gap to practically and semi-automatically track TD issues in issue trackers.
To develop predictive models for the reactivity of organic contaminants toward four oxidantsSO4•–, HClO, O3, and ClO2all with small sample sizes, we proposed two approaches: combining small data sets and transferring knowledge between them. We first merged these data sets and developed a unified model using machine learning (ML), which showed better predictive performance than the individual models for HClO (RMSEtest: 2.1 to 2.04), O3 (2.06 to 1.94), ClO2 (1.77 to 1.49), and SO4•– (0.75 to 0.70) because the model “corrected” the wrongly learned effects of several atom groups. We further developed knowledge transfer models for three pairs of the data sets and observed different predictive performances: improved for O3 (RMSEtest: 2.06 to 2.01)/HClO (2.10 to 1.98), mixed for O3 (2.06 to 2.01)/ClO2 (1.77 to 1.95), and unchanged for ClO2 (1.77 to 1.77)/HClO (2.1 to 2.1). The effectiveness of the latter approach depended on whether there was consistent knowledge shared between the data sets and on the performance of the individual models. We also compared our approaches with multitask learning and image-based transfer learning and found that our approaches consistently improved the predictive performance for all data sets while the other two did not. This study demonstrated the effectiveness of combining small, similar data sets and transferring knowledge between them to improve ML model performance.
This dataset contains aerodynamic quantities - including flow field values (momentum, energy, and vorticity) and summary values (coefficients of lift, drag, and momentum) - for 1,830 airfoil shapes computed using the HAM2D CFD (computational fluid dynamics) model. The airfoil shapes were designed using the separable shape tensor parameterization that encodes two-dimensional shapes as elements of the Grassmann manifold. This data-driven approach learns two independent spaces of parameter from a collection of sample airfoils. The first captures large-scale, linear perturbations, and the second defines small-scale, higher-order perturbations. For this dataset, we used the G2Aero database of over 19,000 airfoil shapes to learn a parameter space that captured a wide array of shape characteristics. We sampled airfoil designs over both parameter spaces to explore the full range of possible shape variations. The aerodynamic quantities for the generated airfoil were obtained using the HAM2D code, which is a finite-volume Reynolds-averaged Navier-Stokes (RANS) flow solver. We employ a fifth-order WENO scheme for spatial reconstruction with Roe's flux difference scheme for inviscid flux and second-order central differencing for viscous flux. A preconditioned GMRES method is applied for implicit integration. The Spalart-Allmaras 1-eq turbulence model is used for the turbulence closure, and the Medida-Baeder 2-eq transition model is applied to account for the effects of laminar turbulent transition. The airfoil grid is generated with a total of 400 points on the airfoil surface, the initial wall-normal spacing of y+ = 1, and an outer boundary located at 300 chord lengths away from the wall. The CFD simulations are performed at a freestream Mach number of 0.1, for or three different Reynolds' numbers (3M, 6M, and 9M), and for 25 angles of attack from -4 deg. to 20 deg. with 1 degree increments. Across all these various parameters, this dataset includes the results from over 250,000 CFD simulations. The simulations were performed using the Bridges-2 system at the Pittsburgh Supercomputing Center in February 2023 as part of the INTEGRATE project funded by the Advanced Research Projects Agency - Energy, in the U.S. Department of Energy. The data was collected, reformatted, and preprocessed for this OEDI submission in July 2023 under the Foundational AI for Wind Energy project funded by the U.S. Department of Energy Wind Energy Technologies Office. This dataset is intended to serve as a benchmark against which new artificial intelligence (AI) or machine learning (ML) tools may be tested. Baseline AI/ML methods for analyzing this dataset have been implemented, and a link to their repository containing those models has been provided. The .h5 data file structure can be found in the GitHub Repository resource under explore_airfoil_2k_data.ipynb.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The accurate description of the structural and thermodynamic properties of ferroelectrics has been one of the most remarkable achievements of Density Functional Theory (DFT). However, running large simulation cells with DFT is computationally demanding, while simulations of small cells are often plagued with non-physical effects that are a consequence of the system's finite size. Therefore, one is often forced to use empirical models that describe the physics of the material in terms of effective interaction terms, that are fitted using the results from DFT, to perform simulations that do not suffer from finite size effects. In this study we use a machine-learning (ML) potential trained on DFT, in combination with accelerated sampling techniques, to converge the thermodynamic properties of Barium Titanate (BTO) with first-principles accuracy and a full atomistic description. Our results indicate that the predicted Curie temperature depends strongly on the choice of DFT functional and system size, due to the presence of emergent long-range directional correlations in the local dipole fluctuations. Our findings demonstrate how the combination of ML models and traditional bottom-up modeling allow one to investigate emergent phenomena with the accuracy of first-principles calculations and the large size and time scales afforded by empirical models.
https://researchintelo.com/privacy-and-policyhttps://researchintelo.com/privacy-and-policy
According to our latest research, the global Animal Health Machine Learning market size reached USD 1.42 billion in 2024, demonstrating robust momentum driven by technological advancements and increasing demand for intelligent animal health solutions. The market is projected to grow at a CAGR of 17.3% from 2025 to 2033, reaching a forecasted value of USD 6.14 billion by 2033. This exceptional growth trajectory is primarily fueled by the rising adoption of machine learning (ML) technologies in veterinary diagnostics, disease surveillance, and precision livestock management, enabling data-driven decision-making and improved animal welfare across the globe.
One of the primary growth factors for the Animal Health Machine Learning market is the increasing prevalence of zoonotic diseases and the subsequent need for advanced diagnostic and monitoring tools. Machine learning algorithms are revolutionizing the way veterinarians and animal health professionals detect, diagnose, and manage diseases in both companion and livestock animals. By leveraging vast datasets from electronic health records, imaging, and biosensors, ML models can identify subtle patterns and predict disease outbreaks with remarkable accuracy. This capability is especially critical in preventing the spread of infectious diseases, reducing economic losses in the livestock industry, and ensuring food safety. Furthermore, the integration of ML with remote monitoring devices and wearable sensors is enabling continuous health surveillance, early intervention, and personalized treatment plans, marking a paradigm shift in animal healthcare management.
Another significant driver is the expanding role of precision livestock farming, which relies heavily on machine learning to optimize productivity, resource utilization, and animal welfare. As the global demand for animal protein rises, farmers and producers are increasingly adopting ML-powered solutions to monitor herd health, track behavioral changes, and automate feeding and breeding processes. These technologies not only enhance operational efficiency but also contribute to sustainable farming practices by minimizing the use of antibiotics and reducing environmental impact. Additionally, ML-driven drug discovery platforms are accelerating the development of novel therapeutics and vaccines for animal diseases, shortening research timelines and improving success rates. The growing collaboration between technology providers, research institutes, and veterinary organizations is further catalyzing innovation and expanding the application landscape of machine learning in animal health.
The market's growth is also underpinned by favorable regulatory frameworks, increased investments in veterinary informatics, and the rising awareness of animal welfare among consumers and stakeholders. Governments and industry bodies across North America, Europe, and Asia Pacific are actively promoting digital transformation in the animal health sector through grants, pilot projects, and public-private partnerships. The proliferation of cloud-based platforms and advancements in big data analytics are making ML solutions more accessible and scalable, even for small and medium-sized animal farms. Despite challenges such as data privacy concerns and the need for skilled professionals, the overall outlook for the Animal Health Machine Learning market remains highly optimistic, with significant opportunities for innovation and value creation in the coming decade.
From a regional perspective, North America currently dominates the Animal Health Machine Learning market, accounting for a substantial share of global revenues, followed by Europe and Asia Pacific. The United States, in particular, is a frontrunner due to its advanced veterinary infrastructure, high adoption of digital health technologies, and strong presence of leading market players. Europe is witnessing rapid growth driven by stringent animal welfare regulations and increased research funding, while Asia Pacific is emerging as a lucrative market owing to its large livestock population, rising pet ownership, and government initiatives to modernize the agricultural sector. Latin America and the Middle East & Africa are gradually catching up, supported by improving veterinary services and growing awareness of the benefits of ML-based animal health solutions.
The Animal Health Machine Learning market is segmented by co
Background: Effective treatment using antibiotic vancomycin requires close monitoring of serum drug levels due to its narrow therapeutic index. In the current practice, physicians use various dosing algorithms for dosage titration, but these algorithms reported low success in achieving therapeutic targets. We explored using artificial intelligent to assist vancomycin dosage titration.Methods: We used a novel method to generate the label for each record and only included records with appropriate label data to generate a clean cohort with 2,282 patients and 7,912 injection records. Among them, 64% of patients were used to train two machine learning models, one for initial dose recommendation and another for subsequent dose recommendation. The model performance was evaluated using two metrics: PAR, a pharmacology meaningful metric defined by us, and Mean Absolute Error (MAE), a commonly used regression metric.Results: In our 3-year data, only a small portion (34.1%) of current injection doses could reach the desired vancomycin trough level (14–20 mcg/ml). Both PAR and MAE of our machine learning models were better than the classical pharmacokinetic models. Our model also showed better performance than the other previously developed machine learning models in our test data.Conclusion: We developed machine learning models to recommend vancomycin dosage. Our results show that the new AI-assisted dosage titration approach has the potential to improve the traditional approaches. This is especially useful to guide decision making for inexperienced doctors in making consistent and safe dosing recommendations for high-risk medications like vancomycin.
ObjectivesThis study aimed to evaluate and compare the diagnostic accuracy of machine learning (ML)- fractional flow reserve (FFR) based on optical coherence tomography (OCT) with wire-based FFR irrespective of the coronary territory.BackgroundML techniques for assessing hemodynamics features including FFR in coronary artery disease have been developed based on various imaging modalities. However, there is no study using OCT-based ML models for all coronary artery territories.MethodsOCT and FFR data were obtained for 356 individual coronary lesions in 130 patients. The training and testing groups were divided in a ratio of 4:1. The ML-FFR was derived for the testing group and compared with the wire-based FFR in terms of the diagnosis of ischemia (FFR ≤ 0.80).ResultsThe mean age of the subjects was 62.6 years. The numbers of the left anterior descending, left circumflex, and right coronary arteries were 130 (36.5%), 110 (30.9%), and 116 (32.6%), respectively. Using seven major features, the ML-FFR showed strong correlation (r = 0.8782, P < 0.001) with the wire-based FFR. The ML-FFR predicted wire-based FFR ≤ 0.80 in the test set with sensitivity of 98.3%, specificity of 61.5%, and overall accuracy of 91.7% (area under the curve: 0.948). External validation showed good correlation (r = 0.7884, P < 0.001) and accuracy of 83.2% (area under the curve: 0.912).ConclusionOCT-based ML-FFR showed good diagnostic performance in predicting FFR irrespective of the coronary territory. Because the study was a small-size study, the results should be warranted the performance in further large-scale research.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
The Insect Detect - insect classification dataset v2 contains mainly images of various insects sitting on or flying above an artificial flower platform. All images were automatically recorded with the Insect Detect DIY camera trap, a hardware combination of the Luxonis OAK-1, Raspberry Pi Zero 2 W and PiJuice Zero pHAT for automated insect monitoring (bioRxiv preprint).
Most of the images were captured by camera traps deployed at different sites in 2023. For some classes (e.g. ant, bee_bombus, beetle_cocci, bug, bug_grapho, hfly_eristal, hfly_myathr, hfly_syrphus) additional images were captured with a lab setup of the camera trap. For some classes (e.g. bee_apis, fly, hfly_episyr, wasp) images from the first dataset version were transferred to this dataset.
This dataset is also available on Roboflow Universe. The images in the dataset from Roboflow are automatically compressed, which decreases model accuracy when used for training. Therefore it is recommended to use this uncompressed Zenodo version and split the dataset into train/val/test subsets in the provided training notebook.
This dataset contains the following 27 classes:
For the classes hfly_eupeo and hfly_syrphus a precise taxonomic distinction is not possible with images only, due to a potentially high variability in the appearance of the respective species. While most specimens will show the visual features that are important for a classification into one of these classes, some specimens of Syrphus sp. might look more like Eupeodes sp. and vice versa.
The images were sorted to the respective class by considering taxonomic and visual distinctions. However, this dataset is still rather small regarding the visually extremely diverse Insecta. Insects that are not included in this dataset can therefore be classified to the wrong class. All results should always be manually validated and false classifications can be used to extend this basic dataset and retrain your custom classification model.
You can use this dataset as starting point to train your own insect classification models with the provided Google Colab training notebook. Read the model training instructions for more information.
A insect classification model trained on this dataset is available in the insect-detect-ml GitHub repo. To deploy the model on your PC (ONNX format for fast CPU inference), follow the provided step-by-step instructions.
This dataset is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0).
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Machine learning (ML) of quantum mechanical properties shows promise for accelerating chemical discovery. For transition metal chemistry where accurate calculations are computationally costly and available training data sets are small, the molecular representation becomes a critical ingredient in ML model predictive accuracy. We introduce a series of revised autocorrelation functions (RACs) that encode relationships of the heuristic atomic properties (e.g., size, connectivity, and electronegativity) on a molecular graph. We alter the starting point, scope, and nature of the quantities evaluated in standard ACs to make these RACs amenable to inorganic chemistry. On an organic molecule set, we first demonstrate superior standard AC performance to other presently available topological descriptors for ML model training, with mean unsigned errors (MUEs) for atomization energies on set-aside test molecules as low as 6 kcal/mol. For inorganic chemistry, our RACs yield 1 kcal/mol ML MUEs on set-aside test molecules in spin-state splitting in comparison to 15–20× higher errors for feature sets that encode whole-molecule structural information. Systematic feature selection methods including univariate filtering, recursive feature elimination, and direct optimization (e.g., random forest and LASSO) are compared. Random-forest- or LASSO-selected subsets 4–5× smaller than the full RAC set produce sub- to 1 kcal/mol spin-splitting MUEs, with good transferability to metal–ligand bond length prediction (0.004–5 Å MUE) and redox potential on a smaller data set (0.2–0.3 eV MUE). Evaluation of feature selection results across property sets reveals the relative importance of local, electronic descriptors (e.g., electronegativity, atomic number) in spin-splitting and distal, steric effects in redox potential and bond lengths.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Model performance results based on random forest, gradient boosting, penalized logistic regression, XGBoost, SVM, neural network, and stacking for EMBARC data as training set and APAT data as testing set after multiple imputation for 10 times.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The method numbers correspond to method numbers in Fig 1. FPD denotes the full-positive-dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Over the last ten years, social media has become a crucial data source for businesses and researchers, providing a space where people can express their opinions and emotions. To analyze this data and classify emotions and their polarity in texts, natural language processing (NLP) techniques such as emotion analysis (EA) and sentiment analysis (SA) are employed. However, the effectiveness of these tasks using machine learning (ML) and deep learning (DL) methods depends on large labeled datasets, which are scarce in languages like Spanish. To address this challenge, researchers use data augmentation (DA) techniques to artificially expand small datasets. This study aims to investigate whether DA techniques can improve classification results using ML and DL algorithms for sentiment and emotion analysis of Spanish texts. Various text manipulation techniques were applied, including transformations, paraphrasing (back-translation), and text generation using generative adversarial networks, to small datasets such as song lyrics, social media comments, headlines from national newspapers in Chile, and survey responses from higher education students. The findings show that the Convolutional Neural Network (CNN) classifier achieved the most significant improvement, with an 18% increase using the Generative Adversarial Networks for Sentiment Text (SentiGan) on the Aggressiveness (Seriousness) dataset. Additionally, the same classifier model showed an 11% improvement using the Easy Data Augmentation (EDA) on the Gender-Based Violence dataset. The performance of the Bidirectional Encoder Representations from Transformers (BETO) also improved by 10% on the back-translation augmented version of the October 18 dataset, and by 4% on the EDA augmented version of the Teaching survey dataset. These results suggest that data augmentation techniques enhance performance by transforming text and adapting it to the specific characteristics of the dataset. Through experimentation with various augmentation techniques, this research provides valuable insights into the analysis of subjectivity in Spanish texts and offers guidance for selecting algorithms and techniques based on dataset features.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Vitamin D insufficiency appears to be prevalent in SLE patients. Multiple factors potentially contribute to lower vitamin D levels, including limited sun exposure, the use of sunscreen, darker skin complexion, aging, obesity, specific medical conditions, and certain medications. The study aims to assess the risk factors associated with low vitamin D levels in SLE patients in the southern part of Bangladesh, a region noted for a high prevalence of SLE. The research additionally investigates the possible correlation between vitamin D and the SLEDAI score, seeking to understand the potential benefits of vitamin D in enhancing disease outcomes for SLE patients. The study incorporates a dataset consisting of 50 patients from the southern part of Bangladesh and evaluates their clinical and demographic data. An initial exploratory data analysis is conducted to gain insights into the data, which includes calculating means and standard deviations, performing correlation analysis, and generating heat maps. Relevant inferential statistical tests, such as the Student’s t-test, are also employed. In the machine learning part of the analysis, this study utilizes supervised learning algorithms, specifically Linear Regression (LR) and Random Forest (RF). To optimize the hyperparameters of the RF model and mitigate the risk of overfitting given the small dataset, a 3-Fold cross-validation strategy is implemented. The study also calculates bootstrapped confidence intervals to provide robust uncertainty estimates and further validate the approach. A comprehensive feature importance analysis is carried out using RF feature importance, permutation-based feature importance, and SHAP values. The LR model yields an RMSE of 4.83 (CI: 2.70, 6.76) and MAE of 3.86 (CI: 2.06, 5.86), whereas the RF model achieves better results, with an RMSE of 2.98 (CI: 2.16, 3.76) and MAE of 2.68 (CI: 1.83,3.52). Both models identify Hb, CRP, ESR, and age as significant contributors to vitamin D level predictions. Despite the lack of a significant association between SLEDAI and vitamin D in the statistical analysis, the machine learning models suggest a potential nonlinear dependency of vitamin D on SLEDAI. These findings highlight the importance of these factors in managing vitamin D levels in SLE patients. The study concludes that there is a high prevalence of vitamin D insufficiency in SLE patients. Although a direct linear correlation between the SLEDAI score and vitamin D levels is not observed, machine learning models suggest the possibility of a nonlinear relationship. Furthermore, factors such as Hb, CRP, ESR, and age are identified as more significant in predicting vitamin D levels. Thus, the study suggests that monitoring these factors may be advantageous in managing vitamin D levels in SLE patients. Given the immunological nature of SLE, the potential role of vitamin D in SLE disease activity could be substantial. Therefore, it underscores the need for further large-scale studies to corroborate this hypothesis.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Progress in the application of machine learning (ML) methods to materials design is hindered by the lack of understanding of the reliability of ML predictions, in particular, for the application of ML to small data sets often found in materials science. Using ML prediction for transparent conductor oxide formation energy and band gap, dilute solute diffusion, and perovskite formation energy, band gap, and lattice parameter as examples, we demonstrate that (1) construction of a convex hull in feature space that encloses accurately predicted systems can be used to identify regions in feature space for which ML predictions are highly reliable; (2) analysis of the systems enclosed by the convex hull can be used to extract physical understanding; and (3) materials that satisfy all well-known chemical and physical principles that make a material physically reasonable are likely to be similar and show strong relationships between the properties of interest and the standard features used in ML. We also show that similar to the composition–structure–property relationships, inclusion in the ML training data set of materials from classes with different chemical properties will not be beneficial for the accuracy of ML prediction and that reliable results likely will be obtained by ML model for narrow classes of similar materials even in the case where the ML model will show large errors on the data set consisting of several classes of materials.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
With recent success in supervised learning, artificial intelligence (AI) and machine learning (ML) can play a vital role in precision medicine. Deep learning neural networks have been used in drug discovery when larger data is available. However, applications of machine learning in clinical trials with small sample size (around a few hundreds) are limited. We propose a Similarity-Principle-Based Machine Learning (SBML) method, which is applicable for small and large sample size problems. In SBML, the attribute-scaling factors are introduced to objectively determine the relative importance of each attribute (predictor). The gradient method is used in learning (training), that is, updating the attribute-scaling factors. We evaluate SBML when the sample size is small and investigate the effects of tuning parameters. Simulations show that SBML achieves better predictions in terms of mean squared errors for various complicated nonlinear situations than full linear models, optimal and ridge regressions, mixed effect models, support vector machine and decision tree methods.