Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Framing the investigation of diverse cancers as a machine learning problem has recently shown significant potential in multi-omics analysis and cancer research. Empowering these successful machine learning models are the high-quality training datasets with sufficient data volume and adequate preprocessing. However, while there exist several public data portals including The Cancer Genome Atlas (TCGA) multi-omics initiative or open-bases such as the LinkedOmics, these databases are not off-the-shelf for existing machine learning models. we propose MLOmics, an open cancer multi-omics database aiming at serving better the development and evaluation of bioinformatics and machine learning models. MLOmics contains 8,314 patient samples covering all 32 cancer types with four omics types, stratified features, and extensive baselines. Complementary support for downstream analysis and bio-knowledge linking are also included to support interdisciplinary analysis.
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
The UCI Machine Learning Repository is a collection of databases, domain theories, and data generators that are used by the machine learning community for the empirical analysis of machine learning algorithms. The archive was created as an ftp archive in 1987 by David Aha and fellow graduate students at UC Irvine. Since that time, it has been widely used by students, educators, and researchers all over the world as a primary source of machine learning data sets. As an indication of the impact of the archive, it has been cited over 1000 times, making it one of the top 100 most cited "papers" in all of computer science. The current version of the web site was designed in 2007 by Arthur Asuncion and David Newman, and this project is in collaboration with Rexa.info at the University of Massachusetts Amherst. Funding support from the National Science Foundation is gratefully acknowledged. Many people deserve thanks for making the repository a success. Foremost among them are the d
Collection of databases, domain theories, and data generators that are used by machine learning community for empirical analysis of machine learning algorithms. Datasets approved to be in the repository will be assigned Digital Object Identifier (DOI) if they do not already possess one. Datasets will be licensed under a Creative Commons Attribution 4.0 International license (CC BY 4.0) which allows for the sharing and adaptation of the datasets for any purpose, provided that the appropriate credit is given
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
With the goal of improving machine learning approaches in inverse scattering
Snippets database has sound / audio / sonic recordings across all kinds of venues (restaurants, bars, arenas, churches, movie theaters, retail stores, factories, parks, libraries, gyms, hotels, offices, factories and many more) and variance in noise levels (Quiet, Moderate, Loud, Very Loud), noise types and acoustic environments with valuable metadata.
This is valuable for any audio-based software product/company to run/test its algorithm against various acoustic environments including:
Hearing aid companies wanting to test their software's ability to identify or separate certain sounds and background noise and mitigate them
Audio or Video Conferencing platforms that want to be able to identify a user's location (i.e. user joins call from a coffee shop and platform has ability to identify and mitigate such sounds for better audio
Other audio-based use cases
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Please cite the following paper when using this dataset:
N. Thakur, V. Su, M. Shao, K. Patel, H. Jeong, V. Knieling, and A. Bian “A labelled dataset for sentiment analysis of videos on YouTube, TikTok, and other sources about the 2024 outbreak of measles,” Proceedings of the 26th International Conference on Human-Computer Interaction (HCII 2024), Washington, USA, 29 June - 4 July 2024. (Accepted as a Late Breaking Paper, Preprint Available at: https://doi.org/10.48550/arXiv.2406.07693)
Abstract
This dataset contains the data of 4011 videos about the ongoing outbreak of measles published on 264 websites on the internet between January 1, 2024, and May 31, 2024. These websites primarily include YouTube and TikTok, which account for 48.6% and 15.2% of the videos, respectively. The remainder of the websites include Instagram and Facebook as well as the websites of various global and local news organizations. For each of these videos, the URL of the video, title of the post, description of the post, and the date of publication of the video are presented as separate attributes in the dataset. After developing this dataset, sentiment analysis (using VADER), subjectivity analysis (using TextBlob), and fine-grain sentiment analysis (using DistilRoBERTa-base) of the video titles and video descriptions were performed. This included classifying each video title and video description into (i) one of the sentiment classes i.e. positive, negative, or neutral, (ii) one of the subjectivity classes i.e. highly opinionated, neutral opinionated, or least opinionated, and (iii) one of the fine-grain sentiment classes i.e. fear, surprise, joy, sadness, anger, disgust, or neutral. These results are presented as separate attributes in the dataset for the training and testing of machine learning algorithms for performing sentiment analysis or subjectivity analysis in this field as well as for other applications. The paper associated with this dataset (please see the above-mentioned citation) also presents a list of open research questions that may be investigated using this dataset.
Bats play crucial ecological roles and provide valuable ecosystem services, yet many populations face serious threats from various ecological disturbances. The North American Bat Monitoring Program (NABat) aims to assess status and trends of bat populations while developing innovative and community-driven conservation solutions using its unique data and technology infrastructure. To support scalability and transparency in the NABat acoustic data pipeline, we developed a fully-automated machine-learning algorithm. This dataset includes audio files of bat echolocation calls that were considered to develop V1.0 of the NABat machine-learning algorithm, however the test set (i.e., holdout dataset) has been excluded from this release. These recordings were collected by various bat monitoring partners across North America using ultrasonic acoustic recorders for stationary acoustic and mobile acoustic surveys. For more information on how these surveys may be conducted, see Chapters 4 and 5 of “A Plan for the North American Bat Monitoring Program” (https://doi.org/10.2737/SRS-GTR-208). These data were then post-processed by bat monitoring partners to remove noise files (or those that do not contain recognizable bat calls) and apply a species label to each file. There is undoubtedly variation in the steps that monitoring partners take to apply a species label, but the steps documented in “A Guide to Processing Bat Acoustic Data for the North American Bat Monitoring Program” (https://doi.org/10.3133/ofr20181068) include first processing with an automated classifier and then manually reviewing to confirm or downgrade the suggested species label. Once a manual ID label was applied, audio files of bat acoustic recordings were submitted to the NABat database in Waveform Audio File format. From these available files in the NABat database, we considered files from 35 classes (34 species and a noise class). Files for 4 species were excluded due to low sample size (Corynorhinus rafinesquii, N=3; Eumops floridanus, N =3; Lasiurus xanthinus, N = 4; Nyctinomops femorosaccus, N =11). From this pool, files were randomly selected until files for each species/grid cell combination were exhausted or the number of recordings reach 1250. The dataset was then randomly split into training, validation, and test sets (i.e., holdout dataset). This data release includes all files considered for training and validation, including files that had been excluded from model development and testing due to low sample size for a given species or because the threshold for species/grid cell combinations had been met. The test set (i.e., holdout dataset) is not included. Audio files are grouped by species, as indicated by the four-letter species code in the name of each folder. Definitions for each four-letter code, including Family, Genus, Species, and Common name, are also included as a dataset in this release.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In practical media distribution systems
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
The vector database solution market is experiencing explosive growth, projected to reach $3218.3 million in 2025 and exhibiting a robust Compound Annual Growth Rate (CAGR) of 22.6% from 2025 to 2033. This rapid expansion is driven by the increasing adoption of AI and machine learning applications across various sectors, including e-commerce, finance, and healthcare. These applications rely heavily on efficient similarity search capabilities offered by vector databases, making them a crucial component of modern data infrastructure. The rising volume of unstructured data, such as images, videos, and text, further fuels the demand, as vector databases excel at handling and querying such data types effectively. Key market drivers include advancements in deep learning algorithms, the need for real-time search functionalities, and the growing emphasis on personalized user experiences. This market is characterized by a diverse range of players, including established tech giants like Redis and emerging specialized vendors like Zilliz (with its Milvus offering), Pinecone, Weaviate, and others. Competition is fierce, prompting continuous innovation in areas such as query performance, scalability, and ease of integration. While challenges remain, such as the complexity of managing and deploying vector databases, the overall market outlook remains positive. Future growth will likely be influenced by the continued development of AI/ML applications, the maturation of cloud-based vector database services, and the increased accessibility of these solutions for businesses of all sizes. The ongoing development of standardized interfaces and improved tooling will also play a significant role in broader adoption.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is designed to test Machine-Learning techniques on Computational Fluid Dynamics (CFD) data.
It contains two-dimensional RANS simulations of the turbulent flow around NACA 4-digits airfoils, at fixed angle of attack (10 degrees) and at a fixed Reynolds number (3x10^6). The whole NACA family is spawned. The present dataset contains 2600 geometries, and 425 further geometries are published in an accompanying repository (10.5281/zenodo.4638071).
For further information refer to: Schillaci, A., Quadrio, M., Pipolo, C., Restelli, M., Boracchi, G. "Inferring Functional Properties from Fluid Dynamics Features" 2020 25th International Conference on Pattern Recognition (ICPR) Milan, Italy, Jan 10-15, 2021
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Combinatorial and guided screening of materials space with density-functional theory and related approaches has provided a wealth of hypothetical inorganic materials, which are increasingly tabulated in open databases. The OPTIMADE API is a standardised format for representing crystal structures, their measured and computed properties, and the methods for querying and filtering them from remote resources. Currently, the OPTIMADE federation spans over 20 data providers, rendering over 30 million structures accessible in this way, many of which are novel and have only recently been suggested by machine learning-based approaches. In this work, we outline our approach to non-exhaustively screen this dynamic trove of structures for the next-generation of optical materials. By applying MODNet, a neural network-based model for property prediction that has been shown to perform especially well for small materials datasets, within a combined active learning and high-throughput computation framework, we isolate particular structures and chemistries that should be most fruitful for further theoretical calculations and for experimental study as high-refractive-index materials. By making explicit use of automated calculations, federated dataset curation and machine learning, and by releasing these publicly, the workflows presented here can be periodically re-assessed as new databases implement OPTIMADE, and new hypothetical materials are suggested.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
OverviewThis dataset, contained within Database.csv, is a comprehensive collection tailored for machine learning applications in the field of catalysis and materials science, focusing on single-atom alloys. It encompasses a wide array of data with 10,950 entries, each featuring 85 intrinsic descriptors alongside novel information on the predicted C-H dissociation energy barriers and reaction rates. These intrinsic descriptors include a variety of element and surface properties extracted from renowned databases like the Materials Project and Pymatgen, as well as surface structural features and characteristics derived through expert knowledge.Intrinsic DescriptorsThe 85 intrinsic descriptors provided in this dataset offer a detailed insight into the properties of single-atom alloys. These descriptors cover:Element Properties: Extracted from the Materials Project and Pymatgen databases, these properties include atomic size, electronegativity, and other elemental characteristics critical for the study of material properties.Surface Properties: Features related to the surface characteristics of the alloys, which play a significant role in their catalytic behavior and interaction with reactants.Surface Structural Features: Detailed information on the structural aspects of the alloy surfaces, which can influence the material's catalytic activity and stability.Expert-Derived Features: A set of features developed through expert knowledge, combining various data points to form comprehensive descriptors for machine learning applications.Predicted PropertiesC-H Dissociation Energy Barrier: A key metric for evaluating the catalytic efficiency of single-atom alloys, particularly in processes involving hydrocarbons.Reaction Rates: Provides valuable insights into the kinetics of reactions facilitated by single-atom alloys, crucial for the development and optimization of catalytic processes.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The utilization of advanced structural materials, such as preplaced aggregate concrete (PAC), fiber-reinforced concrete (FRC), and FRC beams has revolutionized the field of civil engineering. Therefore, the current research titled "RAGN-R: A multi-subject ensemble machine-learning method for estimating mechanical properties of advanced structural materials" in Computers and Structures, introduces a novel RAGN-R approach for proposing a comprehensive predictive model. The dataset used for this research is published to be used by researchers, for more, please check the paper.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The machine learning-based virtual screening of molecular databases is a commonly used approach to identify hits. However, many aspects associated with training predictive models can influence the final performance and, consequently, the number of hits found. Thus, we performed a systematic study of the simultaneous influence of the proportion of negatives to positives in the testing set, the size of screening databases and the type of molecular representations on the effectiveness of classification. The results obtained for eight protein targets, five machine learning algorithms (SMO, Naïve Bayes, Ibk, J48 and Random Forest), two types of molecular fingerprints (MACCS and CDK FP) and eight screening databases with different numbers of molecules confirmed our previous findings that increases in the ratio of negative to positive training instances greatly influenced most of the investigated parameters of the ML methods in simulated virtual screening experiments. However, the performance of screening was shown to also be highly dependent on the molecular library dimension. Generally, with the increasing size of the screened database, the optimal training ratio also increased, and this ratio can be rationalized using the proposed cost-effectiveness threshold approach. To increase the performance of machine learning-based virtual screening, the training set should be constructed in a way that considers the size of the screening database.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Our most comprehensive database of AI models, containing over 800 models that are state of the art, highly cited, or otherwise historically notable. It tracks key factors driving machine learning progress and includes over 300 training compute estimates.
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Imbalanced dataset for benchmarking
=======================
The different algorithms of the `imbalanced-learn` toolbox are evaluated on a set of common dataset, which are more or less balanced. These benchmark have been proposed in [1]. The following section presents the main characteristics of this benchmark.
Characteristics
-------------------
|ID |Name |Repository & Target |Ratio |# samples| # features |
|:---:|:----------------------:|--------------------------------------|:------:|:-------------:|:--------------:|
|1 |Ecoli |UCI, target: imU |8.6:1 |336 |7 |
|2 |Optical Digits |UCI, target: 8 |9.1:1 |5,620 |64 |
|3 |SatImage |UCI, target: 4 |9.3:1 |6,435 |36 |
|4 |Pen Digits |UCI, target: 5 |9.4:1 |10,992 |16 |
|5 |Abalone |UCI, target: 7 |9.7:1 |4,177 |8 |
|6 |Sick Euthyroid |UCI, target: sick euthyroid |9.8:1 |3,163 |25 |
|7 |Spectrometer |UCI, target: >=44 |11:1 |531 |93 |
|8 |Car_Eval_34 |UCI, target: good, v good |12:1 |1,728 |6 |
|9 |ISOLET |UCI, target: A, B |12:1 |7,797 |617 |
|10 |US Crime |UCI, target: >0.65 |12:1 |1,994 |122 |
|11 |Yeast_ML8 |LIBSVM, target: 8 |13:1 |2,417 |103 |
|12 |Scene |LIBSVM, target: >one label |13:1 |2,407 |294 |
|13 |Libras Move |UCI, target: 1 |14:1 |360 |90 |
|14 |Thyroid Sick |UCI, target: sick |15:1 |3,772 |28 |
|15 |Coil_2000 |KDD, CoIL, target: minority |16:1 |9,822 |85 |
|16 |Arrhythmia |UCI, target: 06 |17:1 |452 |279 |
|17 |Solar Flare M0 |UCI, target: M->0 |19:1 |1,389 |10 |
|18 |OIL |UCI, target: minority |22:1 |937 |49 |
|19 |Car_Eval_4 |UCI, target: vgood |26:1 |1,728 |6 |
|20 |Wine Quality |UCI, wine, target: <=4 |26:1 |4,898 |11 |
|21 |Letter Img |UCI, target: Z |26:1 |20,000 |16 |
|22 |Yeast _ME2 |UCI, target: ME2 |28:1 |1,484 |8 |
|23 |Webpage |LIBSVM, w7a, target: minority|33:1 |49,749 |300 |
|24 |Ozone Level |UCI, ozone, data |34:1 |2,536 |72 |
|25 |Mammography |UCI, target: minority |42:1 |11,183 |6 |
|26 |Protein homo. |KDD CUP 2004, minority |111:1|145,751 |74 |
|27 |Abalone_19 |UCI, target: 19 |130:1|4,177 |8 |
References
----------
[1] Ding, Zejin, "Diversified Ensemble Classifiers for H
ighly Imbalanced Data Learning and their Application in Bioinformatics." Dissertation, Georgia State University, (2011).
[2] Blake, Catherine, and Christopher J. Merz. "UCI Repository of machine learning databases." (1998).
[3] Chang, Chih-Chung, and Chih-Jen Lin. "LIBSVM: a library for support vector machines." ACM Transactions on Intelligent Systems and Technology (TIST) 2.3 (2011): 27.
[4] Caruana, Rich, Thorsten Joachims, and Lars Backstrom. "KDD-Cup 2004: results and analysis." ACM SIGKDD Explorations Newsletter 6.2 (2004): 95-108.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
IPLID integrates protein-ligand interaction data from multiple well-known resources, including BindingDB, ChEMBL, DrugBank, GPCRDB, PubChem, LINCS-HMS KinomeScan, and four published kinome assay results. Our database can facilitate projects in machine learning or deep learning-based drug development and other applications by providing integrated data sets appropriate for many research interests. Our database can be utilized for small-scale (e.g. kinases or GPCRs only) and large-scale (e.g. proteome-wide), qualitative or quantitative projects. With its ease of use and straightforward data format, IPLID offers a great educational resource for computer science and data science trainees who lack familiarity with chemistry and biology.
Data statistics
Target (data type) Activities | Unique chemicals | Unique proteins | File name
All (binary) 96318 | 18107 | 3107 | integrated_binary_activity.tsv
All (numerical) 2798365 | 683009 | 5876 | integrated_continuous_activity.tsv
CYP450 (binary) 67552 | 17273 | 47 | integrated_cyp450_binary.tsv
CRT (binary) 4152 | 1219 | 412 | integrated_cancer_related_targets_binary.tsv
CDT (binary) 519 | 349 | 88 | integrated_cardio_targets_binary.tsv
DRT (binary) 4433 | 1325 | 852 | integrated_disease_related_targets_binary.tsv
FDA (binary) 6217 | 1521 | 592 | integrated_fda_approved_targets_binary.tsv
GPCR (binary) 1958 | 545 | 129 | integrated_gpcr_binary.tsv
NR (binary) 1335 | 657 | 264 | integrated_nr_binary.tsv
PDT (binary) 1469 | 674 | 404 | integrated_potential_drug_targets_binary.tsv
TF (binary) 1966 | 998 | 304 | integrated_tf_binary.tsv
*Abbreviations: CYP450 (Cytochrome P450), CRT (Cancer-Related Target), CDT (Cardiovascular Disease candidate Target), DRT (Disease-Related Target), FDA (FDA-approved target), GPCR (G-Protein Coupled Receptor), NR (Nuclear Receptor), PDT (Potential Drug Target), TF (Transcription Factor)
*These protein classifications are from UniProt database and the Human Protein Atlas (https://www.proteinatlas.org/)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
These are the supplemental search query instructions for the JAAD International article titled as above.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Heterogenous Big dataset is presented in this proposed work: electrocardiogram (ECG) signal, blood pressure signal, oxygen saturation (SpO2) signal, and the text input. This work is an extension version for our relevant formulating of dataset that presented in [1] and a trustworthy and relevant medical dataset library (PhysioNet [2]) was used to acquire these signals. The dataset includes medical features from heterogenous sources (sensory data and non-sensory). Firstly, ECG sensor’s signals which contains QRS width, ST elevation, peak numbers, and cycle interval. Secondly: SpO2 level from SpO2 sensor’s signals. Third, blood pressure sensors’ signals which contain high (systolic) and low (diastolic) values and finally text input which consider non-sensory data. The text inputs were formulated based on doctors diagnosing procedures for heart chronic diseases. Python software environment was used, and the simulated big data is presented along with analyses.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
a Real-Time Operating System.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Framing the investigation of diverse cancers as a machine learning problem has recently shown significant potential in multi-omics analysis and cancer research. Empowering these successful machine learning models are the high-quality training datasets with sufficient data volume and adequate preprocessing. However, while there exist several public data portals including The Cancer Genome Atlas (TCGA) multi-omics initiative or open-bases such as the LinkedOmics, these databases are not off-the-shelf for existing machine learning models. we propose MLOmics, an open cancer multi-omics database aiming at serving better the development and evaluation of bioinformatics and machine learning models. MLOmics contains 8,314 patient samples covering all 32 cancer types with four omics types, stratified features, and extensive baselines. Complementary support for downstream analysis and bio-knowledge linking are also included to support interdisciplinary analysis.