100+ datasets found

f
Data from: MLOmics: Cancer Multi-Omics Database for Machine Learning
figshare.com
bin
Updated May 25, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rikuto Kotoge (2025). MLOmics: Cancer Multi-Omics Database for Machine Learning [Dataset]. http://doi.org/10.6084/m9.figshare.28729127.v2
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.28729127.v2
Dataset updated
May 25, 2025
Dataset provided by
figshare
Authors
Rikuto Kotoge
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Framing the investigation of diverse cancers as a machine learning problem has recently shown significant potential in multi-omics analysis and cancer research. Empowering these successful machine learning models are the high-quality training datasets with sufficient data volume and adequate preprocessing. However, while there exist several public data portals including The Cancer Genome Atlas (TCGA) multi-omics initiative or open-bases such as the LinkedOmics, these databases are not off-the-shelf for existing machine learning models. we propose MLOmics, an open cancer multi-omics database aiming at serving better the development and evaluation of bioinformatics and machine learning models. MLOmics contains 8,314 patient samples covering all 32 cancer types with four omics types, stratified features, and extensive baselines. Complementary support for downstream analysis and bio-knowledge linking are also included to support interdisciplinary analysis.
a
UCI Machine Learning Datasets 12/2013
academictorrents.com
bittorrent
Updated Dec 20, 2013
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
UCI (2013). UCI Machine Learning Datasets 12/2013 [Dataset]. https://academictorrents.com/details/7fafb101f9c7961f9b840daeb4af43039107ddef
Explore at:
bittorrent(16365432846)Available download formats
Dataset updated
Dec 20, 2013
Dataset authored and provided by
UCI
License
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
Description
The UCI Machine Learning Repository is a collection of databases, domain theories, and data generators that are used by the machine learning community for the empirical analysis of machine learning algorithms. The archive was created as an ftp archive in 1987 by David Aha and fellow graduate students at UC Irvine. Since that time, it has been widely used by students, educators, and researchers all over the world as a primary source of machine learning data sets. As an indication of the impact of the archive, it has been cited over 1000 times, making it one of the top 100 most cited "papers" in all of computer science. The current version of the web site was designed in 2007 by Arthur Asuncion and David Newman, and this project is in collaboration with Rexa.info at the University of Massachusetts Amherst. Funding support from the National Science Foundation is gratefully acknowledged. Many people deserve thanks for making the repository a success. Foremost among them are the d
s
UCI Machine Learning Repository
scicrunch.org
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
UCI Machine Learning Repository [Dataset]. http://identifiers.org/RRID:SCR_026571
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_026571
Description
Collection of databases, domain theories, and data generators that are used by machine learning community for empirical analysis of machine learning algorithms. Datasets approved to be in the repository will be assigned Digital Object Identifier (DOI) if they do not already possess one. Datasets will be licensed under a Creative Commons Attribution 4.0 International license (CC BY 4.0) which allows for the sharing and adaptation of the datasets for any purpose, provided that the appropriate credit is given
i
A 2D Near-Field Microwave Imaging Database for Machine Learning Training
ieee-dataport.org
Updated Mar 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Colin Gilmore (2024). A 2D Near-Field Microwave Imaging Database for Machine Learning Training [Dataset]. https://ieee-dataport.org/open-access/2d-near-field-microwave-imaging-database-machine-learning-training
Explore at:
Dataset updated
Mar 18, 2024
Authors
Colin Gilmore
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
With the goal of improving machine learning approaches in inverse scattering
d
AI-Machine Learning Sound / Audio / Snippet Recordings Database
datarade.ai
Updated Jun 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SoundPrint (2025). AI-Machine Learning Sound / Audio / Snippet Recordings Database [Dataset]. https://datarade.ai/data-products/ai-machine-learning-sound-audio-snippet-recordings-database-soundprint
Explore at:
Dataset updated
Jun 19, 2025
Dataset authored and provided by
SoundPrint
Area covered
Congo, Solomon Islands, Greenland, Iran (Islamic Republic of), Turkey, Peru, Taiwan, Nauru, Palau, Mongolia
Description
Snippets database has sound / audio / sonic recordings across all kinds of venues (restaurants, bars, arenas, churches, movie theaters, retail stores, factories, parks, libraries, gyms, hotels, offices, factories and many more) and variance in noise levels (Quiet, Moderate, Loud, Very Loud), noise types and acoustic environments with valuable metadata.

This is valuable for any audio-based software product/company to run/test its algorithm against various acoustic environments including:

Hearing aid companies wanting to test their software's ability to identify or separate certain sounds and background noise and mitigate them

Audio or Video Conferencing platforms that want to be able to identify a user's location (i.e. user joins call from a coffee shop and platform has ability to identify and mitigate such sounds for better audio

Other audio-based use cases
A Labelled Dataset for Sentiment Analysis of Videos on YouTube, TikTok, and...
zenodo.org
data.niaid.nih.gov
+2more
csv
Updated Jul 20, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nirmalya Thakur; Nirmalya Thakur; Vanessa Su; Mingchen Shao; Kesha A. Patel; Hongseok Jeong; Victoria Knieling; Andrew Bian; Vanessa Su; Mingchen Shao; Kesha A. Patel; Hongseok Jeong; Victoria Knieling; Andrew Bian (2024). A Labelled Dataset for Sentiment Analysis of Videos on YouTube, TikTok, and other sources about the 2024 outbreak of Measles [Dataset]. http://doi.org/10.5281/zenodo.11711230
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.11711230
Dataset updated
Jul 20, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Nirmalya Thakur; Nirmalya Thakur; Vanessa Su; Mingchen Shao; Kesha A. Patel; Hongseok Jeong; Victoria Knieling; Andrew Bian; Vanessa Su; Mingchen Shao; Kesha A. Patel; Hongseok Jeong; Victoria Knieling; Andrew Bian
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jun 15, 2024
Area covered
YouTube
Description
Please cite the following paper when using this dataset:

N. Thakur, V. Su, M. Shao, K. Patel, H. Jeong, V. Knieling, and A. Bian “A labelled dataset for sentiment analysis of videos on YouTube, TikTok, and other sources about the 2024 outbreak of measles,” Proceedings of the 26th International Conference on Human-Computer Interaction (HCII 2024), Washington, USA, 29 June - 4 July 2024. (Accepted as a Late Breaking Paper, Preprint Available at: https://doi.org/10.48550/arXiv.2406.07693)

Abstract

This dataset contains the data of 4011 videos about the ongoing outbreak of measles published on 264 websites on the internet between January 1, 2024, and May 31, 2024. These websites primarily include YouTube and TikTok, which account for 48.6% and 15.2% of the videos, respectively. The remainder of the websites include Instagram and Facebook as well as the websites of various global and local news organizations. For each of these videos, the URL of the video, title of the post, description of the post, and the date of publication of the video are presented as separate attributes in the dataset. After developing this dataset, sentiment analysis (using VADER), subjectivity analysis (using TextBlob), and fine-grain sentiment analysis (using DistilRoBERTa-base) of the video titles and video descriptions were performed. This included classifying each video title and video description into (i) one of the sentiment classes i.e. positive, negative, or neutral, (ii) one of the subjectivity classes i.e. highly opinionated, neutral opinionated, or least opinionated, and (iii) one of the fine-grain sentiment classes i.e. fear, surprise, joy, sadness, anger, disgust, or neutral. These results are presented as separate attributes in the dataset for the training and testing of machine learning algorithms for performing sentiment analysis or subjectivity analysis in this field as well as for other applications. The paper associated with this dataset (please see the above-mentioned citation) also presents a list of open research questions that may be investigated using this dataset.
d
Training dataset for NABat Machine Learning V1.0
catalog.data.gov
data.usgs.gov
Updated Jul 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Training dataset for NABat Machine Learning V1.0 [Dataset]. https://catalog.data.gov/dataset/training-dataset-for-nabat-machine-learning-v1-0
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
U.S. Geological Survey
Description
Bats play crucial ecological roles and provide valuable ecosystem services, yet many populations face serious threats from various ecological disturbances. The North American Bat Monitoring Program (NABat) aims to assess status and trends of bat populations while developing innovative and community-driven conservation solutions using its unique data and technology infrastructure. To support scalability and transparency in the NABat acoustic data pipeline, we developed a fully-automated machine-learning algorithm. This dataset includes audio files of bat echolocation calls that were considered to develop V1.0 of the NABat machine-learning algorithm, however the test set (i.e., holdout dataset) has been excluded from this release. These recordings were collected by various bat monitoring partners across North America using ultrasonic acoustic recorders for stationary acoustic and mobile acoustic surveys. For more information on how these surveys may be conducted, see Chapters 4 and 5 of “A Plan for the North American Bat Monitoring Program” (https://doi.org/10.2737/SRS-GTR-208). These data were then post-processed by bat monitoring partners to remove noise files (or those that do not contain recognizable bat calls) and apply a species label to each file. There is undoubtedly variation in the steps that monitoring partners take to apply a species label, but the steps documented in “A Guide to Processing Bat Acoustic Data for the North American Bat Monitoring Program” (https://doi.org/10.3133/ofr20181068) include first processing with an automated classifier and then manually reviewing to confirm or downgrade the suggested species label. Once a manual ID label was applied, audio files of bat acoustic recordings were submitted to the NABat database in Waveform Audio File format. From these available files in the NABat database, we considered files from 35 classes (34 species and a noise class). Files for 4 species were excluded due to low sample size (Corynorhinus rafinesquii, N=3; Eumops floridanus, N =3; Lasiurus xanthinus, N = 4; Nyctinomops femorosaccus, N =11). From this pool, files were randomly selected until files for each species/grid cell combination were exhausted or the number of recordings reach 1250. The dataset was then randomly split into training, validation, and test sets (i.e., holdout dataset). This data release includes all files considered for training and validation, including files that had been excluded from model development and testing due to low sample size for a given species or because the threshold for species/grid cell combinations had been met. The test set (i.e., holdout dataset) is not included. Audio files are grouped by species, as indicated by the four-letter species code in the name of each folder. Definitions for each four-letter code, including Family, Genus, Species, and Common name, are also included as a dataset in this release.
i
DR IQA Database V2
ieee-dataport.org
Updated Dec 23, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shahrukh Athar (2022). DR IQA Database V2 [Dataset]. https://ieee-dataport.org/documents/dr-iqa-database-v2
Explore at:
Dataset updated
Dec 23, 2022
Authors
Shahrukh Athar
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In practical media distribution systems
V
Vector Database Solution Report
datainsightsmarket.com
doc, pdf, ppt
Updated May 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Insights Market (2025). Vector Database Solution Report [Dataset]. https://www.datainsightsmarket.com/reports/vector-database-solution-1930729
Explore at:
pdf, doc, pptAvailable download formats
Dataset updated
May 31, 2025
Dataset authored and provided by
Data Insights Market
License
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The vector database solution market is experiencing explosive growth, projected to reach $3218.3 million in 2025 and exhibiting a robust Compound Annual Growth Rate (CAGR) of 22.6% from 2025 to 2033. This rapid expansion is driven by the increasing adoption of AI and machine learning applications across various sectors, including e-commerce, finance, and healthcare. These applications rely heavily on efficient similarity search capabilities offered by vector databases, making them a crucial component of modern data infrastructure. The rising volume of unstructured data, such as images, videos, and text, further fuels the demand, as vector databases excel at handling and querying such data types effectively. Key market drivers include advancements in deep learning algorithms, the need for real-time search functionalities, and the growing emphasis on personalized user experiences. This market is characterized by a diverse range of players, including established tech giants like Redis and emerging specialized vendors like Zilliz (with its Milvus offering), Pinecone, Weaviate, and others. Competition is fierce, prompting continuous innovation in areas such as query performance, scalability, and ease of integration. While challenges remain, such as the complexity of managing and deploying vector databases, the overall market outlook remains positive. Future growth will likely be influenced by the continued development of AI/ML applications, the maturation of cloud-based vector database services, and the increased accessibility of these solutions for businesses of all sizes. The ongoing development of standardized interfaces and improved tooling will also play a significant role in broader adoption.
Z
A database of CFD-computed flow fields around airfoils for machine-learning...
data.niaid.nih.gov
explore.openaire.eu
Updated Mar 26, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Quadrio, Maurizio (2021). A database of CFD-computed flow fields around airfoils for machine-learning applications [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4106751
Explore at:
Dataset updated
Mar 26, 2021
Dataset provided by
Quadrio, Maurizio
Schillaci, Andrea
Boracchi, Giacomo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is designed to test Machine-Learning techniques on Computational Fluid Dynamics (CFD) data.

It contains two-dimensional RANS simulations of the turbulent flow around NACA 4-digits airfoils, at fixed angle of attack (10 degrees) and at a fixed Reynolds number (3x10^6). The whole NACA family is spawned. The present dataset contains 2600 geometries, and 425 further geometries are published in an accompanying repository (10.5281/zenodo.4638071).

For further information refer to: Schillaci, A., Quadrio, M., Pipolo, C., Restelli, M., Boracchi, G. "Inferring Functional Properties from Fluid Dynamics Features" 2020 25th International Conference on Pattern Recognition (ICPR) Milan, Italy, Jan 10-15, 2021
m
Data from: Optical materials discovery and design with federated databases...
archive.materialscloud.org
application/gzip, bin +1
Updated Aug 5, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Victor Trinquet; Matthew L. Evans; Cameron Hargreaves; Pierre-Paul De Breuck; Gian-Marco Rignanese; Victor Trinquet; Matthew L. Evans; Cameron Hargreaves; Pierre-Paul De Breuck; Gian-Marco Rignanese (2024). Optical materials discovery and design with federated databases and machine learning [Dataset]. http://doi.org/10.24435/materialscloud:5p-vq
Explore at:
bin, text/markdown, application/gzipAvailable download formats
Unique identifier
https://doi.org/10.24435/materialscloud:5p-vq
Dataset updated
Aug 5, 2024
Dataset provided by
Materials Cloud
Authors
Victor Trinquet; Matthew L. Evans; Cameron Hargreaves; Pierre-Paul De Breuck; Gian-Marco Rignanese; Victor Trinquet; Matthew L. Evans; Cameron Hargreaves; Pierre-Paul De Breuck; Gian-Marco Rignanese
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Combinatorial and guided screening of materials space with density-functional theory and related approaches has provided a wealth of hypothetical inorganic materials, which are increasingly tabulated in open databases. The OPTIMADE API is a standardised format for representing crystal structures, their measured and computed properties, and the methods for querying and filtering them from remote resources. Currently, the OPTIMADE federation spans over 20 data providers, rendering over 30 million structures accessible in this way, many of which are novel and have only recently been suggested by machine learning-based approaches. In this work, we outline our approach to non-exhaustively screen this dynamic trove of structures for the next-generation of optical materials. By applying MODNet, a neural network-based model for property prediction that has been shown to perform especially well for small materials datasets, within a combined active learning and high-throughput computation framework, we isolate particular structures and chemistries that should be most fruitful for further theoretical calculations and for experimental study as high-refractive-index materials. By making explicit use of automated calculations, federated dataset curation and machine learning, and by releasing these publicly, the workflows presented here can be periodically re-assessed as new databases implement OPTIMADE, and new hypothetical materials are suggested.
Single-Atom Alloy Dataset for Machine Learning
figshare.com
txt
Updated Jul 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jikai Sun; Huan Wang (2024). Single-Atom Alloy Dataset for Machine Learning [Dataset]. http://doi.org/10.6084/m9.figshare.26200007.v3
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.26200007.v3
Dataset updated
Jul 9, 2024
Dataset provided by
Figsharehttp://figshare.com/
Authors
Jikai Sun; Huan Wang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
OverviewThis dataset, contained within Database.csv, is a comprehensive collection tailored for machine learning applications in the field of catalysis and materials science, focusing on single-atom alloys. It encompasses a wide array of data with 10,950 entries, each featuring 85 intrinsic descriptors alongside novel information on the predicted C-H dissociation energy barriers and reaction rates. These intrinsic descriptors include a variety of element and surface properties extracted from renowned databases like the Materials Project and Pymatgen, as well as surface structural features and characteristics derived through expert knowledge.Intrinsic DescriptorsThe 85 intrinsic descriptors provided in this dataset offer a detailed insight into the properties of single-atom alloys. These descriptors cover:Element Properties: Extracted from the Materials Project and Pymatgen databases, these properties include atomic size, electronegativity, and other elemental characteristics critical for the study of material properties.Surface Properties: Features related to the surface characteristics of the alloys, which play a significant role in their catalytic behavior and interaction with reactants.Surface Structural Features: Detailed information on the structural aspects of the alloy surfaces, which can influence the material's catalytic activity and stability.Expert-Derived Features: A set of features developed through expert knowledge, combining various data points to form comprehensive descriptors for machine learning applications.Predicted PropertiesC-H Dissociation Energy Barrier: A key metric for evaluating the catalytic efficiency of single-atom alloys, particularly in processes involving hydrocarbons.Reaction Rates: Provides valuable insights into the kinetics of reactions facilitated by single-atom alloys, crucial for the development and optimization of catalytic processes.
m
Data from: RAGN-R: A multi-subject ensemble machine-learning method for...
data.mendeley.com
Updated May 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Farzin Kazemi (2025). RAGN-R: A multi-subject ensemble machine-learning method for estimating mechanical properties of advanced structural materials [Dataset]. http://doi.org/10.17632/zv2cdhhxrn.2
Explore at:
Unique identifier
https://doi.org/10.17632/zv2cdhhxrn.2
Dataset updated
May 14, 2025
Authors
Farzin Kazemi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The utilization of advanced structural materials, such as preplaced aggregate concrete (PAC), fiber-reinforced concrete (FRC), and FRC beams has revolutionized the field of civil engineering. Therefore, the current research titled "RAGN-R: A multi-subject ensemble machine-learning method for estimating mechanical properties of advanced structural materials" in Computers and Structures, introduces a novel RAGN-R approach for proposing a comprehensive predictive model. The dataset used for this research is published to be used by researchers, for more, please check the paper.
f
The influence of the negative-positive ratio and screening database size on...
plos.figshare.com
pdf
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rafał Kurczab; Andrzej J. Bojarski (2023). The influence of the negative-positive ratio and screening database size on the performance of machine learning-based virtual screening [Dataset]. http://doi.org/10.1371/journal.pone.0175410
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0175410
Dataset updated
Jun 1, 2023
Dataset provided by
PLOS ONE
Authors
Rafał Kurczab; Andrzej J. Bojarski
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The machine learning-based virtual screening of molecular databases is a commonly used approach to identify hits. However, many aspects associated with training predictive models can influence the final performance and, consequently, the number of hits found. Thus, we performed a systematic study of the simultaneous influence of the proportion of negatives to positives in the testing set, the size of screening databases and the type of molecular representations on the effectiveness of classification. The results obtained for eight protein targets, five machine learning algorithms (SMO, Naïve Bayes, Ibk, J48 and Random Forest), two types of molecular fingerprints (MACCS and CDK FP) and eight screening databases with different numbers of molecules confirmed our previous findings that increases in the ratio of negative to positive training instances greatly influenced most of the investigated parameters of the ML methods in simulated virtual screening experiments. However, the performance of screening was shown to also be highly dependent on the molecular library dimension. Generally, with the increasing size of the screened database, the optimal training ratio also increased, and this ratio can be rationalized using the proposed cost-effectiveness threshold approach. To increase the performance of machine learning-based virtual screening, the training set should be constructed in a way that considers the size of the screening database.
D
Notable AI Models
epoch.ai
csv
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Epoch AI, Notable AI Models [Dataset]. https://epoch.ai/data/notable-ai-models
Explore at:
csvAvailable download formats
Dataset authored and provided by
Epoch AI
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Global
Variables measured
https://epoch.ai/data/notable-ai-models-documentation#records
Measurement technique
https://epoch.ai/data/notable-ai-models-documentation#records
Description
Our most comprehensive database of AI models, containing over 800 models that are state of the art, highly cited, or otherwise historically notable. It tracks key factors driving machine learning progress and includes over 300 training compute estimates.
Imbalanced dataset for benchmarking
zenodo.org
data.niaid.nih.gov
application/gzip
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Guillaume Lemaitre; Fernando Nogueira; Christos K. Aridas; Dayvid V. R. Oliveira; Guillaume Lemaitre; Fernando Nogueira; Christos K. Aridas; Dayvid V. R. Oliveira (2020). Imbalanced dataset for benchmarking [Dataset]. http://doi.org/10.5281/zenodo.61452
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.61452
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Guillaume Lemaitre; Fernando Nogueira; Christos K. Aridas; Dayvid V. R. Oliveira; Guillaume Lemaitre; Fernando Nogueira; Christos K. Aridas; Dayvid V. R. Oliveira
License
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Description
Imbalanced dataset for benchmarking
=======================

The different algorithms of the `imbalanced-learn` toolbox are evaluated on a set of common dataset, which are more or less balanced. These benchmark have been proposed in [1]. The following section presents the main characteristics of this benchmark.

Characteristics
-------------------

|ID |Name |Repository & Target |Ratio |# samples| # features |
|:---:|:----------------------:|--------------------------------------|:------:|:-------------:|:--------------:|
|1 |Ecoli |UCI, target: imU |8.6:1 |336 |7 |
|2 |Optical Digits |UCI, target: 8 |9.1:1 |5,620 |64 |
|3 |SatImage |UCI, target: 4 |9.3:1 |6,435 |36 |
|4 |Pen Digits |UCI, target: 5 |9.4:1 |10,992 |16 |
|5 |Abalone |UCI, target: 7 |9.7:1 |4,177 |8 |
|6 |Sick Euthyroid |UCI, target: sick euthyroid |9.8:1 |3,163 |25 |
|7 |Spectrometer |UCI, target: >=44 |11:1 |531 |93 |
|8 |Car_Eval_34 |UCI, target: good, v good |12:1 |1,728 |6 |
|9 |ISOLET |UCI, target: A, B |12:1 |7,797 |617 |
|10 |US Crime |UCI, target: >0.65 |12:1 |1,994 |122 |
|11 |Yeast_ML8 |LIBSVM, target: 8 |13:1 |2,417 |103 |
|12 |Scene |LIBSVM, target: >one label |13:1 |2,407 |294 |
|13 |Libras Move |UCI, target: 1 |14:1 |360 |90 |
|14 |Thyroid Sick |UCI, target: sick |15:1 |3,772 |28 |
|15 |Coil_2000 |KDD, CoIL, target: minority |16:1 |9,822 |85 |
|16 |Arrhythmia |UCI, target: 06 |17:1 |452 |279 |
|17 |Solar Flare M0 |UCI, target: M->0 |19:1 |1,389 |10 |
|18 |OIL |UCI, target: minority |22:1 |937 |49 |
|19 |Car_Eval_4 |UCI, target: vgood |26:1 |1,728 |6 |
|20 |Wine Quality |UCI, wine, target: <=4 |26:1 |4,898 |11 |
|21 |Letter Img |UCI, target: Z |26:1 |20,000 |16 |
|22 |Yeast _ME2 |UCI, target: ME2 |28:1 |1,484 |8 |
|23 |Webpage |LIBSVM, w7a, target: minority|33:1 |49,749 |300 |
|24 |Ozone Level |UCI, ozone, data |34:1 |2,536 |72 |
|25 |Mammography |UCI, target: minority |42:1 |11,183 |6 |
|26 |Protein homo. |KDD CUP 2004, minority |111:1|145,751 |74 |
|27 |Abalone_19 |UCI, target: 19 |130:1|4,177 |8 |

References
----------
[1] Ding, Zejin, "Diversified Ensemble Classifiers for H
ighly Imbalanced Data Learning and their Application in Bioinformatics." Dissertation, Georgia State University, (2011).

[2] Blake, Catherine, and Christopher J. Merz. "UCI Repository of machine learning databases." (1998).

[3] Chang, Chih-Chung, and Chih-Jen Lin. "LIBSVM: a library for support vector machines." ACM Transactions on Intelligent Systems and Technology (TIST) 2.3 (2011): 27.

[4] Caruana, Rich, Thorsten Joachims, and Lars Backstrom. "KDD-Cup 2004: results and analysis." ACM SIGKDD Explorations Newsletter 6.2 (2004): 95-108.
Integrated Protein-Ligand Interaction Database
zenodo.org
data.niaid.nih.gov
application/gzip, csv +1
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hansaim Lim; Hansaim Lim; Lei Xie; Lei Xie (2020). Integrated Protein-Ligand Interaction Database [Dataset]. http://doi.org/10.7706/iplid.01
Explore at:
application/gzip, tsv, csvAvailable download formats
Unique identifier
https://doi.org/10.7706/iplid.01
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Hansaim Lim; Hansaim Lim; Lei Xie; Lei Xie
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
IPLID integrates protein-ligand interaction data from multiple well-known resources, including BindingDB, ChEMBL, DrugBank, GPCRDB, PubChem, LINCS-HMS KinomeScan, and four published kinome assay results. Our database can facilitate projects in machine learning or deep learning-based drug development and other applications by providing integrated data sets appropriate for many research interests. Our database can be utilized for small-scale (e.g. kinases or GPCRs only) and large-scale (e.g. proteome-wide), qualitative or quantitative projects. With its ease of use and straightforward data format, IPLID offers a great educational resource for computer science and data science trainees who lack familiarity with chemistry and biology.

Data statistics

Target (data type) Activities | Unique chemicals | Unique proteins | File name

All (binary) 96318 | 18107 | 3107 | integrated_binary_activity.tsv

All (numerical) 2798365 | 683009 | 5876 | integrated_continuous_activity.tsv

CYP450 (binary) 67552 | 17273 | 47 | integrated_cyp450_binary.tsv

CRT (binary) 4152 | 1219 | 412 | integrated_cancer_related_targets_binary.tsv

CDT (binary) 519 | 349 | 88 | integrated_cardio_targets_binary.tsv

DRT (binary) 4433 | 1325 | 852 | integrated_disease_related_targets_binary.tsv

FDA (binary) 6217 | 1521 | 592 | integrated_fda_approved_targets_binary.tsv

GPCR (binary) 1958 | 545 | 129 | integrated_gpcr_binary.tsv

NR (binary) 1335 | 657 | 264 | integrated_nr_binary.tsv

PDT (binary) 1469 | 674 | 404 | integrated_potential_drug_targets_binary.tsv

TF (binary) 1966 | 998 | 304 | integrated_tf_binary.tsv

*Abbreviations: CYP450 (Cytochrome P450), CRT (Cancer-Related Target), CDT (Cardiovascular Disease candidate Target), DRT (Disease-Related Target), FDA (FDA-approved target), GPCR (G-Protein Coupled Receptor), NR (Nuclear Receptor), PDT (Potential Drug Target), TF (Transcription Factor)

*These protein classifications are from UniProt database and the Human Protein Atlas (https://www.proteinatlas.org/)

IPLID data statistics
m
Data from: Scoping Review of Genetic Databases for Rare Dermatologic...
data.mendeley.com
Updated Mar 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Celine Schreidah (2023). Scoping Review of Genetic Databases for Rare Dermatologic Diseases: Opportunity for Artificial Intelligence and Machine Learning [Dataset]. http://doi.org/10.17632/msz9s7htnp.1
Explore at:
Unique identifier
https://doi.org/10.17632/msz9s7htnp.1
Dataset updated
Mar 23, 2023
Authors
Celine Schreidah
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
These are the supplemental search query instructions for the JAAD International article titled as above.
m
Generating Heterogeneous Big Data Set for Healthcare and Telemedicine...
data.mendeley.com
Updated Jan 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Omar Al-Obidi (2023). Generating Heterogeneous Big Data Set for Healthcare and Telemedicine Research Based on ECG, Spo2, Blood Pressure Sensors, and Text Inputs: Data set classified, Analyzed, Organized, And Presented in Excel File Format. [Dataset]. http://doi.org/10.17632/gsmjh55sfy.1
Explore at:
Unique identifier
https://doi.org/10.17632/gsmjh55sfy.1
Dataset updated
Jan 23, 2023
Authors
Omar Al-Obidi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Heterogenous Big dataset is presented in this proposed work: electrocardiogram (ECG) signal, blood pressure signal, oxygen saturation (SpO2) signal, and the text input. This work is an extension version for our relevant formulating of dataset that presented in [1] and a trustworthy and relevant medical dataset library (PhysioNet [2]) was used to acquire these signals. The dataset includes medical features from heterogenous sources (sensory data and non-sensory). Firstly, ECG sensor’s signals which contains QRS width, ST elevation, peak numbers, and cycle interval. Secondly: SpO2 level from SpO2 sensor’s signals. Third, blood pressure sensors’ signals which contain high (systolic) and low (diastolic) values and finally text input which consider non-sensory data. The text inputs were formulated based on doctors diagnosing procedures for heart chronic diseases. Python software environment was used, and the simulated big data is presented along with analyses.
i
Online Machine Learning for Energy-Aware Multicore Real-Time Embedded...
ieee-dataport.org
Updated May 18, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jose Luis Hoffmann (2022). Online Machine Learning for Energy-Aware Multicore Real-Time Embedded Systems Database [Dataset]. https://ieee-dataport.org/documents/online-machine-learning-energy-aware-multicore-real-time-embedded-systems-database
Explore at:
Dataset updated
May 18, 2022
Authors
Jose Luis Hoffmann
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
a Real-Time Operating System.

Facebook

Twitter

Click to copy link

Link copied

Cite

Rikuto Kotoge (2025). MLOmics: Cancer Multi-Omics Database for Machine Learning [Dataset]. http://doi.org/10.6084/m9.figshare.28729127.v2

Data from: MLOmics: Cancer Multi-Omics Database for Machine Learning

Explore at:

binAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.28729127.v2

Dataset updated

May 25, 2025

Dataset provided by

figshare

Authors

Rikuto Kotoge

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Framing the investigation of diverse cancers as a machine learning problem has recently shown significant potential in multi-omics analysis and cancer research. Empowering these successful machine learning models are the high-quality training datasets with sufficient data volume and adequate preprocessing. However, while there exist several public data portals including The Cancer Genome Atlas (TCGA) multi-omics initiative or open-bases such as the LinkedOmics, these databases are not off-the-shelf for existing machine learning models. we propose MLOmics, an open cancer multi-omics database aiming at serving better the development and evaluation of bioinformatics and machine learning models. MLOmics contains 8,314 patient samples covering all 32 cancer types with four omics types, stratified features, and extensive baselines. Complementary support for downstream analysis and bio-knowledge linking are also included to support interdisciplinary analysis.

Clear search

Close search

Google apps

Main menu

Data from: MLOmics: Cancer Multi-Omics Database for Machine Learning

UCI Machine Learning Datasets 12/2013

UCI Machine Learning Repository

A 2D Near-Field Microwave Imaging Database for Machine Learning Training

AI-Machine Learning Sound / Audio / Snippet Recordings Database

A Labelled Dataset for Sentiment Analysis of Videos on YouTube, TikTok, and...

Training dataset for NABat Machine Learning V1.0

DR IQA Database V2

Vector Database Solution Report

A database of CFD-computed flow fields around airfoils for machine-learning...

Data from: Optical materials discovery and design with federated databases...

Single-Atom Alloy Dataset for Machine Learning

Data from: RAGN-R: A multi-subject ensemble machine-learning method for...

The influence of the negative-positive ratio and screening database size on...

Notable AI Models

Imbalanced dataset for benchmarking

Integrated Protein-Ligand Interaction Database

Data from: Scoping Review of Genetic Databases for Rare Dermatologic...

Generating Heterogeneous Big Data Set for Healthcare and Telemedicine...

Online Machine Learning for Energy-Aware Multicore Real-Time Embedded...

Data from: MLOmics: Cancer Multi-Omics Database for Machine LearningSee More Versions

Data from: MLOmics: Cancer Multi-Omics Database for Machine Learning