100+ datasets found
  1. d

    UCI Machine Learning Repository

    • dknet.org
    • rrid.site
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    UCI Machine Learning Repository [Dataset]. http://identifiers.org/RRID:SCR_026571
    Explore at:
    Description

    Collection of databases, domain theories, and data generators that are used by machine learning community for empirical analysis of machine learning algorithms. Datasets approved to be in the repository will be assigned Digital Object Identifier (DOI) if they do not already possess one. Datasets will be licensed under a Creative Commons Attribution 4.0 International license (CC BY 4.0) which allows for the sharing and adaptation of the datasets for any purpose, provided that the appropriate credit is given

  2. Data from: MLOmics: Cancer Multi-Omics Database for Machine Learning

    • figshare.com
    bin
    Updated May 25, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rikuto Kotoge (2025). MLOmics: Cancer Multi-Omics Database for Machine Learning [Dataset]. http://doi.org/10.6084/m9.figshare.28729127.v2
    Explore at:
    binAvailable download formats
    Dataset updated
    May 25, 2025
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Rikuto Kotoge
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Framing the investigation of diverse cancers as a machine learning problem has recently shown significant potential in multi-omics analysis and cancer research. Empowering these successful machine learning models are the high-quality training datasets with sufficient data volume and adequate preprocessing. However, while there exist several public data portals including The Cancer Genome Atlas (TCGA) multi-omics initiative or open-bases such as the LinkedOmics, these databases are not off-the-shelf for existing machine learning models. we propose MLOmics, an open cancer multi-omics database aiming at serving better the development and evaluation of bioinformatics and machine learning models. MLOmics contains 8,314 patient samples covering all 32 cancer types with four omics types, stratified features, and extensive baselines. Complementary support for downstream analysis and bio-knowledge linking are also included to support interdisciplinary analysis.

  3. Cancer Multiple Dataset UCI MLR

    • kaggle.com
    zip
    Updated Aug 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Medi Hunter - 4004 (2025). Cancer Multiple Dataset UCI MLR [Dataset]. https://www.kaggle.com/datasets/shuvokumarbasakbd/cancer-multiple-dataset-uci-mlr/suggestions
    Explore at:
    zip(74213598 bytes)Available download formats
    Dataset updated
    Aug 5, 2025
    Authors
    Medi Hunter - 4004
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Source More Info : https://archive.ics.uci.edu/datasets

    The **UCI Machine Learning Repository **is a collection of databases, domain theories, and data generators that are used by the machine learning community for the empirical analysis of machine learning algorithms.

    The datasets collected in this project represent a diverse and comprehensive set of cancer-related data sourced from the UCI Machine Learning Repository. They cover a wide spectrum of cancer types and research perspectives, including breast cancer datasets such as the original, diagnostic, prognostic, and Coimbra variants, which focus on tumor features, recurrence, and biochemical markers. Cervical cancer is represented through datasets focusing on behavioral risks and general risk factors. The lung cancer dataset provides categorical diagnostic attributes, while the primary tumor dataset offers insights into tumor locations based on metastasis data. Additionally, specialized datasets like differentiated thyroid cancer recurrence, glioma grading with clinical and mutation features, and gene expression RNA-Seq data expand the scope into genetic and molecular-level cancer analysis. Together, these datasets support a wide range of machine learning applications including classification, prediction, survival analysis, and feature correlation across various types of cancer.

    RRA_Think Differently, Create history’s next line.

    Hello Data Hunters! Hope you're doing well. https://www.kaggle.com/shuvokumarbasak4004 (More Dataset) https://www.kaggle.com/shuvokumarbasak2030

  4. G

    In-Database Machine Learning Market Research Report 2033

    • growthmarketreports.com
    csv, pdf, pptx
    Updated Aug 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Growth Market Reports (2025). In-Database Machine Learning Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/in-database-machine-learning-market
    Explore at:
    csv, pdf, pptxAvailable download formats
    Dataset updated
    Aug 4, 2025
    Dataset authored and provided by
    Growth Market Reports
    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    In-Database Machine Learning Market Outlook



    According to our latest research, the global in-database machine learning market size in 2024 stands at USD 2.74 billion, reflecting the sector’s rapid adoption across diverse industries. The market is expected to grow at a robust CAGR of 28.6% from 2025 to 2033, reaching a projected value of USD 24.19 billion by the end of the forecast period. This exceptional growth is primarily driven by the increasing demand for advanced analytics, real-time data processing, and the seamless integration of machine learning capabilities directly within database environments, which are essential for accelerating business insights and operational efficiency.




    The primary growth factor propelling the in-database machine learning market is the exponential surge in data volumes generated by enterprises worldwide. As organizations transition to digital-first operations, the need to analyze vast datasets in real time has become paramount. Traditional machine learning workflows, which require data extraction and movement to external environments, are increasingly seen as inefficient and prone to latency and security issues. In-database machine learning eliminates these bottlenecks by enabling algorithms to run directly within the database, thus reducing data movement, minimizing latency, and ensuring higher data security. This approach not only streamlines the analytics pipeline but also empowers businesses to derive actionable insights faster, supporting critical functions such as fraud detection, predictive maintenance, and customer personalization.




    Another significant factor fueling market expansion is the growing adoption of cloud-based data platforms and the proliferation of hybrid IT infrastructures. Enterprises are leveraging cloud-native databases and data warehouses to centralize and scale their analytics capabilities. In-database machine learning solutions are designed to seamlessly integrate with these modern architectures, allowing organizations to harness the power of machine learning without the need for extensive data migration or IT overhead. This integration facilitates agile development, lowers total cost of ownership, and enables organizations to respond swiftly to market changes. Furthermore, the rise of open-source machine learning frameworks and APIs has democratized access to advanced analytics, making it easier for businesses of all sizes to implement and benefit from in-database ML capabilities.




    A third pivotal growth driver is the increasing emphasis on regulatory compliance, data privacy, and security in highly regulated industries such as BFSI and healthcare. In-database machine learning offers a compelling solution by keeping sensitive data within secure database environments, thereby reducing the risk of data breaches and ensuring compliance with stringent data protection regulations such as GDPR and HIPAA. This capability is particularly valuable for organizations operating in regions with complex regulatory landscapes, where data residency and sovereignty are critical concerns. As a result, the adoption of in-database ML is accelerating among enterprises that prioritize security, governance, and auditability in their analytics workflows.




    From a regional perspective, North America continues to dominate the in-database machine learning market, accounting for the largest revenue share in 2024, followed closely by Europe and Asia Pacific. The presence of leading technology vendors, early adoption of advanced analytics, and a mature digital infrastructure contribute to North America’s leadership. However, rapid economic development, digitization initiatives, and expanding IT ecosystems in Asia Pacific are positioning the region as a significant growth engine for the forecast period. Meanwhile, Europe’s focus on data privacy and innovation is driving substantial investments in secure and compliant in-database ML solutions, further fueling market growth across the continent.





    Component Analysis



    The in-database machine learning mark

  5. i

    DR IQA Database V1

    • ieee-dataport.org
    Updated Dec 23, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shahrukh Athar (2022). DR IQA Database V1 [Dataset]. https://ieee-dataport.org/documents/dr-iqa-database-v1
    Explore at:
    Dataset updated
    Dec 23, 2022
    Authors
    Shahrukh Athar
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In practical media distribution systems

  6. m

    Data from: RAGN-R: A multi-subject ensemble machine-learning method for...

    • data.mendeley.com
    Updated May 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Farzin Kazemi (2025). RAGN-R: A multi-subject ensemble machine-learning method for estimating mechanical properties of advanced structural materials [Dataset]. http://doi.org/10.17632/zv2cdhhxrn.2
    Explore at:
    Dataset updated
    May 14, 2025
    Authors
    Farzin Kazemi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The utilization of advanced structural materials, such as preplaced aggregate concrete (PAC), fiber-reinforced concrete (FRC), and FRC beams has revolutionized the field of civil engineering. Therefore, the current research titled "RAGN-R: A multi-subject ensemble machine-learning method for estimating mechanical properties of advanced structural materials" in Computers and Structures, introduces a novel RAGN-R approach for proposing a comprehensive predictive model. The dataset used for this research is published to be used by researchers, for more, please check the paper.

  7. d

    Data from: Training dataset for NABat Machine Learning V1.0

    • catalog.data.gov
    • data.usgs.gov
    • +1more
    Updated Nov 26, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Training dataset for NABat Machine Learning V1.0 [Dataset]. https://catalog.data.gov/dataset/training-dataset-for-nabat-machine-learning-v1-0
    Explore at:
    Dataset updated
    Nov 26, 2025
    Dataset provided by
    U.S. Geological Survey
    Description

    Bats play crucial ecological roles and provide valuable ecosystem services, yet many populations face serious threats from various ecological disturbances. The North American Bat Monitoring Program (NABat) aims to assess status and trends of bat populations while developing innovative and community-driven conservation solutions using its unique data and technology infrastructure. To support scalability and transparency in the NABat acoustic data pipeline, we developed a fully-automated machine-learning algorithm. This dataset includes audio files of bat echolocation calls that were considered to develop V1.0 of the NABat machine-learning algorithm, however the test set (i.e., holdout dataset) has been excluded from this release. These recordings were collected by various bat monitoring partners across North America using ultrasonic acoustic recorders for stationary acoustic and mobile acoustic surveys. For more information on how these surveys may be conducted, see Chapters 4 and 5 of “A Plan for the North American Bat Monitoring Program” (https://doi.org/10.2737/SRS-GTR-208). These data were then post-processed by bat monitoring partners to remove noise files (or those that do not contain recognizable bat calls) and apply a species label to each file. There is undoubtedly variation in the steps that monitoring partners take to apply a species label, but the steps documented in “A Guide to Processing Bat Acoustic Data for the North American Bat Monitoring Program” (https://doi.org/10.3133/ofr20181068) include first processing with an automated classifier and then manually reviewing to confirm or downgrade the suggested species label. Once a manual ID label was applied, audio files of bat acoustic recordings were submitted to the NABat database in Waveform Audio File format. From these available files in the NABat database, we considered files from 35 classes (34 species and a noise class). Files for 4 species were excluded due to low sample size (Corynorhinus rafinesquii, N=3; Eumops floridanus, N =3; Lasiurus xanthinus, N = 4; Nyctinomops femorosaccus, N =11). From this pool, files were randomly selected until files for each species/grid cell combination were exhausted or the number of recordings reach 1250. The dataset was then randomly split into training, validation, and test sets (i.e., holdout dataset). This data release includes all files considered for training and validation, including files that had been excluded from model development and testing due to low sample size for a given species or because the threshold for species/grid cell combinations had been met. The test set (i.e., holdout dataset) is not included. Audio files are grouped by species, as indicated by the four-letter species code in the name of each folder. Definitions for each four-letter code, including Family, Genus, Species, and Common name, are also included as a dataset in this release.

  8. Breast Cancer Wisconsin (Prognostic) Data Set

    • kaggle.com
    zip
    Updated Mar 31, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sarah VCH (2017). Breast Cancer Wisconsin (Prognostic) Data Set [Dataset]. https://www.kaggle.com/sarahvch/breast-cancer-wisconsin-prognostic-data-set
    Explore at:
    zip(49800 bytes)Available download formats
    Dataset updated
    Mar 31, 2017
    Authors
    Sarah VCH
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    Context

    Data From: UCI Machine Learning Repository http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wpbc.names

    Content

    "Each record represents follow-up data for one breast cancer case. These are consecutive patients seen by Dr. Wolberg since 1984, and include only those cases exhibiting invasive breast cancer and no evidence of distant metastases at the time of diagnosis.

    The first 30 features are computed from a digitized image of a
    fine needle aspirate (FNA) of a breast mass. They describe
    characteristics of the cell nuclei present in the image.
    A few of the images can be found at
    http://www.cs.wisc.edu/~street/images/
    
    The separation described above was obtained using
    Multisurface Method-Tree (MSM-T) [K. P. Bennett, "Decision Tree
    Construction Via Linear Programming." Proceedings of the 4th
    Midwest Artificial Intelligence and Cognitive Science Society,
    pp. 97-101, 1992], a classification method which uses linear
    programming to construct a decision tree. Relevant features
    were selected using an exhaustive search in the space of 1-4
    features and 1-3 separating planes.
    
    The actual linear program used to obtain the separating plane
    in the 3-dimensional space is that described in:
    [K. P. Bennett and O. L. Mangasarian: "Robust Linear
    Programming Discrimination of Two Linearly Inseparable Sets",
    Optimization Methods and Software 1, 1992, 23-34].
    
    The Recurrence Surface Approximation (RSA) method is a linear
    programming model which predicts Time To Recur using both
    recurrent and nonrecurrent cases. See references (i) and (ii)
    above for details of the RSA method. 
    
    This database is also available through the UW CS ftp server:
    
    ftp ftp.cs.wisc.edu
    cd math-prog/cpo-dataset/machine-learn/WPBC/
    

    1) ID number 2) Outcome (R = recur, N = nonrecur) 3) Time (recurrence time if field 2 = R, disease-free time if field 2 = N) 4-33) Ten real-valued features are computed for each cell nucleus:

    a) radius (mean of distances from center to points on the perimeter)
    b) texture (standard deviation of gray-scale values)
    c) perimeter
    d) area
    e) smoothness (local variation in radius lengths)
    f) compactness (perimeter^2 / area - 1.0)
    g) concavity (severity of concave portions of the contour)
    h) concave points (number of concave portions of the contour)
    i) symmetry 
    j) fractal dimension ("coastline approximation" - 1)"
    

    Acknowledgements

    Creators:

    Dr. William H. Wolberg, General Surgery Dept., University of
    Wisconsin, Clinical Sciences Center, Madison, WI 53792
    wolberg@eagle.surgery.wisc.edu
    
    W. Nick Street, Computer Sciences Dept., University of
    Wisconsin, 1210 West Dayton St., Madison, WI 53706
    street@cs.wisc.edu 608-262-6619
    
    Olvi L. Mangasarian, Computer Sciences Dept., University of
    Wisconsin, 1210 West Dayton St., Madison, WI 53706
    olvi@cs.wisc.edu 
    

    Inspiration

    I'm really interested in trying out various machine learning algorithms on some real life science data.

  9. m

    iCubWorld28 - Full Images

    • data.mendeley.com
    Updated Aug 29, 2016
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Giulia Pasquale (2016). iCubWorld28 - Full Images [Dataset]. http://doi.org/10.17632/5txngd6td6.1
    Explore at:
    Dataset updated
    Aug 29, 2016
    Authors
    Giulia Pasquale
    License

    http://www.gnu.org/licenses/gpl-3.0.en.htmlhttp://www.gnu.org/licenses/gpl-3.0.en.html

    Description

    A complete description of this dataset is available at https://robotology.github.io/iCubWorld .

  10. V

    Vector Database Software Report

    • datainsightsmarket.com
    doc, pdf, ppt
    Updated Sep 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Insights Market (2025). Vector Database Software Report [Dataset]. https://www.datainsightsmarket.com/reports/vector-database-software-529421
    Explore at:
    pdf, ppt, docAvailable download formats
    Dataset updated
    Sep 20, 2025
    Dataset authored and provided by
    Data Insights Market
    License

    https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The global Vector Database Software market is poised for substantial growth, projected to reach an estimated $XXX million in 2025, with an impressive Compound Annual Growth Rate (CAGR) of XX% during the forecast period of 2025-2033. This rapid expansion is fueled by the increasing adoption of AI and machine learning across industries, necessitating efficient storage and retrieval of unstructured data like images, audio, and text. The burgeoning demand for enhanced search capabilities, personalized recommendations, and advanced anomaly detection is driving the market forward. Key market drivers include the widespread implementation of large language models (LLMs), the growing need for semantic search functionalities, and the continuous innovation in AI-powered applications. The market is segmenting into applications catering to both Small and Medium-sized Enterprises (SMEs) and Large Enterprises, with a clear shift towards Cloud-based solutions owing to their scalability, cost-effectiveness, and ease of deployment. The vector database landscape is characterized by dynamic innovation and fierce competition, with prominent players like Pinecone, Weaviate, Supabase, and Zilliz Cloud leading the charge. Emerging trends such as the development of hybrid search capabilities, integration with existing data infrastructure, and enhanced security features are shaping the market's trajectory. While the market shows immense promise, certain restraints, including the complexity of data integration and the need for specialized technical expertise, may pose challenges. Geographically, North America is expected to dominate the market share due to its early adoption of AI technologies and robust R&D investments, followed closely by Asia Pacific, which is witnessing rapid digital transformation and a surge in AI startups. Europe and other emerging regions are also anticipated to contribute significantly to market growth as AI adoption becomes more widespread. This report delves into the rapidly evolving Vector Database Software Market, providing a detailed analysis of its landscape from 2019 to 2033. With a Base Year of 2025, the report offers crucial insights for the Estimated Year of 2025 and projects market dynamics through the Forecast Period of 2025-2033, building upon the Historical Period of 2019-2024. The global vector database software market is poised for significant expansion, with an estimated market size projected to reach hundreds of millions of dollars by 2025, and anticipated to grow exponentially in the coming years. This growth is fueled by the increasing adoption of AI and machine learning across various industries, necessitating efficient storage and retrieval of high-dimensional vector data.

  11. UCI Automobile Dataset

    • kaggle.com
    Updated Feb 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Otrivedi (2023). UCI Automobile Dataset [Dataset]. https://www.kaggle.com/datasets/otrivedi/automobile-data/suggestions
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 12, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Otrivedi
    Description

    In this project, I have done exploratory data analysis on the UCI Automobile dataset available at https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data

    This dataset consists of data From the 1985 Ward's Automotive Yearbook. Here are the sources

    1) 1985 Model Import Car and Truck Specifications, 1985 Ward's Automotive Yearbook. 2) Personal Auto Manuals, Insurance Services Office, 160 Water Street, New York, NY 10038 3) Insurance Collision Report, Insurance Institute for Highway Safety, Watergate 600, Washington, DC 20037

    Number of Instances: 398 Number of Attributes: 9 including the class attribute

    Attribute Information:

    mpg: continuous cylinders: multi-valued discrete displacement: continuous horsepower: continuous weight: continuous acceleration: continuous model year: multi-valued discrete origin: multi-valued discrete car name: string (unique for each instance)

    This data set consists of three types of entities:

    I - The specification of an auto in terms of various characteristics

    II - Tts assigned an insurance risk rating. This corresponds to the degree to which the auto is riskier than its price indicates. Cars are initially assigned a risk factor symbol associated with its price. Then, if it is riskier (or less), this symbol is adjusted by moving it up (or down) the scale. Actuaries call this process "symboling".

    III - Its normalized losses in use as compared to other cars. This is the relative average loss payment per insured vehicle year. This value is normalized for all autos within a particular size classification (two-door small, station wagons, sports/specialty, etc...), and represents the average loss per car per year.

    The analysis is divided into two parts:

    Data Wrangling

    1. Pre-processing data in python
    2. Dealing with missing values
    3. Data formatting
    4. Data normalization
    5. Binning
    6. Exploratory Data Analysis

    7. Descriptive statistics

    8. Groupby

    9. Analysis of variance

    10. Correlation

    11. Correlation stats

    Acknowledgment Dataset: UCI Machine Learning Repository Data link: https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data

  12. c

    3D Kinect Total Body Database for Back Stretches

    • kilthub.cmu.edu
    txt
    Updated May 30, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Blake Capella; Deepak Subramanian; Roberta Klatzky; Daniel Siewiorek (2023). 3D Kinect Total Body Database for Back Stretches [Dataset]. http://doi.org/10.1184/R1/7999364.v2
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Carnegie Mellon University
    Authors
    Blake Capella; Deepak Subramanian; Roberta Klatzky; Daniel Siewiorek
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The data was collected by a Kinect V2 as a set of X, Y, Z coordinates at 60 fps during 6 different yoga inspired back stretches. There are 541 files in the dataset, each containing position, velocity for 25 body joints. These joints include: Head, Neck, SpineShoulder, SpineMid, SpineBase, ShoulderRight, ShoulderLeft, HipRight, HipLeft, ElbowRight, WristRight, HandRight, HandTipRight, ThumbRight, ElbowLeft, WristLeft, HandLeft, HandTipLeft, ThumbLeft, KneeRight, AnkleRight, FootRight, KneeLeft, AnkleLeft, FootLeft. The program used to record this data was adapted from Thomas Sanchez Langeling’s skeleton recording code. The file was set to record data for each body part as a separate file, repeated for each exercise. Each bodypart for a specific exercise is stored in a distinct folder. These folders are named with the following convention: subjNumber_stretchName_trialNumber The subjNumber ranged from 0 – 8. The stretchName was one of the following: Mermaid, Seated, Sumo, Towel, Wall, Y. The trialNumber ranged from 0 – 9 and represented the repetition number. These coordinates were chosen to have an origin centered at the subject’s upper chest. The data was standardized to the following conditions: 1) Kinect placed at the height of 2 ft and 3 in 2) Subject consistently positioned 6.5 ft away from the camera with their chests facing the camera 3) Each participant completed 10 repetitions of each stretch before continuing on Data was collected from the following population: * Adults ages 18-21 * Females: 4 * Males: 5 The following types of pre-processing occurred at the time of data collection. Velocity Data: Calculated using a discrete derivative equation with a spacing of 5 frames chosen to reduce sensitivity of the velocity function v[n]=(x[n]-x[n-5])/5 Occurs for all body parts and all axes individually Related manuscript: Capella, B., Subrmanian, D., Klatzky, R., & Siewiorek, D. Action Pose Recognition from 3D Camera Data Using Inter-frame and Inter-joint Dependencies. Preprint at link in references.

  13. Image data sets for machine learning

    • kaggle.com
    zip
    Updated Mar 22, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    InkiYinji (2022). Image data sets for machine learning [Dataset]. https://www.kaggle.com/datasets/inkiyinji/image
    Explore at:
    zip(1300016613 bytes)Available download formats
    Dataset updated
    Mar 22, 2022
    Authors
    InkiYinji
    Description

    Dataset

    This dataset was created by InkiYinji

    Contents

  14. D

    Distributed Vector Search System Report

    • datainsightsmarket.com
    doc, pdf, ppt
    Updated Jun 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Insights Market (2025). Distributed Vector Search System Report [Dataset]. https://www.datainsightsmarket.com/reports/distributed-vector-search-system-502348
    Explore at:
    pdf, ppt, docAvailable download formats
    Dataset updated
    Jun 9, 2025
    Dataset authored and provided by
    Data Insights Market
    License

    https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The Distributed Vector Search (DVS) system market is experiencing rapid growth, driven by the increasing adoption of artificial intelligence (AI) and machine learning (ML) applications across diverse sectors. The market's expansion is fueled by the need for efficient and scalable solutions to manage and query large-scale vector databases, crucial for applications like recommendation engines, image and video search, and natural language processing. While precise market sizing data is unavailable, considering the high CAGR (let's assume a conservative 30% based on industry trends for similar rapidly growing technologies) and a likely 2025 market size in the low billions (e.g., $2 billion), we can project substantial growth in the coming years. Key drivers include the rising volume of unstructured data, advancements in deep learning models generating high-dimensional vectors, and the need for real-time search capabilities. The market is segmented by deployment (cloud, on-premise), application (recommendation systems, similarity search), and organization size (SMEs, large enterprises). Companies like Pinecone, Vespa, Zilliz, Weaviate, Elastic, Meta, Microsoft, Qdrant, and Spotify are major players, fostering competition and innovation within the space. However, challenges such as the complexity of implementing DVS systems and the need for specialized expertise can act as restraints to broader adoption. The forecast period (2025-2033) promises even more significant market expansion, driven by continuous technological advancements and increased awareness of DVS solutions' potential. The increasing integration of DVS into various industry verticals – from e-commerce to healthcare – will further fuel growth. While challenges exist, the potential benefits, including improved search accuracy, faster query response times, and better scalability, are compelling enterprises to invest in DVS systems. The competitive landscape is dynamic, with both established tech giants and specialized startups vying for market share. This dynamic environment will likely lead to further innovation and improved accessibility of DVS technology, driving even faster market growth in the coming decade.

  15. o

    kr-vs-kp

    • openml.org
    Updated Apr 6, 2014
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alen Shapiro (2014). kr-vs-kp [Dataset]. https://www.openml.org/d/3
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 6, 2014
    Authors
    Alen Shapiro
    Description

    Author: Alen Shapiro Source: UCI Please cite: UCI citation policy

    1. Title: Chess End-Game -- King+Rook versus King+Pawn on a7 (usually abbreviated KRKPA7). The pawn on a7 means it is one square away from queening. It is the King+Rook's side (white) to move.

    2. Sources: (a) Database originally generated and described by Alen Shapiro. (b) Donor/Coder: Rob Holte (holte@uottawa.bitnet). The database was supplied to Holte by Peter Clark of the Turing Institute in Glasgow (pete@turing.ac.uk). (c) Date: 1 August 1989

    3. Past Usage:

    4. Alen D. Shapiro (1983,1987), "Structured Induction in Expert Systems", Addison-Wesley. This book is based on Shapiro's Ph.D. thesis (1983) at the University of Edinburgh entitled "The Role of Structured Induction in Expert Systems".

    5. Stephen Muggleton (1987), "Structuring Knowledge by Asking Questions", pp.218-229 in "Progress in Machine Learning", edited by I. Bratko and Nada Lavrac, Sigma Press, Wilmslow, England SK9 5BB.

    6. Robert C. Holte, Liane Acker, and Bruce W. Porter (1989), "Concept Learning and the Problem of Small Disjuncts", Proceedings of IJCAI. Also available as technical report AI89-106, Computer Sciences Department, University of Texas at Austin, Austin, Texas 78712.

    7. Relevant Information: The dataset format is described below. Note: the format of this database was modified on 2/26/90 to conform with the format of all the other databases in the UCI repository of machine learning databases.

    8. Number of Instances: 3196 total

    9. Number of Attributes: 36

    10. Attribute Summaries: Classes (2): -- White-can-win ("won") and White-cannot-win ("nowin"). I believe that White is deemed to be unable to win if the Black pawn can safely advance. Attributes: see Shapiro's book.

    11. Missing Attributes: -- none

    12. Class Distribution: In 1669 of the positions (52%), White can win. In 1527 of the positions (48%), White cannot win.

    The format for instances in this database is a sequence of 37 attribute values. Each instance is a board-descriptions for this chess endgame. The first 36 attributes describe the board. The last (37th) attribute is the classification: "win" or "nowin". There are 0 missing values. A typical board-description is

    f,f,f,f,f,f,f,f,f,f,f,f,l,f,n,f,f,t,f,f,f,f,f,f,f,t,f,f,f,f,f,f,f,t,t,n,won

    The names of the features do not appear in the board-descriptions. Instead, each feature correponds to a particular position in the feature-value list. For example, the head of this list is the value for the feature "bkblk". The following is the list of features, in the order in which their values appear in the feature-value list:

    [bkblk,bknwy,bkon8,bkona,bkspr,bkxbq,bkxcr,bkxwp,blxwp,bxqsq,cntxt,dsopp,dwipd, hdchk,katri,mulch,qxmsq,r2ar8,reskd,reskr,rimmx,rkxwp,rxmsq,simpl,skach,skewr, skrxp,spcop,stlmt,thrsk,wkcti,wkna8,wknck,wkovl,wkpos,wtoeg]

    In the file, there is one instance (board position) per line.

    Num Instances: 3196 Num Attributes: 37 Num Continuous: 0 (Int 0 / Real 0) Num Discrete: 37 Missing values: 0 / 0.0%

  16. m

    The IQ-OTH/NCCD lung cancer dataset

    • data.mendeley.com
    • kaggle.com
    Updated Jan 3, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    hamdalla alyasriy (2023). The IQ-OTH/NCCD lung cancer dataset [Dataset]. http://doi.org/10.17632/bhmdr45bh2.3
    Explore at:
    Dataset updated
    Jan 3, 2023
    Authors
    hamdalla alyasriy
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Iraq-Oncology Teaching Hospital/National Center for Cancer Diseases (IQ-OTH/NCCD) lung cancer dataset was collected in the above-mentioned specialist hospitals over a period of three months in fall 2019. It includes CT scans of patients diagnosed with lung cancer in different stages, as well as healthy subjects. IQ-OTH/NCCD slides were marked by oncologists and radiologists in these two centers. The dataset contains a total of 1190 images representing CT scan slices of 110 cases. These cases are grouped into three classes: normal, benign, and malignant. of these, 40 cases are diagnosed as malignant; 15 cases diagnosed with benign; and 55 cases classified as normal cases. The CT scans were originally collected in DICOM format. The scanner used is SOMATOM from Siemens. CT protocol includes: 120 kV, slice thickness of 1 mm, with window width ranging from 350 to 1200 HU and window center from 50 to 600 were used for reading. with breath hold at full inspiration. All images were de-identified before performing analysis. Written consent was waived by the oversight review board. The study was approved by the institutional review board of participating medical centers. Each scan contains several slices. The number of these slices range from 80 to 200 slices, each of them represents an image of the human chest with different sides and angles. The 110 cases vary in gender, age, educational attainment, area of residence and living status. Some of them are employees of the Iraqi ministries of Transport and Oil, others are farmers and gainers. Most of them come from places in the middle region of Iraq, particularly, the provinces of Baghdad, Wasit, Diyala, Salahuddin, and Babylon.

  17. fundrazrPandaSays

    • kaggle.com
    zip
    Updated Dec 1, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Popescu Aura (2021). fundrazrPandaSays [Dataset]. https://www.kaggle.com/datasets/popescuaura/fundrazrpandasays
    Explore at:
    zip(110266 bytes)Available download formats
    Dataset updated
    Dec 1, 2021
    Authors
    Popescu Aura
    Description

    Dataset

    This dataset was created by Popescu Aura

    Contents

  18. n

    A machine learning based prediction model for life expectancy

    • data.niaid.nih.gov
    • datasetcatalog.nlm.nih.gov
    • +1more
    zip
    Updated Nov 14, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Evans Omondi; Brian Lipesa; Elphas Okango; Bernard Omolo (2022). A machine learning based prediction model for life expectancy [Dataset]. http://doi.org/10.5061/dryad.z612jm6fv
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 14, 2022
    Dataset provided by
    Strathmore University
    University of South Carolina Upstate
    Authors
    Evans Omondi; Brian Lipesa; Elphas Okango; Bernard Omolo
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    The social and financial systems of many nations throughout the world are significantly impacted by life expectancy (LE) models. Numerous studies have pointed out the crucial effects that life expectancy projections will have on societal issues and the administration of the global healthcare system. The computation of life expectancy has primarily entailed building an ordinary life table. However, the life table is limited by its long duration, the assumption of homogeneity of cohorts and censoring. As a result, a robust and more accurate approach is inevitable. In this study, a supervised machine learning model for estimating life expectancy rates is developed. The model takes into consideration health, socioeconomic, and behavioral characteristics by using the eXtreme Gradient Boosting (XGBoost) algorithm to data from 193 UN member states. The effectiveness of the model's prediction is compared to that of the Random Forest (RF) and Artificial Neural Network (ANN) regressors utilized in earlier research. XGBoost attains an MAE and an RMSE of 1.554 and 2.402, respectively outperforming the RF and ANN models that achieved MAE and RMSE values of 7.938 and 11.304, and 3.86 and 5.002, respectively. The overall results of this study support XGBoost as a reliable and efficient model for estimating life expectancy. Methods Secondary data were used from which a sample of 2832 observations of 21 variables was sourced from the World Health Organization (WHO) and the United Nations (UN) databases. The data was on 193 UN member states from the year 2000–2015, with the LE health-related factors drawn from the Global Health Observatory data repository.

  19. a

    Machine Learning Acquisitions Database

    • acquirezy.com
    Updated Nov 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Acquirezy (2025). Machine Learning Acquisitions Database [Dataset]. https://acquirezy.com/acquisitions/industry/machine-learning
    Explore at:
    Dataset updated
    Nov 15, 2025
    Dataset authored and provided by
    Acquirezy
    Description

    Comprehensive database of mergers and acquisitions in the Machine Learning industry

  20. m

    Sport Services Dataset

    • data.mendeley.com
    Updated Sep 26, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Paulo Pinheiro (2020). Sport Services Dataset [Dataset]. http://doi.org/10.17632/yprk4jdgnv.1
    Explore at:
    Dataset updated
    Sep 26, 2020
    Authors
    Paulo Pinheiro
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset corresponds to actual data from the functioning of a sports facility and refers to all new users who signed up between June 1st 2014 and October 31st 2019. Demographic and service level agreement (SLA) data is collected by operators in the process of enrolling users in the activities they intend to practice. The data regarding the frequency of the sports facility and classes were obtained by the access control system where each user identifies himself with an RFID card to access the facilities on the days and times agreed in his SLA.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
UCI Machine Learning Repository [Dataset]. http://identifiers.org/RRID:SCR_026571

UCI Machine Learning Repository

RRID:SCR_026571, r3d100010960, UCI Machine Learning Repository (RRID:SCR_026571), UC Irvine Machine Learning Repository

Explore at:
Description

Collection of databases, domain theories, and data generators that are used by machine learning community for empirical analysis of machine learning algorithms. Datasets approved to be in the repository will be assigned Digital Object Identifier (DOI) if they do not already possess one. Datasets will be licensed under a Creative Commons Attribution 4.0 International license (CC BY 4.0) which allows for the sharing and adaptation of the datasets for any purpose, provided that the appropriate credit is given

Search
Clear search
Close search
Google apps
Main menu