100+ datasets found
  1. Machine Learning Dataset

    • brightdata.com
    .json, .csv, .xlsx
    Updated Dec 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bright Data (2024). Machine Learning Dataset [Dataset]. https://brightdata.com/products/datasets/machine-learning
    Explore at:
    .json, .csv, .xlsxAvailable download formats
    Dataset updated
    Dec 23, 2024
    Dataset authored and provided by
    Bright Datahttps://brightdata.com/
    License

    https://brightdata.com/licensehttps://brightdata.com/license

    Area covered
    Worldwide
    Description

    Utilize our machine learning datasets to develop and validate your models. Our datasets are designed to support a variety of machine learning applications, from image recognition to natural language processing and recommendation systems. You can access a comprehensive dataset or tailor a subset to fit your specific requirements, using data from a combination of various sources and websites, including custom ones. Popular use cases include model training and validation, where the dataset can be used to ensure robust performance across different applications. Additionally, the dataset helps in algorithm benchmarking by providing extensive data to test and compare various machine learning algorithms, identifying the most effective ones for tasks such as fraud detection, sentiment analysis, and predictive maintenance. Furthermore, it supports feature engineering by allowing you to uncover significant data attributes, enhancing the predictive accuracy of your machine learning models for applications like customer segmentation, personalized marketing, and financial forecasting.

  2. d

    A Dataset for Machine Learning Algorithm Development

    • catalog.data.gov
    • fisheries.noaa.gov
    Updated May 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (Point of Contact, Custodian) (2024). A Dataset for Machine Learning Algorithm Development [Dataset]. https://catalog.data.gov/dataset/a-dataset-for-machine-learning-algorithm-development2
    Explore at:
    Dataset updated
    May 1, 2024
    Dataset provided by
    (Point of Contact, Custodian)
    Description

    This dataset consists of imagery, imagery footprints, associated ice seal detections and homography files associated with the KAMERA Test Flights conducted in 2019. This dataset was subset to include relevant data for detection algorithm development. This dataset is limited to data collected during flights 4, 5, 6 and 7 from our 2019 surveys.

  3. Data for Machine Learning

    • kaggle.com
    zip
    Updated Nov 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    jiankang wang663 (2024). Data for Machine Learning [Dataset]. https://www.kaggle.com/datasets/jiankangwang663/data-for-machine-learning
    Explore at:
    zip(25942 bytes)Available download formats
    Dataset updated
    Nov 20, 2024
    Authors
    jiankang wang663
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset

    This dataset was created by jiankang wang663

    Released under MIT

    Contents

    Data for machine learning course.

  4. D

    SYNERGY - Open machine learning dataset on study selection in systematic...

    • dataverse.nl
    csv, json, txt, zip
    Updated Apr 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jonathan De Bruin; Jonathan De Bruin; Yongchao Ma; Yongchao Ma; Gerbrich Ferdinands; Gerbrich Ferdinands; Jelle Teijema; Jelle Teijema; Rens Van de Schoot; Rens Van de Schoot (2023). SYNERGY - Open machine learning dataset on study selection in systematic reviews [Dataset]. http://doi.org/10.34894/HE6NAQ
    Explore at:
    txt(212), json(702), zip(16028323), json(19426), txt(263), zip(3560967), txt(305), json(470), txt(279), zip(2355371), json(23201), csv(460956), txt(200), json(685), json(546), csv(63996), zip(2989015), zip(5749455), txt(331), txt(315), json(691), json(23775), csv(672721), json(468), txt(415), json(22778), csv(31919), csv(746832), json(18392), zip(62992826), csv(234822), txt(283), zip(34788857), json(475), txt(242), json(533), csv(42227), json(24548), zip(738232), json(22477), json(25491), zip(11463283), json(17741), csv(490660), json(19662), json(578), csv(19786), zip(14708207), zip(24619707), zip(2404439), json(713), json(27224), json(679), json(26426), txt(185), json(906), zip(18534723), json(23550), txt(266), txt(317), zip(6019723), json(33943), txt(436), csv(388378), json(469), zip(2106498), txt(320), csv(451336), txt(338), zip(19428163), json(14326), json(31652), txt(299), csv(96153), txt(220), csv(114789), json(15452), csv(5372708), json(908), csv(317928), csv(150923), json(465), csv(535584), json(26090), zip(8164831), json(19633), txt(316), json(23494), csv(133950), json(18638), csv(3944082), json(15345), json(473), zip(4411063), zip(10396095), zip(835096), txt(255), json(699), csv(654705), txt(294), csv(989865), zip(1028035), txt(322), zip(15085090), txt(237), txt(310), json(756), json(30628), json(19490), json(25908), txt(401), json(701), zip(5543909), json(29397), zip(14007470), json(30058), zip(58869042), csv(852937), json(35711), csv(298011), csv(187163), txt(258), zip(3526740), json(568), json(21552), zip(66466788), csv(215250), json(577), csv(103010), txt(306), zip(11840006)Available download formats
    Dataset updated
    Apr 24, 2023
    Dataset provided by
    DataverseNL
    Authors
    Jonathan De Bruin; Jonathan De Bruin; Yongchao Ma; Yongchao Ma; Gerbrich Ferdinands; Gerbrich Ferdinands; Jelle Teijema; Jelle Teijema; Rens Van de Schoot; Rens Van de Schoot
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    SYNERGY is a free and open dataset on study selection in systematic reviews, comprising 169,288 academic works from 26 systematic reviews. Only 2,834 (1.67%) of the academic works in the binary classified dataset are included in the systematic reviews. This makes the SYNERGY dataset a unique dataset for the development of information retrieval algorithms, especially for sparse labels. Due to the many available variables available per record (i.e. titles, abstracts, authors, references, topics), this dataset is useful for researchers in NLP, machine learning, network analysis, and more. In total, the dataset contains 82,668,134 trainable data points. The easiest way to get the SYNERGY dataset is via the synergy-dataset Python package. See https://github.com/asreview/synergy-dataset for all information.

  5. Insurance Data for Machine Learning

    • kaggle.com
    zip
    Updated Apr 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SRIDHAR STREAKS (2023). Insurance Data for Machine Learning [Dataset]. https://www.kaggle.com/datasets/sridharstreaks/insurance-data-for-machine-learning
    Explore at:
    zip(22284586 bytes)Available download formats
    Dataset updated
    Apr 8, 2023
    Authors
    SRIDHAR STREAKS
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Insurance Dataset for Predicting Health Insurance Premiums in the US" is a collection of data on various factors that can influence medical costs and premiums for health insurance in the United States. The dataset includes information on 10 variables, including age, gender, body mass index (BMI), number of children, smoking status, region, income, education, occupation, and type of insurance plan. The dataset was created using a script that generated a million records of randomly sampled data points, ensuring that the data represented the population of insured individuals in the US. The dataset can be used to build and test machine learning models for predicting insurance premiums and exploring the relationship between different factors and medical costs.

  6. d

    Data from: Data for Machine Learning Predictions of Nitrate in Shallow...

    • catalog.data.gov
    • data.usgs.gov
    • +1more
    Updated Nov 26, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Data for Machine Learning Predictions of Nitrate in Shallow Groundwater in the Conterminous United States [Dataset]. https://catalog.data.gov/dataset/data-for-machine-learning-predictions-of-nitrate-in-shallow-groundwater-in-the-conterminou
    Explore at:
    Dataset updated
    Nov 26, 2025
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Area covered
    United States, Contiguous United States
    Description

    An extreme gradient boosting (XGB) machine learning model was developed to predict the distribution of nitrate in shallow groundwater across the conterminous United States (CONUS). Nitrate was predicted at a 1-square-kilometer (km) resolution at a depth below the water table of 10 m. The model builds off a previous XGB machine learning model developed to predict nitrate at domestic and public supply groundwater zones (Ransom and others, 2022) by incorporating additional monitoring well samples and modifying and adding predictor variables. The shallow zone model included variables representing well characteristics, hydrologic conditions, soil type, geology, climate, oxidation/reduction, and nitrogen inputs. Predictor variables derived from empirical or numerical process-based models were also included to integrate information on controlling processes and conditions. This data release documents the model and provides the model results. Included in this data release are, 1) a model archive of the R project: source code, input files (including model training and testing data, rasters of all final predictor variables, and an output raster representing predicted nitrate concentration in the shallow zone), 2) a read_me.txt file describing the model archive and an explanation of its use and the modeling details, and 3) a table describing the model variables.

  7. Weather Prediction

    • kaggle.com
    • zenodo.org
    zip
    Updated Mar 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2024). Weather Prediction [Dataset]. https://www.kaggle.com/datasets/thedevastator/weather-prediction
    Explore at:
    zip(958204 bytes)Available download formats
    Dataset updated
    Mar 10, 2024
    Authors
    The Devastator
    Description

    Credit to the original author: The dataset was originally published here

    Weather prediction dataset

    A dataset for teaching machine learning and deep learning

    Hands-on teaching of modern machine learning and deep learning techniques heavily relies on the use of well-suited datasets. The "weather prediction dataset" is a novel tabular dataset that was specifically created for teaching machine learning and deep learning to an academic audience. The dataset contains intuitively accessible weather observations from 18 locations in Europe. It was designed to be suitable for a large variety of different training goals, many of which are not easily giving way to unrealistically high prediction accuracy. Teachers or instructors thus can chose the difficulty of the training goals and thereby match it with the respective learner audience or lesson objective. The compact size and complexity of the dataset make it possible to quickly train common machine learning and deep learning models on a standard laptop so that they can be used in live hands-on sessions.

    The dataset can be found in the `\dataset` folder and be downloaded from zenodo: https://doi.org/10.5281/zenodo.4980359

    References

    If you make use of this dataset, in particular if this is in form of an academic contribution, then please cite the following two references:

    • Klein Tank, A.M.G. and Coauthors, 2002. Daily dataset of 20th-century surface air temperature and precipitation series for the European Climate Assessment. Int. J. of Climatol., 22, 1441-1453. Data and metadata available at http://www.ecad.eu
    • Florian Huber, Dafne van Kuppevelt, Peter Steinbach, Colin Sauze, Yang Liu, Berend Weel, "Will the sun shine? – An accessible dataset for teaching machine learning and deep learning", DOI TO BE ADDED!

    Map of the locations of the 18 weather stations from which data was collected

    Map of weather stations

  8. a

    [Coursera] Neural Networks for Machine Learning (University of Toronto)...

    • academictorrents.com
    bittorrent
    Updated Mar 5, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    University of Toronto (2017). [Coursera] Neural Networks for Machine Learning (University of Toronto) (neuralnets) [Dataset]. https://academictorrents.com/details/3e6f1876bbd46780602e72f4b122329fb668bd2c
    Explore at:
    bittorrent(1026708997)Available download formats
    Dataset updated
    Mar 5, 2017
    Dataset authored and provided by
    University of Toronto
    License

    https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified

    Area covered
    Toronto
    Description

    [Coursera] Neural Networks for Machine Learning (University of Toronto) (neuralnets)

  9. R

    Banana Machine Learning Dataset

    • universe.roboflow.com
    zip
    Updated Dec 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MHFaisalb (2023). Banana Machine Learning Dataset [Dataset]. https://universe.roboflow.com/mhfaisalb/banana-machine-learning
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 11, 2023
    Dataset authored and provided by
    MHFaisalb
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Variables measured
    Pisang Bounding Boxes
    Description

    Banana Machine Learning

    ## Overview
    
    Banana Machine Learning is a dataset for object detection tasks - it contains Pisang annotations for 200 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [Public Domain license](https://creativecommons.org/licenses/Public Domain).
    
  10. n

    Data from: Assessing predictive performance of supervised machine learning...

    • data.niaid.nih.gov
    • datasetcatalog.nlm.nih.gov
    • +1more
    zip
    Updated May 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Evans Omondi (2023). Assessing predictive performance of supervised machine learning algorithms for a diamond pricing model [Dataset]. http://doi.org/10.5061/dryad.wh70rxwrh
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 23, 2023
    Dataset provided by
    Strathmore University
    Authors
    Evans Omondi
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    The diamond is 58 times harder than any other mineral in the world, and its elegance as a jewel has long been appreciated. Forecasting diamond prices is challenging due to nonlinearity in important features such as carat, cut, clarity, table, and depth. Against this backdrop, the study conducted a comparative analysis of the performance of multiple supervised machine learning models (regressors and classifiers) in predicting diamond prices. Eight supervised machine learning algorithms were evaluated in this work including Multiple Linear Regression, Linear Discriminant Analysis, eXtreme Gradient Boosting, Random Forest, k-Nearest Neighbors, Support Vector Machines, Boosted Regression and Classification Trees, and Multi-Layer Perceptron. The analysis is based on data preprocessing, exploratory data analysis (EDA), training the aforementioned models, assessing their accuracy, and interpreting their results. Based on the performance metrics values and analysis, it was discovered that eXtreme Gradient Boosting was the most optimal algorithm in both classification and regression, with a R2 score of 97.45% and an Accuracy value of 74.28%. As a result, eXtreme Gradient Boosting was recommended as the optimal regressor and classifier for forecasting the price of a diamond specimen. Methods Kaggle, a data repository with thousands of datasets, was used in the investigation. It is an online community for machine learning practitioners and data scientists, as well as a robust, well-researched, and sufficient resource for analyzing various data sources. On Kaggle, users can search for and publish various datasets. In a web-based data-science environment, they can study datasets and construct models.

  11. m

    A dataset for machine learning research in the field of stress analyses of...

    • data.mendeley.com
    • narcis.nl
    Updated Jul 25, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jaroslav Matej (2020). A dataset for machine learning research in the field of stress analyses of mechanical structures [Dataset]. http://doi.org/10.17632/wzbzznk8z3.2
    Explore at:
    Dataset updated
    Jul 25, 2020
    Authors
    Jaroslav Matej
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset is prepared and intended as a data source for development of a stress analysis method based on machine learning. It consists of finite element stress analyses of randomly generated mechanical structures. The dataset contains more than 270,794 pairs of stress analyses images (von Mises stress) of randomly generated 2D structures with predefined thickness and material properties. All the structures are fixed at their bottom edges and loaded with gravity force only. See PREVIEW directory with some examples. The zip file contains all the files in the dataset.

  12. h

    mmlu-machine-learning

    • huggingface.co
    Updated Feb 7, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bruce W. Lee (2024). mmlu-machine-learning [Dataset]. https://huggingface.co/datasets/brucewlee1/mmlu-machine-learning
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 7, 2024
    Authors
    Bruce W. Lee
    Description

    brucewlee1/mmlu-machine-learning dataset hosted on Hugging Face and contributed by the HF Datasets community

  13. R

    Data from: Project Machine Learning Dataset

    • universe.roboflow.com
    zip
    Updated Jun 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    soda (2024). Project Machine Learning Dataset [Dataset]. https://universe.roboflow.com/soda-fj5ov/project-machine-learning-8sjsi
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 6, 2024
    Dataset authored and provided by
    soda
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Deteksi Rempah Rempah Bounding Boxes
    Description

    Project Machine Learning

    ## Overview
    
    Project Machine Learning is a dataset for object detection tasks - it contains Deteksi Rempah Rempah annotations for 1,270 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  14. o

    Solar Dynamics Observatory (SDO) Machine Learning Dataset

    • registry.opendata.aws
    Updated May 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NASA (2023). Solar Dynamics Observatory (SDO) Machine Learning Dataset [Dataset]. https://registry.opendata.aws/sdoml-fdl/
    Explore at:
    Dataset updated
    May 18, 2023
    Dataset provided by
    <a href="http://www.nasa.gov/">NASA</a>
    Description

    The v1 dataset includes AIA/HMI observations 2010-2018 and v2 includes AIA/HMI observations 2010-2020 in all 10 wavebands (94A, 131A, 171A, 193A, 211A, 304A, 335A, 1600A, 1700A, 4500A), with 512x512 resolution and 6 minutes cadence; HMI vector magnetic field observations in Bx, By, and Bz components, with 512x512 resolution and 12 minutes cadence; The EVE observations in 39 wavelengths from 2010-05-01 to 2014-05-26, with 10 seconds cadence.

  15. Z

    Data from: MLFMF: Data Sets for Machine Learning for Mathematical...

    • data.niaid.nih.gov
    Updated Oct 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bauer, Andrej; Petković, Matej; Todorovski, Ljupčo (2023). MLFMF: Data Sets for Machine Learning for Mathematical Formalization [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10041074
    Explore at:
    Dataset updated
    Oct 26, 2023
    Dataset provided by
    University of Ljubljana
    Institute of Mathematics, Physics, and Mechanics
    Authors
    Bauer, Andrej; Petković, Matej; Todorovski, Ljupčo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    MLFMF MLFMF (Machine Learning for Mathematical Formalization) is a collection of data sets for benchmarking recommendation systems used to support formalization of mathematics with proof assistants. These systems help humans identify which previous entries (theorems, constructions, datatypes, and postulates) are relevant in proving a new theorem or carrying out a new construction. The MLFMF data sets provide solid benchmarking support for further investigation of the numerous machine learning approaches to formalized mathematics. With more than 250,000 entries in total, this is currently the largest collection of formalized mathematical knowledge in machine learnable format. In addition to benchmarking the recommendation systems, the data sets can also be used for benchmarking node classification and link prediction algorithms. The four data sets Each data set is derived from a library of formalized mathematics written in proof assistants Agda or Lean. The collection includes

    the largest Lean 4 library Mathlib, the three largest Agda libraries:

    the standard library the library of univalent mathematics Agda-unimath, and the TypeTopology library. Each data set represents the corresponding library in two ways: as a heterogeneous network, and as a list of syntax trees of all the entries in the library. The network contains the (modular) structure of the library and the references between entries, while the syntax trees give complete and easily parsed information about each entry. The Lean library data set was obtained by converting .olean files into s-expressions (see the lean2sexp tool). The Agda data sets were obtained with an s-expression extension of the official Agda repository (use either master-sexp or release-2.6.3-sexp branch). For more details, see our arXiv copy of the paper. Directory structure First, the mlfmf.zip archive needs to be unzipped. It contains a separate directory for every library (for example, the standard library of Agda can be found in the stdlib directory) and some auxiliary files. Every library directory contains

    the network file from which the heterogeneous network can be loaded, a zip of the entries directory that contains (many) files with abstract syntax trees. Each of those files describes a single entry of the library. In addition to the auxiliary files which are used for loading the data (and described below), the zipped sources of lean2sexp and Agda s-expression extension are present. Loading the data In addition to the data files, there is also a simple python script main.py for loading the data. To run it, you will have to install the packages listed in the file requirements.txt: tqdm and networkx. The easiest way to do so is calling pip install -r requirements.txt. When running main.py for the first time, the script will unzip the entry files into the directory named entries. After that, the script loads the syntax trees of the entries (see the Entry class) and the network (as networkx.MultiDiGraph object). Note. The entry files have extension .dag (directed acyclic graph), since Lean uses node sharing, which breaks the tree structure (a shared node has more than one parent node). More information For more information about the data collection process, detailed data (and data format) description, and baseline experiments that were already performed with these data, see our arXiv copy of the paper. For the code that was used to perform the experiments and data format description, visit our github repository https://github.com/ul-fmf/mlfmf-data. Funding Since not all the funders are available in the Zenodo's database, we list them here:

    This material is based upon work supported by the Air Force Office of Scientific Research under award number FA9550-21-1-0024. The authors also acknowledge the financial support of the Slovenian Research Agency via the research core funding No. P2-0103 and No. P1-0294.

  16. m

    Data from: SalmonScan: A Novel Image Dataset for Machine Learning and Deep...

    • data.mendeley.com
    Updated Apr 2, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Md Shoaib Ahmed (2024). SalmonScan: A Novel Image Dataset for Machine Learning and Deep Learning Analysis in Fish Disease Detection in Aquaculture [Dataset]. http://doi.org/10.17632/x3fz2nfm4w.3
    Explore at:
    Dataset updated
    Apr 2, 2024
    Authors
    Md Shoaib Ahmed
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The SalmonScan dataset is a collection of images of salmon fish, including healthy fish and infected fish. The dataset consists of two classes of images:

    Fresh salmon 🐟 Infected Salmon 🐠

    This dataset is ideal for various computer vision tasks in machine learning and deep learning applications. Whether you are a researcher, developer, or student, the SalmonScan dataset offers a rich and diverse data source to support your projects and experiments.

    So, dive in and explore the fascinating world of salmon health and disease!

    The SalmonScan dataset (raw) consists of 24 fresh fish and 91 infected fish. [Due to server cleaning in the past, some raw datasets have been deleted]

    The SalmonScan dataset (augmented) consists of approximately 1,208 images of salmon fish, classified into two classes:

    • Fresh salmon (healthy fish with no visible signs of disease), 456 images
    • Infected Salmon containing disease, 752 images

    Each class contains a representative and diverse collection of images, capturing a range of different perspectives, scales, and lighting conditions. The images have been carefully curated to ensure that they are of high quality and suitable for use in a variety of computer vision tasks.

    Data Preprocessing

    The input images were preprocessed to enhance their quality and suitability for further analysis. The following steps were taken:

    Resizing 📏: All the images were resized to a uniform size of 600 pixels in width and 250 pixels in height to ensure compatibility with the learning algorithm. Image Augmentation 📸: To overcome the small amount of images, various image augmentation techniques were applied to the input images. These included: Horizontal Flip ↩️: The images were horizontally flipped to create additional samples. Vertical Flip ⬆️: The images were vertically flipped to create additional samples. Rotation 🔄: The images were rotated to create additional samples. Cropping 🪓: A portion of the image was randomly cropped to create additional samples. Gaussian Noise 🌌: Gaussian noise was added to the images to create additional samples. Shearing 🌆: The images were sheared to create additional samples. Contrast Adjustment (Gamma) ⚖️: The gamma correction was applied to the images to adjust their contrast. Contrast Adjustment (Sigmoid) ⚖️: The sigmoid function was applied to the images to adjust their contrast.

    Usage

    To use the salmon scan dataset in your ML and DL projects, follow these steps:

    • Clone or download the salmon scan dataset repository from GitHub.
    • Use standard libraries such as numpy or pandas to convert the images into arrays, which can be input into a machine learning or deep learning model.
    • Split the dataset into training, validation, and test sets as per your requirement.
    • Preprocess the data as needed, such as resizing and normalizing the images.
    • Train your ML/DL model using the preprocessed training data.
    • Evaluate the model on the test set and make predictions on new, unseen data.
  17. Data from: QFlow 2.0: Quantum dot data for machine learning

    • catalog.data.gov
    • nist.gov
    • +4more
    Updated Sep 30, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institute of Standards and Technology (2025). QFlow 2.0: Quantum dot data for machine learning [Dataset]. https://catalog.data.gov/dataset/quantum-dot-data-for-machine-learning
    Explore at:
    Dataset updated
    Sep 30, 2025
    Dataset provided by
    National Institute of Standards and Technologyhttp://www.nist.gov/
    Description

    Using a modified Thomas-Fermi approximation, we model a reference semiconductor system comprising a quasi-1D nanowire with a series of five depletion gates whose voltages determine the number of quantum dots (QDs), the charges on each of the QDs, as well as the conductance through the wire. The original dataset, QFlow lite, consists of 1 001 idealized simulated measurements with gate configurations sampling over different realizations of the same type of device. Each sample data is stored as a 100 x 100-pixel map from plunger gate voltages to (i) current through the device at infinitesimal bias, (ii) output of the charge sensor evaluated as the Coulomb potential at the sensor location - the experimentally relevant parameters that can be measured, (iii) information about the number of charges on each dot (with a default value 0 for short circuit and a barrier), and (iv) a label determining the state of the device, distinguishing between a single dot, a double dot, a short circuit, and a barrier state. The expanded dataset, QFlow 2.0, consists of 1599 idealized simulated measurements stored as roughly 250 x 250-pixel maps from plunger gate voltages to (i) output of the charge sensor, (ii) net charge on each dot, and (iii) a label determining the state of the device, distinguishing between a left, central, and right single QD, a double QD, and a barrier or short circuit (no QD) state. In addition, the QFlow 2.0 dataset includes two sets of noisy simulated measurements, one with the noise level varied around 1.5 times the optimized noise level and the other one with the noise level ranging from 0 to 7 times the optimized noise level. See the "Project description" and "Data structure" documents for additional information about these datasets.Acknowledgments: This research is sponsored in part by the Army Research Office (ARO), through Grant No. W911NF-17-1-0274. The development and maintenance of the growth facilities used for fabricating samples were supported by the Department of Energy, through Grant No. DE-FG02-03ER46028. We acknowledge the use of clean room facilities supported by The National Science Foundation (NSF) through the UW-Madison MRSEC (DMR-1720415) and electron beam lithography equipment acquired with the support of the NSF MRI program (DMR-1625348). The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the ARO or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright noted herein. Any mention of commercial products is for information only; it does not imply recommendation or endorsement by NIST.

  18. LinkedIn Job Postings - Machine Learning Data Set

    • kaggle.com
    zip
    Updated Nov 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    RealDealAdamP (2023). LinkedIn Job Postings - Machine Learning Data Set [Dataset]. https://www.kaggle.com/datasets/adampq/linkedin-jobs-machine-learning-data-set
    Explore at:
    zip(40278565 bytes)Available download formats
    Dataset updated
    Nov 28, 2023
    Authors
    RealDealAdamP
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    The data comprises job-related information from LinkedIn job postings scraped over a 2-day period. Key features include company details and job-specific information like title, description, and salary. The dataset provides a comprehensive view for exploring factors influencing job posting characteristics and has been reformatted from its original source to improve its compatibility among various machine learning algorithms.

  19. Machine Learning market size was USD 24,345.76 million in 2021!

    • cognitivemarketresearch.com
    pdf,excel,csv,ppt
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cognitive Market Research, Machine Learning market size was USD 24,345.76 million in 2021! [Dataset]. https://www.cognitivemarketresearch.com/machine-learning-market-report
    Explore at:
    pdf,excel,csv,pptAvailable download formats
    Dataset authored and provided by
    Cognitive Market Research
    License

    https://www.cognitivemarketresearch.com/privacy-policyhttps://www.cognitivemarketresearch.com/privacy-policy

    Time period covered
    2021 - 2033
    Area covered
    Global
    Description

    As per Cognitive Market Research's latest published report, the Global Machine Learning market size was USD 24,345.76 million in 2021 and it is forecasted to reach USD 206,235.41 million by 2028. Machine Learning Industry's Compound Annual Growth Rate will be 42.64% from 2023 to 2030. Market Dynamics of Machine Learning Market

    Key Drivers for Machine Learning Market

    Explosion of Big Data Across Industries: The substantial increase in both structured and unstructured data generated by sensors, social media, transactions, and IoT devices is driving the demand for machine learning-based data analysis.

    Widespread Adoption of AI in Business Processes: Machine learning is facilitating automation, predictive analytics, and optimization in various sectors such as healthcare, finance, manufacturing, and retail, thereby enhancing efficiency and outcomes.

    Increased Availability of Open-Source Frameworks and Cloud Platforms: Resources like TensorFlow, PyTorch, and scalable cloud infrastructure are simplifying the process for developers and enterprises to create and implement machine learning models.

    Growing Investments in AI-Driven Innovation: Governments, venture capitalists, and major technology companies are making substantial investments in machine learning research and startups, which is accelerating progress and market entry.

    Key Restraints for Machine Learning Market

    Shortage of Skilled Talent in ML and AI: The need for data scientists, machine learning engineers, and domain specialists significantly surpasses the available supply, hindering scalability and implementation in numerous organizations.

    High Computational and Operational Costs: The training of intricate machine learning models necessitates considerable computing power, energy, and infrastructure, resulting in high costs for startups and smaller enterprises.

    Data Privacy and Regulatory Compliance Challenges: Issues related to user privacy, data breaches, and adherence to regulations such as GDPR and HIPAA present obstacles in the collection and utilization of data for machine learning.

    Lack of Model Transparency and Explainability: The opaque nature of certain machine learning models undermines trust, particularly in sensitive areas like finance and healthcare, where the need for explainable AI is paramount.

    Key Trends for Machine Learning Market

    Growth of AutoML and No-Code ML Platforms: Automated machine learning tools are making AI development more accessible, enabling individuals without extensive coding or mathematical expertise to construct models.

    Integration of ML with Edge Computing: Executing machine learning models locally on edge devices (such as cameras and smartphones) is enhancing real-time performance and minimizing latency in applications.

    Ethical AI and Responsible Machine Learning Practices: Increasing emphasis on fairness, bias reduction, and accountability is shaping ethical frameworks and governance in ML adoption.

    Industry-Specific ML Applications on the Rise: Custom ML solutions are rapidly emerging in sectors like agriculture (crop prediction), logistics (route optimization), and education (personalized learning).

    COVID-19 Impact:

    Similar to other industries, the covid-19 situation has affected the machine learning industry. Despite the dire conditions and uncertain collapse, some industries have continued to grow during the pandemic. During covid 19, the machine learning market remains stable with positive growth and opportunities. The global machine learning market faces minimal impact compared to some other industries.The growth of the global machine learning market has stagnated owing to automation developments and technological advancements. Pre-owned machines and smartphones widely used for remote work are leading to positive growth of the market. Several industries have transplanted the market progress using new technologies of machine learning systems. June 2020, DeCaprio et al. Published COVID-19 pandemic risk research is still in its early stages. In the report, DeCaprio et al. mentions that it has used machine learning to build an initial vulnerability index for the coronavirus. The lab further noted that as more data and results from ongoing research become available, it will be able to see more practical applications of machine learning in predicting infection risk. What is&nbs...

  20. A

    Artificial Intelligence Training Dataset Report

    • datainsightsmarket.com
    doc, pdf, ppt
    Updated May 3, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Insights Market (2025). Artificial Intelligence Training Dataset Report [Dataset]. https://www.datainsightsmarket.com/reports/artificial-intelligence-training-dataset-1958994
    Explore at:
    doc, ppt, pdfAvailable download formats
    Dataset updated
    May 3, 2025
    Dataset authored and provided by
    Data Insights Market
    License

    https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The global Artificial Intelligence (AI) Training Dataset market is experiencing robust growth, driven by the increasing adoption of AI across diverse sectors. The market's expansion is fueled by the burgeoning need for high-quality data to train sophisticated AI algorithms capable of powering applications like smart campuses, autonomous vehicles, and personalized healthcare solutions. The demand for diverse dataset types, including image classification, voice recognition, natural language processing, and object detection datasets, is a key factor contributing to market growth. While the exact market size in 2025 is unavailable, considering a conservative estimate of a $10 billion market in 2025 based on the growth trend and reported market sizes of related industries, and a projected CAGR (Compound Annual Growth Rate) of 25%, the market is poised for significant expansion in the coming years. Key players in this space are leveraging technological advancements and strategic partnerships to enhance data quality and expand their service offerings. Furthermore, the increasing availability of cloud-based data annotation and processing tools is further streamlining operations and making AI training datasets more accessible to businesses of all sizes. Growth is expected to be particularly strong in regions with burgeoning technological advancements and substantial digital infrastructure, such as North America and Asia Pacific. However, challenges such as data privacy concerns, the high cost of data annotation, and the scarcity of skilled professionals capable of handling complex datasets remain obstacles to broader market penetration. The ongoing evolution of AI technologies and the expanding applications of AI across multiple sectors will continue to shape the demand for AI training datasets, pushing this market toward higher growth trajectories in the coming years. The diversity of applications—from smart homes and medical diagnoses to advanced robotics and autonomous driving—creates significant opportunities for companies specializing in this market. Maintaining data quality, security, and ethical considerations will be crucial for future market leadership.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Bright Data (2024). Machine Learning Dataset [Dataset]. https://brightdata.com/products/datasets/machine-learning
Organization logo

Machine Learning Dataset

Explore at:
.json, .csv, .xlsxAvailable download formats
Dataset updated
Dec 23, 2024
Dataset authored and provided by
Bright Datahttps://brightdata.com/
License

https://brightdata.com/licensehttps://brightdata.com/license

Area covered
Worldwide
Description

Utilize our machine learning datasets to develop and validate your models. Our datasets are designed to support a variety of machine learning applications, from image recognition to natural language processing and recommendation systems. You can access a comprehensive dataset or tailor a subset to fit your specific requirements, using data from a combination of various sources and websites, including custom ones. Popular use cases include model training and validation, where the dataset can be used to ensure robust performance across different applications. Additionally, the dataset helps in algorithm benchmarking by providing extensive data to test and compare various machine learning algorithms, identifying the most effective ones for tasks such as fraud detection, sentiment analysis, and predictive maintenance. Furthermore, it supports feature engineering by allowing you to uncover significant data attributes, enhancing the predictive accuracy of your machine learning models for applications like customer segmentation, personalized marketing, and financial forecasting.

Search
Clear search
Close search
Google apps
Main menu