100+ datasets found

TREC 2022 Deep Learning test collection
catalog.data.gov
s.cnmilf.com
+1more
Updated May 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Standards and Technology (2023). TREC 2022 Deep Learning test collection [Dataset]. https://catalog.data.gov/dataset/trec-2022-deep-learning-test-collection
Explore at:
Dataset updated
May 9, 2023
Dataset provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
Description
This is a test collection for passage and document retrieval, produced in the TREC 2023 Deep Learning track. The Deep Learning Track studies information retrieval in a large training data regime. This is the case where the number of training queries with at least one positive label is at least in the tens of thousands, if not hundreds of thousands or more. This corresponds to real-world scenarios such as training based on click logs and training based on labels from shallow pools (such as the pooling in the TREC Million Query Track or the evaluation of search engines based on early precision).Certain machine learning based methods, such as methods based on deep learning are known to require very large datasets for training. Lack of such large scale datasets has been a limitation for developing such methods for common information retrieval tasks, such as document ranking. The Deep Learning Track organized in the previous years aimed at providing large scale datasets to TREC, and create a focused research effort with a rigorous blind evaluation of ranker for the passage ranking and document ranking tasks.Similar to the previous years, one of the main goals of the track in 2022 is to study what methods work best when a large amount of training data is available. For example, do the same methods that work on small data also work on large data? How much do methods improve when given more training data? What external data and models can be brought in to bear in this scenario, and how useful is it to combine full supervision with other forms of supervision?The collection contains 12 million web pages, 138 million passages from those web pages, search queries, and relevance judgments for the queries.
Machine Learning Dataset
brightdata.com
.json, .csv, .xlsx
Updated Dec 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bright Data (2024). Machine Learning Dataset [Dataset]. https://brightdata.com/products/datasets/machine-learning
Explore at:
.json, .csv, .xlsxAvailable download formats
Dataset updated
Dec 23, 2024
Dataset authored and provided by
Bright Datahttps://brightdata.com/
License
https://brightdata.com/licensehttps://brightdata.com/license
Area covered
Worldwide
Description
Utilize our machine learning datasets to develop and validate your models. Our datasets are designed to support a variety of machine learning applications, from image recognition to natural language processing and recommendation systems. You can access a comprehensive dataset or tailor a subset to fit your specific requirements, using data from a combination of various sources and websites, including custom ones. Popular use cases include model training and validation, where the dataset can be used to ensure robust performance across different applications. Additionally, the dataset helps in algorithm benchmarking by providing extensive data to test and compare various machine learning algorithms, identifying the most effective ones for tasks such as fraud detection, sentiment analysis, and predictive maintenance. Furthermore, it supports feature engineering by allowing you to uncover significant data attributes, enhancing the predictive accuracy of your machine learning models for applications like customer segmentation, personalized marketing, and financial forecasting.
d
A Dataset for Machine Learning Algorithm Development
catalog.data.gov
s.cnmilf.com
+1more
Updated May 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(Point of Contact, Custodian) (2024). A Dataset for Machine Learning Algorithm Development [Dataset]. https://catalog.data.gov/dataset/a-dataset-for-machine-learning-algorithm-development2
Explore at:
Dataset updated
May 1, 2024
Dataset provided by
(Point of Contact, Custodian)
Description
This dataset consists of imagery, imagery footprints, associated ice seal detections and homography files associated with the KAMERA Test Flights conducted in 2019. This dataset was subset to include relevant data for detection algorithm development. This dataset is limited to data collected during flights 4, 5, 6 and 7 from our 2019 surveys.
Data from: NICHE: A Curated Dataset of Engineered Machine Learning Projects...
figshare.com
txt
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ratnadira Widyasari; Zhou YANG; Ferdian Thung; Sheng Qin Sim; Fiona Wee; Camellia Lok; Jack Phan; Haodi Qi; Constance Tan; Qijin Tay; David LO (2023). NICHE: A Curated Dataset of Engineered Machine Learning Projects in Python [Dataset]. http://doi.org/10.6084/m9.figshare.21967265.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.21967265.v1
Dataset updated
May 30, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Ratnadira Widyasari; Zhou YANG; Ferdian Thung; Sheng Qin Sim; Fiona Wee; Camellia Lok; Jack Phan; Haodi Qi; Constance Tan; Qijin Tay; David LO
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Machine learning (ML) has gained much attention and has been incorporated into our daily lives. While there are numerous publicly available ML projects on open source platforms such as GitHub, there have been limited attempts in filtering those projects to curate ML projects of high quality. The limited availability of such high-quality dataset poses an obstacle to understanding ML projects. To help clear this obstacle, we present NICHE, a manually labelled dataset consisting of 572 ML projects. Based on evidences of good software engineering practices, we label 441 of these projects as engineered and 131 as non-engineered. In this repository we provide "NICHE.csv" file that contains the list of the project names along with their labels, descriptive information for every dimension, and several basic statistics, such as the number of stars and commits. This dataset can help researchers understand the practices that are followed in high-quality ML projects. It can also be used as a benchmark for classifiers designed to identify engineered ML projects.

GitHub page: https://github.com/soarsmu/NICHE
Machine Learning Software Market Report | Global Forecast From 2025 To 2033
dataintelo.com
csv, pdf, pptx
Updated Jan 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dataintelo (2025). Machine Learning Software Market Report | Global Forecast From 2025 To 2033 [Dataset]. https://dataintelo.com/report/global-machine-learning-software-market
Explore at:
pptx, csv, pdfAvailable download formats
Dataset updated
Jan 7, 2025
Dataset authored and provided by
Dataintelo
License
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
Time period covered
2024 - 2032
Area covered
Global
Description
Machine Learning Software Market Outlook

The global machine learning software market is anticipated to experience a robust growth trajectory, with the market size projected to expand from USD 15 billion in 2023 to an estimated USD 85 billion by 2032, at a Compound Annual Growth Rate (CAGR) of 20.5%. This remarkable growth is driven by a confluence of technological advancements, increased adoption of AI across various sectors, and a surge in demand for intelligent business solutions that automate processes and enhance decision-making efficiency. The ability of machine learning software to process large volumes of data and generate actionable insights is transforming industries, making it a crucial element in the digital transformation strategies of organizations worldwide.

One of the primary growth factors contributing to this market expansion is the exponential increase in data generation from various sources, including IoT devices, social media, and enterprise applications. The need to derive strategic insights from this massive data pool is pushing organizations to adopt machine learning solutions. Furthermore, the evolution of big data technologies and cloud computing platforms has made it feasible for businesses of all sizes to implement sophisticated machine learning models without incurring prohibitive costs. This democratization of machine learning technologies is particularly beneficial for small and medium enterprises (SMEs), enabling them to leverage data analytics to drive business growth.

Another significant driver for the machine learning software market is the growing emphasis on customer-centric business strategies. Industries such as retail, BFSI, and healthcare are increasingly adopting machine learning algorithms to gain a deeper understanding of customer behavior, tailor personalized experiences, and improve customer satisfaction. For example, predictive analytics and natural language processing (NLP) technologies are being used to anticipate customer needs, optimize pricing strategies, and enhance customer service. Additionally, the integration of machine learning with automation processes is enabling industries to streamline operations, reduce operational costs, and enhance productivity, thereby further fueling market growth.

The growing focus on innovation and technological advancements in artificial intelligence is also a potent growth catalyst. Governments and private sectors across the globe are investing heavily in AI research and development to gain technological superiority, fostering an environment conducive to the proliferation of machine learning applications. The rise of edge computing and 5G technology further amplifies this growth, as it enables faster data processing and real-time analytics, crucial for applications such as autonomous driving and IoT. Consequently, the synergy between machine learning and these emerging technologies is anticipated to unlock new avenues for market expansion over the forecast period.

The advent of Deep Learning System Software is revolutionizing the capabilities of machine learning applications. These systems are designed to mimic the human brain's neural networks, allowing for more complex data processing and pattern recognition. As a result, industries are able to tackle more sophisticated challenges, such as image and speech recognition, with unprecedented accuracy. This advancement is particularly transformative in sectors like healthcare, where deep learning is being used to analyze medical images and predict patient outcomes. The integration of deep learning systems with existing machine learning frameworks is enhancing the overall efficiency and effectiveness of AI-driven solutions, paving the way for groundbreaking innovations.

Regionally, North America is anticipated to dominate the machine learning software market due to the presence of leading technology firms and a highly mature digital ecosystem. The region's focus on innovation, coupled with substantial investments in R&D, has positioned it as a leader in AI and machine learning technologies. Meanwhile, the Asia Pacific region is expected to exhibit the highest growth rate, driven by rapid digitalization, growing internet penetration, and significant government initiatives promoting AI adoption. Countries such as China and India are emerging as key markets, leveraging their large IT talent pool and increasing industrial automation to boost machine learning software deployment.

Component Analysis
D
SYNERGY - Open machine learning dataset on study selection in systematic...
dataverse.nl
csv, json, txt, zip
Updated Apr 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jonathan De Bruin; Jonathan De Bruin; Yongchao Ma; Yongchao Ma; Gerbrich Ferdinands; Gerbrich Ferdinands; Jelle Teijema; Jelle Teijema; Rens Van de Schoot; Rens Van de Schoot (2023). SYNERGY - Open machine learning dataset on study selection in systematic reviews [Dataset]. http://doi.org/10.34894/HE6NAQ
Explore at:
txt(212), json(702), zip(16028323), json(19426), txt(263), zip(3560967), txt(305), json(470), txt(279), zip(2355371), json(23201), csv(460956), txt(200), json(685), json(546), csv(63996), zip(2989015), zip(5749455), txt(331), txt(315), json(691), json(23775), csv(672721), json(468), txt(415), json(22778), csv(31919), csv(746832), json(18392), zip(62992826), csv(234822), txt(283), zip(34788857), json(475), txt(242), json(533), csv(42227), json(24548), zip(738232), json(22477), json(25491), zip(11463283), json(17741), csv(490660), json(19662), json(578), csv(19786), zip(14708207), zip(24619707), zip(2404439), json(713), json(27224), json(679), json(26426), txt(185), json(906), zip(18534723), json(23550), txt(266), txt(317), zip(6019723), json(33943), txt(436), csv(388378), json(469), zip(2106498), txt(320), csv(451336), txt(338), zip(19428163), json(14326), json(31652), txt(299), csv(96153), txt(220), csv(114789), json(15452), csv(5372708), json(908), csv(317928), csv(150923), json(465), csv(535584), json(26090), zip(8164831), json(19633), txt(316), json(23494), csv(133950), json(18638), csv(3944082), json(15345), json(473), zip(4411063), zip(10396095), zip(835096), txt(255), json(699), csv(654705), txt(294), csv(989865), zip(1028035), txt(322), zip(15085090), txt(237), txt(310), json(756), json(30628), json(19490), json(25908), txt(401), json(701), zip(5543909), json(29397), zip(14007470), json(30058), zip(58869042), csv(852937), json(35711), csv(298011), csv(187163), txt(258), zip(3526740), json(568), json(21552), zip(66466788), csv(215250), json(577), csv(103010), txt(306), zip(11840006)Available download formats
Unique identifier
https://doi.org/10.34894/HE6NAQ
Dataset updated
Apr 24, 2023
Dataset provided by
DataverseNL
Authors
Jonathan De Bruin; Jonathan De Bruin; Yongchao Ma; Yongchao Ma; Gerbrich Ferdinands; Gerbrich Ferdinands; Jelle Teijema; Jelle Teijema; Rens Van de Schoot; Rens Van de Schoot
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
SYNERGY is a free and open dataset on study selection in systematic reviews, comprising 169,288 academic works from 26 systematic reviews. Only 2,834 (1.67%) of the academic works in the binary classified dataset are included in the systematic reviews. This makes the SYNERGY dataset a unique dataset for the development of information retrieval algorithms, especially for sparse labels. Due to the many available variables available per record (i.e. titles, abstracts, authors, references, topics), this dataset is useful for researchers in NLP, machine learning, network analysis, and more. In total, the dataset contains 82,668,134 trainable data points. The easiest way to get the SYNERGY dataset is via the synergy-dataset Python package. See https://github.com/asreview/synergy-dataset for all information.
US Deep Learning Market Analysis, Size, and Forecast 2025-2029
technavio.com
Updated Jul 14, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Technavio (2017). US Deep Learning Market Analysis, Size, and Forecast 2025-2029 [Dataset]. https://www.technavio.com/report/us-deep-learning-market-industry-analysis
Explore at:
Dataset updated
Jul 14, 2017
Dataset provided by
TechNavio
Authors
Technavio
Time period covered
2021 - 2025
Area covered
United States
Description
Snapshot img

US Deep Learning Market Size 2025-2029

The deep learning market size in US is forecast to increase by USD 5.02 billion at a CAGR of 30.1% between 2024 and 2029.

The deep learning market is experiencing robust growth, driven by the increasing adoption of artificial intelligence (AI) in various industries for advanced solutioning. This trend is fueled by the availability of vast amounts of data, which is a key requirement for deep learning algorithms to function effectively. Industry-specific solutions are gaining traction, as businesses seek to leverage deep learning for specific use cases such as image and speech recognition, fraud detection, and predictive maintenance. Alongside, intuitive data visualization tools are simplifying complex neural network outputs, helping stakeholders understand and validate insights. However, challenges remain, including the need for powerful computing resources, data privacy concerns, and the high cost of implementing and maintaining deep learning systems. Despite these hurdles, the market's potential for innovation and disruption is immense, making it an exciting space for businesses to explore further. Semi-supervised learning, data labeling, and data cleaning facilitate efficient training of deep learning models. Cloud analytics is another significant trend, as companies seek to leverage cloud computing for cost savings and scalability.

What will be the Size of the market During the Forecast Period?

Request Free Sample

Deep learning, a subset of machine learning, continues to shape industries by enabling advanced applications such as image and speech recognition, text generation, and pattern recognition. Reinforcement learning, a type of deep learning, gains traction, with deep reinforcement learning leading the charge. Anomaly detection, a crucial application of unsupervised learning, safeguards systems against security vulnerabilities. Ethical implications and fairness considerations are increasingly important in deep learning, with emphasis on explainable AI and model interpretability. Graph neural networks and attention mechanisms enhance data preprocessing for sequential data modeling and object detection. Time series forecasting and dataset creation further expand deep learning's reach, while privacy preservation and bias mitigation ensure responsible use.

In summary, deep learning's market dynamics reflect a constant pursuit of innovation, efficiency, and ethical considerations. The Deep Learning Market in the US is flourishing as organizations embrace intelligent systems powered by supervised learning and emerging self-supervised learning techniques. These methods refine predictive capabilities and reduce reliance on labeled data, boosting scalability. BFSI firms utilize AI image recognition for various applications, including personalizing customer communication, maintaining a competitive edge, and automating repetitive tasks to boost productivity. Sophisticated feature extraction algorithms now enable models to isolate patterns with high precision, particularly in applications such as image classification for healthcare, security, and retail.

How is this market segmented and which is the largest segment?

The market research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD million' for the period 2025-2029, as well as historical data from 2019-2023 for the following segments.

Application Image recognition Voice recognition Video surveillance and diagnostics Data mining Type Software Services Hardware End-user Security Automotive Healthcare Retail and commerce Others Geography North America US

By Application Insights

The Image recognition segment is estimated to witness significant growth during the forecast period. In the realm of artificial intelligence (AI) and machine learning, image recognition, a subset of computer vision, is gaining significant traction. This technology utilizes neural networks, deep learning models, and various machine learning algorithms to decipher visual data from images and videos. Image recognition is instrumental in numerous applications, including visual search, product recommendations, and inventory management. Consumers can take photographs of products to discover similar items, enhancing the online shopping experience. In the automotive sector, image recognition is indispensable for advanced driver assistance systems (ADAS) and autonomous vehicles, enabling the identification of pedestrians, other vehicles, road signs, and lane markings.

Furthermore, image recognition plays a pivotal role in augmented reality (AR) and virtual reality (VR) applications, where it tracks physical objects and overlays digital content onto real-world scenarios. The model training process involves the backpropagation algorithm, which calculates
Data sources used by companies for training AI models South Korea 2024
statista.com
Updated May 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Data sources used by companies for training AI models South Korea 2024 [Dataset]. https://www.statista.com/statistics/1452822/south-korea-data-sources-for-training-artificial-intelligence-models/
Explore at:
Dataset updated
May 27, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Sep 2024 - Nov 2024
Area covered
South Korea
Description
As of 2024, customer data was the leading source of information used to train artificial intelligence (AI) models in South Korea, with nearly ** percent of surveyed companies answering that way. About ** percent responded to use public sector support initiatives.
Machine Learning model data
ecmwf.int
Updated Jan 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
European Centre for Medium-Range Weather Forecasts (2023). Machine Learning model data [Dataset]. https://www.ecmwf.int/en/forecasts/dataset/machine-learning-model-data
Explore at:
Dataset updated
Jan 1, 2023
Dataset authored and provided by
European Centre for Medium-Range Weather Forecastshttp://ecmwf.int/
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
three of these models are available:
D
Notable AI Models
epoch.ai
csv
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Epoch AI, Notable AI Models [Dataset]. https://epoch.ai/data/notable-ai-models
Explore at:
csvAvailable download formats
Dataset authored and provided by
Epoch AI
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Global
Variables measured
https://epoch.ai/data/notable-ai-models-documentation#records
Measurement technique
https://epoch.ai/data/notable-ai-models-documentation#records
Description
Our most comprehensive database of AI models, containing over 800 models that are state of the art, highly cited, or otherwise historically notable. It tracks key factors driving machine learning progress and includes over 300 training compute estimates.
D
Data Collection and Labelling Report
marketresearchforecast.com
doc, pdf, ppt
Updated Mar 13, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Market Research Forecast (2025). Data Collection and Labelling Report [Dataset]. https://www.marketresearchforecast.com/reports/data-collection-and-labelling-33030
Explore at:
doc, ppt, pdfAvailable download formats
Dataset updated
Mar 13, 2025
Dataset authored and provided by
Market Research Forecast
License
https://www.marketresearchforecast.com/privacy-policyhttps://www.marketresearchforecast.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The data collection and labeling market is experiencing robust growth, fueled by the escalating demand for high-quality training data in artificial intelligence (AI) and machine learning (ML) applications. The market, estimated at $15 billion in 2025, is projected to achieve a Compound Annual Growth Rate (CAGR) of 25% over the forecast period (2025-2033), reaching approximately $75 billion by 2033. This expansion is primarily driven by the increasing adoption of AI across diverse sectors, including healthcare (medical image analysis, drug discovery), automotive (autonomous driving systems), finance (fraud detection, risk assessment), and retail (personalized recommendations, inventory management). The rising complexity of AI models and the need for more diverse and nuanced datasets are significant contributing factors to this growth. Furthermore, advancements in data annotation tools and techniques, such as active learning and synthetic data generation, are streamlining the data labeling process and making it more cost-effective. However, challenges remain. Data privacy concerns and regulations like GDPR necessitate robust data security measures, adding to the cost and complexity of data collection and labeling. The shortage of skilled data annotators also hinders market growth, necessitating investments in training and upskilling programs. Despite these restraints, the market’s inherent potential, coupled with ongoing technological advancements and increased industry investments, ensures sustained expansion in the coming years. Geographic distribution shows strong concentration in North America and Europe initially, but Asia-Pacific is poised for rapid growth due to increasing AI adoption and the availability of a large workforce. This makes strategic partnerships and global expansion crucial for market players aiming for long-term success.
Dataset: An Open Combinatorial Diffraction Dataset Including Consensus Human...
data.nist.gov
s.cnmilf.com
+1more
Updated Oct 23, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Brian DeCost (2020). Dataset: An Open Combinatorial Diffraction Dataset Including Consensus Human and Machine Learning Labels with Quantified Uncertainty for Training New Machine Learning Models [Dataset]. http://doi.org/10.18434/mds2-2301
Explore at:
Unique identifier
https://doi.org/10.18434/mds2-2301, https://identifiers.org/ark:/88434/mds2-2301
Dataset updated
Oct 23, 2020
Dataset provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
Authors
Brian DeCost
License
https://www.nist.gov/open/licensehttps://www.nist.gov/open/license
Description
The open dataset, software, and other files accompanying the manuscript "An Open Combinatorial Diffraction Dataset Including Consensus Human and Machine Learning Labels with Quantified Uncertainty for Training New Machine Learning Models," submitted for publication to Integrated Materials and Manufacturing Innovations. Machine learning and autonomy are increasingly prevalent in materials science, but existing models are often trained or tuned using idealized data as absolute ground truths. In actual materials science, "ground truth" is often a matter of interpretation and is more readily determined by consensus. Here we present the data, software, and other files for a study using as-obtained diffraction data as a test case for evaluating the performance of machine learning models in the presence of differing expert opinions. We demonstrate that experts with similar backgrounds can disagree greatly even for something as intuitive as using diffraction to identify the start and end of a phase transformation. We then use a logarithmic likelihood method to evaluate the performance of machine learning models in relation to the consensus expert labels and their variance. We further illustrate this method's efficacy in ranking a number of state-of-the-art phase mapping algorithms. We propose a materials data challenge centered around the problem of evaluating models based on consensus with uncertainty. The data, labels, and code used in this study are all available online at data.gov, and the interested reader is encouraged to replicate and improve the existing models or to propose alternative methods for evaluating algorithmic performance.
n
Data from: Assessing predictive performance of supervised machine learning...
data.niaid.nih.gov
datadryad.org
+1more
zip
Updated May 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Evans Omondi (2023). Assessing predictive performance of supervised machine learning algorithms for a diamond pricing model [Dataset]. http://doi.org/10.5061/dryad.wh70rxwrh
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.wh70rxwrh
Dataset updated
May 23, 2023
Dataset provided by
Strathmore University
Authors
Evans Omondi
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
The diamond is 58 times harder than any other mineral in the world, and its elegance as a jewel has long been appreciated. Forecasting diamond prices is challenging due to nonlinearity in important features such as carat, cut, clarity, table, and depth. Against this backdrop, the study conducted a comparative analysis of the performance of multiple supervised machine learning models (regressors and classifiers) in predicting diamond prices. Eight supervised machine learning algorithms were evaluated in this work including Multiple Linear Regression, Linear Discriminant Analysis, eXtreme Gradient Boosting, Random Forest, k-Nearest Neighbors, Support Vector Machines, Boosted Regression and Classification Trees, and Multi-Layer Perceptron. The analysis is based on data preprocessing, exploratory data analysis (EDA), training the aforementioned models, assessing their accuracy, and interpreting their results. Based on the performance metrics values and analysis, it was discovered that eXtreme Gradient Boosting was the most optimal algorithm in both classification and regression, with a R2 score of 97.45% and an Accuracy value of 74.28%. As a result, eXtreme Gradient Boosting was recommended as the optimal regressor and classifier for forecasting the price of a diamond specimen. Methods Kaggle, a data repository with thousands of datasets, was used in the investigation. It is an online community for machine learning practitioners and data scientists, as well as a robust, well-researched, and sufficient resource for analyzing various data sources. On Kaggle, users can search for and publish various datasets. In a web-based data-science environment, they can study datasets and construct models.
Metatasks for AutoGluon - ROC AUC and Balanced Accuracy
figshare.com
bin
Updated Jul 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lennart Purucker (2023). Metatasks for AutoGluon - ROC AUC and Balanced Accuracy [Dataset]. http://doi.org/10.6084/m9.figshare.23609361.v1
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.23609361.v1
Dataset updated
Jul 1, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Lennart Purucker
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Prediction Data of Base Models from AutoGluon on 71 classification datasets from the AutoML Benchmark for Balanced Accuracy and ROC AUC.

The files of this figshare item include data that was collected for the paper: CMA-ES for Post Hoc Ensembling in AutoML: A Great Success and Salvageable Failure, Lennart Purucker, Joeran Beel, Second International Conference on Automated Machine Learning, 2023.

The data was stored and used with the assembled framework: https://github.com/ISG-Siegen/assembled.

In detail, the data contains the predictions of base models on validation and test as produced by running AutoGluon for 4 hours. Such prediction data is included for each model produced by AutoGluon on each fold of 10-fold cross-validation on the 71 classification datasets from the AutoML Benchmark. The data exists for two metrics (ROC AUC and Balanced Accuracy). More details can be found in the paper.

The data was collected by code created for the paper and is available in its reproducibility repository: https://doi.org/10.6084/m9.figshare.23609226.

Its usage is intended for but not limited to using assembled to evaluate post hoc ensembling methods for AutoML.

Details The link above points to a hosted server that facilitates the download. We opted for a hosted server, as we found no other suitable solution to share these large files (due to file size or storage limits) for a reasonable price. If you want to obtain the data in another way or know of a more suitable alternative, please contact Lennart Purucker.

The link resolves to a directory containing the following:

example_metatasks: contains an example metatask for test purposes before committing to downloading all files.
metatasks_roc_auc.zip: The Metatasks obtained by running AutoGluon for ROC AUC. metatasks_bacc.zip: The Metatasks obtained by running AutoGluon for Balanced Accuracy.

The size after unzipping is:

metatasks_roc_auc.zip: ~85GB metatasks_bacc.zip: ~100GB

The metatask .zip files contain 2 files per metatask. One .json file with metadata information and a .hdf file containing the prediction data. The details on how this should be read and used as a Metatask can be found in the assembled framework and the reproducibility repository. To obtain the data without Metataks, we advise looking at the file content and metadata individually or parsing them by using Metatasks first.
o
Replication data for: Big Data: New Tricks for Econometrics
openicpsr.org
Updated May 1, 2014
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hal R. Varian (2014). Replication data for: Big Data: New Tricks for Econometrics [Dataset]. http://doi.org/10.3886/E113925V1
Explore at:
Unique identifier
https://doi.org/10.3886/E113925V1
Dataset updated
May 1, 2014
Dataset provided by
American Economic Association
Authors
Hal R. Varian
Time period covered
May 1, 2014
Description
Computers are now involved in many economic transactions and can capture data associated with these transactions, which can then be manipulated and analyzed. Conventional statistical and econometric techniques such as regression often work well, but there are issues unique to big datasets that may require different tools. First, the sheer size of the data involved may require more powerful data manipulation tools. Second, we may have more potential predictors than appropriate for estimation, so we need to do some kind of variable selection. Third, large datasets may allow for more flexible relationships than simple linear models. Machine learning techniques such as decision trees, support vector machines, neural nets, deep learning, and so on may allow for more effective ways to model complex relationships. In this essay, I will describe a few of these tools for manipulating and analyzing big data. I believe that these methods have a lot to offer and should be more widely known and used by economists.
8kHz Conversational Speech Data | 15,000 Hours | Audio Data | Speech...
datarade.ai
Updated Dec 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nexdata (2023). 8kHz Conversational Speech Data | 15,000 Hours | Audio Data | Speech Recognition Data| Machine Learning (ML) Data [Dataset]. https://datarade.ai/data-products/nexdata-multilingual-conversational-speech-data-8khz-tele-nexdata
Explore at:
.bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
Dataset updated
Dec 10, 2023
Dataset authored and provided by
Nexdata
Area covered
Czech Republic, United Arab Emirates, Vietnam, Argentina, Romania, Philippines, Poland, Singapore, Netherlands, United States of America
Description
Specifications Format : 8kHz, 8bit, u-law/a-law pcm, mono channel;

Environment : quiet indoor environment, without echo;

Recording content : No preset linguistic data，dozens of topics are specified, and the speakers make dialogue under those topics while the recording is performed;

Demographics : Speakers are evenly distributed across all age groups, covering children, teenagers, middle-aged, elderly, etc.

Annotation : annotating for the transcription text, speaker identification, gender and noise symbols;

Device : Telephony recording system;

Language : 100+ Languages;

Application scenarios : speech recognition; voiceprint recognition;

Accuracy rate : the word accuracy rate is not less than 98%

About Nexdata Nexdata owns off-the-shelf PB-level Large Language Model(LLM) Data, 1 million hours of Audio Data and 800TB of Annotated Imagery Data. These ready-to-go Machine Learning (ML) Data support instant delivery, quickly improve the accuracy of AI models. For more details, please visit us at https://www.nexdata.ai/datasets/speechrecog?source=Datarade
Global Machine Learning Market Size By Component (Hardware, Software), By...
verifiedmarketresearch.com
Updated Oct 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
VERIFIED MARKET RESEARCH (2024). Global Machine Learning Market Size By Component (Hardware, Software), By Enterprise Size (SMEs, Large Enterprises), By End-User (Healthcare, Retail), By Geographic Scope And Forecast [Dataset]. https://www.verifiedmarketresearch.com/product/global-machine-learning-market-size-and-forecast/
Explore at:
Dataset updated
Oct 10, 2024
Dataset provided by
Verified Market Researchhttps://www.verifiedmarketresearch.com/
Authors
VERIFIED MARKET RESEARCH
License
https://www.verifiedmarketresearch.com/privacy-policy/https://www.verifiedmarketresearch.com/privacy-policy/
Time period covered
2024 - 2031
Area covered
Global
Description
Machine Learning Market size was valued at USD 10.24 Billion in 2024 and is projected to reach USD 200.08 Billion by 2031, growing at a CAGR of 10.9% from 2024 to 2031.

Key Market Drivers:

Increasing Data Volume and Complexity: The explosion of digital data is fueling ML adoption across industries. Organizations are leveraging ML to extract insights from vast, complex datasets. According to the European Commission, the volume of data globally is projected to grow from 33 zettabytes in 2018 to 175 zettabytes by 2025. For instance, on September 15, 2023, Google Cloud announced new ML-powered data analytics tools to help enterprises handle increasing data complexity.

Advancements in AI and Deep Learning Algorithms: Continuous improvements in AI algorithms are expanding ML capabilities. Deep learning breakthroughs are enabling more sophisticated applications. The U.S. National Science Foundation reported a 63% increase in AI research publications from 2017 to 2021. For instance, on August 24, 2023, DeepMind unveiled Graphcast, a new ML weather forecasting model achieving unprecedented accuracy.
f
Data from: Advanced machine learning techniques for building performance...
tandf.figshare.com
txt
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Debaditya Chakraborty; Hazem Elzarka (2023). Advanced machine learning techniques for building performance simulation: a comparative analysis [Dataset]. http://doi.org/10.6084/m9.figshare.6848453.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.6848453.v1
Dataset updated
May 30, 2023
Dataset provided by
Taylor & Francis
Authors
Debaditya Chakraborty; Hazem Elzarka
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Energy consumption predictions for buildings play an important role in energy efficiency and sustainability research. Accurate energy predictions have numerous application in real-time performance monitoring, fault detection, identifying prime targets for energy conservation, quantifying savings resulting from energy efficiency projects, etc. Machine learning-based energy models have proved to be more efficient and accurate where historical time series data is available. This paper presents various machine learning concepts that will aid in the generation of more accurate and efficient energy models. We have shown in detail the development of energy models using extreme gradient boosting (XGBoost), artificial neural network (ANN), and degree-day-based ordinary least square regression. We have presented a thorough description of the workflow, including intermediate steps for feature engineering, feature selection, hyper-parameter optimization and the Python source code. Our results indicate that XGBoost produces highly accurate energy models, and the intermediate steps are particularly important for XGBoost and ANN model development.
f
Data from: Count-Based Morgan Fingerprint: A More Efficient and...
acs.figshare.com
xlsx
Updated Jul 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shifa Zhong; Xiaohong Guan (2023). Count-Based Morgan Fingerprint: A More Efficient and Interpretable Molecular Representation in Developing Machine Learning-Based Predictive Regression Models for Water Contaminants’ Activities and Properties [Dataset]. http://doi.org/10.1021/acs.est.3c02198.s002
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.est.3c02198.s002
Dataset updated
Jul 5, 2023
Dataset provided by
ACS Publications
Authors
Shifa Zhong; Xiaohong Guan
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
In this study, we introduce the count-based Morgan fingerprint (C-MF) to represent chemical structures of contaminants and develop machine learning (ML)-based predictive models for their activities and properties. Compared with the binary Morgan fingerprint (B-MF), C-MF not only qualifies the presence or absence of an atom group but also quantifies its counts in a molecule. We employ six different ML algorithms (ridge regression, SVM, KNN, RF, XGBoost, and CatBoost) to develop models on 10 contaminant-related data sets based on C-MF and B-MF to compare them in terms of the model’s predictive performance, interpretation, and applicability domain (AD). Our results show that C-MF outperforms B-MF in nine of 10 data sets in terms of model predictive performance. The advantage of C-MF over B-MF is dependent on the ML algorithm, and the performance enhancements are proportional to the difference in the chemical diversity of data sets calculated by B-MF and C-MF. Model interpretation results show that the C-MF-based model can elucidate the effect of atom group counts on the target and have a wider range of SHAP values. AD analysis shows that C-MF-based models have an AD similar to that of B-MF-based ones. Finally, we developed a “ContaminaNET” platform to deploy these C-MF-based models for free use.
Machine Learning Market Report | Global Forecast From 2025 To 2033
dataintelo.com
csv, pdf, pptx
Updated Jan 7, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dataintelo (2025). Machine Learning Market Report | Global Forecast From 2025 To 2033 [Dataset]. https://dataintelo.com/report/machine-learning-market
Explore at:
csv, pptx, pdfAvailable download formats
Dataset updated
Jan 7, 2025
Dataset authored and provided by
Dataintelo
License
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
Time period covered
2024 - 2032
Area covered
Global
Description
Machine Learning Market Outlook

The global machine learning market is projected to witness a remarkable growth trajectory, with the market size estimated to reach USD 21.17 billion in 2023 and anticipated to expand to USD 209.91 billion by 2032, growing at a compound annual growth rate (CAGR) of 29.2% over the forecast period. This extraordinary growth is primarily propelled by the escalating demand for artificial intelligence-driven solutions across various industries. As businesses seek to leverage machine learning for improving operational efficiency, enhancing customer experience, and driving innovation, the market is poised to expand rapidly. Key factors contributing to this growth include advancements in data generation, increasing computational power, and the proliferation of big data analytics.

A pivotal growth factor for the machine learning market is the ongoing digital transformation across industries. Enterprises globally are increasingly adopting machine learning technologies to optimize their operations, streamline processes, and make data-driven decisions. The healthcare sector, for example, leverages machine learning for predictive analytics to improve patient outcomes, while the finance sector uses machine learning algorithms for fraud detection and risk assessment. The retail industry is also utilizing machine learning for personalized customer experiences and inventory management. The ability of machine learning to analyze vast amounts of data in real-time and provide actionable insights is fueling its adoption across various applications, thereby driving market growth.

Another significant growth driver is the increasing integration of machine learning with the Internet of Things (IoT). The convergence of these technologies enables the creation of smarter, more efficient systems that enhance operational performance and productivity. In manufacturing, for instance, IoT devices equipped with machine learning capabilities can predict equipment failures and optimize maintenance schedules, leading to reduced downtime and costs. Similarly, in the automotive industry, machine learning algorithms are employed in autonomous vehicles to process and analyze sensor data, improving navigation and safety. The synergistic relationship between machine learning and IoT is expected to further propel market expansion during the forecast period.

Moreover, the rising investments in AI research and development by both public and private sectors are accelerating the advancement and adoption of machine learning technologies. Governments worldwide are recognizing the potential of AI and machine learning to transform industries, leading to increased funding for research initiatives and innovation centers. Companies are also investing heavily in developing cutting-edge machine learning solutions to maintain a competitive edge. This robust investment landscape is fostering an environment conducive to technological breakthroughs, thereby contributing to the growth of the machine learning market.

Supervised Learning, a subset of machine learning, plays a crucial role in the advancement of AI-driven solutions. It involves training algorithms on a labeled dataset, allowing the model to learn and make predictions or decisions based on new, unseen data. This approach is particularly beneficial in applications where the desired output is known, such as in classification or regression tasks. For instance, in the healthcare sector, supervised learning algorithms are employed to analyze patient data and predict health outcomes, thereby enhancing diagnostic accuracy and treatment efficacy. Similarly, in finance, these algorithms are used for credit scoring and fraud detection, providing financial institutions with reliable tools for risk assessment. As the demand for precise and efficient AI applications grows, the significance of supervised learning in driving innovation and operational excellence across industries becomes increasingly evident.

From a regional perspective, North America holds a dominant position in the machine learning market due to the early adoption of advanced technologies and the presence of major technology companies. The region's strong focus on R&D and innovation, coupled with a well-established IT infrastructure, further supports market growth. In addition, Asia Pacific is emerging as a lucrative market for machine learning, driven by rapid industrialization, increasing digitalization, and government initiatives promoting AI adoption. The region is witnessing significant investments in AI technologies, particu

Facebook

Twitter

Click to copy link

Link copied

Cite

National Institute of Standards and Technology (2023). TREC 2022 Deep Learning test collection [Dataset]. https://catalog.data.gov/dataset/trec-2022-deep-learning-test-collection

TREC 2022 Deep Learning test collection

Explore at:

Dataset updated

May 9, 2023

Dataset provided by

National Institute of Standards and Technologyhttp://www.nist.gov/

Description

This is a test collection for passage and document retrieval, produced in the TREC 2023 Deep Learning track. The Deep Learning Track studies information retrieval in a large training data regime. This is the case where the number of training queries with at least one positive label is at least in the tens of thousands, if not hundreds of thousands or more. This corresponds to real-world scenarios such as training based on click logs and training based on labels from shallow pools (such as the pooling in the TREC Million Query Track or the evaluation of search engines based on early precision).Certain machine learning based methods, such as methods based on deep learning are known to require very large datasets for training. Lack of such large scale datasets has been a limitation for developing such methods for common information retrieval tasks, such as document ranking. The Deep Learning Track organized in the previous years aimed at providing large scale datasets to TREC, and create a focused research effort with a rigorous blind evaluation of ranker for the passage ranking and document ranking tasks.Similar to the previous years, one of the main goals of the track in 2022 is to study what methods work best when a large amount of training data is available. For example, do the same methods that work on small data also work on large data? How much do methods improve when given more training data? What external data and models can be brought in to bear in this scenario, and how useful is it to combine full supervision with other forms of supervision?The collection contains 12 million web pages, 138 million passages from those web pages, search queries, and relevance judgments for the queries.

Clear search

Close search

Google apps

Main menu

TREC 2022 Deep Learning test collection

Machine Learning Dataset

A Dataset for Machine Learning Algorithm Development

Data from: NICHE: A Curated Dataset of Engineered Machine Learning Projects...

Machine Learning Software Market Report | Global Forecast From 2025 To 2033

Machine Learning Software Market Outlook

Component Analysis

SYNERGY - Open machine learning dataset on study selection in systematic...

US Deep Learning Market Analysis, Size, and Forecast 2025-2029

Snapshot img

Data sources used by companies for training AI models South Korea 2024

Machine Learning model data

Notable AI Models

Data Collection and Labelling Report

Dataset: An Open Combinatorial Diffraction Dataset Including Consensus Human...

Data from: Assessing predictive performance of supervised machine learning...

Metatasks for AutoGluon - ROC AUC and Balanced Accuracy

Replication data for: Big Data: New Tricks for Econometrics

8kHz Conversational Speech Data | 15,000 Hours | Audio Data | Speech...

Global Machine Learning Market Size By Component (Hardware, Software), By...

Data from: Advanced machine learning techniques for building performance...

Data from: Count-Based Morgan Fingerprint: A More Efficient and...

Machine Learning Market Report | Global Forecast From 2025 To 2033

Machine Learning Market Outlook

TREC 2022 Deep Learning test collection