100+ datasets found

Data from: A large-scale comparative analysis of Coding Standard conformance...
figshare.com
application/x-gzip
Updated Oct 4, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anj Simmons; Scott Barnett; Jessica Rivera-Villicana; Akshat Bajaj; Rajesh Vasa (2021). A large-scale comparative analysis of Coding Standard conformance in Open-Source Data Science projects [Dataset]. http://doi.org/10.6084/m9.figshare.12377237.v3
Explore at:
application/x-gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12377237.v3
Dataset updated
Oct 4, 2021
Dataset provided by
Figsharehttp://figshare.com/
Authors
Anj Simmons; Scott Barnett; Jessica Rivera-Villicana; Akshat Bajaj; Rajesh Vasa
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This study investigates the extent to which data science projects follow code standards. In particular, which standards are followed, which are ignored, and how does this differ to traditional software projects? We compare a corpus of 1048 Open-Source Data Science projects to a reference group of 1099 non-Data Science projects with a similar level of quality and maturity.results.tar.gz: Extracted data for each project, including raw logs of all detected code violations.notebooks_out.tar.gz: Tables and figures generated by notebooks.source_code_anonymized.tar.gz: Anonymized source code (at time of publication) to identify, clone, and analyse the projects. Also includes Jupyter notebooks used to produce figures in the paper.The latest source code can be found at: https://github.com/a2i2/mining-data-science-repositoriesPublished in ESEM 2020: https://doi.org/10.1145/3382494.3410680Preprint: https://arxiv.org/abs/2007.08978
d
Replication Data for: Scaling Data from Multiple Sources
search.dataone.org
dataverse.harvard.edu
Updated Nov 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Enamorado, Ted; Lopez-Moctezuma, Gabriel; Ratkovic, Marc (2023). Replication Data for: Scaling Data from Multiple Sources [Dataset]. http://doi.org/10.7910/DVN/FOUVEL
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/FOUVEL
Dataset updated
Nov 22, 2023
Dataset provided by
Harvard Dataverse
Authors
Enamorado, Ted; Lopez-Moctezuma, Gabriel; Ratkovic, Marc
Description
We introduce a method for scaling two data sets from different sources. The proposed method estimates a latent factor common to both datasets as well as an idiosyncratic factor unique to each. In addition, it offers a flexible modeling strategy that permits the scaled locations to be a function of covariates, and efficient implementation allows for inference through resampling. A simulation study shows that our proposed method improves over existing alternatives in capturing the variation common to both datasets, as well as the latent factors specific to each. We apply our proposed method to vote and speech data from the 112th U.S. Senate. We recover a shared subspace that aligns with a standard ideological dimension running from liberals to conservatives while recovering the words most associated with each senator's location. In addition, we estimate a word-specific subspace that ranges from national security to budget concerns, and a vote-specific subspace with Tea Party senators on one extreme and senior committee leaders on the other.
W
Data from: SmaT-Scaling Data Collection Tools (SMILER)
data.worldagroforestry.org
Updated Jul 28, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arinloye, Ademonla (2016). SmaT-Scaling Data Collection Tools (SMILER) [Dataset]. http://doi.org/10.34725/DVN/VI9I6B
Explore at:
Unique identifier
https://doi.org/10.34725/DVN/VI9I6B
Dataset updated
Jul 28, 2016
Dataset provided by
Dataverse
Authors
Arinloye, Ademonla
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
SmaT-Scaling Data Collection Tools (SMILER)
s
Online Feature Selection and Its Applications
researchdata.smu.edu.sg
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
HOI Steven; Jialei WANG; Peilin ZHAO; Rong JIN (2023). Online Feature Selection and Its Applications [Dataset]. http://doi.org/10.25440/smu.12062733.v1
Explore at:
Unique identifier
https://doi.org/10.25440/smu.12062733.v1
Dataset updated
May 31, 2023
Dataset provided by
SMU Research Data Repository (RDR)
Authors
HOI Steven; Jialei WANG; Peilin ZHAO; Rong JIN
License
https://www.gnu.org/licenses/gpl-3.0.htmlhttps://www.gnu.org/licenses/gpl-3.0.html
Description
Feature selection is an important technique for data mining before a machine learning algorithm is applied. Despite its importance, most studies of feature selection are restricted to batch learning. Unlike traditional batch learning methods, online learning represents a promising family of efficient and scalable machine learning algorithms for large-scale applications. Most existing studies of online learning require accessing all the attributes/features of training instances. Such a classical setting is not always appropriate for real-world applications when data instances are of high dimensionality or it is expensive to acquire the full set of attributes/features. To address this limitation, we investigate the problem of Online Feature Selection (OFS) in which an online learner is only allowed to maintain a classifier involved only a small and fixed number of features. The key challenge of Online Feature Selection is how to make accurate prediction using a small and fixed number of active features. This is in contrast to the classical setup of online learning where all the features can be used for prediction. We attempt to tackle this challenge by studying sparsity regularization and truncation techniques. Specifically, this article addresses two different tasks of online feature selection: (1) learning with full input where an learner is allowed to access all the features to decide the subset of active features, and (2) learning with partial input where only a limited number of features is allowed to be accessed for each instance by the learner. We present novel algorithms to solve each of the two problems and give their performance analysis. We evaluate the performance of the proposed algorithms for online feature selection on several public datasets, and demonstrate their applications to real-world problems including image classification in computer vision and microarray gene expression analysis in bioinformatics. The encouraging results of our experiments validate the efficacy and efficiency of the proposed techniques.Related Publication: Hoi, S. C., Wang, J., Zhao, P., & Jin, R. (2012). Online feature selection for mining big data. In Proceedings of the 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications (pp. 93-100). ACM. http://dx.doi.org/10.1145/2351316.2351329 Full text available in InK: http://ink.library.smu.edu.sg/sis_research/2402/ Wang, J., Zhao, P., Hoi, S. C., & Jin, R. (2014). Online feature selection and its applications. IEEE Transactions on Knowledge and Data Engineering, 26(3), 698-710. http://dx.doi.org/10.1109/TKDE.2013.32 Full text available in InK: http://ink.library.smu.edu.sg/sis_research/2277/
Data Science Platform Market Analysis, Size, and Forecast 2025-2029: North...
technavio.com
Updated Feb 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Technavio (2025). Data Science Platform Market Analysis, Size, and Forecast 2025-2029: North America (US and Canada), Europe (France, Germany, UK), APAC (China, India, Japan), South America (Brazil), and Middle East and Africa (UAE) [Dataset]. https://www.technavio.com/report/data-science-platform-market-industry-analysis
Explore at:
Dataset updated
Feb 15, 2025
Dataset provided by
TechNavio
Authors
Technavio
Time period covered
2021 - 2025
Area covered
Global, Canada, United States
Description
Snapshot img

Data Science Platform Market Size 2025-2029

The data science platform market size is forecast to increase by USD 763.9 million, at a CAGR of 40.2% between 2024 and 2029.

The market is experiencing significant growth, driven by the increasing integration of Artificial Intelligence (AI) and Machine Learning (ML) technologies. This fusion enables organizations to derive deeper insights from their data, fueling business innovation and decision-making. Another trend shaping the market is the emergence of containerization and microservices in data science platforms. This approach offers enhanced flexibility, scalability, and efficiency, making it an attractive choice for businesses seeking to streamline their data science operations. However, the market also faces challenges. Data privacy and security remain critical concerns, with the increasing volume and complexity of data posing significant risks. Ensuring robust data security and privacy measures is essential for companies to maintain customer trust and comply with regulatory requirements. Additionally, managing the complexity of data science platforms and ensuring seamless integration with existing systems can be a daunting task, requiring significant investment in resources and expertise. Companies must navigate these challenges effectively to capitalize on the market's opportunities and stay competitive in the rapidly evolving data landscape.

What will be the Size of the Data Science Platform Market during the forecast period?

Explore in-depth regional segment analysis with market size data - historical 2019-2023 and forecasts 2025-2029 - in the full report.
Request Free SampleThe market continues to evolve, driven by the increasing demand for advanced analytics and artificial intelligence solutions across various sectors. Real-time analytics and classification models are at the forefront of this evolution, with APIs integrations enabling seamless implementation. Deep learning and model deployment are crucial components, powering applications such as fraud detection and customer segmentation. Data science platforms provide essential tools for data cleaning and data transformation, ensuring data integrity for big data analytics. Feature engineering and data visualization facilitate model training and evaluation, while data security and data governance ensure data privacy and compliance. Machine learning algorithms, including regression models and clustering models, are integral to predictive modeling and anomaly detection. Statistical analysis and time series analysis provide valuable insights, while ETL processes streamline data integration. Cloud computing enables scalability and cost savings, while risk management and algorithm selection optimize model performance. Natural language processing and sentiment analysis offer new opportunities for data storytelling and computer vision. Supply chain optimization and recommendation engines are among the latest applications of data science platforms, demonstrating their versatility and continuous value proposition. Data mining and data warehousing provide the foundation for these advanced analytics capabilities.

How is this Data Science Platform Industry segmented?

The data science platform industry research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD million' for the period 2025-2029, as well as historical data from 2019-2023 for the following segments. DeploymentOn-premisesCloudComponentPlatformServicesEnd-userBFSIRetail and e-commerceManufacturingMedia and entertainmentOthersSectorLarge enterprisesSMEsApplicationData PreparationData VisualizationMachine LearningPredictive AnalyticsData GovernanceOthersGeographyNorth AmericaUSCanadaEuropeFranceGermanyUKMiddle East and AfricaUAEAPACChinaIndiaJapanSouth AmericaBrazilRest of World (ROW)

By Deployment Insights

The on-premises segment is estimated to witness significant growth during the forecast period.In the dynamic the market, businesses increasingly adopt solutions to gain real-time insights from their data, enabling them to make informed decisions. Classification models and deep learning algorithms are integral parts of these platforms, providing capabilities for fraud detection, customer segmentation, and predictive modeling. API integrations facilitate seamless data exchange between systems, while data security measures ensure the protection of valuable business information. Big data analytics and feature engineering are essential for deriving meaningful insights from vast datasets. Data transformation, data mining, and statistical analysis are crucial processes in data preparation and discovery. Machine learning models, including regression and clustering, are employed for model training and evaluation. Time series analysis and natural language processing are valuable tools for understanding trends and customer sen
Data from: Scaling and Citations
figshare.com
pdf
Updated Jun 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tim Evans (2023). Scaling and Citations [Dataset]. http://doi.org/10.6084/m9.figshare.96161.v1
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.96161.v1
Dataset updated
Jun 4, 2023
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Tim Evans
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Invited talk given by Tim Evans (Imperial College London) at the EPSRC Workshop on "Scaling in Social Systems” held at the Saïd Business School, Oxford on 1st December 2011. Abstract:

The pattern of innovation seen through citations of academic papers has long fascinated academics. It has been known for at least fifty years that the data shows various long tailed distributions. In this talk I will look at some of the features of the data and show how to extract some simple universal patterns. I will discuss some of the implications of the results and some of the further questions it raises. •What is a citation? •What does an individual citation mean? •Is the data perfect? •Why citation count? •If not citation count, what else? •What does this data say about me? •Why h-index? •What is a self-citation? •How else can I use this data? •How will things change?

Tim S. Evans – Mini Biography Tim studied the mixture of quantum field theory and statistical physics in his PhD at Imperial College London. He was supervised by Prof. Ray Rivers who also supervised another speaker, Prof. Luis Bettencourt. Tim then spent time as a researcher at the University of Alberta in Edmonton Canada, before returning to research positions back here at Imperial, latterly as a Royal Society University Research Fellow. He was appointed to a lectureship at Imperial in 1997. Around 2003 he expanded his work on statistical physics to cover at problems in complexity, with a particular interest in network methods. This has included participation in an EU collaboration with social scientists on innovation, ―ISCOM, run in part by Prof. Geoff West (another speaker today). This fuelled his interest in social science applications and started an on going collaboration with an archaeologist.
d
The Piraeus AIS Dataset for Large-scale Maritime Data Analytics - Dataset -...
datahub.digicirc.eu
Updated May 5, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). The Piraeus AIS Dataset for Large-scale Maritime Data Analytics - Dataset - CE data hub [Dataset]. https://datahub.digicirc.eu/dataset/the-piraeus-ais-dataset-for-large-scale-maritime-data-analytics
Explore at:
Dataset updated
May 5, 2022
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Piraeus
Description
Dataset that contains vessel position information transmitted by vessels of different types and collected via the Automatic Identification System (AIS). The AIS dataset comes along with spatially and temporally correlated data about the vessels and the area of interest, including weather information
f
Efficiency and optimal size of hospitals: Results of a systematic search
plos.figshare.com
docx
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Monica Giancotti; Annamaria Guglielmo; Marianna Mauro (2023). Efficiency and optimal size of hospitals: Results of a systematic search [Dataset]. http://doi.org/10.1371/journal.pone.0174533
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0174533
Dataset updated
Jun 1, 2023
Dataset provided by
PLOS ONE
Authors
Monica Giancotti; Annamaria Guglielmo; Marianna Mauro
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BackgroundNational Health Systems managers have been subject in recent years to considerable pressure to increase concentration and allow mergers. This pressure has been justified by a belief that larger hospitals lead to lower average costs and better clinical outcomes through the exploitation of economies of scale. In this context, the opportunity to measure scale efficiency is crucial to address the question of optimal productive size and to manage a fair allocation of resources.Methods and findingsThis paper analyses the stance of existing research on scale efficiency and optimal size of the hospital sector. We performed a systematic search of 45 past years (1969–2014) of research published in peer-reviewed scientific journals recorded by the Social Sciences Citation Index concerning this topic. We classified articles by the journal’s category, research topic, hospital setting, method and primary data analysis technique. Results showed that most of the studies were focussed on the analysis of technical and scale efficiency or on input / output ratio using Data Envelopment Analysis. We also find increasing interest concerning the effect of possible changes in hospital size on quality of care.ConclusionsStudies analysed in this review showed that economies of scale are present for merging hospitals. Results supported the current policy of expanding larger hospitals and restructuring/closing smaller hospitals. In terms of beds, studies reported consistent evidence of economies of scale for hospitals with 200–300 beds. Diseconomies of scale can be expected to occur below 200 beds and above 600 beds.
Analysis data for location- and scale-invariant power transformations
zenodo.org
application/gzip, bin +2
Updated Mar 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alex Zwanenburg; Alex Zwanenburg; Steffen Löck; Steffen Löck (2025). Analysis data for location- and scale-invariant power transformations [Dataset]. http://doi.org/10.5281/zenodo.14986689
Explore at:
application/gzip, bin, zip, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14986689
Dataset updated
Mar 7, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Alex Zwanenburg; Alex Zwanenburg; Steffen Löck; Steffen Löck
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains various files and folders related to the machine learning experiments in a forthcoming manuscript on location- and scale-invariant power transformations.
Dataset for Towards Understanding Performance Bugs in Popular Data Science...
zenodo.org
zip
Updated Apr 20, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anonymous Anonymous; Anonymous Anonymous (2025). Dataset for Towards Understanding Performance Bugs in Popular Data Science Libraries [Dataset]. http://doi.org/10.5281/zenodo.15250092
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15250092
Dataset updated
Apr 20, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Anonymous Anonymous; Anonymous Anonymous
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description

In our paper, we conducted a large-scale empirical study to characterize performance bugs in seven popular data science libraries.

We identified 202 performance bugs. By analyzing these bugs (including bug description, patches, and project development history), we analyzed their impacts, proposed a taxonomy of the root causes and summarized three challenges for locating root causes and four challenges for fixing these performance bugs.

We found there are about 20% of fixes has LOC not larger than 10, which indicates they can be fixed through simple changes. We then manually checked the patch and found several fixing strategies with small LOC that can be automated. We believe that this study can facilitate future research and the development of data science ecosystems. Both data science libraries' developers and users can receive useful guidance from our study.

This dataset contains 202 performance bugs in data science core libraries, and their impacts, root causes, location and fixing challenge, and fixing strategy.

Our replication package consists of three main folders:RQ1&2_Impacts_and_Root_Causes, RQ3_Root_Causes_Locating_Fixing_Effort_Challenge and RQ4_Fixing_Strategy.

RQ1&2_Impacts_and_Root_Causes

In this folder we first placed the identified impact (Explicit and Implicit). Then we gave the identified symptoms and root cause taxonomy. In each file (corresponding to each iteration), we provided the repo name, issue number, and the label (symptom and root cause).

RQ3_Root_Causes_Locating_Fixing_Effort_Challenge

The challenge in locating and fixing these bugs in data science libraries are identified here.

RQ4_Fixing_Strategy

We provided the identified fixing strategy with small LOC. In the file, we provided the repo name, issue number, and the label (fixing strategy).
Z
Data from: The Piraeus AIS Dataset for Large-scale Maritime Data Analytics
data.niaid.nih.gov
Updated Mar 2, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andreas Tritsarolis (2022). The Piraeus AIS Dataset for Large-scale Maritime Data Analytics [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5562629
Explore at:
Dataset updated
Mar 2, 2022
Dataset provided by
Andreas Tritsarolis
Yannis Theodoridis
Yannis Kontoulis
Area covered
Piraeus
Description
AIS data collected from the receiver at the University of Pireaus
M
Mass Data Migration Service Report
archivemarketresearch.com
doc, pdf, ppt
Updated Mar 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Archive Market Research (2025). Mass Data Migration Service Report [Dataset]. https://www.archivemarketresearch.com/reports/mass-data-migration-service-56309
Explore at:
ppt, doc, pdfAvailable download formats
Dataset updated
Mar 12, 2025
Dataset authored and provided by
Archive Market Research
License
https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The Mass Data Migration Service market is experiencing robust growth, driven by the increasing volume of data generated across various industries and the rising need for efficient data management solutions. The market, estimated at $15 billion in 2025, is projected to exhibit a Compound Annual Growth Rate (CAGR) of 18% from 2025 to 2033, reaching an estimated value of $50 billion by 2033. This significant expansion is fueled by several key factors. Firstly, the proliferation of cloud computing and the associated need to migrate legacy on-premise systems to cloud environments is a major catalyst. Secondly, the growing adoption of data analytics and business intelligence initiatives necessitates efficient and reliable data migration capabilities. Thirdly, stringent data privacy regulations and compliance requirements are pushing organizations to adopt robust data migration solutions for better control and security. Finally, the rising demand for data-driven decision making across diverse sectors like healthcare, finance, and manufacturing is further bolstering market growth. Segment-wise, the cloud-based Mass Data Migration Service is expected to dominate the market due to its scalability, cost-effectiveness, and enhanced security features. Among application segments, healthcare & life sciences, manufacturing, and BFSI are leading the adoption, reflecting their substantial data volumes and the critical need for secure and efficient data handling. Geographically, North America and Europe currently hold significant market share, but the Asia-Pacific region is anticipated to experience substantial growth driven by increasing digitalization and investment in technological infrastructure. However, challenges such as data security concerns, integration complexities, and the lack of skilled professionals capable of handling large-scale data migrations represent potential restraints to market growth. Despite these challenges, the overall outlook for the Mass Data Migration Service market remains highly positive, promising substantial growth and opportunities for market players in the coming years.
D
Data from: On scaling of scientific knowledge production in U.S....
ssh.datastations.nl
csv, pdf, zip
Updated Sep 16, 2014
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
DANS Data Station Social Sciences and Humanities (2014). On scaling of scientific knowledge production in U.S. metropolitan areas [Dataset]. http://doi.org/10.17026/dans-x7g-egc2
Explore at:
csv(11318), pdf(531526), zip(14667)Available download formats
Unique identifier
https://doi.org/10.17026/dans-x7g-egc2
Dataset updated
Sep 16, 2014
Dataset provided by
DANS Data Station Social Sciences and Humanities
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Data underlying the findings described in the study "On scaling of scientific knowledge production in U.S. metropolitan areas" by Önder Nomaler (School of Innovation Sciences, Eindhoven University of Technology, The Netherlands), Koen Frenken and Gaston Heimeriks (both Copernicus Institute of Sustainable Development, Utrecht University, The Netherlands)
Data Science and Machine Learning Service Market Report | Global Forecast...
dataintelo.com
csv, pdf, pptx
Updated Jan 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dataintelo (2025). Data Science and Machine Learning Service Market Report | Global Forecast From 2025 To 2033 [Dataset]. https://dataintelo.com/report/global-data-science-and-machine-learning-service-market
Explore at:
pptx, csv, pdfAvailable download formats
Dataset updated
Jan 7, 2025
Dataset authored and provided by
Dataintelo
License
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
Time period covered
2024 - 2032
Area covered
Global
Description
Data Science and Machine Learning Service Market Outlook

The global data science and machine learning service market size was valued at approximately USD 23.2 billion in 2023 and is projected to reach USD 101.6 billion by 2032, growing at a compelling CAGR of 17.8% during the forecast period. This impressive growth is driven by a multitude of factors including technological advancements, increased adoption of artificial intelligence (AI) across industries, and the exponential rise in data generation. Organizations across the globe are increasingly leveraging data science and machine learning to extract actionable insights, enhance decision-making processes, and automate complex operational tasks. As businesses strive for digital transformation, the demand for data science and machine learning services is expected to soar, positioning these technologies at the core of innovative strategies across diverse sectors.

One of the critical growth factors driving this market is the ever-increasing amount of data being generated globally. With the proliferation of IoT devices, social media platforms, and digital communication technologies, data is being produced at an unprecedented rate. This data, often unstructured and complex, necessitates sophisticated tools and methodologies for analysis. Data science and machine learning provide the essential frameworks for parsing through vast datasets to uncover trends, patterns, and correlations that traditional data analysis methods might miss. As organizations recognize the value of data as a strategic asset, the demand for services that can unlock the potential of this data will continue to rise, fostering substantial market growth.

Another catalyst for market growth is the progressive integration of AI and machine learning technologies in business operations. Machine learning algorithms enable predictive analytics, which allows businesses to forecast future trends and behaviors, thus enhancing their strategic planning and operational efficiency. In sectors such as healthcare, machine learning aids in predictive diagnostics and personalized medicine, leading to better patient outcomes. Similarly, in the financial sector, these technologies help in risk management and fraud detection. As various industries continue to realize the transformative potential of AI and machine learning, the market for related services is likely to expand significantly, tapping into untapped opportunities across new and existing sectors.

The growing need for automation in business processes is another factor propelling the data science and machine learning service market. Organizations are increasingly adopting automation to improve productivity, reduce costs, and minimize human error in repetitive tasks. Machine learning models can automate data-driven tasks such as customer segmentation, inventory management, and demand forecasting. This shift towards automation is particularly prominent in industries like manufacturing and retail, where efficiency and cost savings are paramount. As more businesses look to automate their operations, the demand for comprehensive data science and machine learning solutions is expected to grow, further driving the market forward.

As the demand for data science and machine learning services continues to rise, the role of Machine Learning Infrastructure as a Service becomes increasingly pivotal. This infrastructure provides the necessary computational power and storage solutions that enable organizations to efficiently manage and process vast amounts of data. By leveraging cloud-based infrastructure, businesses can scale their machine learning operations seamlessly, without the need for significant upfront investment in hardware. This flexibility allows companies to focus on developing and deploying machine learning models that drive innovation and competitive advantage. As more organizations recognize the benefits of a robust machine learning infrastructure, the market for these services is expected to grow substantially, supporting the broader adoption of AI-driven solutions across industries.

Regionally, North America is anticipated to hold a dominant position in the data science and machine learning service market. The region's early adoption of technology, coupled with significant investments in AI research and development, provides a robust ecosystem for market growth. Additionally, the presence of key technology players and a highly developed IT infrastructure further contribute to this growth. However, Asia Pacific is expected to exhibit the highest CA
MedMNIST: Standardized Biomedical Images
kaggle.com
Updated Feb 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Möbius (2024). MedMNIST: Standardized Biomedical Images [Dataset]. https://www.kaggle.com/datasets/arashnic/standardized-biomedical-images-medmnist
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 2, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Möbius
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
"'https://www.nature.com/articles/s41597-022-01721-8'">MedMNIST v2 - A large-scale lightweight benchmark for 2D and 3D biomedical image classification https://www.nature.com/articles/s41597-022-01721-8

A large-scale MNIST-like collection of standardized biomedical images, including 12 datasets for 2D and 6 datasets for 3D. All images are pre-processed into 28x28 (2D) or 28x28x28 (3D) with the corresponding classification labels, so that no background knowledge is required for users. Covering primary data modalities in biomedical images, MedMNIST is designed to perform classification on lightweight 2D and 3D images with various data scales (from 100 to 100,000) and diverse tasks (binary/multi-class, ordinal regression and multi-label). The resulting dataset, consisting of approximately 708K 2D images and 10K 3D images in total, could support numerous research and educational purposes in biomedical image analysis, computer vision and machine learning.Providers benchmark several baseline methods on MedMNIST, including 2D / 3D neural networks and open-source / commercial AutoML tools.

MedMNIST Landscape :

https://storage.googleapis.com/kagglesdsdata/datasets/4390240/7539891/medmnistlandscape.png?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=databundle-worker-v2%40kaggle-161607.iam.gserviceaccount.com%2F20240202%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20240202T132716Z&X-Goog-Expires=345600&X-Goog-SignedHeaders=host&X-Goog-Signature=479c8d80a4c6f28bf9532fea037969292a4f963662b022484a79c139297cfa1afc82db06c9b5275d6c52d5555d7fb178701d3ad7ebb036c9cf3d076fcf41014c05a6230d293f39dd320303efaa81d18e9c5888c23fe19884148a3be618e3e7c041383119a4c5547f0fa6cb1ddb5f3bf4dc1330a6fd5c693f32280e90fde5735e02052f2fc5b0003085d9ea70039903439814154dc39980dce3bace422d0672a69c4f4cefbe6bcebaacd2c5192a60172143667b14ba050a8383d0a7c6c639526c820ae58bbad99b4afc84e97bc87b2da6002d6faf181d4138e2a33961514370578892409b1e1a662424051573a3392273b00132a4f39becff877dff16a594848f" alt="medmnistlandscape">

About MedMNIST Landscape figure: The horizontal axis denotes the base-10 logarithm of the dataset scale, and the vertical axis denotes base-10 logarithm of imaging resolution. The upward and downward triangles are used to distinguish between 2D datasets and 3D datasets, and the 4 different colors represent different tasks

Key Features

###

Diverse: It covers diverse data modalities, dataset scales (from 100 to 100,000), and tasks (binary/multi-class, multi-label, and ordinal regression). It is as diverse as the VDD and MSD to fairly evaluate the generalizable performance of machine learning algorithms in different settings, but both 2D and 3D biomedical images are provided.

Standardized: Each sub-dataset is pre-processed into the same format, which requires no background knowledge for users. As an MNIST-like dataset collection to perform classification tasks on small images, it primarily focuses on the machine learning part rather than the end-to-end system. Furthermore, we provide standard train-validation-test splits for all datasets in MedMNIST, therefore algorithms could be easily compared.

User-Friendly: The small size of 28×28 (2D) or 28×28×28 (3D) is lightweight and ideal for evaluating machine learning algorithms. We also offer a larger-size version, MedMNIST+: 64x64 (2D), 128x128 (2D), 224x224 (2D), and 64x64x64 (3D). Serving as a complement to the 28-size MedMNIST, this could be a standardized resource for developing medical foundation models. All these datasets are accessible via the same API.

Educational: As an interdisciplinary research area, biomedical image analysis is difficult to hand on for researchers from other communities, as it requires background knowledge from computer vision, machine learning, biomedical imaging, and clinical science. Our data with the Creative Commons (CC) License is easy to use for educational purposes.

Refer to the paper to learn more about data : https://www.nature.com/articles/s41597-022-01721-8

Starter Code: download more data and training

Github Page: https://github.com/MedMNIST/MedMNIST

My Kaggle Starter Notebook: https://www.kaggle.com/code/arashnic/medmnist-download-and-use-data?scriptVersionId=161421937

Acknowledgements

Jiancheng Yang,Rui Shi,Donglai Wei,Zequan Liu,Lin Zhao,Bilian Ke,Hanspeter Pfister,Bingbing Ni Shanghai Jiao Tong University, Shanghai, China, Boston College, Chestnut Hill, MA RWTH Aachen University, Aachen, Germany, Fudan Institute of Metabolic Diseases, Zhongshan Hospital, Fudan University, Shanghai, China, Shanghai General Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China, Harvard University, Cambridge, MA

License and Citation

The code is under Apache-2.0 License.

The MedMNIST dataset is licensed under Creative Commons Attribution 4.0 International (CC BY 4.0)...
c
National-Scale Geophysical, Geologic, and Mineral Resource Data and Grids...
s.cnmilf.com
data.usgs.gov
+1more
Updated Jul 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). National-Scale Geophysical, Geologic, and Mineral Resource Data and Grids for the United States, Canada, and Australia: Data in Support of the Tri-National Critical Minerals Mapping Initiative [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/national-scale-geophysical-geologic-and-mineral-resource-data-and-grids-for-the-united-sta-651a6
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
Canada, United States, Australia
Description
National-scale geologic, geophysical, and mineral resource raster and vector data covering the United States, Canada, and Australia are provided in this data release. The data were compiled as part of the tri-national Critical Minerals Mapping Initiative (CMMI). The CMMI, established in 2019, is an international science collaboration between the U.S. Geological Survey (USGS), Geoscience Australia (GA), and the Geological Survey of Canada (GSC). One aspect of the CMMI is to use national- to global-scale earth science data to map where critical mineral prospectivity may exist using advanced machine learning approaches (Kelley, 2020). The geoscience information presented in this report include the training and evidential layers that cover all three countries and underpin the resultant prospectivity models for basin-hosted Pb-Zn mineralization described in Lawley and others (2021). It is expected that these data layers will be useful to many regional- to continental-scale studies related to a wide range of earth science research. Therefore, the data layers are organized using widely accepted GIS formats in the same map projection to increase efficiency and effectiveness of future studies. All datasets have a common geographic projection in decimal degrees using a WGS84 datum. Data for the various training and evidential layers were either derived for this study or were extracted from previous national to global-scale compilations. Data from outside work are provided here as a courtesy for completeness of the model and should be cited as the original source. Original references are provided on each child page. Where possible, data for the United States were merged to data for Canada to provide composite data that allow for continuity and seamless analyses of the earth science data across the two countries. Earth science data provided in this report include training data for the models. Training data include a mineral resource database of Pb-Zn deposits and occurrences related to either carbonate-hosted (Mississippi Valley type-MVT) or clastic-dominated (aka sedex) Pb-Zn mineralization. Evidential layers that were used as input to the models include GeoTIFF grid files consisting of ground, airborne, and satellite geophysical data (magnetic, gravity, tomography, seismic) and several related derivative products. Geologic layers incorporated into the models include shapefiles of modified lithology and faults for the United States, Canada and Australia. A global database of ancient and modern passive margins is provided here as well as a link to a database mapping the global distribution of black shale units from a previous USGS study. GeoTIFF grids of the final prospectivity models for MVT and for clastic-dominated Pb-Zn mineralization across the US, Canada, and Australia from Lawley and others (2021) are also included. Each child page describes the particular data layer and related derivative products if applicable. Kelley, K.D., 2020, International geoscience collaboration to support critical mineral discovery: U.S. Geological Survey Fact Sheet 2020–3035, 2 p., https://doi.org/10.3133/fs20203035. Lawley, C.J.M., McCafferty, A.E., Graham, G.E., Huston, D.L., Kelley, K.D., Czarnota, K., Paradis, S., Peter, J.M., Hayward, N., Barlow, M., Emsbo, P., Coyan, J., San Juan, C.A., and Gadd, M.G., 2022, Data-driven prospectivity modelling of sediment-hosted Zn-Pb mineral systems and their critical raw materials: Ore Geology Reviews, v. 141, no. 104635, https://doi.org/10.1016/j.oregeorev.2021.104635.
H
Hydro-social metabolism: data underlying observed scaling relationship...
dataverse.harvard.edu
tsv
Updated Jul 16, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Harvard Dataverse (2018). Hydro-social metabolism: data underlying observed scaling relationship between birth rate and regional water use [Dataset]. http://doi.org/10.7910/DVN/N1FRI8
Explore at:
tsv(8947), tsv(3102), tsv(3607)Available download formats
Unique identifier
https://doi.org/10.7910/DVN/N1FRI8
Dataset updated
Jul 16, 2018
Dataset provided by
Harvard Dataverse
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Data underlying Figures 1 to 6 in "Hydro-social Metabolism: Scaling of birth rate with regional water use."
H
Replication Data for: Ideological Scaling of Social Media Users. A Dynamic...
data.niaid.nih.gov
application/x-gzip +1
Updated Apr 9, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dufresne, Yannick (2018). Replication Data for: Ideological Scaling of Social Media Users. A Dynamic Lexicon Approach [Dataset]. http://doi.org/10.7910/DVN/0ZCBTB
Explore at:
application/x-gzip, txtAvailable download formats
Unique identifier
https://doi.org/10.7910/DVN/0ZCBTB
Dataset updated
Apr 9, 2018
Dataset provided by
Hendrickx, Julien M.
van der Linden, Clifton
Dufresne, Yannick
Temporão, Mickael
Vande Kerckhove, Corentin
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Words matter in politics. The rhetoric that political elites employ structures civic discourse. The emergence of social media platforms as a medium of politics has enabled ordinary citizens to express their ideological inclinations by adopting the lexicon of political elites. This avails to researchers a rich new source of data in the study of political ideology. However, existing ideological text-scaling methods fail to produce meaningful inferences when applied to the short, informal style of textual content that is characteristic of social media platforms such as Twitter. This paper introduces the first viable approach to the estimation of individual-level ideological positions derived from social media content. This method allows us to position social media users---be they political elites, parties, or citizens---along a shared ideological dimension. We validate the proposed method by demonstrating correlation with existing measures of ideology across various political contexts and multiple languages. We further demonstrate the ability of ideological estimates to capture derivative signal by predicting out-of-sample, individual-level voting intentions. We posit that social media data can, when properly modeled, better capture derivative signal than discrete scales used in more traditional survey instruments.
H
Replication Data for: A Common-Space Scaling of the American Judiciary and...
dataverse.harvard.edu
search.dataone.org
Updated Aug 7, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Adam Bonica (2016). Replication Data for: A Common-Space Scaling of the American Judiciary and Legal Profession [Dataset]. http://doi.org/10.7910/DVN/RPZLMY
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/RPZLMY
Dataset updated
Aug 7, 2016
Dataset provided by
Harvard Dataverse
Authors
Adam Bonica
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This replication archive contains all data and code to replicate the results in "A Common-Space Scaling of the American Judiciary and Legal Profession" by Maya Sen and Adam Bonica. Abstract: We extend the scaling methodology previously used in Bonica (2014) to jointly scale the American federal judiciary and legal profession in a common-space with other political actors. The end result is the first data set of consistently measured ideological scores across all tiers of the federal judiciary and the legal profession, including 840 federal judges and 380,307 attorneys. To illustrate these measures, we present two examples involving the U.S. Supreme Court. These data open up significant areas of scholarly inquiry.
Data Science Platform Market Size By Deployment (Cloud, On-premise), By...
verifiedmarketresearch.com
pdf,excel,csv,ppt
Updated Aug 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Verified Market Research (2024). Data Science Platform Market Size By Deployment (Cloud, On-premise), By Enterprise Type (Large Enterprises, Small & Medium Enterprises), By Application (Customer Support, Business Operation, Marketing, Finance & Accounting, Logistics), By End-User Industry (BFSI, IT &Telecom, Healthcare, Retail), By Geographic Scope And Forecast [Dataset]. https://www.verifiedmarketresearch.com/product/data-science-platform-market/
Explore at:
pdf,excel,csv,pptAvailable download formats
Dataset updated
Aug 17, 2024
Dataset authored and provided by
Verified Market Researchhttps://www.verifiedmarketresearch.com/
License
https://www.verifiedmarketresearch.com/privacy-policy/https://www.verifiedmarketresearch.com/privacy-policy/
Time period covered
2026 - 2032
Area covered
Global
Description
Data Science Platform Market size was valued at USD 101.34 Billion in 2024 and is projected to reach USD 739.07 Billion by 2032 growing at a CAGR of 31.10% from 2026 to 2032.

Global Data Science Platform Market Drivers

AI and Machine Learning Integration: As AI and machine learning technologies become more widely adopted, demand for data science platforms grows. The United States Bureau of Labour Statistics predicts a 36% increase in data scientist jobs between 2021 and 2031, underlining the growing need for advanced platforms to develop and scale intelligent applications.

Demand for Business Intelligence and Analytics: As firms rely more on data-driven decision-making, there is a greater need for advanced analytics and business intelligence capabilities. Data science platforms provide critical tools for these roles, resulting in market growth, as evidenced by a predicted CAGR of 27.6% from 2022 to 2027.

Facebook

Twitter

Click to copy link

Link copied

Cite

Anj Simmons; Scott Barnett; Jessica Rivera-Villicana; Akshat Bajaj; Rajesh Vasa (2021). A large-scale comparative analysis of Coding Standard conformance in Open-Source Data Science projects [Dataset]. http://doi.org/10.6084/m9.figshare.12377237.v3

Data from: A large-scale comparative analysis of Coding Standard conformance in Open-Source Data Science projects

Explore at:

application/x-gzipAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.12377237.v3

Dataset updated

Oct 4, 2021

Dataset provided by

Figsharehttp://figshare.com/

Authors

Anj Simmons; Scott Barnett; Jessica Rivera-Villicana; Akshat Bajaj; Rajesh Vasa

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This study investigates the extent to which data science projects follow code standards. In particular, which standards are followed, which are ignored, and how does this differ to traditional software projects? We compare a corpus of 1048 Open-Source Data Science projects to a reference group of 1099 non-Data Science projects with a similar level of quality and maturity.results.tar.gz: Extracted data for each project, including raw logs of all detected code violations.notebooks_out.tar.gz: Tables and figures generated by notebooks.source_code_anonymized.tar.gz: Anonymized source code (at time of publication) to identify, clone, and analyse the projects. Also includes Jupyter notebooks used to produce figures in the paper.The latest source code can be found at: https://github.com/a2i2/mining-data-science-repositoriesPublished in ESEM 2020: https://doi.org/10.1145/3382494.3410680Preprint: https://arxiv.org/abs/2007.08978

Clear search

Close search

Google apps

Main menu

Data from: A large-scale comparative analysis of Coding Standard conformance...

Replication Data for: Scaling Data from Multiple Sources

Data from: SmaT-Scaling Data Collection Tools (SMILER)

Online Feature Selection and Its Applications

Data Science Platform Market Analysis, Size, and Forecast 2025-2029: North...

Snapshot img

Data from: Scaling and Citations

The Piraeus AIS Dataset for Large-scale Maritime Data Analytics - Dataset -...

Efficiency and optimal size of hospitals: Results of a systematic search

Analysis data for location- and scale-invariant power transformations

Dataset for Towards Understanding Performance Bugs in Popular Data Science...

Data from: The Piraeus AIS Dataset for Large-scale Maritime Data Analytics

Mass Data Migration Service Report

Data from: On scaling of scientific knowledge production in U.S....

Data Science and Machine Learning Service Market Report | Global Forecast...

Data Science and Machine Learning Service Market Outlook

MedMNIST: Standardized Biomedical Images

Key Features

Starter Code: download more data and training

Acknowledgements

License and Citation

National-Scale Geophysical, Geologic, and Mineral Resource Data and Grids...

Hydro-social metabolism: data underlying observed scaling relationship...

Replication Data for: Ideological Scaling of Social Media Users. A Dynamic...

Replication Data for: A Common-Space Scaling of the American Judiciary and...

Data Science Platform Market Size By Deployment (Cloud, On-premise), By...

Data from: A large-scale comparative analysis of Coding Standard conformance in Open-Source Data Science projects