100+ datasets found
  1. Top 2500 Kaggle Datasets

    • kaggle.com
    Updated Feb 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Saket Kumar (2024). Top 2500 Kaggle Datasets [Dataset]. http://doi.org/10.34740/kaggle/dsv/7637365
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 16, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Saket Kumar
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    This dataset compiles the top 2500 datasets from Kaggle, encompassing a diverse range of topics and contributors. It provides insights into dataset creation, usability, popularity, and more, offering valuable information for researchers, analysts, and data enthusiasts.

    Research Analysis: Researchers can utilize this dataset to analyze trends in dataset creation, popularity, and usability scores across various categories.

    Contributor Insights: Kaggle contributors can explore the dataset to gain insights into factors influencing the success and engagement of their datasets, aiding in optimizing future submissions.

    Machine Learning Training: Data scientists and machine learning enthusiasts can use this dataset to train models for predicting dataset popularity or usability based on features such as creator, category, and file types.

    Market Analysis: Analysts can leverage the dataset to conduct market analysis, identifying emerging trends and popular topics within the data science community on Kaggle.

    Educational Purposes: Educators and students can use this dataset to teach and learn about data analysis, visualization, and interpretation within the context of real-world datasets and community-driven platforms like Kaggle.

    Column Definitions:

    Dataset Name: Name of the dataset. Created By: Creator(s) of the dataset. Last Updated in number of days: Time elapsed since last update. Usability Score: Score indicating the ease of use. Number of File: Quantity of files included. Type of file: Format of files (e.g., CSV, JSON). Size: Size of the dataset. Total Votes: Number of votes received. Category: Categorization of the dataset's subject matter.

  2. Top 1000 Kaggle Datasets

    • kaggle.com
    zip
    Updated Jan 3, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Trrishan (2022). Top 1000 Kaggle Datasets [Dataset]. https://www.kaggle.com/datasets/notkrishna/top-1000-kaggle-datasets
    Explore at:
    zip(34269 bytes)Available download formats
    Dataset updated
    Jan 3, 2022
    Authors
    Trrishan
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    From wiki

    Kaggle, a subsidiary of Google LLC, is an online community of data scientists and machine learning practitioners. Kaggle allows users to find and publish data sets, explore and build models in a web-based data-science environment, work with other data scientists and machine learning engineers, and enter competitions to solve data science challenges.

    Kaggle got its start in 2010 by offering machine learning competitions and now also offers a public data platform, a cloud-based workbench for data science, and Artificial Intelligence education. Its key personnel were Anthony Goldbloom and Jeremy Howard. Nicholas Gruen was founding chair succeeded by Max Levchin. Equity was raised in 2011 valuing the company at $25 million. On 8 March 2017, Google announced that they were acquiring Kaggle.[1][2]

    Source: Kaggle

  3. Multivariate Time Series Search - Dataset - NASA Open Data Portal

    • data.nasa.gov
    Updated Mar 31, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nasa.gov (2025). Multivariate Time Series Search - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/multivariate-time-series-search
    Explore at:
    Dataset updated
    Mar 31, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    Multivariate Time-Series (MTS) are ubiquitous, and are generated in areas as disparate as sensor recordings in aerospace systems, music and video streams, medical monitoring, and financial systems. Domain experts are often interested in searching for interesting multivariate patterns from these MTS databases which can contain up to several gigabytes of data. Surprisingly, research on MTS search is very limited. Most existing work only supports queries with the same length of data, or queries on a fixed set of variables. In this paper, we propose an efficient and flexible subsequence search framework for massive MTS databases, that, for the first time, enables querying on any subset of variables with arbitrary time delays between them. We propose two provably correct algorithms to solve this problem — (1) an R-tree Based Search (RBS) which uses Minimum Bounding Rectangles (MBR) to organize the subsequences, and (2) a List Based Search (LBS) algorithm which uses sorted lists for indexing. We demonstrate the performance of these algorithms using two large MTS databases from the aviation domain, each containing several millions of observations. Both these tests show that our algorithms have very high prune rates (>95%) thus needing actual disk access for only less than 5% of the observations. To the best of our knowledge, this is the first flexible MTS search algorithm capable of subsequence search on any subset of variables. Moreover, MTS subsequence search has never been attempted on datasets of the size we have used in this paper.

  4. d

    Multivariate Time Series Search

    • catalog.data.gov
    • s.cnmilf.com
    • +1more
    Updated Nov 14, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dashlink (2025). Multivariate Time Series Search [Dataset]. https://catalog.data.gov/dataset/multivariate-time-series-search
    Explore at:
    Dataset updated
    Nov 14, 2025
    Dataset provided by
    Dashlink
    Description

    Multivariate Time-Series (MTS) are ubiquitous, and are generated in areas as disparate as sensor recordings in aerospace systems, music and video streams, medical monitoring, and financial systems. Domain experts are often interested in searching for interesting multivariate patterns from these MTS databases which can contain up to several gigabytes of data. Surprisingly, research on MTS search is very limited. Most existing work only supports queries with the same length of data, or queries on a fixed set of variables. In this paper, we propose an efficient and flexible subsequence search framework for massive MTS databases, that, for the first time, enables querying on any subset of variables with arbitrary time delays between them. We propose two provably correct algorithms to solve this problem — (1) an R-tree Based Search (RBS) which uses Minimum Bounding Rectangles (MBR) to organize the subsequences, and (2) a List Based Search (LBS) algorithm which uses sorted lists for indexing. We demonstrate the performance of these algorithms using two large MTS databases from the aviation domain, each containing several millions of observations. Both these tests show that our algorithms have very high prune rates (>95%) thus needing actual disk access for only less than 5% of the observations. To the best of our knowledge, this is the first flexible MTS search algorithm capable of subsequence search on any subset of variables. Moreover, MTS subsequence search has never been attempted on datasets of the size we have used in this paper.

  5. n

    Amazon Web Services Public Data Sets

    • neuinfo.org
    • dknet.org
    • +1more
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amazon Web Services Public Data Sets [Dataset]. http://identifiers.org/RRID:SCR_006318
    Explore at:
    Description

    A multidisciplinary repository of public data sets such as the Human Genome and US Census data that can be seamlessly integrated into AWS cloud-based applications. AWS is hosting the public data sets at no charge for the community. Anyone can access these data sets from their Amazon Elastic Compute Cloud (Amazon EC2) instances and start computing on the data within minutes. Users can also leverage the entire AWS ecosystem and easily collaborate with other AWS users. If you have a public domain or non-proprietary data set that you think is useful and interesting to the AWS community, please submit a request and the AWS team will review your submission and get back to you. Typically the data sets in the repository are between 1 GB to 1 TB in size (based on the Amazon EBS volume limit), but they can work with you to host larger data sets as well. You must have the right to make the data freely available.

  6. Data generation volume worldwide 2010-2029

    • statista.com
    Updated Nov 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Data generation volume worldwide 2010-2029 [Dataset]. https://www.statista.com/statistics/871513/worldwide-data-created/
    Explore at:
    Dataset updated
    Nov 19, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Area covered
    Worldwide
    Description

    The total amount of data created, captured, copied, and consumed globally is forecast to increase rapidly. While it was estimated at ***** zettabytes in 2025, the forecast for 2029 stands at ***** zettabytes. Thus, global data generation will triple between 2025 and 2029. Data creation has been expanding continuously over the past decade. In 2020, the growth was higher than previously expected, caused by the increased demand due to the coronavirus (COVID-19) pandemic, as more people worked and learned from home and used home entertainment options more often.

  7. Powerful Data for Power BI

    • kaggle.com
    zip
    Updated Aug 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shiv_D24Coder (2023). Powerful Data for Power BI [Dataset]. https://www.kaggle.com/datasets/shivd24coder/powerful-data-for-power-bi
    Explore at:
    zip(907404 bytes)Available download formats
    Dataset updated
    Aug 28, 2023
    Authors
    Shiv_D24Coder
    Description

    Explore the world of data visualization with this Power BI dataset containing HR Analytics and Sales Analytics datasets. Gain insights, create impactful reports, and craft engaging dashboards using real-world data from HR and sales domains. Sharpen your Power BI skills and uncover valuable data-driven insights with this powerful dataset. Happy analyzing!

  8. Forecast revenue big data market worldwide 2011-2027

    • statista.com
    Updated Mar 15, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2018). Forecast revenue big data market worldwide 2011-2027 [Dataset]. https://www.statista.com/statistics/254266/global-big-data-market-forecast/
    Explore at:
    Dataset updated
    Mar 15, 2018
    Dataset authored and provided by
    Statistahttp://statista.com/
    Area covered
    Worldwide
    Description

    The global big data market is forecasted to grow to 103 billion U.S. dollars by 2027, more than double its expected market size in 2018. With a share of 45 percent, the software segment would become the large big data market segment by 2027. What is Big data? Big data is a term that refers to the kind of data sets that are too large or too complex for traditional data processing applications. It is defined as having one or some of the following characteristics: high volume, high velocity or high variety. Fast-growing mobile data traffic, cloud computing traffic, as well as the rapid development of technologies such as artificial intelligence (AI) and the Internet of Things (IoT) all contribute to the increasing volume and complexity of data sets. Big data analytics Advanced analytics tools, such as predictive analytics and data mining, help to extract value from the data and generate new business insights. The global big data and business analytics market was valued at 169 billion U.S. dollars in 2018 and is expected to grow to 274 billion U.S. dollars in 2022. As of November 2018, 45 percent of professionals in the market research industry reportedly used big data analytics as a research method.

  9. f

    Data from: Chemical Topic Modeling: Exploring Molecular Data Sets Using a...

    • acs.figshare.com
    zip
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nadine Schneider; Nikolas Fechner; Gregory A. Landrum; Nikolaus Stiefl (2023). Chemical Topic Modeling: Exploring Molecular Data Sets Using a Common Text-Mining Approach [Dataset]. http://doi.org/10.1021/acs.jcim.7b00249.s003
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    ACS Publications
    Authors
    Nadine Schneider; Nikolas Fechner; Gregory A. Landrum; Nikolaus Stiefl
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Big data is one of the key transformative factors which increasingly influences all aspects of modern life. Although this transformation brings vast opportunities it also generates novel challenges, not the least of which is organizing and searching this data deluge. The field of medicinal chemistry is not different: more and more data are being generated, for instance, by technologies such as DNA encoded libraries, peptide libraries, text mining of large literature corpora, and new in silico enumeration methods. Handling those huge sets of molecules effectively is quite challenging and requires compromises that often come at the expense of the interpretability of the results. In order to find an intuitive and meaningful approach to organizing large molecular data sets, we adopted a probabilistic framework called “topic modeling” from the text-mining field. Here we present the first chemistry-related implementation of this method, which allows large molecule sets to be assigned to “chemical topics” and investigating the relationships between those. In this first study, we thoroughly evaluate this novel method in different experiments and discuss both its disadvantages and advantages. We show very promising results in reproducing human-assigned concepts using the approach to identify and retrieve chemical series from sets of molecules. We have also created an intuitive visualization of the chemical topics output by the algorithm. This is a huge benefit compared to other unsupervised machine-learning methods, like clustering, which are commonly used to group sets of molecules. Finally, we applied the new method to the 1.6 million molecules of the ChEMBL22 data set to test its robustness and efficiency. In about 1 h we built a 100-topic model of this large data set in which we could identify interesting topics like “proteins”, “DNA”, or “steroids”. Along with this publication we provide our data sets and an open-source implementation of the new method (CheTo) which will be part of an upcoming version of the open-source cheminformatics toolkit RDKit.

  10. Amount of data created, consumed, and stored 2010-2023, with forecasts to...

    • statista.com
    Updated Mar 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Petroc Taylor (2025). Amount of data created, consumed, and stored 2010-2023, with forecasts to 2028 [Dataset]. https://www.statista.com/topics/1464/big-data/
    Explore at:
    Dataset updated
    Mar 31, 2025
    Dataset provided by
    Statistahttp://statista.com/
    Authors
    Petroc Taylor
    Description

    The total amount of data created, captured, copied, and consumed globally is forecast to increase rapidly, reaching 149 zettabytes in 2024. Over the next five years up to 2028, global data creation is projected to grow to more than 394 zettabytes. In 2020, the amount of data created and replicated reached a new high. The growth was higher than previously expected, caused by the increased demand due to the COVID-19 pandemic, as more people worked and learned from home and used home entertainment options more often. Storage capacity also growing Only a small percentage of this newly created data is kept though, as just 2 percent of the data produced and consumed in 2020 was saved and retained into 2021. In line with the strong growth of the data volume, the installed base of storage capacity is forecast to increase, growing at a compound annual growth rate of 19.2 percent over the forecast period from 2020 to 2025. In 2020, the installed base of storage capacity reached 6.7 zettabytes.

  11. Leading countries by number of data centers 2025

    • statista.com
    Updated Mar 31, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Petroc Taylor (2025). Leading countries by number of data centers 2025 [Dataset]. https://www.statista.com/topics/1464/big-data/
    Explore at:
    Dataset updated
    Mar 31, 2025
    Dataset provided by
    Statistahttp://statista.com/
    Authors
    Petroc Taylor
    Description

    As of March 2025, there were a reported 5,426 data centers in the United States, the most of any country worldwide. A further 529 were located in Germany, while 523 were located in the United Kingdom. What is a data center? A data center is a network of computing and storage resources that enables the delivery of shared software applications and data. These facilities can house large amounts of critical and important data, and therefore are vital to the daily functions of companies and consumers alike. As a result, whether it is a cloud, colocation, or managed service, data center real estate will have increasing importance worldwide. Hyperscale data centers In the past, data centers were highly controlled physical infrastructures, but the cloud has since changed that model. A cloud data service is a remote version of a data center – located somewhere away from a company's physical premises. Cloud IT infrastructure spending has grown and is forecast to rise further in the coming years. The evolution of technology, along with the rapid growth in demand for data across the globe, is largely driven by the leading hyperscale data center providers.

  12. Discord-Data

    • kaggle.com
    zip
    Updated Apr 16, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jess Fan (2021). Discord-Data [Dataset]. https://www.kaggle.com/datasets/jef1056/discord-data/code
    Explore at:
    zip(8155868013 bytes)Available download formats
    Dataset updated
    Apr 16, 2021
    Authors
    Jess Fan
    License

    http://www.gnu.org/licenses/agpl-3.0.htmlhttp://www.gnu.org/licenses/agpl-3.0.html

    Description

    Description

    This is a long-context, anonymized, clean, multi-turn and single-turn conversational dataset based on discord data scraped from a large variety of severs, big and small.

    Want your server to be a part of the next release? Want access to the raw data? Contact me at contact@j-fan.ml

    Some statistics

    The raw data for this version contained 51,826,268 messages [v1] 5103788 (regex) + 696161 (toxic)/51826268, or 0.11% of the messages were removed [v2] 6737000 (regex) + 946778 (toxic)/90841631, 0.08%of the messages were removed The dataset's final size is 46,026,319 (v1) + 64,345,492 (v2) [110,371,811] messages across 456,810 (v1) + 750,416 (v2) [1,207,226] conversations, which is reduced from 89.6 GB of raw json data to just under 2 GB

    Inspiration

    There is a wide variety of NLP datasets that cover a huge number of different interactions between users that can be used for pretraining; Google's C4 covers webtexts and a extremely diverse amount of data for the majority of language tasks. Reddit crawls cover strucutred, forum-style text. However, despite this abundance of data, there is a lack of clean long-context data for specifically conversation puproses. In a search for potential sources of data, I discovered that discord has a long-standing history of having interesting and diverse conversations, and a realatively open API. With the collaboration of a large number of discord moderators, server owners, and members of the community, this data was sucessfully downloaded and cleaned.

    Goal

    To create a diverse, structured dataset of turn-by-tun conversation that can be used to pretrain a model oriented specifically for conversational purposes

    Content

    Files containing -detox are cleaned files that utilized a LSTM network to analyize each message and evaluate if the message is toxic, obscene, threatening, insulting, or is identity hate All files were cleaned using https://github.com/JEF1056/clean-discord, mostly using the default settings. The repo takes an automated, heuristic approach to removing unwanted, non-NLP, or toxic comments. context.txt contains all data that has been cleaned using basic regex and some text replacement context-pairs.txt contains pairs of data using only discord's recent replies feature. As it is so new, its yeild is very low. It has also been cleaned using basic regex and some text replacement

    Aknowledgements

    A massive thanks to https://github.com/codemicro for working on multithreading code for the clean-discord repo!

    Cite this dataset: @misc{discord-data, author = {Jess Fan}, title = {Discord Dataset}, contact = {jeefan@ucsc.edu, contact@j-fan.ml}, year = {2021}, howpublished = {\url{https://www.kaggle.com/jef1056/discord-data}}, note = {V5} }

  13. Kaggle Datasets - Summary, Topics, Classification

    • kaggle.com
    zip
    Updated Nov 16, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Katherine Marsh (2020). Kaggle Datasets - Summary, Topics, Classification [Dataset]. https://www.kaggle.com/datasets/katherinemarsh/kaggle-datasets-summary-topics-classification
    Explore at:
    zip(273449 bytes)Available download formats
    Dataset updated
    Nov 16, 2020
    Authors
    Katherine Marsh
    Description

    Context

    Companies and individuals are storing increasingly more data digitally; however, much of the data is unused because it is unclassified. How many times have you opened your downloads folder, found a file you downloaded a year ago and you have no idea what the contents are? You can read through those files individually but imagine doing that for thousands of files. All that raw data in storage facilities create data lakes. As the amount of data grows and the complexity rises, data lakes become data swamps. The potentially valuable and interesting datasets will likely remain unused. Our tool addresses the need to classify these large pools of data in a visually effective and succinct manner by identifying keywords in datasets, and classifying datasets into a consistent taxonomy.

    The files listed within kaggleDatasetSummaryTopicsClassification.csv have been processed with our tool to generate the keywords and taxonomic classification as seen below. The summaries are not generated from our system. Instead they were retrieved from user input as they uploaded the files on Kaggle. We planned to utilize these summaries to create an NLG model to generate summaries from any input file. Unfortunately we were not able to collect enough data to build a good model. Hopefully the data within this set might help future users achieve that goal.

    Acknowledgements

    Developed with Senior Design Center at NC State in collaboration with SAS. Senior Design Team: Tanya Chu, Katherine Marsh, Nikhil Milind, Anna Owens SAS Representatives: : Nancy Rausch, Marty Warner, Brant Kay, Tyler Wendell, JP Trawinski

  14. Revenue of leading data center markets worldwide 2018-2029

    • statista.com
    Updated Mar 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Petroc Taylor (2025). Revenue of leading data center markets worldwide 2018-2029 [Dataset]. https://www.statista.com/topics/1464/big-data/
    Explore at:
    Dataset updated
    Mar 31, 2025
    Dataset provided by
    Statistahttp://statista.com/
    Authors
    Petroc Taylor
    Description

    The revenue is forecast to experience significant growth in all regions in 2029. From the selected regions, the ranking by revenue in the data center market is forecast to be led by the United States with 212.06 billion U.S. dollars. In contrast, the ranking is trailed by the United Kingdom with 23.76 billion U.S. dollars, recording a difference of 188.3 billion U.S. dollars to the United States. Find further statistics on other topics such as a comparison of the revenue in the world and a comparison of the revenue in the United States.The Statista Market Insights cover a broad range of additional markets.

  15. Reference list of 265 sources used for the discovery of relationships...

    • doi.pangaea.de
    • search.dataone.org
    Updated Jul 8, 2012
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jürgen Bernard; Tobias Ruppert; Tobias Schreck; Maximilian Scherer; Jörn Kohlhammer (2012). Reference list of 265 sources used for the discovery of relationships between data clusters and metadata properties [Dataset]. http://doi.org/10.1594/PANGAEA.785666
    Explore at:
    Dataset updated
    Jul 8, 2012
    Dataset provided by
    PANGAEA
    Authors
    Jürgen Bernard; Tobias Ruppert; Tobias Schreck; Maximilian Scherer; Jörn Kohlhammer
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Time period covered
    Jan 1, 2006 - Dec 31, 2006
    Area covered
    Description

    Visual cluster analysis provides valuable tools that help analysts to understand large data sets in terms of representative clusters and relationships thereof. Often, the found clusters are to be understood in context of belonging categorical, numerical or textual metadata which are given for the data elements. While often not part of the clustering process, such metadata play an important role and need to be considered during the interactive cluster exploration process. Traditionally, linked-views allow to relate (or loosely speaking: correlate) clusters with metadata or other properties of the underlying cluster data. Manually inspecting the distribution of metadata for each cluster in a linked-view approach is tedious, specially for large data sets, where a large search problem arises. Fully interactive search for potentially useful or interesting cluster to metadata relationships may constitute a cumbersome and long process. To remedy this problem, we propose a novel approach for guiding users in discovering interesting relationships between clusters and associated metadata. Its goal is to guide the analyst through the potentially huge search space. We focus in our work on metadata of categorical type, which can be summarized for a cluster in form of a histogram. We start from a given visual cluster representation, and compute certain measures of interestingness defined on the distribution of metadata categories for the clusters. These measures are used to automatically score and rank the clusters for potential interestingness regarding the distribution of categorical metadata. Identified interesting relationships are highlighted in the visual cluster representation for easy inspection by the user. We present a system implementing an encompassing, yet extensible, set of interestingness scores for categorical metadata, which can also be extended to numerical metadata. Appropriate visual representations are provided for showing the visual correlations, as well as the calculated ranking scores. Focusing on clusters of time series data, we test our approach on a large real-world data set of time-oriented scientific research data, demonstrating how specific interesting views are automatically identified, supporting the analyst discovering interesting and visually understandable relationships.

  16. Global IT spending 2005-2024

    • statista.com
    Updated Mar 31, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ahmed Sherif (2025). Global IT spending 2005-2024 [Dataset]. https://www.statista.com/topics/1464/big-data/
    Explore at:
    Dataset updated
    Mar 31, 2025
    Dataset provided by
    Statistahttp://statista.com/
    Authors
    Ahmed Sherif
    Description

    IT spending worldwide is projected to reach over 5.7 trillion U.S. dollars in 2025, over a nine percent increase on 2024 spending. Smaller companies spending a greater share on hardware According to the results of a survey, hardware projects account for a fifth of IT budgets across North America and Europe. Larger companies tend to allocate a smaller share of their budget to hardware projects. Companies employing between one and 99 people allocated 31 percent of the budget to hardware, compared with 29 percent in companies of five thousand people or more. This could be explained by the greater need to spend money on managed services in larger companies. Not all companies can reduce their spending While COVID-19 has the overall effect of reducing IT spending, not all companies will face the same experiences. Setting up employees to comfortably work from home can result in unexpected costs, as can adapting to new operational requirements. In a recent survey of IT buyers, 18 percent of the respondents said they expected their IT budgets to increase in 2020. For further information about the coronavirus (COVID-19) pandemic, please visit our dedicated Facts and Figures page.

  17. Data from: Estimating Heterogeneous Causal Mediation Effects with Bayesian...

    • tandf.figshare.com
    bin
    Updated May 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Angela Ting; Antonio R. Linero (2025). Estimating Heterogeneous Causal Mediation Effects with Bayesian Decision Tree Ensembles [Dataset]. http://doi.org/10.6084/m9.figshare.29039267.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    May 12, 2025
    Dataset provided by
    Taylor & Francishttps://taylorandfrancis.com/
    Authors
    Angela Ting; Antonio R. Linero
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The causal inference literature has increasingly recognized that targeting treatment effect heterogeneity can lead to improved scientific understanding and policy recommendations. Similarly, studying the causal pathway connecting the treatment to the outcome can be useful. We address these problems in the context of causal mediation analysis . We introduce a varying coefficient model based on Bayesian additive regression trees to estimate and regularize heterogeneous causal mediation effects. Even on large datasets with few covariates, we show LSEMs can produce highly unstable estimates of the conditional average direct and indirect effects, while our Bayesian causal mediation forests model produces stable estimates. We find that our approach is conservative, with effect estimates “shrunk towards homogeneity.” Using data from the Medical Expenditure Panel Survey and empirically-grounded simulated data, we examine the salient properties of our method. Finally, we show how our model can be combined with posterior summarization strategies to identify interesting subgroups and interpret the model fit.

  18. E

    Data from: Slovene instruction-following dataset for large language models...

    • live.european-language-grid.eu
    binary format
    Updated Aug 24, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Slovene instruction-following dataset for large language models GaMS-Instruct-MED 2.0 [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/23882
    Explore at:
    binary formatAvailable download formats
    Dataset updated
    Aug 24, 2025
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    GaMS-Instruct-MED is an instruction-following dataset designed to fine-tune Slovene large language models to follow instructions in the medical domain. It consists of units of prompts, instrutions and responses from the field of medicine, particularly those pertaining to the use of pharmaceutical drugs and medications.

    The dataset was generated in several steps (for a more detailed description, please refer to 00README.txt). After consulting with experts from the medical field, a series of prompts was manually compiled containing questions interesting in the context of drug and medication use. For each medication in the PoVeJMo-VeMo-Med 1.0 dataset (http://hdl.handle.net/11356/1983), approximately 10-15 questions were automatically generated using prompt tuning. In version 2.0, the dataset was extended with several other similar datasets for English that were translated into Slovene: MedQuAD, MeQSum, Medication QA, and LiveQA (references are available in 00README.txt). All translations were made automatically using GPT-4.1. The manual validation was made in two phases. In the preparation-evaluation phase, the quality of machine translations were validated on a sample using different machine translation applications (DeepL, OpenAI) to determine the solution with optimal performance. In the second phase, a random sample of 20--40 examples from each translated subset were manually validated (a total of 240 examples). The manual validations were made by two experts from the field of medicine and an expert for dataset compilation.

    Unlike version 1.0, where the dataset consisted of prompt-response pairs, version 2.0 contains units consisting of three elements (instruction-input-output). The conversion was made using OpenAI GPT-4.1. All final instructions were manually validated by an expert for dataset compilation. Two experts from the field of medicine participated in the design of clinically relevant categories of instructions, the compilation of examples of prompt-response pairs, and the manual validation of test results of the conversion process.

    Please note that the current version of the dataset (containing 25,046 instruction-input-output units) does not guarantee full clinical accuracy and may contain errors as a consequence of LLM hallucinations.

  19. Digital transformation spending worldwide 2017-2027

    • statista.com
    Updated Mar 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Petroc Taylor (2025). Digital transformation spending worldwide 2017-2027 [Dataset]. https://www.statista.com/topics/1464/big-data/
    Explore at:
    Dataset updated
    Mar 31, 2025
    Dataset provided by
    Statistahttp://statista.com/
    Authors
    Petroc Taylor
    Description

    In 2024, spending on digital transformation (DX) is projected to reach 2.5 trillion U.S. dollars. By 2027, global digital transformation spending is forecast to reach 3.9 trillion U.S. dollars. What is digital transformation? Digital transformation refers to the adoption of digital technology to transform business processes and services from non-digital to digital. This encompasses, among others, moving data to the cloud, using technological devices and tools for communication and collaboration, as well as automating processes. What is driving digital transformation? Digital transformation growth is due to several contributing factors. Among these was COVID-19 pandemic, which has increased the digital transformation tempo in organizations around the globe in 2020 considerably. Although the pandemic is over, working from home among organizations globally has not only remained, but also increased, increasing the drive for digital transformation. Other contributing causes include customer demand and the need to be on par with competitors. Overall, utilizing technologies for digital transformation render organizations more agile in responding to changing markets and enhance innovation, thereby making them more resilient.

  20. d

    Data from: Permutation-validated principal components analysis of microarray...

    • catalog.data.gov
    • healthdata.gov
    • +1more
    Updated Sep 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institutes of Health (2025). Permutation-validated principal components analysis of microarray data [Dataset]. https://catalog.data.gov/dataset/permutation-validated-principal-components-analysis-of-microarray-data
    Explore at:
    Dataset updated
    Sep 7, 2025
    Dataset provided by
    National Institutes of Health
    Description

    Background In microarray data analysis, the comparison of gene-expression profiles with respect to different conditions and the selection of biologically interesting genes are crucial tasks. Multivariate statistical methods have been applied to analyze these large datasets. Less work has been published concerning the assessment of the reliability of gene-selection procedures. Here we describe a method to assess reliability in multivariate microarray data analysis using permutation-validated principal components analysis (PCA). The approach is designed for microarray data with a group structure. Results We used PCA to detect the major sources of variance underlying the hybridization conditions followed by gene selection based on PCA-derived and permutation-based test statistics. We validated our method by applying it to well characterized yeast cell-cycle data and to two datasets from our laboratory. We could describe the major sources of variance, select informative genes and visualize the relationship of genes and arrays. We observed differences in the level of the explained variance and the interpretability of the selected genes. Conclusions Combining data visualization and permutation-based gene selection, permutation-validated PCA enables one to illustrate gene-expression variance between several conditions and to select genes by taking into account the relationship of between-group to within-group variance of genes. The method can be used to extract the leading sources of variance from microarray data, to visualize relationships between genes and hybridizations and to select informative genes in a statistically reliable manner. This selection accounts for the level of reproducibility of replicates or group structure as well as gene-specific scatter. Visualization of the data can support a straightforward biological interpretation.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Saket Kumar (2024). Top 2500 Kaggle Datasets [Dataset]. http://doi.org/10.34740/kaggle/dsv/7637365
Organization logo

Top 2500 Kaggle Datasets

Explore, Analyze, Innovate: The Best of Kaggle's Data at Your Fingertips

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 16, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Saket Kumar
License

http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

Description

This dataset compiles the top 2500 datasets from Kaggle, encompassing a diverse range of topics and contributors. It provides insights into dataset creation, usability, popularity, and more, offering valuable information for researchers, analysts, and data enthusiasts.

Research Analysis: Researchers can utilize this dataset to analyze trends in dataset creation, popularity, and usability scores across various categories.

Contributor Insights: Kaggle contributors can explore the dataset to gain insights into factors influencing the success and engagement of their datasets, aiding in optimizing future submissions.

Machine Learning Training: Data scientists and machine learning enthusiasts can use this dataset to train models for predicting dataset popularity or usability based on features such as creator, category, and file types.

Market Analysis: Analysts can leverage the dataset to conduct market analysis, identifying emerging trends and popular topics within the data science community on Kaggle.

Educational Purposes: Educators and students can use this dataset to teach and learn about data analysis, visualization, and interpretation within the context of real-world datasets and community-driven platforms like Kaggle.

Column Definitions:

Dataset Name: Name of the dataset. Created By: Creator(s) of the dataset. Last Updated in number of days: Time elapsed since last update. Usability Score: Score indicating the ease of use. Number of File: Quantity of files included. Type of file: Format of files (e.g., CSV, JSON). Size: Size of the dataset. Total Votes: Number of votes received. Category: Categorization of the dataset's subject matter.

Search
Clear search
Close search
Google apps
Main menu