100+ datasets found
  1. Big data and business analytics revenue worldwide 2015-2022

    • statista.com
    Updated Aug 17, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2021). Big data and business analytics revenue worldwide 2015-2022 [Dataset]. https://www.statista.com/statistics/551501/worldwide-big-data-business-analytics-revenue/
    Explore at:
    Dataset updated
    Aug 17, 2021
    Dataset authored and provided by
    Statistahttp://statista.com/
    Area covered
    Worldwide
    Description

    The global big data and business analytics (BDA) market was valued at ***** billion U.S. dollars in 2018 and is forecast to grow to ***** billion U.S. dollars by 2021. In 2021, more than half of BDA spending will go towards services. IT services is projected to make up around ** billion U.S. dollars, and business services will account for the remainder. Big data High volume, high velocity and high variety: one or more of these characteristics is used to define big data, the kind of data sets that are too large or too complex for traditional data processing applications. Fast-growing mobile data traffic, cloud computing traffic, as well as the rapid development of technologies such as artificial intelligence (AI) and the Internet of Things (IoT) all contribute to the increasing volume and complexity of data sets. For example, connected IoT devices are projected to generate **** ZBs of data in 2025. Business analytics Advanced analytics tools, such as predictive analytics and data mining, help to extract value from the data and generate business insights. The size of the business intelligence and analytics software application market is forecast to reach around **** billion U.S. dollars in 2022. Growth in this market is driven by a focus on digital transformation, a demand for data visualization dashboards, and an increased adoption of cloud.

  2. Forecast revenue big data market worldwide 2011-2027

    • statista.com
    Updated Mar 15, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2018). Forecast revenue big data market worldwide 2011-2027 [Dataset]. https://www.statista.com/statistics/254266/global-big-data-market-forecast/
    Explore at:
    Dataset updated
    Mar 15, 2018
    Dataset authored and provided by
    Statistahttp://statista.com/
    Area covered
    Worldwide
    Description

    The global big data market is forecasted to grow to 103 billion U.S. dollars by 2027, more than double its expected market size in 2018. With a share of 45 percent, the software segment would become the large big data market segment by 2027. What is Big data? Big data is a term that refers to the kind of data sets that are too large or too complex for traditional data processing applications. It is defined as having one or some of the following characteristics: high volume, high velocity or high variety. Fast-growing mobile data traffic, cloud computing traffic, as well as the rapid development of technologies such as artificial intelligence (AI) and the Internet of Things (IoT) all contribute to the increasing volume and complexity of data sets. Big data analytics Advanced analytics tools, such as predictive analytics and data mining, help to extract value from the data and generate new business insights. The global big data and business analytics market was valued at 169 billion U.S. dollars in 2018 and is expected to grow to 274 billion U.S. dollars in 2022. As of November 2018, 45 percent of professionals in the market research industry reportedly used big data analytics as a research method.

  3. Data Science Platform Market Analysis, Size, and Forecast 2025-2029: North...

    • technavio.com
    pdf
    Updated Feb 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Technavio (2025). Data Science Platform Market Analysis, Size, and Forecast 2025-2029: North America (US and Canada), Europe (France, Germany, UK), APAC (China, India, Japan), South America (Brazil), and Middle East and Africa (UAE) [Dataset]. https://www.technavio.com/report/data-science-platform-market-industry-analysis
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Feb 8, 2025
    Dataset provided by
    TechNavio
    Authors
    Technavio
    License

    https://www.technavio.com/content/privacy-noticehttps://www.technavio.com/content/privacy-notice

    Time period covered
    2025 - 2029
    Area covered
    United States
    Description

    Snapshot img

    Data Science Platform Market Size 2025-2029

    The data science platform market size is valued to increase USD 763.9 million, at a CAGR of 40.2% from 2024 to 2029. Integration of AI and ML technologies with data science platforms will drive the data science platform market.

    Major Market Trends & Insights

    North America dominated the market and accounted for a 48% growth during the forecast period.
    By Deployment - On-premises segment was valued at USD 38.70 million in 2023
    By Component - Platform segment accounted for the largest market revenue share in 2023
    

    Market Size & Forecast

    Market Opportunities: USD 1.00 million
    Market Future Opportunities: USD 763.90 million
    CAGR : 40.2%
    North America: Largest market in 2023
    

    Market Summary

    The market represents a dynamic and continually evolving landscape, underpinned by advancements in core technologies and applications. Key technologies, such as machine learning and artificial intelligence, are increasingly integrated into data science platforms to enhance predictive analytics and automate data processing. Additionally, the emergence of containerization and microservices in data science platforms enables greater flexibility and scalability. However, the market also faces challenges, including data privacy and security risks, which necessitate robust compliance with regulations.
    According to recent estimates, the market is expected to account for over 30% of the overall big data analytics market by 2025, underscoring its growing importance in the data-driven business landscape.
    

    What will be the Size of the Data Science Platform Market during the forecast period?

    Get Key Insights on Market Forecast (PDF) Request Free Sample

    How is the Data Science Platform Market Segmented and what are the key trends of market segmentation?

    The data science platform industry research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD million' for the period 2025-2029, as well as historical data from 2019-2023 for the following segments.

    Deployment
    
      On-premises
      Cloud
    
    
    Component
    
      Platform
      Services
    
    
    End-user
    
      BFSI
      Retail and e-commerce
      Manufacturing
      Media and entertainment
      Others
    
    
    Sector
    
      Large enterprises
      SMEs
    
    
    Application
    
      Data Preparation
      Data Visualization
      Machine Learning
      Predictive Analytics
      Data Governance
      Others
    
    
    Geography
    
      North America
    
        US
        Canada
    
    
      Europe
    
        France
        Germany
        UK
    
    
      Middle East and Africa
    
        UAE
    
    
      APAC
    
        China
        India
        Japan
    
    
      South America
    
        Brazil
    
    
      Rest of World (ROW)
    

    By Deployment Insights

    The on-premises segment is estimated to witness significant growth during the forecast period.

    In the dynamic and evolving the market, big data processing is a key focus, enabling advanced model accuracy metrics through various data mining methods. Distributed computing and algorithm optimization are integral components, ensuring efficient handling of large datasets. Data governance policies are crucial for managing data security protocols and ensuring data lineage tracking. Software development kits, model versioning, and anomaly detection systems facilitate seamless development, deployment, and monitoring of predictive modeling techniques, including machine learning algorithms, regression analysis, and statistical modeling. Real-time data streaming and parallelized algorithms enable real-time insights, while predictive modeling techniques and machine learning algorithms drive business intelligence and decision-making.

    Cloud computing infrastructure, data visualization tools, high-performance computing, and database management systems support scalable data solutions and efficient data warehousing. ETL processes and data integration pipelines ensure data quality assessment and feature engineering techniques. Clustering techniques and natural language processing are essential for advanced data analysis. The market is witnessing significant growth, with adoption increasing by 18.7% in the past year, and industry experts anticipate a further expansion of 21.6% in the upcoming period. Companies across various sectors are recognizing the potential of data science platforms, leading to a surge in demand for scalable, secure, and efficient solutions.

    API integration services and deep learning frameworks are gaining traction, offering advanced capabilities and seamless integration with existing systems. Data security protocols and model explainability methods are becoming increasingly important, ensuring transparency and trust in data-driven decision-making. The market is expected to continue unfolding, with ongoing advancements in technology and evolving business needs shaping its future trajectory.

    Request Free Sample

    The On-premises segment was valued at USD 38.70 million in 2019 and showed

  4. Big Data use by companies by sector in France 2015

    • statista.com
    Updated Nov 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Big Data use by companies by sector in France 2015 [Dataset]. https://www.statista.com/statistics/770505/big-data-business-use-by-sector-la-france/
    Explore at:
    Dataset updated
    Nov 28, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    2016
    Area covered
    France
    Description

    This chart highlights the percentage of companies using Big Data data in France in 2015, by sector of activity. It can be seen that in the transport sector, a quarter of the companies surveyed reported using big data, also known as "big data." The concept of big data refers to large volumes of data related to use of a good or a service, for example a social network. Being able to process large volumes of data is a significant business issue, as it allows them to better understand how users behave in a service, making them better able to meet user expectations.

  5. 10 Million Number Dataset

    • kaggle.com
    zip
    Updated Apr 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mehedi Hasand1497 (2025). 10 Million Number Dataset [Dataset]. https://www.kaggle.com/datasets/mehedihasand1497/10-million-random-number-dataset-for-ml/data
    Explore at:
    zip(2285635720 bytes)Available download formats
    Dataset updated
    Apr 28, 2025
    Authors
    Mehedi Hasand1497
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    About the Dataset: Random Data with Hidden Structure

    This dataset consists of 10,000,000 samples with 50 numerical features. Each feature has been randomly generated using a uniform distribution between 0 and 1. To add complexity, a hidden structure has been introduced in some of the features. Specifically, Feature 2 is related to Feature 1, making it a good candidate for regression analysis tasks. The other features remain purely random, allowing for the exploration of feature engineering and random data generation techniques.

    Key Features and Structure

    • Feature 1: A random number drawn from a uniform distribution between 0 and 1.
    • Feature 2: A function of Feature 1, specifically Feature 2 ≈ 2 × Feature 1 + small Gaussian noise (N(0, 0.05)). This introduces a hidden linear relationship with a small amount of noise for added realism.
    • Features 3 to 50: Independent random numbers generated between 0 and 1, with no relationship to each other or any other features.

    This hidden structure allows you to test models on data where a simple pattern (between Feature 1 and Feature 2) exists, but with noise that can challenge more advanced models in finding the relationship.

    Dataset Overview

    Feature NameDescription
    feature_1Random number (0–1, uniform)
    feature_22 × feature_1 + small noise (N(0, 0.05))
    feature_3–50Independent random numbers (0–1)
    • Rows: 10,000,000
    • Columns: 50
    • Format: CSV
    • File Size: 5.32 GB ## Intended Uses

    This dataset is versatile and can be used for various machine learning tasks, including:

    • Testing and benchmarking machine learning models: Evaluate model performance on large, randomly generated datasets.
    • Regression analysis practice: The relationship between Feature 1 and Feature 2 makes it ideal for testing regression models.
    • Feature engineering experiments: Explore techniques for selecting, transforming, or creating new features.
    • Random data generation research: Investigate methods for generating synthetic data and its applications.
    • Large-scale data processing testing: Test frameworks such as Pandas, Dask, and Spark for processing large datasets.

    Licensing

    This dataset is made available under the Creative Commons Attribution 4.0 International (CC BY 4.0) license. You are free to share and adapt the material for any purpose, even commercially, as long as proper attribution is given.

    Learn more about the license here

  6. Z

    DEVILS: a tool for the visualization of large datasets with a high dynamic...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jul 19, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Romain Guiet; Olivier Burri; Nicolas Chiaruttini; Olivier Hagens; Arne Seitz (2024). DEVILS: a tool for the visualization of large datasets with a high dynamic range [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_4058413
    Explore at:
    Dataset updated
    Jul 19, 2024
    Dataset provided by
    EPFL - École Polytechnique Fédérale de Lausanne
    Authors
    Romain Guiet; Olivier Burri; Nicolas Chiaruttini; Olivier Hagens; Arne Seitz
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository accompanying the article “DEVILS: a tool for the visualization of large datasets with a high dynamic range” contains the following:

    Extended Material of the article

    An example raw dataset corresponding to the images shown in Fig. 3

    A workflow description that demonstrates the use of the DEVILS workflow with BigStitcher.

    Two scripts (“CLAHE_Parameters_test.ijm” and a “DEVILS_Parallel_tests.groovy”) used for Figure S2, S3 and S4.

  7. f

    fdata-02-00044_Parallel Processing Strategies for Big Geospatial Data.pdf

    • frontiersin.figshare.com
    pdf
    Updated Jun 3, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Martin Werner (2023). fdata-02-00044_Parallel Processing Strategies for Big Geospatial Data.pdf [Dataset]. http://doi.org/10.3389/fdata.2019.00044.s001
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    Frontiers
    Authors
    Martin Werner
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This paper provides an abstract analysis of parallel processing strategies for spatial and spatio-temporal data. It isolates aspects such as data locality and computational locality as well as redundancy and locally sequential access as central elements of parallel algorithm design for spatial data. Furthermore, the paper gives some examples from simple and advanced GIS and spatial data analysis highlighting both that big data systems have been around long before the current hype of big data and that they follow some design principles which are inevitable for spatial data including distributed data structures and messaging, which are, however, incompatible with the popular MapReduce paradigm. Throughout this discussion, the need for a replacement or extension of the MapReduce paradigm for spatial data is derived. This paradigm should be able to deal with the imperfect data locality inherent to spatial data hindering full independence of non-trivial computational tasks. We conclude that more research is needed and that spatial big data systems should pick up more concepts like graphs, shortest paths, raster data, events, and streams at the same time instead of solving exactly the set of spatially separable problems such as line simplifications or range queries in manydifferent ways.

  8. Data from: Big Data versus a Survey

    • clevelandfed.org
    Updated Dec 31, 2014
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Federal Reserve Bank of Cleveland (2014). Big Data versus a Survey [Dataset]. https://www.clevelandfed.org/publications/working-paper/2014/wp-1440-big-data-versus-a-survey
    Explore at:
    Dataset updated
    Dec 31, 2014
    Dataset authored and provided by
    Federal Reserve Bank of Clevelandhttps://www.clevelandfed.org/
    Description

    Economists are shifting attention and resources from work on survey data towork on “big data.” This analysis is an empirical exploration of the trade-offs this transition requires. Parallel models are estimated using the Federal Reserve Bank of New York Consumer Credit Panel/Equifax and the Survey of Consumer Finances. After adjustments to account for different variable definitions and sampled populations, it is possible to arrive at similar models of total household debt. However, the estimates are sensitive to the adjustments. Little similarity is observed in parallel models of nonmortgage debt. While surveys intentionally collect theoretically related variables, it may be necessary to merge external data into commercial big data. In this example, some education and income measures are successfully integrated with the big data, but other external aggregates fail to adequately substitute for survey responses. Big data offers sample sizes, frequencies, and details that surveys cannot match. However, this example illustrates why caution is appropriate when attempting to substitute big data for a carefully executed survey.

  9. Supply Chain Dataset

    • kaggle.com
    zip
    Updated Jul 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dhanushka Tharanga (2024). Supply Chain Dataset [Dataset]. https://www.kaggle.com/datasets/dhanushkatharanga/supply-chain-dataset
    Explore at:
    zip(18360491 bytes)Available download formats
    Dataset updated
    Jul 8, 2024
    Authors
    Dhanushka Tharanga
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    In today’s data-driven world, multinational enterprises face the challenge of efficiently managing and processing vast amounts of diverse data. Optimizing big data processing capabilities is crucial for extracting meaningful insights, improving decision-making, and maintaining a competitive edge. This report focuses on designing and implementing a solution that leverages cloud computing technologies for efficient storage, processing, and analysis of big data. After that use the Google Cloud Platform (GCP) for practical implementation and provide an example of data set extraction and analysis using the DataCoSupplyChainDataset.csv (fromhttps://data.mendeley.com/datasets/).

  10. m

    Raw data outputs 1-18

    • bridges.monash.edu
    • researchdata.edu.au
    xlsx
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abbas Salavaty Hosein Abadi; Sara Alaei; Mirana Ramialison; Peter Currie (2023). Raw data outputs 1-18 [Dataset]. http://doi.org/10.26180/21259491.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Monash University
    Authors
    Abbas Salavaty Hosein Abadi; Sara Alaei; Mirana Ramialison; Peter Currie
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Raw data outputs 1-18 Raw data output 1. Differentially expressed genes in AML CSCs compared with GTCs as well as in TCGA AML cancer samples compared with normal ones. This data was generated based on the results of AML microarray and TCGA data analysis. Raw data output 2. Commonly and uniquely differentially expressed genes in AML CSC/GTC microarray and TCGA bulk RNA-seq datasets. This data was generated based on the results of AML microarray and TCGA data analysis. Raw data output 3. Common differentially expressed genes between training and test set samples the microarray dataset. This data was generated based on the results of AML microarray data analysis. Raw data output 4. Detailed information on the samples of the breast cancer microarray dataset (GSE52327) used in this study. Raw data output 5. Differentially expressed genes in breast CSCs compared with GTCs as well as in TCGA BRCA cancer samples compared with normal ones. Raw data output 6. Commonly and uniquely differentially expressed genes in breast cancer CSC/GTC microarray and TCGA BRCA bulk RNA-seq datasets. This data was generated based on the results of breast cancer microarray and TCGA BRCA data analysis. CSC, and GTC are abbreviations of cancer stem cell, and general tumor cell, respectively. Raw data output 7. Differential and common co-expression and protein-protein interaction of genes between CSC and GTC samples. This data was generated based on the results of AML microarray and STRING database-based protein-protein interaction data analysis. CSC, and GTC are abbreviations of cancer stem cell, and general tumor cell, respectively. Raw data output 8. Differentially expressed genes between AML dormant and active CSCs. This data was generated based on the results of AML scRNA-seq data analysis. Raw data output 9. Uniquely expressed genes in dormant or active AML CSCs. This data was generated based on the results of AML scRNA-seq data analysis. Raw data output 10. Intersections between the targeting transcription factors of AML key CSC genes and differentially expressed genes between AML CSCs vs GTCs and between dormant and active AML CSCs or the uniquely expressed genes in either class of CSCs. Raw data output 11. Targeting desirableness score of AML key CSC genes and their targeting transcription factors. These scores were generated based on an in-house scoring function described in the Methods section. Raw data output 12. CSC-specific targeting desirableness score of AML key CSC genes and their targeting transcription factors. These scores were generated based on an in-house scoring function described in the Methods section. Raw data output 13. The protein-protein interactions between AML key CSC genes with themselves and their targeting transcription factors. This data was generated based on the results of AML microarray and STRING database-based protein-protein interaction data analysis. Raw data output 14. The previously confirmed associations of genes having the highest targeting desirableness and CSC-specific targeting desirableness scores with AML or other cancers’ (stem) cells as well as hematopoietic stem cells. These data were generated based on a PubMed database-based literature mining. Raw data output 15. Drug score of available drugs and bioactive small molecules targeting AML key CSC genes and/or their targeting transcription factors. These scores were generated based on an in-house scoring function described in the Methods section. Raw data output 16. CSC-specific drug score of available drugs and bioactive small molecules targeting AML key CSC genes and/or their targeting transcription factors. These scores were generated based on an in-house scoring function described in the Methods section. Raw data output 17. Candidate drugs for experimental validation. These drugs were selected based on their respective (CSC-specific) drug scores. CSC is the abbreviation of cancer stem cell. Raw data output 18. Detailed information on the samples of the AML microarray dataset GSE30375 used in this study.

  11. Data Processing and Hosting Services Market Size Report & Share | 2025

    • mordorintelligence.com
    pdf,excel,csv,ppt
    Updated Jul 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mordor Intelligence (2025). Data Processing and Hosting Services Market Size Report & Share | 2025 [Dataset]. https://www.mordorintelligence.com/industry-reports/data-processing-and-hosting-services-market
    Explore at:
    pdf,excel,csv,pptAvailable download formats
    Dataset updated
    Jul 1, 2025
    Dataset authored and provided by
    Mordor Intelligence
    License

    https://www.mordorintelligence.com/privacy-policyhttps://www.mordorintelligence.com/privacy-policy

    Time period covered
    2019 - 2030
    Area covered
    Global
    Description

    The Data Processing and Hosting Services Market Report is Segmented by Organisation (Large Enterprise and Small and Medium Enterprises [SME]), Offering (Data Processing Services and Hosting Services), Deployment Model (Public Cloud, Private Cloud, and Hybrid and Multi-Cloud), End-User Industry (IT and Telecommunication, BFSI, Retail and E-Commerce, Manufacturing, Healthcare and Life Sciences, and More), and Geography

  12. o

    Whistlerlib: a distributed computing library for exploratory data analysis...

    • repositorio.observatoriogeo.mx
    Updated Oct 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Whistlerlib: a distributed computing library for exploratory data analysis on large social network datasets - Dataset - Repositorio del Observatorio Metropolitano CentroGeo [Dataset]. http://repositorio.observatoriogeo.mx/dataset/1ee805b50082
    Explore at:
    Dataset updated
    Oct 21, 2025
    Description

    At least 350k posts are published on X, 510k comments are posted on Facebook, and 66k pictures and videos are shared on Instagram each minute. These large datasets require substantial processing power, even if only a percentage is collected for analysis and research. To face this challenge, data scientists can now use computer clusters deployed on various IaaS and PaaS services in the cloud. However, scientists still have to master the design of distributed algorithms and be familiar with using distributed computing programming frameworks. It is thus essential to generate tools that provide analysis methods to leverage the advantages of computer clusters for processing large amounts of social network text. This paper presents Whistlerlib, a new Python library for conducting exploratory analysis on large text datasets on social networks. Whistlerlib implements distributed versions of various social media, sentiment, and social network analysis methods that can run atop computer clusters. We experimentally demonstrate the scalability of the various Whistlerlib distributed methods when deployed on a public cloud platform. We also present a practical example of the analysis of posts on the social network X about the Mexico City subway to showcase the features of Whistlerlib in scenarios where social network analysis tools are needed to address issues with a social dimension.

  13. r

    1000 Empirical Time series

    • researchdata.edu.au
    • bridges.monash.edu
    • +1more
    Updated May 5, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ben Fulcher (2022). 1000 Empirical Time series [Dataset]. http://doi.org/10.6084/m9.figshare.5436136.v10
    Explore at:
    Dataset updated
    May 5, 2022
    Dataset provided by
    Monash University
    Authors
    Ben Fulcher
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A diverse selection of 1000 empirical time series, along with results of an hctsa feature extraction, using v1.06 of hctsa and Matlab 2019b, computed on a server at The University of Sydney.


    The results of the computation are in the hctsa file, HCTSA_Empirical1000.mat for use in Matlab using v1.06 of hctsa.

    The same data is also provided in .csv format for the hctsa_datamatrix.csv (results of feature computation), with information about rows (time series) in hctsa_timeseries-info.csv, information about columns (features) in hctsa_features.csv (and corresponding hctsa code used to compute each feature in hctsa_masterfeatures.csv), and the data of individual time series (each line a time series, for time series described in hctsa_timeseries-info.csv) is in hctsa_timeseries-data.csv.

    These .csv files were produced by running >>OutputToCSV(HCTSA_Empirical1000.mat,true,true); in hctsa.

    The input file, INP_Empirical1000.mat, is for use with hctsa, and contains the time-series data and metadata for the 1000 time series. For example, massive feature extraction from these data on the user's machine, using hctsa, can proceed as
    >> TS_Init('INP_Empirical1000.mat');

    Some visualizations of the dataset are in CarpetPlot.png (first 1000 samples of all time series as a carpet (color) plot) and 150TS-250samples.png (conventional time-series plots of the first 250 samples of a sample of 150 time series from the dataset). More visualizations can be performed by the user using TS_PlotTimeSeries from the hctsa package.

    See links in references for more comprehensive documentation for performing methodological comparison using this dataset, and on how to download and use v1.06 of hctsa.

  14. Big Data Services Market Analysis, Size, and Forecast 2025-2029: North...

    • technavio.com
    pdf
    Updated Feb 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Technavio (2025). Big Data Services Market Analysis, Size, and Forecast 2025-2029: North America (Mexico), Europe (France, Germany, Italy, and UK), Middle East and Africa (UAE), APAC (Australia, China, India, Japan, and South Korea), South America (Brazil), and Rest of World (ROW) [Dataset]. https://www.technavio.com/report/big-data-services-market-industry-analysis
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Feb 12, 2025
    Dataset provided by
    TechNavio
    Authors
    Technavio
    License

    https://www.technavio.com/content/privacy-noticehttps://www.technavio.com/content/privacy-notice

    Time period covered
    2025 - 2029
    Description

    Snapshot img

    Big Data Services Market Size 2025-2029

    The big data services market size is forecast to increase by USD 604.2 billion, at a CAGR of 54.4% between 2024 and 2029.

    The market is experiencing significant growth, driven by the increasing adoption of big data in various industries, particularly in blockchain technology. The ability to process and analyze vast amounts of data in real-time is revolutionizing business operations and decision-making processes. However, this market is not without challenges. One of the most pressing issues is the need to cater to diverse client requirements, each with unique data needs and expectations. This necessitates customized solutions and a deep understanding of various industries and their data requirements. Additionally, ensuring data security and privacy in an increasingly interconnected world poses a significant challenge. Companies must navigate these obstacles while maintaining compliance with regulations and adhering to ethical data handling practices. To capitalize on the opportunities presented by the market, organizations must focus on developing innovative solutions that address these challenges while delivering value to their clients. By staying abreast of industry trends and investing in advanced technologies, they can effectively meet client demands and differentiate themselves in a competitive landscape.

    What will be the Size of the Big Data Services Market during the forecast period?

    Explore in-depth regional segment analysis with market size data - historical 2019-2023 and forecasts 2025-2029 - in the full report.
    Request Free SampleThe market continues to evolve, driven by the ever-increasing volume, velocity, and variety of data being generated across various sectors. Data extraction is a crucial component of this dynamic landscape, enabling entities to derive valuable insights from their data. Human resource management, for instance, benefits from data-driven decision making, operational efficiency, and data enrichment. Batch processing and data integration are essential for data warehousing and data pipeline management. Data governance and data federation ensure data accessibility, quality, and security. Data lineage and data monetization facilitate data sharing and collaboration, while data discovery and data mining uncover hidden patterns and trends. Real-time analytics and risk management provide operational agility and help mitigate potential threats. Machine learning and deep learning algorithms enable predictive analytics, enhancing business intelligence and customer insights. Data visualization and data transformation facilitate data usability and data loading into NoSQL databases. Government analytics, financial services analytics, supply chain optimization, and manufacturing analytics are just a few applications of big data services. Cloud computing and data streaming further expand the market's reach and capabilities. Data literacy and data collaboration are essential for effective data usage and collaboration. Data security and data cleansing are ongoing concerns, with the market continuously evolving to address these challenges. The integration of natural language processing, computer vision, and fraud detection further enhances the value proposition of big data services. The market's continuous dynamism underscores the importance of data cataloging, metadata management, and data modeling for effective data management and optimization.

    How is this Big Data Services Industry segmented?

    The big data services industry research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD billion' for the period 2025-2029, as well as historical data from 2019-2023 for the following segments. ComponentSolutionServicesEnd-userBFSITelecomRetailOthersTypeData storage and managementData analytics and visualizationConsulting servicesImplementation and integration servicesSupport and maintenance servicesSectorLarge enterprisesSmall and medium enterprises (SMEs)GeographyNorth AmericaUSMexicoEuropeFranceGermanyItalyUKMiddle East and AfricaUAEAPACAustraliaChinaIndiaJapanSouth KoreaSouth AmericaBrazilRest of World (ROW).

    By Component Insights

    The solution segment is estimated to witness significant growth during the forecast period.Big data services have become indispensable for businesses seeking operational efficiency and customer insight. The vast expanse of structured and unstructured data presents an opportunity for organizations to analyze consumer behaviors across multiple channels. Big data solutions facilitate the integration and processing of data from various sources, enabling businesses to gain a deeper understanding of customer sentiment towards their products or services. Data governance ensures data quality and security, while data federation and data lineage provide transparency and traceability. Artificial intelligence and machine learning algo

  15. Wiki-Quantities and Wiki-Measurements: Datasets of Quantities and their...

    • zenodo.org
    bin, zip
    Updated Feb 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jan Göpfert; Jan Göpfert; Patrick Kuckertz; Patrick Kuckertz; Jann M. Weinand; Jann M. Weinand; Detlef Stolten; Detlef Stolten (2025). Wiki-Quantities and Wiki-Measurements: Datasets of Quantities and their Measurement Context from Wikipedia [Dataset]. http://doi.org/10.5281/zenodo.14858280
    Explore at:
    zip, binAvailable download formats
    Dataset updated
    Feb 12, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jan Göpfert; Jan Göpfert; Patrick Kuckertz; Patrick Kuckertz; Jann M. Weinand; Jann M. Weinand; Detlef Stolten; Detlef Stolten
    Description

    The task of measurement extraction is typically approached in a pipeline manner, where 1) quantities are identified before 2) their individual measurement context is extracted (see our review paper). To support the development and evaluation of systems for measurement extraction, we present two large datasets that correspond to the two tasks:

    • Wiki-Quantities, a dataset for identifying quantities, and
    • Wiki-Measurements, a dataset for extracting measurement context for given quantities.

    The datasets are heuristically generated from Wikipedia articles and Wikidata facts. For a detailed description of the datasets, please refer to the upcoming corresponding paper:

    Wiki-Quantities and Wiki-Measurements: Datasets of Quantities and their Measurement Context from Wikipedia. 2025. Jan Göpfert, Patrick Kuckertz, Jann M. Weinand, and Detlef Stolten.

    Versions

    The datasets are released in different versions:

    • Processing level: the pre-processed versions can be used directly for training and evaluating models, while the raw versions can be used to create custom pre-processed versions or for other purposes. Wiki-Quantities is pre-processed for IOB sequence labeling, while Wiki-Measurements is pre-processed for SQuAD-style generative question answering.
    • Filtering level:
      • Wiki-Quantities is available in a raw, large, small, and tiny version: The raw version is the original version, which includes all the samples originally obtained. In the large version, all duplicates and near duplicates present in the raw version are removed. The small and tiny versions are subsets of the large version which are additionally filtered to balance the data with respect to units, properties, and topics.
      • Wiki-Measurements is available in a large`, small, large_strict, small_strict, small_context, and large_strict_context version: The large version contains all examples minus a few duplicates. The small version is a subset of the large version with very similar examples removed. In the context versions, additional sentences are added around the annotated sentence. In the strict versions, the quantitative facts are more strictly aligned with the text.
    • Quality: all data has been automatically annotated using heuristics. In contrast to the silver data, the gold data has been manually curated.

    Format

    The datasets are stored in JSON format. The pre-processed versions are formatted for direct use for IOB sequence labeling or SQuAD-style generative question answering in NLP frameworks such as Huggingface Transformers. In the not pre-processed versions of the datasets, annotations are visualized using emojis to facilitate curation. For example:

    • Wiki-Quantities (only quantities annotated):
      • "In a 🍏100-gram🍏 reference amount, almonds supply 🍏579 kilocalories🍏 of food energy."
      • "Extreme heat waves can raise readings to around and slightly above 🍏38 °C🍏, and arctic blasts can drop lows to 🍏−23 °C to 0 °F🍏."
      • "This sail added another 🍏0.5 kn🍏."
    • Wiki-Measurements (measurement context for a single quantity; qualifiers and quantity modifiers are only sparsely annotated):
      • "The 🔭French national census🔭 of 📆2018📆 estimated the 🍊population🍊 of 🌶️Metz🌶️ to be 🍐116,581🍐, while the population of Metz metropolitan area was about 368,000."
      • "The 🍊surface temperature🍊 of 🌶️Triton🌶️ was 🔭recorded by Voyager 2🔭 as 🍐-235🍐 🍓°C🍓 (-391 °F)."
      • "🙋The Babylonians🙋 were able to find that the 🍊value🍊 of 🌶️pi🌶️ was ☎️slightly greater than☎️ 🍐3🍐, by simply 🔭making a big circle and then sticking a piece of rope onto the circumference and the diameter, taking note of their distances, and then dividing the circumference by the diameter🔭."

    The mapping of annotation types to emojis is as follows:

    • Basic quantitative statement:
      • Entity: 🌶️
      • Property: 🍊
      • Quantity: 🍏
      • Value: 🍐
      • Unit: 🍓
      • Quantity modifier: ☎️
    • Qualifier:
      • Temporal scope: 📆
      • Start time: ⏱️
      • End time: ⏰️
      • Location: 📍
      • Reference: 🙋
      • Determination method: 🔭
      • Criterion used: 📏
      • Applies to part: 🦵
      • Scope: 🔎
      • Some qualifier: 🛁

    Note that for each version of Wiki-Measurements sample IDs are randomly assigned. Therefore, they are not consistent, e.g., between silver small and silver large. The proportions of train, dev, and test sets are unusual because Wiki-Quantities and Wiki-Measurements are intended to be used in conjunction with other non-heuristically generated data.

    Evaluation

    The evaluation directories contain the manually validated random samples used for evaluation. The evaluation is based on the large versions of the datasets. Manual validation of 100 samples each of Wiki-Quantities and Wiki-Measurements showed that 100% of the Wiki-Quantities samples and 94% (or 84% if strictly scored) of the Wiki-Measurements samples were correct.

    License

    In accordance with Wikipedia's and Wikidata's licensing terms, the datasets are released under the CC BY-SA 4.0 license, except for Wikidata facts in ./Wiki-Measurements/raw/additional_data.json, which are released under the CC0 1.0 license (the texts are still CC BY-SA 4.0).

    About Us

    We are the Institute of Climate and Energy Systems (ICE) - Jülich Systems Analysis belonging to the Forschungszentrum Jülich. Our interdisciplinary department's research is focusing on energy-related process and systems analyses. Data searches and system simulations are used to determine energy and mass balances, as well as to evaluate performance, emissions and costs of energy systems. The results are used for performing comparative assessment studies between the various systems. Our current priorities include the development of energy strategies, in accordance with the German Federal Government’s greenhouse gas reduction targets, by designing new infrastructures for sustainable and secure energy supply chains and by conducting cost analysis studies for integrating new technologies into future energy market frameworks.

    Acknowledgements

    The authors would like to thank the German Federal Government, the German State Governments, and the Joint Science Conference (GWK) for their funding and support as part of the NFDI4Ing consortium. Funded by the German Research Foundation (DFG) – project number: 442146713. Furthermore, this work was supported by the Helmholtz Association under the program "Energy System Design".

  16. D

    Earth Observation Big Data Service Market Report | Global Forecast From 2025...

    • dataintelo.com
    csv, pdf, pptx
    Updated Jan 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataintelo (2025). Earth Observation Big Data Service Market Report | Global Forecast From 2025 To 2033 [Dataset]. https://dataintelo.com/report/global-earth-observation-big-data-service-market
    Explore at:
    csv, pptx, pdfAvailable download formats
    Dataset updated
    Jan 7, 2025
    Dataset authored and provided by
    Dataintelo
    License

    https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy

    Time period covered
    2024 - 2032
    Area covered
    Earth, Global
    Description

    Earth Observation Big Data Service Market Outlook



    As of 2023, the global market size for Earth Observation Big Data Services is estimated at approximately $8.5 billion, and it is projected to reach $18.7 billion by 2032, growing at a CAGR of 9.1% during the forecast period. This robust growth can be attributed to several factors, including advancements in satellite technology, increasing demand for real-time data analysis, and the growing application of big data analytics across various industries.



    The primary growth factor driving the Earth Observation Big Data Service market is the significant advancements in satellite technologies. The development of high-resolution imaging satellites and the launch of numerous small satellites (CubeSats) have revolutionized the way data is captured and utilized from space. These advancements have enhanced the accuracy and frequency of Earth observation data, making it more beneficial for diverse applications such as climate monitoring, agriculture, and disaster management. Additionally, the decreasing cost of launching satellites has made it more accessible for various sectors to leverage Earth observation data, thereby broadening the market's scope.



    Another crucial growth factor is the increasing demand for real-time data analysis. In today's data-driven world, organizations across various sectors require timely and accurate information to make informed decisions. Earth observation data, when combined with big data analytics, provides insightful and actionable information that can be used for immediate decision-making. For example, in agriculture, real-time data on weather conditions, soil moisture, and crop health can significantly enhance yield and efficiency. Similarly, in disaster management, real-time data on natural calamities can drastically improve response times and mitigate damage. This demand for real-time data analysis is expected to propel the market further.



    The growing application of big data analytics in various industries is also a significant driver of the Earth Observation Big Data Service market. Industries such as agriculture, forestry, urban planning, and defense are increasingly leveraging big data analytics to optimize operations, reduce costs, and improve decision-making. In the defense sector, for instance, big data analytics is used for surveillance, reconnaissance, and intelligence gathering, which are vital for national security. The integration of advanced analytics with Earth observation data has opened new frontiers for innovation and efficiency, thus driving market growth.



    The rise of Commercial Satellite Imaging has played a pivotal role in the evolution of Earth Observation Big Data Services. By providing high-resolution images of the Earth's surface, commercial satellites have enabled a more detailed and comprehensive understanding of various geographical and environmental phenomena. This capability is not only beneficial for scientific research but also for practical applications such as urban planning, agriculture, and disaster management. The accessibility of commercial satellite data has democratized the use of satellite imagery, allowing a wider range of industries to leverage this technology for enhanced decision-making and strategic planning.



    Regional outlook for the Earth Observation Big Data Service market indicates significant growth across all major regions, with North America and Europe leading the charge due to their advanced technological infrastructure and substantial investments in satellite technology. Asia Pacific is expected to witness the highest growth rate, driven by rapid industrialization and increasing governmental focus on space programs. Latin America and the Middle East & Africa are also anticipated to show considerable growth, albeit at a slower pace compared to other regions.



    Service Type Analysis



    The Earth Observation Big Data Service market is segmented by service type into Data Acquisition, Data Processing, Data Analysis, and Data Visualization. Data Acquisition involves the collection of raw data from various satellite sources. This segment is critical as it forms the foundation upon which other services build. The advancements in satellite technology and the proliferation of CubeSats have made data acquisition more efficient and frequent, enhancing the overall quality and quantity of data collected.



    Data Processing is the next crucial segment, involving the transformatio

  17. f

    GlycCompSoft: Software for Automated Comparison of Low Molecular Weight...

    • figshare.com
    • plos.figshare.com
    tiff
    Updated Dec 13, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xiaohua Wang; Xinyue Liu; Lingyun Li; Fuming Zhang; Min Hu; Fuji Ren; Lianli Chi; Robert J. Linhardt (2016). GlycCompSoft: Software for Automated Comparison of Low Molecular Weight Heparins Using Top-Down LC/MS Data [Dataset]. http://doi.org/10.1371/journal.pone.0167727
    Explore at:
    tiffAvailable download formats
    Dataset updated
    Dec 13, 2016
    Dataset provided by
    PLOS ONE
    Authors
    Xiaohua Wang; Xinyue Liu; Lingyun Li; Fuming Zhang; Min Hu; Fuji Ren; Lianli Chi; Robert J. Linhardt
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Low molecular weight heparins are complex polycomponent drugs that have recently become amenable to top-down analysis using liquid chromatography-mass spectrometry. Even using open source deconvolution software, DeconTools, and automatic structural assignment software, GlycReSoft, the comparison of two or more low molecular weight heparins is extremely time-consuming, taking about a week for an expert analyst and provides no guarantee of accuracy. Efficient data processing tools are required to improve analysis. This study uses the programming language of Microsoft Excel™ Visual Basic for Applications to extend its standard functionality for macro functions and specific mathematical modules for mass spectrometric data processing. The program developed enables the comparison of top-down analytical glycomics data on two or more low molecular weight heparins. The current study describes a new program, GlycCompSoft, which has a low error rate with good time efficiency in the automatic processing of large data sets. The experimental results based on three lots of Lovenox®, Clexane® and three generic enoxaparin samples show that the run time of GlycCompSoft decreases from 11 to 2 seconds when the data processed decreases from 18000 to 1500 rows.

  18. Data from: Optimized SMRT-UMI protocol produces highly accurate sequence...

    • data.niaid.nih.gov
    • zenodo.org
    • +1more
    zip
    Updated Dec 7, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dylan Westfall; Mullins James (2023). Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations – application to HIV-1 quasispecies [Dataset]. http://doi.org/10.5061/dryad.w3r2280w0
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 7, 2023
    Dataset provided by
    HIV Prevention Trials Networkhttp://www.hptn.org/
    National Institute of Allergy and Infectious Diseaseshttp://www.niaid.nih.gov/
    HIV Vaccine Trials Networkhttp://www.hvtn.org/
    PEPFAR
    Authors
    Dylan Westfall; Mullins James
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Pathogen diversity resulting in quasispecies can enable persistence and adaptation to host defenses and therapies. However, accurate quasispecies characterization can be impeded by errors introduced during sample handling and sequencing which can require extensive optimizations to overcome. We present complete laboratory and bioinformatics workflows to overcome many of these hurdles. The Pacific Biosciences single molecule real-time platform was used to sequence PCR amplicons derived from cDNA templates tagged with universal molecular identifiers (SMRT-UMI). Optimized laboratory protocols were developed through extensive testing of different sample preparation conditions to minimize between-template recombination during PCR and the use of UMI allowed accurate template quantitation as well as removal of point mutations introduced during PCR and sequencing to produce a highly accurate consensus sequence from each template. Handling of the large datasets produced from SMRT-UMI sequencing was facilitated by a novel bioinformatic pipeline, Probabilistic Offspring Resolver for Primer IDs (PORPIDpipeline), that automatically filters and parses reads by sample, identifies and discards reads with UMIs likely created from PCR and sequencing errors, generates consensus sequences, checks for contamination within the dataset, and removes any sequence with evidence of PCR recombination or early cycle PCR errors, resulting in highly accurate sequence datasets. The optimized SMRT-UMI sequencing method presented here represents a highly adaptable and established starting point for accurate sequencing of diverse pathogens. These methods are illustrated through characterization of human immunodeficiency virus (HIV) quasispecies. Methods This serves as an overview of the analysis performed on PacBio sequence data that is summarized in Analysis Flowchart.pdf and was used as primary data for the paper by Westfall et al. "Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations – application to HIV-1 quasispecies" Five different PacBio sequencing datasets were used for this analysis: M027, M2199, M1567, M004, and M005 For the datasets which were indexed (M027, M2199), CCS reads from PacBio sequencing files and the chunked_demux_config files were used as input for the chunked_demux pipeline. Each config file lists the different Index primers added during PCR to each sample. The pipeline produces one fastq file for each Index primer combination in the config. For example, in dataset M027 there were 3–4 samples using each Index combination. The fastq files from each demultiplexed read set were moved to the sUMI_dUMI_comparison pipeline fastq folder for further demultiplexing by sample and consensus generation with that pipeline. More information about the chunked_demux pipeline can be found in the README.md file on GitHub. The demultiplexed read collections from the chunked_demux pipeline or CCS read files from datasets which were not indexed (M1567, M004, M005) were each used as input for the sUMI_dUMI_comparison pipeline along with each dataset's config file. Each config file contains the primer sequences for each sample (including the sample ID block in the cDNA primer) and further demultiplexes the reads to prepare data tables summarizing all of the UMI sequences and counts for each family (tagged.tar.gz) as well as consensus sequences from each sUMI and rank 1 dUMI family (consensus.tar.gz). More information about the sUMI_dUMI_comparison pipeline can be found in the paper and the README.md file on GitHub. The consensus.tar.gz and tagged.tar.gz files were moved from sUMI_dUMI_comparison pipeline directory on the server to the Pipeline_Outputs folder in this analysis directory for each dataset and appended with the dataset name (e.g. consensus_M027.tar.gz). Also in this analysis directory is a Sample_Info_Table.csv containing information about how each of the samples was prepared, such as purification methods and number of PCRs. There are also three other folders: Sequence_Analysis, Indentifying_Recombinant_Reads, and Figures. Each has an .Rmd file with the same name inside which is used to collect, summarize, and analyze the data. All of these collections of code were written and executed in RStudio to track notes and summarize results. Sequence_Analysis.Rmd has instructions to decompress all of the consensus.tar.gz files, combine them, and create two fasta files, one with all sUMI and one with all dUMI sequences. Using these as input, two data tables were created, that summarize all sequences and read counts for each sample that pass various criteria. These are used to help create Table 2 and as input for Indentifying_Recombinant_Reads.Rmd and Figures.Rmd. Next, 2 fasta files containing all of the rank 1 dUMI sequences and the matching sUMI sequences were created. These were used as input for the python script compare_seqs.py which identifies any matched sequences that are different between sUMI and dUMI read collections. This information was also used to help create Table 2. Finally, to populate the table with the number of sequences and bases in each sequence subset of interest, different sequence collections were saved and viewed in the Geneious program. To investigate the cause of sequences where the sUMI and dUMI sequences do not match, tagged.tar.gz was decompressed and for each family with discordant sUMI and dUMI sequences the reads from the UMI1_keeping directory were aligned using geneious. Reads from dUMI families failing the 0.7 filter were also aligned in Genious. The uncompressed tagged folder was then removed to save space. These read collections contain all of the reads in a UMI1 family and still include the UMI2 sequence. By examining the alignment and specifically the UMI2 sequences, the site of the discordance and its case were identified for each family as described in the paper. These alignments were saved as "Sequence Alignments.geneious". The counts of how many families were the result of PCR recombination were used in the body of the paper. Using Identifying_Recombinant_Reads.Rmd, the dUMI_ranked.csv file from each sample was extracted from all of the tagged.tar.gz files, combined and used as input to create a single dataset containing all UMI information from all samples. This file dUMI_df.csv was used as input for Figures.Rmd. Figures.Rmd used dUMI_df.csv, sequence_counts.csv, and read_counts.csv as input to create draft figures and then individual datasets for eachFigure. These were copied into Prism software to create the final figures for the paper.

  19. d

    Current Population Survey (CPS)

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Damico, Anthony (2023). Current Population Survey (CPS) [Dataset]. http://doi.org/10.7910/DVN/AK4FDD
    Explore at:
    Dataset updated
    Nov 21, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Damico, Anthony
    Description

    analyze the current population survey (cps) annual social and economic supplement (asec) with r the annual march cps-asec has been supplying the statistics for the census bureau's report on income, poverty, and health insurance coverage since 1948. wow. the us census bureau and the bureau of labor statistics ( bls) tag-team on this one. until the american community survey (acs) hit the scene in the early aughts (2000s), the current population survey had the largest sample size of all the annual general demographic data sets outside of the decennial census - about two hundred thousand respondents. this provides enough sample to conduct state- and a few large metro area-level analyses. your sample size will vanish if you start investigating subgroups b y state - consider pooling multiple years. county-level is a no-no. despite the american community survey's larger size, the cps-asec contains many more variables related to employment, sources of income, and insurance - and can be trended back to harry truman's presidency. aside from questions specifically asked about an annual experience (like income), many of the questions in this march data set should be t reated as point-in-time statistics. cps-asec generalizes to the united states non-institutional, non-active duty military population. the national bureau of economic research (nber) provides sas, spss, and stata importation scripts to create a rectangular file (rectangular data means only person-level records; household- and family-level information gets attached to each person). to import these files into r, the parse.SAScii function uses nber's sas code to determine how to import the fixed-width file, then RSQLite to put everything into a schnazzy database. you can try reading through the nber march 2012 sas importation code yourself, but it's a bit of a proc freak show. this new github repository contains three scripts: 2005-2012 asec - download all microdata.R down load the fixed-width file containing household, family, and person records import by separating this file into three tables, then merge 'em together at the person-level download the fixed-width file containing the person-level replicate weights merge the rectangular person-level file with the replicate weights, then store it in a sql database create a new variable - one - in the data table 2012 asec - analysis examples.R connect to the sql database created by the 'download all microdata' progr am create the complex sample survey object, using the replicate weights perform a boatload of analysis examples replicate census estimates - 2011.R connect to the sql database created by the 'download all microdata' program create the complex sample survey object, using the replicate weights match the sas output shown in the png file below 2011 asec replicate weight sas output.png statistic and standard error generated from the replicate-weighted example sas script contained in this census-provided person replicate weights usage instructions document. click here to view these three scripts for more detail about the current population survey - annual social and economic supplement (cps-asec), visit: the census bureau's current population survey page the bureau of labor statistics' current population survey page the current population survey's wikipedia article notes: interviews are conducted in march about experiences during the previous year. the file labeled 2012 includes information (income, work experience, health insurance) pertaining to 2011. when you use the current populat ion survey to talk about america, subract a year from the data file name. as of the 2010 file (the interview focusing on america during 2009), the cps-asec contains exciting new medical out-of-pocket spending variables most useful for supplemental (medical spending-adjusted) poverty research. confidential to sas, spss, stata, sudaan users: why are you still rubbing two sticks together after we've invented the butane lighter? time to transition to r. :D

  20. Massive Bank dataset ( 1 Million+ rows)

    • kaggle.com
    zip
    Updated Feb 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    K S ABISHEK (2023). Massive Bank dataset ( 1 Million+ rows) [Dataset]. https://www.kaggle.com/datasets/ksabishek/massive-bank-dataset-1-million-rows
    Explore at:
    zip(32471013 bytes)Available download formats
    Dataset updated
    Feb 21, 2023
    Authors
    K S ABISHEK
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Greetings , fellow analysts !

    (NOTE : This is a random dataset generated using python. It bears no resemblance to any real entity in the corporate world. Any resemblance is a matter of coincidence.)

    REC-SSEC Bank is a govt-aided bank operating in the Indian Peninsula. They have regional branches in over 40+ regions of the country. You have been provided with a massive excel sheet containing the transaction details, the total transaction amount and their location and total transaction count.

    The dataset is described as follows :

    1. Date - The date on which the transaction took place. 2.Domain - Where or which type of Business entity made the transaction. 3.Location - Where the data is collected from 4.Value - Total value of transaction
    2. Count of transaction .

    For example , in the very first row , the data can be read as : " On the first of January, 2022 , 1932 transactions of summing upto INR 365554 from Bhuj were reported " NOTE : There are about 2750 transactions every single day. All of this has been given to you.

    The bank wants you to answer the following questions :

    1. What is the average transaction value everyday for each domain over the year.
    2. What is the average transaction value for every city/location over the year
    3. The bank CEO , Mr: Hariharan , wants to promote the ease of transaction for the highest active domain. If the domains could be sorted into a priority, what would be the priority list ?
    4. What's the average transaction count for each city ?
Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Statista (2021). Big data and business analytics revenue worldwide 2015-2022 [Dataset]. https://www.statista.com/statistics/551501/worldwide-big-data-business-analytics-revenue/
Organization logo

Big data and business analytics revenue worldwide 2015-2022

Explore at:
38 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Aug 17, 2021
Dataset authored and provided by
Statistahttp://statista.com/
Area covered
Worldwide
Description

The global big data and business analytics (BDA) market was valued at ***** billion U.S. dollars in 2018 and is forecast to grow to ***** billion U.S. dollars by 2021. In 2021, more than half of BDA spending will go towards services. IT services is projected to make up around ** billion U.S. dollars, and business services will account for the remainder. Big data High volume, high velocity and high variety: one or more of these characteristics is used to define big data, the kind of data sets that are too large or too complex for traditional data processing applications. Fast-growing mobile data traffic, cloud computing traffic, as well as the rapid development of technologies such as artificial intelligence (AI) and the Internet of Things (IoT) all contribute to the increasing volume and complexity of data sets. For example, connected IoT devices are projected to generate **** ZBs of data in 2025. Business analytics Advanced analytics tools, such as predictive analytics and data mining, help to extract value from the data and generate business insights. The size of the business intelligence and analytics software application market is forecast to reach around **** billion U.S. dollars in 2022. Growth in this market is driven by a focus on digital transformation, a demand for data visualization dashboards, and an increased adoption of cloud.

Search
Clear search
Close search
Google apps
Main menu