100+ datasets found

f
Data from: Teaching and Learning Data Visualization: Ideas and Assignments
tandf.figshare.com
figshare.com
txt
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Deborah Nolan; Jamis Perrett (2023). Teaching and Learning Data Visualization: Ideas and Assignments [Dataset]. http://doi.org/10.6084/m9.figshare.1627940.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.1627940.v1
Dataset updated
Jun 1, 2023
Dataset provided by
Taylor & Francis
Authors
Deborah Nolan; Jamis Perrett
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This article discusses how to make statistical graphics a more prominent element of the undergraduate statistics curricula. The focus is on several different types of assignments that exemplify how to incorporate graphics into a course in a pedagogically meaningful way. These assignments include having students deconstruct and reconstruct plots, copy masterful graphs, create one-minute visual revelations, convert tables into “pictures,” and develop interactive visualizations, for example, with the virtual earth as a plotting canvas. In addition to describing the goals and details of each assignment, we also discuss the broader topic of graphics and key concepts that we think warrant inclusion in the statistics curricula. We advocate that more attention needs to be paid to this fundamental field of statistics at all levels, from introductory undergraduate through graduate level courses. With the rapid rise of tools to visualize data, for example, Google trends, GapMinder, ManyEyes, and Tableau, and the increased use of graphics in the media, understanding the principles of good statistical graphics, and having the ability to create informative visualizations is an ever more important aspect of statistics education. Supplementary materials containing code and data for the assignments are available online.
Data Analytics Market Analysis, Size, and Forecast 2025-2029: North America...
technavio.com
Updated Jun 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Technavio (2024). Data Analytics Market Analysis, Size, and Forecast 2025-2029: North America (US and Canada), Europe (France, Germany, and UK), Middle East and Africa (UAE), APAC (China, India, Japan, and South Korea), South America (Brazil), and Rest of World (ROW) [Dataset]. https://www.technavio.com/report/data-analytics-market-industry-analysis
Explore at:
Dataset updated
Jun 23, 2024
Dataset provided by
TechNavio
Authors
Technavio
Time period covered
2021 - 2025
Area covered
Global
Description
Snapshot img

Data Analytics Market Size 2025-2029

The data analytics market size is forecast to increase by USD 288.7 billion, at a CAGR of 14.7% between 2024 and 2029.

The market is driven by the extensive use of modern technology in company operations, enabling businesses to extract valuable insights from their data. The prevalence of the Internet and the increased use of linked and integrated technologies have facilitated the collection and analysis of vast amounts of data from various sources. This trend is expected to continue as companies seek to gain a competitive edge by making data-driven decisions. However, the integration of data from different sources poses significant challenges. Ensuring data accuracy, consistency, and security is crucial as companies deal with large volumes of data from various internal and external sources. Additionally, the complexity of data analytics tools and the need for specialized skills can hinder adoption, particularly for smaller organizations with limited resources. Companies must address these challenges by investing in robust data management systems, implementing rigorous data validation processes, and providing training and development opportunities for their employees. By doing so, they can effectively harness the power of data analytics to drive growth and improve operational efficiency.

What will be the Size of the Data Analytics Market during the forecast period?

Explore in-depth regional segment analysis with market size data - historical 2019-2023 and forecasts 2025-2029 - in the full report.
Request Free SampleIn the dynamic and ever-evolving the market, entities such as explainable AI, time series analysis, data integration, data lakes, algorithm selection, feature engineering, marketing analytics, computer vision, data visualization, financial modeling, real-time analytics, data mining tools, and KPI dashboards continue to unfold and intertwine, shaping the industry's landscape. The application of these technologies spans various sectors, from risk management and fraud detection to conversion rate optimization and social media analytics. ETL processes, data warehousing, statistical software, data wrangling, and data storytelling are integral components of the data analytics ecosystem, enabling organizations to extract insights from their data. Cloud computing, deep learning, and data visualization tools further enhance the capabilities of data analytics platforms, allowing for advanced data-driven decision making and real-time analysis. Marketing analytics, clustering algorithms, and customer segmentation are essential for businesses seeking to optimize their marketing strategies and gain a competitive edge. Regression analysis, data visualization tools, and machine learning algorithms are instrumental in uncovering hidden patterns and trends, while predictive modeling and causal inference help organizations anticipate future outcomes and make informed decisions. Data governance, data quality, and bias detection are crucial aspects of the data analytics process, ensuring the accuracy, security, and ethical use of data. Supply chain analytics, healthcare analytics, and financial modeling are just a few examples of the diverse applications of data analytics, demonstrating the industry's far-reaching impact. Data pipelines, data mining, and model monitoring are essential for maintaining the continuous flow of data and ensuring the accuracy and reliability of analytics models. The integration of various data analytics tools and techniques continues to evolve, as the industry adapts to the ever-changing needs of businesses and consumers alike.

How is this Data Analytics Industry segmented?

The data analytics industry research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD billion' for the period 2025-2029, as well as historical data from 2019-2023 for the following segments. ComponentServicesSoftwareHardwareDeploymentCloudOn-premisesTypePrescriptive AnalyticsPredictive AnalyticsCustomer AnalyticsDescriptive AnalyticsOthersApplicationSupply Chain ManagementEnterprise Resource PlanningDatabase ManagementHuman Resource ManagementOthersGeographyNorth AmericaUSCanadaEuropeFranceGermanyUKMiddle East and AfricaUAEAPACChinaIndiaJapanSouth KoreaSouth AmericaBrazilRest of World (ROW)

By Component Insights

The services segment is estimated to witness significant growth during the forecast period.The market is experiencing significant growth as businesses increasingly rely on advanced technologies to gain insights from their data. Natural language processing is a key component of this trend, enabling more sophisticated analysis of unstructured data. Fraud detection and data security solutions are also in high demand, as companies seek to protect against threats and maintain customer trust. Data analytics platforms, including cloud-based offeri
Big data and business analytics revenue worldwide 2015-2022
statista.com
Updated Nov 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2023). Big data and business analytics revenue worldwide 2015-2022 [Dataset]. https://www.statista.com/statistics/551501/worldwide-big-data-business-analytics-revenue/
Explore at:
Dataset updated
Nov 22, 2023
Dataset authored and provided by
Statistahttp://statista.com/
Area covered
Worldwide
Description
The global big data and business analytics (BDA) market was valued at 168.8 billion U.S. dollars in 2018 and is forecast to grow to 215.7 billion U.S. dollars by 2021. In 2021, more than half of BDA spending will go towards services. IT services is projected to make up around 85 billion U.S. dollars, and business services will account for the remainder. Big data High volume, high velocity and high variety: one or more of these characteristics is used to define big data, the kind of data sets that are too large or too complex for traditional data processing applications. Fast-growing mobile data traffic, cloud computing traffic, as well as the rapid development of technologies such as artificial intelligence (AI) and the Internet of Things (IoT) all contribute to the increasing volume and complexity of data sets. For example, connected IoT devices are projected to generate 79.4 ZBs of data in 2025. Business analytics Advanced analytics tools, such as predictive analytics and data mining, help to extract value from the data and generate business insights. The size of the business intelligence and analytics software application market is forecast to reach around 16.5 billion U.S. dollars in 2022. Growth in this market is driven by a focus on digital transformation, a demand for data visualization dashboards, and an increased adoption of cloud.
Sample CVs Dataset for Analysis
kaggle.com
Updated Aug 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
lone (2024). Sample CVs Dataset for Analysis [Dataset]. https://www.kaggle.com/datasets/hussnainmushtaq/sample-cvs-dataset-for-analysis
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 19, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
lone
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
This dataset contains a small collection of 6 randomly selected CVs (Curriculum Vitae), representing various professional backgrounds. The dataset is intended to serve as a resource for research in fields such as Human Resources (HR), data analysis, natural language processing (NLP), and machine learning. It can be used for tasks like resume parsing, skill extraction, job matching, and analyzing trends in professional qualifications and experiences. Potential Use Cases: This dataset can be used for various research and development purposes, including but not limited to:

Resume Parsing: Developing algorithms to automatically extract and categorize information from resumes. Skill Extraction: Identifying key skills and competencies from text data within the CVs. Job Matching: Creating models to match candidates to job descriptions based on their qualifications and experience. NLP Research: Analyzing language patterns, sentence structure, and terminology used in professional resumes. HR Analytics: Studying trends in career paths, education, and skill development across different professions. Training Data for Machine Learning Models: Using the dataset as a sample for training and testing machine learning models in HR-related applications. Dataset Format: The dataset is available in a compressed file (ZIP) containing the 6 CVs in both PDF and DOCX formats. This allows for flexibility in how the data is processed and analyzed.

Licensing: This dataset is shared under the CC BY-NC-SA 4.0 license. This means that you are free to:

Share: Copy and redistribute the material in any medium or format. Adapt: Remix, transform, and build upon the material. Under the following terms:

Attribution: You must give appropriate credit, provide a link to the license, and indicate if changes were made. NonCommercial: You may not use the material for commercial purposes. ShareAlike: If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original. Citation: If you use this dataset in your research or projects, please cite it as follows:

"Sample CVs Dataset for Analysis, Mushtaq et al., Kaggle, 2024."

Limitations and Considerations: Sample Size: The dataset contains only 6 CVs, which is a very small sample size. It is intended for educational and prototyping purposes rather than large-scale analysis. Anonymization: Personal details such as names, contact information, and specific locations may be anonymized or altered to protect privacy. Bias: The dataset is not representative of the entire population and may contain biases related to profession, education, and experience. This dataset is a useful starting point for developing models or conducting small-scale experiments in HR-related fields. However, users should be aware of its limitations and consider supplementing it with additional data for more robust analysis.
Data from: Replication package for the paper: "A Study on the Pythonic...
zenodo.org
zip
Updated Nov 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anonymous; Anonymous (2023). Replication package for the paper: "A Study on the Pythonic Functional Constructs' Understandability" [Dataset]. http://doi.org/10.5281/zenodo.10101383
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10101383
Dataset updated
Nov 10, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Anonymous; Anonymous
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Replication Package for A Study on the Pythonic Functional Constructs' Understandability
This package contains several folders and files with code and data used in the study.

examples/
Contains the code snippets used as objects of the study, named as reported in Table 1, summarizing the experiment design.
RQ1-RQ2-files-for-statistical-analysis/
Contains three .csv files used as input for conducting the statistical analysis and drawing the graphs for addressing the first two research questions of the study. Specifically:
- ConstructUsage.csv contains the declared frequency usage of the three functional constructs object of the study. This file is used to draw Figure 4.
- RQ1.csv contains the collected data used for the mixed-effect logistic regression relating the use of functional constructs with the correctness of the change task, and the logistic regression relating the use of map/reduce/filter functions with the correctness of the change task.
- RQ1Paired-RQ2.csv contains the collected data used for the ordinal logistic regression of the relationship between the perceived ease of understanding of the functional constructs and (i) participants' usage frequency, and (ii) constructs' complexity (except for map/reduce/filter).
inter-rater-RQ3-files/
Contains four .csv files used as input for computing the inter-rater agreement for the manual labeling used for addressing RQ3. Specifically, you will find one file for each functional construct, i.e., comprehension.csv, lambda.csv, and mrf.csv, and a different file used for highlighting the reasons why participants prefer to use the procedural paradigm, i.e., procedural.csv.
Questionnaire-Example.pdf
This file contains the questionnaire submitted to one of the ten experimental groups within our controlled experiment. Other questionnaires are similar, except for the code snippets used for the first section, i.e., change tasks, and the second section, i.e., comparison tasks.
RQ2ManualValidation.csv
This file contains the results of the manual validation being done to sanitize the answers provided by our participants used for addressing RQ2. Specifically, we coded the behavior description using four different levels: (i) correct, (ii) somewhat correct, (iii) wrong, and (iv) automatically generated.
RQ3ManualValidation.xlsx
This file contains the results of the open coding applied to address our third research question. Specifically, you will find four sheets, one for each functional construct and one for the procedural paradigm. For each sheet, you will find the provided answers together with the categories assigned to them.
Appendix.pdf
This file contains the results of the logistic regression relating the use of map, filter, and reduce functions with the correctness of the change task, not shown in the paper.
FuncConstructs-Statistics.r
This file contains an R script that you can reuse to re-run all the analyses conducted and discussed in the paper.
FuncConstructs-Statistics.ipynb
This file contains the code to re-execute all the analysis conducted in the paper as a notebook.
H
Advancing Open and Reproducible Water Data Science by Integrating Data...
hydroshare.org
beta.hydroshare.org
zip
Updated Jan 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jeffery S. Horsburgh (2024). Advancing Open and Reproducible Water Data Science by Integrating Data Analytics with an Online Data Repository [Dataset]. https://www.hydroshare.org/resource/45d3427e794543cfbee129c604d7e865
Explore at:
zip(50.9 MB)Available download formats
Dataset updated
Jan 9, 2024
Dataset provided by
HydroShare
Authors
Jeffery S. Horsburgh
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Scientific and related management challenges in the water domain require synthesis of data from multiple domains. Many data analysis tasks are difficult because datasets are large and complex; standard formats for data types are not always agreed upon nor mapped to an efficient structure for analysis; water scientists may lack training in methods needed to efficiently tackle large and complex datasets; and available tools can make it difficult to share, collaborate around, and reproduce scientific work. Overcoming these barriers to accessing, organizing, and preparing datasets for analyses will be an enabler for transforming scientific inquiries. Building on the HydroShare repository’s established cyberinfrastructure, we have advanced two packages for the Python language that make data loading, organization, and curation for analysis easier, reducing time spent in choosing appropriate data structures and writing code to ingest data. These packages enable automated retrieval of data from HydroShare and the USGS’s National Water Information System (NWIS), loading of data into performant structures keyed to specific scientific data types and that integrate with existing visualization, analysis, and data science capabilities available in Python, and then writing analysis results back to HydroShare for sharing and eventual publication. These capabilities reduce the technical burden for scientists associated with creating a computational environment for executing analyses by installing and maintaining the packages within CUAHSI’s HydroShare-linked JupyterHub server. HydroShare users can leverage these tools to build, share, and publish more reproducible scientific workflows. The HydroShare Python Client and USGS NWIS Data Retrieval packages can be installed within a Python environment on any computer running Microsoft Windows, Apple MacOS, or Linux from the Python Package Index using the PIP utility. They can also be used online via the CUAHSI JupyterHub server (https://jupyterhub.cuahsi.org/) or other Python notebook environments like Google Collaboratory (https://colab.research.google.com/). Source code, documentation, and examples for the software are freely available in GitHub at https://github.com/hydroshare/hsclient/ and https://github.com/USGS-python/dataretrieval.

This presentation was delivered as part of the Hawai'i Data Science Institute's regular seminar series: https://datascience.hawaii.edu/event/data-science-and-analytics-for-water/
student data analysis
kaggle.com
Updated Nov 17, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
maira javeed (2023). student data analysis [Dataset]. https://www.kaggle.com/datasets/mairajaveed/student-data-analysis
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 17, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
maira javeed
Description
In this project, we aim to analyze and gain insights into the performance of students based on various factors that influence their academic achievements. We have collected data related to students' demographic information, family background, and their exam scores in different subjects.

**********Key Objectives:*********

Performance Evaluation: Evaluate and understand the academic performance of students by analyzing their scores in various subjects.

Identifying Underlying Factors: Investigate factors that might contribute to variations in student performance, such as parental education, family size, and student attendance.

Visualizing Insights: Create data visualizations to present the findings effectively and intuitively.

Dataset Details:

The dataset used in this analysis contains information about students, including their age, gender, parental education, lunch type, and test scores in subjects like mathematics, reading, and writing.

Analysis Highlights:

We will perform a comprehensive analysis of the dataset, including data cleaning, exploration, and visualization to gain insights into various aspects of student performance.

By employing statistical methods and machine learning techniques, we will determine the significant factors that affect student performance.

Why This Matters:

Understanding the factors that influence student performance is crucial for educators, policymakers, and parents. This analysis can help in making informed decisions to improve educational outcomes and provide support where it is most needed.

Acknowledgments:

We would like to express our gratitude to [mention any data sources or collaborators] for making this dataset available.

Please Note:

This project is meant for educational and analytical purposes. The dataset used is fictitious and does not represent any specific educational institution or individuals.
Big Data Analytics for Clinical Research Market Research Report 2033
growthmarketreports.com
csv, pdf, pptx
Updated Jun 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Growth Market Reports (2025). Big Data Analytics for Clinical Research Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/big-data-analytics-for-clinical-research-market-global-industry-analysis
Explore at:
pdf, csv, pptxAvailable download formats
Dataset updated
Jun 30, 2025
Dataset authored and provided by
Growth Market Reports
Time period covered
2024 - 2032
Area covered
Global
Description
Big Data Analytics for Clinical Research Market Outlook

As per our latest research, the Big Data Analytics for Clinical Research market size reached USD 7.45 billion globally in 2024, reflecting a robust adoption pace driven by the increasing digitization of healthcare and clinical trial processes. The market is forecasted to grow at a CAGR of 17.2% from 2025 to 2033, reaching an estimated USD 25.54 billion by 2033. This significant growth is primarily attributed to the rising need for real-time data-driven decision-making, the proliferation of electronic health records (EHRs), and the growing emphasis on precision medicine and personalized healthcare solutions. The industry is experiencing rapid technological advancements, making big data analytics a cornerstone in transforming clinical research methodologies and outcomes.

Several key growth factors are propelling the expansion of the Big Data Analytics for Clinical Research market. One of the primary drivers is the exponential increase in clinical data volumes from diverse sources, including EHRs, wearable devices, genomics, and imaging. Healthcare providers and research organizations are leveraging big data analytics to extract actionable insights from these massive datasets, accelerating drug discovery, optimizing clinical trial design, and improving patient outcomes. The integration of artificial intelligence (AI) and machine learning (ML) algorithms with big data platforms has further enhanced the ability to identify patterns, predict patient responses, and streamline the entire research process. These technological advancements are reducing the time and cost associated with clinical research, making it more efficient and effective.

Another significant factor fueling market growth is the increasing collaboration between pharmaceutical & biotechnology companies and technology firms. These partnerships are fostering the development of advanced analytics solutions tailored specifically for clinical research applications. The demand for real-world evidence (RWE) and real-time patient monitoring is rising, particularly in the context of post-market surveillance and regulatory compliance. Big data analytics is enabling stakeholders to gain deeper insights into patient populations, treatment efficacy, and adverse event patterns, thereby supporting evidence-based decision-making. Furthermore, the shift towards decentralized and virtual clinical trials is creating new opportunities for leveraging big data to monitor patient engagement, adherence, and safety remotely.

The regulatory landscape is also evolving to accommodate the growing use of big data analytics in clinical research. Regulatory agencies such as the FDA and EMA are increasingly recognizing the value of data-driven approaches for enhancing the reliability and transparency of clinical trials. This has led to the establishment of guidelines and frameworks that encourage the adoption of big data technologies while ensuring data privacy and security. However, the implementation of stringent data protection regulations, such as GDPR and HIPAA, poses challenges related to data integration, interoperability, and compliance. Despite these challenges, the overall outlook for the Big Data Analytics for Clinical Research market remains highly positive, with sustained investments in digital health infrastructure and analytics capabilities.

From a regional perspective, North America currently dominates the Big Data Analytics for Clinical Research market, accounting for the largest share due to its advanced healthcare infrastructure, high adoption of digital technologies, and strong presence of leading pharmaceutical companies. Europe follows closely, driven by increasing government initiatives to promote health data interoperability and research collaborations. The Asia Pacific region is emerging as a high-growth market, supported by expanding healthcare IT investments, rising clinical trial activities, and growing awareness of data-driven healthcare solutions. Latin America and the Middle East & Africa are also witnessing gradual adoption, albeit at a slower pace, due to infrastructural and regulatory challenges. Overall, the global market is poised for substantial growth across all major regions over the forecast period.

"https://growthmarketreports.com/request-sample/5077">
d
Matlab example for Local Enrichment Analysis (LEA) analysis with real data
datadryad.org
data.niaid.nih.gov
zip
Updated Aug 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Berend Snijder; Yannik Severin (2022). Matlab example for Local Enrichment Analysis (LEA) analysis with real data [Dataset]. http://doi.org/10.5061/dryad.2jm63xssk
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.2jm63xssk
Dataset updated
Aug 29, 2022
Dataset provided by
Dryad
Authors
Berend Snijder; Yannik Severin
Time period covered
2022
Description
Code is compatible with Matlab v2020. The corresponding open-source alternative is Octave (https://octave.org/).
B
Data Cleaning Sample
borealisdata.ca
dataone.org
Updated Jul 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rong Luo (2023). Data Cleaning Sample [Dataset]. http://doi.org/10.5683/SP3/ZCN177
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.5683/SP3/ZCN177
Dataset updated
Jul 13, 2023
Dataset provided by
Borealis
Authors
Rong Luo
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Sample data for exercises in Further Adventures in Data Cleaning.
f
Orange dataset table
figshare.com
xlsx
Updated Mar 4, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rui Simões (2022). Orange dataset table [Dataset]. http://doi.org/10.6084/m9.figshare.19146410.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.19146410.v1
Dataset updated
Mar 4, 2022
Dataset provided by
figshare
Authors
Rui Simões
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The complete dataset used in the analysis comprises 36 samples, each described by 11 numeric features and 1 target. The attributes considered were caspase 3/7 activity, Mitotracker red CMXRos area and intensity (3 h and 24 h incubations with both compounds), Mitosox oxidation (3 h incubation with the referred compounds) and oxidation rate, DCFDA fluorescence (3 h and 24 h incubations with either compound) and oxidation rate, and DQ BSA hydrolysis. The target of each instance corresponds to one of the 9 possible classes (4 samples per class): Control, 6.25, 12.5, 25 and 50 µM for 6-OHDA and 0.03, 0.06, 0.125 and 0.25 µM for rotenone. The dataset is balanced, it does not contain any missing values and data was standardized across features. The small number of samples prevented a full and strong statistical analysis of the results. Nevertheless, it allowed the identification of relevant hidden patterns and trends.

Exploratory data analysis, information gain, hierarchical clustering, and supervised predictive modeling were performed using Orange Data Mining version 3.25.1 [41]. Hierarchical clustering was performed using the Euclidean distance metric and weighted linkage. Cluster maps were plotted to relate the features with higher mutual information (in rows) with instances (in columns), with the color of each cell representing the normalized level of a particular feature in a specific instance. The information is grouped both in rows and in columns by a two-way hierarchical clustering method using the Euclidean distances and average linkage. Stratified cross-validation was used to train the supervised decision tree. A set of preliminary empirical experiments were performed to choose the best parameters for each algorithm, and we verified that, within moderate variations, there were no significant changes in the outcome. The following settings were adopted for the decision tree algorithm: minimum number of samples in leaves: 2; minimum number of samples required to split an internal node: 5; stop splitting when majority reaches: 95%; criterion: gain ratio. The performance of the supervised model was assessed using accuracy, precision, recall, F-measure and area under the ROC curve (AUC) metrics.
Sample data files for Python Course
figshare.com
txt
Updated Nov 4, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Peter Verhaar (2022). Sample data files for Python Course [Dataset]. http://doi.org/10.6084/m9.figshare.21501549.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.21501549.v1
Dataset updated
Nov 4, 2022
Dataset provided by
Figsharehttp://figshare.com/
Authors
Peter Verhaar
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Sample data set used in an introductory course on Programming in Python
t
Ecommerce Analytics Reports: Decision-Driven Data Analysis & Conventions...
thegood.com
html
Updated Dec 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Good (2024). Ecommerce Analytics Reports: Decision-Driven Data Analysis & Conventions That Mean More Than Benchmarks [Dataset]. https://thegood.com/insights/ecommerce-google-analytics-reports/
Explore at:
htmlAvailable download formats
Dataset updated
Dec 4, 2024
Dataset authored and provided by
The Good
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The first step in any new digital experience optimization program is to build a strong understanding of the digital journey. The reason is pretty simple. Whether it’s a software registration experience or an ecommerce path to purchase, our goal is always to identify challenges and present a clear roadmap to address them. But we first […]
Data from: Untargeted metabolomics workshop report: quality control...
data.niaid.nih.gov
xml
Updated Dec 17, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Prasad Phapale (2020). Untargeted metabolomics workshop report: quality control considerations from sample preparation to data analysis [Dataset]. https://data.niaid.nih.gov/resources?id=mtbls1301
Explore at:
xmlAvailable download formats
Dataset updated
Dec 17, 2020
Dataset provided by
EMBL
Authors
Prasad Phapale
Variables measured
tumor, Metabolomics
Description
The Metabolomics workshop on experimental and data analysis training for untargeted metabolomics was hosted by the Proteomics Society of India in December 2019. The Workshop included six tutorial lectures and hands-on data analysis training sessions presented by seven speakers. The tutorials and hands-on data analysis sessions focused on workflows for liquid chromatography-mass spectrometry (LC-MS) based on untargeted metabolomics. We review here three main topics from the workshop which were uniquely identified as bottlenecks for new researchers: a) experimental design, b) quality controls during sample preparation and instrumental analysis and c) data quality evaluation. Our objective here is to present common challenges faced by novice researchers and present possible guidelines and resources to address them. We provide resources and good practices for researchers who are at the initial stage of setting up metabolomics workflows in their labs.

Complete detailed metabolomics/lipidomics protocols are available online at EMBL-MCF protocol including video tutorials.
f
Data_Sheet_1_Raw Data Visualization for Common Factorial Designs Using SPSS:...
frontiersin.figshare.com
zip
Updated Jun 2, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Florian Loffing (2023). Data_Sheet_1_Raw Data Visualization for Common Factorial Designs Using SPSS: A Syntax Collection and Tutorial.ZIP [Dataset]. http://doi.org/10.3389/fpsyg.2022.808469.s001
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.3389/fpsyg.2022.808469.s001
Dataset updated
Jun 2, 2023
Dataset provided by
Frontiers
Authors
Florian Loffing
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Transparency in data visualization is an essential ingredient for scientific communication. The traditional approach of visualizing continuous quantitative data solely in the form of summary statistics (i.e., measures of central tendency and dispersion) has repeatedly been criticized for not revealing the underlying raw data distribution. Remarkably, however, systematic and easy-to-use solutions for raw data visualization using the most commonly reported statistical software package for data analysis, IBM SPSS Statistics, are missing. Here, a comprehensive collection of more than 100 SPSS syntax files and an SPSS dataset template is presented and made freely available that allow the creation of transparent graphs for one-sample designs, for one- and two-factorial between-subject designs, for selected one- and two-factorial within-subject designs as well as for selected two-factorial mixed designs and, with some creativity, even beyond (e.g., three-factorial mixed-designs). Depending on graph type (e.g., pure dot plot, box plot, and line plot), raw data can be displayed along with standard measures of central tendency (arithmetic mean and median) and dispersion (95% CI and SD). The free-to-use syntax can also be modified to match with individual needs. A variety of example applications of syntax are illustrated in a tutorial-like fashion along with fictitious datasets accompanying this contribution. The syntax collection is hoped to provide researchers, students, teachers, and others working with SPSS a valuable tool to move towards more transparency in data visualization.
Z
Missing data in the analysis of multilevel and dependent data (Examples)
data.niaid.nih.gov
zenodo.org
Updated Jul 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alexander Robitzsch (2023). Missing data in the analysis of multilevel and dependent data (Examples) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7773613
Explore at:
Dataset updated
Jul 20, 2023
Dataset provided by
Alexander Robitzsch
Simon Grund
Oliver Lüdtke
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Example data sets and computer code for the book chapter titled "Missing Data in the Analysis of Multilevel and Dependent Data" submitted for publication in the second edition of "Dependent Data in Social Science Research" (Stemmler et al., 2015). This repository includes the computer code (".R") and the data sets from both example analyses (Examples 1 and 2). The data sets are available in two file formats (binary ".rda" for use in R; plain-text ".dat").

The data sets contain simulated data from 23,376 (Example 1) and 23,072 (Example 2) individuals from 2,000 groups on four variables:

ID = group identifier (1-2000) x = numeric (Level 1) y = numeric (Level 1) w = binary (Level 2)

In all data sets, missing values are coded as "NA".
w
Synthetic Data for an Imaginary Country, Sample, 2023 - World
microdata.worldbank.org
Updated Jul 7, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Development Data Group, Data Analytics Unit (2023). Synthetic Data for an Imaginary Country, Sample, 2023 - World [Dataset]. https://microdata.worldbank.org/index.php/catalog/5906
Explore at:
Dataset updated
Jul 7, 2023
Dataset authored and provided by
Development Data Group, Data Analytics Unit
Time period covered
2023
Area covered
World, World
Description
Abstract

The dataset is a relational dataset of 8,000 households households, representing a sample of the population of an imaginary middle-income country. The dataset contains two data files: one with variables at the household level, the other one with variables at the individual level. It includes variables that are typically collected in population censuses (demography, education, occupation, dwelling characteristics, fertility, mortality, and migration) and in household surveys (household expenditure, anthropometric data for children, assets ownership). The data only includes ordinary households (no community households). The dataset was created using REaLTabFormer, a model that leverages deep learning methods. The dataset was created for the purpose of training and simulation and is not intended to be representative of any specific country.

The full-population dataset (with about 10 million individuals) is also distributed as open data.

Geographic coverage

The dataset is a synthetic dataset for an imaginary country. It was created to represent the population of this country by province (equivalent to admin1) and by urban/rural areas of residence.

Analysis unit

Household, Individual

Universe

The dataset is a fully-synthetic dataset representative of the resident population of ordinary households for an imaginary middle-income country.

Kind of data

ssd

Sampling procedure

The sample size was set to 8,000 households. The fixed number of households to be selected from each enumeration area was set to 25. In a first stage, the number of enumeration areas to be selected in each stratum was calculated, proportional to the size of each stratum (stratification by geo_1 and urban/rural). Then 25 households were randomly selected within each enumeration area. The R script used to draw the sample is provided as an external resource.

Mode of data collection

other

Research instrument

The dataset is a synthetic dataset. Although the variables it contains are variables typically collected from sample surveys or population censuses, no questionnaire is available for this dataset. A "fake" questionnaire was however created for the sample dataset extracted from this dataset, to be used as training material.

Cleaning operations

The synthetic data generation process included a set of "validators" (consistency checks, based on which synthetic observation were assessed and rejected/replaced when needed). Also, some post-processing was applied to the data to result in the distributed data files.

Response rate

This is a synthetic dataset; the "response rate" is 100%.

Big Data Market Analysis, Size, and Forecast 2025-2029: North America (US...

technavio.com

Updated Jun 14, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Technavio (2025). Big Data Market Analysis, Size, and Forecast 2025-2029: North America (US and Canada), Europe (France, Germany, and UK), APAC (Australia, China, India, Japan, and South Korea), and Rest of World (ROW) [Dataset]. https://www.technavio.com/report/big-data-market-industry-analysis

Explore at:

Dataset updated

Jun 14, 2025

Dataset provided by

TechNavio

Authors

Technavio

Time period covered

2021 - 2025

Area covered

Global

Description

Snapshot img

Big Data Market Size 2025-2029

The big data market size is forecast to increase by USD 193.2 billion at a CAGR of 13.3% between 2024 and 2029.

The market is experiencing a significant rise due to the increasing volume of data being generated across industries. This data deluge is driving the need for advanced analytics and processing capabilities to gain valuable insights and make informed business decisions. A notable trend in this market is the rising adoption of blockchain solutions to enhance big data implementation. Blockchain's decentralized and secure nature offers an effective solution to address data security concerns, a growing challenge in the market. However, the increasing adoption of big data also brings forth new challenges. Data security issues persist as organizations grapple with protecting sensitive information from cyber threats and data breaches.
Companies must navigate these challenges by investing in robust security measures and implementing best practices to mitigate risks and maintain trust with their customers. To capitalize on the market opportunities and stay competitive, businesses must focus on harnessing the power of big data while addressing these challenges effectively. Deep learning frameworks and machine learning algorithms are transforming data science, from data literacy assessments to computer vision models.

What will be the Size of the Big Data Market during the forecast period?

Explore in-depth regional segment analysis with market size data - historical 2019-2023 and forecasts 2025-2029 - in the full report.
Request Free Sample

In today's data-driven business landscape, the demand for advanced data management solutions continues to grow. Companies are investing in business intelligence dashboards and data analytics tools to gain insights from their data and make informed decisions. However, with this increased reliance on data comes the need for robust data governance policies and regular data compliance audits. Data visualization software enables businesses to effectively communicate complex data insights, while data engineering ensures data is accessible and processed in real-time. Data-driven product development and data architecture are essential for creating agile and responsive business strategies. Data management encompasses data accessibility standards, data privacy policies, and data quality metrics.
Data usability guidelines, prescriptive modeling, and predictive modeling are critical for deriving actionable insights from data. Data integrity checks and data agility assessments are crucial components of a data-driven business strategy. As data becomes an increasingly valuable asset, businesses must prioritize data security and privacy. Prescriptive and predictive modeling, data-driven marketing, and data culture surveys are key trends shaping the future of data-driven businesses. Data engineering, data management, and data accessibility standards are interconnected, with data privacy policies and data compliance audits ensuring regulatory compliance.
Data engineering and data architecture are crucial for ensuring data accessibility and enabling real-time data processing. The data market is dynamic and evolving, with businesses increasingly relying on data to drive growth and inform decision-making. Data engineering, data management, and data analytics tools are essential components of a data-driven business strategy, with trends such as data privacy, data security, and data storytelling shaping the future of data-driven businesses.

How is this Big Data Industry segmented?

The big data industry research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD billion' for the period 2025-2029, as well as historical data from 2019-2023 for the following segments.

Deployment

  On-premises
  Cloud-based
  Hybrid


Type

  Services
  Software


End-user

  BFSI
  Healthcare
  Retail and e-commerce
  IT and telecom
  Others


Geography

  North America

    US
    Canada


  Europe

    France
    Germany
    UK


  APAC

    Australia
    China
    India
    Japan
    South Korea


  Rest of World (ROW)

By Deployment Insights

The on-premises segment is estimated to witness significant growth during the forecast period.

In the realm of big data, on-premise and cloud-based deployment models cater to varying business needs. On-premise deployment allows for complete control over hardware and software, making it an attractive option for some organizations. However, this model comes with a significant upfront investment and ongoing maintenance costs. In contrast, cloud-based deployment offers flexibility and scalability, with service providers handling infrastructure and maintenance. Yet, it introduces potential security risks, as data is accessed through multiple points and stored on external servers. Data

Sample data for analysis of demographic potential of the 15-minute city in...

zenodo.org

bin, txt

Updated Aug 29, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

Joan Perez; Joan Perez; Giovanni Fusco; Giovanni Fusco (2024). Sample data for analysis of demographic potential of the 15-minute city in northern and southern France [Dataset]. http://doi.org/10.5281/zenodo.13456826

Explore at:

bin, txtAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.13456826

Dataset updated

Aug 29, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Joan Perez; Joan Perez; Giovanni Fusco; Giovanni Fusco

License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Area covered

Southern France, France

Description

This upload contains two Geopackage files of raw data used for urban analysis in the outskirts of Lille and Nice, France. 
The data include building footprints (layer "building"), roads (layer "road"), and administrative boundaries (layer "adm_boundaries")
extracted from version 3.3 of the French dataset BD TOPO®3 (IGN, 2023) for the municipalities of Santes, Hallennes-lez-Haubourdin,
Haubourdin, and Emmerin in northern France (Geopackage "DPC_59.gpkg") and Drap, Cantaron and La Trinité in southern France 
(Geopackage "DPC_06.gpkg").

Metadata for these layers is available here: https://geoservices.ign.fr/sites/default/files/2023-01/DC_BDTOPO_3-3.pdf

Additionally, this upload contains the results of the following algorithms available in GitHub (https://github.com/perezjoan/emc2-WP2?tab=readme-ov-file)

1. The identification of main streets using the QGIS plugin Morpheo (layers "road_morpheo" and "buffer_morpheo") 
https://plugins.qgis.org/plugins/morpheo/

2. The identification of main streets in local contexts – connectivity locally weighted (layer "road_LocRelCon")

3. Basic morphometry of buildings (layer "building_morpho")

4. Evaluation of the number of dwellings within inhabited buildings (layer "building_dwellings")

5. Projecting population potential accessible from main streets (layer "road_pop_results")

Project website: http://emc2-dut.org/

Publications using this sample data: 
Perez, J. and Fusco, G., 2024. Potential of the 15-Minute Peripheral City: Identifying Main Streets and Population Within Walking Distance. In: O. Gervasi, B. Murgante, C. Garau, D. Taniar, A.M.A.C. Rocha and M.N. Faginas Lago, eds. Computational Science and Its Applications – ICCSA 2024 Workshops. ICCSA 2024. Lecture Notes in Computer Science, vol 14817. Cham: Springer, pp.50-60. https://doi.org/10.1007/978-3-031-65238-7_4.

Acknowledgement. This work is part of the emc2 project, which received the grant ANR-23-DUTP-0003-01 from the French National Research Agency (ANR) within the DUT Partnership.

UCI and OpenML Data Sets for Ordinal Quantification
zenodo.org
data.niaid.nih.gov
zip
Updated Jul 25, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mirko Bunse; Mirko Bunse; Alejandro Moreo; Alejandro Moreo; Fabrizio Sebastiani; Fabrizio Sebastiani; Martin Senz; Martin Senz (2023). UCI and OpenML Data Sets for Ordinal Quantification [Dataset]. http://doi.org/10.5281/zenodo.8177302
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8177302
Dataset updated
Jul 25, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Mirko Bunse; Mirko Bunse; Alejandro Moreo; Alejandro Moreo; Fabrizio Sebastiani; Fabrizio Sebastiani; Martin Senz; Martin Senz
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
These four labeled data sets are targeted at ordinal quantification. The goal of quantification is not to predict the label of each individual instance, but the distribution of labels in unlabeled sets of data.

With the scripts provided, you can extract CSV files from the UCI machine learning repository and from OpenML. The ordinal class labels stem from a binning of a continuous regression label.

We complement this data set with the indices of data items that appear in each sample of our evaluation. Hence, you can precisely replicate our samples by drawing the specified data items. The indices stem from two evaluation protocols that are well suited for ordinal quantification. To this end, each row in the files app_val_indices.csv, app_tst_indices.csv, app-oq_val_indices.csv, and app-oq_tst_indices.csv represents one sample.

Our first protocol is the artificial prevalence protocol (APP), where all possible distributions of labels are drawn with an equal probability. The second protocol, APP-OQ, is a variant thereof, where only the smoothest 20% of all APP samples are considered. This variant is targeted at ordinal quantification tasks, where classes are ordered and a similarity of neighboring classes can be assumed.

Usage

You can extract four CSV files through the provided script extract-oq.jl, which is conveniently wrapped in a Makefile. The Project.toml and Manifest.toml specify the Julia package dependencies, similar to a requirements file in Python.

Preliminaries: You have to have a working Julia installation. We have used Julia v1.6.5 in our experiments.

Data Extraction: In your terminal, you can call either

make

(recommended), or

julia --project="." --eval "using Pkg; Pkg.instantiate()" julia --project="." extract-oq.jl

Outcome: The first row in each CSV file is the header. The first column, named "class_label", is the ordinal class.

Further Reading

Implementation of our experiments: https://github.com/mirkobunse/regularized-oq

Facebook

Twitter

Click to copy link

Link copied

Cite

Deborah Nolan; Jamis Perrett (2023). Teaching and Learning Data Visualization: Ideas and Assignments [Dataset]. http://doi.org/10.6084/m9.figshare.1627940.v1

Data from: Teaching and Learning Data Visualization: Ideas and Assignments

Explore at:

txtAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.1627940.v1

Dataset updated

Jun 1, 2023

Dataset provided by

Taylor & Francis

Authors

Deborah Nolan; Jamis Perrett

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This article discusses how to make statistical graphics a more prominent element of the undergraduate statistics curricula. The focus is on several different types of assignments that exemplify how to incorporate graphics into a course in a pedagogically meaningful way. These assignments include having students deconstruct and reconstruct plots, copy masterful graphs, create one-minute visual revelations, convert tables into “pictures,” and develop interactive visualizations, for example, with the virtual earth as a plotting canvas. In addition to describing the goals and details of each assignment, we also discuss the broader topic of graphics and key concepts that we think warrant inclusion in the statistics curricula. We advocate that more attention needs to be paid to this fundamental field of statistics at all levels, from introductory undergraduate through graduate level courses. With the rapid rise of tools to visualize data, for example, Google trends, GapMinder, ManyEyes, and Tableau, and the increased use of graphics in the media, understanding the principles of good statistical graphics, and having the ability to create informative visualizations is an ever more important aspect of statistics education. Supplementary materials containing code and data for the assignments are available online.

Clear search

Close search

Google apps

Main menu

Data from: Teaching and Learning Data Visualization: Ideas and Assignments

Data Analytics Market Analysis, Size, and Forecast 2025-2029: North America...

Snapshot img

Big data and business analytics revenue worldwide 2015-2022

Sample CVs Dataset for Analysis

Data from: Replication package for the paper: "A Study on the Pythonic...

Advancing Open and Reproducible Water Data Science by Integrating Data...

student data analysis

Big Data Analytics for Clinical Research Market Research Report 2033

Big Data Analytics for Clinical Research Market Outlook

Matlab example for Local Enrichment Analysis (LEA) analysis with real data

Data Cleaning Sample

Orange dataset table

Sample data files for Python Course

Ecommerce Analytics Reports: Decision-Driven Data Analysis & Conventions...

Data from: Untargeted metabolomics workshop report: quality control...

Data_Sheet_1_Raw Data Visualization for Common Factorial Designs Using SPSS:...

Missing data in the analysis of multilevel and dependent data (Examples)

Synthetic Data for an Imaginary Country, Sample, 2023 - World

Abstract

Geographic coverage

Analysis unit

Universe

Kind of data

Sampling procedure

Mode of data collection

Research instrument

Cleaning operations

Response rate

Big Data Market Analysis, Size, and Forecast 2025-2029: North America (US...

Snapshot img

Sample data for analysis of demographic potential of the 15-minute city in...

UCI and OpenML Data Sets for Ordinal Quantification

Data from: Teaching and Learning Data Visualization: Ideas and Assignments