Facebook
TwitterIntroduction: I have chosen to complete a data analysis project for the second course option, Bellabeats, Inc., using a locally hosted database program, Excel for both my data analysis and visualizations. This choice was made primarily because I live in a remote area and have limited bandwidth and inconsistent internet access. Therefore, completing a capstone project using web-based programs such as R Studio, SQL Workbench, or Google Sheets was not a feasible choice. I was further limited in which option to choose as the datasets for the ride-share project option were larger than my version of Excel would accept. In the scenario provided, I will be acting as a Junior Data Analyst in support of the Bellabeats, Inc. executive team and data analytics team. This combined team has decided to use an existing public dataset in hopes that the findings from that dataset might reveal insights which will assist in Bellabeat's marketing strategies for future growth. My task is to provide data driven insights to business tasks provided by the Bellabeats, Inc.'s executive and data analysis team. In order to accomplish this task, I will complete all parts of the Data Analysis Process (Ask, Prepare, Process, Analyze, Share, Act). In addition, I will break each part of the Data Analysis Process down into three sections to provide clarity and accountability. Those three sections are: Guiding Questions, Key Tasks, and Deliverables. For the sake of space and to avoid repetition, I will record the deliverables for each Key Task directly under the numbered Key Task using an asterisk (*) as an identifier.
Section 1 - Ask:
A. Guiding Questions:
1. Who are the key stakeholders and what are their goals for the data analysis project?
2. What is the business task that this data analysis project is attempting to solve?
B. Key Tasks: 1. Identify key stakeholders and their goals for the data analysis project *The key stakeholders for this project are as follows: -Urška Sršen and Sando Mur - co-founders of Bellabeats, Inc. -Bellabeats marketing analytics team. I am a member of this team.
Section 2 - Prepare:
A. Guiding Questions: 1. Where is the data stored and organized? 2. Are there any problems with the data? 3. How does the data help answer the business question?
B. Key Tasks:
Research and communicate the source of the data, and how it is stored/organized to stakeholders.
*The data source used for our case study is FitBit Fitness Tracker Data. This dataset is stored in Kaggle and was made available through user Mobius in an open-source format. Therefore, the data is public and available to be copied, modified, and distributed, all without asking the user for permission. These datasets were generated by respondents to a distributed survey via Amazon Mechanical Turk reportedly (see credibility section directly below) between 03/12/2016 thru 05/12/2016.
*Reportedly (see credibility section directly below), thirty eligible Fitbit users consented to the submission of personal tracker data, including output related to steps taken, calories burned, time spent sleeping, heart rate, and distance traveled. This data was broken down into minute, hour, and day level totals. This data is stored in 18 CSV documents. I downloaded all 18 documents into my local laptop and decided to use 2 documents for the purposes of this project as they were files which had merged activity and sleep data from the other documents. All unused documents were permanently deleted from the laptop. The 2 files used were:
-sleepDay_merged.csv
-dailyActivity_merged.csv
Identify and communicate to stakeholders any problems found with the data related to credibility and bias. *As will be more specifically presented in the Process section, the data seems to have credibility issues related to the reported time frame of the data collected. The metadata seems to indicate that the data collected covered roughly 2 months of FitBit tracking. However, upon my initial data processing, I found that only 1 month of data was reported. *As will be more specifically presented in the Process section, the data has credibility issues related to the number of individuals who reported FitBit data. Specifically, the metadata communicates that 30 individual users agreed to report their tracking data. My initial data processing uncovered 33 individual ...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is one of the three example ScRNAseq datasets used to follow the guided example analyses within "A Guide to Single-Cell RNA Sequencing Analysis Using Web-based Tools for Non-Bioinformaticians" in the FEBS Journal. This dataset can be downloaded and imported into a variety of web-based tools and used as a learning device to gain more familiarity with the tools. As described in the paper, this dataset represents the negative control (carrier only).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Considering dataset description
The user wants an improved description of the "online_shoppers_intention" dataset. I'll base the rewrite on metadata like 12,330 sessions, collected over a year, and incorporate details about the dataset’s purpose, data components, and possible uses. I could consider writing it as a cohesive paragraph or bullet points. I'll need to include citations too—let's search for further details on OpenML. Once I find that, I'll craft a clearer, more thorough description.
Thought for 5 seconds
Overview The Online Shoppers Purchasing Intention dataset captures 12,330 distinct web‐session records collected over a one‐year span from an e-commerce site, with each session belonging to a different visitor to prevent user‐ or campaign-specific biases. Originally published in 2017 and licensed under CC BY 4.0, it was curated by Sakar et al. for benchmarking classifiers on independent and identically distributed tabular data.
Features
Numerical (10):
Categorical (7):
Target and Class Distribution
Intended Use This dataset is ideal for developing and comparing binary classification models—ranging from multilayer perceptrons and LSTM networks to tree-based methods—to predict online purchasing intention in a controlled, time-invariant setting.
Facebook
Twitterhttps://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
The global healthcare cloud based analytics market size was valued at approximately USD 14.8 billion in 2023, and it is anticipated to reach around USD 54.3 billion by 2032, growing at a compound annual growth rate (CAGR) of 15.7% from 2024 to 2032. One of the primary growth factors influencing this market is the increasing demand for data-driven decision-making processes in healthcare settings to enhance patient outcomes and operational efficiency.
One significant growth factor for the healthcare cloud based analytics market is the rapid digital transformation within the healthcare sector. The transition from paper-based systems to electronic health records (EHRs) and the adoption of telehealth services are driving the need for sophisticated analytics solutions that can process vast amounts of healthcare data. The accessibility and scalability offered by cloud-based solutions make them particularly attractive for healthcare providers looking to leverage patient data for better diagnostic and treatment outcomes.
Moreover, the rising focus on personalized medicine and the need for population health management are propelling the demand for healthcare cloud based analytics. Personalized medicine requires the analysis of large datasets to understand individual patient profiles and predict responses to treatments. Similarly, population health management aims to improve health outcomes by analyzing data to identify trends and intervene proactively. Cloud-based analytics platforms provide the necessary computational power and flexibility to handle these complex data requirements efficiently.
The cost-efficiency of cloud based solutions compared to traditional on-premises systems is another crucial growth driver. Healthcare organizations are under constant pressure to reduce operational costs while improving patient care quality. Cloud-based analytics solutions eliminate the need for significant upfront investments in hardware and software while offering the benefits of scalable resources and reduced IT maintenance costs. This financial advantage is particularly appealing to small and medium-sized healthcare providers who may have limited budgets for technology investments.
The integration of Business Intelligence in Healthcare is transforming the way data is utilized to improve patient care and streamline operations. By employing BI tools, healthcare organizations can analyze vast datasets to uncover insights that drive better decision-making. These tools enable healthcare providers to track patient outcomes, optimize resource allocation, and enhance overall operational efficiency. The ability to visualize data through dashboards and reports allows for a deeper understanding of patient trends and organizational performance, ultimately leading to improved healthcare delivery and patient satisfaction.
From a regional perspective, North America currently holds the largest market share in the healthcare cloud based analytics market, driven by advanced healthcare infrastructure and high adoption rates of digital healthcare technologies. However, regions like Asia Pacific are expected to witness the highest growth rates during the forecast period. Factors such as increasing healthcare expenditures, growing awareness about the benefits of healthcare analytics, and supportive government initiatives are contributing to the market expansion in these regions.
The healthcare cloud based analytics market can be segmented by component into software and services. The software segment includes various analytics platforms and tools designed to process and analyze healthcare data. These software solutions are essential for enabling healthcare providers to harness the power of big data and derive actionable insights. As the volume of healthcare data continues to grow exponentially, the demand for robust and scalable analytics software solutions is expected to increase significantly. Innovations in artificial intelligence and machine learning are also enhancing the capabilities of these software solutions, making them more effective in predictive analytics and decision support.
Cloud Computing in Healthcare is revolutionizing the way healthcare data is stored, accessed, and analyzed. By leveraging cloud technology, healthcar
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This respository contains the CLUE-LDS (CLoud-based User Entity behavior analytics Log Data Set). The data set contains log events from real users utilizing a cloud storage suitable for User Entity Behavior Analytics (UEBA). Events include logins, file accesses, link shares, config changes, etc. The data set contains around 50 million events generated by more than 5000 distinct users in more than five years (2017-07-07 to 2022-09-29 or 1910 days). The data set is complete except for 109 events missing on 2021-04-22, 2021-08-20, and 2021-09-05 due to database failure. The unpacked file size is around 14.5 GB. A detailed analysis of the data set is provided in [1].
The logs are provided in JSON format with the following attributes in the first level:
In the following data sample, the first object depicts a successful user login (see type: login_successful) and the second object depicts a file access (see type: file_accessed) from a remote location:
{"params": {"user": "intact-gray-marlin-trademarkagent"}, "type": "login_successful", "time": "2019-11-14T11:26:43Z", "uid": "intact-gray-marlin-trademarkagent", "id": 21567530, "uidType": "name"}
{"isLocalIP": false, "params": {"path": "/proud-copper-orangutan-artexer/doubtful-plum-ptarmigan-merchant/insufficient-amaranth-earthworm-qualitycontroller/curious-silver-galliform-tradingstandards/incredible-indigo-octopus-printfinisher/wicked-bronze-sloth-claimsmanager/frantic-aquamarine-horse-cleric"}, "type": "file_accessed", "time": "2019-11-14T11:26:51Z", "uid": "graceful-olive-spoonbill-careersofficer", "id": 21567531, "location": {"countryCode": "AT", "countryName": "Austria", "region": "4", "city": "Gmunden", "latitude": 47.915, "longitude": 13.7959, "timezone": "Europe/Vienna", "postalCode": "4810", "metroCode": null, "regionName": "Upper Austria", "isInEuropeanUnion": true, "continent": "Europe", "accuracyRadius": 50}, "uidType": "ipaddress"}
The data set was generated at the premises of Huemer Group, a midsize IT service provider located in Vienna, Austria. Huemer Group offers a range of Infrastructure-as-a-Service solutions for enterprises, including cloud computing and storage. In particular, their cloud storage solution called hBOX enables customers to upload their data, synchronize them with multiple devices, share files with others, create versions and backups of their documents, collaborate with team members in shared data spaces, and query the stored documents using search terms. The hBOX extends the open-source project Nextcloud with interfaces and functionalities tailored to the requirements of customers.
The data set comprises only normal user behavior, but can be used to evaluate anomaly detection approaches by simulating account hijacking. We provide an implementation for identifying similar users, switching pairs of users to simulate changes of behavior patterns, and a sample detection approach in our github repo.
Acknowledgements: Partially funded by the FFG project DECEPT (873980). The authors thank Walter Huemer, Oskar Kruschitz, Kevin Truckenthanner, and Christian Aigner from Huemer Group for supporting the collection of the data set.
If you use the dataset, please cite the following publication:
[1] M. Landauer, F. Skopik, G. Höld, and M. Wurzenberger. "A User and Entity Behavior Analytics Log Data Set for Anomaly Detection in Cloud Computing". 2022 IEEE International Conference on Big Data - 6th International Workshop on Big Data Analytics for Cyber Intelligence and Defense (BDA4CID 2022), December 17-20, 2022, Osaka, Japan. IEEE. [PDF]
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We present a tool for multi-omics data analysis that enables simultaneous visualization of up to four types of omics data on organism-scale metabolic network diagrams. The tool’s interactive web-based metabolic charts depict the metabolic reactions, pathways, and metabolites of a single organism as described in a metabolic pathway database for that organism; the charts are constructed using automated graphical layout algorithms. The multi-omics visualization facility paints each individual omics dataset onto a different “visual channel” of the metabolic-network diagram. For example, a transcriptomics dataset might be displayed by coloring the reaction arrows within the metabolic chart, while a companion proteomics dataset is displayed as reaction arrow thicknesses, and a complementary metabolomics dataset is displayed as metabolite node colors. Once the network diagrams are painted with omics data, semantic zooming provides more details within the diagram as the user zooms in. Datasets containing multiple time points can be displayed in an animated fashion. The tool will also graph data values for individual reactions or metabolites designated by the user. The user can interactively adjust the mapping from data value ranges to the displayed colors and thicknesses to provide more informative diagrams.
Facebook
Twitterhttps://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
According to our latest research, the global dataset versioning for analytics market size reached USD 527.4 million in 2024. The market is experiencing robust expansion with a remarkable CAGR of 18.2% during the forecast period. By 2033, the market is projected to achieve a value of USD 2,330.6 million. This growth is primarily driven by the escalating demand for efficient data management, regulatory compliance, and the proliferation of AI and machine learning applications across diverse industries.
The primary growth driver in the dataset versioning for analytics market is the exponential increase in data volume and complexity across organizations of all sizes. As enterprises continue to generate and utilize vast amounts of structured and unstructured data, the need for robust dataset versioning solutions has become imperative. These solutions enable organizations to track, manage, and analyze different versions of datasets, ensuring data integrity, reproducibility, and transparency throughout the analytics lifecycle. The surge in adoption of advanced analytics, machine learning, and artificial intelligence further amplifies the necessity for dataset versioning, as it facilitates the training, validation, and deployment of models with consistent and reliable data sources. In addition, the integration of dataset versioning tools with popular analytics platforms and cloud services has made these solutions more accessible and scalable, catering to the evolving needs of modern data-driven enterprises.
Another significant factor fueling market growth is the rising emphasis on data governance and regulatory compliance across industries such as BFSI, healthcare, and government. Stringent regulations like GDPR, HIPAA, and CCPA mandate organizations to maintain accurate records of data usage, lineage, and modifications. Dataset versioning solutions play a pivotal role in helping organizations meet these compliance requirements by providing comprehensive audit trails, access controls, and data lineage tracking. This not only mitigates the risk of non-compliance penalties but also enhances organizational trust and credibility. Furthermore, the growing awareness about the strategic importance of data governance in driving business value and mitigating operational risks has prompted enterprises to invest in sophisticated dataset versioning tools, thereby propelling market expansion.
The proliferation of cloud computing and the increasing adoption of hybrid and multi-cloud architectures are also contributing to the growth of the dataset versioning for analytics market. Cloud-based dataset versioning solutions offer unparalleled scalability, flexibility, and cost-efficiency, enabling organizations to manage and version datasets seamlessly across distributed environments. The shift towards cloud-native analytics and the integration of dataset versioning with cloud data lakes, warehouses, and analytics platforms have further accelerated market adoption. Additionally, advancements in automation, AI-driven data cataloging, and self-service analytics are enhancing the capabilities of dataset versioning tools, making them indispensable for organizations seeking to maximize the value of their data assets while minimizing operational complexities.
From a regional perspective, North America continues to dominate the dataset versioning for analytics market, accounting for the largest revenue share in 2024. This leadership is attributed to the presence of major technology vendors, high adoption rates of advanced analytics, and a mature regulatory landscape. However, the Asia Pacific region is witnessing the fastest growth, driven by rapid digital transformation, increasing investments in AI and analytics, and the emergence of data-centric industries. Europe also holds a significant market share, supported by stringent data protection regulations and growing awareness about data governance. The Middle East & Africa and Latin America are gradually catching up, with increasing adoption of cloud-based analytics and regulatory initiatives promoting data management best practices.
The dataset versioning for analytics market is segmented by component into software and services. The software segment holds the dominant share, driven by the widespread adoption of standalone and integrated dataset versioning platforms that cater to various data management and analytics requirements. These s
Facebook
Twitterhttps://www.technavio.com/content/privacy-noticehttps://www.technavio.com/content/privacy-notice
Cloud Analytics Market Size 2024-2028
The cloud analytics market size is forecast to increase by USD 74.08 billion at a CAGR of 24.4% between 2023 and 2028.
The market is experiencing significant growth due to several key trends. The adoption of hybrid and multi-cloud setups is on the rise, as these configurations enhance data connectivity and flexibility. Another trend driving market growth is the increasing use of cloud security applications to safeguard sensitive data.
However, concerns regarding confidential data security and privacy remain a challenge for market growth. Organizations must ensure robust security measures are in place to mitigate risks and maintain trust with their customers. Overall, the market is poised for continued expansion as businesses seek to leverage the benefits of cloud technologies for data processing and data analytics.
What will be the Size of the Cloud Analytics Market During the Forecast Period?
Request Free Sample
The market is experiencing significant growth due to the increasing volume of data generated by businesses and the demand for advanced analytics solutions. Cloud-based analytics enables organizations to process and analyze large datasets from various data sources, including unstructured data, in real-time. This is crucial for businesses looking to make data-driven decisions and gain valuable insights to optimize their operations and meet customer requirements. Key industries such as sales and marketing, customer service, and finance are adopting cloud analytics to improve key performance indicators and gain a competitive edge. Both Small and Medium-sized Enterprises (SMEs) and large enterprises are embracing cloud analytics, with solutions available on private, public, and multi-cloud platforms.
Big data technology, such as machine learning and artificial intelligence, are integral to cloud analytics, enabling advanced data analytics and business intelligence. Cloud analytics provides businesses with the flexibility to store and process data In the cloud, reducing the need for expensive on-premises data storage and computation. Hybrid environments are also gaining popularity, allowing businesses to leverage the benefits of both private and public clouds. Overall, the market is poised for continued growth as businesses increasingly rely on data-driven insights to inform their decision-making processes.
How is this Cloud Analytics Industry segmented and which is the largest segment?
The cloud analytics industry research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD billion' for the period 2024-2028, as well as historical data from 2017-2022 for the following segments.
Solution
Hosted data warehouse solutions
Cloud BI tools
Complex event processing
Others
Deployment
Public cloud
Hybrid cloud
Private cloud
Geography
North America
US
Europe
Germany
UK
APAC
China
Japan
Middle East and Africa
South America
By Solution Insights
The hosted data warehouse solutions segment is estimated to witness significant growth during the forecast period.
Hosted data warehouses enable organizations to centralize and analyze large datasets from multiple sources, facilitating advanced analytics solutions and real-time insights. By utilizing cloud-based infrastructure, businesses can reduce operational costs through eliminating licensing expenses, hardware investments, and maintenance fees. Additionally, cloud solutions offer network security measures, such as Software Defined Networking and Network integration, ensuring data protection. Cloud analytics caters to diverse industries, including SMEs and large enterprises, addressing requirements for sales and marketing, customer service, and key performance indicators. Advanced analytics capabilities, including predictive analytics, automated decision making, and fraud prevention, are essential for data-driven decision making and business optimization.
Furthermore, cloud platforms provide access to specialized talent, big data technology, and AI, enhancing customer experiences and digital business opportunities. Data connectivity and data processing in real-time are crucial for network agility and application performance. Hosted data warehouses offer computational power and storage capabilities, ensuring efficient data utilization and enterprise information management. Cloud service providers offer various cloud environments, including private, public, multi-cloud, and hybrid, catering to diverse business needs. Compliance and security concerns are addressed through cybersecurity frameworks and data security measures, ensuring data breaches and thefts are minimized.
Get a glance at the Cloud Analytics Industry report of share of various segments Request Free Sample
The Hosted data warehouse solutions s
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
To determine the effectiveness of any defense mechanism, there is a need for comprehensive real-time network data that solely references various attack scenarios based on older software versions or unprotected ports, and so on. This presented dataset has entire network data at the time of several cyber attacks to enable experimentation on challenges based on implementing defense mechanisms on a larger scale. For collecting the data, we captured the network traffic of configured virtual machines using Wireshark and tcpdump. To analyze the impact of several cyber attack scenarios, this dataset presents a set of ten computers connected to Router1 on VLAN1 in a Docker Bridge network, that try and exploit each other. It includes browsing the web and downloading foreign packages including malicious ones. Also, services like FTP and SSH were exploited using several attack mechanisms. The presented dataset shows the importance of updating and patching systems to protect themselves to a greater extent, by following attack tactics on older versions of packages as compared to the newer and updated ones. This dataset also includes an Apache Server hosted on the different subset on VLAN2 which is connected to the VLAN1 to demonstrate isolation and cross-VLAN communication. The services on this web server were also exploited by the previously stated ten computers. The attack types include: Distributed Denial of Service, SQL Injection, Account Takeover, Service Exploitation (SSH, FTP), DNS and ARP Spoofing, Scanning and Firewall Searching and Indexing (using Nmap), Hammering the services to brute-force passwords and usernames, Malware attack, Spoofing and Man-in-the-Middle Attack. The attack scenarios also show various scanning mechanisms and the impact of Insider Threats on the entire network.
Facebook
Twitterhttps://www.wiseguyreports.com/pages/privacy-policyhttps://www.wiseguyreports.com/pages/privacy-policy
| BASE YEAR | 2024 |
| HISTORICAL DATA | 2019 - 2023 |
| REGIONS COVERED | North America, Europe, APAC, South America, MEA |
| REPORT COVERAGE | Revenue Forecast, Competitive Landscape, Growth Factors, and Trends |
| MARKET SIZE 2024 | 2.29(USD Billion) |
| MARKET SIZE 2025 | 2.49(USD Billion) |
| MARKET SIZE 2035 | 5.8(USD Billion) |
| SEGMENTS COVERED | End Use, Deployment Type, Database Type, Application, Regional |
| COUNTRIES COVERED | US, Canada, Germany, UK, France, Russia, Italy, Spain, Rest of Europe, China, India, Japan, South Korea, Malaysia, Thailand, Indonesia, Rest of APAC, Brazil, Mexico, Argentina, Rest of South America, GCC, South Africa, Rest of MEA |
| KEY MARKET DYNAMICS | growing demand for real-time analytics, increasing data volume and variety, rising cloud adoption trends, need for enhanced decision-making, regulatory compliance and data governance |
| MARKET FORECAST UNITS | USD Billion |
| KEY COMPANIES PROFILED | Nasdaq, Fitch Ratings, Tickdata, Thomson Reuters, MSCI, St. Louis Federal Reserve, FTSE Russell, Bloomberg, Morningstar, IHS Markit, S&P Dow Jones Indices, FactSet, S&P Global, Refinitiv |
| MARKET FORECAST PERIOD | 2025 - 2035 |
| KEY MARKET OPPORTUNITIES | Cloud-based solutions integration, Enhanced data analytics capabilities, Adoption in fintech applications, Real-time data accessibility demands, Rising importance of accurate indexing. |
| COMPOUND ANNUAL GROWTH RATE (CAGR) | 8.8% (2025 - 2035) |
Facebook
TwitterThe Reddit Subreddit Dataset by Dataplex offers a comprehensive and detailed view of Reddit’s vast ecosystem, now enhanced with appended AI-generated columns that provide additional insights and categorization. This dataset includes data from over 2.1 million subreddits, making it an invaluable resource for a wide range of analytical applications, from social media analysis to market research.
Dataset Overview:
This dataset includes detailed information on subreddit activities, user interactions, post frequency, comment data, and more. The inclusion of AI-generated columns adds an extra layer of analysis, offering sentiment analysis, topic categorization, and predictive insights that help users better understand the dynamics of each subreddit.
2.1 Million Subreddits with Enhanced AI Insights: The dataset covers over 2.1 million subreddits and now includes AI-enhanced columns that provide: - Sentiment Analysis: AI-driven sentiment scores for posts and comments, allowing users to gauge community mood and reactions. - Topic Categorization: Automated categorization of subreddit content into relevant topics, making it easier to filter and analyze specific types of discussions. - Predictive Insights: AI models that predict trends, content virality, and user engagement, helping users anticipate future developments within subreddits.
Sourced Directly from Reddit:
All social media data in this dataset is sourced directly from Reddit, ensuring accuracy and authenticity. The dataset is updated regularly, reflecting the latest trends and user interactions on the platform. This ensures that users have access to the most current and relevant data for their analyses.
Key Features:
Use Cases:
Data Quality and Reliability:
The Reddit Subreddit Dataset emphasizes data quality and reliability. Each record is carefully compiled from Reddit’s vast database, ensuring that the information is both accurate and up-to-date. The AI-generated columns further enhance the dataset's value, providing automated insights that help users quickly identify key trends and sentiments.
Integration and Usability:
The dataset is provided in a format that is compatible with most data analysis tools and platforms, making it easy to integrate into existing workflows. Users can quickly import, analyze, and utilize the data for various applications, from market research to academic studies.
User-Friendly Structure and Metadata:
The data is organized for easy navigation and analysis, with metadata files included to help users identify relevant subreddits and data points. The AI-enhanced columns are clearly labeled and structured, allowing users to efficiently incorporate these insights into their analyses.
Ideal For:
This dataset is an essential resource for anyone looking to understand the intricacies of Reddit's vast ecosystem, offering the data and AI-enhanced insights needed to drive informed decisions and strategies across various fields. Whether you’re tracking emerging trends, analyzing user behavior, or conduc...
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
🛒 E-Commerce Customer Behavior and Sales Dataset 📊 Dataset Overview This comprehensive dataset contains 5,000 e-commerce transactions from a Turkish online retail platform, spanning from January 2023 to March 2024. The dataset provides detailed insights into customer demographics, purchasing behavior, product preferences, and engagement metrics.
🎯 Use Cases This dataset is perfect for:
Customer Segmentation Analysis: Identify distinct customer groups based on behavior Sales Forecasting: Predict future sales trends and patterns Recommendation Systems: Build product recommendation engines Customer Lifetime Value (CLV) Prediction: Estimate customer value Churn Analysis: Identify customers at risk of leaving Marketing Campaign Optimization: Target customers effectively Price Optimization: Analyze price sensitivity across categories Delivery Performance Analysis: Optimize logistics and shipping 📁 Dataset Structure The dataset contains 18 columns with the following features:
Order Information Order_ID: Unique identifier for each order (ORD_XXXXXX format) Date: Transaction date (2023-01-01 to 2024-03-26) Customer Demographics Customer_ID: Unique customer identifier (CUST_XXXXX format) Age: Customer age (18-75 years) Gender: Customer gender (Male, Female, Other) City: Customer city (10 major Turkish cities) Product Information Product_Category: 8 categories (Electronics, Fashion, Home & Garden, Sports, Books, Beauty, Toys, Food) Unit_Price: Price per unit (in TRY/Turkish Lira) Quantity: Number of units purchased (1-5) Transaction Details Discount_Amount: Discount applied (if any) Total_Amount: Final transaction amount after discount Payment_Method: Payment method used (5 types) Customer Behavior Metrics Device_Type: Device used for purchase (Mobile, Desktop, Tablet) Session_Duration_Minutes: Time spent on website (1-120 minutes) Pages_Viewed: Number of pages viewed during session (1-50) Is_Returning_Customer: Whether customer has purchased before (True/False) Post-Purchase Metrics Delivery_Time_Days: Delivery duration (1-30 days) Customer_Rating: Customer satisfaction rating (1-5 stars) 📈 Key Statistics Total Records: 5,000 transactions Date Range: January 2023 - March 2024 (15 months) Average Transaction Value: ~450 TRY Customer Satisfaction: 3.9/5.0 average rating Returning Customer Rate: 60% Mobile Usage: 55% of transactions 🔍 Data Quality ✅ No missing values ✅ Consistent formatting across all fields ✅ Realistic data distributions ✅ Proper data types for all columns ✅ Logical relationships between features 💡 Sample Analysis Ideas Customer Segmentation with K-Means Clustering
Segment customers based on spending, frequency, and recency Sales Trend Analysis
Identify seasonal patterns and peak shopping periods Product Category Performance
Compare revenue, ratings, and return rates across categories Device-Based Behavior Analysis
Understand how device choice affects purchasing patterns Predictive Modeling
Build models to predict customer ratings or purchase amounts City-Level Market Analysis
Compare market performance across different cities 🛠️ Technical Details File Format: CSV (Comma-Separated Values) Encoding: UTF-8 File Size: ~500 KB Delimiter: Comma (,) 📚 Column Descriptions Column Name Data Type Description Example Order_ID String Unique order identifier ORD_001337 Customer_ID String Unique customer identifier CUST_01337 Date DateTime Transaction date 2023-06-15 Age Integer Customer age 35 Gender String Customer gender Female City String Customer city Istanbul Product_Category String Product category Electronics Unit_Price Float Price per unit 1299.99 Quantity Integer Units purchased 2 Discount_Amount Float Discount applied 129.99 Total_Amount Float Final amount paid 2469.99 Payment_Method String Payment method Credit Card Device_Type String Device used Mobile Session_Duration_Minutes Integer Session time 15 Pages_Viewed Integer Pages viewed 8 Is_Returning_Customer Boolean Returning customer True Delivery_Time_Days Integer Delivery duration 3 Customer_Rating Integer Satisfaction rating 5 🎓 Learning Outcomes By working with this dataset, you can learn:
Data cleaning and preprocessing techniques Exploratory Data Analysis (EDA) with Python/R Statistical analysis and hypothesis testing Machine learning model development Data visualization best practices Business intelligence and reporting 📝 Citation If you use this dataset in your research or project, please cite:
E-Commerce Customer Behavior and Sales Dataset (2024) Turkish Online Retail Platform Data (2023-2024) Available on Kaggle ⚖️ License This dataset is released under the CC0: Public Domain license. You are free to use it for any purpose.
🤝 Contribution Found any issues or have suggestions? Feel free to provide feedback!
📞 Contact For questions or collaborations, please reach out through Kaggle.
Happy Analyzing! 🚀
Keywords: e-c...
Facebook
TwitterThis EnviroAtlas dataset includes analysis by NatureServe of species that are Imperiled (G1/G2) or Listed under the U.S. Endangered Species Act (ESA) by 12-digit Hydrologic Units (HUCs). The analysis results are for use and publication by both the LandScope America website and by the EnviroAtlas. Results are provided for the total number of Aquatic Associated G1-G2/ESA species, the total number of Wetland Associated G1-G2/ESA species, the total number of Terrestrial Associated G1-G2/ESA species, and the total number of Unknown Habitat Association G1-G2/ESA species in each HUC12. NatureServe is a non-profit organization dedicated to developing and providing information about the world's plants, animals, and ecological communities. NatureServe works in partnership with 82 independent Natural Heritage programs and Conservation Data Centers that gather scientific information on rare species and ecosystems in the United States, Latin America, and Canada (the Natural Heritage Network). NatureServe is a leading source for biodiversity information that is essential for effective conservation action. This dataset was produced by NatureServe to support research and online mapping activities related to EnviroAtlas. EnviroAtlas (https://www.epa.gov/enviroatlas) allows the user to interact with a web-based, easy-to-use, mapping application to view and analyze multiple ecosystem services for the contiguous United States. The dataset is available as downloadable data (https://edg.epa.gov/data/Public/ORD/EnviroAtlas) or as an EnviroAtlas map service. Additional descriptive information about each attribute in this dataset can be found in its associated EnviroAtlas Fact Sheet (https://www.epa.gov/enviroatlas/enviroatlas-fact-sheets).
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Datasets are critical for emotion analysis in the machine learning field. This study aims to explore emotion analysis datasets and related benchmarks in online learning, since, currently, there are very few studies that explore the same. We have scientifically labeled the topic and nine-category emotion of 4715 comment texts in online learning platforms using the “three-person voting label method” based on the “sentence-level” and multi-category labeling dimensions with our self-developed system. After testing the consistency of the labeling results using the Fleiss Kappa method, we found that the consistency of the dataset was about 0.51, representing a moderate strength of agreement. Based on the dataset, the prediction accuracy of the Long-Short Term Memory (LSTM) method is about 0.68. This dataset provides a benchmark for the multi- category emotion dataset in the Chinese online learning field. It can provide a basis for the subsequent solution of emotion analysis, monitoring, and intervention in the education field. It can also provide a reference for constructing subsequent datasets in the education field. We need to remind you that this is a Chinese dataset. If you want to use this dataset, please contact the author and you should request for the dataset below.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Self-citation analysis data based on PubMed Central subset (2002-2005) ---------------------------------------------------------------------- Created by Shubhanshu Mishra, Brent D. Fegley, Jana Diesner, and Vetle Torvik on April 5th, 2018 ## Introduction This is a dataset created as part of the publication titled: Mishra S, Fegley BD, Diesner J, Torvik VI (2018) Self-Citation is the Hallmark of Productive Authors, of Any Gender. PLOS ONE. It contains files for running the self citation analysis on articles published in PubMed Central between 2002 and 2005, collected in 2015. The dataset is distributed in the form of the following tab separated text files: * Training_data_2002_2005_pmc_pair_First.txt (1.2G) - Data for first authors * Training_data_2002_2005_pmc_pair_Last.txt (1.2G) - Data for last authors * Training_data_2002_2005_pmc_pair_Middle_2nd.txt (964M) - Data for middle 2nd authors * Training_data_2002_2005_pmc_pair_txt.header.txt - Header for the data * COLUMNS_DESC.txt file - Descriptions of all columns * model_text_files.tar.gz - Text files containing model coefficients and scores for model selection. * results_all_model.tar.gz - Model coefficient and result files in numpy format used for plotting purposes. v4.reviewer contains models for analysis done after reviewer comments. * README.txt file ## Dataset creation Our experiments relied on data from multiple sources including properitery data from Thompson Rueter's (now Clarivate Analytics) Web of Science collection of MEDLINE citations. Author's interested in reproducing our experiments should personally request from Clarivate Analytics for this data. However, we do make a similar but open dataset based on citations from PubMed Central which can be utilized to get similar results to those reported in our analysis. Furthermore, we have also freely shared our datasets which can be used along with the citation datasets from Clarivate Analytics, to re-create the datased used in our experiments. These datasets are listed below. If you wish to use any of those datasets please make sure you cite both the dataset as well as the paper introducing the dataset. * MEDLINE 2015 baseline: https://www.nlm.nih.gov/bsd/licensee/2015_stats/baseline_doc.html * Citation data from PubMed Central (original paper includes additional citations from Web of Science) * Author-ity 2009 dataset: - Dataset citation: Torvik, Vetle I.; Smalheiser, Neil R. (2018): Author-ity 2009 - PubMed author name disambiguated dataset. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4222651_V1 - Paper citation: Torvik, V. I., & Smalheiser, N. R. (2009). Author name disambiguation in MEDLINE. ACM Transactions on Knowledge Discovery from Data, 3(3), 1–29. https://doi.org/10.1145/1552303.1552304 - Paper citation: Torvik, V. I., Weeber, M., Swanson, D. R., & Smalheiser, N. R. (2004). A probabilistic similarity metric for Medline records: A model for author name disambiguation. Journal of the American Society for Information Science and Technology, 56(2), 140–158. https://doi.org/10.1002/asi.20105 * Genni 2.0 + Ethnea for identifying author gender and ethnicity: - Dataset citation: Torvik, Vetle (2018): Genni + Ethnea for the Author-ity 2009 dataset. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-9087546_V1 - Paper citation: Smith, B. N., Singh, M., & Torvik, V. I. (2013). A search engine approach to estimating temporal changes in gender orientation of first names. In Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries - JCDL ’13. ACM Press. https://doi.org/10.1145/2467696.2467720 - Paper citation: Torvik VI, Agarwal S. Ethnea -- an instance-based ethnicity classifier based on geo-coded author names in a large-scale bibliographic database. International Symposium on Science of Science March 22-23, 2016 - Library of Congress, Washington DC, USA. http://hdl.handle.net/2142/88927 * MapAffil for identifying article country of affiliation: - Dataset citation: Torvik, Vetle I. (2018): MapAffil 2016 dataset -- PubMed author affiliations mapped to cities and their geocodes worldwide. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4354331_V1 - Paper citation: Torvik VI. MapAffil: A Bibliographic Tool for Mapping Author Affiliation Strings to Cities and Their Geocodes Worldwide. D-Lib magazine : the magazine of the Digital Library Forum. 2015;21(11-12):10.1045/november2015-torvik * IMPLICIT journal similarity: - Dataset citation: Torvik, Vetle (2018): Author-implicit journal, MeSH, title-word, and affiliation-word pairs based on Author-ity 2009. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4742014_V1 * Novelty dataset for identify article level novelty: - Dataset citation: Mishra, Shubhanshu; Torvik, Vetle I. (2018): Conceptual novelty scores for PubMed articles. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-5060298_V1 - Paper citation: Mishra S, Torvik VI. Quantifying Conceptual Novelty in the Biomedical Literature. D-Lib magazine : The Magazine of the Digital Library Forum. 2016;22(9-10):10.1045/september2016-mishra - Code: https://github.com/napsternxg/Novelty * Expertise dataset for identifying author expertise on articles: * Source code provided at: https://github.com/napsternxg/PubMed_SelfCitationAnalysis Note: The dataset is based on a snapshot of PubMed (which includes Medline and PubMed-not-Medline records) taken in the first week of October, 2016. Check here for information to get PubMed/MEDLINE, and NLMs data Terms and Conditions Additional data related updates can be found at Torvik Research Group ## Acknowledgments This work was made possible in part with funding to VIT from NIH grant P01AG039347 and NSF grant 1348742. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. ## License Self-citation analysis data based on PubMed Central subset (2002-2005) by Shubhanshu Mishra, Brent D. Fegley, Jana Diesner, and Vetle Torvik is licensed under a Creative Commons Attribution 4.0 International License. Permissions beyond the scope of this license may be available at https://github.com/napsternxg/PubMed_SelfCitationAnalysis.
Facebook
TwitterAnnotated tutorials and example code are provided describing the use of these data.
Facebook
Twitterhttps://www.wiseguyreports.com/pages/privacy-policyhttps://www.wiseguyreports.com/pages/privacy-policy
| BASE YEAR | 2024 |
| HISTORICAL DATA | 2019 - 2023 |
| REGIONS COVERED | North America, Europe, APAC, South America, MEA |
| REPORT COVERAGE | Revenue Forecast, Competitive Landscape, Growth Factors, and Trends |
| MARKET SIZE 2024 | 7.18(USD Billion) |
| MARKET SIZE 2025 | 7.89(USD Billion) |
| MARKET SIZE 2035 | 20.0(USD Billion) |
| SEGMENTS COVERED | Database Type, Deployment Type, End User Industry, Application, Regional |
| COUNTRIES COVERED | US, Canada, Germany, UK, France, Russia, Italy, Spain, Rest of Europe, China, India, Japan, South Korea, Malaysia, Thailand, Indonesia, Rest of APAC, Brazil, Mexico, Argentina, Rest of South America, GCC, South Africa, Rest of MEA |
| KEY MARKET DYNAMICS | Scalability and Flexibility, Real-time Data Processing, Increased Cloud Adoption, Big Data Integration, Cost-effective Solutions |
| MARKET FORECAST UNITS | USD Billion |
| KEY COMPANIES PROFILED | DataStax, Microsoft, Amazon Web Services, Teradata, Aerospike, MongoDB, Berkeley DB, Google, MarkLogic, IBM, Redis Labs, Couchbase, Cassandra, CouchDB, Oracle |
| MARKET FORECAST PERIOD | 2025 - 2035 |
| KEY MARKET OPPORTUNITIES | Cloud-based database solutions, Increasing demand for big data analytics, Integration with AI and machine learning, Growing adoption in IoT applications, Enhanced scalability for multi-cloud environments |
| COMPOUND ANNUAL GROWTH RATE (CAGR) | 9.8% (2025 - 2035) |
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The work involved in developing the dataset and benchmarking its use of machine learning is set out in the article ‘IoMT-TrafficData: Dataset and Tools for Benchmarking Intrusion Detection in Internet of Medical Things’. DOI: 10.1109/ACCESS.2024.3437214.
Please do cite the aforementioned article when using this dataset.
The increasing importance of securing the Internet of Medical Things (IoMT) due to its vulnerabilities to cyber-attacks highlights the need for an effective intrusion detection system (IDS). In this study, our main objective was to develop a Machine Learning Model for the IoMT to enhance the security of medical devices and protect patients’ private data. To address this issue, we built a scenario that utilised the Internet of Things (IoT) and IoMT devices to simulate real-world attacks. We collected and cleaned data, pre-processed it, and provided it into our machine-learning model to detect intrusions in the network. Our results revealed significant improvements in all performance metrics, indicating robustness and reproducibility in real-world scenarios. This research has implications in the context of IoMT and cybersecurity, as it helps mitigate vulnerabilities and lowers the number of breaches occurring with the rapid growth of IoMT devices. The use of machine learning algorithms for intrusion detection systems is essential, and our study provides valuable insights and a road map for future research and the deployment of such systems in live environments. By implementing our findings, we can contribute to a safer and more secure IoMT ecosystem, safeguarding patient privacy and ensuring the integrity of medical data.
The ZIP folder comprises two main components: Captures and Datasets. Within the captures folder, we have included all the captures used in this project. These captures are organized into separate folders corresponding to the type of network analysis: BLE or IP-Based. Similarly, the datasets folder follows a similar organizational approach. It contains datasets categorized by type: BLE, IP-Based Packet, and IP-Based Flows.
To cater to diverse analytical needs, the datasets are provided in two formats: CSV (Comma-Separated Values) and pickle. The CSV format facilitates seamless integration with various data analysis tools, while the pickle format preserves the intricate structures and relationships within the dataset.
This organization enables researchers to easily locate and utilize the specific captures and datasets they require, based on their preferred network analysis type or dataset type. The availability of different formats further enhances the flexibility and usability of the provided data.
Within this dataset, three sub-datasets are available, namely BLE, IP-Based Packet, and IP-Based Flows. Below is a table of the features selected for each dataset and consequently used in the evaluation model within the provided work.
Identified Key Features Within Bluetooth Dataset
| Feature | Meaning |
| btle.advertising_header | BLE Advertising Packet Header |
| btle.advertising_header.ch_sel | BLE Advertising Channel Selection Algorithm |
| btle.advertising_header.length | BLE Advertising Length |
| btle.advertising_header.pdu_type | BLE Advertising PDU Type |
| btle.advertising_header.randomized_rx | BLE Advertising Rx Address |
| btle.advertising_header.randomized_tx | BLE Advertising Tx Address |
| btle.advertising_header.rfu.1 | Reserved For Future 1 |
| btle.advertising_header.rfu.2 | Reserved For Future 2 |
| btle.advertising_header.rfu.3 | Reserved For Future 3 |
| btle.advertising_header.rfu.4 | Reserved For Future 4 |
| btle.control.instant | Instant Value Within a BLE Control Packet |
| btle.crc.incorrect | Incorrect CRC |
| btle.extended_advertising | Advertiser Data Information |
| btle.extended_advertising.did | Advertiser Data Identifier |
| btle.extended_advertising.sid | Advertiser Set Identifier |
| btle.length | BLE Length |
| frame.cap_len | Frame Length Stored Into the Capture File |
| frame.interface_id | Interface ID |
| frame.len | Frame Length Wire |
| nordic_ble.board_id | Board ID |
| nordic_ble.channel | Channel Index |
| nordic_ble.crcok | Indicates if CRC is Correct |
| nordic_ble.flags | Flags |
| nordic_ble.packet_counter | Packet Counter |
| nordic_ble.packet_time | Packet time (start to end) |
| nordic_ble.phy | PHY |
| nordic_ble.protover | Protocol Version |
Identified Key Features Within IP-Based Packets Dataset
| Feature | Meaning |
| http.content_length | Length of content in an HTTP response |
| http.request | HTTP request being made |
| http.response.code | Sequential number of an HTTP response |
| http.response_number | Sequential number of an HTTP response |
| http.time | Time taken for an HTTP transaction |
| tcp.analysis.initial_rtt | Initial round-trip time for TCP connection |
| tcp.connection.fin | TCP connection termination with a FIN flag |
| tcp.connection.syn | TCP connection initiation with SYN flag |
| tcp.connection.synack | TCP connection establishment with SYN-ACK flags |
| tcp.flags.cwr | Congestion Window Reduced flag in TCP |
| tcp.flags.ecn | Explicit Congestion Notification flag in TCP |
| tcp.flags.fin | FIN flag in TCP |
| tcp.flags.ns | Nonce Sum flag in TCP |
| tcp.flags.res | Reserved flags in TCP |
| tcp.flags.syn | SYN flag in TCP |
| tcp.flags.urg | Urgent flag in TCP |
| tcp.urgent_pointer | Pointer to urgent data in TCP |
| ip.frag_offset | Fragment offset in IP packets |
| eth.dst.ig | Ethernet destination is in the internal network group |
| eth.src.ig | Ethernet source is in the internal network group |
| eth.src.lg | Ethernet source is in the local network group |
| eth.src_not_group | Ethernet source is not in any network group |
| arp.isannouncement | Indicates if an ARP message is an announcement |
Identified Key Features Within IP-Based Flows Dataset
| Feature | Meaning |
| proto | Transport layer protocol of the connection |
| service | Identification of an application protocol |
| orig_bytes | Originator payload bytes |
| resp_bytes | Responder payload bytes |
| history | Connection state history |
| orig_pkts | Originator sent packets |
| resp_pkts | Responder sent packets |
| flow_duration | Length of the flow in seconds |
| fwd_pkts_tot | Forward packets total |
| bwd_pkts_tot | Backward packets total |
| fwd_data_pkts_tot | Forward data packets total |
| bwd_data_pkts_tot | Backward data packets total |
| fwd_pkts_per_sec | Forward packets per second |
| bwd_pkts_per_sec | Backward packets per second |
| flow_pkts_per_sec | Flow packets per second |
| fwd_header_size | Forward header bytes |
| bwd_header_size | Backward header bytes |
| fwd_pkts_payload | Forward payload bytes |
| bwd_pkts_payload | Backward payload bytes |
| flow_pkts_payload | Flow payload bytes |
| fwd_iat | Forward inter-arrival time |
| bwd_iat | Backward inter-arrival time |
| flow_iat | Flow inter-arrival time |
| active | Flow active duration |
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
The field of metabolomics has witnessed the development of hundreds of computational tools, but only a few have become cornerstones of this field. While MetaboLights and Metabolomics Workbench are two well-established data repositories for metabolomics data sets, Workflows4Metabolomics and MetaboAnalyst are two well-established web-based data analysis platforms for metabolomics. Yet, the raw data stored in the aforementioned repositories lack standardization in terms of the file system format used to store the associated acquisition files. Consequently, it is not straightforward to reuse available data sets as input data in the above-mentioned data analysis resources, especially for non-expert users. This paper presents CloMet, a novel open-source modular software platform that contributes to standardization, reusability, and reproducibility in the metabolomics field. CloMet, which is available through a Docker file, converts raw and NMR-based metabolomics data from MetaboLights and Metabolomics Workbench to a file format that can be used directly either in MetaboAnalyst or in Workflows4Metabolomics. We validated both CloMet and the output data using data sets from these repositories. Overall, CloMet fills the gap between well-established data repositories and web-based statistical platforms and contributes to the consolidation of a data-driven perspective of the metabolomics field by leveraging and connecting existing data and resources.
Facebook
Twitterhttps://www.marketresearchintellect.com/privacy-policyhttps://www.marketresearchintellect.com/privacy-policy
Learn more about the Cloud-based Database Market Report by Market Research Intellect, which stood at USD 10.5 billion in 2024 and is forecast to expand to USD 25.0 billion by 2033, growing at a CAGR of 10.5%.Discover how new strategies, rising investments, and top players are shaping the future.
Facebook
TwitterIntroduction: I have chosen to complete a data analysis project for the second course option, Bellabeats, Inc., using a locally hosted database program, Excel for both my data analysis and visualizations. This choice was made primarily because I live in a remote area and have limited bandwidth and inconsistent internet access. Therefore, completing a capstone project using web-based programs such as R Studio, SQL Workbench, or Google Sheets was not a feasible choice. I was further limited in which option to choose as the datasets for the ride-share project option were larger than my version of Excel would accept. In the scenario provided, I will be acting as a Junior Data Analyst in support of the Bellabeats, Inc. executive team and data analytics team. This combined team has decided to use an existing public dataset in hopes that the findings from that dataset might reveal insights which will assist in Bellabeat's marketing strategies for future growth. My task is to provide data driven insights to business tasks provided by the Bellabeats, Inc.'s executive and data analysis team. In order to accomplish this task, I will complete all parts of the Data Analysis Process (Ask, Prepare, Process, Analyze, Share, Act). In addition, I will break each part of the Data Analysis Process down into three sections to provide clarity and accountability. Those three sections are: Guiding Questions, Key Tasks, and Deliverables. For the sake of space and to avoid repetition, I will record the deliverables for each Key Task directly under the numbered Key Task using an asterisk (*) as an identifier.
Section 1 - Ask:
A. Guiding Questions:
1. Who are the key stakeholders and what are their goals for the data analysis project?
2. What is the business task that this data analysis project is attempting to solve?
B. Key Tasks: 1. Identify key stakeholders and their goals for the data analysis project *The key stakeholders for this project are as follows: -Urška Sršen and Sando Mur - co-founders of Bellabeats, Inc. -Bellabeats marketing analytics team. I am a member of this team.
Section 2 - Prepare:
A. Guiding Questions: 1. Where is the data stored and organized? 2. Are there any problems with the data? 3. How does the data help answer the business question?
B. Key Tasks:
Research and communicate the source of the data, and how it is stored/organized to stakeholders.
*The data source used for our case study is FitBit Fitness Tracker Data. This dataset is stored in Kaggle and was made available through user Mobius in an open-source format. Therefore, the data is public and available to be copied, modified, and distributed, all without asking the user for permission. These datasets were generated by respondents to a distributed survey via Amazon Mechanical Turk reportedly (see credibility section directly below) between 03/12/2016 thru 05/12/2016.
*Reportedly (see credibility section directly below), thirty eligible Fitbit users consented to the submission of personal tracker data, including output related to steps taken, calories burned, time spent sleeping, heart rate, and distance traveled. This data was broken down into minute, hour, and day level totals. This data is stored in 18 CSV documents. I downloaded all 18 documents into my local laptop and decided to use 2 documents for the purposes of this project as they were files which had merged activity and sleep data from the other documents. All unused documents were permanently deleted from the laptop. The 2 files used were:
-sleepDay_merged.csv
-dailyActivity_merged.csv
Identify and communicate to stakeholders any problems found with the data related to credibility and bias. *As will be more specifically presented in the Process section, the data seems to have credibility issues related to the reported time frame of the data collected. The metadata seems to indicate that the data collected covered roughly 2 months of FitBit tracking. However, upon my initial data processing, I found that only 1 month of data was reported. *As will be more specifically presented in the Process section, the data has credibility issues related to the number of individuals who reported FitBit data. Specifically, the metadata communicates that 30 individual users agreed to report their tracking data. My initial data processing uncovered 33 individual ...