11 datasets found
  1. Stock Market Simulation Dataset

    • kaggle.com
    Updated Mar 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Samay Ashar (2025). Stock Market Simulation Dataset [Dataset]. http://doi.org/10.34740/kaggle/dsv/11010423
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 12, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Samay Ashar
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset provides realistic stock market data generated using Geometric Brownian Motion for price movements and Markov Chains for trend prediction. It is designed for time-series forecasting, financial modeling, and algorithmic trading simulations.

    Key Features

    • 1000 days of synthetic stock market data (from January 1, 2022, onwards).
    • Multiple companies from diverse industries (Technology, Finance, Healthcare, Energy, Consumer Goods, Automotive, Aerospace, etc.).
    • Stock price details: Open, High, Low, Close prices.
    • Trading volume and market capitalization.
    • Financial metrics: P/E Ratio, Dividend Yield, Volatility.
    • Sentiment Score: A measure of market sentiment (-1 to 1 scale).
    • Trend Labeling: Bullish, Bearish, or Stable, based on Markov Chain modeling.
    Column NameDescription
    DateTrading date
    CompanyStock name (e.g., Apple, Tesla, JPMorgan, etc.)
    SectorIndustry classification
    OpenOpening price of the stock
    HighHighest price of the stock for the day
    LowLowest price of the stock for the day
    CloseClosing price of the stock
    VolumeNumber of shares traded
    Market_CapMarket capitalization (in USD)
    PE_RatioPrice-to-Earnings ratio
    Dividend_YieldPercentage of dividends relative to stock price
    VolatilityMeasure of stock price fluctuation
    Sentiment_ScoreMarket sentiment (-1 to 1 scale)
    TrendStock market trend (Bullish, Bearish, or Stable)

    Usage Scenarios

    πŸ”Ή Time-Series Forecasting: Train models like LSTMs, Transformers, or ARIMA for stock price prediction.
    πŸ”Ή Algorithmic Trading: Develop trading strategies based on trends and sentiment.
    πŸ”Ή Feature Engineering: Explore correlations between financial metrics and stock movements.
    πŸ”Ή Quantitative Finance Research: Analyze market trends using simulated yet realistic data.

    PS: If you find this dataset helpful, please consider upvoting :)

  2. f

    S2 Data -

    • plos.figshare.com
    txt
    Updated Dec 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mahadee Al Mobin; Md. Kamrujjaman (2023). S2 Data - [Dataset]. http://doi.org/10.1371/journal.pone.0295803.s002
    Explore at:
    txtAvailable download formats
    Dataset updated
    Dec 14, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Mahadee Al Mobin; Md. Kamrujjaman
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data scarcity and discontinuity are common occurrences in the healthcare and epidemiological dataset and often is needed to form an educative decision and forecast the upcoming scenario. Often to avoid these problems, these data are processed as monthly/yearly aggregate where the prevalent forecasting tools like Autoregressive Integrated Moving Average (ARIMA), Seasonal Autoregressive Integrated Moving Average (SARIMA), and TBATS often fail to provide satisfactory results. Artificial data synthesis methods have been proven to be a powerful tool for tackling these challenges. The paper aims to propose a novel algorithm named Stochastic Bayesian Downscaling (SBD) algorithm based on the Bayesian approach that can regenerate downscaled time series of varying time lengths from aggregated data, preserving most of the statistical characteristics and the aggregated sum of the original data. The paper presents two epidemiological time series case studies of Bangladesh (Dengue, Covid-19) to showcase the workflow of the algorithm. The case studies illustrate that the synthesized data agrees with the original data regarding its statistical properties, trend, seasonality, and residuals. In the case of forecasting performance, using the last 12 years data of Dengue infection data in Bangladesh, we were able to decrease error terms up to 72.76% using synthetic data over actual aggregated data.

  3. Synthetic Data Generation for Hard Drive Failure Prediction in Large-scale...

    • figshare.com
    zip
    Updated Apr 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chandranil Chakraborttii (2025). Synthetic Data Generation for Hard Drive Failure Prediction in Large-scale Systems [Dataset]. http://doi.org/10.6084/m9.figshare.28878830.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 27, 2025
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Chandranil Chakraborttii
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Accurate failure prediction is critical for the reliability of HPC facilities and data centers storage systems. This study addresses data scarcity, privacy concerns, and class imbalance in HDD failure datasets by leveraging synthetic data generation. We propose an end-to-end framework to generate synthetic storage data using Generative Adversarial Networks and Diffusion models. We implement a data segmentation approach considering temporal variation of disks access to generate high-fidelity synthetic data that replicates the nuanced temporal and feature-specific patterns of disk failures. Experimental results show that synthetic data achieves similarity scores of 0.81–0.89 and enhances failure prediction performance, with up to 3% improvement in accuracy and 2% in ROC-AUC. With only minor performance drops versus real-data training, synthetically trained models prove viable for predictive maintenance.

  4. Heat pump COP drop - synthetic faults

    • kaggle.com
    zip
    Updated Feb 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mathieu Vallee (2023). Heat pump COP drop - synthetic faults [Dataset]. https://www.kaggle.com/datasets/mathieuvallee/ai-dhc-heatpump-cop
    Explore at:
    zip(68378018 bytes)Available download formats
    Dataset updated
    Feb 28, 2023
    Authors
    Mathieu Vallee
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    This dataset contains data generated in the AI DHC project.

    This dataset contains synthetic fault data for decrease of the COP of a heat pump

    The IEA DHC Annex XIII project β€œArtificial Intelligence for Failure Detection and Forecasting of Heat Production and Heat demand in District Heating Networks” is developing Artificial Intelligence (AI) methods for forecasting heat demand and heat production and is evaluating algorithms for detecting faults which can be used by interested stakeholders (operators, suppliers of DHC components and manufacturers of control devices).

    See https://github.com/mathieu-vallee/ai-dhc for the models and pythons scripts used to generate the dataset

    Please cite this dataset as: Vallee, M., Wissocq T., Gaoua Y., Lamaison N., Generation and Evaluation of a Synthetic Dataset to improve Fault Detection in District Heating and Cooling Systems, 2023 (under review at the Energy journal)

    Disclaimer notice (IEA DHC): This project has been independently funded by the International Energy Agency Technology Collaboration Programme on District Heating and Cooling including Combined Heat and Power (IEA DHC).

    Any views expressed in this publication are not necessarily those of IEA DHC.

    IEA DHC can take no responsibility for the use of the information within this publication, nor for any errors or omissions it may contain.

    Information contained herein have been compiled or arrived from sources believed to be reliable. Nevertheless, the authors or their organizations do not accept liability for any loss or damage arising from the use thereof. Using the given information is strictly your own responsibility.

    Disclaimer Notice (Authors):

    This publication has been compiled with reasonable skill and care. However, neither the authors nor the DHC Contracting Parties (of the International Energy Agency Technology Collaboration Programme on District Heating & Cooling) make any representation as to the adequacy or accuracy of the information contained herein, or as to its suitability for any particular application, and accept no responsibility or liability arising out of the use of this publication. The information contained herein does not supersede the requirements given in any national codes, regulations or standards, and should not be regarded as a substitute

    Copyright:

    All property rights, including copyright, are vested in IEA DHC. In particular, all parts of this publication may be reproduced, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or otherwise only by crediting IEA DHC as the original source. Republishing of this report in another format or storing the report in a public retrieval system is prohibited unless explicitly permitted by the IEA DHC Operating Agent in writing.

  5. Delhi Power Load with Weather & Development

    • kaggle.com
    Updated Jan 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pratik Chougule (2025). Delhi Power Load with Weather & Development [Dataset]. https://www.kaggle.com/datasets/pratikyuvrajchougule/delhi-datset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 12, 2025
    Dataset provided by
    Kaggle
    Authors
    Pratik Chougule
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Area covered
    Delhi
    Description

    This dataset provides synthetic data designed to analyze and predict power load (in MW) in Delhi, incorporating a variety of influencing factors such as weather, holidays, festivals, and real estate development levels. With over a year of hourly data, this dataset is ideal for researchers, students, and practitioners working on energy systems, urban planning, and time-series forecasting.

    Key Features:

    • Weather Data: Temperature, humidity, wind speed, and rainfall measurements for each hour.
    • Socio-Economic Indicators: Information on public holidays, weekly holidays, and festival days.
    • Urban Development: Classification of areas into low, medium, and high development zones with respective percentages.
    • Power Load (MW): Target variable representing hourly electricity consumption in megawatts. ## Purpose: This dataset is intended for the following use cases:

    1. Power Load Forecasting:Build machine learning models to predict future electricity demand. 2. Weather Impact Studies: Analyze how weather conditions influence power consumption patterns. 3. Urban Development Insights: Explore the correlation between area development levels and energy usage. 4. Policy Planning: Assist policymakers in understanding energy demand trends during holidays, festivals, and extreme weather. 5. Time Series Analysis: Practice and research advanced time-series forecasting techniques. 6. Renewable Energy Integration: Develop models to optimize energy distribution and reduce reliance on non-renewable sources.

    Potential Applications:

    • Building intelligent power grid systems.
    • Analyzing the impact of climate change on energy demand.
    • Supporting smart city initiatives with energy-efficient planning.
    • Creating educational tools for data science and machine learning learners.
  6. f

    Coefficients of ARIMA(7,0,7).

    • plos.figshare.com
    xls
    Updated Dec 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mahadee Al Mobin; Md. Kamrujjaman (2023). Coefficients of ARIMA(7,0,7). [Dataset]. http://doi.org/10.1371/journal.pone.0295803.t010
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Dec 14, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Mahadee Al Mobin; Md. Kamrujjaman
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data scarcity and discontinuity are common occurrences in the healthcare and epidemiological dataset and often is needed to form an educative decision and forecast the upcoming scenario. Often to avoid these problems, these data are processed as monthly/yearly aggregate where the prevalent forecasting tools like Autoregressive Integrated Moving Average (ARIMA), Seasonal Autoregressive Integrated Moving Average (SARIMA), and TBATS often fail to provide satisfactory results. Artificial data synthesis methods have been proven to be a powerful tool for tackling these challenges. The paper aims to propose a novel algorithm named Stochastic Bayesian Downscaling (SBD) algorithm based on the Bayesian approach that can regenerate downscaled time series of varying time lengths from aggregated data, preserving most of the statistical characteristics and the aggregated sum of the original data. The paper presents two epidemiological time series case studies of Bangladesh (Dengue, Covid-19) to showcase the workflow of the algorithm. The case studies illustrate that the synthesized data agrees with the original data regarding its statistical properties, trend, seasonality, and residuals. In the case of forecasting performance, using the last 12 years data of Dengue infection data in Bangladesh, we were able to decrease error terms up to 72.76% using synthetic data over actual aggregated data.

  7. Spacecraft Thruster Firing Test Dataset

    • zenodo.org
    • data.niaid.nih.gov
    csv, zip
    Updated Jul 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Patrick Fleith; Patrick Fleith (2024). Spacecraft Thruster Firing Test Dataset [Dataset]. http://doi.org/10.5281/zenodo.7137930
    Explore at:
    zip, csvAvailable download formats
    Dataset updated
    Jul 16, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Patrick Fleith; Patrick Fleith
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    WARNING

    This version of the dataset is not recommended for anomaly detection use case. We discovered discrepancies in the anomalous sequences. A new version will be released. In the meantime, please ignore all sequence marked as anomalous.

    CONTEXT

    Testing hardware to qualify it for Spaceflight is critical to model and verify performances. Hot fire tests (also known as life-tests) are typically run during the qualification campaigns of satellite thrusters, but results remain proprietary data, hence making it difficult for the machine learning community to develop suitable data-driven predictive models. This synthetic dataset was generated partially based on the real-world physics of monopropellant chemical thrusters, to foster the development and benchmarking of new data-driven analytical methods (machine learning, deep-learning, etc.).

    The PDF document "STFT Dataset Description" describes in much details the structure, context, use cases and domain-knowledge about thruster in order for ML practitioners to use the dataset.

    PROPOSED TASKS

    Supervised:

    • Performance Modelling: Prediction of the thruster performances (target can be thrust, mass flow rate, and/or the average specific impulse)
    • Acceptance Test for Individualised Performance Model refinement: Taking into account the acceptance test of individual thruster might be helpful to generate individualised thruster predictive model
    • Uncertainty Quantification for Thruster-to-thruster reproducibility verification, i.e. to evaluate the prediction variability between several thrusters in order to construct uncertainty bounds around the prediction (predictive intervals) of the thrust and mass flow rate of future thrusters that may be used during an actual space mission

    Unsupervised / Anomaly Detection

    • Anomaly Detection: Anomalies can be detected in an unsupervised setting (outlier detection) or in a semi-supervised setting (novelty detection). The dataset includes a total of 270 anomalies. A simple approach is to predict if a firing test sequence is anomalous or nominal. A more advanced approach is trying to predict which portion of a time series is anomalous. The dataset also provide a detailed information about each time point being anomalous or nominal. In case of an anomaly, a code is provided which allows to diagnosis the detection system performance on the different types of anomalies contained in the dataset.

  8. w

    Global Synthetic Data Tool Market Research Report: By Type (Image...

    • wiseguyreports.com
    Updated Aug 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    wWiseguy Research Consultants Pvt Ltd (2024). Global Synthetic Data Tool Market Research Report: By Type (Image Generation, Text Generation, Audio Generation, Time-Series Generation, User-Generated Data Marketplace), By Application (Computer Vision, Natural Language Processing, Predictive Analytics, Healthcare, Retail), By Deployment Mode (Cloud-Based, On-Premise), By Organization Size (Small and Medium Enterprises (SMEs), Large Enterprises) and By Regional (North America, Europe, South America, Asia Pacific, Middle East and Africa) - Forecast to 2032. [Dataset]. https://www.wiseguyreports.com/cn/reports/synthetic-data-tool-market
    Explore at:
    Dataset updated
    Aug 10, 2024
    Dataset authored and provided by
    wWiseguy Research Consultants Pvt Ltd
    License

    https://www.wiseguyreports.com/pages/privacy-policyhttps://www.wiseguyreports.com/pages/privacy-policy

    Time period covered
    Jan 8, 2024
    Area covered
    Global
    Description
    BASE YEAR2024
    HISTORICAL DATA2019 - 2024
    REPORT COVERAGERevenue Forecast, Competitive Landscape, Growth Factors, and Trends
    MARKET SIZE 20237.98(USD Billion)
    MARKET SIZE 20249.55(USD Billion)
    MARKET SIZE 203240.0(USD Billion)
    SEGMENTS COVEREDType ,Application ,Deployment Mode ,Organization Size ,Regional
    COUNTRIES COVEREDNorth America, Europe, APAC, South America, MEA
    KEY MARKET DYNAMICSGrowing Demand for Data Privacy and Security Advancement in Artificial Intelligence AI and Machine Learning ML Increasing Need for Faster and More Efficient Data Generation Growing Adoption of Synthetic Data in Various Industries Government Regulations and Compliance
    MARKET FORECAST UNITSUSD Billion
    KEY COMPANIES PROFILEDMostlyAI ,Gretel.ai ,H2O.ai ,Scale AI ,UNchart ,Anomali ,Replica ,Big Syntho ,Owkin ,DataGenix ,Synthesized ,Verisart ,Datumize ,Deci ,Datasaur
    MARKET FORECAST PERIOD2025 - 2032
    KEY MARKET OPPORTUNITIESData privacy compliance Improved data availability Enhanced data quality Reduced data bias Costeffective
    COMPOUND ANNUAL GROWTH RATE (CAGR) 19.61% (2025 - 2032)
  9. f

    Selection of best model based on criteria.

    • plos.figshare.com
    xls
    Updated Dec 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mahadee Al Mobin; Md. Kamrujjaman (2023). Selection of best model based on criteria. [Dataset]. http://doi.org/10.1371/journal.pone.0295803.t009
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Dec 14, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Mahadee Al Mobin; Md. Kamrujjaman
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data scarcity and discontinuity are common occurrences in the healthcare and epidemiological dataset and often is needed to form an educative decision and forecast the upcoming scenario. Often to avoid these problems, these data are processed as monthly/yearly aggregate where the prevalent forecasting tools like Autoregressive Integrated Moving Average (ARIMA), Seasonal Autoregressive Integrated Moving Average (SARIMA), and TBATS often fail to provide satisfactory results. Artificial data synthesis methods have been proven to be a powerful tool for tackling these challenges. The paper aims to propose a novel algorithm named Stochastic Bayesian Downscaling (SBD) algorithm based on the Bayesian approach that can regenerate downscaled time series of varying time lengths from aggregated data, preserving most of the statistical characteristics and the aggregated sum of the original data. The paper presents two epidemiological time series case studies of Bangladesh (Dengue, Covid-19) to showcase the workflow of the algorithm. The case studies illustrate that the synthesized data agrees with the original data regarding its statistical properties, trend, seasonality, and residuals. In the case of forecasting performance, using the last 12 years data of Dengue infection data in Bangladesh, we were able to decrease error terms up to 72.76% using synthetic data over actual aggregated data.

  10. πŸ’³ Financial Transactions Dataset: Analytics

    • kaggle.com
    Updated Oct 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ComputingVictor (2024). πŸ’³ Financial Transactions Dataset: Analytics [Dataset]. https://www.kaggle.com/datasets/computingvictor/transactions-fraud-datasets/discussion?sort=undefined
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 31, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    ComputingVictor
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Overview

    This comprehensive financial dataset combines transaction records, customer information, and card data from a banking institution, spanning across the 2010s decade. The dataset is designed for multiple analytical purposes, including synthetic fraud detection, customer behavior analysis, and expense forecasting.

    Dataset Components

    1. Transaction Data (transactions_data.csv)

    • Detailed transaction records including amounts, timestamps, and merchant details
    • Covers transactions throughout the 2010s
    • Features transaction types, amounts, and merchant information
    • Perfect for analyzing spending patterns and building fraud detection models

    2. Card Information (cards_dat.csv)

    • Credit and debit card details
    • Includes card limits, types, and activation dates
    • Links to customer accounts via card_id
    • Essential for understanding customer financial profiles

    3. Merchant Category Codes (mcc_codes.json)

    • Standard classification codes for business types
    • Enables transaction categorization and spending analysis
    • Industry-standard MCC codes with descriptions

    4. Fraud Labels (train_fraud_labels.json)

    • Binary classification labels for transactions
    • Indicates fraudulent vs. legitimate transactions
    • Ideal for training supervised fraud detection models

    5. User Data (users_data)

    • Demographic information about customers
    • Account-related details
    • Enables customer segmentation and personalized analysis

    Use Cases and Applications

    1. Fraud Detection and Security

    • Build real-time fraud detection systems
    • Develop anomaly detection algorithms
    • Create risk scoring models
    • Implement transaction monitoring systems
    • Design security alert systems

    2. Customer Analytics

    • Analyze customer lifetime value
    • Create customer segmentation models
    • Develop churn prediction systems
    • Build recommendation engines
    • Study customer acquisition patterns

    3. Financial Planning and Forecasting

    • Develop expense forecasting models
    • Create budget planning tools
    • Build cash flow prediction systems
    • Design financial health indicators
    • Implement savings recommendation systems

    4. Business Intelligence

    • Analyze merchant performance
    • Study market trends
    • Create sales forecasting models
    • Develop competitive analysis tools
    • Build market segmentation models

    5. Machine Learning Projects

    • Practice supervised learning with fraud detection
    • Implement time series forecasting
    • Develop clustering algorithms for customer segmentation
    • Create deep learning models for pattern recognition
    • Build reinforcement learning systems for automated decision making

    Technical Details

    • Format: CSV, JSON
    • Time Period: 2010s decade

    Citation

    Dataset created by Caixabank Tech for the 2024 AI Hackathon

  11. Cost of Living in Nairobi

    • kaggle.com
    Updated Feb 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yacooti (2025). Cost of Living in Nairobi [Dataset]. https://www.kaggle.com/datasets/yacooti/cost-of-living-in-nairobi/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 15, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Yacooti
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Area covered
    Nairobi
    Description

    🏑 Cost of Living in Nairobi, Kenya

    πŸ“Œ Overview

    This dataset provides a detailed time-series estimate of the monthly cost of living across 20 different areas in Nairobi, Kenya from 2019 to 2024. It covers essential expenses such as rent, food, transport, utilities, and miscellaneous costs, allowing for comprehensive cost-of-living analysis.

    This dataset is useful for:
    βœ… Individuals planning to move to Nairobi
    βœ… Researchers analyzing long-term cost trends
    βœ… Businesses assessing salary benchmarks based on inflation
    βœ… Data scientists developing predictive models for cost forecasting

    πŸ“Š Data Summary

    • Total Records: 60,000 (5 years of monthly data)
    • Columns:
      • 🏠 Area: The residential area in Nairobi
      • πŸ’° Rent: Estimated monthly rent (KES)
      • 🍽️ Food: Grocery and dining expenses (KES)
      • πŸš• Transport: Public and private transport costs (KES)
      • ⚑ Utilities: Water, electricity, and internet bills (KES)
      • 🎭 Misc: Entertainment, personal care, and leisure expenses (KES)
      • 🏷️ Total: Sum of all expenses
      • πŸ“† Date: Monthly timestamp from January 2019 to December 2024

    πŸ“ Areas Covered

    This dataset provides cost estimates for 20+ residential areas, including:
    - High-End Areas 🏑: Kileleshwa, Westlands, Karen
    - Mid-Range Areas πŸ™οΈ: South B, Langata, Ruaka
    - Affordable Areas 🏠: Embakasi, Kasarani, Githurai, Ruiru, Umoja
    - Satellite Towns 🌿: Ngong, Rongai, Thika, Kitengela, Kikuyu

    πŸ› οΈ How the Data Was Generated

    This dataset was synthetically generated using Python, incorporating realistic market variations. The process includes:

    βœ” Inflation Modeling πŸ“ˆ – A 2% annual increase in costs over time.
    βœ” Seasonal Effects πŸ“… – Higher food and transport costs in December & January (holiday season), rent spikes in June & July.
    βœ” Economic Shocks ⚠️ – A 5% chance per record of external economic effects (e.g., fuel price hikes, supply chain issues).
    βœ” Random Fluctuations πŸ”„ – Expenses vary slightly month-to-month to simulate real-world spending behavior.

    πŸ” Potential Use Cases

    • πŸ“Š Cost of Living Analysis – Compare affordability across different Nairobi areas.
    • πŸ’΅ Salary & Real Estate Benchmarking – Businesses can analyze salary expectations by location.
    • πŸ“‰ Time-Series Forecasting – Train predictive models (ARIMA, Prophet, LSTM) to estimate future living costs.
    • πŸ“ˆ Inflation Impact Studies – Measure how economic conditions influence cost variations over time.

    ⚠️ Limitations

    • Synthetic Data – The dataset is not based on real survey data but follows market trends.
    • No Lifestyle Adjustments – Differences in household size or spending habits are not factored in.
    • Inflation Approximation – While inflation is simulated at 2% annually, actual inflation rates may differ.

    πŸ“ File Format & Access

    • nairobi_cost_of_living_time_series.csv – 60,000 records in CSV format (time-series structured).

    πŸ“’ Acknowledgments

    This dataset was generated for research and educational purposes. If you find it useful, consider citing it in your work. πŸš€

    πŸ“₯ Download and Explore the Data Now!

    This updated version makes your documentation more detailed and actionable for users interested in forecasting and economic analysis. Would you like help building a cost prediction model? πŸš€

  12. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Samay Ashar (2025). Stock Market Simulation Dataset [Dataset]. http://doi.org/10.34740/kaggle/dsv/11010423
Organization logo

Stock Market Simulation Dataset

πŸ“ˆ A Realistic Synthetic Dataset for Time-Series Forecasting & Stock Analysis

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 12, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Samay Ashar
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

This dataset provides realistic stock market data generated using Geometric Brownian Motion for price movements and Markov Chains for trend prediction. It is designed for time-series forecasting, financial modeling, and algorithmic trading simulations.

Key Features

  • 1000 days of synthetic stock market data (from January 1, 2022, onwards).
  • Multiple companies from diverse industries (Technology, Finance, Healthcare, Energy, Consumer Goods, Automotive, Aerospace, etc.).
  • Stock price details: Open, High, Low, Close prices.
  • Trading volume and market capitalization.
  • Financial metrics: P/E Ratio, Dividend Yield, Volatility.
  • Sentiment Score: A measure of market sentiment (-1 to 1 scale).
  • Trend Labeling: Bullish, Bearish, or Stable, based on Markov Chain modeling.
Column NameDescription
DateTrading date
CompanyStock name (e.g., Apple, Tesla, JPMorgan, etc.)
SectorIndustry classification
OpenOpening price of the stock
HighHighest price of the stock for the day
LowLowest price of the stock for the day
CloseClosing price of the stock
VolumeNumber of shares traded
Market_CapMarket capitalization (in USD)
PE_RatioPrice-to-Earnings ratio
Dividend_YieldPercentage of dividends relative to stock price
VolatilityMeasure of stock price fluctuation
Sentiment_ScoreMarket sentiment (-1 to 1 scale)
TrendStock market trend (Bullish, Bearish, or Stable)

Usage Scenarios

πŸ”Ή Time-Series Forecasting: Train models like LSTMs, Transformers, or ARIMA for stock price prediction.
πŸ”Ή Algorithmic Trading: Develop trading strategies based on trends and sentiment.
πŸ”Ή Feature Engineering: Explore correlations between financial metrics and stock movements.
πŸ”Ή Quantitative Finance Research: Analyze market trends using simulated yet realistic data.

PS: If you find this dataset helpful, please consider upvoting :)

Search
Clear search
Close search
Google apps
Main menu