99 datasets found
  1. Supply Chain Dataset

    • kaggle.com
    zip
    Updated May 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ziya (2025). Supply Chain Dataset [Dataset]. https://www.kaggle.com/datasets/ziya07/bdt-mba-supply-chain-dataset
    Explore at:
    zip(20611 bytes)Available download formats
    Dataset updated
    May 22, 2025
    Authors
    Ziya
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset is designed to simulate supply chain operations in large-scale engineering projects. It integrates realistic data from IoT sensors, digital twins, and blockchain-enabled monitoring systems over the years 2023 to 2024.

    It aims to support research in predictive maintenance, resource optimization, secure data exchange, and supply chain transparency through advanced analytics and machine learning.

    ⭐ Key Features Time-bound IoT Sensor Data: Includes real-time-like sensor outputs such as temperature and vibration across multiple locations and assets.

    Digital Twin Sync Fields: Tracks Condition_Score and Last_Maintenance to simulate digital twin feedback loops.

    Operational KPIs: Features supply chain metrics like Resource_Utilization, Delivery_Efficiency, and Downtime_Hours.

    Blockchain Contextual Fit: Designed to be compatible with blockchain audit trails and smart contract triggers (e.g., anomaly response, automated logistics payments).

    Labeled Targets: SupplyChain_Efficiency_Label classifies overall efficiency into 3 tiers (0: Low, 1: Medium, 2: High) based on predefined KPI thresholds.

    Location-aware Simulation: Assets and operations are tagged by realistic geographic locations.

    Supply Chain Economics: Captures Inventory_Level and Logistics_Cost for resource allocation analysis.

    Year-specific Scope: Covers the period from 2023 to 2024, aligning with recent and ongoing digital transformation trends.

  2. Data from: NICHE: A Curated Dataset of Engineered Machine Learning Projects...

    • figshare.com
    txt
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ratnadira Widyasari; Zhou YANG; Ferdian Thung; Sheng Qin Sim; Fiona Wee; Camellia Lok; Jack Phan; Haodi Qi; Constance Tan; Qijin Tay; David LO (2023). NICHE: A Curated Dataset of Engineered Machine Learning Projects in Python [Dataset]. http://doi.org/10.6084/m9.figshare.21967265.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Ratnadira Widyasari; Zhou YANG; Ferdian Thung; Sheng Qin Sim; Fiona Wee; Camellia Lok; Jack Phan; Haodi Qi; Constance Tan; Qijin Tay; David LO
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Machine learning (ML) has gained much attention and has been incorporated into our daily lives. While there are numerous publicly available ML projects on open source platforms such as GitHub, there have been limited attempts in filtering those projects to curate ML projects of high quality. The limited availability of such high-quality dataset poses an obstacle to understanding ML projects. To help clear this obstacle, we present NICHE, a manually labelled dataset consisting of 572 ML projects. Based on evidences of good software engineering practices, we label 441 of these projects as engineered and 131 as non-engineered. In this repository we provide "NICHE.csv" file that contains the list of the project names along with their labels, descriptive information for every dimension, and several basic statistics, such as the number of stars and commits. This dataset can help researchers understand the practices that are followed in high-quality ML projects. It can also be used as a benchmark for classifiers designed to identify engineered ML projects.

    GitHub page: https://github.com/soarsmu/NICHE

  3. Data Engineer Salary in 2024

    • kaggle.com
    zip
    Updated Apr 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kshitij (2024). Data Engineer Salary in 2024 [Dataset]. https://www.kaggle.com/datasets/chopper53/data-engineer-salary-in-2024
    Explore at:
    zip(110281 bytes)Available download formats
    Dataset updated
    Apr 24, 2024
    Authors
    Kshitij
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This dataset provides insights into data engineer salaries and employment attributes for the year 2024. It includes information such as salary, job title, experience level, employment type, employee residence, remote work ratio, company location, and company size.

    The dataset allows for analysis of salary trends, employment patterns, and geographic variations in data engineering roles. It can be used by researchers, analysts, and organizations to understand the evolving landscape of data engineering employment and compensation.

    Feature Description:

    • work_year: The year in which the data was collected (2024).
    • experience_level: The experience level of the employee, categorized as SE (Senior Engineer), MI (Mid-Level Engineer), or EL (Entry-Level Engineer).
    • employment_type: The type of employment, such as full-time (FT), part-time (PT), contract (C), or freelance (F).
    • job_title: The title or role of the employee within the company, for example, AI Engineer.
    • salary: The salary of the employee in the local currency (e.g., 202,730 USD).
    • salary_currency: The currency in which the salary is denominated (e.g., USD).
    • salary_in_usd: The salary converted to US dollars for standardization purposes.
    • employee_residence: The country of residence of the employee.
    • remote_ratio: The ratio indicating the extent of remote work allowed in the position (0 for no remote work, 1 for fully remote).
    • company_location: The location of the company where the employee is employed.
    • company_size: The size of the company, often categorized by the number of employees (S for small, M for medium, L for large).
  4. Data Engineering Jobs in the USA Glassdoor

    • kaggle.com
    zip
    Updated Oct 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hamza El Belghiti (2023). Data Engineering Jobs in the USA Glassdoor [Dataset]. https://www.kaggle.com/datasets/hamzaelbelghiti/data-engineering-jobs-in-the-usa-glassdoor/versions/5
    Explore at:
    zip(2612489 bytes)Available download formats
    Dataset updated
    Oct 17, 2023
    Authors
    Hamza El Belghiti
    License

    https://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/

    Description

    This dataset contains a list of data engineering job postings scraped from Glassdoor in the USA (March 2023). It includes details such as the company name, location, job title, job description, estimated salary, company size, company type, company sector, company industry, the year the company was founded, and company revenue. The dataset can be used for exploring data engineering job trends in the USA, analyzing salaries, and identifying the most in-demand skills and qualifications.

    You can see the whole project on GitHub.

    How to use

    • Identify the education and experience required for jobs
    • Find the top recruiting companies and the industry they work in
    • Explore date engineering skills required in job descriptions
    • Predict salary based on location, industry, company rating etc
  5. Materials and their Mechanical Properties

    • kaggle.com
    zip
    Updated Apr 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Purushottam Nawale (2023). Materials and their Mechanical Properties [Dataset]. https://www.kaggle.com/datasets/purushottamnawale/materials
    Explore at:
    zip(145487 bytes)Available download formats
    Dataset updated
    Apr 15, 2023
    Authors
    Purushottam Nawale
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    We utilized a dataset of Machine Design materials, which includes information on their mechanical properties. The dataset was obtained from the Autodesk Material Library and comprises 15 columns, also referred to as features/attributes. This dataset is a real-world dataset, and it does not contain any random values. However, due to missing values, we only utilized seven of these columns for our ML model. You can access the related GitHub Repository here: https://github.com/purushottamnawale/material-selection-using-machine-learning

    To develop a ML model, we employed several Python libraries, including NumPy, pandas, scikit-learn, and graphviz, in addition to other technologies such as Weka, MS Excel, VS Code, Kaggle, Jupyter Notebook, and GitHub. We employed Weka software to swiftly visualize the data and comprehend the relationships between the features, without requiring any programming expertise.

    My Problem statement is Material Selection for EV Chassis. So, if you have any specific ideas, be sure to implement them and add the codes on Kaggle.

    A Detailed Research Paper is available on https://iopscience.iop.org/article/10.1088/1742-6596/2601/1/012014

  6. BIM-AI Integrated Dataset

    • kaggle.com
    zip
    Updated Feb 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ziya (2025). BIM-AI Integrated Dataset [Dataset]. https://www.kaggle.com/datasets/ziya07/bim-ai-integrated-dataset
    Explore at:
    zip(162775 bytes)Available download formats
    Dataset updated
    Feb 28, 2025
    Authors
    Ziya
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Description This dataset is designed for whole life cycle management of civil engineering projects, integrating Building Information Modeling (BIM) and Artificial Intelligence (AI). It includes comprehensive project data covering cost, schedule, structural health, environmental conditions, resource allocation, safety risks, and drone-based monitoring.

    Key Features Project Metadata: ID, type (bridge, road, building, etc.), location, and timeline. Financial Data: Planned vs. actual cost, cost overruns. Scheduling Data: Planned vs. actual duration, schedule deviation. Structural Health Monitoring: Vibration levels, crack width, load-bearing capacity. Environmental Factors: Temperature, humidity, air quality, weather conditions. Resource & Safety Management: Material usage, labor hours, equipment utilization, accident records. Drone-Based Monitoring: Image analysis scores, anomaly detection, completion percentage. Target Variable: Risk Level (Low, Medium, High) based on cost, schedule, safety, and structural health. Use Cases Predictive Modeling: Train AI models to forecast project risks and optimize decision-making. BIM & AI Integration: Leverage real-time IoT and drone data for smart construction management. Risk Assessment: Identify early signs of cost overruns, delays, and structural failures. Automation & Efficiency: Develop automated maintenance and safety monitoring frameworks

  7. Electronics Project(2600+ projects)

    • kaggle.com
    zip
    Updated Nov 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NICK-2908 (2025). Electronics Project(2600+ projects) [Dataset]. https://www.kaggle.com/datasets/nick2908/electronics-project2600-projects
    Explore at:
    zip(274002 bytes)Available download formats
    Dataset updated
    Nov 13, 2025
    Authors
    NICK-2908
    Description

    **Summary ** This dataset contains over 2,600 circuit projects scraped from Instructables, focusing on the "Circuits" category. It includes project titles, authors, engagement metrics (views, likes), and the primary component used (Instruments).

    ** How This Data Was Collected**

    I built a web scraper using Python and Selenium to gather all project links (over 2,600 of them) by handling the "Load All" button. The full page source was saved, and I then used BeautifulSoup to parse the HTML and extract the raw data for each project.

    Data Cleaning (The Important Part!)

    The raw data was very messy. I performed a full data cleaning pipeline in a Colab notebook using Pandas.

    • Converted Text to Numbers: Views and Likes were text fields (object).
    • Handled "K" Values: Found and converted "K" values (e.g., "2.2K") into proper numbers (2200).
    • Handled Missing Data: Replaced all "N/A" strings with null values.
    • Mean Imputation: To keep the dataset complete, I filled all missing Likes and Views with the mean (average) of the respective column.

    Key Insights & Analysis

    1. "Viral" Effect (High Skew): The Views and Likes data is highly right-skewed (skewness of ~9.5). This shows a "viral" effect where a tiny number of superstar projects get the vast majority of all views and likes.

    [](url)

    1. Log-Transformation: Because of the skew, I created log_Views and log_Likes columns. A 2D density plot of these log-transformed columns shows a strong positive correlation (as likes increase, views increase) and that the most "typical" project gets around 30-40 likes and 4,000-5,000 views. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F29431778%2Fd90e2039f1be11b53308ab7191b10954%2Fdownload%20(1).png?generation=1763013545903998&alt=media" alt="">

    2. Top Instruments: I've also analyzed the most popular instruments to see which ones get the most engagement. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F29431778%2F19fca1ce142ddddc1e16a5319a1f4fc5%2Fdownload%20(2).png?generation=1763013562400830&alt=media" alt="">

    Column Descriptions

    • Title: The name of the project.
    • Project_Admin: The author/creator of the project.
    • Image_URL: The URL for the project's cover image.
    • Views: The total number of views (cleaned and imputed).
    • Likes: The total number of likes/favorites (cleaned and imputed).
    • Instruments: The main component or category tag (e.g., "Arduino", "Raspberry Pi").
  8. Supply Chain DataSet

    • kaggle.com
    zip
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amir Motefaker (2023). Supply Chain DataSet [Dataset]. https://www.kaggle.com/datasets/amirmotefaker/supply-chain-dataset
    Explore at:
    zip(9340 bytes)Available download formats
    Dataset updated
    Jun 1, 2023
    Authors
    Amir Motefaker
    Description

    Supply chain analytics is a valuable part of data-driven decision-making in various industries such as manufacturing, retail, healthcare, and logistics. It is the process of collecting, analyzing and interpreting data related to the movement of products and services from suppliers to customers.

  9. Feature Engineering Dataset

    • kaggle.com
    zip
    Updated Apr 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Harikant Shukla (2023). Feature Engineering Dataset [Dataset]. https://www.kaggle.com/datasets/harikantshukla/feature-engineering-dataset/discussion
    Explore at:
    zip(95245 bytes)Available download formats
    Dataset updated
    Apr 18, 2023
    Authors
    Harikant Shukla
    Description

    While searching for the dream house, the buyer looks at various factors, not just at the height of the basement ceiling or the proximity to an east-west railroad.

    Using the dataset, find the factors that influence price negotiations while buying a house.

    There are 79 explanatory variables describing every aspect of residential homes in Ames, Iowa.

    Task to be Performed:

    1) Download the “PEP1.csv” using the link given in the Feature Engineering project problem statement 2) For a detailed description of the dataset, you can download and refer to data_description.txt using the link given in the Feature Engineering project problem statement Tasks to Perform 1) Import the necessary libraries 1.1 Pandas is a Python library for data manipulation and analysis. 1.2 NumPy is a package that contains a multidimensional array object and several derivative ones. 1.3 Matplotlib is a Python visualization package for 2D array plots. 1.4 Seaborn is built on top of Matplotlib. It's used for exploratory data analysis and data visualization. 2) Read the dataset 2.1 Understand the dataset 2.2 Print the name of the columns 2.3 Print the shape of the dataframe Tasks to Perform 2.4 Check for null values 2.5 Print the unique values 2.6 Select the numerical and categorical variables 3) Descriptive stats and EDA 3.1 EDA of numerical variables 3.2 Missing value treatment 3.3 Identify the skewness and distribution 3.4 Identify significant variables using a correlation matrix 3.5 Pair plot for distribution and density Project Outcome • The aim of the project is to help understand working with the dataset and performing analysis. • This project will assess the data and prepares a fresh dataset for training and prediction • To create a box plot to identify the variables with outliers

  10. Google Data Analytics Bellabeat Capstone Project

    • kaggle.com
    Updated Aug 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jeff Burkle (2025). Google Data Analytics Bellabeat Capstone Project [Dataset]. https://www.kaggle.com/datasets/jeffburkle/google-data-analytics-bellabeat-capstone-project/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 16, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Jeff Burkle
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is part of a capstone project for the Google Data Analytics Certificate. It contains cleaned, merged, and feature-engineered Fitbit data from 35 participants, originally sourced from publicly available Fitabase exports. The goal is to explore user behavior and engagement patterns to inform marketing strategies for Bellabeat, a wellness tech company.

  11. Data from: Online Retail Dataset

    • kaggle.com
    zip
    Updated Nov 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Minal Choudhary (2025). Online Retail Dataset [Dataset]. https://www.kaggle.com/datasets/minalchoudhary/online-retail-dataset
    Explore at:
    zip(7572122 bytes)Available download formats
    Dataset updated
    Nov 19, 2025
    Authors
    Minal Choudhary
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Description: Online Retail II

    This dataset contains 525,461 transaction-level records from an online retail store based in the United Kingdom. It captures detailed information about customer purchases, products, pricing, and order timestamps, making it suitable for sales analytics, customer behavior analysis, product performance evaluation, and SQL data engineering projects.

    Key Features

    Invoice: A unique identifier for each order. Some invoices may represent returns or cancellations depending on business rules. StockCode: Product-level unique code identifying each item sold. Description: Text description of the product purchased. Quantity: Number of units bought. Negative values typically indicate returns. InvoiceDate: Timestamp indicating the exact date and time of the transaction. Price: Unit price of the product in the transaction currency. Customer ID: Unique identifier assigned to each registered customer. Missing values may indicate guest or unregistered buyers. Country: The country where the customer is located, enabling regional and international sales analysis.

  12. Synthetic E-Commerce Relational Datasets

    • kaggle.com
    Updated Aug 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nael Aqel (2025). Synthetic E-Commerce Relational Datasets [Dataset]. https://www.kaggle.com/datasets/naelaqel/synthetic-e-commerce-relational-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 31, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Nael Aqel
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Synthetic E-Commerce Relational Dataset

    This dataset is synthetically generated fake data designed to simulate a realistic e-commerce environment.

    Purpose

    To provide large-scale relational datasets for practicing database operations, analytics, and testing tools like DuckDB, Pandas, and SQL engines. Ideal for benchmarking, educational projects, and data engineering experiments.

    Entity Relationship Diagram (ERD) - Tables Overview

    1. Customers

    • customer_id (int): Unique identifier for each customer
    • name (string): Customer full name
    • email (string): Customer email address
    • gender (string): Customer gender ('Male', 'Female', 'Other')
    • signup_date (date): Date customer signed up
    • country (string): Customer country of residence

    2. Products

    • product_id (int): Unique identifier for each product
    • product_name (string): Name of the product
    • category (string): Product category (e.g., Electronics, Books)
    • price (float): Price per unit
    • stock_quantity (int): Available stock count
    • brand (string): Product brand name

    3. Orders

    • order_id (int): Unique identifier for each order
    • customer_id (int): ID of the customer who placed the order (foreign key to Customers)
    • order_date (date): Date when order was placed
    • total_amount (float): Total amount for the order
    • payment_method (string): Payment method used (Credit Card, PayPal, etc.)
    • shipping_country (string): Country where the order is shipped

    4. Order Items

    • order_item_id (int): Unique identifier for each order item
    • order_id (int): ID of the order this item belongs to (foreign key to Orders)
    • product_id (int): ID of the product ordered (foreign key to Products)
    • quantity (int): Number of units ordered
    • unit_price (float): Price per unit at order time

    5. Product Reviews

    • review_id (int): Unique identifier for each review
    • product_id (int): ID of the reviewed product (foreign key to Products)
    • customer_id (int): ID of the customer who wrote the review (foreign key to Customers)
    • rating (int): Rating score (1 to 5)
    • review_text (string): Text content of the review
    • review_date (date): Date the review was written

    Visual EDR

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F9179978%2F7681afe8fc52a116ff56a2a4e179ad19%2FEDR.png?generation=1754741998037680&alt=media" alt="">

    Notes

    • All data is randomly generated using Python’s Faker library, so it does not reflect any real individuals or companies.
    • The data is provided in both CSV and Parquet formats.
    • The generator script is available in the accompanying GitHub repository for reproducibility and customization.

    Output

    The script saves two folders inside the specified output path:

    csv/    # CSV files
    parquet/  # Parquet files
    

    License

    MIT License

    References

  13. UCI Mechanical Analysis Data Set

    • kaggle.com
    zip
    Updated Apr 16, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Heitor Nunes (2022). UCI Mechanical Analysis Data Set [Dataset]. https://www.kaggle.com/datasets/heitornunes/mechanical-analysis
    Explore at:
    zip(120333 bytes)Available download formats
    Dataset updated
    Apr 16, 2022
    Authors
    Heitor Nunes
    Description

    Context

    Please read the description file of the Data Set. The work I done was adjusting the data into a acceptable file format by kaggle standards.

    Content

    1 - instance - instance indicator

    1 - component - component number (integer)

    2 - sup - support in the machine where measure was taken (1..4)

    3 - cpm - frequency of the measure (integer)

    4 - mis - measure (real)

    5 - misr - earlier measure (real)

    6 - dir - filter, type of the measure and direction: {vo=no filter, velocity, horizontal, va=no filter, velocity, axial, vv=no filter, velocity, vertical, ao=no filter, amplitude, horizontal, aa=no filter, amplitude, axial, av=no filter, amplitude, vertical, io=filter, velocity, horizontal, ia=filter, velocity, axial, iv=filter, velocity, vertical}

    7 - omega - rpm of the machine (integer, the same for components of one example)

    8 - class - classification (1..6, the same for components of one example)

    9 - comb. class - combined faults

    10 - other class - other faults occuring

    Acknowledgements

    Data Source: https://archive.ics.uci.edu/ml/datasets/Mechanical+Analysis

  14. week3class1review

    • kaggle.com
    Updated Sep 2, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ben Kreiger (2021). week3class1review [Dataset]. https://www.kaggle.com/benkreiger/week3class1review/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 2, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ben Kreiger
    Description

    Robotics for All Data Science With Python Week 3 Class 1 Review Projects

  15. Cloud Carbon Emissions Dataset

    • kaggle.com
    zip
    Updated Sep 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nidhi Suryavanshi (2025). Cloud Carbon Emissions Dataset [Dataset]. https://www.kaggle.com/datasets/nidhis4444/cloud-carbon-emissions-dataset
    Explore at:
    zip(36611 bytes)Available download formats
    Dataset updated
    Sep 23, 2025
    Authors
    Nidhi Suryavanshi
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset contains a synthetic simulation of cloud resource usage and carbon emissions, designed for experimentation, analysis, and forecasting in sustainability and data engineering projects.

    Included Tables: - projects → Metadata about projects/teams. - services → Metadata about cloud services (Compute, Storage, AI, etc.). - emission_factors → Regional grid carbon intensity (gCO₂ per kWh). - service_energy_coefficients → Conversion rates from usage units to kWh. - daily_usage → Raw service usage (per project × service × region × day). - daily_emissions → Carbon emissions derived from usage × regional emission factors. - service_cost_coefficients → Conversion rates from usage units to cost (USD per unit).
    - daily_cost_emissions → Integrated fact table combining usage, energy, cost, and emissions for analysis.

    Features: - Simulated seasonality (weekend dips/spikes, holiday surges, quarter-end growth). - Regional variations in carbon intensity (e.g., coal-heavy vs renewable grids). - Multiple projects and services for multi-dimensional analysis. - Directly importable into BigQuery for analytics & forecasting.

    Use Cases: Explore sustainability analytics at scale. Build carbon footprint dashboards. Run AI/ML forecasting on emissions data. Practice SQL, data modeling, and visualization.

    ⚠️ Note: All data is synthetic and created for educational/demo purposes. It does not represent actual cloud provider emissions.

  16. Drug Labels & Side Effects Dataset | 1400+ Records

    • kaggle.com
    zip
    Updated Aug 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pratyush Puri (2025). Drug Labels & Side Effects Dataset | 1400+ Records [Dataset]. https://www.kaggle.com/datasets/pratyushpuri/drug-labels-and-side-effects-dataset-1400-records
    Explore at:
    zip(51886 bytes)Available download formats
    Dataset updated
    Aug 2, 2025
    Authors
    Pratyush Puri
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Drug Labels and Side Effects Dataset

    Dataset Overview

    This comprehensive pharmaceutical synthetic dataset contains 1,393 records of synthetic drug information with 15 columns, designed for data science projects focusing on healthcare analytics, drug safety analysis, and pharmaceutical research. The dataset simulates real-world pharmaceutical data with appropriate variety and realistic constraints for machine learning applications.

    Dataset Specifications

    AttributeValue
    Total Records1,393
    Total Columns15
    File FormatCSV
    Data TypesMixed (intentional for data cleaning practice)
    DomainPharmaceutical/Healthcare
    Use CaseML Training, Data Analysis, Healthcare Research

    Column Specifications

    Categorical Features

    Column NameData TypeUnique ValuesDescriptionExample Values
    drug_nameObject1,283 uniquePharmaceutical drug names with realistic naming patterns"Loxozepam32", "Amoxparin43", "Virazepam10"
    manufacturerObject10 uniqueMajor pharmaceutical companiesPfizer Inc., AstraZeneca, Johnson & Johnson
    drug_classObject10 uniqueTherapeutic drug classificationsAntibiotic, Analgesic, Antidepressant, Vaccine
    indicationsObject10 uniqueMedical conditions the drug treats"Pain relief", "Bacterial infections", "Depression treatment"
    side_effectsObject434 uniqueCombination of side effects (1-3 per drug)"Nausea, Dizziness", "Headache, Fatigue, Rash"
    administration_routeObject7 uniqueMethod of drug deliveryOral, Intravenous, Topical, Inhalation, Sublingual
    contraindicationsObject10 uniqueMedical warnings for drug usage"Pregnancy", "Heart disease", "Liver disease"
    warningsObject10 uniqueSafety instructions and precautions"Take with food", "Avoid alcohol", "Monitor blood pressure"
    batch_numberObject1,393 uniqueManufacturing batch identifiers"xr691zv", "Ye266vU", "Rm082yX"
    expiry_dateObject782 uniqueDrug expiration dates (YYYY-MM-DD)"2025-12-13", "2027-03-09", "2026-10-06"
    side_effect_severityObject3 uniqueSeverity classificationMild, Moderate, Severe
    approval_statusObject3 uniqueRegulatory approval statusApproved, Pending, Rejected

    Numerical Features

    Column NameData TypeRangeMeanStd DevDescription
    approval_yearFloat/String*1990-20242006.710.0FDA/regulatory approval year
    dosage_mgFloat/String*10-990 mg499.7290.0Medication strength in milligrams
    price_usdFloat/String*$2.32-$499.24$251.12$144.81Drug price in US dollars

    *Intentionally stored as mixed types for data cleaning practice

    Key Statistics

    Manufacturer Distribution

    ManufacturerCountPercentage
    Pfizer Inc.17012.2%
    AstraZeneca~140~10.0%
    Merck & Co.~140~10.0%
    Johnson & Johnson~140~10.0%
    GlaxoSmithKline~140~10.0%
    Others~623~44.8%

    Drug Class Distribution

    Drug ClassCountMost Common
    Anti-inflammatory154
    Antibiotic~140
    Antidepressant~140
    Antiviral~140
    Vaccine~140
    Others~679

    Side Effect Severity

    SeverityCountPercentage
    Severe48835.0%
    Moderate~453~32.5%
    Mild~452~32.5%

    Potential Use Cases

    1. Machine Learning Applications

    • Drug Approval Prediction: Predict approval likelihood based on drug characteristics
    • Price Prediction: Estimate drug pricing using features like class, manufacturer, dosage
    • Side Effect Classification: Classify severity based on drug properties
    • Market Success Analysis: Analyze factors contributing to drug market performance

    2. Data Engineering Projects

    • ETL Pipeline Development: Practice data cleaning and transformation
    • Data Quality Assessment: Implement data validation and quality checks
    • Database Design: Create normalized pharmaceutical database schema
    • Real-time Processing: Stream processing for drug monitoring systems

    3. Business Intelligence

    • Pharmaceutical Market Analysis: Manufacturer market share and competitive analysis
    • Drug Safety Analytics: Side effect patterns and safety profile analysis
    • Regulatory Compliance: Approval trends and regulatory timeline analysis
    • Pricing Strategy: Competitive pricing analysis across drug classes

    Recommended Next Steps

    1. Data Cleaning Pipeline: Implement comprehe...
  17. Tunnel Risk Dataset

    • kaggle.com
    zip
    Updated Dec 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ziya (2024). Tunnel Risk Dataset [Dataset]. https://www.kaggle.com/datasets/ziya07/tunnel-risk-dataset
    Explore at:
    zip(12738 bytes)Available download formats
    Dataset updated
    Dec 17, 2024
    Authors
    Ziya
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset is designed to facilitate the development of deep learning-based models for real-time risk assessment in tunnel engineering projects. The data contains critical engineering parameters, geotechnical properties, and sensor-based monitoring data collected or simulated under various tunneling conditions. Each record corresponds to specific tunneling conditions and is labeled with a risk level to indicate the likelihood of structural failure or hazardous events.

    Dataset Content The dataset contains 1000 samples (modifiable based on requirements), with each row representing a unique tunneling scenario. The key features include:

    1. Tunnel Parameters Tunnel_ID: Unique identifier for each tunnel record. Length (m): Length of the tunnel section (in meters). Depth (m): Depth at which the tunnel is located (in meters).
    2. Geotechnical and Environmental Features Rock_Type: Type of geological material in the tunnel's surrounding (e.g., Clay, Sandstone, Mixed, Shale). Water_Level: Groundwater level conditions, categorized as Low, Medium, or High.
    3. Monitoring Data Displacement (mm): Real-time tunnel deformation or displacement measured in millimeters. Settlement (mm): Vertical surface settlement above the tunnel, measured in millimeters.
    4. Risk Level (Target Variable) The dataset is labeled based on risk assessment for each tunneling condition:

    0 = Low Risk (Safe conditions) 1 = Medium Risk (Moderate risk, monitoring required) 2 = High Risk (High likelihood of structural stress) 3 = Critical Risk (Failure scenario or hazardous condition) The risk levels are assigned based on threshold values for tunnel displacement and settlement, which are essential indicators of tunnel stability.

  18. Voice Search AI Conversational Queries 2025

    • kaggle.com
    zip
    Updated Jul 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pratyush Puri (2025). Voice Search AI Conversational Queries 2025 [Dataset]. https://www.kaggle.com/datasets/pratyushpuri/voice-search-1500-conversational-queries-2025/data
    Explore at:
    zip(72384 bytes)Available download formats
    Dataset updated
    Jul 30, 2025
    Authors
    Pratyush Puri
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Voice Search Query Captures Dataset

    Overview

    This synthetically created dataset contains 1,555 conversational voice search queries captured across multiple devices, languages, and user intents. The dataset simulates realistic voice command interactions for machine learning and analytics projects in the conversational AI domain.

    Dataset Specifications

    • Total Rows: 1,555
    • Total Columns: 13
    • File Format: CSV
    • Data Types: Mixed (String, Integer, Float, DateTime, Boolean)
    • Null Values: Strategically distributed (~5-10% across select columns)

    Column Details

    Column NameData TypeDescriptionExample ValuesNull %
    query_idIntegerUnique identifier for each voice search query1, 2, 3... 15550%
    user_idString (UUID)Unique user identifierbdd640fb-0667-4ad1-9c80-317fa3b1799d0%
    timestampDateTimeWhen the query was made2025-04-17 19:27:320%
    device_typeStringDevice used for voice searchsmartphone, smart speaker, smartwatch, tablet, car assistant0%
    query_textStringThe actual voice search text"What's the weather like today?", "Call Mom"0%
    languageStringLanguage of the queryEnglish, Spanish, Mandarin, Hindi, French0%
    intentStringQuery category/purposeinformation, navigation, command, entertainment, shopping0%
    locationStringUser's geographical locationNew York, Los Angeles, London, Delhi, Shanghai, Paris, Tokyo0%
    query_duration_secFloatDuration of voice query in seconds1.05 to 12.71 seconds0%
    num_wordsFloat*Number of words in the query2.0 to 7.00%
    is_successfulObjectWhether query returned resultsTrue, False, None~15%
    confidence_scoreString*Speech recognition confidence (0.5-1.0)"0.87", "0.61", "1.0"0%
    device_os_versionStringOperating system versioniOS 14, iOS 15, Android 10, Android 11, None~20%

    Intent Categories Distribution

    IntentDescriptionSample Queries
    InformationKnowledge/fact-seeking queries"How tall is the Eiffel Tower?", "What's the weather like today?"
    NavigationLocation/direction requests"Directions to nearest gas station", "Find nearest coffee shop"
    CommandDevice/app control instructions"Set an alarm for 7 AM", "Turn off the lights", "Call Mom"
    EntertainmentMedia/content requests"Play latest movie trailers", "Show me comedy shows"
    ShoppingPurchase/commerce related"Order me a pizza", "Buy new headphones", "Track my Amazon order"

    Device Distribution

    Device TypeUsage Context
    SmartphoneMobile, on-the-go queries
    Smart SpeakerHome-based voice commands
    SmartwatchQuick, hands-free interactions
    TabletCasual browsing and queries
    Car AssistantIn-vehicle voice commands

    Language & Location Coverage

    LanguagePrimary LocationsUse Case
    EnglishNew York, Los Angeles, LondonGlobal communication
    SpanishLos Angeles, New YorkHispanic markets
    MandarinShanghai, global citiesChinese user base
    HindiDelhi, global citiesIndian diaspora
    FrenchParis, global citiesEuropean markets

    Data Quality Features

    Realistic Patterns

    • Query Duration: Normal distribution around 5 seconds (1-12 sec range)
    • Word Count: Aligned with actual query complexity (2-7 words)
    • Intent Matching: Query text semantically matches intent categories
    • Temporal Distribution: Queries spread across 2025 timeframe

    Data Challenges (Intentional)

    • Mixed Data Types: num_words stored as float instead of int
    • String Numerics: confidence_score stored as string instead of float
    • Strategic Nulls: Missing values in is_successful and device_os_version

    Use Cases

    Analytics Applications

    • Voice search trend analysis
    • Device usage pattern identification
    • Multi-language query processing
    • Intent classification modeling
    • User behavior segmentation

    Machine Learning Projects

    • Classification: Intent prediction from query text
    • NLP: Multi-language text analysis
    • Time Series: Usage pattern analysis
    • Clustering: User behavior grouping
    • Recommendation: Query suggestion systems

    Data Engineering Practice

    • Data cleaning and type conversion
    • Handling missing values
    • Multi-language data processing
    • Real-time analytics pipeline development

    Technical Notes

    • Encoding: UTF-8 for multi-language support
    • Timestamp Format: YYYY-MM-DD HH:MM:SS
    • UUID Format: Standard UUID4 format
    • **Geographic ...
  19. SAP FI Anomaly Detection - Prepared Data & Models

    • kaggle.com
    zip
    Updated Apr 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    aidsmlProjects (2025). SAP FI Anomaly Detection - Prepared Data & Models [Dataset]. https://www.kaggle.com/datasets/aidsmlprojects/sap-fi-anomaly-detection-prepared-data-and-models
    Explore at:
    zip(9285 bytes)Available download formats
    Dataset updated
    Apr 30, 2025
    Authors
    aidsmlProjects
    Description

    Intelligent SAP Financial Integrity Monitor

    Project Status: Proof-of-Concept (POC) - Capstone Project

    Overview

    This project demonstrates a proof-of-concept system for detecting financial document anomalies within core SAP FI/CO data, specifically leveraging the New General Ledger table (FAGLFLEXA) and document headers (BKPF). It addresses the challenge that standard SAP reporting and rule-based checks often struggle to identify subtle, complex, or novel irregularities in high-volume financial postings.

    The solution employs a Hybrid Anomaly Detection strategy, combining unsupervised Machine Learning models with expert-defined SAP business rules. Findings are prioritized using a multi-faceted scoring system and presented via an interactive dashboard built with Streamlit for efficient investigation.

    This project was developed as a capstone, showcasing the application of AI/ML techniques to enhance financial controls within an SAP context, bridging deep SAP domain knowledge with modern data science practices.

    Author: Anitha R (https://www.linkedin.com/in/anithaswamy)

    Dataset Origin: Kaggle SAP Dataset by Sunitha Siva License:Other (specified in description)-No description available.

    Motivation

    Financial integrity is critical. Undetected anomalies in SAP FI/CO postings can lead to: * Inaccurate financial reporting * Significant reconciliation efforts * Potential audit failures or compliance issues * Masking of operational errors or fraud

    Standard SAP tools may not catch all types of anomalies, especially complex or novel patterns. This project explores how AI/ML can augment traditional methods to provide more robust and efficient financial monitoring.

    Key Features

    • Data Cleansing & Preparation: Rigorous process to handle common SAP data extract issues (duplicates, financial imbalance), prioritizing FAGLFLEXA for reliability.
    • Exploratory Data Analysis (EDA): Uncovered baseline patterns in posting times, user activity, amounts, and process context.
    • Feature Engineering: Created 16 context-aware features (FE_...) to quantify potential deviations from normalcy based on EDA and SAP knowledge.
    • Hybrid Anomaly Detection:
      • Ensemble ML: Utilized unsupervised models: Isolation Forest (IF), Local Outlier Factor (LOF) (via Scikit-learn), and an Autoencoder (AE) (via TensorFlow/Keras).
      • Expert Rules (HRFs): Implemented highly customizable High-Risk Flags based on percentile thresholds and SAP logic (e.g., weekend posting, missing cost center).
    • Multi-Faceted Prioritization: Combined ML model consensus (Model_Anomaly_Count) and HRF counts (HRF_Count) into a Priority_Tier for focusing investigation efforts.
    • Contextual Anomaly Reason: Generated a Review_Focus text description summarizing why an item was flagged.
    • Interactive Dashboard (Streamlit):
      • File upload for anomaly/feature data.
      • Overview KPIs (including multi-currency "Value at Risk by CoCode").
      • Comprehensive filtering capabilities.
      • Dynamic visualizations (User/Doc Type/HRF frequency, Time Trends).
      • Interactive AgGrid table for anomaly list investigation.
      • Detailed drill-down view for selected anomalies.

    Methodology Overview

    The project followed a structured approach:

    1. Phase 1: Data Quality Assessment & Preparation: Cleaned and validated raw BKPF and FAGLFLEXA data extracts. Discarded BSEG due to imbalances. Removed duplicates.
    2. Phase 2: Exploratory Data Analysis & Feature Engineering: Analyzed cleaned data patterns and engineered 16 features quantifying anomaly indicators. Resulted in sap_engineered_features.csv.
    3. Phase 3: Baseline Anomaly Detection & Evaluation: Scaled features, applied IF and LOF models, evaluated initial results.
    4. Phase 4: Advanced Modeling & Prioritization: Trained Autoencoder model, combined all model outputs and HRFs, implemented prioritization logic, generated context, and created the final anomaly list.
    5. Phase 5: UI Development: Built the Streamlit dashboard for interactive analysis and investigation.

    (For detailed methodology, please refer to the Comprehensive_Project_Report.pdf in the /docs folder - if you include it).

    Technology Stack

    • Core Language: Python 3.x
    • Data Manipulation & Analysis: Pandas, NumPy
    • Machine Learning: Scikit-learn (IsolationForest, LocalOutlierFactor, StandardScaler), TensorFlow/Keras (Autoencoder)
    • Visualization: Matplotlib, Seaborn, Plotly Express
    • Dashboard: Streamlit, streamlit-aggrid
    • Utilities: Joblib (for saving scaler)

    Libraries:

    Model/Scaler Saving

    joblib==1.4.2

    Data I/O Efficiency (Optional but good practice if used)

    pyarrow==19.0.1

    Machine L...

  20. E-Commerce Data

    • kaggle.com
    zip
    Updated Aug 17, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carrie (2017). E-Commerce Data [Dataset]. https://www.kaggle.com/datasets/carrie1/ecommerce-data
    Explore at:
    zip(7548686 bytes)Available download formats
    Dataset updated
    Aug 17, 2017
    Authors
    Carrie
    Description

    Context

    Typically e-commerce datasets are proprietary and consequently hard to find among publicly available data. However, The UCI Machine Learning Repository has made this dataset containing actual transactions from 2010 and 2011. The dataset is maintained on their site, where it can be found by the title "Online Retail".

    Content

    "This is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers."

    Acknowledgements

    Per the UCI Machine Learning Repository, this data was made available by Dr Daqing Chen, Director: Public Analytics group. chend '@' lsbu.ac.uk, School of Engineering, London South Bank University, London SE1 0AA, UK.

    Image from stocksnap.io.

    Inspiration

    Analyses for this dataset could include time series, clustering, classification and more.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Ziya (2025). Supply Chain Dataset [Dataset]. https://www.kaggle.com/datasets/ziya07/bdt-mba-supply-chain-dataset
Organization logo

Supply Chain Dataset

Sensor-driven supply chain data with efficiency labels and IoT metrics

Explore at:
zip(20611 bytes)Available download formats
Dataset updated
May 22, 2025
Authors
Ziya
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

This dataset is designed to simulate supply chain operations in large-scale engineering projects. It integrates realistic data from IoT sensors, digital twins, and blockchain-enabled monitoring systems over the years 2023 to 2024.

It aims to support research in predictive maintenance, resource optimization, secure data exchange, and supply chain transparency through advanced analytics and machine learning.

⭐ Key Features Time-bound IoT Sensor Data: Includes real-time-like sensor outputs such as temperature and vibration across multiple locations and assets.

Digital Twin Sync Fields: Tracks Condition_Score and Last_Maintenance to simulate digital twin feedback loops.

Operational KPIs: Features supply chain metrics like Resource_Utilization, Delivery_Efficiency, and Downtime_Hours.

Blockchain Contextual Fit: Designed to be compatible with blockchain audit trails and smart contract triggers (e.g., anomaly response, automated logistics payments).

Labeled Targets: SupplyChain_Efficiency_Label classifies overall efficiency into 3 tiers (0: Low, 1: Medium, 2: High) based on predefined KPI thresholds.

Location-aware Simulation: Assets and operations are tagged by realistic geographic locations.

Supply Chain Economics: Captures Inventory_Level and Logistics_Cost for resource allocation analysis.

Year-specific Scope: Covers the period from 2023 to 2024, aligning with recent and ongoing digital transformation trends.

Search
Clear search
Close search
Google apps
Main menu