99 datasets found

Supply Chain Dataset
kaggle.com
zip
Updated May 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ziya (2025). Supply Chain Dataset [Dataset]. https://www.kaggle.com/datasets/ziya07/bdt-mba-supply-chain-dataset
Explore at:
zip(20611 bytes)Available download formats
Dataset updated
May 22, 2025
Authors
Ziya
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset is designed to simulate supply chain operations in large-scale engineering projects. It integrates realistic data from IoT sensors, digital twins, and blockchain-enabled monitoring systems over the years 2023 to 2024.

It aims to support research in predictive maintenance, resource optimization, secure data exchange, and supply chain transparency through advanced analytics and machine learning.

⭐ Key Features Time-bound IoT Sensor Data: Includes real-time-like sensor outputs such as temperature and vibration across multiple locations and assets.

Digital Twin Sync Fields: Tracks Condition_Score and Last_Maintenance to simulate digital twin feedback loops.

Operational KPIs: Features supply chain metrics like Resource_Utilization, Delivery_Efficiency, and Downtime_Hours.

Blockchain Contextual Fit: Designed to be compatible with blockchain audit trails and smart contract triggers (e.g., anomaly response, automated logistics payments).

Labeled Targets: SupplyChain_Efficiency_Label classifies overall efficiency into 3 tiers (0: Low, 1: Medium, 2: High) based on predefined KPI thresholds.

Location-aware Simulation: Assets and operations are tagged by realistic geographic locations.

Supply Chain Economics: Captures Inventory_Level and Logistics_Cost for resource allocation analysis.

Year-specific Scope: Covers the period from 2023 to 2024, aligning with recent and ongoing digital transformation trends.
Data from: NICHE: A Curated Dataset of Engineered Machine Learning Projects...
figshare.com
txt
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ratnadira Widyasari; Zhou YANG; Ferdian Thung; Sheng Qin Sim; Fiona Wee; Camellia Lok; Jack Phan; Haodi Qi; Constance Tan; Qijin Tay; David LO (2023). NICHE: A Curated Dataset of Engineered Machine Learning Projects in Python [Dataset]. http://doi.org/10.6084/m9.figshare.21967265.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.21967265.v1
Dataset updated
May 30, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Ratnadira Widyasari; Zhou YANG; Ferdian Thung; Sheng Qin Sim; Fiona Wee; Camellia Lok; Jack Phan; Haodi Qi; Constance Tan; Qijin Tay; David LO
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Machine learning (ML) has gained much attention and has been incorporated into our daily lives. While there are numerous publicly available ML projects on open source platforms such as GitHub, there have been limited attempts in filtering those projects to curate ML projects of high quality. The limited availability of such high-quality dataset poses an obstacle to understanding ML projects. To help clear this obstacle, we present NICHE, a manually labelled dataset consisting of 572 ML projects. Based on evidences of good software engineering practices, we label 441 of these projects as engineered and 131 as non-engineered. In this repository we provide "NICHE.csv" file that contains the list of the project names along with their labels, descriptive information for every dimension, and several basic statistics, such as the number of stars and commits. This dataset can help researchers understand the practices that are followed in high-quality ML projects. It can also be used as a benchmark for classifiers designed to identify engineered ML projects.

GitHub page: https://github.com/soarsmu/NICHE
Data Engineer Salary in 2024
kaggle.com
zip
Updated Apr 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kshitij (2024). Data Engineer Salary in 2024 [Dataset]. https://www.kaggle.com/datasets/chopper53/data-engineer-salary-in-2024
Explore at:
zip(110281 bytes)Available download formats
Dataset updated
Apr 24, 2024
Authors
Kshitij
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This dataset provides insights into data engineer salaries and employment attributes for the year 2024. It includes information such as salary, job title, experience level, employment type, employee residence, remote work ratio, company location, and company size.

The dataset allows for analysis of salary trends, employment patterns, and geographic variations in data engineering roles. It can be used by researchers, analysts, and organizations to understand the evolving landscape of data engineering employment and compensation.

Feature Description:

work_year: The year in which the data was collected (2024).

experience_level: The experience level of the employee, categorized as SE (Senior Engineer), MI (Mid-Level Engineer), or EL (Entry-Level Engineer).

employment_type: The type of employment, such as full-time (FT), part-time (PT), contract (C), or freelance (F).

job_title: The title or role of the employee within the company, for example, AI Engineer.

salary: The salary of the employee in the local currency (e.g., 202,730 USD).

salary_currency: The currency in which the salary is denominated (e.g., USD).

salary_in_usd: The salary converted to US dollars for standardization purposes.

employee_residence: The country of residence of the employee.

remote_ratio: The ratio indicating the extent of remote work allowed in the position (0 for no remote work, 1 for fully remote).

company_location: The location of the company where the employee is employed.

company_size: The size of the company, often categorized by the number of employees (S for small, M for medium, L for large).
Data Engineering Jobs in the USA Glassdoor
kaggle.com
zip
Updated Oct 17, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hamza El Belghiti (2023). Data Engineering Jobs in the USA Glassdoor [Dataset]. https://www.kaggle.com/datasets/hamzaelbelghiti/data-engineering-jobs-in-the-usa-glassdoor/versions/5
Explore at:
zip(2612489 bytes)Available download formats
Dataset updated
Oct 17, 2023
Authors
Hamza El Belghiti
License
https://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/
Description
This dataset contains a list of data engineering job postings scraped from Glassdoor in the USA (March 2023). It includes details such as the company name, location, job title, job description, estimated salary, company size, company type, company sector, company industry, the year the company was founded, and company revenue. The dataset can be used for exploring data engineering job trends in the USA, analyzing salaries, and identifying the most in-demand skills and qualifications.

You can see the whole project on GitHub.

How to use

Identify the education and experience required for jobs

Find the top recruiting companies and the industry they work in

Explore date engineering skills required in job descriptions

Predict salary based on location, industry, company rating etc
Materials and their Mechanical Properties
kaggle.com
zip
Updated Apr 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Purushottam Nawale (2023). Materials and their Mechanical Properties [Dataset]. https://www.kaggle.com/datasets/purushottamnawale/materials
Explore at:
zip(145487 bytes)Available download formats
Dataset updated
Apr 15, 2023
Authors
Purushottam Nawale
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
We utilized a dataset of Machine Design materials, which includes information on their mechanical properties. The dataset was obtained from the Autodesk Material Library and comprises 15 columns, also referred to as features/attributes. This dataset is a real-world dataset, and it does not contain any random values. However, due to missing values, we only utilized seven of these columns for our ML model. You can access the related GitHub Repository here: https://github.com/purushottamnawale/material-selection-using-machine-learning

To develop a ML model, we employed several Python libraries, including NumPy, pandas, scikit-learn, and graphviz, in addition to other technologies such as Weka, MS Excel, VS Code, Kaggle, Jupyter Notebook, and GitHub. We employed Weka software to swiftly visualize the data and comprehend the relationships between the features, without requiring any programming expertise.

My Problem statement is Material Selection for EV Chassis. So, if you have any specific ideas, be sure to implement them and add the codes on Kaggle.

A Detailed Research Paper is available on https://iopscience.iop.org/article/10.1088/1742-6596/2601/1/012014
BIM-AI Integrated Dataset
kaggle.com
zip
Updated Feb 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ziya (2025). BIM-AI Integrated Dataset [Dataset]. https://www.kaggle.com/datasets/ziya07/bim-ai-integrated-dataset
Explore at:
zip(162775 bytes)Available download formats
Dataset updated
Feb 28, 2025
Authors
Ziya
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Description This dataset is designed for whole life cycle management of civil engineering projects, integrating Building Information Modeling (BIM) and Artificial Intelligence (AI). It includes comprehensive project data covering cost, schedule, structural health, environmental conditions, resource allocation, safety risks, and drone-based monitoring.

Key Features Project Metadata: ID, type (bridge, road, building, etc.), location, and timeline. Financial Data: Planned vs. actual cost, cost overruns. Scheduling Data: Planned vs. actual duration, schedule deviation. Structural Health Monitoring: Vibration levels, crack width, load-bearing capacity. Environmental Factors: Temperature, humidity, air quality, weather conditions. Resource & Safety Management: Material usage, labor hours, equipment utilization, accident records. Drone-Based Monitoring: Image analysis scores, anomaly detection, completion percentage. Target Variable: Risk Level (Low, Medium, High) based on cost, schedule, safety, and structural health. Use Cases Predictive Modeling: Train AI models to forecast project risks and optimize decision-making. BIM & AI Integration: Leverage real-time IoT and drone data for smart construction management. Risk Assessment: Identify early signs of cost overruns, delays, and structural failures. Automation & Efficiency: Develop automated maintenance and safety monitoring frameworks
Electronics Project(2600+ projects)
kaggle.com
zip
Updated Nov 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NICK-2908 (2025). Electronics Project(2600+ projects) [Dataset]. https://www.kaggle.com/datasets/nick2908/electronics-project2600-projects
Explore at:
zip(274002 bytes)Available download formats
Dataset updated
Nov 13, 2025
Authors
NICK-2908
Description
**Summary ** This dataset contains over 2,600 circuit projects scraped from Instructables, focusing on the "Circuits" category. It includes project titles, authors, engagement metrics (views, likes), and the primary component used (Instruments).

** How This Data Was Collected**

I built a web scraper using Python and Selenium to gather all project links (over 2,600 of them) by handling the "Load All" button. The full page source was saved, and I then used BeautifulSoup to parse the HTML and extract the raw data for each project.

Data Cleaning (The Important Part!)

The raw data was very messy. I performed a full data cleaning pipeline in a Colab notebook using Pandas.

Converted Text to Numbers: Views and Likes were text fields (object).

Handled "K" Values: Found and converted "K" values (e.g., "2.2K") into proper numbers (2200).

Handled Missing Data: Replaced all "N/A" strings with null values.

Mean Imputation: To keep the dataset complete, I filled all missing Likes and Views with the mean (average) of the respective column.

Key Insights & Analysis

"Viral" Effect (High Skew): The Views and Likes data is highly right-skewed (skewness of ~9.5). This shows a "viral" effect where a tiny number of superstar projects get the vast majority of all views and likes.

[](url)

Log-Transformation: Because of the skew, I created log_Views and log_Likes columns. A 2D density plot of these log-transformed columns shows a strong positive correlation (as likes increase, views increase) and that the most "typical" project gets around 30-40 likes and 4,000-5,000 views. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F29431778%2Fd90e2039f1be11b53308ab7191b10954%2Fdownload%20(1).png?generation=1763013545903998&alt=media" alt="">

Top Instruments: I've also analyzed the most popular instruments to see which ones get the most engagement. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F29431778%2F19fca1ce142ddddc1e16a5319a1f4fc5%2Fdownload%20(2).png?generation=1763013562400830&alt=media" alt="">

Column Descriptions

Title: The name of the project.

Project_Admin: The author/creator of the project.

Image_URL: The URL for the project's cover image.

Views: The total number of views (cleaned and imputed).

Likes: The total number of likes/favorites (cleaned and imputed).

Instruments: The main component or category tag (e.g., "Arduino", "Raspberry Pi").
Supply Chain DataSet
kaggle.com
zip
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amir Motefaker (2023). Supply Chain DataSet [Dataset]. https://www.kaggle.com/datasets/amirmotefaker/supply-chain-dataset
Explore at:
zip(9340 bytes)Available download formats
Dataset updated
Jun 1, 2023
Authors
Amir Motefaker
Description
Supply chain analytics is a valuable part of data-driven decision-making in various industries such as manufacturing, retail, healthcare, and logistics. It is the process of collecting, analyzing and interpreting data related to the movement of products and services from suppliers to customers.
Feature Engineering Dataset
kaggle.com
zip
Updated Apr 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Harikant Shukla (2023). Feature Engineering Dataset [Dataset]. https://www.kaggle.com/datasets/harikantshukla/feature-engineering-dataset/discussion
Explore at:
zip(95245 bytes)Available download formats
Dataset updated
Apr 18, 2023
Authors
Harikant Shukla
Description
While searching for the dream house, the buyer looks at various factors, not just at the height of the basement ceiling or the proximity to an east-west railroad.

Using the dataset, find the factors that influence price negotiations while buying a house.

There are 79 explanatory variables describing every aspect of residential homes in Ames, Iowa.

Task to be Performed:

1) Download the “PEP1.csv” using the link given in the Feature Engineering project problem statement 2) For a detailed description of the dataset, you can download and refer to data_description.txt using the link given in the Feature Engineering project problem statement Tasks to Perform 1) Import the necessary libraries 1.1 Pandas is a Python library for data manipulation and analysis. 1.2 NumPy is a package that contains a multidimensional array object and several derivative ones. 1.3 Matplotlib is a Python visualization package for 2D array plots. 1.4 Seaborn is built on top of Matplotlib. It's used for exploratory data analysis and data visualization. 2) Read the dataset 2.1 Understand the dataset 2.2 Print the name of the columns 2.3 Print the shape of the dataframe Tasks to Perform 2.4 Check for null values 2.5 Print the unique values 2.6 Select the numerical and categorical variables 3) Descriptive stats and EDA 3.1 EDA of numerical variables 3.2 Missing value treatment 3.3 Identify the skewness and distribution 3.4 Identify significant variables using a correlation matrix 3.5 Pair plot for distribution and density Project Outcome • The aim of the project is to help understand working with the dataset and performing analysis. • This project will assess the data and prepares a fresh dataset for training and prediction • To create a box plot to identify the variables with outliers
Google Data Analytics Bellabeat Capstone Project
kaggle.com
Updated Aug 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jeff Burkle (2025). Google Data Analytics Bellabeat Capstone Project [Dataset]. https://www.kaggle.com/datasets/jeffburkle/google-data-analytics-bellabeat-capstone-project/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 16, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Jeff Burkle
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is part of a capstone project for the Google Data Analytics Certificate. It contains cleaned, merged, and feature-engineered Fitbit data from 35 participants, originally sourced from publicly available Fitabase exports. The goal is to explore user behavior and engagement patterns to inform marketing strategies for Bellabeat, a wellness tech company.
Data from: Online Retail Dataset
kaggle.com
zip
Updated Nov 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Minal Choudhary (2025). Online Retail Dataset [Dataset]. https://www.kaggle.com/datasets/minalchoudhary/online-retail-dataset
Explore at:
zip(7572122 bytes)Available download formats
Dataset updated
Nov 19, 2025
Authors
Minal Choudhary
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Description: Online Retail II

This dataset contains 525,461 transaction-level records from an online retail store based in the United Kingdom. It captures detailed information about customer purchases, products, pricing, and order timestamps, making it suitable for sales analytics, customer behavior analysis, product performance evaluation, and SQL data engineering projects.

Key Features

Invoice: A unique identifier for each order. Some invoices may represent returns or cancellations depending on business rules. StockCode: Product-level unique code identifying each item sold. Description: Text description of the product purchased. Quantity: Number of units bought. Negative values typically indicate returns. InvoiceDate: Timestamp indicating the exact date and time of the transaction. Price: Unit price of the product in the transaction currency. Customer ID: Unique identifier assigned to each registered customer. Missing values may indicate guest or unregistered buyers. Country: The country where the customer is located, enabling regional and international sales analysis.
Synthetic E-Commerce Relational Datasets
kaggle.com
Updated Aug 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nael Aqel (2025). Synthetic E-Commerce Relational Datasets [Dataset]. https://www.kaggle.com/datasets/naelaqel/synthetic-e-commerce-relational-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 31, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Nael Aqel
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Synthetic E-Commerce Relational Dataset

This dataset is synthetically generated fake data designed to simulate a realistic e-commerce environment.

Purpose

To provide large-scale relational datasets for practicing database operations, analytics, and testing tools like DuckDB, Pandas, and SQL engines. Ideal for benchmarking, educational projects, and data engineering experiments.

Entity Relationship Diagram (ERD) - Tables Overview

1. Customers

customer_id (int): Unique identifier for each customer

name (string): Customer full name

email (string): Customer email address

gender (string): Customer gender ('Male', 'Female', 'Other')

signup_date (date): Date customer signed up

country (string): Customer country of residence

2. Products

product_id (int): Unique identifier for each product

product_name (string): Name of the product

category (string): Product category (e.g., Electronics, Books)

price (float): Price per unit

stock_quantity (int): Available stock count

brand (string): Product brand name

3. Orders

order_id (int): Unique identifier for each order

customer_id (int): ID of the customer who placed the order (foreign key to Customers)

order_date (date): Date when order was placed

total_amount (float): Total amount for the order

payment_method (string): Payment method used (Credit Card, PayPal, etc.)

shipping_country (string): Country where the order is shipped

4. Order Items

order_item_id (int): Unique identifier for each order item

order_id (int): ID of the order this item belongs to (foreign key to Orders)

product_id (int): ID of the product ordered (foreign key to Products)

quantity (int): Number of units ordered

unit_price (float): Price per unit at order time

5. Product Reviews

review_id (int): Unique identifier for each review

product_id (int): ID of the reviewed product (foreign key to Products)

customer_id (int): ID of the customer who wrote the review (foreign key to Customers)

rating (int): Rating score (1 to 5)

review_text (string): Text content of the review

review_date (date): Date the review was written

Visual EDR

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F9179978%2F7681afe8fc52a116ff56a2a4e179ad19%2FEDR.png?generation=1754741998037680&alt=media" alt="">

Notes

All data is randomly generated using Python’s Faker library, so it does not reflect any real individuals or companies.

The data is provided in both CSV and Parquet formats.

The generator script is available in the accompanying GitHub repository for reproducibility and customization.

Output

The script saves two folders inside the specified output path:

csv/ # CSV files parquet/ # Parquet files

License

MIT License

References

Github Repo: https://github.com/NaelAqel/db_gen

Notebook: https://www.kaggle.com/code/naelaqel/synthetic-e-commerce-relational-dataset-generator
UCI Mechanical Analysis Data Set
kaggle.com
zip
Updated Apr 16, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Heitor Nunes (2022). UCI Mechanical Analysis Data Set [Dataset]. https://www.kaggle.com/datasets/heitornunes/mechanical-analysis
Explore at:
zip(120333 bytes)Available download formats
Dataset updated
Apr 16, 2022
Authors
Heitor Nunes
Description
Context

Please read the description file of the Data Set. The work I done was adjusting the data into a acceptable file format by kaggle standards.

Content

1 - instance - instance indicator

1 - component - component number (integer)

2 - sup - support in the machine where measure was taken (1..4)

3 - cpm - frequency of the measure (integer)

4 - mis - measure (real)

5 - misr - earlier measure (real)

6 - dir - filter, type of the measure and direction: {vo=no filter, velocity, horizontal, va=no filter, velocity, axial, vv=no filter, velocity, vertical, ao=no filter, amplitude, horizontal, aa=no filter, amplitude, axial, av=no filter, amplitude, vertical, io=filter, velocity, horizontal, ia=filter, velocity, axial, iv=filter, velocity, vertical}

7 - omega - rpm of the machine (integer, the same for components of one example)

8 - class - classification (1..6, the same for components of one example)

9 - comb. class - combined faults

10 - other class - other faults occuring

Acknowledgements

Data Source: https://archive.ics.uci.edu/ml/datasets/Mechanical+Analysis
week3class1review
kaggle.com
Updated Sep 2, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ben Kreiger (2021). week3class1review [Dataset]. https://www.kaggle.com/benkreiger/week3class1review/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 2, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ben Kreiger
Description
Robotics for All Data Science With Python Week 3 Class 1 Review Projects
Cloud Carbon Emissions Dataset
kaggle.com
zip
Updated Sep 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nidhi Suryavanshi (2025). Cloud Carbon Emissions Dataset [Dataset]. https://www.kaggle.com/datasets/nidhis4444/cloud-carbon-emissions-dataset
Explore at:
zip(36611 bytes)Available download formats
Dataset updated
Sep 23, 2025
Authors
Nidhi Suryavanshi
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset contains a synthetic simulation of cloud resource usage and carbon emissions, designed for experimentation, analysis, and forecasting in sustainability and data engineering projects.

Included Tables: - projects → Metadata about projects/teams. - services → Metadata about cloud services (Compute, Storage, AI, etc.). - emission_factors → Regional grid carbon intensity (gCO₂ per kWh). - service_energy_coefficients → Conversion rates from usage units to kWh. - daily_usage → Raw service usage (per project × service × region × day). - daily_emissions → Carbon emissions derived from usage × regional emission factors. - service_cost_coefficients → Conversion rates from usage units to cost (USD per unit).
- daily_cost_emissions → Integrated fact table combining usage, energy, cost, and emissions for analysis.

Features: - Simulated seasonality (weekend dips/spikes, holiday surges, quarter-end growth). - Regional variations in carbon intensity (e.g., coal-heavy vs renewable grids). - Multiple projects and services for multi-dimensional analysis. - Directly importable into BigQuery for analytics & forecasting.

Use Cases: Explore sustainability analytics at scale. Build carbon footprint dashboards. Run AI/ML forecasting on emissions data. Practice SQL, data modeling, and visualization.

⚠️ Note: All data is synthetic and created for educational/demo purposes. It does not represent actual cloud provider emissions.

Drug Labels & Side Effects Dataset | 1400+ Records

kaggle.com

zip

Updated Aug 2, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Pratyush Puri (2025). Drug Labels & Side Effects Dataset | 1400+ Records [Dataset]. https://www.kaggle.com/datasets/pratyushpuri/drug-labels-and-side-effects-dataset-1400-records

Explore at:

zip(51886 bytes)Available download formats

Dataset updated

Aug 2, 2025

Authors

Pratyush Puri

License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

Drug Labels and Side Effects Dataset

Dataset Overview

This comprehensive pharmaceutical synthetic dataset contains 1,393 records of synthetic drug information with 15 columns, designed for data science projects focusing on healthcare analytics, drug safety analysis, and pharmaceutical research. The dataset simulates real-world pharmaceutical data with appropriate variety and realistic constraints for machine learning applications.

Dataset Specifications

Attribute	Value
Total Records	1,393
Total Columns	15
File Format	CSV
Data Types	Mixed (intentional for data cleaning practice)
Domain	Pharmaceutical/Healthcare
Use Case	ML Training, Data Analysis, Healthcare Research

Column Specifications

Categorical Features

Column Name	Data Type	Unique Values	Description	Example Values
`drug_name`	Object	1,283 unique	Pharmaceutical drug names with realistic naming patterns	"Loxozepam32", "Amoxparin43", "Virazepam10"
`manufacturer`	Object	10 unique	Major pharmaceutical companies	Pfizer Inc., AstraZeneca, Johnson & Johnson
`drug_class`	Object	10 unique	Therapeutic drug classifications	Antibiotic, Analgesic, Antidepressant, Vaccine
`indications`	Object	10 unique	Medical conditions the drug treats	"Pain relief", "Bacterial infections", "Depression treatment"
`side_effects`	Object	434 unique	Combination of side effects (1-3 per drug)	"Nausea, Dizziness", "Headache, Fatigue, Rash"
`administration_route`	Object	7 unique	Method of drug delivery	Oral, Intravenous, Topical, Inhalation, Sublingual
`contraindications`	Object	10 unique	Medical warnings for drug usage	"Pregnancy", "Heart disease", "Liver disease"
`warnings`	Object	10 unique	Safety instructions and precautions	"Take with food", "Avoid alcohol", "Monitor blood pressure"
`batch_number`	Object	1,393 unique	Manufacturing batch identifiers	"xr691zv", "Ye266vU", "Rm082yX"
`expiry_date`	Object	782 unique	Drug expiration dates (YYYY-MM-DD)	"2025-12-13", "2027-03-09", "2026-10-06"
`side_effect_severity`	Object	3 unique	Severity classification	Mild, Moderate, Severe
`approval_status`	Object	3 unique	Regulatory approval status	Approved, Pending, Rejected

Numerical Features

Column Name	Data Type	Range	Mean	Std Dev	Description
`approval_year`	Float/String*	1990-2024	2006.7	10.0	FDA/regulatory approval year
`dosage_mg`	Float/String*	10-990 mg	499.7	290.0	Medication strength in milligrams
`price_usd`	Float/String*	$2.32-$499.24	$251.12	$144.81	Drug price in US dollars

*Intentionally stored as mixed types for data cleaning practice

Key Statistics

Manufacturer Distribution

Manufacturer	Count	Percentage
Pfizer Inc.	170	12.2%
AstraZeneca	~140	~10.0%
Merck & Co.	~140	~10.0%
Johnson & Johnson	~140	~10.0%
GlaxoSmithKline	~140	~10.0%
Others	~623	~44.8%

Drug Class Distribution

Drug Class	Count	Most Common
Anti-inflammatory	154	✓
Antibiotic	~140
Antidepressant	~140
Antiviral	~140
Vaccine	~140
Others	~679

Side Effect Severity

Severity	Count	Percentage
Severe	488	35.0%
Moderate	~453	~32.5%
Mild	~452	~32.5%

Potential Use Cases

1. Machine Learning Applications

Drug Approval Prediction: Predict approval likelihood based on drug characteristics
Price Prediction: Estimate drug pricing using features like class, manufacturer, dosage
Side Effect Classification: Classify severity based on drug properties
Market Success Analysis: Analyze factors contributing to drug market performance

2. Data Engineering Projects

ETL Pipeline Development: Practice data cleaning and transformation
Data Quality Assessment: Implement data validation and quality checks
Database Design: Create normalized pharmaceutical database schema
Real-time Processing: Stream processing for drug monitoring systems

3. Business Intelligence

Pharmaceutical Market Analysis: Manufacturer market share and competitive analysis
Drug Safety Analytics: Side effect patterns and safety profile analysis
Regulatory Compliance: Approval trends and regulatory timeline analysis
Pricing Strategy: Competitive pricing analysis across drug classes

Recommended Next Steps

Data Cleaning Pipeline: Implement comprehe...

Tunnel Risk Dataset
kaggle.com
zip
Updated Dec 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ziya (2024). Tunnel Risk Dataset [Dataset]. https://www.kaggle.com/datasets/ziya07/tunnel-risk-dataset
Explore at:
zip(12738 bytes)Available download formats
Dataset updated
Dec 17, 2024
Authors
Ziya
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset is designed to facilitate the development of deep learning-based models for real-time risk assessment in tunnel engineering projects. The data contains critical engineering parameters, geotechnical properties, and sensor-based monitoring data collected or simulated under various tunneling conditions. Each record corresponds to specific tunneling conditions and is labeled with a risk level to indicate the likelihood of structural failure or hazardous events.

Dataset Content The dataset contains 1000 samples (modifiable based on requirements), with each row representing a unique tunneling scenario. The key features include:

Tunnel Parameters Tunnel_ID: Unique identifier for each tunnel record. Length (m): Length of the tunnel section (in meters). Depth (m): Depth at which the tunnel is located (in meters).

Geotechnical and Environmental Features Rock_Type: Type of geological material in the tunnel's surrounding (e.g., Clay, Sandstone, Mixed, Shale). Water_Level: Groundwater level conditions, categorized as Low, Medium, or High.

Monitoring Data Displacement (mm): Real-time tunnel deformation or displacement measured in millimeters. Settlement (mm): Vertical surface settlement above the tunnel, measured in millimeters.

Risk Level (Target Variable) The dataset is labeled based on risk assessment for each tunneling condition:

0 = Low Risk (Safe conditions) 1 = Medium Risk (Moderate risk, monitoring required) 2 = High Risk (High likelihood of structural stress) 3 = Critical Risk (Failure scenario or hazardous condition) The risk levels are assigned based on threshold values for tunnel displacement and settlement, which are essential indicators of tunnel stability.

Voice Search AI Conversational Queries 2025

kaggle.com

zip

Updated Jul 30, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Pratyush Puri (2025). Voice Search AI Conversational Queries 2025 [Dataset]. https://www.kaggle.com/datasets/pratyushpuri/voice-search-1500-conversational-queries-2025/data

Explore at:

zip(72384 bytes)Available download formats

Dataset updated

Jul 30, 2025

Authors

Pratyush Puri

License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

Voice Search Query Captures Dataset

Overview

This synthetically created dataset contains 1,555 conversational voice search queries captured across multiple devices, languages, and user intents. The dataset simulates realistic voice command interactions for machine learning and analytics projects in the conversational AI domain.

Dataset Specifications

Total Rows: 1,555
Total Columns: 13
File Format: CSV
Data Types: Mixed (String, Integer, Float, DateTime, Boolean)
Null Values: Strategically distributed (~5-10% across select columns)

Column Details

Column Name	Data Type	Description	Example Values	Null %
`query_id`	Integer	Unique identifier for each voice search query	1, 2, 3... 1555	0%
`user_id`	String (UUID)	Unique user identifier	bdd640fb-0667-4ad1-9c80-317fa3b1799d	0%
`timestamp`	DateTime	When the query was made	2025-04-17 19:27:32	0%
`device_type`	String	Device used for voice search	smartphone, smart speaker, smartwatch, tablet, car assistant	0%
`query_text`	String	The actual voice search text	"What's the weather like today?", "Call Mom"	0%
`language`	String	Language of the query	English, Spanish, Mandarin, Hindi, French	0%
`intent`	String	Query category/purpose	information, navigation, command, entertainment, shopping	0%
`location`	String	User's geographical location	New York, Los Angeles, London, Delhi, Shanghai, Paris, Tokyo	0%
`query_duration_sec`	Float	Duration of voice query in seconds	1.05 to 12.71 seconds	0%
`num_words`	Float*	Number of words in the query	2.0 to 7.0	0%
`is_successful`	Object	Whether query returned results	True, False, None	~15%
`confidence_score`	String*	Speech recognition confidence (0.5-1.0)	"0.87", "0.61", "1.0"	0%
`device_os_version`	String	Operating system version	iOS 14, iOS 15, Android 10, Android 11, None	~20%

Intent Categories Distribution

Intent	Description	Sample Queries
Information	Knowledge/fact-seeking queries	"How tall is the Eiffel Tower?", "What's the weather like today?"
Navigation	Location/direction requests	"Directions to nearest gas station", "Find nearest coffee shop"
Command	Device/app control instructions	"Set an alarm for 7 AM", "Turn off the lights", "Call Mom"
Entertainment	Media/content requests	"Play latest movie trailers", "Show me comedy shows"
Shopping	Purchase/commerce related	"Order me a pizza", "Buy new headphones", "Track my Amazon order"

Device Distribution

Device Type	Usage Context
Smartphone	Mobile, on-the-go queries
Smart Speaker	Home-based voice commands
Smartwatch	Quick, hands-free interactions
Tablet	Casual browsing and queries
Car Assistant	In-vehicle voice commands

Language & Location Coverage

Language	Primary Locations	Use Case
English	New York, Los Angeles, London	Global communication
Spanish	Los Angeles, New York	Hispanic markets
Mandarin	Shanghai, global cities	Chinese user base
Hindi	Delhi, global cities	Indian diaspora
French	Paris, global cities	European markets

Data Quality Features

Realistic Patterns

Query Duration: Normal distribution around 5 seconds (1-12 sec range)
Word Count: Aligned with actual query complexity (2-7 words)
Intent Matching: Query text semantically matches intent categories
Temporal Distribution: Queries spread across 2025 timeframe

Data Challenges (Intentional)

Mixed Data Types: num_words stored as float instead of int
String Numerics: confidence_score stored as string instead of float
Strategic Nulls: Missing values in is_successful and device_os_version

Use Cases

Analytics Applications

Voice search trend analysis
Device usage pattern identification
Multi-language query processing
Intent classification modeling
User behavior segmentation

Machine Learning Projects

Classification: Intent prediction from query text
NLP: Multi-language text analysis
Time Series: Usage pattern analysis
Clustering: User behavior grouping
Recommendation: Query suggestion systems

Data Engineering Practice

Data cleaning and type conversion
Handling missing values
Multi-language data processing
Real-time analytics pipeline development

Technical Notes

Encoding: UTF-8 for multi-language support
Timestamp Format: YYYY-MM-DD HH:MM:SS
UUID Format: Standard UUID4 format
**Geographic ...

SAP FI Anomaly Detection - Prepared Data & Models
kaggle.com
zip
Updated Apr 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
aidsmlProjects (2025). SAP FI Anomaly Detection - Prepared Data & Models [Dataset]. https://www.kaggle.com/datasets/aidsmlprojects/sap-fi-anomaly-detection-prepared-data-and-models
Explore at:
zip(9285 bytes)Available download formats
Dataset updated
Apr 30, 2025
Authors
aidsmlProjects
Description
Intelligent SAP Financial Integrity Monitor

Project Status: Proof-of-Concept (POC) - Capstone Project

Overview

This project demonstrates a proof-of-concept system for detecting financial document anomalies within core SAP FI/CO data, specifically leveraging the New General Ledger table (FAGLFLEXA) and document headers (BKPF). It addresses the challenge that standard SAP reporting and rule-based checks often struggle to identify subtle, complex, or novel irregularities in high-volume financial postings.

The solution employs a Hybrid Anomaly Detection strategy, combining unsupervised Machine Learning models with expert-defined SAP business rules. Findings are prioritized using a multi-faceted scoring system and presented via an interactive dashboard built with Streamlit for efficient investigation.

This project was developed as a capstone, showcasing the application of AI/ML techniques to enhance financial controls within an SAP context, bridging deep SAP domain knowledge with modern data science practices.

Author: Anitha R (https://www.linkedin.com/in/anithaswamy)

Dataset Origin: Kaggle SAP Dataset by Sunitha Siva License:Other (specified in description)-No description available.

Motivation

Financial integrity is critical. Undetected anomalies in SAP FI/CO postings can lead to: * Inaccurate financial reporting * Significant reconciliation efforts * Potential audit failures or compliance issues * Masking of operational errors or fraud

Standard SAP tools may not catch all types of anomalies, especially complex or novel patterns. This project explores how AI/ML can augment traditional methods to provide more robust and efficient financial monitoring.

Key Features

Data Cleansing & Preparation: Rigorous process to handle common SAP data extract issues (duplicates, financial imbalance), prioritizing FAGLFLEXA for reliability.

Exploratory Data Analysis (EDA): Uncovered baseline patterns in posting times, user activity, amounts, and process context.

Feature Engineering: Created 16 context-aware features (FE_...) to quantify potential deviations from normalcy based on EDA and SAP knowledge.

Hybrid Anomaly Detection:

Ensemble ML: Utilized unsupervised models: Isolation Forest (IF), Local Outlier Factor (LOF) (via Scikit-learn), and an Autoencoder (AE) (via TensorFlow/Keras).

Expert Rules (HRFs): Implemented highly customizable High-Risk Flags based on percentile thresholds and SAP logic (e.g., weekend posting, missing cost center).

Multi-Faceted Prioritization: Combined ML model consensus (Model_Anomaly_Count) and HRF counts (HRF_Count) into a Priority_Tier for focusing investigation efforts.

Contextual Anomaly Reason: Generated a Review_Focus text description summarizing why an item was flagged.

Interactive Dashboard (Streamlit):

File upload for anomaly/feature data.

Overview KPIs (including multi-currency "Value at Risk by CoCode").

Comprehensive filtering capabilities.

Dynamic visualizations (User/Doc Type/HRF frequency, Time Trends).

Interactive AgGrid table for anomaly list investigation.

Detailed drill-down view for selected anomalies.

Methodology Overview

The project followed a structured approach:

Phase 1: Data Quality Assessment & Preparation: Cleaned and validated raw BKPF and FAGLFLEXA data extracts. Discarded BSEG due to imbalances. Removed duplicates.

Phase 2: Exploratory Data Analysis & Feature Engineering: Analyzed cleaned data patterns and engineered 16 features quantifying anomaly indicators. Resulted in sap_engineered_features.csv.

Phase 3: Baseline Anomaly Detection & Evaluation: Scaled features, applied IF and LOF models, evaluated initial results.

Phase 4: Advanced Modeling & Prioritization: Trained Autoencoder model, combined all model outputs and HRFs, implemented prioritization logic, generated context, and created the final anomaly list.

Phase 5: UI Development: Built the Streamlit dashboard for interactive analysis and investigation.

(For detailed methodology, please refer to the Comprehensive_Project_Report.pdf in the /docs folder - if you include it).

Technology Stack

Core Language: Python 3.x

Data Manipulation & Analysis: Pandas, NumPy

Machine Learning: Scikit-learn (IsolationForest, LocalOutlierFactor, StandardScaler), TensorFlow/Keras (Autoencoder)

Visualization: Matplotlib, Seaborn, Plotly Express

Dashboard: Streamlit, streamlit-aggrid

Utilities: Joblib (for saving scaler)

Libraries:

Model/Scaler Saving

joblib==1.4.2

Data I/O Efficiency (Optional but good practice if used)

pyarrow==19.0.1

Machine L...
E-Commerce Data
kaggle.com
zip
Updated Aug 17, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carrie (2017). E-Commerce Data [Dataset]. https://www.kaggle.com/datasets/carrie1/ecommerce-data
Explore at:
zip(7548686 bytes)Available download formats
Dataset updated
Aug 17, 2017
Authors
Carrie
Description
Context

Typically e-commerce datasets are proprietary and consequently hard to find among publicly available data. However, The UCI Machine Learning Repository has made this dataset containing actual transactions from 2010 and 2011. The dataset is maintained on their site, where it can be found by the title "Online Retail".

Content

"This is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers."

Acknowledgements

Per the UCI Machine Learning Repository, this data was made available by Dr Daqing Chen, Director: Public Analytics group. chend '@' lsbu.ac.uk, School of Engineering, London South Bank University, London SE1 0AA, UK.

Image from stocksnap.io.

Inspiration

Analyses for this dataset could include time series, clustering, classification and more.

Facebook

Twitter

Click to copy link

Link copied

Cite

Ziya (2025). Supply Chain Dataset [Dataset]. https://www.kaggle.com/datasets/ziya07/bdt-mba-supply-chain-dataset

Supply Chain Dataset

Sensor-driven supply chain data with efficiency labels and IoT metrics

Explore at:

zip(20611 bytes)Available download formats

Dataset updated

May 22, 2025

Authors

Ziya

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

This dataset is designed to simulate supply chain operations in large-scale engineering projects. It integrates realistic data from IoT sensors, digital twins, and blockchain-enabled monitoring systems over the years 2023 to 2024.

It aims to support research in predictive maintenance, resource optimization, secure data exchange, and supply chain transparency through advanced analytics and machine learning.

⭐ Key Features Time-bound IoT Sensor Data: Includes real-time-like sensor outputs such as temperature and vibration across multiple locations and assets.

Digital Twin Sync Fields: Tracks Condition_Score and Last_Maintenance to simulate digital twin feedback loops.

Operational KPIs: Features supply chain metrics like Resource_Utilization, Delivery_Efficiency, and Downtime_Hours.

Blockchain Contextual Fit: Designed to be compatible with blockchain audit trails and smart contract triggers (e.g., anomaly response, automated logistics payments).

Labeled Targets: SupplyChain_Efficiency_Label classifies overall efficiency into 3 tiers (0: Low, 1: Medium, 2: High) based on predefined KPI thresholds.

Location-aware Simulation: Assets and operations are tagged by realistic geographic locations.

Supply Chain Economics: Captures Inventory_Level and Logistics_Cost for resource allocation analysis.

Year-specific Scope: Covers the period from 2023 to 2024, aligning with recent and ongoing digital transformation trends.

Clear search

Close search

Google apps

Main menu

Supply Chain Dataset

Data from: NICHE: A Curated Dataset of Engineered Machine Learning Projects...

Data Engineer Salary in 2024

Feature Description:

Data Engineering Jobs in the USA Glassdoor

How to use

Materials and their Mechanical Properties

BIM-AI Integrated Dataset

Electronics Project(2600+ projects)

Data Cleaning (The Important Part!)

Key Insights & Analysis

Column Descriptions

Supply Chain DataSet

Feature Engineering Dataset

Google Data Analytics Bellabeat Capstone Project

Data from: Online Retail Dataset

Synthetic E-Commerce Relational Datasets

Synthetic E-Commerce Relational Dataset

Purpose

Entity Relationship Diagram (ERD) - Tables Overview

1. Customers

2. Products

3. Orders

4. Order Items

5. Product Reviews

Visual EDR

Notes

Output

License

References

UCI Mechanical Analysis Data Set

Context

Content

Acknowledgements

week3class1review

Cloud Carbon Emissions Dataset

Drug Labels & Side Effects Dataset | 1400+ Records

Drug Labels and Side Effects Dataset

Dataset Overview

Dataset Specifications

Column Specifications

Categorical Features

Numerical Features

Key Statistics

Manufacturer Distribution

Drug Class Distribution

Side Effect Severity

Potential Use Cases

1. Machine Learning Applications

2. Data Engineering Projects

3. Business Intelligence

Recommended Next Steps

Tunnel Risk Dataset

Voice Search AI Conversational Queries 2025

Voice Search Query Captures Dataset

Overview

Dataset Specifications

Column Details

Intent Categories Distribution

Device Distribution

Language & Location Coverage

Data Quality Features

Realistic Patterns

Data Challenges (Intentional)

Use Cases

Analytics Applications

Machine Learning Projects

Data Engineering Practice

Technical Notes

SAP FI Anomaly Detection - Prepared Data & Models

Intelligent SAP Financial Integrity Monitor

Overview

Motivation

Key Features

Methodology Overview

Technology Stack

Model/Scaler Saving

Data I/O Efficiency (Optional but good practice if used)

pyarrow==19.0.1

Machine L...