21 datasets found
  1. f

    Data_Sheet_1_ExGUtils: A Python Package for Statistical Analysis With the...

    • frontiersin.figshare.com
    zip
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carmen Moret-Tatay; Daniel Gamermann; Esperanza Navarro-Pardo; Pedro Fernández de Córdoba Castellá (2023). Data_Sheet_1_ExGUtils: A Python Package for Statistical Analysis With the ex-Gaussian Probability Density.zip [Dataset]. http://doi.org/10.3389/fpsyg.2018.00612.s001
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Frontiers
    Authors
    Carmen Moret-Tatay; Daniel Gamermann; Esperanza Navarro-Pardo; Pedro Fernández de Córdoba Castellá
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The study of reaction times and their underlying cognitive processes is an important field in Psychology. Reaction times are often modeled through the ex-Gaussian distribution, because it provides a good fit to multiple empirical data. The complexity of this distribution makes the use of computational tools an essential element. Therefore, there is a strong need for efficient and versatile computational tools for the research in this area. In this manuscript we discuss some mathematical details of the ex-Gaussian distribution and apply the ExGUtils package, a set of functions and numerical tools, programmed for python, developed for numerical analysis of data involving the ex-Gaussian probability density. In order to validate the package, we present an extensive analysis of fits obtained with it, discuss advantages and differences between the least squares and maximum likelihood methods and quantitatively evaluate the goodness of the obtained fits (which is usually an overlooked point in most literature in the area). The analysis done allows one to identify outliers in the empirical datasets and criteriously determine if there is a need for data trimming and at which points it should be done.

  2. Engine Ratng Prediction

    • kaggle.com
    zip
    Updated Feb 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ved Prakash (2023). Engine Ratng Prediction [Dataset]. https://www.kaggle.com/datasets/ved1104/engine-ratng-prediction
    Explore at:
    zip(3540393 bytes)Available download formats
    Dataset updated
    Feb 28, 2023
    Authors
    Ved Prakash
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Your task is to write a small Python or R script that predicts the engine rating based on the inspection parameters using only the provided dataset. You need to find all the cases/outliers where the rating has been given incorrectly as compared to the current condition of the engine.

    This task is designed to test your Python or R ability, your knowledge of Data Science techniques, your ability to find trends, and outliers, the relative importance of variables with deviation in target variable, and your ability to work effectively, efficiently, and independently within a commercial setting.

    This task is designed as well to test your hyper-tuning abilities or lateral thinking. Deliverables: · One Python or R script · One requirement text file including an exhaustive list of packages and version numbers used in your solution · Summary of your insights · List of cases that are outliers/incorrectly rated as high or low and it should be backed with analysis/reasons. · model object files for reproducibility.

    Your solution should at a minimum do the following: · Load the data into memory · Prepare the data for modeling · EDA of the variables · Build a model on training data · Test the model on testing data · Provide some measure of performance · Outlier analysis and detection

  3. Medical Clean Dataset

    • kaggle.com
    zip
    Updated Jul 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aamir Shahzad (2025). Medical Clean Dataset [Dataset]. https://www.kaggle.com/datasets/aamir5659/medical-clean-dataset
    Explore at:
    zip(1262 bytes)Available download formats
    Dataset updated
    Jul 6, 2025
    Authors
    Aamir Shahzad
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This is the cleaned version of a real-world medical dataset that was originally noisy, incomplete, and contained various inconsistencies. The dataset was cleaned through a structured and well-documented data preprocessing pipeline using Python and Pandas. Key steps in the cleaning process included:

    • Handling missing values using statistical techniques such as median imputation and mode replacement
    • Converting categorical values to consistent formats (e.g., gender formatting, yes/no standardization)
    • Removing duplicate entries to ensure data accuracy
    • Parsing and standardizing date fields
    • Creating new derived features such as age groups
    • Detecting and reviewing outliers based on IQR
    • Removing irrelevant or redundant columns

    The purpose of cleaning this dataset was to prepare it for further exploratory data analysis (EDA), data visualization, and machine learning modeling.

    This cleaned dataset is now ready for training predictive models, generating visual insights, or conducting healthcare-related research. It provides a high-quality foundation for anyone interested in medical analytics or data science practice.

  4. g

    Bathymetry of the Main Pool of Lake Calumet, Cook County, Illinois, July...

    • gimi9.com
    Updated Jul 18, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Bathymetry of the Main Pool of Lake Calumet, Cook County, Illinois, July 2023 | gimi9.com [Dataset]. https://gimi9.com/dataset/data-gov_bathymetry-of-the-main-pool-of-lake-calumet-cook-county-illinois-july-2023/
    Explore at:
    Dataset updated
    Jul 18, 2023
    Area covered
    Lake Calumet, Cook County, Illinois
    Description

    These data are single-beam bathymetry points compiled in comma separated values (CSV) file format, generated from a hydrographic survey of the northern portion of Lake Calumet in Cook County, Illinois. Hydrographic data were collected July 18-19, 2023, using a single-beam echosounder (SBES) integrated with a Global Navigation Satellite System (GNSS) mounted on a marine survey vessel. Surface water elevation data were collected July 18 utilizing a single-base real-time kinematic (RTK)/GNSS unit. Bathymetric data points were collected as the vessel traversed the northern portions of the lake along overlapping survey lines. The SBES internally collected and stored the depth data from the echosounder and the horizontal and vertical position data of the vessel from the GNSS in real time. Data processing required specialized computer software to export bathymetry data from the raw data files. A Python script was written to calculate the lakebed elevations and identify outliers in the dataset. These data are provided in comma separated values (CSV) format as LakeCalumet_SBES_20230718.csv. Data points are stored as a series of x (longitude), y (latitude), and z (elevation or depth) points along with variable length records specific to the data transects.

  5. Supplementary Data for Mahalanobis-Based Ratio Analysis and Clustering of...

    • zenodo.org
    zip
    Updated May 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    o; o (2025). Supplementary Data for Mahalanobis-Based Ratio Analysis and Clustering of U.S. Tech Firms [Dataset]. http://doi.org/10.5281/zenodo.15337959
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 7, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    o; o
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    May 4, 2025
    Description

    Note: All supplementary files are provided as a single compressed archive named dataset.zip. Users should extract this file to access the individual Excel and Python files listed below.

    This supplementary dataset supports the manuscript titled “Mahalanobis-Based Multivariate Financial Statement Analysis: Outlier Detection and Typological Clustering in U.S. Tech Firms.” It contains both data files and Python scripts used in the financial ratio analysis, Mahalanobis distance computation, and hierarchical clustering stages of the study. The files are organized as follows:

    • ESM_1.xlsx – Raw financial ratios of 18 U.S. technology firms (2020–2024)

    • ESM_2.py – Python script to calculate Z-scores from raw financial ratios

    • ESM_3.xlsx – Dataset containing Z-scores for the selected financial ratios

    • ESM_4.py – Python script for generating the correlation heatmap of the Z-scores

    • ESM_5.xlsx – Mahalanobis distance values for each firm

    • ESM_6.py – Python script to compute Mahalanobis distances

    • ESM_7.py – Python script to visualize Mahalanobis distances

    • ESM_8.xlsx – Mean Z-scores per firm (used for cluster analysis)

    • ESM_9.py – Python script to compute mean Z-scores

    • ESM_10.xlsx – Re-standardized Z-scores based on firm-level means

    • ESM_11.py – Python script to re-standardize mean Z-scores

    • ESM_12.py – Python script to generate the hierarchical clustering dendrogram

    All files are provided to ensure transparency and reproducibility of the computational procedures in the manuscript. Each script is commented and formatted for clarity. The dataset is intended for educational and academic reuse under the terms of the Creative Commons Attribution 4.0 International License (CC BY 4.0).

  6. Metabolomics Data Preprocessing PQN PCA

    • kaggle.com
    zip
    Updated Nov 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dr. Nagendra (2025). Metabolomics Data Preprocessing PQN PCA [Dataset]. https://www.kaggle.com/datasets/mannekuntanagendra/metabolomics-data-preprocessing-pqn-pca
    Explore at:
    zip(22763 bytes)Available download formats
    Dataset updated
    Nov 29, 2025
    Authors
    Dr. Nagendra
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This dataset provides a step-by-step pipeline for preprocessing metabolomics data.

    The pipeline implements Probabilistic Quotient Normalization (PQN) to correct dilution effects in metabolomics measurements.

    Includes guidance on handling raw metabolomics datasets obtained from LC-MS or NMR experiments.

    Demonstrates Principal Component Analysis (PCA) for dimensionality reduction and exploratory data analysis.

    Includes data visualization techniques to interpret PCA results effectively.

    Suitable for metabolomics researchers and data scientists working on omics data.

    Enables better reproducibility of preprocessing workflows for metabolomics studies.

    Can be used to normalize data, detect outliers, and identify major patterns in metabolomics datasets.

    Provides a Python-based notebook that is easy to adapt to new datasets.

    Includes example datasets and code snippets for immediate application.

    Helps users understand the impact of normalization on downstream statistical analyses.

    Supports integration with other metabolomics pipelines or machine learning workflows.

  7. Z

    Lipidomics LC-MS analysis support tools for outlier detection

    • data.niaid.nih.gov
    Updated Mar 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Spick, Matt (2024). Lipidomics LC-MS analysis support tools for outlier detection [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10889320
    Explore at:
    Dataset updated
    Mar 28, 2024
    Dataset provided by
    University of Surrey
    Authors
    Spick, Matt
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Identification of features with high levels of confidence in liquid chromatography-mass spectrometry (LC MS) lipidomics research is an essential part of biomarker discovery, but existing software platforms can give inconsistent results, even from identical spectral data. This poses a clear challenge for reproducibility in bioinformatics work, and highlights the importance of data-driven outlier detection in assessing spectral outputs – here demonstrated using a machine learning approach based on support vector machine regression combined with leave-one-out cross validation – as well as manual curation, in order to identify software-driven errors driven by closely related lipids and by co-elution issues.

    The lipidomics case study dataset used in this work analysed a lipid extraction of a human pancreatic adenocarcinoma cell line (PANC-1, Merck, UK, cat no. 87092802) analysed using an Acquity M-Class UPLC system (Waters, UK) coupled to a ZenoToF 7600 mass spectrometer (Sciex, UK). Raw output files are included alongside processed data using MS DIAL (v4.9.221218) and Lipostar (v2.1.4) and a Jupyter notebook with Python code to analyse the outputs for outlier detection.

  8. Additional file 2 - datasets and scripts for metabolome analysis

    • figshare.com
    xlsx
    Updated Apr 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Roberta Ruggeri; Giuseppe Bee; Paolo Trevisi; Catherine Ollagnier; Federico Correa (2024). Additional file 2 - datasets and scripts for metabolome analysis [Dataset]. http://doi.org/10.6084/m9.figshare.25684509.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Apr 29, 2024
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Roberta Ruggeri; Giuseppe Bee; Paolo Trevisi; Catherine Ollagnier; Federico Correa
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    For the metabolome data, all calculations and statistical analyses were performed using Python. The Shapiro-Wilk test was performed to identify the metabolites whose concentrations in the blood showed a normal distribution, and Student’s t-test was used to compare their concentrations in blood samples for the IUGR and NORM groups. Metabolites whose concentrations did not show a normal distribution were compared between the two groups using the non-parametric Mann–Whitney test. The Benjamini–Hochberg correction was applied in both cases to account for the risk I inflation associated with multiple comparisons. Before being subjected to unsupervised and supervised algorithms, the concentration of each metabolite was normalised and centred. Principal component analysis (PCA) and orthogonal projection to latent structures-discriminant analysis (OPLS-DA) were employed as unsupervised and supervised methods in the multivariate analysis, respectively. PCA was used for the identification of outliers (Mahalanobis distance metric) as well as the spontaneous clustering of similar samples in the scatter plot of the two principal components. In the OPLS-DA analysis, the X matrix consisted of metabolite concentrations, while the Y vector contained information regarding the group (IUGR or NORM). The goodness of fit of the OPLS-DA model (R2Y) was reported, and predictive performance was assessed through cross-validation. Metrics such as the predictive ability of the model (Q2Y) and the predictive ability of permuted models (Q2Y-perm) were calculated for evaluation. OPLS-DA loading plots were used to illustrate the metabolites that contributed the most to the separation between the IUGR and NORM groups. The identification of metabolites of interest was made through the combination of the variable importance in the projection (VIP) and the loading between the metabolite in the X matrix and the predictive latent variable (pLV) of the model. Metabolites with VIP >1.0 and absolute high loading values were considered important in the metabolomics signature (De la Barca et al., 2022).References:Chao de la Barca JM, Chabrun F, Lefebvre T, Roche O, Huetz N, Blanchet O, Legendre G, Simard G, Reynier P, Gascoin G: A Metabolomic Profiling of Intra-Uterine Growth Restriction in Placenta and Cord Blood Points to an Impairment of Lipid and Energetic Metabolism. Biomedicines 2022, 10:1411.

  9. Classicmodels

    • kaggle.com
    zip
    Updated Dec 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Javier Landaeta (2024). Classicmodels [Dataset]. https://www.kaggle.com/datasets/javierlandaeta/classicmodels
    Explore at:
    zip(65751 bytes)Available download formats
    Dataset updated
    Dec 15, 2024
    Authors
    Javier Landaeta
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Abstract This project presents a comprehensive analysis of a company's annual sales, using the classic dataset classicmodels as the database. Python is used as the main programming language, along with the Pandas, NumPy and SQLAlchemy libraries for data manipulation and analysis, and PostgreSQL as the database management system.

    The main objective of the project is to answer key questions related to the company's sales performance, such as: Which were the most profitable products and customers? Were sales goals met? The results obtained serve as input for strategic decision making in future sales campaigns.

    Methodology 1. Data Extraction:

    • A connection is established with the PostgreSQL database to extract the relevant data from the orders, orderdetails, customers, products and employees tables.
    • A reusable function is created to read each table and load it into a Pandas DataFrame.

    2. Data Cleansing and Transformation:

    • An exploratory analysis of the data is performed to identify missing values, inconsistencies, and outliers.
    • New variables are calculated, such as the total value of each sale, cost, and profit.
    • Different DataFrames are joined using primary and foreign keys to obtain a complete view of sales.

    3. Exploratory Data Analysis (EDA):

    • Key metrics such as total sales, number of unique customers, and average order value are calculated.
    • Data is grouped by different dimensions (products, customers, dates) to identify patterns and trends.
    • Results are visualized using relevant graphics (histograms, bar charts, etc.).

    4. Modeling and Prediction:

    • Although the main focus of the project is descriptive, predictive modeling techniques (e.g., time series) could be explored to forecast future sales.

    5. Report Generation:

    • Detailed reports are created in Pandas DataFrames format that answer specific business questions.
    • These reports are stored in new PostgreSQL tables for further analysis and visualization.

    Results - Identification of top products and customers: The best-selling products and the customers that generate the most revenue are identified. - Analysis of sales trends: Sales trends over time are analyzed and possible factors that influence sales behavior are identified. - Calculation of key metrics: Metrics such as average profit margin and sales growth rate are calculated.

    Conclusions This project demonstrates how Python and PostgreSQL can be effectively used to analyze large data sets and obtain valuable insights for business decision making. The results obtained can serve as a starting point for future research and development in the area of ​​sales analysis.

    Technologies Used - Python: Pandas, NumPy, SQLAlchemy, Matplotlib/Seaborn - Database: PostgreSQL - Tools: Jupyter Notebook - Keywords: data analysis, Python, PostgreSQL, Pandas, NumPy, SQLAlchemy, EDA, sales, business intelligence

  10. u

    High-high cluster and high-low outlier road intersections for road traffic...

    • zivahub.uct.ac.za
    docx
    Updated Jun 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Simone Vieira; Simon Hull; Roger Behrens (2024). High-high cluster and high-low outlier road intersections for road traffic crashes within the CoCT in 2017, 2018, 2019 and 2021 [Dataset]. http://doi.org/10.25375/uct.25966402.v1
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jun 6, 2024
    Dataset provided by
    University of Cape Town
    Authors
    Simone Vieira; Simon Hull; Roger Behrens
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    City of Cape Town
    Description

    This dataset offers a detailed inventory of road intersections and their corresponding suburbs within Cape Town, meticulously curated to highlight instances of high crash counts observed in "high-high" cluster and "high-low" outlier fishnet grid cells across the years 2017, 2018, 2019, and 2021. To enhance its utility, the dataset meticulously colour-codes each month associated with elevated crash occurrences, providing a nuanced perspective. Furthermore, the dataset categorises road intersections based on their placement within "high-high" clusters (marked with pink tabs) or "high-low" outlier cells (indicated by red tabs). For ease of navigation, the intersections are further organised alphabetically by suburb name, ensuring accessibility and clarity.Data SpecificsData Type: Geospatial-temporal categorical data with numeric attributesFile Format: Word document (.docx)Size: 602 KBNumber of Files: The dataset contains a total of 625 road intersection records (606 "high-high" cluster and 19 "high-low" outliers)Date Created: 21st May 2024MethodologyData Collection Method: The descriptive road traffic crash data per crash victim involved in the crashes was obtained from the City of Cape Town Network InformationSoftware: ArcGIS Pro, Open Refine, Python, SQLProcessing Steps: The raw road traffic crash data underwent a comprehensive refining process using Python software. Following this, duplicate crash records were eliminated to retain only one entry per crash. Subsequently, the data underwent further refinement with Open Refine software, focusing specifically on isolating unique crash descriptions for subsequent geocoding in ArcGIS Pro. Notably, during this process, only the road intersection crashes were retained, as they were the only crashes that were able to be spatially defined.Once geocoded, the road traffic crash data underwent rigorous spatio-temporal analyses, encompassing spatial autocorrelation, hotspot analysis, and cluster and outlier analysis. Leveraging these methods, road intersections identified as either "high-high" clusters or "high-low" outliers were extracted for inclusion in the dataset.Geospatial InformationSpatial Coverage:West Bounding Coordinate: 18°20'EEast Bounding Coordinate: 19°05'ENorth Bounding Coordinate: 33°25'SSouth Bounding Coordinate: 34°25'SCoordinate System: South African Reference System (Lo19) using the Universal Transverse Mercator projectionTemporal InformationTemporal Coverage:Start Date: 01/01/2017End Date: 31/12/2021 (2020 data omitted)

  11. u

    High-high cluster and high-low outlier road intersections for road traffic...

    • zivahub.uct.ac.za
    docx
    Updated Jun 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Simone Vieira; Simon Hull; Roger Behrens (2024). High-high cluster and high-low outlier road intersections for road traffic crashes involving severely injured pedestrians within the CoCT in 2017, 2018 and 2019 [Dataset]. http://doi.org/10.25375/uct.25974964.v1
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jun 6, 2024
    Dataset provided by
    University of Cape Town
    Authors
    Simone Vieira; Simon Hull; Roger Behrens
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    City of Cape Town
    Description

    This dataset offers a detailed inventory of road intersections and their corresponding suburbs within Cape Town, meticulously curated to highlight instances of high pedestrian crash counts resulting in serious injuries observed in "high-high" cluster and "high-low" outlier fishnet grid cells across the years 2017, 2018 and 2019. To enhance its utility, the dataset meticulously colour-codes each month associated with elevated crash occurrences, providing a nuanced perspective. Furthermore, the dataset categorises road intersections based on their placement within "high-high" clusters (marked with pink tabs) or "high-low" outlier cells (indicated by red tabs). For ease of navigation, the intersections are further organised alphabetically by suburb name, ensuring accessibility and clarity.Data SpecificsData Type: Geospatial-temporal categorical data with numeric attributesFile Format: Word document (.docx)Size: 231 KBNumber of Files: The dataset contains a total of 245 road intersection records (7 "high-high" clusters and 238 "high-low" outliers)Date Created: 21st May 2024MethodologyData Collection Method: The descriptive road traffic crash data per crash victim involved in the crashes was obtained from the City of Cape Town Network InformationSoftware: ArcGIS Pro, Open Refine, Python, SQLProcessing Steps: The raw road traffic crash data underwent a comprehensive refining process using Python software to ensure its accuracy and consistency. Following this, duplicates were eliminated to retain only one entry per crash incident. Subsequently, the data underwent further refinement with Open Refine software, focusing specifically on isolating unique crash descriptions for subsequent geocoding in ArcGIS Pro. Notably, during this process, only the road intersection crashes were retained, as they were the only incidents with spatial definitions.Once geocoded, road intersection crashes that involved a pedestrian with a severe or fatal injury type were extracted so that subsequent spatio-temporal analyses would focus on these crashes only. The spatio-temporal analysis methods by which these pedestrian crashes were analysed included spatial autocorrelation, hotspot analysis, and cluster and outlier analysis. Leveraging these methods, road intersections with pedestrian crashes that resulted in a severe injury identified as either "high-high" clusters or "high-low" outliers were extracted for inclusion in the dataset.Geospatial InformationSpatial Coverage:West Bounding Coordinate: 18°20'EEast Bounding Coordinate: 19°05'ENorth Bounding Coordinate: 33°25'SSouth Bounding Coordinate: 34°25'SCoordinate System: South African Reference System (Lo19) using the Universal Transverse Mercator projectionTemporal InformationTemporal Coverage:Start Date: 01/01/2017End Date: 31/12/2019

  12. Bank Loan Case Study Dataset

    • kaggle.com
    zip
    Updated May 4, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shreshth Vashisht (2023). Bank Loan Case Study Dataset [Dataset]. https://www.kaggle.com/datasets/shreshthvashisht/bank-loan-case-study-dataset/discussion
    Explore at:
    zip(117814223 bytes)Available download formats
    Dataset updated
    May 4, 2023
    Authors
    Shreshth Vashisht
    Description

    This case study aims to give you an idea of applying EDA in a real business scenario. In this case study, apart from applying the techniques that you have learnt in the EDA module, you will also develop a basic understanding of risk analytics in banking and financial services and understand how data is used to minimize the risk of losing money while lending to customers.

    Business Understanding: The loan providing companies find it hard to give loans to the people due to their insufficient or non-existent credit history. Because of that, some consumers use it as their advantage by becoming a defaulter. Suppose you work for a consumer finance company which specialises in lending various types of loans to urban customers. You have to use EDA to analyse the patterns present in the data. This will ensure that the applicants capable of repaying the loan are not rejected.

    When the company receives a loan application, the company has to decide for loan approval based on the applicant’s profile. Two types of risks are associated with the bank’s decision:

    If the applicant is likely to repay the loan, then not approving the loan results in a loss of business to the company. If the applicant is not likely to repay the loan, i.e. he/she is likely to default, then approving the loan may lead to a financial loss for the company. The data given below contains the information about the loan application at the time of applying for the loan. It contains two types of scenarios:

    The client with payment difficulties: he/she had late payment more than X days on at least one of the first Y instalments of the loan in our sample All other cases: All other cases when the payment is paid on time. When a client applies for a loan, there are four types of decisions that could be taken by the client/company:

    Approved: The company has approved loan application Cancelled: The client cancelled the application sometime during approval. Either the client changed her/his mind about the loan or in some cases due to a higher risk of the client he received worse pricing which he did not want. Refused: The company had rejected the loan (because the client does not meet their requirements etc.). Unused Offer: Loan has been cancelled by the client but on different stages of the process. In this case study, you will use EDA to understand how consumer attributes and loan attributes influence the tendency of default.

    Business Objectives: It aims to identify patterns which indicate if a client has difficulty paying their installments which may be used for taking actions such as denying the loan, reducing the amount of loan, lending (to risky applicants) at a higher interest rate, etc. This will ensure that the consumers capable of repaying the loan are not rejected. Identification of such applicants using EDA is the aim of this case study.

    In other words, the company wants to understand the driving factors (or driver variables) behind loan default, i.e. the variables which are strong indicators of default. The company can utilize this knowledge for its portfolio and risk assessment.

    To develop your understanding of the domain, you are advised to independently research a little about risk analytics – understanding the types of variables and their significance should be enough).

    Data Understanding: Download the Dataset using the link given under dataset section on the right.

    application_data.csv contains all the information of the client at the time of application. The data is about wheather a client has payment difficulties. previous_application.csv contains information about the client’s previous loan data. It contains the data whether the previous application had been Approved, Cancelled, Refused or Unused offer. columns_descrption.csv is data dictionary which describes the meaning of the variables. You are required to provide a detailed report for the below data record mentioning the answer to the questions that follows:

    Present the overall approach of the analysis. Mention the problem statement and the analysis approach briefly Indentify the missing data and use appropriate method to deal with it. (Remove columns/or replace it with an appropriate value) Hint: Note that in EDA, since it is not necessary to replace the missing value, but if you have to replace the missing value, what should be the approach. Clearly mention the approach. Identify if there are outliers in the dataset. Also, mention why do you think it is an outlier. Again, remember that for this exercise, it is not necessary to remove any data points. Identify if there is data imbalance in the data. Find the ratio of data imbalance. Hint: Since there are a lot of columns, you can run your analysis in loops for the appropriate columns and find the insights. Explain the results of univariate, segmented univariate, bivariate analysis, etc. in business terms. Find the top 10 c...

  13. f

    Socio-demographic and economic characteristics of respondents.

    • plos.figshare.com
    • figshare.com
    xls
    Updated Oct 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shimels Derso Kebede; Daniel Niguse Mamo; Jibril Bashir Adem; Birhan Ewunu Semagn; Agmasie Damtew Walle (2023). Socio-demographic and economic characteristics of respondents. [Dataset]. http://doi.org/10.1371/journal.pdig.0000345.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Oct 17, 2023
    Dataset provided by
    PLOS Digital Health
    Authors
    Shimels Derso Kebede; Daniel Niguse Mamo; Jibril Bashir Adem; Birhan Ewunu Semagn; Agmasie Damtew Walle
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Socio-demographic and economic characteristics of respondents.

  14. Insurance_claims

    • kaggle.com
    • data.mendeley.com
    zip
    Updated Oct 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Miannotti (2025). Insurance_claims [Dataset]. https://www.kaggle.com/datasets/mian91218/insurance-claims
    Explore at:
    zip(68984 bytes)Available download formats
    Dataset updated
    Oct 19, 2025
    Authors
    Miannotti
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    AQQAD, ABDELRAHIM (2023), “insurance_claims ”, Mendeley Data, V2, doi: 10.17632/992mh7dk9y.2

    https://data.mendeley.com/datasets/992mh7dk9y/2

    Latest version Version 2 Published: 22 Aug 2023 DOI: 10.17632/992mh7dk9y.2

    Data Acquisition: - Obtain the dataset titled "Insurance_claims" from the following Mendeley repository: https://https://data.mendeley.com/drafts/992mh7dk9y - Download and store the dataset locally for easy access during subsequent steps.

    Data Loading & Initial Exploration: - Use Python's Pandas library to load the dataset into a DataFrame. python Code used:

    Load the Dataset File

    insurance_df = pd.read_csv('insurance_claims.csv')

    • Inspect the initial rows, data types, and summary statistics to get an understanding of the dataset's structure.

    Data Cleaning & Pre-processing: - Handle missing values, if any. Strategies may include imputation or deletion based on the nature of the missing data. - Identify and handle outliers. In this research, particularly, outliers in the 'umbrella_limit' column were addressed. - Normalize or standardize features if necessary.

    Exploratory Data Analysis (EDA): - Utilize visualization libraries such as Matplotlib and Seaborn in Python for graphical exploration. - Examine distributions, correlations, and patterns in the data, especially between features and the target variable 'fraud_reported'. - Identify features that exhibit distinct patterns for fraudulent and non-fraudulent claims.

    Feature Engineering & Selection: - Create or transform existing features to improve model performance. - Use techniques like Recursive Feature Elimination (RFECV) to identify and retain only the most informative features.

    Modeling: - Split the dataset into training and test sets to ensure the model's generalizability. - Implement machine learning algorithms such as Support Vector Machine, RandomForest, and Voting Classifier using libraries like Scikit-learn. - Handle class imbalance issues using methods like Synthetic Minority Over-sampling Technique (SMOTE).

    Model Evaluation: - Evaluate the performance of each model using metrics like precision, recall, F1-score, ROC-AUC score, and confusion matrix. - Fine-tune the models based on the results. Hyperparameter tuning can be performed using techniques like Grid Search or Random Search.

    Model Interpretation: - Use methods like SHAP (SHapley Additive exPlanations) or LIME (Local Interpretable Model-agnostic Explanations) to interpret and understand the predictions made by the model.

    Deployment & Prediction: - Utilize the best-performing model to make predictions on unseen data. - If the intention is to deploy the model in a real-world scenario, convert the trained model into a format suitable for deployment (e.g., using libraries like joblib or pickle).

    Software & Tools: - Programming Language: Python (version: GoogleColab) - Libraries: Pandas, Numpy, Matplotlib, Seaborn, Scikit-learn, Imbalanced-learn, LIME, and SHAP. - Environment: Jupyter Notebook or any Python IDE.

  15. Bay Area All Commute Points (2018 Data)

    • kaggle.com
    zip
    Updated Nov 30, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thomas Nguyen (2022). Bay Area All Commute Points (2018 Data) [Dataset]. https://www.kaggle.com/datasets/thomasnguyen01/bay-area-all-commute-points-2018-data
    Explore at:
    zip(1694976 bytes)Available download formats
    Dataset updated
    Nov 30, 2022
    Authors
    Thomas Nguyen
    Area covered
    San Francisco Bay Area
    Description

    Context

    The San Francisco Bay Area (nine-county) is one of the largest urban areas in the US by population and GDP. It is home to over 7.5 million people and has a GDP of $995 billion (third highest by GDP output and first highest by GDP per capita). Home to Silicon Valley (a global center for high technology and innovation) and San Francisco (the second largest financial center in the US after New York), the Bay Area contains some of the most profitable industries and sophisticated workforces in the world. This dataset describes where these workers live and commute to work in 2018.

    Content

    This data file includes all needed information as a means to find out more about the different commute patterns, geographical locations, and necessary metrics to make predictions and draw conclusions.

    Inspiration

    • What can we learn about the different residence and workplace locations? What is the average distance between these locations?
    • Which counties contain a high/low concentration of residence and workplace locations? Why?
    • What counties do most of the commuters usually commute to work? What counties do most of the commuters call home?
    • Are most of these commute patterns county-by-county or within a single county?
    • Are there any noticeable outliers (e.g., long commute patterns) in this dataset? What counties contain a high concentration of these outliers?
  16. Feature Engineering Dataset

    • kaggle.com
    zip
    Updated Apr 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Harikant Shukla (2023). Feature Engineering Dataset [Dataset]. https://www.kaggle.com/datasets/harikantshukla/feature-engineering-dataset/discussion
    Explore at:
    zip(95245 bytes)Available download formats
    Dataset updated
    Apr 18, 2023
    Authors
    Harikant Shukla
    Description

    While searching for the dream house, the buyer looks at various factors, not just at the height of the basement ceiling or the proximity to an east-west railroad.

    Using the dataset, find the factors that influence price negotiations while buying a house.

    There are 79 explanatory variables describing every aspect of residential homes in Ames, Iowa.

    Task to be Performed:

    1) Download the “PEP1.csv” using the link given in the Feature Engineering project problem statement 2) For a detailed description of the dataset, you can download and refer to data_description.txt using the link given in the Feature Engineering project problem statement Tasks to Perform 1) Import the necessary libraries 1.1 Pandas is a Python library for data manipulation and analysis. 1.2 NumPy is a package that contains a multidimensional array object and several derivative ones. 1.3 Matplotlib is a Python visualization package for 2D array plots. 1.4 Seaborn is built on top of Matplotlib. It's used for exploratory data analysis and data visualization. 2) Read the dataset 2.1 Understand the dataset 2.2 Print the name of the columns 2.3 Print the shape of the dataframe Tasks to Perform 2.4 Check for null values 2.5 Print the unique values 2.6 Select the numerical and categorical variables 3) Descriptive stats and EDA 3.1 EDA of numerical variables 3.2 Missing value treatment 3.3 Identify the skewness and distribution 3.4 Identify significant variables using a correlation matrix 3.5 Pair plot for distribution and density Project Outcome • The aim of the project is to help understand working with the dataset and performing analysis. • This project will assess the data and prepares a fresh dataset for training and prediction • To create a box plot to identify the variables with outliers

  17. Long-term Thermal Drift Dataset

    • kaggle.com
    zip
    Updated Aug 2, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ivan Nikolov (2021). Long-term Thermal Drift Dataset [Dataset]. https://www.kaggle.com/ivannikolov/longterm-thermal-drift-dataset
    Explore at:
    zip(18976473381 bytes)Available download formats
    Dataset updated
    Aug 2, 2021
    Authors
    Ivan Nikolov
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Long-term datasets for detecting concept drift

    Once a model goes from laboratory conditions to being deployed in the real world, the problem of changing environmental conditions causes performance degradation. This phenomenon can be explained by the presence of concept drift. This drift can be gradual, recurrent, or rapid. To train more robust methods this drift of the data needs to be taken into consideration. To be able to do this, large-scale single location datasets need to be used.

    The dataset is used to test the influence of concept drift on six different deep learning models

    • Two autoencoders - a shallow convolutional autoencoder (CAE), an implementation of the VQVAE2 link autoencoder
    • Two anomaly detectors - two versions of the MNAD model link, using reconstructions and one using predictions
    • Two object detectors - the YOLOv5 link and Faster R-CNN link

    No changes have been done to the architecture of the models, except changing their input channels from 3 to 1, corresponding to the change from RGB to Grayscale thermal images. The implementations of the training code, dataloaders and config files are given in the Methods folder of the github repository - Methods

    This is the largest publicly available single location thermal dataset. It consists of 8 months of video data between January and August. The video clips are in a 2-minute format every 30 minutes throughout the day. The dataset is captured at the harbor front of Aalborg, Denmark at coordinates (9.9217, 57.0488). The dataset contains many changing weather conditions like rain, snow, fog, as well as people-related changes like large groups of people, parked and moving cars, trucks, bicycles, etc.

    The captured videos are in .mp4 format, with a resolution of 288 x 384 and were captured using a Hikvision DS-2TD2235D-25/50 thermal camera. The data is 8-bit grayscale clips.

    The folder structure of dataset is as follows:

    Dataset
    │ 
    └───Day Folder with naming convention - {YYYYMMDD}
    │  │  2-minute clip with a naming convention - clip_{number}_{HHMM}
    │ 
    │  Metadata.csv containing the weather and timestamp data
    │  Code for extracting image frames from the video clips
    │  Code for a video clip dataloader using the metadata to select clips depending on their timestamps, weather condition, etc.
    

    Content

    The dataset contains 298 hours of video data, captured in 2020 and 2021.

    A subset of the dataset frames used for pedestrian detection is given in the folder Data_Annotated_Subset_Object_Detectors from January, February, March, April and August are annotated using the labelImg software. Amount of files: - Testing data - January - 100 images - April - 100 images - August - 100 images - Training data - February - 400 images - March - 100 images

    The annotation is presented as bounding boxes. The annotation files contain the class of the annotated object - in this case, only pedestrians (with class 0), followed by the X,Y coordinates of the corner of the bounding box, the width and height of the box. The coordinates, width and height are normalized based on the resolution of the images.

    A second subset of frames is given in the folder Data_Subset_Autoencoders_Anomaly_Detectors. No annotations are given for this subset - it is only used as a division of training and testing data. To selected data for this dataset please use the provided dataset.py script. Amount of files: - Testing data - January - 100 images - April - 100 images - August - 100 images - Training data - February - 20000 images - March - 5000 images

    Together with the video data, the dataset contains csv metadata, consisting the column structure: - Folder name - the name of the folder the video clip is - Clip name - the name of the 2-minute clip - DataTime - the timestamp of the start of the clip - Temperature - the temperature in Celsius - Humidity - relative humidity percentage measured 2m over terrain in % - Precipitation - accumulated precipitation in [kg/m^2] - Dew Point - dew point temperature in Celsius measured 2m over terrain - Wind Direction - wind direction in degrees orientation measured 10m over terrain - Wind Speed - wind speed in [m/s] measured 10m over terrain - Sun Radiation - mean sun radiation in [W/m^2] - Min Sunshine - minutes of sunshine in the measured interval

    The weather data is captured in 10 min intervals using the open-source Danish Meteorological Institute (DMI) weather API - https://confluence.govcloud.dk/display/FDAPI

    The dataset contains two Python scripts - the first is a simple dataloader for gathe...

  18. INDIA ELECTRICITY & ENERGY ANALYSIS PROJECT

    • kaggle.com
    zip
    Updated Nov 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bimal Kumar Saini (2025). INDIA ELECTRICITY & ENERGY ANALYSIS PROJECT [Dataset]. https://www.kaggle.com/datasets/bimalkumarsaini/india-electricity-and-energy-analysis-project
    Explore at:
    zip(4986654 bytes)Available download formats
    Dataset updated
    Nov 23, 2025
    Authors
    Bimal Kumar Saini
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Area covered
    India
    Description

    ⚡ INDIA ELECTRICITY & ENERGY ANALYSIS PROJECT

    This repository presents an extensive data engineering, cleaning, and analytical study on India’s electricity ecosystem using Python. The project covers coal stock status, thermal power generation, renewable energy trends, energy requirements & availability, and installed capacity across states.

    The goal is to identify operational bottlenecks, resource deficits, energy trends, and support data-driven decisions in the power sector.

    📊 Electricity Data Insights & System Analysis

    The project leverages five government datasets:

    🔹 Daily Coal Stock Data

    🔹 Daily Power Generation

    🔹 Renewable Energy Production

    🔹 State-wise Energy Requirement vs Availability

    🔹 Installed Capacity Across Fuel Types

    The final analysis includes EDA, heatmaps, trend analysis, outlier detection, data-cleaning automation, and visual summaries.

    🔹 Key Features ✅ 1. Comprehensive Data Cleaning Pipeline

    Null value treatment using median/mode strategies

    Standardizing categorical inconsistencies

    Filling missing regions, states, and production values

    Date format standardization

    Removing duplicates across all datasets

    Large-scale outlier detection using custom 5×IQR logic (to preserve real-world operational variance)

    ✅ 2. Exploratory Data Analysis (EDA)

    Includes:

    Coal stock trends over years

    Daily power generation patterns

    Solar, wind, and renewable growth

    State-wise energy shortage & surplus

    Installed capacity distribution across India

    Correlation maps for all major datasets

    ✅ 3. Trend Visualizations

    📈 Coal Stock Time-Series

    🔥 Thermal Power Daily Output

    🌞 Solar & Wind Contribution Over Time

    🇮🇳 State-wise Energy Deficit Bar Chart

    🗺️ MOM Energy Requirement Heatmap

    ⚙️ Installed Capacity Share of Each State

    📌 Dashboard & Analysis Components Section Description 🔹 Coal Stock Dashboard Daily stock, consumption, transport mode, critical plants 🔹 Power Generation Capacity, planned vs actual generation 🔹 Renewable Mix Solar, wind, hydro & total RE contributions 🔹 Energy Shortfall Requirement vs availability across states 🔹 Installed Capacity Coal, Gas, Hydro, Nuclear & RES capacity stacks 🧠 Insights & Findings 🔥 Coal Stock

    Critical coal stock days observed for multiple stations

    Seasonal dips in stock days & indigenous supply shocks

    Import dependency minimal but volatile

    ⚡ Power Generation

    Thermal stations show fluctuating PLF (Plant Load Factor)

    Many states underperform planned generation

    🌞 Renewable Energy

    Solar shows continuous year-over-year growth

    Wind output peaks around monsoon months

    🔌 Energy Requirement vs Availability

    States like Delhi, Bihar, Jharkhand show intermittent deficits

    MOM heatmap highlights major seasonal spikes

    ⚙️ Installed Capacity

    Southern & Western regions dominate national capacity

    Coal remains the largest but renewable share rising rapidly

    📁 Files in This Repository File Description coal_stock.csv Cleaned coal stock dataset power_gen.csv Daily power generation data renewable_engy.csv State-wise renewable energy dataset engy_reqmt.csv Monthly requirement & availability dataset install_cpty.csv Installed capacity across fuel types electricity.ipynb Full Python EDA notebook electricity.pdf Export of full Colab notebook (code + visuals) README.md GitHub project summary

    🛠️ Technologies Used 📊 Data Analysis

    Python (Pandas, NumPy, Matplotlib, Seaborn)

    🧹 Data Cleaning

    Null Imputation

    Outlier Detection (5×IQR)

    Standardization & Encoding

    Handling Large Multi-year Datasets

    🔧 System Concepts

    Modular Python Code

    Data Pipelines & Feature Engineering

    Version Control (Git/GitHub)

    Cloud Concepts (Google Colab + Drive Integration)

    📈 Core Metrics & KPIs

    Total Stock Days

    PLF% (Plant Load Factor)

    Renewable Energy Contribution

    Energy Deficit (%)

    National Installed Capacity Share

    📚 Future Enhancements

    Build a Power BI dashboard for visual storytelling

    Integrate forecasting models (ARIMA / Prophet)

    Automate coal shortage alerts

    Add state-level energy prediction for seasonality

    Deploy the analysis as a web dashboard (Streamlit)

  19. Red and White Wine Quality Analysis

    • kaggle.com
    zip
    Updated Dec 6, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sai Geetha Chandrashekar (2021). Red and White Wine Quality Analysis [Dataset]. https://www.kaggle.com/datasets/saigeethac/red-and-white-wine-quality-datasets
    Explore at:
    zip(97750 bytes)Available download formats
    Dataset updated
    Dec 6, 2021
    Authors
    Sai Geetha Chandrashekar
    Description

    Wine Quality Data Set

    This data set is available in UCI at https://archive.ics.uci.edu/ml/datasets/Wine+Quality.

    Abstract: Two datasets are included, related to red and white vinho verde wine samples, from the north of Portugal. The goal is to model wine quality based on physicochemical tests.

    Data Set Information:

    The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).

    These datasets can be viewed as classification or regression tasks. The classes are ordered and not balanced (e.g. there are many more normal wines than excellent or poor ones). Outlier detection algorithms could be used to detect the few excellent or poor wines. Also, we are not sure if all input variables are relevant. So it could be interesting to test feature selection methods.

    Attribute Information:

    Input variables (based on physicochemical tests):

    1. fixed acidity
    2. volatile acidity
    3. citric acid
    4. residual sugar
    5. chlorides
    6. free sulfur dioxide
    7. total sulfur dioxide
    8. density
    9. pH
    10. sulphates
    11. alcohol

    Output variable (based on sensory data):

    1. quality (score between 0 and 10)

    These columns have been described in the Kaggle Data Explorer.

    Context

    The authors state "we are not sure if all input variables are relevant. So it could be interesting to test feature selection methods." We have briefly explored this aspect and see that Red wine quality prediction on the test and training datasets is almost the same (~88%) with just three features. Likewise White wine quality prediction appears to depend on just one feature. This may be due to the privacy and logistics issues mentioned by the dataset authors.

    Content

    Two datasets are included, related to red and white vinho verde wine samples, from the north of Portugal. Both these datasets are analyzed and linear regression models are developed in Python 3. The github link provided for the source code also includes a Flask web application for deployment on the local machine or on Heroku.

    Acknowledgements

    Datasets: P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.

    Banner Image: Photo by Roberta Sorge on Unsplash

    Github Link

    Complete code has been uploaded onto github at https://github.com/saigeethachandrashekar/wine_quality.

    Please clone the repo - this contains both the datasets, the code required for building and saving the model on to your local system. Code for a Flask app is provided for deploying the models on your local machine. The app can also be deployed on Heroku - the requirements.txt and Procfile are also provided for this.

    Next Steps

    1. White wine quality prediction appears to depend on just one feature. This may be due to the privacy and logistics issues mentioned by the dataset authors (e.g. there is no data about grape types, wine brand, wine selling price, etc.) or it may be due to other factors that are not clear. This is an area that might be worth exploring further.

    2. Other ML techniques may be applied to improve the accuracy.

  20. zomato order data

    • kaggle.com
    Updated Jul 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NayakGanesh007 (2025). zomato order data [Dataset]. https://www.kaggle.com/datasets/nayakganesh007/zomato-order-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 14, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    NayakGanesh007
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Zomato Food Orders – Data Analysis Project 📌 Description: This dataset contains food order data from Zomato, one of India’s leading food delivery platforms. It includes information on customer orders, order status, restaurants, delivery times, and more. The goal of this project is to explore and analyze key insights around customer behavior, delivery patterns, restaurant performance, and order trends.

    🔍 Project Objectives: 📊 Perform Exploratory Data Analysis (EDA)

    📦 Analyze most frequently ordered cuisines and items

    ⏱️ Understand average delivery times and delays

    🧾 Identify top restaurants and order volumes

    📈 Uncover order trends by time (hour/day/week)

    💬 Visualize data using Matplotlib & Seaborn

    🧹 Clean and preprocess data (missing values, outliers, etc.)

    📁 Dataset Features (Example Columns): Column Name Description Order ID - Unique ID for each order Customer ID - Unique customer identifier Restaurant - Name of the restaurant Cuisine - Type of cuisine ordered Order Time - Timestamp when the order was placed Delivery Time - Timestamp when the order was delivered Order Status - Status of the order (Delivered, Cancelled) Payment Method - Mode of payment (Cash, Card, UPI, etc.) Order Amount - Total price of the order

    🛠 Tools & Libraries Used: Python

    Pandas, NumPy for data manipulation

    Matplotlib, Seaborn for visualization

    Excel (for raw dataset preview and checks)

    ✅ Outcomes: Customer ordering trends by cuisine and location

    Time-of-day and day-of-week analysis for peak delivery times

    Delivery efficiency evaluation

    Business recommendations for improving customer experience

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Carmen Moret-Tatay; Daniel Gamermann; Esperanza Navarro-Pardo; Pedro Fernández de Córdoba Castellá (2023). Data_Sheet_1_ExGUtils: A Python Package for Statistical Analysis With the ex-Gaussian Probability Density.zip [Dataset]. http://doi.org/10.3389/fpsyg.2018.00612.s001

Data_Sheet_1_ExGUtils: A Python Package for Statistical Analysis With the ex-Gaussian Probability Density.zip

Related Article
Explore at:
zipAvailable download formats
Dataset updated
Jun 1, 2023
Dataset provided by
Frontiers
Authors
Carmen Moret-Tatay; Daniel Gamermann; Esperanza Navarro-Pardo; Pedro Fernández de Córdoba Castellá
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The study of reaction times and their underlying cognitive processes is an important field in Psychology. Reaction times are often modeled through the ex-Gaussian distribution, because it provides a good fit to multiple empirical data. The complexity of this distribution makes the use of computational tools an essential element. Therefore, there is a strong need for efficient and versatile computational tools for the research in this area. In this manuscript we discuss some mathematical details of the ex-Gaussian distribution and apply the ExGUtils package, a set of functions and numerical tools, programmed for python, developed for numerical analysis of data involving the ex-Gaussian probability density. In order to validate the package, we present an extensive analysis of fits obtained with it, discuss advantages and differences between the least squares and maximum likelihood methods and quantitatively evaluate the goodness of the obtained fits (which is usually an overlooked point in most literature in the area). The analysis done allows one to identify outliers in the empirical datasets and criteriously determine if there is a need for data trimming and at which points it should be done.

Search
Clear search
Close search
Google apps
Main menu