25 datasets found

Customer Sales Analysis
kaggle.com
zip
Updated Nov 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nibedita Sahu (2024). Customer Sales Analysis [Dataset]. https://www.kaggle.com/datasets/nibeditasahu/customer-sales-analysis
Explore at:
zip(1410791 bytes)Available download formats
Dataset updated
Nov 2, 2024
Authors
Nibedita Sahu
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Developed the Customer Sales Analysis project using Python libraries like Pandas, NumPy, Matplotlib, and Seaborn. This project involves advanced data cleaning, outlier detection, and seasonal trends analysis. Key insights include identifying outliers, analyzing seasonal trends, and performing customer segmentation using RFM analysis. The project features various visualizations to communicate insights effectively, allowing for deeper insights into sales performance and customer behavior.
mumpcepy: A Python implementation of the Method of Uncertainty Minimization...
datasets.ai
catalog.data.gov
0
Updated Mar 11, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Standards and Technology (2021). mumpcepy: A Python implementation of the Method of Uncertainty Minimization using Polynomial Chaos Expansions [Dataset]. https://datasets.ai/datasets/mumpcepy-a-python-implementation-of-the-method-of-uncertainty-minimization-using-polynomia-c2fc3
Explore at:
0Available download formats
Dataset updated
Mar 11, 2021
Dataset authored and provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
Description
The Method of Uncertainty Minimization using Polynomial Chaos Expansions (MUM-PCE) was developed as a software tool to constrain physical models against experimental measurements. These models contain parameters that cannot be easily determined from first principles and so must be measured, and some which cannot even be easily measured. In such cases, the models are validated and tuned against a set of global experiments which may depend on the underlying physical parameters in a complex way. The measurement uncertainty will affect the uncertainty in the parameter values.
Metabolomics Data Preprocessing PQN PCA
kaggle.com
zip
Updated Nov 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dr. Nagendra (2025). Metabolomics Data Preprocessing PQN PCA [Dataset]. https://www.kaggle.com/datasets/mannekuntanagendra/metabolomics-data-preprocessing-pqn-pca
Explore at:
zip(22763 bytes)Available download formats
Dataset updated
Nov 29, 2025
Authors
Dr. Nagendra
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This dataset provides a step-by-step pipeline for preprocessing metabolomics data.

The pipeline implements Probabilistic Quotient Normalization (PQN) to correct dilution effects in metabolomics measurements.

Includes guidance on handling raw metabolomics datasets obtained from LC-MS or NMR experiments.

Demonstrates Principal Component Analysis (PCA) for dimensionality reduction and exploratory data analysis.

Includes data visualization techniques to interpret PCA results effectively.

Suitable for metabolomics researchers and data scientists working on omics data.

Enables better reproducibility of preprocessing workflows for metabolomics studies.

Can be used to normalize data, detect outliers, and identify major patterns in metabolomics datasets.

Provides a Python-based notebook that is easy to adapt to new datasets.

Includes example datasets and code snippets for immediate application.

Helps users understand the impact of normalization on downstream statistical analyses.

Supports integration with other metabolomics pipelines or machine learning workflows.
Engine Ratng Prediction
kaggle.com
zip
Updated Feb 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ved Prakash (2023). Engine Ratng Prediction [Dataset]. https://www.kaggle.com/datasets/ved1104/engine-ratng-prediction
Explore at:
zip(3540393 bytes)Available download formats
Dataset updated
Feb 28, 2023
Authors
Ved Prakash
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Your task is to write a small Python or R script that predicts the engine rating based on the inspection parameters using only the provided dataset. You need to find all the cases/outliers where the rating has been given incorrectly as compared to the current condition of the engine.

This task is designed to test your Python or R ability, your knowledge of Data Science techniques, your ability to find trends, and outliers, the relative importance of variables with deviation in target variable, and your ability to work effectively, efficiently, and independently within a commercial setting.

This task is designed as well to test your hyper-tuning abilities or lateral thinking. Deliverables: · One Python or R script · One requirement text file including an exhaustive list of packages and version numbers used in your solution · Summary of your insights · List of cases that are outliers/incorrectly rated as high or low and it should be backed with analysis/reasons. · model object files for reproducibility.

Your solution should at a minimum do the following: · Load the data into memory · Prepare the data for modeling · EDA of the variables · Build a model on training data · Test the model on testing data · Provide some measure of performance · Outlier analysis and detection
f
Data_Sheet_1_ExGUtils: A Python Package for Statistical Analysis With the...
frontiersin.figshare.com
zip
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carmen Moret-Tatay; Daniel Gamermann; Esperanza Navarro-Pardo; Pedro Fernández de Córdoba Castellá (2023). Data_Sheet_1_ExGUtils: A Python Package for Statistical Analysis With the ex-Gaussian Probability Density.zip [Dataset]. http://doi.org/10.3389/fpsyg.2018.00612.s001
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.3389/fpsyg.2018.00612.s001
Dataset updated
Jun 1, 2023
Dataset provided by
Frontiers
Authors
Carmen Moret-Tatay; Daniel Gamermann; Esperanza Navarro-Pardo; Pedro Fernández de Córdoba Castellá
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The study of reaction times and their underlying cognitive processes is an important field in Psychology. Reaction times are often modeled through the ex-Gaussian distribution, because it provides a good fit to multiple empirical data. The complexity of this distribution makes the use of computational tools an essential element. Therefore, there is a strong need for efficient and versatile computational tools for the research in this area. In this manuscript we discuss some mathematical details of the ex-Gaussian distribution and apply the ExGUtils package, a set of functions and numerical tools, programmed for python, developed for numerical analysis of data involving the ex-Gaussian probability density. In order to validate the package, we present an extensive analysis of fits obtained with it, discuss advantages and differences between the least squares and maximum likelihood methods and quantitatively evaluate the goodness of the obtained fits (which is usually an overlooked point in most literature in the area). The analysis done allows one to identify outliers in the empirical datasets and criteriously determine if there is a need for data trimming and at which points it should be done.
H
CNN-based Colon Cancer Detection - Python Code
dataverse.harvard.edu
Updated Sep 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jie Li (2025). CNN-based Colon Cancer Detection - Python Code [Dataset]. http://doi.org/10.7910/DVN/B7OXBS
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/B7OXBS
Dataset updated
Sep 18, 2025
Dataset provided by
Harvard Dataverse
Authors
Jie Li
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This is the code for CNN-based colon cancer detection. The algorithm and code including parametric data cleaning method using Gaussian 99% rule to remove outliers from the benchmark dataset - LC25000.
Supplementary Data for Mahalanobis-Based Ratio Analysis and Clustering of...
zenodo.org
zip
Updated May 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
o; o (2025). Supplementary Data for Mahalanobis-Based Ratio Analysis and Clustering of U.S. Tech Firms [Dataset]. http://doi.org/10.5281/zenodo.15337959
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15337959
Dataset updated
May 7, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
o; o
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
May 4, 2025
Description
Note: All supplementary files are provided as a single compressed archive named dataset.zip. Users should extract this file to access the individual Excel and Python files listed below.

This supplementary dataset supports the manuscript titled “Mahalanobis-Based Multivariate Financial Statement Analysis: Outlier Detection and Typological Clustering in U.S. Tech Firms.” It contains both data files and Python scripts used in the financial ratio analysis, Mahalanobis distance computation, and hierarchical clustering stages of the study. The files are organized as follows:

ESM_1.xlsx – Raw financial ratios of 18 U.S. technology firms (2020–2024)

ESM_2.py – Python script to calculate Z-scores from raw financial ratios

ESM_3.xlsx – Dataset containing Z-scores for the selected financial ratios

ESM_4.py – Python script for generating the correlation heatmap of the Z-scores

ESM_5.xlsx – Mahalanobis distance values for each firm

ESM_6.py – Python script to compute Mahalanobis distances

ESM_7.py – Python script to visualize Mahalanobis distances

ESM_8.xlsx – Mean Z-scores per firm (used for cluster analysis)

ESM_9.py – Python script to compute mean Z-scores

ESM_10.xlsx – Re-standardized Z-scores based on firm-level means

ESM_11.py – Python script to re-standardize mean Z-scores

ESM_12.py – Python script to generate the hierarchical clustering dendrogram

All files are provided to ensure transparency and reproducibility of the computational procedures in the manuscript. Each script is commented and formatted for clarity. The dataset is intended for educational and academic reuse under the terms of the Creative Commons Attribution 4.0 International License (CC BY 4.0).
Z
Lipidomics LC-MS analysis support tools for outlier detection
data.niaid.nih.gov
Updated Mar 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Spick, Matt (2024). Lipidomics LC-MS analysis support tools for outlier detection [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10889320
Explore at:
Dataset updated
Mar 28, 2024
Dataset provided by
University of Surrey
Authors
Spick, Matt
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Identification of features with high levels of confidence in liquid chromatography-mass spectrometry (LC MS) lipidomics research is an essential part of biomarker discovery, but existing software platforms can give inconsistent results, even from identical spectral data. This poses a clear challenge for reproducibility in bioinformatics work, and highlights the importance of data-driven outlier detection in assessing spectral outputs – here demonstrated using a machine learning approach based on support vector machine regression combined with leave-one-out cross validation – as well as manual curation, in order to identify software-driven errors driven by closely related lipids and by co-elution issues.

The lipidomics case study dataset used in this work analysed a lipid extraction of a human pancreatic adenocarcinoma cell line (PANC-1, Merck, UK, cat no. 87092802) analysed using an Acquity M-Class UPLC system (Waters, UK) coupled to a ZenoToF 7600 mass spectrometer (Sciex, UK). Raw output files are included alongside processed data using MS DIAL (v4.9.221218) and Lipostar (v2.1.4) and a Jupyter notebook with Python code to analyse the outputs for outlier detection.
Medical Clean Dataset
kaggle.com
zip
Updated Jul 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aamir Shahzad (2025). Medical Clean Dataset [Dataset]. https://www.kaggle.com/datasets/aamir5659/medical-clean-dataset
Explore at:
zip(1262 bytes)Available download formats
Dataset updated
Jul 6, 2025
Authors
Aamir Shahzad
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This is the cleaned version of a real-world medical dataset that was originally noisy, incomplete, and contained various inconsistencies. The dataset was cleaned through a structured and well-documented data preprocessing pipeline using Python and Pandas. Key steps in the cleaning process included:

Handling missing values using statistical techniques such as median imputation and mode replacement

Converting categorical values to consistent formats (e.g., gender formatting, yes/no standardization)

Removing duplicate entries to ensure data accuracy

Parsing and standardizing date fields

Creating new derived features such as age groups

Detecting and reviewing outliers based on IQR

Removing irrelevant or redundant columns

The purpose of cleaning this dataset was to prepare it for further exploratory data analysis (EDA), data visualization, and machine learning modeling.

This cleaned dataset is now ready for training predictive models, generating visual insights, or conducting healthcare-related research. It provides a high-quality foundation for anyone interested in medical analytics or data science practice.
Data from: tableone: An open source Python package for producing summary...
zenodo.org
search.dataone.org
+1more
csv, txt
Updated May 30, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tom J. Pollard; Alistair E. W. Johnson; Jesse D. Raffa; Roger G. Mark; Tom J. Pollard; Alistair E. W. Johnson; Jesse D. Raffa; Roger G. Mark (2022). Data from: tableone: An open source Python package for producing summary statistics for research papers [Dataset]. http://doi.org/10.5061/dryad.26c4s35
Explore at:
csv, txtAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.26c4s35
Dataset updated
May 30, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Tom J. Pollard; Alistair E. W. Johnson; Jesse D. Raffa; Roger G. Mark; Tom J. Pollard; Alistair E. W. Johnson; Jesse D. Raffa; Roger G. Mark
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Objectives: In quantitative research, understanding basic parameters of the study population is key for interpretation of the results. As a result, it is typical for the first table ("Table 1") of a research paper to include summary statistics for the study data. Our objectives are 2-fold. First, we seek to provide a simple, reproducible method for providing summary statistics for research papers in the Python programming language. Second, we seek to use the package to improve the quality of summary statistics reported in research papers.

Materials and Methods: The tableone package is developed following good practice guidelines for scientific computing and all code is made available under a permissive MIT License. A testing framework runs on a continuous integration server, helping to maintain code stability. Issues are tracked openly and public contributions are encouraged.

Results: The tableone software package automatically compiles summary statistics into publishable formats such as CSV, HTML, and LaTeX. An executable Jupyter Notebook demonstrates application of the package to a subset of data from the MIMIC-III database. Tests such as Tukey's rule for outlier detection and Hartigan's Dip Test for modality are computed to highlight potential issues in summarizing the data.

Discussion and Conclusion: We present open source software for researchers to facilitate carrying out reproducible studies in Python, an increasingly popular language in scientific research. The toolkit is intended to mature over time with community feedback and input. Development of a common tool for summarizing data may help to promote good practice when used as a supplement to existing guidelines and recommendations. We encourage use of tableone alongside other methods of descriptive statistics and, in particular, visualization to ensure appropriate data handling. We also suggest seeking guidance from a statistician when using tableone for a research study, especially prior to submitting the study for publication.

Datasets and source code for a pipeline architecture for feature-based...

data.mendeley.com

Updated Dec 14, 2022

Facebook

Twitter

Click to copy link

Link copied

Cite

Jonatan Enes (2022). Datasets and source code for a pipeline architecture for feature-based unsupervised clustering using multivariate time series from HPC jobs [Dataset]. http://doi.org/10.17632/hgkv9cpnmn.2

Explore at:

Unique identifier

https://doi.org/10.17632/hgkv9cpnmn.2

Dataset updated

Dec 14, 2022

Authors

Jonatan Enes

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This repository is composed of 2 compressed files, with the contents as next described.

--- code.tar.gz --- The source code that implements the pipeline, as well as code and scripts needed to retrieve time series, create the plots or run the experiments. More specifically:

+ prepare.py and main.py ⇨ 
  The Python programs that implement the pipeline, both the auxiliary and the main pipeline 
  stages, respectively. 

+ 'anomaly' and 'config' folders ⇨ 
  Scripts and Python files containing the configuration and some basic functions that are 
  used to retrieve the information needed to process the data, like the actual resource 
  time series from OpenTSDB, or the job metadata from Slurm.

+ 'functions' folder ⇨ 
  Several folders with the Python programs that implement all the stages of the pipeline, 
  either for the Machine Learning processing (e.g., extractors, aggregators, models), or 
  the technical aspect of the pipeline (e.g., pipelines, transformer).

+ plotDF.py ⇨ 
  A Python program used to create the different plots presented, from the resource time 
  series to the evaluation plots.

+ several bash scripts ⇨ 
  Used to run the experiments using a specific configuration, whether regarding which 
  transformers are chosen and how they are parametrized, or more technical aspects 
  involving how the pipeline is executed.

--- data.tar.gz --- The actual data and results, organized as follows:

+ jobs ⇨ 
  All the jobs' resource time series plots for all the experiments, with a folder used 
  for each experiment. Inside each folder all the jobs are separated according to their 
  id, containing the plots for the different system resources (e.g., User CPU, Cached memory).

+ plots ⇨ 
  All the predictions' plots for all the experiments in separated folders, mainly used for 
  evaluation purposes (e.g., scatter plot, heatmaps, Andrews curves, dendrograms). These 
  plots are available for all the predictors resulting from the pipeline execution. In 
  addition, for each predictor it is also possible to visualize the resource time series 
  grouped by clusters. Finally, the projections as generated by the dimension reduction 
  models, and the outliers detected, are also available for each experiment.

+ datasets ⇨ 
  The datasets used for the experiments, which include the lists of job IDs to be processed 
  (CSV files) and the results of each stage of the pipeline (e.g., features, predictions), 
  and the output text files as generated by several pipeline stages. Among these latter 
  files it is worth to note the evaluation ones, that include all the predictions scores.

d
Bathymetry of the Main Pool of Lake Calumet, Cook County, Illinois, July...
catalog.data.gov
data.usgs.gov
Updated Oct 22, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Bathymetry of the Main Pool of Lake Calumet, Cook County, Illinois, July 2023 [Dataset]. https://catalog.data.gov/dataset/bathymetry-of-the-main-pool-of-lake-calumet-cook-county-illinois-july-2023
Explore at:
Dataset updated
Oct 22, 2025
Dataset provided by
U.S. Geological Survey
Area covered
Lake Calumet, Cook County, Illinois
Description
These data are single-beam bathymetry points compiled in comma separated values (CSV) file format, generated from a hydrographic survey of the northern portion of Lake Calumet in Cook County, Illinois. Hydrographic data were collected July 18-19, 2023, using a single-beam echosounder (SBES) integrated with a Global Navigation Satellite System (GNSS) mounted on a marine survey vessel. Surface water elevation data were collected July 18 utilizing a single-base real-time kinematic (RTK)/GNSS unit. Bathymetric data points were collected as the vessel traversed the northern portions of the lake along overlapping survey lines. The SBES internally collected and stored the depth data from the echosounder and the horizontal and vertical position data of the vessel from the GNSS in real time. Data processing required specialized computer software to export bathymetry data from the raw data files. A Python script was written to calculate the lakebed elevations and identify outliers in the dataset. These data are provided in comma separated values (CSV) format as LakeCalumet_SBES_20230718.csv. Data points are stored as a series of x (longitude), y (latitude), and z (elevation or depth) points along with variable length records specific to the data transects.
Bank Loan Case Study Dataset
kaggle.com
zip
Updated May 4, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shreshth Vashisht (2023). Bank Loan Case Study Dataset [Dataset]. https://www.kaggle.com/datasets/shreshthvashisht/bank-loan-case-study-dataset/discussion
Explore at:
zip(117814223 bytes)Available download formats
Dataset updated
May 4, 2023
Authors
Shreshth Vashisht
Description
This case study aims to give you an idea of applying EDA in a real business scenario. In this case study, apart from applying the techniques that you have learnt in the EDA module, you will also develop a basic understanding of risk analytics in banking and financial services and understand how data is used to minimize the risk of losing money while lending to customers.

Business Understanding: The loan providing companies find it hard to give loans to the people due to their insufficient or non-existent credit history. Because of that, some consumers use it as their advantage by becoming a defaulter. Suppose you work for a consumer finance company which specialises in lending various types of loans to urban customers. You have to use EDA to analyse the patterns present in the data. This will ensure that the applicants capable of repaying the loan are not rejected.

When the company receives a loan application, the company has to decide for loan approval based on the applicant’s profile. Two types of risks are associated with the bank’s decision:

If the applicant is likely to repay the loan, then not approving the loan results in a loss of business to the company. If the applicant is not likely to repay the loan, i.e. he/she is likely to default, then approving the loan may lead to a financial loss for the company. The data given below contains the information about the loan application at the time of applying for the loan. It contains two types of scenarios:

The client with payment difficulties: he/she had late payment more than X days on at least one of the first Y instalments of the loan in our sample All other cases: All other cases when the payment is paid on time. When a client applies for a loan, there are four types of decisions that could be taken by the client/company:

Approved: The company has approved loan application Cancelled: The client cancelled the application sometime during approval. Either the client changed her/his mind about the loan or in some cases due to a higher risk of the client he received worse pricing which he did not want. Refused: The company had rejected the loan (because the client does not meet their requirements etc.). Unused Offer: Loan has been cancelled by the client but on different stages of the process. In this case study, you will use EDA to understand how consumer attributes and loan attributes influence the tendency of default.

Business Objectives: It aims to identify patterns which indicate if a client has difficulty paying their installments which may be used for taking actions such as denying the loan, reducing the amount of loan, lending (to risky applicants) at a higher interest rate, etc. This will ensure that the consumers capable of repaying the loan are not rejected. Identification of such applicants using EDA is the aim of this case study.

In other words, the company wants to understand the driving factors (or driver variables) behind loan default, i.e. the variables which are strong indicators of default. The company can utilize this knowledge for its portfolio and risk assessment.

To develop your understanding of the domain, you are advised to independently research a little about risk analytics – understanding the types of variables and their significance should be enough).

Data Understanding: Download the Dataset using the link given under dataset section on the right.

application_data.csv contains all the information of the client at the time of application. The data is about wheather a client has payment difficulties. previous_application.csv contains information about the client’s previous loan data. It contains the data whether the previous application had been Approved, Cancelled, Refused or Unused offer. columns_descrption.csv is data dictionary which describes the meaning of the variables. You are required to provide a detailed report for the below data record mentioning the answer to the questions that follows:

Present the overall approach of the analysis. Mention the problem statement and the analysis approach briefly Indentify the missing data and use appropriate method to deal with it. (Remove columns/or replace it with an appropriate value) Hint: Note that in EDA, since it is not necessary to replace the missing value, but if you have to replace the missing value, what should be the approach. Clearly mention the approach. Identify if there are outliers in the dataset. Also, mention why do you think it is an outlier. Again, remember that for this exercise, it is not necessary to remove any data points. Identify if there is data imbalance in the data. Find the ratio of data imbalance. Hint: Since there are a lot of columns, you can run your analysis in loops for the appropriate columns and find the insights. Explain the results of univariate, segmented univariate, bivariate analysis, etc. in business terms. Find the top 10 c...
T
Data from: flic
tensorflow.org
Updated Jun 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). flic [Dataset]. https://www.tensorflow.org/datasets/catalog/flic
Explore at:
Dataset updated
Jun 1, 2024
Description
From the paper: We collected a 5003 image dataset automatically from popular Hollywood movies. The images were obtained by running a state-of-the-art person detector on every tenth frame of 30 movies. People detected with high confidence (roughly 20K candidates) were then sent to the crowdsourcing marketplace Amazon Mechanical Turk to obtain groundtruthlabeling. Each image was annotated by five Turkers for $0.01 each to label 10 upperbody joints. The median-of-five labeling was taken in each image to be robust to outlier annotation. Finally, images were rejected manually by us if the person was occluded or severely non-frontal. We set aside 20% (1016 images) of the data for testing.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('flic', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.

https://storage.googleapis.com/tfds-data/visualization/fig/flic-small-2.0.0.png" alt="Visualization" width="500px">
f
Socio-demographic and economic characteristics of respondents.
plos.figshare.com
figshare.com
xls
Updated Oct 17, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shimels Derso Kebede; Daniel Niguse Mamo; Jibril Bashir Adem; Birhan Ewunu Semagn; Agmasie Damtew Walle (2023). Socio-demographic and economic characteristics of respondents. [Dataset]. http://doi.org/10.1371/journal.pdig.0000345.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pdig.0000345.t001
Dataset updated
Oct 17, 2023
Dataset provided by
PLOS Digital Health
Authors
Shimels Derso Kebede; Daniel Niguse Mamo; Jibril Bashir Adem; Birhan Ewunu Semagn; Agmasie Damtew Walle
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Socio-demographic and economic characteristics of respondents.
Classicmodels
kaggle.com
zip
Updated Dec 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Javier Landaeta (2024). Classicmodels [Dataset]. https://www.kaggle.com/datasets/javierlandaeta/classicmodels
Explore at:
zip(65751 bytes)Available download formats
Dataset updated
Dec 15, 2024
Authors
Javier Landaeta
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Abstract This project presents a comprehensive analysis of a company's annual sales, using the classic dataset classicmodels as the database. Python is used as the main programming language, along with the Pandas, NumPy and SQLAlchemy libraries for data manipulation and analysis, and PostgreSQL as the database management system.

The main objective of the project is to answer key questions related to the company's sales performance, such as: Which were the most profitable products and customers? Were sales goals met? The results obtained serve as input for strategic decision making in future sales campaigns.

Methodology 1. Data Extraction:

A connection is established with the PostgreSQL database to extract the relevant data from the orders, orderdetails, customers, products and employees tables.

A reusable function is created to read each table and load it into a Pandas DataFrame.

2. Data Cleansing and Transformation:

An exploratory analysis of the data is performed to identify missing values, inconsistencies, and outliers.

New variables are calculated, such as the total value of each sale, cost, and profit.

Different DataFrames are joined using primary and foreign keys to obtain a complete view of sales.

3. Exploratory Data Analysis (EDA):

Key metrics such as total sales, number of unique customers, and average order value are calculated.

Data is grouped by different dimensions (products, customers, dates) to identify patterns and trends.

Results are visualized using relevant graphics (histograms, bar charts, etc.).

4. Modeling and Prediction:

Although the main focus of the project is descriptive, predictive modeling techniques (e.g., time series) could be explored to forecast future sales.

5. Report Generation:

Detailed reports are created in Pandas DataFrames format that answer specific business questions.

These reports are stored in new PostgreSQL tables for further analysis and visualization.

Results - Identification of top products and customers: The best-selling products and the customers that generate the most revenue are identified. - Analysis of sales trends: Sales trends over time are analyzed and possible factors that influence sales behavior are identified. - Calculation of key metrics: Metrics such as average profit margin and sales growth rate are calculated.

Conclusions This project demonstrates how Python and PostgreSQL can be effectively used to analyze large data sets and obtain valuable insights for business decision making. The results obtained can serve as a starting point for future research and development in the area of sales analysis.

Technologies Used - Python: Pandas, NumPy, SQLAlchemy, Matplotlib/Seaborn - Database: PostgreSQL - Tools: Jupyter Notebook - Keywords: data analysis, Python, PostgreSQL, Pandas, NumPy, SQLAlchemy, EDA, sales, business intelligence
SKAB - Skoltech Anomaly Benchmark
kaggle.com
zip
Updated Nov 28, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Iurii Katser (2020). SKAB - Skoltech Anomaly Benchmark [Dataset]. https://www.kaggle.com/datasets/yuriykatser/skoltech-anomaly-benchmark-skab/code
Explore at:
zip(1300142 bytes)Available download formats
Dataset updated
Nov 28, 2020
Authors
Iurii Katser
License
http://www.gnu.org/licenses/agpl-3.0.htmlhttp://www.gnu.org/licenses/agpl-3.0.html
Description
❗️❗️❗️**The current version of SKAB (v0.9) contains 34 datasets with collective anomalies. But the upcoming update to v1.0 (probably up to the summer of 2021) will contain 300+ additional files with point and collective anomalies. It will make SKAB one of the largest changepoint-containing benchmarks, especially in the technical field.**

About SKAB

We propose the Skoltech Anomaly Benchmark (SKAB) designed for evaluating the anomaly detection algorithms. SKAB allows working with two main problems (there are two markups for anomalies): * Outlier detection (anomalies considered and marked up as single-point anomalies) * Changepoint detection (anomalies considered and marked up as collective anomalies)

SKAB consists of the following artifacts: * Datasets. * Leaderboard (scoreboard). * Python modules for algorithms’ evaluation. * Notebooks: python notebooks with anomaly detection algorithms.

The IIot testbed system is located in the Skolkovo Institute of Science and Technology (Skoltech). All the details regarding the testbed and the experimenting process are presented in the following artifacts: - Position paper (currently submitted for publication) - Slides about the project

Datasets

The SKAB v0.9 corpus contains 35 individual data files in .csv format. Each file represents a single experiment and contains a single anomaly. The dataset represents a multivariate time series collected from the sensors installed on the testbed. The data folder contains datasets from the benchmark. The structure of the data folder is presented in the structure file. Columns in each data file are following: - datetime - Represents dates and times of the moment when the value is written to the database (YYYY-MM-DD hh:mm:ss) - Accelerometer1RMS - Shows a vibration acceleration (Amount of g units) - Accelerometer2RMS - Shows a vibration acceleration (Amount of g units) - Current - Shows the amperage on the electric motor (Ampere) - Pressure - Represents the pressure in the loop after the water pump (Bar) - Temperature - Shows the temperature of the engine body (The degree Celsius) - Thermocouple - Represents the temperature of the fluid in the circulation loop (The degree Celsius) - Voltage - Shows the voltage on the electric motor (Volt) - RateRMS - Represents the circulation flow rate of the fluid inside the loop (Liter per minute) - anomaly - Shows if the point is anomalous (0 or 1) - changepoint - Shows if the point is a changepoint for collective anomalies (0 or 1)

Leaderboard (Scoreboard)

Here we propose the leaderboard for SKAB v0.9 both for outlier and changepoint detection problems. You can also present and evaluate your algorithm using SKAB on kaggle. The results in the tables are calculated in the python notebooks from the notebooks folder.

Outlier detection problem

Sorted by F1; for F1 bigger is better; both for FAR and MAR less is better
| Algorithm | F1 | FAR, % | MAR, % |---|---|---|---| Perfect detector | 1 | 0 | 0 T-squared+Q (PCA) | 0.67 | 13.95 | 36.32 LSTM | 0.64 | 15.4 | 39.93 MSCRED | 0.64 | 13.56 | 41.16 T-squared | 0.56 | 12.14 | 52.56 Autoencoder | 0.45 | 7.56 | 66.57 Isolation forest | 0.4 | 6.86 | 72.09 Null detector | 0 | 0 | 100

Changepoint detection problem

Sorted by NAB (standart); for all metrics bigger is better
| Algorithm | NAB (standart) | NAB (lowFP) | NAB (LowFN) | |---|---|---|---| Perfect detector | 100 | 100 | 100 Isolation forest | 37.53 | 17.09 | 45.02 MSCRED | 28.74 | 23.43 | 31.21 LSTM | 27.09 | 11.06 | 32.68 T-squared+Q (PCA) | 26.71 | 22.42 | 28.32 T-squared | 17.87 | 3.44 | 23.2 ArimaFD | 16.06 | 14.03 | 17.12 Autoencoder | 15.59 | 0.78 | 20.91 Null detector | 0 | 0 | 0

Notebooks

The notebooks folder contains python notebooks with the code for the proposed leaderboard results reproducing.

We have calculated the results for five quite common anomaly detection algorithms: - Hotelling's T-squared statistics; - Hotelling's T-squared statistics + Q statistics based on PCA; - Isolation forest; - LSTM-based NN; - Feed-Forward Autoencoder.

Additionaly to the repository were added the results of the following algorithms: - ArimaFD; - MSCRED.

Citat...
Insurance_claims
kaggle.com
data.mendeley.com
zip
Updated Oct 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Miannotti (2025). Insurance_claims [Dataset]. https://www.kaggle.com/datasets/mian91218/insurance-claims
Explore at:
zip(68984 bytes)Available download formats
Dataset updated
Oct 19, 2025
Authors
Miannotti
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
AQQAD, ABDELRAHIM (2023), “insurance_claims ”, Mendeley Data, V2, doi: 10.17632/992mh7dk9y.2

https://data.mendeley.com/datasets/992mh7dk9y/2

Latest version Version 2 Published: 22 Aug 2023 DOI: 10.17632/992mh7dk9y.2

Data Acquisition: - Obtain the dataset titled "Insurance_claims" from the following Mendeley repository: https://https://data.mendeley.com/drafts/992mh7dk9y - Download and store the dataset locally for easy access during subsequent steps.

Data Loading & Initial Exploration: - Use Python's Pandas library to load the dataset into a DataFrame. python Code used:

Load the Dataset File

insurance_df = pd.read_csv('insurance_claims.csv')

Inspect the initial rows, data types, and summary statistics to get an understanding of the dataset's structure.

Data Cleaning & Pre-processing: - Handle missing values, if any. Strategies may include imputation or deletion based on the nature of the missing data. - Identify and handle outliers. In this research, particularly, outliers in the 'umbrella_limit' column were addressed. - Normalize or standardize features if necessary.

Exploratory Data Analysis (EDA): - Utilize visualization libraries such as Matplotlib and Seaborn in Python for graphical exploration. - Examine distributions, correlations, and patterns in the data, especially between features and the target variable 'fraud_reported'. - Identify features that exhibit distinct patterns for fraudulent and non-fraudulent claims.

Feature Engineering & Selection: - Create or transform existing features to improve model performance. - Use techniques like Recursive Feature Elimination (RFECV) to identify and retain only the most informative features.

Modeling: - Split the dataset into training and test sets to ensure the model's generalizability. - Implement machine learning algorithms such as Support Vector Machine, RandomForest, and Voting Classifier using libraries like Scikit-learn. - Handle class imbalance issues using methods like Synthetic Minority Over-sampling Technique (SMOTE).

Model Evaluation: - Evaluate the performance of each model using metrics like precision, recall, F1-score, ROC-AUC score, and confusion matrix. - Fine-tune the models based on the results. Hyperparameter tuning can be performed using techniques like Grid Search or Random Search.

Model Interpretation: - Use methods like SHAP (SHapley Additive exPlanations) or LIME (Local Interpretable Model-agnostic Explanations) to interpret and understand the predictions made by the model.

Deployment & Prediction: - Utilize the best-performing model to make predictions on unseen data. - If the intention is to deploy the model in a real-world scenario, convert the trained model into a format suitable for deployment (e.g., using libraries like joblib or pickle).

Software & Tools: - Programming Language: Python (version: GoogleColab) - Libraries: Pandas, Numpy, Matplotlib, Seaborn, Scikit-learn, Imbalanced-learn, LIME, and SHAP. - Environment: Jupyter Notebook or any Python IDE.
INDIA ELECTRICITY & ENERGY ANALYSIS PROJECT
kaggle.com
zip
Updated Nov 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bimal Kumar Saini (2025). INDIA ELECTRICITY & ENERGY ANALYSIS PROJECT [Dataset]. https://www.kaggle.com/datasets/bimalkumarsaini/india-electricity-and-energy-analysis-project
Explore at:
zip(4986654 bytes)Available download formats
Dataset updated
Nov 23, 2025
Authors
Bimal Kumar Saini
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Area covered
India
Description
⚡ INDIA ELECTRICITY & ENERGY ANALYSIS PROJECT

This repository presents an extensive data engineering, cleaning, and analytical study on India’s electricity ecosystem using Python. The project covers coal stock status, thermal power generation, renewable energy trends, energy requirements & availability, and installed capacity across states.

The goal is to identify operational bottlenecks, resource deficits, energy trends, and support data-driven decisions in the power sector.

📊 Electricity Data Insights & System Analysis

The project leverages five government datasets:

🔹 Daily Coal Stock Data

🔹 Daily Power Generation

🔹 Renewable Energy Production

🔹 State-wise Energy Requirement vs Availability

🔹 Installed Capacity Across Fuel Types

The final analysis includes EDA, heatmaps, trend analysis, outlier detection, data-cleaning automation, and visual summaries.

🔹 Key Features ✅ 1. Comprehensive Data Cleaning Pipeline

Null value treatment using median/mode strategies

Standardizing categorical inconsistencies

Filling missing regions, states, and production values

Date format standardization

Removing duplicates across all datasets

Large-scale outlier detection using custom 5×IQR logic (to preserve real-world operational variance)

✅ 2. Exploratory Data Analysis (EDA)

Includes:

Coal stock trends over years

Daily power generation patterns

Solar, wind, and renewable growth

State-wise energy shortage & surplus

Installed capacity distribution across India

Correlation maps for all major datasets

✅ 3. Trend Visualizations

📈 Coal Stock Time-Series

🔥 Thermal Power Daily Output

🌞 Solar & Wind Contribution Over Time

🇮🇳 State-wise Energy Deficit Bar Chart

🗺️ MOM Energy Requirement Heatmap

⚙️ Installed Capacity Share of Each State

📌 Dashboard & Analysis Components Section Description 🔹 Coal Stock Dashboard Daily stock, consumption, transport mode, critical plants 🔹 Power Generation Capacity, planned vs actual generation 🔹 Renewable Mix Solar, wind, hydro & total RE contributions 🔹 Energy Shortfall Requirement vs availability across states 🔹 Installed Capacity Coal, Gas, Hydro, Nuclear & RES capacity stacks 🧠 Insights & Findings 🔥 Coal Stock

Critical coal stock days observed for multiple stations

Seasonal dips in stock days & indigenous supply shocks

Import dependency minimal but volatile

⚡ Power Generation

Thermal stations show fluctuating PLF (Plant Load Factor)

Many states underperform planned generation

🌞 Renewable Energy

Solar shows continuous year-over-year growth

Wind output peaks around monsoon months

🔌 Energy Requirement vs Availability

States like Delhi, Bihar, Jharkhand show intermittent deficits

MOM heatmap highlights major seasonal spikes

⚙️ Installed Capacity

Southern & Western regions dominate national capacity

Coal remains the largest but renewable share rising rapidly

📁 Files in This Repository File Description coal_stock.csv Cleaned coal stock dataset power_gen.csv Daily power generation data renewable_engy.csv State-wise renewable energy dataset engy_reqmt.csv Monthly requirement & availability dataset install_cpty.csv Installed capacity across fuel types electricity.ipynb Full Python EDA notebook electricity.pdf Export of full Colab notebook (code + visuals) README.md GitHub project summary

🛠️ Technologies Used 📊 Data Analysis

Python (Pandas, NumPy, Matplotlib, Seaborn)

🧹 Data Cleaning

Null Imputation

Outlier Detection (5×IQR)

Standardization & Encoding

Handling Large Multi-year Datasets

🔧 System Concepts

Modular Python Code

Data Pipelines & Feature Engineering

Version Control (Git/GitHub)

Cloud Concepts (Google Colab + Drive Integration)

📈 Core Metrics & KPIs

Total Stock Days

PLF% (Plant Load Factor)

Renewable Energy Contribution

Energy Deficit (%)

National Installed Capacity Share

📚 Future Enhancements

Build a Power BI dashboard for visual storytelling

Integrate forecasting models (ARIMA / Prophet)

Automate coal shortage alerts

Add state-level energy prediction for seasonality

Deploy the analysis as a web dashboard (Streamlit)
Bay Area All Commute Points (2018 Data)
kaggle.com
zip
Updated Nov 30, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thomas Nguyen (2022). Bay Area All Commute Points (2018 Data) [Dataset]. https://www.kaggle.com/datasets/thomasnguyen01/bay-area-all-commute-points-2018-data
Explore at:
zip(1694976 bytes)Available download formats
Dataset updated
Nov 30, 2022
Authors
Thomas Nguyen
Area covered
San Francisco Bay Area
Description
Context

The San Francisco Bay Area (nine-county) is one of the largest urban areas in the US by population and GDP. It is home to over 7.5 million people and has a GDP of $995 billion (third highest by GDP output and first highest by GDP per capita). Home to Silicon Valley (a global center for high technology and innovation) and San Francisco (the second largest financial center in the US after New York), the Bay Area contains some of the most profitable industries and sophisticated workforces in the world. This dataset describes where these workers live and commute to work in 2018.

Content

This data file includes all needed information as a means to find out more about the different commute patterns, geographical locations, and necessary metrics to make predictions and draw conclusions.

Inspiration

What can we learn about the different residence and workplace locations? What is the average distance between these locations?

Which counties contain a high/low concentration of residence and workplace locations? Why?

What counties do most of the commuters usually commute to work? What counties do most of the commuters call home?

Are most of these commute patterns county-by-county or within a single county?

Are there any noticeable outliers (e.g., long commute patterns) in this dataset? What counties contain a high concentration of these outliers?

Facebook

Twitter

Click to copy link

Link copied

Cite

Nibedita Sahu (2024). Customer Sales Analysis [Dataset]. https://www.kaggle.com/datasets/nibeditasahu/customer-sales-analysis

Customer Sales Analysis

Enhanced Customer Sales Insights: Outlier Detection, Seasonal Trends, and RFM An

Explore at:

zip(1410791 bytes)Available download formats

Dataset updated

Nov 2, 2024

Authors

Nibedita Sahu

License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

Developed the Customer Sales Analysis project using Python libraries like Pandas, NumPy, Matplotlib, and Seaborn. This project involves advanced data cleaning, outlier detection, and seasonal trends analysis. Key insights include identifying outliers, analyzing seasonal trends, and performing customer segmentation using RFM analysis. The project features various visualizations to communicate insights effectively, allowing for deeper insights into sales performance and customer behavior.

Clear search

Close search

Google apps

Main menu

Customer Sales Analysis

mumpcepy: A Python implementation of the Method of Uncertainty Minimization...

Metabolomics Data Preprocessing PQN PCA

Engine Ratng Prediction

Data_Sheet_1_ExGUtils: A Python Package for Statistical Analysis With the...

CNN-based Colon Cancer Detection - Python Code

Supplementary Data for Mahalanobis-Based Ratio Analysis and Clustering of...

Lipidomics LC-MS analysis support tools for outlier detection

Medical Clean Dataset

Data from: tableone: An open source Python package for producing summary...

Datasets and source code for a pipeline architecture for feature-based...

Bathymetry of the Main Pool of Lake Calumet, Cook County, Illinois, July...

Bank Loan Case Study Dataset

Data from: flic

Socio-demographic and economic characteristics of respondents.

Classicmodels

SKAB - Skoltech Anomaly Benchmark

About SKAB

Datasets

Leaderboard (Scoreboard)

Outlier detection problem

Changepoint detection problem

Notebooks

Citat...

Insurance_claims

Load the Dataset File

INDIA ELECTRICITY & ENERGY ANALYSIS PROJECT

Bay Area All Commute Points (2018 Data)

Context

Content

Inspiration

Customer Sales Analysis

Enhanced Customer Sales Insights: Outlier Detection, Seasonal Trends, and RFM An