19 datasets found
  1. Data-Science-Book

    • kaggle.com
    zip
    Updated Aug 20, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Md Waquar Azam (2022). Data-Science-Book [Dataset]. https://www.kaggle.com/datasets/mdwaquarazam/datasciencebook
    Explore at:
    zip(9376 bytes)Available download formats
    Dataset updated
    Aug 20, 2022
    Authors
    Md Waquar Azam
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context This dataset holds a list of approx 200 + books in the field of Data science related topics. The list of books was constructed using one of the popular websites Amazon which provide information on book ratings and many details given below.

    There are 6 column

    1. Book_name / book title

    2. Publisher:-- name of the publisher or writer

    3. Buyers ():--it means no of customer who purchase the same book

    4. Cover_type:-- types of cover use to protect the book

    5. stars:--out of 5 * how much rated

    6. Price

    Inspiration I’d like to call the attention of my fellow Kagglers to use Machine Learning and Data Sciences to help me explore these ideas:

    • What is the best-selling book?

    • Find any hidden patterns if you can

    . EDA of dataset

  2. IntroDS

    • kaggle.com
    zip
    Updated Sep 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dayche (2023). IntroDS [Dataset]. https://www.kaggle.com/datasets/rouzbeh/introds
    Explore at:
    zip(2564 bytes)Available download formats
    Dataset updated
    Sep 7, 2023
    Authors
    Dayche
    Description

    Dataset for Beginners to start Data Science process. The subject of data is about simple clinical data for problem definition and solving. range of data science tasks such as classification, clustering, EDA and statistical analysis are using with dataset.

    columns in data set are present: Age: Numerical (Age of patient) Sex: Binary (Gender of patient) BP: Nominal (Blood Pressure of patient with values: Low, Normal and High) Cholesterol: Nominal (Cholesterol of patient with values: Normal and High) Na: Numerical (Sodium level of patient experiment) K: Numerical (Potassium level of patient experiment) Drug: Nominal (Type of Drug that prescribed with doctor, with values: A, B, C, X and Y)

  3. Human Cognitive Performance Analysis

    • kaggle.com
    zip
    Updated Apr 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Samharison (2025). Human Cognitive Performance Analysis [Dataset]. https://www.kaggle.com/datasets/samxsam/human-cognitive-performance-analysis
    Explore at:
    zip(1784012 bytes)Available download formats
    Dataset updated
    Apr 2, 2025
    Authors
    Samharison
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Human Cognitive Performance Analysis: Lifestyle & AI Predictions

    This dataset provides a comprehensive analysis of human cognitive performance based on various lifestyle factors, including sleep duration, stress levels, diet type, screen time, exercise frequency, and caffeine intake. Additionally, it includes a cognitive score computed using a weighted formula and an AI-predicted score, making it suitable for machine learning and AI-based predictive modeling.

    The dataset contains 80,000 samples with diverse demographic attributes, making it an excellent resource for data science, AI, and human behavior analysis.

    The Cognitive Score is calculated using a formula that considers multiple factors:

    • Faster reaction time ⬆️ increases the score
    • Higher memory test scores ⬆️ increase the score
    • More sleep ⬆️ improves cognition
    • Higher stress levels ⬇️ decrease the score
    • More screen time ⬇️ reduces cognitive ability
    • Regular exercise ⬆️ improves cognitive performance
    • Higher caffeine intake ⬇️ negatively affects cognition

    Potential Use Cases

    This dataset is suitable for various machine learning and AI applications:

    Regression Tasks

    • Predict cognitive performance based on lifestyle habits
    • Forecast cognitive decline based on stress and screen time

    Classification Tasks

    • Classify individuals into low, medium, or high cognitive performance
    • Identify people at risk of cognitive decline

    Clustering & Pattern Discovery

    • Group individuals based on lifestyle patterns and cognitive ability
    • Find correlations between diet, exercise, and mental performance

    AI Model Benchmarking

    • Train AI models to predict human cognitive scores
    • Evaluate different deep learning and machine learning models

    Data Exploration Ideas

    Exploratory Data Analysis (EDA)

    • Visualize age vs. cognitive score using scatter plots
    • Analyze the impact of stress levels on cognitive performance
    • Study how sleep duration affects reaction time

    Machine Learning Experiments

    • Train a Random Forest Regressor to predict cognitive scores
    • Use Neural Networks for AI-powered cognitive predictions
    • Apply clustering (K-Means, DBSCAN) to segment users by lifestyle
  4. Bank Loan Case Study Dataset

    • kaggle.com
    zip
    Updated May 4, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shreshth Vashisht (2023). Bank Loan Case Study Dataset [Dataset]. https://www.kaggle.com/datasets/shreshthvashisht/bank-loan-case-study-dataset/discussion
    Explore at:
    zip(117814223 bytes)Available download formats
    Dataset updated
    May 4, 2023
    Authors
    Shreshth Vashisht
    Description

    This case study aims to give you an idea of applying EDA in a real business scenario. In this case study, apart from applying the techniques that you have learnt in the EDA module, you will also develop a basic understanding of risk analytics in banking and financial services and understand how data is used to minimize the risk of losing money while lending to customers.

    Business Understanding: The loan providing companies find it hard to give loans to the people due to their insufficient or non-existent credit history. Because of that, some consumers use it as their advantage by becoming a defaulter. Suppose you work for a consumer finance company which specialises in lending various types of loans to urban customers. You have to use EDA to analyse the patterns present in the data. This will ensure that the applicants capable of repaying the loan are not rejected.

    When the company receives a loan application, the company has to decide for loan approval based on the applicant’s profile. Two types of risks are associated with the bank’s decision:

    If the applicant is likely to repay the loan, then not approving the loan results in a loss of business to the company. If the applicant is not likely to repay the loan, i.e. he/she is likely to default, then approving the loan may lead to a financial loss for the company. The data given below contains the information about the loan application at the time of applying for the loan. It contains two types of scenarios:

    The client with payment difficulties: he/she had late payment more than X days on at least one of the first Y instalments of the loan in our sample All other cases: All other cases when the payment is paid on time. When a client applies for a loan, there are four types of decisions that could be taken by the client/company:

    Approved: The company has approved loan application Cancelled: The client cancelled the application sometime during approval. Either the client changed her/his mind about the loan or in some cases due to a higher risk of the client he received worse pricing which he did not want. Refused: The company had rejected the loan (because the client does not meet their requirements etc.). Unused Offer: Loan has been cancelled by the client but on different stages of the process. In this case study, you will use EDA to understand how consumer attributes and loan attributes influence the tendency of default.

    Business Objectives: It aims to identify patterns which indicate if a client has difficulty paying their installments which may be used for taking actions such as denying the loan, reducing the amount of loan, lending (to risky applicants) at a higher interest rate, etc. This will ensure that the consumers capable of repaying the loan are not rejected. Identification of such applicants using EDA is the aim of this case study.

    In other words, the company wants to understand the driving factors (or driver variables) behind loan default, i.e. the variables which are strong indicators of default. The company can utilize this knowledge for its portfolio and risk assessment.

    To develop your understanding of the domain, you are advised to independently research a little about risk analytics – understanding the types of variables and their significance should be enough).

    Data Understanding: Download the Dataset using the link given under dataset section on the right.

    application_data.csv contains all the information of the client at the time of application. The data is about wheather a client has payment difficulties. previous_application.csv contains information about the client’s previous loan data. It contains the data whether the previous application had been Approved, Cancelled, Refused or Unused offer. columns_descrption.csv is data dictionary which describes the meaning of the variables. You are required to provide a detailed report for the below data record mentioning the answer to the questions that follows:

    Present the overall approach of the analysis. Mention the problem statement and the analysis approach briefly Indentify the missing data and use appropriate method to deal with it. (Remove columns/or replace it with an appropriate value) Hint: Note that in EDA, since it is not necessary to replace the missing value, but if you have to replace the missing value, what should be the approach. Clearly mention the approach. Identify if there are outliers in the dataset. Also, mention why do you think it is an outlier. Again, remember that for this exercise, it is not necessary to remove any data points. Identify if there is data imbalance in the data. Find the ratio of data imbalance. Hint: Since there are a lot of columns, you can run your analysis in loops for the appropriate columns and find the insights. Explain the results of univariate, segmented univariate, bivariate analysis, etc. in business terms. Find the top 10 c...

  5. Black Friday Sales Data

    • kaggle.com
    zip
    Updated Jan 20, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    PrepInsta Technologies (2023). Black Friday Sales Data [Dataset]. https://www.kaggle.com/datasets/prepinstaprime/black-friday-sales-data/code
    Explore at:
    zip(5744184 bytes)Available download formats
    Dataset updated
    Jan 20, 2023
    Authors
    PrepInsta Technologies
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset History

    A retail company “ABC Private Limited” wants to understand the customer purchase behaviour (specifically, purchase amount) against various products of different categories. They have shared purchase summaries of various customers for selected high-volume products from last month. The data set also contains customer demographics (age, gender, marital status, city type, stay in the current city), product details (productid and product category) and Total purchase amount from last month.

    Now, they want to build a model to predict the purchase amount of customers against various products which will help them to create a personalized offer for customers against different products.

    Tasks to perform

    The purchase column is the Target Variable, perform Univariate Analysis and Bivariate Analysis w.r.t the Purchase.

    Masked in the column description means already converted from categorical value to numerical column.

    Below mentioned points are just given to get you started with the dataset, not mandatory to follow the same sequence.

    DATA PREPROCESSING

    • Check the basic statistics of the dataset

    • Check for missing values in the data

    • Check for unique values in data

    • Perform EDA

    • Purchase Distribution

    • Check for outliers

    • Analysis by Gender, Marital Status, occupation, occupation vs purchase, purchase by city, purchase by age group, etc

    • Drop unnecessary fields

    • Convert categorical data into integer using map function (e.g 'Gender' column)

    • Missing value treatment

    • Rename columns

    • Fill nan values

    • map range variables into integers (e.g 'Age' column)

    Data Visualisation

    • visualize individual column
    • Age vs Purchased
    • Occupation vs Purchased
    • Productcategory1 vs Purchased
    • Productcategory2 vs Purchased
    • Productcategory3 vs Purchased
    • City category pie chart
    • check for more possible plots

    All the Best!!

  6. Retail Sales Forecasting Using SARIMAX

    • kaggle.com
    zip
    Updated Jun 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Muhammad Zamin (2025). Retail Sales Forecasting Using SARIMAX [Dataset]. https://www.kaggle.com/datasets/muhammadzamin1/retail-sales-forecasting-using-sarimax
    Explore at:
    zip(261786 bytes)Available download formats
    Dataset updated
    Jun 12, 2025
    Authors
    Muhammad Zamin
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    🔮 Retail Sales Forecasting Using SARIMAX

    A Time Series Forecasting Project from a Data Science Internship This project showcases a real-world approach to forecasting monthly retail sales using Python and SARIMAX, completed as part of a 25-day Data Science Internship.

    📌 Objective

    To forecast the next 6 months of retail sales based on 4 years of historical monthly data, with the goal of improving inventory management, marketing strategy, and financial planning.

    📊 Data Overview

    Synthetic sales dataset (Jan 2020 – Dec 2023)

    🔢 Column Descriptions

    Column NameDescription
    DateFirst day of each month (format: YYYY-MM-DD), representing the sales period
    SalesAmountTotal monthly sales amount (includes trend, seasonality, and promotion effects)
    PromotionBinary flag: 1 = promotional campaign active, 0 = no promotion
    HolidayMonthBinary flag: 1 = holiday month (e.g., December), 0 = non-holiday month

    Features include:

    SalesAmount Promotion flag HolidayMonth flag

    🔍 What You'll Learn

    How to perform time series EDA (decomposition, ACF/PACF, stationarity tests) Feature engineering for time series (lags, rolling means, exogenous variables) Training and tuning SARIMAX models Making 6-month forecasts with confidence intervals Translating insights into business recommendations

    📈 Model Performance

    Model Used: SARIMAX(1,1,1)(0,1,1,12)

    Validation Metrics:

    MAE: 814.37 RMSE: 1012.38 Promotion events shown to significantly increase sales

    📅 6-Month Forecast Preview

    Month Forecast Jan 2024 15,387 Feb 2024 18,653 Mar 2024 14,954 Apr 2024 13,468 May 2024 11,059 Jun 2024 10,514

    📁 What's Inside

    Full EDA and modeling steps with markdown explanations Forecasting charts and insights Clean, beginner-friendly code structure

    ✅ Ideal For

    Data science learners looking for an internship project idea Anyone learning time series forecasting Retail businesses wanting a forecasting template

    📌 Want to explore the full code and report? 🔗 Also available on GitHub: https://github.com/muhammad-zamin/retail-sales-forecasting

  7. Raw Jobs Data from Head Hunters Russia

    • kaggle.com
    zip
    Updated Jul 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Etietop Abraham (2023). Raw Jobs Data from Head Hunters Russia [Dataset]. https://www.kaggle.com/datasets/etietopabraham/jobs-raw-data
    Explore at:
    zip(422020638 bytes)Available download formats
    Dataset updated
    Jul 5, 2023
    Authors
    Etietop Abraham
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Россия
    Description

    🔥 This dataset provides a comprehensive snapshot of Russia's job market, derived from over 560,000 job listings. The diverse set of attributes, which include job titles, salaries, job types, descriptions, key skills required, and more, offers an extensive overview of the employment landscape in Russia.

    Source:

    The dataset is sourced from hh.ru, a prominent online employment portal in Russia. Leveraging their API, the data was meticulously gathered and compiled to create this rich repository of job market insights.

    Inspiration:

    The primary motivation behind creating and sharing this dataset was to build a job recommendation model utilizing graph-based models. With a significant portion of the data in Russian, it poses a fascinating challenge in data preprocessing and feature engineering. Some potential new features could be extracted from existing ones, such as min/max experience, min/max salary, job type split, and others.

    Moreover, the complexity and richness of the dataset make it a suitable and intriguing field for the data science community to explore and analyze. It's not just about the translation of the data, but also about understanding the trends, identifying patterns, and even predicting future trajectories in Russia's job market. The dataset could lead to an array of innovative applications, models, and analyses.

    In sharing this dataset, the hope is to inspire the Kaggle community to bring their diverse skills to bear in exploring this unique data, unveiling new insights, and building transformative models. The results can then be used to advance the field and build better job recommendation systems for diverse and multilingual job markets.

  8. University Salaries

    • kaggle.com
    zip
    Updated Mar 13, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tyson (2021). University Salaries [Dataset]. https://www.kaggle.com/tysonpo/university-salaries
    Explore at:
    zip(525393 bytes)Available download formats
    Dataset updated
    Mar 13, 2021
    Authors
    Tyson
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Overview

    This data contains salaries of University of Vermont (UVM) faculty from 2009 to 2021. We present two datasets. The second dataset is richer because it contains information on faculty departments/colleges; however, it contains less rows due to how we chose to join this data. 1. salaries_without_dept.csv contains all of the data we extracted from the PDFs. The four columns are: Year, Faculty Name, Primary Job Title, and Base Pay. There are 47,479 rows. 2. salaries_final.csv contains the same columns as [1], but also joins with data about the faculty's "Department" and "College" (for a total of six columns). There are only 14,470 rows in this dataset because we removed rows for which we could not identify the Department/College of the faculty.

    Data collection

    All data is publicly available on the University of Vermont website. I downloaded all PDFs from https://www.uvm.edu/oir/faculty-and-staff. Then I used a Python package (Camelot) to parse the tabular PDFs and used regex matching to ensure data was correctly parsed. I performed some initial cleaning (removed dollar signs from monetary values, etc.). At this stage, I saved the data to salaries_without_dept.csv.

    I also wanted to know what department and college each faculty belonged to. I used http://catalogue.uvm.edu/undergraduate/faculty/fulltime (plus Python's lxml package to parse the HTML) to determine "Department" and then manually built an encoding to map "Department" to "College". Note that this link provides faculty information for 2020, thus after joining we end up only with faculty that are still employed as of 2020 (this should be taken into consideration). Secondly, this link does not include UVM administration (and possibly some other personnel) so they are not present in this dataset. Thirdly, there were several different ways names were reported (sometimes even the same person has their name reported differently in different years). We tried joining first on LastName+FirstName and then on LastName+FirstInitial but did not bother using middle name. To handle ambiguity, we removed duplicates (e.g. we removed Martin, Jacob and Martin, Jacob William as they were not distinguishable by our criteria). The joined data is available in salaries_final.csv.

    Note: perhaps "College" was not the best naming, since faculty of UVM Libraries and other miscellaneous fields are included.

    Data dictionary

    The column definitions are self-explanatory, but the "College" abbreviation meanings are unclear to a non-UVM-affiliate. We've included data_dictionary.csv to explain what each "College" abbreviation means. You can use this dictionary to filter out miscellaneous "colleges" (e.g. UVM Libraries) and only include colleges within the undergraduate program (e.g. filter out College of Medicine).

    Uses

    Despite there only being a few (six) columns, I think this is quite a rich dataset and could also be paired with other UVM data or combined with data from other universities. This dataset is mainly for data analytics and exploratory data analysis (EDA), but perhaps it could also be used for forecasting (however, there's only 12 time values so you'd probably want to make use of "College" or "Primary Job Title"). Interesting EDA questions could be: 1. "Are the faculty in arts & humanities departments being paid less?" This news article -- UVM to eliminate 23 programs in the College of Arts and Sciences -- suggests so. Give a quantitative answer.
    2. "Are lecturers declining in quantity and pay?" This news article -- ‘I’m going to miss this:’ Three cut lecturers reflect on time at UVM -- suggests so. Give a quantitative answer. 3. "How does the College of Medicine compare to the undergraduate colleges in terms of number of faculty and pay?" See data_dictionay.csv for which colleges are in the undergraduate program. 4. "How long does it take for a faculty member to become a full professor?" Yes, this is also answerable from the data because Primary Job Title updates when a faculty member is promoted.

    Future updates

    I do not plan to maintain this dataset. If I get the chance, I may update it with future year salaries.

  9. Jigsaw Bias Toxicity EDA NLP aug16 alpha0.05

    • kaggle.com
    zip
    Updated Apr 14, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Matt Yates (2019). Jigsaw Bias Toxicity EDA NLP aug16 alpha0.05 [Dataset]. https://www.kaggle.com/datasets/yeayates21/jigsaw-bias-toxicity-eda-nlp-aug16-alpha005
    Explore at:
    zip(1593446024 bytes)Available download formats
    Dataset updated
    Apr 14, 2019
    Authors
    Matt Yates
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Text Augmentation on Jigsaw Unintended Bias in Toxicity Classification competition training data using EDA_NLP.

    Context

    Code from https://github.com/jasonwei20/eda_nlp was run on the training dataset for the Jigsaw Unintended Bias in Toxicity Classification competition to create augmented training dataset. Number of augmentations was set to 16 and alpha value was set to 0.05.

    Content

    train_augmented1605.zip - augmented training dataset for Jigsaw Unintended Bias in Toxicity Classification competition.

    Acknowledgements

    Code provided by: https://github.com/jasonwei20/eda_nlp

    Code for the paper: Easy data augmentation techniques for boosting performance on text classification tasks. https://arxiv.org/abs/1901.11196

    Special thanks to ErvTong \ @papasmurfff for sharing the eda_nlp repo with me. https://www.kaggle.com/papasmurfff

    Inspiration

    https://mlwhiz.com/blog/2019/02/19/siver_medal_kaggle_learnings/

    The above article talks about how the 1st place competitors for the Quora Insincere Question competition stated they:

    "We do not pad sequences to the same length based on the whole data, but just on a batch level. That means we conduct padding and truncation on the data generator level for each batch separately, so that length of the sentences in a batch can vary in size. Additionally, we further improved this by not truncating based on the length of the longest sequence in the batch but based on the 95% percentile of lengths within the sequence. This improved runtime heavily and kept accuracy quite robust on single model level, and improved it by being able to average more models."

    This got @papasmurfff and I thinking about text augmentation and from there @papasmurfff found the eda_nlp repo.

  10. Toy Dataset

    • kaggle.com
    zip
    Updated Dec 10, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carlo Lepelaars (2018). Toy Dataset [Dataset]. https://www.kaggle.com/datasets/carlolepelaars/toy-dataset
    Explore at:
    zip(1184308 bytes)Available download formats
    Dataset updated
    Dec 10, 2018
    Authors
    Carlo Lepelaars
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    A fictional dataset for exploratory data analysis (EDA) and to test simple prediction models.

    This toy dataset features 150000 rows and 6 columns.

    Columns

    Note: All data is fictional. The data has been generated so that their distributions are convenient for statistical analysis.

    Number: A simple index number for each row

    City: The location of a person (Dallas, New York City, Los Angeles, Mountain View, Boston, Washington D.C., San Diego and Austin)

    Gender: Gender of a person (Male or Female)

    Age: The age of a person (Ranging from 25 to 65 years)

    Income: Annual income of a person (Ranging from -674 to 177175)

    Illness: Is the person Ill? (Yes or No)

    Acknowledgements

    Stock photo by Mika Baumeister on Unsplash.

  11. 160k Spotify songs from 1921 to 2020 (Sorted)

    • kaggle.com
    Updated Sep 17, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FCPercival (2022). 160k Spotify songs from 1921 to 2020 (Sorted) [Dataset]. https://www.kaggle.com/datasets/fcpercival/160k-spotify-songs-sorted
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 17, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    FCPercival
    Description

    This is an analysis of the data on Spotify tracks from 1921-2020 with Jupyter Notebook and Python Data Science tools.

    About the Dataset

    The Spotify dataset (titled data.csv) consists of 160,000+ tracks sorted by name, from 1921-2020 found in Spotify as of June 2020. Collected by Kaggle user and Turkish Data Scientist Yamaç Eren Ay, the data was retrieved and tabulated from the Spotify Web API. Each row in the dataset corresponds to a track, with variables such as the title, artist, and year located in their respective columns. Aside from the fundamental variables, musical elements of each track, such as the tempo, danceability, and key, were likewise extracted; the algorithm for these values were generated by Spotify based on a range of technical parameters.

    Exploratory Data Analysis (EDA)

    1. Studying the correlations between the variables in the Spotify data.
    2. The evolution of different musical elements through the years.
    3. The divide between explicit and non-explicit songs through the years.

    Further Investigation and Inference (FII)

    1. Determining if there is a significant difference in popularity between explicit and non-explicit songs.
    2. Finding the most frequent emotions in Spotify tracks and analyzing their musical elements based on the track's mode and key.
    3. Determining the classifications of the Spotify tracks through K-Means Clustering.

    Project Directory Guide

    1. Spotify Data.ipynb is the main notebook where the data is imported for EDA and FII.
    2. data.csv is the dataset downloaded from Kaggle.
    3. spotify_eda.html is the HTML file for the comprehensive EDA done using the Pandas Profiling module.

    Project Notes

    1. This is in partial fulfillment of the course Statistical Modelling and Simulation (CSMODEL).

    Credits to gabminamedez for the original dataset.

  12. Machine Learning 😃 ❤️😃

    • kaggle.com
    zip
    Updated Mar 8, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Qusay AL-Btoush (2022). Machine Learning 😃 ❤️😃 [Dataset]. https://www.kaggle.com/qusaybtoush1990/machine-learning
    Explore at:
    zip(533584 bytes)Available download formats
    Dataset updated
    Mar 8, 2022
    Authors
    Qusay AL-Btoush
    Description

    Machine Learning 😃 ❤️😃

    Predicting Bank Loan Defaults 🙄 😃🙄 ❤️😃🙄 😃

    DESCRIPTION❤️❤️

    A data science approach to predict and understand the applicant’s profile to minimize the risk of future loan defaults.

    About the project

    The dataset contains information about credit applicants. Banks, globally, use this kind of dataset and type of informative data to create models to help in deciding on who to accept/refuse for a loan. After all the exploratory data analysis, cleansing and dealing with all the anomalies we might (will) find along the way, the patterns of a good/bad applicant will be exposed to be learned by machine learning models.

    • Machine Learning issue and objectives We’re dealing with a supervised binary classification problem. The goal is to train the best machine learning model to maximize the predictive capability of deeply understanding the past customer’s profile minimizing the risk of future loan defaults.

    • Performance Metric The metric used for the models’ evaluation is the ROC AUC given that we’re dealing with a highly unbalanced data.

    • Project structure The project divides into three categories: EDA: Exploratory Data Analysis Data Wrangling: Cleansing and Feature Selection Machine Learning: Predictive Modelling

    • The dataset You can download the data set here. Feature description

    • id: Unique ID of the loan application.

    • grade: LC assigned loan grade.

    • annual_inc: The self-reported annual income provided by the borrower during registration.

    • short_emp: 1 when employed for 1 year or less.

    • emp_length_num: Employment length in years. Possible values are - between 0 and 10 where 0 means less than one year and 10 means ten or more years.

    • home_ownership: Type of home ownership.

    • dti (Debt-To-Income Ratio): A ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower’s self-reported monthly income.

    • purpose: A category provided by the borrower for the loan request.

    • term: The number of payments on the loan. Values are in months and can be either 36 or 60.

    • last_delinq_none: 1 when the borrower had at least one event of delinquency.

    • last_major_derog_none: 1 borrower had at least 90 days of a bad rating.

    • revol_util: Revolving line utilization rate, or the amount of credit the borrower is using relative to all available revolving credit.

    • total_rec_late_fee: Late fees received to date.

    • od_ratio: Overdraft ratio.

    • bad_loan: 1 when a loan was not paid.

    Note😃😃😃😃 This data is for training how using data analysis 🤝🎉

    Please appreciate the effort with an upvote 👍 😃😃

    Thank You ❤️❤️❤️

  13. Ecommerce Customer Dataset

    • kaggle.com
    zip
    Updated Aug 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mirza Yasir Abdullah Baig (2025). Ecommerce Customer Dataset [Dataset]. https://www.kaggle.com/datasets/mirzayasirabdullah07/ecommerce-customer-dataset/data
    Explore at:
    zip(2707 bytes)Available download formats
    Dataset updated
    Aug 21, 2025
    Authors
    Mirza Yasir Abdullah Baig
    Description

    The Ecommerce Customer Dataset contains customer-related information from an online retail platform. It is often used for data analysis, customer segmentation, predictive modeling, and business intelligence tasks.

    The dataset provides key details about customers, their demographics, and their purchasing behavior, which can help businesses understand their audience better and optimize decision-making.

    📂 Features

    Typical columns in an ecommerce dataset may include (depending on the version of the dataset you have):

    CustomerID – Unique identifier for each customer

    Gender – Male/Female/Other

    Age – Customer’s age

    Annual Income – Customer’s yearly income

    Spending Score – A score assigned based on customer spending behavior

    Purchase History – Past transactions or order details

    Product Categories – Types of products bought

    Date of Purchase – Timestamp of transactions

    🎯 Use Cases

    This dataset is widely used for:

    Customer Segmentation (e.g., using K-Means clustering or RFM analysis)

    Recommendation Systems (suggesting products to customers)

    Churn Prediction (identifying customers likely to stop buying)

    Sales Forecasting (predicting future purchases or revenue trends)

    Marketing Analytics (targeting specific customer groups with campaigns)

    🛠️ Suitable For

    Data Science and Machine Learning projects

    Beginner to intermediate learners exploring EDA, clustering, regression, or classification

    Business analysts focusing on customer insights and ecommerce growth

  14. Stock Market: Historical Data of Top 10 Companies

    • kaggle.com
    zip
    Updated Jul 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Khushi Pitroda (2023). Stock Market: Historical Data of Top 10 Companies [Dataset]. https://www.kaggle.com/datasets/khushipitroda/stock-market-historical-data-of-top-10-companies
    Explore at:
    zip(486977 bytes)Available download formats
    Dataset updated
    Jul 18, 2023
    Authors
    Khushi Pitroda
    Description

    The dataset contains a total of 25,161 rows, each row representing the stock market data for a specific company on a given date. The information collected through web scraping from www.nasdaq.com includes the stock prices and trading volumes for the companies listed, such as Apple, Starbucks, Microsoft, Cisco Systems, Qualcomm, Meta, Amazon.com, Tesla, Advanced Micro Devices, and Netflix.

    Data Analysis Tasks:

    1) Exploratory Data Analysis (EDA): Analyze the distribution of stock prices and volumes for each company over time. Visualize trends, seasonality, and patterns in the stock market data using line charts, bar plots, and heatmaps.

    2)Correlation Analysis: Investigate the correlations between the closing prices of different companies to identify potential relationships. Calculate correlation coefficients and visualize correlation matrices.

    3)Top Performers Identification: Identify the top-performing companies based on their stock price growth and trading volumes over a specific time period.

    4)Market Sentiment Analysis: Perform sentiment analysis using Natural Language Processing (NLP) techniques on news headlines related to each company. Determine whether positive or negative news impacts the stock prices and volumes.

    5)Volatility Analysis: Calculate the volatility of each company's stock prices using metrics like Standard Deviation or Bollinger Bands. Analyze how volatile stocks are in comparison to others.

    Machine Learning Tasks:

    1)Stock Price Prediction: Use time-series forecasting models like ARIMA, SARIMA, or Prophet to predict future stock prices for a particular company. Evaluate the models' performance using metrics like Mean Squared Error (MSE) or Root Mean Squared Error (RMSE).

    2)Classification of Stock Movements: Create a binary classification model to predict whether a stock will rise or fall on the next trading day. Utilize features like historical price changes, volumes, and technical indicators for the predictions. Implement classifiers such as Logistic Regression, Random Forest, or Support Vector Machines (SVM).

    3)Clustering Analysis: Cluster companies based on their historical stock performance using unsupervised learning algorithms like K-means clustering. Explore if companies with similar stock price patterns belong to specific industry sectors.

    4)Anomaly Detection: Detect anomalies in stock prices or trading volumes that deviate significantly from the historical trends. Use techniques like Isolation Forest or One-Class SVM for anomaly detection.

    5)Reinforcement Learning for Portfolio Optimization: Formulate the stock market data as a reinforcement learning problem to optimize a portfolio's performance. Apply algorithms like Q-Learning or Deep Q-Networks (DQN) to learn the optimal trading strategy.

    The dataset provided on Kaggle, titled "Stock Market Stars: Historical Data of Top 10 Companies," is intended for learning purposes only. The data has been gathered from public sources, specifically from web scraping www.nasdaq.com, and is presented in good faith to facilitate educational and research endeavors related to stock market analysis and data science.

    It is essential to acknowledge that while we have taken reasonable measures to ensure the accuracy and reliability of the data, we do not guarantee its completeness or correctness. The information provided in this dataset may contain errors, inaccuracies, or omissions. Users are advised to use this dataset at their own risk and are responsible for verifying the data's integrity for their specific applications.

    This dataset is not intended for any commercial or legal use, and any reliance on the data for financial or investment decisions is not recommended. We disclaim any responsibility or liability for any damages, losses, or consequences arising from the use of this dataset.

    By accessing and utilizing this dataset on Kaggle, you agree to abide by these terms and conditions and understand that it is solely intended for educational and research purposes.

    Please note that the dataset's contents, including the stock market data and company names, are subject to copyright and other proprietary rights of the respective sources. Users are advised to adhere to all applicable laws and regulations related to data usage, intellectual property, and any other relevant legal obligations.

    In summary, this dataset is provided "as is" for learning purposes, without any warranties or guarantees, and users should exercise due diligence and judgment when using the data for any purpose.

  15. Kaggle Survey Analytics Helper : Analysis a Breeze

    • kaggle.com
    zip
    Updated Nov 25, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pranav941 (2022). Kaggle Survey Analytics Helper : Analysis a Breeze [Dataset]. https://www.kaggle.com/datasets/pranav941/kaggle-survey-analytics-helper
    Explore at:
    zip(2751966 bytes)Available download formats
    Dataset updated
    Nov 25, 2022
    Authors
    Pranav941
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    When I first came across the competition a few weeks ago, I knew I just had to participate, Quickly I started going through the data provided, and man o man...was it not large? most of us haven't dealt with the type of information before, at least if you are not working or a student like me.

    I went over the EDA Provided by Paul_Mooney and noticed a way of slicing the data-frame fetching values from it, I wanted to have a simpler solution that would be easily understood by many,

    I reviewed all previous datasets from 2018 to 2021 and found there are common questions, plus a few added over the years. We will call these questions the "Look up Questions"

    I manually made an Excel sheet aka the "Look up Table" listing these questions row-by-row, for all 5 years. Most importantly, I started adding their question tag (Q1, Q3. Q26_A, Q33_B, etc) for every year.

    Now what we have is,
    A. Look u Table B. Unique Questions listed row-by-row C. For every Question, It's a column name for every year

    Now what we have is

    A. All questions asked in a single list

    https://imgur.com/fddPb94.jpg" alt="">

    B. Column ID/Index of that question for every year

    Note. Blake space / empty field means that particular question was not asked in that specific year https://imgur.com/3BQLZUS.jpg" alt="">

    C. Final Gem, The Look Up Table

    https://imgur.com/aQrumcx.jpg" alt="">

    But why would you need such-type of solution over the one in kaggle EDA?

    A. Quick referencing, Spend more time analyzing, and less on fiddling B. With a few custom functions (Added below), A single line of code will get you any sort of Data, Filtered & categorized based on ANY other column C. Works with previous years as well future kaggle survey analytics (Given that the question format doesn't change, didn't change for the past 5yrs)

    How to use these 1 Line functions?

    Heres a demo notebook -> https://www.kaggle.com/code/pranav941/kaggle-analytics-helper-functions-2017-2022

  16. Risk Analytics in Banking

    • kaggle.com
    zip
    Updated Mar 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Saba (2023). Risk Analytics in Banking [Dataset]. https://www.kaggle.com/sabarostami/risk-analytics-in-banking
    Explore at:
    zip(117814223 bytes)Available download formats
    Dataset updated
    Mar 13, 2023
    Authors
    Saba
    Description

    Business Understanding

    The loan providing companies find it hard to give loans to the people due to their insufficient or non-existent credit history. Because of that, some consumers use it as their advantage by becoming a defaulter. Suppose you work for a consumer finance company which specialises in lending various types of loans to urban customers. You have to use EDA to analyse the patterns present in the data. This will ensure that the applicants are capable of repaying the loans are not rejected. When the company receives a loan application, the company has to decide for loan approval based on the applicant’s profile. Two types of risks are associated with the bank’s decision: If the applicant is likely to repay the loan, then not approving the loan results in a loss of business to the company If the applicant is not likely to repay the loan, i.e. he/she is likely to default, then approving the loan may lead to a financial loss for the company. The data given below contains the information about the loan application at the time of applying for the loan. It contains two types of scenarios: The client with payment difficulties: he/she had late payment more than X days on at least one of the first Y instalments of the loan in our sample, All other cases: All other cases when the payment is paid on time. When a client applies for a loan, there are four types of decisions that could be taken by the client/company:

    Approved: The Company has approved loan Application Cancelled: The client cancelled the application sometime during approval. Either the client changed her/his mind about the loan or in some cases due to a higher risk of the client he received worse pricing which he did not want. Refused: The company had rejected the loan (because the client does not meet their requirements etc.). Unused offer: Loan has been cancelled by the client but at different stages of the process.

    Business Objectives

    The case study aims to identify patterns which indicate if a client has difficulty paying their instalments which may be used for taking actions such as denying the loan, reducing the amount of loan, lending (too risky applicants) at a higher interest rate, etc. This will ensure that the consumers capable of repaying the loan are not rejected. Identification of such applicants, using EDA is the aim of this case study. In other words, the company wants to understand the driving factors (or driver variables) behind loan default, i.e., the variables which are strong indicators of default. The company can utilise this knowledge for its portfolio and risk assessment.

    Data Understanding

    1. application_data.csv It contains all the information of the client at the time of application. The data is about whether a client has payment difficulties. 2. previous_application.csv It contains information about the client’s previous loan data. It contains the data whether the previous application had been Approved, Cancelled, Refused or Unused offer. 3. columns_description.csv It is data dictionary which describes the meaning of the variables.

    The solution is made in 2 different ipymb files First file contains detailed analysis (EDA) on application_data to identify the important features which help us to identify the defaulters Second file contains data where we inner join the records (application_data, previous_application) with same the SK_ID_CURR

  17. NASA Mars Rover

    • kaggle.com
    zip
    Updated Oct 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kush Tripathi (2023). NASA Mars Rover [Dataset]. https://www.kaggle.com/datasets/kushtripathi/nasa-mars-rover-captured-images-and-its-details
    Explore at:
    zip(101585155 bytes)Available download formats
    Dataset updated
    Oct 8, 2023
    Authors
    Kush Tripathi
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset Title: Exploring Mars: A Comprehensive Dataset of Rover Photos and Metadata Description

    This dataset provides an extensive collection of Mars rover images paired with in-depth metadata. Sourced from various Mars missions, this dataset is a treasure trove for anyone interested in space exploration, planetary science, or computer vision.

    Components:

    • Photos: A curated set of high-definition images taken by different cameras onboard Mars rovers. These images capture a variety of terrains, weather conditions, and other Martian phenomena.
    • Details: A detailed CSV file accompanies these images, containing rich metadata like the type of camera used, the corresponding Martian sol, Earth date, and the rover responsible for each image.

    Dataset Origin

    The dataset was compiled from various Mars missions conducted over the years. Special care has been taken to include a diverse set of images to enable a wide range of analyses and applications. Objective

    As a learner delving into the field of Computer Vision, my objectives for this project are multi-fold:

    • Data Analysis: To perform exploratory data analysis (EDA) to understand the distribution of images based on attributes like camera type, date, and rover.
    • Color Analysis: To identify and visualize dominant colors across different sets of images. This could provide insights into Martian geology.
    • Texture and Pattern Recognition: To classify Martian terrains using texture and pattern recognition techniques.
    • Machine Learning: To potentially develop a predictive model that could classify images into predefined categories based on their features.

    Research Questions

    1. Which camera types have contributed the most to the dataset?
    2. What can the dominant colors in the images tell us about Mars?
    3. Can we classify Martian terrains into categories like rocky, sandy, and icy?
    4. Is there a correlation between the type of terrain and other variables like camera type or date?

    Tools and Technologies

    I plan to utilize Python for this project, particularly libraries like OpenCV for image processing, Pandas for data manipulation, and Matplotlib/Seaborn for data visualization. For machine learning tasks, I will likely use scikit-learn or TensorFlow.

    Learning and Development

    This project serves as both a learning exercise and a stepping stone toward more complex computer vision projects. I aim to document my learning journey, challenges, and milestones in a series of Kaggle notebooks. Collaboration and Feedback

    I warmly invite the Kaggle community to offer suggestions, critiques, or even collaborate on this venture. Your insights could be invaluable in enhancing the depth and breadth of this project.

  18. Amazon Cell Phones

    • kaggle.com
    zip
    Updated May 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michael Matta (2025). Amazon Cell Phones [Dataset]. https://www.kaggle.com/datasets/michaelmatta0/amazon-cell-phones-cleaned-scraped-data/code
    Explore at:
    zip(4486223 bytes)Available download formats
    Dataset updated
    May 21, 2025
    Authors
    Michael Matta
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    🆕 Update History

    • v2 (May 2025): Uploaded amazon_cell_phones_updated.csv — this new version includes updated features and additional data fields, while preserving the core structure and main features of the original dataset.
    • v1 (September 2024): Original upload of amazon_cell_phones_original_old.csv. The current dataset description refers to this version.

    📌 Note: The description below is still based on the original dataset (amazon_cell_phones_original_old.csv). For the latest structure and feature details, please refer directly to amazon_cell_phones_updated.csv.

    Various details about cell phones listed on Amazon USA, such as product name, price, rating, number of ratings, and technical specifications like RAM, storage, screen size, and more.

    Dataset Overview

    This dataset contains detailed information about cellphones listed on Amazon, scraped using Selenium and BeautifulSoup. It includes product details such as the name, price, ratings, specifications (RAM, storage, screen size, etc.), and additional metadata like the number of ratings and discount percentage. The dataset was designed to provide insights into cellphone features, pricing trends, and customer feedback on one of the world's largest e-commerce platforms.

    Source

    The data was scraped from Amazon's cellphone category pages over multiple pages (up to 250 pages). Given Amazon's structure, the dataset includes a wide variety of cellphone brands and models, including older and newer releases.

    Data Fields

    • ID: Unique identifier for each product.
    • Product Name: The name of the cellphone.
    • Product Link: URL link to the Amazon product page.
    • Image Link: URL link to the product image on Amazon.
    • Price (Dollar): The price of the cellphone in USD.
    • Discount Percentage: Discount percentage, if applicable, calculated as the difference between the original price and the current price.
    • Price Before Discount: Original price of the cellphone, if available.
    • Rating (out of 5): Customer rating, extracted from the product page.
    • Number of Ratings: The total number of customer ratings for the product.
    • Brand: Brand of the cellphone.
    • Operating System: The operating system of the cellphone (e.g., Android, iOS).
    • RAM (GB): Amount of RAM in GB.
    • Storage (GB): Internal storage capacity in GB.
    • Screen Size (Inches): Screen size in inches.
    • Cellular Technology: Cellular technology (e.g., 4G, 5G).
    • CPU: CPU Speed.
    • CPU Model: CPU model used in the cellphone.
    • Available Colors: Available colors for the cellphone model.

    Potential Uses

    • Product Comparison: This dataset can be used to compare various cellphone brands and models based on price, RAM, storage, and other features.
    • Trend Analysis: Analyze price trends, discount patterns, and customer preferences based on ratings and reviews.
    • Machine Learning: Build machine learning models to predict price trends, ratings, or sales volume.
    • Exploratory Data Analysis (EDA): Perform EDA to discover patterns, outliers, and insights into the cellphone market.

    Data Cleaning

    The dataset has been cleaned to remove duplicates and standardize data entries. Missing values were handled where possible, and units of measurement (e.g., RAM, storage) have been converted for consistency.

    Limitations

    • Dynamic Content: Prices, ratings, and availability may change on Amazon over time, meaning this dataset represents a snapshot of the listings at the time of scraping.
  19. IMDB top 250 French movies

    • kaggle.com
    zip
    Updated Aug 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Khushi Pitroda (2023). IMDB top 250 French movies [Dataset]. https://www.kaggle.com/datasets/khushipitroda/imdb-top-250-french-movies/code
    Explore at:
    zip(36031 bytes)Available download formats
    Dataset updated
    Aug 3, 2023
    Authors
    Khushi Pitroda
    Area covered
    French
    Description

    Important Note: The "Top 250 French Movies" dataset comprises information on the highest-rated French movies according to user ratings on various platforms. This dataset contains 250 unique French movies that have garnered critical acclaim and popularity among viewers. Each movie is associated with essential details, including its rank, title, release year, duration, genre, IMDb rating, image source link, and a brief description.

    This dataset is intended for learning, research, and analysis purposes. The movie ratings and details provided in the dataset are based on publicly available information at the time of scraping. As IMDb ratings and movie information may change over time, it is essential to verify and update the data for the latest information.

    By using this dataset, you acknowledge that the accuracy and completeness of the information cannot be guaranteed, and you assume responsibility for any analysis or decision-making based on the data. Additionally, please adhere to IMDb's terms of use and copyright policies when using the data for any public dissemination or commercial purposes.

    Data Analysis Tasks:

    1.Exploratory Data Analysis (EDA): Explore the distribution of movies by genres, release years, and IMDb ratings. Visualize the top-rated French movies and their IMDb ratings using bar charts or histograms.

    2.Year-wise Trends: Observe trends in French movie production over the years using line charts or area plots. Analyze if there's any correlation between release year and IMDb ratings.

    3.Word Cloud Analysis: Create word clouds from movie descriptions to visualize the most common words and themes among the top-rated French movies. This can provide insights into popular topics and genres.

    4.Network Analysis: Build a network graph connecting French movies that share common actors or directors. Analyze the interconnectedness of movies based on their production teams.

    Machine Learning Tasks:

    1.Movie Recommendation System: Implement a content-based recommendation system that suggests French movies based on similarities in genre, release year, and IMDb ratings. Use techniques like cosine similarity or Jaccard similarity to measure movie similarities.

    2.Movie Genre Classification: Build a multi-class classification model to predict the genre of an French movie based on its description. Utilize Natural Language Processing (NLP) techniques like text preprocessing, TF-IDF, or word embeddings. Use classifiers like Logistic Regression, Naive Bayes, or Support Vector Machines.

    3.Movie Sentiment Analysis: Perform sentiment analysis on movie descriptions to determine the overall sentiment (positive, negative, neutral) of each movie. Use sentiment lexicons or pre-trained sentiment analysis models.

    4.Movie Rating Prediction: Develop a regression model to predict the IMDb rating of an French movie based on features like genre, release year, and description sentiment. Employ regression algorithms like Linear Regression, Decision Trees, or Random Forests.

    5.Movie Clustering: Apply unsupervised clustering algorithms to group French movies with similar attributes. Use features like genre, IMDb rating, and release year to identify movie clusters. Experiment with algorithms like K-means clustering or hierarchical clustering.

    Important Note: Ensure that the data is appropriately preprocessed and encoded for machine learning tasks. Handle any missing values, perform feature engineering, and split the dataset into training and testing sets. Evaluate the performance of each machine learning model using appropriate metrics such as accuracy, precision, recall, or Mean Squared Error (MSE) depending on the task.

    It is crucial to remember that the performance of machine learning models may vary based on the dataset's size and quality. Interpret the results carefully and consider using cross-validation techniques to assess model generalization.

    Lastly, please adhere to IMDb's terms of use and any applicable data usage policies while conducting data analysis and implementing machine learning models with this dataset.

  20. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Md Waquar Azam (2022). Data-Science-Book [Dataset]. https://www.kaggle.com/datasets/mdwaquarazam/datasciencebook
Organization logo

Data-Science-Book

Top 200 Records of Data science books from Amazon

Explore at:
312 scholarly articles cite this dataset (View in Google Scholar)
zip(9376 bytes)Available download formats
Dataset updated
Aug 20, 2022
Authors
Md Waquar Azam
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Context This dataset holds a list of approx 200 + books in the field of Data science related topics. The list of books was constructed using one of the popular websites Amazon which provide information on book ratings and many details given below.

There are 6 column

  1. Book_name / book title

  2. Publisher:-- name of the publisher or writer

  3. Buyers ():--it means no of customer who purchase the same book

  4. Cover_type:-- types of cover use to protect the book

  5. stars:--out of 5 * how much rated

  6. Price

Inspiration I’d like to call the attention of my fellow Kagglers to use Machine Learning and Data Sciences to help me explore these ideas:

• What is the best-selling book?

• Find any hidden patterns if you can

. EDA of dataset

Search
Clear search
Close search
Google apps
Main menu