84 datasets found
  1. Capstone Project TikTok - EDA

    • kaggle.com
    zip
    Updated Nov 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sohail K. Nikouzad (2023). Capstone Project TikTok - EDA [Dataset]. https://www.kaggle.com/datasets/sohailnikouzad/capstone-pr0ject-tiktok-eda
    Explore at:
    zip(52324 bytes)Available download formats
    Dataset updated
    Nov 15, 2023
    Authors
    Sohail K. Nikouzad
    Description

    Dataset

    This dataset was created by Sohail K. Nikouzad

    Contents

  2. Electronics Store Sales Dataset for EDA

    • kaggle.com
    zip
    Updated Feb 13, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sinjoy Saha (2021). Electronics Store Sales Dataset for EDA [Dataset]. https://www.kaggle.com/sinjoysaha/sales-analysis-dataset
    Explore at:
    zip(2505035 bytes)Available download formats
    Dataset updated
    Feb 13, 2021
    Authors
    Sinjoy Saha
    Description

    Content

    This is a transactions data from an Electronics store chain in the US. The data contains 12 CSV files for each month of 2019. The naming convention is as follows: Sales_[MONTH_NAME]_2019 Each file contains anywhere from around 9000 to 26000 rows and 6 columns. The columns are as follows: Order ID, Product, Quantity Ordered, Price Each, Order Date, Purchase Address There are around 186851 data points combining all the 12-month files. There may be null values in some rows.

    Inspiration

    Keith Galli

    Acknowledgements

  3. Pandas Practice Dataset

    • kaggle.com
    zip
    Updated Jan 27, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mrityunjay Pathak (2023). Pandas Practice Dataset [Dataset]. https://www.kaggle.com/datasets/themrityunjaypathak/pandas-practice-dataset/discussion
    Explore at:
    zip(493 bytes)Available download formats
    Dataset updated
    Jan 27, 2023
    Authors
    Mrityunjay Pathak
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    What is Pandas?

    Pandas is a Python library used for working with data sets.

    It has functions for analyzing, cleaning, exploring, and manipulating data.

    The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was created by Wes McKinney in 2008.

    Why Use Pandas?

    Pandas allows us to analyze big data and make conclusions based on statistical theories.

    Pandas can clean messy data sets, and make them readable and relevant.

    Relevant data is very important in data science.

    What Can Pandas Do?

    Pandas gives you answers about the data. Like:

    Is there a correlation between two or more columns?

    What is average value?

    Max value?

    Min value?

  4. h

    watches

    • huggingface.co
    Updated Nov 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    gil (2025). watches [Dataset]. https://huggingface.co/datasets/yotam22/watches
    Explore at:
    Dataset updated
    Nov 17, 2025
    Authors
    gil
    Description

    🕰️ Exploratory Data Analysis of Luxury Watch Prices

      Overview
    

    This project analyzes a large dataset of luxury watches to understand which factors influence price.We focus on brand, movement type, case material, size, gender, and production year.All work was done in Python (Pandas, NumPy, Matplotlib/Seaborn) on Google Colab.

      Dataset
    

    Rows: ~172,000
    Columns: 14
    Unit of observation: one watch listing

    Main columns

    name – watch/listing title
    price – listed… See the full description on the dataset page: https://huggingface.co/datasets/yotam22/watches.

  5. Cleaned Netflix Dataset for EDA

    • kaggle.com
    zip
    Updated Jul 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nikhil raman K (2025). Cleaned Netflix Dataset for EDA [Dataset]. https://www.kaggle.com/datasets/nikhilramank/cleaned-netflix-dataset-for-eda
    Explore at:
    zip(750797 bytes)Available download formats
    Dataset updated
    Jul 7, 2025
    Authors
    Nikhil raman K
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This is a cleaned version of a Netflix movies dataset prepared for exploratory data analysis (EDA). Missing values have been handled, invalid rows removed, and numerical + categorical columns cleaned for analysis using Python and Pandas.

  6. Keith Galli's Sales Analysis Exercise

    • kaggle.com
    zip
    Updated Jan 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zulkhairee Sulaiman (2022). Keith Galli's Sales Analysis Exercise [Dataset]. https://www.kaggle.com/datasets/zulkhaireesulaiman/sales-analysis-2019-excercise/discussion
    Explore at:
    zip(2505083 bytes)Available download formats
    Dataset updated
    Jan 28, 2022
    Authors
    Zulkhairee Sulaiman
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    This is the dataset required for Keith Galli's 'Solving real world data science tasks with Python Pandas!' video. Where he analyzes and answers business questions for 12 months worth of business data. The data contains hundreds of thousands of electronics store purchases broken down by month, product type, cost, purchase address, etc.

    I decided to upload the data here so that I can carry out the exercise straight on Kaggle Notebooks. Making it ready for viewing as a portfolio project.

    Content

    12 .csv files containing sales data for each month of 2019.

    Acknowledgements

    Of course, all thanks goes to Keith Galli and the great work he does with his tutorials. He has several other amazing tutorials that you can follow and subscribe at his channel.

  7. singapore

    • kaggle.com
    zip
    Updated Jul 30, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    saibharath (2020). singapore [Dataset]. https://www.kaggle.com/saibharath12/singapore
    Explore at:
    zip(116322 bytes)Available download formats
    Dataset updated
    Jul 30, 2020
    Authors
    saibharath
    Area covered
    Singapore
    Description

    This dataset has total population of dingapore basing on their ethnicity,gender . It is raw data which has mixed entities in columns . from year 1957 to 2018 population data is given . The main aim in uploading this data is to get skilled in python pandas for exploratory data analysis.

  8. Play Store Data Analysis By Vaishnavi

    • kaggle.com
    zip
    Updated Apr 30, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vaishnavi Sahu (2021). Play Store Data Analysis By Vaishnavi [Dataset]. https://www.kaggle.com/vaishnavisahu/play-store-data-analysis-by-vaishnavi
    Explore at:
    zip(597350 bytes)Available download formats
    Dataset updated
    Apr 30, 2021
    Authors
    Vaishnavi Sahu
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    **### Context

    EDA using numpy and pandas

    Content

    In this Task i have to predict what factors makes an app perform well .whether its size , price , category or multiple factors together . what makes an app rank on the top in google Playstore .**

    Column description: App : name of the application Category: category of the application Rating: rating of an application Reviews: reviews of that application Size: size of application Installs:how many users installed that application Type: Type of application Price: price of application content rating:rating of content of the application

  9. Startup_India_EDA

    • kaggle.com
    zip
    Updated Apr 30, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aryan Mahabhoi (2022). Startup_India_EDA [Dataset]. https://www.kaggle.com/datasets/aryanmahabhoi/startup-india-eda
    Explore at:
    zip(97006 bytes)Available download formats
    Dataset updated
    Apr 30, 2022
    Authors
    Aryan Mahabhoi
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Startup India - Exploratory Data Analysis

    1- The dataset contains updated record of all startups from 1963 to 2021. 2- An Exploratory Data Analysis is performed our the record with different types of data visualizations.

    Technologies Used: Python Numpy Pandas Matplotlib Seaborn

  10. Aviation EDA - on plane accidents

    • kaggle.com
    zip
    Updated Nov 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    victor munyaradzi (2024). Aviation EDA - on plane accidents [Dataset]. https://www.kaggle.com/datasets/victormunyaradzi/aviation-eda-on-plane-accidents
    Explore at:
    zip(628563 bytes)Available download formats
    Dataset updated
    Nov 27, 2024
    Authors
    victor munyaradzi
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    this is my first EDA analysis took the data off Kaggle took a sample of all accidents since 1919 did an EDA analysis on them using MATPLOTLIb, Python, Pandas and Numpy.

    not so familiar with Git or kaggle as an aspiring Data Analysist/ scientist so please forgive any github errors

  11. ZOMATO BANGALORE EDA

    • kaggle.com
    zip
    Updated Sep 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anshika Srivastava (2025). ZOMATO BANGALORE EDA [Dataset]. https://www.kaggle.com/datasets/anshikasri62/zomato-banglore-eda
    Explore at:
    zip(1246927 bytes)Available download formats
    Dataset updated
    Sep 15, 2025
    Authors
    Anshika Srivastava
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    Bengaluru
    Description

    Exploratory Data Analysis (EDA) of ZOMATO BANGALORE DATASET using Python and its libraries (Pandas , Matplotlib and Seaborn ). Analyzed restaurant distribution ,top cuisines ,rating distribution, cost for two and other interesting insights.

    Included files: - NOTEBOOK : : ZOMATO_EDA.ipynb -IMAGES : : Visualizations of key insights - requirement.txt : : Python dependencies

  12. Shopping Mall Customer Data Segmentation Analysis

    • kaggle.com
    zip
    Updated Aug 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DataZng (2024). Shopping Mall Customer Data Segmentation Analysis [Dataset]. https://www.kaggle.com/datasets/datazng/shopping-mall-customer-data-segmentation-analysis
    Explore at:
    zip(5890828 bytes)Available download formats
    Dataset updated
    Aug 4, 2024
    Authors
    DataZng
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Demographic Analysis of Shopping Behavior: Insights and Recommendations

    Dataset Information: The Shopping Mall Customer Segmentation Dataset comprises 15,079 unique entries, featuring Customer ID, age, gender, annual income, and spending score. This dataset assists in understanding customer behavior for strategic marketing planning.

    Cleaned Data Details: Data cleaned and standardized, 15,079 unique entries with attributes including - Customer ID, age, gender, annual income, and spending score. Can be used by marketing analysts to produce a better strategy for mall specific marketing.

    Challenges Faced: 1. Data Cleaning: Overcoming inconsistencies and missing values required meticulous attention. 2. Statistical Analysis: Interpreting demographic data accurately demanded collaborative effort. 3. Visualization: Crafting informative visuals to convey insights effectively posed design challenges.

    Research Topics: 1. Consumer Behavior Analysis: Exploring psychological factors driving purchasing decisions. 2. Market Segmentation Strategies: Investigating effective targeting based on demographic characteristics.

    Suggestions for Project Expansion: 1. Incorporate External Data: Integrate social media analytics or geographic data to enrich customer insights. 2. Advanced Analytics Techniques: Explore advanced statistical methods and machine learning algorithms for deeper analysis. 3. Real-Time Monitoring: Develop tools for agile decision-making through continuous customer behavior tracking. This summary outlines the demographic analysis of shopping behavior, highlighting key insights, dataset characteristics, team contributions, challenges, research topics, and suggestions for project expansion. Leveraging these insights can enhance marketing strategies and drive business growth in the retail sector.

    References OpenAI. (2022). ChatGPT [Computer software]. Retrieved from https://openai.com/chatgpt. Mustafa, Z. (2022). Shopping Mall Customer Segmentation Data [Data set]. Kaggle. Retrieved from https://www.kaggle.com/datasets/zubairmustafa/shopping-mall-customer-segmentation-data Donkeys. (n.d.). Kaggle Python API [Jupyter Notebook]. Kaggle. Retrieved from https://www.kaggle.com/code/donkeys/kaggle-python-api/notebook Pandas-Datareader. (n.d.). Retrieved from https://pypi.org/project/pandas-datareader/

  13. Cyclistic Bike - Data Analysis (Python)

    • kaggle.com
    zip
    Updated Jun 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amirthavarshini (2023). Cyclistic Bike - Data Analysis (Python) [Dataset]. https://www.kaggle.com/datasets/amirthavarshini12/cyclistic-bike-data-analysis-python/code
    Explore at:
    zip(211278092 bytes)Available download formats
    Dataset updated
    Jun 19, 2023
    Authors
    Amirthavarshini
    Description

    Conducted an in-depth analysis of Cyclistic bike-share data to uncover customer usage patterns and trends. Cleaned and processed raw data using Python libraries such as pandas and NumPy to ensure data quality. Performed exploratory data analysis (EDA) to identify insights, including peak usage times, customer demographics, and trip duration patterns. Created visualizations using Matplotlib and Seaborn to effectively communicate findings. Delivered actionable recommendations to enhance customer engagement and optimize operational efficiency.

  14. Classicmodels

    • kaggle.com
    zip
    Updated Dec 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Javier Landaeta (2024). Classicmodels [Dataset]. https://www.kaggle.com/datasets/javierlandaeta/classicmodels
    Explore at:
    zip(65751 bytes)Available download formats
    Dataset updated
    Dec 15, 2024
    Authors
    Javier Landaeta
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Abstract This project presents a comprehensive analysis of a company's annual sales, using the classic dataset classicmodels as the database. Python is used as the main programming language, along with the Pandas, NumPy and SQLAlchemy libraries for data manipulation and analysis, and PostgreSQL as the database management system.

    The main objective of the project is to answer key questions related to the company's sales performance, such as: Which were the most profitable products and customers? Were sales goals met? The results obtained serve as input for strategic decision making in future sales campaigns.

    Methodology 1. Data Extraction:

    • A connection is established with the PostgreSQL database to extract the relevant data from the orders, orderdetails, customers, products and employees tables.
    • A reusable function is created to read each table and load it into a Pandas DataFrame.

    2. Data Cleansing and Transformation:

    • An exploratory analysis of the data is performed to identify missing values, inconsistencies, and outliers.
    • New variables are calculated, such as the total value of each sale, cost, and profit.
    • Different DataFrames are joined using primary and foreign keys to obtain a complete view of sales.

    3. Exploratory Data Analysis (EDA):

    • Key metrics such as total sales, number of unique customers, and average order value are calculated.
    • Data is grouped by different dimensions (products, customers, dates) to identify patterns and trends.
    • Results are visualized using relevant graphics (histograms, bar charts, etc.).

    4. Modeling and Prediction:

    • Although the main focus of the project is descriptive, predictive modeling techniques (e.g., time series) could be explored to forecast future sales.

    5. Report Generation:

    • Detailed reports are created in Pandas DataFrames format that answer specific business questions.
    • These reports are stored in new PostgreSQL tables for further analysis and visualization.

    Results - Identification of top products and customers: The best-selling products and the customers that generate the most revenue are identified. - Analysis of sales trends: Sales trends over time are analyzed and possible factors that influence sales behavior are identified. - Calculation of key metrics: Metrics such as average profit margin and sales growth rate are calculated.

    Conclusions This project demonstrates how Python and PostgreSQL can be effectively used to analyze large data sets and obtain valuable insights for business decision making. The results obtained can serve as a starting point for future research and development in the area of ​​sales analysis.

    Technologies Used - Python: Pandas, NumPy, SQLAlchemy, Matplotlib/Seaborn - Database: PostgreSQL - Tools: Jupyter Notebook - Keywords: data analysis, Python, PostgreSQL, Pandas, NumPy, SQLAlchemy, EDA, sales, business intelligence

  15. 💥 Data-cleaning-for-beginner-using-pandas💢💥

    • kaggle.com
    zip
    Updated Oct 16, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pavan Tanniru (2022). 💥 Data-cleaning-for-beginner-using-pandas💢💥 [Dataset]. https://www.kaggle.com/datasets/pavantanniru/-datacleaningforbeginnerusingpandas/code
    Explore at:
    zip(654 bytes)Available download formats
    Dataset updated
    Oct 16, 2022
    Authors
    Pavan Tanniru
    Description

    This dataset helps you to increase the data-cleaning process using the pure python pandas library.

    Indicators

    1. Age
    2. Salary
    3. Rating
    4. Location
    5. Established
    6. Easy Apply
  16. Customer Sale Dataset for Data Visualization

    • kaggle.com
    Updated Jun 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Atul (2025). Customer Sale Dataset for Data Visualization [Dataset]. https://www.kaggle.com/datasets/atulkgoyl/customer-sale-dataset-for-visualization
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 6, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Atul
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This synthetic dataset is designed specifically for practicing data visualization and exploratory data analysis (EDA) using popular Python libraries like Seaborn, Matplotlib, and Pandas.

    Unlike most public datasets, this one includes a diverse mix of column types:

    📅 Date columns (for time series and trend plots) 🔢 Numerical columns (for histograms, boxplots, scatter plots) 🏷️ Categorical columns (for bar charts, group analysis)

    Whether you are a beginner learning how to visualize data or an intermediate user testing new charting techniques, this dataset offers a versatile playground.

    Feel free to:

    Create EDA notebooks Practice plotting techniques Experiment with filtering, grouping, and aggregations 🛠️ No missing values, no data cleaning needed — just download and start exploring!

    Hope you find this helpful. Looking forward to hearing from you all.

  17. IMDb Top 4070: Explore the Cinema Data

    • kaggle.com
    zip
    Updated Aug 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    K.T.S. Prabhu (2023). IMDb Top 4070: Explore the Cinema Data [Dataset]. https://www.kaggle.com/datasets/ktsprabhu/imdb-top-4070-explore-the-cinema-data/discussion
    Explore at:
    zip(1449581 bytes)Available download formats
    Dataset updated
    Aug 13, 2023
    Authors
    K.T.S. Prabhu
    Description

    Description: Dive into the world of exceptional cinema with our meticulously curated dataset, "IMDb's Gems Unveiled." This dataset is a result of an extensive data collection effort based on two critical criteria: IMDb ratings exceeding 7 and a substantial number of votes, surpassing 10,000. The outcome? A treasure trove of 4070 movies meticulously selected from IMDb's vast repository.

    What sets this dataset apart is its richness and diversity. With more than 20 data points meticulously gathered for each movie, this collection offers a comprehensive insight into each cinematic masterpiece. Our data collection process leveraged the power of Selenium and Pandas modules, ensuring accuracy and reliability.

    Cleaning this vast dataset was a meticulous task, combining both Excel and Python for optimum precision. Analysis is powered by Pandas, Matplotlib, and NLTK, enabling to uncover hidden patterns, trends, and themes within the realm of cinema.

    Note: The data is collected as of April 2023. Future versions of this analysis include Movie recommendation system Please do connect for any queries, All Love, No Hate.

  18. Convert Text to Pandas

    • kaggle.com
    zip
    Updated Sep 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zeyad Usf (2024). Convert Text to Pandas [Dataset]. https://www.kaggle.com/datasets/zeyadusf/convert-text-to-pandas
    Explore at:
    zip(4333134 bytes)Available download formats
    Dataset updated
    Sep 22, 2024
    Authors
    Zeyad Usf
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    kaggle notebook
    Github Repo

    I found two datasets about converting text with context to pandas code on Hugging Face, but the challenge is in the context. The context in both datasets is different which reduces the results of the model. First let's mention the data I found and then show examples, solution and some other problems.

    • Rahima411/text-to-pandas:

      • The data is divided into Train with 57.5k and Test with 19.2k.

      • The data has two columns as you can see in the example:

        • "Input": Contains the context and the question together, in the context it shows the metadata about the data frame.
        • "Pandas Query": Pandas code txt Input | Pandas Query -----------------------------------------------------------|------------------------------------------- Table Name: head (age (object), head_id (object)) | result = management['head.age'].unique() Table Name: management (head_id (object), | temporary_acting (object)) | What are the distinct ages of the heads who are acting? |
    • hiltch/pandas-create-context:

      • It contains 17k rows with three columns:
        • question : text .
        • context : Code to create a data frame with column names, unlike the first data set which contains the name of the data frame, column names and data type.
        • answer : Pandas code.
          question           |            context             |       answer 
    ----------------------------------------|--------------------------------------------------------|---------------------------------------
    What was the lowest # of total votes?  | df = pd.DataFrame(columns=['_number_of_total_votes']) | df['_number_of_total_votes'].min()   
    

    As you can see, the problem with this data is that they are not similar as inputs and the structure of the context is different . My solution to this problem was: - Convert the first data set to become like the second in the context. I chose this because it is difficult to get the data type for the columns in the second data set. It was easy to convert the structure of the context from this shape Table Name: head (age (object), head_id (object)) to this head = pd.DataFrame(columns=['age','head_id']) through this code that I wrote. - Then separate the question from the context. This was easy because if you look at the data, you will find that the context always ends with "(" and then a blank and then the question. You will find all of this in this code. - You will also notice that more than one code or line can be returned to the context, and this has been engineered into the code. ```py def extract_table_creation(text:str)->(str,str): """ Extracts DataFrame creation statements and questions from the given text.

    Args:
      text (str): The input text containing table definitions and questions.
    
    Returns:
      tuple: A tuple containing a concatenated DataFrame creation string and a question.
    """
    # Define patterns
    table_pattern = r'Table Name: (\w+) \(([\w\s,()]+)\)'
    column_pattern = r'(\w+)\s*\((object|int64|float64)\)'
    
    # Find all table names and column definitions
    matches = re.findall(table_pattern, text)
    
    # Initialize a list to hold DataFrame creation statements
    df_creations = []
    
    for table_name, columns_str in matches:
      # Extract column names
      columns = re.findall(column_pattern, columns_str)
      column_names = [col[0] for col in columns]
    
      # Format DataFrame creation statement
      df_creation = f"{table_name} = pd.DataFrame(columns={column_names})"
      df_creations.append(df_creation)
    
    # Concatenate all DataFrame creation statements
    df_creation_concat = '
    

    '.join(df_creations)

    # Extract and clean the question
    question = text[text.rindex(')')+1:].strip()
    
    return df_creation_concat, question
    
    After both datasets were similar in structure, they were merged into one set and divided into _72.8K_ train and _18.6K_ test. We analyzed this dataset and you can see it all through the **[`notebook`](https://www.kaggle.com/code/zeyadusf/text-2-pandas-t5#Exploratory-Data-Analysis(EDA))**, but we found some problems in the dataset as well, such as
    > - `Answer` : `df['Id'].count()` has been repeated, but this is possible, so we do not need to dispense with these rows.
    > - `Context` : We see that it contains `147` rows that do not contain any text. We will see Through the experiment if this will affect the results negatively or positively.
    > - `Question` : It is ...
    
  19. DataScience for Work - Human Resources

    • kaggle.com
    zip
    Updated Apr 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Beytullah Soylev (2024). DataScience for Work - Human Resources [Dataset]. https://www.kaggle.com/datasets/soylevbeytullah/ds4work-human-resources
    Explore at:
    zip(51278 bytes)Available download formats
    Dataset updated
    Apr 28, 2024
    Authors
    Beytullah Soylev
    Description

    Case Study: Improving Human Resources with Data Science

    Objective: Utilize data science to predict employee turnover and enhance the Human Resources department.

    Key Learnings:

    Leveraging Data Science for HR Transformation: Understand how data science can reduce employee turnover and revolutionize HR.

    Logistic Regression and Random Forest Classifiers: Grasp the theory behind these classifiers and implement them using scikit-learn.

    Sigmoid Functions and Pandas DataFrames: Extract probability values using sigmoid functions and manipulate datasets with Pandas.

    Python Functions and Pandas Dataframe Applications: Develop and apply Python functions to Pandas dataframes.

    Exploratory Data Analysis with Matplotlib and Seaborn: Perform EDA using Matplotlib and Seaborn, generating KDE plots, box plots, and count plots.

    Categorical Variable Transformation and Data Set Division: Convert categorical variables into dummy variables and divide datasets into training and testing sets using scikit-learn.

    Artificial Neural Networks for Classification: Understand the theory and application of artificial neural networks in classification tasks.

    Classification Model Evaluation and Result Interpretation: Evaluate classification models using confusion matrices and classification reports, distinguishing between precision, recall, and F1 scores.

    Embark on this data-driven journey to transform Human Resources!

  20. Road Accident Severity in India

    • kaggle.com
    zip
    Updated Jan 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SHRIYANSHMESSI (2024). Road Accident Severity in India [Dataset]. https://www.kaggle.com/datasets/shriyanshmessi/road-accident-severity-in-india/code
    Explore at:
    zip(317927 bytes)Available download formats
    Dataset updated
    Jan 5, 2024
    Authors
    SHRIYANSHMESSI
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    India
    Description

    The dataset offers data on a number of variables related to Road Accident Severity in India, such as the time of day, the day of the week, the age range of drivers, gender, educational attainment, car attributes, driving history, road conditions, and the seriousness of accidents. We can learn more about the trends, connections, and possible risk factors associated with auto accidents by examining this dataset. The dataset offers valuable insights into the dynamics of road accidents, enabling authorities, policymakers, and researchers to make informed decisions regarding road safety measures and interventions.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Sohail K. Nikouzad (2023). Capstone Project TikTok - EDA [Dataset]. https://www.kaggle.com/datasets/sohailnikouzad/capstone-pr0ject-tiktok-eda
Organization logo

Capstone Project TikTok - EDA

Using the Pandas package in Python for exploratory data analysis (EDA)

Explore at:
zip(52324 bytes)Available download formats
Dataset updated
Nov 15, 2023
Authors
Sohail K. Nikouzad
Description

Dataset

This dataset was created by Sohail K. Nikouzad

Contents

Search
Clear search
Close search
Google apps
Main menu