93 datasets found
  1. MICRO CREDIT PROJECT

    • kaggle.com
    zip
    Updated Oct 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Akshatha Aravind (2023). MICRO CREDIT PROJECT [Dataset]. https://www.kaggle.com/datasets/akshathaaravind/micro-credit-project
    Explore at:
    zip(1625692 bytes)Available download formats
    Dataset updated
    Oct 21, 2023
    Authors
    Akshatha Aravind
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset

    This dataset was created by Akshatha Aravind

    Released under Apache 2.0

    Contents

  2. Ken Jee YouTube Data

    • kaggle.com
    zip
    Updated Jan 22, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ken Jee (2022). Ken Jee YouTube Data [Dataset]. https://www.kaggle.com/datasets/kenjee/ken-jee-youtube-data
    Explore at:
    zip(6556461 bytes)Available download formats
    Dataset updated
    Jan 22, 2022
    Authors
    Ken Jee
    Area covered
    YouTube
    Description

    Context

    I've been creating videos on YouTube since November of 2017 (https://www.youtube.com/c/KenJee1) with the mission of making data science accessible to more people. One of the best ways to do this is to tell stories and working on projects. This is my attempt at my first community project. I am making my YouTube data available for everyone to help better understand the growth of my YouTube community and think about ways that it could be improved! I would love for everyone in the community feel like they had some hand in contributing to the channel.

    Announcement Video: https://youtu.be/YPph59-rTxA

    I will be sharing my favorite projects in a few of my videos (with permission of course), and would also like to give away a few small prizes to the top featured notebooks. I hope you have fun with the analysis, I'm interested in seeing what you find in the data!

    For those looking for a place to start, some things I'm thinking about are: - What are the themes of the comment data? - What types of video titles and thumbnails drive the most traffic? - Who is my core audience and what are they interested in? - What types of videos have lead to the most growth? - What type of content are people engaging with the most or watching the longest?

    Some advanced projects could be: - Creating a chat bot to respond to common comments with videos where I have addressed a topic - Pulling sentiment from thumbnails and titles and comparing that with performance

    Data I would like to add over time - Video descriptions - Video subtitles - Actual video data

    Content

    There are four files in this repo. The relevant data included in most of them is from Nov 2017 - Jan 2022. I gathered some of this data via the YouTube API and the rest from my specific analytics.

    1) Aggregated Metrics By Video - This has all the topline metrics from my channel from its start (around 2015 to Jan 22 2022). I didn't post my first video until around 2) Aggregated Metrics By Video with Country and Subscriber Status - This has the same data as aggregated metrics by video, but it includes dimensions for which country people are viewing from and if the viewers are subscribed to the channel or not. 3) Video Performance Over Time - This has the daily data from each of my videos. 4) All Comments - This is all of my comment data gathered from the YouTube API. I have anonymized the users so don't worry about your name showing up!

    Acknowledgements

    This obviously wouldn't be possible without all of the wonderful people who watch and interact with my videos! I'm incredibly grateful for you all and I'm so happy I can share this project with you!

    License

    I collected this data from the YouTube API and through my own google analytics. Thus use of it must uphold the YouTube API's terms of service: https://developers.google.com/youtube/terms/api-services-terms-of-service

  3. Housing Prices Dataset

    • kaggle.com
    zip
    Updated Jan 12, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    M Yasser H (2022). Housing Prices Dataset [Dataset]. https://www.kaggle.com/datasets/yasserh/housing-prices-dataset
    Explore at:
    zip(4740 bytes)Available download formats
    Dataset updated
    Jan 12, 2022
    Authors
    M Yasser H
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    https://raw.githubusercontent.com/Masterx-AI/Project_Housing_Price_Prediction_/main/hs.jpg" alt="">

    Description:

    A simple yet challenging project, to predict the housing price based on certain factors like house area, bedrooms, furnished, nearness to mainroad, etc. The dataset is small yet, it's complexity arises due to the fact that it has strong multicollinearity. Can you overcome these obstacles & build a decent predictive model?

    Acknowledgement:

    Harrison, D. and Rubinfeld, D.L. (1978) Hedonic prices and the demand for clean air. J. Environ. Economics and Management 5, 81–102. Belsley D.A., Kuh, E. and Welsch, R.E. (1980) Regression Diagnostics. Identifying Influential Data and Sources of Collinearity. New York: Wiley.

    Objective:

    • Understand the Dataset & cleanup (if required).
    • Build Regression models to predict the sales w.r.t a single & multiple feature.
    • Also evaluate the models & compare thier respective scores like R2, RMSE, etc.
  4. Power BI dataset

    • kaggle.com
    zip
    Updated Oct 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ahmadali Jamali (2023). Power BI dataset [Dataset]. https://www.kaggle.com/datasets/ahmadalijamali/dataset
    Explore at:
    zip(1642 bytes)Available download formats
    Dataset updated
    Oct 31, 2023
    Authors
    Ahmadali Jamali
    License

    https://www.licenses.ai/ai-licenseshttps://www.licenses.ai/ai-licenses

    Description

    Tabular dataset for data analysis and machine learning practice. The dataset is about the market and is usable for Power BI practice and data science.

  5. Toy Dataset

    • kaggle.com
    zip
    Updated Dec 10, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carlo Lepelaars (2018). Toy Dataset [Dataset]. https://www.kaggle.com/datasets/carlolepelaars/toy-dataset
    Explore at:
    zip(1184308 bytes)Available download formats
    Dataset updated
    Dec 10, 2018
    Authors
    Carlo Lepelaars
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    A fictional dataset for exploratory data analysis (EDA) and to test simple prediction models.

    This toy dataset features 150000 rows and 6 columns.

    Columns

    Note: All data is fictional. The data has been generated so that their distributions are convenient for statistical analysis.

    Number: A simple index number for each row

    City: The location of a person (Dallas, New York City, Los Angeles, Mountain View, Boston, Washington D.C., San Diego and Austin)

    Gender: Gender of a person (Male or Female)

    Age: The age of a person (Ranging from 25 to 65 years)

    Income: Annual income of a person (Ranging from -674 to 177175)

    Illness: Is the person Ill? (Yes or No)

    Acknowledgements

    Stock photo by Mika Baumeister on Unsplash.

  6. Analysis of small businesses in Michigan

    • kaggle.com
    zip
    Updated Oct 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maooz Abdullah (2024). Analysis of small businesses in Michigan [Dataset]. https://www.kaggle.com/datasets/maoozabdullah/analysis-of-small-businesses-in-michigan
    Explore at:
    zip(334456 bytes)Available download formats
    Dataset updated
    Oct 12, 2024
    Authors
    Maooz Abdullah
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Area covered
    Michigan
    Description

    The objective of this report is to analyze the role of small businesses in the Michigan job market using the provided dataset. We aim to understand the impact of small businesses on employment, sales, and other economic factors. This analysis will help in identifying trends and patterns that can inform policy decisions and support for small businesses.

  7. Supply Chain DataSet

    • kaggle.com
    zip
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amir Motefaker (2023). Supply Chain DataSet [Dataset]. https://www.kaggle.com/datasets/amirmotefaker/supply-chain-dataset
    Explore at:
    zip(9340 bytes)Available download formats
    Dataset updated
    Jun 1, 2023
    Authors
    Amir Motefaker
    Description

    Supply chain analytics is a valuable part of data-driven decision-making in various industries such as manufacturing, retail, healthcare, and logistics. It is the process of collecting, analyzing and interpreting data related to the movement of products and services from suppliers to customers.

  8. Social Media and Mental Health

    • kaggle.com
    zip
    Updated Jul 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SouvikAhmed071 (2023). Social Media and Mental Health [Dataset]. https://www.kaggle.com/datasets/souvikahmed071/social-media-and-mental-health
    Explore at:
    zip(10944 bytes)Available download formats
    Dataset updated
    Jul 18, 2023
    Authors
    SouvikAhmed071
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Description

    This dataset was originally collected for a data science and machine learning project that aimed at investigating the potential correlation between the amount of time an individual spends on social media and the impact it has on their mental health.

    The project involves conducting a survey to collect data, organizing the data, and using machine learning techniques to create a predictive model that can determine whether a person should seek professional help based on their answers to the survey questions.

    This project was completed as part of a Statistics course at a university, and the team is currently in the process of writing a report and completing a paper that summarizes and discusses the findings in relation to other research on the topic.

    The following is the Google Colab link to the project, done on Jupyter Notebook -

    https://colab.research.google.com/drive/1p7P6lL1QUw1TtyUD1odNR4M6TVJK7IYN

    The following is the GitHub Repository of the project -

    https://github.com/daerkns/social-media-and-mental-health

    Libraries used for the Project -

    Pandas
    Numpy
    Matplotlib
    Seaborn
    Sci-kit Learn
    
  9. (CO2 Emissions Project)

    • kaggle.com
    zip
    Updated Aug 4, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ömeryldz (2024). (CO2 Emissions Project) [Dataset]. https://www.kaggle.com/datasets/omeryldz4034/co2-emissions-project
    Explore at:
    zip(677350 bytes)Available download formats
    Dataset updated
    Aug 4, 2024
    Authors
    ömeryldz
    Description
    • 1. About Dataset
    • This dataset contains information about various vehicles' carbon dioxide (CO2) emissions and fuel consumption.
    • In the context of Machine Learning (ML), this dataset is often used to predict CO2 emissions based on vehicle characteristics or to analyze fuel efficiency of vehicles.
    • The goal could be to predict CO2 emissions or fuel consumption based on the features of the vehicles.
    • There are total 7385 rows and 12 columns.

    The columns in the dataset can be described as follows:

    1. Make: The brand of the vehicle.
    2. Model: The model of the vehicle.
    3. Vehicle Class: The class of the vehicle (e.g., compact, SUV).
    4. Engine Size(L): The engine size in liters.
    5. Cylinders: The number of cylinders in the engine.
    6. Transmission: The type of transmission (e.g., automatic, manual).
    7. Fuel Type: The type of fuel used (e.g., gasoline, diesel).
    8. Fuel Consumption City (L/100 km): Fuel consumption in the city (liters per 100 kilometers).
    9. Fuel Consumption Hwy (L/100 km): Highway (out-of-city) fuel consumption.
    10. Fuel Consumption Comb (L/100 km): Combined (city and highway) fuel consumption.
    11. Fuel Consumption Comb (mpg): Combined fuel consumption in miles per gallon.
    12. CO2 Emissions(g/km): CO2 emissions in grams per kilometer.

    Model

    The "Model" column includes terms that identify specific features or configurations of vehicles: - 4WD/4X4: Four-wheel drive. A drive system where all four wheels receive power. - AWD: All-wheel drive. Similar to 4WD but often with more complex mechanisms for power distribution. - FFV: Flexible-fuel vehicle. Vehicles that can use multiple types of fuel, such as both gasoline and ethanol blends. - SWB: Short wheelbase. - LWB: Long wheelbase. - EWB: Extended wheelbase.

    Transmission

    The "Transmission" column indicates the type of transmission system in the vehicle: - A: Automatic. A transmission type that operates without the need for the driver to manually change gears. - AM: Automated manual. A version of a manual transmission that is automated. - AS: Automatic with select shift. An automatic transmission that allows for manual intervention. - AV: Continuously variable. A transmission that uses continuously varying ratios instead of fixed gear ratios. - M: Manual. A transmission type that requires the driver to manually change gears. - 3 - 10: Number of gears in the transmission.

    Fuel Type

    The "Fuel Type" column specifies the type of fuel used by the vehicle: - X: Regular gasoline. - Z: Premium gasoline. - D: Diesel. - E: Ethanol (E85). - N: Natural gas.

    Vehicle Class

    The "Vehicle Class" column categorizes vehicles by size and type: - COMPACT: Smaller-sized vehicles. - SUV - SMALL: Smaller-sized sports utility vehicles. - MID-SIZE: Medium-sized vehicles. - TWO-SEATER: Vehicles with two seats. - MINICOMPACT: Very small-sized vehicles. - SUBCOMPACT: Smaller than compact-sized vehicles. - FULL-SIZE: Larger-sized vehicles. - STATION WAGON - SMALL: Smaller-sized station wagons. - SUV - STANDARD: Standard-sized sports utility vehicles. - VAN - CARGO: Vans designed for cargo. - VAN - PASSENGER: Vans designed for passenger transportation. - PICKUP TRUCK - STANDARD: Standard-sized pickup trucks. - MINIVAN: Smaller-sized vans. - SPECIAL PURPOSE VEHICLE: Vehicles designed for special purposes. - STATION WAGON - MID-SIZE: Mid-sized station wagons. - PICKUP TRUCK - SMALL: Smaller-sized pickup trucks.

    This dataset can be used to understand the fuel efficiency and environmental impact of vehicles. Machine learning models can use these features to predict CO2 emissions or perform analyses comparing the fuel consumption of different vehicles.

  10. Insurance Dataset Based on Real-World Statistics

    • kaggle.com
    zip
    Updated Jan 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SamiAlyasin (2025). Insurance Dataset Based on Real-World Statistics [Dataset]. https://www.kaggle.com/datasets/samialyasin/insurance-data-personal-auto-line-of-business
    Explore at:
    zip(157388 bytes)Available download formats
    Dataset updated
    Jan 19, 2025
    Authors
    SamiAlyasin
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Area covered
    World
    Description

    This dataset is a synthetic yet realistic representation of personal auto insurance data, crafted using real-world statistics. While actual insurance data is sensitive and unavailable for public use, this dataset bridges the gap by offering a safe and practical alternative for building robust data science projects.

    Why This Dataset? - Realistic Foundation: Synthetic data generated from real-world statistical patterns ensures practical relevance. - Safe for Use: No personal or sensitive information—completely anonymized and compliant with data privacy standards. - Flexible Applications: Ideal for testing models, developing prototypes, and showcasing portfolio projects.

    How You Can Use It: - Build machine learning models for predicting customer conversion and retention. - Design risk assessment tools or premium optimization algorithms. - Create dashboards to visualize trends in customer segmentation and policy data. - Explore innovative solutions for the insurance industry using a realistic data foundation.

    This dataset empowers you to work on real-world insurance scenarios without compromising on data sensitivity.

  11. Student Startup Success Dataset

    • kaggle.com
    zip
    Updated Jul 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ziya (2025). Student Startup Success Dataset [Dataset]. https://www.kaggle.com/datasets/ziya07/student-startup-success-dataset
    Explore at:
    zip(45574 bytes)Available download formats
    Dataset updated
    Jul 14, 2025
    Authors
    Ziya
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset includes 2,100 entries of student-led entrepreneurship projects sourced from 40 academic institutions between 2019 and 2023. It was developed to aid in predictive modeling of the success rate of college startup initiatives using deep learning and machine learning approaches.

    The dataset captures both structural and strategic elements that influence startup outcomes — such as funding, innovation, team dynamics, and support systems like mentorship and incubation. Each project is labeled as either successful (1) or not successful (0) based on a calculated success metric derived from multiple weighted inputs.

    This data can be used to train classification models, perform feature analysis, and build intelligent recommendation systems to support innovation incubators, educational policymakers, and student entrepreneurs.

    The current dataset primarily captures internal project-specific factors such as team experience, innovation score, funding, mentorship, and incubation support. However, it does not include broader environmental variables, such as macroeconomic indicators (e.g., industry growth rates, regional investment trends) or regional factors (e.g., resource availability in large vs. small cities). These external factors can significantly influence startup success. To enhance the dataset’s robustness, future work can integrate supplementary environmental variables using publicly available data sources, such as regional economic indicators, startup density, proximity to innovation hubs, and local infrastructure quality. Incorporating these variables will enable the predictive model to account for both internal and external determinants of success, thereby improving its accuracy, generalizability, and practical applicability for diverse institutional and regional contexts.

    Key Features Feature Name Description project_id Unique identifier for each project institution_name Name of the college or university institution_type Type of institution (Public, Private, Technical, Non-technical) project_domain Startup domain (e.g., HealthTech, EdTech, AgriTech) team_size Number of students in the team avg_team_experience Average prior experience of the team members (in years) innovation_score Normalized score reflecting novelty and originality of the project funding_amount_usd Initial funding received by the project in USD mentorship_support Whether the team received mentorship (1 = Yes, 0 = No) incubation_support Whether the project was incubated (1 = Yes, 0 = No) market_readiness_level Readiness scale from idea (1) to market-ready (5) competition_awards Number of awards won in competitions business_model_score Score representing clarity and scalability of the business model (0 to 1) technology_maturity Maturity level of the tech used (1 = prototype, 5 = production ready) year Year the project was submitted success_label Target variable: 1 = Successful, 0 = Not successful

  12. Top 980 Starred Open Source Projects on GitHub

    • kaggle.com
    zip
    Updated Jun 24, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chase Willden (2017). Top 980 Starred Open Source Projects on GitHub [Dataset]. https://www.kaggle.com/datasets/chasewillden/topstarredopensourceprojects/code
    Explore at:
    zip(64636 bytes)Available download formats
    Dataset updated
    Jun 24, 2017
    Authors
    Chase Willden
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    GitHub is the leader in hosting open source projects. For those who are not familiar with open source projects, a group of developers share and contribute to common code to develop software. Example open source projects include, Chromium (which makes Google Chrome), WordPress, and Hadoop. Open source projects are said to have disrupted the software industry (2008 Kansas Keynote).

    Content

    For this study, I crawled the leader in hosting open source projects, GitHub.com and extracted a list of the top starred open source projects. On GitHub, a user may choose the star a repository representing that they “like” the project. For each project, I gathered the repository username or Organization the project resided in, the repository name, a description, the last updated date, the language of the project, the number of stars, any tags, and finally the url of the project.

    Acknowledgements

    This data wouldn't be available if it weren't for GitHub. An example micro-study can be found at The Concept Center

  13. Netflix Data: Cleaning, Analysis and Visualization

    • kaggle.com
    zip
    Updated Aug 26, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abdulrasaq Ariyo (2022). Netflix Data: Cleaning, Analysis and Visualization [Dataset]. https://www.kaggle.com/datasets/ariyoomotade/netflix-data-cleaning-analysis-and-visualization
    Explore at:
    zip(276607 bytes)Available download formats
    Dataset updated
    Aug 26, 2022
    Authors
    Abdulrasaq Ariyo
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Netflix is a popular streaming service that offers a vast catalog of movies, TV shows, and original contents. This dataset is a cleaned version of the original version which can be found here. The data consist of contents added to Netflix from 2008 to 2021. The oldest content is as old as 1925 and the newest as 2021. This dataset will be cleaned with PostgreSQL and visualized with Tableau. The purpose of this dataset is to test my data cleaning and visualization skills. The cleaned data can be found below and the Tableau dashboard can be found here .

    Data Cleaning

    We are going to: 1. Treat the Nulls 2. Treat the duplicates 3. Populate missing rows 4. Drop unneeded columns 5. Split columns Extra steps and more explanation on the process will be explained through the code comments

    --View dataset
    
    SELECT * 
    FROM netflix;
    
    
    --The show_id column is the unique id for the dataset, therefore we are going to check for duplicates
                                      
    SELECT show_id, COUNT(*)                                                                                      
    FROM netflix 
    GROUP BY show_id                                                                                              
    ORDER BY show_id DESC;
    
    --No duplicates
    
    --Check null values across columns
    
    SELECT COUNT(*) FILTER (WHERE show_id IS NULL) AS showid_nulls,
        COUNT(*) FILTER (WHERE type IS NULL) AS type_nulls,
        COUNT(*) FILTER (WHERE title IS NULL) AS title_nulls,
        COUNT(*) FILTER (WHERE director IS NULL) AS director_nulls,
        COUNT(*) FILTER (WHERE movie_cast IS NULL) AS movie_cast_nulls,
        COUNT(*) FILTER (WHERE country IS NULL) AS country_nulls,
        COUNT(*) FILTER (WHERE date_added IS NULL) AS date_addes_nulls,
        COUNT(*) FILTER (WHERE release_year IS NULL) AS release_year_nulls,
        COUNT(*) FILTER (WHERE rating IS NULL) AS rating_nulls,
        COUNT(*) FILTER (WHERE duration IS NULL) AS duration_nulls,
        COUNT(*) FILTER (WHERE listed_in IS NULL) AS listed_in_nulls,
        COUNT(*) FILTER (WHERE description IS NULL) AS description_nulls
    FROM netflix;
    
    We can see that there are NULLS. 
    director_nulls = 2634
    movie_cast_nulls = 825
    country_nulls = 831
    date_added_nulls = 10
    rating_nulls = 4
    duration_nulls = 3 
    

    The director column nulls is about 30% of the whole column, therefore I will not delete them. I will rather find another column to populate it. To populate the director column, we want to find out if there is relationship between movie_cast column and director column

    -- Below, we find out if some directors are likely to work with particular cast
    
    WITH cte AS
    (
    SELECT title, CONCAT(director, '---', movie_cast) AS director_cast 
    FROM netflix
    )
    
    SELECT director_cast, COUNT(*) AS count
    FROM cte
    GROUP BY director_cast
    HAVING COUNT(*) > 1
    ORDER BY COUNT(*) DESC;
    
    With this, we can now populate NULL rows in directors 
    using their record with movie_cast 
    
    UPDATE netflix 
    SET director = 'Alastair Fothergill'
    WHERE movie_cast = 'David Attenborough'
    AND director IS NULL ;
    
    --Repeat this step to populate the rest of the director nulls
    --Populate the rest of the NULL in director as "Not Given"
    
    UPDATE netflix 
    SET director = 'Not Given'
    WHERE director IS NULL;
    
    --When I was doing this, I found a less complex and faster way to populate a column which I will use next
    

    Just like the director column, I will not delete the nulls in country. Since the country column is related to director and movie, we are going to populate the country column with the director column

    --Populate the country using the director column
    
    SELECT COALESCE(nt.country,nt2.country) 
    FROM netflix AS nt
    JOIN netflix AS nt2 
    ON nt.director = nt2.director 
    AND nt.show_id <> nt2.show_id
    WHERE nt.country IS NULL;
    UPDATE netflix
    SET country = nt2.country
    FROM netflix AS nt2
    WHERE netflix.director = nt2.director and netflix.show_id <> nt2.show_id 
    AND netflix.country IS NULL;
    
    
    --To confirm if there are still directors linked to country that refuse to update
    
    SELECT director, country, date_added
    FROM netflix
    WHERE country IS NULL;
    
    --Populate the rest of the NULL in director as "Not Given"
    
    UPDATE netflix 
    SET country = 'Not Given'
    WHERE country IS NULL;
    

    The date_added rows nulls is just 10 out of over 8000 rows, deleting them cannot affect our analysis or visualization

    --Show date_added nulls
    
    SELECT show_id, date_added
    FROM netflix_clean
    WHERE date_added IS NULL;
    
    --DELETE nulls
    
    DELETE F...
    
  14. Manga Dataset (title/genre/rating)

    • kaggle.com
    zip
    Updated Jul 18, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Clyde Melton (2022). Manga Dataset (title/genre/rating) [Dataset]. https://www.kaggle.com/datasets/clydemelton/manga-dataset
    Explore at:
    zip(10027 bytes)Available download formats
    Dataset updated
    Jul 18, 2022
    Authors
    Clyde Melton
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Data scraped from Mangakalot I had originally decided to create this dataset for use in a recommendation system for manga titles. Other datasets that I had found were either missing information that I wanted to use to build this system or contained too small a sample size to build what I deemed a useful product. This is also my first attempt at web scraping (I'm also fairly new to python and data science) so I suppose I wanted to do a simple project at first to learn the basics. I hope it proves useful to someone.

  15. SW MNIST Small

    • kaggle.com
    zip
    Updated Sep 27, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hilary Abraham (2024). SW MNIST Small [Dataset]. https://www.kaggle.com/datasets/hilaryabraham/sw-mnist-small
    Explore at:
    zip(274437 bytes)Available download formats
    Dataset updated
    Sep 27, 2024
    Authors
    Hilary Abraham
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    MNIST dataset in csv format

    From Joseph Redmon https://pjreddie.com/projects/mnist-in-csv/

    Adapted for Science World lesson

    Small version; for quick training & testing and low internet speeds

    Training dataset 100 or 1000 option

    Testing dataset 10 or 100 option

  16. Chess data used to train (WCRA-project) CNN

    • kaggle.com
    zip
    Updated Jan 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohamed Moataz Moustafa (2024). Chess data used to train (WCRA-project) CNN [Dataset]. https://www.kaggle.com/datasets/mohamedmoataz99/wcra-ai
    Explore at:
    zip(584424096 bytes)Available download formats
    Dataset updated
    Jan 9, 2024
    Authors
    Mohamed Moataz Moustafa
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This dataset was used to train CNN used in my graduation project Wireless Chess Robotic Arm (WCRA-AI for short) is a robotic arm capable of playing chess ( it can be controlled over the network as well), The CNN was used to get human moves from the physical board. it takes a single square picture as input and gives one of three outputs (empty square, white piece, black piece). we can use that to detect the human chess move since the initial board state is known.

    All pictures were taken with a 5mp Raspberry Pi camera with the legacy camera settings turned off.

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F7966647%2Fb31d5195aea01ca0db26688f2ab40c98%2F20240110_034249.gif?generation=1704851736745352&alt=media" alt="">

  17. UFC Complete Dataset (All events 1996-2024)

    • kaggle.com
    zip
    Updated Mar 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MaksBasher (2024). UFC Complete Dataset (All events 1996-2024) [Dataset]. https://www.kaggle.com/datasets/maksbasher/ufc-complete-dataset-all-events-1996-2024
    Explore at:
    zip(2149419 bytes)Available download formats
    Dataset updated
    Mar 28, 2024
    Authors
    MaksBasher
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This is my first public project upvotes and suggestions are appreciated 😎🖤

    Project description

    The UFC (Ultimate Fighting Championship) is an American mixed martial arts promotion company which is considered the biggest promotion in the MMA World. Soon they will host an anniversary event UFC 300. It is interesting to see what path the promotion has come from 1996 to this day. There are UFC datasets available in the Kaggle but all of them are outdated. For that matter I've decided to gather the new dataset which will include most of the useful stats you can do for various data analysis tasks and put my theoretical skills into practice. I've created a Python script to parse the ufcstats website and gather available data.

    Currently 4 datasets are available

    Large dataset

    The biggest dataset yet with over 7000 rows and 95 different features to explore. Some of the ideas for projects with this dataset: - ML model for betting predictions; - Data analysis to compare different years, weight classes, fighters, etc; - In depth analysis of a specific fight or all fights of a selected fighter; - Visualisation of average stats (strikes, takedowns, subs) per weightclass, gender, years etc.

    Source code for the scraper that was used to create this dataset can be found in this notebook

    Medium dataset

    Medium dataset for some basic tasks (contains 7582 rows and 19 columns). You can use it for getting a basic understanding of UFC historical data and perform different visualisations.

    Source code for the scraper that was used to create this dataset can be found in this notebook

    Small dataset

    Contains the information with data about completed or upcoming events with only 683 rows and 3 columns

    Source code for the scraper that was used to create this dataset can be found in this notebook

    Fighter stats

    A dataset with the stats for every fighter fought at the UFC event.

  18. E-Commerce Data

    • kaggle.com
    zip
    Updated Aug 17, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carrie (2017). E-Commerce Data [Dataset]. https://www.kaggle.com/datasets/carrie1/ecommerce-data
    Explore at:
    zip(7548686 bytes)Available download formats
    Dataset updated
    Aug 17, 2017
    Authors
    Carrie
    Description

    Context

    Typically e-commerce datasets are proprietary and consequently hard to find among publicly available data. However, The UCI Machine Learning Repository has made this dataset containing actual transactions from 2010 and 2011. The dataset is maintained on their site, where it can be found by the title "Online Retail".

    Content

    "This is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers."

    Acknowledgements

    Per the UCI Machine Learning Repository, this data was made available by Dr Daqing Chen, Director: Public Analytics group. chend '@' lsbu.ac.uk, School of Engineering, London South Bank University, London SE1 0AA, UK.

    Image from stocksnap.io.

    Inspiration

    Analyses for this dataset could include time series, clustering, classification and more.

  19. small_movie_datasets

    • kaggle.com
    zip
    Updated Sep 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Azram Afaq (2025). small_movie_datasets [Dataset]. https://www.kaggle.com/datasets/azramm/small-movie-datasets
    Explore at:
    zip(17000 bytes)Available download formats
    Dataset updated
    Sep 5, 2025
    Authors
    Azram Afaq
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    IMDb Movies Dataset with Details

    This dataset contains detailed information about movies retrieved using the IMDb RapidAPI. It is designed for research, analysis, and machine learning applications such as recommendation systems, sentiment analysis, and revenue prediction.

    📂 Dataset Columns

    url → IMDb page link

    originalTitle → Original movie title

    type → Type of media (movie, short, TV, etc.)

    description → Short summary of the movie

    trailer → Official trailer link (if available)

    startYear → Year when the movie was first released or started production

    releaseDate → Official release date

    interests → Popularity metrics (e.g., how many people are following/interested)

    countriesOfOrigin → Country/countries where the movie was produced

    spokenLanguages → Languages spoken in the movie

    filmingLocations → Locations where the movie was filmed

    budget → Estimated budget (in USD)

    grossWorldwide → Worldwide gross revenue (in USD)

    genres → List of genres (Drama, Action, Romance, etc.)

    isAdult → 0 = not adult, 1 = adult content

    runtimeMinutes → Duration of the movie in minutes

    averageRating → IMDb rating (0–10)

    numVotes → Number of user votes on IMDb

    metascore → Metacritic score (0–100, if available)

    📊 Possible Use Cases

    Movie recommendation systems

    Predicting box office success

    Sentiment analysis on movie descriptions

    Analyzing trends in genres, budgets, and ratings over time

    📌 Source

    Data collected using the IMDb RapidAPI

  20. Curated docks, stingers and plantain weeds

    • kaggle.com
    zip
    Updated Aug 18, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Peter Windridge (2020). Curated docks, stingers and plantain weeds [Dataset]. https://www.kaggle.com/datasets/ptd006/curated-docks-stingers-and-plantain-weeds
    Explore at:
    zip(44509320 bytes)Available download formats
    Dataset updated
    Aug 18, 2020
    Authors
    Peter Windridge
    Description

    Context

    Deep learning is already bringing massive benefits to farmers around the world. It has huge potential to cut monetary and environmental costs. However, like as in the motivation for the OpenSprayer dock dataset https://www.kaggle.com/gavinarmstrong/open-sprayer-images there is a risk large corporations and private equity run away with it.

    This data set is supposed to be a playground for dock weeds vs greater plantain (which have a waxier texture and different ribbing in the leaves). There are also stingers (which should be easy to distinguish- buttercups will be added soon to increase the challenge :D) . All photos are sifted for aerial view covering around 5cm to 50cm square.

    MobileNet with dense classification and dropout transfer learning last 60 layers from imagenet and sensible augmentation, LR decrease on plateau and early stopping gets ~96.5% accuracy (with the supplied stratified test/train split).

    **Please show your custom architectures tuned to plants, fancy augmentations, LR schedules, relevant datasets for pretraining etc :) **

    • What is a minimalist fast network that can hit 96% accuracy and what layers/activation discriminate weed leaves well?
    • On the theme of AutoML - can the bigger plant datasets be used to design a better architecture?

    Sample construction

    A Python script similar to https://github.com/ptd006/WeedML/blob/master/label_tool.py splits photos into 224 px square boxes with small overlap. These are presented to human. Aerial view tiles showing leaves are kept and unsuitable tiles are skipped. XBox controller is used for speed.

    This Kaggle dataset features 1000 tiles selected at random from classes: docks, stingers and plantain weeds.

    Acknowledgements

    The photos from which the tiles are produced are originally from iNaturalist.org- thanks to everyone who contributed!

    Inspiration

    Various weed deep learning projects, e.g. https://www.kaggle.com/gavinarmstrong/open-sprayer-images https://github.com/AlexOlsen/DeepWeeds

    Concluding remarks

    I intend to release new datasets related to Agtech and also put more details on my website http://www.agrovate.co.uk/

    ** If you add new images or do further labelling etc please contribute back to the community! **

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Akshatha Aravind (2023). MICRO CREDIT PROJECT [Dataset]. https://www.kaggle.com/datasets/akshathaaravind/micro-credit-project
Organization logo

MICRO CREDIT PROJECT

Explore at:
zip(1625692 bytes)Available download formats
Dataset updated
Oct 21, 2023
Authors
Akshatha Aravind
License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

Dataset

This dataset was created by Akshatha Aravind

Released under Apache 2.0

Contents

Search
Clear search
Close search
Google apps
Main menu