93 datasets found

MICRO CREDIT PROJECT
kaggle.com
zip
Updated Oct 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Akshatha Aravind (2023). MICRO CREDIT PROJECT [Dataset]. https://www.kaggle.com/datasets/akshathaaravind/micro-credit-project
Explore at:
zip(1625692 bytes)Available download formats
Dataset updated
Oct 21, 2023
Authors
Akshatha Aravind
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset

This dataset was created by Akshatha Aravind

Released under Apache 2.0

Contents
Ken Jee YouTube Data
kaggle.com
zip
Updated Jan 22, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ken Jee (2022). Ken Jee YouTube Data [Dataset]. https://www.kaggle.com/datasets/kenjee/ken-jee-youtube-data
Explore at:
zip(6556461 bytes)Available download formats
Dataset updated
Jan 22, 2022
Authors
Ken Jee
Area covered
YouTube
Description
Context

I've been creating videos on YouTube since November of 2017 (https://www.youtube.com/c/KenJee1) with the mission of making data science accessible to more people. One of the best ways to do this is to tell stories and working on projects. This is my attempt at my first community project. I am making my YouTube data available for everyone to help better understand the growth of my YouTube community and think about ways that it could be improved! I would love for everyone in the community feel like they had some hand in contributing to the channel.

Announcement Video: https://youtu.be/YPph59-rTxA

I will be sharing my favorite projects in a few of my videos (with permission of course), and would also like to give away a few small prizes to the top featured notebooks. I hope you have fun with the analysis, I'm interested in seeing what you find in the data!

For those looking for a place to start, some things I'm thinking about are: - What are the themes of the comment data? - What types of video titles and thumbnails drive the most traffic? - Who is my core audience and what are they interested in? - What types of videos have lead to the most growth? - What type of content are people engaging with the most or watching the longest?

Some advanced projects could be: - Creating a chat bot to respond to common comments with videos where I have addressed a topic - Pulling sentiment from thumbnails and titles and comparing that with performance

Data I would like to add over time - Video descriptions - Video subtitles - Actual video data

Content

There are four files in this repo. The relevant data included in most of them is from Nov 2017 - Jan 2022. I gathered some of this data via the YouTube API and the rest from my specific analytics.

1) Aggregated Metrics By Video - This has all the topline metrics from my channel from its start (around 2015 to Jan 22 2022). I didn't post my first video until around 2) Aggregated Metrics By Video with Country and Subscriber Status - This has the same data as aggregated metrics by video, but it includes dimensions for which country people are viewing from and if the viewers are subscribed to the channel or not. 3) Video Performance Over Time - This has the daily data from each of my videos. 4) All Comments - This is all of my comment data gathered from the YouTube API. I have anonymized the users so don't worry about your name showing up!

Acknowledgements

This obviously wouldn't be possible without all of the wonderful people who watch and interact with my videos! I'm incredibly grateful for you all and I'm so happy I can share this project with you!

License

I collected this data from the YouTube API and through my own google analytics. Thus use of it must uphold the YouTube API's terms of service: https://developers.google.com/youtube/terms/api-services-terms-of-service
Housing Prices Dataset
kaggle.com
zip
Updated Jan 12, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
M Yasser H (2022). Housing Prices Dataset [Dataset]. https://www.kaggle.com/datasets/yasserh/housing-prices-dataset
Explore at:
zip(4740 bytes)Available download formats
Dataset updated
Jan 12, 2022
Authors
M Yasser H
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
https://raw.githubusercontent.com/Masterx-AI/Project_Housing_Price_Prediction_/main/hs.jpg" alt="">

Description:

A simple yet challenging project, to predict the housing price based on certain factors like house area, bedrooms, furnished, nearness to mainroad, etc. The dataset is small yet, it's complexity arises due to the fact that it has strong multicollinearity. Can you overcome these obstacles & build a decent predictive model?

Acknowledgement:

Harrison, D. and Rubinfeld, D.L. (1978) Hedonic prices and the demand for clean air. J. Environ. Economics and Management 5, 81–102. Belsley D.A., Kuh, E. and Welsch, R.E. (1980) Regression Diagnostics. Identifying Influential Data and Sources of Collinearity. New York: Wiley.

Objective:

Understand the Dataset & cleanup (if required).

Build Regression models to predict the sales w.r.t a single & multiple feature.

Also evaluate the models & compare thier respective scores like R2, RMSE, etc.
Power BI dataset
kaggle.com
zip
Updated Oct 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ahmadali Jamali (2023). Power BI dataset [Dataset]. https://www.kaggle.com/datasets/ahmadalijamali/dataset
Explore at:
zip(1642 bytes)Available download formats
Dataset updated
Oct 31, 2023
Authors
Ahmadali Jamali
License
https://www.licenses.ai/ai-licenseshttps://www.licenses.ai/ai-licenses
Description
Tabular dataset for data analysis and machine learning practice. The dataset is about the market and is usable for Power BI practice and data science.
Toy Dataset
kaggle.com
zip
Updated Dec 10, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carlo Lepelaars (2018). Toy Dataset [Dataset]. https://www.kaggle.com/datasets/carlolepelaars/toy-dataset
Explore at:
zip(1184308 bytes)Available download formats
Dataset updated
Dec 10, 2018
Authors
Carlo Lepelaars
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

A fictional dataset for exploratory data analysis (EDA) and to test simple prediction models.

This toy dataset features 150000 rows and 6 columns.

Columns

Note: All data is fictional. The data has been generated so that their distributions are convenient for statistical analysis.

Number: A simple index number for each row

City: The location of a person (Dallas, New York City, Los Angeles, Mountain View, Boston, Washington D.C., San Diego and Austin)

Gender: Gender of a person (Male or Female)

Age: The age of a person (Ranging from 25 to 65 years)

Income: Annual income of a person (Ranging from -674 to 177175)

Illness: Is the person Ill? (Yes or No)

Acknowledgements

Stock photo by Mika Baumeister on Unsplash.
Analysis of small businesses in Michigan
kaggle.com
zip
Updated Oct 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maooz Abdullah (2024). Analysis of small businesses in Michigan [Dataset]. https://www.kaggle.com/datasets/maoozabdullah/analysis-of-small-businesses-in-michigan
Explore at:
zip(334456 bytes)Available download formats
Dataset updated
Oct 12, 2024
Authors
Maooz Abdullah
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Area covered
Michigan
Description
The objective of this report is to analyze the role of small businesses in the Michigan job market using the provided dataset. We aim to understand the impact of small businesses on employment, sales, and other economic factors. This analysis will help in identifying trends and patterns that can inform policy decisions and support for small businesses.
Supply Chain DataSet
kaggle.com
zip
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amir Motefaker (2023). Supply Chain DataSet [Dataset]. https://www.kaggle.com/datasets/amirmotefaker/supply-chain-dataset
Explore at:
zip(9340 bytes)Available download formats
Dataset updated
Jun 1, 2023
Authors
Amir Motefaker
Description
Supply chain analytics is a valuable part of data-driven decision-making in various industries such as manufacturing, retail, healthcare, and logistics. It is the process of collecting, analyzing and interpreting data related to the movement of products and services from suppliers to customers.
Social Media and Mental Health
kaggle.com
zip
Updated Jul 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SouvikAhmed071 (2023). Social Media and Mental Health [Dataset]. https://www.kaggle.com/datasets/souvikahmed071/social-media-and-mental-health
Explore at:
zip(10944 bytes)Available download formats
Dataset updated
Jul 18, 2023
Authors
SouvikAhmed071
License
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Description
This dataset was originally collected for a data science and machine learning project that aimed at investigating the potential correlation between the amount of time an individual spends on social media and the impact it has on their mental health.

The project involves conducting a survey to collect data, organizing the data, and using machine learning techniques to create a predictive model that can determine whether a person should seek professional help based on their answers to the survey questions.

This project was completed as part of a Statistics course at a university, and the team is currently in the process of writing a report and completing a paper that summarizes and discusses the findings in relation to other research on the topic.

The following is the Google Colab link to the project, done on Jupyter Notebook -

https://colab.research.google.com/drive/1p7P6lL1QUw1TtyUD1odNR4M6TVJK7IYN

The following is the GitHub Repository of the project -

https://github.com/daerkns/social-media-and-mental-health

Libraries used for the Project -

Pandas Numpy Matplotlib Seaborn Sci-kit Learn
(CO2 Emissions Project)
kaggle.com
zip
Updated Aug 4, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ömeryldz (2024). (CO2 Emissions Project) [Dataset]. https://www.kaggle.com/datasets/omeryldz4034/co2-emissions-project
Explore at:
zip(677350 bytes)Available download formats
Dataset updated
Aug 4, 2024
Authors
ömeryldz
Description
1. About Dataset

This dataset contains information about various vehicles' carbon dioxide (CO2) emissions and fuel consumption.

In the context of Machine Learning (ML), this dataset is often used to predict CO2 emissions based on vehicle characteristics or to analyze fuel efficiency of vehicles.

The goal could be to predict CO2 emissions or fuel consumption based on the features of the vehicles.

There are total 7385 rows and 12 columns.

The columns in the dataset can be described as follows:

Make: The brand of the vehicle.

Model: The model of the vehicle.

Vehicle Class: The class of the vehicle (e.g., compact, SUV).

Engine Size(L): The engine size in liters.

Cylinders: The number of cylinders in the engine.

Transmission: The type of transmission (e.g., automatic, manual).

Fuel Type: The type of fuel used (e.g., gasoline, diesel).

Fuel Consumption City (L/100 km): Fuel consumption in the city (liters per 100 kilometers).

Fuel Consumption Hwy (L/100 km): Highway (out-of-city) fuel consumption.

Fuel Consumption Comb (L/100 km): Combined (city and highway) fuel consumption.

Fuel Consumption Comb (mpg): Combined fuel consumption in miles per gallon.

CO2 Emissions(g/km): CO2 emissions in grams per kilometer.

Model

The "Model" column includes terms that identify specific features or configurations of vehicles: - 4WD/4X4: Four-wheel drive. A drive system where all four wheels receive power. - AWD: All-wheel drive. Similar to 4WD but often with more complex mechanisms for power distribution. - FFV: Flexible-fuel vehicle. Vehicles that can use multiple types of fuel, such as both gasoline and ethanol blends. - SWB: Short wheelbase. - LWB: Long wheelbase. - EWB: Extended wheelbase.

Transmission

The "Transmission" column indicates the type of transmission system in the vehicle: - A: Automatic. A transmission type that operates without the need for the driver to manually change gears. - AM: Automated manual. A version of a manual transmission that is automated. - AS: Automatic with select shift. An automatic transmission that allows for manual intervention. - AV: Continuously variable. A transmission that uses continuously varying ratios instead of fixed gear ratios. - M: Manual. A transmission type that requires the driver to manually change gears. - 3 - 10: Number of gears in the transmission.

Fuel Type

The "Fuel Type" column specifies the type of fuel used by the vehicle: - X: Regular gasoline. - Z: Premium gasoline. - D: Diesel. - E: Ethanol (E85). - N: Natural gas.

Vehicle Class

The "Vehicle Class" column categorizes vehicles by size and type: - COMPACT: Smaller-sized vehicles. - SUV - SMALL: Smaller-sized sports utility vehicles. - MID-SIZE: Medium-sized vehicles. - TWO-SEATER: Vehicles with two seats. - MINICOMPACT: Very small-sized vehicles. - SUBCOMPACT: Smaller than compact-sized vehicles. - FULL-SIZE: Larger-sized vehicles. - STATION WAGON - SMALL: Smaller-sized station wagons. - SUV - STANDARD: Standard-sized sports utility vehicles. - VAN - CARGO: Vans designed for cargo. - VAN - PASSENGER: Vans designed for passenger transportation. - PICKUP TRUCK - STANDARD: Standard-sized pickup trucks. - MINIVAN: Smaller-sized vans. - SPECIAL PURPOSE VEHICLE: Vehicles designed for special purposes. - STATION WAGON - MID-SIZE: Mid-sized station wagons. - PICKUP TRUCK - SMALL: Smaller-sized pickup trucks.

This dataset can be used to understand the fuel efficiency and environmental impact of vehicles. Machine learning models can use these features to predict CO2 emissions or perform analyses comparing the fuel consumption of different vehicles.
Insurance Dataset Based on Real-World Statistics
kaggle.com
zip
Updated Jan 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SamiAlyasin (2025). Insurance Dataset Based on Real-World Statistics [Dataset]. https://www.kaggle.com/datasets/samialyasin/insurance-data-personal-auto-line-of-business
Explore at:
zip(157388 bytes)Available download formats
Dataset updated
Jan 19, 2025
Authors
SamiAlyasin
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Area covered
World
Description
This dataset is a synthetic yet realistic representation of personal auto insurance data, crafted using real-world statistics. While actual insurance data is sensitive and unavailable for public use, this dataset bridges the gap by offering a safe and practical alternative for building robust data science projects.

Why This Dataset? - Realistic Foundation: Synthetic data generated from real-world statistical patterns ensures practical relevance. - Safe for Use: No personal or sensitive information—completely anonymized and compliant with data privacy standards. - Flexible Applications: Ideal for testing models, developing prototypes, and showcasing portfolio projects.

How You Can Use It: - Build machine learning models for predicting customer conversion and retention. - Design risk assessment tools or premium optimization algorithms. - Create dashboards to visualize trends in customer segmentation and policy data. - Explore innovative solutions for the insurance industry using a realistic data foundation.

This dataset empowers you to work on real-world insurance scenarios without compromising on data sensitivity.
Student Startup Success Dataset
kaggle.com
zip
Updated Jul 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ziya (2025). Student Startup Success Dataset [Dataset]. https://www.kaggle.com/datasets/ziya07/student-startup-success-dataset
Explore at:
zip(45574 bytes)Available download formats
Dataset updated
Jul 14, 2025
Authors
Ziya
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset includes 2,100 entries of student-led entrepreneurship projects sourced from 40 academic institutions between 2019 and 2023. It was developed to aid in predictive modeling of the success rate of college startup initiatives using deep learning and machine learning approaches.

The dataset captures both structural and strategic elements that influence startup outcomes — such as funding, innovation, team dynamics, and support systems like mentorship and incubation. Each project is labeled as either successful (1) or not successful (0) based on a calculated success metric derived from multiple weighted inputs.

This data can be used to train classification models, perform feature analysis, and build intelligent recommendation systems to support innovation incubators, educational policymakers, and student entrepreneurs.

The current dataset primarily captures internal project-specific factors such as team experience, innovation score, funding, mentorship, and incubation support. However, it does not include broader environmental variables, such as macroeconomic indicators (e.g., industry growth rates, regional investment trends) or regional factors (e.g., resource availability in large vs. small cities). These external factors can significantly influence startup success. To enhance the dataset’s robustness, future work can integrate supplementary environmental variables using publicly available data sources, such as regional economic indicators, startup density, proximity to innovation hubs, and local infrastructure quality. Incorporating these variables will enable the predictive model to account for both internal and external determinants of success, thereby improving its accuracy, generalizability, and practical applicability for diverse institutional and regional contexts.

Key Features Feature Name Description project_id Unique identifier for each project institution_name Name of the college or university institution_type Type of institution (Public, Private, Technical, Non-technical) project_domain Startup domain (e.g., HealthTech, EdTech, AgriTech) team_size Number of students in the team avg_team_experience Average prior experience of the team members (in years) innovation_score Normalized score reflecting novelty and originality of the project funding_amount_usd Initial funding received by the project in USD mentorship_support Whether the team received mentorship (1 = Yes, 0 = No) incubation_support Whether the project was incubated (1 = Yes, 0 = No) market_readiness_level Readiness scale from idea (1) to market-ready (5) competition_awards Number of awards won in competitions business_model_score Score representing clarity and scalability of the business model (0 to 1) technology_maturity Maturity level of the tech used (1 = prototype, 5 = production ready) year Year the project was submitted success_label Target variable: 1 = Successful, 0 = Not successful
Top 980 Starred Open Source Projects on GitHub
kaggle.com
zip
Updated Jun 24, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chase Willden (2017). Top 980 Starred Open Source Projects on GitHub [Dataset]. https://www.kaggle.com/datasets/chasewillden/topstarredopensourceprojects/code
Explore at:
zip(64636 bytes)Available download formats
Dataset updated
Jun 24, 2017
Authors
Chase Willden
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

GitHub is the leader in hosting open source projects. For those who are not familiar with open source projects, a group of developers share and contribute to common code to develop software. Example open source projects include, Chromium (which makes Google Chrome), WordPress, and Hadoop. Open source projects are said to have disrupted the software industry (2008 Kansas Keynote).

Content

For this study, I crawled the leader in hosting open source projects, GitHub.com and extracted a list of the top starred open source projects. On GitHub, a user may choose the star a repository representing that they “like” the project. For each project, I gathered the repository username or Organization the project resided in, the repository name, a description, the last updated date, the language of the project, the number of stars, any tags, and finally the url of the project.

Acknowledgements

This data wouldn't be available if it weren't for GitHub. An example micro-study can be found at The Concept Center

Netflix Data: Cleaning, Analysis and Visualization

kaggle.com

zip

Updated Aug 26, 2022

Facebook

Twitter

Click to copy link

Link copied

Cite

Abdulrasaq Ariyo (2022). Netflix Data: Cleaning, Analysis and Visualization [Dataset]. https://www.kaggle.com/datasets/ariyoomotade/netflix-data-cleaning-analysis-and-visualization

Explore at:

zip(276607 bytes)Available download formats

Dataset updated

Aug 26, 2022

Authors

Abdulrasaq Ariyo

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Netflix is a popular streaming service that offers a vast catalog of movies, TV shows, and original contents. This dataset is a cleaned version of the original version which can be found here. The data consist of contents added to Netflix from 2008 to 2021. The oldest content is as old as 1925 and the newest as 2021. This dataset will be cleaned with PostgreSQL and visualized with Tableau. The purpose of this dataset is to test my data cleaning and visualization skills. The cleaned data can be found below and the Tableau dashboard can be found here .

Data Cleaning

We are going to: 1. Treat the Nulls 2. Treat the duplicates 3. Populate missing rows 4. Drop unneeded columns 5. Split columns Extra steps and more explanation on the process will be explained through the code comments

--View dataset

SELECT * 
FROM netflix;

--The show_id column is the unique id for the dataset, therefore we are going to check for duplicates
                                  
SELECT show_id, COUNT(*)                                                                                      
FROM netflix 
GROUP BY show_id                                                                                              
ORDER BY show_id DESC;

--No duplicates

--Check null values across columns

SELECT COUNT(*) FILTER (WHERE show_id IS NULL) AS showid_nulls,
    COUNT(*) FILTER (WHERE type IS NULL) AS type_nulls,
    COUNT(*) FILTER (WHERE title IS NULL) AS title_nulls,
    COUNT(*) FILTER (WHERE director IS NULL) AS director_nulls,
    COUNT(*) FILTER (WHERE movie_cast IS NULL) AS movie_cast_nulls,
    COUNT(*) FILTER (WHERE country IS NULL) AS country_nulls,
    COUNT(*) FILTER (WHERE date_added IS NULL) AS date_addes_nulls,
    COUNT(*) FILTER (WHERE release_year IS NULL) AS release_year_nulls,
    COUNT(*) FILTER (WHERE rating IS NULL) AS rating_nulls,
    COUNT(*) FILTER (WHERE duration IS NULL) AS duration_nulls,
    COUNT(*) FILTER (WHERE listed_in IS NULL) AS listed_in_nulls,
    COUNT(*) FILTER (WHERE description IS NULL) AS description_nulls
FROM netflix;

We can see that there are NULLS. 
director_nulls = 2634
movie_cast_nulls = 825
country_nulls = 831
date_added_nulls = 10
rating_nulls = 4
duration_nulls = 3

The director column nulls is about 30% of the whole column, therefore I will not delete them. I will rather find another column to populate it. To populate the director column, we want to find out if there is relationship between movie_cast column and director column

-- Below, we find out if some directors are likely to work with particular cast

WITH cte AS
(
SELECT title, CONCAT(director, '---', movie_cast) AS director_cast 
FROM netflix
)

SELECT director_cast, COUNT(*) AS count
FROM cte
GROUP BY director_cast
HAVING COUNT(*) > 1
ORDER BY COUNT(*) DESC;

With this, we can now populate NULL rows in directors 
using their record with movie_cast

UPDATE netflix 
SET director = 'Alastair Fothergill'
WHERE movie_cast = 'David Attenborough'
AND director IS NULL ;

--Repeat this step to populate the rest of the director nulls
--Populate the rest of the NULL in director as "Not Given"

UPDATE netflix 
SET director = 'Not Given'
WHERE director IS NULL;

--When I was doing this, I found a less complex and faster way to populate a column which I will use next

Just like the director column, I will not delete the nulls in country. Since the country column is related to director and movie, we are going to populate the country column with the director column

--Populate the country using the director column

SELECT COALESCE(nt.country,nt2.country) 
FROM netflix AS nt
JOIN netflix AS nt2 
ON nt.director = nt2.director 
AND nt.show_id <> nt2.show_id
WHERE nt.country IS NULL;
UPDATE netflix
SET country = nt2.country
FROM netflix AS nt2
WHERE netflix.director = nt2.director and netflix.show_id <> nt2.show_id 
AND netflix.country IS NULL;


--To confirm if there are still directors linked to country that refuse to update

SELECT director, country, date_added
FROM netflix
WHERE country IS NULL;

--Populate the rest of the NULL in director as "Not Given"

UPDATE netflix 
SET country = 'Not Given'
WHERE country IS NULL;

The date_added rows nulls is just 10 out of over 8000 rows, deleting them cannot affect our analysis or visualization

--Show date_added nulls

SELECT show_id, date_added
FROM netflix_clean
WHERE date_added IS NULL;

--DELETE nulls

DELETE F...

Manga Dataset (title/genre/rating)
kaggle.com
zip
Updated Jul 18, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Clyde Melton (2022). Manga Dataset (title/genre/rating) [Dataset]. https://www.kaggle.com/datasets/clydemelton/manga-dataset
Explore at:
zip(10027 bytes)Available download formats
Dataset updated
Jul 18, 2022
Authors
Clyde Melton
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Data scraped from Mangakalot I had originally decided to create this dataset for use in a recommendation system for manga titles. Other datasets that I had found were either missing information that I wanted to use to build this system or contained too small a sample size to build what I deemed a useful product. This is also my first attempt at web scraping (I'm also fairly new to python and data science) so I suppose I wanted to do a simple project at first to learn the basics. I hope it proves useful to someone.
SW MNIST Small
kaggle.com
zip
Updated Sep 27, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hilary Abraham (2024). SW MNIST Small [Dataset]. https://www.kaggle.com/datasets/hilaryabraham/sw-mnist-small
Explore at:
zip(274437 bytes)Available download formats
Dataset updated
Sep 27, 2024
Authors
Hilary Abraham
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
MNIST dataset in csv format

From Joseph Redmon https://pjreddie.com/projects/mnist-in-csv/

Adapted for Science World lesson

Small version; for quick training & testing and low internet speeds

Training dataset 100 or 1000 option

Testing dataset 10 or 100 option
Chess data used to train (WCRA-project) CNN
kaggle.com
zip
Updated Jan 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohamed Moataz Moustafa (2024). Chess data used to train (WCRA-project) CNN [Dataset]. https://www.kaggle.com/datasets/mohamedmoataz99/wcra-ai
Explore at:
zip(584424096 bytes)Available download formats
Dataset updated
Jan 9, 2024
Authors
Mohamed Moataz Moustafa
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This dataset was used to train CNN used in my graduation project Wireless Chess Robotic Arm (WCRA-AI for short) is a robotic arm capable of playing chess ( it can be controlled over the network as well), The CNN was used to get human moves from the physical board. it takes a single square picture as input and gives one of three outputs (empty square, white piece, black piece). we can use that to detect the human chess move since the initial board state is known.

All pictures were taken with a 5mp Raspberry Pi camera with the legacy camera settings turned off.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F7966647%2Fb31d5195aea01ca0db26688f2ab40c98%2F20240110_034249.gif?generation=1704851736745352&alt=media" alt="">
UFC Complete Dataset (All events 1996-2024)
kaggle.com
zip
Updated Mar 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MaksBasher (2024). UFC Complete Dataset (All events 1996-2024) [Dataset]. https://www.kaggle.com/datasets/maksbasher/ufc-complete-dataset-all-events-1996-2024
Explore at:
zip(2149419 bytes)Available download formats
Dataset updated
Mar 28, 2024
Authors
MaksBasher
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description

This is my first public project upvotes and suggestions are appreciated 😎🖤

Project description

The UFC (Ultimate Fighting Championship) is an American mixed martial arts promotion company which is considered the biggest promotion in the MMA World. Soon they will host an anniversary event UFC 300. It is interesting to see what path the promotion has come from 1996 to this day. There are UFC datasets available in the Kaggle but all of them are outdated. For that matter I've decided to gather the new dataset which will include most of the useful stats you can do for various data analysis tasks and put my theoretical skills into practice. I've created a Python script to parse the ufcstats website and gather available data.

Currently 4 datasets are available

Large dataset

The biggest dataset yet with over 7000 rows and 95 different features to explore. Some of the ideas for projects with this dataset: - ML model for betting predictions; - Data analysis to compare different years, weight classes, fighters, etc; - In depth analysis of a specific fight or all fights of a selected fighter; - Visualisation of average stats (strikes, takedowns, subs) per weightclass, gender, years etc.

Source code for the scraper that was used to create this dataset can be found in this notebook

Medium dataset

Medium dataset for some basic tasks (contains 7582 rows and 19 columns). You can use it for getting a basic understanding of UFC historical data and perform different visualisations.

Source code for the scraper that was used to create this dataset can be found in this notebook

Small dataset

Contains the information with data about completed or upcoming events with only 683 rows and 3 columns

Source code for the scraper that was used to create this dataset can be found in this notebook

Fighter stats

A dataset with the stats for every fighter fought at the UFC event.
E-Commerce Data
kaggle.com
zip
Updated Aug 17, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carrie (2017). E-Commerce Data [Dataset]. https://www.kaggle.com/datasets/carrie1/ecommerce-data
Explore at:
zip(7548686 bytes)Available download formats
Dataset updated
Aug 17, 2017
Authors
Carrie
Description
Context

Typically e-commerce datasets are proprietary and consequently hard to find among publicly available data. However, The UCI Machine Learning Repository has made this dataset containing actual transactions from 2010 and 2011. The dataset is maintained on their site, where it can be found by the title "Online Retail".

Content

"This is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers."

Acknowledgements

Per the UCI Machine Learning Repository, this data was made available by Dr Daqing Chen, Director: Public Analytics group. chend '@' lsbu.ac.uk, School of Engineering, London South Bank University, London SE1 0AA, UK.

Image from stocksnap.io.

Inspiration

Analyses for this dataset could include time series, clustering, classification and more.
small_movie_datasets
kaggle.com
zip
Updated Sep 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Azram Afaq (2025). small_movie_datasets [Dataset]. https://www.kaggle.com/datasets/azramm/small-movie-datasets
Explore at:
zip(17000 bytes)Available download formats
Dataset updated
Sep 5, 2025
Authors
Azram Afaq
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
IMDb Movies Dataset with Details

This dataset contains detailed information about movies retrieved using the IMDb RapidAPI. It is designed for research, analysis, and machine learning applications such as recommendation systems, sentiment analysis, and revenue prediction.

📂 Dataset Columns

url → IMDb page link

originalTitle → Original movie title

type → Type of media (movie, short, TV, etc.)

description → Short summary of the movie

trailer → Official trailer link (if available)

startYear → Year when the movie was first released or started production

releaseDate → Official release date

interests → Popularity metrics (e.g., how many people are following/interested)

countriesOfOrigin → Country/countries where the movie was produced

spokenLanguages → Languages spoken in the movie

filmingLocations → Locations where the movie was filmed

budget → Estimated budget (in USD)

grossWorldwide → Worldwide gross revenue (in USD)

genres → List of genres (Drama, Action, Romance, etc.)

isAdult → 0 = not adult, 1 = adult content

runtimeMinutes → Duration of the movie in minutes

averageRating → IMDb rating (0–10)

numVotes → Number of user votes on IMDb

metascore → Metacritic score (0–100, if available)

📊 Possible Use Cases

Movie recommendation systems

Predicting box office success

Sentiment analysis on movie descriptions

Analyzing trends in genres, budgets, and ratings over time

📌 Source

Data collected using the IMDb RapidAPI
Curated docks, stingers and plantain weeds
kaggle.com
zip
Updated Aug 18, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Peter Windridge (2020). Curated docks, stingers and plantain weeds [Dataset]. https://www.kaggle.com/datasets/ptd006/curated-docks-stingers-and-plantain-weeds
Explore at:
zip(44509320 bytes)Available download formats
Dataset updated
Aug 18, 2020
Authors
Peter Windridge
Description
Context

Deep learning is already bringing massive benefits to farmers around the world. It has huge potential to cut monetary and environmental costs. However, like as in the motivation for the OpenSprayer dock dataset https://www.kaggle.com/gavinarmstrong/open-sprayer-images there is a risk large corporations and private equity run away with it.

This data set is supposed to be a playground for dock weeds vs greater plantain (which have a waxier texture and different ribbing in the leaves). There are also stingers (which should be easy to distinguish- buttercups will be added soon to increase the challenge :D) . All photos are sifted for aerial view covering around 5cm to 50cm square.

MobileNet with dense classification and dropout transfer learning last 60 layers from imagenet and sensible augmentation, LR decrease on plateau and early stopping gets ~96.5% accuracy (with the supplied stratified test/train split).

**Please show your custom architectures tuned to plants, fancy augmentations, LR schedules, relevant datasets for pretraining etc :) **

What is a minimalist fast network that can hit 96% accuracy and what layers/activation discriminate weed leaves well?

On the theme of AutoML - can the bigger plant datasets be used to design a better architecture?

Sample construction

A Python script similar to https://github.com/ptd006/WeedML/blob/master/label_tool.py splits photos into 224 px square boxes with small overlap. These are presented to human. Aerial view tiles showing leaves are kept and unsuitable tiles are skipped. XBox controller is used for speed.

This Kaggle dataset features 1000 tiles selected at random from classes: docks, stingers and plantain weeds.

Acknowledgements

The photos from which the tiles are produced are originally from iNaturalist.org- thanks to everyone who contributed!

Inspiration

Various weed deep learning projects, e.g. https://www.kaggle.com/gavinarmstrong/open-sprayer-images https://github.com/AlexOlsen/DeepWeeds

Concluding remarks

I intend to release new datasets related to Agtech and also put more details on my website http://www.agrovate.co.uk/

** If you add new images or do further labelling etc please contribute back to the community! **

Facebook

Twitter

Click to copy link

Link copied

Cite

Akshatha Aravind (2023). MICRO CREDIT PROJECT [Dataset]. https://www.kaggle.com/datasets/akshathaaravind/micro-credit-project

MICRO CREDIT PROJECT

Explore at:

zip(1625692 bytes)Available download formats

Dataset updated

Oct 21, 2023

Authors

Akshatha Aravind

License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

Dataset

This dataset was created by Akshatha Aravind

Released under Apache 2.0

Clear search

Close search

Google apps

Main menu

MICRO CREDIT PROJECT

Dataset

Contents

Ken Jee YouTube Data

Context

Content

Acknowledgements

License

Housing Prices Dataset

Description:

Acknowledgement:

Objective:

Power BI dataset

Toy Dataset

Context

Columns

Acknowledgements

Analysis of small businesses in Michigan

Supply Chain DataSet

Social Media and Mental Health

(CO2 Emissions Project)

Insurance Dataset Based on Real-World Statistics

Student Startup Success Dataset

Top 980 Starred Open Source Projects on GitHub

Context

Content

Acknowledgements

Netflix Data: Cleaning, Analysis and Visualization

Data Cleaning

Manga Dataset (title/genre/rating)

SW MNIST Small

Chess data used to train (WCRA-project) CNN

UFC Complete Dataset (All events 1996-2024)

Project description

Large dataset

Medium dataset

Small dataset

Fighter stats

E-Commerce Data

Context

Content

Acknowledgements

Inspiration

small_movie_datasets

Curated docks, stingers and plantain weeds

Context

Sample construction

Acknowledgements

Inspiration

Concluding remarks

MICRO CREDIT PROJECT

Dataset

Contents