Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by Akshatha Aravind
Released under Apache 2.0
Facebook
TwitterI've been creating videos on YouTube since November of 2017 (https://www.youtube.com/c/KenJee1) with the mission of making data science accessible to more people. One of the best ways to do this is to tell stories and working on projects. This is my attempt at my first community project. I am making my YouTube data available for everyone to help better understand the growth of my YouTube community and think about ways that it could be improved! I would love for everyone in the community feel like they had some hand in contributing to the channel.
Announcement Video: https://youtu.be/YPph59-rTxA
I will be sharing my favorite projects in a few of my videos (with permission of course), and would also like to give away a few small prizes to the top featured notebooks. I hope you have fun with the analysis, I'm interested in seeing what you find in the data!
For those looking for a place to start, some things I'm thinking about are: - What are the themes of the comment data? - What types of video titles and thumbnails drive the most traffic? - Who is my core audience and what are they interested in? - What types of videos have lead to the most growth? - What type of content are people engaging with the most or watching the longest?
Some advanced projects could be: - Creating a chat bot to respond to common comments with videos where I have addressed a topic - Pulling sentiment from thumbnails and titles and comparing that with performance
Data I would like to add over time - Video descriptions - Video subtitles - Actual video data
There are four files in this repo. The relevant data included in most of them is from Nov 2017 - Jan 2022. I gathered some of this data via the YouTube API and the rest from my specific analytics.
1) Aggregated Metrics By Video - This has all the topline metrics from my channel from its start (around 2015 to Jan 22 2022). I didn't post my first video until around 2) Aggregated Metrics By Video with Country and Subscriber Status - This has the same data as aggregated metrics by video, but it includes dimensions for which country people are viewing from and if the viewers are subscribed to the channel or not. 3) Video Performance Over Time - This has the daily data from each of my videos. 4) All Comments - This is all of my comment data gathered from the YouTube API. I have anonymized the users so don't worry about your name showing up!
This obviously wouldn't be possible without all of the wonderful people who watch and interact with my videos! I'm incredibly grateful for you all and I'm so happy I can share this project with you!
I collected this data from the YouTube API and through my own google analytics. Thus use of it must uphold the YouTube API's terms of service: https://developers.google.com/youtube/terms/api-services-terms-of-service
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
https://raw.githubusercontent.com/Masterx-AI/Project_Housing_Price_Prediction_/main/hs.jpg" alt="">
A simple yet challenging project, to predict the housing price based on certain factors like house area, bedrooms, furnished, nearness to mainroad, etc. The dataset is small yet, it's complexity arises due to the fact that it has strong multicollinearity. Can you overcome these obstacles & build a decent predictive model?
Harrison, D. and Rubinfeld, D.L. (1978) Hedonic prices and the demand for clean air. J. Environ. Economics and Management 5, 81–102. Belsley D.A., Kuh, E. and Welsch, R.E. (1980) Regression Diagnostics. Identifying Influential Data and Sources of Collinearity. New York: Wiley.
Facebook
Twitterhttps://www.licenses.ai/ai-licenseshttps://www.licenses.ai/ai-licenses
Tabular dataset for data analysis and machine learning practice. The dataset is about the market and is usable for Power BI practice and data science.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
A fictional dataset for exploratory data analysis (EDA) and to test simple prediction models.
This toy dataset features 150000 rows and 6 columns.
Note: All data is fictional. The data has been generated so that their distributions are convenient for statistical analysis.
Number: A simple index number for each row
City: The location of a person (Dallas, New York City, Los Angeles, Mountain View, Boston, Washington D.C., San Diego and Austin)
Gender: Gender of a person (Male or Female)
Age: The age of a person (Ranging from 25 to 65 years)
Income: Annual income of a person (Ranging from -674 to 177175)
Illness: Is the person Ill? (Yes or No)
Stock photo by Mika Baumeister on Unsplash.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
The objective of this report is to analyze the role of small businesses in the Michigan job market using the provided dataset. We aim to understand the impact of small businesses on employment, sales, and other economic factors. This analysis will help in identifying trends and patterns that can inform policy decisions and support for small businesses.
Facebook
TwitterSupply chain analytics is a valuable part of data-driven decision-making in various industries such as manufacturing, retail, healthcare, and logistics. It is the process of collecting, analyzing and interpreting data related to the movement of products and services from suppliers to customers.
Facebook
TwitterOpen Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
This dataset was originally collected for a data science and machine learning project that aimed at investigating the potential correlation between the amount of time an individual spends on social media and the impact it has on their mental health.
The project involves conducting a survey to collect data, organizing the data, and using machine learning techniques to create a predictive model that can determine whether a person should seek professional help based on their answers to the survey questions.
This project was completed as part of a Statistics course at a university, and the team is currently in the process of writing a report and completing a paper that summarizes and discusses the findings in relation to other research on the topic.
The following is the Google Colab link to the project, done on Jupyter Notebook -
https://colab.research.google.com/drive/1p7P6lL1QUw1TtyUD1odNR4M6TVJK7IYN
The following is the GitHub Repository of the project -
https://github.com/daerkns/social-media-and-mental-health
Libraries used for the Project -
Pandas
Numpy
Matplotlib
Seaborn
Sci-kit Learn
Facebook
TwitterThe columns in the dataset can be described as follows:
Model
The "Model" column includes terms that identify specific features or configurations of vehicles:
- 4WD/4X4: Four-wheel drive. A drive system where all four wheels receive power.
- AWD: All-wheel drive. Similar to 4WD but often with more complex mechanisms for power distribution.
- FFV: Flexible-fuel vehicle. Vehicles that can use multiple types of fuel, such as both gasoline and ethanol blends.
- SWB: Short wheelbase.
- LWB: Long wheelbase.
- EWB: Extended wheelbase.
Transmission
The "Transmission" column indicates the type of transmission system in the vehicle:
- A: Automatic. A transmission type that operates without the need for the driver to manually change gears.
- AM: Automated manual. A version of a manual transmission that is automated.
- AS: Automatic with select shift. An automatic transmission that allows for manual intervention.
- AV: Continuously variable. A transmission that uses continuously varying ratios instead of fixed gear ratios.
- M: Manual. A transmission type that requires the driver to manually change gears.
- 3 - 10: Number of gears in the transmission.
Fuel Type
The "Fuel Type" column specifies the type of fuel used by the vehicle:
- X: Regular gasoline.
- Z: Premium gasoline.
- D: Diesel.
- E: Ethanol (E85).
- N: Natural gas.
Vehicle Class
The "Vehicle Class" column categorizes vehicles by size and type:
- COMPACT: Smaller-sized vehicles.
- SUV - SMALL: Smaller-sized sports utility vehicles.
- MID-SIZE: Medium-sized vehicles.
- TWO-SEATER: Vehicles with two seats.
- MINICOMPACT: Very small-sized vehicles.
- SUBCOMPACT: Smaller than compact-sized vehicles.
- FULL-SIZE: Larger-sized vehicles.
- STATION WAGON - SMALL: Smaller-sized station wagons.
- SUV - STANDARD: Standard-sized sports utility vehicles.
- VAN - CARGO: Vans designed for cargo.
- VAN - PASSENGER: Vans designed for passenger transportation.
- PICKUP TRUCK - STANDARD: Standard-sized pickup trucks.
- MINIVAN: Smaller-sized vans.
- SPECIAL PURPOSE VEHICLE: Vehicles designed for special purposes.
- STATION WAGON - MID-SIZE: Mid-sized station wagons.
- PICKUP TRUCK - SMALL: Smaller-sized pickup trucks.
This dataset can be used to understand the fuel efficiency and environmental impact of vehicles. Machine learning models can use these features to predict CO2 emissions or perform analyses comparing the fuel consumption of different vehicles.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset is a synthetic yet realistic representation of personal auto insurance data, crafted using real-world statistics. While actual insurance data is sensitive and unavailable for public use, this dataset bridges the gap by offering a safe and practical alternative for building robust data science projects.
Why This Dataset? - Realistic Foundation: Synthetic data generated from real-world statistical patterns ensures practical relevance. - Safe for Use: No personal or sensitive information—completely anonymized and compliant with data privacy standards. - Flexible Applications: Ideal for testing models, developing prototypes, and showcasing portfolio projects.
How You Can Use It: - Build machine learning models for predicting customer conversion and retention. - Design risk assessment tools or premium optimization algorithms. - Create dashboards to visualize trends in customer segmentation and policy data. - Explore innovative solutions for the insurance industry using a realistic data foundation.
This dataset empowers you to work on real-world insurance scenarios without compromising on data sensitivity.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset includes 2,100 entries of student-led entrepreneurship projects sourced from 40 academic institutions between 2019 and 2023. It was developed to aid in predictive modeling of the success rate of college startup initiatives using deep learning and machine learning approaches.
The dataset captures both structural and strategic elements that influence startup outcomes — such as funding, innovation, team dynamics, and support systems like mentorship and incubation. Each project is labeled as either successful (1) or not successful (0) based on a calculated success metric derived from multiple weighted inputs.
This data can be used to train classification models, perform feature analysis, and build intelligent recommendation systems to support innovation incubators, educational policymakers, and student entrepreneurs.
The current dataset primarily captures internal project-specific factors such as team experience, innovation score, funding, mentorship, and incubation support. However, it does not include broader environmental variables, such as macroeconomic indicators (e.g., industry growth rates, regional investment trends) or regional factors (e.g., resource availability in large vs. small cities). These external factors can significantly influence startup success. To enhance the dataset’s robustness, future work can integrate supplementary environmental variables using publicly available data sources, such as regional economic indicators, startup density, proximity to innovation hubs, and local infrastructure quality. Incorporating these variables will enable the predictive model to account for both internal and external determinants of success, thereby improving its accuracy, generalizability, and practical applicability for diverse institutional and regional contexts.
Key Features Feature Name Description project_id Unique identifier for each project institution_name Name of the college or university institution_type Type of institution (Public, Private, Technical, Non-technical) project_domain Startup domain (e.g., HealthTech, EdTech, AgriTech) team_size Number of students in the team avg_team_experience Average prior experience of the team members (in years) innovation_score Normalized score reflecting novelty and originality of the project funding_amount_usd Initial funding received by the project in USD mentorship_support Whether the team received mentorship (1 = Yes, 0 = No) incubation_support Whether the project was incubated (1 = Yes, 0 = No) market_readiness_level Readiness scale from idea (1) to market-ready (5) competition_awards Number of awards won in competitions business_model_score Score representing clarity and scalability of the business model (0 to 1) technology_maturity Maturity level of the tech used (1 = prototype, 5 = production ready) year Year the project was submitted success_label Target variable: 1 = Successful, 0 = Not successful
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
GitHub is the leader in hosting open source projects. For those who are not familiar with open source projects, a group of developers share and contribute to common code to develop software. Example open source projects include, Chromium (which makes Google Chrome), WordPress, and Hadoop. Open source projects are said to have disrupted the software industry (2008 Kansas Keynote).
For this study, I crawled the leader in hosting open source projects, GitHub.com and extracted a list of the top starred open source projects. On GitHub, a user may choose the star a repository representing that they “like” the project. For each project, I gathered the repository username or Organization the project resided in, the repository name, a description, the last updated date, the language of the project, the number of stars, any tags, and finally the url of the project.
This data wouldn't be available if it weren't for GitHub. An example micro-study can be found at The Concept Center
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Netflix is a popular streaming service that offers a vast catalog of movies, TV shows, and original contents. This dataset is a cleaned version of the original version which can be found here. The data consist of contents added to Netflix from 2008 to 2021. The oldest content is as old as 1925 and the newest as 2021. This dataset will be cleaned with PostgreSQL and visualized with Tableau. The purpose of this dataset is to test my data cleaning and visualization skills. The cleaned data can be found below and the Tableau dashboard can be found here .
We are going to: 1. Treat the Nulls 2. Treat the duplicates 3. Populate missing rows 4. Drop unneeded columns 5. Split columns Extra steps and more explanation on the process will be explained through the code comments
--View dataset
SELECT *
FROM netflix;
--The show_id column is the unique id for the dataset, therefore we are going to check for duplicates
SELECT show_id, COUNT(*)
FROM netflix
GROUP BY show_id
ORDER BY show_id DESC;
--No duplicates
--Check null values across columns
SELECT COUNT(*) FILTER (WHERE show_id IS NULL) AS showid_nulls,
COUNT(*) FILTER (WHERE type IS NULL) AS type_nulls,
COUNT(*) FILTER (WHERE title IS NULL) AS title_nulls,
COUNT(*) FILTER (WHERE director IS NULL) AS director_nulls,
COUNT(*) FILTER (WHERE movie_cast IS NULL) AS movie_cast_nulls,
COUNT(*) FILTER (WHERE country IS NULL) AS country_nulls,
COUNT(*) FILTER (WHERE date_added IS NULL) AS date_addes_nulls,
COUNT(*) FILTER (WHERE release_year IS NULL) AS release_year_nulls,
COUNT(*) FILTER (WHERE rating IS NULL) AS rating_nulls,
COUNT(*) FILTER (WHERE duration IS NULL) AS duration_nulls,
COUNT(*) FILTER (WHERE listed_in IS NULL) AS listed_in_nulls,
COUNT(*) FILTER (WHERE description IS NULL) AS description_nulls
FROM netflix;
We can see that there are NULLS.
director_nulls = 2634
movie_cast_nulls = 825
country_nulls = 831
date_added_nulls = 10
rating_nulls = 4
duration_nulls = 3
The director column nulls is about 30% of the whole column, therefore I will not delete them. I will rather find another column to populate it. To populate the director column, we want to find out if there is relationship between movie_cast column and director column
-- Below, we find out if some directors are likely to work with particular cast
WITH cte AS
(
SELECT title, CONCAT(director, '---', movie_cast) AS director_cast
FROM netflix
)
SELECT director_cast, COUNT(*) AS count
FROM cte
GROUP BY director_cast
HAVING COUNT(*) > 1
ORDER BY COUNT(*) DESC;
With this, we can now populate NULL rows in directors
using their record with movie_cast
UPDATE netflix
SET director = 'Alastair Fothergill'
WHERE movie_cast = 'David Attenborough'
AND director IS NULL ;
--Repeat this step to populate the rest of the director nulls
--Populate the rest of the NULL in director as "Not Given"
UPDATE netflix
SET director = 'Not Given'
WHERE director IS NULL;
--When I was doing this, I found a less complex and faster way to populate a column which I will use next
Just like the director column, I will not delete the nulls in country. Since the country column is related to director and movie, we are going to populate the country column with the director column
--Populate the country using the director column
SELECT COALESCE(nt.country,nt2.country)
FROM netflix AS nt
JOIN netflix AS nt2
ON nt.director = nt2.director
AND nt.show_id <> nt2.show_id
WHERE nt.country IS NULL;
UPDATE netflix
SET country = nt2.country
FROM netflix AS nt2
WHERE netflix.director = nt2.director and netflix.show_id <> nt2.show_id
AND netflix.country IS NULL;
--To confirm if there are still directors linked to country that refuse to update
SELECT director, country, date_added
FROM netflix
WHERE country IS NULL;
--Populate the rest of the NULL in director as "Not Given"
UPDATE netflix
SET country = 'Not Given'
WHERE country IS NULL;
The date_added rows nulls is just 10 out of over 8000 rows, deleting them cannot affect our analysis or visualization
--Show date_added nulls
SELECT show_id, date_added
FROM netflix_clean
WHERE date_added IS NULL;
--DELETE nulls
DELETE F...
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Data scraped from Mangakalot I had originally decided to create this dataset for use in a recommendation system for manga titles. Other datasets that I had found were either missing information that I wanted to use to build this system or contained too small a sample size to build what I deemed a useful product. This is also my first attempt at web scraping (I'm also fairly new to python and data science) so I suppose I wanted to do a simple project at first to learn the basics. I hope it proves useful to someone.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
MNIST dataset in csv format
From Joseph Redmon https://pjreddie.com/projects/mnist-in-csv/
Adapted for Science World lesson
Small version; for quick training & testing and low internet speeds
Training dataset 100 or 1000 option
Testing dataset 10 or 100 option
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset was used to train CNN used in my graduation project Wireless Chess Robotic Arm (WCRA-AI for short) is a robotic arm capable of playing chess ( it can be controlled over the network as well), The CNN was used to get human moves from the physical board. it takes a single square picture as input and gives one of three outputs (empty square, white piece, black piece). we can use that to detect the human chess move since the initial board state is known.
All pictures were taken with a 5mp Raspberry Pi camera with the legacy camera settings turned off.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F7966647%2Fb31d5195aea01ca0db26688f2ab40c98%2F20240110_034249.gif?generation=1704851736745352&alt=media" alt="">
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This is my first public project upvotes and suggestions are appreciated 😎🖤
The UFC (Ultimate Fighting Championship) is an American mixed martial arts promotion company which is considered the biggest promotion in the MMA World. Soon they will host an anniversary event UFC 300. It is interesting to see what path the promotion has come from 1996 to this day. There are UFC datasets available in the Kaggle but all of them are outdated. For that matter I've decided to gather the new dataset which will include most of the useful stats you can do for various data analysis tasks and put my theoretical skills into practice. I've created a Python script to parse the ufcstats website and gather available data.
Currently 4 datasets are available
The biggest dataset yet with over 7000 rows and 95 different features to explore. Some of the ideas for projects with this dataset: - ML model for betting predictions; - Data analysis to compare different years, weight classes, fighters, etc; - In depth analysis of a specific fight or all fights of a selected fighter; - Visualisation of average stats (strikes, takedowns, subs) per weightclass, gender, years etc.
Source code for the scraper that was used to create this dataset can be found in this notebook
Medium dataset for some basic tasks (contains 7582 rows and 19 columns). You can use it for getting a basic understanding of UFC historical data and perform different visualisations.
Source code for the scraper that was used to create this dataset can be found in this notebook
Contains the information with data about completed or upcoming events with only 683 rows and 3 columns
Source code for the scraper that was used to create this dataset can be found in this notebook
A dataset with the stats for every fighter fought at the UFC event.
Facebook
TwitterTypically e-commerce datasets are proprietary and consequently hard to find among publicly available data. However, The UCI Machine Learning Repository has made this dataset containing actual transactions from 2010 and 2011. The dataset is maintained on their site, where it can be found by the title "Online Retail".
"This is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers."
Per the UCI Machine Learning Repository, this data was made available by Dr Daqing Chen, Director: Public Analytics group. chend '@' lsbu.ac.uk, School of Engineering, London South Bank University, London SE1 0AA, UK.
Image from stocksnap.io.
Analyses for this dataset could include time series, clustering, classification and more.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
IMDb Movies Dataset with Details
This dataset contains detailed information about movies retrieved using the IMDb RapidAPI. It is designed for research, analysis, and machine learning applications such as recommendation systems, sentiment analysis, and revenue prediction.
📂 Dataset Columns
url → IMDb page link
originalTitle → Original movie title
type → Type of media (movie, short, TV, etc.)
description → Short summary of the movie
trailer → Official trailer link (if available)
startYear → Year when the movie was first released or started production
releaseDate → Official release date
interests → Popularity metrics (e.g., how many people are following/interested)
countriesOfOrigin → Country/countries where the movie was produced
spokenLanguages → Languages spoken in the movie
filmingLocations → Locations where the movie was filmed
budget → Estimated budget (in USD)
grossWorldwide → Worldwide gross revenue (in USD)
genres → List of genres (Drama, Action, Romance, etc.)
isAdult → 0 = not adult, 1 = adult content
runtimeMinutes → Duration of the movie in minutes
averageRating → IMDb rating (0–10)
numVotes → Number of user votes on IMDb
metascore → Metacritic score (0–100, if available)
📊 Possible Use Cases
Movie recommendation systems
Predicting box office success
Sentiment analysis on movie descriptions
Analyzing trends in genres, budgets, and ratings over time
📌 Source
Data collected using the IMDb RapidAPI
Facebook
TwitterDeep learning is already bringing massive benefits to farmers around the world. It has huge potential to cut monetary and environmental costs. However, like as in the motivation for the OpenSprayer dock dataset https://www.kaggle.com/gavinarmstrong/open-sprayer-images there is a risk large corporations and private equity run away with it.
This data set is supposed to be a playground for dock weeds vs greater plantain (which have a waxier texture and different ribbing in the leaves). There are also stingers (which should be easy to distinguish- buttercups will be added soon to increase the challenge :D) . All photos are sifted for aerial view covering around 5cm to 50cm square.
MobileNet with dense classification and dropout transfer learning last 60 layers from imagenet and sensible augmentation, LR decrease on plateau and early stopping gets ~96.5% accuracy (with the supplied stratified test/train split).
**Please show your custom architectures tuned to plants, fancy augmentations, LR schedules, relevant datasets for pretraining etc :) **
A Python script similar to https://github.com/ptd006/WeedML/blob/master/label_tool.py splits photos into 224 px square boxes with small overlap. These are presented to human. Aerial view tiles showing leaves are kept and unsuitable tiles are skipped. XBox controller is used for speed.
This Kaggle dataset features 1000 tiles selected at random from classes: docks, stingers and plantain weeds.
The photos from which the tiles are produced are originally from iNaturalist.org- thanks to everyone who contributed!
Various weed deep learning projects, e.g. https://www.kaggle.com/gavinarmstrong/open-sprayer-images https://github.com/AlexOlsen/DeepWeeds
I intend to release new datasets related to Agtech and also put more details on my website http://www.agrovate.co.uk/
** If you add new images or do further labelling etc please contribute back to the community! **
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by Akshatha Aravind
Released under Apache 2.0