26 datasets found

A Beginner's Journey into Data Wrangling
kaggle.com
zip
Updated Feb 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maramsa (2024). A Beginner's Journey into Data Wrangling [Dataset]. https://www.kaggle.com/datasets/maramsa/a-beginners-journey-into-data-wrangling/discussion
Explore at:
zip(40762 bytes)Available download formats
Dataset updated
Feb 25, 2024
Authors
Maramsa
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dive into the world of data manipulation with pandas, the powerful Python library for data analysis. This series of exercises is designed for beginners who are eager to learn how to use pandas for data wrangling tasks. Each exercise will cover a different aspect of pandas, from loading and exploring datasets to manipulating data and performing basic analysis. Whether you're new to programming or just getting started with pandas, these exercises will help you build a solid foundation in data wrangling skills. Join us on this exciting journey and unleash the power of pandas!
UCI Automobile Dataset
kaggle.com
Updated Feb 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Otrivedi (2023). UCI Automobile Dataset [Dataset]. https://www.kaggle.com/datasets/otrivedi/automobile-data/suggestions
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 12, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Otrivedi
Description
In this project, I have done exploratory data analysis on the UCI Automobile dataset available at https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data

This dataset consists of data From the 1985 Ward's Automotive Yearbook. Here are the sources

1) 1985 Model Import Car and Truck Specifications, 1985 Ward's Automotive Yearbook. 2) Personal Auto Manuals, Insurance Services Office, 160 Water Street, New York, NY 10038 3) Insurance Collision Report, Insurance Institute for Highway Safety, Watergate 600, Washington, DC 20037

Number of Instances: 398 Number of Attributes: 9 including the class attribute

Attribute Information:

mpg: continuous cylinders: multi-valued discrete displacement: continuous horsepower: continuous weight: continuous acceleration: continuous model year: multi-valued discrete origin: multi-valued discrete car name: string (unique for each instance)

This data set consists of three types of entities:

I - The specification of an auto in terms of various characteristics

II - Tts assigned an insurance risk rating. This corresponds to the degree to which the auto is riskier than its price indicates. Cars are initially assigned a risk factor symbol associated with its price. Then, if it is riskier (or less), this symbol is adjusted by moving it up (or down) the scale. Actuaries call this process "symboling".

III - Its normalized losses in use as compared to other cars. This is the relative average loss payment per insured vehicle year. This value is normalized for all autos within a particular size classification (two-door small, station wagons, sports/specialty, etc...), and represents the average loss per car per year.

The analysis is divided into two parts:

Data Wrangling

Pre-processing data in python

Dealing with missing values

Data formatting

Data normalization

Binning

Exploratory Data Analysis

Descriptive statistics

Groupby

Analysis of variance

Correlation

Correlation stats

Acknowledgment Dataset: UCI Machine Learning Repository Data link: https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data
Enriched NYTimes COVID19 U.S. County Dataset
kaggle.com
zip
Updated Jun 14, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ringhilterra17 (2020). Enriched NYTimes COVID19 U.S. County Dataset [Dataset]. https://www.kaggle.com/ringhilterra17/enrichednytimescovid19
Explore at:
zip(11291611 bytes)Available download formats
Dataset updated
Jun 14, 2020
Authors
ringhilterra17
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Area covered
United States
Description
Overview and Inspiration

I wanted to make some geospatial visualizations to convey the current severity of COVID19 in different parts of the U.S..

I liked the NYTimes COVID dataset, but it was lacking information on county boundary shape data, population per county, new cases / deaths per day, and per capita calculations, and county demographics.

After a lot of work tracking down the different data sources I wanted and doing all of the data wrangling and joins in python, I wanted to open-source the final enriched data set in order to give others a head start in their COVID-19 related analytic, modeling, and visualization efforts.

This dataset is enriched with county shapes, county center point coordinates, 2019 census population estimates, county population densities, cases and deaths per capita, and calculated per day cases / deaths metrics. It contains daily data per county back to January, allowing for analyizng changes over time.

UPDATE: I have also included demographic information per county, including ages, races, and gender breakdown. This could help determine which counties are most susceptible to an outbreak.

How this data can be used

Geospatial analysis and visualization - Which counties are currently getting hit the hardest (per capita and totals)? - What patterns are there in the spread of the virus across counties? (network based spread simulations using county center lat / lons) -county population densities play a role in how quickly the virus spreads? -how does a specific county/state cases and deaths compare to other counties/states? Join with other county level datasets easily (with fips code column)

Content Details

See the column descriptions for more details on the dataset

Visualizations and Analysis Examples

COVID-19 U.S. Time-lapse: Confirmed Cases per County (per capita)

https://github.com/ringhilterra/enriched-covid19-data/blob/master/example_viz/covid-cases-final-04-06.gif?raw=true" alt="">-

Other Data Notes

Please review nytimes README for detailed notes on Covid-19 data - https://github.com/nytimes/covid-19-data/

The only update I made in regards to 'Geographic Exceptions', is that I took 'New York City' county provided in the Covid-19 data, which has all cases for 'for the five boroughs of New York City (New York, Kings, Queens, Bronx and Richmond counties) and replaced the missing FIPS for those rows with the 'New York County' fips code 36061. That way I could join to a geometry, and then I used the sum of those five boroughs population estimates for the 'New York City' estimate, which allowed me calculate 'per capita' metrics for 'New York City' entries in the Covid-19 dataset

Acknowledgements

Special thanks to NYTimes for all of their hard work gathering and consolidating all of the U.S. COVID19 related data on daily basis. Their git repo https://github.com/nytimes/covid-19-data/

Also, thanks to ykzeng for the county population density estimates: https://github.com/ykzeng/covid-19/tree/master/data-
m
COVID-19 Scholarly Production Dataset
data.mendeley.com
Updated Jul 7, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gisliany Alves (2020). COVID-19 Scholarly Production Dataset [Dataset]. http://doi.org/10.17632/kx7wwc8dzp.5
Explore at:
Unique identifier
https://doi.org/10.17632/kx7wwc8dzp.5
Dataset updated
Jul 7, 2020
Authors
Gisliany Alves
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
COVID-2019 has been recognized as a global threat, and several studies are being conducted in order to contribute to the fight and prevention of this pandemic. This work presents a scholarly production dataset focused on COVID-19, providing an overview of scientific research activities, making it possible to identify countries, scientists and research groups most active in this task force to combat the coronavirus disease. The dataset is composed of 40,212 records of articles' metadata collected from Scopus, PubMed, arXiv and bioRxiv databases from January 2019 to July 2020. Those data were extracted by using the techniques of Python Web Scraping and preprocessed with Pandas Data Wrangling.
H
Data from: SBIR - STTR Data and Code for Collecting Wrangling and Using It
dataverse.harvard.edu
Updated Nov 5, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Grant Allard (2018). SBIR - STTR Data and Code for Collecting Wrangling and Using It [Dataset]. http://doi.org/10.7910/DVN/CKTAZX
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/CKTAZX
Dataset updated
Nov 5, 2018
Dataset provided by
Harvard Dataverse
Authors
Grant Allard
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Data set consisting of data joined for analyzing the SBIR/STTR program. Data consists of individual awards and agency-level observations. The R and python code required for pulling, cleaning, and creating useful data sets has been included. Allard_Get and Clean Data.R This file provides the code for getting, cleaning, and joining the numerous data sets that this project combined. This code is written in the R language and can be used in any R environment running R 3.5.1 or higher. If the other files in this Dataverse are downloaded to the working directory, then this Rcode will be able to replicate the original study without needing the user to update any file paths. Allard SBIR STTR WebScraper.py This is the code I deployed to multiple Amazon EC2 instances to scrape data o each individual award in my data set, including the contact info and DUNS data. Allard_Analysis_APPAM SBIR project Forthcoming Allard_Spatial Analysis Forthcoming Awards_SBIR_df.Rdata This unique data set consists of 89,330 observations spanning the years 1983 - 2018 and accounting for all eleven SBIR/STTR agencies. This data set consists of data collected from the Small Business Administration's Awards API and also unique data collected through web scraping by the author. Budget_SBIR_df.Rdata 246 observations for 20 agencies across 25 years of their budget-performance in the SBIR/STTR program. Data was collected from the Small Business Administration using the Annual Reports Dashboard, the Awards API, and an author-designed web crawler of the websites of awards. Solicit_SBIR-df.Rdata This data consists of observations of solicitations published by agencies for the SBIR program. This data was collected from the SBA Solicitations API. Primary Sources Small Business Administration. “Annual Reports Dashboard,” 2018. https://www.sbir.gov/awards/annual-reports. Small Business Administration. “SBIR Awards Data,” 2018. https://www.sbir.gov/api. Small Business Administration. “SBIR Solicit Data,” 2018. https://www.sbir.gov/api.

PythonLibraries|WheelFiles

kaggle.com

zip

Updated Mar 25, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

Ravi Ramakrishnan (2024). PythonLibraries|WheelFiles [Dataset]. https://www.kaggle.com/datasets/ravi20076/pythonlibrarieswheelfiles/code

Explore at:

zip(1556654809 bytes)Available download formats

Dataset updated

Mar 25, 2024

Authors

Ravi Ramakrishnan

License

https://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/

Description

Hello all,
This dataset is my humble attempt to allow myself and others to upgrade essential python packages to their latest versions. This dataset contains the .whl files of the below packages to be used across general kernels and especially in internet-off code challenges-

Package	Version	Functionality
AutoGluon	1.0.0	AutoML models
Catboost	1.2.2 1.2.3	ML models
Iterative-Stratification	0.1.7	Iterative stratification for multi-label classifiers
Joblib	1.3.2	File dumping and retrieval
LAMA	0.3.8b1	AutoML models
LightGBM	4.3.0 4.2.0 4.1.0	ML models
MAPIE	0.8.2	Quantile regression
Numpy	1.26.3	Data wrangling
Pandas	2.1.4	Data wrangling
Polars	0.20.3 0.20.4	Data wrangling
PyTorch	2.0.1	Neural networks
PyTorch-TabNet	4.1.0	Neural networks
PyTorch-Forecast	0.7.0	Neural networks
Pygwalker	0.3.20	Data wrangling and visualization
Scikit-learn	1.3.2 1.4.0	ML Models/ Pipelines/ Data wrangling
Scipy	1.11.4	Data wrangling/ Statistics
TabPFN	10.1.9	ML models
Torch-Frame	1.7.5	Neural Networks
TorchVision	0.15.2	Neural Networks
XGBoost	2.0.2 2.0.1 2.0.3	ML models

I plan to update this dataset with more libraries and later versions as they get upgraded in due course. I hope these wheel files are useful to one and all.

Recent updates based on user feedback-

lightgbm 4.1.0 and 4.3.0
Older XGBoost versions (2.0.1 and 2.0.2)
Torch-Frame, TabNet, PyTorch-Forecasting, TorchVision
MAPIE
LAMA 0.3.8b1
Iterative-Stratification
Catboost 1.2.3

Best regards and happy learning and coding!

m
Bee Swarm Analysis
data.mendeley.com
Updated Jul 4, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kosta Manser (2022). Bee Swarm Analysis [Dataset]. http://doi.org/10.17632/5bmscj7jf7.1
Explore at:
Unique identifier
https://doi.org/10.17632/5bmscj7jf7.1
Dataset updated
Jul 4, 2022
Authors
Kosta Manser
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data collected by E. Hunting et al. comprising video footage and electric field recordings from a video camera and field mill respectively. Data wrangling was done by K. Manser, the author of the python script.
Explore Bike Share Data
kaggle.com
zip
Updated Jun 3, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shaltout (2021). Explore Bike Share Data [Dataset]. https://www.kaggle.com/shaltout/explore-bike-share-data
Explore at:
zip(26232124 bytes)Available download formats
Dataset updated
Jun 3, 2021
Authors
Shaltout
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Bike Share Data Over the past decade, bicycle-sharing systems have been growing in number and popularity in cities across the world. Bicycle-sharing systems allow users to rent bicycles on a very short-term basis for a price. This allows people to borrow a bike from point A and return it at point B, though they can also return it to the same location if they'd like to just go for a ride. Regardless, each bike can serve several users per day.

Thanks to the rise in information technologies, it is easy for a user of the system to access a dock within the system to unlock or return bicycles. These technologies also provide a wealth of data that can be used to explore how these bike-sharing systems are used.

In this project, you will use data provided by Motivate, a bike share system provider for many major cities in the United States, to uncover bike share usage patterns. You will compare the system usage between three large cities: Chicago, New York City, and Washington, DC.

The Datasets Randomly selected data for the first six months of 2017 are provided for all three cities. All three of the data files contain the same core six (6) columns:

Start Time (e.g., 2017-01-01 00:07:57) End Time (e.g., 2017-01-01 00:20:53) Trip Duration (in seconds - e.g., 776) Start Station (e.g., Broadway & Barry Ave) End Station (e.g., Sedgwick St & North Ave) User Type (Subscriber or Customer) The Chicago and New York City files also have the following two columns:

Gender Birth Year

Data for the first 10 rides in the new_york_city.csv file

The original files are much larger and messier, and you don't need to download them, but they can be accessed here if you'd like to see them (Chicago, New York City, Washington). These files had more columns and they differed in format in many cases. Some data wrangling has been performed to condense these files to the above core six columns to make your analysis and the evaluation of your Python skills more straightforward. In the Data Wrangling course that comes later in the Data Analyst Nanodegree program, students learn how to wrangle the dirtiest, messiest datasets, so don't worry, you won't miss out on learning this important skill!

Statistics Computed You will learn about bike share use in Chicago, New York City, and Washington by computing a variety of descriptive statistics. In this project, you'll write code to provide the following information:

1 Popular times of travel (i.e., occurs most often in the start time)

most common month most common day of week most common hour of day

2 Popular stations and trip

most common start station most common end station most common trip from start to end (i.e., most frequent combination of start station and end station)

3 Trip duration

total travel time average travel time

4 User info

counts of each user type counts of each gender (only available for NYC and Chicago) earliest, most recent, most common year of birth (only available for NYC and Chicago) The Files To answer these questions using Python, you will need to write a Python script. To help guide your work in this project, a template with helper code and comments is provided in a bikeshare.py file, and you will do your scripting in there also. You will need the three city dataset files too:

chicago.csv new_york_city.csv washington.csv

All four of these files are zipped up in the Bikeshare file in the resource tab in the sidebar on the left side of this page. You may download and open up that zip file to do your project work on your local machine.
Netflix
kaggle.com
zip
Updated Jul 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Prasanna@82 (2025). Netflix [Dataset]. https://www.kaggle.com/datasets/prasanna82/netflix/code
Explore at:
zip(1400865 bytes)Available download formats
Dataset updated
Jul 29, 2025
Authors
Prasanna@82
Description
Netflix Dataset Exploration and Visualization

This project involves an in-depth analysis of the Netflix dataset to uncover key trends and patterns in the streaming platform’s content offerings. Using Python libraries such as Pandas, NumPy, and Matplotlib, this notebook visualizes and interprets critical insights from the data.

Objectives:

Analyze the distribution of content types (Movies vs. TV Shows)

Identify the most prolific countries producing Netflix content

Study the ratings and duration of shows

Handle missing values using techniques like interpolation, forward-fill, and custom replacements

Enhance readability with bar charts, horizontal plots, and annotated visuals

Key Visualizations:

Bar charts for type distribution and country-wise contributions

Handling missing data in rating, duration, and date_added

Annotated plots showing values for clarity

Tools Used:

Python 3

Pandas for data wrangling

Matplotlib for visualizations

Jupyter Notebook for hands-on analysis

Outcome: This project provides a clear view of Netflix's content library, helping data enthusiasts and beginners understand how to process, clean, and visualize real-world datasets effectively.

Feel free to fork, adapt, and extend the work.
Automobile_Price_prediction
kaggle.com
zip
Updated Oct 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ayush Triapthi (2023). Automobile_Price_prediction [Dataset]. https://www.kaggle.com/datasets/ayusht18dec/case-study-dataset
Explore at:
zip(10947 bytes)Available download formats
Dataset updated
Oct 22, 2023
Authors
Ayush Triapthi
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
A Case Study

In this case study we are going to use the automobile dataset, which plenty of car manufacturers withtheir specifications in in order to build a predictive model to find out the approximate car price. This dataset has 26 columns, including categorical and quantitative attributes.

The given_automobile.csv contains records from the above-mentioned dataset.

You need to write descriptive answers to the questions under each task and also usea proper program written in Python and execute the code. 1. The missing values are presented as ‘?’ in the dataset. Apply data wrangling techniques using Python programming language to solve missing values inall the attributes. 2. Check the data types of those columns with the missing values, and convert the data type if needed. 3. Find all the correlated features to the ‘Price’. 4. Build a predictive model to predict the car price based on using one of the independent correlated variables. 5. Continue with the same built model in No.4, but choose differentindependent variables and discuss the result.
Indian Matrimony Profiles Dataset (Vivaah.com)
kaggle.com
zip
Updated Aug 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nitish Jain (2025). Indian Matrimony Profiles Dataset (Vivaah.com) [Dataset]. https://www.kaggle.com/datasets/njnj41019/indian-matrimony-profiles-dataset-vivaah-com
Explore at:
zip(1055 bytes)Available download formats
Dataset updated
Aug 1, 2025
Authors
Nitish Jain
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Area covered
India
Description
This dataset contains 300 publicly available matrimony profiles scraped from Vivaah.com using Python and Selenium.

Each profile includes: - Profile ID - Age & Height - Religion - Caste - Mother Tongue - Profession - Education - Location

🧠 Ideal for: - Exploratory Data Analysis (EDA) - Filtering & segmentation - Recommender system prototypes - Practice with web scraping & data wrangling

⚠️ This dataset is shared only for educational and research use. It includes no personal contact or private info.
iNeuron Projectathon Oct-Nov'21
kaggle.com
zip
Updated Oct 22, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aman Anand (2021). iNeuron Projectathon Oct-Nov'21 [Dataset]. https://www.kaggle.com/yekahaaagayeham/ineuron-projectathon-octnov21
Explore at:
zip(3335989 bytes)Available download formats
Dataset updated
Oct 22, 2021
Authors
Aman Anand
Description
iNeuron-Projectathon-Oct-Nov-21

Problem Statement:

Design a web portal to automate the various operation performed in machine learning projects to solve specific problems related to supervised or unsupervised use case.. Web portal must have the capabilities to perform below-mentioned task: 1. Extract Transform Load: a. Extract: Portal should provide the capabilities to configure any data source example. Cloud Storage (AWS, Azure, GCP), Database (RDBMS, NoSQL,), and real-time streaming data to extract data into portportal. (Allow feasibility to write cucustom script if required to connect to any data source to extract data) b. Transform: Portal should provide various inbuilt functions/components to apply rich set of transformation to transform extracted data into desired format. c. Load: Portal should be able to save data into any of the cloud storage after extracted data transformed into desired format. d. Allow user to write custom script in python if some of the functionality is not present in the portal. 2. Exploratory Data Analysis: Portal should allow users to perform exploratory data analysis. 3. Data Preparation: data wrangling, feature extraction and feature selection should be automation with minimal user intervention. 4. Application must suggest appropriate machine learning algorithm which is best suitable for the use case and perform best model search operation to automate model development operation. 5. Application should provide feature to deploy model in any of the cloud and application should create prediction API to predict new instances. 6. Application should log each and every detail so that each activity can be audited in future to investigate any of the event. 7. Detail report should be generated for ETL, EDA, Data preparation and Model development and deployment. 8. Create a dashboard to monitor model performance and create various alert mechanism to notify appropriate user to take necessary precaution. 9. Create functionality to implement retraining for existing model if it is necessary. 10.Portal must be designed in such a way that it can be used by multiple organization/user where each organization/user is isolated from other. 11.Portal should provide functionality to manage user. Similar to RBAC concept used in Cloud. (It is not necessary to build so many role but design it in such a way that it can add role in future so that newly created role can also be applied to users.) Organization/User can have multiple user and each user will have specific role. 12.Portal should have a scheduler to schedule training or prediction task and appropriate alert regarding to scheduled job should be notified to subscriber/configured email id. 13.Implement watcher functionality to perform prediction as soon as file arrived at input location.

Approach:

Follow standard guild line to write quality solution for web portal.

Follow OOPS to design solution.

Implement REST API wherever possible.

Implement CI, CD pipeline with automated testing and dockerization. (Use container or Kubernetes to deploy your dockerized application)

CI, CD pipeline should have different environment example ST, SST, Production. Note: Feel free to use any of the technology to design your solution.

Results:

You have to build a solution that should summarize the various news articles from different reading categories.

Project Evaluation metrics:

Code:  You are supposed to write a code in a modular fashion  Safe: It can be used without causing harm.  Testable: It can be tested at the code level.  Maintainable: It can be maintained, even as your codebase grows.  Portable: It works the same in every environment (operating system)  You have to maintain your code on GitHub.  You have to keep your GitHub repo public so that anyone can check your code.  Proper readme file you have to maintain for any project development.  You should include basic workflow and execution of the entire project in the readme

file on GitHub  Follow the coding standards: https://www.python.org/dev/peps/pep-0008/

Database: Based on development requirement feel free to choose any database (SQL,

NoSQL) or use multiple database.

Cloud:

 You can use any cloud platform for this entire solution hosting like AWS, Azure or GCP.

API Details or User Interface:

Web portal should be designed like any cloud platform.

Model developed using web portal should have functionality to expose API to test prediction.

Logging:

 Logging is a must for every action performed by your code use the python logging library for this.

DevOps Pipeline:

Use source version control tool to implement CI, CD pipeline, e.g.: Azure Devops, Github, Circle CI.

Deployment:

 You can host your application in the cloud platform using automated CI, CD pipeline.

Solutions Design:

 You have to submit complete solution design strate...
YouTube VP and Presidential Debate Comments
kaggle.com
zip
Updated Oct 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aadit Kapoor (2020). YouTube VP and Presidential Debate Comments [Dataset]. https://www.kaggle.com/aaditkapoor1201/youtube-vp-and-presidential-debate-comments
Explore at:
zip(32482 bytes)Available download formats
Dataset updated
Oct 24, 2020
Authors
Aadit Kapoor
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Area covered
YouTube
Description
Context

After getting mixed results from the news sources, I thought to analyze the Vice Presidential and Presidential debates using Data Science. The idea is to use YouTube comments as a medium to get the sentiment regarding the debate and getting insights from the data. In this analysis, we plot common phrases, common words, we also analyze sentiment and in the end for all my data science practitioners I present them a full-fledged dataset containing YouTube Comments of VP and Presidential debates.

Why: After getting mixed results from the news sources about the outcome of the debate, I decided to use data science to help me see the outcome of the result. With the elections around the corner, technology or to be precise analytics plays a key role in shaping our thoughts and supporting our hypothesis. How: To Analyze YouTube Comments we use Python and various other NLP Libraries followed by some data visualization tools. We will use the wonders of the awesome data wrangling library known as Pandas and we hope to find some interesting insights.

Content

The dataset contains comments (YT comment scraped) and a sentiment calculated using the TextBlob library.

Acknowledgements

YouTube data API
Top U.S. Tech Companies by Revenue [2023]
kaggle.com
zip
Updated Nov 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oluwabori Abiodun-Johnson (2023). Top U.S. Tech Companies by Revenue [2023] [Dataset]. https://www.kaggle.com/datasets/oluwaboriaj/top-u-s-tech-companies-by-revenue
Explore at:
zip(346861 bytes)Available download formats
Dataset updated
Nov 20, 2023
Authors
Oluwabori Abiodun-Johnson
Description
----WEB SCRAPING PROJECT---

DESCRIPTION: This is a self-guided project where I have tasked myself with extracting the top 100 companies in the United States by Revenue according to the Fortune 500 list, performing EDA, and then narrowing it down to the Technology companies within that list.

Dataset URL: https://en.wikipedia.org/wiki/List_of_largest_companies_in_the_United_States_by_revenue

PROBLEM STATEMENT: How much Revenue have the largest U.S. Tech companies generated within the year 2023 so far?

PROJECT TYPE: Web Scraping, EDA (Exploratory Data Analysis), Data Wrangling/Cleaning, Data Visualization

SOFTWARE TOOLS USED: Python 3.1.0

DATE: 20th September, 2023

Author: Oluwabori Abiodun-Johnson
Deaths in 2024 - Messy Data for Practice(NLP)
kaggle.com
zip
Updated Apr 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ghulam Haider (2025). Deaths in 2024 - Messy Data for Practice(NLP) [Dataset]. https://www.kaggle.com/datasets/ghulamhiader/deaths-in-2024-messy-data-for-practicenlp
Explore at:
zip(362326 bytes)Available download formats
Dataset updated
Apr 26, 2025
Authors
Ghulam Haider
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
This dataset contains a collection of notable deaths from around the world, as recorded on Wikipedia throughout the year 2024. The data was scraped from the Wikipedia pages for each month of 2024, specifically from the "Deaths in [Month] 2024" articles.

Columns in the dataset: 1. Name: The name of the deceased individual. 2. Age: The age at which the individual passed away. 3. Location/Profession: The geographical location or professional background of the individual. 4. Cause of Death: The reported cause of death (if available). 5. Month: The month of the year in which the individual passed away.

Data Collection Methodology: Data was collected using a custom Python script that utilized the requests and BeautifulSoup libraries to scrape and parse the data from Wikipedia.

Information was extracted from the list of deaths provided on Wikipedia pages for each month, and the data was cleaned and organized into a structured CSV file.

This dataset is ideal for educational purposes and provides an opportunity for practicing data cleaning, data wrangling, and NLP techniques like text classification, Named Entity Recognition (NER), and summarization tasks.

Use Cases: 1. Data cleaning and preprocessing exercises. 2. Natural Language Processing (NLP) and text analysis tasks. 3. Time-based analysis or trend analysis of notable deaths over the course of a year. 4. Practicing Named Entity Recognition (NER) for identifying names, locations, and professions. 6. The dataset is available for educational purposes only and can be used to practice various data science and machine learning techniques. The data was collected under the Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0).

Code of about this data set: https://github.com/ghulamhaider65/Web_scraping
Malaysia Covid-19 Dataset
kaggle.com
zip
Updated Jul 20, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TanKY (2021). Malaysia Covid-19 Dataset [Dataset]. https://www.kaggle.com/datasets/yeanzc/malaysia-covid19-dataset/discussion
Explore at:
zip(32611 bytes)Available download formats
Dataset updated
Jul 20, 2021
Authors
TanKY
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Area covered
Malaysia
Description
A free, publicly available Malaysia Covid-19 dataset.

Data Descriptions

28 variables. Include:

New case New case (7 day rolling average) Recovered Active case Local cases Imported case ICU Death Cumulative deaths

People tested Cumulative people tested Positivity rate Positivity rate (7 day rolling average)

Data Sources

Column 1 to 22 are Twitter data, which the Tweets are retrieved from Health DG @DGHisham timeline with Twitter API. A typical covid situation update Tweet is written in a relatively fixed format. Data wrangling are done in Python/Pandas, numerical values extracted with Regular Expression (RegEx). Missing data are added manually from Desk of DG (kpkesihatan).

Column 23 ['remark'] is my own written remark regarding the Tweet status/content.

Column 24 ['Cumulative people tested'] data is transcribed from an image on MOH COVID-19 website. Specifically, the first image under TABURAN KES section in each Situasi Terkini daily webpage of http://covid-19.moh.gov.my/terkini. If missing, the image from CPRC KKM Telegram or KKM Facebook Live video is used. Data in this column, dated from 1 March 2020 to 11 Feb 2021, are from Our World in Data, their data collection method as stated here.

Why does this dataset exist?

MOH does not publish any covid data in csv/excel format as of today, they provide the data as is, along with infographics that are hardly informative. In an undisclosed email, MOH doesn't seem to understand my request for them to release the covid public health data for anyone to download and do their analysis if they do wish.

To be updated periodically

A simple visualization dashboard is now published on Tableau Public. It's is updated daily. Do check it out! More charts to be added in the near future

Inspiration

Create better visualizations to help fellow Malaysians understand the Covid-19 situation. Empower the data science community.
Party strength in each US state
kaggle.com
zip
Updated Jan 13, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
GeneBurin (2017). Party strength in each US state [Dataset]. https://www.kaggle.com/datasets/kiwiphrases/partystrengthbystate
Explore at:
zip(16377 bytes)Available download formats
Dataset updated
Jan 13, 2017
Authors
GeneBurin
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Area covered
United States
Description
Data on party strength in each US state

The repository contains data on party strength for each state as shown on each state's corresponding party strength Wikipedia page (for example, here is Virginia )

Each state has a table of a detailed summary of the state of its governing and representing bodies on Wikipedia but there is no data set that collates these entries. I scraped each state's Wikipedia table and collated the entries into a single dataset. The data are stored in the state_party_strength.csv and state_party_strength_cleaned.csv. The code that generated the file can be found in corresponding Python notebooks.

Data contents:

The data contain information from 1980 on each state's: 1. governor and party 2. state house and senate composition 3. state representative composition in congress 4. electoral votes

Clean Version

Data in the clean version has been cleaned and processed substantially. Namely: - all columns now contain homogenous data within the column - names and Wiki-citations have been removed - only the party counts and party identification have been left The notebook that created this file is here

Uncleaned Data Version

The data contained herein have not been altered from their Wikipedia tables except in two instances: - Forced column names to be in accord across states - Any needed data modifications (ie concatenated string columns) to retain information when combining columns

To use the data:

Please note that the right encoding for the dataset is "ISO-8859-1", not 'utf-8' though in future versions I will try to fix that to make it more accessible.

This means that you will likely have to perform further data wrangling prior to doing any substantive analysis. The notebook that has been used to create this data file is located here

Raw scraped data

The raw scraped data can be found in the pickle. This file contains a Python dictionary where each key is a US state name and each element is the raw scraped table in Pandas DataFrame format.

Hope it proves as useful to you in analyzing/using political patterns at the state level in the US for political and policy research.
Oslo City Bike Open Data
kaggle.com
zip
Updated Nov 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
stanislav_o27 (2025). Oslo City Bike Open Data [Dataset]. https://www.kaggle.com/datasets/stanislavo27/oslo-city-bike-open-data
Explore at:
zip(251012812 bytes)Available download formats
Dataset updated
Nov 8, 2025
Authors
stanislav_o27
Area covered
Oslo
Description
Source: https://oslobysykkel.no/en/open-data/historical

I am not the author of the data, only сompiled and structured from here using python-script

oslo-city-bike License: Norwegian Licence for Open Government Data (NLOD) 2.0 According to the license, we have full rights to collect, use, modify, and distribute this data, provided you clearly indicate the source (which I do).

Dataset structure

Folder oslobysykkel contains all available data from 2019 to 2025. Format: oslobysykkel-YYYY-MM.csv. why is oslo still appearing in the file names? because there is also similar data for Trondheim and Bergen

Variables

from oslobysykkel.no Variable Format Description started_at Timestamp Timestamp of when the trip started ended_at Timestamp Timestamp of when the trip ended duration Integer Duration of trip in seconds start_station_id String Unique ID for start station start_station_name String Name of start station start_station_description String Description of where start station is located start_station_latitude Decimal degrees in WGS84 Latitude of start station start_station_longitude Decimal degrees in WGS84 Longitude of start station end_station_id String Unique ID for end station end_station_name String Name of end station end_station_description String Description of where end station is located end_station_latitude Decimal degrees in WGS84 Latitude of end station end_station_longitude Decimal degrees in WGS84 Longitude of end station

Please note: this data and my analysis focuses on the new data format, but historical data for the period April 2016 - December 2018 (Legacy Trip Data) has a different pattern.

Motivation

I myself was extremely fascinated by this open data of Oslo City Bike and in the process of deep analysis saw broad prospects. This interest turned into an idea to create a data-analytical problem book or even platfrom 'exercise bike'. Publishing this dataset to make it convenient for my own further use in the next phases of the project (Clustering, Forecasting), as well as so that anyone can participate in analysis and modeling based on this exciting data.

**Autumn's remake of Oslo bike sharing data analysis ** https://colab.research.google.com/drive/1tAxrIWVK5V-ptKLJBdODjy10zHlsppFv?usp=sharing

https://drive.google.com/file/d/17FP9Bd5opoZlw40LRxWtycgJJyXSAdC6/view

Full notebooks with code, visualizations, and commentary will be published soon! This dataset is the backbone of an ongoing project — stay tuned for see a deeper dives into anomaly detection, station clustering, and interactive learning challenges.

Index of my notebooks Phase 1: Cleaned Data & Core Insights Time-Space Dynamics Exploratory

Challenge Ideas

Clustering and Segmentation Demand Forecasting (Time Series) Geospatial Analysis (Network Analysis)

Resources & Related Work

Similar dataset https://www.kaggle.com/code/florestancharlaix/oslo-city-bikes-analysis

links to works I have found or that have inspired me

Exploring Open Data from Oslo City Bike Jon Olave — visualization of popular routes and seasonality analysis.

Oslo City Bike Data Wrangling Karl Tryggvason — predicting bicycle availability at stations, focusing on everyday use (e.g., trips to kindergarten).

Helsinki City Bikes: Exploratory Data Analysis Analysis of a similar system in Helsinki — useful for comparative studies and methodological ideas.

External Data Sources

The idea is to connect with other data. For example I did it for weather data - integrate temperature, precipitation, and wind speed to explain variations in daily demand. https://meteostat.net/en/place/no/oslo

I also used data from Airbnb (that's where I took division into neighbourhoods) https://data.insideairbnb.com/norway/oslo/oslo/2025-06-27/visualisations/neighbourhoods.csv

oslo bike-sharing eda feature-engineering geospatial time-series
Lego Sets and Parts Flattened
kaggle.com
zip
Updated Dec 1, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
rickvenadata (2017). Lego Sets and Parts Flattened [Dataset]. https://www.kaggle.com/rickvenadata/lego-sets-and-parts-flattened
Explore at:
zip(11286839 bytes)Available download formats
Dataset updated
Dec 1, 2017
Authors
rickvenadata
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

Derived from the LEGO Database https://www.kaggle.com/rtatman/lego-database , flattened into a single CSV file for use in a Kernel I plan to build with the Pandas Python library.

Content

The original data files were imported into MS SQL Server, and then exported after performing a huge join across all of the tables. A bunch of data was excluded form my export query to simplify the data set for my research goals. Certain part categories such as minifigs as well as some themes like Duplo have been excluded.

Acknowledgements

This is derived from the LEGO database https://www.kaggle.com/rtatman/lego-database which is courtesy of https://www.kaggle.com/rtatman and in turn originated from Rebrickable.

Inspiration

This is a fun data set to play with for learning data wrangling. I personally identify with it as a LEGO fan!
US stocks short volumes (FINRA)
kaggle.com
zip
Updated Mar 16, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
DenzilG (2021). US stocks short volumes (FINRA) [Dataset]. https://www.kaggle.com/denzilg/finra-short-volumes-us-equities
Explore at:
zip(46837645 bytes)Available download formats
Dataset updated
Mar 16, 2021
Authors
DenzilG
License
https://www.usa.gov/government-works/https://www.usa.gov/government-works/
Description
Inspiration

Originally, I was planning to use the Python Quandl api to get the data from here because it is already conveniently in time-series format. However, the data is split by reporting agency which makes it difficult to get an accurate image of the true short ratio because of missing data/difficulty in aggregation. So, I clicked on the source link which turned out to be a gold mine because of their consolidated data. Only downside was that it was all in .txt format so I had to use regex to parse through and data scraping to get the information from the website but that was a good refresher 😄.

For better understanding of what the values in the text file mean, you can read this pdf from FINRA: https://www.finra.org/sites/default/files/2020-12/short-sale-volume-user-guide.pdf

Functionality

I condensed all the individual text files into a single .txt file such that it's much faster and less complex to write code compared to having to iterate through each individual .txt file. I created several functions for this dataset so please check out my workbook "FINRA Short Ratio functions" where I have described step by step on how I gathered the data and formatted it so that you can understand and modify them to fit your needs. Note that the data is only for the range of 1st April 2020 onwards (20200401 to 20210312 as of gathering the data) and the contents are separated by | delimiters so I used \D (non-digit) in regex to avoid confusion with the (a|b) pattern syntax.

If you need historical data before April 2020, you can use the quandl database but it has non-consolidated information and you have to make a reference call for each individual stock for each agency so you would need to manually input tickers or get a list of all tickers through regex of the txt files or something like that 😅.

Thoughts

An excellent task to combine regular expressions (regex), web scraping, plotting, and data wrangling... see my notebook for an example with annotated workflow. Please comment and feel free to fork and modify my workbook to change the functionality. Possibly the short volumes can be combined with p/b ratios or price data to see the correlation --> can use seaborn pairgrid to visualise this for multiple stocks?

Facebook

Twitter

Click to copy link

Link copied

Cite

Maramsa (2024). A Beginner's Journey into Data Wrangling [Dataset]. https://www.kaggle.com/datasets/maramsa/a-beginners-journey-into-data-wrangling/discussion

A Beginner's Journey into Data Wrangling

Explore at:

zip(40762 bytes)Available download formats

Dataset updated

Feb 25, 2024

Authors

Maramsa

License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

Dive into the world of data manipulation with pandas, the powerful Python library for data analysis. This series of exercises is designed for beginners who are eager to learn how to use pandas for data wrangling tasks. Each exercise will cover a different aspect of pandas, from loading and exploring datasets to manipulating data and performing basic analysis. Whether you're new to programming or just getting started with pandas, these exercises will help you build a solid foundation in data wrangling skills. Join us on this exciting journey and unleash the power of pandas!

Clear search

Close search

Google apps

Main menu

A Beginner's Journey into Data Wrangling

UCI Automobile Dataset

Enriched NYTimes COVID19 U.S. County Dataset

Overview and Inspiration

How this data can be used

Content Details

Visualizations and Analysis Examples

Other Data Notes

Acknowledgements

COVID-19 Scholarly Production Dataset

Data from: SBIR - STTR Data and Code for Collecting Wrangling and Using It

PythonLibraries|WheelFiles

Recent updates based on user feedback-

Bee Swarm Analysis

Explore Bike Share Data

1 Popular times of travel (i.e., occurs most often in the start time)

2 Popular stations and trip

3 Trip duration

4 User info

Netflix

Automobile_Price_prediction

Indian Matrimony Profiles Dataset (Vivaah.com)

iNeuron Projectathon Oct-Nov'21

iNeuron-Projectathon-Oct-Nov-21

Problem Statement:

Approach:

Results:

Project Evaluation metrics:

Database: Based on development requirement feel free to choose any database (SQL,

Cloud:

API Details or User Interface:

Logging:

DevOps Pipeline:

Deployment:

Solutions Design:

YouTube VP and Presidential Debate Comments

Context

Content

Acknowledgements

Top U.S. Tech Companies by Revenue [2023]

Deaths in 2024 - Messy Data for Practice(NLP)

Malaysia Covid-19 Dataset

A free, publicly available Malaysia Covid-19 dataset.

Data Descriptions

28 variables. Include:

Data Sources

Why does this dataset exist?

To be updated periodically

Inspiration

Party strength in each US state

Data on party strength in each US state

Data contents:

Clean Version

Uncleaned Data Version

To use the data:

Raw scraped data

Oslo City Bike Open Data

Source: https://oslobysykkel.no/en/open-data/historical

I am not the author of the data, only сompiled and structured from here using python-script

Dataset structure

Variables

Motivation

Challenge Ideas

Resources & Related Work

External Data Sources

Lego Sets and Parts Flattened

Context

Content

Acknowledgements

Inspiration

US stocks short volumes (FINRA)

Inspiration

Functionality

Thoughts

A Beginner's Journey into Data Wrangling