26 datasets found
  1. UCI Automobile Dataset

    • kaggle.com
    Updated Feb 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Otrivedi (2023). UCI Automobile Dataset [Dataset]. https://www.kaggle.com/datasets/otrivedi/automobile-data/suggestions
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 12, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Otrivedi
    Description

    In this project, I have done exploratory data analysis on the UCI Automobile dataset available at https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data

    This dataset consists of data From the 1985 Ward's Automotive Yearbook. Here are the sources

    1) 1985 Model Import Car and Truck Specifications, 1985 Ward's Automotive Yearbook. 2) Personal Auto Manuals, Insurance Services Office, 160 Water Street, New York, NY 10038 3) Insurance Collision Report, Insurance Institute for Highway Safety, Watergate 600, Washington, DC 20037

    Number of Instances: 398 Number of Attributes: 9 including the class attribute

    Attribute Information:

    mpg: continuous cylinders: multi-valued discrete displacement: continuous horsepower: continuous weight: continuous acceleration: continuous model year: multi-valued discrete origin: multi-valued discrete car name: string (unique for each instance)

    This data set consists of three types of entities:

    I - The specification of an auto in terms of various characteristics

    II - Tts assigned an insurance risk rating. This corresponds to the degree to which the auto is riskier than its price indicates. Cars are initially assigned a risk factor symbol associated with its price. Then, if it is riskier (or less), this symbol is adjusted by moving it up (or down) the scale. Actuaries call this process "symboling".

    III - Its normalized losses in use as compared to other cars. This is the relative average loss payment per insured vehicle year. This value is normalized for all autos within a particular size classification (two-door small, station wagons, sports/specialty, etc...), and represents the average loss per car per year.

    The analysis is divided into two parts:

    Data Wrangling

    1. Pre-processing data in python
    2. Dealing with missing values
    3. Data formatting
    4. Data normalization
    5. Binning
    6. Exploratory Data Analysis

    7. Descriptive statistics

    8. Groupby

    9. Analysis of variance

    10. Correlation

    11. Correlation stats

    Acknowledgment Dataset: UCI Machine Learning Repository Data link: https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data

  2. Enriched NYTimes COVID19 U.S. County Dataset

    • kaggle.com
    zip
    Updated Jun 14, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ringhilterra17 (2020). Enriched NYTimes COVID19 U.S. County Dataset [Dataset]. https://www.kaggle.com/ringhilterra17/enrichednytimescovid19
    Explore at:
    zip(11291611 bytes)Available download formats
    Dataset updated
    Jun 14, 2020
    Authors
    ringhilterra17
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Area covered
    United States
    Description

    Overview and Inspiration

    I wanted to make some geospatial visualizations to convey the current severity of COVID19 in different parts of the U.S..

    I liked the NYTimes COVID dataset, but it was lacking information on county boundary shape data, population per county, new cases / deaths per day, and per capita calculations, and county demographics.

    After a lot of work tracking down the different data sources I wanted and doing all of the data wrangling and joins in python, I wanted to open-source the final enriched data set in order to give others a head start in their COVID-19 related analytic, modeling, and visualization efforts.

    This dataset is enriched with county shapes, county center point coordinates, 2019 census population estimates, county population densities, cases and deaths per capita, and calculated per day cases / deaths metrics. It contains daily data per county back to January, allowing for analyizng changes over time.

    UPDATE: I have also included demographic information per county, including ages, races, and gender breakdown. This could help determine which counties are most susceptible to an outbreak.

    How this data can be used

    Geospatial analysis and visualization - Which counties are currently getting hit the hardest (per capita and totals)? - What patterns are there in the spread of the virus across counties? (network based spread simulations using county center lat / lons) -county population densities play a role in how quickly the virus spreads? -how does a specific county/state cases and deaths compare to other counties/states? Join with other county level datasets easily (with fips code column)

    Content Details

    See the column descriptions for more details on the dataset

    Visualizations and Analysis Examples

    COVID-19 U.S. Time-lapse: Confirmed Cases per County (per capita)

    https://github.com/ringhilterra/enriched-covid19-data/blob/master/example_viz/covid-cases-final-04-06.gif?raw=true" alt="">-

    Other Data Notes

    • Please review nytimes README for detailed notes on Covid-19 data - https://github.com/nytimes/covid-19-data/
    • The only update I made in regards to 'Geographic Exceptions', is that I took 'New York City' county provided in the Covid-19 data, which has all cases for 'for the five boroughs of New York City (New York, Kings, Queens, Bronx and Richmond counties) and replaced the missing FIPS for those rows with the 'New York County' fips code 36061. That way I could join to a geometry, and then I used the sum of those five boroughs population estimates for the 'New York City' estimate, which allowed me calculate 'per capita' metrics for 'New York City' entries in the Covid-19 dataset

    Acknowledgements

  3. A Beginner's Journey into Data Wrangling

    • kaggle.com
    zip
    Updated Feb 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maramsa (2024). A Beginner's Journey into Data Wrangling [Dataset]. https://www.kaggle.com/datasets/maramsa/a-beginners-journey-into-data-wrangling/discussion
    Explore at:
    zip(40762 bytes)Available download formats
    Dataset updated
    Feb 25, 2024
    Authors
    Maramsa
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dive into the world of data manipulation with pandas, the powerful Python library for data analysis. This series of exercises is designed for beginners who are eager to learn how to use pandas for data wrangling tasks. Each exercise will cover a different aspect of pandas, from loading and exploring datasets to manipulating data and performing basic analysis. Whether you're new to programming or just getting started with pandas, these exercises will help you build a solid foundation in data wrangling skills. Join us on this exciting journey and unleash the power of pandas!

  4. H

    Data from: SBIR - STTR Data and Code for Collecting Wrangling and Using It

    • dataverse.harvard.edu
    Updated Nov 5, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Grant Allard (2018). SBIR - STTR Data and Code for Collecting Wrangling and Using It [Dataset]. http://doi.org/10.7910/DVN/CKTAZX
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 5, 2018
    Dataset provided by
    Harvard Dataverse
    Authors
    Grant Allard
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Data set consisting of data joined for analyzing the SBIR/STTR program. Data consists of individual awards and agency-level observations. The R and python code required for pulling, cleaning, and creating useful data sets has been included. Allard_Get and Clean Data.R This file provides the code for getting, cleaning, and joining the numerous data sets that this project combined. This code is written in the R language and can be used in any R environment running R 3.5.1 or higher. If the other files in this Dataverse are downloaded to the working directory, then this Rcode will be able to replicate the original study without needing the user to update any file paths. Allard SBIR STTR WebScraper.py This is the code I deployed to multiple Amazon EC2 instances to scrape data o each individual award in my data set, including the contact info and DUNS data. Allard_Analysis_APPAM SBIR project Forthcoming Allard_Spatial Analysis Forthcoming Awards_SBIR_df.Rdata This unique data set consists of 89,330 observations spanning the years 1983 - 2018 and accounting for all eleven SBIR/STTR agencies. This data set consists of data collected from the Small Business Administration's Awards API and also unique data collected through web scraping by the author. Budget_SBIR_df.Rdata 246 observations for 20 agencies across 25 years of their budget-performance in the SBIR/STTR program. Data was collected from the Small Business Administration using the Annual Reports Dashboard, the Awards API, and an author-designed web crawler of the websites of awards. Solicit_SBIR-df.Rdata This data consists of observations of solicitations published by agencies for the SBIR program. This data was collected from the SBA Solicitations API. Primary Sources Small Business Administration. “Annual Reports Dashboard,” 2018. https://www.sbir.gov/awards/annual-reports. Small Business Administration. “SBIR Awards Data,” 2018. https://www.sbir.gov/api. Small Business Administration. “SBIR Solicit Data,” 2018. https://www.sbir.gov/api.

  5. m

    COVID-19 Scholarly Production Dataset

    • data.mendeley.com
    Updated Jul 7, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gisliany Alves (2020). COVID-19 Scholarly Production Dataset [Dataset]. http://doi.org/10.17632/kx7wwc8dzp.5
    Explore at:
    Dataset updated
    Jul 7, 2020
    Authors
    Gisliany Alves
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    COVID-2019 has been recognized as a global threat, and several studies are being conducted in order to contribute to the fight and prevention of this pandemic. This work presents a scholarly production dataset focused on COVID-19, providing an overview of scientific research activities, making it possible to identify countries, scientists and research groups most active in this task force to combat the coronavirus disease. The dataset is composed of 40,212 records of articles' metadata collected from Scopus, PubMed, arXiv and bioRxiv databases from January 2019 to July 2020. Those data were extracted by using the techniques of Python Web Scraping and preprocessed with Pandas Data Wrangling.

  6. Indian Matrimony Profiles Dataset (Vivaah.com)

    • kaggle.com
    zip
    Updated Aug 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nitish Jain (2025). Indian Matrimony Profiles Dataset (Vivaah.com) [Dataset]. https://www.kaggle.com/datasets/njnj41019/indian-matrimony-profiles-dataset-vivaah-com
    Explore at:
    zip(1055 bytes)Available download formats
    Dataset updated
    Aug 1, 2025
    Authors
    Nitish Jain
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    India
    Description

    This dataset contains 300 publicly available matrimony profiles scraped from Vivaah.com using Python and Selenium.

    Each profile includes: - Profile ID - Age & Height - Religion - Caste - Mother Tongue - Profession - Education - Location

    🧠 Ideal for: - Exploratory Data Analysis (EDA) - Filtering & segmentation - Recommender system prototypes - Practice with web scraping & data wrangling

    ⚠️ This dataset is shared only for educational and research use. It includes no personal contact or private info.

  7. Automobile_Price_prediction

    • kaggle.com
    zip
    Updated Oct 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ayush Triapthi (2023). Automobile_Price_prediction [Dataset]. https://www.kaggle.com/datasets/ayusht18dec/case-study-dataset
    Explore at:
    zip(10947 bytes)Available download formats
    Dataset updated
    Oct 22, 2023
    Authors
    Ayush Triapthi
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    A Case Study

    In this case study we are going to use the automobile dataset, which plenty of car manufacturers withtheir specifications in in order to build a predictive model to find out the approximate car price. This dataset has 26 columns, including categorical and quantitative attributes.

    The given_automobile.csv contains records from the above-mentioned dataset.

    You need to write descriptive answers to the questions under each task and also usea proper program written in Python and execute the code. 1. The missing values are presented as ‘?’ in the dataset. Apply data wrangling techniques using Python programming language to solve missing values inall the attributes. 2. Check the data types of those columns with the missing values, and convert the data type if needed. 3. Find all the correlated features to the ‘Price’. 4. Build a predictive model to predict the car price based on using one of the independent correlated variables. 5. Continue with the same built model in No.4, but choose differentindependent variables and discuss the result.

  8. Netflix

    • kaggle.com
    zip
    Updated Jul 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Prasanna@82 (2025). Netflix [Dataset]. https://www.kaggle.com/datasets/prasanna82/netflix/code
    Explore at:
    zip(1400865 bytes)Available download formats
    Dataset updated
    Jul 29, 2025
    Authors
    Prasanna@82
    Description

    Netflix Dataset Exploration and Visualization

    This project involves an in-depth analysis of the Netflix dataset to uncover key trends and patterns in the streaming platform’s content offerings. Using Python libraries such as Pandas, NumPy, and Matplotlib, this notebook visualizes and interprets critical insights from the data.

    Objectives:

    Analyze the distribution of content types (Movies vs. TV Shows)

    Identify the most prolific countries producing Netflix content

    Study the ratings and duration of shows

    Handle missing values using techniques like interpolation, forward-fill, and custom replacements

    Enhance readability with bar charts, horizontal plots, and annotated visuals

    Key Visualizations:

    Bar charts for type distribution and country-wise contributions

    Handling missing data in rating, duration, and date_added

    Annotated plots showing values for clarity

    Tools Used:

    Python 3

    Pandas for data wrangling

    Matplotlib for visualizations

    Jupyter Notebook for hands-on analysis

    Outcome: This project provides a clear view of Netflix's content library, helping data enthusiasts and beginners understand how to process, clean, and visualize real-world datasets effectively.

    Feel free to fork, adapt, and extend the work.

  9. Explore Bike Share Data

    • kaggle.com
    zip
    Updated Jun 3, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaltout (2021). Explore Bike Share Data [Dataset]. https://www.kaggle.com/shaltout/explore-bike-share-data
    Explore at:
    zip(26232124 bytes)Available download formats
    Dataset updated
    Jun 3, 2021
    Authors
    Shaltout
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Bike Share Data Over the past decade, bicycle-sharing systems have been growing in number and popularity in cities across the world. Bicycle-sharing systems allow users to rent bicycles on a very short-term basis for a price. This allows people to borrow a bike from point A and return it at point B, though they can also return it to the same location if they'd like to just go for a ride. Regardless, each bike can serve several users per day.

    Thanks to the rise in information technologies, it is easy for a user of the system to access a dock within the system to unlock or return bicycles. These technologies also provide a wealth of data that can be used to explore how these bike-sharing systems are used.

    In this project, you will use data provided by Motivate, a bike share system provider for many major cities in the United States, to uncover bike share usage patterns. You will compare the system usage between three large cities: Chicago, New York City, and Washington, DC.

    The Datasets Randomly selected data for the first six months of 2017 are provided for all three cities. All three of the data files contain the same core six (6) columns:

    Start Time (e.g., 2017-01-01 00:07:57) End Time (e.g., 2017-01-01 00:20:53) Trip Duration (in seconds - e.g., 776) Start Station (e.g., Broadway & Barry Ave) End Station (e.g., Sedgwick St & North Ave) User Type (Subscriber or Customer) The Chicago and New York City files also have the following two columns:

    Gender Birth Year

    Data for the first 10 rides in the new_york_city.csv file

    The original files are much larger and messier, and you don't need to download them, but they can be accessed here if you'd like to see them (Chicago, New York City, Washington). These files had more columns and they differed in format in many cases. Some data wrangling has been performed to condense these files to the above core six columns to make your analysis and the evaluation of your Python skills more straightforward. In the Data Wrangling course that comes later in the Data Analyst Nanodegree program, students learn how to wrangle the dirtiest, messiest datasets, so don't worry, you won't miss out on learning this important skill!

    Statistics Computed You will learn about bike share use in Chicago, New York City, and Washington by computing a variety of descriptive statistics. In this project, you'll write code to provide the following information:

    1 Popular times of travel (i.e., occurs most often in the start time)

    most common month most common day of week most common hour of day

    2 Popular stations and trip

    most common start station most common end station most common trip from start to end (i.e., most frequent combination of start station and end station)

    3 Trip duration

    total travel time average travel time

    4 User info

    counts of each user type counts of each gender (only available for NYC and Chicago) earliest, most recent, most common year of birth (only available for NYC and Chicago) The Files To answer these questions using Python, you will need to write a Python script. To help guide your work in this project, a template with helper code and comments is provided in a bikeshare.py file, and you will do your scripting in there also. You will need the three city dataset files too:

    chicago.csv new_york_city.csv washington.csv

    All four of these files are zipped up in the Bikeshare file in the resource tab in the sidebar on the left side of this page. You may download and open up that zip file to do your project work on your local machine.

  10. EDA on Car Sales Dataset in Ukraine

    • kaggle.com
    zip
    Updated Jan 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Swati Khedekar (2023). EDA on Car Sales Dataset in Ukraine [Dataset]. https://www.kaggle.com/datasets/swatikhedekar/eda-on-car-sales-dataset-in-ukraine
    Explore at:
    zip(508971 bytes)Available download formats
    Dataset updated
    Jan 13, 2023
    Authors
    Swati Khedekar
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    Ukraine
    Description

    1. Problem statemont:

    This dataset contains data more than 9.5k car sales in Ukraine.Most of then are used car so it open the possibility to analyze featurs related to car operation. This is subset of all car data in Ukraine. Using this we will analyze the various parameter of used car sales in Ukraine.

    1.1 Introduction: This Exploratory Data Analysis is to practice python skills till now on a structured dataset including loading, inspecting,wrangling,Exploring and drawing conclusions from the data.The notebook has the obeservations with each step in order to explain throughtly how to approach the dataset. Based on the obseravation some quetions also are answered in the notebook for the reference though not all them are explored in the analysis.

    1.2 Data Source and Dataset: a. How was it collected?

    Name: Car Sales Sponsering Organization: Dont Know! Year :2019 Description: This is case study of more than 9.5k car sales in Ukraine. b. it is sample? If yes ,What is properly sampled?

    Yes .It is sample .We dont have official information about the data collection method, but its appears not to be random sample, so we can assume that it is not representative.

  11. iNeuron Projectathon Oct-Nov'21

    • kaggle.com
    zip
    Updated Oct 22, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aman Anand (2021). iNeuron Projectathon Oct-Nov'21 [Dataset]. https://www.kaggle.com/yekahaaagayeham/ineuron-projectathon-octnov21
    Explore at:
    zip(3335989 bytes)Available download formats
    Dataset updated
    Oct 22, 2021
    Authors
    Aman Anand
    Description

    iNeuron-Projectathon-Oct-Nov-21

    Problem Statement:

    Design a web portal to automate the various operation performed in machine learning projects to solve specific problems related to supervised or unsupervised use case.. Web portal must have the capabilities to perform below-mentioned task: 1. Extract Transform Load: a. Extract: Portal should provide the capabilities to configure any data source example. Cloud Storage (AWS, Azure, GCP), Database (RDBMS, NoSQL,), and real-time streaming data to extract data into portportal. (Allow feasibility to write cucustom script if required to connect to any data source to extract data) b. Transform: Portal should provide various inbuilt functions/components to apply rich set of transformation to transform extracted data into desired format. c. Load: Portal should be able to save data into any of the cloud storage after extracted data transformed into desired format. d. Allow user to write custom script in python if some of the functionality is not present in the portal. 2. Exploratory Data Analysis: Portal should allow users to perform exploratory data analysis. 3. Data Preparation: data wrangling, feature extraction and feature selection should be automation with minimal user intervention. 4. Application must suggest appropriate machine learning algorithm which is best suitable for the use case and perform best model search operation to automate model development operation. 5. Application should provide feature to deploy model in any of the cloud and application should create prediction API to predict new instances. 6. Application should log each and every detail so that each activity can be audited in future to investigate any of the event. 7. Detail report should be generated for ETL, EDA, Data preparation and Model development and deployment. 8. Create a dashboard to monitor model performance and create various alert mechanism to notify appropriate user to take necessary precaution. 9. Create functionality to implement retraining for existing model if it is necessary. 10.Portal must be designed in such a way that it can be used by multiple organization/user where each organization/user is isolated from other. 11.Portal should provide functionality to manage user. Similar to RBAC concept used in Cloud. (It is not necessary to build so many role but design it in such a way that it can add role in future so that newly created role can also be applied to users.) Organization/User can have multiple user and each user will have specific role. 12.Portal should have a scheduler to schedule training or prediction task and appropriate alert regarding to scheduled job should be notified to subscriber/configured email id. 13.Implement watcher functionality to perform prediction as soon as file arrived at input location.

    Approach:

    1. Follow standard guild line to write quality solution for web portal.
    2. Follow OOPS to design solution.
    3. Implement REST API wherever possible.
    4. Implement CI, CD pipeline with automated testing and dockerization. (Use container or Kubernetes to deploy your dockerized application)
    5. CI, CD pipeline should have different environment example ST, SST, Production. Note: Feel free to use any of the technology to design your solution.

    Results:

    You have to build a solution that should summarize the various news articles from different reading categories.

    Project Evaluation metrics:

    Code:  You are supposed to write a code in a modular fashion  Safe: It can be used without causing harm.  Testable: It can be tested at the code level.  Maintainable: It can be maintained, even as your codebase grows.  Portable: It works the same in every environment (operating system)  You have to maintain your code on GitHub.  You have to keep your GitHub repo public so that anyone can check your code.  Proper readme file you have to maintain for any project development.  You should include basic workflow and execution of the entire project in the readme

    file on GitHub  Follow the coding standards: https://www.python.org/dev/peps/pep-0008/

    Database: Based on development requirement feel free to choose any database (SQL,

    NoSQL) or use multiple database.

    Cloud:

     You can use any cloud platform for this entire solution hosting like AWS, Azure or GCP.

    API Details or User Interface:

    1. Web portal should be designed like any cloud platform.
    2. Model developed using web portal should have functionality to expose API to test prediction.

    Logging:

     Logging is a must for every action performed by your code use the python logging library for this.

    DevOps Pipeline:

    Use source version control tool to implement CI, CD pipeline, e.g.: Azure Devops, Github, Circle CI.

    Deployment:

     You can host your application in the cloud platform using automated CI, CD pipeline.

    Solutions Design:

     You have to submit complete solution design strate...

  12. School data

    • kaggle.com
    zip
    Updated Nov 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Omokhefe Ogbodo (2024). School data [Dataset]. https://www.kaggle.com/datasets/victorogbodo/school-data
    Explore at:
    zip(7811 bytes)Available download formats
    Dataset updated
    Nov 20, 2024
    Authors
    Omokhefe Ogbodo
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset Overview This dataset simulates the academic and extracurricular records of students in a Nigerian primary school. It contains three tables designed to capture key aspects of the student lifecycle, including demographic information, academic scores, and their affiliations with sport houses. The dataset can be used for educational purposes, research, and exploratory data analysis.

    Context and Inspiration This dataset is inspired by the structure of Nigerian primary schools, where students are grouped into sport houses for extracurricular activities and assessed on academic performance. It is a useful resource for: Exploring relationships between demographics, academic performance, and extracurricular activities. Analyzing patterns in hobbies and character traits. Creating visualizations for school or student performance analytics.

    Usage This dataset is synthetic but can be used for: Data science practice, including cleaning, wrangling, and visualization. Developing machine learning models to predict academic outcomes or classify students. Creating dashboards and reports for educational analytics.

    License This dataset is synthetic and open for public use. Feel free to use it for learning, research, and creative projects.

    Acknowledgments The dataset was generated using Python libraries, including: Faker for generating realistic student data. Pandas for organizing and exporting the dataset.

    Example Questions to Explore Which sport house has the best average performance in academics? Is there a correlation between hobbies and academic scores? Are there performance differences between male and female students? What is the distribution of student ages across sport houses?

  13. PythonLibraries|WheelFiles

    • kaggle.com
    zip
    Updated Mar 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ravi Ramakrishnan (2024). PythonLibraries|WheelFiles [Dataset]. https://www.kaggle.com/datasets/ravi20076/pythonlibrarieswheelfiles/code
    Explore at:
    zip(1556654809 bytes)Available download formats
    Dataset updated
    Mar 25, 2024
    Authors
    Ravi Ramakrishnan
    License

    https://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/

    Description

    Hello all,
    This dataset is my humble attempt to allow myself and others to upgrade essential python packages to their latest versions. This dataset contains the .whl files of the below packages to be used across general kernels and especially in internet-off code challenges-

    PackageVersionFunctionality
    AutoGluon1.0.0AutoML models
    Catboost1.2.2
    1.2.3
    ML models
    Iterative-Stratification0.1.7Iterative stratification for multi-label classifiers
    Joblib1.3.2File dumping and retrieval
    LAMA0.3.8b1AutoML models
    LightGBM4.3.0
    4.2.0
    4.1.0
    ML models
    MAPIE0.8.2Quantile regression
    Numpy1.26.3Data wrangling
    Pandas2.1.4Data wrangling
    Polars0.20.3
    0.20.4
    Data wrangling
    PyTorch2.0.1Neural networks
    PyTorch-TabNet4.1.0Neural networks
    PyTorch-Forecast0.7.0Neural networks
    Pygwalker0.3.20Data wrangling and visualization
    Scikit-learn1.3.2
    1.4.0
    ML Models/ Pipelines/ Data wrangling
    Scipy1.11.4Data wrangling/ Statistics
    TabPFN10.1.9ML models
    Torch-Frame1.7.5Neural Networks
    TorchVision0.15.2Neural Networks
    XGBoost2.0.2
    2.0.1
    2.0.3
    ML models


    I plan to update this dataset with more libraries and later versions as they get upgraded in due course. I hope these wheel files are useful to one and all.

    Recent updates based on user feedback-

    1. lightgbm 4.1.0 and 4.3.0
    2. Older XGBoost versions (2.0.1 and 2.0.2)
    3. Torch-Frame, TabNet, PyTorch-Forecasting, TorchVision
    4. MAPIE
    5. LAMA 0.3.8b1
    6. Iterative-Stratification
    7. Catboost 1.2.3

    Best regards and happy learning and coding!

  14. Deaths in 2024 - Messy Data for Practice(NLP)

    • kaggle.com
    zip
    Updated Apr 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ghulam Haider (2025). Deaths in 2024 - Messy Data for Practice(NLP) [Dataset]. https://www.kaggle.com/datasets/ghulamhiader/deaths-in-2024-messy-data-for-practicenlp
    Explore at:
    zip(362326 bytes)Available download formats
    Dataset updated
    Apr 26, 2025
    Authors
    Ghulam Haider
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    This dataset contains a collection of notable deaths from around the world, as recorded on Wikipedia throughout the year 2024. The data was scraped from the Wikipedia pages for each month of 2024, specifically from the "Deaths in [Month] 2024" articles.

    Columns in the dataset: 1. Name: The name of the deceased individual. 2. Age: The age at which the individual passed away. 3. Location/Profession: The geographical location or professional background of the individual. 4. Cause of Death: The reported cause of death (if available). 5. Month: The month of the year in which the individual passed away.

    Data Collection Methodology: Data was collected using a custom Python script that utilized the requests and BeautifulSoup libraries to scrape and parse the data from Wikipedia.

    Information was extracted from the list of deaths provided on Wikipedia pages for each month, and the data was cleaned and organized into a structured CSV file.

    This dataset is ideal for educational purposes and provides an opportunity for practicing data cleaning, data wrangling, and NLP techniques like text classification, Named Entity Recognition (NER), and summarization tasks.

    Use Cases: 1. Data cleaning and preprocessing exercises. 2. Natural Language Processing (NLP) and text analysis tasks. 3. Time-based analysis or trend analysis of notable deaths over the course of a year. 4. Practicing Named Entity Recognition (NER) for identifying names, locations, and professions. 6. The dataset is available for educational purposes only and can be used to practice various data science and machine learning techniques. The data was collected under the Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0).

    Code of about this data set: https://github.com/ghulamhaider65/Web_scraping

  15. YouTube VP and Presidential Debate Comments

    • kaggle.com
    zip
    Updated Oct 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aadit Kapoor (2020). YouTube VP and Presidential Debate Comments [Dataset]. https://www.kaggle.com/aaditkapoor1201/youtube-vp-and-presidential-debate-comments
    Explore at:
    zip(32482 bytes)Available download formats
    Dataset updated
    Oct 24, 2020
    Authors
    Aadit Kapoor
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    YouTube
    Description

    Context

    After getting mixed results from the news sources, I thought to analyze the Vice Presidential and Presidential debates using Data Science. The idea is to use YouTube comments as a medium to get the sentiment regarding the debate and getting insights from the data. In this analysis, we plot common phrases, common words, we also analyze sentiment and in the end for all my data science practitioners I present them a full-fledged dataset containing YouTube Comments of VP and Presidential debates.

    Why: After getting mixed results from the news sources about the outcome of the debate, I decided to use data science to help me see the outcome of the result. With the elections around the corner, technology or to be precise analytics plays a key role in shaping our thoughts and supporting our hypothesis. How: To Analyze YouTube Comments we use Python and various other NLP Libraries followed by some data visualization tools. We will use the wonders of the awesome data wrangling library known as Pandas and we hope to find some interesting insights.

    Content

    The dataset contains comments (YT comment scraped) and a sentiment calculated using the TextBlob library.

    Acknowledgements

    YouTube data API

  16. Lego Sets and Parts Flattened

    • kaggle.com
    zip
    Updated Dec 1, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    rickvenadata (2017). Lego Sets and Parts Flattened [Dataset]. https://www.kaggle.com/rickvenadata/lego-sets-and-parts-flattened
    Explore at:
    zip(11286839 bytes)Available download formats
    Dataset updated
    Dec 1, 2017
    Authors
    rickvenadata
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    Derived from the LEGO Database https://www.kaggle.com/rtatman/lego-database , flattened into a single CSV file for use in a Kernel I plan to build with the Pandas Python library.

    Content

    The original data files were imported into MS SQL Server, and then exported after performing a huge join across all of the tables. A bunch of data was excluded form my export query to simplify the data set for my research goals. Certain part categories such as minifigs as well as some themes like Duplo have been excluded.

    Acknowledgements

    This is derived from the LEGO database https://www.kaggle.com/rtatman/lego-database which is courtesy of https://www.kaggle.com/rtatman and in turn originated from Rebrickable.

    Inspiration

    This is a fun data set to play with for learning data wrangling. I personally identify with it as a LEGO fan!

  17. Oslo City Bike Open Data

    • kaggle.com
    zip
    Updated Nov 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    stanislav_o27 (2025). Oslo City Bike Open Data [Dataset]. https://www.kaggle.com/datasets/stanislavo27/oslo-city-bike-open-data
    Explore at:
    zip(251012812 bytes)Available download formats
    Dataset updated
    Nov 8, 2025
    Authors
    stanislav_o27
    Area covered
    Oslo
    Description

    Source: https://oslobysykkel.no/en/open-data/historical

    I am not the author of the data, only сompiled and structured from here using python-script

    oslo-city-bike License: Norwegian Licence for Open Government Data (NLOD) 2.0 According to the license, we have full rights to collect, use, modify, and distribute this data, provided you clearly indicate the source (which I do).

    Dataset structure

    Folder oslobysykkel contains all available data from 2019 to 2025. Format: oslobysykkel-YYYY-MM.csv. why is oslo still appearing in the file names? because there is also similar data for Trondheim and Bergen

    Variables

    from oslobysykkel.no Variable Format Description started_at Timestamp Timestamp of when the trip started ended_at Timestamp Timestamp of when the trip ended duration Integer Duration of trip in seconds start_station_id String Unique ID for start station start_station_name String Name of start station start_station_description String Description of where start station is located start_station_latitude Decimal degrees in WGS84 Latitude of start station start_station_longitude Decimal degrees in WGS84 Longitude of start station end_station_id String Unique ID for end station end_station_name String Name of end station end_station_description String Description of where end station is located end_station_latitude Decimal degrees in WGS84 Latitude of end station end_station_longitude Decimal degrees in WGS84 Longitude of end station

    Please note: this data and my analysis focuses on the new data format, but historical data for the period April 2016 - December 2018 (Legacy Trip Data) has a different pattern.

    Motivation

    I myself was extremely fascinated by this open data of Oslo City Bike and in the process of deep analysis saw broad prospects. This interest turned into an idea to create a data-analytical problem book or even platfrom 'exercise bike'. Publishing this dataset to make it convenient for my own further use in the next phases of the project (Clustering, Forecasting), as well as so that anyone can participate in analysis and modeling based on this exciting data.

    **Autumn's remake of Oslo bike sharing data analysis ** https://colab.research.google.com/drive/1tAxrIWVK5V-ptKLJBdODjy10zHlsppFv?usp=sharing

    https://drive.google.com/file/d/17FP9Bd5opoZlw40LRxWtycgJJyXSAdC6/view

    Full notebooks with code, visualizations, and commentary will be published soon! This dataset is the backbone of an ongoing project — stay tuned for see a deeper dives into anomaly detection, station clustering, and interactive learning challenges.

    Index of my notebooks Phase 1: Cleaned Data & Core Insights Time-Space Dynamics Exploratory

    Challenge Ideas

    Clustering and Segmentation Demand Forecasting (Time Series) Geospatial Analysis (Network Analysis)

    Resources & Related Work

    Similar dataset https://www.kaggle.com/code/florestancharlaix/oslo-city-bikes-analysis

    links to works I have found or that have inspired me

    Exploring Open Data from Oslo City Bike Jon Olave — visualization of popular routes and seasonality analysis.

    Oslo City Bike Data Wrangling Karl Tryggvason — predicting bicycle availability at stations, focusing on everyday use (e.g., trips to kindergarten).

    Helsinki City Bikes: Exploratory Data Analysis Analysis of a similar system in Helsinki — useful for comparative studies and methodological ideas.

    External Data Sources

    The idea is to connect with other data. For example I did it for weather data - integrate temperature, precipitation, and wind speed to explain variations in daily demand. https://meteostat.net/en/place/no/oslo

    I also used data from Airbnb (that's where I took division into neighbourhoods) https://data.insideairbnb.com/norway/oslo/oslo/2025-06-27/visualisations/neighbourhoods.csv

    oslo bike-sharing eda feature-engineering geospatial time-series

  18. US stocks short volumes (FINRA)

    • kaggle.com
    zip
    Updated Mar 16, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DenzilG (2021). US stocks short volumes (FINRA) [Dataset]. https://www.kaggle.com/denzilg/finra-short-volumes-us-equities
    Explore at:
    zip(46837645 bytes)Available download formats
    Dataset updated
    Mar 16, 2021
    Authors
    DenzilG
    License

    https://www.usa.gov/government-works/https://www.usa.gov/government-works/

    Description

    Inspiration

    Originally, I was planning to use the Python Quandl api to get the data from here because it is already conveniently in time-series format. However, the data is split by reporting agency which makes it difficult to get an accurate image of the true short ratio because of missing data/difficulty in aggregation. So, I clicked on the source link which turned out to be a gold mine because of their consolidated data. Only downside was that it was all in .txt format so I had to use regex to parse through and data scraping to get the information from the website but that was a good refresher 😄.

    For better understanding of what the values in the text file mean, you can read this pdf from FINRA: https://www.finra.org/sites/default/files/2020-12/short-sale-volume-user-guide.pdf

    Functionality

    I condensed all the individual text files into a single .txt file such that it's much faster and less complex to write code compared to having to iterate through each individual .txt file. I created several functions for this dataset so please check out my workbook "FINRA Short Ratio functions" where I have described step by step on how I gathered the data and formatted it so that you can understand and modify them to fit your needs. Note that the data is only for the range of 1st April 2020 onwards (20200401 to 20210312 as of gathering the data) and the contents are separated by | delimiters so I used \D (non-digit) in regex to avoid confusion with the (a|b) pattern syntax.

    If you need historical data before April 2020, you can use the quandl database but it has non-consolidated information and you have to make a reference call for each individual stock for each agency so you would need to manually input tickers or get a list of all tickers through regex of the txt files or something like that 😅.

    Thoughts

    An excellent task to combine regular expressions (regex), web scraping, plotting, and data wrangling... see my notebook for an example with annotated workflow. Please comment and feel free to fork and modify my workbook to change the functionality. Possibly the short volumes can be combined with p/b ratios or price data to see the correlation --> can use seaborn pairgrid to visualise this for multiple stocks?

  19. Malaysia Covid-19 Dataset

    • kaggle.com
    zip
    Updated Jul 20, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TanKY (2021). Malaysia Covid-19 Dataset [Dataset]. https://www.kaggle.com/datasets/yeanzc/malaysia-covid19-dataset/discussion
    Explore at:
    zip(32611 bytes)Available download formats
    Dataset updated
    Jul 20, 2021
    Authors
    TanKY
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    Malaysia
    Description

    A free, publicly available Malaysia Covid-19 dataset.

    Data Descriptions

    28 variables. Include:

    New case New case (7 day rolling average) Recovered Active case Local cases Imported case ICU Death Cumulative deaths

    People tested Cumulative people tested Positivity rate Positivity rate (7 day rolling average)

    Data Sources

    Column 1 to 22 are Twitter data, which the Tweets are retrieved from Health DG @DGHisham timeline with Twitter API. A typical covid situation update Tweet is written in a relatively fixed format. Data wrangling are done in Python/Pandas, numerical values extracted with Regular Expression (RegEx). Missing data are added manually from Desk of DG (kpkesihatan).

    Column 23 ['remark'] is my own written remark regarding the Tweet status/content.

    Column 24 ['Cumulative people tested'] data is transcribed from an image on MOH COVID-19 website. Specifically, the first image under TABURAN KES section in each Situasi Terkini daily webpage of http://covid-19.moh.gov.my/terkini. If missing, the image from CPRC KKM Telegram or KKM Facebook Live video is used. Data in this column, dated from 1 March 2020 to 11 Feb 2021, are from Our World in Data, their data collection method as stated here.

    Why does this dataset exist?

    MOH does not publish any covid data in csv/excel format as of today, they provide the data as is, along with infographics that are hardly informative. In an undisclosed email, MOH doesn't seem to understand my request for them to release the covid public health data for anyone to download and do their analysis if they do wish.

    To be updated periodically

    A simple visualization dashboard is now published on Tableau Public. It's is updated daily. Do check it out! More charts to be added in the near future

    Inspiration

    Create better visualizations to help fellow Malaysians understand the Covid-19 situation. Empower the data science community.

  20. Party strength in each US state

    • kaggle.com
    zip
    Updated Jan 13, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    GeneBurin (2017). Party strength in each US state [Dataset]. https://www.kaggle.com/datasets/kiwiphrases/partystrengthbystate
    Explore at:
    zip(16377 bytes)Available download formats
    Dataset updated
    Jan 13, 2017
    Authors
    GeneBurin
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    United States
    Description

    Data on party strength in each US state

    The repository contains data on party strength for each state as shown on each state's corresponding party strength Wikipedia page (for example, here is Virginia )

    Each state has a table of a detailed summary of the state of its governing and representing bodies on Wikipedia but there is no data set that collates these entries. I scraped each state's Wikipedia table and collated the entries into a single dataset. The data are stored in the state_party_strength.csv and state_party_strength_cleaned.csv. The code that generated the file can be found in corresponding Python notebooks.

    Data contents:

    The data contain information from 1980 on each state's: 1. governor and party 2. state house and senate composition 3. state representative composition in congress 4. electoral votes

    Clean Version

    Data in the clean version has been cleaned and processed substantially. Namely: - all columns now contain homogenous data within the column - names and Wiki-citations have been removed - only the party counts and party identification have been left The notebook that created this file is here

    Uncleaned Data Version

    The data contained herein have not been altered from their Wikipedia tables except in two instances: - Forced column names to be in accord across states - Any needed data modifications (ie concatenated string columns) to retain information when combining columns

    To use the data:

    Please note that the right encoding for the dataset is "ISO-8859-1", not 'utf-8' though in future versions I will try to fix that to make it more accessible.

    This means that you will likely have to perform further data wrangling prior to doing any substantive analysis. The notebook that has been used to create this data file is located here

    Raw scraped data

    The raw scraped data can be found in the pickle. This file contains a Python dictionary where each key is a US state name and each element is the raw scraped table in Pandas DataFrame format.

    Hope it proves as useful to you in analyzing/using political patterns at the state level in the US for political and policy research.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Otrivedi (2023). UCI Automobile Dataset [Dataset]. https://www.kaggle.com/datasets/otrivedi/automobile-data/suggestions
Organization logo

UCI Automobile Dataset

Automobile Dataset exploratory Analysis

Explore at:
13 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 12, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Otrivedi
Description

In this project, I have done exploratory data analysis on the UCI Automobile dataset available at https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data

This dataset consists of data From the 1985 Ward's Automotive Yearbook. Here are the sources

1) 1985 Model Import Car and Truck Specifications, 1985 Ward's Automotive Yearbook. 2) Personal Auto Manuals, Insurance Services Office, 160 Water Street, New York, NY 10038 3) Insurance Collision Report, Insurance Institute for Highway Safety, Watergate 600, Washington, DC 20037

Number of Instances: 398 Number of Attributes: 9 including the class attribute

Attribute Information:

mpg: continuous cylinders: multi-valued discrete displacement: continuous horsepower: continuous weight: continuous acceleration: continuous model year: multi-valued discrete origin: multi-valued discrete car name: string (unique for each instance)

This data set consists of three types of entities:

I - The specification of an auto in terms of various characteristics

II - Tts assigned an insurance risk rating. This corresponds to the degree to which the auto is riskier than its price indicates. Cars are initially assigned a risk factor symbol associated with its price. Then, if it is riskier (or less), this symbol is adjusted by moving it up (or down) the scale. Actuaries call this process "symboling".

III - Its normalized losses in use as compared to other cars. This is the relative average loss payment per insured vehicle year. This value is normalized for all autos within a particular size classification (two-door small, station wagons, sports/specialty, etc...), and represents the average loss per car per year.

The analysis is divided into two parts:

Data Wrangling

  1. Pre-processing data in python
  2. Dealing with missing values
  3. Data formatting
  4. Data normalization
  5. Binning
  6. Exploratory Data Analysis

  7. Descriptive statistics

  8. Groupby

  9. Analysis of variance

  10. Correlation

  11. Correlation stats

Acknowledgment Dataset: UCI Machine Learning Repository Data link: https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data

Search
Clear search
Close search
Google apps
Main menu