Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dive into the world of data manipulation with pandas, the powerful Python library for data analysis. This series of exercises is designed for beginners who are eager to learn how to use pandas for data wrangling tasks. Each exercise will cover a different aspect of pandas, from loading and exploring datasets to manipulating data and performing basic analysis. Whether you're new to programming or just getting started with pandas, these exercises will help you build a solid foundation in data wrangling skills. Join us on this exciting journey and unleash the power of pandas!
Facebook
TwitterIn this project, I have done exploratory data analysis on the UCI Automobile dataset available at https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data
This dataset consists of data From the 1985 Ward's Automotive Yearbook. Here are the sources
1) 1985 Model Import Car and Truck Specifications, 1985 Ward's Automotive Yearbook. 2) Personal Auto Manuals, Insurance Services Office, 160 Water Street, New York, NY 10038 3) Insurance Collision Report, Insurance Institute for Highway Safety, Watergate 600, Washington, DC 20037
Number of Instances: 398 Number of Attributes: 9 including the class attribute
Attribute Information:
mpg: continuous cylinders: multi-valued discrete displacement: continuous horsepower: continuous weight: continuous acceleration: continuous model year: multi-valued discrete origin: multi-valued discrete car name: string (unique for each instance)
This data set consists of three types of entities:
I - The specification of an auto in terms of various characteristics
II - Tts assigned an insurance risk rating. This corresponds to the degree to which the auto is riskier than its price indicates. Cars are initially assigned a risk factor symbol associated with its price. Then, if it is riskier (or less), this symbol is adjusted by moving it up (or down) the scale. Actuaries call this process "symboling".
III - Its normalized losses in use as compared to other cars. This is the relative average loss payment per insured vehicle year. This value is normalized for all autos within a particular size classification (two-door small, station wagons, sports/specialty, etc...), and represents the average loss per car per year.
The analysis is divided into two parts:
Data Wrangling
Exploratory Data Analysis
Descriptive statistics
Groupby
Analysis of variance
Correlation
Correlation stats
Acknowledgment Dataset: UCI Machine Learning Repository Data link: https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data
Facebook
TwitterAttribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
I wanted to make some geospatial visualizations to convey the current severity of COVID19 in different parts of the U.S..
I liked the NYTimes COVID dataset, but it was lacking information on county boundary shape data, population per county, new cases / deaths per day, and per capita calculations, and county demographics.
After a lot of work tracking down the different data sources I wanted and doing all of the data wrangling and joins in python, I wanted to open-source the final enriched data set in order to give others a head start in their COVID-19 related analytic, modeling, and visualization efforts.
This dataset is enriched with county shapes, county center point coordinates, 2019 census population estimates, county population densities, cases and deaths per capita, and calculated per day cases / deaths metrics. It contains daily data per county back to January, allowing for analyizng changes over time.
UPDATE: I have also included demographic information per county, including ages, races, and gender breakdown. This could help determine which counties are most susceptible to an outbreak.
Geospatial analysis and visualization - Which counties are currently getting hit the hardest (per capita and totals)? - What patterns are there in the spread of the virus across counties? (network based spread simulations using county center lat / lons) -county population densities play a role in how quickly the virus spreads? -how does a specific county/state cases and deaths compare to other counties/states? Join with other county level datasets easily (with fips code column)
See the column descriptions for more details on the dataset
COVID-19 U.S. Time-lapse: Confirmed Cases per County (per capita)
https://github.com/ringhilterra/enriched-covid19-data/blob/master/example_viz/covid-cases-final-04-06.gif?raw=true" alt="">-
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
COVID-2019 has been recognized as a global threat, and several studies are being conducted in order to contribute to the fight and prevention of this pandemic. This work presents a scholarly production dataset focused on COVID-19, providing an overview of scientific research activities, making it possible to identify countries, scientists and research groups most active in this task force to combat the coronavirus disease. The dataset is composed of 40,212 records of articles' metadata collected from Scopus, PubMed, arXiv and bioRxiv databases from January 2019 to July 2020. Those data were extracted by using the techniques of Python Web Scraping and preprocessed with Pandas Data Wrangling.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Data set consisting of data joined for analyzing the SBIR/STTR program. Data consists of individual awards and agency-level observations. The R and python code required for pulling, cleaning, and creating useful data sets has been included. Allard_Get and Clean Data.R This file provides the code for getting, cleaning, and joining the numerous data sets that this project combined. This code is written in the R language and can be used in any R environment running R 3.5.1 or higher. If the other files in this Dataverse are downloaded to the working directory, then this Rcode will be able to replicate the original study without needing the user to update any file paths. Allard SBIR STTR WebScraper.py This is the code I deployed to multiple Amazon EC2 instances to scrape data o each individual award in my data set, including the contact info and DUNS data. Allard_Analysis_APPAM SBIR project Forthcoming Allard_Spatial Analysis Forthcoming Awards_SBIR_df.Rdata This unique data set consists of 89,330 observations spanning the years 1983 - 2018 and accounting for all eleven SBIR/STTR agencies. This data set consists of data collected from the Small Business Administration's Awards API and also unique data collected through web scraping by the author. Budget_SBIR_df.Rdata 246 observations for 20 agencies across 25 years of their budget-performance in the SBIR/STTR program. Data was collected from the Small Business Administration using the Annual Reports Dashboard, the Awards API, and an author-designed web crawler of the websites of awards. Solicit_SBIR-df.Rdata This data consists of observations of solicitations published by agencies for the SBIR program. This data was collected from the SBA Solicitations API. Primary Sources Small Business Administration. “Annual Reports Dashboard,” 2018. https://www.sbir.gov/awards/annual-reports. Small Business Administration. “SBIR Awards Data,” 2018. https://www.sbir.gov/api. Small Business Administration. “SBIR Solicit Data,” 2018. https://www.sbir.gov/api.
Facebook
Twitterhttps://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/
Hello all,
This dataset is my humble attempt to allow myself and others to upgrade essential python packages to their latest versions. This dataset contains the .whl files of the below packages to be used across general kernels and especially in internet-off code challenges-
| Package | Version | Functionality |
|---|---|---|
| AutoGluon | 1.0.0 | AutoML models |
| Catboost | 1.2.2 1.2.3 | ML models |
| Iterative-Stratification | 0.1.7 | Iterative stratification for multi-label classifiers |
| Joblib | 1.3.2 | File dumping and retrieval |
| LAMA | 0.3.8b1 | AutoML models |
| LightGBM | 4.3.0 4.2.0 4.1.0 | ML models |
| MAPIE | 0.8.2 | Quantile regression |
| Numpy | 1.26.3 | Data wrangling |
| Pandas | 2.1.4 | Data wrangling |
| Polars | 0.20.3 0.20.4 | Data wrangling |
| PyTorch | 2.0.1 | Neural networks |
| PyTorch-TabNet | 4.1.0 | Neural networks |
| PyTorch-Forecast | 0.7.0 | Neural networks |
| Pygwalker | 0.3.20 | Data wrangling and visualization |
| Scikit-learn | 1.3.2 1.4.0 | ML Models/ Pipelines/ Data wrangling |
| Scipy | 1.11.4 | Data wrangling/ Statistics |
| TabPFN | 10.1.9 | ML models |
| Torch-Frame | 1.7.5 | Neural Networks |
| TorchVision | 0.15.2 | Neural Networks |
| XGBoost | 2.0.2 2.0.1 2.0.3 | ML models |
I plan to update this dataset with more libraries and later versions as they get upgraded in due course. I hope these wheel files are useful to one and all.
Best regards and happy learning and coding!
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data collected by E. Hunting et al. comprising video footage and electric field recordings from a video camera and field mill respectively. Data wrangling was done by K. Manser, the author of the python script.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Bike Share Data Over the past decade, bicycle-sharing systems have been growing in number and popularity in cities across the world. Bicycle-sharing systems allow users to rent bicycles on a very short-term basis for a price. This allows people to borrow a bike from point A and return it at point B, though they can also return it to the same location if they'd like to just go for a ride. Regardless, each bike can serve several users per day.
Thanks to the rise in information technologies, it is easy for a user of the system to access a dock within the system to unlock or return bicycles. These technologies also provide a wealth of data that can be used to explore how these bike-sharing systems are used.
In this project, you will use data provided by Motivate, a bike share system provider for many major cities in the United States, to uncover bike share usage patterns. You will compare the system usage between three large cities: Chicago, New York City, and Washington, DC.
The Datasets Randomly selected data for the first six months of 2017 are provided for all three cities. All three of the data files contain the same core six (6) columns:
Start Time (e.g., 2017-01-01 00:07:57) End Time (e.g., 2017-01-01 00:20:53) Trip Duration (in seconds - e.g., 776) Start Station (e.g., Broadway & Barry Ave) End Station (e.g., Sedgwick St & North Ave) User Type (Subscriber or Customer) The Chicago and New York City files also have the following two columns:
Gender Birth Year
Data for the first 10 rides in the new_york_city.csv file
The original files are much larger and messier, and you don't need to download them, but they can be accessed here if you'd like to see them (Chicago, New York City, Washington). These files had more columns and they differed in format in many cases. Some data wrangling has been performed to condense these files to the above core six columns to make your analysis and the evaluation of your Python skills more straightforward. In the Data Wrangling course that comes later in the Data Analyst Nanodegree program, students learn how to wrangle the dirtiest, messiest datasets, so don't worry, you won't miss out on learning this important skill!
Statistics Computed You will learn about bike share use in Chicago, New York City, and Washington by computing a variety of descriptive statistics. In this project, you'll write code to provide the following information:
most common month most common day of week most common hour of day
most common start station most common end station most common trip from start to end (i.e., most frequent combination of start station and end station)
total travel time average travel time
counts of each user type counts of each gender (only available for NYC and Chicago) earliest, most recent, most common year of birth (only available for NYC and Chicago) The Files To answer these questions using Python, you will need to write a Python script. To help guide your work in this project, a template with helper code and comments is provided in a bikeshare.py file, and you will do your scripting in there also. You will need the three city dataset files too:
chicago.csv new_york_city.csv washington.csv
All four of these files are zipped up in the Bikeshare file in the resource tab in the sidebar on the left side of this page. You may download and open up that zip file to do your project work on your local machine.
Facebook
TwitterNetflix Dataset Exploration and Visualization
This project involves an in-depth analysis of the Netflix dataset to uncover key trends and patterns in the streaming platform’s content offerings. Using Python libraries such as Pandas, NumPy, and Matplotlib, this notebook visualizes and interprets critical insights from the data.
Objectives:
Analyze the distribution of content types (Movies vs. TV Shows)
Identify the most prolific countries producing Netflix content
Study the ratings and duration of shows
Handle missing values using techniques like interpolation, forward-fill, and custom replacements
Enhance readability with bar charts, horizontal plots, and annotated visuals
Key Visualizations:
Bar charts for type distribution and country-wise contributions
Handling missing data in rating, duration, and date_added
Annotated plots showing values for clarity
Tools Used:
Python 3
Pandas for data wrangling
Matplotlib for visualizations
Jupyter Notebook for hands-on analysis
Outcome: This project provides a clear view of Netflix's content library, helping data enthusiasts and beginners understand how to process, clean, and visualize real-world datasets effectively.
Feel free to fork, adapt, and extend the work.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
A Case Study
In this case study we are going to use the automobile dataset, which plenty of car manufacturers withtheir specifications in in order to build a predictive model to find out the approximate car price. This dataset has 26 columns, including categorical and quantitative attributes.
The given_automobile.csv contains records from the above-mentioned dataset.
You need to write descriptive answers to the questions under each task and also usea proper program written in Python and execute the code. 1. The missing values are presented as ‘?’ in the dataset. Apply data wrangling techniques using Python programming language to solve missing values inall the attributes. 2. Check the data types of those columns with the missing values, and convert the data type if needed. 3. Find all the correlated features to the ‘Price’. 4. Build a predictive model to predict the car price based on using one of the independent correlated variables. 5. Continue with the same built model in No.4, but choose differentindependent variables and discuss the result.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset contains 300 publicly available matrimony profiles scraped from Vivaah.com using Python and Selenium.
Each profile includes: - Profile ID - Age & Height - Religion - Caste - Mother Tongue - Profession - Education - Location
🧠 Ideal for: - Exploratory Data Analysis (EDA) - Filtering & segmentation - Recommender system prototypes - Practice with web scraping & data wrangling
⚠️ This dataset is shared only for educational and research use. It includes no personal contact or private info.
Facebook
TwitterDesign a web portal to automate the various operation performed in machine learning projects to solve specific problems related to supervised or unsupervised use case.. Web portal must have the capabilities to perform below-mentioned task: 1. Extract Transform Load: a. Extract: Portal should provide the capabilities to configure any data source example. Cloud Storage (AWS, Azure, GCP), Database (RDBMS, NoSQL,), and real-time streaming data to extract data into portportal. (Allow feasibility to write cucustom script if required to connect to any data source to extract data) b. Transform: Portal should provide various inbuilt functions/components to apply rich set of transformation to transform extracted data into desired format. c. Load: Portal should be able to save data into any of the cloud storage after extracted data transformed into desired format. d. Allow user to write custom script in python if some of the functionality is not present in the portal. 2. Exploratory Data Analysis: Portal should allow users to perform exploratory data analysis. 3. Data Preparation: data wrangling, feature extraction and feature selection should be automation with minimal user intervention. 4. Application must suggest appropriate machine learning algorithm which is best suitable for the use case and perform best model search operation to automate model development operation. 5. Application should provide feature to deploy model in any of the cloud and application should create prediction API to predict new instances. 6. Application should log each and every detail so that each activity can be audited in future to investigate any of the event. 7. Detail report should be generated for ETL, EDA, Data preparation and Model development and deployment. 8. Create a dashboard to monitor model performance and create various alert mechanism to notify appropriate user to take necessary precaution. 9. Create functionality to implement retraining for existing model if it is necessary. 10.Portal must be designed in such a way that it can be used by multiple organization/user where each organization/user is isolated from other. 11.Portal should provide functionality to manage user. Similar to RBAC concept used in Cloud. (It is not necessary to build so many role but design it in such a way that it can add role in future so that newly created role can also be applied to users.) Organization/User can have multiple user and each user will have specific role. 12.Portal should have a scheduler to schedule training or prediction task and appropriate alert regarding to scheduled job should be notified to subscriber/configured email id. 13.Implement watcher functionality to perform prediction as soon as file arrived at input location.
You have to build a solution that should summarize the various news articles from different reading categories.
Code: You are supposed to write a code in a modular fashion Safe: It can be used without causing harm. Testable: It can be tested at the code level. Maintainable: It can be maintained, even as your codebase grows. Portable: It works the same in every environment (operating system) You have to maintain your code on GitHub. You have to keep your GitHub repo public so that anyone can check your code. Proper readme file you have to maintain for any project development. You should include basic workflow and execution of the entire project in the readme
file on GitHub Follow the coding standards: https://www.python.org/dev/peps/pep-0008/
NoSQL) or use multiple database.
You can use any cloud platform for this entire solution hosting like AWS, Azure or GCP.
Logging is a must for every action performed by your code use the python logging library for this.
Use source version control tool to implement CI, CD pipeline, e.g.: Azure Devops, Github, Circle CI.
You can host your application in the cloud platform using automated CI, CD pipeline.
You have to submit complete solution design strate...
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
After getting mixed results from the news sources, I thought to analyze the Vice Presidential and Presidential debates using Data Science. The idea is to use YouTube comments as a medium to get the sentiment regarding the debate and getting insights from the data. In this analysis, we plot common phrases, common words, we also analyze sentiment and in the end for all my data science practitioners I present them a full-fledged dataset containing YouTube Comments of VP and Presidential debates.
Why: After getting mixed results from the news sources about the outcome of the debate, I decided to use data science to help me see the outcome of the result. With the elections around the corner, technology or to be precise analytics plays a key role in shaping our thoughts and supporting our hypothesis. How: To Analyze YouTube Comments we use Python and various other NLP Libraries followed by some data visualization tools. We will use the wonders of the awesome data wrangling library known as Pandas and we hope to find some interesting insights.
The dataset contains comments (YT comment scraped) and a sentiment calculated using the TextBlob library.
YouTube data API
Facebook
Twitter----WEB SCRAPING PROJECT---
DESCRIPTION: This is a self-guided project where I have tasked myself with extracting the top 100 companies in the United States by Revenue according to the Fortune 500 list, performing EDA, and then narrowing it down to the Technology companies within that list.
Dataset URL: https://en.wikipedia.org/wiki/List_of_largest_companies_in_the_United_States_by_revenue
PROBLEM STATEMENT: How much Revenue have the largest U.S. Tech companies generated within the year 2023 so far?
PROJECT TYPE: Web Scraping, EDA (Exploratory Data Analysis), Data Wrangling/Cleaning, Data Visualization
SOFTWARE TOOLS USED: Python 3.1.0
DATE: 20th September, 2023
Author: Oluwabori Abiodun-Johnson
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset contains a collection of notable deaths from around the world, as recorded on Wikipedia throughout the year 2024. The data was scraped from the Wikipedia pages for each month of 2024, specifically from the "Deaths in [Month] 2024" articles.
Columns in the dataset: 1. Name: The name of the deceased individual. 2. Age: The age at which the individual passed away. 3. Location/Profession: The geographical location or professional background of the individual. 4. Cause of Death: The reported cause of death (if available). 5. Month: The month of the year in which the individual passed away.
Data Collection Methodology: Data was collected using a custom Python script that utilized the requests and BeautifulSoup libraries to scrape and parse the data from Wikipedia.
Information was extracted from the list of deaths provided on Wikipedia pages for each month, and the data was cleaned and organized into a structured CSV file.
This dataset is ideal for educational purposes and provides an opportunity for practicing data cleaning, data wrangling, and NLP techniques like text classification, Named Entity Recognition (NER), and summarization tasks.
Use Cases: 1. Data cleaning and preprocessing exercises. 2. Natural Language Processing (NLP) and text analysis tasks. 3. Time-based analysis or trend analysis of notable deaths over the course of a year. 4. Practicing Named Entity Recognition (NER) for identifying names, locations, and professions. 6. The dataset is available for educational purposes only and can be used to practice various data science and machine learning techniques. The data was collected under the Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0).
Code of about this data set: https://github.com/ghulamhaider65/Web_scraping
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
New case New case (7 day rolling average) Recovered Active case Local cases Imported case ICU Death Cumulative deaths
People tested Cumulative people tested Positivity rate Positivity rate (7 day rolling average)
Column 1 to 22 are Twitter data, which the Tweets are retrieved from Health DG @DGHisham timeline with Twitter API. A typical covid situation update Tweet is written in a relatively fixed format. Data wrangling are done in Python/Pandas, numerical values extracted with Regular Expression (RegEx). Missing data are added manually from Desk of DG (kpkesihatan).
Column 23 ['remark'] is my own written remark regarding the Tweet status/content.
Column 24 ['Cumulative people tested'] data is transcribed from an image on MOH COVID-19 website. Specifically, the first image under TABURAN KES section in each Situasi Terkini daily webpage of http://covid-19.moh.gov.my/terkini. If missing, the image from CPRC KKM Telegram or KKM Facebook Live video is used. Data in this column, dated from 1 March 2020 to 11 Feb 2021, are from Our World in Data, their data collection method as stated here.
MOH does not publish any covid data in csv/excel format as of today, they provide the data as is, along with infographics that are hardly informative. In an undisclosed email, MOH doesn't seem to understand my request for them to release the covid public health data for anyone to download and do their analysis if they do wish.
A simple visualization dashboard is now published on Tableau Public. It's is updated daily. Do check it out! More charts to be added in the near future
Create better visualizations to help fellow Malaysians understand the Covid-19 situation. Empower the data science community.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The repository contains data on party strength for each state as shown on each state's corresponding party strength Wikipedia page (for example, here is Virginia )
Each state has a table of a detailed summary of the state of its governing and representing bodies on Wikipedia but there is no data set that collates these entries. I scraped each state's Wikipedia table and collated the entries into a single dataset. The data are stored in the state_party_strength.csv and state_party_strength_cleaned.csv. The code that generated the file can be found in corresponding Python notebooks.
The data contain information from 1980 on each state's: 1. governor and party 2. state house and senate composition 3. state representative composition in congress 4. electoral votes
Data in the clean version has been cleaned and processed substantially. Namely: - all columns now contain homogenous data within the column - names and Wiki-citations have been removed - only the party counts and party identification have been left The notebook that created this file is here
The data contained herein have not been altered from their Wikipedia tables except in two instances: - Forced column names to be in accord across states - Any needed data modifications (ie concatenated string columns) to retain information when combining columns
Please note that the right encoding for the dataset is "ISO-8859-1", not 'utf-8' though in future versions I will try to fix that to make it more accessible.
This means that you will likely have to perform further data wrangling prior to doing any substantive analysis. The notebook that has been used to create this data file is located here
The raw scraped data can be found in the pickle. This file contains a Python dictionary where each key is a US state name and each element is the raw scraped table in Pandas DataFrame format.
Hope it proves as useful to you in analyzing/using political patterns at the state level in the US for political and policy research.
Facebook
Twitteroslo-city-bike License: Norwegian Licence for Open Government Data (NLOD) 2.0 According to the license, we have full rights to collect, use, modify, and distribute this data, provided you clearly indicate the source (which I do).
Folder oslobysykkel contains all available data from 2019 to 2025. Format: oslobysykkel-YYYY-MM.csv. why is oslo still appearing in the file names? because there is also similar data for Trondheim and Bergen
from oslobysykkel.no Variable Format Description started_at Timestamp Timestamp of when the trip started ended_at Timestamp Timestamp of when the trip ended duration Integer Duration of trip in seconds start_station_id String Unique ID for start station start_station_name String Name of start station start_station_description String Description of where start station is located start_station_latitude Decimal degrees in WGS84 Latitude of start station start_station_longitude Decimal degrees in WGS84 Longitude of start station end_station_id String Unique ID for end station end_station_name String Name of end station end_station_description String Description of where end station is located end_station_latitude Decimal degrees in WGS84 Latitude of end station end_station_longitude Decimal degrees in WGS84 Longitude of end station
Please note: this data and my analysis focuses on the new data format, but historical data for the period April 2016 - December 2018 (Legacy Trip Data) has a different pattern.
I myself was extremely fascinated by this open data of Oslo City Bike and in the process of deep analysis saw broad prospects. This interest turned into an idea to create a data-analytical problem book or even platfrom 'exercise bike'. Publishing this dataset to make it convenient for my own further use in the next phases of the project (Clustering, Forecasting), as well as so that anyone can participate in analysis and modeling based on this exciting data.
**Autumn's remake of Oslo bike sharing data analysis ** https://colab.research.google.com/drive/1tAxrIWVK5V-ptKLJBdODjy10zHlsppFv?usp=sharing
https://drive.google.com/file/d/17FP9Bd5opoZlw40LRxWtycgJJyXSAdC6/view
Full notebooks with code, visualizations, and commentary will be published soon! This dataset is the backbone of an ongoing project — stay tuned for see a deeper dives into anomaly detection, station clustering, and interactive learning challenges.
Index of my notebooks Phase 1: Cleaned Data & Core Insights Time-Space Dynamics Exploratory
Clustering and Segmentation Demand Forecasting (Time Series) Geospatial Analysis (Network Analysis)
Similar dataset https://www.kaggle.com/code/florestancharlaix/oslo-city-bikes-analysis
links to works I have found or that have inspired me
Exploring Open Data from Oslo City Bike Jon Olave — visualization of popular routes and seasonality analysis.
Oslo City Bike Data Wrangling Karl Tryggvason — predicting bicycle availability at stations, focusing on everyday use (e.g., trips to kindergarten).
Helsinki City Bikes: Exploratory Data Analysis Analysis of a similar system in Helsinki — useful for comparative studies and methodological ideas.
The idea is to connect with other data. For example I did it for weather data - integrate temperature, precipitation, and wind speed to explain variations in daily demand. https://meteostat.net/en/place/no/oslo
I also used data from Airbnb (that's where I took division into neighbourhoods) https://data.insideairbnb.com/norway/oslo/oslo/2025-06-27/visualisations/neighbourhoods.csv
oslo bike-sharing eda feature-engineering geospatial time-series
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Derived from the LEGO Database https://www.kaggle.com/rtatman/lego-database , flattened into a single CSV file for use in a Kernel I plan to build with the Pandas Python library.
The original data files were imported into MS SQL Server, and then exported after performing a huge join across all of the tables. A bunch of data was excluded form my export query to simplify the data set for my research goals. Certain part categories such as minifigs as well as some themes like Duplo have been excluded.
This is derived from the LEGO database https://www.kaggle.com/rtatman/lego-database which is courtesy of https://www.kaggle.com/rtatman and in turn originated from Rebrickable.
This is a fun data set to play with for learning data wrangling. I personally identify with it as a LEGO fan!
Facebook
Twitterhttps://www.usa.gov/government-works/https://www.usa.gov/government-works/
Originally, I was planning to use the Python Quandl api to get the data from here because it is already conveniently in time-series format. However, the data is split by reporting agency which makes it difficult to get an accurate image of the true short ratio because of missing data/difficulty in aggregation. So, I clicked on the source link which turned out to be a gold mine because of their consolidated data. Only downside was that it was all in .txt format so I had to use regex to parse through and data scraping to get the information from the website but that was a good refresher 😄.
For better understanding of what the values in the text file mean, you can read this pdf from FINRA: https://www.finra.org/sites/default/files/2020-12/short-sale-volume-user-guide.pdf
I condensed all the individual text files into a single .txt file such that it's much faster and less complex to write code compared to having to iterate through each individual .txt file. I created several functions for this dataset so please check out my workbook "FINRA Short Ratio functions" where I have described step by step on how I gathered the data and formatted it so that you can understand and modify them to fit your needs. Note that the data is only for the range of 1st April 2020 onwards (20200401 to 20210312 as of gathering the data) and the contents are separated by | delimiters so I used \D (non-digit) in regex to avoid confusion with the (a|b) pattern syntax.
If you need historical data before April 2020, you can use the quandl database but it has non-consolidated information and you have to make a reference call for each individual stock for each agency so you would need to manually input tickers or get a list of all tickers through regex of the txt files or something like that 😅.
An excellent task to combine regular expressions (regex), web scraping, plotting, and data wrangling... see my notebook for an example with annotated workflow. Please comment and feel free to fork and modify my workbook to change the functionality. Possibly the short volumes can be combined with p/b ratios or price data to see the correlation --> can use seaborn pairgrid to visualise this for multiple stocks?
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dive into the world of data manipulation with pandas, the powerful Python library for data analysis. This series of exercises is designed for beginners who are eager to learn how to use pandas for data wrangling tasks. Each exercise will cover a different aspect of pandas, from loading and exploring datasets to manipulating data and performing basic analysis. Whether you're new to programming or just getting started with pandas, these exercises will help you build a solid foundation in data wrangling skills. Join us on this exciting journey and unleash the power of pandas!