Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset includes 12 files with month data from January 2022 to December 2022. The data used is reliable because it is the primary source data based on the company, Cyclistic Bike Share. All the necessary information regarding the conduction of data analysis is included, so the data is comprehensive. The ROCCC is evaluated. In order to evaluate the data RStudio 2022.12.0+353 "Elsbeth Geranium" is used. Even tough there are some missing values, by doing data cleaning, the results were not affected in terms of my main area of interest.
The main area of my study is to differentiate the usage of Cyclistic bikes of annual members and casual members. My dataset and notebook includes a clear statement of the business task as well as a clear description of all the data sources I used. Moreover, the summary of the analysis I made is included in the notebook. In order to make the analysis more understandable by the user I used the support of visualisations and key findings.
At the end of my notebook, you can access the recommendations I made based on the analysis. I would be more than happy to receive all feedbacks, advices and comments.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Thorough knowledge of the structure of analyzed data allows to form detailed scientific hypotheses and research questions. The structure of data can be revealed with methods for exploratory data analysis. Due to multitude of available methods, selecting those which will work together well and facilitate data interpretation is not an easy task. In this work we present a well fitted set of tools for a complete exploratory analysis of a clinical dataset and perform a case study analysis on a set of 515 patients. The proposed procedure comprises several steps: 1) robust data normalization, 2) outlier detection with Mahalanobis (MD) and robust Mahalanobis distances (rMD), 3) hierarchical clustering with Ward’s algorithm, 4) Principal Component Analysis with biplot vectors. The analyzed set comprised elderly patients that participated in the PolSenior project. Each patient was characterized by over 40 biochemical and socio-geographical attributes. Introductory analysis showed that the case-study dataset comprises two clusters separated along the axis of sex hormone attributes. Further analysis was carried out separately for male and female patients. The most optimal partitioning in the male set resulted in five subgroups. Two of them were related to diseased patients: 1) diabetes and 2) hypogonadism patients. Analysis of the female set suggested that it was more homogeneous than the male dataset. No evidence of pathological patient subgroups was found. In the study we showed that outlier detection with MD and rMD allows not only to identify outliers, but can also assess the heterogeneity of a dataset. The case study proved that our procedure is well suited for identification and visualization of biologically meaningful patient subgroups.
Facebook
TwitterThis dataset was created by Rahul Sharma
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Problem Statement-
Bike-sharing systems are meant to rent bicycles and return to different places for bike-sharing purposes in Washington DC.
You are provided with rental data spanning 2 years. It would help if you predicted the total count of bikes rented during each hour covered by the test set, using only information available prior to the rental period.
This is the bike rental dataset, to practice pandas profiling. This dataset contains numerical values.
Tasks to perform : 1. Perform Exploratory Data Analysis 2. Use Pandas Profiling
Compare the pandas profiling report with Exploratory Data Analysis
Facebook
Twitterhttps://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Assignment 1: EDA - US Company Bankruptcy Prediction
Student Name: Reef Zehavi Date: November 10, 2025
📹 Project Presentation Video
[(https://www.loom.com/share/6920e493e8654ef3bb4f67a10eb9b03d)]
1. Overview and Project Goal
The goal of this project is to perform Exploratory Data Analysis (EDA) on a fundamental dataset of American companies. The analysis focuses on understanding the financial characteristics that differentiate between companies that survived… See the full description on the dataset page: https://huggingface.co/datasets/reefzehavi/EDA-US-Bankruptcy-Prediction.
Facebook
TwitterCustomer Personality Analysis – EDA Results
1. Project Goal
The goal of this project is to use numeric-focused Exploratory Data Analysis (EDA) on the Customer Personality Analysis dataset to understand:
Which customer characteristics are associated with higher spending. How these characteristics differ between customers who responded to the last marketing campaign and those who did not.
The main outcome variable is:
Response (0 = no, 1 = yes) – did the customer respond… See the full description on the dataset page: https://huggingface.co/datasets/maigurski/maigurski-customer-personality-assignment1.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Diabetes Dataset — Exploratory Data Analysis (EDA)
This repository contains a diabetes-related tabular dataset and a complete Exploratory Data Analysis (EDA).The main objective of this project was to learn how to conduct a structured EDA, apply best practices, and extract meaningful insights from real-world health data.
The analysis includes correlations, distributions, group comparisons, class balance exploration, and statistical interpretations that illustrate how different… See the full description on the dataset page: https://huggingface.co/datasets/guyshilo12/diabetes_eda_analysis.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
PLEASE UPVOTE THIS DATASET IF THIS HELP YOU... GLAD TO ANY FORKS HERE
BACKGROUND DQLab Telco is a telecommunications company with numerous locations all over the world. In order to ensure that customers are not left behind, DQLab Telco has consistently paid attention to the customer experience since its establishment in 2019.
Even though DQLab Telco is only a little over a year old, many of its customers have already changed their subscriptions to rival companies. By using machine learning, management hopes to lower the number of customers who leave.
After cleaning the data yesterday, it is now time for us to build the best model to forecast customer churn.
TASKS & STEPS Yesterday, we completed "Cleansing Data" as part of project part 1. You are now expected to develop the appropriate model as a data scientist.
You will perform "Machine Learning Modeling" in this assignment using data from the previous month, specifically June 2020.
The actions that must be taken are, 1. Analyze exploratory data first. 2. Carry out pre-processing of the data. 3. Using modeling from machine learning. 4. Picking the Ideal Model.
Facebook
Twitterhttps://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Stroke Prediction Dataset — Exploratory Data Analysis (EDA) By Yuval Malka
Project Overview
This project explores the Stroke Prediction Dataset from Kaggle, containing 5,110 rows and 12 features related to demographics, health indicators, and lifestyle factors. The goal is to understand which factors may be associated with the likelihood of having a stroke by performing a full Exploratory Data Analysis (EDA). The target variable is: stroke → 0 = No Stroke, 1 = Stroke This README summarizes… See the full description on the dataset page: https://huggingface.co/datasets/Yuvalos/stroke-prediction-eda-yuval-malka.
Facebook
TwitterThis collection of files are part of a larger dataset uploaded in support of Low Temperature Geothermal Play Fairway Analysis for the Appalachian Basin (GPFA-AB, DOE Project DE-EE0006726). Phase 1 of the GPFA-AB project identified potential Geothermal Play Fairways within the Appalachian basin of Pennsylvania, West Virginia and New York. This was accomplished through analysis of 4 key criteria: thermal quality, natural reservoir productivity, risk of seismicity, and heat utilization. Each of these analyses represent a distinct project task, with the fifth task encompassing combination of the 4 risks factors. Supporting data for all five tasks has been uploaded into the Geothermal Data Repository node of the National Geothermal Data System (NGDS).
This submission comprises the data for Thermal Quality Analysis (project task 1) and includes all of the necessary shapefiles, rasters, datasets, code, and references to code repositories that were used to create the thermal resource and risk factor maps as part of the GPFA-AB project. The identified Geothermal Play Fairways are also provided with the larger dataset. Figures (.png) are provided as examples of the shapefiles and rasters. The regional standardized 1 square km grid used in the project is also provided as points (cell centers), polygons, and as a raster. Two ArcGIS toolboxes are available: 1) RegionalGridModels.tbx for creating resource and risk factor maps on the standardized grid, and 2) ThermalRiskFactorModels.tbx for use in making the thermal resource maps and cross sections. These toolboxes contain item description documentation for each model within the toolbox, and for the toolbox itself. This submission also contains three R scripts: 1) AddNewSeisFields.R to add seismic risk data to attribute tables of seismic risk, 2) StratifiedKrigingInterpolation.R for the interpolations used in the thermal resource analysis, and 3) LeaveOneOutCrossValidation.R for the cross validations used in the thermal interpolations.
Some file descriptions make reference to various 'memos'. These are contained within the final report submitted October 16, 2015.
Each zipped file in the submission contains an 'about' document describing the full Thermal Quality Analysis content available, along with key sources, authors, citation, use guidelines, and assumptions, with the specific file(s) contained within the .zip file highlighted.
UPDATE: Newer version of the Thermal Quality Analysis has been added here: https://gdr.openei.org/submissions/879 (Also linked below) Newer version of the Combined Risk Factor Analysis has been added here: https://gdr.openei.org/submissions/880 (Also linked below) This is one of sixteen associated .zip files relating to thermal resource interpolation results within the Thermal Quality Analysis task of the Low Temperature Geothermal Play Fairway Analysis for the Appalachian Basin. This file contains an ArcGIS Toolbox with 6 ArcGIS Models: WellClipsToWormsSections, BufferedRasterToClippedRaster, ExtractThermalPropertiesToCrossSection, AddExtraInfoToCrossSection, and CrossSectionExtraction.
The sixteen files contain the results of the thermal resource interpolation as binary grid (raster) files, images (.png) of the rasters, and toolbox of ArcGIS Models used. Note that raster files ending in “pred” are the predicted mean for that resource, and files ending in “err” are the standard error of the predicted mean for that resource. Leave one out cross validation results are provided for each thermal resource.
Several models were built in order to process the well database with outliers removed. ArcGIS toolbox ThermalRiskFactorModels contains the ArcGIS processing tools used. First, the WellClipsToWormSections model was used to clip the wells to the worm sections (interpolation regions). Then, the 1 square km gridded regions (see series of 14 Worm Based Interpolation Boundaries .zip files) along with the wells in those regions were loaded into R using the rgdal package. Then, a stratified kriging algorithm implemented in the R gstat package was used to create rasters of the predicted mean and the standard error of the predicted mean. The code used to make these rasters is called StratifiedKrigingInterpolation.R Details about the interpolation, and exploratory data analysis on the well data is provided in 9_GPFA-AB_InterpolationThermalFieldEstimation.pdf (Smith, 2015), contained within the final report.
The output rasters from R are brought into ArcGIS for further spatial processing. First, the BufferedRasterToClippedRaster tool is used to clip the interpolations back to the Worm Sections. Then, the Mosaic tool in ArcGIS is used to merge all predicted mean rasters into a single raster, and all error rasters into a single raster for each thermal resource.
A leave one out cross validation was performed on each of the thermal resources. The code used to implement the cross validation is provided in the R script LeaveOneOutCrossValidation.R. The results of the cross validation are given for each thermal resource.
Other tools provided in this toolbox are useful for creating cross sections of the thermal resource. ExtractThermalPropertiesToCrossSection model extracts the predicted mean and the standard error of predicted mean to the attribute table of a line of cross section. The AddExtraInfoToCrossSection model is then used to add any other desired information, such as state and county boundaries, to the cross section attribute table. These two functions can be combined as a single function, as provided by the CrossSectionExtraction model.
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
In this notebook, we will walk through solving a complete machine learning problem using a real-world dataset. This was a "homework" assignment given to me for a job application over summer 2018. The entire assignment can be viewed here and the one sentence summary is:
Use the provided building energy data to develop a model that can predict a building's Energy Star score, and then interpret the results to find the variables that are most predictive of the score.
This is a supervised, regression machine learning task: given a set of data with targets (in this case the score) included, we want to train a model that can learn to map the features (also known as the explanatory variables) to the target.
Supervised problem: we are given both the features and the target Regression problem: the target is a continous variable, in this case ranging from 0-100 During training, we want the model to learn the relationship between the features and the score so we give it both the features and the answer. Then, to test how well the model has learned, we evaluate it on a testing set where it has never seen the answers!
Machine Learning Workflow Although the exact implementation details can vary, the general structure of a machine learning project stays relatively constant:
Data cleaning and formatting Exploratory data analysis Feature engineering and selection Establish a baseline and compare several machine learning models on a performance metric Perform hyperparameter tuning on the best model to optimize it for the problem Evaluate the best model on the testing set Interpret the model results to the extent possible Draw conclusions and write a well-documented report Setting up the structure of the pipeline ahead of time lets us see how one step flows into the other. However, the machine learning pipeline is an iterative procedure and so we don't always follow these steps in a linear fashion. We may revisit a previous step based on results from further down the pipeline. For example, while we may perform feature selection before building any models, we may use the modeling results to go back and select a different set of features. Or, the modeling may turn up unexpected results that mean we want to explore our data from another angle. Generally, you have to complete one step before moving on to the next, but don't feel like once you have finished one step the first time, you cannot go back and make improvements!
This notebook will cover the first three (and a half) steps of the pipeline with the other parts discussed in two additional notebooks. Throughout this series, the objective is to show how all the different data science practices come together to form a complete project. I try to focus more on the implementations of the methods rather than explaining them at a low-level, but have provided resources for those who want to go deeper. For the single best book (in my opinion) for learning the basics and implementing machine learning practices in Python, check out Hands-On Machine Learning with Scikit-Learn and Tensorflow by Aurelion Geron.
With this outline in place to guide us, let's get started!
Facebook
Twitter🎓 Student Performance Factors — EDA & Insights Michael Ozon — Assignment #1 (EDA & Dataset) Reichman University – Data Science Course 🎥 Presentation Video https://drive.google.com/drive/folders/1cAXLzcZflMgv12EDlVTeQoKxzVumOjbd?usp=drive_link 📌 Project Overview This project explores the Student Performance Factors dataset, containing 6,607 student records and 20 academic, behavioral, lifestyle, and demographic features. The goal of this Exploratory Data Analysis (EDA) is to understand which… See the full description on the dataset page: https://huggingface.co/datasets/michaelozon/student-performance-factors-analysis-michael-ozon.
Facebook
TwitterThe Engineered Geothermal System (EGS) Exploration Methodology Project is developing an exploration approach for EGS through the integration of geoscientific data. The overall project area is 2500km2 with the Calibration Area (Dixie Valley Geothermal Wellfield) being about 170km2. The Final Scientific Report (FSR) is submitted in two parts (I and II). FSR part I presents (1) an assessment of the readily available public domain data and some proprietary data provided by terra-gen power, llc, (2) a re-interpretation of these data as required, (3) an exploratory geostatistical data analysis, (4) the baseline geothermal conceptual model, and (5) the EGS favorability/trust mapping. The conceptual model presented applies to both the hydrothermal system and EGS in the Dixie Valley region. FSR Part II presents (1) 278 new gravity stations; (2) enhanced gravity-magnetic modeling; (3) 42 new ambient seismic noise survey stations; (4) an integration of the new seismic noise data with a regional seismic network; (5) a new methodology and approach to interpret this data; (5) a novel method to predict rock type and temperature based on the newly interpreted data; (6) 70 new magnetotelluric (MT) stations; (7) an integrated interpretation of the enhanced MT data set; (8) the results of a 308 station soil CO2 gas survey; (9) new conductive thermal modeling in the project area; (10) new convective modeling in the Calibration Area; (11) pseudo-convective modeling in the Calibration Area; (12) enhanced data implications and qualitative geoscience correlations at three scales (a) Regional, (b) Project, and (c) Calibration Area; (13) quantitative geostatistical exploratory data analysis; and (14) responses to nine questions posed in the proposal for this investigation. Enhanced favorability/trust maps were not generated because there was not a sufficient amount of new, fully-vetted (see below) rock type, temperature, and stress data. The enhanced seismic data did generate a new method to infer rock type and temperature (However, in the opinion of the Principal Investigator for this project, this new methodology needs to be tested and evaluated at other sites in the Basin and Range before it is used to generate the referenced maps. As in the baseline conceptual model, the enhanced findings can be applied to both the hydrothermal system and EGS in the Dixie Valley region).
Facebook
Twitter🏙️ NYC Airbnb Price Analysis 📘 Overview This project analyzes the Airbnb NYC Listings Dataset to explore which property attributes have the greatest influence on an apartment’s nightly rental price. The analysis includes: Data Loading Data Cleaning Handling Missing Values Outlier Detection Feature Preparation Exploratory Data Analysis (EDA) Visualizations Insights & Conclusions 🗂️ 1. Data Loading The dataset was downloaded from Kaggle and contains: Thousands of NYC Airbnb listings 40+… See the full description on the dataset page: https://huggingface.co/datasets/meirnm13/meirneeman.
Facebook
TwitterThe Engineered Geothermal System (EGS) Exploration Methodology Project is developing an exploration approach for EGS through the integration of geoscientific data. The Project chose the Dixie Valley Geothermal System in Nevada as a field laboratory site for methodlogy calibration purposes because, in the public domain, it is a highly characterized geothermal systems in the Basin and Range with a considerable amount of geoscience and most importantly, well data. This Baseline Conceptual Model report summarizes the results of the first three project tasks (1) collect and assess the existing public domain geoscience data, (2) design and populate a GIS database, and (3) develop a baseline (existing data) geothermal conceptual model, evaluate geostatistical relationships, and generate baseline, coupled EGS favorability/trust maps from +1km above sea level (asl) to -4km asl for the Calibration Area (Dixie Valley Geothermal Wellfield) to identify EGS drilling targets at a scale of 5km x 5km. It presents (1) an assessment of the readily available public domain data and some proprietary data provided by Terra-Gen Power, LLC, (2) a re-interpretation of these data as required, (3) an exploratory geostatistical data analysis, (4) the baseline geothermal conceptual model, and (5) the EGS favorability/trust mapping. The conceptual model presented applies to both the hydrothermal system and EGS in the Dixie Valley region.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
An exploratory data analysis project using Excel to understand what influences Instagram post reach and engagement.
مشروع تحليل استكشافي لفهم العوامل المؤثرة في وصول منشورات إنستغرام وتفاعل المستخدمين، باستخدام Excel.
This project uses an Instagram dataset imported from Kaggle to explore how different factors like hashtags, saves, shares, and caption length influence impressions and engagement.
يستخدم هذا المشروع بيانات من إنستغرام تم استيرادها من منصة Kaggle لتحليل كيف تؤثر عوامل مثل الهاشتاقات، الحفظ، المشاركة، وطول التسمية التوضيحية في عدد مرات الظهور والتفاعل.
TRIM Standardized formatting: freeze top row, wrap text, center align
إزالة المسافات غير الضرورية باستخدام TRIM
حذف 17 صفًا مكررًا → تبقى 103 صفوف فريدة
تنسيق موحد: تثبيت الصف الأول، لف النص، وتوسيط المحتوى
#Thecleverprogrammer, #Amankharwal, #Python Shorter captions and higher save counts contribute more to reach than repeated hashtags. Profile visits are often linked to new followers.
العناوين القصيرة وعدد الحفظات تلعب دورًا أكبر في الوصول من تكرار الهاشتاقات. كما أن زيارات الملف الشخصي ترتبط غالبًا بزيادة المتابعين.
Inspired by content from TheCleverProgrammer, Aman Kharwal, and Kaggle datasets.
استُلهم المشروع من محتوى TheCleverProgrammer وأمان خروال، وبيانات من Kaggle.
Feel free to open an issue or share suggestions!
يسعدنا تلقي ملاحظاتكم واقتراحاتكم عبر صفحة المشروع.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Daily Machine Learning Practice – 1 Commit per Day
Author: Astrid Villalobos Location: Montréal, QC LinkedIn: https://www.linkedin.com/in/astridcvr/
Objective The goal of this project is to strengthen Machine Learning and data analysis skills through small, consistent daily contributions. Each commit focuses on a specific aspect of data processing, feature engineering, or modeling using Python, Pandas, and Scikit-learn.
Dataset Source: Kaggle – Sample Sales Data File: data/sales_data_sample.csv Variables: ORDERNUMBER, QUANTITYORDERED, PRICEEACH, SALES, COUNTRY, etc. Goal: Analyze e-commerce performance, predict sales trends, segment customers, and forecast demand.
**Project Rules **Rule Description 🟩 1 Commit per Day Minimum one line of code daily to ensure consistency and discipline 🌍 Bilingual Comments Code and documentation in English and French 📈 Visible Progress Daily green squares = daily learning 🧰 Tech Stack
Languages: Python Libraries: Pandas, NumPy, Scikit-learn, Matplotlib, Seaborn Tools: Jupyter Notebook, GitHub, Kaggle
Learning Outcomes By the end of this challenge: Develop a stronger understanding of data preprocessing, modeling, and evaluation. Build consistent coding habits through daily practice. Apply ML techniques to real-world sales data scenarios.
Facebook
TwitterThis dataset pulls the projects posted by clients on PeoplePerHour. Data collection started on January 20th, 2023, and adds approximately ~40 new projects to this dataset every hour.
Inspiration:
I have been a freelance Python Developer since my graduation (2019). And recently I completed the Google Data Analytics Professional Certificate from Coursera.
Last week I saw this cool video from LUKE BAROUSSE on youtube here's the link. He created a pipeline to scrape Data Analyst jobs in the US on a daily basis and update the dataset daily on Kaggle. Also lately I was not winning a lot of jobs as a freelancer. I have also started looking for a job in Data Analytics. So I thought a lot about it and concluded to do some analysis as it would be a great project to add to my resume.
I hope this dataset proves to be useful to you.
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
In this Power BI Dashboard, we used data from HR analytics to help an organization improve employee performance and retention (reduce attrition) by creating an HR Analytics Dashboard.
Complete the Power BI project through this data set. Topics covered in this Power BI Project. This dashboard includes topics;
Dashboard Overview Raw HR Analytics Data Dashboard Setup Data Cleaning and processing in Power BI Import Data in Power BI Power Bi Dashboard- KPIs Power Bi Dashboard- Charts & Table Export or share Power Bi Dashboard Insights from Dashboard Measures and Calculations in Power BI
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The dataset used in this project contains features extracted from various applications, aiming to detect malware using machine learning techniques. Malware detection is a critical task in cybersecurity, as it helps protect users and organizations from potential threats.
Source: The dataset was sourced from [insert dataset source here]. It consists of features extracted from a large number of Android applications, including permissions, API calls, and other attributes. The original dataset was collected for research purposes and is publicly available for download.
Inspiration: The inspiration behind this project came from the increasing prevalence of malware attacks on mobile devices and the need for effective detection methods. By leveraging machine learning algorithms, we aim to develop a model that can accurately classify applications as benign or malicious based on their features. This project is motivated by a desire to contribute to cybersecurity research and develop practical solutions for malware detection.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset includes 12 files with month data from January 2022 to December 2022. The data used is reliable because it is the primary source data based on the company, Cyclistic Bike Share. All the necessary information regarding the conduction of data analysis is included, so the data is comprehensive. The ROCCC is evaluated. In order to evaluate the data RStudio 2022.12.0+353 "Elsbeth Geranium" is used. Even tough there are some missing values, by doing data cleaning, the results were not affected in terms of my main area of interest.
The main area of my study is to differentiate the usage of Cyclistic bikes of annual members and casual members. My dataset and notebook includes a clear statement of the business task as well as a clear description of all the data sources I used. Moreover, the summary of the analysis I made is included in the notebook. In order to make the analysis more understandable by the user I used the support of visualisations and key findings.
At the end of my notebook, you can access the recommendations I made based on the analysis. I would be more than happy to receive all feedbacks, advices and comments.