Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by yvonne gatwiri
Released under Apache 2.0
This dataset was created by yvonne gatwiri
Released under Apache 2.0
This dataset was created by Nandish Jani
Portobello Tech is an app innovator that has devised an intelligent way of predicting employee turnover within the company. It periodically evaluates employees' work details including the number of projects they worked upon, average monthly working hours, time spent in the company, promotions in the last 5 years, and salary level. Data from prior evaluations show the employee’s satisfaction at the workplace. The data could be used to identify patterns in work style and their interest to continue to work in the company. The HR Department owns the data and uses it to predict employee turnover. Employee turnover refers to the total number of workers who leave a company over a certain time period.
You work for Spark Funds, an asset management company. Spark Funds wants to make investments in a few companies. The CEO of Spark Funds wants to understand the global trends in investments so that she can take the investment decisions effectively.
It wants to invest between 5 to 15 million USD per round of investment
It wants to invest only in English-speaking countries because of the ease of communication with the companies it would invest in
For your analysis, consider a country to be English speaking only if English is one of the official languages in that country
You may use this link: :- https://en.wikipedia.org/wiki/List_of_territorial_entities_where_English_is_an_official_language for a list of countries where English is an official language.
These conditions will give you sufficient information for your initial analysis. Before getting to specific questions, let’s understand the problem and the data first.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Census County Economically Distressed Areas 2018’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://catalog.data.gov/dataset/0b289b5e-0507-424d-9f07-f8d2b11b9580 on 27 January 2022.
--- Dataset description provided by original source is as follows ---
This is a copy of the statewide Census County GIS Tiger file. It is used to determine if a county is EDA or not by adding ACS (American Community Survey) Median Household Income (MHI) and Population Density data at the county level. The IRWM web based DAC mapping tool uses this GIS layer. Every year this table gets updated after ACS publishes their updated estimates. Created by joining 2016 EDA table to 2010 block groups feature class. The TIGER/Line Files are shapefiles and related database files (.dbf) that are an extract of selected geographic and cartographic information from the U.S. Census Bureau's Master Address File / Topologically Integrated Geographic Encoding and Referencing (MAF/TIGER) Database (MTDB). The MTDB represents a seamless national file with no overlaps or gaps between parts, however, each TIGER/Line File is designed to stand alone as an independent data set, or they can be combined to cover the entire nation. Block Groups (BGs) are defined before tabulation block delineation and numbering, but are clusters of blocks within the same census tract that have the same first digit of their 4-digit census block number from the same decennial census. For example, Census 2000 tabulation blocks 3001, 3002, 3003,.., 3999 within Census 2000 tract 1210.02 are also within BG 3 within that census tract. Census 2000 BGs generally contained between 600 and 3,000 people, with an optimum size of 1,500 people. Most BGs were delineated by local participants in the Census Bureau's Participant Statistical Areas Program (PSAP). The Census Bureau delineated BGs only where the PSAP participant declined to delineate BGs or where the Census Bureau could not identify any local PSAP participant. A BG usually covers a contiguous area. Each census tract contains at least one BG, and BGs are uniquely numbered within census tract. Within the standard census geographic hierarchy, BGs never cross county or census tract boundaries, but may cross the boundaries of other geographic entities like county subdivisions, places, urban areas, voting districts, congressional districts, and American Indian / Alaska Native / Native Hawaiian areas. BGs have a valid code range of 0 through 9. BGs coded 0 were intended to only include water area, no land area, and they are generally in territorial seas, coastal water, and Great Lakes water areas. For Census 2000, rather than extending a census tract boundary into the Great Lakes or out to the U.S. nautical three-mile limit, the Census Bureau delineated some census tract boundaries along the shoreline or just offshore. The Census Bureau assigned a default census tract number of 0 and BG of 0 to these offshore, water-only areas not included in regularly numbered census tract areas.
--- Original source retains full ownership of the source dataset ---
This dataset is cleaned and ready to deploy for model building.
This dataset is for learning purpose and thus is simplified and is without any null values or major skewness.
I learned much from Kaggle and the data community and this is my contribution so that flow of knowledge never stops.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Univariate and multivariate cox regression models testing associations between baseline characteristics and risk of EDA during RTX treatment.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Census Block Group Economically Distressed Areas 2018’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://catalog.data.gov/dataset/ac57065c-1179-421b-968f-e8010700189c on 12 February 2022.
--- Dataset description provided by original source is as follows ---
This is a copy of the statewide Census Block Group GIS Tiger file. It is used to determine if a block group (BG) is EDA or not by adding ACS (American Community Survey) Median Household Income (MHI) and Population Density data at the BG level. The IRWM web based DAC mapping tool uses this GIS layer. Every year this table gets updated after ACS publishes their updated estimates. Created by joining 2016 EDA table to 2010 block groups feature class. The TIGER/Line Files are shapefiles and related database files (.dbf) that are an extract of selected geographic and cartographic information from the U.S. Census Bureau's Master Address File / Topologically Integrated Geographic Encoding and Referencing (MAF/TIGER) Database (MTDB). The MTDB represents a seamless national file with no overlaps or gaps between parts, however, each TIGER/Line File is designed to stand alone as an independent data set, or they can be combined to cover the entire nation. Block Groups (BGs) are defined before tabulation block delineation and numbering, but are clusters of blocks within the same census tract that have the same first digit of their 4-digit census block number from the same decennial census. For example, Census 2000 tabulation blocks 3001, 3002, 3003,.., 3999 within Census 2000 tract 1210.02 are also within BG 3 within that census tract. Census 2000 BGs generally contained between 600 and 3,000 people, with an optimum size of 1,500 people. Most BGs were delineated by local participants in the Census Bureau's Participant Statistical Areas Program (PSAP). The Census Bureau delineated BGs only where the PSAP participant declined to delineate BGs or where the Census Bureau could not identify any local PSAP participant. A BG usually covers a contiguous area. Each census tract contains at least one BG, and BGs are uniquely numbered within census tract. Within the standard census geographic hierarchy, BGs never cross county or census tract boundaries, but may cross the boundaries of other geographic entities like county subdivisions, places, urban areas, voting districts, congressional districts, and American Indian / Alaska Native / Native Hawaiian areas. BGs have a valid code range of 0 through 9. BGs coded 0 were intended to only include water area, no land area, and they are generally in territorial seas, coastal water, and Great Lakes water areas. For Census 2000, rather than extending a census tract boundary into the Great Lakes or out to the U.S. nautical three-mile limit, the Census Bureau delineated some census tract boundaries along the shoreline or just offshore. The Census Bureau assigned a default census tract number of 0 and BG of 0 to these offshore, water-only areas not included in regularly numbered census tract areas.
--- Original source retains full ownership of the source dataset ---
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘COVID-19 dataset in Japan’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/lisphilar/covid19-dataset-in-japan on 28 January 2022.
--- Dataset description provided by original source is as follows ---
This is a COVID-19 dataset in Japan. This does not include the cases in Diamond Princess cruise ship (Yokohama city, Kanagawa prefecture) and Costa Atlantica cruise ship (Nagasaki city, Nagasaki prefecture). - Total number of cases in Japan - The number of vaccinated people (New/experimental) - The number of cases at prefecture level - Metadata of each prefecture
Note: Lisphilar (author) uploads the same files to https://github.com/lisphilar/covid19-sir/tree/master/data
This dataset can be retrieved with CovsirPhy (Python library).
pip install covsirphy --upgrade
import covsirphy as cs
data_loader = cs.DataLoader()
japan_data = data_loader.japan()
# The number of cases (Total/each province)
clean_df = japan_data.cleaned()
# Metadata
meta_df = japan_data.meta()
Please refer to CovsirPhy Documentation: Japan-specific dataset.
Note: Before analysing the data, please refer to Kaggle notebook: EDA of Japan dataset and COVID-19: Government/JHU data in Japan. The detailed explanation of the build process is discussed in Steps to build the dataset in Japan. If you find errors or have any questions, feel free to create a discussion topic.
covid_jpn_total.csv
Cumulative number of cases:
- PCR-tested / PCR-tested and positive
- with symptoms (to 08May2020) / without symptoms (to 08May2020) / unknown (to 08May2020)
- discharged
- fatal
The number of cases: - requiring hospitalization (from 09May2020) - hospitalized with mild symptoms (to 08May2020) / severe symptoms / unknown (to 08May2020) - requiring hospitalization, but waiting in hotels or at home (to 08May2020)
In primary source, some variables were removed on 09May2020. Values are NA in this dataset from 09May2020.
Manually collected the data from Ministry of Health, Labour and Welfare HP:
厚生労働省 HP (in Japanese)
Ministry of Health, Labour and Welfare HP (in English)
The number of vaccinated people:
- Vaccinated_1st
: the number of vaccinated persons for the first time on the date
- Vaccinated_2nd
: the number of vaccinated persons with the second dose on the date
- Vaccinated_3rd
: the number of vaccinated persons with the third dose on the date
Data sources for vaccination: - To 09Apr2021: 厚生労働省 HP 新型コロナワクチンの接種実績(in Japanese) - 首相官邸 新型コロナワクチンについて - From 10APr2021: Twitter: 首相官邸(新型コロナワクチン情報)
covid_jpn_prefecture.csv
Cumulative number of cases:
- PCR-tested / PCR-tested and positive
- discharged
- fatal
The number of cases: - requiring hospitalization (from 09May2020) - hospitalized with severe symptoms (from 09May2020)
Using pdf-excel converter, manually collected the data from Ministry of Health, Labour and Welfare HP:
厚生労働省 HP (in Japanese)
Ministry of Health, Labour and Welfare HP (in English)
Note:
covid_jpn_prefecture.groupby("Date").sum()
does not match covid_jpn_total
.
When you analyse total data in Japan, please use covid_jpn_total
data.
covid_jpn_metadata.csv
- Population (Total, Male, Female): 厚生労働省 厚生統計要覧(2017年度)第1-5表
- Area (Total, Habitable): Wikipedia 都道府県の面積一覧 (2015)
Hospital_bed: With the primary data of 厚生労働省 感染症指定医療機関の指定状況(平成31年4月1日現在), 厚生労働省 第二種感染症指定医療機関の指定状況(平成31年4月1日現在), 厚生労働省 医療施設動態調査(令和2年1月末概数), 厚生労働省 感染症指定医療機関について and secondary data of COVID-19 Japan 都道府県別 感染症病床数,
Clinic_bed: With the primary data of 医療施設動態調査(令和2年1月末概数) ,
Location: Data is from LinkData 都道府県庁所在地 (Public Domain) (secondary data).
Admin
To create this dataset, edited and transformed data of the following sites was used.
厚生労働省 Ministry of Health, Labour and Welfare, Japan:
厚生労働省 HP (in Japanese)
Ministry of Health, Labour and Welfare HP (in English)
厚生労働省 HP 利用規約・リンク・著作権等 CC BY 4.0 (in Japanese)
国土交通省 Ministry of Land, Infrastructure, Transport and Tourism, Japan: 国土交通省 HP (in Japanese) 国土交通省 HP (in English) 国土交通省 HP 利用規約・リンク・著作権等 CC BY 4.0 (in Japanese)
Code for Japan / COVID-19 Japan: Code for Japan COVID-19 Japan Dashboard (CC BY 4.0) COVID-19 Japan 都道府県別 感染症病床数 (CC BY)
Wikipedia: Wikipedia
LinkData: LinkData (Public Domain)
Kindly cite this dataset under CC BY-4.0 license as follows. - Hirokazu Takaya (2020-2022), COVID-19 dataset in Japan, GitHub repository, https://github.com/lisphilar/covid19-sir/data/japan, or - Hirokazu Takaya (2020-2022), COVID-19 dataset in Japan, Kaggle Dataset, https://www.kaggle.com/lisphilar/covid19-dataset-in-japan
--- Original source retains full ownership of the source dataset ---
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Baseline characteristics at first RTX infusion in all MS, RRMS and PMS patients.
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
This data set consists of Placement data, of students in a XYZ campus. It includes secondary and higher secondary school percentage and specialisation. It also includes degree specialisation, type and Work experience and salary offers to the placed students we will Analyse what factors are playing a major role in order to select a candidate for job recruitment
This dataset was created by Pradyumna Reddy
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
A fictional dataset for exploratory data analysis (EDA) and to test simple prediction models.
This toy dataset features 150000 rows and 6 columns.
Note: All data is fictional. The data has been generated so that their distributions are convenient for statistical analysis.
Number: A simple index number for each row
City: The location of a person (Dallas, New York City, Los Angeles, Mountain View, Boston, Washington D.C., San Diego and Austin)
Gender: Gender of a person (Male or Female)
Age: The age of a person (Ranging from 25 to 65 years)
Income: Annual income of a person (Ranging from -674 to 177175)
Illness: Is the person Ill? (Yes or No)
Stock photo by Mika Baumeister on Unsplash.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Coronavirus disease 2019 (COVID-19) time series listing confirmed cases, reported deaths and reported recoveries. Data is disaggregated by country (and sometimes subregion). Coronavirus disease (COVID-19) is caused by the Severe acute respiratory syndrome Coronavirus 2 (SARS-CoV-2) and has had a worldwide effect. On March 11 2020, the World Health Organization (WHO) declared it a pandemic, pointing to the over 118,000 cases of the Coronavirus illness in over 110 countries and territories around the world at the time.
This dataset includes time series data tracking the number of people affected by COVID-19 worldwide, including:
confirmed tested cases of Coronavirus infection the number of people who have reportedly died while sick with Coronavirus the number of people who have reportedly recovered from it
Data is in CSV format and updated daily. It is sourced from this upstream repository maintained by the amazing team at Johns Hopkins University Center for Systems Science and Engineering (CSSE) who have been doing a great public service from an early point by collating data from around the world.
We have cleaned and normalized that data, for example tidying dates and consolidating several files into normalized time series. We have also added some metadata such as column descriptions and data packaged it.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
We all love movies! I remember watching my first movie with my family when I was 5 and 3 years later, I still love movies. But have you ever wondered how some people rate movies as good or bad, awesome or mehh! That's correct. Different people have different perspectives on how they like or dislike movies. To help us select from a plethora of movie option out there, IMDB platform provides us honest reviews by the people for the people.
Long story short, this assignment will take you through different aspects of how a movie is reviewed by different people from across the globe based on their star cast, genre, story length and many more aspects.
So here is what you need to do! Few points: 1. Download the dataset & the dictionary that will help you learn the different columns in the dataset 2. Start exploring the data by performing EDA (wiki what’s EDA, if you are a dummy like I was initially) 3. Get back to this notebook to check what all I did for exploring through the data and then follow the subtasks & checkpoints!
Simple? Isn’t it! Do complete the exercise & let me know in the comments if you found this exercise helpful? There’s always a scope for improvement. Tell me what more could have been added to this notebook! Hope you’ll have a good time exploring data.
This dataset was created by Muhammad sardar daud khan
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description The sinking of the Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone on board, resulting in the death of 1502 out of 2224 passengers and crew. While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others. In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).
Objective:
Survival Prediction: To build a logistic regression model that accurately predicts the survival of passengers based on features such as age, gender, passenger class, and number of siblings/spouses aboard.
Data Cleaning and Preprocessing:To perform data cleaning by handling missing values, removing unnecessary columns, and encoding categorical variables to prepare the dataset for analysis.
Exploratory Data Analysis (EDA): To conduct a thorough exploratory data analysis to visualize survival rates and identify patterns based on various factors like gender, passenger class, and embarked location.
Feature Importance Analysis: To analyze the correlation between different features and their impact on survival rates, identifying which factors are the most significant predictors of survival.
Model Evaluation: To evaluate the performance of the logistic regression model using accuracy scores and classification reports, ensuring that the model generalizes well to unseen data.
ROC Curve Analysis: To create a ROC curve to assess the trade-off between the true positive rate and false positive rate, providing insights into the model's ability to distinguish between survivors and non-survivors.
Insights and Recommendations: To derive insights from the analysis that could inform future safety measures or policies related to passenger safety in maritime travel.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
A realistic, large-scale synthetic dataset of 10,000 students designed to analyze factors affecting college placements.
This dataset simulates the academic and professional profiles of 10,000 college students, focusing on factors that influence placement outcomes. It includes features like IQ, academic performance, CGPA, internships, communication skills, and more.
The dataset is ideal for:
Column Name | Description |
---|---|
College_ID | Unique ID of the college (e.g., CLG0001 to CLG0100) |
IQ | Student’s IQ score (normally distributed around 100) |
Prev_Sem_Result | GPA from the previous semester (range: 5.0 to 10.0) |
CGPA | Cumulative Grade Point Average (range: ~5.0 to 10.0) |
Academic_Performance | Annual academic rating (scale: 1 to 10) |
Internship_Experience | Whether the student has completed any internship (Yes/No) |
Extra_Curricular_Score | Involvement in extracurriculars (score from 0 to 10) |
Communication_Skills | Soft skill rating (scale: 1 to 10) |
Projects_Completed | Number of academic/technical projects completed (0 to 5) |
Placement | Final placement result (Yes = Placed, No = Not Placed) |
This dataset was generated to resemble real-world data in academic institutions for research and machine learning use. While it is synthetic, the variables and relationships are crafted to mimic authentic trends observed in student placements.
MIT
Created using Python (NumPy, Pandas) with data logic designed for educational and ML experimentation purposes.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset contains information about various restaurants, including their ratings, cuisine types, pricing, and availability of services like online ordering and table booking.
Column Name | Description |
---|---|
restaurant name | Name of the restaurant. |
restaurant type | Type of restaurant (e.g., Quick Bites, Cafe, Casual Dining). |
rate (out of 5) | Average rating of the restaurant (out of 5). |
num of ratings | Number of people who have rated the restaurant. |
avg cost (two people) | Average cost for two people in local currency. |
online_order | Whether online ordering is available (Yes /No ). |
table booking | Whether table booking is available (Yes /No ). |
cuisines type | Types of cuisines served at the restaurant (e.g., Fast Food, Chinese, BBQ). |
area | Location area of the restaurant. |
local address | Specific address of the restaurant. |
Note: This dataset may contain missing values or inconsistencies that require preprocessing before analysis.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by yvonne gatwiri
Released under Apache 2.0