48 datasets found

AI Impact on Job Market: (2024–2030)
kaggle.com
Updated Jun 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sahil Islam007 (2025). AI Impact on Job Market: (2024–2030) [Dataset]. https://www.kaggle.com/datasets/sahilislam007/ai-impact-on-job-market-20242030
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 28, 2025
Dataset provided by
Kaggle
Authors
Sahil Islam007
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
📂 Dataset Title:

AI Impact on Job Market: Increasing vs Decreasing Jobs (2024–2030)

📝 Dataset Description:

This dataset explores how Artificial Intelligence (AI) is transforming the global job market. With a focus on identifying which jobs are increasing or decreasing due to AI adoption, this dataset provides insights into job trends, automation risks, education requirements, gender diversity, and other workforce-related factors across industries and countries.

The dataset contains 30,000 rows and 13 valuable columns, generated to reflect realistic labor market patterns based on ongoing research and public data insights. It can be used for data analysis, predictive modeling, AI policy planning, job recommendation systems, and economic forecasting.

📊 Columns Description:

Column Name Description

Job Title Name of the job/role (e.g., Data Analyst, Cashier, etc.) Industry Industry sector in which the job is categorized (e.g., IT, Healthcare, Manufacturing) Job Status Indicates whether the job is Increasing or Decreasing due to AI adoption AI Impact Level Estimated level of AI impact on the job: Low, Moderate, or High Median Salary (USD) Median annual salary for the job in USD Required Education Typical minimum education level required for the job Experience Required (Years) Average number of years of experience required Job Openings (2024) Number of current job openings in 2024 Projected Openings (2030) Projected job openings by the year 2030 Remote Work Ratio (%) Estimated percentage of jobs that can be done remotely Automation Risk (%) Probability of the job being automated or replaced by AI Location Country where the job data is based (e.g., USA, India, UK, etc.) Gender Diversity (%) Approximate percentage representation of non-male genders in the job

🔍 Potential Use Cases:

Predict which jobs are most at risk due to automation.

Compare AI impact across industries and countries.

Build dashboards on workforce diversity and trends.

Forecast job market shifts by 2030.

Train ML models to predict job growth or decline.

📚 Source:

This is a synthetic dataset generated using realistic modeling, public job data patterns (U.S. BLS, OECD, McKinsey, WEF reports), and AI simulation to reflect plausible scenarios from 2024 to 2030. Ideal for educational, research, and AI project purposes.

📌 License: MIT
PASTA Data
kaggle.com
Updated Dec 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Google Research (2024). PASTA Data [Dataset]. https://www.kaggle.com/datasets/googleai/pasta-data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 10, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Google Research
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
This dataset contains human rater trajectories used in paper: "Preference Adaptive and Sequential Text-to-Image Generation".

We use human raters to gather sequential user preferences data for personalized T2I generation. Participants are tasked with interacting with an LMM agent for five turns. Throughout our rater study we use a Gemini 1.5 Flash Model as our base LMM, which acts as an agent. At each turn, the system presents 16 images, arranged in four columns, each representing a different prompt expansion derived from the user's initial prompt and prior interactions. Raters are shown only the generated images, not the prompt expansions themselves.

At session start, raters are instructed to provide an initial prompt of at most 12 words, encapsulating a specific visual concept. They are encouraged to provide descriptive prompts that avoid generic terms (e.g., "an ancient Egyptian temple with hieroglyphs" 'instead of "a temple"). At each turn, raters then select the column of images preferred most; they are instructed to select a column based on the quality of the best image in that column w.r.t. their original intent. Raters may optionally provide a free-text critique (up to 12 words) to guide subsequent prompt expansions, though most raters did not use this facility.

See our paper for a comprehensive description of the rater study.

Citation

Please cite our paper if you use it in your work.
R
Custom Yolov7 On Kaggle On Custom Dataset
universe.roboflow.com
zip
Updated Jan 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Owais Ahmad (2023). Custom Yolov7 On Kaggle On Custom Dataset [Dataset]. https://universe.roboflow.com/owais-ahmad/custom-yolov7-on-kaggle-on-custom-dataset-rakiq/dataset/2
Explore at:
zipAvailable download formats
Dataset updated
Jan 29, 2023
Dataset authored and provided by
Owais Ahmad
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Person Car Bounding Boxes
Description
Custom Training with YOLOv7 🔥

Some Important links

Model Inference🤖

🚀Training Yolov7 on Kaggle

Weight and Biases 🐝

HuggingFace 🤗 Model Repo

Contact Information

Name - Owais Ahmad

Phone - +91-9515884381

Email - owaiskhan9654@gmail.com

Portfolio - https://owaiskhan9654.github.io/

Objective

To Showcase custom Object Detection on the Given Dataset to train and Infer the Model using newly launched YoloV7.

Data Acquisition

The goal of this task is to train a model that can localize and classify each instance of Person and Car as accurately as possible.

Link to the Downloadable Dataset

from IPython.display import Markdown, display display(Markdown("../input/Car-Person-v2-Roboflow/README.roboflow.txt"))

Custom Training with YOLOv7 🔥

In this Notebook, I have processed the images with RoboFlow because in COCO formatted dataset was having different dimensions of image and Also data set was not splitted into different Format. To train a custom YOLOv7 model we need to recognize the objects in the dataset. To do so I have taken the following steps:

Export the dataset to YOLOv7

Train YOLOv7 to recognize the objects in our dataset

Evaluate our YOLOv7 model's performance

Run test inference to view performance of YOLOv7 model at work

📦 YOLOv7

https://raw.githubusercontent.com/Owaiskhan9654/Yolo-V7-Custom-Dataset-Train-on-Kaggle/main/car-person-2.PNG" width=800>

Image Credit - jinfagang

Step 1: Install Requirements

!git clone https://github.com/WongKinYiu/yolov7 # Downloading YOLOv7 repository and installing requirements %cd yolov7 !pip install -qr requirements.txt !pip install -q roboflow

Downloading YOLOV7 starting checkpoint

!wget "https://github.com/WongKinYiu/yolov7/releases/download/v0.1/yolov7.pt"

import os import glob import wandb import torch from roboflow import Roboflow from kaggle_secrets import UserSecretsClient from IPython.display import Image, clear_output, display # to display images print(f"Setup complete. Using torch {torch._version_} ({torch.cuda.get_device_properties(0).name if torch.cuda.is_available() else 'CPU'})")

https://camo.githubusercontent.com/dd842f7b0be57140e68b2ab9cb007992acd131c48284eaf6b1aca758bfea358b/68747470733a2f2f692e696d6775722e636f6d2f52557469567a482e706e67">

I will be integrating W&B for visualizations and logging artifacts and comparisons of different models!

YOLOv7-Car-Person-Custom

try: user_secrets = UserSecretsClient() wandb_api_key = user_secrets.get_secret("wandb_api") wandb.login(key=wandb_api_key) anonymous = None except: wandb.login(anonymous='must') print('To use your W&B account, Go to Add-ons -> Secrets and provide your W&B access token. Use the Label name as WANDB. Get your W&B access token from here: https://wandb.ai/authorize') wandb.init(project="YOLOvR",name=f"7. YOLOv7-Car-Person-Custom-Run-7")

Step 2: Assemble Our Dataset

https://uploads-ssl.webflow.com/5f6bc60e665f54545a1e52a5/615627e5824c9c6195abfda9_computer-vision-cycle.png" alt="">

In order to train our custom model, we need to assemble a dataset of representative images with bounding box annotations around the objects that we want to detect. And we need our dataset to be in YOLOv7 format.

In Roboflow, We can choose between two paths:

Convert an existing Coco dataset to YOLOv7 format. In Roboflow it supports over 30 formats object detection formats for conversion.

Uploading only these raw images and annotate them in Roboflow with Roboflow Annotate.

Version v2 Aug 12, 2022 Looks like this.

https://raw.githubusercontent.com/Owaiskhan9654/Yolo-V7-Custom-Dataset-Train-on-Kaggle/main/Roboflow.PNG" alt="">

user_secrets = UserSecretsClient() roboflow_api_key = user_secrets.get_secret("roboflow_api")

rf = Roboflow(api_key=roboflow_api_key) project = rf.workspace("owais-ahmad").project("custom-yolov7-on-kaggle-on-custom-dataset-rakiq") dataset = project.version(2).download("yolov7")

Step 3: Training Custom pretrained YOLOv7 model

Here, I am able to pass a number of arguments: - img: define input image size - batch: determine
Employment Of India CLeaned and Messy Data
kaggle.com
Updated Apr 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SONIA SHINDE (2025). Employment Of India CLeaned and Messy Data [Dataset]. https://www.kaggle.com/datasets/soniaaaaaaaa/employment-of-india-cleaned-and-messy-data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 7, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
SONIA SHINDE
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Area covered
India
Description
This dataset presents a dual-version representation of employment-related data from India, crafted to highlight the importance of data cleaning and transformation in any real-world data science or analytics project.

🔹 Dataset Composition:

It includes two parallel datasets: 1. Messy Dataset (Raw) – Represents a typical unprocessed dataset often encountered in data collection from surveys, databases, or manual entries. 2. Cleaned Dataset – This version demonstrates how proper data preprocessing can significantly enhance the quality and usability of data for analytical and visualization purposes.

Each record captures multiple attributes related to individuals in the Indian job market, including: - Age Group
- Employment Status (Employed/Unemployed)
- Monthly Salary (INR)
- Education Level
- Industry Sector
- Years of Experience
- Location
- Perceived AI Risk
- Date of Data Recording

Transformations & Cleaning Applied:

The raw dataset underwent comprehensive transformations to convert it into its clean, analysis-ready form: - Missing Values: Identified and handled using either row elimination (where critical data was missing) or imputation techniques. - Duplicate Records: Identified using row comparison and removed to prevent analytical skew. - Inconsistent Formatting: Unified inconsistent naming in columns (like 'monthly_salary_(inr)' → 'Monthly Salary (INR)'), capitalization, and string spacing. - Incorrect Data Types: Converted columns like salary from string/object to float for numerical analysis. - Outliers: Detected and handled based on domain logic and distribution analysis. - Categorization: Converted numeric ages into grouped age categories for comparative analysis. - Standardization: Uniform labels for employment status, industry names, education, and AI risk levels were applied for visualization clarity.

Purpose & Utility:

This dataset is ideal for learners and professionals who want to understand: - The impact of messy data on visualization and insights - How transformation steps can dramatically improve data interpretation - Practical examples of preprocessing techniques before feeding into ML models or BI tools

It's also useful for: - Training ML models with clean inputs
- Data storytelling with visual clarity
- Demonstrating reproducibility in data cleaning pipelines

By examining both the messy and clean datasets, users gain a deeper appreciation for why “garbage in, garbage out” rings true in the world of data science.
A
‘Bike Sharing in Washington D.C. Dataset’ analyzed by Analyst-2
analyst-2.ai
Updated Jan 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘Bike Sharing in Washington D.C. Dataset’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-bike-sharing-in-washington-d-c-dataset-78a6/a61f230e/?iid=047-641&v=presentation
Explore at:
Dataset updated
Jan 28, 2022
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Washington
Description
Analysis of ‘Bike Sharing in Washington D.C. Dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/marklvl/bike-sharing-dataset on 28 January 2022.

--- Dataset description provided by original source is as follows ---

Context

Bike sharing systems are a new generation of traditional bike rentals where the whole process from membership, rental and return back has become automatic. Through these systems, user is able to easily rent a bike from a particular position and return back to another position. Currently, there are about over 500 bike-sharing programs around the world which are composed of over 500 thousands bicycles. Today, there exists great interest in these systems due to their important role in traffic, environmental and health issues.

Apart from interesting real-world applications of bike sharing systems, the characteristics of data being generated by these systems make them attractive for the research. Opposed to other transport services such as bus or subway, the duration of travel, departure and arrival position is explicitly recorded in these systems. This feature turns bike sharing system into a virtual sensor network that can be used for sensing mobility in the city. Hence, it is expected that most of important events in the city could be detected via monitoring these data.

This dataset contains the hourly and daily count of rental bikes between years 2011 and 2012 in Capital bikeshare system in Washington, DC with the corresponding weather and seasonal information.

Content

Both hour.csv and day.csv have the following fields, except hr which is not available in day.csv

instant: Record index

dteday: Date

season: Season (1:springer, 2:summer, 3:fall, 4:winter)

yr: Year (0: 2011, 1:2012)

mnth: Month (1 to 12)

hr: Hour (0 to 23)

holiday: weather day is holiday or not (extracted from Holiday Schedule)

weekday: Day of the week

workingday: If day is neither weekend nor holiday is 1, otherwise is 0.

weathersit: (extracted from Freemeteo)

1: Clear, Few clouds, Partly cloudy, Partly cloudy

2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist

3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds

4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog

temp: Normalized temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-8, t_max=+39 (only in hourly scale)

atemp: Normalized feeling temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-16, t_max=+50 (only in hourly scale)

hum: Normalized humidity. The values are divided to 100 (max)

windspeed: Normalized wind speed. The values are divided to 67 (max)

casual: count of casual users

registered: count of registered users

cnt: count of total rental bikes including both casual and registered

Acknowledgements

Hadi Fanaee-T Laboratory of Artificial Intelligence and Decision Support (LIAAD), University of Porto INESC Porto, Campus da FEUP Rua Dr. Roberto Frias, 378 4200 - 465 Porto, Portugal

Original Source: http://capitalbikeshare.com/system-data

Weather Information: http://www.freemeteo.com

Holiday Schedule: http://dchr.dc.gov/page/holiday-schedule

--- Original source retains full ownership of the source dataset ---
A
‘ 🚴 Bike Sharing Dataset’ analyzed by Analyst-2
analyst-2.ai
Updated Jan 13, 2014
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2014). ‘ 🚴 Bike Sharing Dataset’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-bike-sharing-dataset-20d4/6ac341fa/?iid=032-878&v=presentation
Explore at:
Dataset updated
Jan 13, 2014
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘ 🚴 Bike Sharing Dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/yamqwe/bike-sharing-datasete on 13 February 2022.

--- Dataset description provided by original source is as follows ---

About this dataset

Source:

Hadi Fanaee-T
Laboratory of Artificial Intelligence and Decision Support (LIAAD), University of PortoINESC Porto, Campus da FEUPRua Dr. Roberto Frias, 3784200 - 465 Porto, Portugal
Original Source:
http://capitalbikeshare.com/system-data
Weather Information:
http://www.freemeteo.com
Holiday Schedule:
http://dchr.dc.gov/page/holiday-schedule

Data Set Information:

Bike sharing systems are new generation of traditional bike rentals where whole process from membership, rental and return back has become automatic. Through these systems, user is able to easily rent a bike from a particular position and return back at another position. Currently, there are about over 500 bike-sharing programs around the world which is composed of over 500 thousands bicycles. Today, there exists great interest in these systems due to their important role in traffic, environmental and health issues.
Apart from interesting real world applications of bike sharing systems, the characteristics of data being generated by these systems make them attractive for the research. Opposed to other transport services such as bus or subway, the duration of travel, departure and arrival position is explicitly recorded in these systems. This feature turns bike sharing system into a virtual sensor network that can be used for sensing mobility in the city. Hence, it is expected that most of important events in the city could be detected via monitoring these data.

Attribute Information:

Both hour.csv and day.csv have the following fields, except hr which is not available in day.csv
- instant: record index - dteday : date - season : season (1:springer, 2:summer, 3:fall, 4:winter) - yr : year (0: 2011, 1:2012) - mnth : month ( 1 to 12) - hr : hour (0 to 23) - holiday : weather day is holiday or not (extracted from ) - weekday : day of the week - workingday : if day is neither weekend nor holiday is 1, otherwise is 0. + weathersit :
- 1: Clear, Few clouds, Partly cloudy, Partly cloudy - 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist - 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds - 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog - temp : Normalized temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-8, t_max=+39 (only in hourly scale) - atemp: Normalized feeling temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-16, t_max=+50 (only in hourly scale) - hum: Normalized humidity. The values are divided to 100 (max) - windspeed: Normalized wind speed. The values are divided to 67 (max) - casual: count of casual users - registered: count of registered users - cnt: count of total rental bikes including both casual and registered

Relevant Papers:

Fanaee-T, Hadi, and Gama, Joao, 'Event labeling combining ensemble detectors and background knowledge', Progress in Artificial Intelligence (2013): pp. 1-15, Springer Berlin Heidelberg, .

Citation Request:

Fanaee-T, Hadi, and Gama, Joao, 'Event labeling combining ensemble detectors and background knowledge', Progress in Artificial Intelligence (2013): pp. 1-15, Springer Berlin Heidelberg, .
@article{ year={2013}, issn={2192-6352}, journal={Progress in Artificial Intelligence}, doi={10.1007/s13748-013-0040-3}, title={Event labeling combining ensemble detectors and background knowledge}, url={ }, publisher={Springer Berlin Heidelberg}, keywords={Event labeling; Event detection; Ensemble learning; Background knowledge}, author={Fanaee-T, Hadi and Gama, Joao}, pages={1-15}}

Source: http://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset

This dataset was created by UCI and contains around 20000 samples along with Dteday, Windspeed, technical information and other features such as: - Registered - Cnt - and more.

How to use this dataset

Analyze Weekday in relation to Casual

Study the influence of Season on Holiday

More datasets

Acknowledgements

If you use this dataset in your research, please credit UCI

Start A New Notebook!

--- Original source retains full ownership of the source dataset ---
A
‘Are Your Employees Burning Out?’ analyzed by Analyst-2
analyst-2.ai
Updated Jan 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘Are Your Employees Burning Out?’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-are-your-employees-burning-out-166e/544e528d/?iid=010-795&v=presentation
Explore at:
Dataset updated
Jan 28, 2022
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Are Your Employees Burning Out?’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/blurredmachine/are-your-employees-burning-out on 28 January 2022.

--- Dataset description provided by original source is as follows ---

Context

Understanding what will be the Burn Rate for the employee working in an organization based on the current pandemic situation where work from home is a boon and a bane. How are employees' Burn Rate affected based on various conditions provided?

Content

Globally, World Mental Health Day is celebrated on October 10 each year. The objective of this day is to raise awareness about mental health issues around the world and mobilize efforts in support of mental health. According to an anonymous survey, about 450 million people live with mental disorders that can be one of the primary causes of poor health and disability worldwide. These days when the world is suffering from a pandemic situation, it becomes really hard to maintain mental fitness.

Employee ID: The unique ID allocated for each employee (example: fffe390032003000)

Date of Joining: The date-time when the employee has joined the organization (example: 2008-12-30)

Gender: The gender of the employee (Male/Female)

Company Type: The type of company where the employee is working (Service/Product)

WFH Setup Available: Is the work from home facility available for the employee (Yes/No)

Designation: The designation of the employee of work in the organization.

In the range of [0.0, 5.0] bigger is higher designation.

Resource Allocation: The amount of resource allocated to the employee to work, ie. number of working hours.

In the range of [1.0, 10.0] (higher means more resource)

Mental Fatigue Score: The level of fatigue mentally the employee is facing.

In the range of [0.0, 10.0] where 0.0 means no fatigue and 10.0 means completely fatigue.

Burn Rate: The value we need to predict for each employee telling the rate of Bur out while working.

In the range of [0.0, 1.0] where the higher the value is more is the burn out.

Acknowledgements

A special thanks to the HackerEarth Competition on the topic "HackerEarth Machine Learning Challenge: Are your employees burning out? which can be accessed here for this data collection.

Inspiration

Try to build some really amazing predictions keeping in mind that happy and healthy employees are indisputably more productive at work, and in turn, help the business flourish profoundly.

--- Original source retains full ownership of the source dataset ---
A
‘Bike Sharing Dataset’ analyzed by Analyst-2
analyst-2.ai
Updated Aug 4, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2020). ‘Bike Sharing Dataset’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-bike-sharing-dataset-ae2a/782b6811/?iid=066-429&v=presentation
Explore at:
Dataset updated
Aug 4, 2020
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Bike Sharing Dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/lakshmi25npathi/bike-sharing-dataset on 12 November 2021.

--- Dataset description provided by original source is as follows ---

This dataset contains the hourly and daily count of rental bikes between the years 2011 and 2012 in the Capital bike share system with the corresponding weather and seasonal information.

Data Set Information:

Bike sharing systems are new generation of traditional bike rentals where whole process from membership, rental and return back has become automatic. Through these systems, user is able to easily rent a bike from a particular position and return back at another position. Currently, there are about over 500 bike-sharing programs around the world which is composed of over 500 thousands bicycles. Today, there exists great interest in these systems due to their important role in traffic, environmental and health issues.

Apart from interesting real world applications of bike sharing systems, the characteristics of data being generated by these systems make them attractive for the research. Opposed to other transport services such as bus or subway, the duration of travel, departure and arrival position is explicitly recorded in these systems. This feature turns bike sharing system into a virtual sensor network that can be used for sensing mobility in the city. Hence, it is expected that most of important events in the city could be detected via monitoring these data.

Attribute Information:

Both hour.csv and day.csv have the following fields, except hr which is not available in day.csv

instant: record index

dteday : date

season : season (1:springer, 2:summer, 3:fall, 4:winter)

yr : year (0: 2011, 1:2012)

mnth : month ( 1 to 12)

hr : hour (0 to 23)

holiday : weather day is holiday or not (extracted from [Web Link])

weekday : day of the week

workingday : if day is neither weekend nor holiday is 1, otherwise is 0.

weathersit :

1: Clear, Few clouds, Partly cloudy, Partly cloudy

2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist

3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds

4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog

temp : Normalized temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-8, t_max=+39 (only in hourly scale)

atemp: Normalized feeling temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-16, t_max=+50 (only in hourly scale)

hum: Normalized humidity. The values are divided to 100 (max)

windspeed: Normalized wind speed. The values are divided to 67 (max)

casual: count of casual users

registered: count of registered users

cnt: count of total rental bikes including both casual and registered

for further more information please go through the following link, http://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset

--- Original source retains full ownership of the source dataset ---
A
‘Detroit Daily Temperatures with Artificial Warming’ analyzed by Analyst-2
analyst-2.ai
Updated Oct 5, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2019). ‘Detroit Daily Temperatures with Artificial Warming’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-detroit-daily-temperatures-with-artificial-warming-c8ae/6a66bd3d/?iid=000-953&v=presentation
Explore at:
Dataset updated
Oct 5, 2019
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Detroit
Description
Analysis of ‘Detroit Daily Temperatures with Artificial Warming’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/agajorte/detroit-daily-temperatures-with-artificial-warming on 14 February 2022.

--- Dataset description provided by original source is as follows ---

Context

Who among us doesn't talk a little about the weather now and then? Will it rain tomorrow and get so cold to shake your chin or will it make that cracking sun? Does global warming exist?

With this dataset, you can apply machine learning tools to predict the average temperature of Detroit city based on historical data collected over 5 years.

Content

The given data set was produced from the Historical Hourly Weather Data [https://www.kaggle.com/selfishgene/historical-hourly-weather-data], which consists of about 5 years of hourly measurements of various weather attributes (eg. temperature, humidity, air pressure) from 30 US and Canadian cities.

From this rich database, a cutout was made by selecting only the city of Detroit (USA), highlighting only the temperature, converting it to Celsius degrees and keeping only one value for each date (corresponding to the average daytime temperature - from 9am to 5pm).

In addition, temperature values were artificially and gradually increased by a few Celsius degrees over the available period. This will simulate a small global warming (or is it local?)...

In summary, the available dataset contains the average daily temperatures (collected during the day), artificially increased by a certain value, for the city of Detroit from October 2012 to November 2017.

The purpose of this dataset is to apply forecasting models in order to predict the value of the artificially warmed average daily temperature of Detroit.

See graph in the following image: black dots refer to the actual data and the blue line represents the predictive model (including a confidence area).

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3089313%2Faf9614514242dfb6164a08c013bf6e35%2Fplot-ts2.png?generation=1567827710930876&alt=media" alt="">

Acknowledgements

This dataset wouldn't be possible without the previous work in Historical Hourly Weather Data.

Inspiration

What are the best forecasting models to address this particular problem? TBATS, ARIMA, Prophet? You tell me!

--- Original source retains full ownership of the source dataset ---
30 Short Tips for Your Data Scientist Interview
kaggle.com
Updated Oct 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Skillslash17 (2023). 30 Short Tips for Your Data Scientist Interview [Dataset]. https://www.kaggle.com/datasets/skillslash17/30-short-tips-for-your-data-scientist-interview
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 12, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Skillslash17
Description
If you’re a data scientist looking to get ahead in the ever-changing world of data science, you know that job interviews are a crucial part of your career. But getting a job as a data scientist is not just about being tech-savvy, it’s also about having the right skillset, being able to solve problems, and having good communication skills. With competition heating up, it’s important to stand out and make a good impression on potential employers.

Data Science has become an essential part of the contemporary business environment, enabling decision-making in a variety of industries. Consequently, organizations are increasingly looking for individuals who can utilize the power of data to generate new ideas and expand their operations. However these roles come with a high level of expectation, requiring applicants to possess a comprehensive knowledge of data analytics and machine learning, as well as the capacity to turn their discoveries into practical solutions.

With so many job seekers out there, it’s super important to be prepared and confident for your interview as a data scientist.

Here are 30 tips to help you get the most out of your interview and land the job you want. No matter if you’re just starting out or have been in the field for a while, these tips will help you make the most of your interview and set you up for success.

Technical Preparation

Qualifying for a job as a data scientist needs a comprehensive level of technical preparation. Job seekers are often required to demonstrate their technical skills in order to show their ability to effectively fulfill the duties of the role. Here are a selection of key tips for technical proficiency:

1 Master the Basics

Make sure you have a good understanding of statistics, math, and programming languages such as Python and R.

2 Understand Machine Learning

Gain an in-depth understanding of commonly used machine learning techniques, including linear regression and decision trees, as well as neural networks.

3 Data Manipulation

Make sure you're good with data tools like Pandas and Matplotlib, as well as data visualization tools like Seaborn.

4 SQL Skills

Gain proficiency in the use of SQL language to extract and process data from databases.

5 Feature Engineering

Understand and know the importance of feature engineering and how to create meaningful features from raw data.

6 Model Evaluation

Learn to assess and compare machine learning models using metrics like accuracy, precision, recall, and F1-score.

7 Big Data Technologies

If the job requires it, become familiar with big data technologies like Hadoop and Spark.

8 Coding Challenges

Practice coding challenges related to data manipulation and machine learning on platforms like LeetCode and Kaggle.

Portfolio and Projects

9 Build a Portfolio

Develop a portfolio of your data science projects that outlines your methodology, the resources you have employed, and the results achieved.

10 Kaggle Competitions

Participate in Kaggle competitions to gain real-world experience and showcase your problem-solving skills.

11 Open Source Contributions

Contribute to open-source data science projects to demonstrate your collaboration and coding abilities.

12 GitHub Profile

Maintain a well-organized GitHub profile with clean code and clear project documentation.

Domain Knowledge

13 Understand the Industry

Research the industry you’re applying to and understand its specific data challenges and opportunities.

14 Company Research

Study the company you’re interviewing with to tailor your responses and show your genuine interest.

Soft Skills

15 Communication

Practice explaining complex concepts in simple terms. Data Scientists often need to communicate findings to non-technical stakeholders.

16 Problem-Solving

Focus on your problem-solving abilities and how you approach complex challenges.

17 Adaptability

Highlight your ability to adapt to new technologies and techniques as the field of data science evolves.

Interview Etiquette

18 Professional Appearance

Dress and present yourself in a professional manner, whether the interview is in person or remote.

19 Punctuality

Be on time for the interview, whether it’s virtual or in person.

20 Body Language

Maintain good posture and eye contact during the interview. Smile and exhibit confidence.

21 Active Listening

Pay close attention to the interviewer's questions and answer them directly.

Behavioral Questions

22 STAR Method

Use the STAR (Situation, Task, Action, Result) method to structure your responses to behavioral questions.

23 Conflict Resolution

Be prepared to discuss how you have handled conflicts or challenging situations in previous roles.

24 Teamwork

Highlight instances where you’ve worked effectively in cross-functional teams...
A
‘Ratings of the most popular anime’ analyzed by Analyst-2
analyst-2.ai
Updated Feb 14, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘Ratings of the most popular anime’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-ratings-of-the-most-popular-anime-62c6/3e12c26a/
Explore at:
Dataset updated
Feb 14, 2022
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Ratings of the most popular anime’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/michau96/ratings-of-the-most-popular-anime on 14 February 2022.

--- Dataset description provided by original source is as follows ---

Context

myanimelist.net is the most popular site where fans of Japanese animated films and series share their opinions on various productions. The portal works in a similar way to IMDb and allows users to rate various positions and then create various types of rankings based on them. In this database you will find a distribution of votes on a scale from 1 to 10 for the 100 most popular (most often cast) anime votes from a specific point in time.

Content

The data was obtained using webscraping. The Python language with the "BeautifulSoup", "requests", "re", "pandas" and "numpy" packages was used for this process and "SelectorGadet" add-on, which made the work with the site easier. For each movie or series, we have 10 lines in turn with each rating and number of votes assigned to it, and some other information related to the series / movie.

Inspiration

The data can be used to check the distribution of ratings between individual series / movies. We can check whether the final average results from a large, relatively high score or maybe from the bi-larity of the distribution. Dana can be an addition to other, often already old, datasets for subordinate topics (e.g. Anime recommendations database and Anime dataset).

Features

Votes (Number of votes)

Score (Score: 10 - the best, 1 - the worst)

English_name (Title in english)

Favourites_count (Number of users who add series to favourites)

Popularity_ranking (Position in popularity ranking - the lowest the more users votes)

Photo by Dex Ezekiel on Unsplash

--- Original source retains full ownership of the source dataset ---
Ask A Manager 2023 Salary Survey
kaggle.com
Updated Feb 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lexie DeGrandchamp (2024). Ask A Manager 2023 Salary Survey [Dataset]. https://www.kaggle.com/datasets/lexiedegrandchamp/ask-a-manager-2023-salary-survey
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 11, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Lexie DeGrandchamp
Description
Popular US workplace blog AskAManager (askamanager.org) sponsors an annual salary survey of blog readers. The 2023 survey collected data about industry, job function, title, annual salary, additional compensation, race, gender, remote/on-site requirements, education, location, and years' experience.

The dataset here features responses collected between April 11 and 28, 2023, and has some 16,000 responses. This version of the data set has employed several feature engineering techniques to group and cleanse data, convert the currency to USD values as of April 1, 2023, and add clarity to location data. In particular, US respondents were paired when possible with a metropolitan area.
A
‘Spotify Recommendation’ analyzed by Analyst-2
analyst-2.ai
Updated Jan 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘Spotify Recommendation’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-spotify-recommendation-3903/3a5b5131/?iid=006-678&v=presentation
Explore at:
Dataset updated
Jan 28, 2022
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Spotify Recommendation’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/bricevergnou/spotify-recommendation on 28 January 2022.

--- Dataset description provided by original source is as follows ---

Spotify Recommandation

( You can check how I used this dataset on my github repository )

I am basically a HUGE fan of music ( mostly French rap though with some exceptions but I love music ). And someday , while browsing stuff on Internet , I found the Spotify's API . I knew I had to use it when I found out you could get information like danceability about your favorite songs just with their id's.

https://user-images.githubusercontent.com/86613710/127216769-745ac143-7456-4464-bbe3-adc53872c133.png" alt="image">

Once I saw that , my machine learning instincts forced me to work on this project.

1. Data Collection

1.1 Playlist creation

I collected 100 liked songs and 95 disliked songs

For those I like , I made a playlist of my favorite 100 songs. It is mainly French Rap , sometimes American rap , rock or electro music.

For those I dislike , I collected songs from various kind of music so the model will have a broader view of what I don't like

There is : - 25 metal songs ( Cannibal Corps ) - 20 " I don't like " rap songs ( PNL ) - 25 classical songs - 25 Disco songs

I didn't include any Pop song because I'm kinda neutral about it

1.2 Getting the ID's

From the Spotify's API "Get a playlist's Items" , I turned the playlists into json formatted data which cointains the ID and the name of each track ( ids/yes.py and ids/no.py ). NB : on the website , specify "items(track(id,name))" in the fields format , to avoid being overwhelmed by useless data.

With a script ( ids/ids_to_data.py ) , I turned the json data into a long string with each ID separated with a comma.

1.3 Getting the statistics

Now I just had to enter the strings into the Spotify API "Get Audio Features from several tracks" and get my data files ( data/good.json and data/dislike.json )

2. Data features

From Spotify's API documentation :

acousticness : A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.

danceability : Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.

duration_ms : The duration of the track in milliseconds.

energy : Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.

instrumentalness : Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.

key : The key the track is in. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on.

liveness : Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.

loudness : The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.

mode : Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.

speechiness : Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.

tempo : The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.

time_signature : An estimated overall time signature of a track. The time signature (meter) is a notational convention to specify how many beats are in each bar (or measure).

valence : A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).

And the variable that has to be predicted :

liked : 1 for liked songs , 0 for disliked songs

--- Original source retains full ownership of the source dataset ---
A
‘German Credit Data’ analyzed by Analyst-2
analyst-2.ai
Updated Jan 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘German Credit Data’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-german-credit-data-2158/latest
Explore at:
Dataset updated
Jan 28, 2022
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘German Credit Data’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/varunchawla30/german-credit-data on 28 January 2022.

--- Dataset description provided by original source is as follows ---

Context

The original dataset contains 1000 entries with 20 categorial/symbolic attributes prepared by Prof. Hofmann. In this dataset, each entry represents a person who takes a credit by a bank. Each person is classified as good or bad credit risks according to the set of attributes. The link to the original dataset can be found below.

Content

It is almost impossible to understand the original dataset due to its complicated system of categories and symbols. Thus, I wrote a small Python script to convert it into a readable CSV file. The column names were also given in German originally. So, they are replaced by English names while processing. The attributes and their details in English are given below:

Status - Categorical (Ordinal)

Duration - Numerical

Credit History - Categorical (Nominal)

Purpose - Categorical (Nominal)

Amount - Numerical

Savings - Categorical (Ordinal)

Employment Duration - Categorical (Ordinal)

Installment Rate - Categorical (Ordinal)

Personal Status Sex - Categorical (Nominal)

Other Debtors - Categorical (Nominal)

Present Residence - Categorical (Ordinal)

Property - Categorical (Nominal)

Age - Numerical

Other Installment Plans - Categorical (Nominal)

Housing - Categorical (Nominal)

Number Credits - Categorical (Ordinal)

Job - Categorical (Nominal)

People Liable - Categorical (Ordinal)

Telephone - Categorical (Nominal)

Foreign Worker - Categorical (Nominal)

Credit Risk - Binary Target Variable

Acknowledgements

Source : UCI

--- Original source retains full ownership of the source dataset ---
Cycling Case Study
kaggle.com
Updated Mar 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NichAnderson (2024). Cycling Case Study [Dataset]. https://www.kaggle.com/datasets/nichanderson/cycling-case-study-data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 11, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
NichAnderson
Description
This data is based on the divvy cycling data for 2023, see https://divvy-tripdata.s3.amazonaws.com/index.html for the raw data, which we've compiled into a single dataset and and cleaned. The data contains information on over 5.5 million bike-share trips, including the following information on each trip: trip starting and ending time, trip start and end location, and the membership status of the rider. This analysis is my own project work and is not promoted by or affiliated with Divvy itself. This data is reproduced under the following license https://divvybikes.com/data-license-agreement.
Supermarket Groceries Image Dataset
kaggle.com
Updated Mar 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ash_knight (2023). Supermarket Groceries Image Dataset [Dataset]. http://doi.org/10.34740/kaggle/dsv/5115598
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/dsv/5115598
Dataset updated
Mar 6, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ash_knight
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
So my professor wanted me to do something unique instead of using datasets already available on internet. We planned on doing a retail product recognition and classification and to collect data, we noted down all the big super markets in my city. But my team mates chickened out at the last minute and I had to get the job done. DMart wasn't kind to me when I asked for permission to take pictures of their products(they literally laughed as soon as I turned my back towards exit...how mortifying!) so I went to Modern super market(I'm not kidding, that's the name) and captured all these images with my Mi A2 phone with a lens glass that is broken after I dropped it perhaps fifty times or is it hundred?

And the cherry on the top is that we didn't even use this dataset for our project.
JOB-A-THON- Analytics Vidhya- Health Insurance
kaggle.com
Updated Feb 26, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sajid Hussain (2021). JOB-A-THON- Analytics Vidhya- Health Insurance [Dataset]. https://www.kaggle.com/sajidhussain3/jobathon-analytics-vidhya-health-insurance/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 26, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Sajid Hussain
Description
Health Insurance Lead Prediction: Your Client FinMan is a financial services company that provides various financial services like loan, investment funds, insurance etc. to its customers. FinMan wishes to cross-sell health insurance to the existing customers who may or may not hold insurance policies with the company. The company recommend health insurance to it' customers based on their profile once these customers land on the website. Customers might browse the recommended health insurance policy and consequently fill up a form to apply. When these customers fill-up the form, their Response towards the policy is considered positive, and they are classified as a lead.

Once these leads are acquired, the sales advisors approach them to convert and thus the company can sell proposed health insurance to these leads in a more efficient manner.

Now the company needs your help in building a model to predict whether the person will be interested in their proposed Health plan/policy given the information about:

Demographics (city, age, region etc.) Information regarding holding policies of the customer Recommended Policy Information

Evaluation: The evaluation metric for this competition is roc_auc_score across all entries in the test set.

Credits: This is dataset is released as a part of hackathon conducted by Analytics vidhya Visit this link for more information:-https://datahack.analyticsvidhya.com/contest/job-a-thon/#ProblemStatement
case study 1 bike share
kaggle.com
Updated Oct 8, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
mohamed osama (2022). case study 1 bike share [Dataset]. https://www.kaggle.com/ososmm/case-study-1-bike-share/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 8, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
mohamed osama
Description
Cyclistic: Google Data Analytics Capstone Project

Cyclistic - Google Data Analytics Certification Capstone Project Moirangthem Arup Singh How Does a Bike-Share Navigate Speedy Success? Background: This project is for the Google Data Analytics Certification capstone project. I am wearing the hat of a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. Cyclistic is a bike-share program that features more than 5,800 bicycles and 600 docking stations. Cyclistic sets itself apart by also offering reclining bikes, hand tricycles, and cargo bikes, making bike-share more inclusive to people with disabilities and riders who can’t use a standard two-wheeled bike. The majority of riders opt for traditional bikes; about 8% of riders use the assistive options. Cyclistic users are more likely to ride for leisure, but about 30% use them to commute to work each day. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore,my team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, my team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve the recommendations, so they must be backed up with compelling data insights and professional data visualizations. This project will be completed by using the 6 Data Analytics stages: Ask: Identify the business task and determine the key stakeholders. Prepare: Collect the data, identify how it’s organized, determine the credibility of the data. Process: Select the tool for data cleaning, check for errors and document the cleaning process. Analyze: Organize and format the data, aggregate the data so that it’s useful, perform calculations and identify trends and relationships. Share: Use design thinking principles and data-driven storytelling approach, present the findings with effective visualization. Ensure the analysis has answered the business task. Act: Share the final conclusion and the recommendations. Ask: Business Task: Recommend marketing strategies aimed at converting casual riders into annual members by better understanding how annual members and casual riders use Cyclistic bikes differently. Stakeholders: Lily Moreno: The director of marketing and my manager. Cyclistic executive team: A detail-oriented executive team who will decide whether to approve the recommended marketing program. Cyclistic marketing analytics team: A team of data analysts responsible for collecting, analyzing, and reporting data that helps guide Cyclistic’s marketing strategy. Prepare: For this project, I will use the public data of Cyclistic’s historical trip data to analyze and identify trends. The data has been made available by Motivate International Inc. under the license. I downloaded the ZIP files containing the csv files from the above link but while uploading the files in kaggle (as I am using kaggle notebook), it gave me a warning that the dataset is already available in kaggle. So I will be using the dataset cyclictic-bike-share dataset from kaggle. The dataset has 13 csv files from April 2020 to April 2021. For the purpose of my analysis I will use the csv files from April 2020 to March 2021. The source csv files are in Kaggle so I can rely on it's integrity. I am using Microsoft Excel to get a glimpse of the data. There is one csv file for each month and has information about the bike ride which contain details of the ride id, rideable type, start and end time, start and end station, latitude and longitude of the start and end stations. Process: I will use R as language in kaggle to import the dataset to check how it’s organized, whether all the columns have appropriate data type, find outliers and if any of these data have sampling bias. I will be using below R libraries

Load the tidyverse, lubridate, ggplot2, sqldf and psych libraries

library(tidyverse) library(lubridate) library(ggplot2) library(plotrix) ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

✔ ggplot2 3.3.5 ✔ purrr 0.3.4 ✔ tibble 3.1.4 ✔ dplyr 1.0.7 ✔ tidyr 1.1.3 ✔ stringr 1.4.0 ✔ readr 2.0.1 ✔ forcats 0.5.1

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ✖ dplyr::filter() masks stats::filter() ✖ dplyr::lag() masks stats::lag()

Attaching package: ‘lubridate’

The following objects are masked from ‘package:base’:

date, intersect, setdiff, union

Set the working directory

setwd("/kaggle/input/cyclistic-bike-share")

Import the csv files

r_202004 <- read.csv("202004-divvy-tripdata.csv") r_202005 <- read.csv("20...
Health Insurance Lead Prediction
kaggle.com
zip
Updated Mar 2, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sathishkumar (2021). Health Insurance Lead Prediction [Dataset]. https://www.kaggle.com/klmsathishkumar/health-insurance-lead-prediction
Explore at:
zip(1177806 bytes)Available download formats
Dataset updated
Mar 2, 2021
Authors
Sathishkumar
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Context Your Client FinMan is a financial services company that provides various financial services like loan, investment funds, insurance etc. to its customers. FinMan wishes to cross-sell health insurance to the existing customers who may or may not hold insurance policies with the company. The company recommend health insurance to it's customers based on their profile once these customers land on the website. Customers might browse the recommended health insurance policy and consequently fill up a form to apply. When these customers fill-up the form, their Response towards the policy is considered positive and they are classified as a lead.

Once these leads are acquired, the sales advisors approach them to convert and thus the company can sell proposed health insurance to these leads in a more efficient manner.

Content Demographics (city, age, region etc.) Information regarding holding policies of the customer Recommended Policy Information

Acknowledgements This is dataset is released as part of a hackathon conducted by Analytics Vidhya. Visit https://datahack.analyticsvidhya.com/contest/job-a-thon/#ProblemStatement for more information.
Named Entity Recognition (NER) Corpus
kaggle.com
Updated Jan 14, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Naser Al-qaydeh (2022). Named Entity Recognition (NER) Corpus [Dataset]. https://www.kaggle.com/datasets/naseralqaydeh/named-entity-recognition-ner-corpus
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 14, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Naser Al-qaydeh
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
Task

Named Entity Recognition(NER) is a task of categorizing the entities in a text into categories like names of persons, locations, organizations, etc.

Dataset

Each row in the CSV file is a complete sentence, list of POS tags for each word in the sentence, and list of NER tags for each word in the sentence

You can use Pandas Dataframe to read and manipulate this dataset.

Since each row in the CSV file contain lists, if we read the file with pandas.read_csv() and try to get tag lists by indexing the list will be a string. ```

data['tag'][0] "['O', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-gpe', 'O', 'O', 'O', 'O', 'O']" type(data['tag'][0]) string You can use the following to convert it back to list type: from ast import literal_eval literal_eval(data['tag'][0] ) ['O', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-gpe', 'O', 'O', 'O', 'O', 'O'] type(literal_eval(data['tag'][0] )) list ```

Acknowledgements

This dataset is taken from Annotated Corpus for Named Entity Recognition by Abhinav Walia dataset and then processed.

Annotated Corpus for Named Entity Recognition is annotated Corpus for Named Entity Recognition using GMB(Groningen Meaning Bank) corpus for entity classification with enhanced and popular features by Natural Language Processing applied to the data set.

Essential info about entities:

geo = Geographical Entity

org = Organization

per = Person

gpe = Geopolitical Entity

tim = Time indicator

art = Artifact

eve = Event

nat = Natural Phenomenon

Facebook

Twitter

Click to copy link

Link copied

Cite

Sahil Islam007 (2025). AI Impact on Job Market: (2024–2030) [Dataset]. https://www.kaggle.com/datasets/sahilislam007/ai-impact-on-job-market-20242030

AI Impact on Job Market: (2024–2030)

AI Impact on Job Market: Increasing vs Decreasing Jobs (2024–2030)

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Jun 28, 2025

Dataset provided by

Kaggle

Authors

Sahil Islam007

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

📂 Dataset Title:

AI Impact on Job Market: Increasing vs Decreasing Jobs (2024–2030)

📝 Dataset Description:

This dataset explores how Artificial Intelligence (AI) is transforming the global job market. With a focus on identifying which jobs are increasing or decreasing due to AI adoption, this dataset provides insights into job trends, automation risks, education requirements, gender diversity, and other workforce-related factors across industries and countries.

The dataset contains 30,000 rows and 13 valuable columns, generated to reflect realistic labor market patterns based on ongoing research and public data insights. It can be used for data analysis, predictive modeling, AI policy planning, job recommendation systems, and economic forecasting.

📊 Columns Description:

Column Name Description

Job Title Name of the job/role (e.g., Data Analyst, Cashier, etc.) Industry Industry sector in which the job is categorized (e.g., IT, Healthcare, Manufacturing) Job Status Indicates whether the job is Increasing or Decreasing due to AI adoption AI Impact Level Estimated level of AI impact on the job: Low, Moderate, or High Median Salary (USD) Median annual salary for the job in USD Required Education Typical minimum education level required for the job Experience Required (Years) Average number of years of experience required Job Openings (2024) Number of current job openings in 2024 Projected Openings (2030) Projected job openings by the year 2030 Remote Work Ratio (%) Estimated percentage of jobs that can be done remotely Automation Risk (%) Probability of the job being automated or replaced by AI Location Country where the job data is based (e.g., USA, India, UK, etc.) Gender Diversity (%) Approximate percentage representation of non-male genders in the job

🔍 Potential Use Cases:

Predict which jobs are most at risk due to automation.

Compare AI impact across industries and countries.

Build dashboards on workforce diversity and trends.

Forecast job market shifts by 2030.

Train ML models to predict job growth or decline.

📚 Source:

This is a synthetic dataset generated using realistic modeling, public job data patterns (U.S. BLS, OECD, McKinsey, WEF reports), and AI simulation to reflect plausible scenarios from 2024 to 2030. Ideal for educational, research, and AI project purposes.

📌 License: MIT

Clear search

Close search

Google apps

Main menu

AI Impact on Job Market: (2024–2030)

PASTA Data

Citation

Custom Yolov7 On Kaggle On Custom Dataset

Custom Training with YOLOv7 🔥

Some Important links

Contact Information

Objective

To Showcase custom Object Detection on the Given Dataset to train and Infer the Model using newly launched YoloV7.

Data Acquisition

Custom Training with YOLOv7 🔥

📦 YOLOv7

Step 1: Install Requirements

Downloading YOLOV7 starting checkpoint

Step 2: Assemble Our Dataset

Version v2 Aug 12, 2022 Looks like this.

Step 3: Training Custom pretrained YOLOv7 model

Employment Of India CLeaned and Messy Data

🔹 Dataset Composition:

Transformations & Cleaning Applied:

Purpose & Utility:

‘Bike Sharing in Washington D.C. Dataset’ analyzed by Analyst-2

Context

Content

Acknowledgements

‘ 🚴 Bike Sharing Dataset’ analyzed by Analyst-2

About this dataset

Source:

Data Set Information:

Attribute Information:

Relevant Papers:

Citation Request:

How to use this dataset

Acknowledgements

Start A New Notebook!

‘Are Your Employees Burning Out?’ analyzed by Analyst-2

Context

Content

Acknowledgements

Inspiration

‘Bike Sharing Dataset’ analyzed by Analyst-2

‘Detroit Daily Temperatures with Artificial Warming’ analyzed by Analyst-2

Context

Content

Acknowledgements

Inspiration

30 Short Tips for Your Data Scientist Interview

1 Master the Basics

2 Understand Machine Learning

3 Data Manipulation

4 SQL Skills

5 Feature Engineering

6 Model Evaluation

7 Big Data Technologies

8 Coding Challenges

9 Build a Portfolio

10 Kaggle Competitions

11 Open Source Contributions

12 GitHub Profile

13 Understand the Industry

14 Company Research

15 Communication

16 Problem-Solving

17 Adaptability

18 Professional Appearance

19 Punctuality

20 Body Language

21 Active Listening

22 STAR Method

23 Conflict Resolution

24 Teamwork

‘Ratings of the most popular anime’ analyzed by Analyst-2

Context

Content

Inspiration

Features

Ask A Manager 2023 Salary Survey

‘Spotify Recommendation’ analyzed by Analyst-2

Spotify Recommandation

1. Data Collection