64 datasets found

Salary Prediction: Based on years of experience
kaggle.com
Updated Jan 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Adil Shamim (2025). Salary Prediction: Based on years of experience [Dataset]. http://doi.org/10.34740/kaggle/dsv/10626597
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/dsv/10626597
Dataset updated
Jan 31, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Adil Shamim
Description
About the Dataset

Dataset Name: Salary Prediction Dataset

File Format: CSV

Rows: 100,000

Columns: 4

Overview

This dataset provides salary data based on years of experience, education level, and job role. It can be used for salary prediction models, regression analysis, and workforce analytics. The dataset includes realistic salary variations based on industry trends.

Columns Description

YearsExperience (float) – Number of years of experience (0 to 40 years).

Education Level (string) – The highest level of education attained. Categories include:

High School

Associate Degree

Bachelor's

Master's

PhD

Job Role (string) – Common job titles in the industry:

Software Engineer

Data Scientist

Product Manager

Marketing Specialist

Business Analyst

Salary (float) – The estimated annual salary in USD. The salary is influenced by experience, education level, and job role.

Potential Use Cases

Salary Prediction Models – Train regression models to predict salaries based on experience and qualifications.

Data Science & Machine Learning – Use this dataset for exploratory data analysis and feature engineering.

Workforce Analysis – Analyze salary trends across job roles and experience levels.

How the Data Was Generated

The dataset was synthetically generated using a linear regression-based formula with added randomness and scaling factors based on job roles and education levels. While not real-world data, it closely mimics actual salary distributions in the tech and business industries.

Acknowledgments

This dataset is designed for research, learning, and data science practice. It is not collected from real-world surveys but follows statistical patterns observed in salary data.

Code4ML 2.0: a Large-scale Dataset of annotated Machine Learning Code

zenodo.org

csv, txt

Updated Oct 23, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

Anonymous; Anonymous (2024). Code4ML 2.0: a Large-scale Dataset of annotated Machine Learning Code [Dataset]. http://doi.org/10.5281/zenodo.13918465

Explore at:

csv, txtAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.13918465

Dataset updated

Oct 23, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Anonymous; Anonymous

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This is an enriched version of the Code4ML dataset, a large-scale corpus of annotated Python code snippets, competition summaries, and data descriptions sourced from Kaggle. The initial release includes approximately 2.5 million snippets of machine learning code extracted from around 100,000 Jupyter notebooks. A portion of these snippets has been manually annotated by human assessors through a custom-built, user-friendly interface designed for this task.

The original dataset is organized into multiple CSV files, each containing structured data on different entities:

code_blocks.csv: Contains raw code snippets extracted from Kaggle.
kernels_meta.csv: Metadata for the notebooks (kernels) from which the code snippets were derived.
competitions_meta.csv: Metadata describing Kaggle competitions, including information about tasks and data.
markup_data.csv: Annotated code blocks with semantic types, allowing deeper analysis of code structure.
vertices.csv: A mapping from numeric IDs to semantic types and subclasses, used to interpret annotated code blocks.

Table 1. code_blocks.csv structure

Column	Description
code_blocks_index	Global index linking code blocks to markup_data.csv.
kernel_id	Identifier for the Kaggle Jupyter notebook from which the code block was extracted.
code_block_id	Position of the code block within the notebook.
code_block	The actual machine learning code snippet.

Table 2. kernels_meta.csv structure

Column	Description
kernel_id	Identifier for the Kaggle Jupyter notebook.
kaggle_score	Performance metric of the notebook.
kaggle_comments	Number of comments on the notebook.
kaggle_upvotes	Number of upvotes the notebook received.
kernel_link	URL to the notebook.
comp_name	Name of the associated Kaggle competition.

Table 3. competitions_meta.csv structure

Column	Description
comp_name	Name of the Kaggle competition.
description	Overview of the competition task.
data_type	Type of data used in the competition.
comp_type	Classification of the competition.
subtitle	Short description of the task.
EvaluationAlgorithmAbbreviation	Metric used for assessing competition submissions.
data_sources	Links to datasets used.
metric type	Class label for the assessment metric.

Table 4. markup_data.csv structure

Column	Description
code_block	Machine learning code block.
too_long	Flag indicating whether the block spans multiple semantic types.
marks	Confidence level of the annotation.
graph_vertex_id	ID of the semantic type.

The dataset allows mapping between these tables. For example:

code_blocks.csv can be linked to kernels_meta.csv via the kernel_id column.
kernels_meta.csv is connected to competitions_meta.csv through comp_name. To maintain quality, kernels_meta.csv includes only notebooks with available Kaggle scores.

In addition, data_with_preds.csv contains automatically classified code blocks, with a mapping back to code_blocks.csvvia the code_blocks_index column.

Code4ML 2.0 Enhancements

The updated Code4ML 2.0 corpus introduces kernels extracted from Meta Kaggle Code. These kernels correspond to the kaggle competitions launched since 2020. The natural descriptions of the competitions are retrieved with the aim of LLM.

Notebooks in kernels_meta2.csv may not have a Kaggle score but include a leaderboard ranking (rank), providing additional context for evaluation.

Applications

The Code4ML 2.0 corpus is a versatile resource, enabling training and evaluation of models in areas such as:

Code generation
Code understanding
Natural language processing of code-related tasks

i
Data from: Customer Churn Dataset
ieee-dataport.org
kaggle.com
Updated Jun 4, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Usman JOY (2024). Customer Churn Dataset [Dataset]. http://doi.org/10.21227/wc9d-b672
Explore at:
Unique identifier
https://doi.org/10.21227/wc9d-b672
Dataset updated
Jun 4, 2024
Dataset provided by
IEEE Dataport
Authors
Usman JOY
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Customer log dataset is a 12.5 GB JSON file and it contains 18 columns and 26,259,199 records. There are 12 string columns and 6 numeric columns, which may also contain null or NaN values. The columns include userId, artist, auth, firstName, gender, itemInSession, lastName, length, level, location, method, page, registration, sessionId, song,status, ts and userAgent. As evident from the column names, the dataset contains various user-related information, such as user identifiers, demographic details (firstName, lastName, gender), interaction details (artist, song, length, itemInSession, sessionId, registration, lastinteraction) and technical details (userAgent, method, page, location, status, level, auth).
A
‘COVID-19's Impact on Educational Stress’ analyzed by Analyst-2
analyst-2.ai
Updated Jan 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘COVID-19's Impact on Educational Stress’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-covid-19-s-impact-on-educational-stress-49b5/4f12e21a/?iid=019-227&v=presentation
Explore at:
Dataset updated
Jan 28, 2022
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘COVID-19's Impact on Educational Stress’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/bsoyka3/educational-stress-due-to-the-coronavirus-pandemic on 28 January 2022.

--- Dataset description provided by original source is as follows ---

The survey collecting this information is still open for responses here.

Context

I just made this public survey because I want someone to be able to do something fun or insightful with the data that's been gathered. You can fill it out too!

Content

Each row represents a response to the survey. A few things have been done to sanitize the raw responses: - Column names and options have been renamed to make them easier to work with without much loss of meaning. - Responses from non-students have been removed. - Responses with ages greater than or equal to 22 have been removed.

Take a look at the column description for each column to see what exactly it represents.

Acknowledgements

This dataset wouldn't exist without the help of others. I'd like to thank the following people for their contributions: - Every student who responded to the survey with valid responses - @radcliff on GitHub for providing the list of countries and abbreviations used in the survey and dataset - Giovanna de Vincenzo for providing the list of US states used in the survey and dataset - Simon Migaj for providing the image used for the survey and this dataset

--- Original source retains full ownership of the source dataset ---
P
MNAD Dataset
paperswithcode.com
Updated May 16, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). MNAD Dataset [Dataset]. https://paperswithcode.com/dataset/mnad
Explore at:
Dataset updated
May 16, 2023
Description
About the MNAD Dataset The MNAD corpus is a collection of over 1 million Moroccan news articles written in modern Arabic language. These news articles have been gathered from 11 prominent electronic news sources. The dataset is made available to the academic community for research purposes, such as data mining (clustering, classification, etc.), information retrieval (ranking, search, etc.), and other non-commercial activities.

Dataset Fields

Title: The title of the article Body: The body of the article Category: The category of the article Source: The Electronic News paper source of the article

About Version 1 of the Dataset (MNAD.v1) Version 1 of the dataset comprises 418,563 articles classified into 19 categories. The data was collected from well-known electronic news sources, namely Akhbarona.ma, Hespress.ma, Hibapress.com, and Le360.com. The articles were stored in four separate CSV files, each corresponding to the news website source. Each CSV file contains three fields: Title, Body, and Category of the news article.

The dataset is rich in Arabic vocabulary, with approximately 906,125 unique words. It has been utilized as a benchmark in the research paper: "A Moroccan News Articles Dataset (MNAD) For Arabic Text Categorization". In 2021 International Conference on Decision Aid Sciences and Application (DASA).

This dataset is available for download from the following sources: - Kaggle Datasets : MNADv1 - Huggingface Datasets: MNADv1

About Version 2 of the Dataset (MNAD.v2) Version 2 of the MNAD dataset includes an additional 653,901 articles, bringing the total number of articles to over 1 million (1,069,489), classified into the same 19 categories as in version 1. The new documents were collected from seven additional prominent Moroccan news websites, namely al3omk.com, medi1news.com, alayam24.com, anfaspress.com, alyaoum24.com, barlamane.com, and SnrtNews.com.

The newly collected articles have been merged with the articles from the previous version into a single CSV file named MNADv2.csv. This file includes an additional column called "Source" to indicate the source of each news article.

Furthermore, MNAD.v2 incorporates improved pre-processing techniques and data cleaning methods. These enhancements involve removing duplicates, eliminating multiple spaces, discarding rows with NaN values, replacing new lines with " ", excluding very long and very short articles, and removing non-Arabic articles. These additions and improvements aim to enhance the usability and value of the MNAD dataset for researchers and practitioners in the field of Arabic text analysis.

This dataset is available for download from the following sources: - Kaggle Datasets : MNADv2 - Huggingface Datasets: MNADv2

Citation If you use our data, please cite the following paper:

bibtex @inproceedings{MNAD2021, author = {Mourad Jbene and Smail Tigani and Rachid Saadane and Abdellah Chehri}, title = {A Moroccan News Articles Dataset ({MNAD}) For Arabic Text Categorization}, year = {2021}, publisher = {{IEEE}}, booktitle = {2021 International Conference on Decision Aid Sciences and Application ({DASA})} doi = {10.1109/dasa53625.2021.9682402}, url = {https://doi.org/10.1109/dasa53625.2021.9682402}, }
A
‘Kaggle Datasets Ranking’ analyzed by Analyst-2
analyst-2.ai
Updated Nov 18, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2021). ‘Kaggle Datasets Ranking’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-kaggle-datasets-ranking-2744/64eafea2/?iid=003-656&v=presentation
Explore at:
Dataset updated
Nov 18, 2021
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Kaggle Datasets Ranking’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/vivovinco/kaggle-datasets-ranking on 28 January 2022.

--- Dataset description provided by original source is as follows ---

Context

This dataset contains Kaggle ranking of datasets.

Content

+800 rows and 8 columns. Columns' description are listed below.

Rank : Rank of the user

Tier : Grandmaster, Master or Expert

Username : Name of the user

Join Date : Year of join

Gold Medals : Number of gold medals

Silver Medals : Number of silver medals

Bronze Medals : Number of bronze medals

Points : Total points

Acknowledgements

Data from Kaggle. Image from The Guardian.

If you're reading this, please upvote.

--- Original source retains full ownership of the source dataset ---
CrimeDataset From 2000 to present
kaggle.com
Updated Sep 16, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tushar Bhalerao (2023). CrimeDataset From 2000 to present [Dataset]. https://www.kaggle.com/datasets/tush32/crimedataset-from-2000-to-present
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 16, 2023
Dataset provided by
Kaggle
Authors
Tushar Bhalerao
Description
This dataset reflects incidents of crime in the City of Los Angeles dating back to 2020. This data is transcribed from original crime reports that are typed on paper and therefore there may be some inaccuracies within the data. Some location fields with missing data are noted as (0°, 0°). Address fields are only provided to the nearest hundred block in order to maintain privacy. This data is as accurate as the data in the database. Please note questions or concerns in the comments.

The dataset you provided appears to contain information related to reported crimes, with each row representing a specific crime incident. Below, I'll describe the meaning and potential content of each column in the dataset:

DR_NO: This column likely represents a unique identifier or reference number for each reported crime incident. It helps in tracking and referencing individual cases.

Date Rptd: This column stores the date when the crime was reported to law enforcement authorities. It marks the date when the incident came to their attention.

DATE OCC: This column indicates the date when the crime actually occurred or took place. It represents the day when the incident happened.

TIME OCC: This column records the time of day when the crime occurred. It provides a timestamp for the incident.

AREA: This column may represent a specific geographical area or jurisdiction within a larger region where the crime took place. It categorizes the incident's location.

AREA NAME: This column likely contains the name or label of the larger area or district that encompasses the specific area where the crime occurred.

Rpt Dist No: This column might represent a reporting district number or code within the specified area. It provides additional location details.

Part 1-2: This column could be related to the type or category of crime reported. "Part 1" crimes typically include serious offenses like homicide, robbery, etc., while "Part 2" crimes may include less serious offenses.

Crm Cd: This column may contain a numerical code representing the specific type of crime that was committed. Each code corresponds to a distinct category of criminal activity.

Crm Cd Desc: This column likely contains a textual description or label for the crime type identified by the "Crm Cd."

Mocodes: This column might store additional information or details related to the modus operandi (MO) of the crime, providing insights into how the crime was committed.

Vict Age: This column records the age of the victim involved in the crime.

Vict Sex: This column indicates the gender or sex of the victim.

Vict Descent: This column might represent the ethnic or racial background of the victim.

Premis Cd: This column could contain a numerical code representing the type of premises where the crime occurred, such as a residence, commercial establishment, or public place.

Premis Desc: This column likely contains a textual description or label for the type of premises identified by the "Premis Cd."

Weapon Used Cd: This column may indicate whether a weapon was used in the commission of the crime and, if so, it could provide a numerical code for the type of weapon.

Weapon Desc: This column likely contains a textual description or label for the type of weapon identified by the "Weapon Used Cd."

Status: This column could represent the current status or disposition of the reported crime, such as "open," "closed," "under investigation," etc.

Status Desc: This column likely contains a textual description or label for the status of the reported crime.

Crm Cd 1, Crm Cd 2, Crm Cd 3, Crm Cd 4: These columns might provide additional numerical codes for multiple crime categories associated with a single incident.

LOCATION: This column likely describes the specific location or address where the crime occurred, providing detailed location information.

Cross Street: This column might include the name of a cross street or intersection near the crime location, offering additional context.

LAT: This column stores the latitude coordinate of the crime location, allowing for precise geospatial mapping.

LON: This column contains the longitude coordinate of the crime location, complementing the latitude for accurate geolocation.

Overall, this dataset appears to be a comprehensive record of reported crimes, providing valuable information about the nature of each incident, the location, and various details related to the victims, perpetrators, and circumstances surrounding the crimes. It can be a valuable resource for crime analysis, law enforcement, and public safety research.
YouTube Trending Video Dataset (updated daily)
kaggle.com
zip
Updated Apr 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rishav Sharma (2024). YouTube Trending Video Dataset (updated daily) [Dataset]. https://www.kaggle.com/rsrishav/YouTube-trending-video-dataset
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Apr 15, 2024
Authors
Rishav Sharma
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Area covered
YouTube
Description
This dataset is a daily record of the top trending YouTube videos and it will be updated daily.

Context

YouTube maintains a list of the top trending videos on the platform. According to Variety magazine, “To determine the year’s top-trending videos, YouTube uses a combination of factors including measuring users interactions (number of views, shares, comments and likes). Note that they’re not the most-viewed videos overall for the calendar year”.

Note that this dataset is a structurally improved version of this dataset.

Content

This dataset includes several months (and counting) of data on daily trending YouTube videos. Data is included for the IN, US, GB, DE, CA, FR, RU, BR, MX, KR, and JP regions (India, USA, Great Britain, Germany, Canada, France, Russia, Brazil, Mexico, South Korea, and, Japan respectively), with up to 200 listed trending videos per day.

Each region’s data is in a separate file. Data includes the video title, channel title, publish time, tags, views, likes and dislikes, description, and comment count.

The data also includes a category_id field, which varies between regions. To retrieve the categories for a specific video, find it in the associated JSON. One such file is included for each of the 11 regions in the dataset.

For more information on specific columns in the dataset refer to the column metadata.

Acknowledgements

This dataset was collected using the YouTube API. This dataset is the updated version of Trending YouTube Video Statistics.

Inspiration

Possible uses for this dataset could include: - Sentiment analysis in a variety of forms - Categorizing YouTube videos based on their comments and statistics. - Training ML algorithms like RNNs to generate their own YouTube comments. - Analyzing what factors affect how popular a YouTube video will be. - Statistical analysis over time.

For further inspiration, see the kernels on this dataset!
A
‘Phishing website Detector’ analyzed by Analyst-2
analyst-2.ai
Updated Mar 2, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2020). ‘Phishing website Detector’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-phishing-website-detector-d919/latest
Explore at:
Dataset updated
Mar 2, 2020
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Phishing website Detector’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/eswarchandt/phishing-website-detector on 28 January 2022.

--- Dataset description provided by original source is as follows ---

Description

The data set is provided both in text file and csv file which provides the following resources that can be used as inputs for model building :

A collection of website URLs for 11000+ websites. Each sample has 30 website parameters and a class label identifying it as a phishing website or not (1 or -1).

The code template containing these code blocks: a. Import modules (Part 1) b. Load data function + input/output field descriptions

The data set also serves as an input for project scoping and tries to specify the functional and non-functional requirements for it.

Background of Problem Statement :

You are expected to write the code for a binary classification model (phishing website or not) using Python Scikit-Learn that trains on the data and calculates the accuracy score on the test data. You have to use one or more of the classification algorithms to train a model on the phishing website data set.

Dataset Description:

The dataset for a “.txt” file is with no headers and has only the column values.

The actual column-wise header is described above and, if needed, you can add the header manually if you are using '.txt' file.If you are using '.csv' file then the column names were added and given.

The header list (column names) is as follows : [ 'UsingIP', 'LongURL', 'ShortURL', 'Symbol@', 'Redirecting//', 'PrefixSuffix-', 'SubDomains', 'HTTPS', 'DomainRegLen', 'Favicon', 'NonStdPort', 'HTTPSDomainURL', 'RequestURL', 'AnchorURL', 'LinksInScriptTags', 'ServerFormHandler', 'InfoEmail', 'AbnormalURL', 'WebsiteForwarding', 'StatusBarCust', 'DisableRightClick', 'UsingPopupWindow', 'IframeRedirection', 'AgeofDomain', 'DNSRecording', 'WebsiteTraffic', 'PageRank', 'GoogleIndex', 'LinksPointingToPage', 'StatsReport', 'class' ] ### Brief Description of the features in data set ● UsingIP (categorical - signed numeric) : { -1,1 } ● LongURL (categorical - signed numeric) : { 1,0,-1 } ● ShortURL (categorical - signed numeric) : { 1,-1 } ● Symbol@ (categorical - signed numeric) : { 1,-1 } ● Redirecting// (categorical - signed numeric) : { -1,1 } ● PrefixSuffix- (categorical - signed numeric) : { -1,1 } ● SubDomains (categorical - signed numeric) : { -1,0,1 } ● HTTPS (categorical - signed numeric) : { -1,1,0 } ● DomainRegLen (categorical - signed numeric) : { -1,1 } ● Favicon (categorical - signed numeric) : { 1,-1 } ● NonStdPort (categorical - signed numeric) : { 1,-1 } ● HTTPSDomainURL (categorical - signed numeric) : { -1,1 } ● RequestURL (categorical - signed numeric) : { 1,-1 } ● AnchorURL (categorical - signed numeric) :

--- Original source retains full ownership of the source dataset ---
DATS 6401 - Final Project - Yon ho Cheong.zip
figshare.com
zip
Updated Dec 15, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yon ho Cheong (2018). DATS 6401 - Final Project - Yon ho Cheong.zip [Dataset]. http://doi.org/10.6084/m9.figshare.7471007.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.7471007.v1
Dataset updated
Dec 15, 2018
Dataset provided by
figshare
Authors
Yon ho Cheong
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
AbstractThe H1B is an employment-based visa category for temporary foreign workers in the United States. Every year, the US immigration department receives over 200,000 petitions and selects 85,000 applications through a random process and the U.S. employer must submit a petition for an H1B visa to the US immigration department. This is the most common visa status applied to international students once they complete college or higher education and begin working in a full-time position. The project provides essential information on job titles, preferred regions of settlement, foreign applicants and employers' trends for H1B visa application. According to locations, employers, job titles and salary range make up most of the H1B petitions, so different visualization utilizing tools will be used in order to analyze and interpreted in relation to the trends of the H1B visa to provide a recommendation to the applicant. This report is the base of the project for Visualization of Complex Data class at the George Washington University, some examples in this project has an analysis for the different relevant variables (Case Status, Employer Name, SOC name, Job Title, Prevailing Wage, Worksite, and Latitude and Longitude information) from Kaggle and Office of Foreign Labor Certification(OFLC) in order to see the H1B visa changes in the past several decades. Keywords: H1B visa, Data Analysis, Visualization of Complex Data, HTML, JavaScript, CSS, Tableau, D3.jsDatasetThe dataset contains 10 columns and covers a total of 3 million records spanning from 2011-2016. The relevant columns in the dataset include case status, employer name, SOC name, jobe title, full time position, prevailing wage, year, worksite, and latitude and longitude information.Link to dataset: https://www.kaggle.com/nsharan/h-1b-visaLink to dataset(FY2017): https://www.foreignlaborcert.doleta.gov/performancedata.cfmRunning the codeOpen Index.htmlData ProcessingDoing some data preprocessing to transform the raw data into an understandable format.Find and combine any other external datasets to enrich the analysis such as dataset of FY2017.To make appropriated Visualizations, variables should be Developed and compiled into visualization programs.Draw a geo map and scatter plot to compare the fastest growth in fixed value and in percentages.Extract some aspects and analyze the changes in employers’ preference as well as forecasts for the future trends.VisualizationsCombo chart: this chart shows the overall volume of receipts and approvals rate.Scatter plot: scatter plot shows the beneficiary country of birth.Geo map: this map shows All States of H1B petitions filed.Line chart: this chart shows top10 states of H1B petitions filed. Pie chart: this chart shows comparison of Education level and occupations for petitions FY2011 vs FY2017.Tree map: tree map shows overall top employers who submit the greatest number of applications.Side-by-side bar chart: this chart shows overall comparison of Data Scientist and Data Analyst.Highlight table: this table shows mean wage of a Data Scientist and Data Analyst with case status certified.Bubble chart: this chart shows top10 companies for Data Scientist and Data Analyst.Related ResearchThe H-1B Visa Debate, Explained - Harvard Business Reviewhttps://hbr.org/2017/05/the-h-1b-visa-debate-explainedForeign Labor Certification Data Centerhttps://www.foreignlaborcert.doleta.govKey facts about the U.S. H-1B visa programhttp://www.pewresearch.org/fact-tank/2017/04/27/key-facts-about-the-u-s-h-1b-visa-program/H1B visa News and Updates from The Economic Timeshttps://economictimes.indiatimes.com/topic/H1B-visa/newsH-1B visa - Wikipediahttps://en.wikipedia.org/wiki/H-1B_visaKey FindingsFrom the analysis, the government is cutting down the number of approvals for H1B on 2017.In the past decade, due to the nature of demand for high-skilled workers, visa holders have clustered in STEM fields and come mostly from countries in Asia such as China and India.Technical Jobs fill up the majority of Top 10 Jobs among foreign workers such as Computer Systems Analyst and Software Developers.The employers located in the metro areas thrive to find foreign workforce who can fill the technical position that they have in their organization.States like California, New York, Washington, New Jersey, Massachusetts, Illinois, and Texas are the prime location for foreign workers and provide many job opportunities. Top Companies such Infosys, Tata, IBM India that submit most H1B Visa Applications are companies based in India associated with software and IT services.Data Scientist position has experienced an exponential growth in terms of H1B visa applications and jobs are clustered in West region with the highest number.Visualization utilizing programsHTML, JavaScript, CSS, D3.js, Google API, Python, R, and Tableau
AI-Generated Computer Build Reviews (indoneisan)
kaggle.com
zip
Updated Aug 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
yaemico (2024). AI-Generated Computer Build Reviews (indoneisan) [Dataset]. https://www.kaggle.com/datasets/yaemico/ai-generated-computer-build-reviews-indoneisan
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Aug 31, 2024
Authors
yaemico
Description
Roast-PC Dataset: AI-Generated PC Build Reviews

Description:

This dataset is sourced from the "Roast-PC by Gemini" website, a platform that provides AI-powered roasting (critical feedback) on custom PC builds. Users input the components of their PC build, including CPU, GPU, motherboard, RAM, PSU, disk, and intended use case. The dataset captures the logs of these submissions, along with the roasting comments generated by Gemini AI, Google's AI model.

Dataset Overview:

Number of Columns: 9

Number of Rows: 1285

Column Names and Descriptions:

Time: Date and Time of request.

cpu: The CPU model specified by the user (e.g., "AMD Ryzen 5 5500", "Intel i7 1200K").

gpu: The GPU model specified by the user (e.g., "NVIDIA RTX 3080", "AMD Radeon RX 6800").

motherboard: The motherboard model specified by the user (e.g., "ASUS ROG Strix B550-F", "MSI B450 TOMAHAWK").

ram: The RAM configuration specified by the user, including size and speed (e.g., "16GB DDR4 3200MHz").

psu: The PSU (Power Supply Unit) model specified by the user, including wattage (e.g., "Corsair RM750x 750W").

disk: The storage devices specified by the user, including type and capacity (e.g., "1TB NVMe SSD", "500GB SATA HDD").

use_case: The intended use of the PC as specified by the user (e.g., "gaming", "video editing", "general use").

roast_comments: The AI-generated feedback or roasting comments provided by Gemini AI, critiquing the PC build based on the components and use case (indonesian).

Functionality:

This dataset serves multiple purposes:

Component Analysis: Allows for analysis of popular PC component choices and configurations.

AI Feedback Insights: Provides insights into how AI evaluates and critiques different PC builds.

Data Mining: Can be used for exploring trends in PC building preferences, identifying common mistakes, and understanding user behavior in custom PC setups.

Machine Learning Applications: Useful for training models in natural language processing (NLP), particularly in generating or understanding feedback for hardware configurations.

This dataset is ideal for those interested in PC building, hardware analysis, AI-generated content, or anyone curious about trends in custom PC configurations.
A
‘🗳 Pollster Ratings’ analyzed by Analyst-2
analyst-2.ai
Updated Jan 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘🗳 Pollster Ratings’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-pollster-ratings-3cf7/4aa1e9a4/?iid=017-459&v=presentation
Explore at:
Dataset updated
Jan 28, 2022
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘🗳 Pollster Ratings’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/yamqwe/pollster-ratingse on 28 January 2022.

--- Dataset description provided by original source is as follows ---

About this dataset

This dataset contains the data behind FiveThirtyEight's pollster ratings.

FiveThirtyEight's Pollster Ratings

The State Of The Polls, 2019

The Polls Are All Right

The State Of The Polls, 2016

How FiveThirtyEight Calculates Pollster Ratings

pollster-stats-full contains a spreadsheet with all of the summary data and calculations involved in determining the pollster ratings as well as descriptions for each column.
pollster-ratings has ratings and calculations for each pollster. A copy of this data and descriptions for each column can also be found in pollster-stats-full.
raw-polls contains all of the polls analyzed to give each pollster a grade

Source: https://github.com/fivethirtyeight/data

License: The data is available under the Creative Commons Attribution 4.0 International License. If you find it useful, please let us know.

Updated: Pollster-ratings and raw-polls synced from source weekly.

This dataset was created by FiveThirtyEight and contains around 10000 samples along with Cand2 Id, Pollster, technical information and other features such as: - Samplesize - Partisan - and more.

How to use this dataset

Analyze Cand2 Party in relation to Race Id

Study the influence of Margin Poll on Cand1 Actual

More datasets

Acknowledgements

If you use this dataset in your research, please credit FiveThirtyEight

Start A New Notebook!

--- Original source retains full ownership of the source dataset ---
P
Criteo Display Advertising Challenge Dataset
paperswithcode.com
Updated Apr 29, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Criteo Display Advertising Challenge Dataset [Dataset]. https://paperswithcode.com/dataset/criteo-display-advertising-challenge
Explore at:
Dataset updated
Apr 29, 2019
Description
his dataset contains feature values and click feedback for millions of display ads. Its purpose is to benchmark algorithms for clickthrough rate (CTR) prediction. It has been used for the Display Advertising Challenge hosted by Kaggle: https://www.kaggle.com/c/criteo-display-ad-challenge/

===================================================

Full description:

This dataset contains 2 files: train.txt test.txt corresponding to the training and test parts of the data.

====================================================

Dataset construction:

The training dataset consists of a portion of Criteo's traffic over a period of 7 days. Each row corresponds to a display ad served by Criteo and the first column is indicates whether this ad has been clicked or not. The positive (clicked) and negatives (non-clicked) examples have both been subsampled (but at different rates) in order to reduce the dataset size.

There are 13 features taking integer values (mostly count features) and 26 categorical features. The values of the categorical features have been hashed onto 32 bits for anonymization purposes. The semantic of these features is undisclosed. Some features may have missing values.

The rows are chronologically ordered.

The test set is computed in the same way as the training set but it corresponds to events on the day following the training period. The first column (label) has been removed.

====================================================

Format:

The columns are tab separeted with the following schema:

When a value is missing, the field is just empty. There is no label field in the test set.

====================================================

Dataset assembled by Olivier Chapelle (o.chapelle@criteo.com)
Z
Coughs: ESC-50 and FSDKaggle2018
data.niaid.nih.gov
zenodo.org
Updated Jul 27, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michelle Hernandez (2021). Coughs: ESC-50 and FSDKaggle2018 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5136591
Explore at:
Dataset updated
Jul 27, 2021
Dataset provided by
Michelle Hernandez
Jinyi Qiu
Edgar Lobaton
Alper Bozkurt
Mahmoud Abdelkhalek
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
This dataset consists of timestamps for coughs contained in files extracted from the ESC-50 and FSDKaggle2018 datasets.

Citation

This dataset was generated and used in our paper:

Mahmoud Abdelkhalek, Jinyi Qiu, Michelle Hernandez, Alper Bozkurt, Edgar Lobaton, “Investigating the Relationship between Cough Detection and Sampling Frequency for Wearable Devices,” in the 43rd Annual International Conference of the IEEE Engineering in Medicine and Biology Society, 2021.

Please cite this paper if you use the timestamps.csv file in your work.

Generation

The cough timestamps given in the timestamps.csv file were generated using the cough templates given in figures 3 and 4 in the paper:

A. H. Morice, G. A. Fontana, M. G. Belvisi, S. S. Birring, K. F. Chung, P. V. Dicpinigaitis, J. A. Kastelik, L. P. McGarvey, J. A. Smith, M. Tatar, J. Widdicombe, "ERS guidelines on the assessment of cough", European Respiratory Journal 2007 29: 1256-1276; DOI: 10.1183/09031936.00101006

More precisely, 40 files labelled as "coughing" in the ESC-50 dataset and 273 files labelled as "Cough" in the FSDKaggle2018 dataset were manually searched using Audacity for segments of audio that closely matched the aforementioned templates, both visually and auditorily. Some files did not contain any coughs at all, while other files contained several coughs. Therefore, only the files that contained at least one cough are included in the coughs directory. In total, the timestamps of 768 cough segments with lengths ranging from 0.2 seconds to 0.9 seconds were extracted.

Description

The audio files are presented in wav format in the coughs directory. Files named in the general format of "*-*-*-24.wav" were extracted from the ESC-50 dataset, while all other files were extracted from the FSDKaggle2018 dataset.

The timestamps.csv file contains the timestamps for the coughs and it consists of four columns:

file_name,cough_number,start_time,end_time

Files in the file_name column can be found in the coughs directory. cough_number refers to the index of the cough in the corresponding file. For example, if the file X.wav contains 5 coughs, then X.wav will be repeated 5 times under the file_name column, and for each row, the cough_number will range from 1 to 5. start_time refers to the starting time of a cough segment measured in seconds, while end_time refers to the end time of a cough segment measured in seconds.

Licensing

The ESC-50 dataset as a whole is licensed under the Creative Commons Attribution-NonCommercial license. Individual files in the ESC-50 dataset are licensed under different Creative Commons licenses. For a list of these licenses, see LICENSE. The ESC-50 files in the cough directory are given for convenience only, and have not been modified from their original versions. To download the original files, see the ESC-50 dataset.

The FSDKaggle2018 dataset as a whole is licensed under the Creative Commons Attribution 4.0 International license. Individual files in the FSDKaggle2018 dataset are licensed under different Creative Commons licenses. For a list of these licenses, see the License section in FSDKaggle2018. The FSDKaggle2018 files in the cough directory are given for convenience only, and have not been modified from their original versions. To download the original files, see the FSDKaggle2018 dataset.

The timestamps.csv file is licensed under the Creative Commons Attribution-NonCommercial 4.0 International license.
Job Postings Dataset
kaggle.com
Updated Feb 3, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Moyukh Biswas (2024). Job Postings Dataset [Dataset]. https://www.kaggle.com/datasets/moyukhbiswas/job-postings-dataset/versions/1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 3, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Moyukh Biswas
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Job Postings Dataset

This dataset offers an extensive assortment of job postings, designed to support investigations and examinations within the realms of job market patterns, natural language processing (NLP), and machine learning. Developed for educational and research objectives, this dataset presents a varied array of job advertisements spanning diverse industries and job categories.

Description of each column:

Category- The category of the job. Workplace- If the job in remote, on-site or hybrid. Location- Location of the job posting. Department- The department for which the job has been posted. Type- If the job is full-time, part-time or contractual in nature.

Potential use cases:

Optimizing workforce planning and talent acquisition strategies.

Developing NLP models for resume parsing and job matching.

Building predictive models to forecast job market trends.

Exploring salary prediction models for various job roles.

Analyzing regional job market disparities and opportunities.

Young's Modulus of Metals

kaggle.com

Updated Dec 31, 2024

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Kanchana1990 (2024). Young's Modulus of Metals [Dataset]. http://doi.org/10.34740/kaggle/dsv/10344425

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Unique identifier

https://doi.org/10.34740/kaggle/dsv/10344425

Dataset updated

Dec 31, 2024

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Kanchana1990

License

Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically

Description

Dataset Overview

This dataset provides the Young's Modulus values (in GPa) for 50 metals, covering a wide range of categories such as alkali metals, alkaline earth metals, transition metals, and rare earth elements. Young's Modulus is a fundamental mechanical property that measures a material's stiffness under tensile or compressive stress. It is critical for applications in materials science, physics, and engineering.

The dataset includes: - 50 metals with their chemical symbols and Young's Modulus values. - A wide range of stiffness values, from soft metals like cesium (1.7 GPa) to very stiff metals like ruthenium (447 GPa). - Clean and complete data with no missing or duplicate entries.

Data Science Applications

This dataset can be utilized in various data science and engineering applications, such as: 1. Material Property Prediction: Train machine learning models to predict mechanical properties based on elemental features. 2. Cluster Analysis: Group metals based on their mechanical properties or periodic trends. 3. Correlation Studies: Explore relationships between Young's Modulus and other physical/chemical properties (e.g., density, atomic radius). 4. Engineering Simulations: Use the data for simulations in structural analysis or material selection for design purposes. 5. Visualization and Education: Create visualizations to teach periodic trends and material property variations.

Column Descriptors

Column Name	Description
`Metal`	Name of the metal (e.g., Lithium, Beryllium).
`Symbol`	Chemical symbol of the metal (e.g., Li, Be).
`Young's Modulus (GPa)`	Young's Modulus value in gigapascals (GPa), indicating stiffness under stress.

Ethically Mined Data

The dataset was ethically sourced from publicly available scientific references and academic resources. The data was verified for accuracy using multiple authoritative sources, ensuring reliability for research and educational purposes. No proprietary or sensitive information was included.

Key checks performed: - No missing values: The dataset contains complete entries for all 50 metals. - No duplicates: Each metal appears only once in the dataset. - Statistical analysis: The mean Young's Modulus is ~98.93 GPa, with a wide range from 1.7 GPa to 447 GPa.

Acknowledgments

We would like to thank the following sources for their contributions to this dataset: - Academic references such as WebElements, Byju's Chemistry Resources, and Wikipedia for cross-verifying the data. - Scientific databases like MatWeb and ASM International for providing accurate material property data. - Special thanks to DALL·E 3 for generating the accompanying dataset image.

Agriculture dataset | Karnataka
kaggle.com
data.mendeley.com
Updated Dec 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohamadreza Momeni (2024). Agriculture dataset | Karnataka [Dataset]. https://www.kaggle.com/datasets/imtkaggleteam/agriculture-dataset-karnataka
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 16, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Mohamadreza Momeni
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Karnataka
Description
Data Description

The dataset you've provided appears to capture agricultural data for Karnataka, specifically focusing on crop yields in Mangalore. Key features include the year of production, geographic details, and environmental conditions such as rainfall (measured in mm), temperature (in degrees Celsius), and humidity (as a percentage). Soil type, irrigation method, and crop type are also recorded, along with crop yields, market price, and season of growth (e.g., Kharif).

The dataset includes several columns related to crop production conditions and outcomes. For example, coconut crop data reveals a pattern of yields over different area sizes, showing how factors like rainfall, temperature, and irrigation influence production. Prices also vary, offering insights into the economic aspects of agriculture in the region. This information could be used to study the impact of environmental conditions and farming techniques on crop productivity, assisting in the development of optimized agricultural practices tailored for specific soil types, climates, and crop needs.

Column Description

yield: yield typically refers to the amount of crop produced per unit area of land

In season column:

Kharif Season: This is the monsoon crop season, where crops are sown at the beginning of the monsoon season (around June) and harvested at the end of the monsoon season (around October). Examples of Kharif crops include rice, maize, and pulses.

Rabi Season: This is the winter crop season, where crops are sown after the monsoon season (around November) and harvested in the spring (around April). Examples of Rabi crops include wheat, barley, and mustard.

Zaid Season: This is the summer crop season, which falls between the Kharif and Rabi seasons (around March to June). Zaid crops are usually short-duration crops and include vegetables, watermelons, and cucumbers.

Authors

rajesh naik

Area covered

Karnataka

Unique identifier

Click Here
Black Friday Sales EDA
kaggle.com
Updated Oct 29, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rushikesh Konapure (2022). Black Friday Sales EDA [Dataset]. https://www.kaggle.com/datasets/rishikeshkonapure/black-friday-sales-eda
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 29, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Rushikesh Konapure
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Dataset History

A retail company “ABC Private Limited” wants to understand the customer purchase behaviour (specifically, purchase amount) against various products of different categories. They have shared purchase summaries of various customers for selected high-volume products from last month. The data set also contains customer demographics (age, gender, marital status, city type, stay in the current city), product details (productid and product category) and Total purchase amount from last month.

Now, they want to build a model to predict the purchase amount of customers against various products which will help them to create a personalized offer for customers against different products.

Tasks to perform

The purchase column is the Target Variable, perform Univariate Analysis and Bivariate Analysis w.r.t the Purchase.

Masked in the column description means already converted from categorical value to numerical column.

Below mentioned points are just given to get you started with the dataset, not mandatory to follow the same sequence.

DATA PREPROCESSING

Check the basic statistics of the dataset

Check for missing values in the data

Check for unique values in data

Perform EDA

Purchase Distribution

Check for outliers

Analysis by Gender, Marital Status, occupation, occupation vs purchase, purchase by city, purchase by age group, etc

Drop unnecessary fields

Convert categorical data into integer using map function (e.g 'Gender' column)

Missing value treatment

Rename columns

Fill nan values

map range variables into integers (e.g 'Age' column)

Data Visualisation

visualize individual column

Age vs Purchased

Occupation vs Purchased

Productcategory1 vs Purchased

Productcategory2 vs Purchased

Productcategory3 vs Purchased

City category pie chart

check for more possible plots

All the Best!!
A
‘prediction of facebook comment’ analyzed by Analyst-2
analyst-2.ai
Updated Oct 6, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2019). ‘prediction of facebook comment’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-prediction-of-facebook-comment-1c17/latest
Explore at:
Dataset updated
Oct 6, 2019
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘prediction of facebook comment’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/kiranraje/prediction-facebook-comment on 28 January 2022.

--- Dataset description provided by original source is as follows ---

Context

The Dataset is uploaded in ZIP format. The dataset contains 5 variants of the dataset, for the details about the variants and detailed analysis read and cite the research paper TITLE='Comment Volume Prediction

Content

28 columns content in this Dataset 1] Describing popularity or support for the source. 2] Describe how many prople so far visited this place 3]Defines the daily interest of individuals towards source of the document/ Post. 4]Defines the daily interest of individuals towards source of the document/ Post.

--- Original source retains full ownership of the source dataset ---
18 excel spreadsheets by species and year giving reproduction and growth...
catalog.data.gov
data.wu.ac.at
Updated Aug 17, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2024). 18 excel spreadsheets by species and year giving reproduction and growth data. One excel spreadsheet of herbicide treatment chemistry. [Dataset]. https://catalog.data.gov/dataset/18-excel-spreadsheets-by-species-and-year-giving-reproduction-and-growth-data-one-excel-sp
Explore at:
Dataset updated
Aug 17, 2024
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
Excel spreadsheets by species (4 letter code is abbreviation for genus and species used in study, year 2010 or 2011 is year data collected, SH indicates data for Science Hub, date is date of file preparation). The data in a file are described in a read me file which is the first worksheet in each file. Each row in a species spreadsheet is for one plot (plant). The data themselves are in the data worksheet. One file includes a read me description of the column in the date set for chemical analysis. In this file one row is an herbicide treatment and sample for chemical analysis (if taken). This dataset is associated with the following publication: Olszyk , D., T. Pfleeger, T. Shiroyama, M. Blakely-Smith, E. Lee , and M. Plocher. Plant reproduction is altered by simulated herbicide drift toconstructed plant communities. ENVIRONMENTAL TOXICOLOGY AND CHEMISTRY. Society of Environmental Toxicology and Chemistry, Pensacola, FL, USA, 36(10): 2799-2813, (2017).

Facebook

Twitter

Click to copy link

Link copied

Cite

Adil Shamim (2025). Salary Prediction: Based on years of experience [Dataset]. http://doi.org/10.34740/kaggle/dsv/10626597

Salary Prediction: Based on years of experience

Salary Prediction: Salary dataset based on years of experience

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Unique identifier

https://doi.org/10.34740/kaggle/dsv/10626597

Dataset updated

Jan 31, 2025

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Adil Shamim

Description

About the Dataset

Dataset Name: Salary Prediction Dataset

File Format: CSV

Rows: 100,000

Columns: 4

Overview

This dataset provides salary data based on years of experience, education level, and job role. It can be used for salary prediction models, regression analysis, and workforce analytics. The dataset includes realistic salary variations based on industry trends.

Columns Description

YearsExperience (float) – Number of years of experience (0 to 40 years).
Education Level (string) – The highest level of education attained. Categories include:
- High School
- Associate Degree
- Bachelor's
- Master's
- PhD
Job Role (string) – Common job titles in the industry:
- Software Engineer
- Data Scientist
- Product Manager
- Marketing Specialist
- Business Analyst
Salary (float) – The estimated annual salary in USD. The salary is influenced by experience, education level, and job role.

Potential Use Cases

Salary Prediction Models – Train regression models to predict salaries based on experience and qualifications.
Data Science & Machine Learning – Use this dataset for exploratory data analysis and feature engineering.
Workforce Analysis – Analyze salary trends across job roles and experience levels.

How the Data Was Generated

The dataset was synthetically generated using a linear regression-based formula with added randomness and scaling factors based on job roles and education levels. While not real-world data, it closely mimics actual salary distributions in the tech and business industries.

Acknowledgments

This dataset is designed for research, learning, and data science practice. It is not collected from real-world surveys but follows statistical patterns observed in salary data.

Clear search

Close search

Google apps

Main menu

Salary Prediction: Based on years of experience

About the Dataset

Dataset Name: Salary Prediction Dataset

File Format: CSV

Rows: 100,000

Columns: 4

Overview

Columns Description

Potential Use Cases

How the Data Was Generated

Acknowledgments

Code4ML 2.0: a Large-scale Dataset of annotated Machine Learning Code

Code4ML 2.0 Enhancements

Applications

Data from: Customer Churn Dataset

‘COVID-19's Impact on Educational Stress’ analyzed by Analyst-2

Context

Content

Acknowledgements

MNAD Dataset

‘Kaggle Datasets Ranking’ analyzed by Analyst-2

Context

Content

Acknowledgements

CrimeDataset From 2000 to present

YouTube Trending Video Dataset (updated daily)

This dataset is a daily record of the top trending YouTube videos and it will be updated daily.

Context

Content

Acknowledgements

Inspiration

‘Phishing website Detector’ analyzed by Analyst-2

Description

Background of Problem Statement :

Dataset Description:

DATS 6401 - Final Project - Yon ho Cheong.zip

AI-Generated Computer Build Reviews (indoneisan)

Roast-PC Dataset: AI-Generated PC Build Reviews

‘🗳 Pollster Ratings’ analyzed by Analyst-2

About this dataset

How to use this dataset

Acknowledgements

Start A New Notebook!

Criteo Display Advertising Challenge Dataset

Coughs: ESC-50 and FSDKaggle2018

Job Postings Dataset

Job Postings Dataset

Description of each column:

Potential use cases:

Young's Modulus of Metals

Dataset Overview

Data Science Applications

Column Descriptors

Ethically Mined Data

Acknowledgments

Agriculture dataset | Karnataka

Black Friday Sales EDA

‘prediction of facebook comment’ analyzed by Analyst-2

Context

Content

18 excel spreadsheets by species and year giving reproduction and growth...

Salary Prediction: Based on years of experience

Salary Prediction: Salary dataset based on years of experience

About the Dataset

Dataset Name: Salary Prediction Dataset

File Format: CSV

Rows: 100,000

Columns: 4

Overview

Columns Description

Potential Use Cases

How the Data Was Generated

Acknowledgments