64 datasets found
  1. Salary Prediction: Based on years of experience

    • kaggle.com
    Updated Jan 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Adil Shamim (2025). Salary Prediction: Based on years of experience [Dataset]. http://doi.org/10.34740/kaggle/dsv/10626597
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 31, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Adil Shamim
    Description

    About the Dataset

    Dataset Name: Salary Prediction Dataset

    File Format: CSV

    Rows: 100,000

    Columns: 4

    Overview

    This dataset provides salary data based on years of experience, education level, and job role. It can be used for salary prediction models, regression analysis, and workforce analytics. The dataset includes realistic salary variations based on industry trends.

    Columns Description

    1. YearsExperience (float) – Number of years of experience (0 to 40 years).
    2. Education Level (string) – The highest level of education attained. Categories include:
      • High School
      • Associate Degree
      • Bachelor's
      • Master's
      • PhD
    3. Job Role (string) – Common job titles in the industry:
      • Software Engineer
      • Data Scientist
      • Product Manager
      • Marketing Specialist
      • Business Analyst
    4. Salary (float) – The estimated annual salary in USD. The salary is influenced by experience, education level, and job role.

    Potential Use Cases

    • Salary Prediction Models – Train regression models to predict salaries based on experience and qualifications.
    • Data Science & Machine Learning – Use this dataset for exploratory data analysis and feature engineering.
    • Workforce Analysis – Analyze salary trends across job roles and experience levels.

    How the Data Was Generated

    The dataset was synthetically generated using a linear regression-based formula with added randomness and scaling factors based on job roles and education levels. While not real-world data, it closely mimics actual salary distributions in the tech and business industries.

    Acknowledgments

    This dataset is designed for research, learning, and data science practice. It is not collected from real-world surveys but follows statistical patterns observed in salary data.

  2. Code4ML 2.0: a Large-scale Dataset of annotated Machine Learning Code

    • zenodo.org
    csv, txt
    Updated Oct 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonymous; Anonymous (2024). Code4ML 2.0: a Large-scale Dataset of annotated Machine Learning Code [Dataset]. http://doi.org/10.5281/zenodo.13918465
    Explore at:
    csv, txtAvailable download formats
    Dataset updated
    Oct 23, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anonymous; Anonymous
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is an enriched version of the Code4ML dataset, a large-scale corpus of annotated Python code snippets, competition summaries, and data descriptions sourced from Kaggle. The initial release includes approximately 2.5 million snippets of machine learning code extracted from around 100,000 Jupyter notebooks. A portion of these snippets has been manually annotated by human assessors through a custom-built, user-friendly interface designed for this task.

    The original dataset is organized into multiple CSV files, each containing structured data on different entities:

    • code_blocks.csv: Contains raw code snippets extracted from Kaggle.
    • kernels_meta.csv: Metadata for the notebooks (kernels) from which the code snippets were derived.
    • competitions_meta.csv: Metadata describing Kaggle competitions, including information about tasks and data.
    • markup_data.csv: Annotated code blocks with semantic types, allowing deeper analysis of code structure.
    • vertices.csv: A mapping from numeric IDs to semantic types and subclasses, used to interpret annotated code blocks.

    Table 1. code_blocks.csv structure

    ColumnDescription
    code_blocks_indexGlobal index linking code blocks to markup_data.csv.
    kernel_idIdentifier for the Kaggle Jupyter notebook from which the code block was extracted.
    code_block_id

    Position of the code block within the notebook.

    code_block

    The actual machine learning code snippet.

    Table 2. kernels_meta.csv structure

    ColumnDescription
    kernel_idIdentifier for the Kaggle Jupyter notebook.
    kaggle_scorePerformance metric of the notebook.
    kaggle_commentsNumber of comments on the notebook.
    kaggle_upvotesNumber of upvotes the notebook received.
    kernel_linkURL to the notebook.
    comp_nameName of the associated Kaggle competition.

    Table 3. competitions_meta.csv structure

    ColumnDescription
    comp_nameName of the Kaggle competition.
    descriptionOverview of the competition task.
    data_typeType of data used in the competition.
    comp_typeClassification of the competition.
    subtitleShort description of the task.
    EvaluationAlgorithmAbbreviationMetric used for assessing competition submissions.
    data_sourcesLinks to datasets used.
    metric typeClass label for the assessment metric.

    Table 4. markup_data.csv structure

    ColumnDescription
    code_blockMachine learning code block.
    too_longFlag indicating whether the block spans multiple semantic types.
    marksConfidence level of the annotation.
    graph_vertex_idID of the semantic type.

    The dataset allows mapping between these tables. For example:

    • code_blocks.csv can be linked to kernels_meta.csv via the kernel_id column.
    • kernels_meta.csv is connected to competitions_meta.csv through comp_name. To maintain quality, kernels_meta.csv includes only notebooks with available Kaggle scores.

    In addition, data_with_preds.csv contains automatically classified code blocks, with a mapping back to code_blocks.csvvia the code_blocks_index column.

    Code4ML 2.0 Enhancements

    The updated Code4ML 2.0 corpus introduces kernels extracted from Meta Kaggle Code. These kernels correspond to the kaggle competitions launched since 2020. The natural descriptions of the competitions are retrieved with the aim of LLM.

    Notebooks in kernels_meta2.csv may not have a Kaggle score but include a leaderboard ranking (rank), providing additional context for evaluation.

    Applications

    The Code4ML 2.0 corpus is a versatile resource, enabling training and evaluation of models in areas such as:

    • Code generation
    • Code understanding
    • Natural language processing of code-related tasks
  3. i

    Data from: Customer Churn Dataset

    • ieee-dataport.org
    • kaggle.com
    Updated Jun 4, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Usman JOY (2024). Customer Churn Dataset [Dataset]. http://doi.org/10.21227/wc9d-b672
    Explore at:
    Dataset updated
    Jun 4, 2024
    Dataset provided by
    IEEE Dataport
    Authors
    Usman JOY
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Customer log dataset is a 12.5 GB JSON file and it contains 18 columns and 26,259,199 records. There are 12 string columns and 6 numeric columns, which may also contain null or NaN values. The columns include userId, artist, auth, firstName, gender, itemInSession, lastName, length, level, location, method, page, registration, sessionId, song,status, ts and userAgent. As evident from the column names, the dataset contains various user-related information, such as user identifiers, demographic details (firstName, lastName, gender), interaction details (artist, song, length, itemInSession, sessionId, registration, lastinteraction) and technical details (userAgent, method, page, location, status, level, auth).

  4. A

    ‘COVID-19's Impact on Educational Stress’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Jan 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘COVID-19's Impact on Educational Stress’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-covid-19-s-impact-on-educational-stress-49b5/4f12e21a/?iid=019-227&v=presentation
    Explore at:
    Dataset updated
    Jan 28, 2022
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘COVID-19's Impact on Educational Stress’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/bsoyka3/educational-stress-due-to-the-coronavirus-pandemic on 28 January 2022.

    --- Dataset description provided by original source is as follows ---

    Made by Statistry

    The survey collecting this information is still open for responses here.

    Context

    I just made this public survey because I want someone to be able to do something fun or insightful with the data that's been gathered. You can fill it out too!

    Content

    Each row represents a response to the survey. A few things have been done to sanitize the raw responses: - Column names and options have been renamed to make them easier to work with without much loss of meaning. - Responses from non-students have been removed. - Responses with ages greater than or equal to 22 have been removed.

    Take a look at the column description for each column to see what exactly it represents.

    Acknowledgements

    This dataset wouldn't exist without the help of others. I'd like to thank the following people for their contributions: - Every student who responded to the survey with valid responses - @radcliff on GitHub for providing the list of countries and abbreviations used in the survey and dataset - Giovanna de Vincenzo for providing the list of US states used in the survey and dataset - Simon Migaj for providing the image used for the survey and this dataset

    --- Original source retains full ownership of the source dataset ---

  5. P

    MNAD Dataset

    • paperswithcode.com
    Updated May 16, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). MNAD Dataset [Dataset]. https://paperswithcode.com/dataset/mnad
    Explore at:
    Dataset updated
    May 16, 2023
    Description

    About the MNAD Dataset The MNAD corpus is a collection of over 1 million Moroccan news articles written in modern Arabic language. These news articles have been gathered from 11 prominent electronic news sources. The dataset is made available to the academic community for research purposes, such as data mining (clustering, classification, etc.), information retrieval (ranking, search, etc.), and other non-commercial activities.

    Dataset Fields

    Title: The title of the article Body: The body of the article Category: The category of the article Source: The Electronic News paper source of the article

    About Version 1 of the Dataset (MNAD.v1) Version 1 of the dataset comprises 418,563 articles classified into 19 categories. The data was collected from well-known electronic news sources, namely Akhbarona.ma, Hespress.ma, Hibapress.com, and Le360.com. The articles were stored in four separate CSV files, each corresponding to the news website source. Each CSV file contains three fields: Title, Body, and Category of the news article.

    The dataset is rich in Arabic vocabulary, with approximately 906,125 unique words. It has been utilized as a benchmark in the research paper: "A Moroccan News Articles Dataset (MNAD) For Arabic Text Categorization". In 2021 International Conference on Decision Aid Sciences and Application (DASA).

    This dataset is available for download from the following sources: - Kaggle Datasets : MNADv1 - Huggingface Datasets: MNADv1

    About Version 2 of the Dataset (MNAD.v2) Version 2 of the MNAD dataset includes an additional 653,901 articles, bringing the total number of articles to over 1 million (1,069,489), classified into the same 19 categories as in version 1. The new documents were collected from seven additional prominent Moroccan news websites, namely al3omk.com, medi1news.com, alayam24.com, anfaspress.com, alyaoum24.com, barlamane.com, and SnrtNews.com.

    The newly collected articles have been merged with the articles from the previous version into a single CSV file named MNADv2.csv. This file includes an additional column called "Source" to indicate the source of each news article.

    Furthermore, MNAD.v2 incorporates improved pre-processing techniques and data cleaning methods. These enhancements involve removing duplicates, eliminating multiple spaces, discarding rows with NaN values, replacing new lines with " ", excluding very long and very short articles, and removing non-Arabic articles. These additions and improvements aim to enhance the usability and value of the MNAD dataset for researchers and practitioners in the field of Arabic text analysis.

    This dataset is available for download from the following sources: - Kaggle Datasets : MNADv2 - Huggingface Datasets: MNADv2

    Citation If you use our data, please cite the following paper:

    bibtex @inproceedings{MNAD2021, author = {Mourad Jbene and Smail Tigani and Rachid Saadane and Abdellah Chehri}, title = {A Moroccan News Articles Dataset ({MNAD}) For Arabic Text Categorization}, year = {2021}, publisher = {{IEEE}}, booktitle = {2021 International Conference on Decision Aid Sciences and Application ({DASA})} doi = {10.1109/dasa53625.2021.9682402}, url = {https://doi.org/10.1109/dasa53625.2021.9682402}, }

  6. A

    ‘Kaggle Datasets Ranking’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Nov 18, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2021). ‘Kaggle Datasets Ranking’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-kaggle-datasets-ranking-2744/64eafea2/?iid=003-656&v=presentation
    Explore at:
    Dataset updated
    Nov 18, 2021
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘Kaggle Datasets Ranking’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/vivovinco/kaggle-datasets-ranking on 28 January 2022.

    --- Dataset description provided by original source is as follows ---

    Context

    This dataset contains Kaggle ranking of datasets.

    Content

    +800 rows and 8 columns. Columns' description are listed below.

    • Rank : Rank of the user
    • Tier : Grandmaster, Master or Expert
    • Username : Name of the user
    • Join Date : Year of join
    • Gold Medals : Number of gold medals
    • Silver Medals : Number of silver medals
    • Bronze Medals : Number of bronze medals
    • Points : Total points

    Acknowledgements

    Data from Kaggle. Image from The Guardian.

    If you're reading this, please upvote.

    --- Original source retains full ownership of the source dataset ---

  7. CrimeDataset From 2000 to present

    • kaggle.com
    Updated Sep 16, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tushar Bhalerao (2023). CrimeDataset From 2000 to present [Dataset]. https://www.kaggle.com/datasets/tush32/crimedataset-from-2000-to-present
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 16, 2023
    Dataset provided by
    Kaggle
    Authors
    Tushar Bhalerao
    Description

    This dataset reflects incidents of crime in the City of Los Angeles dating back to 2020. This data is transcribed from original crime reports that are typed on paper and therefore there may be some inaccuracies within the data. Some location fields with missing data are noted as (0°, 0°). Address fields are only provided to the nearest hundred block in order to maintain privacy. This data is as accurate as the data in the database. Please note questions or concerns in the comments.

    The dataset you provided appears to contain information related to reported crimes, with each row representing a specific crime incident. Below, I'll describe the meaning and potential content of each column in the dataset:

    1. DR_NO: This column likely represents a unique identifier or reference number for each reported crime incident. It helps in tracking and referencing individual cases.

    2. Date Rptd: This column stores the date when the crime was reported to law enforcement authorities. It marks the date when the incident came to their attention.

    3. DATE OCC: This column indicates the date when the crime actually occurred or took place. It represents the day when the incident happened.

    4. TIME OCC: This column records the time of day when the crime occurred. It provides a timestamp for the incident.

    5. AREA: This column may represent a specific geographical area or jurisdiction within a larger region where the crime took place. It categorizes the incident's location.

    6. AREA NAME: This column likely contains the name or label of the larger area or district that encompasses the specific area where the crime occurred.

    7. Rpt Dist No: This column might represent a reporting district number or code within the specified area. It provides additional location details.

    8. Part 1-2: This column could be related to the type or category of crime reported. "Part 1" crimes typically include serious offenses like homicide, robbery, etc., while "Part 2" crimes may include less serious offenses.

    9. Crm Cd: This column may contain a numerical code representing the specific type of crime that was committed. Each code corresponds to a distinct category of criminal activity.

    10. Crm Cd Desc: This column likely contains a textual description or label for the crime type identified by the "Crm Cd."

    11. Mocodes: This column might store additional information or details related to the modus operandi (MO) of the crime, providing insights into how the crime was committed.

    12. Vict Age: This column records the age of the victim involved in the crime.

    13. Vict Sex: This column indicates the gender or sex of the victim.

    14. Vict Descent: This column might represent the ethnic or racial background of the victim.

    15. Premis Cd: This column could contain a numerical code representing the type of premises where the crime occurred, such as a residence, commercial establishment, or public place.

    16. Premis Desc: This column likely contains a textual description or label for the type of premises identified by the "Premis Cd."

    17. Weapon Used Cd: This column may indicate whether a weapon was used in the commission of the crime and, if so, it could provide a numerical code for the type of weapon.

    18. Weapon Desc: This column likely contains a textual description or label for the type of weapon identified by the "Weapon Used Cd."

    19. Status: This column could represent the current status or disposition of the reported crime, such as "open," "closed," "under investigation," etc.

    20. Status Desc: This column likely contains a textual description or label for the status of the reported crime.

    21. Crm Cd 1, Crm Cd 2, Crm Cd 3, Crm Cd 4: These columns might provide additional numerical codes for multiple crime categories associated with a single incident.

    22. LOCATION: This column likely describes the specific location or address where the crime occurred, providing detailed location information.

    23. Cross Street: This column might include the name of a cross street or intersection near the crime location, offering additional context.

    24. LAT: This column stores the latitude coordinate of the crime location, allowing for precise geospatial mapping.

    25. LON: This column contains the longitude coordinate of the crime location, complementing the latitude for accurate geolocation.

    Overall, this dataset appears to be a comprehensive record of reported crimes, providing valuable information about the nature of each incident, the location, and various details related to the victims, perpetrators, and circumstances surrounding the crimes. It can be a valuable resource for crime analysis, law enforcement, and public safety research.

  8. YouTube Trending Video Dataset (updated daily)

    • kaggle.com
    zip
    Updated Apr 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rishav Sharma (2024). YouTube Trending Video Dataset (updated daily) [Dataset]. https://www.kaggle.com/rsrishav/YouTube-trending-video-dataset
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Apr 15, 2024
    Authors
    Rishav Sharma
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    YouTube
    Description

    This dataset is a daily record of the top trending YouTube videos and it will be updated daily.

    Context

    YouTube maintains a list of the top trending videos on the platform. According to Variety magazine, “To determine the year’s top-trending videos, YouTube uses a combination of factors including measuring users interactions (number of views, shares, comments and likes). Note that they’re not the most-viewed videos overall for the calendar year”.

    Note that this dataset is a structurally improved version of this dataset.

    Content

    This dataset includes several months (and counting) of data on daily trending YouTube videos. Data is included for the IN, US, GB, DE, CA, FR, RU, BR, MX, KR, and JP regions (India, USA, Great Britain, Germany, Canada, France, Russia, Brazil, Mexico, South Korea, and, Japan respectively), with up to 200 listed trending videos per day.

    Each region’s data is in a separate file. Data includes the video title, channel title, publish time, tags, views, likes and dislikes, description, and comment count.

    The data also includes a category_id field, which varies between regions. To retrieve the categories for a specific video, find it in the associated JSON. One such file is included for each of the 11 regions in the dataset.

    For more information on specific columns in the dataset refer to the column metadata.

    Acknowledgements

    This dataset was collected using the YouTube API. This dataset is the updated version of Trending YouTube Video Statistics.

    Inspiration

    Possible uses for this dataset could include: - Sentiment analysis in a variety of forms - Categorizing YouTube videos based on their comments and statistics. - Training ML algorithms like RNNs to generate their own YouTube comments. - Analyzing what factors affect how popular a YouTube video will be. - Statistical analysis over time .

    For further inspiration, see the kernels on this dataset!

  9. A

    ‘Phishing website Detector’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Mar 2, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2020). ‘Phishing website Detector’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-phishing-website-detector-d919/latest
    Explore at:
    Dataset updated
    Mar 2, 2020
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘Phishing website Detector’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/eswarchandt/phishing-website-detector on 28 January 2022.

    --- Dataset description provided by original source is as follows ---

    Description

    The data set is provided both in text file and csv file which provides the following resources that can be used as inputs for model building :

    1. A collection of website URLs for 11000+ websites. Each sample has 30 website parameters and a class label identifying it as a phishing website or not (1 or -1).

    2. The code template containing these code blocks: a. Import modules (Part 1) b. Load data function + input/output field descriptions

    The data set also serves as an input for project scoping and tries to specify the functional and non-functional requirements for it.

    Background of Problem Statement :

    You are expected to write the code for a binary classification model (phishing website or not) using Python Scikit-Learn that trains on the data and calculates the accuracy score on the test data. You have to use one or more of the classification algorithms to train a model on the phishing website data set.

    Dataset Description:

    1. The dataset for a “.txt” file is with no headers and has only the column values.
    2. The actual column-wise header is described above and, if needed, you can add the header manually if you are using '.txt' file.If you are using '.csv' file then the column names were added and given.
    3. The header list (column names) is as follows : [ 'UsingIP', 'LongURL', 'ShortURL', 'Symbol@', 'Redirecting//', 'PrefixSuffix-', 'SubDomains', 'HTTPS', 'DomainRegLen', 'Favicon', 'NonStdPort', 'HTTPSDomainURL', 'RequestURL', 'AnchorURL', 'LinksInScriptTags', 'ServerFormHandler', 'InfoEmail', 'AbnormalURL', 'WebsiteForwarding', 'StatusBarCust', 'DisableRightClick', 'UsingPopupWindow', 'IframeRedirection', 'AgeofDomain', 'DNSRecording', 'WebsiteTraffic', 'PageRank', 'GoogleIndex', 'LinksPointingToPage', 'StatsReport', 'class' ] ### Brief Description of the features in data set ● UsingIP (categorical - signed numeric) : { -1,1 } ● LongURL (categorical - signed numeric) : { 1,0,-1 } ● ShortURL (categorical - signed numeric) : { 1,-1 } ● Symbol@ (categorical - signed numeric) : { 1,-1 } ● Redirecting// (categorical - signed numeric) : { -1,1 } ● PrefixSuffix- (categorical - signed numeric) : { -1,1 } ● SubDomains (categorical - signed numeric) : { -1,0,1 } ● HTTPS (categorical - signed numeric) : { -1,1,0 } ● DomainRegLen (categorical - signed numeric) : { -1,1 } ● Favicon (categorical - signed numeric) : { 1,-1 } ● NonStdPort (categorical - signed numeric) : { 1,-1 } ● HTTPSDomainURL (categorical - signed numeric) : { -1,1 } ● RequestURL (categorical - signed numeric) : { 1,-1 } ● AnchorURL (categorical - signed numeric) :

    --- Original source retains full ownership of the source dataset ---

  10. DATS 6401 - Final Project - Yon ho Cheong.zip

    • figshare.com
    zip
    Updated Dec 15, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yon ho Cheong (2018). DATS 6401 - Final Project - Yon ho Cheong.zip [Dataset]. http://doi.org/10.6084/m9.figshare.7471007.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 15, 2018
    Dataset provided by
    figshare
    Authors
    Yon ho Cheong
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    AbstractThe H1B is an employment-based visa category for temporary foreign workers in the United States. Every year, the US immigration department receives over 200,000 petitions and selects 85,000 applications through a random process and the U.S. employer must submit a petition for an H1B visa to the US immigration department. This is the most common visa status applied to international students once they complete college or higher education and begin working in a full-time position. The project provides essential information on job titles, preferred regions of settlement, foreign applicants and employers' trends for H1B visa application. According to locations, employers, job titles and salary range make up most of the H1B petitions, so different visualization utilizing tools will be used in order to analyze and interpreted in relation to the trends of the H1B visa to provide a recommendation to the applicant. This report is the base of the project for Visualization of Complex Data class at the George Washington University, some examples in this project has an analysis for the different relevant variables (Case Status, Employer Name, SOC name, Job Title, Prevailing Wage, Worksite, and Latitude and Longitude information) from Kaggle and Office of Foreign Labor Certification(OFLC) in order to see the H1B visa changes in the past several decades. Keywords: H1B visa, Data Analysis, Visualization of Complex Data, HTML, JavaScript, CSS, Tableau, D3.jsDatasetThe dataset contains 10 columns and covers a total of 3 million records spanning from 2011-2016. The relevant columns in the dataset include case status, employer name, SOC name, jobe title, full time position, prevailing wage, year, worksite, and latitude and longitude information.Link to dataset: https://www.kaggle.com/nsharan/h-1b-visaLink to dataset(FY2017): https://www.foreignlaborcert.doleta.gov/performancedata.cfmRunning the codeOpen Index.htmlData ProcessingDoing some data preprocessing to transform the raw data into an understandable format.Find and combine any other external datasets to enrich the analysis such as dataset of FY2017.To make appropriated Visualizations, variables should be Developed and compiled into visualization programs.Draw a geo map and scatter plot to compare the fastest growth in fixed value and in percentages.Extract some aspects and analyze the changes in employers’ preference as well as forecasts for the future trends.VisualizationsCombo chart: this chart shows the overall volume of receipts and approvals rate.Scatter plot: scatter plot shows the beneficiary country of birth.Geo map: this map shows All States of H1B petitions filed.Line chart: this chart shows top10 states of H1B petitions filed. Pie chart: this chart shows comparison of Education level and occupations for petitions FY2011 vs FY2017.Tree map: tree map shows overall top employers who submit the greatest number of applications.Side-by-side bar chart: this chart shows overall comparison of Data Scientist and Data Analyst.Highlight table: this table shows mean wage of a Data Scientist and Data Analyst with case status certified.Bubble chart: this chart shows top10 companies for Data Scientist and Data Analyst.Related ResearchThe H-1B Visa Debate, Explained - Harvard Business Reviewhttps://hbr.org/2017/05/the-h-1b-visa-debate-explainedForeign Labor Certification Data Centerhttps://www.foreignlaborcert.doleta.govKey facts about the U.S. H-1B visa programhttp://www.pewresearch.org/fact-tank/2017/04/27/key-facts-about-the-u-s-h-1b-visa-program/H1B visa News and Updates from The Economic Timeshttps://economictimes.indiatimes.com/topic/H1B-visa/newsH-1B visa - Wikipediahttps://en.wikipedia.org/wiki/H-1B_visaKey FindingsFrom the analysis, the government is cutting down the number of approvals for H1B on 2017.In the past decade, due to the nature of demand for high-skilled workers, visa holders have clustered in STEM fields and come mostly from countries in Asia such as China and India.Technical Jobs fill up the majority of Top 10 Jobs among foreign workers such as Computer Systems Analyst and Software Developers.The employers located in the metro areas thrive to find foreign workforce who can fill the technical position that they have in their organization.States like California, New York, Washington, New Jersey, Massachusetts, Illinois, and Texas are the prime location for foreign workers and provide many job opportunities. Top Companies such Infosys, Tata, IBM India that submit most H1B Visa Applications are companies based in India associated with software and IT services.Data Scientist position has experienced an exponential growth in terms of H1B visa applications and jobs are clustered in West region with the highest number.Visualization utilizing programsHTML, JavaScript, CSS, D3.js, Google API, Python, R, and Tableau

  11. AI-Generated Computer Build Reviews (indoneisan)

    • kaggle.com
    zip
    Updated Aug 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    yaemico (2024). AI-Generated Computer Build Reviews (indoneisan) [Dataset]. https://www.kaggle.com/datasets/yaemico/ai-generated-computer-build-reviews-indoneisan
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Aug 31, 2024
    Authors
    yaemico
    Description

    Roast-PC Dataset: AI-Generated PC Build Reviews

    Description:

    This dataset is sourced from the "Roast-PC by Gemini" website, a platform that provides AI-powered roasting (critical feedback) on custom PC builds. Users input the components of their PC build, including CPU, GPU, motherboard, RAM, PSU, disk, and intended use case. The dataset captures the logs of these submissions, along with the roasting comments generated by Gemini AI, Google's AI model.

    Dataset Overview:

    • Number of Columns: 9
    • Number of Rows: 1285

    Column Names and Descriptions:

    1. Time: Date and Time of request.
    2. cpu: The CPU model specified by the user (e.g., "AMD Ryzen 5 5500", "Intel i7 1200K").
    3. gpu: The GPU model specified by the user (e.g., "NVIDIA RTX 3080", "AMD Radeon RX 6800").
    4. motherboard: The motherboard model specified by the user (e.g., "ASUS ROG Strix B550-F", "MSI B450 TOMAHAWK").
    5. ram: The RAM configuration specified by the user, including size and speed (e.g., "16GB DDR4 3200MHz").
    6. psu: The PSU (Power Supply Unit) model specified by the user, including wattage (e.g., "Corsair RM750x 750W").
    7. disk: The storage devices specified by the user, including type and capacity (e.g., "1TB NVMe SSD", "500GB SATA HDD").
    8. use_case: The intended use of the PC as specified by the user (e.g., "gaming", "video editing", "general use").
    9. roast_comments: The AI-generated feedback or roasting comments provided by Gemini AI, critiquing the PC build based on the components and use case (indonesian).

    Functionality:

    This dataset serves multiple purposes:

    • Component Analysis: Allows for analysis of popular PC component choices and configurations.
    • AI Feedback Insights: Provides insights into how AI evaluates and critiques different PC builds.
    • Data Mining: Can be used for exploring trends in PC building preferences, identifying common mistakes, and understanding user behavior in custom PC setups.
    • Machine Learning Applications: Useful for training models in natural language processing (NLP), particularly in generating or understanding feedback for hardware configurations.

    This dataset is ideal for those interested in PC building, hardware analysis, AI-generated content, or anyone curious about trends in custom PC configurations.

  12. A

    ‘🗳 Pollster Ratings’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Jan 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘🗳 Pollster Ratings’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-pollster-ratings-3cf7/4aa1e9a4/?iid=017-459&v=presentation
    Explore at:
    Dataset updated
    Jan 28, 2022
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘🗳 Pollster Ratings’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/yamqwe/pollster-ratingse on 28 January 2022.

    --- Dataset description provided by original source is as follows ---

    About this dataset

    This dataset contains the data behind FiveThirtyEight's pollster ratings.

    pollster-stats-full contains a spreadsheet with all of the summary data and calculations involved in determining the pollster ratings as well as descriptions for each column.
    pollster-ratings has ratings and calculations for each pollster. A copy of this data and descriptions for each column can also be found in pollster-stats-full.
    raw-polls contains all of the polls analyzed to give each pollster a grade

    Source: https://github.com/fivethirtyeight/data

    License: The data is available under the Creative Commons Attribution 4.0 International License. If you find it useful, please let us know.

    Updated: Pollster-ratings and raw-polls synced from source weekly.

    This dataset was created by FiveThirtyEight and contains around 10000 samples along with Cand2 Id, Pollster, technical information and other features such as: - Samplesize - Partisan - and more.

    How to use this dataset

    • Analyze Cand2 Party in relation to Race Id
    • Study the influence of Margin Poll on Cand1 Actual
    • More datasets

    Acknowledgements

    If you use this dataset in your research, please credit FiveThirtyEight

    Start A New Notebook!

    --- Original source retains full ownership of the source dataset ---

  13. P

    Criteo Display Advertising Challenge Dataset

    • paperswithcode.com
    Updated Apr 29, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Criteo Display Advertising Challenge Dataset [Dataset]. https://paperswithcode.com/dataset/criteo-display-advertising-challenge
    Explore at:
    Dataset updated
    Apr 29, 2019
    Description

    his dataset contains feature values and click feedback for millions of display ads. Its purpose is to benchmark algorithms for clickthrough rate (CTR) prediction. It has been used for the Display Advertising Challenge hosted by Kaggle: https://www.kaggle.com/c/criteo-display-ad-challenge/

    ===================================================

    Full description:

    This dataset contains 2 files: train.txt test.txt corresponding to the training and test parts of the data.

    ====================================================

    Dataset construction:

    The training dataset consists of a portion of Criteo's traffic over a period of 7 days. Each row corresponds to a display ad served by Criteo and the first column is indicates whether this ad has been clicked or not. The positive (clicked) and negatives (non-clicked) examples have both been subsampled (but at different rates) in order to reduce the dataset size.

    There are 13 features taking integer values (mostly count features) and 26 categorical features. The values of the categorical features have been hashed onto 32 bits for anonymization purposes. The semantic of these features is undisclosed. Some features may have missing values.

    The rows are chronologically ordered.

    The test set is computed in the same way as the training set but it corresponds to events on the day following the training period. The first column (label) has been removed.

    ====================================================

    Format:

    The columns are tab separeted with the following schema:

    When a value is missing, the field is just empty. There is no label field in the test set.

    ====================================================

    Dataset assembled by Olivier Chapelle (o.chapelle@criteo.com)

  14. Z

    Coughs: ESC-50 and FSDKaggle2018

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jul 27, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michelle Hernandez (2021). Coughs: ESC-50 and FSDKaggle2018 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5136591
    Explore at:
    Dataset updated
    Jul 27, 2021
    Dataset provided by
    Michelle Hernandez
    Jinyi Qiu
    Edgar Lobaton
    Alper Bozkurt
    Mahmoud Abdelkhalek
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    This dataset consists of timestamps for coughs contained in files extracted from the ESC-50 and FSDKaggle2018 datasets.

    Citation

    This dataset was generated and used in our paper:

    Mahmoud Abdelkhalek, Jinyi Qiu, Michelle Hernandez, Alper Bozkurt, Edgar Lobaton, “Investigating the Relationship between Cough Detection and Sampling Frequency for Wearable Devices,” in the 43rd Annual International Conference of the IEEE Engineering in Medicine and Biology Society, 2021.

    Please cite this paper if you use the timestamps.csv file in your work.

    Generation

    The cough timestamps given in the timestamps.csv file were generated using the cough templates given in figures 3 and 4 in the paper:

    A. H. Morice, G. A. Fontana, M. G. Belvisi, S. S. Birring, K. F. Chung, P. V. Dicpinigaitis, J. A. Kastelik, L. P. McGarvey, J. A. Smith, M. Tatar, J. Widdicombe, "ERS guidelines on the assessment of cough", European Respiratory Journal 2007 29: 1256-1276; DOI: 10.1183/09031936.00101006

    More precisely, 40 files labelled as "coughing" in the ESC-50 dataset and 273 files labelled as "Cough" in the FSDKaggle2018 dataset were manually searched using Audacity for segments of audio that closely matched the aforementioned templates, both visually and auditorily. Some files did not contain any coughs at all, while other files contained several coughs. Therefore, only the files that contained at least one cough are included in the coughs directory. In total, the timestamps of 768 cough segments with lengths ranging from 0.2 seconds to 0.9 seconds were extracted.

    Description

    The audio files are presented in wav format in the coughs directory. Files named in the general format of "*-*-*-24.wav" were extracted from the ESC-50 dataset, while all other files were extracted from the FSDKaggle2018 dataset.

    The timestamps.csv file contains the timestamps for the coughs and it consists of four columns:

    file_name,cough_number,start_time,end_time

    Files in the file_name column can be found in the coughs directory. cough_number refers to the index of the cough in the corresponding file. For example, if the file X.wav contains 5 coughs, then X.wav will be repeated 5 times under the file_name column, and for each row, the cough_number will range from 1 to 5. start_time refers to the starting time of a cough segment measured in seconds, while end_time refers to the end time of a cough segment measured in seconds.

    Licensing

    The ESC-50 dataset as a whole is licensed under the Creative Commons Attribution-NonCommercial license. Individual files in the ESC-50 dataset are licensed under different Creative Commons licenses. For a list of these licenses, see LICENSE. The ESC-50 files in the cough directory are given for convenience only, and have not been modified from their original versions. To download the original files, see the ESC-50 dataset.

    The FSDKaggle2018 dataset as a whole is licensed under the Creative Commons Attribution 4.0 International license. Individual files in the FSDKaggle2018 dataset are licensed under different Creative Commons licenses. For a list of these licenses, see the License section in FSDKaggle2018. The FSDKaggle2018 files in the cough directory are given for convenience only, and have not been modified from their original versions. To download the original files, see the FSDKaggle2018 dataset.

    The timestamps.csv file is licensed under the Creative Commons Attribution-NonCommercial 4.0 International license.

  15. Job Postings Dataset

    • kaggle.com
    Updated Feb 3, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Moyukh Biswas (2024). Job Postings Dataset [Dataset]. https://www.kaggle.com/datasets/moyukhbiswas/job-postings-dataset/versions/1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 3, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Moyukh Biswas
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Job Postings Dataset

    This dataset offers an extensive assortment of job postings, designed to support investigations and examinations within the realms of job market patterns, natural language processing (NLP), and machine learning. Developed for educational and research objectives, this dataset presents a varied array of job advertisements spanning diverse industries and job categories.

    Description of each column:

    Category- The category of the job. Workplace- If the job in remote, on-site or hybrid. Location- Location of the job posting. Department- The department for which the job has been posted. Type- If the job is full-time, part-time or contractual in nature.

    Potential use cases:

    1. Optimizing workforce planning and talent acquisition strategies.
    2. Developing NLP models for resume parsing and job matching.
    3. Building predictive models to forecast job market trends.
    4. Exploring salary prediction models for various job roles.
    5. Analyzing regional job market disparities and opportunities.
  16. Young's Modulus of Metals

    • kaggle.com
    Updated Dec 31, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kanchana1990 (2024). Young's Modulus of Metals [Dataset]. http://doi.org/10.34740/kaggle/dsv/10344425
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 31, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Kanchana1990
    License

    Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
    License information was derived automatically

    Description

    Dataset Overview

    This dataset provides the Young's Modulus values (in GPa) for 50 metals, covering a wide range of categories such as alkali metals, alkaline earth metals, transition metals, and rare earth elements. Young's Modulus is a fundamental mechanical property that measures a material's stiffness under tensile or compressive stress. It is critical for applications in materials science, physics, and engineering.

    The dataset includes: - 50 metals with their chemical symbols and Young's Modulus values. - A wide range of stiffness values, from soft metals like cesium (1.7 GPa) to very stiff metals like ruthenium (447 GPa). - Clean and complete data with no missing or duplicate entries.

    Data Science Applications

    This dataset can be utilized in various data science and engineering applications, such as: 1. Material Property Prediction: Train machine learning models to predict mechanical properties based on elemental features. 2. Cluster Analysis: Group metals based on their mechanical properties or periodic trends. 3. Correlation Studies: Explore relationships between Young's Modulus and other physical/chemical properties (e.g., density, atomic radius). 4. Engineering Simulations: Use the data for simulations in structural analysis or material selection for design purposes. 5. Visualization and Education: Create visualizations to teach periodic trends and material property variations.

    Column Descriptors

    Column NameDescription
    MetalName of the metal (e.g., Lithium, Beryllium).
    SymbolChemical symbol of the metal (e.g., Li, Be).
    Young's Modulus (GPa)Young's Modulus value in gigapascals (GPa), indicating stiffness under stress.

    Ethically Mined Data

    The dataset was ethically sourced from publicly available scientific references and academic resources. The data was verified for accuracy using multiple authoritative sources, ensuring reliability for research and educational purposes. No proprietary or sensitive information was included.

    Key checks performed: - No missing values: The dataset contains complete entries for all 50 metals. - No duplicates: Each metal appears only once in the dataset. - Statistical analysis: The mean Young's Modulus is ~98.93 GPa, with a wide range from 1.7 GPa to 447 GPa.

    Acknowledgments

    We would like to thank the following sources for their contributions to this dataset: - Academic references such as WebElements, Byju's Chemistry Resources, and Wikipedia for cross-verifying the data. - Scientific databases like MatWeb and ASM International for providing accurate material property data. - Special thanks to DALL·E 3 for generating the accompanying dataset image.

  17. Agriculture dataset | Karnataka

    • kaggle.com
    • data.mendeley.com
    Updated Dec 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohamadreza Momeni (2024). Agriculture dataset | Karnataka [Dataset]. https://www.kaggle.com/datasets/imtkaggleteam/agriculture-dataset-karnataka
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 16, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Mohamadreza Momeni
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Karnataka
    Description

    Data Description

    The dataset you've provided appears to capture agricultural data for Karnataka, specifically focusing on crop yields in Mangalore. Key features include the year of production, geographic details, and environmental conditions such as rainfall (measured in mm), temperature (in degrees Celsius), and humidity (as a percentage). Soil type, irrigation method, and crop type are also recorded, along with crop yields, market price, and season of growth (e.g., Kharif).

    The dataset includes several columns related to crop production conditions and outcomes. For example, coconut crop data reveals a pattern of yields over different area sizes, showing how factors like rainfall, temperature, and irrigation influence production. Prices also vary, offering insights into the economic aspects of agriculture in the region. This information could be used to study the impact of environmental conditions and farming techniques on crop productivity, assisting in the development of optimized agricultural practices tailored for specific soil types, climates, and crop needs.

    Column Description

    yield: yield typically refers to the amount of crop produced per unit area of land

    In season column:

    Kharif Season: This is the monsoon crop season, where crops are sown at the beginning of the monsoon season (around June) and harvested at the end of the monsoon season (around October). Examples of Kharif crops include rice, maize, and pulses.

    Rabi Season: This is the winter crop season, where crops are sown after the monsoon season (around November) and harvested in the spring (around April). Examples of Rabi crops include wheat, barley, and mustard.

    Zaid Season: This is the summer crop season, which falls between the Kharif and Rabi seasons (around March to June). Zaid crops are usually short-duration crops and include vegetables, watermelons, and cucumbers.

    Authors

    rajesh naik

    Area covered

    Karnataka

    Unique identifier

    Click Here

  18. Black Friday Sales EDA

    • kaggle.com
    Updated Oct 29, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rushikesh Konapure (2022). Black Friday Sales EDA [Dataset]. https://www.kaggle.com/datasets/rishikeshkonapure/black-friday-sales-eda
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 29, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Rushikesh Konapure
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset History

    A retail company “ABC Private Limited” wants to understand the customer purchase behaviour (specifically, purchase amount) against various products of different categories. They have shared purchase summaries of various customers for selected high-volume products from last month. The data set also contains customer demographics (age, gender, marital status, city type, stay in the current city), product details (productid and product category) and Total purchase amount from last month.

    Now, they want to build a model to predict the purchase amount of customers against various products which will help them to create a personalized offer for customers against different products.

    Tasks to perform

    The purchase column is the Target Variable, perform Univariate Analysis and Bivariate Analysis w.r.t the Purchase.

    Masked in the column description means already converted from categorical value to numerical column.

    Below mentioned points are just given to get you started with the dataset, not mandatory to follow the same sequence.

    DATA PREPROCESSING

    • Check the basic statistics of the dataset

    • Check for missing values in the data

    • Check for unique values in data

    • Perform EDA

    • Purchase Distribution

    • Check for outliers

    • Analysis by Gender, Marital Status, occupation, occupation vs purchase, purchase by city, purchase by age group, etc

    • Drop unnecessary fields

    • Convert categorical data into integer using map function (e.g 'Gender' column)

    • Missing value treatment

    • Rename columns

    • Fill nan values

    • map range variables into integers (e.g 'Age' column)

    Data Visualisation

    • visualize individual column
    • Age vs Purchased
    • Occupation vs Purchased
    • Productcategory1 vs Purchased
    • Productcategory2 vs Purchased
    • Productcategory3 vs Purchased
    • City category pie chart
    • check for more possible plots

    All the Best!!

  19. A

    ‘prediction of facebook comment’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Oct 6, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2019). ‘prediction of facebook comment’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-prediction-of-facebook-comment-1c17/latest
    Explore at:
    Dataset updated
    Oct 6, 2019
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘prediction of facebook comment’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/kiranraje/prediction-facebook-comment on 28 January 2022.

    --- Dataset description provided by original source is as follows ---

    Context

    The Dataset is uploaded in ZIP format. The dataset contains 5 variants of the dataset, for the details about the variants and detailed analysis read and cite the research paper TITLE='Comment Volume Prediction

    Content

    28 columns content in this Dataset 1] Describing popularity or support for the source. 2] Describe how many prople so far visited this place 3]Defines the daily interest of individuals towards source of the document/ Post. 4]Defines the daily interest of individuals towards source of the document/ Post.

    --- Original source retains full ownership of the source dataset ---

  20. 18 excel spreadsheets by species and year giving reproduction and growth...

    • catalog.data.gov
    • data.wu.ac.at
    Updated Aug 17, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2024). 18 excel spreadsheets by species and year giving reproduction and growth data. One excel spreadsheet of herbicide treatment chemistry. [Dataset]. https://catalog.data.gov/dataset/18-excel-spreadsheets-by-species-and-year-giving-reproduction-and-growth-data-one-excel-sp
    Explore at:
    Dataset updated
    Aug 17, 2024
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    Excel spreadsheets by species (4 letter code is abbreviation for genus and species used in study, year 2010 or 2011 is year data collected, SH indicates data for Science Hub, date is date of file preparation). The data in a file are described in a read me file which is the first worksheet in each file. Each row in a species spreadsheet is for one plot (plant). The data themselves are in the data worksheet. One file includes a read me description of the column in the date set for chemical analysis. In this file one row is an herbicide treatment and sample for chemical analysis (if taken). This dataset is associated with the following publication: Olszyk , D., T. Pfleeger, T. Shiroyama, M. Blakely-Smith, E. Lee , and M. Plocher. Plant reproduction is altered by simulated herbicide drift toconstructed plant communities. ENVIRONMENTAL TOXICOLOGY AND CHEMISTRY. Society of Environmental Toxicology and Chemistry, Pensacola, FL, USA, 36(10): 2799-2813, (2017).

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Adil Shamim (2025). Salary Prediction: Based on years of experience [Dataset]. http://doi.org/10.34740/kaggle/dsv/10626597
Organization logo

Salary Prediction: Based on years of experience

Salary Prediction: Salary dataset based on years of experience

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 31, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Adil Shamim
Description

About the Dataset

Dataset Name: Salary Prediction Dataset

File Format: CSV

Rows: 100,000

Columns: 4

Overview

This dataset provides salary data based on years of experience, education level, and job role. It can be used for salary prediction models, regression analysis, and workforce analytics. The dataset includes realistic salary variations based on industry trends.

Columns Description

  1. YearsExperience (float) – Number of years of experience (0 to 40 years).
  2. Education Level (string) – The highest level of education attained. Categories include:
    • High School
    • Associate Degree
    • Bachelor's
    • Master's
    • PhD
  3. Job Role (string) – Common job titles in the industry:
    • Software Engineer
    • Data Scientist
    • Product Manager
    • Marketing Specialist
    • Business Analyst
  4. Salary (float) – The estimated annual salary in USD. The salary is influenced by experience, education level, and job role.

Potential Use Cases

  • Salary Prediction Models – Train regression models to predict salaries based on experience and qualifications.
  • Data Science & Machine Learning – Use this dataset for exploratory data analysis and feature engineering.
  • Workforce Analysis – Analyze salary trends across job roles and experience levels.

How the Data Was Generated

The dataset was synthetically generated using a linear regression-based formula with added randomness and scaling factors based on job roles and education levels. While not real-world data, it closely mimics actual salary distributions in the tech and business industries.

Acknowledgments

This dataset is designed for research, learning, and data science practice. It is not collected from real-world surveys but follows statistical patterns observed in salary data.

Search
Clear search
Close search
Google apps
Main menu