Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset contains job postings related to Data Science roles in 2025, collected from publicly available sources. It includes essential details such as job titles, seniority levels, company information, locations, salaries, industries, company size, and required skills. The dataset has been cleaned and structured to ensure accuracy and consistency, with duplicates and irrelevant entries removed.
It is designed to help researchers, students, and professionals analyze hiring trends, salary ranges, and in-demand skills in the Data Science job market. This dataset can also support projects in machine learning, career prediction, salary forecasting, and workforce analytics.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset provides insights into data science job salaries from 2020 to 2025, including information on experience levels, employment types, job titles, and company characteristics. It serves as a valuable resource for understanding salary trends and factors influencing compensation in the data science field.
| Feature | Description |
|---|---|
| work_year | The year of the data related to the job salary. |
| experience_level | The level of experience of the employee (e.g., entry-level, mid-level, senior-level). |
| employment_type | The type of employment (e.g., full-time, part-time, contract). |
| job_title | The title or role of the employee within the data science field. |
| salary | The salary of the employee. |
| salary_currency | The currency in which the salary is denoted. |
| salary_in_usd | The salary converted to US dollars for standardization. |
| employee_residence | The residence location of the employee. |
| remote_ratio | The ratio of remote work allowed for the position. |
| company_location | The location of the company. |
| company_size | The size of the company based on employee count or revenue. |
This data set is made available by ai-jobs.net Salaries. Thank you for aggregating this information!
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Stack Exchange is a network of question-and-answer websites on topics in diverse fields, each site covering a specific topic, where questions, answers, and users are subject to a reputation award process. The reputation system allows the sites to be self-moderating.
The dataset here is specific to one such network site of Stack Exchange named Data Science Stack Exchange. The dataset is distributed over multiple files. It contains information on various Posts on data science that can be used for language processing, it has data on which posts are being liked by users more, etc. A lot of analysis can be done on this dataset.
Facebook
TwitterDescription: This dataset contains detailed information about videos from various YouTube channels that specialize in data science and analytics. It includes metrics such as views, likes, comments, and publication dates. The dataset consists of 22862 rows, providing a robust sample for analyzing trends in content engagement, popularity of topics over time, and comparison of channels' performance.
Column Descriptors:
Channel_Name: The name of the YouTube channel. Title: The title of the video. Published_date: The date when the video was published. Views: The number of views the video has received. Like_count: The number of likes the video has received. Comment_Count: The number of comments on the video.
This dataset contains information from the following YouTube channels:
['sentdex', 'freeCodeCamp.org' ,'CampusX', 'Darshil Parmar',' Keith Galli' ,'Alex The Analyst', 'Socratica' , Krish Naik', 'StatQuest with Josh Starmer', 'Nicholas Renotte', 'Leila Gharani', 'Rob Mulla' ,'Ryan Nolan Data', 'techTFQ', 'Dataquest' ,'WsCube Tech', 'Chandoo', 'Luke Barousse', 'Andrej Karpathy', 'Thu Vu data analytics', 'Guy in a Cube', 'Tableau Tim', 'codebasics', 'DeepLearningAI', 'Rishabh Mishra' 'ExcelIsFun', 'Kevin Stratvert' ' Ken Jee','Kaggle' , 'Tina Huang']
This dataset can be used for various analyses, including but not limited to:
Identifying the most popular videos and channels in the data science field.
Understanding viewer engagement trends over time.
Comparing the performance of different types of content across multiple channels.
Performing a comparison between different channels to find the best-performing ones.
Identifying the best videos to watch for specific topics in data science and analytics.
Conducting a detailed analysis of your favorite YouTube channel to understand its content strategy and performance.
Note: The data is current as of the date of extraction and may not reflect real-time changes on YouTube. For any analyses, ensure to consider the date when the data was last updated to maintain accuracy and relevance.
Facebook
TwitterHabibAhmed/Data-Science-Instruct-Dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was created by Divyang Mandal
Released under CC0: Public Domain
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
MODIS Water Lake Powell Toy Dataset
Dataset Summary
Tabular dataset comprised of MODIS surface reflectance bands along with calculated indices and a label (water/not-water)
Dataset Structure
Data Fields
water: Label, water or not-water (binary) sur_refl_b01_1: MODIS surface reflection band 1 (-100, 16000) sur_refl_b02_1: MODIS surface reflection band 2 (-100, 16000) sur_refl_b03_1: MODIS surface reflection band 3 (-100, 16000) sur_refl_b04_1: MODIS… See the full description on the dataset page: https://huggingface.co/datasets/nasa-cisto-data-science-group/modis-lake-powell-toy-dataset.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Scraped Data on AI, ML, DS & Big Data Jobs is a comprehensive dataset that includes valuable information about job opportunities in the fields of Artificial Intelligence (AI), Machine Learning (ML), Data Science (DS), and Big Data. The dataset covers various aspects, including company names, job titles, locations, job types (full-time, part-time, remote), experience levels, salary ranges, job requirements, and available facilities.
This dataset offers a wealth of insights for job seekers, researchers, and organizations interested in the rapidly evolving fields of AI, ML, DS, and Big Data. By analyzing the data, users can gain a better understanding of the job market trends, geographical distribution of opportunities, popular job titles, required skills and qualifications, salary expectations, and the types of facilities provided by companies in these domains.
Whether you are exploring career prospects, conducting market research, or building predictive models, this dataset serves as a valuable resource to extract meaningful insights and make informed decisions in the exciting world of AI, ML, DS, and Big Data jobs.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The city of Austin has administered a community survey for the 2015, 2016, 2017, 2018 and 2019 years (https://data.austintexas.gov/City-Government/Community-Survey/s2py-ceb7), to “assess satisfaction with the delivery of the major City Services and to help determine priorities for the community as part of the City’s ongoing planning process.” To directly access this dataset from the city of Austin’s website, you can follow this link https://cutt.ly/VNqq5Kd. Although we downloaded the dataset analyzed in this study from the former link, given that the city of Austin is interested in continuing administering this survey, there is a chance that the data we used for this analysis and the data hosted in the city of Austin’s website may differ in the following years. Accordingly, to ensure the replication of our findings, we recommend researchers to download and analyze the dataset we employed in our analyses, which can be accessed at the following link https://github.com/democratizing-data-science/MDCOR/blob/main/Community_Survey.csv. Replication Features or Variables The community survey data has 10,684 rows and 251 columns. Of these columns, our analyses will rely on the following three indicators that are taken verbatim from the survey: “ID”, “Q25 - If there was one thing you could share with the Mayor regarding the City of Austin (any comment, suggestion, etc.), what would it be?", and “Do you own or rent your home?”
Facebook
Twitterhttps://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Dataset Card for Data Science Job Salaries
Dataset Summary
Content
Column Description
work_year The year the salary was paid.
experience_level The experience level in the job during the year with the following possible values: EN Entry-level / Junior MI Mid-level / Intermediate SE Senior-level / Expert EX Executive-level / Director
employment_type The type of employement for the role: PT Part-time FT Full-time CT Contract FL Freelance
job_title… See the full description on the dataset page: https://huggingface.co/datasets/hugginglearners/data-science-job-salaries.
Facebook
TwitterOpen Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
Data science is a rapidly growing field in the tech industry, and LinkedIn is a popular platform for finding job opportunities in this domain.
This dataset provides valuable insights into data science job postings, including the required skills and software proficiency sought by employers.
If you find this dataset useful, don't forget to hit the upvote button! 😊💝
Photo by Shahadat Rahman on Unsplash
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This data set is use to preform the EDA in simple or smaller data set beacuse firstly you need to resolve smaller level problem
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset is taken from Google Trend. It shows the trend of "Data Science" search term on Google Search Engine and YouTube from 2004 to 2022 (April). There will be an update soon.
Facebook
TwitterThis dataset was created by Syed Siddique Mridul
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
The AI/ML industry is rapidly evolving, and companies worldwide are actively hiring Data Scientists, ML Engineers, AI Researchers, and Quant Analysts. This dataset provides 2,000+ synthetic but realistic job listings that capture important details like:
Company information
Industry domain
Job titles & experience levels
Required skills & tools
Salary ranges (USD)
Location & employment type
Posting dates (2023–2025)
This dataset is designed to help researchers, students, and practitioners analyze trends in the AI job market and build real-world projects such as salary prediction, skill-demand analysis, and workforce analytics.
Facebook
TwitterData scientist is the sexiest job in the world. How many times have you heard that? Analytics India Annual Salary Study which aims to understand a wide range of trends data science says that the median analytics salary in India for the year 2017 is INR 12.7 Lakhs across all experience level and skill sets. So given the job description and other key information can you predict the range of salary of the job posting? What kind of factors influence the salary of a data scientist? The study also says that in the world of analytics, Mumbai is the highest paymaster at almost 13.3 Lakhs per annum, followed by Bengaluru at 12.5 Lakhs. The industry of the data scientist can also influence the salary. Telecom industry pays the highest median salaries to its analytics professionals at 18.6 Lakhs. What are you waiting for, solve the problem by predicting how much a data scientist or analytics professional will be paid by analysing the data given. Bonus Tip: You can analyse the data and get key insights for your career as well. The best data scientists and machine learning engineers will be given awesome prizes at the end of hackathon. Share this hackathon with a colleague who may be interested in mining the dataset for insights and make great predictions. Data The dataset is based on salary and job postings in India across the internet. The train and the test data consists of attributes mentioned below. The rows of train dataset has rich amount of information regarding the job posting such as name of the designation and key skills required for the job. The training data and test data comprise of 19802 samples and of 6601 samples each. This is a dataset which has been collected over some time to gather relevant analytics jobs posting over the years. Features Name of the company (Encoded) Years of experience Job description Job designation Job Type Key skills Location Salary in Rupees Lakhs(To be predicted) Problem Statement Based on the given attributes and salary information, build a robust machine learning model that predicts the salary range of the salary post. calender Event Duration 10 Dec 2018 to 20 Jan 2030
hlevelBeginne
Facebook
TwitterArcade is a collection of natural language to code problems on interactive data science notebooks. Each problem features an NL intent as problem specification, a reference code solution, and preceding notebook context (Markdown or code cells). Arcade can be used to evaluate the accuracies of code large language models in generating data science programs given natural language instructions. Please read our paper for more details.
Note👉 This Kaggle dataset only contains the dataset files of Arcade. Refer to our main Github repository for detailed instructions to use this dataset.
Below is the structure of its content:
└── ./
├── existing_tasks # Problems derived from existing data science notebooks on Github/
│ ├── metadata.json # Metadata by `build_existing_tasks_split.py` to reproduce this split.
│ ├── artifacts/ # Folder that stores dependent ML datasets to execute the problems, created by running `build_existing_tasks_split.py`
│ └── derived_datasets/ # Folder for preprocessed datasets used for prompting experiments.
├── new_tasks/
│ ├── dataset.json # Original, unprepossessed dataset
│ ├── kaggle_dataset_provenance.csv # Metadata of the Kaggle datasets used to build this split.
│ ├── artifacts/ # Folder that stores dependent ML Kaggle datasets to execute the problems, created by running `build_new_tasks_split.py`
│ └── derived_datasets/ # Folder for preprocessed datasets used for prompting experiments.
└── checksums.txt # Table of MD5 checksums of dataset files.
All the dataset '*.json' files follow the same structure. Each dataset file is a Json-serialized list of Episodes. Each episode corresponds to a notebook annotated with NL-to-code problems. The structure of an episode is documented below:
{
"notebook_name": "Name of the notebook.",
"work_dir": "Path to the dependent data artifacts (e.g., ML datasets) to execute the notebook.",
"annotator": "Anonymized annotator Id."
"turns": [
# A list of natural language to code examples using the current notebook context.
{
"input": "Prompt to a code generation model.",
"turn": {
"intent": {
"value": "Annotated NL intent for the current turn.",
"is_cell_intent": "Metadata used for the existing tasks split to indicate if the code solution is only part of an existing code cell.",
"cell_idx": "Index of the intent Markdown cell.",
"line_span": "Line span of the intent.",
"not_sure": "Annotation confidence.",
"output_variables": "List of variable names denoting the output. If None, use the output of the last line of code as the output of the problem.",
},
"code": {
"value": "Reference code solution.",
"cell_idx": "Cell index of the code cell containing the solution.",
"num_lines": "Number of lines in the reference solution.",
"line_span": "Line span.",
},
"code_context": "Context code (all code cells before this problem) that need to be executed before executing the reference/predicted programs.",
"delta_code_context": "Delta context code between the last problem in this notebook and the current problem, useful for incremental execution.",
"metadata": {
"annotator_id": "Annotator Id",
"num_code_lines": "Metadata, please ignore.",
"utterance_without_output_spec": "Annotated NL intent without output specification. Refer to the paper for details.",
},
},
"notebook": "Field intended to store the Json-serialized Jupyter notebook. Not used for now since the notebook can be reconstructed from other metadata in this file.",
"metadata": {
# A dict of metadata of this turn.
"context_cells": [ # A list of context cells before the problem.
{
"cell_type": "code|markdown",
"source": "Cell content."
},
],
"delta_cell_num": "Number of preceding context cells between the prior turn and the current turn.",
# The following fields only occur in datasets inlined with schema descriptions.
"context_cell_num": "Number of context cells in the prompt after inserting schema descriptions and left-truncation.",
"inten...
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset was scraped from Indeed during the summer of 2024, focusing on the search term 'data scientist.' The data encompasses job listings from every state in the USA, including remote positions, providing a comprehensive snapshot of the data science job market during this period.
Working with this dataset involves a variety of skills that can help students gain valuable experience in data analysis, visualization, and interpretation. Some skills that could be practiced using this data:
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by Archis Save
Released under Apache 2.0
Facebook
TwitterOpen Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
This dataset was originally collected for a data science and machine learning project that aimed at investigating the potential correlation between the amount of time an individual spends on social media and the impact it has on their mental health.
The project involves conducting a survey to collect data, organizing the data, and using machine learning techniques to create a predictive model that can determine whether a person should seek professional help based on their answers to the survey questions.
This project was completed as part of a Statistics course at a university, and the team is currently in the process of writing a report and completing a paper that summarizes and discusses the findings in relation to other research on the topic.
The following is the Google Colab link to the project, done on Jupyter Notebook -
https://colab.research.google.com/drive/1p7P6lL1QUw1TtyUD1odNR4M6TVJK7IYN
The following is the GitHub Repository of the project -
https://github.com/daerkns/social-media-and-mental-health
Libraries used for the Project -
Pandas
Numpy
Matplotlib
Seaborn
Sci-kit Learn
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset contains job postings related to Data Science roles in 2025, collected from publicly available sources. It includes essential details such as job titles, seniority levels, company information, locations, salaries, industries, company size, and required skills. The dataset has been cleaned and structured to ensure accuracy and consistency, with duplicates and irrelevant entries removed.
It is designed to help researchers, students, and professionals analyze hiring trends, salary ranges, and in-demand skills in the Data Science job market. This dataset can also support projects in machine learning, career prediction, salary forecasting, and workforce analytics.