MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Overview This dataset contains question-answer pairs with context extracted from Kaggle solution write-ups and discussion forums. The dataset was created to facilitate fine-tuning Gemma, an AI model, for data scientist assistant tasks such as question answering and providing data science assistance.
Dataset Details Columns: Question: The question generated based on the context extracted from Kaggle solution write-ups and discussion forums. Answer: The corresponding answer to the generated question. Context: The context extracted from Kaggle solution write-ups and discussion forums, which serves as the basis for generating questions and answers. Subtitle: Subtitle or additional information related to the Kaggle competition or topic. Title: Title of the Kaggle competition or topic. Sources and Inspiration
Sources:
Meta Kaggle: The dataset was sourced from Meta Kaggle, an official Kaggle platform where users discuss competitions, kernels, datasets, and more. Kaggle Solution Write-ups: Solution write-ups submitted by Kaggle users were utilized as a primary source of context for generating questions and answers. Discussion Forums: Discussion threads on Kaggle forums were used to gather additional insights and context for the dataset. Inspiration:
The dataset was inspired by the need for a specialized dataset tailored for fine-tuning Gemma, an AI model designed for data scientist assistant tasks. The goal was to create a dataset that captures the essence of real-world data science problems discussed on Kaggle, enabling Gemma to provide accurate and relevant assistance to data scientists and Kaggle users. Dataset Specifics Total Records: [Specify the total number of question-answer pairs in the dataset] Format: CSV (Comma Separated Values) Size: [Specify the size of the dataset in MB or GB] License: [Specify the license under which the dataset is distributed, e.g., CC BY-SA 4.0] Download Link: [Provide a link to download the dataset] Acknowledgments We acknowledge Kaggle and its community for providing valuable data science resources and discussions that contributed to the creation of this dataset. We appreciate the efforts of Gemma and Langchain in fine-tuning AI models for data scientist assistant tasks, enabling enhanced productivity and efficiency in the field of data science.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This Dataset containes the details of the AI, ML, Data Science Salary (2020- 2025). Salary data is in USD and recalculated at its average fx rate during the year for salaries entered in other currencies.
The data is processed and updated on a weekly basis so the rankings may change over time during the year.
Attribute Information
Acknowledgements
Photo by Anastassia Anufrieva on Unsplash
In this paper, we introduce a novel benchmarking framework designed specifically for evaluations of data science agents. Our contributions are three-fold. First, we propose DSEval, an evaluation paradigm that enlarges the evaluation scope to the full lifecycle of LLM-based data science agents. We also cover aspects including but not limited to the quality of the derived analytical solutions or machine learning models, as well as potential side effects such as unintentional changes to the original data. Second, we incorporate a novel bootstrapped annotation process letting LLM themselves generate and annotate the benchmarks with ``human in the loop''. A novel language (i.e., DSEAL) has been proposed and the derived four benchmarks have significantly improved the benchmark scalability and coverage, with largely reduced human labor. Third, based on DSEval and the four benchmarks, we conduct a comprehensive evaluation of various data science agents from different aspects. Our findings reveal the common challenges and limitations of the current works, providing useful insights and shedding light on future research on LLM-based data science agents.
This is one of DSEval benchmarks.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Top 1000 Kaggle Datasets’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/notkrishna/top-1000-kaggle-datasets on 28 January 2022.
--- Dataset description provided by original source is as follows ---
Kaggle, a subsidiary of Google LLC, is an online community of data scientists and machine learning practitioners. Kaggle allows users to find and publish data sets, explore and build models in a web-based data-science environment, work with other data scientists and machine learning engineers, and enter competitions to solve data science challenges.
Kaggle got its start in 2010 by offering machine learning competitions and now also offers a public data platform, a cloud-based workbench for data science, and Artificial Intelligence education. Its key personnel were Anthony Goldbloom and Jeremy Howard. Nicholas Gruen was founding chair succeeded by Max Levchin. Equity was raised in 2011 valuing the company at $25 million. On 8 March 2017, Google announced that they were acquiring Kaggle.[1][2]
Source: Kaggle
--- Original source retains full ownership of the source dataset ---
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Meta Kaggle Code is an extension to our popular Meta Kaggle dataset. This extension contains all the raw source code from hundreds of thousands of public, Apache 2.0 licensed Python and R notebooks versions on Kaggle used to analyze Datasets, make submissions to Competitions, and more. This represents nearly a decade of data spanning a period of tremendous evolution in the ways ML work is done.
By collecting all of this code created by Kaggle’s community in one dataset, we hope to make it easier for the world to research and share insights about trends in our industry. With the growing significance of AI-assisted development, we expect this data can also be used to fine-tune models for ML-specific code generation tasks.
Meta Kaggle for Code is also a continuation of our commitment to open data and research. This new dataset is a companion to Meta Kaggle which we originally released in 2016. On top of Meta Kaggle, our community has shared nearly 1,000 public code examples. Research papers written using Meta Kaggle have examined how data scientists collaboratively solve problems, analyzed overfitting in machine learning competitions, compared discussions between Kaggle and Stack Overflow communities, and more.
The best part is Meta Kaggle enriches Meta Kaggle for Code. By joining the datasets together, you can easily understand which competitions code was run against, the progression tier of the code’s author, how many votes a notebook had, what kinds of comments it received, and much, much more. We hope the new potential for uncovering deep insights into how ML code is written feels just as limitless to you as it does to us!
While we have made an attempt to filter out notebooks containing potentially sensitive information published by Kaggle users, the dataset may still contain such information. Research, publications, applications, etc. relying on this data should only use or report on publicly available, non-sensitive information.
The files contained here are a subset of the KernelVersions
in Meta Kaggle. The file names match the ids in the KernelVersions
csv file. Whereas Meta Kaggle contains data for all interactive and commit sessions, Meta Kaggle Code contains only data for commit sessions.
The files are organized into a two-level directory structure. Each top level folder contains up to 1 million files, e.g. - folder 123 contains all versions from 123,000,000 to 123,999,999. Each sub folder contains up to 1 thousand files, e.g. - 123/456 contains all versions from 123,456,000 to 123,456,999. In practice, each folder will have many fewer than 1 thousand files due to private and interactive sessions.
The ipynb files in this dataset hosted on Kaggle do not contain the output cells. If the outputs are required, the full set of ipynbs with the outputs embedded can be obtained from this public GCS bucket: kaggle-meta-kaggle-code-downloads
. Note that this is a "requester pays" bucket. This means you will need a GCP account with billing enabled to download. Learn more here: https://cloud.google.com/storage/docs/requester-pays
We love feedback! Let us know in the Discussion tab.
Happy Kaggling!
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by Raghav Khandelwal
Released under Apache 2.0
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘HR Analytics: Job Change of Data Scientists’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/arashnic/hr-analytics-job-change-of-data-scientists on 28 January 2022.
--- Dataset description provided by original source is as follows ---
A company which is active in Big Data and Data Science wants to hire data scientists among people who successfully pass some courses which conduct by the company. Many people signup for their training. Company wants to know which of these candidates are really wants to work for the company after training or looking for a new employment because it helps to reduce the cost and time as well as the quality of training or planning the courses and categorization of candidates. Information related to demographics, education, experience are in hands from candidates signup and enrollment.
This dataset designed to understand the factors that lead a person to leave current job for HR researches too. By model(s) that uses the current credentials,demographics,experience data you will predict the probability of a candidate to look for a new job or will work for the company, as well as interpreting affected factors on employee decision.
The whole data divided to train and test . Target isn't included in test but the test target values data file is in hands for related tasks. A sample submission correspond to enrollee_id of test set provided too with columns : enrollee _id , target
Note: - The dataset is imbalanced. - Most features are categorical (Nominal, Ordinal, Binary), some with high cardinality. - Missing imputation can be a part of your pipeline as well.
#
Features
#
- enrollee_id : Unique ID for candidate
city: City code
city_ development _index : Developement index of the city (scaled)
gender: Gender of candidate
relevent_experience: Relevant experience of candidate
enrolled_university: Type of University course enrolled if any
education_level: Education level of candidate
major_discipline :Education major discipline of candidate
experience: Candidate total experience in years
company_size: No of employees in current employer's company
company_type : Type of current employer
last_new_job: Difference in years between previous job and current job
training_hours: training hours completed
target: 0 – Not looking for job change, 1 – Looking for a job change
--- Original source retains full ownership of the source dataset ---
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for [LLM Science Exam Kaggle Competition]
Dataset Summary
https://www.kaggle.com/competitions/kaggle-llm-science-exam/data
Languages
[en, de, tl, it, es, fr, pt, id, pl, ro, so, ca, da, sw, hu, no, nl, et, af, hr, lv, sl]
Dataset Structure
Columns prompt - the text of the question being asked A - option A; if this option is correct, then answer will be A B - option B; if this option is correct, then answer will be B C - option C; if this… See the full description on the dataset page: https://huggingface.co/datasets/Sangeetha/Kaggle-LLM-Science-Exam.
This dataset contains information on salaries for data science jobs in Karachi, Pakistan. This dataset can be used to gain insights into the salaries offered for data science jobs in Karachi and can be helpful for professionals who are looking to explore career opportunities in this field.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Time Series starter dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/podsyp/time-series-starter-dataset on 28 January 2022.
--- Dataset description provided by original source is as follows ---
Machine learning can be applied to time series datasets.
A problem when getting started in time series forecasting with machine learning is finding good quality standard datasets on which to practice.
Almost every data scientist will encounter time series in their work and being able to effectively deal with such data is an important skill in the data science toolbox.Almost every data scientist will encounter time series in their work and being able to effectively deal with such data is an important skill in the data science toolbox.
Let’s begin from basics.
--- Original source retains full ownership of the source dataset ---
Data science is the domain of study that deals with vast volumes of data using modern tools and techniques to find unseen patterns, derive meaningful information, and make business decisions. Data science uses complex machine learning algorithms to build predictive models.
The data used for analysis can come from many different sources and be presented in various formats. Data science is an essential part of many industries today, given the massive amounts of data that are produced, and is one of the most debated topics in IT circles.
Kaggle Notebooks LLM Filtered
Model: meta-llama/Meta-Llama-3.1-70B-Instruct Sample: 12,400 Source dataset: data-agents/kaggle-notebooks Prompt:
Below is an extract from a Jupyter notebook. Evaluate whether it has a high analysis value and could help a data scientist.
The notebooks are formatted with the following tokens:
START
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Content This dataset contains 20647 amazon reviews for 836 data-science related books. Every review consists of review text and score (number of stars from 1 to 5).
Acknowledgements Thanks to all the people who write reviews.
CC0
Original Data Source: Amazon Data Science Book Reviews
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘HR data, Predict changing jobs (competition form)’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/kukuroo3/hr-data-predict-change-jobscompetition-form on 28 January 2022.
--- Dataset description provided by original source is as follows ---
Context This dataset was taken from link and separated into competition format. The label for the test data is provided in the form of a function.
--- Original source retains full ownership of the source dataset ---
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Can you tell geographical stories about the world using data science?
World countries with their corresponding continents , official english names, official french names, Dial,ITU,Languages and so on.
This data was gotten from https://old.datahub.io/
Exploration of the world countries: - Can we graphically visualize countries that speak a particular language? - We can also integrate this dataset into others to enhance our exploration. - The dataset has now been updated to include longitude and latitudes of countries in the world.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Data gained from GlassDoor
It contains information regarding the Data Science/ML/DL put by company in GlassDoor
Data gained from Glassdoor
Analyze data
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Heart Attack Analysis & Prediction Dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/rashikrahmanpritom/heart-attack-analysis-prediction-dataset on 13 February 2022.
--- Dataset description provided by original source is as follows ---
Data Science Job Posting on Glassdoor
Groceries dataset for Market Basket Analysis(MBA)
Dataset for Facial recognition using ML approach
Covid_w/wo_Pneumonia Chest Xray
Disney Movies 1937-2016 Gross Income
Bollywood Movie data from 2000 to 2019
17.7K English song data from 2008-2017
Age : Age of the patient
Sex : Sex of the patient
exang: exercise induced angina (1 = yes; 0 = no)
ca: number of major vessels (0-3)
cp : Chest Pain type chest pain type
trtbps : resting blood pressure (in mm Hg)
chol : cholestoral in mg/dl fetched via BMI sensor
fbs : (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
rest_ecg : resting electrocardiographic results
thalach : maximum heart rate achieved
target : 0= less chance of heart attack 1= more chance of heart attack
n
--- Original source retains full ownership of the source dataset ---
A collection of cheat sheets for various data-science related languages and topics.
Taken from https://github.com/abhat222/Data-Science--Cheat-Sheet
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Data Science Job Salaries Dataset contains 11 columns, each are:
work_year: The year the salary was paid. experience_level: The experience level in the job during the year employment_type: The type of employment for the role job_title: The role worked in during the year. salary: The total gross salary amount paid. salary_currency: The currency of the salary paid as an ISO 4217 currency code. salaryinusd: The salary in USD employee_residence: Employee's primary country of residence in during the work year as an ISO 3166 country code. remote_ratio: The overall amount of work done remotely company_location: The country of the employer's main office or contracting branch company_size: The median number of people that worked for the company during the year
If you’re a data scientist looking to get ahead in the ever-changing world of data science, you know that job interviews are a crucial part of your career. But getting a job as a data scientist is not just about being tech-savvy, it’s also about having the right skillset, being able to solve problems, and having good communication skills. With competition heating up, it’s important to stand out and make a good impression on potential employers.
Data Science has become an essential part of the contemporary business environment, enabling decision-making in a variety of industries. Consequently, organizations are increasingly looking for individuals who can utilize the power of data to generate new ideas and expand their operations. However these roles come with a high level of expectation, requiring applicants to possess a comprehensive knowledge of data analytics and machine learning, as well as the capacity to turn their discoveries into practical solutions.
With so many job seekers out there, it’s super important to be prepared and confident for your interview as a data scientist.
Here are 30 tips to help you get the most out of your interview and land the job you want. No matter if you’re just starting out or have been in the field for a while, these tips will help you make the most of your interview and set you up for success.
Technical Preparation
Qualifying for a job as a data scientist needs a comprehensive level of technical preparation. Job seekers are often required to demonstrate their technical skills in order to show their ability to effectively fulfill the duties of the role. Here are a selection of key tips for technical proficiency:
Make sure you have a good understanding of statistics, math, and programming languages such as Python and R.
Gain an in-depth understanding of commonly used machine learning techniques, including linear regression and decision trees, as well as neural networks.
Make sure you're good with data tools like Pandas and Matplotlib, as well as data visualization tools like Seaborn.
Gain proficiency in the use of SQL language to extract and process data from databases.
Understand and know the importance of feature engineering and how to create meaningful features from raw data.
Learn to assess and compare machine learning models using metrics like accuracy, precision, recall, and F1-score.
If the job requires it, become familiar with big data technologies like Hadoop and Spark.
Practice coding challenges related to data manipulation and machine learning on platforms like LeetCode and Kaggle.
Portfolio and Projects
Develop a portfolio of your data science projects that outlines your methodology, the resources you have employed, and the results achieved.
Participate in Kaggle competitions to gain real-world experience and showcase your problem-solving skills.
Contribute to open-source data science projects to demonstrate your collaboration and coding abilities.
Maintain a well-organized GitHub profile with clean code and clear project documentation.
Domain Knowledge
Research the industry you’re applying to and understand its specific data challenges and opportunities.
Study the company you’re interviewing with to tailor your responses and show your genuine interest.
Soft Skills
Practice explaining complex concepts in simple terms. Data Scientists often need to communicate findings to non-technical stakeholders.
Focus on your problem-solving abilities and how you approach complex challenges.
Highlight your ability to adapt to new technologies and techniques as the field of data science evolves.
Interview Etiquette
Dress and present yourself in a professional manner, whether the interview is in person or remote.
Be on time for the interview, whether it’s virtual or in person.
Maintain good posture and eye contact during the interview. Smile and exhibit confidence.
Pay close attention to the interviewer's questions and answer them directly.
Behavioral Questions
Use the STAR (Situation, Task, Action, Result) method to structure your responses to behavioral questions.
Be prepared to discuss how you have handled conflicts or challenging situations in previous roles.
Highlight instances where you’ve worked effectively in cross-functional teams...
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Overview This dataset contains question-answer pairs with context extracted from Kaggle solution write-ups and discussion forums. The dataset was created to facilitate fine-tuning Gemma, an AI model, for data scientist assistant tasks such as question answering and providing data science assistance.
Dataset Details Columns: Question: The question generated based on the context extracted from Kaggle solution write-ups and discussion forums. Answer: The corresponding answer to the generated question. Context: The context extracted from Kaggle solution write-ups and discussion forums, which serves as the basis for generating questions and answers. Subtitle: Subtitle or additional information related to the Kaggle competition or topic. Title: Title of the Kaggle competition or topic. Sources and Inspiration
Sources:
Meta Kaggle: The dataset was sourced from Meta Kaggle, an official Kaggle platform where users discuss competitions, kernels, datasets, and more. Kaggle Solution Write-ups: Solution write-ups submitted by Kaggle users were utilized as a primary source of context for generating questions and answers. Discussion Forums: Discussion threads on Kaggle forums were used to gather additional insights and context for the dataset. Inspiration:
The dataset was inspired by the need for a specialized dataset tailored for fine-tuning Gemma, an AI model designed for data scientist assistant tasks. The goal was to create a dataset that captures the essence of real-world data science problems discussed on Kaggle, enabling Gemma to provide accurate and relevant assistance to data scientists and Kaggle users. Dataset Specifics Total Records: [Specify the total number of question-answer pairs in the dataset] Format: CSV (Comma Separated Values) Size: [Specify the size of the dataset in MB or GB] License: [Specify the license under which the dataset is distributed, e.g., CC BY-SA 4.0] Download Link: [Provide a link to download the dataset] Acknowledgments We acknowledge Kaggle and its community for providing valuable data science resources and discussions that contributed to the creation of this dataset. We appreciate the efforts of Gemma and Langchain in fine-tuning AI models for data scientist assistant tasks, enabling enhanced productivity and efficiency in the field of data science.