100+ datasets found
  1. Gemma-Data Science Agent- Instruct- Dataset

    • kaggle.com
    Updated Apr 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ian cecil akoto (2024). Gemma-Data Science Agent- Instruct- Dataset [Dataset]. https://www.kaggle.com/datasets/ianakoto/gemma-data-science-agent-instruct-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 2, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    ian cecil akoto
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Overview This dataset contains question-answer pairs with context extracted from Kaggle solution write-ups and discussion forums. The dataset was created to facilitate fine-tuning Gemma, an AI model, for data scientist assistant tasks such as question answering and providing data science assistance.

    Dataset Details Columns: Question: The question generated based on the context extracted from Kaggle solution write-ups and discussion forums. Answer: The corresponding answer to the generated question. Context: The context extracted from Kaggle solution write-ups and discussion forums, which serves as the basis for generating questions and answers. Subtitle: Subtitle or additional information related to the Kaggle competition or topic. Title: Title of the Kaggle competition or topic. Sources and Inspiration

    Sources:

    Meta Kaggle: The dataset was sourced from Meta Kaggle, an official Kaggle platform where users discuss competitions, kernels, datasets, and more. Kaggle Solution Write-ups: Solution write-ups submitted by Kaggle users were utilized as a primary source of context for generating questions and answers. Discussion Forums: Discussion threads on Kaggle forums were used to gather additional insights and context for the dataset. Inspiration:

    The dataset was inspired by the need for a specialized dataset tailored for fine-tuning Gemma, an AI model designed for data scientist assistant tasks. The goal was to create a dataset that captures the essence of real-world data science problems discussed on Kaggle, enabling Gemma to provide accurate and relevant assistance to data scientists and Kaggle users. Dataset Specifics Total Records: [Specify the total number of question-answer pairs in the dataset] Format: CSV (Comma Separated Values) Size: [Specify the size of the dataset in MB or GB] License: [Specify the license under which the dataset is distributed, e.g., CC BY-SA 4.0] Download Link: [Provide a link to download the dataset] Acknowledgments We acknowledge Kaggle and its community for providing valuable data science resources and discussions that contributed to the creation of this dataset. We appreciate the efforts of Gemma and Langchain in fine-tuning AI models for data scientist assistant tasks, enabling enhanced productivity and efficiency in the field of data science.

  2. The AI, ML, Data Science Salary (2020- 2025)

    • kaggle.com
    Updated Feb 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Samith Chimminiyan (2025). The AI, ML, Data Science Salary (2020- 2025) [Dataset]. https://www.kaggle.com/datasets/samithsachidanandan/the-global-ai-ml-data-science-salary-for-2025
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 25, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Samith Chimminiyan
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This Dataset containes the details of the AI, ML, Data Science Salary (2020- 2025). Salary data is in USD and recalculated at its average fx rate during the year for salaries entered in other currencies.

    The data is processed and updated on a weekly basis so the rankings may change over time during the year.

    Attribute Information

    • work_year: The year the salary was paid.
    • experience_level: The experience level in the job during the year with the following possible values: EN Entry-level / Junior MI Mid-level / Intermediate SE Senior-level / Expert EX Executive-level / Director
    • employment_type: The type of employement for the role: PT Part-time FT Full-time CT Contract FL Freelance
    • job_title: The role worked in during the year.
    • salary: The total gross salary amount paid.
    • salary_currency: The currency of the salary paid as an ISO 4217 currency code.
    • salary_in_usd: The salary in USD (FX rate divided by avg. USD rate of respective year) via statistical data from the BIS and central banks.
    • employee_residence: Employee's primary country of residence in during the work year as an ISO 3166 country code.
    • remote_ratio : The overall amount of work done remotely, possible values are as follows: 0 No remote work (less than 20%) 50 Partially remote/hybird 100 Fully remote (more than 80%)
    • company_location: The country of the employer's main office or contracting branch as an ISO 3166 country code.
    • company_size: The average number of people that worked for the company during the year: S less than 50 employees (small) M 50 to 250 employees (medium) L more than 250 employees (large)

    Acknowledgements

    https://aijobs.net/

    Photo by Anastassia Anufrieva on Unsplash

  3. P

    DSEval-Kaggle Dataset

    • paperswithcode.com
    Updated Feb 26, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yuge Zhang; Qiyang Jiang; Xingyu Han; Nan Chen; Yuqing Yang; Kan Ren (2024). DSEval-Kaggle Dataset [Dataset]. https://paperswithcode.com/dataset/dseval
    Explore at:
    Dataset updated
    Feb 26, 2024
    Authors
    Yuge Zhang; Qiyang Jiang; Xingyu Han; Nan Chen; Yuqing Yang; Kan Ren
    Description

    In this paper, we introduce a novel benchmarking framework designed specifically for evaluations of data science agents. Our contributions are three-fold. First, we propose DSEval, an evaluation paradigm that enlarges the evaluation scope to the full lifecycle of LLM-based data science agents. We also cover aspects including but not limited to the quality of the derived analytical solutions or machine learning models, as well as potential side effects such as unintentional changes to the original data. Second, we incorporate a novel bootstrapped annotation process letting LLM themselves generate and annotate the benchmarks with ``human in the loop''. A novel language (i.e., DSEAL) has been proposed and the derived four benchmarks have significantly improved the benchmark scalability and coverage, with largely reduced human labor. Third, based on DSEval and the four benchmarks, we conduct a comprehensive evaluation of various data science agents from different aspects. Our findings reveal the common challenges and limitations of the current works, providing useful insights and shedding light on future research on LLM-based data science agents.

    This is one of DSEval benchmarks.

  4. A

    ‘Top 1000 Kaggle Datasets’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Jan 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘Top 1000 Kaggle Datasets’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-top-1000-kaggle-datasets-658b/b992f64b/?iid=004-457&v=presentation
    Explore at:
    Dataset updated
    Jan 28, 2022
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘Top 1000 Kaggle Datasets’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/notkrishna/top-1000-kaggle-datasets on 28 January 2022.

    --- Dataset description provided by original source is as follows ---

    From wiki

    Kaggle, a subsidiary of Google LLC, is an online community of data scientists and machine learning practitioners. Kaggle allows users to find and publish data sets, explore and build models in a web-based data-science environment, work with other data scientists and machine learning engineers, and enter competitions to solve data science challenges.

    Kaggle got its start in 2010 by offering machine learning competitions and now also offers a public data platform, a cloud-based workbench for data science, and Artificial Intelligence education. Its key personnel were Anthony Goldbloom and Jeremy Howard. Nicholas Gruen was founding chair succeeded by Max Levchin. Equity was raised in 2011 valuing the company at $25 million. On 8 March 2017, Google announced that they were acquiring Kaggle.[1][2]

    Source: Kaggle

    --- Original source retains full ownership of the source dataset ---

  5. Meta Kaggle Code

    • kaggle.com
    zip
    Updated Jul 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kaggle (2025). Meta Kaggle Code [Dataset]. https://www.kaggle.com/datasets/kaggle/meta-kaggle-code/code
    Explore at:
    zip(147568851439 bytes)Available download formats
    Dataset updated
    Jul 3, 2025
    Dataset authored and provided by
    Kagglehttp://kaggle.com/
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Explore our public notebook content!

    Meta Kaggle Code is an extension to our popular Meta Kaggle dataset. This extension contains all the raw source code from hundreds of thousands of public, Apache 2.0 licensed Python and R notebooks versions on Kaggle used to analyze Datasets, make submissions to Competitions, and more. This represents nearly a decade of data spanning a period of tremendous evolution in the ways ML work is done.

    Why we’re releasing this dataset

    By collecting all of this code created by Kaggle’s community in one dataset, we hope to make it easier for the world to research and share insights about trends in our industry. With the growing significance of AI-assisted development, we expect this data can also be used to fine-tune models for ML-specific code generation tasks.

    Meta Kaggle for Code is also a continuation of our commitment to open data and research. This new dataset is a companion to Meta Kaggle which we originally released in 2016. On top of Meta Kaggle, our community has shared nearly 1,000 public code examples. Research papers written using Meta Kaggle have examined how data scientists collaboratively solve problems, analyzed overfitting in machine learning competitions, compared discussions between Kaggle and Stack Overflow communities, and more.

    The best part is Meta Kaggle enriches Meta Kaggle for Code. By joining the datasets together, you can easily understand which competitions code was run against, the progression tier of the code’s author, how many votes a notebook had, what kinds of comments it received, and much, much more. We hope the new potential for uncovering deep insights into how ML code is written feels just as limitless to you as it does to us!

    Sensitive data

    While we have made an attempt to filter out notebooks containing potentially sensitive information published by Kaggle users, the dataset may still contain such information. Research, publications, applications, etc. relying on this data should only use or report on publicly available, non-sensitive information.

    Joining with Meta Kaggle

    The files contained here are a subset of the KernelVersions in Meta Kaggle. The file names match the ids in the KernelVersions csv file. Whereas Meta Kaggle contains data for all interactive and commit sessions, Meta Kaggle Code contains only data for commit sessions.

    File organization

    The files are organized into a two-level directory structure. Each top level folder contains up to 1 million files, e.g. - folder 123 contains all versions from 123,000,000 to 123,999,999. Each sub folder contains up to 1 thousand files, e.g. - 123/456 contains all versions from 123,456,000 to 123,456,999. In practice, each folder will have many fewer than 1 thousand files due to private and interactive sessions.

    The ipynb files in this dataset hosted on Kaggle do not contain the output cells. If the outputs are required, the full set of ipynbs with the outputs embedded can be obtained from this public GCS bucket: kaggle-meta-kaggle-code-downloads. Note that this is a "requester pays" bucket. This means you will need a GCP account with billing enabled to download. Learn more here: https://cloud.google.com/storage/docs/requester-pays

    Questions / Comments

    We love feedback! Let us know in the Discussion tab.

    Happy Kaggling!

  6. Analyzing Data Science Salaries

    • kaggle.com
    Updated Aug 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Raghav Khandelwal (2024). Analyzing Data Science Salaries [Dataset]. https://www.kaggle.com/datasets/raghavkhandelwal65/analyzing-data-science-salaries
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 2, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Raghav Khandelwal
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset

    This dataset was created by Raghav Khandelwal

    Released under Apache 2.0

    Contents

  7. A

    ‘HR Analytics: Job Change of Data Scientists’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Jan 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘HR Analytics: Job Change of Data Scientists’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-hr-analytics-job-change-of-data-scientists-db67/latest
    Explore at:
    Dataset updated
    Jan 28, 2022
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘HR Analytics: Job Change of Data Scientists’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/arashnic/hr-analytics-job-change-of-data-scientists on 28 January 2022.

    --- Dataset description provided by original source is as follows ---

    Context and Content

    A company which is active in Big Data and Data Science wants to hire data scientists among people who successfully pass some courses which conduct by the company. Many people signup for their training. Company wants to know which of these candidates are really wants to work for the company after training or looking for a new employment because it helps to reduce the cost and time as well as the quality of training or planning the courses and categorization of candidates. Information related to demographics, education, experience are in hands from candidates signup and enrollment.

    This dataset designed to understand the factors that lead a person to leave current job for HR researches too. By model(s) that uses the current credentials,demographics,experience data you will predict the probability of a candidate to look for a new job or will work for the company, as well as interpreting affected factors on employee decision.

    The whole data divided to train and test . Target isn't included in test but the test target values data file is in hands for related tasks. A sample submission correspond to enrollee_id of test set provided too with columns : enrollee _id , target

    Note: - The dataset is imbalanced. - Most features are categorical (Nominal, Ordinal, Binary), some with high cardinality. - Missing imputation can be a part of your pipeline as well.

    # Features #
    - enrollee_id : Unique ID for candidate

    • city: City code

    • city_ development _index : Developement index of the city (scaled)

    • gender: Gender of candidate

    • relevent_experience: Relevant experience of candidate

    • enrolled_university: Type of University course enrolled if any

    • education_level: Education level of candidate

    • major_discipline :Education major discipline of candidate

    • experience: Candidate total experience in years

    • company_size: No of employees in current employer's company

    • company_type : Type of current employer

    • last_new_job: Difference in years between previous job and current job

    • training_hours: training hours completed

    • target: 0 – Not looking for job change, 1 – Looking for a job change

    Inspiration

    --- Original source retains full ownership of the source dataset ---

  8. h

    Kaggle-LLM-Science-Exam

    • huggingface.co
    Updated Aug 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sangeetha Venkatesan (2023). Kaggle-LLM-Science-Exam [Dataset]. https://huggingface.co/datasets/Sangeetha/Kaggle-LLM-Science-Exam
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 8, 2023
    Authors
    Sangeetha Venkatesan
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Card for [LLM Science Exam Kaggle Competition]

      Dataset Summary
    

    https://www.kaggle.com/competitions/kaggle-llm-science-exam/data

      Languages
    

    [en, de, tl, it, es, fr, pt, id, pl, ro, so, ca, da, sw, hu, no, nl, et, af, hr, lv, sl]

      Dataset Structure
    

    Columns prompt - the text of the question being asked A - option A; if this option is correct, then answer will be A B - option B; if this option is correct, then answer will be B C - option C; if this… See the full description on the dataset page: https://huggingface.co/datasets/Sangeetha/Kaggle-LLM-Science-Exam.

  9. Data Science Jobs Salaries Dataset

    • kaggle.com
    Updated Apr 20, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wahaj Raza (2023). Data Science Jobs Salaries Dataset [Dataset]. https://www.kaggle.com/datasets/swahajraza/data-science-jobs-salaries-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 20, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Wahaj Raza
    Description

    This dataset contains information on salaries for data science jobs in Karachi, Pakistan. This dataset can be used to gain insights into the salaries offered for data science jobs in Karachi and can be helpful for professionals who are looking to explore career opportunities in this field.

  10. A

    ‘Time Series starter dataset’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Jan 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘Time Series starter dataset’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-time-series-starter-dataset-19e9/latest
    Explore at:
    Dataset updated
    Jan 28, 2022
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘Time Series starter dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/podsyp/time-series-starter-dataset on 28 January 2022.

    --- Dataset description provided by original source is as follows ---

    Context

    Machine learning can be applied to time series datasets.

    Content

    A problem when getting started in time series forecasting with machine learning is finding good quality standard datasets on which to practice.

    Acknowledgements

    Almost every data scientist will encounter time series in their work and being able to effectively deal with such data is an important skill in the data science toolbox.Almost every data scientist will encounter time series in their work and being able to effectively deal with such data is an important skill in the data science toolbox.

    Inspiration

    Let’s begin from basics.

    --- Original source retains full ownership of the source dataset ---

  11. Data Science Jobs Analysis

    • kaggle.com
    Updated Feb 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Niyal Thakkar (2023). Data Science Jobs Analysis [Dataset]. https://www.kaggle.com/datasets/niyalthakkar/data-science-jobs-analysis
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 8, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Niyal Thakkar
    Description

    Data science is the domain of study that deals with vast volumes of data using modern tools and techniques to find unseen patterns, derive meaningful information, and make business decisions. Data science uses complex machine learning algorithms to build predictive models.

    The data used for analysis can come from many different sources and be presented in various formats. Data science is an essential part of many industries today, given the massive amounts of data that are produced, and is one of the most debated topics in IT circles.

  12. h

    kaggle-notebooks-edu-v0

    • huggingface.co
    Updated May 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Agents (2025). kaggle-notebooks-edu-v0 [Dataset]. https://huggingface.co/datasets/data-agents/kaggle-notebooks-edu-v0
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 31, 2025
    Dataset authored and provided by
    Data Agents
    Description

    Kaggle Notebooks LLM Filtered

    Model: meta-llama/Meta-Llama-3.1-70B-Instruct Sample: 12,400 Source dataset: data-agents/kaggle-notebooks Prompt:

    Below is an extract from a Jupyter notebook. Evaluate whether it has a high analysis value and could help a data scientist.

    The notebooks are formatted with the following tokens:

    START

    Here comes markdown content

    Here comes python code

    Here comes code output

    More… See the full description on the dataset page: https://huggingface.co/datasets/data-agents/kaggle-notebooks-edu-v0.

  13. o

    Amazon Data Science Book Reviews

    • opendatabay.com
    • kaggle.com
    .undefined
    Updated Jun 16, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). Amazon Data Science Book Reviews [Dataset]. https://www.opendatabay.com/data/ai-ml/fa468f38-c13a-4388-9e15-6e7acdc99d98
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jun 16, 2025
    Dataset authored and provided by
    Datasimple
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Reviews & Ratings
    Description

    Content This dataset contains 20647 amazon reviews for 836 data-science related books. Every review consists of review text and score (number of stars from 1 to 5).

    Acknowledgements Thanks to all the people who write reviews.

    License

    CC0

    Original Data Source: Amazon Data Science Book Reviews

  14. A

    ‘HR data, Predict changing jobs (competition form)’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Jan 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘HR data, Predict changing jobs (competition form)’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-hr-data-predict-changing-jobs-competition-form-1d9b/a230c863/?iid=013-955&v=presentation
    Explore at:
    Dataset updated
    Jan 28, 2022
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘HR data, Predict changing jobs (competition form)’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/kukuroo3/hr-data-predict-change-jobscompetition-form on 28 January 2022.

    --- Dataset description provided by original source is as follows ---

    Context This dataset was taken from link and separated into competition format. The label for the test data is provided in the form of a function.

    --- Original source retains full ownership of the source dataset ---

  15. World Countries and Continents Details

    • kaggle.com
    zip
    Updated Oct 5, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    folaraz (2017). World Countries and Continents Details [Dataset]. https://www.kaggle.com/folaraz/world-countries-and-continents-details
    Explore at:
    zip(24400 bytes)Available download formats
    Dataset updated
    Oct 5, 2017
    Authors
    folaraz
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Area covered
    World
    Description

    Context

    Can you tell geographical stories about the world using data science?

    Content

    World countries with their corresponding continents , official english names, official french names, Dial,ITU,Languages and so on.

    Acknowledgements

    This data was gotten from https://old.datahub.io/

    Inspiration

    Exploration of the world countries: - Can we graphically visualize countries that speak a particular language? - We can also integrate this dataset into others to enhance our exploration. - The dataset has now been updated to include longitude and latitudes of countries in the world.

  16. GlassDoor(Data Scientist)

    • kaggle.com
    zip
    Updated Aug 1, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    milan (2020). GlassDoor(Data Scientist) [Dataset]. https://www.kaggle.com/milan400/glassdoordata-scientist
    Explore at:
    zip(1040514 bytes)Available download formats
    Dataset updated
    Aug 1, 2020
    Authors
    milan
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    Data gained from GlassDoor

    Content

    It contains information regarding the Data Science/ML/DL put by company in GlassDoor

    Acknowledgements

    Data gained from Glassdoor

    Inspiration

    Analyze data

  17. A

    ‘Heart Attack Analysis & Prediction Dataset’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Feb 13, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘Heart Attack Analysis & Prediction Dataset’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-heart-attack-analysis-prediction-dataset-51b9/de5fe27e/?iid=015-932&v=presentation
    Explore at:
    Dataset updated
    Feb 13, 2022
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘Heart Attack Analysis & Prediction Dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/rashikrahmanpritom/heart-attack-analysis-prediction-dataset on 13 February 2022.

    --- Dataset description provided by original source is as follows ---

    Hone your analytical and ML skills by participating in tasks of my other dataset's. Given below.

    Data Science Job Posting on Glassdoor

    Groceries dataset for Market Basket Analysis(MBA)

    Dataset for Facial recognition using ML approach

    Covid_w/wo_Pneumonia Chest Xray

    Disney Movies 1937-2016 Gross Income

    Bollywood Movie data from 2000 to 2019

    17.7K English song data from 2008-2017

    About this dataset

    • Age : Age of the patient

    • Sex : Sex of the patient

    • exang: exercise induced angina (1 = yes; 0 = no)

    • ca: number of major vessels (0-3)

    • cp : Chest Pain type chest pain type

      • Value 1: typical angina
      • Value 2: atypical angina
      • Value 3: non-anginal pain
      • Value 4: asymptomatic
    • trtbps : resting blood pressure (in mm Hg)

    • chol : cholestoral in mg/dl fetched via BMI sensor

    • fbs : (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)

    • rest_ecg : resting electrocardiographic results

      • Value 0: normal
      • Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
      • Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
    • thalach : maximum heart rate achieved

    • target : 0= less chance of heart attack 1= more chance of heart attack

    n

    --- Original source retains full ownership of the source dataset ---

  18. Data Science Cheat Sheets

    • kaggle.com
    zip
    Updated Feb 4, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Timo Bozsolik (2020). Data Science Cheat Sheets [Dataset]. https://www.kaggle.com/timoboz/data-science-cheat-sheets
    Explore at:
    zip(625256639 bytes)Available download formats
    Dataset updated
    Feb 4, 2020
    Authors
    Timo Bozsolik
    Description

    A collection of cheat sheets for various data-science related languages and topics.

    Taken from https://github.com/abhat222/Data-Science--Cheat-Sheet

  19. Data Science Salaries 2025 💸

    • kaggle.com
    Updated Apr 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    randomarnab (2025). Data Science Salaries 2025 💸 [Dataset]. https://www.kaggle.com/datasets/arnabchaki/data-science-salaries-2025
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 11, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    randomarnab
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    Data Science Job Salaries Dataset contains 11 columns, each are:

    work_year: The year the salary was paid. experience_level: The experience level in the job during the year employment_type: The type of employment for the role job_title: The role worked in during the year. salary: The total gross salary amount paid. salary_currency: The currency of the salary paid as an ISO 4217 currency code. salaryinusd: The salary in USD employee_residence: Employee's primary country of residence in during the work year as an ISO 3166 country code. remote_ratio: The overall amount of work done remotely company_location: The country of the employer's main office or contracting branch company_size: The median number of people that worked for the company during the year

  20. 30 Short Tips for Your Data Scientist Interview

    • kaggle.com
    Updated Oct 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Skillslash17 (2023). 30 Short Tips for Your Data Scientist Interview [Dataset]. https://www.kaggle.com/datasets/skillslash17/30-short-tips-for-your-data-scientist-interview
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 12, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Skillslash17
    Description

    If you’re a data scientist looking to get ahead in the ever-changing world of data science, you know that job interviews are a crucial part of your career. But getting a job as a data scientist is not just about being tech-savvy, it’s also about having the right skillset, being able to solve problems, and having good communication skills. With competition heating up, it’s important to stand out and make a good impression on potential employers.

    Data Science has become an essential part of the contemporary business environment, enabling decision-making in a variety of industries. Consequently, organizations are increasingly looking for individuals who can utilize the power of data to generate new ideas and expand their operations. However these roles come with a high level of expectation, requiring applicants to possess a comprehensive knowledge of data analytics and machine learning, as well as the capacity to turn their discoveries into practical solutions.

    With so many job seekers out there, it’s super important to be prepared and confident for your interview as a data scientist.

    Here are 30 tips to help you get the most out of your interview and land the job you want. No matter if you’re just starting out or have been in the field for a while, these tips will help you make the most of your interview and set you up for success.

    Technical Preparation

    Qualifying for a job as a data scientist needs a comprehensive level of technical preparation. Job seekers are often required to demonstrate their technical skills in order to show their ability to effectively fulfill the duties of the role. Here are a selection of key tips for technical proficiency:

    1 Master the Basics

    Make sure you have a good understanding of statistics, math, and programming languages such as Python and R.

    2 Understand Machine Learning

    Gain an in-depth understanding of commonly used machine learning techniques, including linear regression and decision trees, as well as neural networks.

    3 Data Manipulation

    Make sure you're good with data tools like Pandas and Matplotlib, as well as data visualization tools like Seaborn.

    4 SQL Skills

    Gain proficiency in the use of SQL language to extract and process data from databases.

    5 Feature Engineering

    Understand and know the importance of feature engineering and how to create meaningful features from raw data.

    6 Model Evaluation

    Learn to assess and compare machine learning models using metrics like accuracy, precision, recall, and F1-score.

    7 Big Data Technologies

    If the job requires it, become familiar with big data technologies like Hadoop and Spark.

    8 Coding Challenges

    Practice coding challenges related to data manipulation and machine learning on platforms like LeetCode and Kaggle.

    Portfolio and Projects

    9 Build a Portfolio

    Develop a portfolio of your data science projects that outlines your methodology, the resources you have employed, and the results achieved.

    10 Kaggle Competitions

    Participate in Kaggle competitions to gain real-world experience and showcase your problem-solving skills.

    11 Open Source Contributions

    Contribute to open-source data science projects to demonstrate your collaboration and coding abilities.

    12 GitHub Profile

    Maintain a well-organized GitHub profile with clean code and clear project documentation.

    Domain Knowledge

    13 Understand the Industry

    Research the industry you’re applying to and understand its specific data challenges and opportunities.

    14 Company Research

    Study the company you’re interviewing with to tailor your responses and show your genuine interest.

    Soft Skills

    15 Communication

    Practice explaining complex concepts in simple terms. Data Scientists often need to communicate findings to non-technical stakeholders.

    16 Problem-Solving

    Focus on your problem-solving abilities and how you approach complex challenges.

    17 Adaptability

    Highlight your ability to adapt to new technologies and techniques as the field of data science evolves.

    Interview Etiquette

    18 Professional Appearance

    Dress and present yourself in a professional manner, whether the interview is in person or remote.

    19 Punctuality

    Be on time for the interview, whether it’s virtual or in person.

    20 Body Language

    Maintain good posture and eye contact during the interview. Smile and exhibit confidence.

    21 Active Listening

    Pay close attention to the interviewer's questions and answer them directly.

    Behavioral Questions

    22 STAR Method

    Use the STAR (Situation, Task, Action, Result) method to structure your responses to behavioral questions.

    23 Conflict Resolution

    Be prepared to discuss how you have handled conflicts or challenging situations in previous roles.

    24 Teamwork

    Highlight instances where you’ve worked effectively in cross-functional teams...

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
ian cecil akoto (2024). Gemma-Data Science Agent- Instruct- Dataset [Dataset]. https://www.kaggle.com/datasets/ianakoto/gemma-data-science-agent-instruct-dataset
Organization logo

Gemma-Data Science Agent- Instruct- Dataset

Data Science Assistance with Gemma Fine-tuned on Kaggle Solutions Writeup

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 2, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
ian cecil akoto
License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Overview This dataset contains question-answer pairs with context extracted from Kaggle solution write-ups and discussion forums. The dataset was created to facilitate fine-tuning Gemma, an AI model, for data scientist assistant tasks such as question answering and providing data science assistance.

Dataset Details Columns: Question: The question generated based on the context extracted from Kaggle solution write-ups and discussion forums. Answer: The corresponding answer to the generated question. Context: The context extracted from Kaggle solution write-ups and discussion forums, which serves as the basis for generating questions and answers. Subtitle: Subtitle or additional information related to the Kaggle competition or topic. Title: Title of the Kaggle competition or topic. Sources and Inspiration

Sources:

Meta Kaggle: The dataset was sourced from Meta Kaggle, an official Kaggle platform where users discuss competitions, kernels, datasets, and more. Kaggle Solution Write-ups: Solution write-ups submitted by Kaggle users were utilized as a primary source of context for generating questions and answers. Discussion Forums: Discussion threads on Kaggle forums were used to gather additional insights and context for the dataset. Inspiration:

The dataset was inspired by the need for a specialized dataset tailored for fine-tuning Gemma, an AI model designed for data scientist assistant tasks. The goal was to create a dataset that captures the essence of real-world data science problems discussed on Kaggle, enabling Gemma to provide accurate and relevant assistance to data scientists and Kaggle users. Dataset Specifics Total Records: [Specify the total number of question-answer pairs in the dataset] Format: CSV (Comma Separated Values) Size: [Specify the size of the dataset in MB or GB] License: [Specify the license under which the dataset is distributed, e.g., CC BY-SA 4.0] Download Link: [Provide a link to download the dataset] Acknowledgments We acknowledge Kaggle and its community for providing valuable data science resources and discussions that contributed to the creation of this dataset. We appreciate the efforts of Gemma and Langchain in fine-tuning AI models for data scientist assistant tasks, enabling enhanced productivity and efficiency in the field of data science.

Search
Clear search
Close search
Google apps
Main menu