41 datasets found
  1. Gemma-Data Science Agent- Instruct- Dataset

    • kaggle.com
    Updated Apr 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ian cecil akoto (2024). Gemma-Data Science Agent- Instruct- Dataset [Dataset]. https://www.kaggle.com/datasets/ianakoto/gemma-data-science-agent-instruct-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 2, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    ian cecil akoto
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Overview This dataset contains question-answer pairs with context extracted from Kaggle solution write-ups and discussion forums. The dataset was created to facilitate fine-tuning Gemma, an AI model, for data scientist assistant tasks such as question answering and providing data science assistance.

    Dataset Details Columns: Question: The question generated based on the context extracted from Kaggle solution write-ups and discussion forums. Answer: The corresponding answer to the generated question. Context: The context extracted from Kaggle solution write-ups and discussion forums, which serves as the basis for generating questions and answers. Subtitle: Subtitle or additional information related to the Kaggle competition or topic. Title: Title of the Kaggle competition or topic. Sources and Inspiration

    Sources:

    Meta Kaggle: The dataset was sourced from Meta Kaggle, an official Kaggle platform where users discuss competitions, kernels, datasets, and more. Kaggle Solution Write-ups: Solution write-ups submitted by Kaggle users were utilized as a primary source of context for generating questions and answers. Discussion Forums: Discussion threads on Kaggle forums were used to gather additional insights and context for the dataset. Inspiration:

    The dataset was inspired by the need for a specialized dataset tailored for fine-tuning Gemma, an AI model designed for data scientist assistant tasks. The goal was to create a dataset that captures the essence of real-world data science problems discussed on Kaggle, enabling Gemma to provide accurate and relevant assistance to data scientists and Kaggle users. Dataset Specifics Total Records: [Specify the total number of question-answer pairs in the dataset] Format: CSV (Comma Separated Values) Size: [Specify the size of the dataset in MB or GB] License: [Specify the license under which the dataset is distributed, e.g., CC BY-SA 4.0] Download Link: [Provide a link to download the dataset] Acknowledgments We acknowledge Kaggle and its community for providing valuable data science resources and discussions that contributed to the creation of this dataset. We appreciate the efforts of Gemma and Langchain in fine-tuning AI models for data scientist assistant tasks, enabling enhanced productivity and efficiency in the field of data science.

  2. Meta Kaggle Code

    • kaggle.com
    zip
    Updated Jun 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kaggle (2025). Meta Kaggle Code [Dataset]. https://www.kaggle.com/datasets/kaggle/meta-kaggle-code/code
    Explore at:
    zip(145784025210 bytes)Available download formats
    Dataset updated
    Jun 19, 2025
    Dataset authored and provided by
    Kagglehttp://kaggle.com/
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Explore our public notebook content!

    Meta Kaggle Code is an extension to our popular Meta Kaggle dataset. This extension contains all the raw source code from hundreds of thousands of public, Apache 2.0 licensed Python and R notebooks versions on Kaggle used to analyze Datasets, make submissions to Competitions, and more. This represents nearly a decade of data spanning a period of tremendous evolution in the ways ML work is done.

    Why we’re releasing this dataset

    By collecting all of this code created by Kaggle’s community in one dataset, we hope to make it easier for the world to research and share insights about trends in our industry. With the growing significance of AI-assisted development, we expect this data can also be used to fine-tune models for ML-specific code generation tasks.

    Meta Kaggle for Code is also a continuation of our commitment to open data and research. This new dataset is a companion to Meta Kaggle which we originally released in 2016. On top of Meta Kaggle, our community has shared nearly 1,000 public code examples. Research papers written using Meta Kaggle have examined how data scientists collaboratively solve problems, analyzed overfitting in machine learning competitions, compared discussions between Kaggle and Stack Overflow communities, and more.

    The best part is Meta Kaggle enriches Meta Kaggle for Code. By joining the datasets together, you can easily understand which competitions code was run against, the progression tier of the code’s author, how many votes a notebook had, what kinds of comments it received, and much, much more. We hope the new potential for uncovering deep insights into how ML code is written feels just as limitless to you as it does to us!

    Sensitive data

    While we have made an attempt to filter out notebooks containing potentially sensitive information published by Kaggle users, the dataset may still contain such information. Research, publications, applications, etc. relying on this data should only use or report on publicly available, non-sensitive information.

    Joining with Meta Kaggle

    The files contained here are a subset of the KernelVersions in Meta Kaggle. The file names match the ids in the KernelVersions csv file. Whereas Meta Kaggle contains data for all interactive and commit sessions, Meta Kaggle Code contains only data for commit sessions.

    File organization

    The files are organized into a two-level directory structure. Each top level folder contains up to 1 million files, e.g. - folder 123 contains all versions from 123,000,000 to 123,999,999. Each sub folder contains up to 1 thousand files, e.g. - 123/456 contains all versions from 123,456,000 to 123,456,999. In practice, each folder will have many fewer than 1 thousand files due to private and interactive sessions.

    The ipynb files in this dataset hosted on Kaggle do not contain the output cells. If the outputs are required, the full set of ipynbs with the outputs embedded can be obtained from this public GCS bucket: kaggle-meta-kaggle-code-downloads. Note that this is a "requester pays" bucket. This means you will need a GCP account with billing enabled to download. Learn more here: https://cloud.google.com/storage/docs/requester-pays

    Questions / Comments

    We love feedback! Let us know in the Discussion tab.

    Happy Kaggling!

  3. h

    Kaggle-LLM-Science-Exam

    • huggingface.co
    Updated Aug 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sangeetha Venkatesan (2023). Kaggle-LLM-Science-Exam [Dataset]. https://huggingface.co/datasets/Sangeetha/Kaggle-LLM-Science-Exam
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 8, 2023
    Authors
    Sangeetha Venkatesan
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Card for [LLM Science Exam Kaggle Competition]

      Dataset Summary
    

    https://www.kaggle.com/competitions/kaggle-llm-science-exam/data

      Languages
    

    [en, de, tl, it, es, fr, pt, id, pl, ro, so, ca, da, sw, hu, no, nl, et, af, hr, lv, sl]

      Dataset Structure
    

    Columns prompt - the text of the question being asked A - option A; if this option is correct, then answer will be A B - option B; if this option is correct, then answer will be B C - option C; if this… See the full description on the dataset page: https://huggingface.co/datasets/Sangeetha/Kaggle-LLM-Science-Exam.

  4. A

    ‘Top 1000 Kaggle Datasets’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Jan 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘Top 1000 Kaggle Datasets’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-top-1000-kaggle-datasets-658b/b992f64b/?iid=004-457&v=presentation
    Explore at:
    Dataset updated
    Jan 28, 2022
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘Top 1000 Kaggle Datasets’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/notkrishna/top-1000-kaggle-datasets on 28 January 2022.

    --- Dataset description provided by original source is as follows ---

    From wiki

    Kaggle, a subsidiary of Google LLC, is an online community of data scientists and machine learning practitioners. Kaggle allows users to find and publish data sets, explore and build models in a web-based data-science environment, work with other data scientists and machine learning engineers, and enter competitions to solve data science challenges.

    Kaggle got its start in 2010 by offering machine learning competitions and now also offers a public data platform, a cloud-based workbench for data science, and Artificial Intelligence education. Its key personnel were Anthony Goldbloom and Jeremy Howard. Nicholas Gruen was founding chair succeeded by Max Levchin. Equity was raised in 2011 valuing the company at $25 million. On 8 March 2017, Google announced that they were acquiring Kaggle.[1][2]

    Source: Kaggle

    --- Original source retains full ownership of the source dataset ---

  5. How to Win Data Science Competition

    • kaggle.com
    zip
    Updated Jan 30, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Budi Ryan (2018). How to Win Data Science Competition [Dataset]. https://www.kaggle.com/budiryan/how-to-win-data-science-competition
    Explore at:
    zip(15845091 bytes)Available download formats
    Dataset updated
    Jan 30, 2018
    Authors
    Budi Ryan
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset

    This dataset was created by Budi Ryan

    Released under CC0: Public Domain

    Contents

  6. i

    Kaggle

    • registry.identifiers.org
    • bioregistry.io
    Updated Aug 23, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2019). Kaggle [Dataset]. https://registry.identifiers.org/registry/kaggle
    Explore at:
    Dataset updated
    Aug 23, 2019
    Description

    Kaggle is a platform for sharing data, performing reproducible analyses, interactive data analysis tutorials, and machine learning competitions.

  7. Titanic Dataset - cleaned

    • kaggle.com
    Updated Aug 9, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    WinstonSDodson (2019). Titanic Dataset - cleaned [Dataset]. https://www.kaggle.com/winstonsdodson/titanic-dataset-cleaned/kernels
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 9, 2019
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    WinstonSDodson
    Description

    This is the classic Titanic Dataset provided in the Kaggle Competition K Kernel and then cleaned in one of the most popular Kernels there. Please see the Kernel titled, "A Data Science Framework: To Achieve 99% Accuracy" for a great lesson in data science. This Kernel gives a great explanaton of the thinking behind the of this data cleaning as well as a very professional demonstration of the technologies and skills to do so. It then continues to provide an overview of many ML techniques and it is copiously and meticulously documented with many useful citations.

    Of course, data cleaning is an essential skill in data science but I wanted to use this data for a study of other machine learning techniques. So, I found and used this set of data that is well known and cleaned to a benchmark accepted by many.

  8. d

    Data from: PlanktonSet 1.0: Plankton imagery data collected from F.G. Walton...

    • datadiscoverystudio.org
    • s.cnmilf.com
    • +3more
    html
    Updated Feb 8, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2018). PlanktonSet 1.0: Plankton imagery data collected from F.G. Walton Smith in Straits of Florida from 2014-06-03 to 2014-06-06 and used in the 2015 National Data Science Bowl (NCEI Accession 0127422). [Dataset]. http://datadiscoverystudio.org/geoportal/rest/metadata/item/f5a2c6072c47451192a114d51f902e14/html
    Explore at:
    htmlAvailable download formats
    Dataset updated
    Feb 8, 2018
    Description

    description: Data presented here are subset of a larger plankton imagery data set collected in the subtropical Straits of Florida from 2014-05-28 to 2014-06-14. Imagery data were collected using the In Situ Ichthyoplankton Imaging System (ISIIS-2) as part of a NSF-funded project to assess the biophysical drivers affecting fine-scale interactions between larval fish, their prey, and predators. This subset of images was used in the inaugural National Data Science Bowl (www.datasciencebowl.com) hosted by Kaggle and sponsored by Booz Allen Hamilton. Data were originally collected to examine the biophysical drivers affecting fine scale (spatial) interactions between larval fish, their prey, and predators in a subtropical pelagic marine ecosystem. Image segments extracted from the raw data were sorted into 121 plankton classes, split 50:50 into train and test data sets, and provided for a machine learning competition (the National Data Science Bowl). There was no hierarchical relationships explicit in the 121 plankton classes, though the class naming convention and a tree-like diagram (see file "Plankton Relationships.pdf") indicated relationships between classes, whether it was taxonomic or structural (size and shape). We intend for this dataset to be available to the machine learning and computer vision community as a standard machine learning benchmark. This €œPlankton 1.0€ dataset is a medium-size dataset with a fair amount of complexity where image classification improvements can still be made.; abstract: Data presented here are subset of a larger plankton imagery data set collected in the subtropical Straits of Florida from 2014-05-28 to 2014-06-14. Imagery data were collected using the In Situ Ichthyoplankton Imaging System (ISIIS-2) as part of a NSF-funded project to assess the biophysical drivers affecting fine-scale interactions between larval fish, their prey, and predators. This subset of images was used in the inaugural National Data Science Bowl (www.datasciencebowl.com) hosted by Kaggle and sponsored by Booz Allen Hamilton. Data were originally collected to examine the biophysical drivers affecting fine scale (spatial) interactions between larval fish, their prey, and predators in a subtropical pelagic marine ecosystem. Image segments extracted from the raw data were sorted into 121 plankton classes, split 50:50 into train and test data sets, and provided for a machine learning competition (the National Data Science Bowl). There was no hierarchical relationships explicit in the 121 plankton classes, though the class naming convention and a tree-like diagram (see file "Plankton Relationships.pdf") indicated relationships between classes, whether it was taxonomic or structural (size and shape). We intend for this dataset to be available to the machine learning and computer vision community as a standard machine learning benchmark. This €œPlankton 1.0€ dataset is a medium-size dataset with a fair amount of complexity where image classification improvements can still be made.

  9. Competition Dataset: Center of Policing Equity

    • kaggle.com
    zip
    Updated Nov 26, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shivam Bansal (2018). Competition Dataset: Center of Policing Equity [Dataset]. https://www.kaggle.com/shivamb/external-datasets-cpe
    Explore at:
    zip(198124223 bytes)Available download formats
    Dataset updated
    Nov 26, 2018
    Authors
    Shivam Bansal
    Description

    Dataset

    This dataset was created by Shivam Bansal

    Contents

    It contains the following files:

  10. LLM Science Dataset

    • kaggle.com
    Updated Aug 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zhecheng Li (2023). LLM Science Dataset [Dataset]. https://www.kaggle.com/datasets/lizhecheng/llm-science-dataset/suggestions
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 7, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Zhecheng Li
    License

    https://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/

    Description

    The version 3 contains 6 datasets.

    1 - The original training dataset in LLM Science Exam

    2 - 6.0k train examples for LLM Science Exam from RADEK OSMULSKI, the dataset link is here

    3 - 500 train examples for LLM Science Exam from RADEK OSMULSKI, the dataset link is here

    4 - 600 train examples collected by Zhecheng LI using Chatgpt3.5 here

    5 - wikipedia-stem-1k dataset collected by LEONID KULYK, the dataset link is here

    6 - MMLU Dataset, I choose about 3600+ examples that are suitable for finetuning this competition, the original dataset I have published here

    Thanks for their contribution to this competition and many NLP project

  11. The Quest Dataset

    • kaggle.com
    Updated Nov 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jules King (2024). The Quest Dataset [Dataset]. https://www.kaggle.com/datasets/julesking/the-quest-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 26, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Jules King
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Learning Agency Lab’s data science competition, “The Quest for Quality Questions: Improving Reading Comprehension through Automated Question Generation,” was designed to build AI algorithms that can automatically generate questions that test young learners’ reading comprehension.

    As many educators and researchers know, questions are key in teaching and evaluating narrative comprehension skills in young learners. However, generating high-quality reading comprehension queries is time consuming, which limits the number of texts that young readers can engage with in this way. Datasets can help by informing quality question automation.

    The Quest challenge dataset can be accessed on this page and was aided by foundational data from the Lab’s FairytaleQA dataset of 10,580 questions. Those queries were created to address gaps in similar datasets, which often overlooked fine reading skills that showcased an understanding of varying narrative elements.

    The Quest was made possible by The Learning Agency Lab, Mark Warschauer at UC Irvine, and Ying Xu at The University of Michigan School of Education. More can be found about the creators here.

    Quest dataset © 2024 by The Learning Agency Lab is licensed under CC BY 4.0. To view a copy of this license, visit https://creativecommons.org/licenses/by/4.0/

    Competition - https://www.thequestchallenge.org/

    Publications - Xu, Y., Wang, D., Yu, M., Ritchie, D., Yao, B., Wu, T., ... & Warschauer, M. (2022). Fantastic Questions and Where to Find Them: FairytaleQA--An Authentic Dataset for Narrative Comprehension. arXiv preprint arXiv:2203.13947.

  12. Data Science Bowl 2018 Competition - Merged Mask

    • kaggle.com
    Updated Jan 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenitsu157 (2024). Data Science Bowl 2018 Competition - Merged Mask [Dataset]. https://www.kaggle.com/datasets/mahmudulhasantasin/data-science-bowl-2018-competition-merged-mask/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 5, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Zenitsu157
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset contains a large number of segmented nuclei images. The images were acquired under a variety of conditions and vary in the cell type, magnification, and imaging modality (brightfield vs. fluorescence). The dataset is designed to challenge an algorithm's ability to generalize across these variations.

    Each image is represented by an associated ImageId. Files belonging to an image are contained in a folder with this ImageId. Within this folder are two subfolders:

    images contains the image file. masks contains the segmented masks of each nucleus. This folder is only included in the training set. Each mask contains one nucleus. Masks are not allowed to overlap (no pixel belongs to two masks). The second stage dataset will contain images from unseen experimental conditions. To deter hand labeling, it will also contain images that are ignored in scoring. The metric used to score this competition requires that your submissions are in run-length encoded format. Please see the evaluation page for details.

    As with any human-annotated dataset, you may find various forms of errors in the data. You may manually correct errors you find in the training set. The dataset will not be updated/re-released unless it is determined that there are a large number of systematic errors.

  13. Football Analytics (Event data)

    • kaggle.com
    Updated Aug 25, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    HARDIK AGARWAL (2020). Football Analytics (Event data) [Dataset]. https://www.kaggle.com/datasets/hardikagarwal1/football-analytics-event-data-statsbomb/data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 25, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    HARDIK AGARWAL
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    Most publicly available football (soccer) statistics are limited to aggregated data such as Goals, Shots, Fouls, Cards. When assessing performance or building predictive models, this simple aggregation, without any context, can be misleading. For example, a team that produced 10 shots on target from long range has a lower chance of scoring than a club that produced the same amount of shots from inside the box. However, metrics derived from this simple count of shots will similarly asses the two teams.

    A football game generates hundreds of events and it is very important and interesting to take into account the context in which those events were generated. This incredibly rich data set should keep football analytics enthusiasts awake for long hours as the size of the data set and number of questions that can be asked is huge.

    Content

    There are 4 main files containing the data: 1) Competition data: Contains information regarding competetion id, competition name, season id, season name, country and gender.

    2)Match data: Match information for each match including competition and season information, stadium and referee information, home and away team information as well as the data version the match was collected under.

    3) Lineup data: Records the lineup information for the players, managers and referees involved with each match. The following variables are collected in the lineups of each match - team id, team name and lineup. The lineup array is a nested data frame inside of the lineup object, the lineup array contains the following information for each team- player id, player name, player nickname, jersey number and country

    4) Event data: Event Data comprises of general attributes and event specific attributes. General attributes are recorded for most event types, depending only on applicability. Event specific attributes help describe the event type in more detail as well as describe the outcome of the event type.

    The open data specification document in the doc folder describes the structure of the data along with all attributes in great detail. Take a look at this file for deeper understanding of the data.

    Acknowledgements

    This data is from the StatsBomb Open Data repository. StatsBomb are committed to sharing new data and research publicly to enhance understanding of the game of Football. They want to actively encourage new research and analysis at all levels. Therefore they have made certain leagues of StatsBomb Data freely available for public use for research projects and genuine interest in football analytics.

    Inspiration

    There are many many questions we can ask with such detailed event data. Here are just a few examples: What is the value of a shot? Or what is the probability of a shot being a goal given it's location, shooter, league, assist method, gamestate, number of players on the pitch, time - known as expected goals (xG) models When are teams more likely to score? Which teams are the best or sloppiest at holding the lead? Which teams or players make the best use of set pieces? How do players compare when they shoot with their week foot versus strong foot? Or which players are ambidextrous? Identify different styles of plays (shooting from long range vs shooting from the box, crossing the ball vs passing the ball, use of headers) Which teams have a bias for attacking on a particular flank?

  14. Life Expectancy 1960 to present (Global)

    • kaggle.com
    Updated Mar 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Frederick Salazar Sanchez (2025). Life Expectancy 1960 to present (Global) [Dataset]. https://www.kaggle.com/datasets/fredericksalazar/life-expectancy-1960-to-present-global
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 13, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Frederick Salazar Sanchez
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    PLEASE if you use or like this dataset UPVOTE 👁️

    This dataset offers a detailed historical record of global life expectancy, covering data from 1960 to the present. It is meticulously curated to enable deep analysis of trends and gender disparities in life expectancy worldwide.

    Dataset Structure & Key Columns:

    Country Code (🔤): Unique identifier for each country.

    Country Name (🌍): Official name of the country.

    Region (🌐): Broad geographical area (e.g., Asia, Europe, Africa).

    Sub-Region (🗺️): More specific regional classification within the broader region.

    Intermediate Region (🔍): Additional granular geographical grouping when applicable.

    Year (📅): The specific year to which the data pertains.

    Life Expectancy for Women (👩‍⚕️): Average years a woman is expected to live in that country and year.

    Life Expectancy for Men (👨‍⚕️): Average years a man is expected to live in that country and year.

    Context & Use Cases:

    This dataset is a rich resource for exploring long-term trends in global health and demography. By comparing life expectancy data over decades, researchers can:

    Analyze Time Series Trends: Forecast future changes in life expectancy and evaluate the impact of health interventions over time.

    Study Gender Disparities: Investigate the differences between life expectancy for women and men, providing insights into social, economic, and healthcare factors influencing these trends.

    Regional & Sub-Regional Analysis: Compare and contrast life expectancy across various regions and sub-regions to understand geographical disparities and their underlying causes.

    Support Public Policy Research: Inform policymakers by linking life expectancy trends with public health policies, socioeconomic developments, and other key indicators.

    Educational & Data Science Applications: Serve as a comprehensive teaching tool for courses on public health, global development, and data analysis, as well as for Kaggle competitions and projects.

    With its detailed, structured format and broad temporal coverage, this dataset is ideal for anyone looking to gain a nuanced understanding of global health trends and to drive impactful analyses in public health, social sciences, and beyond.

    Feel free to ask for further customizations or additional details as needed!

  15. Tweet Sentiment Extraction JSON

    • kaggle.com
    zip
    Updated Mar 31, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vaishvik (2020). Tweet Sentiment Extraction JSON [Dataset]. https://www.kaggle.com/vaishvik25/tweet-sentiment-extraction-json
    Explore at:
    zip(1452508 bytes)Available download formats
    Dataset updated
    Mar 31, 2020
    Authors
    Vaishvik
    Description

    Dataset

    This dataset was created by Vaishvik

    Contents

    It contains the following files:

  16. Champions League 23/24

    • kaggle.com
    Updated May 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sharvagya (2024). Champions League 23/24 [Dataset]. http://doi.org/10.34740/kaggle/ds/5071658
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 24, 2024
    Dataset provided by
    Kaggle
    Authors
    Sharvagya
    Description

    Champions League 2023/2024 Dataset

    Overview

    This dataset provides detailed statistics for the UEFA Champions League 2023/2024 season, focusing on team performance across various metrics. The data is sourced from FBref, a comprehensive platform for football statistics. This single-table dataset includes metrics such as matches played, wins, losses, goals scored, expected goals (xG), and more for each team participating in the Champions League.

    Dataset Content

    The dataset is structured as a single CSV file with the following headers:

    • Rk: Rank of the team based on the stage of the competition reached.
    • Country: The country of the club.
    • Squad: The name of the club.
    • MP: Matches played.
    • W: Matches won.
    • D: Matches drawn.
    • L: Matches lost.
    • GF: Goals for - total goals scored by the team.
    • GA: Goals against - total goals conceded by the team.
    • GD: Goal difference (GF - GA).
    • Pts: Total points accumulated by the team
    • xG: Expected goals - a metric that estimates the number of goals a team should have scored based on the quality of their chances.
    • xGA: Expected goals against - a metric that estimates the number of goals a team should have conceded based on the quality of chances they allowed.
    • xGD: Expected goal difference (xG - xGA).
    • xGD/90: Expected goal difference per 90 minutes.
    • Last 5: Results of the last 5 matches (e.g., WWDWL for 3 wins, 1 draw, and 1 loss).
    • Attendance: Average attendance for home matches.
    • Top Team Scorer: The name of the top scorer for the team.
    • Goalkeeper: The name of the main goalkeeper for the team.

    Data Source

    The data has been scraped from FBref, a well-known source for football statistics. FBref provides detailed and historical data for various football competitions worldwide, including the UEFA Champions League.

    Acknowledgements

    • FBref: For providing the comprehensive data used to compile this dataset.
    • Kaggle: For hosting and facilitating data science competitions and datasets.
  17. 📊 Meta Kaggle| Kaggle Users' Stats

    • kaggle.com
    zip
    Updated Jun 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BwandoWando (2025). 📊 Meta Kaggle| Kaggle Users' Stats [Dataset]. https://www.kaggle.com/datasets/bwandowando/meta-kaggle-users-stats/suggestions
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Jun 4, 2025
    Authors
    BwandoWando
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Image

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F1842206%2Ff84a67b64934ccfdd6fd4bfc24db094d%2F_982f849a-87df-44ff-94ff-3fc97c6198aa-small2.jpeg?generation=1738169001850229&alt=media" alt="">

    History

    • 03Mar2025- when determining last content shared, I am now using the latest version of Model, Dataset, and Notebook, rather than the creation date of the very first version. I also added the reaction counts which was a new csv added in the MetaKaggle dataset. The discussion can be found here . I also added versions created for Model, Notebook, and Dataset to properly track users that are updating their datasets.
    • 04Feb2025- Fixed the issue on ModelUpvotesGiven and ModelUpvotesReceived values being identical

    Context

    User aggregated stats and data using the Official Meta Kaggle dataset

    Note

    Expect some discrepancies between the counts seen in your profile, because, aside from there is a lag of one to two days before a new dataset is published, some information such as Kaggle staffs' upvotes and private competitions are not included. But for almost all members, the figures should reconcile

    Notebook updater

    📊 (Scheduled) Meta Kaggle Users' Stats

    Image

    Generated with Bing image generator

  18. Translated Dataset Augmentation

    • kaggle.com
    Updated Aug 7, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aditya Mishra (2020). Translated Dataset Augmentation [Dataset]. https://www.kaggle.com/aditya08/contradictory-my-dear-watson-translated-dataset/metadata
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 7, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Aditya Mishra
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    The dataset contains translated dataset for training & test data for Contradictory, my dear Watson competition. It is created using the data augmentation trick from @jpmiller Augmenting Data With Translation kernel. Since, the notebook run-time per session capped at 2 hours for this competition & I wished to train K-Fold XLM Roberta model, I augmented the data as a preprocessing step to save time. As you can see in my kernel took about an hour for the entire process to complete. XLM-Roberta | K-Fold kernel demonstrates the use for this dataset. Kindly, upvote my above & John's Kernel, if you find this dataset useful.

    Content

    • train_augmented.csv: This file contains the ID, premise, hypothesis, and label, as well as the language of the text and its two-letter abbreviation. The original competition data had 12120 entries, whereas this file has 24240 rows.
    • test.csv: This file contains the ID, premise, hypothesis, language, and language abbreviation, without labels.

    Acknowledgements

    @jpmiller

    Inspiration

    If you use this dataset for the competition then please share your experiences. Cheers!!

  19. Netflix Prize data

    • kaggle.com
    zip
    Updated Jul 19, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Netflix (2017). Netflix Prize data [Dataset]. https://www.kaggle.com/netflix-inc/netflix-prize-data
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Jul 19, 2017
    Dataset authored and provided by
    Netflixhttp://netflix.com/
    Description

    Context

    Netflix held the Netflix Prize open competition for the best algorithm to predict user ratings for films. The grand prize was $1,000,000 and was won by BellKor's Pragmatic Chaos team. This is the dataset that was used in that competition.

    Content

    This comes directly from the README:

    TRAINING DATASET FILE DESCRIPTION

    The file "training_set.tar" is a tar of a directory containing 17770 files, one per movie. The first line of each file contains the movie id followed by a colon. Each subsequent line in the file corresponds to a rating from a customer and its date in the following format:

    CustomerID,Rating,Date

    • MovieIDs range from 1 to 17770 sequentially.
    • CustomerIDs range from 1 to 2649429, with gaps. There are 480189 users.
    • Ratings are on a five star (integral) scale from 1 to 5.
    • Dates have the format YYYY-MM-DD.

    MOVIES FILE DESCRIPTION

    Movie information in "movie_titles.txt" is in the following format:

    MovieID,YearOfRelease,Title

    • MovieID do not correspond to actual Netflix movie ids or IMDB movie ids.
    • YearOfRelease can range from 1890 to 2005 and may correspond to the release of corresponding DVD, not necessarily its theaterical release.
    • Title is the Netflix movie title and may not correspond to titles used on other sites. Titles are in English.

    QUALIFYING AND PREDICTION DATASET FILE DESCRIPTION

    The qualifying dataset for the Netflix Prize is contained in the text file "qualifying.txt". It consists of lines indicating a movie id, followed by a colon, and then customer ids and rating dates, one per line for that movie id. The movie and customer ids are contained in the training set. Of course the ratings are withheld. There are no empty lines in the file.

    MovieID1:

    CustomerID11,Date11

    CustomerID12,Date12

    ...

    MovieID2:

    CustomerID21,Date21

    CustomerID22,Date22

    For the Netflix Prize, your program must predict the all ratings the customers gave the movies in the qualifying dataset based on the information in the training dataset.

    The format of your submitted prediction file follows the movie and customer id, date order of the qualifying dataset. However, your predicted rating takes the place of the corresponding customer id (and date), one per line.

    For example, if the qualifying dataset looked like:

    111:

    3245,2005-12-19

    5666,2005-12-23

    6789,2005-03-14

    225:

    1234,2005-05-26

    3456,2005-11-07

    then a prediction file should look something like:

    111:

    3.0

    3.4

    4.0

    225:

    1.0

    2.0

    which predicts that customer 3245 would have rated movie 111 3.0 stars on the 19th of Decemeber, 2005, that customer 5666 would have rated it slightly higher at 3.4 stars on the 23rd of Decemeber, 2005, etc.

    You must make predictions for all customers for all movies in the qualifying dataset.

    THE PROBE DATASET FILE DESCRIPTION

    To allow you to test your system before you submit a prediction set based on the qualifying dataset, we have provided a probe dataset in the file "probe.txt". This text file contains lines indicating a movie id, followed by a colon, and then customer ids, one per line for that movie id.

    MovieID1:

    CustomerID11

    CustomerID12

    ...

    MovieID2:

    CustomerID21

    CustomerID22

    Like the qualifying dataset, the movie and customer id pairs are contained in the training set. However, unlike the qualifying dataset, the ratings (and dates) for each pair are contained in the training dataset.

    If you wish, you may calculate the RMSE of your predictions against those ratings and compare your RMSE against the Cinematch RMSE on the same data. See http://www.netflixprize.com/faq#probe for that value.

    Acknowledgements

    The training data came in 17,000+ files. In the interest of keeping files together and file sizes as low as possible, I combined them into four text files: combined_data_(1,2,3,4).txt

    The contest was originally hosted at http://netflixprize.com/index.html

    The dataset was downloaded from https://archive.org/download/nf_prize_dataset.tar

    Inspiration

    This is a fun dataset to work with. You can read about the winning algorithm by BellKor's Pragmatic Chaos here

  20. Random_set

    • kaggle.com
    Updated Feb 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Akshiu (2025). Random_set [Dataset]. https://www.kaggle.com/datasets/akshiu/random-set/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 6, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Akshiu
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Random Set Dataset Overview The Random_Set dataset contains a collection of randomly generated numerical and categorical values. This dataset is ideal for testing machine learning models, statistical analysis, and data preprocessing techniques. It includes a mix of integer, float, categorical, and boolean features, making it versatile for exploratory data analysis (EDA), feature engineering, and algorithm benchmarking.

    Why Use This Dataset? ✅ Pre-cleaned & Structured – No missing values, well-organized data. ✅ Ideal for ML & Data Science Practice – Test different models and preprocessing techniques. ✅ Great for Feature Engineering – Work with different data types (categorical, numerical, boolean). ✅ Useful for Statistical & Algorithm Testing – Validate sorting, searching, clustering, and regression methods.

    Potential Use Cases 📊 Machine Learning Pipeline Testing: Evaluate ML models on random structured data. 🧪 Feature Engineering Practice: Experiment with feature encoding, scaling, and transformations. 🎲 Algorithm Benchmarking: Test sorting, clustering, and classification algorithms. 📈 Data Visualization: Practice creating charts, graphs, and statistical summaries. 🛠️ Training for Data Science Competitions: Sharpen your skills with synthetic but structured data.

    Source & Acknowledgment This dataset is randomly generated using statistical distributions and structured for usability. It is designed for practice, experimentation, and algorithm evaluation rather than real-world analysis.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
ian cecil akoto (2024). Gemma-Data Science Agent- Instruct- Dataset [Dataset]. https://www.kaggle.com/datasets/ianakoto/gemma-data-science-agent-instruct-dataset
Organization logo

Gemma-Data Science Agent- Instruct- Dataset

Data Science Assistance with Gemma Fine-tuned on Kaggle Solutions Writeup

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 2, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
ian cecil akoto
License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Overview This dataset contains question-answer pairs with context extracted from Kaggle solution write-ups and discussion forums. The dataset was created to facilitate fine-tuning Gemma, an AI model, for data scientist assistant tasks such as question answering and providing data science assistance.

Dataset Details Columns: Question: The question generated based on the context extracted from Kaggle solution write-ups and discussion forums. Answer: The corresponding answer to the generated question. Context: The context extracted from Kaggle solution write-ups and discussion forums, which serves as the basis for generating questions and answers. Subtitle: Subtitle or additional information related to the Kaggle competition or topic. Title: Title of the Kaggle competition or topic. Sources and Inspiration

Sources:

Meta Kaggle: The dataset was sourced from Meta Kaggle, an official Kaggle platform where users discuss competitions, kernels, datasets, and more. Kaggle Solution Write-ups: Solution write-ups submitted by Kaggle users were utilized as a primary source of context for generating questions and answers. Discussion Forums: Discussion threads on Kaggle forums were used to gather additional insights and context for the dataset. Inspiration:

The dataset was inspired by the need for a specialized dataset tailored for fine-tuning Gemma, an AI model designed for data scientist assistant tasks. The goal was to create a dataset that captures the essence of real-world data science problems discussed on Kaggle, enabling Gemma to provide accurate and relevant assistance to data scientists and Kaggle users. Dataset Specifics Total Records: [Specify the total number of question-answer pairs in the dataset] Format: CSV (Comma Separated Values) Size: [Specify the size of the dataset in MB or GB] License: [Specify the license under which the dataset is distributed, e.g., CC BY-SA 4.0] Download Link: [Provide a link to download the dataset] Acknowledgments We acknowledge Kaggle and its community for providing valuable data science resources and discussions that contributed to the creation of this dataset. We appreciate the efforts of Gemma and Langchain in fine-tuning AI models for data scientist assistant tasks, enabling enhanced productivity and efficiency in the field of data science.

Search
Clear search
Close search
Google apps
Main menu