41 datasets found

Gemma-Data Science Agent- Instruct- Dataset
kaggle.com
Updated Apr 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ian cecil akoto (2024). Gemma-Data Science Agent- Instruct- Dataset [Dataset]. https://www.kaggle.com/datasets/ianakoto/gemma-data-science-agent-instruct-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 2, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
ian cecil akoto
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Overview This dataset contains question-answer pairs with context extracted from Kaggle solution write-ups and discussion forums. The dataset was created to facilitate fine-tuning Gemma, an AI model, for data scientist assistant tasks such as question answering and providing data science assistance.

Dataset Details Columns: Question: The question generated based on the context extracted from Kaggle solution write-ups and discussion forums. Answer: The corresponding answer to the generated question. Context: The context extracted from Kaggle solution write-ups and discussion forums, which serves as the basis for generating questions and answers. Subtitle: Subtitle or additional information related to the Kaggle competition or topic. Title: Title of the Kaggle competition or topic. Sources and Inspiration

Sources:

Meta Kaggle: The dataset was sourced from Meta Kaggle, an official Kaggle platform where users discuss competitions, kernels, datasets, and more. Kaggle Solution Write-ups: Solution write-ups submitted by Kaggle users were utilized as a primary source of context for generating questions and answers. Discussion Forums: Discussion threads on Kaggle forums were used to gather additional insights and context for the dataset. Inspiration:

The dataset was inspired by the need for a specialized dataset tailored for fine-tuning Gemma, an AI model designed for data scientist assistant tasks. The goal was to create a dataset that captures the essence of real-world data science problems discussed on Kaggle, enabling Gemma to provide accurate and relevant assistance to data scientists and Kaggle users. Dataset Specifics Total Records: [Specify the total number of question-answer pairs in the dataset] Format: CSV (Comma Separated Values) Size: [Specify the size of the dataset in MB or GB] License: [Specify the license under which the dataset is distributed, e.g., CC BY-SA 4.0] Download Link: [Provide a link to download the dataset] Acknowledgments We acknowledge Kaggle and its community for providing valuable data science resources and discussions that contributed to the creation of this dataset. We appreciate the efforts of Gemma and Langchain in fine-tuning AI models for data scientist assistant tasks, enabling enhanced productivity and efficiency in the field of data science.
Meta Kaggle Code
kaggle.com
zip
Updated Jun 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kaggle (2025). Meta Kaggle Code [Dataset]. https://www.kaggle.com/datasets/kaggle/meta-kaggle-code/code
Explore at:
zip(145784025210 bytes)Available download formats
Dataset updated
Jun 19, 2025
Dataset authored and provided by
Kagglehttp://kaggle.com/
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Explore our public notebook content!

Meta Kaggle Code is an extension to our popular Meta Kaggle dataset. This extension contains all the raw source code from hundreds of thousands of public, Apache 2.0 licensed Python and R notebooks versions on Kaggle used to analyze Datasets, make submissions to Competitions, and more. This represents nearly a decade of data spanning a period of tremendous evolution in the ways ML work is done.

Why we’re releasing this dataset

By collecting all of this code created by Kaggle’s community in one dataset, we hope to make it easier for the world to research and share insights about trends in our industry. With the growing significance of AI-assisted development, we expect this data can also be used to fine-tune models for ML-specific code generation tasks.

Meta Kaggle for Code is also a continuation of our commitment to open data and research. This new dataset is a companion to Meta Kaggle which we originally released in 2016. On top of Meta Kaggle, our community has shared nearly 1,000 public code examples. Research papers written using Meta Kaggle have examined how data scientists collaboratively solve problems, analyzed overfitting in machine learning competitions, compared discussions between Kaggle and Stack Overflow communities, and more.

The best part is Meta Kaggle enriches Meta Kaggle for Code. By joining the datasets together, you can easily understand which competitions code was run against, the progression tier of the code’s author, how many votes a notebook had, what kinds of comments it received, and much, much more. We hope the new potential for uncovering deep insights into how ML code is written feels just as limitless to you as it does to us!

Sensitive data

While we have made an attempt to filter out notebooks containing potentially sensitive information published by Kaggle users, the dataset may still contain such information. Research, publications, applications, etc. relying on this data should only use or report on publicly available, non-sensitive information.

Joining with Meta Kaggle

The files contained here are a subset of the KernelVersions in Meta Kaggle. The file names match the ids in the KernelVersions csv file. Whereas Meta Kaggle contains data for all interactive and commit sessions, Meta Kaggle Code contains only data for commit sessions.

File organization

The files are organized into a two-level directory structure. Each top level folder contains up to 1 million files, e.g. - folder 123 contains all versions from 123,000,000 to 123,999,999. Each sub folder contains up to 1 thousand files, e.g. - 123/456 contains all versions from 123,456,000 to 123,456,999. In practice, each folder will have many fewer than 1 thousand files due to private and interactive sessions.

The ipynb files in this dataset hosted on Kaggle do not contain the output cells. If the outputs are required, the full set of ipynbs with the outputs embedded can be obtained from this public GCS bucket: kaggle-meta-kaggle-code-downloads. Note that this is a "requester pays" bucket. This means you will need a GCP account with billing enabled to download. Learn more here: https://cloud.google.com/storage/docs/requester-pays

Questions / Comments

We love feedback! Let us know in the Discussion tab.

Happy Kaggling!
h
Kaggle-LLM-Science-Exam
huggingface.co
Updated Aug 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sangeetha Venkatesan (2023). Kaggle-LLM-Science-Exam [Dataset]. https://huggingface.co/datasets/Sangeetha/Kaggle-LLM-Science-Exam
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 8, 2023
Authors
Sangeetha Venkatesan
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Card for [LLM Science Exam Kaggle Competition]

Dataset Summary

https://www.kaggle.com/competitions/kaggle-llm-science-exam/data

Languages

[en, de, tl, it, es, fr, pt, id, pl, ro, so, ca, da, sw, hu, no, nl, et, af, hr, lv, sl]

Dataset Structure

Columns prompt - the text of the question being asked A - option A; if this option is correct, then answer will be A B - option B; if this option is correct, then answer will be B C - option C; if this… See the full description on the dataset page: https://huggingface.co/datasets/Sangeetha/Kaggle-LLM-Science-Exam.
A
‘Top 1000 Kaggle Datasets’ analyzed by Analyst-2
analyst-2.ai
Updated Jan 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘Top 1000 Kaggle Datasets’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-top-1000-kaggle-datasets-658b/b992f64b/?iid=004-457&v=presentation
Explore at:
Dataset updated
Jan 28, 2022
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Top 1000 Kaggle Datasets’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/notkrishna/top-1000-kaggle-datasets on 28 January 2022.

--- Dataset description provided by original source is as follows ---

From wiki

Kaggle, a subsidiary of Google LLC, is an online community of data scientists and machine learning practitioners. Kaggle allows users to find and publish data sets, explore and build models in a web-based data-science environment, work with other data scientists and machine learning engineers, and enter competitions to solve data science challenges.

Kaggle got its start in 2010 by offering machine learning competitions and now also offers a public data platform, a cloud-based workbench for data science, and Artificial Intelligence education. Its key personnel were Anthony Goldbloom and Jeremy Howard. Nicholas Gruen was founding chair succeeded by Max Levchin. Equity was raised in 2011 valuing the company at $25 million. On 8 March 2017, Google announced that they were acquiring Kaggle.[1][2]

Source: Kaggle

--- Original source retains full ownership of the source dataset ---
How to Win Data Science Competition
kaggle.com
zip
Updated Jan 30, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Budi Ryan (2018). How to Win Data Science Competition [Dataset]. https://www.kaggle.com/budiryan/how-to-win-data-science-competition
Explore at:
zip(15845091 bytes)Available download formats
Dataset updated
Jan 30, 2018
Authors
Budi Ryan
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Dataset

This dataset was created by Budi Ryan

Released under CC0: Public Domain

Contents
i
Kaggle
registry.identifiers.org
bioregistry.io
Updated Aug 23, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2019). Kaggle [Dataset]. https://registry.identifiers.org/registry/kaggle
Explore at:
Dataset updated
Aug 23, 2019
Description
Kaggle is a platform for sharing data, performing reproducible analyses, interactive data analysis tutorials, and machine learning competitions.
Titanic Dataset - cleaned
kaggle.com
Updated Aug 9, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
WinstonSDodson (2019). Titanic Dataset - cleaned [Dataset]. https://www.kaggle.com/winstonsdodson/titanic-dataset-cleaned/kernels
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 9, 2019
Dataset provided by
Kagglehttp://kaggle.com/
Authors
WinstonSDodson
Description
This is the classic Titanic Dataset provided in the Kaggle Competition K Kernel and then cleaned in one of the most popular Kernels there. Please see the Kernel titled, "A Data Science Framework: To Achieve 99% Accuracy" for a great lesson in data science. This Kernel gives a great explanaton of the thinking behind the of this data cleaning as well as a very professional demonstration of the technologies and skills to do so. It then continues to provide an overview of many ML techniques and it is copiously and meticulously documented with many useful citations.

Of course, data cleaning is an essential skill in data science but I wanted to use this data for a study of other machine learning techniques. So, I found and used this set of data that is well known and cleaned to a benchmark accepted by many.
d
Data from: PlanktonSet 1.0: Plankton imagery data collected from F.G. Walton...
datadiscoverystudio.org
s.cnmilf.com
+3more
html
Updated Feb 8, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2018). PlanktonSet 1.0: Plankton imagery data collected from F.G. Walton Smith in Straits of Florida from 2014-06-03 to 2014-06-06 and used in the 2015 National Data Science Bowl (NCEI Accession 0127422). [Dataset]. http://datadiscoverystudio.org/geoportal/rest/metadata/item/f5a2c6072c47451192a114d51f902e14/html
Explore at:
htmlAvailable download formats
Dataset updated
Feb 8, 2018
Description
description: Data presented here are subset of a larger plankton imagery data set collected in the subtropical Straits of Florida from 2014-05-28 to 2014-06-14. Imagery data were collected using the In Situ Ichthyoplankton Imaging System (ISIIS-2) as part of a NSF-funded project to assess the biophysical drivers affecting fine-scale interactions between larval fish, their prey, and predators. This subset of images was used in the inaugural National Data Science Bowl (www.datasciencebowl.com) hosted by Kaggle and sponsored by Booz Allen Hamilton. Data were originally collected to examine the biophysical drivers affecting fine scale (spatial) interactions between larval fish, their prey, and predators in a subtropical pelagic marine ecosystem. Image segments extracted from the raw data were sorted into 121 plankton classes, split 50:50 into train and test data sets, and provided for a machine learning competition (the National Data Science Bowl). There was no hierarchical relationships explicit in the 121 plankton classes, though the class naming convention and a tree-like diagram (see file "Plankton Relationships.pdf") indicated relationships between classes, whether it was taxonomic or structural (size and shape). We intend for this dataset to be available to the machine learning and computer vision community as a standard machine learning benchmark. This €œPlankton 1.0€ dataset is a medium-size dataset with a fair amount of complexity where image classification improvements can still be made.; abstract: Data presented here are subset of a larger plankton imagery data set collected in the subtropical Straits of Florida from 2014-05-28 to 2014-06-14. Imagery data were collected using the In Situ Ichthyoplankton Imaging System (ISIIS-2) as part of a NSF-funded project to assess the biophysical drivers affecting fine-scale interactions between larval fish, their prey, and predators. This subset of images was used in the inaugural National Data Science Bowl (www.datasciencebowl.com) hosted by Kaggle and sponsored by Booz Allen Hamilton. Data were originally collected to examine the biophysical drivers affecting fine scale (spatial) interactions between larval fish, their prey, and predators in a subtropical pelagic marine ecosystem. Image segments extracted from the raw data were sorted into 121 plankton classes, split 50:50 into train and test data sets, and provided for a machine learning competition (the National Data Science Bowl). There was no hierarchical relationships explicit in the 121 plankton classes, though the class naming convention and a tree-like diagram (see file "Plankton Relationships.pdf") indicated relationships between classes, whether it was taxonomic or structural (size and shape). We intend for this dataset to be available to the machine learning and computer vision community as a standard machine learning benchmark. This €œPlankton 1.0€ dataset is a medium-size dataset with a fair amount of complexity where image classification improvements can still be made.
Competition Dataset: Center of Policing Equity
kaggle.com
zip
Updated Nov 26, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shivam Bansal (2018). Competition Dataset: Center of Policing Equity [Dataset]. https://www.kaggle.com/shivamb/external-datasets-cpe
Explore at:
zip(198124223 bytes)Available download formats
Dataset updated
Nov 26, 2018
Authors
Shivam Bansal
Description
Dataset

This dataset was created by Shivam Bansal

Contents

It contains the following files:
LLM Science Dataset
kaggle.com
Updated Aug 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zhecheng Li (2023). LLM Science Dataset [Dataset]. https://www.kaggle.com/datasets/lizhecheng/llm-science-dataset/suggestions
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 7, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Zhecheng Li
License
https://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/
Description
The version 3 contains 6 datasets.

1 - The original training dataset in LLM Science Exam

2 - 6.0k train examples for LLM Science Exam from RADEK OSMULSKI, the dataset link is here

3 - 500 train examples for LLM Science Exam from RADEK OSMULSKI, the dataset link is here

4 - 600 train examples collected by Zhecheng LI using Chatgpt3.5 here

5 - wikipedia-stem-1k dataset collected by LEONID KULYK, the dataset link is here

6 - MMLU Dataset, I choose about 3600+ examples that are suitable for finetuning this competition, the original dataset I have published here

Thanks for their contribution to this competition and many NLP project
The Quest Dataset
kaggle.com
Updated Nov 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jules King (2024). The Quest Dataset [Dataset]. https://www.kaggle.com/datasets/julesking/the-quest-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 26, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Jules King
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Learning Agency Lab’s data science competition, “The Quest for Quality Questions: Improving Reading Comprehension through Automated Question Generation,” was designed to build AI algorithms that can automatically generate questions that test young learners’ reading comprehension.

As many educators and researchers know, questions are key in teaching and evaluating narrative comprehension skills in young learners. However, generating high-quality reading comprehension queries is time consuming, which limits the number of texts that young readers can engage with in this way. Datasets can help by informing quality question automation.

The Quest challenge dataset can be accessed on this page and was aided by foundational data from the Lab’s FairytaleQA dataset of 10,580 questions. Those queries were created to address gaps in similar datasets, which often overlooked fine reading skills that showcased an understanding of varying narrative elements.

The Quest was made possible by The Learning Agency Lab, Mark Warschauer at UC Irvine, and Ying Xu at The University of Michigan School of Education. More can be found about the creators here.

Quest dataset © 2024 by The Learning Agency Lab is licensed under CC BY 4.0. To view a copy of this license, visit https://creativecommons.org/licenses/by/4.0/

Competition - https://www.thequestchallenge.org/

Publications - Xu, Y., Wang, D., Yu, M., Ritchie, D., Yao, B., Wu, T., ... & Warschauer, M. (2022). Fantastic Questions and Where to Find Them: FairytaleQA--An Authentic Dataset for Narrative Comprehension. arXiv preprint arXiv:2203.13947.
Data Science Bowl 2018 Competition - Merged Mask
kaggle.com
Updated Jan 5, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenitsu157 (2024). Data Science Bowl 2018 Competition - Merged Mask [Dataset]. https://www.kaggle.com/datasets/mahmudulhasantasin/data-science-bowl-2018-competition-merged-mask/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 5, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Zenitsu157
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset contains a large number of segmented nuclei images. The images were acquired under a variety of conditions and vary in the cell type, magnification, and imaging modality (brightfield vs. fluorescence). The dataset is designed to challenge an algorithm's ability to generalize across these variations.

Each image is represented by an associated ImageId. Files belonging to an image are contained in a folder with this ImageId. Within this folder are two subfolders:

images contains the image file. masks contains the segmented masks of each nucleus. This folder is only included in the training set. Each mask contains one nucleus. Masks are not allowed to overlap (no pixel belongs to two masks). The second stage dataset will contain images from unseen experimental conditions. To deter hand labeling, it will also contain images that are ignored in scoring. The metric used to score this competition requires that your submissions are in run-length encoded format. Please see the evaluation page for details.

As with any human-annotated dataset, you may find various forms of errors in the data. You may manually correct errors you find in the training set. The dataset will not be updated/re-released unless it is determined that there are a large number of systematic errors.
Football Analytics (Event data)
kaggle.com
Updated Aug 25, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
HARDIK AGARWAL (2020). Football Analytics (Event data) [Dataset]. https://www.kaggle.com/datasets/hardikagarwal1/football-analytics-event-data-statsbomb/data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 25, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
HARDIK AGARWAL
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

Most publicly available football (soccer) statistics are limited to aggregated data such as Goals, Shots, Fouls, Cards. When assessing performance or building predictive models, this simple aggregation, without any context, can be misleading. For example, a team that produced 10 shots on target from long range has a lower chance of scoring than a club that produced the same amount of shots from inside the box. However, metrics derived from this simple count of shots will similarly asses the two teams.

A football game generates hundreds of events and it is very important and interesting to take into account the context in which those events were generated. This incredibly rich data set should keep football analytics enthusiasts awake for long hours as the size of the data set and number of questions that can be asked is huge.

Content

There are 4 main files containing the data: 1) Competition data: Contains information regarding competetion id, competition name, season id, season name, country and gender.

2)Match data: Match information for each match including competition and season information, stadium and referee information, home and away team information as well as the data version the match was collected under.

3) Lineup data: Records the lineup information for the players, managers and referees involved with each match. The following variables are collected in the lineups of each match - team id, team name and lineup. The lineup array is a nested data frame inside of the lineup object, the lineup array contains the following information for each team- player id, player name, player nickname, jersey number and country

4) Event data: Event Data comprises of general attributes and event specific attributes. General attributes are recorded for most event types, depending only on applicability. Event specific attributes help describe the event type in more detail as well as describe the outcome of the event type.

The open data specification document in the doc folder describes the structure of the data along with all attributes in great detail. Take a look at this file for deeper understanding of the data.

Acknowledgements

This data is from the StatsBomb Open Data repository. StatsBomb are committed to sharing new data and research publicly to enhance understanding of the game of Football. They want to actively encourage new research and analysis at all levels. Therefore they have made certain leagues of StatsBomb Data freely available for public use for research projects and genuine interest in football analytics.

Inspiration

There are many many questions we can ask with such detailed event data. Here are just a few examples: What is the value of a shot? Or what is the probability of a shot being a goal given it's location, shooter, league, assist method, gamestate, number of players on the pitch, time - known as expected goals (xG) models When are teams more likely to score? Which teams are the best or sloppiest at holding the lead? Which teams or players make the best use of set pieces? How do players compare when they shoot with their week foot versus strong foot? Or which players are ambidextrous? Identify different styles of plays (shooting from long range vs shooting from the box, crossing the ball vs passing the ball, use of headers) Which teams have a bias for attacking on a particular flank?
Life Expectancy 1960 to present (Global)
kaggle.com
Updated Mar 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Frederick Salazar Sanchez (2025). Life Expectancy 1960 to present (Global) [Dataset]. https://www.kaggle.com/datasets/fredericksalazar/life-expectancy-1960-to-present-global
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 13, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Frederick Salazar Sanchez
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
PLEASE if you use or like this dataset UPVOTE 👁️

This dataset offers a detailed historical record of global life expectancy, covering data from 1960 to the present. It is meticulously curated to enable deep analysis of trends and gender disparities in life expectancy worldwide.

Dataset Structure & Key Columns:

Country Code (🔤): Unique identifier for each country.

Country Name (🌍): Official name of the country.

Region (🌐): Broad geographical area (e.g., Asia, Europe, Africa).

Sub-Region (🗺️): More specific regional classification within the broader region.

Intermediate Region (🔍): Additional granular geographical grouping when applicable.

Year (📅): The specific year to which the data pertains.

Life Expectancy for Women (👩‍⚕️): Average years a woman is expected to live in that country and year.

Life Expectancy for Men (👨‍⚕️): Average years a man is expected to live in that country and year.

Context & Use Cases:

This dataset is a rich resource for exploring long-term trends in global health and demography. By comparing life expectancy data over decades, researchers can:

Analyze Time Series Trends: Forecast future changes in life expectancy and evaluate the impact of health interventions over time.

Study Gender Disparities: Investigate the differences between life expectancy for women and men, providing insights into social, economic, and healthcare factors influencing these trends.

Regional & Sub-Regional Analysis: Compare and contrast life expectancy across various regions and sub-regions to understand geographical disparities and their underlying causes.

Support Public Policy Research: Inform policymakers by linking life expectancy trends with public health policies, socioeconomic developments, and other key indicators.

Educational & Data Science Applications: Serve as a comprehensive teaching tool for courses on public health, global development, and data analysis, as well as for Kaggle competitions and projects.

With its detailed, structured format and broad temporal coverage, this dataset is ideal for anyone looking to gain a nuanced understanding of global health trends and to drive impactful analyses in public health, social sciences, and beyond.

Feel free to ask for further customizations or additional details as needed!
Tweet Sentiment Extraction JSON
kaggle.com
zip
Updated Mar 31, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vaishvik (2020). Tweet Sentiment Extraction JSON [Dataset]. https://www.kaggle.com/vaishvik25/tweet-sentiment-extraction-json
Explore at:
zip(1452508 bytes)Available download formats
Dataset updated
Mar 31, 2020
Authors
Vaishvik
Description
Dataset

This dataset was created by Vaishvik

Contents

It contains the following files:
Champions League 23/24
kaggle.com
Updated May 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sharvagya (2024). Champions League 23/24 [Dataset]. http://doi.org/10.34740/kaggle/ds/5071658
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/ds/5071658
Dataset updated
May 24, 2024
Dataset provided by
Kaggle
Authors
Sharvagya
Description
Champions League 2023/2024 Dataset

Overview

This dataset provides detailed statistics for the UEFA Champions League 2023/2024 season, focusing on team performance across various metrics. The data is sourced from FBref, a comprehensive platform for football statistics. This single-table dataset includes metrics such as matches played, wins, losses, goals scored, expected goals (xG), and more for each team participating in the Champions League.

Dataset Content

The dataset is structured as a single CSV file with the following headers:

Rk: Rank of the team based on the stage of the competition reached.

Country: The country of the club.

Squad: The name of the club.

MP: Matches played.

W: Matches won.

D: Matches drawn.

L: Matches lost.

GF: Goals for - total goals scored by the team.

GA: Goals against - total goals conceded by the team.

GD: Goal difference (GF - GA).

Pts: Total points accumulated by the team

xG: Expected goals - a metric that estimates the number of goals a team should have scored based on the quality of their chances.

xGA: Expected goals against - a metric that estimates the number of goals a team should have conceded based on the quality of chances they allowed.

xGD: Expected goal difference (xG - xGA).

xGD/90: Expected goal difference per 90 minutes.

Last 5: Results of the last 5 matches (e.g., WWDWL for 3 wins, 1 draw, and 1 loss).

Attendance: Average attendance for home matches.

Top Team Scorer: The name of the top scorer for the team.

Goalkeeper: The name of the main goalkeeper for the team.

Data Source

The data has been scraped from FBref, a well-known source for football statistics. FBref provides detailed and historical data for various football competitions worldwide, including the UEFA Champions League.

Acknowledgements

FBref: For providing the comprehensive data used to compile this dataset.

Kaggle: For hosting and facilitating data science competitions and datasets.
📊 Meta Kaggle| Kaggle Users' Stats
kaggle.com
zip
Updated Jun 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BwandoWando (2025). 📊 Meta Kaggle| Kaggle Users' Stats [Dataset]. https://www.kaggle.com/datasets/bwandowando/meta-kaggle-users-stats/suggestions
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Jun 4, 2025
Authors
BwandoWando
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Image

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F1842206%2Ff84a67b64934ccfdd6fd4bfc24db094d%2F_982f849a-87df-44ff-94ff-3fc97c6198aa-small2.jpeg?generation=1738169001850229&alt=media" alt="">

History

03Mar2025- when determining last content shared, I am now using the latest version of Model, Dataset, and Notebook, rather than the creation date of the very first version. I also added the reaction counts which was a new csv added in the MetaKaggle dataset. The discussion can be found here . I also added versions created for Model, Notebook, and Dataset to properly track users that are updating their datasets.

04Feb2025- Fixed the issue on ModelUpvotesGiven and ModelUpvotesReceived values being identical

Context

User aggregated stats and data using the Official Meta Kaggle dataset

Note

Expect some discrepancies between the counts seen in your profile, because, aside from there is a lag of one to two days before a new dataset is published, some information such as Kaggle staffs' upvotes and private competitions are not included. But for almost all members, the figures should reconcile

Notebook updater

📊 (Scheduled) Meta Kaggle Users' Stats

Image

Generated with Bing image generator
Translated Dataset Augmentation
kaggle.com
Updated Aug 7, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aditya Mishra (2020). Translated Dataset Augmentation [Dataset]. https://www.kaggle.com/aditya08/contradictory-my-dear-watson-translated-dataset/metadata
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 7, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Aditya Mishra
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

The dataset contains translated dataset for training & test data for Contradictory, my dear Watson competition. It is created using the data augmentation trick from @jpmiller Augmenting Data With Translation kernel. Since, the notebook run-time per session capped at 2 hours for this competition & I wished to train K-Fold XLM Roberta model, I augmented the data as a preprocessing step to save time. As you can see in my kernel took about an hour for the entire process to complete. XLM-Roberta | K-Fold kernel demonstrates the use for this dataset. Kindly, upvote my above & John's Kernel, if you find this dataset useful.

Content

train_augmented.csv: This file contains the ID, premise, hypothesis, and label, as well as the language of the text and its two-letter abbreviation. The original competition data had 12120 entries, whereas this file has 24240 rows.

test.csv: This file contains the ID, premise, hypothesis, language, and language abbreviation, without labels.

Acknowledgements

@jpmiller

Inspiration

If you use this dataset for the competition then please share your experiences. Cheers!!
Netflix Prize data
kaggle.com
zip
Updated Jul 19, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Netflix (2017). Netflix Prize data [Dataset]. https://www.kaggle.com/netflix-inc/netflix-prize-data
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Jul 19, 2017
Dataset authored and provided by
Netflixhttp://netflix.com/
Description
Context

Netflix held the Netflix Prize open competition for the best algorithm to predict user ratings for films. The grand prize was $1,000,000 and was won by BellKor's Pragmatic Chaos team. This is the dataset that was used in that competition.

Content

This comes directly from the README:

TRAINING DATASET FILE DESCRIPTION

The file "training_set.tar" is a tar of a directory containing 17770 files, one per movie. The first line of each file contains the movie id followed by a colon. Each subsequent line in the file corresponds to a rating from a customer and its date in the following format:

CustomerID,Rating,Date

MovieIDs range from 1 to 17770 sequentially.

CustomerIDs range from 1 to 2649429, with gaps. There are 480189 users.

Ratings are on a five star (integral) scale from 1 to 5.

Dates have the format YYYY-MM-DD.

MOVIES FILE DESCRIPTION

Movie information in "movie_titles.txt" is in the following format:

MovieID,YearOfRelease,Title

MovieID do not correspond to actual Netflix movie ids or IMDB movie ids.

YearOfRelease can range from 1890 to 2005 and may correspond to the release of corresponding DVD, not necessarily its theaterical release.

Title is the Netflix movie title and may not correspond to titles used on other sites. Titles are in English.

QUALIFYING AND PREDICTION DATASET FILE DESCRIPTION

The qualifying dataset for the Netflix Prize is contained in the text file "qualifying.txt". It consists of lines indicating a movie id, followed by a colon, and then customer ids and rating dates, one per line for that movie id. The movie and customer ids are contained in the training set. Of course the ratings are withheld. There are no empty lines in the file.

MovieID1:

CustomerID11,Date11

CustomerID12,Date12

...

MovieID2:

CustomerID21,Date21

CustomerID22,Date22

For the Netflix Prize, your program must predict the all ratings the customers gave the movies in the qualifying dataset based on the information in the training dataset.

The format of your submitted prediction file follows the movie and customer id, date order of the qualifying dataset. However, your predicted rating takes the place of the corresponding customer id (and date), one per line.

For example, if the qualifying dataset looked like:

111:

3245,2005-12-19

5666,2005-12-23

6789,2005-03-14

225:

1234,2005-05-26

3456,2005-11-07

then a prediction file should look something like:

111:

3.0

3.4

4.0

225:

1.0

2.0

which predicts that customer 3245 would have rated movie 111 3.0 stars on the 19th of Decemeber, 2005, that customer 5666 would have rated it slightly higher at 3.4 stars on the 23rd of Decemeber, 2005, etc.

You must make predictions for all customers for all movies in the qualifying dataset.

THE PROBE DATASET FILE DESCRIPTION

To allow you to test your system before you submit a prediction set based on the qualifying dataset, we have provided a probe dataset in the file "probe.txt". This text file contains lines indicating a movie id, followed by a colon, and then customer ids, one per line for that movie id.

MovieID1:

CustomerID11

CustomerID12

...

MovieID2:

CustomerID21

CustomerID22

Like the qualifying dataset, the movie and customer id pairs are contained in the training set. However, unlike the qualifying dataset, the ratings (and dates) for each pair are contained in the training dataset.

If you wish, you may calculate the RMSE of your predictions against those ratings and compare your RMSE against the Cinematch RMSE on the same data. See http://www.netflixprize.com/faq#probe for that value.

Acknowledgements

The training data came in 17,000+ files. In the interest of keeping files together and file sizes as low as possible, I combined them into four text files: combined_data_(1,2,3,4).txt

The contest was originally hosted at http://netflixprize.com/index.html

The dataset was downloaded from https://archive.org/download/nf_prize_dataset.tar

Inspiration

This is a fun dataset to work with. You can read about the winning algorithm by BellKor's Pragmatic Chaos here
Random_set
kaggle.com
Updated Feb 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Akshiu (2025). Random_set [Dataset]. https://www.kaggle.com/datasets/akshiu/random-set/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 6, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Akshiu
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Random Set Dataset Overview The Random_Set dataset contains a collection of randomly generated numerical and categorical values. This dataset is ideal for testing machine learning models, statistical analysis, and data preprocessing techniques. It includes a mix of integer, float, categorical, and boolean features, making it versatile for exploratory data analysis (EDA), feature engineering, and algorithm benchmarking.

Why Use This Dataset? ✅ Pre-cleaned & Structured – No missing values, well-organized data. ✅ Ideal for ML & Data Science Practice – Test different models and preprocessing techniques. ✅ Great for Feature Engineering – Work with different data types (categorical, numerical, boolean). ✅ Useful for Statistical & Algorithm Testing – Validate sorting, searching, clustering, and regression methods.

Potential Use Cases 📊 Machine Learning Pipeline Testing: Evaluate ML models on random structured data. 🧪 Feature Engineering Practice: Experiment with feature encoding, scaling, and transformations. 🎲 Algorithm Benchmarking: Test sorting, clustering, and classification algorithms. 📈 Data Visualization: Practice creating charts, graphs, and statistical summaries. 🛠️ Training for Data Science Competitions: Sharpen your skills with synthetic but structured data.

Source & Acknowledgment This dataset is randomly generated using statistical distributions and structured for usability. It is designed for practice, experimentation, and algorithm evaluation rather than real-world analysis.

Facebook

Twitter

Click to copy link

Link copied

Cite

ian cecil akoto (2024). Gemma-Data Science Agent- Instruct- Dataset [Dataset]. https://www.kaggle.com/datasets/ianakoto/gemma-data-science-agent-instruct-dataset

Gemma-Data Science Agent- Instruct- Dataset

Data Science Assistance with Gemma Fine-tuned on Kaggle Solutions Writeup

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Apr 2, 2024

Dataset provided by

Kagglehttp://kaggle.com/

Authors

ian cecil akoto

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Overview This dataset contains question-answer pairs with context extracted from Kaggle solution write-ups and discussion forums. The dataset was created to facilitate fine-tuning Gemma, an AI model, for data scientist assistant tasks such as question answering and providing data science assistance.

Dataset Details Columns: Question: The question generated based on the context extracted from Kaggle solution write-ups and discussion forums. Answer: The corresponding answer to the generated question. Context: The context extracted from Kaggle solution write-ups and discussion forums, which serves as the basis for generating questions and answers. Subtitle: Subtitle or additional information related to the Kaggle competition or topic. Title: Title of the Kaggle competition or topic. Sources and Inspiration

Sources:

Meta Kaggle: The dataset was sourced from Meta Kaggle, an official Kaggle platform where users discuss competitions, kernels, datasets, and more. Kaggle Solution Write-ups: Solution write-ups submitted by Kaggle users were utilized as a primary source of context for generating questions and answers. Discussion Forums: Discussion threads on Kaggle forums were used to gather additional insights and context for the dataset. Inspiration:

The dataset was inspired by the need for a specialized dataset tailored for fine-tuning Gemma, an AI model designed for data scientist assistant tasks. The goal was to create a dataset that captures the essence of real-world data science problems discussed on Kaggle, enabling Gemma to provide accurate and relevant assistance to data scientists and Kaggle users. Dataset Specifics Total Records: [Specify the total number of question-answer pairs in the dataset] Format: CSV (Comma Separated Values) Size: [Specify the size of the dataset in MB or GB] License: [Specify the license under which the dataset is distributed, e.g., CC BY-SA 4.0] Download Link: [Provide a link to download the dataset] Acknowledgments We acknowledge Kaggle and its community for providing valuable data science resources and discussions that contributed to the creation of this dataset. We appreciate the efforts of Gemma and Langchain in fine-tuning AI models for data scientist assistant tasks, enabling enhanced productivity and efficiency in the field of data science.

Clear search

Close search

Google apps

Main menu

Gemma-Data Science Agent- Instruct- Dataset

Meta Kaggle Code

Explore our public notebook content!

Why we’re releasing this dataset

Sensitive data

Joining with Meta Kaggle

File organization

Questions / Comments

Kaggle-LLM-Science-Exam

‘Top 1000 Kaggle Datasets’ analyzed by Analyst-2

From wiki

How to Win Data Science Competition

Dataset

Contents

Kaggle

Titanic Dataset - cleaned

Data from: PlanktonSet 1.0: Plankton imagery data collected from F.G. Walton...

Competition Dataset: Center of Policing Equity

Dataset

Contents

LLM Science Dataset

The Quest Dataset

Data Science Bowl 2018 Competition - Merged Mask

Football Analytics (Event data)

Context

Content

Acknowledgements

Inspiration

Life Expectancy 1960 to present (Global)

Tweet Sentiment Extraction JSON

Dataset

Contents

Champions League 23/24

Dataset Content

📊 Meta Kaggle| Kaggle Users' Stats

Image

History

Context

Note

Notebook updater

Image

Translated Dataset Augmentation

Context

Content

Acknowledgements

Inspiration

Netflix Prize data

Context

Content

TRAINING DATASET FILE DESCRIPTION

MOVIES FILE DESCRIPTION

QUALIFYING AND PREDICTION DATASET FILE DESCRIPTION

THE PROBE DATASET FILE DESCRIPTION

Acknowledgements

Inspiration

Random_set

Gemma-Data Science Agent- Instruct- Dataset

Data Science Assistance with Gemma Fine-tuned on Kaggle Solutions Writeup