Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This comprehensive CSV dataset compiles historical features of NCAA basketball teams participating in March Madness tournaments from 2012 to 2023. The dataset includes a rich array of performance metrics aimed at analyzing team dynamics and competitiveness. Key features encompass win-loss percentage, advanced metrics like Simple Rating System (SRS), Strength of Schedule (SOS), field goal percentage (FG%), three-point percentage (3P%), free throw percentage (FT%), home and away win rates, conference win rates, and point differential percentage.
Additionally, advanced statistical insights are provided, such as adjusted efficiency margin (AdjEM), adjusted offensive efficiency (AdjO), adjusted defensive efficiency (AdjD), adjusted tempo (AdjT), luck factor, adjusted strength of schedule (SOS AdjEM), average adjusted offensive efficiency of opposing teams (OppO), average adjusted defensive efficiency of opposing teams (OppD), and non-conference adjusted strength of schedule (NCSOS AdjEM). This dataset serves as a valuable resource for researchers, analysts, and enthusiasts seeking to delve into the intricate performance dynamics of collegiate basketball teams during the March Madness era.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This Kaggle dataset comes from an output dataset that powers my March Madness Data Analysis dashboard in Domo. - Click here to view this dashboard: Dashboard Link - Click here to view this dashboard features in a Domo blog post: Hoops, Data, and Madness: Unveiling the Ultimate NCAA Dashboard
This dataset offers one the most robust resource you will find to discover key insights through data science and data analytics using historical NCAA Division 1 men's basketball data. This data, sourced from KenPom, goes as far back as 2002 and is updated with the latest 2025 data. This dataset is meticulously structured to provide every piece of information that I could pull from this site as an open-source tool for analysis for March Madness.
Key features of the dataset include: - Historical Data: Provides all historical KenPom data from 2002 to 2025 from the Efficiency, Four Factors (Offense & Defense), Point Distribution, Height/Experience, and Misc. Team Stats endpoints from KenPom's website. Please note that the Height/Experience data only goes as far back as 2007, but every other source contains data from 2002 onward. - Data Granularity: This dataset features an individual line item for every NCAA Division 1 men's basketball team in every season that contains every KenPom metric that you can possibly think of. This dataset has the ability to serve as a single source of truth for your March Madness analysis and provide you with the granularity necessary to perform any type of analysis you can think of. - 2025 Tournament Insights: Contains all seed and region information for the 2025 NCAA March Madness tournament. Please note that I will continually update this dataset with the seed and region information for previous tournaments as I continue to work on this dataset.
These datasets were created by downloading the raw CSV files for each season for the various sections on KenPom's website (Efficiency, Offense, Defense, Point Distribution, Summary, Miscellaneous Team Stats, and Height). All of these raw files were uploaded to Domo and imported into a dataflow using Domo's Magic ETL. In these dataflows, all of the column headers for each of the previous seasons are standardized to the current 2025 naming structure so all of the historical data can be viewed under the exact same field names. All of these cleaned datasets are then appended together, and some additional clean up takes place before ultimately creating the intermediate (INT) datasets that are uploaded to this Kaggle dataset. Once all of the INT datasets were created, I joined all of the tables together on the team name and season so all of these different metrics can be viewed under one single view. From there, I joined an NCAAM Conference & ESPN Team Name Mapping table to add a conference field in its full length and respective acronyms they are known by as well as the team name that ESPN currently uses. Please note that this reference table is an aggregated view of all of the different conferences a team has been a part of since 2002 and the different team names that KenPom has used historically, so this mapping table is necessary to map all of the teams properly and differentiate the historical conferences from their current conferences. From there, I join a reference table that includes all of the current NCAAM coaches and their active coaching lengths because the active current coaching length typically correlates to a team's success in the March Madness tournament. I also join another reference table to include the historical post-season tournament teams in the March Madness, NIT, CBI, and CIT tournaments, and I join another reference table to differentiate the teams who were ranked in the top 12 in the AP Top 25 during week 6 of the respective NCAA season. After some additional data clean-up, all of this cleaned data exports into the "DEV _ March Madness" file that contains the consolidated view of all of this data.
This dataset provides users with the flexibility to export data for further analysis in platforms such as Domo, Power BI, Tableau, Excel, and more. This dataset is designed for users who wish to conduct their own analysis, develop predictive models, or simply gain a deeper understanding of the intricacies that result in the excitement that Division 1 men's college basketball provides every year in March. Whether you are using this dataset for academic research, personal interest, or professional interest, I hope this dataset serves as a foundational tool for exploring the vast landscape of college basketball's most riveting and anticipated event of its season.
This folder contains data behind the 2014 NCAA Tournament Predictions.
This dataset was scraped from FiveThirtyEight - march-madness-predictions ...
Data taken from https://www.kaggle.com/datasets/andrewsundberg/college-basketball-dataset and updated with data from https://barttorvik.com/
TEAM: The Division I college basketball school
CONF: The Athletic Conference in which the school participates in (A10 = Atlantic 10, ACC = Atlantic Coast Conference, AE = America East, Amer = American, ASun = ASUN, B10 = Big Ten, B12 = Big 12, BE = Big East, BSky = Big Sky, BSth = Big South, BW = Big West, CAA = Colonial Athletic Association, CUSA = Conference USA, Horz = Horizon League, Ivy = Ivy League, MAAC = Metro Atlantic Athletic Conference, MAC = Mid-American Conference, MEAC = Mid-Eastern Athletic Conference, MVC = Missouri Valley Conference, MWC = Mountain West, NEC = Northeast Conference, OVC = Ohio Valley Conference, P12 = Pac-12, Pat = Patriot League, SB = Sun Belt, SC = Southern Conference, SEC = South Eastern Conference, Slnd = Southland Conference, Sum = Summit League, SWAC = Southwestern Athletic Conference, WAC = Western Athletic Conference, WCC = West Coast Conference)
G: Number of games played
W: Number of games won
ADJOE: Adjusted Offensive Efficiency (An estimate of the offensive efficiency (points scored per 100 possessions) a team would have against the average Division I defense)
ADJDE: Adjusted Defensive Efficiency (An estimate of the defensive efficiency (points allowed per 100 possessions) a team would have against the average Division I offense)
BARTHAG: Power Rating (Chance of beating an average Division I team)
EFG_O: Effective Field Goal Percentage Shot
EFG_D: Effective Field Goal Percentage Allowed
TOR: Turnover Percentage Allowed (Turnover Rate)
TORD: Turnover Percentage Committed (Steal Rate)
ORB: Offensive Rebound Rate
DRB: Offensive Rebound Rate Allowed
FTR : Free Throw Rate (How often the given team shoots Free Throws)
FTRD: Free Throw Rate Allowed
2P_O: Two-Point Shooting Percentage
2P_D: Two-Point Shooting Percentage Allowed
3P_O: Three-Point Shooting Percentage
3P_D: Three-Point Shooting Percentage Allowed
ADJ_T: Adjusted Tempo (An estimate of the tempo (possessions per 40 minutes) a team would have against the team that wants to play at an average Division I tempo)
WAB: Wins Above Bubble (The bubble refers to the cut off between making the NCAA March Madness Tournament and not making it)
POSTSEASON: Round where the given team was eliminated or where their season ended (R68 = First Four, R64 = Round of 64, R32 = Round of 32, S16 = Sweet Sixteen, E8 = Elite Eight, F4 = Final Four, 2ND = Runner-up, Champion = Winner of the NCAA March Madness Tournament for that given year)
SEED: Seed in the NCAA March Madness Tournament
YEAR: Season
During the 2025 edition of the NCAA Division I Men's Basketball Championship, the average TV viewership in the United States stood at **** million viewers. This represented an increase of ***** percent from the previous year.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘March Madness 2018’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/yamqwe/march-madness-2018e on 28 January 2022.
--- Dataset description provided by original source is as follows ---
This file contains links to the data behind our 2018 March Madness Predictions.
fivethirtyeight_ncaa_forecasts.csv contains power ratings for each team and the chance of each team reaching every round of the tournament. It includes men's and women's forecasts, with one forecast for each day of the tournament.
Source: https://github.com/fivethirtyeight/data/tree/master/march-madness-predictions-2018
This dataset was created by FiveThirtyEight and contains around 600 samples along with Rd1 Win, Rd7 Win, technical information and other features such as: - Team Id - Playin Flag - and more.
- Analyze Team Region in relation to Team Name
- Study the influence of Gender on Rd5 Win
- More datasets
If you use this dataset in your research, please credit FiveThirtyEight
--- Original source retains full ownership of the source dataset ---
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was created by Sam Pochyly
Released under CC0: Public Domain
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Kaggle’s March Machine Learning Mania competition challenged data scientists to predict winners and losers of the men's 2016 NCAA basketball tournament. This dataset contains the 1070 selected predictions of all Kaggle participants. These predictions were collected and locked in prior to the start of the tournament.
How can this data be used? You can pivot it to look at both Kaggle and NCAA teams alike. You can look at who will win games, which games will be close, which games are hardest to forecast, or which Kaggle teams are gambling vs. sticking to the data.
The NCAA tournament is a single-elimination tournament that begins with 68 teams. There are four games, usually called the “play-in round,” before the traditional bracket action starts. Due to competition timing, these games are included in the prediction files but should not be used in analysis, as it’s possible that the prediction was submitted after the play-in round games were over.
Each Kaggle team could submit up to two prediction files. The prediction files in the dataset are in the 'predictions' folder and named according to:
TeamName_TeamId_SubmissionId.csv
The file format contains a probability prediction for every possible game between the 68 teams. This is necessary to cover every possible tournament outcome. Each team has a unique numerical Id (given in Teams.csv). Each game has a unique Id column created by concatenating the year and the two team Ids. The format is the following:
Id,Pred
2016_1112_1114,0.6
2016_1112_1122,0
...
The team with the lower numerical Id is always listed first. “Pred” represents the probability that the team with the lower Id beats the team with the higher Id. For example, "2016_1112_1114,0.6" indicates team 1112 has a 0.6 probability of beating team 1114.
For convenience, we have included the data files from the 2016 March Mania competition dataset in the Scripts environment (you may find TourneySlots.csv and TourneySeeds.csv useful for determining matchups, see the documentation). However, the focus of this dataset is on Kagglers' predictions.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘College Basketball Dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/andrewsundberg/college-basketball-dataset on 28 January 2022.
--- Dataset description provided by original source is as follows ---
Data from the 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, and 2021 Division I college basketball seasons.
cbb.csv has seasons 2013-2019 combined
The 2020 season's data set is kept separate from the other seasons, because there was no postseason due to the Coronavirus.
The 2021 data is from 3/15/2021 and will be updated and added to cbb.csv after the tournament
RK (Only in cbb20): The ranking of the team at the end of the regular season according to barttorvik
TEAM: The Division I college basketball school
CONF: The Athletic Conference in which the school participates in (A10 = Atlantic 10, ACC = Atlantic Coast Conference, AE = America East, Amer = American, ASun = ASUN, B10 = Big Ten, B12 = Big 12, BE = Big East, BSky = Big Sky, BSth = Big South, BW = Big West, CAA = Colonial Athletic Association, CUSA = Conference USA, Horz = Horizon League, Ivy = Ivy League, MAAC = Metro Atlantic Athletic Conference, MAC = Mid-American Conference, MEAC = Mid-Eastern Athletic Conference, MVC = Missouri Valley Conference, MWC = Mountain West, NEC = Northeast Conference, OVC = Ohio Valley Conference, P12 = Pac-12, Pat = Patriot League, SB = Sun Belt, SC = Southern Conference, SEC = South Eastern Conference, Slnd = Southland Conference, Sum = Summit League, SWAC = Southwestern Athletic Conference, WAC = Western Athletic Conference, WCC = West Coast Conference)
G: Number of games played
W: Number of games won
ADJOE: Adjusted Offensive Efficiency (An estimate of the offensive efficiency (points scored per 100 possessions) a team would have against the average Division I defense)
ADJDE: Adjusted Defensive Efficiency (An estimate of the defensive efficiency (points allowed per 100 possessions) a team would have against the average Division I offense)
BARTHAG: Power Rating (Chance of beating an average Division I team)
EFG_O: Effective Field Goal Percentage Shot
EFG_D: Effective Field Goal Percentage Allowed
TOR: Turnover Percentage Allowed (Turnover Rate)
TORD: Turnover Percentage Committed (Steal Rate)
ORB: Offensive Rebound Rate
DRB: Offensive Rebound Rate Allowed
FTR : Free Throw Rate (How often the given team shoots Free Throws)
FTRD: Free Throw Rate Allowed
2P_O: Two-Point Shooting Percentage
2P_D: Two-Point Shooting Percentage Allowed
3P_O: Three-Point Shooting Percentage
3P_D: Three-Point Shooting Percentage Allowed
ADJ_T: Adjusted Tempo (An estimate of the tempo (possessions per 40 minutes) a team would have against the team that wants to play at an average Division I tempo)
WAB: Wins Above Bubble (The bubble refers to the cut off between making the NCAA March Madness Tournament and not making it)
POSTSEASON: Round where the given team was eliminated or where their season ended (R68 = First Four, R64 = Round of 64, R32 = Round of 32, S16 = Sweet Sixteen, E8 = Elite Eight, F4 = Final Four, 2ND = Runner-up, Champion = Winner of the NCAA March Madness Tournament for that given year)
SEED: Seed in the NCAA March Madness Tournament
YEAR: Season
This data was scraped from from http://barttorvik.com/trank.php#. I cleaned the data set and added the POSTSEASON, SEED, and YEAR columns
--- Original source retains full ownership of the source dataset ---
http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
Updated the unbiased data up to selection Sunday 2022
This data contains two csv files. One of them is guaranteed to have no leakage. The problem with it is that the data only starts after 2010. The other file goes back to 2001, but contains some leakage.
The data was acquired from Ken Pom's official website (leaky data) and from time machine services for the unleaky version.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Summary
The repo provides answer, title and sentence expansions for the Natural Questions corpus with gar-T5.
Dataset Structure
There are dev and test folds An example data entry of the dev split looks as follows: { "id": "1", "predicted_answers": ["312"], "predicted_titles": ["Invisible Man"], "predicted_sentences": ["The Invisible Man First edition Author Ralph Ellison Cover artist M."] }
An example data entry of the test split looks as follows: {… See the full description on the dataset page: https://huggingface.co/datasets/castorini/nq_gar-t5_expansions.
A team's mean seasons statistics can be used as predictors for their performance in future games. However, these statistics gain additional meaning when placed in the context of their opponents' (and opponents' opponents') performance. This dataset provides this context for each team. Furthermore, predicting games based on post-season stats causes data leakage, which from experience can be significant in this context (15-20% loss in accuracy). Thus, this dataset provides each of these statistics prior to each game of the regular season, preventing any source of data leakage.
All data is derived from the March Madness competition data. Each original column was renamed to "A" and "B" instead of "W" and "L," and the mirrored to represent both orderings of opponents. Each team's mean stats are computed (both their stats, and the mean "allowed" or "forced" statistics by their opponents). To compute the mean opponents' stats, we analyze the games played by each opponent (excluding games played against the team in question), and compute the mean statistics for those games. We then compute the mean of these mean statistics, weighted by the number of times the team in question played each opponent. The opponents' opponent's stats are computed as a weighted average of the opponents' average. This results in statistics similar to those used to compute strength of schedule or RPI, just that they go beyond win percentages (See: https://en.wikipedia.org/wiki/Rating_percentage_index)
The per game statistics are computed by pretending we don't have any of the data on or after the day in question.
Currently, the data isn't computed particularly efficiently. Computing the per game averages for every day of the season is necessary to compute fully accurate opponents' opponents' average, but takes about 90 minutes to obtain. It is probably possible to parallelize this, and the per-game averages involve a lot of repeated computation (basically computing the final averages over and over again for each day). Speeding this up will make it more convenient to make changes to the dataset.
I would like to transform these statistics to be per-possession, add shooting percentages, pace, and number of games played (to give an idea of the amount uncertainty that exists in the per-game averages). Some of these can be approximated with the given data (but the results won't be exact), while others will need to be computed from scratch.
This is feather format data of the compeition Google Cloud & NCAA® ML Competition 2020-NCAAM. Please refer the kernel 2020 NCAAM: Fast data loading with feather for usage.
Cover photo from pexels.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Kaggle’s March Machine Learning Mania competition challenged data scientists to predict winners and losers of the men's 2017 NCAA basketball tournament. This dataset contains the selected predictions of all Kaggle participants. These predictions were collected and locked in prior to the start of the tournament.
The NCAA tournament is a single-elimination tournament that begins with 68 teams. There are four games, usually called the “play-in round,” before the traditional bracket action starts. Due to competition timing, these games are included in the prediction files but should not be used in analysis, as it’s possible that the prediction was submitted after the play-in round games were over.
Each Kaggle team could submit up to two prediction files. The prediction files in the dataset are in the 'predictions' folder. You can map the files to the teams by team_submission_key.csv.
The submission format contains a probability prediction for every possible game between the 68 teams. Refer to the competition documentation for data details. For convenience, we have included the data files from the competition dataset in the dataset (you may find TourneySlots.csv and TourneySeeds.csv useful for determining matchups). However, the focus of this dataset is on Kagglers' predictions.
This is the first live data stream on Kaggle providing a simple yet rich source of all soccer matches around the world 24/7 in real-time.
What makes it unique compared to other datasets?
Simply train your algorithm on the first version of training dataset of approximately 11.5k matches and predict the data provided in the following data feed.
The CSV file is updated every 30 minutes at minutes 20’ and 50’ of every hour. I kindly request not to download it more than twice per hour as it incurs additional cost.
You may download the csv data file from the following link from Amazon S3 server by changing the FOLDER_NAME as below,
https://s3.amazonaws.com/FOLDER_NAME/amasters.csv
*. Substitute the FOLDER_NAME with "**analyst-masters**"
Our goal is to identify the outcome of a match as Home, Draw or Away. The variety of sources and nature of information provided in this data stream makes it a unique database. Currently, FIVE servers are collecting data from soccer matches around the world, communicating with each other and finally aggregating the data based on the dominant features learned from 400,000 matches over 7 years. I describe every column and the data collection below in two categories, Category I – Current situation and Category II – Head-to-Head History. Hence, we divide the type of data we have from each team to 4 modes,
Below you can find a full illustration of each category.
I. Current situation
Col 1 to 3:
Votes_for_Home Votes_for_Draw Votes_for_Away
The most distinctive parts of the database are these 3 columns. We are releasing opinions of over 100 professional soccer analysts predicting the outcome of a match. Their votes is the result of every piece of information they receive on players, team line-up, injuries and the urge of a team to win a match to stay in the league. They are spread around the world in various time zones and are experts on soccer teams from various regions. Our servers aggregate their opinions to update the CSV file until kickoff. Therefore, even if 40 users predict Real-Madrid wins against Real-Sociedad in Santiago Bernabeu on January 6th, 2019 but 5 users predict Real-Sociedad (the away team) will be the winner, you should doubt the home win. Here, the “majority of votes” works in conjunction with other features.
Col 4 to 9:
Weekday Day Month Year Hour Minute
There are over 60,000 matches during a year, and approximately 400 ones are usually held per day on weekends. More critical and exciting matches, which are usually less predictable, are held toward the evening in Europe. We are currently providing time in Central Europe Time (CET) equivalent to GMT +01:00.
*. Please note that the 2nd row of the CSV file represents the time, data values are saved from all servers to the file.
Col 10 to 13:
Total_Bettors Bet_Perc_on_Home Bet_Perc_on_Draw Bet_Perc_on_Away
This data is recorded a few hours before the match as people place bets emotionally when kickoff approaches. The percentage of the overall number of people denoted as “Total_Bettors” is indicated in each column for “Home,” “Draw” and “Away” outcomes.
Col 14 to 15:
Team_1 Team_2
The team playing “Home” is “Team_1” and the opponent playing “Away” is “Team_2”.
Col 16 to 36:
League_Rank_1 League_Rank_2 Total_teams Points_1 Points_2 Max_points Min_points Won_1 Draw_1 Lost_1 Won_2 Draw_2 Lost_2 Goals_Scored_1 Goals_Scored_2 Goals_Rec_1 Goal_Rec_2 Goals_Diff_1 Goals_Diff_2
If the match is betw...
The app allows you to upload a submission and analyze how well you would have done in previous years’ competitions. * The Public leaderboard is usually full of leaky submissions making it hard to determine the quality of a submission. The Public leaderboard is included here for comparison. It is updated everytime the app is run. * The Average leaderboard shows the average score of the nth place teams. For example, if your submission places 10th on the Average leaderboard then your score is slightly better the average of the 10th place teams in the previous competitions and slightly worse than the average of the 9th place teams in the previous competitions. * The 2018 - 2019 leaderboards are exact copys from previous competitions.You can use them to view where your submission would have placed in those competitions.
Fork and edit the Women's March Madness 2021 Leaderboard Analyzer on Kaggle. Run all cells of the notebook and view the app in a separate tab using the url generated by ngrok.
Important: The app needs a backend to run. You must fork and edit the notebook. You won't be able to view the app from a static Kaggle notebook.
Follow the instructions here to download and run the latest Wave Server, a requirement for apps. Note: If you have a version of Wave older than or equal to 0.12.0, you will need to reinstall Wave with a newer version.
Download the app code from kaggle. Open a terminal in the downloaded womens_leaderboard
directory and create a tmp
folder for uploded files.
bash
$ mkdir tmp
$ make setup
$ source venv/bin/activate
$ wave run leaderboard.app
Point your favorite web browser to localhost:10101
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This comprehensive CSV dataset compiles historical features of NCAA basketball teams participating in March Madness tournaments from 2012 to 2023. The dataset includes a rich array of performance metrics aimed at analyzing team dynamics and competitiveness. Key features encompass win-loss percentage, advanced metrics like Simple Rating System (SRS), Strength of Schedule (SOS), field goal percentage (FG%), three-point percentage (3P%), free throw percentage (FT%), home and away win rates, conference win rates, and point differential percentage.
Additionally, advanced statistical insights are provided, such as adjusted efficiency margin (AdjEM), adjusted offensive efficiency (AdjO), adjusted defensive efficiency (AdjD), adjusted tempo (AdjT), luck factor, adjusted strength of schedule (SOS AdjEM), average adjusted offensive efficiency of opposing teams (OppO), average adjusted defensive efficiency of opposing teams (OppD), and non-conference adjusted strength of schedule (NCSOS AdjEM). This dataset serves as a valuable resource for researchers, analysts, and enthusiasts seeking to delve into the intricate performance dynamics of collegiate basketball teams during the March Madness era.