Baseball Databank is a compilation of historical baseball data in a convenient, tidy format, distributed under Open Data terms.
This version of the Baseball databank was downloaded from Sean Lahman's website.
Note that as of v1, this dataset is missing a few tables because of a restriction on the number of individual files that can be added. This is in the process of being fixed. The missing tables are Parks, HomeGames, CollegePlaying, Schools, Appearances, and FieldingPost.
The design follows these general principles. Each player is assigned a unique number (playerID). All of the information relating to that player is tagged with his playerID. The playerIDs are linked to names and birthdates in the MASTER table.
The database is comprised of the following main tables:
It is supplemented by these tables:
Descriptions of each of these tables can be found attached to their associated files, below.
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. For details see: http://creativecommons.org/licenses/by-sa/3.0/
Person identification and demographics data are provided by Chadwick Baseball Bureau (http://www.chadwick-bureau.com), from its Register of baseball personnel.
Player performance data for 1871 through 2014 is based on the Lahman Baseball Database, version 2015-01-24, which is Copyright (C) 1996-2015 by Sean Lahman.
The tables Parks.csv and HomeGames.csv are based on the game logs and park code table published by Retrosheet. This information is available free of charge from and is copyrighted by Retrosheet. Interested parties may contact Retrosheet at http://www.retrosheet.org.
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Baffled why your team traded for that 34-year-old pitcher? Convinced you can create a new and improved version of WAR? Wondering what made the 1907 Cubs great and if can they do it again?
The History of Baseball is a reformatted version of the famous Lahman’s Baseball Database. It contains Major League Baseball’s complete batting and pitching statistics from 1871 to 2015, plus fielding statistics, standings, team stats, park stats, player demographics, managerial records, awards, post-season data, and more.
Scripts, Kaggle’s free, in-browser analytics tool, makes it easy to share detailed sabermetrics, predict the next hall of fame inductee, illustrate how speed scores runs, or publish a definitive analysis on why the Los Angeles Dodgers will never win another World Series.
We have more ideas for analysis than games in a season, but here are a few we’d really love to see:
See the full SQLite schema.
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
This was created to evaluate MLB baseball salaries adjusted for Inflation
This is a join from two datasets. The first is baseball salaries. I took the data from the consumer index, transposed it, and used SQL to join the two tables:
SELECT * FROM Salaries LEFT JOIN USIndex ON Salaries.yearID = USIndex.Year
The baseball tables are from Lahman's Baseball Database The Consumer Price Index is from World Bank Data
I'm practicing using SQL and R using baseball
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Baseball Databank is a compilation of historical baseball data in a convenient, tidy format, distributed under Open Data terms.
This version of the Baseball databank was downloaded from Sean Lahman's website.
Note that as of v1, this dataset is missing a few tables because of a restriction on the number of individual files that can be added. This is in the process of being fixed. The missing tables are Parks, HomeGames, CollegePlaying, Schools, Appearances, and FieldingPost.
The design follows these general principles. Each player is assigned a unique number (playerID). All of the information relating to that player is tagged with his playerID. The playerIDs are linked to names and birthdates in the MASTER table.
The database is comprised of the following main tables:
It is supplemented by these tables:
Descriptions of each of these tables can be found attached to their associated files, below.
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. For details see: http://creativecommons.org/licenses/by-sa/3.0/
Person identification and demographics data are provided by Chadwick Baseball Bureau (http://www.chadwick-bureau.com), from its Register of baseball personnel.
Player performance data for 1871 through 2014 is based on the Lahman Baseball Database, version 2015-01-24, which is Copyright (C) 1996-2015 by Sean Lahman.
The tables Parks.csv and HomeGames.csv are based on the game logs and park code table published by Retrosheet. This information is available free of charge from and is copyrighted by Retrosheet. Interested parties may contact Retrosheet at http://www.retrosheet.org.