Baseball Databank is a compilation of historical baseball data in a convenient, tidy format, distributed under Open Data terms.
This version of the Baseball databank was downloaded from Sean Lahman's website.
Note that as of v1, this dataset is missing a few tables because of a restriction on the number of individual files that can be added. This is in the process of being fixed. The missing tables are Parks, HomeGames, CollegePlaying, Schools, Appearances, and FieldingPost.
The design follows these general principles. Each player is assigned a unique number (playerID). All of the information relating to that player is tagged with his playerID. The playerIDs are linked to names and birthdates in the MASTER table.
The database is comprised of the following main tables:
It is supplemented by these tables:
Descriptions of each of these tables can be found attached to their associated files, below.
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. For details see: http://creativecommons.org/licenses/by-sa/3.0/
Person identification and demographics data are provided by Chadwick Baseball Bureau (http://www.chadwick-bureau.com), from its Register of baseball personnel.
Player performance data for 1871 through 2014 is based on the Lahman Baseball Database, version 2015-01-24, which is Copyright (C) 1996-2015 by Sean Lahman.
The tables Parks.csv and HomeGames.csv are based on the game logs and park code table published by Retrosheet. This information is available free of charge from and is copyrighted by Retrosheet. Interested parties may contact Retrosheet at http://www.retrosheet.org.
A baseball batting data frame with 107429 observations on the following 22 variables.
playerID Player ID code
yearID Year
stint player's stint (order of appearances within a season)
teamID Team; a factor
lgID League; a factor with levels AA AL FL NL PL UA
G Games: number of games in which a player played
AB At Bats
R Runs
H Hits: times reached base because of a batted, fair ball without error by the defense
X2B Doubles: hits on which the batter reached second base safely
X3B Triples: hits on which the batter reached third base safely
HR Homeruns
RBI Runs Batted In
SB Stolen Bases
CS Caught Stealing
BB Base on Balls
SO Strikeouts
IBB Intentional walks
HBP Hit by pitch
SH Sacrifice hits
SF Sacrifice flies
GIDP Grounded into double play
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Baseball Databank is a compilation of historical baseball data in a convenient, tidy format, distributed under Open Data terms.
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Baffled why your team traded for that 34-year-old pitcher? Convinced you can create a new and improved version of WAR? Wondering what made the 1907 Cubs great and if can they do it again?
The History of Baseball is a reformatted version of the famous Lahman’s Baseball Database. It contains Major League Baseball’s complete batting and pitching statistics from 1871 to 2015, plus fielding statistics, standings, team stats, park stats, player demographics, managerial records, awards, post-season data, and more.
Scripts, Kaggle’s free, in-browser analytics tool, makes it easy to share detailed sabermetrics, predict the next hall of fame inductee, illustrate how speed scores runs, or publish a definitive analysis on why the Los Angeles Dodgers will never win another World Series.
We have more ideas for analysis than games in a season, but here are a few we’d really love to see:
See the full SQLite schema.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Baseball Databank is a compilation of historical baseball data in a convenient, tidy format, distributed under Open Data terms.
This version of the Baseball databank was downloaded from Sean Lahman's website.
Note that as of v1, this dataset is missing a few tables because of a restriction on the number of individual files that can be added. This is in the process of being fixed. The missing tables are Parks, HomeGames, CollegePlaying, Schools, Appearances, and FieldingPost.
The design follows these general principles. Each player is assigned a unique number (playerID). All of the information relating to that player is tagged with his playerID. The playerIDs are linked to names and birthdates in the MASTER table.
The database is comprised of the following main tables:
It is supplemented by these tables:
Descriptions of each of these tables can be found attached to their associated files, below.
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. For details see: http://creativecommons.org/licenses/by-sa/3.0/
Person identification and demographics data are provided by Chadwick Baseball Bureau (http://www.chadwick-bureau.com), from its Register of baseball personnel.
Player performance data for 1871 through 2014 is based on the Lahman Baseball Database, version 2015-01-24, which is Copyright (C) 1996-2015 by Sean Lahman.
The tables Parks.csv and HomeGames.csv are based on the game logs and park code table published by Retrosheet. This information is available free of charge from and is copyrighted by Retrosheet. Interested parties may contact Retrosheet at http://www.retrosheet.org.