2025 UPDATE: Retrosheet now directly provides the contextualized data that I had created for this project. I'm starting to use the retrosheet-provided data instead.
What happens in a baseball game? Pitchers pitch, batters hit, baserunners run the bases and score, the score increases. All these events can be easily recorded, but provide an incomplete picture of the game without knowing the context in which the event occurred. For example, a home run with the bases loaded is more valuable to a team than a home run with no one on base.
This dataset takes the amazing Major League Baseball events dataset built by retrosheet.org and adds contextual information about the state of the game at the time the event occurred. This additional data should enable research into deeper and more complex questions about what happens in baseball, and why.
The most important files are the "{year}rs.csv" files, which contain regular season event data in context. Most studies will use only those files.
Recipients of Retrosheet data are free to make any desired use of the information, including (but not limited to) selling it, giving it away, or producing a commercial product based upon the data. Retrosheet has one requirement for any such transfer of data or product development, which is that the following statement must appear prominently:
The information used here was obtained free of charge from and is copyrighted by Retrosheet. Interested parties may contact Retrosheet at www.retrosheet.org.
Retrosheet makes no guarantees of accuracy for the information that is supplied. Much effort is expended to make our website as correct as possible, but Retrosheet shall not be held responsible for any consequences arising from the use the material presented here. All information is subject to corrections as additional data are received. We are grateful to anyone who discovers discrepancies and we appreciate learning of the details.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Comprehensive dataset tracking 934 youth baseball bats used in Little League World Series regional tournaments from 2019 to 2025. Includes brand distribution, model popularity, and year-over-year trends for USA Baseball certified bats.
đââď¸Please take a moment to carefully read through this description and metadata to better understand the dataset and its nuances before proceeding to the Suggestions and Discussions section.
This dataset focuses on a wide range of sabermetric metrics for analyzing batter performance in baseball. It provides a comprehensive view of a player's abilities in power, plate discipline, speed, and overall efficiency.
NOTE that only qualified players are included in this data, meaning that players who reach the minimum number of plate appearances are required to qualify for season-long leaderboards and rate stats. The data was retrieved on October 18th, 2024.
AB (At-Bats): The total number of times a batter has a turn at the plate, excluding walks and sacrifices.
PA (Plate Appearances): The total number of times a batter completes a turn at the plate, including all outcomes.
Home Run: The number of times the batter hits the ball out of the field, allowing them to round all bases and score.
K% (Strikeout Percentage): The percentage of plate appearances that result in a strikeout.
BB% (Walk Percentage): The percentage of plate appearances that result in the batter receiving a walk.
SLG% (Slugging Percentage): A measure of the batter's power, calculated as total bases per at-bat.
OBP (On-Base Percentage): The percentage of times the batter reaches base via hits, walks, or hit-by-pitch events.
OPS (On-Base Plus Slugging): The sum of OBP and SLG, providing a combined measure of a batter's ability to get on base and hit for power.
Isolated Power (ISO): A measure of a batter's raw power, calculated by subtracting batting average from slugging percentage to focus on extra-base hits.
BABIP (Batting Average on Balls in Play): The batting average on balls hit into play, excluding home runs and strikeouts.
Total Stolen Bases: The total number of bases a player has stolen successfully.
xwOBA (Expected Weighted On-Base Average): A predictive version of wOBA (Weighted On-base Average) based on the quality of contact, such as exit velocity and launch angle.
wOBAdiff (wOBA Differential): The difference between a batterâs actual wOBA and expected wOBA (xwOBA), indicating performance versus expectations.
Exit Velocity Avg (Average Exit Velocity): The average speed of the ball off the bat, providing insight into the quality of contact.
Sweet Spot Percentage: The percentage of batted balls hit with a launch angle between 8 and 32 degrees, which typically leads to better offensive results.
Barrel Batted Rate: The percentage of batted balls hit with ideal exit velocity and launch angle, maximizing chances for extra-base hits.
Hard-Hit Percentage: The percentage of batted balls hit with an exit velocity of 95 mph or higher, reflecting the strength of contact.
Average Hyper Speed: The batter's average sprint speed during short, high-intensity runs like reaching base.
Whiff Percentage: The percentage of swings in which the batter misses the ball entirely.
Swing Percentage: The percentage of pitches at which the batter swings, regardless of whether they make contact.
HP to 1B (Home Plate to First Base Speed): The time it takes for a batter to sprint from home plate to first base after hitting the ball.
Sprint Speed: The playerâs top running speed, usually measured during base running or fielding.
WAR (Wins Above Replacement): measures a player's value in all facets of the game by deciphering how many more wins he's worth than a replacement-level player at his same position.
This dataset is designed to simplify the process of analyzing batter performance using advanced sabermetric principles, providing key insights into offensive effectiveness and expected outcomes.
The dataset was retrieved from the respective sources listed in the Provenance section. Users are urged to use this data responsibly and to respect the rights and guidelines specified by the original data providers. When utilizing or sharing insights derived from this dataset, ensure proper attribution to the sources.
There's a story behind every dataset and here's your opportunity to share yours.
What's inside is more than just rows and columns. Make it easy for others to get started by describing how you acquired the data and what time period it represents, too.
Batting: PYID: Player Id Team: The team that the player played for TeamYear: The year that the player played for said team Pos: The position that the player played Player: The name of the player Age: The age of the player Games: How many games that the player has played in for the season of said year PA: Plate appearances AB: At bats Runs: The runs the player has, which is each time the player reaches home base Hits: When the runner gets on base from hitting the baseball Doubles: Gets to 2nd base off a hit Triples: Gets to 3rd base off a hit HR: home run RBI: Runs batted in SB: Stolen bases CS: Caught stealing BB: Bases on Balls/Walks SO: Strikeouts BA: Batting avg (Hits/at bats) OBP: On base % (H + BB + HBP)/(AB + BB + HBP + SF) SLG: Slugging (Total bases)/At bats or (1B + 2*2B + 3*3B + 4*HR)/AB OPS: On base + slugging percentages OPSP: 100 * [OBP/lg OBP + SLG/lg SLG -1] -> adjusted to the player's ballpark(s) TB: Total bases GDP: Double plays grounded into HBP: Times hit by pitch SH: Sacrifice bunts SF: Sacrifice flies IBB: Intentional Bases on Balls
Pitching PYID: Player id in table Team: The team that the player plays for that year TeamYear: The year that the player plays for the team Pos: This could be null but the player. The position of the player Player: The name of the player Age: The age of the player that year Wins: The amount of wins that the pitcher has the year Losses: The amount of losses that the pitcher has the year WL: The ratio between wins and losses ERA: Earned run avg (9* ER/IP) Games: Games player or pitched GS: Games started GF: Games finished CG: Complete games SHO: Shutouts (no runs allowed and a complete game) SV: Saves IP: Innings pitched Hits: Hits/Hits allowed Runs: Runs scored/allowed ER: Earned runs allowed HR: Home runs hits/allowed BB: Bases on balls/walks IBB: Intentional bases on balls SO: Strikeouts HBP: Times hit by pitch BK: Balks WP: Wild pitches BF: Batters faced ERAP: 100*[lgERA/ERA] -> adjusted to the players ballpark FIP: Fielding independent pitching -> measures a pitcher's effectiveness at preventing HR, BB, HBP and causing SO -> (13*HR + 3*(BB+HBP) - 2*SO)/IP + Constant_lg Whip: (BB+H)/IP H9: 9*H/IP HR9: 9*HR/IP BB9: 9 * BB/IP SO9: 9 * SO/IP SOW: SO/W or SO/BB
We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.
Your data will be in front of the world's largest data science community. What questions do you want to see answered?
Not seeing a result you expected?
Learn how you can add new datasets to our index.
2025 UPDATE: Retrosheet now directly provides the contextualized data that I had created for this project. I'm starting to use the retrosheet-provided data instead.
What happens in a baseball game? Pitchers pitch, batters hit, baserunners run the bases and score, the score increases. All these events can be easily recorded, but provide an incomplete picture of the game without knowing the context in which the event occurred. For example, a home run with the bases loaded is more valuable to a team than a home run with no one on base.
This dataset takes the amazing Major League Baseball events dataset built by retrosheet.org and adds contextual information about the state of the game at the time the event occurred. This additional data should enable research into deeper and more complex questions about what happens in baseball, and why.
The most important files are the "{year}rs.csv" files, which contain regular season event data in context. Most studies will use only those files.
Recipients of Retrosheet data are free to make any desired use of the information, including (but not limited to) selling it, giving it away, or producing a commercial product based upon the data. Retrosheet has one requirement for any such transfer of data or product development, which is that the following statement must appear prominently:
The information used here was obtained free of charge from and is copyrighted by Retrosheet. Interested parties may contact Retrosheet at www.retrosheet.org.
Retrosheet makes no guarantees of accuracy for the information that is supplied. Much effort is expended to make our website as correct as possible, but Retrosheet shall not be held responsible for any consequences arising from the use the material presented here. All information is subject to corrections as additional data are received. We are grateful to anyone who discovers discrepancies and we appreciate learning of the details.