This dataset utilized raw data from Advanced Sports Analytics (https://www.advancedsportsanalytics.com/).
This is a great website that provides raw MLB game data for every game. It is quite messy and requires a quite a bit cleaning but the data is worth it! Batting, Pitching, and play by play data was exported into csv files for the 2017-2020 seasons. R script is provided
Key Column information:
Batting Order = Where the player batted in the lineup for that given day Position = The position they played for that game Pit = Total amount of pitches they saw over the course of the game Str = Total amount of strikes they saw over the course of the game Team.R = Total runs scored by the batters team in the game Team.H = Total hits by the batters team in the game Opponent.R = Total runs scored by the opposing team in the game Opponent.H = Total hits by the opposing team in the game X1b.Ump = First base umpire for the game X2b.Ump = Second base umpire for the game X3b.Ump = Third base umpire for the game HP.Ump = Home Plate umpire for the game Date = Date of the game Game.Time = Game time H.A = Home or Away Precipitation = yes/no Sky = Whether it was sunny, cloudy, overcast, rain, drizzle, night, or in dome Stadium = Stadium played in Temperature = Temperature at game time Weather = Character combining temperature, wind speed, wind direction, and stadium/sky ** Wind.Direction** = Direction of the wind speed Wind.Speed = Wind speed in mph Starting.Pitcher = Starting pitcher Over.Under = Over/Under of the game Moneyline = The moneyline for the batters team Wagers = Amount of wagers placed on the game
Unfortunately, it seems like they no longer have this raw data available on their website so I will be uploading the raw data along with the cleaned files so that other's can manipulate the data anyway they like!
finnnnnnnnnnnn/mlb-play-by-plays-v1 dataset hosted on Hugging Face and contributed by the HF Datasets community
Pete Rose has played the most games in Major League Baseball history, taking to the field in 3,562 games between 1963 and 1986. Second in the ranking is Carl Yastrzemski, who played in 3,308 MLB games.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This public data includes pitch-by-pitch data for Major League Baseball (MLB) games in 2016. With this data you can effectively replay a game and rebuild basic statistics for players and teams.
games_wide - Every pitch, steal, or lineup event for each at bat in the 2016 regular season.
games_post_wide - Every pitch, steal, or lineup event for each at-bat in the 2016 post season.
schedules - The schedule for every team in the regular season.
*The schemas for the games_wide and games_post_wide tables are identical.
You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.github_repos.[TABLENAME]
. Fork this kernel to get started to learn how to safely manage analyzing large BigQuery datasets.
Dataset Source: Sportradar LLC
Use: Copyright Sportradar LLC. Access to data is intended solely for internal research and testing purposes, and is not to be used for any business or commercial purpose. Data are not to be exploited in any manner without express approval from Sportradar. Display of data must include the phrase, “Data provided by Sportradar LLC,” and be hyperlinked to www.sportradar.com.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
In recent years, there has been increased attention and focus from the public on the environmental impact of professional sports organizations. Significant opportunities exist for Major League Baseball (MLB) teams to both reduce their own environmental footprint, and that of their fans, through sustainability initiatives. Despite stadiums using upwards of ten million gallons of water per year and having the same energy needs as a small city, no MLB team has completed a public-facing quantification of their total environmental footprint. This project calculated the carbon footprint and water consumption of the Tampa Bay Rays for the 2019 regular season. We analyzed Scope 1, 2, and 3 GHG emissions to identify hotspots within the Rays' operations, supply chains, and transportation. Fan transportation was found to be the largest source of GHGs, followed by food production for concessions. The cooling tower and restrooms were identified as the largest sources of onsite water usage. We created a repository of best practices as a resource for stadium managers that includes strategies to reduce GHGs and water use coupled with scenario analyses estimating potential reductions. The following recommendations are highlighted as the largest reduction opportunities: (1) prioritizing fan engagement to switch to more sustainable modes of transportation, and (2) offering and highlighting more vegetarian options at concessions. To further reduce emissions and water usage, MLB teams should prioritize sub-metering electricity and water lines and installing more efficient equipment.
Ahead of the 2023 Major League Baseball season, a pitch clock was introduced to speed up the pace of the game. As a result, an average game during the 2024 MLB season lasted * hours and ** minutes. This was more than ** minutes shorter than an average game during the 2022 season, when the pitch clock had not yet been introduced.
By Devi Ramanan [source]
This dataset features a comprehensive look into the performance of 311 professional Major League Baseball players. It comprises key batting statistics including name, team, age, plate appearances (PA), batting average (AVG), on-base plus slugging percentage - average (OBP-AVG), isolated power (ISO), stolen bases (SB), and ultimate zone rating per 150 games (UZR/150). Additionally, the dataset contains more detailed and complex metrics for each player such as weighted values for singles (1Bw), doubles (2Bw), triples(3Bw), home runs(HRw) unintentional walks(uBBw), hit by pitches(HBPw) ,stolen bases attempted/successful(SBW/CSW) and weighted On-Base Average(WOBA). All these data points create an effective way to measure the offensive performance that is both insightful and objective. Jeff Long's Spira Award winning article analyzed this very same data to compare MLB players who have similar skillsets than would otherwise be expected
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset can be used to analyse the wOBA stats of MLB players with at least 250 plate appearances (PA). This dataset has data on 31 baseball players. The data includes the player's name, their team, age, PA, batting average (AVG), on-base percentage minus batting average (OBP-AVG), isolated power (ISO), stolen bases (SB), Ultimate Zone Rating per 150 games (UZR/150), weighted value of singles(1Bw) , weighted value of doubles(2Bw) , weighted value of triples(3Bw) , weighted value of home runs(HRw) ,weighted value of unintentional walks(uBBw) ,weighted value of hit by pitches(HBPw )and stolen base attempt success rate (CSW). By using this dataset you can compare different MLB Players' stats in the same year.
- Analyzing and predicting batting performance. With this dataset, researchers could create models to observe correlations between batting metrics such as strikeouts, walks, home runs, stolen bases etc and overall wOBA scores for the players. This could be used to generate insights into the most important batting factors that contribute the greatest benefit for a team's success.
- Comparing players from different teams in terms of their batting performance. By comparing two players with similar stats (for example two offensive power hitters) across different teams it would be possible to analyze whether certain teams consistently have better offensive players or if they just have higher quantity in particular positions of play.
- Creating a predictive model for MLB draft prospects or free agents signing potentials based on their stats and previous yearly changes in OBP-AVG or UZR/150 score could provide meaningful insight into which emerging talents are likely to see substantial improvement in their career trajectory over time when compared with aging stars who may gradually decline over time due to age related attrition factors such as injury and fatigue amongst others
If you use this dataset in your research, please credit the original authors. Data Source
Unknown License - Please check the dataset description for more information.
File: Batting Key Stats2.csv | Column name | Description | |:--------------|:--------------------------------------------------| | Name | Name of the player. (String) | | Team | Team the player is on. (String) | | Age | Age of the player. (Integer) | | PA | Plate Appearances. (Integer) | | AVG | Batting Average. (Float) | | OBP-AVG | On-Base Percentage minus Batting Average. (Float) | | ISO | Isolated Power. (Float) | | SB | Stolen Bases. (Integer) | | UZR/150 | Ultimate Zone Rating per 150 games. (Float) |
File: 2014 wOBA Stats 3.csv | Column name | Description | |:--------------|:-----------------------------------------------| | Name | Name of the player. (String) | | Team | Team the player is on. (String) | | PA | Plate Appearances. (Integer) | | 1Bw | Weighted value of singles. (Float) | | 2Bw | Weighted value of doubles. (Float)...
Yogi Berra played in a record 75 MLB World Series games in a career spanning from 1946 to 1965. Berra spent his whole career in New York, first playing for the Yankees, before playing a single season for the Mets in 1965. The catcher won the World Series 10 times as a player, before claiming three more rings as a coach and manager.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
During the early days of professional baseball, the dominant major leagues imposed a “reserve clause” designed to limit player wages by restricting competition for labor. Entry into the market by rival leagues challenged the incumbent monopsony cartel’s ability to restrict compensation. Using a sample of player salaries from the first 40 years of the reserve clause (1880-1919), this study examines the impact of inter-league competition on player wages. This study finds a positive salary effect associated with rival league entry that is consistent with monopsony wage suppression, but the effect is stronger during the 20th century than the 19th century. Changes in levels of market saturation and minor-league competition may explain differences in the effects between the two eras.
As of the first quarter of 2021, 10.4 percent of Xbox console owners in the United States said that they would consider playing the baseball game MLB The Show, in comparison to the 5.6 percent of PlayStation console owners in the same time period. The sports game was released in April 2021 and available on Xbox for the first time since 2006.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This layer shows the locations of Major League Baseball (MLB) stadiums in the United States and Canada. The layer includes a popup with info on the stadium, including the city where it is located, the name of the home team, and its seating capacity.This layer was originally sourced from the Major Sports Venues layer from the Homeland Infrastructure Foundation - Level Data (HIFLD) database (https://gii.dhs.gov/HIFLD). This layer includes a subset of Major Sports Venues, which has been updated with some additional info on the MLB teams that play in these stadiums and re-published from the Esri organization in ArcGIS Online to provide access. Minor updates have been made to the data to add new stadiums and update existing stadium names.
In 2023, the average of players within each MLB team was between around 26-30 years old. This is considered to be the prime of a player's career, as they are typically at their peak physical and athletic ability at this age.
Who is the oldest player in the MLB?
In 2023, the average age of the players on the New York Yankees' roster was 28.3 years. Out of all the teams in MLB, the Los Angeles Dodgers had the highest average player age. In the same year, the Toronto Blue Jays' average player age was 29.6 years.
What is Major League Baseball? Major League Baseball (MLB) is the highest level of professional baseball in the United States and Canada. It comprises 30 teams, 29 of which are located in the United States and one in Canada. The teams are divided into two leagues: the American League (AL) and the National League (NL), and each league is further divided into three divisions: East, Central, and West. The teams play a 162-game regular season schedule, with the goal of earning a spot in the postseason, which consists of the AL and NL Championship Series, and the World Series. The team that wins the World Series is declared the champion of the MLB.
Fans watch at home and live in the stadiums There are many ways to enjoy MLB games, whether you are a die-hard fan, a casual viewer, or a player yourself. You can watch games on TV, or stream them live online. In 2022, the average TV viewership of MLB World Series games stood at 11.8 million. Additionally, many teams have their own websites, social media accounts, and mobile apps that allow fans to stay up-to-date with the latest news, scores, and player stats. It is also possible to purchase tickets to games and watch the action live at the stadium. In 2022, the average attendance at the games in the MLB was 26,808.
Competition page: https://www.kaggle.com/c/mlb-player-digital-engagement-forecasting/data
"train.csv"'s json-like columns are unpacked by json.loads
then stored in csv format. All csv files have date_
column, which indicates "train.csv"'s date
column.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The data included in this replication package include Major League Baseball player performance and contract data. The study looks at how temporary/permanent employment status of MLB players impacts injury management. The study's abstract is as follows:When employees are employed in a temporary capacity, employers should be less willing to invest in their human capital relative to permanent employees. This study uses the context of injury management by Major League Baseball teams to test for differential investment in the protection of player human capital. Injury management is inherently uncertain as medical professionals can give differing opinions, so teams may be able to influence recovery times. Using a panel dataset and estimating player fixed-effects regressions, players are found to miss significantly fewer games to injury when employed on a temporary basis.
MLB 2021 schedule, formatted to fit the format of the MLB Player Digital Engagement competition. The schedule is available from numerous sources, but this data set was created using: https://www.baseball-reference.com/leagues/MLB-schedule.shtml It was obtained on June 18th, and has not been updated. There surely have been changes to the schedule since June 18th, and these will not be reflected.
Each team's schedule is available twice, once as the primary teamId
, but also as the opponentId
, listed again with the opposite frame of reference.
Double-headers may not accounted for properly, both forward/future and backward/history (as of June 18th).
baseball-reference.com
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This layer shows the locations of Major League Baseball (MLB) stadiums in the United States and Canada, with an added Placekey to enable joining with other datasets. The layer includes a popup with info on the stadium, including the city where it is located, the name of the home team, and its seating capacity.This layer was originally sourced from the Major Sports Venues layer from the Homeland Infrastructure Foundation - Level Data (HIFLD) database (https://gii.dhs.gov/HIFLD). This layer includes a subset of Major Sports Venues, which has been updated with some additional info on the MLB teams that play in these stadiums and re-published from the Esri organization in ArcGIS Online to provide access. Minor updates have been made to the data to add new stadiums and update existing stadium names.
The Sports Leagues Dataset (SLD) contains statistical data of the major professional sports leagues in the United States: NFL (National Football League), NBA (National Basketball Association), NHL (National Hockey League) and MLB (Major League Baseball). One collect five topics (Player Expenses, Player Salaries, Players Performance, Team Salaries, Team Valuation) of two dimensions (Finance and Performance) in different seasons (2000-2007) from three data sources (Forbes, Spotrac and Sports Reference). Please consider citing https://doi.org/10.5281/zenodo.3256432 if you found this dataset useful: [1] André Albino Bastos, Matheus de Oliveira Salim, Wladmir Cardoso Brandão. (2019). SLD: The Sports Leagues Dataset (Version 1.0) [Data set]. Zenodo.
The statistic shows the average player salary of the teams in Major League Baseball in 2019. The New York Yankees had an average player salary of 7.69 million U.S. dollars for the 2019 season.
The top ranked 144 hitters in MLB during 2017's regular season.
The data for each player includes:
Team - MLB Team
Pos - Field Position
G - Games Played
AB - At Bats
R - Runs Scored
H - Hits
2B - Doubles
3B - Triples
HR - Home Runs
RBI _ Runs Batted In
BB - Walks
SO - Strike Outs
SB - Stolen Bases
CS - Times picked off while trying to steal
AVG - Batting Average (hits/At Bats)
OBP - On Base Percentage (H+BB+HBP)/(AB+BB+HBP+SF)
SLG - Slugging Percentage (TB/AB) Total bases divided by at bats
OPS - On base percentage plus slugging (OBP + SLG)
Major League Baseball makes a lot of statistics available for you at: http://mlb.mlb.com/stats
This is just a fun data set to play with for nebies. Inspired by the fun of Baseball.
We examine whether social data can be used to predict how members of Major League Baseball (MLB) and members of the National Basketball Association (NBA) transition between teams during their career. We find that incorporating social data into various machine learning algorithms substantially improves the algorithms' ability to correctly determine these transitions in the NBA but only marginally in MLB. We also measure the extent to which player performance and team fitness data can be used to predict transitions between teams. This data, however, only slightly improves our predictions for players for both basketball and baseball players. We also consider whether social, performance, and team fitness data can be used to infer past transitions. Here we find that social data significantly improves our inference accuracy in both the NBA and MLB but player performance and team fitness data again does little to improve this score.
This dataset utilized raw data from Advanced Sports Analytics (https://www.advancedsportsanalytics.com/).
This is a great website that provides raw MLB game data for every game. It is quite messy and requires a quite a bit cleaning but the data is worth it! Batting, Pitching, and play by play data was exported into csv files for the 2017-2020 seasons. R script is provided
Key Column information:
Batting Order = Where the player batted in the lineup for that given day Position = The position they played for that game Pit = Total amount of pitches they saw over the course of the game Str = Total amount of strikes they saw over the course of the game Team.R = Total runs scored by the batters team in the game Team.H = Total hits by the batters team in the game Opponent.R = Total runs scored by the opposing team in the game Opponent.H = Total hits by the opposing team in the game X1b.Ump = First base umpire for the game X2b.Ump = Second base umpire for the game X3b.Ump = Third base umpire for the game HP.Ump = Home Plate umpire for the game Date = Date of the game Game.Time = Game time H.A = Home or Away Precipitation = yes/no Sky = Whether it was sunny, cloudy, overcast, rain, drizzle, night, or in dome Stadium = Stadium played in Temperature = Temperature at game time Weather = Character combining temperature, wind speed, wind direction, and stadium/sky ** Wind.Direction** = Direction of the wind speed Wind.Speed = Wind speed in mph Starting.Pitcher = Starting pitcher Over.Under = Over/Under of the game Moneyline = The moneyline for the batters team Wagers = Amount of wagers placed on the game
Unfortunately, it seems like they no longer have this raw data available on their website so I will be uploading the raw data along with the cleaned files so that other's can manipulate the data anyway they like!