Nexus Voices

EP01-12_Review: Web Scraping MLB Statistics to Predict Player Salaries Based on Performance


Listen Later

Review: Web Scraping MLB Statistics to Predict Player Salaries Based on Performance
Author: Alexander J. Schoessler
Publication Information: Swarthmore College Senior Theses, Projects, and Awards, Spring 2023
Abstract
This study investigates the relationship between player performance and salaries in Major League Baseball (MLB) and predicts salary fairness using machine learning models. The research utilizes Python web scraping techniques to collect player performance statistics, personal details, and salary data from ESPN and Spotrac. Separate datasets for batters, starting pitchers, and relief pitchers were established. A linear regression model was applied to predict salaries and analyze the alignment between salaries and performance.
Results indicate that high-salary players (e.g., top batters and starting pitchers) are often overvalued, while seasoned players with strong performance but lower salaries are undervalued. The model achieved a moderate accuracy with R² scores ranging from 0.5 to 0.6.
Key Contributions

  • Data Construction and Web Scraping Techniques: Demonstrates a comprehensive workflow using web scraping to create high-quality datasets integrating player performance statistics, salary, and contract details.
  • Analysis of Salary-Performance Matching: The model reveals the overvaluation of high-salary players and quantifies the undervaluation of seasoned low-salary players.
  • Practical Implications for Decision-Making: The study provides scientific support for salary decision-making in team management, especially in contract negotiations and player value assessment.

Data Sources
The data primarily came from ESPN and Spotrac, including player performance statistics, salary, and contract information.
Methodology

  1. Dataset Construction: Data was scraped using Python and integrated performance metrics with contract details to form comprehensive player datasets.
  2. Data Preprocessing: Included data cleaning, variable standardization, and grouping, with separate modeling for batters, starting pitchers, and relief pitchers.
  3. Salary Prediction Model: Linear regression was used for salary prediction, and the model's accuracy was evaluated using MAE and R² scores.

Results

  • Model Accuracy: The mean absolute error (MAE) for batters was $4.28 million, $3.66 million for starting pitchers, and $1.28 million for relief pitchers. R² scores ranged between 0.5 and 0.6.
  • Salary Mismatch: High-salary players (e.g., some batters and starting pitchers) are often overvalued, while low-salary, high-performing players are undervalued.

Conclusions and Recommendations
The study highlights significant structural issues in MLB salary distribution and recommends adopting data-driven methods for salary decision-making, using player performance metrics to enhance salary evaluations. Future research could improve prediction accuracy and utility by incorporating more advanced metrics and exploring other machine learning models.



Review: Web Scraping MLB Statistics to Predict Player Salaries Based on Performance
作者Alexander J. Schoessler
出版資訊Swarthmore College Senior Theses, Projects, and Awards, Spring 2023
摘要
本研究旨在探討美國職棒大聯盟(MLB)球員的表現與薪資之間的關係,並通過機器學習模型預測薪資是否合理。研究利用Python的網頁爬蟲技術,從ESPN與Spotrac網站收集MLB選手的技術指標、基本資料及薪資數據,並分別建立打者、先發投手與後援投手的數據集。最後,採用線性回歸模型進行薪資預測,並分析薪資與表現的匹配情況。
結果顯示,高薪球員(如頂尖打者與先發投手)可能被高估,而表現出色但薪資較低的資深球員則被低估,且模型的R²分數約為0.5-0.6,預測精度適中。
主要貢獻

  • 資料構建與爬取技術:展示了利用網頁爬蟲技術從多個數據來源建立高質量數據集的完整流程,包括球員技術指標、薪資與合約資料的整合。
  • 薪資與表現匹配分析:模型結果揭示了高薪球員的過度評價問題,並量化低薪資資深球員的價值低估。
  • 決策參考意義:本研究為球隊薪資決策提供了科學化的數據支持,特別是在合約談判與球員價值評估中具有應用潛力。

數據來源
研究數據主要來自ESPN與Spotrac,包括球員技術指標、薪資與合約信息。
研究方法

  1. 數據集構建:使用Python爬取數據,結合技術指標與合約信息,構建包含完整選手數據的數據集。
  2. 數據預處理:進行數據清洗、變數標準化與分組,並按照不同職位(打者、先發投手、後援投手)分別建模。
  3. 薪資預測模型:採用線性回歸進行薪資預測,並評估模型準確性(使用MAE與R²分數)。

結果

  • 模型準確性:打者的平均絕對誤差(MAE)為428萬美元,先發投手與後援投手分別為366萬與128萬美元,R²分數在0.5-0.6之間。
  • 薪資不匹配:高薪合約球員(如部分打者與先發投手)可能被高估,而表現出色的低薪球員則被低估。

結論與建議
研究顯示MLB薪資分配中存在顯著的結構性問題,建議球隊在薪資決策中採用數據驅動的方法,結合球員技術指標進行更精準的薪資評估。此外,未來可通過引入更高階的技術指標與其他機器學習模型,進一步提升薪資預測的準確性與實用性。



Powered by Firstory Hosting
...more
View all episodesView all episodes
Download on the App Store

Nexus VoicesBy C.Y. LU