WWW.DIS.XLIBX.INFO
FREE ELECTRONIC LIBRARY - Thesis, dissertations, books
 
<< HOME
CONTACTS



Pages:   || 2 | 3 |

«Abstract In this study we describe efforts to use machine learning to out-perform the expert Las Vegas line-makers at predicting the outcome of NFL ...»

-- [ Page 1 ] --

Predicting Margin of Victory in NFL Games:

Machine Learning vs. the Las Vegas Line

Jim Warner

December 17, 2010

Abstract

In this study we describe efforts to use machine learning to out-perform the expert Las Vegas line-makers

at predicting the outcome of NFL football games. The statistical model we employ for inference is the

Gaussian process, a powerful tool for supervised learning applications. With predictions for the margin of

victory and associated confidence intervals from the Gaussian process model, we propose a simple framework which recommends a bet on a given game when it is deemed statistically favorable. The training dataset we consider in this study includes a wide variety of offensive and defensive NFL statistics from about 2,000 games between 2000 and 2009. We also explore the impact of including additional novel features previously unstudied: the temperature difference between competing team’s cities and a team’s computed strength according to [10]. We show that our predictions for margin of victory result in an error just 2% higher than that of the Las Vegas line and that we can successfully pick the game winner over 64% of the time. The bet-recommendation scheme we propose is shown to provide a win rate just under 51% but falls short of the mark of 52.4% needed to break even in the NFL gambling system.

1 Introduction NFL football is arguably the most popular sport to bet on in the United States. It is said that gamblers bet nearly $1B per year on football games in Nevada alone [9]. Critical to the NFL gambling system is what is known as the (Las Vegas) line or point spread. The point spread is a handicap assigned to one of the teams, for betting purposes only, that is designed to give each team an equal chance of winning in the eyes of the bettor. For example, if team A is the clear favorite over team B, the bookkeepers (those in charge of the betting process) will create the point spread for the game to reflect this; say, team A is the ten point favorite. Now, a gambler will not win a bet simply in the event that team A is victorious, but only if they do so by a margin of victory larger than ten. Likewise, a bet on team B will pay off not only if they win but also if they lose by nine points or fewer.

The purpose of this study is to explore the use of a data-driven machine learning framework to predict the margin of victory in a matchup between any two given NFL football teams. With an accurate prediction of game winner and margin of victory, one could hypothetically compare this predicted value with the point spread designated to the game and proceed to make a statistically-favorable bet. The potential for success stems from the fact that the prediction supplied by a machine learning algorithm will be based solely off data and outcomes fromprevious games whereas the point spread is not necessarily an unbiased predictor of the game outcome. As stated by Vergin and Sosik,... the line can be viewed as the best forecast of bettor behavior, rather than the best forecast of game outcome. Given this objective, it is conceivable that there might be biases in the line of sufficient size to make some technical betting strategies profitable[8]. So although the point spread is designed by the Las Vegas bookkeepers to make consistently winning bets maximally difficult, the possibility of exploiting an existing bias in the line motivates the development of a predictive model for NFL games.

Despite being largely overshadowed by forecasting efforts in financial markets, there has been a modest amount of work done in the statistics & machine learning communities with regards to predictions for sports markets such as the NFL. A substantial amount of the work done relating to the NFL football betting market is theoretical in nature, debating the so-called efficiency of the market. A term that is commonly used in financial settings, an efficient market is one which is not predictable but random, such that no planned approach to betting or investing can be successful in the long term. These theoretical works report conflicting conclusions, with researchers in [2], [4], and [5] suggesting the NFL betting market is indeed an efficient one. While in [1], [3], [6], and [7] degrees of inefficiency in the market are shown.

Although there is a lack of agreement about the possible existence of successful betting strategies on NFL games, there have been several statistical models developed for this and similar purposes. One class of approaches deal with ranking or assigning relative strengths to each team. In [10], a number of general ranking schemes are introduced and subsequently applied to the problem of ranking the best college football teams. Rating methods are developed in [12] for soccer teams and in [13] for NFL teams, but predictions for game winners based on these ratings are not described.

There have been several statistical models developed to make predictions on NFL games that vary in both sophistication and success. In [15], the authors make predictions using simple probit regressions based on power scores published in The New York Times, but report that the official point spreads are in general more accurate. Mixed linear models based on home-field advantage and team performance are used in [16] resulting in an error that is only slightly higher than the bookmaker’s predictions it is compared with. In [11], a state-space model is created in a fully-Bayesian context to model strengths of teams and make subsequent predictions of final scores to games. Accuracy of this model is reported to be about as high as the official point spread. Finally, a more successful approach is employed in [14] where an accuracy of up to 54% was reported for using a logistic regression classifier to predict the winner of NFL games while taking into account the point spread.





The related work in this area confirms the prowess of the bookmakers in Las Vegas in that it is hard to find evidence of a technical betting approach that consistently outperforms these expert line-makers. In this light, a baseline measure of success in this study is to provide predictions of the margin of victory in NFL games that is on average closer to the true score differential than the official point spread. However, the bettor is at a further disadvantage to the line-makers in the NFL betting system due to what is known as the eleven for ten rule; that is, one must put down $11 to win $10 on any given bet, providing the bookmakers with a commission known as the vigorish. Due to the vigorish, a bettor with a 50% success rate will actually lose money and instead needs to win 52.4% of their bets in order to break even. Therefore, a more geniune goal for this study is the development of predictive framework that when used to make informed decisions in the NFL betting market, results in a win rate of 52.4% or better.

In an effort to reach the accuracy goals mentioned above, this approach utilizes the Gaussian process model, which has emerged as a serious competitor for real supervised learning applications in the past decade [17]. Gaussian processes provide a powerful tool for inference with computational tractability as well as a principled manner in which to quantify uncertainty. The ability to generate confidence measures together with predictions seems to lend itself naturally to a betting scenario where one looks to balance the risk of placing a bet on a game with their relative certainty in its outcome. Indeed, we shall seek a scheme in which bets are placed on a game only when a specific confidence level is met in order to achieve an adequately high win rate. To the author’s knowledge, this study is the first which utilizes the Gaussian process model in the area of sports forecasting.

The training dataset considered in this study includes a wide variety of offensive and defensive statistics for over 1,000 games from the 2000-2007 NFL seasons while 2008 and 2009 seasonal data is reserved for final testing. We also look to investigate the benefit of including novel features not considered in previous works. To this end, we explore the impact of home-field advantage on game outcomes by factoring in the temperature difference between the home and away cities with the belief that a visiting team’s performance can be negatively affected by playing in a climate significantly different from their own[9]. Additionally, we seek improved accuracy in our approach by coupling it with a ranking system for sports teams. We compute the strength of the home and away teams going into each game according to [10] as an additional feature to the learning algorithms. It is expected that this data will be more benefitial than standard winning percentages in making predictions since a team’s rank takes into account both the strength of its previous opponents as well as the outcomes of the previous games.

The remainder of the paper is laid out as follows: section 2 describes the process of data collection, provides a complete list of the features considered for training, and gives the formulation of the ranking system considered. Section 3 provides an overview of learning with Gaussian processes. Section 4 describes the process of feature selection, shows the impact of the novel features considered in this study, and compares results for each algorithm versus the Las Vegas lines for predictions in games from the 2009 and 2010 NFL seasons. Section 5 concludes the paper by discussing the effectiveness of this approach and suggestions for future work in this area.

2 Data Aquisition In this section we provide an overview of the dataset utilized for margin of victory predictions and the process for collecting this data. NFL games from 8 seasons between 2000-2007 are used as training examples while games from the 2008 and 2009 seasons are reserved for final testing. We assume that individual seasons are mainly independent from one another and so a prediction for a particular game is based solely off data from the current season. In this light, games from the first four weeks of each season are excluded from training and testing due to lack of data. Note also that the dataset includes only matchups from the regular season (preseason and playoff games are excluded). For each game, we consider a variety of standard NFL statistics as well as two additional novel features to be described in the following subsections. In total, there are 1544 games in the training set and 390 games for final testing with 47 features considered. A list of the full set of features can be viewed in Figure 1.

Figure 1: Full list of the features considered for training in this study. The ”H” and ”A” indicates there is data for both the home and away teams for that feature. The ”S” indicates that there is streak (4-game moving average) data for that feature. Taking into account these variations, there are a total of 47 features.

2.1 NFL Seasonal Statistics Although there exists plentiful sources of NFL statistics online, the availability of nicely formatted, downloadable data files is limited. Given the large amount of data needed for this project (scores & statistics for 32 teams playing 16 games per season from 10 seasons), custom MATLAB programs with the ability to go on the web and collect this data efficiently and accurately are employed. The website that is mainly utilized for collecting seasonal statistics is www.pro-football-reference.com, which is found to be the most extensive source of data.

Two main types of functions are used to obtain the NFL statistics for training: the first type collect raw NFL data online while the second process the raw data into individual files containing a particular statistic from each game in a given year. The MATLAB functions urlread and urlwrite, which input the url of a website and return a large character array containing the HTML source code for that site, form the basis of the raw data collection code. Additional code needed to be written to parse through the HTML, extract the useful data, and print it to a neatly csv-formatted text file. The second type of function to refine this raw data is necessary to simplify the process of loading input features to the machine learning algorithm. The function identifies the home and away team from every game in the raw data files and locates and organizes a particular statistic into two columns for each team, subsequently storing the columns individually in text files.

A total of eleven NFL statistics from each game are collected (see Figure 1) for both the home and away teams, producing 22 features for which to train the Gaussian process for predictions. While these features in general represent an average value over the course of the season up to that game, an additional moving average for every statistic is computed considering only the four previous games. This provides an additional 22 features that can take into account the known notion of ”hot” or ”cold” streaks in football where a team performs well above or below average in a particular stretch of games. Of course, a streak of any amount can be considered but a four game streak is used here as it represents about a month of game play or exactly one quarter of an NFL season which is deemed adequately long. Also, an average computed over a longer stretch would require omitting more games from the start of a season in the dataset.



Pages:   || 2 | 3 |


Similar works:

«Methodologies for Calculating Baseline and Compliance Urban Per Capita Water Use (For the Consistent Implementation of the Water Conservation Act of 2009) October 1, 2010 California Department of Water Resources Division of Statewide Integrated Water Management Water Use and Efficiency Branch METHODOLOGIES FOR CALCULATING BASELINE AND COMPLIANCE URBAN PER CAPITA WATER USE OCTOBER 1, 2010 Methodologies for Calculating Baseline and Compliance Urban Per Capita Water Use October 1, 2010 California...»

«USGS National Hydrography Dataset Newsletter Vol. 9, No. 1, November 2009 by Jeff Simley, USGS Ninth Year of the NHD Newsletter This NHD Newsletter marks the beginning of the ninth year for the newsletter. Every month for the past eight years customers of the NHD have received a newsletter communicating what is happening in the NHD program and also covering related WBD, NHDPlus, and hydrography topics. The ninth year will see some exciting new developments as we reorient the stewardship program...»

«Over the last decade Australia’s national security circumstances have deteriorated significantly. The proliferation of nuclear weapons continues – particularly within our own region. The nuclear non-proliferation treaty continues to fracture. And there has been little if any progress on nuclear arms reduction – let alone nuclear disarmament. And despite recent progress on the Korean Peninsula it is sobering to reflect on the fact that three of the world’s major unresolved territorial...»

«Fresh Food Buffets (or Salad Bars): Notes on CHEP’s approach in Saskatoon Table of Contents: 1. Equipment 2. Produce and food needed for 250 3. Produce and food needed for 70 4. Salad bar to-do list 5. Newsletter and background information 6. Letter to parents 7. Other ideas for Fresh Food Buffets 8. Offer vs. Serve 9. Going Green. 10. Salad bar evaluation for Students 11. Salad Bar evaluation for staff 12. Recommended layout for the buffet lines 1. Equipment: Kits are available through the...»

«C O N S E J E R ÍÍA D E T R A B A J O Y A S U N T O S S O C IIA L E S CONSEJER A DE TRABAJO Y ASUNTOS SOC ALES El Seguro Social de Dependencia C O N S E J E R IIA D E T R A B A J O Y A S U N T O S S O C IIA L E S CONSEJER A DE TRABAJO Y ASUNTOS SOC ALES D E L A E M B A J A D A D E E S P A Ñ A E N B E R L ÍÍN DE LA EMBAJADA DE ESPAÑA EN BERL N R E P Ú B L IIC A F E D E R A L D E A L E M A N IIA – REPÚBL CA FEDERAL DE ALEMAN A – Seguridad Social EL SEGURO SOCIAL DE DEPENDENCIA 1 -....»

«Two Complexities The Need to Link Complex Thinking and Complex Adaptive Systems Science Abstract This article reflects the division in the field of the study of complexity, between a mainly philosophical and epistemological approach (Edgar Morin called it general complexity”) and a mainly scientific and methodological approach (called by Morin as restricted complexity). The first perspective would be well represented by Morin’s complex thinking, while the second by the new science of...»

«Click here to view the fund's statutory prospectus or statement of additional information. Vanguard FTSE All-World ex-US ETF Summary Prospectus February 26, 2016 Exchange-traded fund shares that are not individually redeemable and are listed on NYSE Arca Vanguard FTSE All-World ex-US Index Fund ETF Shares (VEU) The Fund’s statutory Prospectus and Statement of Additional Information dated February 26, 2016, as may be amended or supplemented, are incorporated into and made part of this Summary...»

«Interaction Dialer 3.0 Best Practices March 20, 2009 Interaction Dialer 3.0 Best Practices ©2009 Interactive Intelligence Inc. All Rights Reserved Table of Contents Overview Configuration Default Settings Copy & Paste Automation with Rules Change Auditing Pacing Key Concepts AMD/LSD Trade-off Blending Pacing Slider Bar Understanding Abandons Calls per Agent & Server Parameters Customer Experience The Manage Scheduled Calls dialog SIP Line configuration Converting Callbacks Dialer Efficiency...»

«QUANTITATIVE ANALYSIS APPROACHES TO QUALITATIVE DATA: WHY, WHEN AND HOW Savitri Abeyasekera Statistical Services Centre, University of Reading, P.O. Box 240, Harry Pitt Building, Whiteknights Rd., Reading RG6 6FN, UK. Phone 0118 931 8459, e-mail s.abeyasekera@rdg.ac.uk 1. Introduction In many research studies involving the use of participatory tools, much of the information gathered is of a qualitative nature. Some of this will contribute to addressing specific research questions, while other...»

«Quality Control (QC) Information and Troubleshooting Guide Blood Banking Centrifugation UniCel DxC Clinical Chemistry Systems Chemistry UniCel DxC Integrated Systems Flow Cytometry Hematology Hemostasis Immunoassay Information Systems Lab Automation Molecular Diagnostics Rapid Diagnostics BR-16544A Quality Control Introduction Quality controls are samples of known concentration that are used to monitor system performance and verify calibration. The system presents Quality Control (QC) data in...»

«A Position Paper New Dimensions in Evaluation Karnataka D.Ed Curriculum Framework-2012 Directorate of State Education, Research and Training, Bengaluru, Members of the sub-committee Dr.A.M. Ajatha Swamy Convener Head, Department of P G Studies in Education, S J College of Education, Sagar Rd, Bhatkal, Uttar Kannada (Dist) ajathaswamy@yahoo.com Sri.V.G. Hegde Member Principal, Swami Parijnanashram DEd College, Sagar Rd, Bhatkal Uttar Kannada (Dist) vg.nandayyan@gmal.com Sri.C.R. Ashwin Member...»

«Pre-Arrival Information CONGRATULATIONS ON YOUR SUCCESSFUL APPLICATION TO STUDY AT THE UNIVERSITY OF WOLLONGONG. This booklet offers practical suggestions to help you to prepare for your time in Wollongong and highlights some of the items that you will need to organise before leaving home. It is important that you read through this information carefully, and any other materials sent to you by the University, as soon as possible. We wish you a safe and trouble-free journey and look forward to...»





 
<<  HOME   |    CONTACTS
2016 www.dis.xlibx.info - Thesis, dissertations, books

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.