Tennis Prediction

blog-image

Passionate about sports and having played tennis, what could be more interesting than studying Big Data this theme. We tried to answer: Can tennis outcomes be predicted thanks to data available on the internet?

I did this project in the third year, like part of my studies, with a long-time friend: Thomas CARIN. We had 3 weeks to do the analysis and create a dashboard on a website to see latest tweets regarding tennis, see different results of tennis matches and link our predictions to it.

We reached an accuracy of 65% by using Elo rating of players in a Logistic Regression. That is a good score already, but we then tried to do a more deeper analysis.

We tried to compute more features to see if we can beat this accuracy:

We performed different steps to reach our goal:

  • Data collection: we got our data from a website that freely gives a pre-populated database with tennis results and a lot of information on players. This website is Ultimate tennis statistics.

  • Data cleaning: we checked all features to see if some important data were missing. We saw that we had matches duplicates, so we decided to drop those duplicates and reverse half of the matches to get a balanced sample. We also dropped features that we weren’t interested in like match_id, player_rank (we preferred elo_rating which is more accurate) and surface. We handled NA values by dropping records where missing values were essential for future steps. We also saw that missing values were mostly coming from atypical tournaments like Davis Cup or ATP Finals.

  • Feature scaling: we then scaled our features for the Logistic Regression.

  • Feature Engineering: We computed new statistics as first serve success percentage, winning on first serve percentage, aces, percentage of matches won, and also head to head statistics between players. We then performed feature difference to have one feature for each statistics representing the difference of levels between the 2 players on this statistics.

  • Feature selection: We performed recursive feature elimination to keep only the most important features.

  • Modeling: We got a 64% accuracy. this model didn’t outperform our previous model, our features don’t seem to bring more information than Elo rating.

In conclusion, with more time to research really important features, with more time to do exploratory analysis and statistics tests to evaluate our features, we could expect to improve our model. We were a little bit “disappointed” not by our work but by the results, but we will definitely come back on this study, with more knowledge (we are starting a major in Data Science) and with more time to create a good model, and study its return on investment.

Link to the github repository