Using data from baseball-reference.com to predict baseball game attendance


One of the important abilities
to do in sports analytics is prediction. In particular, prediction
of future events that affect your
team or your league. What we’re going to try
now is coming up with ways to predict the attendance
of future games. So the objective is to
look at three games that are coming up in the next
week, in particular, Thursday, June 8, Friday, June
9, and Sunday the 11 against these various teams. Now what I’ve done is selected
these games because they all are slightly different,
but they all take place at Chase Field in Phoenix. So let’s think about how we
would come up with predictions for these games,
and in particular, how we can learn about
the Diamondbacks’ schedule, and the unique
characteristics of Chase Field to try to come up with
better informed predictions. Instead of just guessing
what the attendance would be. So the way we’re going
to do this is we’re going to pull some data
from a website called baseballreference.com. And I’m going to show you
some very basic steps that can be done to sort this data
and be able to analyze it in a way to be able to answer
this question of attendance at future games. So a little bit
about Chase Field. It’s been around
for almost 20 years. Originally was called
Bank One Ballpark. And at some point,
Chase Bank came in as the corporate sponsor. The interesting thing is this
is one of the largest capacity stadiums in Major
League Baseball. Almost 50,000 seats
is the capacity. And you’ll see that in the
very high nosebleed seats on the top left of this
photo and on the top right of this photo, it’s
quite tough to get the people to sit up there. I think it’s an
oversized stadium. But that gives you an idea of
what the maximum capacity would theoretically be if this
was a sold out game. So let’s talk about
how we would make the forecast for attendance. There’s a couple of ways. One way to do it is to
estimate by analogy. In other words, to look at
very similar games that took place on a Thursday, a
Friday, and a Sunday, and try to see what the
attendance was for those games. And then try to
compare roughly what these future games, how
similar or different they are from those past games. So that’s one way. I don’t think that’s
very accurate, and it relies a lot on guessing. So we’re going to use method
two, which is actually looking at historical data, that
is, past attendance at Chase Field, and then use a little
bit of our expert judgment to try to figure out what
additional factors may influence the attendance
of these particular games that we’re interested in. These three games on
June 8, 8, and 11. So let’s talk about briefly
what factors could affect the attendance of a game. So I think weekday
games are less popular than weekend games. If you have a game that’s during
the day during work hours, that happens once
in a while, that’s not going to be as well attended
as a game that’s an evening. Of course, the weather– not in the particular example
of Chase Field, because– some of you may
have been there– but you should be aware that the
roof is actually retractable. So if there is rain
outside or extreme heat, they go ahead and
close this roof and make the inside a very
comfortable 70 or 75 degrees. But the weather could affect the
commute to getting somewhere. That’s harder to predict. Free giveaways are
something that I think would increase the attendance. Who the opponent is
and how the opponent is doing, how the
home team is doing, and then maybe even some
possible personal or team records that are
about to be broken. The harder thing to account
for is other competing sporting events. So in particular,
let me point out that of these three games
that I want you to consider, one of those– I won’t tell you which one,
you’re have to look this up– one of these games conflicts
with the NBA finals. So if people have
to make a choice between going to a regular
season game at Chase Field or watching the
NBA finals on TV, depending on what
sport you like better, you might choose
one over the other. So these are other
things to consider. Not just other sporting
events that may conflict, but also, other events
in and around Phoenix that may conflict. For instance, if there’s a
big conference at the Phoenix Convention Center, that
actually may help attendance at Chase Field, because when
folks go to this conference, the conference ends, and they
just walk to their hotel, they get together for
dinner, and talk about, hey, what are we
going to do tonight? Oh, there’s a home
game at Chase Field. It’s a couple blocks away. Let’s walk over there and do it. So think about all
these considerations that may affect the actual
attendance of the game. So first item for you is to be
able to look up the information on these three games. So we’re going to go to this
website, mlb.com/dbacks, and look at these particular
games that I’m referring to. So here we are on the
website, mlb.com/dbacks, and we’re going
to go to schedule. And when I hover– I haven’t clicked on anything– I’ve just hovered my
mouse over this menu, and allows me to select
the 2017 schedule. So because we’re already
in the month of June, it automatically pulls up
the calendar for the month. So the first game that
we are interested in is the Thursday, June 8
game against the Padres. So when I hover
my mouse over it, it tells me some
information about that game, not just that it’s at 12:40,
but also, that there’s a promotion that day
called Throwback Thursday. Now this is a difficult
one to compare because it’s a day game during the week. And you’re going to
have to look this up, but I think the Padres are
in last place in the National League West. So that’s probably not going to
be a very well attended game. You’re going to have to
make a judgment about that. The very next game
however, on Friday, June 9, is an evening game against
the Milwaukee Brewers. I don’t know how the
Brewers are doing this year, but Friday nights are usually
a little bit more popular. There are no promotions
that day, though. No promotions that game. And the third and final
game I want you to consider is also against the Brewers, but
this is a Sunday afternoon game or early afternoon game, 1
o’clock, 1:10 PM first pitch. But look at all these
promotions going on. Let me expand this link. And it looks like there
are multiple promotions. We’re looking at this one here. Not only do the first
10,000 kids get a free– this must be a plastic
Wiffle ball bat, but there’s also something
going on with Baxter’s birthday. And Baxter is the mascot
for the Diamondbacks. And then there’s a military
and first responder discount. OK, so a lot. Three promotions in one game. So probably, this is going
to influence attendance on the Sunday early
afternoon game. OK, so the second
thing we’re going to do is we’re going to look
at historical information on attendance at
Diamondbacks games. In order to do this,
we’ll go to this website called baseballreference.com. And first thing I
want to point out is that there are some
sibling websites up here– football, basketball,
hockey, soccer– that all contain
the same format. So all the processes that you’re
going to see me go through, you can replicate these things
in these other websites. So for now, we’re interested in
just the Arizona Diamondbacks. There are several ways
to get to this team. We can either click on
this link called teams, or here from the
home page, you’ll see that it’s
available, because they list the current standings. So here you go,
Padres are actually in fourth place in the
National League West. And the Brewers, who
are the other opponents, are in first place in the
National League Central. So two first place teams
playing each other. So that might be interesting. Let’s look at Arizona only. So Arizona currently
is tied for first place with the Colorado Rockies. So there is a lot of information
about this team on this page. And you can see that
the roster is available. The team’s statistics, the
pitching, all the fielding totals. But the only thing
we’re interested in are the schedule and results
available at the very, very top of this page. So by clicking on this
page, we get the information that we are interested
in is here, attendance. So when you hover over
any of these columns, you get a definition
of that particular data under that column. Now you’ll see in this
case, they actually define what attendance means. It’s the number of
tickets sold at that game. That’s very different than
the number of people attended. So I would say this is a little
bit of an inflated number because people who
actually attend the game is slightly lower than
the number of tickets sold. So this is the number
we’re going to use. Not a big deal. It’s still very helpful. They list all the attendance
for all the games so far. So you’ll notice a few things. Up until present time,
the most recent game was played Thursday, June 1. I’m recording this
video on Saturday. So you’ll see that the
Friday information has not been uploaded. So that already
tells you the data is already a day old or
a day behind in terms of updating this website. That’s pretty good. Pretty good. OK, so how do we take
all this information and actually use
it to understand the typical attendance
for Diamondbacks games? So what we’re going
to do is we’re going to first modify this table. So I’m going to click
on Share and More, and select the first option,
which is modify and share. So what it does is it
converts this table into an editable format. So because I’m only interested
in home games and attendance at those home games,
there are a lot of room The columns on here that I’m
going to actually eliminate. So the nice thing about
baseballreference.com is it has these little circle
buttons so that if I click on, it eliminates that column. So I’m going to
eliminate streak. I’m going to eliminate
the opponent, whether it was a win or a loss. All the other
information that is irrelevant to this particular
question that I want to answer. So I am leaving
this column, which is showing me whether the
home the game is home or away. I’m going to eliminate
the team column, because by default, all
the data in this table is for the Arizona Diamondbacks. I’m also going to
eliminate the box score. I don’t I don’t care about
the details of the game. I only care about whether
the game was at home and what the attendance
was at that game. Now the season is still
going on in the data here is displayed in
chronological order, meaning that we only have
information available up until the present time
or delayed by a day. So what I’m going to do is,
because all this information is blank, I’m actually going to
click on the little button that looks like a home plate. And it’s going to eliminate
all the rows underneath. So now I’m only left
with attendance for games up to the present time. Now I’m ready to extract the
data and put it into a program that I can do some
manipulation on. So I’m going to go back to this
Share and More pull down menu, and I’m going to get
the table as a CSV. Now CSV stands for
commerce separated value. I’m going to click on that. You’re going to see
how the data now changed from a very nice
graphical summary and a table to sort of very raw
looking piece of data which, I’m going to copy. I’m going to highlight it,
right click, and select Copy. So now I’ve got all this
data on home attendance. And actually, on
total attendance. I’m going to do some additional
sorting in just a moment. But I’ve got all this
data on attendance to date for this season. So in Microsoft Excel, I’m
going to paste this data. There it is. Now, you’ll see that
it’s sort of all crunched only into
column A. So we have to do some
additional steps in order to separate the data
into multiple columns. So I’m going to go
to the data column. And there’s an option there
called text to columns. And what it does is
it starts to separate the information that you have. As long as it’s
separated by commas– in this case, it’s
delimited by commas– I’m going to hit Next. And then I’m going
to say no, comma is the actual separator
for this information. Hit Next. Yeah, that’s great. Finish. Whoops, sorry. Let me do that again. I should have highlighted the
entire sheet before I did that. So I highlighted
the entire sheet. Or I can actually just
get away with highlighting column A, because all the
data is in column A. Again, go to Data menu. Text to columns. Delimited. Correct. Comma delimited. And you’ll see down
here it gives you a preview of what the data is
going to look like before we actually select Finish. Look at that. That’s great. That’s what I want. You notice that the data
in column B sort of creeps into column C. So
what I’m going to do is I’m going to get right
in between those two columns and I’m going to double click. And what it does is it
automatically readjusts the width of that column. Now this column C is
not labeled anything, because when we pulled the data
from baseballreference.com, it didn’t have a label
back then, anyway. So I do want to label it now. And I’m just going to call it
H or A, as in home or away. Now because the question
that I want you to answer has to do with whether
the attendance at home gives us any hints,
we want to get rid of all the games that are away. And those are marked
by this @ symbol. So you’ll see that
the Diamondbacks played a bunch of away games. Then came home, played a
bunch of games at home, and then played a bunch of
games away, and so forth. So what I’m going to do
is I can either manually come in and remove each
row, which would take me a few seconds to do. Or I can be a little
smarter and say OK let’s go to Data and Sort. So what I’m going to
do is now sort the data by whether the games
were home or away. Before we even do
this though, you have to select this button
that says my data has headers. And what this tells
Excel is that row number 1 are just labels. And everything
underneath is data. So we click that, and you’ll
see how it sort of left out row number 1. And then I’m going to
say Sort by H or A. So what it’s done is it’s put
all the away games on the top, and all the home
games on the bottom. So now I can just select
the first few rows now I’m going to right
click on this section, and I’m going to say, delete. Goodbye. Now I’m only left with
data for home attendance. So there’s only
been 29 home games. Now let’s do some
analysis on this data. This column is no
longer useful to me because I’ve already
sorted the data. So I could right click on column
C and say goodbye as well. Now I’m only left with the
information that I care about. I’m going to double click on
this spot between columns C and D to readjust the size. I’m going to do one more
thing, because in order to look at these numbers,
it’s much easier for me to appreciate how large
these values are when there’s a comma in between. So I’m going to go to Home
and click the comma button. And then it’s going
to convert the number to a different format. And I don’t really need
the decimal points, so I’m going to
click on this button and say, I’m not interested
in fractions or decimals. I just want the whole number. So here’s the attendance
for the home games thus far at Diamondbacks
stadium, at Chase Field. Now what I’m going to do is try
to get some summary statistics for this data only. So I’m going to add a
couple of rows underneath. One is going to
be called average. The other one is going
to be called minimum. The next one maximum. And then the last one
is standard deviation. And I’ll explain
each of these, how it’s going to be helpful to us. So let’s use formulas in Excel
to calculate these four things. So with our mouse here
on this particular cell, I’m going to go to Formulas. And there’s already some
frequently used formulas in this menu called AutoSum. So I want Excel to
calculate the average. And look how it automatically
highlighted the information above. Now if it didn’t
automatically do that, you can actually put the mouse
here and highlight it yourself. Or you could go into the formula
itself, say this was blank, and manually type in, I want you
to analyze the data from cell C2 all the way to C30. And then you get
the same result. So the outcome
you get is 22,649. So that is the average number
of tickets sold for the first 29 games this season. Now that gives me a good sense
of what the typical number is, but I also want to understand
the range of numbers. So what’s the minimum
number of tickets and what’s the maximum
number of tickets. Let’s use other formulas
for this as well. So I’m going to go
to Formulas again. Sorry. Formulas again. My functions are
already in here, but if they don’t show
up on here for you, I’m going to go
to more functions. Formulas. More functions. So here in this
function menu, it’s asking me to tell it something
about the type of formula that I want. So I’m going to type
in min for minimum. And it down here,
it’s going to give me some recommendations
of formulas that relate to this word, this term. So fortunately,
there’s one called min. And that gives me the smallest
number in a set of values. That’s exactly what I want. OK, so what’s the range. Well, I can either
type it in manually, C2 all the way to C30. Or I could have
highlighted it in order to achieve the same outcome. So this the least
attended game is 12,215. Now let’s go to max. There is a formula that I
can access through this menu. More functions called max. Enter. And here it is. Returns the largest
value in a set of values. Bingo, that’s what I want. Hit OK. So I want you to analyze from
cell C2 all the way to C30. Or– I’m going to remove that– I can also just
click on this button. And it’ll send me to
the data, and give me an opportunity to select. Then I’m going to hit Enter. C2 to C30, yes. Great, go. So the maximum number is 49,016. And I bet you that
was opening day. Game one, 49,000. Yes, opening day is
always very popular. So we know the average. We know the minimum. We know the. Maximum Another
descriptive statistic is standard deviation. So there is a formula
for that as well. I go to formula,
AutoSum, more functions. And then I’m going to type
in standard deviation. So there are a lot of
versions of this formula. Let’s just use a
very basic one called STDEV for standard deviation. And it estimates a standard
deviation based on a sample. Yeah, that’s what I want. Give me the standard
deviation of C2 to C30. Either I’m typing it in manually
or I go to the actual data, highlight it, and
then hit Enter. All right, beautiful. So what this
information tells me is that one standard deviation
from the mean, that is, 8,000 fans, either
above or below the mean, above the $22,000 number or
below the $22,000 number, the majority of the data
sits within that range. So to put it in different
terms, about 60% of the time, the attendance at Chase
Field is between 30,000– that is 8,000 plus 22– and 22,000 minus 8,
so that’s 14,000. So 14,000 to 30,000,
66% of the time, that is the attendance
at the Chase Field. OK, so now your
job is to replicate what I’ve done in
Microsoft Excel by using the data on
baseballreference.com. But notice that the only
thing that I’ve done is I’ve compared data from to 2017. What you could do is look
at data from the 2016 season as well. Or look at data for
an entire decade, and try to utilize
that information to figure out how the
various promotions, how the various opponents influence
this column right here, which is attendance sorted by this
column here, home games. OK, I hope that’s enough
information for you to replicate this. And then eventually,
your objective is to submit your
predictions on this survey. So let me go to this
survey and show you where I want you to
enter your predictions for this particular exercise. It’s a Google survey
that I’ve created. And what I want you to do
is enter your prediction– this is a numerical input
for the Thursday game. And then I want you
to justify how did you come up with this number. Just give me a few data
points on how you actually obtained that. Then you estimate the
attendance on the Friday game and justify that. And finally, the attendance
at the Sunday game and justify that. So that is the extent of
this particular exercise. There are more components
of homework nuber one in the first inning. But this just will get you
started with utilizing the data from baseballreference.com and
making some inferences and some descriptive statistics– in particular, average,
minimum, maximum, and standard deviation. All right, play ball.

Leave a Reply

Your email address will not be published. Required fields are marked *