確率分布フィッティング

実際のデータに近い確率分布を探したい。データはBayesian Data AnalysisのFootball scores and point spreadsを使う。

using CSV, DataFrames, StatsPlots
df = CSV.File("_assets/datasets/football.csv", delim = ' ', ignorerepeated = true, header=true) |> DataFrame
first(df, 5)

5×7 DataFrame
 Row │ home   favorite  underdog  spread   favorite.name  underdog.name  week
     │ Int64  Int64     Int64     Float64  String3        String3        Int64
─────┼─────────────────────────────────────────────────────────────────────────
   1 │     1        21        13      2.0  TB             MIN                1
   2 │     1        27         0      9.5  ATL            NO                 1
   3 │     1        31         0      4.0  BUF            NYJ                1
   4 │     1         9        16      4.0  CHI            GB                 1
   5 │     1        27        21      4.5  CIN            SEA                1

ここで、favoriteは勝つと予想されているチームの実際の点数、underdogは負けると予想されているチームの実際の点数、spreadはその試合の点差予想である。

outcomes = df.favorite .- df.underdog
first(outcomes, 5)

5-element Vector{Int64}:
  8
 27
 31
 -7
  6

outcomesに実際の点数差を計算する。実際の点数差と予想の点数差との差をヒストグラムにする。

histogram(outcomes - df.spread, label="outcome - point spread", yrotation=90)
savefig(joinpath(@OUTPUT, "football.svg"))
nothing

このヒストグラムを正規分布にフィッティングする。

using Distributions
fitted = fit(Normal, outcomes - df.spread)

Distributions.Normal{Float64}(μ=0.22589285714285715, σ=13.687140377113344)

平均は0.23で標準偏差は13.69の正規分布にフィッティングされた。

histogram(outcomes - df.spread, normalize=:pdf, label="outcome - point spread", yrotation=90)
plot!(fitted, label="Normal PDF")
savefig(joinpath(@OUTPUT, "football_fitted.svg"))
nothing

ここで、ヒストグラムはnormalize=:pdfでPDFに正規化している。正規分布の標準偏差が13.69ということは平均0.23から±13.69内に約68%のデータが含まれることを意味する。