Jekyll2019-09-21T13:48:16-07:00http://aaron.niskin.org/feed.xmlAaron Niskin’s BlogNot just another mathy blog! Now with more... Aaron!amniskinThe sequential shooters2019-07-07T00:00:00-07:002019-07-07T00:00:00-07:00http://aaron.niskin.org/blog/2019/07/07/sequential_shooters_riddle<p>There are N people indexed by $1,..,N$ (because down with the zero-indexers! Long live <a href="https://en.wikipedia.org/wiki/Peano_axioms">Peano!</a>) standing in a circle. They’re standing such that $i$ is between $i-1$ and $i+1$ (unless $i \in{1,N}$, for obvious reasons).</p>
<p>Then starting with $1$ and going around the circle in turn, each person shoots the person directly to their left. This stops when there’s only one person left alive. For example: if $N=5$, it would go:</p>
<!-- more -->
<ol>
<li>1 shoots 2</li>
<li>3 shoots 4</li>
<li>5 shoots 1</li>
<li>3 shoots 5</li>
</ol>
<p>and only 3 remains.</p>
<h2 id="the-task">The task</h2>
<p>Given $N$, what is the index (1-indexed, unfortunately) of the last shooter standing? For example: if $N=5$, the answer is $3$.</p>
<h2 id="solution">Solution:</h2>
<div class="hint">
<p>Every number $N$ can be written uniquely as $2^i + \lambda$, where $i,\lambda\in\mathbb{N}_{\geq 0}$ and $\lambda$ is minimal (hence $i$ maximal).</p>
<p>The answer to the riddle is $1 + 2\lambda$ for this particular $\lambda$.</p>
<p>So full disclosure, I kinda cheated on this one. We’ll get to that in a bit; but first let’s go through the motions and let me walk you through my thought process (as if that’s super valuable).</p>
<p>Firsty, we can immediately tell that every time there are an even number of people in the circle, that round will end up at the same person it started with (as in if there are 6 people in the circle, then 2, 4, and 6 are killed in the first round and we start round two on 1 again.</p>
<p>From there we can conclude that any power of two should end on 1. e.g.</p>
<p>if $N = 2^1 = 2$ we have:</p>
<ol>
<li>1 shoots 2</li>
</ol>
<p>if $N = 2^2 = 4$</p>
<ol>
<li>1 shoots 2</li>
<li>3 shoots 4
and we’re left in pretty much the same scenario as $N=2$.</li>
</ol>
<p>If $N=8$,</p>
<ol>
<li>1 shoots 2</li>
<li>3 shoots 4</li>
<li>5 shoots 6</li>
<li>7 shoots 8</li>
</ol>
<p>and we’re left in pretty much the same scenario as $N=4$. In fact, for $2^{i+1}$, the first iteration will wipe out half of the people and leave us starting on 1 again – thereby leaving us in an analogous scenario as $2^i$.</p>
<p>What about the other cases? Well, naively we can see that every shot moves the head forward by two slots (when 1 shoots 2, the round robbin starts again at 3 but with one fewer player). We now have to prove that this sort of thing reaches parity at $2^i$ and not beforehand. But I’m feeling lazy and I’m pretty confident that it’s correct, so I’ll just show that it’s true for the first few programmatically…</p>
</div>
<h3 id="the-programmatic-solution">The Programmatic Solution</h3>
<div class="hint">
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">rotate</span><span class="p">(</span><span class="n">lst</span><span class="p">,</span> <span class="n">n</span><span class="o">=</span><span class="mi">1</span><span class="p">):</span>
<span class="s">"rotate a list n times sending the first elems to the back"</span>
<span class="k">return</span> <span class="n">lst</span><span class="p">[</span><span class="n">n</span><span class="p">:]</span> <span class="o">+</span> <span class="n">lst</span><span class="p">[:</span><span class="n">n</span><span class="p">]</span>
<span class="k">def</span> <span class="nf">shoot</span><span class="p">(</span><span class="n">lst</span><span class="p">):</span>
<span class="n">lst</span> <span class="o">=</span> <span class="n">rotate</span><span class="p">(</span><span class="n">lst</span><span class="p">)</span>
<span class="n">lst</span><span class="o">.</span><span class="n">pop</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>
<span class="k">return</span> <span class="n">lst</span>
<span class="k">def</span> <span class="nf">play</span><span class="p">(</span><span class="n">N</span><span class="p">:</span> <span class="nb">int</span><span class="p">)</span> <span class="o">-></span> <span class="nb">int</span><span class="p">:</span>
<span class="n">players</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">N</span><span class="o">+</span><span class="mi">1</span><span class="p">))</span>
<span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">N</span> <span class="o">-</span> <span class="mi">1</span><span class="p">):</span> <span class="c"># n-1 shots leaves one standing</span>
<span class="n">players</span> <span class="o">=</span> <span class="n">shoot</span><span class="p">(</span><span class="n">players</span><span class="p">)</span>
<span class="k">assert</span> <span class="nb">len</span><span class="p">(</span><span class="n">players</span><span class="p">)</span> <span class="o">==</span> <span class="mi">1</span><span class="p">,</span> <span class="s">"Wrong number of players left"</span>
<span class="k">return</span> <span class="n">players</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
<span class="k">def</span> <span class="nf">get_lambda</span><span class="p">(</span><span class="n">N</span><span class="p">:</span> <span class="nb">int</span><span class="p">)</span> <span class="o">-></span> <span class="nb">int</span><span class="p">:</span>
<span class="n">i</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">floor</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">log2</span><span class="p">(</span><span class="n">N</span><span class="p">)))</span>
<span class="k">return</span> <span class="n">N</span> <span class="o">-</span> <span class="p">(</span><span class="mi">2</span> <span class="o">**</span> <span class="n">i</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">check_theory</span><span class="p">(</span><span class="n">i</span><span class="p">):</span>
<span class="k">return</span> <span class="n">play</span><span class="p">(</span><span class="n">i</span><span class="p">)</span> <span class="o">==</span> <span class="p">(</span><span class="mi">2</span> <span class="o">*</span> <span class="n">get_lambda</span><span class="p">(</span><span class="n">i</span><span class="p">)</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span>
<span class="n">t0</span> <span class="o">=</span> <span class="n">time</span><span class="p">()</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1000</span><span class="p">):</span>
<span class="k">assert</span> <span class="n">check_theory</span><span class="p">(</span><span class="n">i</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">time</span><span class="p">()</span> <span class="o">-</span> <span class="n">t0</span><span class="p">)</span>
</code></pre></div> </div>
<p>And we can confirm that the relationship holds, at least for the first 1000 positive integers. To do more, we’ll have to account for python’s speed issues.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">numba</span> <span class="kn">import</span> <span class="n">jit</span>
<span class="nd">@jit</span><span class="p">(</span><span class="n">nopython</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">_rotate</span><span class="p">(</span><span class="n">lst</span><span class="p">,</span> <span class="n">n</span><span class="o">=</span><span class="mi">1</span><span class="p">):</span>
<span class="s">"rotate a list n times sending the first elems to the back"</span>
<span class="k">return</span> <span class="n">lst</span><span class="p">[</span><span class="n">n</span><span class="p">:]</span> <span class="o">+</span> <span class="n">lst</span><span class="p">[:</span><span class="n">n</span><span class="p">]</span>
<span class="nd">@jit</span><span class="p">(</span><span class="n">nopython</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">_shoot</span><span class="p">(</span><span class="n">lst</span><span class="p">):</span>
<span class="n">lst</span> <span class="o">=</span> <span class="n">_rotate</span><span class="p">(</span><span class="n">lst</span><span class="p">)</span>
<span class="n">lst</span><span class="o">.</span><span class="n">pop</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>
<span class="k">return</span> <span class="n">lst</span>
<span class="nd">@jit</span><span class="p">(</span><span class="n">nopython</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">_play</span><span class="p">(</span><span class="n">N</span><span class="p">:</span> <span class="nb">int</span><span class="p">)</span> <span class="o">-></span> <span class="nb">int</span><span class="p">:</span>
<span class="n">players</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">N</span><span class="o">+</span><span class="mi">1</span><span class="p">))</span>
<span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">N</span> <span class="o">-</span> <span class="mi">1</span><span class="p">):</span> <span class="c"># n-1 shots leaves one standing</span>
<span class="n">players</span> <span class="o">=</span> <span class="n">_shoot</span><span class="p">(</span><span class="n">players</span><span class="p">)</span>
<span class="k">assert</span> <span class="nb">len</span><span class="p">(</span><span class="n">players</span><span class="p">)</span> <span class="o">==</span> <span class="mi">1</span><span class="p">,</span> <span class="s">"Wrong number of players left"</span>
<span class="k">return</span> <span class="n">players</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
<span class="nd">@jit</span><span class="p">(</span><span class="n">nopython</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">_get_lambda</span><span class="p">(</span><span class="n">N</span><span class="p">:</span> <span class="nb">int</span><span class="p">)</span> <span class="o">-></span> <span class="nb">int</span><span class="p">:</span>
<span class="n">i</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">floor</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">log2</span><span class="p">(</span><span class="n">N</span><span class="p">)))</span>
<span class="k">return</span> <span class="n">N</span> <span class="o">-</span> <span class="p">(</span><span class="mi">2</span> <span class="o">**</span> <span class="n">i</span><span class="p">)</span>
<span class="nd">@jit</span><span class="p">(</span><span class="n">nopython</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">_check_theory</span><span class="p">(</span><span class="n">i</span><span class="p">):</span>
<span class="k">return</span> <span class="n">_play</span><span class="p">(</span><span class="n">i</span><span class="p">)</span> <span class="o">==</span> <span class="p">(</span><span class="mi">2</span> <span class="o">*</span> <span class="n">_get_lambda</span><span class="p">(</span><span class="n">i</span><span class="p">)</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span>
<span class="n">t0</span> <span class="o">=</span> <span class="n">time</span><span class="p">()</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1000</span><span class="p">):</span>
<span class="k">assert</span> <span class="n">_check_theory</span><span class="p">(</span><span class="n">i</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">time</span><span class="p">()</span> <span class="o">-</span> <span class="n">t0</span><span class="p">)</span>
</code></pre></div> </div>
<p>And now we run our horse race!</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">race</span><span class="p">(</span><span class="n">func</span><span class="p">,</span> <span class="n">max_n</span><span class="p">,</span> <span class="n">num_races</span><span class="p">):</span>
<span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">num_races</span><span class="p">):</span>
<span class="n">t0</span> <span class="o">=</span> <span class="n">time</span><span class="p">()</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">max_n</span> <span class="o">+</span> <span class="mi">1</span><span class="p">):</span>
<span class="k">assert</span> <span class="n">func</span><span class="p">(</span><span class="n">i</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">time</span><span class="p">()</span> <span class="o">-</span> <span class="n">t0</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">'no numba'</span><span class="p">)</span>
<span class="n">race</span><span class="p">(</span><span class="n">check_theory</span><span class="p">,</span> <span class="mi">1000</span><span class="p">,</span> <span class="mi">5</span><span class="p">)</span>
</code></pre></div> </div>
<p>On my machine, that prints something on the order of 1 second for each of the five runs. Now let’s try numba!</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">print</span><span class="p">(</span><span class="s">'yes numba'</span><span class="p">)</span>
<span class="n">race</span><span class="p">(</span><span class="n">_check_theory</span><span class="p">,</span> <span class="mi">1000</span><span class="p">,</span> <span class="mi">5</span><span class="p">)</span>
</code></pre></div> </div>
<p>This one takes about 1.5 seconds for the first one, then consistently half a second for the rest.</p>
<p>The point being, if you’re going to run a function a bunch of times, it’s probably worth while letting numba have a go at it – it’s super cheap to do after all.</p>
</div>Aaron NiskinThere are N people indexed by $1,..,N$ (because down with the zero-indexers! Long live Peano!) standing in a circle. They’re standing such that $i$ is between $i-1$ and $i+1$ (unless $i \in{1,N}$, for obvious reasons). Then starting with $1$ and going around the circle in turn, each person shoots the person directly to their left. This stops when there’s only one person left alive. For example: if $N=5$, it would go:The Sequential Doors and The Data Science Analogue2019-07-06T00:00:00-07:002019-07-06T00:00:00-07:00http://aaron.niskin.org/blog/2019/07/06/sequential_doors_riddle<p>This post is going to be something of a combination riddle and blog post.</p>
<h1 id="the-setup">The setup</h1>
<p>There are $N$ doors indexed by $1,..,N$ along a wall and they’re all closed. Then for each $i\in{1,2,\dots,N}$, you toggle every $i$ door. So at first ($i = 1$) you toggle every door. Then $i=2$ you toggle every second door – the even doors ($2, 4, 6, \dots,$).</p>
<!-- more -->
<h2 id="the-task">The task</h2>
<p>Given $N$, describe the “door is closed” function. By that I mean, which doors are open and which are closed at the end?</p>
<h1 id="solution">Solution:</h1>
<div class="hint">
<p>Only the perfect squares are open at the end. So $1,4,9,16,\dots$ are open, and $2,3,5,6,7,8,\dots$ are closed.</p>
</div>
<h1 id="explanation">Explanation</h1>
<p><strong>NOTE: Spoilers below!</strong></p>
<p>So there are really two philosophies to solving this riddle, and they each have their merits.</p>
<h2 id="theory">Theory</h2>
<p>The first philosophy I’m talking about is what I’ll call the <em>theory based</em> approach. For this approach we’ll investigate what’s driving the system. By that we mean, given a number $m\leq N$, how many times will that $m$ be toggled? Note that we’re not investigating the task directly, but rather something a lot deeper and more specific.</p>
<p>Well, a number $m$ is toggled exactly once per divisor. For instance, the number $6$ is toggled 4 times – once at $1,2,3,6$ – and hence is closed at the end. $9$, on the other hand only gets toggled $3$ times – $1,3,9$ – and hence is open at the end.</p>
<p>So how many divisors does a given number have? For that, we’ll use some beautiful number theory (the number theory we’re using isn’t deep or anything, but any number theory is beautiful really). So any counting number $m$ can be written as a product of unique primes to non-negative integer exponents. So, mathematically, what I’m saying is:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
m =& \prod p_i^{\alpha_i}
\end{align*} %]]></script>
<p>Since every number can be written that way, all $m$’s divisors can be written that way too. By definition, any divisor of $m$ must be $\prod p_i^{\lambda_i}$ such that for each $i$, $0 \leq \lambda_i \leq \alpha_i$ – and each such collection of $\lambda_i$ determines a unique divisor. So how many such numbers can we create? Clearly: $\prod\limits_i(\alpha_i + 1)$ because we can have up to $\alpha_i$ powers of $p_i$, but we could also have 0 of them. So there are $\alpha_i + 1$ many.</p>
<p>So now we can ask which numbers have an odd number divisors? Those will be the ones left open and the others will be closed. Well an odd number of divisors means that $\gamma = \prod\limits_i(\alpha_i + 1)$ is odd. But since $\gamma$ is odd, then none if its divisors can be even, and hence $\alpha_i + 1$ must be odd for each $i$. But that means $\alpha_i$ must be even for each $i$. And that’s the exact description of a perfect square.</p>
<p>Another way to use this understanding is that for every divisor there’s a pair, a buddy. But only if it’s a perfect square will there be a divisor whose buddy is itself, thereby making an odd number of divisors. But this is still based on our theoretical understanding of the scenario.</p>
<h1 id="data-science-approach">Data Science Approach</h1>
<p>The second approach is one that’s particularly useful if the theory is very costly to investigate (for whatever reason). For this approach we’ll just collect a bunch of data, and see what fits. For this example, we’ll collect just the data that’s directly what we’re looking to question. Let’s do up to 11 (just to pick a random number) letting 0 denote closed and 1 open.</p>
<table>
<thead>
<tr>
<th style="text-align: center">iteration</th>
<th style="text-align: center">1</th>
<th style="text-align: center">2</th>
<th style="text-align: center">3</th>
<th style="text-align: center">4</th>
<th style="text-align: center">5</th>
<th style="text-align: center">6</th>
<th style="text-align: center">7</th>
<th style="text-align: center">8</th>
<th style="text-align: center">9</th>
<th style="text-align: center">10</th>
<th style="text-align: center">11</th>
<th style="text-align: center">12</th>
<th style="text-align: left">NOTES</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: left"> </td>
</tr>
<tr>
<td style="text-align: center">1</td>
<td style="text-align: center">1</td>
<td style="text-align: center">1</td>
<td style="text-align: center">1</td>
<td style="text-align: center">1</td>
<td style="text-align: center">1</td>
<td style="text-align: center">1</td>
<td style="text-align: center">1</td>
<td style="text-align: center">1</td>
<td style="text-align: center">1</td>
<td style="text-align: center">1</td>
<td style="text-align: center">1</td>
<td style="text-align: center">1</td>
<td style="text-align: left"> </td>
</tr>
<tr>
<td style="text-align: center">2</td>
<td style="text-align: center">1</td>
<td style="text-align: center">0</td>
<td style="text-align: center">1</td>
<td style="text-align: center">0</td>
<td style="text-align: center">1</td>
<td style="text-align: center">0</td>
<td style="text-align: center">1</td>
<td style="text-align: center">0</td>
<td style="text-align: center">1</td>
<td style="text-align: center">0</td>
<td style="text-align: center">1</td>
<td style="text-align: center">0</td>
<td style="text-align: left"> </td>
</tr>
<tr>
<td style="text-align: center">3</td>
<td style="text-align: center">1</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">1</td>
<td style="text-align: center">1</td>
<td style="text-align: center">1</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">1</td>
<td style="text-align: center">1</td>
<td style="text-align: left"> </td>
</tr>
<tr>
<td style="text-align: center">4</td>
<td style="text-align: center">1</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">1</td>
<td style="text-align: center">1</td>
<td style="text-align: center">1</td>
<td style="text-align: center">1</td>
<td style="text-align: center">1</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">1</td>
<td style="text-align: center">0</td>
<td style="text-align: left"> </td>
</tr>
<tr>
<td style="text-align: center">5</td>
<td style="text-align: center">1</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">1</td>
<td style="text-align: center">0</td>
<td style="text-align: center">1</td>
<td style="text-align: center">1</td>
<td style="text-align: center">1</td>
<td style="text-align: center">1</td>
<td style="text-align: center">0</td>
<td style="text-align: center">1</td>
<td style="text-align: center">0</td>
<td style="text-align: left">transcription error</td>
</tr>
<tr>
<td style="text-align: center">6</td>
<td style="text-align: center">1</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">1</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">1</td>
<td style="text-align: center">1</td>
<td style="text-align: center">1</td>
<td style="text-align: center">0</td>
<td style="text-align: center">1</td>
<td style="text-align: center">1</td>
<td style="text-align: left"> </td>
</tr>
<tr>
<td style="text-align: center">7</td>
<td style="text-align: center">1</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">1</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">1</td>
<td style="text-align: center">1</td>
<td style="text-align: center">0</td>
<td style="text-align: center">1</td>
<td style="text-align: center">1</td>
<td style="text-align: left"> </td>
</tr>
<tr>
<td style="text-align: center">8</td>
<td style="text-align: center">1</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">1</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">1</td>
<td style="text-align: center">0</td>
<td style="text-align: center">1</td>
<td style="text-align: center">1</td>
<td style="text-align: left"> </td>
</tr>
<tr>
<td style="text-align: center">9</td>
<td style="text-align: center">1</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">1</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">1</td>
<td style="text-align: center">1</td>
<td style="text-align: left"> </td>
</tr>
<tr>
<td style="text-align: center">10</td>
<td style="text-align: center">1</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">1</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">1</td>
<td style="text-align: center">1</td>
<td style="text-align: center">1</td>
<td style="text-align: left"> </td>
</tr>
<tr>
<td style="text-align: center">11</td>
<td style="text-align: center">1</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">1</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">1</td>
<td style="text-align: center">0</td>
<td style="text-align: center">1</td>
<td style="text-align: left"> </td>
</tr>
<tr>
<td style="text-align: center">12</td>
<td style="text-align: center">1</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">1</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: center">1</td>
<td style="text-align: center">0</td>
<td style="text-align: center">0</td>
<td style="text-align: left"> </td>
</tr>
</tbody>
</table>
<p>At this point we might already have noticed a pattern. The first is 1, then $1 + 3 = 4$. Then $4 + 2\cdot 3 = 10$. So the next one should be $10 + 3 \cdot 3 = 19$, which we know from the first approach is wrong, so we collect more data and reject this hypothesis. We can imagine that with enough data we’d start to notice that 9 and 10 are outliers and even probably notice the transcription error at iteration 5 and end up with something like this.</p>
<table>
<thead>
<tr>
<th style="text-align: left">index</th>
<th style="text-align: left">open?</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left">1</td>
<td style="text-align: left">0</td>
</tr>
<tr>
<td style="text-align: left">2</td>
<td style="text-align: left">1</td>
</tr>
<tr>
<td style="text-align: left">3</td>
<td style="text-align: left">1</td>
</tr>
<tr>
<td style="text-align: left">4</td>
<td style="text-align: left">0</td>
</tr>
<tr>
<td style="text-align: left">5</td>
<td style="text-align: left">1</td>
</tr>
<tr>
<td style="text-align: left">6</td>
<td style="text-align: left">1</td>
</tr>
<tr>
<td style="text-align: left">7</td>
<td style="text-align: left">1</td>
</tr>
<tr>
<td style="text-align: left">8</td>
<td style="text-align: left">1</td>
</tr>
<tr>
<td style="text-align: left">9</td>
<td style="text-align: left">0</td>
</tr>
<tr>
<td style="text-align: left">10</td>
<td style="text-align: left">1</td>
</tr>
<tr>
<td style="text-align: left">11</td>
<td style="text-align: left">1</td>
</tr>
<tr>
<td style="text-align: left">12</td>
<td style="text-align: left">1</td>
</tr>
<tr>
<td style="text-align: left">13</td>
<td style="text-align: left">1</td>
</tr>
<tr>
<td style="text-align: left">14</td>
<td style="text-align: left">1</td>
</tr>
<tr>
<td style="text-align: left">15</td>
<td style="text-align: left">1</td>
</tr>
<tr>
<td style="text-align: left">16</td>
<td style="text-align: left">0</td>
</tr>
</tbody>
</table>
<p>And we conclude that the pattern is 1, then skip 3 (to 4), then skip 5 (to 9), then skip 7 (to 16), etc. and confirm it with some out-of-sample:</p>
<table>
<thead>
<tr>
<th style="text-align: left">index</th>
<th style="text-align: left">open?</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left">17</td>
<td style="text-align: left">1</td>
</tr>
<tr>
<td style="text-align: left">18</td>
<td style="text-align: left">1</td>
</tr>
<tr>
<td style="text-align: left">19</td>
<td style="text-align: left">1</td>
</tr>
<tr>
<td style="text-align: left">20</td>
<td style="text-align: left">1</td>
</tr>
<tr>
<td style="text-align: left">21</td>
<td style="text-align: left">1</td>
</tr>
<tr>
<td style="text-align: left">22</td>
<td style="text-align: left">1</td>
</tr>
<tr>
<td style="text-align: left">23</td>
<td style="text-align: left">1</td>
</tr>
<tr>
<td style="text-align: left">24</td>
<td style="text-align: left">1</td>
</tr>
<tr>
<td style="text-align: left">25</td>
<td style="text-align: left">0</td>
</tr>
<tr>
<td style="text-align: left">26</td>
<td style="text-align: left">1</td>
</tr>
<tr>
<td style="text-align: left">27</td>
<td style="text-align: left">1</td>
</tr>
<tr>
<td style="text-align: left">28</td>
<td style="text-align: left">1</td>
</tr>
<tr>
<td style="text-align: left">29</td>
<td style="text-align: left">1</td>
</tr>
<tr>
<td style="text-align: left">30</td>
<td style="text-align: left">1</td>
</tr>
<tr>
<td style="text-align: left">31</td>
<td style="text-align: left">1</td>
</tr>
<tr>
<td style="text-align: left">32</td>
<td style="text-align: left">1</td>
</tr>
<tr>
<td style="text-align: left">33</td>
<td style="text-align: left">1</td>
</tr>
<tr>
<td style="text-align: left">34</td>
<td style="text-align: left">1</td>
</tr>
<tr>
<td style="text-align: left">35</td>
<td style="text-align: left">1</td>
</tr>
<tr>
<td style="text-align: left">36</td>
<td style="text-align: left">0</td>
</tr>
<tr>
<td style="text-align: left">37</td>
<td style="text-align: left">1</td>
</tr>
</tbody>
</table>
<p>and I’m satisfied. But that pattern isn’t very illuminating, and it’s only valid in the statistical sense. But either way, we get an answer<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup>.</p>
<h1 id="the-problem">The Problem</h1>
<p>What if we work in a field where data accrual is very slow (think Finance, Economics, or any such field). Clearly this approach has some pretty deep flaws for us. Our out-of-sample grows too slowly to be able to make conclusions like we made at the end there, and we’re stuck in the first realm where all data has errors and there are a plethora of theories that look great (in sample). And even if we hold out some data as “out-of-sample” and try to do everything right, we can still over-fit to that sample.</p>
<p>Luckily for us, there’s a lot of autocorrelation in these things. So if you’ve done everything properly and found something that works on the hold-out, it’s shouldn’t be too surprising if it works well for at least a short period on true out-of-sample, even if the thing is over-fit, but that’s not particularly satisfying.</p>
<p>So how else should we go about this? The economist’s approach, perhaps? It seems sensible at first glance; you get to build off of the tremendous work of the mathematical community and potentially get conclusive answers. We can certainly model things with increasing complexity and beauty (unrelated), but at the end of the day the only way to assess the validity of these models is through out-of-sample predictions. You know, the way Physicists do it. But we can’t really do that here, so we do like we said we would and have a hold out, etc. and we’re already back to the data-science approach but with more story-telling. And naturally, since we want to pretend like we’re doing something more rigorous that data-science, we judge this largely on the sensibility of the stories. But all that means is that we’re unwilling to accept theories that are counter-intuitive – sorry Einstein, your theory of relativity isn’t wanted here.</p>
<p>So either we do the data science approach, or we do the “sensibility” based approach, which is itself based on itself. Why do people prefer the sensibility based approach? If you’re reading this, please really do <a href="mailto:aaron@niskin.org?Subject=Economics%20data-science%20blog-post">email</a> me your thoughts on this – I’d love to hear other opinions here.</p>
<div class="footnotes">
<ol>
<li id="fn:1">
<p>We can show that these two solutions are equivalent by induction. Clearly it’s true for 1 (1 is a perfect square and is how we defined the start of the sequence). Now imagine a square of size $n$ and let’s see how many we have to add to get to a square with side-length $n+1$. Well, we have to add $n$ to each side, then 1 for the corner. So we add $2n + 1$. So for instance, if $n=1$ (the base case) then the next perfect square would be $n + (2n+1) = 1 + 3 = 4$. Indeed! <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Aaron NiskinThis post is going to be something of a combination riddle and blog post. The setup There are $N$ doors indexed by $1,..,N$ along a wall and they’re all closed. Then for each $i\in{1,2,\dots,N}$, you toggle every $i$ door. So at first ($i = 1$) you toggle every door. Then $i=2$ you toggle every second door – the even doors ($2, 4, 6, \dots,$).Quantile Regression and Quantiles of Loss2019-06-04T00:00:00-07:002019-06-04T00:00:00-07:00http://aaron.niskin.org/blog/2019/06/04/quantile_regrssion<h1 id="background">Background</h1>
<p>If you’ve ever read any papers referring to quantile regression (QR), you undoubtedly have read “MSE yields means, MAE yields medians, and QR yields quantiles”. But what they hell does that even mean? Like not just hand-wavy bullshit, but really, what does that mean? After all, these are just loss functions and therefore don’t <em>yield</em> anything at all. Even the regressions yields coefficients, or, more generally, models; so whence cometh the mean, median talk? It turns out they’re talking about the distribution of residuals. In this post, we’ll show what they mean, how that’s the case, and we’ll investigate a bit more about QR.</p>
<!-- more -->
<h2 id="what-is-it">What is it?</h2>
<p>Given a vector of residuals $r\in\mathbb{R}^n$, and a scaler $q\in[0,1]$, we define the <strong>Quantile Loss</strong> with quantile $q$ ($L_q$) as:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
L_q : \mathbb{R}^n& \to \mathbb{R} \\
L_q(r) =& \begin{cases} qr & r > 0 \\ (1-q)(-r) & r \leq 0\end{cases} \\
=& \begin{cases} q|r| & r > 0 \\ (1-q)|r| & r \leq 0\end{cases}
\end{align*} %]]></script>
<h3 id="in-words">In words</h3>
<p>Basically, the loss is very much like MAE (A.K.A. $L_1$ loss) but we weight the absolute errors based on whether they are positive or negative.</p>
<h2 id="relation-to-quantiles">Relation to Quantiles</h2>
<p>A natural question to ask at this point is, when are you getting to your point? Er, I mean, what does this have to do with quantiles?</p>
<p>To answer this, let’s investigate, when will the derivative equal zero (a necessary condition for a minima). To make our lives easier, let $\omega_0$ be the set of residuals that are greater than zero, and $\omega_1$ be those that are less than or equal to zero.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
0 =& \frac{\partial L}{\partial r}(r) \\\\
=& \sum_{r\in\omega_0}q + \sum_{r \in\omega_1}(1-q) \\\\
=& q\sum_{r\in\omega_0}1 + (1-q)\sum_{r \in\omega_1}1 \\\\
=& q|\omega_0| + (1-q)|\omega_1| \iff \\\\
\omega_1 =& q(\omega_0 - \omega_1) \iff \\\\
q =& \frac{\omega_1}{\omega_0 - \omega_1}
\end{align*} %]]></script>
<p>So we can see that the Quantile Regression optimization problem attempts to over-estimate $q$ percent of the observations. In particular, if $q = 0.5$, then the QR loss attempts to ensure that there are just as many over-estimations as under-estimations.</p>
<p>But if $q=0.9$, for instance, then this loss will attempt to ensure your model under-estimates about 90\% of the observations and over-estimates about 10\% of them.</p>
<h3 id="practice">Practice</h3>
<p>In practice it seems to converge somewhere around the quantile, but depending on the outlier situation you have, it might be somewhat off. That might be an effect of the number of iterations I’ve allowed, but either way, it’s not far enough from the designated quantile to raise any alarms; especially not since whatever you’re doing, it’s probably using Stochastic Gradient Descent anyway.</p>
<h2 id="further-investigation">Further Investigation</h2>
<p>I lied. I’m tired and I don’t think I’ll be following through on my desires for further investigation. “Why don’t you just delete the part where you said you’d have further investigation”? Well, frankly, I don’t think you’re reading this anyway. Email me if you actually read this. I’d be super interested to find out.</p>
<h2 id="summary">Summary</h2>
<p>Quantile regression is\dots interesting.</p>Aaron NiskinBackground If you’ve ever read any papers referring to quantile regression (QR), you undoubtedly have read “MSE yields means, MAE yields medians, and QR yields quantiles”. But what they hell does that even mean? Like not just hand-wavy bullshit, but really, what does that mean? After all, these are just loss functions and therefore don’t yield anything at all. Even the regressions yields coefficients, or, more generally, models; so whence cometh the mean, median talk? It turns out they’re talking about the distribution of residuals. In this post, we’ll show what they mean, how that’s the case, and we’ll investigate a bit more about QR.Gradient Boosting Seen as Gradient Descent2019-04-25T00:00:00-07:002019-04-25T00:00:00-07:00http://aaron.niskin.org/blog/2019/04/25/gbm-basic<h1 id="tldr">tldr</h1>
<p>The short version of this whole thing is that we can see the gradient boosting algorithm as just gradient descent on black-box models. Typically gradient descent involves parameters and a closed form solution to the gradient of your learner as a way to update those parameters. In the case of trees, for instance, there are no parameters you want to update<sup id="fnref:tree_parameters"><a href="#fn:tree_parameters" class="footnote">1</a></sup>, and somewhat more importantly they aren’t even continuous predictors, much less differentiable!</p>
<p>But gradient boosting can be seen as gradient descent on black box models resulting in an additive black box model.</p>
<h1 id="gbm">GBM</h1>
<h2 id="the-naive-algorithm">The naive algorithm</h2>
<p>A description of the algorithm I’m talking about can be found on <a href="https://en.wikipedia.org/wiki/Gradient_boosting#Algorithm">wikipedia</a>, but I’ll go over the algorithm somewhat quickly here just to add another phrasing of the thing.</p>
<p>Let’s assume we have a feature-set $X$, a response variable $y$, a loss function $L(\hat y, y)$, and a learning rate $\lambda\in\mathbb{R}_+$ such that $\lambda \leq 1$. The algorithm is as follows:</p>
<p>First we define $f_0(X) = 0$.</p>
<p>Now if $i \geq 1$, we let:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
y_i =& -\frac{\partial{L(y, f_{i-1}(X))}}{\partial f_{i-1}(X)}
\end{align*} %]]></script>
<p>We then train a new set of trees (or whatever learner we want) such that:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
g_i(X) \sim& y_i
\end{align*} %]]></script>
<p>We then find the real number scaler that minimizes our loss in the following equation (AKA a line-search):</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\gamma_i =& \text{argmin}_\gamma(y - (f_{i-1}(X) + \gamma g_i(X)))
\end{align*} %]]></script>
<p>Finally we define:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
f_i(X) =& f_{i-1}(X) + \lambda\gamma_i g_i(X)
\end{align*} %]]></script>
<p>So that</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
f(X) =& \sum\limits_{i=0}^N \lambda\gamma_i g_i(X)
\end{align*} %]]></script>
<p>In the special case where our loss function is mean squared error (or $L2$), our gradients are just the residuals.</p>
<h1 id="gradient-descent">Gradient Descent</h1>
<p>The typical scenario we generally talk about in gradient descent is fitting linear weights in some type of model (whether that be a linear model, logistic regression, neural network, whatever). So let’s stick to this paradigm and consider the simplest case (a linear model), and we’ll show how the algorithm above is actually the same algorithm, just slightly more general.</p>
<p>So we’re modeling $y = X\beta + \epsilon$; in the parlance of GBM: $f = \beta$ or $f(X) = X\beta$.</p>
<p>To solve this, we define $\beta_0 = 0$.</p>
<p>Then for $i \geq 1 \in \mathbb{N}$,</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\beta_i =& \beta_{i-1} - \lambda\gamma_i\frac{\partial L}{\partial \beta_{i-1}}
\end{align*} %]]></script>
<p>Where $L, \lambda, \gamma_i$ are exactly the same as in the gradient boosting algorithm<sup id="fnref:line_search"><a href="#fn:line_search" class="footnote">2</a></sup>. So that was the standard gradient descent algorithm. Now let’s show how these two are really the same beast.</p>
<h1 id="similarities">Similarities</h1>
<p>We’ll show that the update formula is the same (mutatis mutandis) in both, and then as a consequence the final models will be the constructed in the same way (because they’re both aggregates of their updates)<sup id="fnref:updates"><a href="#fn:updates" class="footnote">3</a></sup>.</p>
<p>To make the notation easier, we define:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\alpha_i =& \frac{\partial L}{\partial \beta_{i-1}}
\end{align*} %]]></script>
<h2 id="update-formula">Update Formula</h2>
<p>For the update formula, we need to first note two things:</p>
<ol>
<li>Gradients are linear by definition, so $X\frac{\partial L}{\partial \beta_{i-1}}$ can be seen as the result of modeling ($\frac{\partial L}{\partial \beta_{i-1}(X)}$) by the application of a linear function (although it is normally unwise to do so).</li>
</ol>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
f_i(X) =& X\beta_i \\\\
=& X\left(\beta_{i-1} - \lambda\gamma_i\frac{\partial L}{\partial \beta_{i-1}}\right) \\\\
=& X\beta_{i-1} - \lambda\gamma_iX\frac{\partial L}{\partial \beta_{i-1}} \\\\
=& X\beta_{i-1} - \lambda\gamma_i g_i(X) \\\\
=& f_{i-1}(X) - \lambda\gamma_i g_i(X)
\end{align*} %]]></script>
<p>Which is the same formula we have in gradient boosting.</p>
<h1 id="observations">Observations</h1>
<p>Since we’re modeling the gradient with an arbitrary (potentially black-box) learner, we don’t have the option to find the gradient with respect to the parameters, so the scale might not decrease as desired. To exemplify this, let’s consider an $L_1$ objective (Mean-Absolute-Error), and a black-box learner. The gradient at each point is either 1, -1, or <code class="highlighter-rouge">np.nan</code> (because the absolute value function is $f(x) = \pm x$ depending on $x$). The magnitude of the gradients will never change. In a linear model we have that extra $\frac{\partial f}{\partial \beta}$ which adds scale to our gradient, but in trees we have no such thing. So in order to add scale, we tend to fit the line search and then add a learning rate to avoid over-fitting.</p>
<p>One can also sub-sample (as is a parameter in popular packages like <code class="highlighter-rouge">LightGBM</code>). Sub-sampling is the black-box model version of the familiar Stochastic Gradient Descent.</p>
<h1 id="summary">Summary</h1>
<p>I realize this might be obvious to some, but it was pretty cool when I first realized this. I hope you found something useful and/or interesting here.</p>
<div class="footnotes">
<ol>
<li id="fn:tree_parameters">
<p>Technically the split leaves in a tree define an indicator function on your data and the average value within a leaf (the prediction for that leaf) can be seen as the parameters of a tree, but this is kind of ridiculous because these are not tuned in the learning of the tree and there’s really no reason to do so (as far as I can tell). <a href="#fnref:tree_parameters" class="reversefootnote">↩</a></p>
</li>
<li id="fn:line_search">
<p>In practice we generally don’t include the line search and just have a decreasing $\lambda$ – and sometimes we don’t even do that. We can get away with these shortcuts because the magnitude of the gradients will decrease as you get closer to the optima, and the derivative with respect to $\beta$ is always continuous. The same cannot be said about the gradient boosting algorithm. <a href="#fnref:line_search" class="reversefootnote">↩</a></p>
</li>
<li id="fn:updates">
<p>If we call our final linear model: $X\hat\beta$, then $X\hat\beta = X\left(\sum\limits_{i=0}^N\lambda\gamma_i\alpha_i\right) = \sum\limits_{i=0}^N\lambda\gamma_iX\alpha_i $. So we can see that linear regression has always been constructed as a sum. <a href="#fnref:updates" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Aaron NiskinGradient Boosting Seen as Gradient DescentA little geometric walk down risk model lane2019-03-28T00:00:00-07:002019-03-28T00:00:00-07:00http://aaron.niskin.org/blog/2019/03/28/risk_models<h1 id="prerequisites">Prerequisites</h1>
<h2 id="inner-products">inner products</h2>
<p>So the first thing to understand is that as long as we have an inner-product defined, we have a geometry. And in case you’re unfamiliar with the term inner-product, you can think of it like a slightly more general dot-product.</p>
<p>Furthermore, an inner product defines a norm and it defines angles. Specifically,</p>
<p><script type="math/tex">\begin{align*}
\left |\left| u\right|\right| = \sqrt{\langle u, u\rangle} \\\\
\cos(\theta) = \frac{\langle u, v\rangle}{||u||\cdot||v||}
\end{align*}</script></p>
<h2 id="standard-deviation">Standard deviation</h2>
<p>The next thing to understand is that standard deviation is a norm. It’s easily seen from the formula:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\sqrt{\frac{1}{n-1}\sum(x_i - \mu)^2} =& \sqrt{\frac{n}{n-1}}||x - \mu||\\\\
=& \sqrt{\frac{n}{n-1}}\sqrt{\langle x-\mu, x-\mu\rangle}
\end{align*} %]]></script>
<p>So standard deviation is a norm. The two are equivalent. Sweet. Let’s move on.</p>
<h2 id="covariance-is-an-inner-product">Covariance is an inner product</h2>
<p>If we have a linear transformation $X$ defines a linear transformation from $X:\mathbb{R}^m\to\mathbb{R}^n$, and with our covariance matrix $V$, we have an inner product defined on the image. Specifically, if $u,v\in\mathbb{R}^n$ are two portfolios already mapped to risk space, the inner product is (the estimated covariance):</p>
<script type="math/tex; mode=display">\langle u, v\rangle := u'Vv</script>
<h2 id="projections">Projections</h2>
<p>Let $u,v\in\mathbb{R}^n$. Given an inner product $\langle\cdot,\cdot\rangle$ (and the associated norm), the <strong>projection</strong> (sometimes called <strong>vector projection</strong>) of $u$ onto $v$ (denoted as $\text{proj}_vu$)is defined as:</p>
<script type="math/tex; mode=display">\text{proj}_vu := \frac{\langle u, v\rangle}{||v||} \frac{v}{||v||}</script>
<p>Another way to see the definition, given our definition of $\cos$ is:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\text{proj}_vu =& \frac{\langle u, v\rangle}{||v||} \frac{v}{||v||} \\\\
=& \frac{||u||\langle u, v\rangle}{||u||\cdot||v||} \frac{v}{||v||} \\\\
=& ||u||\cos(\theta) \frac{v}{||v||}
\end{align*} %]]></script>
<p>Geometrically, this gives us the component of $u$ in the $v$ direction. See figure below.</p>
<p><img src="/assets/pics/2019/03/28_projection.png" alt="visual of a vector projection" title="Visual of a vector projection" /></p>
<p>If you’re wondering why I specified that it’s the “<strong>vector</strong> projection” rather than just “projection”, it’s because there is a notion of a <em>scaler projection</em> too: The <strong>scaler projection</strong> of $u$ onto $v$ is simply the coefficient part of the projection:</p>
<script type="math/tex; mode=display">\text{scaler-proj}_vu = \frac{\langle u, v\rangle}{||v||}</script>
<p>But since they’re so similar and we really only care about the latter, I’ll abuse notation here and generally use $\text{proj}_vu$ to mean “scaler-projection” unless otherwise noted.</p>
<h2 id="risk">Risk</h2>
<p>When we talk about how <em>risky</em> a portfolio is, intuitively you probably understand that as “what’s the probability that I lose all of my money”, and you’re not wrong. That would be what we call <strong>downside risk</strong>. Unfortunately, that’s not so easy to compute (at least not on paper), and there isn’t much in the way of mathematical research built up on the notion of downside risk. Maybe in a later post, I’ll go over how downside risk relates to the Information Ratio, but not today. There is, however, a lot of research built up on <em>standard deviation</em>, and that’s somewhat related (“how irregular are your returns” is not too bad of a proxy). So we define a portfolio’s <strong>risk</strong> as the standard deviation of the portfolio’s returns.</p>
<h1 id="the-problem">The problem</h1>
<p>Now comes the wonderful task of predicting future risk. This involves estimating a covariance matrix of asset-level returns. But if you’re in the equities world (and even more-so if you’re in the quantitative equities world), the covariance matrix is very large (often on the order of 5,000 by 5,000) and somewhat non-stationary (meaning the data we have deteriorates in value over time). So these covariance matrices take quite a lot of data to fit, and they will inherently leave out newer stocks. Clearly, there is a place for a lower dimensional estimator of risk. We call such an estimator a <em>risk model</em>. There are several versions of a risk model, but this is meant to be a short post, so we’ll only go over the Barra style risk model…</p>
<h2 id="barra">Barra</h2>
<p>The barra risk model takes the following form:</p>
<script type="math/tex; mode=display">(Xh)'V(Xh) + D\cdot h</script>
<p>Where $h \in \mathbb{R}^m$ is our <strong>holdings</strong> (of $m$ many assets) vector, $X \in \mathbb{R}^{n\times m}$ is called our <strong>factor exposure matrix</strong> (usually shortened to simply, “exposure matrix”), $V \in \mathbb{R}^{n\times n}$ is called our <strong>factor covariance matrix</strong>, and $D \in \mathbb{R}^{m}$ is our <strong>specific risk</strong> vector. The exposure matrix is computed based off of real-world things like the stock’s N-day momentum, the stock’s market cap, volatility (yet another word for standard deviation) of returns, etc.</p>
<p>If $n \ll m$ the covariance matrix can be computed with a lot more efficacy and the exposures are computed from external things like official SEC filings, etc. so as to reduce the amount of over-fitting here. So Barra generally regresses the exposures against returns (with some universe filtering and time weighting, etc.) to get “factor returns”. We can then compute the factor covariance matrix, and we’re done!</p>
<h2 id="the-geometric-view">The geometric view</h2>
<p>We can view this style of risk model in a geometric way for some added clarity. Let’s focus entirely on the first part ($(Xh)’V(Xh)$) for now, and furthermore let’s assume that we’ve fit the model already – meaning we’ve already computed $X$ and $V$. It’s clear to see that we’re using a lower dimensional embedding of our portfolios and computing the risk in the embedded space, then projecting back.
Formally, we have an inner product defined in $\mathbb{R}^n$ given by:</p>
<script type="math/tex; mode=display">\langle u, v\rangle = u'Fv</script>
<p>So our factor risk is the norm of our vector in this space. But now we want to see how seeing things the geometric way can benefit us.</p>
<h2 id="marginal-contribution-to-risk">Marginal contribution to risk</h2>
<p>One common task is to attribute some amount of the total risk to a particular factor. By that I mean that we have a portfolio and we would like to express the total risk of this portfolio in terms of the risk factors from our risk model. The theory suggests that we want the <strong>marginal contribution to risk</strong> ($MCAR$) which is defined as:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
MCAR(i) :=& \frac{\partial \sigma}{\partial u_i}
\end{align*} %]]></script>
<p>It’s the partial derivative of the total risk with respect to our factor in question (denoted above as $u_i$). And we can compute this somewhat easily:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
MCAR(i) :=& \frac{\partial \sigma}{\partial u_i} \\\\
=& \frac{\partial \sqrt{\langle u, u\rangle}}{\partial u_i} \text{ -- risk is the inner product}\\\\
=& \frac{1}{2\sqrt{\langle u, u\rangle}}\cdot \frac{\partial \langle u, u\rangle}{\partial u_i} \text{ -- chain rule}\\\\
=& \frac{1}{2\sqrt{\langle u, u\rangle}}\cdot \frac{\partial u'Vu}{\partial u_i} \text{ -- definition of our inner product}\\\\
=& \frac{(u'V + u'V')u_i}{2\sqrt{\langle u, u\rangle}} \text{ -- computing the derivative}\\\\
=& \frac{u'Vu_i}{\sqrt{\langle u, u\rangle}} \text{ -- $V$ is symmetric}\\\\
=& \frac{\langle u, u_i\rangle}{\left|\left|u\right|\right|} \text{ -- Replacing with the geometric notion}\\\\
\end{align*} %]]></script>
<p>But man, is this not illuminating. Why would we even want the derivative to begin with? Are we really concerned with such an extremely local property here or are we trying to find some more global thing? It seems to me that what we want is more about explaining our current position rather than how our position could change with infinitesimal changes in other values. You may have noticed that the end formula above is the same as the projection formula from the <em>prerequisites</em> section. That’s no coincidence.</p>
<p>If we instead look at MCAR as trying to explain how much of our total portfolio risk can be explained by the portion of our portfolio in the direction of a chosen basis vector (factor), we end up with the same formula. In order to see how much the portion of $u$ in the direction of a particular basis explains the total risk, we compute the projection of $u_i$ onto $u$, which would look like:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\text{proj}_uu_i =& \frac{\langle u, u_i\rangle}{\left|\left|u\right|\right|}
\end{align*} %]]></script>
<p>And it comes with the all the geometric intuition we all know and love.</p>
<h2 id="ufps">U.F.P.’s</h2>
<p>[edit]
See note below.
[end edit]</p>
<p>What the hell are they? The standard definition of a <strong>UFP</strong> is: the characteristic portfolio with unit exposure to a particular factor, and zero exposure to all the others. What does this mean? How are we guaranteed that such a thing always exists, and just, what? Well, yet again, our geometric interpretation can come to our rescue…</p>
<p>Let’s pick a particular factor $i$. Now let’s construct an ortho-normal basis (perhaps using Gram-Schmidt) in factor space starting with every other factor <em>but</em> $i$, leaving $i$ for last. That last basis vector would be, by construction, orthogonal to every other factor (hence zero exposure to them) and have some exposure to our chosen factor. We can now take that last basis vector, rescale it as appropriately and we have ourselves a sweet UFP.</p>
<p>[edit]</p>
<h3 id="note">NOTE</h3>
<p>The information above about <em>UFP</em> is actually incorrect. A <strong>UFP</strong> is actually just the returns of the momentum vector. If you’re more comfortable thinking in portfolio space, it’s the returns of a fictitious portfolio that, when projected to factor space, is one of our basis vectors. What I wrote below is a bit more complicated, but provides an interesting and potentially useful insight into factor performance. I personally couldn’t really care too much less about factors, but if that’s what you’re into, looking at both the “real” UFPs and what I was calling UFPs can give you a much more complete view of things.</p>
<p>Let me explain:</p>
<p>The risk of a “real” UFP has a portion explained by other factors. So if momentum is on a tear, it could just be a coalescence of other factors ripping it but the part that makes momentum unique is actually quite negative. One way to see that is to plot the performance of the “real” UFPs along with my UFPs. If you see the “real” UFP is positive but my UFP is negative (for a give factor), you can conclude we’re in a situation like I described above. And so on.
[end edit]</p>
<h2 id="specific-risk">Specific risk</h2>
<p>So how do we interpret specific risk (what we called $D$) in the geometric way? Can we simply add a column to our risk model for specific risk – even if it is mostly zeros? It turns out, no. For one, the specific risk is stock-specific (hence the name), but also, the thing we want to measure – variance – is not linear. So what is specific risk?</p>
<p>Well, one way to interpret it is as the norm of the projection of our vector in holdings space onto the kernel of our embedding. Let me explain what I mean there…</p>
<p>We have a surjective map $X:\mathbb{R}^n\to\mathbb{R}^m$ where $n\lt m$. Since the dimension of the domain is larger than the dimension of the range we know that several dimensions have to be in the kernel. In particular, since the map is surjective we know exactly that the dimension of the kernel is $m-n$. So there is an $n$ dimensional substructure of our holdings space that matters – as far as factors are concerned – and $m-n$ dimensions that don’t. We project that parts that do matter into factor space and compute their magnitude there where we know the geometry, and compute the magnitude in ambient space of the part that doesn’t map to factor-space. It’s that latter piece that we call <strong>specific risk</strong>.</p>
<h2 id="summary">Summary</h2>
<p>So, in particular, if $h_1$ is a particular risk factor, we get the typical risk-decomposition formula we all know and love, but without messing with any of the unmotivated calculus. What’s nice is it also illuminates the place for non-linear manifold approximations: at the end of the day, they’re all just defining the geometry we’re going to use to assess our portfolio.</p>Aaron NiskinLet's talk about the geometry of risk modelsLet’s Cover The Fundamental Group (pt 1)2018-08-12T00:00:00-07:002018-08-12T00:00:00-07:00http://aaron.niskin.org/blog/2018/08/12/fundamental_group_pt1<h2 id="jumping-right-in">Jumping right in</h2>
<h3 id="prerequisites">Prerequisites:</h3>
<p>Firstly, this post builds off of my <a href="/blog/2018/08/05/homotopy_revisited.html">previous post</a>, so if you’re learning these things for the first time and find yourself kinda lost, start there.</p>
<p>In addition to the previous post, the following are mathematical definitions you should probably know before reading this post. If you don’t know any of these, trust google.</p>
<ul>
<li><strong>Group</strong>.</li>
<li><strong>Homomorphism</strong></li>
<li><strong>Kernel</strong></li>
<li><strong>Subgroup</strong></li>
<li><strong>Normal subgroup</strong></li>
<li><strong>Homotopy</strong></li>
<li><strong>Retraction</strong></li>
<li><strong>Star Convex</strong></li>
</ul>
<h3 id="moving-onward">Moving onward</h3>
<p><strong>Definition</strong>:</p>
<p>Given a path $\alpha$ from $x_0$ to $x_1$ in $X$, we define $\hat\alpha:\pi_1(X,x_0)\to\pi_1(X,x_1)$ as follows:</p>
<script type="math/tex; mode=display">\hat\alpha([f]) = [\bar\alpha]\star[f]\star[\alpha]</script>
<p>And it’s a theorem that $\hat\alpha$ is a group isomorphism.</p>
<p><strong>Definition</strong>:</p>
<p>Given a continuous map $h:(X,x_0)\to(Y,y_0)$, we define:</p>
<p>$h_\star:\pi_1(X,x_0)\to\pi_1(Y,y_0)$ as follows:</p>
<p>$h_\star([f]) = [h\circ f]$.</p>
<p>And with that…</p>
<p>Let’s get into the questions!</p>
<h2 id="exercises">Exercises</h2>
<h3 id="show-that-if-asubseteqmathbbrn-is-star-convex-then-a-is-simply-connected">Show that if $A\subseteq\mathbb{R}^n$ is star convex then $A$ is simply connected.</h3>
<div class="hint">
<p>Let $a\in A$ be one of the points that make the set star convex. Our task now is to show:</p>
<ol>
<li>For any $x,y\in A$, there is a path entirely within $A$ connecting $x$ and $y$, and</li>
<li>$\pi_1(A,a)$ is trivial.</li>
</ol>
<p>For the first: just let $x,y\in A$ and let $f_x,f_y$ be the paths connecting them to $a$. Then note that $f_x \star \bar f_y$ is a path connecting $x$ to $y$.</p>
<p>For the second we can use the same straight-line homotopy that we used for the convex version: let $f$ be a path starting and ending at $a$ and let $H(s,t) = t\cdot a + (1-t)\cdot f(s)$.</p>
<p>Lastly, because part <em>a</em> (that I didn’t mention) is to find a set that’s star convex but not convex, I’ll leave the question with The Star of David.</p>
</div>
<h3 id="show-that-if-alphabeta-are-paths-from-x_0to-x_1to-x_2-all-in-x-and-gamma--alphastarbeta-that-hatgamma--hatbetacirchatalpha">Show that if $\alpha,\beta$ are paths from $x_0\to x_1\to x_2$ all in $X$ and $\gamma = \alpha\star\beta$ that $\hat\gamma = \hat\beta\circ\hat\alpha$.</h3>
<div class="hint">
<p>Since these are functions between equivalence classes, our job is now to show that the outputs for a given input are homotopic.</p>
<p>So let $f$ be a path in $X$ starting and stopping at $x_0$. Then:</p>
<p><script type="math/tex">% <![CDATA[
\begin{align*}
\hat\gamma(f) =& [\bar\gamma]\star[f]\star[\gamma] \\
=& [\bar(\alpha\star\beta)]\star[f]\star[\alpha\star\beta]\\
=& [\bar\beta\star\bar\alpha]\star[f]\star[\alpha\star\beta]\\
=& [\bar\beta]\star[\bar\alpha]\star[f]\star[\alpha]\star[\beta]\\
=& [\bar\beta]\star([\bar\alpha]\star[f]\star[\alpha])\star[\beta]\\
=& [\bar\beta]\star(\hat\alpha(f))\star[\beta]\\
=& \hat\beta(\hat\alpha(f))
\end{align*} %]]></script></p>
</div>
<h3 id="show-that-if-x_0x_1-are-points-in-a-path-connected-space-x-pi_1xx_0-is-abelian-if-and-only-if-for-every-pair-of-paths-from-alphabeta-from-x_0-to-x_1-hatalpha--hatbeta">Show that if $x_0,x_1$ are points in a path-connected space $X$, $\pi_1(X,x_0)$ is abelian if and only if for every pair of paths from $\alpha,\beta$ from $x_0$ to $x_1$, $\hat\alpha = \hat\beta$.</h3>
<div class="hint">
<p>If $\pi_1(X,x_0)$ is abelian, we have:</p>
<p><script type="math/tex">% <![CDATA[
\begin{align*}
\hat\alpha([f]) =& [\bar\alpha]\star[f]\star[\alpha]\\
=& [\bar\alpha]\star[\bar{(\beta\star\bar\alpha)} \star(\beta\star\bar\alpha)\star f]\star[\alpha]\\
=& [\bar\alpha]\star[(\alpha\star\bar\beta)\star f\star(\beta\star\bar\alpha)]\star[\alpha]\\
=& [\bar\alpha]\star[\alpha\star\bar\beta]\star[f]\star[\beta\star\bar\alpha]\star[\alpha]\\
=& [\bar\alpha\star\alpha\star\bar\beta]\star[f]\star[\beta\star\bar\alpha\star\alpha]\\
=&[\bar\beta]\star[f]\star[\beta] \\
=&\hat\beta([f])
\end{align*} %]]></script></p>
</div>
<h3 id="let-asubset-x-and-let-rxto-a-be-a-retraction-show-that-for-a_0in-a-r_starpi_1xa_0topi_1aa_0-is-surjective">Let $A\subset X$ and let $r:X\to A$ be a retraction. Show that for $a_0\in A$, $r_\star:\pi_1(X,a_0)\to\pi_1(A,a_0)$ is surjective.</h3>
<div class="hint">
<p>Well, any path $\alpha$ in $A$ starting and stopping at $a_0$ will also be a path in $X$ (because $A\subset X$). So $r_\star([\alpha]_X) = [\alpha]_A$. It’s surjective because it’s the identity map when we restrict paths to $A$.</p>
</div>
<h3 id="let-asubsetmathbbrn-and-haa_0toyy_0-show-that-if-h-is-extendable-to-a-continuous-map-tilde-hmathbbrnto-y-then-h_star-is-trivial-ie-sends-everybody-to-the-class-of-the-constant-loop">Let $A\subset\mathbb{R}^n$ and $h:(A,a_0)\to(Y,y_0)$. Show that if $h$ is extendable to a continuous map $\tilde h:\mathbb{R}^n\to Y$, then $h_\star$ is trivial (i.e. sends everybody to the class of the constant loop).</h3>
<div class="hint">
<p>Let $G=\pi_1(A,a_0), H=\pi_1(Y,y_0)$. Then, as a reminder, $h_\star:G\to H$ such that $h_\star([\alpha]) = [h\circ\alpha]$.</p>
<p>So let $\alpha,\beta$ be paths in $(A,a_0)$. Since $\alpha,\beta$ were arbitrary, it is sufficient to show that $h_\star([\alpha])$ is homotopic to $h_\star([\beta])$.</p>
<p>Consider $F:I\times I\to \mathbb{R}^n$ given by $F(s,t) = t\alpha(s) + (1-t)\beta(s)$ (the straight line homotopy between the two loops). But since $h$ is extendible to $\tilde h:\mathbb{R}^n\to Y$, we know that even $F$ is a homotopy that leaves $A$ (into some other part of $\mathbb{R}^n$), $\tilde h\circ F$ is a homotopy between $h(\alpha),h(\beta)$ that stays entirely in $Y$. Hence $[\tilde h\circ \alpha] = [\tilde h\circ \beta]$.</p>
</div>
<!--
### Let $X$ be path connected, $h:X\to Y$ be continuous with $h(x_0) = y_0$ and $h(x_1)=y_1$; let $\alpha$ be a path in $X$ from $x_0$ to $x_1$, and $\beta = h\circ\alpha$. Show that
$$\hat\beta\circ(h_{x_0})_\star = (h_{x_1})_\star\circ\hat\alpha$$
As in, show that the following diagram commutes:
$$
\newcommand{\ra}[1]{\xrightarrow{\quad#1\quad}}
\newcommand{\da}[1]{\left\downarrow{\scriptstyle#1}\vphantom{\displaystyle\int_0^1}\right.}
\newcommand{\sea}[1]{\left\searrow{\scriptstyle#1}\vphantom{\displaystyle\int_0^1}\right.}
$$
$$
\begin{array}{ccc}
\pi_1(X,x_0) & \ra{(h_{x_0})_\star} & \pi_1(Y,y_0) \\
\da{\hat\alpha} & & \da{\hat\beta} \\
\pi_1(X,x_1) & \ra{(h_{x_1})_\star} & \pi_1(Y,y_1) \\
\end{array}
$$
<div class="hint" markdown="1">
Let
$$\begin{align*}
f=&\hat\beta\circ (h_{x_0})_\star & \text{and}& & g=&(h_{x_1})_\star\circ\hat\alpha
\end{align*}$$
The claim is now that
$$\forall a \in \pi_1(X,x_0), f(a) = g(a)$$
_Proof_:
Let $b_0 \in f(a) = \hat\beta\circ(h_{x_0})_\star (c)$. That means $b$ is a particular path in $(Y,y_1)$ such that $b_0 = \bar\beta\star(h(c))\star\beta$, whereas $g(a) \ni b_1 = h(\bar\alpha\star d \star\alpha)$ with $c,d$ homotopic.
Let $\alpha$ be a path in $(X,x_0)$ and $\beta = h\circ\alpha$ -- hence a path in $(Y,y_1)$.
</div>
-->Aaron NiskinMunkres saves the day and explains the fundamental secrets behind the most invariant of groups!Let’s Cover Homotopy (an apologetic revisionary tale)2018-08-05T00:00:00-07:002018-08-05T00:00:00-07:00http://aaron.niskin.org/blog/2018/08/05/homotopy_revisited<h2 id="what-am-i-doing">What am I doing?</h2>
<h3 id="the-structure-of-these-posts">The structure of these posts</h3>
<p>This post is not really building off of the theory from my previous post (<a href="/blog/2018/04/12/homology.html">homology pt 1</a>), but it’s about the same subject. It turns out we were using a book I didn’t really like too much, so I switched to something with a bit more of a familiar style for me. So I’ll be using <a href="https://www.amazon.com/Topology-Classic-Classics-Advanced-Mathematics/dp/0134689518/ref=sr_1_2?ie=UTF8&qid=1533510847&sr=8-2&keywords=topology+munkres">Topology by Munkres</a> for these posts (at least for now).</p>
<p>I’ll not be covering the book in much detail because after all, Munkres did a great job at doing that himself. But I will summarize some things that I feel like summarizing, and do some of the exercises (just the ones I feel like doing because, well, I can – I know, this is sounding more useful by the second).</p>
<p>So let’s dig in!</p>
<h1 id="part-ii">Part II</h1>
<p>But where did part I go? That’s not Algebraic Topology, so we’re not gonna cover that part (at least not until I feel like I need a review).</p>
<h2 id="chapter-9-the-fundamental-group">Chapter 9: The Fundamental Group</h2>
<p>Basically, we say $f_0,f_1:X\to Y$ are <strong>homotopic</strong> if there is a continuous function $F:X\times I\to Y$ (where $I=[0,1]$) with the following two properties:</p>
<ol>
<li>$\forall x\in X, F(x,0) = f_0(x)$</li>
<li>$\forall x\in X, F(x,1) = f_1(x)$</li>
</ol>
<p>Basically, at one end of the interval $I$ $F$ is $f_0$ and on the other end $F$ is $f_1$. Since the whole map is continuous, this must be a continuous deformation. And there’s a bunch of (helpful) book-keeping to prove that homotopy defines an equivalence relation, etc. but we won’t be covering that right now. But before we move on, let’s talk about path homotopy…</p>
<p>We say the paths $f_0,f_1:I\to Y$ are <strong>path homotopic</strong> if they have the same start and end points ($x_0$ and $x_1$) and there is a homotopy $F$ from $f_0$ to $f_1$ with the following two additional requirements:</p>
<ol>
<li>$\forall s\in I, F(0,s) = x_0$</li>
<li>$\forall s\in I, F(1,s) = x_1$</li>
</ol>
<p>One thing to note: if you consider a torus (because who doesn’t love donuts), any two non-intersecting paths will be homotopic, but not necessarily path homotopic because they might go around different sides of the donut hole, for instance. The requirement that the end-points are fixed is no trivial thing.</p>
<p>Let’s get into the questions!</p>
<h2 id="exercises">Exercises</h2>
<h3 id="show-that-if-hhxto-y-are-homotopic-and-kkyto-z-are-also-homotopic-then-kcirc-h-and-kcirc-h-are-homotopic">Show that if $h,h’:X\to Y$ are homotopic and $k,k’:Y\to Z$ are also homotopic, then $k\circ h$ and $k’\circ h’$ are homotopic.</h3>
<div class="hint">
<p>Let $H,K$ be homotopies from $h$ to $h’$ and $k$ to $k’$ respectively. Then consider the map $G:X\times I\to Z$ defined by $G(x,i) = K(H(x,i),i)$.</p>
<p><em>Claim</em>: $G$ is a homotopy between $k\circ h$ and $k’\circ h’$.</p>
<p><em>Proof</em>: $G$ is obviously continuous (because $H$ and $K$ are), so we only have to prove that $G(x,0) = k\circ h$ and $G(x,1)= k’\circ h’$. But those follow directly from the definition of $G$.</p>
</div>
<h3 id="given-spaces-x-and-y-let-xy-be-the-set-of-homotopy-classes-of-maps-from-x-to-y">Given spaces $X$ and $Y$, let $[X,Y]$ be the set of homotopy classes of maps from $X$ to $Y$.</h3>
<p>a) Show that for any $X$, $[X,I]$ is a singleton.</p>
<div class="hint">
<p>Let $f,g:X\to I$ be continuous maps. Then since $\forall x \in X$, $f(x),g(x)\in[0,1]$, we know that the following map $H:X\times I\to I$ will always be well defined: $H(x,t) = t\cdot f(x) + (1-t)\cdot g(x)$. $H$ is a homotopy.</p>
</div>
<p>b) Show that if $Y$ is path connected, the set $[I,Y]$ is a singleton.</p>
<div class="hint">
<p>Let $Y$ be path connected and let $f:I\to Y$ be continuous. Pick a point $y_o\in Y$ and define $g:I\to Y$ such that $\forall t\in I$, $g(t)=y_0$. We will prove that $f$ is homotopic to $g$ (and hence all maps are), and conclude that $[I,Y]$ is a singleton. We’ll prove this in two steps: first we’ll show that $f$ is homotopic to the constant map that maps all of $I$ to one of the end-points of $f$; then we’ll show that the aforementioned constant map is homotopic to $g$.</p>
<p>To prove the first part, let $h:I\to Y$ be such that $\forall t\in I$, $h(t)=f(0)$. Let $H:I\times I\to Y$ be such that $H(s,t) = f(s\cdot (1-t))$. Then clearly $H$ is a homotopy between $f$ and $h$.</p>
<p>To prove the second part, since $Y$ is path connected, let $p:I\to Y$ be a path from $f(0)$ to $y_0$. Then let $G:I\times I\to Y$ be such that $G(s,t) = p(t)$. Aaaannnnndddd, that’s our homotopy between the two constant maps.</p>
<p>So our desired homotopy from $f$ to $g$ is $G\circ H$.</p>
</div>
<h3 id="a-space-x-is-contractible-if-the-identity-map-i_xxto-x-is-homotopic-to-a-constant-map-nullhomotopic">A space X is <strong>contractible</strong> if the identity map $i_x:X\to X$ is homotopic to a constant map (<strong>nullhomotopic</strong>).</h3>
<p>a) Show that $I,\mathbb{R}$ are contractible.</p>
<div class="hint">
<p>To show that $I$ is contractible, we can just consider it a corollary of the first part of the previous answer. So we’ll prove the second part (that $\mathbb{R}$ is contractible).</p>
<p>Let $i:\mathbb{R}\to\mathbb{R}$ be the identity and let $x_0\in\mathbb{R}$, and $f:\mathbb{R}\to\mathbb{R}$ with $f(x) = x_0$. Our job is to show that there is a homotopy $F:\mathbb{R}\times I\to\mathbb{R}$ between $f$ and $i$.</p>
<p>Consider the map $F(x,t)=\begin{cases} x & | x | < t \\ \text{sign}(x)t & \text{otherwise}\end{cases}$. That is our homotopy.</p>
</div>
<p>b) Show that a contractible space is path connected.</p>
<div class="hint">
<p>Let $Y$ be contractible and let $y_0,y_1$ be points. Then let $f:I\to Y$ be the constant map $f(t) = y_1$, and let $F:I\times I\to Y$ be the map contracting $i_Y$ to $f$. Then let $p(t) = F(t,t)$. $p$ is a path connecting $y_0$ and $y_1$. We’re done here.</p>
</div>
<p>c) Show that if $Y$ is contractible, then for any $X$ the set $[X,Y]$ is a singleton.</p>
<div class="hint">
<p>The proof is very similar to the last two, so I’m not gonna do it. Ah, the pleasures of self-learning. But basically, contractions have inverses (because they’re homotopies too!).</p>
</div>
<p>d) Show that if $X$ is contractible and $Y$ is path connected then $[X,Y]$ is a singleton.</p>
<div class="hint">
<p>The proof that $[I,Y]$ is a singleton is general enough to accommodate this (mutatis mutandis, obvs). Laziness ENGAGE!</p>
</div>
<h2 id="summary">Summary</h2>
<p>We’ve done so many exercises I don’t even feel like I need to workout anymore! (These jokes aren’t really free if you consider the toll they take on you).</p>Aaron NiskinMunkres will surely save us our trespasses this once, but NEVER AGAIN!Let’s talk Homotopy and Algebra2018-06-29T00:00:00-07:002018-06-29T00:00:00-07:00http://aaron.niskin.org/blog/2018/06/29/homotopy<p>So this post is going to be a bit more terse than most. In fact, the objective of this post will be to develop the theory of Homotopy very briefly, with the goal of proving the Fundamental Theorem of Algebra.</p>
<h1 id="homotopy">Homotopy</h1>
<p>Given two continuous maps \(f,g : \mathbb{R}^n\to\mathbb{R}m\), a homotopy between \(f,\) and $g$ is a continuous map $h: \mathbb{I}\times\mathbb{R}^n\to\mathbb{R}^m$ such that $h(0,x)=f(x)$ and $h(1,x) = g(x)$. Basically it’s a continuous deformation of one map into the other. It’s a bit stronger than just homeomorphism of topological spaces. You can see this because two interlocked rings are topologically homeomorphic to non-interlocked rings, but they’re not homotopic because you’d have to split one of the rings.</p>
<p>Enough of all that! Let’s get to the proof already!</p>
<h1 id="the-fundamental-theorem-of-algebra">The Fundamental Theorem of Algebra</h1>
<h3 id="the-statement">The statement</h3>
<p><em>If $p\in\mathbb{C}[x]$ is a non-constant polynomial, then it has at least one root in $\mathbb{C}$.</em></p>
<p>A consequence of this theorem is that any non-constant polynomial over $\mathbb{C}$ has all of its roots in $\mathbb{C}$. Or equivalently, any polynomial of degree $n$ over $\mathbb{C}$ has $n$ roots in $\mathbb{C}$.</p>
<h3 id="the-proof">The proof</h3>
<p>Let $p(x) = \sum\limits_{i=1}^n a_i x^i \in\mathbb{C}[x]$ be a polynomial without any roots in $\mathbb{C}$. For simplicity, we assume that the leading coefficient is 1. There is no loss of generality in doing so. We then define a function</p>
<script type="math/tex; mode=display">f_r(s) = \frac{ p(re^{2\pi is}) / p(r)}{ \|p(re^{2\pi is}) / p(r)\| }</script>
<p>Then for each $r$ in $\mathbb{C}$, $f_r:\mathbb{C}\to S^1$ is a map from $\mathbb{C}$ to $S^1$. Furthermore, given $r_0,r_1\in\mathbb{C}$, we can define a homotopy from $f_{r_0}$ to $f_{r_1}$ as $g(t,s) = f_{r_0+t(r_1-r_0)}(s)$. So for all $r\in\mathbb{C}$, $f_r$ is homotopic to $f_0$. Since $f_0(s) = 1$ for all $s\in\mathbb{C}$, we have that $f_r$ are all in the same homotopy class as the constant function on $S^1$.</p>
<p>We now pick $z,r$ such that $|z| = r \gt \max(1,\sum |a_i| )$. We then note that Cauchy-Schwarz gives us the following:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\|z\|^n =& \|z\|\cdot \|z\|^{n-1}\\\\\\
>& (\sum\|a_i\|)\|z\|^{n-1}\\\\\\
=& (\sum\|a_i\|\|z\|^{n-1})\\\\\\
>=& (\sum\|a_iz^{n-1}\|)\\\\\\
>& (\sum\|a_iz^i\|)
\end{align} %]]></script>
<p>Which means $|z^n - \sum a_i z^i \neq 0|$. So, we can define yet another function $q(t,z) = z^n +t(\sum\limits_{i=1}^{n-1}a_iz^i)$. Then $q$ is a homotopy between $z^n$ and $p(z)$. We now note that when $t=1$, if we make a similar construction to $f_r$ but with this parameterized polynomial instead of $p$, we get a homotopy between $f_r$ and just going around the circle $n$ times (with $t=0$). But since $f_r$ is homotopic to a point on $S^1$, and going around the circle any non-zero number of times is not homotopic to a point, we know that $n=0$.</p>
<p>So the only polynomials without roots in $\mathbb{C}$ are constant.</p>
<p>Kinda cool, huh?</p>
<p>This proof was courtesy of <em>Algebraic Topology</em> by Allen Hatcher.</p>Aaron NiskinJust a little dip into homotopy via the Fundamental Theorem of AlgebraLet’s Cover Homology (pt 1)2018-04-12T00:00:00-07:002018-04-12T00:00:00-07:00http://aaron.niskin.org/blog/2018/04/12/homology<h1 id="topology">Topology</h1>
<h2 id="what-is-it">What is it?</h2>
<p>Topology is essentially the study of proximity, which is a close analogue for shape. The specifics of how that’s done are not really the point of this post. That’s all for this section.</p>
<h2 id="how-is-it-related-to-analyzing-data">How is it related to analyzing data?</h2>
<p>Most data analysis techniques are already about quantifying some geometric attributes. For instance, when you fit a linear model, you’re really just imposing an affine model on the data, and computing the appropriate parameters (slope and intercept). But these sorts of things work best if you have an idea for what an appropriate shape would be. That’s one area where topology might be able to help. But to hear more about that, you’ll have to wait for my next post.</p>
<p>There is a notion of equivalence (called isomorphism) of topological spaces (bi-continuous bijections). Basically, we consider two topological spaces to be equivalent if one can be continuously transformed into the other and back. That’s actually a very powerful description that captures a lot of non-linearities and things like that. For instance, an egg shape, a sphere, a pyramid, and a tetrahedron are all isomorphic topological spaces, but not a doughnut (torus). And those kinda make sense. But there are some counter-intuitive examples. For instance, if you were to take one of those long clown balloons and tie it up into a poodle, it would also be isomorphic to the bunch above. And, as the joke goes, donuts are isomorphic to coffee cups. I know what you’re thinking. This topology stuff sounds super cool.</p>
<p>So topology would make a wonderful (admittedly incomplete) description of data. But there are some problems: mainly that this whole story breaks down in higher dimensions. Luckily, there is an analogue that can be computed.</p>
<h1 id="homology">Homology</h1>
<h2 id="what-is-it-1">What is it?</h2>
<p>Homology is (in an overly-simplified sentence), counting the number of holes. That’s actually a lot more information than you might initially think. It’s actually sufficient to completely classify the space in lower dimensions, and a good approximation in higher ones. Either way, if we knew that info about our data, we’d capture a lot of the non-linear relationships. Either way, it’ll be fun, so let’s get into it.</p>
<h1 id="how-to-computes">How to computes…</h1>
<h2 id="triangulation">Triangulation</h2>
<h3 id="simplices">Simplices</h3>
<p>A <a href="https://en.wikipedia.org/wiki/Simplex">simplex</a> is a collection of $n$ points with the property that no one point lies in the subspace defined by the other $n-1$ points. The idea is that it’s the minimal number of points with an $n-1$ dimensional convex hull – or, said differently, the minimal number of points to make an $n-1$ dimensional shape. For instance, if we consider $\mathbb{R}^2$ as our ambient space, then the only simplices that exist are 0, 1, or 2 simplices: the 0 simplices being points, 1 simplices being lines and 2 simplices being tetrahedra. If we look at these simplices like building blocks, we can now make something isomorphic to any shape just by glueing a bunch of these simplices together. But we’ll get to that in a bit.</p>
<p>Let $\{v_0,v_1,v_2\}$ be a simplex. Firstly, since it’s a simplex with 3 points, it must be a 2-simplex. And since it’s a simplex, we know that $v_0-v_1\notin span(\{v_1-v_2,v_1+v_2\})$ and the same relationship holds for all the rest of the points. This simplex is supposed to represent any 2 dimensional contiguous ball-like thing. For instance, since a filled-in circle can be continuously transformed into a filled-in triangle (which is, in essence what a 2-simplex is), our simplex becomes a proxy for such circles, and squares, and rhombi, and octagons, but not the Apple logo. But if you take that stem thing and throw it away so you’re only left with the main apple part of the logo, then yeah, that too.</p>
<p>So now if we have any single contiguous $N$-D object without any holes, we can represent it topologically by a triangle. So simplices make the building blocks for general shapes, as we’ll see in a bit.</p>
<h3 id="simplicial-complexes">Simplicial Complexes</h3>
<p>So let’s build a simplicial complex! But wait! I haven’t even defined what that is yet! Meh, let’s just roll with it. So given a bunch of things we can represent by simplices (like the full apple logo, for instance), we can construct a simplicial complex to represent it. Some more examples would be: an air-filled balloon (which we’ll consider to be hollow) on a string, a filled water balloon (which we’ll not consider to be hollow) on a string, or a doughnut. The list goes on, obviously, but I think you get the point. We can represent pretty much anything this way.</p>
<p>The specifics really only come in when we try to construct the simplicial complex itself. In order to make the thing have the nice characteristics we’ll use later on, we’ll need to make sure our complexes have some properties. But first, we need to know what the face of a simplex is:</p>
<p>Given a simplex $\{v_0,v_1,…,v_n\}$, a face is the simplex defined by a subset (e.g. $\{v_1,v_2,…,v_n\}$ is a face, and so is the original simplex). If you think of a tetrahedron, for instance, the faces are the four triangular faces we’re used to calling faces as well as the six lines on the edges, and the four individual points. So the faces are kinda similar to our intuitive notion of a face. Now we can define a simplicial complex. A simplicial complex $K$ is a collection of simplices with the following properties (see <a href="http://mathworld.wolfram.com/SimplicialComplex.html">mathworld</a>):</p>
<ol>
<li>Each face of a simplex in $K$ is itself a simplex in $K$</li>
<li>The intersection of two simplices in $K$ is itself a simplex (hence in $K$)</li>
</ol>
<p>It’s the second rule that’s really the crux of everything, as we’ll see. But let’s go through some examples…</p>
<p>The Apple logo can be triangulated by two disjoint 2-simplices (along with all the appropriate faces of the two). So we basically forced the first property to hold, and since the two simplices are disjoint the second property trivially holds.</p>
<p>The filled water-balloon on a string can be triangulated by a 3-simplex attached to a 1-simplex along with all the appropriate faces. The filled water-balloon part is clearly the 3-simplex, and the string is the 1-simplex. So let’s prove that the two properties hold. The first one holds trivially, since we defined it as the union of all the faces. The second property holds because if you take any two simplices in this complex, either they intersect at the point on top (a 0-simplex) or they don’t intersect at all! (Pardon the hand waving proof there).</p>
<p>The (hollow) air-filled balloon, on the other hand, is a bit harder to triangulate, but we’ll barrel through it. We can glue four 2-simplices together to be like a hollowed out 3-simplex, then attach a 1-simplex as the string to one of the vertices of the hollow tetrahedron, and we’ll be done. All that’s left is to show the two properties hold. That’s left as an exercise to the reader (ah, the good life).</p>
<p>The doughnut, on the other hand, is a lot more complicated to triangulate. To do that we’d have to specify if we’re talking about the shell of a doughnut or a filled in one. Although this is a fun exercise, the traditional way to triangulate the torus doesn’t have much to do with what we’re going to be using this stuff for, so I’ll take a more data-science type approach to this. If we have a bunch of data arranged in the shape of a torus, we can just take the union of the simplices defined by taking a point and the 3 nearest points to it, and creating the convex hull of those points (or considering them as a 3-simplex).</p>
<h3 id="associations">Associations</h3>
<p>But it would be useful to triangulate the torus using the more traditional approach. To get you started on the right track, you’ll need another tool: association. An association is basically a way of considering things to be equal. It’s kinda like defining specific teleportation or something. I think it’s best to just work out an example. So let’s triangulate some things (like a sphere – the air-filled balloon, for example) but without using the fact that we’re in 3-D space and using associations instead. Let’s start with two 2-simplices joined at one edge (to make something that looks like a square).</p>
<p><img src="/assets/pics/2018/04/12_triangulate_base.png" alt="sphere" /></p>
<p>We then make the following association: we say anything along the line $(\langle1\rangle,\langle2\rangle)$ is associated with things on the line $(\langle4\rangle,\langle3\rangle)$ – where order is important. If we want to make it more formal, we could parameterize the line connecting $\langle1\rangle$ to $\langle2\rangle$ by a parameter $t$ such that $f_0(0) = \langle1\rangle$ and $f_0(1) = \langle2\rangle$. We describe this visually like so:</p>
<p><img src="/assets/pics/2018/04/12_tube.png" alt="tube" /></p>
<p>We’re considering all the points on the top line to be associated with (or equal to) their corresponding points on the bottom line. You can imagine this to be kinda like pacman: when you go off the top edge, you come back from the bottom edge in the same direction (because of the association that we’ve made). So, we might as well fold the paper over and tape the two sides together so as to better visualize the association we’ve made. But when we do that, we notice we have a tube!</p>
<p>So the association shown above can be described as a tube. Unfortunately, it’s not a simplicial complex though. This can be seen because if we call the top-left triangle $A$, and the bottom-right one $B$, the intersection of $A$ and $B$ is the diagonal line $(\langle1\rangle,\langle3\rangle)$ and the associated line (it’s only one because the two are associated – hence the same line) $(\langle1\rangle,\langle2\rangle) = (\langle4\rangle,\langle3\rangle)$. So the intersection is two lines, which is a simplicial complex, but not a simplex! So this triangulation doesn’t have the second property that we require all simplicial complexes to have.</p>
<p>We could make it a simplicial complex by dividing the thing into a 9 square grid (so 18 triangles) and then doing the same thing again with the top being associated with the bottom, but it’s a lot to write out. Try it out though. See if you can make it a simplicial complex. If you run into any issues, feel free to email me.</p>
<p>Moving on… Now if we just flipped the association, we get a möbius strip!</p>
<p><img src="/assets/pics/2018/04/12_mobius.png" alt="möbius" /></p>
<p>It might be useful to actually break out a piece of paper and do this if you’ve never done it before.</p>
<p>If we take the tube and add this one association it becomes a torus (a hollow doughnut)!</p>
<p><img src="/assets/pics/2018/04/12_torus.png" alt="torus" /></p>
<p>Notice that we identify which sides are connected by the fact that they share the same symbol (the double arrow is connected to the other double arrow, and the single arrow does likewise).</p>
<p>If you’re trying it out on paper, make sure you label the points so you can see them! If you reverse the direction one side of the second association, you get something very different – a Klein bottle!</p>
<p><img src="/assets/pics/2018/04/12_klein.png" alt="klein bottle" /></p>
<p>I feel like it’s probably a bit cruel of me to have put this after the picture, in case you’ve been trying to make the last one with your paper… It turns out, you can’t make the last one in 3D space. You need to either use an extra dimension or tear the paper, unfortunately. But, it’s interesting to see that flipping the association completely changes the thing – kinda like chirality in chemistry.</p>
<p>Like if we take the möbius strip and make the following additional association, you get something called the Projective Plane (\(\mathbb{P}^2\) – also not embeddable in 3D space).</p>
<p><img src="/assets/pics/2018/04/12_projective.png" alt="projective plane" /></p>
<p>And finally, the following association leads us to a sphere, or, more precisely, the surface of a sphere.</p>
<p><img src="/assets/pics/2018/04/12_sphere.png" alt="sphere" /></p>
<p>Then, since all the points on the top line (and the bottom, left, and right lines) are all associated, imagine contracting them while keeping the inside of the bag relatively unchanged (as far as area is concerned). So necessarily the bag will puff up like a bubble as the sides squish down to a smaller and smaller hole. Eventually, the edge becomes just a point, and what do we have? Yup, we’ve got a sphere. So that’s one way we can triangulate a sphere.</p>
<p>The problem is that none of these are simplicial complexes. Why not? Can you fix these so that they become simplicial complexes?</p>
<h2 id="summary">Summary</h2>
<p>So we’ve triangulated some things, defined what simplices and simplicial complexes are, but we haven’t yet found their utility. For that, just stay tuned! In the next episode we’ll go over Betti Numbers and how to compute them using these simplicial complexes!</p>Aaron NiskinWhat does topology have to do with data?! And what is this topology thing anyway?Decomposition of Autoregressive Models2017-10-25T13:52:39-07:002017-10-25T13:52:39-07:00http://aaron.niskin.org/blog/2017/10/25/Autoregression<h1 id="autoregression">Autoregression</h1>
<h2 id="background">Background</h2>
<p>This post will be a formal introduction into some of the theory of Autoregressive models. Specifically, we’ll tackle how to decompose an AR(p) model into a bunch of AR(1) models. We’ll discuss the interpretability of these models and howto therein. This post will develop the subject using what seems to be an atypical approach, but one that I find to be very elegant.</p>
<!-- more -->
<h1 id="the-traditional-way">The traditional way</h1>
<p>Let $x_t$ be an AR(p) process. So $x_t = \sum\limits_{i=1}^pa_ix_{t-i} + w_t$. We can express $x_t$ thusly:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
x_t =& \sum\limits_{i=1}^pa_ix_{t-i} + w_t \\
x_t - \sum\limits_{i=1}^pa_ix_{t-i} =& w_t \\
\left(1 - \sum\limits_{i=1}^pa_iL^i\right)x_t =& w_t
\end{align*} %]]></script>
<p>Where $L$ is the <strong>lag operator</strong>. We define the AR polynomial $\Phi$ as $\Phi(L) := \left(1 - \sum\limits_{i=1}^pa_iL^i\right)$. Then as long as we’re considering this polynomial to be over an algebraically closed field (like $\mathbb{C}$), we can factor this polynomial. So</p>
<script type="math/tex; mode=display">\Phi(L) = c\prod\limits_{i=1}^p(L-\varphi_i)</script>
<p>Where $\varphi_i$ are the roots of $\Phi$. This polynomial will be the star of our show today. In fact, if we want to understand $\Phi$ (the AR process we’re considering), we need only understand each $x_t = \varphi_ix_{t-1} + w_t$ model. We’ll actually prove this later.</p>
<p>We’ll be investigating several of its properties but first (just for fun) let’s investigate the conditions under which it’s invertible.</p>
<p>We can reduce the problem of determining whether or not the AR polynomial is invertible to the problem of inverting each factor. After all, the polynomial is invertible if and only if each factor is. So let’s restrict our considerations to just a single factor for the time being. When will $x-\lambda$ be invertible?</p>
<p>Now, we recall the Taylor Series Expansion formula from our trusty calculus class.</p>
<script type="math/tex; mode=display">f(x) = \sum\limits_{i=0}^\infty \frac{f^{(i)}(a)}{i!}(x-a)^i</script>
<p>In general, we tend to use $a=0$ as a good arbitrary choice (the Maclaurin Series), resulting in:</p>
<script type="math/tex; mode=display">f(x) = \sum\limits_{i=0}^\infty \frac{f^{(i)}(0)}{i!}x^i</script>
<p>Well, if we consider the Taylor expansion of $\frac{1}{x-\lambda}$, we find</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\frac{1}{x-\lambda} =& \frac{-1}{\lambda}(x) + (-1)^1\frac{-1}{\lambda^2}x^2 + (-1)^2\frac{-1}{\lambda^3}x^3 + ... \\
=& \sum\limits_{i=0}^\infty (-1)^{i+1}\frac{1}{\lambda^{i+1}}x^{i+1} \\
=& \sum\limits_{i=1}^\infty (-1)^{i}\frac{1}{\lambda^{i}}x^{i}
\end{align*} %]]></script>
<p>Which is square summable if and only if $|\lambda| > 1$. So we see that if the AR roots are all outside the unit circle, then the AR polynomial will be invertible (in such a case, the AR process is called <strong>causal</strong>).</p>
<h1 id="the-new-approach">The new(?) approach</h1>
<h2 id="definition">Definition</h2>
<p>Let $X_t\in \mathbb{R}^n, A_i\in\mathbb{R}^{n\times n}, B\in\mathbb{R}^{k_0\times n}$, and $p\in\mathbb{N}_{>0}$. Also let, $W_t$ be $k_0$ dimensional white noise. Then, $X_t$ is called a VAR(p) process if,</p>
<script type="math/tex; mode=display">X_t = \sum\limits_{i=1}^p A_iX_{t-i} + BW_t</script>
<p>Basically a VAR(p) model is an analogue of AR(p) models where the scalars are matrices and the variable itself is a vector. The noise in such a model need not be independent per coordinate, nor restricted to only one. In fact, we can have any number of driving noise processes that get mixed together linearly into $X_t$. Hence cometh $B$ – the linear transformation from the driving noise space into the $X_t$ space.</p>
<p>Furthermore, let’s say that we don’t directly observe $X_t$ directly, but rather we observe some $Y_t$ that’s a linear function of $X_t$ (plus some noise). Formally, we observe $Y_t$ s.t.:</p>
<script type="math/tex; mode=display">Y_t = C X_t + D U_t</script>
<p>Where $Y_t \in\mathbb{R}^m, C\in\mathbb{R}^{m\times n}, D\in \mathbb{R}^{k_1\times m}$. And again, $U_t$ is $k_1$ dimensional white noise.</p>
<p>Here $D$ serves the same purpose as $B$ did above, but for the observations themselves.</p>
<p>This is called a “State Space” model.</p>
<h2 id="examplificate">Examplificate</h2>
<p>How does this relate to our star (the AR polynomial $\Phi$)? Let’s try to express an AR(p) process as a VAR(1) process…</p>
<p>Let $x_t$ be an AR(p) process. That is to say, $x_t = \sum\limits_{i=1}^p a_ix_{t-i} + w_t$ where $w_t$ is some white noise process.</p>
<p>Let $W_t$ be a 1 dimensional white-noise processes such that $W_t = w_t$. We define</p>
<script type="math/tex; mode=display">X_t := \left[\begin{matrix} x_t \\ x_{t-1} \\ \vdots \\ x_{t-(p-1)} \end{matrix}\right]</script>
<p>Then, we can see that the first coordinate of $X_t$ is always $x_t$. Our notation for this will be $X_t[0] = x_t$.</p>
<p>Furthermore, let</p>
<script type="math/tex; mode=display">% <![CDATA[
A = \left[\begin{matrix}
a_1 & a_2 & \dots & a_{p-1} & a_p \\
1 & 0 & \dots & 0 & 0 \\
0 & 1 & \dots & 0 & 0 \\
& & \ddots & & \\
0 & \dots & 0 & 1 & 0 \\
\end{matrix} \right] %]]></script>
<p>We can see that for $0 < i < p$, we have $(AX_{t-1})[i] = x_{t-i}$, and $(AX_{t-1})[0] = \sum\limits_{i=1}^p a_i x_{t-i} = x_t - w_t$. As it stands, $AX_{t-1}$ is almost $X_t$, just without the added noise in the first coordinate. So let’s add noise to only the first coordinate. To this end, let</p>
<script type="math/tex; mode=display">B := \left[\begin{matrix} 1 \\ 0 \\ \vdots \\ 0 \end{matrix}\right]</script>
<script type="math/tex; mode=display">X_t = AX_{t-1} + BW_t</script>
<p>and</p>
<script type="math/tex; mode=display">\forall t \in \mathbb{N}_{>p} (X_t[0] = x_t)</script>
<p>Now we want $Y_t = x_t$. So we set</p>
<script type="math/tex; mode=display">% <![CDATA[
C = \left[\begin{matrix}1 & 0 & 0 & \dots & 0\end{matrix}\right] %]]></script>
<p>That way, $Y_t = CX_t = x_t$ and all is right with the world.</p>
<p>So yay! We’ve expressed this AR(p) process as a $p$ dimensional $VAR(1)$ process! So… Why?</p>
<h2 id="why-do">Why do?</h2>
<p>Well, one natural question to ask at this point is, what are the eigenvalues of $A$?</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
0 =& \det(A - \lambda I) \\
=& \left\|\begin{matrix}
a_1 - \lambda & a_2 & \dots & a_{p-1} & a_p \\
1 & -\lambda & \dots & 0 & 0 \\
0 & 1 & \dots & 0 & 0 \\
& & \ddots & & \\
0 & \dots & 0 & 1 & -\lambda \\
\end{matrix} \right\| \\
=& (-\lambda)^p + a_1(-\lambda)^{p-1} - a_2(-\lambda)^{p-2} + \dots + (-1)^{p-1}a_p
\end{align*} %]]></script>
<p>This polynomial is the characteristic polynomial of $A$, so the roots of this polynomial are the eigenvalues of $A$. Let’s see how this relates to $\Phi$…</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
0 =& (-\lambda)^p + a_1(-\lambda)^{p-1} - a_2(-\lambda)^{p-2} + \dots + (-1)^{p-1}a_p \\
(-\lambda)^{-p}(0) =& 1 - a_1\lambda^{-1} - a_2\lambda^{-2} - \dots - a_p\lambda^{-p} \\
\text{Define } L :=& \lambda^{-1} \\
0 =& 1 - a_1L - a_2L^2 - \dots - a_pL^p \\
=& 1 - \sum\limits_{i=1}^pa_iL^i \\
=& \Phi(L)
\end{align*} %]]></script>
<p>So the characteristic polynomial is really just $\Phi(L)$ revisited! Furthermore, check this sweetness out…</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\Phi(L) =& c\prod\limits_{i=1}^p\left(L-\varphi_i\right) \\
=& c\prod\limits_{i=1}^p\left(\lambda^{-1}-\varphi_i\right) \Rightarrow \\
\lambda_i =& \frac{1}{\varphi_i}
\end{align*} %]]></script>
<p>So we can see that the eigenvalues are actually intimately related with the roots (in fact they’re inverses of each other)! With that in mind, let’s see what happens when we try to diagonalize this matrix $A$. Firstly, let’s assume that $A$ matrix is diagonalizable. I.e., let</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\bar H^{-1}AH =& A \Leftrightarrow \\
\bar A =& HAH^{-1}
\end{align*} %]]></script>
<p>where $H$ is a coordinate transformation (an invertible map) and $\bar A$ is diagonal.</p>
<p>Then</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
X_t =& H^{-1}\bar AH X_{t-1} + BW_t \\
=& H^{-1}\left(\bar AHX_{t-1} + HBW_t\right)
\end{align*} %]]></script>
<p>So if we let $\tilde X_t = HX_t$, we get</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\tilde X_t =& HX_t \\
=& HH^{-1}\left(\bar AHX_{t-1} + HBW_t\right) \\
=& \bar AHX_{t-1} + HBW_t \\
=& \bar A\tilde X_{t-1} + HBW_t
\end{align*} %]]></script>
<p>So $\tilde X_t$ is a bunch of AR(1) processes (because the matrix $\bar A$ is diagonal), and,</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
Y_t =& CX_t \text{ because $U_t=0$ for our scenario}\\
=& CH^{-1}\tilde X_t \\
=& \left(CH^{-1}\right)\tilde X_t
\end{align*} %]]></script>
<p>Which is to say that $Y_t$ is a linear combination of AR(1) processes! So we started out with a general AR(p) process, and found that expressing it as a VAR(1) model allows us to see that this is just a linear combination of AR(1) processes. I think that’s pretty cool. But let’s investigate this a bit further…</p>
<h1 id="what-more">What more?!</h1>
<p>As per our previous section, we have that $Y_t$, which was really just $x_t$ – our original AR(p) variable, is a linear combination of AR(1) models. More specifically, if we let $F := CH^{-1}$, then</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
x_t =& Y_t \\
=& F\tilde X_t
\end{align*} %]]></script>
<p>So indeed, $x_t$ is a linear combination of $p$ many AR(1) processes. In fact, since $\tilde X_t = \bar A \tilde X_{t-1} + \left(HB\right)W_t$, we know the AR(1) models explicitly:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\tilde x_{t,i} =& \lambda_i\tilde x_{t-1,i} + w_t \\
\tilde x_{t,i} - \lambda_i\tilde x_{t-1,i} =& w_{t,i} \\
\left(1-\lambda_iL\right)\tilde x_{t,i} =& w_{t,i} \\
\frac{-1}{\lambda_i}\left(L-\lambda_i\right)\tilde x_{t,i} =& w_{t,i} \\
\end{align*} %]]></script>
<p>Where $\lambda_i$ is the $i$th eigenvalue of $A$ (hence the $(i,i)$th element of $\bar A$). So we see that the AR(1) processes each have their respective roots from the roots of the original AR polynomial $\Phi$. Pretty cool, right?</p>
<h2 id="summary">Summary</h2>
<p>So we can express any AR(p) model as a linear combination of AR(1) processes (albeit with correlated noise terms), where each AR(1) process is determined by the roots of the AR polynomial (or the characteristic polynomial).</p>
<p>We’ve essentially reduced the problem of studying AR(p) models to studying their eigenvalues (or AR roots).</p>Aaron NiskinWe discuss the decomposition of AR models into components, and how eigenvalues are involved (because they always are).