Jekyll2018-12-23T13:14:45-08:00http://aaron.niskin.org/feed.xmlAaron Niskin’s BlogNot just another mathy blog! Now with more... Aaron!amniskinLet’s Cover The Fundamental Group (pt 1)2018-08-12T00:00:00-07:002018-08-12T00:00:00-07:00http://aaron.niskin.org/blog/2018/08/12/fundamental_group_pt1<h2 id="jumping-right-in">Jumping right in</h2>
<h3 id="prerequisites">Prerequisites:</h3>
<p>Firstly, this post builds off of my <a href="/blog/2018/08/05/homotopy_revisited.html">previous post</a>, so if you’re learning these things for the first time and find yourself kinda lost, start there.</p>
<p>In addition to the previous post, the following are mathematical definitions you should probably know before reading this post. If you don’t know any of these, trust google.</p>
<ul>
<li><strong>Group</strong>.</li>
<li><strong>Homomorphism</strong></li>
<li><strong>Kernel</strong></li>
<li><strong>Subgroup</strong></li>
<li><strong>Normal subgroup</strong></li>
<li><strong>Homotopy</strong></li>
<li><strong>Retraction</strong></li>
<li><strong>Star Convex</strong></li>
</ul>
<h3 id="moving-onward">Moving onward</h3>
<p><strong>Definition</strong>:</p>
<p>Given a path $\alpha$ from $x_0$ to $x_1$ in $X$, we define $\hat\alpha:\pi_1(X,x_0)\to\pi_1(X,x_1)$ as follows:</p>
<script type="math/tex; mode=display">\hat\alpha([f]) = [\bar\alpha]\star[f]\star[\alpha]</script>
<p>And it’s a theorem that $\hat\alpha$ is a group isomorphism.</p>
<p><strong>Definition</strong>:</p>
<p>Given a continuous map $h:(X,x_0)\to(Y,y_0)$, we define:</p>
<p>$h_\star:\pi_1(X,x_0)\to\pi_1(Y,y_0)$ as follows:</p>
<p>$h_\star([f]) = [h\circ f]$.</p>
<p>And with that…</p>
<p>Let’s get into the questions!</p>
<h2 id="exercises">Exercises</h2>
<h3 id="show-that-if-asubseteqmathbbrn-is-star-convex-then-a-is-simply-connected">Show that if $A\subseteq\mathbb{R}^n$ is star convex then $A$ is simply connected.</h3>
<div class="hint">
<p>Let $a\in A$ be one of the points that make the set star convex. Our task now is to show:</p>
<ol>
<li>For any $x,y\in A$, there is a path entirely within $A$ connecting $x$ and $y$, and</li>
<li>$\pi_1(A,a)$ is trivial.</li>
</ol>
<p>For the first: just let $x,y\in A$ and let $f_x,f_y$ be the paths connecting them to $a$. Then note that $f_x \star \bar f_y$ is a path connecting $x$ to $y$.</p>
<p>For the second we can use the same straight-line homotopy that we used for the convex version: let $f$ be a path starting and ending at $a$ and let $H(s,t) = t\cdot a + (1-t)\cdot f(s)$.</p>
<p>Lastly, because part <em>a</em> (that I didn’t mention) is to find a set that’s star convex but not convex, I’ll leave the question with The Star of David.</p>
</div>
<h3 id="show-that-if-alphabeta-are-paths-from-x_0to-x_1to-x_2-all-in-x-and-gamma--alphastarbeta-that-hatgamma--hatbetacirchatalpha">Show that if $\alpha,\beta$ are paths from $x_0\to x_1\to x_2$ all in $X$ and $\gamma = \alpha\star\beta$ that $\hat\gamma = \hat\beta\circ\hat\alpha$.</h3>
<div class="hint">
<p>Since these are functions between equivalence classes, our job is now to show that the outputs for a given input are homotopic.</p>
<p>So let $f$ be a path in $X$ starting and stopping at $x_0$. Then:</p>
<p><script type="math/tex">% <![CDATA[
\begin{align*}
\hat\gamma(f) =& [\bar\gamma]\star[f]\star[\gamma] \\
=& [\bar(\alpha\star\beta)]\star[f]\star[\alpha\star\beta]\\
=& [\bar\beta\star\bar\alpha]\star[f]\star[\alpha\star\beta]\\
=& [\bar\beta]\star[\bar\alpha]\star[f]\star[\alpha]\star[\beta]\\
=& [\bar\beta]\star([\bar\alpha]\star[f]\star[\alpha])\star[\beta]\\
=& [\bar\beta]\star(\hat\alpha(f))\star[\beta]\\
=& \hat\beta(\hat\alpha(f))
\end{align*} %]]></script></p>
</div>
<h3 id="show-that-if-x_0x_1-are-points-in-a-path-connected-space-x-pi_1xx_0-is-abelian-if-and-only-if-for-every-pair-of-paths-from-alphabeta-from-x_0-to-x_1-hatalpha--hatbeta">Show that if $x_0,x_1$ are points in a path-connected space $X$, $\pi_1(X,x_0)$ is abelian if and only if for every pair of paths from $\alpha,\beta$ from $x_0$ to $x_1$, $\hat\alpha = \hat\beta$.</h3>
<div class="hint">
<p>If $\pi_1(X,x_0)$ is abelian, we have:</p>
<p><script type="math/tex">% <![CDATA[
\begin{align*}
\hat\alpha([f]) =& [\bar\alpha]\star[f]\star[\alpha]\\
=& [\bar\alpha]\star[\bar{(\beta\star\bar\alpha)} \star(\beta\star\bar\alpha)\star f]\star[\alpha]\\
=& [\bar\alpha]\star[(\alpha\star\bar\beta)\star f\star(\beta\star\bar\alpha)]\star[\alpha]\\
=& [\bar\alpha]\star[\alpha\star\bar\beta]\star[f]\star[\beta\star\bar\alpha]\star[\alpha]\\
=& [\bar\alpha\star\alpha\star\bar\beta]\star[f]\star[\beta\star\bar\alpha\star\alpha]\\
=&[\bar\beta]\star[f]\star[\beta] \\
=&\hat\beta([f])
\end{align*} %]]></script></p>
</div>
<h3 id="let-asubset-x-and-let-rxto-a-be-a-retraction-show-that-for-a_0in-a-r_starpi_1xa_0topi_1aa_0-is-surjective">Let $A\subset X$ and let $r:X\to A$ be a retraction. Show that for $a_0\in A$, $r_\star:\pi_1(X,a_0)\to\pi_1(A,a_0)$ is surjective.</h3>
<div class="hint">
<p>Well, any path $\alpha$ in $A$ starting and stopping at $a_0$ will also be a path in $X$ (because $A\subset X$). So $r_\star([\alpha]_X) = [\alpha]_A$. It’s surjective because it’s the identity map when we restrict paths to $A$.</p>
</div>
<h3 id="let-asubsetmathbbrn-and-haa_0toyy_0-show-that-if-h-is-extendable-to-a-continuous-map-tilde-hmathbbrnto-y-then-h_star-is-trivial-ie-sends-everybody-to-the-class-of-the-constant-loop">Let $A\subset\mathbb{R}^n$ and $h:(A,a_0)\to(Y,y_0)$. Show that if $h$ is extendable to a continuous map $\tilde h:\mathbb{R}^n\to Y$, then $h_\star$ is trivial (i.e. sends everybody to the class of the constant loop).</h3>
<div class="hint">
<p>Let $G=\pi_1(A,a_0), H=\pi_1(Y,y_0)$. Then, as a reminder, $h_\star:G\to H$ such that $h_\star([\alpha]) = [h\circ\alpha]$.</p>
<p>So let $\alpha,\beta$ be paths in $(A,a_0)$. Since $\alpha,\beta$ were arbitrary, it is sufficient to show that $h_\star([\alpha])$ is homotopic to $h_\star([\beta])$.</p>
<p>Consider $F:I\times I\to \mathbb{R}^n$ given by $F(s,t) = t\alpha(s) + (1-t)\beta(s)$ (the straight line homotopy between the two loops). But since $h$ is extendible to $\tilde h:\mathbb{R}^n\to Y$, we know that even $F$ is a homotopy that leaves $A$ (into some other part of $\mathbb{R}^n$), $\tilde h\circ F$ is a homotopy between $h(\alpha),h(\beta)$ that stays entirely in $Y$. Hence $[\tilde h\circ \alpha] = [\tilde h\circ \beta]$.</p>
</div>
<!--
### Let $X$ be path connected, $h:X\to Y$ be continuous with $h(x_0) = y_0$ and $h(x_1)=y_1$; let $\alpha$ be a path in $X$ from $x_0$ to $x_1$, and $\beta = h\circ\alpha$. Show that
$$\hat\beta\circ(h_{x_0})_\star = (h_{x_1})_\star\circ\hat\alpha$$
As in, show that the following diagram commutes:
$$
\newcommand{\ra}[1]{\xrightarrow{\quad#1\quad}}
\newcommand{\da}[1]{\left\downarrow{\scriptstyle#1}\vphantom{\displaystyle\int_0^1}\right.}
\newcommand{\sea}[1]{\left\searrow{\scriptstyle#1}\vphantom{\displaystyle\int_0^1}\right.}
$$
$$
\begin{array}{ccc}
\pi_1(X,x_0) & \ra{(h_{x_0})_\star} & \pi_1(Y,y_0) \\
\da{\hat\alpha} & & \da{\hat\beta} \\
\pi_1(X,x_1) & \ra{(h_{x_1})_\star} & \pi_1(Y,y_1) \\
\end{array}
$$
<div class="hint" markdown="1">
Let
$$\begin{align*}
f=&\hat\beta\circ (h_{x_0})_\star & \text{and}& & g=&(h_{x_1})_\star\circ\hat\alpha
\end{align*}$$
The claim is now that
$$\forall a \in \pi_1(X,x_0), f(a) = g(a)$$
_Proof_:
Let $b_0 \in f(a) = \hat\beta\circ(h_{x_0})_\star (c)$. That means $b$ is a particular path in $(Y,y_1)$ such that $b_0 = \bar\beta\star(h(c))\star\beta$, whereas $g(a) \ni b_1 = h(\bar\alpha\star d \star\alpha)$ with $c,d$ homotopic.
Let $\alpha$ be a path in $(X,x_0)$ and $\beta = h\circ\alpha$ -- hence a path in $(Y,y_1)$.
</div>
-->Aaron NiskinMunkres saves the day and explains the fundamental secrets behind the most invariant of groups!Let’s Cover Homotopy (an apologetic revisionary tale)2018-08-05T00:00:00-07:002018-08-05T00:00:00-07:00http://aaron.niskin.org/blog/2018/08/05/homotopy_revisited<h2 id="what-am-i-doing">What am I doing?</h2>
<h3 id="the-structure-of-these-posts">The structure of these posts</h3>
<p>This post is not really building off of the theory from my previous post (<a href="/blog/2018/04/12/homology.html">homology pt 1</a>), but it’s about the same subject. It turns out we were using a book I didn’t really like too much, so I switched to something with a bit more of a familiar style for me. So I’ll be using <a href="https://www.amazon.com/Topology-Classic-Classics-Advanced-Mathematics/dp/0134689518/ref=sr_1_2?ie=UTF8&qid=1533510847&sr=8-2&keywords=topology+munkres">Topology by Munkres</a> for these posts (at least for now).</p>
<p>I’ll not be covering the book in much detail because after all, Munkres did a great job at doing that himself. But I will summarize some things that I feel like summarizing, and do some of the exercises (just the ones I feel like doing because, well, I can – I know, this is sounding more useful by the second).</p>
<p>So let’s dig in!</p>
<h1 id="part-ii">Part II</h1>
<p>But where did part I go? That’s not Algebraic Topology, so we’re not gonna cover that part (at least not until I feel like I need a review).</p>
<h2 id="chapter-9-the-fundamental-group">Chapter 9: The Fundamental Group</h2>
<p>Basically, we say $f_0,f_1:X\to Y$ are <strong>homotopic</strong> if there is a continuous function $F:X\times I\to Y$ (where $I=[0,1]$) with the following two properties:</p>
<ol>
<li>$\forall x\in X, F(x,0) = f_0(x)$</li>
<li>$\forall x\in X, F(x,1) = f_1(x)$</li>
</ol>
<p>Basically, at one end of the interval $I$ $F$ is $f_0$ and on the other end $F$ is $f_1$. Since the whole map is continuous, this must be a continuous deformation. And there’s a bunch of (helpful) book-keeping to prove that homotopy defines an equivalence relation, etc. but we won’t be covering that right now. But before we move on, let’s talk about path homotopy…</p>
<p>We say the paths $f_0,f_1:I\to Y$ are <strong>path homotopic</strong> if they have the same start and end points ($x_0$ and $x_1$) and there is a homotopy $F$ from $f_0$ to $f_1$ with the following two additional requirements:</p>
<ol>
<li>$\forall s\in I, F(0,s) = x_0$</li>
<li>$\forall s\in I, F(1,s) = x_1$</li>
</ol>
<p>One thing to note: if you consider a torus (because who doesn’t love donuts), any two non-intersecting paths will be homotopic, but not necessarily path homotopic because they might go around different sides of the donut hole, for instance. The requirement that the end-points are fixed is no trivial thing.</p>
<p>Let’s get into the questions!</p>
<h2 id="exercises">Exercises</h2>
<h3 id="show-that-if-hhxto-y-are-homotopic-and-kkyto-z-are-also-homotopic-then-kcirc-h-and-kcirc-h-are-homotopic">Show that if $h,h’:X\to Y$ are homotopic and $k,k’:Y\to Z$ are also homotopic, then $k\circ h$ and $k’\circ h’$ are homotopic.</h3>
<div class="hint">
<p>Let $H,K$ be homotopies from $h$ to $h’$ and $k$ to $k’$ respectively. Then consider the map $G:X\times I\to Z$ defined by $G(x,i) = K(H(x,i),i)$.</p>
<p><em>Claim</em>: $G$ is a homotopy between $k\circ h$ and $k’\circ h’$.</p>
<p><em>Proof</em>: $G$ is obviously continuous (because $H$ and $K$ are), so we only have to prove that $G(x,0) = k\circ h$ and $G(x,1)= k’\circ h’$. But those follow directly from the definition of $G$.</p>
</div>
<h3 id="given-spaces-x-and-y-let-xy-be-the-set-of-homotopy-classes-of-maps-from-x-to-y">Given spaces $X$ and $Y$, let $[X,Y]$ be the set of homotopy classes of maps from $X$ to $Y$.</h3>
<p>a) Show that for any $X$, $[X,I]$ is a singleton.</p>
<div class="hint">
<p>Let $f,g:X\to I$ be continuous maps. Then since $\forall x \in X$, $f(x),g(x)\in[0,1]$, we know that the following map $H:X\times I\to I$ will always be well defined: $H(x,t) = t\cdot f(x) + (1-t)\cdot g(x)$. $H$ is a homotopy.</p>
</div>
<p>b) Show that if $Y$ is path connected, the set $[I,Y]$ is a singleton.</p>
<div class="hint">
<p>Let $Y$ be path connected and let $f:I\to Y$ be continuous. Pick a point $y_o\in Y$ and define $g:I\to Y$ such that $\forall t\in I$, $g(t)=y_0$. We will prove that $f$ is homotopic to $g$ (and hence all maps are), and conclude that $[I,Y]$ is a singleton. We’ll prove this in two steps: first we’ll show that $f$ is homotopic to the constant map that maps all of $I$ to one of the end-points of $f$; then we’ll show that the aforementioned constant map is homotopic to $g$.</p>
<p>To prove the first part, let $h:I\to Y$ be such that $\forall t\in I$, $h(t)=f(0)$. Let $H:I\times I\to Y$ be such that $H(s,t) = f(s\cdot (1-t))$. Then clearly $H$ is a homotopy between $f$ and $h$.</p>
<p>To prove the second part, since $Y$ is path connected, let $p:I\to Y$ be a path from $f(0)$ to $y_0$. Then let $G:I\times I\to Y$ be such that $G(s,t) = p(t)$. Aaaannnnndddd, that’s our homotopy between the two constant maps.</p>
<p>So our desired homotopy from $f$ to $g$ is $G\circ H$.</p>
</div>
<h3 id="a-space-x-is-contractible-if-the-identity-map-i_xxto-x-is-homotopic-to-a-constant-map-nullhomotopic">A space X is <strong>contractible</strong> if the identity map $i_x:X\to X$ is homotopic to a constant map (<strong>nullhomotopic</strong>).</h3>
<p>a) Show that $I,\mathbb{R}$ are contractible.</p>
<div class="hint">
<p>To show that $I$ is contractible, we can just consider it a corollary of the first part of the previous answer. So we’ll prove the second part (that $\mathbb{R}$ is contractible).</p>
<p>Let $i:\mathbb{R}\to\mathbb{R}$ be the identity and let $x_0\in\mathbb{R}$, and $f:\mathbb{R}\to\mathbb{R}$ with $f(x) = x_0$. Our job is to show that there is a homotopy $F:\mathbb{R}\times I\to\mathbb{R}$ between $f$ and $i$.</p>
<p>Consider the map $F(x,t)=\begin{cases} x & | x | < t \\ \text{sign}(x)t & \text{otherwise}\end{cases}$. That is our homotopy.</p>
</div>
<p>b) Show that a contractible space is path connected.</p>
<div class="hint">
<p>Let $Y$ be contractible and let $y_0,y_1$ be points. Then let $f:I\to Y$ be the constant map $f(t) = y_1$, and let $F:I\times I\to Y$ be the map contracting $i_Y$ to $f$. Then let $p(t) = F(t,t)$. $p$ is a path connecting $y_0$ and $y_1$. We’re done here.</p>
</div>
<p>c) Show that if $Y$ is contractible, then for any $X$ the set $[X,Y]$ is a singleton.</p>
<div class="hint">
<p>The proof is very similar to the last two, so I’m not gonna do it. Ah, the pleasures of self-learning. But basically, contractions have inverses (because they’re homotopies too!).</p>
</div>
<p>d) Show that if $X$ is contractible and $Y$ is path connected then $[X,Y]$ is a singleton.</p>
<div class="hint">
<p>The proof that $[I,Y]$ is a singleton is general enough to accommodate this (mutatis mutandis, obvs). Laziness ENGAGE!</p>
</div>
<h2 id="summary">Summary</h2>
<p>We’ve done so many exercises I don’t even feel like I need to workout anymore! (These jokes aren’t really free if you consider the toll they take on you).</p>Aaron NiskinMunkres will surely save us our trespasses this once, but NEVER AGAIN!Let’s talk Homotopy and Algebra2018-06-29T00:00:00-07:002018-06-29T00:00:00-07:00http://aaron.niskin.org/blog/2018/06/29/homotopy<p>So this post is going to be a bit more terse than most. In fact, the objective of this post will be to develop the theory of Homotopy very briefly, with the goal of proving the Fundamental Theorem of Algebra.</p>
<h1 id="homotopy">Homotopy</h1>
<p>Given two continuous maps \(f,g : \mathbb{R}^n\to\mathbb{R}m\), a homotopy between \(f,\) and $g$ is a continuous map $h: \mathbb{I}\times\mathbb{R}^n\to\mathbb{R}^m$ such that $h(0,x)=f(x)$ and $h(1,x) = g(x)$. Basically it’s a continuous deformation of one map into the other. It’s a bit stronger than just homeomorphism of topological spaces. You can see this because two interlocked rings are topologically homeomorphic to non-interlocked rings, but they’re not homotopic because you’d have to split one of the rings.</p>
<p>Enough of all that! Let’s get to the proof already!</p>
<h1 id="the-fundamental-theorem-of-algebra">The Fundamental Theorem of Algebra</h1>
<h3 id="the-statement">The statement</h3>
<p><em>If $p\in\mathbb{C}[x]$ is a non-constant polynomial, then it has at least one root in $\mathbb{C}$.</em></p>
<p>A consequence of this theorem is that any non-constant polynomial over $\mathbb{C}$ has all of its roots in $\mathbb{C}$. Or equivalently, any polynomial of degree $n$ over $\mathbb{C}$ has $n$ roots in $\mathbb{C}$.</p>
<h3 id="the-proof">The proof</h3>
<p>Let $p(x) = \sum\limits_{i=1}^n a_i x^i \in\mathbb{C}[x]$ be a polynomial without any roots in $\mathbb{C}$. For simplicity, we assume that the leading coefficient is 1. There is no loss of generality in doing so. We then define a function</p>
<script type="math/tex; mode=display">f_r(s) = \frac{ p(re^{2\pi is}) / p(r)}{ \|p(re^{2\pi is}) / p(r)\| }</script>
<p>Then for each $r$ in $\mathbb{C}$, $f_r:\mathbb{C}\to S^1$ is a map from $\mathbb{C}$ to $S^1$. Furthermore, given $r_0,r_1\in\mathbb{C}$, we can define a homotopy from $f_{r_0}$ to $f_{r_1}$ as $g(t,s) = f_{r_0+t(r_1-r_0)}(s)$. So for all $r\in\mathbb{C}$, $f_r$ is homotopic to $f_0$. Since $f_0(s) = 1$ for all $s\in\mathbb{C}$, we have that $f_r$ are all in the same homotopy class as the constant function on $S^1$.</p>
<p>We now pick $z,r$ such that $|z| = r \gt \max(1,\sum |a_i| )$. We then note that Cauchy-Schwarz gives us the following:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\|z\|^n =& \|z\|\cdot \|z\|^{n-1}\\\\\\
>& (\sum\|a_i\|)\|z\|^{n-1}\\\\\\
=& (\sum\|a_i\|\|z\|^{n-1})\\\\\\
>=& (\sum\|a_iz^{n-1}\|)\\\\\\
>& (\sum\|a_iz^i\|)
\end{align} %]]></script>
<p>Which means $|z^n - \sum a_i z^i \neq 0|$. So, we can define yet another function $q(t,z) = z^n +t(\sum\limits_{i=1}^{n-1}a_iz^i)$. Then $q$ is a homotopy between $z^n$ and $p(z)$. We now note that when $t=1$, if we make a similar construction to $f_r$ but with this parameterized polynomial instead of $p$, we get a homotopy between $f_r$ and just going around the circle $n$ times (with $t=0$). But since $f_r$ is homotopic to a point on $S^1$, and going around the circle any non-zero number of times is not homotopic to a point, we know that $n=0$.</p>
<p>So the only polynomials without roots in $\mathbb{C}$ are constant.</p>
<p>Kinda cool, huh?</p>
<p>This proof was courtesy of <em>Algebraic Topology</em> by Allen Hatcher.</p>Aaron NiskinJust a little dip into homotopy via the Fundamental Theorem of AlgebraLet’s Cover Homology (pt 1)2018-04-12T00:00:00-07:002018-04-12T00:00:00-07:00http://aaron.niskin.org/blog/2018/04/12/homology<h1 id="topology">Topology</h1>
<h2 id="what-is-it">What is it?</h2>
<p>Topology is essentially the study of proximity, which is a close analogue for shape. The specifics of how that’s done are not really the point of this post. That’s all for this section.</p>
<h2 id="how-is-it-related-to-analyzing-data">How is it related to analyzing data?</h2>
<p>Most data analysis techniques are already about quantifying some geometric attributes. For instance, when you fit a linear model, you’re really just imposing an affine model on the data, and computing the appropriate parameters (slope and intercept). But these sorts of things work best if you have an idea for what an appropriate shape would be. That’s one area where topology might be able to help. But to hear more about that, you’ll have to wait for my next post.</p>
<p>There is a notion of equivalence (called isomorphism) of topological spaces (bi-continuous bijections). Basically, we consider two topological spaces to be equivalent if one can be continuously transformed into the other and back. That’s actually a very powerful description that captures a lot of non-linearities and things like that. For instance, an egg shape, a sphere, a pyramid, and a tetrahedron are all isomorphic topological spaces, but not a doughnut (torus). And those kinda make sense. But there are some counter-intuitive examples. For instance, if you were to take one of those long clown balloons and tie it up into a poodle, it would also be isomorphic to the bunch above. And, as the joke goes, donuts are isomorphic to coffee cups. I know what you’re thinking. This topology stuff sounds super cool.</p>
<p>So topology would make a wonderful (admittedly incomplete) description of data. But there are some problems: mainly that this whole story breaks down in higher dimensions. Luckily, there is an analogue that can be computed.</p>
<h1 id="homology">Homology</h1>
<h2 id="what-is-it-1">What is it?</h2>
<p>Homology is (in an overly-simplified sentence), counting the number of holes. That’s actually a lot more information than you might initially think. It’s actually sufficient to completely classify the space in lower dimensions, and a good approximation in higher ones. Either way, if we knew that info about our data, we’d capture a lot of the non-linear relationships. Either way, it’ll be fun, so let’s get into it.</p>
<h1 id="how-to-computes">How to computes…</h1>
<h2 id="triangulation">Triangulation</h2>
<h3 id="simplices">Simplices</h3>
<p>A <a href="https://en.wikipedia.org/wiki/Simplex">simplex</a> is a collection of $n$ points with the property that no one point lies in the subspace defined by the other $n-1$ points. The idea is that it’s the minimal number of points with an $n-1$ dimensional convex hull – or, said differently, the minimal number of points to make an $n-1$ dimensional shape. For instance, if we consider $\mathbb{R}^2$ as our ambient space, then the only simplices that exist are 0, 1, or 2 simplices: the 0 simplices being points, 1 simplices being lines and 2 simplices being tetrahedra. If we look at these simplices like building blocks, we can now make something isomorphic to any shape just by glueing a bunch of these simplices together. But we’ll get to that in a bit.</p>
<p>Let $\{v_0,v_1,v_2\}$ be a simplex. Firstly, since it’s a simplex with 3 points, it must be a 2-simplex. And since it’s a simplex, we know that $v_0-v_1\notin span(\{v_1-v_2,v_1+v_2\})$ and the same relationship holds for all the rest of the points. This simplex is supposed to represent any 2 dimensional contiguous ball-like thing. For instance, since a filled-in circle can be continuously transformed into a filled-in triangle (which is, in essence what a 2-simplex is), our simplex becomes a proxy for such circles, and squares, and rhombi, and octagons, but not the Apple logo. But if you take that stem thing and throw it away so you’re only left with the main apple part of the logo, then yeah, that too.</p>
<p>So now if we have any single contiguous $N$-D object without any holes, we can represent it topologically by a triangle. So simplices make the building blocks for general shapes, as we’ll see in a bit.</p>
<h3 id="simplicial-complexes">Simplicial Complexes</h3>
<p>So let’s build a simplicial complex! But wait! I haven’t even defined what that is yet! Meh, let’s just roll with it. So given a bunch of things we can represent by simplices (like the full apple logo, for instance), we can construct a simplicial complex to represent it. Some more examples would be: an air-filled balloon (which we’ll consider to be hollow) on a string, a filled water balloon (which we’ll not consider to be hollow) on a string, or a doughnut. The list goes on, obviously, but I think you get the point. We can represent pretty much anything this way.</p>
<p>The specifics really only come in when we try to construct the simplicial complex itself. In order to make the thing have the nice characteristics we’ll use later on, we’ll need to make sure our complexes have some properties. But first, we need to know what the face of a simplex is:</p>
<p>Given a simplex $\{v_0,v_1,…,v_n\}$, a face is the simplex defined by a subset (e.g. $\{v_1,v_2,…,v_n\}$ is a face, and so is the original simplex). If you think of a tetrahedron, for instance, the faces are the four triangular faces we’re used to calling faces as well as the six lines on the edges, and the four individual points. So the faces are kinda similar to our intuitive notion of a face. Now we can define a simplicial complex. A simplicial complex $K$ is a collection of simplices with the following properties (see <a href="http://mathworld.wolfram.com/SimplicialComplex.html">mathworld</a>):</p>
<ol>
<li>Each face of a simplex in $K$ is itself a simplex in $K$</li>
<li>The intersection of two simplices in $K$ is itself a simplex (hence in $K$)</li>
</ol>
<p>It’s the second rule that’s really the crux of everything, as we’ll see. But let’s go through some examples…</p>
<p>The Apple logo can be triangulated by two disjoint 2-simplices (along with all the appropriate faces of the two). So we basically forced the first property to hold, and since the two simplices are disjoint the second property trivially holds.</p>
<p>The filled water-balloon on a string can be triangulated by a 3-simplex attached to a 1-simplex along with all the appropriate faces. The filled water-balloon part is clearly the 3-simplex, and the string is the 1-simplex. So let’s prove that the two properties hold. The first one holds trivially, since we defined it as the union of all the faces. The second property holds because if you take any two simplices in this complex, either they intersect at the point on top (a 0-simplex) or they don’t intersect at all! (Pardon the hand waving proof there).</p>
<p>The (hollow) air-filled balloon, on the other hand, is a bit harder to triangulate, but we’ll barrel through it. We can glue four 2-simplices together to be like a hollowed out 3-simplex, then attach a 1-simplex as the string to one of the vertices of the hollow tetrahedron, and we’ll be done. All that’s left is to show the two properties hold. That’s left as an exercise to the reader (ah, the good life).</p>
<p>The doughnut, on the other hand, is a lot more complicated to triangulate. To do that we’d have to specify if we’re talking about the shell of a doughnut or a filled in one. Although this is a fun exercise, the traditional way to triangulate the torus doesn’t have much to do with what we’re going to be using this stuff for, so I’ll take a more data-science type approach to this. If we have a bunch of data arranged in the shape of a torus, we can just take the union of the simplices defined by taking a point and the 3 nearest points to it, and creating the convex hull of those points (or considering them as a 3-simplex).</p>
<h3 id="associations">Associations</h3>
<p>But it would be useful to triangulate the torus using the more traditional approach. To get you started on the right track, you’ll need another tool: association. An association is basically a way of considering things to be equal. It’s kinda like defining specific teleportation or something. I think it’s best to just work out an example. So let’s triangulate some things (like a sphere – the air-filled balloon, for example) but without using the fact that we’re in 3-D space and using associations instead. Let’s start with two 2-simplices joined at one edge (to make something that looks like a square).</p>
<p><img src="/assets/pics/2018/04/12_triangulate_base.png" alt="sphere" /></p>
<p>We then make the following association: we say anything along the line $(\langle1\rangle,\langle2\rangle)$ is associated with things on the line $(\langle4\rangle,\langle3\rangle)$ – where order is important. If we want to make it more formal, we could parameterize the line connecting $\langle1\rangle$ to $\langle2\rangle$ by a parameter $t$ such that $f_0(0) = \langle1\rangle$ and $f_0(1) = \langle2\rangle$. We describe this visually like so:</p>
<p><img src="/assets/pics/2018/04/12_tube.png" alt="tube" /></p>
<p>We’re considering all the points on the top line to be associated with (or equal to) their corresponding points on the bottom line. You can imagine this to be kinda like pacman: when you go off the top edge, you come back from the bottom edge in the same direction (because of the association that we’ve made). So, we might as well fold the paper over and tape the two sides together so as to better visualize the association we’ve made. But when we do that, we notice we have a tube!</p>
<p>So the association shown above can be described as a tube. Unfortunately, it’s not a simplicial complex though. This can be seen because if we call the top-left triangle $A$, and the bottom-right one $B$, the intersection of $A$ and $B$ is the diagonal line $(\langle1\rangle,\langle3\rangle)$ and the associated line (it’s only one because the two are associated – hence the same line) $(\langle1\rangle,\langle2\rangle) = (\langle4\rangle,\langle3\rangle)$. So the intersection is two lines, which is a simplicial complex, but not a simplex! So this triangulation doesn’t have the second property that we require all simplicial complexes to have.</p>
<p>We could make it a simplicial complex by dividing the thing into a 9 square grid (so 18 triangles) and then doing the same thing again with the top being associated with the bottom, but it’s a lot to write out. Try it out though. See if you can make it a simplicial complex. If you run into any issues, feel free to email me.</p>
<p>Moving on… Now if we just flipped the association, we get a möbius strip!</p>
<p><img src="/assets/pics/2018/04/12_mobius.png" alt="möbius" /></p>
<p>It might be useful to actually break out a piece of paper and do this if you’ve never done it before.</p>
<p>If we take the tube and add this one association it becomes a torus (a hollow doughnut)!</p>
<p><img src="/assets/pics/2018/04/12_torus.png" alt="torus" /></p>
<p>Notice that we identify which sides are connected by the fact that they share the same symbol (the double arrow is connected to the other double arrow, and the single arrow does likewise).</p>
<p>If you’re trying it out on paper, make sure you label the points so you can see them! If you reverse the direction one side of the second association, you get something very different – a Klein bottle!</p>
<p><img src="/assets/pics/2018/04/12_klein.png" alt="klein bottle" /></p>
<p>I feel like it’s probably a bit cruel of me to have put this after the picture, in case you’ve been trying to make the last one with your paper… It turns out, you can’t make the last one in 3D space. You need to either use an extra dimension or tear the paper, unfortunately. But, it’s interesting to see that flipping the association completely changes the thing – kinda like chirality in chemistry.</p>
<p>Like if we take the möbius strip and make the following additional association, you get something called the Projective Plane (\(\mathbb{P}^2\) – also not embeddable in 3D space).</p>
<p><img src="/assets/pics/2018/04/12_projective.png" alt="projective plane" /></p>
<p>And finally, the following association leads us to a sphere, or, more precisely, the surface of a sphere.</p>
<p><img src="/assets/pics/2018/04/12_sphere.png" alt="sphere" /></p>
<p>Then, since all the points on the top line (and the bottom, left, and right lines) are all associated, imagine contracting them while keeping the inside of the bag relatively unchanged (as far as area is concerned). So necessarily the bag will puff up like a bubble as the sides squish down to a smaller and smaller hole. Eventually, the edge becomes just a point, and what do we have? Yup, we’ve got a sphere. So that’s one way we can triangulate a sphere.</p>
<p>The problem is that none of these are simplicial complexes. Why not? Can you fix these so that they become simplicial complexes?</p>
<h2 id="summary">Summary</h2>
<p>So we’ve triangulated some things, defined what simplices and simplicial complexes are, but we haven’t yet found their utility. For that, just stay tuned! In the next episode we’ll go over Betti Numbers and how to compute them using these simplicial complexes!</p>Aaron NiskinWhat does topology have to do with data?! And what is this topology thing anyway?Decomposition of Autoregressive Models2017-10-25T13:52:39-07:002017-10-25T13:52:39-07:00http://aaron.niskin.org/blog/2017/10/25/Autoregression<h1 id="autoregression">Autoregression</h1>
<h2 id="background">Background</h2>
<p>This post will be a formal introduction into some of the theory of Autoregressive models. Specifically, we’ll tackle how to decompose an AR(p) model into a bunch of AR(1) models. We’ll discuss the interpretability of these models and howto therein. This post will develop the subject using what seems to be an atypical approach, but one that I find to be very elegant.</p>
<!-- more -->
<h1 id="the-traditional-way">The traditional way</h1>
<p>Let $x_t$ be an AR(p) process. So $x_t = \sum\limits_{i=1}^pa_ix_{t-i} + w_t$. We can express $x_t$ thusly:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
x_t =& \sum\limits_{i=1}^pa_ix_{t-i} + w_t \\
x_t - \sum\limits_{i=1}^pa_ix_{t-i} =& w_t \\
\left(1 - \sum\limits_{i=1}^pa_iL^i\right)x_t =& w_t
\end{align*} %]]></script>
<p>Where $L$ is the <strong>lag operator</strong>. We define the AR polynomial $\Phi$ as $\Phi(L) := \left(1 - \sum\limits_{i=1}^pa_iL^i\right)$. Then as long as we’re considering this polynomial to be over an algebraically closed field (like $\mathbb{C}$), we can factor this polynomial. So</p>
<script type="math/tex; mode=display">\Phi(L) = c\prod\limits_{i=1}^p(L-\varphi_i)</script>
<p>Where $\varphi_i$ are the roots of $\Phi$. This polynomial will be the star of our show today. In fact, if we want to understand $\Phi$ (the AR process we’re considering), we need only understand each $x_t = \varphi_ix_{t-1} + w_t$ model. We’ll actually prove this later.</p>
<p>We’ll be investigating several of its properties but first (just for fun) let’s investigate the conditions under which it’s invertible.</p>
<p>We can reduce the problem of determining whether or not the AR polynomial is invertible to the problem of inverting each factor. After all, the polynomial is invertible if and only if each factor is. So let’s restrict our considerations to just a single factor for the time being. When will $x-\lambda$ be invertible?</p>
<p>Now, we recall the Taylor Series Expansion formula from our trusty calculus class.</p>
<script type="math/tex; mode=display">f(x) = \sum\limits_{i=0}^\infty \frac{f^{(i)}(a)}{i!}(x-a)^i</script>
<p>In general, we tend to use $a=0$ as a good arbitrary choice (the Maclaurin Series), resulting in:</p>
<script type="math/tex; mode=display">f(x) = \sum\limits_{i=0}^\infty \frac{f^{(i)}(0)}{i!}x^i</script>
<p>Well, if we consider the Taylor expansion of $\frac{1}{x-\lambda}$, we find</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\frac{1}{x-\lambda} =& \frac{-1}{\lambda}(x) + (-1)^1\frac{-1}{\lambda^2}x^2 + (-1)^2\frac{-1}{\lambda^3}x^3 + ... \\
=& \sum\limits_{i=0}^\infty (-1)^{i+1}\frac{1}{\lambda^{i+1}}x^{i+1} \\
=& \sum\limits_{i=1}^\infty (-1)^{i}\frac{1}{\lambda^{i}}x^{i}
\end{align*} %]]></script>
<p>Which is square summable if and only if $|\lambda| > 1$. So we see that if the AR roots are all outside the unit circle, then the AR polynomial will be invertible (in such a case, the AR process is called <strong>causal</strong>).</p>
<h1 id="the-new-approach">The new(?) approach</h1>
<h2 id="definition">Definition</h2>
<p>Let $X_t\in \mathbb{R}^n, A_i\in\mathbb{R}^{n\times n}, B\in\mathbb{R}^{k_0\times n}$, and $p\in\mathbb{N}_{>0}$. Also let, $W_t$ be $k_0$ dimensional white noise. Then, $X_t$ is called a VAR(p) process if,</p>
<script type="math/tex; mode=display">X_t = \sum\limits_{i=1}^p A_iX_{t-i} + BW_t</script>
<p>Basically a VAR(p) model is an analogue of AR(p) models where the scalars are matrices and the variable itself is a vector. The noise in such a model need not be independent per coordinate, nor restricted to only one. In fact, we can have any number of driving noise processes that get mixed together linearly into $X_t$. Hence cometh $B$ – the linear transformation from the driving noise space into the $X_t$ space.</p>
<p>Furthermore, let’s say that we don’t directly observe $X_t$ directly, but rather we observe some $Y_t$ that’s a linear function of $X_t$ (plus some noise). Formally, we observe $Y_t$ s.t.:</p>
<script type="math/tex; mode=display">Y_t = C X_t + D U_t</script>
<p>Where $Y_t \in\mathbb{R}^m, C\in\mathbb{R}^{m\times n}, D\in \mathbb{R}^{k_1\times m}$. And again, $U_t$ is $k_1$ dimensional white noise.</p>
<p>Here $D$ serves the same purpose as $B$ did above, but for the observations themselves.</p>
<p>This is called a “State Space” model.</p>
<h2 id="examplificate">Examplificate</h2>
<p>How does this relate to our star (the AR polynomial $\Phi$)? Let’s try to express an AR(p) process as a VAR(1) process…</p>
<p>Let $x_t$ be an AR(p) process. That is to say, $x_t = \sum\limits_{i=1}^p a_ix_{t-i} + w_t$ where $w_t$ is some white noise process.</p>
<p>Let $W_t$ be a 1 dimensional white-noise processes such that $W_t = w_t$. We define</p>
<script type="math/tex; mode=display">X_t := \left[\begin{matrix} x_t \\ x_{t-1} \\ \vdots \\ x_{t-(p-1)} \end{matrix}\right]</script>
<p>Then, we can see that the first coordinate of $X_t$ is always $x_t$. Our notation for this will be $X_t[0] = x_t$.</p>
<p>Furthermore, let</p>
<script type="math/tex; mode=display">% <![CDATA[
A = \left[\begin{matrix}
a_1 & a_2 & \dots & a_{p-1} & a_p \\
1 & 0 & \dots & 0 & 0 \\
0 & 1 & \dots & 0 & 0 \\
& & \ddots & & \\
0 & \dots & 0 & 1 & 0 \\
\end{matrix} \right] %]]></script>
<p>We can see that for $0 < i < p$, we have $(AX_{t-1})[i] = x_{t-i}$, and $(AX_{t-1})[0] = \sum\limits_{i=1}^p a_i x_{t-i} = x_t - w_t$. As it stands, $AX_{t-1}$ is almost $X_t$, just without the added noise in the first coordinate. So let’s add noise to only the first coordinate. To this end, let</p>
<script type="math/tex; mode=display">B := \left[\begin{matrix} 1 \\ 0 \\ \vdots \\ 0 \end{matrix}\right]</script>
<script type="math/tex; mode=display">X_t = AX_{t-1} + BW_t</script>
<p>and</p>
<script type="math/tex; mode=display">\forall t \in \mathbb{N}_{>p} (X_t[0] = x_t)</script>
<p>Now we want $Y_t = x_t$. So we set</p>
<script type="math/tex; mode=display">% <![CDATA[
C = \left[\begin{matrix}1 & 0 & 0 & \dots & 0\end{matrix}\right] %]]></script>
<p>That way, $Y_t = CX_t = x_t$ and all is right with the world.</p>
<p>So yay! We’ve expressed this AR(p) process as a $p$ dimensional $VAR(1)$ process! So… Why?</p>
<h2 id="why-do">Why do?</h2>
<p>Well, one natural question to ask at this point is, what are the eigenvalues of $A$?</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
0 =& \det(A - \lambda I) \\
=& \left\|\begin{matrix}
a_1 - \lambda & a_2 & \dots & a_{p-1} & a_p \\
1 & -\lambda & \dots & 0 & 0 \\
0 & 1 & \dots & 0 & 0 \\
& & \ddots & & \\
0 & \dots & 0 & 1 & -\lambda \\
\end{matrix} \right\| \\
=& (-\lambda)^p + a_1(-\lambda)^{p-1} - a_2(-\lambda)^{p-2} + \dots + (-1)^{p-1}a_p
\end{align*} %]]></script>
<p>This polynomial is the characteristic polynomial of $A$, so the roots of this polynomial are the eigenvalues of $A$. Let’s see how this relates to $\Phi$…</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
0 =& (-\lambda)^p + a_1(-\lambda)^{p-1} - a_2(-\lambda)^{p-2} + \dots + (-1)^{p-1}a_p \\
(-\lambda)^{-p}(0) =& 1 - a_1\lambda^{-1} - a_2\lambda^{-2} - \dots - a_p\lambda^{-p} \\
\text{Define } L :=& \lambda^{-1} \\
0 =& 1 - a_1L - a_2L^2 - \dots - a_pL^p \\
=& 1 - \sum\limits_{i=1}^pa_iL^i \\
=& \Phi(L)
\end{align*} %]]></script>
<p>So the characteristic polynomial is really just $\Phi(L)$ revisited! Furthermore, check this sweetness out…</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\Phi(L) =& c\prod\limits_{i=1}^p\left(L-\varphi_i\right) \\
=& c\prod\limits_{i=1}^p\left(\lambda^{-1}-\varphi_i\right) \Rightarrow \\
\lambda_i =& \frac{1}{\varphi_i}
\end{align*} %]]></script>
<p>So we can see that the eigenvalues are actually intimately related with the roots (in fact they’re inverses of each other)! With that in mind, let’s see what happens when we try to diagonalize this matrix $A$. Firstly, let’s assume that $A$ matrix is diagonalizable. I.e., let</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\bar H^{-1}AH =& A \Leftrightarrow \\
\bar A =& HAH^{-1}
\end{align*} %]]></script>
<p>where $H$ is a coordinate transformation (an invertible map) and $\bar A$ is diagonal.</p>
<p>Then</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
X_t =& H^{-1}\bar AH X_{t-1} + BW_t \\
=& H^{-1}\left(\bar AHX_{t-1} + HBW_t\right)
\end{align*} %]]></script>
<p>So if we let $\tilde X_t = HX_t$, we get</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\tilde X_t =& HX_t \\
=& HH^{-1}\left(\bar AHX_{t-1} + HBW_t\right) \\
=& \bar AHX_{t-1} + HBW_t \\
=& \bar A\tilde X_{t-1} + HBW_t
\end{align*} %]]></script>
<p>So $\tilde X_t$ is a bunch of AR(1) processes (because the matrix $\bar A$ is diagonal), and,</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
Y_t =& CX_t \text{ because $U_t=0$ for our scenario}\\
=& CH^{-1}\tilde X_t \\
=& \left(CH^{-1}\right)\tilde X_t
\end{align*} %]]></script>
<p>Which is to say that $Y_t$ is a linear combination of AR(1) processes! So we started out with a general AR(p) process, and found that expressing it as a VAR(1) model allows us to see that this is just a linear combination of AR(1) processes. I think that’s pretty cool. But let’s investigate this a bit further…</p>
<h1 id="what-more">What more?!</h1>
<p>As per our previous section, we have that $Y_t$, which was really just $x_t$ – our original AR(p) variable, is a linear combination of AR(1) models. More specifically, if we let $F := CH^{-1}$, then</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
x_t =& Y_t \\
=& F\tilde X_t
\end{align*} %]]></script>
<p>So indeed, $x_t$ is a linear combination of $p$ many AR(1) processes. In fact, since $\tilde X_t = \bar A \tilde X_{t-1} + \left(HB\right)W_t$, we know the AR(1) models explicitly:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\tilde x_{t,i} =& \lambda_i\tilde x_{t-1,i} + w_t \\
\tilde x_{t,i} - \lambda_i\tilde x_{t-1,i} =& w_{t,i} \\
\left(1-\lambda_iL\right)\tilde x_{t,i} =& w_{t,i} \\
\frac{-1}{\lambda_i}\left(L-\lambda_i\right)\tilde x_{t,i} =& w_{t,i} \\
\end{align*} %]]></script>
<p>Where $\lambda_i$ is the $i$th eigenvalue of $A$ (hence the $(i,i)$th element of $\bar A$). So we see that the AR(1) processes each have their respective roots from the roots of the original AR polynomial $\Phi$. Pretty cool, right?</p>
<h2 id="summary">Summary</h2>
<p>So we can express any AR(p) model as a linear combination of AR(1) processes (albeit with correlated noise terms), where each AR(1) process is determined by the roots of the AR polynomial (or the characteristic polynomial).</p>
<p>We’ve essentially reduced the problem of studying AR(p) models to studying their eigenvalues (or AR roots).</p>Aaron NiskinWe discuss the decomposition of AR models into components, and how eigenvalues are involved (because they always are).Markov Chain Monte Carlo2017-04-19T13:52:39-07:002017-04-19T13:52:39-07:00http://aaron.niskin.org/blog/2017/04/19/MCMC<h1 id="text--mc2"><script type="math/tex">\text{???} = (MC)^2</script></h1>
<h2 id="background">Background</h2>
<p>What if we know the relative likelihood, but want the probability distribution?</p>
<script type="math/tex; mode=display">\mathbb{P}(X=x) = \frac{f(x)}{\int_{-\infty}^\infty f(x)dx}</script>
<p>But what if \( \int f(x)dx \) is hard, or you can’t sample from \( f \) directly?</p>
<p>This is the problem we will be trying to solve.</p>
<!-- more -->
<h2 id="first-approach">First approach</h2>
<p>If space if bounded (integral is between \( a,b \)) we can use Monte Carlo to estimate \( \int\limits_a^b f(x)dx \)</p>
<ol>
<li>Pick \( \alpha \in (a,b) \)</li>
<li>Compute \( f(\alpha)/(b-a) \)</li>
<li>Repeat as necessary</li>
<li>Compute the expected value of the computed values</li>
</ol>
<h2 id="what-if-its-a-big-ol-bag-of-nope">What if it’s a big ol’ bag of nope?</h2>
<p>What if we can’t sample from \( f(x) \) but can only determine likelihood ratios?</p>
<script type="math/tex; mode=display">\frac{f(x)}{f(y)}</script>
<p>Enter Markov Chain Monte Carlo</p>
<h1 id="the-obligatory-basics">The obligatory basics</h1>
<p>A <strong>Markov Chain</strong> is a stochastic process (a collection of indexed random variables) such that \( \mathbb{P}(X_n=x|X_0,X_1,…,X_{n-1}) = \mathbb{P}(X_n=x|X_{n-1}) \).</p>
<p>In other words, the conditional probabilities only depend on the last state, not on any deeper history.</p>
<p>We call the set of all possible values of \( X_i \) the <strong>state space</strong> and denote it by, \( \chi \).</p>
<h3 id="transition-matrix">Transition Matrix</h3>
<p>Let \( p_{ij} = \mathbb{P}(X_1 = j | X_0 = i) \). We call the matrix \( (p_{ij}) \) the <strong>Transition Matrix</strong> of \( X \) and denote it \( P \).</p>
<p>Let \( \mu_n = \left(\mathbb{P}(X_n=0), \mathbb{P}(X_n=1),…, \mathbb{P}(X_n=l)\right) \) be the row vector corresponding to the “probabilities” of being at each state at the \( n \)th point in time (iteration).</p>
<p>Claim: \( \mu_{i+1} = \mu_0P^{i+1} \)</p>
<p>Proof:</p>
<p>if \( i = 0 \):</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
(\mu_0 P)_j =& \sum\limits_{i=1}^l\mu_0(i)p_{ij}\\
=&\sum\limits_{i=1}^l\mathbb{P}(X_0=i)\mathbb{P}(X_1=j | X_0=i)\\
=&\mathbb{P}(X_1 = j)\\
=&\mu_1(j)
\end{align*} %]]></script>
<p>So: \( \mu_1 = \mu_0P \)</p>
<p>if \( i > 0 \):</p>
<p>Then, \( \mu_{i+1} = \mu_iP = (\mu_0P^i)P = \mu_0P^{i+1} \)</p>
<h3 id="stationary-distributions">Stationary Distributions</h3>
<p>We say that a distribution \( \pi \) is <strong>stationary</strong> if</p>
<script type="math/tex; mode=display">\begin{gather*}\pi P = \pi\end{gather*}</script>
<h3 id="main-theorem">Main Theorem:</h3>
<p>An irreducible, ergotic Markov Chain \( {X_n} \) has a unique stationary distribution, \( \pi \). The limiting distribution exists and is equal to \( \pi \). And furthermore, if \( g \) is any bounded function, then with probability 1:</p>
<script type="math/tex; mode=display">\lim\limits_{N\to\infty}\frac{1}{N}\sum\limits_{n=1}^Ng(X_n) \rightarrow E_\pi(g)</script>
<p>This really just means that if our Markov Chain has certain properties (basically just that any state can be gotten to from any other state at any time), then we can sample from this Markov Chain, and it’ll have the same distribution as a generic sample from our desired distribution.</p>
<h2 id="random-walk-metropolis---hastings">Random-Walk-Metropolis - Hastings:</h2>
<p>Let \( f(x) \) be the relative likelihood function of our desired distribution.</p>
<p>And \( q(y|x_i) \) the known distribution easily sampled from (generally taken to be \( N(x_i,b^2) \) )</p>
<p>1) Given \( X_0,X_1,…,X_i \), pick \( Y \sim q(y|X_i) \)</p>
<p>2) Compute \( r(X_i,Y) = \min\left(\frac{f(Y)q(X_i|Y)}{f(X_i)q(Y|X_i)}, 1\right) \)</p>
<p>3) Pick \( a \sim U(0,1) \)</p>
<p>4) Set</p>
<script type="math/tex; mode=display">% <![CDATA[
X_{i+1} = \begin{cases} Y & \text{if } a < r \\ X_i & \text{otherwise}\end{cases} %]]></script>
<h2 id="the-confidence-builder">The Confidence Builder:</h2>
<p>We would like to sample from and obtain a histogram of the Cauchy distribution:</p>
<script type="math/tex; mode=display">f(x) = \frac{1}{\pi}\frac{1}{1+x^2}</script>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
q_{01} \sim& N(x,0.1)\\
q_1 \sim& N(x,1)\\
q_{10} \sim& N(x,10)\end{align*} %]]></script>
<p>Note: Since \( q \) is symmetric, \( q(x|y) = q(y|x) \).</p>
<h2 id="pseudocode">Pseudocode:</h2>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">MC</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">X_i</span> <span class="o">=</span> <span class="n">initial_state</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">int</span><span class="p">(</span><span class="n">num_iters</span><span class="p">)):</span>
<span class="n">Y</span> <span class="o">=</span> <span class="n">q</span><span class="p">(</span><span class="n">X_i</span><span class="p">)</span>
<span class="n">r</span> <span class="o">=</span> <span class="nb">min</span><span class="p">(</span><span class="n">f</span><span class="p">(</span><span class="n">Y</span><span class="p">)</span><span class="o">/</span><span class="n">f</span><span class="p">(</span><span class="n">X_i</span><span class="p">),</span> <span class="mi">1</span><span class="p">)</span>
<span class="n">a</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">uniform</span><span class="p">()</span>
<span class="k">if</span> <span class="n">a</span> <span class="o"><</span> <span class="n">r</span><span class="p">:</span>
<span class="n">X_i</span> <span class="o">=</span> <span class="n">Y</span>
<span class="n">MC</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">X_i</span><span class="p">)</span>
</code></pre></div></div>
<h2 id="estimation">Estimation:</h2>
<script type="math/tex; mode=display">q(y\|x) \sim N(x,b^2)</script>
<p><img src="/assets/pics/2017/04/18-MCMC-Cauchy-Estimation.png" alt="The many distributions" /></p>
<p><img src="/assets/pics/2017/04/18-MCMC-Cauchy-Estimation_TS.png" alt="png" /></p>
<p>So you can clearly see that the different values of \( b \) lead to wildly different distributions, and therefore wildly different approximations of the end distribution. For example, if the standard deviation on our sampling distribution is very small, our prospective transitions won’t venture out very far, and hence the tails of our distribution will be woefully underrepresented. If, on the other hand, the standard deviation is too large, we will often venture our much further than we should, and the we’ll end up with something like a multi-modal distribution. Basically, the prospective transition states will often be too far out to accept, in which case the current state persists, or it will be just close enough to be accepted. What you end up with is far too many samples from the tails and not enough samples from the beafier parts of the distribution.</p>
<p>Choosing the “best” value of \( b \) is more of an art than a science, really. This issue will poke its ugly head a little later. But even with that issue, we see clearly that something amazing is happening here.</p>
<h2 id="how-is-this-happening">How is this happening?</h2>
<p>A property called <strong>detailed balance</strong>, which means,</p>
<script type="math/tex; mode=display">\pi_ip_{ij} = p_{ji}\pi_j</script>
<p>or in the continuous case:</p>
<script type="math/tex; mode=display">f(x)P_{xy} = f(y)P_{yx}</script>
<h3 id="the-intuition">The intuition</h3>
<p>Since what we’re really after is the distribution of the samples and not the order in which they were picked, it’s reasonable to require a notion of (what I’m going to call) equal hopportunity:</p>
<script type="math/tex; mode=display">f(x)p(y|x) = f(y)p(x|y)</script>
<p>In other words, the probability of “hopping” from \( x \) to \( y \) should be the same as the reverse, because at the end of the day, both of these two scenarios only mean that \( x \) and \( y \) are in our sample.</p>
<p>If you work out the math therein (as we will do below), you find that this ratio guarantees equal hopportunity.</p>
<script type="math/tex; mode=display">r(x,y) = \min\left(\frac{f(y)q(x|y)}{f(x)q(y|x)},1\right)</script>
<h3 id="lets-prove-it">Let’s prove it!</h3>
<p>To reiterate, the claim was that this detailed balance property ( \( f(x)P_{xy} = P_{yx}f(y) \) ) guarantees that our likelihood function \( f \) is the stable distribution for our Markov Chain.</p>
<p>Let \( f \) be the desired distribution (in our example, it was the Cauchy Distribution), and let \( q(y|x) \) be the distribution we draw from.</p>
<p>First we’ll show that detailed balance implies \( \pi \) (or \( f \) if the distribution is continuous) is the stable distribution!</p>
<p>Let \( \pi_ip_{ij} = \pi_jp_{ji} \) for discrete or for continuous \( f(i)P_{ij}=P_{ji}f(j) \)</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
(\pi P)_i =& \sum\limits_j\pi_jP_{ji} & (fP)(x)=&\int f(y)p(x|y)dy\\
=&\sum\limits_j\pi_iP_{ij} & =& \int f(x)p(y|x)dy\\
=&\pi_i\sum\limits_jP_{ij} & =& f(x)\int p(y|x)dy\\
=&\pi_i & =& f(x)
\end{align*} %]]></script>
<p>Now we’ll show that the Markov Chain defined by the Metropolis-Hastings algorithm has the detailed balance property (and hence \( f \) is the stable distribution).</p>
<p>Without any loss of generality, we’ll assume \( f(x)q(y|x) > f(y)q(x|y) \) (if this is not the case, the proof will follow the same with just symbolic changes).</p>
<p>Note: Since \( f(x)q(y|x) > f(y)q(x|y) \), we know, \( r(x,y) = \frac{f(y)q(x|y)}{f(x)q(y|x)} \), and \( r(y,x) = 1 \).</p>
<p>Then,</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\pi_xp_{xy} =& f(x)\mathbb{P}(X_1=y|X_0=x) & \pi_yp_{yx} =& f(y)\mathbb{P}(X_1=x|X_0=y)\\
=& f(x)\left[q(y|x)\cdot r(x,y)\right] & =&f(y)\left[q(x|y)\cdot r(y,x)\right]\\
=& f(x)\left[q(y|x)\cdot \frac{f(y)q(x|y)}{f(x)q(y|x)}\right] & =&f(y)\left[q(x|y)\cdot 1\right]\\
=& f(y)q(x|y) & =&f(y)q(x|y)
\end{align*} %]]></script>
<h2 id="modeling-change-point-models-in-astrostatistics">Modeling Change Point Models in Astrostatistics.</h2>
<p>This was taken from a Penn State statistics summer program website located at <a href="http://sites.stat.psu.edu/~mharan/MCMCtut/COUP551_rates.dat">stat.psu.edu</a>. A detailed walk-through of the process PSU took with this project can be found at that site. The data is a time series of light emission recorded and aggregated over periods of 10000 seconds (just under 3 hours). A plot of the time series can be seen below for your convenience.</p>
<p><img src="/assets/pics/2017/04/18_psu_ts.png" alt="png" /></p>
<p>The idea (with this data) is that the emissions can be modeled by two Poisson distributions and a change point. The data will be modeled by the first Poisson until the change point, and then it switches to the second Poisson.</p>
<p>So we start out with a multi-parameter model, and through some Bayesian inference you arrive at a pretty nasty looking posterior distribution. You’d like to sample this distribution to obtain, say, the mean \( k \) (change point) value along with a 95% confidence interval. Since you need that confidence interval, standard optimization methods won’t suffice: we’d need the distribution of \( k \)s. For this, we can use MCMC!</p>
<p>The posterior distribution we would like to sample from is this:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
f(k,\theta,\lambda,b_1,b_2 | Y) \alpha& \prod\limits_{i=1}^k\frac{\theta^{Y_i}e^{-\theta}}{Y_i!} \prod\limits_{i=k+1}^n\frac{\lambda^{Y_i}e^{-\lambda}}{Y_i!} \\
&\times\frac{1}{\Gamma(0.5)b_1^{0.5}}\theta^{-0.5}e^{-\theta/b_1} \times\frac{1}{\Gamma(0.5)b_2^{0.5}}\theta^{-0.5}e^{-\theta/b_2}\\
&\times\frac{e^{-1/b_1}}{b_1}\frac{e^{-1/b_2}}{b_2}\times \frac{1}{n}
\end{align*} %]]></script>
<p>This distribution is pretty gnarly and since I don’t really know how to sample from this (beyond a blind search), we would like to use MCMC. Metropolis-Hastings (at least how we’ve discussed it) clearly doesn’t work here because this is multidimensional. For that we introduce another MCMC algorithm: <strong>Gibbs Sampling</strong>.</p>
<h3 id="gibbs-sampling">Gibbs Sampling</h3>
<p>The basic pseudocode is this:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">X</span> <span class="o">=</span> <span class="n">initial_X</span>
<span class="n">Y</span> <span class="o">=</span> <span class="n">initial_Y</span>
<span class="o">...</span>
<span class="n">Z</span> <span class="o">=</span> <span class="n">initial_Z</span>
<span class="n">MC</span> <span class="o">=</span> <span class="p">[(</span><span class="n">X</span><span class="p">,</span><span class="n">Y</span><span class="p">,</span><span class="o">...</span><span class="p">,</span><span class="n">Z</span><span class="p">)]</span>
<span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">num_iters</span><span class="p">):</span>
<span class="o">//</span> <span class="k">for</span> <span class="n">each</span> <span class="n">variable</span> <span class="n">we</span> <span class="n">compute</span>
<span class="n">X</span> <span class="o">~</span> <span class="n">q_X</span><span class="p">(</span><span class="n">x</span><span class="o">|</span><span class="n">X</span><span class="p">,</span><span class="n">Y</span><span class="p">,</span><span class="o">...</span><span class="p">,</span><span class="n">Z</span><span class="p">)</span>
<span class="o">//</span> <span class="n">We</span> <span class="n">do</span> <span class="n">this</span> <span class="k">for</span> <span class="n">X</span><span class="p">,</span> <span class="n">Y</span><span class="p">,</span> <span class="o">...</span><span class="p">,</span> <span class="n">Z</span> <span class="ow">in</span> <span class="n">turn</span>
</code></pre></div></div>
<p>We can combine Gibbs with Metropolis quite easily:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">X</span> <span class="o">=</span> <span class="n">initial_X</span>
<span class="n">Y</span> <span class="o">=</span> <span class="n">initial_Y</span>
<span class="o">...</span>
<span class="n">Z</span> <span class="o">=</span> <span class="n">initial_Z</span>
<span class="n">MC</span> <span class="o">=</span> <span class="p">[(</span><span class="n">X</span><span class="p">,</span><span class="n">Y</span><span class="p">,</span><span class="o">...</span><span class="p">,</span><span class="n">Z</span><span class="p">)]</span>
<span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">num_iters</span><span class="p">):</span>
<span class="o">//</span> <span class="k">for</span> <span class="n">each</span> <span class="n">variable</span> <span class="n">we</span> <span class="n">compute</span>
<span class="n">prospect</span> <span class="o">~</span> <span class="n">q_X</span><span class="p">(</span><span class="n">x</span><span class="o">|</span><span class="n">X</span><span class="p">,</span><span class="n">Y</span><span class="p">,</span><span class="o">...</span><span class="p">,</span><span class="n">Z</span><span class="p">)</span>
<span class="n">r</span> <span class="o">=</span> <span class="nb">min</span><span class="p">(</span><span class="n">f</span><span class="p">(</span><span class="n">prospect</span><span class="p">)</span><span class="o">/</span><span class="n">f</span><span class="p">(</span><span class="n">X</span><span class="p">),</span> <span class="mi">1</span><span class="p">)</span>
<span class="n">a</span> <span class="o">~</span> <span class="n">uniform</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">)</span>
<span class="k">if</span> <span class="n">a</span> <span class="o"><</span> <span class="n">r</span><span class="p">:</span>
<span class="n">X</span> <span class="o">=</span> <span class="n">prospect</span>
<span class="o">//</span> <span class="n">We</span> <span class="n">do</span> <span class="n">this</span> <span class="k">for</span> <span class="n">X</span><span class="p">,</span> <span class="n">Y</span><span class="p">,</span> <span class="o">...</span><span class="p">,</span> <span class="n">Z</span> <span class="ow">in</span> <span class="n">turn</span>
</code></pre></div></div>
<p>So, basically, you update one at a time.</p>
<p>Using this algorithm, we can analyze the data from the PSU website we mentioned above. In fact we did that very thing. Below you’ll find some graphs depicting the distribution of \( k \) values.</p>
<!--
```python
def psu_mcmc(X, q, numIters=10000):
theta, lambd, k, b1, b2 = 1, 1, 20, 1, 1
thetas, lambds, ks, b1s, b2s = [], [], [], [], []
n = len(X)
def f_k(theta, lambd, k, b1, b2):
if 0 <= k and k <= n:
return theta**sum(X[:k])*lambd**sum(X[k:])*np.exp(-k*theta-(n-k)*lambd)
elif k < 0:
return lambd**sum(X)*np.exp(-n*lambd)
elif k > n:
return theta**sum(X)*np.exp(-n*theta)
def f_t(theta, k, b1):
return theta**(sum(X[:k])+0.5)*np.exp(-theta*(k+1.0)/b1)
def f_l(lambd, k, b2):
return lambd**(sum(X[k:])+0.5)*np.exp(-lambd*((n-k)+1.0)/b2)
def f_b(b, par):
return np.exp(-(1 + par) / b) / (b*np.sqrt(b))
for i in range(numIters):
tmp = q(theta)
if tmp < np.infty:
r = min(1, f_t(tmp,k,b1)/f_t(theta,k,b1))
if np.random.uniform(0,1) < r:
theta = tmp
tmp = q(lambd)
if tmp < np.infty:
r = min(1, f_l(tmp,k,b2)/f_l(lambd,k,b2))
if np.random.uniform(0,1) < r:
lambd = tmp
tmp = q(b1)
if tmp < np.infty:
r = min(1, f_b(tmp, theta)/f_b(b1, theta))
if np.random.uniform(0,1) < r:
b1 = tmp
tmp = q(b2)
if tmp < np.infty:
r = min(1, f_b(tmp, lambd)/f_b(b2, lambd))
if np.random.uniform(0,1) < r:
b2 = tmp
tmp = q(k)
if tmp < np.infty:
r = min(1, f_k(theta, lambd, tmp, b1, b2) /
f_k(theta, lambd, k, b1,b2))
if np.random.uniform(0,1) < r:
k = tmp
thetas.append(theta)
lambds.append(lambd)
b1s.append(b1)
b2s.append(b2)
ks.append(k)
return np.array([thetas,lambds,ks,b1s,b2s])
```
```python
%%bash
if [ ! -f tmp/psu_data.tsv ]
then
wget http://sites.stat.psu.edu/~mharan/MCMCtut/COUP551_rates.dat -O tmp/psu_data.tsv
fi
```
```python
psu_data = []
with open("tmp/psu_data.tsv", "r") as f:
title = f.readline()
for line in f:
tmpArr = [x.strip() for x in line.split(" ")]
psu_data.append([int(x) for x in tmpArr if x != ""][1])
psu_data = np.array(psu_data)
psu_data
```
array([11, 3, 5, 9, 3, 4, 5, 5, 5, 5, 13, 18, 27, 8, 4, 10, 8,
3, 12, 10, 10, 3, 9, 8, 5, 9, 4, 6, 1, 5, 14, 7, 9, 10,
8, 13, 8, 11, 11, 10, 11, 13, 10, 3, 8, 5])
```python
mcmc2 = psu_mcmc(psu_data, q(1), 1000)
fig = plt.figure()
fig.suptitle("MCMC values for Change Point")
plt.subplot(2,1,1)
plt.hist(mcmc2[2] % len(psu_data), normed=True)
plt.subplot(2,1,2)
plt.plot(mcmc2[2])
plb.savefig("tmp/psu_graphs1.png")
plt.show()
```
-->
<p><img src="/assets/pics/2017/04/18_psu_graphs1.png" alt="png" /></p>
<!--
```python
plt.plot(psu_data)
plt.title("PSU Data")
plb.savefig("tmp/psu_ts.png")
plt.show()
```
-->
<h2 id="summary">Summary</h2>
<ul>
<li>Random-Walk-Metropolis-Hastings
<ul>
<li>Can be used to sample difficult univariate distributions relatively easily
<ul>
<li>Have to tune the sampling parameter</li>
<li>Curse of dimensionality in tuning parameters</li>
</ul>
</li>
<li>Requires \( f \) to be defined on all of \( \mathbb{R} \)
<ul>
<li>Transform as needed</li>
</ul>
</li>
</ul>
</li>
<li>Gibbs Sampling
<ul>
<li>Turn high dimensional sampling into iterative one-dimensional sampling</li>
</ul>
</li>
<li>Gibbs with Metropolis-Hastings
<ul>
<li>Lovely</li>
</ul>
</li>
</ul>
<h2 id="bibliography">Bibliography</h2>
<p><a href="http://sites.stat.psu.edu/~mharan/MCMCtut/MCMC.html">Summer School in Astrostatistics</a></p>Aaron NiskinMarkov Chain Monte Carlo is a tool for sampling otherwise intractible distributions. In this post, we'll go through some descriptions, reasoning, and examples therein.Linear Regression – The Basics2017-02-20T23:24:43-08:002017-02-20T23:24:43-08:00http://aaron.niskin.org/blog/2017/02/20/What-Is-Linear-Regression<h3 id="the-basics">The basics</h3>
<p>Yeah. It’s not a good sign if I’m starting out already repeating myself. But
that’s how things seem to be with linear regression, so I guess it’s fitting.
It seems like every day one of my professors will talk about linear regression,
and it’s not due to laziness or lack of coordination. Indeed, it’s an
intentional part of the curriculum here at New College of Florida because of
how ubiquitous linear regression is. Not only is it an extremely simple yet
expressive formulation, it’s also the theoretical basis of a whole slew of
other tactics. Let’s just get right into it, shall we?</p>
<!-- more -->
<h3 id="linear-regression">Linear Regression:</h3>
<p>Let’s say you have some data from the real world (and hence riddled with
real-world error). A basic example for us to start with is this one:</p>
<p><img src="/assets/pics/2017/02/21-linear-regression_scatterPoints.png" alt="Data suitable for Linear Regression" title="Some Example Data" /></p>
<p>There’s clearly a linear trend there, but how do we pick which linear trend would be the best? Well, one thing we could do is pick the line that has the least amount of error from the prediction to the actual data-point. To do that, we have to say what we mean by “least amount of error”. For this post, we’ll calculate that error by squaring the difference between the predicted value and the actual value for every point in our data set, then averaging those values. This standard is called the Mean-Squared-Error (MSE). We can write the MSE as:</p>
<script type="math/tex; mode=display">\frac{1}{N}\sum\limits_{i=1}^N\left(\hat Y_i - Y_i\right)^2</script>
<p>where \( \hat Y_i\) is our predicted value of \(Y_i\) for a give \(X_i\). Being as how we want a linear model (for simplicity and extensibility), we can write the above equation as,</p>
<script type="math/tex; mode=display">\sum\limits_{i=1}^N\left(\alpha + \beta X_i - Y_i\right)^2</script>
<p>for some \( \alpha, \beta \) that we don’t yet know. But since we want to minimize that error, we can take some derivatives and solve for \( \alpha, \beta \)! Let’s go ahead and do that! We want to minimize</p>
<script type="math/tex; mode=display">\sum\limits_{i=1}^N\left(\alpha + \beta X_i - Y_i\right)^2</script>
<p>We can start by finding the \( \hat\alpha \) such that, \( \frac{d}{d\alpha}\sum\limits_{i=1}^N\left(\alpha + \beta X_i - Y_i\right)^2 = 0 \). And as long as we don’t forget the chain rule, we’ll be alright…</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\sum\limits_{i=1}^N2\left(\alpha + \beta X_i - Y_i\right) =& 0\Longrightarrow\\
\sum\limits_{i=1}^N\left(\alpha + \beta X_i - Y_i\right) =& 0 \Longrightarrow\\
N\alpha + N\beta\bar X - N\bar Y =& 0\Longrightarrow\\
\alpha =& \bar Y - \beta\bar X \end{align*} %]]></script>
<p>and we’ll find the \( \beta \) such that \( \frac{d}{d\beta}\sum\limits_{i=1}^N\left(\alpha + \beta X_i - Y_i\right)^2 = 0 \)</p>
<p>And following a similar pattern we find (sorry for the editing… Wordpress.com isn’t the greatest therein):</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\sum\limits_{i=1}^N2X_i\left(\alpha + \beta X_i - Y_i\right) =& 0\Longrightarrow\\
\alpha\sum\limits_{i=1}^NX_i + \beta\sum\limits_{i=1}^N X_i^2 - \sum\limits_{i=1}^NY_iX_i =& 0\Longrightarrow\\
(\bar Y -\beta\bar X)N\bar X + \beta\sum\limits_{i=1}^NX_i^2 - \sum\limits_{i=1}^NY_iX_i =& 0\\
N\bar Y\bar X -N\beta(\bar X)^2 + \beta\sum\limits_{i=1}^NX_i^2 - \sum\limits_{i=1}^NY_iX_i =& 0\\
\beta\left(\sum\limits_{i=1}^NX_i^2 - N(\bar X)^2\right) =& \sum\limits_{i=1}^NY_iX_i - N\bar Y\bar X\\
\beta =& \frac{\sum\limits_{i=1}^NY_iX_i - N\bar Y\bar X}{\sum\limits_{i=1}^NX_i^2 - N(\bar X)^2}
\end{align*} %]]></script>
<p>But note:</p>
<script type="math/tex; mode=display">\text{VAR}(X) = \frac{1}{N}\sum\limits_{i=1}^NX_i^2 - (\bar X)^2</script>
<p>So,</p>
<script type="math/tex; mode=display">N\cdot\text{VAR}(X) = \sum\limits_{i=1}^NX_i^2 - N(\bar X)^2</script>
<p>And:
<script type="math/tex">% <![CDATA[
\begin{align*}
\sum\limits_{i=1}^NY_iX_i - N\bar Y\bar X =& \sum\limits_{i=1}^NY_iX_i - N(\frac{1}{N}\sum\limits_{i=1}^NY_i)\bar X \\
=& \sum\limits_{i=1}^NY_iX_i - \sum\limits_{i=1}^N(Y_i\bar X)\\
=& \sum\limits_{i=1}^NY_i\left(X_i - \bar X\right) \\
=& \sum\limits_{i=1}^NY_i\left(X_i - \bar X\right) - \sum\limits_{i=1}^N\bar Y\left(X_i - \bar X\right) + \sum\limits_{i=1}^N\bar Y\left(X_i - \bar X\right)\\
=& \sum\limits_{i=1}^N\left(Y_i-\bar Y\right)\left(X_i - \bar X\right) + \bar Y\sum\limits_{i=1}^N\left(X_i - \bar X\right)\\
=& \sum\limits_{i=1}^N\left(Y_i-\bar Y\right)\left(X_i - \bar X\right) = N\cdot \text{COV}(X, Y) \end{align*} %]]></script></p>
<p>So,</p>
<script type="math/tex; mode=display">\beta = \frac{\text{COV}(X,Y)}{\text{VAR}(X)}</script>
<p>And then we can find \( \alpha \) by substituting in our approximation of \( \beta \). Using those coefficients, we can plot the line below, and as you can see, it really is a good approximation.</p>
<p><img src="/assets/pics/2017/02/21-linear-regression_scatterPoints_withLine.png" alt="Data With Linear Regression Line" title="Some Example Data With Regression Line" /></p>
<h3 id="now-we-have-it">Now we have it</h3>
<p>Okay, so now we have our line of “best fit”, but what does it mean? Well, it
means that this line predicts the data we gave it with the least error. That’s
really all it means. And sometimes, as we’ll see later, reading too much into
that can really get you into trouble.</p>
<p>But using this model we can now predict other data outside the model. So, for
instance, in the model pictured above, if we were to try and predict \( Y \)
when \( X=2 \), we wouldn’t do so bad by picking something around 10 for \(
Y \).</p>
<h3 id="an-example-perhaps">An example, perhaps?</h3>
<p>So I feel at this point, it’s probably best to give an example. Let’s say we’re
trying to predict stock price given the total market price. Well, in practice
this model is used to assess volatility, but that’s neither here nor there.
Right now, we’re really only interested in the model itself. But without
further ado, I present you with, the CAPM (Capital Asset Pricing Model):</p>
<p><script type="math/tex">r = \alpha + \beta r_m + \epsilon</script> (where \( \epsilon \) is the error in
our predictions).</p>
<p>And you can fit this using historical data or what-have-you. There are a bunch
of downsides to fitting it with historical data though, like the fact that data
from 3 days ago really doesn’t have much to say about the future anymore. There
are plenty of cool things you can do therein, but sadly, those are out of the
scope of this post.</p>
<p>For now, we move on to</p>
<h2 id="multiple-regression">Multiple Regression</h2>
<h3 id="what-is-this-anyway">What is this anyway?</h3>
<p>Well, multiple regression is really just a new name for the same thing: how do
we fit a linear model to our data given some set of predictors and a single
response variable? The only difference is that this time our linear model
doesn’t have to be one dimensional. Let’s get right into it, shall we?</p>
<p>So let’s say you have \( k \) many predictors arranged in a vector (in other
words, our predictor is a vector in \( \mathbb{R}^n \)). Well, I wonder if a
similar formula would work… Let’s figure it out…</p>
<p>Firstly, we need to know what a derivative is in \( \mathbb{R}^n \). Well, if
\( f:\mathbb{R}^n\to\mathbb{R}^m \) is a differentiable function, then for
any \( x \) in the domain, \( f’(x) \) is the linear map \(
A:\mathbb{R}^n\to\mathbb{R}^m \) such that \( \text{lim}_{h\to 0}\frac{||f(x+h) - f(x) - Ah||}{||h||} = 0 \). Basically, \( f’(x) \) is the tangent plane.</p>
<p>So, now that we got that out of the way, let’s use it! We want to find the
linear function that minimizes the Euclidean norm of the error terms (just like
before). But note: the error term is \( \epsilon = Y - \hat Y = Y - \alpha
-\beta X \), for some vector \( \alpha \) and some matrix \( \beta \).
Now, since it’s easier and it’ll give us the same answer, we’re going to
minimize the squared error term instead of just the error term (like we did in
the one dimensional version). We’re also going to make one more simplification:
That \( \alpha=0 \). We can do this safely by simply appending (or
prepending) a 1 to the rows of our data (thereby creating a constant term). So
for the following, assume we’ve done that.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}\langle\epsilon, \epsilon\rangle =& (Y - X\beta)^T(Y - X\beta) \\
\langle\epsilon, \epsilon\rangle =& (Y^T - \beta^TX^T)(Y - X\beta) \\
\langle\epsilon, \epsilon\rangle =& Y^TY - 2Y^TX\beta + \beta^TX^TX\beta \end{align*} %]]></script>
<p>So, let’s find the \( \beta \) that minimizes that.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
0=&\lim\limits_{h\to0}\frac{||(Y^TY - 2Y^TX(\beta+h) + (\beta+h)^TX^TX(\beta+h)) - (Y^TY - 2Y^TX\beta + \beta^TX^TX\beta) - Ah||}{||h||} \\
0=&\lim\limits_{h\to0}\frac{||- 2Y^TXh + 2\beta^TX^TXh + h^TX^TXh - Ah||}{||h||}\\
0=&\lim\limits_{h\to0}||- 2Y^TX + 2\beta^TX^TX + h^TX^TX - A||\frac{||h||}{||h||}\\
0=&\lim\limits_{h\to0}||- 2Y^TX + 2\beta^TX^TX - A|| \end{align*} %]]></script>
<p>So, now we see that the derivative is \( -2Y^TX + 2\beta^TX^TX \) and we want to find where our error is minimized, so we want to set that derivative to zero:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
0=&- 2Y^TX + 2\beta^TX^TX \\
X^TX\beta =& X^TY \\
\beta =& (X^TX)^{-1}X^TY \end{align*} %]]></script>
<p>And there we have it. That’s called the <strong>normal equation</strong> for linear regression.</p>
<p>Maybe next time I’ll post about how we can find these coefficients given some data using gradient descent, or some modification thereof.</p>
<p>Till next time, I hope you enjoyed this post. Please, let me know if something could be clearer or if you have any requests.</p>Aaron NiskinLinear Regression is the bedrock of Machine Learning. We cover the basics therein, some theory involved, and give some relevant examples.Probability – A Measure Theoretic Approach2017-02-18T06:42:55-08:002017-02-18T06:42:55-08:00http://aaron.niskin.org/blog/2017/02/18/probability-a-measure-theoretic-approach<h2 id="probability-using-measure-theory">Probability using Measure Theory</h2>
<p>A mathematically rigorous definition of probability, and some examples therein.</p>
<!-- more -->
<h3 id="the-traditional-definition">The Traditional Definition:</h3>
<p>Consider a set \( \Omega \) (called the <strong>sample space</strong>
), and a function \( X:\Omega\rightarrow\mathbb{R} \) (called a <strong>random variable</strong>
.</p>
<p>If \( \Omega \) is countable (or finite), a function \( \mathbb{P}:\Omega\rightarrow\mathbb{R} \) is called a <strong>probability distribution</strong>
if it satisfies the following 2 conditions:</p>
<ul>
<li>
<p>For each \( x \in \Omega \), \( \mathbb{P}(x) \geq 0 \)</p>
</li>
<li>
<p>If \( A_i\cap A_j = \emptyset \), then \( \mathbb{P}(\bigcup\limits_0^\infty A_i) = \sum\limits_0^\infty\mathbb{P}(A_i) \)</p>
</li>
</ul>
<p>And if \( \Omega \) is uncountable, a function \( F:\mathbb{R}\rightarrow\mathbb{R} \) is called a <strong>probability distribution</strong>
or a <strong>cumulative distribution function</strong>
if it satisfies the following 3 conditions:</p>
<ul>
<li>
<p>For each \( a,b\in\mathbb{R} \), \( a < b \rightarrow F(a)\leq F(b) \)</p>
</li>
<li>
<p>\( \lim\limits_{x\to -\infty}F(x) = 0 \)</p>
</li>
<li>
<p>\( \lim\limits_{x\to\infty}F(x) = 1 \)</p>
</li>
</ul>
<h3 id="the-intuition">The Intuition:</h3>
<p>What idea are we even trying to capture with these seemingly disparate definitions for the same thing? Well, with the two cases taken separately it's somewhat obvious, but they don't seem to marry very well. The discrete case is giving us a pointwise estimation of something akin to the proportion of observations that should correspond to a value (in a perfect world). The continuous case is the same thing, but instead of corresponding to that particular value (which doesn't really even make sense in this case), the proportion corresponds to the point in question and everything less than it. The shaded region in the top picture below and the curve in the picture directly below it denote the cumulative density function of a standard normal distribution (don't worry too much about what that means for this post, but if you're doing anything with statistics, you should probably know a bit about that).</p>
<p><img src="/assets/pics/2017/02/18-probability-a-measure-theoretic-approach_pdf-cdf1.png" alt="pdf vs cdf for Normal distribution" title="PDF vs CDF for Normal Distribution" /></p>
<p>Another way to define a continuous probability distribution is through something called a probability density function, which is closer to the discrete case definition of a probability distribution (or <strong>probability mass function</strong>
). A <strong>probability density function</strong>
is a function \( f:\mathbb{R}\rightarrow\mathbb{R}<em>+ \) such that \( \int</em>{-\infty}^xf(t)dt = F(x) \). In other words, \( \frac{dF}{dX} = f \). This new function has some properties of our discrete case probability function, but lacks some others. On the one hand, they’re both defined pointwise, but on the other, this one can be greater than one in some places – meaning the value of the probability density function isn’t really the probability of an event, but rather (as the name “suggests”) the density therein.</p>
<h3 id="does-it-measure-up">Does it measure up?</h3>
<p>Now let’s check out the measure theoretic approach…</p>
<p>Let \( \Omega \) be our sample space, \( S \) be the \( \sigma \)-algebra on \( \Omega \) (so \( S \) is the collection of measurable subsets of \( \Omega \)), and \( \mu:S\to\mathbb{R} \) a measure on that measure space. Let \( X:\Omega\rightarrow\mathbb{E} \) be a random variable (\( \mathbb{E} \) is generally taken to be \( \mathbb{R} \) or \( \mathbb{R}^n \)). We define the function \( \mathbb{P}:\mathcal{P}(\mathbb{E})\rightarrow\mathbb{R} \) (where \( \mathcal{P}(\mathbb{E}) \) is the powerset of \( \mathbb{E} \) – the set of all subsets) such that if \( A\subseteq\mathbb{E} \), we have \( \mathbb{P}(A)=\mu(X^{-1}(A)) \). We call \( \mathbb{P} \) a <strong>probability distribution</strong>
if the following conditions hold:</p>
<ul>
<li>
<p>\( \mu(\Omega) = 1 \)</p>
</li>
<li>
<p>for each \( A\subseteq\mathbb{E} \) we have \( X^{-1}(A)\in S \).</p>
</li>
</ul>
<h3 id="why-do-this">Why do this?</h3>
<p>Well, right off the bat we have a serious benefit: we no longer have two disparate definitions of our probability distributions. Furthermore, there is the added benefit of having a natural separation of concerns: the measure \( \mu \) determines the what we might intuitively consider to be the probability distribution while the random variable is used to encode the aspects of the events that we care about.</p>
<p>To further illustrate this</p>
<h3 id="the-examples">The Examples</h3>
<h4 id="a-fair-die">A fair die</h4>
<h5 id="all-even">All even</h5>
<p>Let’s consider a fair die. Our sample space will be \( {1,2,3,4,5,6} \). Since our die is fair, we’ll define our measure fairly: for any \( x \) in our sample space, \( \mu({x}) = \frac{1}{6} \). If we want to know, for instance, what the probability of getting each number is, we could use a very intuitive random variable \( X(a) = a \) (so \( X(1)=1 \), etc.). Then we see that \( \mathbb{P}({1}) = \mu({1}) = \frac{1}{6} \), and the rest are found similarly.</p>
<h5 id="odds-and-evens">Odds and Evens?</h5>
<p>What if we want to consider the fair die of yester-paragraph, but we only care if the face of the die shows an odd or an even number? Well, since the actual distribution of the die hasn’t changed, we won’t have to change our measure. Instead we’ll change our random variable to capture just those aspects we care about. In particular, \( X(a) = 0 \) if \( a \) is even, and \( X(a) = 1 \) if \( a \) is odd. We then see \( \mathbb{P}(1) = \mu({1,3,5}) = \frac{1}{2} \) and \( \mathbb{P}(0) = \mu({2,4,6}) = \frac{1}{2} \)</p>
<h4 id="getting-loaded">Getting loaded</h4>
<h5 id="all-even-1">All even</h5>
<p>Now let’s consider the same scenario of wanting to know the probability of getting each number, but now our die is loaded. Being as how we’re changing the distribution itself and not just the aspects we’re choosing to care about, we’re going to want to change the measure this time. For simplicity, let’s consider a kind of degenerate case scenario. Let our measure be: \( \mu(A) = 1 \) if \( 1\in A \) and \( \mu(A)=0 \) if \( 1\notin A \). Basically, we’re defining our probability to be such that the only possible outcome is a roll of 1. So since we are concerned with the same things we were concerned with last time, we can take that same random variable. We note \( \mathbb{P}(1) = 1 \) and \( \mathbb{P}(a) = 0 \) for any \( a \neq 1 \).</p>
<h5 id="odds-or-evens">Odds or evens</h5>
<p>Try to do this one yourself. I’m going to go get some sleep now. Please feel free to contact me with any questions. I love doing this stuff, so don’t be shy!</p>Aaron NiskinWe cover probability from a measure theoretic approachA dirty little ditty on Finite Automata2015-12-03T00:00:00-08:002015-12-03T00:00:00-08:00http://aaron.niskin.org/blog/2015/12/03/a-dirty-little-ditty-on-finite-automata<p>This post builds on the previous post about <a href="/mathematics/2015/11/30/whats-in-a-language/">Formal Languages</a>.</p>
<h3 id="some-formal-definitions">Some Formal Definitions</h3>
<h4 id="a-deterministic-finite-automata-dfa-is">A Deterministic Finite Automata (DFA) is</h4>
<ul>
<li>A set \( \mathcal{Q} \) called “states”</li>
<li>A set \( \Sigma \) called “symbols” or “alphabet”</li>
<li>A function \( \delta_F:\mathcal{Q}\times\Sigma \to \mathcal{Q} \)</li>
<li>A designated state \( q_0\in\mathcal{Q} \) called the start point</li>
<li>A subset \( F\subseteq\mathcal{Q} \) called the “accepting states”</li>
</ul>
<p>The DFA is then often referred to as the ordered quintuple \( A=(\mathcal{Q},\Sigma,\delta_F,q_0,F) \).</p>
<!-- more -->
<h4 id="defining-how-strings-act-on-dfas">Defining how strings act on DFAs.</h4>
<p>Given a DFA, \( A=(\mathcal{Q}, \Sigma, \delta, q_0, F) \), a state \( q_i\in\mathcal{Q} \), and a string \( w\in\Sigma^* \), we can define \( \delta(q_i,w) \) like so:</p>
<ul>
<li>If \( w \) only has one symbol, we can consider \( w \) to be the symbol and define \( \delta(q_i,w) \) to be the same as if we considered \( w \) as the symbol</li>
<li>If \( w=xv \), where \( x\in\Sigma \) and \( v\in\Sigma^* \), then \( \delta(q_i, w)=\delta(\delta(q_i,x),v) \)</li>
</ul>
<p>And in this way, we have defined how DFAs can interpret strings of symbols rather than just single symbols.</p>
<h4 id="the-language-of-a-dfa">The language of a DFA</h4>
<p>Given a DFA, \( A=(\mathcal{Q}, \Sigma, \delta, q_0, F) \), we can define “the language of \( A \)”, denoted \( L(A) \), as \( {w\in\Sigma^*|\delta(q_0,w)\in F} \).</p>
<h3 id="some-examples-maybe">Some Examples, Maybe?</h3>
<h4 id="example-1">Example 1:</h4>
<p>Let’s construct a DFA that accepts only strings beginning with a 1 that, when interpreted as binary numbers, are multiples of 5. So some examples of strings that would be in \( L(A) \) are 101, 1010, 1111</p>
<h3 id="some-more-formal-definitions">Some More Formal Definitions</h3>
<h4 id="a-nondeterministic-finite-automata-nfa-is">A Nondeterministic Finite Automata (NFA) is</h4>
<ul>
<li>A set \( \mathcal{Q} \) called “states”</li>
<li>A set \( \Sigma \) called “symbols”</li>
<li>A function \( \delta_N:\mathcal{Q}\times\Sigma \to \mathcal{P}\left(\mathcal{Q}\right) \)</li>
<li>A designated state \( q_0\in\mathcal{Q} \) called the start point</li>
<li>A subset \( F\subseteq\mathcal{Q} \) called the “accepting states”</li>
</ul>
<p>The NFA is then often referred to as the ordered quintuple \( A=(\mathcal{Q},\Sigma,\delta_N,q_0,F) \).</p>
<h4 id="defining-how-strings-act-on-nfas">Defining how strings act on NFAs.</h4>
<p>Given an NFA, \( N=(\mathcal{Q}, \Sigma, \delta, q_0, F) \), a collection of states \( \mathcal{S}\subseteq\mathcal{Q} \), and a string \( w\in\Sigma^* \), we can define \( \delta(\mathcal{S},w) \) like so:</p>
<ul>
<li>If \( w \) only has one symbol, then we can consider \( w \) to be the symbol and define \( \delta(\mathcal{S},w):=\bigcup\limits_{s\in\mathcal{S}}\delta(s,w) \)</li>
<li>If \( w=xv \), where \( x\in\Sigma \) and \( v\in\Sigma^* \), then \( \delta(\mathcal{S}, w)=\bigcup\limits_{s\in\mathcal{S}}\delta(\delta(s,x),v) \)</li>
</ul>
<p>And in this way, we have defined how NFAs can interpret strings of symbols rather than just single symbols.</p>
<h3 id="maybe-in-another-post-well-get-past-the-definitions--">Maybe in another post, we’ll get past the definitions! :-/</h3>Aaron NiskinA basic yet formal introduction to Finite Automata and how strings act on them.What’s in a language?2015-11-29T22:36:29-08:002015-11-29T22:36:29-08:00http://aaron.niskin.org/blog/2015/11/29/whats-in-a-language<h2 id="languages-in-abstraction">Languages in abstraction</h2>
<p>This post is about Languages from a mathematical and abstract linguistics point of view. Not much more to say about that, so let’s get right to it!</p>
<!-- more -->
<h2 id="the-rigorous-definition">The Rigorous definition:</h2>
<p>Let \( \Sigma \) be an alphabet and let \( \Sigma^k \) be the set of all strings of length k over that alphabet. Then, we define \( \Sigma^* \) to be \( \bigcup\limits_{k\in\mathbb{N}}\Sigma^k \) (the union of ∑<sup>k</sup> over all natural numbers k). If \( L\subseteq\Sigma^* \) , we call \( L \) a language.</p>
<h2 id="the-intuition-behind-the-definition">The Intuition Behind the Definition</h2>
<p>Consider an alphabet (some finite set of characters), for example we can consider the letters of the English language, the ASCII symbols, the symbols \( {0, 1} \) (otherwise known as binary), or the symbols \( {1, 2, 3, 4, 5, 6, 7, 8, 9, 0, +, \times , =} \) . We can then construct the infinite list of all the different ways we can arrange those characters (e.g. \( 1001011 \) or \( 0011011011 \) , etc. if we’re using binary). We call these arrangements “strings”. Once we have all that machinery built up, a language is just some subset of that infinite collection of strings. The language may itself be infinite.</p>
<h2 id="some-examples">Some Examples</h2>
<ul>
<li>The alphabet: \( \Sigma={0, 1} \) <br />
The language: \( {x\in\Sigma|x \text{ is prime}} \) (all prime numbers in binary)<br />
Some strings from the language: \( 10, 11, 101… \)</li>
<li>The alphabet: ASCII characters<br />
The language: All syntactically correct Clojure programs (the source code)</li>
<li>The alphabet: All Clojure functions, operators, etc, and list \( {x, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0} \) <br />
The language: All syntactically correct Clojure programs (the source code)</li>
</ul>
<p>You see that we need to have an alphabet before we can have a Formal Language. Also, different alphabets may result in equivalent languages – by equivalent, we mean that both languages contain the same strings.</p>
<h2 id="what-are-we-getting-at">What are we getting at?</h2>
<p>Well, there are two ways to look at this. On the first hand, linguists would like to study language in its abstract essence. For this, Formal Languages may come in handy (if endowed with a Grammar and possibly more). That is not the reason I will be studying Formal Languages. I’m learning about formal languages to find their applications to computing.
Apparently, with the help of a little Mathematical thinking, we can assign semantics to the strings in a language and somehow correlate them with real world problems – such as computability, the P=nP problem, cryptography, and more!</p>Aaron NiskinThe basics of the theory of formal languages