Jekyll2019-06-04T21:47:10-07:00http://aaron.niskin.org/feed.xmlAaron Niskin’s BlogNot just another mathy blog! Now with more... Aaron!amniskinQuantile Regression and Quantiles of Loss2019-06-04T00:00:00-07:002019-06-04T00:00:00-07:00http://aaron.niskin.org/blog/2019/06/04/quantile_regrssion<h1 id="background">Background</h1>
<p>If you’ve ever read any papers referring to quantile regression (QR), you undoubtedly have read “MSE yields means, MAE yields medians, and QR yields quantiles”. But what they hell does that even mean? Like not just hand-wavy bullshit, but really, what does that mean? After all, these are just loss functions and therefore don’t <em>yield</em> anything at all. Even the regressions yields coefficients, or, more generally, models; so whence cometh the mean, median talk? It turns out they’re talking about the distribution of residuals. In this post, we’ll show what they mean, how that’s the case, and we’ll investigate a bit more about QR.</p>
<!-- more -->
<h2 id="what-is-it">What is it?</h2>
<p>Given a vector of residuals $r\in\mathbb{R}^n$, and a scaler $q\in[0,1]$, we define the <strong>Quantile Loss</strong> with quantile $q$ ($L_q$) as:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
L_q : \mathbb{R}^n& \to \mathbb{R} \\
L_q(r) =& \begin{cases} qr & r > 0 \\ (1-q)(-r) & r \leq 0\end{cases} \\
=& \begin{cases} q|r| & r > 0 \\ (1-q)|r| & r \leq 0\end{cases}
\end{align*} %]]></script>
<h3 id="in-words">In words</h3>
<p>Basically, the loss is very much like MAE (A.K.A. $L_1$ loss) but we weight the absolute errors based on whether they are positive or negative.</p>
<h2 id="relation-to-quantiles">Relation to Quantiles</h2>
<p>A natural question to ask at this point is, when are you getting to your point? Er, I mean, what does this have to do with quantiles?</p>
<p>To answer this, let’s investigate, when will the derivative equal zero (a necessary condition for a minima). To make our lives easier, let $\omega_0$ be the set of residuals that are greater than zero, and $\omega_1$ be those that are less than or equal to zero.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
0 =& \frac{\partial L}{\partial r}(r) \\\\
=& \sum_{r\in\omega_0}q + \sum_{r \in\omega_1}(1-q) \\\\
=& q\sum_{r\in\omega_0}1 + (1-q)\sum_{r \in\omega_1}1 \\\\
=& q|\omega_0| + (1-q)|\omega_1| \iff \\\\
\omega_1 =& q(\omega_0 - \omega_1) \iff \\\\
q =& \frac{\omega_1}{\omega_0 - \omega_1}
\end{align*} %]]></script>
<p>So we can see that the Quantile Regression optimization problem attempts to over-estimate $q$ percent of the observations. In particular, if $q = 0.5$, then the QR loss attempts to ensure that there are just as many over-estimations as under-estimations.</p>
<p>But if $q=0.9$, for instance, then this loss will attempt to ensure your model under-estimates about 90\% of the observations and over-estimates about 10\% of them.</p>
<h3 id="practice">Practice</h3>
<p>In practice it seems to converge somewhere around the quantile, but depending on the outlier situation you have, it might be somewhat off. That might be an effect of the number of iterations I’ve allowed, but either way, it’s not far enough from the designated quantile to raise any alarms; especially not since whatever you’re doing, it’s probably using Stochastic Gradient Descent anyway.</p>
<h2 id="further-investigation">Further Investigation</h2>
<p>I lied. I’m tired and I don’t think I’ll be following through on my desires for further investigation. “Why don’t you just delete the part where you said you’d have further investigation”? Well, frankly, I don’t think you’re reading this anyway. Email me if you actually read this. I’d be super interested to find out.</p>
<h2 id="summary">Summary</h2>
<p>Quantile regression is\dots interesting.</p>Aaron NiskinBackground If you’ve ever read any papers referring to quantile regression (QR), you undoubtedly have read “MSE yields means, MAE yields medians, and QR yields quantiles”. But what they hell does that even mean? Like not just hand-wavy bullshit, but really, what does that mean? After all, these are just loss functions and therefore don’t yield anything at all. Even the regressions yields coefficients, or, more generally, models; so whence cometh the mean, median talk? It turns out they’re talking about the distribution of residuals. In this post, we’ll show what they mean, how that’s the case, and we’ll investigate a bit more about QR.Gradient Boosting Seen as Gradient Descent2019-04-25T00:00:00-07:002019-04-25T00:00:00-07:00http://aaron.niskin.org/blog/2019/04/25/gbm-basic<h1 id="tldr">tldr</h1>
<p>The short version of this whole thing is that we can see the gradient boosting algorithm as just gradient descent on black-box models. Typically gradient descent involves parameters and a closed form solution to the gradient of your learner as a way to update those parameters. In the case of trees, for instance, there are no parameters you want to update<sup id="fnref:tree_parameters"><a href="#fn:tree_parameters" class="footnote">1</a></sup>, and somewhat more importantly they aren’t even continuous predictors, much less differentiable!</p>
<p>But gradient boosting can be seen as gradient descent on black box models resulting in an additive black box model.</p>
<h1 id="gbm">GBM</h1>
<h2 id="the-naive-algorithm">The naive algorithm</h2>
<p>A description of the algorithm I’m talking about can be found on <a href="https://en.wikipedia.org/wiki/Gradient_boosting#Algorithm">wikipedia</a>, but I’ll go over the algorithm somewhat quickly here just to add another phrasing of the thing.</p>
<p>Let’s assume we have a feature-set $X$, a response variable $y$, a loss function $L(\hat y, y)$, and a learning rate $\lambda\in\mathbb{R}_+$ such that $\lambda \leq 1$. The algorithm is as follows:</p>
<p>First we define $f_0(X) = 0$.</p>
<p>Now if $i \geq 1$, we let:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
y_i =& -\frac{\partial{L(y, f_{i-1}(X))}}{\partial f_{i-1}(X)}
\end{align*} %]]></script>
<p>We then train a new set of trees (or whatever learner we want) such that:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
g_i(X) \sim& y_i
\end{align*} %]]></script>
<p>We then find the real number scaler that minimizes our loss in the following equation (AKA a line-search):</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\gamma_i =& \text{argmin}_\gamma(y - (f_{i-1}(X) + \gamma g_i(X)))
\end{align*} %]]></script>
<p>Finally we define:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
f_i(X) =& f_{i-1}(X) + \lambda\gamma_i g_i(X)
\end{align*} %]]></script>
<p>So that</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
f(X) =& \sum\limits_{i=0}^N \lambda\gamma_i g_i(X)
\end{align*} %]]></script>
<p>In the special case where our loss function is mean squared error (or $L2$), our gradients are just the residuals.</p>
<h1 id="gradient-descent">Gradient Descent</h1>
<p>The typical scenario we generally talk about in gradient descent is fitting linear weights in some type of model (whether that be a linear model, logistic regression, neural network, whatever). So let’s stick to this paradigm and consider the simplest case (a linear model), and we’ll show how the algorithm above is actually the same algorithm, just slightly more general.</p>
<p>So we’re modeling $y = X\beta + \epsilon$; in the parlance of GBM: $f = \beta$ or $f(X) = X\beta$.</p>
<p>To solve this, we define $\beta_0 = 0$.</p>
<p>Then for $i \geq 1 \in \mathbb{N}$,</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\beta_i =& \beta_{i-1} - \lambda\gamma_i\frac{\partial L}{\partial \beta_{i-1}}
\end{align*} %]]></script>
<p>Where $L, \lambda, \gamma_i$ are exactly the same as in the gradient boosting algorithm<sup id="fnref:line_search"><a href="#fn:line_search" class="footnote">2</a></sup>. So that was the standard gradient descent algorithm. Now let’s show how these two are really the same beast.</p>
<h1 id="similarities">Similarities</h1>
<p>We’ll show that the update formula is the same (mutatis mutandis) in both, and then as a consequence the final models will be the constructed in the same way (because they’re both aggregates of their updates)<sup id="fnref:updates"><a href="#fn:updates" class="footnote">3</a></sup>.</p>
<p>To make the notation easier, we define:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\alpha_i =& \frac{\partial L}{\partial \beta_{i-1}}
\end{align*} %]]></script>
<h2 id="update-formula">Update Formula</h2>
<p>For the update formula, we need to first note two things:</p>
<ol>
<li>Gradients are linear by definition, so $X\frac{\partial L}{\partial \beta_{i-1}}$ can be seen as the result of modeling ($\frac{\partial L}{\partial \beta_{i-1}(X)}$) by the application of a linear function (although it is normally unwise to do so).</li>
</ol>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
f_i(X) =& X\beta_i \\\\
=& X\left(\beta_{i-1} - \lambda\gamma_i\frac{\partial L}{\partial \beta_{i-1}}\right) \\\\
=& X\beta_{i-1} - \lambda\gamma_iX\frac{\partial L}{\partial \beta_{i-1}} \\\\
=& X\beta_{i-1} - \lambda\gamma_i g_i(X) \\\\
=& f_{i-1}(X) - \lambda\gamma_i g_i(X)
\end{align*} %]]></script>
<p>Which is the same formula we have in gradient boosting.</p>
<h1 id="observations">Observations</h1>
<p>Since we’re modeling the gradient with an arbitrary (potentially black-box) learner, we don’t have the option to find the gradient with respect to the parameters, so the scale might not decrease as desired. To exemplify this, let’s consider an $L_1$ objective (Mean-Absolute-Error), and a black-box learner. The gradient at each point is either 1, -1, or <code class="highlighter-rouge">np.nan</code> (because the absolute value function is $f(x) = \pm x$ depending on $x$). The magnitude of the gradients will never change. In a linear model we have that extra $\frac{\partial f}{\partial \beta}$ which adds scale to our gradient, but in trees we have no such thing. So in order to add scale, we tend to fit the line search and then add a learning rate to avoid over-fitting.</p>
<p>One can also sub-sample (as is a parameter in popular packages like <code class="highlighter-rouge">LightGBM</code>). Sub-sampling is the black-box model version of the familiar Stochastic Gradient Descent.</p>
<h1 id="summary">Summary</h1>
<p>I realize this might be obvious to some, but it was pretty cool when I first realized this. I hope you found something useful and/or interesting here.</p>
<div class="footnotes">
<ol>
<li id="fn:tree_parameters">
<p>Technically the split leaves in a tree define an indicator function on your data and the average value within a leaf (the prediction for that leaf) can be seen as the parameters of a tree, but this is kind of ridiculous because these are not tuned in the learning of the tree and there’s really no reason to do so (as far as I can tell). <a href="#fnref:tree_parameters" class="reversefootnote">↩</a></p>
</li>
<li id="fn:line_search">
<p>In practice we generally don’t include the line search and just have a decreasing $\lambda$ – and sometimes we don’t even do that. We can get away with these shortcuts because the magnitude of the gradients will decrease as you get closer to the optima, and the derivative with respect to $\beta$ is always continuous. The same cannot be said about the gradient boosting algorithm. <a href="#fnref:line_search" class="reversefootnote">↩</a></p>
</li>
<li id="fn:updates">
<p>If we call our final linear model: $X\hat\beta$, then $X\hat\beta = X\left(\sum\limits_{i=0}^N\lambda\gamma_i\alpha_i\right) = \sum\limits_{i=0}^N\lambda\gamma_iX\alpha_i $. So we can see that linear regression has always been constructed as a sum. <a href="#fnref:updates" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Aaron NiskinGradient Boosting Seen as Gradient DescentA little geometric walk down risk model lane2019-03-28T00:00:00-07:002019-03-28T00:00:00-07:00http://aaron.niskin.org/blog/2019/03/28/risk_models<h1 id="prerequisites">Prerequisites</h1>
<h2 id="inner-products">inner products</h2>
<p>So the first thing to understand is that as long as we have an inner-product defined, we have a geometry. And in case you’re unfamiliar with the term inner-product, you can think of it like a slightly more general dot-product.</p>
<p>Furthermore, an inner product defines a norm and it defines angles. Specifically,</p>
<p><script type="math/tex">\begin{align*}
\left |\left| u\right|\right| = \sqrt{\langle u, u\rangle} \\\\
\cos(\theta) = \frac{\langle u, v\rangle}{||u||\cdot||v||}
\end{align*}</script></p>
<h2 id="standard-deviation">Standard deviation</h2>
<p>The next thing to understand is that standard deviation is a norm. It’s easily seen from the formula:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\sqrt{\frac{1}{n-1}\sum(x_i - \mu)^2} =& \sqrt{\frac{n}{n-1}}||x - \mu||\\\\
=& \sqrt{\frac{n}{n-1}}\sqrt{\langle x-\mu, x-\mu\rangle}
\end{align*} %]]></script>
<p>So standard deviation is a norm. The two are equivalent. Sweet. Let’s move on.</p>
<h2 id="covariance-is-an-inner-product">Covariance is an inner product</h2>
<p>If we have a linear transformation $X$ defines a linear transformation from $X:\mathbb{R}^m\to\mathbb{R}^n$, and with our covariance matrix $V$, we have an inner product defined on the image. Specifically, if $u,v\in\mathbb{R}^n$ are two portfolios already mapped to risk space, the inner product is (the estimated covariance):</p>
<script type="math/tex; mode=display">\langle u, v\rangle := u'Vv</script>
<h2 id="projections">Projections</h2>
<p>Let $u,v\in\mathbb{R}^n$. Given an inner product $\langle\cdot,\cdot\rangle$ (and the associated norm), the <strong>projection</strong> (sometimes called <strong>vector projection</strong>) of $u$ onto $v$ (denoted as $\text{proj}_vu$)is defined as:</p>
<script type="math/tex; mode=display">\text{proj}_vu := \frac{\langle u, v\rangle}{||v||} \frac{v}{||v||}</script>
<p>Another way to see the definition, given our definition of $\cos$ is:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\text{proj}_vu =& \frac{\langle u, v\rangle}{||v||} \frac{v}{||v||} \\\\
=& \frac{||u||\langle u, v\rangle}{||u||\cdot||v||} \frac{v}{||v||} \\\\
=& ||u||\cos(\theta) \frac{v}{||v||}
\end{align*} %]]></script>
<p>Geometrically, this gives us the component of $u$ in the $v$ direction. See figure below.</p>
<p><img src="/assets/pics/2019/03/28_projection.png" alt="visual of a vector projection" title="Visual of a vector projection" /></p>
<p>If you’re wondering why I specified that it’s the “<strong>vector</strong> projection” rather than just “projection”, it’s because there is a notion of a <em>scaler projection</em> too: The <strong>scaler projection</strong> of $u$ onto $v$ is simply the coefficient part of the projection:</p>
<script type="math/tex; mode=display">\text{scaler-proj}_vu = \frac{\langle u, v\rangle}{||v||}</script>
<p>But since they’re so similar and we really only care about the latter, I’ll abuse notation here and generally use $\text{proj}_vu$ to mean “scaler-projection” unless otherwise noted.</p>
<h2 id="risk">Risk</h2>
<p>When we talk about how <em>risky</em> a portfolio is, intuitively you probably understand that as “what’s the probability that I lose all of my money”, and you’re not wrong. That would be what we call <strong>downside risk</strong>. Unfortunately, that’s not so easy to compute (at least not on paper), and there isn’t much in the way of mathematical research built up on the notion of downside risk. Maybe in a later post, I’ll go over how downside risk relates to the Information Ratio, but not today. There is, however, a lot of research built up on <em>standard deviation</em>, and that’s somewhat related (“how irregular are your returns” is not too bad of a proxy). So we define a portfolio’s <strong>risk</strong> as the standard deviation of the portfolio’s returns.</p>
<h1 id="the-problem">The problem</h1>
<p>Now comes the wonderful task of predicting future risk. This involves estimating a covariance matrix of asset-level returns. But if you’re in the equities world (and even more-so if you’re in the quantitative equities world), the covariance matrix is very large (often on the order of 5,000 by 5,000) and somewhat non-stationary (meaning the data we have deteriorates in value over time). So these covariance matrices take quite a lot of data to fit, and they will inherently leave out newer stocks. Clearly, there is a place for a lower dimensional estimator of risk. We call such an estimator a <em>risk model</em>. There are several versions of a risk model, but this is meant to be a short post, so we’ll only go over the Barra style risk model…</p>
<h2 id="barra">Barra</h2>
<p>The barra risk model takes the following form:</p>
<script type="math/tex; mode=display">(Xh)'V(Xh) + D\cdot h</script>
<p>Where $h \in \mathbb{R}^m$ is our <strong>holdings</strong> (of $m$ many assets) vector, $X \in \mathbb{R}^{n\times m}$ is called our <strong>factor exposure matrix</strong> (usually shortened to simply, “exposure matrix”), $V \in \mathbb{R}^{n\times n}$ is called our <strong>factor covariance matrix</strong>, and $D \in \mathbb{R}^{m}$ is our <strong>specific risk</strong> vector. The exposure matrix is computed based off of real-world things like the stock’s N-day momentum, the stock’s market cap, volatility (yet another word for standard deviation) of returns, etc.</p>
<p>If $n \ll m$ the covariance matrix can be computed with a lot more efficacy and the exposures are computed from external things like official SEC filings, etc. so as to reduce the amount of over-fitting here. So Barra generally regresses the exposures against returns (with some universe filtering and time weighting, etc.) to get “factor returns”. We can then compute the factor covariance matrix, and we’re done!</p>
<h2 id="the-geometric-view">The geometric view</h2>
<p>We can view this style of risk model in a geometric way for some added clarity. Let’s focus entirely on the first part ($(Xh)’V(Xh)$) for now, and furthermore let’s assume that we’ve fit the model already – meaning we’ve already computed $X$ and $V$. It’s clear to see that we’re using a lower dimensional embedding of our portfolios and computing the risk in the embedded space, then projecting back.
Formally, we have an inner product defined in $\mathbb{R}^n$ given by:</p>
<script type="math/tex; mode=display">\langle u, v\rangle = u'Fv</script>
<p>So our factor risk is the norm of our vector in this space. But now we want to see how seeing things the geometric way can benefit us.</p>
<h2 id="marginal-contribution-to-risk">Marginal contribution to risk</h2>
<p>One common task is to attribute some amount of the total risk to a particular factor. By that I mean that we have a portfolio and we would like to express the total risk of this portfolio in terms of the risk factors from our risk model. The theory suggests that we want the <strong>marginal contribution to risk</strong> ($MCAR$) which is defined as:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
MCAR(i) :=& \frac{\partial \sigma}{\partial u_i}
\end{align*} %]]></script>
<p>It’s the partial derivative of the total risk with respect to our factor in question (denoted above as $u_i$). And we can compute this somewhat easily:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
MCAR(i) :=& \frac{\partial \sigma}{\partial u_i} \\\\
=& \frac{\partial \sqrt{\langle u, u\rangle}}{\partial u_i} \text{ -- risk is the inner product}\\\\
=& \frac{1}{2\sqrt{\langle u, u\rangle}}\cdot \frac{\partial \langle u, u\rangle}{\partial u_i} \text{ -- chain rule}\\\\
=& \frac{1}{2\sqrt{\langle u, u\rangle}}\cdot \frac{\partial u'Vu}{\partial u_i} \text{ -- definition of our inner product}\\\\
=& \frac{(u'V + u'V')u_i}{2\sqrt{\langle u, u\rangle}} \text{ -- computing the derivative}\\\\
=& \frac{u'Vu_i}{\sqrt{\langle u, u\rangle}} \text{ -- $V$ is symmetric}\\\\
=& \frac{\langle u, u_i\rangle}{\left|\left|u\right|\right|} \text{ -- Replacing with the geometric notion}\\\\
\end{align*} %]]></script>
<p>But man, is this not illuminating. Why would we even want the derivative to begin with? Are we really concerned with such an extremely local property here or are we trying to find some more global thing? It seems to me that what we want is more about explaining our current position rather than how our position could change with infinitesimal changes in other values. You may have noticed that the end formula above is the same as the projection formula from the <em>prerequisites</em> section. That’s no coincidence.</p>
<p>If we instead look at MCAR as trying to explain how much of our total portfolio risk can be explained by the portion of our portfolio in the direction of a chosen basis vector (factor), we end up with the same formula. In order to see how much the portion of $u$ in the direction of a particular basis explains the total risk, we compute the projection of $u_i$ onto $u$, which would look like:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\text{proj}_uu_i =& \frac{\langle u, u_i\rangle}{\left|\left|u\right|\right|}
\end{align*} %]]></script>
<p>And it comes with the all the geometric intuition we all know and love.</p>
<h2 id="ufps">U.F.P.’s</h2>
<p>[edit]
See note below.
[end edit]</p>
<p>What the hell are they? The standard definition of a <strong>UFP</strong> is: the characteristic portfolio with unit exposure to a particular factor, and zero exposure to all the others. What does this mean? How are we guaranteed that such a thing always exists, and just, what? Well, yet again, our geometric interpretation can come to our rescue…</p>
<p>Let’s pick a particular factor $i$. Now let’s construct an ortho-normal basis (perhaps using Gram-Schmidt) in factor space starting with every other factor <em>but</em> $i$, leaving $i$ for last. That last basis vector would be, by construction, orthogonal to every other factor (hence zero exposure to them) and have some exposure to our chosen factor. We can now take that last basis vector, rescale it as appropriately and we have ourselves a sweet UFP.</p>
<p>[edit]</p>
<h3 id="note">NOTE</h3>
<p>The information above about <em>UFP</em> is actually incorrect. A <strong>UFP</strong> is actually just the returns of the momentum vector. If you’re more comfortable thinking in portfolio space, it’s the returns of a fictitious portfolio that, when projected to factor space, is one of our basis vectors. What I wrote below is a bit more complicated, but provides an interesting and potentially useful insight into factor performance. I personally couldn’t really care too much less about factors, but if that’s what you’re into, looking at both the “real” UFPs and what I was calling UFPs can give you a much more complete view of things.</p>
<p>Let me explain:</p>
<p>The risk of a “real” UFP has a portion explained by other factors. So if momentum is on a tear, it could just be a coalescence of other factors ripping it but the part that makes momentum unique is actually quite negative. One way to see that is to plot the performance of the “real” UFPs along with my UFPs. If you see the “real” UFP is positive but my UFP is negative (for a give factor), you can conclude we’re in a situation like I described above. And so on.
[end edit]</p>
<h2 id="specific-risk">Specific risk</h2>
<p>So how do we interpret specific risk (what we called $D$) in the geometric way? Can we simply add a column to our risk model for specific risk – even if it is mostly zeros? It turns out, no. For one, the specific risk is stock-specific (hence the name), but also, the thing we want to measure – variance – is not linear. So what is specific risk?</p>
<p>Well, one way to interpret it is as the norm of the projection of our vector in holdings space onto the kernel of our embedding. Let me explain what I mean there…</p>
<p>We have a surjective map $X:\mathbb{R}^n\to\mathbb{R}^m$ where $n\lt m$. Since the dimension of the domain is larger than the dimension of the range we know that several dimensions have to be in the kernel. In particular, since the map is surjective we know exactly that the dimension of the kernel is $m-n$. So there is an $n$ dimensional substructure of our holdings space that matters – as far as factors are concerned – and $m-n$ dimensions that don’t. We project that parts that do matter into factor space and compute their magnitude there where we know the geometry, and compute the magnitude in ambient space of the part that doesn’t map to factor-space. It’s that latter piece that we call <strong>specific risk</strong>.</p>
<h2 id="summary">Summary</h2>
<p>So, in particular, if $h_1$ is a particular risk factor, we get the typical risk-decomposition formula we all know and love, but without messing with any of the unmotivated calculus. What’s nice is it also illuminates the place for non-linear manifold approximations: at the end of the day, they’re all just defining the geometry we’re going to use to assess our portfolio.</p>Aaron NiskinLet's talk about the geometry of risk modelsLet’s Cover The Fundamental Group (pt 1)2018-08-12T00:00:00-07:002018-08-12T00:00:00-07:00http://aaron.niskin.org/blog/2018/08/12/fundamental_group_pt1<h2 id="jumping-right-in">Jumping right in</h2>
<h3 id="prerequisites">Prerequisites:</h3>
<p>Firstly, this post builds off of my <a href="/blog/2018/08/05/homotopy_revisited.html">previous post</a>, so if you’re learning these things for the first time and find yourself kinda lost, start there.</p>
<p>In addition to the previous post, the following are mathematical definitions you should probably know before reading this post. If you don’t know any of these, trust google.</p>
<ul>
<li><strong>Group</strong>.</li>
<li><strong>Homomorphism</strong></li>
<li><strong>Kernel</strong></li>
<li><strong>Subgroup</strong></li>
<li><strong>Normal subgroup</strong></li>
<li><strong>Homotopy</strong></li>
<li><strong>Retraction</strong></li>
<li><strong>Star Convex</strong></li>
</ul>
<h3 id="moving-onward">Moving onward</h3>
<p><strong>Definition</strong>:</p>
<p>Given a path $\alpha$ from $x_0$ to $x_1$ in $X$, we define $\hat\alpha:\pi_1(X,x_0)\to\pi_1(X,x_1)$ as follows:</p>
<script type="math/tex; mode=display">\hat\alpha([f]) = [\bar\alpha]\star[f]\star[\alpha]</script>
<p>And it’s a theorem that $\hat\alpha$ is a group isomorphism.</p>
<p><strong>Definition</strong>:</p>
<p>Given a continuous map $h:(X,x_0)\to(Y,y_0)$, we define:</p>
<p>$h_\star:\pi_1(X,x_0)\to\pi_1(Y,y_0)$ as follows:</p>
<p>$h_\star([f]) = [h\circ f]$.</p>
<p>And with that…</p>
<p>Let’s get into the questions!</p>
<h2 id="exercises">Exercises</h2>
<h3 id="show-that-if-asubseteqmathbbrn-is-star-convex-then-a-is-simply-connected">Show that if $A\subseteq\mathbb{R}^n$ is star convex then $A$ is simply connected.</h3>
<div class="hint">
<p>Let $a\in A$ be one of the points that make the set star convex. Our task now is to show:</p>
<ol>
<li>For any $x,y\in A$, there is a path entirely within $A$ connecting $x$ and $y$, and</li>
<li>$\pi_1(A,a)$ is trivial.</li>
</ol>
<p>For the first: just let $x,y\in A$ and let $f_x,f_y$ be the paths connecting them to $a$. Then note that $f_x \star \bar f_y$ is a path connecting $x$ to $y$.</p>
<p>For the second we can use the same straight-line homotopy that we used for the convex version: let $f$ be a path starting and ending at $a$ and let $H(s,t) = t\cdot a + (1-t)\cdot f(s)$.</p>
<p>Lastly, because part <em>a</em> (that I didn’t mention) is to find a set that’s star convex but not convex, I’ll leave the question with The Star of David.</p>
</div>
<h3 id="show-that-if-alphabeta-are-paths-from-x_0to-x_1to-x_2-all-in-x-and-gamma--alphastarbeta-that-hatgamma--hatbetacirchatalpha">Show that if $\alpha,\beta$ are paths from $x_0\to x_1\to x_2$ all in $X$ and $\gamma = \alpha\star\beta$ that $\hat\gamma = \hat\beta\circ\hat\alpha$.</h3>
<div class="hint">
<p>Since these are functions between equivalence classes, our job is now to show that the outputs for a given input are homotopic.</p>
<p>So let $f$ be a path in $X$ starting and stopping at $x_0$. Then:</p>
<p><script type="math/tex">% <![CDATA[
\begin{align*}
\hat\gamma(f) =& [\bar\gamma]\star[f]\star[\gamma] \\
=& [\bar(\alpha\star\beta)]\star[f]\star[\alpha\star\beta]\\
=& [\bar\beta\star\bar\alpha]\star[f]\star[\alpha\star\beta]\\
=& [\bar\beta]\star[\bar\alpha]\star[f]\star[\alpha]\star[\beta]\\
=& [\bar\beta]\star([\bar\alpha]\star[f]\star[\alpha])\star[\beta]\\
=& [\bar\beta]\star(\hat\alpha(f))\star[\beta]\\
=& \hat\beta(\hat\alpha(f))
\end{align*} %]]></script></p>
</div>
<h3 id="show-that-if-x_0x_1-are-points-in-a-path-connected-space-x-pi_1xx_0-is-abelian-if-and-only-if-for-every-pair-of-paths-from-alphabeta-from-x_0-to-x_1-hatalpha--hatbeta">Show that if $x_0,x_1$ are points in a path-connected space $X$, $\pi_1(X,x_0)$ is abelian if and only if for every pair of paths from $\alpha,\beta$ from $x_0$ to $x_1$, $\hat\alpha = \hat\beta$.</h3>
<div class="hint">
<p>If $\pi_1(X,x_0)$ is abelian, we have:</p>
<p><script type="math/tex">% <![CDATA[
\begin{align*}
\hat\alpha([f]) =& [\bar\alpha]\star[f]\star[\alpha]\\
=& [\bar\alpha]\star[\bar{(\beta\star\bar\alpha)} \star(\beta\star\bar\alpha)\star f]\star[\alpha]\\
=& [\bar\alpha]\star[(\alpha\star\bar\beta)\star f\star(\beta\star\bar\alpha)]\star[\alpha]\\
=& [\bar\alpha]\star[\alpha\star\bar\beta]\star[f]\star[\beta\star\bar\alpha]\star[\alpha]\\
=& [\bar\alpha\star\alpha\star\bar\beta]\star[f]\star[\beta\star\bar\alpha\star\alpha]\\
=&[\bar\beta]\star[f]\star[\beta] \\
=&\hat\beta([f])
\end{align*} %]]></script></p>
</div>
<h3 id="let-asubset-x-and-let-rxto-a-be-a-retraction-show-that-for-a_0in-a-r_starpi_1xa_0topi_1aa_0-is-surjective">Let $A\subset X$ and let $r:X\to A$ be a retraction. Show that for $a_0\in A$, $r_\star:\pi_1(X,a_0)\to\pi_1(A,a_0)$ is surjective.</h3>
<div class="hint">
<p>Well, any path $\alpha$ in $A$ starting and stopping at $a_0$ will also be a path in $X$ (because $A\subset X$). So $r_\star([\alpha]_X) = [\alpha]_A$. It’s surjective because it’s the identity map when we restrict paths to $A$.</p>
</div>
<h3 id="let-asubsetmathbbrn-and-haa_0toyy_0-show-that-if-h-is-extendable-to-a-continuous-map-tilde-hmathbbrnto-y-then-h_star-is-trivial-ie-sends-everybody-to-the-class-of-the-constant-loop">Let $A\subset\mathbb{R}^n$ and $h:(A,a_0)\to(Y,y_0)$. Show that if $h$ is extendable to a continuous map $\tilde h:\mathbb{R}^n\to Y$, then $h_\star$ is trivial (i.e. sends everybody to the class of the constant loop).</h3>
<div class="hint">
<p>Let $G=\pi_1(A,a_0), H=\pi_1(Y,y_0)$. Then, as a reminder, $h_\star:G\to H$ such that $h_\star([\alpha]) = [h\circ\alpha]$.</p>
<p>So let $\alpha,\beta$ be paths in $(A,a_0)$. Since $\alpha,\beta$ were arbitrary, it is sufficient to show that $h_\star([\alpha])$ is homotopic to $h_\star([\beta])$.</p>
<p>Consider $F:I\times I\to \mathbb{R}^n$ given by $F(s,t) = t\alpha(s) + (1-t)\beta(s)$ (the straight line homotopy between the two loops). But since $h$ is extendible to $\tilde h:\mathbb{R}^n\to Y$, we know that even $F$ is a homotopy that leaves $A$ (into some other part of $\mathbb{R}^n$), $\tilde h\circ F$ is a homotopy between $h(\alpha),h(\beta)$ that stays entirely in $Y$. Hence $[\tilde h\circ \alpha] = [\tilde h\circ \beta]$.</p>
</div>
<!--
### Let $X$ be path connected, $h:X\to Y$ be continuous with $h(x_0) = y_0$ and $h(x_1)=y_1$; let $\alpha$ be a path in $X$ from $x_0$ to $x_1$, and $\beta = h\circ\alpha$. Show that
$$\hat\beta\circ(h_{x_0})_\star = (h_{x_1})_\star\circ\hat\alpha$$
As in, show that the following diagram commutes:
$$
\newcommand{\ra}[1]{\xrightarrow{\quad#1\quad}}
\newcommand{\da}[1]{\left\downarrow{\scriptstyle#1}\vphantom{\displaystyle\int_0^1}\right.}
\newcommand{\sea}[1]{\left\searrow{\scriptstyle#1}\vphantom{\displaystyle\int_0^1}\right.}
$$
$$
\begin{array}{ccc}
\pi_1(X,x_0) & \ra{(h_{x_0})_\star} & \pi_1(Y,y_0) \\
\da{\hat\alpha} & & \da{\hat\beta} \\
\pi_1(X,x_1) & \ra{(h_{x_1})_\star} & \pi_1(Y,y_1) \\
\end{array}
$$
<div class="hint" markdown="1">
Let
$$\begin{align*}
f=&\hat\beta\circ (h_{x_0})_\star & \text{and}& & g=&(h_{x_1})_\star\circ\hat\alpha
\end{align*}$$
The claim is now that
$$\forall a \in \pi_1(X,x_0), f(a) = g(a)$$
_Proof_:
Let $b_0 \in f(a) = \hat\beta\circ(h_{x_0})_\star (c)$. That means $b$ is a particular path in $(Y,y_1)$ such that $b_0 = \bar\beta\star(h(c))\star\beta$, whereas $g(a) \ni b_1 = h(\bar\alpha\star d \star\alpha)$ with $c,d$ homotopic.
Let $\alpha$ be a path in $(X,x_0)$ and $\beta = h\circ\alpha$ -- hence a path in $(Y,y_1)$.
</div>
-->Aaron NiskinMunkres saves the day and explains the fundamental secrets behind the most invariant of groups!Let’s Cover Homotopy (an apologetic revisionary tale)2018-08-05T00:00:00-07:002018-08-05T00:00:00-07:00http://aaron.niskin.org/blog/2018/08/05/homotopy_revisited<h2 id="what-am-i-doing">What am I doing?</h2>
<h3 id="the-structure-of-these-posts">The structure of these posts</h3>
<p>This post is not really building off of the theory from my previous post (<a href="/blog/2018/04/12/homology.html">homology pt 1</a>), but it’s about the same subject. It turns out we were using a book I didn’t really like too much, so I switched to something with a bit more of a familiar style for me. So I’ll be using <a href="https://www.amazon.com/Topology-Classic-Classics-Advanced-Mathematics/dp/0134689518/ref=sr_1_2?ie=UTF8&qid=1533510847&sr=8-2&keywords=topology+munkres">Topology by Munkres</a> for these posts (at least for now).</p>
<p>I’ll not be covering the book in much detail because after all, Munkres did a great job at doing that himself. But I will summarize some things that I feel like summarizing, and do some of the exercises (just the ones I feel like doing because, well, I can – I know, this is sounding more useful by the second).</p>
<p>So let’s dig in!</p>
<h1 id="part-ii">Part II</h1>
<p>But where did part I go? That’s not Algebraic Topology, so we’re not gonna cover that part (at least not until I feel like I need a review).</p>
<h2 id="chapter-9-the-fundamental-group">Chapter 9: The Fundamental Group</h2>
<p>Basically, we say $f_0,f_1:X\to Y$ are <strong>homotopic</strong> if there is a continuous function $F:X\times I\to Y$ (where $I=[0,1]$) with the following two properties:</p>
<ol>
<li>$\forall x\in X, F(x,0) = f_0(x)$</li>
<li>$\forall x\in X, F(x,1) = f_1(x)$</li>
</ol>
<p>Basically, at one end of the interval $I$ $F$ is $f_0$ and on the other end $F$ is $f_1$. Since the whole map is continuous, this must be a continuous deformation. And there’s a bunch of (helpful) book-keeping to prove that homotopy defines an equivalence relation, etc. but we won’t be covering that right now. But before we move on, let’s talk about path homotopy…</p>
<p>We say the paths $f_0,f_1:I\to Y$ are <strong>path homotopic</strong> if they have the same start and end points ($x_0$ and $x_1$) and there is a homotopy $F$ from $f_0$ to $f_1$ with the following two additional requirements:</p>
<ol>
<li>$\forall s\in I, F(0,s) = x_0$</li>
<li>$\forall s\in I, F(1,s) = x_1$</li>
</ol>
<p>One thing to note: if you consider a torus (because who doesn’t love donuts), any two non-intersecting paths will be homotopic, but not necessarily path homotopic because they might go around different sides of the donut hole, for instance. The requirement that the end-points are fixed is no trivial thing.</p>
<p>Let’s get into the questions!</p>
<h2 id="exercises">Exercises</h2>
<h3 id="show-that-if-hhxto-y-are-homotopic-and-kkyto-z-are-also-homotopic-then-kcirc-h-and-kcirc-h-are-homotopic">Show that if $h,h’:X\to Y$ are homotopic and $k,k’:Y\to Z$ are also homotopic, then $k\circ h$ and $k’\circ h’$ are homotopic.</h3>
<div class="hint">
<p>Let $H,K$ be homotopies from $h$ to $h’$ and $k$ to $k’$ respectively. Then consider the map $G:X\times I\to Z$ defined by $G(x,i) = K(H(x,i),i)$.</p>
<p><em>Claim</em>: $G$ is a homotopy between $k\circ h$ and $k’\circ h’$.</p>
<p><em>Proof</em>: $G$ is obviously continuous (because $H$ and $K$ are), so we only have to prove that $G(x,0) = k\circ h$ and $G(x,1)= k’\circ h’$. But those follow directly from the definition of $G$.</p>
</div>
<h3 id="given-spaces-x-and-y-let-xy-be-the-set-of-homotopy-classes-of-maps-from-x-to-y">Given spaces $X$ and $Y$, let $[X,Y]$ be the set of homotopy classes of maps from $X$ to $Y$.</h3>
<p>a) Show that for any $X$, $[X,I]$ is a singleton.</p>
<div class="hint">
<p>Let $f,g:X\to I$ be continuous maps. Then since $\forall x \in X$, $f(x),g(x)\in[0,1]$, we know that the following map $H:X\times I\to I$ will always be well defined: $H(x,t) = t\cdot f(x) + (1-t)\cdot g(x)$. $H$ is a homotopy.</p>
</div>
<p>b) Show that if $Y$ is path connected, the set $[I,Y]$ is a singleton.</p>
<div class="hint">
<p>Let $Y$ be path connected and let $f:I\to Y$ be continuous. Pick a point $y_o\in Y$ and define $g:I\to Y$ such that $\forall t\in I$, $g(t)=y_0$. We will prove that $f$ is homotopic to $g$ (and hence all maps are), and conclude that $[I,Y]$ is a singleton. We’ll prove this in two steps: first we’ll show that $f$ is homotopic to the constant map that maps all of $I$ to one of the end-points of $f$; then we’ll show that the aforementioned constant map is homotopic to $g$.</p>
<p>To prove the first part, let $h:I\to Y$ be such that $\forall t\in I$, $h(t)=f(0)$. Let $H:I\times I\to Y$ be such that $H(s,t) = f(s\cdot (1-t))$. Then clearly $H$ is a homotopy between $f$ and $h$.</p>
<p>To prove the second part, since $Y$ is path connected, let $p:I\to Y$ be a path from $f(0)$ to $y_0$. Then let $G:I\times I\to Y$ be such that $G(s,t) = p(t)$. Aaaannnnndddd, that’s our homotopy between the two constant maps.</p>
<p>So our desired homotopy from $f$ to $g$ is $G\circ H$.</p>
</div>
<h3 id="a-space-x-is-contractible-if-the-identity-map-i_xxto-x-is-homotopic-to-a-constant-map-nullhomotopic">A space X is <strong>contractible</strong> if the identity map $i_x:X\to X$ is homotopic to a constant map (<strong>nullhomotopic</strong>).</h3>
<p>a) Show that $I,\mathbb{R}$ are contractible.</p>
<div class="hint">
<p>To show that $I$ is contractible, we can just consider it a corollary of the first part of the previous answer. So we’ll prove the second part (that $\mathbb{R}$ is contractible).</p>
<p>Let $i:\mathbb{R}\to\mathbb{R}$ be the identity and let $x_0\in\mathbb{R}$, and $f:\mathbb{R}\to\mathbb{R}$ with $f(x) = x_0$. Our job is to show that there is a homotopy $F:\mathbb{R}\times I\to\mathbb{R}$ between $f$ and $i$.</p>
<p>Consider the map $F(x,t)=\begin{cases} x & | x | < t \\ \text{sign}(x)t & \text{otherwise}\end{cases}$. That is our homotopy.</p>
</div>
<p>b) Show that a contractible space is path connected.</p>
<div class="hint">
<p>Let $Y$ be contractible and let $y_0,y_1$ be points. Then let $f:I\to Y$ be the constant map $f(t) = y_1$, and let $F:I\times I\to Y$ be the map contracting $i_Y$ to $f$. Then let $p(t) = F(t,t)$. $p$ is a path connecting $y_0$ and $y_1$. We’re done here.</p>
</div>
<p>c) Show that if $Y$ is contractible, then for any $X$ the set $[X,Y]$ is a singleton.</p>
<div class="hint">
<p>The proof is very similar to the last two, so I’m not gonna do it. Ah, the pleasures of self-learning. But basically, contractions have inverses (because they’re homotopies too!).</p>
</div>
<p>d) Show that if $X$ is contractible and $Y$ is path connected then $[X,Y]$ is a singleton.</p>
<div class="hint">
<p>The proof that $[I,Y]$ is a singleton is general enough to accommodate this (mutatis mutandis, obvs). Laziness ENGAGE!</p>
</div>
<h2 id="summary">Summary</h2>
<p>We’ve done so many exercises I don’t even feel like I need to workout anymore! (These jokes aren’t really free if you consider the toll they take on you).</p>Aaron NiskinMunkres will surely save us our trespasses this once, but NEVER AGAIN!Let’s talk Homotopy and Algebra2018-06-29T00:00:00-07:002018-06-29T00:00:00-07:00http://aaron.niskin.org/blog/2018/06/29/homotopy<p>So this post is going to be a bit more terse than most. In fact, the objective of this post will be to develop the theory of Homotopy very briefly, with the goal of proving the Fundamental Theorem of Algebra.</p>
<h1 id="homotopy">Homotopy</h1>
<p>Given two continuous maps \(f,g : \mathbb{R}^n\to\mathbb{R}m\), a homotopy between \(f,\) and $g$ is a continuous map $h: \mathbb{I}\times\mathbb{R}^n\to\mathbb{R}^m$ such that $h(0,x)=f(x)$ and $h(1,x) = g(x)$. Basically it’s a continuous deformation of one map into the other. It’s a bit stronger than just homeomorphism of topological spaces. You can see this because two interlocked rings are topologically homeomorphic to non-interlocked rings, but they’re not homotopic because you’d have to split one of the rings.</p>
<p>Enough of all that! Let’s get to the proof already!</p>
<h1 id="the-fundamental-theorem-of-algebra">The Fundamental Theorem of Algebra</h1>
<h3 id="the-statement">The statement</h3>
<p><em>If $p\in\mathbb{C}[x]$ is a non-constant polynomial, then it has at least one root in $\mathbb{C}$.</em></p>
<p>A consequence of this theorem is that any non-constant polynomial over $\mathbb{C}$ has all of its roots in $\mathbb{C}$. Or equivalently, any polynomial of degree $n$ over $\mathbb{C}$ has $n$ roots in $\mathbb{C}$.</p>
<h3 id="the-proof">The proof</h3>
<p>Let $p(x) = \sum\limits_{i=1}^n a_i x^i \in\mathbb{C}[x]$ be a polynomial without any roots in $\mathbb{C}$. For simplicity, we assume that the leading coefficient is 1. There is no loss of generality in doing so. We then define a function</p>
<script type="math/tex; mode=display">f_r(s) = \frac{ p(re^{2\pi is}) / p(r)}{ \|p(re^{2\pi is}) / p(r)\| }</script>
<p>Then for each $r$ in $\mathbb{C}$, $f_r:\mathbb{C}\to S^1$ is a map from $\mathbb{C}$ to $S^1$. Furthermore, given $r_0,r_1\in\mathbb{C}$, we can define a homotopy from $f_{r_0}$ to $f_{r_1}$ as $g(t,s) = f_{r_0+t(r_1-r_0)}(s)$. So for all $r\in\mathbb{C}$, $f_r$ is homotopic to $f_0$. Since $f_0(s) = 1$ for all $s\in\mathbb{C}$, we have that $f_r$ are all in the same homotopy class as the constant function on $S^1$.</p>
<p>We now pick $z,r$ such that $|z| = r \gt \max(1,\sum |a_i| )$. We then note that Cauchy-Schwarz gives us the following:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\|z\|^n =& \|z\|\cdot \|z\|^{n-1}\\\\\\
>& (\sum\|a_i\|)\|z\|^{n-1}\\\\\\
=& (\sum\|a_i\|\|z\|^{n-1})\\\\\\
>=& (\sum\|a_iz^{n-1}\|)\\\\\\
>& (\sum\|a_iz^i\|)
\end{align} %]]></script>
<p>Which means $|z^n - \sum a_i z^i \neq 0|$. So, we can define yet another function $q(t,z) = z^n +t(\sum\limits_{i=1}^{n-1}a_iz^i)$. Then $q$ is a homotopy between $z^n$ and $p(z)$. We now note that when $t=1$, if we make a similar construction to $f_r$ but with this parameterized polynomial instead of $p$, we get a homotopy between $f_r$ and just going around the circle $n$ times (with $t=0$). But since $f_r$ is homotopic to a point on $S^1$, and going around the circle any non-zero number of times is not homotopic to a point, we know that $n=0$.</p>
<p>So the only polynomials without roots in $\mathbb{C}$ are constant.</p>
<p>Kinda cool, huh?</p>
<p>This proof was courtesy of <em>Algebraic Topology</em> by Allen Hatcher.</p>Aaron NiskinJust a little dip into homotopy via the Fundamental Theorem of AlgebraLet’s Cover Homology (pt 1)2018-04-12T00:00:00-07:002018-04-12T00:00:00-07:00http://aaron.niskin.org/blog/2018/04/12/homology<h1 id="topology">Topology</h1>
<h2 id="what-is-it">What is it?</h2>
<p>Topology is essentially the study of proximity, which is a close analogue for shape. The specifics of how that’s done are not really the point of this post. That’s all for this section.</p>
<h2 id="how-is-it-related-to-analyzing-data">How is it related to analyzing data?</h2>
<p>Most data analysis techniques are already about quantifying some geometric attributes. For instance, when you fit a linear model, you’re really just imposing an affine model on the data, and computing the appropriate parameters (slope and intercept). But these sorts of things work best if you have an idea for what an appropriate shape would be. That’s one area where topology might be able to help. But to hear more about that, you’ll have to wait for my next post.</p>
<p>There is a notion of equivalence (called isomorphism) of topological spaces (bi-continuous bijections). Basically, we consider two topological spaces to be equivalent if one can be continuously transformed into the other and back. That’s actually a very powerful description that captures a lot of non-linearities and things like that. For instance, an egg shape, a sphere, a pyramid, and a tetrahedron are all isomorphic topological spaces, but not a doughnut (torus). And those kinda make sense. But there are some counter-intuitive examples. For instance, if you were to take one of those long clown balloons and tie it up into a poodle, it would also be isomorphic to the bunch above. And, as the joke goes, donuts are isomorphic to coffee cups. I know what you’re thinking. This topology stuff sounds super cool.</p>
<p>So topology would make a wonderful (admittedly incomplete) description of data. But there are some problems: mainly that this whole story breaks down in higher dimensions. Luckily, there is an analogue that can be computed.</p>
<h1 id="homology">Homology</h1>
<h2 id="what-is-it-1">What is it?</h2>
<p>Homology is (in an overly-simplified sentence), counting the number of holes. That’s actually a lot more information than you might initially think. It’s actually sufficient to completely classify the space in lower dimensions, and a good approximation in higher ones. Either way, if we knew that info about our data, we’d capture a lot of the non-linear relationships. Either way, it’ll be fun, so let’s get into it.</p>
<h1 id="how-to-computes">How to computes…</h1>
<h2 id="triangulation">Triangulation</h2>
<h3 id="simplices">Simplices</h3>
<p>A <a href="https://en.wikipedia.org/wiki/Simplex">simplex</a> is a collection of $n$ points with the property that no one point lies in the subspace defined by the other $n-1$ points. The idea is that it’s the minimal number of points with an $n-1$ dimensional convex hull – or, said differently, the minimal number of points to make an $n-1$ dimensional shape. For instance, if we consider $\mathbb{R}^2$ as our ambient space, then the only simplices that exist are 0, 1, or 2 simplices: the 0 simplices being points, 1 simplices being lines and 2 simplices being tetrahedra. If we look at these simplices like building blocks, we can now make something isomorphic to any shape just by glueing a bunch of these simplices together. But we’ll get to that in a bit.</p>
<p>Let $\{v_0,v_1,v_2\}$ be a simplex. Firstly, since it’s a simplex with 3 points, it must be a 2-simplex. And since it’s a simplex, we know that $v_0-v_1\notin span(\{v_1-v_2,v_1+v_2\})$ and the same relationship holds for all the rest of the points. This simplex is supposed to represent any 2 dimensional contiguous ball-like thing. For instance, since a filled-in circle can be continuously transformed into a filled-in triangle (which is, in essence what a 2-simplex is), our simplex becomes a proxy for such circles, and squares, and rhombi, and octagons, but not the Apple logo. But if you take that stem thing and throw it away so you’re only left with the main apple part of the logo, then yeah, that too.</p>
<p>So now if we have any single contiguous $N$-D object without any holes, we can represent it topologically by a triangle. So simplices make the building blocks for general shapes, as we’ll see in a bit.</p>
<h3 id="simplicial-complexes">Simplicial Complexes</h3>
<p>So let’s build a simplicial complex! But wait! I haven’t even defined what that is yet! Meh, let’s just roll with it. So given a bunch of things we can represent by simplices (like the full apple logo, for instance), we can construct a simplicial complex to represent it. Some more examples would be: an air-filled balloon (which we’ll consider to be hollow) on a string, a filled water balloon (which we’ll not consider to be hollow) on a string, or a doughnut. The list goes on, obviously, but I think you get the point. We can represent pretty much anything this way.</p>
<p>The specifics really only come in when we try to construct the simplicial complex itself. In order to make the thing have the nice characteristics we’ll use later on, we’ll need to make sure our complexes have some properties. But first, we need to know what the face of a simplex is:</p>
<p>Given a simplex $\{v_0,v_1,…,v_n\}$, a face is the simplex defined by a subset (e.g. $\{v_1,v_2,…,v_n\}$ is a face, and so is the original simplex). If you think of a tetrahedron, for instance, the faces are the four triangular faces we’re used to calling faces as well as the six lines on the edges, and the four individual points. So the faces are kinda similar to our intuitive notion of a face. Now we can define a simplicial complex. A simplicial complex $K$ is a collection of simplices with the following properties (see <a href="http://mathworld.wolfram.com/SimplicialComplex.html">mathworld</a>):</p>
<ol>
<li>Each face of a simplex in $K$ is itself a simplex in $K$</li>
<li>The intersection of two simplices in $K$ is itself a simplex (hence in $K$)</li>
</ol>
<p>It’s the second rule that’s really the crux of everything, as we’ll see. But let’s go through some examples…</p>
<p>The Apple logo can be triangulated by two disjoint 2-simplices (along with all the appropriate faces of the two). So we basically forced the first property to hold, and since the two simplices are disjoint the second property trivially holds.</p>
<p>The filled water-balloon on a string can be triangulated by a 3-simplex attached to a 1-simplex along with all the appropriate faces. The filled water-balloon part is clearly the 3-simplex, and the string is the 1-simplex. So let’s prove that the two properties hold. The first one holds trivially, since we defined it as the union of all the faces. The second property holds because if you take any two simplices in this complex, either they intersect at the point on top (a 0-simplex) or they don’t intersect at all! (Pardon the hand waving proof there).</p>
<p>The (hollow) air-filled balloon, on the other hand, is a bit harder to triangulate, but we’ll barrel through it. We can glue four 2-simplices together to be like a hollowed out 3-simplex, then attach a 1-simplex as the string to one of the vertices of the hollow tetrahedron, and we’ll be done. All that’s left is to show the two properties hold. That’s left as an exercise to the reader (ah, the good life).</p>
<p>The doughnut, on the other hand, is a lot more complicated to triangulate. To do that we’d have to specify if we’re talking about the shell of a doughnut or a filled in one. Although this is a fun exercise, the traditional way to triangulate the torus doesn’t have much to do with what we’re going to be using this stuff for, so I’ll take a more data-science type approach to this. If we have a bunch of data arranged in the shape of a torus, we can just take the union of the simplices defined by taking a point and the 3 nearest points to it, and creating the convex hull of those points (or considering them as a 3-simplex).</p>
<h3 id="associations">Associations</h3>
<p>But it would be useful to triangulate the torus using the more traditional approach. To get you started on the right track, you’ll need another tool: association. An association is basically a way of considering things to be equal. It’s kinda like defining specific teleportation or something. I think it’s best to just work out an example. So let’s triangulate some things (like a sphere – the air-filled balloon, for example) but without using the fact that we’re in 3-D space and using associations instead. Let’s start with two 2-simplices joined at one edge (to make something that looks like a square).</p>
<p><img src="/assets/pics/2018/04/12_triangulate_base.png" alt="sphere" /></p>
<p>We then make the following association: we say anything along the line $(\langle1\rangle,\langle2\rangle)$ is associated with things on the line $(\langle4\rangle,\langle3\rangle)$ – where order is important. If we want to make it more formal, we could parameterize the line connecting $\langle1\rangle$ to $\langle2\rangle$ by a parameter $t$ such that $f_0(0) = \langle1\rangle$ and $f_0(1) = \langle2\rangle$. We describe this visually like so:</p>
<p><img src="/assets/pics/2018/04/12_tube.png" alt="tube" /></p>
<p>We’re considering all the points on the top line to be associated with (or equal to) their corresponding points on the bottom line. You can imagine this to be kinda like pacman: when you go off the top edge, you come back from the bottom edge in the same direction (because of the association that we’ve made). So, we might as well fold the paper over and tape the two sides together so as to better visualize the association we’ve made. But when we do that, we notice we have a tube!</p>
<p>So the association shown above can be described as a tube. Unfortunately, it’s not a simplicial complex though. This can be seen because if we call the top-left triangle $A$, and the bottom-right one $B$, the intersection of $A$ and $B$ is the diagonal line $(\langle1\rangle,\langle3\rangle)$ and the associated line (it’s only one because the two are associated – hence the same line) $(\langle1\rangle,\langle2\rangle) = (\langle4\rangle,\langle3\rangle)$. So the intersection is two lines, which is a simplicial complex, but not a simplex! So this triangulation doesn’t have the second property that we require all simplicial complexes to have.</p>
<p>We could make it a simplicial complex by dividing the thing into a 9 square grid (so 18 triangles) and then doing the same thing again with the top being associated with the bottom, but it’s a lot to write out. Try it out though. See if you can make it a simplicial complex. If you run into any issues, feel free to email me.</p>
<p>Moving on… Now if we just flipped the association, we get a möbius strip!</p>
<p><img src="/assets/pics/2018/04/12_mobius.png" alt="möbius" /></p>
<p>It might be useful to actually break out a piece of paper and do this if you’ve never done it before.</p>
<p>If we take the tube and add this one association it becomes a torus (a hollow doughnut)!</p>
<p><img src="/assets/pics/2018/04/12_torus.png" alt="torus" /></p>
<p>Notice that we identify which sides are connected by the fact that they share the same symbol (the double arrow is connected to the other double arrow, and the single arrow does likewise).</p>
<p>If you’re trying it out on paper, make sure you label the points so you can see them! If you reverse the direction one side of the second association, you get something very different – a Klein bottle!</p>
<p><img src="/assets/pics/2018/04/12_klein.png" alt="klein bottle" /></p>
<p>I feel like it’s probably a bit cruel of me to have put this after the picture, in case you’ve been trying to make the last one with your paper… It turns out, you can’t make the last one in 3D space. You need to either use an extra dimension or tear the paper, unfortunately. But, it’s interesting to see that flipping the association completely changes the thing – kinda like chirality in chemistry.</p>
<p>Like if we take the möbius strip and make the following additional association, you get something called the Projective Plane (\(\mathbb{P}^2\) – also not embeddable in 3D space).</p>
<p><img src="/assets/pics/2018/04/12_projective.png" alt="projective plane" /></p>
<p>And finally, the following association leads us to a sphere, or, more precisely, the surface of a sphere.</p>
<p><img src="/assets/pics/2018/04/12_sphere.png" alt="sphere" /></p>
<p>Then, since all the points on the top line (and the bottom, left, and right lines) are all associated, imagine contracting them while keeping the inside of the bag relatively unchanged (as far as area is concerned). So necessarily the bag will puff up like a bubble as the sides squish down to a smaller and smaller hole. Eventually, the edge becomes just a point, and what do we have? Yup, we’ve got a sphere. So that’s one way we can triangulate a sphere.</p>
<p>The problem is that none of these are simplicial complexes. Why not? Can you fix these so that they become simplicial complexes?</p>
<h2 id="summary">Summary</h2>
<p>So we’ve triangulated some things, defined what simplices and simplicial complexes are, but we haven’t yet found their utility. For that, just stay tuned! In the next episode we’ll go over Betti Numbers and how to compute them using these simplicial complexes!</p>Aaron NiskinWhat does topology have to do with data?! And what is this topology thing anyway?Decomposition of Autoregressive Models2017-10-25T13:52:39-07:002017-10-25T13:52:39-07:00http://aaron.niskin.org/blog/2017/10/25/Autoregression<h1 id="autoregression">Autoregression</h1>
<h2 id="background">Background</h2>
<p>This post will be a formal introduction into some of the theory of Autoregressive models. Specifically, we’ll tackle how to decompose an AR(p) model into a bunch of AR(1) models. We’ll discuss the interpretability of these models and howto therein. This post will develop the subject using what seems to be an atypical approach, but one that I find to be very elegant.</p>
<!-- more -->
<h1 id="the-traditional-way">The traditional way</h1>
<p>Let $x_t$ be an AR(p) process. So $x_t = \sum\limits_{i=1}^pa_ix_{t-i} + w_t$. We can express $x_t$ thusly:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
x_t =& \sum\limits_{i=1}^pa_ix_{t-i} + w_t \\
x_t - \sum\limits_{i=1}^pa_ix_{t-i} =& w_t \\
\left(1 - \sum\limits_{i=1}^pa_iL^i\right)x_t =& w_t
\end{align*} %]]></script>
<p>Where $L$ is the <strong>lag operator</strong>. We define the AR polynomial $\Phi$ as $\Phi(L) := \left(1 - \sum\limits_{i=1}^pa_iL^i\right)$. Then as long as we’re considering this polynomial to be over an algebraically closed field (like $\mathbb{C}$), we can factor this polynomial. So</p>
<script type="math/tex; mode=display">\Phi(L) = c\prod\limits_{i=1}^p(L-\varphi_i)</script>
<p>Where $\varphi_i$ are the roots of $\Phi$. This polynomial will be the star of our show today. In fact, if we want to understand $\Phi$ (the AR process we’re considering), we need only understand each $x_t = \varphi_ix_{t-1} + w_t$ model. We’ll actually prove this later.</p>
<p>We’ll be investigating several of its properties but first (just for fun) let’s investigate the conditions under which it’s invertible.</p>
<p>We can reduce the problem of determining whether or not the AR polynomial is invertible to the problem of inverting each factor. After all, the polynomial is invertible if and only if each factor is. So let’s restrict our considerations to just a single factor for the time being. When will $x-\lambda$ be invertible?</p>
<p>Now, we recall the Taylor Series Expansion formula from our trusty calculus class.</p>
<script type="math/tex; mode=display">f(x) = \sum\limits_{i=0}^\infty \frac{f^{(i)}(a)}{i!}(x-a)^i</script>
<p>In general, we tend to use $a=0$ as a good arbitrary choice (the Maclaurin Series), resulting in:</p>
<script type="math/tex; mode=display">f(x) = \sum\limits_{i=0}^\infty \frac{f^{(i)}(0)}{i!}x^i</script>
<p>Well, if we consider the Taylor expansion of $\frac{1}{x-\lambda}$, we find</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\frac{1}{x-\lambda} =& \frac{-1}{\lambda}(x) + (-1)^1\frac{-1}{\lambda^2}x^2 + (-1)^2\frac{-1}{\lambda^3}x^3 + ... \\
=& \sum\limits_{i=0}^\infty (-1)^{i+1}\frac{1}{\lambda^{i+1}}x^{i+1} \\
=& \sum\limits_{i=1}^\infty (-1)^{i}\frac{1}{\lambda^{i}}x^{i}
\end{align*} %]]></script>
<p>Which is square summable if and only if $|\lambda| > 1$. So we see that if the AR roots are all outside the unit circle, then the AR polynomial will be invertible (in such a case, the AR process is called <strong>causal</strong>).</p>
<h1 id="the-new-approach">The new(?) approach</h1>
<h2 id="definition">Definition</h2>
<p>Let $X_t\in \mathbb{R}^n, A_i\in\mathbb{R}^{n\times n}, B\in\mathbb{R}^{k_0\times n}$, and $p\in\mathbb{N}_{>0}$. Also let, $W_t$ be $k_0$ dimensional white noise. Then, $X_t$ is called a VAR(p) process if,</p>
<script type="math/tex; mode=display">X_t = \sum\limits_{i=1}^p A_iX_{t-i} + BW_t</script>
<p>Basically a VAR(p) model is an analogue of AR(p) models where the scalars are matrices and the variable itself is a vector. The noise in such a model need not be independent per coordinate, nor restricted to only one. In fact, we can have any number of driving noise processes that get mixed together linearly into $X_t$. Hence cometh $B$ – the linear transformation from the driving noise space into the $X_t$ space.</p>
<p>Furthermore, let’s say that we don’t directly observe $X_t$ directly, but rather we observe some $Y_t$ that’s a linear function of $X_t$ (plus some noise). Formally, we observe $Y_t$ s.t.:</p>
<script type="math/tex; mode=display">Y_t = C X_t + D U_t</script>
<p>Where $Y_t \in\mathbb{R}^m, C\in\mathbb{R}^{m\times n}, D\in \mathbb{R}^{k_1\times m}$. And again, $U_t$ is $k_1$ dimensional white noise.</p>
<p>Here $D$ serves the same purpose as $B$ did above, but for the observations themselves.</p>
<p>This is called a “State Space” model.</p>
<h2 id="examplificate">Examplificate</h2>
<p>How does this relate to our star (the AR polynomial $\Phi$)? Let’s try to express an AR(p) process as a VAR(1) process…</p>
<p>Let $x_t$ be an AR(p) process. That is to say, $x_t = \sum\limits_{i=1}^p a_ix_{t-i} + w_t$ where $w_t$ is some white noise process.</p>
<p>Let $W_t$ be a 1 dimensional white-noise processes such that $W_t = w_t$. We define</p>
<script type="math/tex; mode=display">X_t := \left[\begin{matrix} x_t \\ x_{t-1} \\ \vdots \\ x_{t-(p-1)} \end{matrix}\right]</script>
<p>Then, we can see that the first coordinate of $X_t$ is always $x_t$. Our notation for this will be $X_t[0] = x_t$.</p>
<p>Furthermore, let</p>
<script type="math/tex; mode=display">% <![CDATA[
A = \left[\begin{matrix}
a_1 & a_2 & \dots & a_{p-1} & a_p \\
1 & 0 & \dots & 0 & 0 \\
0 & 1 & \dots & 0 & 0 \\
& & \ddots & & \\
0 & \dots & 0 & 1 & 0 \\
\end{matrix} \right] %]]></script>
<p>We can see that for $0 < i < p$, we have $(AX_{t-1})[i] = x_{t-i}$, and $(AX_{t-1})[0] = \sum\limits_{i=1}^p a_i x_{t-i} = x_t - w_t$. As it stands, $AX_{t-1}$ is almost $X_t$, just without the added noise in the first coordinate. So let’s add noise to only the first coordinate. To this end, let</p>
<script type="math/tex; mode=display">B := \left[\begin{matrix} 1 \\ 0 \\ \vdots \\ 0 \end{matrix}\right]</script>
<script type="math/tex; mode=display">X_t = AX_{t-1} + BW_t</script>
<p>and</p>
<script type="math/tex; mode=display">\forall t \in \mathbb{N}_{>p} (X_t[0] = x_t)</script>
<p>Now we want $Y_t = x_t$. So we set</p>
<script type="math/tex; mode=display">% <![CDATA[
C = \left[\begin{matrix}1 & 0 & 0 & \dots & 0\end{matrix}\right] %]]></script>
<p>That way, $Y_t = CX_t = x_t$ and all is right with the world.</p>
<p>So yay! We’ve expressed this AR(p) process as a $p$ dimensional $VAR(1)$ process! So… Why?</p>
<h2 id="why-do">Why do?</h2>
<p>Well, one natural question to ask at this point is, what are the eigenvalues of $A$?</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
0 =& \det(A - \lambda I) \\
=& \left\|\begin{matrix}
a_1 - \lambda & a_2 & \dots & a_{p-1} & a_p \\
1 & -\lambda & \dots & 0 & 0 \\
0 & 1 & \dots & 0 & 0 \\
& & \ddots & & \\
0 & \dots & 0 & 1 & -\lambda \\
\end{matrix} \right\| \\
=& (-\lambda)^p + a_1(-\lambda)^{p-1} - a_2(-\lambda)^{p-2} + \dots + (-1)^{p-1}a_p
\end{align*} %]]></script>
<p>This polynomial is the characteristic polynomial of $A$, so the roots of this polynomial are the eigenvalues of $A$. Let’s see how this relates to $\Phi$…</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
0 =& (-\lambda)^p + a_1(-\lambda)^{p-1} - a_2(-\lambda)^{p-2} + \dots + (-1)^{p-1}a_p \\
(-\lambda)^{-p}(0) =& 1 - a_1\lambda^{-1} - a_2\lambda^{-2} - \dots - a_p\lambda^{-p} \\
\text{Define } L :=& \lambda^{-1} \\
0 =& 1 - a_1L - a_2L^2 - \dots - a_pL^p \\
=& 1 - \sum\limits_{i=1}^pa_iL^i \\
=& \Phi(L)
\end{align*} %]]></script>
<p>So the characteristic polynomial is really just $\Phi(L)$ revisited! Furthermore, check this sweetness out…</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\Phi(L) =& c\prod\limits_{i=1}^p\left(L-\varphi_i\right) \\
=& c\prod\limits_{i=1}^p\left(\lambda^{-1}-\varphi_i\right) \Rightarrow \\
\lambda_i =& \frac{1}{\varphi_i}
\end{align*} %]]></script>
<p>So we can see that the eigenvalues are actually intimately related with the roots (in fact they’re inverses of each other)! With that in mind, let’s see what happens when we try to diagonalize this matrix $A$. Firstly, let’s assume that $A$ matrix is diagonalizable. I.e., let</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\bar H^{-1}AH =& A \Leftrightarrow \\
\bar A =& HAH^{-1}
\end{align*} %]]></script>
<p>where $H$ is a coordinate transformation (an invertible map) and $\bar A$ is diagonal.</p>
<p>Then</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
X_t =& H^{-1}\bar AH X_{t-1} + BW_t \\
=& H^{-1}\left(\bar AHX_{t-1} + HBW_t\right)
\end{align*} %]]></script>
<p>So if we let $\tilde X_t = HX_t$, we get</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\tilde X_t =& HX_t \\
=& HH^{-1}\left(\bar AHX_{t-1} + HBW_t\right) \\
=& \bar AHX_{t-1} + HBW_t \\
=& \bar A\tilde X_{t-1} + HBW_t
\end{align*} %]]></script>
<p>So $\tilde X_t$ is a bunch of AR(1) processes (because the matrix $\bar A$ is diagonal), and,</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
Y_t =& CX_t \text{ because $U_t=0$ for our scenario}\\
=& CH^{-1}\tilde X_t \\
=& \left(CH^{-1}\right)\tilde X_t
\end{align*} %]]></script>
<p>Which is to say that $Y_t$ is a linear combination of AR(1) processes! So we started out with a general AR(p) process, and found that expressing it as a VAR(1) model allows us to see that this is just a linear combination of AR(1) processes. I think that’s pretty cool. But let’s investigate this a bit further…</p>
<h1 id="what-more">What more?!</h1>
<p>As per our previous section, we have that $Y_t$, which was really just $x_t$ – our original AR(p) variable, is a linear combination of AR(1) models. More specifically, if we let $F := CH^{-1}$, then</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
x_t =& Y_t \\
=& F\tilde X_t
\end{align*} %]]></script>
<p>So indeed, $x_t$ is a linear combination of $p$ many AR(1) processes. In fact, since $\tilde X_t = \bar A \tilde X_{t-1} + \left(HB\right)W_t$, we know the AR(1) models explicitly:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\tilde x_{t,i} =& \lambda_i\tilde x_{t-1,i} + w_t \\
\tilde x_{t,i} - \lambda_i\tilde x_{t-1,i} =& w_{t,i} \\
\left(1-\lambda_iL\right)\tilde x_{t,i} =& w_{t,i} \\
\frac{-1}{\lambda_i}\left(L-\lambda_i\right)\tilde x_{t,i} =& w_{t,i} \\
\end{align*} %]]></script>
<p>Where $\lambda_i$ is the $i$th eigenvalue of $A$ (hence the $(i,i)$th element of $\bar A$). So we see that the AR(1) processes each have their respective roots from the roots of the original AR polynomial $\Phi$. Pretty cool, right?</p>
<h2 id="summary">Summary</h2>
<p>So we can express any AR(p) model as a linear combination of AR(1) processes (albeit with correlated noise terms), where each AR(1) process is determined by the roots of the AR polynomial (or the characteristic polynomial).</p>
<p>We’ve essentially reduced the problem of studying AR(p) models to studying their eigenvalues (or AR roots).</p>Aaron NiskinWe discuss the decomposition of AR models into components, and how eigenvalues are involved (because they always are).Markov Chain Monte Carlo2017-04-19T13:52:39-07:002017-04-19T13:52:39-07:00http://aaron.niskin.org/blog/2017/04/19/MCMC<h1 id="text--mc2"><script type="math/tex">\text{???} = (MC)^2</script></h1>
<h2 id="background">Background</h2>
<p>What if we know the relative likelihood, but want the probability distribution?</p>
<script type="math/tex; mode=display">\mathbb{P}(X=x) = \frac{f(x)}{\int_{-\infty}^\infty f(x)dx}</script>
<p>But what if \( \int f(x)dx \) is hard, or you can’t sample from \( f \) directly?</p>
<p>This is the problem we will be trying to solve.</p>
<!-- more -->
<h2 id="first-approach">First approach</h2>
<p>If space if bounded (integral is between \( a,b \)) we can use Monte Carlo to estimate \( \int\limits_a^b f(x)dx \)</p>
<ol>
<li>Pick \( \alpha \in (a,b) \)</li>
<li>Compute \( f(\alpha)/(b-a) \)</li>
<li>Repeat as necessary</li>
<li>Compute the expected value of the computed values</li>
</ol>
<h2 id="what-if-its-a-big-ol-bag-of-nope">What if it’s a big ol’ bag of nope?</h2>
<p>What if we can’t sample from \( f(x) \) but can only determine likelihood ratios?</p>
<script type="math/tex; mode=display">\frac{f(x)}{f(y)}</script>
<p>Enter Markov Chain Monte Carlo</p>
<h1 id="the-obligatory-basics">The obligatory basics</h1>
<p>A <strong>Markov Chain</strong> is a stochastic process (a collection of indexed random variables) such that \( \mathbb{P}(X_n=x|X_0,X_1,…,X_{n-1}) = \mathbb{P}(X_n=x|X_{n-1}) \).</p>
<p>In other words, the conditional probabilities only depend on the last state, not on any deeper history.</p>
<p>We call the set of all possible values of \( X_i \) the <strong>state space</strong> and denote it by, \( \chi \).</p>
<h3 id="transition-matrix">Transition Matrix</h3>
<p>Let \( p_{ij} = \mathbb{P}(X_1 = j | X_0 = i) \). We call the matrix \( (p_{ij}) \) the <strong>Transition Matrix</strong> of \( X \) and denote it \( P \).</p>
<p>Let \( \mu_n = \left(\mathbb{P}(X_n=0), \mathbb{P}(X_n=1),…, \mathbb{P}(X_n=l)\right) \) be the row vector corresponding to the “probabilities” of being at each state at the \( n \)th point in time (iteration).</p>
<p>Claim: \( \mu_{i+1} = \mu_0P^{i+1} \)</p>
<p>Proof:</p>
<p>if \( i = 0 \):</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
(\mu_0 P)_j =& \sum\limits_{i=1}^l\mu_0(i)p_{ij}\\
=&\sum\limits_{i=1}^l\mathbb{P}(X_0=i)\mathbb{P}(X_1=j | X_0=i)\\
=&\mathbb{P}(X_1 = j)\\
=&\mu_1(j)
\end{align*} %]]></script>
<p>So: \( \mu_1 = \mu_0P \)</p>
<p>if \( i > 0 \):</p>
<p>Then, \( \mu_{i+1} = \mu_iP = (\mu_0P^i)P = \mu_0P^{i+1} \)</p>
<h3 id="stationary-distributions">Stationary Distributions</h3>
<p>We say that a distribution \( \pi \) is <strong>stationary</strong> if</p>
<script type="math/tex; mode=display">\begin{gather*}\pi P = \pi\end{gather*}</script>
<h3 id="main-theorem">Main Theorem:</h3>
<p>An irreducible, ergotic Markov Chain \( {X_n} \) has a unique stationary distribution, \( \pi \). The limiting distribution exists and is equal to \( \pi \). And furthermore, if \( g \) is any bounded function, then with probability 1:</p>
<script type="math/tex; mode=display">\lim\limits_{N\to\infty}\frac{1}{N}\sum\limits_{n=1}^Ng(X_n) \rightarrow E_\pi(g)</script>
<p>This really just means that if our Markov Chain has certain properties (basically just that any state can be gotten to from any other state at any time), then we can sample from this Markov Chain, and it’ll have the same distribution as a generic sample from our desired distribution.</p>
<h2 id="random-walk-metropolis---hastings">Random-Walk-Metropolis - Hastings:</h2>
<p>Let \( f(x) \) be the relative likelihood function of our desired distribution.</p>
<p>And \( q(y|x_i) \) the known distribution easily sampled from (generally taken to be \( N(x_i,b^2) \) )</p>
<p>1) Given \( X_0,X_1,…,X_i \), pick \( Y \sim q(y|X_i) \)</p>
<p>2) Compute \( r(X_i,Y) = \min\left(\frac{f(Y)q(X_i|Y)}{f(X_i)q(Y|X_i)}, 1\right) \)</p>
<p>3) Pick \( a \sim U(0,1) \)</p>
<p>4) Set</p>
<script type="math/tex; mode=display">% <![CDATA[
X_{i+1} = \begin{cases} Y & \text{if } a < r \\ X_i & \text{otherwise}\end{cases} %]]></script>
<h2 id="the-confidence-builder">The Confidence Builder:</h2>
<p>We would like to sample from and obtain a histogram of the Cauchy distribution:</p>
<script type="math/tex; mode=display">f(x) = \frac{1}{\pi}\frac{1}{1+x^2}</script>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
q_{01} \sim& N(x,0.1)\\
q_1 \sim& N(x,1)\\
q_{10} \sim& N(x,10)\end{align*} %]]></script>
<p>Note: Since \( q \) is symmetric, \( q(x|y) = q(y|x) \).</p>
<h2 id="pseudocode">Pseudocode:</h2>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">MC</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">X_i</span> <span class="o">=</span> <span class="n">initial_state</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">int</span><span class="p">(</span><span class="n">num_iters</span><span class="p">)):</span>
<span class="n">Y</span> <span class="o">=</span> <span class="n">q</span><span class="p">(</span><span class="n">X_i</span><span class="p">)</span>
<span class="n">r</span> <span class="o">=</span> <span class="nb">min</span><span class="p">(</span><span class="n">f</span><span class="p">(</span><span class="n">Y</span><span class="p">)</span><span class="o">/</span><span class="n">f</span><span class="p">(</span><span class="n">X_i</span><span class="p">),</span> <span class="mi">1</span><span class="p">)</span>
<span class="n">a</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">uniform</span><span class="p">()</span>
<span class="k">if</span> <span class="n">a</span> <span class="o"><</span> <span class="n">r</span><span class="p">:</span>
<span class="n">X_i</span> <span class="o">=</span> <span class="n">Y</span>
<span class="n">MC</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">X_i</span><span class="p">)</span>
</code></pre></div></div>
<h2 id="estimation">Estimation:</h2>
<script type="math/tex; mode=display">q(y\|x) \sim N(x,b^2)</script>
<p><img src="/assets/pics/2017/04/18-MCMC-Cauchy-Estimation.png" alt="The many distributions" /></p>
<p><img src="/assets/pics/2017/04/18-MCMC-Cauchy-Estimation_TS.png" alt="png" /></p>
<p>So you can clearly see that the different values of \( b \) lead to wildly different distributions, and therefore wildly different approximations of the end distribution. For example, if the standard deviation on our sampling distribution is very small, our prospective transitions won’t venture out very far, and hence the tails of our distribution will be woefully underrepresented. If, on the other hand, the standard deviation is too large, we will often venture our much further than we should, and the we’ll end up with something like a multi-modal distribution. Basically, the prospective transition states will often be too far out to accept, in which case the current state persists, or it will be just close enough to be accepted. What you end up with is far too many samples from the tails and not enough samples from the beafier parts of the distribution.</p>
<p>Choosing the “best” value of \( b \) is more of an art than a science, really. This issue will poke its ugly head a little later. But even with that issue, we see clearly that something amazing is happening here.</p>
<h2 id="how-is-this-happening">How is this happening?</h2>
<p>A property called <strong>detailed balance</strong>, which means,</p>
<script type="math/tex; mode=display">\pi_ip_{ij} = p_{ji}\pi_j</script>
<p>or in the continuous case:</p>
<script type="math/tex; mode=display">f(x)P_{xy} = f(y)P_{yx}</script>
<h3 id="the-intuition">The intuition</h3>
<p>Since what we’re really after is the distribution of the samples and not the order in which they were picked, it’s reasonable to require a notion of (what I’m going to call) equal hopportunity:</p>
<script type="math/tex; mode=display">f(x)p(y|x) = f(y)p(x|y)</script>
<p>In other words, the probability of “hopping” from \( x \) to \( y \) should be the same as the reverse, because at the end of the day, both of these two scenarios only mean that \( x \) and \( y \) are in our sample.</p>
<p>If you work out the math therein (as we will do below), you find that this ratio guarantees equal hopportunity.</p>
<script type="math/tex; mode=display">r(x,y) = \min\left(\frac{f(y)q(x|y)}{f(x)q(y|x)},1\right)</script>
<h3 id="lets-prove-it">Let’s prove it!</h3>
<p>To reiterate, the claim was that this detailed balance property ( \( f(x)P_{xy} = P_{yx}f(y) \) ) guarantees that our likelihood function \( f \) is the stable distribution for our Markov Chain.</p>
<p>Let \( f \) be the desired distribution (in our example, it was the Cauchy Distribution), and let \( q(y|x) \) be the distribution we draw from.</p>
<p>First we’ll show that detailed balance implies \( \pi \) (or \( f \) if the distribution is continuous) is the stable distribution!</p>
<p>Let \( \pi_ip_{ij} = \pi_jp_{ji} \) for discrete or for continuous \( f(i)P_{ij}=P_{ji}f(j) \)</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
(\pi P)_i =& \sum\limits_j\pi_jP_{ji} & (fP)(x)=&\int f(y)p(x|y)dy\\
=&\sum\limits_j\pi_iP_{ij} & =& \int f(x)p(y|x)dy\\
=&\pi_i\sum\limits_jP_{ij} & =& f(x)\int p(y|x)dy\\
=&\pi_i & =& f(x)
\end{align*} %]]></script>
<p>Now we’ll show that the Markov Chain defined by the Metropolis-Hastings algorithm has the detailed balance property (and hence \( f \) is the stable distribution).</p>
<p>Without any loss of generality, we’ll assume \( f(x)q(y|x) > f(y)q(x|y) \) (if this is not the case, the proof will follow the same with just symbolic changes).</p>
<p>Note: Since \( f(x)q(y|x) > f(y)q(x|y) \), we know, \( r(x,y) = \frac{f(y)q(x|y)}{f(x)q(y|x)} \), and \( r(y,x) = 1 \).</p>
<p>Then,</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\pi_xp_{xy} =& f(x)\mathbb{P}(X_1=y|X_0=x) & \pi_yp_{yx} =& f(y)\mathbb{P}(X_1=x|X_0=y)\\
=& f(x)\left[q(y|x)\cdot r(x,y)\right] & =&f(y)\left[q(x|y)\cdot r(y,x)\right]\\
=& f(x)\left[q(y|x)\cdot \frac{f(y)q(x|y)}{f(x)q(y|x)}\right] & =&f(y)\left[q(x|y)\cdot 1\right]\\
=& f(y)q(x|y) & =&f(y)q(x|y)
\end{align*} %]]></script>
<h2 id="modeling-change-point-models-in-astrostatistics">Modeling Change Point Models in Astrostatistics.</h2>
<p>This was taken from a Penn State statistics summer program website located at <a href="http://sites.stat.psu.edu/~mharan/MCMCtut/COUP551_rates.dat">stat.psu.edu</a>. A detailed walk-through of the process PSU took with this project can be found at that site. The data is a time series of light emission recorded and aggregated over periods of 10000 seconds (just under 3 hours). A plot of the time series can be seen below for your convenience.</p>
<p><img src="/assets/pics/2017/04/18_psu_ts.png" alt="png" /></p>
<p>The idea (with this data) is that the emissions can be modeled by two Poisson distributions and a change point. The data will be modeled by the first Poisson until the change point, and then it switches to the second Poisson.</p>
<p>So we start out with a multi-parameter model, and through some Bayesian inference you arrive at a pretty nasty looking posterior distribution. You’d like to sample this distribution to obtain, say, the mean \( k \) (change point) value along with a 95% confidence interval. Since you need that confidence interval, standard optimization methods won’t suffice: we’d need the distribution of \( k \)s. For this, we can use MCMC!</p>
<p>The posterior distribution we would like to sample from is this:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
f(k,\theta,\lambda,b_1,b_2 | Y) \alpha& \prod\limits_{i=1}^k\frac{\theta^{Y_i}e^{-\theta}}{Y_i!} \prod\limits_{i=k+1}^n\frac{\lambda^{Y_i}e^{-\lambda}}{Y_i!} \\
&\times\frac{1}{\Gamma(0.5)b_1^{0.5}}\theta^{-0.5}e^{-\theta/b_1} \times\frac{1}{\Gamma(0.5)b_2^{0.5}}\theta^{-0.5}e^{-\theta/b_2}\\
&\times\frac{e^{-1/b_1}}{b_1}\frac{e^{-1/b_2}}{b_2}\times \frac{1}{n}
\end{align*} %]]></script>
<p>This distribution is pretty gnarly and since I don’t really know how to sample from this (beyond a blind search), we would like to use MCMC. Metropolis-Hastings (at least how we’ve discussed it) clearly doesn’t work here because this is multidimensional. For that we introduce another MCMC algorithm: <strong>Gibbs Sampling</strong>.</p>
<h3 id="gibbs-sampling">Gibbs Sampling</h3>
<p>The basic pseudocode is this:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">X</span> <span class="o">=</span> <span class="n">initial_X</span>
<span class="n">Y</span> <span class="o">=</span> <span class="n">initial_Y</span>
<span class="o">...</span>
<span class="n">Z</span> <span class="o">=</span> <span class="n">initial_Z</span>
<span class="n">MC</span> <span class="o">=</span> <span class="p">[(</span><span class="n">X</span><span class="p">,</span><span class="n">Y</span><span class="p">,</span><span class="o">...</span><span class="p">,</span><span class="n">Z</span><span class="p">)]</span>
<span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">num_iters</span><span class="p">):</span>
<span class="o">//</span> <span class="k">for</span> <span class="n">each</span> <span class="n">variable</span> <span class="n">we</span> <span class="n">compute</span>
<span class="n">X</span> <span class="o">~</span> <span class="n">q_X</span><span class="p">(</span><span class="n">x</span><span class="o">|</span><span class="n">X</span><span class="p">,</span><span class="n">Y</span><span class="p">,</span><span class="o">...</span><span class="p">,</span><span class="n">Z</span><span class="p">)</span>
<span class="o">//</span> <span class="n">We</span> <span class="n">do</span> <span class="n">this</span> <span class="k">for</span> <span class="n">X</span><span class="p">,</span> <span class="n">Y</span><span class="p">,</span> <span class="o">...</span><span class="p">,</span> <span class="n">Z</span> <span class="ow">in</span> <span class="n">turn</span>
</code></pre></div></div>
<p>We can combine Gibbs with Metropolis quite easily:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">X</span> <span class="o">=</span> <span class="n">initial_X</span>
<span class="n">Y</span> <span class="o">=</span> <span class="n">initial_Y</span>
<span class="o">...</span>
<span class="n">Z</span> <span class="o">=</span> <span class="n">initial_Z</span>
<span class="n">MC</span> <span class="o">=</span> <span class="p">[(</span><span class="n">X</span><span class="p">,</span><span class="n">Y</span><span class="p">,</span><span class="o">...</span><span class="p">,</span><span class="n">Z</span><span class="p">)]</span>
<span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">num_iters</span><span class="p">):</span>
<span class="o">//</span> <span class="k">for</span> <span class="n">each</span> <span class="n">variable</span> <span class="n">we</span> <span class="n">compute</span>
<span class="n">prospect</span> <span class="o">~</span> <span class="n">q_X</span><span class="p">(</span><span class="n">x</span><span class="o">|</span><span class="n">X</span><span class="p">,</span><span class="n">Y</span><span class="p">,</span><span class="o">...</span><span class="p">,</span><span class="n">Z</span><span class="p">)</span>
<span class="n">r</span> <span class="o">=</span> <span class="nb">min</span><span class="p">(</span><span class="n">f</span><span class="p">(</span><span class="n">prospect</span><span class="p">)</span><span class="o">/</span><span class="n">f</span><span class="p">(</span><span class="n">X</span><span class="p">),</span> <span class="mi">1</span><span class="p">)</span>
<span class="n">a</span> <span class="o">~</span> <span class="n">uniform</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">)</span>
<span class="k">if</span> <span class="n">a</span> <span class="o"><</span> <span class="n">r</span><span class="p">:</span>
<span class="n">X</span> <span class="o">=</span> <span class="n">prospect</span>
<span class="o">//</span> <span class="n">We</span> <span class="n">do</span> <span class="n">this</span> <span class="k">for</span> <span class="n">X</span><span class="p">,</span> <span class="n">Y</span><span class="p">,</span> <span class="o">...</span><span class="p">,</span> <span class="n">Z</span> <span class="ow">in</span> <span class="n">turn</span>
</code></pre></div></div>
<p>So, basically, you update one at a time.</p>
<p>Using this algorithm, we can analyze the data from the PSU website we mentioned above. In fact we did that very thing. Below you’ll find some graphs depicting the distribution of \( k \) values.</p>
<!--
```python
def psu_mcmc(X, q, numIters=10000):
theta, lambd, k, b1, b2 = 1, 1, 20, 1, 1
thetas, lambds, ks, b1s, b2s = [], [], [], [], []
n = len(X)
def f_k(theta, lambd, k, b1, b2):
if 0 <= k and k <= n:
return theta**sum(X[:k])*lambd**sum(X[k:])*np.exp(-k*theta-(n-k)*lambd)
elif k < 0:
return lambd**sum(X)*np.exp(-n*lambd)
elif k > n:
return theta**sum(X)*np.exp(-n*theta)
def f_t(theta, k, b1):
return theta**(sum(X[:k])+0.5)*np.exp(-theta*(k+1.0)/b1)
def f_l(lambd, k, b2):
return lambd**(sum(X[k:])+0.5)*np.exp(-lambd*((n-k)+1.0)/b2)
def f_b(b, par):
return np.exp(-(1 + par) / b) / (b*np.sqrt(b))
for i in range(numIters):
tmp = q(theta)
if tmp < np.infty:
r = min(1, f_t(tmp,k,b1)/f_t(theta,k,b1))
if np.random.uniform(0,1) < r:
theta = tmp
tmp = q(lambd)
if tmp < np.infty:
r = min(1, f_l(tmp,k,b2)/f_l(lambd,k,b2))
if np.random.uniform(0,1) < r:
lambd = tmp
tmp = q(b1)
if tmp < np.infty:
r = min(1, f_b(tmp, theta)/f_b(b1, theta))
if np.random.uniform(0,1) < r:
b1 = tmp
tmp = q(b2)
if tmp < np.infty:
r = min(1, f_b(tmp, lambd)/f_b(b2, lambd))
if np.random.uniform(0,1) < r:
b2 = tmp
tmp = q(k)
if tmp < np.infty:
r = min(1, f_k(theta, lambd, tmp, b1, b2) /
f_k(theta, lambd, k, b1,b2))
if np.random.uniform(0,1) < r:
k = tmp
thetas.append(theta)
lambds.append(lambd)
b1s.append(b1)
b2s.append(b2)
ks.append(k)
return np.array([thetas,lambds,ks,b1s,b2s])
```
```python
%%bash
if [ ! -f tmp/psu_data.tsv ]
then
wget http://sites.stat.psu.edu/~mharan/MCMCtut/COUP551_rates.dat -O tmp/psu_data.tsv
fi
```
```python
psu_data = []
with open("tmp/psu_data.tsv", "r") as f:
title = f.readline()
for line in f:
tmpArr = [x.strip() for x in line.split(" ")]
psu_data.append([int(x) for x in tmpArr if x != ""][1])
psu_data = np.array(psu_data)
psu_data
```
array([11, 3, 5, 9, 3, 4, 5, 5, 5, 5, 13, 18, 27, 8, 4, 10, 8,
3, 12, 10, 10, 3, 9, 8, 5, 9, 4, 6, 1, 5, 14, 7, 9, 10,
8, 13, 8, 11, 11, 10, 11, 13, 10, 3, 8, 5])
```python
mcmc2 = psu_mcmc(psu_data, q(1), 1000)
fig = plt.figure()
fig.suptitle("MCMC values for Change Point")
plt.subplot(2,1,1)
plt.hist(mcmc2[2] % len(psu_data), normed=True)
plt.subplot(2,1,2)
plt.plot(mcmc2[2])
plb.savefig("tmp/psu_graphs1.png")
plt.show()
```
-->
<p><img src="/assets/pics/2017/04/18_psu_graphs1.png" alt="png" /></p>
<!--
```python
plt.plot(psu_data)
plt.title("PSU Data")
plb.savefig("tmp/psu_ts.png")
plt.show()
```
-->
<h2 id="summary">Summary</h2>
<ul>
<li>Random-Walk-Metropolis-Hastings
<ul>
<li>Can be used to sample difficult univariate distributions relatively easily
<ul>
<li>Have to tune the sampling parameter</li>
<li>Curse of dimensionality in tuning parameters</li>
</ul>
</li>
<li>Requires \( f \) to be defined on all of \( \mathbb{R} \)
<ul>
<li>Transform as needed</li>
</ul>
</li>
</ul>
</li>
<li>Gibbs Sampling
<ul>
<li>Turn high dimensional sampling into iterative one-dimensional sampling</li>
</ul>
</li>
<li>Gibbs with Metropolis-Hastings
<ul>
<li>Lovely</li>
</ul>
</li>
</ul>
<h2 id="bibliography">Bibliography</h2>
<p><a href="http://sites.stat.psu.edu/~mharan/MCMCtut/MCMC.html">Summer School in Astrostatistics</a></p>Aaron NiskinMarkov Chain Monte Carlo is a tool for sampling otherwise intractible distributions. In this post, we'll go through some descriptions, reasoning, and examples therein.Linear Regression – The Basics2017-02-20T23:24:43-08:002017-02-20T23:24:43-08:00http://aaron.niskin.org/blog/2017/02/20/What-Is-Linear-Regression<h3 id="the-basics">The basics</h3>
<p>Yeah. It’s not a good sign if I’m starting out already repeating myself. But
that’s how things seem to be with linear regression, so I guess it’s fitting.
It seems like every day one of my professors will talk about linear regression,
and it’s not due to laziness or lack of coordination. Indeed, it’s an
intentional part of the curriculum here at New College of Florida because of
how ubiquitous linear regression is. Not only is it an extremely simple yet
expressive formulation, it’s also the theoretical basis of a whole slew of
other tactics. Let’s just get right into it, shall we?</p>
<!-- more -->
<h3 id="linear-regression">Linear Regression:</h3>
<p>Let’s say you have some data from the real world (and hence riddled with
real-world error). A basic example for us to start with is this one:</p>
<p><img src="/assets/pics/2017/02/21-linear-regression_scatterPoints.png" alt="Data suitable for Linear Regression" title="Some Example Data" /></p>
<p>There’s clearly a linear trend there, but how do we pick which linear trend would be the best? Well, one thing we could do is pick the line that has the least amount of error from the prediction to the actual data-point. To do that, we have to say what we mean by “least amount of error”. For this post, we’ll calculate that error by squaring the difference between the predicted value and the actual value for every point in our data set, then averaging those values. This standard is called the Mean-Squared-Error (MSE). We can write the MSE as:</p>
<script type="math/tex; mode=display">\frac{1}{N}\sum\limits_{i=1}^N\left(\hat Y_i - Y_i\right)^2</script>
<p>where \( \hat Y_i\) is our predicted value of \(Y_i\) for a give \(X_i\). Being as how we want a linear model (for simplicity and extensibility), we can write the above equation as,</p>
<script type="math/tex; mode=display">\sum\limits_{i=1}^N\left(\alpha + \beta X_i - Y_i\right)^2</script>
<p>for some \( \alpha, \beta \) that we don’t yet know. But since we want to minimize that error, we can take some derivatives and solve for \( \alpha, \beta \)! Let’s go ahead and do that! We want to minimize</p>
<script type="math/tex; mode=display">\sum\limits_{i=1}^N\left(\alpha + \beta X_i - Y_i\right)^2</script>
<p>We can start by finding the \( \hat\alpha \) such that, \( \frac{d}{d\alpha}\sum\limits_{i=1}^N\left(\alpha + \beta X_i - Y_i\right)^2 = 0 \). And as long as we don’t forget the chain rule, we’ll be alright…</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\sum\limits_{i=1}^N2\left(\alpha + \beta X_i - Y_i\right) =& 0\Longrightarrow\\
\sum\limits_{i=1}^N\left(\alpha + \beta X_i - Y_i\right) =& 0 \Longrightarrow\\
N\alpha + N\beta\bar X - N\bar Y =& 0\Longrightarrow\\
\alpha =& \bar Y - \beta\bar X \end{align*} %]]></script>
<p>and we’ll find the \( \beta \) such that \( \frac{d}{d\beta}\sum\limits_{i=1}^N\left(\alpha + \beta X_i - Y_i\right)^2 = 0 \)</p>
<p>And following a similar pattern we find (sorry for the editing… Wordpress.com isn’t the greatest therein):</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\sum\limits_{i=1}^N2X_i\left(\alpha + \beta X_i - Y_i\right) =& 0\Longrightarrow\\
\alpha\sum\limits_{i=1}^NX_i + \beta\sum\limits_{i=1}^N X_i^2 - \sum\limits_{i=1}^NY_iX_i =& 0\Longrightarrow\\
(\bar Y -\beta\bar X)N\bar X + \beta\sum\limits_{i=1}^NX_i^2 - \sum\limits_{i=1}^NY_iX_i =& 0\\
N\bar Y\bar X -N\beta(\bar X)^2 + \beta\sum\limits_{i=1}^NX_i^2 - \sum\limits_{i=1}^NY_iX_i =& 0\\
\beta\left(\sum\limits_{i=1}^NX_i^2 - N(\bar X)^2\right) =& \sum\limits_{i=1}^NY_iX_i - N\bar Y\bar X\\
\beta =& \frac{\sum\limits_{i=1}^NY_iX_i - N\bar Y\bar X}{\sum\limits_{i=1}^NX_i^2 - N(\bar X)^2}
\end{align*} %]]></script>
<p>But note:</p>
<script type="math/tex; mode=display">\text{VAR}(X) = \frac{1}{N}\sum\limits_{i=1}^NX_i^2 - (\bar X)^2</script>
<p>So,</p>
<script type="math/tex; mode=display">N\cdot\text{VAR}(X) = \sum\limits_{i=1}^NX_i^2 - N(\bar X)^2</script>
<p>And:
<script type="math/tex">% <![CDATA[
\begin{align*}
\sum\limits_{i=1}^NY_iX_i - N\bar Y\bar X =& \sum\limits_{i=1}^NY_iX_i - N(\frac{1}{N}\sum\limits_{i=1}^NY_i)\bar X \\
=& \sum\limits_{i=1}^NY_iX_i - \sum\limits_{i=1}^N(Y_i\bar X)\\
=& \sum\limits_{i=1}^NY_i\left(X_i - \bar X\right) \\
=& \sum\limits_{i=1}^NY_i\left(X_i - \bar X\right) - \sum\limits_{i=1}^N\bar Y\left(X_i - \bar X\right) + \sum\limits_{i=1}^N\bar Y\left(X_i - \bar X\right)\\
=& \sum\limits_{i=1}^N\left(Y_i-\bar Y\right)\left(X_i - \bar X\right) + \bar Y\sum\limits_{i=1}^N\left(X_i - \bar X\right)\\
=& \sum\limits_{i=1}^N\left(Y_i-\bar Y\right)\left(X_i - \bar X\right) = N\cdot \text{COV}(X, Y) \end{align*} %]]></script></p>
<p>So,</p>
<script type="math/tex; mode=display">\beta = \frac{\text{COV}(X,Y)}{\text{VAR}(X)}</script>
<p>And then we can find \( \alpha \) by substituting in our approximation of \( \beta \). Using those coefficients, we can plot the line below, and as you can see, it really is a good approximation.</p>
<p><img src="/assets/pics/2017/02/21-linear-regression_scatterPoints_withLine.png" alt="Data With Linear Regression Line" title="Some Example Data With Regression Line" /></p>
<h3 id="now-we-have-it">Now we have it</h3>
<p>Okay, so now we have our line of “best fit”, but what does it mean? Well, it
means that this line predicts the data we gave it with the least error. That’s
really all it means. And sometimes, as we’ll see later, reading too much into
that can really get you into trouble.</p>
<p>But using this model we can now predict other data outside the model. So, for
instance, in the model pictured above, if we were to try and predict \( Y \)
when \( X=2 \), we wouldn’t do so bad by picking something around 10 for \(
Y \).</p>
<h3 id="an-example-perhaps">An example, perhaps?</h3>
<p>So I feel at this point, it’s probably best to give an example. Let’s say we’re
trying to predict stock price given the total market price. Well, in practice
this model is used to assess volatility, but that’s neither here nor there.
Right now, we’re really only interested in the model itself. But without
further ado, I present you with, the CAPM (Capital Asset Pricing Model):</p>
<p><script type="math/tex">r = \alpha + \beta r_m + \epsilon</script> (where \( \epsilon \) is the error in
our predictions).</p>
<p>And you can fit this using historical data or what-have-you. There are a bunch
of downsides to fitting it with historical data though, like the fact that data
from 3 days ago really doesn’t have much to say about the future anymore. There
are plenty of cool things you can do therein, but sadly, those are out of the
scope of this post.</p>
<p>For now, we move on to</p>
<h2 id="multiple-regression">Multiple Regression</h2>
<h3 id="what-is-this-anyway">What is this anyway?</h3>
<p>Well, multiple regression is really just a new name for the same thing: how do
we fit a linear model to our data given some set of predictors and a single
response variable? The only difference is that this time our linear model
doesn’t have to be one dimensional. Let’s get right into it, shall we?</p>
<p>So let’s say you have \( k \) many predictors arranged in a vector (in other
words, our predictor is a vector in \( \mathbb{R}^n \)). Well, I wonder if a
similar formula would work… Let’s figure it out…</p>
<p>Firstly, we need to know what a derivative is in \( \mathbb{R}^n \). Well, if
\( f:\mathbb{R}^n\to\mathbb{R}^m \) is a differentiable function, then for
any \( x \) in the domain, \( f’(x) \) is the linear map \(
A:\mathbb{R}^n\to\mathbb{R}^m \) such that \( \text{lim}_{h\to 0}\frac{||f(x+h) - f(x) - Ah||}{||h||} = 0 \). Basically, \( f’(x) \) is the tangent plane.</p>
<p>So, now that we got that out of the way, let’s use it! We want to find the
linear function that minimizes the Euclidean norm of the error terms (just like
before). But note: the error term is \( \epsilon = Y - \hat Y = Y - \alpha
-\beta X \), for some vector \( \alpha \) and some matrix \( \beta \).
Now, since it’s easier and it’ll give us the same answer, we’re going to
minimize the squared error term instead of just the error term (like we did in
the one dimensional version). We’re also going to make one more simplification:
That \( \alpha=0 \). We can do this safely by simply appending (or
prepending) a 1 to the rows of our data (thereby creating a constant term). So
for the following, assume we’ve done that.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}\langle\epsilon, \epsilon\rangle =& (Y - X\beta)^T(Y - X\beta) \\
\langle\epsilon, \epsilon\rangle =& (Y^T - \beta^TX^T)(Y - X\beta) \\
\langle\epsilon, \epsilon\rangle =& Y^TY - 2Y^TX\beta + \beta^TX^TX\beta \end{align*} %]]></script>
<p>So, let’s find the \( \beta \) that minimizes that.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
0=&\lim\limits_{h\to0}\frac{||(Y^TY - 2Y^TX(\beta+h) + (\beta+h)^TX^TX(\beta+h)) - (Y^TY - 2Y^TX\beta + \beta^TX^TX\beta) - Ah||}{||h||} \\
0=&\lim\limits_{h\to0}\frac{||- 2Y^TXh + 2\beta^TX^TXh + h^TX^TXh - Ah||}{||h||}\\
0=&\lim\limits_{h\to0}||- 2Y^TX + 2\beta^TX^TX + h^TX^TX - A||\frac{||h||}{||h||}\\
0=&\lim\limits_{h\to0}||- 2Y^TX + 2\beta^TX^TX - A|| \end{align*} %]]></script>
<p>So, now we see that the derivative is \( -2Y^TX + 2\beta^TX^TX \) and we want to find where our error is minimized, so we want to set that derivative to zero:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
0=&- 2Y^TX + 2\beta^TX^TX \\
X^TX\beta =& X^TY \\
\beta =& (X^TX)^{-1}X^TY \end{align*} %]]></script>
<p>And there we have it. That’s called the <strong>normal equation</strong> for linear regression.</p>
<p>Maybe next time I’ll post about how we can find these coefficients given some data using gradient descent, or some modification thereof.</p>
<p>Till next time, I hope you enjoyed this post. Please, let me know if something could be clearer or if you have any requests.</p>Aaron NiskinLinear Regression is the bedrock of Machine Learning. We cover the basics therein, some theory involved, and give some relevant examples.