[
{
"title": "First Post, An Explication",
"excerpt": "A solemn oath to the reader… An oath I will surely forget and fail to uphold. Take it as you will\n",
"content": "This is my first blog, so I guess I should start off by explaining my motives behind writing what will probably be a sporadically updated blog. Basically, as I learn things that I find pretty difficult to find online, I’ll try to explain them as best I can here. Also, since I enjoy learning math, I’m going to try to keep up a (semi) regular stream of math posts. If you have any questions, feel free to contact me. My up-to-date contact info can be found on my website aaron.niskin.org ",
"url": "/meta/general/2015/11/29/first-post-an-explication/"
},
{
"title": "What's in a language?",
"excerpt": "The basics of the theory of formal languages\n",
"content": "Languages in abstraction This post is about Languages from a mathematical and abstract linguistics point of view. Not much more to say about that, so let’s get right to it! The Rigorous definition: Let \\( \\Sigma \\) be an alphabet and let \\( \\Sigma^k \\) be the set of all strings of length k over that alphabet. Then, we define \\( \\Sigma^* \\) to be \\( \\bigcup\\limits_{k\\in\\mathbb{N}}\\Sigma^k \\) (the union of ∑k over all natural numbers k). If \\( L\\subseteq\\Sigma^* \\) , we call \\( L \\) a language. The Intuition Behind the Definition Consider an alphabet (some finite set of characters), for example we can consider the letters of the English language, the ASCII symbols, the symbols \\( {0, 1} \\) (otherwise known as binary), or the symbols \\( {1, 2, 3, 4, 5, 6, 7, 8, 9, 0, +, \\times , =} \\) . We can then construct the infinite list of all the different ways we can arrange those characters (e.g. \\( 1001011 \\) or \\( 0011011011 \\) , etc. if we’re using binary). We call these arrangements “strings”. Once we have all that machinery built up, a language is just some subset of that infinite collection of strings. The language may itself be infinite. Some Examples The alphabet: \\( \\Sigma={0, 1} \\) The language: \\( {x\\in\\Sigma|x \\text{ is prime}} \\) (all prime numbers in binary) Some strings from the language: \\( 10, 11, 101… \\) The alphabet: ASCII characters The language: All syntactically correct Clojure programs (the source code) The alphabet: All Clojure functions, operators, etc, and list \\( {x, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0} \\) The language: All syntactically correct Clojure programs (the source code) You see that we need to have an alphabet before we can have a Formal Language. Also, different alphabets may result in equivalent languages – by equivalent, we mean that both languages contain the same strings. What are we getting at? Well, there are two ways to look at this. On the first hand, linguists would like to study language in its abstract essence. For this, Formal Languages may come in handy (if endowed with a Grammar and possibly more). That is not the reason I will be studying Formal Languages. I’m learning about formal languages to find their applications to computing. Apparently, with the help of a little Mathematical thinking, we can assign semantics to the strings in a language and somehow correlate them with real world problems – such as computability, the P=nP problem, cryptography, and more! ",
"url": "/mathematics/2015/11/30/whats-in-a-language/"
},
{
"title": "A dirty little ditty on Finite Automata",
"excerpt": "A basic yet formal introduction to Finite Automata and how strings act on them.\n",
"content": "This post builds on the previous post about Formal Languages. Some Formal Definitions A Deterministic Finite Automata (DFA) is A set \\( \\mathcal{Q} \\) called “states” A set \\( \\Sigma \\) called “symbols” or “alphabet” A function \\( \\delta_F:\\mathcal{Q}\\times\\Sigma \\to \\mathcal{Q} \\) A designated state \\( q_0\\in\\mathcal{Q} \\) called the start point A subset \\( F\\subseteq\\mathcal{Q} \\) called the “accepting states” The DFA is then often referred to as the ordered quintuple \\( A=(\\mathcal{Q},\\Sigma,\\delta_F,q_0,F) \\). Defining how strings act on DFAs. Given a DFA, \\( A=(\\mathcal{Q}, \\Sigma, \\delta, q_0, F) \\), a state \\( q_i\\in\\mathcal{Q} \\), and a string \\( w\\in\\Sigma^* \\), we can define \\( \\delta(q_i,w) \\) like so: If \\( w \\) only has one symbol, we can consider \\( w \\) to be the symbol and define \\( \\delta(q_i,w) \\) to be the same as if we considered \\( w \\) as the symbol If \\( w=xv \\), where \\( x\\in\\Sigma \\) and \\( v\\in\\Sigma^* \\), then \\( \\delta(q_i, w)=\\delta(\\delta(q_i,x),v) \\) And in this way, we have defined how DFAs can interpret strings of symbols rather than just single symbols. The language of a DFA Given a DFA, \\( A=(\\mathcal{Q}, \\Sigma, \\delta, q_0, F) \\), we can define “the language of \\( A \\)”, denoted \\( L(A) \\), as \\( {w\\in\\Sigma^*|\\delta(q_0,w)\\in F} \\). Some Examples, Maybe? Example 1: Let’s construct a DFA that accepts only strings beginning with a 1 that, when interpreted as binary numbers, are multiples of 5. So some examples of strings that would be in \\( L(A) \\) are 101, 1010, 1111 Some More Formal Definitions A Nondeterministic Finite Automata (NFA) is A set \\( \\mathcal{Q} \\) called “states” A set \\( \\Sigma \\) called “symbols” A function \\( \\delta_N:\\mathcal{Q}\\times\\Sigma \\to \\mathcal{P}\\left(\\mathcal{Q}\\right) \\) A designated state \\( q_0\\in\\mathcal{Q} \\) called the start point A subset \\( F\\subseteq\\mathcal{Q} \\) called the “accepting states” The NFA is then often referred to as the ordered quintuple \\( A=(\\mathcal{Q},\\Sigma,\\delta_N,q_0,F) \\). Defining how strings act on NFAs. Given an NFA, \\( N=(\\mathcal{Q}, \\Sigma, \\delta, q_0, F) \\), a collection of states \\( \\mathcal{S}\\subseteq\\mathcal{Q} \\), and a string \\( w\\in\\Sigma^* \\), we can define \\( \\delta(\\mathcal{S},w) \\) like so: If \\( w \\) only has one symbol, then we can consider \\( w \\) to be the symbol and define \\( \\delta(\\mathcal{S},w):=\\bigcup\\limits_{s\\in\\mathcal{S}}\\delta(s,w) \\) If \\( w=xv \\), where \\( x\\in\\Sigma \\) and \\( v\\in\\Sigma^* \\), then \\( \\delta(\\mathcal{S}, w)=\\bigcup\\limits_{s\\in\\mathcal{S}}\\delta(\\delta(s,x),v) \\) And in this way, we have defined how NFAs can interpret strings of symbols rather than just single symbols. Maybe in another post, we’ll get past the definitions! :-/ ",
"url": "/mathematics/2015/12/03/a-dirty-little-ditty-on-finite-automata/"
},
{
"title": "Probability -- A Measure Theoretic Approach",
"excerpt": "We cover probability from a measure theoretic approach\n",
"content": "Probability using Measure Theory A mathematically rigorous definition of probability, and some examples therein. The Traditional Definition: Consider a set \\( \\Omega \\) (called the sample space ), and a function \\( X:\\Omega\\rightarrow\\mathbb{R} \\) (called a random variable . If \\( \\Omega \\) is countable (or finite), a function \\( \\mathbb{P}:\\Omega\\rightarrow\\mathbb{R} \\) is called a probability distribution if it satisfies the following 2 conditions: For each \\( x \\in \\Omega \\), \\( \\mathbb{P}(x) \\geq 0 \\) If \\( A_i\\cap A_j = \\emptyset \\), then \\( \\mathbb{P}(\\bigcup\\limits_0^\\infty A_i) = \\sum\\limits_0^\\infty\\mathbb{P}(A_i) \\) And if \\( \\Omega \\) is uncountable, a function \\( F:\\mathbb{R}\\rightarrow\\mathbb{R} \\) is called a probability distribution or a cumulative distribution function if it satisfies the following 3 conditions: For each \\( a,b\\in\\mathbb{R} \\), \\( a < b \\rightarrow F(a)\\leq F(b) \\) \\( \\lim\\limits_{x\\to -\\infty}F(x) = 0 \\) \\( \\lim\\limits_{x\\to\\infty}F(x) = 1 \\) The Intuition: What idea are we even trying to capture with these seemingly disparate definitions for the same thing? Well, with the two cases taken separately it's somewhat obvious, but they don't seem to marry very well. The discrete case is giving us a pointwise estimation of something akin to the proportion of observations that should correspond to a value (in a perfect world). The continuous case is the same thing, but instead of corresponding to that particular value (which doesn't really even make sense in this case), the proportion corresponds to the point in question and everything less than it. The shaded region in the top picture below and the curve in the picture directly below it denote the cumulative density function of a standard normal distribution (don't worry too much about what that means for this post, but if you're doing anything with statistics, you should probably know a bit about that). Another way to define a continuous probability distribution is through something called a probability density function, which is closer to the discrete case definition of a probability distribution (or probability mass function ). A probability density function is a function \\( f:\\mathbb{R}\\rightarrow\\mathbb{R}+ \\) such that \\( \\int{-\\infty}^xf(t)dt = F(x) \\). In other words, \\( \\frac{dF}{dX} = f \\). This new function has some properties of our discrete case probability function, but lacks some others. On the one hand, they’re both defined pointwise, but on the other, this one can be greater than one in some places – meaning the value of the probability density function isn’t really the probability of an event, but rather (as the name “suggests”) the density therein. Does it measure up? Now let’s check out the measure theoretic approach… Let \\( \\Omega \\) be our sample space, \\( S \\) be the \\( \\sigma \\)-algebra on \\( \\Omega \\) (so \\( S \\) is the collection of measurable subsets of \\( \\Omega \\)), and \\( \\mu:S\\to\\mathbb{R} \\) a measure on that measure space. Let \\( X:\\Omega\\rightarrow\\mathbb{E} \\) be a random variable (\\( \\mathbb{E} \\) is generally taken to be \\( \\mathbb{R} \\) or \\( \\mathbb{R}^n \\)). We define the function \\( \\mathbb{P}:\\mathcal{P}(\\mathbb{E})\\rightarrow\\mathbb{R} \\) (where \\( \\mathcal{P}(\\mathbb{E}) \\) is the powerset of \\( \\mathbb{E} \\) – the set of all subsets) such that if \\( A\\subseteq\\mathbb{E} \\), we have \\( \\mathbb{P}(A)=\\mu(X^{-1}(A)) \\). We call \\( \\mathbb{P} \\) a probability distribution if the following conditions hold: \\( \\mu(\\Omega) = 1 \\) for each \\( A\\subseteq\\mathbb{E} \\) we have \\( X^{-1}(A)\\in S \\). Why do this? Well, right off the bat we have a serious benefit: we no longer have two disparate definitions of our probability distributions. Furthermore, there is the added benefit of having a natural separation of concerns: the measure \\( \\mu \\) determines the what we might intuitively consider to be the probability distribution while the random variable is used to encode the aspects of the events that we care about. To further illustrate this The Examples A fair die All even Let’s consider a fair die. Our sample space will be \\( {1,2,3,4,5,6} \\). Since our die is fair, we’ll define our measure fairly: for any \\( x \\) in our sample space, \\( \\mu({x}) = \\frac{1}{6} \\). If we want to know, for instance, what the probability of getting each number is, we could use a very intuitive random variable \\( X(a) = a \\) (so \\( X(1)=1 \\), etc.). Then we see that \\( \\mathbb{P}({1}) = \\mu({1}) = \\frac{1}{6} \\), and the rest are found similarly. Odds and Evens? What if we want to consider the fair die of yester-paragraph, but we only care if the face of the die shows an odd or an even number? Well, since the actual distribution of the die hasn’t changed, we won’t have to change our measure. Instead we’ll change our random variable to capture just those aspects we care about. In particular, \\( X(a) = 0 \\) if \\( a \\) is even, and \\( X(a) = 1 \\) if \\( a \\) is odd. We then see \\( \\mathbb{P}(1) = \\mu({1,3,5}) = \\frac{1}{2} \\) and \\( \\mathbb{P}(0) = \\mu({2,4,6}) = \\frac{1}{2} \\) Getting loaded All even Now let’s consider the same scenario of wanting to know the probability of getting each number, but now our die is loaded. Being as how we’re changing the distribution itself and not just the aspects we’re choosing to care about, we’re going to want to change the measure this time. For simplicity, let’s consider a kind of degenerate case scenario. Let our measure be: \\( \\mu(A) = 1 \\) if \\( 1\\in A \\) and \\( \\mu(A)=0 \\) if \\( 1\ otin A \\). Basically, we’re defining our probability to be such that the only possible outcome is a roll of 1. So since we are concerned with the same things we were concerned with last time, we can take that same random variable. We note \\( \\mathbb{P}(1) = 1 \\) and \\( \\mathbb{P}(a) = 0 \\) for any \\( a \ eq 1 \\). Odds or evens Try to do this one yourself. I’m going to go get some sleep now. Please feel free to contact me with any questions. I love doing this stuff, so don’t be shy! ",
"url": "/probability/2017/02/18/probability-a-measure-theoretic-approach/"
},
{
"title": "Linear Regression -- The Basics",
"excerpt": "Linear Regression is the bedrock of Machine Learning. We cover the basics therein, some theory involved, and give some relevant examples.\n",
"content": "The basics Yeah. It’s not a good sign if I’m starting out already repeating myself. But that’s how things seem to be with linear regression, so I guess it’s fitting. It seems like every day one of my professors will talk about linear regression, and it’s not due to laziness or lack of coordination. Indeed, it’s an intentional part of the curriculum here at New College of Florida because of how ubiquitous linear regression is. Not only is it an extremely simple yet expressive formulation, it’s also the theoretical basis of a whole slew of other tactics. Let’s just get right into it, shall we? Linear Regression: Let’s say you have some data from the real world (and hence riddled with real-world error). A basic example for us to start with is this one: There’s clearly a linear trend there, but how do we pick which linear trend would be the best? Well, one thing we could do is pick the line that has the least amount of error from the prediction to the actual data-point. To do that, we have to say what we mean by “least amount of error”. For this post, we’ll calculate that error by squaring the difference between the predicted value and the actual value for every point in our data set, then averaging those values. This standard is called the Mean-Squared-Error (MSE). We can write the MSE as: where \\( \\hat Y_i\\) is our predicted value of \\(Y_i\\) for a give \\(X_i\\). Being as how we want a linear model (for simplicity and extensibility), we can write the above equation as, for some \\( \\alpha, \\beta \\) that we don’t yet know. But since we want to minimize that error, we can take some derivatives and solve for \\( \\alpha, \\beta \\)! Let’s go ahead and do that! We want to minimize We can start by finding the \\( \\hat\\alpha \\) such that, \\( \\frac{d}{d\\alpha}\\sum\\limits_{i=1}^N\\left(\\alpha + \\beta X_i - Y_i\\right)^2 = 0 \\). And as long as we don’t forget the chain rule, we’ll be alright… and we’ll find the \\( \\beta \\) such that \\( \\frac{d}{d\\beta}\\sum\\limits_{i=1}^N\\left(\\alpha + \\beta X_i - Y_i\\right)^2 = 0 \\) And following a similar pattern we find (sorry for the editing… Wordpress.com isn’t the greatest therein): But note: So, And: So, And then we can find \\( \\alpha \\) by substituting in our approximation of \\( \\beta \\). Using those coefficients, we can plot the line below, and as you can see, it really is a good approximation. Now we have it Okay, so now we have our line of “best fit”, but what does it mean? Well, it means that this line predicts the data we gave it with the least error. That’s really all it means. And sometimes, as we’ll see later, reading too much into that can really get you into trouble. But using this model we can now predict other data outside the model. So, for instance, in the model pictured above, if we were to try and predict \\( Y \\) when \\( X=2 \\), we wouldn’t do so bad by picking something around 10 for \\( Y \\). An example, perhaps? So I feel at this point, it’s probably best to give an example. Let’s say we’re trying to predict stock price given the total market price. Well, in practice this model is used to assess volatility, but that’s neither here nor there. Right now, we’re really only interested in the model itself. But without further ado, I present you with, the CAPM (Capital Asset Pricing Model): (where \\( \\epsilon \\) is the error in our predictions). And you can fit this using historical data or what-have-you. There are a bunch of downsides to fitting it with historical data though, like the fact that data from 3 days ago really doesn’t have much to say about the future anymore. There are plenty of cool things you can do therein, but sadly, those are out of the scope of this post. For now, we move on to Multiple Regression What is this anyway? Well, multiple regression is really just a new name for the same thing: how do we fit a linear model to our data given some set of predictors and a single response variable? The only difference is that this time our linear model doesn’t have to be one dimensional. Let’s get right into it, shall we? So let’s say you have \\( k \\) many predictors arranged in a vector (in other words, our predictor is a vector in \\( \\mathbb{R}^n \\)). Well, I wonder if a similar formula would work… Let’s figure it out… Firstly, we need to know what a derivative is in \\( \\mathbb{R}^n \\). Well, if \\( f:\\mathbb{R}^n\\to\\mathbb{R}^m \\) is a differentiable function, then for any \\( x \\) in the domain, \\( f’(x) \\) is the linear map \\( A:\\mathbb{R}^n\\to\\mathbb{R}^m \\) such that \\( \\text{lim}_{h\\to 0}\\frac{||f(x+h) - f(x) - Ah||}{||h||} = 0 \\). Basically, \\( f’(x) \\) is the tangent plane. So, now that we got that out of the way, let’s use it! We want to find the linear function that minimizes the Euclidean norm of the error terms (just like before). But note: the error term is \\( \\epsilon = Y - \\hat Y = Y - \\alpha -\\beta X \\), for some vector \\( \\alpha \\) and some matrix \\( \\beta \\). Now, since it’s easier and it’ll give us the same answer, we’re going to minimize the squared error term instead of just the error term (like we did in the one dimensional version). We’re also going to make one more simplification: That \\( \\alpha=0 \\). We can do this safely by simply appending (or prepending) a 1 to the rows of our data (thereby creating a constant term). So for the following, assume we’ve done that. So, let’s find the \\( \\beta \\) that minimizes that. So, now we see that the derivative is \\( -2Y^TX + 2\\beta^TX^TX \\) and we want to find where our error is minimized, so we want to set that derivative to zero: And there we have it. That’s called the normal equation for linear regression. Maybe next time I’ll post about how we can find these coefficients given some data using gradient descent, or some modification thereof. Till next time, I hope you enjoyed this post. Please, let me know if something could be clearer or if you have any requests. ",
"url": "/statistics/2017/02/21/What-Is-Linear-Regression/"
},
{
"title": "Markov Chain Monte Carlo",
"excerpt": "Markov Chain Monte Carlo is a tool for sampling otherwise intractible distributions. In this post, we’ll go through some descriptions, reasoning, and examples therein.\n",
"content": " Background What if we know the relative likelihood, but want the probability distribution? But what if \\( \\int f(x)dx \\) is hard, or you can’t sample from \\( f \\) directly? This is the problem we will be trying to solve. First approach If space if bounded (integral is between \\( a,b \\)) we can use Monte Carlo to estimate \\( \\int\\limits_a^b f(x)dx \\) Pick \\( \\alpha \\in (a,b) \\) Compute \\( f(\\alpha)/(b-a) \\) Repeat as necessary Compute the expected value of the computed values What if it’s a big ol’ bag of nope? What if we can’t sample from \\( f(x) \\) but can only determine likelihood ratios? Enter Markov Chain Monte Carlo The obligatory basics A Markov Chain is a stochastic process (a collection of indexed random variables) such that \\( \\mathbb{P}(X_n=x|X_0,X_1,…,X_{n-1}) = \\mathbb{P}(X_n=x|X_{n-1}) \\). In other words, the conditional probabilities only depend on the last state, not on any deeper history. We call the set of all possible values of \\( X_i \\) the state space and denote it by, \\( \\chi \\). Transition Matrix Let \\( p_{ij} = \\mathbb{P}(X_1 = j | X_0 = i) \\). We call the matrix \\( (p_{ij}) \\) the Transition Matrix of \\( X \\) and denote it \\( P \\). Let \\( \\mu_n = \\left(\\mathbb{P}(X_n=0), \\mathbb{P}(X_n=1),…, \\mathbb{P}(X_n=l)\\right) \\) be the row vector corresponding to the “probabilities” of being at each state at the \\( n \\)th point in time (iteration). Claim: \\( \\mu_{i+1} = \\mu_0P^{i+1} \\) Proof: if \\( i = 0 \\): So: \\( \\mu_1 = \\mu_0P \\) if \\( i > 0 \\): Then, \\( \\mu_{i+1} = \\mu_iP = (\\mu_0P^i)P = \\mu_0P^{i+1} \\) Stationary Distributions We say that a distribution \\( \\pi \\) is stationary if Main Theorem: An irreducible, ergotic Markov Chain \\( {X_n} \\) has a unique stationary distribution, \\( \\pi \\). The limiting distribution exists and is equal to \\( \\pi \\). And furthermore, if \\( g \\) is any bounded function, then with probability 1: This really just means that if our Markov Chain has certain properties (basically just that any state can be gotten to from any other state at any time), then we can sample from this Markov Chain, and it’ll have the same distribution as a generic sample from our desired distribution. Random-Walk-Metropolis - Hastings: Let \\( f(x) \\) be the relative likelihood function of our desired distribution. And \\( q(y|x_i) \\) the known distribution easily sampled from (generally taken to be \\( N(x_i,b^2) \\) ) 1) Given \\( X_0,X_1,…,X_i \\), pick \\( Y \\sim q(y|X_i) \\) 2) Compute \\( r(X_i,Y) = \\min\\left(\\frac{f(Y)q(X_i|Y)}{f(X_i)q(Y|X_i)}, 1\\right) \\) 3) Pick \\( a \\sim U(0,1) \\) 4) Set The Confidence Builder: We would like to sample from and obtain a histogram of the Cauchy distribution: Note: Since \\( q \\) is symmetric, \\( q(x|y) = q(y|x) \\). Pseudocode: MC = [] X_i = initial_state for i in range(int(num_iters)): \tY = q(X_i) \tr = min(f(Y)/f(X_i), 1) \ta = np.random.uniform() \tif a < r: \t\tX_i = Y \tMC.append(X_i) Estimation: So you can clearly see that the different values of \\( b \\) lead to wildly different distributions, and therefore wildly different approximations of the end distribution. For example, if the standard deviation on our sampling distribution is very small, our prospective transitions won’t venture out very far, and hence the tails of our distribution will be woefully underrepresented. If, on the other hand, the standard deviation is too large, we will often venture our much further than we should, and the we’ll end up with something like a multi-modal distribution. Basically, the prospective transition states will often be too far out to accept, in which case the current state persists, or it will be just close enough to be accepted. What you end up with is far too many samples from the tails and not enough samples from the beafier parts of the distribution. Choosing the “best” value of \\( b \\) is more of an art than a science, really. This issue will poke its ugly head a little later. But even with that issue, we see clearly that something amazing is happening here. How is this happening? A property called detailed balance, which means, or in the continuous case: The intuition Since what we’re really after is the distribution of the samples and not the order in which they were picked, it’s reasonable to require a notion of (what I’m going to call) equal hopportunity: In other words, the probability of “hopping” from \\( x \\) to \\( y \\) should be the same as the reverse, because at the end of the day, both of these two scenarios only mean that \\( x \\) and \\( y \\) are in our sample. If you work out the math therein (as we will do below), you find that this ratio guarantees equal hopportunity. Let’s prove it! To reiterate, the claim was that this detailed balance property ( \\( f(x)P_{xy} = P_{yx}f(y) \\) ) guarantees that our likelihood function \\( f \\) is the stable distribution for our Markov Chain. Let \\( f \\) be the desired distribution (in our example, it was the Cauchy Distribution), and let \\( q(y|x) \\) be the distribution we draw from. First we’ll show that detailed balance implies \\( \\pi \\) (or \\( f \\) if the distribution is continuous) is the stable distribution! Let \\( \\pi_ip_{ij} = \\pi_jp_{ji} \\) for discrete or for continuous \\( f(i)P_{ij}=P_{ji}f(j) \\) Now we’ll show that the Markov Chain defined by the Metropolis-Hastings algorithm has the detailed balance property (and hence \\( f \\) is the stable distribution). Without any loss of generality, we’ll assume \\( f(x)q(y|x) > f(y)q(x|y) \\) (if this is not the case, the proof will follow the same with just symbolic changes). Note: Since \\( f(x)q(y|x) > f(y)q(x|y) \\), we know, \\( r(x,y) = \\frac{f(y)q(x|y)}{f(x)q(y|x)} \\), and \\( r(y,x) = 1 \\). Then, Modeling Change Point Models in Astrostatistics. This was taken from a Penn State statistics summer program website located at stat.psu.edu. A detailed walk-through of the process PSU took with this project can be found at that site. The data is a time series of light emission recorded and aggregated over periods of 10000 seconds (just under 3 hours). A plot of the time series can be seen below for your convenience. The idea (with this data) is that the emissions can be modeled by two Poisson distributions and a change point. The data will be modeled by the first Poisson until the change point, and then it switches to the second Poisson. So we start out with a multi-parameter model, and through some Bayesian inference you arrive at a pretty nasty looking posterior distribution. You’d like to sample this distribution to obtain, say, the mean \\( k \\) (change point) value along with a 95% confidence interval. Since you need that confidence interval, standard optimization methods won’t suffice: we’d need the distribution of \\( k \\)s. For this, we can use MCMC! The posterior distribution we would like to sample from is this: This distribution is pretty gnarly and since I don’t really know how to sample from this (beyond a blind search), we would like to use MCMC. Metropolis-Hastings (at least how we’ve discussed it) clearly doesn’t work here because this is multidimensional. For that we introduce another MCMC algorithm: Gibbs Sampling. Gibbs Sampling The basic pseudocode is this: X = initial_X Y = initial_Y ... Z = initial_Z MC = [(X,Y,...,Z)] for _ in range(num_iters): \t// for each variable we compute \tX ~ q_X(x|X,Y,...,Z) \t// We do this for X, Y, ..., Z in turn We can combine Gibbs with Metropolis quite easily: X = initial_X Y = initial_Y ... Z = initial_Z MC = [(X,Y,...,Z)] for _ in range(num_iters): \t// for each variable we compute \tprospect ~ q_X(x|X,Y,...,Z) \tr = min(f(prospect)/f(X), 1) \ta ~ uniform(0,1) \tif a < r: \t\tX = prospect \t// We do this for X, Y, ..., Z in turn So, basically, you update one at a time. Using this algorithm, we can analyze the data from the PSU website we mentioned above. In fact we did that very thing. Below you’ll find some graphs depicting the distribution of \\( k \\) values. Summary Random-Walk-Metropolis-Hastings Can be used to sample difficult univariate distributions relatively easily Have to tune the sampling parameter Curse of dimensionality in tuning parameters Requires \\( f \\) to be defined on all of \\( \\mathbb{R} \\) Transform as needed Gibbs Sampling Turn high dimensional sampling into iterative one-dimensional sampling Gibbs with Metropolis-Hastings Lovely Bibliography Summer School in Astrostatistics ",
"url": "/statistics/2017/04/19/MCMC/"
},
{
"title": "Decomposition of Autoregressive Models",
"excerpt": "We discuss the decomposition of AR models into components, and how eigenvalues are involved (because they always are).\n",
"content": "Autoregression Background This post will be a formal introduction into some of the theory of Autoregressive models. Specifically, we’ll tackle how to decompose an AR(p) model into a bunch of AR(1) models. We’ll discuss the interpretability of these models and howto therein. This post will develop the subject using what seems to be an atypical approach, but one that I find to be very elegant. The traditional way Let $x_t$ be an AR(p) process. So $x_t = \\sum\\limits_{i=1}^pa_ix_{t-i} + w_t$. We can express $x_t$ thusly: Where $L$ is the lag operator. We define the AR polynomial $\\Phi$ as $\\Phi(L) := \\left(1 - \\sum\\limits_{i=1}^pa_iL^i\\right)$. Then as long as we’re considering this polynomial to be over an algebraically closed field (like $\\mathbb{C}$), we can factor this polynomial. So Where $\\varphi_i$ are the roots of $\\Phi$. This polynomial will be the star of our show today. In fact, if we want to understand $\\Phi$ (the AR process we’re considering), we need only understand each $x_t = \\varphi_ix_{t-1} + w_t$ model. We’ll actually prove this later. We’ll be investigating several of its properties but first (just for fun) let’s investigate the conditions under which it’s invertible. We can reduce the problem of determining whether or not the AR polynomial is invertible to the problem of inverting each factor. After all, the polynomial is invertible if and only if each factor is. So let’s restrict our considerations to just a single factor for the time being. When will $x-\\lambda$ be invertible? Now, we recall the Taylor Series Expansion formula from our trusty calculus class. In general, we tend to use $a=0$ as a good arbitrary choice (the Maclaurin Series), resulting in: Well, if we consider the Taylor expansion of $\\frac{1}{x-\\lambda}$, we find Which is square summable if and only if $|\\lambda| > 1$. So we see that if the AR roots are all outside the unit circle, then the AR polynomial will be invertible (in such a case, the AR process is called causal). The new(?) approach Definition Let $X_t\\in \\mathbb{R}^n, A_i\\in\\mathbb{R}^{n\\times n}, B\\in\\mathbb{R}^{k_0\\times n}$, and $p\\in\\mathbb{N}_{>0}$. Also let, $W_t$ be $k_0$ dimensional white noise. Then, $X_t$ is called a VAR(p) process if, Basically a VAR(p) model is an analogue of AR(p) models where the scalars are matrices and the variable itself is a vector. The noise in such a model need not be independent per coordinate, nor restricted to only one. In fact, we can have any number of driving noise processes that get mixed together linearly into $X_t$. Hence cometh $B$ – the linear transformation from the driving noise space into the $X_t$ space. Furthermore, let’s say that we don’t directly observe $X_t$ directly, but rather we observe some $Y_t$ that’s a linear function of $X_t$ (plus some noise). Formally, we observe $Y_t$ s.t.: Where $Y_t \\in\\mathbb{R}^m, C\\in\\mathbb{R}^{m\\times n}, D\\in \\mathbb{R}^{k_1\\times m}$. And again, $U_t$ is $k_1$ dimensional white noise. Here $D$ serves the same purpose as $B$ did above, but for the observations themselves. This is called a “State Space” model. Examplificate How does this relate to our star (the AR polynomial $\\Phi$)? Let’s try to express an AR(p) process as a VAR(1) process… Let $x_t$ be an AR(p) process. That is to say, $x_t = \\sum\\limits_{i=1}^p a_ix_{t-i} + w_t$ where $w_t$ is some white noise process. Let $W_t$ be a 1 dimensional white-noise processes such that $W_t = w_t$. We define Then, we can see that the first coordinate of $X_t$ is always $x_t$. Our notation for this will be $X_t[0] = x_t$. Furthermore, let We can see that for $0 < i < p$, we have $(AX_{t-1})[i] = x_{t-i}$, and $(AX_{t-1})[0] = \\sum\\limits_{i=1}^p a_i x_{t-i} = x_t - w_t$. As it stands, $AX_{t-1}$ is almost $X_t$, just without the added noise in the first coordinate. So let’s add noise to only the first coordinate. To this end, let and Now we want $Y_t = x_t$. So we set That way, $Y_t = CX_t = x_t$ and all is right with the world. So yay! We’ve expressed this AR(p) process as a $p$ dimensional $VAR(1)$ process! So… Why? Why do? Well, one natural question to ask at this point is, what are the eigenvalues of $A$? This polynomial is the characteristic polynomial of $A$, so the roots of this polynomial are the eigenvalues of $A$. Let’s see how this relates to $\\Phi$… So the characteristic polynomial is really just $\\Phi(L)$ revisited! Furthermore, check this sweetness out… So we can see that the eigenvalues are actually intimately related with the roots (in fact they’re inverses of each other)! With that in mind, let’s see what happens when we try to diagonalize this matrix $A$. Firstly, let’s assume that $A$ matrix is diagonalizable. I.e., let where $H$ is a coordinate transformation (an invertible map) and $\\bar A$ is diagonal. Then So if we let $\\tilde X_t = HX_t$, we get So $\\tilde X_t$ is a bunch of AR(1) processes (because the matrix $\\bar A$ is diagonal), and, Which is to say that $Y_t$ is a linear combination of AR(1) processes! So we started out with a general AR(p) process, and found that expressing it as a VAR(1) model allows us to see that this is just a linear combination of AR(1) processes. I think that’s pretty cool. But let’s investigate this a bit further… What more?! As per our previous section, we have that $Y_t$, which was really just $x_t$ – our original AR(p) variable, is a linear combination of AR(1) models. More specifically, if we let $F := CH^{-1}$, then So indeed, $x_t$ is a linear combination of $p$ many AR(1) processes. In fact, since $\\tilde X_t = \\bar A \\tilde X_{t-1} + \\left(HB\\right)W_t$, we know the AR(1) models explicitly: Where $\\lambda_i$ is the $i$th eigenvalue of $A$ (hence the $(i,i)$th element of $\\bar A$). So we see that the AR(1) processes each have their respective roots from the roots of the original AR polynomial $\\Phi$. Pretty cool, right? Summary So we can express any AR(p) model as a linear combination of AR(1) processes (albeit with correlated noise terms), where each AR(1) process is determined by the roots of the AR polynomial (or the characteristic polynomial). We’ve essentially reduced the problem of studying AR(p) models to studying their eigenvalues (or AR roots). ",
"url": "/statistics/2017/10/25/Autoregression/"
},
{
"title": "Categories",
"excerpt": "Category index\n",
"content": " ",
"url": "/categories/"
},
{
"title": "Welcome to the site of a MathemaNiskin!",
"excerpt": "A boilerplate for my mind. Here is where I will blog about whatever: Math, Statistics, Economics, Computer Science, Algorithms, etc.\n",
"content": "Welcome to my home on the internet. I’ll be posting sporadically about whatever interests me. Generally that’ll be more STEM oriented, but I imagine once I get more used to this whole blogging thing, I’ll branch out. For now the posts are mainly about Math, Algorithms, Optimization, Statistics, Sampling, etc. Enjoy the site and please feel free to email me, if you come across any issues, typos, mistakes, etc. or if you just feel like talking with me. I’m always up to chattin with new people. ",
"url": "/"
},
{
"title": "Search",
"excerpt": "Search for a page or post you’re looking for\n",
"content": " ",
"url": "/search/"
}
]